THE
QUARTERLY JOURNAL OF ECONOMICS Vol. CXXV
February 2010
Issue 1
FREE DISTRIBUTION OR COST-SHARING? EVIDENCE FROM A RANDOMIZED MALARIA PREVENTION EXPERIMENT∗ JESSICA COHEN AND PASCALINE DUPAS It is often argued that cost-sharing—charging a subsidized, positive price— for a health product is necessary to avoid wasting resources on those who will not use or do not need the product. We explore this argument through a field experiment in Kenya, in which we randomized the price at which prenatal clinics could sell long-lasting antimalarial insecticide-treated bed nets (ITNs) to pregnant women. We find no evidence that cost-sharing reduces wastage on those who will not use the product: women who received free ITNs are not less likely to use them than those who paid subsidized positive prices. We also find no evidence that costsharing induces selection of women who need the net more: those who pay higher prices appear no sicker than the average prenatal client in the area in terms of measured anemia (an important indicator of malaria). Cost-sharing does, however, considerably dampen demand. We find that uptake drops by sixty percentage points when the price of ITNs increases from zero to $0.60 (i.e., from 100% to 90% subsidy), a price still $0.15 below the price at which ITNs are currently sold to pregnant women in Kenya. We combine our estimates in a cost-effectiveness analysis of the impact of ITN prices on child mortality that incorporates both private and social returns to ITN usage. Overall, our results suggest that free distribution of ITNs could save many more lives than cost-sharing programs have achieved so far, and, given the large positive externality associated with widespread usage of ITNs, would likely do so at a lesser cost per life saved. ∗ We thank Larry Katz, the editor, and four anonymous referees for comments that significantly improved the paper. We also thank David Autor, Moshe Bushinsky, Esther Duflo, William Easterly, Greg Fischer, Raymond Guiteras, Sendhil Mullainathan, Mead Over, Dani Rodrik, and numerous seminar participants for helpful comments and suggestions. We thank the Mulago Foundation for its financial support, and the donors to TAMTAM Africa for providing the free nets distributed in this study. Jessica Cohen was funded by a National Science Foundation Graduate Research Fellowship. We are very grateful to the Kenya Ministry of Health and its staff for their collaboration. We thank Eva Kaplan, Nejla Liias, and especially Katharine Conn, Carolyne Nekesa, and Moses Baraza for the smooth implementation of the project and the excellent data collection. All errors are our own.
[email protected],
[email protected]. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
1
2
QUARTERLY JOURNAL OF ECONOMICS
I. INTRODUCTION Standard public finance analysis implies that health goods generating positive externalities should be publicly funded, or even subsidized at more than 100% if the private nonmonetary costs (such as side effects) are high. Although this analysis applies to goods whose effectiveness is independent of the behavior of the recipients (e.g., vaccines, deworming pills administered to schoolchildren), it does not necessarily apply to goods that require active usage (adherence) by their owner for the public health benefits to be realized (e.g., bed nets for reduced malaria transmission, pit latrines for reduced water contamination). For such goods, charging nonzero prices (“cost-sharing”) could improve the efficacy of public subsidies by reducing wastage from giving products to those who will not use them. There are three possible effects of positive prices on the likelihood that people who acquire the product use it appropriately. First, a selection effect: charging a positive price could select out those who do not value the good and place it only in the hands of those who are likely to use it (Oster 1995; Population Services International [PSI] 2003; Ashraf, Berry, and Shapiro forthcoming). Second, a psychological effect: paying a positive price for a good could induce people to use it more if they exhibited “sunk cost” effects (Thaler 1980; Arkes and Blumer 1985). Third, higher prices may encourage usage if they are interpreted as a signal of higher quality (Bagwell and Riordan 1991; Riley 2001). Although cost-sharing may lead to higher usage intensity than free distribution, it may also reduce program coverage by dampening demand. A number of experimental and field studies indicate that there may be special psychological properties to zero financial price and that demand may drop precipitously when the price is raised slightly above zero (Ariely and Shampan’er 2007; Kremer and Miguel 2007). Beyond reducing demand, selection effects are not straightforward in the context of credit and cash constraints: if people who cannot afford to pay a positive price are more likely to be sick and need the good, then charging a positive price would screen out the neediest and could significantly reduce the health benefits of the partial subsidy. In the end, the relative benefits of various levels of subsidization of health products depend on a few key factors: (1) the elasticity of demand with respect to price, (2) the elasticity of usage with respect to price (which potentially includes selection, psychological, and signaling effects), (3) the impact of price variation on the
FREE DISTRIBUTION OR COST SHARING?
3
vulnerability (i.e., need) of the marginal consumer, and, finally, (4) the presence of nonlinearities or externalities in the health production function.1 This paper estimates the first three parameters and explores the trade-offs between free distribution and cost-sharing for a health product with a proven positive externality: insecticidetreated bed nets (ITNs). ITNs are used to prevent malaria infection and have proven highly effective in reducing maternal anemia and infant mortality, both directly for users and indirectly for nonusers with a large enough share of users in their vicinity. The manufacture of ITNs is expensive, and the question of how much to subsidize them is at the center of a very vivid debate in the international community, opposing proponents of free distribution (Sachs 2005; World Health Organization [WHO] 2007) to advocates of cost-sharing (PSI 2003; Easterly 2006). In a field experiment in Kenya, we randomized the price at which 20 prenatal clinics could sell long-lasting ITNs to pregnant women. Four clinics served as a control group and four price levels were used among the other 16 clinics, ranging from 0 (free distribution) to 40 Kenyan shillings (Ksh) ($0.60). ITNs were thus heavily subsidized, with the highest price corresponding to a 90% subsidy, comparable to the subsidies offered by the major costsharing interventions operating in the area and in many other malaria-endemic African countries. To check whether women who need the ITN most are willing to pay more for it, we measured hemoglobin levels (a measure of anemia and an important indicator of malaria in pregnancy) at the time of the prenatal visit. To estimate the impact of price variation on usage, we visited a subsample of women at home a few months later to check whether they still had the nets and whether they were using them. The relationship between prices and usage that we estimate based on follow-up home visits is the combined effect of selection and sunk cost effects.2 To isolate these separate channels, we 1. There are other potential channels from the price of a health product to its health impact. For example, the price could influence how the product is cared for (e.g., a more expensive bed net could be washed too frequently, losing the efficacy of its insecticide) or could have spillover effects to other health behaviors. We focus on the four channels described because these are most commonly cited in the debate over pricing of public health products and likely to have first-order impacts on the relationship between prices and health outcomes. 2. The correlation between prices and usage is also potentially the product of signaling effects of prices, but this is unlikely in our context. Qualitative evidence suggests that the great majority of households in Kenya know that ITNs are subsidized heavily for pregnant women and young children and that the “true” price of ITNs (i.e., the signal of their value) is in the $4–$6 range. This is likely due to the fact that retail shops sell unsubsidized ITNs at these prices.
4
QUARTERLY JOURNAL OF ECONOMICS
follow Karlan and Zinman (forthcoming) and Ashraf, Berry, and Shapiro (forthcoming) and implement a randomized two-stage pricing design. In clinics charging a positive price, a subsample of women who decided to buy the net at the posted price were surprised with a lottery for an additional discount; for the women sampled for this second-stage lottery, the actual price ranged from 0 to the posted price. Among these women, any variation in usage with the actual price paid should be the result of psychological sunk cost effects. Taken together, both stages of this experimental design enable us to estimate the relative merits of free distribution and varying degrees of cost-sharing on uptake, selection and usage intensity. We find that uptake of ITNs drops significantly at modest costsharing prices. Demand drops by 60% when the price is increased from zero to 40 Ksh ($0.60). This latter price is still 10 Ksh ($0.15) below the prevailing cost-sharing price offered to pregnant women through prenatal clinics in this region. Our estimates suggest that of 100 pregnant women receiving an ITN under full subsidy, 25 of them would purchase an ITN at the prevailing cost-sharing price. Given the very low uptake at higher prices, the sample of women for which usage could be measured is much smaller than the initial sample of women included in the experiment, limiting the precision of the estimates of the effect of price on usage. Keeping this caveat in mind, we find no evidence that usage intensity is increasing with the offer price of ITNs. Women who paid the highest price were slightly more likely (though without statistical significance) to be using the net than women who received the net for free, but at intermediate prices the opposite was true, showing no clear relationship between the price paid and probability of usage, as well as no discontinuity in usage rates between zero and positive prices. Further, when we look only at women coming for their first prenatal care visits (the relevant long-run group to consider), usage is highest among women receiving the fully subsidized net. Women who received a net free were also no more likely to have resold it than women paying higher prices. Finally, we did not observe a second-hand market develop. Among both buyers of ITNs and recipients of free ITNs, the retention rate was above 90%. The finding that there is no overall effect of ITN prices on usage suggests that potential psychological effects of prices on usage are minor in this context, unless they are counteracted by opposite selection effects, which is unlikely. The second-stage randomization enables us to formally test for the presence of sunk-cost
FREE DISTRIBUTION OR COST SHARING?
5
effects (without potentially confounding selection effects) and yields no significant effect of the actual price paid (holding the posted price constant) on usage. This result is consistent with a recent test of the sunk-cost fallacy for usage of a water purification product in Zambia (Ashraf, Berry, and Shapiro forthcoming). In order to explore whether higher prices induce selection of women who need the net more, we measured baseline hemoglobin levels (anemia rates) for women buying/receiving nets at each price. Anemia is an important indicator of malaria, reflecting repeated infection with malaria parasites, and is a common symptom of the disease in pregnant women in particular. We find that prenatal clients who pay positive prices for an ITN are no sicker, at baseline, than the clients at the control clinics. On the other hand, we find that recipients of free nets are healthier at baseline than the average prenatal population observed at control clinics. We suspect this is driven by the incentive effect the free net had on returning for follow-up prenatal care before the benefits of the previous visit (e.g., iron supplementation) had worn off. Taken together, our results suggest that cost-sharing ITN programs may have difficulty reaching a large fraction of the populations most vulnerable to malaria. Although our estimates of usage rates among buyers suffer from small-sample imprecision, effective coverage (i.e., the fraction of the population using a program net) can be precisely estimated and appears significantly (and considerably) higher under free distribution than under a 90% subsidy. In other words, we can confidently reject the possibility that the drop in demand induced by higher prices is offset by an increase in usage. Because effective coverage declines with price increases, the level of coverage under cost-sharing is likely to be too low to achieve the strong social benefits that ITNs can confer. When we combine our estimates of demand elasticity and usage elasticity in a model of cost-effectiveness that incorporates both private and social benefits of ITNs on child mortality, we find that for reasonable parameters, free distribution is at least as cost-effective as partially but still highly subsidized distribution, such as the cost-sharing program for ITNs that was under way in Kenya at the time of this study. We also find that, for the full range of parameter values, the number of child lives saved is highest when ITNs are distributed free. Our results have to be considered in their context: ITNs have been advertised heavily for the past few years in Kenya, both by the Ministry of Health and by the social-marketing
6
QUARTERLY JOURNAL OF ECONOMICS
nongovernmental organization Population Services International (PSI); pregnant women and parents of young children have been particularly targeted by the malaria prevention messages; and most people (even in rural areas) are aware that the unsubsidized price of ITNs is high, thus reducing the risk that low prices through large subsidies are taken as a signal of bad quality. Our results thus do not speak to the debate on optimal pricing for health products that are unknown to the public. But if widespread awareness about ITNs explains why price does not seem to affect usage among owners, it makes the price sensitivity we observe all the more puzzling. Although large effects of prices on uptake have been observed in other contexts, they were found for less well-known products, such as deworming medication (Kremer and Miguel 2007) and contraceptives (Harvey 1994). Given the high private returns to ITN use and the absence of a detected effect of price on usage, the price sensitivity of demand we observe suggests that pregnant women in rural Kenya are credit- or saving-constrained. The remainder of the paper proceeds as follows. Section II presents the conceptual framework. Section III provides background information on ITNs and describes the experiment and the data. Section IV describes the results on price elasticity of demand, price elasticity of usage, and selection effects on health. Section V presents a cost-effectiveness analysis, and Section VI concludes. II. A SIMPLE MODEL OF PIGOUVIAN SUBSIDIES This section develops a simple model to highlight the parameters that must be identified by the experiment to determine the optimal subsidy level. Assume that ITNs have two uses: a health use, when the net is hung, and a nonhealth use, for which the net is not hung.3 Nonhealth uses could be using the net for fishing, or simply leaving it in its bag for later use, for example, when a previous net wears out. Health use of the ITNs generates positive health externalities but nonhealth uses do not. Purchasing a net for health or nonhealth purposes costs the same to the household. The price of a net to a household is the marginal cost C minus a subsidy T. We call h the number of nets used for health purposes and n the number of nets used for nonhealth purposes. The household 3. We thank an anonymous referee for suggesting this formalization.
FREE DISTRIBUTION OR COST SHARING?
7
utility is U = u(h) + v(n) − (C − T )(h + n) + kH, where u(h) is the utility from having hanging nets, with u ≥ 0 and u ≤ 0; v(n) is the utility from nonhanging nets, with v ≥ 0 and v ≤ 0; H is the average number of nets used for health purposes per household; and the constant k represents the positive health externality.4 When choosing how many nets to invest in, the household ignores the health externality and chooses h and n such that u (h) = v (h) = C − T . Increasing the size of the subsidy T increases households’ investment in nets for health use, and thus the health externality. Because the subsidy is common for all nets, however, increasing T might also affect households’ investment in nets for nonhealth use. Call N the average number of nets used for nonhealth purposes per household. The marginal cost of increasing the health externality is T × [d(H + N)/dT], whereas the marginal benefit is only k × (dH/dT). The efficient subsidy level is the level that equates the marginal cost of increasing the externality to the marginal benefit of increasing it: T = [k × (dH/dT )]/[d(H + N)/dT ]. If N does not respond to the subsidy (dN/dT = 0), the optimal subsidy is k, the level of the externality, as in Pigou’s standard theory. But if subsidizing H distorts the amount of N consumed upward, the optimal subsidy is lower than the level of the externality. The gap between the level of the externality and the optimal subsidy level will depend on how sensitive the hanging of nets is to price, relative to total ownership of nets. In other words, what we need to learn from the experiment is the following: when we increase the price, by how much do we reduce the number of hanging nets (nets put to health use), and how does it compare to the reduction in the total number of nets acquired? This simple model could be augmented to incorporate imperfect information (for the household) on the true returns to hanging nets, especially on the relative curvature of u(.) and v(.). The lack of information could be on the effectiveness or the quality of ITNs. In this context, households could use the subsidy level as a signal of effectiveness or quality (i.e., if households interpret the size of the subsidy as the government’s willingness to pay to increase coverage and thus as a measure of the net’s likely effectiveness). 4. For simplicity we assume that the positive health externality is linear in the share of the population that is covered with a net. In reality the health externality for malaria seems to be S-shaped.
8
QUARTERLY JOURNAL OF ECONOMICS
In such a case, subsidizing H would distort the amount of N consumed downward, and the optimal subsidy would be greater than the level of the externality. Alternatively, households could lack information on the nonmonetary transaction cost of hanging the net and underestimate this cost when they invest in nets for health use. Once households realize how much effort is required to hang the net (hanging it every evening and dehanging it every morning can be cumbersome for households that sleep in their living rooms), they might decide to reallocate a net from health use to nonhealth use. Households that suffer from the sunk-cost fallacy, however, would be less likely to reallocate a net from health use to nonhealth use if they had to pay a greater price for the net. This could be formalized, for example, by adding an effort cost in the function u(.), and assuming that the disutility of the effort needed to hang the net is weighted by the relative importance of the nonmonetary cost (effort) in the total cost of the net (nonmonetary cost + monetary cost). Increasing the subsidy level (decreasing the price) would then increase the disutility of putting forth effort to hang the net and increase the likelihood that households do not use the net. This sunk cost effect would lead to an upward distortion of N, and imply a subsidy level lower than the level of the externality. For a quick preview of our findings, Figure I plots the demand curve and the “hanging curve” observed in our experiment. The slope of the top curve is an estimate of −d(H + N)/dT and the slope of the bottom curve estimates −dH/dT. We find no systematic effect of the price on the ratio of these two slopes. When the price decreases from 10 Ksh to 0, the ratio of hanging nets to acquired nets actually increases, suggesting that the full subsidy (a price of zero) distorts the demand for nonhanging nets downward. However, at higher price levels, the effect of changing the subsidy is different. The ratio increases when the price decreases from 40 to 20 Ksh and from 20 to 10 Ksh. Overall, however, the ratio remains quite close to 1 over the price range we study.
III. BACKGROUND ON ITNS AND EXPERIMENTAL SETUP III.A. Background on Insecticide-Treated Nets ITNs have been shown to reduce overall child mortality by at least 20% in regions of Africa where malaria is the leading cause of death among children under five (Lengeler 2004). ITN
9
0
0.2
0.4
0.6
0.8
1
FREE DISTRIBUTION OR COST SHARING?
Free
10Ksh
20Ksh
40Ksh
Price of ITN Acquired ITN Acquired ITN and using it
95% CI 95% CI
FIGURE I Ownership vs. Effective Coverage Sample includes women sampled for baseline survey during clinic visit, and who either did not acquire an ITN or acquired one and were later randomly sampled for the home follow-up. Usage of program ITN is zero for those who did not acquire a program ITN. Error bars represent ±2.14 standard errors (5% confidence interval with fourteen degrees of freedom). At the time this study was conducted, ITNs in Kenya were social-marketed through prenatal clinics at a price of 50 Ksh.
coverage protects pregnant women and their children from the serious detrimental effects of maternal malaria. In addition, ITN use can help avert some of the substantial direct costs of treatment and the indirect costs of malaria infection on impaired learning and lost income. Lucas (forthcoming) estimates that the gains to education from a malaria-free environment alone more than compensate for the cost of an ITN. Despite the proven efficacy and increasing availability of ITNs on the retail market, the majority of children and pregnant women in sub-Saharan Africa do not use ITNs.5 At $5–$7 a net (US$ in PPP), they are unaffordable to most families, and so governments and NGOs distribute ITNs at heavily subsidized prices. However, the price that is charged for the net 5. According to the World Malaria Report (2008), which compiled results from surveys in 18 African countries, 23% of children and 27% of pregnant women sleep under ITNs.
10
QUARTERLY JOURNAL OF ECONOMICS
varies greatly by the distributing organization, country, and consumer. The failure to achieve higher ITN coverage rates despite repeated pledges by governments and the international community (such as the Abuja Declaration of 2000) has put ITNs at the center of a lively debate over how to price vital public health products in developing countries (Lengeler et al. 2007). Proponents of cost-sharing ITN distribution programs argue that a positive price is needed to screen out people who will not use the net, and thus avoid wasting the subsidy on nonusers. Cost-sharing programs often have a “social marketing” component, which uses mass media communication strategies and branding to increase the consumer’s willingness to pay (Schellenberg et al. 2001; PSI 2003). The goal is to shore up demand and usage by making the value of ITN use salient to consumers. Proponents of cost-sharing programs also point out that positive prices are necessary to ensure the development of a commercial market, considered key to ensuring a sustainable supply of ITNs. Proponents of full subsidization argue that, although the private benefits of ITN use can be substantial, ITNs also have important positive health externalities deriving from reduced disease transmission.6,7 In a randomized trial of an ITN distribution program at the village level in western Kenya, the positive impacts of ITN distribution on child mortality, anemia, and malaria infection were as strong among nonbeneficiary households within 300 meters of beneficiary villages as they were among households in the beneficiary villages themselves (Gimnig et al. 2003).8 Although ITNs may have positive externalities at low levels of coverage (e.g., for unprotected children in the same household), it is estimated that at least 50% coverage is required to achieve strong community effects on mortality and morbidity (Hawley et al. 2003). To date, no cost-sharing distribution program is known to have reached this threshold (WHO 2007). 6. The external effects of ITN use derive from three sources: (1) fewer mosquitoes due to contact with insecticide, (2) reduction in the infective mosquito population due to the decline in the available blood supply, and (3) fewer malaria parasites to be passed on to others. 7. The case for fully subsidizing ITNs has also been made on the basis of the substantial costs to the government of hospital admissions and outpatient consultations due to malaria (Evans et al. 1997). 8. In a similar study in Ghana, Binka, Indome, and Smith (1998) find that child mortality increases by 6.7% with each 100-meter shift away from the nearest household with an ITN.
FREE DISTRIBUTION OR COST SHARING?
11
III.B. Experimental Setup The experiment was conducted in twenty communities in western Kenya, spread across four districts: Busia, Bungoma, Butere, and Mumias. Malaria is endemic in this region of Kenya: transmission occurs throughout the year with two peaks corresponding to periods of heavy rain, in May/June/July and October/November. In two nearby districts, a study by the CDC and the Kenyan Medical Research Institute found that pregnant women may receive as many as 230 infective bites during their forty weeks of gestation, and as a consequence of the high resulting levels of maternal anemia, up to a third of all infants are born either premature, small for gestational age, or with low birth weight (Ter Kuile et al. 2003). The latest published data on net ownership and usage available for the region come from the Kenya Demographic and Health Survey of 2003. It estimated that 19.8% of households in Western Kenya had at least one net and 6.7% had a treated net (an ITN); 12.4% of children under five slept under a net and 4.8% under an ITN; 6% of pregnant women slept under a net the night before and 3% under an ITN. Net ownership is very likely to have gone up since, however. In July 2006, the Measles Initiative ran a oneweek campaign throughout western Kenya to vaccinate children between nine months and five years of age and distributed a free long-lasting ITN to each mother who brought her children to be vaccinated. The 2008 World Malaria Report uses ITN distribution figures to estimate that 65% of Kenyan households now own an ITN. A 2007 survey conducted (for a separate project) in the area of study among households with school-age children found a rate of long-lasting ITN ownership around 30% (Dupas 2009b). Our experiment targeted ITN distribution to pregnant women visiting health clinics for prenatal care.9 We worked with 20 rural public health centers chosen from a total of 70 health centers in the region, 17 of which were private and 53 were public. The 20 health centers we sampled were chosen based on their public status, their size, services offered, and distance from each other. We then randomly assigned them to one of five groups: four clinics formed the control group; five clinics were provided with ITNs 9. The ITNs distributed in our experiment were PermaNets, sold by Vestergaard Frandsen. They are circular polyester bed nets treated with the insecticide Deltamethrin and maintain efficacy without retreatment for about three to five years (or about twenty washes).
12
QUARTERLY JOURNAL OF ECONOMICS
and instructed to give them free of charge to all expectant mothers coming for prenatal care; five clinics were provided with ITNs to be sold at 10 Ksh (corresponding to a 97.5% subsidy); three clinics were provided with ITNs to be sold at 20 Ksh (95.0% subsidy); and the last three clinics were provided with ITNs to be sold at 40 Ksh (90% subsidy). The highest price is 10 Ksh below the prevailing subsidized price of ITNs in this region, offered through PSI to pregnant women at prenatal clinics.10 Table I presents summary statistics on the main characteristics of health centers in each group. Although the relatively small number of clinics leads to imperfect balancing of characteristics, the clinics appear reasonably similar across ITN price assignment and we show below that controlling for clinic characteristics does not change our estimates except to add precision. Clinics were provided with financial incentives to carry out the program as designed. For each month of implementation, clinics received a cash bonus (or a piece of equipment of their choice) worth 5,000 Ksh (approximately $75) if no evidence of “leakage” or mismanagement of the ITNs or funds was observed. Clinics were informed that random spot checks of their record books would be conducted, as well as visits to a random subsample of beneficiaries to confirm the price at which the ITNs had been sold and to confirm that they had indeed purchased ITNs (if the clinic’s records indicated so). Despite this, we observed leakages and mismanagement of the ITNs in four of the eleven clinics that were asked to sell ITNs for a positive price. We did not observe any evidence of mismanagement in the five clinics instructed to give out the ITNs for free. Of the four clinics that mismanaged the ITNs, none of them altered the price at which ITNs were made available to prenatal clients, but they sold some of the program ITNs to ineligible recipients (i.e., nonprenatal clients). The ITN distribution program was phased into program clinics between March and May 2007 and was kept in place for at least three months in each clinic, throughout the peak “long rains” malaria season and subsequent months. Posters were put up in clinics to inform prenatal clients of the price at which the ITNs were sold. Other than offering a free hemoglobin test to each woman on survey days, we did not interfere with the normal 10. Results from a preprogram clinic survey suggest that it is perhaps not appropriate to interpret our results in the context of widely available ITNs to pregnant women at 50 Ksh, as many of the clinics reported the supply of PSI nets to be erratic and frequently out of stock.
4
67 [46.3] 114 [69.4] 10.0 [8.2] 0.50 [0.58] 3.8 [2.9] 11.3 [2.6] 5
13.3
3.4
0.40
12.0
117
63
(2)
(1)
5
13.0
3.6
0.80
4.0
164
75
(3)
10 Ksh ($0.15)
3
12.1
4.3
0.67
13.3
106
54
(4)
20 Ksh ($0.30)
3
11.4
5.0
0.33
10.0
122
62
(5)
40 Ksh ($0.60)
.743
.769
.507
.292
.565
.769
(6)
p-value, joint test 1
.593
.758
.713
.619
.847
.965
(7)
p-value, joint test 2
Notes: Standard deviations presented in brackets. At the time of the program, $US 1 was equivalent to around 67 Kenyan shillings (Ksh). Prenatal clinics were sampled from a pool of seventy prenatal clinics over four districts in Kenya’s Western Province: Busia, Bungoma, Butere, and Mumias. Joint test 1: Test of equality of means across four treatment groups. Joint test 2: Joint test that means in treatment groups are equal to mean in control group.
Number of clinics
Distance (in km) to closest prenatal clinic in the sample
Total other prenatal clinics within 10 kilometers (km)
Fraction of clinics with HIV testing services
Prenatal enrollment fee (in Ksh)
Average monthly attendance in 2006 (first + subsequent visits)
Average monthly attendance in 2006 (first visits ONLY)
0 Ksh (free)
Control group
Treatment groups ITN price:
TABLE I CHARACTERISTICS OF PRENATAL CLINICS IN THE SAMPLE, BY TREATMENT GROUP
FREE DISTRIBUTION OR COST SHARING?
13
14
QUARTERLY JOURNAL OF ECONOMICS
procedures these clinics used at prenatal care visits, which in principle included a discussion of the importance of bed net usage. Within clinics where the posted price was positive, a second stage randomization was conducted on unannounced, random days. On those days, women who had expressed their willingness and showed their ability to purchase an ITN at the posted price (by putting the required amount of money on the counter) were surprised by the opportunity to participate in a lottery for an additional promotion by picking an envelope from a basket. All women given the opportunity to participate in the lottery agreed to pick an envelope. The final price paid by these women was the initial offer price if they picked an empty envelope; zero if they picked a “free net” envelope; or a positive price below the initial offer price if the initial price was 40 Ksh. This second-stage randomization started at least five weeks after the program had started in a given clinic, and took place no more than once a week, on varying week days, to avoid biasing the women’s decisions to purchase the ITN based on the expectation of a second-stage discount.11 III.C. Data Three types of data were collected. First, administrative records kept by the clinic on ITN sales were collected. Second, each clinic was visited three or four times on random days, and on those days enumerators surveyed all pregnant women who came for a prenatal visit. Women were asked basic background questions and whether they purchased a net, and their hemoglobin levels were recorded. In total, these measures were collected from 545 pregnant women. Third, a random sample of 246 prenatal clients who had purchased/received a net through the program were selected to be visited at their homes three to ten weeks after their net purchases. All home visits were conducted within three weeks in July 2007 to ensure that all respondents faced the same environment (especially in terms of malaria seasonality) at the time of the follow-up. Of this subsample, 92% (226 women) were found and consented to be interviewed. During the home visits, respondents were asked to show the net, whether they had started using it, and who was sleeping under it. Surveyors 11. By comparing days with and those without the lottery, we can test whether women heard about the lottery on days we did the lottery. We do not find evidence that uptake was higher on the days we performed the lottery; we also do not observe a significant increase in the uptake of nets after the first lottery day (data not shown).
FREE DISTRIBUTION OR COST SHARING?
15
checked to see whether the net had been taken out of the packaging, whether it was hanging, and the condition of the net.12 Note that, at the time of the baseline survey and ITN purchase, women were not told that follow-up visits could be made at their homes. What’s more, neither the clinic staff nor the enumerators conducting the baseline surveys knew that usage would be checked. This limits the risk that usage behavior might be abnormally high during the study period. Also note that we do not observe an increase in reported or observed usage over the three weeks during which the home surveys were conducted. This suggests that the spread of information about the usage checks was limited and unlikely to have altered usage behavior. III.D. Clinic-Level Randomization The price at which ITNs were sold was randomized at the clinic level, but our outcomes of interest are at the individual level: uptake, usage rates, and health. When regressing individual-level dependent variables on clinic-level characteristics, we are likely to overstate the precision of our estimators if we ignore the fact that observations within the same clinic (cluster) are not independent (Moulton 1990; Donald and Lang 2007). We compute clusterrobust standard errors using the cluster-correlated Huber–White covariance matrix method. In addition, because the number of clusters is small (sixteen treatment clinics), the critical values for the tests of significance are drawn from a t-distribution with fourteen (= 16 − 2) degrees of freedom (Cameron, Miller, and Gelbach 2007). The critical values for the 1%, 5%, and 10% significance levels are thus 2.98, 2.14, and 1.76, respectively. Another approach to credibly assessing causal effects with a limited number of randomization units is to use (nonparametric) randomization inference, first proposed by Fisher (1935), later developed by Rosenbaum (2002), and recently used by Bloom et al. (2006). Hypothesis testing under this method is done as follows. For each clinic, we observe the share of prenatal clients who purchased a net (or were using a net). Let yi denote the observed purchase rate for clinic i. For each clinic i = 1, 2, . . . ,16, Yi (Pi ) represents the purchase rate at clinic i when the ITN price at clinic i is Pi , Pi ∈ [0, 10, 20, 40]. The outcome variable is a function of 12. The nets that were distributed through the program were easily recognizable through their tags. Enumerators were instructed to check the tags to confirm the origin of the nets.
16
QUARTERLY JOURNAL OF ECONOMICS
the treatment variable and potential outcomes: (1|Pi = k)Yi (k). yi = k=0,10,20,40
The effect of charging price k in clinic i (relative to free distribution) is Eki = Yi (k) − Yi (0). To make causal inferences for a price level k via Fisher’s exact test, we use the null hypothesis that the effect of charging k is zero for all clinics: H0 : Eki = 0 for all i = 1, . . . , 16. Under this null hypothesis, all potential outcomes are known exactly. For example, although we do not observe the outcome under price 0 for clinic i subject to price k > 0, the null hypothesis implies that the unobserved outcome is equal to the observed outcome, Yi (0) = yi . For a given price level k, we can test the null hypothesis against the alternative hypothesis that Eki = 0 for some clinics by using the difference in average outcomes by treatment status as a test statistic: (1|Pi = 0)yi (1|Pi = k)yi − . Tk = (1|Pi = k) (1|Pi = 0) Under the null hypothesis, only the price variable P is random, and thus the distribution of the test statistic (generated by taking all possible treatment assignments of clinics to prices) is completely determined by that of P. By checking whether Tkobs , the statistic for the “true” assignment of prices (the actual assignment in our experiment), falls in the tails of the distribution, we can test the null hypothesis. We can reject the null hypothesis with a confidence level of 1 − α if the test statistic for the true assignment is in the (α/2)% tails of the distribution. This test is nonparametric because it does not make distributional assumptions. We call the p-values computed this way “randomization inference p-values.” IV. RESULTS IV.A. Clinic-Level Analysis: Randomization Inference Results Table II presents the results of randomization inference tests of the hypotheses that the three positive prices in our experiment
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
Share of prenatal clients who acquired program ITN (12)
Panel A: Takeup Mean in free group 41.03 0.99 Difference with free group: ITN price = 10 Ksh −3.43 −13.77 −0.07 −0.07 S.E. (18.60) (13.98) (0.03)∗∗ (0.03)∗ Randomization inference p-value .824 .460 .125 .091 ITN price = 20 Ksh −13.87 −10.75 −0.17 −0.18 S.E. (20.36) (18.16) (0.02)∗∗∗ (0.02)∗∗∗ Randomization inference p-value .64 .61 .000 .036 ITN price = 40 Ksh −32.12 −34.03 −0.58 −0.58 S.E. (25.05) (22.00) (0.06)∗∗∗ (0.05)∗∗∗ Randomization inference p-value .23 .19 .000 .018 Clinic-level controls X X X X X X Number of clinics 10 10 8 8 7 7 10 10 8 8 8 8 R2 .00 .54 .07 .39 .25 .54 .45 .45 .93 .96 .95 .97 # of possible assignments 252 252 56 56 21 21 252 252 56 56 56 56 for random inference
(1)
Average weekly ITN sales over first 6 weeks
TABLE II CLINIC-LEVEL ANALYSIS: FISHERIAN PERMUTATION TESTS FREE DISTRIBUTION OR COST SHARING?
17
Panel B: Effective coverage
(14)
X 10 .22 252 2.429 1.418
10 .2 252 2.571 2.107
−0.18 −0.17 (0.13) (0.14) .173 .206
0.70
(13)
(16)
0.598
1.588
8 .42 56
0.822
1.500
X 8 .42 56
−0.27 −0.27 (0.13)∗ (0.14) .071 .143
(15)
(18)
0.185
0.948
0.153
0.931
−0.55 −0.54 (0.14)∗∗∗ (0.15)∗∗ .018 .054 X 8 8 .71 .73 56 56
(17)
Share using program ITN at follow-up (unconditional on takeup)
Notes: Panel A, columns (1)–(6): Sales data from clinics’ records. Data missing for one clinic due to misreporting of sales. Panel A, columns (7)–(12), and Panel B: Individual data collected by research team, averaged at the clinic level (the level of randomization). “Using program ITN” is equal to 1 only for those who (1) acquired the ITN and (2) had the ITN hanging in the home during an unannounced visit. Standard errors in parentheses, estimated through linear regressions. P-values for treatment effects computed by randomization inference. ∗∗∗ , ∗∗ , ∗ Significance at 1%, 5%, and 10% levels, respectively.
Mean in free group Difference with free group: ITN price = 10 Ksh S.E. Randomization inference p-value ITN price = 20 Ksh S.E. Randomization inference p-value ITN price = 40 Ksh S.E. Randomization inference p-value Clinic-level controls Number of clinics R2 # of possible assignments for random inference Ratio [(H/T ) from Panel B / ((H + N)/T ) from Panel A] Standard error of ratio (H/T )/(H + N/T )
TABLE II (CONTINUED)
18 QUARTERLY JOURNAL OF ECONOMICS
FREE DISTRIBUTION OR COST SHARING?
19
had no effect on demand and coverage. The data used in Table II were collapsed at the clinic level. (The raw data on clinic level outcomes are provided in the Online Appendix). We have two indicators of demand presented in Panel A: average weekly sales of ITNs (recorded by the clinics) in columns (1)–(6) and the share of surveyed pregnant women who acquired an ITN in columns (7)–(12). Panel B shows the rate of effective coverage: the share of surveyed pregnant women in the clinic who not only acquired the ITN but also reported using it at follow-up. For each outcome (sales, uptake, effective coverage), we present the estimated effect of prices both without and with clinic-level controls. We present the standard errors estimated through parametric linear regressions, as well as the randomization inference p-values. Results in columns (1)–(6) suggest that, although the ITN sales were lower on average in clinics charging a higher price for the ITN, none of the differences between clinics can be attributed to the price. Even the 32/41 = 78% lower sales in the clinics charging 40 Ksh are not significant. Note, however, that the sales data are missing for one of the three 40 Ksh clinics, and as a consequence the power of the randomization inference test in columns (5) and (6) is extremely low: there are only 21 possible assignments of seven clinics to two groups of sizes five and two, and each of them has a 1/21 = 4.76% chance of being selected. This means that even the largest effect cannot fall within the 2.5% tails of the distribution, and randomization inference would thus fail to reject the null hypothesis of no price effect with 95% confidence no matter how large the difference in uptake between 0 Ksh and 40 Ksh clinics is (Bloom et al. 2006). The power is higher for the tests performed on the survey data (columns (7)–(12) of Panel A, and Panel B), but still lower relative to tests that impose some structure on the error term. Nevertheless, the p-values in columns (9)–(12) suggest that we can reject the hypothesis that charging either 20 or 40 Ksh for nets has no effect on uptake with 95% confidence. In particular, uptake of the net is 58 percentage points lower in the 40 Ksh group than in the free distribution group, and the confidence level for this effect is 98%. The results on effective coverage (usage of the net unconditional on uptake) are weaker for the 20 Ksh treatment but still significant for the 40 Ksh treatment: effective coverage is 54 percentage points lower in the 40 Ksh group than in the free distribution group, and the confidence level for this effect is 94%.
20
QUARTERLY JOURNAL OF ECONOMICS
As shown in Section II, the key parameter of interest in determining the optimal subsidy level is the ratio (H/T)/((H + N)/T). We compute this ratio for T = 10 Ksh, T = 20 Ksh, and T = 40 Ksh at the bottom of Panel B in Table II. The ratio is greater than 1 for price changes from 0 to 10 Ksh or 0 to 20 Ksh, but the standard errors are massive and there is little informational content in those numbers. For T = 40 Ksh, the ratio is more precisely estimated, at 0.95, still quite close to 1. The standard error of this ratio is 0.18 in the absence of covariates, and implies a 95% confidence interval of [0.58–1.31]. When we control for clinic-level covariates in the estimations of the two effects, the confidence interval on the ratio is somewhat reduced to [0.63–1.23]. The finding in Table II that effective coverage is statistically significantly lower by 54 percentage points in the 40 Ksh group (the group that proxies the cost-sharing program in place in Kenya at the time of the study) compared to the free distribution group is the main result of the paper. In the remainder of the analysis, we investigate the effects in more detail by conducting parametric analysis on the disaggregated data with cluster standard errors adjusted for the small number of clusters. IV.B. Micro-Level Analysis: Price Elasticity of Demand for ITNs Table III presents coefficient estimates from OLS regressions of weekly ITN sales on price. The coefficient estimate on ITN price from the most basic specification in column (1) is −0.797. This estimate implies that weekly ITN sales drop by about eight nets for each 10 Ksh increase in price. Because clinics distributing ITNs for free to their clients distribute an average of 41 ITNs per week, these estimates imply that a 10 Ksh increase in ITN price leads to a 20% decline in weekly ITN sales. The specification in column (4) regresses weekly ITN sales on indicator variables for each ITN price (0 Ksh is excluded). Raising the price from 0 to 40 Ksh reduces demand by 80% (from 41 ITNs per week to 9)— a substantial decline in demand, a bit smaller than the decline implied by the linear estimate in column (1). These results are not sensitive to adding controls for time effects (columns (2) and (5)). Columns (3) and (6) present results of robustness checks conducted by including various characteristics of the clinics as controls. Because net sales are conditional on enrollment at prenatal clinics, one concern is that our demand estimates are confounded
ANC clinic offers HIV testing services
Prenatal enrollment fee (in Ksh)
Average attendance in 2006 (total)
Average attendance in 2006 (first visits)
Number of weeks since program started
ITN price = 40 Ksh ($0.60)
ITN price = 20 Ksh ($0.30)
ITN price = 10 Ksh ($0.15)
ITN price in Kenyan shillings (Ksh)
−0.797 (0.403)∗
−0.797 (0.401)∗
−5.08 (1.41)∗∗∗
(2)
(1)
−5.08 (1.46)∗∗∗ 1.48 (0.21)∗∗∗ −0.46 (0.15)∗∗∗ −0.77 (0.27)∗∗ 14.08 (7.44)∗
−0.803 (0.107)∗∗∗
(3)
−0.33 (16.81) −9.50 (16.04) −32.42 (15.38)∗
(4)
Weekly ITN sales
TABLE III WEEKLY ITN SALES ACROSS PRICES: CLINIC-LEVEL DATA
−0.33 (16.92) −9.50 (16.14) −32.42 (15.47)∗ −5.08 (1.42)∗∗∗
(5)
1.52 (4.37) −14.08 (5.00)∗∗ −33.71 (2.88)∗∗∗ −5.08 (1.48)∗∗∗ 1.56 (0.22)∗∗∗ −0.50 (0.15)∗∗∗ −0.54 (0.32) 7.07 (7.65)
(6)
FREE DISTRIBUTION OR COST SHARING?
21
90 .13 41.03
90 .21
(2)
(4)
90 .14
(3) −1.08 (0.77) −8.85 (2.89)∗∗∗ 90 .64
Weekly ITN sales
90 .23
(5)
−1.84 (0.68)∗∗ −9.63 (2.70)∗∗∗ 90 .65
(6)
Notes: Each column is an OLS regression of weekly ITN sales on ITN price or on a set of indicator variables for each price (0 Ksh is excluded). All regressions include district fixed effects. The sample includes fifteen clinics in three districts over six weeks after program introduction. (One 40 Ksh clinic is not included because of problems with net sales reporting.) Standard errors in parentheses are clustered at the clinic level. Given the small number of clusters (fifteen), the critical values for T -tests were drawn from a t-distribution with 13 (15 − 2) degrees of freedom. ∗ ∗ ∗ , ∗∗ , ∗ Significance at 1%, 5%, and 10% levels, respectively.
Observations (clinic-weeks) R2 Mean of dep. var. in clinics with free ITNs
Distance to the closest ANC clinic in the sample
Distance to the closest ANC clinic
(1)
TABLE III (CONTINUED)
22 QUARTERLY JOURNAL OF ECONOMICS
FREE DISTRIBUTION OR COST SHARING?
23
by variation in the level of prenatal attendance across clinics. Subsidized ITNs may provide an incentive to receive prenatal care, and therefore the level of prenatal enrollment after the introduction of the program is an endogenous variable of interest (Dupas 2005). Any impact of ITN price on total enrollment should be captured by total ITN sales (which reflect the change in the number of patients and in the fraction of patients willing to buy ITNs at each price). However, our demand estimates could be biased if total attendance prior to program introduction is correlated with the assigned ITN price. To check whether this is the case, the specification in columns (3) and (6) control for monthly prenatal attendance at each clinic in 2006, as well as additional clinic characteristics that could potentially influence attendance such as any fee for prenatal care, whether the clinic offers counseling and/or testing for HIV, the distance to the closest other clinic/hospital in our sample, and the distance to the closest other clinic/hospital in the area. The coefficient estimates on ITN price are basically unchanged when clinic controls are included, but their precision is improved. One might be concerned that our net sales data are biased due to (a moderate amount of) mismanagement, theft, and misreporting by clinics. Further, because the number of observations in Table III is small, demand estimates are not precisely estimated. For these reasons, it is important to check that the demand estimates based on net sales are consistent with those based on our survey data. Table IV presents additional estimates of demand based on individual-level data from surveys conducted among all prenatal clients who visited the clinics on the randomly chosen days when baseline surveys were conducted. These specifications correspond to linear probability models where the dependent variable is a dummy equal to one if the prenatal client bought or received an ITN; the independent variables are the price at which ITNs were sold, or dummies for each price. The coefficient estimate of −0.015 on ITN price in column (1) implies that a 10 Ksh ($0.15) increase in the price of ITNs reduces demand by fifteen percentage points (or roughly 20% at the mean purchase probability of .81). This is very consistent with the results based on net sales and corresponds to a price elasticity (at the mean price and purchase probability) of −.37. These results imply that demand for ITNs is 75% lower at the cost-sharing price prevailing in Kenya at the time of the study (50 Ksh or $0.75) than it is under a free distribution scheme.
424 .26 0.81 .23
424 .28 0.81
X X
−0.015 −0.017 (0.002)∗∗∗ (0.001)∗∗∗
(2)
(4)
424 .32 0.81
424 .32 0.81
−0.073 −0.058 (0.018)∗∗∗ (0.037) −0.172 −0.331 (0.035)∗∗∗ (0.102)∗∗∗ −0.605 −0.656 (0.058)∗∗∗ (0.037)∗∗∗ X X
(3)
(6)
(7)
201 .42 0.77
X X X
134 .24 0.84
X
X X
266 .32 0.84
X
X X
−0.018 −0.012 −0.016 (0.001)∗∗∗ (0.002)∗∗∗ (0.002)∗∗∗
(5)
266 .33 0.84
X
0.046 (0.034) −0.350 (0.142)∗∗ −0.635 (0.061)∗∗∗ X X
(8)
Notes: Data are from clinic-based surveys conducted in April–June 2007, throughout the first six weeks of the program. All regressions include district fixed effects. Standard errors in parentheses are clustered at the clinic level. Given the small number of clusters (sixteen), the critical values for T -tests were drawn from a t-distribution with 14 (16 − 2) degrees of freedom. All specifications are OLS regressions of an indicator variable equal to one if the respondent bought or received an ITN for free on the price of the ITN, except columns (4) and (8), in which regressors are indicator variables for each price (price = 0 is excluded). Time controls include fixed effects for the day of the week the survey was administered and a variable indicating how much time had elapsed between the day the survey was administered and the program introduction. Clinic controls include total monthly first prenatal care visits between April and June of 2006, the fee charged for a prenatal care visit, whether or not the clinic offers voluntary counseling and testing for HIV or prevention-of-mother-to-child-transmission of HIV services, the distance between the clinic and the closest other clinic or hospital and the distance between the clinic and the closest other clinic or hospital in the program. ∗ ∗ ∗ , ∗∗ , ∗ Significance at 1%, 5%, and 10% levels, respectively.
Time controls Clinic controls Restricted sample: first prenatal visit Restricted sample: first pregnancy Restricted sample: did not receive free ITN previous year Observations R2 Mean of dep. var. Intracluster correlation
ITN price = 40 Ksh ($0.60)
ITN price = 20 Ksh ($0.30)
ITN price = 10 Ksh ($0.15)
ITN price in Kenyan shillings (Ksh)
(1)
Bought/received an ITN during prenatal visit
TABLE IV DEMAND FOR ITNS ACROSS PRICES: INDIVIDUAL-LEVEL DATA
24 QUARTERLY JOURNAL OF ECONOMICS
FREE DISTRIBUTION OR COST SHARING?
25
In column (2) of Table IV, we add controls for when the survey was administered, including day-of-the-week fixed effects and the time elapsed since program introduction, as well as controls for the clinic characteristics used in Table III, column (3). The coefficient estimate for price remains very close to that obtained in the basic specification. Columns (3) and (4) present estimates of demand at each price point. In the absence of clinic or time controls, the decrease in demand for an increase in price from 0 to 10 Ksh is estimated at seven percentage points (larger than suggested by the clinic-level ITN sales in Table III). An increase in price from 20 to 40 Ksh leads to a 43–percentage point drop in demand. Column (5) presents demand estimates for the restricted sample of women who are making their first prenatal care visits for their current pregnancies. It is important to separate first visits from revisits because the latter may be returning because they are sick. Alternatively, women who are coming for a second or third visit may be healthier, because they have already received the benefits of the earlier visit(s), some of which can directly affect their immediate need for an ITN (such as malaria prophylaxis and iron supplementation). The coefficient estimate in column (5) is larger than that for the entire sample, implying that women coming for the first time are more sensitive to price than women coming for a revisit. This could be because women learn about the subsidized ITN program at their first visit and bring the cash to purchase the net at their second visit. Access to free ITNs from other sources could have dampened demand for ITNs distributed through the program. This is a real concern, because the Measles Initiative ran a campaign in July 2006 (nine months before the start of our experiment) throughout Kenya to vaccinate children between nine months and five years of age, distributing free ITNs to mothers of these children in western Kenya. To examine the demand response among women who are less likely to have had access to free ITNs in the past, column (6) estimates the impact of ITN price on demand for women in their first pregnancies only. When we restrict the sample in this way, the coefficient on ITN price drops to −0.012. This implies that women in their first pregnancies are indeed less sensitive to ITN price differences, but their demand still drops by 55 percentage points when the ITN price is raised from 0 to 50 Ksh. Our baseline survey asked respondents if they had received a free ITN in the previous year, and 37.3% said they did. In columns
26
QUARTERLY JOURNAL OF ECONOMICS
(7) and (8), we focus on the 63% who reported not having received a free ITN and estimate how their demand for an ITN in our program was affected by price. We find a coefficient on price very similar to that obtained with the full sample (−0.016), and the specifications with dummies for each price group generate estimates that are also indistinguishable from those obtained with the full sample. Nearly three-quarters of prenatal clients walked to the clinics for prenatal care. Because clinics included in our sample were at least 13 kilometers from one another, it is unlikely that prenatal clients would switch from one of our program clinics to another. However, it is likely that our program generated some crowd-out of prenatal clients at nonprogram clinics in the vicinity, particularly in the case of free nets. Because these “switchers” are driven by price differences in ITNs that would not exist in a nationwide distribution program, we should look at the demand response of those prenatal clients who, at the time of the interview, were attending the same clinic that they had in the past. In Online Appendix Table A.1, we replicate Table IV for this subsample of prenatal clients who did not switch clinics. The results are nearly unchanged, suggesting that the same degree of price sensitivity would prevail in a program with a uniform price across all clinics. In sum, our findings suggest that demand for ITNs is not sensitive to small increases in price from zero, but that even a moderate degree of cost-sharing leads to large decreases in demand. At the mean, a 10 Ksh ($0.15) increase in ITN price decreases demand by 20%. These estimates suggest that the majority of pregnant women are either unable or unwilling to pay the prevailing cost-sharing price, which is itself still far below the manufacturing cost of ITNs. IV.C. Price-Elasticity of the Usage of ITNs Usage Conditional on Ownership. Let us start this section with an important caveat: Our sample size to study usage conditional on uptake is considerably hampered by the fact that uptake was low in the higher-priced groups: only a small fraction of the respondents interviewed at baseline in the 40 Ksh group purchased an ITN and could be followed up at home for a usage check. Keeping this caveat in mind, Figure II shows the average usage rate of program-issued ITNs across price groups. The top panel shows self-reported usage rates, and the bottom panel shows the likelihood that the ITN was found hanging, both measured during
27
FREE DISTRIBUTION OR COST SHARING?
0
0.2
0.4
0.6
0.8
1
Declare using ITN
Free
10Ksh
20Ksh
40Ksh
ITN Price Average
95% CI
0
0.2
0.4
0.6
0.8
1
ITN seen visibly hanging
Free
10Ksh
20Ksh
40Ksh
ITN Price Average
95% CI
FIGURE II Program ITN Usage Rates (Conditional on Uptake) by ITN Price Error bars represent ±2.14 standard errors (95% confidence interval with fourteen degrees of freedom). Number of observations: 226.
28
QUARTERLY JOURNAL OF ECONOMICS
an unannounced home visit by an enumerator. On average, 62% of women visited at home claimed to be using the ITN they acquired through the program, a short-term usage rate that is very consistent with previous usage studies (D’Alessandro 1994; Alaii et al. 2003). The observed hanging rate was only slightly lower, at 57%. However, we find little variation in usage across price groups, and no systematic pattern. This is confirmed by the regression estimates of selection effects on usage, presented in Table V. Our coefficient estimate on ITN price in column (1) is positive, but insignificant, suggesting that a price increase of 10 Ksh increases usage by four percentage points, representing an increase of 6% at the mean. The confidence interval is large, however, and the true coefficient could be on either side of zero (the 95% confidence interval is −0.004; 0.012). These estimates correspond to a price elasticity of usage (at the mean price and usage rate) of 0.097. Adding controls in column (2) does not improve precision but reduces the size of the estimated effect. The results also hold when the sample is restricted to the subsample of women coming for their first prenatal visit, women in their first pregnancy, or to those who reported not having received a free ITN the previous year (data not shown). Estimates using indicators for each price in column (3) are also very imprecise, but show no pattern of increasing use with price. Women who pay 10 or 20 Ksh are less likely to be using their ITNs than women receiving them for free, but women who pay 40 Ksh appear close to 10% more likely to be using their ITNs. In none of the cases, however, can we reject the null hypothesis that price has no effect on intensity of usage. We cannot observe whether the net is actually used at night, but it is reasonable to believe that, if the ITN is taken out of its packaging and has been hung on the ceiling, it is being used.13 Of those women who claimed to be using the ITN, 95% had the net hanging. Results for whether or not the net is hanging (columns (5) and (6)) are very similar to those using self-reported usage. One might be concerned that usage rates among prenatal clients receiving a free net are higher than they would be under a one-price policy, because pregnant women who value an ITN 13. Having the insecticide-treated net hanging from the ceiling creates health benefits even if people do not sleep under the net, because it repels, disables, and/or kills mosquitoes coming into contact with the insecticide on the netting material (WHO 2007).
226 0.62 .01 .04
0.004 (0.004)
X X 226 0.62 .06
0.003 (0.003)
(2)
1.16 .36
1.14 .37
226 0.62 .03
−0.094 (0.103) −0.017 (0.119) 0.125 (0.123) X X 226 0.62 .07
(4)
−0.125 (0.120) −0.017 (0.107) 0.098 (0.135)
(3)
222 0.57 .01
0.003 (0.003)
(5)
ITN is visibly hanging
1.87 .18
222 0.57 .03
−0.154 (0.129) −0.088 (0.124) 0.071 (0.131)
(6)
Notes: Data are from home visits to a random sample of patients who bought nets at each price or received a net for free. Home visits were conducted for a subsample of patients roughly three to six weeks after their prenatal visit. Each column is an OLS regression of the dependent variable indicated by column on either the price of the ITN or an indicator variable for each price. All regressions include district fixed effects. Standard errors in parentheses are clustered at the clinic level. Given the small number of clusters (sixteen), the critical values for T -tests were drawn from a t-distribution with 14 (16 − 2) degrees of freedom. The specifications in columns (2) and (4) control for the number of days that have elapsed since the net was purchased, the number of days that have elapsed since the program was introduced at the clinic in which the net was purchased, and whether the woman has given birth already, is still pregnant, or miscarried, as well as the clinic controls in Table III.
Time controls Clinic controls Observations Sample mean of dep. var. R2 Intracluster correlation Joint F-test Prob > F
ITN price = 40 Ksh
ITN price = 20 Ksh
ITN price = 10 Ksh
ITN price
(1)
Respondent is currently using the ITN acquired through the program
TABLE V ITN USAGE RATES ACROSS PRICES, CONDITIONAL ON OWNERSHIP
FREE DISTRIBUTION OR COST SHARING?
29
30
QUARTERLY JOURNAL OF ECONOMICS
highly may have switched clinics in order to get a free net. We show in Online Appendix Table A.2 that, as with our demand estimates, usage rates among the subsample of women who did not switch clinics (i.e., attended the same prenatal clinic after our program was introduced as before it) are not different from the sample as a whole. Overall, one might be surprised that the level of net usage is not higher than 60%. This result might come from the fact that usage was measured a relatively short time after the net was purchased. In the usage regressions, the coefficients on time controls (not shown) suggest that usage increases as time passes after the ITN purchase. Among women not using the net, the most common reasons given for not using it were waiting for the birth of the child and waiting for another net (typically untreated with insecticide) to wear out. Dupas (2009a) finds that, among the general population, usage among both buyers and recipients of free ITNs is around 90% a year after the ITNs were acquired. Unconditional Usage: “Effective Coverage.” Although our estimates of usage rates among buyers suffer from small sample size imprecision, effective coverage (i.e., the fraction of the population using a program net) can be precisely estimated. Figure I presents effective coverage with program ITNs across ITN prices. The corresponding regression is presented in Table VI, column (1). The coefficient on price is −0.012, significant at the 1% level. This corresponds to a price elasticity of effective coverage of −0.44. The share of prenatal clients that are protected by an ITN under the free distribution scheme is 65%, versus 15% when ITNs are sold for 40 Ksh; this difference is significant at the 1% level (column (3)). The results are robust to the addition of clinic controls (columns (2) and (4)), and hold for all subgroups (data not shown). Overall, our results suggest that, at least in the Kenyan context, positive prices do not help generate higher usage intensity than free distribution. The absence of a selection effect on usage could be due to the nature of the good studied, which is probably valued very highly in areas of endemic malaria, particularly among pregnant women who want to protect their babies. The context in which the evaluation took place also probably contributed to the high valuation among those who didn’t have to pay. In particular, women had to travel to the health clinic for the prenatal visit and were told at the check-up about the importance
259 0.42 0.65 .02
−0.012 (0.003)∗∗∗
X X 259 0.42 0.65
−0.010 (0.002)∗∗∗
(2)
12.71 .00
259 0.42 0.65
−0.188 (0.123) −0.203 (0.097)∗ −0.504 (0.112)∗∗∗
(3)
8.12 .00
0.020 (0.145) −0.143 (0.104) −0.389 (0.095)∗∗∗ X X 259 0.42 0.65
(4)
Notes: Data are from random sample of patients who visited program clinics. Usage for those who acquired the ITNs was measured through home visits conducted roughly three to six weeks after their prenatal visit. Each column is an OLS regression of the dependent variable indicated by column on either the price of the ITN or an indicator variable for each price. All regressions include district fixed effects. Standard errors in parentheses are clustered at the clinic level. Given the small number of clusters (sixteen), the critical values for T -tests were drawn from a t-distribution with 14 (16 − 2) degrees of freedom. ∗∗∗ , ∗∗ , ∗ Significance at 1%, 5%, and 10% levels, respectively.
Time controls Clinic controls Observations Sample mean of dep. var. Mean in (ITN price = 0) group Intracluster correlation Joint F-test Prob > F
ITN price = 40 Ksh
ITN price = 20 Ksh
ITN price = 10 Ksh
ITN price
(1)
Respondent is currently using an ITN acquired through the program
TABLE VI EFFECTIVE COVERAGE: ITN USAGE RATES ACROSS PRICES, UNCONDITIONAL ON OWNERSHIP
FREE DISTRIBUTION OR COST SHARING?
31
32
QUARTERLY JOURNAL OF ECONOMICS
of protection against malaria. In addition, PSI has been conducting a very intense advertising campaign for ITN use throughout Kenya over the past five years. Last, the evaluation took place in a very poor region of Kenya, in which many households do not have access to credit and have difficulty affording even modest prices for health goods. Thus, a large number of prenatal clients may value ITNs but be unable to pay higher prices for them. IV.D. Are There Psychological Effects of Prices on Usage of ITNs? In this section, we test whether the act of paying itself can stimulate higher product use by triggering a sunk cost effect, when willingness to pay is held constant. We use data from the ex post price randomization conducted with a subset of women who had expressed their willingness to pay the posted price (in clinics charging a positive price). For those women, the transaction price ranged from “free” to the posted price they initially agreed to pay. Table VII presents estimates of the effect of price (columns (1) and (2)) and of the act of paying (columns (3)–(6)) on the likelihood of usage and likelihood that the ITN has been hung. These coefficients are from linear probability models with clinic fixed effects, estimated on the sample of women who visited a clinic where ITNs were sold at a positive price, decided to buy an ITN at the posted price, and were sampled to participate in the ex post lottery determining the transaction price they eventually had to pay to take the net home. Because the uptake of ITNs decreased sharply with the price, the sample we have at hand to test for the presence of sunk cost effects is small, and therefore the precision of the estimates we present below is limited. We find no psychological effect of price or the act of paying on usage, as expected from the earlier result that there is no overall effect of prices on usage. In column (1), the coefficient for price is negative, suggesting that higher prices could discourage usage, but the effect is not significant and cannot be distinguished from zero. The 95% confidence interval is (−0.0158; 0.0098), suggesting that a 10 Ksh increase in price could lead to anything from a decrease of sixteen to an increase of ten percentage points in usage. Larger effects on either side can be confidently rejected, however. Adding controls, including a dummy for having received a free ITN from the government in the previous year, does not reduce the standard error but decreases the coefficient of price further, enabling us to rule out sunk cost effects of more than seven percentage points per 10 Ksh increase in price (column (2)).
132 0.58
123 0.58 3.23 .00
−0.192 (0.100)∗ −0.234 (0.121)∗ 0.202 (0.102)∗∗ 0.148 (0.104) 0.000 (0.001) 0.015 (0.006)∗∗∗
−0.006 (0.006)
−0.003 (0.006)
132 0.58
−0.017 (0.100)
(3)
124 0.58 2.99 .01
−0.195 (0.122) 0.199 (0.103)∗ 0.184 (0.100)∗ 0.000 (0.001) 0.014 (0.006)∗∗
−0.072 (0.101)
(4)
123 0.58 3.60 .00
−0.065 (0.100) −0.191 (0.101)∗ −0.231 (0.122)∗ 0.202 (0.104)∗ 0.153 (0.104) 0.000 (0.001) 0.015 (0.006)∗∗∗
(5)
121 0.53 1.97 .07
−0.084 (0.099) −0.165 (0.102) −0.213 (0.125)∗ 0.121 (0.107) 0.063 (0.106) 0.000 (0.001) 0.011 (0.005)∗∗
(6)
ITN is visibly hanging
Notes: Standard errors in parentheses. Estimates are from linear probability models with clinic fixed effects, estimated on the sample of women who (1) visited a clinic where ITNs were sold at a positive price; (2) decided to buy an ITN at the posted price; and (3) were sampled to participate in the ex post lottery determining the transaction price they eventually had to pay to take the net home. The transaction prices ranged from 0 (free) to the posted price. Some of the individual control variables are missing for some respondents. ∗∗∗ , ∗∗ , ∗ Significance at 1%, 5%, and 10% levels, respectively.
Observations Sample mean of dep. var. F stat Prob > F
Time elapsed since ITN purchase
Time to clinic
First pregnancy
First prenatal visit
Still pregnant at time of follow-up
Got a free ITN the previous year
Transaction price > 0
Transaction price
(2)
(1)
Respondent is currently using the ITN acquired through the program
TABLE VII SUNK COST EFFECTS? ITN USAGE RATES ACROSS PRICES (CONDITIONAL ON OWNERSHIP), HOLDING WILLINGNESS TO PAY CONSTANT
FREE DISTRIBUTION OR COST SHARING?
33
34
QUARTERLY JOURNAL OF ECONOMICS
In column (3), the coefficient for the act of paying a positive price is also negative, suggesting that if the act of paying had any effect, it would decrease usage rather than increase it, but here again the coefficient cannot be confidently distinguished from zero. The 95% confidence interval for this estimate is quite large and suggests that a 10 Ksh increase in price could lead to anything from a decrease of 22 to an increase of 20 percentage points in usage. Overall, these results suggest that, in the case of ITNs marketed through health clinics, there is no large positive psychological effect of price on usage. We do not have data on baseline time preferences to check whether certain subgroups are more likely to exhibit a “sunk cost” effect. We also do not have data on what women perceived ex post as the price they paid for the ITN; we thus cannot verify that those who received a discount mentally “integrated” the two events (payment and discount) to “cancel” the loss, in the terms of Thaler (1985), or whether they “segregated” the two events and perceived the payment as a cash loss and the discount as a cash gain. If usage might not increase with price, what about the private benefits to the users? Is it the case that the users reached through the 40 Ksh distribution system are those who really need the ITN, whereas the additional users obtained through the free distribution will not benefit from using the ITN because they don’t need it as much (i.e., they are healthier, or can afford other means to protect themselves against malaria)? From a public health point of view, this issue might be irrelevant in the case of ITNs, given the important community-wide effects of ITN use documented in the medical literature cited earlier. Nevertheless, it is interesting to test the validity of the argument advanced by cost-sharing programs with respect to the private returns of ITN use. This is what we attempt to do in the next section. IV.E. Selection Effects of ITN Prices This section presents results on selection effects of positive prices on the health of patients who buy them. The argument that cost-sharing targets those who are more vulnerable by screening out women who appear to need the ITN less assumes that willingness to pay is the main factor in the decision to buy an ITN. In the presence of extreme poverty and weak credit markets, however, it is possible that people are not able (do not have the cash) to pay what they would be willing to pay in the absence of
FREE DISTRIBUTION OR COST SHARING?
35
credit constraints. The optimal subsidy level will have to be low enough to discourage women who do not need the product from buying it, although at the same time high enough to enable creditconstrained women to buy it if they need it. We focus our analysis on an objective measure of health among prenatal clients— their hemoglobin levels. Women who are anemic (i.e., with low hemoglobin levels) are likely the women with the most exposure and least resistance to malaria, and are likely the consumers that a cost-sharing program would want to target. To judge whether higher prices encourage sicker women to purchase nets, we study the impact of price on the health of “takers” (i.e., buyers and recipients of free nets) relative to the health of the prenatal clients attending control clinics. Figure III plots the cumulative density functions (CDFs) of hemoglobin levels for women buying/receiving a net at each price relative to women in the control group. The surprising result in Figure III is that the CDFs for women receiving free nets stochastically dominates the distribution in the control group, implying that women who get free nets are healthier than the average prenatal woman (Panel A). In contrast, the CDFs of hemoglobin levels of women who pay a positive price (whether 10, 20, or 40 Ksh) are indistinguishable from the CDFs of women in the control clinics (Panels B, C, and D). In other words, women who pay a higher price do not appear to be sicker than the average prenatal clients in the area.14 Why would it be that women who receive free nets appear substantially healthier, even though higher prices do not appear to induce selection of women who are sicker than the general prenatal population? Dupas (2005) shows that there is a strong incentive effect of free ITNs on enrollment for prenatal care. To test whether such an effect was at play in our experiment, Table VIII presents the average characteristics of prenatal clients in control clinics (column (1)), and, for each price group, how the average buyer diverges from the average woman in the control group (columns (2)– (5)). The results provide some evidence that the incentive effect of free ITNs was strong: women who came for free nets were 12% 14. For each price level, we test the significance of the differences in CDFs (compared to the control group) with the Kolmogorov–Smirnov equalityof-distributions test. Following Præstgaard (1995), we use the bootstrap method to adjust the p-values for clustering at the clinic level. The results of the tests are presented in the notes of Figure III. We can reject the null hypothesis of equality of distributions between women who receive free nets and those attending control clinics at the 10% significance level. We cannot reject the equality of distributions for women in the control population and those paying 10, 20, or 40 Ksh for an ITN.
0 0.2 0.4 0.6 0.8 1
Control
Free net
10 Hemoglobin level (g/dL)
15
5
Control
20 Ksh net
10 Hemoglobin level (g/dL)
15
C: Clients at control clinics vs. clients buying 20 Ksh net
5
A: Clients at control clinics vs. clients receiving free net
Control
10 Ksh net
10 Hemoglobin level (g/dL)
15
5
Control
40 Ksh net
10 Hemoglobin level (g/dL)
15
D: Clients at control clinics vs. clients buying 40 Ksh net
5
B: Clients at control clinics vs. clients buying 10 Ksh net
FIGURE III Cumulative Density of Hemoglobin Levels among ITN Recipients/Buyers The p-values for Kolmogorov–Smirnov tests of equality of distribution (adjusted for clustering at the clinic level by bootstrap) are .091 (Panel A), .385 (Panel B), .793 (Panel C), and .781 (Panel D). Number of observations: 198 (Panel A), 217 (Panel B), 208 (Panel C), and 139 (Panel D).
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
36 QUARTERLY JOURNAL OF ECONOMICS
37
FREE DISTRIBUTION OR COST SHARING? TABLE VIII CHARACTERISTICS OF PRENATAL CLIENTS BUYING/RECEIVING ITN RELATIVE TO CLIENTS OF CONTROL CLINICS Differences with control clinics
Mean in control clinics
0 Ksh (free)
10 Ksh ($0.15)
20 Ksh ($0.30)
40 Ksh ($0.60)
(1)
(2)
(3)
(4)
(5)
Respondent owns animal assets
Panel A. Characteristics of visit to prenatal clinic 0.48 −0.12 −0.02 0.03 0.02 0.50 (0.06)∗∗ (0.04) (0.06) (0.04) 0.73 −0.12 0.04 0.07 −0.16 0.45 (0.13) (0.07) (0.06) (0.08)∗ 4.58 3.52 0.79 −1.17 4.27 10.83 (3.29) (1.78) (1.37) (1.94)∗∗ 0.81 0.10 0.05 0.00 0.09 0.40 (0.03)∗∗∗ (0.05) (0.04) (0.02)∗∗∗ 0.61 0.06 0.07 −0.11 0.11 0.49 (0.12) (0.12) (0.12) (0.12) 0.19 0.00 0.01 0.12 0.07 0.39 (0.06) (0.05) (0.05)∗∗ (0.09)
Hemoglobin level (Hb), in g/dL Moderate anemia (Hb < 11.5 g/dL) Severe anemia (Hb ≤ 9 g/dL)
10.44 1.77 0.69 0.46 0.16 0.37
First prenatal visit for current pregnancy Walked to the clinic If took transport to clinic: price paid (Ksh) Can read Swahili Wearing shoes
Observations
110
Panel B. Health status 0.94 0.49 0.22 (0.34)∗∗ (0.49) (0.47) −0.18 −0.09 −0.08 (0.07)∗∗ (0.12) (0.10) −0.10 −0.01 0.07 (0.06) (0.07) (0.09) 98
120
99
0.48 (0.78) −0.05 (0.19) −0.06 (0.11) 28
Notes: For each variable, column (1) shows the mean observed among prenatal clients enrolling in control clinics; the standard deviations are presented in italics. Columns (2), (3), (4), and (5) show the differences between “buyers” in the clinics providing ITNs at 0, 10, 20, and 40 Ksh and prenatal clients enrolling in control clinics. Standard errors in parentheses are clustered at the clinic level; given the small number of clusters (sixteen), the critical values for T -tests were drawn from a t-distribution with 14 (16 − 2) degrees of freedom. ∗∗∗ , ∗∗ , ∗ Significance at 1%, 5%, and 10% levels, respectively.
more likely to be coming for a repeat visit and 12% less likely to have come by foot (i.e., more likely to have come by public transportation), and they paid about 3.5 Ksh more to travel to the clinic than women in the control group (Panel A). These results suggest that the free ITN distribution induced women who had come to the clinic before the introduction of the program to come back for a revisit earlier than scheduled, and therefore before the health benefits of their first prenatal visit had worn out.15 As a result, 15. In Kenya, pregnant women are typically given free iron supplements, as well as free presumptive malaria treatment, when they come for prenatal care. Both of these “treatments” have a positive impact on hemoglobin levels.
38
QUARTERLY JOURNAL OF ECONOMICS
as seen in Figure III, women receiving free nets are substantially less likely to be anemic (eighteen percentage points off of a base of 69% in Panel B of Table VIII).16 In absolute terms, however, the number of anemic women covered by an ITN is substantially greater under free distribution than under cost-sharing. As shown in Table VIII, the great majority of pregnant women in Kenya are moderately anemic (71%). All of them receive ITNs under free distribution, but only 40% of them invest in ITNs when the price is 40 Ksh (Table IV). Given that usage of the ITN (conditional on ownership) is similar across price groups, effective coverage of the anemic population is thus 60% lower under cost-sharing.17 Finally, it is interesting to note in Table VIII that women who bought nets for 40 Ksh were more likely to pay for transportation and paid more to come to the clinic than the control group. Women who paid 40 Ksh were also more likely to be literate, more likely to be wearing shoes, and more likely to report owning animal assets. Not all of these differences are statistically different from zero, given the small-sample problem, but overall these results are suggestive that selection under cost-sharing happened at least partially along wealth lines.18 V. COST-EFFECTIVENESS ANALYSIS This section presents estimates of the cost-effectiveness of each pricing strategy in terms of children’s lives saved. There are many benefits to preventing malaria transmission in addition to saving children’s lives, and restricting ourselves to child mortality will lead to conservative estimates of cost-effectiveness. An important dimension to keep in mind in the costeffectiveness analysis is the nonlinearity in the health benefits associated with ITN use: high-density ITN coverage reduces overall transmission rates and thus positively affects the health of both 16. Because some of the women who received free nets appear to have traveled farther and spent more money on travel to the clinic, one might expect that this group was composed of many switchers from nonprogram clinics. However, we find that the effects of price on selection in terms of health are unchanged for the subsample of women staying with the same clinic (Online Appendix Table A3). 17. The usage results in Table V hold when the sample is restricted to moderately anemic women (data not shown). 18. This hypothesis is supported by the fact that, when we compare the average client at 40 Ksh clinics (rather than the average buyer at these clinics) to the average control client, they are not more likely to have paid for transportation and paid no more for transportation than the control group (results not shown).
FREE DISTRIBUTION OR COST SHARING?
39
nonusers and users. The results of a 2003 medical trial of ITNs in western Kenya imply that “in areas with intense malaria transmission with high ITN coverage, the primary effect of insecticidetreated nets is via area-wide effects on the mosquito population and not, as commonly supposed, by simple imposition of a physical barrier protecting individuals from biting” (Hawley et al. 2003, p. 121). In this context, we propose the following methodology to measure the health impact of each ITN pricing scheme: we create a “protection index for nonusers” (a logistic function of the share of users in the total population) and a “protection index for users” (a weighted sum of a “physical barrier” effect of the ITN and the externality effect, the weights depending on the share of users). This enables us to compute the health impact of each pricing scheme on both users and nonusers and to (roughly) approximate the total number of child lives saved, as well as the cost per life saved. Because the relative importance of the “physical barrier” effect and of the externality are uncertain, we consider three possible values for the parameter of the logistic function predicting the protection index for nonusers (the “threshold externality parameter”) and three possible values for the effectiveness of ITNs as physical barriers. This gives us a total of 3 × 3 = 9 different scenarios and 9 different cost-per-life-saved estimates for each of the four pricing strategies. The cost-effectiveness estimates are presented in Table IX. These estimates are provided to enable comparisons across distribution schemes, but their absolute values should be taken with caution, as they rely on a number of coarse assumptions (the details of the calculations are provided in the Online Appendix). In particular, two key assumptions made are the following: (1) We assume that the only difference in cost per ITN between free distribution and cost-sharing is the difference in the subsidy. That is, we assume that an ITN given for free costs 40 Ksh more to the social planner than an ITN sold for 40 Ksh. We thus ignore money management costs associated with cost-sharing schemes. (2) We assume that 65% of households will experience a pregnancy within five years and be eligible for the ITN distribution program.19 The estimates in Table IX suggest that, under all nine scenarios we study, child mortality is reduced more under free distribution than any cost-sharing strategy (Panel A). This result is not 19. Making less conservative assumptions would increase the relative costeffectiveness of free distribution programs.
0 10 20 40
100.0 97.5 95.0 90.0
200 234 189 175
38 29 32 16
High (1)
Low (3)
High (4)
Medium (5)
Low (6)
Panel A. Child lives saved per 1,000 prenatal clients 37 36 30 27 24 28 26 20 16 13 30 28 22 19 15 14 12 11 8 6 Panel B. Cost per child life saved (US$) 206 212 255 284 321 251 270 348 421 531 200 213 274 325 399 201 235 261 339 483
Medium (2)
Medium Hypothesis on physical barrier effectiveness:
352 448 361 302
22 15 17 9
High (7)
460 609 487 418
17 11 12 7
Medium (8)
662 949 748 678
11 7 8 4
Low (9)
High Hypothesis on physical barrier effectiveness:
Notes: Each cell corresponds to a separate state of the world. To this date, existing medical evidence on the relative importance of the physical barrier provided by an ITN and on the externality threshold is insufficient to know which cells are closest to the actual state of the world. See Online Appendix for details on how these estimates were computed and the hypotheses they rely on.
0 10 20 40
ITN price (Ksh)
100.0 97.5 95.0 90.0
Subsidy level (%)
Low Hypothesis on physical barrier effectiveness:
Hypothesis on externality threshold:
TABLE IX COST-EFFECTIVENESS COMPARISONS
40 QUARTERLY JOURNAL OF ECONOMICS
FREE DISTRIBUTION OR COST SHARING?
41
surprising considering the large negative effect of cost-sharing on the share of ITN users in the population. Under the low threshold assumption for the externality effect, in terms of cost per life saved, we find that charging 40 Ksh is more cost-effective than free distribution if the physical barrier effect of ITNs is high (Panel B, column (1)). When the assumptions about the effectiveness of ITNs as physical barriers for their users are less optimistic, we find that free distribution becomes at least as cost-effective, if not more, than cost-sharing. Under the assumption of a “medium” externality threshold level, we find that free distribution could dominate cost-sharing in terms of cost-effectiveness (Panel B, columns (4)–(6)). Last, in the scenario where a large share of ITN users is necessary for a substantial externality to take place, we find that cost-sharing is again slightly cheaper than free distribution, unless the physical barrier effectiveness is very low. This is due to the fact that under the high threshold hypothesis, even free distribution to pregnant women is not enough to generate significant community-wide effects, because not all households experience a pregnancy. That said, given the very large standard errors on the usage estimates, the differences observed across schemes in cost per life saved typically cannot be distinguished from zero. The general conclusion of this cost-effectiveness exercise is thus that cost-sharing is at best marginally more cost-effective than free distribution, but free distribution leads to many more lives saved. VI. DISCUSSION AND CONCLUSION The argument that charging a positive price for a commodity is necessary to ensure that it is effectively used has recently gained prominence in the debate on the efficiency of foreign aid. The cost-sharing model of selling nets for $0.50 to mothers through prenatal clinics is believed to reduce waste because “it gets the nets to those who both value them and need them” (Easterly 2006, p. 13). Our randomized pricing experiment in western Kenya finds no evidence to support this assumption. We find no evidence that cost-sharing reduces wastage by sifting out those who would not use the net: pregnant women who receive free ITNs are no less likely to put them to intended use than pregnant women who pay for their nets. This suggests that costsharing does not increase usage intensity in this context. Although it doesn’t increase usage intensity, cost-sharing does considerably
42
QUARTERLY JOURNAL OF ECONOMICS
dampen demand: we find that the cost-sharing scheme ongoing in Kenya at the time of this study results in a coverage rate 75 percentage points lower than with a full subsidy. In terms of getting nets to those who need them, our results on selection based on health imply that women who purchase nets at cost-sharing prices are no more likely to be anemic than the average prenatal woman in the area. We also find that localized, short-lived free distribution programs disproportionately benefit healthier women who can more easily travel to the distribution sites. Although our results speak to the ongoing debate regarding the optimal subsidization level for ITNs—one of the most promising health tools available in public health campaigns in sub-Saharan Africa—they may not be applicable to other public health goods that are important candidates for subsidization. In particular, it is important to keep in mind that this study was conducted when ITNs were already highly valued in Kenya, thanks to years of advertising by both the Ministry of Health and Population Services International. This high ex ante valuation likely diminished the risk that a zero or low price be perceived as a signal of bad quality. Our findings are consistent with previous literature on the value of free products: in a series of lab experiments, both hypothetical and real, Ariely and Shampan’er (2007) found that when people have to choose between two products, one of which is free, charging zero price increases consumers’ valuation of the product itself, in addition to reducing its cost. In a recent study in Uganda, Hoffmann (2007) found that households that are told about the vulnerability of children to malaria on the day they acquire an ITN are more likely to use the ITN to protect their children when they receive it for free than when they have to pay for it. In a study conducted with the general Kenyan population, Dupas (2009b) randomly varied ITN prices over a much larger range (between $0 and $4), and also found no evidence that charging higher prices leads to higher usage intensity. Dupas (2009b) also found that the demand curve for ITNs remains unaffected by common marketing techniques derived from psychology (such as the framing of marketing messages, the gender of the person targeted by the marketing, or verbal commitment elicitation), further suggesting that the high price-elasticity of the demand for ITNs is driven mostly by budget constraints. Our finding that usage of ITNs is insensitive to the price paid to acquire them contrasts with the finding of Ashraf, Berry, and Shapiro (forthcoming), in which Zambian households that paid a
FREE DISTRIBUTION OR COST SHARING?
43
higher price for a water-treatment product were more likely to report treating their drinking water two weeks later. Their experimental design departs from ours in multiple ways that could explain the difference in findings. First, because the range of prices at which the product was offered in their experiment did not include zero, Ashraf, Berry, and Shapiro do not measure usage under a free distribution scheme. Second, in contrast to a bed net that can be used for three years before it wears out, the bottle of water disinfectant used in Ashraf, Berry, and Shapiro lasts for only about one month if used consistently to treat the drinking water of an average family; in this context, it is possible that households that purchased the water disinfectant but were not using it two weeks later had stored the bottle for later use (e.g., for the next sickness episode in their household or the next cholera outbreak), and therefore the evidence on usage in Ashraf, Berry, and Shapiro has a different interpretation from ours. In addition, the baseline level of information about the product (its effectiveness, how to use it) might have differed across experiments. Although ITN distribution programs that use cost-sharing are less effective and not more cost-effective than free distribution in terms of health impact, they might have other benefits. Indeed, they often have the explicit aim of promoting sustainability. The aim is to encourage a sustainable retail sector for ITNs by combining public and private sector distribution channels (Mushi et al. 2003; Webster, Lines, and Smith 2007). Our experiment does not enable us to quantify the potentially negative impact of free distribution on the viability of the retail sector and therefore our analysis does not consider this externality. Another important dimension of the debate on free distribution versus cost-sharing is the effect of full subsidies on the distribution system. In particular, the behavior of agents on the distribution side, notably health workers in our context, could depend on the level of subsidy. Although user fees can be used to incentivize providers (World Bank 2004), free distribution schemes have been shown to be plagued by corruption (in the form of diversion) among providers (Olken 2006). Our experiment focused on the demand side and was not powered to address this distribution question. As with most randomized experiments, we are unable to characterize or quantify the impact of the various possible distribution schemes when they have been scaled up and general equilibrium effects have set in. Our experimental results should thus be seen as one piece in the puzzle of how to increase uptake of effective, externality-generating health products in resource-poor settings.
44
QUARTERLY JOURNAL OF ECONOMICS
HARVARD SCHOOL OF PUBLIC HEALTH UNIVERSITY OF CALIFORNIA, LOS ANGELES
REFERENCES Alaii, Jane A., William A. Hawley, Margarette S. Kolczak, Feiko O. Ter Kuile, John E. Gimnig, John M. Vulule, Amos Odhacha, Aggrey J. Oloo, Bernard L. Nahlen, and Penelope A. Phillips-Howard, “Factors Affecting Use of PermethrinTreated Bed Nets during a Randomized Controlled Trial in Western Kenya,” American Journal of Tropical Medicine and Hygiene, 68 (2003), 137–141. Ariely, Dan, and Krsitina Shampan’er, “How Small Is Zero Price? The True Value of Free Products,” Marketing Science, 26 (2007), 742–757. Arkes, Hal R., and Catherine Blumer, “The Psychology of Sunk Cost,” Organizational Behavior and Human Decision Processes, 35 (1985), 124–140. Ashraf, Nava, James Berry, and Jesse Shapiro, “Can Higher Prices Stimulate Product Use? Evidence from a Field Experiment in Zambia,” American Economic Review, forthcoming. Bagwell, Kyle, and Michael H. Riordan, “High and Declining Prices Signal Product Quality,” American Economic Review, 81 (1991), 224–239. Binka, F. N., F. Indome, and T. Smith, “Impact of Spatial Distribution of Permethrin-Impregnated Bed Nets on Child Mortality in Rural Northern Ghana,” American Journal of Tropical Medicine and Hygiene, 59 (1998), 80– 85. Bloom, Erik, Indu Bhushan, David Clingingsmith, Elizabeth King, Michael Kremer, Benjamin Loevinsohn, Rathavuth Hong, and J. Brad Schwartz, “Contracting for Health: Evidence from Cambodia,” Brookings Institution Report, 2006. Cameron, A. Colin, Douglas Miller, and Jonah B. Gelbach, “Bootstrapped-Based Improvements for Inference with Clustered Errors,” Review of Economics and Statistics, 90 (2007), 414–427. D’Alessandro, Umberto, “Nationwide Survey of Bednet Use in Rural Gambia,” Bulletin of the World Health Organization, 72 (1994), 391–394. Donald, Stephen, and Kevin Lang, “Inference with Differences-in-Differences and Other Panel Data,” Review of Economics and Statistics, 89 (2007), 221–233. Dupas, Pascaline, “Short-Run Subsidies and Long-Term Adoption of New Health Products: Evidence from a Field Experiment,” Mimeo, UCLA, 2009a. ——, “What Matters (and What Does Not) in Households’ Decision to Invest in Malaria Prevention?” American Economic Review: Papers and Proceedings, 99 (2009b), 224–230. ——, The Impact of Conditional In-Kind Subsidies on Preventive Health Behaviors: Evidence from Western Kenya, unpublished manuscript, 2005. Easterly, William, The White Man’s Burden: Why the West’s Efforts to Aid the Rest Have Done So Much Ill and So Little Good (New York: Penguin Press, 2006). Evans, David B., Girma Azene, and Joses Kirigia, “Should Governments Subsidize the Use of Insecticide-Impregnated Mosquito Nets in Africa? Implications of a Cost-Effectiveness Analysis,” Health Policy and Planning, 12 (1997), 107–114. Fisher, Ronald A., The Design of Experiments (London: Oliver and Boyd, 1935). Gimnig, John E., Margarette S. Kolczak, Allen W. Hightower, John M. Vulule, Erik Schoute, Luna Kamau, Penelope A. Phillips-Howard, Feiko O. Ter Kuile, Bernard L. Nahlen, and William A. Hawley, “Effect of Permethrin-Treated Bed Nets on the Spatial Distribution of Malaria Vectors in Western Kenya,” American Journal of Tropical Medicine and Hygiene, 68 (2003), 115–120. Harvey Philipp D., “The Impact of Condom Prices on Sales in Social Marketing Programs,” Studies in Family Planning, 25 (1994), 52–58. Hawley, William A., Penelope A. Phillips-Howard, Feiko O. Ter Kuile, Dianne J. Terlouw, John M. Vulule, Maurice Ombok, Bernard L. Nahlen, John E. Gimnig, Simon K. Kariuki, Margarette S. Kolczak, and Allen W. Hightower, “Community-Wide Effects of Permethrin-Treated Bed Nets on Child Mortality and Malaria Morbidity in Western Kenya,” American Journal of Tropical Medicine and Hygiene, 68 (2003), 121–127.
FREE DISTRIBUTION OR COST SHARING?
45
Hoffmann, Vivian, “Psychology, Gender, and the Intrahousehold Allocation of Free and Purchased Mosquito Nets,” Mimeo, Cornell University, 2007. Karlan, Dean, and Jonathan Zinman, “Observing Unobservables: Identifying Information Asymmetries with a Consumer Credit Field Experiment,” Econometrica, forthcoming. Kremer, Michael, and Edward Miguel, “The Illusion of Sustainability,” Quarterly Journal of Economics, 112 (2007), 1007–1065. Lengeler, Christian, “Insecticide-Treated Bed Nets and Curtains for Preventing Malaria,” Cochrane Dabatase Syst Rev 2:CF000363, 2004. Lengeler, Christian, Mark Grabowsky, David McGuire, and Don deSavigny, “Quick Wins versus Sustainability: Options for the Upscaling of Insecticide-Treated Nets,” American Journal of Tropical Medicine and Hygiene, 77 (2007), 222–226. Lucas, Adrienne, “Economic Effects of Malaria Eradication: Evidence from the Malarial Periphery,” American Economic Journal: Applied Economics, forthcoming. Moulton, Brent R., “An Illustration of a Pitfall in Estimating the Effects of Aggregate Variables on Micro Units,” Review of Economics and Statistics, 72 (1990), 334–338. Mushi, Adiel K., Jonna R. Schellenberg, Haji Mponda, and Christian Lengeler, “Targeted Subsidy for Malaria Control with Treated Nets Using a Discount Voucher System in Southern Tanzania,” Health Policy and Planning, 18 (2003), 163–171. Olken, Benjamin, “Corruption and the Costs of Redistribution: Micro Evidence from Indonesia,” Journal of Public Economics, 90 (2006), 853–870. Oster, Sharon, Strategic Management for Nonprofit Organizations: Theory and Cases (Oxford, UK: Oxford University Press, 1995). Population Services International [PSI], “What Is Social Marketing?” available online at http://www.psi.org/resources/pubs/what is smEN.pdf, 2003. Præstgaard, Jens P., “Permutation and Bootstrap Kolmogorov-Smirnov Test for the Equality of Two Distributions,” Scandinavian Journal of Statistics, 22 (1995), 305–322. Riley, John G., “Silver Signals: Twenty-Five Years of Screening and Signaling,” Journal of Economic Literature, 39 (2001), 432–478. Rosenbaum, Paul R., Observational Studies, (New York: Springer-Verlag, 2002). Sachs, Jeffrey, The End of Poverty: Economic Possibilities for Our Time (New York: Penguin, 2005). Schellenberg, Joanna A., Salim Abdulla, Rose Nathan, Oscar Mukasa, Tanya Marchant, Nassor Kikumbih, Adiel Mushi, Haji Mponda, Happiness Minja, and Hassan Mshinda, “Effect of Large-Scale Social Marketing of InsecticideTreated Nets on Child Survival in Rural Tanzania,” Lancet, 357 (2001), 1241–1247. Ter Kuile, Feiko O., Dianne J. Terlouw, Penelope A. Phillips-Howard, William A. Hawley, Jennifer F. Friedman, Simon K. Kariuki, Ya Ping Shi, Margarette S. Kolczak, Altaf A. Lal, John M. Vulule, and Bernard L. Nahlen, “Reduction of Malaria during Pregnancy by Permethrin-Treated Bed Nets in an Area of Intense Perennial Malaria Transmission in Western Kenya,” American Journal of Tropical Medicine and Hygiene, 68 (2003), 50–60. Thaler, Richard, “Toward a Positive Theory of Consumer Choice,” Journal of Economic Behavior and Organization, 1 (1980), 39–60. ——, “Mental Accounting and Consumer Choice,” Marketing Science, 4 (1985), 199–214. Webster, Jayne, Jo Lines, and Lucy Smith, “Protecting All Pregnant Women and Children under Five Years Living in Malaria Endemic Areas in Africa with Insecticide Treated Mosquito Nets,” World Health Organization Working Paper, available at http://www.who.int/malaria/docs/VulnerableGroupsWP.pdf, 2007. World Bank, World Development Report 2004: Making Services Work for Poor People (Washington, DC: World Bank and Oxford University Press, 2004). World Health Organization [WHO], “WHO Global Malaria Programme: Position Statement on ITNs,” available at http://www.who.int/malaria/docs/itn/ ITNspospaperfinal.pdf, 2007. World Malaria Report, available at http://www.who.int/malaria/wmr2008/ malaria2008.pdf, 2008.
SOPHISTICATED MONETARY POLICIES∗ ANDREW ATKESON VARADARAJAN V. CHARI PATRICK J. KEHOE In standard monetary policy approaches, interest-rate rules often produce indeterminacy. A sophisticated policy approach does not. Sophisticated policies depend on the history of private actions, government policies, and exogenous events and can differ on and off the equilibrium path. They can uniquely implement any desired competitive equilibrium. When interest rates are used along the equilibrium path, implementation requires regime-switching. These results are robust to imperfect information. Our results imply that the Taylor principle is neither necessary nor sufficient for unique implementation. They also provide a direction for empirical work on monetary policy rules and determinacy.
I. INTRODUCTION The now-classic Ramsey (1927) approach to policy analysis under commitment specifies the set of instruments available to policy makers and finds the best competitive equilibrium outcomes given those instruments. This approach has been adapted to situations with uncertainty, by Barro (1979) and Lucas and Stokey (1983), among others, by specifying the policy instruments as functions of exogenous events.1 Although the Ramsey approach has been useful in identifying the best outcomes, it needs to be extended before it can be used to guide policy. Such an extension must describe what would happen for every history of private agent actions, government policies, and exogenous events. It should also structure policy in such a way that policy makers can ensure that their desired outcomes occur. Here, we provide such an extended approach. To construct it, we extend the language of Chari and Kehoe (1990) in a natural fashion by describing private agent actions and government policies as functions of the histories of those actions and policies as well as of exogenous events. The key to our approach is our ∗ The authors thank the National Science Foundation for financial support and Kathleen Rolfe and Joan Gieseke for excellent editorial assistance. The views expressed herein are those of the authors and not necessarily those of the Federal Reserve Bank of Minneapolis or the Federal Reserve System. 1. The Ramsey approach has been used extensively to discuss optimal monetary policy. See, among others, the work of Chari, Christiano, and Kehoe (1996); Schmitt-Groh´e and Uribe (2004); Siu (2004); and Correia, Nicolini, and Teles (2008). C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
47
48
QUARTERLY JOURNAL OF ECONOMICS
requirement that for all histories, including those in which private agents deviate from the equilibrium path, the continuation outcomes constitute a continuation competitive equilibrium.2 We label such policy functions sophisticated policies and the resulting equilibrium a sophisticated equilibrium. If policies can be structured to ensure that the desired outcomes occur, then we say that the policies uniquely implement the desired outcome. Here we describe this approach and use it to analyze an important outstanding question in monetary economics: How should policy be designed in order to avoid indeterminacy and achieve unique implementation? It has been known, at least since the work of Sargent and Wallace (1975), that when interest rates are the policy instrument, many ways of specifying policy lead to indeterminate outcomes including multiple equilibria. Indeterminacy is risky because some of those outcomes can be bad, including hyperinflation. Researchers thus agree that designing policies that achieve unique implementation is desirable. Here we demonstrate that our sophisticated policy approach does that for monetary policy. We illustrate our approach in two standard monetary economies: a simple sticky-price model with one-period pricesetting and a sticky-price model with staggered price-setting (often referred to as the New Keynesian model). For both, we show that, under sufficient conditions, any outcome of a competitive equilibrium can be uniquely implemented by appropriately constructed sophisticated policies. In particular, the Ramsey equilibrium can be uniquely implemented. In the two model economies, we construct central bank policies that uniquely implement a desired competitive equilibrium in the same basic way. Along the equilibrium path, we choose the policies to be those given by the desired competitive equilibrium. We structure the policies off the equilibrium path, the reversion policies, to discourage deviations. Specifically, if the average choice of private agents deviates from that in the desired equilibrium, then we choose the reversion policies so that the optimal choice, or best response, of each individual agent is different from the average choice. One way to see why such reversion policies can eliminate multiplicity is to recall how multiple equilibria arise in the first 2. This requirement is the natural analog of subgame perfection in an environment in which private agents are competitive. In this sense, our equilibrium concept is the obvious one for our macroeconomic environment.
SOPHISTICATED MONETARY POLICIES
49
place. At an intuitive level, they arise if, when each agent believes that all other agents will choose some particular action other than the desired one, each agent finds it optimal to go along with the deviation by also picking that particular action. Our construction of reversion policies breaks the self-fulfilling nature of such deviations. It does so by ensuring that even if an agent believes that all other agents are choosing a particular action that differs from the desired action, the central bank policy makes it optimal for that agent not to go along with that deviation. When such reversion policies can be found, we say that the best responses are controllable. A sufficient condition for controllability is that policies can be found such that after a deviation the continuation equilibrium is unique and varies with policy. Variation with policy typically holds, so if policies can be found under which the continuation equilibrium is unique (somewhere), then we have unique implementation (everywhere). This sufficient condition suggests a simple way to state our message in a general way: uniqueness somewhere generates uniqueness everywhere. One concern with our construction of sophisticated policies is that it apparently relies on the idea that the central bank perfectly observes private agents’ actions and thus can detect any deviation. We show that this concern is unwarranted: our results are robust to imperfect information about private agents’ actions. Specifically, with imperfect detection of deviations, sophisticated policies can be designed that have unique equilibria that are close to the desired outcomes when the detection error is small and that converge to the desired equilibria as the detection error goes to zero. The approach proposed here suggests an operational guide to policy making: First use the Ramsey approach to determine the best competitive equilibrium, and then check whether in that situation, best responses are controllable. If they are, then sophisticated policies of the kind we have constructed can uniquely implement the Ramsey outcome. If best responses are not controllable, then the only option is to accept indeterminacy. Our work here is related to previous work on the problem of indeterminacy in monetary economies (Wallace 1981; Obstfeld and Rogoff 1983; King 2000; Benhabib, Schmitt-Groh´e, and Uribe 2001; Christiano and Rostagno 2001; Svensson and Woodford 2005). The previous work pursues an approach different from ours (and from that in the microeconomic literature on implementation); we call it unsophisticated implementation. The basic idea of that approach is to specify policies as functions of the history
50
QUARTERLY JOURNAL OF ECONOMICS
and check only to see whether the period-zero competitive equilibrium is unique. Unsophisticated implementation has been criticized in the macroeconomic and the microeconomic literature. For example, in the macroeconomic literature, Kocherlakota and Phelan (1999), Bassetto (2002), Buiter (2002), and Ljungqvist and Sargent (2004) criticize this general idea in the context of the fiscal theory of the price level; Bassetto (2005) criticizes it in the context of a simple tax example; and Cochrane (2007) criticizes it in the context of the literature on monetary policy rules. In the microeconomic literature, Jackson (2001) criticizes a related approach to implementation. In our view, unsophisticated implementation is deficient because it does not describe how the economy will behave after a deviation by private agents from the desired outcome. This deficiency leaves open the possibility that the approach achieves implementation via nonexistence. By this phrase, we mean an approach that specifies policy actions under which no continuation equilibrium exists after private agent deviations. We agree with those who argue that implementation via nonexistence trivializes the implementation problem. To see why it does, consider the following policy rule: If private agents choose the desired outcome, then continue with the desired policy; if private agents deviate from the desired outcome, then forever after set government spending at a high level and taxes at zero. Clearly, under this policy rule, any deviation from the desired outcome leads to nonexistence of equilibrium, and hence, we trivially have implementation via nonexistence. We find this way of achieving implementation unpalatable. Our approach, in contrast, insists that policies be specified such that a competitive equilibrium exists after any deviation. We achieve implementation in the traditional microeconomic sense— by discouraging deviations, not by nonexistence. In our approach, policies are specified so that even if an individual agent believes that all other agents will deviate to some specific action, that individual agent finds it optimal to choose a different action. Our approach not only ensures that the continuation equilibria always exist, but also has the desirable property that the reversion policies are not extreme in any sense. That is, after deviations, our reversion policies do not threaten the private economy with dire outcomes such as hyperinflation; they simply bring inflation back to the desired path.
SOPHISTICATED MONETARY POLICIES
51
Despite the shortcomings of the unsophisticated implementation approach, this literature has made two contributions that we find useful. One is the idea of regime-switching. This idea dates back at least to Wallace (1981) and has been used by Obstfeld and Rogoff (1983), Benhabib, Schmitt-Groh´e, and Uribe (2001), and Christiano and Rostagno (2001). The basic idea in, say, Benhabib, Schmitt-Groh´e, and Uribe (2001) is that if the economy embarks on an undesirable path, then the monetary and fiscal policy regime switches in such a way that the government’s budget constraint is violated, and the undesirable path is not an equilibrium. The other useful contribution of the literature on unsophisticated implementation is what Cochrane (2007) calls the King rule. This rule seeks to implement a desired equilibrium through an interest-rate policy that makes the difference between the interest rate and its desired equilibrium level a linear function of the difference between inflation and its desired equilibrium level, with a coefficient greater than 1. This idea dates back to at least King (2000) and has been used by Svensson and Woodford (2005). As we show here, the King rule, like other rules that use interest rates for all histories, namely, pure interest-rate rules, always leads to indeterminacy in our simple model and does so for a large class of parameters in our staggered price-setting model as well. We build on these two contributions by considering a King– money hybrid rule: When private agents deviate from the equilibrium path, the central bank uses the King rule for small deviations and switches regimes (from interest rates to money) for large deviations. Notice that with this rule, under our definition of equilibrium, outcomes return to the desired outcome path in the period after the deviation. In this sense, our hybrid rule achieves unique implementation without threatening agents with dire outcomes. Our work here is also related to another substantial literature that aims to find monetary policy rules which eliminate indeterminacy. (See, for example, McCallum [1981] and, more recently, Woodford [2003].) The recent literature argues that to achieve a unique outcome, interest-rate rules should follow the Taylor principle: interest rates relative to exogenously specified levels should rise more than one for one when inflation rates rise relative to their exogenously specified levels. We show here that adherence to the Taylor principle is neither necessary nor sufficient for unique implementation. It is not necessary because the sophisticated policy approach can uniquely
52
QUARTERLY JOURNAL OF ECONOMICS
implement any desired competitive equilibrium outcome, including outcomes in which, along the equilibrium path, the central bank follows an interest-rate rule that violates the Taylor principle. It is not sufficient because pure interest-rate rules may lead to indeterminacy even if they satisfy the Taylor principle. Notwithstanding these considerations, our analysis of the King–money hybrid rule does lend support to the idea that adherence to the Taylor principle can sometimes help achieve unique implementation. Specifically, this is true within the class of King– money hybrid rules when the Taylor principle is used in the region where the King part of the rules applies. Our findings also cast light on empirical investigations of determinacy based on the Taylor principle. We argue that, under the set of assumptions made explicit in the literature, inferences about determinacy based on existing estimation procedures should be treated skeptically. For our simple model economies, we provide assumptions under which such inferences can be confidently made. Although there is some hope that such inference may be possible in more interesting applied examples using variants of our assumptions, difficult challenges remain. Using sophisticated policies is our proposed way to eliminate indeterminacy when setting monetary policy. For some other re˜ Correia, cent proposals, see the work of Bassetto (2002) and Adao, and Teles (2007). II. A SIMPLE MODEL WITH ONE-PERIOD PRICE-SETTING We begin by illustrating the basic idea of our construction of sophisticated policies using a simple model with one-period price-setting. The dynamical system associated with the competitive equilibrium of this model is straightforward, which lets us focus on the strategic aspects of sophisticated policies. With this model, we demonstrate that any desired outcome of a competitive equilibrium can be uniquely implemented by sophisticated policies with reversion to a money regime. We show that pure interest-rate rules, which exclusively use interest rates as the policy instrument, cannot achieve unique implementation. Finally, we show that reversion to a particular hybrid rule, which uses interest rates as the policy instrument for small deviations and money for large deviations, can achieve unique implementation. The model we analyze here is a modified version of the basic sticky-price model with a New Classical Phillips curve (as in
SOPHISTICATED MONETARY POLICIES
53
Woodford [2003, Chap. 3, Sect. 1.3]). In order to make our results comparable to those in the literature, we here describe a simple, linearized version of the model. In Atkeson, Chari, and Kehoe (2009), we describe the general equilibrium version that, when linearized, produces the equilibrium conditions studied here. II.A. The Determinants of Output and Inflation Consider a monetary economy populated by a large number of identical, infinitely lived consumers, a continuum of producers, and a central bank. Each producer uses labor to produce a differentiated good on the unit interval. A fraction of producers j ∈ [0, α) are flexible-price producers, and a fraction j ∈ [α, 1] are sticky-price producers. In this economy, the timing within a period t is as follows. At the beginning of the period, sticky-price producers set their prices, after which the central bank chooses its monetary policy by setting one of its instruments, either interest rates or the quantity of money. Two shocks, ηt and νt , are then realized. We interpret the shock ηt as a flight to quality shock that affects the attractiveness of government debt relative to private claims and the shock νt as a velocity shock. At the end of the period, flexible-price producers set their prices, and consumers make their decisions. Now we develop necessary conditions for a competitive equilibrium in this economy and then, in the next section, formally define a competitive equilibrium. Here and throughout, we express all variables in log-deviation form. This way of expressing variables implies that none of our equations will have constant terms. Consumer behavior in this model is summarized by an intertemporal Euler equation and a cash-in-advance constraint. We can write the linearized Euler equation as (1)
yt = Et [ yt+1 ] − ψ (it − Et [πt+1 ]) + ηt ,
where yt is aggregate output, it is the nominal interest rate, ηt (the flight to quality shock) is an i.i.d. mean-zero shock with variance var(η), and πt+1 = pt+1 − pt is the inflation rate from time period t to t + 1 , where pt is the aggregate price level. The parameter ψ determines the intertemporal elasticity, and Et denotes the expectations of a representative consumer given that consumer’s information in period t, which includes the shock ηt .
54
QUARTERLY JOURNAL OF ECONOMICS
The cash-in-advance constraint, when first-differenced, implies that the relationships among inflation πt , money growth μt , and output growth yt − yt−1 are given by a quantity equation of the form (2)
πt = μt − (yt − yt−1 ) + νt ,
where νt (the velocity shock) is an i.i.d. mean-zero shock with variance var(ν). We turn now to producer behavior. The optimal price set by an individual flexible-price producer j satisfies p f t ( j) = pt + γ yt ,
(3)
where the parameter γ is the elasticity of the equilibrium real wage with respect to output (often referred to in the literature as Taylor’s γ ). The optimal price set by a sticky-price producer j satisfies (4)
pst ( j) = Et−1 [ pt + γ yt ] ,
where Et−1 denotes expectations at the beginning of period t before the shocks ηt and νt are realized. The aggregate price level pt is a linear combination of the prices p f t set by the flexible-price producers and the prices pst set by the sticky-price producers and is given by α 1 (5) pt = p f t ( j) dj + pst ( j) dj. 0
α
Using language from game theory, we can think of equations (3) and (4) as akin to the best responses of the flexible- and stickyprice producers given their beliefs about the aggregate price level and aggregate output. In this model, the flexible-price producers are strategically uninteresting. Their expectations about the future have no influence on their decisions; their prices are set mechanically according to the static considerations reflected in (3). Thus, in all that follows, equation (3) will hold on and off the equilibrium path, and we can think of p f t ( j) as being residually determined by (3) and substitute out for p f t ( j). To do so, substitute (3) into (5) and solve for pt to get 1 1 (6) pt = κ yt + pst ( j) dj, 1−α α where κ = αγ /(1 − α).
SOPHISTICATED MONETARY POLICIES
55
We follow the literature and express the sticky-price producers’ decisions in terms of inflation rates rather than price levels. To do so, let xt ( j) = pst ( j) − pt−1 , and rewrite (4) as (7)
xt ( j) = Et−1 [πt + γ yt ] .
For convenience, we define (8)
1 xt = 1−α
α
1
xt ( j) dj
to be the average price set by the sticky-price producers relative to the aggregate price level in period t − 1, so that we can rewrite (7) as (9)
xt = Et−1 [πt + γ yt ] .
We can also rewrite (6) as (10)
πt = κ yt + xt .
Consider now the setting of monetary policy in this model. When the central bank sets its policy, it has to choose to operate under either a money regime or an interest-rate regime. In the money regime, the central bank’s policy instrument is money growth μt ; it sets μt , and the nominal interest rate it is residually determined from the Euler equation (1) after the realization of the shock ηt . In the interest-rate regime, the central bank’s instrument is the interest rate; it sets it , and money growth μt is residually determined from the cash-in-advance constraint (2) after the realization of the shock νt . Of course, in both regimes, the Euler equation and the cash-in-advance constraint both hold. II.B. Competitive Equilibrium Now we define a notion of competitive equilibrium for the simple model in the spirit of the work of Barro (1979) and Lucas and Stokey (1983). In this equilibrium, allocations, prices, and policies are all defined as functions of the history of exogenous events, or shocks, st = (s0 , . . . , st ), where st = (ηt , νt ). Sticky-price producer decisions and aggregate inflation and output levels can be summarized by {xt (st−1 ), πt (st ), yt (st )}. In terms of the policies, we let the regime choice and the policy choice within the regime be δt (st−1 ) = (δ1t (st−1 ), δ2t (st−1 )), where the first argument δ1t (st−1 ) ∈ {M, I} denotes the regime choice, either money (M) or the interest rate (I), and the second argument
56
QUARTERLY JOURNAL OF ECONOMICS
denotes the policy choice within the regime, either money growth μt (st−1 ) or the interest rate it (st−1 ). If the money regime is chosen in t, then the interest rate is determined residually at the end of that period, whereas if the interest-rate regime is chosen in t, then the money growth rate is determined residually at the end of the period. Let {at (st )} = {xt (st−1 ), δt (st−1 ), πt (st ), yt (st )} denote a collection of allocations, prices, and policies in this competitive equilibrium. Such a collection is a competitive equilibrium given y−1 if it satisfies (i) consumer optimality, namely, (1) and (2) for all st ; (ii) optimality by sticky-price producers, namely, (9) for all st−1 ; and (iii) optimality by flexible-price producers, namely, (10) for all st . We also define a continuation competitive equilibrium starting from any point in time. For example, consider the beginning of period t with state variables st−1 and yt−1 . A collection of allocations, prices, and policies {a(st−1 , yt−1 )}r≥t = {xr (sr−1 | st−1 , yt−1 ), δr (sr−1 | st−1 , yt−1 ), πr (sr | st−1 , yt−1 ), yr (sr | st−1 , yt−1 )}r≥t is a continuation competitive equilibrium from (st−1 , yt−1 ) if it satisfies the three conditions of a competitive equilibrium above for all periods starting from (st−1 , yt−1 ). In this definition, we effectively drop the equilibrium conditions from period 0 through period t − 1. This notion of a continuation competitive equilibrium from the beginning of period t onward is very similar to that of a competitive equilibrium from the beginning of period 0 onward, except that the initial conditions are now given by (st−1 , yt−1 ). We define a continuation competitive equilibrium that starts at the end of period t from (st−1 , yt−1 , xt , δt , st ) in a similar way. This latter definition requires optimality by consumers and flexibleprice producers from st onward and optimality by sticky-price producers from st+1 onward. Note that this equilibrium must satisfy all the conditions of a continuation competitive equilibrium that starts at the beginning of period t, except for the sticky-price optimality condition in period t, namely, (9) in period t. Finally, a continuation competitive equilibrium starting at the beginning of period 0 is simply a competitive equilibrium. The following lemma proves that any competitive equilibrium gives rise to a New Classical Phillips curve along with some other useful properties of such an equilibrium.
SOPHISTICATED MONETARY POLICIES
57
LEMMA 1 (New Classical Phillips Curve and Other Useful Properties). Any competitive equilibrium must satisfy (11)
πt (st ) = κ yt (st ) + E[πt (st ) | st−1 ],
which is often referred to as the New Classical Phillips curve; E[yt (st ) | st−1 ] = 0 and xt (st−1 ) = E[πt (st ) | st−1 ]; and (12) (13) E[xt+1 (st ) | st−1 ] = E[πt+1 (st+1 ) | st−1 ] = it , where it = it (st−1 ) if the central bank uses an interest-rate regime in period t and it = it (st ) if the central bank uses a money regime in period t. Proof. To see that E[yt (st ) | st−1 ] = 0, take expectations of (10) into (9). Using this result in (10), we as of st−1 and substitute obtain xt (st−1 ) = E πt (st ) | st−1 . Substituting this result into (10) yields (11). To show (13), take expectations of the Euler equation (1) with respect to st−1 and use E[yt (st ) | st−1 ] = 0 along with the law of iterated expectations to get (13). QED A similar argument establishes that (11)–(13) hold for any continuation competitive equilibrium. II.C. Sophisticated Equilibrium We now turn to what we call sophisticated equilibrium. The definition of this concept is very similar to that for competitive equilibrium, except that here we allow allocations, prices, and policies to be functions of more than just the history of exogenous events; they are also functions of the history of both aggregate private actions and central bank policies. For sophisticated equilibrium, we require as well that for every history, the continuation of allocations, prices, and policies from that history onward constitutes a continuation competitive equilibrium. Setup and Definition. Before turning to our formal definition, we note that our definition of sophisticated equilibrium simply specifies policy rules that the central bank must follow; it does not require that the policy rules be optimal. We specify sophisticated policies in this way in order to show that our unique implementation result does not depend on the objectives of the central bank. We think of sophisticated policies as being specified at the beginning of period 0 and of the central bank as being committed to following them.
58
QUARTERLY JOURNAL OF ECONOMICS
We turn now to defining the histories that private agents and the central bank confront when they make their decisions. The public events that occur in a period are, in chronological order, qt = (xt ; δt ; st ; yt , πt ). Letting ht denote the history of these events from period −1 up to and including period t, we have that ht = (ht−1 , qt ) for t ≥ 0. The history h−1 = y−1 is given. For notational convenience, we focus on perfect public equilibria in which the central bank’s strategy (choice of regime and policy) is a function only of the public history. The public history faced by the sticky-price producers at the beginning of period t when they set their prices is ht−1 . A strategy for the sticky-price producers is a sequence of rules σx = {xt (ht−1 )} for choosing prices for every possible public history. The public history faced by the central bank when it chooses its regime and sets either its money-growth or interest-rate policy is hgt = (ht−1 , xt ). A strategy for the central bank {δt (hgt )} is a sequence of rules for choosing the regime as well as the policy within the regime, either μt (hgt ) or it (hgt ). Let σg denote that strategy. At the end of period t, then, output and inflation are determined as functions of the relevant history hyt according to the rules yt (hyt ) and πt (hyt ). We let σ y = {yt (hyt )} and σπ = {πt (hyt )} denote the sequence of output and inflation rules. Notice that for any history, the strategies σ induce continuation outcomes in the natural way. For example, starting at some history ht−1 , these strategies recursively induce outcomes {ar (sr | ht−1 ; σ )}. We illustrate this recursion for period t. The sticky-price producer’s decision in t is given by xt ( j, st−1 | ht−1 ; σ ) = xt (ht−1 ), where xt (ht−1 ) is obtained from σx . The central bank’s decision in t is given by δt (st−1 | ht−1 ; σ ) = δt (hgt ), where hgt = (ht−1 , xt (ht−1 )) and δt (hgt ) is obtained from σg . The consumer and flexible-price producer decisions in t are given by yt (st | ht−1 ; σ ) = yt (hyt ) and πt (st | ht−1 ; σ ) = πt (hyt ), where hyt = (ht−1 , xt (ht−1 ), δt (ht−1 , xt (ht−1 ))) and yt (hyt ) and πt (hyt ) are obtained from σ y and σπ . Continuing in a similar way, we can recursively define continuation outcomes for subsequent periods. We can likewise define continuation outcomes {ar (sr | hgt ; σ )} and {ar (sr | hyt ; σ )} following histories hgt and hyt , respectively. We now use these strategies and continuation outcomes to formally define our notion of equilibrium. A sophisticated equilibrium given the policies here is a collection of strategies (σx , σg ) and allocation rules (σ y , σπ ) such that (i) given any history ht−1 , the continuation outcomes {ar (sr | ht−1 ; σ )} induced by σ constitute
SOPHISTICATED MONETARY POLICIES
59
a continuation competitive equilibrium and (ii) given any history hyt , so do the continuation outcomes {ar (sr | hyt ; σ )}.3 Associated with each sophisticated equilibrium σ = (σg , σx , σ y , σπ ) are the particular stochastic processes for outcomes that occur along the equilibrium path, which we call sophisticated outcomes. These outcomes are competitive equilibrium outcomes. We will say a policy σg∗ uniquely implements a desired competitive equilibrium {at∗ (st )} if the sophisticated outcome associated with any sophisticated equilibrium of the form (σg∗ , σx , σ y , σπ ) coincides with the desired competitive equilibrium. A central feature of our definition of sophisticated equilibrium is our requirement that for all histories, including deviation histories, the continuation outcomes constitute a continuation competitive equilibrium. We think of this requirement as analogous to the requirement that in a subgame perfect equilibrium, the continuation strategies constitute a Nash equilibrium. This requirement constitutes the most important difference between our approach to determinacy and that in the macroeconomic literature. Technically, one way of casting that literature’s approach into our language of strategies and allocation rules is to consider the following notion of equilibrium. An unsophisticated equilibrium is a strategy for the central bank σg and allocations, policies, and prices {at (st )} = {xt (st−1 ), δt (st−1 ), πt (st ), yt (st )} such that {at (st )} is a period-zero competitive equilibrium and the policies induced by σg from {at (st )} coincide with {δt (st−1 )}. In our view, unsophisticated equilibrium is a deficient guide to policy. Although an unsophisticated equilibrium does tell policy makers what to do for every history, it does not specify what will happen under their policies for every history, in particular for deviation histories. Achieving implementation using the notion of unsophisticated equilibrium is, in general, trivial. As we explained earlier, one way of achieving implementation is via nonexistence: simply specify policies so that no competitive equilibrium exists after deviation histories. We find this way of achieving implementation uninteresting. 3. In general, a sophisticated equilibrium would require that for every history (including histories in which the government acts, hgt ), the continuation outcomes from that history onward constitute a competitive equilibrium. Here, that requirement would be redundant because the conditions for a competitive equilibrium for hgt are the same as those for hyt .
60
QUARTERLY JOURNAL OF ECONOMICS
Finally, to help avoid a common confusion, we stress that our definition does not require that, when there is a deviation in period t, the entire sequence starting from period 0, including the deviation in period t, constitute a period-zero competitive equilibrium. Indeed, if we achieve unique implementation, then such a sequence will not constitute a period-zero equilibrium. Implementation with Sophisticated Policies. We focus on implementing competitive equilibria with sophisticated policies in which the central bank uses interest rates along the equilibrium path. This focus is motivated in part by the observation that most central banks seem to use interest rates as their policy instruments. Another motivation is that if the variance of the velocity shock νt is large, then all of the outcomes under the money regime are undesirable. To set up our construction of sophisticated policies, recall that in our economy the only strategically interesting agents are the sticky-price producers. Their choices must satisfy a key property, that (14)
xt (ht−1 ) = E[πt (hyt ) + γ yt (hyt ) | ht−1 ],
where hyt = (ht−1 , xt (ht−1 ), δt (ht−1 , xt (ht−1 )), st ). Notice that xt (ht−1 ) shows up on both sides of equation (14), so we require that the optimal choice xt (ht−1 ) satisfy a fixed point property. To get some intuition for this property, suppose that each sticky-price producer believes that all other sticky-price producers will choose some value, say, xˆt . This choice, together with the central bank’s strategy ˆ yt ) and the inflation and output rules, induces the outcomes πt (h ˆ ˆ and yt (hyt ), where hyt = (ht−1 , xˆt , δt (ht−1 , xˆt ), st ). The fixed point property requires that for xˆt to be part of an equilibrium, each sticky-price producer’s best response must coincide with xˆt . The basic idea behind our sophisticated policy construction is that the central bank starts by picking any desired competitive equilibrium allocations and sets its policy on the equilibrium path consistent with them. The central bank then constructs its policy off the equilibrium path so that even if an individual agent believes that all other agents will deviate to some specific action, that individual agent finds it optimal to choose a different action. In this sense, the policies are specified so that the fixed point property is satisfied at only the desired allocations.
SOPHISTICATED MONETARY POLICIES
61
We now analyze several possible ways for a central bank to attempt the implementation of competitive equilibria in which it uses interest rates as its monetary policy instrument. With reversion to a money regime. We show first that in the simple sticky-price model, any competitive equilibrium in which the central bank uses the interest rate as its instrument in all periods can be uniquely implemented with sophisticated policies that involve a one-period reversion to money. Under these policies, after a deviation, the central bank switches to a money regime for one period. More precisely, fix a desired competitive equilibrium outcome path (xt∗ (st−1 ), πt∗ (st ), yt∗ (st )) together with central bank policies it∗ (st−1 ). Consider the following trigger-type policy: If sticky-price producers choose xt in period t to coincide with the desired outcomes xt∗ (st−1 ), then let central bank policy in t be it∗ (st−1 ). If not, and these producers deviate to some xˆt = xt∗ (st−1 ), then for that period t, let the central bank switch to a money regime with a suitably chosen level of money growth. This level of money growth makes it not optimal for any individual sticky–price setter to cooperate with the deviation. If such a level of money growth exists, we say that the best responses of the sticky–price setters are controllable. The following lemma shows that this property holds for our model. LEMMA 2 (Controllability of Best Responses with One-Period Price-Setting). For any history (ht−1 , xˆt ), if the central bank chooses the money regime, then there exists a choice for money growth μt such that (15)
ˆ yt ) + γ yt (h ˆ yt )], xˆt = E[πt (h
where hyt = (ht−1 , xˆt , M, μt ). Proof. Substituting (2) into (10), we have a result showing that if the central bank chooses the money regime with money growth μt , then output yt and inflation πt are uniquely determined and given by (16) (17)
μt + νt + yt−1 − xˆt , 1+κ πt = κ yt + xˆt . yt =
62
QUARTERLY JOURNAL OF ECONOMICS
Hence, ˆ yt ) + γ yt (h ˆ yt )] = E[πt (h
κ +γ (μt + yt−1 − xˆt ) + xˆt . 1+κ
Clearly, then, any choice of μt = xˆt − yt−1 will ensure that (15) holds. QED We use this lemma to guide our choice of the suitable money growth rate after deviations. We choose this growth rate to generate the same expected inflation as in the original equilibrium. (Of course, we could have chosen many other values that also would discourage deviations, but we found this value to be the most intuitive.4 ) In particular, if the producers deviate to some xˆt = xt∗ (st−1 ), then for that period t, let the central bank switch to a money regime with money growth set so that (18)
μt = xˆt − yt−1 +
1 + κ ∗ t−1 xt (s ) − xˆt ) . κ
Note that μt = xˆt − yt−1 . With such a money growth rate, expected inflation is the same in the reversion period as it would have been in the desired outcome. From Lemma 1, such a choice of xˆt cannot be part of an equilibrium. It is also easy to see that if a deviation occurs in period t, the economy returns to the desired outcomes in period t + 1. We have established the following proposition. PROPOSITION 1 (Unique Implementation with Money Reversion). Any competitive equilibrium outcome in which the central bank uses interest rates as its instrument can be implemented as a unique equilibrium with sophisticated policies with one-period reversion to a money regime. Moreover, under this rule, after any deviation in period t, the equilibrium outcomes from period t + 1 are the desired outcomes. A simple way to describe our unique implementation result is that controllability of best responses under some regime guarantees unique implementation of any desired outcome. We obtain controllability by reversion to a money regime. Note that even though the money regime is not used on the equilibrium path, it is useful as an off-equilibrium commitment that helps support 4. We choose this part of the policy as a clear demonstration that after a deviation, the central bank is not doing anything exotic, such as producing a hyperinflation. Rather, in an intuitive sense, the central bank is simply getting the economy back on the track it had been on before the deviation threatened to shift it in another direction.
SOPHISTICATED MONETARY POLICIES
63
desired outcomes in which the central bank uses interest rates on the equilibrium path. Notice also that the proposition implies that deviations lead to only very transitory departures from desired outcomes. In particular, we do not achieve implementation by threatening the economy with dire outcomes after deviations. (Note that the particular result, that the economy returns exactly to the desired outcomes in the period after the deviation, would not hold in a version of this model with state variables, such as capital.) So far we have focused on uniquely implementing competitive outcomes when the central bank uses interest rates as its instrument. Equations (16) and (17) imply that the equilibrium outcome under a money regime is unique, so that implementing desired outcomes is trivial when the central bank uses money as its instrument. Clearly, we can use a simple generalization of Proposition 1 to uniquely implement a competitive equilibrium in which the central bank uses interest rates in some periods and money in others. With pure interest-rate rules. Now, as a second possible way for a central bank to implement competitive equilibria, we analyze pure interest-rate rules. We find that this way cannot achieve unique implementation. We begin with a pure interest-rate rule of the form (19)
it (st−1 ) = it∗ (st−1 ) + φ(xt (st−1 ) − xt∗ (st−1 )),
where it∗ (st−1 ) and xt∗ (st−1 ) are the interest rates and the stickyprice producer choices associated with a competitive equilibrium that the central bank wants to implement uniquely, and the parameter φ represents how aggressively the central bank changes interest-rates when private agents deviate from the desired equilibrium. Notice that this rule (19) specifies policy both on and off the equilibrium path. On the equilibrium path, xt (st−1 ) = xt∗ (st−1 ), and the rule yields it (st−1 ) = it∗ (st−1 ). Off the equilibrium path, the rule specifies how it (st−1 ) should differ from it∗ (st−1 ) when xt (st−1 ) differs from xt∗ (st−1 ). Pure interest-rate rules of the form (19) have been discussed by King (2000) and Svensson and Woodford (2005). We follow Cochrane (2007) and call (19) the King rule. Note from Lemma 1 that xt (st−1 ) = E[πt (st ) | st−1 ], so that the King rule can be thought of as targeting expected inflation, in the
64
QUARTERLY JOURNAL OF ECONOMICS
sense that (19) is equivalent to (20)
it (st−1 ) = it∗ (st−1 ) + φ(E[πt (st ) | st−1 ] − E[πt∗ (st ) | st−1 ]).
We now show that if the central bank follows the King rule (19), it cannot ensure unique implementation of the desired outcome. Indeed, under this rule, the economy has a continuum of equilibria. More formally: PROPOSITION 2 (Indeterminacy of Equilibrium under the King Rule). Suppose the central bank sets interest rates it according to the simple economy’s King rule (19). Then any of the continuum of sequences indexed by the initial condition x0 and the parameter c that satisfies (21)
xt+1 = it + cηt , πt = xt + κ(1 + ψc)ηt , and yt = (1 + ψc)ηt
is a sophisticated outcome. Proof. In order to verify that the multiple outcomes that satisfy (21) are part of a period-zero competitive equilibrium, we need to check that they satisfy (1), (9), and (10). That they satisfy (9) follows by taking expectations of the second and third equations in (21). Substituting for it from (19) and for xt+1 from (21) into (1), we obtain that yt = (1 + ψc)ηt , as required by (21). Inspecting the expressions for πt and yt in (21) shows that they satisfy (10). Clearly, any such period-zero competitive equilibrium can be supported by a government strategy, σg , of the King rule form and QED appropriately chosen σx , σ y , and σπ . The intuitive idea behind the multiplicity of equilibria associated with the initial condition x0 is that interest-rate rules, including the King rule, induce nominal indeterminacy and do not pin down the initial price level. The intuitive idea behind the multiplicity of stochastic equilibria associated with c = 0 is that interest rates pin down only expected inflation and not the stateby-state realizations indexed by the parameter c. Note that Proposition 2 implies that even if the King rule parameter φ > 1, the economy has a continuum of equilibria. In that case, all but one of the equilibria has exploding inflation, in the sense that inflation eventually becomes unbounded. In the literature, researchers often restrict attention to bounded equilibria. We argue that, in this model, equilibria with exploding inflation
SOPHISTICATED MONETARY POLICIES
65
cannot be dismissed on logical grounds. Indeed, these equilibria are perfectly reasonable because the inflation explosion is associated with a money supply explosion. To see this association, suppose that the economy has no stochastic shocks and the desired outcomes are πt = 0 and yt = 0 in all periods. Then, from the cash-in-advance constraint (2), we know that the growth of the money supply is given by μt = xt = φ t x0 .
(22)
Thus, in these equilibria, inflation explodes because money growth explodes. Each equilibrium is indexed by a different initial value of the endogenous variable x0 . This endogenous variable depends solely on expectations of future policy and is not pinned down by any initial condition or transversality condition. Such equilibria are reasonable because at the core of most monetary models is the idea that the central bank’s printing of money at an ever-increasing rate leads to a hyperinflation. In these equilibria, inflation does not arise from the speculative reasons analyzed by Obstfeld and Rogoff (1983), but from the conventional money-printing reasons analyzed by Cagan (1956). In this sense, our model predicts, for perfectly standard and sensible reasons, that the economy can suffer from any one of a continuum of very undesirable paths for inflation. (Cochrane [2007] makes a similar point for a flexible-price model.) The same proposition obviously applies to more general interest-rate rules that are restricted to be the same on and off the equilibrium path. For example, Proposition 2 applies to linear feedback rules of the form (23)
it = ¯ıt +
∞ s=0
φxs xt−s +
∞ s=1
φ ys yt−s +
∞
φπs πt−s ,
s=1
where the intercept term ¯ıt can depend on the history of stochastic events. With reversion to a hybrid rule. Analysis of a third possible way to implement competitive equilibria is a bit more complicated. In Proposition 1, we have shown how reversion to a money regime can achieve unique implementation. In Proposition 2 and the subsequent discussion, we have shown that pure interest-rate rules, such as the King rule, cannot. Notice that in our money reversion policies, even tiny deviations trigger a reversion to a money
66
QUARTERLY JOURNAL OF ECONOMICS
regime. A natural question arises: Can unique implementation be achieved using a combination of these two strategies, or a hybrid rule, specifying, for example, that the central bank continue to use interest rates unless the deviations are very large and then revert to a money regime? The answer is yes. To see this, consider a particular hybrid rule that is intended to implement a bounded competitive equilibrium {xt∗ (st−1 ), πt∗ (st ), yt∗ (st )} with an associated interest rate it∗ (st−1 ). Fix some x¯ and x which satisfy x¯ > maxt xt∗ (st−1 ) and x < mint xt∗ (st−1 ). What we will call the King–money hybrid rule specifies that if xt (st−1 ) is within ¯ then the central bank follows a the interest-rate interval [x, x], King rule of the form (19); and if xt (st−1 ) falls outside this interval, then the central bank reverts to a money regime and chooses the money growth rate that produces an expected inflation rate π¯ ∈ ¯ That the money growth rate can be so chosen follows from [x, x]. (16) and (17). We show that an attractive feature of outcomes under this hybrid rule is that deviations from the desired path lead only to very transitory movements away from the desired path. More precisely, after any deviation in period t, even though inflation and output in period t may differ from the desired outcomes, those in subsequent periods coincide with the desired outcomes. More formally: PROPOSITION 3 (Unique Implementation with a Hybrid Rule). In the simple economy, the King–money hybrid rule with φ > 1 uniquely implements any bounded competitive equilibrium. Moreover, under this rule, after any deviation in period t, the equilibrium outcomes from period t + 1 are the desired outcomes. We prove this proposition in the Appendix. Here we simply sketch the argument for a deterministic version of the model. The key to the proof is a preliminary result that shows that no ¯ To see equilibrium outcome xt can be outside the interval [x, x]. that this is true, suppose that in some period t, xt is outside that interval. But when this is true, the hybrid rule specifies a money growth rate in that period that yields expected inflation inside the interval. Because xt equals expected inflation, this gives a contradiction and proves the preliminary result. To establish uniqueness, suppose that there is some sophisticated equilibrium with xˆr = xr∗ for some r. From the prelimi¯ where the King rule nary result, xˆr must be in the interval [x, x]
SOPHISTICATED MONETARY POLICIES
67
is operative. From Lemma 1, we know that in any equilibrium, it = xt+1 , so that the King rule implies that ∗ xˆt+1 − xt+1 = φ xˆt − xt∗ = φ t−r (xˆr − xr∗ ). Because φ > 1 and xt∗ is bounded, eventually xˆt+1 must leave the ¯ which is a contradiction. interval [x, x], Extension to Interest-Elastic Money Demand. So far, to keep the exposition simple, we have assumed a cash-in-advance setup in which money demand is interest-inelastic. This feature of the model implies that if a money regime is adopted in some period t, then the equilibrium outcomes in that period are uniquely determined by the money growth rate in that period. This uniqueness under a money regime is what allows the central bank to switch to a one-period money regime in order to support any desired competitive equilibrium. Now we consider economies with interest-elastic money demand. We argue that under appropriate conditions, our unique implementation result extends to such economies. When economies have interest-elastic money demand, sophisticated policies that specify reversion to money or to a hybrid rule can uniquely implement any desired outcome if best responses are controllable. A sufficient condition for such controllability is that competitive equilibria are unique under a suitably chosen money regime. Here, as with inelastic money demand, the uniqueness under a money regime is what enables unique implementation. A sizable literature has analyzed the uniqueness of competitive equilibria under money growth policies with interest-elastic money demand. Obstfeld and Rogoff (1983) and Woodford (1994) provide sufficient conditions for this uniqueness. For example, Obstfeld and Rogoff (1983) consider a money-in-the-utilityfunction model with preferences of the form u(c) + v(m), where c is consumption and m is real money balances, and show that a sufficient condition for uniqueness under a money regime is lim mv (m) > 0.
m→0
Obstfeld and Rogoff (1983) focus attention on flexible-price models, but their results can be readily extended to our simple sticky-price model. Indeed, their sufficient conditions apply unchanged to a deterministic version of that model because our model without shocks is effectively identical to a flexible-price model. Hence, under appropriate sufficient conditions, our unique
68
QUARTERLY JOURNAL OF ECONOMICS
implementation result extends to environments with interestelastic money demand. More generally, for our hybrid rule to uniquely implement desired outcomes, we need a reversion policy that has a unique equilibrium. An alternative to a money regime is a commodity standard such as those in the work of Wallace (1981) and Obstfeld and Rogoff (1983). With this type of standard, the government promises to redeem money for goods for some arbitrarily low price and finances the supply of goods with taxation. An alternative to our hybrid rule with money reversion is, therefore, a hybrid rule with reversion to a commodity standard.
III. A MODEL WITH STAGGERED PRICE-SETTING We turn now to a version of our simple model with staggered price-setting, often referred to as the New Keynesian model. We show that, along the lines of the argument developed above, policies with infinite reversion to either a money regime or a hybrid rule can uniquely implement any desired outcome under an interest-rate regime. We also show that for a large class of economies, pure interest-rate rules of the King form still lead to indeterminacy. To make our points in the simplest way, we abstract from aggregate uncertainty. III.A. Setup and Competitive Equilibrium We begin by setting up the model with staggered price-setting. In the model, prices are set in a staggered fashion as in the work of Calvo (1983). At the beginning of each period, a fraction 1 − α of producers are randomly chosen and allowed to reset their prices. After that, the central bank makes its decisions, and then, finally, consumers make theirs. This economy has no flexible-price producers. The linearized equations in this model are similar to those in the simple model. The Euler equation (1) and the quantity equation (2) are unchanged, except that here they have no shocks. The price set by a producer permitted to reset its price is given by the analog of (4), which is (24)
pst ( j) = (1 − αβ)
∞ r=0
(αβ)r−t (γ yr + pr ) ,
SOPHISTICATED MONETARY POLICIES
69
where β is the discount factor. Here, again, Taylor’s γ is the elasticity of the equilibrium real wage with respect to output. Letting pst denote the average price set by producers permitted to reset their prices in period t, we can recursively rewrite this equation as (25)
pst ( j) = (1 − αβ) (γ yt + pt ) + αβpst+1 ,
together with a type of transversality condition limT →∞ (αβ)T psT ( j) = 0. The aggregate price level can then be written as (26)
pt = αpt−1 + (1 − α) pst .
To make our analysis parallel to the literature, we again translate the decisions of the sticky-price producers from price levels to inflation rates. Letting xt ( j) = pst ( j) − pt−1 and letting xt denote the average of xt ( j), with some manipulation we can rewrite (25) as (27)
xt = (1 − αβ)γ yt + πt + αβxt+1 .
We can also rewrite (26) as (28)
πt = (1 − α)xt
and the transversality condition as limT →∞ (αβ)T xt ( j) = 0. Using (28) and the fact that xt is the average of xt ( j) implies this condition is equivalent to (29)
lim (αβ)t πt = 0.
t→∞
In addition to these conditions, we now argue that in this staggered price-setting model, a competitive equilibrium must satisfy two boundedness conditions. In general, boundedness conditions are controversial in the literature. Standard analyses of New Keynesian models impose strict boundedness conditions: in any reasonable equilibrium, both output and inflation must be bounded both above and below. Cochrane (2007) has forcefully criticized this practice, arguing that any boundedness condition must have a solid economic rationale. Here we provide rationales for two such conditions: output yt must be bounded above, so that (30)
yt ≤ y¯
for some y¯ ,
70
QUARTERLY JOURNAL OF ECONOMICS
and interest rates must be bounded below, so that (31)
it ≥ i
for some i.
The rationale for output being bounded above is that the economy has a finite amount of labor to produce the output. The rationale for requiring that interest rates be bounded below comes from the restriction that the nominal interest rate must be nonnegative.5 These bounds allow outcomes in which (the log of) output, yt , falls without bound (so that the level of output converges to zero). The bounds also allow for outcomes in which inflation rates explode upward without limit. Here, then, a collection of allocations, prices, and policies at = {xt , δt , πt , yt } is a competitive equilibrium if it satisfies (i) consumer optimality, namely, the deterministic versions of (1) and (2); (ii) sticky-price producer optimality, (27)–(29); and (iii) the boundedness conditions, (30) and (31). Note that any allocations that satisfy (27)–(29) also satisfy the New Keynesian Phillips curve, (32)
πt = κ yt + βπt+1 ,
where now κ = (1 − α)(1 − αβ)γ /α. To see this result, use (28) to substitute for xt and xt+1 in (27) and collect terms. Here, as we did in the simple-sticky price model, we define continuation competitive equilibria. For example, consider the beginning of period t with a state variable yt−1 . A collection of allocations a(yt−1 ) = {xr (yt−1 ), δr (yt−1 ), πr (yt−1 ), yr (yt−1 )}r≥t is a continuation competitive equilibrium with yt−1 if it satisfies the three conditions of a competitive equilibrium above in all periods r ≥ t. A continuation competitive equilibrium that starts at the end of period t given (yt−1 , xt , δt ) is defined similarly. This definition requires optimality by consumers from t onward and optimality by sticky-price producers from t + 1 onward. III.B. Sophisticated Equilibrium We turn now to sophisticated equilibrium in the staggered price-setting model, its definition and how it can be implemented. 5. Note that even though the real value of consumer holdings of bonds must satisfy a transversality condition, this condition does not impose any restrictions on the paths of yt and πt . The reason is that in our nonlinear model, the government has access to lump-sum taxes, so that government debt can be arbitrarily chosen to satisfy any transversality condition.
SOPHISTICATED MONETARY POLICIES
71
Definition. The definition of a sophisticated equilibrium in the staggered price-setting model parallels that in the simple sticky-price model. The elements needed for that definition are basically the same. The public events that occur in a period are, in chronological order, qt = (xt ; δt ; yt , πt ). We let ht−1 denote the history of these events up until the beginning of period t. A strategy for the sticky-price producers is a sequence of rules σx = {xt (ht−1 )}. The public history faced by the central bank is hgt = (ht−1 , xt ) and its strategy, {δt (hgt )}. The public history faced by consumers in period t is hyt = (ht−1 , xt , δt ). We let σ y = {yt (hyt )} and σπ = {πt (hyt )} denote the sequences of output and inflation rules. Strategies and allocation rules induce continuation outcomes written as {ar (ht−1 ; σ )}r≥t or {a(hyt ; σ )}r≥t in the obvious recursive fashion. Formally, then, a sophisticated equilibrium given the policies here is a collection of strategies (σx , σg ) and allocation rules (σ y , σπ ) such that (i) given any history ht−1 , the continuation outcomes {ar (ht−1 ; σ )}r≥t induced by σ constitute a continuation competitive equilibrium and (ii) given any history hyt , so do the continuation outcomes {ar (hyt ; σ )}r≥t . In this model, as in the simple sticky-price model, the choices of the sticky-price producers must satisfy a key fixed point property, that (33)
xt (ht−1 ) = (1 − αβ)γ yt (hyt ) + πt (hyt ) + αβxt+1 (ht ),
where hyt = (ht−1 , xt (ht−1 ), δt (ht−1 , xt (ht−1 ))) and ht = (hyt , πt (hyt ), yt (hyt )). Here, as in the simple sticky-price model, xt (ht−1 ) shows up on both sides of the fixed point equation—on the right side, through its effect on the histories hyt and ht . Implementation with Sophisticated Policies. We now show that in the staggered price-setting model, any competitive equilibrium can be uniquely implemented with sophisticated policies. The basic idea behind our construction is, again, that the central bank starts by picking any competitive equilibrium allocations and sets its policy on the equilibrium path consistent with those allocations. The central bank then constructs its policy off the equilibrium path so that any deviations from these allocations would never be a best response for any individual price-setter. In so doing, the constructed sophisticated policies support the chosen allocations as the unique equilibrium allocations. As we did with the simple model, here we show that, under sufficient conditions, policies that specify infinite reversion
72
QUARTERLY JOURNAL OF ECONOMICS
to a money regime can achieve unique implementation, a pure interest-rate rule of the King rule form cannot, and a King–money hybrid rule can. With reversion to a money regime. We start with sophisticated policies that specify reversion to a money regime after deviations. In our construction of sophisticated policies, we assume that the best responses of sticky-price producers are controllable in that if they deviate by setting xˆt = xt∗ , then by infinitely reverting to the money regime, the central bank can set money growth rate policies so that the profit-maximizing value of xt ( j) is such that xt ( j) = xˆt . The sophisticated policy that supports a desired outcome is to follow the chosen monetary policy as long as private agents have not deviated from the desired outcome. If sticky-price producers ever deviate to some choice xˆt , the central bank switches to a money regime set such that xt ( j) = xˆt . The following proposition follows immediately: PROPOSITION 4 (Unique Implementation with Money Reversion). If the best responses of the sticky-price producers are controllable, then any competitive equilibrium outcome in which the central bank uses interest rates as its instrument can be implemented as a unique equilibrium by sophisticated policies which specify reversion to a money regime. A sufficient condition for best responses to be controllable is that in the nonlinear economy, preferences are given by U (c, l) = log c + b(1 − l), where c is consumption and l is labor supply, so that in the linearized economy, Taylor’s γ equals one. To demonstrate controllability, suppose that after a deviation, the central bank reverts to a constant money supply m = log M. With a constant money supply, it is convenient to use the original formulation of the economy with price levels rather than inflation rates. With that translation, the cash-in-advance constraint implies that yr + pr = m for all r, so that (24) implies that the producer’s price is simply to set ∞ (αβ)r−t m = m. (34) pst ( j) = (1 − αβ) r=0
That is, if after a deviation the central bank chooses a constant level of the money supply m, then sticky-price producers optimally
SOPHISTICATED MONETARY POLICIES
73
choose their prices to be m. Clearly, (34) implies that the best responses of these producers are controllable. For example, consider ∗ to a history in which price-setters in period t deviate from pst pˆ st . Obviously, the central bank can choose the level of the money supply so that the optimal choice for an individual price-setter becomes pst ( j) = pˆ st , so that xt ( j) = m − pt−1 = xˆt . With pure interest-rate rules. Now, as with the simple model, we turn to pure interest-rate rules such as the King rule. For the staggered price-setting model, we ask, can such rules uniquely implement bounded competitive equilibrium? We find that for a large class of parameter values, the answer is, again, no. We arrive at this answer by first showing that under the King rule, the economy has a continuum of period-zero competitive equilibria. We then argue that associated with each competitive equilibrium is a sophisticated equilibrium. Here, we write the King rule as (35)
it = it∗ + φ(1 − α)(xt − xt∗ ),
where it∗ and πt∗ are the interest rates and the inflation rates associated with the desired (bounded) competitive equilibrium. From (28), it follows that in all periods, inflation and the aggregate price-setting choice are mechanically linked by πt = (1 − α)xt . This mechanical link means that we can equally well think of policy as feeding back on either inflation or the price-setting choice, so that (35) is equivalent to (36)
it = it∗ + φ(πt − πt∗ ).
Now we show that the economy has a continuum of competitive equilibria by showing that there is a continuum of solutions to (1), (32), and (36) and that these solutions do not violate the transversality and boundedness conditions (29), (30), and (31). Expressing the variables as deviations from the desired equilibrium is convenient. To that end, let π˜ t = πt − πt∗ and y˜t = yt − yt∗ . Subtracting the equations governing {πt∗ , yt∗ } from those governing {πt , yt } gives a system governing {π˜ t , y˜t } that satisfies (1), (32), and (36). Substituting for ˜ıt in (1), using (36), we get that (37)
y˜t+1 + ψ π˜ t+1 = y˜t + ψφ π˜ t ,
74
QUARTERLY JOURNAL OF ECONOMICS
and from (32) we have that (38)
π˜ t = κ y˜t + β π˜ t+1 .
Equations (37) and (38) define a dynamical system. Letting zt = ( y˜t , π˜ t ) , with some manipulation we can stack these equations to give zt+1 = Azt , where ⎡ ⎤ a b ⎢ ⎥ A = ⎣ −κ 1 ⎦ β
β
and where a = 1 + κψ/β and b = ψ(φ − 1/β). This system has a continuum of solutions of the form y˜t = λt1 ω1 + λt2 ω2 and λ1 − a λ2 − a ω1 + λt2 ω2 , π˜ t = λt1 b b where λ1 < λ2 , the eigenvalues of A, are given by 2 1 + κψ 1 1 + κψ κψ 1 λ1 , λ2 = +1 ± − 1 − 4(φ − 1) , 2 β 2 β β (40)
(39)
1 ) y˜0 + π˜ 0 ]/, where is and ω1 = [( λ2b−a ) y˜0 − π˜ 0 ]/ and ω2 = [( a−λ b the determinant of A.6 This continuum of solutions is indexed by y˜0 and π˜ 0 . In the Appendix, we show that for a class of economies that satisfy the restriction
(41)
1 − κψ < β and α(1 + κψ) < 1,
equilibrium is indeterminate under the King rule. We can think of (41) as requiring that the period length is sufficiently short, in the sense that β is close enough to 1, and that the price stickiness is not too large, in the sense that α is sufficiently small. Formally, in the Appendix, we prove the following proposition: PROPOSITION 5 (Indeterminacy of Equilibrium under the King Rule). Suppose that the central bank sets interest rates it according to the King rule (35) with φ > 1 and that (41) is 6. Here and throughout, we restrict attention to values of φ ∈ [0, φmax ], where φmax is the largest value of φ that yields real eigenvalues. That is, at φmax , the discriminant in (40) is zero.
SOPHISTICATED MONETARY POLICIES
75
satisfied. Then the economy has a continuum of competitive equilibria indexed by y0 ≤ y0∗ , (42)
yt = yt∗ + λt2 (y0 − y0∗ )
and
πt = πt∗ + λt2 c(y0 − y0∗ ),
where λ2 > 1 and c = (λ2 − a)/b < 0 are constants. It is immediate to construct a sophisticated equilibrium for each of the continuum of competitive equilibria in (42). Notice that under the King rule, there is one equilibrium with yt = yt∗ and πt = πt∗ for all t, and in the rest, yt goes to minus infinity and πt to plus infinity. All of these equilibria satisfy the boundedness conditions (30) and (31) and, under (41), the transversality condition (29). It turns out that if the inequality in the second part of (41) is reversed, then the set of solutions to the New Keynesian dynamical system, (1), (28), (32), and (35), has the form (42), but the transversality condition rules out all solutions except the one with yt = yt∗ and πt = πt∗ for all t. We find this way of ruling out solutions unappealing because it hinges critically on the idea that sticky-price producers may be unable to change their prices for extremely long periods, even in the face of exploding inflation. With reversion to a hybrid rule. We now show that in the staggered price-setting model, as in the simple model, a King– money hybrid rule can uniquely implement any bounded competitive equilibrium. To do so in this model, we will assume boundedness under money, namely, that for any state variable yt−1 there exists a money regime from period t onward such that a continuation competitive equilibrium exists, and for all such equilibria, inflation in period t, πt , is uniformly bounded. Here uniformly bounded means that there exist constants π and π¯ such that for all yt−1 , πt ∈ [π, π¯ ]. It is immediate that a sufficient condition for boundedness under money is that preferences in the nonlinear economy are given by U (c, l) = log c + b(1 − l). In an economy that satisfies boundedness under money, the King–money hybrid rule that implements a competitive equilibrium {xt∗ , πt∗ , yt∗ } with an associated interest rate it∗ is defined as follows. Set x¯ to be greater than both maxt xt∗ and π¯ , and set x to be lower than both mint xt∗ and π . This rule specifies that if ¯ then the central bank follows a King rule of the form xt ∈ [x, x],
76
QUARTERLY JOURNAL OF ECONOMICS
(35) with φ > 1. If xt falls outside the interval [x, x], ¯ then the central bank reverts to a money regime forever. PROPOSITION 6 (Unique Implementation with a Hybrid Rule). Suppose the staggered price-setting economy satisfies boundedness under money. Then the King–money hybrid rule implements any desired bounded competitive equilibrium. Moreover, under this rule, after any deviation in period t, the equilibrium outcomes from period t + 1 are the desired outcomes. The formal proof of this proposition is in the Appendix. The key idea of this proof is the same as that for this proof of Proposition 3. The idea is that under the King rule, any xˆt that does not equal xt∗ leads subsequent price-setting choices to eventually ¯ But given boundedness under money, leave the interval [x, x]. ¯ cannot be part price-setting choices outside of the interval [x, x] of an equilibrium. Note that with the staggered price-setting model, as with the simple model, under a hybrid rule, deviations lead to only very transitory departures from desired outcomes.
IV. TREMBLES AND IMPERFECT INFORMATION We have shown that in both of the models we have analyzed— a simple one-period price-setting model and a staggered pricesetting model—any equilibrium outcome can be implemented as a unique equilibrium with sophisticated policies. In our equilibria, deviations in private actions lead to changes in the regime. This observation leads to the question of how to construct sophisticated policies if trembles in private actions occur or if deviations in private actions can be detected only imperfectly, say, with measurement error. We show that we can achieve unique implementation with trembles. We show that, with imperfect detection, the King–money hybrid rule leads to a unique equilibrium. This equilibrium is arbitrarily close to the desired equilibrium when the detection error is small. In this sense, our results are robust to trembles and imperfect information. IV.A. Trembles Unique implementation is not a problem if trembles in private actions occur.
SOPHISTICATED MONETARY POLICIES
77
To see that, consider allowing for trembles in private decisions by supposing that the actual price chosen by a price-setter, xt ( j), differs from the intended price, x˜t ( j), by an additive error εt ( j), so that xt ( j) = x˜t ( j) + εt ( j). Trembles are clearly a trivial consideration. If εt ( j) is independently distributed across agents, then it simply washes out in the aggregate; it is irrelevant. Even if εt ( j) is correlated across agents, say, because it has both aggregate and idiosyncratic components, our argument goes through unchanged if the central bank can observe the aggregate component, for example, with a random sample of prices. IV.B. Imperfect Information Not as trivial is a situation in which the central bank has imperfect information about prices. But even in that situation, the King–money hybrid rule leads to a unique equilibrium; and when the detection error is small, this equilibrium is arbitrarily close to the desired equilibrium. To see that, consider a formulation in which the central bank observes the actions of price-setters with measurement error. Of course, if the central bank could see some other variable perfectly, such as output or interest rates on private debt, then it could infer what the private agents did. We think of this formulation as giving the central bank minimal amounts of information relative to what actual central banks have. We show here that with this sort of imperfect information, we can implement outcomes that are close to the desired outcomes when the measurement error is small. Here the central bank observes the price-setters’ choices with error, so that (43)
xˆt = xt + εt ,
where the error εt is i.i.d. over time with mean zero and bounded support [ε, ε¯ ]. Consider using the King–money hybrid rule to support some desired competitive equilibrium. Choose the interest¯ such that xt∗ + εt is contained in this interval rate interval [x, x] for all t. Here, the King rule is of the form (44) with φ > 1.
it (hgt ) = it∗ + φ(1 − α)(xˆt − xt∗ )
78
QUARTERLY JOURNAL OF ECONOMICS
In this economy with measurement error, the best response of any individual price-setter is identical to that in the economy without measurement error. This result follows because the best response depends on only the expected values of future variables. Because the measurement error εt has mean zero, these expected values are unchanged. Therefore, the unique equilibrium in this economy with measurement error has xt = xt∗ ; thus, πt = πt∗ . The realized values of the interest rate it and output yt , however, fluctuate around their desired values it∗ and yt∗ . Using (43) and (44), we know that the realized value of the interest rate is given by (45)
it = it∗ + φ(1 − α)εt ,
whereas using the Euler equation, we know that the realized value of output is given by (46)
yt = yt∗ − ψφ(1 − α)εt .
Notice that when the central bank observes private actions imperfectly, the King–money hybrid rule does not exactly implement any desired competitive equilibrium. Rather, this rule implements an equilibrium in which output fluctuates around its desired level. These fluctuations are proportional to the size of the measurement error. Clearly, as the size of the measurement error εt goes to zero, the outcomes converge to the desired outcomes. We have thus established a proposition: PROPOSITION 7 (Approximate Implementation with Measurement Error). Suppose the sophisticated policy is described by the King–money hybrid rule described above. Then the economy has a unique equilibrium with xt = xt∗ and yt given by (46). As the variance of the measurement error approaches zero, the economy’s outcomes converge to the desired outcomes. Note that although the central bank never reverts to a money regime when it is on the equilibrium path, the possibility that it will do so off the equilibrium path plays a critical role in this implementation. V. IMPLICATIONS FOR THE TAYLOR PRINCIPLE The sophisticated policy approach we have just described has implications for the use of the Taylor principle as a device to
SOPHISTICATED MONETARY POLICIES
79
ensure determinacy and to guide inferences from empirical investigations about whether central bank policy has led the economy into a determinate or indeterminate region. (Recall that the Taylor principle is the notion that interest rates should rise more than one for one with inflation rates, both compared to some exogenous, possibly stochastic, levels.) V.A. Setup In order to show what the sophisticated policy approach implies for our discussion of the Taylor principle, we consider a popular specification of the Taylor rule of the form (47)
it = ¯ıt + φ Et−1 πt + bEt−1 yt ,
where ¯ıt is an exogenously given, possibly stochastic, sequence. (See Taylor [1993] for a similar specification.) In our simple model, from (12), policies of the Taylor rule form (47) can be written as (48)
it = ¯ıt + φ(xt − x¯t ).
When the parameter φ > 1, such policies are said to satisfy the Taylor principle: The central bank should raise its interest rate more than one for one with increases in inflation. When φ < 1, such policies are said to violate that principle. Notice that when ¯ıt and x¯t coincide with the desired competitive equilibrium outcomes it∗ and xt∗ for all periods, the Taylor rule (48) reduces to the simple model’s King rule (19). V.B. Implications for Determinacy Many economists have argued that central banks must adhere to the Taylor principle in order to ensure unique implementation. Our results clearly imply that if the central bank is following a pure interest-rate rule, then adherence to the Taylor principle is neither necessary nor sufficient for unique implementation. If, however, the central bank is following a King–money hybrid rule, then adherence to this principle after deviations between observed outcomes and desired outcomes can help ensure unique implementation. Note that policies of the Taylor rule form (48) are linear feedback rules of the form (23) and lead to indeterminacy, regardless of the value of φ. In this sense, if the central bank is following a pure interest-rate rule, then adherence to the Taylor principle is not sufficient for unique implementation. A similar argument implies
80
QUARTERLY JOURNAL OF ECONOMICS
that, under (41), it is not sufficient in the staggered price-setting model either. Clearly, under pure interest-rate rules, adherence to the Taylor principle is also not necessary for unique implementation. Propositions 1 and 4 imply that, in both models, the central bank can uniquely implement any competitive equilibrium, including those that violate the Taylor principle along the equilibrium path. V.C. Implications for Estimation Many economists have estimated monetary policy rules and then inferred that these rules have led the economy to be in the determinate region if and only if they satisfy the Taylor principle. Indeed, one branch of this literature argues that the undesirable inflation experiences of the 1970s in the United States occurred in part because monetary policy led the economy to be in the indeterminate region. See, for example, the work of Clarida, Gal´ı, and Gertler (2000). We provide a set of stark assumptions under which such inferences can be made more confidently. Nonetheless, finding appropriate assumptions in more interesting applied examples remains a challenge. Perfect Information. In economies in which the central bank and private agents have the same information, observations of variables along the equilibrium path shed no light on the properties of policies off that path, and it is these properties that govern the determinacy of equilibrium. Of course, any estimation procedure can rely only on data along the equilibrium path; it cannot uncover the properties of policies off that path. In this sense, estimation procedures in economies with perfect information cannot determine whether monetary policy is leading the economy to be in the determinate or the indeterminate region. (See Cochrane [2007] for a related point.) To see this general point in the context of our models, note that any estimation procedure can only uncover relationships between the equilibrium interest rate it∗ and the equilibrium inflation rate πt∗ . These relationships have nothing whatsoever to do with the off-equilibrium path policies that govern determinacy. For example, in the context of the King–money hybrid rule with the King rule of the form (35), neither it∗ nor πt∗ depend on the parameter φ, but the size of this parameter plays a key role in ensuring determinacy. In this sense, without trivial identifying
SOPHISTICATED MONETARY POLICIES
81
assumptions, no estimation procedure can uncover the key parameter for determinacy. For example, suppose that along the equilibrium path, interest rates satisfy (49)
it∗ = ¯ı + φ ∗ (xt∗ − x), ¯
where it∗ and xt∗ are the desired equilibrium outcomes and ı¯ and x¯ are some constants that differ from those desired outcomes. This equilibrium can be supported in many ways, including reversion after deviations to a money regime or some sort of hybrid rule. Notice that in (49) the parameter φ ∗ simply describes the relation between the equilibrium outcomes it∗ and xt∗ and has no connection to the behavior of policy after deviations. Obviously, with a policy that specifies reversion to a money regime, the size of φ ∗ (whether it is smaller or larger than one) has no bearing on the determinacy of equilibrium. That is also true with a policy that reverts to a hybrid rule after deviations, though perhaps not as obviously. Suppose that for small deviations, the hybrid rule specifies the King rule (20) with φ > 1. The parameter φ of this King rule has no connection to the parameter φ ∗ in (49). The former governs the behavior of policies after deviations, whereas the latter simply describes a relationship that holds along the equilibrium path. Furthermore, although φ > 1 ensures determinacy, the size of φ ∗ —whether it is smaller or larger than 1—has no bearing on determinacy. These arguments clearly generalize to situations in which the constants ¯ı and x¯ are replaced by exogenous, possibly stochastic, sequences ¯ıt and x¯t that differ from the desired outcomes, so that along the equilibrium path, interest rates satisfy (50)
it∗ = ¯ıt + φ ∗ (xt∗ − x¯t ).
We interpret most of the current estimation procedures of the Taylor rule variety as estimating φ ∗ , the parameter governing desired outcomes in (50) or its analog in more general setups. To use these estimates to draw inferences about determinacy, researchers implicitly assume that the parameter φ (the parameter describing off-equilibrium path behavior) is the same as φ ∗ (the parameter describing on-equilibrium path behavior). Researchers also restrict attention to bounded solutions. As we have discussed, with perfect information, theory imposes no connection between φ and φ ∗ , so the assumption that φ = φ ∗ is not grounded in theory.
82
QUARTERLY JOURNAL OF ECONOMICS
Also, the rationale for restricting attention to bounded solutions is not clear. With perfect information, then, current estimation procedures simply cannot uncover whether the economy is in the determinate or the indeterminate region. Imperfect Information. With imperfect information, however, there is some hope that variants of current procedures may be able to uncover some of the key parameters for determinacy, provided researchers are willing to make some quite strong assumptions. Here we provide a stark example in which a variant of current procedures can uncover one of the key parameters governing determinacy. Consider our staggered price-setting economy, in which the central bank observes the price-setters’ choices with error. Recall that in this economy, the equilibrium outcomes for interest rates and output, (45) and (46), depend on the parameter φ in the King–money hybrid rule and that this parameter plays a key role in ensuring determinacy. Note the contrast with the perfect information economy, in which the equilibrium outcomes do not depend on the parameter φ. The fact that equilibrium outcomes depend on the key determinacy parameter here offers some hope that researchers will be able to estimate it. For our stark example, we assume that researchers observe the same data as the central bank and that along the equilibrium path, the central bank follows a King rule of the form (51)
it = it∗ + φ(1 − α)(xˆt − xt∗ ).
If researchers know the desired outcomes xt∗ and it∗ , as well as the parameter α, then they can simply solve (51) for φ as long as xˆt does not identically equal xt∗ . To go from this solution for φ to an inference about determinacy requires more assumptions. One set of assumptions is that the data are generated by our staggered price-setting model, in which the central bank observes xˆt = xt + εt , where εt is i.i.d. over time and has mean zero and bounded support [ε, ε¯ ], and the central bank follows the King–money hybrid rule, with the King rule given by (51). The key feature of the formulation that allows this inference is that xˆt does not identically equal xt∗ as it does in the economies with perfect information. Note that in our stark example, this procedure can uncover the King rule parameter φ, but not the hybrid rule parameters π and π. ¯ More generally, no procedure can uncover what behavior would be in situations that are never reached in equilibrium, even
SOPHISTICATED MONETARY POLICIES
83
if the specification of such behavior plays a critical role in unique implementation. This observation implies that even in our stark example, we cannot distinguish between a pure interest-rate rule and the King–money hybrid rule. Although we have offered some hope for uncovering some of the key parameters for determinacy, applying our insight to a broader class of environments is apt to be hard. In practice, after all, the desired outcomes are not known, the other parameters of the economy are not known, the measurement error is likely to be serially correlated, and the interest-rate rule is subject to stochastic shocks. Quite beyond these practical issues is a theoretical one: drawing inferences about determinacy requires confronting a subtle identification issue. This issue stems from the fact that characterizing the equilibrium is relatively easy if the economy is in the determinate region, but extremely hard if it is not. Specifically, if the economy is in the determinate region, then the probability distribution over observed variables is a relatively straightforward function of the primitive parameters. If the economy is in the indeterminate region, however, then this probability distribution (which must take account of the possibility of sunspots) is more complicated. One way to proceed is to tentatively assume that the economy is in the determinate region and estimate the key parameters governing determinacy. Suppose that under this tentative assumption, we find that the parameters fall in the determinate region. Can we then conclude that the economy is in the determinate region? Not yet. We must still show that the data could not have been generated by one of the indeterminate equilibria—not an easy task. VI. CONCLUSIONS We have here described our sophisticated policy approach and illustrated its use as an operational guide to policy that achieves unique implementation of any competitive equilibrium outcome. We have demonstrated that using a pure interest-rate rule leads to indeterminacy. We have also constructed policies that avoid this by switching regimes: they use interest rates until private agents deviate and then revert to a money regime or a hybrid rule. Our work has strong implications for the use of the Taylor principle as a guide to policy. We have shown that if a central bank
84
QUARTERLY JOURNAL OF ECONOMICS
follows a pure interest-rate rule, then adherence to the Taylor principle is neither necessary nor sufficient for unique implementation. Adherence to that principle may ensure determinacy, however, if monetary policy includes a reversion to the King–money hybrid rule after deviations. We have also argued that existing empirical procedures used to draw inferences about the relationship between adherence to the Taylor principle and determinacy should be treated with caution. We have provided a set of stark assumptions that can be more confidently used in applied work to draw inferences regarding the relationship between central bank policy and determinacy. Using this method, however, requires solving multiple difficult identification problems. Finally, although we have here focused exclusively on monetary policy, the use of our operational guide is not necessarily limited to that application. The logic behind the construction of the guide should be applicable as well to other governmental policies—for example, to fiscal policy and to policy responses to financial crises—or to any application that aims to uniquely implement a desired outcome.
APPENDIX: THE PROOFS OF PROPOSITIONS 3, 5, AND 6 A. Proof of Proposition 3: A Unique Implementation with a Hybrid Rule in the Simple Model Given that the central bank follows the King–money hybrid rule, say, σg∗ , we will show here that there are unique strategies σx , σ y , and σπ for private agents that, together with σg∗ , constitute a sophisticated equilibrium. We then show that this sophisticated equilibrium implements the desired outcomes. The strategies σx , σ y , and σπ are as follows. The strategy σx specifies that xt (ht−1 ) = xt∗ (st−1 ) for all histories. The strategies σ y and σπ specify yt (hyt ) and πt (hyt ) as the unique solutions to conditions defining consumer optimality; (1) and (2), which define flexible price–producer optimality, (10); and the King–money ∗ ∗ (st+1 ) and xt+1 (st+1 ) = xt+1 (st+1 ). hybrid rule with yt+1 (st+1 ) = yt+1 Note that the value of xt in the history hyt = (ht−1 , xt , δt , st ) determines the regime in the current period and, hence, determines whether the Euler equation (1) or the cash-in-advance constraint (2) is used to solve for yt (hyt ) and πt (hyt ).
SOPHISTICATED MONETARY POLICIES
85
We now show that (σg∗ , σx , σ y , σπ ) is a sophisticated equilibrium. Given that {xt∗ (st−1 ), πt∗ (st ), yt∗ (st )} is a period-zero competi¯ so that the central bank tive equilibrium and that xt∗ (st−1 ) ∈ [x, x], is following an interest-rate regime, we know that any tail of these outcomes {xt∗ (st−1 ), πt∗ (st ), yt∗ (st )}t≥r is a continuation competitive equilibrium starting in period r regardless of the history hr−1 . On the equilibrium path, this claim follows immediately because the continuation of any competitive equilibrium is also a competitive equilibrium. Off the equilibrium path, for histories ht−1 , the tail is a period-zero competitive equilibrium (with periods suitably relabeled) and is, therefore, a continuation competitive equilibrium. A similar argument shows that the tail of the outcomes starting from the end of period r, namely, πr (hyr ) and yr (hyr ), together with the outcomes {xt∗ (st−1 ), πt∗ (st ), yt∗ (st )}t≥r+1 , constitutes a continuation competitive equilibrium. Note that our construction implies that after any deviation in period t, the equilibrium outcomes from period t + 1 are the desired outcomes. We now establish uniqueness of the sophisticated equilibrium of the form (σg∗ , σx , σ y , σπ ). We begin with a preliminary result that shows that for any st−1 in any equilibrium, xt (st−1 ) ∈ [x, x]. ¯ / This argument is by contradiction. Suppose that at st−1 , xt (st−1 ) ∈ ¯ Under the hybrid rule, the central bank reverts to a money [x, x]. ¯ From Lemma 1, regime with expected inflation equal to π¯ ∈ [x, x]. ¯ which contradicts xt (st−1 ) ∈ / [x, x]. ¯ This result xt (st−1 ) = π¯ ∈ [x, x], implies that along the equilibrium path, the central bank never reverts to money, so that interest rates are given by the King rule (19). With this preliminary result, we establish uniqueness by another contradiction argument. Suppose that the economy has a sophisticated equilibrium in which in some history hr−1 , xr (hr−1 ) = xˆr , which differs from xr∗ (sr−1 ). Without loss of generality, suppose that xˆr − xr∗ (sr−1 ) = ε > 0. Let {xˆt (st−1 ), πˆ t (st ), yˆt (st )}t≥r denote the associated continuation competitive equilibrium outcomes. Our preliminary result implies that the central bank follows the King rule in all periods. Let {ˆıt (st−1 )}t≥r denote the associated interest rates. From (13), using the law of iterated expectations, we have that (52)
∗ E[it∗ (st−1 ) | sr−1 ] = E[xt+1 (st ) | sr−1 ]
E[ˆıt (s
t−1
)|s
r−1
] = E[xˆt+1 (s ) | s t
r−1
].
and
86
QUARTERLY JOURNAL OF ECONOMICS
Substituting (52) into the King rule (19) gives that ∗ (st ) | sr−1 = φ t−r ε. E xˆt+1 (st ) − xt+1 ∗ (st ) is bounded, for every ε there exists Because φ > 1 and xt+1 some T such that ¯ E xˆ T +1 (sT ) | sT −1 > x.
But this contradicts our preliminary result that xt (st−1 ) ≤ x¯ for all QED t and st−1 . B. Proof of Proposition 5: Indeterminacy of Equilibrium under the King Rule in the Staggered Price-Setting Model It is straightforward to verify that output and inflation satisfying (42) satisfy all equilibrium conditions except the model’s transversality condition (29) and its two boundedness conditions (30) and (31). Here we verify these conditions. Consider first the transversality condition. Under (40) it follows that the larger eigenvalue λ2 (φ) is a decreasing function of φ and that λ2 (1) = (1 + κψ)/β. From (41) it then follows that βαλ2 (φ) < 1 for all φ ≥ 1. Hence, limt→∞ (αβ)t π˜ t = 0. Because πt∗ is bounded, it follows that πt satisfies the transversality condition (29). Consider next the output and interest-rate boundedness conditions. We first show that [λ2 (φ) − a]/b < 0 for all φ ≥ 1. To do so, we show that λ2 (φ) − a is positive for φ ∈ [1, 1/β), zero at φ = 1/β, and negative for φ ∈ (1/β, φmax ]. From (40) we know that 1 1 1 + κψ (53) λ2 = +1 β 2 β 1 κψ 2 κψ 1 1 −1 + −1 . + −4 2 β β β β Note that the term in the radical is a perfect square. Then using that and the first part of (41) turns (53) into 1 κψ λ2 =1+ = a. β β Because λ2 (φ) is decreasing, it follows that λ2 (φ) − a has the desired sign pattern. Because b = ψ(φ − 1/β), the numerator and the denominator of [λ2 (φ) − a]/b have opposite signs for all φ ≥ 1, so that [λ2 (φ) − a]/b is negative. Thus, the boundedness conditions
SOPHISTICATED MONETARY POLICIES
87
are satisfied for all ω2 ≤ 0. In the resulting equilibria, inflation goes to plus infinity and output goes to minus infinity (so that the level of output goes to zero). QED C. Proof of Proposition 6: Unique Implementation with a Hybrid Rule in the Staggered Price-Setting Model Let {xt∗ , πt∗ , yt∗ } be the desired bounded competitive equilibrium. The strategies that implement this competitive equilibrium are as follows. The strategy σg∗ is the King–money hybrid rule. The strategy σx specifies that xt (ht−1 ) = xt∗ for all histories. The strategies σ y and σπ specify yt (hyt ) and πt (hyt ) that are the unique solutions to the deterministic versions of the conditions defining consumer optimality, (1), (2), (28), (32), and the King–money hy∗ ∗ and xt+1 = xt+1 . brid rule with yt+1 = yt+1 ∗ The proof that (σg , σx , σ y , σπ ) is a sophisticated equilibrium closely parallels that of Proposition 3. We now establish uniqueness of the sophisticated equilibrium of the form (σg∗ , σx , σ y , σπ ). We begin by showing that given σg∗ , xt (ht−1 ) = xt∗ for all histories. (Clearly, given σg∗ and σx , σ y and σπ are unique.) For reasons similar to those underlying the preliminary result in Proposition 3, for any history ht−1 , xt (ht−1 ) must ¯ so that for any history, interest rates are be in the interval [x, x], given by the King rule (35). Under an interest-rate rule, the state yt−1 is irrelevant; therefore, a continuation competitive equilibrium starting at the beginning of any period t solves the same equations as a competitive equilibrium (starting from period 0). For notational simplicity, we focus on a competitive equilibrium starting from period 0. Suppose by way of contradiction that {xˆt , πˆ t , yˆt } is an equilibrium that does not coincide with {xt∗ , πt∗ , yt∗ }. Let x˜t = xˆt − xt∗ , and use similar notation for π˜ t and y˜t . Then, subtracting the equations governing the systems denoted with an asterisk from those denoted with a caret, we have a system governing {x˜t , π˜ t , y˜t } that satisfies (the analogs of) (1), (32), and (35). The resulting system, given by (37) and (38), coincides with that in the proof of Proposition 5. Hence, the solution is given by (39) with eigenvalues given by (40). It is easy to check that φ > 1 implies that both eigenvalues λ1 and λ2 are greater than one. Furthermore, at least one of (λ1 − a)/b and (λ2 − a)/b is nonzero. Because both of the eigenvalues are greater than one, (39) implies that if the two equilibria ever differ, then π˜ t becomes unbounded, so that x˜t does as well. Because
88
QUARTERLY JOURNAL OF ECONOMICS
xt∗ is bounded, xˆt must eventually leave the interval [x, x], ¯ which cannot happen in equilibrium. So we have a contradiction, and the first part of Proposition 6 is established. Note that our construction implies that after any deviation in period t, the equilibrium outcomes from period t + 1 are the desired outcomes. Thus, we have also established the second part of the proposition. QED UNIVERSITY OF CALIFORNIA, LOS ANGELES, FEDERAL RESERVE BANK OF MINNEAPOLIS, AND NATIONAL BUREAU OF ECONOMIC RESEARCH UNIVERSITY OF MINNESOTA AND FEDERAL RESERVE BANK OF MINNEAPOLIS FEDERAL RESERVE BANK OF MINNEAPOLIS, UNIVERSITY OF MINNESOTA, AND NATIONAL BUREAU OF ECONOMIC RESEARCH
REFERENCES ˜ Bernardino, Isabel Correia, and Pedro Teles, “Unique Monetary Equilibria Adao, with Interest Rate Rules,” manuscript, Bank of Portugal, 2007. Atkeson, Andrew, V. V. Chari, and Patrick J. Kehoe, “Sophisticated Monetary Policies,” Federal Reserve Bank of Minneapolis, Research Department Staff Report 419, 2009. Barro, Robert J., “On the Determination of the Public Debt,” Journal of Political Economy, 87 (1979), 940–971. Bassetto, Marco, “A Game-Theoretic View of the Fiscal Theory of the Price Level,” Econometrica, 70 (2002), 2167–2195. ——, “Equilibrium and Government Commitment,” Journal of Economic Theory, 124 (2005), 79–105. Benhabib, Jess, Stephanie Schmitt-Groh´e, and Mart´ın Uribe, “Monetary Policy and Multiple Equilibria,” American Economic Review, 91 (2001), 167–186. Buiter, Willem H., “The Fiscal Theory of the Price Level: A Critique,” Economic Journal, 112 (2002), 459–480. Cagan, Phillip, “The Monetary Dynamics of Hyperinflation,” in Studies in the Quantity Theory of Money, Milton Friedman, ed. (Chicago: University of Chicago Press, 1956). Calvo, Guillermo A., “Staggered Prices in a Utility-Maximizing Framework,” Journal of Monetary Economics, 12 (1983), 383–398. Chari, Varadarajan V., Lawrence J. Christiano, and Patrick J. Kehoe, “Optimality of the Friedman Rule in Economies with Distorting Taxes,” Journal of Monetary Economics, 37 (1996), 203–223. Chari, Varadarajan V., and Patrick J. Kehoe, “Sustainable Plans,” Journal of Political Economy, 98 (1990), 783–802. Christiano, Lawrence J., and Massimo Rostagno, “Money Growth Monitoring and the Taylor Rule,” NBER Working Paper No. 8539, 2001. Clarida, Richard, Jordi Gal´ı, and Mark Gertler, “Monetary Policy Rules and Macroeconomic Stability: Evidence and Some Theory,” Quarterly Journal of Economics, 115 (2000), 147–180. Cochrane, John H., “Inflation Determination with Taylor Rules: A Critical Review,” NBER Working Paper No. 13409, 2007. Correia, Isabel, Juan Pablo Nicolini, and Pedro Teles, “Optimal Fiscal and Monetary Policy: Equivalence Results,” Journal of Political Economy, 116 (2008), 141–170. Jackson, Matthew O., “A Crash Course in Implementation Theory,” Social Choice and Welfare, 18 (2001), 655–708. King, Robert G., “The New IS-LM Model: Language, Logic, and Limits,” Federal Reserve Bank of Richmond Economic Quarterly, 86 (2000), 45–103.
SOPHISTICATED MONETARY POLICIES
89
Kocherlakota, Narayana, and Christopher Phelan, “Explaining the Fiscal Theory of the Price Level,” Federal Reserve Bank of Minneapolis Quarterly Review, 23 (1999), 14–23. Ljungqvist, Lars, and Thomas J. Sargent, Recursive Macroeconomic Theory, 2nd ed. (Cambridge, MA: MIT Press, 2004). Lucas, Robert E., Jr., and Nancy L. Stokey, “Optimal Fiscal and Monetary Policy in an Economy without Capital,” Journal of Monetary Economics, 12 (1983), 55–93. McCallum, Bennett T., “Price Level Determinacy with an Interest Rate Policy Rule and Rational Expectations,” Journal of Monetary Economics, 8 (1981), 319–329. Obstfeld, Maurice, and Kenneth Rogoff, “Speculative Hyperinflations in Maximizing Models: Can We Rule Them Out?” Journal of Political Economy, 91 (1983), 675–687. Ramsey, Frank P., “A Contribution to the Theory of Taxation,” Economic Journal, 37 (1927), 47–61. Sargent, Thomas J., and Neil Wallace, “‘Rational’ Expectations, the Optimal Monetary Instrument, and the Optimal Money Supply Rule,” Journal of Political Economy, 83 (1975), 241–254. Schmitt-Groh´e, Stephanie, and Mart´ın Uribe, “Optimal Fiscal and Monetary Policy under Sticky Prices,” Journal of Economic Theory, 114 (2004), 198–230. Siu, Henry E., “Optimal Fiscal and Monetary Policy with Sticky Prices,” Journal of Monetary Economics, 51 (2004), 575–607. Svensson, Lars E. O., and Michael Woodford, “Implementing Optimal Policy through Inflation-Forecast Targeting,” in The Inflation-Targeting Debate, Ben S. Bernanke and Michael Woodford, eds. (Chicago: University of Chicago Press, 2005). Taylor, John B., “Discretion Versus Policy Rules in Practice,” Carnegie–Rochester Conference Series on Public Policy, 39 (1993), 195–214. Wallace, Neil, “A Hybrid Fiat–Commodity Monetary System,” Journal of Economic Theory, 25 (1981), 421–430. Woodford, Michael, “Monetary Policy and Price Level Determinacy in a Cash-inAdvance Economy,” Economic Theory, 4 (1994), 345–380. ——, Interest and Prices: Foundations of a Theory of Monetary Policy (Princeton, NJ: Princeton University Press, 2003).
EARNINGS INEQUALITY AND MOBILITY IN THE UNITED STATES: EVIDENCE FROM SOCIAL SECURITY DATA SINCE 1937∗ WOJCIECH KOPCZUK EMMANUEL SAEZ JAE SONG This paper uses Social Security Administration longitudinal earnings micro data since 1937 to analyze the evolution of inequality and mobility in the United States. Annual earnings inequality is U-shaped, decreasing sharply up to 1953 and increasing steadily afterward. Short-term earnings mobility measures are stable over the full period except for a temporary surge during World War II. Virtually all of the increase in the variance in annual (log) earnings since 1970 is due to increase in the variance of permanent earnings (as opposed to transitory earnings). Mobility at the top of the earnings distribution is stable and has not mitigated the dramatic increase in annual earnings concentration since the 1970s. Long-term mobility among all workers has increased since the 1950s but has slightly declined among men. The decrease in the gender earnings gap and the resulting substantial increase in upward mobility over a lifetime for women are the driving force behind the increase in long-term mobility among all workers.
I. INTRODUCTION Market economies are praised for creating macroeconomic growth but blamed for the economic disparities among individuals they generate. Economic inequality is often measured using highfrequency economic outcomes such as annual income. However, market economies also generate substantial mobility in earnings over a working lifetime. As a result, annual earnings inequality might substantially exaggerate the extent of true economic disparity among individuals. To the extent that individuals can smooth changes in earnings using savings and credit markets, inequality based on longer periods than a year is a better measure ∗ We thank Tony Atkinson, Clair Brown, David Card, Jessica Guillory, Russ Hudson, Jennifer Hunt, Markus Jantti, Alan Krueger, David Lee, Thomas Lemieux, Michael Leonesio, Joyce Manchester, Robert Margo, David Pattison, Michael Reich, Jonathan Schwabish, numerous seminar participants, and especially the editor, Lawrence Katz, and four anonymous referees for very helpful comments and discussions. We also thank Ed DeMarco, Linda Maxfield, and especially Joyce Manchester for their support, Bill Kearns, Joel Packman, Russ Hudson, Shirley Piazza, Greg Diez, Fred Galeas, Bert Kestenbaum, William Piet, Jay Rossi, and Thomas Mattson for help with the data, and Thomas Solomon and Barbara Tyler for computing support. Financial support from the Sloan Foundation and NSF Grant SES-0617737 is gratefully acknowledged. All our series are available in electronic format in the Online Appendix. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
91
92
QUARTERLY JOURNAL OF ECONOMICS
of economic disparity. Thus, a comprehensive analysis of disparity requires studying both inequality and mobility. A large body of academic work has indeed analyzed earnings inequality and mobility in the United States. A number of key facts from the pre–World War II years to the present have been established using five main data sources:1 (1) Decennial Census data show that earnings inequality decreased substantially during the “Great Compression” from 1939 to 1949 (Goldin and Margo 1992) and remained low over the next two decades; (2) the annual Current Population Surveys (CPS) show that earnings inequality has increased substantially since the 1970s and especially during the 1980s (Katz and Murphy 1992; Katz and Autor 1999); (3) income tax statistics show that the top of the annual earnings distribution experienced enormous gains over the last 25 years (Piketty and Saez 2003); (4) panel survey data, primarily the Panel Study of Income Dynamics (PSID), show that short-term rank-based mobility has remained fairly stable since the 1970s (Gottschalk 1997); and (5) the gender gap has narrowed substantially since the 1970s (Goldin 1990, 2006; Blau 1998). There are, however, important questions that remain open due primarily to lack of homogeneous and longitudinal earnings data covering a long period of time. First, no annual earnings survey data covering most of the U.S. workforce are available before the 1960s, so that it is difficult to measure overall earnings inequality on a consistent basis before the 1960s, and in particular to analyze the exact timing of the Great Compression. Second, studies of mobility have focused primarily on short-term mobility measures due to lack of longitudinal data with large sample size and covering a long time period. Therefore, little is known about earnings mobility across an entire working life, let alone how such long-term mobility has evolved over time. Third and related, there is a controversial debate on whether the increase in inequality since the 1970s has been offset by increases in earnings mobility, and whether consumption inequality has increased to the same extent as income inequality.2 In particular, the development of performance pay such as bonuses and stock options for highly compensated employees might have increased year-to-year earnings variability substantially among 1. A number of studies have also analyzed inequality and mobility in America in earlier periods (see Lindert [2000] for a survey on inequality and Ferrie [2008] for an analysis of occupational mobility). 2. See, for example, Cutler and Katz (1991), Slesnick (2001), Krueger and Perri (2006), and Attanasio, Battistin, and Ichimura (2007).
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
93
top earners, so that the trends documented in Piketty and Saez (2003) could be misleading. The goal of this paper is to use the Social Security Administration (SSA) earnings micro data available since 1937 to make progress on those questions. The SSA data we use combine four key advantages relative to the data that have been used in previous studies on inequality and mobility in the United States. First, the SSA data we use for our research purposes have a large sample size: a 1% sample of the full US covered workforce is available since 1957, and a 0.1% sample since 1937. Second, the SSA data are annual and cover a very long time period of almost seventy years. Third, the SSA data are longitudinal balanced panels, as samples are selected based on the same Social Security number pattern every year. Finally, the earnings data have very little measurement error and are fully uncapped (with no top code) since 1978.3 Although Social Security earnings data have been used in a number of previous studies (often matched to survey data such as the Current Population Survey), the data we have assembled for this study overcome three important previous limitations. First, from 1946 to 1977, we use quarterly earnings information to extrapolate earnings up to four times the Social Security annual cap.4 Second, we can match the data to employer and industry information starting in 1957, allowing us to control for expansions in Social Security coverage that started in the 1950s. Finally, to our knowledge, the Social Security annual earnings data before 1951 have not been used outside the SSA for research purposes since Robert Solow’s unpublished Harvard Ph.D. thesis (Solow 1951). Few sociodemographic variables are available in the SSA data relative to standard survey data. Date of birth, gender, place of birth (including a foreign country birthplace), and race are available since 1937. Employer information (including geographic location, industry, and size) is available since 1957. Because we do not have information on important variables such as family 3. A number of studies have compared survey data to matched administrative data to assess measurement error in survey data (see, e.g., Abowd and Stinson [2005]). 4. Previous work using SSA data before the 1980s has almost always used data capped at the Social Security annual maximum (which was around the median of the earnings distribution in the 1960s), making it impossible to study the top half of the distribution. Before 1946, the top code was above the top quintile, allowing us to study earnings up to the top quintile over the full period.
94
QUARTERLY JOURNAL OF ECONOMICS
structure, education, and hours of work, our analysis will focus only on earnings rather than on wage rates and will not attempt to explain the links between family structure, education, labor supply, and earnings, as many previous studies have done. In contrast to studies relying on income tax returns, the whole analysis is also based on individual rather than family-level data. Furthermore, we focus only on employment earnings and hence exclude self-employment earnings as well as all other forms of income such as capital income, business income, and transfers. We further restrict our analysis to employment earnings from commerce and industry workers, who represent about 70% of all U.S. employees, as this is the core group always covered by Social Security since 1937. This is an important limitation when analyzing mobility as (a) mobility within the commerce and industry sector may be different than overall mobility and (b) mobility between the commerce and industry sector and all other sectors is eliminated. We obtain three main findings. First, our annual series confirm the U-shaped evolution of earnings inequality since the 1930s. Inequality decreases sharply up to 1953 and increases steadily and continuously afterward. The U-shaped evolution of inequality over time is also present within each gender group and is more pronounced for men. Percentile ratio series show that (1) the compression in the upper part of the distribution took place from 1942 to 1950 and was followed by a steady and continuous widening ever since the early 1950s, and (2) the compression in the lower part of the distribution took place primarily in the postwar period from 1946 to the late 1960s and unraveled quickly from 1970 to 1985, especially for men, and has been fairly stable over the last two decades. Second, we find that short-term relative mobility measures such as rank correlation measures and Shorrocks indices comparing annual vs. multiyear earnings inequality have been quite stable over the full period, except for a temporary surge during World War II.5 In particular, short-term mobility has been remarkably stable since the 1950s, for a variety of mobility measures and also when the sample is restricted to men only. Therefore, the
5. Such a surge is not surprising in light of the large turnover in the labor market generated by the war.
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
95
evolution of annual earnings inequality over time is very close to the evolution of inequality of longer term earnings. Furthermore, we show that most of the increase in the variance of (log) annual earnings is due to increases in the variance of (log) permanent earnings, with modest increases in the variance of transitory (log) earnings. Finally, mobility at the top of the earnings distribution, measured by the probability of staying in the top percentile after one, three, or five years, has also been very stable since 1978 (the first year in our data with no top code). Therefore, in contrast to the stock-option scenario mentioned above, the SSA data show very clearly that mobility has not mitigated the dramatic increase in annual earnings concentration. Third, we find that long-term mobility measures among all workers, such as the earnings rank correlations from the early part of a working life to the late part of a working life, display significant increases since 1951 either when measured unconditionally or when measured within cohorts. However, those increases mask substantial heterogeneity across gender groups. Long-term mobility among males has been stable over most of the period, with a slight decrease in recent decades. The decrease in the gender earnings gap and the resulting substantial increase in upward mobility over a lifetime for women is the driving force behind the increase in long-term mobility among all workers. The paper is organized as follows. Section 2 presents the conceptual framework linking inequality and mobility measures, the data, and our estimation methods. Section 3 presents inequality results based on annual earnings. Section 4 focuses on short-term mobility and its effect on inequality, whereas Section 5 focuses on long-term mobility and inequality. Section 6 concludes. Additional details on the data and our methodology, as well as extensive sensitivity analysis and the complete series, are presented in the Online Appendix. II. FRAMEWORK, DATA, AND METHODOLOGY II.A. Conceptual Framework Our main goal is to document the evolution of earnings inequality. Inequality can be measured over short-term earnings (such as annual earnings) or over long-term earnings (such as earnings averaged over several years or even a lifetime). When there is mobility in individual earnings over time, long-term
96
QUARTERLY JOURNAL OF ECONOMICS
inequality will be lower than short-term inequality, as moving up and down the distribution of short-term earnings will make the distribution of long-term earnings more equal. Therefore, conceptually, a way to measure mobility (Shorrocks 1978) is to compare inequality of short-term earnings to inequality of long-term earnings and define mobility as a coefficient between zero and one (inclusive) as follows: (1)
Long-term earnings inequality = Short-term earning inequality × (1 − Mobility).
Alternatively, one can define mobility directly as changes or “shocks” in earnings.6 In our framework, such shocks are defined broadly as any deviation from long-term earnings. Those shocks could indeed be real shocks such as unemployment, disability, or an unexpected promotion. Changes could also be the consequence of voluntary choices such as reducing (or increasing) hours of work, voluntarily changing jobs, or obtaining an expected pay raise. Such shocks can be transitory (such as working overtime in response to a temporarily increased demand for an employer’s product, or a short unemployment spell in the construction industry) or permanent (being laid off from a job in a declining industry). In that framework, both long-term inequality and the extent of shocks contribute to shaping short-term inequality: (2) Short-term earnings inequality = Long-term earnings inequality + Variability in earnings. Equations (1) and (2) are related by the formula (3) Variability in earnings = Short-term earnings inequality × Mobility = Long-term earnings inequality × Mobility/(1 − Mobility). Thus, equation (3) shows that a change in mobility with no change in long-term inequality is due to an increase in variability in earnings. Conversely, an increase in inequality (either short-term or long-term) with no change in mobility implies an increased 6. See Fields (2007) for an overview of different approaches to measuring income mobility.
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
97
variability in earnings. Importantly, our concept of mobility is relative rather than absolute.7 Formally, we consider a situation where a fixed group of individuals i = 1, . . . , I have short-term earnings zit > 0 in each period t = 1, . . . , K. For example, t can represent a year. We can define long-term earnings for individual i as average earnings across all K periods: z¯ i = t zit /K. We normalize earnings so that average earnings (across individuals) are the same in each period.8 From a vector of individual earnings z = (z1 , . . . , zI ), an inequality index can be defined as G(z), where G(.) is convex in z and homogeneous of degree zero (multiplying all earnings by a given factor does not change inequality). For example, G(.) can be the Gini index or the variance of log earnings. Shorrocks (1978, Theorem 1, p. 381) shows that G(¯z) ≤
K
G(zt )/K,
t=1
where zt is the vector of earnings in period t and z¯ the vector of long-term earnings (the average across the K periods). This inequality result captures the idea that movements in individual earnings up and down the distribution reduce long-term inequality (relative to short-term inequality). Hence we can define a related Shorrocks mobility index 0 ≤ M ≤ 1 as 1 − M = K
G(¯z)
t=1
G(zt )/K
,
which is a formalization of equation (1) above. M = 0 if and only if individuals’ incomes (relative to the mean) do not change over time. The central advantage of the Shorrocks mobility index is that it formally links short-term and long-term inequality, which is perhaps the primary motivation for analyzing mobility. The disadvantage of the Shorrocks index is that it is an indirect measure of mobility. 7. Our paper focuses exclusively on relative mobility measures, although absolute mobility measures (such as the likelihood of experiencing an earnings increase of at least X% after one year) are also of great interest. Such measures might produce different time series if economic growth or annual inequality changed over time. 8. In our empirical analysis, earnings will be indexed to the nominal average earnings index.
98
QUARTERLY JOURNAL OF ECONOMICS
Therefore, it is also useful to define direct mobility indices such as the rank correlation in earnings from year t to year t + p (or quintile mobility matrices from year t to year t + p). Such mobility indices are likely to be closely related to the Shorrocks indices, as reranking from one period to another is precisely what creates a wedge between long-term inequality and (the average of) short-term inequality. The advantage of direct mobility indices is that they are more concrete and transparent than Shorrocks indices. In our paper, we will therefore use both and show that they evolve very similarly over time. One specific measure of inequality—the variance of log earnings—has received substantial attention in the literature on inequality and mobility. Introducing yit = log zit and y¯i = t log zit /K, we can define deviations in (log) earnings as εit = yit − y¯i . It is important to note that εit may reflect both transitory earnings shocks (such as an i.i.d. process) and permanent earnings shocks (such as a Brownian motion). The deviation εit could either be uncertain ex ante from the individual perspective, or predictable.9 The Shorrocks theorem applied to the inequality index variance of log-earnings implies that vari ( y¯i ) ≤ varit (yit ), where the variance varit (yit ) is taken over both i = 1, . . . , I and K = 1, . . . , t. If, for illustration, we make the statistical assumption that εit ⊥ y¯i and we denote var(εit ) = σε2 , then we have varit (yit ) = vari ( y¯i ) + σε2 , which is a formalization of equation (2) above. The Shorrocks inequality index in that case is M = σε2 /varit (yit ) = σε2 / vari ( y¯i ) + σε2 . This shows that short-term earnings variance can increase because of an increase in long-term earnings variance or an increase in the variance of earnings deviations. Alternatively and 9. Uncertainty is important conceptually because individuals facing no credit constraints can fully smooth predictable shocks, whereas uncertain shocks can only be smoothed with insurance. We do not pursue this distinction in our analysis, because we cannot observe the degree of uncertainty in the empirical earnings shocks.
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
99
equivalently, short-term inequality can increase while long-term inequality remains stable if mobility increases. This simple framework can help us understand the findings from the previous literature on earnings mobility in the United States. Rank-based mobility measures (such as year-to-year rank correlation or quintile mobility matrices) are stable over time (Gottschalk 1997), whereas there has been an increase in the variance of transitory earnings (Gottschalk and Moffitt 1994). Such findings can be reconciled if the disparity in permanent earnings has simultaneously widened to keep rank-based mobility of earnings stable. In the theoretical framework we just described, the same set of individuals are followed across the K short-term periods. In practice, because individuals leave or enter the labor force (or the “commerce and industry” sector we will be focusing on), the set of individuals with positive earnings varies across periods. As the number of periods K becomes large, the sample will become smaller. Therefore, we will mostly consider relatively small values of K such as K = 3 or K = 5. When a period is a year, that allows us to analyze short-term mobility. When a period is a longer period of time such as twelve consecutive years, with K = 3, we cover 36 years, which is almost a full lifetime of work, allowing us to analyze long-term mobility, that is, mobility over a full working life. Our analysis will focus on the time series of various inequality and mobility statistics. The framework we have considered can be seen as an analysis at a given point in time s. We can recompute those statistics for various points in time to create time series. II.B. Data and Methodology Social Security Administration Data. We use primarily data sets constructed in SSA for research and statistical analysis, known as the continuous work history sample (CWHS) system.10 The annual samples are selected based on a fixed subset of digits of (a transformation of) the Social Security number (SSN). The same digits are used every year so that the sample is a balanced panel and can be treated as a random sample of the full population data. We use three main SSA data sets. (1) The 1% CWHS file contains information about taxable Social Security earnings from 1951 to 2004, basic demographic 10. Detailed documentation of these data sets can be found in Panis et al. (2000).
100
QUARTERLY JOURNAL OF ECONOMICS
characteristics such as year of birth, sex, and race, type of work (farm or nonfarm, employment or self-employment), selfemployment taxable income, insurance status for the Social Security programs, and several other variables. Because Social Security taxes apply up to a maximum level of annual earnings, however, earnings in this data set are effectively top-coded at the annual cap before 1978. Starting in 1978, the data set also contains information about full compensation derived from the W2 forms, and hence earnings are no longer top-coded. Employment earnings (either FICA employment earnings before 1978 or W2 earnings from 1978 on) are defined as the sum of all wages and salaries, bonuses, and exercised stock options exactly as wage income reported on individual income tax returns.11 (2) The second file is known as the employee–employer file (EE-ER), and we will rely on its longitudinal version (LEED), which covers 1957 to date. Although the sampling approach based on the SSN is the same as the 1% CWHS, individual earnings are reported at the employer level so that there is a record for each employer a worker is employed by in a year. This data set contains demographic characteristics, compensation information subject to top-coding at the employer–employee record level (and with no top code after 1978), and information about the employer, including geographic information and industry at the three-digit (major group and industry group) level. The industry information allows us to control for expansion in coverage overtime (see below). Importantly, the LEED (and EE-ER) data set also includes imputations based on quarterly earnings structure from 1957 to 1977, which allows us to handle earnings above the top code (see below).12 (3) Third, we use the so-called 0.1% CWHS file (one-tenth of 1%) that is constructed as a subset of the 1% file but covers 1937– 1977. This file is unique in its covering the Great Compression of the 1940s. The 0.1% file contains the same demographic variables as well as quarterly earnings information starting with 1951 (and quarter at which the top code was reached for 1946–1950), thereby extending our ability to deal with top-coding problems (see below). 11. FICA earnings include elective employee contributions for pensions (primarily 401(k) contributions), whereas W2 earnings exclude such contributions. However, before 1978, such contributions were almost nonexistent. 12. To our knowledge, the LEED has hardly ever been used in academic publications. Two notable exceptions are Schiller (1977) and Topel and Ward (1992).
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
101
Top Coding Issues. From 1937 to 1945, no information above the taxable ceiling is available. From 1946 to 1950, the quarter at which the ceiling is reached is available. From 1951 to 1977, we rely on imputations based on quarterly earnings (up to the quarter at which the annual ceiling is reached). Finally, since 1978, the data are fully uncapped. To our knowledge, the exact quarterly earnings information seems to have been retained only in the 0.1% CWHS sample since 1951. The LEED 1% sample since 1957 contains imputations that are based on quarterly earnings, but the quarterly earnings themselves were not retained in the data available to us. The imputation method is discussed in more detail in Kestenbaum (1976, his method II) and in the Online Appendix. It relies on earnings for quarters when they are observed to impute earnings in quarters that are not observed (when the taxable ceiling is reached after the first quarter). Importantly, this imputation method might not be accurate if individual earnings were not uniform across quarters. We extend the same procedure to 1951–1956 using the 0.1% file and because of the overlap of the 0.1% file and 1% LEED between 1957 and 1977 are able to verify that this is indeed the exact procedure that was applied in the LEED data. For 1946–1950, the imputation procedure (see the Online Appendix and Kestenbaum [1976, his method I]) uses Pareto distributions and preserves the rank order based on the quarter when the taxable maximum was reached. For individuals with earnings above the taxable ceiling (from 1937 to 1945) or who reach the taxable ceiling in the first quarter (from 1946 to 1977), we impute earnings assuming a Pareto distribution above the top code (1937–1945) or four times the top code (1946–1977). The Pareto distribution is calibrated from wage income tax statistics published by the Internal Revenue Service to match the top wage income shares series estimated in Piketty and Saez (2003). The number of individuals who were top-coded in the first quarter and whose earnings are imputed based on the Pareto imputation is less than 1% of the sample for virtually all years after 1951. Consequently, high-quality earnings information is available for the bottom 99% of the sample, allowing us to study both inequality and mobility up to the top percentile. From 1937 to 1945, the fraction of workers top-coded (in our sample of interest defined below) increases from 3.6% in 1937 to 19.5% in 1944 and 17.4% in 1945. The number of top-coded observations increases
102
QUARTERLY JOURNAL OF ECONOMICS
to 32.9% by 1950, but the quarter when a person reached the taxable maximum helps in classifying people into broad income categories. This implies that we cannot study groups smaller than the top percentile from 1951 to 1977 and we cannot study groups smaller than the top quintile from 1937 to 1950. To assess the sensitivity of our mobility and multiyear inequality estimates with respect to top code imputation, we use two Pareto imputation methods (see the Online Appendix). In the first or main method, the Pareto imputation is based on draws from a uniform distribution that are independent across individuals but also across time periods. As there is persistence in ranking even at the top of the distribution, this method generates an upward bias in mobility within top-coded individuals. In the alternative method, the uniform distribution draws are independent across individuals but fixed over time for a given individual. As there is some mobility in rankings at the top of the distribution, this method generates a downward bias in mobility. We always test that the two methods generate virtually the same series (see Online Appendix Figures A.5 to A.9 for examples).13 Changing Coverage Issues. Initially, Social Security covered only “commerce and industry” employees, defined as most private for-profit sector employees, and excluding farm and domestic employees as well as self-employed workers. Since 1951, there has been an expansion in the workers covered by Social Security and hence included in the data. An important expansion took place in 1951 when self-employed workers and farm and domestic employees were included. This reform also expanded coverage to some government and nonprofit employees (including large parts of the education and health care industries), with coverage increasing significantly further in 1954 and then slowly expanding since then. We include in our sample only commerce and industry employment earnings in order to focus on a consistent definition of workers. Using SIC classification in the LEED, we define commerce and industry as all SIC codes excluding agriculture, forestry, and fishing (01–09), hospitals (8060–8069), educational services (82), social services (83), religious organizations and nonclassified membership organizations (8660–8699), private households (88), and public administration (91–97). 13. This is not surprising because, starting with 1951, imputations matter for just the top 1% of the sample and mobility measures for the full population are not very sensitive to what happens within the very top group.
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
103
Between 1951 and 1956, we do not have industry information, as the LEED starts in 1957. Therefore, we impute “commerce and industry” classification using 1957–1958 industrial classification as well as discontinuities in covered earnings from 1950 to 1951 (see the Online Appendix for complete details). In 2004, commerce and industry employees are about 70% of all employees, and this proportion has declined only very modestly since 1937. Using only commerce and industry earnings is a limitation for our study for two reasons. First, inequality and mobility within the commerce and industry sector may be different from those in the full population. Second and more important, mobility between the commerce and industry sector and all other sectors is eliminated. Because in recent decades Social Security covers over 95% of earnings, we show in the Online Appendix that our mobility findings for recent decades are robust to including all covered workers. However, we cannot perform such a robustness check for earlier periods when coverage was much less complete. Note also that, throughout the period, the data include immigrant workers only if they have valid SSNs. Sample Selection. For our primary analysis, we are restricting the sample to adult individuals aged 25 to 60 (by January 1 of the corresponding year). This top age restriction allows us to concentrate on the working-age population.14 Second, we consider for our main sample only workers with annual (commerce and industry) employment earnings above a minimum threshold defined as one-fourth of a full year–full time minimum wage in 2004 ($2,575 in 2004), and then indexed by nominal average wage growth for earlier years. For many measures of inequality, such as log-earnings variance, it is necessary to trim the bottom of the earnings distribution. We show in Online Appendix Figures A.2 to A.9 that our results are not sensitive to choosing a higher minimum threshold such as a full year–full time minimum wage. We cannot analyze the transition into and out of the labor force satisfactorily using our sample because the SSA data cover only about 70% of employees in the early decades. From now on, we refer to our main sample of interest, namely “commerce and industry” workers aged 25 to 60 with earnings above the indexed minimum threshold (of $2,575 in 2004), as the “core sample.” 14. Kopczuk, Saez, and Song (2007) used a wider age group from 18 to 70 and obtain the same qualitative findings.
104
QUARTERLY JOURNAL OF ECONOMICS
0.50
0.45 ●
Gini coefficient
●
●●●
● ● ● ●
0.40
●
●● ●
●● ●●●
● ●
●
●
●●
●
●
●●
●●●●
●
●●●
●
●●
●●
●
●●
●
●●●
●● ●
●
●
●
●●
●
●
●
● ●
●
●
●●
●●●
0.35
●
All workers Men Women
0.30 1940
1950
1960
1970
1980
1990
2000
Year
FIGURE I Annual Gini Coefficients The figure displays the Gini coefficients from 1937 to 2004 for earnings of individuals in the core sample, men in the core sample, and women in the core sample. The core sample in year t is defined as all employees with commerce and industry earnings above a minimum threshold ($2,575 in 2004 and indexed using average wage for earlier years) and aged 25 to 60 (by January 1 of year t). Commerce and industry are defined as all industrial sectors excluding government employees, agriculture, hospitals, educational services, social services, religious and membership organizations, and private households. Self-employment earnings are fully excluded. Estimations are based on the 0.1% CWHS data set for 1937 to 1956, the 1% LEED sample from 1957 to 1977, and the 1% CWHS (matched to W-2 data) from 1978 on. See the Online Appendix for complete details.
III. ANNUAL EARNINGS INEQUALITY Figure I plots the annual Gini coefficient from 1937 to 2004 for the core sample of all workers, and for men and women separately in lighter gray. The Gini series for all workers follows a U-shape over the period, which is consistent with previous work based on decennial Census data (Goldin and Margo 1992), wage income from tax return data for the top of the distribution (Piketty and Saez 2003), and CPS data available since the early 1960s (Katz and Autor 1999). The series displays a sharp decrease of the Gini coefficient from 0.44 in 1938 down to 0.36 in 1953 (the Great Compression) followed by a steady increase since 1953 that accelerates in the 1970s and especially the 1980s. The Gini coefficient surpassed the prewar level in the late 1980s and was highest in 2004 at 0.47. Our series shows that the Great Compression is indeed the period of most dramatic change in inequality since the late 1930s
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
105
and that it took place in two steps. The Gini coefficient decreased sharply during the war from 1942 to 1944, rebounded very slightly from 1944 to 1946, and then declined again from 1946 to 1953. Among all workers, the increase in the Gini coefficient over the five decades from 1953 to 2004 is close to linear, which suggests that changes in overall inequality were not limited to an episodic event in the 1980s. Figure I shows that the series for males and females separately display the same U-shaped evolution over time. Interestingly, the Great Compression as well as the upward trend in inequality is much more pronounced for men than for all workers. This shows that the rise in the Gini coefficient since 1970 cannot be attributed to changes in gender composition of the labor force. The Gini for men shows a dramatic increase from 0.35 in 1979 to 0.43 in 1988, which is consistent with the CPS evidence extensively discussed in Katz and Autor (1999).15 On the other hand, stability of the Gini coefficients for men and for women from the early 1950s through the late 1960s highlights that the overall increase in the Gini coefficient in that period has been driven by a widening of the gender gap in earnings (i.e., the betweenrather than within-group component). Strikingly, there is more earnings inequality among women than among men in the 1950s and 1960s, whereas the reverse is true before the Great Compression and since the late 1970s. Finally, the increase in the Gini coefficient has slowed since the late 1980s in the overall sample. It is interesting to note that a large part of the 3.5 points increase in the Gini from 1990 to 2004 is due to a surge in earnings within the top percentile of the distribution. The series of Gini coefficients estimated, excluding the top percentile, increases by less than 2 points since 1990 (see Online Appendix Figure A.3).16 It should also be noted that, since the 1980s, the Gini coefficient has increased faster for men and women separately than for all workers. This has been driven by 15. There is a controversial debate in labor economics about the timing of changes in male wage inequality, due in part to discrepancies across different data sets. For example, Lemieux (2006), using May CPS data, argues that most of the increase in inequality occurs in the 1980s, whereas Autor, Katz, and Kearney (2008), using March CPS data, estimate that inequality starts to increase in the late 1960s. The Social Security data also point to an earlier increase in earnings inequality among males. 16. Hence, results based on survey data such as official Census Bureau inequality statistics, which do not measure the top percentile well, can give an incomplete view of inequality changes even when using global indices such as the Gini coefficient.
106
QUARTERLY JOURNAL OF ECONOMICS
1.0 ●
●●
0.9
● ●
●
●
●
●
●
● ●
0.8 Log percentile ratios
P50 P20 ●
●
● ●
●
●
●●
●
●●●
●
●●
●●
●
●
●
●●●
●●
●
●●● ●
●●
●
●
●●●
●●
●●
●●●●
●●
●●
●●
0.7
0.6
●
● ●
●●●
●
● ●
0.5
● ●
0.4
●●●
●
●●●
●●●
●●●●
●●●●●●●
●●●
●●●
●●
●●●
●●
●●
●●●●●●●●
P80 P50
●●
●●
●●●●●●●●●●●
●
●
●
●
All workers Men Women
0.3 1940
1950
1960
1970
1980
1990
2000
Year
FIGURE II Percentile Ratios log(P80/P50) and log(P50/P20) Sample is the core sample (commerce and industry employees aged 25 to 60; see Figure I). The figure displays the log of the 50th to 20th percentile earnings ratio (upper part of the figure) and the log of the 80th to 50th percentile earnings ratio (lower part of the figure) among all workers, men only (in lighter gray), and women only (in lighter gray).
an increase in the earnings of women relative to men, especially at the top of the distribution, as we shall see. Most previous work in the labor economics literature has focused on gender-specific measures of inequality. As men and women share a single labor market, it is also valuable to analyze the overall inequality generated in the labor market (in the “commerce and industry” sector in our analysis). Our analysis for all workers and by gender provides clear evidence of the importance of changes in women’s labor market behavior and outcomes for understanding overall changes in inequality, a topic we will return to. To understand where in the distribution the changes in inequality displayed in Figure I are occurring, Figure II displays the (log) percentile annual earnings ratios P80/P50—measuring inequality in the upper half of the distribution—and P50/P20— measuring inequality in the lower half of the distribution. We also depict the series for men and women only separately in lighter gray.17 17. We choose P80 (instead of the more usual P90) to avoid top-coding issues before 1951 and P20 (instead of the more usual P10) so that our low percentile
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
107
The P80/P50 series (depicted in the bottom half of the figure) are also U-shaped over the period, with a brief but substantial Great Compression from 1942 to 1947 and a steady increase starting in 1951, which accelerates in the 1970s. Interestingly, P80/P50 is virtually constant from 1985 to 2000, showing that the gains at the top of the distribution occurred above P80. The series for men is similar except that P80/P50 increases sharply in the 1980s and continues to increase in the 1990s. The P50/P20 series (depicted in the upper half of the figure) display a fairly different time pattern from the P80/P50 series. First, the compression happens primarily in the postwar period from 1946 to 1953. There are large swings in P50/P20 during the war, especially for men, as many young low income earners leave and enter the labor force because of the war, but P50/P20 is virtually the same in 1941 and 1946 or 1947.18 After the end of the Great Compression in 1953, the P50/P20 series for all workers remains fairly stable to the present, alternating periods of increase and decrease. In particular, it decreases smoothly from the mid1980s to 2000, implying that inequality in the bottom half shrank in the last two decades, although it started increasing after 2000. The series for men only is quite different and displays an overall U shape over time, with a sharper great compression that extends well into the postwar period, with an absolute minimum in 1969 followed by a sharp increase up to 1983 and relative stability since then (consistent with recent evidence by Autor, Katz, and Kearney [2008]). For women, the P50/P20 series display a secular and steady fall since World War II. Table I summarizes the annual earnings inequality trends for all (Panel A), men (Panel B), and women (Panel C) with various inequality measures for selective years (1939, 1960, 1980, and 2004). In addition to the series depicted in the Figures, Table I contains the variance of log-earnings, which also displays a U-shaped pattern over the period, as well as the shares of total earnings going to the bottom quintile group (P0–20), the top quintile group (P80–100), and the top percentile group (P99–100). Those last two series also display a U shape over the period. In particular, the top percentile share has almost doubled from 1980 estimate is not too closely driven by the average wage-indexed minimum threshold we have chosen ($2,575 in 2004). 18. In the working paper version (Kopczuk, Saez, and Song 2007), we show that compositional changes during the war are strongly influencing the bottom of the distribution during the early 1940s.
0.433 0.375 0.408 0.471
0.417 0.326 0.366 0.475
0.380 0.349 0.354 0.426
1939 1960 1980 2004
1939 1960 1980 2004
1939 1960 1980 2004
0.635 0.570 0.564 0.693
0.800 0.533 0.618 0.797
0.826 0.681 0.730 0.791
1.36 1.31 1.22 1.34
1.32 0.94 1.06 1.34
1.43 1.24 1.33 1.39
0.87 0.82 0.74 0.74
0.85 0.58 0.64 0.73
0.88 0.79 0.76 0.76
A. All 0.55 3.64 0.46 4.54 0.57 4.34 0.63 3.91 B. Men 0.47 3.82 0.35 5.89 0.43 5.25 0.61 3.92 C. Women 0.49 4.49 0.50 4.98 0.49 5.15 0.59 4.45 42.25 39.18 40.38 47.36
45.52 38.80 42.02 51.83
46.82 41.66 44.98 51.41
6.11 4.05 4.37 8.00
9.58 5.55 6.85 13.44
9.55 5.92 7.21 12.28
9,145 15,148 20,439 32,499
17,918 32,989 44,386 52,955
15,806 27,428 35,039 44,052
4,911 11,006 19,566 33,063
15,493 24,309 30,564 42,908
20,404 35,315 50,129 75,971
#Workers (’000s) (11)
Notes. The table displays various annual earnings inequality statistics for selected years, 1939, 1960, 1980, and 2004 for all workers in the core sample (Panel A), men in the core sample (Panel B), and women in the core sample (Panel C). The core sample in year t is defined as all employees with commerce and industry earnings above a minimum threshold ($2,575 in 2004 and indexed using average wage for earlier years) and aged 25 to 60 (by January 1 of year t). Commerce and industry are defined as all industrial sectors excluding government employees, agriculture, hospitals, educational services, social services, religious and membership organizations, and private households. Self-employment earnings are fully excluded. Estimates are based on the 0.1% CWHS data set for 1937 to 1956, the 1% LEED sample from 1957 to 1977, and the 1% CWHS from 1978 on. See the Online Appendix for complete details. Columns (2) and (3) report the Gini coefficient and variance of log earnings. Columns (4), (5), and (6) report the percentile log ratios P80/P20, P50/P20, and P80/P50. P80 denotes the 80th percentile, etc. Columns (7), (8), and (9) report the share of total earnings accruing to P0–20 (the bottom quintile), P80–100 (the top quintile), and P99–100 (the top percentile). Column (10) reports average earnings in 2004 dollars using the CPI index (the new CPI-U-RS index is used after 1978). Column (11) reports the number of workers in thousands.
Gini (2)
Year (1)
Variance Log percentile ratios Earnings shares Average log earnings earnings P80/P20 P50/P20 P80/P50 P0–20 P80–100 P99–100 (2004 $) (3) (4) (5) (6) (7) (8) (9) (10) 0.1% sample from 1937 to 1956, 1% from 1957 to 2004. Number of workers in thousands
TABLE I ANNUAL EARNINGS INEQUALITY
108 QUARTERLY JOURNAL OF ECONOMICS
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
109
to 2004 in the sample of men only and the sample of women only and accounts for over half of the increase in the top quintile share from 1980 to 2004. IV. THE EFFECTS OF SHORT-TERM MOBILITY ON EARNINGS INEQUALITY In this section, we apply our theoretical framework from Section II.A to analyze multiyear inequality and relate it to the annual earnings inequality series analyzed in Section III. We will consider each period to be a year and the longer period to be five years (K = 5).19 We will compare inequality based on annual earnings and earnings averaged over five years. We will then derive the implied Shorrocks mobility indices and decompose annual inequality into permanent and transitory inequality components. We will also examine some direct measures of mobility such as rank correlations. Figure III plots the Gini coefficient series for earnings averaged over five years20 (numerator of the Shorrocks index) and the five-year average of the Gini coefficients of annual earnings (the denominator of the Shorrocks index). For a given year t, the sample for both the five-year Gini and the annual Ginis is defined as all individuals with “Commerce and Industry” earnings above the minimum threshold in all five years, t − 2, t − 1, t, t + 1, t + 2 (and aged 25 to 60 in the middle year t). We show the average of the five annual Gini coefficients between t − 2 and t + 2 as our measure of the annual Gini coefficient, because it matches the Shorrocks approach. Because the sample is the same for both series, Shorrocks’ theorem implies that the five-year Gini is always smaller than the average of the annual Gini (over the corresponding five years), as indeed displayed in the figure.21 We also display the same series for men only (in lighter gray). The annual Gini displays the same overall evolution over time as in Figure I. The level is lower, as there is naturally less inequality in the group of 19. Series based on three-year averages instead of five year generates display a very similar time pattern. Increasing K beyond five would reduce sample size substantially, as we require earnings to be above the minimum threshold in each of the five years, as described below. 20. The average is taken after indexing annual earnings by the average wage index. 21. Alternatively, we could have defined the sample as all individuals with earnings above the minimum threshold in any of the five years, t − 2, t − 1, t, t + 1, t + 2. The time pattern of those series is very similar. We prefer to use the positive-earnings in all five years criterion because this is a necessity when analyzing variability in log-earnings, as we do below.
110
QUARTERLY JOURNAL OF ECONOMICS
0.45
0.40
Gini coefficient
●
● ●
● ●
0.35
● ●
●
●
● ●
0.30
●
●●●
●
● ●●
●
●●
●
●
●●●●●
●
●●
●●
●●●●● ●●●
●●●
●●●
●
●● ●●●
●●
●●●●●●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ● ●●● ●● ●
●
●
● ●
● ●
● ●● ●●● ● ●●●
● ●
● ● ●
● ● ●
●
●
●
●
●
●
●
●
● ●
Annual earnings, all workers Five-year earnings, all workers Annual earnings, men Five-year earnings, men
0.25 1940
1950
1960
1970
1980
1990
2000
Year
FIGURE III Gini Coefficients: Annual Earnings vs. Five-Year Earnings The figure displays the Gini coefficients for annual earnings and for earnings averaged over five years from 1939 to 2002. In year t, the sample for both series is defined as all individuals aged 25 to 60 in year t, with commerce and industry earnings above the minimum threshold in all five years t − 2, t − 1, t, t + 1, t + 2. Earnings are averaged over the five-year span using the average earnings index. The Gini coefficient for annual earnings displayed for year t is the average of the Gini coefficient for annual earnings in years t − 2, . . . , t + 2. The same series are reported in lighter gray for the sample restricted to men only.
individuals with positive earnings for five consecutive years than in the core sample. The Gini coefficient estimated for five-year earnings average follows a very similar evolution over time and is actually extremely close to the annual Gini, especially in recent decades. Interestingly, in this sample, the Great Compression takes place primarily during the war from 1940 to 1944. The war compression is followed by a much more modest decline till 1952. This suggests that the postwar compression observed in annual earnings in Figure I was likely due to entry (of young men in the middle of the distribution) and exit (likely of wartime working women in the lower part of the distribution). Since the early 1950s, the two Gini series are remarkably parallel, and the five-year earnings average Gini displays an accelerated increase during the 1970s and especially the 1980s, as did our annual Gini series. The fiveyear average earnings Gini series for men show that the Great Compression is concentrated during the war, with little change in the Gini from 1946 to 1970, and a very sharp increase over the next three decades, especially the 1980s.
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
111
Shorrocks Gini mobility index and rank correlation
1.0
● ●
0.9
● ●●
●●● ●
●
●
●
●●
●●●●●●●●●●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●● ●●●● ●● ●●●●●● ●●●●●●●●●●●●● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ●
●
0.8
0.7
0.6 ● ●
Shorrocks Index (five-year Gini/annual Gini), all workers Shorrocks Index (five-year Gini/annual Gini), men Rank correlation (after one year), all workers Rank correlation (after one year), men
0.5 1940
1950
1960
1970
1980
1990
2000
Year
FIGURE IV Short-Term Mobility: Shorrocks’ Index and Rank Correlation The figure displays the Shorrocks mobility coefficient based on annual earnings Gini vs. five-year average earnings Gini and the rank correlation between earnings in year t and year t + 1. The Shorrocks mobility coefficient in year t is defined as the ratio of the five-year earnings (from t − 2 to t + 2) Gini coefficient to the average of the annual earnings Gini for years t − 2, . . . , t + 2 (those two series are displayed in Figure III). The rank correlation in year t is estimated on the sample of individuals present in the core sample (commerce and industry employees aged 25 to 60; see Figure I) in both year t and year t + 1. The same series are reported in lighter gray for the sample restricted to men only.
Figure IV displays two measures of mobility (in black for all workers and in lighter gray for men only). The first measure is the Shorrocks measure, defined as the ratio of the five-year Gini to (the average of) the annual Gini. Mobility decreases with the index, and an index equal to one implies no mobility at all. The Shorrocks index series is above 0.9, except for a temporary dip during the war. The increased earnings mobility during the war is likely explained by the large movements into and out of the labor force of men serving in the army and women temporarily replacing men in the civilian labor force. The Shorrocks series have very slightly increased since the early 1970s, from 0.945 to 0.967 in 2004.22 This small change in the direction of reduced mobility further confirms that, as we expected from Figure III, short-term mobility has played a minor role in the surge in annual earnings inequality documented in Figure I. 22. The increase is slightly more pronounced for the sample of men.
112
QUARTERLY JOURNAL OF ECONOMICS
The second mobility measure displayed on Figure IV is the straight rank correlation in earnings between year t and year t + 1 (computed in the sample of individuals present in our core sample in both years t and t + 1).23 As with the Shorrocks index, mobility decreases with the rank correlation and a correlation of one implies no year-to-year mobility. The rank mobility series follows the same overall evolution over time as the Shorrocks mobility index: a temporary but sharp dip during the war followed by a slight increase. Over the last two decades, the rank correlation in year-to-year earnings has been very stable and very high, around .9. As with the Shorrocks index, the increase in rank correlation is slightly more pronounced for men (than for the full sample) since the late 1960s. Figure V displays (a) the average of variance of annual log earnings from t − 2 to t + 2 (defined on the stable sample as in the Shorrocks index analysis before), (b) the variance of five-year average log-earnings, var(( t+2 s=t−2 log zis )/5), and (c) the variance of log earnings deviations, estimated as t+2 s=t−2 log zis Dt = var log(zit ) − , 5 where the variance is taken across all individuals i with earnings above the minimum threshold in all five years t − 2, . . . , t + 2. As with the previous two mobility measures, those series, displayed in black for all workers and in lighter gray for men only, show a temporary surge in the variance of transitory earnings during the war, and are stable after 1960. In particular, it is striking that we do not observe an increased earnings variability over the last twenty years, so that all the increase in the log-earnings variance can be attributed to the increase in the variance of permanent (five-year average) log-earnings. Our results differ somewhat from those of Gottschalk and Moffitt (1994), using PSID data, who found that over one-third of the increase in the variance of log-earnings from the 1970s to the 1980s was due to an increase in transitory earnings (Table 1, row 1, p. 223). We find a smaller increase in transitory earnings in 23. More precisely, within the sample of individuals present in the core sample in both years t and t + 1, we measure the rank rt and rt+1 of each individual in each of the two years, and then compute the correlation between rt and rt+1 across individuals.
113
EARNINGS INEQUALITY AND MOBILITY IN THE U.S. 0.7 ●
0.6
All Annual variance Permanent (five-year) variance Transitory variance ●●
Variance of log(earnings)
0.5
●
●● ●
●
●●●
● ●●
0.4
●
●●
●
Men Annual variance Permanent (five-year) variance Transitory variance
● ●● ●
●●●●●●● ●●
●●●
●●●●●●●●●●●●●●
●●●●●●●●●●
●●●
● ●●
●●●●
●●
●
●
●●
●●
●●●
●●
●●
●●
●
●
●
●●
●
●
●● ●
●
●● ●●●●●● ●●●●●● ●●●●●● ●●●● ●●●● ● ● ●● ●
0.3
0.2
0.1
0.0 1940
1950
1960
1970
1980
1990
2000
Year
FIGURE V Variance of Annual, Permanent, and Transitory (log) Earnings The figure displays the variance of (log) annual earning, the variance of (log) five-year average earnings (permanent variance), and the transitory variance, defined as the variance of the difference between (log) annual earnings and (log) five-year average earnings. In year t, the sample for all three series is defined as all individuals aged 25 to 60 in year t, with commerce and industry earnings above the minimum threshold in all five years t − 2, t − 1, t, t + 1, t + 2. The (log) annual earnings variance is estimated as the average (across years t − 2, . . . , t + 2) of the variance of (log) annual earnings. The same series are reported in lighter gray for the sample restricted to men only.
the 1970s and we find that this increase reverts in the late 1980s and 1990s so that transitory earnings variance is virtually identical in 1970 and 2000. To be sure, our results could differ from those of Gottschalk and Moffitt (1994) for many reasons, such as measurement error and earnings definition consistency issues in the PSID or the sample definition. Gottschalk and Moffitt focus exclusively on white males, use a different age cutoff, take out age-profile effects, and include earnings from all industrial sectors. Gottschalk and Moffitt also use nine-year earnings periods (instead of five as we do) and include all years with positive annual earnings years (instead of requiring positive earnings in all nine years as we do).24 24. The recent studies of Dynan, Elmendorf, and Sichel (2008) and Shin and Solon (2008) revisit mobility using PSID data. Shin and Solon (2008) find an increase in mobility in the 1970s followed by stability, which is consistent with our results. Dynan, Elmendorf, and Sichel (2008) find an increase in mobility in recent decades, but they focus on household total income instead of individual earnings.
114
QUARTERLY JOURNAL OF ECONOMICS ●
13 ●
Earnings Share (%)
12
●
11
● ●
10
●
●
●
● ●
●
●
●
●
●
●
●
●
9 ●
8 ●
7 ● ●
●
● ● ●
● ●
6
Annual earnings Five-year average earnings
A. Top 1% earnings share: annual vs. five-year
100 ●
After one year After three years After five years
Probability (%)
90
80 ●
● ●
● ●
● ● ●
●
● ● ●
● ● ● ● ● ● ● ●
70
● ●
●
●
● ●
60
50 1980
1985 1990 1995 B. Probability of staying in the top 1%
2000
2005
FIGURE VI Top Percentile Earnings Share and Mobility In Panel A, the sample in year t is all individuals aged 25 to 60 in year t and with commerce and industry earnings above the minimum threshold in all five years t − 2, t − 1, t, t + 1, t + 2. In year t, Panel A displays (1) the share of total year t annual earnings accruing to the top 1% earners in that year t and (2) the share of total five-year average earnings (from year t − 2, . . . , t + 2) accruing to the top 1% earners (defined as top 1% in terms of average five-year earnings). Panel B displays the probability of staying in the top 1% annual earnings group after X years (where X = 1, 3, 5). The sample in year t is all individuals present in the core sample (commerce and industry employees aged 25 to 60; see Figure I) in both year t and year t + X. Series in both panels are restricted to 1978 and on because sample has no top code since 1978.
The absence of top-coding since 1978 allows us to zoom on top earnings, which, as we showed in Table I, have surged in recent decades. Figure VI.A uses the uncapped data since 1978 to plot the share of total annual earnings accruing to the top 1% (those with
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
115
earnings above $236,000 in 2004). The top 1% annual earnings share doubles from 6.5% in 1978 to 13% in 2004.25 Figure VI.A then compares the share of earnings of the top 1% based on annual data with shares of the top 1% defined based on earnings averaged at the individual level over five years. The five-year average earnings share series naturally smoothes short-term fluctuations but shows the same time pattern of robust increase as the annual measure.26 This shows that the surge in top earnings is not due to increased mobility at the top. This finding is confirmed in Figure VI.B, which shows the probability of staying in the top 1% earnings group after one, three, and five years (conditional on staying in our core sample) starting in 1978. The one-year probability is between sixty and seventy percent and it shows no overall trend. Therefore, our analysis shows that the dramatic surge in top earnings has not been accompanied by a similar surge in mobility into and out of top earnings groups. Hence, annual earnings concentration measures provide a very good approximation to longer-term earnings concentration measures. In particular, the development of performance-based pay such as bonuses and profits from exercised stock options (both included in our earnings measure) does not seem to have increased mobility dramatically.27 Table II summarizes the key short-term mobility trends for all (Panel A) and men (Panel B) with various mobility measures for selected years (1939, 1960, 1980, and 2002). In sum, the movements in short-term mobility series appear to be much smaller than changes in inequality over time. As a result, changes in short-term mobility have had no significant impact on inequality trends in the United States. Those findings are consistent with previous studies for recent decades based on PSID data (see, e.g., Gottschalk [1997] for a summary) as well as the most recent SSA
25. The closeness of our SSA-based (individual-level) results and the tax return–based (family-level) results of Piketty and Saez (2003) shows that changes in assortative mating played at best a minor role in the surge of family employment earnings at the top of the earnings distribution. 26. Following the framework from Section II.A (applied in this case to the top 1% earnings–share measure of inequality), we have computed such shares (in year t) on the sample of all individuals with minimum earnings in all five years, t − 2, . . . , t + 2. Note also that, in contrast to Shorrocks’ theorem, the series cross because we do not average the annual income share in year t across the five years t − 2, . . . , t + 2. 27. Conversely, the widening of the gap in annual earnings between the top 1% and the rest of the workforce has not affected the likelihood of top-1% earners falling back into the bottom 99%.
116
QUARTERLY JOURNAL OF ECONOMICS TABLE II FIVE-YEAR AVERAGE EARNINGS INEQUALITY AND SHORT-TERM MOBILITY
Annual Permanent Annual 5-year earnings Rank (5-year log-earnings Transitory earnings Gini correlation average) variance logaverage (average after log-earnings (average earnings #Workers Year Gini t − 2, . . . , t + 2) 1 year variance t − 2, . . . , t + 2) variance (’000s) (1) (2) (3) (4) (5) (6) (7) (8) 1939 1960 1980 2002
0.357 0.307 0.347 0.421
0.380 0.324 0.364 0.435
0.859 0.883 0.885 0.897
1939 1960 1980 2002
0.340 0.272 0.310 0.426
0.365 0.291 0.329 0.440
0.853 0.855 0.869 0.898
A. All 0.416 0.371 0.426 0.514 B. Men 0.373 0.288 0.337 0.509
0.531 0.447 0.513 0.594
0.085 0.054 0.061 0.058
14,785 26,479 35,500 55,108
0.494 0.362 0.425 0.591
0.091 0.052 0.062 0.061
11,700 19,577 23,190 32,259
Notes. The table displays various measures of 5-year average earnings inequality and short-term mobility measures centered around selected years, 1939, 1960, 1980, and 2002 for all workers (Panel A) and men (Panel B). In all columns (except (4)), the sample in year t is defined as all employees with commerce and industry earnings above a minimum threshold ($2,575 in 2004 and indexed using average wage for earlier years) in all five years t − 2, t − 1, t, t + 1, and t + 2, and aged 25 to 60 (by January 1 of year t). Column (2) reports the Gini coefficients based on average earnings from year t − 2 to year t + 2 (averages are computed using indexed wages). Column (3) reports the average across years t − 2, . . . , t + 2 of the Gini coefficients of annual earnings. Column (4) reports the rank correlation between annual earnings in year t and annual earnings in year t + 1 in the sample of workers in the core sample (see Table I footnote for the definition) in both years t and t + 1. Column (5) reports the variance of average log-earnings from year t − 2 to year t + 2. Column (6) reports the average across years t − 2, . . . , t + 2 of the variance of annual log-earnings. Column (7) reports the variance of the difference between log earnings in year t and the average of log earnings from year t − 2 to t + 2. Column (8) reports the number of workers in thousands.
data–based analysis of the Congressional Budget Office (2007)28 and the tax return–based analysis of Carroll, Joulfaian, and Rider (2007). They are more difficult to reconcile, however, with the findings of Hungerford (1993) and especially Hacker (2006), who find great increases in family income variability in recent decades using PSID data. Our finding of stable transitory earnings variance is also at odds with the findings of Gottschalk and Moffitt (1994), who decompose transitory and permanent variance in logearnings using PSID data and show an increase in both components. Our decomposition using SSA data shows that only the variance of the relatively permanent component of earnings has increased in recent decades. V. LONG-TERM MOBILITY AND LIFETIME INEQUALITY The very long span of our data allows us to estimate long-term mobility. Such mobility measures go beyond the issue of transitory 28. The CBO study focuses on probabilities of large earnings increases (or drops).
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
117
earnings analyzed above and instead describe mobility across a full working life. Such estimates have not yet been produced for the United States in any systematic way because of the lack of panel data with large sample size and covering a long time period. V.A. Unconditional Long-Term Inequality and Mobility We begin with the simplest extension of our previous analysis to a longer horizon. In the context of the theoretical framework from Section II.A, we now assume that a period is eleven consecutive years. We define the “core long-term sample” in year t as all individuals aged 25–60 in year t with average earnings (using the standard wage indexation) from year t − 5 to year t + 5 above the minimum threshold. Hence, our sample includes individuals with zeros in some years as long as average earnings are above the threshold.29 Figure VII displays the Gini coefficients for all workers, and for men and women separately based on those eleven-year average earnings from 1942 to 1999. The overall picture is actually strikingly similar to our annual Figure I. The Gini coefficient series for all workers displays on overall U shape with a Great Compression from 1942 to 1953 and an absolute minimum in 1953, followed by a steady increase that accelerates in the 1970s and 1980s and slows down in the 1990s. The U-shaped evolution over time is also much more pronounced for men than for women and shows that, for men, the inequality increase was concentrated in the 1970s and 1980s.30 After exploring base inequality over those eleven-year spells, we turn to long-term mobility. Figure VIII displays the rank correlation between the eleven-year earnings spell centered in year t and the eleven-year earnings spell after T years (i.e., centered in year t + T ) in the same sample of individuals present in the “long-term core sample” in both year t and year t + T . The figure presents such correlations for three choices of T : ten years, fifteen years, and twenty years. Given our 25–60 age restriction (which applies in both year t and year t + T ), for T = 20, the sample in year t is aged 25 to 40 (and the sample in year t + 20 is aged 45 to 60). Thus, this measure captures mobility from early career to late career. The figure also displays the same series for men only 29. This allows us to analyze large and representative samples as the number of individuals with positive “commerce and industry” earnings in eleven consecutive years is only between 35% and 50% of the core annual samples. 30. We show in Online Appendix Figures A.8 and A.9 that these results are robust to using a higher minimum threshold.
118
QUARTERLY JOURNAL OF ECONOMICS
0.50
●
Gini coefficient
●
●
●
● ●
0.45
●
●
●●● ● ● ●● ●● ●●●
● ●●●● ●●●
●
●
●
●
●
●●
● ●● ●●
●
●
●
●●
●●●●●● ●●● ●● ●●
0.40
●
1940
1950
1960
1970
1980
1990
All workers Men Women 2000
Year (middle of the eleven-year span)
FIGURE VII Long-Term Earnings Gini Coefficients The figure displays the Gini coefficients from 1942 to 1999 for eleven-year average earnings for all workers, men only, and women only. The sample in year t is defined as all employees aged 25 to 60 in year t, alive in all years t − 5 to t + 5, and with average commerce and industry earnings (averaged using the average wage index) from year t − 5 to t + 5 above the minimum threshold. Gini coefficient in year t is based on average (indexed) earnings across the eleven-year span from year t − 5 to t + 5.
in lighter gray, in which case rank is defined within the sample of men. Three points are worth noting. First, the correlation is unsurprisingly lower as T increases, but it is striking to note that even after twenty years, the correlation is still substantial (in the vicinity of .5). Second, the series for all workers shows that rank correlation has actually significantly decreased over time: for example, the rank correlation between 1950s and 1970s earnings was around .57, but it is only .49 between 1970s and 1990s earnings. This shows that long-term mobility has increased significantly over the last five decades. This result stands in contrast to our short-term mobility results displaying substantial stability. Third, however, Figure VIII shows that this increase in long-term mobility disappears in the sample of men. The series for men displays a slight decrease in rank correlation in the first part of the period followed by an increase in the last part of the period. On net, the series for men displays almost no change in rank correlation and hence no change in long-term mobility over the full period.
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
119
0.8 ● ●
●
●
● ● ● ● ● ● ● ● ●
● ●
0.7
Rank correlation
● ● ● ● ● ● ● ● ● ● ●
● ●
After ten years, all After ten years, men
● ● ● ● ● ● ● ●
After fifteen years, all After fifteen years, men
After twenty years, all After twenty years, men
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.6
0.5
0.4 1950
1960
1970
1980
1990
Year (middle of the initial eleven-year span)
FIGURE VIII Long-Term Mobility: Rank Correlation in Eleven-Year Earnings Spans The figure displays in year t the rank correlation between eleven-year average earnings centered around year t and eleven-year average earnings centered around year t + X, where X = ten, fifteen, twenty. The sample is defined as all individuals aged 25 to 60 in year t and t + X, with average eleven-year earnings around years t and t + X above the minimum threshold. Because of small sample size, series including earnings before 1957 are smoothed using a weighted three-year moving average with weight of 0.5 for cohort t and weights of 0.25 for t − 1 and t + 1. The same series are reported in lighter gray for the sample restricted to men only (in which case, rank is estimated within the sample of men only).
V.B. Cohort-Based Long-Term Inequality and Mobility The analysis so far ignored changes in the age structure of the population as well as changes in the wage profiles over a career. We turn to cohort-level analysis to control for those effects. In principle, we could control for age (as well as other demographic changes) using a regression framework. In this paper, we focus exclusively on series without controls because they are more transparent, easier to interpret, and less affected by imputation issues. We defer a more comprehensive structural analysis of earnings processes to future work.31 We divide working lifetimes from age 25 to 60 into three stages: Early career is defined as from the calendar year the 31. An important strand of the literature on income mobility has developed covariance structure models to estimate such earnings processes. The estimates of such models are often difficult to interpret and sensitive to the specification (see, e.g., Baker and Solon [2003]). As a result, many recent contributions in the mobility literature have also focused on simple measures without using a complex framework (see, e.g., Congressional Budget Office [2007] and in particular the discussion in Shin and Solon [2008]).
120
QUARTERLY JOURNAL OF ECONOMICS
0.55
Gini coefficient
0.50
0.45
●●
●
0.40
●
●
●●
●
●●●●
● ●●● ●
●
● ● ● ●●● ●●●● ● ●
●●
●●
● ● ● ●
●
Early career: age 25 to 36 Mid-career:age 37 to 48 Late career: age 49 to 60 Men only in lighter gray
●●●●●●●●●●●
● ●
0.35
●●●
●●
● ● ● ●● ● ● ●● ● ●
●●
● ●● ● ●
●
●
●
●●●
●
●
●● ● ● ● ● ●● ● ●●●●●● ● ● ● ●●● ●●● ●●● ●
●●
●●
0.30 1900
1920
1940
1960
Year of birth
FIGURE IX Long-Term Earnings Gini Coefficients by Birth Cohort Sample is career sample defined as follows for each career stage and birth cohort: all employees with average commerce and industry earnings (using average wage index) over the twelve-year career stage above the minimum threshold ($2,575 in 2004 and indexed on average wage for earlier years). Note that earnings can be zero for some years. Early career is from age 25 to 36, middle career is from age 37 to 48, late career is from age 49 to 60. Because of small sample size, series including earnings before 1957 are smoothed using a weighted three-year moving average with weight of 0.5 for cohort t and weights of 0.25 for t − 1 and t + 1.
person reaches 25 to the calendar year the person reaches 36. Middle and later careers are defined similarly from age 37 to 48 and age 49 to 60, respectively. For example, for a person born in 1944, the early career is calendar years 1969–1980, the middle career is 1981–1992, and the late career is 1993–2004. For a given year-of-birth cohort, we define the “core early career sample” as all individuals with average “commerce and industry” earnings over the twelve years of the early career stage above the minimum threshold (including zeros and using again the standard wage indexation). The “core mid-career” and “core late career” samples are defined similarly for each birth cohort. The earnings in early, mid-, and late career are defined as average “commerce and industry” earnings during the corresponding stage (always using the average wage index). Figure IX reports the Gini coefficient series by year of birth for early, mid-, and late career. The Gini coefficients for men only are also displayed in lighter gray. The cohort-based Gini coefficients
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
121
are consistent with our previous findings and display a U shape over the full period. Three results are notable. First, there is much more inequality in late career than in middle career, and in middle career than in early career, showing that long-term inequality fans out over the course of a working life. Second, the Gini series show that long-term inequality has been stable for the baby-boom cohorts born after 1945 in the sample of all workers (we can observe only early- and mid-career inequality for those cohorts, as their late-career earnings are not completed by 2004). Those results are striking in light of our previous results showing a worsening of inequality in annual and five-year average earnings. Third, however, the Gini series for men only show that inequality has increased substantially across baby-boom cohorts born after 1945. This sharp contrast between series for all workers versus men only reinforces our previous findings that gender effects play an important role in shaping the trends in overall inequality. We also find that cohort-based rank mobility measures display stability or even slight decreases over the last five decades in the full sample, but that rank mobility has decreased substantially in the sample of men (figure omitted to save space). This confirms that the evolution of long-term mobility is heavily influenced by gender effects, to which we now turn. V.C. The Role of Gender Gaps in Long-Term Inequality and Mobility As we saw, there are striking differences in the long-term inequality and mobility series for all workers vs. for men only: Long-term inequality has increased much less in the sample of all workers than in the sample of men only. Long-term mobility has increased over the last four decades in the sample of all workers, but not in the sample of men only. Such differences can be explained by the reduction in the gender gap that has taken place over the period. Figure X plots the fraction of women in our core sample and in various upper earnings groups: the fourth quintile group (P60–80), the ninth decile group (P80–90), the top decile group (P90–100), and the top percentile group (P99–100). As adult women aged 25 to 60 are about half of the adult population aged 25 to 60, with no gender differences in earnings, those fractions should be approximately 0.5. Those representation indices with no adjustment capture the total realized earnings gap including labor
122
QUARTERLY JOURNAL OF ECONOMICS
0.5 ●
Fraction of women in each group
0.4
All workers P60–80 P80–90 P90–100 P99–100
● ●
0.3
●●
●●●
● ●●●●●●
●●●
●●●●●●
●●●●
●
●●
● ●●●●
●●
●
●●
●●
●
● ●●
●●
●●●●
●●●●●●●
●●●●●●●●
●
0.2
0.1
0.0 1940
1950
1960
1970
1980
1990
2000
Year
FIGURE X Gender Gap in Upper Earnings Groups Sample is core sample (commerce and industry employees aged 25 to 60; see Figure I). The figure displays the fraction of women in various groups. P60–80 denotes the fourth quintile group from percentile 60 to percentile 80, P90–100 denotes the top 10%, etc. Because of top-coding in the micro data, estimates from 1943 to 1950 for P80–90 and P90–100 are estimated using published tabulations in Social Security Administration (1937–1952, 1967) and reported in lighter gray.
supply decisions.32 We use those representation indices instead of the traditional ratio of mean (or median) female earnings to male earnings because such representation indices remain meaningful in the presence of differential changes in labor force participation or in the wage structure across genders, and we do not have covariates to control for such changes, as is done in survey data (see, e.g., Blau, Ferber, and Winkler [2006]). Two elements in Figure X are worth noting. First, the fraction of women in the core sample of commerce and industry workers has increased from around 23% in 1937 to about 44% in 2004. World War II generated a temporary surge in women’s labor force participation, two-thirds of which was reversed immediately after the war.33 Women’s labor force participation has been steadily and continuously increasing since the mid-1950s and has been stable at around 43%–44% since 1990. 32. As a result, they combine not only the traditional wage gap between males and females but also the labor force participation gap (including the decision to work in the commerce and industry sector rather than other sectors or selfemployment). 33. This is consistent with the analysis of Goldin (1991), who uses unique micro survey data covering women’s workforce history from 1940 to 1951.
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
123
Second, Figure X shows that the representation of women in upper earnings groups has increased significantly over the last four decades and in a staggered time pattern across upper earnings groups.34 For example, the fraction of women in P60–80 starts to increase in 1966 from around 8% and reaches about 34% in the early 1990s and has remained about stable since then. The fraction of women in the top percentile (P99– 100) does not really start to increase significantly before 1980. It grows from around 2% in 1980 to almost 14% in 2004 and is still quickly increasing. Those results show that the representation of women in top earnings groups has increased substantially over the last three to four decades. They also suggest that economic progress of women is likely to impact measures of upward mobility significantly, as many women are likely to move up the earnings distribution over their lifetimes. Indeed, we have found that such gender effects are strongest in upward mobility series such as the probability of moving from the bottom two quintile groups (those earning less than $25,500 in 2004) to the top quintile group (those earning over $59,000 in 2004) over a lifetime. Figure XI displays such upward mobility series, defined as the probability of moving from the bottom two quintile groups to the top quintile group after twenty years (conditional on being in the “long-term core sample” in both year t and year t + 20) for all workers, men, and women.35 The figure shows striking heterogeneity across groups. First, men have much higher levels of upward mobility than women. Thus, in addition to the annual earnings gap we documented, there is an upward mobility gap as well across groups. Second, the upward mobility gap has also been closing over time: the probability of upward mobility among men has been stable overall since World War II, with a slight increase up to the 1960s and declines after the 1970s. In contrast, the probability of upward mobility of women has continuously increased from a very low level of less than 1% in the 1950s to about 7% in the 1980s. The increase in upward mobility for women compensates for the stagnation or slight decline in mobility for men, so that upward mobility among 34. There was a surge in women in P60–80 during World War II, but this was entirely reversed by 1948. Strikingly, women were better represented in upper groups in the late 1930s than in the 1950s. 35. Note that quintile groups are always defined based on the sample of all workers, including both male and female workers.
Probability of moving from P0−40 to P80−100 (%) after twenty years
124
QUARTERLY JOURNAL OF ECONOMICS
10
8
6
4 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
2 ●
All Men Women
0 1950
1955
1960
1965
1970
1975
1980
Year (middle of the initial eleven-year span)
FIGURE XI Long-Term Upward Mobility: Gender Effects The figure displays in year t the probability of moving to the top quintile group (P80–100) for eleven-year average earnings centered around year t + 20 conditional on having eleven-year average earnings centered around year t in the bottom two quintile groups (P0–40). The sample is defined as all individuals aged 25 to 60 in year t and t + 20, with average eleven-year “commerce and industry” earnings around years t and t + 20 above the minimum threshold. Because of small sample size, series including earnings before 1957 are smoothed using a weighted three-year moving average with weight of 0.5 for cohort t and weights of 0.25 for t − 1 and t + 1. The series are reported for all workers, men only, and women only. In all three cases, quintile groups are defined based on the sample of all workers.
all workers is slightly increasing.36 Figure XI also suggests that the gains in female annual earnings we documented above were in part due to earnings gains of women already in the labor force rather than entirely due to the entry of new cohorts of women with higher earnings. Such gender differential results are robust to conditioning on birth cohort, as series of early- to late-career upward mobility display a very similar evolution over time (see Online Appendix Figure A.10). Hence, our upward mobility results show that the economic progress of women since the 1960s has had a large impact on long-term mobility series among all U.S. workers. Table III summarizes the long-term inequality and mobility results for all (Panel A), men (Panel B), and women (Panel C) by 36. It is conceivable that upward mobility is lower for women because even within P0–40, they are more likely to be in the bottom half of P0–40 than men. Kopczuk, Saez, and Song (2007) show that controlling for those differences leaves the series virtually unchanged. Therefore, controlling for base earnings does not affect our results.
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
125
TABLE III LONG-TERM INEQUALITY AND MOBILITY
Year (1)
11-year earnings average Gini (2)
1956 1978 1999
0.437 0.477 0.508
1956 1978 1999
0.376 0.429 0.506
1956 1978 1999
0.410 0.423 0.459
Rank correlation after 20 years (3)
Upward mobility after 20 years (4)
#Workers (’000s) (5)
A. All 0.572 0.494
0.037 0.053
42,753 61,828 94,930
B. Men 0.465 0.458
0.084 0.071
27,952 37,187 52,761
C. Women 0.361 0.358
0.008 0.041
14,801 24,641 42,169
Notes. The table displays various measures of eleven-year average earnings inequality and long-term mobility centered around selected years, 1956, 1978, and 1999, for all workers (Panel A), men (Panel B), and women (Panel C). The sample in year t is defined as all employees with commerce and industry earnings averaged across the eleven-year span from t − 5 to t + 5 above a minimum threshold ($2,575 in 2004 and indexed using average wage for earlier years) and aged 25 to 60 (by January 1 of year t). Column (2) reports the Gini coefficients for those eleven-year earnings averages. Column (3) reports the rank correlation between eleven-year average earnings centered around year t and eleven-year average earnings centered around year t + 20 in the sample of workers (1) aged between 25 and 60 in both years t and t + 20, and (2) with eleven-year average earnings above the minimum threshold in both earnings spans t − 5 to t + 5 and t + 15 to t + 25. Column (4) reports the probability of moving to the top quintile group (P80–100) for eleven-year average earnings centered around year t + 20 conditional on having eleven-year average earnings centered around year t in the bottom two quintile groups (P0–40). The sample is the same as in column (3). Column (5) reports the number of workers in thousands.
reporting measures for selected eleven-year spans (1950–1960, 1973–1983, and 1994–2004). VI. CONCLUSIONS Our paper has used U.S. Social Security earnings administrative data to construct series of inequality and mobility in the United States since 1937. The analysis of these data has allowed us to start exploring the evolution of mobility and inequality over a lifetime as well as to complement the more standard analysis of annual inequality and short-term mobility in several ways. We found that changes in short-term mobility have not substantially affected the evolution of inequality, so that annual snapshots of the distribution provide a good approximation of the evolution of the longer-term measures of inequality. In particular, we find that increases in annual earnings inequality are driven almost entirely by increases in permanent earnings inequality, with much more modest changes in the variability of transitory earnings.
126
QUARTERLY JOURNAL OF ECONOMICS
However, our key finding is that although the overall measures of mobility are fairly stable, they hide heterogeneity by gender groups. Inequality and mobility among male workers has worsened along almost any dimension since the 1950s: our series display sharp increases in annual earnings inequality, slight reductions in short-term mobility, and large increases in long-term inequality with slight reduction or stability of long-term mobility. Against those developments stand the very large earning gains achieved by women since the 1950s, due to increases in labor force attachment as well as increases in earnings conditional on working. Those gains have been so great that they have substantially reduced long-term inequality in recent decades among all workers, and actually almost exactly compensate for the increase in inequality for males. COLUMBIA UNIVERSITY AND NATIONAL BUREAU OF ECONOMIC RESEARCH UNIVERSITY OF CALIFORNIA BERKELEY AND NATIONAL BUREAU OF ECONOMIC RESEARCH SOCIAL SECURITY ADMINISTRATION
REFERENCES Abowd, John M., and Martha Stinson, “Estimating Measurement Error in SIPP Annual Job Earnings: A Comparison of Census Survey and SSA Administrative Data,” Cornell University, Mimeo, 2005. Attanasio, Orazio, Erich Battistin, and Hidehiko Ichimura, “What Really Happened to Consumption Inequality in the US?” in Measurement Issues in Economics—The Paths Ahead. Essays in Honor of Zvi Griliches, Ernst Berndt and Charles Hulten, eds. (Chicago: University of Chicago Press, 2007). Autor, David, Lawrence F. Katz, and Melissa Schettini Kearney, “Trends in U.S. Wage Inequality: Revising the Revisionists,” Review of Economics and Statistics, 90 (2008), 300–323. Baker, Michael, and Gary Solon, “Earnings Dynamics and Inequality among Canadian Men, 1976–1992: Evidence from Longitudinal Income Tax Records,” Journal of Labor Economics, 21 (2003), 289–321. Blau, Francine D., “Trends in the Well-being of American Women, 1970–1995,” Journal of Economic Literature, 36 (1998), 112–165. Blau, Francine D., Marianne Ferber, and Anne Winkler, The Economics of Women, Men and Work, 4th ed. (Prentice-Hall, 2006). Carroll, Robert, David Joulfaian, and Mark Rider, “Income Mobility: The Recent American Experience,” Andrew Young School of Policy Studies, Georgia State University Working Paper 07-18, 2007. Congressional Budget Office, “Trends in Earnings Variability over the Past 20 Years,” Letter to the Honorable Charles E. Schumer and the Honorable Jim Webb, April 2007. Available at http://www.cbo.gov/ftpdocs/80xx/doc8007/ 04-17-EarningsVariability.pdf. Cutler, David, and Lawrence Katz, “Macroeconomic Performance and the Disadvantaged,” Brookings Papers on Economic Activity, 2 (1991), 1–74. Dynan, Karen E., Douglas W. Elmendorf, and Daniel E. Sichel, “The Evolution of Household Income Volatility,” Brookings Institution Working Paper, 2008.
EARNINGS INEQUALITY AND MOBILITY IN THE U.S.
127
Ferrie, Joseph P., “History Lessons: The End of American Exceptionalism? Mobility in the United States since 1850,” Journal of Economic Perspectives, 19 (2005), 199–215. Fields, Gary S., “Income Mobility,” Cornell University ILR School Working Paper 19, 2007. Available at http://digitalcommons.ilr.cornell.edu/workingpapers/ 19. Goldin, Claudia, Understanding the Gender Gap: An Economic History of American Women, NBER Series on Long-Term Factors in Economic Development (New York/Oxford/Melbourne: Oxford University Press, 1990). ——, “The Role of World War II in the Rise of Women’s Employment,” American Economic Review, 81 (1991), 741–756. ——, “The Quiet Revolution That Transformed Women’s Employment, Education, and Family,” American Economic Review Papers and Proceedings, 96 (2006), 1–21. Goldin, Claudia, and Robert A. Margo, “The Great Compression: The Wage Structure in the United States at Mid-Century,” Quarterly Journal of Economics, 107 (1992), 1–34. Gottschalk, Peter, “Inequality, Income Growth, and Mobility: The Basic Facts,” Journal of Economic Perspectives, 11 (1997), 21–40. Gottschalk, Peter, and Robert Moffitt, “The Growth of Earnings Instability in the U.S. Labor Market,” Brookings Papers on Economic Activity, 2 (1994), 217–254. Hacker, Jacob S, The Great Risk Shift: The Assault on American Jobs, Families Health Care, and Retirement—And How You Can Fight Back (Oxford, UK: Oxford University Press, 2006). Hungerford, Thomas L., “U.S. Income Mobility in the Seventies and Eighties,” Review of Income and Wealth, 39 (1993), 403–417. Katz, Lawrence F., and David Autor, “Changes in the Wage Structure and Earnings Inequality,” in Handbook of Labor Economics, Orley Ashenfelter and David Card, eds. (Amsterdam/New York: Elsevier/North Holland, 1999). Katz, Lawrence F., and Kevin M. Murphy, “Changes in Relative Wages, 1963– 87: Supply and Demand Factors,” Quarterly Journal of Economics, 107 (1992), 35–78. Kestenbaum, Bert, “Evaluating SSA’s Current Procedure for Estimating Untaxed Wages,” American Statistical Association Proceedings of the Social Statistics Section, Part 2 (1976), 461–465. Kopczuk, Wojciech, Emmanuel Saez, and Jae Song, “Uncovering the American Dream: Inequality and Mobility in Social Security Earnings Data since 1937,” National Bureau of Economic Research Working Paper 13345, 2007. Krueger, Dirk, and Fabrizio Perri, “Does Income Inequality Lead to Consumption Inequality? Evidence and Theory,” Review of Economic Studies, 73 (2006), 163– 193. Lemieux, Thomas, “Increasing Residual Wage Inequality: Composition Effects, Noisy Data, or Rising Demand for Skill?” American Economic Review, 96 (2006), 461–498. Lindert, Peter, “Three Centuries of Inequality in Britain and America,” in Handbook of Income Distribution, Anthony B. Atkinson and Francois Bourguignon, eds. (Amsterdam/New York: Elsevier/North Holland, 2000). Panis, Constantijn, Roald Euller, Cynthia Grant, Melissa Bradley, Christine E. Peterson, Randall Hirscher, and Paul Steinberg, SSA Program Data User’s Manual, RAND, 2000. Prepared for the Social Security Administration. Piketty, Thomas, and Emmanuel Saez, “Income Inequality in the United States, 1913–1998,” Quarterly Journal of Economics, 118 (2003), 1–39. Schiller, Bradley R., “Relative Earnings Mobility in the United States,” American Economic Review, 67 (1977), 926–941. Shin, Donggyun, and Gary Solon, “Trends in Men’s Earnings Volatility: What Does the Panel Study of Income Dynamics Show?” National Bureau of Economic Research Working Paper 14075, 2008. Shorrocks, Anthony F., “Income Inequality and Income Mobility,” Journal of Economic Theory, 19 (1978), 376–93. Slesnick, Daniel T., Consumption and Social Welfare: Living Standards and Their Distribution in the United States (Cambridge/New York/Melbourne: Cambridge University Press, 2001).
128
QUARTERLY JOURNAL OF ECONOMICS
Social Security Administration, Handbook of Old-Age and Survivors Insurance Statistics (annual), (Washington, DC: U.S. Government Printing Office, 1937– 1952). ——, Social Security Bulletin: Annual Statistical Supplement (Washington, DC: Government Printing Press Office, 1967). Solow, Robert M., “On the Dynamics of the Income Distribution,” Ph.D. dissertation (Harvard University, 1951). Topel, Robert H., and Michael P. Ward, “Job Mobility and the Careers of Young Men,” Quarterly Journal of Economics, 107 (1992), 439–479.
THE ROLE OF THE STRUCTURAL TRANSFORMATION IN AGGREGATE PRODUCTIVITY∗ MARGARIDA DUARTE AND DIEGO RESTUCCIA We investigate the role of sectoral labor productivity in explaining the process of structural transformation—the secular reallocation of labor across sectors—and the time path of aggregate productivity across countries. We measure sectoral labor productivity across countries using a model of the structural transformation. Productivity differences across countries are large in agriculture and services and smaller in manufacturing. Over time, productivity gaps have been substantially reduced in agriculture and industry but not nearly as much in services. These sectoral productivity patterns generate implications in the model that are broadly consistent with the cross-country data. We find that productivity catch-up in industry explains about 50% of the gains in aggregate productivity across countries, whereas low productivity in services and the lack of catch-up explain all the experiences of slowdown, stagnation, and decline observed across countries.
I. INTRODUCTION It is a well-known observation that over the last fifty years countries have experienced remarkably different paths of economic performance.1 Looking at the behavior of GDP per hour in individual countries relative to that in the United States, we find experiences of sustained catch-up, catch-up followed by a slowdown, stagnation, and even decline. (See Figure I for some illustrative examples.2 ) Consider, for instance, the experience of Ireland. Between 1960 and 2004, GDP per hour in Ireland relative to that of the United States rose from about 35% to 75%.3 Spain also experienced a period of rapid catch-up to the United States from 1960 to around 1990, a period during which relative GDP per hour rose from about 35% to 80%. Around 1990, however, this ∗ We thank Robert Barro, three anonymous referees, and Francesco Caselli for very useful and detailed comments. We also thank Tasso Adamopoulos, John Coleman, Mike Dotsey, Gary Hansen, Gueorgui Kambourov, Andr´es Rodr´ıguez-Clare, Richard Rogerson, Marcelo Veracierto, Xiaodong Zhu, and seminar participants at several conferences and institutions for comments and suggestions. Andrea Waddle provided excellent research assistance. All errors are our own. We gratefully acknowledge support from the Connaught Fund at the University of Toronto (Duarte) and the Social Sciences and Humanities Research Council of Canada (Restuccia).
[email protected],
[email protected]. 1. See Chari, Kehoe, and McGrattan (1996), Jones (1997), Prescott (2002), and Duarte and Restuccia (2006), among many others. 2. We use GDP per hour as our measure of economic performance. Throughout the paper we refer to labor productivity, output per hour, and GDP per hour interchangeably. 3. All numbers reported refer to data trended using the Hodrick–Prescott filter. See Section II for details. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
129
130
QUARTERLY JOURNAL OF ECONOMICS
FIGURE I Relative GDP per Hour—Some Countries GDP per hour in each country relative to that of the United States.
process slowed down dramatically and relative GDP per hour in Spain stagnated and later declined. Another remarkable growth experience is that of New Zealand, where GDP per hour fell from about 70% to 60% of that of the United States between 1970 and 2004. Along their modern paths of development, countries undergo a process of structural transformation by which labor is reallocated among agriculture, industry, and services. Over the last fifty years many countries have experienced substantial amounts of labor reallocation across sectors. For instance, from 1960 to 2004 the share of hours in agriculture in Spain fell from 44% to 6%, while the share of hours in services rose from 25% to 64%. In about the same period, the share of hours in agriculture in Belgium fell from 7% to 2%, while the share in services rose from 43% to 72%. In this paper we study the behavior of GDP per hour over time from the perspective of sectoral productivity and the structural transformation.4 Does a sectoral analysis contribute to the 4. See Baumol (1967) for a discussion of the implications of structural change on aggregate productivity growth.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
131
understanding of aggregate productivity paths? At a qualitative level the answer to this question is clearly yes. Because aggregate labor productivity is the sum of labor productivity across sectors weighted by the share of hours in each sector, the structural transformation matters for aggregate productivity. At a quantitative level the answer depends on whether there are substantial differences in sectoral labor productivity across countries. Our approach in this paper is to first develop a simple model of the structural transformation that is calibrated to the growth experience of the United States. We then use the model to measure sectoral labor productivity differences across countries at a point in time. These measures, together with data on growth in sectoral labor productivity, imply time paths of sectoral labor productivity for each country. We use these measures of sectoral productivity in the model to assess their quantitative effect on labor reallocation and aggregate productivity outcomes across countries. We find that there are large and systematic differences in sectoral labor productivity across countries. In particular, differences in labor productivity levels between rich and poor countries are larger in agriculture and services than in manufacturing. Moreover, over time, productivity gaps have been substantially reduced in agriculture and industry but not nearly as much in services. To illustrate the implications of these sectoral differences for aggregate productivity, imagine that productivity gaps remain constant as countries undergo the structural transformation. Then as developing countries reallocate labor from agriculture to manufacturing, aggregate productivity can catch up as labor is reallocated from a low–relative productivity sector to a high–relative productivity sector. Countries further along the structural transformation can slow down, stagnate, and decline as labor is reallocated from industry (a high–relative productivity sector) to services (a low–relative productivity sector). When the time series of sectoral productivity are fed into the model of the structural transformation, we find that high growth in labor productivity in industry relative to that of the United States explains about 50% of the catch-up in relative aggregate productivity across countries. Although there is substantial catch-up in agricultural productivity, we show that this factor contributes little to aggregate productivity gains in our sample countries. In addition, we show that low relative productivity in services and the lack of catch-up explain all the experiences of slowdown, stagnation, and decline in relative aggregate productivity observed across countries.
132
QUARTERLY JOURNAL OF ECONOMICS
We construct a panel data set on PPP-adjusted real output per hour and disaggregated output and hours worked for agriculture, industry, and services. Our panel data include 29 countries with annual data covering the period from 1956 to 2004 for most countries.5 From these data, we document three basic facts. First, countries follow a common process of structural transformation characterized by a declining share of hours in agriculture over time, an increasing share of hours in services, and a hump-shaped share of hours in industry. Second, there is substantial lag in the process of structural transformation for some countries, and this lag is associated with the level of relative income. Third, there are sizable and systematic differences in sectoral growth rates of labor productivity across countries. In particular, most countries observe higher growth rates of labor productivity in agriculture and manufacturing than in services. In addition, countries with high rates of aggregate productivity growth tend to have much higher productivity growth in agriculture and manufacturing than the United States, but this strong relative performance is not observed in services. Countries with low rates of aggregate labor productivity growth tend to observe low labor productivity growth in all sectors. We develop a general equilibrium model of the structural transformation with three sectors—agriculture, industry, and services. Following Rogerson (2008), labor reallocation across sectors is driven by two channels: income effects due to nonhomothetic preferences and substitution effects due to differential productivity growth across sectors.6 We calibrate the model to the structural transformation of the United States between 1956 and 2004. A model of the structural transformation is essential for the purpose of this paper for two reasons. First, we use the calibrated model to measure sectoral productivity differences across countries at one point in time. This step is needed because of the lack of comparable (PPP-adjusted) sectoral output data across a large set of countries. Second, the process of structural transformation is endogenous to the level and changes over time in sectoral labor productivity. As a result, a quantitative assessment of the aggregate implications of sectoral productivity differences requires that 5. Our sample does not include the poorest countries in the world: the labor productivity ratio between the richest and poorest countries in our data is only 10:1. 6. For recent models of the structural transformation emphasizing nonhomothetic preferences, see Kongsamut, Rebelo, and Xie (2001), and emphasizing substitution effects see Ngai and Pissarides (2007).
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
133
changes in the distribution of labor across sectors be consistent with sectoral productivity paths.7 The model implies that sectoral productivity levels in the first year in the sample tend to be lower in poor than in rich countries, particularly in agriculture and services, and the model implies low dispersion in productivity levels in manufacturing across countries. We argue that these differences in sectoral labor productivity levels implied by the model are consistent with the available evidence from studies using producer and micro data for specific sectors, for instance, Baily and Solow (2001) for manufacturing and service sectors and Restuccia, Yang, and Zhu (2008) for agriculture. These productivity levels together with data on sectoral labor productivity growth for each country imply time paths for sectoral productivity. Given these time paths, the model reproduces the broad patterns of labor reallocation and aggregate productivity growth across countries. The model also has implications for sectoral output and relative prices that are broadly consistent with the cross-country data. This paper is related to a large literature studying income differences across countries. Closely connected is the literature studying international income differences in the context of models with delay in the start of modern growth.8 Because countries in our data set have started the process of structural transformation well before the first year in the sample period, our focus is on measuring sectoral productivity across countries at a point in time and on assessing the role of their movement over time in accounting for the patterns of structural transformation and aggregate productivity growth across countries.9 Our paper is also closely related to a literature that emphasizes the sectoral composition of the economy in aggregate outcomes, for instance, Caselli and Coleman (2001), C´ordoba and Ripoll (2004), Coleman (2007), Chanda and Dalgaard (2008), Restuccia, Yang, and Zhu (2008), Adamopoulos and Akyol (2009), and Vollrath (2009).10 In studying the role of the structural transformation for cross-country aggregate productivity catch-up, our paper is closest to that of Caselli 7. This is in sharp contrast to the widely followed shift-share analysis approach where aggregate productivity changes are decomposed into productivity changes within sectors and labor reallocation. 8. See, for instance, Lucas (2000), Hansen and Prescott (2002), Ngai (2004), and Gollin, Parente, and Rogerson (2002). 9. Herrendorf and Valentinyi (2006) also consider a model to measure sectoral productivity levels across countries but instead use expenditure data from the Penn World Table. 10. See also the survey article by Caselli (2005) and the references therein.
134
QUARTERLY JOURNAL OF ECONOMICS
and Tenreyro (2006). We differ in that we use a model of the structural transformation to measure sectoral productivity levels and to assess the contribution of sectoral productivity for aggregate growth. In studying labor productivity over time, our paper is related to a literature studying country episodes of slowdown and depression.11 Most of this literature focuses on the effect of exogenous movements in aggregate total factor productivity and aggregate distortions on GDP relative to trend. We differ from this literature by emphasizing the importance of sectoral productivity in the structural transformation and the secular movements in relative GDP per hour across countries. The paper is organized as follows. In the next section we document some facts about the process of structural transformation and sectoral labor productivity growth across countries. Section III describes the economic environment and calibrates a benchmark economy to U.S. data for the period between 1956 and 2004. In Section IV we discuss the quantitative experiment and perform counterfactual analysis. We conclude in Section V.
II. SOME FACTS In this section we document the process of structural transformation and labor productivity growth in agriculture, industry, and services for the countries in our data set. Because we focus on long-run trends, data are trended using the Hodrick–Prescott filter with a smoothing parameter λ = 100. The Appendix provides a detailed description of the data. II.A. The Process of Structural Transformation The reallocation of labor across sectors over time is typically referred to in the economic development literature as the process of structural transformation. This process has been extensively documented.12 The structural transformation is characterized by a systematic fall over time in the share of labor allocated to agriculture, by a steady increase in the share of labor in services, and by a hump-shaped pattern for the share of labor in manufacturing. That is, the typical process of sectoral reallocation involves an increase in the share of labor in manufacturing in the early 11. See Kehoe and Prescott (2002) and the references therein. 12. See, for instance, Kuznets (1966) and Maddison (1980), among others.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
135
stages of the reallocation process, followed by a decrease in the later stages.13 We document the processes of structural transformation in our data set by focusing on the distribution of labor hours across sectors. We note, however, that this characterization is very similar to the one obtained by looking at shares of employment. Our panel data cover countries at very different stages in the process of structural transformation. For instance, our data include countries that in 1960 allocated about 70% of their labor hours to agriculture (e.g., Turkey and Bolivia), as well as countries that in the same year had shares of hours in agriculture below 10% (e.g., the United Kingdom). Despite this diversity, all countries in the sample follow a common process of structural transformation. First, all countries exhibit declining shares of hours in agriculture, even the most advanced countries in this process, such as the United Kingdom and the United States. Second, countries at an early stage of the process of structural transformation exhibit a hump-shaped share of hours in industry, whereas this share is decreasing for countries at a more advanced stage. Finally, all countries exhibit an increasing share of hours in services. To illustrate these features, Figure II plots sectoral shares of hours for Greece, Ireland, Spain, and Canada. The processes of structural transformation observed in our sample suggest two additional observations. First, the lag in the structural transformation observed across countries is systematically related to the level of development: poor countries have the largest shares of hours in agriculture, while rich countries have the smallest shares.14 Second, our data suggest the basic tendency for countries that start the process of structural transformation later to accomplish a given amount of labor reallocation faster than those countries that initiated this process earlier.15 13. In this paper we refer to manufacturing and industry interchangeably. In the Appendix we describe in detail our definition of sectors in the data. 14. See, for instance, Gollin, Parente, and Rogerson (2007) and Restuccia, Yang, and Zhu (2008) for a detailed documentation of this fact for shares of employment across a wider set of countries. 15. According to the U.S. Census Bureau (1975), Historical Statistics of the United States, the distribution of employment in the United States circa 1870 resembles that of Portugal in 1950. By 1948 the sectoral shares in the United States were 0.10, 0.34, and 0.56, levels that Portugal reached sometime during the 1990s. Although Portugal is lagging behind the process of structural transformation of the United States, it has accomplished about the same reallocation of labor across sectors in less than half the time (39 years as opposed to 89 years in the United States). See Duarte and Restuccia (2007) for a detailed documentation of these observations.
136
QUARTERLY JOURNAL OF ECONOMICS
FIGURE II Shares of Hours—Some Countries
II.B. Sectoral Labor Productivity Growth For the United States, the annualized growth rate of labor productivity between 1956 and 2004 has been highest in agriculture (3.8%), second in industry (2.4%), and lowest in services (1.3%).16 This ranking of growth rates of labor productivity across sectors is observed in 23 of the 29 countries in our sample, and in all countries but Venezuela, the growth rate in services is the smallest. Nevertheless, there is an enormous variation in sectoral labor productivity growth across countries. Figure III plots the annualized growth rate of labor productivity in each sector against the annualized growth rate of aggregate labor productivity for all countries in our data set. The sectoral growth rate of the United States in each panel is identified by the horizontal dashed line, whereas the vertical dashed 16. The annualized percentage growth rate of variable x over the period t to t + T is computed as [(xt+T /xt )1/T − 1] × 100.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
137
Annualized growth rate of aggregate labor productivity
FIGURE III Sectoral Growth Rates of Labor Productivity (%) Aggregate labor productivity is GDP per hour, whereas sectoral labor productivity is value added per hour in each sector. Annualized percentage growth rates during the sample period are given for each country. The horizontal lines indicate the sectoral growth rates observed in the United States, and the vertical line indicates the aggregate growth rate of the United States.
line marks the growth rate of aggregate productivity of the United States. This figure documents the tendency for countries to feature higher growth rates of labor productivity in agriculture and manufacturing than in services. For instance, in our panel, the average growth rates in agriculture and manufacturing are 4.0% and 3.1%, whereas the average growth rate in services is 1.3%. Figure III also illustrates that countries with low aggregate labor productivity growth relative to the United States tend to have low productivity growth in all sectors (e.g., Latin American countries), whereas countries with high relative aggregate labor productivity growth tend to have higher productivity growth than the United States in agriculture and, especially, industry (e.g., European countries, Japan, and Korea). For the countries that grew faster than the United States in aggregate productivity,
138
QUARTERLY JOURNAL OF ECONOMICS
labor productivity growth exceeds that for the United States by, on average, 1 percentage point in agriculture and 1.5 percentage points in industry. In contrast, labor productivity growth in services for these countries exceeds that for the United States by only 0.4 percentage point. The fact is that few countries have observed a much higher growth rate of labor productivity in services than the United States. These features of the data motivate some of the counterfactual exercises we perform in Section IV.
III. ECONOMIC ENVIRONMENT We develop a simple model of the structural transformation of an economy where at each date three goods are produced: agriculture, industry, and services. Following Rogerson (2008), labor reallocation across sectors is driven by two forces—an income effect due to nonhomothetic preferences and a substitution effect due to differential productivity growth between industry and services. We calibrate a benchmark economy to U.S. data and show that this basic framework captures the salient features of the structural transformation in the United States from 1956 to 2004. III.A. Description Production. At each date three goods are produced— agriculture (a), manufacturing (m), and services (s)—according to the following constant–returns to scale production functions: (1)
Yi = Ai Li ,
i ∈ {a, m, s},
where Yi is output in sector i, Li is labor input in sector i, and Ai is a sector-specific technology parameter.17 When mapping the model to data, we associate the labor input Li with hours allocated to sector i. We assume that there is a continuum of homogeneous firms in each sector that are competitive in goods and factor markets. At each date, given the price of good i, output pi , and wages w, a 17. We note that labor productivity in each sector is summarized in the model by the productivity parameter Ai . There are many features that can explain differences over time and across countries in labor productivity such as capital intensity and factor endowments. Accounting for these sources can provide a better understanding of labor productivity facts. Our analysis abstracts from the sources driving labor productivity observations.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
139
representative firm in sector i solves max{ pi Ai Li − wLi }.
(2)
Li ≥0
Households. The economy is populated by an infinitely lived representative household of constant size. Without loss of generality we normalize the population size to one. The household is endowed with L units of time each period, which are supplied inelastically to the market. We associate L with total hours per capita in the data. The household has preferences over consumption goods as follows: ∞
β t u(ca,t , ct ),
β ∈ (0, 1),
t=0
where ca,t is the consumption of agricultural goods at date t and ct is the consumption of a composite of manufacturing and service goods at date t. The per-period utility is given by ¯ + (1 − a) log(ct ), u(ca,t , ct ) = a log(ca,t − a)
a ∈ [0, 1],
where a¯ > 0 is a subsistence level of agricultural goods below which the household cannot survive. This feature of preferences has a long tradition in the development literature and it has been emphasized as a quantitatively important feature leading to the movement of labor away from agriculture in the process of structural transformation.18 The composite nonagricultural consumption good ct is given by 1/ρ ρ , ct = bcm,t + (1 − b)(cs,t + s¯ )ρ where s¯ > 0, b ∈ (0, 1), and ρ < 1. For s¯ > 0, these preferences imply that the income elasticity of service goods is greater than one. We note that s¯ works as a negative subsistence consumption level—when the income of the household is low, less resources are allocated to the production of services, and when the income of the household increases, resources are reallocated to services. The parameter s¯ can also be interpreted as a constant level of production of service goods at home. Our approach to modeling the 18. See, for instance, Echevarria (1997), Laitner (2000), Caselli and Coleman (2001), Kongsamut, Rebelo, and Xie (2001), Gollin, Parente, and Rogerson (2002), and Restuccia, Yang, and Zhu (2008).
140
QUARTERLY JOURNAL OF ECONOMICS
home sector for services is reduced-form. Rogerson (2008) considers a generalization of this feature where people can allocate time to market and nonmarket production of service goods. However, we argue that our simplification is not as restrictive as it may first appear, because we abstract from the allocation of time between market and nonmarket activities. Our focus is on the determination of aggregate productivity from the allocation of time across market sectors. Because we abstract from intertemporal decisions the problem of the household is effectively a sequence of static problems.19 At each date and given prices, the household chooses consumption of each good to maximize the per-period utility subject to the budget constraint. Formally, ρ 1 ρ , (3) max a log(ca − a) ¯ + (1 − a) log bcm + (1 − b)(cs + s¯ ) ci ≥0 ρ subject to pa ca + pmcm + ps cs = wL. Market Clearing. The demand for labor from firms must equal the exogenous supply of labor by households at every date: La + Lm + Ls = L.
(4)
Also, at each date, the market for each good produced must clear: (5)
ca = Ya ,
cm = Ym,
cs = Ys .
III.B. Equilibrium A competitive equilibrium is a set of prices { pa , pm, ps }, allocations {ca , cm, cs } for the household, and allocations {La , Lm, Ls } for firms such that (i) given prices, firm’s allocations {La , Lm, Ls } solve the firm’s problem in (2); (ii) given prices, household’s allocations {ca , cm, cs } solve the household’s problem in (3); and (iii) markets clear: equations (4) and (5) hold. The first-order condition from the firm’s problem implies that the benefit and cost of a marginal unit of labor must be equal. Normalizing the wage rate to one, this condition implies that prices 19. Because we are abstracting from intertemporal decisions such as investment, our analysis is not crucially affected by alternative stochastic assumptions on the time path for labor productivity.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
141
of goods are inversely related to productivity: pi =
(6)
1 . Ai
Note that in the model, price movements are driven solely by labor productivity changes. The first-order conditions for consumption imply that the labor input in agriculture is given by (7)
La = (1 − a)
a¯ s¯ . + a L+ Aa As
When a = 0, the household consumes a¯ of agricultural goods each period, and labor allocation in agriculture depends only on the level of labor productivity in that sector. When productivity in agriculture increases, labor moves away from the agricultural sector. This restriction on preferences implies that output and consumption per capita of agricultural goods are constant over time, implications that are at odds with data. When a > 0 and productivity growth is positive in all sectors, the share of labor allocated to agriculture converges asymptotically to a and the nonhomothetic terms in preferences become asymptotically irrelevant in the determination of the allocation of labor. In this case, output and consumption per capita of agricultural goods grow at the rate of labor productivity. The first-order conditions for consumption of manufacturing and service goods imply that b (1 − b)
cm cs + s¯
ρ−1
=
pm . ps
This equation can be rewritten as Lm =
(8)
(L − La ) + s¯ /As , 1+x
where x≡
b 1−b
1/(ρ−1)
Am As
ρ/(ρ−1)
,
142
QUARTERLY JOURNAL OF ECONOMICS
and La is given by (7).20 Equation (8) reflects the two forces that drive labor reallocation between manufacturing and services in the model. First, suppose that preferences are homothetic (i.e., s¯ = 0). In this case, Lm/Ls = 1/x and differential productivity growth in manufacturing relative to services is the only source of labor reallocation between these sectors (through movements in x) as long as ρ is not equal to zero. In particular, when s¯ = 0, the model can be consistent with the observed labor reallocation from manufacturing into services as labor productivity grows in the manufacturing sector relative to services if the elasticity of substitution between these goods is low (ρ < 0). Second, suppose that s¯ > 0 (i.e., preferences are nonhomothetic) and that either labor productivity grows at the same rate in manufacturing and services, or ρ = 0, so that x is constant. Then, for a given La , productivity improvements lead to the reallocation of labor from manufacturing into services (services are more income-elastic). The model allows both channels to be operating during the structural transformation. III.C. Calibration We calibrate a benchmark economy to U.S. data for the period from 1956 to 2004. Our calibration strategy involves selecting parameter values so that the equilibrium of the model matches the salient features of the structural transformation for the United States during this period. We assume that a period in the model is one year. We need to select parameter values for a, b, ρ, a, ¯ s¯ , and the time series of productivity for each sector Ai,t for t from 1956 to 2004 and i ∈ {a, m, s}. We proceed as follows. First, we normalize productivity levels across sectors to one in 1956; that is, Ai,1956 = 1 for all i ∈ {a, m, s}. Then we use data on the growth rate of sectoral value added per hour in the United States to obtain the time paths of sectoral labor productivity. In particular, denoting as γi,t the growth rate of labor productivity in sector i at date t, we obtain the time path of labor productivity in each sector as Ai,t+1 = (1 + γi,t )Ai,t . Second, with positive productivity growth in all sectors, the share of hours 20. When the growth rates of sectoral labor productivity are positive, the model implies that, in the long run, the share of hours in manufacturing and services asymptote to constants that depend on preference parameters a, b, ρ and any permanent level difference in labor productivity between manufacturing and services. If productivity growth in manufacturing is higher than in services, then the share of hours in manufacturing asymptotes to 0 and the share of hours in services to (1 − a).
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
143
TABLE I PARAMETER VALUES AND U.S. DATA TARGETS Parameter
Value
Target
Ai,1956 {Aa,t }2004 t=1957
1.0 {·}
Normalization Productivity growth in agriculture
{Am,t }2004 t=1957
{·}
Productivity growth in industry
{As,t }2004 t=1957 a a¯ s¯ b ρ
{·} 0.01 0.11 0.89 0.04 −1.5
Productivity growth in services Long-run share of hours in agriculture Share of hours in agriculture 1956 Share of hours in industry 1956 Share of hours in industry 1957–2004 Aggregate productivity growth
in agriculture converges to a in the long run. Because the share of hours in agriculture has been falling systematically and was about 3% in 2004, we assume a long-run share of 1%. Although this target is somewhat arbitrary, our main results are not sensitive to this choice. Third, given values for ρ and b, a¯ and s¯ are chosen to match the shares of hours in agriculture and manufacturing in the United States in 1956 using equations (7) and (8). Finally, b and ρ are jointly chosen to match as close as possible the share of hours in manufacturing over time and the annualized growth rate of aggregate productivity. The annualized growth rate in labor productivity in the United States between 1956 and 2004 is roughly 2%. Table I summarizes the calibrated parameters and targets. The shares of hours implied by the model are reported in Figure IV (dotted lines), together with data on the shares of hours in the United States (solid lines). The equilibrium allocation of hours across sectors in the model closely matches the process of structural transformation in the United States during the calibrated period. The model implies a fall in the share of hours in manufacturing from about 39% in 1956 to 24% in 2004, whereas the share of hours in services increases from about 49% to 73% during this period.21 Notice that even though the calibration only targets the share of hours in agriculture in 1956 (13%), the model implies a time path for the equilibrium share of hours in agriculture that is remarkably close to the data, declining to about 3% in 2004. 21. We emphasize that the model can deliver a hump-shaped pattern for labor in manufacturing for less developed economies even though during the calibrated period the U.S. economy is already in the second stage of the structural transformation, whereby labor is being reallocated away from manufacturing.
144
QUARTERLY JOURNAL OF ECONOMICS
FIGURE IV Share of Hours by Sector—Model vs. U.S. Data
The model also has implications for sectoral output and for relative prices. Sectoral output is given by labor productivity times labor input. Because the model matches closely the time path of sectoral labor allocation for the U.S. economy, the output implications of the model over time for the United States are very close to the data. In particular, the model implies that output growth in agriculture is 2.08% per year (versus 2.29% in the data), whereas output growth in manufacturing and services in the model is 2.74% and 3.60% (versus 2.70% and 3.61% in the data). The model implies that the producer price of good i relative to good i is given by the ratio of labor productivity in these sectors: (9)
pi Ai = . pi Ai
We assess the price implications of the model against data on sectoral relative prices.22 The model implies that the producer price of 22. Data for sectoral relative prices are available from 1971 to 2004. See the Appendix for details.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
145
services relative to industry increases by 0.94% per year between 1971 and 2004, very close to the increase in the data for the relative price of services from the implicit price deflators (0.87% per year). The price of agriculture relative to manufacturing declines in the model at a rate of 1.04% per year from 1971 to 2004. This fall in the relative price of agriculture is consistent with the data, although the relative price of agriculture falls somewhat more in the data (3.12% per year) than in the model.23 Because productivity growth across sectors is the driving force in the model, it is reassuring that this mechanism generates implications that are broadly consistent with the data. For this reason, we also discuss the relative price implications of the model when assessing the relevance of sectoral productivity growth for labor reallocation in the cross-country data in Section IV. IV. QUANTITATIVE ANALYSIS In this section, we assess the quantitative effect of sectoral labor productivity on the structural transformation and aggregate productivity outcomes across countries. In this analysis we maintain preference parameters as in the benchmark economy and proceed in three steps. First, we use the model to restrict the level of sectoral labor productivity in the first period for each country. Second, using these levels and data on sectoral labor productivity growth in each country as the exogenous time-varying factors, the model implies time paths for the allocation of hours across sectors and aggregate labor productivity for each country. We assess the cross-country implications of the model with data for labor reallocation across sectors, aggregate productivity, and relative prices. Third, we perform counterfactual exercises to assess the quantitative importance of sectoral analysis in explaining aggregate productivity experiences across countries. IV.A. Relative Sectoral Productivity Levels We use the model to restrict the levels of labor productivity in agriculture, industry, and services relative to those in the 23. We note that in the context of our model distortions to the price of agriculture would not substantially affect the equilibrium allocation of labor in agriculture because this is mainly determined by labor productivity in agriculture relative to the subsistence constraint (a is close to zero in the calibration). In this context, it would be possible to introduce price distortions to match the faster decline in the relative price of agriculture in the data without affecting our main quantitative results.
146
QUARTERLY JOURNAL OF ECONOMICS
United States for the first year in the sample for each country. This step is needed because of the lack of comparable (PPPadjusted) sectoral output data across a large set of countries. Because our data on sectoral value added are in constant local currency units, some adjustment is needed. Using market exchange rates would be problematic for arguments well discussed in the literature, such as Summers and Heston (1991). Another approach would be to apply the national currency shares of value added to the PPP-adjusted measure of real aggregate output from the Penn World Tables (PWT). This is problematic because it assumes that the PPP-conversion factor for aggregate output applies to all sectors in that country, whereas there is strong evidence that the PPP-conversion factors differ systematically across sectors in development.24 Using detailed categories from the International Comparisons Program (ICP) benchmark data in the PWT would also be problematic for inferences at the sector level because these data are based on the expenditure side of national accounts. For instance, it would not be advisable to use food expenditures and their PPP-conversion factor to adjust units of agricultural output across countries because food expenditures include charges for goods and services not directly related to agricultural production. Our approach is to use the model to back out sector-specific PPP-conversion factors for each country and to use the constantprice value-added data in local currency units to calculate growth rates of labor productivity in each sector for each country. In particular, we use the model to restrict productivity levels in the initial period and use the data on growth rates of labor productivity to construct the time series for productivity that we feed into the model. The underlying assumption is that the growth rate of value added in constant domestic prices is a good measure of real changes in output. This approach of using growth rates as a measure of changes in “quantities” is similar to the approach followed in the construction of panel data of comparable output across countries, such as the PWT.25 We proceed as follows. For each country j, we choose the three j j j labor productivity levels Aa , Am, and As to match three targets 24. See, for instance, the evidence on agriculture relative to nonagriculture in Restuccia, Yang, and Zhu (2008). 25. In particular, in the PWT, the growth rates of expenditure categories such as consumption and investment are the growth rates of constant domestic price expenditures from national accounts.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
147
FIGURE V Relative Labor Productivity across Sectors—First Year Labor productivity relative to the level of the United States.
from the data in the first year in the sample: (1) the share of hours in agriculture, (2) the share of hours in manufacturing (therefore the model matches the share of hours in services by labor market clearing), and (3) aggregate labor productivity relative to that of the United States.26 Figure V plots the average level of sectoral labor productivity relative to the level of the United States for countries in each quintile of aggregate productivity in the first year. The model implies that relative sectoral productivity in the first year tends to be lower in poorer countries than in richer countries, but particularly so in agriculture and services. In fact, the model implies 26. We adjust s¯ by the level of relative productivity in services in the first period for each country so that s¯ /As is constant across countries in the first period of the sample. Although it is not modeled explicitly, one interpretation of s¯ is as service goods produced at home. Therefore, s¯ cannot be invariant to large changes in productivity levels in services.
148
QUARTERLY JOURNAL OF ECONOMICS
FIGURE VI Relative Labor Productivity across Sectors—First and Last Years Labor productivity relative to the level of the United States.
that the dispersion of relative productivity in agriculture and services is much larger than in manufacturing. In the first year, the six poorest countries have relative productivity in agriculture and services of around 20% and 10%, whereas the six richest countries have relative productivity in these sectors of around 86% and 84%. In contrast, for manufacturing, average relative productivity of the six poorest countries in the first year is 31% and that of the six richest countries is 70%. The levels of sectoral labor productivity implied by the model for the first year, together with data on growth rates of sectoral value added per hour in local currency units, imply time paths for sectoral labor productivity in each country. In particular, letting j γi,t denote the growth rate of labor productivity in country j, sector j j j i, at date t, we obtain sectoral productivity as Ai,t+1 = (1 + γi,t )Ai,t . Figure VI plots the average level of sectoral labor productivity relative to the level in the United States in the first and last years
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
149
for countries in each quintile of aggregate productivity in the first year. We note that, on average, countries have experienced substantial gains in productivity in agriculture and industry relative to the United States (from an average relative productivity level of 48% and 51% in the first period to 71% and 75% in the last period). In sharp contrast, countries experienced, on average, much smaller gains in productivity in services relative to the United States (from an average relative productivity level of 46% to 49%). These features are particularly pronounced for countries in the top three quintiles of the productivity distribution. For these countries, average relative labor productivity in agriculture and industry increased from 66% and 59% to 100% and 85%, whereas average productivity in services increased from 63% to only 66%. We emphasize that the low levels of relative productivity in services in the first period together with the lack of catch-up over time imply that, for most countries, relative productivity levels in services are lower than those in agriculture and industry at the end of the sample period. Therefore, as these economies allocate an increasing share of hours to services, low relative labor productivity in this sector dampens aggregate productivity growth. These relative productivity patterns are suggestive of the results we discuss in Section IV.C, where we show that productivity catchup in industry explains a large portion of the gains in aggregate productivity across countries. In addition, we show that low relative productivity levels in services and the lack of catch-up play a quantitatively important role in explaining the growth episodes of slowdown, stagnation, and decline in aggregate productivity across countries. We argue that our productivity-level results are consistent with the available evidence from studies using producer and micro data. Empirical studies provide internationally comparable measures of labor productivity for some sectors and some countries. These studies typically provide estimates for narrow sectoral definitions at a given point in time. One such study for agriculture is from the Food and Agriculture Organization (FAO) of the United Nations. This study uses producer data (prices of detailed categories at the farm gate) to calculate international prices and comparable measures of output in agriculture using a procedure similar to that of Summers and Heston (1991) for the construction of the PWT. We find that the labor productivity differences in agriculture implied by the model are qualitatively consistent with the differences in GDP per worker in agriculture between
150
QUARTERLY JOURNAL OF ECONOMICS
rich and poor countries from the FAO for 1985.27 Baily and Solow (2001) have compiled a number of case studies from the McKinsey Global Institute (MGI) documenting labor productivity differences in some sectors and countries. Their findings are broadly consistent with our results. In particular, Baily and Solow emphasize a pattern that emerges from the micro studies where productivity differences across countries in services are not only large but also larger than the differences for manufacturing. The Organization for Economic Cooperation and Development (OECD) and MGI provide studies at different levels of sectoral disaggregation for manufacturing. These studies report relative productivity for a relatively small set of countries, and most studies report estimates only at one point in time. One exception is Pilat (1996). This study reports relative labor productivity levels in manufacturing for 1960, 1973, 1985, and 1995 for thirteen countries. Although the implied relative labor productivity levels in industry in our model tend to be higher than those reported in this study, the patterns of relative productivity are consistent for most countries. Finally, consistent with our findings, several studies report that the United States has higher levels of labor productivity in service sectors than other developed countries and that lower labor productivity in service sectors compared to manufacturing is pervasive.28 IV.B. The Structural Transformation across Countries Given paths for sectoral labor productivity, the model has time-series implications for the allocation of labor hours and output across sectors, aggregate labor productivity, and relative prices for each country. In this section we evaluate the implications of the model against the available cross-country data. Overall, the model reproduces the salient features of the structural transformation and aggregate productivity across countries. Figures VII and VIII illustrate this performance. Figure VII reports the shares of hours in each sector and relative aggregate productivity in the last period of the sample for each country in the model and in the data. Figure VIII reports the change in 27. See Restuccia, Yang, and Zhu (2008) for a detailed documentation of the cross-country differences in labor productivity in agriculture. 28. Baily, Farrell, and Remes (2005), for instance, estimate that, relative to the United States, France and Germany had lower relative productivity levels in 2000 and had lower growth rates of labor productivity between 1992 and 2000 for a set of narrowly defined service sectors, with the exception of mobile telecommunications.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
151
FIGURE VII Model vs. Data across Countries—Levels in the Last Year Each plot reports the value for each variable in the last period for the model and the data.
these variables (in percentage points) between the last and first periods in the model and in the data. As these figures illustrate, the model replicates well the patterns of the allocation of hours across sectors and relative aggregate productivity observed in the data, particularly so for the share of hours in agriculture and relative aggregate productivity. This performance attests to the ability of the model to replicate the basic trends observed for the share of hours in agriculture across a large sample of countries. Regarding the share of hours in industry, the model tends to imply a smaller increase over time compared to the data, particularly for less developed economies where the share of hours in industry increased over the sample period. Conversely, the model tends to imply a larger increase in the share of hours in services over the sample period than that observed in the data. This implication of the model suggests that, especially for some less developed countries, distortions or frictions in labor reallocation between industry and services may be important in accounting for their
152
QUARTERLY JOURNAL OF ECONOMICS
FIGURE VIII Model vs. Data across Countries—Changes Each plot reports the change between the last and first period (in percentage points) of each variable during the sample period in the data and in the model.
structural transformation.29 As a summary statistic for the performance of the model in replicating the time-series properties of the data, we compute the average absolute deviation (over time and across countries) in percentage points (p.p.) between a given time series in the model and in the data.30 The average absolute deviations for the shares of hours in agriculture, industry, and 29. Although in most cases the model does well in reproducing the time series in the data, in some countries modifications to the simple model would be required in order to better account for the process of structural transformation and aggregate productivity growth—see Duarte and Restuccia (2007) for an application of wedges across sectors in Portugal. These richer environments, however, would require country-specific analysis. We instead maintain our simple model specification and leave these interesting country-specific experiences for future research. 30. We measure the average absolute deviation in percentage points between the time series in the model and the data across countries as ϒ = 1 J T j d m j=1 t=1 abs(x j,t − x j,t ) × 100, where j is the country index and T j is the JT j
sample size for country j.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
153
services are 2, 6, and 7 p.p., respectively, and 4 p.p. for relative aggregate productivity. We conclude that the model captures the bulk of the labor reallocation and aggregate productivity experiences across countries. To better understand our finding about aggregate productivity, recall that aggregate labor productivity is the sum of labor productivity in each sector weighted by the share of labor in that sector, that is, Yi Li Y = . L i∈{a,m,s} Li L As a result, the behavior of aggregate productivity arises from the behavior of sectoral labor productivity and the allocation of labor across sectors over time.31 Because the model reproduces the salient features of labor reallocation across countries, aggregate productivity growth in the model is also broadly consistent with the cross-country data. The model has implications for sectoral output in each country. Sectoral output is given by the product of labor productivity and labor hours. As a result, the growth rate of output in sector i is the sum of the growth rates of labor productivity Ai (which we take from the data) and the growth in labor hours Li . The fact that the model reproduces well the cross-country patterns of the structural transformation implies that sectoral output growth is also well captured by the model. The model also has implications for levels and changes over time in relative prices across countries. We first discuss the implications for changes in relative prices. Figure IX plots the annualized percentage change in the producer prices of agriculture and services relative to manufacturing in the model and in the data. The figure shows that the model captures the broad patterns of price changes in the data—because productivity growth tends to be faster in agriculture than in industry and in industry than in services in most countries, the tendency is for the relative price of agriculture to fall and the relative price of services to increase over time. The direction of changes in the relative price of agriculture in the model matches the data for 23 of 29 countries in the sample (80%). For the relative price of services, the model is consistent 31. Note that in the above equation, sectoral labor productivity is measured at a common set of prices across countries. We use the prices of the benchmark economy in 1956.
154
QUARTERLY JOURNAL OF ECONOMICS
FIGURE IX Changes in Relative Prices (%) Each figure reports the annualized percentage change of the variable in the time series in the data and in the model. Relative prices of agriculture and services refer to the prices of agriculture and services relative to industry. Data on relative prices cover the period 1971 to 2004.
with the data in 25 countries (86%). We note that in the model, the only factors driving relative price changes over time are the growth in labor productivity across sectors. Of course, many other factors can affect the magnitude of price changes over time, so the model cannot capture all the changes. Now we turn to the implications of the model for price-level differences across countries. Recall that the prices of agriculture and services relative to industry are given by the inverse of labor productivity ( pa / pm = Am/Aa and ps / pm = Am/As ). The fact that the dispersion in productivity across rich and poor countries is large in agriculture and services relative to industry implies that the relative prices of agriculture and services are higher in poor than in rich countries. These implications may seem inconsistent at first with conventional wisdom about price-level differences
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
155
across countries. We emphasize that this view stems from observations about expenditure prices (often from ICP or PWT data) instead of producer prices. Our model, however, is better characterized as having implications for producer prices across countries. To see why the distinction between producer and expenditure prices is important, consider first the conventional wisdom that food is cheap in poor countries. This observation arises when the PPP-expenditure price of food is compared across countries using market exchange rates. For the sample of countries in Restuccia, Yang, and Zhu (2008), the dollar price of food is 60% higher in rich than in poor countries and the elasticity with respect to GDP per worker is positive and significant at 0.23.32 Food expenditures, however, include distribution and other charges— in the United States, for every dollar of food expenditure, only 20 cents represents payments to the farmer for the agricultural product—and the distinction between producer and expenditure prices may differ systematically across countries. In fact, producer-price data reveal a striking conclusion about the relative price of agriculture across countries: the evidence from FAO and PWT data in Restuccia, Yang, and Zhu (2008) is that the price of agricultural goods relative to nonagricultural goods is much lower in rich than in poor countries (a ratio of 0.22) and the elasticity of this relative price with respect to GDP per worker is negative and statistically significant at −0.34. This evidence is consistent with the price implications of the model for agriculture.33 Regarding the relative price of services, the conventional wisdom is that the price of services is higher in rich than in poor countries. This view stems again from observations about expenditure prices; see Summers and Heston (1991, pp. 338 and 339). We argue that this evidence is not necessarily inconsistent with the producer-price implications of the model, because the gap between expenditure and producer price-levels may be affected by many factors that can be systematically related to development.34 32. Note, however, that when the price of food is compared relative to the price of all goods, food appears expensive in poor countries. See Summers and Heston (1991, p. 338). 33. The distinction between food and agricultural goods prices is also important for the implications of price changes through time. For example, in the United States, the annualized growth rate of food prices from the Consumer Price Index relative to the price of manufacturing goods is positive, about 1% per year from 1971 to 2005, whereas the growth rate of the price of agriculture relative to manufacturing is negative, at roughly −2.5%. 34. Nevertheless, it is an interesting question for future research to assess the factors explaining higher expenditure price levels of services in rich countries.
156
QUARTERLY JOURNAL OF ECONOMICS
Because there are no systematic producer price-level data for services that can be compared with the price implications of the model, we focus instead on the indirect evidence from productivity measurements found in micro studies. The lower relative price of services in rich countries in the model stems from a higher relative productivity in services than in manufacturing compared to poor countries. Thus, we use the available sectoral productivity measurements to indirectly assess the price implications of the model for services. The evidence presented by Baily and Solow (2001) and other OECD studies discussed earlier suggests that labor productivity differences between rich and poor countries in services are larger than those for manufacturing sectors. This evidence is consistent with our productivity findings and therefore indirectly provides some assurance of the price implications of the model for services. IV.C. Counterfactuals We construct a series of counterfactuals aimed at assessing the quantitative importance of sectoral labor productivity on the process of structural transformation and aggregate productivity experiences across countries. We focus on two sets of counterfactuals. The first set is designed to illustrate the mechanics of positive sectoral productivity growth for labor reallocation and the contribution of productivity growth differences across sectors and countries for labor reallocation and aggregate productivity. The second set of counterfactuals focuses on explaining aggregate productivity growth experiences of catch-up, slowdown, stagnation, and decline by assessing the contribution of specific cross-country sectoral productivity patterns, such as productivity catch-up in agriculture and industry and low productivity levels and the lack of catch-up in services. The Mechanics of Sectoral Productivity Growth. We start by considering counterfactuals where we set the growth rate of labor productivity in one sector to zero in all countries, leaving the remaining growth rates as in the data. These counterfactuals illustrate the importance of productivity growth in each sector for labor reallocation and aggregate productivity. Summary statistics are reported in Figure X and Table II. In Figure X we report, for each country, the change in the time series of the share of hours in each sector and relative aggregate productivity between the last and first periods (in percentage points) in the counterfactual
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
157
FIGURE X The Mechanics of Sectoral Productivity Growth Counterfactuals (1) to (3) set the growth rate of labor productivity in a sector to zero in all countries, leaving the other sectors as in the data, for agriculture (first column), industry (second column), and services (third column). Counterfactual (4) sets labor productivity growth in each sector to aggregate productivity growth in the United States. Each panel plots the change between the last and the first period in the time series (in percentage points) of the share of hours in each sector and relative aggregate productivity in the model and in the counterfactual.
and in the model. In Table II we report the average change in the model for all countries, for countries that catch up, and for countries that decline relative to the United States. Consider first the counterfactual for agriculture (γa = 0). No productivity growth in agriculture generates no labor reallocation away from agriculture: there is an average increase in the share of hours in agriculture of 2 p.p. in the counterfactual instead of a decrease of 26 p.p. in the model. As a result, much less labor is reallocated to services. This counterfactual has important negative implications for relative aggregate productivity for most countries regardless of their level
158
QUARTERLY JOURNAL OF ECONOMICS TABLE II SECTORAL GROWTH, LABOR REALLOCATION, AND AGGREGATE PRODUCTIVITY Change in share of hours Agriculture
Model Counterfactual: (1) γa = 0 (2) γm = 0 (3) γs = 0 (4) γi = γ US
−25.5 2.1 −25.5 −25.2 −16.8
Model Counterfactual: (1) γa = 0 (2) γm = 0 (3) γs = 0 (4) γi = γ US
−24.3
Model Counterfactual: (1) γa = 0 (2) γm = 0 (3) γs = 0 (4) γi = γ US
−27.6
4.9 −24.3 −23.8 −13.3
−2.9 −27.6 −27.6 −23.2
Industry All countries −10.3
Services
Change in relative aggregate productivity
35.8
12.8
11.6 18.2 36.9 21.5
−0.5 −7.0 −2.2 0.4
Catch-up countries −13.5 37.8
25.8
−13.7 7.3 −11.8 −4.7
−17.3 9.5 −15.6 −4.5
12.4 14.8 39.4 17.8
7.9 −1.5 4.0 1.6
Decline countries −4.5 32.1
−10.5
−7.2 3.3 −4.9 −5.1
10.1 24.3 32.5 28.2
−15.7 −16.8 −13.3 −1.9
Notes. The table reports the average change between the last and first periods in the time series (in percentage points) of each variable for the model and the counterfactuals. Counterfactuals (1) to (3) assume zero growth in labor productivity in a sector, leaving the other sectoral growth rates as in the data. Counterfactual (4) assumes labor productivity growth in each sector equal to the aggregate productivity growth in the United States.
of development: there is an average decline in relative aggregate productivity of 1 p.p. in the counterfactual instead of the 13 p.p. increase in the model. Next we turn to the counterfactual for industry (γm = 0). This counterfactual has no effect on the share of hours in agriculture (see equation (7)). With no productivity growth in industry there is much less reallocation of labor away from industry into services compared to the model and thus industry represents a larger share of output in the counterfactual. The result is a process for relative aggregate productivity that is sharply diminished across countries: an average decline of 7 p.p. in the counterfactual instead of the catch-up of 13 p.p. in the model. And indeed the largest negative impact is on countries that observed the most
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
159
catch-up in relative aggregate productivity in the model. Finally, having no productivity growth in services (γs = 0) has a very small impact on labor reallocation across sectors.35 Relative aggregate productivity declines by an average of 2 p.p. in this counterfactual. The negative impact of this counterfactual on relative aggregate productivity is smaller than that in the case with no productivity growth in industry for all countries but three (Japan, Portugal, and Venezuela), even though services account for a larger share of hours than industry in most countries. We end this set of counterfactuals by assessing the quantitative importance of differences in labor productivity growth across sectors and countries. We set labor productivity growth in each sector to the growth rate of aggregate labor productivity in the United States (γi = γ US ) and document the results in Table II and in the fourth column in Figure X. The counterfactual has a substantial impact on the process of structural transformation. In particular, much less labor is reallocated away from agriculture and industry toward services. For instance, over the sample period, the share of hours in agriculture fell, on average, 26 p.p. in the model and 17 p.p. in the counterfactual. In turn, the share of hours in services increased, on average, 36 p.p. in the model and 22 p.p. in the counterfactual. And indeed this different reallocation process, together with the assumption about sectoral labor productivity growth, explains a large portion of the experiences of catch-up and decline in aggregate productivity. For countries that catch up in aggregate productivity to the United States in the model over the sample period, the average catch-up is 26 p.p. in the model and only 2 p.p. in the counterfactual. For countries that decline in relative aggregate productivity, the average decline is 11 p.p. in the model and only 2 p.p. in the counterfactual.36 We conclude from these counterfactuals that sectoral productivity growth generates substantial effects on labor reallocation, 35. This is due to two opposing effects of productivity growth in services on the labor allocation between industry and services, which roughly cancel each other in the model. See Duarte and Restuccia (2007, p. 42) for a detailed discussion of these effects. 36. Notice that this counterfactual does not eliminate all aggregate productivity growth differences across countries, even though productivity growth rates are identical across sectors and countries and labor reallocation is much diminished as a result. For instance, in the counterfactual, relative aggregate productivity in Finland increases by 8 p.p. over the sample period, and it decreases by 6 p.p. in Mexico. These movements in relative aggregate productivity in the counterfactual stem solely from labor reallocation across sectors (due to positive productivity growth) that have different labor productivity levels.
160
QUARTERLY JOURNAL OF ECONOMICS TABLE III CHANGE IN RELATIVE AGGREGATE PRODUCTIVITY
Model Counterfactual: (1) γi = γiUS (1a) Agriculture (1b) Industry (1c) Services (2) γi = γiUS ∀i (3) Catch-up in services
All countries
Catch-up countries
Decline countries
12.8
25.8
−10.5
11.5 6.0 10.4 3.9 30.7
23.2 13.9 18.3 5.8 46.9
−9.4 −8.4 −3.7 0.5 1.6
Notes. The table reports the average change between the last and first periods in the time series (in percentage points) of relative aggregate productivity for the model and the counterfactuals. Counterfactuals (1a) to (1c) set the growth rate in a sector to the rate in the United States in that sector. Counterfactual (2) sets the growth rate of all sectors to the sectoral growth rates in the United States. Counterfactual (3) sets the productivity growth in services such that in the last period in the sample relative productivity in services is the same as relative productivity in industry in each country.
which in turn are important in understanding aggregate productivity growth across countries. Sectoral Productivity Patterns and Cross-Country Experiences. We now turn to the second set of counterfactuals, where we assess the role of specific labor productivity patterns across sectors in explaining cross-country episodes of catch-up, slowdown, stagnation, and decline in relative aggregate productivity. In Figure VI we documented a substantial catch-up across countries in labor productivity in agriculture and industry but not in services. To assess the importance of sectoral catch-up for aggregate productivity, we compute counterfactuals where we set the growth rate of labor productivity in one sector to the growth rate in that sector in the United States, leaving the other sectoral growth rates as in the data (γi = γiUS for each i ∈ {a, m, s}). For completeness we also compute a counterfactual where all sectoral growth rates are set to the ones in the United States (γi = γiUS ∀i). Table III summarizes the results for these counterfactuals. Although there has been substantial catch-up of labor productivity in agriculture during the sample period (from an average relative productivity of 48% in the first period to 71% in the last period of the sample), this factor contributes little, about 10%, to catch-up in aggregate productivity across countries (1.3 p.p. of 12.8 p.p. in the model). The substantial catch-up in agricultural productivity produces a reallocation of labor away from this
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
161
FIGURE XI Change in Relative Aggregate Productivity—The Importance of Industry This counterfactual sets the growth of labor productivity in industry in each country to the rate in the United States. The figure plots the difference between the last and first periods (in percentage points) of relative aggregate productivity during the sample period in the model and in the counterfactual.
sector, which dampens its positive effect on aggregate productivity growth.37 The catch-up in industry productivity has also been substantial. Unlike agriculture, this catch-up has a significant impact on relative aggregate productivity. Given that most countries have observed higher growth rates of labor productivity in industry than the United States, labor reallocation away from industry and toward services is diminished in the counterfactual for industry. On average, the share of hours in industry decreases 6.5 p.p. in the counterfactual, compared to a decrease of 10.3 p.p. in the model. Figure XI summarizes our findings for the effect of this counterfactual on relative aggregate productivity by reporting 37. The effect of labor reallocation on relative aggregate productivity depends on the normalized end-of-period sectoral labor productivity. Because there is substantial catch-up in agricultural productivity but not in services, the effect of reallocation from agriculture to services is negative.
162
QUARTERLY JOURNAL OF ECONOMICS
the difference in relative aggregate productivity between the last and the first period in the time series for each country in the model and in the conterfactual. Industry productivity growth is important for countries that catch up in aggregate productivity to the United States, because these countries are substantially below the 45◦ line. In fact, we draw in this figure a dash-dotted line indicating half the gains in aggregate productivity in the counterfactual relative to the model. Many countries are in this category and some countries substantially below it, such as Australia, Sweden, and the United Kingdom. For all countries, the average change in relative aggregate productivity is only 6 p.p. in the counterfactual instead of 12.8 p.p. in the model.38 We conclude from this counterfactual that productivity catch-up in industry explains about 50% (6.8 p.p. of 12.8 in the model) of the relative aggregate productivity gains observed during the sample period. Recall that, in contrast to agriculture and industry, there has been no substantial catch-up in services across countries and, as reported in Figure VI, there has been a decline in relative productivity in services for the richer countries. As a result, even though services represent an increasing share of output in the economy, we do not expect services to contribute much to catchup in the model. This is confirmed in the third counterfactual, as productivity catch-up in services contributes about 15% of the catch-up in relative aggregate productivity (2.4 p.p. of 12.8 p.p. in the model). We note, however, that for countries that decline in relative aggregate productivity, lower growth in services than in the United States contributes substantially to this decline (−6.8 p.p. of −10.5 p.p. in the model; see Table III). Among the developed economies—which feature a large share of hours in services— Canada, New Zealand, and Sweden had lower productivity growth rates in services than the United States. In the model, Canada and New Zealand declined in relative aggregate productivity by 9 p.p. and 8 p.p. over the sample period, whereas Sweden observed a substantial catch-up in relative aggregate productivity but stagnated at around 82% during the mid-1970s. In the counterfactual, relative aggregate productivity increases by 3 p.p. in Canada, remains constant for New Zealand, and increases by 9 p.p. from the stagnated level in Sweden. Low productivity growth in services is 38. Note that among countries that decline in relative productivity the effect of industry growth is not systematic and the gaps are not as large.
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
163
essential for understanding these growth experiences of stagnation and decline among rich economies. Figure VI also documents that the level of relative productivity in services is lower than that of industry and that most countries failed to catch up in services to the relative level of industry. For instance, the average relative productivity in services increased from 46% in the first period to 49% in the last period in the sample, whereas the average relative productivity in industry increased from 51% to 75%. In the last period of the sample, all countries except Austria, France, Denmark, the United Kingdom, and New Zealand feature lower relative productivity in services than in industry. Moreover, in many instances the differences in productivity between services and industry are substantial: around 40% lower in services in Spain, Finland, and Norway, around 60% lower in Portugal, and around 80% lower in Korea and Ireland. These features imply that the service sector represents an increasing drag on aggregate productivity as resources are reallocated to this sector in the process of structural transformation. To illustrate the role of low productivity in services and the lack of catch-up in accounting for the growth experiences of slowdown, stagnation, and decline, we compute a counterfactual where we let productivity growth in services be such that in the last period in the sample relative productivity in services is the same as relative productivity in industry in each country. Although the impact of these different productivity growth rates in services on labor reallocation is somewhat limited, the impact on growth experiences across countries is quite striking: for countries that catch up to the United States during the sample period, the average catch-up increases by almost 80% to 46 p.p., whereas for countries that decline there is instead a catch-up of 1.6 p.p. during the sample period. (See Table III.) More important, these summary statistics hide the impact of productivity in services in explaining experiences of slowdown, stagnation, and decline observed in the time series. For this reason, Figure XII plots the time path of relative aggregate productivity for all country experiences of slowdown, stagnation, and decline in relative aggregate productivity. The solid lines represent the model and the dash-dotted lines represent the counterfactual. This figure clearly indicates the extent to which low productivity in services and the lack of catch-up account for all these poor growth experiences. To summarize, although productivity convergence in industry (and agriculture) are essential in the first stages of the process
164
QUARTERLY JOURNAL OF ECONOMICS
FIGURE XII Relative Aggregate Productivity—The Importance of Services This counterfactual sets the productivity growth in services such that in the last period in the sample relative productivity in services is the same as relative productivity in industry in each country. Each panel plots aggregate labor productivity relative to that of the United States in the model and the counterfactual for each country which, during the sample period, experienced an episode of slowdown, stagnation, or decline. The solid line represents the model and the dash-dotted line the counterfactual.
of structural transformation, poor relative performance in services has determined a slowdown, stagnation, and decline in aggregate productivity. In fact, in the last period of the sample, almost all countries observe a lower relative labor productivity in services than in aggregate. (See Figure XIII.) Because growth rate differences across countries in the service sector tend to be small and services represent a large and increasing share of hours in most countries, this suggests an increasing role of services in determining cross-country aggregate productivity outcomes.
165
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
Relative labor productivity in services—last year
1.2
NOR
1
AUT FRA DNK BEL GBRNLD
0.8
NZL
0.6
CAN SWE AUSITA FIN
JPN IRL
ESP
0.4 VEN
0.2 BOL
0
0
CRI MEX COL TUR BRA
0.2
ARG CHL
GRC PRT
KOR
0.4
0.6
0.8
1
1.2
Relative aggregate labor productivity—last year FIGURE XIII Labor Productivity in Services across Countries—Last Period This figure plots relative labor productivity in services against relative aggregate productivity in the last period of the sample for all countries.
IV.D. Discussion Our analysis of structural transformation and aggregate productivity growth relies on a collection of closed economies. It is of interest to discuss the limitations and implications of this assumption for the results. Openness and trade can have two important effects in an economy. First, competition from trade can affect domestic productivity. Second, for an open economy, prices of traded goods reflect world market conditions and not just domestic factors. Regarding the effect of trade on productivity, we argue that the closed-economy assumption is not as restrictive for our analysis as it may first appear. To see this point, notice that the effect of openness on labor allocations and aggregate productivity is already embedded in the measures of labor productivity growth by sector, which the analysis takes as given. For instance, we found that the growth rate of labor productivity in manufacturing for Korea was almost three times that of the United States. It is
166
QUARTERLY JOURNAL OF ECONOMICS
likely that openness to trade during this period can help explain this fact. Moreover, openness would imply that productivity differences across countries for those goods that are most tradable would tend to be small relative to the differences for those goods that are less traded. The productivity implications of the model are consistent with this broad prediction, because differences in manufacturing productivity are smaller than productivity differences in services (mostly nontraded goods). It is an interesting question for future research to assess the importance of trade for productivity convergence in manufacturing across countries and the lack of convergence in services. Regarding the effect of trade on relative prices, recall that the closed-economy assumption implies a one-to-one mapping from sectoral productivity growth to relative prices. An open-economy version of the model would tend to produce a weaker link between domestic productivity growth and relative prices. In fact, in a small open economy, relative prices are invariant to domestic productivity. As we discussed earlier, the relative price implications of the model are broadly consistent with the data, which suggests that domestic productivity growth is a substantial component of the movements in relative prices. To put it differently, we found a strong correlation between changes in relative prices and labor productivity growth across countries, as documented in Figure IX. As a result, the labor allocations implied by the model are broadly consistent with the incentives that consumers face in these economies. We found that not all differences in relative prices are captured by the model. In particular, we found that the price of services relative to manufacturing increased faster in the model than in the data for many countries. This departure of the model from the data may arise not only from the closedeconomy assumption, but also from other features of the data, such as price distortions and barriers to labor reallocation across sectors. Finally, note that standard open-economy models imply that the prices of traded goods are equalized across countries. The evidence, however, suggests large departures from the law of one price. For instance, the price exercise on agricultural goods from the FAO suggests large price differences across countries and the international macro literature documents large deviations in prices across countries even for highly tradable goods. Another potential avenue to assessing the limitations of the closed-economy assumption of the model would be to compare the consumption and production implications relative to data. For
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
167
instance, in the closed economy, output and consumption shares are equal, but in the open economy, they would differ. Unfortunately, this implication cannot be tested directly, because consumption is measured as expenditures in final goods and any gap between production and consumption of goods may also be due to processing, distribution, and marketing services and other charges. But because for the more developed countries most of the trade occurs intra-industry, consumption and production shares of broad sectors tend not to differ greatly. V. CONCLUSIONS This paper highlights the role of sectoral labor productivity for the structural transformation and aggregate productivity experiences across countries. Using a model of the structural transformation that is calibrated to the growth experience of the United States, we showed that sectoral differences in labor productivity levels and growth explain the broad patterns of labor reallocation and aggregate productivity experiences across countries. We found that sectoral labor productivity differences across countries are large and systematic both at a point in time and over time. In particular, labor productivity differences between rich and poor countries are large in agriculture and services and smaller in manufacturing. Moreover, most countries have experienced substantial productivity catch-up in agriculture and industry, but productivity in services has remained low relative to the United States. An implication of these findings is that, as countries move through the process of structural transformation, relative aggregate labor productivity can first increase and later stagnate or decline. We find that labor productivity catch-up in manufacturing explains about 50% of the gains in aggregate productivity across countries and that low labor productivity in services and the lack of catch-up explain all the experiences of slowdown, stagnation, and decline in relative aggregate productivity across countries. Our findings suggest that understanding the sources of sectoral differences in labor productivity levels and growth across countries is crucial in understanding the relative performance of countries. In analyzing sectoral labor productivity levels and growth rates across countries, a number of interesting questions arise. What factors contribute to cross-country differences in labor productivity across sectors? Why were countries able to catch up in manufacturing productivity but not in services? What are the
168
QUARTERLY JOURNAL OF ECONOMICS
barriers that prevent other developed economies from sustaining growth rates of labor productivity in services as high as in the United States? How are trade openness and regulation related to these productivity differences across countries? Although there may not be a unifying explanation for all these observations, a recurrent theme in productivity studies at the sectoral level is that the threat or actual pressure of competition is crucial for productivity performance; see, for instance, Schmitz ´ (2005) and Gald´on-Sanchez and Schmitz (2002). Because services are less traded than manufacturing goods, there is a tendency for services to be less subject to competitive pressure, which may explain the larger productivity gaps observed in services relative to manufacturing across countries. Moreover, protected domestic sectors may be the explanation for poor productivity performance in some countries. Because openness to trade would not generally have the desired competitive-pressure impact in services, other factors such as the regulatory environment may prove useful in explaining productivity differences across countries in this sector. For instance, the role of land and size regulations on productivity in retail services is often emphasized; see, for instance, Baily and Solow (2001). As a first pass at providing some empirical support for this potential explanation for productivity differences across countries, we have correlated labor productivity differences in industry and services derived from our model to measures of trade openness and government regulation. We find that trade openness is strongly correlated with industry productivity but less so with services productivity, whereas measures of regulation (such as that from the World Bank’s Doing Business) are strongly correlated with productivity in services. We leave a detailed investigation of these important issues for future research. APPENDIX: DATA SOURCES AND DEFINITIONS We build a panel data set with annual observations for aggregate GDP per hour and value added per hour and shares of hours for agriculture, industry, and services for 29 countries. The countries covered in our data set are, with sample period in parentheses, Argentina (1950–2004), Australia (1964–2004), Austria (1960–2004), Belgium (1956–2004), Bolivia (1950–2002), Brazil (1950–2003), Canada (1956–2004), Chile (1951–2004), Colombia (1950–2003), Costa Rica (1950–2002), Denmark (1960–2004), Finland (1959–2004), France (1969–2003), Greece (1960–2004),
STRUCTURAL TRANSFORMATION AND PRODUCTIVITY
169
Ireland (1958–2004), Italy (1956–2004), Japan (1960–2004), Korea (1972–2003), Mexico (1950–2004), the Netherlands (1960– 2004), New Zealand (1971–2004), Norway (1956–2004), Portugal (1956–2004), Spain (1960–2004), Sweden (1960–2004), Turkey (1960–2003), the United Kingdom (1956–2004), the United States (1956–2004), and Venezuela (1950–2004). All series are trended using the Hodrick–Prescott filter with a smoothing parameter λ = 100 before any ratios are computed. A. Aggregate Data We obtain data on PPP-adjusted real GDP per capita in constant prices (RGDPL) and population (POP) from Penn World Tables version 6.2; see Heston, Summers, and Aten (2006). We obtain data on employment (EMP) and annual hours actually worked per person employed (HOURS) from the Total Economy Database; see the Conference Board (2008). With these data we construct annual time series of PPP-adjusted GDP per hour in constant prices for each country as Y Lh = RGDPL × POP/(EMP × HOURS). B. Sectoral Data We obtain annual data on employment, hours worked, and constant domestic-price value added for agriculture, industry, and services for the countries listed above. The sectors are defined by the International Standard Industrial Classification, revision 3 (ISIC III) definitions, with agriculture corresponding to ISIC divisions 1–5 (agriculture, forestry, hunting, and fishing), industry to ISIC divisions 10–45 (mining, manufacturing, construction, electricity, water, and gas), and services to ISIC divisions 50– 99 (wholesale and retail trade—including hotels and restaurants, transport, and government, financial, professional, and personal services such as education, health care, and real estate services). Value Added by Sector. Value added by sector is obtained by combining data from the World Bank (2008) World Development Indicators online and historical data from the OECD National Accounts publications for the following countries: Australia, Austria, Belgium, Canada, Denmark, Finland, France, Greece, Ireland, Italy, Japan, Korea, the Netherlands, New Zealand, Norway, Portugal, Spain, Sweden, Turkey, the United Kingdom, and the United States. The data series from the World Bank’s World Development Indicators are agriculture value added, industry
170
QUARTERLY JOURNAL OF ECONOMICS
value added, and services value added. All series are measured in constant local currency units, base year 2000 (with the exception of Turkey, 1987). These series are extended backward using historical data from the OECD National Accounts publications, except for Korea. A combination of three OECD publications was used: National Accounts of OECD Countries (1950–1968), National Accounts of OECD Countries (1950–1961), and National Accounts of OECD Countries (1960–1977); see OECD (1963, 1970, 1979). The primary resource was the book covering the period from 1950 to 1968. We compute growth rates of the OECD data for corresponding variables for years prior to those available through the World Bank and apply them to the World Bank series. Data on value added by sector for all Latin American countries in our data set (Argentina, Bolivia, Brazil, Chile, Colombia, Costa Rica, Mexico, and Venezuela) are obtained from the 10-Sector Database; see Timmer and de Vries (2009). This database has data on value added in constant local prices for ten sectors. These data are aggregated into value added in agriculture, industry, and services using the ISIC III definitions above. Employment by Sector. The sectoral employment data are obtained from a variety of sources as well. We obtain data on civilian employment in each broad sector from The OECD (2008) Labor Force Statistics database online for Australia, Austria, Belgium, Canada, Finland, France, Ireland, Italy, Japan, the Netherlands, New Zealand, Norway, Spain, Turkey, the United Kingdom, and the United States. Data for Portugal on sectoral employment are obtained from the Banco de Portugal (2006). The data are aggregated into the same three broad sectors. We extend this series forward to 2005 by using growth rates for each variable computed from the EU KLEMS database; see O’Mahony and Timmer (2009). Data for Korea and all Latin American countries are obtained from the 10-Sector Database. We aggregate these data into the three broad sectors using the ISIC III definitions above. Hours Worked by Sector. We obtain data on hours of work per worker from the EU KLEMS database for Australia, Austria, Belgium, Denmark, Finland, France, Ireland, Italy, Japan, the Netherlands, Portugal, Spain, Sweden, the United Kingdom, and the United States. These data cover the period 1970 to 2005. Data for Brazil, Canada, Chile, Colombia, Costa Rica, Greece, Mexico, Norway, New Zealand, and Turkey are obtained from the International Labour Office (2008) Laborsta database. These series are
STRUCTURAL TRANSFORMATION IN PRODUCTIVITY
171
much shorter; the time period covered varies by country, but it starts after 1990 for all countries. From these data, we compute the ratio of per-worker hours by sector relative to per-worker aggregate hours. In analyzing these ratios, we find that relative sectoral hours are remarkably stable over time for most countries and that these ratios are very close to one for many countries. Moreover, any deviations from one in relative hours across countries are not systematically related to the level of development. For each country, we use the average value of each of these ratios, denoted as hi , i ∈ {a, m, s}, to calculate shares of hours by sector and value added per hour by sector. Because the time series of sectoral hours are shorter than those of sectoral employment and value added, this simplification allows us to compute sectoral shares of total hours and value added per hour without shortening the time series. We do not have data on sectoral hours for Argentina, Bolivia, Korea, and Venezuela, and we assume that hi = 1 for these countries. Total hours by sector are computed by multiplying employment with hours per worker in each sector. We construct value added per hour by dividing the series of value added with the corresponding series of total hours for each sector. Shares of hours by sector are simply the ratio of total hours by sector relative to total aggregate hours. Prices by Sector. We compute implicit producer price deflators for each sector using data on sectoral value added at constant and current prices from the World Development Indicators. The price data are consistent with the sectoral definitions for value added. They cover the period from 1971 to 2004. DEPARTMENT OF ECONOMICS, UNIVERSITY OF TORONTO DEPARTMENT OF ECONOMICS, UNIVERSITY OF TORONTO
REFERENCES Adamopoulos, Tasso, and Ahmet Akyol, “Relative Underperformance Alla Turca,” Review of Economic Dynamics, 12 (2009), 697–717. Baily, Martin, Diana Farrell, and Jaana Remes, “Domestic Services: The Hidden Key to Growth,” McKinsey Global Institute, 2005. Baily, Martin, and Robert Solow, “International Productivity Comparisons Built from the Firm Level,” Journal of Economic Perspectives, 15 (2001), 151–172. Banco de Portugal, “S´eries Longas para a Economia Portuguesa p´os II Guerra Mundial, 2006.” Available at http://www.bportugal.pt/publish/serlong/ serlong p.htm. Baumol, William, “Macroeconomics of Unbalanced Growth: The Anatomy of Urban Crisis,” American Economic Review, 57 (1967), 415–426. Caselli, Francesco, “Accounting for Cross-Country Income Differences,” in Handbook of Economic Growth, Philippe Aghion and Steven Durlauf, eds. (New York: North Holland Elsevier, 2005).
172
QUARTERLY JOURNAL OF ECONOMICS
Caselli, Francesco, and Wilbur J. Coleman II, “The U.S. Structural Transformation and Regional Convergence: A Reinterpretation,” Journal of Political Economy, 109 (2001), 584–616. Caselli, Francesco, and Silvana Tenreyro, “Is Poland the Next Spain?” in NBER International Seminar on Macroeconomics 2004, Richard Clarida, Jeffrey Frankel, Francesco Giavazzi, and Kenneth West, eds. (Cambridge, MA: The MIT Press, 2006). Chanda, Areendam, and Carl-Johan Dalgaard, “Dual Economies and International Total Factor Productivity Differences: Channelling the Impact from Institutions, Trade, and Geography,” Economica, 75 (2008), 629–661. Chari, Varadarajan V., Patrick Kehoe, and Ellen McGrattan, “The Poverty of Nations: A Quantitative Exploration,” NBER Working Paper No. 5414, 1996. Coleman, Wilbur J., “Accommodating Emerging Giants,” Mimeo, Duke University, 2007. Conference Board, Total Economy Database, 2008. Available at www.conference -board.org/economics/. C´ordoba, Juan, and Marla Ripoll, “Agriculture, Aggregation, and Development Accounting,” Mimeo, University of Pittsburgh, 2004. Duarte, Margarida, and Diego Restuccia, “The Productivity of Nations,” Federal Reserve Bank of Richmond Economic Quarterly, 92 (2006), 195–223. ——, “The Structural Transformation and Aggregate Productivity in Portugal,” Portuguese Economic Journal, 6 (2007), 23–46. Echevarria, Cristina, “Changes in Sectoral Composition Associated with Growth,” International Economic Review, 38 (1997), 431–452. ´ Gald´on-Sanchez, Jos´e, and James Schmitz Jr., “Competitive Pressure and Labor Productivity: World Iron-Ore Markets in the 1980’s,” American Economic Review, 92 (2002), 1222–1235. Gollin, Douglas, Stephen Parente, and Richard Rogerson, “The Role of Agriculture in Development,” American Economic Review Papers and Proceedings, 92 (2002), 160–164. ——, “The Food Problem and the Evolution of International Income Levels,” Journal of Monetary Economics, 54 (2007), 1230–1255. Hansen, Gary, and Edward C. Prescott, “From Malthus to Solow,” American Economic Review, 92 (2002), 1205–1217. ` Herrendorf, Berthold, and Akos Valentinyi, “Which Sectors Make the Poor Countries So Unproductive?” Mimeo, Arizona State University, 2006. Heston, Alan, Robert Summers, and Bettina Aten, “Penn World Table Version 6.2,” Center for International Comparisons of Production, Income and Prices at the University of Pennsylvania, 2006. Available at http://pwt.econ.upenn.edu. International Labour Office, “LABORSTA Database,” Bureau of Statistics, 2008. Available at http://laborsta.ilo.org/. Jones, Charles (1997) “On the Evolution of the World Income Distribution,” Journal of Economic Perspectives, 11 (1997), 19–36. Kehoe, Timothy, and Edward C. Prescott, “Great Depressions of the 20th Century,” Review of Economic Dynamics, 5 (2002), 1–18. Kongsamut, Piyabha, S´ergio Rebelo, and Danyang Xie, “Beyond Balanced Growth,” Review of Economic Studies, 68 (2001), 869–882. Kuznets, Simon, Modern Economic Growth (New Haven, CT: Yale University Press, 1966). Laitner, John, “Structural Change and Economic Growth,” Review of Economic Studies, 67 (2000), 545–561. Lucas, Robert, “Some Macroeconomics for the 21st Century,” Journal of Economic Perspectives, 14 (2000), 159–168. Maddison, Angus, “Economic Growth and Structural Change in the Advanced Countries,” in Western Economies in Transition, Irving Leveson and Jimmy Wheeler, eds. (London: Croom Helm, 1980). Ngai, Rachel, “Barriers and the Transition to Modern Growth,” Journal of Monetary Economics, 51 (2004), 1353–1383. Ngai, Rachel, and Christopher Pissarides, “Structural Change in a Multisector Model of Growth,” American Economic Review, 97 (2007), 429–443. OECD, National Accounts of OECD Countries: Detailed Tables, Volume II, 1950– 1961 (Paris, France: OECD, 1963).
STRUCTURAL TRANSFORMATION IN PRODUCTIVITY
173
——, National Accounts of OECD Countries: Detailed Tables, Volume II, 1950–1968 (Paris, France: OECD, 1970). ——, National Accounts of OECD Countries: Detailed Tables, Volume II, 1960–1977 (Paris, France: OECD, 1979). ——, Labor Force Statistics, 2008. Available at http://hermia.sourceoecd.org/ vl=718832/cl=16/nw=1/rpsv/outlookannuals.htm. O’Mahony, Mary, and Marcel P. Timmer, “Output, Input and Productivity Measures at the Industry Level: The EU KLEMS Database,” Economic Journal, 119 (2009), F374–F403. Available at www.euklems.net. Pilat, Dirk, “Labour Productivity Levels in OECD Countries: Estimates for Manufacturing and Selected Service Sectors,” OECD Working Paper No. 169, 1996. Prescott, Edward C., “Prosperity and Depression,” American Economic Review, 92 (2002), 1–15. Restuccia, Diego, Dennis Yang, and Xiaodong Zhu, “Agriculture and Aggregate Productivity: A Quantitative Cross-Country Analysis,” Journal of Monetary Economics, 55 (2008), 234–250. Rogerson, Richard, “Structural Transformation and the Deterioriation of European Labor Market Outcomes,” Journal of Political Economy, 116 (2008), 235– 259. Schmitz, James Jr., “What Determines Productivity? Lessons from the Dramatic Recovery of the U.S. and Canadian Iron Ore Industries Following Their Early 1980s Crisis,” Journal of Political Economy, 113 (2005), 582–625. Summers, Robert, and Alan Heston, “The Penn World Table: An Expanded Set of International Comparisons, 1950–1988,” Quarterly Journal of Economics, 106 (1991), 327–368. Timmer, Marcel P., and Gaaitzen J. de Vries, “Structural Change and Growth Accelerations in Asia and Latin America: A New Sectoral Data Set,” Cliometrica, 3 (2009), 165–190. Available at www.ggdc.net. U.S. Census Bureau, Department of Commerce, Historical Statistics of the United States: Colonial Times to 1970 (Part I) (Washington, DC: U.S. Government Printing Office, 1975). Vollrath, Dietrich, “How Important Are Dual Economy Effects for Aggregate Productivity?” Journal of Development Economics, 88 (2009), 325–334. World Bank, World Development Indicators, 2008. Available at http://devdata .worldbank.org/dataonline/.
TEACHER QUALITY IN EDUCATIONAL PRODUCTION: TRACKING, DECAY, AND STUDENT ACHIEVEMENT∗ JESSE ROTHSTEIN Growing concerns over the inadequate achievement of U.S. students have led to proposals to reward good teachers and penalize (or fire) bad ones. The leading method for assessing teacher quality is “value added” modeling (VAM), which decomposes students’ test scores into components attributed to student heterogeneity and to teacher quality. Implicit in the VAM approach are strong assumptions about the nature of the educational production function and the assignment of students to classrooms. In this paper, I develop falsification tests for three widely used VAM specifications, based on the idea that future teachers cannot influence students’ past achievement. In data from North Carolina, each of the VAMs’ exclusion restrictions is dramatically violated. In particular, these models indicate large “effects” of fifth grade teachers on fourth grade test score gains. I also find that conventional measures of individual teachers’ value added fade out very quickly and are at best weakly related to long-run effects. I discuss implications for the use of VAMs as personnel tools.
I. INTRODUCTION Parallel literatures in labor economics and education adopt similar econometric strategies for identifying the effects of firms on wages and of teachers on student test scores. Outcomes are modeled as the sum of firm or teacher effect, individual heterogeneity, and transitory, orthogonal error. The resulting estimates of firm effects are used to gauge the relative importance of firm and worker heterogeneity in the determination of wages. In education, so-called “value added” models (hereafter, VAMs) have been used to measure the importance of teacher quality to educational production, to assess teacher preparation and certification programs, and as important inputs to personnel evaluations and merit pay programs.1 ∗ Earlier versions of this paper circulated under the title “Do Value Added Models Add Value?” I am grateful to Nathan Wozny and Enkeleda Gjeci for exceptional research assistance. I thank Orley Ashenfelter, Henry Braun, David Card, Henry Farber, Bo Honor´e, Brian Jacob, Tom Kane, Larry Katz, Alan Krueger, Sunny Ladd, David Lee, Lars Lefgren, Austin Nichols, Amine Ouazad, Mike Rothschild, Cecilia Rouse, Diane Schanzenbach, Eric Verhoogen, Tristan Zajonc, anonymous referees, and conference and seminar participants for helpful conversations and suggestions. I also thank the North Carolina Education Data Research Center at Duke University for assembling, cleaning, and making available the confidential data used in this study. Financial support was generously provided by the Princeton Industrial Relations Section and Center for Economic Policy Studies and the U.S. Department of Education (under Grant R305A080560).
[email protected]. 1. On firm effects, see, for example, Abowd and Kramarz (1999). For recent examinations of teacher effects modeling, see McCaffrey et al. (2003); Wainer (2004); Braun (2005a, 2005b); and Harris and Sass (2006). C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
175
176
QUARTERLY JOURNAL OF ECONOMICS
All of these applications suppose that the estimates can be interpreted causally. But observational analyses can identify causal effects only under unverifiable assumptions about the correlation between treatment assignment—the assignment of students to teachers, or the matching of workers to firms—and other determinants of test scores and wages. If these assumptions do not hold, the resulting estimates of teacher and firm effects are likely to be quite misleading. Anecdotally, assignments of students to teachers incorporate matching to take advantage of teachers’ particular specialties, intentional separation of children who are known to interact badly, efforts on the principal’s part to reward favored teachers through the allocation of easy-to-teach students, and parental requests (see, e.g., Monk [1987]; Jacob and Lefgren [2007]). These are difficult to model statistically. Instead, VAMs typically assume that teacher assignments are random conditional on a single (observed or latent) factor. In this paper, I develop and implement tests of the exclusion restrictions of commonly used value added specifications. My strategy exploits the fact that future teachers cannot have causal effects on past outcomes, whereas violations of model assumptions may lead to apparent counterfactual “effects” of this form. Test scores, like wages, are serially correlated, and as a result an association between the current teacher and the lagged score is strong evidence against exogeneity with respect to the current score. I examine three commonly used VAMs, two of which have direct parallels in the firm effects literature. In the simplest, most widely used VAM—which resembles the most common specification for firm effects—the necessary exclusion restriction is that teacher assignments are orthogonal to all other determinants of the so-called “gain” score, the change in a student’s test score over the course of the year. If this restriction holds, fifth grade teacher assignments should not be correlated with students’ gains in fourth grade. Using a large microdata set describing North Carolina elementary students, I find that there is in fact substantial within-school dispersion of students’ fourth grade gains across fifth grade classrooms. Sorting on past reading gains is particularly prominent, though there is clear evidence of sorting on math gains as well. Because test scores exhibit strong mean reversion— and thus gains are negatively autocorrelated—sorting on past gains produces bias in the simple VAM’s estimates.
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
177
The other VAMs that I consider rely on different exclusion restrictions, namely that classroom assignments are as good as random conditional on either the lagged test score or the student’s (unobserved, but permanent) ability. I discuss how similar strategies can be used to test these restrictions as well. I find strong evidence in the data against each. Evidently, classroom assignments respond dynamically to annual achievement in ways that are not captured by the controls typically included in VAM specifications. To evaluate the magnitude of the biases that assignments produce, I compare common VAMs to a richer model that conditions on the complete achievement history. Estimated teacher effects from the rich model diverge importantly from those obtained from the simple VAMs in common use. I discuss how selection on unobservables is likely to produce substantial additional biases. I use a simple simulation to explore the sensitivity of teacher rankings to these biases. Under plausible assumptions, simple VAMs can be quite misleading. The rich VAM that controls for all observables does better, but still yields rankings that diverge meaningfully from the truth. My estimates also point to an important substantive result. To the extent that any of the VAMs that I consider identify causal effects, they indicate that teachers’ long-run effects are at best weakly proxied by their immediate impacts. A teacher’s effect in the year of exposure—the universal focus of value added analyses—is correlated only .3 to .5 with her cumulative effect over two years, and even less with her effect over three years. Accountability policies that rely on measures of short-term value added would do an extremely poor job of rewarding the teachers who are best for students’ longer-run outcomes. An important caveat to the empirical results is that they may be specific to North Carolina. Students in other states or in individual school districts might be assigned to classrooms in ways that satisfy the assumptions required for common VAMs. But at the least, VAM-style analyses should attempt to evaluate the model assumptions, perhaps with methods like those used here. Models that rely on incorrect assumptions are likely to yield misleading estimates, and policies that use these estimates in hiring, firing, and compensation decisions may reward and punish teachers for the students they are assigned as much as for their actual effectiveness in the classroom. Section II reviews the use of preassignment variables to test exogeneity assumptions. Section III introduces the three VAMs,
178
QUARTERLY JOURNAL OF ECONOMICS
discusses their implicit assumptions, and describes my proposed tests. Section IV describes the data. Section V presents results. Section VI attempts to quantify the biases that nonrandom classroom assignments produce in VAM-based analyses. Section VII presents evidence on teachers’ long-run effects. I conclude, in Section VIII, by discussing some implications for the design of incentive pay systems in education. II. USING PANEL DATA TO TEST EXCLUSION RESTRICTIONS A central assumption in all econometric studies of treatment effects is that the treatment is uncorrelated with other determinants of the outcome, conditional on covariates. Although the assumption is ultimately untestable—the “fundamental problem of causal inference” (Holland 1986)—the data can provide indications that it is unlikely to hold. In experiments, for example, significant correlations between treatment and preassignment variables are interpreted as evidence that randomization was unsuccessful.2 Panel data can be particularly useful. A correlation between treatment and some preassignment variable X need not indicate bias in the estimated treatment effect if X is uncorrelated with the outcome variable of interest. But outcomes are typically correlated within individuals over time, so an association between treatment and the lagged outcome strongly suggests that the treatment is not exogenous with respect to posttreatment outcomes. This insight has been most fully explored in the literature on the effect of job training on wages and employment. Today’s wage or employment status is quite informative about tomorrow’s, even controlling for all observables. Evidence that assignment to job training is correlated with lagged wage dynamics indicates that simple specifications for the effect of training on outcomes are likely to yield biased estimates (Ashenfelter 1978). Richer models of the training assignment process may absorb this correlation while permitting identification (Heckman, Hotz, and Dabos 1987). But even these models may impose testable restrictions on the relationship between treatment and the outcome history 2. Similar tests are often used in nonexperimental analyses: Researchers conducting propensity score matching studies frequently check for “balance” of covariates conditional on the propensity score (Rosenbaum and Rubin 1984), and Imbens and Lemieux (2008) recommend analogous tests for regression discontinuity analyses.
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
179
(Ashenfelter and Card 1985; Card and Sullivan 1988; Jacobson, LaLonde, and Sullivan 1993).3 In value added studies, the multiplicity of teacher “treatments” can blur the connection to program evaluation methods. But the utility of past outcomes for specification diagnostics carries over directly. Identification of a teacher’s effect rests on assumptions about the relationship between the teacher assignment and the other determinants of future achievement, and the relationship with past achievement can be informative about the plausibility of these assumptions. Only a few studies have attempted to validate VAMs. Harris and Sass (2007) and Jacob and Lefgren (2008) show that value added coefficients are weakly but significantly correlated with principals’ ratings of teacher performance. Of course, if principal decisions about classroom assignments created bias in the VAMs, causality could run from principal opinions to estimated value added rather than the reverse. More relevant to the current analysis, Kane and Staiger (2008) demonstrate that VAM estimates from observational data are approximately unbiased predictors of teachers’ effects when students are randomly assigned. Although I examine a question closely related to that considered by Kane and Staiger, my larger and more representative sample permits me to extend their analysis in two ways. First, I have much more statistical power. This enables me to identify biases that are substantively important but that lie well within Kane and Staiger’s confidence intervals. Second, my sample resembles the sort that would be used for any VAM intended as a teacher compensation or retention tool. In particular, it includes teachers specializing in students (e.g., late readers) who cannot be readily identified and excluded from large-scale analyses. The likely exclusion of such teachers from Kane and Staiger’s sample quite plausibly avoids the most severe biases in observational VAM estimates.4 3. Of course, these sorts of tests cannot diagnose all model violations. If treatment assignments depend on unobserved determinants of future outcomes that are uncorrelated with the outcome history, the treatment effect estimator may be biased even though treatment is uncorrelated with past outcomes. 4. In the Kane and Staiger experiment, principals were given the name of one teacher and asked to identify a comparison teacher such that it would be appropriate to randomly assign students within the pair. One imagines that principals generally chose a comparison who was assigned similar students as the focal teacher in the preexperimental data. Moreover, a substantial majority of principals declined to participate, perhaps because the initial teacher was a specialist for whom no similar comparison could be found.
180
QUARTERLY JOURNAL OF ECONOMICS
III. STATISTICAL MODEL AND METHODS This section develops the statistical framework for VAM analysis and introduces my tests. I begin by defining the parameters of interest in Section III.A. In Section III.B, I introduce the three VAMs that I consider. Section III.C describes the exclusion restrictions that the VAM requires to permit identification of the causal effects of interest and develops the implications of these restrictions for the relationship between the current teacher and lagged outcome. Section III.D discusses the implementation of the tests. III.A. Defining the Problem I take the parameter of interest in value added modeling to be the effect on a student’s test score at the end of grade g of being assigned to a particular grade-g classroom rather than to another classroom at the same school. Later, I extend this to look at dynamic treatment effects (that is, the effect of the grade-g classroom on the g + s score). I do not distinguish between classroom and teacher effects, and use the terms interchangably. In the Online Appendix, I consider this distinction, defining a teacher’s effect as the time-invariant component of the effects of the classrooms taught by the teacher over several years. The basic conclusions are unaffected by this redefinition. I am interested in whether common VAMs identify classroom effects with arbitrarily large samples. I therefore sidestep smallsample issues by considering the properties of VAM estimates as the number of students grows with the number of teachers (and classrooms) fixed.5 If classroom effects are identified under these unrealistic asymptotics, VAMs may be usable in compensation and retention policy with appropriate allowances for the sampling errors that arise with finite class sizes;6 if not, these corrections are likely to go awry. A final important distinction is between identification of the variance of teacher quality and identification of individual teachers’ effects. I focus exclusively on the latter. It is impractical 5. Under realistic asymptotics, the number of classrooms should rise in proportion to the number of students. If so, classroom effects are not identified under any exogeneity restrictions: Even in the asymptotic limit, the number of students per teacher remains finite and the sampling error in an individual teacher’s effect remains nontrivial. 6. A typical approach shrinks a teacher’s estimated effect toward the population mean in proportion to the degree of imprecision in the estimate. The resulting empirical Bayes estimate is the best linear predictor of the teacher’s true effect, given the noisy estimate. See McCaffrey et al. (2003, pp. 63–68).
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
181
to report each of several thousand teachers’ estimated effects, however. I therefore report only the implied standard deviations (across teachers) of teachers’ actual and counterfactual effects, along with tests of the hypothesis that the teacher effects are all zero.7 III.B. Data Generating Process and the Three VAMs I develop the three VAMs and the associated tests in the context of a relatively general educational production function, modeled on those used by Todd and Wolpin (2003) and Harris and Sass (2006), that allows student achievement to depend on the full history of inputs received to date plus the student’s innate ability. Separating classroom effects from other inputs, I assume that the test score of student i at the end of grade g, Aig , can be written as (1)
Aig = αg +
g h=1
βhgc(i,h) + μi τg +
g
εihφhg + vig .
h=1
Here, βhgc is the effect of being in classroom c in grade h on the grade-g test score, and c (i, h) ∈ {1, . . . , Jh} indexes the classroom to which student i is assigned in grade h. μi is individual ability. We might expect the achievement gap between high-ability and low-ability students to grow over time; this would correspond to τk > τg > 0 for each k > g. εih captures all other inputs in grade h, including those received from the family, nonclassroom peers, and the community. It might also include developmental factors: A precocious child might have positive εs in early grades and negative εs in later grades as her classmates caught up. As this example shows, ε is quite likely to be serially correlated within students across grades. Finally, vig represents measurement error in the grade-g test relative to the student’s “true” grade-g achievement. This is independent across grades within students.8 A convenient restriction on the time pattern of classroom effects is uniform geometric decay, βhg c = βhgc λg −g for some 0 ≤ λ ≤ 1 and all h ≤ g ≤ g . A special case is λ = 1, corresponding to perfect persistence. Although my results do not depend on these restrictions, I impose them as needed for notational simplicity. 7. Rivkin, Hanushek, and Kain (2005) develop a strategy for identifying the variance of teachers’ effects, but not the effect of individual teachers, under weaker assumptions than are required by the VAMs described below. 8. I define the β parameters to include any classroom-level component of vig and assume that vig is independent across students in the same classroom.
182
QUARTERLY JOURNAL OF ECONOMICS
I consider nonuniform decay in Section VII. Note that there is no theoretical basis for restrictions on the decay of nonclassroom effects (i.e., on φhg ). Itwill be useful to adopt some simplifying notation. Let g ωig ≡ h=1 εihφhg be the composite grade-g residual achievement, and let indicate first differences across student grades: βhgc ≡ βhgc − βh,g−1,c , τg ≡ τg − τg−1 , ωig ≡ ωig − ωig−1 , and so on. Tractable VAMs amount to decompositions of Aig (or, more commonly, of Aig ≡ Aig − Aig−1 ) into the current teacher’s effect βggc(i,g) , a student heterogeneity component, and an error assumed to be orthogonal to the classroom assignment. Models differ in the form of this decomposition. In this paper I consider three specifications: A simple regression of gain scores on grade and contemporaneous classroom indicators, VAM1: Aig = αg + βggc(i, g) + e1ig ; a regression of score levels (or, equivalently, of gains) on classroom indicators and the lagged score, VAM2: Aig = αg + Aig−1 λ + βggc(i, g) + e2ig ; and a regression that stacks gain scores from several grades and adds student fixed effects, VAM3: Aig = αg + βggc(i, g) + μi + e3ig . All three are widely used.9 VAM2 and VAM3 can both be seen as generalizations of VAM1: Constraining λ = 1 converts VAM2 to VAM1, whereas constraining μi ≡ 0 converts VAM3. III.C. Exclusion Restrictions and Falsification Tests Despite their similarity, the three VAMs rely on quite distinct restrictions on the process by which students are assigned to classrooms. I discuss the three in turn. 9. The most widely used VAM, the Tennessee Value Added Assessment System (TVAAS; see Sanders, Saxton, and Horn [1997]), is specified as a mixed model for level scores that depend on the full history of classroom assignments, but this model implies an equation for annual gain scores of the form used in VAM1. VAM2 is more widely used in the recent economics literature. See, for example, Aaronson, Barrow, and Sander (2007); Goldhaber (2007); Jacob and Lefgren (2008); and Kane, Rockoff, and Staiger (2008). VAM3 was proposed by Boardman and Murnane (1979) and has been used recently by Rivkin, Hanushek, and Kain (2005); Harris and Sass (2006); Boyd et al. (2007); and Jacob and Lefgren (2008).
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
183
The Gain Score Model (VAM1). First-differencing the production function (1), we can write the grade-g gain score as (2) Aig = αg +
g−1
βhgc(i,h) + βggc(i,g) + μi τg + ωig + vig .
h=1
If we assume that teacher effects do not decay, βhgc = 0 for all h < g. The error term e1ig from VAM1 then has three components: e1ig = μi τg + ωig + vig . VAM1 will yield consistent estimates of the grade-g classroom effects only if, for each c, (3)
E[e1ig | c(i, g) = c] = 0.
The most natural model that is consistent with (3) is for assignments to depend only on student ability, μi , and for ability to have the same effect on achievement in grades g and g − 1 (i.e., τg = 0). With these restrictions, VAM1 can be seen as the firstdifference estimator for a fixed effects model, with strict exogeneity of classroom assignments conditional on μi . By contrast, (3) is not likely to hold if c (i, g) depends, even in part, on ωig−1 , vig−1 , or Aig−1 . Differences in last year’s gains across this year’s classrooms are informative about the exclusion restriction. Using (2), the average g − 1 gain in classroom c is (4) E[ Aig−1 | c(i, g) = c] = αg−1 + E[βg−1,g−1,c(i, g−1) | c(i, g) = c] + E[e1ig−1 | c(i, g) = c]. The first term is constant across c and can be neglected. The second term might vary with c if (for example) a principal compensated for a bad teacher in grade g − 1 by assignment to a better-than-average teacher in grade g. This can be absorbed by examining the across-c (i, g) variation in Aig−1 controlling for c (i, g − 1). I estimate specifications of this form below.10 Any 10. This is a test of the hypothesis that students are randomly assigned to grade-g classrooms conditional on the g − 1 classroom. This test is uninformative unless there is independent variation in c (i, g − 1) and c (i, g). To take one example, Nye, Konstantopoulos, and Hedges (2004) use data from the Tennessee STAR class size experiment to study teacher effects. In STAR, “streaming” was quite common, and in many schools there is zero independent variation in third grade classroom assignments controlling for second grade assignments. In this case, identification of teacher effects rests entirely on the assumption that past teachers’ effects do not decay.
184
QUARTERLY JOURNAL OF ECONOMICS
remaining variation across grade-g classrooms in g − 1 gains, after controlling for g − 1 classroom assignments, must indicate that students are sorted into grade-g classrooms on the basis of e1ig−1 . Sorting on e1ig−1 would not necessarily violate (3) if e1ig were not serially correlated. But the definition of e1ig above indicates four sources of potential serial correlation. First, ability μi appears in both e1ig and e1ig−1 (unless τg = 0). Second, the εig process may be serially correlated. Third, even if ε is white noise, ωig is a moving average of order g − 1 (absent strong restrictions on the φ coefficients). Finally, vig is an MA(1), degenerate only if var(v) = 0.11 Thus, (3) is not likely to hold if E[e1ig−1 | c(i, g)] is nonzero. The Lagged Score Model (VAM2). VAM2 frees up the coefficient on the lagged test score. If teacher effects decay geometrically at uniform rate 1 − λ, the grade-g score can be written in terms of the g − 1 score, (5)
Aig = αˇ g + Aig−1 λ + βggc(i, g) + e2ig ,
where αˇ g = αg − αg−1 λ. This can equivalently be expressed as a model for the grade-g gain, by subtracting Aig−1 from each side of (5). In either case, the error is (6) e2ig = μi τg − τg−1 λ + εih φhg − φhg−1 λ + εig + vig − vig−1 λ . g−1
h=1
As before, each of the terms in (6) is likely to be serially correlated. The exclusion restriction for VAM2 is that e2ig is uncorrelated with c (i, g) conditional on Aig−1 . This would hold if c (i, g) were randomly assigned conditional on Aig−1 . It is unlikely to hold if assignments depend on e2ig−1 or on any of its components (including μi ).12 As with the VAM1, I test the VAM2 exclusion restriction by 11. In Rothstein (2008), I conclude that vig accounts for as much as 80% of the variance of Aig . 12. Alternatively, if τg − τg−1 λ is constant across g, (5) can be seen as a fixed effects model with a lagged dependent variable. λ and βgg can be identified via IV or GMM (instrumenting for Aig−1 in a model for Aig ) if c (i, g) depends on μi but is strictly exogenous conditional on this (Anderson and Hsiao 1981; Arellano and Bond 1991). See, for example, Koedel and Betts (2007). Value added researchers typically apply OLS to (5). This is inconsistent for λ and identifies βggc only if c (i, g) is random conditional on Aig−1 .
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
185
reestimating the model with the g − 1 gain as the dependent variable. By rearranging the lag of (5), we can write the g − 1 gain as (7) Aig−1 = λ−1 αˇ g + Aig−1 (λ − 1) + βg−1,g−1,c(i,g−1) + e2ig−1 . Thus, the grade-g classroom assignment will have predictive power for the gain in grade g − 1, controlling for the g − 1 achievement level, if grade-g classrooms are correlated either with the g − 1 teacher’s effect (i.e., with βg−1,g−1,c(i,g−1) ) or with e2ig−1 .13 As in VAM1, the former can be ruled out by controlling for g − 1 classroom assignments; the latter would indicate a violation of the VAM2 exclusion restriction if e2 is serially correlated. The Fixed Effects in Gains Model (VAM3). For the final VAM, we return to equation (2) and to the earlier assumption of zero decay of teachers’ effects.14 The student fixed effects used in VAM3 absorb any variation in μi (assuming that τg = 1 for each g). Thus, the VAM3 error term is e3ig = ωig + vig . The reliance on fixed effects, combined with the small time dimension of student data sets, means that VAM3 requires stronger assumptions than the earlier models. To avoid bias in the teacher effects βggc , even in large samples, teacher assignments must be strictly exogenous conditional on μi : E[e3ih | c(i, g)] = 0 for all g and all h (Wooldridge 2002, p. 253).15 Conditional strict exogeneity means that the same information, μi or some function of it, is used to make teacher assignments in each grade. This requires, in effect, that principals decide on classroom assignments for the remainder of a child’s career before she starts kindergarten. If teacher assignments are updated each year in response to the student’s performance during the previous year, strict exogeneity is violated. 13. The test can alternatively be expressed in terms of a model for the score level in g − 2. (Simply rearrange terms in (7).) The VAM2 exclusion restriction of random assignment conditional on Aig−1 will be rejected if the grade-g classroom predicts Aig−2 conditional on Aig−1 . 14. Although VAM1 and VAM2 can easily be generalized to allow for nonuniform decay, VAM3 cannot. 15. For practical value added implementations, it is rare to have more than three or four student grades, so asymptotics based on the g dimension are infeasible. One approach if strict exogeneity does not hold is to focus on the first difference of (2). OLS estimation of the first-differenced equation requires that c (i, g) be uncorrelated with e3ig−1 , e3ig , and e3ig+1 . Though this is weaker than strict exogeneity, it is difficult to imagine an assignment process that would satisfy one but not the other. If the OLS requirements are not satisfied, the only option is IV/GMM (see note 12), instrumenting for both the g and g − 1 classroom assignments. Satisfactory instruments are not apparent.
186
QUARTERLY JOURNAL OF ECONOMICS
As before, my test is based on analyses of the apparent effects of grade g teachers on gains in prior grades. Consider estimation of VAM1, without the student fixed effects that are added in VAM3. If teacher assignments depend on ability, this will bias the VAM coefficients and will lead me to reject the VAM1 exclusion restriction. But the conditional strict exogeneity assumption imposes restrictions on the coefficients from the VAM1 falsification test. Under this assumption, the only source of bias in VAM1 is the omission of controls for μi . As μi enters into every grade’s gain equation, grade-g teachers should have the same apparent effects on g − 2 gains as they do on g − 1 gains. An indication that these differ would indicate that omitted time-varying determinants of gains are correlated with teacher assignments, and therefore that assignments are not strictly exogenous. Following Chamberlain (1984), consider a projection of μ onto the full sequence of classroom assignments in grades 1 through G: (8)
μi = ξ1c(i,1) + · · · + ξGc(i,G) + ηi .
ξhc is the incremental information about μi provided by the knowledge that the student was in classroom c in grade h, conditional on classroom assignments in all other grades. Substituting (8) into (2), we obtain (9)
Aig = αg +
G
πhgc(i,h) + ηi + e3ig ,
h=1
where πggc = ξgc τg + βggc and πhgc = ξhc τg for h = g. Under conditional strict exogeneity, E[e3ih | c(i, 1), . . . , c(i, G)] = 0 for each h, and the fact that (8) is a linear projection ensures that ηi is uncorrelated with the regressors as well. An OLS regression of grade-g gains onto classroom indicators in grades 1 through G thus estimates the πhgc coefficients without bias. When G ≥ 3, the underlying parameters are overidentified. To see this, note that (10)
πhgc = ξhc τg = ξhc τg−1
τg τg = πh,g−1,c τg−1 τg−1
for all h > g: The coefficient for grade-h classroom c in a model of gains in grade g is proportional to the same coefficient in a model of gains in g − 1. If there are Jh grade-h classrooms in the sample, this represents Jh − 1 overidentifying restrictions on
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
187
the 2Jh elements of the vectors hg = {πhg1 . . . πhgJh } and hg−1 = {πh,g−1,1 . . . πh,g−1,Jh } .16 To test these restrictions, I estimate the the Jh-vector h and the scalars τ1 and τ2 that minimize (11) ˆ hg−1 ˆ hg−1
h τg−1 h τg−1
−1 − − , W D= ˆ hg ˆ hg h τg h τg
ˆ ˆ using the sampling variance of ( hg−1 hg ) as W. Under the null hypothesis of strict exogeneity, the minimized value D is distributed χ 2 with Jh − 1 degrees of freedom.17 If D is above the 95% critical value from this distribution, the null is rejected. Intuitively, the correlation between corresponding elements of the coefficient vectors hg−1 and hg , representing apparent “effects” of grade-h teachers on gains in grades g − 1 and g (g < h), should be 1 or −1 under the null; a correlation far from this would suggest that the exclusion restriction is violated. III.D. Implementation To put the three VAMs in the best possible light, I focus on estimation of within-school differences in classroom effects. For many purposes, one might want to make across-school comparisons. But students are not randomly assigned to schools, and those at one school may gain systematically faster than those at another for reasons unrelated to teacher quality. Random assignment to classrooms within schools is at least somewhat plausible. To isolate within-school variation, I augment each of the estimating equations discussed above with a set of indicators for the school attended.18 The tests for VAM1 and VAM2 then amount to tests of whether students are (conditionally) randomly assigned to 16. When G > 3, there are many such pairs of vectors that must be proportional. Even when G = 3, there are additional overidentifying restrictions created by similar proportionality relationships for teachers’ effects on future gains. These restrictions might fail either because strict exogeneity is violated or because teachers’ effects decay (that is, βhh = βhg for some g > h). I therefore focus on restrictions on the coefficients for teachers’ effects on past gains, as these provide sharper tests of strict exogeneity. 17. Although there are Jh + 2 unknown parameters, they are underidentified: Multiplying h by a constant and dividing τg−1 and τg by the same constant does not change the fit. 18. This makes W singular in (11). For the OMD analysis of VAM3, I drop the elements of πgh that correspond to the largest class at each school.
188
QUARTERLY JOURNAL OF ECONOMICS
classrooms within schools. They resemble tests of successful randomization in stratified experiments, treating schools as strata. Intuitively, I will reject random assignment if replacing a set of school indicators with grade-g grade classroom indicators adds more explanatory power for g − 1 gains than would be expected by chance alone. Let Sg and Tg be matrices of indicators for grade-g schools and classrooms. These are collinear, so to eliminate this I define T˜ g as the submatrix of Tg that results from excluding the columns corresponding to one classroom per school. The VAM1 test is based on a simple regression: (12)
Ag−1 = α + Sg δ + T˜ g β + e.
The identifying assumption of VAM1 is rejected if β = 0. I use a heteroscedasticity-robust score test (Wooldridge 2002, p. 60) to evaluate this. I also estimate versions of (12) that include controls for grade-(g − 1) classroom assignments. To test VAM2, I simply add a control for Ag−1 on the right-hand side of (12). It is clear from the definition of T˜ g that only schools with multiple classrooms per grade can contribute to the analysis. One might be concerned that schools with only two or three classrooms will be misleading, as even with random assignment of students to classrooms there will be substantial overlap in the composition of a student’s grade-g and grade-(g − 1) classrooms. The Online Appendix presents a Monte Carlo analysis of the VAM1 and VAM2 tests in schools of varying sizes. The VAM1 test has appropriate size even with just two classrooms per school, so long as the number of students per classroom is large. (Recall that I focus on large-class asymptotics.) With small classes, the asymptotic distribution of the test statistic is an imperfect approximation, and as a result the test over-rejects slightly. When there are twenty students per class, the test of VAM1 has size around 10%. With empirically reasonable parameter values, the VAM2 test performs similarly.19,20 19. When students are assigned to classrooms based on the lagged score and when this score incorporates implausibly high degrees of clustering at the fourth grade classrom level, the VAM2 test rejects at high rates even with large classes. This reflects my use of a test that assumes independence of residuals within schools. Unfortunately, it is not possible to allow for dependence, as clustered variance-covariance matrices are consistent only if the number of clusters grows with the number of parameters fixed (Kezdi 2004) and in my application, the number of parameters grows with the number of clusters. 20. Kinsler (2008) claims that the VAM3 test also overrejects in simulations. In personal communication, he reports that the problem disappears with large classes.
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
189
I also report the standard deviation of the teacher coefficients (the βs in (12)) themselves. The standard deviation of the estimated coefficients necessarily exceeds that of the true coefficients (those that would be identified with large samples of students per teacher, even if these are biased estimates of teachers’ true causal effects). Aaronson, Barrow, and Sander (2007) propose a simple estimator for the variance of the true coefficients across teachers. Let β be a mean-zero vector of true projection coefficients and let βˆ be an unbiased finite-sample estimate of β, with E[β (βˆ − β)] = 0. The variance (across elements) of β can be written as (13)
ˆ − E[(βˆ − β) (βˆ − β)]. E[β β] = E[βˆ β]
ˆ is simply the variance across teachers of the coefficient E[βˆ β] estimates.21 E[(βˆ − β) (βˆ − β)] is the average heteroscedasticityrobust sampling variance. I weight each by the number of students taught. Specifications that include indicators for classroom assignments in several grades simultaneously—such as that used for the test of VAM3—introduce two complications. First, the coefficients for teachers in different grades can only be separately identified when there is sufficient shuffling of students between classrooms. If students are perfectly streamed—if a student’s classmates in third grade are also his or her classmates in fourth grade—the third and fourth grade classroom indicators are collinear. I exclude from my samples a few schools where inadequate shuffling leads to perfect collinearity. Second, these regressions are difficult to compute, due to the presence of several overlapping sets of fixed effects. As discussed in the Online Appendix, this difficulty is avoided by restricting the samples to students who do not switch schools during the grades for which classroom assignments are controlled. IV. DATA AND SAMPLE CONSTRUCTION The specifications described in Section III require longitudinal data that track students’ outcomes across several grades, linked to classroom assignments in each grade. I use administrative data on elementary students in North Carolina public schools, assembled and distributed by the North Carolina 21. βˆ is normalized to have mean zero across teachers at the same school, and its variance is adjusted for the degrees of freedom that this consumes.
190
QUARTERLY JOURNAL OF ECONOMICS
Education Research Data Center. These data have been used for several previous value added analyses (see, e.g., Clotfelter, Ladd, and Vigdor [2006]; Goldhaber [2007]). I examine end-of-grade math and reading tests from grades 3 through 5, plus “pretests” from the beginning of third grade (which I treat as second grade tests). I standardize the scale scores separately for each subject–grade–year combination.22 The North Carolina data identify the school staff member who administered the end-of-grade tests. In the elementary grades, this was usually the regular teacher. Following Clotfelter, Ladd, and Vigdor (2006), I count a student–teacher match as valid if the test administrator taught a “self-contained” (i.e., all day, all subject) class for the relevant grade in the relevant year, if that class was not designated as special education or honors, and if at least half of the tests that the teacher administered were to students in the correct grade. Using this definition, 73% of fifth graders can be matched to teachers. In each of my analyses, I restrict the sample to students with valid teacher matches in all grades for which teacher assignments are controlled. I focus on the cohort of students who were in fifth grade in 2000–2001. Beginning with the population (N = 99,071), I exclude students who have inconsistent longitudinal records (e.g., gender changes between years); who were not in fourth grade in 1999– 2000; who are missing fourth or fifth grade test scores; or who cannot be matched to a fifth grade teacher. I additionally exclude fifth grade classrooms that contain fewer than twelve sample students or are the only included classroom at the school. This leaves my base sample, consisting of 60,740 students from 3,040 fifth grade classrooms and 868 schools. My analyses all use subsets of this sample that provide sufficient longitudinal data. In analyses of fourth grade gains, for example, I exclude students who have missing third grade scores or who were not in third grade in 1998–1999. In specifications that include identifiers for teachers in multiple grades, I further exclude students who changed schools between grades, plus a few schools where streaming produces perfect collinearity. Table I presents summary statistics. I show statistics for the population, for the base sample, and for my most restricted sample 22. The original score scale is meant to ensure that one point corresponds to an equal amount of learning at each grade and at each point in the within-grade distribution. Rothstein (2008) and Ballou (2009) emphasize the importance of this property for value added modeling. All of the results here are robust to using the original scale.
# of students # of schools 1 fifth grade teacher 2 fifth grade teachers 3–5 fifth grade teachers >5 fifth grade teachers # of fifth grade classrooms # of fifth grade classrooms w/valid teacher match Female (%) Black (%) Other nonwhite (%) Consistent student record (%) Complete test score record, G4–5 (%) G3–5 (%) G2–5 (%) Changed schools between G3 and G5 (%) Valid teacher assignment in grade 3 (%) grade 4 (%) grade 5 (%) Fr. of students in G5 class in same G4 class Fr. of students in G5 class in same G3 class [0.19] [0.15]
(2)
(1) 99,071 1,269 122 168 776 203 4,876 3,315 49 29 8 99 88 81 72 30 68 70 72 0.22 0.15
SD
Mean
Population
TABLE I SUMMARY STATISTICS
60,740 868 0 207 602 59 3,040 3,040 50 28 7 100 99 91 80 27 78 86 100 0.22 0.15
(3)
Mean
[0.17] [0.13]
(4)
SD
Base sample
23,415 598 0 122 440 36 2,116 2,116 51 23 6 100 100 100 100 0 100 100 100 0.30 0.28
(5)
Mean
[0.19] [0.18]
(6)
SD
Most restricted sample TEACHER QUALITY IN EDUCATIONAL PRODUCTION
191
0.11 0.09 0.04 0.00 −0.02 −0.02 −0.01 0.08 0.08 0.04 0.00 0.01 −0.02 −0.01
[0.97] [0.94] [0.97] [1.00] [0.70] [0.58] [0.55] [0.98] [0.95] [0.98] [1.00] [0.76] [0.59] [0.59]
(2)
(1) 0.14 0.11 0.07 0.09 −0.02 −0.01 0.01 0.12 0.11 0.07 0.07 0.00 −0.02 0.00
(3)
Mean
[0.96] [0.94] [0.97] [0.98] [0.69] [0.58] [0.55] [0.98] [0.94] [0.97] [0.97] [0.75] [0.59] [0.58]
(4)
SD
Base sample
0.20 0.19 0.20 0.20 0.00 0.01 −0.01 0.17 0.19 0.18 0.17 0.01 0.00 −0.02
(5)
Mean
[0.96] [0.91] [0.93] [0.94] [0.69] [0.56] [0.53] [0.98] [0.91] [0.93] [0.94] [0.75] [0.57] [0.57]
(6)
SD
Most restricted sample
Notes. Summary statistics are computed over all available observations. Test scores are standardized using all third graders in 1999, fourth graders in 2000, and fifth graders in 2001, regardless of grade progress. “Population” in columns (1) and (2) is students enrolled in fifth grade in 2001, merged with third and fourth grade records (if present) for the same students in 1999 and 2000, respectively. Columns (3) and (4) describe the base sample discussed in the text; it excludes students with missing fourth and fifth grade test scores, students without valid fifth grade teacher matches, fifth grade classes with fewer than twelve sample students, and schools with only one fifth grade class. Columns (5) and (6) further restrict the sample to students with nonmissing scores in grades 3–5 (plus the third grade beginning-of-year tests) and valid teacher assignments in each grade, at schools with multiple classes in each school in each grade and without perfect collinearity of classroom assignments in different grades.
Third grade (beginning of year) Third grade (end of year) Fourth grade (end of year) Fifth grade (end of year) Third grade gain Fourth grade gain Fifth grade gain Reading scores Third grade (beginning of year) Third grade (end of year) Fourth grade (end of year) Fifth grade (end of year) Third grade gain Fourth grade gain Fifth grade gain
Math scores
SD
Mean
Population
TABLE I (CONTINUED)
192 QUARTERLY JOURNAL OF ECONOMICS
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
193
(used for estimation of equation (9)). The last is much smaller than the others, largely because I require students to have attended the same school in grades 3 through 5 and to have valid teacher matches in each grade. Table I indicates that the restricted sample has higher mean fifth grade scores than the full population. This primarily reflects the lower scores of students who switch schools frequently.23 Average fifth grade gains are similar across samples. The Online Appendix describes each sample in more detail. As discussed above, my tests can be applied only if there is sufficient reshuffling of classrooms between grades. Table A2 in the Online Appendix shows the fraction of students’ fifth grade classmates who were also in the same fourth grade classes, by the number of fourth grade classes at the school. Complete reshuffling (combined with equal-sized classes) would produce 0.5 with two classes, 0.33 with three, and so on. The actual fractions are larger than this, but only slightly. In schools with exactly three fifth grade teachers, for example, 35% of students’ fifth grade classmates were also their classmates in fourth grade. In only 7% of multiple-classroom schools do the fourth and fifth grade classroom indicators have deficient rank. Table II presents the correlation of test scores and gains across grades and subjects. The table indicates that fifth grade scores are correlated above .8 with fourth grade scores in the same subject, whereas correlations with scores in earlier grades or other subjects are somewhat lower. Fifth grade gains are strongly negatively correlated with fourth grade levels and gains in the same subject and weakly negatively correlated with those in the other subject. The correlations between fifth and third grade gains are small but significant both within and across subjects. VAM3 is predicated on the notion that student ability is an important component of annual gains. Assuming that high-ability students gain faster, this would imply positive correlations between gains in different years. There is no indication of this in Table II. One potential explanation is that noise in the annual tests introduces negative autocorrelation in gains, but I conclude elsewhere (Rothstein 2008) that even true gains are negatively 23. Table I shows that average third and fourth grade scores in the “population” are well above zero. The norming sample that I use to standardize scores in each grade consists of all students in that grade in the relevant year (i.e., of all third graders in 1999), whereas only those who make normal progress to fifth grade in 2001 are included in the sample for columns (1) and (2). The low scores of students who repeat grades account for the discrepancy.
194
QUARTERLY JOURNAL OF ECONOMICS TABLE II CORRELATIONS OF TEST SCORES AND SCORE GAINS ACROSS GRADES Correlations Summary statistics Fifth grade score Fifth grade gain Mean
SD
Math
Reading
(1)
(2)
(3)
(4)
(5)
(6)
(7)
1.00 0.97 0.95 0.97
1 .84 .80 .71
.78 .73 .70 .64
.29 −.27 −.02 .00
.08 −.07 −.03 −.03
70,740 61,535 57,382 50,661
1.00 0.97 0.95 0.99
.78 .73 .70 .59
1 .82 .78 .65
.10 −.05 −.01 .00
.31 −.29 −.05 −.05
70,078 61,535 57,344 50,629
0.55 0.58 0.70
.29 .11 .08
.10 .07 .05
1 −.41 −.02
.25 −.07 .01
61,349 56,171 50,615
0.58 0.59 0.75
.08 .08 .09
.31 .10 .10
.25 −.08 −.01
1 −.41 .02
60,987 56,159 50,558
Math scores G5 0.02 G4 0.07 G3 0.09 G3 pretest 0.08 Reading scores G5 0.01 G4 0.06 G3 0.09 G3 pretest 0.08 Math gains G4–G5 0.01 G3–G4 −0.01 G2–G3 0.02 Reading gains G4–G5 0.00 G3–G4 −0.02 G2–G3 0.02
Math Reading
N
Notes. Each statistic is calculated using the maximal possible sample of valid student records with observations on all necessary scores and normal grade progress between the relevant grades. Column (7) lists the sample size for each row variable; correlations use smaller samples for which the column variable is also available. Italicized correlations are not different from zero at the 5% level.
autocorrelated. This strongly suggests that VAM3 is poorly suited to the test score data generating process. V. RESULTS Tables III, IV, and V present results for the three VAMs in turn. I begin with VAM1, in Table III. I regress fifth grade math and reading gains (in columns (1) and (2), respectively) on indicators for fifth grade schools and classrooms, excluding one classroom per school. In each case, the hypothesis that all of the classroom coefficients are zero (i.e., that classroom indicators have no explanatory power beyond that provided by school indicators) is decisively rejected. The VAM indicates that the withinschool standard deviations of fifth grade teachers’ effects on math and reading are 0.15 and 0.11, respectively. This is similar to what
0.160 0.113 <.001
n
55,142 3,038 868 .100 .047
0.179 0.149 <.001
n
55,142 3,038 868 .195 .148
(2)
(1)
55,142 3,038 868 .132 .081
n
0.134 0.077 .016
(3)
Math
55,142 3,038 868 .086 .033
n
0.142 0.084 .002
(4)
Reading
Fourth grade gain
40,661 2,761 783 .176 .066
0.181 0.125 <.001 y
0.188 0.150 <.001 y
40,661 2,761 783 .297 .203
0.181 0.126 <.001
(6)
Reading
0.197 0.163 <.001
(5)
Math
Fifth grade gain
40,661 2,761 783 .254 .154
0.220 0.182 <.001 y
0.151 0.090 .035
(7)
Math
40,661 2,761 783 .174 .064
0.193 0.140 <.001 y
0.168 0.105 <.001
(8)
Reading
Fourth grade gain
Notes. Dependent variables are as indicated at the top of each column. Regressions include school indicators, fifth grade teacher indicators, and (in columns (5)–(8)) fourth grade teacher indicators, with one teacher per school per grade excluded. p-values are for test of the hypothesis that all teacher coefficients equal zero, using the heteroscedasticity-robust score test proposed by Wooldridge (2002, p. 60). Standard deviations are of teacher coefficients, normalized to have mean zero at each school and weighted by the number of students taught. Adjusted standard deviations are computed as described in Online Appendix B2. Sample for columns (1)–(4) includes students from the base sample (see text) with nonmissing scores in each subject in grades 3–5. Columns (5)–(8) exclude students without valid fourth grade teacher matches and those who switched schools between fourth and fifth grade.
Teacher coefficients Fifth grade teachers Unadjusted SD Adjusted SD p-value Fourth grade teachers Unadjusted SD Adjusted SD p-value Exclude invalid fourth grade teacher assignments & fifth grade movers? # of students # of fifth grade teachers # of schools R2 Adjusted R2
Reading
Math
Fifth grade gain
TABLE III EVALUATION OF VAM1: REGRESSION OF GAIN SCORES ON TEACHER INDICATORS
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
195
0.150 0.109 <.001
0.239 (0.004) −0.383 (0.004) n
55,142 3,038 868 .249 .206
0.176 0.150 <.001
−0.317 (0.004) 0.195 (0.004) n
55,142 3,038 868 .313 .273
(2)
(1)
55,142 3,038 868 .274 .231
0.368 (0.004) −0.218 (0.004) n
0.120 0.067 .040
(3)
Math
55,142 3,038 868 .237 .193
−0.213 (0.004) 0.380 (0.004) n
0.129 0.076 .007
(4)
Reading
Fourth grade gain
40,661 2,761 783 .385 .302
40,661 2,761 783 .315 .224
0.255 (0.005) −0.387 (0.005) y
0.162 0.109 <.001
0.160 0.121 <.001 −0.292 (0.004) 0.189 (0.004) y
0.169 0.121 <.001
(6)
Reading
0.191 0.161 <.001
(5)
Math
Fifth grade gain
40,661 2,761 783 .354 .268
0.332 (0.005) −0.206 (0.005) y
0.182 0.142 <.001
0.138 0.079 .162
(7)
Math
40,661 2,761 783 .307 .215
−0.229 (0.005) 0.379 (0.005) y
0.175 0.126 <.001
0.150 0.091 .001
(8)
Reading
Fourth grade gain
Notes. Dependent variables are as indicated at the top of each column. Regressions include school indicators, fourth grade math and reading scores, fifth grade teacher indicators, and (in columns (5)–(8)) fourth grade teacher indicators, with one teacher per school per grade excluded. p-values are for test of the hypothesis that all teacher coefficients equal zero, using the heteroscedasticity-robust score test proposed by Wooldridge (2002, p. 60). Standard deviations are of teacher coefficients, normalized to have mean zero at each school and weighted by the number of students taught. Adjusted standard deviations are computed as described in Online Appendix B2. Samples correspond to those in Table III.
Exclude invalid fourth grade teacher assignments & fifth grade movers? # of students # of fifth grade teachers # of schools R2 Adjusted R2
Fourth grade reading score
Teacher coefficients Fifth grade teachers Unadjusted SD Adjusted SD p-value Fourth grade teachers Unadjusted SD Adjusted SD p-value Continuous controls Fourth grade math score
Reading
Math
Fifth grade gain
TABLE IV EVALUATION OF VAM2: REGRESSIONS WITH CONTROLS FOR LAGGED SCORE LEVELS
196 QUARTERLY JOURNAL OF ECONOMICS
(2)
(1)
(3)
Corr((1),(2))
1.17 0.103
−.06 −.08 −.24
(6)
Corr((4),(5))
Notes. N = 25,974. Students who switched schools between third and fifth grade, who are missing test scores in third or fourth grade (or on the third grade beginning-of-year tests), or who lack valid teacher assignments in any of grades 3–5 are excluded. Schools with only one included teacher per grade or where teacher indicators are collinear across grades are also excluded. “Unrestricted model” reports estimates from a specification with school indicators and indicators for classrooms in grades 3, 4, and 5. Restricted model reports optimal minimum distance estimates obtained from the coefficients from the unrestricted models for third and fourth grade gains, excluding the largest class in each grade in each school. Restriction is that the fourth grade effects are a scalar multiple of the third grade effects. The weighting matrix is the inverse of the robust sampling variance–covariance matrix for the unrestricted estimates, allowing for cross-grade covariances.
2,174 1,684 <.001
.284 .092
.245 .042
0.088
0.123 0.163 0.145
(5)
Fourth grade
Reading
0.144 0.160 0.183
(4)
Third grade
Unrestricted model Standard deviation of teacher effects, adjusted Fifth grade teacher 0.135 0.099 −.04 Fourth grade teacher 0.136 0.193 −.07 Third grade teacher 0.228 0.166 −.36 Fit statistics R2 .314 .376 .129 .209 Adjusted R2 Restricted model (optimal minimum distance) Ratio, effect on G4/effect on G3 0.14 SD of G5 teacher effects 0.126 0.018 Objective function 2,136 95% critical value 1,684 p-value <.001
Fourth grade
Third grade
Math
TABLE V CORRELATED RANDOM EFFECTS EVALUATION OF VAM3: GAIN SCORE SPECIFICATION WITH STUDENT FIXED EFFECTS TEACHER QUALITY IN EDUCATIONAL PRODUCTION
197
198
QUARTERLY JOURNAL OF ECONOMICS
has been found in other studies (e.g., Aaronson, Barrow, and Sander [2007]; Rivkin, Hanushek, and Kain [2005]). Columns (3) and (4) present falsification tests in which fourth grade gains are substituted for the fifth grade gains as dependent variables, with the specification otherwise unchanged. The standard deviation of fifth grade teachers’ “effects” on fourth grade gains is 0.08 in each subject, and the hypothesis of zero association is rejected in each specification.24 In both the standard deviation and statistical significance senses, fifth grade classroom assignments are slightly more strongly associated with fourth grade reading gains than with math gains. One potential explanation for these counterfactual effects is that they represent omitted variables bias deriving from my failure to control for fourth grade teachers. Columns (5)–(8) present estimates that do control for fourth grade classroom assignments, using a sample of students who attended the same school in fourth and fifth grades and can be matched to teachers in each grade. Two aspects of the results are of interest. First, fourth grade teachers have strong independent predictive power for fifth grade gains. This is at least suggestive that the “zero decay” assumption is violated. I return to this in Section VII. Second, the coefficients on fifth grade classroom indicators in models for fourth grade gains remain quite variable—even more so than in the sparse specifications in columns (3) and (4)—and are significantly different from zero. Evidently, the correlation between fifth grade teachers and fourth grade gains derives from sorting on the basis of the fourth grade residual, not merely from between-grade correlation of teacher assignments. These results strongly suggest that the exclusion restrictions for VAM1 are violated. To demonstrate this conclusively, however, we need to show that the residual in VAM1, e1ig , is serially correlated. To examine this, I reestimated VAM1 for fourth grade teachers’ effects on fourth grade gains. The correlation between eˆ1i4 and eˆ1i5 is −.38 in math and −.37 in reading. The negative serial correlation of e1 implies that students with high gains in fourth grade will tend to have low gains in fifth grade, and vice versa. Because VAM1 evidently does not 24. The table shows analytic p-values based on the F distribution. As noted earlier, simulations suggest that my tests over-reject slightly. When I use the empirical distribution of test statistics from an appropriately calibrated Monte Carlo simulation (discussed in the Online Appendix) to construct p-values, these are .031 and .004, respectively.
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
199
adequately control the determinants of classroom assignments, it gives unearned credit to teachers who are assigned students who did poorly in fourth grade, as these students will predictably post unusually high fifth grade gains when they revert toward their long-run means. Similarly, teachers whose students did unusually well in fourth grade will be penalized by the students’ fall back toward their long-run means in fifth grade. Indeed, an examination of the VAM1 coefficients indicates that fifth grade teachers whose students have above average fourth grade gains have systematically lower estimated value added than teachers whose students underperformed in the prior year. Importantly, this pattern is stronger than can be explained by sampling error in the estimated teacher effects; it reflects true mean reversion and not merely measurement error. Table IV repeats the falsification exercise for VAM2. The structure is identical to that of Table III. Columns (1) and (2) present estimates of the basic VAM for fifth grade teachers’ effects on fifth grade gains, controlling for fourth grade math and reading scores. The standard deviations of fifth grade teachers’ effects are nearly identical to those in Table III. Columns (3) and (4) substitute fourth grade gains as the dependent variable. Once again, we see that fifth grade teachers are strongly predictive, more so in reading than in math.25 Columns (5)–(8) augment the specification with controls for fourth grade teachers. The fifth grade teacher coefficients are no longer jointly significant in the fourth grade math gain specification, though they remain quite large in magnitude. They are still highly significant in the specification for fourth grade reading gains. The VAM2 residuals, like those from VAM1, are nontrivially correlated between fourth and fifth grades, −.21 for math gains and −.19 for reading. They are also correlated across subjects: −.14 between fourth grade reading and fifth grade math. Thus, the evidence that fifth grade teacher assignments are correlated with the fourth grade residuals indicates that the VAM2 exclusion restriction is violated, regardless of whether the dependent variable is the math or the reading score. As before, fifth grade teachers’ effects on fifth grade scores are negatively correlated with their counterfactual “effects” on fourth grade gains, suggesting that mean reversion in student achievement—combined 25. p-values based on Monte Carlo simulations (see note 24) are .086 and .018 in columns (3) and (4), respectively.
200
QUARTERLY JOURNAL OF ECONOMICS
with nonrandom classroom assignments—is an important source of bias in VAM2. To implement the VAM3 falsification test, I begin by selecting the subsample with nonmissing third and fourth grade gains; valid teacher assignments in grades 3, 4, and 5; and continuous enrollment at the same school in all three grades. I exclude 26 schools where the three sets of indicators for teachers in grades 3, 4, and 5 (dropping one teacher in each grade from each school) are collinear. I then regress both the third and fourth grade gains on school indicators and on each of the three sets of teacher indicators.26 Table V reports estimates for math gains, in columns (1) and (2), and for reading gains, in columns (4) and (5). The first panel shows the standard deviations (adjusted for sampling error) of the coefficients for each grade’s teachers. Gains in each subject and in each grade are substantially correlated with classroom assignments in all three grades. Although p-values are not shown, in all twelve cases the hypothesis of zero effects is rejected. Columns (3) and (6) report the across-teacher correlations between the coefficients in the models for third and fourth grade gains (i.e., between g3 and g4 ). The most important correlation is that for fifth grade teachers, −.04 for math and −.06 for reading. Recall that strict exogeneity implies that the fifth grade teacher coefficients in the model for fourth grade gains should be proportional to the corresponding coefficients in the model for third grade gains,
54 = ( τ4 / τ3 ) 53 , implying a correlation of ±1. The near-zero correlations strongly suggest that a single ability factor is unable to account for the apparent “effects” of fifth grade teachers on gains in earlier grades. Indeed, these correlations are direct evidence against the VAM3 identifying assumption of conditional strict exogeneity. The lower panel of Table V presents OMD estimates of the restricted model.27 For math scores, the estimated ratio τ4 / τ3 is 0.14, implying that student ability is much more important to third grade than to fourth grade gains. Thus, the constrained estimates 26. It is not essential to the correlated random effects test that the full sequence of teacher assignments back to grade 1 be observed, but the test may over-reject if classroom assignments in grades 3–5 are correlated with those in first and second grade and if the latter have continuing effects on third and fourth grade gains. Recall, however, that VAM3 assumes such lagged effects away. 27. The OMD analysis uses a variance–covariance matrix W that is robust to arbitrary heteroscedasticity and within-student, between-grade clustering. See the Online Appendix.
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
201
imply negligible coefficients for fifth grade teachers in the equation for fourth grade gains and do a very poor job of fitting the unconstrained estimate of the standard deviation of these coefficients, 0.099. The test statistic D is 2,136, and the overidentifying restrictions are overwhelmingly rejected. In the reading specification, the τ4 / τ3 ratio is close to one, and the restricted model allows meaningful coefficients on fifth grade teachers in both the third and fourth grade gain equations, albeit much less variability than is seen in the unconstrained model. But the test statistic is even larger here, and the restricted model is again rejected. We can thus conclude that fifth grade teacher assignments are not strictly exogenous with respect to either math or reading gains, even conditional on single-dimensional (subject-specific) student heterogeneity. The identifying assumption for VAM3 is thus violated. The results in Tables III, IV, and V indicate that all three of the VAMs considered here rely on incorrect exclusion restrictions—teacher assignments evidently depend on the past learning trajectory even after controlling for student ability or the prior year’s test score. It is possible, however, that slight modifications of the VAMs could eliminate the endogeneity. I have explored several alternative specifications to gauge the robustness of the results. I have reestimated VAM1 and VAM2 with controls for student race, gender, free lunch status, fourth grade absences, and fourth grade TV viewing; these have no effect on the tests. The three VAMs also continue to fail falsification tests when I use the original score scales or score percentiles in place of standardizedby-grade scores, or when I use data from other cohorts. As a final investigation, I have extended the tests to evaluate VAM analyses that use data from multiple cohorts of students to distinguish between permanent and transitory components of a teacher’s “effect.” As discussed in the Online Appendix, the assumptions under which this can avoid the biases identified here do not appear to hold in the data. VI. HOW MUCH DOES THIS MATTER? The results in Section V indicate that the identifying assumptions for all three VAMs are violated in the North Carolina data. However, if classroom assignments nearly satisfied the assumptions underlying the VAMs, the models might yield almost unbiased estimates of teachers’ causal effects. In this section, I
202
QUARTERLY JOURNAL OF ECONOMICS
use the degree of sorting on prior outcomes to quantify the magnitude of the biases resulting from nonrandom assignments. I focus on VAM1 and VAM2, as the lack of correlation between third and fifth grade gains (Table II) strongly suggests that the additional complexity and strong maintained assumptions of VAM3 are unnecessary. In general, classroom assignments may depend both on variables observed by the econometrician and on unobserved factors. The former can in principle be incorporated into VAM specifications. Accordingly, the first part of my investigation focuses on the role of observable characteristics that are omitted from VAM1 and VAM2. I compare VAM1 and VAM2 to a richer specification, VAM4, that controls for teacher assignments in grades 3 and 4, end-of-grade scores in both subjects in both grades, and scores from the tests given at the beginning of third grade. This would identify fifth grade teachers’ effects if assignments were random conditional on the test score and teacher assignment history. It is thus more general than VAM2. It does not strictly nest VAM1, however: Assignment of teachers based purely on student ability (μi ) would satisfy the VAM1 exclusion restriction but not that for VAM4. If assignments depend on both ability and lagged scores, VAM1, VAM2, and VAM4 are all misspecified. Table VI presents the comparisons. The first rows show the estimated standard deviations of teachers’ effects obtained from VAM1 and VAM2, as applied to the subset of students with complete test score histories and valid teacher assignments in each prior grade. The unadjusted estimates are somewhat higher than those in Tables III and IV, as the smaller sample yields noisier estimates, but the sampling-adjusted estimates are quite similar to those seen earlier. The next two rows of the table show estimates from the richer specification. Standard deviations are somewhat larger, but not dramatically so. The final two rows describe the bias in the simpler VAMs VAM1 VAM4 VAM2 VAM4 − β55 and β55 − β55 ). I relative to VAM4 (that is, β55 again show both the raw standard deviation of the point estimates and an adjusted standard deviation that removes the portion due to sampling error. For VAM1, the bias has a standard deviation over one-third as large as that of the VAM4 effects. For VAM2, which already includes a subset of the controls in VAM4, the bias is somewhat smaller. For both VAMs, the bias is more important in estimates of teachers’ value added for math scores than for reading scores.
203
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
TABLE VI MAGNITUDE OF BIAS IN VAM1 AND VAM2 RELATIVE TO A RICHER SPECIFICATION THAT CONTROLS FOR ALL PAST OBSERVABLES VAM1
VAM2
Math
Reading
Math
Reading
(1)
(2)
(3)
(4)
Standard deviation of fifth grade teachers’ estimated effects from traditional VAM Unadjusted for sampling error 0.203 0.189 0.197 0.176 Adjusted for sampling error 0.162 0.127 0.162 0.121 SD of fifth grade teachers’ estimated effects from rich specification (VAM4) Unadjusted for sampling error 0.206 0.200 0.206 0.200 Adjusted for sampling error 0.172 0.148 0.172 0.148 SD of bias in traditional VAMs relative to the rich specification Unadjusted for sampling error 0.118 0.130 0.097 0.106 Adjusted for sampling error 0.060 0.054 0.037 0.028 Notes. N = 23,415. Sample is that used in Table V, less observations with missing fifth grade scores and those in schools rendered unusable (i.e., only one valid classroom or collinearity between third, fourth, and fifth grade classroom indicators) by this exclusion. “Rich” specification controls for classroom assignments in grades 3 and 4 and for scores in math and reading in grades 2, 3, and 4. “Bias” is the difference between the VAM1/VAM2 estimates and those from the rich specification. Unadjusted estimates summarize the estimated coefficients. Adjustments for sampling error are described in Online Appendix B.
Of course, the exercise carried out here can only diagnose bias in VAM1 and VAM2 from selection on observables—variables that can easily be included in the VAM specification. In a companion paper (Rothstein 2009), I attempt to quantify the bias that is likely to result from selection on unobservables. Following the intuition of Altonji, Elder, and Taber (2005) that the weight of observable (to the econometrician) and unobservable variables in classroom assignments is likely to mirror their relative weights in predicting achievement, one can use the degree of sorting on observables to estimate the importance of unobservables and therefore the magnitude of the bias in estimated teacher effects. Under varying assumptions about the amount of information that parents and principals have, I find that the bias from nonrandom assignments is quite plausibly 75% as large (in standard deviation terms) as the estimates of teachers’ effects in VAM1, and perhaps half this large in VAM2.28 To provide a better sense of the import of nonrandom classroom assignments for the value of VAMs in teacher compensation 28. Kane and Staiger’s (2008) comparison of experimental and nonexperimental value added estimates would be unlikely to detect biases of this magnitude.
204
QUARTERLY JOURNAL OF ECONOMICS
and retention decisions, I simulate true and estimated teacher effects with joint distributions resembling those reported in Table VI and in Rothstein (2009). For each of several scenarios characterizing the assignment of students to classrooms, I generate 10,000 teachers’ true effects and coefficients from VAMs 1, 2, and 4.29 I assume that true effects and biases are both normally distributed, and that the VAM coefficients are free of sampling error. I then compute three statistics to summarize the relationship of the VAM estimates to teachers’ true effects: the correlation between teachers’ true effects and the VAM coefficients, the rank correlation, and the fraction of teachers with true effects in the top quintile who are indicated to be in the top quintile by the VAMs. Results are presented in Table VII. Each panel corresponds to a distinct assumption about the classroom assignment process. In the first panel, I assume that selection is solely on the basis of the observed test score history. Using the model for reading scores from Table VI, the standard deviation of teachers’ true effects is 0.148, and the standard deviations of the biases in VAM1 and VAM2 are 0.054 and 0.028, respectively. Columns (4)–(6) show the reliability of teacher quality under different metrics. True effects and ranks are very highly correlated with the effects and ranks indicated by VAMs 1 and 2. From 79% to 90% of teachers who are in the top quintile of the actual quality distribution are judged to be so by the simple VAMs. But this analysis assumes, implausibly, that selection is solely on observables. Panels B–E present alternative estimates that allow variables that are not controlled even in VAM4 to play a role in classroom assignments, as in Rothstein (2009). In Panel B, I assume that classroom assignments depend both on the test score history that is reported in my data and on a second, unobserved history (e.g., student grades) that provides an independent, equally noisy measure of the student’s trajectory through grades 2–4. Allowing for this moderate degree of selection-onunobservables notably degrades the performance of VAM1, but VAM2 and VAM4 continue to perform reasonably well. In Panel C, I assume that there are two separate unobserved achievement measures. Performance degrades still further; although the correlations between true effects and the VAM2 and VAM4 estimates 29. It is not possible to use the estimates from Table VI directly because I wish to abstract from the role of sampling error. The simulation is described in greater detail in the Online Appendix.
205
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
TABLE VII SIMULATIONS OF THE EFFECTS OF STUDENT SELECTION AND HETEROGENEOUS DECAY ON TEACHER QUALITY ESTIMATES
Data generating process SD of truth
SD of bias
(2) as % of (1)
(1)
(2)
(3)
Simulation: comparisons between true effects and those indicated by VAM
Correlation
Rank correlation
Reliability of top quintile ranking
(4)
(5)
(6)
Panel A: Selection is on observables 0.148 0.054 36% .93 .93 0.79 0.148 0.028 19% .98 .98 0.90 0.148 0 0% 1.00 1.00 1.00 Panel B: Selection is on history of two tests, one observed VAM1 0.148 0.124 84% .77 .75 0.62 VAM2 0.148 0.049 33% .95 .94 0.82 VAM4 0.148 0.028 19% .98 .98 0.89 Panel C: Selection is on history of three tests, one observed VAM1 0.148 0.137 92% .74 .73 0.60 VAM2 0.148 0.060 40% .93 .92 0.78 VAM4 0.148 0.041 28% .96 .96 0.85 Panel D: Selection is on true and observed achievement history VAM1 0.148 0.166 112% .64 .63 0.52 VAM2 0.148 0.089 60% .86 .85 0.70 VAM4 0.148 0.078 53% .89 .88 0.73 Panel E: Selection on unobservables is like selection on observables VAM1 0.148 0.212 143% .57 .56 0.49 VAM2 0.148 0.140 95% .73 .71 0.59 VAM4 0.148 0.147 99% .71 .70 0.58 Panel F: Selection conforms to VAM assumptions, but effects of interest are those on the following year’s score VAM1 0.118 0.148 125% .42 .40 0.38 VAM2 0.110 0.147 133% .33 .32 0.34 VAM1 VAM2 VAM4
Notes. Estimates in column (1) are taken from the rich specification for reading in Table VI (Panels A–E) and from columns (2) and (4) of Table VIII (Panel F). Column (2) is from Table VI, columns (2) and (4) in Panel A, and is computed from the models reported in Table VIII in Panel F. In Panels B–E, estimates from Table 10 of Rothstein (2009) are used, with an adjustment for the different test scale used here. See the Online Appendix for details. Columns (4)–(6) are computed by drawing 10,000 teachers from normal distributions with the standard deviations described in columns (1) and (2). Estimates of the correlation between teachers’ true effects and the bias in their estimated effects (−.33 for VAM 1 and −.43 for VAM2) are used in Panel A. In Panels B–E, this correlation is constrained to zero. In Panel F, the estimated correlation is used again; this is −.38 for VAM1 and −.43 for VAM2. “Reliability of top quintile” in column (6) is the fraction of teachers whose true effects are in the top quintile who are estimated to be in the top quintile by the indicated VAM.
remain large, only about four-fifths of top-quintile teachers are judged to be so by the two VAMs. Panel D allows even more unobserved information to be used in classroom assignments: I assume that the principal knows the
206
QUARTERLY JOURNAL OF ECONOMICS
student’s true achievement in grades 2–4. Now, even VAM4 is correlated less than .9 with teachers’ true effects, and less than three-fourths of true top-quintile teachers get top-quintile ratings from any the VAMs. Finally, Panel E presents an extreme scenario corresponding to Altonji, Elder, and Taber’s (2005) assumption that selection on unobservables is like selection on observables. This is not realistic, as principals cannot perfectly predict student achievement, but it provides a useful bound for the degree of bias that nonrandom classroom assignments might produce in VAMbased estimates. This bound is tight enough to be informative: Even in this worst case, the VAMs retain some signal, and VAM2 and VAM4 continue to classify correctly over half of top-quintile teachers. It is difficult to know which of the scenarios is the most accurate. Panel E likely assumes too much sorting on unobservables, whereas Panel A almost certainly assumes too little. The truth almost certainly lies in between, perhaps resembling the scenarios depicted in Panels B and C. These suggest that VAMs that control only for past test scores—typically the only available variables— have substantial signal but nevertheless introduce important misclassification into any assessment of teacher quality. Only 60%–80% of the highest quality teachers will receive rewards given on the basis of high VAM scores. Moreover, Table VII omits three major sources of error in VAM-based quality measures that would magnify the misclassification rates seen there. First, I have suppressed the role of sampling error that would inevitably arise in VAM-based estimates. It is well documented (Lockwood, Louis, and McCaffrey 2002; McCaffrey et al. 2009) that this alone produces high misclassification rates. Second, all of the analyses in this paper are based on comparisons of teachers within schools. As in most other value added studies, I make no effort to measure across-school differences in teacher quality. But most policy applications of value added would require comparisons across as well as within schools. Because students are not even approximately randomly assigned to schools, these comparisons are likely to be less informative about causal effects than are the within-school comparisons considered here. Finally, I have assumed that teachers’ effects on their students’ end-of-grade scores are the sole outcome of interest. This may be incorrect. In particular, if teachers can allocate effort between teaching to the test and raising students’ long-run learning
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
207
trajectories (e.g., by working to instill a love of reading), one would like to reward the second rather than the first. This suggests that the effects that matter may be those on students’ long-run outcomes rather than on their end-of-grade scores. I consider this issue in the next section. VII. SHORT-RUN VS. LONG-RUN EFFECTS Recall from columns (5)–(6) of Tables III and IV that fourth grade teachers appear to have large effects on students’ fifth grade gains. Given the results for fourth grade gains, these “effects” cannot be treated as causal. But setting this issue aside, we can use the lagged teacher coefficients to evaluate restrictions on time pattern of teachers’ effects (that is, on the relationship between βgg and βg,g+s in the production function (1)) that are universally imposed in value added analyses. When only a single grade’s teacher assignment is included, VAM2 implicitly assumes that teachers’ effects decay at a uniform, geometric rate (βg,g+s = βgg λs for λ ∈ [0, 1]), whereas VAM1 assumes zero decay (λ = 0). It is not clear that either restriction is reasonable.30 Although several studies have estimated λ,31 all have done so under the restriction that decay is uniform. As a final investigation, I analyze the validity of this restriction by comparing a grade-g teacher’s initial effect in grade g with her longerrun effect on scores in grade g + 1 or g + 2, without restricting the relationships among them.32 If in fact teachers’ effects decay uniformly, the initial and longer-run effects should be perfectly correlated (except for sampling error). I begin by estimating VAM1 and VAM2 for third, fourth, and fifth grade scores or gains, augmenting each specification with controls for past teachers back to third grade. I then compute third 30. Although a full discussion is beyond the scope of this paper, assumptions about “decay” are closely related to issues of test scaling and content coverage (Martineau 2006; Rothstein 2008; Ballou 2009). To illustrate, consider a third grade teacher who focuses on addition and subtraction. This will raise her students’ third grade scores but may do little for their performance on a fifth grade multiplication test. 31. See, for example, Sanders and Rivers (1996), Konstantopoulos (2007), and Andrabi et al. (2009). 32. For VAM1, the effect of being in classroom c in grade g on achievement in grade g + s is simply st=0 βg,g+t,c . In VAM2, the presence of a lagged dependent variable complicates the calculation of cumulative effects. If only the same-subject score is controlled, the effect of third grade teacher c on fifth grade achievement is (β33c λ + β34c ) λ + β35c . A similar but more complex expression characterizes the effects when lagged scores in both math and reading are controlled, as in my estimates.
208
QUARTERLY JOURNAL OF ECONOMICS TABLE VIII PERSISTENCE OF TEACHER EFFECTS IN VAMS WITH LAGGED TEACHERS VAM1
VAM2
Math Reading Math Reading (1)
(2)
(3)
Cumulative effect of fourth grade teachers over two years Standard deviation of fourth grade teacher effects, adjusted On fourth grade scores 0.184 0.150 0.188 On fifth grade scores 0.108 0.118 0.118 Average persistence of fourth grade teacher’s immediate effect one year later 0.269 0.325 0.320 Correlation (effect on fourth grade, effect on fifth grade), adjusted .455 .413 .511 Cumulative effect of third grade teachers over three years Standard deviation of third grade teacher effects, adjusted On third grade scores 0.218 0.172 0.209 On fourth grade scores 0.136 0.126 0.120 On fifth grade scores 0.185 0.199 0.129 Average persistence of third grade teacher’s immediate effect two years later 0.335 0.394 0.277 Correlation (effect on third grade, effect on fifth grade), adjusted .395 .341 .450
(4)
0.140 0.110 0.262 .334
0.167 0.130 0.147 0.394 .447
Notes. N = 23,415. Sample is identical to that used in Table VI. Effects of fourth grade teachers on fifth grade scores and of third grade teachers on fourth and fifth grade scores are cumulative effects. For VAM1, the specification for gains in grade g includes controls for teachers in grades 3 through g, and the cumulative effect of the grade h teacher on the grade g gain is the sum of the effects in h, h + 1, . . . , g. For VAM2, the specification is augmented with controls for math and reading scores in grade g − 1. The calculation of cumulative effects is described in footnote 31. “Average persistence” is the coefficient from a regression of effects on fifth grade scores on effects on fourth (Panel A) or third (Panel B) scores, and indicates the expected effect on fifth grade scores for a teacher whose initial effect was +1. All standard deviations, correlations, and persistence parameters are adjusted for the influence of sampling error, as described in Online Appendix B.
and fourth grade teachers’ cumulative effects over one, two, and (for third grade teachers) three years. Table VIII presents summary statistics for these cumulative effects. I show their standard deviation; the implied average persistence of teachers’ first-year effects (computed as λ = cov(β44 , β45 )/ var(β44 )); and the correlation between the initial and cumulative effects. All statistics are adjusted for sampling error in the β coefficients. Three aspects of the results are of note. First, there is much more variation in fourth grade teachers’ effects on fourth grade scores than in those same teachers’ effects on fifth grade scores. With uniform decay at rate (1 − λ), var(βg,g+s ) = λ2s var(βgg ), so this is consistent with the mounting recent evidence that teachers’ effects decay importantly in the year after contact (Kane and Staiger 2008; Andrabi et al. 2009; Jacob, Lefgren, and Sims forthcoming). Second, the average
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
209
persistence of fourth grade teachers’ effects one year later is only around 0.3, again consistent with recent evidence.33 Third, the data are not even approximately consistent with the notion that this persistence rate is uniform across teachers: The correlation between teachers’ first-year effects and their two year cumulative effects is much less than one, ranging between .33 and .51 depending on the model and subject. Three-year cumulative effects show a similar pattern, correlated around .4 with the immediate effect. Even if we assume that the VAM-based estimates can be treated as causal, a teacher’s first-year effect is a poor proxy for his or her longer-run impact. The final panel of Table VII explores the implications of this analysis for teacher quality measurement. I use the estimates in Table VIII as parameters for my simulation to compare traditional end-of-year VAM coefficients to teachers’ longer-run (twoyear) effects, treating the latter as the “truth.” The results are not encouraging. Correlations are well below .5, and only about a third of teachers in the top quintile of the distribution of twoyear cumulative effects are also in the top quintile of the oneyear effect distribution. It is apparent that misspecification of the outcome variable produces extreme amounts of misclassification. Note, moreover, that this analysis assumes that the VAM1 and VAM2 exclusion restrictions are valid. A full account of the utility of VAMs for identifying good teachers would need to combine the analyses of lagged effects and endogenous classroom assignments. This would imply even higher rates of misclassification than are produced by either on its own. VIII. DISCUSSION Panel data allow flexible controls for individual heterogeneity, but even panel data models can identify treatment effects only if assignment to treatment satisfies strong exclusion restrictions. This has long been recognized in the literature on program evaluation, but has received relatively little attention in the literature on the estimation of teachers’ effects on student achievement. In this paper, I have shown how the availability of lagged outcome measures can be used to evaluate common value added specifications. 33. In other contexts, experiments have shown short-term effects on test scores that do not persist, as well as long-term effects on other outcomes (see, e.g., Schweinhart et al. [2005]). If teachers’ effects had this form, we might wish to focus on short-run rather than long-run test score effects. But there is no direct evidence that teacher effects follow this pattern.
210
QUARTERLY JOURNAL OF ECONOMICS
The results presented here show that the assumptions underlying common VAMs are substantially incorrect, at least in North Carolina. Classroom assignments are not exogenous conditional on the typical controls, and estimates of teachers’ effects based on these models cannot be interpreted as causal. Clear evidence of this is that each VAM indicates that fifth grade teachers have quantitatively important “effects” on students’ fourth grade learning. These results have important implications for educational research, for research in a variety of related areas, and for education policy. I discuss these in turn. First, it is clear that an important priority in educational research should be to build richer VAMs that can accommodate dynamic sorting of students to classrooms. In contrast, there is little apparent need to allow for permanent heterogeneity in students’ rates of growth. One approach might be to assume that classroom assignments depend on the principal’s best prediction of students’ unobserved ability, with predictions updated each year based on student grades and test scores. None of the VAMs considered here can accommodate assignments of this form, which on its face seems quite plausible, but approaches like those taken by Altonji, Elder, and Taber (2005) and Rothstein (2009) may be useful. I am skeptical, however, that purely econometric solutions will be adequate. There is likely to be important heterogeneity across schools in both information structures and principal objectives. Thus, there would be large returns to incorporating information about the actual school-level assignment process—perhaps gathered from surveys of principals, as in Monk (1987)—into the value added specification. In addition, more attention to the specification of the outcome variable is needed. Are we interested in measuring a teacher’s short-run effect or his or her impact on test scores in later grades? The former is evidently a poor proxy for the latter. Any proposed VAM should be subjected to thorough validation and falsification analyses. The tests implemented here suggest a starting point, and may be adaptable to richer models. Failure to reject the exclusion restrictions need not indicate that the restrictions are correct, as my tests can identify only sorting based on past observables. But rejection does indicate that the VAMbased estimates are likely to be misleading about teachers’ causal effects. The present analysis also has implications beyond the specific application to measuring teacher productivity. Estimates of the
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
211
quality of schools and of the effects of firms on workers’ wages use identical econometric models, and rely on similar exclusion restrictions. Evidence about the “effects” of future schools and employers on current outcomes would be informative about the validity of both sets of estimates. Finally, the results here have important implications for the use of existing VAMs in education policy. My results indicate that policies based on these VAMs will reward or punish teachers who do not deserve it and fail to reward or punish teachers who do. The literature on pay-for-performance suggests some consequences of this result. First, and most clearly, the stakes attached to VAMbased measures should be relatively small. Baker (1992, 2002) considers a performance measure that is less than perfectly correlated with the worker’s contribution to firm output. He notes that high-stakes compensation will create incentives for workers to direct excess effort to the unproductive component of the performance measure. In education, this might take the form of teachers lobbying their principals to be assigned the “right” students who will yield predictably high value added scores. In Baker’s model, misallocation of effort can be kept to a tolerable level by keeping the variable component of compensation small.34 Another argument for low stakes in VAM-based compensation is provided by H¨olmstrom and Milgrom (1991), who discuss implications of the results presented in Section VII above: If short-term test scores are poor proxies for the dimensions of achievement that really matter, it may be better to forgo or limit incentive pay rather than encourage excessive teaching to the test. A second and more speculative suggestion is that VAM-based estimates should be used as only one among several inputs into an accountability system that also incorporates principals’ subjective ratings (see, e.g., Baker, Gibbons, and Murphy [1994]). There are two reasons for this. First, principals may have information about the direction of the bias in a particular teacher’s VAM-based estimate that is not otherwise available to the econometrician, so incorporation of their opinions might lead to bettertargeted incentives (H¨olmstrom 1979). Second, use of the VAM as the sole basis for teacher compensation and/or retention would permit principals to reward or punish teachers only through the assignment of desirable or undesirable students. Anecdotally, this 34. See also Milgrom (1988), who argues that an important goal of organizational design should be to limit the incentive for workers to devote their time to “influence activities,” and Lazear (1989), who argues that tournament stakes should be kept small to limit the incentive for “sabotage.”
212
QUARTERLY JOURNAL OF ECONOMICS
is an important management tool for principals, who may induce disfavored teachers to resign by assigning them difficult students. But there is evidence that teacher–student matching is an important determinant of student learning (Dee 2005; Clotfelter, Ladd, and Vigdor 2006), so manipulation of matches can have real efficiency consequences. If the principal’s subjective judgment is incorporated directly into the incentive scheme, he or she will be able to allocate students to teachers to maximize output without sacrificing his or her ability to influence rewards and sanctions. Of course, this suggestion presumes high-quality principals who have enough time to observe teachers’ classrooms and enough training to distinguish good from bad teachers. Without this, neither subjective evaluations nor VAM-based estimates that depend importantly on classroom assignments are likely to provide much useful information. GOLDMAN SCHOOL OF PUBLIC POLICY, UNIVERSITY OF CALIFORNIA, BERKELEY, AND NATIONAL BUREAU OF ECONOMIC RESEARCH
REFERENCES Aaronson, Daniel, Lisa Barrow, and William Sander, “Teachers and Student Achievement in the Chicago Public High Schools,” Journal of Labor Economics, 25 (2007), 95–135. Abowd, John M., and Francis Kramarz, “The Analysis of Labor Markets Using Matched Employer-Employee Data,” in Handbook of Labor Economics, Vol. 3B, Orley C. Ashenfelter and David Card, eds. (Amsterdam: North-Holland, 1999). Altonji, Joseph G., Todd E. Elder, and Christopher R. Taber, “Selection on Observed and Unobserved Variables: Assessing the Effectiveness of Catholic Schools,” Journal of Political Economy, 113 (2005), 151–184. Anderson, T. W., and Cheng Hsiao, “Estimation of Dynamic Models with Error Components,” Journal of the American Statistical Association, 76 (1981), 598– 609. Andrabi, Tahir, Jishnu Das, Asim I. Khwaja, and Tristan Zajonc, Do ValueAdded Estimates Add Value? Accounting for Learning Dynamics, unpublished manuscript, Harvard, 2009. Arellano, Manuel, and Stephen Bond, “Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations,” Review of Economic Studies, 58 (1991), 277–297. Ashenfelter, Orley, “Estimating the Effect of Training Programs on Earnings,” Review of Economics and Statistics, 60 (1978), 47–57. Ashenfelter, Orley, and David Card, “Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs,” Review of Economics and Statistics, 67 (1985), 648–660. Baker, George P., “Incentive Contracts and Performance Measurement,” Journal of Political Economy, 100 (1992), 598–614. ——, “Distortion and Risk in Optimal Incentive Contracts,” Journal of Human Resources, 37 (2002), 728–751. Baker, George P., Robert Gibbons, and Kevin J. Murphy, “Subjective Performance Measures in Optimal Incentive Contracts,” Quarterly Journal of Economics, 109 (1994), 1125–1156.
TEACHER QUALITY IN EDUCATIONAL PRODUCTION
213
Ballou, Dale, “Test Scaling and Value-Added Measurement,” Education Finance and Policy, 4 (2009), 351–383. Boardman, Anthony E., and Richard J. Murnane, “Using Panel Data to Improve Estimates of the Determinants of Educational Achievement,” Sociology of Education, 52 (1979), 113–121. Boyd, Donald, Hamilton Lankford, Susanna Loeb, Jonah E. Rockoff, and James Wyckoff, “The Narrowing Gap in New York City Teacher Qualifications and Its Implications for Student Achievement in High-Poverty Schools,” Center for Analysis of Longitudinal Data in Education Research, Working Paper 10, 2007. Braun, Henry I., “Using Student Progress To Evaluate Teachers: A Primer on Value-Added Models,” ETS Policy Information Center, Manuscript, 2005a. ——, “Value-Added Modeling: What Does Due Diligence Require?” in Value Added Models in Education: Theory and Applications, Robert W. Lissitz, ed. (Maple Grove, MN: JAM Press, 2005b). Card, David, and Daniel Sullivan, “Measuring the Effect of Subsidized Training Programs on Movements in and out of Employment,” Econometrica, 56 (1988), 497–530. Chamberlain, Gary, “Panel Data,” in Handbook of Econometrics, Vol. II, Z. Griliches and M. D. Intriligator, eds. (Amsterdam: Elsevier North-Holland, 1984). Clotfelter, Charles T., Helen F. Ladd, and Jacob L. Vigdor, “Teacher–Student Matching and the Assessment of Teacher Effectiveness,” Journal of Human Resources, 41 (2006), 778–820. Dee, Thomas S., “A Teacher like Me: Does Race, Ethnicity, or Gender Matter?” American Economic Review, 95 (2005), 158–165. Goldhaber, Dan “Everyone’s Doing It, but What Does Teacher Testing Tell Us about Teacher Effectiveness?” Journal of Human Resources, 42 (2007), 765–794. Harris, Douglas N., and Tim R. Sass, Value-Added Models and the Measurement of Teacher Quality, unpublished manuscript, 2006. ——, What Makes for a Good Teacher and Who Can Tell? unpublished manuscript, 2007. Heckman, James J., V. Joseph Hotz, and Marcelo Dabos, “Do We Need Experimental Data to Evaluate the Impact of Manpower Training on Earnings?” Evaluation Review, 11 (1987), 395–427. Holland, Paul W., “Statistics and Causal Inference,” Journal of the American Statistical Association, 81 (1986), 945–960. H¨olmstrom, Bengt, “Moral Hazard and Observability,” Bell Journal of Economics, 10 (1979), 74–91. H¨olmstrom, Bengt, and Paul Milgrom “Multitask Principal–Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design,” Journal of Law, Economics, and Organization, 7 (1991), 24–52. Imbens, Guido W., and Thomas Lemieux, “Regression Discontinuity Designs: A Guide to Practice,” Journal of Econometrics, 142 (2008), 615–635. Jacob, Brian A., and Lars Lefgren, “What Do Parents Value in Education? An Empirical Examination of Parents’ Revealed Preferences for Teachers,” Quarterly Journal of Economics, 122 (2007), 1603–1637. ——, “Can Principals Identify Effective Teachers? Evidence on Subjective Performance Evaluation in Education,” Journal of Labor Economics, 25 (2008), 101–136. Jacob, Brian A., Lars Lefgren, and David Sims, “The Persistence of TeacherInduced Learning Gains,” Journal of Human Resources, forthcoming. Jacobson, Louis S., Robert J. LaLonde, and Daniel G. Sullivan, “Earnings Losses of Displaced Workers,” American Economic Review, 83 (1993), 685–709. Kane, Thomas J., Jonah E. Rockoff, and Douglas O. Staiger, “What Does Certification Tell Us about Teacher Effectiveness? Evidence from New York City,” Economics of Education Review, 27 (2008), 615–631. Kane, Thomas J., and Douglas O. Staiger, “Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation,” National Bureau of Economic Research Working Paper No. 14607, 2008. Kezdi, Gabor, “Robust Standard Error Estimation in Fixed Effects Panel Models,” Hungarian Statistical Review, 9 (2004), 95–116. Kinsler, Josh, Estimating Teacher Value-Added in a Cumulative Production Function, unpublished manuscript, University of Rochester, 2008.
214
QUARTERLY JOURNAL OF ECONOMICS
Koedel, Cory, and Julian R. Betts, “Re-Examining the Role of Teacher Quality in the Educational Production Function,” University of Missouri Department of Economics, Working Paper 07-08, 2007. Konstantopoulos, Spyros, “How Long Do Teacher Effects Persist?” IZA Discussion Paper No. 2893, 2007. Lazear, Edward P., “Pay Equality and Industrial Politics,” Journal of Political Economy, 97 (1989), 561–580. Lockwood, J. R., Thomas A. Louis, and Daniel F. McCaffrey, “Uncertainty in Rank Estimation: Implications for Value-Added Modeling Accountability Systems,” Journal of Educational and Behavioral Statistics, 27 (2002), 255. Martineau, Joseph A., “Distorting Value Added: The Use of Longitudinal, Vertically Scaled Student Achievement Data for Growth-Based, Value-Added Accountability,” Journal of Educational and Behavioral Statistics, 31 (2006), 35– 62. McCaffrey, Daniel F., J. R. Lockwood, Daniel M. Koretz, and Laura S. Hamilton, “Evaluating Value-Added Models for Teacher Accountability,” RAND, Report, 2003. McCaffrey, Daniel F., Tim R. Sass, J. R. Lockwood, and Kata Mihaly, “The Intertemporal Stability of Teacher Effect Estimates,” Education Finance and Policy, 4 (2009), 572–606. Milgrom, Paul R., “Employment Contracts, Influence Activities, and Efficient Organization Design,” Journal of Political Economy, 96 (1988), 42–60. Monk, David H., “Assigning Elementary Pupils to Their Teachers,” Elementary School Journal, 88 (1987), 167–187. Nye, Barbara, Spyros Konstantopoulos, and Larry V. Hedges, “How Large Are Teacher Effects?” Educational Evaluation and Policy Analysis, 26 (2004), 237– 257. Rivkin, Steven G., Eric A. Hanushek, and John F. Kain, “Teachers, Schools, and Academic Achievement,” Econometrica, 73 (2005), 417–458. Rosenbaum, Paul R., and Donald B. Rubin, “Reducing Bias in Observational Studies Using Subclassification on the Propensity Score,” Journal of the American Statistical Association, 79 (1984), 516–524. Rothstein, Jesse, “Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement,” Princeton University Education Research Section, Working Paper 25, 2008. ——, “Student Sorting and Bias in Value-Added Estimation: Selection on Observables and Unobservables,” Education Finance and Policy, 4 (2009), 537–571. Sanders, William L., and June C. Rivers, “Cumulative and Residual Effects of Teachers on Future Student Academic Achievement,” University of Tennessee Value-Added Research and Assessment Center, Research Progress Report, 1996. Sanders, William L., Arnold M. Saxton, and Sandra P. Horn, “The Tennessee Value-Added Assessment System: A Quantitative, Outcomes-Based Approach to Educational Assessment,” in Grading Teachers, Grading Schools: Is Student Achievement a Valid Evaluation Measure? Jason Millman, ed. (Thousand Oaks, CA: Corwin, 1997). Schweinhart, L. J., J. Montie, Z. Xiang, W. S. Barnett, C. R. Belfield, and M. Nores, Lifetime Effects: The High/Scope Perry Preschool Study Through Age 40 (Ypsilanti, MI: High/Scope Press, 2005). Todd, Petra E., and Kenneth I. Wolpin, “On the Specification and Estimation of the Production Function for Cognitive Achievement,” Economic Journal, 113 (2003), F3–F33. Wainer, Howard, “Introduction to a Special Issue of the Journal of Educational and Behavioral Statistics on Value-Added Assessment,” Journal of Educational and Behavioral Statistics, 29 (2004), 1–3. Wooldridge, Jeffrey M., Econometric Analysis of Cross Section and Panel Data (Cambridge, MA: MIT Press, 2002).
THE VALUE OF SCHOOL FACILITY INVESTMENTS: EVIDENCE FROM A DYNAMIC REGRESSION DISCONTINUITY DESIGN∗ STEPHANIE RIEGG CELLINI FERNANDO FERREIRA JESSE ROTHSTEIN Despite extensive public infrastructure spending, surprisingly little is known about its economic return. In this paper, we estimate the value of school facility investments using housing markets: standard models of local public goods imply that school districts should spend up to the point where marginal increases would have zero effect on local housing prices. Our research design isolates exogenous variation in investments by comparing school districts where referenda on bond issues targeted to fund capital expenditures passed and failed by narrow margins. We extend this traditional regression discontinuity approach to identify the dynamic treatment effects of bond authorization on local housing prices, student achievement, and district composition. Our results indicate that California school districts underinvest in school facilities: passing a referendum causes immediate, sizable increases in home prices, implying a willingness to pay on the part of marginal homebuyers of $1.50 or more for each $1 of capital spending. These effects do not appear to be driven by changes in the income or racial composition of homeowners, and the impact on test scores appears to explain only a small portion of the total housing price effect.
I. INTRODUCTION Federal, state, and local governments invest more than $420 billion in infrastructure projects every year, and the American Recovery and Reinvestment Act of 2009 is funding substantial temporary increases in capital spending.1 School facilities may be among the most important public infrastructure investments: $50 billion is spent on public school construction and repairs each year ∗ We thank Janet Currie, Joseph Gyourko, Larry Katz, David Lee, Chris Mayer, Tom Romer, Cecilia Rouse, Tony Yezer, and anonymous referees, as well as seminar participants at Brown; Chicago GSB; Duke; George Washington; Haas School of Public Policy; IIES; University of Oslo; NHH; Penn; Princeton; UMBC; Wharton; Yale; and conferences of the American Education Finance Association, National Tax Association, NBER (Labor Economics and Public Economics), and Southern Economic Association for helpful comments and suggestions. We are also grateful to Eric Brunner for providing data on California educational foundations. Fernando Ferreira would like to thank the Research Sponsor Program of the Zell/Lurie Real Estate Center at Wharton for financial support. Jesse Rothstein thanks the Princeton University Industrial Relations Section and Center for Economic Policy Studies. We also thank Igar Fuki, Scott Mildrum, Francisco Perez Arce, Michela Tincani, and Moises Yi for excellent research assistance.
[email protected],
[email protected],
[email protected]. 1. Council of Economic Advisers (2009, Table B-20). The annual total includes gross investment in structures, equipment, and software for both military and nonmilitary uses. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
215
216
QUARTERLY JOURNAL OF ECONOMICS
(U.S. Department of Education 2007, Table 167), yet many of the more than 97,000 public elementary and secondary schools in the United States are in need of renovation, expansion, and repair. One-third of public schools rely on portable or temporary classrooms and one-fourth report that environmental factors, such as air conditioning and lighting, are “moderate” or “major” obstacles to instruction (U.S. Department of Education 2007, Table 98). Despite the importance of capital spending, little is known about the overall impact of public infrastructure investment on economic output,2 and even less is known about the effects of school facilities investments.3 Two central barriers to identification have been difficult to overcome. First, resources may be endogenous to local outcomes. Variation in capital spending is typically confounded with other factors (e.g., the state of the local economy or the socioeconomic status of students) that also determine outcomes.4 Second, even causal estimates of the effects of investments may miss benefits that do not appear in measured output. This is likely to be a particular problem for school facilities, which may yield difficult-to-measure nonacademic benefits such as aesthetic appeal or student health and safety. Housing markets can be used to overcome the challenge of measuring outputs. If homebuyers value a local project more than they value the taxes they will pay to finance it, spending increases should lead to increases in housing prices.5 Indeed, in standard models, a positive effect of tax increases on local property values is direct evidence that the initial tax rate was inefficiently low. But this strategy does not avoid the challenge of obtaining causal effects, which can be difficult when localities are free to endogenously choose their spending levels. In this paper we implement a new research design that isolates exogenous variation in school investments. School capital
2. Aschauer (1989) is an early participant in this literature. Reviews by Munnell (1992) and Gramlich (1994) highlight a number of unresolved endogeneity issues. Pereira and Flores de Frutos (1999) address some of the endogeneity issues and find sizable returns to infrastructure investments. 3. See Jones and Zimmer (2001) and Schneider (2002). Also closely related is the long literature on the effects of school spending more generally. Hanushek (1996) reviews more than ninety studies and concludes that “[s]imple resource policies hold little hope for improving student outcomes,” but Card and Krueger (1996) dispute Hanushek’s interpretation of the literature. 4. Angrist and Lavy (2002) and Goolsbee and Guryan (2006) exploit credibly exogenous variation in school technology investments. Neither study finds shortrun effects on student achievement. 5. See, for example, Oates (1969).
THE VALUE OF SCHOOL FACILITY INVESTMENTS
217
projects are frequently financed via local bond issues, repaid from future property tax receipts. In many states, bonds can be issued only with voter approval. Although school districts that issue bonds are likely to differ in both observable and unobservable ways from those that do not, these differences can be minimized by focusing on very close elections: a district where a proposed bond passes by one vote is likely to be similar to one where the proposal fails by the same margin, though their “treatment” statuses will be quite different. Thus, a regression discontinuity (RD) framework can be used to identify the causal impact of bond funding on district outcomes.6 Several previous papers have used elections as sources of identification in RD models.7 Our analysis is complicated by the dynamic nature of the bond proposal process. A district that narrowly rejects an initial proposal is likely to consider and pass a new proposal shortly thereafter. Moreover, bond effects may occur with nontrivial and unknown lags, both because new bond-financed facilities do not come online until several years after the initial authorization and because sticky housing markets may respond slowly to new information. Traditional experimental and quasi-experimental analytical techniques cannot fully accommodate the presence of both types of dynamics, in treatment assignment and in treatment effects.8 When treatment dynamics are important, researchers usually either restrict treatment effects to be constant or focus on the so-called “intent-to-treat” (ITT) effects of the initial treatment assignment. We develop methods for identifying dynamic “treatment-on-the-treated” (TOT) effects in the presence of dynamics in treatment assignment. To our knowledge, our proposed estimators are new to the literature. They might be fruitfully applied in a variety of other settings.9 6. For recent overviews, see Imbens and Lemieux (2008) and Lee and Lemieux (2009). 7. See, for example, DiNardo and Lee (2004); Lee, Moretti, and Butler (2004); Pettersson-Lidbom (2008); Cellini (2009); and Ferreira and Gyourko (2009). 8. Ham and LaLonde (1996) model the dynamic treatment effects of job training in experimental data. They focus, however, on the impact of initial treatment assignment (i.e., of the intention to treat) and do not exploit noncompliance with this assignment. See also Card and Hyslop (2005). 9. Examples in the RD literature include studies of the effect of incumbency on electoral outcomes (Lee 2008); the effect of unionization on employer survival and profitability (DiNardo and Lee 2004; Lee and Mas 2009); the effect of passing a high school graduation exam (Martorel 2005); and the effect of access to payday loans (Skiba and Tobacman 2008). It would also be straightforward to extend our strategy to experimental and quasi-experimental settings where agents have multiple opportunities to be assigned to treatment.
218
QUARTERLY JOURNAL OF ECONOMICS
We apply our estimators to a rich data set combining information on two decades of California school bond referenda with annual measures of school district spending, housing prices, district-level demographics, and student test scores. We focus on California because it provides a large sample of close elections, but it is important to emphasize that California’s school finance system is unique. Nearly all school spending in California is determined centrally, and the “general obligation” bonds we study are essentially the only source of local discretion. As in other states, bond revenue is restricted to capital projects. Although the theoretical literature emphasizes the futility of restricted funding (see, e.g., Bradford and Oates [1971]), it seems to be effective in our data: as we show below, bond revenues remain in the capital account. We therefore interpret the impact of bond passage on home prices and test scores as reflecting the effects of school facility investments. We find that passage of a bond measure causes house prices in a district to rise by about 6%. This effect appears gradually over the two or three years following the election and persists for at least a decade. Our preferred estimates indicate that marginal homebuyers are willing to pay, via higher purchase prices and expected future property taxes, $1.50 or more for an additional dollar of school facility spending, and even our most conservative estimates indicate a willingness to pay (WTP) of $1.13. We find little evidence of changes in the income or racial composition of local homebuyers following the passage of a bond. Estimated effects on student achievement are extremely imprecise and provide, at best, ambiguous evidence for positive effects at long lags. Even our largest point estimates for the achievement effects are too small to fully explain the impact of bond authorization on housing prices, however. Evidently, prices reflect dimensions of school output that are not captured in student test scores. This highlights the importance of using housing markets—rather than simply test score gains—to evaluate school investments. Although much of the public choice literature emphasizes the potential for overspending by “Leviathan” governments, our results suggest that the opposite is the case. They provide clear evidence that school districts in our sample underinvest in school facilities even with (limited) local control.10 Caution is required, however, in attempting to generalize this result beyond our 10. This is consistent with Matsusaka’s (1995) conclusion that public spending is lower in states with initiatives.
THE VALUE OF SCHOOL FACILITY INVESTMENTS
219
sample. Returns to marginal school spending may be lower in districts where the referendum election is not close or in states that allow more local control. The remainder of the paper is organized as follows: Section II describes the California school finance system; Section III develops simple economic models of resource allocation and capitalization; Section IV describes our research design and introduces our estimators of dynamic treatment effects; Section V describes the data; Section VI validates our regression discontinuity strategy; Section VII presents our estimates; and Section VIII concludes. II. CALIFORNIA SCHOOL FINANCE California was known in the postwar era for its high-quality, high-spending school system. By the 1980s and 1990s, however, California schools were widely considered underfunded. In 1995, per-pupil current spending was 13% below the national average, ranking the state 35th in the country despite its relatively high costs. Capital spending was particularly stingy, 30% below the national average.11 California schools became notorious for their overcrowding, poor physical conditions, and heavy reliance on temporary, modular classrooms (see, e.g., New York Times [1989]). Much of the decline in school funding has been attributed to the state’s shift to a centralized system of finance under the 1971 Serrano v. Priest decision and to the passage of Proposition 13 in 1978. In the regime that resulted, the property tax rate was fixed at 1% and the state distributed additional revenues using a highly egalitarian formula.12 Districts were afforded no flexibility and there was little provision for capital investments. In 1984, voters approved Proposition 46, which allowed school districts to issue general obligation bonds to finance capital projects.13 Bonds are proposed by the school district board and must be approved by 11. Statistics in this paragraph are computed from U.S. Department of Education (1998, Tables 165 and 42) and U.S. Department of Education (2007, Table 174). 12. See Sonstelie, Brunner, and Ardon (2000) for further details and discussion of California’s school finance reforms. 13. Noneducational public entities (e.g., cities, sanitation districts) can also issue general obligation bonds using a similar procedure. An alternative source of funds is a parcel tax, which also requires voter approval but imposes fewer restrictions (Orrick, Herrington & Sutcliffe, LLP, 2004). These are comparatively rare. Although we focus on general obligation bonds in the analysis below, we present some specifications that incorporate parcel taxes as well.
220
QUARTERLY JOURNAL OF ECONOMICS
a local referendum.14 Initially, a two-thirds vote was required, but beginning in 2001 proposals that adhered to certain restrictions could qualify for a reduced threshold of 55%. Brunner and Reuben (2001) attribute 32% of California school facility spending between 1992–1993 and 1998–1999 to local bond referenda. The leading alternative source of funds was state transfers. Authorized bonds are paid off over twenty or thirty years through an increment—typically 0.25 percentage points—to the local property tax rate. Under Proposition 13, assessed home values are based on the purchase price rather than the current market value. As property values in California have risen substantially in recent decades, homeowners with long tenure face low tax shares and recent homebuyers bear disproportionate shares of the burden. Districts must specify in advance how the bond revenues will be spent. The ballot summary for a representative proposal reads: Shall Alhambra Unified School District repair, upgrade and equip all local schools, improve student safety conditions, upgrade electrical wiring for technology, install fire safety, energy efficient heating/cooling systems, emergency lighting, fire doors, replace outdated plumbing/sewer systems, repair leaky rundown roofs/bathrooms, decaying walls, drainage systems, repair, construct, acquire, equip classrooms, libraries, science labs, sites and facilities, by issuing $85,000,000 of bonds at legal rates, requiring annual audits, citizen oversight, and no money for administrators’ salaries? (Institute for Social Research 2006)
Anecdotally, bonds are frequently used to build new permanent classrooms that replace temporary buildings (e.g., Sebastian [2006]), although repair, maintenance, and modernization are common uses as well. Of the 1,035 school districts in California, 629 voted on at least one bond measure between 1987 and 2006. The average number of measures considered (conditional on any) was slightly more than two.15 Elections were frequently close, with 35% decided by less than 5% of the vote. Table I shows the number of measures proposed and passed in each year, along with the average bond amount (in $1,000 per pupil), the distribution of required vote 14. Balsdon, Brunner, and Rueben (2003) model the board’s decision to propose a bond issue. 15. These data come from the California Education Data Partnership. More details are provided in Section V. Between 1987 and 2006, 264 districts had exactly one measure on the ballot whereas 189 districts had 2, 99 districts had 3, 53 districts had 4, and 30 districts had 5 or more measures. The maximum was 10 measures.
221
THE VALUE OF SCHOOL FACILITY INVESTMENTS TABLE I SCHOOL BOND MEASURE SUMMARY STATISTICS
Vote share in favor (%) Number of Avg. amount Fraction 55% Fraction Year measures per pupil ($) req. (vs. 2/3) approved Mean SD (1) (2) (3) (4) (5) (6) (7) 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
29 33 28 31 55 57 45 50 84 50 110 116 82 86 50 146 18 106 35 109
3,134 5,081 3,103 7,096 7,612 7,467 7,305 7,365 6,266 5,780 7,244 6,762 9,425 6,307 8,338 6,004 6,542 8,130 10,157 9,748
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.48 0.89 0.50 0.93 0.74 0.96
0.52 0.61 0.50 0.42 0.40 0.40 0.47 0.42 0.48 0.70 0.64 0.60 0.62 0.65 0.84 0.79 0.56 0.82 0.86 0.72
64.6 67.8 66.4 61.4 64.0 62.2 62.1 65.1 65.0 70.3 68.9 68.7 69.6 69.4 68.7 63.4 61.6 65.1 64.7 61.0
12.0 8.2 9.7 15.2 10.3 10.8 11.7 9.6 10.9 7.9 8.7 9.3 9.7 8.7 9.2 8.5 9.6 8.6 6.5 7.9
Notes. Data obtained from California Data Partnership. Sample includes all general obligation bond measures proposed by California school districts from 1987 to 2006. Dollar amounts in column (3) are measured in constant year-2000 dollars.
shares for bond approval, and the mean and standard deviation of observed vote shares. III. THEORETICAL FRAMEWORK Education researchers and reformers often cite overcrowded classrooms; poor ventilation, indoor air quality, temperature control, or lighting; inadequate computer hardware or wiring; and broken windows or plumbing as problems that can interfere with student learning. Mitigating such environmental conditions may bring substantial gains to student achievement in the short run by reducing distractions and missed school days.16 It may also benefit teachers by improving morale and reducing absenteeism and 16. See Earthman (2002) and Mendell and Heath (2004) for reviews.
222
QUARTERLY JOURNAL OF ECONOMICS
turnover, with indirect impacts on student achievement (Buckley, Schneider, and Shang 2005). However, student achievement is not the only potential benefit of improved infrastructure. Capital investments may also lead to enhancements in student safety, athletic and art training, the aesthetic appeal of the campus, or any number of other nonacademic outputs. A full evaluation of investment decisions must capture all of these potential impacts. But rather than investigating each outcome separately, one can use parents’ location decisions to identify their revealed preferences over spending levels. Any shift in the desirability of a district—along either academic or nonacademic dimensions—will be reflected in equilibrium housing prices. Bond-funded investments are accompanied by an increased tax burden with an approximately equal present value. Thus, if funds are misspent or simply yield smaller benefits than the consumption foregone due to increased taxes, bond authorization will make a district less attractive, leading to reduced pretax housing prices. By contrast, if the effect on school output is valued more than the foregone consumption, home prices will rise when bonds are passed. It can be shown that the efficient choice of spending levels will equate the aggregate marginal utilities of consumption and school spending (Samuelson 1954), so positive effects on prices indicate inefficiently low spending. We sketch a simple model to support this intuition.17 We assume that the utility of family i living in district j depends on local school output Aj , exogenous amenities X j , and other consumption ci : ui j = Ui (Aj , X j , ci ). The family has income wi and faces the budget constraint ci ≤ wi − r j − p j , where r j represents taxes and p j is the (rental) price of local housing. Service quality depends on tax revenues, Aj = A(r j ); if districts use funds inefficiently, A (r) will be low.18 We consider first the household location decision with predetermined spending. A family chooses the community that provides the highest utility, taking into account housing prices, taxes, and service quality. When the family’s indirect utility in district j is written as U (A(r j ), X j , wi − r j − p j ), the implicit function theorem 17. The basic model is due to Tiebout (1956). We draw heavily on Brueckner (1979) and Barrow and Rouse (2004). 18. If residents do not trust district management, A (r) may be larger for restricted bond funds—which require that the projects that will be funded are specified before the bond referendum—than it would be for other forms of revenue. If so, bonds will have larger price effects than would unrestricted tax increases.
THE VALUE OF SCHOOL FACILITY INVESTMENTS
223
yields the family’s bid for housing in district j as a function of amenities and taxes, gi j = gi (X j , r j ).19 Holding prices, amenities, and tax rates in all other communities in the family’s choice set constant, community j will provide higher utility than any alternative community if p j < gi j . The family’s WTP for a marginal increase in r j in its chosen district is ∂gi (X j , r j )/∂r j . It can be shown that (1)
∂gi (X j , r j )/∂r j = (∂U/∂c)−1 [A (r j ) ∗ (∂U/∂ A)] − 1.
This WTP is positive if the marginal product of school revenues multiplied by the marginal utility of school outputs (in brackets) exceeds the marginal utility of consumption. Ignoring momentarily the effect of spending on local housing prices, the family’s optimal tax and service level satisfies ∂gi (X j , r j )/∂r j = 0. If ∂gi (X j , r j )/∂r j > 0, the district’s spending is below the family’s preferred level; if ∂gi (X j , r j )/∂r j < 0, the family would prefer that taxes and services be cut. In equilibrium, the price of housing in district j, p j = p∗ (X j , r j ), equals the bid of the marginal consumer, who must be indifferent between this district and another alternative. Thus, p j will respond positively to increases in r j if and only if the prior level of school spending was below the preferred level of the marginal resident. Tax changes are not exogenous but depend on election outcomes. Many models of voting focus on landlords who are not local residents. Because they do not directly consume services, these absentee landlords will vote to maximize net-of-tax housing rents. At the maximum, the first-order effect of an exogenous change in tax rates will be zero for net rents and one for gross rents. Sale prices of rental units should reflect the present discounted value of net rents, so they will be invariant to the tax rate change. But absentee landlords do not vote. Residents do, and many will not vote to maximize the rental values of their homes. Most obviously, any renter who values spending less than the marginal resident—for whom ∂gi (X j , r j )/∂r j < ∂ p∗ (X j , r j )/∂r j —will vote against a proposed spending increase, as the utility he or she will derive from higher spending will not compensate for the increased rent that he or she will pay. Similarly, a homeowner who does not wish to move will vote on the basis of his or her own 19. gi () is defined implicitly by U (A(r j ), X j , wi − r j − gi (X j , r j )) = maxk= j U (A(rk), Xk, wi − rk − pk).
224
QUARTERLY JOURNAL OF ECONOMICS
bid-rent, not the community’s price function, and will oppose a tax increase if ∂gi (X j , r j )/∂r j < 0. This group may be particularly important in California: under Proposition 13, “empty nesters” face incentives to remain in their houses after their children are grown (Ferreira 2008). These families derive little direct utility from school spending and, if they do not plan to move, will not be motivated by the prospect of increased home values. Thus, in general, we should expect that even price-increasing proposals will attract some opposition and therefore that ∂ p∗ (X j , r j )/∂r j may be larger than zero even in political equilibrium.20 A final issue concerns timing. Capital projects take time to plan, initiate, and carry out, so bonds issued today will take several years to translate into improved capital services. Direct measures of school outputs will reflect the effects of bond passage only with long lags. House prices reflect the present discounted value (PDV) of all future services less all future taxes, so they should rise or fall as soon as the outcome of the election is known. This may happen well before the election if the outcome is easy to predict, but when the election is close important information is likely revealed on Election Day. Price effects may therefore be immediate. However, if house prices are sticky or homebuyers have imperfect information, it may take a few years for prices to fully reflect the impact of bond passage. We are thus interested in measuring the full sequence of dynamic treatment effects on each of our outcomes. IV. EMPIRICAL RESEARCH DESIGN In this section we describe our dynamic regression discontinuity design in six steps. First, we show in a cross-sectional framework how a RD design approximates a randomized experiment. Second, we extend the framework to incorporate the presence of multiple elections in the same district. We also discuss two interpretations of the causal effect of measure passage, corresponding to the ITT and TOT effects that arise in experiments with imperfect compliance. Third, we describe our implementation of the RD estimator for the ITT, which exploits panel data to enhance precision. Fourth, we describe our two new estimators for the dynamic 20. Exogenous increases in r may increase prices even if the pivotal voter’s WTP is one (as must be the case for close elections in the median voter model) or negative (as in Romer and Rosenthal’s [1979] agenda-setter model), if marginal homebuyers’ preferences diverge sufficiently from those of inframarginal residents.
THE VALUE OF SCHOOL FACILITY INVESTMENTS
225
(TOT) treatment effects of bond authorization. Fifth, we discuss complications that arise in analyses of housing prices. Finally, we discuss how estimates of the effect of bond passage can be interpreted in terms of the marginal WTP for $1 of school facilities investment. IV.A. Regression Discontinuity in Cross Section Suppose that district j considers a bond measure and that this proposal receives vote share ν j (relative to the required threshold ν ∗ ). Let b j = 1(ν j ≥ ν ∗ ) be an indicator for authorization of the bond. Suppressing time-related considerations, we can write some outcome y j (capital spending or the price of local houses at some later date, for example) as (2)
yj = κ + bj θ + uj ,
where θ is the causal effect of bond authorization and u j represents all other determinants of the outcome (with E[u j ] = 0).21 In general, the election outcome may be correlated with other district characteristics that influence spending, so E[u j b j ] = 0. If so, a simple regression of y j on b j will yield a biased estimate of θ . However, as Lee (2008) points out, as long as there is some unpredictable random component of the vote, a narrowly decided election approximates a randomized experiment. In other words, the correlation between the election outcome and unobserved district characteristics can be kept arbitrarily close to zero by focusing on sufficiently close elections. One can therefore identify the causal effect of measure passage by comparing districts that barely passed a measure (the “treatment group”) with others that barely rejected a bond measure (the “control group”). We focus on an implementation of the RD strategy that retains all of the data in the sample but absorbs variation coming from nonclose elections using flexible controls for the vote share.22 Assuming that E[u j | ν j ], the conditional expectation of the unobserved determinants of y given the realized vote share, is continuous, we can approximate it by a polynomial of order g with 21. When y j is district spending, one might expect that θ would equal the size of the authorized bond. But this need not be so, as districts where the proposal fails may make up some of the shortfall via other means. In practice, however, the appropriately estimated θ turns out to be quite close to the average proposed bond amount. 22. For a detailed comparison of this approach with an approach that uses data only from close elections, see Imbens and Lemieux (2008).
226
QUARTERLY JOURNAL OF ECONOMICS
coefficients γu, Pg (ν j , γu), and the approximation will become arbitrarily accurate as g → ∞. Under this assumption we can rewrite (2) as (3)
y j = κ + b j θ + Pg (ν j , γu) + uj ,
where uj ≡ u j − Pg (ν j , γu) = (u j − E[u j | ν j ]) + (E[u j | ν j ] −Pg (ν j , γu)) is asymptotically uncorrelated with ν j (and therefore with b j ). A regression of realized outcomes on the bond approval indicator, controlling for a flexible polynomial in the vote share, thus consistently estimates θ .23 IV.B. Panel Data and Multiple Treatments We now extend the framework to allow multiple bond measures in the same district. We redefine b jt to equal one if district j approved a measure in calendar year t and zero otherwise (i.e., if there was no election in year t or if a proposed bond was rejected). We assume that the partial effect of a bond authorization in one year on outcomes in some later year (holding bond issues in all intermediate years constant) depends only on the elapsed time. We can then write spending in any year t as a function of the full history of bond authorizations: (4)
y jt =
∞
b j,t−τ θτ + u jt .
τ =0
There are two sensible definitions of the causal effect of a measure’s passage in t − τ on spending in year t, corresponding to different potential interventions. First, one can examine the effect of exogenously authorizing a bond issue in district j in year t − τ and prohibiting the district authorizing bonds in any subsequent year. By equation (4), this is θτ , because we are controlling for all other bond measures. It is commonly known as the effect of the “treatment on the treated,” or TOT, and we hereafter refer to it as θτTOT . By isolating the impact of $1 of debt authorization with no subsequent changes in the district’s budget constraint, estimates of the TOT effect on house prices will allow us to examine homebuyers’ WTP for additional school spending. Alternatively, one can focus on the impact of exogenously authorizing a bond issue and thereafter leaving the district to make 23. If there is heterogeneity in θ across districts, the RD estimator identifies the average of θ j among districts with close elections (Imbens and Angrist 1994).
THE VALUE OF SCHOOL FACILITY INVESTMENTS
227
subsequent bond issuance decisions as its voters wish. This interpretation, known as the “intent-to-treat” (ITT) effect, incorporates effects of b j,t−τ operating through the intermediate variables {b j,t−τ +1 , . . . , b jt }. It is arguably the effect of interest for evaluations of a particular bond proposal. The ITT effect of b j,t−τ on y jt is τ ∂ y jt dy jt ∂ y jt db j,t−τ +h (5) = + ∗ θτITT ≡ db j,t−τ ∂b j,t−τ ∂b j,t−τ +h db j,t−τ h=1
= θτTOT +
τ
θτTOT −h πh,
h=1
where πh ≡ db j,t−τ +h/db j,t−τ represents the effect of authorizing the first bond on the probability of authorizing another bond measure h years later. We show in Section VI that districts that approve a bond are less likely to propose and approve other bonds in the next few years: πh < 0 for h ≤ 4 and πh = 0 for h > 4. Assuming that ITT ≤ θτTOT . θτTOT −h ≥ 0 for all h, this implies that θτ IV.C. Intent-to-Treat Effects We begin by describing how the RD strategy can be used to identify the ITT effects, and then return to the TOT effects in Section IV.D. Recall that the ITT corresponds to the effect of experimentally manipulating one election outcome without controlling the district’s behavior in subsequent years. The nonexperimental RD analogue is straightforward: we simply examine outcomes in later years for districts that pass or fail a specified initial election, controlling flexibly for the vote share in that election but not for any subsequent votes or other variables. It is most natural to reorient our time index around the focal election. Thus, consider a district j that had an election in year t. We can write the district’s outcome τ years later as (6)
y j,t+τ = b jt θτITT + Pg (ν jt , γτ ) + uj,t+τ ,
where Pg (ν jt , γτ ) is a polynomial in ν jt with coefficients γτ , and uj,t+τ = u j,t+τ − Pg (ν jt , γτ ). By the logic in Section IV.A, uj,t+τ asymptotes to u j,t+τ − E[u j,t+τ | ν j ], which is uncorrelated with b jt . In practice, equation (6) is inefficient. This is because the error term uj,t+τ has an important component that varies at the district level but is fixed within districts over time. Even though the RD strategy ensures that this component is uncorrelated with
228
QUARTERLY JOURNAL OF ECONOMICS
b jt conditional on the vote share controls, it nevertheless reduces precision. More precise estimates of the θτITT parameters can be obtained by pooling data from multiple τ (including τ < 0, corresponding to periods preceding the focal election) and including controls to absorb district-level heterogeneity. To implement this, we begin by identifying each ( j, t) combination with an election. We then select observations from district j in years t − 2 through t + 6. Where a district has multiple elections in close succession, the same calendar year observation is used more than once. For example, if a district had elections in 1995 and 1997, the [t − 2, t + 6] windows are [1993, 2001] and [1995, 2003], respectively, and the 1995–2001 observations are included in each. Observations in the resulting data set are uniquely identified by the district, j, the date of the focal election, t, and the number of years elapsed between the focal election and the time at which the outcome was measured, τ . We use this sample to estimate the following regression: (7)
y jtτ = b jt θτITT + Pg (ν jt , γτ ) + ατ + κt + λ jt + e jtτ .
Here, ατ , κt , and λ jt represent fixed effects for years relative to the election, for calendar years, and for focal elections, respectively. Note that the λ jt effects absorb any across-district variation. Pg (ν jt , γτ ) is a polynomial in the focal election vote share. Both the γτ and θτITT coefficients are allowed to vary freely with τ for τ ≥ 0, but are constrained to zero for τ < 0. We cluster standard errors by district (i.e., by j) to account for dependence created by the use of multiple ( j, t) observations in the sample or by serial correlation in the e jtτ .24 IV.D. Treatment-on-the-Treated Effects In traditional experimental designs with a single opportunity for randomization and imperfect compliance, the TOT is readily identified by using the random treatment assignment as an instrument for the actual treatment status. The “fuzzy” RD design (Hahn, Todd, and Van der Klaauw 2001) uses the same strategy: even when some subjects with ν jt < ν ∗ are treated and/or subjects with ν jt > ν ∗ are untreated, the discontinuous indicator for measure passage, b jt , can be used as an instrument for the realized treatment status. 24. In the empirical application we also include an indicator for a measure with a 55% (as opposed to two-thirds) threshold.
THE VALUE OF SCHOOL FACILITY INVESTMENTS
229
In our study, each election is a sharp RD, but the possibility of dynamics in the b jt variable introduces fuzziness: a district in the “control” group—one where the focal election narrowly failed— might approve a bond in a subsequent election and therefore be treated. However, the usual fuzzy RD strategy cannot be applied. With dynamic treatment effects, a bond authorization in year t + h does not have the same effect on the outcome in t + τ as the initial authorization in t would have. Our “recursive” estimator for the TOT effects extends the fuzzy RD design to the case of dynamic treatment effects. We assume (as has been implicit in our notation thus far) that the TOT effects of bond authorization on later authorizations and outcomes depend only on the time elapsed since the focal treatment (τ ) and not on the time at which the treatment occurred or on the treatment history. That is, although ∂ y j,t+τ /∂b jt and ∂b j,t+τ /∂b jt depend on τ , they do not depend on t or on {b j1 , . . . , b j,t−1 , b j,t+1 , . . . , b j,t+τ −1 }. Recall that equation (5) related θτITT to {θhTOT , πh}h−1,...,τ . We can simply invert that equation to obtain recursive formulas for the TOT effects in terms of the ITTs and the π s: (8)
θ0TOT = θ0ITT ;
(9)
θ1TOT = θ1ITT − π1 θ0TOT ;
(10)
θ2TOT = θ2ITT − π1 θ1TOT − π2 θ0TOT ;
and, in general, (11)
θτTOT = θτITT −
τ
πhθτTOT −h .
h=1
Our recursive estimator thus proceeds in two steps. First, we estimate the coefficients θτITT and πτ using the methods discussed in Section IV.C.25 Second, we solve for the dynamic TOT effects using the recursive equation (11). Standard errors are obtained by the delta method. 25. Note that πτ is defined as an ITT, so it can be estimated via equation (7) by simply redefining y jtτ to equal b j,t+τ . We modify the approach discussed in Section IV.C in two ways. First, we use all available relative years: τ ranges from −19 to +18 rather than just from −2 to +6. This permits us to estimate the TOT over a longer postelection period. Second, to obtain the covariance between the θ ITT and π parameters we stack observations (for each election and each relative year τ ) on the outcomes y jtτ and b jtτ and fully interact (7) with an indicator for the outcome type. As before, we cluster standard errors at the district level.
230
QUARTERLY JOURNAL OF ECONOMICS
This recursive strategy has an important drawback. Equation (11) indicates that θτTOT depends on θτITT as well as on θτTOT −h and πh for all 1 ≤ h ≤ τ . As a result, the estimates become extremely imprecise at long lags. Our second TOT estimator obtains greater precision by applying additional restrictions on the election dynamics. We return to equation (4), which specifies the outcome in year t as depending on the full history of bond authorizations in the district. An OLS estimate of (4) would yield biased estimates of the TOT effects, as bond authorizations (both past and current) are likely to be correlated with other determinants of outcomes. However, under the standard RD assumption—that measure passage is as good as randomly assigned conditional on a smooth function of the measure vote share—this endogeneity can be absorbed via the inclusion of a flexible polynomial in the vote share. Thus, to bring the RD methodology to the “structural” equation (4), we augment each of the lagged bond authorization indicators b j,t−τ with an indicator for the presence of a measure on the ballot in year t − τ , mj,t−τ , and a polynomial in the vote share, Pg (ν j,t−τ , γτ ).26 Both the mj,t−τ coefficient and the polynomial coefficients are allowed to vary freely with τ (for τ ≥ 0). We also add fixed effects for each district and for each calendar year. The estimating equation then becomes (12) y jt =
τ¯ b j,t−τ θτTOT + mj,t−τ ατ + Pg (ν j,t−τ , γτ ) + λ j + κt + u jt . τ =0
We estimate this on a conventional panel of school districts over calendar years, with each observation used exactly once. Standard errors are clustered on the school district. It is instructive to compare this “one-step” estimator with the recursive approach. Where the recursive strategy extends experimental techniques to accommodate dynamic treatment effects, the one-step estimator imports the RD strategy for isolating exogenous variation into an observational analysis. With the inclusion of controls for the election and vote share history in (12), the θτTOT coefficients are identified from the contrast between districts where an election in t − τ narrowly passed and those where the election narrowly failed but the sequence of prior and subsequent elections, votes, and bond authorizations is similar. 26. We set ν j,t−τ = 0 if district j did not hold an election in year t − τ .
THE VALUE OF SCHOOL FACILITY INVESTMENTS
231
An important limitation on the one-step estimator is that it involves controlling for intermediate outcomes. The RD design does not permit a causal interpretation of the α τ or γτ coefficients in (12). If the outcome of an initial election influences m or ν in subsequent years, biases in their coefficients relative to the true causal effects will lead to bias in the estimated bond authorization . For example, the one-step estimator will be inconsiseffects θ TOT τ tent if the outcome of an initial election affects the composition of the electorate in subsequent elections. We will see below that the one-step estimator yields quite similar estimates to those obtained from the recursive estimator, which does not suffer from the intermediate outcomes problem. Moreover, the one-step estimates are substantially more precise. IV.E. Forward-Looking Housing Prices We have not yet specified the “outcome” variable. Below, we present estimates for school district spending, student test scores, and district demographics, but our primary dependent variable is the average sale price of homes in the district. This outcome adds some complexity, as prices depend in part on expectations of future events. If the discount rate is r, standard no-arbitrage conditions ensure that the discounted TOT effect of a bond issue that will be authorized (with probability one) in period t + h on prices in t is tied to the TOT effect of an authorization in period t: TOT = θ0TOT (1 + r)−h. Moreover, uncertainty about future election θ−h outcomes is priced at the expected value. Thus, we can write house prices in year t as (13)
y jt =
∞
b j,t−τ θτTOT +
τ =0
∞
Et [b j,t+h]θ0TOT (1 + r)−h + u jt ,
h=1
where Et [ ] is the expectation as of date t and the second summation reflects the influence of the expected future path of the b jt series. We assume homebuyers cannot predict future election outcomes any better than we can. With this assumption, dEt [b j,t+h]/ db j,t−τ = π h+τ . The ITT effect of b j,t−τ on y jt then becomes (14)
θτITT
=
θτTOT
+
τ h=1
πhθτTOT −h
+
∞ h=1
πτ +hθ0TOT (1 + r)−h.
232
QUARTERLY JOURNAL OF ECONOMICS
Again, the final summation reflects the portion of the effect of the t − τ treatment that operates through its influence on the expectation of post-t treatments. Our recursive estimator is readily modified for this case. We present these “forward-looking” housing price estimates in addition to the recursive and one-step estimates below. IV.F. Willingness to Pay for School Spending We have described methods for identifying the causal effect of authorizing a bond in one year on house prices in future years. Our estimators identify the bond authorization effect based on the discontinuity in the relationship between later house prices and the election vote share at the threshold required for passage. They therefore are local to close elections, and can be interpreted as the average effect of the bonds for which the elections are close. In our sample (discussed below), the average proposal that passed by less than two percentage points was for a bond issue of $6,309 per pupil, so the effect per dollar of bonds authorized in τ is simply θ˜τTOT ≡ θτTOT / 6,309. To convert this into an estimate of the WTP for additional school spending, it is useful to think of a bond authorization as a bundle of several “programs.” First, authorization to issue $1 in bonds per pupil means that spending in future years can rise by an amount equal in present value to $1. Second, property tax rates are raised in each of the next thirty years by an amount sufficient to pay the bond principal and interest. Assuming that the district borrows and saves at the residents’ discount rate, the present value of the increment to future taxes is also $1. As homebuyers are committing to the purchase price of the house plus the stream of future property taxes, their implied WTP for the additional spending is $1 + θ˜0TOT . The WTP will be greater than one if the marginal homebuyer values $1 in school spending more than $1 in other consumption. This simplified presentation ignores many complexities: sticky house prices; the income tax treatment of property taxes, mortgage interest, and municipal bond interest; the ratio of pupils to homes; and heterogeneity in tax shares within districts can all lead $1 + θ˜0TOT to diverge from the marginal WTP for $1 in school spending. We discuss a simple WTP calculation in Section VII, and then add the various complexities in the Online Appendix.
THE VALUE OF SCHOOL FACILITY INVESTMENTS
233
V. DATA We obtained bond data from a database maintained by the California Education Data Partnership. For each proposed bond, the data include the amount, intended purpose, vote share, required vote share for passage, and voter turnout. Our sample includes all general obligation bond measures sponsored by school districts between 1987 and 2006. We merged these with annual district-level enrollment and financial data from the Common Core of Data (CCD). We obtained calendar-year averages, at the census block group level, of the sale prices, square footages, and lot sizes of transacted homes from a proprietary database compiled from public records by the real estate services firm DataQuick. The underlying data describe all housing transactions in California from 1988 to 2005.27 We used geographic information system (GIS) mapping software to assign census block groups to school districts. If the mix of houses that transact changes from one year to the next (for example, one might expect sales of houses that can accommodate families with children to react differently to school spending than do smaller houses), this will bias our house price measure relative to the quantity of interest, the average market value of houses in the district. We take two steps to minimize this bias. First, when we average block groups to the district level, we weight them by their year-2000 populations rather than by the number of transactions. This holds constant the location of transactions within the district. Second, we include in our models of housing prices controls for the average square footage and lot size of transacted homes and for the number of sales to absorb any remaining selection. These adjustments have little effect on the results, and estimates based on unadjusted data are presented in Section VII.C. We constructed a panel of average student achievement by merging data from several different tests (listed in the Online Appendix) given in California at various times. We focused on third and fourth graders, for whom the longest panel is available, 27. The majority of housing transactions happen from May through August. We assign measures occurring after October to housing data from the following calendar year. This means that a few of the housing transactions assigned to year 0 in fact occurred before the election. To merge measures to academic year data from the CCD, we treat any measure between May 2005 and April 2006 as occurring during the 2005–2006 academic year.
234
QUARTERLY JOURNAL OF ECONOMICS
and standardized the scaled scores each year using school-level means and standard deviations. Finally, we obtained the racial composition and average family income of homebuyers in each district between 1992 and 2006 from data collected under the Home Mortgage Disclosure Act. We treat this measure as characterizing in-migrants to the district, though we are unable to exclude intradistrict movers from the calculation. Renters are not represented. The Online Appendix provides more detail on data and sources. Table II presents descriptive statistics. Column (1) shows the means and standard deviations computed over all district– year observations in our data. Columns (2) and (3) divide the sample between districts that proposed at least one bond between 1987 and 2006 and those that did not. Districts that proposed bonds are larger and have higher test scores, incomes, and housing prices, but smaller lot sizes. Columns (4) and (5) focus on districts that approved and rejected school bonds, using data from the year just before the bond election, whereas column (6) presents differences between them. Districts that passed measures had 25% higher enrollment, $206 higher current instructional spending, and $349 higher total spending. Districts that passed measures also had much higher incomes and house prices, as well as more housing transactions. However, these districts also had homes with smaller lots. VI. EVALUATING THE BOND REFERENDUM QUASI-EXPERIMENT Our empirical strategy is to use close elections to approximate a true experiment. This requires that bond authorization be as good as randomly assigned, conditional on having a close election. In this section, we consider tests of this assumption. We also demonstrate that bond authorization in fact leads to increased capital spending in subsequent years. VI.A. Balance of Treatment and Control Groups We examine three diagnostics for the validity of the RD quasiexperiment, based on the distribution of vote shares, preelection differences in mean characteristics, and differences in preelection trends. Tests of the balance of outcome variable means and trends before the election are possible only because of the panel structure of our data and provide particularly convincing evidence regarding the approximate randomness of measure passage.
Lot size
Square footage
Log house prices
Number of observations House prices ($)
Current instructional expenditures PP ($)
Capital outlays PP ($)
Total expenditures PP ($)
Number of observations Log enrollment
Number of districts
15,151 241,537 [198,618] 12.16 [0.65] 1,603 [407] 56,772 [81,652]
10,197 7.43 [1.69] 7,466 [2,177] 922 [1,100] 3,905 [808]
948
All school districts (1) 629 A. Fiscal variables 6,891 8.03 [1.47] 7,493 [2,119] 1,038 [1,164] 3,844 [728]
Proposed at least one measure (3)
B. Housing market variables 4,578 10,573 190,337 263,706 [149,691] [212,612] 11.95 12.26 [0.62] [0.65] 1,572 1,615 [456] [386] 97,604 39,797 [111,614] [57,266]
3,306 6.18 [1.43] 7,410 [2,293] 679 [905] 4,034 [941]
319
Never proposed a measure (2)
731 285,857 [240,439] 12.33 [0.66] 1,625 [401] 32,342 [48,933]
626 8.34 [1.48] 7,290 [1,898] 882 [1,005] 3,824 [703]
Passed a measure (time t − 1) (4)
382 210,499 [178,766] 12.08 [0.55] 1,637 [363] 49,388 [60,891]
218 8.09 [1.48] 6,941 [1,921] 935 [1,112] 3,618 [677]
Failed a measure (time t − 1) (5)
TABLE II SCHOOL DISTRICT DESCRIPTIVE STATISTICS FOR FISCAL, HOUSING MARKET, AND ACADEMIC VARIABLES
75,358 (5.91) 0.25 (6.71) −11 (0.47) −17,047 (4.73)
0.25 (2.13) 349 (2.32) −53 (0.62) 206 (3.82)
Diff (4)–(5) (t stat) (6)
THE VALUE OF SCHOOL FACILITY INVESTMENTS
235
9,748 0.17 [0.91] 0.07 [0.90]
881 [1,966] 96,482 [59,094] 11.36 [0.46] 3,240 0.10 [0.96] −0.06 [0.99]
316 [951] 84,753 [45,204] 11.25 [0.43]
1,126 [2,225] 101,674 [63,606] 11.40 [0.47]
Proposed at least one measure (3)
C. Achievement variables 6,508 0.21 [0.88] 0.13 [0.85]
Never proposed a measure (2)
460 0.16 [0.93] 0.12 [0.88]
1,519 [3,568] 107,689 [70,382] 11.45 [0.49]
Passed a measure (time t − 1) (4)
170 0.19 [0.81] 0.10 [0.82]
1,134 [1,445] 90,339 [58,903] 11.31 [0.41]
Failed a measure (time t − 1) (5)
−0.028 (0.37) 0.020 (0.27)
385 (2.54) 17,350 (4.36) 0.14 (5.13)
Diff (4)–(5) (t stat) (6)
Notes. Columns (1)–(5) show averages and standard deviations (in square brackets). Column (6) reports the difference between columns (4) and (5), with t statistics in parentheses; bold coefficients are significant at the 5% level. Samples in columns (1), (2), and (3) include all available observations in all years. Fiscal variables are available for years 1995–2005, housing market variables for 1988–2005 (except income and log income, which are only from 1992 to 2006), and test scores for 1992–1993 and 1997–2006. Columns (4) and (5) include only observations for the year prior to a bond referendum. All dollar figures are measured in real year-2000 dollars. Achievement scores are standardized to mean zero and standard deviation one across schools in each year, then averaged to the district. Housing market variables are averages across all transacted homes, reweighted to match the population distribution across block groups in 2000.
Math, grade 3
Number of observations Reading, grade 3
Log income of homebuyers
Income of homebuyers ($)
Sales volume
All school districts (1)
TABLE II (CONTINUED)
236 QUARTERLY JOURNAL OF ECONOMICS
237
THE VALUE OF SCHOOL FACILITY INVESTMENTS
100 Frequency 50 0
0
50
Frequency
100
150
66.7% threshold (N = 848)
150
55% threshold (N = 382)
40
50
60 70 Vote share in favor
80
90
40
50
60 70 Vote share in favor
80
90
FIGURE I Distribution of Bond Measures by Vote Share Sample includes all school district general obligation bond measures in California from 1987 to 2006. Vote shares are censored at 40 and 90.
Figure I shows histograms of bond measure vote shares, separately for measures that required two-thirds and 55% of the vote for approval. Discontinuous changes in density around the threshold can be an indication of endogenous sorting around this threshold, which would violate the RD assumptions (McCrary 2008). We see no evidence of such changes. Columns (1)–(4) of Table III present regressions of fiscal, housing, and academic variables measured in the year before a bond referendum, on an indicator for whether the bond proposal was approved. The specifications in columns (1) and (2) are estimated from a sample that includes only observations from the year before the election. The first column controls for year effects and the required threshold. Like Table II, it reveals large premeasure differences in several outcomes. The second column adds a cubic polynomial in the measure vote share. Comparing districts that barely passed a bond with districts that barely failed eliminates the significant estimates, shrinking two of the point estimates substantially.
238
QUARTERLY JOURNAL OF ECONOMICS TABLE III PRE–BOND MEASURE BALANCE OF TREATMENT AND CONTROL GROUPS Year before election (t − 1) (1)
Total expenditures PP Capital outlays PP Current instructional exp. PP Log house prices
Reading, grade 3 Math, grade 3 Year effects and threshold control Cubic in vote share Sample pools relative years [−2, 6] Bond measure fixed effects
6 (123) −179 (86) 91 (44)
(2)
(3)
Change, t − 2 to t − 1
(4)
A. Fiscal outcomes −363 −262 28 (191) (187) (177) −220 −154 −44 (133) (126) (145) −24 −12 35 (63) (62) (36)
B. Housing market outcomes 0.184 0.043 0.040 0.013 (0.029) (0.044) (0.043) (0.011) −0.040 (0.088) 0.042 (0.089) Y
C. Achievement outcomes 0.147 0.185 −0.010 (0.120) (0.117) (0.054) 0.180 0.214 0.054 (0.112) (0.109) (0.062) Y Y Y
(5)
(6)
(7)
−10 (102) 0 (87) −7 (19)
50 (149) 54 (121) −6 (31)
98 (151) 95 (123) −2 (31)
0.015 (0.007)
0.020 (0.010)
0.017 (0.010)
−0.022 (0.034) −0.054 (0.039) Y
−0.032 (0.058) −0.002 (0.059) Y
−0.022 (0.057) 0.004 (0.056) Y
N N
Y N
Y Y
Y Y
N N
Y N
Y Y
N
N
N
Y
N
N
N
Notes . Each entry comes from a separate regression. Dollar values are measured in constant year-2000 dollars. Columns (1)–(4) report estimated bond effects on outcome levels the year before the election; columns (5)–(7) report estimated effects on the annual growth rate that year. Samples in columns (1)–(2) and (5)– (6) include observations from the year before each bond measure election. Samples in columns (3), (4), and (7) consist of observations from two years before to six years after each bond election. The specification in these columns is equation (7), with indicators for each calendar year and each relative year (−2 through +6), plus interactions of the −1 through +6 relative year indicators with a cubic in the vote share, an indicator for measure passage, and an indicator for an election with a 55% threshold. The interaction between the relative year −1 indicator and the measure passage indicator is reported. Column (4) also includes measure fixed effects. Models for house prices include controls for square footage, lot size, and sales volume in all columns. Sample sizes vary with availability of dependent variable; for fiscal outcomes, N = 845 in columns (1)–(2), 6,970 in (3)–(4), 780 in (5)–(6), and 5,815 in (7). Standard errors (in parentheses) are robust to heteroscedasticity and, in columns (3), (4), and (7), clustered at the school district level. Bold coefficients are significant at the 5% level.
Columns (3) and (4) turn to panels pooling observations from two years before through six years after the election, as discussed in Section IV. We generalize equation (7) by freeing the coefficients corresponding to outcomes in the year of and the year before the election, and report in the table the “effect” of bond passage in the year before the election, θ −1 . Column (3) reports estimates from a specification without measure fixed effects (λ jt in (7)), whereas column (4) includes them. Pooling the data does not substantially change the estimates. The specification in column (4), however, has much smaller (in absolute value) point estimates and
THE VALUE OF SCHOOL FACILITY INVESTMENTS
239
standard errors, particularly for housing prices and test scores. The fixed effects evidently absorb a great deal of variation in these outcomes that is unrelated to election results. Columns (5)–(7) in Table III repeat our three first specifications, taking as the dependent variable the change in each outcome between years t − 2 and t − 1. Although the model without controls shows some differences in trends between districts that pass and fail measures, these are eliminated when we include controls for the vote share. Overall, there seems to be little cause for concern about the approximate randomness of the measure passage indicator in our RD framework. Once we control for a cubic in the measure vote share, measure passage is not significantly correlated with pretreatment trends of any of the outcomes we examine.28 Further, in similar specifications (not reported in Table III), we find no evidence of “effects” on sales volume, housing characteristics, the income of homebuyers, or other covariates.29 VI.B. Intent-to-Treat Effects on School Spending Figure II presents graphical analyses of mean district spending per pupil by the margin of victory or defeat, in the year before the election and three years after it. We show average outcomes (controlling for calendar year effects) in two-percentage-point bins defined by the vote share relative to the threshold.30 Thus, the leftmost point represents measures that failed by between eight and ten percentage points, the next measures that failed by six to eight points, and so on. The left panel shows total district spending, whereas the right panel shows capital outlays. As expected, there is no sign of a discontinuity in either total or capital spending in the year before the election. By contrast, in the third year 28. The estimated effect of bond authorization on house price changes in columns (5)–(7) is reasonably large, though not significant when the vote share is controlled. If bond passage were indeed correlated with preexisting trends in district house prices, even after controlling flexibly for the vote share, this could confound our estimates of the effect of passage on postelection prices. To investigate this issue further, we have estimated a variety of additional specifications, reported in the Online Appendix. The point estimates here seem to reflect a transitory blip in housing prices in year t−1 rather than any long-run trend. 29. We have also examined other election outcomes for evidence that bond authorization is nonrandomly assigned in close elections. Bond authorizations are not associated with the number of county and municipal measures that pass in the same year or in previous years nor with the probability that an incumbent mayor is reelected. 30. The bin corresponding to measures that failed by less than two percentage points is the category excluded from the regression used to control for year effects, so estimates may be interpreted as differences relative to that bin. Results are robust to exclusion of the year controls.
QUARTERLY JOURNAL OF ECONOMICS Total expenditures
1,500
1,500
240
Capital outlays Year before election
Mean capital outlays per pupil 500 1,000 0
Mean total expenditures per pupil 0 500 1,000
Three years after election
0 5 Vote share relative to threshold (2 pp bins)
10
0 5 Vote share relative to threshold (2 pp bins)
10
FIGURE II Total Spending and Capital Outlays per Pupil, by Vote Share, One Year before and Three Years after Election Graph shows average total expenditures (left panel) and capital outlays (right panel) per pupil, by the vote share in the focal bond election. Focal elections are grouped into bins two percentage points wide: measures that passed by between 0.001% and 2% are assigned to the 1 bin; those that failed by similar margins are assigned to the −1 bin. Averages are conditional on year fixed effects, and the −1 bin is normalized to zero.
after the election, districts where the measure just passed spend about $1,000 more per pupil, essentially all of it in the capital account.31 Panel A of Table IV presents estimates of the intent-to-treat effect of bond passage on district spending and on state and federal transfers (all in per-pupil terms) over the six years following the election, using equation (7).32 Bond passage has no significant effect on any of the fiscal variables in the first year. We see large increases in capital expenditures in years 2, 3, and 4. These increases fade by the fifth year following the election. There is no indication of any effect on current spending in any year, and confidence intervals rule out effects amounting to more than about 31. It is possible that districts use bond revenues for operating expenses but report these expenditures in their capital accounts. The CCD data are not used for financial oversight, so districts have no obvious incentive to misreport. 32. We make one modification to equation (7): We constrain the τ = 0 coefficients to zero. It is not plausible that bond passage can have effects on that year’s district budget, which will typically have been set well before the election. In any case, results are insensitive to removing this constraint.
THE VALUE OF SCHOOL FACILITY INVESTMENTS
241
TABLE IV THE IMPACT OF BOND PASSAGE ON FISCAL OUTCOMES: ITT AND TOT EFFECT ESTIMATES 1 yr later (1) Total expenditures PP Capital outlays PP Current instructional expenditures PP State and federal transfers PP
2 yrs later (2)
3 yrs later (3)
4 yrs later (4)
5 yrs later (5)
6 yrs later (6)
A. ITT 335 936 (177) (216) 255 802 (151) (191) 35 8 (39) (43) 100 41 (129) (149)
1,271 (273) 1,121 (244) 3 (45) −98 (177)
961 (305) 841 (277) −26 (56) 79 (175)
200 (316) 219 (276) −20 (71) 157 (175)
−333 (335) −360 (279) −19 (74) −13 (193)
B. TOT Recursive estimator Total expenditures PP Capital outlays PP Current instructional expenditures PP State and federal transfers PP One-step estimator Total expenditures PP Capital outlays PP Current instructional expenditures PP State and federal transfers PP
306 (166) 250 (143) 44 (41) 67 (120)
920 (225) 822 (193) 13 (54) −22 (148)
1,424 (297) 1,303 (257) 1 (59) −142 (190)
1,405 (358) 1,281 (308) −20 (77) −19 (207)
940 (404) 924 (341) −20 (100) 11 (227)
452 (455) 381 (372) −27 (115) −148 (261)
198 (188) 220 (157) 22 (46) 41 (133)
853 (235) 792 (228) −28 (52) −50 (185)
1,688 (337) 1,549 (299) −33 (49) 184 (311)
1,841 (417) 1,660 (308) −64 (64) 104 (218)
1,169 (374) 1,091 (268) −80 (77) 91 (203)
701 (389) 554 (267) −82 (80) −6 (227)
Notes. Each row represents a separate specification, and reports effects of measure passage on outcomes 1 year later (column (1)), 2 years later (column (2)), and so on. Dependent variables are measured in constant year-2000 dollars per pupil. Panel A presents estimates of the ITT effects of bond passage. The sample consists of all bond elections and all outcome measures from years relative to the election −2 through +6. Some fiscal measures appear in the sample several times for different relative years. N = 6,970. The specification corresponds to equation (7), and includes bond measure fixed effects; indicators for calendar years and years relative to the bond measure; and interactions of the relative year fixed effects (for relative years 1 through 6) with a cubic in the vote share, an indicator for passage, and an indicator for a 55% threshold. The table reports the relative year-passage interaction coefficients. Panel B presents estimates of TOT effects, first using the recursive estimator and second using the one-step estimator. The recursive estimator uses equation (11), applied to ITT estimates as in Panel A but with all available relative years included in the sample. N = 13,405. The one-step estimator uses a conventional panel of districts-by-calendar years. The specification is equation (12). It includes calendar year effects; indicators for the presence of an election t years ago for t = 1, . . . ,18; indicators for measure approval t years ago; cubics in the vote shares of the election t years ago (if any); and indicators for a 55% threshold in the election t years ago. N = 7,038. Standard errors (in parentheses) are clustered on the school district, and bold coefficients are significant at the 5% level.
242
QUARTERLY JOURNAL OF ECONOMICS
$100 per pupil in every year. Essentially all of the funds made available by the bond authorization are kept in the capital account. One might be concerned that bond issues will crowd out other types of educational revenues. Table IV indicates that there is no meaningful crowding out of state or federal transfers (indeed, most of the point estimates are positive). We have also examined whether bond authorization crowds out donations to local education foundations, which often provide cash or in-kind transfers to California schools (Brunner and Sonstelie 1997; Brunner and Imazeki 2005). We find no evidence of such an effect.33 VI.C. School Bond Dynamics and TOT Effects on Spending School districts where an initial measure fails are more likely to pass a subsequent measure than districts where the initial measure passes. Figure III plots estimates of the π τ coefficients, the ITT effect of measure passage in year t on the probability of passing a measure in year t + τ . These are estimated via equation (7), using b jtτ as the dependent variable. There are negative effects in each of the first four years, but there is no sign of any effect thereafter. The cumulative effect after bond passage is around −0.6, indicating that a close loss in an initial election reduces the expected total number of bonds ever passed by about 0.4. As discussed in Section IV, the dynamics in treatment assignment imply that the ITT effects of bond authorization on spending understate the true TOT effects. Panel B of Table IV presents estimates of the TOT effects from our recursive and one-step estimators. The spending effects are larger and more persistent than in Panel A, but there is still no indication that current spending or intergovernmental transfers respond to bond passage. In particular, the effects on current spending are tightly estimated zeros in every year. Figure IV presents estimates of the recursive and one-step dynamic treatment effects of bond passage on district spending over the longer term. Both indicate that effects on spending are exhausted by year 6. The one-step estimator indicates somewhat larger effects than the recursive estimator, but the differences are small. As expected, it also yields substantially smaller 33. A regression of total foundation revenue per pupil in the district in 2001 on an indicator for having approved a bond proposal before 2001 (controlling for a cubic in the vote share) yields a coefficient of 15 (s.e. 42).
1
THE VALUE OF SCHOOL FACILITY INVESTMENTS
243
0.5 0.25 0 –0.5
–0.25
Effect on pr(pass bond)
0.75
Reduced-form RD estimate 95% CI
0
3
6 9 Year (relative to election)
12
15
FIGURE III Estimates of the Effect of Bond Passage on the Probability of Passing a Later Bond, by Years since the Focal Election Graph shows coefficients and 95% confidence intervals for the effect of measure passage in year t on the probability of passing a measure in year t + τ . The specification is the ITT regression described in equation (7). Sample includes relative years −19 through +19, excluding relative year 0 (when the effect is mechanically one).
confidence intervals, particularly at long lags. When we discount all of the estimated effects from the one-step estimator back to the date of the election, using a discount rate of 7.33% as in Barrow and Rouse (2004), the effect of authorizing a bond is to increase the present value of future spending by $5,671. This is quite similar to the size of the average bond proposal in close elections, $6,309. VII. RESULTS VII.A. Housing Prices Figure V provides a graphical analysis of the impact of bond passage on log housing prices corresponding to the analyses of fiscal outcomes in Figure II. Two important patterns emerge. First, housing prices in the year before the election are positively correlated with vote shares, indicating that higher priced districts are more likely to pass bond measures with larger margins of
244
1,000 0 –2,000 –1,000
Mean expenditures PP
2,000
QUARTERLY JOURNAL OF ECONOMICS
0
3
6 9 Year (relative to election) Recursive estimate Recursive 95% CI
12 -
15 e
-
FIGURE IV Recursive and One-Step Estimates of Dynamic TOT Effects of Bond Passage on Total Expenditures per Pupil, by Years since Election Graph shows coefficients and 95% confidence intervals for the “recursive” and “one-step” estimates of the treatment-on-the-treated (TOT) effects of measure passage at each lag on expenditures per pupil. The specifications are as in equations (11) and (12), respectively. CIs are based on standard errors clustered at the district level.
victory. Second, in districts where bond measures were approved, housing prices appear to shift upward by six or seven percentage points by the third year after the election relative to the preelection prices. There is no such shift in districts where bonds failed. Panel A of Table V presents estimates of the effects of bond passage on log housing prices.34 The first row presents the ITT analysis, using equation (7). House prices increase by 2.1% in the year of bond passage, though this is not significantly different from zero. The estimated effects rise slightly thereafter, reaching 5.8% and becoming significant three years after the election. Point estimates fade somewhat thereafter and cease to be significant. The next rows show estimates of the TOT effects from our two estimators. As expected, these are somewhat larger and are 34. We augment each of our house price specifications with controls for the average characteristics of transacted homes. In contrast to the analysis in Table IV, we allow for bond effects in the year of the election, as housing markets may respond immediately to the election outcome.
245
0.2 0.1 0 –0.1
Mean log housing prices
0.3
THE VALUE OF SCHOOL FACILITY INVESTMENTS
–10
–5 0 5 Vote share relative to threshold (2 pp bins) Year before election
10
Three years after election
FIGURE V Log Housing Prices by Vote Share, One Year before and Three Years after Election Graph shows average log housing prices by the vote share in the focal bond election. Focal elections are grouped into bins two percentage points wide: measures that passed by between 0.001% and 2% are assigned to the 1 bin; those that failed by similar margins are assigned to the −1 bin. Averages are conditional on year fixed effects, and the −1 bin is normalized to zero.
uniformly significant after year 0. The estimates indicate that the TOT effect of bond approval in year t is to increase average prices by 2.8%–3.0% that year, 3.6%–4.1% in year t + 1, 4.2%–8.6% in years t + 2 through t + 5, and 6.7%–10.1% in t + 6. Figure VI plots the coefficients and confidence intervals from the two dynamic specifications, showing estimates out to year 15. The recursive estimator shows growing effects through almost the entire period, whereas the one-step estimator yields a flatter profile. Confidence intervals are wide, particularly for the recursive estimator in later periods, and a zero effect is typically at or near the lower bound of these intervals.35 As discussed in Section IV, the TOT estimators assume that house prices are unaffected by the likelihood of a future bond 35. We have also estimated models that constrain the TOT to be constant over time. With our one-step estimator, we obtain a point estimate of 4.9% and a standard error of 1.7%.
One-step estimator
TOT Recursive estimator
ITT
0.028 (0.017) 0.030 (0.017)
0.021 (0.015)
Yr of elec. (1)
0.041 (0.021) 0.036 (0.018)
0.027 (0.017)
2 yrs later (3)
3 yrs later (4)
0.050 (0.025) 0.042 (0.020)
0.077 (0.030) 0.062 (0.021)
A. Effect of authorizing a bond 0.036 0.058 (0.020) (0.022)
1 yr later (2)
0.075 (0.035) 0.052 (0.022)
0.038 (0.024)
4 yrs later (5)
0.086 (0.041) 0.054 (0.026)
0.038 (0.027)
5 yrs later (6)
TABLE V THE IMPACT OF BOND PASSAGE AND BOND AMOUNTS ON LOG HOUSING PRICES: ITT AND TOT EFFECT ESTIMATES
0.101 (0.050) 0.067 (0.034)
0.047 (0.035)
6 yrs later (7)
246 QUARTERLY JOURNAL OF ECONOMICS
0.0076 (0.0036) [$1,788] 0.0059 (0.0033) [$1,387]
0.0047 (0.0032) [$1,105]
2 yrs later (3)
3 yrs later (4)
4 yrs later (5)
0.0090 (0.0044) [$2,137] 0.0076 (0.0036) [$1,796]
0.0104 (0.0045) [$2,449] 0.0081 (0.0036) [$1,908]
0.0188 (0.0058) [$4,441] 0.0120 (0.0042) [$2,841]
0.0146 (0.0059) [$3,441] 0.0091 (0.0041) [$2,147]
B. Effect of authorizing $1,000 per pupil in bonds 0.0059 0.0078 0.0136 0.0083 (0.0037) (0.0041) (0.0050) (0.0050) [$1,403] [$1,853] [$3,213] [$1,959]
1 yr later (2)
0.0153 (0.0070) [$3,611] 0.0095 (0.0048) [$2,246]
0.0081 (0.0056) [$1,914]
5 yrs later (6)
0.0174 (0.0102) [$4,108] 0.0130 (0.0068) [$3,073]
0.0104 (0.0080) [$2,460]
6 yrs later (7)
Notes. Each row represents a separate specification and reports effects of measure passage on log house prices 1 year later (column (2)), 2 years later (column (3)), and so on. The dependent variable is the log of the block-group-population weighted average sale price of all transacted homes in the school district, measured in constant year-2000 dollars per pupil. Panel A presents estimates of the effect of authorizing a bond. The ITT estimates use a sample of all bond elections and all log house prices from years relative to the election −2 through +6. N = 7,968. The specification is equation (7), and includes bond measure fixed effects; indicators for calendar years and years relative to the bond measure; average square footage and lot size of transacted homes; the number of transactions; and interactions of the relative year fixed effects (for relative years 0 through 6) with a cubic in the vote share, an indicator for passage, and an indicator for a 55% threshold. The table reports the relative year–passage interaction coefficients. The recursive TOT estimates are based on equation (11) applied to ITT estimates, estimated as in row (1) but with all available relative years included in the sample. N = 20,070. The one-step TOT estimates are based on a conventional panel of districts-by-calendar years. The specification is equation (12). It includes calendar year effects; indicators for the presence of an election t years ago for t = 1, . . . ,18; indicators for measure approval t years ago; cubics in the vote shares of the election t years ago (if any); indicators for a 55% threshold in the election t years ago; average square footage and lot size of transacted homes; and the number of transactions. N = 10,227. Panel B reports the effect of authorizing $1,000 per pupil in bonds. All specifications are similar to Panel A, except that bond passage indicators are replaced by continuous measures of the size of the bond authorized (set to zero if no bond is authorized), with the passage indicators used as instrumental variables for the bond amounts. Numbers in square brackets represent the effect on home price levels, evaluated at the mean price in districts with close elections, $236,433. Standard errors are clustered on the school district, and bold coefficients are significant at the 5% level.
One-step estimator
TOT Recursive estimator
ITT
Yr of elec. (1)
TABLE V (CONTINUED) THE VALUE OF SCHOOL FACILITY INVESTMENTS
247
QUARTERLY JOURNAL OF ECONOMICS
0.05 0.1 0.15 0.2 0.25 –0.1 –0.05 0
Mean log house prices
248
–1
0
1
2
3
4
5 6 7 8 9 10 11 Year (relative to election)
Recursive estimate Recursive 95% CI Forward–looking estimate
12 13 14 15
One-step estimate One-step 95% CI
FIGURE VI Recursive, One-Step, and Forward-Looking Estimates of Dynamic TOT Effects of Bond Passage on Log House Prices, by Years since Election Graph shows coefficients and 95% confidence intervals for the “recursive” and “one-step” estimates of the TOT effects of measure passage at each lag on log house prices. Specifications are described in equations (11) and (12), respectively. The graph also shows recursive estimates of the forward-looking TOT effect of measure passage on log house prices, using the alternative recursion formula (14).
authorization. To relax this assumption, we estimate a modified version of the recursive estimator that allows for perfectly forward-looking prices, as described in Section IV.E. In this specification, the immediate effect of bond passage is larger and the profile in the first few years is flatter than in our myopic specification. Point estimates in years 0 through 6 are 6.1%, 6.8%, 7.4%, 9.5%, 8.8%, 9.5%, and 10.4%, respectively. These are shown as hollow circles in Figure VI. Because our expectation is that housing markets are neither fully myopic nor subject to perfect no-arbitrage conditions, we think that the true effect is likely to lie between the two sets of estimates. VII.B. Willingness to Pay for School Facility Investments As discussed in Section III, a substantial effect of bond passage on prices indicates that the marginal resident’s WTP for school services exceeds the cost of providing those services and
THE VALUE OF SCHOOL FACILITY INVESTMENTS
249
therefore that school capital spending is inefficiently low. It is thus instructive to compute the WTP implied by our estimated price effects. This calculation requires assumptions about interest and discount rates, the speed with which new facilities are brought into service, property tax shares, and the income tax deductibility of property taxes and mortgage interest payments. We outline our baseline calculations here. We describe the details and present alternative calculations in the Online Appendix. The average house in districts with close elections (margins of victory or defeat less than 2%) is worth $236,433, so a 3.0% effect on house prices raises the value of the average house by approximately $7,100. The average bond proposal in close elections is about $6,300 per pupil, and there are 2.4 owner-equivalent housing units per pupil. With a typical municipal bond interest rate of 4.6%, this implies a property tax increment of $163 per house per year, for a present discounted value of about $1,950. Thus, the effect of passing a bond on the total cost of owning a home in the district, combining the house price effect with the PDV of future taxes, is approximately $9,050. That homebuyers are willing to pay this implies that their WTP for $1 in per-pupil spending is about $1.44 (= 9,050/6,300).36 When we account (in the Online Appendix) for the deductibility of mortgage interest and property taxes and for the higher tax share borne by new homebuyers, we can drive the WTP estimate as low as $1.13, but never down to $1. As Figure VI suggests, the WTP is generally higher when we measure the price effects several years after the election. WTPs based on the price effect in year 4, for example, range from $1.31 (one-step estimator, fully accounting for taxes) to $1.89 (recursive estimator, without taxes, using a discount rate of 5.24%). The sensitivity of WTP calculations to the year in which price effects are measured may indicate that capitalization is not immediate.37 However, the forward-looking price estimates indicate a WTP that 36. Our comparison of the cost per home to the bond amount per pupil is appropriate if the marginal homebuyer has one school-aged child. This almost exactly matches the average number of children in owner-occupied California households in the 2000 census who moved in 1999. 37. If capitalization is immediate, a simpler WTP calculation could be based on the ITT effects of bond passage on year-0 housing prices and on the PDV of future spending. Applying this, we estimate a WTP around $1.77. But there are several drawbacks to this method, most notably that we observe a long panel of postelection spending for only the earliest referenda in our sample and that our “immediate” house price measure—average sales prices in the year of the election—may be contaminated by sales occurring before the election.
250
QUARTERLY JOURNAL OF ECONOMICS
is largely invariant to the year in which the price effect is measured and is around $2. Additional specifications reported in Panel B of Table V use an alternative strategy to identify the WTP. We reestimate the ITT and TOT effects, this time using the dollar value of bonds authorized as the “treatment” variable and the indicator for bond authorization as an instrument for it.38 These estimates indicate a $1.39–$1.79 increase in house prices in the year of the election per dollar of bonds issued, with even larger estimates in later years. The implied WTP depends on assumptions about interest rates, tax shares, and income tax deductibility, but under reasonable assumptions exceeds these coefficients by around $0.31.39 VII.C. Robustness Table VI presents a variety of alternative specifications meant to probe the robustness of the housing price results. To conserve space we report only the estimates from our one-step specification of the TOT effect of bond approval on prices four years later. Row (1) reports the baseline estimates. Rows (2)–(4) vary the vote share controls: row (2) includes only a linear control; row (3) allows for three linear segments, with kinks at 55% and 67% vote shares; and row (4) allows separate cubic vote share–outcome relationships in the [0, 55%], [55%, 67%], and [67%, 100%] ranges. None of these yields evidence contrary to our main results. Rows (5)–(7) report estimated discontinuities at locations other than the threshold required for passage. In each of these specifications, we also allow a discontinuity at the actual threshold. In row (5), we estimate the discontinuity in our outcomes at the counterfactual threshold, 55% when ν ∗ = 2/3 and 2/3 when ν ∗ = 55%, whereas rows (6) and (7) show estimates for placebo thresholds ten percentage points above or below the true 38. To implement this, we replace b jt in equations (7) and (12) with the dollar values of the authorized bonds (set to zero if the proposal is rejected) and instrument these with b jt . The π coefficients in the recursion formula (11) are similarly redefined as the effect of authorizing $1 in bonds in year t on the expected value of the bond authorization in t + τ . Note that this incorporates any differences in the size of initial and subsequent proposals. See the Online Appendix for further details. 39. The $0.31 figure reflects a ratio of 2.4 houses per pupil and a wedge between the district’s borrowing rate and residents’ discount rates. See the Online Appendix for more detail. Overall, our WTP estimates are somewhat larger than, but not out of line with, the WTPs implied by estimates of the effect of unrestricted spending on house prices from Bradbury, Mayer, and Case (2001); Barrow and Rouse (2004); and Hilber and Mayer (2004).
THE VALUE OF SCHOOL FACILITY INVESTMENTS
251
TABLE VI ALTERNATIVE SPECIFICATIONS FOR LOG HOUSING PRICES: ONE-STEP ESTIMATES OF TOT EFFECTS Log housing price effects 4 yrs after election Baseline (cubic in vote share)
0.052 (0.022) A. Vote share controls
Linear 3-part linear 3-part cubic B. Placebo thresholds Switch 55% and 67% thresholds Actual threshold minus 10 Actual threshold plus 10 C. Additional specifications Including parcel tax referenda No weights and no housing controls
0.061 (0.021) 0.048 (0.024) 0.109 (0.042) −0.078 (0.033) 0.031 (0.033) −0.017 (0.034) 0.058 (0.020) 0.051 (0.022)
Notes. Each cell represents a separate regression. Only the coefficients for housing prices four years after the election are shown. The baseline specification presents the estimated effect of measure passage in the fourth year after the election from the one-step specification in Panel A of Table V. Remaining cells derive from slight modifications to this sample or specification. The “linear” specification replaces the cubics in the vote share of each past election with linear controls; “3-part linear” uses linear segments in the [0, 55], [55, 66.7], and [66.7, 100] ranges; and “3-part cubic” uses separate cubic segments in each range. The “placebo thresholds” specification in Panel B include both the actual measure passage indicators and counterfactual indicators that reflect vote shares in excess of alternative thresholds; the coefficients shown are those on the counterfactual indicators in the fourth year after the election. In Panel C, the estimate labeled “including parcel tax measures” adds controls for the presence of a parcel tax measure on the ballot in each past year, cubics in the parcel tax vote shares, and indicators for parcel tax passage. In row (9), the dependent variable is the raw average price of houses transacted during the calendar year, without reweighting, and housing characteristic controls are excluded. All standard errors are clustered at the district level and bold coefficients are significant at the 5% level.
threshold. Only one of the coefficients measuring discontinuities at counterfactual thresholds is statistically significant, and it has the opposite sign from the estimated effect at the actual threshold. Our TOT effects hold constant school bond authorizations that are subsequent to an initial authorization, but do not hold constant other forms of district responses, such as parcel taxes. If bond authorization raises the probability that other revenue increases will be approved, our calculations will overstate the WTP
252
QUARTERLY JOURNAL OF ECONOMICS
for $1 in additional spending. To examine this, we add indicators for the presence of a parcel tax measure in each relative year τ and for its passage. Row (8) reports the bond passage coefficient when parcel taxes are controlled. The estimated bond effects are unchanged. The parcel tax coefficients (not reported) are statistically indistinguishable both from zero and from the bond coefficients.40 Finally, row (9) reports estimates from a specification for the raw mean of the log prices of homes that transacted, without adjusting for changes in the distribution of transactions across block groups or controlling for home characteristics. The bond effect is again similar to that obtained with our preferred price measure. VII.D. Academic Achievement The first two rows of Table VII report estimates of the effect of bond passage on third grade reading and mathematics scores from our one-step estimator.41 The effects are small and insignificant for the first several years. This result is expected given the time it takes to execute capital projects; the flow of academic benefits (if any) should not begin for several years. However, the point estimates are generally positive and seem to gradually trend upward, at least for the first few years. This pattern is easier to see in Figure VII, which plots the point estimates and confidence intervals from the math specification. By year six, we see large, marginally significant effects, corresponding to about one-sixth of a school-level standard deviation. Point estimates fall back to zero thereafter, and are quite imprecise. Confidence intervals include large positive effects, but we cannot reject zero effects in every year. The year-six point estimates correspond to effects of roughly 0.067 student-level standard deviations for reading and 0.077 for mathematics. If taken literally, these imply that bond-financed improvements to existing facilities raise achievement by about one-third as much as a reduction in class sizes from 22 to 15 students (Krueger 1999).42 But even this maximal interpretation 40. We have also estimated the effect of bond authorization on fiscal outcomes in the district’s municipality. Effects on municipal revenues and on a variety of categories of spending are precisely estimated zeros. 41. Estimates from our other estimators are similar. See Cellini, Ferreira, and Rothstein (2008). 42. We find no evidence that bond passage affects teacher–pupil ratios, or that the results could be attributable to the construction of new, smaller schools. Results are available upon request.
Average lot size
Average square footage
Log sales volume
241 (116) 0.032 (0.070) 14 (16) 3,455 (5,026)
5 yrs later (5)
B. Housing market transactions 207 282 213 (89) (93) (98) 0.041 0.031 −0.005 (0.062) (0.066) (0.068) 2 6 15 (16) (17) (17) 3,008 3,980 10,874 (3,041) (3,903) (5,198)
93 (78) −0.001 (0.056) −6 (13) −4,159 (3,962)
Sales volume
Math, grade 3
4 yrs later (4) 0.039 (0.061) 0.058 (0.069)
−0.023 (0.051) −0.034 (0.054)
−0.010 (0.054) −0.012 (0.057)
Reading, grade 3
3 yrs later (3) A. Academic achievement 0.058 −0.026 (0.053) (0.058) 0.030 0.026 (0.062) (0.062)
2 yrs later (2)
1 yr later (1)
325 (115) 0.039 (0.073) 18 (19) 6,586 (3,777)
0.103 (0.064) 0.160 (0.075)
6 yrs later (6)
10,579
10,273
10,857
10,857
6,660
6,660
N (7)
TABLE VII THE EFFECT OF BOND PASSAGE ON ACADEMIC ACHIEVEMENT, HOUSING MARKET TRANSACTIONS, AND HOMEBUYER AND SCHOOL DISTRICT CHARACTERISTICS: ONE-STEP ESTIMATES OF TOT EFFECTS
THE VALUE OF SCHOOL FACILITY INVESTMENTS
253
0.001 (0.035) 0.008 (0.009) −0.163 (0.161)
D. School district characteristics −0.011 0.010 0.001 (0.020) (0.025) (0.039) 0.004 0.002 0.005 (0.006) (0.008) (0.008) −0.218 −0.018 −0.169 (0.170) (0.156) (0.218)
−0.012 (0.017) −0.002 (0.005) −0.060 (0.096)
5 yrs later (5) 1,486 (3,218) −0.008 (0.025) 0.001 (0.011)
4 yrs later (4)
C. Homebuyer characteristics 1,042 −2,384 4,212 (2,645) (3,113) (3,435) 0.017 −0.005 0.035 (0.021) (0.023) (0.025) 0.000 0.001 0.004 (0.009) (0.010) (0.010)
3 yrs later (3)
3,245 (2,624) 0.027 (0.019) 0.017 (0.009)
2 yrs later (2)
−0.007 (0.042) 0.004 (0.010) 0.005 (0.149)
450 (3,391) 0.004 (0.024) −0.008 (0.011)
6 yrs later (6)
6,978
7,035
7,038
9,921
9,921
9,921
N (7)
Notes. Each row represents a separate specification and reports effects of measure passage on outcomes one year later (column (1)), two years later (column (2)), and so on. All regressions use the one-step TOT specification to estimate the effect of bond passage on outcomes. The specification is equation (12) in the text. It includes calendar year effects; indicators for the presence of an election t years ago for t = 1, . . . ,18; indicators for measure approval t years ago; cubics in the vote shares of the election t years ago (if any) and indicators for a 55% threshold in the election t years ago. Variation in sample sizes reflects differences in availability of dependent variables. Kindergarten and first grade enrollments and racial shares exclude districts with grade-level enrollment below 20 in 1987. Bold coefficients are significant at the 5% level.
Avg. parental education
Fraction white and Asian
Log enrollment
Fraction white and Asian
Log income
Income
1 yr later (1)
TABLE VII (CONTINUED)
254 QUARTERLY JOURNAL OF ECONOMICS
Mean math score (in school-level std. deviations) –0.4 –0.2 0 0.2 0.4
THE VALUE OF SCHOOL FACILITY INVESTMENTS
0
3
6 9 Year (relative to election) Recursive estimate Recursive 95% CI
12
255
15
One-step estimate One-step 95% CI
FIGURE VII Recursive and One-Step Estimates of Dynamic TOT Effects of Bond Passage on Average Mathematics Test Scores, by Years since Election Graph shows coefficients and 95% confidence intervals for the “recursive” and “one-step” estimates of the TOT effects of measure passage at each lag on mathematics test scores. Specifications are as in equations (11) and (12), respectively. CIs are based on standard errors clustered at the district level.
of the test score results can explain only a small share of the full house price effects seen earlier. Previous research on school quality capitalization (see, e.g., Black [1999]; Kane, Riegg, and Staiger [2006]; and Bayer, Ferreira, and McMillan [2007]) has found that a one–school level standard deviation increase in test scores raises housing prices between 4% and 6%. This implies that our estimated year-six effect on test scores would explain only about one-sixth of the effect of bond passage on house prices.43 Third grade test scores are a limited measure of academic outcomes. School facilities improvements may have larger effects on achievement in later grades or in other subjects (e.g., science, where lab facilities may be important inputs). Nevertheless, it seems likely that a sizable portion of the hedonic value of school 43. An increase of 0.185 school-level standard deviation in test scores multiplied by an effect of 6 percentage points would yield a price increase of just 1.1 percentage point.
256
QUARTERLY JOURNAL OF ECONOMICS
facilities reflects nonacademic outputs. Parents may value new playgrounds or athletic facilities for the recreational opportunities they provide, enhanced safety from a remodeled entrance or drop-off area, and improved child health from asbestos abatement and the replacement of drafty temporary classrooms, even if these do not contribute to academic achievement. New facilities may also be aesthetically appealing. Any improvements in these dimensions of school output will lead to housing price effects that exceed those reflected in test scores. The potential relevance of these channels underscores the importance of using housing markets to value school investments. VII.E. Household Sorting Recent empirical studies of the capitalization of school quality emphasize the importance of social multiplier effects deriving from preferences for wealthy neighbors (see, e.g., Bayer, Ferreira, and McMillan [2004]). If wealthy families have higher WTP for school output, passage of a bond may lead to increases in the income of in-migrants to the district, generating follow-on increases in the desirability of the district, in house prices, and in test scores. In Panel B of Table VII we report dynamic RD estimates for the impact of bond approval on sales volumes. Volumes would be expected to rise if passage leads to changes in the sort of families that prefer the school district. The estimates show that sales volumes increase by 200–300 units per year. An analysis of log volumes indicates about a 3% increase in sales, though this is not statistically significant.44 The next two rows show estimated effects on the average size of transacted homes and lots. The estimated effects on home size are precisely estimated zeros. Those for lot size—which is far more heterogeneous—are less precise but offer no indication of systematic effects. The remainder of Table VII examines effects on population composition directly. In Panel C, we report effects on the characteristics of new homebuyers. We find no distinguishable effect on average income or on racial composition. Panel D reports effects on the student population. We find no impact on enrollment, racial composition, or average parental education.45 44. Sales volume effects could represent either an increase in the local supply of homes or an increased turnover rate of existing homes. Yearly data on housing construction are unavailable, so these cannot be disentangled. 45. We have also looked at effects on enrollment in early grades, where composition effects may appear first. We find no change in kindergarten or 1st grade
THE VALUE OF SCHOOL FACILITY INVESTMENTS
257
Because we have only limited data on population changes, there may be sorting on characteristics that we do not measure (e.g., tastes for education or the presence of children). Even so, sorting is not likely to account for our full price effect. The literature indicates that social multiplier effects on house prices could be as large as 75% of the direct effect of school quality (Bayer, Ferreira, and McMillan 2004). This would imply that at most 2.5 percentage points of the estimated 6% price effect in year 3 could be due to sorting, still leaving a large portion that must be attributed to increased school output. VIII. CONCLUSIONS Infrastructure investments have been and will remain important components of government budgets, yet we have few tools to assess their effectiveness. In this paper we use a “dynamic” regression discontinuity design to estimate the value of school facility investments to parents and homeowners. We identify the effects of capital investments on housing prices by comparing districts in which school bond referenda passed or failed by narrow margins. Unlike districts where bond referenda garnered overwhelming voter support or opposition, the set of districts with close votes are likely to be similar to each other in both observable and unobservable characteristics. Our analysis is complicated by the tendency for districts where proposed bonds are rejected to propose and pass additional measures in future years and by the likely importance of dynamics in the treatment effects. We propose two new “dynamic RD” estimators that accommodate these complexities, bringing the identification power of a traditional RD design into a dynamic panel data context. These estimators are likely to prove useful in other experimental and quasi-experimental settings where there are multiple opportunities for treatment and where the treatment effect dynamics are of interest. In RD settings, the methods require that each treatment opportunity be characterized by a discontinuity in treatment probability as a running variable exceeds a threshold. Repeated referenda are an ideal example: our methods enrollment. We do find a small, permanent increase in the fraction of white and Asian students in kindergarten, though not in first grade even several years later. One potential explanation is that some families switched from private to public kindergartens after bond passage; some bond proposals specify building additional classrooms to permit conversion from half-day to full-day kindergarten.
258
QUARTERLY JOURNAL OF ECONOMICS
can easily be used to assess the causal effects of other policies decided by elections. Turning back to our substantive application, our primary analyses are of the impact of passing a bond on house prices. We find treatment effects of 6% or more, and implied valuations of $1.50 or more for $1 in school capital spending. As theory predicts, most of the price effect appears well in advance of the completion of the funded projects. We find some evidence of effects on student achievement several years after bond passage, but no sign of effects on the racial composition or average incomes of district residents. The home price effects presumably reflect the anticipation of increased school output, though it appears that much of the effect derives from dimensions of output (such as safety or aesthetics) that are not captured by test scores. Our results provide clear evidence that California districts at the margin of passing a bond are spending well below the economically efficient level, with returns to additional spending far in excess of the cost. Evidently, the referendum process erects too large a barrier to the issuance of bonds and prevents many worthwhile projects. As Hoxby (2001) argues, a loosening of California’s constraints on local spending would yield substantial economic benefits. More generally, our results suggest that well-targeted funds for school construction may raise social welfare, particularly in states and localities with low levels of capital investment and highly centralized systems of school finance. THE TRACHTENBERG SCHOOL OF PUBLIC POLICY AND PUBLIC ADMINISTRATION, GEORGE WASHINGTON UNIVERSITY THE WHARTON SCHOOL, UNIVERSITY OF PENNSYLVANIA, AND NATIONAL BUREAU OF ECONOMIC RESEARCH GOLDMAN SCHOOL OF PUBLIC POLICY, UNIVERSITY OF CALIFORNIA, BERKELEY, AND NATIONAL BUREAU OF ECONOMIC RESEARCH
REFERENCES Angrist, Joshua, and Victor Lavy, “New Evidence on Classroom Computers and Pupil Learning,” The Economic Journal, 122 (2002), 735–765. Aschauer, David Alan, “Is Public Expenditure Productive?” Journal of Monetary Economics, 23 (1989), 177–200. Balsdon, Ed, Eric J. Brunner, and Kim Rueben, “Private Demands for Public Capital: Evidence from School Bond Referenda,” Journal of Urban Economics, 54 (2003), 610–638. Barrow, Lisa, and Cecilia Rouse, “Using Market Valuation to Assess Public School Spending,” Journal of Public Economics, 88 (2004), 1747–1769. Bayer, Patrick, Fernando Ferreira, and Robert McMillan, “Tiebout Sorting, Social Multipliers, and the Demand for School Quality,” NBER Working Paper No. w10871, 2004.
THE VALUE OF SCHOOL FACILITY INVESTMENTS
259
——, “A Unified Framework for Measuring Preferences for Schools and Neighborhoods,” Journal of Political Economy, 115 (2007), 588–638. “Births and Immigration Squeeze California Classroom Space,” New York Times, October, 8, 1989, Section 1, 24. Black, Sandra E., “Do Better Schools Matter? Parental Valuation of Elementary Education,” Quarterly Journal of Economics, 114 (1999), 577–599. Bradbury, Katharine, Christopher Mayer, and Karl Case, “Property Tax Limits, Local Fiscal Behavior, and Property Values: Evidence from Massachusetts under Proposition 2 1/2,” Journal of Public Economics, 80 (2001), 287– 311. Bradford, David, and Wallace Oates, “The Analysis of Revenue Sharing in a New Approach to Collective Fiscal Decisions,” Quarterly Journal of Economics, 85 (1971), 416–439. Brueckner, Jan K., “Property Values, Local Public Expenditure and Economic Efficiency,” Journal of Public Economics, 11 (1979), 223–245. Brunner, Eric J., and Jennifer Imazeki, “Fiscal Stress and Voluntary Contributions to Public Schools,” in Developments in School Finance, 2004: Fiscal Proceedings from the Annual State Data Conference of July 2004, W.J. Fowler Jr., ed. (Washington, DC: U.S. Department of Education, National Center for Education Statistics/Government Printing Office, 2005). Brunner, Eric J., and Kim Rueben, “Financing New School Construction and Modernization: Evidence from California,” National Tax Journal, 54 (2001), 527– 539. Brunner, Eric J., and Jon Sonstelie, “Coping with Serrano: Voluntary Contributions to California’s Local Public Schools,” in Proceedings of the Eighty-Ninth Annual Conference on Taxation, Boston, Massachusetts, November 10–12, 1996 (Columbus, OH: National Tax Association, 1997). Buckley, Jack, Mark Schneider, and Yi Shang, “Fix It and They Might Stay: School Facility Quality and Teacher Retention in Washington, DC,” Teachers College Record, 107 (2005), 1107–1123. Card, David, and Dean R. Hyslop, “Estimating the Effects of a Time-Limited Earnings Subsidy for Welfare-Leavers,” Econometrica, 73 (2005), 1723–1770. Card, David, and Alan B. Krueger, “School Resources and Student Outcomes: An Overview of the Literature and New Evidence from North and South Carolina,” Journal of Economic Perspectives, 10 (1996), 31–50. Cellini, Stephanie Riegg, “Crowded Colleges and College Crowd-Out: The Impact of Public Subsidies on the Two-Year College Market,” American Economic Journal: Economic Policy, 1 (2009), 1–30. Cellini, Stephanie Riegg, Fernando Ferreira, and Jesse Rothstein, “The Value of School Facilities: Evidence from a Dynamic Regression Discontinuity Design,” NBER Working Paper No. w14516, 2008. Council of Economic Advisers, Economic Report of the President (Washington, DC: Government Printing Office, 2009). DiNardo, John, and David S. Lee, “Economic Impacts of New Unionization on Private Sector Employers: 1984–2001,” Quarterly Journal of Economics, 119 (2004), 1383–1441. Earthman, Glen I., “School Facility Conditions and Student Academic Achievement,” UCLA’s Institute for Democracy, Education, and Access (IDEA) Paper No. wws-rr008-1002, 2002. Ferreira, Fernando, “You Can Take It with You: Proposition 13 Tax Benefits, Residential Mobility, and Willingness to Pay for Housing Amenities,” U.S. Census Bureau Center for Economic Studies Working Paper No. 08-15, 2008. Ferreira, Fernando, and Joseph Gyourko, “Do Political Parties Matter? Evidence from U.S. Cities,” Quarterly Journal of Economics, 124 (2009), 399–402. Goolsbee, Austan, and Jonathan Guryan, “The Impact of Internet Subsidies in Public Schools,” Review of Economics and Statistics, 88 (2006), 336–347. Gramlich, Edward M., “Infrastructure Investment: A Review Essay,” Journal of Economic Literature, 32 (1994), 1176–1196. Hahn, Jinyong, Petra Todd, and Wilbert Van der Klaauw, “Identification and Estimation of Treatment Effects with a Regression-Discontinuity Design,” Econometrica, 69 (2001), 201–209.
260
QUARTERLY JOURNAL OF ECONOMICS
Ham, John C., and Robert J. LaLonde, “The Effect of Sample Selection and Initial Conditions in Duration Models: Evidence from Experimental Data on Training,” Econometrica, 64 (1996), 175–205. Hanushek, Eric A., “School Resources and Student Performance,” in Does Money Matter? The Effect of School Resources on Student Achievement and Adult Success, Gary Burtless, ed. (Washington, DC: Brookings Institution, 1996). Hilber, Christian A. L., and Christopher J. Mayer, “School Funding Equalization and Residential Location for the Young and the Elderly,” in Brookings– Wharton Papers on Urban Affairs 2004, William G. Gale and Janet R. Pack, eds. (Washington, DC: Brookings Institution, 2004). Hoxby, Caroline M., “All School Finance Equalizations Are Not Created Equal,” Quarterly Journal of Economics, 116 (2001), 1189–1231. Imbens, Guido W., and Joshua D. Angrist, “Identification and Estimation of Local Average Treatment Effects,” Econometrica, 62 (1994), 467–475. Imbens, Guido W., and Thomas Lemieux, “Regression Discontinuity Designs: A Guide to Practice,” Journal of Econometrics, 142 (2008), 615–635. Institute for Social Research, California Elections Data Archive (http://www.csus .edu/calst/cal studies/CEDA.html, 2006). Jones, John T., and Ron W. Zimmer, “Examining the Impact of Capital on Academic Achievement,” Economics of Education Review, 20 (2001), 577–588. Kane, Thomas J., Stephanie K. Riegg, and Douglas O. Staiger, “School Quality, Neighborhoods, and Housing Prices,” American Law and Economics Review, 8 (2006), 183–212. Krueger, Alan B., “Experimental Estimates of Education Production Functions,” Quarterly Journal of Economics, 114 (1999), 497–532. Lee, David S., “Randomized Experiments from Non-random Selection in U.S. House Elections,” Journal of Econometrics, 142 (2008), 675–697. Lee, David S., and Thomas Lemieux, “Regression Discontinuity Designs in Economics,” NBER Working Paper No. w14723, 2009. Lee, David S., and Alexandre Mas, “Long-Run Impacts of Unions on Firms: New Evidence from Financial Markets, 1961–1999,” NBER Working Paper No. w14709, 2009. Lee, David S., Enrico Moretti, and Matthew J. Butler, “Do Voters Affect or Elect Policies? Evidence from the U.S. House,” Quarterly Journal of Economics, 119 (2004), 807–859. Martorell, Paco, “Does Failing a High School Graduation Exam Matter?” RAND, Mimeo, 2005. Matsusaka, John G., “Fiscal Effects of the Voter Initiative: Evidence from the Last 30 Years,” Journal of Political Economy, 103 (1995), 587–623. McCrary, Justin, “Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test,” Journal of Econometrics, 142 (2008), 698–714. Mendell, Mark J., and Garvin A. Heath, “Do Indoor Environments in Schools Influence Student Performance? A Critical Review of the Literature,” Indoor Air, 15 (2004), 27–52. Munnell, Alicia H., “Infrastructure Investment and Economic Growth,” Journal of Economic Perspectives, 6 (1992), 189–198. Oates, Wallace E., “The Effects of Property Taxes and Local Public Spending on Property Values: An Empirical Study of Tax Capitalization and the Tiebout Hypothesis,” Journal of Political Economy, 77 (1969), 957–971. Orrick, Herrington & Sutcliffe, LLP, School Finance Bulletin (San Francisco, CA: Public Finance Department of Orrick, Herrington, & Sutcliffe, http://www.orrick.com/fileupload/259.pdf, 2004). Pereira, Alfredo M., and Rafael Flores de Frutos, “Public Capital Accumulation and Private Sector Performance,” Journal of Urban Economics, 46 (1999), 300–322. Pettersson-Lidbom, Per, “Do Parties Matter for Economic Outcomes? A RegressionDiscontinuity Approach,” Journal of European Economic Association, 6 (2008), 1037–1056. Romer, Thomas, and Howard Rosenthal, “Bureaucrats versus Voters: On the Political Economy of Resource Allocation by Direct Democracy,” Quarterly Journal of Economics, 93 (1979), 563–587.
THE VALUE OF SCHOOL FACILITY INVESTMENTS
261
Samuelson, Paul A., “The Pure Theory of Public Expenditure,” Review of Economics and Statistics, 36 (1954), 387–389. Schneider, Mark, “Do School Facilities Affect Academic Outcomes?” National Clearinghouse for Educational Facilities, 2002. Sebastian, Simone, “Schools Measure Proposed,” San Francisco Chronicle, March 8, 2006, p. B-1. Skiba, Paige M., and Jeremy Tobacman, “Do Payday Loans Cause Bankruptcy?” University of Pennsylvania, Wharton School of Business, Mimeo, 2008. Sonstelie, Jon, Eric Brunner, and Kenneth Ardon, For Better or for Worse? School Finance Reform in California (San Francisco: Public Policy Institute of California, 2000). Tiebout, Charles, “A Pure Theory of Local Public Expenditures,” Journal of Political Economy, 64 (1956), 416–424. U.S. Department of Education, Digest of Education Statistics 1998 (Washington, DC: National Center for Education Statistics, 1998). ——, Digest of Education Statistics 2007 (Washington, DC: National Center for Education Statistics, 2007).
WHAT’S ADVERTISING CONTENT WORTH? EVIDENCE FROM A CONSUMER CREDIT MARKETING FIELD EXPERIMENT∗ MARIANNE BERTRAND DEAN KARLAN SENDHIL MULLAINATHAN ELDAR SHAFIR JONATHAN ZINMAN Firms spend billions of dollars developing advertising content, yet there is little field evidence on how much or how it affects demand. We analyze a direct mail field experiment in South Africa implemented by a consumer lender that randomized advertising content, loan price, and loan offer deadlines simultaneously. We find that advertising content significantly affects demand. Although it was difficult to predict ex ante which specific advertising features would matter most in this context, the features that do matter have large effects. Showing fewer example loans, not suggesting a particular use for the loan, or including a photo of an attractive woman increases loan demand by about as much as a 25% reduction in the interest rate. The evidence also suggests that advertising content persuades by appealing “peripherally” to intuition rather than reason. Although the advertising content effects point to an important role for persuasion and related psychology, our deadline results do not support the psychological prediction that shorter deadlines may help overcome time-management problems; instead, demand strongly increases with longer deadlines.
I. INTRODUCTION Firms spend billions of dollars each year on advertising consumer products to influence demand. Economic theories emphasize the informational content of advertising: Stigler (1987, p. 243), for example, writes that “advertising may be defined as the provision of information about the availability and quality of a commodity.” But advertisers also spend resources trying to ∗ Previous title: “What’s Psychology Worth? A Field Experiment in the Consumer Credit Market.” Thanks to Rebecca Lowry, Karen Lyons, and Thomas Wang for providing superb research assistance. Also, thanks to many seminar participants and referees for comments. We are especially grateful to David Card, Stefano DellaVigna, Larry Katz, and Richard Thaler for their advice and comments. Thanks to the National Science Foundation, the Bill and Melinda Gates Foundation, and USAID/BASIS for funding. Much of this paper was completed while Zinman was at the Federal Reserve Bank of New York (FRBNY); he thanks the FRBNY for research support. Views expressed are those of the authors and do not necessarily represent those of the funders, the Federal Reserve System, or the Federal Reserve Bank of New York. Special thanks to the Lender for generously providing us with the data from its experiment.
C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
263
264
QUARTERLY JOURNAL OF ECONOMICS
persuade consumers with “creative” content that does not appear to be informative in the Stiglerian sense.1 Although laboratory studies in marketing have shown that noninformative content may affect demand, and sophisticated firms use randomized experiments to optimize their advertising content strategy (Stone and Jacobs 2001; Day 2003; Agarwal and Ambrose 2007), academic researchers have rarely used field experiments to study advertising content effects.2 Chandy et al. (2001) review evidence of advertising effects on consumer behavior and find that “research to date can be broadly classified into two streams: laboratory studies of the effects of ad cues on cognition, affect, or intentions and econometric observational field studies of the effects of advertising intensity on purchase behavior . . . each has focused on different variables and operated largely in isolation of the other” (p. 399).3 Thus, although we know that attempts to persuade consumers with noninformative advertising are common, we know little about how, and how much, such advertising influences consumer choice in natural settings. In this paper, we use a large-scale direct-mail field experiment to study the effects of advertising content on real decisions, involving nonnegligible sums, among experienced decision makers. A consumer lender in South Africa randomized advertising content and the interest rate in actual offers to 53,000 former clients (Figures I and II show example mailers).4 The variation in advertising content comes from eight “features” that varied the presentation of the loan offer. We worked together with the lender to create six features relevant to the extensive literature (primarily from laboratory experiments in psychology and decision sciences) on how “frames” and “cues” may affect choices. Specifically, 1. For example, see Mullainathan, Schwartzstein, and Shleifer (2008) for evidence on the prevalence of persuasive content in mutual fund advertisements. 2. Levitt and List (2007) discuss the importance of validating laboratory findings in the field. 3. Bagwell’s (2007) extensive review of the economics of advertising covers both laboratory and field studies and cites only one randomized field experiment (Krishnamurthi and Raj 1985); only 5 of the 232 empirical papers cited in Bagwell’s review address advertising content effects. DellaVigna (2009) reviews field studies in psychology and economics and does not cite any studies on advertising other than an earlier version of this paper. Simester (2004) laments the “striking absence” of randomized field experimentation in the marketing literature. For some exceptions see, for example, Ganzach and Karsahi (1995) and Anderson and Simester (2008), and the literature on direct mail charitable fundraising (e.g., List and Lucking-Reiley [2002]). Several other articles in the marketing literature call for greater reliance on field studies more generally: Stewart (1992), Wells (1993), Cook and Kover (1997), and Winer (1999). 4. The Online Appendix contains additional example mailers.
WHAT’S ADVERTISING CONTENT WORTH?
265
FIGURE I Example Letter 1
mailers varied in whether they included a person’s photograph on the letter, suggestions for how to use the loan proceeds, a large or small table of example loans, information about the interest rate as well as the monthly payments, a comparison to competitors’ interest rates, and mention of a promotional raffle for a cell
266
QUARTERLY JOURNAL OF ECONOMICS
FIGURE II Example Letter 2
phone. Mailers also included two features that were the lender’s choice, rather than motivated by a body of psychological evidence: reference to the interest rate as “special” or “low,” and mention of speaking the local language. Our research design enables us to
WHAT’S ADVERTISING CONTENT WORTH?
267
estimate demand sensitivity to advertising content and to compare it directly to price sensitivity.5 An additional randomization of the offer expiration date also allows us to study demand sensitivity to deadlines. Our interest in deadline effects is motivated by the fact that firms often promote time-limited offers and by the theoretically ambiguous effect of such time limits on demand. Under neoclassical models, shorter deadlines should reduce demand, because longer deadlines provide more option value; in contrast, some behavioral models and findings suggest that shorter deadlines will increase demand by overcoming limited attention or procrastination. Our analysis uncovers four main findings. First, we ask whether advertising content affects demand. We use joint F-tests across all eight content randomizations and find significant effects on loan takeup (the extensive margin) but not on loan amount (the intensive margin). We do not find any evidence that the extensive margin demand increase is driven by reductions in the likelihood of borrowing from other lenders, nor do we find evidence of adverse selection on the demand response to advertising content: repayment default is not significantly correlated with advertising content. This first finding suggests that traditional demand estimation, which focuses solely on price and ignores advertising content, may produce unstable estimates of demand. Second, we ask how much advertising content affects demand, relative to price. As one would expect, demand is significantly decreasing in price; for example, each 100–basis point (13%) reduction in the interest rate increases loan takeup by 0.3 percentage points (4%). The statistically significant advertising content effects are large relative to this price effect. Showing one example of a possible loan (instead of four example loans) has the same 5. The existing field evidence on the effects of framing and cues does not simultaneously vary price. A large marketing literature using conjoint analysis does this comparison, but is essentially focused on hypothetical choices with no consumption consequences for the respondents; see Krieger, Green, and Wind (2004) for an overview of this literature. In a typical conjoint analysis, respondents are shown or described a set of alternative products and asked to rate, rank or select products from that set. Conjoint analysis is widely applied in marketing to develop and position new products and help with the pricing of products. As discussed in Rao (2008, p. 34), “an issue in the data collection in conjoint studies is whether respondents experience strong incentives to expend their cognitive resources (or devote adequate time and effort) in providing responses (ratings or choices) to hypothetical stimuli presented as profiles or in choice sets.” Some recent conjoint analyses have tried to develop more incentive-aligned elicitation methods that provide better estimates of true consumer preferences; see, for example, Ding, Grewal, and Liechty (2005).
268
QUARTERLY JOURNAL OF ECONOMICS
estimated effect as a 200–basis point reduction in the interest rate. This finding of a strong positive effect on demand of displaying fewer example loans provides novel evidence consistent with the hypothesis that presenting consumers with larger menus can trigger choice avoidance and/or deliberation that makes the advertised product less appealing. We also find that showing a female photo, or not suggesting a particular use for the loan, increases demand by about as much as a 200–basis point reduction in the interest rate. Third, we provide suggestive evidence on the channels through which persuasive advertising content operates. We classify our content treatments into those that aim to trigger “peripheral” or “intuitive” responses (effortless, quick, and associative) along the lines of Kahneman’s (2003) System I, and those that aim to trigger more “deliberative” responses (effortful, conscious, and reasoned) along the lines of Kahneman’s (2003) System II. The System II content does not have jointly significant effects on takeup. The System I content does have jointly significant effects on loan takeup. Hence, in our context at least, advertising content appears to be more effective when it aims to trigger an intuitive rather than a deliberative response. However, because the classification of some of our treatments into System I or System II is open to debate, we view this evidence as more suggestive than definitive. Finally, we report the effects of deadlines on demand. In contrast with the view that shorter deadlines help overcome limited attention or procrastination, we do not find any evidence that shorter deadlines increase demand; rather, we find that demand increases dramatically as deadlines randomly increase from two to six weeks. Nor do we find that shorter deadlines increase the probability of applying early, or that they increase the probability of applying after the deadline. So although our advertising content results point to an important role for persuasion and related psychology, our deadline results tell another story. The option value of longer deadlines seems to dominate in our setting: there is no evidence that shorter deadlines spur action by providing salience or commitment to overcome procrastination. Overall, our results suggest that seemingly noninformative advertising may play a large role in real consumer decisions. Moreover, insights from controlled laboratory experiments in psychology and decision sciences on how frames and cues affect choice can
WHAT’S ADVERTISING CONTENT WORTH?
269
be leveraged to guide the design of effective advertising content. It is sobering, though, that we only had modest success predicting (based on the prior evidence) which specific content features would significantly impact demand. One interpretation of this failure is that we lacked the statistical power to identify anything other than large effects of any single content treatment, but it is also likely that some the findings generated in other contexts did not carry over to ours. This fits with a central premise of psychology—that context matters—and suggests that pinning down which effects matter most in particular market settings will require systematic field experimentation. The paper proceeds as follows. Section II describes the market and our cooperating lender. Section III details the experimental and empirical strategies. Section IV provides a conceptual framework for interpreting the results. Section V presents the empirical results. Section VI concludes. II. THE MARKET SETTING Our cooperating consumer lender (the “Lender”) had operated for over twenty years as one of the largest, most profitable lenders in South Africa.6 The Lender competed in a “cash loan” market segment that offers small, high-interest, short-term, uncollateralized credit with fixed monthly repayment schedules to the working poor population. Aggregate outstanding loans in the cash loan market segment equal about 38% of nonmortgage consumer debt.7 Estimates of the proportion of the South African working-age population currently borrowing in the cash loan market range from below 5% to around 10%.8 Cash loan borrowers generally lack the credit history and/or collateralizable wealth needed to borrow from traditional institutional sources such as commercial banks. Data on how borrowers use the loans are scarce, because lenders usually follow the “no questions asked” policy common to consumption loan markets. The available data suggest a range of consumption smoothing 6. The Lender was merged into a bank holding company in 2005 and no longer exists as a distinct entity. 7. Cash loan disbursements totaled approximately 2.6% of all household consumption and 4% of all household debt outstanding in 2005. (Sources: reports by the Department of Trade and Industry, Micro Finance Regulatory Council, and South African Reserve Bank.) 8. Sources: reports by Finscope South Africa and the Micro Finance Regulatory Council.
270
QUARTERLY JOURNAL OF ECONOMICS
and investment uses, including food, clothing, transportation, education, housing, and paying off other debt.9 Cash loan sizes tend to be small relative to the fixed costs of underwriting and monitoring them, but substantial relative to a typical borrower’s income. For example, the Lender’s median loan size of 1,000 rand (about $150) was 32% of its median borrower’s gross monthly income. Cash lenders focusing on the highest-risk market segment typically make one–month maturity loans at 30% interest per month. Informal sector moneylenders charge 30%– 100% per month. Lenders targeting lower-risk segments charge as little as 3% per month, and offer longer maturities (twelve months or more).10 Our cooperating Lender’s product offerings were somewhat differentiated from those of competitors. It had a “mediummaturity” product niche, with a 90% concentration of four- month loans, and longer loan terms of six, twelve, and eighteen months available to long-term clients with good repayment records. Most other cash lenders focus on one-month or twelve plus–month loans. The Lender’s standard four-month rates, absent this experiment, ranged from 7.75% to 11.75% per month depending on assessed credit risk, with 75% of clients in the high-risk (11.75%) category. These are “add-on” rates, where interest is charged up front over the original principal balance, rather than over the declining balance. The implied annual percentage rate (APR) of the modal loan is about 200%. The Lender did not pursue collection or collateralization strategies such as direct debit from paychecks, or physically keeping bank books and ATM cards of clients, as is the policy of some other lenders in this market. The Lender’s pricing was transparent, with no surcharges, application fees, or insurance premiums. Per standard practice in the cash loan market, the Lender’s underwriting and transactions were almost always conducted in person, in one of over 100 branches. Its risk assessment technology combined centralized credit scoring with decentralized loan officer discretion. Rejection was common for new applicants (50%) but less so for clients who had repaid successfully in the past (14%). 9. Sources: data from this experiment (survey administered to a sample of borrowers following finalization of the loan contract); household survey data from other studies on different samples of cash loan market borrowers (FinScope 2004; Karlan and Zinman forthcoming a). 10. There is essentially no difference between these nominal rates and corresponding real rates. For instance, South African inflation was 10.2% per year from March 2002 to March 2003 and 0.4% per year from March 2003 to March 2004.
WHAT’S ADVERTISING CONTENT WORTH?
271
Reasons for rejection include inability to document steady wage employment, suspicion of fraud, credit rating, and excessive debt burden. Borrowers had several incentives to repay despite facing high interest rates. Carrots included decreasing prices and increasing future loan sizes following good repayment behavior. Sticks included reporting to credit bureaus, frequent phone calls from collection agents, court summons, and wage garnishments. Repeat borrowers had default rates of about 15%, and first-time borrowers defaulted twice as often. Policymakers and regulators in South Africa encouraged the development of the cash loan market as a less expensive substitute for traditional “informal sector” moneylenders. Since deregulation of the usury ceiling in 1992, cash lenders have been regulated by the Micro Finance Regulatory Council. The regulation requires that monthly repayment not exceed a certain proportion of monthly income, but no interest rate ceilings existed at the time of this experiment. III. EXPERIMENTAL DESIGN, IMPLEMENTATION, AND EMPIRICAL STRATEGY III.A. Overview We identify and price the effects of advertising content and deadlines, using randomly and independently assigned variation in the description and price of loan offers presented in direct mailers. The Lender sent direct mail solicitations to 53,194 former clients offering each a new loan, at a randomly assigned interest rate, with a randomly assigned deadline for taking up the offer. The offers were presented with randomly assigned variations on eight advertising content “features” detailed below and summarized in Table I. III.B. Sample Frame Characteristics The sample frame consisted entirely of experienced clients. Each of the 53,194 solicited clients had borrowed from the Lender within 24 months of the mailing date, but not within the previous 6 months.11 The mean (median) number of prior loans from the 11. This sample is slightly smaller than the samples analyzed in two companion papers because a subset of mailers did not include the advertising content treatments. See Appendix 1 of Karlan and Zinman (2008) for details.
Frequency
0.40 0.40 0.43 0.15 0.52 0.57 0.48
Four loan amounts shown in example table Four loan amounts in table, one maturity (high risk clients)
0.40 0.40
Match increases due Photo with gender matched to client gender to affinity/simliarity Photo with mismatched gender Feature 2: Number of example loans One loan amount shown in example table One loan increases: simplified choice Of low and medium risk clients avoids “choice overload” problem Of high risk clients
Female photo Male photo
Female increases due to affective response
0.53 0.27
0.13 0.12 0.07
Indian White Colored
Photo with race matched to client race Photo with mismatched race
0.48
Black photo Non-black photo:
Features 1–3: System I (intuitive processing) treatments No photo 0.20
Treatment value
Match increases due to affinity/simliarity
Feature 1: Photo
Creative content and its hypothesized effects on demand
TABLE I EXPERIMENTAL SUMMARY
All
Assigned conditional on client’s race to produce the targeted ratio of client-photo matches
All
Sample frame/conditions
272 QUARTERLY JOURNAL OF ECONOMICS
Interest rate shown (and monthly payments) Interest rate not shown (just monthly payments)
Four loan amounts in table, one maturity (low/med risk clients) Four loan amounts in table, three maturities (low/med risk clients)
Treatment value
0.20
0.80
0.10
0.75
Frequency
Feature 4: Suggested loan uses
Features 4–6: System II (deliberative processing) treatments “You can use this loan for anything you 0.20 want” No suggested use maximizes demand, “You can use this loan to X, or for because suggesting particular uses anything else you want”, where X is: triggers deliberation and reinforces the status quo (not borrowing) Pay off a more expensive debt 0.20 Buy an appliance 0.20 Pay for school 0.20 Repair your home 0.20
Feature 3: Interest rate shown in example(s)? Indeterminate: several potentially counteracting channels (see Section III.F of text for details)
Creative content and its hypothesized effects on demand
TABLE I (CONTINUED)
All
Only low and medium risk eligible for 4 amount, 3 maturity treatment All
Sample frame/conditions
WHAT’S ADVERTISING CONTENT WORTH?
273
Feature 8: “A ‘special’ or ‘low’ rate for you”
Feature 7: Client’s language
Indeterminate: mentioning increases if overestimate small probabilities, but decreases if reason-based choice and can’t justify irrelevant good
Comparison increases by inducing choice of dominating (Lender’s) option Loss frame increases by triggering loss aversion
Creative content and its hypothesized effects on demand
0.40 0.40
0.25
0.75
0.37 0.63
Gain frame Loss frame Feature 6: Cell phone raffle Mentioned cell phone raffle
Not mentioned cell phone raffle Features 7 and 8: Lender-imposed treatments No mention of language “We speak [client’s language]”
0.75
0.20
Feature 5: Comparison to outside rate No comparison to competitor rates
Interest rate is labeled as “special” or “low”
Frequency
Treatment value
TABLE I (CONTINUED)
All
Eligible if non-English primary language (0.44 of full sample)
All
All
Sample frame/conditions
274 QUARTERLY JOURNAL OF ECONOMICS
Deadline
Interest rate
Creative content and its hypothesized effects on demand
0.78 0.14
0.03
0.04
Short deadline (approx. 2 weeks)
Short deadline with option to extend 2 weeks by calling in
0.25
Frequency
Medium deadline (approx. 4 weeks) Long deadline (approx. 6 weeks)
Other treatments High risk: [3.25, 11.75] Medium risk: [3.25, 9.75] Low risk: [3.25, 7.75]
No mention of “special” or “low”
Treatment value
TABLE I (CONTINUED)
1.0 of sample eligible for medium 0.79 of sample eligible for long (certain branches excluded by Lender) 0.14 of sample eligible for short (certain branches excluded by Lender, and all P.O. boxes excluded)
Monthly rates randomly assigned from a smooth distribution, conditional on risk
Sample frame/conditions
WHAT’S ADVERTISING CONTENT WORTH?
275
276
QUARTERLY JOURNAL OF ECONOMICS
TABLE II SUMMARY STATISTICS (MEANS OR PROPORTIONS, WITH STANDARD DEVIATIONS IN PARENTHESES)
Applied before deadline Obtained a loan before deadline Loan amount in rand Loan in default Got outside loan and did not apply with Lender Maturity = 4 months Offer rate Last loan amount in rand Last maturity = 4 months Low risk Medium risk High risk Female Predicted education (years) Number previous loans with Lender Months since most recent loan with Lender Race = African Race = Indian Race = White Race = Mixed (“Colored”) Gross monthly income in rand Number of observations
Full sample
Obtained a loan
Did not obtain a loan
0.085 0.074 110 (536)
0.01 0 0 (0)
0.22
1 1 1,489 (1,351) 0.12 0.00
7.93 1,118 (829) 0.93 0.14 0.10 0.76 0.48 6.85 (3.25) 4.14 (3.77) 10.4 (6.80) 0.85 0.03 0.08 0.03 3,416 (19,657) 53,194
0.81 7.23 1,158 (835) 0.91 0.30 0.21 0.50 0.49 7.08 (3.30) 4.71 (4.09) 6.19 (5.81) 0.85 0.03 0.08 0.04 3,424 (2,134) 3,944
0.24
7.98 1,115 (828) 0.93 0.12 0.10 0.78 0.48 6.83 (3.25) 4.10 (3.74) 10.8 (6.76) 0.85 0.03 0.08 0.03 3,416 (20,420) 49,250
Lender was four (three). The mean and median time elapsed since the most recent loan from the Lender was 10 months. Table II presents additional descriptive statistics on the sample frame. These clients had received mail and advertising solicitations from the Lender in the past.12 The Lender sent monthly statements to clients and periodic reminder letters to former clients who had not borrowed recently. But prior to our experiment none 12. Mail delivery is generally reliable and quick in South Africa. Two percent of the mailers in our sample frame were returned as undeliverable.
WHAT’S ADVERTISING CONTENT WORTH?
277
of the solicitations had varied interest rates, systematically varied advertising content, or included any of the content or deadline features we tested other than the cell phone raffle. III.C. Identification and Power We estimate the impact of advertising content on client choice using empirical tests of the form (1) Yi = f ri , ci1 , ci2 , . . . , ci13 , di , Xi , where Y is a measure of client i’s loan demand or repayment behavior, r is the client’s randomly assigned interest rate, and c1 , . . . , c13 are categorical variables in the vector Ci of randomly assigned variations on the eight different content features displayed (or not) on the client’s mailer (we need thirteen categorical variables to capture the eight features because several of the features are categorical, not binary). Most interest rate offers were discounted relative to standard rates, and clients were given a randomly assigned deadline di for taking up the offer. All randomizations were assigned independently, and hence orthogonal to each other by construction, after controlling for the vector of randomization conditions Xi . We ignore interaction terms, given that we did not have any strong priors on the existence of interaction effects across treatments. Below, we motivate and detail our treatment design and priors on the main effects and groups of main effects. Our inference is based on several different statistics obtained from estimating equation (1). Let β r be the probit marginal effect or OLS coefficient for r, and β 1 , . . . , β 13 be the marginal effects or OLS coefficients on the advertising content variables from the same specification. We estimate whether content affects demand by testing whether the β n’s are jointly different from zero. We estimate the magnitude of content effects by scaling each β n by the price effect β r . Our sample of 53,194 offers, which was constrained by the size of the Lender’s pool of former clients, is sufficient to identify only economically large effects of individual pieces of advertising content on demand. To see this, note that each 100–basis point reduction in r (which represents a 13% reduction relative to the sample mean interest rate of 793 basis points) increased the client’s application likelihood by 3/10 of a percentage point. The Lender’s standard takeup rate following mailers to inactive former clients
278
QUARTERLY JOURNAL OF ECONOMICS
was 0.07. Standard power calculations show that identifying a content feature effect that was equivalent to the effect of a 100– basis point price reduction (i.e., that increased takeup from 0.07 to 0.073) would require over 300,000 observations. So in fact we can only distinguish individual content effects from zero if they are equivalent to a price reduction of 200 to 300 basis points (i.e., a price reduction of 25% to 38%). III.D. Measuring Demand and Other Outcomes Clients revealed their demand by their takeup decision, that is, by whether they applied for a loan at their local branch before their deadlines. Loan applications were assessed and processed using the Lender’s normal procedures. Clients were not required to bring the mailer with them when applying, and branch personnel were trained and monitored to ignore the mailers. To facilitate this, each client’s randomly assigned interest rate was hard-coded ex ante into the computer system the Lender used to process applications. Alternative measures of demand include obtaining a loan and the amount borrowed. The solicitations were “pre-approved” based on the client’s prior record with the Lender, and hence 87% of applications resulted in a loan.13 Rejections were due to adverse changes in the client’s work status, ease of contact by phone, or other indebtedness. We consider two other outcomes. We measure outside borrowing using credit bureau data. We also examine loan repayment behavior by setting Y = 1 if the account was in default (i.e., in collection or written off as uncollectable as of the latest date for which we had repayment data), and = 0 otherwise. The motivating question for this outcome variable is whether any demand response to advertising content produces adverse selection by attracting clients who are induced to take loans they cannot afford. Note that we have less power for this outcome variable, because we only observe repayment behavior for the 4,000 or so individuals that obtained a loan. III.E. Interest Rate Variation The interest rate randomization was stratified by the client’s preapproved risk category because risk determined the loan price 13. All approved clients actually took loans. This is not surprising given the short application process (45 minutes or less), the favorable interest rates offered in the experiment, and the clients’ prior experience and hence familiarity with the Lender.
WHAT’S ADVERTISING CONTENT WORTH?
279
under standard operations. The standard schedule for four-month loans was low-risk = 7.75% per month; medium-risk = 9.75%; high-risk = 11.75%. The randomization program established a target distribution of interest rates for four-month loans in each risk category and then randomly assigned each individual to a rate based on the target distribution for his or her category.14,15 Rates varied from 3.25% per month to 11.75% per month, and the target distribution varied slightly across two “waves” (bunched for operational reasons) mailed September 29–30 and October 29–31, 2003. At the Lender’s request, 97% of the offers were at lowerthan-standard rates, with an average discount of 3.1 percentage points on the monthly rate (the average rate on prior loans was 11.0%). The remaining offers in this sample were at the standard rates. III.F. Mailer Design: Content Treatments, Motivation, and Priors Figures I and II show example mailers. The Lender designed the mailers in consultation with its South African–based marketing consulting firm and us. Each mailer contained some boilerplate content; for example, the Lender’s logo, its slogan “the trusted way to borrow cash,” instructions for how to apply, and branch hours. Each mailer also contained mail merge fields that were populated (or could be left blank in some cases) with randomized variations on the eight different advertising content features. Some randomizations were conditional on preapproved characteristics, and each of these conditions is included in the empirical models we estimate. The content and variations for each of the features are summarized in Table I. We detail the features below along with some prior work and hypotheses underlying these treatments. 14. Rates on other maturities in these data were set with a fixed spread from the offer rate conditional on risk, so we focus exclusively on the four-month rate. 15. Actually three rates were assigned to each client: an “offer rate” included in the direct mail solicitation and noted above, a “contract rate” (rc ) that was weakly less than the offer rate and revealed only after the borrower had accepted the solicitation and applied for a loan, and a dynamic repayment incentive (D) that extended preferential contract rates for up to one year, conditional on good repayment performance, and was revealed only after all other loan terms had been finalized. This multitiered interest rate randomization was designed to identify specific information asymmetries (Karlan and Zinman forthcoming b). Because D and rc were surprises to the client, and hence did not affect the decision to borrow, we exclude them from most analysis in this paper. In principle, rc and D might affect the intensive margin of borrowing, but in practice adding these interest rates to our loan size demand specifications does not change the results. Mechanically what happened was that very few clients changed their loan amounts after learning that rc < r.
280
QUARTERLY JOURNAL OF ECONOMICS
We group the content treatments along two thematic lines. The first, and more important, thematic grouping is based on whether the content is more likely to trigger an intuitive or reasoned response. Such a distinction between intuitive and deliberative modes is common in much of the decision research on cognitive functions.16 The deliberative or reasoning mode (Kahneman’s [2003] System II) is what we do when we carry out a mathematical computation, or plan our travel to an upcoming conference. The peripheral or intuitive mode (Kahneman’s [2003] System I) is at work when we smile at a picture of puppies playing, or recoil at the thought of eating a cockroach (Rozin and Nemeroff 2002). Intuition is relatively effortless and automatic, whereas reasoning requires greater processing capacity and attention. Research on persuasion suggests that the effect of content will depend on which System(s) the content triggers, and on the underlying intentions of the consumer (Petty and Cacioppo 1986; Petty and Wegener 1999). Content that triggers “central processing,” or conscious deliberation, may be more effective when the product offer is consistent with the consumer’s intentions; for example, a consumer who is actively shopping for a loan may be persuaded most by quantitative cost or location comparisons. Content that triggers “peripheral processing,” or intuition, may be more effective when the offer is less aligned with intentions; for example, a consumer may be more persuaded to order a beer by a poster showing beautiful people sipping beer at sunset than by careful arguments about beer’s merits. We group the content treatments below by whether they were more likely to trigger System I or System II responses, and highlight where our classification is debatable. The second thematic grouping is based on whether the treatment was motivated more by a body of prior evidence (and hence the researchers’ priors) or by the Lender’s priors. System I Treatments Feature 1: photo. Visual (largely uninformative) images tend to be processed through intuitive cognitive systems. This may explain why visuals play such a large role in advertising. Mandel and Johnson (2002), for example, find that randomly manipulated background images affect hypothetical student choices in a 16. See, for example, Chaiken and Trope (1999), Slovic et al. (2002), and Stanovich and West (2002). Kahneman (2003) refers to the intuitive and deliberative modes as System I and System II in his Nobel lecture.
WHAT’S ADVERTISING CONTENT WORTH?
281
simulated Internet shopping environment. Our mailers test the effectiveness of visual cues by featuring a photo of a smiling person in the bottom right-hand corner in 80% of the mailers. There was one photo subject for each combination of gender and race represented in our sample (for a total of eight different photos).17 All subjects were deemed attractive and professional-looking by the marketing firm. The overall target frequency for each photo was determined by the racial and gender composition of the sample and the objective was to obtain a 2-to-1 ratio of photo race that matched the client’s race and a 1-to-1 ratio of photo gender that matched the client’s gender.18 Several prior studies suggested that matching photos to client race or gender would increase takeup by triggering intuitive affinity between the client and Lender. Evans (1963) finds that demographic similarity between client and salesperson can drive choice, and several studies find that similarity can outweigh even expertise or credibility (see, e.g., Lord [1996]; Cialdini [2001]; Mobius and Rosenblat [2006]). We also predicted that a photo of an attractive woman would (weakly) increase takeup. This prior was based on casual empiricism (e.g., of beer and car ads) and a field experiment on door-todoor charitable fundraising in which attractive female solicitors secured significantly more donations (Landry et al. 2006). Feature 2: number of example loans. The middle of each mailer prominently featured a table that was randomly assigned to display one or four example loans. Each example showed a loan amount and maturity based on the client’s most recent loan and a monthly payment based on the assigned interest rate.19 The rate itself was also displayed in randomly chosen mailers (see 17. For mailers with a photo, the employee named at the bottom of the mailer was that of an actual employee of the same race and gender featured in the photo. In cases where no employee in the client’s branch had the matched race and gender, an employee from the regional office was listed instead. 18. If the client was assigned randomly to “match,” then the race of the client matched that of the model on the photograph. For those assigned to mismatch, we randomly selected one of the other races. To determine a client’s race, we used the race most commonly associated with his/her last name (as determined by employees of the Lender). The gender of the photo was then randomized unconditionally at the individual level. 19. High risk clients were not eligible for six- or twelve-month loans, and hence their four-example tables featured four loan amounts based on small increments above the client’s last loan amount. When the client was eligible for longer maturities we randomly assigned whether the four-example table featured different maturities. See Table II and Karlan and Zinman (2008) for additional details.
282
QUARTERLY JOURNAL OF ECONOMICS
Feature 3). Small tables were nested in the large tables, to ensure that large tables contained more information. Every mailer stated “Loans available in other amounts . . . ” directly below the example(s) table. Our motivation for experimenting with a small vs. large table of loans comes from psychology and marketing papers on “choice overload.” In strict neoclassical models demand is (weakly) increasing in the number of choices. In contrast, the choice overload literature has found that demand can decrease with menu size. Large menus can “demotivate” choice by creating feelings of conflict and indecision that lead to procrastination or total inaction (Shafir, Simonson, and Tversky 1993). Overload effects have been found in field settings including physician prescriptions (Redelmeier and Shafir 1995) and 401k plans (Iyengar, Huberman, and Jiang 2004). An influential field experiment shows that grocery store shoppers who stopped to taste jam were much more likely to purchase if there were 6 choices rather than 24 (Iyengar and Lepper 2000). Prior studies suggest that demotivation happens largely beyond conscious awareness, and hence largely through intuitive processing. (In fact, the same people who are demotivated by choice overload often state an a priori preference for larger choice sets.) We therefore group our number of loans feature with System I. (There may be other contexts where menu size triggers conscious deliberation, e.g., where a single loan may signal a customized offer, or where multiple loans may signal full disclosure. But this was unlikely to be the case here, given the sample’s prior experience with the Lender and common knowledge on the nature and availability of different loan amounts.) Feature 3: interest rate(s) shown in example(s)? Example loan tables also randomly varied whether the interest rate was shown.20 In cases where the interest rate was suppressed, the information presented in the table (loan amount, maturity, and monthly payment) was sufficient for the client to impute the rate. This point was emphasized with the statement below the table that “There are no hidden costs. What you see is what you pay.” Displaying the interest rate has ambiguous effects on demand in rich models of consumer choice. Displaying the rate may depress demand by overloading boundedly rational consumers (see 20. South African law did not require interest rate disclosure, in contrast to the U.S. Truth-in-Lending Act.
WHAT’S ADVERTISING CONTENT WORTH?
283
Feature 2), or by debiasing consumers who tend to underestimate rates when inferring them from other loan terms (Stango and Zinman 2009). Displaying the rate may have no effect if consumers do not understand interest rates and use decision rules based on other loan terms (this was the Lender’s prior). Finally, displaying the rate may induce demand by signaling that the Lender indeed has “no hidden costs,” reducing computational burden, and/or clarifying that the rate is, indeed, low. Despite the potential for offsetting effects (and hence our lack of strong priors), we thought that testing this feature would be thought-provoking nonetheless, given policy focus on interest rate disclosure (Kroszner 2007). Given the Lender’s prior that interest rate disclosure would not affect demand, and its branding strategy as a “trusted” source for cash, it decided to err on the side of full disclosure and display the interest rate on the mailers with 80% probability. The interest rate feature is perhaps the most difficult one to categorize. Although it could trigger a “System II” type computation, the Lender’s prior suggests that any effect would operate mostly as an associative or emotional signal of openness and trust. So we group rate disclosure with System I and also show below that the results are robust to dropping it from the System I grouping. System II Treatments Feature 4: suggested uses. After the salutation and deadline, the mailer said something about how the client could use the loan. This “suggested use” appeared in boldface type and was one of five variations on “You can use this loan to X, or for anything else you want.” X was one of four common uses for cash loans indicated by market research and detailed in Table I. The most general phrase simply stated, “You can use this cash for anything you want.” Each of the five variations was randomly assigned with equal probabilities. We group this treatment with System II on the presumption that highlighting intended use would trigger client deliberation about potential uses and whether to take a loan. Because clients had revealed a preference for not taking up a loan in recent months, we presume that conscious deliberation would not likely change this preference. Hence we predicted that takeup would be maximized by not suggesting a particular use.21 21. We cannot rule out other cognitive mechanisms that could affect the response to suggested loan uses or the interpretation of an effect here. Suggesting a
284
QUARTERLY JOURNAL OF ECONOMICS
Feature 5: comparison to outside rate. Randomly chosen mailers included a comparison of the offered interest rate to a higher outside market rate. When included, the comparison appeared in boldface in the field below “Loans available in other amounts. . . . ” Half of the comparisons used a “gain frame”; for example, “If you borrow from us, you will pay R100 rand less each month on a four month loan.” Half of the comparisons used a “loss frame”; for example, “If you borrow elsewhere, you will pay R100 rand more each month on a four month loan.”22 Several papers have found that such frames can influence choice by manipulating “reference points” that enter decision rules or preferences. There is evidence that the presence of a dominated alternative can induce choice of the dominating option (Huber, Payne, and Puto 1982; Doyle et al. 1999). This suggests that mailers with our dominated comparison rate should produce (weakly) higher takeup rates than mailers without mention of a competitor’s rate. Any dominance effect probably operates by inducing greater deliberation (Priester et al. 2004), and presenting a reason for choosing the dominating option (Shafir, Simonson, and Tversky 1993), particularly because the comparison is presented in text. Invoking potential losses may be a particularly powerful stimulus for demand if it triggers loss aversion (Kahneman and Tversky 1979; Tversky and Kahneman 1991), and indeed Ganzach and Karsahi (1995) find that a loss-framed message induced significantly higher credit card usage than a gain-framed message in a direct marketing field experiment in Israel. This suggests that the loss-framed comparison should produce (weakly) higher takeup rates than either the gain-frame or the no-comparison conditions. Feature 6: cell phone raffle. Many firms, including the Lender and many of its competitors, use promotional giveaways as part particular use might make consumption salient and serve as a cue to take up the loan (although this sort of associative response may be difficult to achieve with text, which typically triggers more deliberative processing). Yet another possibility is that suggesting a particular use creates dissonance with the Lender’s “no questions asked” policy regarding loan uses, a policy designed to counteract the stigma associated with high-interest borrowing. In any case, it is unlikely that suggesting a particular use provided information by (incorrectly) signaling a policy change regarding loan uses, because each variation ended with “or for anything else you want.” 22. The mailers also randomized the unit of comparison (rand per month, rand per loan, percentage point differential per month, percentage point differential per loan), but the resulting cell sizes are too small to statistically distinguish any differential effects of units on demand.
WHAT’S ADVERTISING CONTENT WORTH?
285
of their marketing. Our experiment randomized whether a cell phone raffle was prominently featured in the bottom right margin of the mailer: “WIN 10 CELLPHONES UP FOR GRABS EACH MONTH!” Per common practice in the cash loan market, the mailers did not detail the odds of winning or the value of the prizes. In fact, the expected value of the raffle for any individual client was vanishingly small.23 This implies that the raffle should not change the takeup decision based on strictly economic factors.24 Yet marketing practice suggests that promotional raffles may increase demand despite not providing any material increase in the expected value of taking up the offer. A possible channel is a tendency for individuals to overestimate the frequency of low-probability events. In contrast, several papers have reached the surprising conclusion that promotional giveaways can backfire and reduce demand. The channel seems to be “reason-based choice” (Shafir, Simonson, and Tversky 1993): many consumers feel the need to justify their choices and find it more difficult to do so when the core product comes with an added feature they do not value. This holds even when subjects understand that the added option comes at no extra pecuniary or time cost (Simonson, Carmon, and O’Curry 1994). Given the conflicting prior evidence, we had no strong prior on whether promoting the cell phone raffle would affect demand. Because both postulated cognitive channels seem to operate through conscious (if faulty) reasoning, we classify the raffle as a System II treatment. Lender-Imposed Treatments. Two additional treatments were motivated by the Lender’s choices and the low-cost nature of content testing, rather than by a body of prior evidence on consumer decision making. 23. The 10 cell phones were each purchased for R300 and randomly assigned within the pool of approximately 10,000 individuals who applied at the Lender’s branches during the 3 months spanned by the experiment. The pool was much larger than the number of applicants who received a mailer featuring the raffle, because by law all applicants (including first-time applicants, and former clients excluded from our sample frame) were eligible for the raffle. 24. The raffle could be economically relevant if the Lender’s market were perfectly competitive. In that case, and where raffles are part of the equilibrium offer, then not offering the small-value raffle could produce a sharp drop in demand (because potential clients would be indifferent on the margin between borrowing from the Lender or from competitors when offered the raffle, but would weakly prefer a competitor’s offer when the Lender did not offer the raffle). But the cash loan market seems to be imperfectly competitive: see Section II, and the modest response to price reductions in Section V.A.
286
QUARTERLY JOURNAL OF ECONOMICS
Feature 7: language affinity. Some mailers featured a blurb “We speak (client’s language)” for a random subset of the clients who were not primarily English speakers (44% of the sample). When present, the matched language blurb was directly under the “business hours” box in the upper right of the mailer. The rest of the mailer was always in English. The Lender was particularly confident that the language affinity treatment would increase takeup and insisted that most eligible clients get it; hence the 63–37 split noted in Table I. In contrast to matched photos, we did not think that the “language affinity” was well motivated by laboratory evidence or that it would increase takeup. The difference is one of medium. The language blurb was in text, and hence more likely to be processed through deliberative cognitive systems, where linguistic affinity was unlikely to prove particularly compelling. Photos are more likely to be processed through intuitive and emotional systems. The laboratory evidence suggests that affinities work through intuitive associations (System I) rather than through reasoning (System II). Feature 8: “special” rate vs. “low” rate vs. no blurb. As discussed above, nearly all of the interest rate offers were at discounted rates, and the Lender had never offered anything other than its standard rates prior to the experiment. So the Lender decided to highlight the unusual nature of the promotion for a random subset of the clients: 50% of clients received the blurb “A special rate for you,” and 25% of clients received “A low rate for you.” The mail merge field was left blank for the remaining clients. When present, the blurb was inserted just below the field for the language match. Our prior was that this treatment would not influence takeup, although there may be models with very boundedly rational consumers and credible signaling by firms where showing one of these blurbs would (weakly) increase takeup. III.G. Deadlines Each mailer also contained a randomly assigned deadline by which the client had to respond to obtain the offered interest rate. Deadlines ranged from “short” (approximately two weeks) to “long” (approximately six weeks). Short deadlines were assigned only among clients who lived in urban areas with a non–P.O. Box mailing address and hence were likely to receive their mail quickly
WHAT’S ADVERTISING CONTENT WORTH?
287
(see Table I for details). Some clients eligible for the short deadline were randomly assigned a blurb showing a phone number to call for an extension (to the medium deadline). Our deadline randomization was motivated by advertising practices, which often promote limited-time offers, and by decision research on time management. Some behavioral models predict that shorter deadlines will boost demand by overcoming a tendency to procrastinate and postpone difficult decisions or tasks. Indeed, the findings in Ariely and Wertenbroch (2002), and introspection, suggest that many individuals choose to impose shorter deadlines on themselves even when longer ones are in the choice set. In contrast, standard economic models predict that consumers will always (weakly) prefer the longest available deadline, all else equal, due to the option value of waiting.
IV. CONCEPTUAL FRAMEWORK: INTERPRETING THE EFFECTS OF ADVERTISING CONTENT As discussed above, the advertising content treatments in our experiment were motivated primarily by findings from psychology and marketing that are most closely related to theories of persuasive advertising. Here we formalize definitions of persuasion and other mechanisms through which advertising content might affect consumer choice. We also speculate on the likely relevance of these different mechanisms in our research context. As a starting point, consider a simple decision rule where consumers purchase a product if and only if the marginal cost of the product is less than the expected marginal return (in utility terms) of consuming the product. A very simple way to formalize this is to note that the consumer purchases (loan) product (or consumption bundle) l iff (2)
ui (l) − pi > 0,
where ui is the consumer’s (discounted) utility gain from purchasing l and p is the price.25 Advertising has no effect on either u or p and the model predicts that we will not reject the hypothesis 25. In our context p is a summary statistic capturing the cost of borrowing. Without liquidity constraints the discounted sum of any fees + the periodic interest rate captures this cost. Under liquidity constraints, loan maturity affects the effective price as well (Karlan and Zinman 2008).
288
QUARTERLY JOURNAL OF ECONOMICS
of null effects of advertising content on demand when estimating equation (1). One might wonder whether a very slightly enriched model would predict that consumers who are just indifferent about borrowing (from the Lender) might be influenced by advertising content (say by changing the consumer from indifference to “go with the choice that has the attractive mailer”). This would be a more plausible interpretation in our setting if the experiment’s prices were more uniform and standard, given that everyone in the sample had borrowed recently at the Lender’s standard rates. But the experimental prices ranged widely, with a density almost entirely below the standard rates. Thus if consumers were indifferent on average in our sample then price reductions should have huge positive effects on takeup on average. This is not the case; Section V.A shows that takeup elasticities for the price reductions are substantially below one in absolute value. Models in the “behavioral” decision-making and economics-ofadvertising literatures enrich the simple decision rule in equation (2) and allow for the possibility that advertising affects consumer behavior, that is, for the possibility that the average effect of the advertising content variables in equation (1) is different from 0. Following Bagwell’s (2007) taxonomy, we explore three distinct mechanisms. A first possible mechanism is informative advertising content. Here the consumer has some uncertainty about the utility gain and/or price (which could be resolved by a consumer at a search and/or computational cost), and advertising operates through the consumer’s expectations about utility and price. Now the consumer buys the product if (3)
p
Etu(Cit )[ui (l)] − Et (Cit )t [ pi ] > 0,
where expectations E at time t are influenced by the vector of advertising content C that consumer i receives. In our setting, for example, announcing that the firm speaks Zulu might provide information. The content treatments might also affect expected utility through credible signaling. Seeing a photo on the mailer might increase the client’s expectation of an enjoyable encounter with an attractive loan officer at the Lender’s branch. Our experimental design does not formally rule out these sorts of informative effects, but we do not find them especially
WHAT’S ADVERTISING CONTENT WORTH?
289
plausible in this particular implementation. Recall that the mailers were sent exclusively to clients who successfully repaid prior loans from the Lender. Most had been to a branch within the past year and hence were familiar with the loan product, the transaction process, the branch’s staff and general environment, and the fact that loan uses are unrestricted. A second possibility is that advertising is complementary to consumption: consumers have fixed preferences, and advertising makes the consumer “believe—correctly or incorrectly—that it [sic] gets a greater output of the commodity from a given input of the advertised product” (Stigler and Becker 1977). In reduced form, this means that advertising affects net utility by interacting with enjoyment of the product. So the consumer purchases if (4)
ui (l, l∗ Ci ) − pi > 0.
Our design does not formally rule out complementary mechanisms, but their relevance might be limited in our particular implementation. Complementary models tend to be motivated by luxury or prestige goods (e.g., cool advertising content makes me enjoy wearing a Rolex more, all else equal), and the product here is an intermediate good that is used most commonly to pay for necessities. Moreover, the first-hand prior experience our sample frame had with consumer borrowing makes it unlikely that marketing content would change perceptions of the loan product in a complementary way. Finally, a third mechanism is persuasive advertising content. A simple model of persuasion would be that the true utility of purchase is given by ui (l) – pi . But individuals decide to purchase or not based on (5)
Di (ui (l), Ci ) − pi > 0,
where Di (ui (l), Ci ) is the effective decision, rather than hedonic utility. Persuasion can operate directly on preferences by manipulating reference points, providing cues that increase the marginal utility of consumption, providing motivation to make (rather than procrastinate) choices, or simplifying the complexity of decision making. Other channels for persuasion arise if perceptions of key decision parameters are biased and can be manipulated by advertising content. As discussed above (in Section III.F), content may work through these channels by triggering intuitive and/or deliberative cognitive processes. Note that (5) does not allow content to
290
QUARTERLY JOURNAL OF ECONOMICS
affect demand by affecting price sensitivity: Di (.) does not include pi as an argument. To clarify the distinction from the informative view, note that allowing for biased expectations or biased perceptions of choice parameters is equivalent to allowing a distinction between hedonic utility (i.e., true, experienced utility) and choice utility (perceived/expected utility at the time of the decision). Under a persuasive view of advertising, consumers decide based on choice utility. Finally, note that, as in the traditional model, price will continue to affect overall demand. In this sense, there may appear to be a stable demand curve. But the demand curve may shift as content Ci varies. Thus demand estimation that ignores persuasive content may produce a misleading view of underlying utility. V. RESULTS This section presents results from estimating equation (1) detailed in Section III.C. V.A. Interest Rates Consumer sensitivity to the price of the loan offer will provide a useful way to scale the magnitude of any advertising content effects. The first row of econometric results in Table III shows the estimated magnitude of loan demand price sensitivities in our sample. Our main result on price is that the probability of applying before the deadline (8.5% in the full sample) rose 3/10 of a percentage point for every 100–basis point reduction in the monthly interest rate ((column (1)). This implies a 4% increase in takeup for every 13% decrease in the interest rate, and a takeup price elasticity of −0.28.26 Column (4) shows a nearly identical result when the outcome is obtaining a loan instead of applying for a loan. Column (5) shows that the total loan amount borrowed (unconditional on borrowing) also responded negatively to price. The implied elasticity here is −0.34.27 Column (6) shows that default rose with price; 26. Clients were far more elastic with respect to offers at rates greater than the Lender’s standard ones (Karlan and Zinman 2008). This small subsample (632 offers) is excluded here because it was part of a pilot wave of mailers that did not include the content randomizations. 27. See Karlan and Zinman (2008) for additional results on price sensitivity on the intensive margin.
1 = female photo (System I: affective response) 1 = photo gender matches client’s (System I: affinity/similarity) 1 = photo race matches client’s (System I: affinity/similarity) 1 = one example loan shown (System I: avoid choice overload) 1 = interest rate shown (System I: several, potentially offsetting, channels)
Monthly interest rate in percentage point units (e.g., 8.2) 1 = no photo
Sample Estimator Mean (dependent variable)
Dependent variable
−0.0034∗∗∗ (0.0008) 0.0021 (0.0055) 0.0032 (0.0038)
−0.0099 (0.0070) 0.0031 (0.0040) 0.0073 (0.0044)
−0.0025∗∗∗ (0.0007) −0.0050 (0.0048) 0.0079∗∗ (0.0034)
−0.0014 (0.0064) 0.0099∗∗∗ (0.0038) −0.0017 (0.0042)
−0.0029∗∗∗ (0.0005)
0.0013 (0.0040) 0.0057∗∗ (0.0026) −0.0026 (0.0026)
−0.0056 (0.0048) 0.0068∗∗ (0.0028)
0.0025 (0.0030)
Females Probit 0.0879 (3)
Applied for loan before mailer deadline
Males Probit 0.0824 (2)
Applied for loan before mailer deadline
Full Probit 0.0850 (1)
Applied for loan before mailer deadline
0.0043 (0.0028)
−0.0035 (0.0044) 0.0075∗∗∗ (0.0026)
0.0029 (0.0037) 0.0056∗∗ (0.0024) −0.0033 (0.0024)
−0.0026∗∗∗ (0.0005)
Full Probit 0.0741 (4)
Obtained loan before mailer deadline
2.8879 (6.7231)
9.0638 (10.4079) 2.4394 (4.8383)
3.9316 (7.6763) 8.3292 (5.0897) −7.1773 (5.0850)
−4.7712∗∗∗ (0.8238)
Full OLS 110.4363 (5)
Loan amount obtained before mailer deadline
TABLE III EFFECTS OF ADVERTISING CONTENT ON BORROWER BEHAVIOR
0.0140 (0.0123)
0.0181 (0.0176) 0.0073 (0.0117)
0.0013 (0.0166) −0.0076 (0.0107) −0.0059 (0.0107)
0.0071∗∗∗ (0.0022)
Obtained Probit 0.1207 (6)
Loan in collection status
0.0007 (0.0049)
−0.0018 (0.0072) −0.0043 (0.0042)
−0.0024 (0.0060) −0.0047 (0.0040) 0.0041 (0.0040)
0.0009 (0.0008)
Full Probit 0.2183 (7)
Borrowed from other lender WHAT’S ADVERTISING CONTENT WORTH?
291
−0.0001 (0.0036) 0.0084∗∗ (0.0040) −0.0012 (0.0043) −0.0018 (0.0035) −0.0016 (0.0049) −0.0022 (0.0043) 27,848
−0.0023 (0.0026)
0.0059∗∗ (0.0029)
−0.0002 (0.0031)
−0.0024 (0.0026)
−0.0043 (0.0036) 0.0001 (0.0031) 53,194
1 = cell phone raffle mentioned (System II: overestimate small probabilities vs. conflict from reason-based choice) 1 = no specific loan use mentioned (System II: mentioning specific use, via text, triggers deliberation) 1 = comparison to competitor rate (System II: makes dominating option salient) 1 = loss frame comparison (System II: triggers loss aversion) 1 = we speak “your language” (Lender imposed) 1 = a “low” or “special” rate for you (Lender imposed) N
Males Probit 0.0824 (2)
Full Probit 0.0850 (1)
Applied for loan before mailer deadline
Sample Estimator Mean (dependent variable)
Dependent variable
Applied for loan before mailer deadline
−0.0073 (0.0053) 0.0027 (0.0045) 25,346
−0.0029 (0.0038)
0.0010 (0.0046)
0.0031 (0.0043)
−0.0049 (0.0039)
Females Probit 0.0879 (3)
Applied for loan before mailer deadline
TABLE III (CONTINUED)
−0.0036 (0.0033) 0.0010 (0.0028) 53,194
−0.0021 (0.0024)
0.0012 (0.0029)
0.0043 (0.0027)
−0.0013 (0.0025)
Full Probit 0.0741 (4)
Obtained loan before mailer deadline
0.0032 (0.0108) −0.0031 (0.0152) −0.0137 (0.0128) 3,944
−11.3556∗ (6.2935) 3.3864 (5.9209) 53,194
−0.0027 (0.0133) 3.0925 (5.0678)
−2.6021 (6.2961)
0.0086 (0.0121)
−0.0050 (0.0109)
−9.4384∗ (5.1200)
4.0850 (5.6266)
Obtained Probit 0.1207 (6)
Loan in collection status
Full OLS 110.4363 (5)
Loan amount obtained before mailer deadline
0.0133∗∗ (0.0059) −0.0002 (0.0047) 53,194
0.0027 (0.0040)
0.0054 (0.0049)
−0.0033 (0.0045)
−0.0015 (0.0041)
Full Probit 0.2183 (7)
Borrowed from other lender
292 QUARTERLY JOURNAL OF ECONOMICS
(Pseudo-) r-squared p-Value, F-test on all advertising content variables Absolute value lower bound of range of joint content effect for which F-test rejects null Absolute value upper bound of range of joint content effect for which F-test rejects null p-Value, F-test on Lender-imposed content (“low” or “special”; language) p-Value, F-test on psychology-motivated content (all other features)
Sample Estimator Mean (dependent variable)
Dependent variable Males Probit 0.0824 (2) .0481 .0623 0.0021
0.0388 .8217 .0300
.0456 .0729
0.0010
0.0448
.5064
.0522
Applied for loan before mailer deadline
Full Probit 0.0850 (1)
Applied for loan before mailer deadline
.5541
.3337
.0438 .5354
Females Probit 0.0879 (3)
Applied for loan before mailer deadline
TABLE III (CONTINUED)
.0286
.5254
0.0498
0.0026
.0534 .0431
Full Probit 0.0741 (4)
Obtained loan before mailer deadline
.3420
.1695
.0361 .2483
Full OLS 110.4363 (5)
Loan amount obtained before mailer deadline
.7262
.5382
.0674 .7485
Obtained Probit 0.1207 (6)
Loan in collection status
.7583
.0785
.0048 .4866
Full Probit 0.2183 (7)
Borrowed from other lender
WHAT’S ADVERTISING CONTENT WORTH?
293
.2643
.0211 .0288
.0598
.0355
Males Probit 0.0824 (2)
.1946
Full Probit 0.0850 (1)
Applied for loan before mailer deadline
.5130
.3929
.6200
Females Probit 0.0879 (3)
Applied for loan before mailer deadline
.0072
.0127
.4499
Full Probit 0.0741 (4)
Obtained loan before mailer deadline
.3288
.4362
.3399
Full OLS 110.4363 (5)
Loan amount obtained before mailer deadline
.4196
.4346
.9360
Obtained Probit 0.1207 (6)
Loan in collection status
.7169
.7675
.4947
Full Probit 0.2183 (7)
Borrowed from other lender
Notes: Huber–White standard errors. Probit results are marginal effects. All models include controls for randomization conditions: risk, race, gender, language, and mailer wave (September or October). Treatment variable labels: parentheses contain summary description of our prior on why each ad content treatment would increase demand (or of reason(s) why we had no strong prior). Omitted categories: male photo, no photo gender match, no photo race match, four example loans shown, no interest rate shown, no cell phone raffle mentioned, specific loan use mentioned, no comparison to competitor rate, gain frame comparison, no mention of speaking local language, no mention of low or special rate. ∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01.
Split psychology-motivated content: p-Value, F-test on System II (reasoning) content (suggested use, comparison, cell) p-Value, F-test on System I (intuitive) content (photo, # loans shown, rate shown) p-Value, F-test on System I, dropping rate shown
Sample Estimator Mean (dependent variable)
Dependent variable
Applied for loan before mailer deadline
TABLE III (CONTINUED)
294 QUARTERLY JOURNAL OF ECONOMICS
WHAT’S ADVERTISING CONTENT WORTH?
295
this result indicates adverse selection and/or moral hazard with respect to interest rates.28 Column (7) shows that more expensive offers did not induce significantly more substitution to other formal sector lenders (as measured from credit bureau data). This result is a precisely estimated zero relative to a sample mean outside borrowing proportion of 0.22. The lack of substitution is consistent with the descriptive evidence discussed in Section II on the dearth of close substitutes for the Lender. V.B. Advertising Content Treatments Table III also presents the results on advertising content variations for the full sample. The F-tests reported near the bottom of the table indicate whether the content features had an effect on demand that was jointly significantly different from zero.29 The applied (or “takeup”) model has a p-value of .07 (column (1)), and the “obtained a loan” model has a p-value of .04 (column (4)), implying that advertising content did influence the extensive margin of loan demand with at least 90% confidence. Column (5) shows that the joint effect of content on loan amount is insignificant (p-value = .25). Column (6) shows an insignificant effect on default; that is, we do not find evidence of adverse selection on response to content. Column (7) shows an insignificant effect on outside borrowing; that is, the positive effect on demand for credit from the Lender in columns (1) and (4) does not appear to be driven by balance-shifting from other lenders. The results on the individual content variables give some insight into which features affected demand (although some inferential caution is warranted here, because with thirteen content variables we would expect one to be significant purely by chance). Three variables show significant increases in takeup: one example loan, no suggested loan use, and female photo. The result on one example loan strikes us as noteworthy. It is a clear departure from strict neoclassical models, where more choice and more information weakly increase demand. It replicates prior findings and moreover suggests that choice 28. The finding here is reduced-form evidence of information asymmetries; see Karlan and Zinman (forthcoming b) for additional results that separately identify adverse selection and moral hazard effects. 29. Results are nearly identical if we omit the cell phone raffle from the joint test of content effects on the grounds that the raffle has some expected pecuniary value.
296
QUARTERLY JOURNAL OF ECONOMICS
overload can matter even when the amount of content in the “more” condition is small: we test across two small menus, for a product that everyone in our sample has used before. The effect of the female photo motivates consideration of whether advertising content effects differ by consumer gender; for example, in Landry et al. (2006), male charitable donor prospects respond more to female solicitor attractiveness than female prospects do.30 Columns (2) and (3) of Table III show that male clients receiving the female photo took up significantly more, but female clients did not. In fact, female clients did not respond significantly to any of the content treatments. Males responded to example loans and loan uses, as well as to the female photo. Unsurprisingly, then, the joint F-test for all content variables is significant for male but not for female clients. Note that takeup rates and sample sizes are quite similar across client genders, so these findings are not driven purely by power issues. However, as with other results, the insignificant results for female clients are imprecise, and do not rule out economically large effects of advertising content. Another notable finding on the individual content variables is the disjoint between our priors and findings. Several treatments we predicted would have significant effects did not (comparisons, and the other photo variables). Results on the individual content feature variable conditions also provide some insight into how much advertising content affects demand, relative to price. For our preferred outcome (1 = applied), the statistically significant point estimates imply large magnitudes: a mailer with one example loan (or no suggested use, or a female photo) increased takeup by at least as much as a 200–basis point (25%) reduction in the interest rate. Table IV reports the results of this scaling calculation for each of the content point estimates in Table III; that is, it takes the point estimate on a content variable, divides it by the coefficient on the offer rate for that specification, and multiplies the result by 100 to get an estimate of the interest rate reduction needed to obtain the increase in demand implied by the point estimate on the content variable. The bottom rows of Table III show results for our thematic groupings of content treatments. These results shed some light 30. The Online Appendix presents results for subsamples split by income, education, and number of prior transactions with the Lender. We do not feature these results because we are underpowered even in the full sample and also lacked strong priors that treatment effects should vary with these other characteristics.
297
WHAT’S ADVERTISING CONTENT WORTH?
TABLE IV EFFECTS OF ADVERTISING CONTENT ON BORROWER BEHAVIOR: POINT ESTIMATES IN TABLE III, SCALED BY PRICE EFFECT
Dependent variable Sample Mean (dependent variable) 1 = no photo 1 = female photo (System I: affective response) 1 = photo gender matches client’s (System I: affinity/similarity) 1 = photo race matches client’s (System I: affinity/similarity) 1 = one example loan shown (System I: avoid choice overload) 1 = interest rate shown (System I: several, potentially offsetting, channels) 1 = cell phone raffle mentioned (System II: overestimate small probabilities vs. conflict from reason-based choice) 1 = no specific loan use mentioned (System II: mentioning specific use, via text, triggers deliberation) 1 = comparison to competitor rate (System II: makes dominating option salient) 1 = loss frame comparison (System II: triggers loss aversion) 1 = we speak “your language” (Lender imposed) 1 = a “low” or “special” rate for you (Lender imposed)
Applied for loan before mailer deadline
Applied for loan before mailer deadline
Applied for loan before mailer deadline
Full 0.0850 (1)
Males 0.0824 (2)
Females 0.0879 (3)
−200 316
62 94
−193
−56
−291
234
396
91
86
−68
215
−79
−4
−144
203
336
91
−7
−48
29
−83
−72
−85
−148
−64
−215
3
−88
79
45 197 −90
Notes: Cells divide the coefficent on the content variable from Table III by the offer rate (i.e., the price) coefficient, and multiply by −100, to estimate the interest rate drop (in basis points) that would be required to achieve the same effect on demand that was achieved by the content treatment. So negative numbers indicate the equivalent interest rate increase needed to generate the drop in demand implied by a negative point estimate on a content variable. Note that we calculate this for all content treatments here, including the ones that are not statistically significant in Table III. Treatment variable labels: parentheses contain summary descriptions of our prior on why each ad content treatment would increase demand (or of reason(s) that we had no strong priors).
on the mechanisms through which advertising content affects demand. F-tests show that the six content features motivated by prior evidence significantly affected takeup, whereas the two features imposed by the Lender did not. The last rows of F-tests show
298
QUARTERLY JOURNAL OF ECONOMICS
that our grouping of System I (intuitive processing) treatments significantly affected takeup; in contrast, our System II (deliberative processing) treatments did not significantly affect takeup. Hence, in this context, advertising content appears more effective when it is aimed at triggering an intuitive response rather than a deliberative response. There are, however, two important caveats that lead us to view this finding as mainly suggestive, and not definitive, evidence on the cognitive mechanisms through which advertising content affects demand. The first caveat is that our confidence intervals do not rule economically significant effects of System II content. The second caveat is that the classification of some of the treatments as System I or System II is debatable. V.C. Deadlines Recall that the mailers also included randomly assigned deadlines designed to test the relative importance of option value (longer deadlines make the offer more valuable and induce takeup) versus time management problems (longer deadlines induce procrastination and perhaps forgetting, and depress takeup). Table V presents results from estimating our usual specification with the deadline variables included.31 The results in Table V, Panel A, columns (1)–(3), suggest that option value dominates any time management problem in our context: takeup and loan amount increased dramatically with deadline length. Lengthening the deadline by approximately two weeks (i.e., moving from the omitted short deadline to the extension option or medium deadline, or from medium to long deadline) increases takeup by about three percentage points. This is a large effect relative to the mean takeup rate of 0.085, and enormous relative to the price effect. Shifting the deadline by two weeks had about the same effect as a 1,000–basis point reduction in the interest rate. This large effect could be due to time-varying costs of getting to the branch (e.g., transportation cost, opportunity cost of missing work) and/or to borrowing opportunities or needs that vary stochastically (e.g., bad shocks). Columns (4) and (5) show that we do not find any significant effects of deadline on default or on borrowing from other lenders. 31. We omit the advertising content variables from the specification for expositional clarify in the table, but recall that all randomizations were done independently. So including the full set of treatments does not change the results.
(Pseudo-) r-squared N F-test of joint significance of all deadlines
Long deadline
Medium deadline
Monthly interest rate in percentage point units (e.g., 8.2) Short deadline, extended
Sample Estimator Mean (dependent variable)
Dependent variable Full Probit 0.0741 (2) −0.0026∗∗∗ (0.0005) 0.0240∗∗ (0.0107) 0.0270∗∗∗ (0.0061) 0.0563∗∗∗ (0.0112) .0538 53,194 0.0000
−0.0029∗∗∗ (0.0005) 0.0322∗∗∗ (0.0118) 0.0300∗∗∗ (0.0068) 0.0603∗∗∗ (0.0118) .0461 53,194 0.0000
Obtained loan before own deadline
Full Probit 0.0850 (1)
Applied before own deadline
−4.7768∗∗∗ (0.8237) 31.1321∗ (17.2858) 38.0335∗∗∗ (13.8228) 70.1119∗∗∗ (15.0945) .0351 53,194 0.0000
Full OLS 110.4363 (3)
Loan amount obtained before own deadline
Panel A: Predeadline demand
0.0075∗∗∗ (0.0023) 0.0236 (0.0424) 0.0205 (0.0300) 0.0138 (0.0363) .0597 3,944 0.8487
Obtained Probit 0.1207 (4)
Loan obtained before own deadline in collection status
TABLE V EFFECTS OF DEADLINE ON BORROWER BEHAVIOR
0.0009 (0.0008) −0.0104 (0.0131) −0.0065 (0.0119) −0.0054 (0.0123) .0007 53,194 0.8813
Full Probit 0.2183 (5)
Borrowed from other lender
−0.0009∗∗∗ (0.0003) −0.0019 (0.0047) −0.0046 (0.0047) −0.0055 (0.0042) .0471 53,194 0.6570
Full Probit 0.0360 (6)
Applied within 2 weeks (short deadline length)
WHAT’S ADVERTISING CONTENT WORTH?
299
Full Probit 0.1477 (2) 0.0005 (0.0007) −0.0052 (0.0113) −0.0035 (0.0102) 0.0019 (0.0108) .0448 53,194 0.6332
Full Probit 0.1830 (1) −0.0010 (0.0008) −0.0224∗ (0.0117) −0.0058 (0.0112) −0.0089 (0.0114) .0560 53,194 0.2518
0.0009 (0.0006) −0.0030 (0.0102) −0.0047 (0.0092) −0.0014 (0.0095) .0369 53,194 0.8262
Full Probit 0.1184 (3)
After long deadline
Notes: Huber–White standard errors. Probit results are marginal effects. All models include controls for randomization conditions: risk, mailer wave (September or October), and deadline eligibility. Short deadline is the omitted category; “short deadline, extended” gave customers a number to call and get an extension (to the medium deadline). Panel A, column (6): tests whether short deadlines spur action by inducing early applications. The dependent variable here is defined regardless of the individual’s deadline length; that is, the dependent variable = 1 if the individual applied within two weeks of the mailer date, unconditional on her own deadline. Panel B: Testing three alternative measures of post-deadline takeup helps ensure that our results here are not driven by mechanical timing differences, because we have a finite number of postdeadline data (6 months). We measure postdeadline takeup using takeup after the short deadline (2 weeks), after the medium deadline (4 weeks), and after the long deadline (6 weeks). We define these outcomes for each member of the sample, regardless of their own deadline length, in order to ensure that everyone in the sample has the same takeup window. Otherwise those with the short deadline mechanically have a longer postdeadline window, and if there is a positive secular probability of hazard into takeup status within the range our deadlines produce (5 to 6 months), then this would mechanically push toward a decreasing relationship between deadline length and postdeadline takeup. ∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01.
Pseudo r-squared N F-test of joint significance of all deadlines
Long deadline
Medium deadline
Short deadline, extended
Offer interest rate
Sample Estimator Mean (dependent variable)
Dependent variable = applied
Panel B: Postdeadline applications After short After medium deadline deadline
TABLE V (CONTINUED)
300 QUARTERLY JOURNAL OF ECONOMICS
WHAT’S ADVERTISING CONTENT WORTH?
301
It is theoretically possible that the strength of the longerdeadline effect may be due in part to the nature of direct mail. Although we took precautions to ensure that the mailers arrived well before the assigned deadline, it may be the case that clients did not open the mailer until after the deadline expired. For example, if clients only opened their mail every two weeks, then the short deadline would mechanically produce a very low takeup rate (in fact, the mean rate for those offered the short deadline was 0.057, versus 0.085 for the full sample). It is also theoretically possible that by capping the deadline variation at six weeks, we missed important nonlinearities over longer horizons. Note, however, that longer deadlines were arguably empirically irrelevant in our context, as the Lender deemed deadlines beyond six weeks operationally impractical. Panel A, column (6), and Panel B explore whether the large increase in demand with deadline length obscures a smaller, partially offsetting time management effect, that is, whether there is a channel through which longer deadlines depress demand (by triggering procrastination and/or limited attention) that is swamped by the larger, positive effect of option value. Specifically, Panel A, column (6), tests whether short deadlines spur action by inducing early applications (here “applying within two weeks”— the short deadline length—is the dependent variable). The negative signs on the deadline coefficients are consistent with a time management effect, but the deadline variables are neither individually nor jointly significant, and the estimates are imprecise. In Panel B, we test whether longer deadlines increase the likelihood of takeup after deadlines pass. Postdeadline takeup is an interesting outcome to study because the price of loans rose, substantially on the average, postdeadline. So postdeadline takeup could be an indicator of costly time management problems, and if short deadlines help consumers overcome such problems, we might expect postdeadline takeup to increase in deadline length. Panel B tests this hypothesis using three alternative measures of postdeadline takeup.32 The deadline variables 32. Testing three alternative measures of postdeadline takeup helps ensure that our results here are not driven by mechanical timing differences, because we have a finite amount of postdeadline data (six months). We measure postdeadline takeup using takeup after the short deadline (two weeks), after the medium deadline (four weeks), and after the long deadline (six weeks). We define these outcomes for each member of the sample, regardless of their own deadline length, to ensure that everyone in the sample has the same takeup window. Otherwise those with the short deadline mechanically have a longer postdeadline window,
302
QUARTERLY JOURNAL OF ECONOMICS
are not jointly significant for any of the three measures. Across all three specifications only one of the nine deadline variables is significant at the 90% level. So there is little support for the hypothesis that deadlines affect postdeadline takeup. Again, though, our confidence intervals do not rule out economically significant effects. All in all, the results suggest that the demand-inducing option value of longer deadlines appears to dominate in this setting. But our design is not sharp enough to rule out economically meaningful time management effects. VI. CONCLUSIONS Theories of advertising, and prior studies on framing, cues, and product presentation, suggest that advertising content can have important effect on consumer choice. Yet there is remarkably little field evidence on how much and what types of advertising content affect demand. We analyze a direct mail field experiment that simultaneously and independently randomized the advertising content, price, and deadline for actual loan offers made to former clients of a consumer lender in South Africa. We find that advertising content had statistically significant effects on takeup. There is some evidence that these content effects are economically large relative to price effects. Consumer response to advertising content does not seem to have been driven by substitution across lenders, and there is no evidence that it produced adverse selection. Deadline length trumped both advertising content and price in economic importance, and we found no systematic evidence of time management problems. Our design and results leave many questions unanswered and suggest directions for future research. First, we found it difficult to predict ex ante which advertising content or deadline treatments would affect demand, and some prior findings did not carry over to the present context. This fits with a central premise of psychology—context matters—and suggests that pinning down the effects that will matter in various market contexts might and if there is a positive secular probability of hazard into takeup status within the range our deadlines produce (five to six months), then this would mechanically push toward a decreasing relationship between deadline length and postdeadline takeup.
WHAT’S ADVERTISING CONTENT WORTH?
303
require systematic field experimentation on a broad scale. But our paper also highlights a weakness of field experiments: realworld settings can mean low takeup rates, and hence a high cost for obtaining the statistical power needed to test some hypotheses of interest. Future advertising experiments should strive for larger sample sizes (as in Ausubel [1999]) and/or settings with higher takeup rates, and use the additional power to design tests for combinations of treatments, including interactions between advertising and price. Another unresolved question is why advertising (“creative”) content matters. In the taxonomy of the economics of advertising literature, the question is whether content is informative, complementary to preferences, and/or persuasive. A related question from psychology is how advertising affects consumers cognitively. In our setting, we speculate that advertising content operated via intuitive rather than deliberative processes. This fits with the nature of our advertised product (an intermediate good), the fact that little new information or novel arguments were likely in this context, and the experience level of consumers in the sample. But we emphasize that our design was not sufficiently rich to sharply identify the mechanisms underlying the content effects. It also will be fruitful in the future to study consumer choice in conjunction with the strategies of firms that provide and frame choice sets. A literature on industrial organization with “behavioral” or “boundedly rational” consumers is just beginning to (re-)emerge (DellaVigna and Malmendier 2004; Ellison 2006; Gabaix and Laibson 2006; Barr, Mullainathan, and Shafir 2008), and there should be gains from trade between this literature and related ones on the economics of advertising and the psychology of consumer choice. UNIVERSITY OF CHICAGO GRADUATE SCHOOL OF BUSINESS, NBER, AND CEPR YALE UNIVERSITY, INNOVATIONS FOR POVERTY ACTION, AND THE JAMEEL POVERTY ACTION LAB HARVARD UNIVERSITY, INNOVATIONS FOR POVERTY ACTION, AND THE JAMEEL POVERTY ACTION LAB PRINCETON UNIVERSITY AND INNOVATIONS FOR POVERTY ACTION DARTMOUTH COLLEGE AND INNOVATIONS FOR POVERTY ACTION
REFERENCES Agarwal, Sumit, and Brent Ambrose, “Does It Pay to Read Your Junk Mail? Evidence of the Effect of Advertising on Financial Decisions,” Federal Reserve Bank of Chicago Working Paper, 2007.
304
QUARTERLY JOURNAL OF ECONOMICS
Anderson, Eric, and Duncan Simester, “Does Demand Fall When Customers Perceive That Prices Are Unfair? The Case of Premium Pricing for Large Sizes,” Marketing Science, 27 (2008), 492–500. Ariely, Dan, and Klaus Wertenbroch, “Procrastination, Deadlines, and Performance: Self-Control by Pre-Commitment,” Psychological Science, 13 (2002), 219–224. Ausubel, Lawrence M., “Adverse Selection in the Credit Card Market,” Working Paper, University of Maryland, 1999. Bagwell, Kyle, “The Economic Analysis of Advertising,” in Handbook of Industrial Organization, Vol. 3, Mark Armstrong and Rob Porter, eds. (Amsterdam: North-Holland, 2007). Barr, Michael, Sendhil Mullainathan, and Eldar Shafir, “Behaviorally Informed Financial Services Regulation,” New America Foundation White Paper, 2008. Chaiken, Shelly, and Yaacov Trope, eds., Dual-Process Theories in Social Psychology (New York: Guilford Press, 1999). Chandy, Rajesh, Gerard Tellis, Deborah Macinnis, and Pattana Thaivanich, “What to Say When: Advertising Appeals in Evolving Markets,” Journal of Marketing Research, 38 (2001), 399–414. Cialdini, Robert, Influence: Science and Practice, 4th ed. (Needham Heights, MA: Allyn and Bacon, 2001). Cook, William, and Arthur Kover, “Research and the Meaning of Advertising Effectiveness: Mutual Misunderstandings,” in Measuring Advertising Effectiveness, W. Wells, ed. (Hillsdale, NJ: Lawrence Erlbaum Associates, 1997). Day, George S., “Creating a Superior Customer-Relating Capability,” MIT Sloan Management Review, 44 (2003), 77–82. DellaVigna, Stefano, “Psychology and Economics: Evidence from the Field,” Journal of Economic Literature, 47 (2009), 315–372. DellaVigna, Stefano, and Ulrike Malmendier, “Contract Design and Self-Control: Theory and Evidence,” Quarterly Journal of Economics, 119 (2004), 353–402. Ding, Min, Rajdeep Grewal, and John Liechty, “Incentive-Aligned Conjoint Analysis,” Journal of Marketing Research, 42 (2005), 67–82. Doyle, Joseph, David O’Connor, Gareth Reynolds, and Paul Bottomley, “The Robustness of the Asymmetrically Dominated Effect: Buying Frames, Phantom Alternatives, and In-Store Purchases,” Psychology and Marketing, 16 (1999), 225–243. Ellison, Glenn, “Bounded Rationality in Industrial Organization,” in Advances in Economics and Econometrics: Theory and Applications, Richard Blundell, Whitney Newey, and Torsten Persson, eds. (Cambridge, UK: Cambridge University Press, 2006). Evans, Franklin B., “Selling as a Dyadic Relationship,” American Behavioral Scientist, 6 (1963), 76–79. FinScope, “Survey Highlights: FinScope South Africa 2004,” in FinMark Trust: Making Financial Markets Work for the Poor (Vorna Valley, South Africa: FinMark Trust, 2004). Gabaix, Xavier, and David Laibson, “Shrouded Attributes, Consumer Myopia, and Information Suppression in Competitive Markets,” Quarterly Journal of Economics, 121 (2006), 505–540. Ganzach, Yoav, and Nili Karsahi, “Message Framing and Buying Behavior: A Field Experiment,” Journal of Business Research, 32 (1995), 11–17. Huber, Joel, John Payne, and Christopher Puto, “Adding Asymmetrically Dominated Alternatives: Violations of Regularity and the Similarity Hypothesis,” Journal of Consumer Research, 9 (1982), 90–98. Iyengar, Sheena, Gal Huberman, and Wei Jiang, “How Much Choice Is Too Much? Contributions to 401(k) Retirement Plans,” in Pension Design and Structure: New Lessons from Behavioral Finance, Olivia Mitchell and Stephen Utkus, eds. (Oxford, UK: Oxford University Press, 2004). Iyengar, Sheena, and Mark Lepper, “When Choice Is Demotivating: Can One Desire Too Much of a Good Thing?” Journal of Personality and Social Psychology, 79 (2000), 995–1006. Kahneman, Daniel, “Maps of Bounded Rationality: Psychology for Behavioral Economics,” American Economic Review, 93 (2003), 1449–1475.
WHAT’S ADVERTISING CONTENT WORTH?
305
Kahneman, Daniel, and Amos Tversky, “Prospect Theory: An Analysis of Decision under Risk,” Econometrica, 47 (1979), 263–291. Karlan, Dean, and Jonathan Zinman, “Credit Elasticities in Less Developed Economies: Implications for Microfinance,” American Economic Review, 98 (2008), 1040–1068. ——, “Expanding Credit Access: Using Randomized Supply Decisions to Estimate the Impacts,” Review of Financial Studies, forthcoming a. ——, “Observing Unobservables: Identifying Information Asymmetries with a Consumer Credit Field Experiment,” Econometrica, forthcoming b. Krieger, Abba, Paul Green, and Yoram Wind, Adventures in Conjoint Analysis: A Practitioner’s Guide to Trade-off Modeling and Applications (Philadelphia, PA: The Wharton School, 2004). Krishnamurthi, Lakshman, and S.P. Raj, “The Effect of Advertising on Consumer Price Sensitivity,” Journal of Marketing Research, 22 (1985), 119–129. Kroszner, Randall, “Creating More Effective Consumer Disclosures,” Speech at the George Washington University, May 23, 2007. Landry, Craig, Andreas Lange, John List, Michael Price, and Nicholas Rupp, “Towards an Understanding of the Economics of Charity: Evidence from a Field Experiment,” Quarterly Journal of Economics, 121 (2006), 747–782. Levitt, Steven, and John List, “What Do Laboratory Experiments Measuring Social Preferences Reveal about the Real World?” Journal of Economic Perspectives, 21 (2007), 153–174. List, John, and David Lucking-Reiley, “The Effects of Seed Money and Refunds on Charitable Giving: Experimental Evidence from a University Capital Campaign,” Journal of Political Economy, 110 (2002), 215–233. Lord, Charles G., Social Psychology (Fort Worth, TX: Harcourt Brace College Publishers, 1996). Mandel, Naomi, and Eric Johnson, “When Web Pages Influence Choice: Effects of Visual Primes on Experts and Novices,” Journal of Consumer Research, 29 (2002), 235–245. Mobius, Markus, and Tanya Rosenblat, “Why Beauty Matters,” American Economic Review, 96 (2006), 222–235. Mullainathan, Sendhil, Joshua Schwartzstein, and Andrei Shleifer, “Coarse Thinking and Persuasion,” Quarterly Journal of Economics, 123 (2008), 577–619. Petty, R.E., and J.T. Cacioppo, The Elaboration Likelihood Model of Persuasion (New York: Academic Press, 1986). Petty, R.E., and D.T. Wegener, “The Elaboration Likelihood Model: Current Status and Controversies,” in Dual-Process Theories in Social Psychology, S. Chaiken and Y. Trope, eds. (New York: Guilford Press, 1999). Priester, J.R., J. Godek, D.J. Nayakankuppum, and K. Park, “Brand Congruity and Comparative Advertising: When and Why Comparative Advertisements Lead to Greater Elaboration,” Journal of Consumer Psychology, 14 (2004), 115–123. Rao, Vithala, “Developments in Conjoint Analysis,” in Handbook of Marketing Decision Models, Berend Wieranga, ed. (New York: Springer, 2008). Redelmeier, Donald, and Eldar Shafir, “Medical Decision Making in Situations That Offer Multiple Alternatives,” Journal of the American Medical Association, 273 (1995), 302–305. Rozin, P., and C. Nemeroff, “Sympathetic Magical Thinking: The Contagion and Similarity ‘Heuristics,’ ” in Heuristics and Biases: The Psychology of Intuitive Judgment, T. Gilovich, D Griffin, and D. Kahneman, eds. (Cambridge, UK: Cambridge University Press, 2002). Shafir, Eldar, Itamar Simonson, and Amos Tversky, “Reason-Based Choice,” Cognition, 49 (1993), 11–36. Simester, Duncan, “Finally, Market Research You Can Use,” Harvard Business Review, 82 (2004), 20–21. Simonson, Itamar, Ziv Carmon, and Sue O’Curry, “Experimental Evidence on the Negative Effect of Product Features and Sales Promotions on Brand Choice,” Marketing Science, 13 (1994), 23–40. Slovic, Paul, Melissa Finucane, Ellen Peters, and Donald G. MacGregor, “The Affect Heuristic,” in Heuristics and Biases: The Psychology of Intuitive Judgement, T. Gilovich, D. Griffin, and D. Kahneman, eds. (Cambridge, UK: Cambridge University Press, 2002).
306
QUARTERLY JOURNAL OF ECONOMICS
Stango, Victor, and Jonathan Zinman, Fuzzy Math, Disclosure Regulation, and Credit Market Outcomes, unpublished, 2009. Stanovich, K.E., and R.F. West, “Individual Differences in Reasoning: Implications for the Rationality Debate,” in Heuristics and Biases: The Psychology of Intuitive Judgment, T. Gilovich, D. Griffin, and D. Kahneman, eds. (Cambridge, UK: Cambridge University Press, 2002). Stewart, David, “Speculations on the Future of Advertising Research,” Journal of Advertising, 21 (1992), 1–46. Stigler, George, The Theory of Price, 4th ed. (New York: MacMillan, 1987). Stigler, George, and Gary Becker, “De Gustibus Non Est Disputandum,” American Economic Review, 67 (1977), 76–90. Stone, Bob, and Ron Jacobs, Successful Direct Marketing Methods, 7th ed. (New York: McGraw-Hill, 2001). Tversky, Amos, and Daniel Kahneman, “Loss Aversion in Riskless Choice: A Reference Dependent Model,” Quarterly Journal of Economics, 106 (1991), 1039– 1061. Wells, William, “Discovery-Oriented Consumer Research,” Journal of Consumer Research, 19 (1993), 489–504. Winer, Russell, “Experimentation in the 21st Century: The Importance of External Validity,” Journal of the Academy of Marketing Sciences, 27 (1999), 349–358.
DID SECURITIZATION LEAD TO LAX SCREENING? EVIDENCE FROM SUBPRIME LOANS∗ BENJAMIN J. KEYS TANMOY MUKHERJEE AMIT SERU VIKRANT VIG A central question surrounding the current subprime crisis is whether the securitization process reduced the incentives of financial intermediaries to carefully screen borrowers. We examine this issue empirically using data on securitized subprime mortgage loan contracts in the United States. We exploit a specific rule of thumb in the lending market to generate exogenous variation in the ease of securitization and compare the composition and performance of lenders’ portfolios around the ad hoc threshold. Conditional on being securitized, the portfolio with greater ease of securitization defaults by around 10%–25% more than a similar risk profile group with a lesser ease of securitization. We conduct additional analyses to rule out differential selection by market participants around the threshold and lenders employing an optimal screening cutoff unrelated to securitization as alternative explanations. The results are confined to loans where intermediaries’ screening effort may be relevant and soft information about borrowers determines their creditworthiness. Our findings suggest that existing securitization practices did adversely affect the screening incentives of subprime lenders.
I. INTRODUCTION Securitization, converting illiquid assets into liquid securities, has grown tremendously in recent years, with the universe of securitized mortgage loans reaching $3.6 trillion in 2006. The ∗ We thank Viral Acharya, Effi Benmelech, Patrick Bolton, Daniel Bergstresser, Charles Calomiris, Douglas Diamond, John DiNardo, Charles Goodhart, Edward Glaeser, Dwight Jaffee, Chris James, Anil Kashyap, Jose Liberti, Gregor Matvos, Chris Mayer, Donald Morgan, Adair Morse, Daniel Paravisini, Karen Pence, Guillaume Plantin, Manju Puri, Mitch Petersen, Raghuram Rajan, Uday Rajan, Adriano Rampini, Joshua Rauh, Chester Spatt, Steve Schaefer, Henri Servaes, Morten Sorensen, Jeremy Stein, James Vickery, Annette VissingJorgensen, Paul Willen, three anonymous referees, and seminar participants at Boston College, Columbia Law, Duke, the Federal Reserve Bank of Philadelphia, the Federal Reserve Board of Governors, the London Business School, the London School of Economics, Michigan State, NYU Law, Northwestern, Oxford, Princeton, Standard and Poor’s, the University of Chicago Applied Economics Lunch, and the University of Chicago Finance Lunch for useful discussions. We also thank numerous conference participants for their comments. Seru thanks the Initiative on Global Markets at the University of Chicago for financial support. The opinions expressed in the paper are those of the authors and do not reflect the views of the Board of Governors of the Federal Reserve System or Sorin Capital Management. Shu Zhang provided excellent research assistance. All remaining errors are our responsibility.
[email protected], tmukherjee@ sorincapital.com,
[email protected],
[email protected]. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
307
308
QUARTERLY JOURNAL OF ECONOMICS
option to sell loans to investors has transformed the traditional role of financial intermediaries in the mortgage market from “buying and holding” to “buying and selling.” The perceived benefits of this financial innovation, such as improving risk sharing and reducing banks’ cost of capital, are widely cited (e.g., Pennacchi [1988]). However, delinquencies in the heavily securitized subprime housing market increased by 50% from 2005 to 2007, forcing many mortgage lenders out of business and setting off a wave of financial crises, which spread worldwide. In light of the central role of the subprime mortgage market in the current crisis, critiques of the securitization process have gained increased prominence (Blinder 2007; Stiglitz 2007). The rationale for concern over the “originate-to-distribute” model during the crisis derives from theories of financial intermediation. Delegating monitoring to a single lender avoids the duplication, coordination failure, and free-rider problems associated with multiple lenders (Diamond 1984). However, for a lender to screen and monitor, it must be given appropriate incentives (H¨olmstrom and Tirole 1997), and this is provided by the illiquid loans on its balance sheet (Diamond and Rajan 2003). By creating distance between a loan’s originator and the bearer of the loan’s default risk, securitization may have potentially reduced lenders’ incentives to carefully screen and monitor borrowers (Petersen and Rajan 2002). On the other hand, proponents of securitization argue that reputation concerns, regulatory oversight, or sufficient balance sheet risk may have prevented moral hazard on the part of lenders. What the effects of existing securitization practices on screening were thus remains an empirical question. This paper investigates the relationship between securitization and screening standards in the context of subprime mortgage loans. The challenge in making a causal claim is the difficulty of isolating differences in loan outcomes independent of contract and borrower characteristics. First, in any cross section of loans, those that are securitized may differ on observable and unobservable risk characteristics from loans that are kept on the balance sheet (not securitized). Second, in a time-series framework, simply documenting a correlation between securitization rates and defaults may be insufficient. This inference relies on establishing the optimal level of defaults at any given point in time. Moreover, this approach ignores macroeconomic factors and policy initiatives that may be independent of lax screening and yet may induce compositional differences in mortgage borrowers over time. For instance,
DID SECURITIZATION LEAD TO LAX SCREENING?
309
house price appreciation and the changing role of governmentsponsored enterprises (GSEs) in the subprime market may also have accelerated the trend toward originating mortgages to riskier borrowers in exchange for higher payments. We overcome these challenges by exploiting a specific rule of thumb in the lending market that induces exogenous variation in the ease of securitization of a loan compared to another loan with similar observable characteristics. This rule of thumb is based on the summary measure of borrower credit quality known as the FICO score. Since the mid-1990s, the FICO score has become the credit indicator most widely used by lenders, rating agencies, and investors. Underwriting guidelines established by the GSEs, Fannie Mae and Freddie Mac, standardized purchases of lenders’ mortgage loans. These guidelines cautioned against lending to risky borrowers, the most prominent rule of thumb being not lending to borrowers with FICO scores below 620 (Avery et al. 1996; Loesch 1996; Calomiris and Mason 1999; Freddie Mac 2001, 2007; Capone 2002).1 Whereas the GSEs actively securitized loans when the nascent subprime market was relatively small, since 2000 this role has shifted entirely to investment banks and hedge funds (the nonagency sector). We argue that persistent adherence to this ad hoc cutoff by investors who purchase securitized pools from nonagencies generates a differential increase in the ease of securitization for loans. That is, loans made to borrowers which fall just above the 620 credit cutoff have a higher unconditional likelihood of being securitized and are therefore more liquid than loans below this cutoff. To evaluate the effect of securitization on screening decisions, we examine the performance of loans originated by lenders around this threshold. As an example of our design, consider two borrowers, one with a FICO score of 621 (620+ ) and the other with a FICO score of 619 (620− ), who approach the lender for a loan. Screening to evaluate the quality of the loan applicant involves collecting both “hard” information, such as the credit score, and “soft” information, such as a measure of future income stability of the borrower. Hard information, by definition, is something that is easy to contract upon (and transmit), whereas the lender has to exert an unobservable effort to collect soft information (Stein 2002). We argue that the lender has a weaker incentive to base 1. We discuss the 620 rule of thumb in more detail in Section III and in reference to other cutoffs in the lending market in Section IV.G.
310
QUARTERLY JOURNAL OF ECONOMICS
origination decisions on both hard and soft information, less carefully screening the borrower, at 620+ , where there is an increase in the relative ease of securitization. In other words, because investors purchase securitized loans based on hard information, the cost of collecting soft information is internalized by lenders when screening borrowers at 620+ to a lesser extent than at 620− . Therefore, by comparing the portfolio of loans on either side of the credit score threshold, we can assess whether differential access to securitization led to changes in the behavior of lenders who offered these loans to consumers with nearly identical risk profiles. Using a sample of more than one million home purchase loans during the period 2001–2006, we empirically confirm that the number of loans securitized varies systematically around the 620 FICO cutoff. For loans with a potential for significant soft information—low documentation loans—we find that there are more than twice as many loans securitized above the credit threshold at 620+ than below the threshold at 620− . Because the FICO score distribution in the population is smooth (constructed from a logistic function; see Figure I), the underlying creditworthiness and demand for mortgage loans (at a given price) are the same for prospective buyers with a credit score of either 620− or 620+ . Therefore, these differences in the number of loans confirm that the unconditional probability of securitization is higher above the FICO threshold; that is, it is easier to securitize 620+ loans. Strikingly, we find that although 620+ loans should be of slightly better credit quality than those at 620− , lowdocumentation loans that are originated above the credit threshold tend to default within two years of origination at a rate 10%–25% higher than the mean default rate of 5% (which amounts to roughly a 0.5%–1% increase in delinquencies). As this result is conditional on observable loan and borrower characteristics, the only remaining difference between the loans around the threshold is the increased ease of securitization. Therefore, the greater default probability of loans above the credit threshold must be due to a reduction in screening by lenders. Because our results are conditional on securitization, we conduct additional analyses to address selection on the part of borrowers, lenders, or investors as explanations for differences in the performance of loans around the credit threshold. First, we rule out borrower selection on observables, as the loan terms and borrower characteristics are smooth across the FICO score threshold. Next, selection of loans by investors is mitigated because the
311
0
0.001
Density
0.002
0.003
DID SECURITIZATION LEAD TO LAX SCREENING?
400
500
600
FICO
700
800
900
FIGURE I FICO Distribution (U.S. Population) The figure presents the FICO distribution in the U.S. population for 2004. The data are from an anonymous credit bureau, which assures us that the data exhibit similar patterns during the other years of our sample. The FICO distribution across the population is smooth, so the number of prospective borrowers in the local vicinity of a given credit score is similar.
decisions of investors (special purpose vehicles, SPVs) are based on the same (smooth–through the threshold) loan and borrower variables as in our data (Kornfeld 2007). Finally, strategic adverse selection on the part of lenders may also be a concern. However, lenders offer the entire pool of loans to investors, and, conditional on observables, SPVs largely follow a randomized selection rule to create bundles of loans out of these pools, suggesting that securitized loans would look similar to those that remain on the balance sheet (Comptroller’s Handbook 1997; Gorton and Souleles 2006). Furthermore, if at all present, this selection will tend to be more severe below the threshold, thereby biasing the results against our finding any screening effect. We also constrain our analysis to a subset of lenders who are not susceptible to strategic securitization of loans. The results for these lenders are qualitatively similar to the findings using the full sample, highlighting that screening is the driving force behind our results.
312
QUARTERLY JOURNAL OF ECONOMICS
Could the 620 threshold be set by lenders as an optimal cutoff for screening that is unrelated to differential securitization? We investigate further using a natural experiment in the passage and subsequent repeal of antipredatory laws in New Jersey (2002) and Georgia (2003) that varied the ease of securitization around the threshold. If lenders used 620 as an optimal cutoff for screening unrelated to securitization, we would expect the passage of these laws to have no effect on the differential screening standards around the threshold. However, if these laws affected the differential ease of securitization around the threshold, our hypothesis would predict an impact on the screening standards. Our results confirm that the discontinuity in the number of loans around the threshold diminished during a period of strict enforcement of antipredatory lending laws. In addition, there was a rapid return of a discontinuity after the law was revoked. Importantly, our performance results follow the same pattern, that is, screening differentials attenuated only during the period of enforcement. Taken together, this evidence suggests that our results are indeed related to differential securitization at the credit threshold and that lenders did not follow the rule of thumb in all instances. Importantly, the natural experiment also suggests that primeinfluenced selection is not at play. Once we have confirmed that lenders are screening more rigorously at 620− than 620+ , we assess whether borrowers were aware of the differential screening around the threshold. Although there is no difference in contract terms around the cutoff, borrowers may have an incentive to manipulate their credit scores in order to take advantage of differential screening around the threshold (consistent with our central claim). Aside from outright fraud, it is difficult to strategically manipulate one’s FICO score in a targeted manner and any actions to improve one’s score take relatively long periods of time, on the order of three to six months (Fair Isaac). Nonetheless, we investigate further using the same natural experiment evaluating the performance effects over a relatively short time horizon. The results reveal a rapid return of a discontinuity in loan performance around the 620 threshold, which suggests that rather than manipulation, our results are largely driven by differential screening on the part of lenders. As a test of the role of soft information in screening incentives of lenders, we investigate the full documentation loan market. These loans have potentially significant hard information
DID SECURITIZATION LEAD TO LAX SCREENING?
313
because complete background information about the borrower’s ability to repay is provided. In this market, we identify another credit cutoff, a FICO score of 600, based on the advice of the three credit repositories. We find that twice as many full documentation loans are securitized above the credit threshold at 600+ as below the threshold at 600− . Interestingly, however, we find no significant difference in default rates of full documentation loans originated around this credit threshold. This result suggests that despite a difference in ease of securitization across the threshold, differences in the returns to screening are attenuated due to the presence of more hard information. Our findings for full documentation loans suggest that the role of soft information is crucial to understanding what worked and what did not in the existing securitized subprime loan market. We discuss this issue in more detail in Section VI. This paper connects several strands of the literature. Our evidence sheds new light on the subprime housing crisis, as discussed in the contemporaneous work of Doms, Furlong, and Krainer (2007), Gerardi, Shapiro, and Willen (2007), Dell’Ariccia, Igan, and Laeven (2008), Mayer, Piskorski, and Tchistyi (2008), Rajan, Seru, and Vig (2008), Benmelech and Dlugosz (2009), Mian and Sufi (2009), and Demyanyk and Van Hemert (2010).2 This paper also speaks to the literature that discusses the benefits (Kashyap and Stein 2000; Loutskina and Strahan 2007), and the costs (Morrison 2005; Parlour and Plantin 2008) of securitization. In a related line of research, Drucker and Mayer (2008) document how underwriters exploit inside information to their advantage in secondary mortgage markets, and Gorton and Pennacchi (1995), Sufi (2006), and Drucker and Puri (2009) investigate how contract terms are structured to mitigate some of these agency conflicts.3 The rest of the paper is organized as follows. Section II provides a brief overview of lending in the subprime market and describes the data and sample construction. Section III discusses the framework and empirical methodology used in the paper, whereas Sections IV and V present the empirical results in the paper. Section VI concludes. 2. For thorough summaries of the subprime mortgage crisis and the research which has sought to explain it, see Mayer and Pence (2008) and Mayer, Pence, and Sherlund (2009). 3. Our paper also sheds light on the classic liquidity/incentives trade-off that is at the core of the financial contracting literature (see Coffee [1991], Diamond and Rajan [2003], Aghion, Bolton, and Tirole [2004], and DeMarzo and Urosevic [2006]).
314
QUARTERLY JOURNAL OF ECONOMICS
II. LENDING IN THE SUBPRIME MORTGAGE MARKET II.A. Background Approximately 60% of outstanding U.S. mortgage debt is traded in mortgage-backed securities (MBS), making the U.S. secondary mortgage market the largest fixed-income market in the world (Chomsisengphet and Pennington-Cross 2006). The bulk of this securitized universe ($3.6 trillion outstanding as of January 2006) is composed of agency pass-through pools—those issued by Freddie Mac, Fannie Mae, and Ginnie Mae. The remainder, approximately, $2.1 trillion as of January 2006, has been securitized in nonagency securities. Although the nonagency MBS market is relatively small as a percentage of all U.S. mortgage debt, it is nevertheless large on an absolute dollar basis. The two markets are separated based on the eligibility criteria of loans that the GSEs have established. Broadly, agency eligibility is established on the basis of loan size, credit score, and underwriting standards. Unlike the agency market, the nonagency (referred to as “subprime” in the paper) market was not always this size. This market gained momentum in the mid- to late 1990s. Inside B&C Lending—a publication that covers subprime mortgage lending extensively—reports that total subprime lending (B&C originations) grew from $65 billion in 1995 to $500 billion in 2005. Growth in mortgage-backed securities led to an increase in securitization rates (the ratio of the dollar value of loans securitized divided by the dollar value of loans originated) from less than 30% in 1995 to over 80% in 2006. From the borrower’s perspective, the primary feature distinguishing between prime and subprime loans is that the up-front and continuing costs are higher for subprime loans.4 The subprime mortgage market actively prices loans based on the risk associated with the borrower. Specifically, the interest rate on the loan depends on credit scores, debt-to-income ratios, and the documentation level of the borrower. In addition, the exact pricing may depend on loan-to-value ratios (the amount of equity of the borrower), the length of the loan, the flexibility of the interest rate (adjustable, fixed, or hybrid), the lien position, the property 4. Up-front costs include application fees, appraisal fees, and other fees associated with originating a mortgage. The continuing costs include mortgage insurance payments, principal and interest payments, late fees for delinquent payments, and fees levied by a locality (such as property taxes and special assessments).
DID SECURITIZATION LEAD TO LAX SCREENING?
315
type, and whether stipulations are made for any prepayment penalties.5 For investors who hold the eventual mortgage-backed security, credit risk in the agency sector is mitigated by an implicit or explicit government guarantee, but subprime securities have no such guarantee. Instead, credit enhancement for nonagency deals is in most cases provided internally by means of a deal structure that bundles loans into “tranches,” or segments of the overall portfolio (Lucas, Goodman, and Fabozzi 2006). II.B. Data Our primary data set contains individual loan data leased from LoanPerformance. The database is the only source that provides a detailed perspective on the nonagency securities market. The data include information on issuers, broker dealers/deal underwriters, servicers, master servicers, bond and trust administrators, trustees, and other third parties. As of December 2006, more than eight thousand home equity and nonprime loan pools (over seven thousand active) that include 16.5 million loans (more than seven million active) with over $1.6 trillion in outstanding balances were included. LoanPerformance estimates that as of 2006, the data cover over 90% of the subprime loans that are securitized.6 The data set includes all standard loan application variables such as the loan amount, term, LTV ratio, credit score, and interest rate type—all data elements that are disclosed and form the basis of contracts in nonagency securitized mortgage pools. We now describe some of these variables in more detail. For our purpose, the most important piece of information about a particular loan is the creditworthiness of the borrower. The borrower’s credit quality is captured by a summary measure called the FICO score. FICO scores are calculated using various measures of credit history, such as types of credit in use and 5. For example, the rate and underwriting matrix of Countrywide Home Loans Inc., a leading lender of prime and subprime loans, shows how the credit score of the borrower and the loan-to-value ratio are used to determine the rates at which different documentation-level loans are made (www.countrywide.com). 6. Note that only loans that are securitized are reported in the LoanPerformance database. Communication with the database provider suggests that the roughly 10% of loans that are not reported are for privacy concerns from lenders. Importantly for our purpose, the exclusion is not based on any selection criteria that the vendor follows (e.g., loan characteristics or borrower characteristics). Moreover, based on estimates provided by LoanPerformance, the total number of nonagency loans securitized relative to all loans originated has increased from about 65% in early 2000 to over 92% since 2004.
316
QUARTERLY JOURNAL OF ECONOMICS
amount of outstanding debt, but do not include any information about a borrower’s income or assets (Fishelson-Holstein 2005). The software used to generate the score from individual credit reports is licensed by the Fair Isaac Corporation to the three major credit repositories—TransUnion, Experian, and Equifax. These repositories, in turn, sell FICO scores and credit reports to lenders and consumers. FICO scores provide a ranking of potential borrowers by the probability of having some negative credit event in the next two years. Probabilities are rescaled into a range of 400–900, though nearly all scores are between 500 and 800, with a higher score implying a lower probability of a negative event. The negative credit events foreshadowed by the FICO score can be as small as one missed payment or as large as bankruptcy. Borrowers with lower scores are proportionally more likely to have all types of negative credit events than are borrowers with higher scores. FICO scores have been found to be accurate even for low-income and minority populations (see Fair Isaac website www.myfico.com; also see Chomsisengphet and Pennington-Cross [2006]). More importantly, the applicability of scores available at loan origination extends reliably up to two years. By design, FICO measures the probability of a negative credit event over a twoyear horizon. Mortgage lenders, on the other hand, are interested in credit risk over a much longer period of time. The continued acceptance of FICO scores in automated underwriting systems indicates that there is a level of comfort with their value in determining lifetime default probability differences.7 Keeping this as a backdrop, most of our tests of borrower default will examine the default rates up to 24 months from the time the loan is originated. Borrower quality can also be gauged by the level of documentation collected by the lender when taking the loan. The documents collected provide historical and current information about the income and assets of the borrower. Documentation in the market (and reported in the database) is categorized as full, limited, or no documentation. Borrowers with full documentation provide verification of income as well as assets. Borrowers with limited documentation provide no information about their income but do 7. An econometric study by Freddie Mac researchers showed that the predictive power of FICO scores drops by about 25% once one moves to a three to five–year performance window (Holloway, MacDonald, and Straka 1993). FICO scores are still predictive, but do not contribute as much to the default rate probability equation after the first two years.
DID SECURITIZATION LEAD TO LAX SCREENING?
317
provide some information about their assets. “No-documentation” borrowers provide no information about income or assets, which is a very rare degree of screening lenience on the part of lenders. In our analysis, we combine limited and no-documentation borrowers and call them low-documentation borrowers. Our results are unchanged if we remove the very small portion of loans that are no-documentation. Finally, there is also information about the property being financed by the borrower, and the purpose of the loan. Specifically, we have information on the type of mortgage loan (fixed rate, adjustable rate, balloon, or hybrid) and the loan-to-value (LTV) ratio of the loan, which measures the amount of the loan expressed as a percentage of the value of the home. Typically loans are classified as either for purchase or refinance, though for convenience we focus exclusively on loans for home purchases.8 Information about the geography where the dwelling is located (ZIP code) is also available in the database.9 Most of the loans in our sample are for owner-occupied single-family residences, townhouses, or condominiums (singleunit loans account for more than 90% of the loans in our sample). Therefore, to ensure reasonable comparisons, we restrict the loans in our sample to these groups. We also drop nonconventional properties, such as those that are FHA- or VA-insured or pledged properties, and also exclude buy down mortgages. We also exclude Alt-A loans, because the coverage for these loans in the database is limited. Only those loans with valid FICO scores are used in our sample. We conduct our analysis for the period January 2001 to December 2006, because the securitization market in the subprime market grew to a meaningful size post-2000 (Gramlich 2007). III. FRAMEWORK AND METHODOLOGY When a borrower approaches a lender for a mortgage loan, the lender asks the borrower to fill out a credit application. In addition, the lender obtains the borrower’s credit report from the three credit bureaus. Part of the background information on the application and report could be considered “hard” information (e.g., 8. We find similar rules of thumb and default outcomes in the refinance market. 9. See Keys et al. (2009) for a discussion of the interaction of securitization and variation in regulation, driven by the geography of loans and the type of lender.
318
QUARTERLY JOURNAL OF ECONOMICS
the FICO score of the borrower), whereas the rest is “soft” (e.g., a measure of future income stability of the borrower, how many years of documentation were provided by the borrower, joint income status) in the sense that it is less easy to summarize on a legal contract. The lender expends effort to process the soft and hard information about the borrower and, based on this assessment, offers a menu of contracts to the borrower. Subsequently, the borrower decides to accept or decline the loan contract offered by the lender. Once a loan contract has been accepted, the loan can be sold as part of a securitized pool to investors. Notably, only the hard information about the borrower (FICO score) and the contractual terms (e.g., LTV ratio, interest rate) are used by investors when buying these loans as part of a securitized pool.10 In fact, the variables about the borrowers and the loan terms in the LoanPerformance database are identical to those used by investors and rating agencies to rate tranches of the securitized pool. Therefore, although lenders are compensated for the hard information about the borrower, the incentive for lenders to process soft information critically depends on whether they have to bear the risk of loans they originate (Gorton and Pennacchi 1995; Parlour and Plantin 2008; Rajan, Seru, and Vig 2008). The central claim in this paper is that lenders are less likely to expend effort to process soft information as the ease of securitization increases. We exploit a specific rule of thumb at the FICO score of 620 that makes securitization of loans more likely if a certain FICO score threshold is attained. Historically, this score was established as a minimum threshold in the mid-1990s by Fannie Mae and Freddie Mac in their guidelines on loan eligibility (Avery et al. 1996; Capone 2002). Guidelines by Freddie Mac suggest that FICO scores below 620 are placed in the Cautious Review Category, and Freddie Mac considers a score below 620 “as a strong indication that the borrower’s credit reputation is not acceptable” (Freddie Mac 2001, 2007).11 This is also reflected in Fair Isaac’s statement, “. . . those agencies [Fannie Mae and Freddie Mac], which buy mortgages from banks and resell them to investors, 10. See Testimony of Warren Kornfeld, Managing Director of Moodys Investors Service, before the subcommittee on Financial Institutions and Consumer Credit, U.S. House of Representatives, May 8, 2007. 11. These guidelines appeared at least as far back as 1995 in a letter by the Executive Vice President of Freddie Mac (Michael K. Stamper) to the CEOs and credit officers of all Freddie Mac sellers and servicers (see Online Appendix Exhibit 1).
DID SECURITIZATION LEAD TO LAX SCREENING?
319
have indicated to lenders that any consumer with a FICO score above 620 is good, while consumers below 620 should result in further inquiry from the lender. . . . ” Although the GSEs actively securitized loans when the nascent subprime market was relatively small, this role shifted entirely to investment banks and hedge funds (the nonagency sector) in recent times (Gramlich 2007). We argue that adherence to this cutoff by subprime MBS investors, following the advice of GSEs, generates an increase in demand for securitized loans that are just above the credit cutoff relative to loans below this cutoff. There is widespread evidence that is consistent with 620 being a rule of thumb in the securitized subprime lending market. For instance, rating agencies (Fitch and Standard and Poor’s) used this cutoff to determine default probabilities of loans when rating mortgage-backed securities with subprime collateral (Loesch 1996; Temkin, Johnson, and Levy 2002). Similarly, Calomiris and Mason (1999) survey the high-risk mortgage loan market and find 620 as a rule of thumb for subprime loans. We also confirmed this view by conducting a survey of origination matrices used by several of the top fifty originators in the subprime market (a list obtained from Inside B&C Lending; these lenders amount to about 70% of loan volume). The credit threshold of 620 was used by nearly all the lenders. Because investors purchase securitized loans based on hard information, our assertion is that the cost of collecting soft information is internalized by lenders when screening borrowers at 620− to a greater extent than at 620+ . There is widespread anecdotal evidence that lenders in the subprime market review both soft and hard information more carefully for borrowers with credit scores below 620. For instance, the website of Advantage Mortgage, a subprime securitized loan originator, claims that “. . . all loans with credit scores below 620 require a second level review. . . . There are no exceptions, regardless of the strengths of the collateral or capacity components of the loan.”12 By focusing on the lender as a unit of observation, we attempt to learn about the differential impact ease of securitization had on the behavior of lenders around the cutoff. To begin with, our tests empirically identify a statistical discontinuity in the distribution of loans securitized around the credit threshold of 620. In order to do so, we show that the number 12. This position for loans below 620 is reflected in lending guidelines of numerous other subprime lenders.
320
QUARTERLY JOURNAL OF ECONOMICS
of loans securitized dramatically increases when we move along the FICO distribution from 620− to 620+ . We argue that this is equivalent to showing that the unconditional probability of securitization increases as one moves from 620− to 620+ . To see + − this, denote Ns620 and Ns620 as the numbers of loans securitized + − + − at 620 and 620 , respectively. Showing that Ns620 > Ns620 is 620+ 620− equivalent to showing that Ns /Np > Ns /Np, where Np is the number of prospective borrowers at 620+ or 620− . If we assume that the numbers of prospective borrowers at 620+ and 620− − + are similar, that is, Np620 ≈ Np620 = Np (a reasonable assumption, as discussed below), then the unconditional probability of securitization is higher at 620+ . We refer to the difference in these unconditional probabilities as the differential ease of securitization around the threshold. Notably, our assertion of differential screening by lenders does not rely on knowledge of the proportion of prospective borrowers that applied, were rejected, or were held on the lenders’ balance sheet. We simply require that lenders are aware that a prospective borrower at 620+ has a higher likelihood of eventual securitization. We measure the extent of the jump by using techniques that are commonly used in the literature on regression discontinuity (e.g., see DiNardo and Lee [2004]; Card, Mas, and Rothstein [2008]). Specifically, we collapse the data on each FICO score (500– 800) i and estimate equations of the form (1)
Yi = α + βTi + θ f (FICO(i)) + δTi · f (FICO(i)) + i ,
where Yi is the number of loans at FICO score i, Ti is an indicator that takes a value of 1 at FICO ≥ 620 and a value of 0 if FICO < 620 and i is a mean-zero error term. f (FICO) and T · f (FICO) are flexible seventh-order polynomials, with the goal of these functions being to fit the smoothed curves on either side of the cutoff as closely to the data presented in the figures as possible.13 f (FICO) is estimated from 620− to the left, and T · f (FICO) is estimated from 620+ to the right. The magnitude of the discontinuity, β, is estimated by the difference in these two smoothed functions evaluated at the cutoff. The data are re-centered such that FICO = 620 corresponds to “0,” thus at the 13. We have also estimated these functions of the FICO score using third-order and fifth-order polynomials in FICO, as well as relaxing parametric assumptions and estimating using local linear regression. The estimates throughout are not sensitive to the specification of these functions. In Section IV, we also examine the size and power of the test using the seventh-order polynomial specification following the approach of Card, Mas, and Rothstein (2008).
DID SECURITIZATION LEAD TO LAX SCREENING?
321
cutoff the polynomials are evaluated at 0 and drop out of the calculation, which allows β to be interpreted as the magnitude of the discontinuity at the FICO threshold. This coefficient should be interpreted locally in the immediate vicinity of the credit score threshold. After documenting a large jump at the ad-hoc credit thresholds, we focus on the performance of the loans around these thresholds. We evaluate the performance of the loans by examining the default probability of loans—that is, whether or not the loan defaulted t months after it was originated. If lenders screen similarly for the loan of credit quality 620+ and the loan of 620− credit quality, there should not be any discernible differences in default rates of these loans. Our maintained claim is that any differences in default rates on either side of the cutoff, after controlling for hard information, should be only due to the impact that securitization has on lenders’ screening standards. This claim relies on several identification assumptions. First, as we approach the cutoff from either side, any differences in the characteristics of prospective borrowers are assumed to be random. This implies that the underlying creditworthiness and the demand for mortgage loans (at a given price) is the same for prospective buyers with a credit score of 620− or 620+ . This seems reasonable as it amounts to saying that the calculation Fair Isaac performs (using a logistic function) to generate credit scores has a random error component around any specific score. Figure I shows the FICO distribution in the U.S. population in 2004. These data are from an anonymous credit bureau that assures us that the data exhibit similar patterns during the other years of our sample. Note that the FICO distribution across the population is smooth, so the number of prospective borrowers across a given − + credit score is similar (in the example above, Np620 ≈ Np620 = Np). Second, we assume that screening is costly for the lender. The collection of information—hard systematic data (e.g., FICO score) as well as soft information (e.g., joint income status) about the creditworthiness of the borrower—requires time and effort by loan officers. If lenders did not have to expend resources to collect information, it would be difficult to argue that the differences in performance we estimate are a result of ease of securitization around the credit threshold affecting banks incentives to screen and monitor. Again, this seems to be a reasonable assumption (see Gorton and Pennacchi [1995]). Note that our discussion thus far has assumed that there is no explicit manipulation of FICO scores by the lenders or borrowers.
322
QUARTERLY JOURNAL OF ECONOMICS TABLE I SUMMARY STATISTICS Panel A: Summary statistics by year Low documentation Full documentation
2001 2002 2003 2004 2005 2006
Number of loans
Mean loan-to-value
Mean FICO
Number of loans
Mean loan-to-value
Mean FICO
35,427 53,275 124,039 249,298 344,308 270,751
81.4 83.9 85.2 86.0 85.5 86.3
630 646 657 658 659 655
101,056 109,226 194,827 361,455 449,417 344,069
85.7 86.4 88.1 87.0 86.9 87.5
604 613 624 626 623 621
Panel B: Summary statistics of key variables Low documentation Full documentation
Average loan size ($000) FICO score Loan-to-value ratio Initial interest rate ARM (%) Prepayment penalty (%)
Mean
Std. dev.
Mean
Std. dev.
189.4 656.0 85.6 8.3 48.5 72.1
132.8 50.0 9.8 1.8 50.0 44.8
148.5 621.5 87.1 8.2 52.7 74.7
116.9 51.9 9.9 1.9 49.9 43.4
Notes. Information on subprime home purchase loans comes from LoanPerformance. Sample period is 2001–2006. See text for sample selection.
However, the borrower may have incentives to do so if loan contracts or screening differ around the threshold. Our analysis in Section IV.F focuses on a natural experiment and shows that the effects of securitization on performance are not being driven by strategic manipulation. IV. MAIN EMPIRICAL RESULTS IV.A. Descriptive Statistics As noted earlier, the nonagency market differs from the agency market on three dimensions: FICO scores, loan-to-value ratios, and the amount of documentation asked of the borrower. We next look at the descriptive statistics of our sample, with special emphasis on these dimensions. Our analysis uses more than one million loans across the period 2001 to 2006. As mentioned earlier, the nonagency securitization market has grown dramatically since 2000, which is apparent in Panel A of Table I, which
DID SECURITIZATION LEAD TO LAX SCREENING?
323
shows the number of subprime loans securitized across years. These patterns are similar to those described in Gramlich (2007) and Demyanyk and Van Hemert (2010). The market has witnessed an increase in the number of loans with reduced hard information in the form of limited or no documentation. Note that whereas limited documentation provides no information about income but does provide some information about assets, a no-documentation loan provides information about neither income nor assets. In our analysis we combine both types of limited-documentation loans and denote them as low-documentation loans. The fulldocumentation market grew by 445% from 2001 to 2005, whereas the number of low-documentation loans grew by 972%. We find similar trends for loan-to-value ratios and FICO scores in the two documentation groups. LTV ratios have gone up over time, as borrowers have put less and less equity into their homes when financing loans. This increase is consistent with better willingness of market participants to absorb risk. In fact, this is often considered the bright side of securitization—borrowers are able to borrow at better credit terms because risk is being borne by investors who can bear more risk than individual banks. Panel A also shows that average FICO scores of individuals who access the subprime market have been increasing over time. The mean FICO score among low-documentation borrowers increased from 630 in 2001 to 655 in 2006. This increase in average FICO scores is consistent with the rule of thumb leading to a larger expansion of the market above the 620 threshold. Average LTV ratios are lower and FICO scores higher for the low-documentation as compared to the full-documentation sample. This possibly reflects the additional uncertainty lenders have about the quality of low-documentation borrowers. Panel B compares the low- and full-documentation segments of the subprime market on a number of the explanatory variables used in the analysis. Low-documentation loans are on average larger and are given to borrowers with higher credit scores than loans where full information on income and assets is provided. However, the two groups of loans have similar contract terms such as interest rate, loan-to-value, prepayment penalties, and whether the interest rate is adjustable or not. Our analysis below focuses first on the low-documentation segment of the market; we explore the full-documentation market in Section V.
324
QUARTERLY JOURNAL OF ECONOMICS
IV.B. Establishing the Rule of Thumb We first present results that show that large differences exist in the number of low-documentation loans that are securitized around the credit threshold we described earlier. We then examine whether this jump in securitization has any consequences on the subsequent performance of the loans above and below this credit threshold. As mentioned in Section III, the rule of thumb in the lending market impacts the ease of securitization around a credit score of 620. We therefore expect to see a substantial increase in the number of loans just above this credit threshold as compared to number of loans just below this threshold. In order to examine this, we start by plotting the number of loans at each FICO score in the two documentation categories around the credit cutoff of 620 across years starting with 2001 and ending in 2006. As can be seen from Figure II, there is a marked increase in number of lowdocumentation loans at 620+ relative to the number of loans at 620− . We do not find any such jump for full-documentation loans at FICO of 620.14 Given this evidence, we focus on the 620 credit threshold for low-documentation loans. From Figure II, it is clear that the number of loans see roughly a 100% jump in 2004 for low-documentation loans across the credit score of 620—there are twice as many loans securitized at 620+ at 620− . Clearly, this is consistent with the hypothesis that the ease of securitization is higher at 620+ than at scores just below this credit cutoff. To estimate the jumps in the number of loans, we use the methods described above in Section III using the specification provided in equation (1). As reported in Table II, we find that low-documentation loans see a dramatic increase above the credit threshold of 620. In particular, the coefficient estimate (β) is significant at the 1% level and is on average around 110% (from 73% to 193%) higher for 620+ as compared to 620− for loans during the sample period. For instance, in 2001, the estimated discontinuity in Panel A is 85. The mean average number of low-documentation loans at a FICO score for 2001 is 117. The ratio is around 73%. These jumps are plainly visible from the yearly graphs in Figure I. In addition, we conduct permutation tests (or “randomization” tests), where we vary the location of the discontinuity (Ti ) 14. We will elaborate more on full-documentation loans in Section V.
325
DID SECURITIZATION LEAD TO LAX SCREENING? 2002
2003
500
600
700 FICO
800
5 0
0
0
1
2
2
4
10
3
4
6
15
2001
500
700 FICO
800
700 FICO
800
700 FICO
800
30
40
40
20
30
10 0
10 0 600
600
2006
20
20 10 0 500
500
2005
30
2004
600
500
600
700 FICO
800
500
600
700 FICO
800
FIGURE II Number of Loans (Low-Documentation) The figure presents the data for number of low-documentation loans (in ’00s). We plot the average number of loans at each FICO score between 500 and 800. As can be seen from the graphs, there is a large increase in the number of loans around the 620 credit threshold (i.e., more loans at 620+ as compared to 620− ) from 2001 onward. Data are for loans originated between 2001 and 2006.
TABLE II DISCONTINUITY IN NUMBER OF LOW-DOCUMENTATION LOANS Year
FICO ≥ 620 (β)
t-stat
2001 36.83 (2.10) 2002 124.41 (6.31) 2003 354.75 (8.61) 2004 737.01 (7.30) 2005 1,721.64 (11.78) 2006 1,716.49 (6.69) Pooled estimate (t-stat) [permutation test p-value]
Observations
R2
Mean
299 .96 117 299 .98 177 299 .98 413 299 .98 831 299 .99 1,148 299 .97 903 781.87 (4.14) [.003]
Notes. This table reports estimates from a regression that uses the number of low-documentation loans at each FICO score as the dependent variable. In order to estimate the discontinuity (FICO ≥ 620) for each year, we collapse the number of loans at each FICO score and estimate flexible seventh-order polynomials on either side of the 620 cutoff, allowing for a discontinuity at 620. We report t-statistics in parentheses. Permutation tests, which allow for a discontinuity at every point in the FICO distribution, confirm that jumps for each year are significantly larger than those found elsewhere in the distribution (see Section IV.B for more details). For brevity, we report a permutation test estimate from pooled regressions with time fixed effects removed to account for vintage effects. FICO = 620 has the smallest permutation test p-value (and is thus the largest outlier) among all the visible discontinuities in our sample.
326
QUARTERLY JOURNAL OF ECONOMICS
across the range of all possible FICO scores and reestimate equation (1). The test treats every value of the FICO distribution as a potential discontinuity, and estimates the magnitude of the observed discontinuity at each point, forming a counter factual distribution of discontinuity estimates. This is equivalent to a bootstrapping procedure that varies the cutoff but does not resample the order of the points in the distribution (Johnston and DiNardo 1996). We then compare the value of the estimated discontinuity at 620 to the counterfactual distribution and construct a test statistic based on the asymptotic normality of the counterfactual distribution and report the p-value from this test. The null hypothesis is that the estimated discontinuity at a FICO score of 620 is the mean of the 300 possible discontinuities.15 The precision of the permutation test is limited by the number of observations used at each FICO score. As a result, regressions that pool across years provide the greatest power for statistical testing. While constructing the counterfactuals, we therefore use pooled specifications with year fixed effects removed to account for differences in vintage. The result of this test is shown in Table II and shows that the estimate at 620 for low-documentation loans is a strong outlier relative to the estimated jumps at other locations in the distribution. The estimated discontinuity when the years are pooled together is 780 loans with a permutation test p-value of .003. In summary, if the underlying creditworthiness and the demand for mortgage loans are the same for potential buyers with a credit score of 620− or 620+ , this result confirms that it is easier to securitize loans above the FICO threshold. IV.C. Contract Terms and Borrower Demographics Before examining the subsequent performance of loans around the credit threshold, we first assess whether there are any differences in hard information—either in contract terms or in other borrower characteristics—around this threshold. The endogeneity of contractual terms based on the riskiness of borrowers may lead to different contracts and hence different types of borrowers obtaining loans around the threshold in a systematic way. 15. In unreported tests, we also conduct a falsification simulation exercise following Card, Mas, and Rothstein (2008). In particular, we apply our specification to data generated by a continuous process. We reject the null hypothesis of no effect (using a two-sided 5% test) in 6.0% of the simulations, indicating that the size of our test is reasonable. A similar test with data generated by a discontinuous process suggests that the power of our test is also reasonable. We reject the null of no effect about 92% of the times (in a two-sided 5% test) in this case.
327
DID SECURITIZATION LEAD TO LAX SCREENING? 2003
500
600
700 FICO
800
15 10 5
10 5
5
10
15
2002
15
2001
500
700 FICO
800
700 FICO
800
700 FICO
800
15 10
15 10 600
600
2006
5
5
10 5 500
500
2005
15
2004
600
500
600
700 FICO
800
500
600
700 FICO
800
FIGURE III Interest Rates (Low-Documentation) The figure presents the data for interest rate (in %) on low-documentation loans. We plot average interest rates on loans at each FICO score between 500 and 800. As can be seen from the graphs, there is no change in interest rates around the 620 credit threshold (i.e., more loans at 620+ as compared to 620− ) from 2001 onward. Data are for loans originated between 2001 and 2006.
Though we control for the possible contract differences when we evaluate the performance of loans, it is a source of insight to examine whether borrower and contract terms also systematically differ around the credit threshold. We start by examining the contract terms—LTV ratio and interest rates—across the credit threshold. Figures III and IV show the distributions of interest rates and LTV ratios offered on lowdocumentation loans across the FICO spectrum. As is apparent, we find these loan terms to be very similar—that is, we find no differences in contract terms for low-documentation loans above and below the 620 credit score. We test this formally using an approach equivalent to equation (1), replacing the dependent variable Yi in the regression framework with contract terms (loan-to-value ratios and interest rates) and present the results in Appendix I.A. Our results suggest that there is no difference in loan terms across the credit threshold. For instance, for low-documentation loans originated in 2006, the average loan-to-value ratio across the
328
QUARTERLY JOURNAL OF ECONOMICS
700 FICO
800
100 0
600
700 FICO
800
50
100
2005
500
600
700 FICO
800
500
600
700 FICO
800
2006
0
0
0
50
100
2004
500
50
600
100
0 500
2003
50
50
100
2002
0
50
100
2001
500
600
700 FICO
800
500
600
700 FICO
800
FIGURE IV Loan-to-Value Ratio (Low-Documentation) The figure presents the data for loan-to-value ratio (in %) on low-documentation loans. We plot average loan-to-value ratios on loans at each FICO score between 500 and 800. As can be seen from the graphs, there is no change in loan-to-value around the 620 credit threshold (i.e., more loans at 620+ as compared to 620− ) from 2001 onward. Data are for loans originated between 2001 and 2006.
collapsed FICO spectrum is 85%, whereas our estimated discontinuity is only −1.05%, a 1.2% difference. Similarly for the interest rate, for low-documentation loans originated in 2005, the average interest rate is 8.2%, and the difference on either side of the credit score cutoff is only about −0.091%, a 1% difference. Permutation tests reported in Appendix I.D confirm that these differences are not outliers relative to the estimated jumps at other locations in the distribution. Additional contract terms, such as the presence of a prepayment penalty, or whether the loan is ARM, FRM, or interest only/balloon are also similar across the 620 threshold (results not shown). In addition, if loans have second liens, then a combined LTV (CLTV) ratio is calculated. We find no difference in the CLTV ratios around the threshold for those borrowers with more than one lien on the home. Finally, low-documentation loans often do not require that borrowers provide information about their
329
DID SECURITIZATION LEAD TO LAX SCREENING? 2003
500
600
700 FICO
800
80 40
40
40
60
60
60
80
2002
80
2001
500
700 FICO
800
700 FICO
800
700 FICO
800
80
80
60
60 600
600
2006
40
40
60 40 500
500
2005
80
2004
600
500
600
700 FICO
800
500
600
700 FICO
800
FIGURE V Median Household Income (Low-Documentation) The figure presents median household income (in ’000s) of ZIP codes in which loans are made at each FICO score between 500 and 800. As can be seen from the graphs, there is no change in median household income around the 620 credit threshold (i.e., more loans at 620+ as compared to 620− ) from 2001 onward. We plotted similar distributions for average percent minorities taking loans and average house size and found no differences around the credit thresholds. Data are for loans originated between 2001 and 2006.
income, so only a subset of our sample provides a debt-to-income (DTI) ratio for the borrowers. Among this subsample, there is no difference in DTI across the 620 threshold in low-documentation loans. For brevity, we report only the permutation tests for these contract terms in Appendix I.D. Next, we examine whether the characteristics of borrowers differ systematically across the credit threshold. In order to evaluate this, we look at the distribution of the population of borrowers across the FICO spectrum for low-documentation loans. The data on borrower demographics come from Census 2000 and are at the ZIP code level. As can be seen from Figure V, median household incomes of the ZIP codes of borrowers around the credit thresholds look very similar for low-documentation loans. We plotted similar distributions for average percent minorities residing in the ZIP code and average house value in the ZIP code across the FICO
330
QUARTERLY JOURNAL OF ECONOMICS
spectrum (unreported) and again find no differences across the credit threshold.16 We use the same specification as equation (1), this time with the borrower demographic characteristics as dependent variables and present the results formally in Appendix I.B. Consistent with the patterns in the figures, permutation tests (unreported) reveal no differences in borrower demographic characteristics around the credit score threshold. Overall, our results indicate that observable characteristics of loans and borrowers are not different around the credit threshold. IV.D. Performance of Loans We now focus on the performance of loans that are originated close to the credit score threshold. Note that our analysis in Section IV.C suggests that there is no difference in terms of observable hard information about contract terms or about borrower demographic characteristics across the credit score thresholds. Nevertheless, we will control for these differences when evaluating the subsequent performance of loans in our logit regressions. If there is any remaining difference in the performance of the loans above and below the credit threshold, it can be attributed to differences in unobservable soft information about the loans. We estimate the differences in default rates on either side of the cutoff using the same framework as equation (1), using the dollar-weighted fraction of loans defaulted within ten to fifteen months of origination as the dependent variable, Yi . This fraction is calculated as the dollar amount of unpaid loans in default divided by the total dollar amount originated in the same cohort. We classify a loan as under default if any of the conditions is true: (a) payments on the loan are 60+ days late as defined by the Office of Thrift Supervision; (b) the loan is in foreclosure; or (c) the loan is real estate owned (REO), that is, the bank has retaken possession of the home.17 16. Of course, because the census data are at the ZIP code level, we are to some extent smoothing our distributions. We note, however, that when we conduct our analysis on differences in number of loans (from Section IV.B), aggregated at the ZIP code level, we still find jumps across the credit threshold within each individual ZIP code. 17. Although two different definitions of delinquency are used in the industry (Mortgage Bankers Association (MBA) definition and Office of Thrift Supervision (OTS) definition), we have followed the more stringent OTS definition. Whereas MBA starts counting days a loan has been delinquent from the time a payment is missed, OTS counts days a loan is delinquent one month after the first payment is missed.
DID SECURITIZATION LEAD TO LAX SCREENING?
331
We collapse the data into one-point FICO bins and estimate seventh-order polynomials on either side of the threshold for each year. By estimating the magnitude of β in each year separately, we ensure that no one cohort (or vintage) of loans is driving our results. As shown in Figures VI.A to VI.F, the low-documentation loans exhibit discontinuities in default rates at the FICO score of 620. A year-by-year estimate is presented in Panel A of Table III. Contrary to what one might expect, around the credit threshold, we find that loans with higher credit scores actually default more often than lower credit loans in the post-2000 period. In particular, for loans originated in 2005, the estimate of β is 0.023 (t-stat = 2.10), and the mean delinquency rate is 0.078, suggesting a 29% increase in defaults to the right of the credit score cutoff. Similarly, in 2006, the estimated size of the jump is 0.044 (t-stat = 2.68), and the mean delinquency rate for all FICO bins is 0.155, which is again a 29% increase in defaults around the FICO score threshold. Panel B presents results of permutation tests, estimated on the residuals obtained after pooling delinquency rates across years and removing year effects. Besides the 60+ late delinquency definition used in Panel A, we also classify a loan as in default if it is 90+ late in payments and if it is in foreclosure or REO. Our approach yields similar, if not stronger, results. Compared to 620− loans, 620+ loans are on average 2.8% more likely to be in arrears of 90+ days and 2.5% more likely to be in foreclosure or REO. Permutation test p-values confirm that the jump in defaults at 620 using all the definitions of default are extreme outliers to the rest of the delinquency distribution. For instance, with default defined as foreclosure/REO, the p-value for the discontinuity at 620 is .004. That we find similar results using different default definitions is consistent with high levels of rollover, whereby loans that are delinquent continue to reach deeper levels of delinquency. As shown in Online Appendix Table 1, more than 80% of loans that are 60 days delinquent reach 90+ days delinquent within a year, and 66% of loans that are 90 days delinquent reach foreclosure twelve months after in the low documentation market. Although previous default definitions were dollar-weighted, we also use the raw number of loans in default to estimate the magnitude of the discontinuity in loan performance around the FICO threshold. The unweighted results with 60+ delinquency are also presented in Panel B, and continue to exhibit a pattern of higher credit scores leading to higher default rates across the
332
QUARTERLY JOURNAL OF ECONOMICS
2004
0
0
0.05
0.05
0.1
0.1
0.15
0.15
0.2
2001
500
550
600
650
700
500
750
550
600
700
750
(D)
2002
2005
0.2
(A)
0
0
0.05
0.05
0.1
0.1
0.15
0.15
650 FICO
FICO
500
550
600
650
700
500
750
550
600
650
700
750
650
700
750
FICO
FICO
(E)
(B)
2006
0
0.05
0.1
0.05
0.15
0.2
0.1
0.25
0.15
0.3
2003
500
550
600
650 FICO
(C)
700
750
500
550
600 FICO
(F)
FIGURE VI Annual Delinquencies for Low-Documentation Loans Originated in 2001–2006 The figures present the percentage of low-documentation loans originated in 2001(A), 2002(B), 2003(C), 2004(D), 2005(E), and 2006(F) that became delinquent. We plot the dollar-weighted fractions of the pools that become delinquent for onepoint FICO bins between scores of 500 and 750. The vertical lines denote the 620 cutoff, and a seventh-order polynomial is fitted to the data on either side of the threshold. Delinquencies are reported between ten and fifteen months for loans originated in the year.
333
DID SECURITIZATION LEAD TO LAX SCREENING?
TABLE III DELINQUENCIES IN LOW-DOCUMENTATION LOANS AROUND THE CREDIT THRESHOLD
Year
Panel A: Dollar-weighted fraction of loans defaulted (60+ delinquent) FICO ≥ 620 (β) t-stat Observations R2 Mean
2001 2002 2003 2004 2005 2006
0.005 0.010 0.022 0.013 0.023 0.044
(0.44) (2.24) (3.47) (1.86) (2.10) (2.68)
254 254 254 254 254 253
.58 .75 .83 .79 .81 .57
0.053 0.051 0.043 0.049 0.078 0.155
Panel B: Permutation tests for alternative default definitions (pooled 2001–2006 with time fixed effects) Permutation Dependent variable FICO≥ 620 test (default definition) (β) t-stat p-value Observations R2 Mean 60+ (dollar-weighted) 90+ (dollar-weighted) Foreclosure+ (dollar-weighted) 60+ (unweighted)
0.019 0.028 0.025
(3.32) (4.67) (6.25)
.020 .006 .004
1523 1525 1525
.66 0.072 .70 0.065 .71 0.048
0.025
(5.00)
.004
1525
.65 0.073
Panel C: Delinquency status of loans Pr(delinquency)=1 (1) FICO ≥ 620
Observations Pseudo R2 Other controls FICO ≥ 620 ∗ other controls Time fixed effects Clustering unit Mean delinquency (%)
0.12 [0.004] (3.42) 1,393,655 .088 Yes No No Loan ID
(2)
(3)
0.48 0.12 [0.011] [0.004] (2.46) (2.10) 1,393,655 1,393,655 .116 .088 Yes Yes Yes No Yes No Loan ID Vintage 4.45
(4) 0.48 [0.011] (2.48) 1,393,655 .116 Yes Yes Yes Vintage
Notes. In Panel A, we estimate the differences in default rates using a flexible seventh-order polynomial on either side of the 620 cutoff, allowing for a discontinuity at 620. The 60+ dollar-weighted fraction of loans defaulted within 10–15 months is the dependent variable. In Panel B, we present estimates from permutation tests from pooled regressions with time fixed effects removed to account for vintage effects using specification similar to Panel A. Permutation tests confirm that the discontinuity at 620 has the smallest p-value (and is thus largest outlier) in our sample. We use alternative definitions of defaults as the dependent variable. In Panel C, we estimate differences in default rates on either side of the 620 FICO cut off using a logit regression. The dependent variable is the delinquency status of a loan in a given month that takes a value 1 if the loan is classified as under default, as defined in the text. Controls include borrower and loan terms discussed in Section IV. t-statistics are reported in parentheses (marginal effects are reported in square brackets).
334
QUARTERLY JOURNAL OF ECONOMICS
14
Delinquency (%)
12
615–619 (620-) 620–624 (620+)
10
8
6
4
2
0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Loan age (months)
FIGURE VII Delinquencies for Low-Documentation Loans (2001–2006) The figure presents the percent of low-documentation loans (dollar-weighted) originated between 2001 and 2006 that subsequently became delinquent. We track loans in two FICO buckets—615–619 (620− ) dashed and 620–624 (620+ ) solid— from their origination date and plot the average loans that become delinquent each month after the origination date. As can be seen, the higher credit score bucket defaults more than the lower credit score bucket for the post-2000 period. For brevity, we do not report plots separately for each year of origination. The effects shown here in the pooled 2001–2006 plot are apparent in every year.
620 threshold. In fact, the results are statistically stronger than the 60+ weighted results, with a permutation test p-value based on the pooled estimates of .004 and the discontinuity estimate being significant in all the years (unreported; see Online Appendix Figure 4). To show how delinquency rates evolve over the age of the loan, in Figure VII we plot the delinquency rates of 620+ and 620− for low-documentation loans (dollar-weighted) by loan age. As discussed earlier, we restrict our analysis to about two years after the loan has been originated. As can be seen from the figure, the differences in the delinquency rates are stark. The differences begin around four months after the loans have been originated and persist up to two years. Differences in default rates also seem quite large in terms of magnitudes. Those with a credit score of
DID SECURITIZATION LEAD TO LAX SCREENING?
335
620− are about 20% less likely to default after a year as compared to loans with credit score 620+ .18 An alternative methodology is to measure the performance of each unweighted loan by tracking whether or not it became delinquent and estimate logit regressions of the following form: (2) Yikt = α + βTit + γ1 Xikt + δ1 Tit ∗ Xikt + μt + ikt . This logistic approach complements the regression discontinuity framework, as we restrict the sample to the ten FICO points in the immediate vicinity of 620 in order to maintain the same local interpretation of the RD results. Moreover, we are also able to directly control for the possibly endogenous loan terms around the threshold. The dependent variable is an indicator variable (Delinquency) for loan i originated in year t that takes a value of 1 if the loan is classified as under default in month k after origination as defined above. We drop the loan from the regression once it is paid out after reaching the REO state. T takes the value 1 if FICO is between 620 and 624, and 0 if it is between 615 and 619 for low-documentation loans, thus restricting the analysis to the immediate vicinity of the cutoffs. Controls include FICO scores, the interest rate on the loan, loan-to-value ratio, and borrower demographic variables, as well as interaction of these variables with T . We also include a dummy variable for the type of loan (adjustable or fixed rate mortgage). We control for the possible nonlinear effect of age of the loan on defaults by including three dummy variables—which take a value of 1 if the month since origination is 0–10, 11–20, and more than 20 months, respectively. Year of origination fixed effects are included in the estimation and standard errors are clustered at the loan level to account for multiple loan delinquency observations in the data. As can be seen from the logit coefficients in Panel C of Table III, results from this regression are qualitatively similar to those reported in the figures. In particular, we find that β is positive when we estimate the regressions for low-documentation loans. The economic magnitudes are similar to those in the 18. Note that Figure VII does not plot cumulative delinquencies. As loans are paid out, say after a foreclosure, the unpaid balance for these loans falls relative to the time when they entered into a 60+ state. This explains the dip in delinquencies in the figure after about twenty months. Our results are similar if we plot cumulative delinquencies, or delinquencies that are calculated using the unweighted number of loans. Also note that the fact that we find no delinquencies early on in the duration of the loan is not surprising, given that originators are required to take back loans on their books if the loans default within three months.
336
QUARTERLY JOURNAL OF ECONOMICS
figures as well. For instance, keeping all other variables at their mean levels, low-documentation loans with credit score 620− are about 10%–25% less likely to default after a year than lowdocumentation loans with credit score 620+ . These are large magnitudes—for instance, note that the mean delinquency rate for low-documentation loans is around 4.45%; the economic magnitude of the effects in column (2) suggests that the difference in the absolute delinquency rate between loans around the credit threshold is around 0.5%–1% for low documentation loans.19 To account for the possibility that lax screening might be correlated across different loans within the same vintage, we cluster the loans for each vintage and report the results in columns (3) and (4). Note that the RD regressions (Panel A) estimated separately by year also alleviate concerns about correlated errors across different loans with the same vintage. In the mortgage market, the other way for loans to leave the pool is to be repaid in full through refinancing or outright purchase, known as prepayment. This prepayment risk decreases the return to investing in mortgage-backed securities in a manner similar to default risk (see, e.g., Gerardi, Shapiro, and Willen [2007] and Mayer, Piskorski, and Tchistyi [2008]). To assess whether there are any differences in actual prepayments around the 620 threshold, we plot the prepayment seasoning curve for all years 2001–2006 in Figure VIII. As can be observed, prepayments of 620+ and 620− borrowers in the low-documentation market are similar (also see permutation test in Appendix I.D). Nevertheless, to formally account for prepayment rates, we also estimate a competing risk model using both prepayment and default as means for exiting the sample. We use the Cox proportional hazard model based on the econometric specification following Deng, Quigley, and Van Order (2000). In unreported tests (Online Appendix Table 6), we find results that are similar to our logistic specification. Finally, the reported specification uses five-point bins of FICO scores around the threshold, but the results are similar (though less precise) if we restrict the bins to fewer FICO scores on either side of 620 (Online Appendix Table 2). This issue is also fully addressed by the regression discontinuity results reported in Panels A and B, which use individual FICO score bins as the units of 19. Our logistic specification is equivalent to a hazard model if we drop loans as soon as they hit the first indicator of delinquency (sixty days in default) and include a full set of duration dummies. Doing so does not change the nature of our results.
DID SECURITIZATION LEAD TO LAX SCREENING?
337
4.00 620–624 (620+) 615–619 (620−) 3.50
Prepayment as a % of original balance
3.00
2.50
2.00
1.50
1.00
0.50
0.00 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Loan age (months)
FIGURE VIII Actual Prepayments for Low-Documentation Loans (2001–2006) The figure presents the percentage of low-documentation loans (dollar weighted) originated between 2001 and 2006 that subsequently were prepaid. We track loans in two FICO buckets—615–619 (620− ) dashed and 620–624 (620+ ) solid—from their origination dates and plot the average loans that prepaid each month after the origination date. As can be seen, there are no differences in prepayments between the higher and lower credit score buckets. For brevity, we do not report plots separately for each year of origination. The effects shown here in the pooled 2001–2006 plot are apparent in every year.
observation. In sum, we find that even after controlling for all observable characteristics of the loan contracts or borrowers, loans made to borrowers with higher FICO scores perform worse around the credit threshold. IV.E. Selection Concerns Because our results are conditional on securitization, we conduct additional analyses to address selection explanations on account of borrowers, investors, and lenders for the differences in the performance of loans around the credit threshold. First, contract terms offered to borrowers above the credit threshold might differ from those below the threshold and attract a riskier pool of borrowers. If this were the case, it would not be surprising if the loans above the credit threshold performed worse than those below it. As shown in Section IV.C, loan terms are smooth through the FICO score threshold. We also investigate the loan terms in
338
QUARTERLY JOURNAL OF ECONOMICS
more detail than in Section IV.C by examining the distribution of interest rates and loan-to-value ratios of contracts offered around 620 for low-documentation loans. Figure IX.A depicts the Epanechnikov kernel density of the interest rate on low documentation loans in the year 2004 for two FICO groups—620− (615–619) and 620+ (620–624). The distributions of interest rates observed in the two groups lie directly on top of one another. A Kolmogorov–Smirnov test for equality of distribution functions cannot be rejected at the 1% level. Similarly, Figure IX.B depicts the density of LTV ratios on low documentation loans in the year 2004 for 620− and 620+ groups. Again, a Kolmogorov–Smirnov test for equality of distribution functions cannot be rejected at the 1% level. The fact that we find that the borrowers characteristics are similar around the threshold (Section IV.C) also confirms that selection based on observables is unlikely to explain our results.20 Second, there might be concerns about selection of loans by investors. In particular, our results could be explained if investors could potentially cherry pick better loans below the threshold. The loan and borrower variables in our data are identical to the data upon which investors base their decisions (Kornfeld 2007). Furthermore, as shown in Section IV.C, these variables are smooth through the threshold, mitigating any concerns on selection by investors.21 Finally, strategic adverse selection on the part of lenders may also be a concern. Lenders could, for instance, keep loans of better quality on their balance sheet and offer only loans of worse quality to the investors. This concern is mitigated for several reasons. 20. The equality of interest rate distributions also rules out differences in the expected cost of capital across the threshold as an alternative explanation. For instance, lenders could originate riskier loans above the threshold only because the expected cost of capital was lower due to easier securitization. However, in a competitive market, the interest rates charged for these loans should reflect the riskiness of the borrowers. In that case, as mean interest rates above and below the threshold would be the same (Section IV.C), lenders must have added riskier borrowers above the threshold—resulting in a more dispersed interest rate distribution above the threshold. Our analysis in Figure IX.A shows that this is not the case. 21. An argument might also be made that banks screen similarly around the credit threshold but are able to sell portfolios of loans above and below the threshold to investors with different risk tolerance. If this were the case, it could potentially explain our results in Section IV.D. This does not seem likely. Because all the loans in our sample are securitized, our results on performance on loans around the credit threshold are conditional on securitization. Moreover, securitized loans are sold to investors in pools that contain a mix of loans from the entire credit score spectrum. As a result, it is difficult to argue that loans of 620− are purchased by different investors as compared to loans of 620+ .
DID SECURITIZATION LEAD TO LAX SCREENING?
339
FIGURE IX Dispersion of (A) Interest Rates and (B) Loan-to-Value (Low-Documentation) The figure depicts the Epanechnikov kernel density of interest rate (A) and loan-to-value ratio (B) for two FICO groups for low-documentation loans—620− (615–619) as the solid line and 620+ (620–624) as the dashed line. The bandwidth for the density estimation is selected using the plug-in formula of Sheather and Jones (1991). The figures show that the densities of interest rates on loans are similar for both the groups. A Kolmogorov–Smirnov test for equality of distribution functions cannot be rejected at the 1% level. Data for loans originated in 2004 are reported here. We find similar patterns for 2001–2006 originations. We do not report those graphs, for brevity.
340
QUARTERLY JOURNAL OF ECONOMICS
First, the securitization guidelines suggest that lenders offer the entire pool of loans to investors and that conditional on observables, SPVs largely follow a randomized selection rule to create bundles of loans. This suggests that securitized loans would look similar to those that remain on the balance sheet (Comptroller’s Handbook 1997; Gorton and Souleles 2006).22 In addition, this selection, if at all present, will tend to be more severe below the credit threshold, thereby biasing us against finding any effect of screening on performance. We conduct an additional test that also suggests that our results are not driven by selection on the part of lenders. Although banks may screen and then strategically hold loans on their balance sheets, independent lenders do not keep a portfolio of loans on their books. These lenders finance their operations entirely out of short-term warehouse lines of credit, have limited equity capital, and have no deposit base to absorb losses on loans that they originate (Gramlich 2007). Consequently, they have limited motives for strategically choosing which loans to sell to investors. However, because loans below the threshold are more difficult to securitize and thus are less liquid, these independent lenders still have strong incentives to differentially screen these loans to avoid losses. We focus on these lenders to isolate the effects of screening in our results on defaults (Section IV.D). To test this, we classify the lenders into two categories—banks (banks, subsidiaries, thrifts) and independents—and examine the performance results only for the sample of loans originated by independent lenders. It is difficult to identify all the lenders in the database because many of the lender names are abbreviated. In order to ensure that we are able to cover a majority of our sample, we classify the top fifty lenders (by origination volume) across the years in our sample period, based on a list from the publication “Inside B&C Mortgage.” In unreported results, we confirm that independent lenders also follow the rule of thumb for low-documentation loans. Moreover, low-documentation loans securitized by independents with credit scores of 620− are about 15% less likely to default after a year as 22. We confirmed this fact by examining a subset of loans held on the lenders’ balance sheets. The alternative data set covers the top ten servicers in the subprime market (more than 60% of the market) with details on performance and loan terms of loans that are securitized or held on the lenders’ balance sheet. We find no differences in the performance of loans that are securitized relative to those kept by lenders, around the 620 threshold. Results of this analysis are available upon request.
DID SECURITIZATION LEAD TO LAX SCREENING?
341
compared to low-documentation loans securitized by them with credit scores 620+ .23 Note that the results in the sample of loans originated by lenders without a strategic selling motive are similar in magnitude to those in the overall sample (which includes other lenders that screen and then may strategically sell). This finding highlights that screening is the driving force behind our results. IV.F. Additional Variation from a Natural Experiment Unrelated Optimal Rule of Thumb. So far we have worked under the assumption that the 620 threshold is related to securitization. One could plausibly argue, in the spirit of Baumol and Quandt (1964), that this rule of thumb could have been set by lenders as an optimal cutoff for screening that was unrelated to differential securitization. Ruling this alternative out requires an examination of the effects of the threshold when the ease of securitization varies, everything else equal. To achieve this, we exploit a natural experiment that involves the passage of anti–predatory lending laws in two states which reduced securitization in the subprime market drastically. Subsequent to protests by market participants, the laws were substantially amended and the securitization market reverted to prelaw levels. We use these laws to examine how the main effects vary with the time series variation in the ease of securitization likelihood around the threshold in the two states. In October 2002, the Georgia Fair Lending Act (GFLA) went into effect, imposing anti–predatory lending restrictions that at the time were considered the toughest in the United States. The law allowed unlimited punitive damages when lenders did not comply with the provisions, and that liability extended to holders in due course. Once GFLA was enacted, the market response was swift. Fitch, Moody’s, and S&P were reluctant to rate securitized pools that included Georgia loans. In effect, the demand for the securitization of mortgage loans from Georgia fell drastically during the same period. In response to these actions, the Georgia legislature amended the GLFA in early 2003. The amendments removed many of the GFLAs ambiguities and eliminated covered loans. Subsequent to April 2003, the market revived in Georgia. 23. More specifically, in a specification similar to column (2) in Panel C of Table III, we find that the coefficient on the indicator T(FICO ≥ 620) is 0.67 (t = 3.21).
342
QUARTERLY JOURNAL OF ECONOMICS
Similarly, New Jersey enacted its law, the New Jersey Homeownership Security Act of 2002, with many provisions similar to those of the Georgia law. As in Georgia, lenders and ratings agencies expressed concerns when the New Jersey law was passed and decided to substantially reduce the number of loans that were securitized in these markets. The Act was later amended in June 2004 in a way that relaxed requirements and eased lenders’ concerns. If lenders use 620 as an optimal cutoff for screening unrelated to securitization, we expect the passage of these laws to have no effect on the differential screening standards around the threshold. However, if these laws affect the differential ease of securitization around the threshold, our hypothesis would predict an impact on the screening standards. As 620+ loans became relatively more difficult to securitize, lenders would internalize the cost of collecting soft information for these loans to a greater degree. Consequently, the screening differentials we observed earlier should attenuate during the period of enforcement. Moreover, we expect the results described in Section IV.D to appear only during the periods when the differential ease of securitization around the threshold was high, that is, before the law was passed and after the law was amended. Our experimental design examines the ease of securitization and performance of loans above and below the credit threshold in both Georgia and New Jersey during the period when the securitization market was affected and compares it with the period before the law was passed and the period after the law was amended. To do so, we estimate equations (1) and (2) with an additional dummy variable that captures whether or not the law is in effect (NoLaw). We also include time fixed effects to control for any macroeconomic factors independent of the laws. The results are striking. Panel A of Table IV confirms that the discontinuity in the number of loans around the threshold diminishes during a period of strict enforcement of anti–predatory lending laws. In particular, the difference in number of loans securitized around the credit thresholds fell by around 95% during the period when the law was passed in Georgia and New Jersey. This effectively nullified any meaningful difference in the ease of securitization above the FICO threshold. Another intuitive way to see this is to compare these jumps in the number of loans with jumps in states that had housing profiles similar to those of Georgia and New Jersey before the law was passed (e.g., Texas in 2001). For instance, relative to the discontinuity in Texas, the jump during
343
DID SECURITIZATION LEAD TO LAX SCREENING?
TABLE IV NUMBER OF LOANS AND DELINQUENCIES IN LOW-DOCUMENTATION LOANS ACROSS THE CREDIT THRESHOLD: EVIDENCE FROM A NATURAL EXPERIMENT FICO ≥ 620 (β)
Year During law Pre and post law
t-stat
Observations
Panel A: Number of low-documentation loans 10.71 (2.30) 294 211.50 (5.29) 299
R2
Mean
.90 .96
16 150
Panel B: Delinquency status of low-documentation loans Pr(delinquency)=1 Entire period 2001–2006
FICO ≥ 620 FICO ≥ 620 ∗ NoLaw
Observations Other controls FICO ≥ 620 ∗ other controls Time fixed effects Pseudo R2 Clustering unit Mean delinquency (%)
During law and six months after
(1)
(2)
(3)
(4)
−0.91 [0.043] (1.78) 0.88 [0.040] (1.90) 109,536 Yes Yes Yes .06 Vintage
−0.91 [0.043] (2.00) 0.88 [0.040] (1.94) 109,536 Yes Yes Yes .06 Loan ID
−1.02 [0.030] (1.69) 1.13 [0.034] (1.79) 14,883 Yes Yes Yes .05 Vintage
−1.02 [0.030] (2.12) 1.13 [0.034] (1.93) 14,883 Yes Yes Yes .05 Loan ID
6.1
4.2
Notes. This table reports estimates of the regressions on differences in number of loans and performance of loans across the credit thresholds. We use specifications similar to these Table II, Panel A, to estimate the number of loans regressions and Table III, Panel C, to estimate delinquency regressions. We restrict our analysis to loans made in Georgia and New Jersey. NoLaw is a dummy that takes a value of 1 if the anti–predatory lending law was not passed in a given year or was amended and a value of 0 during the time period when the law was passed. Permutation tests confirm that the discontinuity in number of loans at 620 when the law is not passed has the smallest p-value (and is thus the largest outlier) in the Georgia and New Jersey sample. We report t-statistics in parentheses (marginal effects are reported in square brackets).
the period when the law was passed is about 5%, whereas the jumps are of comparable size both before the law was passed and after the law was amended. In addition, the results also indicate the rapid return of a discontinuity after the law is revoked. It is notable that this time horizon is too brief for any meaningful change in the housing stock (Glaeser and Gyourko 2005) or in the underlying demand for home ownership. Importantly, our performance results follow the same pattern as well. Columns (1) and (2) of Panel B show that the default rates for 620+ loans were below that of 620− loans in both Georgia and New Jersey only when the law was in effect. In addition, when the
344
QUARTERLY JOURNAL OF ECONOMICS
law was either not passed or was amended, we find that default rates for loans above the credit threshold are similar to those for loans below the credit threshold. This upward shift in the default curve above the 620 threshold is consistent with the results reported in Section IV.D. Taken together, these results suggest that our findings are indeed related to differential securitization at the credit threshold and that lenders were not blindly following the rule of thumb in all instances. Manipulation of Credit Scores. Having confirmed that lenders are screening more at 620− than at 620+ , we assess whether borrowers were aware of the differential screening around the threshold. Even though there is no difference in contract terms around the cutoff, screening is weaker above the 620 score than below it, and this may create an incentive for borrowers to manipulate their credit score. If FICO scores could be manipulated, lower quality borrowers might artificially appear at higher credit scores. This behavior would be consistent with our central claim of differential screening around the threshold. Note that per the rating agency (Fair Isaac), it is difficult to strategically manipulate one’s FICO score in a targeted manner. Nevertheless, to examine the response of borrowers more closely, we exploit the variation generated from the same natural experiment. If FICO scores tend to be quite sticky and it takes relatively long periods of time (more than three to six months) to improve credit scores, as Fair Isaac claims, we should observe that the difference in performance around the threshold should take time to appear after the laws are reversed. Restricting our analysis to loans originated within six months after the laws were reversed, columns (3) and (4) of Panel B (Table IV) show that the reversal of anti–predatory lending laws has immediate effects on the performance of loans that are securitized. This result suggests that borrowers might not have been aware of the differential screening around the threshold or were unable to quickly manipulate their FICO scores. Overall, the evidence in this section is consistent with Mayer and Pence (2008), who find no evidence of manipulation of FICO scores in their survey of the subprime market.24 24. As a further check, we obtained another data set of subprime loans that continues to track the FICO scores of borrowers after loan origination. Borrowers who manipulate their FICO scores before loan issuance should experience a decline in FICO score shortly after receiving a loan (because a permanent change in the credit score cannot be considered manipulation). Consistent with evidence for no
DID SECURITIZATION LEAD TO LAX SCREENING?
345
IV.G. Additional Confirmatory Tests GSE Selection. Although the subprime market is dominated by the nonagency sector, one might worry that the GSEs may differentially influence the selection of borrowers into the subprime market through their actions in the prime market. For instance, the very best borrowers above the 620 threshold might select out of the subprime market in search of better terms in the prime market. We establish several facts to confirm that this is not the case. First, the natural experiment we discuss in Section IV.F suggests that prime-influenced selection is not at play. The anti–predatory lending laws were targeted primarily toward the subprime part of the market (Bostic et al. 2008), leaving the prime part of the market relatively unaffected. To confirm the behavior of the prime market during the enforcement of anti– predatory lending laws, we rely on another data set of mortgages in the United States that covers the agency loan market. The data are collected from the top U.S. servicers, are primarily focused on the agency market, and covers the period 2001 to 2006. As reported in Panel A of Table V, during the natural experiment, it was no more difficult to obtain an agency loan (comparable to a subprime loan in our sample) than before or after the law was in effect. Similarly, in unreported tests we find that contractual terms (such as LTV ratios and interest rates) around 620 see no change across time periods. Furthermore, in the prime market, there were no differences in defaults around the 620 threshold across the time periods (Table V, Panel B). Because borrower quality in the prime market did not change across the 620 threshold across the two time periods, if there was indeed selection, the very best 620+ subprime borrowers should have selected out into the prime market even while the laws were in place. As a result, we should have found that 620+ borrowers in subprime market continued to default more than 620− borrowers even when the law is in place. As we showed earlier in Table IV, this is not the case. Second, the data set confirms that Freddie Mac and Fannie Mae primarily do not buy subprime loans (especially lowdocumentation loans) with credit scores around FICO of 620. This is consistent with anecdotal evidence that the role of active manipulation around the threshold, we find that both 620+ and 620− borrowers are as likely to experience such a reduction within a quarter of obtaining a loan. Results of this analysis are available upon request.
346
QUARTERLY JOURNAL OF ECONOMICS
TABLE V NUMBER OF LOANS AND DELINQUENCIES IN AGENCY (GSE/PRIME) LOANS ACROSS THE CREDIT THRESHOLD: EVIDENCE FROM A NATURAL EXPERIMENT Year
FICO ≥ 620 (β)
t-stat
Observations
Panel A: Number of prime loans 4.80 (2.70) 249 2.33 (1.02) 268
During law Pre and post law
R2
Mean
.88 .92
20.30 22.80
Panel B: Delinquency status of prime loans Pr(delinquency)=1 60+ delinquent 2001–2006 (1) FICO ≥ 620 FICO ≥ 620 ∗ NoLaw
Observations Other controls FICO ≥ 620 ∗ other controls Time fixed effects Clustering unit Pseudo R2 Mean delinquency (%)
−0.026 [0.001] (0.19) −0.004 [0.0004] (0.03) 56,300 Yes Yes Yes Vintage .01 5.2
90+ delinquent 2001–2006 (2) −0.029 [0.001] (0.10) −0.003 [0.0004] (0.05) 56,300 Yes Yes Yes Vintage .02 3.1
Notes. This table reports estimates of the regressions on differences in number of loans and performance of loans across the credit thresholds. The analysis is restricted to prime loans made in Georgia and New Jersey. The data are for GSE loans that are first mortgages, that are either single-family or condo or a townhouse, that are only purchase loans, that are conventional mortgages without private insurance, and that are primary residents for the borrower. NoLaw is a dummy that takes a value 1 if the anti–predatory lending law was not passed in a given year or was amended and a value 0 during the time period when the law was passed. Permutation tests confirm that the discontinuity in number of loans at 620 when the law is not passed or passed is no different from estimated jumps at other locations in the distribution in the Georgia and New Jersey sample. We report t-statistics in parentheses (marginal effects are reported in square brackets).
subprime securitization in recent years had shifted to the nonagency sector (Gramlich 2007). In unreported permutation tests (see Online Appendix Table 4, Panel A), we also find a very small jump in the number of loans in the agency market across the 620 threshold. In addition, the loan terms and default rates are also smooth. Together these results suggest that, in general, there seems to be no differential selection in terms of number of loans or quality of loans across the 620 cutoff. Third, if our results in the low-documentation market around the 620 threshold are driven by differential GSE selection, we
DID SECURITIZATION LEAD TO LAX SCREENING?
347
should observe no differences in defaults when we combined the loans from agencies with low-documentation subprime loans around the 620 threshold. If it were purely selection, lower performance above the threshold in the low-documentation subprime loans would be offset by differentially higher quality loans selected into the agency market. Unreported results (Online Appendix Table 5) show that there are still differences in default rates across the 620 threshold when we examine the agency loans and lowdocumentation subprime loans together. Finally, we examine the set of borrowers in the subprime market (around 620) who are offered contractual terms similar to those offered in the prime market. If there is indeed selection into the prime market, it is likely based on contractual terms offered to borrowers. By examining borrowers who are offered similar contractual terms in the subprime market, we are able to isolate our analysis to borrowers of similar quality as those who are possibly attracted by GSEs (i.e., the good-quality borrowers). For this subset of subprime borrowers, we are able to show that 620+ loans still default more than 620− loans (Online Appendix Table 4, Panel B). This evidence further suggests that selection by GSEs is unlikely to explain our results. Other Thresholds. In the data, we also observe smaller jumps in other parts of the securitized loan FICO distribution as other ad hoc cutoffs have appeared in the market in the past three years (e.g., 600 for low documentation in 2005 and 2006). We remain agnostic as to why or how these other cutoffs have appeared: due to greater willingness to lend to riskier borrowers, or to changing use of automated underwriting, which generally included a matrix of qualifications and loan terms including FICO buckets. Several comments about why we focus on the 620 threshold are therefore in order. First, the 620 cutoff is the only threshold that is actively discussed by the GSEs in their lending guidelines, where the ease of securitization is higher on the right side of the threshold (see Online Appendix Exhibit 1). This feature is essential for us to disentangle the effect of lax screening on defaults from what a change in FICO score might predict. As increasing FICO scores predict decreasing default rates, performing our analysis with any cutoff where ease of securitization is lower on the right side of the threshold would not allow us to use this identification. For instance, consider the cutoff of 660 that is also discussed in the
348
QUARTERLY JOURNAL OF ECONOMICS
GSE guidelines and where we observe a jump in securitization. The ease of securitization is lower on the right-hand side of this cutoff; that is, the unconditional probability of securitization is lower at 660+ relative to 660− , suggesting that 660+ loans would be more intensively screened and would default less frequently than 660− . However, it would be impossible to disentangle this effect from just a mechanical effect of 660+ FICO loans being more creditworthy and thus defaulting less often than 660− loans (by construction). This subtle advantage of the 620 cutoff is crucial to our identification strategy and rules out the use of several other ad hoc thresholds. In general, our methodology could extend to any cutoff that had greater ease of securitization on the right side of the threshold. Moreover, to identify the effects of securitization on screening by lenders, the liquidity differential for the loan portfolios around the threshold has to be large enough. Because 620 is the largest jump we observe in the loan distribution, it is a natural choice. This is confirmed in the permutation tests, which show that FICO = 620 has the smallest p-value (and is thus the largest outlier) among all the visible discontinuities for each year in our sample. Although other cutoffs may also induce slight differences in screening effort in some years, these differences may be too small to make any meaningful inferences. In results not shown, we analyzed some of these other thresholds and found results for delinquencies that are consistent with those reported for the predominant cutoff (620), but are indeed quite small in magnitude. Other Tests. We also conduct several falsification tests, repeating our analysis at other credit scores where there is no jump in securitization. In sharp contrast to the results reported in Section IV.D, the higher credit score bucket defaults less than the lower credit score bucket. This is consistent with the results of the permutation tests reported above, which estimate every false discontinuity and compare it to the discontinuity at 620. Moreover, as we will show in Section V, full-documentation loans do not see any jumps at this threshold. We plot the delinquency rates of 620+ and 620− for full-documentation loans (2001–2006) in Figure X and find that loans made at lower credit scores are more likely to default.25 25. This test can also provide insight into the issue of GSE selection discussed earlier. Because 620+ full documentation loans do not default more than − 620 loans, differential selection into the agency market must account for this
DID SECURITIZATION LEAD TO LAX SCREENING?
349
8
7
615–619 (620−) 620–624 (620+)
6
Delinquency (%)
5
4
3
2
1
0 0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Loan age ( months)
FIGURE X Falsification Test—Delinquencies for Full-Documentation Loans around FICO of 620 The figure presents the falsification test by examining the percentage of fulldocumentation loans (dollar-weighted) originated between 2001 and 2006 that became delinquent. We track loans in two FICO buckets—615–619 (620− ) dashed and 620–624 (620+ ) solid—from their origination date and plot the average loans that become delinquent each month after the origination date. As can be seen, the higher credit score bucket defaults less than the lower credit score bucket for the post-2000 period. For brevity, we do not report plots separately for each year of origination. The effects shown here in the pooled 2001–2006 plot show up for every year.
As further tests of our hypothesis, we also conducted our tests in the refinance market, and find a similar rule of thumb and similar default outcomes around the 620 threshold in this market. Finally, we reestimated our specifications with state, lender, and pool fixed effects to account for multiple levels of potential variation in the housing market and find qualitatively similar results.26 fact as well. One possibility is selection on the basis of debt-to-income ratios. To examine this, we compare DTI ratios in the full- and low-documentation markets. Unreported tests (Online Appendix Table 3) show that the DTI ratios are similar around the threshold and thus cannot entirely explain results across the two types of loans. 26. For additional information on tests across types of lenders and states, see Keys et al. (2009).
350
QUARTERLY JOURNAL OF ECONOMICS
V. DID HARD INFORMATION MATTER? The results presented above are for low-documentation loans, which necessarily have an unobserved component of borrowers’ creditworthiness. In the full-documentation loan market, on the other hand, there is no omission of hard information on the borrower’s ability to repay. In this market, we identify a credit threshold at the FICO score of 600, the score that Fair Isaac (and the three credit repositories) advises lenders as a bottom cutoff for low risk borrowers. They note that “anything below 600 is considered someone who probably has credit problems that need to be addressed...” (see www.myfico.com). Similarly, Fannie Mae in its guidelines notes that “a borrower with credit score of 600 or less has a high primary risk...” (see www.allregs.com/efnma/doc/). The Consumer Federation of America along with Fair Isaac (survey report in March 2005) suggests that “FICO credit scores range from 300–850, and a score above 700 indicates relatively low credit risk, while scores below 600 indicate relatively high risk which could make it harder to get credit or lead to higher loan rates.” Einav, Jenkins and Levin (2008) make a similar observation when they note that “a FICO score above 600 [is] a typical cut-off for obtaining a standard bank loan.” Figure XI reveals that there is a substantial increase in the number of full documentation loans above the credit threshold of 600. This pattern is consistent with the notion that lenders are more willing to securitize at a lower credit threshold (600 vs. 620) for full-documentation loans because there is less uncertainty about these borrowers relative to those who provide less documentation. The magnitudes are again large—around 100% higher at 600+ than at 600− in 2004—for full-documentation loans. In Panel A of Table VI, we estimate regressions similar to equation (1) and find that the coefficient estimate is also significant at 1% and is on average around 100% (from 80% to 141%) higher for 600+ as compared to 600− for post-2000 loans. Again, if the underlying creditworthiness and the demand for mortgage loans (at a given price) are the same for potential buyers with a credit score of 600− or 600+ , as the credit bureaus claim, this result confirms that it is easier to securitize full-documentation loans above the 600 FICO threshold. We repeated a similar analysis for loan characteristics (LTV and interest rates) and borrower demographics and find no differences for full documentation loans above and below the credit score of 600. Appendix I.C presents the estimates from the
DID SECURITIZATION LEAD TO LAX SCREENING?
10 0
0
0
5
5
5
15
20
2003
10
2002
10
2001
351
500
600
700 FICO
800
500
700 FICO
800
500
2005
600
700 FICO
800
2006
500
600
700 FICO
800
0
0
0
10
10
20
20
20
30
40
30
40
50
60
40
2004
600
500
600
700 FICO
800
500
600
700 FICO
800
FIGURE XI Number of Loans (Full-Documentation) The figure presents the data for the number of full-documentation loans (in ’00s). We plot the average number of loans at each FICO score between 500 and 800. As can be seen from the graphs, there is a large increase in number of loans around the 600 credit threshold (i.e., more loans at 600+ as compared to 600− ) from 2001 onward. Data are for loans originated between 2001 and 2006.
regressions (Appendix I.D provides permutation test estimates corresponding to these loan terms). Interestingly, we find that full-documentation loans with credit scores of 600− (FICO between 595 and 599) are about as likely to default after a year as loans with credit scores of 600+ (FICO between 601 and 605) for the post-2000 period. Both Figures XII and XIII and results in Panels B, C, and D of Table VI support this conjecture. Following the methodology used in Figures VI and VII, we show the default rates annually across the FICO distribution (Figure XII) and across the age of the loans (Figure XIII). The estimated effects of the ad hoc rule on defaults are negligible in all specifications. The absence of differences in default rates across the credit threshold, while the same magnitude of the jump in the number of loans is maintained, is consistent with the notion that the pattern of delinquencies around the low-documentation threshold are primarily due to the soft information of the borrower. With so much
352
QUARTERLY JOURNAL OF ECONOMICS
TABLE VI NUMBER OF LOANS AND DELINQUENCIES ACROSS THE CREDIT THRESHOLD FOR FULL-DOCUMENTATION LOANS Year
FICO ≥ 600 (β)
t-stat
Observations
R2
Mean
Panel A: Number of full-documentation loans 2001 306.85 (5.70) 299 .99 330 2002 378.49 (9.33) 299 .99 360 2003 780.72 (11.73) 299 .99 648 2004 1,629.82 (8.91) 299 .99 1,205 2005 1,956.69 (4.72) 299 .98 1,499 2006 2,399.48 (6.97) 299 .98 1,148 Pooled estimate (t-stat) [permutation test p-value] 1,241.75 (3.23) [.000] 2001 2002 2003 2004 2005 2006
Panel B: Dollar weighted fraction of loans defaulted 0.005 (0.63) 250 0.018 (1.74) 250 0.013 (1.93) 250 0.006 (1.01) 254 0.008 (1.82) 254 0.010 (0.89) 254
Panel C: Permutation tests for alternative default definitions (pooled 2001–2006 with time fixed effects) Permutation Dependent variable FICO ≥ 600 test (Default definition) (β) t-stat p-value Observations 60+ (dollar-weighted) 90+ (dollar-weighted) Foreclosure+ (dollar-weighted) 60+ (unweighted)
.87 .87 .94 .94 .96 .86
0.052 0.041 0.039 0.040 0.059 0.116
R2
Mean
0.010 0.006 0.005
(1.66) (1.00) (1.25)
.240 .314 .265
1,512 1,525 1,525
.84 .75 .77
0.058 0.046 0.032
0.011
(1.50)
.150
1,525
.70
0.056
Panel D: Delinquency status of loans Pr(delinquency)=1
FICO ≥ 600
Observations Pseudo R2 Other controls FICO ≥ 600 ∗ other controls Time fixed effects Clustering unit Mean delinquency (%)
(1)
(2)
(3)
(4)
−0.06 [0.002] (2.30) 3,125,818 .073 Yes No No Loan ID
−0.02 [0.0006] (0.15) 3,125,818 .084 Yes Yes Yes Loan ID
−0.06 [0.002] (1.21) 3,125,818 .073 Yes No No Vintage
−0.02 [0.0006] (0.18) 3,125,818 .084 Yes Yes Yes Vintage
4.54
Notes. This table reports estimates of the regressions on differences in number of loans and performance of loans around the credit threshold of 600 for full-documentation loans. We use specifications similar to these in Table II, Panel A, to estimate the number of loan regressions and Table III, Panels A, B, and C, to estimate delinquency regressions. Permutation tests confirm that FICO = 600 has the smallest permutation test p-value (and is thus the largest outlier) among all the visible discontinuities in the full-documentation loan sample. We report t-statistics in parentheses (marginal effects are reported in square brackets).
DID SECURITIZATION LEAD TO LAX SCREENING?
0.2
600 650 FICO
700
750
0.05 550
600 650 FICO
700
750
0.15
2005
500
550
600 650 FICO
700
750
500
550
600 650 FICO
700
750
700
750
2006
0.15 0.1 0.05
0
0
0.05
0.05
0.1
0.1
0.15
2004
500
0.25
550
0.2
500
0
0
0
0.05
0.05
0.1
0.1
0.1
0.15
2003
0.15
2002
0.15
2001
353
500
550
600 650 FICO
700
750
500
550
600 650 FICO
FIGURE XII Annual Delinquencies for Full-Documentation Loans The figure presents the percentage of full-documentation loans originated between 2001 and 2006 that became delinquent. We plot the dollar-weighted fraction of the pool that becomes delinquent for one-point FICO bins between scores of 500 and 750. The vertical line denotes the 600 cutoff, and a seventh-order polynomial is fitted to the data on either side of the threshold. Delinquencies are reported between 10 and 15 months for loans originated in all years.
information collected by the lender for full-documentation loans, there is less value to collecting soft information. Consequently, for full-documentation loans there is no difference in how the loans perform subsequently after hard information has been controlled for. Put another way, differences in returns to screening are attenuated due to the presence of more hard information. VI. DISCUSSION In the wake of the subprime mortgage crisis, a central question confronting market participants and policy makers is whether securitization had an adverse effect on the ex ante screening effort of loan originators. Comparing characteristics of the loan market above and below the ad hoc credit threshold, we show that a doubling of securitization volume is on average associated with about a 10%–25% increase in defaults. Notably, our empirical strategy delivers only inferences on differences in the performance
354
QUARTERLY JOURNAL OF ECONOMICS
10 9
595–599 (600−) 600–604 (600+)
8
Delinquency (%)
7 6 5 4 3 2 1 0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Loan age (months)
FIGURE XIII Delinquencies for Full-Documentation Loans (2001–2006) The figure presents the percentage of full-documentation loans (dollarweighted) originated between 2001 and 2006 that became delinquent. We track loans in two FICO buckets—595–599 (600− ) dashed and 600–604 (600+ ) solid— from their origination date and plot the average loans that become delinquent each month after the origination date. As can be seen, the higher credit score bucket defaults more than the lower credit score bucket for the post-2000 period. For brevity, we do not report plots separately for each year of origination. The effects shown here in the pooled 2001–2006 plot show up for every year.
of loans around this threshold. Although we cannot infer what the optimal level of screening at each credit score ought to be, we conclude from our empirical analysis that there was a causal link between ease of securitization and screening. That we find any effect on default behavior in one portfolio compared to another with virtually identical risk profiles, demographic characteristics, and loan terms suggests that the ease of securitization may have had a direct impact on incentives elsewhere in the subprime housing market. Understanding whether the ease of securitization had a similar impact on other securitized markets requires more research. The results of this paper, in particular from the anti– predatory lending laws’ natural experiment, confirm that lender behavior in the subprime market did change based on the ease of securitization. This suggests that existing securitization practices did not ensure that a decline in screening standards would be counteracted by requiring originators to hold more of the loans’ default risk. If lenders were in fact holding on to optimal risk
DID SECURITIZATION LEAD TO LAX SCREENING?
355
where it was easier to securitize, there should have been no differences in defaults around the threshold. This finding resonates well with concerns surrounding the subprime crisis that, in an environment with limited disclosure on who holds what in the originate-to-distribute chain, there may have been insufficient “skin in the game” for some lenders (Blinder 2007; Stiglitz 2007). At the same time, the results further suggest that the breakdown in the process only occurred for loans where soft information was particulary important. With enough hard information, as in the full-documentation market, there may be less value in requiring market participants to hold additional risk to counteract the potential moral hazard of reduced screening standards. In a market as competitive as the market for mortgagebacked securities, our results on interest rates are puzzling. Lenders’ compensation on either side of the threshold should reflect differences in default rates, and yet we find that the interest rates to borrowers are similar on either side of 620. The difference in defaults, despite similar compensation around the threshold, suggests that there may have been some efficiency losses. Of course, it is possible that from the lenders’ perspective, a higher propensity to default above the threshold could have exactly offset the benefits of additional liquidity—resulting in identical interest rates around the threshold. Our analysis remains agnostic about whether investors priced the moral hazard aspects of securitization accurately. It may have been the case that moral hazard existed in this market, though investors appropriately priced persistent differences in performance around the threshold (see Rajan, Seru, and Vig [2008]). On the other hand, developing an arbitrage strategy for exploiting this opportunity may have been prohibitively difficult, given that loans are pooled across the FICO spectrum before they are traded. In addition, these fine differences in performance around the FICO threshold could have been obscured by the performance of other complex loan products in the pool. Understanding these aspects of investor behavior warrants additional investigation. It is important to note that we refrain from making any welfare claims. Our conclusions should be directed at securitization practices, as they were during the subprime boom, rather than at the optimally designed originate-to-distribute model. We believe securitization is an important innovation and has several merits. It is often asserted that securitization improves the efficiency of credit markets. The underlying assumption behind this assertion
356
QUARTERLY JOURNAL OF ECONOMICS
is that there is no information loss in transmission, even though securitization increases the distance between borrowers and investors. The benefits of securitization are limited by information loss, and in particular the costs we document in the paper. More generally, what types of credit products should be securitized? We conjecture that the answer depends crucially on the information structure: loans with more hard information are likely to benefit from securitization relative to loans that involve soft information. A careful investigation of this question is a promising area for future research. More broadly, our findings caution against policy that emphasizes excessive reliance on default models. Our research suggests that by relying entirely on hard information variables such as FICO scores, these models ignore essential elements of strategic behavior on the part of lenders which are likely to be important. The formation of a rule of thumb, even if optimal (Baumol and Quandt 1964), has an undesirable effect on the incentives of lenders to collect and process soft information. As in Lucas (1976), this strategic behavior can alter the relationship between observable borrower characteristics and default likelihood, rather than moving along the previous predicted relationship. Incorporating these strategic elements into default models, although challenging, is another important direction for future research. APPENDIX I.A LOAN CHARACTERISTICS AROUND DISCONTINUITY IN LOW-DOCUMENTATION LOANS Loan to value Year 2001 2002 2003 2004 2005 2006
Interest rate
FICO ≥ 620 Mean FICO ≥ 620 Mean (β) t-stat Obs. R2 (%) (β) t-stat Obs. R2 (%) 0.67 1.53 2.44 0.30 −0.33 −1.06
(0.93) (2.37) (4.27) (0.62) (0.96) (2.53)
296 299 299 299 299 299
.76 .91 .96 .96 .95 .96
80.3 82.6 83.4 84.5 84.1 84.8
0.06 0.15 0.10 0.03 −0.09 −0.21
(0.59) (1.05) (1.50) (0.39) (1.74) (2.35)
298 299 299 299 299 299
.92 .89 .97 .97 .98 .98
9.4 8.9 7.9 7.8 8.2 9.2
Notes. This table reports estimates from a regression that uses the mean interest rate and LTV ratio of low-documentation loans at each FICO score as the dependent variables. In order to estimate the discontinuity (FICO ≥ 620) for each year, we collapse the interest rate and LTV ratio at each FICO score and estimate flexible seventh-order polynomials on either side of the 620 cutoff, allowing for a discontinuity at 620. Because the measures of the interest rate and LTV are estimated means, we weight each observation by the inverse of the variance of the estimate. We report t-statistics in parentheses. Permutation tests, which allow for a discontinuity at every point in the FICO distribution, confirm that these jumps are not significantly larger than those found elsewhere in the distribution. For brevity, we report permutation test estimates from pooled regressions (with time fixed effects removed to account for vintage effects) and report them in Appendix I.D.
DID SECURITIZATION LEAD TO LAX SCREENING?
357
APPENDIX I.B BORROWER DEMOGRAPHICS AROUND DISCONTINUITY IN LOW-DOCUMENTATION LOANS Year
FICO ≥ 620 (β)
2001 2002 2003 2004 2005 2006
1.54 0.32 1.70 0.42 −0.50 0.25
2001 2002 2003 2004 2005 2006
1,963.23 −197.21 154.93 699.90 662.71 −303.54
2001 2002 2003 2004 2005 2006
R2
Mean (%)
Panel A: Percent black in ZIP code (1.16) 297 (0.28) 299 (2.54) 299 (0.53) 299 (0.75) 299 (0.26) 299
.79 .63 .70 .72 .69 .59
11.2 10.6 11.1 12.2 13.1 14.7
Panel B: Median income in ZIP code (2.04) 297 (0.13) 299 (0.23) 299 (1.51) 299 (1.08) 299 (0.34) 299
.33 .35 .50 .46 .64 .68
49,873 50,109 49,242 48,221 47,390 46,396
.66 .79 .89 .91 .93 .92
163,151 165,049 160,592 150,679 143,499 138,556
t-stat
Observations
Panel C: Median house value in ZIP code 3,943.30 (0.44) 297 −599.72 (0.11) 299 −1,594.51 (0.36) 299 −2,420.01 (1.03) 299 −342.04 (0.14) 299 −3,446.06 (1.26) 299
Notes. This table reports estimates from a regression that uses the mean demographic characteristics of low-documentation borrowers at each FICO score as the dependent variables. In order to estimate the discontinuity (FICO ≥ 620) for each year, we collapse the demographic variables at each FICO score and estimate flexible seventh-order polynomials on either side of the 620 cutoff, allowing for a discontinuity at 620. Because the demographic variables are estimated means, we weight each observation by the inverse of the variance of the estimate. We obtain the demographic variables from Census 2000, matched using the ZIP code of each loan. Permutation tests, which allow for a discontinuity at every point in the FICO distribution, confirm that these jumps are not significantly larger than those found elsewhere in the distribution. We report t-statistics in parentheses.
358
QUARTERLY JOURNAL OF ECONOMICS
APPENDIX I.C LOAN CHARACTERISTICS AND BORROWER DEMOGRAPHICS AROUND DISCONTINUITY IN FULL-DOCUMENTATION LOANS Panel A: Loan characteristics Loan to value Interest rate Year 2001 2002 2003 2004 2005 2006 Year 2001 2002 2003 2004 2005 2006
FICO ≥ 600 Mean FICO ≥ 600 Mean (β) t-stat Obs. R2 (%) (β) t-stat Obs. R2 (%) 0.820 −0.203 1.012 0.755 0.354 −0.454
(2.09) (0.65) (3.45) (2.00) (1.82) (1.96)
299 299 299 299 299 299
.73 .86 .95 .96 .93 .94
85.1 85.8 86.9 86 86.2 86.7
−0.097 −0.279 −0.189 −0.244 −0.308 −0.437
Panel B: Percent black in ZIP code FICO ≥ 600 (β) t-stat Observations 2.32 −0.79 0.40 0.54 −0.38 −0.86
(2.03) (1.00) (0.48) (0.96) (0.85) (1.40)
299 299 299 299 299 299
(0.87) (3.96) (3.42) (6.44) (5.72) (9.93)
299 299 299 299 299 299
.97 .97 .99 .99 .99 .99
9.5 8.6 7.7 7.3 7.7 8.6
R2
Mean (%)
.86 .82 .87 .92 .86 .81
13.6 12.5 12.5 12.9 13.4 14.3
Notes. This table reports the estimates of the regressions on loan characteristics and borrower demographics around the credit threshold of 600 for full-documentation loans. We use specifications similar to Appendices I.A and I.B for estimation. We report t-statistics in parentheses. Permutation tests, which allow for a discontinuity at every point in the FICO distribution, confirm that these jumps are not significantly larger than those found elsewhere in the distribution. For brevity, we report permutation test estimates from pooled regressions (with time fixed effects removed to account for vintage effects) and report them in Appendix I.D.
0.39 1.63 .20
Pooled FICO ≥ 600 (β) t-stat Permutation test p-value
−0.0009 1.51 .35
Actual prepayments
Panel B: Full-documentation loan characteristics −0.26 0.68 0.008 1.91 1.83 0.73 .07 .11 .45
Prepayment penalty
−0.0004 0.44 .84
Debt-to-income ratio
Panel A: Low-documentation loan characteristics 0.54 0.42 −0.016 1.40 1.25 1.23 .46 .32 .55
Loan-to-value ratio
0.17 0.39 .62
0.05 0.73 .96
CLTV ratio
Notes. This table reports the estimates of the regressions on loan characteristics across the credit threshold of 620 for low-documentation loans and credit threshold of 600 for full-documentation loans. We pool the loans across all years and remove year effects to account for vintage effects. We use specifications similar to Appendix I.A for estimation. Permutation tests, which allow for a discontinuity at every point in the FICO distribution, confirm that these jumps are not significantly larger than those found elsewhere in the distribution. We report p-values from these tests in the table.
0.02 0.32 .90
Pooled FICO ≥ 620 (β) t-stat Permutation test p-value
Interest rate
APPENDIX I.D PERMUTATION TEST RESULTS FOR LOAN CHARACTERISTICS IN LOW- AND FULL-DOCUMENTATION LOANS
DID SECURITIZATION LEAD TO LAX SCREENING?
359
360
QUARTERLY JOURNAL OF ECONOMICS
FEDERAL RESERVE BOARD OF GOVERNORS SORIN CAPITAL MANAGEMENT UNIVERSITY OF CHICAGO, BOOTH SCHOOL OF BUSINESS LONDON BUSINESS SCHOOL
REFERENCES Aghion, Philippe, Patrick Bolton, and Jean Tirole, “Exit Options in Corporate Finance: Liquidity versus Incentives,” Review of Finance, 8 (2004), 327–353. Avery, Robert, Raphael Bostic, Paul Calem, and Glenn Canner, “Credit Risk, Credit Scoring and the Performance of Home Mortgages,” Federal Reserve Bulletin, 82 (1996), 621–648. Baumol, William J., and Richard E. Quandt, “Rules of Thumb and Optimally Imperfect Decisions,” American Economic Review, 54 (1964), 23–46. Benmelech, Efraim, and Jennifer Dlugosz, “The Alchemy of CDO Credit Ratings,” Journal of Monetary Economics, 56 (2009), 617–634. Blinder, Alan, “Six Fingers of Blame in the Mortgage Mess,” New York Times, September 30, 2007. Bostic, Raphael, Kathleen Engel, Patricia McCoy, Anthony Pennington-Cross, and Susan Wachter, “State and Local Anti-Predatory Lending Laws: The Effect of Legal Enforcement Mechanisms,” Journal of Economics and Business, 60 (2008), 47–66. Calomiris, Charles, and Joseph Mason, High Loan-to-Value Mortgage Lending: Problem or Cure (Washington, DC: The AEI Press, 1999). Capone, Charles A., “Research into Mortgage Default and Affordable Housing: A Primer,” Prepared for LISC Center for Home Ownership Summit 2001, Congressional Budget Office, 2002. Card, David, Alexandre Mas, and Jesse Rothstein, “Tipping and the Dynamics of Segregation,” Quarterly Journal of Economics, 123 (2008), 177–218. Chomsisengphet, Souphala, and Anthony Pennington-Cross, “The Evolution of the Subprime Mortgage Market,” Federal Reserve Bank of St. Louis Review, 88 (2006), 31–56. Coffee, John, “Liquidity Versus Control: The Institutional Investor as Corporate Monitor,” Columbia Law Review, 91 (1991), 1277–1368. Comptroller’s Handbook, “Asset Securitization,” November 1997. Dell’Ariccia, Giovanni, Deniz Igan, and Luc A. Laeven, “Credit Booms and Lending Standards: Evidence from the Subprime Mortgage Market,” IMF Working Paper No. 08/106, 2008. DeMarzo, Peter, and Branko Urosevic, “Optimal Trading and Asset Pricing with a Large Shareholder,” Journal of Political Economy, 114 (2006), 774–815. Demyanyk, Yuliya, and Otto Van Hemert, “Understanding the Subprime Mortgage Crisis,” Review of Financial Studies, forthcoming 2010. Deng, Yongheng, John Quigley, and Robert Van Order, “Mortgage Terminations, Heterogeneity and the Exercise of Mortgage Options,” Econometrica, 68 (2000), 275–307. Diamond, Douglas, “Financial Intermediation and Delegated Monitoring,” Review of Economic Studies, 51 (1984), 393–414. Diamond, Douglas, and Raghuram Rajan, “Liquidity Risk, Liquidity Creation and Financial Fragility: A Theory of Banking,” Journal of Political Economy, 109 (2003), 287–327. DiNardo, John, and David S. Lee, “Economic Impacts of New Unionization on Private Sector Employers: 1984–2001,” Quarterly Journal of Economics, 119 (2004), 1383–1441. Doms, Mark, Fred Furlong, and John Krainer, “Subprime Mortgage Delinquency Rates,” FRB San Francisco Working Paper 2007-33, 2007. Drucker, Steven, and Christopher Mayer, “Inside Information and Market Making in Secondary Mortgage Markets,” Columbia Business School Working Paper, 2008. Drucker, Steven, and Manju Puri, “On Loan Sales, Loan Contracting, and Lending Relationships,” Review of Financial Studies, 22 (2009), 2835–2872.
DID SECURITIZATION LEAD TO LAX SCREENING?
361
Einav, Liran, Mark Jenkins, and Jonathan Levin, “Contract Pricing in Consumer Credit Markets,” Stanford University Working Paper, 2008. Fishelson-Holstein, Hollis, “Credit Scoring Role in Increasing Homeownership for Underserved Populations,” in Building Assets, Building Credit: Creating Wealth in Low-Income Communities, Nicolas P. Retsinas and Eric S. Belsky, eds. (Washington, DC: Brookings Institution Press, 2005). Freddie Mac, “Single-Family Seller/Servicer Guide,” Chapter 37, Section 37.6: Using FICO Scores in Underwriting, 2001. ——, “Automated Underwriting Report,” Chapter 6, 2007. Gerardi, Kristopher, Adam Hale Shapiro, and Paul S. Willen, “Subprime Outcomes: Risky Mortgages, Homeownership Experiences and Foreclosures,” Federal Reserve Bank of Boston Working Paper 0715, 2007. Glaeser, Edward L., and Joseph Gyourko, “Urban Decline and Durable Housing,” Journal of Political Economy, 113 (2005), 345–375. Gorton, Gary, and George Pennacchi, “Banks and Loan Sales: Marketing Nonmarketable Assets,” Journal of Monetary Economics, 35 (1995), 389–411. Gorton, Gary B., and Nicholas S. Souleles, “Special Purpose Vehicles and Securitization,” in The Risks of Financial Institutions, Mark Carey and Rene M. Stultz, eds. (Chicago: University of Chicago Press, 2006). Gramlich, Edward, Subprime Mortgages: America’s Latest Boom and Bust (Washington, DC: The Urban Institute Press, 2007). Holloway, Thomas, Gregor MacDonald, and John Straka, “Credit Scores, EarlyPayment Mortgage Defaults, and Mortgage Loan Performance,” Freddie Mac Working Paper, 1993. H¨olmstrom, Bengt, and Jean Tirole, “Financial Intermediation, Loanable Funds, and the Real Sector,” Quarterly Journal of Economics, 52 (1997), 663–692. Johnston, Jack, and John DiNardo, Econometric Methods (New York: McGrawHill, 1996). Kashyap, Anil K., and Jeremy C. Stein, “What Do a Million Observations on Banks Have To Say about the Monetary Transmission Mechanism?” American Economic Review, 90 (2000), 407–428. Keys, Benjamin J., Tanmoy Mukherjee, Amit Seru, and Vikrant Vig, “Financial Regulation and Securitization: Evidence from Subprime Mortgage Loans,” Journal of Monetary Economics, 56 (2009), 700–720. Kornfeld, Warren, “Testimony before the Subcommittee on Financial Institutions and Consumer Credit,” U.S. House of Representatives, May 8, 2007. Loesch, Michele, “A New Look at Subprime Mortgages,” Fitch Research, 1996. Loutskina, Elena, and Philip Strahan, “Securitization and the Declining Impact Of Bank Finance on Loan Supply: Evidence from Mortgage Acceptance Rates,” Boston College Working Paper, 2007. Lucas, Douglas, Laurie Goodman, and Frank Fabozzi, Collateralized Debt Obligations: Structures and Analysis (Hoboken, NJ: Wiley Finance, 2006). Lucas, Robert E., Jr., “Econometric Policy Evaluation: A Critique,” in The Phillips Curve and Labor Markets, Carnegie-Rochester Conferences on Public Policy, K. Brunner and A.H. Meltzer, eds. (Amsterdam: North-Holland, 1976). Mayer, Christopher, and Karen Pence, “Subprime Mortgages: What, Where, and to Whom?” Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series 2008-29, 2008. Mayer, Christopher, Karen Pence, and Shane Sherlund, “The Rise in Mortgage Defaults,” Journal of Economic Perspectives, 23 (2009), 27–50. Mayer, Christopher, Tomasz Piskorski, and Alexei Tchistyi, “The Inefficiency of Refinancing: Why Prepayment Penalties Are Good for Risky Borrowers,” Columbia Business School Working Paper, 2008. Mian, Atif, and Amir Sufi, “The Consequences of Mortgage Credit Expansion: Evidence from the U.S. Mortgage Default Crisis,” Quarterly Journal of Economics, 124 (2009), 1449–1496. Morrison, Alan D., “Credit Derivatives, Disintermediation and Investment Decisions,” Journal of Business, 78 (2005), 621–647. Parlour, Christine, and Guillaume Plantin, “Loan Sales and Relationship Banking,” Journal of Finance, 63 (2008), 1291–1314. Pennacchi, George, “Loan Sales and the Cost of Bank Capital,” Journal of Finance, 43 (1988), 375–396.
362
QUARTERLY JOURNAL OF ECONOMICS
Petersen, Mitchell A., and Raghuram G. Rajan, “Does Distance Still Matter? The Information Revolution in Small Business Lending,” Journal of Finance, 57 (2002), 2533–2570. Rajan, Uday, Amit Seru, and Vikrant Vig, “The Failure of Models That Predict Failure: Distance, Incentives and Defaults,” University of Chicago Booth School of Business Working Paper No. 08-19, 2008. Sheather, Simon, and Chris Jones, “A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation,” Journal of the Royal Statistical Society, 53 (1991), 683–690. Stein, Jeremy, “Information Production and Capital Allocation: Decentralized versus Hierarchical Firms,” Journal of Finance, 57 (2002), 1891–1921 Stiglitz, Joseph, “Houses of Cards,” The Guardian, October 9, 2007. Sufi, Amir, “Information Asymmetry and Financing Arrangements: Evidence from Syndicated Loans,” Journal of Finance, 62 (2006), 629–668. Temkin, Kenneth, Jennifer Johnson, and Diane Levy, “Subprime Markets, The Role of GSEs and Risk-Based Pricing,” Prepared for U.S. Department of Housing and Urban Development Office of Policy Development and Research, 2002.
MONETARY POLICY BY COMMITTEE: CONSENSUS, CHAIRMAN DOMINANCE, OR SIMPLE MAJORITY?∗ ALESSANDRO RIBONI AND FRANCISCO J. RUGE-MURCIA This paper studies the theoretical and empirical implications of monetary policy making by committee under four different voting protocols. The protocols are a consensus model, where a supermajority is required for a policy change; an agenda-setting model, where the chairman controls the agenda; a dictator model, where the chairman has absolute power over the committee; and a simple majority model, where policy is determined by the median member. These protocols give preeminence to different aspects of the actual decision-making process and capture the observed heterogeneity in formal procedures across central banks. The models are estimated by maximum likelihood using interest rate decisions by the committees of five central banks, namely the Bank of Canada, the Bank of England, the European Central Bank, the Swedish Riksbank, and the U.S. Federal Reserve. For all central banks, results indicate that the consensus model fits actual policy decisions better than the alternative models. This suggests that despite institutional differences, committees share unwritten rules and informal procedures that deliver observationally equivalent policy decisions.
“I try to forge a consensus . . . . If a discussion were to lead to a narrow majority, then it is more likely that I would postpone a decision.” Wim Duisenberg, former President of the European Central Bank, The New York Times, 27 June 2001.
I. INTRODUCTION An important question in economics concerns the implications of collective decision making on policy outcomes. Prominent examples of decisions made by a group of individuals, rather than by a single agent, include fiscal and monetary policy. Decisions concerning public spending, taxation, and debt are made by legislatures,1 whereas the target for the key nominal interest rate is selected by a committee in most central banks. ∗ We received helpful comments and suggestions from Jan Marc Berk, Pohan Fong, Guillaume Fr´echette, Petra Gerlach-Kristen, the editor (Robert Barro), the co-editor (Elhanan Helpman), anonymous referees, and participants in the Research Workshop on Monetary Policy Committees held at the Bank of Norway in September 2007, the 2008 SED Meeting in Boston, and the 2008 European Meeting of the Econometric Society in Milan. This research received the financial support of the Social Sciences and Humanities Research Council of Canada. 1. Among the theoretical contributions that examine the consequences of legislative bargaining on fiscal policy decisions are Baron (1991), Chari and Cole (1993), Persson, Roland, and Tabellini (2000), Battaglini and Coate (2007, 2008), and Volden and Wiseman (2007). C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
363
364
QUARTERLY JOURNAL OF ECONOMICS
This paper develops a model where members of a monetary policy committee have different views regarding the optimal interest rate and resolve those differences through voting. First, the equilibrium outcome is solved under various voting (or bargaining) protocols. Then, their respective theoretical and empirical implications are derived, and all models are estimated by the method of maximum likelihood using data on interest rate decisions by the committees in five central banks. The central banks are the Bank of Canada, the Bank of England, the European Central Bank (ECB), the Swedish Riksbank, and the U.S. Federal Reserve. Finally, model selection criteria are used to compare the predictive power of the competing theories of collective choice.2 Because there is institutional evidence of heterogeneity in the formal procedures employed by monetary committees to arrive at a decision (see, for example, Fry et al. [2000]), we study four voting protocols. Each protocol gives preeminence to different aspects of the actual decision-making process. The first protocol is a consensus-based model where a supermajority (that is, a level of support that exceeds a simple majority) is required to adopt a new policy, and no committee member has proposal power. The second protocol is the agenda-setting model, originally proposed by Romer and Rosenthal (1978), where decisions are made by simple majority, and the chairman is assumed to control the agenda. These two protocols yield outcomes different from the median policy and predict an inaction region (or gridlock interval) where the committee keeps the interest rate unchanged. In addition, they deliver endogenous autocorrelation and regime switches in the interest rate. However, these protocols generate different implications for the size of interest rate adjustments: the agenda-setting model predicts larger interest rate increases (decreases) than the consensus model when the chairman is more hawkish (dovish) than the median member. The third protocol is a dictator model where the chairman has absolute power over the committee. Hence, in the absence of frictions in the political decision-making process, the chairman is able to impose his preferred policy on the committee. Finally, the fourth protocol is the median model, where no member controls the agenda, all alternatives are put to a vote, decisions are made by simple majority, and, consequently, the interest rate selected 2. Previous literature on monetary committees usually relies on the Median Voter Theorem and focuses on features other than the voting procedure. For instance, Waller (1992, 2000) studies the implications of the length of the term of office and of the committee size, respectively.
MONETARY POLICY BY COMMITTEE
365
by the committee is the one preferred by the median. These two protocols predict that, regardless of the initial status quo, the committee adjusts the interest rate to the value preferred by a key member (the chairman or the median), and are thus observationally equivalent to a setup where that member is the single central banker. Because a common feature of these protocols is the absence of decision-making frictions, we jointly refer to them as “frictionless” models. According to survey results reported in Fry et al. (2000), a little more than half of the monetary policy committees in their sample (43 out of 79) make decisions by consensus, whereas the rest hold a formal vote. Of the five central banks studied in this paper, only one (the Bank of Canada) explicitly operates on a consensus basis, whereas the remaining ones hold a formal vote by simple majority rule. However, the distinction is ambiguous in practice because consensus appears to play a role in the decisionmaking processes of committees that (on paper) make decisions by simple majority rule.3 Committees also seem to differ with respect to the role played by the chairman. In the committees that hold a formal vote, the chairman can exert leadership, in particular, by deciding the content of the proposal that is put to a vote.4 For example, a prevalent view is that under the mandate of Alan Greenspan, agreement within the Federal Open Market Committee (FOMC) was dictated by the chairman.5 A consequence of the agenda-setting power of the chairman is that the final voting outcome may sometimes be different from the policy favored by a majority of members.6 In the committees that formally operate by consensus, the power of the chairman may also be substantial despite the presumption that these committees behave in a 3. See, for example, the interview of Wim Duisenberg with the The New York Times cited at the beginning of this paper. In contrast, the ECB’s Statutes state that “the Governing Council shall act by a simple majority of the members having a voting right.” See Article 10 in www.ecb.int/ecb/legal/pdf/en statute 2.pdf. 4. This is the case, for instance, in the monetary committees of the Bank of England, the Norges Bank, the Reserve Bank of Australia, the Swedish Riksbank, and the U.S. Federal Reserve (see Maier [2007]). 5. Blinder (2004, p. 47) remarks that “each member other than Alan Greenspan has had only one real choice when the roll was called: whether to go on record as supporting or opposing the chairman’s recommendation, which will prevail in any case.” 6. Blinder (2004, p. 47) cites two instances where this appears to have been true for the FOMC. First, transcripts for the meeting on 4 February 1994 indicate that most members wanted to raise the Federal Funds rate by 50 basis points, whereas Greenspan wanted a 25 point increase. Nonetheless, the committee eventually passed Greenspan’s preferred policy. Second, Blinder reports the general opinion that in the late 1990s Greenspan was able to maintain the status quo although most committee members were in favor of an interest rate increase.
366
QUARTERLY JOURNAL OF ECONOMICS
collegial manner.7 Overall, this discussion suggests that institutional and anecdotal evidence alone cannot unambiguously reveal how monetary policy committees reach decisions. As an alternative strategy for understanding central banking by committee, we study actual policy decisions in the context of simple, but fully specified, decision-making protocols. Because the statistical models of the interest rate implied by these protocols are non-nested, standard tests do not have conventional distributions and cannot be used to discriminate across models. For this reason, we use model selection criteria and the comparison between predicted implications and observed features of the data to shed light on the relative merits of the models. Results show that despite the heterogeneity in the formal procedures followed by the monetary committees in our sample, the consensus-based model describes interest rate decisions more precisely than the alternative protocols for all five central banks studied. These results suggest that, in addition to the formal framework under which committees operate, their decision making is also the result of unwritten rules and informal procedures that deliver observationally equivalent policy decisions. These behavioral similarities mirror the ones obtained in laboratory experiments. For instance, Fr´echette, Kagel, and Morelli (2005) examine the predictions of alternative models of legislative bargaining and also conclude that actual behavior under different voting procedures is more similar than theory would predict. Also, the data give limited support to the frictionless model(s). The two reasons are, first, that this model predicts lower interest rate autocorrelation than found in the data, and, second, that it implies continuous changes in the interest rate and, consequently, cannot account for the empirical observation that the distribution of actual interest rate changes has a mode of zero. Instead, by introducing a status quo bias, the consensus-based model is better able to capture these features of the data. This means, in particular, that consensus can provide a politicoeconomic explanation for the well-known observation that central banks adjust interest rates more cautiously than is predicted by standard models. Finally, regarding the agenda-setting model, the data indicate that even if the chairman has proposal power, his policy 7. For example, Blinder (2007, p. 114) ranks the decision-making process of the Bank of Canada as less democratic than those of most central banks in his sample.
MONETARY POLICY BY COMMITTEE
367
recommendation will take into account the preferences of the other committee members. Furthermore, there is very limited empirical support for agenda control on the part of the chairman, as modeled by Romer and Rosenthal (1978). This confirms, using actual data, earlier findings based on laboratory studies (see, for example, Eavey and Miller [1984]). In addition to the politicoeconomic frictions that are the focus of this paper, other explanations for the large interest rate autocorrelation in the data have been proposed in the literature, including policy maker uncertainty (Sack 2000; Orphanides 2003) better control over long-term interest rates (Woodford 2003), and reduction of financial stress (Cukierman 1991; Goodfriend 1991). In order to evaluate the quantitative contribution of those frictions to our results, we study interest rate decisions by the Bank of Israel, where policy is selected by one individual. Results show that other frictions, besides the ones associated with committee decision making, are also present in the data and partially explain interest rate persistence. However, results also show that (i) politicoeconomic frictions are likely to be the main factor behind differences across monetary committees and (ii) decision making by a single individual is less inertial than that by committees. Besides monetary economists, the results of this paper should also be of interest to political economists and political scientists interested in testing the implications of competing theories of committee bargaining. A few empirical contributions in the area of legislative studies compare different theories of committee decision making using actual data. For example, Krehbiel, Meirowitz, and Woon (2005) test the predictive power of the pivot and cartel theories of law making in the U.S. Senate using frequencies of cutpoint estimates, that is, estimates of the roll-call-specific location that splits the “yea” side from the “nay” side. However, that literature faces the problems that bargaining is potentially multidimensional, that bill and status quo locations are difficult to observe and measure, and that ideal point estimates are not on the same scale as bill locations.8 In contrast to the complexities associated with legislative bargaining, monetary policy making provides an ideal setup for studying group decisions because voting outcomes and status quo locations are well-defined in the data and decisions concern only one dimension (i.e., the interest rate), 8. Ideal points are usually measured using ideology scores obtained with the Poole–Rosenthal NOMINATE algorithm.
368
QUARTERLY JOURNAL OF ECONOMICS
thereby reducing the possibility of logrolling.9 Our empirical analysis does not require the individual voting record of every committee member, which in many central banks is not public.10 We use instead data on final voting decisions by the committee and data on status quo locations. In addition, we employ a tractable economic model and data on inflation and unemployment at the time of the meeting to estimate the members’ preferred interest rates. The remainder of the paper is organized as follows: Sections II and III describe the committee and its decision making under different voting procedures. Section IV estimates the models and compares their empirical results. Section V reviews the literature on interest rate smoothing and extends the analysis to the case where committee members internalize the effect of current policy decisions on future voting outcomes via the status quo. Section VI concludes. II. THE COMMITTEE II.A. Composition and Preferences Consider a monetary policy committee composed of N members, labeled j = 1, . . . , N, where N is an odd integer.11 The committee is concerned with selecting the value of the policy instrument in every meeting. The policy instrument is assumed to be the nominal interest rate, it . The alternatives that can be considered by the committee belong to the continuous and bounded set I = [0, i]. The relation between the policy instrument and economic outcomes is spelled out below in Section II.B. The utility function of member j is ∞ τ −t δ U j (πt ) , (1) Eτ t=τ
9. Given the difficulties of analyzing field data, many researchers have turned to laboratory experiments to test theories of committee bargaining. See Palfrey (2005) for a review of the experimental literature. 10. See, however, Riboni and Ruge-Murcia (2008b), where we use the individual votes of the Bank of England to study heterogeneity in policy preferences among committee members. A difficulty of looking at the voting records is the possible discrepancy between the extent of actual disagreement in the meeting and the dissenting frequencies obtained from the published votes. On this point, see Meade (2005) and Swank, Swank, and Visser (2008). 11. The assumption that N is odd allows us to uniquely pin down the median committee member and eliminates the complications associated with tie votes. Erhart and Vasquez-Paz (2007) find that in a sample of 79 monetary policy committees, 57 have an odd number of members, with five, seven, and nine the most common observations.
369
MONETARY POLICY BY COMMITTEE
where Eτ denotes the expectation conditional on information available at time τ, δ ∈ (0, 1) is the discount factor, πt is the rate of inflation, and U j (·) is the instantaneous utility function. The information set at time τ is assumed to be the same for all members.12 The instantaneous utility function is represented by the asymmetric linex function (Varian 1974), (2)
U j (πt ) =
−exp μ j (πt − π ∗ ) + μ j (πt − π ∗ ) + 1 μ2j
,
where π ∗ is an inflation target and μ j is a member-specific preference parameter. This functional form generalizes the standard quadratic function used in earlier literature and permits different weights for positive and negative inflation deviations from the target.13 For example, when μ j > 0, a positive deviation from π ∗ causes a larger decrease in utility than a negative deviation of the same magnitude. The reason is that for inflation rates above π ∗ , the exponential term dominates and utility decreases exponentially, whereas for inflation rates below π ∗ , the linear term dominates and utility decreases linearly. Intuitively, when μ j > 0, committee member j is more averse to positive than to negative inflation deviations from the target even if their size (in absolute value) is identical. To develop the readers’ intuition, the quadratic and asymmetric utility functions are plotted in Figure I. For this asymmetric utility function, the coefficient of relative prudence (Kimball 1990) is μ j (πt − π ∗ ), which is directly proportional to the inflation deviation from its desired value with coefficient of proportionality μ j . The assumption that committee members differ in their relative prudence is a tractable way to motivate disagreement over preferred interest rates despite the fact that members share the same inflation target.14 We order 12. This assumption allows us to focus on preference, rather than on information, aggregation in committees. Contrary to preference aggregation, which usually leads to moderation, information aggregation may lead to decisions that are more extreme than the average of the individual positions (see Sobel [2006]). 13. To see that the function (2) nests the quadratic case, take the limit of U j (πt ) as μ j → 0 and use L’Hˆopital’s rule twice. 14. We use this modeling strategy because, except for the United States, all countries in our sample follow inflation targeting regimes. It is easy to show that our results are robust to assuming instead quadratic utility and heterogeneous inflation targets. In preliminary work, we considered models where utility is a function of both output and inflation, and the relative output weight is heterogeneous across members. However, depending on the specific model, some analytical or econometric tractability is lost. For example, the ordering of preferred interest rates may depend on whether inflation is above or below target, or the reaction to
370
QUARTERLY JOURNAL OF ECONOMICS
Quadratic
Utility
Asymmetric
0
Deviation from target
FIGURE I Utility Functions
the N committee members so that member 1 (N) is the one with the smallest (largest) value of μ. That is, μ1 ≤ μ2 ≤ . . . ≤ μ N . The median member, denoted by M, is the one with index (N + 1)/2 and, for simplicity, his preference parameter is normalized to be zero; that is, μ M = 0.15 We will see below in Section II.C that heterogeneity in preference parameters implies that indirect utilities, expressed as a function of the policy instrument it , attain different maxima depending on μ j . II.B. Economic Environment As in Svensson (1997), the behavior of the private sector is described in terms of a Phillips curve and an aggregate demand
shocks may depend on the preference parameter so that it is no longer possible to derive a closed-form expression for the likelihood function. 15. For the empirical part of this project, we will also require that the crosssectional distribution of μ is time-invariant, meaning that even when there are changes in the composition of the committee (for example, as a result of alternation of voting members), the preference parameters remain unchanged.
MONETARY POLICY BY COMMITTEE
371
curve, (3)
πt+1 = πt + α1 yt + εt+1 ,
(4)
yt+1 = β1 yt − β2 (it − πt − ι) + ηt+1 ,
where yt is the deviation of an output measure from its natural level, ι is the equilibrium real interest rate, α1 , β2 > 0 and 0 < β1 < 1 are constant parameters, and εt and ηt are disturbances. The persistence of the disturbances is modeled by means of moving average (MA) processes εt = γ ut−1 + ut , ηt = ς vt−1 + vt , where γ , ς ∈ (−1, 1), so that the processes are invertible, and ut and vt are mutually independent innovations. The innovations are normally distributed white noises with zero mean and constant conditional variances σu2 and σv2 , respectively. This model embodies a stylized mechanism for the transmission of monetary policy and is widely used in the literature on monetary policy committees (see, among others, Bhattacharjee and Holly [2006]; Gerlach-Kristen [2007]; Weber [2007]). A version of this model is estimated by Rudebusch and Svensson (1999) using U.S. data, and results show that it compares well statistically with other widely used models (for example, a vector autoregression and the MPS model of the Federal Reserve). After some algebra, one can write (5)
πt+2 = (1 + α1 β2 )πt + α1 (1 + β1 )yt + α1 β2 ι − α1 β2 it + εt+1 + α1 ηt+1 + εt+2 .
As a result of the control lag in this model, the interest rate selected by the committee at time t affects inflation only after two periods via its effect on the output gap after one period. II.C. Policy Preferred by Individual Members Because monetary policy takes two periods to have an effect on inflation, consider the member-specific interest rate i ∗j,t chosen at time t to maximize the expected utility of member j at time t + 2. That is, i ∗j,t = arg max δ 2 Et U j (πt+2 ), {it }
372
QUARTERLY JOURNAL OF ECONOMICS
subject to equation (5). Equation (5) combines the Phillips and aggregate demand curves and summarizes the constraints imposed by the private sector on the policy choices of the committee. The first-order necessary condition of this problem is μ j exp μ j (πt+2 − π ∗ ) − μ j 2 = 0, δ (α1 β2 ) Et μ2j which implies that Et exp(μ j (πt+2 − π ∗ )) = 1.
(6)
Under the assumption that innovations are normally distributed and conditionally homoscedastic, the rate of inflation at time t + 2 (conditional on the information set at time t) is normally distributed and conditionally homoscedastic as well. Then, exp(μ j (πt+2 − π ∗ )) is distributed lognormal with mean exp(μ j (Et πt+2 − π ∗ ) + μ2j σπ2 /2), where σπ2 is the conditional variance of πt . Then, substituting in (6) and taking logs, Et πt+2 = π ∗ − μ j σπ2 /2.
(7)
Finally, taking conditional expectations as of time t in both sides of (5) and using (7) deliver member j’s preferred interest rate, i ∗j,t = a j + bπt + cyt + ζt ,
(8) where
μj 1 ∗ π + σπ2 , ι− α1 β2 2α1 β2 1 , 1+ α1 β2 1 + β1 , β 2 γ ς ut + vt . α1 β2 β2
aj = b= c= ζt =
This reaction function implies that if the current output gap or inflation increases, the nominal interest rate should be raised in order to keep the inflation forecast close to the inflation target. Note, however, that ex post inflation will typically differ from π ∗
MONETARY POLICY BY COMMITTEE
373
because of the disturbances that occur during the control lag period. As a result of this uncertainty, the asymmetry in the utility function will induce a prudence motive in the conduct of monetary policy and i ∗j,t will also depend on inflation volatility in proportion to μ j . Hence, the intercept term in the reaction function is member-specific (and hence the subscript j in a j ). Notice, for example, that committee members who weight positive deviations from the inflation target more heavily than negative deviations (i.e., those with μ j > 0) will generally favor higher interest rates. On the other hand, the coefficients of inflation (b) and the output gap (c) and the disturbance ζt depend only on aggregate parameters (and aggregate shocks in the latter case) and, consequently, they are common to all members. Because individual reaction functions differ in their intercepts only, it is easy to see that ordering members according to μ, that is, μ1 ≤ μ2 ≤ . . . ≤ μ N , ∗ ∗ ≤ i2,t ≤ translates into an ordering of preferred interest rates, i1,t ∗ . . . ≤ i N,t . Finally, because the innovations ut and vt are white noise, ζt is also white noise and its constant conditional variance is σ 2 = γ 2 σu2 /(α1 β2 )2 + ς 2 σv2 /(β2 )2 .
III. DECISION MAKING III.A. The Consensus Model To model the idea of consensus, this protocol assumes that proposals to adopt a new interest rate require a supermajority to pass. A supermajority is a majority greater than 50% plus one of the votes, or simple majority. Under this protocol, no committee member controls the agenda: the set of alternatives that are put to a vote is chosen according to a predetermined rule. Let qt denote the status quo policy in the current meeting and assume that the initial status quo is the interest rate, it−1 , that was selected in the previous meeting. The state of the economy, which is known and predetermined at the beginning of the meeting, is given by st ≡ (πt , yt , ζt ). There are two stages in each meeting. In the first stage, members vote by simple majority rule whether the debate in the second stage will involve an increase or a decrease of the interest rate with respect to the status quo. If the committee votes for an interest rate increase (decrease), all alternatives that are strictly smaller (larger) than qt are immediately discarded.
374
QUARTERLY JOURNAL OF ECONOMICS
In the second stage, the committee selects the interest rate among the remaining alternatives through a binary agenda, with a supermajority required for a proposal to pass.16 A binary agenda is a procedure where the final outcome chosen by the committee is the result of a sequence of pairwise votes (see Austen-Smith and Banks [2005, Ch. 4] for a discussion). Let (N + 1)/2 + K denote the size of the smallest supermajority required for a proposal to pass. The size of the supermajority increases in K, where the integer K ∈ [0, (N − 1)/2] is the minimum number of favorable votes beyond simple majority that are necessary for a proposal to pass. This specification includes as special cases unanimity, when K = (N − 1)/2, and simple majority, when K = 0. The voting procedure is as follows. Suppose, for example, that in the first stage the committee decided to consider an increase of the interest rate. Then the alternative qt + is put to a vote against the status quo qt , where > 0. If qt + does not meet the approval of a supermajority of members, then the proposal does not pass, the meeting ends, and the status quo is implemented. If the proposal passes, then qt + displaces qt (= it−1 ) as the default policy and the meeting continues with the alternative qt + 2 voted against qt + . If qt + 2 does not pass, the final decision becomes qt + . If the proposal passes, then qt + 2 displaces qt + as default policy, the alternative qt + 3 is voted against qt + 2, and so on. For the sake of the exposition and to avoid unnecessary complications, assume that all alternatives and all member-specific preferred interest rates (from equation (8)) are integer multiples of . Note that the number of possible rounds in the second stage of each meeting is finite. Let r denote the round, with r = 1, . . . , R. If the committee keeps accepting further increases, there will be a final round R where the committee will have to choose between i and i − .17 We study pure strategy subgame perfect equilibria with the property that, in each period, individuals vote as if they are pivotal. This refinement is standard in the voting literature and rules out equilibria where a committee member votes contrary to his preferences simply because changing his vote would not alter the voting outcome. Furthermore, it is 16. The voting protocol used here is similar to the one studied by Dal B´o (2006). 17. Alternatively, if in the first stage of the meeting the committee decides to consider interest rate decreases only, then at the final round of the second stage the vote would be between policies and 0.
MONETARY POLICY BY COMMITTEE
375
assumed that committee members are forward-looking within each meeting (that is, they vote strategically in each round of the meeting, foreseeing the effect of their vote on future rounds), but they abstract from the consequences of their voting decision on future meetings via the status quo.18 Let (st , qt , ωt ) denote the political aggregator under a consensus-based voting protocol, where ωt is the vector of induced policy preferences over the interest rate for all committee members. For all states st and qt , the stationary function (·) aggregates the induced policy preferences into a policy outcome. The next proposition shows that the protocol described above delivers a simple equilibrium outcome. For status quo policies that are located close to the median’s preferred policy, the committee does not change the interest rate. For status quo policies that are sufficiently extreme, compared with the values preferred by most members, the committee adopts a new policy that is closer to the median outcome. PROPOSITION 1. The policy outcome in the consensus model is given by ⎧ ∗ ∗ ⎪ ⎨ iM+K,t , if qt > iM+K,t , ∗ ∗ if iM−K,t ≤ qt ≤ iM+K,t , (9) it = (st , qt , ωt ) = qt , ⎪ ⎩ ∗ ∗ iM−K,t , if qt < iM−K,t . Proof. Note that for each committee member, the induced preferences over the interest rate are strictly concave and, consequently, single-peaked, with peak given by (8). The proof consists of the following steps. Step 1. Define the undominated set U(st , ωt ) of the supermajority relation in set I as the set of alternatives that are not defeated in a direct vote against any alternative in I. The set ∗ ∗ , iM+K,t ]. U(st , ωt ) contains all alternatives in the interval [iM−K,t Step 2. We claim that if any policy in U(st , ωt ) is the default in any round r, that policy must be the final outcome of the meeting at time t. This is obviously true in the final round R. We prove that this is true in any round by induction. Suppose that this is true at round r + 1; we show that this is true at round r as well. Suppose that at round r an interest rate i belonging to U(st , ωt ) is the default and, nevertheless, another policy i passes and moves 18. In Section V.B, we relax this assumption.
376
QUARTERLY JOURNAL OF ECONOMICS
to round r + 1. There are two cases: either i also belongs to the undominated set or it does not. In the former case, we know that i will be the final decision according to our inductive hypothesis. But this would mean that a supermajority preferred i to i. This contradicts the fact that i belongs to the undominated set. Suppose instead that i does not belong to the undominated set. Notice that this implies that the alternatives that will be considered in future rounds, including R, will not belong to U(st , ωt ). This is the case because the undominated set is an interval. Then the final outcome must not belong to U(st , ωt ). This contradicts the hypothesis that i belongs to the undominated set. Step 3. By Step 2, we know that if qt ∈ U(st , ωt ), qt will be the ∗ ∗ ≤ qt ≤ iM+K,t . If final outcome. This explains why it = qt if iM−K,t / U(st , ωt ), we know that there is only one direction instead qt ∈ (either an increase or a decrease from the status quo) that allows the committee to eventually reach an alternative in U(st , ωt ). It is easy to see that the committee chooses that direction in the first stage. By doing so, in a finite number of rounds, the committee ∗ ∗ ∗ and iM+K,t + , or between iM−K,t will vote between either iM+K,t ∗ and iM−K,t − . At that round, the alternative in the undominated set will pass and will be the final outcome. This explains why, if ∗ ∗ (or respectively, qt > iM+K,t ), the committee decides to qt < iM−K,t consider an increase (respectively, a decrease) of the interest rate ∗ ∗ (respectively, iM+K,t ) as the final that will eventually lead to iM−K,t outcome. ∗ Intuitively, for an initial status quo qt > iM+K,t , members initially agree on decreasing the nominal rate and successive pro∗ ∗ . When it = iM+K,t , a further posals are passed until it = iM+K,t -decrease will not receive the approval of a supermajority of ∗ . Similarly, for members and the final decision will be it = iM+K,t ∗ an initial status quo qt < iM−K,t , members initially agree on increasing the nominal rate and successive proposals are passed ∗ ∗ . When it = iM−K,t , a further -increase will not until it = iM−K,t receive the approval of a supermajority of members and the fi∗ . Finally, for an initial status quo nal decision will be it = iM−K,t ∗ ∗ iM−K,t ≤ qt ≤ iM+K,t and regardless of the result in the first stage of the meeting, no proposal will pass in the second stage and the interest rate will remain unchanged, it = it−1 . Notice that this protocol features a gridlock interval, that is, a set of status quo policies where policy changes are not possible. The gridlock interval includes all status quo policies
377
MONETARY POLICY BY COMMITTEE A. Consensus
B. Hawkish chairman
i(M+K )
i(A)
i(M-K )
i(M )
i(M-K )
i(M+K )
i(M )
C. Dovish chairman
i (A)
D. Frictionless
i(M ) i(S) i(A)
i (A)
i(M )
i (S )
FIGURE II Policy Outcomes
∗ ∗ qt ∈ [iM−K,t , iM+K,t ] and its width is increasing in the size of the supermajority, K. Note that when K = 0, this model predicts no gridlock interval and delivers the median’s preferred interest rate regardless of the initial status quo. Instead, when K > 0 the median outcome occurs only in the special case where the status quo coincides with the policy preferred by the median (that is, ∗ ). The supermajority requirement induces a status quo qt = iM,t bias because it demands the agreement of most members for a policy change.19 The policy outcome as a function of qt under this protocol is plotted in Panel A of Figure II. Policies on the 45◦ line correspond to it = qt , meaning that the interest rate remains unchanged after the committee meeting.
19. In related work, Gerlach-Kristen (2005) shows that in an economy where policy makers are uncertain about the optimal interest rate, consensus in monetary committees induces relatively small and delayed responses to changes in the state of the economy.
378
QUARTERLY JOURNAL OF ECONOMICS
III.B. The Agenda-Setting Model In this model, proposals are passed by simple majority rule, but members differ in their institutional roles. In particular, the chairman sets the agenda and makes a policy proposal to the other committee members at every meeting. This represents the idea that chairmen have more power and influence than their peers, stemming, for instance, from prestige or additional responsibilities. In what follows, the chairman is denoted by A and the median by M. It is also assumed that μ A = μ M .20 Thus, there are two possible cases: either μ A > μ M or μ A < μ M . In the case where μ A > μ M (μ A < μ M ), the chairman is hawkish (dovish) in the sense that, conditional on inflation, inflation volatility, and the output gap, he prefers a higher (lower) interest rate than M. Note that this specification does not require the chairman to be the most hawkish (or dovish) member of the committee, and there may be members that systematically prefer higher (lower) interest rates than the chairman. The identity of the chairman is assumed to be fixed over time. For the sake of exposition, this section focuses on the case of the hawkish chairman only, but the dovish case is perfectly symmetric. The voting protocol is the following. In each meeting, given the current status quo qt = it−1 , the chairman proposes an interest rate it under closed rule. That is, the other committee members can either accept or reject the chairman’s proposal. If the proposal passes (i.e., it obtains at least (N + 1)/2 votes), then the proposed policy is implemented and becomes the status quo for next meeting. If the proposal is rejected, then the status quo is maintained and it = it−1 . This procedure is repeated at the next meeting. As in the consensus model, individuals vote as if they were pivotal and disregard the consequences of their voting decisions for future meetings via the status quo. Thus, members accept a proposal whenever the current utility from the proposal is larger than or equal to the utility from the current status quo, and the chairman picks the policy closest to his ideal point among those that are acceptable to a majority of (N + 1)/2 members. This voting game is well-known in the political economy literature and was originally derived by Romer and Rosenthal (1978) under the assumption 20. The case μ A = μ M is trivial in that it always delivers the median outcome, and it is therefore observationally equivalent to the protocols studied in the next section.
MONETARY POLICY BY COMMITTEE
379
of symmetric preferences.21 Here, instead, the induced utilities of all members other than the median are single-peaked but not symmetric. In principle, this lack of symmetry may imply that proposals are accepted by a coalition that excludes the median. The proof of Proposition 2 ensures that this is not the case. Define ϒ(st , qt , ωt ) to be the political aggregator in the agendasetting game. The following proposition establishes the policy outcome under this protocol. PROPOSITION 2. The policy outcome in the agenda-setting model with μ A > μ M is given by (10)
it = ϒ(st , qt , ωt ) ⎧ ∗ ⎪ i A,t , ⎪ ⎪ ⎨q , t = ∗ ⎪ 2i ⎪ M,t − qt , ⎪ ⎩ ∗ i A,t ,
if qt > i ∗A,t , ∗ if iM,t ≤ qt ≤ i ∗A,t ,
∗ ∗ if 2iM,t − i ∗A,t ≤ qt < iM,t , ∗ ∗ if qt < 2iM,t − i A,t .
Proof. The proof consists of the following steps. Step 1. Let V j (.) denote the indirect utility of member j as a function of the interest rate and let it denote the current proposal. We show that V j (it ) − V j (qt ) is increasing in μ j for all it and qt ∗ such that qt ≤ iM,t ≤ it . The difference of the expected payoff of committee member j associated with interest rates it and qt is V j (it ) − V j (qt ) exp(t )(exp(−μ j α1 β2 qt ) − exp(−μ j α1 β2 it )) + μ j α1 β2 (qt − it ) , = μ2j where t = (1 + α1 β2 )πt + α1 β2 ι + α1 (1 + β1 )yt + μ2j σπ2 /2 + γ ut + ∗ ≤ it , it can be shown that a sufficient, but ς α1 vt . When qt ≤ iM,t not necessary, condition for V j (it ) − V j (qt ) to be increasing in μ j is that μ2j σπ2 ≥ 2 for all j > M. In the rest of the proof we will assume that this condition is verified. Step 2. First, when qt ∈ (i ∗A, i], the agenda setter proposes i ∗A,t , which is accepted by all members j such that μ j ≤ μ A. This follows ∗ , i ∗A,t ], from the indirect utility being single-peaked. When qt ∈ [iM,t the agenda setter cannot increase the interest rate. The best proposal among the acceptable ones is the status quo, which 21. The agenda-setting model has been used to study monetary policy making by Riboni and Ruge-Murcia (2008c) and Montoro (2006).
380
QUARTERLY JOURNAL OF ECONOMICS
∗ ∗ is always accepted. When qt ∈ [2iM,t − i ∗A,t , iM,t ), the set of policies ∗ − qt ]. By that the median accepts is given by the interval [qt , 2iM,t Step 1, we know that these proposals are accepted by all members ∗ − qt , j such that μ j ≥ μ M and that any proposal greater than 2iM,t which is rejected by the median, is also rejected by all members j ∗ − i ∗A,t ), the agenda such that μ j ≤ μ M . Finally, when qt ∈ [0, 2iM,t ∗ setter is again able to propose i A, which is accepted by the median. By Step 1 this proposal is also accepted by all members j such that μ j ≥ μ M .
The political aggregator as a function of qt is plotted in Panel B of Figure II. The policy aggregator for the case of a dovish chairman is plotted in Panel C of the same figure, and it is easy to see that it is the mirror image of the one derived here for the hawkish chairman. Control over the agenda of the part of the chairman implies deviations from the median outcome. This is due to the fact that the chairman can propose the policy he prefers, among those alternatives that at least a majority of committee members (weakly) prefer to the status quo. Among the acceptable alternatives, there is no reason to expect the chairman to propose the median outcome. Moreover, deviations from the median outcome are systematically in one direction. That is, they will always bring the policy outcome closer to the policy preferred by the chairman. As before, there is an interval of status quo policies for which policy change is not possible (i.e., a gridlock interval). This interval ∗ , i ∗A,t ], that is, all policies between the interest is given by [iM,t rate preferred by the median and the chairman. If the status quo falls within this interval, policy changes are blocked by either the chairman or a majority of committee members. To see this, ∗ , i ∗A,t ], a majority would veto any increase note that when qt ∈ [iM,t of the instrument value towards i ∗A and proposing the status quo is then the best option for the chairman. The width of the gridlock interval is increasing in the distance between the chairman’s and the median’s preferred interest rates. A policy change occurs only if the status quo is sufficiently extreme, compared with the members’ preferred policies. In par∗ ∗ − i ∗A,t , iM,t ), the chairticular, when qt falls in the interval [2iM,t man chooses the policy closest to his or her ideal point subject to the constraint that M will accept it. This constraint is binding at equilibrium, meaning that M will be indifferent between the status quo and the interest rate that A proposes. Because the median has a symmetric induced utility (recall that μ M = 0), this proposal
MONETARY POLICY BY COMMITTEE
381
∗ is the reflection point of qt with respect to iM,t . When the status ∗ ∗ quo policy is either lower than 2iM,t − i A,t or higher than i ∗A,t , the chairman is able to offer and pass the proposal that coincides with his ideal point. In the rest of this section, we compare the theoretical predictions of the consensus and agenda-setting models. First, both models deliver a gridlock interval where it is not possible to change the status quo. However, it is difficult to predict a priori which voting procedure features the largest gridlock interval, because the comparison depends on the degree of consensus that the committee requires (summarized by K) and on the extent of disagreement between the chairman and the median. The intersection of the two ∗ must belong to both gridlock intervals is nonempty, given that iM,t intervals. In principle, the gridlock interval in the agenda-setting model could be a strict subset of the one in the consensus model if |μ A − μ M | were sufficiently small and K sufficiently large, but the converse cannot happen. Second, whenever the committee decides to change the status quo, the models deliver different predictions with respect to the size of the policy change. The agenda-setting model with hawkish (dovish) chairman yields more aggressive interest rate increases (decreases) than the other two models. For example, suppose that ∗ , the agenda-setting the chairman is a hawk. Then, when qt < iM,t model unambiguously predicts a larger policy change than the ∗ , the comparison is amconsensus model. Instead, when qt ≥ iM,t biguous, and the size of the interest rate decrease depends on the ∗ . location of i ∗A,t versus iM+K,t Finally, note that under both protocols, the endpoints of the gridlock interval are stochastic and depend on the current state of the economy. An implication of the predicted local inertia is that the relation between changes in the state of nature and in policy is nonlinear. In particular, small changes in the state of economy are less likely to produce policy changes compared with larger ones. Empirically, this would mean, for example, that small variations in the rates of inflation and unemployment would be less likely to result in a change in the key nominal interest rate, compared with large movements in these variables.
III.C. The Frictionless Model This model is used to describe two protocols, namely the median and the dictator models, which involve different decisionmaking processes, but deliver essentially the same empirical
382
QUARTERLY JOURNAL OF ECONOMICS
implications for the nominal interest rate. In particular, both protocols predict that regardless of the initial status quo, the committee will adopt the interest rate preferred by one key individual: the median member (in the median model) or the chairman (in the dictator model). The protocols are, therefore, frictionless in the sense that the status quo plays no role in determining the current interest rate. Consider first the median model, which is the standard framework of analysis in political economy. Under standard conditions, which are all satisfied in our setting, the Median Voter Theorem (Black 1958) implies a unique core outcome represented by the alternative preferred by the individual whose ideal point constitutes the median of the set of ideal points. Although in its original formulation the Median Voter Theorem lacks a noncooperative underpinning, notice that the median outcome may be obtained as a special case of the consensus-based model when a simple majority is needed to pass a proposal. Applying Proposition 1 to the case where K = 0 (that is, when the required majority equals (N + 1)/2), it is easy to see that, starting from any status quo policy, the interest rate preferred by the median is always selected. Consider now the dictator model. Under this protocol, the chairman, denoted by C, has absolute power over the committee and is able to impose his or her views at every meeting. Hence, the interest rate selected by the committee is the one preferred by the chairman. In this respect, the chairman has much greater power than in the agenda-setting model, where the chairman fully controls the agenda but is subject to an acceptance constraint because a majority is required to pass a proposal. Absent any friction in the political process, both the median and dictator models predict that, within each meeting and starting from any status quo, the committee adopts (11)
∗ = aS + bπt + cyt + ζt , iS,t
where S equals M or C, depending on the model. (Recall that M and C, respectively, stand for the median and chairman.) It is important to note that in a frictionless model there is neither inertia nor path dependence. Having a committee is then equivalent to having either the median or the chairman as single central banker and, therefore, the reaction function is observationally indistinguishable from a standard Taylor rule derived under the assumption that monetary policy is selected by one individual. This
MONETARY POLICY BY COMMITTEE
383
model predicts a proportional adjustment of the policy instrument in response to any change in inflation and unemployment, regardless of their size, and generates interest rate autocorrelation only from the serial correlation of the fundamentals. The policy outcome predicted by the frictionless model is plotted in Panel D of Figure II. In all the panels of this figure, the size of the policy change may be inferred from the vertical distance between the policy rule and the 45◦ line. IV. EMPIRICAL ANALYSIS IV.A. The Data The data set consists of interest rate decisions by monetary policy committees in five central banks, namely the Bank of Canada, the Bank of England, the ECB, the Swedish Riksbank, and the U.S. Federal Reserve, along with measures of inflation and the output gap in their respective countries. Inflation is measured by the twelve-month percentage change of the Consumer Price Index (Canada and Sweden), the Retail Price Index excluding mortgage-interest payments or RPIX (United Kingdom), the Harmonized Consumer Price Index (European Union), and the Consumer Price Index for All Urban Consumers (United States).22 The output gap is measured by the deviation of the seasonally adjusted unemployment rate from a trend computed using the Hodrick–Prescott filter. Interest rate decisions concern the target values for the Overnight Rate (Canada), the Repo Rate (United Kingdom and Sweden), the Rate for Main Refinancing Operations (European Union), and the Federal Funds Rate (United States). For the Federal Reserve, the sources are Chappell, McGregor, and Vermilyea (2005) and the minutes of the FOMC meetings, which are available at www.federalreserve.gov. For the Riksbank, the source is the minutes of the meetings of the Executive Board, which are available at www.riksbank.com. For the other central banks, the sources are official press releases compiled by the authors. The sample for Canada starts with the first preannounced date for monetary policy decisions in December 2000 and ends in March 2007. The sample for the United Kingdom starts with the 22. Since December 2003, the inflation target in the United Kingdom applies to the consumer price index (CPI) rather than to the RPIX. However, results using the CPI are similar to the ones reported below and are available upon request.
384
QUARTERLY JOURNAL OF ECONOMICS
first meeting of the Monetary Policy Committee in June 1997 and ends in June 2007. The sample for the European Union starts on January 1999, when the ECB officially took over monetary policy from the national central banks, and ends in March 2007. The sample for Sweden starts with the first meeting of the Executive Board on January 1999 and ends in June 2007. The sample for the United States starts in August 1988 and ends in January 2007. This period corresponds to the chairmanship of Alan Greenspan, with a small number of observations from the chairmanship of Ben Bernanke.23 The number of scheduled meetings per year varies from seven or eight (Bank of Canada, Riksbank, and Federal Reserve) to eleven (ECB) and twelve (Bank of England). There is substantial heterogeneity in the formal procedures followed by the monetary policy committees in our sample. The Governing Council of the Bank of Canada consists of the Governor and five Deputy Governors and explicitly operates on a consensus basis. This means that the discussion at the meeting is expected to move the committee toward a shared view. The Monetary Policy Committee of the Bank of England consists of nine members of whom five are internal (that is, chosen from within the ranks of bank staff) and four are external appointees. Meetings are chaired by the Governor of the Bank of England, decisions are made by simple majority, and dissenting votes are public. The decision-making body of the ECB consists of six members of the Executive Board and thirteen governors of the national central banks. According to the statutes (see footnote 3 above), monetary policy is decided by simple majority rule. The ECB issues no minutes and, consequently, dissenting opinions are not made public. Under the Riksbank Act of 1999, the Swedish Riksbank is governed by an Executive Board, which includes the Governor and five Deputy Governors, and decisions concerning the Repo Rate are made by majority vote, but formal reservations against the majority decision are recorded in the minutes. Finally, the FOMC takes decisions by majority rule among the seven members of the Board of Governors, the president of the New York Fed, and four members of the remaining district banks, chosen according to an annual rotation scheme. The minutes of FOMC meetings are made public. However, unlike the Riksbank and 23. The working paper version of this article (Riboni and Ruge-Murcia 2008a) also reports results for a U.S. subsample from February 1970 to February 1978, which corresponds to the chairmanship of Arthur Burns. The conclusions drawn from that subsample are the same as those reported here.
MONETARY POLICY BY COMMITTEE
385
the Bank of England, dissenting members in the FOMC do not always state the exact interest rate they would have preferred, but only the direction of dissent (either tightening or easing). IV.B. Formulation of the Likelihood Functions This section shows that the political aggregators derived in Section II imply particular time-series processes for the nominal interest rate and presents their log-likelihood functions under the maintained assumption that shocks are normally distributed.24 First, consider the consensus-based model. The political aggregator (9) in Proposition 1 means that the nominal interest rate follows a nonlinear process whereby each observation may belong to one of three possible regimes depending on whether the sta∗ ∗ , smaller than iM−K,t , or in tus quo (qt = it−1 ) is larger than iM+K,t between these two values. In the first case, the committee cuts ∗ ; in the second case, it raises the inthe interest rate to iM+K,t ∗ terest rate to iM−K,t ; in the third case, it keeps the interest rate unchanged. Because the data clearly show the instances where the committee takes each of these three possible actions, it follows that the sample separation is perfectly observable and each interest rate observation can be unambiguously assigned to its respective regime. Define the set t = {it−1 , πt , yt } with the predetermined variables at time t, and the sets 1 , 2 and 3 that contain the observations where the interest rate was cut, left unchanged, and raised, respectively. Denote by T1 , T2 , and T3 the number of observations in each of these sets and by T (= T1 + T2 + T3 ) the total number of observations. Then the log likelihood function of the T available interest rate observations is simply L(θ ) = −(T1 + T3 )σ + log φ(zM+K,t ) +
it ∈2
it ∈1
log((zM−K,t ) − (zM+K,t )) +
log φ(zM−K,t ),
it ∈3
where θ = {aM+K , aM−K , b, c, σ } is the set of unknown parameters, zM+K,t = (it−1 − aM+K − bπt − cyt )/σ , zM−K,t = (it−1 − aM−K − bπt − cyt )/σ , and φ(·) and (·) are the probability density and cumulative distribution functions of the standard normal variable, respectively. The maximization of this function with respect 24. For the detailed derivation of these functions, see Section 4.2 in the working paper version of this article (Riboni and Ruge-Murcia 2008a).
386
QUARTERLY JOURNAL OF ECONOMICS
to θ delivers consistent maximum likelihood (ML) estimates of the parameters of the interest rate process under the consensus model. The log likelihood function of the consensus model is similar to the one studied by Rosett (1959), who generalizes the two-sided Tobit model to allow the mass point to be anywhere in the conditional cumulative distribution function. In both models, the dependent variable reacts only to large changes in the fundamentals. However, whereas Rosett’s frictional model is static and the mass point is concentrated around a fixed value, the consensus-based model is dynamic and the mass point is concentrated around a time-varying and endogenous value, albeit predetermined at the beginning of the meeting. Second, consider the agenda-setting model with a hawkish chairman. (The case of the dovish chairman is isomorphic and not presented here to save space.) The political aggregator (10) in Proposition 2 means that the nominal interest rate follows a nonlinear process where each realization belongs to one of four possible regimes, rather than three, as in the consensus model. In the case where it−1 is larger than i ∗A,t , the committee cuts the interest rate to i ∗A,t , and the observation can be unambiguously ∗ and assigned to the set 1 . In the case where it−1 is between iM,t ∗ i A,t , the committee keeps the interest rate unchanged and the observation clearly belongs to 2 . However, in the case where it−1 ∗ is smaller than iM,t (for example, as a result of a sufficiently large realization of ζt ), the agenda setter may propose an interest rate ∗ − it−1 or i ∗A,t depending on whether the increase to either 2iM,t acceptance constraint is binding or not. Although the observation can be assigned to 3 , one cannot be sure which of the two regimes ∗ − it−1 or i ∗A,t ) has generated it . The reason is simply (whether 2iM,t that on the basis of interest rate data alone, it is not possible to know ex ante whether the acceptance constraint is binding or not. Hence, in the agenda-setting model, the sample separation is imperfect. The log likelihood function of the T available observations is
L(ϑ) = −(T1 + T3 )σ + +
it ∈3
it ∈1
log φ(z A,t ) +
log((zM,t ) − (z A,t ))
it ∈2
log(φ(z A,t )I(wt ) + (1/2)φ(zD,t )(1 − I(wt ))),
MONETARY POLICY BY COMMITTEE
387
where ϑ = {aA, aM , b, c, σ } is the set of unknown parameters, z A,t = (it−1 − aA − bπt − cyt )/σ, zM,t = (it−1 − aM −bπt − cyt )/σ, zD,t = (it − 2(aM + bπt + cyt ) + it−1 )/σ , wt is short-hand for the condition it − it−1 − 2(aA − aM ) < 0, and I(·) is an indicator function that takes the value 1 if its argument is true and zero otherwise. The terms in the latter summation show that for interest rate increases, the density is a mixture of the two normal distributions ∗ − it−1 and i ∗A,t . The weights associated with the processes 2iM,t of this mixture take either the value zero or one because the disturbance term is the same in both processes and, hence, these distributions are perfectly correlated. By maximizing this function with respect to ϑ, it is possible to obtain consistent ML estimates of the parameters of the interest rate process under the agenda-setting model.25 Finally, consider the frictionless model where it = aS + bπt + cyt + ζt and the log likelihood function of the T available observations is just L(ϕ) = −T σ + log φ(ZS,t ), ∀it
where ϕ = {aS , b, c, σ } is the set of unknown parameters and ZS,t = (it − aS − bπt − cyt )/σ . The maximization of this function with respect to ϕ delivers consistent ML estimates of the parameters of the interest rate process under the frictionless model. Notice, however, that with data on interest rate decisions alone, it is not possible to distinguish between the two possible interpretations of the frictionless model. IV.C. Empirical Results Tables I through V report empirical results for the monetary committees of the Bank of Canada, the Bank of England, the ECB, the Swedish Riksbank, and the U.S. Federal Reserve. Panel A in these tables reports maximum likelihood estimates of the parameters of the interest rate process under each protocol. Although some coefficients are not statistically significant, estimates for all protocols are generally in line with the theory in the sense that 25. The indicator function I(·) induces a discontinuity in the likelihood function and, consequently, this maximization requires either the use of a non-gradientbased optimization algorithm or a smooth approximation to the indicator function. We followed the latter approach here, but using the simulated annealing algorithm in Corana et al. (1987), which does not require numerical derivatives but is more time-consuming, delivers the same results.
388
QUARTERLY JOURNAL OF ECONOMICS TABLE I BANK OF CANADA Dominant chairman Consensus Hawkish
aM+K aM−K
3.175∗ (0.422) 1.357∗ (0.471)
aM aA aS b c σ
0.386∗ (0.174) −0.120 (0.304) 0.965∗ (0.124)
L(·) −58.76 AIC 127.53 RMSE 0.506 MAE 0.388 Chairman extracts all rents (p-value) Autocorrelation Standard deviation Proportion of Cuts Increases No changes Policy reversals
Dovish
With size Frictionless friction Data
A. Parameter estimates
1.314∗ (0.462) 3.162∗ (0.413)
3.257∗ (0.414) 1.377∗ (0.462)
2.618∗ (0.343)
2.604∗ (0.343)
0.381∗ (0.171) −0.112 (0.297) 0.942∗ (0.120)
0.388∗ (0.171) −0.135 (0.297) 0.941∗ (0.120)
0.231 (0.143) −0.041 (0.256) 0.845∗ (0.085)
0.237 (0.143) −0.036 (0.246) 0.845∗ (0.086)
B. Criteria for model selection −67.07 −67.53 −62.82 −77.38 144.14 145.06 133.64 162.75 0.631 0.867 0.850 0.992 0.502 0.561 0.715 0.844 <.001 <.001
.546 0.637
C. Quantitative predictions .397 .365 .014 0.898 0.977 1.260
.024 1.247
.873 0.240
0.204 0.253 0.544 0.107
0.165 0.301 0.534 0.129
0.406 0.429 0.164 0.487
0.280 0.300 0.420 0
0.267 0.194 0.539 0.130
0.486 0.513 0 0.637
Notes. Superscripts ∗ and † denote rejection of the hypothesis that the true parameter value is zero at the 5% and 10% significance levels. Autocorrelation is the first-order autocorrelation of the nominal interest rate. The standard deviation is that of the change in the nominal interest rate. Policy reversals are interest rate changes of opposite sign in two consecutive meetings.
they imply that committees tend to raise (cut) interest rates when inflation (unemployment) increases. Panels B and C, respectively, compare the protocols in terms of model-selection criteria and in terms of their quantitative predictions.26 The model selection criteria in Panel B are the root 26. We follow this model evaluation approach because the protocols imply non-nested statistical processes for the interest rate and, consequently, the usual
389
MONETARY POLICY BY COMMITTEE TABLE II BANK OF ENGLAND Dominant chairman Consensus Hawkish aM+K aM−K
6.511∗ (1.480) 2.055 (1.474)
aM aA
Dovish
A. Parameter estimates
2.070 (1.445) 6.451∗ (1.451)
6.496∗ (1.450) 2.101 (1.443)
0.281 (0.598) −3.366∗ (1.229) 1.638∗ (0.217)
0.289 (0.597) −3.352∗ (1.231) 1.636∗ (0.216)
aS b
0.279 (0.609) −3.390∗ (1.261) 1.686∗ (0.226)
c σ
With size Frictionless friction Data
4.262∗ (0.704) 0.339 (0.299) −3.303∗ (0.619) 1.013∗ (0.069)
4.256∗ (0.704) 0.341 (0.299) −3.294∗ (0.619) 1.007∗ (0.069)
B. Criteria for model selection L(·) −98.20 −105.23 −108.27 −154.66 −209.49 AIC 206.40 220.27 226.55 317.33 437 RMSE 0.329 0.426 0.536 1.013 1.195 MAE 0.251 0.335 0.369 0.885 1.043 Chairman extracts .001 <.001 all rents (p-value) Autocorrelation Standard deviation Proportion of Cuts Increases No changes Policy reversals
.766 0.623
C. Quantitative predictions .641 .638 .179 0.957 0.988 1.451
.182 0.152
.977 0.153
0.113 0.135 0.752 0.039
0.096 0.173 0.731 0.050
0.423 0.438 0.139 0.521
0.111 0.157 0.731 0
0.157 0.113 0.731 0.052
0.492 0.508 0 0.648
Notes. See notes to Table I.
mean squared error (RMSE), the mean absolute error (MAE), and the Akaike information criteria (AIC), and are computed as tests do not have a standard distribution. Two advantages of using model selection criteria are that they treat the competing models equally and deliver a definite outcome, while under hypothesis testing one model is given a privileged status as the null, and its rejection does not amount to accepting the alternative. On the other hand, model selection criteria are not as well suited as hypothesis testing for inferential problems. For a complete discussion, see Pesaran and Weeks (2001).
390
QUARTERLY JOURNAL OF ECONOMICS TABLE III EUROPEAN CENTRAL BANK Dominant chairman Consensus Hawkish 4.926∗ (0.885) 0.325 (0.808)
aM+K aM−K aM aA
Dovish
A. Parameter Estimates
0.351 (0.773) 4.789∗ (0.845)
4.859∗ (0.844) 0.413 (0.773)
0.237 (0.361) −1.801∗ (0.418) 1.291∗ (0.195)
0.231 (0.360) −1.812∗ (0.416) 1.290∗ (0.195)
aS b
0.220 (0.376) −1.877∗ (0.440) 1.363∗ (0.208)
c σ
L(·) −75.36 AIC 160.73 RMSE 0.225 MAE 0.138 Chairman extracts all rents (p-value) Autocorrelation Standard deviation Proportion of Cuts Increases No changes Policy reversals
With size Frictionless friction Data
2.123∗ (0.305) 0.437∗ (0.151) −1.109∗ (0.194) 0.809∗ (0.050)
2.155∗ (0.306) 0.422 (0.152) −1.173∗ (0.195) 0.801∗ (0.051)
B. Criteria for model selection −83.07 −78.79 −159.38 −235.97 176.13 167.58 326.75 479.94 0.358 0.240 0.809 0.795 0.203 0.152 0.678 0.655 <.001 .005
.865 0.322
C. Quantitative predictions .806 .817 .248 0.493 0.444 1.150
.270 1.133
.988 0.145
0.076 0.056 0.867 0.010
0.060 0.080 0.860 0.016
0.418 0.408 0.174 0.494
0.106 0.061 0.833 0.008
0.096 0.047 0.858 0.013
0.504 0.496 0 0.660
Notes. See notes to Table I.
RMSE =
T t=1 (ii
T MAE =
t=1
− E(ii |t ))2 T
|ii − E(ii |t )| , T
1/2 , and
AIC = 2k − 2L(·),
391
MONETARY POLICY BY COMMITTEE TABLE IV SWEDISH RIKSBANK Dominant chairman Consensus Hawkish aM+K aM−K
3.578∗ (0.249) 1.366∗ (0.260)
aM aA
Dovish
A. Parameter estimates
1.375∗ (0.244) 3.530∗ (0.233)
3.716∗ (0.285) 1.406∗ (0.270)
0.528∗ (0.119) −0.614∗ (0.228) 0.804∗ (0.113)
0.348∗ (0.136) −0.473† (0.260) 0.936 (0.137)
aS 0.541∗ (0.126) −0.603∗ (0.241) 0.859∗ (0.122)
b c σ
L(·) −63.24 AIC 136.48 RMSE 0.255 MAE 0.180 Chairman extracts all rents (p-value) Autocorrelation Standard deviation Proportion of Cuts Increases No changes Policy reversals
With size Frictionless friction Data
2.515∗ (0.110) 0.469∗ (0.073) −0.477∗ (0.145) 0.593∗ (0.047)
2.516∗ (0.109) 0.467∗ (0.073) −0.483∗ (0.143) 0.577∗ (0.048)
B. Criteria for model selection −70.19 −76.57 −70.80 −107.99 150.37 163.15 149.60 223.97 0.354 0.374 0.593 0.675 0.230 0.272 0.457 0.551 <.001 .001
.821 0.400
C. Quantitative predictions .719 .593 .369 0.587 0.677 0.899
.390 0.879
.972 0.181
0.155 0.148 0.697 0.047
0.122 0.190 0.688 0.061
0.391 0.387 0.223 0.425
0.177 0.139 0.684 0
0.204 0.129 0.668 0.072
0.502 0.498 0 0.632
Notes. See notes to Table I.
where t = {it−1 , πt , yt }, k is the number of free parameters, and L(·) is the maximized value of the log likelihood function. Note that the AIC penalizes the less parsimonious models linearly.27 27. In unreported work, we also used the Bayesian information criteria, but conclusions are the same as those obtained using the AIC.
392
QUARTERLY JOURNAL OF ECONOMICS TABLE V U.S. FEDERAL RESERVE Dominant chairman Consensus Hawkish
aM+K aM−K
4.355∗ (0.575) 0.359 (0.529)
aM aA
Dovish
A. Parameter estimates
0.310 (0.522) 4.316∗ (0.564)
4.326∗ (0.560) 0.325 (0.515)
0.758∗ (0.158) −4.465∗ (0.672) 1.811∗ (0.141)
0.787∗ (0.158) −4.477∗ (0.670) 1.805∗ (0.140)
aS 0.759∗ (0.162) −4.554∗ (0.685) 1.853∗ (0.145)
b c σ
With size Frictionless friction Data
1.775∗ (0.392) 0.985∗ (0.125) −2.794∗ (0.501) 1.538∗ (0.087)
1.772∗ (0.392) 0.986∗ (0.124) −2.791∗ (0.501) 1.534∗ (0.087)
B. Criteria for model selection L(·) −218.65 −241.14 −243.50 −292.18 −347.64 AIC 447.30 492.28 497.01 592.35 703.28 RMSE 0.745 1.207 1.046 1.538 1.846 MAE 0.547 0.766 0.771 1.291 1.558 Chairman extracts <.001 <.001 all rents (p-value) Autocorrelation Standard deviation Proportion of Cuts Increases No changes Policy reversals
.840 1.061
C. Quantitative predictions .711 .710 .442 1.588 1.626 2.247
.450 2.238
.989 0.270
0.193 0.202 0.605 0.082
0.155 0.254 0.591 0.104
0.451 0.460 0.089 0.559
0.234 0.259 0.506 0.013
0.246 0.160 0.594 0.104
0.496 0.504 0 0.645
Notes. See notes to Table I.
The quantitative predictions in Panel C are derived by means of simulations as follows. Given current inflation and unemployment and the previous observation of the nominal interest rate, we draw a realization of ζt from a normal distribution with zero mean and standard deviation equal to its ML estimate. Then, for each protocol, we compute the preferred policies of the key member(s) using the ML estimates of their reaction function parameters and
MONETARY POLICY BY COMMITTEE
393
use the political aggregators in Section II to derive the interest rate selected by the “artificial” committee. This algorithm is repeated as many times as the actual sample size to deliver one simulated path of the nominal interest rate. Then, from the simulated sample, we compute the autocorrelation function and the proportion of interest rate cuts, increases, no changes, and policy reversals. We define policy reversals as interest rate changes of opposite sign in two consecutive meetings. The numbers reported in Panel C are the averages of these statistics over 240 replications of this procedure. As we discuss in the following two sections, the results in Panels B and C indicate that for all committees in the sample, the consensus model fits the data better than the other models of collective decision making. Comparison between the Consensus and Frictionless Models. Let us start by comparing the consensus and frictionless models. Panel B in Tables I through V shows that for all monetary committees, the frictionless model features larger RMSE, MAE, and AIC than the consensus model. The comparatively poor performance of the frictionless model can be seen in Figures III–VII, which plot the interest rate decisions and the fitted values under all protocols for each committee in the sample. Although the frictionless model tracks interest rate decisions in broad terms, the quantitative difference between the actual and predicted values is the largest among the protocols studied. There are two reasons for this result. First, the frictionless model generates interest rate autocorrelation only from the serial correlation of inflation, unemployment, and the disturbance term. However, the serial correlation of these variables is not large enough to account for the large autocorrelation of the nominal interest rate. To illustrate this point, Figure VIII plots the autocorrelation functions (up to ten lags) implied by each protocol and the one computed from the data, and Panel C reports the firstorder autocorrelations. From this figure and panel, it is clear that the frictionless model generally predicts less interest rate persistence than the consensus model and than that found in the data. Second, the frictionless model delivers a linear reaction function whereby the interest rate changes whenever inflation, unemployment, or the shock realization change. Hence, this model cannot explain the relatively large number of instances where the committee keeps the interest rate unchanged despite the fact that
394
QUARTERLY JOURNAL OF ECONOMICS
FIGURE III Model Fit: Bank of Canada
fundamentals have changed. From the simulations reported in Panel C, we can see that this model predicts that the proportion of observations where the nominal interest rate remains unchanged is exactly zero, while the consensus model predicts a proportion similar to that found in the data. Moreover, the consensus model predicts substantially fewer policy reversals than the frictionless model. This result also contributes to the empirical success of the consensus model because policy reversals are rare events in the actual data. To understand why the consensus model predicts fewer policy reversals than the frictionless model, suppose that following a positive shock at time t, the consensus-based committee increases ∗ . Moreover, suppose that at time t + 1 a the interest rate to iM−K,t
MONETARY POLICY BY COMMITTEE
395
FIGURE IV Model Fit: Bank of England
shock of the same (positive) sign occurs. This likely implies that ∗ ∗ iM−K,t+1 > qt+1 , where qt+1 = iM−K,t . It is easy to see that a supermajority that included all members that were more hawkish than M − K would be willing to pass a further interest rate increase. Suppose instead that at time t + 1 a shock of opposite sign (that is, negative) occurs. Although member M − K would be in favor of an interest rate reduction, notice that, unless the shock is large, ∗ > qt+1 and the committee will keep the interest rate uniM+K,t+1 changed at time t + 1. In contrast, under the frictionless model, two consecutive shocks of opposite sign are more likely to push the desired interest rate in different directions, implying a policy reversal.
396
QUARTERLY JOURNAL OF ECONOMICS
FIGURE V Model Fit: European Central Bank
The finding that the frictionless model is only partly successful in characterizing actual monetary policy decisions is important because this model underpins the derivation of the Taylor rules frequently used in the literature. This finding also suggests that the ad hoc smoothing term that is usually appended to theoretically derived rules plays a nontrivial role in their reported empirical success. For example, the R2 s of the frictionless model are .05, .22, .27, .46, and .49 for the Bank of Canada, the Bank of England, the ECB, the Swedish Riksbank, and the Federal Reserve, respectively. Once an ad hoc smoothing term is added to the regression, the R2 s increase to .94, .98, .98, .95, and .99, respectively.
MONETARY POLICY BY COMMITTEE
397
FIGURE VI Model Fit: Swedish Riksbank
One simple amendment to the frictionless model involves costs of small policy changes as assumed by Eijffinger, Schaling and Verhagen (1999) and Guthrie and Wright (2004). These unspecified costs generate an inaction range around the previous interest rate and may help explain the fact that most central banks do not make policy changes smaller than 25 basis points. A simple way to represent these ideas is the following statistical model: ⎧ ∗ ∗ ⎨ iM,t , if it−1 ≥ iM,t + 0.25, ∗ − it−1 < 0.25, it = it−1 , if − 0.25 < iM,t ⎩ ∗ ∗ iM,t , if it−1 ≤ iM,t − 0.25.
398
QUARTERLY JOURNAL OF ECONOMICS
FIGURE VII Model Fit: U.S. Federal Reserve
In this case, the median (or the chairman) keeps the interest rate unchanged whenever the distance between his preferred policy and the status quo is less that one-fourth point. Otherwise, he increases or decreases the interest rate to his preferred point. Compared with the consensus and agenda-setting models, this specification also generates an interval around the status quo where policy does not change. However, the width of this interval is fixed at fifty basis points. Also, this specification predicts the same time series process for interest rate increases and decreases. This model is estimated by ML for all central banks in our sample and the results are reported under a self-explanatory heading in
MONETARY POLICY BY COMMITTEE
399
FIGURE VIII Autocorrelations
Tables I through V.28 The results indicate that this ad hoc friction delivers a small increase in the predicted autocorrelation of the model, fewer policy reversals, and some observations with no interest changes. However, the predicted proportion of these observations is still very different from the ones in the data and the overall fit of the model is somewhat worse than that of the frictionless model in almost all cases. This suggests that frictions in 28. We stress that the intention here is not to perform a full-fledged econometric analysis of the models in Eijffinger, Schaling, and Verhagen (1999) and Guthrie and Wright (2004). Instead, the goal is to examine whether ad-hoc frictions in the size of interest rate changes are a promising venue to remedy the empirical limitations of the standard model.
400
QUARTERLY JOURNAL OF ECONOMICS
the size of interest rate adjustments alone are unlikely to explain the high autocorrelation and frequent observations of no interest rate changes found in the data. The finding that the consensus model provides a better fit than the frictionless model hints at the existence of politicoeconomic frictions that may make changes to the status quo more difficult to implement in committees. Suggestive evidence that these frictions may be partly related to the decision-making process is the positive correlation between the improvement in statistical fit of the consensus over the frictionless model29 and the committee size and the democracy index constructed by Alan Blinder (Blinder 2007).30 See the top two panels in Figure IX. These panels show that the fit improvement is larger for central banks that are very democratic according to Blinder and/or have relatively larger committees. The fit improvement is smallest for the Bank of Canada, which has a low democracy index and a relatively small committee. Recalling the dual interpretation of the frictionless model as either the dictator or the median model, results in this section are consistent with two strands of evidence. First, to the extent that the frictionless model represents the dictator model, results are consistent with various evidence showing that chairmen are not de facto single decision makers. For instance, Chappell, McGregor, and Vermilyea (2005, p. 109) find that the voting weight of Arthur Burns within the FOMC was approximately 40% to 50%. Because median and mean voters were also found to be significant in explaining policy outcomes, they refute the view that Burns was a dictator. Along the same lines, former FOMC members Sherman Maisel (1973, p. 124) and Laurence Meyer (2004, p. 52) respectively observe that the Chairman “does not make policy alone” and “does not necessarily always get his way.” Second, to the extent that the frictionless model represents the median (or simple-majority) model, the better fit of the consensus model is consistent with empirical and experimental evidence indicating that committee decisions typically involve qualified, 29. We measure fit improvement as ln(RMSE(frictionless)/RMSE(consensus)). The results using the MAE or the difference in log likelihood functions are the same as those reported here and support the same conclusions. 30. This index ranks monetary policy committees according to the level of “democracy” in their decision-making process. Committees with a low index are those where, according to Alan Blinder, decisions are to a large extent dictated by the chairman, whereas committees with a high index are those where disagreement is accepted and committee members can freely voice their dissent.
MONETARY POLICY BY COMMITTEE
401
FIGURE IX Relation with Committee Characteristics
rather than simple, majorities. For example, the voting records of the committees of the Bank of England, the Riksbank, and the Federal Reserve show that split decisions are extremely infrequent. The experimental study by Blinder and Morgan (2005) finds than even though their artificial monetary committee is supposed to make decisions by majority rule, in reality most decisions are unanimous. Experimental runs of the divide-the-dollar game show that despite the simple-majority requirement necessary to pass a proposal, the agenda setter does not always select a minimum winning coalition: in some cases (roughly 30% to 40% of the experiments in McKelvey [1991] and Diermeier and Morton [2005]), agenda setters allocate money to all players. It remains an open question why a committee formally operating under simple majority rule adopts a supermajority rule in practice. The search for consensus might be motivated by the belief that split votes lead to aggrieved minorities and undermine future cooperation. Alternatively, it might be the case
402
QUARTERLY JOURNAL OF ECONOMICS
that the initial status quo determines a basic form of entitlement. In a laboratory study, Diermeier and Gailmard (2006) find that subjects are more willing to accept less generous offers if the proposer’s reservation value is high, thereby supporting the idea that a favorable initial status quo determines an entitlement to a larger amount of resources. In our context, this would explain why a simple majority of committee members who dislike the status quo are less willing to push for a policy change that would hurt a minority that prefers the status quo to the new alternative. Another explanation is proposed by Bullard and Waller (2004) and Dal B´o (2006), who argue that a supermajority rule may help overcome time-consistency problems by inducing policy stickiness. Finally, Janis (1972) describes the psychological drive for consensus in small groups, and argues that the emphasis on consensus may suppress dissent and restrict the range of options considered in group decision making. Comparison between the Consensus and Agenda-Setting Models. Now let us compare the consensus and agenda-setting models. These two protocols deliver outcomes different from the median policy, predict a time-varying gridlock interval where the committee chooses to keep the interest rate unchanged, and endogenously generate interest rate autocorrelation from the role of the status quo as the default policy. However, these protocols generate different implications for the size of interest rate adjustments. More precisely, the agenda-setting model predicts larger interest rate increases (decreases) than the consensus model when the chairman is more hawkish (dovish) than the median. The reason for this is that the chairman controls the agenda in the agenda-setting model whereas no member does so in the consensus model. Hence, deviations from the median’s preferred policy are systematically in the chairman’s favor in the former model. From Panel B in Tables I through V, we can see that for all monetary committees, the two versions of the agenda-setting model feature larger RMSE, MAE, and AIC than the consensus model. The poorer performance of the agenda-setting model in terms of fit may also be observed in Figures III–VII. Note, in particular, that the agenda-setting model with a hawkish (dovish) chairman tends to overpredict the magnitude of interest rate increases (decreases) compared with the consensus model. In terms of the statistics reported in Panel C of all tables and the
MONETARY POLICY BY COMMITTEE
403
autocorrelation functions plotted in Figure VIII, it also clear that the agenda-setting model is not as successful as the consensus model in replicating the persistence of interest rates and the proportions of interest rate cuts, increases, no changes, and policy reversals observed in the data.31 Recall that in the agenda-setting model with a hawkish (dovish) chairman, when the committee decreases (increases) the interest rate, the chairman’s preferred alternative is selected. On the other hand, when the committee increases (decreases) the interest rate, the new policy belongs to either of two possible regimes depending on whether the acceptance constraint is binding or not. As noted above in Section III.B, it is not possible to know ex ante whether this constraint is binding. However, given the ML estimates, it is possible to construct ex post probability estimates that a given interest rate observation belongs to either regime. We found that in all cases and for all committees in the sample, interest rate increases (decreases) by the hawkish (dovish) chairman are generated by the regime where the acceptance constraint binds. This suggests that even though the chairman has proposal power, he often has to compromise and choose a policy proposal that takes into account the preferences of the other committee members. Based on FOMC transcripts for the period 1987 to 1996, Chappell, McGregor, and Vermilyea (2005, p. 186) conclude that “there are at least suggestions that Greenspan’s proposals were crafted with knowledge of what other members might find acceptable.” In the rest of this section, we investigate in more detail the reasons that the agenda-setting model is less empirically successful than the consensus-based model. A feature of the agendasetting model is that the chairman is able to use his proposal power to extract all rents when the status quo is sufficiently far from the median’s preferred interest rate. For example, when ∗ ∗ − i ∗A,t , iM,t ), the hawkish chairman proposes the policy it−1 ∈ [2iM,t ∗ 2iM,t − it−1 , which is the reflection point of the status quo with respect to the median’s preferred interest rate. The proposed policy leaves the median as well off as under the status quo and delivers 31. When we compare both versions of the agenda-setting model, results in the tables show that, except for the ECB, the version with a hawkish chairman meets somewhat greater empirical success than the version with a dovish chairman. In particular, the former delivers smaller RMSE, MAE, and AIC and larger interest rate autocorrelation than the latter. Notice that in both cases the predicted distribution of interest rate changes is asymmetric, with proportionally more increases than decreases (decreases than increases) in the hawkish (dovish) case.
404
QUARTERLY JOURNAL OF ECONOMICS
all surplus to the chairman, who has fully exploited his proposal power. In order to test this feature of the agenda-setting model, consider the statistical model (12) ⎧ ∗ i A,t , if it−1 > i ∗A,t , ⎪ ⎪ ⎪ ∗ ⎨i , if iM,t ≤ it−1 ≤ i ∗A,t , t−1 it = ∗ ∗ ∗ ⎪ (1 + ξ )iM,t − ξ it−1 , if ((1 + ξ )iM,t − i ∗A,t )/ξ ≤ it−1 < iM,t , ⎪ ⎪ ⎩ ∗ ∗ ∗ i A,t , if it−1 < ((1 + ξ )iM,t − i A,t )/ξ, where ξ ∈ (0, 1]. Note that when ξ = 1, (12) is the interest rate process implied by the agenda-setting model with a hawkish chairman. Consider relaxing the restriction ξ = 1 so that now ξ = 1 − ∗ ∗ − i ∗A,t )/ξ, iM,t ), for a “small” > 0. Then, when it−1 ∈ [((1 + ξ )iM,t ∗ ∗ the proposed policy would be (1 + ξ )iM,t − ξ it−1 < 2iM,t − it−1 .32 This means that, given the status quo, the proposal is now closer to the median’s ideal point than under the original protocol and, consequently, the median collects part of the rents. This simple argument suggests that the implication that the chairman uses his agenda-setting power to extract all rents may be statistically tested via a Lagrange multiplier (LM) test of the restriction ξ = 1. The p-values of this test are reported in the last row of Panel B and indicate that the restriction is strongly rejected by the data from all committees. (A similar argument and test deliver the same result in the case of the dovish chairman.) Thus, the data reject the strong form of agenda control embodied in the agenda-setting model by Romer and Rosenthal (1978), where counterproposals are not allowed. It is interesting to note that this result also holds for the FOMC despite the common view that its chairman exerts almost undisputed power, and evidence from the transcripts and minutes that other committee members usually do not bring forward counterproposals once the chairman has made his policy recommendation. In what follows, we test whether the chairman is able to at least partially exploit his proposal power. In particular, notice that if we were to empirically find a ML estimate of ξ inside the range (0, 1), this would suggest a form of cooperative bargaining between the median member and the agenda-setting chairman. On the other hand, if we were to find that ξ → 0, then the specification ∗ − 32. To see this, substitute in ξ = 1 − , write the inequality as (2 − )iM,t ∗ −i ∗ (1 − )it−1 < 2i M,t t−1 , and simplify to obtain −(i M,t − it−1 ) < 0, which is satisfied for the range of status quo policies we are concerned with.
MONETARY POLICY BY COMMITTEE
405
(12) would approach the interest rate process implied by the consensus model (with the thresholds appropriately relabeled), where no member controls the agenda. This discussion shows that the specification (12) encompasses both the agenda-setting and consensus models and, consequently, provides a means for directly comparing these models. In particular, one could imagine estimating (12) using actual data and then testing, for example, whether ξ is statistically closer to 0 or 1. We attempted this strategy but found that when the encompassing model was estimated, the ML estimate of ξ would converge to 0 in all cases. Because ξ = 0 is the lower bound of the parameter space, regularity conditions are violated, and standard tests do not have the usual distributions. However, the fact the data strongly prefer ξ ≈ 0 compared with ξ = 1 constitutes further evidence against the notion that the chairman is able to control the agenda by excluding some alternatives from the vote. Thus, the existence of an inefficient status quo does not appear to represent a sufficient threat that the chairman can exploit to pass a large interest rate change towards his preferred policy. Our results are in line with laboratory experiments on committee decision making. For example, Eavey and Miller (1984) test the prediction of the agenda-setting model in a one-dimensional policy space and find that the agenda setter does influence committee decisions and bias the policy outcome away from the policy preferred by the median member. However, contrary to the strong implication of the model, the agenda setter does not seem to select his most preferred policy from among the set of alternatives that dominate the status quo. Recent experimental studies test the predictions of Baron and Ferejohn’s (1989) extension of the agenda-setting model and find that the individual selected to make a proposal enjoys a first-mover advantage, as predicted by the theory, but does not fully exploit his proposal power (see McKelvey [1991]; Diermeier and Morton [2005]; Fr´echette, Kagel, and Morelli [2005]; Diermeier and Gailmard [2006]). In the empirical literature, Knight (2005) analyzes U.S. data on the distribution of transportation projects across congressional districts. His results support the qualitative prediction that legislators with proposal power secure higher spending in their districts but, in quantitative terms, the estimated value of proposal power is in some cases lower than implied by the theory. In summary, a robust finding of this paper is that, among the protocols examined, the consensus-based model describes interest
406
QUARTERLY JOURNAL OF ECONOMICS
rate decisions more accurately than the alternative models for all committees in our sample. This means that, in addition to the formal rules under which monetary committees operate, their decision making is also the result of unwritten rules and informal procedures that deliver observationally equivalent policy decisions. Two words of caution about this result are the following. First, although the consensus model fits the data better in all cases, the level of consensus in interest rate decisions may vary across central banks, as suggested by Figure IX. Second, although our findings reject the notion that the chairman controls the agenda in the strong form assumed by Romer and Rosenthal (1978), and in the weaker form examined above, they do not necessarily imply that the chairman has the same power as his peers. For example, a proposal rule similar to the one in Panel A of Figure II may arise from a committee that makes decisions according to the consensus-based protocol plus the requirement that the supermajority must include the chairman. V. TWO CAVEATS V.A. Other Frictions In addition to the politicoeconomic frictions examined in this paper, the literature has proposed other explanations to account for the large serial correlation of the key interest rate under central bank control. The large and statistically significant coefficient of the lagged interest rate in reaction functions is usually interpreted as evidence of interest rate smoothing. This term refers to the tendency of central banks to adjust interest rates gradually in response to changes in economic conditions and to reverse the direction of interest rates movements only infrequently.33 To explain why central banks seem reluctant to make a change of interest rates that might need to be subsequently reversed, Goodhart (1999) argues that such movements would be interpreted by outside commentators as evidence of inconsistency and irresolution, and damage the central bank’s reputation. However, from a theoretical point of view, it is not obvious that introducing reputational considerations into the standard framework would help explain the persistence of interest rates and the lack of policy 33. Detailed evidence of interest rate smoothing on the part of the Federal Reserve and other central banks is presented by Rudebusch (1995, 2002) and Goodhart (1997), respectively.
MONETARY POLICY BY COMMITTEE
407
reversals. Among others, Prendergast and Stole (1996) and Levy (2004) have shown that decision makers who are motivated by career concerns may implement more aggressive policy changes and go against public information to signal their ability, even when they believe that the public prior is correct. Cukierman (1991) and Goodfriend (1991) argue that interest rate smoothing minimizes financial market stress resulting from interest rate prediction errors. The aversion to financial market volatility would then motivate the introduction of an additional interest rate stabilization term into the central bank’s loss function, in addition to the usual output and inflation stabilization (see, for example, Svensson [2000]). Woodford (2003) suggests that interest rate smoothing may indeed be optimal even when the central bank is not explicitly concerned with minimizing the size of interest rate changes. In a model where the private sector is forwardlooking, inertial central bank behavior allows substantial effects on aggregate demand by influencing expectations of future policy movements without large variation in the level of short-term interest rates. Rotemberg and Woodford (1998) find that interest rate inertia in a generalized Taylor rule is optimal, and show that the coefficient of the lagged interest rate should be greater than one to obtain a unique equilibrium. That is, the feedback rule must be superinertial. Alternatively, others have argued that policy conservatism can be partly explained by the fact that the central banks observe key macroeconomic variables with noise (see Orphanides and Williams [2002]; Aoki [2003]; Orphanides [2003]) and do not perfectly know the dynamic structure of the economy (Sack 2000). In particular, Sack (2000) shows that the optimal policy under data and parameter uncertainty can account for a portion of the gradualism in the observed Federal Funds rate movements, although the latter has fewer reversals and more no-change observations. In order to evaluate the quantitative importance of frictions other than politicoeconomic ones in our results, we econometrically study interest rate decisions in a central bank where policy is selected by one individual. In particular, we consider the Bank of Israel (BOI), where the Governor alone makes decisions on the key interest rate. Although the BOI Governor receives input from an advisory committee before reaching his decision, its role is more limited than, for example, the advisory committee of the Reserve
408
QUARTERLY JOURNAL OF ECONOMICS
Bank of New Zealand (RBNZ).34 In the case of the BOI, the single– central banker model should then be a reasonable approximation to actual policy making. We estimate all protocols using Israeli data from January 1995 to June 2007. The policy decision concerns the Headline interest rate announced by the governor at end of every month. Inflation is measured by the percentage change in the CPI and, as before, the output gap is measured by the deviation of the seasonally adjusted unemployment rate from a trend computed using the Hodrick–Prescott filter. (Because the Labor Force Survey is carried out on a quarterly basis, monthly unemployment observations were computed using linear interpolation.) The data were taken from the website of the Bank of Israel (www.bankisrael.gov.il). Results for all protocols are reported in Table VI and show that, despite the fact that decisions are made by one individual, the consensus model dominates the frictionless one in terms of RMSE, MAE, and AIC. This finding suggests that some of the frictions captured by our consensus model are likely not associated with collective decision making.35 In what follows, we compute the improvement in statistical fit of the consensus over the frictionless model for Israel and incorporate it into the top two panels in Figure IX, where this variable is correlated with the committee size and Blinder’s democracy index. The committee size is measured by the number of voting members. The size of the BOI “committee” is set to one, and because Blinder does not classify the democracy level in the BOI, we set its index to zero, below the RBNZ, to which Blinder assigns the lowest value (in his sample) of one. From these panels, it is clear that the fit improvement for the BOI is generally less than for central banks that use sizeable and/or democratic committees to formulate monetary policy. It is interesting to note that the only exception is the Bank of Canada, which has the lowest democracy index and the smallest committee in the sample. These observations indicate that although politicoeconomic frictions alone do not 34. Svensson (2007, p. 2005) argues that decisions in the RBZN are “normally made in a very collegial manner.” Also, Blinder (2007, p. 110) points out that the RBNZ’s advisory committee takes recorded votes and keeps minutes of its meetings. The government-appointed Monetary Policy Advisory Committee of the BOI was discontinued in 1989 and revived in 2000. For additional institutional details, see Barkai and Liviatan (2007, p. 100). 35. On the other hand, because the frictionless model dominates the model with frictions in the size of interest rate adjustments, results also suggest that the latter is not a quantitatively important friction in Isreali policy making.
409
MONETARY POLICY BY COMMITTEE TABLE VI BANK OF ISRAEL Dominant chairman
With size Consensus Hawkish Dovish Frictionless friction Data aM+K aM−K
6.593∗ (0.244) 1.747∗ (0.382)
aM aA
A. Parameter estimates
1.625∗ (0.377) 6.608∗ (0.238)
8.496∗ (0.586) 4.491∗ (0.375)
0.955∗ (0.041) −0.632 (0.410) 1.788∗ (0.124)
0.776∗ (0.065) 0.464 (0.673) 3.011∗ (0.241)
aS 0.966∗ (0.041) −0.677 (0.420) 1.836∗ (0.128)
b c σ
5.363∗ (0.238) 0.880∗ (0.043) −0.079 (0.402) 2.049∗ (0.119)
5.366∗ (0.238) 0.880∗ (0.043) −0.079 (0.449) 2.048∗ (0.119)
B. Criteria for model selection L(·) −220.69 −230.14 −312.87 −318.34 −357.86 AIC 451.38 470.27 635.75 644.67 723.72 RMSE 1.058 1.235 2.131 2.049 2.469 MAE 0.701 0.791 1.724 1.699 2.057 Chairman extracts <.001 <.001 all rents (p-value) Autocorrelation Standard deviation Proportion of Cuts Increases No changes Policy reversals
.765 1.008
C. Quantitative predictions .654 .642 .192 1.336 3.666 2.961
.178 2.967
.977 0.531
0.141 0.194 0.664 0.035
0.104 0.228 0.668 0.044
0.456 0.476 0.068 0.576
0.134 0.483 0.383 0.013
0.363 0.261 0.375 0.267
0.492 0.508 0 0.642
Notes. See notes to Table I.
completely explain inertia in policy decisions, they are likely to be the main factor behind differences across monetary committees. Moreover, it is clear from the direct analysis of the data that decision making by a single individual may involve less frictions than those associated with committees. To see this, note in the bottom four panels in Figure IX that the BOI features a much higher volatility and a smaller proportion of no interest rate changes
410
QUARTERLY JOURNAL OF ECONOMICS
compared with central banks where decisions are made by committee. Although we cannot rule out these results being driven by idiosyncrasies specific to Israel, the data do suggest that committees are relatively more inertial in the sense that they change interest rates less frequently and in smaller steps than a single central banker would. V.B. Dynamics One assumption that makes our model analytically and econometrically tractable is that committee members abstract from the dynamic implications of their voting decisions for future meetings via the status quo. Unfortunately, relaxing this assumption is technically difficult. Among the papers that tackle this theoretical issue are Baron (1996), Baron and Herron (2003), Kalandrakis (2004), and Diermeier and Fong (2008).36 In earlier work, Riboni and Ruge-Murcia (2008c) numerically solve this problem for the agenda-setting game, but in a framework simpler than the one developed here. It is shown that decision rules derived under fully strategic behavior imply more inertia in policy making than those obtained under myopic behavior, but that they are qualitatively similar otherwise. We will see below that the same result holds here. To quantify the effect of dynamic considerations regarding the status quo, we solve numerically a two-period version of the consensus and agenda-setting models. The economy parameters are those estimated by Rudebusch and Svensson (1999) using U.S. data and a model similar to the one used here. That is, α1 = 0.15, β1 = 0.90, β2 = 0.10, ι = 0, and shocks are assumed to be normal white noises (that is, γ = ς = 0) with σε2 = ση2 = 0.01. The latter assumption conveniently reduces the number of state variables to three (the status quo, current inflation, and output gap). The discount factor, δ, is set to 0.9, and the inflation target is set to 2% per year. For the consensus model, we assume a committee of five members and a supermajority of four members. Preferences 36. Notice, however, that our setup is less tractable than the one considered by most of the literature. First, in contrast to the papers cited above, the economy assumed here is dynamic because of equations (3) and (4). This implies that besides the political state variable (the status quo), current output and current inflation are two additional state variables. Moreover, our economy is also subject to shocks. Because shocks are assumed to be persistent, the current shock is another state variable. Finally, whereas most of the literature uses either linear or quadratic utility, our model features a utility function that is more general but less tractable for the purpose of modeling dynamics. Because of all of the above, finding a stationary solution in our setup would be considerably more difficult.
MONETARY POLICY BY COMMITTEE
411
FIGURE X Comparison of Policy Outcomes
parameters (the μs) are quantitatively similar to those estimated by Ruge-Murcia (2003) and are equally spaced in the range from −4 to 4, so that μ M−K = −2 and μ M+K = 2. For the agenda-setting model, μ A = 3 and μ M = 0, and the chairman is therefore more hawkish than the median. The solution is found by backward induction for a threedimensional grid of status quo policies, inflation rates, and output gaps. First, the problem in the “last” period (say, τ ) is solved analytically, exploiting the observation that this problem is identical to the one studied in Sections III.A and III.B. As we saw, the gridlock interval is constant (because it is only a function of structural parameters) and is independent of current output and inflation. Then, the problem in the next to last period (τ − 1) is solved using the decision rules for the last period in Propositions 1 and 2 and Monte Carlo simulation to compute the conditional expectations of future utility. The gridlock interval in period τ − 1 depends on current output and inflation and so we plot it in Figure X for
412
QUARTERLY JOURNAL OF ECONOMICS
the case where πτ −1 = π ∗ and yτ −1 = 0. (Results for the other grid points are qualitatively the same as those reported here.) From this figure it is clear that the gridlock interval in period τ − 1 is larger than in period τ. The precise magnitude of this difference depends on the model parameters and on the values of current output and inflation. On the other hand, the policy rules are otherwise identical and feature the same slopes (either 0, 1 or −1). These results lead us to conclude that our model may understate the range of policy inaction in actual monetary policy committees. Hence, a model where committee members internalize their role in setting the future status quo may predict more, not less, policy inertia than our model. To understand this result, we focus on the consensus model.37 Consider, for instance, the incentives of individual M − K to vote in favor of an interest rate cut when a negative shock occurs in the first period, and compare them with the incentives to vote in favor of an interest rate increase when, instead, a positive shock occurs. In the model with endogenous status quo, M − K realizes that a change of the current interest rate would affect the default option in the next meeting. Notice, however, that cutting the interest rate has less irreversible consequences for M − K than an interest rate increase. Being among the most dovish members in the committee, were M − K to prefer a higher interest rate than the current one in the next meeting, most committee members would also agree. Conversely, were he to prefer a lower interest rate in the next meeting, M − K would likely face the opposition of most committee members. This asymmetry explains why M − K may not vote in favor of an interest rate increase, even if a positive shock occurs, because this choice, by changing the default policy, may prove irreversible in the following meeting. The same argument explains why M + K favors in the first period an interest rate larger than the one he or she would have chosen in the model with myopic behavior. Because the preferred interest rates of individuals M − K and M + K move away from each other, the gridlock interval expands. VI. CONCLUSIONS To our knowledge, all existing studies that estimate interest rate rules abstract from the voting process that leads to policy 37. For the agenda-setting model, we refer the reader to the discussion in Riboni and Ruge-Murcia (2008c).
MONETARY POLICY BY COMMITTEE
413
decisions. A large body of anecdotal evidence hints instead at the importance of strategic considerations in the decision-making process. Committee members differ along various dimensions and, consequently, are likely to have different preferred interest rates. The way committees resolve these differences depends crucially on the particular voting protocol (implicitly or explicitly) adopted. In this paper, we consider various voting protocols that capture some relevant aspects of actual monetary policy making by committee: the consensus, the agenda-setting, and two frictionless models. The four protocols have distinct time series implications for the nominal interest rate. These different implications allow us to empirically distinguish among the protocols using actual data from the policy decisions by committees in five central banks. A robust empirical conclusion is that the consensus model fits actual policy decisions better than the other voting protocols. This is observed despite the fact that all central banks (except the Bank of Canada) considered in our sample do not formally operate under a consensus (or supermajority) rule. This result is consistent with a large experimental literature on committee decision making that indicates a preference for oversized or nearly unanimous coalitions even in strict-majority rule games. The empirical analysis of policy decisions by the Bank of Israel, for which the single central banker model constitutes a reasonable approximation, indicates that politicoeconomic frictions alone cannot completely explain policy inertia, but that they are, nonetheless, the main factor behind differences across monetary policy committees. Moreover, the data provide direct evidence that policy making by a single individual and by committee are fundamentally different in that the former features more interest rate changes and larger adjustments than the latter. Overall, our research points to the importance of politicoeconomic considerations in central banking. ´ CONOMIQUES, UNIVERSITE´ DE MONTREAL ´ ´ DEPARTEMENT DE SCIENCES E ´ CONOMIQUES, UNIVERSITE´ DE MONTREAL ´ ´ DEPARTEMENT DE SCIENCES E
REFERENCES Aoki, Kosuke, “On the Optimal Monetary Policy Response to Noisy Indicators,” Journal of Monetary Economics, 50 (2003), 501–523. Austen-Smith, David, and Jeffrey S. Banks, Positive Political Theory II: Strategy & Structure (Ann Arbor: University of Michigan Press, 2005). Barkai, Haim, and Nissan Liviatan, eds. The Bank of Israel: Volume 2: Selected Topics in Israel’s Monetary Policy (Oxford, UK: Oxford University Press, 2007).
414
QUARTERLY JOURNAL OF ECONOMICS
Baron, David P., “Majoritarian Incentives, Pork Barrel Programs, and Procedural Control,” American Journal of Political Science, 35 (1991), 57–90. ——, “A Dynamic Theory of Collective Goods Programs,” American Political Science Review, 90 (1996), 316–330. Baron, David P., and John A. Ferejohn, “Bargaining in Legislatures,” American Political Science Review, 89 (1989), 1181–1206. Baron, David P., and Michael Herron, “A Dynamic Model of Multidimensional Collective Choice,” in Computational Models of Political Economy, Ken Kollman, John H. Miller, and Scott E. Page, eds. (Cambridge, MA: MIT Press, 2003). Battaglini, Marco, and Stephen Coate, “ Inefficiency in Legislative Policymaking: A Dynamic Analysis,” American Economic Review, 97 (2007), 118–149. ——, “A Dynamic Theory of Public Spending, Taxation and Debt,” American Economic Review, 98 (2008), 201–236. Bhattacharjee, Arnab, and Sean Holly, “Taking Personalities out of Monetary Policy Decision Making? Interactions, Heterogeneity and Committee Decisions in the Bank of England’s MPC,” University of St. Andrews Mimeo, 2006. Black, Duncan, The Theory of Committees and Elections (Cambridge, UK: Cambridge University Press, 1958). Blinder, Alan S., The Quiet Revolution: Central Banking Goes Modern (New Haven, CT: Yale University Press, 2004). ——, “Monetary Policy by Committee: Why and How?” European Journal of Political Economy, 23 (2007), 106–123. Blinder, Alan S., and John Morgan, “Are Two Heads Better Than One? Monetary Policy by Committee,” Journal of Money, Credit and Banking, 37 (2005), 789– 812. Bullard, James, and Christopher Waller, “Central Bank Design in General Equilibrium,” Journal of Money, Credit and Banking, 36 (2004), 95–113. Chappell, Henry W., Rob Roy McGregor, and Todd Vermilyea, Committee Decisions on Monetary Policy (Cambridge, MA: MIT Press, 2005). Chari, V. V., and Harold Cole, “A Contribution to the Theory of Pork Barrel Spending,” Federal Reserve Bank of Minneapolis Staff Report 156, 1993. Corana, Angelo, Michele Marchesi, Claudio Martini, and Sandro Ridella, “Minimizing Multimodal Functions of Continuous Variables with the Simulated Annealing Algorithm,” ACM Transactions on Mathematical Software, 13 (1987), 262–280. Cukierman, Alex, “Why Does the Fed Smooth Interest Rates?” in Monetary Policy on the 75th Anniversary of the Federal Reserve System, Michael Belongia, ed. (Boston: Kluwer Academic, 1991). Dal B´o, Ernesto, “Committees with Supermajority Voting Yield Commitment with Flexibility,” Journal of Public Economics, 90 (2006), 573–599. Diermeier, Daniel, and Pohan Fong, “Endogenous Limits on Proposal Power,” Northwestern University Mimeo, 2008. Diermeier, Daniel and Sean Gailmard, “ Self-Interest, Inequality, and Entitlement in Majoritarian Decision-Making,” Quarterly Journal of Political Science, 4 (2006), 327–350. Diermeier, Daniel, and Rebecca B. Morton, “Experiments in Majoritarian Bargaining,” in Social Choice and Strategic Behavior Essays in Honor of Jeffrey Banks, David Austen-Smith and John Duggan, eds. (New York: SpringerVerlag, 2005). Eavey, Cheryl, and Gary Miller, “Bureaucratic Agenda Control: Imposition or Bargaining,” American Political Science Review, 78 (1984), 719–733. Eijffinger, Sylvester, Eric Schaling, and Willem Verhagen, “A Theory of Interest Rate Stepping: Inflation Targeting in a Dynamic Menu Cost Model,” Tilburg University Mimeo, 1999. ´ Erhart, Szilard, and Jos´e Vasques-Paz, “Optimal Monetary Policy Committee Size: Theory and Cross Country Evidence,” Magyar Nemzeti Bank Mimeo, 2007. Fr´echette, Guillaume R., John H. Kagel, and Massimo Morelli, “Behavioral Identification in Coalitional Bargaining: An Experimental Analysis of Demand Bargaining and Alternating Offers,” Econometrica, 73 (2005), 1893– 1938. Fry, Maxwell, DeAnne Julius, Lavan Mahadeva, Sandra Roger, and Gabriel Sterne, “Key Issues in the Choice of Monetary Policy Framework,” in Monetary
MONETARY POLICY BY COMMITTEE
415
Frameworks in a Global Context, Lavan Mahadeva and Gabriel Sterne, eds. (London: Routledge, 2000). Gerlach-Kristen, Petra, “Too Little, Too Late: Interest Rate Setting and the Costs of Consensus,” Economics Letters, 88 (2005), 376–381. ——, “Outsiders at the Bank of England’s MPC,” Swiss National Bank Mimeo, 2007. Goodfriend, Marvin, “Interest Rate Smoothing in the Conduct of Monetary Policy,” Carnegie–Rochester Conference Series on Public Policy, 37 (1991), 7–30. Goodhart, Charles A., “Why Do the Monetary Authorities Smooth Interest Rates,” in European Monetary Policy, Stefan Collignon, ed. (London: Pinter, 1997). ——, “Central Bankers and Uncertainty,” Bank of England Quarterly Review (February 1999), 102–121. Guthrie, Graeme, and Julian Wright, “The Optimal Design of Interest Rate Target Changes,” Journal of Money, Credit and Banking, 36 (2004), 115–137. Janis, Irving L., Victim of Group Think: A Psychological Study of Foreign Policy Decisions and Fiascos (Boston: Houghton Mifflin, 1972). Kalandrakis, Anastassios, “A Three-Player Dynamic Majoritarian Bargaining Game,” Journal of Economic Theory, 16 (2004), 294–322. Kimball, Miles S., “Precautionary Saving in the Small and in the Large,” Econometrica, 58 (1990), 53–73. Knight, Brian, “Estimating the Value of Proposal Power,” American Economic Review, 95 (2005), 1639–1652. Krehbiel, Keith, Adam Meirowitz, and Jonathan Woon, “Testing Theories of Lawmaking,” in Social Choice and Strategic Behavior Essays in Honor of Jeffrey Banks, David Austen-Smith and John Duggan, eds. (New York: SpringerVerlag, 2005). Levy, Gilat, “Anti-Herding and Strategic Consultation,” European Economic Review, 48 (2004), 503–525. Maier, Philipp, “Monetary Policy Committees in Action: Is There Room for Improvement?” Bank of Canada Working Paper 6, 2007. Maisel, Sherman, Managing the Dollar (New York: Norton, 1973). McKelvey, Richard D., “An Experimental Test of a Stochastic Game Model of Committee Bargaining,” in Contemporary Laboratory Research in Political Economy, Thomas R. Palfrey, ed. (Ann Arbor: University of Michigan Press, 1991). Meade, Ellen, “The FOMC: Preferences, Voting, and Consensus,” Federal Reserve Bank of St. Louis Review, 87 (2005), 93–101. Meyer, Laurence H., A Term at the Fed (New York: Harper Collins, 2004). Montoro, Carlos, “Monetary Policy Committees and Interest Rate Smoothing,” London School of Economics Mimeo, 2006. Orphanides, Athanasios, “Monetary Policy Evaluation with Noisy Information,” Journal of Monetary Economics, 50 (2003), 605–31. Orphanides, Athanasios, and John C. Williams, “Robust Monetary Policy Rules with Unknown Natural Rates,” Brookings Papers on Economic Activity, 2 (2002), 63–145. Palfrey, Thomas R., “Laboratory Experiments in Political Economy,” CEPS Working Paper No. 111, 2005. Persson, Torsten, Gerard Roland, and Guido Tabellini, “Comparative Politics and Public Finance,” Journal of Political Economy, 108 (2000), 1121–1161. Pesaran, M. Hashem, and Melvin Weeks, “Non-nested Hypothesis Testing: An Overview,” in Companion to Theoretical Econometrics, Badi Baltagi, ed. (Oxford, UK: Basil Blackwell, 2001). Prendergast, Canice, and Lars Stole, “Impetuous Youngsters and Jaded OldTimers: Acquiring a Reputation for Learning,” Journal of Political Economy, 104 (1996), 1105–1134. Riboni, Alessandro, and Francisco Ruge-Murcia, “Monetary Policy by Committee: Consensus, Chairman Dominance or Simple Majority?” CIREQ Working Paper 02, 2008a. ——, “Preference Heterogeneity in Monetary Policy Committees,” International Journal of Central Banking, 4 (2008b), 213–233. ——, “The Dynamic (In)efficiency of Monetary Policy by Committee,” Journal of Money, Credit and Banking, 40 (2008c), 1001–1032.
416
QUARTERLY JOURNAL OF ECONOMICS
Romer, Thomas, and Howard Rosenthal, “Political Resource Allocation, Controlled Agendas, and the Status Quo,” Public Choice, 33 (1978), 27–43. Rosett, Richard N., “A Statistical Model of Frictions in Economics,” Econometrica, 27 (1959), 263–267. Rotemberg, Julio, and Michael Woodford, “Interest-Rate Rules in an Estimated Sticky Price Model,” NBER Working Paper No. 6618, 1998. Rudebusch, Glenn, “Federal Reserve Interest Rate Targeting, Rational Expectations, and the Term Structure,” Journal of Monetary Economics, 24 (1995), 245–274. ——, “Term Structure Evidence on Interest Rate Smoothing and Monetary Policy Inertia,” Journal of Monetary Economics, 49 (2002), 1161–1187. Rudebusch, Glenn, and Lars Svensson, “Policy Rules for Inflation Targeting,” in Monetary Policy Rules, John B. Taylor, ed. (Chicago: University of Chicago Press, 1999). Ruge-Murcia, Francisco, “Inflation Targeting under Asymmetric Preferences,” Journal of Money, Credit and Banking, 35 (2003), 763–785. Sack, Brian, “Does the Fed Act Gradually? A VAR Analysis,” Journal of Monetary Economics, 46 (2000), 229–256. Sobel, Joel, “Information Aggregation and Group Decisions,” UCSD Mimeo, 2006. Svensson, Lars, “Optimal Inflation Targeting: Further Developments of Inflation Targeting,” in Monetary Policy under Inflation Targeting, Frederic Mishkin and Klaus Schmidt-Hebbel, eds. (Santiago, Chile: Banco Central de Chile, 2007). Svensson Lars E.O., “Inflation Forecast Targeting: Implementing and Monitoring Inflation Targets,” European Economic Review, 41 (1997), 1111–1146. ——, “Open-Economy Inflation Targeting,” Journal of International Economics, 50 (2000), 155–183. Swank, Job, Otto Swank, and Bauke Visser, “How Committees of Experts Interact with the Outside World: Some Theory, and Evidence from the FOMC,” Journal of the European Economic Association, 6 (2008), 478–486. Varian, Hal, “A Bayesian Approach to Real Estate Assessment,” in Studies in Bayesian Economics in Honour of L. J. Savage, S. E. Feinberg and A Zellner, eds. (Amsterdam: North-Holland, 1974). Volden, Craig, and Alan E. Wiseman, “Bargaining in Legislatures over Particularistic and Collective Goods,” American Political Science Review, 101 (2007), 79–92. Waller, Christopher J., “A Bargaining Model of Partisan Appointments to the Central Bank,” Journal of Monetary Economics, 29 (1992), 411–428. ——, “Policy Boards and Policy Smoothing,” Quarterly Journal of Economics, 115 (2000), 305–339. Weber, Anke, “Communication, Decision-Making and the Optimal Degree of Transparency of Monetary Policy Committees,” University of Cambridge Mimeo, 2007. Woodford, Michael, “Optimal Interest-Rate Smoothing,” Review of Economic Studies, 70, (2003), 861–886.
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”? EVIDENCE FROM THE BLACK MIGRATION∗ LEAH PLATT BOUSTAN Residential segregation by jurisdiction generates disparities in public services and education. The distinctive American pattern—in which blacks live in cities and whites in suburbs—was enhanced by a large black migration from the rural South. I show that whites responded to this black influx by leaving cities and rule out an indirect effect on housing prices as a sole cause. I instrument for changes in black population by using local economic conditions to predict black migration from southern states and assigning predicted flows to northern cities according to established settlement patterns. The best causal estimates imply that each black arrival led to 2.7 white departures.
I. INTRODUCTION American metropolitan areas are segregated by race, both by neighborhood and across jurisdiction lines. In 1980, after a century of suburbanization, 72% of metropolitan blacks lived in central cities, compared to 33% of metropolitan whites. Because many public goods are locally financed, segregation between the central city and the suburbs can generate disparities in access to education and other public services (Benabou 1996; Bayer, McMillan, and Rueben 2005). These local disparities have motivated large policy changes over the past fifty years, including school finance equalization plans within states and federal expenditures on education. Racial segregation by jurisdiction has historical roots in two population flows: black migration from the rural South and white relocation from central cities to suburban rings. Both flows peaked during World War II and the subsequent decades. Between 1940 and 1970, four million black migrants left the South, increasing ∗ I appreciate helpful suggestions from Edward Glaeser (the editor), two anonymous referees, my dissertation committee (Claudia Goldin, Caroline Hoxby, Lawrence Katz, and Robert A. Margo), and numerous colleagues at UCLA. I enjoyed productive conversations with Lee Alston, David Clingingsmith, William J. Collins, Carola Frydman, Christopher Jencks, Jesse Rothstein, Albert Saiz, and Raven Saks. I received useful comments from seminar participants at the All-UC Conference for Labor Economics, the Federal Reserve Bank of Philadelphia, the KALER group at UCLA, New York University’s Wagner School of Public Service, the Society of Labor Economists, the University of British Columbia, UCBerkeley’s Goldman School of Public Policy, the University of Chicago Booth School of Business, and the Wharton School. Michael Haines generously shared some of the data used in this study. Financial support was provided by the National Science Foundation Graduate Research Fellowship and the Multi-disciplinary Program on Inequality and Social Policy at Harvard University.
[email protected]. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
417
418
QUARTERLY JOURNAL OF ECONOMICS 150000
Change in white population
100000
50000
0 -50000
-40000
-30000
-20000
-10000
0
10000
20000
30000
40000
50000
-50000
-100000
-150000
-200000
Change in black population
FIGURE I Change in Black and White Population in Central City, 1950–1960 Each point in the scatter diagram represents the residual change in a city’s black and white populations after controlling for region fixed effects and changes in the metropolitan area’s population over the decade. The slope of a regression line through these points is −2.010 (s.e. = 0.291). Although the four largest cities—Chicago, IL; Detroit, MI; Los Angeles, CA; and New York City, NY—are omitted for reasons of scale, they fall close to the regression line. With these cities included, the slope is −2.465 (s.e. = 0.132).
the black population share in northern and western cities from 4% in 1940 to 16% in 1970. Over the same period, the median nonsouthern city lost 10% of its white population. This paper shows that white departures from central cities were, in part, a response to black in-migration.1 In every decade, cities that received a larger flow of black migrants also lost a larger number of white residents. Figure I provides an initial look at the relationship between black arrivals and white departures in nonsouthern cities over the 1950s. The slope of the regression line through these points suggests that each black arrival was associated with two white departures. The relationship between black arrivals and white departures provides suggestive evidence of “white flight,” a process by which 1. An extensive literature argues that white households have a preference for white neighbors. See Crowder (2000), Ellen (2000), Emerson, Chai, and Yancey (2001), and the references contained therein. Boustan (2007) shows that demand for urban residence is also affected by citywide demographics.
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
419
white households left central cities to avoid living in racially diverse neighborhoods or jurisdictions. However, the correlation between black arrivals and white departures could also be driven by the potentially endogenous location decisions of southern black migrants. If whites left particular northern cities for other reasons (for example, due to the construction of a new interstate highway), migrants might have been attracted by lower housing prices left in the wake of white departures (Gabriel, Shack-Marquez, and Wascher 1992; Saiz 2007).2 Alternatively, migrants might have flocked to areas with high wages or centrally located manufacturing jobs, factors that also underlie the demand for suburban residence (Steinnes 1977; Margo 1992; Thurston and Yezer 1994). I employ an instrumental variables procedure to address these potential alternatives. The instrument makes use of the fact that black migrants from given southern states clustered in particular northern cities. As a result, northern cities received exogenous flows of black migrants when their traditional southern states of origin underwent agricultural and economic change. In particular, I use variation in local agricultural conditions to predict black out-migration from southern states and assign these predicted migrant flows to northern cities using settlement patterns established by an earlier wave of black migration. These predicted changes in black population serve as an instrument for actual black in-migration. After adjusting for migrant location choices, I estimate that each black arrival was associated with 2.7 white departures. The median city, which had 200,000 white residents, absorbed 19,000 black migrants over this period. My estimates imply that these arrivals prompted the departure of 52,000 white residents, resulting in a 17% net decline in the urban population. Although primarily driven by household mobility, I find that the decline in white population is also partly due to a reduction in the size of the remaining white households. Observing white departures in response to black arrivals is not sufficient evidence to demonstrate the presence of white flight. White departures may be prompted by the fact that black migrants bid up the price of city housing units. In a simple spatial model, I demonstrate that if white households have no distaste 2. Gamm (1999) argues that black migrants were attracted to the Dorchester and Roxbury neighborhoods of Boston by the decline in housing prices following a wave of Jewish suburbanization.
420
QUARTERLY JOURNAL OF ECONOMICS
for racial diversity, each black arrival will lead to one white departure with no long-run effect on housing prices. In contrast, if white households have a distaste for racial diversity, black migration will be associated with more than one white departure for every black arrival, declining urban population, and, in some cases, falling housing prices. I show that in otherwise declining areas, black migration leads to an increase in the vacancy rate and an associated decline in housing prices. In growing areas, black migration instead slows the rate of new home construction, leading to a smaller housing stock with no effect on housing prices. Early studies of urban population loss suggest that households left cities to escape mounting urban problems, including a rising crime rate, fiscal mismanagement, and a growing concentration of racial minorities and the poor (Bradford and Kelejian 1973; Guterbock 1976; Frey 1979; Marshall 1979; Grubb 1982; Mills and Price 1984; Mieszkowski and Mills 1993). These papers find mixed evidence for the relationship between urban racial diversity and suburbanization in 1960 and 1970 cross sections. Recent studies put more emphasis on transportation improvements, including the automobile and new road building, which reduce the time cost of commuting from bedroom communities (LeRoy and Sonstelie 1983; Baum-Snow 2007; Kopecky and Suen 2007).3 The decline in urban population following the typical black inmigration found here is equivalent to Baum-Snow’s (2007) estimates of the decline in urban population after the construction of one new highway through the central city. This paper documents that black arrivals reduced the overall demand for city residence in the mid-twentieth century, leading to white out-migration and, in some cases, falling housing prices. However, the mechanisms by which cities lost their luster are less clear. Because poverty and race are highly correlated, I cannot distinguish here between a distaste for the race and for the income level of southern arrivals. Moreover, with a metropolitan area– level analysis, I cannot separate changes to local neighborhoods and schools from changes to citywide characteristics, including the property tax rate and local spending priorities. Card, Mas, and Rothstein (2008) demonstrate that neighborhoods can “tip” from predominantly white to predominantly minority areas after 3. An exception is Cullen and Levitt (1999), which studies the relationship between crime rates in the central city and suburbanization. Historians continue to emphasize the connection between racial diversity and suburbanization (Jackson 1985; Sugrue 1996; Meyer 2000).
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
421
reaching a critical minority share. However, because cities were highly segregated by neighborhood, few neighborhoods fell into the range in which they would be at risk for tipping. I find that at most 20% of the estimated white departures can be traced to neighborhoods in the tipping range. Exploring other mechanisms for white departures is a fruitful area for future research. II. WHITE FLIGHT IN A SIMPLE SPATIAL MODEL In the postwar period, black migrants settled disproportionately in central cities. This section illustrates potential channels by which black arrivals may have affected both the number of white residents and housing prices in receiving cities. The model demonstrates that, as long as the housing supply is not perfectly elastic, some white departures will occur even without a distaste for racial diversity due to the effect of new arrivals on housing prices.4 However, if whites have some distaste for living near blacks, black migration will be associated with declining urban population and, in some cases, falling housing prices. Consider a central city in the North with a given number of white households. With free mobility, utility in this city cannot fall below u, the utility level for a white household in the suburban ring of the city’s own metropolitan area and in other metropolitan areas around the country. Household utility can be written (1)
U ( p, b, z) = u.
U is decreasing both in the price of housing ( p) and (weakly) in the share of the city residents who are black (b = B/(B + W), where B and W are the numbers of black and white households, respectively). z is a demand shifter representing either local amenities or productivity. The price of housing is a function of the number of households in the city, N (N = W + B). The sensitivity of price to the number of households is determined by ϕ, the price elasticity of housing supply. Initially, all blacks live in the South. Blacks will migrate to the North if their utility level in the northern city is higher than some reservation southern utility. Southern utility s is determined by the wage rate in southern agriculture (w), which is decreasing in 4. A distaste for racial diversity could arise either directly from racist attitudes or indirectly from concerns about local amenities such as crime rates or school quality.
422
QUARTERLY JOURNAL OF ECONOMICS
the number of blacks in the South. The utility function of a black household in the North is identical to that of a white household, except that black utility may be increasing in the number of blacks in the city: (2)
U ( p, b, z) = s(w).
The price elasticity of housing supply (ϕ) is determined by the decisions of a profit-maximizing construction sector. For prices below construction cost (c), each unit built yields negative profits. In this region, firms will not build new units and the housing supply elasticity is zero. In the simplest case, housing supply will be perfectly elastic at a price equal to construction cost. Alternatively, we could imagine that the city rations building permits. To build an additional unit, firms must incur a lobbying cost L(N), which is increasing with the size of the city. In this case, housing supply elasticity will be positive but not infinite at prices above construction cost. This kinked supply curve generates an asymmetric response to changes in demand: increasing demand leads to new construction, but declining demand does not lead to an (immediate) reduction in the housing stock (Glaeser and Gyourko 2005). The city is in spatial equilibrium when all white and black residents weakly prefer their own locations to the alternatives and when firms in the construction sector earn zero profits. Spatial equilibrium determines a city housing price p∗ , which will be equal to or below construction costs, and the share of the city residents who are black (b∗ ). How will the city respond to an influx of black arrivals? Consider a decline in southern wages following mechanization in the agricultural sector, prompting black migration to the city. This case corresponds to the instrument for black migration described in the next section, which relies on exogenous variation in southern agricultural conditions. When s falls, black migrants move to the city. Migration continues until the southern wage rises sufficiently to make blacks indifferent between the South and the North. The city’s construction sector responds to the new arrivals. If housing supply is perfectly elastic at prices above construction costs, firms will build new units until prices return to p∗ = c and no white households will leave the city. If housing supply is less than perfectly elastic, housing prices will increase somewhat with
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
423
black in-migration, encouraging some white households to leave the city in response. How many whites will leave the city in this scenario? To begin with, assume that whites have no distaste for black residents (Ub = 0). According to equation (1), spatial equilibrium for white households will only be restored when city prices return to p∗ . Given that prices are a function of the total number of households in the city, this relationship holds when each black arrival displaces exactly one white resident. From this reasoning, we can conclude that if whites exhibit no distaste for racial diversity (and housing supply is not perfectly elastic), black migration to a central city will lead to (a) exactly one white departure for every black arrival and (b) no long-run change in city housing prices. Black migration increases both housing prices and the black population share in the city. If white households dislike racial diversity (Ub < 0), black migration will prompt more white departures than in the previous case. This decline in city population will lead housing prices to fall below construction costs. I assume that the housing stock will decline at some rate λ until prices eventually return to p∗ .5 From this reasoning, we can conclude that if whites exhibit a distaste for racial diversity, black migration to a central city will lead to (a) more than one white departure for every black arrival and (b) a short-run decline in city housing prices. Define λ as the (exogenous) speed with which city housing prices return to p∗ , either through depreciation of the existing housing stock or a slowdown in new construction. In cities that are otherwise expanding, the housing stock can easily decline (in a relative sense) through a slowing of the rate of new construction. That is, expanding cities are characterized by a high λ. However, in cities that are otherwise shrinking, a decline must occur through a slower process of depreciation of the existing housing stock. This distinction generates an additional prediction: In declining areas, white departures will be coupled with a high vacancy rate and falling prices, whereas in growing areas, white departures will lead to a decline in the rate of new construction and housing prices will remain at construction costs. The model suggests a set of empirical relationships to be explored in the data. First, white departures from the central city 5. In the meantime, low housing prices in the city will induce additional black migration, which, in turn, will prompt more white departures. The city will not tip from all white to all black because the loss of black population from the South will increase southern wages, eventually bringing migration to a halt.
424
QUARTERLY JOURNAL OF ECONOMICS
will respond to the number of black arrivals, rather than the percentage change in the black population. However, spatial equilibrium for white households indicates that housing prices will respond to the black share of the city’s population, rather than to the number of black arrivals. If the number of white departures with every black arrival is statistically greater than one, we can rule out housing prices as a sole cause of the white outflow. Thus far, I have considered how the urban equilibrium is affected by black migration pushed from the South by a decline in southern wages. However, changes to the northern city itself may also attract black migrants. An increase in northern productivity (z) could simultaneously attract black migrants and encourage some white households to move to the suburbs.6 This process could generate a spurious correlation between these two population flows. Alternatively, if whites leave the city for any other reason (modeled as an increase in u), housing prices may fall, encouraging black in-migration from the South. In this case, an association between black arrivals and white departures would not be driven by white racism but rather by black location choice (reverse causality). The spatial model helps to demonstrate the importance of focusing on southern conditions as a source of exogenous variation in black population growth in the North. In the next section, I introduce an instrument for black migration using factors that exogenously change the utility of southern blacks. III. USING SOUTHERN BLACK MIGRATION TO INSTRUMENT FOR BLACK ARRIVALS TO NORTHERN CITIES III.A. Historical Context and Conceptual Approach Rural blacks were attracted northward by economic opportunities in the manufacturing and service sectors. The demand-pull component of this migrant flow is undoubtedly correlated with economic conditions in destination cities. Southern push factors can be used to create an instrument for changes in urban diversity in the North. I use local economic conditions to predict black migrant flows from each southern state. These local factors are 6. A productivity-driven increase in wages may encourage some white households to move to the suburbs. Living in the suburbs involves a trade-off between the price of housing services and the distance to work. An increase in income will prompt households to move to the suburbs as long as the elasticity of housing services with respect to income is greater than the income elasticity of the opportunity cost of time (Becker, 1965).
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”? Alabama
425
Mississippi
0.1 0.09
Share of migrants
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Chicago, IL
St. Louis, MO
Detroit, MI
Los Cleveland, Angeles, CA OH
New York, NY
Pittsburgh, PA
Other northern cities
FIGURE II Top Destinations of Northern Black Migrants from Alabama and Mississippi, 1935–1940 Data on migration flows are calculated from aggregate mobility tables from the 1940 Census (U.S. Bureau of the Census, Internal Migration, 1935–1940).
unlikely to be correlated with aspects of the northern economy. I assign predicted flows to northern destinations using settlement patterns established by an earlier wave of black migration.7 The predicted black population in a northern city is used to instrument for the actual black population. Key to this procedure is the fact that blacks leaving particular southern states settled in certain northern cities. These settlement patterns were highly persistent, in part due to the stability of train routes and community networks.8 Much of the variation in source/destination pairs occurs between regions, with migrants simply moving due north—say, from the Mississippi Delta to industrial cities in the Midwest. However, there is also considerable variation within regions. Consider the case of Alabama and Mississippi, two neighboring, cotton-producing states in the traditional “black belt.” Figure II displays the shares of northern black 7. The first wave of black migration was prompted by growth in industrial employment during World War I and the imposition of strict immigration quotas in 1924, which slowed migration from Europe (Collins 1997). 8. Carrington, Detragiache, and Vishwanath (1996) model this type of chain migration as a reduction in the uncertainty costs of migration.
426
QUARTERLY JOURNAL OF ECONOMICS
migrants from these two states that settled in various cities between 1935 and 1940. Migration from Mississippi to the North was overwhelmingly concentrated in two destinations, Chicago and St. Louis. By contrast, Detroit received the largest flow from Alabama, followed by Chicago and Cleveland. The difference in migration patterns between these neighboring states is consistent with disparities in their railroad infrastructure, which were in place long before 1940. The black population in Mississippi was clustered along the Mississippi river, a region served by only one interstate railroad (the Illinois Central), whose main hubs were St. Louis and Chicago. In contrast, the large cities in Alabama, Mobile and Birmingham, were each served by two major railroads—the Gulf, Mobile, and Ohio railroad, which connected to the Illinois Central network in St. Louis, and the Alabama Great Southern Railroad, which brought riders east to Cleveland and Detroit.9 III.B. Building an Instrument from Historical Data The instrument for northern black population is made up of two components: predicted migrant flows from southern states and the settlement pattern established by blacks leaving these states in an earlier wave of migration. To predict black migration from a southern state, I start by estimating net black migration rates at the county level as a function of agricultural and industrial conditions: (3)
mig ratect−t+10 = α + γ (push factors)ct + εct .
I use county characteristics at the beginning of a decade to predict migration over the subsequent ten-year period because contemporaneous changes in southern economic conditions could be a response to, rather than a cause of, migration (Fligstein 1981). For instance, planters may scale back cotton production as agricultural wages rise with out-migration. I also present results using only 1940 county characteristics to predict migration in each of the three following decades. 9. Grossman (1989, p. 99) writes that “the first [migrant from Mississippi] to leave for Chicago probably chose the city because of its position at the head of the Illinois Central.” A map of rail links from the South c. 1915 can be found at http://alabamamaps.ua.edu/historicalmaps/railroads/. See Gottlieb (1987, pp. 39– 62) and Grossman (1989, pp. 66–119) for a broader discussion of the role of train routes and information networks in black migration.
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
427
TABLE I DETERMINANTS OF NET BLACK MIGRATION RATES BY SOUTHERN COUNTY, 1940–1970 1940–1950 Share land planted in cotton Share farmers as tenants Share agriculture =1 if tobacco state Sh agriculture (=1 tobacco) Share mining =1 if oil state Share mining (=1 if oil state) $ in defense pc, 1940–1945 Constant N
−63.575 (13.519) −73.290 (31.404) 96.909 (27.776) 20.390 (26.614) −119.379 (49.753) 16.750 (82.892) 58.331 (11.040) 146.970 (182.76) 19.806 (7.042) 16.377 (14.330) 1,378
1950–1960 −9.695 (7.064) −22.836 (15.778) −144.440 (100.353) −60.438 (58.781) 185.865 (169.730) −63.233 (36.631) 8.919 (7.680) 267.268 (78.670) 2.151 (4.077) 40.695 (33.557) 1,352
1960–1970 −49.886 (19.863) −76.232 (46.834) 159.350 (47.875) 45.501 (20.783) −230.003 (81.407) 59.030 (73.275) 21.538 (12.750) −126.308 (98.638) 2.720 (8.566) −2.801 (11.489) 1,350
Notes. See Data Appendix for source details. Table A.2 contains summary statistics. The dependent variable for each regression is the net black migration rate by southern county.
Table I contains coefficients from the regression of net migration rates on county characteristics.10 The results from this exercise coincide with predictions from southern economic history. A county’s cotton share strongly predicts black out-migration in the 1940s, as the planting and weeding components of cotton production were mechanized, and again in the 1960s, when a viable cotton harvester diffused throughout the South—but not in the 1950s (Grove and Heinicke 2003, 2005).11 A ten–percentage 10. Source details are contained in the Data Appendix, and the associated summary statistics are presented in Table A.2. 11. Federal cotton policy may have spurred the first wave of cotton mechanization in the late 1930s and 1940s. The Agricultural Adjustment Act (AAA) of 1933 encouraged cotton growers to leave fields fallow, a burden they often imposed on their tenants. This policy inadvertently increased the average size of cotton farms, thus providing an incentive to invest in high fixed cost capital goods. See Fligstein (1981, pp. 137–151), Whatley (1983), and Wright (1986, pp. 226–238). Correspondingly, tenancy rates are an important predictor of out-migration in the 1940s, when the traditional sharecropping system was giving way to wage labor arrangements (Alston 1981).
428
QUARTERLY JOURNAL OF ECONOMICS
point increase in the share of land planted in cotton predicts six additional out-migrants per 100 black residents in the 1940s and five additional out-migrants in the 1960s. In contrast, agricultural counties in tobacco-growing states, which were slow to mechanize, lost black population only in the 1960s (Wright 1986). Counties that received federal funds for war-related industry in the 1940s attracted black migrants in that decade, though the effect of this wartime spending dissipated by the 1950s. The discovery of major oil fields and the expansion of natural gas attracted black entrants to mining counties in Oklahoma and Texas in the 1940s and 1950s. I generate a predicted migration flow from each county by multiplying the fitted migration rate by the county’s initial black population. These predicted flows are aggregated to the state level (pred migst ) and allocated to northern cities according to the settlement patterns of blacks who left the state between 1935 and 1940. Let wns be the share of blacks who left state s after 1935 and resided in city n in 1940.12 The number of black migrants predicted to arrive in city n at time t is the weighted sum over the fourteen southern states of migrants leaving state s and settling in city n: (4)
pred mignt = s=1,...,14 (wns · pred migst ).
I use this predicted in-flow to advance a city’s black population forward from 1940, with the predicted black population serving as the instrument for the actual population. Card (2001), Lewis (2005), and Doms and Lewis (2006) use a similar approach to study the effect of immigration on local labor markets.13 One important difference, however, is that these papers allocate the actual inflow of immigrants to cities rather than predicting the inflow from a set of local push factors. As a result, the method assumes that the “total number of immigrants from a given source country who enter the United States is independent of . . . demand conditions in any particular city” (Card 2001, p. 43). However, given that migrants cluster, a positive economic shock in a destination city could stimulate additional migration 12. The 1940 Census is the first to collect systematic data on internal migration. Aggregate mobility tables are available by race for 53 cities in the sample. The mobility data provide the city and state of residence in 1935 for residents of a given city in 1940. 13. In a related method, Munshi (2003) uses rainfall in Mexican villages as an instrument for the size of different migrant networks in the United States.
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
429
flows from source areas. I present results using both actual and predicted migration flows. IV. THE CAUSAL RELATIONSHIP BETWEEN BLACK ARRIVALS AND WHITE DEPARTURES FROM CENTRAL CITIES IV.A. Data and Estimation Framework I compile a data set of population and household counts from 1940 to 1970 in seventy large metropolitan areas (SMSAs) in the North and West.14 Stacking data from the four Census years, I begin by estimating the relationship between the number of nonblack (“white”) residents (W CITY) and the number of black residents (B CITY) in the central cities of these metropolitan areas (m),
(5)
W CITYmrt = αm + β1 (B CITYmrt ) + γ1 (POP METROmrt ) + υrt + εmrt ,
where t and r indicate Census decades and regions, respectively.15 υrt are Census region by decade fixed effects.16 β1 is thus estimated from changes in black population within a city over time, compared to other cities in the region. I control for the size of the metropolitan area (POP METRO) because growing areas will attract a large flow of both black and white in-migrants. The instrument discussed above is only available for 53 of the sample cities. Earlier work on the role of race in the suburbanization process compares cross sections of cities with different black population shares at a point in time. The benefit of a panel is twofold: first, the size of a city’s black population may be correlated with fixed aspects of an area’s industrial base, transportation network, or housing stock. Such characteristics may also encourage suburban development, leading to a spurious correlation in the cross 14. I exclude the South because the vast majority of black migrants into southern cities came from the surrounding state, making it difficult to separate changes in urban diversity from periods of local economic change. Sample selection is discussed in more detail in the Data Appendix. 15. Although the model relates the number of white households to the number of black households in a central city, I begin by estimating the relationship between black and white population for two reasons. First, the instrument generates variation at the individual, rather than the household, level. Second, I am unable to correct the households counts for possibly endogenous annexation. Table III contains household-level results in OLS. 16. I combine the Western and Mountain Census regions and the New England and Mid-Atlantic Census regions into Pacific and Northeastern regions, respectively.
430
QUARTERLY JOURNAL OF ECONOMICS
section. Second, the size of central cities—in land area—relative to their metropolitan areas varies widely. Although this variation can obscure comparisons of suburbanization across metropolitan areas, city size is largely unchanging within a metropolitan area over time. Cities can expand in land area over time by annexing nearby unincorporated land (or, less commonly, neighboring suburbs). My preferred measure of the central city fixes city boundaries according to their 1940 definition, foreclosing the possibility of an endogenous annexation response to changes in racial diversity (Austin 1999; Alesina, Baqir, and Hoxby 2004).17 The Data Appendix discusses alternative definitions of the central city and assesses the robustness of the results to the choice of measure. The mean city is 9.2% black and is located in a metropolitan area with 1.3 million residents, 41% of whom live in the city itself. IV.B. First-Stage Results The stability of migrant settlement patterns generates a strong association between actual changes in black population and changes due to predicted black in-migration alone. The first column of Table II reports results from a series of first-stage regressions. In the first row, the instrument is generated by allocating actual southern flows to the North akin to those reported by Card (2001) and others. The subsequent rows use predicted migrant flows based on southern push factors. Not surprisingly, the relationship between actual and simulated changes in black population is stronger when actual rather than predicted migrant flows are assigned. Each predicted black arrival is associated with 4.4 actual new black residents when real migrant flows are assigned (row (1)) and 3.5 new black residents when predicted migrant flows are assigned (row (2)). The coefficient is highly significant in both cases. The magnitudes suggest that, over a decade, each migrant arrival leads to the equivalent of one new black household (assuming a mean household size of 3.5 residents) in the central city, a process that presumably occurs through family formation and child bearing in the North. Figure III graphs the first stage relationship using predicted migrant flows in the 1950s, again controlling for region fixed 17. Only five cities in the sample annexed enough territory to expand their populations by at least five percent. These are Phoenix, AZ; Fresno, Sacramento, and San Bernardino, CA; and Wichita, KS.
431
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”? TABLE II BLACK MIGRATION TO CENTRAL CITIES AND WHITE POPULATION LOSS
Dependent variable: Instrument type Assign actual migrants Assign predicted migrants, 1940–1970 Assign predicted migrants, 1950–1970 Predict with 1940 variables, 1950–1970 Long-run changes, 1940–2000 Long-run changes, white foreign-born population in the city
Actual black population in city
White population in city
First stage
OLS
IV
4.442 (0.652) 3.466 (0.671) 4.488 (0.968) 4.365 (0.799) 6.800 (0.421) —
−2.099 (0.549) −2.099 (0.549) −2.278 (0.604) −2.278 (0.604) −0.771 (0.166) 0.264 (0.066)
−2.365 (0.805) −2.627 (0.782) −2.983 (0.768) −3.085 (0.708) −1.050 (0.199) 0.169 (0.078)
Notes: Standard errors are clustered by SMSA and reported in parentheses. Standard errors are bootstrapped when using the generated instrument (rows (2)–(6)). The sample includes 53 SMSAs with published 1935–1940 mobility counts by race from 1940–1970 (N = 212) or 1950–1970 (N = 159). The OLS results report estimates of β1 from equation (5) in the text. The instrument in the first row assigns actual migration flows out of southern states to northern cities according to the 1935–1940 settlement patterns. The instrument in the second through sixth rows assign predicted migration flows. Section III.B contains a detailed description of the instrument’s construction. The fourth row uses county characteristics from 1940 to predict out-migration in the 1950s and 1960s. The fifth (sixth) row estimates the relationship between the change in white (foreign-born white) and black populations in the central city from 1940 to 2000.
effects and metropolitan area growth. Larger positive deviations from the regression line correspond to cities such as Baltimore, MD, that experienced more black population growth than would be predicted by migration from their typical sending states, perhaps due to positive economic shocks that attracted arrivals from new source areas. The reverse is true of cities such as St. Louis, MO, that fall below the regression line. In general, the positive relationship between actual and predicted black population growth is strong and is not driven by any obvious outliers. IV.C. Second Stage Results The remainder of Table II conducts the IV analysis. If migrant location choice were driving the correlation between black arrivals and white departures, the IV estimates would be smaller (less negative) than OLS. A comparison between columns (2) and (3) reveals that the IV point estimates are never markedly different from their OLS counterparts. If anything, the IV coefficients are slightly more negative than OLS, suggesting that black migrants
432
QUARTERLY JOURNAL OF ECONOMICS
Actual change in black population to central city
50000 40000 Baltimore, MD 30000 20000 St. Louis, MO
10000 0 -10000
-5000
0
5000
10000
15000
-10000 -20000 -30000 -40000 -50000
Predicted change in black population to central city
FIGURE III First Stage: Predicted versus Actual Change in Black Population, 1950–1960 The sample includes the 53 SMSAs with available mobility counts by race in 1940 (without the four largest cities, for reasons of scale). The predicted change in black population is calculated by assigning predicted migration flows from southern states to northern cities using 1935–1940 settlement patterns. See Section III.B for a detailed description of the instrument’s construction. The slope of a regression line through these points is 3.187 (s.e. = 0.419).
avoided cities that were otherwise losing white population. Interestingly, the results are nearly identical whether I use actual or predicted migrant flows to generate the instrument.18 If economic shocks are serially correlated, migrants’ destination choices in the late 1930s may be related to local economic conditions in subsequent decade(s). The third row presents IV results for 1950–1970, leaving a full decade between the pre- and post-periods. The fourth row uses 1940 county characteristics to predict out-migration from the South in every decade to avoid changes in the southern economy that could be a response to, rather than a cause of, migration. The results are similar in both cases. There is no evidence that the correlation between black arrivals and white departures from central cities is due to 18. Although intrastate migration will net out when actual county-level migration is aggregated to the state level, the same may not be true with predicted migration. Thus, the predicted state aggregates may erroneously include and assign to the North some internal migrants.
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
433
TABLE III BLACK HOUSEHOLDS, WHITE HOUSEHOLDS, AND THE NUMBER OF HOUSING UNITS IN CENTRAL CITIES: COEFFICIENT ON # OF BLACK HOUSEHOLDS (IN 1,000S) Dependent variables # white households White household size # housing units # of vacant units N
Full sample
Low-growth metro
High-growth metro
−1,602.495 (178.513) −0.003 (0.0007) (448 residents) −559.562 (211.192) 46.192 (168.318) 280
−1,715.816 (271.964) −0.0009 (0.0006) (164 residents) −202.652 (237.212) 513.163 (61.391) 140
−1,790.906 (433.305) −0.004 (0.001) (475 residents) −747.981 (414.455) 47.328 (24.982) 140
Notes. Standard errors are clustered by SMSA and are reported in parentheses. The number of black and white households and the number of housing units are from the Census of Housing for relevant years. The second and third columns split the sample by the metropolitan area growth rate from 1940 to 1970 (median = 58%). In the second row, household size is translated into the number of white residents lost using the average number of white households (149,400, 182,200, and 118,750 in the three columns, respectively).
the endogenous location choices of black migrants. Even after constraining black migrants to follow settlements patterns established in the 1930s, I find that each black entrant leads to 2.3–3.0 white departures. The final two rows of Table II examine the long-run implications of black migration for urban population growth. I estimate the relationship between the sixty-year change in the black and nonblack populations of central cities from 1940 to 2000, instrumenting for changes in the black population with migration from 1940 to 1970. In the long run, each black arrival leads to only one nonblack departure and, therefore, has no effect on the overall urban population. Over time, some nonblack residents without a distaste for racial diversity may have been attracted to these central cities by lower housing prices. The last row of Table II shows that the foreign-born, whose numbers have increased greatly since 1970, have contributed to this trend. Each black arrival increased the number of white foreign-born residents in these central city by 0.2 persons, accounting for around 20% of the long-run renewal of urban population. Thus far, I have examined the relationship between black and white residents in central cities, whereas the model focused on households. The population and household effects could be different if black and white households are systematically different in size. Table III contains OLS regressions relating black household entry to the number of white households in the central city
434
QUARTERLY JOURNAL OF ECONOMICS
and the average size of the remaining white households. The arrival of one black household led to the departure of 1.6 white households; we can statistically rule out a displacement rate of one for one. Black arrivals also led to a reduction in the size of the remaining white households, perhaps because larger households with children were more concerned about racial diversity. However, the change in household composition is small, resulting in a reduction of 0.13 white residents for every new black arrival.19 Black in-migration led to a net reduction in the number of households in receiving cities. This decline could either result in vacancies in the existing housing stock or a decline in the housing stock itself as units depreciate and/or fewer new units are built. The model predicts that in otherwise declining areas, white departures will be coupled with a high vacancy rate and falling prices, whereas in growing areas, white departures will lead to a decline in the rate of new construction and housing prices will remain at construction costs. The second and third columns separate the sample into low- and high-growth metropolitan areas (above or below the median growth rate of 58% from 1940 to 1970). Consistent with this prediction, the arrival of 1,000 black households in a high-growth area, which results in a net decline of 800 households, leads to 750 fewer housing units being built and only 50 units standing vacant. In contrast, 1,000 new black households in a low growth areas (a net decline of 700 households) is associated with 500 additional vacancies. I will show a similar pattern with respect to housing prices below. IV.D. Assessing the Quantitative Role of White Flight The estimated number of white departures for every black arrival allows us to calculate the likely effect of black migration on urban population loss. Let’s begin with an extreme thought experiment: What if the four million black migrants had not left the South during this period? The median northern and western city received 19,000 black migrants from 1940 to 1970. The estimated response implies that 52,000 whites left the city as a result, translating into a 27% decline in the city’s white population and a 17% decline in the total urban population. To put this magnitude 19. The arrival of 1,000 black households (= 3,500 residents) leads to −0.003 fewer residents in the average white household. In the typical city, this decline in household size translates into the loss of 448 residents. These figures imply that each new black resident results in the loss of 0.13 white residents through the household size channel.
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
435
into context, consider that Baum-Snow (2007) estimates that the construction of one new interstate highway through a central city leads to a similar 16% decline in urban population. Although this “no-migration” counterfactual is large, it is not entirely out of sample. The effect of shutting off the flow of black migrants is equivalent to imposing the growth rate of Pittsburgh’s black population rather than that of Detroit’s black population on the typical city (150% versus 440%). If instead one considers the difference in the black inflow between Chicago and Detroit (400% versus 440%), the median city would have experienced an 8% decline in its white population.20 Can the estimated response to the black migration be wholly explained by the tipping of certain neighborhoods from majority white to majority black (Schelling 1971)? In 1970, Card, Mas, and Rothstein (2008) estimate that neighborhoods tipped after reaching a 9%–12% minority share. The estimated tipping point has increased over time, so the tipping point in 1950 might have been as low as, say, 5%. To assess the quantitative importance of this phenomenon, imagine that, in 1940, before the wartime migration, no neighborhood in sample metropolitan areas had yet reached the tipping point. By 1950, 5.8% of Census tracts in sample cities fell within the candidate range (5%–12% black). Card, Mas, and Rothstein document that neighborhoods directly above the tipping point lose 10%–16% of their white population over the next decade relative to neighborhoods directly below. Let’s take the case of the median city with 200,000 white residents, which received 6,000 black arrivals over the 1940s. If all candidate neighborhoods lost 16% of their white population over the next decade, this would translate into the departure of 1,856 white residents (= 200,000 · 0.058 · 0.16). The paper’s causal estimates suggest that a total of 16,200 white residents would have left the city in response to these black arrivals (= 6,000 · 2.7). Of these departures, 6,000 residents, or one white departure for every black arrival, may be in direct response to higher housing prices. At most 20% of the remainder can be explained by neighborhood tipping (= 1,856/10,200). Other departures may have been in response to more continuous shifts in neighborhood composition or to changes in citywide attributes. 20. Some blacks were attracted to the North by the availability of manufacturing work. If blacks had not filled these positions, others may have. One possibility is that blacks would have been replaced by Mexicans through an expansion of the Bracero guest worker program into urban areas. The white response to this alternative set of migrants is unknown.
436
QUARTERLY JOURNAL OF ECONOMICS TABLE IV BLACK POPULATION SHARE AND THE VALUE OF OWNER-OCCUPIED HOUSING IN THE CITY, 1950–1970 OLS
Black population share in city Housing controls N
IV
Low growth
High growth
(1)
(2)
(3)
(4)
(5)
−0.610 (0.227) N 159
−0.470 (0.194) Y 159
−0.689 (0.108) Y 159
−0.618 (0.266) Y 99
0.030 (0.295) Y 102
Notes. Standard errors are clustered by SMSA and are reported in parentheses. Housing quality controls include the median number of rooms, the share of housing units that are in detached, single-family buildings, and the share of housing units that were built in the previous ten years. The fourth and fifth columns split the sample by the metropolitan area growth rate from 1940 to 1970 (median = 58%).
V. THE EFFECT OF RACIAL DIVERSITY ON HOUSING PRICES Thus far, I have shown that each black arrival to a central city at midcentury prompted more than one white departure. This pattern suggests that white mobility not only was a response to higher housing prices but also reflected a distaste for racial diversity. I can test this proposition directly by looking for a negative association between the black population share in the central city and the price of urban housing, again using the southern push instrument to predict black arrivals. Aggregate data on housing values are available from 1950 to 1970. For these years, I estimate PRICE CITYmrt = αm + β2 (PERB CITYmrt ) + γ2 (PRICE METRO)mrt + Xmrt + υrt + εmrt , (6) where PERB CITY measures the city’s black population share. β2 estimates the effect of urban diversity on the prices of city housing relative to metropolitan areawide trends. The vector Xmrt contains average housing quality measures, including the median number of rooms in city housing units, the share of units that are in detached, single-family structures, and the share of units that were built in the previous ten years. Table IV examines the relationship between the black population share and the mean value of owner-occupied housing in the central city. The first column of Table IV contains the basic specification, whereas the second adds housing quality controls for the Census of Housing. In both cases, an increase in the black population share of the central city reduces housing prices. In the
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
437
raw data, a ten–percentage point increase in the black population share is associated with a 6% decline in housing prices. Twenty percent of this decline can be explained by a limited set of housing quality controls. It is unlikely that the observed price decline was driven by lower prices paid by new black arrivals. Cutler, Glaeser, and Vigdor (1999) show that, in this period, blacks actually paid more than whites for equivalent housing units, perhaps because blacks faced a supply constraint created by white households unwilling to sell to black buyers. Again, one may be concerned that black migrants were attracted to areas with falling housing prices. Instrumenting with predicted migrant flows augments the negative relationship between racial diversity and urban housing prices (compare columns (2) and (3)).21 If anything, black migrants seem to be attracted to cities with higher wages or amenities that translate into higher city housing prices. Falling housing prices together with the decline in urban population are suggestive of a drop in the demand for cities that experience black in-migration. However, we would not expect housing prices to fall in all cities. In otherwise declining cities, falling demand may lead some existing units to stand vacant and housing prices to fall. In growing cities, a decline in urban demand may instead slow the rate of new construction until housing prices return to construction costs. As before, I split the sample by the rate of metropolitan area growth from 1940 to 1970. Consistent with this reasoning, I find that increasing racial diversity has no effect on housing prices in growing cities, where, as we have already seen, the net decline in urban households resulted in fewer housing units being built (Table III). In declining areas, by contrast, increasing racial diversity is associated with falling housing prices alongside a higher vacancy rate. VI. CONCLUSIONS Black migration from the rural South to industrial cities in the North and West coincided with the development of postwar suburbs. Did black migrants happen to arrive in cities at the wrong time, just as suburbanization got underway? Or was their 21. To instrument for the black population share, I use the city’s population in 1940 as the denominator of the predicted black population share in all years to prevent a mechanical correlation arising between the instrument and the endogenous black population share.
438
QUARTERLY JOURNAL OF ECONOMICS
arrival an important explanation for suburban growth? This paper shows that cities that received more black migrants from 1940 to 1970 lost a greater number of white residents. I rule out explanations for this pattern based on the endogenous location decisions of black migrants or the effect of migration on urban housing prices alone. My estimates suggest that the change in racial diversity associated with black migration resulted in a 17% decline in urban population. An ancillary goal of the paper has been to develop an instrument for changes in urban diversity in American cities over time. The instrument exploits shocks to southern industry and agriculture and the persistence of black migration patterns between southern states and northern cities. This method has many additional applications to questions in urban and public economics as well as to the economic history of American cities in the twentieth century. Although this paper quantifies the relationship between black arrivals and white departures from postwar cities, it has less to say about the mechanisms by which racial diversity affected the demand for urban residence. Some white residents were undoubtedly concerned about the changing racial and socioeconomic composition of their immediate neighborhoods. However, many others lived in all-white enclaves far from burgeoning black ghettos. These residents may have been motivated by changes in local policy accompanying a shift in the racial and socioeconomic composition of the urban electorate. The desegregation of public schools in the 1960s and 1970s provided another reason to leave the city. Exploring these mechanisms offers a promising direction for future research.
DATA APPENDIX A. Northern Data The sample includes all nonsouthern SMSAs that (1) were anchored by one or more of the hundred largest cities in 1940 or (2) had at least 250,000 residents by 1970. Only two SMSAs that meet the first criterion fall short of the later population benchmark (Bridgeport, CT, and New Bedford, MA). The second criterion adds ten metropolitan areas to the sample, including growing western cities (e.g., Phoenix, AZ) and smaller areas in Pennsylvania, Ohio, and upstate New York (e.g., Harrisburg, PA). Excluding
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
439
TABLE A.1 SUMMARY STATISTICS FOR 1940–1970, 70 NONSOUTHERN METROPOLITAN AREAS Mean
Standard deviation
457,107 −16,158 70,877 28,209 0.092 1,289,456
919,030 100,509 182,963 68,553 0.093 3,238,178
Instrument Predicted black population Predicted black
48,834 5,703
102,440 12,687
Households Whites in city whites in city Blacks in city blacks in city Vacant units
149,491 10,232 20,552 8,440 7,724
295,826 34,177 54,172 22,097 16,429
Population Whites in city whites in city Blacks in city blacks in city Share black Total in SMSA
Notes: Statistics are presented for the 70 SMSAs in the North or West that either (1) were anchored by one of the l00 largest cities in 1940 or (2) had at least 250,000 residents by 1970. The white and black population are calculated for counterfactual city borders. The borders are created by reassigning residents who would have lived in the suburbs if not for annexation back to the suburbs, under the assumption that the population living in the annexed area had the same white share as the suburban area as a whole.
these ten areas has no discernable effect on the main results (compare a coefficient of −2.110 (s.e. = 0.548) to the coefficient of interest in Table II, column (2)). For consistency, I apply the 1970 county-based definition of a metropolitan area in every year. I use the New England County Metropolitan Area (NECMA) classifications for the New England region to avoid divided counties. See Table A.1 for summary statistics for nonsouthern metropolitan areas. City boundaries can expand through the annexation of neighboring territory (Dye 1964; Jackson 1985, pp. 138–156). The direction of any bias created by annexation activity is unknown. Austin (1999) argues that politicians in diversifying cities have a stronger incentive to annex neighboring land in order to retain a majority-white electorate. In contrast, Alesina, Baqir, and Hoxby (2004) find that racial diversity reduces the number of successful school district consolidations, particularly in states that require both districts to agree to consolidate. To adjust for annexation, I create a parallel set of population counts that define central cities according to their 1940 borders.
440
QUARTERLY JOURNAL OF ECONOMICS
That is, I reassign residents who would have lived in the suburban ring if not for annexation back to the suburbs.22 Each measure involves a trade-off. Counts based on actual borders might conceal patterns of individual mobility erased by annexation activity. However, counts based on consistent borders will misclassify moves from annexed city territory to the suburbs as suburb-tosuburb moves. The tables in the paper are based on the fixed-border population counts. Using actual city boundaries instead produces an estimate of 2.317 (s.e. = 0.609) white departures for every black arrival. This coefficient is qualitatively similar to the comparable estimate in the second column of Table II.
B. Southern Data Black migration rates are approximated from population counts in race–sex–age cohorts in two Censuses, adjusted by national survival ratios (Gardner and Cohen 1971; Bowles et al. 1990). That is, the actual population in a cohort in county c at time t is compared to a predicted population count determined by multiplying that cohort’s population at time t − 10 by the national survival ratio. The differences between the actual and predicted population counts are attributed to in- or out-migration. Even when measured by race, the national survival ratio may understate mortality in the South, leading to an overestimate of outmigration (Fishback, Horrace, and Kantor 2006). As long as this bias is not systematically related to economic factors across counties, it should simply attenuate the coefficients in equation (3). All southern county-level variables are drawn from the electronic County and City Data Books, with the exception of cotton acreage. Information on cotton acreage is available electronically for some states at the National Agricultural Statistical Service’s historical data website (http://www.usda.gov/nass/pubs/ histdata.htm) and for others at the website of the Population and Environment in the U.S. Great Plains project of the ICPSR (http://www.icpsr.umich.edu/PLAINS/). The remainder were collected by hand from the Censuses of Agriculture. See Table A.2 for summary statistics for southern counties. 22. The Census Bureau estimated the number of individuals drawn into the central city through annexation from block level data (Bogue 1953; U.S. Census 1960, 1970).
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
441
TABLE A.2 SUMMARY STATISTICS FOR 1940–1960, 1,350 SOUTHERN COUNTIES Mean Net black migration rate Share land in cotton Share farmers as tenant Share LF in agriculture Share LF in mining $ defense pc, 1940–1945
1.811 0.329 0.312 0.335 0.028 0.162
Std. Dev. 147.253 0.397 0.195 0.183 0.074 0.599
Min
Max
−100 0 0 0.001 0 0
4,400 1 0.942 0.885 0.818 9.025
Note. See Data Appendix for source details. Spending on defense contracts in current dollars.
UNIVERSITY OF CALIFORNIA, LOS ANGELES, AND NATIONAL BUREAU ECONOMIC RESEARCH
OF
REFERENCES Alesina, Alberto, Reza Baqir, and Caroline Hoxby, “Political Jurisdictions in Heterogeneous Communities,” Journal of Political Economy, 112 (2004), 348–396. Alston, Lee J., “Tenure Choice in Southern Agriculture, 1930–1960,” Explorations in Economic History, 18 (1981), 211–232. Austin, D. Andrew, “Politics vs. Economics: Evidence from Municipal Annexation,” Journal of Urban Economics, 45 (1999), 501–532. Baum-Snow, Nathaniel, “Did Highways Cause Suburbanization?” Quarterly Journal of Economics, 122 (2007), 775–805. Bayer, Patrick, Robert McMillan, and Kim S. Rueben, “Residential Segregation in General Equilibrium,” NBER Working Paper No. 11095, 2005. Becker, Gary S., “A Theory of the Allocation of Time,” Economic Journal, 75 (1965), 493–508. Benabou, Roland, “Equity and Efficiency in Human Capital Investments: The Local Connection,” Review of Economic Studies, 63 (1996), 237–264. Bogue, Donald J., Population Growth in Standard Metropolitan Areas, 1900–1950 (Washington, DC: Housing and Home Finance Agency, 1953). Boustan, Leah Platt, “Escape from the City? The Role of Race, Income, and Local Public Goods in Postwar Suburbanization,” NBER Working Paper No. 13311, 2007. Bowles, Gladys K., James D. Tarver, Calvin L. Beale, and Everette S. Lee, “Net Migration of the Population by Age, Sex, and Race, 1950–1970” [computer file], ICPSR ed., Study No. 8493, 1990. Bradford, David F., and Harry H. Kelejian, “An Econometric Model of the Flight to the Suburbs,” Journal of Political Economy, 81 (1973), 566–589. Card, David, “Immigrant Inflows, Native Outflows, and the Local Market Impacts of Higher Immigration,” Journal of Labor Economics, 19 (2001), 22–64. Card, David, Alexandre Mas, and Jesse Rothstein, “Tipping and the Dynamics of Segregation,” Quarterly Journal of Economics, 123 (2008), 177–218. Carrington, William J., Enrica Detragiache, and Tara Vishwanath, “Migration with Endogenous Moving Costs,” American Economic Review, 86 (1996), 909– 930. Collins, William J., “When the Tide Turned: Immigration and the Delay of the Great Black Migration,” Journal of Economic History, 57 (1997), 607–632. Crowder, Kyle, “The Racial Context of White Mobility: An Individual-Level Assessment of the White Flight Hypothesis,” Social Science Research, 29 (2000), 223–257. Cullen, Julie Berry, and Steven D. Levitt, “Crime, Urban Flight, and the Consequences for Cities,” Review of Economics and Statistics, 81 (1999), 159–169. Cutler, David M., Edward L. Glaeser, and Jacob Vigdor, “The Rise and Decline of the American Ghetto,” Journal of Political Economy, 107 (1999), 455–506.
442
QUARTERLY JOURNAL OF ECONOMICS
Doms, Mark, and Ethan Lewis, “Labor Supply and Personal Computer Adoption,” Federal Reserve Bank of Philadelphia Working Paper No. 06-10, 2006. Dye, Thomas R., “Urban Political Integration: Conditions Associated with Annexation in American Cities,” Midwest Journal of Political Science, 8 (1964), 430–446. Ellen, Ingrid Gould, Sharing America’s Neighborhoods: The Prospects for Stable Racial Integration (Cambridge, MA: Harvard University Press, 2000). Emerson, Michael O., Karen J. Chai, and George Yancey, “Does Race Matter in Residential Segregation? Exploring the Preferences of White Americans,” American Sociological Review, 66 (2001), 922–935. Fishback, Price, William Horrace, and Shawn Kantor, “The Impact of New Deal Expenditures on Mobility During the Great Depression,” Explorations in Economic History, 43 (2006), 179–222. Fligstein, Neil, Going North: Migration of Blacks and Whites from the South, 1900–1950 (New York: Academic Press, 1981). Frey, William H., “Central City White Flight: Racial and Nonracial Causes,” American Sociological Review, 44 (1979), 425–448. Gabriel, Stuart A., Janice Shack-Marquez and William L. Wascher, “Regional House-Price Dispersion and Interregional Migration,” Journal of Housing Economics, 2 (1992), 235–256. Gamm, Gerald H., Urban Exodus: Why the Jews Left Boston and the Catholics Stayed (Cambridge, MA: Harvard University Press, 1999). Gardner, John, and William Cohen, “County Level Demographic Characteristics of the Population of the United States: 1930–1950” [computer file]. Compiled by University of Chicago Center for Urban Studies. ICPSR ed., Study No. 0020, 1971. Glaeser, Edward L., and Joseph Gyourko, “Urban Decline and Durable Housing,” Journal of Political Economy, 113 (2005), 345–375. Gottlieb, Peter, Making Their Own Way: Southern Blacks’ Migration to Pittsburgh: 1916–30 (Urbana: University of Illinois Press, 1987). Great Plains Research Project, “Population and Environment in the U.S. Great Plains” [website], Inter-University Consortium for Political and Social Research, http://www.icpsr.umich.edu/PLAINS/, 2005. Grossman, James R., Land of Hope: Chicago, Black Southerners, and the Great Migration (Chicago: University of Chicago Press, 1989). Grove, Wayne A., and Craig Heinicke, “Better Opportunities or Worse? The Demise of Cotton Harvest Labor, 1949–1964,” Journal of Economic History, 63 (2003), 736–767. ——, “Labor Markets, Regional Diversity, and Cotton Harvest Mechanization in the Post–World War II United States,” Social Science History, 29 (2005), 269– 297. Grubb, W. Norton., “The Flight to the Suburbs of Population and Employment, 1960–1970,” Journal of Urban Economics, 11 (1982), 348–367. Guterbock, Thomas M., “The Push Hypothesis: Minority Presence, Crime, and Urban Deconcentration,” in The Changing Face of the Suburbs, Barry Schwartz, ed. (Chicago: University of Chicago Press, 1976). Jackson, Kenneth T., Crabgrass Frontier: The Suburbanization of the United States (New York: Oxford University Press, 1985). Kopecky, Karen, and Richard M.H. Suen, “A Quantitative Analysis of Suburbanization and the Diffusion of the Automobile,” International Economic Review, forthcoming. LeRoy, Stephen, and John Sonstelie, “Paradise Lost and Regained: Transportation Innovation, Income and Residential Location,” Journal of Urban Economics, 13 (1983), 67–89. Lewis, Ethan, “Immigration, Skill Mix, and the Choice of Technique,” Federal Reserve Bank of Philadelphia Working Paper No. 05-08, 2005. Margo, Robert A., “Explaining the Postwar Suburbanization of Population in the United States: The Role of Income,” Journal of Urban Economics, 31 (1992), 301–310. Marshall, Harvey, “White Movement to the Suburbs: A Comparison of Explanations,” American Sociological Review, 44 (1979), 975–994.
WAS POSTWAR SUBURBANIZATION “WHITE FLIGHT”?
443
Meyer, Stephen Grant, As Long As They Don’t Move Next Door: Segregation and Racial Conflict in American Neighborhoods (New York: Rowman and Littlefield, 2000). Mieszkowski, Peter, and Edwin S. Mills, “The Causes of Metropolitan Suburbanization,” Journal of Economic Perspectives, 7 (1993), 135–147. Mills, Edwin S., and Richard Price, “Metropolitan Suburbanization and Central City Problems,” Journal of Urban Economics, 15 (1984), 1–17. Munshi, Kaivan, “Networks in the Modern Economy: Mexican Migrants in the U.S. Labor Market,” Quarterly Journal of Economics, 118 (2003), 549–599. National Agricultural Statistical Service, “Historical Data” [website], U.S. Department of Agriculture, http://www.usda.gov/nass/pubs/histdata.htm, 2005. Saiz, Albert, “Immigration and Housing Rents in American Cities,” Journal of Urban Economics, 61 (2007), 345–371. Schelling, Thomas C., “Dynamic Models of Segregation,” Journal of Mathematical Sociology, 1 (1971), 143–186. Steinnes, Donald N., “Causality and Intraurban Location,” Journal of Urban Economics, 4 (1977), 69–79. Sugrue, Thomas J., The Origins of Urban Crisis: Race and Inequality in Postwar Detroit (Princeton, NJ: Princeton University Press, 1996). Thurston, Lawrence, and Anthony M. J. Yezer, “Causality in the Suburbanization of Population and Employment,” Journal of Urban Economics, 35 (1994), 105– 118. U.S. Bureau of the Census, 16 th –19 th Censuses of the United States: 1940–1970, Housing (Washington, DC: Government Printing Office, 1942, 1952, 1962, 1972). ——, 16 th Censuses of the United States: 1940, Internal Migration, 1935–40 (Washington, DC: Government Printing Office, 1943). ——, 18 th and 19th Censuses of the United States: 1960, and 1970, Geographic Mobility for Metropolitan Areas (Washington, DC: Government Printing Office, 1962, 1972). ——, County and City Data Book, Consolidated File: City/County Data, 1947– 1977, ICPSR Study No. 7735–7736, 1977. Whatley, Warren, “Labor for the Picking: The New Deal in the South,” Journal of Economic History, 43 (1983), 905–929. Wright, Gavin, Old South, New South: Revolutions in the Southern Economy since the Civil War (New York: Basic Books, 1986).
PAYING FOR PROGRESS: CONDITIONAL GRANTS AND THE DESEGREGATION OF SOUTHERN SCHOOLS∗ ELIZABETH CASCIO NORA GORDON ETHAN LEWIS SARAH REBER This paper examines how a large conditional grants program influenced school desegregation in the American South. Exploiting newly collected archival data and quasi-experimental variation in potential per-pupil federal grants, we show that school districts with more at risk in 1966 were more likely to desegregate just enough to receive their funds. Although the program did not raise the exposure of blacks to whites like later court orders, districts with larger grants at risk in 1966 were less likely to be under court order through 1970, suggesting that tying federal funds to nondiscrimination reduced the burden of desegregation on federal courts.
I. INTRODUCTION Because the U.S. Constitution reserves powers not explicitly delegated to the federal government to the states, conditional grants are key levers for federal policymakers seeking to affect a broad range of state and local policies. States must implement federally approved speed limits and drinking ages to receive highway funding; universities must provide gender parity in athletic offerings to receive research funding; and states can lose funding if they do not comply with the Clean Air Act. More recently, states and school districts have risked losing federal grants for failure to comply with the accountability requirements of the No Child Left Behind Act. In this paper, we examine whether the threat ∗ For their helpful comments and questions, we are grateful to seminar participants at Duke, Georgetown, Northwestern, Stanford, UBC, UCD, UCI, UCSD, UCR, the All-UC Labor Workshop, the University of Virginia, the NBER Economics of Education and Development of the American Economy program meetings, and the annual meetings of the AEA, SSHA, and SOLE. We would especially like to thank Patty Anderson, Sandra Black, Leah Platt Boustan, Ken Chay, Julie Cullen, Jon Guryan, Larry Katz, Sean Reardon, and Doug Staiger, as well as four anonymous referees. Jeremy Gerst, Maria Kahle, Farah Kaiksow, Allison Kidd, Cyrus Kosar, Eric Larsen, Patricia Tong, and Courtney Wicher provided indispensable research assistance. This research was supported by grants from the National Science Foundation (Award Number 0519126), the Spencer Foundation (Award Number 200600131), and the University of Kentucky Center for Poverty Research Regional Small Grants Program (Award Number 2U01PE000002-04). Cascio gratefully acknowledges support from a Junior Faculty Research Grant from the Institute of Governmental Affairs at UC Davis. Gordon and Reber gratefully acknowledge support from the National Academy of Education/Spencer Foundation postdoctoral fellowship. The data presented, the statements made, and the views expressed are solely the responsibility of the authors. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, February 2010
445
446 1.0
QUARTERLY JOURNAL OF ECONOMICS
0.6
0.8
CRA/ESEA
Share of blacks in desegregated schools
Share of districts with student desegregation
0
0.2
0.4
Under court supervision
1956
1958
1960
1962
1964
1966
1968
1970
1972
1974
1976
FIGURE I District-Level Trends in Desegregation and Court Orders in the Former Confederacy Authors’ calculations based on Southern Education Reporting Service (SERS), Department of Health, Education, and Welfare, and Office of Civil Rights data. Sample includes unbalanced panel of school districts in all states of the former Confederacy, except Texas (Alabama, Arkansas, Florida, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, and Virginia), where between 3% and 97% of student enrollment was black on average between 1961 and 1963. Trends for a balanced panel are broadly similar. A school is considered desegregated if it had any blacks in school with whites; a district is desegregated if it contained any desegregated schools. A district is considered under court supervision if it was on SERS’s list of districts desegregating under court order (1956 to 1964) or if it complied with the Civil Rights Act by submitting a court-ordered plan (1966 to 1976). All school districts are given equal weight. Trend breaks between 1964 and 1966 are less dramatic but still apparent when tabulations are weighted by average black enrollment between 1961 and 1963. See Appendix II for details.
of withdrawal of this same source of federal education funding induced Southern school boards to make an extremely unpopular decision four decades ago—to desegregate public schools. Dismantling the dual system of education in the South was, to say the least, contentious. Particularly salient cases, such as those in Little Rock and New Orleans, highlighted extreme white resistance and the need for court intervention, enforced by police or the National Guard, as a “stick” to implement the mandate of Brown. The literature has established that the courts played a critical role in desegregating Southern schools, especially after 1968 (e.g., Welch and Light [1987]; Reber [2005]), and the dotted line in Figure I shows that about half of Southern districts were
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
447
ultimately under court order to desegregate. There was, however, a historic shift away from segregation in the mid-1960s, when the extent of court supervision was far more limited. Through the mid-1960s, the likelihood that the average Southern school district was desegregated—had one or more black students in any school with any white students (solid line in Figure I)—outpaced the likelihood that it was under court supervision. There was a particularly noticeable burst of “voluntary” desegregation—that is, desegregation not mandated by a court—between 1964 and 1966, along with a significant uptick in the share of black students attending desegregated schools (dashed line). We explore whether the “carrot” of federal funding contributed to voluntary desegregation in the mid-1960s and ultimately reduced the burden that desegregation placed on courts. To receive federal funds, Southern school districts had to comply with the nondiscrimination provisions of the Civil Rights Act of 1964 (CRA) by desegregating their schools. Title I of the Elementary and Secondary Education Act of 1965 (ESEA) created large grants for schools, generating significant costs of noncompliance. Researchers have speculated that these policies caused the abrupt rise in desegregation witnessed in aggregate data for the mid-1960s.1 In previous work, we have shown that high-poverty districts—which stood to gain the most from Title I—were particularly likely to desegregate around this time (Cascio et al. 2008). However, past work has not been able to separate the effect of conditional grants from the effects of concurrent policy changes, such as the Voting Rights Act of 1965 and the heightened threat of litigation resulting from other provisions of the CRA. To address this identification problem, we exploit idiosyncratic variation across school districts in the amount of federal funding at risk from noncompliance with the CRA. The amount of Title I funding a compliant district would have received was based on district-level child poverty and state-level spending; the gap in expected Title I receipts between poor and rich districts was larger in states with higher per-pupil spending prior to the ESEA. We examine whether the relatively large difference in funding was matched by a relatively large difference in the likelihood of 1. See, for example, Rosenberg (1991), Boozer, Krueger, and Wolkon (1992), Clotfelter (2004), and Ashenfelter, Collins, and Yoon (2006). In a different context, Almond, Chay, and Greenstone (2006) argue that the fund-withholding provisions of the Civil Rights Act, combined with the introduction of Medicare, reduced the black–white gap in infant mortality by desegregating Southern hospitals.
448
QUARTERLY JOURNAL OF ECONOMICS
a district exerting the minimum desegregation effort required to collect federal funds—an intuitive prediction of the simple theoretical framework presented below. The credibility of our inferences is supported by the fact that differences between poor and rich districts in other factors influencing desegregation and in preprogram desegregation outcomes did not vary systematically with a state’s prior spending. We investigate the effects of conditional funding for 1966, the first year of the policy for which appropriate data exist. Districts with larger grants were more likely to desegregate on the margins required for compliance with the CRA. The probability of having only token desegregation (which we define as less than 2% of blacks in desegregated schools) fell by over eight percentage points for each additional hundred dollars in potential per-pupil Title I funding (in constant 2007 dollars), with districts moving to slightly higher levels of desegregation (2%–6% of blacks in desegregated schools). Our preferred estimates imply that on average a district would have needed to be paid $1,200 per pupil—72% of average per-pupil spending in the South in the early 1960s—to move beyond token desegregation. We find suggestive evidence of similar willingness to pay for teacher segregation. Districts with larger potential grants in 1966 also were less likely to have been under court order both then and through 1970. Our findings thus suggest that conditional federal grants helped prompt a shift away from the minimal desegregation characteristic of the mid-1960s, thereby reducing the burden placed on federal courts in the years that followed. On the other hand, in 1966, districts were not required to desegregate on margins that would have produced substantial increases in exposure of blacks to whites—particularly in comparison with what court orders required after 1968—and we do not find effects of conditional grants on such margins. But our estimates capture the marginal effects of conditional grants in 1966 only, leaving out any aggregate effects of the program’s existence or contemporaneous effects in other years, when more intensive desegregation would have been required to receive funding.2 Further, the historical record suggests that establishing consistent guidelines for desegregation plans—a critical result of ESEA and CRA implementation during the Johnson administration—promoted a strong judicial role in 2. We cannot examine the contemporaneous effects of conditional federal funding in later years, when more was required for CRA compliance, due to changes in program rules.
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
449
desegregation in the years that followed. Overall, our analysis shows that districts responded to financial pressure to desegregate in a historically meaningful way and—together with the existing empirical and historical literature—supports the view that dismantling desegregation in Southern schools was facilitated by all three branches of the federal government. II. CONDITIONAL FUNDING AND DESEGREGATION IN THEORY We begin by presenting a framework for understanding the effects of conditional federal funding on school desegregation using a modified version of the model Margo (1990) used to understand black–white school spending gaps prior to Brown. Because spending on black and white students had greatly converged before the period covered by our study (Margo 1990; Card and Krueger 1992; Donohue, Heckman, and Todd 2002), we depart from Margo and assume that expenditure per pupil did not vary by race within districts.3 We then assume that districts faced a trade-off between expenditure per pupil, e, and student segregation, s, measured as the fraction of black students attending all-black schools. We further assume that decision-making rested in the hands of Southern whites, as few Southern school boards had any black members at this time (U.S. Commission on Civil Rights 1968). White school boards chose e and s to maximize their utility, U = U (e, s) , where the marginal utilities of both spending and segregation are assumed to be positive and diminishing (∂U /∂e, ∂U /∂s > 0, and ∂ 2 U /∂e2 , ∂ 2 U /∂s2 < 0).4 Maximization was subject to the constraint that per-pupil expenditure not exceed net per-pupil revenue, e ≤ l + m + f − τ (s), where l, m, and f , respectively, represent revenue per pupil from local, state, and federal sources. For simplicity, we assume that local and state revenue were fixed, though the substantive 3. Differences in spending per pupil in black and white schools did persist through the mid-1960s in Louisiana (Reber 2007). However, allowing for differential spending by race would not change the model’s implications. 4. To the extent that school boards care about exposure of whites to blacks, ∂U/∂s will be larger in districts with a higher black share in enrollment. For simplicity, we do not incorporate racial composition into the model, but we control for a district’s initial racial composition in the empirical analysis.
450
QUARTERLY JOURNAL OF ECONOMICS
FIGURE II Theoretical Predictions e represents expenditure per pupil; l, m, and f represent local, state, and federal revenue, respectively; s˜ represents the threshold level of student segregation at or below which federal funds are received; and λ represents the cost to the district per unit of segregation s, excluding the loss of federal funds.
implications of the model are unchanged if we relax this assumption by introducing local control over taxation. τ (s) is the per-pupil expense to the district of segregationist policy s. τ (s) has three components. First, segregation may have entailed foregone economies of scale and additional transportation costs. Second, maintaining higher levels of segregation entailed costs associated with deterring and fighting litigation. Finally, requirements for compliance with the CRA made receipt of federal funds conditional on reaching some threshold level of student segregation. Federal funds per pupil received by a district can therefore be characterized as f (s) = f if s ≤ s˜ and f (s) = 0 if s > s˜ , where s˜ represents this threshold. Let the first two categories of costs be denoted by λ. The cost of segregation was thus τ (s) = sλ if the threshold was reached and τ (s) = sλ + f if not, generating a discontinuity in the budget constraint at s˜ : l + m + f − sλ if s ≤ s˜ . e≤ l + m − sλ if s > s˜ The value of s˜ changed over time and was district-specific, as discussed below. The model suggests a simple test of whether the conditional nature of federal funding influenced districts’ segregation policy choices. Figure II provides the intuition, plotting the budget constraints of two hypothetical school districts. Both districts have the same preferences for segregation and spending, represented by their identical indifference curves, but the district
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
451
represented in Panel A has a smaller potential grant and therefore a smaller increase in funding at s˜ . The graphs show that the district facing a sufficiently large federal grant (Panel B) would have desegregated—just to the point required by CRA, s = s˜ — whereas the district facing the smaller grant would have remained fully segregated (s = 1). All else being equal, districts with larger grants would have been more likely to cross the threshold to receive their federal funds. Our empirical models are thus designed to test for an effect of conditional federal funding around s˜ . In practice, however, not all districts will be observed at exactly s = s˜ or s = 1. Although districts could target a level of s with a particular desegregation policy, they could not completely control its realized value. Further, a district with sufficiently weak tastes for segregation and/or sufficiently high costs of segregation might have chosen s < s˜ .5 More generally, variation in s arose from heterogeneity across districts in preferences, costs of segregation, and in the available budget; we describe the data we have collected on these district characteristics below. This observation points to the importance of using variation in federal funding that is not correlated with these other key determinants of segregation to identify the effect of conditional funding. The Title I formula generated such exogenous variation.6 III. EMPIRICAL STRATEGY In the earliest years of the program, the Title I allocation for district d in state j was equal to its count of poor children in the 1960 Census (poord1960 )7 multiplied by one-half of average perpupil expenditure in its state two years prior (stategrant jt ).8 The program was thus compensatory: Within a state, districts with more poor children were due more Title I funding. However, two 5. It is also possible that conditional federal funding had a perverse effect on segregation for districts that would have otherwise been segregated less than s˜ . If segregation is a normal good, such a district would have consumed more segregation due to the income effect of receiving the grant. 6. The federal government intended Title I funding to be used for compensatory programs only, but in practice it was used to finance all types of current education spending (Washington Research Project 1969), suggesting it was as fungible as f in our model. 7. Specifically, poord1960 is the count of five- to seventeen-year-olds living in families with incomes less than $2,000 in the 1960 Census. There were Title I eligibles in other categories, but these categories were relatively small in the South. See Appendix II for more detail. 8. Appendix I.A gives the values of stategrant jt by state for 1966–1967, the year used in our analysis. Because the Title I program was not fully funded in 1966–1967, the figures reported reflect ratable reductions by state-specific multiplicative constants.
452
QUARTERLY JOURNAL OF ECONOMICS
districts with the same poverty count in different states would have had different amounts of Title I funding at risk. The funding formula motivates a difference-in-differences (DD) estimation strategy, where we compare outcomes for higherand lower-poverty districts in higher- and lower-spending states. Following the logic of a DD framework, we include functions of the district- and state-specific components of per-pupil Title I funding as controls in our baseline model, as both are strongly correlated with funding, and either may be independently related to segregation outcomes. Our analysis focuses on potential federal funding and school desegregation during the 1966–1967 academic year. The model of interest is (1)
ydj = α + θ pptidj1966 + g(poord1960 /enrd1966 ) + h stategrant j1966 + εdj .
The outcome, ydj , is an indicator set to one if district d in state j met a particular desegregation target, and pptidj1966 represents potential Title I funding per pupil in the district in 1966: (2)
pptidj1966 ≡
stategrant j1966 poord1960 enrd1966
,
where enrd1966 is the district’s fall 1966 enrollment.9 g(·) and h(·) are functions of the district’s child poverty rate, poord1960 /enrd1966 , and the 1966–1967 state factor in Title I funding, stategrant j 1966 , respectively, and εdj captures unobserved determinants of the segregation decision. If the requirement that districts meet desegregation targets to receive federal funding affected segregation decisions, the parameter of interest, θ , should be positive. Although g(·) and h(·) account for many potential confounding factors, OLS estimates of θ may be biased. In particular, although the total Title I grant was determined based only on preprogram 9. All federal funding was on the line, but we only use Title I funding in the analysis due to data constraints. The parameter θ is thus appropriately interpreted as the reduced-form effect of the conditionality of the Title I program on desegregation. As long as the other categories of federal funding were uncorrelated with the identifying variation in Title I funding, our empirical strategy will produce unbiased estimates of the effect of an additional dollar of federal funding overall. We cannot test this assumption but believe it is likely to hold. ESEA funding was the largest category of aid to elementary and secondary education administered by the Office of Education and was about three times as large as each of the two next-largest categories—Aid to Federally Impacted Areas and the National Defense Education Act programs. Neither of these programs distributed funds based on the interaction of poverty with average state-level spending.
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
453
district characteristics, the per-pupil amount depended on 1966 enrollment, which may have been directly affected by desegregation policy. For example, districts with an unobserved taste for segregation might have both desegregated less and experienced more “white flight.” This would generate a negative correlation between pptidj1966 and the error term in equation (1), biasing OLS estimates of θ downward. Conversely, holding preferences constant, school desegregation may have increased white flight, biasing OLS estimates of θ upward. Even if neither of these conditions holds, OLS estimates of θ will be attenuated if current enrollment is measured with error. We therefore instrument for the actual per-pupil Title I grant with the district’s “simulated” per-pupil Title I grant, which holds enrollment constant at pre-Title I levels: (3)
pptiSIM dj1966 ≡
stategrant j1966 poord1960 enrd,pre
.
enrd,pre represents average enrollment in district d in years prior to Title I’s introduction (specifically, between 1961 and 1963). The simulated grant is thus based entirely on preprogram district characteristics and is itself another noisy measure of Title I funding per pupil, allowing us to address biases from both the endogeneity of enrollment and measurement error. Because the current-year value of the child poverty rate in equation (1) is also potentially endogenous, we use the preprogram child poverty rate, poord1960 /enrd,pre , as a control in our primary estimating equation, (1 )
ydj = α + θ pptidj1966 + g(poord1960 /enrd,pre ) + h stategrant j1966 + εdj .10
We also use the least restrictive functions for g(·) and h(·) that our data can accommodate maintaining reasonable precision: dummies for quantiles of the preprogram child poverty rate and state fixed effects. The former fixed effects account for segregation shocks shared by similar-poverty districts in different states, whereas the latter account for common state-level determinants segregation, such as state policies that affected all districts equally. 10. . We estimate a linear probability model for ease of implementation and interpretation. The reduced-form marginal effects of the simulated per-pupil grant on our dichotomous outcomes are similar when estimated using probit.
454
QUARTERLY JOURNAL OF ECONOMICS
Two-stage least squares (TSLS) estimates of θ in equation (1 ) will be consistent if the instrument, pptiSIM dj1966 —the interaction between preprogram state average per-pupil expenditure and the preprogram district child poverty rate—is uncorrelated with εdj . Put differently, it must be the case that unobserved differences in segregation between rich and poor districts do not vary systematically with average state spending on education. This assumption would be violated if state policies in lower-spending states such as Mississippi or South Carolina affected the gap in segregation outcomes between high- and low-poverty districts differently than those of governments in higher-spending and more “progressive” states, such as Florida (see Appendix I.A).11 Although we cannot entirely rule out this source of bias, historical accounts suggest that the importance of state policies to discourage desegregation had diminished by 1966 and, where present, such policies were not applied differentially to richer and poorer districts.12 Empirical evidence also supports the identifying assumption. For example, we show below that our instrument, pptiSIM dj1966 , is uncorrelated with several observed proxies for segregationist preferences and the threat of litigation in our chosen specification, and that the TSLS estimates of θ are not sensitive to the addition of these observables to (1 ). We also find no significant “effect” of the Title I funding on desegregation before the program existed, suggesting that pptiSIM dj1966 is not correlated with unobserved propensities to desegregate. IV. DATA We have compiled comprehensive school-district-level data for this analysis from a variety of sources. This section provides a brief overview of these data; see Appendix II for more detail. 11. For example, the correlation between stategrant j 1966 and the share of a state’s electorate that voted for Strom Thurmond in the 1948 presidential election—one measure of segregationist preferences—is negative and statistically significant, suggesting that voters in higher spending states were more progressive on race relations. 12. One exception is Alabama, where Governor George Wallace pressured districts to flout the CRA, providing special assistance funds (to offset Title I losses) to districts that did not comply. When we drop Alabama school districts from our sample, we arrive at very similar estimates (available on request). Outside of Alabama, the most active area of state-level policy toward desegregation was legislation aimed at facilitating the development of all-white private schools. We have not found evidence that these policies were differentially applied to higherand lower-poverty districts. We have also investigated the empirical relevance of this competing hypothesis by estimating (1 ) separately for districts in states with high and low Thurmond vote shares. We fail to reject the hypothesis that estimates across subsamples are identical, but the estimates are quite imprecise.
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
455
TABLE I DESCRIPTIVE STATISTICS: POTENTIAL TITLE I FUNDING IN 1966 AND OTHER DISTRICT AND COUNTY CHARACTERISTICS
A. Potential Title I funding, 1966 Title I per pupil (1966 enrollment, $2007) Simulated Title I per pupil (early 1960s enrollment, $2007)
Mean
Std. dev.
277 274
149 137
B. Preexisting district and county characteristics Early 1960s child poverty % 33.8 17.5 Early 1960s black enrollment % 37.3 20.2 1948 Thurmond vote % 35.6 28.8 Early 1960s enrollment 6,121 10,977 Early 1960s expenditure per pupil ($2007) 1,671 427 1960 county characteristics: % with high school degree 26.36 7.35 % employed in agriculture 37.32 24.03 Median family income ($2007) 23,308 6,877 = 1 if urban 0.20 0.40 Number of districts 916 Notes. Table gives descriptive statistics on key explanatory variables for the full estimation sample in 1966. The unit of observation is the school district. The sample includes school districts in ten states of the former Confederacy (Alabama, Arkansas, Florida, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, and Virginia) and is restricted to districts in these states that had black enrollment shares between 0.03 and 0.97 on average between 1961 and 1963, were not under court order to desegregate in 1964, have complete data on the characteristics listed, and have the percent of blacks in desegregated schools observed in 1966. For more information, see text and Appendix II. “Early 1960s” corresponds to an average taken over 1961 to 1963.
IV.A. Title I and Other Explanatory Variables Table I shows summary statistics for the explanatory variables used in our analysis. The key variable of interest is Title I funding per pupil in 1966, the numerator of which was collected from Congressional reports. In 1966, this figure was $277 for the average district, about 17% of the average per-pupil current expenditure of $1,671 in the early 1960s (both in 2007 dollars). Recall that the simulated Title I grant per pupil is the product of the grant per eligible child, which varied across states (see Appendix I.A), and the district’s preprogram child poverty rate, which was on average 33.8%. It is this poverty rate that enters directly and flexibly into equation (1 ). Our model also incorporates controls for district and county characteristics that may have been related to segregation. Annual state administrative reports from 1961 to 1963 provide districtlevel data on average preprogram expenditure per pupil and black
456
QUARTERLY JOURNAL OF ECONOMICS
share in enrollment.13 Preprogram expenditure both proxies for a district’s potential budget and reflects preferences for spending and segregation. Black share in enrollment would have affected white children’s exposure to blacks for any given share of blacks in desegregated schools, thereby affecting white preferences for segregation. County voting records provide data on the share of votes cast for Strom Thurmond in the 1948 Presidential election, another proxy for segregationist preferences. Because larger districts were significantly more likely to have been litigated before 1964 (Cascio et al. 2008), we use average district enrollment between 1961 and 1963, from the state reports noted above, as a measure of the threat of litigation. Several characteristics of the county population in 1960, taken from the City and County Data Book—the percentage of the population with a high school degree, the share of employment in agriculture, median family income, and an urban indicator (equal to one if more than half the county’s population was urban)—round out our list of controls. Table I shows that, in the early 1960s, the average district in our sample enrolled just over 6,100 students and was 37.3% black. It was in a county with a predominately rural and poorly educated population, where 37.3% of workers were agricultural and 35.6% of votes were cast for Thurmond in 1948. Recall the identifying assumption in our model: in a specification with sufficient controls for its state- and district-level components, the simulated Title I grant per pupil should not be correlated with unobserved determinants of segregation policy. Table II shows that, with the exception of preprogram expenditure per pupil and the county urban indicator, the observed district characteristics described above are not significantly related to the instrument in the two specifications employed in our analysis. To mitigate any potential remaining biases and to reduce residual variation, we control flexibly for all district characteristics in the specifications estimated below. IV.B. Outcomes The main prediction of the model presented in Section II is that school districts with larger potential federal grants would have been more likely to choose levels of student desegregation 13. In some cases, we do not have data on enrollment by race for these years, so we use data from later in the 1960s; see Appendix II.
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
457
TABLE II POTENTIAL TITLE I FUNDING IN 1966 AND OTHER DETERMINANTS OF SEGREGATION POLICY Two-stage least squares coefficient (standard error) on Title I funding per pupil (in hundreds of $2007) Dependent variable
(1)
(2)
A. Proxies for preferences −1.573 (1.741) Early 1960s black enrollment % −1.443 (1.636)
−0.610 (1.494) 0.329 (1.452)
B. Proxy for litigation threat Ln early 1960s enrollment −0.153∗ (0.079)
−0.117 (0.078)
1948 Thurmond %
C. Potential school budget Early 1960s expenditure per pupil 0.805∗∗∗ (hundreds of $2007) (0.230) D. County characteristics 1960 % with high school degree −0.129 (0.958) 1960 % employed in agriculture −2.880 (3.280) 1960 median family income ($2007) 515.2 (787.2) 1960 urban indicator 0.123∗∗ (0.054) Controls: State fixed effects X Early 1960s child poverty %: Dummies for 20 quantiles X Restricted quantile effectsa
0.485∗∗ (0.243) −0.741 (0.932) −0.692 (3.150) −359.6 (823.3) 0.086∗ (0.051) X
X
Notes. Each entry gives the TSLS coefficient (standard error) on Title I funding per pupil (hundreds of $2007) in a model predicting the district or county characteristic listed. The instrument for Title I funding per pupil is simulated Title I funding per pupil (also in hundreds of $2007); see text. All regressions contain 916 district-level observations and also include as an explanatory variable whether the district had any student desegregation in 1964. Standard errors are clustered on county. a Dummies for the bottom nine deciles and the top two of the twenty quantiles in the first specification. ∗∗∗ p < .01. ∗∗ p < .05. ∗ p < .1.
at or above the threshold for receiving federal funds. To identify where this threshold was—and to develop outcome variables accordingly—it is critical to understand the specific requirements of the law.
458
QUARTERLY JOURNAL OF ECONOMICS
Districts with court-supervised desegregation plans were automatically in compliance with the CRA. Other districts were required to submit so-called “voluntary” desegregation plans satisfying policy guidelines set out by the Department of Health, Education, and Welfare (DHEW). Most desegregation before 1967 involved transferring black students to formerly all-white schools, and the guidelines were specified in terms of the share of blacks that had to be transferred. The first guidelines, for 1965, were vague and ultimately required the transfer of a handful of black students. Many districts did desegregate on the extensive margin—moving at least one black student districtwide into a school with any white students—for the first time in 1965, giving up the principle of separate schools. In 1966, the year of our main analysis, the guidelines were more stringent and more specific, requiring higher growth in black “transfer rates” for districts that had transferred fewer blacks the prior year.14 In theory, we could identify the threshold level of desegregation for each district in 1966 based on its transfer rate in the prior year and use an indicator for exceeding that threshold as our dependent variable. However, we do not have the data on 1965 transfer rates needed to calculate the 1966 thresholds for districts in most states. Even if we had these data, construction of the above variable would be impossible because the guidelines did not specify clear targets for all districts. Moreover, the DHEW lacked the black enrollment data—the denominator of the transfer rate—to enforce its own guidelines literally. Despite these challenges, the district-level data we do have suggest that enforcement was generally accurate, but not all noncompliant districts were pursued.15 14. Districts with transfer rates (in practice, shares of blacks in desegregated schools) of 8% to 9% in 1965 were expected to double their transfer rates; taking the guidelines literally, in 1966 these districts would have been required to have 16% to 18% of blacks in desegregated schools. A tripling of transfer rates was expected from districts that had transferred 4% to 5% of blacks in 1965, a “proportionally larger” change for districts that had transferred less than 4% of blacks in 1965, and a “substantial start” from districts with no transfers in 1965 (U.S. Department of Health, Education, and Welfare March 1966, p. 8). 15. We have data on both 1965 and 1966 transfer rates for South Carolina. Using these data, we find that all districts in South Carolina where changes in transfer rates from 1965 to 1966 met the criteria outlined in the guidelines were deemed compliant and received their federal funds on time. About 28% of South Carolina districts that appeared noncompliant had their funds deferred. Orfield (1969) explains why not all cases in violation of the guidelines would be pursued: the DHEW had to submit its enforcement actions to the Justice Department’s Civil Rights Division, whose strategy was to enforce only those cases in which the guidelines were violated and the district’s free-choice plan was either flawed by administrative design or rendered irrelevant due to local intimidation.
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
459
Thus, school districts probably had a general idea of their thresholds in 1966 but faced some uncertainty about exactly what was required to receive their funding. Our analysis of student desegregation therefore focuses on a series of dependent variables that are ultimately ad hoc, but arguably capture the relevant margin for the average school district in our sample. We expect that most districts in our sample would have needed more than “token” levels of desegregation, but less than about 10% of blacks in desegregated schools, to meet the targets set out in the guidelines.16 Our key dependent variables are therefore indicators for whether a district fell into each of the following categories: less than 2% of blacks in desegregated schools (our measure of token desegregation), 2% to 6%, 6% to 10%, 10% to 20%, 20% to 30%, 30% to 50%, and 50% to 100%.17 We expect to see most of the response to conditional funding in the lower tail of the distribution, from token desegregation to slightly higher levels. Consistent with this idea, Table III shows that 64% of districts had less than 10% of blacks in desegregated schools in 1966, and over half had less than 6%. We calculate the fraction of blacks in desegregated schools using data from two sources. The number of black students in desegregated schools—any school enrolling at least one student of each race—was published by the Southern Education Reporting Service (SERS), an organization of Southern newspaper editors funded by the Ford Foundation. We estimate the total number of blacks in the district using current-year fall enrollment and percent black in enrollment in the early 1960s, both from the state administrative reports referenced above. Table III shows that, in the average district in our sample, roughly 18% of blacks were in desegregated schools in 1966, compared to less than 1% in 1964.18 SERS also recorded data on teacher desegregation, which 16. According to the 1966 guidelines, districts with transfer rates greater than zero but less than 4% needed to more than triple their level of desegregation activity. Ninety percent of districts in South Carolina—the one state with data available for 1965 that appears representative of the region—had transfer rates above zero but less than 3% in 1965, implying they would have had to transfer less than 10% of black students in 1966 to meet their targets. Half of South Carolina districts had 1965 transfer rates less than 1%, implying 1966 target transfer rates of around 3% would have been sufficient. 17. Although we modeled school board preferences as a positive function of segregation (Section II), we specify our dependent variables in terms of desegregation so that our empirical work matches the policy guidance. 18. Theoretically, a district could have a large share of black students in desegregated schools by moving only one white to a black school, but the guidelines did not contemplate such behavior, and we do not see it in the data. For example,
460
QUARTERLY JOURNAL OF ECONOMICS TABLE III DESCRIPTIVE STATISTICS: MEASURES OF SEGREGATION 1964
A. Segregation outcomes Percent of blacks in desegregated schools 0.8 (std. dev.) (5.4) = 1 if % black students in desegregated schools is: Less than 2% 0.947 At least 2% but less than 6% 0.021 At least 6% but less than 10% 0.012 At least 10% but less than 20% 0.011 At least 20% but less than 30% 0.003 At least 30% but less than 50% 0.002 50% or more 0.003 Number of districts 905 = 1 if any black teachers work with white teachers 0.000 Number of districts 916 B. CRA compliance = 1 if funds deferred or terminated Not applicable = 1 if under court order Not applicable Number of districts Not applicable
1966 18.1 (28.6) 0.305 0.247 0.087 0.122 0.052 0.060 0.127 916 0.725 881 0.204 0.096 916
Notes. See notes to Table I for description of 1966 sample; 1964 sample is limited to districts in the 1966 sample. See Appendix II for a description of how the variables are constructed.
was required on the extensive margin by the guidelines starting in 1966. By this point, nearly three-quarters of districts had at least one black teacher in the same school with a white teacher, compared to none two years prior. By 1966, DHEW reported that over 20% of districts had had their funds deferred or terminated, confirming that the law was not an empty threat. Conditional grants may have also reduced the average district’s likelihood of resisting desegregation, and therefore the chances that it would be sued and ultimately end up under court supervision. To investigate this, we gathered information on the type of plan submitted (court-ordered or voluntary) to comply with the CRA from a 1966 DHEW report. As shown in Table III, only 9.6% of Southern districts complied with the CRA via court order by the fall of 1966. This share rose over the years that followed (Figure I), and below, we investigate whether having more funding in 1967, in the average district, less than 0.04% of white students attended schools that were more than 90% black, and the maximum share of whites in schools that were more than 98% black was less than 1%. (Data from 1966 do not allow this calculation.)
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
461
on the line early (in 1966) slowed this trend. The data for this analysis come from comparable DHEW surveys of CRA compliance in later years. IV.C. Sample Our sample of school districts is drawn from the states of the former Confederacy, except Texas (which we had to exclude due to incomplete data): Alabama, Arkansas, Florida, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, and Virginia. Examining the effects of financial incentives in border states, which also enforced a dual system before Brown, would be interesting; unfortunately, the data are less complete for these states. Because school districts both consolidate and split apart during our sample period, we use the state records referenced above to establish a history of reorganizations and aggregate the raw data to the largest unit to which a district was ever a party (see Online Appendix Section I for details). Of these “aggregated” districts, we exclude those for which desegregation was not relevant because they were one-race or nearly one-race.19 We also exclude districts that were automatically in compliance with the CRA in 1966 because they were supervised by a court in 1964 or were missing data. Our main estimation sample includes 916 districts comprising 84% of districts that were not one-race or nearly one-race (see Online Appendix Table I).20 V. THE EFFECTS OF CONDITIONAL FEDERAL FUNDING IN 1966 V.A. Student and Teacher Desegregation Table IV presents estimates of the effect of potential Title I funding per pupil on two measures of student segregation in 1966: the indicator for a “token” level of desegregation (less than 2% of black students attending desegregated schools) and the indicator for having moved just beyond that token level (2% to 6% of blacks in desegregated schools). As described above, for the average 19. In particular, we drop districts that were less than 3% or more than 97% black in the early 1960s. The cutoffs are arbitrary; results are not sensitive to using alternative cutoffs. 20. Using aggregated district as our unit of analysis, we assume that two districts that may have already split or not yet consolidated in any given year are behaving as one jurisdiction, potentially biasing our estimates downward. Consistent with this, our point estimates tend to be larger and no less precise when we restrict attention to districts that did not split or consolidate over the sample period (results available on request).
916
X
X
916
X
X
916
X X X X X
X
X X X X X X 916
X
−0.0577 (0.0354) 0.365 .402
B. Ordinary least squares −0.0660∗ −0.0541 −0.0606∗ (0.0384) (0.0365) (0.0352) 0.370 0.370 0.364 .375 .370 .401
(4)
−0.0837∗ (0.0449) 400.1 0.365
(3)
0.305 A. Two-stage least squares −0.101∗∗ −0.0747∗ −0.0866∗ (0.0483) (0.0442) (0.0449) 306.2 408.5 433.0 0.370 0.370 0.364
(2)
916
X
X
0.0256 (0.0345) 0.417 .094
0.134∗∗∗ (0.0471) 306.2 0.419
(5) 0.247
916
X
X
0.0416 (0.0326) 0.418 .083
0.151∗∗∗ (0.0432) 408.5 0.420
(6)
916
X X X X X
X
0.0531 (0.0328) 0.402 .168
0.140∗∗∗ (0.0438) 433.0 0.403
(7)
X X X X X X 916
X
0.0456 (0.0321) 0.401 .176
0.129∗∗∗ (0.0434) 400.1 0.402
(8)
At least 2% but less than 6%, 1966
Notes. Each column in each panel gives results from a different regression. In Panel A, the instrument for Title I funding per pupil is simulated Title I funding per pupil (in hundreds of $2007); see text. The unit of observation is a school district; see text and Appendix II for descriptions of the sample. In addition to the controls listed, all models include as an explanatory variable an indicator for whether the district had any student desegregation in 1964. Standard errors, in parentheses, are clustered on county. a Dummies for the bottom nine deciles and the top two of the twenty quantiles in the first specification. b % with high school degree, % employed in agriculture, median family income ($2007), indicator for urban. ∗∗∗ p < .01. ∗∗ p < .05. ∗ p < .1.
RMSE R2 Controls: State fixed effects Early 1960s child poverty %: Dummies for 20 quantiles Restricted quantile effectsa Early 1960s black enrollment % (decile dummies) 1948 Thurmond vote % (quintile dummies) Ln early 1960s enrollment Early 1960s exp. per pupil (quintile dummies) 1960 county characteristicsb Number of districts
Title I per pupil, 1966 (in hundreds of $2007)
First stage partial F-stat for excluded instrument RMSE
Title I per pupil, 1966 (in hundreds of $2007)
Mean of dependent variable
(1)
Less than 2%, 1966
= 1 if % black students in desegregated schools is:
TABLE IV THE EFFECT OF POTENTIAL TITLE I FUNDING ON STUDENT DESEGREGATION, 1966
462 QUARTERLY JOURNAL OF ECONOMICS
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
463
Southern district, conditional federal funding was likely to have mattered most for student desegregation in this part of the distribution. The four specifications presented for each outcome differ in the choice of function to control for child poverty and the inclusion of additional preexisting district and county characteristics.21 All specifications include state fixed effects. Note that the first stage relationship between the actual and simulated per-pupil Title I grants is strong, with a partial F-statistic on the excluded instrument of over 300 across specifications (see Appendix I.B). Because the same specifications are shown below for other outcomes, we discuss them here in some detail. We begin by estimating θ in equation (1 ), controlling flexibly for the stateand district-specific components of the simulated per-pupil Title I grant but omitting other preexisting district characteristics. The first specification, shown in columns (1) and (5), includes state fixed effects and dummies for twenty quantiles of the district preprogram child poverty rate. To improve the precision of our estimates moving forward, all subsequent models include a more parsimonious set of quantile dummies for the child poverty rate (“restricted” quantiles).22 We first show the more parsimonious model without additional controls (columns (2) and (6)). We then add controls that capture preferences and components of the budget constraint (percentage of votes cast for Thurmond in 1948 and early 1960s expenditure per pupil, both using quintile indicators, and early 1960s percent black in enrollment, using decile indicators) and the litigation threat (log of early 1960s enrollment) (columns (3) and (7)). The final specification adds the other socioeconomic indicators available at the county level in the 1960 Census (columns (4) and (8)). The TSLS estimates, shown in Panel A, suggest that the requirement that districts meet desegregation targets to receive 21. Here and in all tables below, standard errors are clustered on counties because some of our control variables vary at the county level. Note that our data are a cross section, and the state fixed effects will account for any unobservable state-specific component of the error term. Of course, the error terms may still be correlated across districts within a state. When we cluster standard errors on state and use the critical values from the t-distribution with eight degrees of freedom to establish statistical significance (following Monte Carlo simulations done in Cameron, Gelbach, and Miller [2007]), the statistical significance of our key results is largely unchanged. All estimates give each district equal weight; weighting by early 1960s black enrollment yields similar results (available from the authors upon request). 22. In the “restricted quantile model,” we retain dummies for the top two (of twenty) quantiles from the first specification, but replace the rest of the quantile indicators with decile indicators. Estimates with the full set of poverty dummies from the first specification tend to be similar in magnitude, but less precise.
464
QUARTERLY JOURNAL OF ECONOMICS
their federal funds did affect behavior, shifting districts from tokenism to somewhat more meaningful desegregation. In the specification with the full set of controls, the TSLS estimates imply that a hundred-dollar increase in Title I funding per pupil was associated with a 12.9 percentage-point increase in the likelihood of having 2% to 6% of blacks in desegregated schools (column (8)) and an 8.4 percentage-point decline in the likelihood of having less than 2% of blacks in desegregated schools (column (4)). We cannot reject the hypothesis that these coefficients are equal in magnitude but opposite in sign. By comparison, the OLS estimates, shown in Panel B, are the same sign, but smaller in magnitude and mostly not statistically significant. The TSLS estimates may be larger than the OLS estimates because districts that desegregated less experienced larger enrollment declines, or because the denominator of pptidj1966 is measured with error. For all outcomes discussed below, this general pattern of differences between the OLS and TSLS estimates persists. If our instrumental variables approach is valid, the inclusion of district and county characteristics should not substantively change our point estimates. Comparison of the TSLS estimates across specifications suggests that this is the case. The coefficients on the controls (not shown) are generally in line with our expectations.23 Notably, however, the controls for per-pupil spending in the early 1960s—the one district characteristic strongly correlated with the instrument (Table II)—do not significantly improve the fit of the model. Furthermore, across specifications, the coefficients on the poverty dummies indicate that higher poverty districts desegregated less, all else equal. The direction of this correlation works against finding any effect of financial incentives on desegregation. Our empirical approach thus tests whether the relationship between poverty and desegregation was less negative in higher spending states, where districts had larger grants holding poverty constant. Panel A of Table V presents TSLS estimates of the effect of potential Title I funding per pupil on the full distribution of student segregation in 1966 based on the specification with the most complete set of controls. Columns (1) and (2) repeat columns (4) and (8) of Table IV. The estimated effects of Title I funding 23. For example, consistent with the findings of Cascio et al. (2008), districts with higher early 1960s black enrollment were significantly more likely to have engaged in only token desegregation in 1966.
0.947 0.00244 (0.0140) 0.197 905
(3)
(4)
0.247 0.129∗∗∗ (0.0434) 0.402 916
0.0873 0.00154 (0.0227) 0.278 916
A. 1966 0.122 −0.00139 (0.0338) 0.317 916 B. 1964 0.0210 0.0122 0.0110 0.00523 −0.00600 −0.00434 (0.00958) (0.00653) (0.00777) 0.138 0.106 0.103 905 905 905
(2)
0.00331 0.00539 (0.00364) 0.0576 905
0.0524 −0.0210 (0.0203) 0.218 916
(5) 0.127 −0.0168 (0.0201) 0.244 916
(7)
50%+
0.00221 0.00331 0.00192 −0.00464 (0.00193) (0.00414) 0.0469 0.0569 905 905
0.0600 −0.00794 (0.0311) 0.231 916
(6)
= 1 if % black students in desegregated schools is: 2 to <6% 6 to <10% 10 to <20% 20 to <30% 30 to <50%
0.8 0.247 (0.300) 0.0509 905
18.1 0.951 (1.480) 0.178 916
(8)
% of blacks in desegregated schools
Notes. Each column in each panel gives results from a different TSLS regression. The instrument for Title I funding per pupil is simulated Title I funding per pupil (in hundreds of $2007); see text. The unit of observation is a school district; see text and Appendix II for descriptions of the sample. The specification is the same as that shown in columns (4) and (8) of Table IV. The partial F-statistic on the excluded instrument in the underlying first-stage regressions is 400.1 in both Panels A and B. Standard errors, in parentheses, are clustered on county. ∗∗∗ p < .01. ∗∗ p < .05. ∗ p < .1.
Mean of dependent variable Title I per pupil, 1966 (in hundreds of $2007) RMSE Number of districts
Mean of dependent variable 0.305 Title I per pupil, 1966 −0.0837∗ (in hundreds of $2007) (0.0449) RMSE 0.365 Number of districts 916
(1)
<2%
TABLE V TSLS ESTIMATES OF THE EFFECT OF POTENTIAL TITLE I FUNDING IN 1966 ON THE DISTRIBUTION OF STUDENT DESEGREGATION, 1964 AND 1966
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
465
466
QUARTERLY JOURNAL OF ECONOMICS
per pupil are small and statistically insignificant for the rest of the distribution—bins at or above 6% (columns (3) through (7)). Redefining the dependent variables as indicators for each two percentage-point bin of student desegregation over its entire support also shows that conditional federal funding significantly affected behavior only in the lower tail of the distribution, as we would expect if districts were desegregating just enough to be in compliance (see Online Appendix Figure I). The final column of Table V shows that the effect of Title I funding on the average percentage of blacks in desegregated schools is positive, but small and not statistically significant.24 This suggests that although marginal financial incentives induced changes on the regulated margin, they do not account for the large overall reduction in the share of blacks in desegregated schools by 1966 shown in Figure I. A substantial minority of districts contributed to this overall decline by desegregating more than required by the guidelines, consistent with pressure to desegregate mounting from multiple sources during this period; for example, other provisions of the CRA increased the threat of litigation and the Voting Rights Act came into effect. It is therefore not surprising that some districts were inframarginal with respect to the financial incentives.25 School boards may have even used these policy changes as political cover to desegregate more than required to receive federal funding, to take advantage of economies of scale, for example.26 Consistent with the framework presented in Section II, districts that clearly exceeded the guidelines’ requirements appear to have faced higher costs of maintaining segregation and to have had weaker preferences for segregation. For example, consider the 13% of districts with more than half of blacks in desegregated 24. The estimates in the first two columns suggest that a $100 larger grant per pupil shifted about 8% of districts from less than 2% to 2% to 6% of blacks in desegregated schools; using the mid-points of the ranges, this amounts to an increase in the percent of blacks in desegregated schools of about 0.2 ((4 − 1) × 0.08) percentage points. The estimate in column (8) is not precise enough to pick up such an effect. 25. Similarly, districts that already had more than 2% of blacks in desegregated schools in 1964 were unlikely to have responded to the conditional funding on this margin, and our estimates are essentially unchanged when we omit such districts from our estimation sample (results available on request). 26. This is similar to Heckman and Payner’s (1989) suggestion that South Carolina manufacturers “seized on the new federal legislation and decrees to do what they wanted to do anyway” (p. 174). Put another way, some school boards may have been above their “optimal” level of segregation prior to the CRA but felt constrained (for example, by a vocal and politically active minority of whites) to maintain a segregated school system.
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
467
schools in 1966 (first row of Table V), only two of which were under court order. These districts tended to be relatively small and were therefore more likely to benefit from economies of scale when desegregating. Seventy-seven percent were in the bottom two deciles of black share (less than 17.6% black), so that increases in the share of blacks in desegregated schools would have translated into smaller increases in whites’ exposure to blacks and would have therefore been less costly. Eighty-two percent of these districts were in the bottom two quintiles of Thurmond vote share. Although conditional federal funding did not yield large effects on blacks’ overall exposure to whites in 1966, it did move districts across the regulated margin—just beyond tokenism. The magnitude of our estimates for this margin might be interpreted in two different ways. First, a simple rescaling of the key TSLS coefficients suggests that the average Southern district would have required $1,200 per pupil in 1966 (in 2007 dollars) to move beyond token desegregation.27 This suggests a substantial willingness to pay for segregated schools, equal to over 70% of the average perpupil budget in the South in the early 1960s (Table I). This estimated willingness to pay for segregation is similar to previous estimates for the South based on preferences revealed through the housing market, both historically (Clotfelter 1975) and in more recent data (Kane, Riegg, and Staiger 2006).28 Second, our estimates suggest that conditional grants account for about 36% of the shift away from token desegregation between 1964 and 1966.29 As noted above, receipt of federal funding at this time rested not only on meeting threshold levels of student desegregation, 27. Because of the sizable uncertainty surrounding future DHEW policy guidelines for CRA compliance, the size of future Title I grants, and what the courts would require in future years, we interpret our results as identifying the effect of one year’s potential grant amount on one year’s segregation policy, rather than as the effect of an expected stream of future payments. 28. Our estimate from the model with all controls is consistent with house prices in a school district with “just enough” (2% to 6% of blacks in desegregated schools) desegregation being about 1.6% lower compared to those in a district with token desegregation. (See Online Appendix Section III.) Similarly, using data from Atlanta, Clotfelter (1975) found that a three percentage-point increase in black enrollment share in the assigned high school (roughly comparable to the change in black share on the margin we examine) was associated with a 1.4% decline in house prices between 1960 and 1970. Investigating willingness to pay for school segregation in Mecklenburg County, North Carolina in the 1990s, Kane, Riegg, and Staiger (2006) find that a three percentage-point increase in the percent of black students at the assigned high school was associated with a 1.3% decline in house prices. 29. We use our estimates to arrive at this figure by comparing the likelihood of having zero to 2% of blacks in desegregated schools with the average potential Title I grant as opposed to no such grant.
468
QUARTERLY JOURNAL OF ECONOMICS
but also on desegregating teaching faculties. If conditional funding mattered, we would therefore expect that districts with more funding at risk were more likely to have desegregated faculties. Table VI shows the results of estimating the same models as in Table IV with an indicator equal to one if the district had any black teachers on faculties with white teachers as the dependent variable. The TSLS estimates are stable and positive across the four specifications. Our preferred point estimate is significant at the 11% level, suggesting that all else equal, each additional $100 in potential per-pupil Title I grant increased the probability that a district’s teaching faculty would be desegregated by 6.7 percentage points. This estimate implies a willingness to pay to avoid teacher desegregation ($1,500 per pupil) similar to that to avoid moving beyond token desegregation of students and suggests that conditional federal funding explains roughly a quarter of the rise in teacher desegregation between 1964 and 1966.30 If our empirical strategy has uncovered the causal effects of conditional funding on desegregation, we should find no relationship between potential Title I funding and desegregation prior to the program’s introduction. Indeed, we find no significant relationship between Title I funds at risk and student desegregation across its entire distribution in 1964, as shown in Panel B of Table V.31 However, because all but the first bin were nearly empty in 1964, a stronger test examines whether the potential grant predicted whether a district had any black students in school with whites in 1964. This test reveals that districts with larger perpupil Title I grants were less inclined to have desegregated at all by 1964, but insignificantly so, if anything likely biasing against the effects we find.32 V.B. Court Supervision The results above show that school districts with more federal funding on the line were more likely to meet the DHEW’s 30. We also find a negative but statistically insignificant impact of potential Title I funding on the likelihood that a district had its federal funding deferred or terminated (see Online Appendix Table II). We attribute the relative weakness of this finding to the incomplete and uncertain nature of enforcement, discussed in Section IV. 31. Unsurprisingly, we find similar TSLS point estimates, but with larger standard errors, when we estimate a model regressing 1964 to 1966 changes in desegregation indicators on the per-pupil Title I grant. 32. See Online Appendix Table II. We cannot perform a similar robustness check for teacher desegregation because no school districts in our estimation sample had any teacher desegregation in 1964 (see Table II).
881
X
X
−0.0138 (0.0382) 0.369 .339
0.371
0.0633 (0.0462) 289.2
(1)
(3)
(4)
881
X
X
X X X X 881
X X X 881
X X
X
0.00570 (0.0355) 0.368 .356
0.368
X X
X
0.370 0.368 B. Ordinary least squares −0.00502 0.00277 (0.0357) (0.0363) 0.369 0.368 .334 .352
0.725 A. Two-stage least squares 0.0694 0.0645 0.0673 (0.0422) (0.0432) (0.0421) 382.7 409.1 383.0
(2)
∗
916
X
X
−0.0324 (0.0188) 0.255 .275
0.255
−0.0552∗ (0.0288) 306.2
(5) 0.0961
∗∗
916
X
X
−0.0394 (0.0177) 0.255 .269
0.255
−0.0627∗∗ (0.0265) 408.5
(6)
916
X X X
X X
X
−0.0263 (0.0184) 0.248 .324
0.248
−0.0643∗∗ (0.0273) 433.0
(7)
= 1 if under court order, 1966
X X X X 916
X X
X
−0.0280 (0.0182) 0.247 .329
0.248
−0.0658∗∗ (0.0267) 400.1
(8)
Notes. Each column in each panel gives results from a different regression. In Panel A, the instrument for Title I funding per pupil is simulated Title I funding per pupil (in hundreds of $2007); see text. The unit of observation is a school district; see text and Appendix II for descriptions of the sample. In addition to the controls listed, all models also include as an explanatory variable an indicator for whether the district had any student desegregation in 1964. Standard errors, in parentheses, are clustered on county. a Dummies for the bottom nine deciles and the top two of the twenty quantiles in the first specification. b % with high school degree, % employed in agriculture, median family income ($2007), indicator for urban. ∗∗∗ p < .01. ∗∗ p < .05. ∗ p < .1.
RMSE R2 Controls: State fixed effects Early 1960s child poverty %: Dummies for 20 quantiles Restricted quantile effectsa Early 1960s black enrollment % (decile dummies) 1948 Thurmond vote % (quintile dummies) Ln early 1960s enrollment Early 1960s exp. per pupil (quintile dummies) 1960 county characteristicsb Number of districts
Title I per pupil, 1966 (in hundreds of $2007)
First stage partial F-stat for excluded instrument RMSE
Title I per pupil, 1966 (in hundreds of $2007)
Mean of dependent variable
= 1 if any black teachers in same school as white teachers, 1966
TABLE VI THE EFFECT OF POTENTIAL TITLE I FUNDING ON TEACHER DESEGREGATION AND COURT SUPERVISION, 1966
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
469
470
QUARTERLY JOURNAL OF ECONOMICS
desegregation requirements. In 1966, DHEW guidelines required at least as much desegregation as the typical court-ordered plan.33 By increasing the probability of meeting DHEW requirements, we expect that larger grants would have made districts less likely to become targets of litigation. Table VI shows that conditional Title I funding indeed reduced the probability of being under court order in 1966. The coefficient changes little in magnitude across specifications and implies that each additional $100 in per-pupil Title I funding reduced the probability of being under court order by 6.6 percentage points (column (4)). The fact that districts with larger grants were no more likely to have been under court order in 1964—prior to the introduction of the program—again helps rule out the possibility that our identifying variation in grants was correlated with unobserved tastes for segregation (see Online Appendix Table II).34 These findings suggest that the CRA and ESEA reduced the burden of school desegregation on federal courts. Our estimates imply that without conditional Title I funding, nearly 28% of Southern districts not already under court order by 1964 would have required court supervision to achieve the observed shift away from token desegregation between 1964 and 1966—triple the actual rate of court supervision. We show below that conditional funding continued to reduce the courts’ burden through 1970. VI. LONG-RUN EFFECTS OF CONDITIONAL FUNDING The results so far show that conditional federal funding mattered for segregation policy choices but did not substantially increase black exposure to whites by 1966. The dual system of education in the South was not eliminated—and dramatic increases in black exposure to whites not achieved—until after 1966, perhaps diminishing the historical importance of the CRA and ESEA relative to court-ordered plans. 33. Comprehensive data on the specific requirements of court-ordered plans are not available, but as discussed below, desegregation requirements were typically strengthened first by DHEW and then the courts during the Johnson Administration. The median court-ordered district had only 2.5% of blacks in desegregated schools in 1966, compared to 5.6% for the median district not under court order. In December 1966, the Fifth Circuit noted that “The announcement in HEW regulations that the Commissioner would accept a final school desegregation order as proof of the school’s eligibility for federal aid prompted a number of schools to seek refuge in the federal courts. Many of these had not moved an inch toward desegregation” (United States v. Jefferson County Board of Education (372 F.2d 836), 1966). Orfield (1969, 2000) makes a similar point. 34. For this robustness check, we add back into our estimation sample districts that were court supervised in 1964. Our finding therefore also implies that our baseline estimates are not biased by sample selection.
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
471
On the other hand, any assessment of the full impact of the CRA, rather than the marginal effects of conditional funding, must emphasize how it changed the role of the courts. Prior to the CRA, few Southern districts had been sued and even those under court order had made little progress. Initially, the DHEW chose relatively weak guidelines to avoid conflicts with existing court orders. During the remaining years of the Johnson administration, the DHEW strengthened its standards in advance of the courts. The guidelines then helped courts coordinate on more consistent remedies and gave them cover to adopt more stringent requirements. The landmark 1966 Fifth Circuit opinion in Jefferson noted that “the HEW Guidelines offer, for the first time, the prospect that the transition from a de jure segregated dual system to a unitary integrated system may be carried out effectively, promptly, and in an orderly manner.” Back-and-forth among the judiciary, executive, and legislature and the resulting case law laid the foundation for later Supreme Court decisions. For example, in the landmark 1968 Green35 decision, the Court drew directly from the 1968 DHEW guidelines, which required school boards to adopt plans so that “there are no Negro or other minority group schools and no white schools—just schools.” In this way, the CRA and ESEA may have indirectly contributed to later and more dramatic reductions in school segregation. That the guidelines became more stringent over time suggests that conditional federal funding may have had direct impacts on policy margins that did matter for black exposure to whites. Unfortunately, we are unable to estimate any such impacts, because a change to the Title I funding formula in 1967 eliminated the cross-state variation in grants per poor child central to our identification strategy.36 However, we can ask whether having a bigger grant in 1966 affected outcomes in later years. That is, did districts with more conditional federal funding early—which we emphasize is not a proxy for a continued stream of bigger grants—follow a permanently different desegregation trajectory? Table VII examines this possibility, showing TSLS estimates from the preferred specification (with full controls) for segregation and court supervision outcomes for several years between 1968 and 1976. We now measure segregation using the dissimilarity index, which captures the margins of desegregation relevant in 35. Green v. New Kent County (391 U.S. 430) 36. Similarly, it would be interesting to examine the extensive margin of student desegregation in 1965, which was all the guidelines required at the time, but we have district-level data for only two states for 1965.
0.247 −0.00760 (0.0173) 335.3 0.154 993
0.693 0.0106 (0.0184) 453.7
0.180 914
1970 (2)
0.129 996
0.214 −0.0196 (0.0143) 339.3
1972 (3)
0.123 916
0.204 −0.0193 (0.0149) 318.1
1976 (4)
0.310 914
0.304 −0.0907∗∗ (0.0369) 453.7
1968 (5)
0.346 993
0.327 −0.0963∗∗∗ (0.0368) 335.3
1970 (6)
0.377 996
0.516 −0.0578 (0.0368) 339.3
1972 (7)
= 1 if under court order
0.401 916
0.553 −0.0203 (0.0429) 318.1
1976 (8)
Notes. Each column gives results from a different TSLS regression. The instrument for Title I funding per pupil is simulated Title I funding per pupil (in hundreds of $2007); see text. The unit of observation is a school district. For columns (5)–(8), the sample is limited to districts for which the dissimilarity index is observed in the same year. The specification is the same as that shown in columns (4) and (8) of Table IV. Districts that experienced a boundary change between 1969 and 1976 are excluded from the analysis. Standard errors, in parentheses, are clustered on county. ∗∗∗ p < .01. ∗∗ p < .05. ∗ p < .1.
Mean of dependent variable Title I per pupil, 1966 (in hundreds of $2007) First-stage partial F-stat for excluded instrument RMSE Number of districts
1968 (1)
Dissimilarity index
TABLE VII TSLS ESTIMATES OF THE EFFECT OF POTENTIAL TITLE I FUNDING IN 1966 ON LONG-TERM OUTCOMES
472 QUARTERLY JOURNAL OF ECONOMICS
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
473
later years and can be interpreted as the share of students who would have had to change schools to replicate the racial composition of the district in each school.37 The estimates suggest that districts with larger 1966 grants were no more desegregated by this measure in 1968, 1970, 1972, or 1976 (columns (1)–(4)). However, through 1970, districts with more conditional federal funding in 1966 were less likely to require a court order to achieve the higher levels of desegregation shown in the first row of Table VII. The coefficients indicate that each additional $100 of conditional funding reduced the probability of court supervision by nine percentage points in 1968 and ten percentage points in 1970 (columns (5) and (6)). Nevertheless, the power of early conditional funding to promote voluntary desegregation faded over time; by 1972, the estimated effect of Title I funding was roughly half of its magnitude in 1970 and no longer statistically significant. That the importance of courts increased and the role of conditional funding diminished in the early 1970s is not necessarily surprising. As part of the “Southern Strategy,” the Nixon administration stopped enforcing the fund-withholding provisions of the CRA, eliminating the potential for marginal financial incentives to matter starting in 1969 (Halpern 1995; Orfield 2000). The 1971 Supreme Court decision in Swann38 also strengthened the desegregation requirements for districts under court supervision and specifically sanctioned the use of busing to achieve racial balance. The rate of court supervision increased substantially after 1970 (Figure I and first row of Table VII). The Swann standard and court supervision more generally were particularly important in desegregating larger school districts (Reber 2005; Cascio et al. 2008). VII. CONCLUSIONS Today, the federal government uses conditional grants—as complements or substitutes for other policy instruments—in a variety of contexts. This paper shows that making receipt of the 37. The measure of segregation relevant in 1966—the share of blacks in desegregated schools—increasingly fails to capture the relevant margins of desegregation in later years, as all-black schools were virtually eliminated by 1970 (Figure I). We examine effects on dissimilarity as a measure of racial balance rather than on the exposure of black students to white students because the former is more closely related to a district’s desegregation policy, whereas the latter also depends directly on the district’s racial composition. Results are similar for the index for exposure of blacks to whites. 38. Swann v. Charlotte-Mecklenburg (401 U.S. 1)
474
QUARTERLY JOURNAL OF ECONOMICS
substantial new federal funds offered through Title I ESEA contingent on nondiscrimination through the CRA played a role historically in desegregating Southern schools. Districts with more federal funding on the line were more likely to change from behavior that would have clearly been out of compliance with CRA in 1966—having less than 2% of their black students in desegregated schools—to behavior for which most districts would have been judged compliant—having 2% to 6% of their black students in desegregated schools. The CRA and ESEA also contributed to faculty desegregation and reduced the burden that desegregation had long placed on the courts: Districts with larger conditional grants in 1966 were less likely to be under court supervision—but were no less desegregated—through 1970. Although the extent of desegregation directly induced by conditional funding in 1966 was small compared to what courtordered plans would achieve in later years, the desegregation that the ESEA and CRA induced appears to have been on a margin that whites cared about, as evidenced by Southern school boards’ high willingness to pay to avoid it. The policies also represented a historic break from the past and the decade of inaction following Brown, giving the courts the much-needed backing of the executive and legislative branches for their interventions in the years that followed.
APPENDIX I.A THE STATE GRANT COMPONENT OF TITLE I FUNDING, 1966
State Alabama Arkansas Florida Georgia Louisiana Mississippi North Carolina South Carolina Tennessee Virginia Total
stategrant1966 ($2007)
Number of districts in 1966 estimation sample
798 871 1,161 906 891 578 862 647 840 844
83 121 56 146 59 99 126 86 67 73 916
Notes: See Section III and Appendix II for a description of the Title I funding formula and the text and Appendix II for a description of the estimation sample.
475
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION APPENDIX I.B FIRST STAGE REGRESSIONS: ALL SPECIFICATIONS, 1966 ESTIMATION SAMPLE Per-pupil Title I grant, 1966
Simulated per-pupil Title I grant, 1966 R2 Partial F-stat for excluded instrument Controls: State fixed effects Early 1960s child poverty %: Dummies for 20 quantiles Restricted quantile effectsa Early 1960s black enrollment % (decile dummies) 1948 Thurmond vote % (quintile dummies) Ln early 1960s enrollment Early 1960s exp. per pupil (quintile dummies) 1960 county characteristicsb Number of districts
(1)
(2)
(3)
(4)
1.129∗∗∗
1.122∗∗∗
1.127∗∗∗
(0.0645) .973 306.2
(0.0555) .972 408.5
(0.0542) .974 433.0
1.125∗∗∗ (0.0563) .974 400.1
X
X
X
X
X
X X
X X
X
X
X X
X X
916
X 916
X
916
916
Notes: Each column in each panel gives results from a different regression. Both the simulated and actual per-pupil Title I grants are in hundreds of $2007. The unit of observation is a school district; see text and Appendix II for descriptions of the 1966 estimation sample. In addition to the controls listed, all models include as an explanatory variable an indicator for whether the district had any student desegregation in 1964. Standard errors, in parentheses, are clustered on county. a Dummies for nine deciles and the top two of the twenty quantiles in the first specification. b % with high school degree, % employed in agriculture, median family income ($2007), indicator for urban. ∗∗∗ p < .01, ∗∗ p < .05, ∗ p < .1.
APPENDIX II: DATA APPENDIX A. Data on Title I Funding and Child Poverty Title I funding allocations were made at the county level. States then allocated grants to districts within each county. We do not know the data sources used for this purpose, but we do observe district-level Title I allocations in the first year of the program, 1965–1966. Using these data, we estimated district-level Title I allocations for 1966–1967, assuming that a district was entitled to a constant share of its county allocation. That is, we defined a district’s 1966–1967 allocation (potential grant) as the share of the county-level allocation that it received in 1965–1966 times its 1966–1967 county-level allocation. Data on 1965–1966
476
QUARTERLY JOURNAL OF ECONOMICS
district-level Title I allocations and 1966–1967 county-level Title I allocations were entered from U.S. Senate (1967). U.S. Senate (1965, 1967) give county-level counts of five- to seventeen-year-olds eligible for Title I in 1965–1966 and 1966– 1967, respectively. By 1966–1967, there were five categories of eligibles: (1) children in families with incomes less than $2,000 in 1960 (poord1960 ); (2) children in families receiving AFDC in excess of $2,000; (3) delinquent children; (4) neglected children; and (5) children in foster homes. We estimated district-level counts of Title I eligibles for 1965–1966 and 1966–1967 (eligiblesd1965 and eligiblesd1966 , respectively) with the number of county-level eligibles in the relevant year times the share of the county Title I allocation received by the district in 1965–1966 (see above). In 1965–1966, only counts under categories (1) and (2) were relevant, and these were based entirely on data collected prior to the introduction of Title I. In 1966–1967, only category (1) was based on prior data. For this reason, we calculate pptiSIM d1966 using (predetermined) eligiblesd1965 and (endogenous) pptid1966 using eligiblesd1966 : pptiSIM d1966 ≡ pptid1966 ≡
stategrant j1966 eligiblesd1965 enrd,pre stategrant j1966 eligiblesd1966 enrd1966
and .39
In practice, this choice makes little difference in the numerators of the actual and simulated Title I grants per pupil, as in nearly all Southern counties, eligiblesd1966 ≈ eligiblesd1965 ≈ poord1960 .40 We then define the preprogram child poverty rate as eligiblesd1965 /enrd,pre ; we refer to it as poord1960 /enrd,pre in the text for ease of explanation. B. Data on Other District- and County-Level Covariates District-level data on total enrollment, enrollment by race, and current expenditure prior to 1964 were entered from annual reports of state departments or superintendents of education.41 .39. Notice that the numerator of pptid1966 is equivalent to the estimated district-level grant for 1966–1967 described in the first paragraph of this section. 40. In our estimation sample, the average, median, and minimum ratios of poord1960 to eligiblesd1966 are 0.984, 0.995, and 0.865, respectively. 41. Alabama Department of Education (various years), Arkansas Department of Education (various years), Florida State Superintendent of Public Instruction (various years), Georgia State Department of Education (various years),
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
477
Fall 1966 enrollment (enrd1966 ) was drawn from the same source. Preprogram enrollment (enrd,pre ) is average fall enrollment based on data from all years reported between 1961 and 1963. Preprogram per-pupil current expenditure and preprogram percent black in enrollment are these variables averaged across all years where reported between 1961 and 1963.42 For states where enrollment by race was not reported, we estimated preprogram percent black using either district-level data from SERS (1964, 1967) (for North Carolina), county-level data on the racial breakdown of the fiveto seventeen-year-old population from special tabulations of the 1960 Census (for Florida, where district boundaries correspond to counties), or school-level data on enrollment by race for 1967 from U.S. DHEW (1969) (for Arkansas). The county-level percentage of votes cast for Strom Thurmond in the 1948 presidential election was drawn from ICPSR Study No. 8611 (Clubb, Flanigan, and Zingale 2006). Data on 1960 median family income, share with a high school degree, share employed in agriculture, and urban status at the county level were drawn from ICPSR Study No. 7736 (U.S. Department of Commerce 1999). C. Data on Desegregation Outcomes, 1966 and Later District-level data on the number of blacks in desegregated public schools and the presence of any teacher desegregation for fall 1966 were entered from SERS (1967). Most data were from computer printouts provided by the Office of Education from its first survey of Southern school desegregation. The survey response rate was 80%; SERS was able to fill in data for some missing districts. For districts listed, we set the student desegregation indicator equal to one if any blacks were reported to be in school with whites. We estimated the total number of blacks in the district (not reported) with fall 1966 enrollment times preprogram fraction black. We then used this measure along with the number of blacks in desegregated schools to construct the percent of Mississippi State Department of Education (various years), North Carolina Education Association (various years), South Carolina State Department of Education (various years), State Department of Education of Louisiana (various years), Tennessee Department of Education (various years), Virginia State Board of Education (various years). 42. We use the enrollment measure most consistently reported within the state over time. All states except Arkansas, Georgia, and North Carolina report fall enrollment or registration or average daily membership. To make these states’ enrollment figures more comparable to those for other states, we multiply the enrollment concept reported (average daily attendance, or ADA) by the statewide average ratio of fall enrollment to ADA reported in U.S. DHEW (1967).
478
QUARTERLY JOURNAL OF ECONOMICS
blacks in desegregated schools. Using the information on teacher desegregation, we also constructed an indicator equal to one if any blacks taught on desegregated faculties. Of the 995 districts meeting all other data requirements for the analysis, 131 did not have SERS data sufficient to directly calculate the percent of blacks in desegregated schools for fall 1966. Of these, 39 had their funds deferred or terminated in that year (see below) and were assumed to have had less than 2% of blacks in desegregated schools.43 Thirteen districts did not appear in SERS (1967) and had not desegregated at all in later years (see below); we assume these districts had not desegregated in 1966 and assign them zero for the percentage of blacks in desegregated schools. Our estimation sample for models using data on the percentage of blacks in desegregated schools therefore includes 916 school districts.44 U.S. DHEW (1966) provides fall 1966 data for all Southern school districts on the type of desegregation plan submitted to comply with CRA and whether the plan was approved by DHEW. We set the court order indicator to one for districts with approved court-ordered plans and zero otherwise. Using other information reported, we created an indicator for whether federal funds to the district had been deferred or terminated by fall 1966. For fall 1968 and later, data on student desegregation and status of compliance with CRA (type of plan) were drawn from school-level surveys conducted by the Office for Civil Rights. See Cascio et al. (2008) for more detail on these data and sources. D. Data on Desegregation Outcomes, 1964 and Earlier For 1956 through 1964, we have entered district-level data on desegregation and type of plan from SERS (various years).45 These publications give, for all districts desegregated “in policy 43. Where observed, almost three quarters of districts with funds deferred or terminated have desegregated less than 2% of blacks; nonreporting districts likely had less desegregation. We impute the fraction of black students in allblack schools to be 0.001 for these districts. Because it would require stronger assumptions given the 1966 DHEW guidelines at the time, we do not impute values for the dichotomous indicators of any student or teacher desegregation based on having funds deferred or terminated. 44. The estimates tend to be stronger when we drop districts with imputed outcomes (available on request). 45. We use data presented in the following versions of this publication: April 15, 1957 (for fall 1956), November 1957 (for fall 1957), October 1958 (for fall 1958), May 1960 (for fall 1959), November 1960 (for fall 1960), November 1961 (for fall 1961), November 1962 (for fall 1962), 1963–64 (for fall 1963), and November 1964 (for fall 1964).
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
479
or in practice,”46 the number of blacks attending public school with whites, the total number of black children enrolled in public schools, and whether desegregation was court-ordered or undertaken voluntarily by the local school board. Using these data, we are able to construct the percentage of blacks attending desegregated schools and indicators for whether the district had a court-ordered desegregation plan or any blacks enrolled in public schools with whites for 1964 and earlier. For districts not listed in these publications, we have set all of these variables to zero.47 It is difficult to assess the credibility of this assumption, because no other agencies collected data on desegregation over the period of interest. SERS’s data collection strategy is also unclear. However, because there were such low rates of desegregation during the period, it was most likely not very onerous to collect the data. SERS was also a trusted source, as it supplied desegregation data to the U.S. Commission on Civil Rights by contractual agreement in 1964 (U.S. Commission on Civil Rights 1966; p. 30). The SERS state-level summaries of desegregation activity are also considered the best available data by social scientists and have been previously cited in academic research (e.g., Rosenberg [1991]; Orfield [2000]). DARTMOUTH COLLEGE AND NATIONAL BUREAU OF ECONOMIC RESEARCH UNIVERSITY OF CALIFORNIA, SAN DIEGO, AND NATIONAL BUREAU OF ECONOMIC RESEARCH DARTMOUTH COLLEGE UNIVERSITY OF CALIFORNIA, LOS ANGELES, AND NATIONAL BUREAU OF ECONOMIC RESEARCH
REFERENCES Alabama Department of Education, Annual Report for the Scholastic Year Ending June 30, 1966 and for the Fiscal Year Ending September 30, 1966: Statistical and Financial Data (Montgomery, AL: Reports covering 1960/61–1965/66 school years). Almond, Douglas, Kenneth Y. Chay, and Michael Greenstone, “Civil Rights, the War on Poverty, and Black–White Convergence in Infant Mortality in the Rural South and Mississippi,” MIT Department of Economics Working Paper No. 07-04, 2006. Arkansas Department of Education, Report on House Concurrent Resolution No. 58 of the 63rd General Assembly (Little Rock, AR: Reports covering 1960/61– 1965/66 school years). 46. Districts desegregated in policy but not in practice had freedom of choice plans, where blacks’ option to apply to white schools was not exercised, or court orders that had not yet taken effect. 47. We compiled lists of districts by state and year from the annual reports cited in Section B of this Appendix.
480
QUARTERLY JOURNAL OF ECONOMICS
Ashenfelter, Orley, William J. Collins, and Albert Yoon, “Evaluating the Role of Brown v. Board of Education in School Equalization, Desegregation, and the Income of African Americans,” American Law and Economics Review, 8 (2006), 213–248. Boozer, Michael A., Alan B. Krueger, and Shari Wolkon, “Race and School Quality since Brown v. Board of Education,” Brookings Papers on Economic Activity, Microeconomics, 1992 (1992), 269–326. Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller, “Bootstrap-Based Improvements for Inference with Clustered Errors,” NBER Technical Working Paper No. 344, 2007. Card, David, and Alan B. Krueger, “School Quality and Black–White Relative Earnings: A Direct Assessment,” Quarterly Journal of Economics, 107 (1992), 151–200. Cascio, Elizabeth, Nora Gordon, Ethan Lewis, and Sarah Reber, “From Brown to Busing,” Journal of Urban Economics, 64 (2008), 296–325. Clotfelter, Charles T., “The Effect of School Desegregation on Housing Prices,” Review of Economics and Statistics, 57 (1975), 395–404. ——, After Brown: The Rise and Retreat of School Desegregation (Princeton, NJ: Princeton University Press, 2004). Clubb, Jerome M., William H. Flanigan, and Nancy H. Zingale, Electoral Data for Counties in the United States: Presidential and Congressional Races, 1984– 1972 [Computer file], Compiled by Jerome M. Clubb, William H. Flanigan, and Nancy H. Zingale, ICPSR08611-v1 (Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2006-11-13). Donohue, John J. III, James J. Heckman, and Petra E. Todd, “The Schooling of Southern Blacks: The Roles of Legal Activism and Private Philanthropy, 1910– 1960,” Quarterly Journal of Economics, 117 (2002), 225–268. Florida State Superintendent of Public Instruction, Division of Research, Ranking of the Counties, (Tallahassee, FL: Reports covering 1962/63, 1964/65, 1965/66 school years). Georgia State Department of Education, Annual Reports of the Department of Education to the General Assembly of the State of Georgia, (Atlanta, GA: Reports covering 1961/62, 1963/64, 1965/66 school years). Halpern, Stephen C., On the Limits of the Law: The Ironic Legacy of Title VI of the 1964 Civil Rights Act (Baltimore, MD: Johns Hopkins University Press, 1995). Heckman, James J., and Brook S. Payner, “Determining the Impact of Federal Antidiscrimination Policy on the Economic Status of Blacks: A Study of South Carolina,” American Economic Review, 79 (1989), 138–177. Kane, Thomas J., Stephanie K. Riegg, and Douglas O. Staiger, “School Quality, Neighborhoods, and Housing Prices,” American Law and Economics Review, 9 (2006), 183–212. Margo, Robert A., Race and Schooling in the South, 1880–1950: An Economic History (Chicago: University of Chicago Press, 1990). Mississippi State Department of Education, Biennial Report and Recommendations of the State Superintendent of Public Education to the Legislature of Mississippi for the Scholastic Year (Jackson, MS: State Superintendent of Public Education, various years). North Carolina Education Association, Per Pupil Expenditures for Current Expense: Information Provided by Division of Statistical Services State Department of Public Instruction (Raleigh, NC: Reports covering 1961/62–1965/66 school years). Orfield, Gary, The Reconstruction of Southern Education: The Schools and the 1964 Civil Rights Act (New York: Wiley-Interscience, 1969). ——, “The 1964 Civil Rights Act and American Education,” in Legacies of the 1964 Civil Rights Act, Bernard Groffman, ed. (Charlottesville and London: University of Virginia Press, 2000). Reber, Sarah J., “Court-Ordered Desegregation: Successes and Failures in Integration since Brown,” Journal of Human Resources, 40 (2005), 559–590. ——, “From Separate and Unequal to Integrated and Equal? School Desegregation and School Finance in Louisiana,” NBER Working Paper No. w13192, 2007.
CONDITIONAL GRANTS AND SCHOOL DESEGREGATION
481
Rosenberg, Gerald N., The Hollow Hope: Can Courts Bring about Social Change? (Chicago: University of Chicago Press, 1991). South Carolina State Department of Education, Annual Report of the State Superintendent of Education of the State of South Carolina (Columbia, SC, Reports covering 1960/61–1965/66 school years). Southern Education Reporting Service [SERS], A Statistical Summary, State-byState, of Segregation–Desegregation Activity Affecting Southern Schools from 1954 to Present, Together with Pertinent Data on Enrollment, Teacher Pay, Etc. (Nashville, TN: Southern Education Reporting Service, April 15, 1957). ——, A Statistical Summary, State-by-State, of Segregation-Desegregation Activity Affecting Southern Schools from 1954 to Present, Together with Pertinent Data on Enrollment, Teacher Pay, Etc., Second Revision (Nashville, TN: Southern Education Reporting Service, November 1, 1957). ——, A Statistical Summary, State-by-State, of Segregation–Desegregation Activity Affecting Southern Schools from 1954 to Present, Together With Pertinent Data on Enrollment, Teacher Pay, Etc., Fifth Printing (Nashville, TN: Southern Education Reporting Service, October 15, 1958). ——, A Statistical Summary, State-by-State, of Segregation–Desegregation Activity Affecting Southern Schools from 1954 to Present, Together with Pertinent Data on Enrollment, Teachers, Colleges, Litigation, and Legislation, Seventh Revision (Nashville, TN: Southern Education Reporting Service, May 1960). ——, A Statistical Summary, State-by-State, of Segregation–Desegregation Activity Affecting Southern Schools from 1954 to Present, Together with Pertinent Data on Enrollment, Teachers, Colleges, Litigation, and Legislation, Eighth Revision (Nashville, TN: Southern Education Reporting Service, November 1960). ——, A Statistical Summary, State by State, of Segregation–Desegregation Activity Affecting Southern Schools from 1954 to the Present, Together with Pertinent Data on Enrollment, Teachers, Colleges, Litigation, and Legislation, Tenth Revision (Nashville, TN: Southern Education Reporting Service, November 1961). ——, A Statistical Summary, State by State, of Segregation–Desegregation Activity Affecting Southern Schools from 1954 to the Present, Together with Pertinent Data on Enrollment, Teachers, Colleges, Litigation, and Legislation, Eleventh Revision (Nashville, TN: Southern Education Reporting Service, November 1962). ——, A Statistical Summary, State by State, of School Segregation–Desegregation in the Southern and Border Area from 1954 to the Present, Thirteenth Revision (Nashville, TN: Southern Education Reporting Service, 1963–1964). ——, A Statistical Summary, State by State, of School Segregation–Desegregation in the Southern and Border Area from 1954 to the Present, Fourteenth Revision (Nashville, TN: Southern Education Reporting Service, November 1964). ——, A Statistical Summary, State by State, of School Segregation–Desegregation in the Southern and Border Area from 1954 to the Present, Sixteenth Revision (Nashville, TN: Southern Education Reporting Service, February 1967). State Department of Education of Louisiana, Financial and Statistical Report (Baton Rouge, LA: Reports covering 1960/61–1965/66 school years). Tennessee Department of Education, Annual Statistical Report of the Department of Education (Nashville, TN, Reports covering 1960/61–1965/66 school years). U.S. Commission on Civil Rights, Political Participation: A Study of the Participation by Negroes in the Electoral and Political Processes in 10 Southern States since Passage of the Voting Rights Act of 1965 (Washington, DC: U.S. Commission on Civil Rights, 1968). U.S. Department of Commerce, Bureau of the Census, County and City Data Book [United States] Consolidated File: County Data, 1947–1977 [Computer file], ICPSR version (Washington, DC: U.S. Dept. of Commerce, Bureau of the Census [producer], 1978/Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 1999). U.S. Department of Health, Education, and Welfare [U.S. DHEW]. Equal Educational Opportunities Programs, Status of Compliance Public School Districts Seventeen Southern and Border States, Report No. 1 (Washington, DC: U.S. Government Printing Office, December 1966).
482
QUARTERLY JOURNAL OF ECONOMICS
——, Office of Education, Revised Statement of Policies for School Desegregation Plans under Title VI of the Civil Rights Act of 1964 (Washington, DC: U.S. Government Printing Office, 1966). ——, Office of Education, Statistics of State School Systems 1963–64 (Washington, DC: U.S. Government Printing Office, 1967). ——, Office of Education, Directory, Public Elementary and Secondary Schools in Large School Districts with Enrollment and Instructional Staff by Race: Fall 1967 (Washington, DC: U.S. Government Printing Office, 1969). U.S. Senate, Committee on Labor and Public Welfare, Subcommittee on Education, Maximum Basic Grants—Elementary and Secondary Education Act of 1965 (Public Law 81-874, Title II, and Public Law 89-10, Title I) (Washington, DC: U.S. Government Printing Office, September 1965). ——, Notes and Working Papers Concerning the Administration of Programs Authorized under Title I of Public Law 89-10, The Elementary and Secondary Education Act of 1965 as Amended by Public Law 89-750 (Washington, DC: U.S. Government Printing Office, May 1967). Virginia State Board of Education, Annual Report of the Superintendent of Public Instruction of the Commonwealth of Virginia (Richmond, VA: Reports covering 1960/61–1965/66 school years). Washington Research Project of the Southern Center for Studies in Public Policy and the NAACP Legal Defense and Education Fund, Inc., Title I of ESEA: Is It Helping Poor Children? (Washington, DC: Washington Research Project, 1969). Welch, Finis, and Audrey Light, New Evidence on School Desegregation (Washington, DC: U.S. Commission on Civil Rights, 1987).