THE
QUARTERLY JOURNAL OF ECONOMICS Vol. CXXV
May 2010
Issue 2
A THEORY OF FIRM SCOPE∗ OLIVER HART AND BENGT HOLMSTROM The formal literature on firm boundaries has assumed that ex post conflicts are resolved through bargaining. In reality, parties often simply exercise their decision rights. We develop a model, based on shading, in which the use of authority has a central role. We consider two firms deciding whether to adopt a common standard. Nonintegrated firms may fail to coordinate if one firm loses. An integrated firm can internalize the externality, but puts insufficient weight on employee benefits. We use our approach to understand why Cisco acquired StrataCom, a provider of new transmission technology. We also analyze delegation.
I. INTRODUCTION In the last twenty years or so, a theoretical literature has developed that argues that the boundaries of firms—and the allocation of asset ownership—can be understood in terms of incomplete contracts and property rights. The basic idea behind the literature is that firm boundaries define the allocation of residual control rights, and these matter in a world of incomplete contracts. In the ∗ This is an extensively revised version of two earlier papers that circulated as “A Theory of Firm Scope” and “Vision and Firm Scope.” Some of the material presented here formed part of the first author’s Munich Lectures (University of Munich, November 2001), Arrow Lectures (Stanford University, May 2002), Karl Borch Lecture (Bergen, May 2003), and Mattioli Lectures (Milan, November 2003). We are especially grateful to Andrei Shleifer for insightful comments. We would also like to thank Philippe Aghion, George Baker, Lucian Bebchuk, Patrick Bolton, Pablo Casas-Arce, Mathias Dewatripont, Douglas Diamond, Aaron Edlin, Florian Englmaier, Robert Gibbons, Richard Holden, Bob Inman, Louis Kaplow, Bentley MacLeod, Meg Meyer, Enrico Perotti, David Scharfstein, Chris Snyder, Jeremy Stein, Lars Stole, Eric van den Steen, and seminar audiences at CESifo, University of Munich, Harvard University, London School of Economics, George Washington University, Stanford University, the Summer 2002 Economic Theory Workshop at Gerzensee, Switzerland, and the University of Zurich for helpful discussions. Finally, we have benefited from the very constructive suggestions of the editor and three referees. Research support from the National Science Foundation is gratefully acknowledged. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
483
484
QUARTERLY JOURNAL OF ECONOMICS
standard property rights model, parties write contracts that are ex ante incomplete but can be completed ex post. The ability to exercise residual control rights improves the ex post bargaining position of an asset owner and thereby increases his or her incentive to make relationship-specific investments. As a consequence, it is optimal to assign asset ownership to those who have the most important relationship-specific investments.1 Although the property rights approach provides a clear explanation of the costs and benefits of integration, the theory has a number of features that have limited its applicability.2 One that we focus on here is the assumption that ex post conflicts are resolved through bargaining with side payments. Although direct empirical evidence on this topic is not readily available, casual inspection suggests that bargaining with unrestricted side payments is not ubiquitous. Many decisions made in a firm will be carried out without consultation or negotiation with other firms even when these decisions impact the other firms in a major way. It is rare, for instance, for a firm to go to a competitor with the intention of extracting side payments for avoiding aggressive moves.3 We present a new model of firm boundaries, which is designed to deal with strategic decisions that are taken in the absence of ex post bargaining. To justify the use of authority rather than bargaining, we adopt the “contracts as reference points” approach of Hart and Moore (2008). According to this approach, a contract (in our model, an organizational form), negotiated under competitive conditions, circumscribes or delineates parties’ senses of entitlement. Parties do not feel entitled to outcomes outside the contract, but may have different views of what they are entitled to within the contract. More specifically, each side interprets the contract in a way that is most favorable to him. When he does not get the most favored outcome within the contract, he feels aggrieved and shades by performing in a perfunctory rather than a consummate fashion, creating deadweight losses. Given these assumptions, a more open-ended contract leads to more aggrievement, implying 1. See Grossman and Hart (1986), Hart and Moore (1990), and Hart (1995). This literature builds on the earlier transaction cost literature of Williamson (1975, 1985) and Klein, Crawford, and Alchian (1978). 2. For a discussion of this, see Holmstrom and Roberts (1998) and Holmstrom (1999). 3. Of course, where there is an opportunity for mutual gains, a firm may approach another firm to explore various ways of cooperating, either through the market or through a joint venture or merger. However, it is also possible that the parties will simply do what is unilaterally in their best interest.
A THEORY OF FIRM SCOPE
485
that ex post bargaining with side payments is costly.4 We rule out renegotiation on these grounds. Our model comprises two units that have a lateral relationship (this is another departure from the literature, which has focused on vertical integration). We think of a unit as an irreducible set of activities that it would be meaningless to break up further. Each unit is operated by a manager and has a decision that affects the other unit; that is, there are externalities. We have in mind strategic decisions that are so significant that they warrant consideration of an organizational structure that best supports them. For example, the units may be deciding whether to adopt a common standard or platform for their technology or product. As an application, we will use the model to understand Cisco’s approach to acquisitions, especially its decision to purchase StrataCom. Ciscos’s Internet Operating System (IOS) is a platform that came to dominate the network industry in the 1990s. StrataCom emerged as the leading provider of a small, but rapidly expanding, new transmission technology, Asynchronous Transmission Mode (ATM). The question for Cisco and StrataCom was whether to coordinate their technologies. Initially they tried to do this as separate firms, but apparently this did not work out. Cisco then acquired StrataCom.5 Each unit has a binary decision: it can choose “Yes” or “No.” Moreover, we simplify matters further by supposing that there are only two aggregate outcomes, which we term “coordination” or “noncoordination.” Coordination occurs if and only if both units choose Yes. That is, each party can veto coordination by choosing No. The decision in each unit is ex ante noncontractible, but ex post contractible. Each unit has a boss. The boss has the right to make the decision in that unit ex post; that is, the boss has residual control rights. In the simplest version of our model the boss is equivalent to an owner; however, in extensions, the boss and owner can be different. We will compare two leading organizational forms. In the first, nonintegration, the units are separate firms, and the unit managers are the bosses. In this case the unit managers make the Yes/No decisions. In the second, integration, 4. For a discussion, see Hart (2008). 5. There is thus a parallel between Cisco–StrataCom and the famous case of General Motors and Fisher Body. General Motors and Fisher Body initially transacted as separate firms, but General Motors then acquired Fisher Body. See, for example, Klein (2007).
486
QUARTERLY JOURNAL OF ECONOMICS
the units are part of a single firm, and an outside manager is the boss. In this case the boss instructs the managers whether to choose Yes or No, and the managers must follow these instructions (they are contractible); however, the managers may shade on performance.6 A key ingredient in our model is the assumption that each unit generates two kinds of benefit: monetary profit, which is transferable with ownership, and private benefits, which are nontransferable. Private benefits represent job satisfaction, broadly defined. They may arise from various sources. Employees often have their human capital tied to particular technologies. They like to work with technologies with which they are familiar. If a new technology is introduced the employees need to learn new skills, which is costly. Also, the future wages and career prospects of employees may depend on how well their human capital fits the firm’s needs: the firm’s choices will therefore affect them. In sum, employees care about the decisions of the firm they work for. The evidence that smaller companies pay less on average than larger companies (see, e.g., Schoar [2002] on pay in conglomerate versus stand-alone plants) is consistent with the idea that employees are affected by the size and scope of their companies. Private benefits can also be viewed as a way of capturing different beliefs held by managers and workers about the consequences of strategic choices (for an explicit analysis of differences in beliefs with organizational implications, see Van den Steen [2005]). In high-tech industries, different visions about the future path of particular technologies are held with passion and influence both the costs of hiring and the decisions undertaken. Our discussion of the Cisco case suggests that private benefits were very important to Cisco and influenced its decision making. The role of the two types of benefits in our analysis can be illustrated as follows. Denote the pair of profits and private benefits (measured in money) accruing to each unit by (v A, w A) and (v B, w B), respectively. To simplify the analysis, assume that the manager is the only worker and hence private benefits refer to his job satisfaction.7 As well, assume that the boss of a unit can use her residual rights of control to divert all the profit from that 6. These are not the only possibilities. For example, one could consider another form of integration where one of the unit managers is the boss. We discuss this in Section III. 7. The interpretation that private benefits are enjoyed by a single manager is restrictive. In the Conclusions we discuss briefly the case where the units are large companies, and private benefits refer to the aggregate job satisfaction of workers.
A THEORY OF FIRM SCOPE
487
unit to herself. This rules out profit sharing as a way to influence incentives. Profit sharing would alleviate, but not eliminate, the effects we describe.8 If the units are nonintegrated, manager A is the boss of unit A, and manager B the boss of unit B; manager A’s payoff will be v A + w A, because he diverts the profit from unit A and cares about his own private benefits, and manager B’s payoff will be v B + w B, for similar reasons. In contrast, if units A and B are integrated, then, if a (professional) outsider is the boss, her payoff will be v A + v B, because she diverts all the profit and does not care about private benefits. As a benchmark, note that social surplus is given by v A + v B + w A + w B. The key point is that integration results in less weight being placed on private benefits than under nonintegration. Under nonintegration, w A, w B each appears in one boss’s objective function. In contrast, under integration the w’s fail to appear in the overall objective function. However, this diminished influence of private benefits is offset by the fact that, under integration, total profits, rather than individual unit profits, are maximized. The actual analysis is more complicated because the deadweight losses from shading must be taken into account. Shading causes some internalization of externalities: a boss puts some weight on the payoffs of other parties, given their ability to shade. We assume that the opportunity to shade under nonintegration also depends on the nature of the relationship between the parties. We make a distinction between two forms of nonintegration. In one, “nonintegration without cooperation,” the relationship between the units is a limited one that terminates if noncoordination occurs; the units cannot shade against each other in this eventuality. In the other, “nonintegration with cooperation,” the relationship persists; shading can occur under noncoordination. In contrast, we assume that shading is always possible under integration: the parties continue to have a relationship. In summary, under nonintegration, bosses have the right balance between private benefits and profits, but are parochial (they do not take into account their effect on the other unit), whereas, under integration, they have the right balance between units, but ignore private benefits. In our model, where the only issue is whether the units coordinate, we show that nonintegration and integration make opposite kinds of mistakes. Nonintegration can lead to too little coordination when the benefits from coordination 8. We return to this issue briefly in Section V.
488
QUARTERLY JOURNAL OF ECONOMICS
are unevenly divided across the units. One unit may then veto coordination even though it is collectively beneficial. In contrast, under a weak assumption—specifically, that coordination represents a reduction in “independence” and therefore causes a fall in private benefits—integration leads to too much coordination.9,10 We analyze the above model in Sections II and III. In Section IV, we generalize the model to allow delegation of decisionmaking authority under integration. We argue that it is hard to make sense of delegation in much of the literature, because it is unclear why the boss cannot change her mind ex post and take back the decision rights that she has delegated. The presence of aggrievement can help here. We assume that reversing delegation is regarded by subordinates as a “breach of promise” and leads to increased levels of aggrievement. This makes delegation a credible commitment device: the boss will reverse herself only in “extreme” states of the world. We show that integration with delegation can be a valuable intermediate organizational form between nonintegration and integration. Under delegation, managers get their way in states of the world where decisions matter significantly more to them than to the boss. However, in states of the world where the boss cares a lot about the outcome, either managers will do what the boss wants of their own accord, given the threat of shading by the boss, or the boss will take back the decision rights. Our paper is related to a number of ideas that have appeared in the literature. First, there is an overlap with the literature on internal capital markets; see particularly Stein (1997, 2002), Rajan, Servaes, and Zingales (2000), Scharfstein and Stein (2000), Brusco and Panunzi (2005), and Inderst and Laux (2005). This 9. In our model the boss of an integrated firm has relatively broad objectives because he diverts (all of) the profit from the units under his control. We believe that a boss may have broad objectives for other reasons: he may be judged according to how well the units under his control perform, or obtain job satisfaction from their success. 10. In a previous version of the paper we assumed that decisions were noncontractible both ex ante and ex post, and did not adopt the “contracts as reference points” approach. We obtained a similar trade-off between nonintegration and integration, but our approach raised some questions. (In independent work, Baker, Gibbons, and Murphy [2008] also obtain a trade-off similar to ours under the assumption that decisions are ex post noncontractible.) First, if a decision is ex post noncontractible, how does a boss get it carried out except by doing it herself? Second, even if decisions are ex post noncontractible, as long as decision rights can be traded ex post, it is unclear why ex ante organizational form matters (in the absence of noncontractible investments). The parties could just rely on ex post bargaining of decision rights to achieve an optimum. Finally, the “ex post noncontractibility” approach by itself does not yield an analysis of delegation (see below).
A THEORY OF FIRM SCOPE
489
literature emphasizes the idea that the boss of a conglomerate firm, even if she is an empire builder, is interested in the overall profit of the conglomerate, rather than the profits of any particular division. As a result, the conglomerate boss will do a good job of allocating capital to the most profitable project (“winnerpicking”). Our idea that the professional boss of an integrated firm maximizes total profit is similar to this; the main differences are that the internal capital markets literature does not stress the same cost of integration as we do—the boss’s insufficient emphasis on private benefits—or allow for the possibility that the allocation of capital can be done through the market (in our model, the market is always an alternative to centralized decision making), or consider standard-setting. Second, the idea that it may be efficient for the firm to have narrow scope and/or choose a boss who is biased toward particular workers is familiar from the work of Shleifer and Summers (1988), Rotemberg and Saloner (1994, 2000), and Van den Steen (2005). These papers emphasize the effect of narrow scope and bias on worker incentives rather than on private benefits or wages, but the underlying premise, that workers care about the boss’s preferences, is the same. However, none of these papers analyzes firm boundaries. Third, several recent works explore firm boundaries and internal organization using the idea that some actions are noncontractible ex ante and ex post but may be transferable through ownership; see, for example, Holmstrom (1999), Aghion, Dewatripont, and Rey (2004), Mailath, Nocke, and Postlewaite (2004), Bolton and Dewatripont (2005), Hart and Moore (2005), Alonso, Dessein, and Matouschek (2008), Baker, Gibbons, and Murphy (2008), and Rantakari (2008). We discuss in footnote 10 some reasons that we have not followed the “ex post noncontractibility” approach here. We should point out how our analysis of delegation differs from the treatment of authority in Aghion and Tirole (1997) (see also Baker, Gibbbons, and Murphy [1999]). In Aghion and Tirole, a boss defers to a subordinate in situations where the subordinate has superior information. In this case, even though the boss has “formal” authority, the subordinate has “real” authority. In contrast, we are interested in situations where allocating authority to someone inside a firm has meaning. As Baker, Gibbons, and Murphy (1999) point out, this corresponds to real rather than formal authority: if the boss appoints someone as unit head, say, she can legally change her mind and take the authority back. In our model, allocating authority inside a firm nonetheless has
490 Organizational form chosen
QUARTERLY JOURNAL OF ECONOMICS
Decisions made
Payoffs realized
FIGURE I
meaning. The reason is that there is a friction: designating someone as unit head and then reversing the decision is costly, given that reversal increases aggrievement (by the unit manager, and possibly by unit workers to the extent that the new boss’s preferences are less aligned with theirs).11 The paper is organized as follows. The basic model is presented in Sections II and III. In Section IV we analyze delegation. Section V illustrates the model using Cisco’s approach to platform leadership through acquisitions. Finally, Section VI concludes. II. A BASIC MODEL OF COORDINATION Our model concerns two units, A and B, that have a lateral relationship: they operate in the same output or input markets. A unit has a manager and no workers. Each unit makes a decision that affects the other unit. For example, the units may be deciding whether to adopt a common standard or platform for their technology or products. It is natural to model such a strategic coordination decision as a binary choice. Each unit can choose “Yes” (Y ) or “No” (N). There are two aggregate outcomes: “coordination” or “noncoordination.” Coordination occurs if and only if both units choose Y . The timeline is as in Figure I. At the beginning, an organizational form is selected—specifically, whether the units should be separate firms (nonintegration, i.e., there are two bosses) or should merge into one firm (integration, i.e., there is one boss). Next, each unit chooses Y or N. Finally, the payoffs are realized. Each unit generates two kinds of benefit: monetary profit v and private (nontransferable) benefits w in the form of job satisfaction for the manager working in the unit (private benefits are measured in money). We assume that the boss of the unit can divert all the profit from that unit to herself.12 In contrast, the private benefits always reside with the managers. We represent payoffs from different outcomes in the matrix in Table I. We assume that these payoffs are nonverifiable and, for simplicity, 11. In Baker, Gibbons, and Murphy (1999), reversal is also costly given that it is a breach of a relational contract. 12. One justification is that the boss can use her residual control rights to authorize side-deals with other companies she owns, and this enables her to siphon profit out of the unit.
A THEORY OF FIRM SCOPE
491
TABLE I PAYOFFS Unit B
Y Unit A N
Y
N
A : v A, w A B : v B, w B A : 0, 0 B : 0, 0
A : 0, 0 B : 0, 0 A : 0, 0 B : 0, 0
perfectly certain. Without loss of generality we normalize so that monetary profit and private benefits under noncoordination are zero in both units. Unit A is the row player, and unit B is the column player. Subscripts refer to units, with v representing profit and w private benefits. It will be convenient to introduce the notation (1)
z A ≡ v A + w A,
zB ≡ v B + w B.
Here, z A (resp. zB) refers to the change in total surplus in unit A (resp. unit B) from coordination, and z A + zB equals the change in aggregate social surplus. Note that (1) does not account for the costs of aggrievement, which depend on the ex ante contract as well as the ex post decision. As discussed in the Introduction, private benefits refer (broadly) to job satisfaction or on-the-job consumption. It is reasonable to assume that part of job satisfaction stems from the ability to pursue an independent course or agenda. Thus, we will assume that coordination leads to a reduction in private benefits: (2)
w A ≤ 0,
w B ≤ 0.13
We put no restrictions on whether coordination increases or decreases profits; moreover, even if coordination increases total profits, profits may rise by more or less than the fall in private benefits. We will focus on two leading organizational forms: 1. Nonintegration:14 Manager A is the boss of unit A and manager B is the boss of unit B. Each manager diverts 13. . Our main results generalize to the case w A + w B ≤ 0. We make the stronger assumption (2) for expositional simplicity. 14. We will actually consider two subcases of nonintegration, one without cooperation and one with cooperation, as discussed below.
492
QUARTERLY JOURNAL OF ECONOMICS
profit and receives private benefits from his unit, and so manager A’s payoff is v A + w A, and manager B’s is v B + w B. 2. Integration: A professional manager (an outsider) is the boss of both units and managers A and B are subordinates. The boss receives v A + v B. The unit managers are under fixed-wage employment contracts and each manager receives the sum of the wage and private benefit in his unit. Organizational form and contracts are determined ex ante. We will assume, as in the standard incomplete contracts literature, that at this stage the coordination decisions are too complicated to specify; however, authority over these decisions can be allocated. We will take the view that the boss of each unit has residual rights of control, which gives her the legal authority to make the Y /N decisions in her unit. Ex post the Y /N decisions can be contracted on. Under nonintegration each unit manager chooses Y or N in his unit. Under integration, the overall boss instructs the unit managers to choose Y or N. We will assume that the unit managers must follow these instructions— they are contractible— but the managers may choose to shade.15 Shading may also occur under nonintegration. As discussed in the Introduction, we use the “contracts as reference points” approach of Hart and Moore (2008) to justify the particular contracting assumptions that we make. According to this approach a contract—an organizational form in this case— negotiated under ex ante competitive conditions delineates or circumscribes parties’ feelings of entitlement ex post. In particular, a contracting party does not feel entitled to an outcome outside those specified by the contract or organizational form. However, parties may feel entitled to different outcomes within the contract or organizational form. A party who does not receive what he feels entitled to is aggrieved and shades on performance. We assume that shading reduces the payoff of the shaded against party but does not affect the payoff of the party doing the shading. Shading creates deadweight losses.16 Specifically, following Hart and Moore (2008), we assume that each party feels entitled to his most preferred outcome or decision within the contract, and that a party who receives ki less than his 15. We do not allow managers to quit within a period; see footnote 22. 16. The reference points approach resembles in some respects relational contracting (see, e.g., Baker, Gibbons, and Murphy [2008]). Shading is like punishment in relational contracting models, but shading does not hurt the person doing the shading.
A THEORY OF FIRM SCOPE
493
maximum payoff will be aggrieved by ki and will shade to the point where the other parties’ payoffs fall by θ ki . Here θ is an exogenous shading parameter, assumed to be the same for all parties,and 0 < θ < 1. Thus the total deadweight loss from shading is θ i ki . The assumption that contracts are reference points provides a natural reason for parties to pin things down in an initial contract. A contract that is too flexible, that is, that specifies too little, can lead to a lot of aggrievement and shading ex post. The downside of a rigid contract is that it is harder for the parties to adjust to new circumstances. Even though there is no payoff uncertainty in our model, our assumption that decisions become contractible only ex post implies a change in circumstances that makes the ex ante choice of organizational form relevant for the deadweight losses from aggrievement, as will become clear below. There is a further consideration about shading: the ability of a party to shade may depend on the nature of the transaction that the party is engaged in. For example, under nonintegration, if the units fail to coordinate on a standard or platform, they may no longer have dealings with each other, which will reduce shading possibilities. For this reason, we will distinguish between two forms of nonintegration. In one, “nonintegration without cooperation,” the parties’ relationship ends in the absence of adoption of a standard and so shading is not possible under noncoordination. In the second, “nonintegration with cooperation,” the parties have a broader relationship that continues beyond the standardization decision and so shading is possible even under noncoordination. In contrast, under integration, we assume that shading is always possible: the parties continue to have a relationship.17 Under the shading assumption, ex post renegotiation is not costless because each party will feel entitled to the best possible outcome in the renegotiation, and they cannot all be satisfied and will shade. Moreover, to the extent that renegotiation reopens consideration of the terms and entitlements underlying existing contracts, renegotiation can make all parties worse off. In the analysis below, we will rule out ex post renegotiation on these grounds. However, we believe that our results could be generalized to ex post renegotiation along the lines of Hart (2009). We assume that bargaining at the ex ante stage ensures that organizational form is chosen to maximize expected future surplus net of ex post shading costs (lump sum transfers are used to 17. In our discussion of the Cisco–StrataCom relationship in Section V we suggest that, before StrataCom was aquired, their relationship was probably best described as “nonintegration with cooperation.”
494
QUARTERLY JOURNAL OF ECONOMICS
redistribute surplus). In particular, we assume that at least one side of the market is competitive ex ante, so that each side achieves the best outcome it can get in the negotiation. Therefore there is no shading at the ex ante stage. In contrast, there is the potential for shading at the ex post stage, because the parties are then locked in. The ex ante bargaining also determines managerial wages. In the special case where there is a competitive market for managers, wages plus expected private benefits will equal the reservation utility for managers. An implication of this is that an organizational change that reduces private benefits will lead to an increase in wages.18 III. OPTIMAL ORGANIZATIONAL FORM In this section we analyze optimal organizational form. We compare “nonintegration without cooperation,” “nonintegration with cooperation,” and “integration.”19 In each case we assume that the ex ante incomplete contract that the parties write fixes prices or wages and allocates authority.20 Also, there is no renegotiation. From now on, we will use S to denote the social surplus net of shading costs, that is, the relevant payoff from Table I less any costs of shading. For simplicity, we refer to S as social surplus. First-best refers to cases where aggregate surplus is maximized and shading costs are zero. Similarly, we say that a decision is first-best efficient if it maximizes total surplus ignoring shading costs. III.A. Nonintegration without Cooperation Under nonintegration, manager A’s payoff is v A + w A, manager B’s payoff is v B + w B, and either manager can veto coordination by choosing N. It is useful to distinguish three cases. 18. There is some evidence consistent with this. Schoar (2002), in a study of the effects of corporate diversification on plant-level productivity, finds that diversified firms have on average 7% more productive plants, but also pay their workers on average 8% more, than comparable stand-alone firms. 19. We take the view that both forms of nonintegration are feasible choices. In reality, past and expected future interactions between the parties may dictate the nature of their relationship under nonintegration. In other words, whenever nonintegration is chosen, its type is determined. 20. We do not consider contracts that specify a price range rather than a single price. For a discussion of such contracts, see Hart and Moore (2008).
A THEORY OF FIRM SCOPE
495
Case 1: z A ≤ 0, zB ≤ 0. The managers’ preferences are aligned. Coordination does not occur because nobody wants it, and given that there is no disagreement, there is no aggrievement. Social surplus is given by (3)
S = 0.
Case 2: z A ≥ 0, zB ≥ 0. The managers’ preferences are aligned. This time both parties want coordination and so coordination occurs without aggrievement.21 Social surplus is given by (4)
S = z A + zB.
Case 3: zi < 0, z j > 0 (i = j). Now there is a conflict. Manager i does not want coordination and can veto it by choosing N. Because under “nonintegration without cooperation” shading by manager j is infeasible if the parties do not coordinate, manager i will not hesitate to exercise his veto, and the outcome will be noncoordination. Social surplus is given by (5)
S = 0.
We see that the first-best, coordinate if and only if (6)
z A + zB ≥ 0,
is achieved in Cases 1 and 2, but may not be achieved in Case 3. This is the critical problem of winners and losers. Even though aggregate surplus may rise, the distribution of the gains may be such that one party loses out, and this party will veto coordination. In summary, there is too little coordination under “nonintegration without cooperation.” Whenever coordination occurs it is first-best efficient (Case 2 implies (6)); but coordination may not occur when it is first-best efficient ((6) does not imply Case 2). Finally, there is no shading in equilibrium under “nonintegration without cooperation,” whether the outcome is coordination or noncoordination. III.B. Nonintegration with Cooperation Now shading is possible even under noncoordination. Cases 1 and 2 remain the same and achieve first-best (in particular, no 21. Note that, in Case 2, (N, N) is a Nash equilibrium along with (Y, Y ); however, we will assume that parties do not pick a Pareto-dominated equilibrium.
496
QUARTERLY JOURNAL OF ECONOMICS
shading). However, under Case 3, manager i may choose not to veto coordination, given that manager j will be aggrieved if i does this—by the difference between manager j’s payoff under his preferred outcome, coordination, and what he actually gets— and will shade in proportion to this difference. That is, manager j will be aggrieved by z j and will shade by θ z j . Coordination will occur if manager i’s utility from coordination exceeds the costs of shading imposed on i by manager j, zi ≥ −θ z j , that is, zi + θ z j ≥ 0.
(7)
If (7) holds, manager i is a reluctant coordinator and will be aggrieved by −zi because the best outcome for him would have been not to coordinate. Thus manager i will shade by −θ zi , and there will be deadweight losses of that amount. Note that (7) implies z j + θ zi > 0.
(8)
and so manager j still wants to coordinate in spite of this shading. On the other hand, if (7) does not hold, coordination will not occur but manager j will shade by θ z j . Social surplus is thus given by (9) S = z A + zB + θ zi −θ z j
if (7) holds (coordination), if (7) does not hold (noncoordination).
Whereas first-best is achieved in Cases 1 and 2, Case 3 does not lead to first-best. It is easy to see that (7) ⇒ (6), so there is too little coordination relative to first-best. In addition, social surplus, given in (9), always entails a strictly positive cost of shading; regardless of the decision, one side will be unhappy. It is evident that “nonintegration with cooperation” is potentially desirable (to the extent that it is a choice) only if coordination is the outcome (i.e., (7) holds). When (7) does not hold, the parties are better off with “nonintegration without cooperation.” In the case where there is uncertainty (to be discussed later) it is possible that parties attempt “nonintegration with cooperation,” only to find that (7) fails. III.C. Integration We divide the analysis into two cases. Case 1: v A + v B ≤ 0. The managers’ and bosses’ preferences are aligned (given (2)). Coordination does not occur because
A THEORY OF FIRM SCOPE
497
no one wants it, and, given that there is no disagreement, there is no shading. Social surplus is given by S = 0.
(10)
Case 2: v A + v B > 0. Now the boss wants coordination, but the managers do not, and they will be aggrieved by w A + w B and will shade by θ (w A + w B) if it occurs. The boss will coordinate if and only if her payoff net of shading costs is higher: (11)
v A + v B + θ (w A + w B) ≥ 0.
In other words, the boss partly internalizes the wishes of her subordinates. If (11) does not hold, the boss will go along with what the managers want and will not coordinate. In this case, the boss is aggrieved by v A + v B because she is not getting her preferred outcome, and so she will shade to the point where the unit managers’ payoffs fall by θ (v A + v B). Social surplus is thus given by (12) S = z A + zB + θ (w A + w B) − θ (v A + v B)
if (11) holds (coordination), if (11) does not hold (noncoordination).
The first-best is achieved in Case 1 but not in Case 2. In Case 2, there is too much coordination relative to the first-best ((6) ⇒ (11) but not vice versa) and too much shading. We have established PROPOSITION 1. Nonintegration errs on the side of too little coordination (when coordination occurs it is first-best efficient, but it may be first-best efficient and not occur), whereas integration errs on the side of too much coordination (when coordination is first-best efficient it occurs, but it may occur even when it is not first-best efficient). If noncoordination is first-best efficient, “nonintegration without cooperation” achieves the first-best. If coordination is first-best efficient then (a) integration leads to coordination, but may not be optimal given the deadweight losses from shading; (b) integration is optimal if the changes in private benefits from coordination are
498
QUARTERLY JOURNAL OF ECONOMICS
sufficiently small; and (c) integration is uniquely optimal if in addition the distribution of profits is sufficiently uneven.22 An extension: So far we have assumed that the integrated firm is run by a professional manager. We now consider whether it might be better to put manager A, say, in charge. Case 1 remains unchanged. However, Case 2 will be different. Instead of (11), manager A’s decision rule will be to coordinate if and only if (13)
v A + v B + w A + θ w B ≥ 0.
So manager A, like the professional manager, coordinates too often. However, because (13) implies (11), manager A is less biased toward coordination. This is an improvement. The social surplus in the event that manager A coordinates will be (14)
S = z A + zB + θ w B,
which is greater than the social surplus when the professional manager coordinates (see (12)). The reason is that when manager A coordinates, he does not shade against himself. The upshot is that it is always at least as good to have manager A (or manager B by symmetry) run the integrated enterprise as to have a professional boss. One way to rationalize our assumption that the boss of the integrated firm is a professional manager is to assume that as well as the strategic decision that we have focused on, there are additional 0–1 decisions that need to be taken, which will be chosen in an inefficient way if manager A or manager B becomes the boss in the integrated firm. To illustrate, suppose that there is an auxiliary decision that has no financial consequences, just private ones. Specifically, let the effects of going ahead with the decision be (15)
wˆ A > 0 > wˆ B and
wˆ A + wˆ B < 0.
22. We assume that unit managers are locked in for a period and cannot quit, that is, we assume that their employment contract is binding for one period. (See Hart and Moore [2008] and Van den Steen [2009] for discussions of the employment contract, and Hart and Moore [2008] for a model where quitting can occur within a period.) If quitting were possible, then under integration the boss would be forced to internalize some of the managers’ private benefits because if she pursued profit too much at the expense of private benefits, managers would leave. Obviously, quitting becomes more of an issue in a multiperiod model where decisions are longterm, and a decision that reduces managerial independence might force the boss to pay higher wages to retain workers. In many interesting situations, however, it is plausible that managers and workers are not on the margin of quitting, perhaps because they have made relationship-specific investments or they are paid efficiency wages.
A THEORY OF FIRM SCOPE
499
Thus, manager A would like to see the decision taken, even though it is inefficient. As the boss, he will go ahead with the decision whenever (16)
wˆ A + θ wˆ B > 0. The social payoff of going ahead is
(17)
wˆ A + wˆ B + θ wˆ B < 0.
A professional manager would never go ahead with the decision. Manager A, but not manager B, will feel aggrieved by this, which results in a social payoff −θ wˆ A < 0. Comparing this with (17), we see that social surplus from the auxiliary decision is strictly higher when a professional manager is in charge than when manager A is in charge. Manager B would make the same auxiliary choice as the professional manager and be more effective than the professional manager with respect to the strategic decision, as we argued earlier. So, when both the strategic decision and the auxiliary decisions are considered together, manager B would be the best boss. To avoid this conclusion, we can add a second auxiliary decision, with the payoffs for A and B reversed. This decision would be just as inefficient, but favors manager B rather than A. With both decisions thrown in, it is easy to see that the professional manager can be the best boss. The benefit of a professional boss is that she will not make decisions that are inefficient and exclusively favor one or the other manager. This is an economically plausible argument for having a professional boss run the integrated firm, though obviously there are interesting cases where manager A or manager B would do better. Finally, we note that instead of introducing auxiliary decisions, we can add uncertainty about private benefits into our original model, allowing them to be negatively correlated as in the discussion above. This requires that we replace our earlier assumption that both A’s and B’s private benefits suffer from coordination, condition (2), with the assumption that the sum of the changes in private benefits is negative. With uncertainty and negatively correlated private benefits, a professional manager can be the optimal choice, exactly for the reasons illustrated by considering auxiliary decisions.
500
QUARTERLY JOURNAL OF ECONOMICS
IV. DELEGATION We now consider delegation, a form of governance that is intermediate between integration and nonintegration, where a professional boss delegates her formal authority over decision rights to the unit managers.23 However, because the boss is legally in charge, there is nothing to stop her from changing her mind and taking back the decision rights ex post. We refer to the taking back of decision rights as a reversal: we assume that the timing is such that a reversal takes place ex post before managers make their decisions. We assume that the subordinates regard a reversal as a “breach of promise,” and this leads to increased levels of aggrievement and shading: the shading parameter rises from θ to θ¯ , where 1 ≥ θ¯ ≥ θ . If θ¯ > θ , and there is uncertainty, we will see that delegation can have value as a partial commitment device. As in our discussion of integration in Section III, there are two cases: Case 1: v A + v B ≤ 0. Preferences are aligned, and no one wants coordination. So coordination does not occur, and there is no shading. Social surplus is given by S = 0. Case 2: v A + v B > 0. Now there is a conflict. Ignore reversal for the moment. If the managers do not coordinate, the boss will be aggrieved. Suppose that the boss divides her shading 50:50 between the two parties.24 Then the managers’ payoffs are given by − 2θ (v A + v B), i = A, B. So the managers will choose to coordinate if (18)
θ (v A + v B) ≥ 0, 2 θ w B + (v A + v B) ≥ 0. 2
w A +
When (18) holds, the managers coordinate reluctantly. They feel aggrieved and will shade, reducing the social surplus to (19)
S = z A + zB + θ (w A + w B).
Suppose next that (18) does not hold. Then coordination will not occur unless the boss reverses the decision and forces coordination. Forced coordination leads to aggrievement levels of w A + w B for the managers. Shading costs equal θ¯ (w A + w B), 23. Although the boss delegates the right to make Y /N decisions, we assume that she retains the ability to divert unit profit. 24. This is a simplifying assumption and other possibilities could be explored.
A THEORY OF FIRM SCOPE
501
given that the shading parameter rises from θ to θ¯ . Thus, the boss reverses if and only if (20)
v A + v B + θ¯ (w A + w B) ≥ 0.
So if neither (18) nor (20) holds, coordination does not occur and (21)
S = −θ (v A + v B),
whereas, if (18) does not hold but (20) does, coordination occurs, and (22)
S = z A + zB + θ¯ (w A + w B). We summarize this discussion in the following proposition.
PROPOSITION 2. In the delegation model, A. If v A + v B ≤ 0, coordination does not occur and social surplus is given by S = 0. B. If v A + v B > 0 and (18) holds, managers will coordinate reluctantly and S = z A + zB + θ (w A + w B). C. If v A + v B > 0 and (18) does not hold but (20) does, the boss forces coordination and S = z A + zB + θ¯ (w A + w B). D. If v A + v B > 0 and neither (18) nor (20) holds, then coordination does not occur, but the boss is aggrieved and S = −θ (v A + v B). It is useful to compare the outcome under delegation with that under integration. It is easy to see that (18) implies (11), given that θ < 1. Also, (20) implies (11). It follows that, whenever coordination occurs under delegation, that is, in case B or C above, coordination occurs under integration too. However, because (6) implies (20) (given that θ¯ ≤ 1), there is still too much coordination under delegation relative to the first-best; that is, coordination occurs whenever it is efficient, but also sometimes when it is inefficient.
502
QUARTERLY JOURNAL OF ECONOMICS
PROPOSITION 3. Under delegation there is (weakly) less coordination than under integration, but still too much coordination relative to the first-best. Proposition 3 is intuitive. If unit managers reluctantly coordinate under delegation, that is, reversal is not required, then a professional manager would also coordinate under integration. And if a professional manager would reverse delegation to achieve coordination, incurring higher aggrievement and shading costs, then she would surely coordinate if reversal were not required. Finally, because θ¯ ≤ 1, if coordination is efficient, the boss will be prepared to incur the costs of reversal to achieve it. Thus, the trade-off between integration and delegation is the following: both yield coordination too much of the time, but delegation yields it less of the time and therefore comes closer to the first-best. However, to the extent that the boss reverses delegation to achieve coordination, the deadweight losses from shading are higher under delegation than under integration. The next proposition shows that delegation is never strictly optimal under certainty. PROPOSITION 4. Under perfect certainty, “nonintegration without cooperation” or integration can be strictly optimal, but delegation is never strictly optimal. Proof. Suppose first that the equilibrium outcome under delegation is (N, N). Then the equilibrium outcome under “nonintegration without cooperation” cannot be worse than this: either it is (N, N) with less shading, or it is (Y, Y ), which is Pareto superior. Suppose next that the equilibrium outcome under delegation is (Y, Y ). If (18) holds, so does (11), and so coordination occurs under integration with the same shading costs. On the other hand, if (18) does not hold, then (20) must hold, because otherwise the outcome would be (N, N). But if (20) holds, then (11) holds, and so coordination again occurs under integration with lower shading costs. Finally, it is easy to find parameters such that (N, N) is socially optimal, and “nonintegration without cooperation” yields (N, N), whereas integration and delegation yield (Y, Y ); and parameters such that (Y, Y ) is socially optimal, and integration yields (Y, Y ), whereas “nonintegration without cooperation” and delegation yield (N, N). In other words, nonintegration and integration can each be uniquely optimal. QED
A THEORY OF FIRM SCOPE
503
FIGURE II
Delegation may, however, be superior to either nonintegration or integration in a world of uncertainty. For delegation to be better, it is important that θ¯ > θ . To see this, note that if θ¯ = θ , (18) implies (20), and (20) and (11) are equivalent. Thus, cases B and C above are both ones where (11) holds. A comparison of cases B– D and (12) then shows that the outcome under integration with delegation is identical to that under integration. From now on, therefore, we assume that θ¯ > θ . Assume that payoffs are drawn from a commonly known probability distribution and are observed by both parties ex post (there is symmetric information). To understand how delegation can be strictly optimal, it is useful to focus on the special case where w A = w B = w. Also, write v = 1/2 (v A + v B). Then the first-best condition for coordination, (6), is v ≥ |w|, where || denotes absolute value. If v ≤ 0, all organizational forms— nonintegration, integration and delegation—yield the same outcome: noncoordination. So assume that v > 0. Then the condition for coordination without reversal under delegation (reluctant coordination) becomes θ v ≥ |w|, whereas the condition for coordination with reversal under delegation (forced coordination) becomes v > θ¯ |w|. In contrast, the condition for coordination under integration can be written as v ≥ θ |w|. The situation is illustrated in Figure II, where w is fixed and v varies. For low values of v, v ≤ θ |w|, there is no coordination under integration or delegation. For values of v above θ |w|, there is coordination under integration. In contrast, under
504
QUARTERLY JOURNAL OF ECONOMICS
delegation, v has to reach θ¯ |w| before coordination occurs. The good news about delegation relative to integration, then, is that, ¯ in the range θ |w| ≤ v ≤ θ|w|, it achieves a more efficient outcome. The bad news is that, in the range θ¯ |w| ≤ v ≤ |w|/θ , delegation achieves coordination, but with higher shading costs because reversal is required. It is fairly clear when delegation will dominate integration. Suppose that the probability distribution of v is such that v is either in the range θ |w| ≤ v ≤ θ¯ |w| or in the range v ≥ |w|/θ . Then delegation achieves noncoordination when this is efficient, and coordination when this is efficient; moreover, the shading costs are low when coordination occurs because reversal is not required. In contrast, under integration coordination would occur also when it is inefficient—that is, in the range θ |w| ≤ v ≤ θ¯ |w|. The intuition is simple. Delegation can be a good way for the boss to commit not to intervene when this is inefficient, given that the costs of intervening, that is, reversal, are high. Note finally, that over the range where integration with delegation is superior to integration without delegation, integration with delegation will also be superior to nonintegration if, when the gains from coordination are large, they are unevenly divided. V. PLATFORM LEADERSHIP AND STANDARDS—CISCO’S PURCHASE OF STRATACOM In this section we describe a context where we think our approach, broadly interpreted, is particularly relevant—the struggle for platform leadership in the network industry. We use Cisco as an example, because Bunnell (2000) (as well as Gawer and Cusumano [2002]) provides a detailed, informative account of Cisco’s acquisition strategy. We illustrate this strategy with Cisco’s acquisition of StrataCom. Standards are very important in rapidly evolving industries such as information and communication technology. The social benefits from a common standard can be huge, but getting independent parties to agree to a standard is often difficult, because the benefits from adopting a single standard tend to be unevenly distributed. Instead, standards are often supported through selfenforcing, multilateral cross-licensing agreements and industry consortia. Naturally, the players owning key technological platforms have a disproportionate say in the determination of standards,
A THEORY OF FIRM SCOPE
505
sometimes to the extent that they may be able to dominate the evolution of the industry. Therefore, the rewards from winning the battle for platform leadership are huge (Gawer and Cusumano 2002) and result in complex strategic games among the contenders. In these games, acquisition strategies play an important role, for reasons that our model captures at least in part. Cisco’s IOS is a technological platform that came to dominate the network industry in the course of the 1990s. Cisco had originally been successful and grown rapidly, thanks to its router technology, which served the core network of the Internet. Over time, IOS, designed to run the routers, became the de facto technology platform on which Cisco built its industry dominance (Gawer and Cusumano 2002, pp. 164–176). This was no accident. When John Chambers became the CEO of Cisco in 1992, his goal was to make Cisco “the architect of a new worldwide communication system for the twenty-first century” (Bunnell 2000, p. xv). The value of controlling the architecture of the network ecosystem was accentuated by the customers’ desire to buy end-to-end solutions that integrated the underlying technologies into a seamless user experience. Acquisitions played a key role in achieving Cisco’s goal. Under Chambers’s leadership, Cisco became a serial acquirer. Between 1993 and 2000, it bought a total of 71 companies—23 companies in 2000 alone. Most of the acquired companies were start-ups, bought to fill gaps in the expanding technological space that Cisco wanted to control. Arguably, the most critical acquisition that Cisco made in this period was the purchase in 1996 of StrataCom, the leading provider of a small, but rapidly expanding, new transmission technology, ATM. It is instructive to look at this acquisition in some detail. ATM was a new, cheaper non–router based technology that was very different from the packet-based router technology (Internet protocol) that IOS was built for. For ATM to work with Cisco equipment, IOS and ATM had to be made compatible. Integrating ATM into IOS meant a major change in Cisco’s leading industry platform. Deciding what to do about ATM became a big strategic decision for Cisco. The main concern was that ATM might eventually displace significant pieces of Cisco’s own router-based technology. Customers were keen to get ATM into their networks, because it was a more cost-effective technology. Even though the major ATM players (including StrataCom) were still small, they were growing fast. Cisco concluded that ATM had the potential to derail its
506
QUARTERLY JOURNAL OF ECONOMICS
plans to be the architect of the networking industry and felt it had to respond. In terms of our model, Cisco had three main ways to respond to the ATM threat: a. Nonintegration without coordination. Cisco could decide not to make IOS and ATM compatible and hope that ATM would not take hold. ATM’s incompatibility with IOS would make it tough for ATM players to grow very large given IOS’s significant customer base, but Cisco could face a risky and costly battle that it might lose. b. Nonintegration with coordination. Cisco could make IOS and ATM compatible without a major acquisition such as the purchase of StrataCom. (Cisco had already bought Lightstream, a smaller ATM player, as a safety play, but this had worked out poorly, because of skeptical customer reception; Lightstream’s size was too insignificant and customers were not sure that Cisco would support the technology in the long run—a valid concern, as it turns out.) This strategy would require Cisco to work with the leading ATM firms, making it much easier for ATM to grow and usurp Cisco’s technology. In fact, three years earlier, Cisco had made an agreement with StrataCom and AT&T to collaborate on the definition of standards and the development of products for ATM, but evidently these efforts did not work out. (In the context of our model, this agreement is probably best interpreted as “nonintegration with cooperation.”) c. Integration with coordination. Cisco could buy StrataCom (or some other major ATM player), make IOS and ATM compatible internally, and become an industry leader in the ATM market. This would support Cisco’s ambitions to be the architect of the network industry. By holding the decision rights to both technologies, Cisco could determine how the two technologies should be integrated to provide a seamless customer experience and maximize overall surplus—much of which would flow into Cisco’s pockets, of course, if it could win the platform game. Cisco chose option c, the same strategy that it had successfully followed when the switching technology became a threat and it bought Crescendo. Cisco paid $4.7 billion for StrataCom— by far the most expensive acquisition that it had made until then and an incredibly high price for a start-up with modest earnings.
A THEORY OF FIRM SCOPE
507
Nevertheless, Cisco’s stock price jumped 10% on the announcement of the deal. (It seems plausible that Cisco had the bargaining power in the acquisition—Cisco had several alternatives to StrataCom, whereas StrataCom had few alternatives to Cisco.) How well does this case fit our model? The value of the deal makes clear that significant joint benefits from coordination were anticipated. Integrating ATM and IOS seamlessly, and in a way that maximized the joint benefits of Cisco and StrataCom rather than those of the whole industry, would give Cisco and StrataCom a much better shot at winning the platform game. Next one has to ask whether coordination would have been feasible across the market. As noted in the description of option b, coordination across the market appeared difficult. We surmise that the reason was the reluctance of StrataCom, the dwarf in the relationship, to choose Y , because this would have tilted the playing field too much in favor of the giant Cisco. Arguably, option b failed because of an uneven split of the surplus, a key driver in our model. 25 Our analysis emphasizes that private benefits also should be considered in making strategic decisions. Embracing the new ATM technology met with much internal resistance at Cisco, because Cisco had been “emphatically biased toward IP [technology]” (Bunnell 2000, p. 84). Also, Cisco’s sales force disliked ATM, because it was a less sophisticated, cheaper technology, which resulted in lower commissions (Bunnell 2000, p. 85). The private losses on StrataCom’s side were probably small, and there may even have been private gains (in contrast to (2)), given that StrataCom’s technology was adopted. One common reason that entrepreneurial firms sell out to a large player like Cisco (besides the money they get from selling their shares) is that access to a huge customer base brings their projects onto a large stage quickly, enhancing the private benefits enjoyed from the development and increased recognition of their product. Seeing their product succeed on a large scale can be a big source of satisfaction for entrepreneurs. 25. One possibility that we have not considered is that Cisco and StrataCom could have entered into some sort of profit sharing agreement to align incentives. Given that Cisco and StrataCom were both public companies at the time, profit sharing was obviously feasible. We ruled out profit sharing in our basic model by supposing that there is 100% diversion of monetary profit. In reality, profit sharing may not have been a very effective way of aligning the incentives of Cisco and StrataCom, because of the big difference in company size and substantial uncertainty about payoffs.
508
QUARTERLY JOURNAL OF ECONOMICS
Cisco’s acquisition strategy, and the rules that Cisco used to select its favored partners, make clear that Cisco was sensitive to the issue of private benefits. Chambers’ five criteria for partners were these: a common vision; cultural compatibility; a quick win for the shareholders; a long-term win for all constituencies; and geographic proximity (Bunnell 2000, p. 65). Chambers also went to great length to avoid alienating employees of the acquired company, partly, we may assume, to minimize shading.26 His strategy was to allow acquired firms to stay as independent as possible within Cisco to retain the spirit of entrepreneurship. Typically, a newly acquired firm only had to make its products compatible with IOS and submit to the purchase and sales systems in Cisco. Otherwise it was largely free to pursue its own agenda. The commitment worked: Cisco had a reputation for being a benevolent, well-liked acquirer. The Mario rule illustrates Chambers’ efforts to protect employees from the acquired company (Bunnell 2000, p. 37). The rule, named after the CEO of Crescendo, Mario Mazzola, stated that no employee of a newly acquired company could be terminated without the consent of Chambers and the CEO of the acquired company. We interpret the Mario rule as a form of delegation (regarding decision rights other than coordination). Interestingly, Cisco abandoned this rule after the dot-com crash in 2000, when it was forced to lay off thousands of employees because of the deep recession in the IT industry. Evidently, delegated rights are not as secure as ownership rights, but they are not valueless either, a distinction that fits our delegation model well. It is worth asking whether traditional, holdup–based property rights theories fit the Cisco story as well or better than ours. In hold-up models as well as in our model, there is concern about being locked in and becoming unduly dependent on an outsider— for a service or a key element in one’s strategy. It is clear that there are hold-up concerns in this broad sense also in the Cisco– StrataCom deal. But we do think the essence of the deal was less about hold-ups in the sense of financial extraction—the hallmark of traditional hold-up models—and much more about the 26. Another important motive for not alienating employees is to prevent them from quitting. Employees may quit because they are disgruntled or because they have better prospects elsewhere, or for a combination of these reasons. Although quitting is not part of our formal model, it could be incorporated into a multiperiod version (see also footnote 22).
A THEORY OF FIRM SCOPE
509
ability to control the path of the ATM-IOS integration and its successful development. This is supported by the whole rationale for Cisco’s acquisition strategy. In Chambers’s own words: “With a combination of IP (Internet protocol) routing and ATM we can define the Internet of the future” (Bunnell 2000, p. 88). Also, the five key criteria for acquisitions seem to have little to do with traditional hold-up stories, but they, together with the meticulous attention to employees in acquired firms, bear witness to the great significance of private benefits. VI. CONCLUSIONS In the traditional property rights model, asset ownership affects incentives to invest in human capital, but not ex post outcomes conditional on these investments. In our model, decision rights directly affect what happens ex post. Our structure is in many ways close to the traditional view of the firm as a technologically defined entity that makes decisions about inputs, outputs, and prices. The difference is that our firm does not necessarily maximize profits, either because a boss cares directly about nontransferable private benefits or because the boss is forced to internalize them given that employees can shade. It is this relatively small wrinkle in the traditional model that opens the door to a discussion of boundaries. The aggrievement approach of Hart and Moore (2008) has two important benefits relative to models based on ex post noncontractibility. First, aggrievement plays a central role in explaining the need for an initial choice of ownership: without aggrievement costs (i.e., setting θ = 0), one could equally well choose the optimal ownership structure ex post. Second, in a dynamic model with uncertainty, one would expect to see continuous reallocations of decision rights in the absence of aggrievement. Aggrievement brings a natural source of inertia into dynamic models. That this source of inertia is empirically relevant is suggested by Cisco’s concern for cultural fit—reorganization can make employees aggrieved, sometimes so much that acquisitions will not happen. Inertia is also what makes delegation distinct from ownership. How one allocates decision rights within the firm will make a difference. Firms do a lot of internal restructuring and many carry out major restructurings several times a decade in response to changes in their strategic situation. These restructurings have
510
QUARTERLY JOURNAL OF ECONOMICS
powerful effects not only on how the organization operates, but also on how employees feel. Restructurings do not come without a cost. Our approach could be fruitful for analyzing internal organization and restructurings. One of the features of our current model is that the outcome of integration does not depend on whether firm A takes over firm B or the other way around. But this is true only because of our assumption that the integrated firm is always run by a professional manager. As we discussed in Section III, this is not the only possibility. If firm A acquires firm B and the manager of firm A becomes the boss of the integrated firm the integrated firm’s decisions and direction will undoubtedly reflect manager A’s preferences, private benefits, and views of the world, and vice versa if the manager of firm B becomes the boss. Because a boss with skewed preferences is likely to take decisions that will cause aggrievement for employees with different preferences, our theory suggests that the cultural compatibility and fit of an acquisition partner may be of first-order importance, something that we saw in Section V is consistent with Cisco’s strategy and experience. Our model does not currently have workers. However, we could interpret a manager’s private benefits as reflecting an alignment of preferences with the workers resulting either from shared interests or from a concern for the workers’ well-being. To pursue this line further, it would be worthwhile thinking about what makes bosses biased toward their workers. One force is that sustained contact with workers fosters friendship and empathy. Wrestling with the same problems, sharing the same information, and having a similar professional background are all conducive to a common vision that aligns interests, particularly on issues such as the strategic direction of the firm. Shleifer and Summers (1988) argue that it may be an efficient long-run strategy for a firm to bring up or train prospective bosses to be committed to workers and other stakeholders (on this, see also Rotemberg and Saloner [1994, 2000]; Blair and Stout [1999]). Milgrom and Roberts (1988) argue that frequent interaction gives workers the opportunity to articulate their views and influence the minds of their bosses, sometimes to the detriment of the firm. All these explanations are consistent with our assumption that the boss of a firm with broad scope will put less weight on private benefits than the boss of a firm with narrow scope. With a broader range of activities, the firm’s workforce will be more heterogeneous, making the boss experience less empathy for any given group. The intensity of
A THEORY OF FIRM SCOPE
511
contact with any particular group will go down, reducing the ability of that group’s workers to influence the boss.27 Let us observe, finally, that giving private benefits a pivotal role in the analysis moves the focus of attention away from assets toward activities in the determination of firm boundaries. It is remarkable how few practitioners, organizational consultants, or researchers studying organizations within disciplines other than economics (e.g., sociology and organizational behavior) ever talk about firms in terms of asset ownership. For most of them a firm is defined by the things it does and the knowledge and capabilities it possesses. Coase (1988) makes clear that he too is looking for “a theory which concerns itself with the optimum distribution of activities, or functions, among firms” (p. 64). He goes on to say that “the costs of organizing an activity within any given firm depend on what other activities the firm is engaged in. A given set of activities will facilitate the carrying out of some activities but hinder the performance of others” (p. 63). The model we have proposed is in this spirit. In our analysis, asset ownership is the means for acquiring essential control rights, but the underlying reason that such control rights are acquired in the first place is that activities need to be brought together under the authority of one boss in order to accomplish strategic goals, such as sharing the same technological platform. HARVARD UNIVERSITY AND NATIONAL BUREAU OF ECONOMIC RESEARCH MASSACHUSETTS INSTITUTE OF TECHNOLOGY AND NATIONAL BUREAU OF ECONOMIC RESEARCH
REFERENCES Aghion, Philippe, Mathias Dewatripont, and Patrick Rey, “Transferable Control,” Journal of the European Economic Association, 2 (2004), 115–138. Aghion, Philippe, and Jean Tirole, “Formal and Real Authority in Organizations,” Journal of Political Economy, 105 (1997), 1–29. Alonso, Ricardo, Wouter Dessein, and Niko Matouschek, “When Does Coordination Require Centralization?” American Economic Review, 98 (2008), 145–179. Baker, George, Robert Gibbons, and Kevin J. Murphy, “Informal Authority in Organizations,” Journal of Law, Economics, and Organization, 15 (1999), 56–73. ——, “Strategic Alliances: Bridges between ‘Islands of Conscious Power,”’ Journal of the Japanese and International Economies, 22 (2008), 146–163. Blair, Margaret M., and Lynn A. Stout, “A Team Production Theory of Corporate Law,” Virginia Law Review, 85 (1999), 247–328. Bolton, Patrick, and Mathias Dewatripont, Contract Theory (Cambridge, MA: MIT Press, 2005). 27. Note that a boss who can divert less than 100% of profits for private gains will put relatively more weight on worker preferences in all cases discussed above.
512
QUARTERLY JOURNAL OF ECONOMICS
Brusco, Sandro, and Fausto Panunzi, “Reallocation of Corporate Resources and Managerial Incentives in Internal Capital Markets,” European Economic Review, 49 (2005), l659–1681. Bunnell, David, Making the Cisco Connection: The Story behind the Real Internet Superpower (New York: Wiley, 2000). Coase, Ronald Harry, “Industrial Organization: A Proposal for Research,” in The Firm, the Market and the Law, Ronald Coase, ed. (Chicago: University of Chicago Press, 1988). Gawer, Annabelle, and Michael A. Cusumano, Platform Leadership: How Intel, Microsoft, and Cisco Drive Industry Innovation (Boston: Harvard Business School Press, 2002). Grossman, Sanford J., and Oliver D. Hart, “The Costs and Benefits of Ownership: A Theory of Vertical and Lateral Integration,” Journal of Political Economy, 94 (1986), 691–719. Hart, Oliver D., Firms, Contracts, and Financial Structure (Oxford, UK: Oxford University Press, 1995). ——, “Reference Points and the Theory of the Firm,” Economica, 75 (2008), 404– 411. ——, “Hold-up, Asset Ownership, and Reference Points,” Quarterly Journal of Economics, 124 (2009), 301–348. Hart, Oliver D., and John Moore, “Property Rights and the Nature of the Firm,” Journal of Political Economy, 98 (1990), 1119–1158. ——, “On the Design of Hierarchies: Coordination versus Specialization,” Journal of Political Economy, 113 (2005), 675–702. ——, “Contracts as Reference Points,” Quarterly Journal of Economics, 123 (2008), 1–48. Holmstrom, Bengt, “The Firm as a Subeconomy,” Journal of Law, Economics, and Organization, 15 (1999), 74–102. Holmstrom, Bengt, and John Roberts, “Boundaries of the Firm Revisited,” Journal of Economic Perspectives, 12 (1998), 73–94. Inderst, Roman, and Christian Laux, “Incentives in Internal Capital Markets: Capital Constraints, Competition, and Investment Opportunities,” RAND Journal of Economics, 36 (2005), 215–228. Klein, Benjamin, “The Economic Lessons of Fisher Body–General Motors,” International Journal of the Economics of Business, 14 (2007), 1–36. Klein, Benjamin, Robert G. Crawford, and Armen A. Alchian, “Vertical Integration, Appropriable Rents, and the Competitive Contracting Process,” Journal of Law and Economics, 21 (1978), 297–326. Mailath, George J., Volker Nocke, and Andrew Postlewaite, “Business Strategy, Human Capital and Managerial Incentives,” Journal of Economics and Management Strategy, 13 (2004), 617–633. Milgrom, Paul, and John Roberts, “An Economic Approach to Influence Activities in Organizations,” American Journal of Sociology, 94 (Supplement) (1988), S154–S179. Rajan, Raghuram, Henri Servaes, and Luigi Zingales, “The Cost of Diversity: The Diversification Discount and Inefficient Investment,” Journal of Finance, 55 (2000), 35–80. Rantakari, Heikki, “Governing Adaptation,” Review of Economic Studies, 75 (2008), 1257–1285. Rotemberg, Julio J., and Garth Saloner, “Benefits of Narrow Business Strategies,” American Economic Review, 84 (1994), 1330–1349. ——, “Visionaries, Managers and Strategic Direction,” RAND Journal of Economics, 31 (2000), 693–716. Scharfstein, David S., and Jeremy C. Stein, “The Dark Side of Internal Capital Markets: Divisional Rent-Seeking and Inefficient Investment,” Journal of Finance, 55 (2000), 2537–2564. Schoar, Antoinette, “Effects of Corporate Diversification on Productivity,” Journal of Finance, 57 (2002), 2379–2403. Shleifer, Andrei, and Lawrence H. Summers, “Breach of Trust in Hostile Takeovers,” in Corporate Takeovers: Causes and Consequence, A. J. Auerbach, ed. (Chicago: University of Chicago Press, 1988).
A THEORY OF FIRM SCOPE
513
Stein, Jeremy C., “Internal Capital Markets and the Competition for Corporate Resources,” Journal of Finance, 52 (1997), 111–133. ——, “Information Production and Capital Allocation: Decentralized vs. Hierarchical Firms,” Journal of Finance, 57 (2002), 1891–1921. Van den Steen, Eric, “Organizational Beliefs and Managerial Vision,” Journal of Law, Economics, and Organization, 21 (2005), 256–283. ——, “Interpersonal Authority in a Theory of the Firm,” American Economic Review, forthcoming, 2009. Williamson, Oliver E., Markets and Hierarchies: Analysis and Antitrust Implications (New York: Free Press, 1975). ——, The Economic Institutions of Capitalism (New York: Free Press, 1985).
THE (PERCEIVED) RETURNS TO EDUCATION AND THE DEMAND FOR SCHOOLING∗ ROBERT JENSEN Economists emphasize the link between market returns to education and investments in schooling. Though many studies estimate these returns with earnings data, it is the perceived returns that affect schooling decisions, and these perceptions may be inaccurate. Using survey data for eighth-grade boys in the Dominican Republic, we find that the perceived returns to secondary school are extremely low, despite high measured returns. Students at randomly selected schools given information on the higher measured returns completed on average 0.20–0.35 more years of school over the next four years than those who were not.
I. INTRODUCTION How important are the returns to education in determining schooling decisions? Do students have accurate information about these returns when they choose whether to continue schooling? Becker’s canonical model of human capital views education as an investment, where costs are compared to the discounted stream of expected future benefits, primarily in the form of greater wages. However, although there is a large literature estimating the returns to schooling with earnings data, as pointed out by Manski (1993), it is the returns perceived by students and/or their parents that will influence actual schooling decisions. Given the great difficulties in estimating the returns encountered even by professional economists using large data sets and advanced econometric techniques, it seems likely that typical students make their schooling decisions on the basis of limited or imperfect information. In this setting, there is little reason to expect the level of education chosen to be either individually or socially efficient. This possibility is particularly important to consider for developing countries, where educational attainment remains persistently low despite high measured returns. For example, in the Dominican Republic, although 80%–90% of youths complete ∗ I would like to thank Christopher Avery, Pascaline Dupas, Eric Edmonds, Andrew Foster, Alaka Holla, Geoffrey Kirkman, Nolan Miller, Kaivan Munshi, Meredith Pearson, Richard Zeckhauser, Jonathan Zinman, Larry Katz, and four anonymous referees for useful comments. I would also like to thank Eric Driggs, ˜ Jason Fiumara, Zachary Jefferson, Magali Junowicz, Yesilernis Pena, Louisa Ramirez, Rosalina G´omez, Alexandra Schlegel, and Paul Wassenich for valuable research assistance. Assistance and financial support from the Fundaci´on Global ´ Democracia y Desarrollo (FUNGLODE) and President Leonel Fernandez are gratefully acknowledged. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
515
516
QUARTERLY JOURNAL OF ECONOMICS
(compulsory) primary schooling, only about 25%–30% complete ´ secondary school (Oficina Nacional de Estad´ıstica, Republica Dominicana 2002). Yet the mean earnings of workers who complete secondary school are over 40% greater than those of workers who only complete primary school1 (estimates are from a survey conducted by the author in 2001, explored in more detail below). There are of course many potential explanations for this puzzle, such as poverty and credit constraints, high discount rates, or simply mismeasured returns on the part of the researcher (i.e., education is low because the true returns are low). However, if underestimates of the returns and thus a low demand for schooling are limiting factors for at least some subset of individuals, simply providing information on the returns may be the most cost-effective strategy for increasing their education. In this paper, we conduct an experimental intervention among eighth-grade male students in the Dominican Republic to test this hypothesis. A handful of studies for the United States have found that high school seniors and college students are relatively well informed of the returns to a college education (Smith and Powell 1990; Betts 1996; Dominitz and Manski 1996; Avery and Kane 2004; Rouse 2004).2 However, there is reason to believe that students and/or their parents in low-income countries may not be as well informed about the returns. For example, the decision to drop out of school is often made at a much younger age, when students have less information about the returns. And schools typically do not have guidance counselors to provide information about the returns.3 Further, in general there may just be little or no information available at all on earnings, because labor market data may not be collected as regularly or comprehensively by governments or private organizations, or because the results may not be as widely disseminated.4 As a result, the only data on earnings available to youths may come from the individuals 1. Assuming a discount rate of 0.05, the net present value of expected lifetime earnings, including forgone wages and the direct costs of schooling, is over 15% greater with secondary schooling. 2. Despite these apparently accurate perceptions, there is evidence that adolescents in the United States, as well as the United Kingdom and Canada, in effect “drop out too soon,” forgoing substantial monetary and nonmonetary returns, perhaps because they ignore or heavily discount these returns (Oreopoulos 2007). 3. Betts (1996), for example, finds that over 60% of the U. S. college seniors surveyed reported using their school’s career services center to obtain information about job prospects by field of study. 4. For example, for this study we had to conduct our own labor force survey to estimate the returns in the Dominican Republic because no data were available at the time, nor were there any available published estimates of the returns.
PERCEIVED RETURNS TO EDUCATION
517
they can observe around them,5 which could lead to inaccuracies. For example, youths in rural communities or small towns where few or no adults have any education will have little information from which to infer the returns, including the potential returns in the urban sector. In addition, if students rely almost exclusively on the earnings of workers in their own communities in forming their expectations of earnings, residential segregation by income could lead to underestimates of the returns to schooling. Although all these factors make it unlikely that youths in low-income countries have accurate information on the returns to schooling, until recently there has been no evidence on the perceived returns for such countries; the present paper, alongside Attanasio and Kaufman (2008) and Kaufman (2008) for Mexico and Nguyen (2008) for Madagascar, has begun to fill this gap. The possibility that decision makers may not be well informed has been explored in several other areas of economic behavior.6 However, only a handful of studies have examined whether providing information in these settings can change behavior. Dupas (2009) finds that providing age-disaggregated information on HIV prevalence rates affects the incidence of risky sexual behavior among girls in Kenya. Duflo and Saez (2003) find that retirement plan decisions respond to being given incentives to attend a session providing benefits information, and Hastings and Weinstein (2008) find that providing parents with simpler, more transparent and relevant information such as average test scores and admissions probabilities can affect school choice. Finally, applying a strategy similar to that used in the present paper,7 Nguyen (2008) finds that providing parents in Madagascar with information on the returns to schooling improves their children’s school performance and attendance in the first few months following the 5. For example, over 70% of students in our survey reported that their main source of information about earnings was the people they knew in their community. By contrast, Betts (1996) reports that the most widely used source of information on employment prospects among college students was newspapers and magazines. 6. For example, many studies find that individuals underestimate the costs of borrowing (see Stango and Zinman [2007] for examples) or are poorly informed of their own pension or social security benefits (Mitchell 1988; Gustman and Steinmeier 2005; Chan and Stevens 2008). Viscusi (1990) finds that individuals overstate the risks of lung cancer from smoking, and that these misperceptions actually reduce smoking behavior. McKenzie, Gibson, and Stillman (2007) find that potential emigrants underestimate the returns to migration. 7. Nguyen’s paper extends the approach by considering the potential value of role models instead of, or in addition to, simply providing information on the returns, as in the present study (though she ultimately concludes that information alone appears in general to be the most effective strategy).
518
QUARTERLY JOURNAL OF ECONOMICS
intervention. An advantage of the present study is that we follow students over a four-year period, so we can assess the long-term impact of this kind of intervention. Using data from a panel survey of boys in the Dominican Republic in the eighth grade, the last year of compulsory schooling and the point at which most students terminate their education, we find that perceptions of the returns to secondary schooling are extremely low for most students, especially relative to returns measured with earnings data. Although many factors may affect or limit school attendance, such as poverty and credit constraints, these results raise the possibility that for at least some youths, school dropout may be the result of low demand due to low perceived returns. Thus, students at a randomly selected subset of schools were provided information on the returns estimated from earnings data. Relative to those not provided with information, these students reported dramatically increased perceived returns when re-interviewed four to six months later, and on average completed 0.20 more years of schooling over the next four years. And, consistent with the hypothesis that poverty and credit constraints limit schooling even when there is demand, we find that the program had a large effect among the least poor students, increasing schooling by 0.33 years, but no effect for the poorest students, despite the fact that both groups increased perceived returns by the same amount. The remainder of this paper proceeds as follows: Section II discusses the data and experimental design and explores both the accuracy of student perceptions and whether measured perceptions predict actual schooling. Section III presents the results of the experiment, and Section IV discusses the policy implications and concludes. II. DATA AND METHODOLOGY II.A. Data To estimate the returns to education, we conducted a household-based income survey in January 2001.8 The survey of 1,500 households was conducted nationwide, but only in nonrural areas (comprising about two-thirds of the population) because of the greater difficulty in estimating earnings for agricultural 8. At the time the study began, there were no publicly available microdata on income.
PERCEIVED RETURNS TO EDUCATION
519
households. The household sample was drawn in two stages. First, from the thirty largest cities and towns, we chose 150 sampling clusters at random,9 with the number of clusters chosen in each town approximately proportional to that town’s share of the combined population of the thirty cities/towns.10 A listing of all dwellings in the cluster was then made, and twenty households were drawn at random from each cluster. The questionnaire gathered information on education, employment and earnings, and background demographic and socioeconomic characteristics for all adult household members. We will discuss the estimated returns to schooling using these data in more detail in Section II.D and the Online Appendix; for now, we note that the mean monthly earnings (including both workers and nonworkers)11 among men thirty to forty years old (the group whose earnings will form the basis of our experiment) expressed in nominal 2001 Dominican pesos (RD$; RD$1 ≈ 0.06US$ in January 2001) are RD$4,479 for those who completed secondary school (only) and RD$3,180 for those who completed primary school (only). The RD$1,299 difference represents an approximately 41% overall return, or about 8% per additional year of schooling (provided there is no “sheepskin effect” or discrete jump at year twelve). For the student survey, for each of the 150 household sample clusters, we selected the school where students from that cluster attend eighth grade.12 From each school, during April and May of 2001, we interviewed fifteen randomly selected boys13 enrolled in eighth grade, the final year of primary school and therefore the point right before the very large declines in enrollment.14 All 9. Cities and towns were divided into a set of clusters with the help of community leaders and government officials. 10. For greater geographic variation, we undersampled the capital, Santo Domingo. The city contains roughly 45% of the total population of the thirty cities/towns but is only about 25% of our sample. 11. About 8%–10% of both groups (slightly higher for the primary school group) reported no earnings in the past month. However, the earnings gap by education is not substantially different if we focus on employed workers. 12. In six cases, two clusters primarily used the same school; for these cases, we also chose the nearest alternate school. 13. We did not interview girls because of difficulties in eliciting expected earnings. Due to a low female labor force participation rate in the Dominican Republic (about 40%), in focus groups most girls were unwilling to estimate their expected earnings because they felt they would never work. 14. Students were randomly selected from lists of currently enrolled students and interviewed individually at the school. If a student was not present on the day of the interview, enumerators returned to the school the following day, and then contacted the student at home if he was still not available. Fifty-eight students were interviewed in their homes, primarily due to extended illness. Students were not compensated for their participation.
520
QUARTERLY JOURNAL OF ECONOMICS
2,250 students in the study were administered a survey gathering information on a variety of individual and household characteristics, as well as some simple questions on expected earnings by education (discussed below). A second survey of the students was conducted after the beginning of the next academic year (October 2001), with respondents interviewed again (at home, school, or work) about perceived returns to education and current enrollment status. In addition, at this time parents were also interviewed to gather additional information on socioeconomic status, including household income. A third-round follow-up survey on schooling was also conducted in May and June of 2005, by which time students should have been finishing their last years of secondary school; for the approximately 120 students who were still enrolled in 2005 but not yet through their final years of school (due primarily to grade repetition), we conducted follow-ups for each of the next two years. For all follow-up surveys, if the respondents could not be found after two attempts, their parents, siblings, or other relatives were interviewed about the youths’ enrollment status. If these relatives also could not be located, neighbors were interviewed about the youths. Overall, we were able to obtain follow-up information in the October 2001 survey directly from 93% of youths, with 2% from relatives and 5% from neighbors. By the 2005 survey, this had changed to 89% from youths, 4% from relatives, and 7% from neighbors. In all cases, we attempted to verify educational attainment by contacting the school that students were reported to be attending or had attended. We were able to do so for 97% of students in the second-round survey and 91% in the third round. Quantifying perceptions of the returns to education is difficult, especially with young respondents (valuable summaries of methods for eliciting expectations for a range of outcomes can be found in Manski [2004] and Delavande, Gin´e, and McKenzie [2008], the latter focusing more on approaches applied in lowincome countries). Therefore, the survey asked only some simple questions about perceived earnings, based on Dominitz and Manski (1996), though much more limited. In particular, students were asked to estimate what they expected they might earn under three alternative education scenarios: Suppose, hypothetically, you were to complete [this school year/secondary school/university], and then stop attending school. Think about the kinds of
PERCEIVED RETURNS TO EDUCATION
521
jobs you might be offered and that you might accept. How much do you think you will earn in a typical week, month or year when you are about 30 to 40 years old?
Students were also asked to estimate the earnings of current thirty- to forty-year-old workers with different levels of education: Now, we would like you to think about adult men who are about 30 to 40 years old and who have completed only [primary school/secondary school/university]. Think not just about the ones you know personally, but all men like this throughout the country. How much do you think they earn in a typical week, month or year?
Although own expected earnings are likely to be the relevant criterion for decision making, this second set of questions was included to measure perceptions of earnings that are purged of any beliefs students may have about themselves, their households, or their communities, such as the quality of their school or their own ability, or beliefs about factors such as race in determining earnings. The two sets of questions can therefore be used to determine in part whether students’ perceptions differ from measured returns because (a) they have poor or inaccurate information on prevailing wages in the labor market (as captured by the second set of questions), or (b) they have information or beliefs about themselves (correct or incorrect) that influence what they expect earnings will be for them personally (as captured by the first set of questions). For example, even if some students believe (perhaps correctly) that they would not gain from education because of labor market discrimination based on race or because they attend a low-quality school, these beliefs about their personal returns should not be reflected in their perceptions of the average earnings of other workers. These simple questions have several obvious and significant limitations.15 First, they are not precise in specifying the meaning of “expected” earnings, such as referring to the mean, median, or mode.16 In addition, they do not elicit perceived uncertainty 15. Our approach to eliciting expectations is similar to that of Nguyen (2008), but differs from those of Attanasio and Kaufmann (2008) and Kaufmann (2008). The latter two instead ask what individuals expect is the maximum and minimum they might earn under different education scenarios, as well as the probability of earning more than the midpoint of these two. With an assumption on the distribution of expectations, these data can be used to estimate various moments of the distribution. 16. Though even if these more precise definitions could have been elicited, it is unclear which quantity students actually use in decision making. The wording
522
QUARTERLY JOURNAL OF ECONOMICS
(unlike, for example, Dominitz and Manski [1996], Attanasio and Kaufmann [2008], and Kaufmann [2008]) or the lifetime profile of earnings, nor do they address expectations of inflation.17 Finally, the questions deal with abstract, hypothetical situations, are stated in fairly formal language, and are slightly lengthy and complicated; as a result, about 10% of students did not provide responses to these questions, or responded “don’t know.” Given the ages of the students and their degree of math literacy, these various limitations could not be overcome in our field testing.18 Thus, we do not view these as perfect measures of youths’ actual decision-making criteria, nor will we rely on them for our primary analysis (the impact of the information intervention on schooling outcomes). We present these data simply as a way of quantifying as well as possible the impressions of low perceived returns revealed in prestudy focus group discussions and to provide motivation for the intervention. II.B. The Intervention At the end of the student survey, each respondent at a randomly selected subset of schools was given information on earnings by education from the household survey and the absolute and percent return implied by those values, as reported above: Before we end, I would like to provide you with some information from our study. In January, we interviewed adults living in this community and all
was intended to elicit as well as possible the level of earnings students expect or associate with different levels of schooling. Delavande, Gin´e, and McKenzie (2008) discuss the weaknesses of this approach relative to more sophisticated strategies that for example elicit information on the distribution of expected earnings. 17. We are thus implicitly assuming that students are not taking inflation into consideration when providing expected future earnings. In focus groups during survey design, students did not reveal any awareness of the possibility of inflation. Further, the high correspondence between students’ expected future earnings and their perceptions of current earnings (shown below) is consistent with ignoring inflation (though we can’t rule out that they are considering inflation but also expecting other factors to lower the returns to schooling). Finally, we note that if students’ responses incorporated expected inflation, this would lead them to report greater absolute differences in earnings by education than what they feel prevails at present; thus, inflation expectations could not account for students underestimating the returns to schooling, and would in fact lead to the opposite outcome (though it would of course leave the percent returns unchanged). 18. More recently, a number of studies have made progress on this problem by using visual or physical methods for eliciting expectations, such as asking respondents to assign a fixed number of objects (e.g., stones or beans) to a number of bins or categories representing different outcomes, with more objects to be allocated for outcomes perceived to be more likely. See Delavande, Gin´e, and McKenzie (2008) for a review.
PERCEIVED RETURNS TO EDUCATION
523
over the country. We asked them about many things, including their earnings and education. We found that the average earnings of a man 30 to 40 years old with only a primary school education was about 3,200 pesos per month. And the average income of a man the same age who completed secondary school, but did not attend university, was about 4,500 pesos per month. So the difference between workers with and without secondary school is about 1,300 pesos per month; workers who finish secondary school earn about 41 percent more than those who don’t. And people who go to university earn about 5,900 pesos per month, which is about 85 percent more than those who only finish primary school.
Although the statement is again perhaps a bit lengthy, formal, and complicated, the training of enumerators stressed that it was essential to emphasize the key elements of the statement, namely the earnings levels by education and the difference between them, by repeating them a second time after the statement was read to make sure students understood the findings (students were then also invited to ask any questions about the data and results that they might have). We chose to provide the simple difference in mean earnings by education rather than estimates adjusted for other controls or using instrumental variables. As shown in Section II.D, these other approaches yield similar estimates of the returns, which are also broadly comparable to those found in other studies of the Dominican Republic and similar countries. Therefore, we chose to provide the information that would be easiest for students to understand. Randomization was conducted blindly by the author, with each school having an equal likelihood of selection into the treatment and control groups. Compliance with randomization was ensured by providing enumerators with treatment-specific questionnaires (i.e., the questionnaires provided to enumerators visiting treatment schools included the paragraph above, and those provided to enumerators visiting control schools did not) and random auditing through visits during the survey process. Assignment of the treatment was done at the school level rather than for individual students within schools because students in the same school are likely to communicate, which would contaminate the control group. We cannot rule out that communication across schools occurred, though to the extent that such contamination took place, the true effect of the treatment would likely be even greater than what we estimate.
524
QUARTERLY JOURNAL OF ECONOMICS
Table I provides means and standard deviations for key variables from the baseline survey (plus income, which was measured in Round 2); all estimates in this and subsequent tables are weighted to be representative of the thirty largest cities and towns. The average eighth-grade youth in our sample is just over 14 years old in the baseline survey and performs about average at school (as reported by teachers on a scale of 1 to 5—1. Much worse than average; 2. Worse than average; 3. Average; 4. Above average; 5. Much better than average). The average household income is approximately RD$3,500 per month, and 38% of youths have fathers who finished high school. At baseline, students expect earnings at age 30–40 of RD$3,516 if they only finish primary school and RD$3,845 if they finish secondary school, both of which are slightly greater on average than what they believe current workers aged 30–40 with those levels of education earn. Table I also shows that there were no systematic differences in these baseline covariates for treatment and control groups. The differences are all small (less than 3% in all cases), and none are statistically significant. Thus, the randomization appears to have been successful in creating comparable samples of students in the treatment and control groups with respect to observable characteristics. II.C. Do Expectations Predict Schooling? Despite the limitations of the measures of expected earnings noted above, it is worth exploring whether they predict schooling. The first columns of Table II show regressions where the dependent variables are three measures of educational outcomes: whether the child returned to school for the academic year following the Round 1 survey (i.e., entered secondary school), whether he finished high school, and years of schooling (the latter two are measured as of Round 3, four years later). The independent variable of interest is the baseline implied perceived returns to secondary schooling for Round 1 (expressed here in thousands of 2001 Dominican pesos, RD$1,000), constructed as the difference between own expected earnings with secondary (only) and own expected earnings with primary (only) at age 30–40. For now, we use only the control group, because the information given later as part of the experimental intervention may cause students to update their expectations, weakening the link between baseline perceptions and eventual schooling.
525
PERCEIVED RETURNS TO EDUCATION TABLE I MEANS, STANDARD DEVIATIONS, AND TEST OF TREATMENT–CONTROL COVARIATE BALANCE All
Control
Treatment
Difference
14.3 [0.79] 2.64 [1.45] 0.38 [0.49] 8.16 [0.32]
14.3 [0.79] 2.66 [1.46] 0.39 [0.49] 8.17 [0.31]
14.4 [0.79] 2.62 [1.45] 0.38 [0.49] 8.15 [0.32]
0.02 (0.04) −0.04 (0.06) −0.01 (0.05) −0.04 (0.05)
3,484 (124) 3,806 (145) 322 (27)
−64 (165) −78 (191) −14 (36)
Round 1 expected earnings (others) 3,478 3,509 3,447 [863] (112) (120) Secondary (only) 3,765 3,802 3,728 [997] (126) (143) Implied perceived returns (other) 287 293 281 [373] (23) (29)
−62 (160) −73 (185) −12 (36)
Age School performance Father finished secondary Log (income per capita)
Round 1 expected earnings (self) 3,516 3,548 [884] (116) Secondary (only) 3,845 3,884 [1,044] (132) Implied perceived returns (self) 329 336 [403] (25) Primary (only)
Primary (only)
Notes. Standard deviations in brackets in columns (1)–(3); heteroscedasticity-consistent standard errors accounting for clustering in parentheses in column (4). Data are from a survey of eighth-grade male students, conducted by the author. Data on age, school performance, and whether the father finished high school were gathered in Round 1 (April–May 2001); the number of observations is 1,125 for both treatment and control. School performance is teacher assessment of the student’s performance, on a scale of 1 to 5 (much worse than average, worse than average, average, above average, much better than average). Implied perceived returns is the difference between own expected earnings at age 30–40 with primary and with secondary, measured in Round 1. Income per capita was gathered in Round 2 (October 2001), where there are 1,054 observations for the control group and 1,057 observations for the treatment group. All monetary figures are measured in 2001 Dominican pesos (RD$). Returned next year is measured in Round 2; finished school and years of schooling are measured in Round 3. “Difference” is a t-test of difference between treatment and control groups. ∗ Significant at 10%. ∗ ∗ Significant at 5%. ∗∗∗ Significant at 1%.
Overall, these baseline implied perceived returns do predict subsequent schooling. Regressions using only the implied perceived returns without additional controls (columns (1), (3), and (5)) show positive and statistically significant associations between perceptions and all three schooling outcomes. The point estimates decline considerably (25%–35%) but remain statistically significant even when we control for characteristics that may be correlated with both schooling and perceived returns,
0.095∗∗∗ (0.21) 0.044 (0.045) 0.014 (0.010) 0.067∗∗ (0.032) −0.011 (0.019) .027 1,899
0.088∗∗∗ 0.37∗∗∗ (0.019) (0.075) 0.18∗∗∗ 0.61∗∗∗ (0.048) (0.17) 0.021∗∗ 0.087∗∗ (0.008) (0.034) 0.045 0.21∗ (0.029) (0.12) 0.004 −0.006 (0.016) (0.066) .050 .053 1,899 1,809
0.16∗∗ (0.071) 0.023 (0.049) 0.013 (0.010) 0.066∗∗ (0.032) −0.011 (0.019) .022 1,899
0.096∗ (0.055) 0.18∗∗∗ (0.051) 0.021∗∗ (0.008) 0.045 (0.029) 0.004 (0.016) .050 1,899
0.63∗∗∗ (0.22) 0.52∗∗∗ (0.17) 0.086∗∗ (0.034) 0.20∗ (0.12) −0.003 (0.067) .046 1,809
Notes. Heteroscedasticity-consistent standard errors accounting for clustering at the school level in parentheses. Data are from a survey of eighth-grade male students, conducted by the author. Returned next year is measured in Round 2; finished school and years of schooling are measured in Round 3. Implied perceived returns is the difference between own expected earnings at age 30–40 with primary and with secondary schooling, measured in thousands of 2001 Dominican pesos (RD$1,000). Columns (1)–(6) (Panel A) use Round 1 implied perceived returns as an independent variable and columns (7)–(12) (Panel B) use Round 2 implied perceived returns. Columns (1), (3), and (5) use no other control variables; all other columns add age, school performance, whether father finished secondary school, and log income per capita as additional controls. School performance is teacher assessment of the student’s performance, on a scale of 1 to 5 (much worse than average, worse than average, average, above average, much better than average). Age, school performance, and whether father finished secondary were gathered in the Round 1; income was measured in Round 2. Regressions also include an indicator for whether income data were unavailable (these household are assigned the median sample income). In columns (10)–(12), implied perceived returns is instrumented using an indicator for having received the treatment. ∗ Significant at 10%. ∗ ∗ Significant at 5%. ∗ ∗ ∗ Significant at 1%.
R2 Observations
Father finished secondary Age
0.083∗∗ 0.14∗∗∗ 0.092∗∗ 0.53∗∗∗ 0.37∗∗ (0.034) (0.036) (0.038) (0.13) (0.14) 0.090 0.25∗∗∗ 0.76∗∗∗ (0.062) (0.063) (0.24) 0.015 0.015 0.093∗∗ (0.014) (0.011) (0.045) 0.036 −0.014 0.045 (0.041) (0.044) (0.16) −0.017 0.006 −0.045 (0.024) (0.025) (0.093) .008 .016 .017 .048 .016 .042 1,003 1,003 1,003 1,003 918 918
Implied perceived 0.11∗∗∗ returns (0.030) Log (inc. per capita)
School performance
Instrumental variables
Panel B. Round 2 implied perceived returns (full sample)
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) Returned Returned Finished Finished Years of Years of Returned Finished Years of Returned Finished Years of next year next year school school schooling schooling next year school schooling next year school schooling
Panel A. Round 1 implied perceived returns (control group only)
TABLE II IMPLIED PERCEIVED RETURNS AND SCHOOLING
526 QUARTERLY JOURNAL OF ECONOMICS
PERCEIVED RETURNS TO EDUCATION
527
including the student’s age and eighth-grade school performance, his household income,19 and whether his father completed high school (income is measured in Round 2, and all other variables are measured at baseline). A RD$1,000 increase in implied perceived returns increases the likelihood of returning to school the next year by eight percentage points, the likelihood of completing high school by nine percentage points, and years of schooling by 0.37, with all coefficients significant at the 5% level. These results are consistent with Kaufmann (2008) and Attanasio and Kaufmann (2008), who find that measures of adolescents’ perceived returns are also correlated with high school and college enrollment in Mexico.20 Of course, these regressions may be plagued by omitted variables bias (e.g., those with low perceived returns may attend lower-quality schools) or reverse causality (e.g., a “sour grapes” effect whereby those who want to go to school but are constrained from doing so by poor grades or low income report low returns). Although we would not want to attach a strong causal interpretation to these results, they provide an initial impression that measured perceptions do have some predictive value (though it should be noted that the perception measure alone can account for only 1%–2% of the total variation in the various schooling outcomes) and are at least consistent with the possibility that an intervention that increases perceived returns might lead to increases in schooling. However, we note that the magnitude of the effects suggests that information alone would not lead to universal high school completion; increasing perceptions by RD$1,000, which we will see below would close the gap between perceived and measured returns, would only increase secondary completion rates by nine percentage points, whereas 70% of students do not complete secondary school. Thus, perhaps not surprisingly, other factors certainly limit secondary school completion. 19. Data on income are missing for 139 observations, almost evenly split between treatment and control groups. We assign the median sample income to these observations, and include a dummy for these observations in the regressions. Dropping these observations from the regression instead does not change the results appreciably. 20. However, with their single cross section, they can only compare perceived returns with schooling decisions already made (i.e., the perceived returns to college among those who are of college age and already either in college or not). However, the authors argue that these results are still informative, because, for example, individuals who are the age of college freshmen (whether in college or not) should have roughly the same information they had at the time they made their college decisions a few months earlier. Further, they show that the distribution of perceived returns for adolescents of high school senior age is very similar to that of the college freshman–aged individuals.
528
QUARTERLY JOURNAL OF ECONOMICS TABLE III MEASURED AND PERCEIVED MONTHLY EARNINGS, MALES AGED 30–40 (1) Measured mean
(2) Perceived (self)
(3) Perceived (others)
Secondary − primary
3,180 [1,400] 4,479 [1,432] 9,681 [3,107] 1,299
Tertiary − secondary
5,202
3,516 [884] 3,845 [1,044] 5,127 [1,629] 329 [403] 1,282 [1,341]
3,478 [863] 3,765 [997] 5,099 [1,588] 287 [373] 1,334 [1,272]
Primary Secondary Tertiary
Notes. All figures in 2001 Dominican pesos (RD$). Standard deviations in brackets. Column (1) provides the mean earnings among men aged 30–40 from a household survey conducted by the author in January 2001. The number of observations is 1,278 primary, 339 secondary, and 83 tertiary. Columns (2) and (3) provide data from the Round 1 survey of eighth-grade male students, conducted by the author in April/May 2001. Column (2) refers to what current students expect to earn themselves under different education scenarios when they are 30–40. Column (3) refers to what current students believe current workers 30–40 years old with different education levels earn. For both columns, there are 2,025 observations with responses for primary and secondary, and 1,847 responses for tertiary.
II.D. How Accurate Are Student Perceptions? Table III provides data from the household and student surveys on measured and expected or perceived earnings by education. As noted above, the simple mean difference in earnings for those with primary only and those with secondary only is RD$1,299, or 41% (8% per year). We will use this benchmark for assessing the accuracy of students’ perceptions; in the Online Appendix, we show that the estimated returns decline only slightly (about five percentage points) when additional covariates are controlled for, and then become 10%–20% larger when we use distance to school in childhood as part of an instrumental variables strategy due to Card (1995) and Kane and Rouse (1993), in an attempt to account for potential omitted variables and measurement error (see Tables A.1 and A.2 in the Online Appendix). Other studies of the Dominican Republic and broadly comparable countries have also found high returns to completing secondary school, typically in the range of 20%–80%; for example, data from the World Bank’s Socioeconomic Database for Latin America and the Caribbean (SEDLAC) reports returns of 20%–30% from 2000– 2006 in the Dominican Republic. Thus, although our estimates may not be purged of all econometric concerns, the best available evidence suggests that there are large returns to schooling in the
PERCEIVED RETURNS TO EDUCATION
529
country, consistent with what has been found almost universally in other low-income countries (see Psacharopoulos and Patrinos [2004], who in fact report the highest returns on average are found in Latin American and Caribbean countries), even in studies with more plausibly exogenous sources of variation in schooling with which to measure the returns (such as Duflo [2001]). By contrast, in presurvey focus groups, it was evident that few students perceived significant returns to secondary school.21 Column (2) of Table III shows that eighth-grade boys report on average that if they were to leave school at the end of the current year and not complete any more schooling, their (own) expected monthly wage at age 30–40 would be RD$3,516, which is greater than that actually measured in the household survey. By contrast, students on average expect monthly earnings of RD$3,845 if they complete secondary school, which is much lower than that observed in the earnings data. Thus, comparing to column (1), students overestimate earnings with primary schooling (by about RD$330, or 11%) and underestimate earnings with secondary schooling (by about RD$700, or 14%). Although they were not directly asked for the expected difference in earnings or the expected returns to schooling, the average implied perceived return is RD$329 (9%), which is only one-fourth as large as the estimate from the earnings data. About 42% of students report no difference in own expected earnings for the two levels of education, and 12% have implied returns that exceed those measured in the data. Using these expectations, if we assume that students expect to work until they are 65, and have a discount rate of 0.05, even if there were no direct costs of schooling, the implied net present value of the lifetime expected stream of earnings without secondary school is 11% greater than with secondary school. Thus, unless students believe there are high nonwage returns, completing secondary school would only be worthwhile for students with these expectations if they were extremely patient (i.e., a discount rate of 0.005 or less). As noted above, any discrepancy between measured and own expected earnings could arise because students feel they have information about themselves that influences where they will fall in the earnings distribution, for example, because they attend poorquality schools or because of their race. Thus, column (3) presents 21. Though most believed there were significant returns to completing primary school.
530
QUARTERLY JOURNAL OF ECONOMICS
data on what students think current adult workers aged 30–40 earn. The means here are lower than own expected earnings for both levels of education, consistent with a general optimism bias. About 55%–60% of students report the same value for current workers as they expect for themselves for both levels of schooling, with about 25%–30% expecting higher wages for themselves and 10%–15% expecting lower wages. As with own expected earnings, the perceived mean difference in earnings for other workers by education is much lower (only about one-fifth as large) than that measured in the earnings data (and 13% lower than what they expect their own personal returns would be). The fact that this measure of perceptions is not influenced by beliefs about personal characteristics that affect earnings but instead just reflects general knowledge of labor market conditions suggests that students do not have accurate information on earnings and appear to underestimate the returns to schooling. Table III also provides data on the measured and expected returns to completing college. As with secondary schooling, students’ perceptions of earnings, and the implied perceived returns, are much lower than those measured in the household survey. Overall, students on average reported expected earnings of RD$5,127 for themselves and RD$5,099 for others with college education, implying returns over completing secondary school of 33% and 35%, respectively, compared to actual measured mean earnings of RD$9,681, implying a 116% return. However, it should be noted that because college is a rare outcome (less than 10% of adult males in our household survey have a college degree), this estimate of earnings is based on only 83 observations and is therefore likely to be fairly imprecise. Though for comparison, we also note that SEDLAC estimates returns to completing college of between 70% and 80% in the Dominican Republic from 2000–2006, which is also much greater than the difference in expected earnings reported by students. A final caveat is that, perhaps because college was perceived to be so unlikely an outcome or because so few students personally knew someone with a college degree, approximately 18% of students reported “don’t know” or refused to answer this question. And those who do respond may not be a representative sample. Because only 13% of students at baseline reported planning to attend college, and only 6% had actually enrolled by the final survey, for the remainder of the paper we will focus on secondary schooling.
PERCEIVED RETURNS TO EDUCATION
531
The fact that students have such low perceived returns to schooling raises the possibility that providing information on the higher measured returns may improve schooling. Of course, given the challenges both in estimating the true returns and in eliciting student perceptions of those returns, we cannot definitively conclude that students are “incorrect.” For example, our measure of the returns may still be biased; alternatively, even if our estimates are the correct average returns for current workers, students might have reason to expect different returns for themselves22 (though, again this would not explain why they perceive low returns among current workers). However, there are two final points worth making. First, students’ implied estimates of the returns are so low (about 2% per year of secondary schooling) that unless we believe our estimates of the market returns are highly biased and that the true returns in the Dominican Republic are dramatically lower than the returns estimated for almost every other country (and in net present value terms, actually negative), it seems likely that students do in fact underestimate the returns to schooling. Second, our experimental intervention does not per se rely on estimating either the true returns or students’ perceptions correctly. The expected effect does depend on whether the estimates provided are above or below the returns perceived by students; but again, focus groups consistently revealed that most students believed there was little or no return to schooling, so this was not a major concern for the study.23 However, both the 22. And although there is likely to be heterogeneity in the returns (say by school quality or race) and students may be aware of that heterogeneity, this alone could not explain why students on average have low expectations for the returns they would personally face. For example, for every youth from a below-average school who knows he or she has low personal returns, there should be one from an above average school who knows that he or she has higher than average personal returns. The (correct) high estimates for those from good schools should offset the (correct) low estimates for those from bad schools, so the average perceived return should not be lower than the measured return. The same would hold for other factors, such as race: black youths may believe the returns are lower for them, but white youths should then also believe that the returns for them are higher than average, so the average across a representative sample of youths should still hit the correct average return. This may not hold, however, if only youths with the attributes that lower returns are aware that those attributes matter or if, for example, all youths think they go to a below-average school (which of course can’t be true and thus would still suggest some youths have incorrectly low perceived returns). Further, this general hypothesis is not consistent with students on average expecting higher returns for themselves than for the average current worker, as in Table III. 23. We were concerned about providing misleading information, such as grossly overstating the returns, especially if they vary by race, region, or family background. However, the intervention was justified on the grounds of simply
532
QUARTERLY JOURNAL OF ECONOMICS
appropriateness of such an intervention from a policy perspective and the long-term potential effectiveness of such a policy may well depend on the ability to provide accurate information to students. We discuss these issues further in Section IV. Finally, we note that within the Becker human capital framework, there are reasons other than low returns for which specific individuals may receive low levels of education, such as the combination of poverty and credit constraints. Such constraints have long been considered significant impediments to schooling, especially in poorer countries. We therefore view the provision of information as an intervention that is likely to have an impact only on the specific subset of individuals for whom low perceived returns and correspondingly low demand for schooling are the only limiting factor, rather than on all students. III. RESULTS III.A. Perceived Returns to Schooling Table IV provides data on key outcome variables for the treatment and control groups in the pre- and postintervention survey rounds. As expected given randomization, in the baseline survey there was little difference between the two groups in own expected earnings with or without a secondary school degree, and thus little difference in the implied perceived returns (Table I shows that none of these baseline differences are statistically significant). providing students the best available information, as well as informing them of the methodology and its limitations (as best as possible), and making it clear that the earnings data were national averages, not necessarily what they could expect for themselves: “We also used statistical methods to try to account for the fact that different kinds of people get different amounts of education; the results were similar. However, no method is perfect, and people differ in many ways that affect their earnings, and statistics can’t always capture those differences. And of course, there is no way to predict anyone’s future, so our results don’t signify that this is what you yourself will earn, these are only averages over the population.” Though the returns may vary by race, for example, so the returns are not as great for some students in our sample, we would only believe the intervention was potentially harmful to those students if we believed their current level of schooling was efficient, which we find unlikely. We also view our intervention as consistent with the numerous efforts under way in the country aimed at increasing educational attainment, especially for the most disadvantaged groups. Finally, we also note that it is even possible that the returns given to students may be an underestimate of the true returns, because we provided the OLS rather than the larger IV estimates and ignored the value of benefits (which we note in the Online Appendix adds RD$212 or six percentage points to the returns to secondary education) and other nonwage returns such as reduced variability in earnings or less hazardous conditions, plus the fact that the returns appear to increase with age, as shown in Table A.3 in the Online Appendix.
PERCEIVED RETURNS TO EDUCATION
533
However, in the follow-up survey four to six months later, the treatment group reported on average greater expected earnings associated with secondary school completion, and lower expected earnings with only primary school. For the control group, there was an increase in expected earnings for both levels of schooling, though more so for secondary.24 Thus the treatment group experienced a large relative decrease (RD$284) in expected earnings with only primary school and a smaller relative increase in expected earnings with secondary school (RD$80). Based on a simple difference-in-difference calculation in column (5), the intervention on average raised own perceived returns by a statistically significant RD$366. Overall, 54% of the treatment group had increased implied own expected returns between the two rounds, compared to about 27% for the control group. However, there was heterogeneity in response to the treatment. About 28% of the treatment group had increased implied returns of RD$1,000 or more, compared to 7% for the control group. The changes in students’ estimates of the earnings of current workers by education are very similar to those for own expected earnings, with again a large and statistically significant increase in the implied perceived returns to schooling. In light of these results, we can reestimate the relationship between perceived returns and schooling in Table II, using the treatment indicator as an instrument for perceptions. Provided there is no channel other than perceptions through which this intervention might influence schooling, this exercise can help validate that measured perceptions can serve as predictors of schooling. For this analysis, we use Round 2 perceptions for the full sample (in contrast to the earlier results using Round 1 perceptions, just for the control group).25 The results are presented in the last six columns of Table II. For returning to school and years of schooling, the IV estimates are much larger than the corresponding OLS estimates (0.095 vs. 0.16 for 24. Although there may just have been an overall general increase in expected earnings due to changes in labor market or macroeconomic factors or because students grew older between the rounds, sample selection is also likely to cause an increase in the mean implied expected return to schooling for both treatment and controls. Students who returned to school in Round 2 (and thus who presumably had higher expected returns to schooling) were slightly more likely to be interviewed in that round than students who did not return, and thus we are more likely to have second-round data on expected earnings for these students. 25. The regressions where “returned to school” is the dependent variable reflects a decision already made at the time the perceptions used in these regressions are asked and may for example exhibit a greater degree of endogeneity (e.g., justification bias).
Number of observations
Implied perceived returns
Secondary (only)
Expected earnings (others): Primary (only)
Implied perceived returns
Secondary (only)
Expected earnings (self): Primary (only)
3,509 (112) 3,802 (126) 293 (23) 1,003
3,548 (116) 3,884 (132) 336 (25)
Control
3,447 (120) 3,728 (143) 281 (29) 1,022
3,484 (124) 3,806 (145) 322 (27)
Treatment
3,546 (113) 3,892 (120) 346 (22) 922
3,583 (118) 4,001 (132) 418 (24)
Control
3,204 (92) 3,916 (111) 712 (31) 977
3,230 (92) 3,995 (114) 765 (34)
Treatment
Panel A. Perceived returns to school Round 1 Round 2
−274∗∗∗ (41) 102∗∗ (45) 377∗∗∗ (26) 1,859
−284∗∗∗ (43) 82∗ (44) 366∗∗∗ (29)
Difference-in-difference
TABLE IV EFFECT OF THE INTERVENTION ON EXPECTED RETURNS AND SCHOOLING: NO COVARIATES
534 QUARTERLY JOURNAL OF ECONOMICS
1,118
0.55 (0.02)
1,123
0.59 (0.02)
Treatment
2,241
0.042 (0.025)
∗
Difference
Panel B. Schooling
0.30 (0.02) 9.75 (0.070) 1,033
Control
Round 3
0.32 (0.02) 9.93 (0.073) 1,041
Treatment
0.020 (0.024) ∗ 0.18 (0.098) 2,074
Difference
Notes. Standard errors, corrected for clustering at the school level, in parentheses. All measures of expected earnings are for earnings at 30–40, measured in nominal (2001) Dominican pesos (RD$). Data are from a survey of eighth-grade male students, conducted by the author. Round 1 was conducted in April and May of 2001; Round 2 was conducted in October of 2001; Round 3 was conducted in May and June of 2005. ∗ Significant at 10%. ∗∗ Significant at 5%. ∗∗∗ Significant at 1%.
Number of observations
Years of schooling completed
Completed secondary school?
Returned to school?
Control
Round 2
TABLE IV (CONTINUED) PERCEIVED RETURNS TO EDUCATION
535
536
QUARTERLY JOURNAL OF ECONOMICS
returned and 0.37 vs. 0.63 for years), with both IV estimates statistically significant at the 5% level or better. Instrumenting also increases the standard errors dramatically; as a result, we cannot reject that the OLS and IV coefficients are equal. For completing secondary school, the coefficients are identical for the two regressions, but the standard errors are again significantly greater. Overall, these results are further confirmation that survey measures of perceptions are useful predictors of schooling outcomes, supporting the conclusions of Kaufmann (2008) and Attanassio and Kaufmann (2008). III.B. Schooling Outcomes It is the large changes in expected returns observed above that we predict will affect schooling behavior, especially for students not constrained by other factors such as poverty and credit constraints. It is worth noting that because the change in the expected returns is driven to a great extent by a decline in expected earnings with only primary schooling, the intervention not only increased the expected future wage gap, but also lowered the opportunity cost of schooling, which is borne much sooner and thus not reduced as much through discounting. Thus, we might expect a bigger effect than if the increase in implied expected returns was driven more by an increase in expected earnings with secondary schooling. As stated earlier, because schooling is compulsory only through the eighth grade, the students in our sample were not required to return to school in the academic year following the first survey. The bottom panel of Table IV provides data on subsequent school attainment; for now, we present data on reported schooling (by the student, their family or neighbors); below, we focus on results using only verified schooling data. The table shows that the treatment group was four percentage points (7%) more likely to have returned to school the following year, though the difference is only marginally statistically significant (p-value of .091). They also achieved on average 0.18 more years of schooling over the next four years. Finally, the difference in the likelihood of completing secondary school is positive, but small (two percentage points) and not statistically significant. Table V presents regression estimates of the effects of the intervention, where we have regressed the schooling outcomes for individual i, Si , on an indicator for having received the treatment, Si = β0 + β1 Treatmenti + β Zi + εi , and control for other variables,
0.041∗ (0.023) Log 0.095∗∗ (inc. per capita) (0.040) School 0.011 performance (0.010) Father 0.074∗∗ finished sec. (0.030) Age −0.010 (0.016) R2 .016 Observations 2,241
Least poor households
0.023 (0.020) 0.23∗∗∗ (0.044) 0.019∗∗ (0.009) 0.050∗ (0.030) 0.004 (0.015) .040 2,205
0.20∗∗ (0.082) 0.79∗∗∗ (0.16) 0.086∗∗ (0.034) 0.26∗∗ (0.12) −0.006 (0.059) .049 2,074
367∗∗∗ 0.006 −0.01 0.037 (28) (0.034) (0.026) (0.11) 29.0 0.054 0.26∗∗∗ 0.69∗∗∗ (47) (0.068) (0.062) (0.23) 0.74 0.001 0.015 0.064 (14) (0.014) (0.012) (0.048) −24 0.056 0.019 0.16 (32) (0.045) (0.043) (0.18) 0.002 −0.071 −42∗ −0.042 (21) (0.030) (0.019) (0.088) .090 .007 .019 .014 1,859 1,055 1,055 1,007
344∗∗∗ (41) 188∗∗ (87) −9.5 (13.5) −29.1 (62) −46 (32) .094 920
0.072∗ (0.038) 0.047 (0.12) 0.025∗ (0.013) 0.096∗∗ (0.038) 0.005 (0.025) .020 1,056
0.054∗ (0.031) 0.10 (0.13) 0.024∗ (0.012) 0.096∗∗ (0.038) 0.005 (0.035) .020 1,056
0.33∗∗∗ (0.12) 0.51 (0.45) 0.10∗∗ (0.048) 0.36∗∗ (0.14) 0.025 (0.087) .029 1,002
386∗∗∗ (41) 23 (133) 8.2 (22) −3.8 (40) −35 (29) .090 939
Notes. Heteroscedasticity-consistent standard errors accounting for clustering at the school level in parentheses. Data are from a survey of eighth-grade male students, conducted by the author. Returned next year is measured in Round 2; finished school and years of schooling are measured in Round 3. Perceived returns in columns (4), (8), and (12) is the change between Round 2 and Round 1 in the difference between what students expect to earn themselves with primary and secondary schooling when they are 30–40, measured in 2001 Dominican pesos (RD$). All regressions also include an indicator for whether income data were unavailable (these households are assigned the median sample income). In columns (5)–(12), youths are split according to whether they live in a household that is below (poor) or above (least poor) the median household income per capita; households with missing income data are excluded from both categories. School performance is teacher assessment of the student’s performance, on a scale of 1 to 5 (much worse than average, worse than average, average, above average, much better than average). Age, school performance, and whether the father finished secondary were gathered in the first round; income was gathered in the second round. ∗ Significant at 10%. ∗ ∗ Significant at 5%. ∗ ∗ ∗ Significant at 1%.
Treatment
Poor households
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) Returned Finished Years of Perceived Returned Finished Years of Perceived Returned Finished Years of Perceived next year school schooling returns next year school schooling returns next year school schooling returns
Full sample
TABLE V EFFECTS OF THE INTERVENTION ON EXPECTED RETURNS AND SCHOOLING
PERCEIVED RETURNS TO EDUCATION
537
538
QUARTERLY JOURNAL OF ECONOMICS
Z, that are baseline predictors of schooling outcomes, as discussed above (child’s age and eighth grade school performance, household income and whether the father completed high school).26 All control variables were gathered in the first round, except income, which was gathered in the second round. Regressions for having returned to school the next year and having finished secondary school are estimated with linear probability models, though results using logits yield nearly identical conclusions (Table A.5 in the Online Appendix). As with the simple treatment–control differences above, the results in columns (1)–(3) are somewhat mixed in terms of statistical significance. Overall, for the four-year period over which students were followed, the treatment caused a statistically significant 0.20 increase in years of schooling on average. However, the impact on the likelihood of returning to school the following year, although large, is only marginally statistically significant (p-value of .08), and the impact on completing secondary school, although positive, is not statistically significant. Most other variables have the expected sign, with higher socioeconomic status (income and whether the father finished secondary) and better school performance associated with increases in schooling. As noted above, within the standard human capital framework, demand is not always sufficient for schooling. For some youths, even if they wanted to attend school, a combination of costs, low family income, and credit constraints will limit the effectiveness of the intervention. This is especially likely to be the case for completing secondary school, which requires a longer term and more costly investment. Therefore, columns (5)–(12) of Table V present separate regressions for youths in households below (“poor”) and above (“least poor”) the median household income per capita. Cases where the student’s family was not interviewed in Round 2 lack income data and are excluded from this analysis (reclassifying households with missing income data as either all poor or all least poor does not change the results appreciably). It should be pointed out, however, that although the role of credit constraints was recognized as part of the study’s conceptualization, the experiment itself was not explicitly designed to account for this (for example, by randomizing the intervention within wealth strata).27 Therefore, the results of this 26. Specifications controlling for baseline perceived returns to schooling yield nearly identical results; see Table A.4 in the Online Appendix. 27. The ability to stratify the experiment by income or wealth was limited by the fact that the initial survey had to be conducted at schools so that we could
PERCEIVED RETURNS TO EDUCATION
539
stratification, although potentially informative, and motivated by the considerable literature documenting the role of poverty and credit constraints in limiting schooling in low-income countries, should be interpreted with somewhat more caution. Means and tests of covariate balance for treatment and control groups within the poor and least poor subsamples are provided in Table A.6 in the Online Appendix (this table also contains the estimated treatment effects excluding other covariates, which yield very similar results). Overall, despite not being explicitly stratified, the randomization still appears to have achieved covariate balance between the treatment and control groups within these subsamples. For the poorest households, the effect of the treatment is extremely small and not statistically significant for all three measures of schooling. This is despite the fact that in column (8), the treatment appears to have had a large effect on perceived returns to schooling for these students. By contrast, for youths from wealthier (though still quite poor) households, the effects are large, and statistically significant at the 10% level or better for all three education measures (though the effect for finished secondary is not statistically significant without the additional covariates (Table A.6 in the Online Appendix), and only marginally significant with them). For this group, the intervention increased the average years of schooling over the four-year period by 0.33. There was a seven–percentage point (11%, from a base of 56%) increase in the likelihood of returning to school the academic year following the intervention, and a five–percentage point (13%, relative to a base of 40%) increase in secondary school completion. The differences between the poor and least poor are all the more notable given that the intervention had a similar impact on perceived returns for the two groups. Though we would only reject equality of the schooling treatment effects for the poor and least poor samples for years of schooling, the fact that the point estimates for the poor sample are so small (0.006 for returning to school, −0.01 for finishing secondary, and 0.037 for years) is consistent with the treatment being limited in impact for poor households, despite having increased potential demand just as
get a large enough sample of eighth-grade boys. Surveying at schools meant we could not measure students’ household income at baseline. A survey of their home households was possible only for the second round, when more resources became available.
540
QUARTERLY JOURNAL OF ECONOMICS
much as for the least poor group. This suggests at least some role for poverty and credit constraints in limiting schooling.28 Overall, the effects for the least poor students are large and striking. The magnitudes compare favorably with large-scale programs implemented elsewhere, such as Mexico’s PROGRESA, which provided direct cash incentives to increase school attendance.29 And many of these other programs are extremely expensive,30 whereas in the present case, information could potentially be provided at low cost. Though, again, we only expect information to have an impact when students are misinformed about the returns and when no other constraints prevent students from attending school, whereas other programs may be effective for a wider group of students. 28. Though we can’t rule out that because perceived returns for poor youths are lower on average (231 vs. 417), increasing them by the same amount does not move as many over the margin to where it is worthwhile to go to school. However, the results in Table A.4 in the Online Appendix show that perceived returns have much smaller impacts on schooling for the poor sample, supportive of the conclusion that perceived returns are delinked from schooling for the poor, consistent with poverty and credit constraints explaining the poor vs. least poor treatment differences (though in Table A.4 we would not reject equality of the coefficients for least poor and poor, and only one of the coefficients for the least poor sample is statistically significant). This conclusion is also consistent with Attanasio and Kaufmann (2008) and Kaufmann (2008), who find that perceived returns only predict schooling for the least poor students in Mexico. Further, estimates of the education impacts using logit models (Table A.5 in the Online Appendix), which do not force the effects of the treatment to be small for individuals far from the margin, yield nearly identical estimates to the least squares estimates above. 29. PROGRESA, whose payments also were conditioned on other requirements and also provided other benefits, increased enrollments for ninth grade boys from 60 to 66 percentage points (Schultz 2004), close to what was found here for wealthier students. For other comparisons, Duflo (2001) finds that a program in Indonesia that built approximately 61,000 primary schools (effectively doubling the stock) resulted in a 0.25–0.40 increase in years of schooling, or 0.12–0.19 years (comparable to the results found here for the full sample) for each additional school built per 1,000 students. Angrist, Bettinger, and Kremer (2006) find that a large voucher program in Colombia increased secondary school completion rates by five to seven percentage points (a 15%–20% gain), similar to what we find for wealthier students. Of course, these results are not directly comparable; for example, Indonesia was in 1973 (and still is) a much poorer country than the Dominican Republic today, PROGRESA started from a much higher enrollment base, and both it and the Colombian voucher program targeted the poorest students, so improvements in schooling may have been harder to achieve in these other cases. 30. For example, PROGRESA cost nearly 0.2% of Mexico’s GDP to provide benefits to about one-ninth of all households. Indonesia’s program cost about 1.5% of 1973 GDP, or about 750 million 2007 dollars. And the Colombian vouchers came at a cost of about $190 per year of attendance (though for the government some of the cost would likely be offset by savings in expenditures for public schools). There are of course other interventions that have also been shown to be very cost-effective, such as the deworming program studied by Miguel and Kremer (2004), which achieved gains at a cost of about $3.50 per additional year of schooling.
PERCEIVED RETURNS TO EDUCATION
541
III.C. Robustness To this point, we have used data on education as reported by the students (or their families or neighbors). The primary concern is that students may inflate the amount of education they achieved, especially if they received the treatment. A second concern is a general decline in accuracy when students or their relatives could not be interviewed (typically because the family had moved) and schooling data were obtained instead from neighbors. As stated, we attempted to verify schooling data for all students, but were unable to do so for 3% of students in the second round and 9% in the third round. Most of the cases where data could not be verified were due to obtaining schooling information from neighbors or more distant relatives, because they often did not know which school the youth attended. Before turning to these results, we make two observations. First, there were very few cases (27) where a youth reported schooling that differed from what his school reported. This is largely because students were typically interviewed during the daytime on school days (at home, work, or school), so students not in school would be unlikely to misreport that they did attend school. Second, to an extent, the results in columns (5)–(12) of Table V already eliminated many of the nonverified households, because if a neighbor had to report on the youth’s schooling, we would also not have income data for that household and it would have been dropped from the analysis. However, the overlap is not perfect, as there are some households where neighbors provided schooling data that could be verified. Table A.7 in the Online Appendix reveals that using only the verified data reduces the sample sizes slightly, but does not change the results appreciably. The effect of the treatment for wealthier households is still positive for all three measures of education, though slightly smaller for years of schooling and having completed secondary school; and in the latter case, the significance level declines (p-value of .12) so that it no longer falls within conventional levels. However, in terms of both returning for ninth grade and total years of schooling, the results indicate that the schooling gains were real, rather than reporting bias. However, we must maintain the assumption that enrollment among students whose data could not be verified is not negatively correlated with the treatment.31 31. For example, if we make the strong assumption that all control students whose data could not be verified had the best educational outcome (returned;
542
QUARTERLY JOURNAL OF ECONOMICS
A second issue we consider is whether just by students being asked to form their expectations of earnings for various levels of schooling, they acquire information or begin to think about the schooling decision in a way they would not have otherwise; alternatively, there may be an effect of just being interviewed by a research team as part of a project from an American university. Because both treatment and controls were administered the same survey except for whether they were provided with information on returns at the end, this does not affect our interpretation of the effect of the treatment.32 However, one issue to consider is whether the control group was influenced by the interview. Therefore, in column (1) of Table VI, we compare the full-sample control group to a “shadow” control group of fifteen randomly selected students at each of thirty randomly selected nonsample schools (chosen to obtain approximately the same population distribution as the original student sample). These students were identified but not interviewed until the second round (unfortunately, they were not followed after this round). However, we only gathered data on enrollment status for this group, so in the regression we only include an indicator for being in the control group that was interviewed. The results show that the original, interviewed control group experienced no differential change in enrollment relative to the noninterviewed control group; the coefficient is positive, but small and not statistically significant. Thus, the provision of information on the returns to schooling appears to be the critical factor for achieving schooling gains. Finally, although the results suggest that the increased schooling was due to the impact of the intervention on perceived returns to schooling, we are unable to rule out that some of the effect was due to other factors, such as reducing the uncertainty
finished; twelve years of school) whereas all treatment students whose data could not be verified had the worst educational outcome (not returned; not finished; eight years of school), the treatment effects are smaller, and not statistically significant (columns (7)–(9) of Table A.8 in the Online Appendix, for the full sample). Although we have no reason to believe nonverified treatment students are less likely to be enrolled than nonverified control students, this assumption is not testable. If we instead assume either all students whose data could not be verified had the worst outcomes or all had the best outcomes, there are slight increases in the magnitudes and statistical significance of the treatment effects (columns (1)–(3) and (4)–(6) in Table A.8 in the Online Appendix). This result is expected, because attrition is slightly higher for the control group (column (10)). 32. Unless we believe that the intervention would not have been effective without students first going through the interview, or without the presence of our research team.
PERCEIVED RETURNS TO EDUCATION
543
TABLE VI ADDITIONAL TESTS Shadow controls (1) Returned next year Treatment Log (inc. per capita) School performance Father finished secondary Age Interviewed R2 Observations
0.014 (0.027) .00 1,575
implied return (self) < RD$1,000 (2) Returned next year
(3) Finished school
(4) Years of schooling
0.028 (0.026) 0.091∗ (0.047) 0.020∗ (0.012) 0.064∗∗ (0.031) −0.014 (0.019)
0.012 (0.021) 0.20∗∗∗ (0.047) 0.021∗∗ (0.009) 0.044 (0.030) 0.006 (0.017)
0.13 (0.084) 0.77∗∗∗ (0.17) 0.091∗∗ (0.037) 0.22∗ (0.12) −0.008 (0.066)
.014 1,664
.031 1,664
.036 1,577
Notes. Heteroscedasticity-consistent standard errors accounting for clustering at the school level in parentheses. Data are from a survey of eighth-grade male students, conducted by the author. Returned next year is measured in Round 2; finished school and years of schooling are measured in Round 3. Column (1) uses only students from the main sample control group, plus a secondary administrative control group consisting of students from schools where no interviews took place. Columns (2)–(4) focus on the subsample of students in the study whose reported changes in implied perceived returns at age 30–40 were less than RD$1,000. School performance is teacher assessment of the student’s performance, on a scale of 1 to 5 (much worse than average, worse than average, average, above average, much better than average). Age, school performance, and whether the father finished secondary were gathered in the first round; income was measured in the second round. All regressions also include an indicator for whether income data were unavailable (these households are assigned the median sample income). ∗ Significant at 10% level. ∗∗ Significant at 5% level. ∗∗∗ Significant at 1% level.
of students’ estimates,33 or that when providing information on the returns, enumerators provided additional information or encouragement to students to remain in school. However, we can provide a limited exploration of whether increased perceptions of the returns played at least some role in schooling improvements by considering whether there was any effect for those youths who did not significantly update their beliefs. Columns (2)–(4) of Table VI restrict the sample to youths whose perceptions of 33. For example, if students were initially more uncertain of their estimates for earnings with secondary school than their estimates for earnings with primary school, reduced uncertainty due to the treatment might in itself have had an independent effect on the decision to stay in school, even if estimates of the returns were unchanged.
544
QUARTERLY JOURNAL OF ECONOMICS
the returns increased between the first and second rounds by less than RD$1,000. For this group, the coefficients are positive, but extremely small and not statistically significant. Although this is only a limited test, and even though we could not reject the hypothesis that the effects for this sample are equal to those for the full sample, the very low point estimates are consistent with the effect of the intervention being limited to only those youths who significantly updated their beliefs about the returns to schooling. IV. DISCUSSION AND CONCLUSIONS We find that despite high measured returns to secondary schooling in the Dominican Republic, the returns perceived by students are low. This finding suggests a possible inefficiency and may even a reflect a potential development trap, as the relative skill composition demanded by the labor market is not transmitted to youths in the form of greater perceived returns, resulting in an undersupply of skilled labor, which in turn inhibits the development of domestic skill–intensive industries or the ability to attract foreign direct investment. An intervention that provided information on the measured returns increased both perceived returns and schooling. The results suggest that demand appears to be a limiting factor in schooling attainment in the Dominican Republic. The effects of the treatment on schooling are large and striking; there are few examples of policies or interventions that result in a 0.20- to 0.35year increase in schooling, much less interventions that are as potentially inexpensive as this one. An additional advantage of information-based programs such as that applied here is that they may result in students who are more committed to school and provide greater effort than under other programs, because they stimulate the demand for schooling itself, rather than, say, the cash incentives to be obtained through attendance. For example, Nguyen’s (2008) findings that providing information on the returns to schooling improves school performance in Madagascar support this hypothesis. And in a regression for the sample of students enrolled at the time of the second round survey in our study, we find that the treatment increased time spent on homework by about 11 minutes per week on average. However, the intervention undertaken here may be limited in its potential scope and applicability, as it will only be effective in cases where the perceived returns are low relative to the
PERCEIVED RETURNS TO EDUCATION
545
true returns, and no other constraints such as poverty limit investment in schooling. Thus, the optimal strategy may involve a combination of stimulating demand by providing information on the returns and lowering the barriers to attendance by reducing school fees or providing financial support. Another limitation on our study is that we focused on boys, so we have no information on how accurate girls’ perceptions of the returns to schooling are or what impact the intervention might have on them. However, given the large and striking results found here, studies in other settings are worth consideration. Already in this spirit, Nguyen (2008) finds significant effects on children’s attendance and test scores in the months after parents in Madagascar are provided with information on the returns to schooling. As noted in the Introduction, there are several potential explanations for why students might underestimate the returns to schooling. In the Online Appendix, we explore the hypothesis that residential segregation by income, coupled with residential mobility (akin to the argument of Wilson [1987]), may be playing a role in the Dominican Republic. In particular, if youths are only able to observe the earnings of workers who live in their neighborhoods, residential segregation will lead to lower estimates of the returns due to differential selection by education. For example, poor neighborhoods may contain most of the workers with low levels of education, but only those more highly educated workers who had the worst income draws, so that within these neighborhoods the more highly educated workers do not earn much more than those with less education. The opposite form of selection will arise in rich neighborhoods, with the net result that within all segregated communities, the local mean difference in earnings by education will be less than the difference by education for the country as a whole, obscuring the returns to schooling. Such segregation would present a case where a strong argument could be made for the type of information intervention undertaken in the present study. In the Online Appendix, we find some supportive evidence of this hypothesis, though our tests are limited and we are unable to rule out alternative interpretations of the results. Further research could explore this hypothesis and its implications in more detail. Of course, the desirability of such information-based programs will depend on the ability to provide accurate information on the returns to schooling, which may often be difficult. Further, even with accurate estimates, there may be reasons that the
546
QUARTERLY JOURNAL OF ECONOMICS
returns for the marginal child may not be as large as the currently measured average return.34 This is made all the more complicated by the fact that even if current estimates of the returns are correct, we would in effect need to forecast the returns to be expected in the future.35 Although there may be some public good or spillover effects of education that make the social returns higher than the private returns, it is unlikely to be desirable public policy to provide information known to be incorrect, even if it leads to outcomes deemed socially desirable. Further, doing so may undermine the effectiveness of the program in the short run (i.e., students may be less likely to believe information that differs markedly from the available evidence) and in the long run (i.e., if younger cohorts of students see older cohorts invest in schooling and not achieve the gains they are told to expect, they will no longer believe the information given). Such effects could even spill over and undermine other government interventions or institutions. But even though it may not be possible to provide students with the absolute certain value of the returns they will personally face, there may still 34. For example, if those with the highest returns are the ones who currently finish schooling, the marginal child may have lower returns than the current average. However, if other factors also influence schooling (such as income, distance to school, costs, geography, rates of time preference, attitudes toward risk, or knowledge of the returns), then the prediction is more ambiguous. To take one extreme example, if high school completion rates were near-universal in the big cities but low elsewhere, the marginal youth from outside of the cities may have higher ability (and thus potentially, higher returns) than the average person who currently completes secondary school in the cities, because the former is still likely to be high in the ability distribution, whereas the latter includes almost the full distribution of ability, high and low. This is not to say that marginal students may not have lower returns on average, only that the prediction depends on a number of factors. We note here also, however, that even if the returns for the marginal student were lower, this could not explain why students underestimate the returns to schooling in our study, because Table III shows that their estimates of the earnings of current workers with different levels of schooling are also low. Of course, this does not rule out that the returns for the marginal student may actually differ from the current average, which reinforces the difficulty in providing students with accurate estimates of the returns they will face. 35. Though rates of completion of secondary school have been increasing over time, which all else equal might be expected to lower the returns, the evolution of future returns for the Dominican Republic is uncertain. As Goldin and Katz (2008) note in their analysis of the United States, a great deal depends on the growth of the supply of skilled labor relative to the demand for skilled labor, as well as technological change (potentially, in a nonmonotonic way). It is also worth noting that the supply of workers with primary school has increased even more rapidly, which might at least in the near term depress the wages of such workers (or more generally, workers with just the basic skills of literacy and numeracy). Further, there may be spillover effects whereby the supply of educated workers actually increases the demand for skilled labor, such as by spurring greater innovation, making firms more competitive, or attracting foreign investment. Unfortunately, there are not sufficient data to examine how the returns have been changing over time in the Dominican Republic. And the estimates for the few points in time that are available are not comparable, due to differences in data and methodology.
PERCEIVED RETURNS TO EDUCATION
547
be a value to providing the best available current estimate of the returns, which students can use as the basis for forming their own expectations, especially if provided alongside the appropriate caveats about uncertainty about how returns will evolve in the future. SCHOOL OF PUBLIC AFFAIRS, UNIVERSITY OF CALIFORNIA, LOS ANGELES, NATIONAL BUREAU OF ECONOMIC RESEARCH, AND WATSON INSTITUTE FOR INTERNATIONAL STUDIES, BROWN UNIVERSITY
REFERENCES Angrist, Joshua, Eric Bettinger, and Michael Kremer, “Long-Term Educational Consequences of Secondary School Vouchers: Evidence from Administrative Records in Colombia,” American Economic Review, 96 (2006), 847–862. Attanasio, Orazio P., and Katja Maria Kaufmann, “School Choices, Subjective Expectations and Credit Constraints,” Bocconi University and University College London Working Paper, 2008. Avery, Christopher, and Thomas J. Kane, “Student Perceptions of College Opportunities: The Boston COACH Program,” in College Choices: The Economics of Where to Go, When to Go, and How to Pay for It, Caroline M. Hoxby, ed. (Chicago: University of Chicago Press, 2004). Betts, Julian, “What Do Students Know about Wages? Evidence from a Survey of Undergraduates,” Journal of Human Resources, 31 (1996), 27–56. Card, David, “Using Geographic Variation in College Proximity to Estimate the Return to Schooling,” in Aspects of Labour Market Behaviour, Louis N. Christofides, E. Kenneth Grant, and Robert Swidinksy, eds. (Toronto, BC: University of Toronto Press, 1995). Chan, Sewin, and Ann Huff Stevens, “What You Don’t Know Can’t Help You: Pension Knowledge and Retirement Decision-Making,” Review of Economics and Statistics, 90 (2008), 253–266. Delavande, Adeline, Xavier Gin´e, and David McKenzie, “Measuring Subjective Expectations in Developing Countries: A Critical Review and New Evidence,” RAND Working Paper, 2008. Dominitz, Jeff, and Charles F. Manski, “Eliciting Student Expectations of the Returns to Schooling,” Journal of Human Resources, 31 (1996), 1–26. Duflo, Esther, “Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment,” American Economic Review, 91 (2001), 795–813. Duflo, Esther, and Emmanuel Saez, “The Role of Information and Social Interactions in Retirement Plan Decisions: Evidence from a Randomized Experiment,” Quarterly Journal of Economics, 118 (2003), 815–842. Dupas, Pascaline, “Relative Risks and the Market for Sex: Teenagers, Sugar Daddies and HIV in Kenya,” NBER Working Paper No. 14707, 2009. Goldin, Claudia, and Lawrence F. Katz, The Race between Education and Technology (Cambridge, MA: The Belknap Press of Harvard University Press, 2008). Gustman, Alan L., and Thomas L. Steinmeier, “Imperfect Knowledge of Social Security and Pensions,” Industrial Relations, 44 (2005), 373–397. Hastings, Justine S., and Jeffrey M. Weinstein, “Information, School Choice, and Academic Achievement: Evidence from Two Experiments,” Quarterly Journal of Economics, 123 (2008), 1373–1414. Kane, Thomas J., and Cecilia Rouse, “Labor Market Returns to Two- and FourYear Colleges: Is a Credit a Credit and Do Degrees Matter?” NBER Working Paper No. 4268, 1993. Kaufmann, Katja Maria, “Understanding the Income Gradient in College Attendance in Mexico: The Role of Heterogeneity in Expected Returns to College,” SIEPR Discussion Paper No. 07–40, 2008.
548
QUARTERLY JOURNAL OF ECONOMICS
Manski, Charles F., “Adolescent Econometricians: How Do Youth Infer the Returns to Education?” in Studies of Supply and Demand in Higher Education, Charles T. Clotfelter and Michael Rothschild, eds. (Chicago: University of Chicago Press, 1993). ——, “Measuring Expectations,” Econometrica, 72 (2004), 1329–1376. McKenzie, David, John Gibson, and Steven Stillman, “A Land of Milk and Honey with Streets Paved with Gold: Do Emigrants Have Over-optimistic Expectations about Incomes Abroad?” World Bank Policy Research Working Paper 4141, 2007. Miguel, Edward, and Michael Kremer, “Worms: Identifying Impacts of Education and Health in the Presence of Treatment Externalities,” Econometrica, 72 (2004), 159–217. Mitchell, Olivia, “Worker Knowledge of Pension Provisions,” Journal of Labor Economics, 6 (1988), 21–39. Nguyen, Trang, “Information, Role Models and Perceived Returns to Education: Experimental Evidence from Madagascar,” MIT Working Paper, 2008. ´ Oficina Nacional de Estad´ıstica, Republica Dominicana, VIII Censo Nacional de Poblac´ıon y Vivienda 2002 (Santo Domingo, DR: Secretariado T´ecnico de la Presidencia, 2002). Oreopoulos, Philip, “Do Dropouts Drop Out Too Soon? Wealth, Health and Happiness from Compulsory Schooling,” Journal of Public Economics, 91 (2007), 2213–2229. Psacharopoulos, George, and Harry Anthony Patrinos, “Returns to Investment in Education: A Further Update,” Education Economics, 12 (2004), 111–134. Rouse, Cecelia Elena, “Low-Income Students and College Attendance: An Exploration of Income Expectations,” Social Science Quarterly, 85 (2004), 1299– 1317. Schultz, T. Paul, “School Subsidies for the Poor: Evaluating the Mexican Progresa Poverty Program,” Journal of Development Economics, 74 (2004), 199–250. Smith, Herbert L., and Brian Powell, “Great Expectations: Variations in Income Expectations among College Seniors,” Sociology of Education, 63 (1990), 194– 207. Stango, Victor, and Jonathan Zinman, “Fuzzy Math and Red Ink: When the Opportunity Cost of Consumption Is Not What It Seems,” Dartmouth College Working Paper, 2007. Viscusi, W. Kip, “Do Smokers Underestimate Risks?” Journal of Political Economy, 98 (1990), 1253–1269. Wilson, William Julius, The Truly Disadvantaged (Chicago: University of Chicago Press, 1987).
SUPERSTAR EXTINCTION PIERRE AZOULAY JOSHUA S. GRAFF ZIVIN JIALAN WANG We estimate the magnitude of spillovers generated by 112 academic “superstars” who died prematurely and unexpectedly, thus providing an exogenous source of variation in the structure of their collaborators’ coauthorship networks. Following the death of a superstar, we find that collaborators experience, on average, a lasting 5% to 8% decline in their quality-adjusted publication rates. By exploring interactions of the treatment effect with a variety of star, coauthor, and star/coauthor dyad characteristics, we seek to adjudicate between plausible mechanisms that might explain this finding. Taken together, our results suggest that spillovers are circumscribed in idea space, but less so in physical or social space. In particular, superstar extinction reveals the boundaries of the scientific field to which the star contributes—the “invisible college.”
“Greater is the merit of the person who facilitates the accomplishments of others than of the person who accomplishes himself.” Rabbi Eliezer, Babylonian Talmud, Tractate Baba Bathra 9a
I. INTRODUCTION Although the production of ideas occupies a central role in modern theories of economic growth (Romer 1990), the creative process remains a black box for economists (Weitzman [1998] and Jones [2009] are notable exceptions). How do innovators actually generate new ideas? Increasingly, discoveries result from the voluntary sharing of knowledge through collaboration, rather than individual efforts (Wuchty, Jones, and Uzzi 2007). The growth of scientific collaboration has important implications for the optimal allocation of public R&D funds, the apportionment of credit among scientists, the formation of scientific reputations, and ultimately ∗ Part of the work was performed while the first author was an Alfred P. Sloan Industry Studies Fellow. We thank the editor Larry Katz and the referees for their constructive comments, as well as various seminar audiences for their feedback, and we gratefully acknowledge the financial support of the National Science Foundation through its SciSIP program (Award SBE-0738142) and the Merck Foundation through the Columbia–Stanford Consortium on Medical Innovation. The project would not have been possible without Andrew Stellman’s extraordinary programming skills (http://www.stellman-greene.com/). The authors also express gratitude to the Association of American Medical Colleges for providing licensed access to the AAMC Faculty Roster, and acknowledge the stewardship of Dr. Hershel Alexander (AAMC Director of Medical School and Faculty Studies). The National Institutes of Health partially supports the AAMC Faculty Roster under Contract HHSN263200900009C. The usual disclaimer applies.
[email protected]. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
549
550
QUARTERLY JOURNAL OF ECONOMICS
the design of research incentives that foster innovation and continued economic growth. Yet, we know surprisingly little about the role of collaboration among peers as a mechanism to spur the creation of new technological or scientific knowledge. This paucity of evidence is due largely to the empirical challenges inherent in this line of inquiry. Individual-level data on the contributors to a particular innovation are generally unavailable. Furthermore, the formation of collaborative teams is the outcome of a purposeful matching process (Mairesse and Turner 2005; Fafchamps, Goyal, and van de Leij 2008), making it difficult to uncover causal effects. The design of our study tackles both of these challenges. To relax the data constraint, we focus on the academic life sciences, where a rich tradition of coauthorship provides an extensive paper trail of collaboration histories and research output. To overcome the endogeneity of the collaboration decision, we make use of the quasi-experimental variation in the structure of coauthorship networks induced by the premature and sudden death of active “superstar” scientists.1 Specifically, we analyze changes in the research output of collaborators for 112 eminent life scientists who died suddenly and unexpectedly. We assess eminence based on the combination of seven criteria, and our procedure is flexible enough to capture established scientists with extraordinary career achievement, as well as promising young and mid-career scientists. Using the Association of American Medical Colleges (AAMC) Faculty Roster as a data source—a comprehensive, longitudinal, matched employee–employer database pertaining to 230,000 faculty members in all U.S. medical schools between 1975 and 2006—we construct a panel data set of 5,267 collaborator–star pairs, and we examine how coauthors’ scientific output (as measured by publications, citations, and National Institutes of Health (NIH) grants) changes when the superstar passes away.2 1. Other economists have used the death of prominent individuals as a source of exogenous variation in leadership, whether in the context of business firms ´ (Bennedsen, P´erez-Gonzalez, and Wolfenzon 2008) or even entire countries (Jones and Olken 2005). To our knowledge, however, we are the first to use this strategy to estimate the impact of scientific collaboration. Oettl (2008) builds on our approach by incorporating helpfulness as implied by acknowledgements to generate a list of eminent immunologists. Aizenman and Kletzer (2008) study the citation “afterlife” of 16 economists who died prematurely, shedding light on the survival of scientific reputation. 2. To be clear, our focus is on faculty peers rather than trainees, and thus our results should be viewed as capturing inter-laboratory spillovers rather than mentorship effects. For evidence on the latter, see Azoulay, Liu, and Stuart (2009).
SUPERSTAR EXTINCTION
551
The study’s focus on the scientific elite can be justified on both substantive and pragmatic grounds. The distribution of publications, funding, and citations at the individual level is extremely skewed (Lotka 1926; de Solla Price 1963) and only a tiny minority of scientists contribute through their published research to the advancement of science (Cole and Cole 1972). Stars also leave behind a corpus of work and colleagues with a stake in the preservation of their legacy, making it possible to trace back their careers, from humble beginnings to wide recognition and acclaim. Our results reveal a lasting 5% to 8% decrease in the qualityadjusted publication output of coauthors in response to the sudden and unexpected loss of a superstar. Though close and recent collaborators see their scientific output fall even more, these differential effects are small in magnitude and statistically insignificant. Therefore, the process of replacing missing skills within ongoing collaborative teams cannot, on its own, explain our core result. The importance of learning through on-the-job social interactions can be traced back to the Talmudic era (as evidenced by the epigraph to this paper), as well as canonical writings by Alfred Marshall (1890) and Robert Lucas (1988).3 Should the effects of exposure to superstar talent be interpreted as laying bare the presence of knowledge spillovers? Because we identify 47 coauthors per superstar on average, we exploit rich variation in the characteristics of collaborative relationships to assess the relative importance of several mechanisms that could plausibly account for our main finding. A jaundiced view of the academic reward system provides the backdrop for a broad class of stories. Their common thread is that collaborating with superstars deepens social connections that might make researchers more productive in ways that have little to do with scientific knowledge, for example, by connecting coauthors to funding resources, editorial goodwill, or potential coauthors. Yet we find no differential impact on coauthors of stars well-connected to the NIH funding apparatus, on coauthors of stars more central in the collaboration network, or on former trainees. These findings do not jibe with explanations stressing the gatekeeping role of eminent scientists. 3. A burgeoning empirical literature examines the influence of peer effects on shirking behavior in the workplace (Costa and Khan 2003; Bandiera, Barankay, and Rasul 2005; Mas and Moretti 2009). Because “exposure” does not involve the transmission of knowledge, these spillovers are conceptually distinct from those that concern us here.
552
QUARTERLY JOURNAL OF ECONOMICS
Rather, the effects of superstar extinction appear to be driven by the loss of an irreplaceable source of ideas. We find that coauthors proximate to the star in intellectual space experience a sharper decline in output, relative to coauthors who work on less related topics. Furthermore, the collaborators of stars whose work was heavily cited at the time of their death also undergo steeper decreases, relative to collaborators of superstars of less renown. Together, these results paint a picture of an invisible college of coauthors bound together by interests in a fairly specific scientific area, which suffers a permanent and reverberating intellectual loss when it loses its star. The rest of the paper proceeds as follows. In the next section, we describe the construction of the sample of matched superstars and collaborators, as well as our empirical strategy. Section III provides descriptive statistics at the coauthor and dyad level. We report the results in Section IV. Section V concludes. II. SETTING, DATA, AND MATCHED SAMPLE CONSTRUCTION The setting for our empirical work is the academic life sciences. This sector is an important one to study for several reasons. First, there are large public subsidies for biomedical research in the United States. With an annual budget of $29.5 billion in 2008, support for the NIH dwarfs that of other national funding agencies in developed countries (Cech 2005). Deepening our understanding of knowledge production in this sector will allow us to better assess the return to these public investments. Second, technological change has been enormously important in the growth of the health care economy, which accounts for roughly 15% of U.S. GDP. Much biomedical innovation is sciencebased (Henderson, Orsenigo, and Pisano 1999), and interactions between academic researchers and their counterparts in industry appear to be an important determinant of research productivity in the pharmaceutical industry (Cockburn and Henderson 1998; Zucker, Darby, and Brewer 1998). Third, academic scientists are generally paid through soft money contracts. Salaries depend on the amount of grant revenue raised by faculty, thus providing researchers with high-powered incentives to remain productive even after they secure a tenured position. Last, introspective accounts by practicing scientists indicate that collaboration plays a large role in both the creation and
SUPERSTAR EXTINCTION
553
diffusion of new ideas (Reese 2004). Knowledge and techniques often remain partially tacit until long after their initial discovery, and are transmitted within the confines of tightly knit research teams (Zucker and Darby 2008). II.A. Superstar Sample Our basic approach is to rely on the death of superstar scientists to estimate the magnitude of knowledge spillovers onto colleagues. From a practical standpoint, it is more feasible to trace back the careers of eminent scientists than to perform a similar exercise for less eminent ones. We began by delineating a set of 10,349 elite life scientists (roughly 5% of the entire relevant labor market), who are so classified if they satisfy at least one of the following criteria for cumulative scientific achievement: (1) highly funded scientists; (2) highly cited scientists; (3) top patenters; and (4) members of the National Academy of Sciences. These four criteria will tend to select seasoned scientists, because they correspond to extraordinary achievement over an entire scientific career. We combine these measures with three others that capture individuals who show great promise at the early and middle stages of their scientific careers, whether or not these episodes of productivity endure for long periods of time: (5) NIH MERIT awardees; (6) Howard Hughes medical investigators; and (7) early career prize winners. Appendix I provides additional details regarding these seven metrics of “superstardom.” We trace these scientists’ careers from the time they obtained their first positions as independent investigators (typically after a postdoctoral fellowship) until 2006. We do so through a combination of curriculum vitaes, NIH biosketches, Who’s Who profiles, accolades/obituaries in medical journals, National Academy of Sciences biographical memoirs, and Google searches. For each one of these individuals, we record employment history, degree held, date of degree, gender, and up to three departmental affiliations. We also cross-reference the list with alternative measures of scientific eminence. For example, the elite subsample contains every U.S.-based Nobel Prize winner in medicine and physiology since 1975, and a plurality of the Nobel Prize winners in chemistry over the same time period. Though we apply the convenient moniker of “superstar” to the entire group, it should be clear that there is substantial heterogeneity in intellectual stature within the elite sample. This
554
QUARTERLY JOURNAL OF ECONOMICS
variation provides a unique opportunity to examine whether the effects we estimate correspond to vertical effects (spillovers from the most talented agents onto those who are less distinguished) or peer effects (spillovers between agents of roughly comparable stature). The scientists who are the focus of this paper constitute a subset of this larger pool of 10,349. We impose several additional criteria to derive the final list. First, the scientist’s death must occur between 1979 and 2003. This will enable us to observe at least four years’ (resp. three years’) worth of scientific output for every colleague before (resp. after) the death of the superstar collaborator. Second, he or she must be 67 years of age or less at the time of death (we will explore the sensitivity of our results to this age cutoff later). Third, we require evidence, in the form of published articles and/or NIH grants, that these scientists have not entered a preretirement phase of their careers prior to the time of death. This constraint is somewhat subjective, but we validate in the Online Appendix our contention that the final set is limited to scientists who are “research-active” at the time of their death. These sequential screens delineate a set of 248 scientists. Finally, we limit our attention to the subset of stars who died suddenly and unexpectedly. This is less difficult that it might seem, because the vast majority of obituaries mention the cause of death explicitly.4 After eliminating 136 scientists whose deaths could have been anticipated by their colleagues, we are left with 112 extinct superstars (their names, cause of death, and institutional affiliations are listed in Table W1 in the Online Appendix). Table I provides descriptive statistics for the superstar sample. The average star received his degree in 1963, died at 57 years old, and worked with 47 coauthors during his lifetime. On the output side, the stars each received an average of roughly 11 million dollars in NIH grants (excluding center grants) and published 139 papers that had garnered 8,190 citations as of early 2008. II.B. The Universe of Potential Colleagues Information about the superstars’ colleagues stems from the Faculty Roster of the Association of American Medical Colleges, 4. We exclude from the sample one scientist who took his own life, and a further two for whom suicide could not be ruled out. In ten other instances, the cause of death could not be ascertained from the obituaries and we contacted former collaborators individually to clarify the circumstances of the superstar’s passing.
57.170 1962.741 0.420 0.438 0.143 0.063 0.786 47.027 0.045 1.330 139.607 8,190 $10,722,590
58 1964 0 0 0 0 1 37 0 1 121 6,408 $8,139,397
Median 7.042 10.193 0.496 0.498 0.351 0.243 0.412 34.716 0.207 1.657 91.371 7,593 $12,057,638
Std. Dev. 37 1942 0 0 0 0 0 3 0 0 25 435 $0
Min.
67 1984 1 1 1 1 1 178 1 7 473 38,941 $70,231,584
Max.
Notes. Sample consists of 112 superstar life scientists who died suddenly while still actively engaged in research. See Appendix I and Section II.A for details on sample construction. Degree year denotes the year of the most recent degree attained by the superstar. Number of collaborators is defined as the number of distinct coauthors within the scientists’ cumulative stock of publications. NIH review panel membership denotes stars who were members of an NIH review panel in the five years prior to their death, and the number of collaborators in NIH review panels refers to the number of coauthors of each superstar who sat on NIH review panels in the five years prior to the star’s death. We use the terms “star” and “superstar” interchangeably.
Birth age at death Degree year M.D. Ph.D. M.D./Ph.D. Female U.S. born No. of collaborators NIH review panel membership (past 5 yrs) No. of collabs. in NIH review panels (past 5 yrs) Career no. of publications Career no. of citations Career NIH funding
Mean
TABLE I SUMMARY STATISTICS FOR SUPERSTAR SCIENTISTS (N = 112)
SUPERSTAR EXTINCTION
555
556
QUARTERLY JOURNAL OF ECONOMICS
to which we secured licensed access for the years 1975 through 2006. The roster is an annual census of all U.S. medical school faculty in which each faculty is linked across yearly cross sections by a unique identifier.5 When all cross sections are pooled, we obtain a matched employee/employer panel data set. For each of the 230,000 faculty members that appear in the roster, we know the full name, the type of degrees received and the years they were awarded, gender, up to two departments, and medical school affiliation. An important implication of our reliance on the AAMC Faculty Roster is that the interactions we can observe in the data take place between faculty members, rather than between faculty members and trainees (graduate students or postdoctoral fellows).6 Because the roster only lists medical school faculty, however, it is not a complete census of the academic life sciences. For instance, it does not list information for faculty at institutions such as MIT, the University of California at Berkeley, Rockefeller University, the Salk Institute, or the Bethesda campus of the NIH; it also ignores faculty members in arts and sciences departments— such as biology and chemistry—if they do not hold joint appointments at local medical schools.7 Our interest lies in assessing the benefits of exposure to superstar talent that accrue through collaboration. Therefore, we focus on the one-degree, egocentric coauthorship network for the sample of 112 extinct superstars. To identify coauthors, we have developed a software program, the Stars/Colleague Generator, or S/CGEN.8 The source of the publication data is PubMED, an online resource 5. AAMC does not collect data from each medical school with a fixed due date. Instead, it collects data on a rolling basis, with each medical school submitting on a time frame that best meets its reporting needs. Nearly all medical schools report once a year, whereas many medical schools update once a semester. 6. To the extent that former trainees go on to secure faculty positions, they will be captured by our procedure even if the date of coauthorship predates the start of their independent career. 7. This limitation is less important than might appear at first glance. First, we have no reason to think that colleagues located in these institutions differ in substantive ways from those based in medical schools. Second, all our analyses focus on changes in research productivity over time for a given scientist. Therefore, the limited coverage is an issue solely for the small number of faculty who transition in and out of medical schools from (or to) other types of research employment. For these faculty, we were successful in filling career gaps by combining the AAMC Faculty Roster with the NIH data. 8. The software can be used by other researchers under an open-source (GNU) license. It can be downloaded, and detailed specifications accessed from the Web site http://stellman-greene.com/SCGen/. Note that the S/CGEN takes the AAMC Faculty Roster as an input; we are not authorized to share these data with third parties. However, they can be licensed from AAMC, provided a local IRB gives
SUPERSTAR EXTINCTION
557
from the National Library of Medicine that provides fast, free, and reliable access to the biomedical research literature. In a first step, S/CGEN downloads from the Internet the entire set of Englishlanguage articles for a superstar, provided they are not letters to the editor, comments, or other “atypical” articles. From this set of publications, S/CGEN strips out the list of coauthors, eliminates duplicate names, matches each coauthor with the Faculty Roster, and stores the identifier of every coauthor for whom a match is found. In a final step, the software queries PubMED for each validated coauthor, and generates publication counts as well as coauthorship variables for each superstar/colleague dyad, in each year. In the Online Appendix, we provide details on the matching procedure, how we guard against the inclusion of spurious coauthors, and our approach to addressing measurement error when tallying the publication output of coauthors with common names. II.C. Identification Strategy A natural starting point for identifying the effect of superstar death is to examine changes in collaborator research output after the superstar passes away, relative to when he or she was still alive, using a simple collaborator fixed effects specification. Because the extinction effect is mechanically correlated with the passage of time, as well as with the coauthor’s age, our specifications must include life-cycle and period effects, as is the norm in studies of scientific productivity (Levin and Stephan 1991). In this framework, the control group that pins down the counterfactual age and calendar time effects for the coauthors that currently experience the death of a superstar consists of coauthors whose associated superstars died in earlier periods, or will die in future periods. Despite its long pedigree in applied economics (e.g., Grogger [1995]; Reber [2005]), this approach may be problematic in our setting. First, coauthors observed in periods after the deaths of their associated superstars are not appropriate controls if the event negatively affected the trend in their output; if this is the case, fixed effects will underestimate the true effect of superstar extinction. Second, collaborations might be subject to idiosyncratic life-cycle patterns, with their productive potential first increasing over time, eventually peaking, and thereafter slowly declining; if its approval and a confidentiality agreement protects the anonymity of individual faculty members.
558
QUARTERLY JOURNAL OF ECONOMICS
this is the case, fixed effects will overestimate the true effect of superstar extinction, at least if we rely on collaborators treated in earlier or later periods as as an “implicit” control group. To mitigate these threats to identification, our preferred empirical strategy relies on the selection of a matched control for each scientist who experiences the death of a superstar collaborator. These control scientists are culled from the universe of coauthors for the 10,000 superstars who do not die (see Section II.A). Combining the treated and control samples enables us to estimate the effect of superstar extinction in a difference-in-differences (DD) framework. Using a “coarsened exact matching” procedure detailed in Appendix II, the control coauthors are chosen so that (1) treated scientists exhibit no differential output trends relative to controls up to the time of superstar death; (2) the distributions of career age at the time of death are similar for treated and controls; (3) the time paths of output for treated and control coauthors are similar up to the time of death; and (4) the dynamics and main features of collaboration (number of coauthorships at the time of death; time elapsed since first and last coauthorship; status of the superstar collaborator as summarized by cumulative citations in the year of death) are balanced between treated and control groups. However, adding this control group to the basic regression does not, by itself, yield a specification where the control group consists exclusively of matched controls. Figure A.1 displays the trends in average and median number of quality-adjusted publications, for treated and control collaborators respectively, without any adjustment for age or calendar time effects. This raw comparison is not without its problems, because it involves centering the raw data around the time of death, thus ignoring the lack of congruence between experimental and calendar time. Yet it is completely nonparametric, and provides early evidence that the loss of a superstar coauthor leads to a decrease in collaborators’ publication output. Furthermore, the magnitudes of the estimates presented below are very similar whether or not control scientists are added to the estimation sample. Another potential concern with the addition of this “explicit” control group is that control coauthors could be affected by the treatment of interest. No scientist is an island. The set of coauthors for our 10,349 elite scientists comprises 65% of the labor market, and the remaining 35% corresponds in large part to clinicians who hold faculty appointments but do not publish
SUPERSTAR EXTINCTION
559
regularly. Furthermore, the death of a prominent scientist could affect the productivity of noncoauthors if meaningful interactions take place in “idea space,” as we propose. Thus, in robustness checks, we check whether eliminating from the estimation sample treated and control collaborators separated by small path lengths in the coauthorship network matters for the substance, or even the magnitude, of our main results. III. DESCRIPTIVE STATISTICS When applied to our sample of 112 extinct superstars, S/CGEN identifies 5,267 distinct coauthors with unique PubMED names.9 Our matching procedure can identify a control scientist for 5,064 (96%) of the treated collaborators. The descriptive statistics in Table II pertain to the set of 2 × 5,064 = 10,128 matched treated and control scientists. The covariates of interest are measured in the (possibly counterfactual) year of death for the superstar. We distinguish variables that are inherently dyadic (e.g., colocation at time of death) from variables that characterize the coauthor at a particular point of time (e.g., NIH R01 funding at the time of death). III.A. Dyadic Variables Of immediate interest is the distribution of coauthorship intensity at the dyad level. Although the average number of coauthorships is slightly less than three, the distribution is extremely skewed (Figure I). We define “casual” dyads as those that have two or fewer coauthorships with the star, “regular” dyads as those with three to nine coauthorships, and “close” dyads as those with ten or more coauthorships. Using these cutoffs, regular dyads correspond to those between the 75th and the 95th percentile of coauthorship intensity, whereas close dyads correspond to those above the 95th percentile. We focus next on collaboration age and recency. On average, collaborations begin eleven years before the star’s death, and time since last coauthorship is slightly more than nine years. In other words, most of the collaborations in the sample do not involve active research projects at the time of death. Recent collaborations 9. Whenever a scientist collaborates with more than one extinct superstar (this is relevant for 10% of the sample), we take into account only the first death event. We have verified that limiting the estimation sample to collaborators with one and only one tie to a superstar who dies does not change the substance, or even the magnitudes, of our core result.
No. of weighted publications Cum. no. of weighted publications Holds R01 grant Colocated Career age Elite Cum. no. of coauthorships No. of other superstar collaborators Years since first coauthorship Years since last coauthorship Former trainee of the star “Accidental” collaborator MeSH keyword overlap Superstar citation count
Median
Control collaborators (N = 5,064) 18.314 8 327.330 187 0.559 1 0.144 0 23.698 23 0.093 0 2.734 1 2.746 2 10.949 10 9.275 8 0.070 0 0.076 0 0.265 0 10,083 7,245
Mean 27.917 409.098 0.497 0.351 9.963 0.290 4.339 3.516 7.901 7.774 0.255 0.265 0.162 8,878
Std. dev.
TABLE II SUMMARY STATISTICS FOR COLLABORATORS IN THE YEAR OF SUPERSTAR DEATH
0 0 0 0 1 0 1 0 0 0 0 0 0 99
Min.
342 3,968 1 1 59 1 69 31 42 41 1 1 1 90,136
Max.
560 QUARTERLY JOURNAL OF ECONOMICS
Treated collaborators (N = 5,064) 19.068 8 334.905 187 0.571 1 0.123 0 23.761 23 0.077 0 2.835 1 3.087 2 11.022 10 9.255 8 0.084 0 0.075 0 0.259 0 10,228 7,239
Median 31.656 436.927 0.495 0.328 9.969 0.266 4.894 4.255 7.896 7.728 0.278 0.264 0.157 7,952
Std. dev. 0 0 0 0 0 0 1 0 0 0 0 0 0 397
Min.
491 4,519 1 1 59 1 75 44 39 38 1 1 1 34,746
Max.
Notes. The samples consist of faculty collaborators of 112 deceased superstar life scientists and an equal number of matched control coauthors. See Sections II.B and III for details on the sample construction and variable definitions and Appendix II for details on the matching procedure. All variables are measured as of the year of superstar death. Publications are JIF-weighted.
No. of weighted publications Cum. no. of weighted publications Holds R01 grant Colocated Career age Elite Cum. no. of coauthorships No. of other superstar collaborators Years since first coauthorship Years since last coauthorship Former trainee of the star “Accidental” collaborator MeSH keyword overlap Superstar citation count
Mean
TABLE II (CONTINUED)
SUPERSTAR EXTINCTION
561
562
QUARTERLY JOURNAL OF ECONOMICS
60
Proportion of collaborators
50 40 30 20 10 0 1
2
3
4
5
10
20
30
40 50
100
Number of coauthorships (N = 5,267 collaborators)
FIGURE I Distribution of Coauthorship Intensity
(those that involve at least one coauthorship in the three years preceding the passing of the superstar) map into the top quartile of collaboration recency at the dyad level. The research collaborations studied here occur between faculty members, who often run their own labs (a conjecture reinforced by the large proportion of coauthors with independent NIH funding). Yet it is interesting to distinguish collaborators who trained under a superstar (either in graduate school or during a postdoctoral fellowship) from those collaborations initiated at a time in which both nodes in the dyad already had a faculty appointment. Although there is no roster of mentor/mentee pairs, coauthorship norms in the life sciences provide an opportunity to identify former trainees. Specifically, we flag first-authored articles published within a few years of receipt of the coauthor’s degree in which the superstar appears in last position on the authorship roster.10 Using this method, we find that roughly 8% of treated collaborators were former trainees of the associated superstar. 10. The purported training period runs from three years before graduation to four years after graduation for Ph.D.’s and M.D./Ph.D.’s, and from the year of graduation to six years after graduation for M.D.’s. Recall that we do not observe the population of former trainees, but only those trainees who subsequently went on to get full-time faculty positions in the United States. One concern is selection bias for the set of former trainees associated with superstars who died when they had just completed training. To guard against this potential source of bias, we
SUPERSTAR EXTINCTION
563
We now examine the spatial distribution of collaborations. Slightly more than 12% of collaborations correspond to scientists who were colocated at the time of superstar extinction; though this is not the focus of the paper, the proportion of local collaborations has declined over time, as many previous authors have documented (e.g., Rosenblat and M¨obius [2004]). We also provide a measure of collaborators’ proximity in ideas space. Every publication indexed by PubMED is tagged by a large number of descriptors, selected from a dictionary of approximately 25,000 MeSH (Medical Subject Headings) terms. Our measure of intellectual proximity between members of a dyad is simply the number of unique MeSH terms that overlap in their noncoauthored publications, normalized by the total number of MeSH terms used by the superstar’s coauthor. The time window for the calculation is the five years that precede the passing of the superstar. The distribution of this variable is displayed in Figure II.11 Finally, we create a measure of social proximity that relies not on the quantity of coauthored output, but on the degree of social interaction it implies. We focus on the pairs involving coauthors who, whenever they collaborate, find themselves in the middle of the authorship list. Given the norms that govern the allocation of credit in the life sciences, these coauthors are likely to share the least amount of social contact. Of the dyads in the sample, 7.5% correspond to this situation of “accidental coauthorship”— the most tenuous form of collaboration. III.B. Coauthor Variables We briefly mention demographic characteristics that do not play a role in the econometric results but are nonetheless informative. The sample is 20% female (only 10% of the superstars are women); approximately half of all coauthors are M.D.’s, 40% are Ph.D.’s, and the remainder are M.D./Ph.D.’s; and a third are affiliated with basic science departments (as opposed to clinical or public health departments). The coauthors are about eight years younger than the superstars on average (1971 vs. 1963 for the year of highest degree). Coauthors lag behind superstars in terms of publication output at the time of death, but the difference is not dramatic (88 eliminated all former trainees from the sample with career age less than five at the time of death. 11. Further details on its construction are provided in the Online Appendix, Section II.
564
QUARTERLY JOURNAL OF ECONOMICS
Number of collaborators
300
200
100
0 0.00
0.20
0.40
0.60
0.80
1.00
Distance in ideas space (N = 5,267 collaborators) Normalized MeSH keyword overlap in the year of star death
FIGURE II Proximity in Ideas Space Measure of distance in ideas space is defined as the number of unique MeSH terms that overlap between the colleague’s and superstar’s publications (excluding coauthored output), normalized by the total number of MeSH terms used in the colleague’s total publications. This measure is calculated for articles published in the five years preceding superstar death. Calculation excludes coauthored publications.
vs. 140 articles, on the average). Assortative matching is present in the market for collaborators, as reflected by the fact that 2,852 (28.16%) of our 10,128 coauthors belong to the elite sample of 10,349 scientists. Of collaborators, 55% had served as PI on at least one NIH R01 grant when the superstar passed away, whereas about 8% of the treated collaborators (and 9% of the controls) belong to a more exclusive elite: Howard Hughes medical investigators, members of the NAS, or MERIT awardees. The estimation sample pools observations between 1975 and 2006 for the dyads described above. The result is an unbalanced panel data set with 153,508 collaborator × year observations (treated collaborators only) or 294,943 collaborator × year observations (treated and control collaborators). IV. RESULTS The exposition of the econometric results proceeds in three stages. After a brief review of methodological issues, we provide results that pertain to the main effect of superstar exposure on
SUPERSTAR EXTINCTION
565
publication rates. Second, we examine whether this effect merely reflects the adverse impact of losing important skills within ongoing collaborative teams. Third, we attempt to explicate the mechanism, or set of mechanisms, responsible for the results. We do so by exploring heterogeneity in the treatment through the interaction of the postdeath indicator variable below with various attributes of the superstar, colleague, and dyad. IV.A. Econometric Considerations Our estimating equation relates colleague j’s output in year t to characteristics of j, superstar i, and dyad i j: (1)
E[y jt | Xi jt ] = exp[β0 + β1 AFTER DEATHit + f (AGE jt ) + δt + γi j ],
where y is a measure of research output, AFTER DEATH denotes an indicator variable that switches to one the year after the superstar dies, f (AGE jt ) corresponds to a flexible function of the colleague’s career age, the δt ’s stand for a full set of calendar year indicator variables, and the γi j ’s correspond to dyad fixed effects, consistent with our approach to analyze changes in j’s output following the passing of superstar i. The dyad fixed effects control for many individual characteristics that could influence research output, such as gender or degree. Academic incentives depend on the career stage; given the shallow slope of posttenure salary increases, Levin and Stephan (1991) suggest that levels of investment in research should vary over the career life cycle. To flexibly account for life-cycle effects, we include seventeen indicator variables corresponding to different career age brackets, where career age measures the number of years since a scientist earned his/her highest degree (M.D. or Ph.D.).12 In specifications that include an interaction between the treatment effect and some covariates, the models also include a set of interactions between the life-cycle effects and these covariates. Estimation. The dependent variables of interest, including weighted or unweighted publication counts and NIH grants awarded, are skewed and nonnegative. For example, 24.80% of the 12. The omitted category corresponds to faculty members in the very early years of their careers (before age −3). It is not possible to separately identify calendar year effects from age effects in the “within” dimension of a panel in a completely flexible fashion, because one cannot observe two individuals at the same point in time who have the same (career) age but earned their degrees in different years (Hall, Mairesse, and Turner 2007).
566
QUARTERLY JOURNAL OF ECONOMICS
collaborator/year observations in the data correspond to years of no publication output; the figure climbs to 87.40% if one focuses on the count of successful grant applications. Following a longstanding tradition in the study of scientific and technical change, we present conditional quasi–maximum likelihood (QML) estimates based on the fixed-effect Poisson model developed by Hausman, Hall, and Griliches (1984). Because the Poisson model is in the linear exponential family, the coefficient estimates remain consistent as long as the mean of the dependent variable is correctly specified (Gouri´eroux, Montfort, and Trognon 1984).13 Inference. QML (i.e., “robust”) standard errors are consistent even if the underlying data-generating process is not Poisson. In fact, the Hausman et al. estimator can be used for any nonnegative dependent variables, whether integer or continuous (Santos Silva and Tenreyro 2006), as long as the variance/covariance matrix is computed using the outer product of the gradient vector (and therefore does not rely on the Poisson variance assumption). Further, QML standard errors are robust to arbitrary patterns of serial correlation (Wooldridge 1997), and hence immune to the issues highlighted by Bertrand, Duflo, and Mullainathan (2004) concerning inference in DD estimation. We cluster the standard errors around superstar scientists in the results presented below. Dependent Variables. Our primary outcome variable is a coauthor’s number of publications. Because SC/GEN matches the entire authorship roster for each article, we can separate those publications coauthored with the superstar from those produced independent of him/her. We perform a quality adjustment by weighting each publication by its journal impact factor (JIF)— a measure of the frequency with which the average article in a journal has been cited in a particular year. One obvious shortcoming of this adjustment is that it does not account for differences in impact within a given journal. In the Online Appendix (Section V), we present additional results based on article-level citation outcomes. IV.B. Main Effect of Superstar Extinction Table III presents our core results. Column (1a) examines the determinants of the 5,267 treated coauthors’ JIF-weighted 13. In the Online Appendix (Section IV), we show that OLS yields results very similar to QML Poisson estimation for our main findings.
−0.086∗∗ (0.025) −1,832,594 294,943 10,128
−0.092∗∗ (0.022) −974,285 153,508 5,267
−0.057∗∗ (0.022) −950,864 153,508 5,267
Without ctrls (2a)
−0.054∗ (0.024) −1,783,958 294,943 10,128
With ctrls (2b)
Panel B JIF-weighted publications written with others
Notes. Estimates stem from conditional quasi–maximum likelihood Poisson specifications. Dependent variable is the total number of JIF-weighted articles authored by a collaborator of a superstar life scientist in the year of observation. All models incorporate a full suite of year effects as well as seventeen age category indicator variables (career age less than −3 is the omitted category). Exponentiating the coefficients and differencing from one yield numbers interpretable as elasticities. For example. the estimates in column (1a) imply that collaborators suffer on average a statistically significant (1 – exp[−0.092]) = 8.79% decrease in the rate of publication after their superstar coauthor passes away. Robust (QML) standard errors in parentheses, clustered at the level of the superstar. ∗ p < .05. ∗∗ p < .01.
Log pseudo-likelihood No. of observations No. of collaborators
After death
With ctrls (1b)
Without ctrls (1a)
Panel A All JIF-weighted publications
TABLE III IMPACT OF SUPERSTAR DEATH ON COLLABORATORS’ PUBLICATION RATES
SUPERSTAR EXTINCTION
567
568
QUARTERLY JOURNAL OF ECONOMICS
publication output. We find a sizable and significant 8.8% decrease in the yearly number of quality-adjusted publications coauthors produce after the star dies. Column (2b) adds the set of control coauthors to the estimation sample. This reduces our estimate of the treatment effect only slightly, to a statistically significant 8.2% decline. Columns (1b) and (2b) provide the results for an identical set of specifications, except that we modify the dependent variable to exclude publications coauthored with the superstar when computing the JIF-weighted publication counts. The contrast between the results in Panels A and B elucidates scientists’ ability to substitute toward new collaborative relationships upon the death of their superstar coauthor. The effects are now smaller, but they remain statistically significant. We also explore the dynamics of the effects uncovered in Table III. We do so by estimating a specification in which the treatment effect is interacted with a set of indicator variables corresponding to a particular year relative to the superstar’s death, and then graphing the effects and the 95% confidence interval around them (Figures IIIA and IIIB, corresponding to Table III, columns (1b) and (2b)). Following the superstar’s death, the treatment effect increases monotonically in absolute value, becoming statistically significant three to four years after death. Two aspects of this result are worthy of note. First, we find no evidence of recovery—the effect of superstar extinction appears permanent. Though we will explore mechanisms in more detail below, this seems inconsistent with a bereavement-induced loss in productivity. Second, the delayed onset of the effect makes sense because it plausibly takes some time to exhaust the productive potential of the star’s last scientific insights. In addition, the typical NIH grant cycle is three to five years, and the impact of a superstar’s absence may not really be felt until it becomes time to apply for a new grant. In all specifications, the results with and without controls are quite similar. In the remainder of the paper, the estimations sample always include the “explicit” control group, though the results without it are qualitatively similar. IV.C. Imperfect Skill Substitution Collaborative research teams emerge to pool the expertise of scientists, who, in their individual capacity, face the “burden of knowledge” problem identified by Jones (2009). Upon the death of
5
6
7 8
9 10 11 12 13 14 15
−0.75
4
−0.75
1
2
3
4
Time to death
−10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0
5
6
7
8
9 10 11 12 13 14 15
B. Publications without superstar collaborator
FIGURE III Dynamics of the Treatment Effect The solid lines in the above plots correspond to coefficient estimates of conditional fixed effects quasi–maximum likelihood Poisson specifications in which the weighted publication output of a collaborator is regressed onto year effects, seventeen indicator variables corresponding to different age brackets, and interactions of the treatment effect with 27 indicator variables corresponding to eleven years before the year of death and prior, ten years before the year of death, nine years before the year of death,. . . , fourteen years after the year of death, and fifteen years after the year of death and above (the indicator variable for treatment status interacted with the year of death is omitted). The 95% confidence interval (corresponding to robust standard errors, clustered around superstars) around these estimates is plotted with dashed lines. The graph for all publications (A) uses column (1b) of Table III as a baseline (i.e., treated and control collaborators, the dependent variable, includes all of the collaborator’s publications); the graph for publications without superstar collaborator (B) uses column (2b) of Table III as a baseline (i.e., treated and control collaborators, the dependent variable, is limited to the collaborator’s publications in which the superstar does not appear on the authorship list).
Time to death
3
−0.5
−0.5
2
−0.25
−0.25
1
0
0
−10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0
0.25
0.25
A. All publications
SUPERSTAR EXTINCTION
569
570
QUARTERLY JOURNAL OF ECONOMICS
a key collaborator, other team members might struggle to suitably replace the pieces of knowledge that were embodied in the star. Viewed in this light, the effects uncovered in Table III could be considered unsurprising—a mechanical reflection of the skill substitution process. The fact that publications with coauthors other than the superstar are adversely affected and the permanence of the treatment effect already suggests other forces are at play. The imperfect skill substitution (ISS) story carries additional testable implications. First, one would expect coauthors with closer relationships with the star to suffer steeper decreases in output; the same would be expected for recent or new collaborations, which are more likely to involve ongoing research efforts at the time of death. Table IV examines these implications empirically. We find that regular and, to a lesser extent, close collaborators are indeed more negatively affected than casual collaborators, but these differential losses are relatively small in magnitude and statistically insignificant (column (1a)). The same holds true for recent collaborations (column (2a), at least one joint publication in the three years preceding the star’s death) and for young collaborations (those for which the first coauthored publication appeared in the five years preceding the star’s death, unreported results available from the authors). Columns (1b) and (2b) provide results for an identical set of specifications, but excluding publications coauthored with the superstar. The contrast between the results in columns (1a) and (1b) (resp. (2a) and (2b)) elucidates scientists’ ability to substitute toward new collaborative relationships upon the death of their superstar coauthor. The estimates imply that close and, to a lesser extent, recent coauthors do manage to find replacement collaborators (or to intensify already existing collaborations). Close collaborators experience an imprecisely estimated 6.18% average increase in their quality-adjusted publications written independent of the star, but this is only a partial offset for the overall loss documented in column (1a). We find that casual collaborators and collaborators without a recent coauthorship see their independent output decline respectively by 5.54% (column (1b)) and 8.25% (column (2b)). Very similar results are obtained when all these covariates are combined into one specification (columns (3a) and (3b)). Although the differential impacts on the closest and most recent collaborators are not statistically significant, they do appear to move in the direction that supports the skill substitution hypothesis. However, the inability of scientists to compensate fully
−1,831,987 294,943 10,128
−1,781,742 294,943 10,128
−0.057∗ (0.025) −0.020 (0.042) 0.117 (0.073)
−0.076∗∗ (0.026) −0.044 (0.041) −0.026 (0.068) −0.022 (0.038) −1,822,664 294,943 10,128
−0.087∗∗ (0.024)
All pubs. (2a)
0.032 (0.039) −1,775,680 294,943 10,128
−0.074∗∗ (0.024)
Pubs. written with others (2b)
Coauthorship recency
−0.080∗∗ (0.024) −0.039 (0.042) −0.014 (0.069) −0.021 (0.039) −1,821,791 294,943 10,128
All pubs. (3a)
−0.075∗∗ (0.024) −0.018 (0.043) 0.119 (0.074) 0.028 (0.039) −1,774,167 294,943 10,128
Pubs. written with others (3b)
Coauthorship intensity & recency
Notes. Estimates stem from conditional quasi–maximum likelihood Poisson specifications. Dependent variable is the total number of JIF-weighted articles authored by a collaborator of a superstar life scientist in the year of observation. Regular and close collaborator are indicator variables for the number of publications coauthored by the superstar and colleague at the time of death (regular collaborations correspond to between three and nine coauthored publications; close collaborations correspond to ten or more coauthored publications; casual collaborations—the omitted category—corresponds to one or two coauthored publications). All models incorporate year effects and seventeen age category indicator variables (career age less than −3 is the omitted category), as well as seventeen interaction terms between the age effects and each covariate of interest (i.e., column (3b) includes a total of 3 × 17 = 51 age-specific interaction terms). Robust (QML) standard errors in parentheses, clustered at the level of the superstar. ∗ p < .05. ∗∗ p < .01.
After death × at least one coauthorship in the three years preceding star’s death Log pseudo-likelihood No. of observations No. of collaborators
After death × close collaborator
After death × regular collaborator
After death
Pubs. written with others (1b)
All pubs. (1a)
Coauthorship intensity
TABLE IV COLLABORATOR PUBLICATION RATES AND IMPERFECT SKILL SUBSTITUTION
SUPERSTAR EXTINCTION
571
572
QUARTERLY JOURNAL OF ECONOMICS
for the loss of expected future collaborations through alternative relationships, as well as the permanence of the extinction effect, demonstrate that something more than the star’s skills disappears upon their death. Taken as a whole, these results suggest that the treatment effect from Table III cannot be fully explained by imperfect skill substitution within ongoing teams. IV.D. Disentangling Mechanisms We exploit the fine-grained level of detail in the data to sort between mechanisms which might underlie the extinction effect. Are collaborative ties with superstars conduits for tangible resources, or for knowledge and ideas? These two broad classes of explanations are not mutually exclusive, but ascertaining their relative importance matters because their welfare implications differ sharply. If superstars merely act as gatekeepers, then their deaths will lead to a reallocation of resources away from former collaborators, but may have little impact on social welfare. Conversely, if spillovers of knowledge were enabled by collaboration, their passing might result in significant welfare losses. Superstars as Gatekeepers. Superstars may matter for their coauthors because they connect them to important resources either within their institution or in the scientific world at large. These resources might include funding, administrative clout, editorial goodwill, or other potential collaborators. We attempt to evaluate the validity of three particular implications of this story in Table V. First, we examine whether the superstar’s ties to the NIH funding apparatus moderate the magnitude of the extinction effect. Whereas social scientists sometimes emphasize the role that journal editors can have in shaping individual careers, life scientists are often more concerned that the allocation of grant dollars deviates from the meritocratic ideal. Therefore, we investigate whether the treatment effect is of larger magnitude when the star either sat on NIH review panels in the last five years, or has coauthorship ties with other scientists who sat on study sections in the recent past. In column (1), we find that this is not the case. The differential impacts are relatively small, positive in magnitude, and not statistically significant. Second, we address the hypothesis that superstars matter because they broker relationships between scientists who would otherwise remain unaware of each other’s expertise. We do so by
−1,831,339 294,943 10,128
−0.105∗∗ (0.037) 0.042 (0.064) 0.011 (0.013)
−1,831,779 294,943 10,128
−0.031 (0.046)
−0.067∗ (0.028)
Quartile of betweenness centrality (2)
0.056 (0.069) −1,830,582 294,943 10,128
−0.086∗∗ (0.025)
Former trainee (3)
−0.089∗ (0.035) 0.024 (0.070) 0.014 (0.015) −0.040 (0.051) 0.048 (0.069) −1,828,754 294,943 10,128
All covariates combined (4)
Notes. Estimates stem from conditional quasi–maximum likelihood Poisson specifications. Dependent variable is the total number of JIF-weighted articles authored by a collaborator of a superstar life scientist in the year of observation. Betweenness centrality is measured using the network of 10,349 superstar life scientists, former trainee indicates that the colleague was a graduate student or postdoctoral fellow in the laboratory of the superstar (7.69% of the collaborators). All models incorporate year effects and seventeen age category indicator variables (career age less than −3 is the omitted category), as well as seventeen interaction terms between the age effects and each covariate of interest. Robust (QML) standard errors in parentheses, clustered at the level of the superstar. ∗ p < .05. ∗∗ p < .01.
Log pseudo-likelihood No. of observations No. of collaborators
After death × coauthor is former trainee
After death × star in fourth quartile of betweenness centrality
After death × star’s no. of coauthor ties to NIH review panelists
After death × star sat on NIH review panel
After death
Star’s ties to NIH funding process (1)
TABLE V COLLABORATOR PUBLICATION RATES AND ACCESS TO RESOURCES
SUPERSTAR EXTINCTION
573
574
QUARTERLY JOURNAL OF ECONOMICS
computing the betweenness centrality for the extinct superstars in the coauthorship network formed by the 10,349 elite scientists.14 We then rank the superstars according to quartile of betweenness and look for evidence that collaborators experience a more pronounced decline in output if their superstar coauthor was more central (column (2)). We find that collaborators with stars in the top quartile suffer additional losses, relative to collaborators of less central superstars, but this differential effect is statistically insignificant. Finally, in column (3), we look for a differential effect of superstar death for coauthors that were also former trainees. It is possible that mentors continue to channel resources to their former associates even after they leave their laboratories, in which case one would expect these former trainees to exhibit steeper and more precipitous declines following the passing of their adviser. In fact, the differential effect is large and positive, though not statistically significant. The evidence presented in Table V appears broadly inconsistent with the three particular gatekeeping stories whose implications we could test empirically. Our assessment of the gatekeeping mechanism must remain guarded for two reasons. First, the effect of variables used to proxy for the strength of social ties are subject to alternative interpretations. For instance, a former trainee effect could also be interpreted as providing evidence of knowledge spillovers, because mentorship can continue into the early faculty career and be extremely important for a young scholar’s intellectual development. Furthermore, it is possible to think of alternative versions of the gatekeeping mechanism; as an example, superstars might be able to curry favors with journal editors on behalf of their prot´eg´es, or they might be editors themselves. We prefer to frame the findings contrapositively: it is hard to look at the evidence presented so far and conclude that access to resources is a potent way in which superstars influence their collaborators’ scientific output. Knowledge Spillovers. We now examine the possibility that stars generate knowledge spillovers onto their coauthors. In Table VI, we build a circumstantial case for the spillover view by 14. Betweenness is a measure of the centrality of a node in a network, and is calculated as the fraction of shortest paths between dyads that pass through the node of interest. In social network analysis, it is often interpreted as a measure of the influence a node has over the spread of information through the network.
After death × colocated
25.35 −1,830,305 294,943 10,128
7.53 −1,831,787 294,943 10,128
0.104† (0.060)
(0.022)
−0.094∗∗
(3)
−1,828,805 294,943 10,128
(0.024) 0.037 (0.043) −0.114† (0.059) 0.111† (0.058)
−0.081∗∗
(4)
−1,817,667 294,943 10,128
−0.074∗∗ (0.026) 0.042 (0.044) −0.127∗ (0.057) 0.077 (0.055) −0.030 (0.044) 0.002 (0.072) −0.022 (0.038)
(5)
Notes. Estimates stem from conditional quasi–maximum likelihood Poisson specifications. Dependent variable is the total number of JIF-weighted articles authored by a collaborator of a superstar life scientist in the year of observation. Colocated indicates that the colleague and superstar were employed at the same institution at the time of superstar death. Keyword overlap is the normalized number of MeSH keywords that appear on both the colleague’s and superstar’s nonjoint publications. Accidental collaborators are those who only appear on coauthored publications with the superstar when both are in the middle of the authorship list. Regular and close collaborator are indicator variables for the number of publications coauthored by the superstar and colleague at the time of death (regular collaborations correspond to between three and nine coauthored publications; close collaborations correspond to ten or more coauthored publications; casual collaborations—the omitted category—correspond to one or two coauthored publications). All models incorporate year effects and seventeen age category indicator variables (career age less than −3 is the omitted category), as well as seventeen interaction terms between the age effects and the covariate of interest. Robust (QML) standard errors in parentheses, clustered at the level of the superstar. † p < .10. ∗ p < .05. ∗∗ p < .01.
% of collaborators affected Log pseudo-likelihood No. of observations No. of collaborators
After death × recent collaborator
After death × close collaborator
After death × regular collaborator
After death × “accidental” collaborator
13.33 −1,831,900 294,943 10,128
(0.023)
(0.027) 0.042 (0.043)
After death
−0.115∗ (0.059)
−0.067∗∗
−0.092∗∗
After death × kwd. overlap in top quartile
(2)
(1)
TABLE VI COLLABORATOR PUBLICATION RATES AND PROXIMITY IN GEOGRAPHIC AND INTELLECTUAL SPACE
SUPERSTAR EXTINCTION
575
576
QUARTERLY JOURNAL OF ECONOMICS
documenting evidence of additional output losses for collaborators who were more proximate with the superstar at the time of death, using two different meanings of proximity: physical and intellectual. In column (1), we investigate the impact of physical proximity by interacting the treatment effect with an indicator variable for those collaborators who were colocated with the superstar at the time of death. We find essentially no difference between the fates of these coauthors and those of coauthors located further away—the interaction term is positive, small in magnitude, and imprecisely estimated. At first blush, this finding appears consistent with some recent work suggesting a fading role for geographic distance, both as a factor influencing the formation of teams (Rosenblat and M¨obius 2004; Agrawal and Goldfarb 2008), and as a factor circumscribing the influence of peers (Kim, Morse, and Zingales 2009; Griffith, Lee, and Van Reenen 2007; Waldinger 2008). However, our estimate of the colocation interaction term conflates the effect of the loss of knowledge spillovers, the effect of the loss of help and protection provided by the star in the competition for internal resources (such as laboratory space), and the effect of any measure taken by the institution to compensate for the death of the superstar. As a result, it is unclear whether our results contradict the more conventional view that spillovers of knowledge are geographically localized (Zucker, Darby, and Brewer 1998; Ham and Weinberg 2008).15 In column (2), we investigate whether the death of a superstar coauthor has a disparate impact on the group of scientists who work on similar research problems. We proxy intellectual distance between the superstar and his/her coauthors with our measure of normalized keyword overlap. Coauthors in the top quartile of this measure at the time of death suffer output decreases that are particularly large in magnitude (−12.2%).16 This evidence is consistent with the existence of an “invisible college”—an elite of productive scientists highly visible in a research area, combined with a “scatter” of less eminent ones, whose attachment to the field may be more tenuous (de Solla Price and Beaver 1966; Crane 1972). Superstar scientists make their field of inquiry visible to others of lesser standing who might enter it; they replenish their field with 15. We thank an anonymous reviewer for making this point. 16. Specifications that include four different interactions corresponding to the four quartiles show that the treatment effect is monotonically increasing in intellectual distance, but we do not have enough statistical power to reject the hypothesis that the four coefficients are equal to one another.
SUPERSTAR EXTINCTION
577
fresh ideas, and their passing causes the processes of knowledge accumulation and diffusion to slow down, or even decline. In this view, important interactions for the production of new scientific knowledge are not rigidly constrained by geographic or social space, but also take place in an ethereal “idea space.” But is the act of formal coauthorship necessary for a scientist to be brought into a superstar’s intellectual orbit? Because our sample is composed exclusively of coauthors, we cannot definitively answer this question. Yet one can use the norms of authorship in the life sciences to try to isolate collaborators whose coauthorship tie to the star is particularly tenuous: “accidental” collaborators— those who always find themselves in the middle of the authorship list. As seen in column (3), these accidental collaborators do not appear to experience net losses after the superstar’s death. This suggests that full membership in the invisible college may be difficult to secure in the absence of a preexisting social tie. Column (4) provides evidence that the effects of physical and intellectual proximity are independent, because combining them in the same specification does not alter their magnitudes or statistical significance. Finally, column (5) demonstrates that these effects are robust to the inclusion of controls for coauthorship intensity and recency. Table VII provides additional evidence in favor of the spillover view by examining the relationship between the magnitude of the treatment effect and the accomplishments of the star. We rank superstars according to two metrics of achievement: cumulative citations and cumulative NIH funding, and we focus on superstars in the top quartile of either distributions (where these quartiles are calculated using the population of 10,349 superstars in a given year). Column (1) shows that collaborators of heavily cited superstars suffer more following the superstar’s death, whereas column (2) shows that this is not true for collaborators of especially wellfunded superstars. Column (3) puts the two effects in a single specification. Once again, it appears that it is the star’s citation impact that matters in shaping collaborators’ postextinction outcomes, rather than his/her control over a funding empire.17 We interpret these findings as buttressing our argument that it is the quality of ideas emanating from the stars, rather than simply the availability of the research funding they control, that goes missing 17. Table VII eliminates from the estimation sample the collaborators of eleven superstars who are NIH intramural scientists, and as such not eligible for extramural NIH funding.
−1,715,929 275,776 9,470
−0.026 (0.050) −1,716,213 275,776 9,470
−0.026 (0.039) −0.080† (0.048) −0.016 (0.051) −1,715,916 275,776 9,470
−0.070∗ (0.035)
−0.034 (0.036) −0.082† (0.047)
Notes. Estimates stem from conditional quasi–maximum likelihood Poisson specifications. Dependent variable is the total number of JIF-weighted articles authored by a collaborator of a superstar life scientist in the year of observation. Top quartiles of citations and career NIH funding are defined using the population of 10,009 superstar scientists with appointments compatible with extramural NIH funding. We exclude from the estimation sample the collaborators of eleven “intramural” NIH scientists who are not eligible to receive extramural funding. All models incorporate year effects and seventeen age category indicator variables (career age less than −3 is the omitted category), as well as seventeen (columns (1) and (2)) or 34 (column (3)) interaction terms between the age effects and the “top quartile” indicator variable. Robust (QML) standard errors in parentheses, clustered at the level of the superstar. † p < .10. ∗ p < −.05.
Log pseudo-likelihood No. of observations No. of collaborators
After death × star in top quartile of NIH $
After death × star in top quartile of cites
After death
Superstar status citations & NIH funding (3)
Superstar status NIH funding (2)
Superstar status citations (1)
TABLE VII IMPACT OF SUPERSTAR STATUS ON COLLABORATORS’ PUBLICATION RATES
578 QUARTERLY JOURNAL OF ECONOMICS
SUPERSTAR EXTINCTION
579
after their deaths. Furthermore, these results suggest that using the same empirical strategy, but applying it to a sample of “humdrum” coauthors who die, would not uncover effects similar in magnitude to those we observed in Table III. As such, they validate ex post our pragmatic focus on the effect of superstars. The overall collection of results presented above help build a circumstantial case in favor of interpreting the effects of superstar extinction as evidence of missing spillovers. However, they do not enable us to reject some potentially relevant versions of the gatekeeping story—such as influence over the editorial process in important journals—nor do they allow us to learn about the effect on noncollaborators. IV.E. Robustness and Sensitivity Checks The Online Appendix provides additional evidence probing the robustness of these results. In Table W7, we interact the treatment effect with three indicators of collaborator status, to ascertain whether some among them are insulated from the effects of superstar extinction. Figure W3 provides evidence that the effect of superstar extinction decreases monotonically with the age of the collaborator at the time of death, becoming insignificantly different from zero after 25 years of career age. Table W8 performs a number of sensitivity checks. We verify that the effect (1) is not driven by a few stars with a large number of coauthors; (2) is robust to the inclusion of indicator variables for the age of the star; (3) is not overly sensitive to our arbitrary cutoff for the superstars’ age at death; and (4) is not sensitive to the problem of leakage through the coauthorship network between treated and control collaborators. Finally, we perform a small simulation study to validate the quasi-experiment exploited in the paper. We generate placebo dates of death for the control collaborators, where those dates are drawn at random from the empirical distribution of death events across years for the 112 extinct superstars. We then replicate the specification in Table III, column (1a), but we limit the estimation sample to the set of 5,064 control collaborators. Reassuringly, the effect of superstar extinction in this manufactured data is a precisely estimated zero. V. CONCLUSIONS We examine the role of collaboration in spurring the creation of new scientific knowledge. Using the premature and unexpected
580
QUARTERLY JOURNAL OF ECONOMICS
deaths of eminent academic life scientists as a quasi-experiment, we find that their collaborators experience a sizable and permanent decline in quality-adjusted publication output following these events. Exploiting the rich heterogeneity in these collaborative relationships, we attempt to adjudicate between plausible mechanisms that could give rise to the extinction effect. Neither a mechanical story whereby ongoing collaborative teams struggle to replace the skills that have gone missing, nor a gatekeeping story where stars merely serve as conduits for tangible resources is sufficient to explain our results. Rather, these effects appear to be driven, at least in part, by the existence of knowledge spillovers across members of the research team. When a superstar dies, part of the scientific field to which he or she contributed dies along with him or her, perhaps because the fount of scientific knowledge from which coauthors can draw is greatly diminished. The permanence and magnitude of these effects also suggests that even collaborations that produce a small number of publications may have long-term repercussions for the pace of scientific advance. In the end, this paper raises as many questions as it answers. It would be interesting to know whether superstar extinction also impacts the productivity of noncoauthors proximate in intellectual space, and in which direction. The degree to which exposure to superstar talent benefits industrial firms is also potentially important and represents a fruitful area that we are pursuing in ongoing research. Future work could also usefully focus on identifying quasi-experiments in intellectual space. For instance, how do scientists adjust to sudden changes in scientific opportunities in their field? Finally, collaboration incentives and opportunities may be different when scientific progress relies more heavily on capital equipment; an examination of the generalizability of our findings to other fields therefore merits further attention. Our results shed light on an heretofore neglected causal process underlying the growth of scientific knowledge, but they should be interpreted with caution. Although we measure the impact of losing a star collaborator, a full accounting of knowledge spillovers would require information on the benefits that accrued to the field while the star was alive. We can think of no experiment, natural or otherwise, that would encapsulate this counterfactual. Moreover, the benefits of exposure to star talent constitute only part of a proper welfare calculation. Scientific coauthorships also entail costs. These costs could be borne by low-status collaborators
SUPERSTAR EXTINCTION
581
in the form of lower wages, or by the stars, who might divert some of their efforts toward mentorship activities. Though some of these costs might be offset by nonpecuniary benefits, we suspect that the spillovers documented here are not fully internalized by the scientific labor market. Finally, for every invisible college that contracts following superstar extinction, another might expand to slowly take its place. Viewed in this light, our work does little more than provide empirical support for Max Planck’s famous quip: “science advances one funeral at a time.” APPENDIX I: CRITERIA FOR DELINEATING THE SET OF 10,349 “SUPERSTARS” We present additional details regarding the criteria used to construct the sample of 10,349 superstars. Highly funded scientists. Our first data source is the Consolidated Grant/Applicant File (CGAF) from the U.S. National Institutes of Health (NIH). This data set records information about grants awarded to extramural researchers funded by the NIH since 1938. Using the CGAF and focusing only on direct costs associated with research grants, we compute individual cumulative totals for the decades 1977–1986, 1987–1996, and 1997–2006, deflating the earlier years by the Biomedical Research Producer Price Index.18 We also recompute these totals excluding large center grants that usually fund groups of investigators (M01 and P01 grants). Scientists whose totals lie in the top ventile (i.e., above the 95th percentile) of either distribution constitute our first group of superstars. In this group, the least well-funded investigator garnered $10.5 million in career NIH funding, and the most wellfunded $462.6 million.19 Highly cited scientists. Despite the preeminent role of the NIH in the funding of public biomedical research, the above indicator of “superstardom” biases the sample toward scientists conducting relatively expensive research. We complement this first group with a second composed of highly cited scientists identified by the 18. http://officeofbudget.od.nih.gov/UI/GDPFromGenBudget.htm. 19. We perform a similar exercise for scientists employed by the intramural campus of the NIH. These scientists are not eligible to receive extramural funds, but the NIH keeps records of the number of “internal projects” each intramural scientist leads. We include in the elite sample the top ventile of intramural scientists according to this metric.
582
QUARTERLY JOURNAL OF ECONOMICS
Institute for Scientific Information. A Highly Cited listing means that an individual was among the 250 most cited researchers for their published articles between 1981 and 1999, within a broad scientific field.20 Top patenters. We add to these groups academic life scientists who belong in the top percentile of the patent distribution among academics—those who were granted seventeen patents or more between 1976 and 2004. Members of the National Academy of Sciences. We add to these groups academic life scientists who were elected to the National Academy of Science between 1975 and 2007. MERIT Awardees of the NIH. Initiated in the mid-1980s, the MERIT Award program extends funding for up to five years (but typically three years) to a select number of NIH-funded investigators “who have demonstrated superior competence, outstanding productivity during their previous research endeavors and are leaders in their field with paradigm-shifting ideas.” The specific details governing selection vary across the component institutes of the NIH, but the essential feature of the program is that only researchers holding an R01 grant in its second or later cycle are eligible. Further, the application must be scored in the top percentile in a given funding cycle. Former and current Howard Hughes medical investigators. Every three years, the Howard Hughes Medical Institute selects a small cohort of mid-career biomedical scientists with the potential to revolutionize their respective subfields. Once selected, HHMIs continue to be based at their institutions, typically leading a research group of 10 to 25 students, postdoctoral associates, and technicians. Their appointment is reviewed every five years, based solely on their most important contributions during the cycle.21 Early career prize winners. We also included winners of the Pew, Searle, Beckman, Rita Allen, and Packard scholarships for the years 1981 through 2000. Every year, these charitable foundations provide seed funding to between twenty and forty young academic life scientists. These scholarships are the most prestigious 20. The relevant scientific fields in the life sciences are microbiology, biochemistry, psychiatry/psychology, neuroscience, molecular biology and genetics, immunology, pharmacology, and clinical medicine. 21. See Azoulay, Zivin, and Manso (2009) for more details and an evaluation of this program.
SUPERSTAR EXTINCTION
583
accolades that young researchers can receive in the first two years of their careers as independent investigators. APPENDIX II: CONSTRUCTION OF THE CONTROL GROUP We detail the procedure implemented to identify the control collaborators that help pin down the life-cycle and secular time effects in our DD specification. Because it did not prove possible to perfectly match treated and control collaborators on all covariates, the procedure is guided by the need to guard against two specific threats to identification. First, collaborators observed in periods before the death of their associated superstar are more likely to work with a younger superstar; thus, they are not ideal as a control if research trends of collaborators differ by the age of the superstar. Collaborators observed in periods after the death of their associated superstar are only appropriate controls if the death of their superstar only affected the level of their output; if the death also negatively affected the trend, fixed effects estimates will be biased toward zero. Second, fixed effects estimates might be misleading if collaborations with superstars are subject to idiosyncratic dynamic patterns. Happenstance might yield a sample of stars clustered in decaying scientific fields. More plausibly, collaborations might be subject to specific life-cycle patterns, with their productive potential first increasing over time, eventually peaking, and thereafter slowly declining. Relying solely on collaborators treated earlier or later as as an implicit control group entails that this dyadspecific, time-varying omitted variable will not be fully captured by collaborator age controls. To address these threats, the sample of control collaborators (to be recruited from the universe of collaborators for the 10,000 stars who do not die prematurely, regardless of cause) should be constructed such that the following four conditions are met: 1. treated collaborators exhibit no differential output trends relative to control collaborators up to the time of superstar death; 2. the distributions of career age at the time of death are similar for treated and controls; 3. the time paths of output for treated and controls are similar up to the time of death; 4. the dynamics of collaboration up to the time of death— number of coauthorships, time elapsed since first/last
584
QUARTERLY JOURNAL OF ECONOMICS
coauthorship, superstar’s scientific standing as proxied by his cumulative citation count—are similar for treated and controls. Coarsened exact matching. To meet these goals, we have implemented a “Coarsened Exact Matching” (CEM) procedure (Iacus, King, and Porro 2008) to identify a control for each treated collaborator. As opposed to methods that rely on the estimation of a propensity score, CEM is a nonparametric procedure. This seems appropriate given that we observe no covariates that predict the risk of being associated with a superstar scientist who dies in a particular year. The first step is to select a relatively small set of covariates on which the analyst wants to guarantee balance. In our example, this choice entails judgment but is strongly guided by the threats to identification mentioned above. The second step is to create a large number of strata to cover the entire support of the joint distribution of the covariates selected in the previous step. In a third step, each observation is allocated to a unique stratum, and for each observation in the treated group, a control observation is selected from the same stratum; if there are multiple choices possible, ties are broken randomly. The procedure is coarse because we do not attempt to precisely match on covariate values; rather, we coarsen the support of the joint distribution of the covariates into a finite number of strata, and we match a treated observation if and only if a control observation can be recruited from this strata. An important advantage of CEM is that the analyst can guarantee the degree of covariate balance ex ante, but this comes at a cost: the more fine-grained the partition of the support for the joint distribution (i.e., the larger the number of strata), the larger the number of unmatched treated observations. Implementation. We identify controls based on the following set of covariates (t denotes the year of death): collaborator’s degree year, number of coauthorships with the star at t, number of years elapsed since last coauthorship with the star at t, JIF-weighted publication flow in year t, cumulative stock of JIF-weighted publications up to year t − 1, and the star’s cumulative citation count at t. We then coarsen the joint distributions of these covariates by creating 51,200 strata. The distribution of degree years is coarsened using three year intervals; the distribution of coauthorship intensity is coarsened to map into our taxonomy of casual (one
SUPERSTAR EXTINCTION
585
and two coauthors), regular (between three and nine coauthors), and close collaborators (ten or more coauthors); the distribution of coauthorship recency is coarsened into quartiles (the first quartile corresponds to recent relationships, i.e., less than three years since the last coauthorship); the flow of publications in the year of death is coarsened into five strata (the three bottom quartiles; from the 75th to the 95th percentile, and above the 95th percentile); the stock of publications is coarsened into eleven strata (0 to 5th; 5th to 10th; 10th to 25th; 25th to 35th; 35th to 50th; 50th to 65th; 65th to 75th; 75th to 90th; 90th to 95th; 95th to 99th; and above the 99th percentile); and the distribution of citation count for the star is coarsened into quartiles. We implement the CEM procedure year by year, without replacement. Specifically, in year t, we 1. eliminate from the set of potential controls all superstars who die, all coauthors of superstars who die, and all control coauthors identified for years of death k, 1979 ≤ k < t; 2. create the strata (the values for the cutoff points will vary from year to year for some of the covariates mentioned above); 3. identify within strata a control for each treated unit and break ties at random; 4. repeat these steps for year t + 1. We match 5,064 of 5,267 treated collaborators (96.15%). In the sample of 5,064 treated + 5,064 controls = 10,128 collaborators, there is indeed no evidence of prexisting trends in output (Figure A.1); nor is there evidence of differential age effects in the years leading up to the death event (Figure A.2). As seen in Table II, treated and controls are very well-balanced on all covariates that pertain to the dynamics of the collaboration: number of coauthorships, time since last and first coauthored publication, and superstar’s citation count. The age distributions are very similar as well. Furthermore, the CEM procedure balances a number of covariates that were not used as inputs, such as normalized keyword overlap and R01 NIH grantee status. For some covariates, we can detect statistically significant mean differences, though they do not appear to be substantively meaningful (e.g., 7% of controls vs. 8.4% of treated collaborators were former trainees of their associated superstars). Sensitivity analyses. The analyst’s judgment matters for the outcome of the CEM procedure insofar as she must draw a list
586
QUARTERLY JOURNAL OF ECONOMICS
FIGURE A.1 Publication Trends for Treated and Control Collaborators
of “reasonable” covariates to match on, as well as decide on the degree of coarsening to impose. Therefore, it is reasonable to ask whether seemingly small changes in the details have consequences for how one should interpret our results. Nonparametric matching procedures such as CEM are prone to a version of the “curse of dimensionality” whereby the proportion of matched units decreases rapidly with the number of strata. For instance, requiring a match on an additional indicator variable (i.e., doubling the number of strata from around 50,000 to 100,000) results in a match rate of about 70%, which seems unacceptably low. Conversely, focusing solely on degree age and the flow and stock of the outcome variables would enable us to achieve pairwise balance (as opposed to global balance, which ignores the one-to-one nature of the matching procedure) on this narrower set of covariates, but at the cost of large differences in the features of collaborations (such as recency and intensity) between treated and controls. This would result in a control sample that could address the first threat to identification mentioned above, but not the second. However, we have verified that slight variations in the details of the implementation (e.g., varying slightly the number of cutoff points for the stock of publications; focusing on collaboration
587
SUPERSTAR EXTINCTION 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 5
10
15
20
25
30
35
40
45
Age
FIGURE A.2 Differential Age Trends for Treated vs. Control Collaborators Each dot corresponds to the coefficient estimate for the interaction between an age indicator variable and treatment status in a Poisson regression of weighted publications onto a full suite of year effects, a full suite of age effects, and age by treatment status interaction terms. The population includes control and treated collaborators, but the estimation sample is limied to the years before the death of the associated superstar. The vertical brackets denote the 95% confidence interval (corresponding to robust standard errors, clustered around collaborators) around these estimates.
age as opposed to collaboration recency; or matching on superstar funding as opposed to superstar citations) have little impact on the results presented in Table III. To conclude, we feel that CEM enables us to identify a population of control collaborators appropriate to guard against the specific threats to identification mentioned in Section II.C. MIT SLOAN SCHOOL OF MANAGEMENT AND NATIONAL BUREAU OF ECONOMIC RESEARCH UNIVERSITY OF CALIFORNIA, SAN DIEGO, AND NATIONAL BUREAU OF ECONOMIC RESEARCH MIT SLOAN SCHOOL OF MANAGEMENT
REFERENCES Agrawal, Ajay K., and Avi Goldfarb, “Restructuring Research: Communication Costs and the Democratization of University Innovation,” American Economic Review, 98 (2008), 1578–1590.
588
QUARTERLY JOURNAL OF ECONOMICS
Aizenman, Joshua, and Kenneth M. Kletzer, “The Life Cycle of Scholars and Papers in Economics—The ‘Citation Death Tax,’ ” NBER Working Paper No. 13891, 2008. Azoulay, Pierre, Christopher Liu, and Toby Stuart, “Social Influence Given (Partially) Deliberate Matching: Career Imprints in the Creation of Academic Entrepreneurs,” MIT Sloan School Working Paper, 2009. Azoulay, Pierre, Joshua Graff Zivin, and Gustavo Manso, “Incentives and Creativity: Evidence from the Academic Life Science,” NBER Working Paper No. 15466, 2009. Bandiera, Oriana, Iwan Barankay, and Imran Rasul, “Social Preferences and the Response to Incentives: Evidence from Personnel Data,” Quarterly Journal of Economics, 120 (2005), 917–962. ´ Bennedsen, Morten, Francisco P´erez-Gonzalez, and Daniel Wolfenzon, “Do CEOs Matter?” New York University Working Paper, 2008. Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan, “How Much Should We Trust Differences-in-Differences Estimates?” Quarterly Journal of Economics, 119 (2004), 249–275. Cech, Thomas R., “Fostering Innovation and Discovery in Biomedical Research,” Journal of the American Medical Association, 294 (2005), 1390–1393. Cockburn, Iain M., and Rebecca M. Henderson, “Absorptive Capacity, Coauthoring Behavior, and the Organization of Research in Drug Discovery,” Journal of Industrial Economics, 46 (1998), 157–182. Cole, Jonathan R., and Stephen Cole, “The Ortega Hypothesis,” Science, 178 (1972), 368–375. Costa, Dora L., and Matthew E. Kahn, “Cowards and Heroes: Group Loyalty in the American Civil War,” Quarterly Journal of Economics, 118 (2003), 519– 548. Crane, Diana, Invisible Colleges: Diffusion of Knowledge in Scientific Communities (Chicago: University of Chicago Press, 1972). de Solla Price, Derek J., Little Science, Big Science (New York: Columbia University Press, 1963). de Solla Price, Derek J., and Donald D. Beaver, “Collaboration in an Invisible College,” American Psychologist, 21 (1966), 1011–1018. Fafchamps, Marcel, Sanjeev Goyal, and Marco van de Leij, “Matching and Network Effects,” University of Oxford Working Paper, 2008. Gouri´eroux, Christian, Alain Montfort, and Alain Trognon, “Pseudo Maximum Likelihood Methods: Applications to Poisson Models,” Econometrica, 53 (1984), 701–720. Griffith, Rachel, Sokbae Lee, and John Van Reenen, “Is Distance Dying at Last? Falling Home Bias in Fixed Effects Models of Patent Citations,” NBER Working Paper No. 13338, 2007. Grogger, Jeffrey, “The Effect of Arrests on the Employment and Earnings of Young Men,” Quarterly Journal of Economics, 110 (1995), 51–71. Hall, Bronwyn H., Jacques Mairesse, and Laure Turner, “Identifying Age, Cohort and Period Effects in Scientific Research Productivity: Discussion and Illustration Using Simulated and Actual Data on French Physicists,” Economics of Innovation and New Technology, 16 (2007), 159–177. Ham, John C., and Bruce A. Weinberg, “Geography and Innovation: Evidence from Nobel Laureates,” Ohio State University Working Paper, 2008. Hausman, Jerry, Bronwyn H. Hall, and Zvi Griliches, “Econometric Models for Count Data with an Application to the Patents–R&D Relationship,” Econometrica, 52 (1984), 909–938. Henderson, Rebecca, Luigi Orsenigo, and Gary P. Pisano, “The Pharmaceutical Industry and the Revolution in Molecular Biology: Interactions Among Scientific, Institutional, and Organizational Change,” in Sources of Industrial Leadership, David C. Mowery, and Richard R. Nelson, eds. (New York: Cambridge University Press, 1999). Iacus, Stefano M., Gary King, and Giuseppe Porro, “Matching for Causal Inference without Balance Checking,” Harvard University Working Paper, 2008. Jones, Benjamin F., “The Burden of Knowledge and the ‘Death of the Renaissance Man’: Is Innovation Getting Harder?” Review of Economic Studies, 76 (2009), 283–317.
SUPERSTAR EXTINCTION
589
Jones, Benjamin F., and Benjamin A. Olken, “Do Leaders Matter? National Leadership and Growth since World War II,” Quarterly Journal of Economics, 120 (2005), 835–864. Kim, E. Han, Adair Morse, and Luigi Zingales, “Are Elite Universities Losing Their Competitive Edge?” Journal of Financial Economics, 93 (2009), 353–381. Levin, Sharon G., and Paula E. Stephan, “Research Productivity over the Life Cycle: Evidence for Academic Scientists,” American Economic Review, 81 (1991), 114–132. Lotka, Alfred J., “The Frequency Distribution of Scientific Productivity,” Journal of the Washington Academy of Sciences, 16 (1926), 317–323. Lucas, Robert E., “On the Mechanics of Economic Development,” Journal of Monetary Economics, 22 (1988), 3–42. Mairesse, Jacques, and Laure Turner, “Measurement and Explanation of the Intensity of Co-publication in Scientific Research: An Analysis at the Laboratory Level,” NBER Working Paper No. 11172, 2005. Marshall, Alfred, Principles of Economics (New York: MacMillan, 1890). Mas, Alexandre, and Enrico Moretti, “Peers at Work,” American Economic Review, 99 (2009), 112–145. Oettl, Alexander, “Productivity, Helpfulness and the Performance of Peers: Exploring the Implications of a New Taxonomy for Star Scientists,” University of Toronto Working Paper, 2008. Reber, Sarah J., “Court-Ordered Desegregation,” Journal of Human Resources, 40 (2005), 559–590. Reese, Thomas S., “My Collaboration with John Heuser,” European Journal of Cell Biology, 83 (2004), 243–244. Romer, Paul M., “Endogenous Technological Change,” Journal of Political Economy, 98 (1990), S71–S102. Rosenblat, Tanya S., and Markus M. M¨obius, “Getting Closer or Drifting Apart?” Quarterly Journal of Economics, 119 (2004), 971–1009. Santos Silva, J.M.C., and Silvanna Tenreyro, “The Log of Gravity,” Review of Economics and Statistics, 88 (2006), 641–658. Waldinger, Fabian, “Peer Effects in Science: Evidence from the Dismissal of Scientists in Nazi Germany,” London School of Economics Working Paper, 2008. Weitzman, Martin L., “Recombinant Growth,” Quarterly Journal of Economics, 113 (1998), 331–360. Wooldridge, Jeffrey M., “Quasi-likelihood Methods for Count Data,” in Handbook of Applied Econometrics, M. Hashem Pesaran and Peter Schmidt, eds. (Oxford, UK: Blackwell, 1997). Wuchty, Stefan, Benjamin F. Jones, and Brian Uzzi, “The Increasing Dominance of Teams in Production of Knowledge,” Science, 316 (2007), 1036–1039. Zucker, Lynne G., and Michael R. Darby, “Defacto and Deeded Intellectual Property Rights,” NBER Working Paper No. 14544, 2008. Zucker, Lynne G., Michael R. Darby, and Marilynn B. Brewer, “Intellectual Human Capital and the Birth of U.S. Biotechnology Enterprises,” American Economic Review, 88 (1998), 290–306.
ESTIMATING MARGINAL RETURNS TO MEDICAL CARE: EVIDENCE FROM AT-RISK NEWBORNS∗ DOUGLAS ALMOND JOSEPH J. DOYLE, JR. AMANDA E. KOWALSKI HEIDI WILLIAMS A key policy question is whether the benefits of additional medical expenditures exceed their costs. We propose a new approach for estimating marginal returns to medical spending based on variation in medical inputs generated by diagnostic thresholds. Specifically, we combine regression discontinuity estimates that compare health outcomes and medical treatment provision for newborns on either side of the very low birth weight threshold at 1,500 grams. First, using data on the census of U.S. births in available years from 1983 to 2002, we find that newborns with birth weights just below 1,500 grams have lower one-year mortality rates than do newborns with birth weights just above this cutoff, even though mortality risk tends to decrease with birth weight. One-year mortality falls by approximately one percentage point as birth weight crosses 1,500 grams from above, which is large relative to mean infant mortality of 5.5% just above 1,500 grams. Second, using hospital discharge records for births in five states in available years from 1991 to 2006, we find that newborns with birth weights just below 1,500 grams have discontinuously higher charges and frequencies of specific medical inputs. Hospital costs increase by approximately $4,000 as birth weight crosses 1,500 grams from above, relative to mean hospital costs of $40,000 just above 1,500 grams. Under an assumption that observed medical spending fully captures the impact of the “very low birth weight” designation on mortality, our estimates suggest that the cost of saving a statistical life of a newborn with birth weight near 1,500 grams is on the order of $550,000 in 2006 dollars.
∗ We thank Christine Pal and Jean Roth for assistance with the data, Christopher Afendulis and Ciaran Phibbs for data on California neonatal intensive care units, and doctors Christopher Almond, Burak Alsan, Munish Gupta, Chafen Hart, and Katherine Metcalf for helpful discussions regarding neonatology. David Autor, Amitabh Chandra, Janet Currie, David Cutler, Dan Fetter, Amy Finkelstein, Edward Glaeser, Michael Greenstone, Jonathan Gruber, Jerry Hausman, Guido Imbens, Lawrence Katz, Michael Kremer, David Lee, Ellen Meara, Derek Neal, Joseph Newhouse, James Poterba, Jesse Rothstein, Gary Solon, Tavneet Suri, the editor and four referees, and participants in seminars at Harvard, the Harvard School of Public Health, MIT, Princeton, and the Fall 2008 NBER Labor Studies meeting provided helpful comments and feedback. We use discharge data from the Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality, provided by the Arizona Department of Health Services, the Maryland Health Services Cost Review Commission, the New Jersey Department of Health and Senior Services, and the New York State Department of Health. Funding from the National Institute on Aging, Grant T32AG000186 to the National Bureau of Economic Research, is gratefully acknowledged (Doyle, Kowalski, Williams).
C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
591
592
QUARTERLY JOURNAL OF ECONOMICS
I. INTRODUCTION Medical expenditures in the United States are high and increasing. Do the benefits of additional medical expenditures exceed their costs? The tendency for patients in worse health to receive more medical inputs complicates empirical estimation of the returns to medical expenditures. Observational studies have used cross-sectional, time-series, and panel data techniques to attempt to identify patients who are similar in terms of underlying health status but who for some reason receive different levels of medical spending. The results of such studies are mixed. On one hand, time-series and panel data studies that compare increases in spending and improvements in health outcomes over time have argued that increases in costs have been less than the value of the associated benefits, at least for some technologies.1 On the other hand, cross-sectional studies that compare “high-spending” and “low-spending” geographic areas tend to find large differences in spending yet remarkably similar health outcomes.2 The lack of consensus may not be surprising, as these studies have measured returns on many different margins of care. The return to a dollar of medical spending likely differs across medical technologies and across patient populations, and in any given context the return to the first dollar of medical spending likely differs from the return to the last dollar of spending. The time-series studies often estimate returns to large changes in treatments that occur over long periods of time. The cross-sectional studies, on the other hand, estimate returns to additional, incremental spending that occurs in some areas but not others. Although estimates of returns to large changes in medical spending are useful summaries of changes over time, estimates of marginal returns are needed to inform policy decisions over whether to increase or decrease the level of care in a given context. The main innovation of this paper is a novel research design that, under explicit assumptions, permits direct estimation of the marginal returns to medical care. Implementation of our research design requires a setting with an observable, continuous 1. See, for example, McClellan (1997); Cutler et al. (1998); Cutler and McClellan (2001); Nordhaus (2002); Murphy and Topel (2003); Cutler, Rosen, and Vijan (2006); and Luce et al. (2006). 2. See, for example, Fisher et al. (1994); Pilote et al. (1995); Kessler and McClellan (1996); Tu et al. (1997); O’Connor et al. (1999); Baicker and Chandra (2004); Fuchs (2004); and Stukel, Lucas, and Wennberg (2005).
MARGINAL RETURNS TO MEDICAL CARE
593
measure of health risk and a diagnostic threshold (based on this risk variable) that generates a discontinuous probability of receiving additional treatment.3 In such settings, we can use a regression discontinuity framework: as long as other factors are smooth across the threshold (an assumption we investigate in several empirical tests), individuals within a small bandwidth on either side of the threshold should differ only in their probability of receiving additional health-related inputs and not in their underlying health. This research design allows us to estimate marginal returns to medical care for patients near such thresholds in the following sense: conditional on estimating that, on average, patients on one side of the threshold incur additional medical costs, we can estimate the associated benefits by examining average differences in health outcomes across the threshold. Under the assumption that observed medical spending fully captures the impact of a “higher risk” designation on mortality, combining these cost and benefit estimates allows us to calculate the return to this increment of additional spending, or “average marginal returns.” We apply our research design to study “at-risk” newborns, a population that is of interest for several reasons. First, the welfare implications of small reductions in mortality for newborns can be magnified in terms of the total number of years of life saved. Second, technologies for treating at-risk newborns have expanded tremendously in recent years, at very high cost. Third, although existing estimates suggest that the benefits associated with large changes in spending on at-risk newborns over time have been greater than their costs (Cutler and Meara 2000), there is a dearth of evidence on the returns to incremental spending in this context. Fourth, studying newborns allows us to focus on a large portion of the health care system, as childbirth is one of the most common reasons for hospital admission in the United States. This patient population also provides samples large enough to detect effects of additional treatment on infant mortality. 3. Such criteria are common in clinical medicine. For example, diabetes diagnoses are frequently made based on a threshold fasting glucose level, hypertension diagnoses based on a threshold systolic blood pressure level, hypercholesterolemia diagnoses based on a threshold cholesterol level, and overweight diagnoses based on a threshold body mass index. Nevertheless, there is “little evidence” that the regression discontinuity framework has been used to evaluate triage criteria in clinical medicine (Linden, Adams, and Roberts 2006). Similarly, Zuckerman et al. (2005, p. 561) note that “program evaluation in health services research has lacked a formal application” of the regression discontinuity approach.
594
QUARTERLY JOURNAL OF ECONOMICS
We focus on the “very low birth weight” (VLBW) classification at 1,500 g (just under 3 pounds, 5 ounces)—a designation frequently referenced in the medical literature. We also consider other classifications based on birth weight and alternative measures of newborn health. From an empirical perspective, birth weight–based thresholds provide an attractive basis for a regression discontinuity design for several reasons. First, they are unlikely to represent breaks in underlying health risk. A 1985 Institute of Medicine report (p. 23), for example, notes that “designation of very low birth weight infants as those weighing 1,500 grams or less reflected convention rather than biologic criteria.” Second, it is generally agreed that birth weight cannot be predicted in advance of delivery with the accuracy needed to change (via birth timing) the classification of a newborn from being just above 1,500 g to being just below 1,500 g. Thus, although we empirically investigate our assumption that the position of a newborn just above 1,500 g relative to just below 1,500 g is “as good as random,” the medical literature also suggests that this assumption is reasonable. To preview our main results, using data on the census of U.S. births in available years from 1983 to 2002, we find that oneyear mortality decreases by approximately one percentage point as birth weight crosses the VLBW threshold from above, which is large relative to mean one-year mortality of 5.5% just above 1,500 g. This sharply contrasts with the overall increase in mortality as birth weight falls, and to the extent that lighter newborns are less healthy in unobservable ways, the mortality change we observe is all the more striking. Second, using hospital discharge records for births in five states in available years from 1991 to 2006, we estimate a $4,000 increase in hospital costs for infants just below the 1,500-g threshold, relative to mean hospital costs of $40,000 just above 1,500 g. As we discuss in Section VIII, this estimated cost difference may not capture all of the relevant mortality-reducing inputs, but it is our best available summary measure of health inputs. Under the assumption that hospital costs fully capture the impact of the VLBW designation on mortality, our estimates suggest that the cost of saving a statistical life for newborns near 1,500 g is on the order of $550,000—well below most value-of-life estimates for this group of newborns. The remainder of the paper is organized as follows. Section II discusses the available evidence on the costs and benefits of medical care for at-risk newborns and gives a brief
MARGINAL RETURNS TO MEDICAL CARE
595
background on the at-risk newborn classifications we study. Section III describes our data and analysis sample, and Section IV outlines our empirical framework and bandwidth selection. Section V presents our main results, and Section VI discusses several robustness and specification checks. Section VII examines variation in our estimated treatment effects across hospital types. In Section VIII we combine our main estimates to calculate twosample estimates of marginal returns, and Section IX concludes. II. BACKGROUND II.A. Costs and Benefits of Medical Care for At-Risk Newborns Medical treatments for at-risk newborns have been expanding tremendously in recent years, at high cost. For example, in 2005 the U.S. Agency for Healthcare Research and Quality estimated that the two most expensive hospital diagnoses (regardless of age) were “infant respiratory distress syndrome” and “premature birth and low birth weight.”4 Russell et al. (2007) estimated that in the United States in 2001, preterm and low–birth weight diagnoses accounted for 8% of newborn admissions, but 47% of costs for all infant hospitalizations (at $15,100 on the average). Despite their high and highly concentrated costs, use of new neonatal technologies has continued to expand.5 These high costs motivate the question of what these medical advances have been “worth” in terms of improved health outcomes. Anspach (1993) and others discuss the paucity of randomized controlled trials that measure the effectiveness of neonatal intensive care. In the absence of such evidence, some have questioned the effectiveness of these increasingly intensive treatment patterns (Enthoven 1980; Goodman et al. 2002; Grumbach 2002). On the other hand, Cutler and Meara (2000) examine time-series variation in birth weight–specific treatment costs and mortality outcomes and argue that medical advances for newborns have had large returns.6 4. See http://www.ahrq.gov/data/hcup/factbk6/factbk6.pdf (accessed 29 October 2008). 5. An example related to our threshold of interest is provided by the Oxford Health Network’s 362 hospitals, where the use of high-frequency ventilation among VLBW infants tripled between 1991 and 1999 (Horbar et al. 2002). 6. Cutler and Meara’s empirical approach assumes that all within–birth weight changes in survival have been due to improvements in medical technologies. This approach is motivated by the argument that conditional on birth weight, the overwhelming factor influencing survival for low–birth weight newborns is
596
QUARTERLY JOURNAL OF ECONOMICS
II.B. “At-Risk” Newborn Classifications Birth weight and gestational age are the two most common metrics of newborn health, and continuous measures of these variables are routinely collapsed into binary classifications. We focus on the VLBW classification at 1,500 g (just under 3 pounds, 5 ounces). We also examine other birth weight classifications— including the “extremely low birth weight” (ELBW) classification at 1,000 g (just over 2 pounds, 3 ounces) and the “low birth weight” (LBW) classification at 2,500 g (just over 5 pounds, 8 ounces)—as well as gestational age–based measures such as the “prematurity” classification at 37 weeks, where gestation length is usually based on the number of weeks since the mother’s last menstrual period. Below, we briefly describe the evolution of these classifications.7 Physicians had begun to recognize and assess the relationships among inadequate growth (LBW), shortened gestation (prematurity), and mortality by the early 1900s. The 2,500-g LBW classification, for example, has existed since at least 1930, when a Finnish pediatrician advocated 2,500 g as the birth weight below which infants were at high risk of adverse neonatal outcomes. Over time, interest increased in the fate of the smallest infants, and “very low birth weight” infants were conventionally defined as those born weighing less than 1,500 g (United States Institute of Medicine 1985).8 Key to our empirical strategy is that these cutoffs appear to truly reflect convention rather than strict biologic criteria. For example, the 1985 IOM report notes (p. 22): Birth weight is a continuous variable and the limit at 2,500 grams does not represent a biologic category, but simply a point on a continuous curve. The infant born at 2,499 grams does not differ significantly from one born at medical care in the immediate postnatal period (Williams and Chen 1982; Paneth 1995). However, others have noted that it is possible that underlying changes in the health status of infants within each weight group (due to, for example, improved maternal nutrition) are responsible for neonatal mortality independent of newborn medical care (United States Congress, Office of Technology Assessment 1981). For comparison to the results obtained with our methodology, we present results based on the Cutler and Meara methodology in our data in Section VIII.A. 7. The discussion in this section draws heavily from United States Institute of Medicine (1985). 8. In our empirical work, to define treatment of observations occurring exactly at the relevant cutoffs, we rely on definitions listed in the International Statistical Classification of Diseases and Related Health Problems (ICD-9) codes. According to the ICD-9 codes, VLBW is defined as having birth weight strictly less than 1,500 g, and analogously (with a strict inequality) for the other thresholds we examine.
MARGINAL RETURNS TO MEDICAL CARE
597
2,501 grams on the basis of birth weight alone . . . As with the 2,500 grams limit, designation of very low birth weight infants as those weighing 1,500 grams or less reflected convention rather than biologic criteria.
Gestational age classifications, such as the “prematurity” classification at 37 gestational weeks, have also been emphasized. Although gestational age is a natural consideration when determining treatment for newborns with low birth weights, applying our research design to gestational age introduces some additional complications. Gestational age is known to women in advance of giving birth, and women can choose to time their births (for example, through an induced vaginal birth or through a C-section) based on gestational age. Thus, we would expect that mothers who give birth prior to 37 gestational weeks may be different from mothers who give birth after 37 gestational weeks on the basis of factors other than gestational age. It is thought that birth weight, on the other hand, cannot be predicted in advance of birth with the accuracy needed to change (via birth timing) the classification of a newborn from being just above 1,500 g to being just below 1,500 g; this assertion has been confirmed from conversations with physicians,9 as well from studies such as Pressman et al. (2000). Of course, birth weight and gestational age are not the only factors used to assess newborn health.10 This implies that we should expect our cutoffs of interest to be “fuzzy” rather than “sharp” discontinuities (Trochim 1984): that is, we do not expect the probability of a given treatment to fall from 1 to 0 as one moves from 1,499 to 1,501 g, but rather expect a change in the likelihood of treatment for newborns classified into a given risk category. 9. We use the phrase “conversations with physicians” somewhat loosely throughout the text of the paper to reference discussions with several physicians as well as readings of the relevant medical literature and references such as the Manual of Neonatal Care for the Joint Program in Neonatology (Harvard Medical School, Beth Israel Deaconess Medical Center, Brigham and Women’s Hospital, Children’s Hospital Boston) (Cloherty and Stark 1998). The medical doctors we spoke with include Dr. Christopher Almond from Children’s Hospital Boston (Boston, MA); Dr. Burak Alsan from Harvard Brigham and Women’s/Children’s Hospital Boston (Boston, MA); Dr. Munish Gupta from Beth Israel Deaconess Medical Center (Boston, MA); Dr. Chafen Hart from the Tufts Medical Center (Boston, MA); and Dr. Katherine Metcalf from Saint Vincent Hospital (Worchester, MA). We are very grateful for their time and feedback, but they are of course not responsible for any errors in our work. 10. For example, respiratory rate, color, APGAR score (an index of newborn health), head circumference, and presence of congenital anomalies could also affect physicians’ initial health assessments of infants (Cloherty and Stark 1998).
598
QUARTERLY JOURNAL OF ECONOMICS
Discussions with physicians suggest that these potential discontinuities are well-known, salient cutoffs below which newborns may be at increased consideration for receiving additional treatments. From an empirical perspective, the fact that we will observe a discontinuity in treatment provision around 1,500 g suggests that hospitals or physicians do use these cutoffs to determine treatment either through hospital protocols or as rules of thumb. As an example of a relevant hospital protocol, the 1,500 g threshold is commonly cited as a point below which diagnostic ultrasounds should be used.11 Such classifications could also affect treatment provision through use as more informal “rules of thumb” by physicians.12 As we discuss below, it is likely that VLBW infants receive a bundle of mortality-reducing health inputs, not all of which we can observe.13 Moreover, because several procedures are given simultaneously, our research design does not allow us to measure marginal returns to specific procedures. This motivates our focus on summary measures—such as charges and length of stay—that are our best available measures of differences in health inputs. Differential reimbursement by birth weight is another potential source of observed discontinuities in summary spending measures. For example, some Current Procedural Terminology (CPT) billing codes and ICD-9 diagnosis codes are categorized by birth weight (ICD-9 codes V21.30–V21.35 denote birth weights of 0–500, 500–999, 1,000–1,499, 1,500–1,999, 2,000–2,500, etc.). If prices differ across our threshold of interest, then any discontinuous jump in charges could in part be due to mechanical changes in the “prices” of services rather than to changes in the “quantities” of the services performed. In practice, we argue that a 11. Diagnostic ultrasounds (also known as cranial ultrasounds) are used to check for bleeding or swelling of the brain as signs of intraventricular hemorrhages (IVH)—a major concern for at-risk newborns. The neonatal care manual used by medical staff at the Longwood Medical Area (Boston, MA) notes: “We perform routine ultrasound screens in infants with birth weight <1500gm” (Cloherty and Stark 1998, p. 508). We investigate differences in the use of diagnostic ultrasounds and other procedures below. 12. For a recent contribution on this point in the economics literature, see Frank and Zeckhauser (2007). In the medical literature, see McDonald (1996) and Andre et al. (2002). Medical malpractice environments could also be one force affecting adherence to either formal rules or informal rules of thumb. 13. A recent review article (Angert and Adam 2009) on care for VLBW infants offered several examples of health inputs that we would likely not be able to detect in our hospital claims data. For example, the authors note (p. 32): “To decrease the risk for intraventricular hemorrhage and brain injury during resuscitation, the baby should be handled gently and not placed in a head down or Trendelenburg position.”
MARGINAL RETURNS TO MEDICAL CARE
599
substantial portion of our observed jump in charges is a quantity effect rather than a price effect, for three reasons. First, the limited qualitative evidence available to us suggests that prices do not vary discontinuously across the VLBW threshold for many of the births in our data.14 Second, we empirically observe a discontinuity in charges within California, a state where the Medicaid reimbursement scheme does not explicitly utilize birth weight during the time period of our study. Third, we find evidence of discontinuities in a summary quantity measure—length of stay— as well as quantities of specific procedures. These three reasons suggest that a substantial portion of our observed jump in charges is a “quantity” effect rather than a “price” effect. Furthermore, if the pricing effect were purely mechanical, we should not observe the empirical discontinuity in mortality.
III. DATA III.A. Data Description Our empirical analysis requires data on birth weight and some welfare-relevant outcome, such as medical care expenditures or health outcomes. Our primary analysis uses three data sets: first, the National Center for Health Statistics (NCHS) birth cohort–linked birth/infant death files; second, a longitudinal research database of linked birth record–death certificate–hospital discharge data from California; and third, hospital discharge data from several states in the Healthcare Cost and Utilization Project (HCUP) state inpatient databases. The NCHS birth cohort–linked birth/infant death files, hereafter the “nationwide data,” include data for a complete census of births occurring each year in the United States for the years 1983–1991 and 1995–2002—approximately 66 million births.15 The data include information reported on birth certificates linked to information reported on death certificates for infants who die 14. We unfortunately do not observe prices directly in any of our data sets. A recent study of Medicaid payment systems (Quinn 2008) found that although some states rely on payment systems that explicitly incorporate birth weight categories into the reimbursement schedules, most states—including California—rely on either a per diem system or the CMS-DRG system, neither of which explicitly utilizes birth weight. More precisely, because birth weight is thought to be the best predictor of neonatal resource use (Lichtig et al. 1989), some newer DRG-based (that is, Diagnosis Related Group) systems explicitly incorporate birth weight categories into the reimbursement schedules. 15. NCHS did not produce linked birth/infant death files from 1992 to 1994.
600
QUARTERLY JOURNAL OF ECONOMICS
within one year of birth. The birth certificate data offer a rich set of covariates (for example, mother’s age and education), and the death certificate data include a cause-of-death code. Beginning in 1989, these data include some treatment variables—namely, indicators for use of a ventilator for less than or (separately) more than thirty minutes after birth. Our other two data sources offer treatment variables beyond ventilator use. The California research database is the same data set used in Almond and Doyle (2008). These data were collected by the California Office of Statewide Health Planning and Development and include all live births in California from 1991 to 2002—approximately 6 million births. The data include hospital discharge records linked to birth and death certificates for infants who die within one year of birth. The hospital discharge data include diagnosis, course of treatment, length of hospital stay, and charges incurred during the hospitalization. The data are longitudinal in nature and track hospital readmissions for up to one year from birth as long as the infants are admitted to a California hospital. This longitudinal aspect of the data allows us to examine charges and length of stay even if the newborn is transferred to another hospital.16 The HCUP state inpatient databases allow us to analyze the universe of hospital discharge abstracts in four other states that include the birth weight variable necessary for our analysis.17 Specifically, we use HCUP data from Arizona for 2001–2006, New Jersey for 1995–2006, Maryland for 1995–2006, and New York for 1995–2000—approximately 10.5 million births (see Table A1 in the Online Appendix for the number of births by state and year within our pilot bandwidth of three ounces of the VLBW cutoff). The HCUP data include variables similar to those available in the California discharge data but, unlike the California data, are not linked to mortality records nor to hospital records for readmissions or transfers. Although we cannot link the HCUP data with mortality records directly, we can examine mortality outcomes 16. The treatment measures that include transfers described below include treatment at the birth hospital and the hospital where the newborn was initially transferred. 17. The State Inpatient Data (SID) we analyze contain the universe of inpatient discharge records from participating states. (Other HCUP databases, such as the National Inpatient Sample, are a subsample of the SID data.) At present, 39 states participate in the SID. Of these 39 states, 10 report the birth weights of newborns. We have obtained HCUP data for 4 of the 10 states with birth weight. With the exception of North Carolina, we have discharge data for the top four states by number of births: New York, New Jersey, Maryland, and Arizona.
MARGINAL RETURNS TO MEDICAL CARE
601
for these newborns using a subsample of our nationwide data, as our nationwide data and the HCUP discharge data relate to the same births.18 In much of our analysis, we pool the California and HCUP data to create a “five-state sample.” Both the California and the HCUP data report hospital charges. These charges are used in negotiations for reimbursement and are typically inflated well over costs. We consider these charges our best available summary of the difference in treatment that the VLBW classification affords. When calculating the returns to medical spending, we adjust hospital charges by a costto-charge ratio.19 The main text focuses on charges rather than costs because charges are available for all years of data, whereas cost-to-charge ratios are available for only a subset of years and are known to introduce noise into the results. III.B. Analysis Sample Sample selection issues are minimal. In our main specifications, we pool data from all available years, although in the Online Appendix we separately examine results across time periods. For the main results, we limit the sample to those observations with nonmissing, nonimputed birth weight information.20 Fortunately, given the demands of our empirical approach, these data provide relatively large samples: over 200,000 newborns fall within our pilot bandwidth of three ounces around the 1,500-g threshold in the nationwide data, and we have approximately 30,000 births in the same interval when we consider the five-state sample. We discuss bandwidth selection below. 18. Note that our nationwide data include births that took place outside of hospitals, whereas our California and HCUP discharge data by construction only capture deliveries taking place in hospitals. In practice, 99.2% of deliveries in our national sample occurred in hospitals. In some robustness checks we limit our nationwide data to the sample of hospitalized births, for greater comparability. 19. The Centers for Medicare and Medicaid Services (CMS) report cost-tocharge ratios for each hospital in each year beginning in 1996 and continuing through 2005. When we use the cost-to-charge ratios, so that we can include information from all years, we use the 2000 cost-to-charge ratios in all states but New York—where the first year of data is 2001 and the 2001 cost-to-charge ratio is used. Further, we follow a CMS suggestion to replace the hospital’s cost-to-charge ratio with the state median if the cost-to-charge ratio is beyond the 5th or 95th percentile of the state’s distribution. Results were similar, though noisier, when the sample was restricted to 1996–2005 and each hospital-year cost-to-charge ratio was employed. 20. This sample selection criteria excludes a very small number of our observations. For the full NCHS data, for example, dropping observations with missing or imputed birth weights drops only 0.12% of the sample. We also exclude a very small number of observations in early years of our data that lack information on the time of death.
602
QUARTERLY JOURNAL OF ECONOMICS
IV. EMPIRICAL FRAMEWORK AND ESTIMATION IV.A. Empirical Framework To estimate the size of the discontinuity in outcomes and treatment, we follow standard methods for regression discontinuity analysis (as in, for example, Imbens and Lemieux [2008] and Lee and Lemieux [forthcoming]). First, we restrict the data to a small window around our threshold (85 g) and estimate a local-linear regression. We describe the selection of this bandwidth in the next section. We use a triangle kernel so that the weight on each observation decays with the distance from the threshold, and we report asymptotic standard errors (Cheng, Fan, and Marron 1997; Porter 2003).21 Second, within our bandwidth, we estimate the following model for infant i weighing g grams in year t: (1) Yi = α0 + α1 VLBWi + α2 VLBWi × (gi − 1500) + α3 (1 − VLBWi ) × (gi − 1500) + αt + αs + δ Xi + i , where Y is an outcome or treatment measure such as one-year mortality or costs, and VLBW is an indicator that the newborn was classified as VLBW (that is, strictly less than 1,500 g). We include separate gram trend terms above and below the cutoff, parameterized so that α2 = α3 if the trend is the same above and below the cutoff. In some specifications, we include indicators for each year of birth t, indicators for each state of birth s, and newborn characteristics, Xi . The newborn characteristics that are available for all of the years in the nationwide data include an indicator that the mother was born outside the state where the infant was born, as well as indicators for mother’s age, education, father’s age, the newborn’s sex, gestational age, race, and plurality. We estimate this model by OLS, and we report two sets of standard errors.22 First, we report heteroscedastic-robust standard errors. Second, to address potential concerns about discreteness in birth weight, we perform the standard error correction suggested by Card and Lee (2008). In our application, this 21. We are grateful to Doug Miller for providing code from Ludwig and Miller (2007). 22. Probit results for our binary dependent variables give very similar results, as described below.
MARGINAL RETURNS TO MEDICAL CARE
603
correction amounts to clustering at the gram level. Estimation of our outcome and treatment results with quadratic (or higherorder) rather than linear trends in birth weight gives similar results (see Online Appendix Table A4). In Section V, we report outcome and treatment estimates separately. Our reduced-form estimate of the direct impact of our VLBW indicator on mortality is itself interesting and policy relevant, as this estimate includes the effects of all relevant inputs. Under an additional assumption, we can combine our outcome and treatment estimates into two-sample estimates of the return to an increment of additional spending in terms of health benefits. In the language of instrumental variables, the discontinuity in mortality is the reduced-form estimate and the discontinuity in health inputs is the first-stage estimate.23 In this framework, the instrument is the VLBW indicator. For our VLBW indicator to be a valid instrument, the two usual instrumental variables conditions must hold. First, there must exist a first-stage relationship between our VLBW indicator and our measure of health inputs; note that this relationship will be conditional on our running variable (birth weight). Second, the exclusion restriction requires that the only mechanism through which the instrument VLBW affects the mortality outcome, conditional on birth weight falling within the bandwidth, is through its effect on our measure of health inputs. If our summary measures allow us to observe and capture all relevant health inputs, then we can argue for the validity of this exclusion restriction. However, for any given measure of health inputs that we observe, it is likely that there exists some additional health-related input that we do not observe (see Section II.B). It is unclear how important such unobserved inputs are in practice, but to the extent that they are important, a violation of the exclusion restriction would occur. We present two-sample estimates in Section VIII, using our most policy-relevant available summary measure of treatment (hospital costs) as our first-stage variable, but we stress that the interpretation of these estimates relies on an assumption that 23. Without covariates, the two-sample estimate is equivalent to the Wald and two-stage least-squares estimates, given our binary instrumental variable. Even though the first stage and reduced-form estimates come from different data sources, we can standardize the samples and covariates to produce the same estimates that we would attain from a single data source.
604
QUARTERLY JOURNAL OF ECONOMICS
hospital charges capture all relevant medical inputs. We also attempt to gauge the magnitude of unobserved inputs by testing for effects on short-run mortality. To the extent that medical inputs are much more important than parental or other unobserved inputs in the very short run after birth (say, within 24 hours of birth), we can test for impacts on short-run mortality measures and be somewhat assured that unobserved parental or other inputs are not likely to affect these estimates. As we will discuss in Section V, we do indeed find effects on short-run mortality measures.
IV.B. Bandwidth Selection Our pilot bandwidth includes newborns with birth weights within 3 ounces (85 g) of 1,500 g, or from 1,415 to 1,585 g. We chose this bandwidth by a cross-validation procedure where the relationships between the main outcomes of interest and birth weight were estimated with local linear regressions and compared to a fourth-order polynomial model. These models were estimated separately above and below the 1,500-g threshold. The bandwidth that minimized the sum of squared errors between these two estimates between 1,200 and 1,800 g tended to be between 60 and 70 g for the mortality outcomes. For the treatment measures, the bandwidth tended to be closer to 40 g. Given that we are estimating the relationship at a boundary, a larger bandwidth is generally warranted. We chose to use a pilot bandwidth of 85 g—three ounces24 —for the main results. This larger bandwidth incorporates more information, which can improve precision, but of course, including births further from the threshold departs from the assumption that newborns are nearly identical on either side of the cutoff. That said, our local linear estimates allow the weight on observations to decay with the distance from the threshold. In addition, the results are qualitatively similar across a wide range of bandwidths (see Online Appendix Table A3). To give a clearer sense of how our data look graphically, our figures report means for a slightly wider bandwidth—namely, the five ounces above and below the threshold. 24. As discussed in the next section, our birth weight variable has pronounced reporting heaps at gram equivalents of ounce intervals. We specify the bandwidth in ounces to ensure that the sample sizes are comparable above and below the discontinuity, given these trends in reporting.
MARGINAL RETURNS TO MEDICAL CARE
605
V. RESULTS V.A. Frequency of Births by Birth Weight Figure I is a histogram of births between 1,350 and 1,650 g in the nationwide sample, which has several notable characteristics.25 First, there are pronounced reporting heaps at the gram equivalents of ounce intervals. Although there are also reporting heaps at “round” gram numbers (such as multiples of 100), these heaps are much smaller than those observed at gram equivalents of ounce intervals. Discussions with physicians suggest that birth weight is frequently measured in ounces, although typically also measured in grams for purposes of billing and treatment recommendations. Given the nature of the variation inherent in the reporting of our birth weight variable, our graphical results will focus on data that have been collapsed into one-ounce bins.26 Second, we do not observe irregular reporting heaps around our 1,500-g threshold of interest, consistent with women being unable to predict birth weight in advance of birth with the accuracy necessary to move their newborn (via birth timing) from just above 1,500 g to just below 1,500 g. The lack of heaping also suggests that physicians or hospitals do not manipulate reported birth weight so that, for example, more newborns fall below the 1,500-g cutoff and justify higher reimbursements. In particular, the frequency of births at 1,500 g is very similar to the frequency of births at 1,400 g and at 1,600 g, and the ounce markers surrounding 1,500 g have frequencies similar to those of other ounce markers. More formally, McCrary (2008) suggests a direct test for possible manipulation of the running variable. We implement his test by collapsing our nationwide data to the gram level—keeping count of the number of newborns classified at each gram—and then regressing that count as the outcome variable in the same framework as our regression discontinuity estimates. Using this test, we find no evidence of manipulation of the running variable around the VLBW threshold.27 Fetal deaths are not included in the birth records data, and hence one possible source of sample selection is the possibility that 25. See Online Appendix Figure A1 for a wider set of births. 26. Specifically, we construct one-ounce bins radiating out from our threshold of interest (e.g., 0–28 g from the threshold, 29–56 g from the threshold). 27. For 1,500 g we estimate a coefficient of −2,100 (s.e. = 1,500).
606
QUARTERLY JOURNAL OF ECONOMICS
30,000
25,000
20,000
15,000
10,000
5,000
0 1,350
1,400
1,450
1,500
1,550
1,600
1,650
FIGURE I Frequency of Births by Gram: Population of U.S. Births between 1,350 and 1,650 g NCHS birth cohort linked birth/infant death files, 1983–1991 and 1995–2003, as described in the text.
very sick infants are discontinuously reported more frequently as fetal deaths across our cutoff of interest (and are thus excluded from our sample). We test for this type of sample selection directly using a McCrary test with data on fetal death reports from the National Center for Health Statistics (NCHS) perinatal mortality data for 1985 to 2002. We would be concerned if we observed a positive jump in fetal deaths for VLBW infants, but in fact the estimated coefficient is negative and not statistically significant.28 Graphical analysis of the data is consistent with this formal test. More complicated manipulations of birth weight could in theory be consistent with Figure I. For example, if doctors relabeled unobservably sicker newborns weighing just above 1,500 g as being below 1,500 g (to receive additional treatments, for example) and symmetrically “switched” the same number of other newborns weighing just below 1,500 g to be labeled as being above 1,500 g, this could be consistent with the smooth distribution in Figure I. This seems unlikely, particularly given that we will later show that other covariates (such as gestational age) are smooth across our 1,500-g cutoff—implying that doctors would need to not only “symmetrically switch” newborns but symmetrically switch 28. As above, we implement this test by collapsing the NCHS perinatal data to the gram level—keeping count of the number of fetal deaths classified at each gram—and then regressing that count as the outcome variable in the same framework as our regression discontinuity estimates. We estimate a coefficient of −106.59 (s.e. = 78.32).
MARGINAL RETURNS TO MEDICAL CARE
607
newborns who are identical on all of the covariates we observe. We hold that the assumption that such switching does not occur is plausible.29
V.B. Health Outcomes Figure II reports mean mortality for all infants in one-ounce bins close to the VLBW threshold. Note that the one-year mortality rate is relatively high for this at-risk population: close to 6%. The figure shows that even within our relatively small bandwidth, there is a general reduction in mortality as birth weight increases, reflecting the health benefits associated with higher birth weight. The increase in mortality observed just above 1,500 g appears to be a level shift, with the slope slightly less steep below the threshold.30 The mean mortality rate in the ounce bin just above the threshold is 6.15%, which is 0.46 percentage points higher than mean mortality just below the threshold of 5.69%. We see a similar 0.48–percentage point difference for 28-day mortality—between 4.39% above the threshold and 3.91% below the threshold. This suggests that most of the observed gains in 28-day mortality persist to one year. Table I reports the main results that account for trends and other covariates. The first reported outcome is one-year mortality, and the local-linear regression estimate is −0.0121. This implies a 22% reduction in mortality compared to a mean mortality rate of 5.53% in the three ounces above the threshold (the “untreated” group in this regression discontinuity design). The OLS estimate in the second column mimics the local linear regression but now places equal weight on the observations up to three ounces on 29. Note also that to the extent that hospitals or physicians may have an incentive to categorize relatively costly newborns as VLBW to justify greater charge amounts, such gaming would tend to lead to higher mortality rates just prior to the threshold, contrary to our main findings. 30. Note that in this graph there is also a smaller, visible “jump” in mortality around 1,600 g, an issue we address in several ways. First, if we construct graphs analogous to Figure II that focus on 1,600 g as a potential discontinuity, there is no visible jump at 1,600 g. Exploration of this issue reveals that the slightly different groupings that occur when one-ounce bins are radiated out from 1,500 g relative to when one-ounce bins are radiated out from 1,600 g explain this difference, implying that small-sample variation is producing this visible “jump” at 1,600 g in Figure II. Reassuringly, the “jump” at 1,500 g is also visible in the graph which radiates one-ounce bins from 1,600 g, suggesting that small-sample variation is not driving the visible discontinuity at 1,500 g. Finally, when we estimate a discontinuity in a formal regression framework at 1,600 g, we find no evidence of either a first-stage or a reduced-form effect at 1,600 g.
Year controls Main controls Mean of dependent variable above cutoff
−0.0224 −0.0196 −0.0184 (0.0029)∗∗ (0.0029)∗∗ (0.0029)∗∗ [0.0081]∗∗ [0.0074]∗∗ [0.0074]∗ No Yes Yes No No Yes
Birth weight ≥ 1,500 g × grams from cutoff (100s)
0.0553
OLS
−0.0095 −0.0067 −0.0072 (0.0022)∗∗ (0.0022)∗∗ (0.0022)∗∗ [0.0048]∗ [0.0040] [0.0040]
OLS
−0.0136 −0.0119 −0.0111 (0.0032)∗∗ (0.0032)∗∗ (0.0032)∗∗ [0.0062]∗ [0.0024]∗∗ [0.0018]∗∗
−0.0121 (0.0023)∗∗
OLS
Birth weight < 1,500 g × grams from cutoff (100s)
Birth weight < 1,500 g
Local linear model
One-year mortality
0.0383
−0.0107 (0.0019)∗∗
Local linear model
OLS
OLS
−0.0199 −0.0179 0.0172 (0.0024)∗∗ (0.0024)∗∗ (0.0024)∗∗ [0.0060]∗∗ [0.0056]∗∗ [0.0055]∗∗ No Yes Yes No No Yes
−0.0114 −0.0102 −0.0097 (0.0027)∗∗ (0.0027)∗∗ (0.0027)∗∗ [0.0055]∗ [0.0027]∗∗ [0.0022]∗∗
−0.0088 −0.0074 −0.0073 (0.0018)∗∗ (0.0018)∗∗ (0.0018)∗∗ [0.0038]∗ [0.0031]∗ [0.0031]∗
OLS
28-day mortality
TABLE I INFANT MORTALITY BY VERY-LOW-BIRTH-WEIGHT STATUS, NATIONAL DATA, 1983–2002 (AVAILABLE YEARS)
608 QUARTERLY JOURNAL OF ECONOMICS
202,071
0.0191
−0.0068 (0.0017)∗∗
Local linear model
OLS
OLS
−0.0036 (0.0019) [0.0018]∗
−0.0036 (0.0019) [0.0015]∗ −0.0098 −0.0088 −0.0086 (0.0017)∗∗ (0.0017)∗∗ (0.0017)∗∗ [0.0036]∗∗ [0.0034]∗ [0.0034]∗ No Yes Yes No No Yes
−0.0042 (0.0019)∗ [0.0031]
−0.0043 0.0036 −0.0035 (0.0013)∗∗ (0.0013)∗∗ (0.0013)∗∗ [0.0023] [0.0020] [0.0020]
OLS
24-hour mortality
Notes. Local linear regressions use a bandwidth of 3 ounces (85 g). OLS models are estimated on a sample within 3 ounces above and below the VLBW threshold. “Main controls” are listed in Online Appendix Table A5, in addition to indicators for five-year intervals of mother’s age, five-year intervals of father’s age, gestational week, state of residence, year, as well as missing-information indicators for the prenatal, birth order, gestational age, and mother’s race categories. Local linear models report asymptotic standard errors. OLS models report heteroscedastic-robust standard errors in parentheses and standard errors clustered at the gram level in brackets. ∗ Significant at 5%; ∗∗ significant at 1%.
Year controls Main controls Mean of dependent variable above cutoff Observations
−0.0137 −0.0120 −0.0116 (0.0022)∗∗ (0.0022)∗∗ (0.0022)∗∗ [0.0049]∗∗ [0.0046]∗ [0.0046]∗ No Yes Yes No No Yes
Birth weight ≥ 1,500 g × grams from cutoff (100s)
0.0301
OLS
−0.0060 −0.0049 −0.0047 (0.0016)∗∗ (0.0016)∗∗ (0.0016)∗∗ [0.0032] [0.0027] [0.0027]
OLS
−0.0078 −0.0068 −0.0066 (0.0024)∗∗ (0.0024)∗∗ (0.0024)∗∗ [0.0047] [0.0026]∗∗ [0.0023]∗∗
−0.0068 (0.0017)∗∗
OLS
Birth weight < 1,500 g × grams from cutoff (100s)
Birth weight < 1,500 g
Local linear model
7-day mortality
TABLE I (CONTINUED)
MARGINAL RETURNS TO MEDICAL CARE
609
610
QUARTERLY JOURNAL OF ECONOMICS A. One-year mortality
0.08
0.07
0.06
0.05
0.04 1,350
1,400
1,450
1,500
1,550
1,600
1,650
Birth weight (g)
B. 28-day mortality 0.06
0.05
0.04
0.03
0.02 1,350
1,400
1,450
1,500
1,550
1,600
1,650
Birth weight (g)
FIGURE II One-Year and 28-Day Mortality around 1,500 g NCHS birth cohort–linked birth/infant death files, 1983–1991 and 1995–2003, as described in the text. Points represent gram-equivalents of ounce intervals, with births grouped into one-ounce bins radiating from 1,500 g; the estimates are plotted at the median birth weight in each bin.
either side of the threshold. The point estimate is slightly smaller, but still large: −0.0095. The probit model estimate is similar.31 31. A probit model with no controls other than the trend terms predicts a difference of −0.0095 evaluated at the cutoff. A probit model with full controls
MARGINAL RETURNS TO MEDICAL CARE
611
The trend terms reflect the overall downward slope in mortality. The point estimates suggest a steeper slope after the threshold. This trend difference could result from greater treatment levels that extend below the cutoff at a diminishing rate. Our estimate of the discontinuity in models that account for trends will not take treatment of inframarginal VLBW infants into account. In terms of the covariates, the largest impact on our main coefficient of interest is found when we introduce year indicators, likely because medical treatments, levels of associated survival rates, and trends in survival rates have changed so much over time. The estimated change in mortality around the threshold in the specification with the year indicators decreases to −0.0076. When we include the full set of covariates, the results are largely unchanged.32 To be conservative, in the rest of our analysis, we always report a specification that includes the full set of covariates. The remaining outcomes in Table I are mortality measures at shorter time intervals. The 28-day mortality coefficient is similar in magnitude to the one-year mortality coefficient, despite a smaller mean mortality rate of 3.83%. Given different mean mortality rates, the estimate implies a 23% reduction in 28-day mortality as compared to a 17% reduction in one-year mortality. As discussed above, the similarity between the one-year and 28-day mortality coefficients implies that any effects of being categorized as VLBW are seen in the first month of life—a time when these infants are largely receiving medical care (as described more below in our length-of-stay results). Within the first month of life, timing of the mortality gains varies, but the percentage reduction in mortality for VLBW infants relative to the rate above the threshold is consistent with that at 28 days. The 7-day and 24-hour mortality rates are 16% and 19% compared to the mean mortality rate for infants above the threshold. Finally, 1-hour mortality rates (not shown) are also lower for those born just below the threshold.33
predicts an average difference across the actual values of the covariates of −0.0069 evaluated at the cutoff. 32. The estimated coefficients on many of these covariates are reported in Online Appendix Table A5. 33. In a probit model with no controls other than the trend terms, the main marginal effect of interest, evaluated at the cutoff, is −0.0018 (s.e. = 0.0007) compared to a mean 1-hour mortality rate of 0.0055 just above the threshold. In a model with full controls, the average marginal effect evaluated at the cutoff is −0.0016.
612
QUARTERLY JOURNAL OF ECONOMICS
The following two sections consider the extent to which newborns classified as VLBW receive discontinuously more medical treatments than newborns just above 1,500 g. Although the universe of births in the natality file allows us to consider mortality effects with a large sample, these data do not include summary measures of treatment. As described above, we are able to examine summary measures of treatment in our hospital discharge data from five states (Arizona, California, Maryland, New Jersey, and New York), which appear to have broadly representative mortality outcomes.34 When we replicate the results in Table I limiting our nationwide data to these five states (a sample of nearly 50,000 births), we estimate that mortality falls by 1.1 percentage points (s.e. = 0.42) compared to a mean of 5.4% (as reported in Online Appendix Table A7). V.C. Differences in Summary Measures of Treatment Figure IIIA reports mean hospital charges in one-ounce bins. The measure appears fairly flat at $94,000 for the three ounces prior to the threshold, then falls discontinuously to $85,000 after the threshold, and continues on a downward trend, consistent with fewer problems among relatively heavier newborns.35 Table II reports the regression results.36 The first column reports estimates from the local linear regression, which suggests that hospital charges are $9,450 higher just before the threshold— relatively large compared to the mean charges of $82,000 above the threshold. The remaining columns report the OLS results. Without controls, the estimate decreases somewhat to $9,022; with full controls the estimated increase in charges for infants categorized as VLBW is largely unchanged ($9,065, s.e. = $2,297). 34. When we estimate our mortality results separately within each state and rank them by the estimated coefficient scaled by mean mortality just above the threshold, each of the states in our five-state sample falls toward the middle of the distribution. Further, Online Appendix Table A7 also considers mortality outcomes in these five states in the (smaller) overlap of the years between the HCUP data and the nationwide data. As expected, the results are more imprecisely estimated with the smaller sample, and the point estimates are lower as well. 35. This flattening before the threshold is suggestive that newborns who are up to three ounces from the threshold may receive additional treatment due to the VLBW categorization. 36. Results are similar when we estimate alternative models, such as count models for length of stay. Note that there are fewer controls in the five-state sample than there are in the nationwide sample, as the discharge data do not include the birth certificate data. Results (not shown) are qualitatively similar in a separate analysis of California, which allows for a wider set of controls from the linked birth certificate data.
613
MARGINAL RETURNS TO MEDICAL CARE A. Hospital charges $110,000 $100,000 $90,000 $80,000 $70,000 $60,000 1,350
1,400
1,450
1,500
1,550
1,600
1,650
Birth weight (g) B. Hospital length of stay 30 29 28 27 26 25 24 23 22 1,350
1,400
1,450
1,500
1,550
1,600
1,650
Birth weight (g)
FIGURE III Summary Treatment Measures around 1,500 g Data are all births in the five-state sample (AZ, CA, MD, NY, and NJ), as described in the text. Charges are in 2006 dollars. Points represent gram-equivalents of ounce intervals, with births grouped into one-ounce bins radiating from 1,500 g; the estimates are plotted at the median birth weight in each bin.
These estimates imply a difference of approximately 11% compared to the charges accrued by infants above the threshold. As the large mean charges suggest, this measure is rightskewed. The results are similar, however, when we estimate the relationship using median comparisons and when the charges are transformed by the natural logarithm to place less weight on
30,935
28,928
1.97 (0.451)∗∗
Local linear model
24.68
−7,951 (2,823)∗∗ [7,562] Yes Yes
617.4876 (3,463) [8,447]
9,065 (2,297)∗∗ [5,094]
OLS
81,566
−8,684 (2,978)∗∗ [4,337]∗ Yes No
−3,176 (3,647) [7,937]
8,205 (2,416)∗∗ [3,174]∗
OLS
−2.3130 (0.5245)∗∗ [1.4366] No No
−0.1012 (0.6482) [1.9397]
1.7768 (0.4165)∗∗ [1.0024]
OLS
OLS
−2.3779 (0.5250)∗∗ [1.4117] Yes No
−0.1356 (0.6467) [1.8419]
1.7600 (0.4166)∗∗ [0.9775]
Length of stay
−2.5993 (0.5174)∗∗ [1.1464]∗ Yes Yes
−0.5766 (0.6366) [1.4858]
1.4635 (0.4107)∗∗ [0.7928]
OLS
Notes. Local linear regressions use a bandwidth of 3 ounces (85 g). OLS models are estimated on a sample within three ounces above and below the VLBW threshold. Five states include AZ, CA, MD, NY, and NJ (various years). Charges are in 2006 dollars. “Main controls” are listed in Online Appendix Table A5, as well as indicators for each year. Some observations have missing charges, as described in the text. Local linear models report asymptotic standard errors. OLS models report heteroscedastic-robust standard errors in parentheses and standard errors clustered at the gram level in brackets. ∗ Significant at 5%; ∗∗ significant at 1%.
Year controls Main controls Mean of dependent variable above cutoff Observations
−7,331 (3,018)∗ [5,022] No No
Birth weight ≥ 1,500 g × grams from cutoff (100s)
9,022 (2,448)∗∗ [3,538]∗ −1,728 (3,700) [8,930]
9,450 (2,710)∗∗
OLS
Birth weight < 1,500 g × grams from cutoff (100s)
Birth weight < 1,500 g
Local linear model
Hospital charges
TABLE II SUMMARY TREATMENT MEASURES BY VERY-LOW-BIRTH-WEIGHT STATUS, FIVE-STATE SAMPLE, 1991–2006
614 QUARTERLY JOURNAL OF ECONOMICS
MARGINAL RETURNS TO MEDICAL CARE
615
large charge amounts, as shown in Online Appendix Figure A2 and Online Appendix Table A6.37 As noted in Section II.B, if prices differ across our threshold of interest, then any discontinuous jump in charges could in part be due to changes in prices rather than changes in quantities. One way to test whether differences in quantities of care are driving the main results is to consider a quantity measure that is consistently measured across hospitals: length of stay in the hospital.38 Figure IIIB shows that average length of stay drops from just over 27.3 days immediately prior to the threshold to 25.7 days immediately after the threshold. Corresponding regression results shown in Table II show that newborns weighing just under 1,500 g have stay lengths that are between 1.5 and 2 days longer, depending on the model, representing a difference of 6%–8% compared to the mean length of stay of 25 days above the threshold. Of course, length of stay and charges are not independent measures, as longer stays accrue higher charges both in terms of room charges and as associated with a greater number of services provided. We further investigate the differences in such service provision measures below. The first-stage variables in the five-state sample could be censored from above if newborns were transferred to another hospital, because we do not observe charges and procedures across hospital transfers in the HCUP data. This censoring is only problematic insofar as newborns on either side of the cutoff are more likely to be transferred to another hospital. In the discharge data, we do observe hospital transfers, and we do not find a statistically significant difference in transfers across the threshold. The first-stage results are also similar when we use the longitudinal data available from California to consider treatment provided at both the hospital of birth and any care provided in a subsequent hospital following a transfer (Online Appendix Table A6). It can also be argued that if treatment is effective at reducing mortality, newborns just below 1,500 g will receive more medical treatment than newborns just above 1,500 g because they are 37. Our sample sizes vary somewhat when looking at charges variables in levels or in logs due to observations with missing or zero charges. Graphing the mean probability that charges are missing or zero across 1,500 g does not reveal a discontinuous change across this threshold. 38. We define our length of stay variable so that the smallest value is 1—a value of 2 indicates that the stay continued beyond the first day, and so forth. This definition allows us to include observations in our log length of stay variable that are less than one full day.
616
QUARTERLY JOURNAL OF ECONOMICS
more likely to be alive. Such treatment is unlikely to drive the first-stage results, however, as it is provided to only 1% of newborns below the cutoff, who appear to have longer lives due to their VLBW status (as in Figure II). Nevertheless, any additional care provided to these newborns is part of the total cost of treatment. Our two-sample instrumental variable estimate of the returns to care discussed in Section VIII.B takes into account these additional costs. To the extent that some of this additional care does not contribute to an improvement in mortality, then our estimate will attribute the reduction in mortality to both care that is effective and care that is ineffective. This will lead to estimated returns that are smaller than they otherwise would be if the ineffective care were excluded. Taken together, these results show differences in summary treatment measures of approximately 10%–15% with some variation in the estimate depending on the treatment measure. In terms of charges, the difference across the discontinuity is approximately $9,000. V.D. Mechanisms: Differences in Types of Care The hospital discharge data include procedure codes that can be used to investigate the types of care that differ for infants on either side of the VLBW threshold. We explore the data for such differences, with a special focus on common perinatal procedures.39 As in the mortality analysis in the smaller five-state sample, however, such differences are difficult to find. Often, for the same procedures, statistically significant regression results do not correspond to convincing graphical results, and convincing graphical results do not correspond to convincing regression results. Table III and Figure IV present regression and graphical results for four common types of treatment. One of the most common procedures is some form of ventilation.40 Although Table III provides some evidence of a 39. Specifically, we searched for differences in procedures used to define NICU quality levels in California (Phibbs et al. 2007), as well as five categories of procedures that were among the top 25 most common primary and secondary procedures in our data: injection of medicines, excision of tissue, repair of hernia, and two additional diagnostic procedures. 40. We observe several measures of assisted ventilation, but found little support for any discontinuous change in any of the measures. Some oxygen may be provided before birth weight is measured, although to the best of our knowledge we are not able to separate this from ventilation provided after birth weight is measured in our data. As noted above, the nationwide data include some
617
MARGINAL RETURNS TO MEDICAL CARE A. Any ventilation
B. NICU > 24 hours 0.5
0.55
0.47
0.5
0.44 0.45 0.41 0.4
0.35 1,350
0.38
1,400
1,450
1,500
1,550
1,600
1,650
0.35 1,350
1,400
1,450
1,500
1,550
1,600
1,650
1,600
1,650
Birth weight (g)
Birth weight (g)
D. Operations on the heart
C. Diagnostic ultrasound 0.28
0.34
0.27
0.32 0.3
0.26
0.28 0.25 0.26 0.24
0.24
0.23 0.22 1,350
0.22
1,400
1,450
1,500
Birth weight (g)
1,550
1,600
1,650
0.2 1,350
1,400
1,450
1,500
1,550
Birth weight (g)
FIGURE IV Specific Treatment Measures around 1,500 g Data are all births in the five-state sample (AZ, CA, MD, NY, and NJ), as described in the text. Points represent gram-equivalents of ounce intervals, with births grouped into one-ounce bins radiating from 1,500 g; the estimates are plotted at the median birth weight in each bin.
discontinuous increase in ventilation for VLBW infants, Figure IVA does not offer compelling evidence of a meaningful difference. Another common measure of resource utilization that we observe in our summary treatment measures is admission to a neonatal intensive care unit (NICU). Because care provided in such units is costly, it seems plausible that the threshold could be used to gain entry into such a unit. However, our data reveal little difference on this margin. First, we examine the California data, which includes a variable on whether or not the infant spent at least 24 hours in a NICU or died in the NICU in less than 24 hours. We include newborns born in hospitals that did not have a NICU for comparability to our main results, which also include such newborns. Although Table III suggests a modest increase in this NICU use measure (approximately 3 percentage points as compared to a mean just above the threshold of
ventilation measures, but we also find little evidence of an increase in ventilation among VLBW newborns in those data.
30,935
0.0147 (0.0112)
Local linear model
30,935
0.0297 (0.0095)∗∗ [0.0125]∗ Yes
OLS
0.260
0.0166 (0.0101) [0.0128] No
OLS
0.0282 (0.0157) [0.0214] No
OLS
0.0155 (0.0104) [0.0338] No
OLS
Operations on the heart
0.244
0.0196 (0.0109)
Local linear model
Diagnostic ultrasound of infant
16,528
0.0372 (0.0170)∗
30,935
0.0274 (0.0112)∗ [0.0191] Yes
Local linear model
0.444
0.0380 (0.0115)∗∗ [0.0263] No
OLS
0.511
0.0357 (0.0125)∗∗
OLS
California: >24 hours in NICU
0.0236 (0.0100)∗ [0.0146] Yes
OLS
0.0265 (0.0156) [0.0204] Yes
OLS
Notes. Local linear regressions use a bandwidth of three ounces (85 g). OLS models are estimated on a sample within three ounces above and below the VLBW threshold, and include linear trends in grams (coefficients not reported). Five states include AZ, CA, MD, NY, and NJ (various years). “Main controls” are listed in Online Appendix Table A5, as well as indicators for each year. The dependent variable in the NICU models is only available in our California data, and equals one if the infant spent more than 24 hours in a NICU or died in the NICU at less than 24 hours. Local linear models report asymptotic standard errors. OLS models report heteroscedastic-robust standard errors in parentheses and standard errors clustered at the gram level in brackets. ∗ Significant at 5%; ∗∗ significant at 1%.
Controls Mean of dependent variable above cutoff Observations
Birth weight < 1,500 g
Controls Mean of dependent variable above cutoff Observations
Birth weight < 1,500 g
Local linear model
Ventilation (various methods)
TABLE III SPECIFIC TREATMENT MEASURES BY VERY-LOW-BIRTH-WEIGHT STATUS: FIVE-STATE SAMPLE, 1991–2006
618 QUARTERLY JOURNAL OF ECONOMICS
MARGINAL RETURNS TO MEDICAL CARE
619
44 percentage points), Figure IVB shows little evidence of a discontinuous change. Second, we examine the Maryland HCUP data, which record the number of days in a NICU, but again we find little evidence of a difference at the threshold.41 Our results are consistent with a study of NICU referrals (Zupancic and Richardson 1998), in which VLBW was not listed among the common reasons for triage to a NICU. We find some weak evidence of differences for two relatively common procedures: diagnostic ultrasound of the infant and operations on the heart. As noted above, diagnostic ultrasounds are used to check for bleeding or swelling of the brain, and some physician manuals cite 1,500 g as a threshold below which diagnostic ultrasounds are suggested. Figure IVC suggests a jump in ultrasounds of roughly two percentage points compared to a mean of approximately 25%. Table III suggests estimates of similar size, although only the OLS estimates with controls are statistically significant at conventional levels. The pattern of the “operations on the heart” indicator in Figure IVD shows an upward pretrend in the procedures prior to the threshold and what appears to be a discontinuous drop after the threshold.42 Table III suggests that the jump is between 1.5 and 2.4 percentage points, or roughly 8% higher than the mean rate for those born above the threshold in this sample, although again only the OLS estimates with controls are statistically significant at conventional levels. In summary, we examine several possible treatment mechanisms at the discontinuity. We find some weak evidence of differences for operations on the heart and diagnostic ultrasounds, for which we estimate an approximate 10% increase in usage just prior to the VLBW threshold.43 These differences are often not 41. The New Jersey HCUP data include a field for NICU charges, but this variable proves unreliable: the fraction of newborns with nonmissing NICU charges for this at-risk population is only 2%. Recent nationwide birth certificate data include an indicator for NICU admission for a handful of states. We do not see a visible discontinuity in these data, albeit potentially due to the relatively small sample of births in the years for which we observe this variable. 42. LBW is associated with failure of the ductus arteriosis to close, in which case surgery may be necessary. Investigating the surgical code for this particular procedure on its own as used in Phibbs et al. (2007) suggested a low mean (4.4 of 1,000 births) and no visible jump. 43. Although these differences are at best suggestive, it is worth noting that our best estimate based on limited pricing data is that these differences would not account for the majority of the measured difference in total charges. On the basis of 2007 Medi-Cal rates, we estimate that the charge for a diagnostic ultrasound is relatively inexpensive (approximately $450) and various heart operations range
620
QUARTERLY JOURNAL OF ECONOMICS
statistically significant, and would be even less so if the standard errors were corrected with a Bonferroni correction to account for search across procedures. We find little evidence of differences in NICU usage or other common procedures such as ventilation. In the end, differences in our summary measures are consistent with medical care driving the mortality results, but we likely lack the statistical power to detect differences in particular procedures in our five-state sample (as evidenced by relatively noisy procedure rates across birth weight bins). VI. ROBUSTNESS AND SPECIFICATION CHECKS In this section, we test for evidence of differences in covariates across our VLBW threshold (Section VI.A), discuss the sensitivity of our results to alternative bandwidths (Section VI.B), examine our mortality results by cause of death (Section VI.C), and discuss evidence of discontinuities at alternative birth weight and gestational age thresholds (Sections VI.D and VI.E). VI.A. Testing for Evidence of Differences in Covariates across 1,500 Grams As discussed above, it is thought that birth weight cannot be predicted in advance of birth with the accuracy needed to change (via birth timing) the classification of a newborn from being just above 1,500 g to being just below 1,500 g. Moreover, as discussed in Section V.A, most forms of strategic recategorization of newborns based on birth weight around 1,500 g should be detectable in our histograms of birth frequencies by gram birth weight. As such, we expect that the newborns will be similar above and below the threshold in both observable and unobservable characteristics. That said, it is still of interest to directly compare births on either side of our threshold based on observable characteristics. Online Appendix Table A2 compares means of observable characteristics above and below the threshold, controlling for from $200 to $2,200. Without more systematic data on prices, it is difficult to pin down an accurate estimate of what share of charges these two procedures could account for, but they do not appear to be able to explain most of our measured difference in charges. Another approach controlled for common procedures in our charges regression. With their inclusion in the model, the estimated difference in hospital charges falls from our main estimate of $9,000 to $5,100. That is, the procedures appear to explain some, but not all of the effect. Length of stay is our other summary measure of treatment, although we find that charges are higher for VLBW newborns even when controlling for length of stay (by $2,184 (s.e. = 1,587)).
MARGINAL RETURNS TO MEDICAL CARE
621
linear trends in grams from the threshold as in the main analysis. The table also includes a summary measure—the predicted mortality rate from a probit model of mortality on all of the controls (specifically, the newborn characteristics Xi described above, together with year indicators). Most of the comparisons show similar levels across the threshold, with few that appear to be meaningfully different. Given the large sample size, however, some of the differences are statistically significant. To further consider these differences, Figure V compares covariates of interest in the 5 ounces around the VLBW threshold.44 Here, the comparisons appear even more stable across the threshold. In particular, gestational age—which is particularly related to birth weight and shows a statistically significant difference in Online Appendix Table A2—is generally smooth through the threshold. Similarly, Figure VJ, which is on the same scale as actual mortality in Figure II, suggests little difference in predicted mortality across the threshold. It thus appears that newborns are nearly identical based on observable variables regardless of whether they weigh just below or just above the VLBW threshold.45 VI.B. Bandwidth Sensitivity The local-linear regression results are qualitatively similar for a wide range of bandwidths (see Online Appendix Table A3). The magnitude of the mortality estimates decreases with the bandwidth, suggesting that our relatively large bandwidth is conservative. When the bandwidth includes only one ounce on either side of the threshold (h = 30 g), the difference in one-year mortality is −2.7 percentage points; when h = 150 g, the estimate decreases to −0.8 percentage points, which is similar to our main estimate at a bandwidth of 85 g. In fact, we find qualitatively similar results for bandwidths as large as 700 g. In terms of the treatment measures in the five-state sample, the discontinuity in hospital charges is largest in magnitude for our 44. The list was selected for ease of presentation and includes the major covariates of interest. Similar results were found for additional covariates as well. 45. We also investigate the possibility that newborns in our data reported as exactly 3 pounds 5 ounces (1,503 g) were treated as VLBW newborns and only appear above the threshold in our data due to rounding when the birth weight was reported to Vital Statistics. Although we prefer not to exclude the information here (the two-sample IV estimates should correct for the misclassification), when we exclude newborns at 1,503 g, we find a larger discontinuity in one-year mortality (−0.011, s.e. = 0.0025) and continue to find a meaningful discontinuity in charges ($5,600, s.e. = 2,400).
622
QUARTERLY JOURNAL OF ECONOMICS B. Mother's age
A. Gestational age 33
27
32.6
26.8
32.2
26.6
31.8
26.4
31.4
26.2
31 1,350
1,400
1,450
1,500
1,550
1,600
1,650
26 1,350
1,400
1,450
Birth weight (g) C. Mother's education: Less than high school
1,500 1,550 Birth weight (g)
1,600
1,650
1,550
1,600
1,650
1,550
1,600
1,650
1,600
1,650
1,600
1,650
D. Mother's race: White 0.46
0.27 0.265
0.45 0.26 0.44
0.255 0.25
0.43 0.245 0.24 1,350
1,400
1,450
1,500
1,550
1,600
1,650
0.42 1,350
1,400
1,450
1,500 Birth weight (g)
Birth weight (g)
F. Vaginal delivery
E. Singleton birth 0.52
0.78
0.5 0.76
0.48 0.46
0.74
0.44 0.72 1,350
1,400
1,450
1,500
1,550
1,600
1,650
0.42 1,350
1,400
1,450
1,500 Birth weight (g)
Birth weight (g) G. Fewer than 7 prenatal visits
H. Tocolysis 0.13
0.4 0.37
0.12
0.34 0.31
0.11
0.28 0.25 1,350
1,400
1,450
1,500
1,550
1,600
1,650
0.1 1,350
1,400
1,450
Birth weight (g) I. Year of birth 0.08
1993.5
0.07
1993
0.06
1992.5
0.05
1,400
1,450
1,500 Birth weight (g)
1,550
J. Predicted one-year mortality
1994
1992 1,350
1,500 Birth weight (g)
1,550
1,600
1,650
0.04 1,350
1,400
1,450
1,500
1,550
Birth weight (g)
FIGURE V Covariates around 1,500 g NCHS birth cohort–linked birth/infant death files, 1983–1991 and 1995–2003, as described in the text. Points represent gram-equivalents of ounce intervals, with births grouped into one-ounce bins radiating from 1,500 g; the estimates are plotted at the median birth weight in each bin.
MARGINAL RETURNS TO MEDICAL CARE
623
benchmark bandwidth, although qualitatively similar across the range from h = 30 g to h = 150 g. VI.C. Causes of Death If our mortality effect were driven by so-called “external” causes of death (such as accidents), this would be of concern, because it would be difficult to link deaths from those causes to differences in medical inputs. Reassuringly, we find no statistically significant change in external deaths across our cutoffs of interest (see Online Appendix Table A8). Examination of our mortality results by cause of death may also be of interest from a policy perspective. When we group causes of death into broad, mutually exclusive categories, we find (see Online Appendix Table A8) effects of the largest magnitude for perinatal conditions (such as jaundice and respiratory distress syndrome), as well as for nervous system and sense organ disorders—the latter of which is a statistically significant effect at conventional levels. We also examine a few individual causes of death, and find a modestly statistically significant reduction in deaths due to jaundice for VLBW infants.46 These results support the notion that differences in care received in the hospital are likely driving our mortality results. VI.D. Alternative Birth Weight Thresholds A main limitation of our analysis is that the returns are estimated at a particular point in the birth weight distribution. For two reasons, we also examine other points in the birth weight distribution. First, other discontinuities could provide an opportunity to trace out marginal returns for wider portions of the overall birth weight distribution. Second, at points in the distribution where we do not anticipate treatment differences, economically and statistically significant jumps of magnitudes similar to our VLBW treatment effects could suggest that the discontinuity we observe at 1,500 g may be due to natural variation in treatment and mortality in our data. As noted in Section II.B, discussions with physicians and readings of the medical literature suggest that other cutoffs may be relevant. To investigate other potential thresholds, we 46. Jaundice is a common neonatal problem that should be detected during the initial hospital stay for newborns in our bandwidth. According to Behrman, Kliegman, and Jenson (2000, p. 513), “Jaundice is observed during the first week of life in approximately 60% of term infants and 80% of preterm infants.”
624
QUARTERLY JOURNAL OF ECONOMICS
estimate differences in mortality and hospital charges for each 100-g interval between 1,000 and 3,000 g. We use local linear regression estimates because they are less sensitive to observations far from the thresholds, and our pilot bandwidth of 3 ounces for comparability. In terms of the mortality differences, the largest difference in mortality compared to the mean at the cutoff is found at 1,500 g (23%), other than one found at 1,800 g (27%).47 A 5% reduction in mortality (relative to the mean) is found at 1,000 g and a 16% reduction in mortality is found at 2,500 g, but graphs do not reveal convincing discontinuities in mortality at these or other cutoffs. When we consider hospital charges, again 1,500 g stands out with a relatively large discontinuity, especially compared to discontinuities at birth weights between 1,100 and 2,500 g. A 12% increase in charges (relative to the mean) is found for newborns classified as ELPW (1,000 g), with similarly large differences for 800- and 900-g, thresholds. However, differences at and below 1,000 g are not robust to alternative specifications (such as the transformation of charges by the natural logarithm), possibly because there are fewer newborns to study at these lower thresholds and the spending levels are thus particularly susceptible to outliers given the large charge amounts. In summary, we find striking discontinuities in treatment and mortality at the VLBW threshold, but less convincing differences at other points of the distribution.48 These results support the validity of our main findings, but they do not allow us to trace out marginal returns across the distribution. VI.E. Gestational Age Thresholds As motivated by the discussion in Section II.B, as an alternative to birth weight thresholds, we also examine heterogeneity in outcomes and treatment by gestational age across the 37-week threshold. In graphical analyses using the nationwide sample, measures of average mortality by gestational week appear smooth 47. A weight of 1,800 g is a commonly cited threshold for changes in feeding practices (Cloherty and Stark 1998). However, we cannot observe changes in feeding practices in our data, and, as discussed in the next paragraph, we do not observe a correspondingly large discontinuity at 1,800 g in our hospital charges measure. 48. We also undertook a more formal search method. Namely, searching for a break between 1,400 and 1,600 g, the largest discontinuity is found at 1,500 g, and that discontinuity also maximizes the F-statistic in a simple model with linear trends.
MARGINAL RETURNS TO MEDICAL CARE
625
across the 37-week threshold.49 Corresponding regression results yield statistically significant coefficients of the expected sign, but we do not emphasize them here, given the lack of a visibly discernable discontinuity in the graphical analysis.50 We also investigated the interaction between birth weight and gestational age through the “small for gestational age” (SGA) classification: newborns below the tenth percentile of birth weight for a given gestational age. Conversations with physicians suggest that doctors use SGA charts such as that established by Fenton (2003), updating the previous work of Babson and Benda (1976). On this chart, 2,500 g is almost exactly the tenth percentile of birth weight for a gestational age of 37 weeks. If physicians treat based on SGA cutoffs, we expect discontinuities in outcomes and treatment at 2,500 g to be most pronounced exactly at 37 weeks and less pronounced at other values of gestational weeks, although we are agnostic about the pattern of decline. In regression results (not shown) we do find evidence consistent with treatment being based on SGA around 2,500 g. For 1,500 g, analogous results are not clearly consistent with treatment based on the Fenton (2003) definition of SGA around 1,500 g.51 VII. VARIATION IN TREATMENT EFFECTS ACROSS HOSPITAL TYPES Our regression discontinuity design allows us to assess potential heterogeneity in outcomes and treatment across hospitals.52 49. Similarly, in graphical analyses using the California data, which report gestation in days, measures of average mortality, charges, and length of stay by gestational day appear smooth across this threshold. 50. Specifically, the coefficient on an indicator variable for “below 37 gestational weeks” is −0.00070 (s.e. = 0.0001277) in a specification that includes linear trends, run on an estimation sample of 21,562,532 observations within a 3-week bandwidth around 37 weeks. Mean mortality above the threshold is 0.0032. To address the concern that discontinuities could be obscured in cases where gestational age can be manipulated, we also estimate a specification that includes only vaginal births that are not induced or stimulated and find similar results. 51. Specifically, if we run separate specifications for each value of gestational weeks, we estimate a coefficient of −.0025 (s.e. = .0009) in the 37-week specification, and the coefficient declines in magnitude in specifications that move away from 37 weeks in both directions (at 35 weeks: −.0002 (s.e. = .0012), at 39 weeks: −.0007 (s.e. = .0009)). These coefficients are not directly comparable to our main estimates because they allow separate trends by gestational week. In the Fenton (2003) chart, 1,500 g is considered SGA for newborns with between 32 and 33 gestational weeks, whereas we find that discontinuities in mortality around 1,500 g are most pronounced at 29 weeks and decrease on either side of 29 weeks. 52. We also examined how our estimated treatment effects vary over time and across subgroups of newborns (results not shown). The trends over time are not consistent with any clear medical technology story of which we are aware (see Online Appendix Table A7), such as a “surfactant effect.” The more recent birth
626
QUARTERLY JOURNAL OF ECONOMICS
In contexts without a regression discontinuity, an estimated relationship between hospital quality and newborn health could be biased: on one hand, a positive correlation could arise if healthier mothers choose to give births at better hospitals; on the other hand, a negative correlation could arise if riskier mothers choose to give birth at better hospitals, knowing that their infants will need more care than an average newborn. However, as discussed above, because birth weight should not be predictable in advance of birth with the accuracy needed to move a birth from just above to just below our 1,500-g threshold of interest, selection should not be differential across our discontinuity—implying that we can calculate internally valid estimates for different types of hospitals and consider how the quality of the hospital affects the results. One natural grouping of hospitals, given our population under study, is the level of neonatal care available in an infant’s hospital of birth. For our California data, classifications of neonatal care availability by hospital by year are available during our time period due to analysis by Phibbs et al. (2007).53 In the sample of newborns within our bandwidth, 10% of births occur at hospitals with no NICU, just over 12% at hospitals with a Level 0–2 NICU, and the remainder at hospitals with Level 3A–3D NICUs.54 Although we can examine our reduced-form estimates by NICU quality level, it is worth noting that we expect to lack sufficient sample size within these NICU quality-level subsamples to give precise estimates of these effects for our one-year mortality outcome. Perhaps unsurprisingly, regression estimates that interact with our regression discontinuity variable as well as our
certificate data referenced above include an indicator for the use of artificial surfactant which we can use to test directly for this type of effect, and we do not see a visible discontinuity in this variable—again potentially due to the small sample of births. In examining our mortality effects by subgroups, we find statistically significant differences for less educated mothers; newborns with missing father’s information (a proxy for single parenthood in our data, which otherwise lacks a stable marital status indicator); single births (where LBW may point to greater developmental problems); and male patients (who are known to be more vulnerable). The first-stage estimates by subgroup exhibit similar differences, with a larger first stage for male newborns and singleton births. 53. We are grateful to Christopher Afendulis and Ciaran Phibbs for sharing these data with us. Phibbs et al. (2007) used the same California data we study to identify the quality level of NICUs (Levels 1 to 3D) by hospital by year, in part based on NICU quality definitions from the American Academy of Pediatrics (definitions that in turn are primarily based on whether hospitals offer specific types of procedures, such as specific types of ventilation and surgery). 54. Because of the small number of births observed in Level 0 or Level 1 NICUs, we create a combined category for births in Level 0, 1, and 2 NICU hospitals.
MARGINAL RETURNS TO MEDICAL CARE
627
linear birth weight trends with indicators for the NICU quality level available in a newborn’s hospital of birth generally do not give statistically significant estimates for our one-year mortality outcome, with the exception of Level 0/1/2 NICU hospitals—for which we estimate a negative, statistically significant coefficient. Using charges as a first-stage outcome in the same regression framework, we estimate economically and statistically significant positive coefficients for non-NICU hospitals as well as Level 0/1/2 and Level 3B hospitals; coefficients for the other hospitals do not produce statistically significant coefficients. We can only offer a cautious interpretation of these results, given that many of our estimates are not statistically significant at conventional levels. That said, Figure VI provides one descriptive analysis—plotting first-stage estimates by hospital against reduced-form estimates by hospital, normalizing each coefficient by the mean outcome for newborns above 1,500 g within our bandwidth for that type of hospital. Hospitals with larger first-stage estimates have larger reduced-form estimates, providing further evidence that treatment differences are driving the outcome differences. In addition, this analysis provides suggestive evidence that the non-NICU and Level 0/1/2 NICU hospitals are the hospitals where our estimated effects are largest.55 VIII. ESTIMATING RETURNS TO MEDICAL SPENDING In this section, for comparability to the existing literature, we present a time series estimate of the returns to large changes in spending over time for newborns in our bandwidth (Section VIII.A). We then combine our first-stage and reduced-form estimates to derive two-sample estimates of the marginal returns to 55. Similarly, we find larger first-stage results when we consider hospitals in the five-state sample that had no NICU compared to hospitals that have a NICU—a wider set of states that does not allow an investigation by NICU quality level. We also considered hospital size (calculated using the number of births in a hospital-year in our sample). Larger hospitals had higher levels of charges, so we compared log charges across quartiles in our hospital size variable. The bulk of the data are in the larger hospitals, and we find treatment differences in the second, third, and top quartiles (the bottom quartile contained fewer births (n = 2,110) and was less precisely estimated). Another way to consider treatment intensity in the nationwide data is to compare states that have higher end-of-life spending levels according to the Dartmouth Atlas of Healthcare, a resource that considers Medicare spending. When the 1996 state rankings are used (the earliest year available, although the rankings are remarkably stable over the years 1996– 2005), the mortality effects are found in the bottom two and top two quintiles, suggesting that the results are fairly robust across different types of hospital systems that vary by spending levels.
628
QUARTERLY JOURNAL OF ECONOMICS
0.40 Level 3B NICU 0.30
One-year mortality: (coefficient / mean above threshold)
0.20 Level 3C NICU 0.10 Level 3A NICU 0.00 –0.10
Level 3D NICU
–0.20 –0.30 –0.40 –0.50 no NICU –0.60 –0.70 Level 0/1/2 NICU –0.80 –0.90 –0.15
–0.10
–0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
Hospital charges [2006$]: (coefficient / mean above threshold)
FIGURE VI First Stage versus Reduced Form, by NICU Quality Level Plot of first-stage coefficients (for 2006 charges, in levels) and reduced-form coefficients (for one-year mortality) by NICU level in our California data. See text for details on the NICU classifications.
medical spending for newborns near 1,500 g (Section VIII.B). As noted in Section III.A, all of our spending figures in this section are hospital costs (that is, hospital charges deflated by a costto-charge ratio) because costs most closely approximate the true social costs of resource use. VIII.A. Comparison to Time-Series Estimates of Returns to Medical Spending As one benchmark, we can compare our marginal return estimate to the type of return estimate calculated by Cutler and Meara (2000). The spirit of the Cutler–Meara calculation is to assume that within–birth weight changes in survival over time are primarily due to improvements in medical technologies in the immediate postnatal period (Williams and Chen 1982; Paneth 1995), and thus to value medical improvements by looking at changes over time in within–birth weight expenditures and health outcomes. We undertake this calculation in our California data as a “long difference” in costs (in 2006 dollars) and one-year mortality from 1991 to 2002. Within our bandwidth, we estimate a $30,000 increase in costs and a 0.0295 decline in one-year mortality over this period, which implies a cost per newborn life under
MARGINAL RETURNS TO MEDICAL CARE
629
the Cutler–Meara assumptions of $1 million dollars. By this metric, as we will see below, our marginal return estimates appear to be similar or slightly more cost-effective than time-series returns to large changes in spending for newborns in our bandwidth. VIII.B. Two-Sample Estimates of Marginal Returns to Medical Spending As discussed in Section IV.A, we can combine our results to produce two-sample estimates of the effect of treatments on health outcomes around the VLBW threshold. To do so, we need to invoke the exclusion restriction that the VLBW designation only affects mortality through treatments captured by our treatment measure—an assumption that is most plausible for costs, our best available summary treatment measure. Because we examine health outcomes and summary treatments in different data sources, additional assumptions are required to combine our estimates. To be conservative, we can focus on mortality and cost estimates based solely on states in the fivestate sample. We obtain the one-year mortality estimate on the nationwide data, restricted to newborns in the five-state sample in available years and standardize covariates across the two samples.56 If we had the exact same newborns in the two samples, our two-sample estimate would be identical to a one-sample estimate on the complete data.57 Coefficients are shown in the last column of Online Appendix Table A7, where $4,553 in additional costs are associated with a 0.74-percentage point reduction in mortality. If we are willing to assume that costs differences in the fivestate sample in the available years (1991–2006) are broadly representative of what we would observe in the full national sample in available years (1983–2002), we can compare our main results: a difference of $3,795 in costs and a one-year mortality reduction of 0.72 percentage points as birth weight approaches the VLBW threshold from above. Equivalently, we can compute a measure of dollars per newborn life saved. In such a calculation, the numerator is our hospital costs estimate: $3,795 for each VLBW newborn in the full 56. Specifically, we restrict the national data to the five states in the years 1991 and 1995–2002. Also, for comparability with the five-state sample, we restrict the national sample to contain only in-hospital births. 57. Because we do not have individual-level identifiers, we cannot restrict the national sample to contain the exact same newborns as the five-state sample, but the agreement is very good. The restricted national sample contains 23,698 infants, and the five-state sample contains 21,479.
630
QUARTERLY JOURNAL OF ECONOMICS
five-state sample. The denominator is our mortality estimate: a 0.72-percentage point reduction in mortality among VLBW newborns in the full sample. These estimates imply that the cost per newborn life saved is $527,083 ($3,795/0.0072). In the five-state sample over the years that overlap with the nationwide data, we obtain a slightly higher estimate of costs per newborn life saved of $615,270 ($4,553/0.0074). Following Inoue and Solon (2005), we calculate an asymptotic 95% confidence interval on this estimate of approximately $30,000 to $1.20 million. Note that this confidence interval for the estimate from the restricted sample is conservative relative to the analogous confidence interval for the more precise estimate we obtain from the full samples: $30,000 to $1.05 million. We can compare these estimates of the cost per newborn life saved to a variety of potential benchmarks. Using data on disabilities and life expectancy, Cutler and Meara (2000) calculate a quality-adjusted value of a newborn life for newborns born in 1990 near 1,500 g to be approximately $2.7 million. If we take the less conservative view that newborns who are saved do not experience decreases in lifespan or quality of life, the relevant benchmark is approximately $3 million to $7 million dollars (Cutler 2004). Comparison with this benchmark suggests that the treatments that we observe are very cost-effective. IX. CONCLUSION Medical inputs can vary discontinuously across plausibly smooth measures of health risk—in our case, birth weight— inviting evaluation using a regression discontinuity design. The treatment threshold and estimated effects are relevant to a “marginally untreated” subpopulation. The relatively frequent use of clinical triage criteria (as discussed in Section I) and availability of micro-level data on health treatments and health outcomes imply that this type of regression discontinuity analysis may be fruitfully applied to a number of other contexts. This approach offers a useful complement to conventional approaches to estimating the returns to medical expenditures—which have generally focused on time-series, cross-sectional, or panel variation in medical treatments and health outcomes—yet has not been widely applied in either the economics literature or the health services literature to date (Zuckerman et al. 2005; Linden, Adams, and Roberts 2006).
MARGINAL RETURNS TO MEDICAL CARE
631
In the universe of all births in the United States over twenty years, we estimate that newborns weighing just below 1,500 g have substantially lower mortality rates than newborns that weigh just over 1500 g, despite a general decline in health associated with lower birth weight. Specifically, one-year mortality falls by approximately one percentage point as birth weight crosses 1,500 g from above, which is large relative to mean one-year mortality of 5.5% just above 1,500 g. Robustness tests suggest some variation around this point estimate, but we generally find a reduction in mortality of close to 0.7 percentage points for newborns just below the threshold. It appears that infants categorized as VLBW have a lower mortality rate because they receive additional treatment. Using all births from five states that report treatment measures and birth weight—states that have a mortality discontinuity similar to that for the nationwide sample—we find that treatment differences are on the order of $9,500 in hospital charges, or $4,000 when these charges are converted into costs. Although these costs may not represent social costs for such care—the nurses, physicians, and capital expenditures may not be affected by the births of a small number of VLBW infants—they represent our best summary measurement of the difference in treatment that the VLBW classification affords. Taken together, our estimates suggest that the cost of saving a statistical life for newborns near 1,500 g is on the order of $550,000 with an upper bound of approximately $1.2 million in 2006 dollars. Although the cost measures may not fully capture the additional care provided to VLBW newborns, the magnitude of the cost-effectiveness estimates suggests that returns to medical care are large for this group. COLUMBIA UNIVERSITY AND NATIONAL BUREAU OF ECONOMIC RESEARCH MIT AND NATIONAL BUREAU OF ECONOMIC RESEARCH YALE UNIVERSITY AND NATIONAL BUREAU OF ECONOMIC RESEARCH HARVARD UNIVERSITY
REFERENCES Almond, Douglas, and Joseph Doyle, “After Midnight: A Regression Discontinuity Design in Length of Postpartum Hospital Stays,” NBER Working Paper No. 13877, 2008. Andre, Malin, Lars Borgquist, Mats Foldevi, and Sigvard Molstad, “Asking for ‘Rules of Thumb’: A Way to Discover Tacit Knowledge in Medical Practice,” Family Medicine, 19 (2002), 617–622.
632
QUARTERLY JOURNAL OF ECONOMICS
Angert, Robert, and Henry Adam, “Care of the Very Low-Birthweight Infant,” Pediatrics Review, 30 (2009), 1–32. Anspach, Renee, Deciding Who Lives: Fateful Choices in the Intensive-Care Nursery (Berkeley, CA: University of California Press, 1993). Babson, S. Gorham, and Gerda Benda, “Growth Graphs for the Clinical Assessment of Infants of Varying Gestational Age,” Journal of Pediatrics, 89 (1976), 814–820. Baicker, Katherine, and Amitabh Chandra, “Medicare Spending, the Physician Workforce, and the Quality of Care Received by Medicare Beneficiaries,” Health Affairs, W4 (2004), 184–197. Behrman, Richard, Robert Kliegman, and Hal Jenson, Nelson Textbook of Pediatrics, 16th ed. (Philadelphia, PA: W.B Saunders Company, 2000). Card, David, and David Lee, “Regression Discontinuity Inference with Specification Error,” Journal of Econometrics, 142 (2008), 655–674. Cheng, Ming-Yen, Jianqing Fan, and J.S. Marron, “On Automatic Boundary Corrections,” Annals of Statistics, 25 (1997), 1691–1708. Cloherty, John, and Ann Stark, Manual of Neonatal Care: Joint Program in Neonatology (Harvard Medical School, Beth Israel Deaconess Medical Center, Brigham and Women’s Hospital, Children’s Hospital Boston), 4th ed. (Philadelphia, PA: Lippincott-Raven, 1998). Cutler, David, Your Money or Your Life: Strong Medicine for America’s Health Care System (New York: Oxford University Press, 2004). Cutler, David, and Mark McClellan, “Is Technological Change in Medicine Worth It?” Health Affairs, 20 (2001), 11–29. Cutler, David, Mark McClellan, Joseph Newhouse, and Dahlia Remler, “Are Medical Prices Declining? Evidence for Heart Attack Treatments,” Quarterly Journal of Economics, 113 (1998), 991–1024. Cutler, David, and Ellen Meara, “The Technology of Birth: Is It Worth It?” Frontiers in Health Policy Research, 3 (2000), 33–68. Cutler, David, Allison Rosen, and Sandeep Vijan, “The Value of Medical Spending in the United States, 1960–2000,” New England Journal of Medicine, 355 (2006), 920–927. Enthoven, Alain, Health Plan: The Only Practical Solution to the Soaring Cost of Medical Care (Reading, MA: Addison Wesley, 1980). Fenton, Tanis, “A New Growth Chart for Preterm Babies: Babson and Benda’s Chart Updated with Recent Data and a New Format,” BMC Pediatrics, 3 (2003), 13. Fisher, Elliott, John Wennberg, Therese Stukel, and Sandra Sharp, “Hospital Readmission Rates for Cohorts of Medicare Beneficiaries in Boston and New Haven,” New England Journal of Medicine, 331 (1994), 989–995. Frank, Richard, and Richard Zeckhauser, “Custom-Made versus Ready-to-Wear Treatments: Behavioral Propensities in Physicians’ Choices,” Journal of Health Economics, 26 (2007), 1101–1127. Fuchs, Victor, “More Variation in Use of Care, More Flat-of-the-Curve Medicine,” Health Affairs, 23 (2004), 104–107. Goodman, David, Elliott Fisher, George Little, Therese Stukel, Chiang Hua Chang, and Kenneth Schoendorf, “The Relation between the Availability of Neonatal Intensive Care and Neonatal Mortality,” New England Journal of Medicine, 346 (2002), 1538–1544. Grumbach, Kevin, “Specialists, Technology, and Newborns—Too Much of a Good Thing?” New England Journal of Medicine, 346 (2002), 1574–1575. Horbar, Jeffrey, Gary Badger, Joseph Carpenter, Avroy Fanaroff, Sarah Kilpatrick, Meena LaCorte, Roderic Phibbs, and Roger Soll, “Trends in Mortality and Morbidity for Very Low Birth Weight Infants, 1991–1999,” Pediatrics, 110 (2002), 143–151. Imbens, Guido, and Thomas Lemieux, “Regression Discontinuity Designs: A Guide to Practice,” Journal of Econometrics, 142 (2008), 615–635. Inoue, Atsushi, and Gary Solon, “Two-Sample Instrumental Variables Estimators,” NBER Technical Working Paper No. 311, (2005). Kessler, Daniel, and Mark McClellan, “Do Doctors Practice Defensive Medicine?” Quarterly Journal of Economics, 111 (1996), 353–390.
MARGINAL RETURNS TO MEDICAL CARE
633
Lee, David, and Thomas Lemieux, “Regression Discontinuity Designs in Economics,” Journal of Economic Literature, forthcoming. Lichtig, Leo, Robert Knauf, Albert Bartoletti, Lynn-Marie Wozniak, Robert Gregg, John Muldoon, and William Ellis, “Revising Diagnosis-Related Groups for Neonates,” Pediatrics, 84 (1989), 49–61. Linden, Ariel, John Adams, and Nancy Roberts, “Evaluating Disease Management Programme Effectiveness: An Introduction to the Regression Discontinuity Design,” Journal of Evaluation in Clinical Practice, 12 (2006), 124–131. Luce, Brian, Josephine Mauskopf, Frank Sloan, Jan Ostermann, and L. Clark Paramore, “The Return on Investment in Health Care: From 1980 to 2000,” Value in Health, 9 (2006), 146–156. Ludwig, Jens, and Douglas L. Miller, “Does Head Start Improve Children’s Life Chances? Evidence from a Regression Discontinuity Design,” Quarterly Journal of Economics, 122 (2007), 159–208. McClellan, Mark, “The Marginal Cost-Effectiveness of Medical Technology: A Panel Instrumental-Variables Approach,” Journal of Econometrics, 77 (1997), 39–64. McCrary, Justin, “Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test,” Journal of Econometrics, 142 (2008), 698–714. McDonald, Clement, “Medical Heuristics: The Silent Adjudicators of Clinical Practice,” Annals of Internal Medicine, 124 (1996), 56–62. Murphy, Kevin, and Robert Topel, “The Economic Value of Medical Research,” in Measuring the Gains from Medical Research, Kevin M. Murphy and Robert H. Topel, eds. (Chicago: University of Chicago Press, 2003). Nordhaus, William, “The Health of Nations: The Contribution of Improved Health to Living Standards,” NBER Working Paper No. 8818, 2002. O’Connor, Gerald, Hebe Quinton, Neal Traven, Lawrence Ramunno, Andrew Dodds, Thomas Marciniak, and John Wennberg, “Geographic Variation in the Treatment of Acute Myocardial Infarction: The Cooperative Cardiovascular Project,” Journal of the American Medical Association, 281 (1999), 627–633. Paneth, Nigel, “The Problem of Low Birth Weight,” Future of Children, 5 (1995), 19–34. Phibbs, Ciaran, Laurence Baker, Aaron Caughey, Beate Danielson, Susan Schmitt, and Roderic Phibbs, “Level and Volume of Neonatal Intensive Care and Mortality in Very-Low-Birth-Weight Infants,” New England Journal of Medicine, 356 (2007), 2165–2175. Pilote, Louise, Robert Califf, Shelly Sapp, Dave Miller, Daniel Mark, Douglas Weaver, Joel Gore, Paul Armstrong, Magnus Ohman, and Eric Topol, “Regional Variation across the United States in the Management of Acute Myocardial Infarction,” New England Journal of Medicine, 333 (1995), 565–572. Porter, Jack, “Estimation in the Regression Discontinuity Model,” University of Wisconsin-Madison Working Paper, 2003. Pressman, Eva, Jessica Bienstock, Karin Blakemore, Shari Martin, and Nancy Callan, “Prediction of Birth Weight by Ultrasound in the Third Trimester,” Obstetrics and Gynecology, 95 (2000), 502–506. Quinn, Kevin, “New Directions in Medicaid Payment for Hospital Care,” Health Affairs, 27 (2008), 269–280. Russell, Rebecca, Nancy Green, Claudia Steiner, Susan Meikle, Jennifer Howse, Karalee Poschman, Todd Dias, Lisa Potetz, Michael Davidoff, Karla Damus, and Joann Petrini, “Cost of Hospitalization for Preterm and Low Birth Weight Infants in the United States,” Pediatrics, 120 (2007), e1–e9. Stukel, Therese, Lee Lucas, and David Wennberg, “Long-Term Outcomes of Regional Variations in the Intensity of Invasive versus Medical Management of Medicare Patients with Acute Myocardial Infarction,” Journal of the American Medical Association, 293 (2005), 1329–1337. Trochim, William, Research Design for Program Evaluation: The RegressionDiscontinuity Design (Beverly Hills, CA: Sage Publications, 1984). Tu, Jack, Chris Pashos, David Naylor, Erluo Chen, Sharon-Lise Normand, Joseph Newhouse, and Barbara McNeil, “Use of Cardiac Procedures and Outcomes in Elderly Patients with Myocardial Infarction in the United States and Canada,” New England Journal of Medicine, 336 (1997), 1500–1505.
634
QUARTERLY JOURNAL OF ECONOMICS
United States Congress, Office of Technology Assessment, The Implications of Cost-Effectiveness Analysis of Medical Technology, Background Paper 2: Case Studies of Medical Technologies; Case Study 10: The Costs and Effectiveness of Neonatal Intensive Care (Washington, DC: United States Congress, Office of Technology Assessment, 1981). United States Institute of Medicine, Preventing Low Birthweight (Washington, DC: National Academies Press, 1985). Williams, Ronald, and Peter Chen, “Identifying the Source of the Recent Decline in Perinatal Mortality Rates in California,” New England Journal of Medicine, 306 (1982), 207–214. Zuckerman, Ilene H., Euni Lee, Anthony Wutoh, Zhenyi Xue, and Bruce Stuart, “Application of Regression-Discontinuity Analysis in Pharmaceutical Health Services Research,” Health Services Research, 41 (2005), 550–563. Zupancic, John, and Douglas Richardson, “Characterization of the Triage Process in Neonatal Intensive Care,” Pediatrics, 102 (1998), 1432–1436.
PROGRESSIVE ESTATE TAXATION∗ ´ WERNING EMMANUEL FARHI AND IVAN We present a model with altruistic parents and heterogeneous productivity. We derive two key properties for optimal estate taxation. First, the estate tax should be progressive, so that parents leaving a higher bequest face a lower net return on bequests. Second, marginal estate taxes should be negative, so that all parents face a marginal subsidy on bequests. Both properties can be implemented with a simple nonlinear tax on bequests, levied separately from the income tax. These results apply to other intergenerational transfers, such as educational investments, and are robust to endogenous fertility choices. Both estate or inheritance taxes can implement the optimal allocation, but we show that the inheritance tax has some advantages. Finally, when we impose an ad hoc constraint requiring marginal estate taxes to be nonnegative, the optimum features a zero tax up to an exemption level, and a progressive tax thereafter.
I. INTRODUCTION One of the biggest risks in life is the family one is born into. We partly inherit the luck, good or bad, of our parents through the wealth they accumulate. Behind the veil of ignorance, future generations value insurance against this risk. At the same time, parents are partly motivated by the impact their efforts can have on their children’s well-being through bequests. This paper studies optimal estate taxation in an economy that captures the trade-off between insurance for newborns and incentives for parents. We begin with a simple economy with two generations. Parents live during the first period. In the second period each is replaced by a single child. Parents are altruistic toward their child, and they work, consume, and bequeath; children simply consume. Following Mirrlees (1971), parents first observe a random productivity draw and then exert work effort. Both productivity and work effort are private information; only output, the product of the two, is publicly observable. ∗ Farhi is grateful for the hospitality of the University of Chicago. This work benefited from useful discussions and comments by Manuel Amador, GeorgeMarios Angeletos, Robert Barro, Peter Diamond, Michael Golosov, Jonathan Gruber, Chad Jones, Narayana Kocherlakota, Robert Lucas, Greg Mankiw, Chris Phelan, James Poterba, Emmanuel Saez, Rob Shimer, Aleh Tsyvinski, and seminar and conference participants at Austin, Brown, Rochester, Cornell, the University of Chicago, the University of Iowa, the Federal Reserve Bank of Minneapolis, MIT, Harvard, Northwestern, New York University, IDEI (Toulouse), the Stanford Institute for Theoretical Economics (SITE), the Society of Economic Dynamics (SED) at Budapest, the Minnesota Workshop in Macroeconomic Theory, and the NBER Summer Institute. All remaining errors are our own. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
635
636
QUARTERLY JOURNAL OF ECONOMICS
Our first objective is to study the constrained efficient allocations and derive their implications for marginal tax rates. For this economy, if one takes the expected utility for parents as the social welfare objective, then Atkinson and Stiglitz’s (1976) celebrated uniform-taxation result applies. It implies that the parent’s intertemporal consumption choice should not be distorted. Thus, when no direct weight is placed on the welfare of children, labor income should be taxed nonlinearly, but bequests should remain untaxed. In terms of the allocation, this tax system induces the consumption of parent and child to vary one for one. In this sense, the luck of the parent’s productivity is perfectly inherited by the child. There is no mean reversion across generations. In effect, from the perspective of the children’s generation, their consumption is manipulated to provide their parents with incentives. They are offered no insurance against the risk of their parents’ productivity. The resulting consumption inequality lowers their expected welfare, but this of no direct concern to the planner. Although this describes one efficient arrangement, the picture is incomplete. In this economy, parent and child are distinct individuals, albeit linked by altruism. In positive analyses it is common to subsume both in a single fictitious “dynastic agent.” However, a complete normative analysis must distinguish the welfare of parents and children (Phelan 2006; Farhi and Werning 2007). Figure I depicts our economy’s Pareto frontier, plotting the ex ante expected utility for the child on the horizontal axis, and that of the parent on the vertical axis. The arrangement discussed in the preceding paragraph corresponds to the peak, marked as point A, which is interior point due to parental altruism. This paper explores other efficient allocations, represented by points on the downward-sloping section of the Pareto frontier. To the right of point A, a role for estate taxation emerges with two critical properties. The first property concerns the shape of marginal taxes: we show that estate taxation should be progressive. That is, more fortunate parents with larger bequests should face a higher marginal estate tax. Because more fortunate parents get a lower after-tax return on bequests than the less fortunate, this induces bequests to become more similar. Our stark conclusion regarding the progressivity of estate taxation contrasts with the well-known lack of sharp results regarding the shape of the optimal income tax
637
PROGRESSIVE ESTATE TAXATION
vp
A
vc FIGURE I Pareto Frontier between Ex Ante Utility for Parent, v p , and Child, vc
schedule (Mirrlees 1971; Seade 1982; Tuomala 1990; Ebert 1992; Diamond 1998; Saez 2001).1 In terms of the allocation, as we move to the right of point A, the consumption inequality for children falls, which increases their expected welfare. The child’s consumption still varies with the parent’s consumption, but the relationship is now less than one-for-one. Consumption mean reverts across generations. In this sense, luck is only imperfectly inherited. Children are partly insured against the risk of their parents’ productivity. The second property concerns the level of marginal taxes. We find that estate taxation should be negative, imposing a marginal subsidy that declines with the size of bequests. A subsidy encourages bequests, which improves the consumption of newborns. This highlights that, in order to improve the average welfare of newborns, it is efficient to combine a reduction in inequality with an increase in average consumption. In a way, the first generation buys inequality, to improve incentives, from the second generation in exchange for higher average bequests. 1. Mirrlees’s (1971) seminal paper established that for bounded distributions of skills the optimal marginal income tax rates are regressive at the top (see also Seade [1982]; Tuomala [1990]; Ebert [1992]). More recently, Diamond (1998) has shown that the opposite—progressivity at the top—is possible if the skill distribution is unbounded (see also Saez [2001]). In contrast, our results on the progressivity of the estate tax do not depend on any assumptions regarding the distribution of skills.
638
QUARTERLY JOURNAL OF ECONOMICS
The second main objective of the paper is to derive an explicit tax system that implements these efficient allocations. We prove that a simple system, which confronts parents with separate nonlinear schedules for income and estate taxes, works. The optimal estate tax schedule is decreasing and convex, reflecting our results for the sign and progressivity of the marginal tax. Thus, our results are not simply about implicit taxes or wedges, but also about marginal taxes in an explicit tax system. Of course, tax implementations are rarely unique and our model is no exception. For example, one other possible implementation combines labor income taxation with a regressive consumption tax on parents. We discuss this alternative later and argue why, in our view, our estate tax implementation seems more natural. We illustrate the flexibility of our basic model by extending it in a number of directions. We start by considering more general welfare criteria. Although our results rely on a utilitarian welfare function for the children’s generation, the welfare criterion for the parents’ generation is irrelevant. We also explore a Rawlsian criterion for the children’s generation. To implement the optimal allocation in this case, the estate tax can be replaced by a no-debt constraint preventing parents from leaving negative bequests. This type of constraint is common throughout the world. Interestingly, we show that a no-debt constraint induces implicit tax rates that are negative and progressive, and that it corresponds to a limiting case of our earlier estate tax results. In other words, our estate tax results can be viewed as generalizing the principle of noninheritable debt. We also consider a simple extension with human capital investments. In the model, it is never optimal to distort the choice between bequests and this alternative source of intergenerational transfers. Thus, our estate tax results carry over, implying that human capital should be subsidized, with a higher marginal subsidy on lower investments. This is broadly consistent with actual educational policies. Finally, we compare estate and inheritance taxes by considering heterogeneous fertility. We show that the optimal estate tax must condition on the number of children, whereas the optimal inheritances tax does not. In this sense, inheritance taxes are simpler. These results apply even when fertility is endogenous, as in Becker and Barro (1988). Our results highlight two properties of optimal marginal tax rates on estates: they should be progressive and negative. To determine whether there is a separate role for each of these
PROGRESSIVE ESTATE TAXATION
639
features, we impose an ad hoc constraint that rules out negative marginal taxes on estates. We show that the progressivity result survives. In particular, the optimal marginal tax on estates is zero up to an exemption level, and positive and increasing above this level. Interestingly, the exemption level depends on the degree to which the rate of return to capital responds to the capital stock. In the limiting case where the rate of return to capital is fixed, the exemption level tends to infinity and estate taxes converge to zero. We close by studying an infinite-horizon version of our model. This framework provides a motivation for weighing the welfare of future generations. Indeed, allocations that maximize the expected utility for the very first generation are disastrous for the average welfare of distant generations. In a related model, Atkeson and Lucas (1992) prove an immiseration result of this kind, showing that inequality rises steadily over time, with everyone’s consumption converging to zero. In contrast, as shown by Farhi and Werning (2007), with a positive weight on future generations, a steady state exists where inequality is bounded and constant.2 Tax implementations are necessarily more involved in our infinite-horizon setting, but our main results extend. We provide an implementation where taxes on estates are linear, but the rates depend on current and past labor income. When future generations are not valued, the expected tax rate is zero, as in Kocherlakota (2005). However, when future generations are valued, the expected tax rate is strictly increasing in the parent’s consumption and it is negative. This progressivity induces mean reversion across generations and plays a key role in moderating the evolution of inequality over generations. Although our approach is normative, it is interesting to compare our prescriptions with actual policies. There are both similarities and differences. On the one hand, progressivity of marginal tax rates is, broadly speaking, a feature of actual estate tax policy in developed economies. For example, in the United States bequests are exempt up to a certain level, and then taxed linearly at a positive rate. Our paper provides the first theoretical justification, to the best of our knowledge, for this common feature of policy. On the other hand, the explicit marginal tax on estates is typically positive or zero, not negative. One interpretation is that our normative model stresses a connection between progressive 2. The model in Atkeson and Lucas (1992) and Farhi and Werning (2007) is an endowment economy with a single consumption good where taste shades affect the marginal utility of consumption. In some cases, one can show that similar conclusions apply in a Mirrlees setting with labs as we have here (see Section VI.C).
640
QUARTERLY JOURNAL OF ECONOMICS
and negative marginal tax rates that may be overlooked in current thinking on estate tax policy. However, the comparison with actual policies is more nuanced. First, a large fraction of bequests may lie below the exemption level and face a zero marginal tax rate. Second, as explained above, restrictions on debt inheritability constitute an implicit marginal subsidy on bequests. Finally, educational policies constitute an explicit subsidy to intergenerational transfers. It is worth stressing that, although we find that marginal estate taxes should be progressive, we do not attempt to derive the overall progressivity of the tax system, nor the extent of redistribution within the first generation. In particular, we do not characterize the shape of labor income taxes. In principle, the redistributive effect of a more progressive estate tax could be counterbalanced by adjusting the income tax schedule.3 Cremer and Pestieau (2001) also study optimal estate taxation in a two-period economy, but their results are quite different from ours. In particular, they find that marginal tax rates may be regressive and positive over some regions. These results are driven by their implicit assumption that parental consumption and work are complements, departing from the Atkinson–Stiglitz benchmark of separability, which is our starting point.4 Kaplow (1995, 2000) discusses estate and gift taxation in an optimal taxation framework with altruistic donors or parents. These papers make the point that gifts or estates should be subsidized, but assume away unobserved heterogeneity and are therefore silent on the issue of progressivity. Our work also relates to a number of recent papers that have explored the implications of including future generations in the welfare criterion. Phelan (2006) considered a planning problem that weighted all generations equally, which is equivalent to not discounting the future at all. Farhi and Werning (2007) considered intermediate cases, where future generations receive a geometrically declining weight. This is equivalent to a social discount factor that is less than one and higher than the private one. Sleet and Yeltekin (2006) have studied how such a higher social discount 3. Indeed, our proofs use such a readjustment to describe a set of feasible perturbation that leaves work incentives unchanged. 4. In the main body of their paper, Cremer and Pestieau (2001) study a model without work effort, with an exogenous wealth shock that is privately observed by parents. However, in their appendix, they develop a more standard Mirrlees model with the assumption that parental consumption and work are complements.
PROGRESSIVE ESTATE TAXATION
641
factor may arise from a utilitarian planner without commitment. However, none of these papers consider implications for estate taxation.
II. PARENT AND CHILD: A TWO-PERIOD ECONOMY In our two-period economy a continuum of parents live during period t = 0. Each parent produces a single descendant, or child, that lives in period t = 1. Parents work and consume, whereas children simply consume. Each parent is altruistic toward his or her child. At the beginning of period t = 0, parents first learn their productivity θ0 , and then produce n0 efficiency units of labor. This requires n0 /θ0 units of work effort. The utility of a parent with productivity θ0 is given by
(1)
n0 (θ0 ) v0 (θ0 ) = u(c0 (θ0 )) − h θ0
+ βv1 (θ0 ),
with β < 1. The child’s utility is simply v1 (θ0 ) = u(c1 (θ0 )).
(2)
The utility function u(c) is increasing, concave, and differentiable and satisfies Inada’s conditions u (0) = ∞ and u (∞) = 0; the disutility function h(n) is increasing, convex, and differentiable. In addition, we denote by n¯ the possibly infinite maximum number of hours worked. Combining equations (1) and (2) gives v0 (θ0 ) = u(c0 ) + βu(c1 ) − h(n0 /θ0 ). In addition to production, there is an endowment e0 of goods in period 0 and an endowment e1 of goods in period 1. Moreover, goods can be transferred between periods t = 0 and t = 1 with a linear savings technology with rate of return R > 0. An allocation is resource feasible if
∞
K1 +
0 ∞ 0
∞
c0 (θ0 ) dF(θ0 ) ≤ e0 +
n0 (θ0 ) dF(θ0 ), 0
c1 (θ0 ) dF(θ0 ) ≤ e1 + RK1 ,
642
QUARTERLY JOURNAL OF ECONOMICS
where K1 is capital. Combining these two inequalities yields the present-value resource constraint ∞ 1 ∞ (3) c0 (θ0 ) dF(θ0 ) + c1 (θ0 ) dF(θ0 ) R 0 0 ∞ 1 ≤ e0 + e1 + n0 (θ0 ) dF(θ0 ). R 0 We assume that productivity is privately observed by the parent. By the revelation principle, we can restrict attention to direct mechanisms, where agents report their productivity and receive an allocation as a function of this report. An allocation is incentive compatible if truthful revelation is optimal: n0 (θ0 ) (4) u(c0 (θ0 )) + βu(c1 (θ0 )) − h θ0 n0 (θ0 ) ∀θ0 , θ0 . ≥ u(c0 (θ0 )) + βu(c1 (θ0 )) − h θ0 An allocation is feasible if it satisfies the resource constraint (3) and the incentive constraints (4). Next, we define two utilitarian welfare measures: ∞ ∞ v0 (θ0 ) dF(θ0 ) and V1 ≡ v1 (θ0 ) dF(θ0 ). V0 ≡ 0
Note that
0
∞
V0 =
(u(c0 (θ0 )) − h(n(θ0 )/θ0 )) dF(θ0 ) + βV1 ,
0
so that the utilitarian welfare of the second generation, V1 , enters that of the first generation, V0 , through the altruism of parents. In addition to this indirect channel, we will allow the welfare of the second generation, V1 , to enter our planning problem directly. Consider the following planning problem: max V0 subject to the resource constraint (3), the incentive-compatibility constraints (4), and (5)
V1 ≥ V 1 .
PROGRESSIVE ESTATE TAXATION
643
The planning problem is indexed by V 1 . For low enough values of V 1 , constraint (5) is not binding, and the planning problem then maximizes parental welfare V0 subject to feasibility. Let V1∗ be the corresponding level of welfare obtained by the second generation in the planning problem when constraint (5) is not imposed. This corresponds to the peak on the Pareto frontier illustrated in Figure I. Constraint (5) is not binding for all V 1 ≤ V1∗ . The second generation obtains a finite level of welfare V1∗ because they are valued indirectly, through the altruism of the first generation. For values of V 1 > V1∗ , constraint (5) binds and the solution corresponds to the downward sloping section in the figure. III. THE MAIN RESULT: PROGRESSIVE ESTATE TAXATION In this section we derive two main results for the two-period economy laid out in the preceding section. For any allocation with interior consumption, define the implicit estate tax τ (θ0 ) by (6)
(1 + τ (θ0 ))u (c0 (θ0 )) ≡ β Ru (c1 (θ0 )).
This identity defines a distortion so that the intergenerational Euler equation holds. Similarly, we can define the implicit inheritance tax τˆ (θ0 ) by (7)
u (c0 (θ0 )) ≡ β R(1 − τˆ (θ0 ))u (c1 (θ0 )).
Each of these wedges can be expressed as a function of the other: τ (θ0 ) or τ (θ0 ) = 1−τˆ (θτˆ (θ0 )0 ) . In the exposition, we choose to foτˆ (θ0 ) = 1+τ (θ0 ) cus mostly on the implicit estate tax. We first derive properties for this implicit tax. We then construct an explicit tax system that implements efficient allocations. III.A. Implicit Tax Rates To derive an intertemporal-optimality condition, let ν be the multiplier on constraint (5) and μ be the multiplier on the resource constraint (3), and form the corresponding Lagrangian, ∞ L≡ v0 (θ0 ) + νv1 (θ0 ) dF(θ0 ) 0 ∞ c0 (θ0 ) + c1 (θ0 )/R − n0 (θ0 ) dF(θ0 ), −μ 0
644
QUARTERLY JOURNAL OF ECONOMICS
so that the planning problem is equivalent to maximizing L subject to incentive constraints (4). Suppose an allocation is optimal and has strictly positive consumption. Consider the following perturbation, at a particular point θ0 . Let c0ε (θ0 ) = c0 (θ0 ) + ε and define c1ε (θ0 ) as the solution to u(c0ε (θ0 )) + βu(c1ε (θ0 )) = u(c0 (θ0 )) + βu(c1 (θ0 )). This construction ensures that the incentive constraints are unaffected by ε. A first-order necessary condition is that the derivative of L with respect to ε be equal to zero. This yields βR u (c0 (θ0 ))
=
1 u (c1 (θ0 ))
ν −R , μ
which shows that c0 (θ0 ) and c1 (θ0 ) are increasing functions of each other. Incentive compatibility implies that utility from consumption, u(c0 (θ0 )) + βu(c1 (θ0 )), is nondecreasing in productivity θ0 . It follows that consumption of both parent and child, c0 (θ0 ) and c1 (θ0 ), are nondecreasing in θ0 . This equation can be rearranged in the following two useful ways: ν (8) 1 − R u (c1 (θ0 )) u (c0 (θ0 )) = β Ru (c1 (θ0 )) μ and (9)
1ν u (c0 (θ0 )) u (c1 (θ0 )). u (c0 (θ0 )) = β R 1 − βμ
Our first result regarding taxes, derived from equation (8) with ν = 0, simply echoes the celebrated Atkinson–Stiglitz uniform-commodity taxation result for our economy. PROPOSITION 1. The optimal allocation with V 1 ≤ V1∗ has a zero implicit estate tax τ (θ0 ) = 0 for all θ0 . Atkinson and Stiglitz (1976) showed that if preferences over a group of consumption goods are separable from work effort, then the tax rates on these goods can be set to zero. In our context, this result applies to the consumption (c0 , c1 ) and implies a zero implicit estate tax. The Euler equation u (c0 ) = β Ru (c1 ) implies that dynastic consumption is smoothed. As a result, the optimum features perfect inheritability of welfare across generations. For example, if the utility function is CRRA u(c) = c1−σ /(1 − σ ), then
PROGRESSIVE ESTATE TAXATION
645
1
c1 (θ0 ) = (β R) σ c0 (θ0 ), or equivalently log c1 (θ0 ) − log c0 (θ0 ) =
1 log(β R). σ
Thus, the consumption of parent and child vary, across dynasties with different θ0 , one for one in logarithmic terms. Making the child’s consumption depend on the parent’s productivity θ0 provides the parent with added incentives. The child’s welfare is not valued directly in the planning problem. As a result, they are used to providing incentives. From their point of view, no insurance for the risk of their parent’s productivity is provided. In contrast, when V 1 > V1∗ , so that ν > 0, then equation (8) implies that the ratio of marginal utilities is not equalized across agents and the marginal estate tax must be nonzero. Indeed, because consumption increases with θ0 estate taxation must be progressive: the implicit marginal estate tax rate τ (θ0 ) increases with the productivity θ0 of the parent. PROPOSITION 2. Suppose V 1 > V1∗ and that the optimal allocation has strictly positive consumption. Then the implicit estate tax is strictly negative and increasing in the parent’s productivity θ0 : ν τ (θ0 ) = −R u (c1 (θ0 )). μ
(10)
The proposition provides an expression for the implicit estate tax that relates it to the child’s consumption. The progressivity of the estate tax is implied by the fact that c1 (θ0 ) is increasing in θ0 . From equation (9) one can also derive the following formula for the implicit marginal inheritance tax τˆ (θ0 ): (11)
τˆ (θ0 ) =
1ν τ (θ0 ) =− u (c0 (θ0 )). 1 + τ (θ0 ) βμ
This alternative expression is sometimes useful. Returning to the CRRA example, equation (9) now implies (12) 1 1ν 1 c0 (θ0 )−σ + log(β R). log c1 (θ0 ) − log c0 (θ0 ) = log 1 + σ βμ σ As long as ν/μ > 0, the right-hand side of equation (12) is strictly decreasing in c0 (θ0 ). Thus, the child’s consumption still varies with that of the parent, but less than one for one in logarithmic
646
QUARTERLY JOURNAL OF ECONOMICS
terms. In this way, the intergenerational transmission of welfare is imperfect, with consumption mean reverting across generations. Mean reversion serves to reduce inequality in the second generation’s consumption. When the expected welfare of the second generation is considered in the planning problem, insurance is provided to reduce inequality. The progressivity of the implicit estate tax reflects this mean reversion. Fortunate parents, with higher productivities, must face a lower net-of-tax return on bequests, so their dynastic consumption slopes downward. Likewise, poorer parents, with lower productivities, require higher net-of-tax returns on bequests, so their dynastic consumption slopes upward. Another intuition is based on interpreting our economy with altruism as an economy with an externality. In the presence of externalities, corrective Pigouvian taxes are optimal. One difference is that, typically, externalities are modeled as being a function of the average consumption of a good, such as the pollution produced from gasoline consumption. As a result, the corrective Pigouvian tax is linear. In contrast, in our model the externality enters nonlinearly, resulting in an optimal tax that is also nonlinear. To see this, think of c1 as a good that the parent enjoys and chooses, but that happens to have a positive externality on the child. Because the externality is positive, a Pigouvian subsidy is called for. However, according to the utilitarian welfare metric, the exter (θ nality is not a function of aggregate consumption c 1 0 ) dF(θ0 ). Instead, it equals u(c1 (θ0 )) dF(θ0 ). Because the utility function u(c1 ) is concave, the externality is stronger for children with lower consumption. Indeed, the subsidy is directly proportional to u (c1 ). This explains the progressivity of the implicit tax τ (θ0 ). Private information is not crucial for our results. In our model, private information creates inequality in the utility parents obtain from consumption goods, u(c0 ) + βu(c1 ). Our results would also obtain if such inequality were simply assumed, or possibly derived for other reasons. III.B. Explicit Tax Implementations An allocation is said to be implemented by a nonlinear labor income tax T y (n0 ) and estate tax T b(b) if, for all θ0 , (c0 (θ0 ), c1 (θ0 ), n0 (θ0 )) solves n 0 max u(c0 ) + βu(c1 ) − h c0 ,c1 ,n0 θ0
PROGRESSIVE ESTATE TAXATION
647
subject to c0 + b ≤ e0 + n0 − T b(b) − T y (n0 ), c1 ≤ e1 + Rb. The first-order condition for this problem, assuming T b(b) is differentiable, gives (1 + T b (b))u (c0 ) = β Ru (c1 ). To find a candidate tax schedule, we match this first-order condition with equation (8) and use the budget constraint to substitute out c1 = e1 + Rb, to obtain (13)
ν T b (b) = −R u (e1 + Rb). μ
For any arbitrary value T b(0), this gives T b(b) = T b(0) + μν u(e1 ) − ν u(e1 + Rb)).5 Indeed, this candidate does implement the optimal μ allocation. The proof, contained in the Appendix, exploits the fact that marginal tax rates are progressive. As a result, the parent’s problem is convex in the bequest choice, ensuring that the firstorder condition, which we used above to define marginal taxes, is sufficient for the parent’s optimal bequest choice. PROPOSITION 3. Suppose that V 1 > V1∗ and that the optimal allocation has strictly positive consumption. Then the optimal allocation is implementable with a nonlinear income tax and an estate tax. The estate tax T b is strictly decreasing and convex. Under this implementation, parents face negative marginal tax rates simply because T b is decreasing. In equilibrium, parents with higher productivity face higher tax rates because they choose to leave larger bequests and T b is convex. Note that b + T b(b) is not monotone and has a minimum where T b (b) = −1; parents never leave bequests below b. For a given utility function u(c) and return R, the optimal estate tax schedule T b belongs to a space of functions indexed by a single parameter ν/μ. It is interesting to note that this space of 5. Note that our implementation features one degree of freedom in the level of income and estate taxes. Marginal tax rates are entirely pinned down by the optimal allocation, which is unique. However, only the sum of the income and estate tax schedules, T y (0) + T b (0), is determined. Thus, the model does not uniquely pin down the sign of average estate taxes, nor the revenue generated by the estate tax.
648
QUARTERLY JOURNAL OF ECONOMICS
functions is independent of the distribution of productivity F(θ0 ) and the disutility of function h.6 Of course these primitives may affect the relevant value of ν/μ, which pins down the tax function from this space. A higher value of ν/μ gives higher marginal taxes T b (b). Interestingly, it has no impact on the ratio T b (b)/T b (b), a measure of local relative progressivity. For any value ν/μ, explicit marginal taxes T b (b) have a full range because T b (b) = −1 and limb→∞ T b (b) = 0 (assuming Inada conditions for u(c)). However, in equilibrium, the range of implicit marginal taxes τ (θ0 ) = T b ( c1 (θ0R) − e1 ) is typically more confined, because parents may stay away from entire sections of the T b(b) schedule. In particular, as long as the allocation has consumption bounded away from zero for parents, so that inf c0 (θ0 ) > 0, equilibrium implicit taxes τ (θ0 ) are bounded away from −1. Inheritance Taxes. It is also possible to implement the allocation using inheritance taxes paid by the child. Here, the difference between an estate and inheritance tax is minor, but a starker contrast emerges with the extensions considered Section IV.D. An allocation is implementable by nonlinear income and inheritance taxes, Tˆ y (n0 ) and Tˆ b(Rb), if (c0 (θ0 ), c1 (θ0 ), n0 (θ0 )) maximizes the utility for a parent with productivity θ0 subject to c0 + b ≤ e0 + n0 − Tˆ y (n0 ), c1 ≤ e1 + Rb − Tˆ b(Rb). Proceeding similarly, the first-order condition is now u (c0 ) = β R(1 − Tˆ b (Rb))u (c1 ), so that matching this expression with equation (8) leads to a differential equation, (14)
ν Tˆ b (Rb) = −R u (e1 + Rb − Tˆ b(Rb)), μ 1 − Tˆ b (Rb)
with any arbitrary value T b(0). As it turns out, one can show that with this inheritance tax, the budget set of affordable (c0 , c1 , n0 ) is identical to that affordable with the proposed estate tax implementation. Thus, parents choose the same allocation in both 6. In contrast, the tax on labor is very sensitive to the specification of these elements (Saez 2001).
PROGRESSIVE ESTATE TAXATION
649
implementations and Proposition 3 implies that this allocation is the optimum. Interestingly, both T b and Tˆ b are defined avoiding reference to the allocation as a function of productivity θ0 . Thus, unlike τ (θ0 ) and τˆ (θ0 ), these tax schedules do not directly depend on the optimal allocation c0 (θ0 ) or c1 (θ0 ), except indirectly through the ratio of the multipliers ν/μ. Other Implementations. We have stated our results in terms of the implicit marginal tax rates, as well as a particular tax implementation. As is commonly the case, tax implementations are not unique. Two other implementations are worth briefly mentioning. First, the optimal allocation can also be implemented by a nonlinear income tax and a progressive consumption tax, T c1 (c1 ), in the second period. Second, the optimal allocation can also be implemented by combining a nonlinear income tax with a regressive consumption tax, T c0 (c0 ), in the first period. In this two-period version of the model all these implementations seem equally plausible. However, with more periods, implementations relying on consumption taxes require marginal consumption tax rates to grow without bound. Although, formally, this is feasible, it seems unappealing for considerations outside the scope of the model, such as tax evasion.7 In any case, all possible implementations share that the intertemporal choice of consumption will be distorted, so that the implicit marginal tax rate on estates is progressive and given by τ (θ0 ).
IV. EXTENSIONS In this section, we consider some extensions that can address a number of relevant issues. They also help in a comparison of optimal policies, within the model, to actual real-world policies. IV.A. General Welfare Functions Our first extension is the most straightforward. In our basic setup, we adopted utilitarian social welfare measures for both generations. We now generalize the welfare measures considered. 7. Moreover, in a multiperiod extension where each agent lives for more than one period, a consumption tax on annual consumption would not work, because the progressive intertemporal distortions should be introduced only across generations, not across a lifetime.
650
QUARTERLY JOURNAL OF ECONOMICS
Define two welfare measures W0 and W1 for parents and children, respectively, by ∞ W0 = Wˆ 0 (v0 (θ0 ), θ0 ) dF(θ0 ) 0 ∞ W1 = Wˆ 1 (v1 (θ0 )) dF(θ0 ), 0
where v0 (θ0 ) = u(c0 (θ0 )) + βu(c1 (θ0 )) − h(n0 (θ0 )/θ0 ) and v1 (θ0 ) = u(c1 (θ0 )). Assume that Wˆ 1 is increasing, concave, and differentiable and that Wˆ 0 (·, θ0 ) is increasing and differentiable for all θ0 . The utilitarian case considered before corresponds to the identity ˆ 1 (v) = v. ˆ 0 (v) = v and W functions W ˆ 0 may depend on θ0 , it allows Because the welfare function W ˆ 0 = π (θ0 )v0 (θ0 ) with arbitrary Pareto weights π (θ0 ). Thus, we W only require Pareto efficiency in evaluating the welfare of the first generation. In contrast, our results do depend on the welfare criterion for the second generation. Importantly, the generalized utilitarian criterion W1 captures a preference for equality. The planning problem maximizes W0 subject to (3), (4), and W1 ≥ W 1 for some W 1 . Using the same perturbation argument developed for the utilitarian case, we find that if the optimal allocation features strictly positive consumption, the implicit estate tax is given by ν ˆ (u(c1 (θ0 ))u (c1 (θ0 )). τ (θ0 ) = −R W μ 1 ˆ 1 is increasing and concave, it follows that τ (θ0 ) is Because W negative and increasing in θ0 . The estate tax is progressive and negative. Interestingly, the marginal tax does not depend directly ˆ 0 , except indirectly through the on the parent’s welfare function W ratio of multipliers ν/μ. In contrast, as the formula above reveals, ˆ 1 has a direct impact on the the welfare function for children W shape of estate taxes. In particular, for given ν/μ, more concave welfare functions imply more progressive tax schedules.8 As in Proposition 3, the optimal allocation, as long as it features strictly positive consumption, is implementable with a nonlinear income tax T y and either an estate tax T b or an inheritance ˆ 0 still plays an important role 8. Of course, the welfare function for parents W in determining the income tax schedule and, hence, the overall progressivity of the tax system.
PROGRESSIVE ESTATE TAXATION
651
tax Tˆ b. The estate and inheritance tax schedules are decreasing and convex. IV.B. Noninheritable Debt In most countries, children are not liable for their parents’ debts. Interestingly, our main results for estate taxation can be seen as a generalization of this widely accepted constraint that parents cannot borrow against their children. First, a no-debt constraint creates implicit marginal taxes that are progressive and negative. This is because parents with lower productivity find the constraint more binding. Second, when the welfare criterion for children is Rawlsian, instead of utilitarian, a no-debt constraint implements the optimal allocation. When the welfare criterion for children is Rawlsian, the planning problem maximizes W0 subject to the resource constraint (3), the incentive-compatibility constraints (4), and (15)
u1 (θ0 ) ≥ u1
for all θ0 ,
where u1 parameterizes a minimum level of utility for children. Let c1 be the corresponding consumption level: c1 = u−1 (u1 ). For a high enough value of u1 , the solution to this problem features a threshold θ 0 such that constraint (15) is binding for all θ0 < θ 0 and slack for θ0 > θ 0 . Moreover, for θ0 ≥ θ 0 the implicit estate tax is zero. For θ0 < θ 0 we have c1 (θ0 ) = c1 , so that (16)
τ (θ0 ) = β R
u (c1 ) − 1. u (c0 (θ0 ))
Because c0 (θ0 ) is nondecreasing in θ0 , it follows that τ (θ0 ) is nondecreasing and nonpositive. In our implementation, an agent of type θ0 faces the borrowing constraints c0 + b ≤ e0 + n0 − T y (n0 ), y c1 ≤ e1 + Rb − T1 , b ≥ 0. y
Under this implementation children pay a lump-sum tax T1 ≡ e1 − c1 , so that when b = 0 they can consume c1 . PROPOSITION 4. Suppose that the welfare function for the children’s generation is Rawlsian. Then the optimal allocation can be implemented with an income tax for parents, T y , a
652
QUARTERLY JOURNAL OF ECONOMICS y
lump-sum tax for the child, T1 , and a no-debt constraint, b ≥ 0. When the debt constraint is strictly binding, b = 0, the intergenerational Euler equation holds with strict inequality, u (c0 (θ0 )) > β Ru (c1 ). Thus, parents face an implicit estate subsidy τ (θ0 ) < 0. These parents would like to borrow against their children, but the implementation precludes it. The lower the productivity θ0 , the lower c0 (θ0 ) and the stronger is this borrowing motive. As a result, the shadow subsidy is strictly increasing in θ0 over the range of parents that are at the debt limit. This implementation highlights a feature of policy that is often overlooked: In most countries, children are not liable for their parent’s debts and this alone contributes to progressive and negative implicit estate taxes, as in our model. The Rawlsian case, and its no-debt constraint solution, can be obtained as a limit case of the previous analysis. To see this, consider a sequence of concave and continuously differentiable ˆ 1,k} that becomes infinitely concave around welfare functions {W u1 in the sense that ˆ (u1 ) = 0 lim lim W 1,k
k→∞ u1 ↓u1
and
ˆ (u1 ) = ∞. lim lim W 1,k
k→∞ u1 ↑u1
In the limit, the solution with this sequence of welfare functions converges to that of the Rawlsian case. Similarly, along this sequence, the estate tax schedule T b,k(b) is convex and decreasing, as implied by our results. However, it converges to a schedule with an infinite tax on bequests below some threshold, effectively imposing a no-debt constraint, and a zero marginal tax rate above this same threshold. In this sense, our results extend the logic of a no-debt constraint to smoother welfare functions. An estate tax that is progressive and negative is a smoother version of a no-debt constraint. ˆ 1 , as RawlWe have taken the welfare function for children, W sian, but do not make any assumptions on the welfare function for parents, Wˆ 0 (·, θ0 ). However, this does not formally consider the case where the welfare criterion for parents is also Rawlsian, W0 = minθ0 (v0 (θ0 )). This case is more easily handled by considering the dual problem of minimizing the net present value of resources subject to the constraint that the utility v0 (θ0 ) of every parent be above some bound v 0 . A very similar analysis then
PROGRESSIVE ESTATE TAXATION
653
shows that Proposition 4 also applies when the welfare criterion for parents is Rawlsian. The details are contained in the Appendix. IV.C. Educational Subsidies In our model, bequests were the only transfer between one generation and the next. However, in reality, educational investments are an important form of giving by parents. We now explore the applicability of our results to these transfers by incorporating the simplest form of human capital. Let x denote investment and H(x) denote human capital, where H is a differentiable, increasing, and concave function with Inada conditions H (0) = ∞ and H (∞) = 0. Each unit of human capital produces a unit of the consumption good, so that the resource constraint becomes ∞ c1 (θ0 ) e1 c0 (θ0 ) + dF (θ0 ) ≤ e0 + R R 0 ∞ H(x (θ0 )) n0 (θ0 ) + − x (θ0 ) dF(θ0 ). + R 0 Preferences are
n0 (θ0 ) + βv1 (θ0 ) , θ0 v1 (θ0 ) = U (c1 (θ0 ) , H (x (θ0 ))) ,
v0 (θ0 ) = u (c0 (θ0 )) − h
where U is differentiable, increasing, and concave in both arguments and satisfies standard Inada conditions. This structure of preferences preserves the weak separability assumption required for the Atkinson–Stiglitz benchmark result. The assumption that H enters the utility function U is a convenient way of ensuring that parents do not all make the exact same choice for H.9 Indeed, we will assume that H is a normal good, so that richer parents invest more. Following the same perturbation arguments as in Section III one finds that the formula for the implicit estate tax is unaffected and given by ν τ (θ0 ) = −R Uc1 (c1 (θ0 ), H(θ0 )) μ 9. Similar results would hold if instead H entered the parent’s utility function directly.
654
QUARTERLY JOURNAL OF ECONOMICS
as long as the optimal allocation features strictly positive consumption. Turning to the implicit tax on human capital, consider the following perturbation. Fix some θ0 . Increase investment x ε (θ0 ) = x (θ0 ) + ε, leave parental consumption unchanged c0ε (θ0 ) = c0 , and set c1ε (θ0 ) so that utility is unchanged:
U c1ε (θ0 ) , H (x ε (θ0 )) = U (c1 (θ0 ) , H (x (θ0 ))) . This perturbation leaves utility for both the parent and the child unchanged, but impacts the resource constraint. This leads to the following first-order condition: U H (c1 (θ0 ), H(x(θ0 ))) . (17) R = H (x(θ0 )) 1 + Uc1 (c1 (θ0 ), H(x(θ0 ))) This equation equalizes the rate of return on saving, R, to that on human capital, which features the purely monetary component, H (x), as well as a term due to the appreciation for human capital in utility. Equation (17) is also the first-order condition of (18)
V1 (e) = max U (c1 , H(x)) c1 ,x
s.t. c1 − e1 + Rx − H(x) ≤ e.
The quantity c1 − H(x) is the financial bequest received by the child, and x is human capital investment. Equation (17) implies that it is optimal not to distort the choice between these two forms of transfer from parent to child. In what follows, we assume that financial bequest and human capital investment are both normal goods. That is, the optimal c1 − e1 − H(x) and x in the maximization (18) are increasing in e. We consider an implementation with three separate nonlinear tax schedules: a nonlinear income tax schedule T y , a nonlinear estate tax T b, and nonlinear human capital tax T x . The parent maximizes u(c0 ) − h(n0 /θ0 ) + βU (c1 , H(x)) subject to c0 + b + x ≤ e0 + n0 − T y (n0 ) − T b(b) − T x (x), c1 ≤ e1 + Rb + H(x). As shown above, the choice between bequests and human capital should be undistorted. This requires equalizing marginal
PROGRESSIVE ESTATE TAXATION
655
estate and human capital tax rates, which suggests looking for candidate tax schedules with x b c1 (θ ) − e1 − H(x(θ )) = τ (θ ). T (x(θ )) = T R The next proposition establishes that this construction indeed works.10 PROPOSITION 5. Assume that financial bequests and human capital investment are normal goods in (18) and that the optimal allocation features strictly positive consumption. There exist three separate nonlinear tax schedules T y , T b, and T x that implement the optimal allocation. In addition, T b and T x are decreasing and convex. Moreover, the marginal tax rates on bequests and human capital investment are equalized. Many countries have policies toward education and other forms of human capital acquisition that are broadly consistent with these prescriptions. Governments help finance human capital investments. Typically, basic education is provided for free, whereas higher levels of education and other forms of training may be only partially subsidized. Furthermore, higher levels of education have an important opportunity cost component, which is not typically subsidized. In sum, these policies subsidize human capital investments, but provide a smaller marginal subsidy to those making larger investments. Alternative arguments for subsidizing education have relied on a “good citizen” externality. As we explained in Section III, our economy with altruism can be interpreted as an economy with an externality. However, the “externality” in this case runs through the average welfare of the second generation, rather than through civic attitudes. Thus, we emphasize completely distinct issues. Interestingly, educational subsidies are often defended by appealing to “equality of opportunity.” Our model captures a desire for equality by the central role played by the utilitarian welfare of the next generation. The generality of the results in Proposition 5 should not be overstated. Our simple model relies on strong assumptions. For 10. If human capital does not enter utility, equation (17) reduces to H (x(θ0 )) = R. Thus, the optimum requires all parents to make the exact same human capital investment. Such an allocation cannot be implemented with three separate tax schedules. An implementation that does work in this case is to tax total wealth jointly, so that the child pays taxes as a function of total wealth Rb + H(x).
656
QUARTERLY JOURNAL OF ECONOMICS
example, it ignores labor supply choices by children, as well as uncertainty, heterogeneity, and private information with respect to human capital returns. No doubt, in its current form, Proposition 5 is unlikely to survive all such extensions. Nevertheless, our model provides a simple benchmark that delivers sharp results, illustrating a mechanism that is likely to be at work in richer environments.11 IV.D. Estate versus Inheritance Taxes In this section, we allow fertility differences across households and show that our results are robust. Moreover, this richer setup allows interesting comparisons of estate and inheritance taxes. Exogenous Fertility Differences . We first assume that fertility is exogenous. Let m denote the number of children in a household. ˆ 0 , m). We with joint distribution for fertility and productivity F(θ assume that m is observable. Preferences are as in Becker and Barro (1988). A parent with productivity and fertility given by (θ0 , m) has utility u(c0 )− β h(n0 /θ0 ) + m j=1 mu(c1, j ), where βm is an altruism factor that may depend on m. Optimal allocations will be symmetric across children within a family, so that c1, j = c1 for all j. The welfare measures are
∞
V0 =
ˆ 0 , m) v0 (θ0 , m) d F(θ
0
V1 =
∞
ˆ 0 , m), mv1 (θ0 , m) d F(θ
0
with v0 (θ0 ) = u(c0 (θ0 , m)) + mβmu(c1 (θ0 , m)) − h(n0 (θ0 )/θ0 ) and v1 (θ0 , m) = u(c1 (θ0 , m)). We assume child rearing costs of κ ≥ 0 per child. The presentvalue resource constraint becomes 1 ˆ 0 , m) c0 (θ0 , m) + m κ + c1 (θ0 , m) d F(θ R 0 ∞ 1 ˆ 0 , m). ≤ e0 + e1 + n0 (θ0 , m) d F(θ R 0
(19)
∞
11. Some recent treatments of optimal taxation in environments with endogenous human capital incorporating some of these features include Kapicka (2006, 2008) and Grochulski and Piskorski (2010). These papers focus on taxation within a lifetime.
PROGRESSIVE ESTATE TAXATION
The incentive-compatibility constraints are n0 (θ0 , m) (20) u(c0 (θ0 , m)) + mβmu(c1 (θ0 , m)) − h θ0 n0 (θ0 , m) ≥ u(c0 (θ0 , m)) + mβmu(c1 (θ0 , m)) − h θ0
657
∀m, θ0 , θ0 .
Note that m is on both sides of the inequality, reflecting the fact that m is observable. The planning problem maximizes V0 subject to (19), (20), and V1 ≥ V 1 , for some V 1 . The same variational argument used before can be applied conditioning on (θ0 , m), giving the following expression for the implicit estate tax if the optimal allocation features strictly positive consumption: (21)
ν τ (θ0 , m) = −R u (c1 (θ0 , m)). μ
In this context, it is not possible to implement the optimal allocation with a nonlinear income tax T y,m and an estate tax T b that is independent of the number of children m. To see this, suppose parents faced such a system. Parents choose an estate of total size b and divide it equally among children to provide them each with consumption c1 = e1 + Rb/m. But then families with different numbers of children m and the same estate b would face the same marginal tax rate, contradicting equation (21), which says that the marginal tax rate should be a function of child consumption c1 , and, thus, lower (i.e., a greater subsidy) for the larger family. It is possible to implement the optimal allocation if the estate tax schedule is allowed to depend on family size m, so that parents face a tax schedule T b,m. However, because the implicit tax in equation (21) depends on θ0 and m only through c1 (θ0 , m), it is possible to do the same with an inheritance tax that is independent of family size m. In this implementation, a parent b with m children faces the budget constraint c0 + m j=1 j + mκ ≤ y,m ˆ e0 + n0 − T (n0 ). Each child is then subject to the budget constraint c1, j ≤ e1 + Rb j − Tˆ b(Rb j ). The proof of the next proposition is omitted, but proceeds exactly like that of Proposition 3.12 12. We also explored an extension where parents care more about one child than the other; we omit the details but discuss the main features briefly. In this model, we assumed parents had two children indexed by j ∈ {L, H} and let the altruism coefficient for child j be β j , with β H ≥ β L. The preference for one child
658
QUARTERLY JOURNAL OF ECONOMICS
PROPOSITION 6. Suppose that the optimal allocation features strictly positive consumption. Then there exist two separate nonlinear tax schedules, a income tax Tˆ y,m that depends on family size and an inheritance tax Tˆ b independent of family size, that implement the optimal allocation. In addition, Tˆ b is decreasing and convex. Endogenous Fertility Choice. Consider now endogenous fertility choice, as in Becker and Barro (1988). Normative models with endogenous fertility may raise some conceptual problems, such as taking a stand on the utility of an unborn child. We are able to sidestep this issue by solving a subproblem over (c0 (θ0 ), c1 (θ0 ), n0 (θ0 )) for any given m(θ0 ). Most of the economy’s elements are the same as in the exogenous fertility case. Utility, social welfare measures, and the resource constraint are exactly the same, except for taking into account that the joint distribution over (θ0 , m) has support on the locus (θ0 , m(θ0 )). The only difference is that the incentive-compatibility constraint becomes n0 (θ0 ) (22) u(c0 (θ0 )) + m(θ0 )βm(θ0 ) u(c1 (θ0 )) − h θ0 n0 (θ0 ) ∀θ0 , θ0 ≥ u(c0 (θ0 )) + m(θ0 )βm(θ0 ) u(c1 (θ0 )) − h θ0 to reflect the fact that fertility m is chosen, just like consumption c0 , c1 and labor n. The planning problem maximizes V0 subject to (20), (22), and V1 ≥ V 1 , for some V 1 . For each θ0 , the same perturbation argument over consumption applies, with no change in n(θ0 ) or m(θ0 ). As a result, the optimal allocation features the same implicit taxes on estates. The implementation also works just as in the exogenous fertility case. As long as the optimal allocation features strictly positive consumption, it can be implemented with an income tax that depends on m and an inheritance tax that is independent on the number of children m. This might seem surprising given that inheritance taxes or subsidies are bound to affect the trade-off over another may reflect the effects of birth order, gender, beauty, or physical or intellectual resemblance. Our results are easily extended to this model. Once again, the model favors inheritance taxes over estate taxes because with an estate tax, the marginal tax on the two children would be equalized, even when their consumption is not. However, just as in equation (21) marginal tax rates should depend on the child’s consumption.
PROGRESSIVE ESTATE TAXATION
659
between the quantity and the quality of children. The key insight is that the impact of progressive inheritance taxes on fertility choice can be undone by the income tax Tˆ y,m, which depends on both income n0 and the number of children m.13 V. IMPOSING POSITIVE MARGINAL ESTATE TAXES Our model delivers two prescriptions: marginal taxes should be progressive and negative. To study the role of the former without the latter, we now impose an ad hoc restriction that rules out negative marginal tax rates. This requires adding the inequality constraints (23)
u (c0 (θ0 )) ≤ β Ru (c1 (θ0 )) ∀θ0
to the planning problem. Although the constraint set is no longer convex, the first-order conditions are still necessary for optimality and can be used to obtain formulas for the optimal implicit tax on estates. As long as the optimal allocation features strictly positive consumption, equation (11) for the implicit marginal estate tax becomes
1ν τ (θ0 ) = max 0, − u (c0 (θ0 )) . 1 + τ (θ0 ) βμ Of course, this implies that τ (θ0 ) = 0 for all θ0 . Thus, when marginal tax rates are restricted to be nonnegative, the constraint binds and they are optimally set to zero. This illustrates a connection between our two results, the progressivity and negativity of marginal tax rates. The former is not optimal without the latter. However, we now show that such a tight connection depends crucially on the simplifying assumption of a linear savings technology. Suppose instead that if K1 goods are invested at t = 0 then G(K1 ) goods are available at t = 1, with G weakly concave and 13. The argument runs as follows. Think of the parent’s problem in two stages. In the second stage, given a choice for (m, n), the parent maximizes over bequests b j = b. In the first stage, the parent chooses over (m, n). Now, the results from the preceding section apply directly to the second stage. In particular, the optimal choice of bequests will be independent of θ0 and depend only on (m, n). Now, as for the first stage, the income tax schedule Tˆ y,m(n) can tax prohibitively any combination of (m, n) that is not prescribed by the optimal allocation. Combining both observations, the parent faces a problem in the first stage that is essentially equivalent to the incentive constraints.
660
QUARTERLY JOURNAL OF ECONOMICS
twice differentiable. Constraint (3) must be replaced by ∞ c1 (θ0 ) dF(θ0 ) 0 ∞ ∞ ≤ e1 + G e0 + n0 (θ0 ) dF(θ0 ) − c0 (θ0 ) dF(θ0 ) , ∞
0
∞
0
where K1 = e0 + 0 n0 (θ0 ) dF(θ0 ) − 0 c0 (θ0 ) dF(θ0 ) ≥ 0. Additionally, instead of taking R as a parameter in the nonnegativity constraint (23), we must impose R = G (K1 ). Letting φ(θ0 )dF(θ0 ) denote the multiplier on inequality (23), and assuming that the optimal allocation features strictly positive consumption, the first-order conditions now imply (24)
τ (θ0 ) 1 + τ (θ0 )
1ν G (K1 ) φ(θ0 ) = max 0, − u (c0 (θ0 )) − β dF(θ0 ) . βμ λ u (c0 (θ0 ))
The new term in condition (24) reflects the fact that an increase in τ (θ0 ), which discourages bequests for a parent with productivity θ0 , now has a spillover effect on all parents. When one parent lowers bequests, this contributes to lowering aggregate capital K1 , which in turn raises the pretax return R = G (K1 ) and relaxes the new constraints (23) for all θ0 . This effect is present only if G (K1 ) < 0. The new term is positive and independent of θ0 . The next proposition follows immediately from these observations. PROPOSITION 7. Consider the planning problem with the additional constraint that τ (θ0 ) ≥ 0 for all θ0 and the more general savings technology G(K1 ) and suppose that the optimal allocation features strictly positive consumption. Then there exists a threshold θ0∗ such that for τ (θ0 ) = 0 for all θ0 ≤ θ0∗ , and τ (θ0 ) is strictly positive, increasing in θ0 for all θ0 > θ0∗ .14 To gain some intuition for this result it is useful to discuss briefly the extreme case of an economy with no savings technology. This corresponds to the limit where G(K) is infinitely concave. In an economy with no savings technology, we can still consider a 14. In general, as long as G < 0, a nontrivial threshold, F(θ0∗ ) < 1, is possible. Indeed, no matter how close to zero G is, if the optimal allocation has c1 (θ0 ) → ∞ as θ0 → ∞ and if u (c) → 0 when c → ∞, then we necessarily have F(θ0∗ ) < 1.
PROGRESSIVE ESTATE TAXATION
661
market where parents can borrow and save with pretax return R. Given taxes, market clearing determines a pretax return R. Parents care only about the after-tax return R(1 − τ (θ0 )). Thus, the overall level of marginal taxes is irrelevant in the following sense. If we change (1 − τ (θ0 )) proportionally across θ0 , then the new equilibrium pretax level of R changes by the inverse of this proportion, so that the after-tax return is unchanged; thus, the allocation is completely unaffected. By this logic, only the difference in marginal taxes across agents affects the allocation. As a result, progressive taxation is still optimal, but the level of taxation is not pinned down.15 In particular, imposing τ (θ0 ) ≥ 0 is not constraining: an equilibrium with high R achieves the same after tax returns R(1 − τ (θ0 )) with positive marginal tax rates. When the savings technology is concave, the situation is intermediate between the linear technology case and the case without a savings technology. Now, imposing τ (θ0 ) ≥ 0 is constraining, but this constraint is not binding for all θ0 . As explained above, a positive marginal tax on a subset of high-θ0 parents increases the return R. This then raises the after-tax return on bequests for low θ0 parents, even without subsidies. The implementation is the same as before, so we omit the details. A tax system with a nonlinear income tax and an estate tax (or an inheritance tax) works. The estate (or inheritance) tax schedule remains convex but is now weakly increasing due to the new constraint on nonnegative marginal taxes. The schedule is flat below some bequest level, associated with the threshold θ ∗ , but strictly increasing and strictly convex above that level. In the basic model of Section II, assuming a linear savings technology had no effect on any of our results. This is consistent with the often made observation that only first derivatives of technology appear in optimal tax formulae. Because of this, a linear technology is commonly adopted as a simplifying assumption in public finance.16 It is therefore noteworthy, that, in the present context, with the nonnegativity constraint (23), the second derivative of technology G (K1 ) does appear in the tax formula (24). Of 15. In an earlier version of this paper, the model had no savings technology throughout. Thus, the only result we reported was progressivity of estate taxation, not the sign of marginal estate taxes. 16. For example, Mirrlees (1976, pp. 329–330) adopts a linear technology and defends this assumption: “This is not a serious restriction. The linear constraint can be thought of as a linear approximation to production possibilities in the neighborhood of the optimum . . . So long as first-order necessary conditions are at issue, it does not matter . . .”.
662
QUARTERLY JOURNAL OF ECONOMICS
course, this nonstandard result is driven by the nonstandard constraint (23), which features the first derivative R = G (K1 ).
VI. A MIRRLEESIAN ECONOMY WITH INFINITE HORIZON In this section we extend the model to an infinite horizon. We state the results and sketch the proofs. A detailed analysis is available in an Online Appendix. VI.A. An Infinite-Horizon Planning Problem An individual born into generation t has ex ante welfare vt with (25)
vt = Et−1 [u(ct ) − h(nt /θt ) + βvt+1 ] ∞ = β s Et−1 u(ct+s ) − h(nt+s /θt+s ) , s=0
where θt indexes the agent’s productivity type and β < 1 is the coefficient of altruism.17 We assume that types θt are independently and identically distributed across dynasties and generations t = 0, 1, . . . . With innate talents assumed noninheritable, intergenerational transmission of welfare is not mechanically linked through the environment but may arise to provide incentives for altruistic parents. Productivity shocks are assumed to be privately observed by individuals and their descendants. We identify dynasties by their initial utility entitlement v with distribution ψ in the population. An allocation is a sequence of capital stocks {Kt } and a sequence of functions {ctv , nvt } for each v that represent consumption and effective units of labor as a function of a history of reports θˆ t ≡ (θˆ0 , θˆ1 , . . . , θˆt ). For any given initial distribution of entitlements ψ, we say that an allocation ({ctv , nvt }, {Kt }) is feasible if (i) {ctv , nvt } is incentive compatible—that is, if truth telling is optimal—and delivers expected utility v and (ii) it satisfies the resource constraints (26)
Ct + Kt+1 ≤ F(Kt , Nt ) t = 0, 1, . . . ,
17. We assume that the utility function satisfies the Inada conditions u (0) = ∞, u (∞) = 0, h (0) = 0, and h (n) ¯ = ∞, where n¯ is the (possibly infinite) upper bound on work effort.
PROGRESSIVE ESTATE TAXATION
663
∞ v t v t t t where Ct ≡ 0 θ t ct (θ ) Pr(θ ) dψ(v) and Nt ≡ θ t nt (θ ) Pr(θ ) 18 dψ(v) are aggregate consumption and labor, respectively. Given ψ and V , efficient allocations minimize the required initial capital stock K0 over feasible allocations ({ctv , nvt }; {Kt }) that verify a sequence of admissibility constraints requiring that average continuation utility across the population for every future generation be higher than V . This is a Pareto problem between ¯ )])/ current and future generations. Let V ∗ ≡ (u(0) − E[h(n/θ (1 − β) be the welfare associated with misery. When V = V ∗ , the admissibility constraints are slack and future generations are taken into account only through the altruism of the first generation. This is the case studied by Atkeson and Lucas (1995) and Kocherlakota (2005). When V > V ∗ , the admissibility constraints are binding at times. Let β t μt and β t νt denote the multipliers on the resource constraint and the admissibility constraint at date t. At an interior solution, the first-order necessary conditions for consumption and capital can be rearranged to give
(27)
1 1 νt+1 1 = E − . t v t v t+1 u (c (θ )) β FK (Kt+1 , Nt+1 ) u (c (θ )) μt
When νt+1 = 0, this optimality condition is known as the Inverse Euler equation.19 Consequently, we refer to equation (27) as the Modified Inverse Euler equation. It generalizes equation (8) to incorporate uncertainty regarding the descendants’ consumption. VI.B. Linear Inheritance Taxes A simple implementation proceeds along the lines of Kocherlakota (2005) and features linear taxes on inherited wealth. Consider an efficient interior allocation {ctv (θ t ), nvt (θ t )}. The tax implementation works as follows. In each period, conditional on the history of their dynasty’s reports θˆ t−1 and any inherited wealth, individuals report their current shock θˆt , produce, consume, pay 18. We assume that the production function F(K, N) is strictly increasing and continuously differentiable in both of its arguments, exhibits constant returns to scale, and satisfies the usual Inada conditions. 19. This condition is familiar in dynamic Mirrleesian models (Diamond and Mirrlees 1978; Rogerson 1985; Golosov, Kocherlakota, and Tsyvinski 2003; Albanesi and Sleet 2006).
664
QUARTERLY JOURNAL OF ECONOMICS
taxes, and bequeath wealth subject to the budget constraints (28)
ct (θ t ) + bt (θ t ) ≤ Wt nvt (θˆ t ) − Ttv (θˆ t ) + 1 − τˆtv (θˆ t ) Rt−1,t bt−1 (θ t−1 ),
where Wt = FN (Kt , Nt ) is the wage, Rt−1,t = FK (Kt , Nt ) is the interest rate, and initially b−1 = K0 . Individuals are subject to two forms of taxation: a labor income tax Ttv (θˆ t ) and a proportional tax on inherited wealth Rt−1, t bt−1 at rate τˆtv (θˆ t ).20 The idea is to devise a tax policy that induces all agents to be truthful and to bequeath bt = Kt . Following Kocherlakota (2005), set the linear tax on inherited wealth to
v u ct−1 (θ t−1 ) 1 v t
. τˆt (θ ) = 1 − β Rt−1, t u ctv (θ t ) Choose the labor income tax so that the budget constraint holds with equality,
Ttv (θ t ) = Wt nvt (θ t ) + 1 − τtv (θ t ) Rt−1,t Kt − ctv (θ t ) − Kt+1 . These choices work because for any reporting strategy, the agent’s consumption Euler equation holds. Because the budget constraints hold with equality, this bequest choice is optimal regardless of the reporting strategy. The allocation is incentive compatible by hypothesis, so it follows that truth telling is optimal. Resource feasibility ensures that the markets clear. The assignment of consumption and labor at in any period depends on the history of reports in a way that can be summarized by the continuation utility vt (θ t−1 ). Therefore, the inheritance tax τˆ v (θ t−1 , θt ) can be expressed as a function of vt (θ t−1 ) and θt ; abusing notation, we denote this by τˆt (vt , θt ). Similarly write ct−1 (vt ) v (θ t−1 ). The for ct−1 average inheritance tax rate τ¯t (vt ) is then defined by τ¯t (vt ) ≡ θ τˆ (vt , θ ) Pr(θ ). Using the modified inverse Euler equation (27), we obtain νt u (ct−1 (vt )). (29) τ¯t (vt ) = − μt−1 20. In this formulation, taxes are a function of the entire history of reports, and labor income nt is mandated given this history. However, if the labor income histories nt : t → Rt being implemented are invertible, then by the taxation principle, we can rewrite T and τ as functions of this history of labor income and avoid having to mandate labor income. Under this arrangement, individuals do not make reports on their shocks, but instead simply choose a budget-feasible allocation of consumption and labor income, taking prices and the tax system as given. See Kocherlakota (2005).
PROGRESSIVE ESTATE TAXATION
665
Formula (29) is the exact analog of equation (11). Note that in the Atkeson–Lucas benchmark where the welfare of future generations is taken into account only through the altruism of the first generation, the average inheritance tax is equal to zero, exactly as in Kocherlakota (2005). Both the negative sign and the progressivity of average inheritance taxes derive directly from the desire to insure future generations against the risk of being born to a poor family. VI.C. Discussion: Long-Run Inequality and Estate Taxation We now turn to the implications of our results for the dynamics of inequality. For this purpose, it is useful to organize the discussion around the concept of steady states. We specialize to the logarithmic utility case, u(c) = log(c). This simplifies things because 1/u (c) = c, which is the expression that appears in the first-order optimality condition (27). A steady state consists of a distribution of utility entitlements ψ ∗ and a welfare level V ∗ such that the solution to the planning problem features, in each period, a cross-sectional distribution of continuation utilities vt that is also distributed according to ψ ∗ . We also require the cross-sectional distribution of consumption and work effort and consumption to replicate itself over time. As a result, all aggregates are constant in a steady state. In particular, Kt = K∗ , Nt = N ∗ , Rt = R∗ , etc. Consider first the case where V = −∞. Suppose that there exists an invariant distribution ψ, and let R be the associated interest rate. The admissibility constraints are slack and νt = 0, giving the standard Inverse Euler equation, (30)
ctv (θ t ) =
v 1 Et ct+1 (θ t+1 ) . ∗ βR
Integrating over v and θ t , it follows that Ct+1 = β R∗ Ct , which is consistent with a steady state only if β R∗ = 1. However, equation (30) then implies that consumption is a positive martingale. By the Martingale Convergence Theorem, consumption must converge almost surely to a finite constant. Indeed, one can argue that ct → 0 and vt → −∞.21 We conclude that no steady state exists in 21. This follows because consumption ct is a monotonic function of vt+1 . However, if vt+1 converges to a finite value, then the incentive constraints must be slack. This can be shown to contradict optimality when h (0) = 0 as we have assumed here.
666
QUARTERLY JOURNAL OF ECONOMICS
this case, which echoes the immiseration result in Atkeson and Lucas (1992). Now suppose that V > −∞. In a steady state, the admissibility constraints are binding and μt /νt is equal to a strictly positive constant. To be compatible with some constant average consumption c, ¯ equation (27) requires R∗ < 1/β and can be rewritten as v = β R∗ ctv + (1 − β R∗ )c. ¯ Et ct+1 Consumption is an autoregressive process, mean reverting towards average consumption c¯ at rate β R∗ < 1. Just as in the two period case, the intergenerational transmission of welfare is imperfect. Indeed, the impact of the initial entitlement of dynasties ¯ Indeed, one can dies out over generations and lim j→∞ Et ct+ j → c. show that a steady state may exist with bounded inequality. Moreover, at the steady state there is a strong form of social mobility in that, regardless of their ancestor’s welfare position vt , the probabilistic conditional distribution at t for vt+ j of distant descendants converges to ψ ∗ as j → ∞. VII. CONCLUDING REMARKS Our analysis delivers sharp results for the optimal estate tax. We explored a number of extensions. We conjecture that the mechanism we isolate here will remain important in other settings. We close by briefly mentioning two important issues omitted in the present paper. First, in our model, the lifetimes of parents and children do not overlap. If this simplifying assumption is dropped, inter vivo transfers would have to be considered alongside bequests. Second, the focus in this paper was entirely normative. However, in an intergenerational context, questions of political economy and lack of commitment arise naturally. Farhi and Werning (2008) explore such a model and find that taxation remains progressive but that the marginal tax may be positive. APPENDIX A. Proof of Proposition 3 Equation (13) implies that T b is decreasing and convex and that (31)
T b (R−1 (c1 (θ0 ) − e1 )) = τ (θ0 ).
PROGRESSIVE ESTATE TAXATION
667
Furthermore, because c0 (θ0 ) and c1 (θ0 ) are nondecreasing in θ0 , it follows from incentive compatibility that there are functions cˆ0 (n) and cˆ1 (n) such that cˆ0 (n0 (θ0 )) = c0 (θ0 ) and cˆ1 (n0 (θ0 )) = c1 (θ0 ). Let N = {n : ∃θ0 s.t. n = n0 (θ0 )} be the equilibrium set of labor choices, in efficiency units. Next, define the income tax function as T y (n) ≡ n + e0 − cˆ0 (n) + R−1 (e1 − cˆ1 (n)) − T b(R−1 (cˆ1 (n) − e1 )), / N. if n ∈ N and T y (n) = ∞ if n ∈ We now show that the constructed tax functions, T y (n) and T b(b), implement the optimal allocation. Clearly, parents cannot choose n ∈ / N. For any given n ∈ N, the parent’s subproblem over consumption choices is V (n) ≡ max{u(c0 ) + βu(c1 )}, c0 ,c1
−1
subject to c0 + R (c1 − e1 ) + T b(R−1 (c1 − e1 )) ≤ n − T y (n). Using the fact that T b is convex, it follows that the constraint set is convex. The objective function is concave. Thus, the first-order condition (1 + T b (R−1 (c1 − e1 )))u (c0 ) = β Ru (c1 ) is sufficient for an interior optimum. Combining equations (6) and (31), it follows that cˆ0 (n), cˆ1 (n) are optimal. Hence V (n) = u(cˆ0 (n)) + βu(cˆ1 (n)). Next, consider the parent’s maximization over n given by max{V (n) − h(n/θ0 )}. n∈N
We need to show that n0 (θ0 ) solves this problem, which implies that the allocation is implemented, because consumption would be given by cˆ0 (n0 (θ0 )) = c0 (θ0 ) and cˆ1 (n0 (θ0 )) = c1 (θ0 ). Now, from the preceding paragraph and our definitions it follows that n0 (θ0 ) ∈ arg max{V (n) − h(n/θ0 )} n∈N
⇔ n0 (θ0 ) ∈ arg max{u(cˆ0 (n)) + βu(cˆ1 (n)) − h(n/θ0 )} n∈N
⇔ θ0 ∈ arg max{u(c0 (θ )) + βu(c1 (θ )) − h(n0 (θ )/θ0 )}. θ
Thus, the first line follows from the last, which is guaranteed by the assumed incentive compatibility of the allocation, conditions (4). Hence, n0 (θ0 ) is optimal and it follows that (c0 (θ0 ), c1 (θ0 ), n0 (θ0 )) is implemented by the constructed tax functions.
668
QUARTERLY JOURNAL OF ECONOMICS
B. Proof of Proposition 4 We can implement this allocation with an income tax T y (n), y an income tax T1 = e1 − c1 , and a no-debt constraint mandating that b ≥ 0. Exactly as in the proof of Proposition 3, we can define the functions cˆ0 (n) and cˆ1 (n) such that cˆ0 (n0 (θ0 )) = c0 (θ0 ) and cˆ1 (n0 (θ0 )) = c1 (θ0 ). Let N = {n : ∃θ0 s.t. n = n0 (θ0 )} be the equilibrium set of labor choices, in efficiency units. Next, define the income tax function as T y (n) ≡ n + e0 − cˆ0 (n) + R−1 (c1 − c1 (n)), if n ∈ N and T y (n) = ∞ if n ∈ / N. Clearly, parents cannot choose n ∈ / N. For any given n ∈ N, the parent’s subproblem over consumption choices is V (n) = max{u(c0 ) + βu(c1 )} c0 ,c1
subject to c0 + (c1 − c1 )/R ≤ n − T y (n) c1 ≥ c1 . This is a concave problem with solution (cˆ0 (n), cˆ1 (n)), so that V (n) = u(cˆ0 (n)) + βu(cˆ1 (n)). The parent’s maximization problem over n is W(θ0 ) = max{V (n) − h(n/θ0 )}. n∈N
We need to prove that n0 (θ0 ) solves this problem, that is, that n0 (θ0 ) ∈ arg max{V (n) − h(n/θ0 )} n∈N
⇔ n0 (θ0 ) ∈ arg max{u(cˆ0 (n)) + βu(cˆ1 (n)) − h(n/θ0 )} n∈N
⇔ θ0 ∈ arg max{u(c0 (θ )) + βu(c1 (θ )) − h(n0 (θ )/θ0 )}. θ
Thus, the first line follows from the last, which is guaranteed by the assumed incentive compatibility of the allocation, conditions (4). Hence, n0 (θ0 ) is optimal and it follows that (c0 (θ0 ), c1 (θ0 ), n0 (θ0 )) is implemented by the constructed tax functions and the no-debt constraint.
669
PROGRESSIVE ESTATE TAXATION
C. Proof of Proposition 5 We can separate the planning problem into two steps: first, solve the optimal allocation in terms of the reduced allocation {c0 (θ0 ), e(θ0 ), n0 (θ0 )}; second, solve c1 (θ0 ) and x(θ0 ) using the program (18). The reduced allocation {c0 (θ0 ), e(θ0 ), n0 (θ0 )} is the solution of the planning program ∞ u(c0 (θ0 )) − h(n0 (θ0 )/θ0 ) + βV1 (e(θ0 )) dF(θ0 ) max 0
subject to the resource constraint ∞ ∞ e(θ0 ) c0 (θ0 ) + dF (θ0 ) ≤ e0 + n0 (θ0 )dF(θ0 ), R 0 0 the incentive compatibility constraints n0 (θ0 ) u(c0 (θ0 )) + βV1 (e(θ0 )) − h θ0 n0 (θ0 ) ≥ u(c0 (θ0 )) + βV1 (e(θ0 )) − h θ0
∀θ0 , θ0 ,
and the promise-keeping constraint ∞ V1 (e(θ0 )) dF(θ0 ) ≥ V1 . 0
This problem is the exact analog of our original planning problem with c1 (θ0 ) replaced by e (θ0 ) and child utility u (c1 (θ0 )) replaced by V1 (e (θ0 )). Therefore we know that c0 (θ0 ) and e (θ0 ) are increasing in θ0 . We also know that c1 (θ0 ) − e1 − H(x(θ0 )) and x (θ0 ) are increasing in θ0 . We use the generalized inverse of x(θ ), namely x −1 (x) = inf {θ0 , x(θ0 ) ≤ x}, to define T x (x) = τ ((x)−1 (x)) and set any value T x (x ∗ ) for the intercept at some x ∗ > 0. We use the generalized inverse of (c1 − e1 − H(x))(θ ), namely (c1 − e1 − H(x))−1 (z) = inf {θ0 , (c1 − e1 − H(x))(θ0 ) ≤ z}, to define T b (b) = τ ((c1 − e1 − H(x))−1 (Rb)) and set any value T b(0) for the intercept at b = 0. Note that by the monotonicity of τ (θ ), x(θ ), and (c1 − e1 − H(x))(θ ), the functions T b, and T x are convex.
670
QUARTERLY JOURNAL OF ECONOMICS
Recall that c0 (θ0 ), c1 (θ0 ), x(θ0 ), and n(θ0 ) are increasing functions of θ0 . Moreover, these functions are constant on the same intervals, if such an interval exists. As a result, exactly as in the proof of Proposition 3, we can define the functions cˆ0 (n), ˆ such that cˆ0 (n0 (θ0 )) = c0 (θ0 ), cˆ1 (n0 (θ0 )) = c1 (θ0 ) and cˆ1 (n) and x(n) x(n ˆ 0 ((θ0 )) = x(θ0 ). Let N = {n : ∃θ0 s.t. n = n0 (θ0 )} be the equilibrium set of labor choices, in efficiency units. Next, define the income tax function as ˆ − cˆ0 (n) + R−1 (e1 + H(x(n)) ˆ − c1 (n)) T y (n) ≡ n + e0 − x(n) b − T ((cˆ1 (n) − e1 − H(x(n)))/R) ˆ − T x (x(n)), ˆ / N. if n ∈ N and T y (n) = ∞ if n ∈ We now show that the constructed tax functions implement the allocation. Clearly, parents cannot choose n ∈ / N. For any given n ∈ N, the parent’s subproblem over consumption choices is V (n) ≡ max{u(c0 ) + βU (c1 , H (x ))} subject to c0 + R−1 (c1 − e1 − H(x)) + x + T b((c1 − H(x))/R) + T x (x) ≤ n − T y (n). This problem is convex, the objective is concave, and the constraint set is convex, because T b and T x are convex. It follows that the first-order conditions Uc1 (c1 , H (x )) βR 1+ − e1 − H(x)) u (c0 ) U H (c1 , H (x )) β 1= x 1 + T (x) u (c0 )
1=
T b (c1
are sufficient for an interior optimum. It follows from the construction of the tax functions T b and T x that these conditions for optimality are satisfied by cˆ0 (n), cˆ1 (n), x(n). ˆ Hence V (n) = ˆ u(cˆ0 (n)) + βU (cˆ1 (n), H(x(n))). Next, consider the parent’s maximization over n given by max{V (n) − h(n/θ0 )}. n
We need to show that n0 (θ0 ) solves this problem, that is, that n0 (θ0 ) ∈ arg max{V (n) − h(n/θ0 )} n∈N
⇔ n0 (θ0 ) ∈ arg max{u(cˆ0 (n) + βU (cˆ1 (n), H(x(n))) ˆ − h(n/θ0 )} n∈N
⇔ θ0 ∈ arg max{u(c0 (θ )) + βU (c1 (θ ), H(x(θ0 ))) − h(n0 (θ )/θ0 )}. θ
PROGRESSIVE ESTATE TAXATION
671
Thus, the first line follows from the last, which is guaranteed by the assumed incentive compatibility of the allocation. Hence, n0 (θ0 ) is optimal and so it follows that the optimal allocation (c0 (θ0 ), c1 (θ0 ), n0 (θ0 ), x(θ0 )) is implemented by the constructed tax functions. D. Proof of Result with Rawlsian Welfare Function for Parents The corresponding planning problem is to maximize minθ0 (v0 (θ0 )) subject to the resource constraint (3), the Rawlsian constraint for children (15), and the incentive-compatibility constraints (4). The problem is that there is no representation of ∞ the objective function W0 = minθ0 (v0 (θ0 )) of the form W0 = ˆ 0 W0 (v0 (θ0 ), θ0 ) dF(θ0 ). This difficulty is easily overcome by noting that the planning problem is concave. There is a one-to-one correspondence between the solutions of this problem and those of the dual problem of minimizing
∞
c0 (θ0 ) dF(θ0 ) +
0
1 R
∞
c1 (θ0 ) dF(θ0 ) 0
subject to the Rawlsian constraint for parents (32)
v0 (θ0 ) ≥ v 0
for all θ0 ,
the Rawlsian constraint for children (15), and the incentivecompatibility constraints (4). The dual problem is more tractable because the objective function is differentiable. Exactly as above, it can be shown that there exists θ 0 > 0 such that equation (15) is binding for all θ0 ≤ θ 0 . Then for all θ0 ≥ θ 0 , the implicit estate tax is zero. When θ0 < θ 0 , the implicit estate tax is given by (16). The rest of the analysis follows. The dual problem also allows us to tackle the case where v 0 = u1 , which would lead to a welfare function that is Rawlsian both across and within generations. There again, Proposition 4 applies. E. Proof of Equation (24) Let φ(θ0 )dF(θ0 ) denote the multiplier on inequality (23). The first-order conditions can then be rearranged to obtain the implicit
672
QUARTERLY JOURNAL OF ECONOMICS
marginal estate tax rate: τ (θ0 ) 1ν =− u (c0 (θ0 )) 1 + τ (θ0 ) βμ φ(θ0 ) β Rc (u(c0 (θ0 ))) + β −1 c (u(c1 (θ0 ))) + λ c (u(c0 (θ0 ))) G (K1 ) − βc (u(c0 (θ0 )))φ(θ0 ) dF(θ0 ), λ where the function c(u) is the inverse of the utility function > 0 and c (u) > 0, R ≡ G (K1 ) and K1 = u(c), implying that c (u) ∞ ∞ e0 + 0 n0 (θ0 ) dF(θ0 ) − 0 c0 (θ0 ) dF(θ0 ). Together with τ (θ0 ) ≥ 0, φ(θ0 ) ≥ 0 and the complementary slackness condition ν(θ0 )τ (θ0 ) = 0, this implies the formula in the text. HARVARD UNIVERSITY AND TOULOUSE SCHOOL OF ECONOMICS MASSACHUSETTS INSTITUTE OF TECHNOLOGY
REFERENCES Albanesi, Stefania, and Christopher Sleet, “Dynamic Optimal Taxation with Private Information,” Review of Economic Studies, 73 (2006), 1–30. Atkeson, Andrew, and Robert E. Lucas, Jr., “On Efficient Distribution with Private Information,” Review of Economic Studies, 59 (1992), 427–453. ———, “Efficiency and Equality in a Simple Model of Unemployment Insurance,” Journal of Economic Theory, 66 (1995), 64–88. Atkinson, Andrew B., and Joseph E. Stiglitz, “The Design of Tax Structure: Direct vs. Indirect Taxation,” Journal of Public Economics, 6 (1976), 55–75. Becker, Gary S., and Robert J. Barro, “A Reformulation of the Economic Theory of Fertility,” Quarterly Journal of Economics, 103 (1988), 1–25. Cremer, Helmuth, and Pierre Pestieau, “Non-linear Taxation of Bequests, Equal Sharing Rules and the Tradeoff between Intra- and Inter-family Inequalities,” Journal of Public Economics, 79 (2001), 35–53. Diamond, Peter A., and James A. Mirrlees, “A Model of Social Insurance with Variable Retirement,” Journal of Public Economics, 10 (1978), 295–336. Diamond, Peter A., “Optimal Income Taxation: An Example with a U-Shaped Pattern of Optimal Marginal Tax Rates,” American Economic Review, 88 (1998), 83–95. Ebert, Udo, “A Reexamination of the Optimal Nonlinear Income Tax,” Journal of Public Economics, 49 (1992) 47–73. ´ Werning, “Inequality and Social Discounting,” Journal Farhi, Emmanuel, and Ivan of Political Economy, 115 (2007), 365–402. ———, “The Political Economy of Non-linear Capital Taxation,” Harvard University and MIT, Mimeo, 2008. Golosov, Mikhail, Narayana Kocherlakota, and Aleh Tsyvinski, “Optimal Indirect and Capital Taxation,” Review of Economic Studies, 70 (2003), 569–587. Grochulski, Borys, and Tomasz Piskorski, “Risky Human Capital and Deferred Capital Income Taxation,” Journal of Economic Theory, forthcoming, 2010. Kapicka, Marek, “Optimal Income Taxation with Human Capital Accumulation and Limited Record Keeping,” Review of Economic Dynamics, 9 (2006), 612– 639.
PROGRESSIVE ESTATE TAXATION
673
———, “The Dynamics of Optimal Taxation When Human Capital Is Endogenous,” UCSB, Mimeo, 2008. Kaplow, Louis, “A Note on Subsidizing Gifts,” Journal of Public Economics, 58 (1995), 469–477. ———, “A Framewok for Assessing Estate and Gift Taxation,” NBER Working Paper No. 7775, 2000. Kocherlakota, Narayana, “Zero Expected Wealth Taxes: A Mirrlees Approach to Dynamic Optimal Taxation,” Econometrica, 73 (2005), 1587–1621. Mirrlees, James A., “Optimal Tax Theory: A Synthesis,” Journal of Public Economics, 6 (1976), 327–358. ———, “An Exploration in the Theory of Optimum Income Taxation,” Review of Economic Studies, 38 (1971), 175–208. Phelan, Christopher, “Opportunity and Social Mobility,” Review of Economic Studies, 73 (2006), 487–505. Rogerson, William P., “Repeated Moral Hazard,” Econometrica, 53 (1985), 69–76. Saez, Emmanuel, “Using Elasticities to Derive Optimal Income Tax Rates,” Review of Economic Studies, 68 (2001), 205–229. Seade, Jesus, “On the Sign of the Optimum Marginal Income Tax,” Review of Economic Studies, 49 (1982), 637–643. Sleet, Christopher, and Sevin Yeltekin, “Credibility and Endogenous Social Discounting,” Review of Economic Dynamics, 9 (2006), 410–437. Tuomala, Matti, Optimal Income Taxation and Redistribution (New York: Oxford University Press/Clarendon Press, 1990).
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH∗ GITA GOPINATH AND OLEG ITSKHOKI We empirically document, using U.S. import prices, that on average goods with a high frequency of price adjustment have a long-run pass-through that is at least twice as high as that of low-frequency adjusters. We show theoretically that this relationship should follow because variable mark-ups that reduce longrun pass-through also reduce the curvature of the profit function when expressed as a function of cost shocks, making the firm less willing to adjust its price. We quantitatively evaluate a dynamic menu-cost model and show that the variable mark-up channel can generate significant variation in frequency, equivalent to 37% of the observed variation in the data. On the other hand, the standard workhorse model with constant elasticity of demand and Calvo or state-dependent pricing has difficulty matching the facts.
I. INTRODUCTION There is a current surge in research that investigates the behavior of prices using micro data with the goal of comprehending key aggregate phenomena such as the gradual adjustment of prices to shocks. A common finding of these studies is that there is large heterogeneity in the frequency of price adjustment even within detailed categories of goods. However, there is little evidence that this heterogeneity is meaningfully correlated with other measurable statistics in the data.1 This makes it difficult to discern what the frequency measure implies for the transmission of shocks and which models of price setting best fit the data, both of which are important for understanding the effects of monetary and exchange rate policy. In this paper we exploit the open economy environment to shed light on these questions. The advantage of the international data over the closed-economy data is that they provide a ∗ We wish to thank the International Price Program of the Bureau of Labor Statistics (BLS) for access to unpublished micro data. We owe a huge debt of gratitude to our project coordinator, Rozi Ulics, for her invaluable help on this project. The views expressed here do not necessarily reflect the views of the BLS. We are grateful to Robert Barro, Elhanan Helpman, and three anonymous referees for detailed comments. We thank participants at several venues for comments and Loukas Karabarbounis for excellent research assistance. This research is supported by NSF Grant SES 0617256. 1. It is clearly the case that raw/homogeneous goods display a higher frequency of adjustment than differentiated goods, as documented in Bils and Klenow (2004) and Gopinath and Rigobon (2008). But outside of this finding, there is little that empirically correlates with frequency. Bils and Klenow (2004) and Kehoe and Midrigan (2007) are recent papers that make this point. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
675
676
QUARTERLY JOURNAL OF ECONOMICS
well-identified and sizable cost shock, namely the exchange rate shock. We find that there is indeed a systematic relation between the frequency of price adjustment and long-run exchange rate pass-through. First, we document empirically that on average high-frequency adjusters have a long-run pass-through that is significantly higher than that of low-frequency adjusters. Next, we show theoretically that long-run pass-through is determined by primitives that shape the curvature of the profit function and hence also affect the frequency of price adjustment, and theory predicts a positive relation between the two in an environment with variable mark-ups. Last, we calibrate a dynamic menu-cost model and show that the variable mark-up channel can generate significant variation in frequency, equivalent to 37% of the observed variation in the data. The standard workhorse model with constant elasticity of demand and Calvo or state-dependent pricing generates long-run pass-through that is uncorrelated with frequency, contrary to the data. We document the relation between frequency and long-run pass-through using micro data on U.S. import prices at the dock.2 Long-run pass-through is a measure of pass-through that does not compound the effects of nominal rigidity. We divide goods imported into the United States into frequency bins and use two nonstructural approaches to estimate long-run exchange rate pass-through within each bin. First, we regress the cumulative change in the price of the good over its life in the sample, referred to as its lifelong price change, on the exchange rate movement over the same period. Second, we estimate an aggregate pass-through regression and compute the cumulative impulse response of the average monthly change in import prices within each bin to a change in the exchange rate over a 24-month period. Either procedure generates similar results: When goods are divided into two equal-sized frequency bins, goods with frequency higher than the median frequency of price adjustment display, on average, longrun pass-through that is at least twice as high as that for goods with frequency less than the median frequency. For the sample of firms in the manufacturing sector, highfrequency adjusters have a pass-through of 44% compared to lowfrequency adjusters with a pass-through of 21%. In the subsample 2. The advantage of using prices at the dock is that they do not compound the effect of local distribution costs, which play a crucial role in generating low pass-through into consumer prices.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
677
of importers in the manufacturing sector from high-income OECD countries, high-frequency adjusters have a pass-through of 59% compared to 25% for the low-frequency adjusters. This result similarly holds for the subsample of differentiated goods based on the Rauch (1999) classification. When we divide goods into frequency deciles so that frequency ranges between 3% and 100% per month, long-run pass-through increases from around 18% to 75% for the subsample of imports from high-income OECD countries. Therefore, the data are characterized not only by a positive relationship between frequency and long-run pass-through, but also by a wide range of variation for both variables. Both frequency and long-run pass-through depend on primitives that affect the curvature of the profit function. In Section III we show that it is indeed the case that higher long-run passthrough should be associated with a higher frequency of price adjustment. We analyze a static price-setting model where long-run pass-through is incomplete and firms pay a menu cost to adjust preset prices in response to cost shocks.3 We allow for two standard channels of incomplete long-run exchange rate pass-through: (i) variable mark-ups and (ii) imported intermediate inputs. A higher mark-up elasticity raises the curvature of the profit function with respect to prices; that is, it reduces the region of nonadjustment. However, it also reduces the firm’s desired price adjustment, so that the firm’s price is more likely to stay within the bounds of nonadjustment. We show that this second effect dominates, implying that a higher mark-up elasticity lowers both pass-through and frequency. Alternatively, it reduces the curvature of the profit function when expressed as a function of the cost shocks, generating lower frequency. The positive relationship between frequency and long-run pass-through implies the existence of a selection effect, wherein firms that infrequently adjust prices are typically not as far from their desired prices due to their lower desired pass-through of cost shocks. On the other hand, firms that have high desired 3. Our price-setting model is closest in spirit to Ball and Mankiw (1994), whereas the analysis of the determinants of frequency relates closely to the exercise in Romer (1989), who constructs a model with complete pass-through (CES demand) and Calvo price setting with optimization over the Calvo probability of price adjustment. Other theoretical studies of frequency include Barro (1972), Sheshinski and Weiss (1977), Rotemberg and Saloner (1987), and Dotsey, King, and Wolman (1999). Finally, Devereux and Yetman (2008) study the relationship between frequency of price adjustment and short-run exchange rate pass-through in an environment with complete long-run pass-through.
678
QUARTERLY JOURNAL OF ECONOMICS
pass-through drift farther away from their optimal price and, therefore, make more frequent adjustments. This potentially has important implications for the strength of nominal rigidities given the median duration of prices in the economy. It is important to stress that this selection effect is different from a classical selection effect of state-dependent models forcefully shown by Caplin and Spulber (1987), and it will also be present in time-dependent models with optimally chosen periods of nonadjustment as in Ball, Mankiw, and Romer (1988). In Section IV we quantitatively solve for the industry equilibrium in a dynamic price-setting model. The standard model of sticky prices in the open economy assumes CES demand and Calvo price adjustment.4 These models predict incomplete passthrough in the short run when prices are rigid and set in the local currency, but perfect pass-through in the long run. To fit the data, we depart from this standard setup. First, we allow endogenous frequency choice via a menu cost model of state-dependent pricing. Second, we allow variable mark-ups, a` la Dornbusch (1987) and Krugman (1987), which generates incomplete long-run passthrough. This source of incomplete pass-through has received considerable support in the open economy empirical literature, as we discuss in Section IV. We examine how variation in the mark-up elasticity across firms affects the frequency of price adjustment. We present four sets of results. First, variation in mark-up elasticity can indeed generate a strong positive relation between frequency and long-run pass-through (LRPT) and can generate significant variation in frequency, equivalent to 37% of the observed variation in the data. The model generates a standard deviation in frequency across goods of 11%, as compared to 30% in the data. Second, a menu cost model that allows for joint cross-sectional variation in mark-up elasticity and menu costs can quantitatively account for both the positive slope between LRPT and frequency and the close-to-zero slope between absolute size of price adjustment and frequency in the data. The model generates a slope of 0.55 between frequency and LRPT, whereas in 4. See the seminal contribution of Obstfeld and Rogoff (1995) and the subsequent literature surveyed in Lane (2001). Midrigan (2007) analyzes an environment with state-dependent pricing, but assumes constant mark-ups and complete pass-through; Bergin and Feenstra (2001) allow for variable mark-ups in an environment with price stickiness, but they assume exogenous periods of nonadjustment.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
679
the data it is 0.56. Similarly, the slope coefficient for the relation between frequency and size is −0.05 in the model, close to the data estimate of −0.01. Further, it generates dispersion in frequency equivalent to 60% of the dispersion in the data. In both simulations the model matches the median absolute size of price adjustment of 7%. Third, we show that the nonstructural passthrough regressions estimated in the empirical section recover the true underlying LRPT. Fourth, we verify that the observed correlation between frequency and LRPT cannot be explained by standard sticky price models with only exogenous differences in the frequency of price adjustment and no variation in LRPT. Section II presents the empirical evidence. Section III presents the static model of frequency and LRPT, and Section IV describes the calibration of a dynamic model and its ability to match the facts. Section V concludes. All proofs are relegated to the Appendix.
II. EMPIRICAL EVIDENCE In this section we empirically evaluate the relation between the frequency of price adjustment of a good and the long-run response of the price of the good to an exchange rate shock. The latter, referred to as long-run exchange rate pass-through (LRPT), is defined to capture pass-through beyond the period when nominal rigidities in price setting are in effect. In the presence of strategic complementarities in price setting or other forms of real rigidities, this can require multiple rounds of price adjustment.5 We use two nonstructural approaches to estimate LRPT from the data. Our main finding is that goods whose prices adjust more frequently also have a higher exchange rate pass-through in the long run than low-frequency adjusters. In Section IV.C we estimate the same regressions on data simulated from conventional sticky price models and verify that both of these regressions indeed deliver estimates close to the true theoretical LRPT. II.A. Data and Methodology We use micro data on the prices of goods imported into the United States provided to us by the Bureau of Labor Statistics 5. Other sources of sluggish adjustment could include the presence of informational frictions or convex adjustment costs in price setting.
680
QUARTERLY JOURNAL OF ECONOMICS
(BLS) for the period 1994–2005. The details regarding this data set are provided in Gopinath and Rigobon (2008). We focus on a subset of the data that satisfies the following criteria. First, we restrict attention to market transactions and exclude intrafirm transactions, as we are interested in price setting driven mainly by market forces.6 Second, we require that a good have at least one price adjustment during its life. This is because the goal of the analysis is to relate the frequency of price adjustment to the flexible price passthrough of the good, and this requires observing at least one price change. In this database 30% of goods have a fixed price during their life. For the purpose of our study these goods are not useful and they are excluded from the analysis. We revisit this issue at the end of this section, when we comment on item substitution. Third, we restrict attention to dollar-priced imports in the manufacturing sector.7 The restriction to manufactured goods allows us to focus on price-setting behavior where firms have market power and goods are not homogeneous. We restrict attention to dollar-priced goods, in order to focus on the question of frequency choice, setting aside the question of currency choice. This restriction does not substantially reduce the sample size because 90% of goods imported are priced in dollars. For the analysis of the relation between currency choice and pass-through see Gopinath, Itskhoki, and Rigobon (2009). The relation between the two papers is discussed in Section II.D. For each of the remaining goods we estimate the frequency of price adjustment following the procedure in Gopinath and Rigobon (2008). We then sort goods into high- and low-frequency bins, depending on whether the good’s frequency is higher or lower than the median frequency, and estimate LRPT within each bin. The first approach estimates exchange rate pass-through over the life of the good in the BLS sample. Specifically, for each good, we measure the cumulative change in the price of the good from its first observed new price to its last observed new price in the BLS data. We refer to this as the lifelong change in price. We then relate it to the cumulative change in the exchange rate over this 6. A significant fraction of trade takes place intrafirm and these transactions constitute about 40% of the BLS sample. For empirical evidence on the difference between intrafirm and arms-length transactions, using this data set, see Neiman (2007) and Gopinath and Rigobon (2008). 7. That is, goods that have a one-digit SIC code of 2 or 3. We exclude any petrol classification codes.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
681
0.6 0.4
pit
0.2 0
RER t
−0.2 −0.4 0
t1
t2 5
10
15
Time
20
25
30
FIGURE I Illustration for Lifelong Pass-Through Regression (1) In this hypothetical example we observe the price of good i from t = 0 to t = 30 months. The figure plots the observed price and the corresponding bilateral real exchange rate for the same period, both in logs. The first observed new price is set at t1 = 3 and the last observed new price is set at t2 = 22. Therefore, for this i,c good we have pi,c L = pit2 − pit1 and RER L = RERt2 − RERt1 . In the baseline specification we additionally adjust pi,c by U.S. CPI inflation over the same L period.
period. Specifically, lifelong pass-through, β L, is estimated from the following micro-level regression: (1)
i,c i,c pi,c L = αc + β LRER L + .
pi,c L is equal to the lifelong change in the good’s log price relative to U.S. inflation, where i indexes the good and c the country. RERi,c L refers to the cumulative change in the log of the bilateral real exchange rate for country c over this same period.8 The construction of these variables is illustrated in Figure I. The real exchange rate is calculated using the nominal exchange rate and the consumer price indices in the two countries. An increase in the RER is a real depreciation of the dollar. Finally, αc is a country fixed effect. The second approach measures LRPT by estimating a standard aggregate pass-through regression. For each frequency bin, 8. The index i on the RER is to highlight that the particular real exchange rate change depends on the period when the good i is in the sample.
682
QUARTERLY JOURNAL OF ECONOMICS
each country c, and each month t, we calculate the average price change relative to U.S. inflation, ptc , and the monthly bilateral real exchange rate movement vis-a-vis ` the dollar for that country, RERct . We then estimate a stacked regression where we regress the average monthly change in prices on monthly lags of the real exchange rate change, (2)
ptc = αc +
n
β j RERct− j + tc ,
j=0
where αc is a country fixed effect and n varies from 1 to 24 months. The aggregate long-run pass-through is then defined to be the cumulative sum of the coefficients, nj=0 β j , at n = 24 months. Before we proceed to describe the results, we briefly comment on the two approaches. First, we use the real specification in both regressions to be consistent with the regressions run later on the model-generated data in Section IV.C. However, the empirical results from both the micro and aggregate regressions are insensitive to using a nominal specification, not surprisingly, given that the real and the nominal exchange rates move closely together at the horizons we consider.9 Second, a standard assumption in the empirical pass-through literature is that movements in the real or nominal exchange rate are orthogonal to other shocks that affect the firm’s pricing decision and are not affected by firm pricing. This assumption is motivated by the empirical finding that exchange rate movements are disconnected from most macro variables at the frequencies studied in this paper. Although this assumption might be problematic for commodities such as oil or metals and for some commodityexporting countries such as Canada, it is far less restrictive for most differentiated goods and most developed countries. Moreover, our main analysis is to rank pass-through across frequency bins as opposed to estimating the true pass-through number. For this reason, our analysis is less sensitive to concerns about the endogeneity of the real exchange rate. Third, the lifelong approach has an advantage in measuring LRPT in that it ensures that all goods have indeed changed their price. In the case of the second approach it is possible that even after 24 months some goods have yet to change price and consequently pass-through estimates are low. A concern with the 9. In the nominal specification, we regress the lifelong change in the log of the nominal price of the good on the cumulative change in the log of the nominal exchange rate.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
683
first approach, however, is that because it conditions on a price change, estimates can be biased, because although the exchange rate may be orthogonal to other shocks, when the decision to adjust is endogenous, conditioning on a price change induces a correlation across shocks. The lifelong regression addresses this selection issue by increasing the window of the pass-through regression to include a number of price adjustments that reduces the size of the selection bias. In Section IV.C we confirm this claim via simulations. II.B. Lifelong Pass-Through In Table I we report the results from estimating the lifelong equation (1). In Panel A the first price refers to the first observed price for the good, and in Panel B the first price refers to the first new price for the good. In both cases, the last price is the last new price. For the hypothetical item in Figure I, Panel A would use observations in [0, t2 ], whereas Panel B would use observations only in [t1 , t2 ]. The main difference between the results in the two tables relates to the number of observations, because there are goods with only one price adjustment during their life. Otherwise, the results are the same. The first column of Table I reports the subsample of the analysis. The next six columns report the median frequencies (Freq) within the low- and high-frequency bins, the point estimates for LRPT (β LLo and β LHi ) and the robust standard errors (s.e.(β LLo ) and s.e.(β LHi )) for the estimates clustered at the level of country interacted with the BLS-defined Primary Strata Lower (PSL) of the good (mostly two- to four-digit harmonized codes). The next two columns report the difference in LRPT between high and lowfrequency adjusters and the t-statistic associated with this difference. The number of observations, Nobs , and R2 are reported in the last two columns. The main finding is that high-frequency adjusters have a lifelong pass-through that is at least twice as high as for lowfrequency adjusters. Consider first the specification in Panel A. In the low-frequency subsample, goods adjust prices on average every fourteen months and pass-through only 21% in the long run. At the same time, in the high-frequency subsample, goods adjust prices every three months and pass-through 44% in the long run. This is more strongly evident when we restrict attention to the high-income OECD sample: LRPT increases from 25% to 59% as we move from the low- to the high-frequency subsample.
0.20 0.16
0.19 0.17
0.10 0.09
0.25 0.25
0.07 0.07
0.10 0.09
0.21 0.20
0.07 0.07
0.05 0.08
0.03 0.05
0.04 0.07
0.03 0.04
s.e.(β LLo ) s.e.(β LHi )
0.59 0.59
0.44 0.46 0.06 0.08
0.05 0.06
0.50 0.40
0.47 0.33 0.67 0.60
0.46 0.48 0.06 0.09
0.06 0.08
Panel B: Starting with a new price
0.40 0.33
0.38 0.29
−β LLo
0.49 0.43
0.26 0.32
0.34 0.34
6.00 3.73
3.71 3.68
4.43 3.50
4.18 3.89
t-stat
Difference
0.23 0.26
β LHi
Panel A: Baseline specification
Freq
β LHi
High frequency
4,079 1,942
9,002 4,708
5,988 2,982
14,227 7,870
Nobs
.10 .11
.09 .10
.08 .10
.07 .08
R2
Note. Robust standard errors clustered by country × PSL pair, where PSL is the BLS-defined Primary Strata Lower that corresponds to 2- to 4-digit sectoral harmonized codes. The baseline specification in Panel A computes the lifelong change in price starting with the first observed price of the good, whereas the specification in Panel B starts from the first new price of the good.
All countries Manufacturing Differentiated High-income OECD Manufacturing Differentiated
All countries Manufacturing Differentiated High-income OECD Manufacturing Differentiated
Freq
β LLo
Low frequency
TABLE I FREQUENCY AND LIFELONG PASS-THROUGH
684 QUARTERLY JOURNAL OF ECONOMICS
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
685
We also examine the subsample of manufactured goods that can be classified as being in the differentiated goods sector, following Rauch’s classification.10 For differentiated goods, moving from the low- to the high-frequency bin raises LRPT from 20% to 46% for goods from all source countries and from 25% to 59% in the high-income OECD sample. In all cases, the difference in passthrough across frequency bins is strongly statistically significant. Similarly, the higher pass-through of high-frequency adjusters is evident in Panel B where the first price is a new price. Because the results are similar for the case where we start with the first price as opposed to the first new price, for the remainder of the analysis we report the results for the former case, as it preserves a larger number of goods in the sample.11 All results in Table I also hold for the nominal specification, which is not reported for brevity. As a sensitivity check, we also restrict the sample to goods that have at least three or more price adjustments during their life. Results for this specification are reported in Table II. As expected, the median frequency of price adjustment is now higher, but the result that long-run pass-through is at least twice as high for the high-frequency bin as for the low-frequency bin still holds strongly and significantly. We also estimate median quantile regressions to limit the effect of outliers and find that the results hold just as strongly. In the case of the all-country sample, β LLo = 0.19 and β LHi = 0.41, with the difference having a t-statistic of 14.1. For the high-income OECD subsample, the difference is 0.30, with a t-statistic of 13.0. We also verify that the results are not driven by variable pass-through rates across countries unrelated to frequency, by 10. Rauch (1999) classified goods on the basis of whether they were traded on an exchange (organized), had prices listed in trade publications (reference), or were brand name products (differentiated). Each good in our database is mapped to a 10-digit harmonized code. We use the concordance between the 10-digit harmonized code and the SITC2 (Rev 2) codes to classify the goods into the three categories. We were able to classify around 65% of the goods using this classification. Consequently, it must not be interpreted that the difference in the number of observations between all manufactured and the subgroup of differentiated represent nondifferentiated goods. In fact, using Rauch’s classification, only 100 odd goods are classified as nondifferentiated. 11. Because there can be months during the life of the good when there is no price information, as a sensitivity test, we exclude goods for whom the last new price had a missing price observation in the previous month to allow for the case that the price could have changed in an earlier month but was not reported. This is in addition to keeping only prices that are new prices (as in Panel B). We find that the results hold just as strongly in this case. The median frequency for the high (low)-frequency goods is 0.35 (0.08) and the long-run pass-through is 0.61 (0.11) respectively. The t-stat of the difference in LRPT is 5.3.
0.12 0.11
High-income OECD Manufacturing Differentiated
0.32 0.28
0.25 0.19 0.07 0.11
0.04 0.06
s.e.(β LLo )
0.60 0.50
0.57 0.42
Freq
0.70 0.71
0.48 0.56
β LHi
0.08 0.09
0.07 0.09
s.e.(β LHi )
High frequency −β LLo
0.38 0.43
4.00 3.20
3.06 3.99
t-stat
Difference
0.23 0.38
β LHi
2,856 1,312
6,111 3,031
Nobs
.11 .16
.11 .16
R2
Note. Robust standard errors clustered by country × PSL pair (as in Table I). The subsample contains only goods with at least three price adjustments during their life in the sample.
0.13 0.11
All countries Manufacturing Differentiated
Freq
β LLo
Low frequency
TABLE II FREQUENCY AND LIFELONG PASS-THROUGH: GOODS WITH THREE OR MORE PRICE CHANGES
686 QUARTERLY JOURNAL OF ECONOMICS
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
687
controlling for differential levels of pass-through across countries. We estimate the difference in the coefficient between highand low-frequency adjusters, within country, to be 27 percentage points with a t-statistic of 5.6. In Table III we additionally allow for variation across countries in the difference (β LHi − β LLo ) and again find that the relation between LRPT and frequency holds for goods from the same country/region. Alternative Specifications. We now verify that the documented positive relationship between frequency and pass-through is not an artifact of splitting the items into two bins by frequency. First, we address this nonstructurally by increasing the number of frequency bins. Specifically, we estimate the same regression across ten frequency bins (deciles). The point estimates and 10% robust standard error bands are reported in Figure II for all manufactured goods and all manufactured goods from highincome OECD countries, respectively. The positive relationship is evident in these graphs. For the high-income OECD subsample, long-run pass-through increases from around 18% to 75% as frequency increases from 0.03 to 1.12 This wide range of passthrough estimates covers almost all of the relevant range of theoretical pass-through, which for most specifications lies between 0 and 1. Furthermore, the positive relation between long-run passthrough and frequency is most evident for the higher frequency range, specifically among the goods that adjust every eight months or more frequently and constitute half of our sample. This fact assuages concerns that the relation between frequency and passthrough is driven by an insufficient number of price adjustments for the very–low frequency goods. As opposed to increasing the number of frequency bins, our second approach estimates the effect of frequency on long-run pass-through using a more structured specification. We estimate the regression13 i,c i,c i,c ˜ ˜ (3) pi,c L = αc + β LRER L + δ L fi,c + γ L f i,c · RER L + , 12. For the all-country sample the long-run pass-through ranges between 14% and 45%. 13. This specification results from the following two-stage econometric model: i,c i,c i,c ˜ pi,c L = αc + β L RER L + δ L fi,c + v ,
β Li,c = β L + γ L f˜i,c + ui,c . Regression (3) consistently estimates γ L provided that ui,c and v i,c are independent ˜ from RERi,c L and fi,c .
0.07 0.07 0.10 0.07
0.31 0.24 0.38 0.17
0.07 0.09 0.12 0.03
s.e.(β LLo ) 0.27 0.31 0.87 0.36
Freq
Note. Robust standard errors clustered by country × PSL pair (as in Table I).
Japan Euro area Canada Non-OECD
Freq
β LLo
Low frequency
0.62 0.49 0.74 0.34
β LHi 0.15 0.10 0.23 0.06
s.e.(β LHi )
High frequency −β LLo
1.81 2.03 1.58 2.44
t-stat
Difference
0.31 0.25 0.36 0.17
β LHi
TABLE III FREQUENCY AND LIFELONG PASS-THROUGH: COUNTRIES AND REGIONS
1,418 1,802 1,150 8,239
Nobs
.07 .08 .07 .06
R2
688 QUARTERLY JOURNAL OF ECONOMICS
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH All countries
0.7
0.9
Lifelong pass-through
Lifelong pass-through
High-income OECD
1
0.6 0.5 0.4 0.3 0.2 0.1 0 0.03
689
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0.05
0.07
0.1
0.13
0.18
Frequency
0.28
0.46
0.67
1
0 0.03
0.05
0.07
0.1
0.13
0.18
Frequency
0.29
0.44
0.67
1
FIGURE II Lifelong Pass-Through across Frequency Deciles For each frequency decile we estimate LRPT from the lifelong regression (1). Dashed lines correspond to 10% confidence bands.
where f˜i,c ≡ fi,c − f¯i,c is the demeaned frequency of the good relative to other goods in the sample. Therefore, coefficient β L captures the average pass-through in the sample, whereas γ L estimates the effect of frequency on long-run pass-through. The results from estimating this regression using both OLS and median quantile regression are reported in Table IV. In the case of the OLS estimates, robust standard errors clustered by country × PSL pair are reported. The lower panel presents the results for goods with at least three or more price changes. As is evident from the table, γ L > 0 in all specifications. That is, goods that adjust prices more frequently also have higher LRPT. The reason the slope estimates vary across samples is partly driven by the fact that the relationship is nonlinear, as is evident in Figure II. These results are also robust to including controls for differential pass-through rates across countries. Between- and Within-Sector Evidence. Does the relation between frequency and LRPT arise across aggregate sectors or is this a within-sector phenomenon? To answer this we first perform a standard variance decomposition (see Theorem 3.3 in Greene [2000, p. 81]) for frequency: STf = S Bf + SW f . STf is the total variance of frequency across all goods in the sample. S Bf is the between-sector component of the variance, measured as the variance of frequency across the average goods
0.04 0.06
0.06 0.07
0.51 0.52
0.04 0.05
0.42 0.43
0.37 0.38
0.03 0.04
0.33 0.33
s.e.(β L)
0.15 0.15
0.14 0.18
s.e.(γ L)
βL
4.35 6.17
2.79 3.96 0.35 0.39
0.31 0.32
Panel A: All goods
t-stat
0.02 0.02
0.01 0.01
s.e.(β L)
0.57 0.84
0.43 0.72
γL
0.60 0.94
0.36 0.88 0.18 0.19
0.17 0.20 3.42 4.86
2.12 4.35 0.45 0.51
0.37 0.40 0.02 0.04
0.02 0.02
0.47 0.90
0.37 0.92
Panel B: Goods with three or more price changes
0.63 0.95
0.40 0.70
γL
0.08 0.13
0.06 0.08
0.04 0.08
0.03 0.05
s.e.(γ L)
Median quantile regression
6.30 6.83
6.43 10.94
13.27 10.83
14.09 12.57
t-stat
2,856 1,312
6,111 3,031
5,988 2,982
14,227 7,870
Nobs
Note. The table reports OLS and median quantile regression estimates of coefficients in specification (3): β L estimates average LRPT for the subsample and γ L estimates the slope coefficient for the frequency–LRPT relation. Robust standard errors clustered by country × PSL pair are reported for the OLS regressions (as in Table I). t-statistics are reported for the hypothesis γ L = 0.
All countries Manufacturing Differentiated High-income OECD Manufacturing Differentiated
All countries Manufacturing Differentiated High-income OECD Manufacturing Differentiated
βL
OLS
TABLE IV SLOPE COEFFICIENT FOR FREQUENCY–PASS-THROUGH RELATION
690 QUARTERLY JOURNAL OF ECONOMICS
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
691
TABLE V FREQUENCY–LRPT RELATION: WITHIN- VERSUS BETWEEN-SECTOR EVIDENCE Freq within Two-digit
85%
Four-digit
70%
OLS
Median quantile regression
γ LW
γ LB
Within
γ LW
γ LB
Within
0.49 (0.09) 0.48 (0.07)
0.17 (0.12) 0.30 (0.08)
98%
0.40 (0.03) 0.40 (0.04)
0.51 (0.05) 0.48 (0.04)
78%
86%
68%
Note. The second column reports the contribution of the within-sector component to total variation in frequency according to the standard variance decomposition. The estimates γ LW and γ LB are from the unrestricted version of regression (3), where we allow separate coefficients on average sectoral frequency (γ LB) and on the deviation of the good’s frequency from the sectoral average (γ LW ), as described in footnote 14. Robust standard errors in parentheses. Columns (5) and (8) report the contribution of the within-sector component to the relation between frequency and pass-through according to the formula in the text.
from each sector. Finally, SW f is the within-sector component of variance, measured as the average variance of frequency across goods within sectors. We perform the analysis at both the two-digit and four-digit sector levels. At the two-digit sector level (88 sectors), the fraction of total variance in frequency (equal to 0.073) explained by variation across two-digit sectors is 15%, whereas the remaining 85% is explained by variation across goods within two-digit sectors. At the four-digit level (693 sectors), the between-sector component accounts for 30% of variation in frequency, whereas the withinsector variation accounts for the remaining 70%. This evidence suggests that variation in frequency is driven largely by variation at highly disaggregated levels. The second exercise we perform is to estimate the counterpart to equation (3) allowing for separate within- and between-sector effects of frequency on pass-through.14 The results are reported in Table V. The within-sector estimates (γ LW ) are positive and statistically significant in all specifications. The between-sector estimates (γ LB) are positive, but the level of significance varies across specifications. B ¯ 14. Specifically, instead of γ L( f˜i,c · RERi,c L ), we include two terms, γ L ( f j(i),c · i,c i,c W RER L ) and γ L ( fi,c − f¯j(i),c ) · RER L , where j indicates the sector that contains good i and f¯j(i),c is the average frequency in sector j. Note that our earlier specification (3) is the restricted version of this regression under the assumption that γ LW = γ LB. Furthermore, the unconstrained specification allows a formal decomposition of the effect of frequency on pass-through into within- and between-sector contribution as discussed in the text.
692
QUARTERLY JOURNAL OF ECONOMICS
We can now quantify the contribution of the within-sector component to the relation between LRPT and frequency using the formula W 2 W γL S f W 2 W B2 B , γL S f + γL S f where the denominator is the total variance in LRPT explained by variation in frequency. Using the OLS (quantile regression) estimates, the within-sector contribution is 98% (78%) at the two-digit level and 86% (62%) at the four-digit level. Therefore the relation between frequency and LRPT is largely a within-sector phenomenon, consistent with the evidence that most variation in frequency arises within sectors and not across aggregated sectors.15 II.C. Aggregate Regressions The next set of results relates to the estimates from the aggregate pass-through regressions defined in (2). We again divide goods into two bins based on the frequency of price adjustment and estimate the aggregate pass-through regressions separately for each of the bins. We again report the results only for the real specification, because the nominal specification delivers very similar results. The results are plotted in Figure III. The solid line plots the cumulative pass-through coefficient, nj β j , as the number of monthly lags increases from 1 to 24. The dashed lines represent the 10% robust standard-error bands. The left-column figures are for the all-country sample and the right-column figures are for the high-income OECD subsample; the top figures correspond to all manufactured goods, whereas the bottom figures correspond to the differentiated good subsample. Although pass-through at 24 months is lower than lifelong estimates, it is still the case that high-frequency adjusters have a pass-through that is at least twice as high as for low-frequency adjusters, and this difference is typically significant. The results from this approach are therefore very much in line with the results 15. This is not to say that there is no variation in frequency and passthrough across sectors. More homogeneous sectors, such as “Animal and Vegetable Products,” “Wood and Articles of Wood,” and “Base Metals and Articles of Base Metals,” on average have higher frequency and higher long-run pass-through. More differentiated sectors have lower average frequency (with little variation across sectors) and lower long-run pass-through. However, the amount of variation across sectors is insufficient to establish a strong empirical relationship.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH High-income OECD
All countries
0.25
693
0.5 0.2
0.4 0.15
0.3
0.1
0.2
0.05
0.35
6
12
Lags
Low frequency
0.1
Low frequency 0 1
High frequency
High frequency
18
24
0 1
6
12
Lags
18
24
High-income OECD, differentiated goods
All countries, differentiated goods 0.5
0.3
0.4
0.25
High frequency
0.2
0.3
High frequency
0.15
0.2 0.1
Low frequency
0.05 0 1
6
12
Lags
0.1
18
24
0 1
Low frequency 6
12
Lags
18
24
FIGURE III Aggregate Pass-Through Solid lines plot the cumulative pass-through coefficients for the two frequency bins estimated from aggregate regression (2) with different number of lags. The dashed lines correspond to 10% confidence bands.
from the lifelong specification. Similar aggregate pass-through results hold for the subsample of goods with at least three price adjustments and for specific countries and regions (as in Table III); however, estimates in these smaller subsamples become a lot noisier, especially for the non-OECD countries. We do not report these results here, for brevity, but the reader can refer to the working paper version (Gopinath and Itskhoki 2008). II.D. Additional Facts In closing the empirical section, we discuss a number of additional relevant findings in the data. Product Replacement. For the previous analysis we estimate LRPT for a good using price changes during the life of the good. Because goods get replaced frequently, one concern could be that
694
QUARTERLY JOURNAL OF ECONOMICS TABLE VI SUBSTITUTIONS
Decile
Freq
Life 1
Life 2
Eff freq 1
Eff freq 2
1 2 3 4 5 6 7 8 9 10
0.03 0.05 0.07 0.10 0.13 0.18 0.29 0.44 0.67 1.00
59 50 52 55 52 49 50 51 52 43
42 34 32 36 33 32 26 34 33 30
0.05 0.07 0.09 0.12 0.15 0.20 0.30 0.46 0.67 1.00
0.05 0.08 0.10 0.13 0.16 0.21 0.31 0.46 0.68 1.00
Note. Effective frequency (Eff freq) corrects the measure of frequency during the life of the good (Freq) for the probability of the good’s discontinuation/replacement according to [Freq+(1-Freq)/Life] for the two measures of the life of the good in the sample (Life).
goods that adjust infrequently have shorter lives and get replaced often, and because we do not observe price adjustments associated with substitutions, we might underestimate the true pass-through for these goods.16 To address this concern, we report in Table VI the median life of goods within each frequency bin for the highincome OECD sample. Very similar results are obtained for other subsamples. For each of the ten frequency bins we estimate two measures of the life of the good. For the first measure we calculate for each good the difference between the discontinuation date and initiation date to capture the life of the good in the sample. “Life 1” then reports the median of this measure for each bin. Goods get discontinued for several reasons. Most goods get replaced during routine sampling and some get discontinued due to lack of reporting. As a second measure, we examine only those goods that were replaced either because the firm reported that the particular good was not being traded any more and had/had not been replaced with another good in the same category, or because the firm reported that it was going out of business.17 This captures most closely the kind of churning one might be interested in and does not suffer from 16. Note that substitutions pose a bigger concern only if there is reason to believe that pass-through associated with substitutions is different from that associated with regular price changes. Otherwise, our measures that condition on multiple rounds of price adjustment capture LRPT. 17. Specifically this refers to the following discontinuation reasons reported in the BLS data: “Out of Business,” “Out of Scope, Not Replaced,” and “Out of Scope, Replaced.”
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
695
0.16
Size of price adjustment
0.14
75% quantile
0.12 0.1 0.08
Median
0.06 0.04
25% quantile
0.02 0 0.03
0.05
0.07
0.1
0.13
0.18
0.29
0.44
0.67
1.00
Frequency deciles FIGURE IV Frequency and Absolute Size of Price Adjustment
right censoring in measuring the life of the good. “Life 2” is then the median of this measure within each bin. As can be seen, if anything, there is a negative relation between frequency and life: that is, goods that adjust infrequently have longer lives in the sample. In the last two columns we report [Freq + (1 − Freq)/Life] for the two measures of life. This corrects the frequency of price adjustment to include the probability of discontinuation. As is evident, the frequency ranking does not change when we include the probability of being discontinued using either measure. As mentioned earlier, there are several goods that do not change price during their life and get discontinued. We cannot estimate passthrough for these goods. The median life of these goods is twenty months (using the second measure), which implies a frequency of 0.05. What this section highlights is that even allowing for the probability of substitution, the benchmark frequency ranking is preserved. Size of Price Adjustment. Figure IV plots the median absolute size of price adjustment across the ten frequency bins for the high-income OECD subsample. Median size is effectively the
696
QUARTERLY JOURNAL OF ECONOMICS
same across frequency bins, ranging between 6% and 7%.18 This feature is not surprising given that size, unlike pass-through, is not scale-independent and, for example, depends on the average size of the shocks. This illustrates the difficulty of using measures such as size in the analysis of frequency. We discuss this issue later in the paper. Long-Run versus Medium-Run Pass-Through. In this paper we estimate the long-run pass-through for a good. A separate measure of pass-through is pass-through conditional on only the first price adjustment to an exchange rate shock. In Gopinath, Itskhoki, and Rigobon (2009), we refer to this as medium-run pass-through (MRPT). As is well known, estimating pass-through conditional on only the first price adjustment may not be sufficient to capture LRPT due to staggered price adjustment by competitors, among other reasons. These effects can be especially pronounced for goods that adjust prices more frequently than their average competitors. In Gopinath and Rigobon (2008), we sort goods into different frequency bins and estimate MRPT within each bin, which is distinct from estimating LRPT. Second, we use both dollar (90% of the sample) and non–dollar (10% of the sample) priced goods. We document that goods that adjust less frequently have higher MRPT than goods that adjust more frequently. This result, relating to MRPT, was driven by the fact that goods that adjust less frequently were goods that were priced in a nondollar currency. If the nondollar goods are excluded from the sample, there is no well-defined pattern in the relation between MRPT and frequency. This is further demonstrated in Figure V, where we plot both LRPT and MRPT against frequency. Unlike LRPT, there is no relation between MRPT and frequency for dollar-priced goods. We also estimate equation (3) for the case where the left hand–side variable conditions on first price adjustment instead of lifelong price change. The coefficient that estimates the effect of frequency on MRPT is −0.04 with a t-statistic of −0.9, confirming the result in Figure V that MRPT is unrelated to frequency in the dollar sample. 18. We also plot in this figure the 25% and 75% quantiles of the size of price adjustment distribution. Just as for median size, we find no pattern for the twenty-fifth quantile, which is roughly stable at 4% across the ten frequency bins. On the opposite, the seventy-fifth quantile decreases from 15% to 10% as we move from low-frequency to high-frequency bins.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH All countries
High-income OECD
0.8 0.7
0.5
LRPT
0.6
0.4
LRPT
0.5
0.3
0.4 0.3
0.2
0.2
MRPT
0.1 0 0.03
697
0.05
0.07
0.1
0.13
0.18
Frequency
MRPT
0.1 0.28
0.46
0.67
1
0 0.03
0.05
0.07
0.1
0.13
0.18
Frequency
0.29
0.44
0.67
1
FIGURE V Long-Run versus Medium-Run Pass-Through LRPT is as in Figure II. MRPT is estimated for each frequency decile from a counterpart to regression (1), which conditions on the first price change instead of the lifelong price change.
In Gopinath, Itskhoki, and Rigobon (2009) we present further systematic evidence on the relation between the currency in which goods are priced and MRPT. We argue theoretically that one should expect to find that goods priced in nondollars indeed have a higher MRPT. In addition, they will have longer price durations, conditioning on the same LRPT. To clarify again, the measure of pass-through we estimate in this paper is a different concept from the main pass-through measures reported in Gopinath and Rigobon (2008) and Gopinath, Itskhoki, and Rigobon (2009). The evidence we find about the relation between frequency and pass-through relates to the long-run pass-through for dollar-priced goods. As we argue below theoretically, the relevant concept relating frequency to the structural features of the profit function is indeed long-run pass-through, and that is why it is the focus of the current paper. III. A STATIC MODEL OF FREQUENCY AND PASS-THROUGH In this section we investigate theoretically the relation between LRPT and frequency. Before constructing in the next section a full-fledged dynamic model of staggered price adjustment, we use a simple static model to illustrate the theoretical relationship between frequency of price adjustment and flexible price pass-through of cost shocks. The latter is the equivalent of LRPT in a dynamic environment and we will refer to it simply as pass-through. We show that, all else equal, higher pass-through
698
QUARTERLY JOURNAL OF ECONOMICS
is associated with higher frequency of price adjustment. This follows because the primitives that reduce pass-through also reduce the curvature of the profit function in the space of the cost shock, making the firm less willing to adjust its price. We consider the problem of a single monopolistic firm that sets its price before observing the cost shock.19 Upon observing the cost shock, the firm has an option to pay a menu cost to reset its price. The frequency of adjustment is then the probability with which the firm decides to reset its price upon observing the cost shock. We introduce two standard sources of incomplete pass-through into the model: variable mark-ups and imported inputs. III.A. Demand and Costs Consider a single price-setting firm that faces a residual demand schedule Q = ϕ(P|σ, ε), where P is its price and σ > 1 and ε ≥ 0 are two demand parameters.20 We denote the price elasticity of demand by σ˜ ≡ σ˜ (P|σ, ε) = −
∂ ln ϕ(P|σ, ε) ∂ ln P
and the superelasticity of demand (in the terminology of Klenow and Willis [2006]), or the elasticity of elasticity, as ε˜ ≡ ε˜ (P|σ, ε) =
∂ ln σ˜ (P|σ, ε) . ∂ ln P
Here σ˜ is the effective elasticity of demand for the firm, which takes into account both direct and indirect effects from price adjustment.21 Note that we introduce variable mark-ups into the model by means of variable elasticity of demand. This should be viewed as a reduced-form specification for variable mark-ups that would arise in a richer model due to strategic interactions between firms.22 19. Our modeling approach in this section is closest to that of Ball and Mankiw (1994), whereas the motivation of the exercise is closest to that in Romer (1989). References to other related papers can be found in the Introduction. 20. Because this is a partial equilibrium model of the firm, we do not explicitly list the prices of competitors or the sectoral price index in the demand functions. An alternative interpretation is that P stands for the relative price of the firm. 21. For example, in a model with large firms, price adjustment by the firm will also affect the sectoral price index, which may in turn indirectly affect the elasticity of demand. 22. The Atkeson and Burstein (2008) model is an example: in this model the effective elasticity of residual demand for each monopolistic competitor depends
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
699
We impose the following normalization on the demand parameters: When the price of the firm is unity (P = 1), elasticity and superelasticity of demand are given by σ and ε, respectively (that is, σ˜ (1|σ, ε) = σ and ε˜ (1|σ, ε) = ε). Moreover, σ˜ (·) is increasing in σ and ε˜ (·) is increasing in ε for any P. Additionally, we normalize the level of demand ϕ(1|σ, ε) to equal 1 independent of the demand parameters σ and ε (see Section IV for an example of such a demand schedule). These normalizations prove to be useful later when we approximate the solution around P = 1. The firm operates a production technology characterized by a constant marginal cost, MC ≡ MC(a, e; φ) = (1 − a)(1 + φe)c, where a is an idiosyncratic productivity shock and e is a real exchange rate shock. We will refer to the pair (a, e) as the cost shock to the firm. We further assume that a and e are independently distributed with Ea = Ee = 0 and standard deviations denoted by σa and σe respectively. Parameter φ ∈ [0, 1] determines the sensitivity of the marginal cost to the exchange rate shock and can be less than 1 due to the presence of imported intermediate inputs in the cost function of the firm (see Section IV). We normalize the marginal cost so that MC = c = (σ − 1)/σ when there is no cost shock (a = e = 0). Under this normalization, the optimal flexible price of the firm when a = e = 0 is equal to 1, because the marginal cost is equal to the inverse of the markup. This normalization is therefore consistent with a symmetric general equilibrium in which all firms’ relative prices are set to 1 (for a discussion see Rotemberg and Woodford [1999]). Finally, the profit function of the firm is given by (4)
(P|a, e) = ϕ(P) P − MC(a, e) ,
where we suppress the explicit dependence on parameters σ , ε, and φ. We denote the desired price of the firm by P(a, e) ≡ arg max P (P|a, e) and the maximal profit by (a, e) ≡
(P(a, e)|a, e).
on the primitive constant elasticity of demand, the market share of the firm, and the details of competition between firms.
700
QUARTERLY JOURNAL OF ECONOMICS
III.B. Price Setting For a given cost shock (a, e), the desired flexible price maximizes profits (4), so that23 (5)
P1 ≡ P(a, e) =
σ˜ (P1 ) (1 − a)(1 + φe)c, σ˜ (P1 ) − 1
and the corresponding maximized profit is (a, e). Denote by P¯0 the price that the firm sets prior to observing the cost shocks (a, e). If the firm chooses not to adjust its price, it will earn ( P¯0 |a, e). The firm will decide to reset the price if the profit loss from not adjusting exceeds the menu cost, κ: L(a, e) ≡ (a, e) − ( P¯0 |a, e) > κ. Define a set of shocks upon observing which the firm decides not to adjust its price by ≡ (a, e) : L(a, e) ≤ κ . Note that the profit-loss function L(a, e) and, hence, depend on the preset price P¯0 . The firm sets its initial price, P¯0 , to maximize expected profits where the expectation is taken conditional on the realization of the cost shocks (a, e) upon observing which the firm does not reset its price,24 ¯
(P|a, e) dF(a, e), P0 = arg max P
(a,e)∈
where F(·) denotes the joint cumulative distribution function of the cost shock (a, e). Using the linearity of the profit function in costs, we can rewrite the ex ante problem of the firm as (6)
P¯0 = arg max{ϕ(P)(P − E {(1 − a)(1 + φe)} · c)}, P
where E {·} denotes the expectation conditional on (a, e) ∈ . We prove the following: LEMMA 1. P¯0 ≈ P(0, 0) = 1, up to second-order terms. 23. The sufficient condition for maximization is σ˜ (P1 ) > 1 provided that ε˜ (P1 ) ≥ 0. We assume that these inequalities are satisfied for all P. 24. We implicitly assume, as is standard in a partial equilibrium approach, that the stochastic discount factor is constant for the firm.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
701
Proof. See the working paper version, Gopinath and Itskhoki (2008). Intuitively, a firm sets its ex ante price as if it anticipates the cost shock to be zero (a = e = 0), that is, equal to its unconditional expected value. This will be an approximately correct expectation of the shocks (a, e) over the region , if this region is nearly symmetric around zero and the cost shocks have a symmetric distribution, as we assume. The optimality condition (5) implies that, given our normalization of the marginal cost and elasticity of demand, P(0, 0) = 1. III.C. Pass-Through Using Lemma 1 we can prove (see the Appendix) PROPOSITION 1. i. The following first-order approximation holds: (7)
P(a, e) − P¯0 ≈ · (−a + φe), P¯0
where
≡
1 ε . 1 + σ −1
ii. Exchange rate pass-through equals (8)
e = φ =
φ ε . 1 + σ −1
Lemma 1 allows us to replace P¯0 with P(0, 0) = 1. Thus, a and φe constitute proportional shocks to the marginal cost, and the desired price of the firm responds to them with elasticity . This pass-through elasticity can be smaller than one because mark-ups adjust to limit the response of the price to the shock. The mark-up elasticity is given by ∂ μ(P) ˜ ε˜ (P) ε =− =− , ∂ ln P P=1 σ˜ (P) − 1 P=1 σ −1 where μ(P) ˜ ≡ ln[σ˜ (P)/(σ˜ (P) − 1)] is the log mark-up. A higher price increases the elasticity of demand, which, in turn, leads to a lower optimal mark-up. Mark-up elasticity depends on both the superelasticity and elasticity of demand: it is increasing in the superelasticity of demand ε and decreasing in the elasticity of demand σ provided that ε > 0. Exchange rate pass-through, e , is the elasticity of the desired price of the firm with respect to the exchange rate shock.
702
QUARTERLY JOURNAL OF ECONOMICS
It is increasing in cost sensitivity to the exchange rate, φ, and decreasing in the mark-up elasticity, ε/(σ − 1).25 III.D. Frequency In this static framework, we interpret the probability of resetting price in response to a cost shock (a, e) as the frequency of price adjustment. Formally, frequency is defined as (9)
≡ 1 − Pr{(a, e) ∈ } = Pr{L(a, e) > κ},
where the probability is taken over the distribution of the cost shock (a, e). To characterize the region of nonadjustment, , we take a second-order approximation to the profit-loss function. In the Appendix we prove LEMMA 2. The following second-order approximation holds: 1σ −1 L(a, e) ≡ (a, e) − ( P¯0 |a, e) ≈ 2
P(a, e) − P¯0 P¯0
2 ,
where is again as defined in (7). Note that Lemma 2 implies that the curvature of the profit function with respect to prices is proportional to
ε σ −1 = (σ − 1) 1 + , σ −1 and increases in both σ and ε. That is, higher elasticity of demand and higher mark-up elasticity increase the curvature of the profit function. Holding pass-through (i.e., the response of desired price to shocks) constant, this should lead to more frequent price adjustment. However, greater mark-up elasticity also limits desired pass-through, which, as we show below, more than offsets the first effect. This is seen when we combine the results of Proposition 1 and Lemma 2 and arrive at the final approximation to the profit-loss 25. In the working paper version (Gopinath and Itskhoki 2008) we also allowed for variable marginal costs as an additional channel of incomplete passthrough. In this case, the effect of σ on can be nonmonotonic. Although greater elasticity of demand limits the variable mark-up channel, it amplifies the variable marginal cost channel.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
703
function: (10)
L(a, e) ≈
1 (σ − 1)(−a + φe)2 , 2
which again holds up to third-order terms. This expression makes it clear that forces that reduce pass-through (i.e., decrease and φ) also reduce the profit loss from not adjusting prices and, as a result, lead to lower frequency of price adjustment. Note that in the space of the cost shock, the curvature of the profit-loss function decreases as pass-through elasticity decreases. Alternatively, primitives that lower reduce the region of nonadjustment in the price space (Lemma 2). However, a lower implies that the desired price adjusts by less and therefore is more likely to remain within the bounds of nonadjustment, thus reducing the frequency of price adjustment. This second effect always dominates (equation (10)). Combining (9) and (10), we have ⎫ ⎧ ⎬ ⎨ 2κ , (11) ≈ Pr |X| > ⎩ (σ − 1) ⎭ where X ≡ −1/2 · (−a + φe) is a standardized random variable with zero mean and unit variance and ≡ σa2 + φ 2 σe2 is the variance of the cost shock (−a + φe). This leads us to PROPOSITION 2. The frequency of price adjustment decreases with mark-up elasticity and increases with the sensitivity of costs to exchange rate shocks. It also decreases with the menu cost and increases with the elasticity of demand and the size of shocks. Taken together, the results on pass-through and frequency in Propositions 1 and 2 imply that PROPOSITION 3. (i) Higher mark-up elasticity as well as lower sensitivity of cost to exchange rate shocks reduces both frequency of price adjustment and pass-through; (ii) higher menu costs and smaller cost shocks decrease frequency, but have no effect on pass-through. Proposition 3 is the central result of this section. It implies that as long as mark-up elasticity varies across goods, we should observe a positive cross-sectional correlation between frequency
704
QUARTERLY JOURNAL OF ECONOMICS
and pass-through. Similarly, variation across goods in cost sensitivity to exchange rate, φ, can also account for the positive relationship between frequency and pass-through.26 Furthermore, other sources of variation in frequency do not affect pass-through and hence cannot account for the observed empirical relationship between the two variables. As for the absolute size of price adjustment, conditional on adjusting price, the effect of mark-up elasticity and the volatility of cost shocks works through two channels. The direct effect of lower mark-up elasticity or more volatile shocks is to increase the change in the desired price, whereas their indirect effect is to increase the frequency of price adjustment. The first effect increases the average size of price adjustment whereas the second reduces it. In Gopinath and Itskhoki (2008) we discuss conditions under which the direct effect dominates, such that lower mark-up elasticity (ε/(σ − 1)) or larger cost shocks () are associated with a larger absolute size of price adjustment. On the other hand, size of price adjustment increases with the size of the menu cost (κ) as frequency of price adjustment decreases. Consequently, as long as there is variation across goods in both κ and ε or , one should not expect to see a robust correlation between frequency and size (see further discussion in Section IV.C). IV. DYNAMIC MODEL We now consider a fully dynamic specification with statedependent pricing and variable mark-ups and quantitatively solve for the industry equilibrium in the U.S. market. First, we show that cross-sectional variation in mark-up elasticity can generate a positive relation between frequency and LRPT in a dynamic setting and can generate significant variation in frequency, equivalent to 37% of the observed variation in the data. Further, a menu cost model that allows for joint cross-sectional variation in mark-up elasticity and menu costs can quantitatively account for both the positive slope between LRPT and frequency and the close-to-zero slope between size and frequency in the data. Second, we show that the pass-through regressions estimated in Section II recover the true underlying LRPT. Third, we verify that 26. Quantitatively, however, the effect of φ on frequency is limited by the ratio of the variances of the exchange rate and idiosyncratic shocks, σe2 /σa2 , because φ affects frequency through = σa2 (1 + φ 2 σe2 /σa2 ). We calibrate this ratio in Section IV and show that the effect of φ on frequency is negligible.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
705
the observed correlation between frequency and LRPT cannot be explained by standard sticky price models with only exogenous differences in frequency of price adjustment and no variation in LRPT. The importance of the variable mark-up channel of incomplete pass-through, as argued for theoretically by Dornbusch (1987) and Krugman (1987), has been documented in the empirical evidence of Knetter (1989), Goldberg and Knetter (1997), Fitzgerald and Haller (2008), and Burstein and Jaimovich (2009) among others.27 We model the variable mark-up channel of incomplete passthrough using Kimball (1995) kinked demand. Our setup is most comparable to that in Klenow and Willis (2006), with two distinctions. First, we have exchange rate shocks that are more idiosyncratic than the aggregate monetary shocks typically considered in the closed economy literature. Second, we examine how variation in mark-up elasticity across goods affects the frequency of price adjustment.28 IV.A. Setup of the Model In this section we lay out the ingredients of the dynamic model. Specifically, we describe demand, the problem of the firm, and the sectoral equilibrium. Industry Demand Aggregator. The industry is characterized by a continuum of varieties indexed by j. There is a unit measure of U.S. varieties and a measure ω < 1 of foreign varieties available for domestic consumption. The smaller fraction of foreign varieties 27. Unlike the evidence for import prices, Bils, Klenow, and Malin (2009) find that variable mark-ups play a limited role in consumer/retail price data. These two sets of findings can be potentially reconciled by the evidence in Goldberg and Hellerstein (2007), Burstein and Jaimovich (2009), and Gopinath et al. (2009), who find that variable mark-ups play an important role at the wholesale cost level and a limited role at the level of retail prices. Atkeson and Burstein (2008) also argue for the importance of variable mark-ups in matching empirical features of wholesale traded goods prices. 28. Klenow and Willis (2006) argue that the levels of mark-up elasticity required to generate sufficient aggregate monetary nonneutrality generate price and quantity behavior that is inconsistent with the micro data on retail prices. We differ from this analysis in the following respects. First, we match facts on import prices, which have different characteristics such as the median frequency of price adjustment. Second, we calibrate the mark-up elasticity to match the evidence on the micro-level relationship between frequency and pass-through for wholesale prices, and as discussed in footnote 34, this implies a lower mark-up elasticity for the median good, and the model-generated data are consistent with the micro facts.
706
QUARTERLY JOURNAL OF ECONOMICS
captures the fact that not all varieties of the differentiated good are internationally traded in equilibrium. The technology of transforming the intermediate varieties into the final good is characterized by the Kimball (1995) aggregator, ||C j 1 dj = 1 ϒ (12) || C with ϒ(1) = 1, ϒ (·) > 0, and ϒ (·) < 0. C j is the quantity demanded of the differentiated variety j ∈ , where is the set of available varieties in the home country with measure || = 1 + ω. Individual varieties are aggregated into sectoral final good demand, C, which is implicitly defined by (12). The associated demand schedules with aggregator (12) are given by Pj C · , where ψ(·) ≡ ϒ −1 (·), (13) Cj = ψ D P || Pj is the price of variety j, P is the sectoral price index, and C D ≡ ϒ (||C j /C) Cj dj. The sectoral price index satisfies (14)
PC =
Pj C j dj
because the aggregator in (12) is homothetic. Firm’s Problem. Consider a home firm producing variety j. Everything holds symmetrically for foreign firms and we superscript foreign variables with an asterisk. The firm faces a constant marginal cost: (Wt∗ )φ . Ajt
1−φ
(15)
MC jt =
Wt
Aj denotes the idiosyncratic productivity shock that follows an autoregressive process in logs:29 a jt = ρaa j,t−1 + σa u jt ,
u jt ∼ iid N (0, 1).
29. In what follows, corresponding small letters denote the logs of the respective variables.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
707
Wt and Wt∗ denote the prices of domestic and foreign inputs, respectively, and we will interpret them as wage rates. Parameter φ measures the share of foreign inputs in the cost of production.30 The profit function of a firm from sales of variety j in the domestic market in period t is 1−φ Wt (Wt∗ )φ
jt (Pjt ) = Pjt − C jt , Ajt where demand C jt satisfies (13). Firms are price setters and must satisfy demand at the posted price. To change the price, both domestic and foreign firms must pay a menu cost κ j in local currency (dollars). Define the state vector of firm j by S jt = (Pj,t−1 , Ajt ; Pt , Wt , Wt∗ ). It contains the past price of the firm, the current idiosyncratic productivity shock and the aggregate state variables, namely, sectoral price level and domestic and foreign wages. The system of Bellman equations for the firm is given by31 ⎧ N V (S jt ) = jt (Pj,t−1 ) + E Q(S j,t+1 )V (S j,t+1 ) S jt , ⎪ ⎪ ⎨ V A(S jt ) = max{ jt (Pjt ) + E{Q(S j,t+1 )V (S j,t+1 )|S jt }}, (16) ⎪ Pjt ⎪ ⎩ V (S jt ) = max{V N (S jt ), V A(S jt ) − κ j }, where V N is the value function if the firm does not adjust its price in the current period, V A is the value of the firm after it adjusts its price, and V is the value of the firm making the optimal price adjustment decision in the current period. Q represents the stochastic discount factor. Conditional on price adjustment, the optimal resetting price is given by ¯ jt ) = arg max{ jt (Pjt ) + E{Q(S j,t+1 )V (S j,t+1 )|S jt }}. P(S Pjt
Therefore, the policy function of the firm is Pj,t−1 , if V N (S jt ) > V A(S jt ) − κ j , (17) P(S jt ) = ¯ jt ), otherwise. P(S 30. The marginal cost in (15) can be derived from a constant returns to scale production function that combines domestic and foreign inputs. 31. In general, one should condition expectations in the Bellman equation on the whole history (S jt , S j,t−1 , . . .), but in our simulation procedure we assume that S jt is a sufficient statistic, following Krusell and Smith (1998).
708
QUARTERLY JOURNAL OF ECONOMICS
In the first case the firm leaves its price unchanged and pays no menu cost, whereas in the second case it adjusts its price optimally and pays the menu cost. Sectoral Equilibrium. Sectoral equilibrium is characterized by a path of the sectoral price level, {Pt }, consistent with the optimal pricing policies of firms given the exogenous paths of their idiosyncratic productivity shocks and wage rates in the two countries {Wt , Wt∗ }. We define Et ≡ Wt∗ /Wt to be the wage-based (real) exchange rate. We assume that all prices are set in the domestic unit of account, consistent with the evidence of dollar (local currency) pricing documented in the data. We further assume that the value of the domestic unit of account is stable relative to movements in the exchange rate and the domestic real wage is also stable. These assumptions that domestic price and real wage inflation are negligible relative to real exchange rate fluctuations are a good approximation for the United States. From the modeling perspective this amounts to setting Wt ≡ 1 and assuming that all shocks to the (real) exchange rate arise from movements in Wt∗ . The simulation procedure of the model is discussed in detail in the Appendix, whereas below we discuss the calibration of the model and simulation results. IV.B. Calibration We adopt the Klenow and Willis (2006) specification of the Kimball aggregator (12) that results in σ/ε σ xj Pj , where x j ≡ D . (18) ψ(x j ) = 1 − ε ln σ −1 P This demand specification is conveniently governed by two parameters, σ > 1 and ε > 0, and the elasticity and superelasticity are given by σ˜ (x j ) =
σ σx 1 − ε ln σ −1j
and
ε˜ (x j ) =
ε σx . 1 − ε ln σ −1j
Note that this demand function satisfies all normalizations assumed in Section III. When ε goes to 0 it results in CES demand with elasticity σ. The calibrated values for the parameters {β, κ, σa , ρa , σe , ρe , ω, φ, ε, σ } are reported in Table VII. The period of the model
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
709
TABLE VII PARAMETER VALUES Parameter
Symbol
Value
Source
β
0.961/12
ω/(1 + ω)
16.7%
Cost sensitivity to ER shock Foreign firms U.S. firms Menu cost
φ∗ φ κ
Annualized interest rate of 4% BEA input–output tables and U.S. import data OECD input–output tables
0.75 0 2.5%
Demand elasticity
σ
5
σe ρe
2.5% 0.985
U.S.–Euro bilateral RER Rogoff (1996)
σa
8.5%
ρa
0.95
Absolute size of price adjustment of 7% Autocorrelation of new prices of 0.77
Discount factor Fraction of imports
Exchange rate process, et Std. dev. of ER shock Persistence of ER Idiosyncratic productivity process, at St. dev. of shock to at Persistence of at
Price duration of 5 months when ε = 4 Broda and Weinstein (2006)
corresponds to one month and, as is standard in the literature, we set the discount rate to equal 4% on an annualized basis (β = 0.961/12 ). To calibrate the share of imports, ω/(1 + ω), we use measures of U.S. domestic output in manufacturing from the Bureau of Economic Analysis input output tables and subtract U.S. exports in manufacturing. We calculate U.S. imports in manufacturing using U.S. trade data available from the Center for International Data at UC Davis web site. The four-year average import share for the period 1998–2001 is close to 17%, implying an ω = 0.2. Calibrating the cost sensitivity to the exchange rate shock, φ and φ ∗ , requires detailed information on the fraction of imports used as inputs in production by destination of output and the currency in which these costs are denominated. Such information is typically unavailable. For our purposes we use the OECD input–output tables to calculate the ratio of imports to industrial output and find that the share for manufacturing varies between 8% and 27% across high-income OECD countries. For our benchmark calibration we use a value of 1 − φ ∗ = 0.25 for foreign firms and φ = 0 for domestic firms, because almost all imports into the
710
QUARTERLY JOURNAL OF ECONOMICS
United States are priced in dollars. This implies that for an average firm in the industry the sensitivity of the marginal cost to the exchange rate is φ¯ = φ ∗ · ω/(1 + ω) = 12.5%. The (log of) the real exchange rate, et ≡ ln(Wt∗ /Wt ), is set to follow a very persistent process with an autocorrelation of around 0.985, and the monthly innovation to the real exchange rate is calibrated to equal 2.5%. These parameter values are consistent with the empirical evidence for developed countries. For instance, for countries in the euro zone, the standard deviation of the change in the bilateral monthly real exchange rate ranges between 2.6% and 2.8%. The persistence parameter is in the midrange of estimates reported in Rogoff (1996). We set the steady state elasticity of demand σ = 5, which implies a mark-up of 25%. This value is in the middle of the range for mean elasticity estimated by Broda and Weinstein (2006) using U.S. import data for the period 1990–2001. They estimate the mean elasticity for SITC-5 to be 6.6 and for SITC-3 to be 4.0. We simulate the model for different values of the superelasticity of demand, ε. The range of values is chosen to match the range of LRPT for the high-income OECD sample of 10% to 75%. Note that φ = 0.75 bounds the long-run pass-through from above. The specific values used are ε ∈ {0, 2, 4, 6, 10, 20, 40}. Our baseline good that matches the middle of the LRPT range has ε = 4.32 Given the value σ = 5, this implies that the range of mark-up elasticity is between 0 and 10, with a baseline value of 1. The parameters κ, σa , ρa are jointly calibrated to match moments of the data on the median duration of price adjustment, the median size of price adjustment, and the autocorrelation of new prices in the data. This is done holding other parameters constant and with ε = 4, as for the baseline good that has a price duration of five months in the data. The implied menu cost, κ, in the baseline calibration is equal to 2.5% of the revenues conditional on adjustment. We set the standard deviation of the innovation in productivity to 8.5% and the persistence of the idiosyncratic shock to 0.95. These parameter values allow us to match the median absolute size of price change of 7% conditional on adjustment and the autocorrelation of new prices in the data. More precisely, we estimate a pooled regression in the data where we regress the 32. See Dossche, Heylen, and Van den Poel (2006) for a discussion of calibrations of ε in the literature and for empirical evidence on the importance of nonzero ε.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
711
new price of each good on its lagged new price, allowing for a fixed effect for each good. The autocorrelation coefficient is estimated to be 0.77.33 For the model to match this, we need a high persistence rate for the idiosyncratic shock of at least 0.95.34 Note that the data moments imply that the ratio of variance of the innovation to the exchange rate shock relative to the idiosyncratic shock is low, σe2 /σa2 = (2.5%/8.5%)2 < 0.1. IV.C. Quantitative Results The first set of simulation results is about the relation between frequency and LRPT in the model-simulated data. We simulate the dynamic stationary equilibrium of the model for each value of the superelasticity of demand and compute the frequency of price adjustment and LRPT for all firms in the industry and then separately for domestic and foreign firms. Figure VI plots the resulting relationship between frequency and LRPT for these three groups of firms. The LRPT estimates are computed using the lifelong regression (1). An exchange rate shock has two effects: a direct effect that follows from the firm’s costs changing and an indirect effect that follows from the change in the sectoral price index, as other firms respond to the cost shock. The magnitude of the second effect depends on the fraction of firms that are affected by the exchange rate shock and consequently, the extent of pass-through depends also on how widespread the shock is. In ¯ our calibration this is determined by φ. In Figure VI the line marked “Foreign” captures the strong positive relation between LRPT and frequency in the modelsimulated import prices into the United States. The variation in ε that matched the empirical range in LRPT generates a frequency range of approximately 0.10 to 0.35, equivalent to durations of three to ten months. On the other hand, the relation between frequency and pass-through is effectively absent for domestic firms, as well as for the sample of all firms in the industry: as frequency varies between 0.1 and 0.35, the LRPT estimate for domestic firms 33. We also estimate the regression with a fixed effect for every two-digit sector × date pair to control for sectoral trends, in addition to the fixed effect for each good. The autocorrelation coefficient in this case is 0.71. 34. Note that in our calibration we need to assume neither very large menu costs, nor very volatile idiosyncratic shocks, as opposed to Klenow and Willis (2006). There are a few differences between our calibration and that of Klenow and Willis (2006). First, we assume a smaller mark-up elasticity: Our baseline good has a superelasticity of 4, as opposed to 10 in their calibration. Second, they assume a much less persistent idiosyncratic shock process and match the standard deviation of relative prices rather than the average size of price adjustment.
712
QUARTERLY JOURNAL OF ECONOMICS
0.8 0.7
Pass-through
0.6 0.5
Importers
0.4 0.3 0.2
All firms
0.1 0
Domestic 0.1
0.15
0.2
0.25
Frequency
0.3
0.35
FIGURE VI Frequency and LRPT in the Model (Variation in ε)
fluctuates between 0% and 5%, whereas for the full sample of firms it lies between 5% and 12%. For domestic firms the nonzero LRPT is because of the indirect effect working through the sectoral price index. The overall low LRPT estimates for the industry are consistent with empirical estimates of low pass-through into consumer prices. Goldberg and Campa (2009) estimate exchange rate pass-through into import prices and consumer prices for a large sample of developed countries and document that pass-through into consumer prices is far lower than pass-through into import prices across all countries. For the case of the United States, for the period 1975–2003, they estimate LRPT into import price to be 42% and into consumer prices to be 1%. This supports our emphasis on at-the-dock prices and international data as a meaningful environment for our study. A second set of results relates to the performance of the two LRPT estimators employed in the empirical section—the lifelong regression (1) and the aggregate regression (2)—when applied to the simulated panel of firm prices. Figure VII plots the relationship between frequency and three different measures of long-run pass-through for the exercise described above. The first measure of long-run pass-through (“Aggregate”) is the 24-month cumulative pass-through coefficient from the aggregate pass-through regression. The second measure (“Lifelong 1”) corresponds to the
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
713
0.8 0.7
Pass-through
0.6 0.5 0.4 0.3 Aggregate
0.2
Lifelong 1
0.1 0 0.1
Lifelong 2 0.15
0.2
0.25
Frequency
0.3
0.35
FIGURE VII Measures of LRPT
lifelong micro-level regression in which we control for firm idiosyncratic productivity. This ensures that the long-run pass-through estimates do not compound the selection effect present in menu cost models. This type of regression is, however, infeasible to run empirically because firm-level marginal costs are not observed. The third measure (“Lifelong 2”) corresponds to the same lifelong micro-level regression, but without controlling for firm idiosyncratic productivity. This estimate is the counterpart to the empirical lifelong estimates we presented in Section II. We observe from the figure that all three measures of LRPT produce the same qualitative patterns and very similar quantitative results. In addition, all the estimates are close to the theoretical long-run (flexible-price) pass-through.35 We conclude that within our calibration both estimators produce accurate measures of long-run pass-through. Specifically, 24 months is enough for the aggregate regression to capture the long-run response of prices, and lifelong regressions do not suffer from significant selection bias. In a third set of simulation results, we argue that variation in φ or κ alone cannot quantitatively explain the findings in the 35. One can show that the theoretical flexible price pass-through is approx¯ where φ¯ is the average sectoral sensitivity of the firms imated by φ¯ + (φ − φ), marginal cost to exchange rate and is the pass-through elasticity as defined in Section III. Note the close relation between this expression and the one in equation (8).
714
QUARTERLY JOURNAL OF ECONOMICS 0.8
Variation in ε
0.7
Pass-through
0.6
Variation in φ
0.5 0.4
Variation in κ
0.3 0.2 0.1 0 0.1
0.15
0.2
0.25
0.3
Frequency
0.35
0.4
FIGURE VIII Frequency and LRPT: Variation in φ and κ
data. This far we have only considered variations in ε. For this exercise we instead set a fixed ε = 4 and first vary φ between 0 and 1 for the baseline value of κ = 2.5% and then vary κ between 0.5% and 7.5% for the baseline value of φ = 0.75. Figure VIII plots the results. First, observe that variation in φ indeed generates a positive relationship between frequency and LRPT; however, the range of variation in frequency is negligible, as predicted in Section III (see footnote 26). As φ increases from 0 to 1, LRPT increases from 0 to 55%, whereas frequency increases from 0.20 to 0.23 only. An important implication of this is that frequency of price adjustment should not be very different across domestically produced and imported goods, despite their φ being very different. This is indeed the case in the data. Gopinath and Rigobon (2008) document that for categories of goods in the import price index that could be matched with the producer price index and using the duration measures from Nakamura and Steinsson (2008) for producer prices, the mean duration for the import price index is 10.3 months, and it is 10.6 months for the producer price index. Next observe that the assumed range of variation in the menu cost, κ, easily delivers a large range of variation in frequency, as expected. However, it produces almost no variation in LRPT, which is stable around 39%. Not surprisingly, in a menu cost model, exogenous variation in the frequency of price adjustment cannot
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
Aggregate pass-through
0.8 0.7
715
LL PT: 0.70
High freq: 0.28 0.70
0.6
LL PT: 0.61
0.52
0.5 0.4 0.3
Low freq: 0.07
0.2 0.1 0 0
6
12
18
24
Horizon, months
30
36
FIGURE IX Frequency and LRPT in a Calvo Model LL PT refers to LRPT estimated from the lifelong micro regression (1) (circle for high-frequency goods and square for low-frequency goods).
generate a robust positive relationship between frequency and measured long-run pass-through.36 The fourth set of results, reported in Figure IX, presents a robustness check by examining whether a Calvo model with large exogenous differences in the flexibility of prices can induce a positive correlation between frequency and measured LRPT even though the true LRPT is the same. We simulate two panels of firm prices—one for a sector with low probability of price adjustment (0.07) and another for a sector with high probability of price adjustment (0.28), the same as in Table I. We set ε = 0 (CES demand) and keep all parameters of the model as in the baseline calibration. The figure plots both pass-through estimates from aggregate regressions at different horizons, as well as lifelong pass-through estimates (depicted as the circle and the square over the 36-month horizon mark). Aggregate pass-through at 24 months is 0.52 for low-frequency adjusters, whereas it is 0.70 for high-frequency adjusters. At the same time, the lifelong pass-through estimates are 0.61 and 0.70, respectively. As is well known, the Calvo model 36. A similar pattern emerges for variation in the elasticity of demand σ when the superelasticity ε is set to zero (i.e., CES demand): variation in σ leads to variation in frequency with long-run pass-through stable at φ.
716
QUARTERLY JOURNAL OF ECONOMICS TABLE VIII QUANTITATIVE RESULTS: MODEL AGAINST THE DATA Variation in Superelasticity, ε
Menu cost, κ
ε and κ
0.56 0.06 0.72
1.86 0.13 0.76
0.03 0.44 0.46
0.55 0.22 0.57
Slope(freq,size) Min(size) Max(size)
−0.01 5.4% 7.4%
0.23 3.8% 11.8%
−0.15 4.8% 12.2%
−0.05 5.8% 8.2%
Std. dev. (freq) Min(freq) Max(freq)
0.30 0.03 1.00
0.11 0.07 0.44
0.17 0.06 0.59
0.18 0.05 0.61
Data Slope(freq,LRPT) Min(LRPT) Max(LRPT)
generates much slower dynamics of price adjustment as compared to the menu cost model. This generates a significant difference in aggregate pass-through even at the 24-month horizon; however, this difference is far smaller than the one documented in Section II. The difference in pass-through is yet much smaller for the Calvo model if we consider the lifelong estimates of long-run pass-through. Further, as Figure II suggests, the steep relation between frequency and pass-through arises once frequency exceeds 0.13. A Calvo model calibrated to match a frequency of at least 0.13 converges sufficiently rapidly and there is no bias in the estimates. Therefore, we conclude that standard sticky price models with exogenous differences in the frequency of adjustment would have difficulty matching the empirical relationship between frequency and pass-through. In the last set of simulation results, we evaluate the quantitative performance of the model in matching the following three moments of the data: the standard deviation of frequency, the slope coefficients in the regressions of LRPT on frequency, and size on frequency. The last two represent the slopes in Figures II (for the high-income OECD subsample) and IV, respectively. We estimate these moments in the model simulated series, just as we do in the data, by sorting goods into ten bins based on their frequency of adjustment and then estimate LRPT within each bin. Table VIII describes the results. The second column provides the moments in the data.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
717
The third column of Table VIII presents the results for the case when there is only variation in ε. Variation in ε alone can generate 37% of the empirical dispersion in frequency. The LRPT– frequency slope coefficient, although large and positive (1.86), is much steeper than in the data (0.56). The range in pass-through is, however, comparable by design. The model with only variation in ε generates a positive relation between size and frequency, as discussed in Section III, with a wide range of sizes that varies from 3.8% to 11.8% as ε decreases. In the data, however, this slope is close to 0 and the size range is small, between 5.4% and 7.4%. This is not surprising given that in the data we sort goods based on frequency and consequently a high-frequency bin combines goods that adjust frequently either because they have a low superelasticity of demand, ε, or because they have a low menu cost, κ. A calibration that allows for such additional sources of variation across goods should improve the fit of the model. We show that this is indeed the case by introducing variation in κ alongside variation in ε. Specifically, for each ε, we have firms facing a range of menu cost parameters from 0.1% to 20% of steady state revenues conditional on adjustment.37 The cross-sectional distribution of menu costs is independent of the distribution of ε. This additional variation in κ makes it possible to match the two slope statistics almost perfectly, as reported in column (5) of Table VIII. The model generates a slope of 0.55 between frequency and LRPT, whereas in the data it is 0.56 and also matches the close-to-zero slope between size and frequency in the data, −0.05 versus −0.01. The fraction of the standard deviation in frequency explained by the model increases to 60%. Last, for the fourth column, we shut down variation in ε, so all of the variation is in κ. The model with only κ performs poorly for reasons described earlier, as it predicts no slope between frequency and pass-through (as emphasized in Figure VIII) and a sizable negative slope for the size–frequency relationship (as seen in Figure XI). In Figures X and XI we plot the relationships between frequency and LRPT and between frequency and size for the case with variation in ε alone and for the case with joint variation in ε and κ against the data series from Figures II and IV respectively. As follows from the results in Table VIII, a menu cost 37. Note that a firm with a 20% menu cost adjusts at most once a year, implying an annual cost of adjustment of less that 2% of revenues.
718
QUARTERLY JOURNAL OF ECONOMICS
0.8
Model (ε only)
0.7
Pass-through
0.6 0.5 0.4
Model (ε and κ)
0.3 0.2
Smoothed data
0.1 0 0.03
0.05
0.07
0.10
0.13
0.18
Frequency
0.29
0.44
0.67
1.00
0.67
1
FIGURE X Frequency and LRPT: Model against the Data
0.14
Only κ
0.12
Only ε
Size
0.1
ε and κ 0.08
Data
0.06 0.04 0.02 0.03
0.05
0.07
0.1
0.13
0.18
Frequency
0.29
0.44
FIGURE XI Frequency and Size: Model against the Data
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
719
model that allows for joint cross-sectional variation in mark-up elasticity, and menu costs can quantitatively account for both the LRPT–frequency and size–frequency relationships and generate a large dispersion in frequency across goods. Further, the variable mark-up channel is essential to matching the relation between LRPT and frequency and can generate a significant fraction of the variation in frequency observed in the data. V. CONCLUSION We exploit the open economy environment with an observable and sizable cost shock, namely the exchange rate shock, to shed light on the mechanism behind the sluggish response of aggregate prices to cost shocks. We find that firms that adjust prices infrequently also pass through a lower amount even after several periods and multiple rounds of price adjustment, as compared to high-frequency adjusters. In other words, firms that infrequently adjust prices are typically not as far from their desired prices due to their lower desired pass-through of cost shocks. On the other hand, firms that have high pass-through drift farther away from their optimal price and, therefore, make more frequent adjustments. We also show evidence that there is interesting variation within sectors in the frequency of price adjustment that is linked to LRPT. This within-sector variation is not surprising in models with variable mark-ups. In a model where the strategic interactions between firms is explicitly modeled, as in Feenstra, Gagnon, and Knetter (1996) and Atkeson and Burstein (2008), mark-up elasticity will, among other things, depend on the market share of the firm, and this relationship is in general nonmonotonic. Hence firms in a sector with the same elasticity of substitution across goods will differ in their mark-up elasticity depending on their market share. Consequently, one should expect to see differences in mark-up elasticity, pass-through, and frequency of price adjustment even within the same disaggregated sector. We have evaluated the empirical evidence through the lens of standard dynamic pricing models and find that a menu cost model with variation in the mark-up elasticity can match the facts in the data, whereas models with exogenous frequency of price adjustment and no variation across goods in LRPT have a difficult time matching the facts.
720
QUARTERLY JOURNAL OF ECONOMICS
APPENDIX A. Proofs for Section III Proof of Proposition 1. The desired flexible price of the firm (5) can be rewritten in logs as ln P(a, e) = μ(P(a, ˜ e)) + ln(1 − a) + ln(1 + φe), where μ˜ ≡ μ(P) ˜ = ln[σ˜ (P)/(σ˜ (P) − 1)] is the log mark-up. Taking a first-order Taylor approximation around the point a = e = 0 gives ∂ μ(P(0, ˜ 0)) · [ln P(a, e) − ln P(0, 0)] 1− ∂ ln P + O(P(a, e) − P(0, 0))2 = (−a + φe) + O(a, e)2 , where O(x) denotes the same order of magnitude as x and · is some norm in R2 . Using the definitions of μ, ˜ σ˜ , and ε˜ , we obtain ∂ μ(P)/∂ ˜ ln P = −˜ε (P)/(σ˜ (P) − 1). Given our demand and cost normalization, we have P(0, 0) = 1 and hence σ˜ (P(0, 0)) = σ and ε˜ (P(0, 0)) = ε. This allows us to rewrite the Taylor expansion as ln P(a, e) − ln P(0, 0) + O(P(a, e) − P(0, 0))2 = (−a + φe) + O(a, e)2 , −1 ε where ≡ 1 + σ −1 . This immediately implies that O(P(a, e) − P(0, 0)) = O(a, e), so that we can rewrite ln P(a, e) − ln P(0, 0) = (−a + φe) + O(a, e)2 . The final step makes use of Lemma 1, which states that P¯0 = P(0, 0) + O(a, e)2 and allows us to substitute P¯0 for P(0, 0) in the Taylor expansion above without affecting the order of approximation: ln P(a, e) − ln P¯0 = (−a + φe) + O(a, e)2 . Last, for small shocks, the difference in logs is approximately equal to the percentage change, which results in expression (7) in the text of the proposition. Exchange rate pass-through is defined as e ≡ ∂ ln P/∂e = φ, which follows from the above approximation.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
721
Proof of Lemma 2. Taking a second-order Taylor approximation to the profit-loss function around the desired price P(a, e) results in L(a, e) ≡ (a, e) − ( P¯0 |a, e) 1 ∂ 2 (a, e) (P(a, e) − P¯0 )2 + O(a, e)3 , =− 2 ∂ P2 where the first-order term is zero due to the first-order condition (FOC) of profit maximization and O(P(a, e) − P¯0 ) = O(a, e) from Lemma 1 and the proof of Proposition 1. The second derivative of the profit function with respect to price is ∂ (P|a, e) ∂ 2 (P|a, e) = ϕ (P) 2 ∂P ∂ P MC σ˜ (P)ϕ(P) ε˜ (P) + 1 − ε˜ (P) . − P P Evaluating this expression at P1 = P(a, e), we have
ε˜ (P1 ) (σ˜ (P1 ) − 1) · ϕ(P1 ) ∂ 2 (a, e) , 1 + = − ∂ P2 P1 σ˜ (P1 ) − 1 where we used the FOC, which implies that MC/P1 = (σ˜ (P1 ) − 1)/σ˜ (P1 ). Note that σ˜ (P1 ) > 1 and ε˜ (P1 ) ≥ 0 are indeed sufficient conditions for profit maximization at P1 . Assuming that ε˜ (·) is a smooth function, we can use the approximation ∂ 2 (0, 0) σ −1 ∂ 2 (a, e) + O(a, e), = + O(a, e) = − ∂ P2 ∂ P2 where the second equality evaluates the second derivative of the profit function at a = e = 0, taking into account that P(0, 0) = 1 and ϕ(1) = 1, σ˜ (1) = σ , and ε˜ (1) = ε due to our normalizations. Combining the above results and the implication of Lemma 1 that P¯0 = P(0, 0) + O(a, e)2 = 1 + O(a, e)2 , we can rewrite the approximation to the profit loss function as L(a, e) ≡ (a, e) − ( P¯0 |a, e) 2
1 σ − 1 P(a, e) − P¯0 = + O(a, e)3 . 2 P¯0
722
QUARTERLY JOURNAL OF ECONOMICS
B. Simulation Procedure To simulate the dynamic model of Section IV, we first need to make a few approximations. As described in the text, the demand for a variety j is a function of the normalized relative price, x jt = Dt Pjt /Pt , where the general expression for the normalization parameter Dt was provided in the text. In the case of the Klenow and Willis (2006) demand specification, this expression becomes ||C jt ε/σ C jt 1 σ −1 exp (19) Dt = dj, 1− σ ε Ct Ct where
||C jt ε/σ 1 ||C jt −1 ||C jt =ψ . = exp 1− ε Ct Ct Ct
¯ C j = C/|| ¯ In a symmetric steady state Pj = P = P, for all j, so ¯ that x j ≡ D = (σ − 1)/σ , and the elasticity and superelasticity of demand equal σ and ε, respectively. Equations (14) and (13) in the text, together with our demand specification, imply the following (implicit) expression for the sectoral price level: σ Dt Pjt σ/ε 1 Pjt 1 − ε ln dj. (20) Pt = || σ − 1 Pt We can now prove ¯ = (σ − 1)/σ LEMMA A.1. (i) The first-order deviation of Dt from D is nil. (ii) The geometric average provides an accurate firstorder approximation to the sectoral price level: 1 ln Pt ≈ ln Pjt dj. || In both cases the order of magnitude is O({ Pˆ jt } j∈ ), where · is some vector norm in L∞ and the circum flex denotes ¯ log-deviation from steady state value, Pˆ jt ≡ ln Pjt − ln P. Proof. Writing (19) and (20) in log-deviations from the symmetric steady state, we obtain two equivalent first-order accurate ˆ t, representations for D σ ˆ 1 1 ˆ ˆ Pˆ jt − Pˆt dj. (C jt − Ct )dj = − Dt = σ −1 || ||
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
723
Now using the definition of the Kimball aggregator (12) and the fact that (·) is a smooth function, we have 1 (Cˆ jt − Cˆ t )dj = 0 || up to second-order terms; specifically, the approximation error has order O({Cˆ jt } j∈ 2 ). Combining the two results immediately ˆ t = 0 and Pˆt = ||−1 ˆ implies that D P jt dj up to the second-order 2 ˆ terms, O({C jt } j∈ ). Taking the log-differential of the demand equation (18), we have ˆ t + Pˆ jt − Pˆt ), Cˆ jt − Cˆ t = −σ ( D which allows us to conclude that O({Cˆ jt } j∈ ) = O({ Pˆ jt } j∈ ). Finally, note that the expression for Pˆt is equivalent to 1 ln Pt = ln Pjt dj || ¯ because in a symmetric steady state Pj = P = P. This result motivates us to make the following assumptions: ¯ = (σ − 1)/σ and comIn our simulation procedure we set Dt ≡ D pute the sectoral price index as the geometric average of individual prices. Lemma A.1 ensures that these are accurate first-order approximations to the true expressions and using them speeds up the simulation procedure considerably as we avoid solving for another layer of fixed point problems.38 We verify, however, that computing the sectoral price index according to the exact expression (20) does not change the results. Additionally, we introduce two more approximations. First, we set the stochastic discount factor to be constant and equal to the discount factor β < 1. Second, we set the sectoral consumption index to Ct ≡ 1. Both of these assumptions are in line with our partial equilibrium approach, and they introduce only secondorder distortions to the price-setting problem of the firm. The assumptions in Section IV.A allow us to reduce the state space for each firm to S jt = (Pj,t−1 , Ajt ; Pt , et ), where et = ln(Wt∗ /Wt ) = ln Wt∗ . We iterate the Bellman operator (16) on a logarithmic grid for each dimension of S jt . Specifically, the grid for individual price Pjt is chosen so that an increment is no greater 38. This is also an assumption adopted by Klenow and Willis (2006).
724
QUARTERLY JOURNAL OF ECONOMICS
than a 0.5% change in price (typically, around 200 grid points). The grid for idiosyncratic shock Ajt contains eleven grid points and covers ±2.5 unconditional standard deviations for the stochastic process. The grid for the sectoral price level Pt is such that an increment is no greater than a 0.2% change in the price level (typically, around thirty grid points). Finally, the grid for the real exchange rate et has fifteen grid points with increments equal to σe and covering ±7σe .39 To iterate the Bellman operator (16), a firm needs to form expectations about the future path of the exogenous state variables, (Ajt , et , Pt ). Because Ajt and et follow exogenous stochastic processes specified above, the conditional expectations for these variables are immediate.40 The path of the sectoral price index, Pt , however, is an endogenous equilibrium outcome, and to set prices the firm needs to form expectations about this path. This constitutes a fixed point problem: the optimal decision of a firm depends on the path of the price level and this optimal decision feeds into the determination of the equilibrium path of the price level.41 Following Krusell and Smith (1998), we assume that firms base their forecast on the restricted set of state variables; specifically, Et ln Pt+1 = γ0 + γ1 ln Pt + γ2 et . In principle, the lags of (ln Pt , et ) can also be useful for forecasting ln Pt+1 ; however, in practice, (ln Pt , et+1 ) alone explains over 95% of the variation in ln Pt+1 , and et is a sufficient statistic to forecast et+1 . The firms use the forecasting vector (γ0 , γ1 , γ2 ) consistent with the dynamics of the model. This reflects the fact that they form rational expectations given the restricted set of state variables which they condition on. To implement this, we 39. We let the (log of) the real exchange rate, et , follow a binomial random walk process within wide boundaries. Specifically, its value each period either increases or decreases by σe with equal probabilities and reflects from the boundaries of a grid. We use this procedure to numerically generate a highly persistent process for the exchange rate, which is harder to obtain using the Tauchen routine. 40. Recall that Ajt follows a first-order autoregressive process; we discretize it using the Tauchen routine. 41. In fact, there are two distinct fixed point problems, one static and one dynamic. The price that the firm sets today, P jt , depends both on the price level today, Pt , and on the expectation of the price level in the future, Et Pt+1 . The static problem is easy to solve: holding the expectations constant, we find Pt consistent with 1 ln P jt (Pt )dj, ln Pt = || where P jt (Pt ) underlines the dependence of the individual prices on the sectoral price level.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
725
i. start with an initial forecasting vector (γ0(0) , γ1(0) , γ2(0) ); ii. given the forecasting vector, iterate the Bellman equation till convergence to obtain policy functions for price setting; iii. using the policy functions, simulate M dynamic paths of T , where in every period we the sectoral price level {Pt }t=0 make sure that Pt is consistent with the price setting of firms; iv. for each simulation m estimate γˆ (0,m) ≡ (γˆ0(0,m) , γˆ1(0,m) , γˆ2(0,m) ) from regressing ln Pt+1 on a constant, ln Pt and et , and obtain (γ0(1) , γ1(1) , γ2(1) ) by taking the median of γˆ (0,m) ; v. iterate this procedure till joint convergence of the forecasting equation coefficients. This constitutes a reasonable convergence procedure in a stochastic environment. Once the forecasting vector is established, we iterate the Bellman operator to find policy functions for domestic and foreign firms in every state. This then allows us to simulate a panel of individual prices similar to the one we use in the empirical section. Specifically, we simulate a stationary equilibrium with 12,000 domestic and 2,400 foreign firms operating in the local market. We simulate the economy for 240 periods and then take the last 120 periods. During this time interval each firm appears in the sample for, on average, 35 consecutive months (on average, four price adjustments for each firm), which generates an unbalanced panel of firm price changes, as we observe in the data.42 On these simulated data, we estimate the same regressions (1) and (2) as we do on the BLS data set. We repeat the simulation B times and take the median of all statistics across these B simulations to exclude noise in the estimates. In the final simulation exercise we allow firms to be heterogeneous in both their menu cost κ and superelasticity of demand ε. Specifically, for every value of ε, we have firms with different values of κ distributed uniformly on {0.1%, 0.5%, 1.25%, 2.5%, 5%, 10%, 20%}. We simulate price data for a panel of firms using the procedure described above. We then estimate frequency for every firm and sort firms into frequency deciles. For each frequency decile we estimate lifelong pass-through and median size of price 42. Specifically, for each firm we choose a random time interval during which its price is observed by the econometrician, although the good exists in all time periods. This captures the feature that in the data the BLS only observes price changes when the good is in the sample and only a few price changes are observed.
726
QUARTERLY JOURNAL OF ECONOMICS
adjustment. This parallels the procedure that we used in our empirical analysis of Section II. DEPARTMENT OF ECONOMICS, HARVARD UNIVERSITY, AND NATIONAL BUREAU ECONOMIC RESEARCH DEPARTMENT OF ECONOMICS, PRINCETON UNIVERSITY
OF
REFERENCES Atkeson, Andrew, and Ariel Burstein, “Pricing-to-Market, Trade Costs, and International Relative Prices,” American Economic Review, 98 (2008), 1998–2031. Ball, Laurence, and N. Gregory Mankiw, “Asymmetric Price Adjustment and Economic Fluctuations,” Economic Journal, 104 (1994), 247–261. Ball, Laurence, N. Gregory Mankiw, and David Romer, “The New Keynesian Economics and the Output–Inflation Trade-Off,” Brookings Papers on Economic Activity, 19 (1988), 1–65. Barro, Robert J., “A Theory of Monopolistic Price Adjustment,” Review of Economic Studies, 39 (1972), 17–26. Bergin, Paul R., and Robert C. Feenstra, “Pricing to Market, Staggered Contracts, and Real Exchange Rate Perisistence,” Journal of International Economics, 54 (2001), 333–359. Bils, Mark J., and Peter J. Klenow, “Some Evidence on the Importance of Sticky Prices,” Journal of Political Economy, 112 (2004), 947–85. Bils, Mark J., Peter J. Klenow, and Benjamin Malin, “Reset Price Inflation and the Impact of Monetary Policy Shocks,” NBER Working Paper No. 14787, 2009. Broda, Christian, and David E. Weinstein, “Globalization and the Gains from Variety,” Quarterly Journal of Economics, 121 (2006), 541–585. Burstein, Ariel, and Nir Jaimovich, “Understanding Movements in Aggregate and Product-Level Real Exchange Rates,” Stanford University Working Paper, 2009. Caplin, Andrew S., and Daniel F. Spulber, “Menu Costs and the Neutrality of Money,” Quarterly Journal of Econmics, 102 (1987), 703–726. Devereux, Michael B., and James Yetman, “Price Adjustment and Exchange Rate Pass-Through,” University of Hong Kong Working Paper, 2008. Dornbusch, Rudiger, “Exchange Rate and Prices,” American Economic Review, 77 (1987), 93–106. Dossche, Maarten, Freddy Heylen, and Dirk Van den Poel, “The Kinked Demand Curve and Price Rigidity: Evidence from Scanner Data,” National Bank of Belgium Research Paper No. 200610-11, 2006. Dotsey, Michael, Robert G. King, and Alexander L. Wolman, “State-Dependent Pricing and the General Equilibrium Dynamics of Money and Output,” Quarterly Journal of Economics, 114 (1999), 655–690. Feenstra, Robert C., Joseph E. Gagnon, and Michael M. Knetter, “Market Share and Exchange Rate Pass-Through in World Automobile Trade,” Journal of International Economics, 40 (1996), 187–207. Fitzgerald, Doireann, and Stefanie Haller, “Exchange Rates and Producer Prices: Evidence from Micro Data,” Stanford University Working Paper, 2008. Goldberg, Linda S., and Jos´e Manuel Campa, “The Sensitivity of the CPI to Exchange Rates: Distribution Margins, Imported Inputs, and Trade Exposure,” Review of Economics and Statistics, forthcoming, 2009. Goldberg, Pinelopi K., and Rebecca Hellerstein, “A Framework for Identifying the Sources of Local-Currency Price Stability with an Empirical Application,” NBER Working Paper No. 13183, 2007. Goldberg, Pinelopi K., and Michael M. Knetter, “Goods Prices and Exchange Rates: What Have We Learned?” Journal of Economic Literature, 35 (1997), 1243– 1272. Gopinath, Gita, Pierre-Olivier Gourinchas, Chang-Tai Hsieh, and Nicholas Li, “Estimating the Border Effect: Some New Evidence,” NBER Working Paper No. 14938, 2009.
FREQUENCY OF PRICE ADJUSTMENT AND PASS-THROUGH
727
Gopinath, Gita, and Oleg Itskhoki, “Frequency of Price Adjustment and PassThrough,” NBER Working Paper No. 14200, 2008. Gopinath, Gita, Oleg Itskhoki, and Roberto Rigobon, “Currency Choice and Exchange Rate Pass-through,” American Economic Review, forthcoming, 2009. Gopinath, Gita, and Roberto Rigobon, “Sticky Borders,” Quarterly Journal of Economics, 123 (2008), 531–575. Greene, William H., Econometric Analysis, 4th ed. (Upper Saddle River, NJ: Prentice-Hall, 2000). Kehoe, Patrick J., and Virgiliu Midrigan, “Sticky Prices and Real Exchange Rates in the Cross-Section,” New York University Working Paper, 2007. Kimball, Miles, “The Quantitative Analytics of the Basic Neomonetarist Model,” Journal of Money, Credit and Banking, 27 (1995), 1241–1277. Klenow, Peter J., and Jonathan L. Willis, “Real Rigidities and Nominal Price Changes,” Federal Reserve Bank of Kansas City Research Working Paper 06-03, 2006. Knetter, Michael M., “Price Discrimination by U.S. and German Exporters,” American Economic Review, 79 (1989), 198–210. Krugman, Paul R., “Pricing to Market When the Exchange Rate Changes,” in Real Financial Linkages among Open Economies, Swen W. Arndt and J. David Richardson, eds. (Cambridge, MA: MIT Press, 1987). Krusell, Per, and Anthony A. Smith, Jr., “Income and Wealth Heterogeneity in the Macroeconomy,” Journal of Political Economy, 106 (1998), 867–896. Lane, Philip R., “The New Open Economy Macroeconomics: A Survey,” Journal of International Economics, 54 (2001), 235–266. Midrigan, Virgiliu, “International Price Dispersion in State-Dependent Pricing Models,” Journal of Monetary Economics, 54 (2007), 2231–2250. Nakamura, Emi, and J´on Steinsson, “Five Facts about Prices: A Reevaluation of Menu Cost Models,” Quarterly Journal of Economics, 123 (2008), 1415–1464. Neiman, Brent, “Multinationals, Intrafirm Trades, and International Macroeconomic Dynamics,” University of Chicago Working Paper, 2007. Obstfeld, Maurice, and Kenneth S. Rogoff, “Exchange Rate Dynamics Redux,” Journal of Political Economy, 103 (1995), 624–660. Rauch, James E., “Networks versus Markets in International Trade,” Journal of International Economics, 48 (1999), 7–35. Rogoff, Kenneth S., “The Purchasing Power Parity Puzzle,” Journal of Economic Literature, 34 (1996), 647–668. Romer, David, “Staggered Price Setting with Endogenous Frequency of Adjustment,” NBER Working Paper No. 3134, 1989. Rotemberg, Julio J., and Garth Saloner, “The Relative Rigidity of Prices,” American Economic Review, 77 (1987), 917–926. Rotemberg, Julio J., and Michael Woodford, “The Cyclical Behavior of Prices and Costs,” in Handbook of Macroeconomics, John B. Taylor and Michael Woodford, eds. (Amsterdam, Netherlands: Elsevier North Holland, 1999). Sheshinski, Eytan, and Yoram Weiss, “Inflation and Costs of Price Adjustment,” Review of Economic Studies, 44 (1977), 287–303.
PRICE STICKINESS AND CUSTOMER ANTAGONISM∗ ERIC T. ANDERSON AND DUNCAN I. SIMESTER Managers often state that they are reluctant to vary prices for fear of “antagonizing customers.” However, there is no empirical evidence that antagonizing customers through price adjustments reduces demand or profits. We use a 28-month randomized field experiment involving over 50,000 customers to investigate how customers react if they buy a product and later observe the same retailer selling it for less. We find that customers react by making fewer subsequent purchases from the firm. The effect is largest among the firm’s most valuable customers: those whose prior purchases were most recent and at the highest prices.
“It seems essential, therefore, to gain a better understanding of precisely what firms mean when they say that they hesitate to adjust prices for fear of antagonizing customers.” Blinder et al. (1998, p. 313)
I. INTRODUCTION The assumption that nominal prices are sticky is fundamental to Keynesian economics and forms a basic premise of many models of monetary policy. A leading explanation for why prices are slow to adjust is that firms do not want to antagonize their customers. Yet there is little empirical evidence that antagonizing customers through price adjustments lowers either demand or profits. In this paper we study the effects of downward price adjustments and show that many customers stop purchasing if they see a firm charge a lower price than they previously paid for the same item. This customer boycott is concentrated among the firm’s most valuable customers, which greatly magnifies the cost to the firm. Firms can mitigate these costs by limiting the frequency and/or depth of price adjustments. The findings are replicated in two separate field experiments conducted in different product categories. The first experiment ∗ We thank Nathan Fong for valuable research assistance and the anonymous retailer for generously providing the data used in this study. We gratefully acknowledge valuable comments from Martin Eichenbaum, Nir Jaimovich, Aviv Nevo, Julio Rotemberg, Catherine Tucker, Birger Wernerfelt, Juanjaun Zhang, and numerous other colleagues. The paper has also benefited from comments by seminar participants at Chicago, Columbia, Cornell, Erasmus, INSEAD, LBS, Minnesota, MIT, NYU, Northeastern, Northwestern, Penn State, UCLA, UCSD, UTD, Yale, the University of Houston, the 2006 Economic Science Association conference, the 2007 QME conference, and the 2008 NBER Price Dynamics conference. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
729
730
QUARTERLY JOURNAL OF ECONOMICS
was conducted with a publishing company. Customers were randomly chosen to receive a test or control version of a catalog (the “Test Catalog”). The versions were identical except that the test version offered substantially lower prices on 36 items (the “Test Items”). The loss of demand and profits was particularly dramatic among customers who had recently paid a higher price for one of those items. Lower prices under the test condition led to 14.8% fewer orders over the next 28 months, which equates to over $90 in lost revenue per customer (including revenue from the Test Catalog itself). Decomposing the results reveals that the price adjustments had two effects. As expected, orders from the Test Catalog were higher in the test condition, as some customers took advantage of the discounted prices. However, this short-run effect was overwhelmed by a sharp reduction in orders from other catalogs. Price adjustments in a single catalog had negative spillover effects on future orders (and overall revenue). To investigate the robustness and generalizability of the results, we replicate the key findings with a separate company in a different product category (clothing). This replication is particularly noteworthy because most customers expect to see lower prices on clothing at the end of a season. Despite this, we show that sending a “Sale” catalog to customers in the days immediately after Christmas reduced purchases by customers who had previously paid a higher price for one of the discounted items. There are several possible explanations for these effects. One explanation is that customers are antagonized if they observe the firm charging lower prices than they previously paid. Another explanation is that lower prices may have prompted customers to update their price expectations and change their purchase behavior. A third possibility is that low prices may have influenced customers’ beliefs about product quality. To evaluate these explanations, we exploit heterogeneity in the experimental treatment effects. For example, the reduction in demand and profits was restricted to customers who had recently purchased one of the discounted items at a high price. Customer antagonism is consistent with these boundaries. We also find some evidence that customers updated their price expectations and delayed purchasing, although this effect does not appear to explain the outcome fully. Other explanations are ruled out by the randomized allocation of customers to the two experimental conditions. The randomization ensures that competitive reactions, inventory constraints,
PRICE STICKINESS AND CUSTOMER ANTAGONISM
731
macroeconomic changes, and customer characteristics cannot explain the decrease in demand, as these factors were equally likely to affect sales under both conditions. The results of the field experiments had a direct impact on the pricing policies of the two participating retailers. After learning of the results, the company that provided data for our first experiment responded by no longer sending catalogs containing discounts to customers who had recently purchased one of the discounted items. This effectively reduces the degree to which prices are varied for an individual customer. The company that participated in the second study also responded by restricting price changes. For example, the company removed a discounted item from the back page of a widely circulated catalog to avoid antagonizing approximately 120,000 customers who had previously purchased the item at full price. Even before learning of the findings, this company had policies that limited the frequency of discounts. Managers acknowledged that these policies reflected concern that frequent price adjustments may antagonize customers. These reactions suggest that firms may be able to mitigate the reduction in demand from charging different prices to different customers. If the products are durables that most customers only purchase once, the cost of foregoing these price changes will be minimal. However, most retailers are not able to price discriminate as perfectly as the two catalog retailers in this study. When firms cannot charge different prices to individual customers, the effects that we report will tend to reduce the optimal frequency and/or depth of price adjustments. I.A. Previous Research The research on price stickiness can be broadly categorized into two topics: (1) are prices sticky? and (2) why are they sticky? The evidence that prices respond slowly to business cycles is now extensive (Gordon 1990; Weiss 1993). In recent years much of the attention has turned to the second question. One set of explanations argue that costs may be constant, either within a limited neighborhood or within a limited time period. Another common explanation studies the (menu) cost of changing prices. The more costly it is to change prices, the less frequently we would expect them to change. Other explanations have considered imperfections in the information available to price setters and asymmetries in the demand curve.
732
QUARTERLY JOURNAL OF ECONOMICS
In this paper we study the role of customer antagonism. This explanation argues that firms do not want to change prices because doing so may antagonize their customers. Hall and Hitch (1939) were among the first to investigate this issue empirically. They interviewed a sample of managers to learn how prices are set. The managers’ responses included statements such as “Price changes [are] a nuisance to agents, and disliked by the market” and “Frequent changes of price would alienate customers” (pp. 35 and 38). More recently, Blinder et al. (1998) also asked managers why they did not vary their prices. This study was conducted on a large scale and involved 200 interviews with senior executives conducted over a two-year period. The most common response was that frequent price changes would “antagonize” or “cause difficulties” for customers, leading the authors to conclude that we need to better understand the role of customer antagonism (pp. 85 and 308; see also the quotation at the start of this paper). The findings have since been corroborated by similar studies surveying managers in Canada (Amirault, Kwan, and Wilkinson 2004) and a broad range of European countries (Hall, Walsh, and Yates 1996; Fabiani et al. 2004; Apel, Friberg, and Hallsten 2005).1 Although these studies have raised awareness that customer antagonism may contribute to price stickiness, there are limits to what can be learned from survey data. Notably, the data do not allow researchers to measure whether antagonizing customers through price adjustments reduces demand or profits. The evidence in this paper is related to Rotemberg’s theoretical research on firm altruism. Customers in Rotemberg’s models only want to transact with firms that are “altruistic.” Although customers interpret firms’ actions generously, they boycott firms when there is convincing evidence that their expectations regarding altruism are violated. This reaction has been used to help explain why prices are sticky (Rotemberg 2005); investigate how customers react to brand extensions (Rotemberg 2008); and explore how fairness influences firms’ pricing decisions (Rotemberg 2009). We will present evidence that the reductions in future purchases do not just reflect changes in customers’ expectations about prices or product quality. Instead, the effects appear to influence 1. Other empirical evidence includes Zbaracki et al. (2004), who report findings from an extensive study of the cost of changing prices at a large industrial firm. Their findings include a series of anecdotes and quotations indicating that managers at this firm were concerned that frequent price changes would damage the firm’s reputation with its customers.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
733
how customers view less tangible characteristics of the firm and its brand. This evidence may be seen as direct support for Rotemberg’s claim that customers’ perceptions of whether firms are altruistic contribute to the value of a firm’s brand (Rotemberg 2008). Although our findings are clearly consistent with Rotemberg’s models, this is not the only explanation for why customers may stop purchasing from a firm when they are antagonized by price adjustments. It is possible that firms are simply risk-averse and want to minimize the risk of future antagonism.2 Other explanations are also suggested by previous research on the role of reputations and firm brands (Wernerfelt 1988; Tadelis 1999). I.B. Customer Antagonism When Prices Increase We measure the response to downward price adjustments. This is a natural place to start, as it invokes a strong strawman: lower prices lead to higher sales. There is a complementary stream of research that studies customer antagonism in response to upward price adjustments. The origins of this work can be traced to Phelps and Winter (1970) and Okun (1981). Okun introduces the label “customer markets” to describe markets with repeated transactions. He argues that if price increases antagonize buyers, then sellers may respond to unobserved price shocks by absorbing the cost changes rather than increasing their prices. Nakamura and Steinsson (2008) extend this intuition in a recent paper. In their model, price rigidity serves as a partial commitment device, which enables sellers to commit not to exploit customers’ preferences to repeatedly purchase from the same seller. The commitment mechanism is endogenous and relies on the threat that a deviation would trigger an adverse shift in customer beliefs about future prices.3 As support for their model, Nakamura and Steinsson (2008) cite two experimental papers. Renner and Tyran (2004) demonstrate that when buyers are uncertain about product quality, sellers are less likely to raise prices in response to cost increases if there is an opportunity to develop long-term relationships. This effect is particularly pronounced if buyers cannot observe the cost increases. The explanation offered for these effects is that the 2. We thank an anonymous reviewer for this suggestion. 3. In a related paper, Kleshchelski and Vincent (2009) investigate how customer switching costs create an incentive for firms to build market share. They demonstrate that this may prompt firms to absorb a portion of transitory cost increases without increasing prices.
734
QUARTERLY JOURNAL OF ECONOMICS
sellers are willing to absorb the additional costs rather than risk antagonizing the buyers. These results complement an earlier experimental study (Cason and Friedman 2002) that investigates how variation in customers’ search costs affects both their willingness to engage in repeated relationships and the resulting variation in prices. The authors show that higher search costs tend to result in more repeated relationships, increasing sellers’ profits and leading to less variation in prices. I.C. Other Relevant Research Our findings are also related to previous work on intertemporal price discrimination. Research on the timing of retail sales argues that firms can profit by occasionally lowering prices and selling to a pool of low-valuation customers (Conlisk, Gerstner, and Sobel 1984; Sobel 1984, 1991). These arguments do not consider the possibility that customers who observe the lower prices may be less likely to make future purchases. Our findings represent a countervailing force that may limit a firm’s willingness to price discriminate. The findings are also relevant to recent work on price obfuscation. This literature recognizes that many firms adopt practices that make it difficult for customers to compare prices (Ellison 2006; Ellison and Ellison 2009). Price obfuscation may allow firms to mitigate the reactions that we document in this paper. If customers could not easily compare prices, then we would not expect the same outcomes. In this respect, our findings may help to explain the use of price obfuscation mechanisms. Finally, we can also compare our findings with research on reference prices (Kahneman, Knetsch, and Thaler 1986). A challenge in the reference price literature is to identify the correct “reference” price. Some customers may use later prices as a reference against which to evaluate the price paid in earlier transactions. Under this interpretation, the findings are easily reconciled with the reference price literature: customers who later see the firm charging a lower price may conclude that they experienced a loss from overpaying in the past. We caution that this definition of the reference price is not unique, and alternative definitions may lead to different predictions. I.D. Plan of the Paper The paper continues in Section II with a detailed description of the experimental setting and the design of the first field
PRICE STICKINESS AND CUSTOMER ANTAGONISM
735
experiment. We present initial findings in Section III, which provide both context and motivation for the results that follow. We investigate how the outcome is moderated by the recency of customers’ prior purchases in Section IV. In Section V we consider the price paid in the prior purchases, together with the persistence of the results across the 28-month measurement period. In Section VI we investigate three alternative explanations for the results, and then in Section VI we present a replication in a different product category. The paper concludes in Section VIII with a summary of findings and limitations.
II. STUDY DESIGN The primary field experiment was conducted with a mediumsized publishing retailer that manufactures and sells a range of approximately 450 products targeted at well-educated retail customers.4 All of the products carry the retailer’s brand name and are sold exclusively through the company’s catalogs. At the time of our study the firm also operated an Internet site, but few customers placed their orders online. The products are durables with characteristics similar to those of books, computer software, and music. It is rare for customers to buy multiple units of the same item (few customers buy two copies of Oliver Twist), and so incremental sales typically reflect purchases of other items (buying both Oliver Twist and David Copperfield). The average interpurchase interval is 48 weeks. Interviews with managers revealed that the firm engages in intertemporal price discrimination, charging relatively high regular prices interspersed with frequent shallow discounts and occasional deep discounts. A review of the firm’s historical transaction data confirms that there had been wide variation in prices paid for the same item. Approximately one-fifth of transactions occur at the regular price, with most of the remaining transactions occurring at either a shallow (20%–30%) or a deep (50%–60%) discount. In Figure I we report a frequency distribution describing the discounts received in the two years before the Test Catalog was mailed. 4. The results of this study for a small subset of customers were previously described in Anderson and Simester (2004). This earlier paper uses data from this study (and two other studies) to investigate a different research question: comparing how prospective and existing customers react to discounts.
736
QUARTERLY JOURNAL OF ECONOMICS
% of purchases
40
30
20
10
0 List price
Under 20–30% 30–40% 40–50% 50–60% 60–70% 70–80% 20%
Over 80%
Discount FIGURE I Histogram of Pre-test Discounts A histogram of the discounts received in purchases in the two years before the Test Catalog was mailed. The histogram was calculated using a randomly selected sample of customers (including customers not involved in the study).
II.A. Design of the Test Catalog The field test encompassed two time periods separated by the mailing of the Test Catalog. Our data describe individual customer purchases in the eight years before the Test Catalog was mailed (the “pre-test” period) and the 28 months after this date (the “post-test” period). It will be important to remember that our measure of sales during the post-test period includes purchases from the Test Catalog itself. This allows us to rule out the possibility that the findings merely reflect intertemporal demand substitution. There were two different versions of the Test Catalog: a “shallow discount” and a “deep discount” version. A total of 55,047 retail customers were mailed the Test Catalog, all of whom had previously purchased at least one item from the company. Approximately two-thirds of the customers (36,815) were randomly assigned to the Shallow Discount condition, and the remaining customers (18,232) were assigned to the Deep Discount condition.5 5. In Harrison and List’s (2004) nomenclature, this is a “natural field experiment.” They distinguish between “artefactual field experiments,” which are the same as conventional lab experiments but with a nonstudent subject pool; “framed field experiments,” which introduce field context; and “natural field experiments,” which also occur in a field context and use subjects who do not know that they are participants in an experiment.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
737
TABLE I CHECK ON RANDOMIZATION PROCEDURES
Days since last purchase (recency) Number of units purchased (frequency) Average price of units purchased (monetary value) ($) Sample size
Shallow Discount condition
Deep Discount condition
646.04 (2.54) 3.34 (0.02) 133.26 (0.44)
647.92 (3.61) 3.31 (0.03) 132.69 (0.63)
36,815
18,232
Difference −1.88 (4.42) 0.03 (0.04) 0.57 (0.77)
p-value .67 .42 .46
Note. Table I reports the mean values of each historical purchasing measure (calculated separately for each condition). The statistics are calculated using purchases during the eight-year pre-test period, prior to the mailing date for the Test Catalog. Standard errors are in parentheses. The p-values denote the probability that the difference between the deep discount and shallow discount averages will be larger than the observed difference, under the null hypothesis that the true averages are identical.
The decision to assign a larger fraction of customers to the shallow discount was made by the firm and was outside our control but does not affect our ability to interpret the results. We confirm that the allocation of customers to the two conditions was random by comparing the historical purchases made by the two samples of customers. In Table I we compare the average Recency, Frequency, and Monetary Value (RFM) of customers’ purchases during the eight-year pre-test period.6 If the assignment were truly random, we should not observe any systematic differences in historical sales between the two samples. Reassuringly, none of the differences are significant despite the large sample sizes. The Test Catalog was a regularly scheduled catalog containing 72 pages and 86 products. The only differences between the two versions were the prices on the 36 “Test Items.” These 36 items were discounted in both versions, but the discounts were larger in the Deep Discount condition. In the Shallow Discount condition, the mean discount on the Test Items was 34%. In the Deep 6. “Recency” is measured as the number of days since a customer’s last purchase. “Frequency” measures the number of items that customers previously purchased. “Monetary Value” measures the average price (in dollars) of the items ordered by each customer. The interpurchase interval (48 weeks) is much shorter than the average of the recency measure (646 days). These measures are not directly comparable: the interpurchase interval describes the time between purchases, whereas the recency measure includes some customers who will make no additional purchases.
738
QUARTERLY JOURNAL OF ECONOMICS
Discount condition, the mean discount was 62%. These yielded mean prices of $133.81 and $77.17 on the 36 Test Items under the two conditions (compared to a mean regular price of $203.83). The sizes of the price discounts were chosen to be large enough to generate an effect, but not so large that they were outside the range of historical discounts (see Figure I). The other fifty items were all at their regular prices in both versions. The prices of the Test Items were presented as “Regularly $x Sale $y” in the deep discount version, and as “Regularly $x Sale $z” in the shallow discount version. The regular price ($x) was the same under both conditions, but the sale price was lower in the Deep Discount condition ($y < $z). The use of identical text ensured that the discounted price was the only difference between the two versions and also explains why we used shallow discounts rather than the regular price as a control. There is considerable evidence that customers are sensitive to the word “Sale,” even when prices are held constant (Anderson and Simester 1998, 2001). Charging the regular price as a control would have made it difficult to distinguish the effect of the price change from the “Sale” effect. By using the same wording under both conditions, we avoid this confound. As we will discuss, using shallow discounts as a control does not affect our ability to measure how price adjustments affect sales. II.B. Mailing and Pricing Policies after the Test Catalog The firm agreed to use the same mailing policy for all customers once the experimental manipulation was over. To confirm compliance, we obtained data describing post-test mailings for a random sample of 16,271 of the 55,047 customers involved in the test (approximately 30% of the full sample). Customers in the Deep Discount condition received a mean of 48.47 catalogs in the post-test period, compared to 48.77 catalogs in the Shallow Discount condition. The difference in these means is not close to statistical significance. Notice also that the paper’s key finding is that sales are lower under the Deep Discount condition (compared to the Shallow Discount condition). As we will discuss, the deep discounts led to an increase in sales from the Test Catalog itself. If this affected the firm’s subsequent mailing decisions, it would tend to increase the number of catalogs mailed to customers under this condition. This could not explain the decrease in aggregate post-test sales.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
739
% of purchases
40
30
20
10
0 List price Under 20%
20–30% 30–40% 40–50% 50–60% 60–70% 70–80%
Over 80%
Discount FIGURE II Histogram of Post-test Discounts A histogram of the discounts received in the 28-month post-test period. The histogram was calculated using a randomly selected sample of customers (including customers not involved in the study).
It is also useful to consider the firm’s pricing policy after the Test Catalog was mailed. The distribution of discounts received in the post-test period is reported in Figure II. The use of discounts persisted, with a noticeable increase in the frequency of deep discounts. Most catalogs mailed during the post-test period contained at least some items with deep discounts. Because customers under the two conditions received the same downstream catalogs, this change in policy does not affect our ability to compare the behavior of customers under the two experimental conditions. II.C. Predictions Just over 47% of the customers (25,942) had not purchased any of the Test Items before receiving the Test Catalog. Of the remaining customers, very few customers had purchased more than one or two of the 36 Test Items, with less than 0.3% purchasing more than 10. As a result, the Test Catalog offered customers an opportunity to purchase discounted Test Items that they did not already own. Because our measure of post-test demand includes purchases from the Test Catalog itself, a standard model predicts higher demand in the Deep Discount condition (which offered lower prices). We label this the “Low Prices” prediction: Low Prices. Post-test demand is higher under the Deep Discount condition than under the Shallow Discount condition.
740
QUARTERLY JOURNAL OF ECONOMICS
Customer antagonism suggests an alternative prediction. Customers under the Deep Discount condition were more likely to see prices in the Test Catalog that were lower than what they paid for the same item. If this antagonized customers then we might expect lower sales in this condition. We label this the “antagonism” prediction: Antagonism. Post-test sales are lower under the Deep Discount condition than under the Shallow Discount condition. This prediction requires only a relatively simple model of customer behavior. It is sufficient that if customers see prices lower than what they paid they are less likely to make additional purchases. Because this is more likely to occur under the Deep Discount condition, we expect fewer sales under that condition.7 We will also consider three interactions. The first interaction distinguishes the 29,105 customers who had purchased one of the 36 Test Items before receiving the Test Catalog from the 25,942 customers who had not. We only expect customers to be antagonized if they see lower prices on items that they have purchased. Therefore, we expect a more negative (less positive) reaction to the deep discounts among the 29,105 customers with a prior purchase:8 Past Purchase Interaction. Deep discounts have a more negative (less positive) impact on post-test demand among customers who have previously purchased one of the Test Items. There is an alternative explanation for this interaction. Because customers are unlikely to purchase the same item twice, customers who had already purchased may have been less likely to take advantage of the deep discounts in the Test Catalog. We will investigate this explanation in Section VI, together with other alternative explanations. Customer antagonism may also depend upon how recently customers had purchased. Customers who purchased a long time ago may find it harder to remember the price that they paid. Moreover, those who can remember the price may be less antagonized 7. Notice that we do not need to rule out favorable responses if customers see a price higher than what they previously paid. We expect lower post-test sales under the Deep Discount condition irrespective of whether there is a positive response (or no response) to seeing higher prices. However, in Section IV we do investigate how the price that customers previously paid affects their response to the deep discounts. 8. It is possible that customers may be concerned about whether other customers were antagonized. However, previous studies have consistently reported that decision makers show little concern for the sufficiency of other customers’ outcomes (Guth and van Damme 1998; Selten and Ockenfels 1998).
PRICE STICKINESS AND CUSTOMER ANTAGONISM
741
TABLE II POST-TEST PURCHASES: COMPLETE SAMPLE Shallow discount
Deep discount
Revenue and orders per customer 159.28 157.16 (2.19) (3.06) Orders: all post-test orders 1.05 1.04 (0.01) (0.02) Orders: from the Test 0.022 0.036 Catalog itself (0.001) (0.002) Orders: from other catalogs 1.03 1.00 (0.01) (0.02) % of customers with at least 35.4 35.0 1 post-test order (0.3) (0.4) Number of customers 36,815 18,232 Composition of orders Average item price ($) 101.49 101.14 (0.51) (0.74) Average number of items 1.57 1.60 per order (0.01) (0.01) Revenue ($)
Difference −2.11 (3.78) −0.02 (0.02) 0.014∗∗ (0.002) −0.03 (0.02) −0.4 (0.4)
−0.35 (0.90) 0.03 (0.02)
Note. Table II reports the averages of each post-test sales measure for the respective samples. All statistics are calculated using only purchases from the 28-month post-test period (after the Test Catalog was mailed). The orders, revenue, and any post-test order statistics are calculated using the 36,815 and 18,232 customers under the Shallow and Deep Discount conditions, respectively. The average item price and number of items per order are calculated using the 13,038 and 6,388 customers from each condition who placed at least one order during the post-test period. Standard errors are in parentheses. ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
because they have had additional opportunities to consume. We will investigate whether the time between a customer’s prior purchase and the mailing date for the Test Catalog contributes to the outcome: Time Interaction. Deep discounts have a less negative (more positive) effect on post-test sales if the time since a customer’s previous purchase is longer. We also investigate whether there is evidence that customers who paid higher prices for Test Items were more antagonized by the deep discounts. This “past price” interaction is somewhat complicated by the experimental design and so we will defer discussion of this issue to when we present the results. III. INITIAL RESULTS In this section we present initial findings that provide both context and motivation for the main analysis that follows. To begin, we ask how the deep discounts affected post-test purchases by
742
QUARTERLY JOURNAL OF ECONOMICS TABLE III POST-TEST PURCHASES: CUSTOMERS AT RISK OF BEING ANTAGONIZED Shallow discount
Deep discount
Revenue and orders per customer 506.02 415.31 (23.70) (30.91) Orders: all post-test orders 3.27 2.78 (0.13) (0.18) Orders: from the Test Catalog 0.10 0.12 (0.01) (0.02) Orders: from other catalogs 3.17 2.66 (0.13) (0.17) % of customers with at least 72.9 66.0 1 post-test order (1.5) (2.2) Number of customers 933 459 Composition of orders Average item price ($) 97.61 97.82 (1.98) (2.87) Average number of 1.66 1.64 items per order (0.06) (0.06) Revenue ($)
Difference −90.71∗ (40.15) −0.48∗ (0.22) 0.02 (0.02) −0.50∗ (0.22) −6.9∗ (2.6)
0.21 (3.53) −0.02 (0.06)
Note. Table III reports the averages of each post-test sales measure for the respective samples. The samples are restricted to customers who had paid a high price (above the shallow discount price) for a Test Item within three months of the Test Catalog mailing date. The orders, revenue, and any post-test order statistics are calculated using all of the customers in the respective samples. The average item price and number of items per order are calculated using customers who placed at least one order during the post-test period. Standard errors are in parentheses. ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
the full sample of 55,047 customers. These results are reported in Table II, where we describe the average number of orders placed by customers under the Deep and Shallow Discount conditions. We distinguish between orders from the Test Catalog itself and orders from other catalogs (not the Test Catalog) during the post-test period.9 We also present the average revenue earned, the average size of customer orders, and the average price of the items that they purchased. Because this overall comparison aggregates customers for whom the outcome was positive with others for whom it was negative, it reveals few significant differences between the two samples. We can illustrate this by returning to the motivating example 9. When customers call to place an order, they are asked for a code printed on the back of the catalog. This code is also printed directly on the mail-in order form. The data we received contain the catalog code for each order, so we can identify orders from the Test Catalog and orders from other catalogs. There are a small number of orders for which the catalog code is not available, including some Internet orders. Fortunately these instances are rare (there were very few Internet orders during the period of the study) and have little effect on the findings.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
743
used in the Introduction. In Table III we focus on customers who paid a high price (above the shallow discount price) for a Test Item in the three months before the Test Catalog was mailed. In the Introduction we anticipated that these were the customers most likely to have been antagonized by the deep discounts. The deep discounts resulted in fewer post-test orders from these customers. Under the Deep Discount condition customers placed an average of 2.78 post-test orders compared to 3.27 orders under the Shallow Discount condition. The difference (0.48 orders or 14.8%) is statistically significant and can be fully attributed to the prices in the Test Catalog. The findings also help clarify the source of the effect. The differences result solely from changes in the number of orders placed, rather than changes in the composition of those orders; there is no difference between conditions in either the average number of items per order or the average prices of the items purchased. The reduction in orders at least partly reflects an increase in the proportion of customers who placed no orders during the post-test period (34.0% under the Deep Discount condition versus 27.1% under the Shallow Discount condition). It is this result that led us to describe the effect as a customer boycott. The decrease in orders is a strong result: merely sending these customers a catalog containing lower prices reduced purchases. The effect is large, and is precisely the outcome that managers appear to anticipate when stating they are reluctant to adjust prices for fear of antagonizing their customers (Hall and Hitch 1939; Blinder et al. 1998). Notice that customers who had purchased recently (Table III) are systematically more valuable than the average customer in the study (Table II). Although they place similar-sized orders, the recent purchasers order a lot more frequently in the post-test period. Because they are more valuable, the cost to the firm is greatly amplified. Although confidentiality restrictions prevent us from reporting detailed profit results, sending the deep discount version to these 1,392 customers would have lowered the profits earned from these customers by over $93,000 (compared to sending the shallow discount version). It is also helpful to remember factors that cannot explain this result. Most importantly, the difference in orders cannot be explained by differences in the customers themselves. Our analysis compares customers under the two experimental conditions, and random assignment (which we verified in Table I) ensures that there are no systematic differences between customers under the two conditions. We can also see that the difference in post-test
744
QUARTERLY JOURNAL OF ECONOMICS
orders is not due to a difference in the number of orders from the Test Catalog itself. Orders from the Test Catalog represent only a very small proportion (less than 3%) of overall post-test demand. Moreover, orders from the Test Catalog were actually slightly higher under the Deep Discount condition, presumably due to the lower prices. Finally, because our measure of sales includes purchases from the Test Catalog, the result cannot be explained by intertemporal demand substitution (forward buying). Acceleration in purchases to take advantage of the Test Catalog discounts would not affect this measure of total orders. One explanation for the results, which does not depend on customer antagonism, is that the deep discounts may have changed customers’ expectations about the availability of future discounts. If customers delayed their future purchases in anticipation of these discounts, it could explain a reduction in post-test sales. As we acknowledged in the Introduction, we cannot completely rule out this explanation. However, in Section VI we will present results that suggest this cannot be a complete explanation for all of the findings. The initial analysis in this section focused on the sample of customers whom we expected to be most susceptible to the antagonism prediction. In the next section we extend the focus to all of the customers in the study, and measure how the effect was moderated by whether customers had previously purchased a Test Item and (if so) how recently they had purchased it. IV. THE PAST PURCHASE AND TIME INTERACTIONS To directly estimate the past purchase and time interactions, we use a multivariate approach. Because our initial analysis revealed that the primary effect is upon the number of orders placed, rather than the composition of those orders, we use a “count” of orders placed as the dependent variable (we later also use revenue and profit as dependent variables). We estimate this variable using Poisson regression, which is well suited to counting data.10 Under this model, the number of orders from customer i (Qi ) is drawn from a Poisson distribution with parameter λi : (1) q e−λi λi , where q = 0, 1, 2, . . . , and ln(λi ) = βXi . Prob(Qi = q) = q! 10. In Section V we investigate the impact on revenue and profits. This analysis demonstrates that the results are also robust to using an OLS specification.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
745
To estimate how the outcome was moderated by the time interaction since a customer’s prior purchase, we use the following specification: (2)
βXi = β1 DeepDiscounti + β2 DeepDiscounti × Timei + β3 Timei + θ Zi .
These variables are defined as follows: Deep Discounti . A binary variable indicating whether customer i was in the Deep Discount condition. Time. The log of the number of months between the Test Catalog mailing date and customer i’s most recent purchase of a Test Item. For completeness, the vector Z includes the log of the historical RFM measures as control variables. Because these variables do not vary systematically across the two conditions (Table I), their inclusion or exclusion has little impact on the coefficients of interest (β 1 and β 2 ). Among customers who previously purchased a Test Item, β 1 describes how receiving the deep discounts affected post-test orders by a “benchmark” customer, who purchased a Test Item immediately before receiving the Test Catalog (Time equals zero). As the time since the customer’s prior purchase increases, the estimated impact of the deep discounts is moderated by β 2 . The time interaction predicts that β 2 will have a positive sign: the longer the time since the prior purchase, the smaller the reduction in post-test sales. Notice that this model preserves the benefits of the randomized experimental design. The coefficients of interest measure the percentage difference in post-test sales between customers who received the deep and shallow discount versions of the Test Catalog. As a result, the coefficients of interest cannot be explained by differences in customer characteristics between the two experimental treatments, or by intervening competitive or macroeconomic events. We rely heavily on this feature of the study as it allows us to rule out a wide range of alternative explanations. The model is estimated separately on the 28,642 customers who had previously purchased a Test Item and on the 26,405 customers who had not previously purchased a Test Item.11 Coefficients and standard errors for both models are reported in Table IV. 11. Joint estimation of the two models yields the same pattern of results, but the findings are more difficult to interpret (they require three-way interactions).
746
QUARTERLY JOURNAL OF ECONOMICS TABLE IV POISSON REGRESSION: IMPACT OF DEEP DISCOUNTS ON POST-TEST ORDERS
Deep discount Deep discount × time Time (since prior test item purchase) Recency (since any purchase) Frequency Monetary value Intercept Log likelihood Sample size
Customers with prior test item purchases (1)
Customers without prior test item purchases (2)
−0.078∗∗ (0.020) 0.026∗∗ (0.007) −0.084∗∗ (0.006) −0.249∗∗ (0.005) 0.620∗∗ (0.005) 0.038∗∗ (0.010) 0.926∗∗ (0.051) −53,680 29,105
0.085 (0.071) −0.017 (0.012) −0.382∗∗ (0.007) 0.466∗∗ (0.012) 0.048∗∗ (0.014) 1.247∗∗ (0.081) −30,308 25,942
Note. Table IV reports the coefficients from estimating equation (2) on each subsample of customers. The dependent variable measures the number of orders made during the post-test period. Asymptotic standard errors are in parentheses. ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
Column (1) focuses on customers with a prior Test Item purchase. For these customers β 1 is negative and significant, indicating that for our benchmark customer (for whom Time equals zero) the deep discounts led to a 7.8% reduction in post-test sales. This replicates our univariate results and confirms that merely sending customers a catalog containing lower prices led to fewer orders by these customers. For the same customers, β 2 is positive and significant. This result is consistent with the Time interaction and confirms that the loss of sales is smaller when more time has passed since a customer’s earlier purchase. The findings also reveal that the drop in post-test sales was limited to customers who had previously purchased one of the discounted items. For customers who had not previously purchased a Test Item (column (2)), the difference in sales between the two conditions was not significant. This is consistent with the Past For customers who had not previously purchased a Test Item, we calculate the Time since their most recent purchase of any item. This is the same as our Recency measure, and so we omit Recency from this model.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
747
Purchase prediction, which anticipated that the deep discounts would only reduce sales among customers who had previously purchased one of the Test Items.12 We can use the coefficients in Table IV to calculate how much time is required to elapse for the deep discounts to have a positive effect on post-test sales. Given coefficients of −0.078 for β 1 and 0.026 for β 2 , the net impact of the deep discounts equals zero when Time is equal to 20.1 months.13 We conclude that the estimated impact of the deep discounts was negative if customers purchased a Test Item within approximately two years of receiving the Test Catalog. This time interaction has an important additional implication for the firm. Recall that the initial findings in the preceding section (and the Recency and Time coefficients in Table IV) confirm that customers who purchased recently are systematically more valuable—they purchase significantly more frequently in the post-test period. Because the reduction in post-test sales was focused on these recent customers, there was a disproportionate effect on overall sales. Specifically, 13,830 customers purchased a Test Item within 20.1 months of receiving the Test Catalog. These customers represented just 25% of the sample but contributed approximately 52% of post-test revenue. If the firm mailed the deep discount version (rather than the shallow discount version) to each of these 13,830 customers, its profits would decrease by approximately $155,000.14 Equation (2) imposes a functional form on the interaction between the Time and Deep Discount variables. We can relax this restriction by grouping customers into segments based on the time since their earlier Test Item purchase and directly estimating the impact of the deep discount on each segment. In particular, we group customers into four segments based on the timing of their earlier purchases: less than 250 days, 250 to 500 days, 500 to 750 days, or over 750 days since the Test Catalog was mailed. We then estimate the following Poisson regression model for each segment: (3)
βXi = β1 DeepDiscounti + θ Zi .
12. We did not expect these customers to be antagonized by the deep discounts, but we would have expected them to take advantage of the lower prices. Further investigation confirms that when we restrict attention to orders from the Test Catalog itself, we see a strong increase in sales under the Deep Discount condition, but this was offset by lower sales from subsequent catalogs. 13. Calculated as e0.078/0.026 (recall that Time is measured in months and has a log specification). 14. Calculated as the difference in the average profit in each condition multiplied by the 13,830 customers.
748
QUARTERLY JOURNAL OF ECONOMICS
Impact of deep discounts
10%
7.87%**
8% 6% 4% 2% 0%
–0.50%
–2% –4% –6%
–4.16% –5.22% **
Under 250 days
250 to 500 days
500 to 750 days
Over 750 days
FIGURE III Impact of Deep Discounts on Post-test Sales by Timing of Prior Test Item Purchase The β 1 coefficients when estimating equation (3) on the four customer segments described above. Detailed findings are provided in Table V. ∗ Significantly different from zero, p < .05. ∗∗ Significantly different from zero, p < .01.
The β 1 coefficients for each segment are summarized in Figure III, and complete findings are provided in Table V.15 The deep discounts led to a significant decrease in sales (−5.22%) from the 4,413 customers who had purchased within 250 days of receiving the Test Catalog, and a significant increase in sales (+7.87%) from customers whose prior purchases occurred over 750 days ago. The results for the other two segments fall between these two results. This consistent monotonic relationship is reassuring, and confirms that we cannot attribute the role of time solely to the functional form imposed in equation (2). The positive outcome for customers whose prior purchases were not recent (over 750 days ago) is worthy of comment. We offer two possible explanations. First, recall that customers had generally purchased at most one or two of the Test Items that were discounted in the Test Catalog. The deep discounts in the Test Catalog gave them the opportunity to purchase other items at low prices. This motivated our Low Prices prediction that sales would be higher under the Deep Discount condition. Second, it 15. In the next set of analyses we show that the effect is concentrated among customers who paid high prices in their previous purchases. We restrict attention to these customers in this analysis.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
749
TABLE V IMPACT OF DEEP DISCOUNTS ON POST-TEST SALES BY TIMING OF PRIOR TEST ITEM PURCHASE
Deep discount Recency Frequency Monetary value Intercept Log likelihood Sample size
Segment 1 Under 250 days
Segment 2 250 to 500 days
Segment 3 500 to 750 days
Segment 4 Over 750 days
−0.052∗∗ (0.015) −0.152∗∗ (0.009) 0.623∗∗ (0.010) −0.203∗∗ (0.016) 1.724∗∗ (0.085) −11,261 4,413
−0.042 (0.027) −0.182∗∗ (0.012) 0.707∗∗ (0.016) −0.239∗∗ (0.026) 1.739∗∗ (0.132) −6,364 2,913
−0.005 (0.037) −0.262∗∗ (0.017) 0.769∗∗ (0.024) −0.347∗∗ (0.033) 2.397∗∗ (0.170) −4,435 2,604
0.079∗∗ (0.029) −0.306∗∗ (0.011) 0.677∗∗ (0.018) −0.365∗∗ (0.025) 2.728∗∗ (0.136) −8,459 6,079
Note. Table V reports the coefficients from estimating equation (3) separately on the four customer segments. Asymptotic standard errors are in parentheses. ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
TABLE VI IMPACT OF THE DEEP DISCOUNTS ON POST-TEST ORDERS FROM THE TEST CATALOG AND OTHER CATALOGS Timing of prior purchase Under 250 days 250 to 500 days 500 to 750 days Over 750 days
Orders from the Test Catalog (%) (1) 31.6∗∗ 21.4 84.3∗∗ 61.2∗∗
Orders from other catalogs (%) (2) −6.5∗∗ −4.9 −2.6 6.6∗
All post-test orders (%) (3) −5.2∗∗ −4.2 −0.5 7.9∗∗
Note. Table VI reports the Deep Discount coefficient (β 1 ) from estimating equation (3) separately on each customer segment and dependent variable. The results in column (3) are also reported in Table V (and in Figure III). ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
is possible the deep discounts persuaded some customers that the firm offered low prices, which may have prompted them to pay more attention to future catalogs (Anderson and Simester 2004). We investigate both explanations in Table VI, where we distinguish orders from the Test Catalog itself and post-test orders from other catalogs.
750
QUARTERLY JOURNAL OF ECONOMICS
The deep discounts led to more orders from the Test Catalog for all customer segments. This is consistent with customers taking advantage of the deep discounts in that catalog to purchase items they had not already purchased (the Low Prices prediction). The effect is particularly strong for customers whose prior purchases were less recent. These are the same customers for whom we observe a positive outcome in Figure III. Among customers who had not purchased for over 750 days, the positive effect of the deep discounts extends beyond the Test Catalog to also increase orders from other catalogs by 6.6%. This is consistent with our second explanation: more favorable price expectations may have expanded the range of occasions on which these customers searched for products in the firm’s catalogs. We conclude that the findings offer strong support for the Past Purchase and Time interactions. The reduction in sales is limited to customers who had previously purchased a Test Item and is stronger for customers who had purchased more recently. We next consider how the price that customers paid in their earlier purchases influences the outcome. We will then investigate whether the findings persist throughout the post-test period, and whether they survive when we use post-test measures of revenue and profit as the dependent variable. V. ADDITIONAL RESULTS If price variation leads to customer antagonism, we might expect that customers who paid the highest prices in their earlier purchases would be most antagonized. There is considerable variation in the prices that customers paid for Test Items before receiving the Test Catalog. We can illustrate this variation by using this past price to group customers into three segments, which we label “Low,” “Medium,” and “High”: Segment
Past price level
Low Medium
Less than the deep discount price Between the deep and shallow discount prices Above the shallow discount price
High
Number of customers 463 12,633 16,009
We do not expect the 463 customers who paid less than the deep discount price to be antagonized by the deep discounts. For the 12,633 customers in the Medium segment who paid between the
PRICE STICKINESS AND CUSTOMER ANTAGONISM
751
deep and shallow discount prices, the possibility of antagonism only arises under the Deep Discount condition, as this was the only condition where prices were lower than what customers previously paid. In contrast, all of the 16,009 customers in the High segment paid above the shallow discount price and saw lower prices when they received the Test Catalog. These customers may be antagonized under both experimental conditions. Without knowing how customers respond to small and large price differences, we cannot make a clear ex ante prediction about how the outcome will vary across the High and Medium segments. We can illustrate this uncertainty by considering two extreme examples. If customers have the same negative reaction to any observed price reduction, then customers in the High segment will have the same reaction under both experimental conditions. In contrast, customers in the Medium segment will react negatively in the Deep Discount condition but not in the Shallow Discount condition. Our comparison of the two conditions will reveal a negative response in the Medium segment and no effect in the High segment. A second (equally extreme) behavior is that customers react only to large price differences. It is possible that none of the customers in the Medium segment observe a large enough price difference to prompt a reaction, whereas the outcome in the High segment depends upon the amount customers paid. Customers who paid the most may react under both conditions, whereas others may only react under the Deep Discount condition. Comparison of the two conditions will reveal no effect in the Medium segment and a negative effect for some customers in the High segment. We can investigate this issue empirically. Our experimental design ensures that within each segment customers are randomly assigned to the Deep and Shallow Discount conditions. Therefore, we can measure how the deep discounts affected post-test sales in each segment by estimating equation (3) separately for each segment. The coefficients, which are reported in Table VII, reveal clear evidence that the past price plays an important role. Customers in the High segment reacted adversely to the deep discounts, but we do not observe any reaction in the Medium segment. Together, these results suggest that although customers react negatively to large price differences, they may be willing to overlook small price differences. These results also have an important implication for the firm. Customers who pay higher prices tend to be systematically more
752
QUARTERLY JOURNAL OF ECONOMICS TABLE VII THE ROLE OF THE PRICE PAID IN EARLIER TRANSACTIONS Past price segment
Deep discount Deep discount × time Time (since prior test item purchase) Recency (since any purchase) Frequency Monetary value Intercept Log likelihood Sample size
Low
Medium
0.469 (0.256) −0.169∗∗ (0.063) 0.045 (0.055) −0.113∗ (0.056) 0.489∗∗ (0.061) 0.061 (0.073) 0.056 (0.037) −965 463
0.008 (0.036) −0.004 (0.013) −0.069∗∗ (0.010) −0.253∗∗ (0.008) 0.648∗∗ (0.008) 0.112∗∗ (0.017) 0.477∗∗ (0.090) −21,449 12,633
High −0.133∗∗ (0.024) 0.050∗∗ (0.009) −0.102∗∗ (0.007) −0.246∗∗ (0.006) 0.606∗∗ (0.007) 0.008 (0.013) 1.128∗∗ (0.068) −31,196 16,009
Note. Table VII reports the coefficients from estimating equation (3) separately on the three customer segments. The Low segment includes customers who paid less than the deep discount price for a Test Item before the Test Catalog; customers in the Medium segment paid between the deep and shallow discount prices; customers in the High segment paid over the shallow discount price. The dependent variable in all three columns measures the number of units purchased during the post-test period. Asymptotic standard errors are in parentheses. ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
valuable: the 16,009 customers in the High segment represent approximately 29% of the sample but contribute 47% of the posttest profit. The concentration of the effect among these customers amplifies the impact on the firm’s profits. We conclude that the reduction in demand is concentrated in the segment of customers who had previously paid a high price (above the shallow discount price) for a Test Item. We will focus on this segment in the remainder of our analysis. We next consider whether the findings persist throughout the post-test period, and then investigate whether they survive when we use post-test measures of revenue and profit as the dependent variable. V.A. Does the Adverse Outcome Persist? To evaluate the persistence of the results we divided the 28month post-test period into two 14-month subperiods. We estimate equation (2) on both subperiods and report the findings in
753
PRICE STICKINESS AND CUSTOMER ANTAGONISM TABLE VIII ADDITIONAL RESULTS: REVENUE, PROFITS, AND PERSISTENCE OF THE EFFECT
Deep discount Deep discount × time Time Recency Frequency Monetary value Intercept Log likelihood Adjusted R2 Sample size
Start of the post-test period (1)
End of the post-test period (2)
−0.139∗∗ (0.033) 0.044∗∗ (0.013) −0.117∗∗ (0.010) −0.249∗∗ (0.009) 0.598∗∗ (0.009) 0.040∗∗ (0.018) 0.375∗∗ (0.095) −19,674
−0.127∗∗ (0.035) 0.056∗∗ (0.013) −0.085∗∗ (0.011) −0.242∗∗ (0.009) 0.616∗∗ (0.010) 0.025∗∗ (0.018) 0.495∗∗ (0.097) −20,635
16,009
16,009
Revenue (3)
Profits (4)
−65.83∗∗ (19.72) 21.30∗∗ (6.74) −31.07∗∗ (6.22) −78.24∗∗ (5.43) 204.23∗∗ (4.26) 58.22∗∗ (6.03) 277.56∗∗ (37.66)
−48.42∗∗ (14.55) 15.46∗∗ (4.98) −21.00∗∗ (4.59) −59.14∗∗ (4.01) 150.84∗∗ (3.14) 45.61∗∗ (4.45) 190.79∗∗ (27.80)
.229 16,009
.230 16,009
Note. The dependent variables in the four columns are the number of units purchased in the first fourteen months of the post-test period (column (1)); the number of units purchased in the last fourteen months of the post-test period (column (2)); total revenue earned from each customer in the post-test period (column (3)); and total profit (calculated as revenue minus cost of goods sold) earned from each customer in the post-test period (column (4)). Columns (1) and (2) are estimated using Poisson regression (asymptotic standard errors are in parentheses). Columns (3) and (4) are estimated using OLS (standard errors are in parentheses). All of the models are estimated using the 16,009 customers who had previously paid a high price for a Test Item. ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
Table VIII (columns (1) and (2)). Comparing β 1 and β 2 between the two data periods reveals no significant difference in either coefficient (they are also not significantly different from the findings reported for these customers in Table VII). However, the results do imply a different time cutoff for which there is a negative response to the deep discounts. At the start of the post-test period, the negative response extends to customers who purchased within 23.5 months of receiving the Test Catalog. At the end of the period, the results indicate that the response is only negative for customers whose prior purchase occurred within 9.7 months of the Test Catalog. We conclude that although the negative effect survives for more than a year after the Test Catalog was mailed, there is evidence that the effect decays over time.
754
QUARTERLY JOURNAL OF ECONOMICS
V.B. Profit and Revenue The regression results reported so far consider only the number of orders purchased during the post-test period. To evaluate robustness, we also analyzed two additional dependent measures: the Total Revenue and Total Profit earned from each customer during the post-test period (where profit is calculated as revenue minus cost of goods sold). Both variables are continuous rather than count measures, and so we estimate the models using OLS. The findings are reported in Table VIII (columns (3) and (4)). They confirm that deep discounts lead to a reduction in both revenue and profits. This effect diminishes as the time since a customer’s previous purchase increases. The interval at which the net effect is zero is approximately 22 months for both metrics. We conclude that lower prices lead to fewer purchases by some customers. This effect is strongest among customers who had recently paid a high price to buy an item on which the price is later lowered. Unfortunately these include many of the firm’s most valuable customers and this magnifies the importance of the effect. Readers who are solely interested in the existence of this effect may want to read ahead to Section VII, where we replicate the findings with a different company in a different product category. However, other readers may be interested in reviewing alternative explanations for these outcomes. We address this issue in the next section. VI. ALTERNATIVE EXPLANATIONS The findings reported in the preceding sections are consistent with customer antagonism. In this section we evaluate three alternative explanations for the findings. We begin by considering the possibility that customers delayed their purchases in anticipation of future discounts. We then consider the role of both quality signals and demand depletion. VI.A. Delayed Purchases Coase recognized that customers may respond to intertemporal discounts by delaying their purchases in anticipation of future discounts (Coase 1972). In our context, the discounts under the Deep Discount condition may have alerted customers to the possibility of future discounts. If customers responded by delaying their subsequent purchases, this might explain the reduction in
PRICE STICKINESS AND CUSTOMER ANTAGONISM
755
TABLE IX WERE CUSTOMERS DELAYING IN ANTICIPATION OF FUTURE DISCOUNTS? POST-TEST PURCHASES AT DIFFERENT DISCOUNT LEVELS Coefficients of interest
Discount threshold Any discount level Discounts of at least 10% Discounts of at least 20% Discounts of at least 30% Discounts of at least 40% Discounts of at least 50% Discounts of at least 60% Discounts of at least 70%
(β 1 ) Deep discount
(β 2 ) Deep discount × time
−0.130∗∗ (0.018) −0.115∗∗ (0.020) −0.114∗∗ (0.020) −0.112∗∗ (0.021) −0.111∗∗ (0.021) −0.115∗∗ (0.021) −0.109∗∗ (0.024) −0.091∗∗ (0.034)
0.055∗∗ (0.007) 0.049∗∗ (0.008) 0.048∗∗ (0.008) 0.048∗∗ (0.008) 0.050∗∗ (0.008) 0.050∗∗ (0.008) 0.049∗∗ (0.009) 0.066∗∗ (0.013)
Note. Each row in Table IX reports the β 1 and β 2 coefficients from reestimating equation (2) when the dependent variable counts post-test units sold under each discount threshold. In each of these eight models the samples are restricted to customers who had paid a high price (the sample size for each model is 16,009). Asymptotic standard errors are in parentheses. ∗∗ Significantly different from zero, p < .01.
post-test orders. As we have already acknowledged, we cannot fully rule out this explanation, but we can investigate it using several approaches. First, if customers were waiting for future discounts, the decrease in post-test sales should be larger at prices that represent either small discounts or no discounts. It is these purchases that we would expect customers to forgo while waiting for larger discounts. To investigate this prediction we recalculate post-test sales using different discount thresholds. For example, “Discounts of at least 60%” counts the number of units in the post-test period at discounts of at least 60%.16 We reestimated equation (2) using eight different discount thresholds and report the results from each of these eight models in Table IX. 16. Notice that we shift from “orders placed” to “units purchased” because discounts are defined at the unit level.
756
QUARTERLY JOURNAL OF ECONOMICS
Though not statistically significant, there is weak evidence of a trend in the results. At higher discount thresholds the Deep Discount coefficients are slightly less negative, which is consistent with customers waiting for larger discounts. Moreover, the implied Time at which the effect switches from negative to positive becomes shorter, which indicates that we only observe a negative effect among customers whose prior purchases were more proximate to receiving the Test Catalog. The findings also confirm that sales decrease even for deeply discounted items, and this is difficult to reconcile with customers waiting for future discounts. Our second approach recognizes that deeply discounted prices were not unique to the Test Catalog. Over 38% of purchases in the pre-test period were made at discount levels of at least 50% (see Figure I).17 If customers had purchased at deep discounts in the past, it seems less likely they would be surprised by the deep discounts in the Test Catalog or that they would change their behavior by delaying future purchases. Therefore, we can evaluate this alternative explanation by investigating whether the decrease in demand was smaller among customers who had purchased at deep discounts in the pre-test period. We do so by including a measure of past discounts in our model, βXi = β1 DeepDiscounti + β2 DeepDiscounti × Timei (4)
+ β3 Timei + β4 DeepDiscounti × MaximumPastDiscounti + β5 MaximumPastDiscounti + θ Zi ,
where MaximumPastDiscounti measures the largest percentage discount on any item that customer i purchased during the pre-test period. We report the coefficients for this model in Table X. We also include an alternative model in which MaximumPastDiscounti is replaced with a binary variable (60% Past Discounti ) identifying customers who had previously purchased at a discount of at least 60%. Neither of these interaction terms approaches statistical significance. The response to the deep discounts was apparently not affected by the size of the discounts that customers had received in the past. This is not what we would expect if the Test Catalog changed customers’ price expectations; the effect should be 17. Many pre-test purchases were made at even deeper discounts (14.6% were made at a discount of at least 60%). Recall that we used a 60% discount level under the Deep Discount condition because it was consistent with past discount levels.
757
PRICE STICKINESS AND CUSTOMER ANTAGONISM TABLE X PRICE EXPECTATIONS (1) Deep discount Deep discount × time Deep discount × maximum past discount
−0.126∗∗ (0.053) 0.051∗∗ (0.010) −0.022 (0.089)
Deep discount × 60% past discount Time Maximum past discount
−0.097∗∗ (0.007) 1.004 (0.062)
60% past discount Recency Frequency Monetary value Intercept Log likelihood Sample size
−0.239∗∗ (0.006) 0.528∗∗ (0.008) 0.143∗∗ (0.015) 0.098 (0.089) −31,024 16,009
(2) −0.146∗∗ (0.026) 0.051∗∗ (0.009)
0.049 (0.032) −0.102∗∗ (0.007)
−0.018 (0.019) −0.246∗∗ (0.006) 0.607∗∗ (0.007) 0.008 (0.013) 1.134∗∗ (0.068) −31,194 16,009
Note. Table X reports the coefficients from estimating equation (4) on the customers who had previously paid a high price for a Test Item. Asymptotic standard errors are in parentheses. ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
strongest for customers who had not previously purchased at a large discount. We make a final observation about this explanation. At the start of Section III we reported the average price of the items purchased in the post-test period (Table III). Increased willingness to wait for discounts should lead to customers paying lower average prices. However, the average price paid (per unit) was not significantly different between the two conditions. We conclude that delay in anticipation of future discounts does not appear to be a complete explanation for the results. VI.B. Quality Signals It is possible that customers interpreted the deep discounts on the Test Items as a signal that these items were of inferior
758
QUARTERLY JOURNAL OF ECONOMICS TABLE XI QUALITY SIGNALS Non-test items (1)
Deep discount Deep discount × time Time (since prior test item purchase) Recency (since any purchase) Frequency Monetary value Intercept Log likelihood Sample size
−0.152∗∗ (0.021) 0.057∗∗ (0.008) −0.153∗∗ (0.007) −0.196∗∗ (0.006) 0.752∗∗ (0.006) 0.116∗∗ (0.012) 0.483∗∗ (0.062) −40,946 16,009
Test items (2) −0.047 (0.039) 0.046∗∗ (0.015) −0.176∗∗ (0.013) −0.180∗∗ (0.012) 0.514∗∗ (0.011) −0.177∗∗ (0.021) 0.917∗∗ (0.111) −17,817 16,009
Note. Both models are estimated using the 16,009 customers who had previously paid a high price for a test item. In column (1) the dependent variable measures the number of non-Test Items purchased during the post-test period. In column (2) the dependent variable measures the number of test items purchased. Asymptotic standard errors are in parentheses. ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
quality. To investigate this possibility, we can restrict attention to non-Test Items (“other items”). We do not expect discounts on the Test Items to signal information about the quality of these other items, and so if the effect persists for these other items, it is unlikely to be due to an adverse quality signal. The Test Items accounted for only 22% of the 94,487 post-test purchases, so the remaining 78% of the purchases were for the approximately 400 other items sold by the firm. We repeat our earlier analysis when distinguishing between post-test sales for Test Items and other items and report the results in Table XI. The pattern of findings is unchanged. Indeed, the findings are stronger for these other items than for the Test Items, presumably because some customers took advantage of the deep discounts (on the Test Items) in the Test Catalog. We conclude that the decrease in sales cannot be fully explained by customers using the deep discounts as a signal that the Test Items are poor quality. An alternative interpretation is that the deep discounts lowered the perceived quality of all of the products sold by the firm.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
759
This is a relatively implausible explanation in this setting. Most customers in the study had made multiple previous purchases and had received a large number of catalogs from the firm. On average, each customer had purchased approximately 3.3 units prior to receiving the Test Catalog. This increases to 5.2 units among customers who had purchased a Test Item within 20.1 months of receiving the Test Catalog. Moreover, customers’ first purchases occurred on average over two years before they received the Test Catalog and in the intervening period they had received many of the firm’s catalogs (approximately once every two to four weeks). Given this extensive experience with the firm, it seems unlikely that customers’ overall perceptions of the quality of the firm’s products would be affected by the prices in a single catalog. We conclude that it is unlikely that the results can be explained by changes in customers’ expectations about product quality. It also appears unlikely that customers were merely waiting in anticipation of future discounts. It is for this reason that we suggested in the Introduction that the reputation effects reflect intangible brand attributions, rather than inferences about future prices or product quality. We next consider an alternative explanation for the Past Purchase and Time interactions. VI.C. Demand Depletion Because customers are unlikely to buy the same item twice, those who had already purchased a Test Item had fewer opportunities to take advantage of the deep discounts on these items in the Test Catalog. Similarly, customers with recent purchases may have had their immediate needs satisfied, which may also have diminished demand for discounted items in the Test Catalog. It is helpful to consider the limits of this explanation. Demand depletion may explain why there was no increase in sales in the Deep Discount condition (compared to the Shallow Discount condition): customers could not take advantage of the lower prices because their demand was depleted. However, the two versions were mailed to equivalent groups of customers (whose demand was equivalently depleted) and so demand depletion cannot explain why sales decreased in the Deep Discount condition. Notice also that orders from the Test Catalog account for less than 3% of the post-test orders. If we restrict attention to orders from catalogs other than the Test Catalog, the findings survive and are actually strengthened by omission of the Test Catalog orders. This result cannot be explained by a diminished response to the Test Catalog
760
QUARTERLY JOURNAL OF ECONOMICS
itself.18 We conclude that depletion of demand for Test Items from the Test Catalog cannot fully explain the reduction in aggregate post-test orders. In the next section we present the findings from a second study conducted with a clothing retailer. The study provides an opportunity to replicate the results with a different company and product category. VII. REPLICATION To investigate whether the effects generalize, we conducted a second study with a separate catalog firm that sells private label clothing. The study investigates the impact of mailing a “Sale” catalog containing discounts in the days immediately after Christmas, a period when many customers expect clothing to be discounted. If we can show that the key findings extend to this setting we can be more confident that customer antagonism plays an important role in contributing to price stickiness. Approximately 110,000 customers who had previously purchased from the firm were randomly assigned to equal-sized Test and Control conditions.19 Those in the Control condition received a 16-page “Sale” catalog containing 132 items, all of which were discounted below their regular prices. The average regular price of the 132 items was $67.24 and the average discount was 45%. Customers in the Test condition were not mailed this catalog. We measured (post-test) sales over the subsequent thirty months, including purchases from the Sale catalog. Before describing these findings, we recognize that there are two important differences from the previous study in the experimental manipulation. First, it was standard for this company to mail a catalog containing discounts immediately after Christmas, and so the experimental manipulation was suppression of the Sale catalog. Because lowering the price was standard behavior, it appears implausible that merely receiving this Sale catalog changed customers’ expectations about the frequency of future discounts. The second difference is that in the previous study all customers received the Test Catalog; we merely varied prices between the 18. Similarly, diminished demand from the Test Catalog cannot explain the findings in the second half of the post-test period. Over 99.7% of purchases from the Test Catalog occur within the first fourteen months of the post-test period. 19. Comparison of historical purchases confirmed that the assignment of customers to these two conditions was random.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
761
TABLE XII REPLICATION RESULTS: IMPACT OF SALE CATALOG ON POST-TEST ORDERS FROM THE APPAREL CATALOG
Received sale catalog Recency Frequency Monetary value Intercept Log likelihood Sample size
Customers who had previously paid more than the sale price (1)
Customers who had paid a price no higher than the sale price (2)
Customers who had not previously purchased a sale item (3)
−0.024∗∗ (0.005) −0.208∗∗ (0.002) 0.624∗∗ (0.003) 0.002∗∗ (0.011) 0.478∗∗ (0.048) −74,498 14,699
0.012 (0.013) −0.177∗∗ (0.006) 0.673∗∗ (0.007) 0.081∗∗ (0.028) 0.047 (0.116) −10,392 1,630
0.016∗∗ (0.003) −0.220∗∗ (0.001) 0.541∗∗ (0.002) 0.075∗∗ (0.005) 0.511∗∗ (0.021) −362,018 93,142
Note. The dependent variable in all three models measures the number of orders made during the posttest period. Column (1) is estimated using the 14,699 customers who had previously paid a higher price for one of the items in the Sale catalog. Column (2) is estimated using the 1,630 customers who had previously paid a price equal to or less than the price in the Sale catalog. Column (3) is estimated using the 93,142 customers who had not previously purchased one of the items in the Sale catalog. Asymptotic standard errors are in parentheses. ∗∗ Significantly different from zero, p < .01. ∗ Significantly different from zero, p < .05.
two conditions. This ensured that any effects could be attributed to variation in prices and not to other information in the catalog. In this second study, the variation is sending the Sale catalog itself. As a result, we cannot distinguish which feature(s) of the Sale catalog prompted the observed outcomes.20 The findings are reported in Table XII. We report three models using different samples of customers. In column (1) we estimate the model using customers who had previously purchased one of the items in the Sale catalog and paid more than the Sale catalog price. In column (2) we use customers who had previously paid a price equal to or less than the price in the Sale catalog, whereas 20. If customers under the Test condition (who did not receive the Sale catalog) called to order one of the items in the Sale catalog, they were also charged the discounted price. Therefore, under a strict interpretation, the experiment manipulates the information that customers received about price changes, rather than the prices themselves. In a catalog setting this distinction is less important, because customers cannot respond to price changes that they do not know about.
762
QUARTERLY JOURNAL OF ECONOMICS
in column (3) we use customers who had not previously purchased one of the items in the Sale catalog. Customers who had previously paid a higher price for one of the discounted items (column (1)) made 2.4% fewer orders if they received the Sale catalog. This is again a strong result. Our measures include purchases from the Sale catalog, and the drop in sales can be fully attributed to merely receiving the Sale catalog containing lower prices. Customers who had not previously purchased one of the discounted items (column (3)) placed 1.6% more orders during the 30-month post-test period if they received the Sale catalog. These outcomes are both significantly different from zero and significantly different from each other (p < .01). They suggest that customer antagonism contributed to a net reduction in orders of 4.0%, which is the difference between a 1.6% increase and a 2.4% decrease. The results in this replication may initially appear counterintuitive. Customers would not be surprised to receive a Sale catalog in the days after Christmas, and so we might not expect them to react adversely to the low prices in this catalog. On the other hand, customers do not know which products will be discounted in the Sale catalog, and it is likely that they experienced regret if they saw that they could have purchased the same item at a lower price. This is precisely the situation anticipated by Rotemberg (2009). Recall that customers in his model only want to transact with firms that are “altruistic.” An altruistic firm would not allow customers to experience regret, and so regret represents evidence that a firm is not altruistic. The findings in this second study offer reassurance that the effects are replicable. They also provide evidence that they generalize to other markets, including markets in which customers recognize that prices vary over time. Without this evidence it would be more difficult to claim that the effects contribute to price stickiness to an extent that is relevant to our understanding of monetary policy. VIII. DISCUSSION AND CONCLUSIONS Although customer antagonism is recognized as a possible explanation for price stickiness, there has been virtually no empirical evidence that price adjustments can cause customer antagonism and reduce demand. We present findings from two large-scale field tests that investigate how customers respond to
PRICE STICKINESS AND CUSTOMER ANTAGONISM
763
downward price adjustments. The findings reveal that many customers stop purchasing if a firm charges a lower price than they previously paid for the same item. We characterize the loss in demand as a customer boycott of the firm. The loss in profits was sufficient to cause the firms that participated in our two field experiments to reduce the frequency of price adjustments. Both of the studies reported in the paper focus on products that are durables for which repeat purchases are rare. It is possible that the results might be different if we had used products that consumers purchase repeatedly. Customers may be less antagonized by lower prices on items they have already purchased if they can take advantage of the discounts. However, it is important to remember that the focus on durables in these two studies did not prevent customers from taking advantage of the discounts. Most customers had previously purchased very few of the discounted items, and so they could take advantage of the discounts by purchasing other items. Indeed, many customers did take advantage of the discounts: in the first study purchases from the Test Catalog itself were over 60% higher under the Deep Discount condition, whereas in the second study many customers who received the Sale catalog purchased from it. Our measures may underestimate how customers react to seeing lower prices on items that they have previously purchased. When customers receive catalogs many of them simply throw them out without opening them. Random assignment ensures that for the Test Catalog this is likely to have occurred at equivalent rates under both conditions. However, if customers did not open the Test Catalog then the variation of prices within the catalog could not have affected their behavior. This will tend to weaken the aggregate differences between the two treatments and underestimate the individual effects on the customers who did see the discounts. Although the data and the randomized experimental design allowed us to rule out many alternative explanations, we recognize that there are limitations. The most important limitation is that we do not have access to direct measures of customers’ psychological reactions. This absence of intermediate process measures is a limitation common to many field studies. NORTHWESTERN UNIVERSITY MIT
764
QUARTERLY JOURNAL OF ECONOMICS
REFERENCES Amirault, David, Carolyn Kwan, and Gordon Wilkinson, “A Survey of the PriceSetting Behavior of Canadian Companies,” Bank of Canada Review (2004), 29–40. Anderson, Eric T., and Duncan I. Simester, “The Role of Sale Signs,” Marketing Science, 17 (1998), 139–155. ——, “Are Sale Signs Less Effective When More Products Have Them?” Marketing Science, 20 (2001), 121–142. ——, “Long-Run Effects of Promotion Depth on New versus Established Customers: Three Field Studies,” Marketing Science, 23 (2004), 4–20. Apel, Mikael, Richard Friberg, and Kerstin Hallsten, “Micro Foundations of Macroeconomic Price Adjustment: Survey Evidence from Swedish Firms,” Journal of Money, Credit and Banking, 37 (2005), 313–338. Blinder, Alan S., Elie R. D. Canetti, David E. Lebow, and Jeremy B. Rudd, Asking about Prices: A New Approach to Understanding Price Stickiness (New York: Russell Sage Foundation, 1998). Cason, Timothy N., and Daniel Friedman, “A Laboratory Study of Customer Markets,” Advances in Economic Analysis and Policy, 2 (2002), Article 1. Coase, Ronald H., “Durability and Monopoly,” Journal of Law and Economics, 15 (1972), 143–149. Conlisk, John, Eitan Gerstner, and Joel Sobel, “Cyclic Pricing by a Durable Goods Monopolist,” Quarterly Journal of Economics, 99 (1984), 489–505. Ellison, Glenn, “Bounded Rationality in Industrial Organization,” in Advances in Economics and Econometrics: Theory and Applications, Ninth World Congress, Richard Blundell, Whitney Newey, and Torsten Persson, eds. (Cambridge, UK: Cambridge University Press, 2006). Ellison, Glenn, and Sara Fisher Ellison, “Search, Obfuscation, and Price Elasticities on the Internet,” Econometrica, 77 (2009), 427–452. Fabiani, Silvia, Martine Druant, Ignacio Hernando, Claudia Kwapil, Bettina Landau, Claire Loupias, Fernando Martins, Thomas Matha, Roberto Sabbatini, Harald Stahl, and Ad Stokman, “The Pricing Behavior of Firms in the Euro Area: New Survey Evidence,” Paper Presented at Conference on Inflation Persistence in the Euro Area at the European Central Bank, 2004. Gordon, Robert J., “What Is New-Keynesian Economics?” Journal of Economic Literature, 28 (1990), 1115–1171. ¨ Guth, Werner, and Eric van Damme, “Information, Strategic Behavior and Fairness in Ultimatum Bargaining: An Experimental Study,” Journal of Mathematical Psychology, 42 (1998), 227–47. Hall, Robert L., and Charles J. Hitch, “Price Theory and Business Behavior,” Oxford Economic Papers, 2 (1939), 12–45. Hall, Simon, Mark Walsh, and Anthony Yates, “How do UK Companies Set Prices?” Bank of England Quarterly Bulletin (1996), 180–192. Harrison, Glenn W., and John A. List, “Field Experiments,” Journal of Economic Literature, 42 (2004), 1009–1055. Kahneman, Daniel, Jack L. Knetsch, and Richard H. Thaler, “Fairness as a Constraint on Profit Seeking: Entitlements in the Market,” American Economic Review, 76 (1986), 728–741. Kleshchelski, Isaac, and Nicolas Vincent, “Market Share and Price Rigidity,” Journal of Monetary Economics, 56 (2009), 344–352. Nakamura, Emi, and J´on Steinsson, “Price Setting in Forward-Looking Customer Markets,” Columbia University Working Paper, 2008. Okun, Arthur M., Prices and Quantities: A Macroeconomic Analysis (Washington, DC: Brookings Institution, 1981). Phelps, Edmond S., and Sidney G. Winter, “Optimal Price Policy under Atomistic Competition,” in Microeconomic Foundations of Employment and Inflation Theory, E. S. Phelps, ed. (New York: W. W. Norton, 1970). Renner, Elke, and Jean-Robert Tyran, “Price Rigidity in Customer Markets: An Empirical Study,” Journal of Economic Behavior and Organization, 55 (2004), 575–593. Rotemberg, Julio J., “Customer Anger at Price Increases, Changes in the Frequency of Price Adjustment and Monetary Policy,” Journal of Monetary Economics, 52 (2005), 829–852.
PRICE STICKINESS AND CUSTOMER ANTAGONISM
765
——, “Expected Firm Altruism and Brand Extensions,” Harvard Business School Working Paper, 2008. ——, “Fair Pricing,” Journal of the European Economic Association, forthcoming, 2009. Selten, Reinhard, and Axel Ockenfels, “An Experimental Solidarity Game,” Journal of Economic Behavior and Organization, 34 (1998), 517–539. Sobel, Joel, “The Timing of Sales,” Review of Economic Studies, 51 (1984), 353–368. ——, “Durable Goods Monopoly with Entry of New Consumers,” Econometrica, 59 (1991), 1455–1485. Tadelis, Steven, “What’s in a Name? Reputation as a Tradeable Asset,” American Economic Review, 89 (1999), 548–563. Weiss, Yoram, “Inflation and Price Adjustment: A Survey of Findings from Micro Data,” in Optimizing Pricing, Inflation, and the Cost of Price Adjustment, E. Sheshinski and Y. Weiss, eds. (Cambridge, MA: MIT Press, 1993). Wernerfelt, Birger, “Umbrella Branding as a Signal of New Product Quality: An Example of Signalling by Posting a Bond,” Rand Journal of Economics, 19 (1988), 458–66. Zbaracki, Mark J., Mark Ritson, Daniel Levy, Shantanu Dutta, and Mark Bergen, “Managerial and Customer Costs of Price Adjustment: Direct Evidence from Industrial Markets,” Review of Economics and Statistics, 86 (2004), 514–533.
BARBED WIRE: PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT∗ RICHARD HORNBECK This paper examines the impact on agricultural development of the introduction of barbed wire fencing to the American Plains in the late nineteenth century. Without a fence, farmers risked uncompensated damage by others’ livestock. From 1880 to 1900, the introduction and near-universal adoption of barbed wire greatly reduced the cost of fences, relative to the predominant wooden fences, especially in counties with the least woodland. Over that period, counties with the least woodland experienced substantial relative increases in settlement, land improvement, land values, and the productivity and production share of crops most in need of protection. This increase in agricultural development appears partly to reflect farmers’ increased ability to protect their land from encroachment. States’ inability to protect this full bundle of property rights on the frontier, beyond providing formal land titles, might have otherwise restricted agricultural development.
I. INTRODUCTION In The Problem of Social Cost, Coase (1960) begins with the example of a farmer and a cattle-raiser: without a fence, cattle will damage the farmer’s crops. Land use will be efficient if liability for damage is defined and enforced, and this property right can be traded costlessly. Otherwise, cattle damage imposes an externality that distorts the farmer’s product choices, investment levels, and production methods (Cheung 1970). The externality is internalized or eliminated when those costs fall below the resulting gains (Demsetz 1967); for example, when fencing costs fall sufficiently, the farmer builds a fence and produces within at efficient levels.1 The efficiency gains from establishing and enforcing property rights may be large, and much attention has been focused on ∗ An earlier version of this paper was distributed under the title “Good Fences Make Good Neighbors: Evidence on the Effects of Property Rights.” I thank Daron Acemoglu, Esther Duflo, Michael Greenstone, Peter Temin, anonymous referees, and Larry Katz for their comments and suggestions, as well as Lee Alston, Josh Angrist, David Autor, Abhijit Banerjee, Dora Costa, Joe Doyle, Claudia Goldin, Tal Gross, Tim Guinnane, Raymond Guiteras, Jeanne Lafortune, Steve Levitt, Derek Neal, Philip Oreopoulos, Paul Rhode, Chris Udry, and numerous seminar participants. I thank Lisa Sweeney, Daniel Sheehan, and the GIS Lab at MIT, as well as Christopher Compean, Lillian Fine, Paul Nikandrou, and Praveen Rathinavelu, for their research assistance. For supporting research expenses, I thank the MIT Schultz Fund, the MIT World Economy Lab, and the MIT UROP program. 1. De Meza and Gould (1992) outline conditions when private decisions to enforce property rights lead to more or less enclosure of land than is socially efficient. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
767
768
QUARTERLY JOURNAL OF ECONOMICS
the role of land rights in development (Alston, Libecap, and Schneider 1996; De Soto 2000; Brasselle, Gaspart, and Platteau 2002; Lanjouw and Levy 2002; Galiani and Schargrodsky 2005; Libecap 2007; Besley and Ghatak 2009). Insecurity distorts farmers’ investments (Goldstein and Udry 2008), and increased tenure security can increase farmers’ investment in land (Banerjee, Gertler, and Ghatak 2002; Jacoby, Li, and Rozelle 2002).2 More broadly, insecure property rights can distort labor supply (Field 2007), reduce investment (Johnson, McMillan, and Woodruff 2002), and slow economic growth (North 1981; Engerman and Sokoloff 2003; Acemoglu and Johnson 2005). Indeed, private enclosure of common lands in England may have contributed to the onset of the Industrial Revolution by increasing both agricultural output and labor supplied to other sectors (Ashton 1997). Returning to Coase’s setting, this paper examines the impact on agricultural development of a decrease in the cost of protecting farmland: the introduction of barbed wire fencing to the American Plains in the late nineteenth century. Fences protected farmers’ land and crops from damage by others’ cattle. Farmers often had no formal right to compensation for such damage if their land was not enclosed with fences. Farmers with formal legal protection still faced uncertainty in their ability to collect damages for intrusions on unfenced land. Fencing had relatively little effect on farmers’ security of land ownership;3 rather, fencing improved farmers’ property rights in the sense that it secured their ability to use land for certain purposes. Before barbed wire, fence construction on the plains was restricted by high costs in areas that lacked local fencing materials. Small sections of local woodland were a vital source of timber for fencing on the plains. The introduction and universal adoption of barbed wire from 1880 to 1900 most affected areas with the least woodland, which had been most costly to fence.4 Based on decennial data from the Census of Agriculture, this paper finds that counties with the least woodland experienced 2. See Besley (1995) for a discussion of three mechanisms: decreased expropriation raises the expected return on investment; an improved ability to collateralize land increases access to credit; and lower costs of trading land raise the expected return on investment. 3. Fencing may have played some informal role in delineating and substantiating land claims. 4. Anderson and Hill (1975) review the historical development of property rights on the American Plains and the role of barbed wire in enforcing private control over land use.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
769
large increases in agricultural development from 1880 to 1900, relative to counties with sufficient woodland for farmers to have accommodated previous fencing material shortages. Controlling for time-invariant differences among counties and statewide shocks to all counties, the fraction of county farmland that was improved increased by nineteen percentage points in counties with the least woodland. From 1880 to 1890, average crop productivity increased relatively by 23% in counties with the least woodland, controlling for crop-specific differences among counties and crop-specific statewide shocks. The increased productivity was entirely among crops more susceptible to damage from roaming livestock, as opposed to hay. Farmers shifted the allocation of farmland toward crops and, in particular, crops more at risk. Agricultural development increased along intensive margins, even as counties with the least woodland expanded along the extensive margin of total farmland settled. Estimated increases in the fraction of farmland improved are robust to controlling for changes correlated with counties’ distance west and distance from St. Louis; counties’ region, subregion, or soil group; counties’ initial fraction of farmland improved; or the expansion of railroad networks. Estimated increases in total farmland are more sensitive to these robustness checks. There were substantial and robust increases in total improved land, combining both intensive and extensive margins. Increases in agricultural development were capitalized in higher land values, totaling among sample counties roughly 0.9% of national GDP. In all, the estimates lend support to historical accounts that “without barbed wire the Plains homestead could never have been protected from the grazing herds and therefore could not have been possible as an agricultural unit” (Webb 1931, p. 317). Indeed, some states’ efforts to reform legal fencing requirements appear to have had little effect, suggesting a difficulty in enforcing land protection on the frontier without physical barriers. In interpreting the results, this paper emphasizes the role of barbed wire in protecting farmland from encroachment by others’ cattle. However, the estimates may also reflect barbed wire’s contribution to agricultural technology. Aside from any external protection effects, cheaper fencing benefits an isolated farm by providing greater control over a farmer’s own cattle. This allows the production of cattle and crops in close proximity, and increases cattle productivity through improvements in feeding and
770
QUARTERLY JOURNAL OF ECONOMICS
breeding. Barbed wire’s effects are a combination of direct technological improvements and increased protection from others’ cattle. There are some indications, however, that direct technological effects of barbed wire do not drive the main results. Counties most affected by barbed wire became increasingly specialized in either crops or cattle, rather than increasing the joint production of cattle and crops. Furthermore, there is little evidence of an increase in cattle production. This suggests that barbed wire did not affect agricultural production only through the purely technological benefits of cheaper fencing; rather, barbed wire’s effects partly reflect an increase in security from external encroachment. Overall, barbed wire appears to have had a substantial impact on agricultural development in the United States and, in particular, this may reflect an important role for protecting land and securing farmers’ full bundle of property rights. The paper is organized as follows. Section II reviews historical accounts of the need for alternative fencing materials in timber-scarce areas and the introduction of barbed wire fencing. Section III provides a theoretical framework to the historical accounts. Section IV describes the data and presents summary statistics. Section V develops the empirical methodology. Section VI presents the main results and explores their robustness. Section VII discusses the interpretation of barbed wire’s effects, and Section VIII concludes. II. HISTORY OF BARBED WIRE AND THE GREAT PLAINS II.A. Timber Shortages Constrained Land Protection English common law made livestock owners responsible for damages by roaming livestock, assigning the responsibility to fence in livestock. In contrast, the American colonies adopted legal codes that required farmers to fence out others’ livestock (Washburn and Moen Manufacturing Company 1880; Davis 1973; Kawashima 1994; Kantor 1998).5 Without a “lawful fence,” farmers had no formal entitlement to compensation for damages by others’ livestock. New states’ legal codes continued to require that farmers fence out livestock, and gave technical specifications for what constituted a lawful fence. 5. This was meant to encourage livestock production and exploit widely available land. Some southern colonies took further steps to prohibit fencing of pasture lands, even private pasture lands.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
771
In practice, fences were necessary to protect crops and required substantial investment. In 1872, fencing capital stock in the United States was roughly equal to the value of all livestock, the national debt, or the railroads; annual fencing repair costs were greater than combined annual tax receipts at all levels of government (U.S. House 1872; Webb 1931, pp. 288–289). Fencing became increasingly costly as settlement moved into areas with little woodland. High transportation costs made it impractical to supply low-woodland areas with enough timber for fencing (Hayter 1939; Kraenzel 1955, p. 129; Bogue 1963b, pp. 6– 7). Although wood scarcity encouraged experimentation, hedge fences were costly to control and smooth iron fences could be broken by animals and were prone to rust (Primack 1969). Writers in agricultural journals argued that the major barrier to settlement was the lack of timber for fencing: the Union Agriculturist and Western Prairie Farmer in 1841, the Prairie Farmer in 1848, and the Iowa Homestead in 1863 (Bogue 1963a, p. 74). An 1871 guide for immigrants focused on three main characteristics of farmland in Plains counties: its price, the amount of timber, and the amount fenced (U.S. House 1871).6 Historians emphasize the importance of fencing for protecting farmers from encroachment by others’ cattle. “When he sought to fence his crops against marauding livestock, the prairie farmer faced the timber problem at its most acute” (Bogue 1963b, p. 7). “Without fences [farmers] could have no crops; yet the expense of fencing was prohibitive, especially in the Plains proper. It is not strange that the farmers began to insist that stock be fenced and that fields be permitted to lie out” (Webb 1931, p. 287).7 Political debates took place in Plains states about changing farmers’ fencing requirements (Davis 1973).8 For example, in 1872, Kansas gave counties the option of adopting a herd law that would make livestock owners liable for damages to farmers’ unfenced crops. Counties’ decisions were attributed explicitly to 6. Similarly, timber availability is among the first county characteristics described in 1870s Kansas State Board of Agriculture Reports. 7. Recent scholarship has associated barbed wire with society’s need to define control over space, beginning with the Western frontier and continuing in wars and prisons (Razac 2002; Netz 2004). 8. The desire for legal reform underscores the inability of individual cattleraisers and farmers to negotiate private guarantees: the large number of potential neighbors may have contributed to high transaction costs, along with an inability to enforce such contracts. Violent conflict on the plains between farmers and cattleraisers was largely prevented by farmers’ concession to settle elsewhere (Alston, Libecap, and Mueller 1998).
772
QUARTERLY JOURNAL OF ECONOMICS
divided local public sentiment: opposed by stockmen, good for farmers and grain production (Kansas State Board of Agriculture 1876, 1877–1878, 1879–1880).9 Formal fencing requirements began to vary by state, county, and township, and were collected by wire manufacturer Washburn and Moen Manufacturing Company (1880).10 However, these legal reforms faced challenges in monitoring damages and enforcing payments on the frontier, while also overcoming established fence-out social norms.11 Farmers mainly adjusted to fencing material shortages by settling in areas with nearby timber plots. Bogue (1963b, p. 6) writes about central Iowa: Where timber and prairie alternated, locations in or near wooded areas were relatively much more attractive.. . . [T]here developed a landholding pattern of which the timber lot was an intricate part. Settlers on the prairie purchased five or ten acres along the stream bottoms or in the prairie groves and drove five, ten or fifteen miles to cut building timber or to split rails during the winter months.
Smaller counties were roughly 30 miles on each side, so farmers traveling 5–15 miles for timber would have been mostly within their home counties. For a standard homestead farm size of 160 acres, a county would need to be roughly 4% woodland for each farm to acquire 5–10 acres of woodland. Based on this calculation, counties can be grouped in three woodland categories: low (0%–4%), medium (4%–8%), and high (8%–12%). The “low” counties are roughly those most constrained by timber scarcities, whereas “medium” counties could have partially adjusted with this landholding pattern and have been less affected along with “high” counties. The exact cutoffs for these categories are not relevant for the results; rather, the continuous estimates will be evaluated at three corresponding benchmark levels (0%, 6%, 12%) to assist in interpreting the estimated magnitudes.
9. Kantor (1998) analyzes similar debates in Georgia, counties’ decisions to adopt herd laws, and relative changes in counties that adopted herd laws. 10. Among the sample states, Iowa left farmers liable for livestock damage. Texas left farmers liable for cattle damage, but allowed counties to determine liability for other animals. Kansas and Colorado allowed damage liability to be determined by counties, whereas Minnesota left this decision to townships. Information that Nebraska left farmers liable appears to be from 1867 and outdated; in 1871, Nebraska resolved a period of conflicting decisions and passed a statewide herd law making livestock owners liable for damage (Davis 1973; Kawashima 1994). 11. Ellickson (1991) analyzes a modern California county, in which farmers and ranchers appeal more to social norms than strict legal responsibilities. In Ellickson’s setting, social norms encourage ranchers to control cattle.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
773
II.B. The Arrival of Cheap Barbed Wire Fences, 1880 to 1900 The most practical and ultimately successful design for barbed wire was patented in 1874 by Joseph Glidden, a farmer in DeKalb, Illinois. Glidden’s design had three important characteristics: barbs prevented cattle from breaking the fence, twisted wires tolerated temperature changes, and the design was easy to manufacture. Glidden sold a half-stake in the patent for a few hundred dollars to Isaac Ellwood, a hardware merchant in DeKalb, and the two started the first commercial production of barbed wire, producing a few thousand pounds per year by hand (Hayter 1939; McCallum and McCallum 1972). Barbed wire was cheaper than wooden fencing, particularly in timber-scarce areas, and it had lower labor requirements.12 Ellwood wrote to sales agents in 1875 that he did “not expect the wire to be much in demand where farmers can build brush and pole fences out of the growth on their own land” (Hayter 1939) and that “where lumber is reportedly dearer, the wire would probably sell for more” (Webb 1931, p. 310). In 1876, the country’s largest plain wire manufacturer (Washburn and Moen) bought half of the Glidden–Ellwood business for $60,000 cash plus royalties and began the first largescale production of barbed wire.13 In contrast to Glidden’s sale to Ellwood, Washburn and Moen’s purchase showed an awareness of barbed wire’s potential and they made “enormous profits” (Webb 1931, p. 309).14 Newspaper advertisements began to appear in Kansas and Nebraska in 1878 and 1879 (Davis 1973, pp. 133–134). There were a series of public demonstrations and, once the effectiveness of barbed wire was proved, “Glidden himself could hardly realize 12. In Iowa, wooden fences varied in total construction costs per rod from $0.91 to $1.31 in 1871, whereas barbed wire fences cost $0.60 in 1874 and below $0.30 in 1885 (Bogue 1963b, p. 8). Other reports quote barbed wire fences as costing $0.75 per rod in Indiana in 1880, whereas hedge fences cost $0.90 per rod and were wasteful of the land (Primack 1977, p. 73). Primack (1977, p. 82) estimates that a rod of barbed wire took 0.08, 0.06, and 0.04 days to construct in 1880, 1900, and 1910. The labor requirements for constructing wooden fences were constant throughout this period: 0.20, 0.34, and 0.40 days for board, post and rail, and Virginia rail. 13. This process began in 1875 when Washburn and Moen, headquartered in Massachusetts, sent an agent to investigate unusually large orders from DeKalb, Illinois. They acquired barbed wire samples and designed automatic machines for its production. 14. McFadden (1978) provides details on the further development of these businesses, with the 1899 incorporation of the American Steel and Wire Company of New Jersey leading to the monopolization of the barbed wire industry.
774
QUARTERLY JOURNAL OF ECONOMICS
TABLE I BARBED WIRE PRODUCTION, FENCE STOCKS, AND NEW FENCE CONSTRUCTION Panel A. Annual production of barbed wire, thousands of tons 1911 Encyclopedia 1874 1875 1876 1877 1878 1879 1880 1890 1900 1907 Britannica 0.005 0.3 1.5 7 13 25 40 125 200 250 Webb 1931, p. 309 1874 1875 1876 1877 1878 1879 1880 0.005 Hayter 1939
0.3
1.3
6
12
23
1901 ∼250
37
1880–1884 1888 1895 80–100
150
157
the magnitude of his business. One day he received an order for a hundred tons; ‘he was dumbfounded and telegraphed to the purchaser asking if his order should not read one hundred pounds’ ” (Webb 1931, p. 312). Local newspapers that had successfully lobbied for herd-law reform recognized the importance of barbed wire, writing “every farm needs some fencing” and as “soon as a farmer is able, he fences his farm. There must be an apparent benefit” (Nebraska Farmer and Wichita Beacon, quoted in Davis [1973, p. 134]).15 Legal and illegal fencing led to controversy and conflict on the range, as stockmen competed with each other and with farmers for control over land. This culminated in fence-cutting wars, which were resolved by the late 1880s (McCallum and McCallum 1972, pp. 159–166; Webb 1931, pp. 312–316). Local recognition of barbed wire’s importance is most reflected in the rapid increase and magnitude of its use. Table I, Panel A, shows a sharp increase around 1880 in the annual production of barbed wire. Panel B shows the resulting transformation in regional fence stocks. Before 1880, fences were predominately made of wood. From 1870 to 1880, there were some small increases in wire fencing, including both smooth wire and barbed wire. After 1880, there were rapid increases in barbed wire fencing. Total fencing increased most in the Plains and Southwest regions, where there were more timber-scarce areas. Wood fencing also initially increased, however, highlighting that it would be inappropriate to attribute all regional increases in fencing and economic activity to the introduction of barbed wire. 15. Despite the previous attention focused on herd laws, Kansas State Board of Agriculture Reports stopped including details on these decisions after 1880 and entirely stopped reporting the law status after 1884.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
775
TABLE I (CONTINUED) Panel B. Fence stocks, millions of rods (1 rod = 16.5 feet) 1850 1860 1870 1880 1890 1900 North Central Total Wood Stone Hedge Wire South Central Total Wood Stone Hedge Wire Prairie Total Wood Stone Hedge Wire Southwest Total Wood Stone Hedge Wire
1910
228 226 2 0 0
303 285 3 9 6
359 320 3 22 14
427 369 3 26 30
443 279 0 27 137
493 192 0 30 271
483 75 0 27 382
175 171 3 0 0
230 219 5 5 2
245 235 4 2 3
344 330 7 3 3
531 425 0 0 106
685 411 0 0 274
701 280 0 0 420
5 4 0 0 0
22 17 1 3 1
41 23 2 13 4
80 40 2 26 12
255 130 0 3 122
607 176 0 18 413
718 7 0 22 689
39 38 1 0 0
78 71 2 2 4
94 80 2 4 9
162 123 2 5 32
280 174 0 0 106
710 312 0 0 398
749 187 0 0 562
Even as the quality of barbed wire improved and consumers became increasingly aware of its effectiveness in the early 1880s, falling input costs and manufacturing improvements drove down prices: $20 (1874), $10 (1880), $4.20 (1885), $3.45 (1890), and $1.80 (1897).16 Panel C of Table I reports that new fence construction was all barbed wire after 1900, so further price declines or quality improvements would have had no differential effect across counties with varying access to wooden fences.17 Barbed wire differentially affected farmers’ fencing costs from roughly 1880 to 1900. The empirical approach presented here requires that the introduction of barbed wire fencing was exogenous, that is, that its rapid rise around 1880 was not caused by the anticipated 16. Prices are per hundred pounds (Webb 1931, p. 310). Hayter reports similar prices for 1874 and 1893. 17. Complete adoption was slower in the Prairie and Southwest, which may reflect less developed distribution networks and ranchers’ opposition.
776
QUARTERLY JOURNAL OF ECONOMICS TABLE I (CONTINUED) Panel C. New fence construction, percentage 1850–1859 1860–1869 1870–1879 1880–1889 1890–1899 1900–1909
North Central Wood Stone Hedge Wire South Central Wood Stone Hedge Wire Prairie Wood Stone Hedge Wire Southwest Wood Stone Hedge Wire
79 1 12 8
66 0 21 13
73 0 6 22
3 0 1 96
0 0 0 100
0 0 0 100
90 2 4 3
94 0 1 5
100 0 1 0
50 0 0 50
0 0 0 100
0 0 0 100
71 5 18 6
39 4 45 12
38 0 38 24
45 0 0 55
18 0 0 82
0 0 0 100
84 2 5 9
56 3 11 29
63 0 2 35
42 0 0 58
32 0 0 68
0 0 0 100
Notes. “Wood” fences include three types: Virginia worm, post and rail, and board. “Wire” fences are smooth iron from 1850 to 1870 and include barbed wire beginning in 1880. Each region includes the following states: (North Central) Ohio, Indiana, Illinois; (South Central) Kentucky, Tennessee, Alabama, Arkansas, Louisiana, Mississippi; (Prairie) North Dakota, South Dakota, Iowa, Nebraska, Kansas; (Southwest) Oklahoma, Missouri, Texas. Panel B is excerpted from Primack (1977, Table 23, pp. 206–208). Panel C is excerpted from Primack (1977, Table 26, pp. 83–84).
development of low-woodland areas. This assumption appears plausible for two main reasons. First, from a microeconomic perspective, the demand for fencing alternatives had been high for decades and Glidden and Ellwood appear not to have anticipated the tremendous market demand for barbed wire. Second, from a more macroeconomic perspective, the necessary cheap steel was only becoming available around 1880. Barbed wire’s widespread commercial success was made possible by unrelated developments in the industrial steel sector. Strong rust-free wire became dramatically cheaper as the Bessemer steel process, originally patented in England in 1855, became widely used. Figure I shows prices for barbed wire, steel, and iron. Barbed wire’s introduction and mass-production follows the sharp decline in steel prices in the 1870s.18 By contrast, iron prices are 18. Before 1877, the Aldrich (1893) report lists high and variable prices in 1876 ($184, $381, $324), and Webb/Hayter report a high price in 1874 ($450).
777
0
Price, dollars per ton 100 200
300
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
1850
1860
1870
1880 Year
Iron
Steel
1890
1900
1910
Barbed wire
FIGURE I Declining Steel Prices and the Introduction of Barbed Wire From the NBER Macrohistory Database. “Iron” is the price of pig iron in Pennsylvania, and follows the reported general price level from 1860 to 1910; “steel” is the price of Bessemer steel rails in Pennsylvania before 1890 and the price of Bessemer steel billets in Pennsylvania after 1890; “barbed wire” is the price of galvanized barbed wire in Chicago after 1890. From 1877 to 1890, “barbed wire” is the price from the 1893 Aldrich Report (reported by manufacturer Washburn and Moen, Vol. 2, p. 183).
more stable and follow a weighted index of general prices over this period. Primack (1969) summarizes: Outcries about the burdens of fencing by agriculturists in the 1850 to 1880 period seem amply justified. A need was revealed and the problem was resolved, not by changing laws and institutions but rather by technological change. This solution had to wait for the development of cheap steel in the industrial sector. Then a solution was found in wire fencing, cheap in both money and labor costs. (p. 289)
III. THEORETICAL FRAMEWORK To motivate the empirical analysis, consider a farmer in each county c and time period t choosing a level of investment Ict and protection Pct to maximize profits. The farmer produces output F(Ict , qct ), where qct denotes land quality. F(·, ·) is increasing in both arguments and F12 (·, ·) > 0. Following Besley (1995), a fraction of output is lost in each period, τ (Pct ) ∈ [0, 1], and it is decreasing in the level of protection, τ (Pct ) < 0. Investment and Price increases around 1900 coincide with the monopolization of the barbed wire industry, as well as the Spanish-American War.
778
QUARTERLY JOURNAL OF ECONOMICS
protection are each produced at some cost, Cct (Ict ) and Cct (Pct ). Thus, farmers choose Ict and Pct to maximize (1)
(1 − τ (Pct ))F(Ict , qct ) − Cct (Ict ) − Cct (Pct ).
An optimal interior solution satisfies two first-order constraints: (2) (3)
(Ict ) = F1 (Ict , qct )(1 − τ (Pct )), Cct Cct (Pct ) = −τ (Pct )F(Ict , qct ).
In equation (2), the marginal cost of investment is set equal to the marginal return that the farmer expects to retain. In equation (3), the marginal cost of protection is set equal to the marginal increase in retained total output. These equations generate three relationships of interest. First, the optimal choice of investment is increasing in the level of protection because a greater proportion of the marginal return would be kept. Second, the optimal choice of protection is increasing in the level of investment because total output is greater. Third, higher land quality directly increases both investment and protection by raising the marginal return to investment and total output. The empirical identification problem is that an observed correlation between Ict and Pct could reflect more than one of these three effects. The effect of a change in the marginal cost of protection, however, can be informative about the direct effect of protection on investment. Equation (2) defines the optimal choice of investment, Ict∗ (Pct∗ , qct ), and inserting that function into equation (3) (Pct∗ )). If C p dedefines the optimal choice of protection, Pct∗ (qct , Cct notes the marginal cost of protection, it follows that dI ∗ /dC p = ∂ I ∗ /∂ P ∗ · ∂ P ∗ /∂C p; that is, the effect on investment from a change in protection cost equals the direct effect of protection on investment, multiplied by the effect on protection from a change in protection cost. Because protection can be assumed to decrease in its cost (∂ P ∗ /∂C p < 0), an estimate of dI ∗ /dC p is informative about the sign of ∂ I ∗ /∂ P ∗ . If investment increases when the marginal cost of protection falls, this implies that greater protection directly increases investment.19 19. The marginal cost of protection can be thought of as an instrumental variable, where estimating dI ∗ /dC p is the “reduced form.” Without data on protection levels, it is not possible to estimate the “first-stage” term ∂ P ∗ /∂C p and ultimately recover ∂ I ∗ /∂ P ∗ . Still, the magnitude of the reduced form reflects the importance of that particular decrease in protection costs for increasing investment.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
779
To model the effect of barbed wire, assume that protection is provided by building fences with timber and barbed wire, Pct = P(Tct , Bct ). The price of barbed wire ( ptB) is assumed to be decreasing over time; but constant across counties. The price of timber in each county ( pcT ) is assumed to be constant over time, but decreasing in the percentage of the county that is woodland; that is, pcT = g(Wc ) and g (Wc ) < 0. The cost of protection reflects choosing Bct and Tct to minimize (4)
ptB · Bct + pcT · Tct ,
subject to: Pct (Tct , Bct ) = P.
If timber is used initially and it is not a perfect complement to barbed wire, a decrease in the price of barbed wire that results in its use will decrease the marginal cost of protection more in counties with less woodland and higher timber prices; that is, ∂ 3 C/∂ pT ∂ pB∂ P > 0. Once the price of barbed wire declines sufficiently that timber is no longer used, further price declines have no differential effect across counties with different woodland levels.20 Thus, barbed wire especially reduces the cost of protection in timber-scarce areas during the period from its widespread introduction until its universal adoption (1880–1900). If protection directly encourages investment, then investment should increase during this time period and especially in timber-scarce areas. IV. DATA AND SUMMARY STATISTICS IV.A. Data Construction County-level data are drawn from the U.S. Census of Agriculture (Gutmann 2005; Haines 2005). The sample is restricted to counties in Plains states (Iowa, Kansas, Nebraska, Minnesota, Texas, and Colorado) for which data are available in each decennial census from 1870 to 1920: data are first available in 1870 and land improvement data are available through 1920. Some county boundaries changed over this period, so the data are adjusted to hold the 1870 geographical units constant.21 20. This represents a corner solution to equation (4). 21. Using historical U.S. county boundary files (Carville, Heppen, and Otterstrom 1999), county borders in later decades are intersected with county borders in 1870 using ArcView GIS software. When later counties fall within more than one 1870 county, data for each piece are calculated by multiplying the later county data by the share of its area in the 1870 county. For those later periods, each 1870 county is then assigned the sum of all pieces falling within its area. This procedure assumes that data are evenly distributed across the county area, though for 85%
780
QUARTERLY JOURNAL OF ECONOMICS
A natural measure of local woodland would be the number of acres of woodland in a county, divided by the total area of the county.22 This measure is unavailable, but data are available on the number of acres of woodland in farms. The amount of woodland in farms may reasonably reflect the total woodland in the county, given that woodland was particularly valuable on the frontier and acquired first (Webb 1931, p. 281; Bogue 1963b, p. 6; Davis 1973, p. 125). Local woodland is defined to be the number of acres of woodland in farms on 1880, divided by the total area of the county (in acres).23 One indication that this is a reasonable measure is its correlation with the fraction of the county area mapped as forest vegetation in the 1924 Atlas of Agriculture (U.S. Department of Agriculture 1924). After digitally overlaying the Atlas with county boundaries and tracing vegetation cover, the overall correlation between local woodland and the fraction of the county in forest vegetation is 0.63 and state-specific correlations are 0.75 (Iowa), 0.64 (Kansas), 0.61 (Texas), 0.54 (Minnesota), 0.45 (Nebraska), 0.37 (Colorado). This measure of local woodland also appears to proxy for differences in wooden fence prices, based on 1879–1880 countylevel data from Kansas (Kansas State Board of Agriculture 1879– 1880). Figure IIA shows that counties with less local woodland face higher per unit wooden rail fencing costs. Prices also appear to have the earlier hypothesized convex relationship with local woodland.24 Fitting a fourth-degree polynomial to the data, counties with 0% woodland pay 56 cents (standard error of 21 cents) more than counties with 6% woodland, whereas counties with 6% woodland pay a statistically insignificant 23 cents (19 cents) less of counties in later periods less than 1% of their area overlaps with a second 1870 county. The sum is left missing when data for any piece are missing. After adjustment, counties are dropped when their standard deviation in number of acres is greater than 50,000 (3% of the sample). 22. Woodland in nearby counties could be included at some discount to reflect transportation costs, but historical accounts indicate that farmers traveled relatively short distances to cut rails for fences. Focusing on counties’ own woodland provides a simple and transparent measure. 23. Woodland data are used from 1880 when the most woodland might be included in farms, but before woodland stocks might be influenced by barbed wire. The empirical results are not sensitive to the date used to assign woodland levels. Plains agriculture did not typically involve clearing woodland for cultivation, as there was much open land and woodlands were a valuable asset. 24. Primack (1977, p. 70) describes increasing difficulty in adjusting wooden fence types to conserve timber. This convex relationship may also reflect distance to the nearest wood plot falling at a decreasing rate as wooded areas are scattered throughout a county.
5
10
15
781
0
Wooden rail fencing cost (per-100), 1879–1880
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
0
0.03
0.06
0.09
0.12
0.15
0.18
0.15
0.18
Woodland fraction
0.8 0.7 0.6 0.5 0.4 0.3
Barbed wire fencing cost (per-rod), 1879–1880
A. Wooden Rail Fences
0
0.03
0.06
0.09
0.12
Woodland fraction
B. Barbed Wire Fences FIGURE II Kansas Counties’ Wooden and Barbed Wire Fencing Costs (per Unit), 1879–1880 County-level data on per-unit fencing costs are from the 1879–1880 Kansas State Board of Agriculture Report. “Woodland fraction” is defined based on Census data: the number of acres of woodland in farms in 1880, divided by the total area of the county (in acres). This measure of local woodland is shown in Figure III and used throughout the analysis.
than counties with 12% woodland. By contrast, Figure IIB shows that barbed wire fencing costs are not systematically related to local woodland.25 25. Substantial quantities of rail and wire fencing are reported in later years, but differences in per-unit costs cannot be inferred. Data on fence posts, pine, and native lumber are available only in 1879–1880 and show a similar convex relationship to local woodland. Plain wire sells at roughly a 10–15 cent discount to barbed wire and its price is not related to local woodland, though it is often reported to be in little use.
782
QUARTERLY JOURNAL OF ECONOMICS
FIGURE III Sample Counties Based on 1870 Boundaries, by Local Woodland Levels Based on 1870 geographical boundaries, the 377 sample counties are shown. Counties are shaded to represent the defined amount of local woodland based on Census data: the number of acres of woodland in farms in 1880, divided by the total area of the county (in acres). This measure of local woodland is used throughout the analysis.
Figure III shows all sample counties based on 1870 geographical boundaries and shaded to represent their defined local woodland levels. Counties with different woodland levels are not evenly balanced geographically, so the later empirical analysis controls for state-by-decade fixed effects. Thus, the relevant woodland
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
783
variation is mostly in Iowa, southern Minnesota, and the eastern parts of Kansas and Nebraska.26 The empirical results are not sensitive to excluding Colorado and Texas. To account partly for geographic differences within states, additional specifications control for distance west, distance from St. Louis, or finer regional groupings. Nonsample counties are mainly excluded because of unavailable data in 1870 or 1880. The empirical analysis focuses initially on three land-use outcomes: the fraction of county land in farms, the fraction of county land that is improved, and the fraction of land in farms that is improved. The fraction of county land in farms represents the extensive margin of settlement, which reflects farmers’ expected returns to converting land from the public domain.27 The fraction of farmland improved represents the intensive margin, which reflects farmers’ willingness to fix investments in land. Note that improved land could be plowed for crops or otherwise prepared for livestock, but the definition appears to exclude land that is simply fenced.28 The fraction of county land improved is a combination of extensive and intensive margins, reflecting the total increase in farmers’ fixed investments. Other outcome variables are introduced as the results are presented. These data are available by decade, so there is limited flexibility in analyzing responses to the exact timing of barbed wire’s introduction. Because the mass distribution of barbed wire was just beginning by 1880 and fencing stocks had yet to respond substantially, all 1880 county outcomes represent the end of the pre–barbed wire period.29 Because new fence construction was entirely barbed wire by 1900, this marks when barbed wire no longer had a differential effect. 26. The estimates therefore reflect changes in the eastern Plains, and may not extrapolate to western Plains regions in which agricultural production faces somewhat different environmental and technological factors. As shorthand, the text refers to this eastern Plains sample region simply as the Plains. 27. Land in farms “describes the number of acres of land devoted to considerable nurseries, orchards and market-gardens, which are owned by separate parties, which are cultivated for pecuniary profit, and employ as much as the labor of one able-bodied workman during the year. To be included are wood-lots, sheep-pastures, and cleared land used for grazing, grass or tillage, or lying fallow. Those lands not included in this variable are cabbage and potato patches, family vegetable-gardens, ornamental lawns, irreclaimable marshes, and considerable bodies of water” (Gutmann 2005). 28. Improved land is “all land regularly tilled or mowed, land in pasture which has been cleared or tilled, land lying fallow, land in nurseries, gardens, vineyards, and orchards, and land occupied by farm buildings” (Gutmann 2005). 29. Land-use measures were reported for the Census year and productivity is imputed from production and acreage in the previous year.
784
QUARTERLY JOURNAL OF ECONOMICS
IV.B. Summary Statistics Average local woodland among sample counties is 10%, but most counties have lower woodland levels: 39% have 0%–4% woodland, 15% have 4%–8% woodland, 11% have 8%–12% woodland, and 35% have more than 12% woodland. Three corresponding benchmarks (0%, 6%, 12%) are informative: 20% of the sample has less than 1% woodland; the median woodland level is 6%; and 12% is among the higher typical levels of woodland. Table II reports average county characteristics in 1880 for all sample counties and within the three woodland categories. Prior to barbed wire’s introduction, low-woodland counties were less settled and less improved, and a smaller share of farmland was improved. A smaller share of farmland was used for crops, and cropland was allocated less to corn and more to hay. Although low-woodland counties were larger, the total value of land was lower. Total fencing expenditures were lower in low-woodland counties, somewhat lower per farm acre, and roughly similar per dollar of output.30 Given higher per-unit costs in low-woodland areas, this suggests a lower intensity of fencing in those areas. Mediumwoodland county averages generally fell between those for lowwoodland and high-woodland counties, and were more similar to high-woodland counties. V. MEASUREMENT FRAMEWORK V.A. Estimation Setup: A Discrete Example The estimation strategy is illustrated in a discrete example with two county types (c ∈ L, M) and two time periods (t ∈ 1, 2). Farmers in county type L (low-woodland) have a high timber price pTH in both periods, whereas farmers in county type M (mediumwoodland) have a medium timber price pTM in both periods. The price of barbed wire is constant across county types but infinite in the first time period and pB in the second time period, with pB < pTH . Given this setup, a difference-in-difference estimate of the change in production outcome Y from a decrease in the cost of protecting land is (5)
c=L, t=1 ) − (Y c=M, t=2 − Y c=M, t=1 ). c=L, t=2 − Y (Y 30. Data on fencing expenditures were collected only in 1880.
Cost of building and repairing fences
Value of all products
Value of land, buildings, and fences
Acres of land in farms
Acres in county
Acres of improved land, per county acre
Acres of land in farms, per county acre
Number of counties Acres of improved land, per acre in farms
High woodland, 8%–12% (4) 43 0.65 [0.24] 0.65 [0.26] 0.48 [0.28] 430,631 [180,333] 254,210 [92,210] 4,776,354 [3,245,953] 1,158,761 [756,448] 40,162 [19,616]
Medium woodland, 4%–8% (3)
1870 county boundaries 377 147 57 0.54 0.55 0.64 [0.23] [0.20] [0.28] 0.53 0.42 0.59 [0.26] [0.26] [0.28] 0.33 0.25 0.45 [0.25] [0.19] [0.30] 550,718 645,898 470,219 [526,638] [801,941] [239,127] 237,407 188,967 242,318 [113,987] [125,574] [104,289] 3,192,401 2,294,290 4,271,744 [2,851,257] [1,750,834] [3,635,498] 838,986 593,556 1,060,637 [659,730] [452,311] [867,683] 33,514 23,267 41,589 [24,120] [18,173] [31,547]
All counties (1)
Low woodland, 0%–4% (2)
TABLE II MEAN COUNTY CHARACTERISTICS IN 1880, BY COUNTY WOODLAND GROUP
.000
.000
.000
.002
.018
.000
.000
— .032
P-value (2) vs. (3) (5)
.782
.548
.467
.548
.348
.542
.257
— .835
P-value (3) vs. (4) (6)
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
785
40.2 [22.4] 23.2 [19.6] 18.3 [20.5] 7.6 [6.5] 1.0 [1.7] 0.5 [0.9]
490 0.31 [0.20]
Medium woodland, 4%–8% (3)
34.3 [23.6] 28.5 [19.0] 26.2 [24.5] 8.2 [7.9] 1.4 [1.8] 0.6 [1.0]
51.1 [24.0] 22.6 [19.5] 13.7 [9.2] 6.7 [4.3] 1.2 [2.6] 0.6 [0.9]
1880 county boundaries 246 61 0.29 0.38 [0.19] [0.25]
Low woodland, 0%–4% (2)
42.6 [21.3] 24.1 [19.7] 14.6 [10.8] 8.6 [4.6] 1.0 [1.6] 0.5 [0.7]
44 0.42 [0.19]
High woodland, 8%–12% (4)
.773
.589
.050
.000
.033
.000
— .009
P-value (2) vs. (3) (5)
.499
.664
.037
.662
.711
.060
— .415
P-value (3) vs. (4) (6)
Notes. For the top panel, the sample is the same as in Figure III and Tables III and IV. For the bottom panel, the sample is the same as in Tables V and VII. Missing data for crop acreage are treated as a zero. Column (1) reports average county characteristics for the entire sample. Columns (2), (3), and (4) report average county characteristics for counties with the indicated amounts of local woodland (as defined in the notes to Figure III). Standard deviations are reported in brackets. Column (5) (or column (6)) reports the probability that the coefficients in columns (2) and (3) (or columns (3) and (4)) are the same, based on standard errors that are adjusted for heteroscedasticity.
Rye
Barley
Oats
Hay
Wheat
% cropland for each crop: Corn
Number of counties Acres of cropland, per acre in farms
All counties (1)
TABLE II (CONTINUED)
786 QUARTERLY JOURNAL OF ECONOMICS
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
787
For this estimate to be unbiased, farmers’ production decisions must depend additively on unobserved factors. Following the notation from Section III, production outcome Y in county c and time t is a function of land protection P and other characteristics q. This general function Yct∗ (Pct∗ , qct ) is assumed to be a function of land protection only and then of separate unobserved factors, (6)
Yct∗ (Pct∗ , qct ) = F(Pct∗ ) + γt + μc + ct ,
where γt is a time effect, μc is a county effect, and ct is a random error term. For example, assume Yct to be the fraction of farmland improved in county c and time t. The identifying assumption is that the fraction of farmland improved in each county would change to the same extent, apart from any change due to increased land protection in the low-woodland county after barbed wire’s introduction. It is impossible to test this identification assumption directly, but additional time periods and greater variation in county types can be used to form indirect tests. First, the same estimator for two periods before the introduction of barbed wire tests whether these county types had been trending similarly. Second, the same estimator for two periods after the universal adoption of barbed wire tests for other sources of differences, given that further price declines would not have differential effects across county types. Any differential trends before barbed wire’s introduction or after its universal adoption may or may not have occurred between those periods, but the results can be tested for robustness to each scenario. A third specification test exploits the potentially nonlinear relationship between local wooden fencing costs and local woodland. If a third county type H (high-woodland) has a low timber price pLT that is much closer to pTM than was pTH , then an estimate from equation (5) should be greater than the same estimate comparing medium- and high-woodland counties. This test is given by the difference-in-difference-in-difference estimator: (7)
c=L, t=1 ) − (Y c=M, t=2 − Y c=M, t=1 )] c=L, t=2 − Y [(Y c=H, t=1 )]. − [(Yc=M, t=2 − Yc=M, t=1 ) − (Yc=H, t=2 − Y
The intuition for this empirical approach can be seen in a plot of the average share of farmland improved, by county woodland group and decade. Figure IV shows that medium- and high-woodland counties changed similarly over the entire period
QUARTERLY JOURNAL OF ECONOMICS
0.2
0.4
0.6
0.8
788
1870
1880
1890
0% to 4%
1900 4% to 8%
1910
1920
8% to 12%
FIGURE IV Acres of Improved Land (per Farm Acre) by County Woodland Group and Decade Counties are allocated to three groups based on defined local woodland levels (see notes to Figure III). For all counties in each woodland group and decade, the average number of improved acres per acre of land in farms are shown. Two vertical dotted lines represent the approximate date of barbed wire’s introduction (1880) and its universal adoption (1900).
(1870–1920). Low-woodland counties also changed similarly, except for large relative increases from barbed wire’s introduction until its universal adoption (1880–1900). This analysis of woodland categories is intended only to illustrate the intuition for the methodology, whereas the later empirical analysis examines continuous variation in woodland levels and includes controls for other potential changes. V.B. Main Estimating Equation For the main empirical analysis, county-level outcomes are first-differenced to control for any county characteristics that are constant over time. State-by-decade fixed effects αst are included to control for state-specific shocks that have an equal effect on all counties in the state. To allow flexibly for changes over time being correlated with county woodland levels, included for each decade is a fourth-degree polynomial function of a county’s 1880 local woodland level. The baseline estimated equation is (8)
Yct − Yc(t−1) = αst + β1t Wc + β2t Wc2 + β3t Wc3 + β4t Wc4 + ct .
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
789
The estimated β’s are allowed to vary in each decade, and summarize how changes over each decade in county outcome Y vary with county woodland level W. The regression is estimated on a pooled sample of all decadal changes from 1870 to 1920.31 VI. ESTIMATION RESULTS VI.A. Land Improvement and Land Settlement Equation (8) is estimated for the fraction of farmland improved in each county. The full set of estimated β’s is difficult to interpret numerically, but the results can be seen in Figure V.32 The solid line reports the estimated change over the indicated time period for a county with that woodland level, relative to the estimated change for a county with 0% woodland.33 The two dashed lines report 95% confidence intervals around the estimates. From 1880 to 1890 and from 1890 to 1900, counties with the least woodland made large relative gains in the improvement intensity of farmland. By contrast, there were no substantial relative changes at low woodland levels before 1880, after 1900, or at higher woodland levels from 1880 to 1900. To display and interpret these results numerically, the estimated changes are evaluated at representative woodland levels: the most affected low-woodland county, with 0% woodland; the average medium-woodland county, with 6% woodland; and the least affected high-woodland county, with 12% woodland. The predicted change for a county with 0% woodland relative to the predicted change for a county with 6% woodland is analogous to a differencein-difference estimate for counties with those exact woodland levels, but the parameterized regression uses available data from counties with similar woodland levels.34 Columns (1) and (2) of Table III report the evaluated results from estimating equation (8) for the fraction of farmland 31. The estimated coefficients are identical when the sample is restricted to changes over any one decade, because the coefficients on each variable are allowed to vary over each decade. As in the case of two time periods, estimating equation (8) in first differences or with county fixed effects yields the same estimated changes. 32. For conciseness, the displayed results are limited to woodland levels less than 0.12 or 12%, though equation (8) is estimated for the entire distribution of woodland levels. 33. Due to the inclusion of state-decade fixed effects, the estimated results are only interpretable relative to some defined benchmark woodland level. 34. These evaluated estimates are not sensitive to the fourth-degree polynomial functional form, as long as the functional form is sufficiently flexible to capture the basic nonlinearity in Figure V.
790
QUARTERLY JOURNAL OF ECONOMICS
0.1
From 1880 to 1890
–0.2
–0.2
Estimated relative change, +/– 2 SE –0.15 –0.1 –0.05 0 0.05
Estimated relative change, +/– 2 SE –0.15 –0.05 –0.1 0 0.05
0.1
From 1870 to 1880
0
0.02
0.04 0.06 0.08 Woodland level
0.1
0.12
0.02
0.04 0.06 0.08 Woodland level
0.1
0.12
0.1
0.12
0.1
From 1900 to 1910
–0.2
–0.2
Estimated relative change, +/– 2 SE –0.1 –0.15 –0.05 0 0.05
Estimated relative change, +/– 2 SE –0.1 –0.15 –0.05 0 0.05
0.1
From 1890 to 1900
0
0
0.02
0.04 0.06 0.08 Woodland level
0.1
0.12
0
0.02
0.04 0.06 0.08 Woodland level
FIGURE V Estimated Changes in Acres of Improved Land (per Farm Acre) Relative to a County with 0% Woodland (±2 Standard Errors) The solid line reports the estimated polynomial function from equation (8) in the text, normalized at (0, 0). County-level changes in the number of improved acres per farm acre are regressed on a fourth-degree polynomial function of local woodland (see notes to Figure III) and state-by-decade fixed effects. Dashed lines report 95% confidence intervals around the estimated changes.
improved. In each decade, the coefficient in column (1) corresponds exactly to the difference in the graphed solid lines at 0 and 0.06 in Figure V. The estimated magnitude is interpreted as follows: the top coefficient in column (1) reports that acres of improved land
1910–1920
1900–1910
0.003 (0.010) [0.31]
0.086∗∗ (0.020) [3.40] −0.019 (0.011) [2.25]
0.015 (0.040) [0.22] 0.100∗∗ (0.029) [4.12]
.4432 1,885
0.006 (0.004)
0.004 (0.006)
0.020∗ (0.009)
−0.004 (0.012)
0.023 (0.013)
6% vs. 12% woodland (2)
0.024 (0.016) [1.59]
0.128∗∗ (0.026) [3.13] −0.022 (0.024) [0.36]
0.039 (0.026) [0.41] 0.129∗∗ (0.023) [3.81]
0% vs. 6% woodland (3)
.5012 1,885
−0.093∗∗ (0.019) [5.09] 0.144∗∗ (0.021) [6.26]
0.028∗∗ (0.010)
0.019∗ (0.008) [1.28] −0.003 (0.007)
−0.014 (0.009)
0.152∗∗ (0.023) [5.31] −0.020∗ (0.009) [1.54]
0.057∗∗ (0.009)
0.048∗∗ (0.009)
0% vs. 6% woodland (5)
.5696 1,885
0.007∗ (0.004)
−0.005 (0.004)
0.046∗∗ (0.008)
0.026∗∗ (0.008)
−0.001 (0.008)
6% vs. 12% woodland (6)
Acres of improved land per county acre
6% vs. 12% woodland (4)
Acres of land in farms per county acre
Notes. Estimates are from equation (8) in the text: county-level changes in each outcome are regressed on a fourth-degree polynomial function of county woodland (see notes to Figure III) and state-by-decade fixed effects. The estimates are evaluated at three woodland levels (0%, 6%, 12%) and represent the predicted change over each decade for a county with 0% woodland relative to a county with 6% woodland (columns (1), (3), (5)) or for a county with 6% woodland relative to a county with 12% woodland (columns (2), (4), (6)). In parentheses are standard errors corrected for heteroscedasticity and clustered by county. In brackets are t-statistics for the difference between coefficients in columns (1) and (2), (3) and (4), or (5) and (6). ∗∗ denotes statistical significance at 1% and ∗ at 5%.
R2 Observations
After barbed wire’s universal adoption
1880–1890
After barbed wire’s introduction
1890–1900
1870–1880
Before barbed wire
Decade
0% vs. 6% woodland (1)
Acres of improved land per acre in farms
TABLE III CHANGES IN LAND IMPROVEMENT AND SETTLEMENT, EVALUATED AT WOODLAND LEVELS (0%, 6%, 12%)
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
791
792
QUARTERLY JOURNAL OF ECONOMICS
per acre of farmland increased from 1870 to 1880, on average, 1.5 percentage points more in a county with 0% woodland than in a county with 6% woodland.35 From the same regression, column (2) reports the predicted change for a county with 6% woodland relative to a county with 12% woodland. In parentheses is the standard error for each coefficient, corrected for heteroscedasticity and clustered at the county level. In brackets is the t-statistic of the absolute difference between the coefficients, comparing 0% vs. 6% and 6% vs. 12%. For example, the coefficients in the first row of columns (1) and (2) are not statistically different, with a t-statistic of 0.22. The first main result is that, from 1880 to 1900, the improvement intensity of farmland increased by a statistically significant and substantial nineteen percentage points in counties with 0% woodland relative to counties with 6% woodland (Table III, column (1)). In contrast, there are not substantial changes before 1880, after 1900, or between higher woodland levels from 1880 to 1900. This result is clear in Figure VI, which plots the estimated cumulative changes after 1870. The increase in the improvement intensity of farmland came despite substantial expansion along the extensive margin of total settlement, which removed land from the public domain. Columns (3) and (4) of Table III report the results from estimating equation (8) for the fraction of county land in farms. In these baseline estimates, settlement increased by 26 percentage points from 1880 to 1900 in counties with 0% woodland relative to counties with 6% woodland. There were also some relative increases from 1870 to 1880, and from 1880 to 1900 counties with 6% woodland made relative gains on counties with 12% woodland. Combining changes in both intensive and extensive margins, columns (5) and (6) of Table III report estimated changes in the fraction of all county land that is improved. Total land improvement increased by 29 percentage points from 1880 to 1900, reversing a negative trend from 1870 to 1880 in counties with 0% woodland relative to counties with 6% woodland. Table IV presents the robustness of the baseline results to including control variables for other potential changes in agricultural development. Also, to account for potential spatial 35. That is, a county with 0% woodland that had 50% of its farmland improved would have, in expectation, caught up to a county with 6% woodland that initially had 51.5% of its farmland improved.
793
0
0.1
0.2
0.3
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
1870
1880
1890
0% Woodland vs. 6% woodland
1900
1910
1920
6% Woodland vs 12% woodland
FIGURE VI Estimated Cumulative Change in Acres of Improved Land (per farm acre) Based on the estimates reported in Table III (columns (1) and (2)), the solid circles represent the estimated cumulative change after 1870 in acres of improved land per farm acre for a county with 0% woodland relative to a county with 6% woodland. The open circles represent the estimated cumulative change after 1870 for a county with 6% woodland relative to a county with 12% woodland. Two vertical dotted lines represent the approximate date of barbed wire’s introduction (1880) and its universal adoption (1900).
correlation among counties, Conley standard errors are estimated (Conley 1999).36 Overall, the estimated changes in land improvement are robust to these alternative specifications, whereas the changes in land settlement are less robust. Column (1) of Table IV presents the results without additional controls, as a basis for comparison. Allowing for spatial correlation increases the standard errors, but the estimated coefficients remain statistically significant. The results are condensed to show only the changes from 1880 to 1890 and from 1890 to 1900 in counties with 0% woodland relative to counties with 6% woodland. Because counties with less woodland tend to be further west, a concern is that baseline estimates could be confounded with an independent push toward increased westward development, changes in land policies,37 reduced armed conflict with Native 36. Spatial correlation among counties is assumed to be declining linearly up to a distance of 100 miles and zero after 100 miles (the shortest distance between the most wooded and least wooded counties in Kansas). 37. Libecap (2007) reviews changing U.S. land policy, highlighted by the 1862 Homestead Act and small subsequent revisions in 1904, 1912, and 1916.
1890–1900
1880–1890
1890–1900
1880–1890
0.094∗∗ (0.036) [3.64] 0.094∗ (0.038) [2.36]
0.068 (0.038) [1.39] 0.079∗ (0.035) [1.35]
0.100∗∗ (0.034) [3.72] 0.086∗∗ (0.033) [2.36]
0.129∗∗ (0.040) [2.43] 0.128∗∗ (0.030) [2.76]
Baseline Distance specification west (1) (2) 0.065∗ (0.031) [3.03] 0.076∗∗ (0.028) [2.48] 0.031 (0.034) [1.04] 0.062 (0.038) [1.03]
Panel A: Acres of improved land per acre in farms 0.119∗∗ 0.100∗∗ 0.086∗ 0.120∗∗ (0.033) (0.034) (0.034) (0.034) [4.36] [3.39] [3.15] [4.04] 0.106∗∗ 0.087∗∗ 0.070∗ 0.094∗∗ (0.037) (0.033) (0.032) (0.032) [2.62] [2.28] [1.90] [2.50] Panel B: Acres of land in farms per county acre 0.083∗ 0.106∗ 0.078∗ 0.121∗∗ (0.037) (0.043) (0.039) (0.041) [1.66] [1.52] [1.20] [2.24] 0.066 0.116∗∗ 0.097∗∗ 0.122∗∗ (0.036) (0.032) (0.031) (0.030) [1.10] [2.06] [1.69] [2.58]
0.129∗∗ (0.040) [2.420] 0.128∗∗ (0.030) [2.76]
0.098∗∗ (0.034) [3.68] 0.086∗∗ (0.033) [2.36]
Distance west Quadratic time Quadratic time Quadratic time and from St. trend by trend by trend by 1870 outcome Railroad track Louis region subregion soil group difference mileage (3) (4) (5) (6) (7) (8)
Additional controls for
TABLE IV CHANGES IN LAND IMPROVEMENT AND SETTLEMENT (0% VS. 6% WOODLAND), ROBUSTNESS TO ALTERNATIVE SPECIFICATIONS
794 QUARTERLY JOURNAL OF ECONOMICS
0.144∗∗ (0.032) [4.42] 0.152∗∗ (0.043) [3.19]
0.103∗∗ (0.027) [4.02] 0.153∗∗ (0.045) [3.08]
Panel C: Acres of improved land per county acre 0.126∗∗ 0.140∗∗ 0.123∗∗ 0.154∗∗ (0.027) (0.032) (0.028) (0.030) [4.57] [3.84] [3.89] [4.88] 0.164∗∗ 0.150∗∗ 0.130∗∗ 0.154∗∗ (0.043) (0.042) (0.040) (0.040) [3.36] [2.97] [2.68] [3.33]
0.061∗ (0.024) [3.41] 0.097∗∗ (0.032) [2.83]
0.142∗∗ (0.032) [4.33] 0.152∗∗ (0.043) [3.19]
Distance west Quadratic time Quadratic time Quadratic time and from St. trend by trend by trend by 1870 outcome Railroad track Louis region subregion soil group difference mileage (3) (4) (5) (6) (7) (8)
Notes. Each column reports a modified version of the specification from Table III (see notes). The results shown are the estimated changes from 1880 to 1890 and 1890 to 1900 for a county with 0% woodland relative to a county with 6% woodland. Conley standard errors that adjust for spatial correlation are reported in parentheses; reported in brackets are t-statistics for the difference from the relative change for a county with 6% woodland and 12% woodland, which are also calculated using Conley standard errors. Column (1) reports the baseline results, adjusted for spatial correlation. Column (2) controls for a county’s distance west, interacted with each decade. Column (3) controls for a county’s distance west and distance from St. Louis, interacted with each decade. Columns (4), (5), and (6) control for a quadratic time trend for each of 11 regions, 43 subregions, or 19 soil groups. Column (7) controls for a fourth-degree polynomial of a county’s 1870 outcome level, interacted with each decade. Column (8) controls for county-level changes in railroad track mileage. ∗∗ denotes statistical significance at 1% and ∗ at 5%.
1890–1900
1880–1890
Baseline Distance specification west (1) (2)
Additional controls for
TABLE IV (CONTINUED)
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
795
796
QUARTERLY JOURNAL OF ECONOMICS
Americans,38 or other factors. Column (2) includes controls for the distance west of each county centroid, interacted with each decade. Column (3) also controls for distance from St. Louis (“Gateway to the West”) interacted with each decade, to allow for expansion out from the middle of the country. The results are similar when higher-order polynomial distance measures are included, with and without including Colorado and Texas. Counties with different woodland levels may be suited to different agricultural products, so changes in prices and technologies may contribute to differential development over this time period.39 To explore the robustness of the results to these types of factors, county data were merged with traced land resource regions and subregions and great soil groups.40 Within the sample of counties, there are 11 regions, 43 subregions, and 19 soil groups. To allow for differential growth patterns, equation (8) is estimated with a quadratic time trend for each of the 11 regions (column (4)), 43 subregions (column (5)), or 19 soil groups (column (6)).41 The baseline estimates may also be confounded with convergence in agricultural development, whereby counties with lower initial levels of land improvement or settlement may have otherwise experienced higher subsequent growth. Counties with less woodland were initially less developed along each measure, though there were only small relative increases from 1870 to 1880 when initial differences were greatest. Column (7) controls for an additional fourth-degree polynomial function of the county’s fixed 1870 outcome level, interacted with each decade. This effectively 38. Hess and Weidenmier (forthcoming) record individual armed conflicts with Native Americans: 14 events from 1866 to 1869, 69 events in the 1870s, 13 events from 1880 to 1883, and 1 event in 1890. The last recorded conflict in the sample states, outside of Texas, was in 1876. I thank the authors for providing their data. 39. Regarding technological change in agriculture, there are not obvious relative advances for low-woodland areas during the particular period 1880 to 1900 (Rasmussen 1962; Primack 1977; Olmstead and Rhode 2002). Changes in local wood prices would have differential effects, though the timber market in all sample counties was a small sector: in 1870, forest products averaged 0.6% of the total value of all farm products. 40. Land resource regions and subregions were mapped in the 1966 U.S. Department of Agriculture Handbook 296. Soil groups were mapped by the U.S. Soil Conservation Service in 1951 and were retained in the National Archives Record Group 114, item 148. These maps were scanned, traced in GIS software, and digitally merged to 1870 county boundaries. Separate variables are defined for the fraction of each county area falling into each region, subregion, or soil group. 41. Because equation (8) is in changes, the first-differenced analog of a quadratic time trend is included: the fraction of the county in that region, and the fraction multiplied by 3 in 1880, 5 in 1890, 7 in 1900, 9 in 1910, 11 in 1920.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
797
focuses the analysis on counties with different woodland levels but similar outcome levels in 1870.42 One particular source of convergence may have been an expansion of the railroad network into previously less-developed and lower-woodland areas. Counties’ railroad track mileage was calculated by merging county borders with railroad network maps, by decade from 1870 to 1920.43 Total track increased in sample counties from 6k miles (1870) to 19k (1880), 30k (1890), 32k (1900), and 38k (1910 and 1920). Railroad expansion after 1880 was mainly on the intensive margin: in 1870, 50% of sample counties had some railroad track and this increased to 89% (1880), 95% (1890), 97% (1900), and 99% (1910 and 1920). The construction of railroad spur lines may be endogenous and a channel through which barbed wire affected development. Aside from potentially following agricultural development, railroads were often required to fence out cattle from tracks and, as lines pushed into less wooded areas, lumber became more expensive and was sometimes stolen by settlers (McCallum and McCallum 1972, pp. 196–201).44 Although potentially inducing overcontrolling bias, column (7) presents the baseline results when controlling for changes in county railroad track mileage. The results are similar when controlling for a fourthdegree polynomial in railroad mileage, or whether a county has any railroad track.45 Overall, Table IV shows that the estimated increases in land improvement are robust, whereas increases in land settlement are 42. Pre–barbed wire land use is endogenously determined, so counties with similar land-use outcomes and different amounts of woodland might be expected to differ along other important dimensions for farmers to be compensated for the lack of woodland. Also, if low initial outcome counties converge due to barbed wire’s introduction, this specification will suffer from overcontrolling bias. 43. Railroad network maps were obtained from the Library of Congress railroad maps collection: Colton’s 1871 map for 1870, Colton’s 1882 map for 1880, Matthew’s 1890 map for 1890, 1897 Century Atlas for 1900, 1911 Century Atlas for 1910, 1918 General Railway map for 1920. Railroad lines on each map were traced and merged to 1870 county boundaries, though the railroad map projections did not merge precisely. To minimize measurement error in the changes, railroad lines for each decade were snapped to their corresponding lines in the 1910 map (the most detailed and precise map). Mapped track mileages produce state-bydecade aggregates similar to those published in Poor’s Manual of Railroads; I thank Paul Rhode for providing these data. 44. From estimating equation (8) for county track mileage, there were few systematic changes aside from that a county with 0% woodland experienced a 12.5 (5.0)–mile increase from 1880 to 1890 relative to a county with 6% woodland. 45. Railroad network expansion may also have a differential effect on areas that had different access to major riverways; note that one of the soil groups (column (6)) effectively captures the presence of a major river.
798
QUARTERLY JOURNAL OF ECONOMICS
less robust to some specifications. Adjusting for spatial correlation increases the standard errors, but the estimates generally remain statistically significant. The remainder of the analysis presents standard errors that are simply clustered at the county level. In contrast to the above estimates, herd laws appear to have been of little benefit. Nebraska adopted a state-wide herd law in 1871 that was intended to make livestock owners liable for damage to farmers’ unfenced crops, which had the potential to benefit farmers more in counties with the least woodland (Davis 1973; Kawashima 1994). However, from 1870 to 1880, the improved fraction of farmland declined by 22 percentage points (standard error of 10 percentage points) in a county with 0% woodland relative to a county with 6% woodland.46 It was not until 1890, after the introduction of barbed wire, that counties with the least woodland showed a 29 (6)–percentage point increase. Settlement was mostly unchanged, until an 18 (6)–percentage point increase from 1890 to 1900. Total land improvement declined by 24 (9) percentage points from 1870 to 1880, and increased by 18 (7) and 8 (5) percentage points from 1880 to 1890 and 1890 to 1900. Kansas gave counties the option of adopting a herd law, beginning in 1872. Counties’ adoption decisions are analyzed by Sanchez and Nugent (2000), and this endogenous decision complicates an analysis of the law’s effects. Nearly all herd-law counties have less woodland than non–herd law counties, so it is not practical to estimate whether the herd law had a greater effect in counties with less woodland. However, based on which counties had adopted the herd law by 1880, it is possible to compute difference-in-difference estimates of the change in each land-use outcome for herd-law counties relative to non–herd law counties.47 Herd-law counties had an insignificant 6 (7)–percentage point decline in the improved fraction of farmland from 1870 to 1880; it was not until after barbed wire’s introduction that they experienced a 20 (3)–percentage point increase from 1880 to 1890. Similarly, total improved land had a 3 (3)–point increase from 1870 to 1880 and an 11 (3)–point increase from 1880 to 1890. By contrast, land settlement increased 46. These coefficients (and standard errors in parentheses) are from estimating equation (8) for Nebraska only. 47. Kansas State Board of Agriculture Reports indicate adoption and provide quick comments on the political situation and hypothesized effects in 1876, 1877–1878, and 1879–1880; yes/no information on adoption in 1881–1882 and 1883–1884; and no information in 1885–1886. This is consistent with the earlier hypothesis that barbed wire defused these political debates.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
799
by 26 (4) percentage points from 1870 to 1880, and declined by 7 (3) points from 1880 to 1890; this may reflect increasingly settled areas managing to adopt the law over the 1870s. Kansas then adopted a statewide herd law in 1889, which did not lead to relative increases for non–herd law counties. The statewide law also did not benefit counties with the least woodland: the improved fraction of farmland declined by 9 (5) percentage points from 1890 to 1900, whereas settlement and total land improvement were little changed. Farmers made substantial investments in fencing before the legal reforms, after the legal reforms, and after the introduction of barbed wire. These laws may have had some small influence or they might not have been so hotly debated. In the absence of physical barriers, however, formal laws appear to have provided farmers little refuge from roaming livestock. VI.B. Crop Productivity and Crop Choice Barbed wire’s introduction may also have led farmers to adjust crop production. When liability for damage can be traded, Coase (1960) discusses how the optimal allocation of land could favor crops or livestock. However, without the ability to protect land physically or contractually, the returns to certain crops may be particularly sensitive to the threat of uncompensated damage by others’ livestock. In response, farmers may reduce crop acreage or, if they continued to grow crops without fencing, reduce investment in cropland, harvest crops earlier, or otherwise adjust production in ways that lower productivity. County-level data are available on the total production and acreage for each of the six main crops on the Plains (corn, wheat, hay, oats, barley, rye), by decade and beginning in 1880.48 Productivity for each crop p in each county c is defined as its total production per acre harvested. To assist in interpreting the results, productivity in each decade is normalized by its value in 1880.49 For the empirical estimation, equation (8) is slightly modified. To control for regional changes in crop productivity, state-decade 48. Cotton is excluded from the analysis, as data are available only for Texas and the boll weevil blight severely impacted cotton productivity. Using the same technique as before, data are adjusted to maintain 1880 geographical boundaries. 49. The upper and lower centiles of the normalized productivity distribution are dropped: those less than 0.36 or greater than 6.4. The results are not sensitive to these cutoffs, as long as the clearly extreme observations are dropped.
800
QUARTERLY JOURNAL OF ECONOMICS
fixed effects are replaced with crop-state-decade fixed effects. The equation is first-differenced by crop-county, to control for constant differences in crop productivity across counties. Data on all crops are pooled in the baseline analysis, which constrains the change in productivity across woodland levels to be the same for all crops: (9) Y pct − Y pc(t−1) = α pst + β1t Wc + β2t Wc2 + β3t Wc3 + β4t Wc4 + pct . Y pc1880 Columns (1) and (2) of Table V present baseline results. From 1880 to 1890, average productivity across all six crops increased 23.4% more in a county with 0% woodland than in a county with 6% woodland.50 By comparison, U.S. crop yields and total crop production increased annually by 0.23% and 1.7% from 1880 to 1920.51 Crop productivity decreased by 4.6% from 1890 to 1900, leaving it 18.8% higher than in 1880. Consistent with cropland becoming more productive, an increasing share of farmland became allocated to crops. From estimating equation (8), the fraction of farmland allocated to cropland increased by 12 percentage points from 1880 to 1890 in a county with 0% woodland relative to a county with 6% woodland (Table V, column (3)). As in the case of crop productivity, there was little change from 1890 to 1900. An extension of the results uses crop-level differences in vulnerability to livestock damage. Although cattle eat hay (various grasses), hay is more resistant to livestock damage before being harvested. Hay fields can even be intended for grazing at certain times of the year. The other crops (corn, wheat, oats, barley, rye) would yield substantially less grain if they were trampled, so it would be more important to protect them from others’ livestock.52 Restricting the analysis to these five crops more at risk of damage, productivity increased 29% in counties with the least woodland (columns (5) and (6), Table V). By contrast, hay productivity was 50. The results are similar when weighting by crop acreages in 1880. 51. These numbers are estimated using indices of total U.S. crop production and yield per acre harvested for twelve major crops (NBER Macrohistory Database, files a01005aa and a01297). The production index is computed by weighting the production of each commodity by average farm prices from 1910 to 1914. To obtain the average annual increase, the natural log of each index is regressed on a time trend from 1880 to 1920. 52. Grazing these crops would require close management of timing and intensity, and even then would substantially lower grain yields (Smith, Benson, and Thomason 2004). The Census defines these crops as those that are grown for grain, rather than grazed or grown for hay.
Productivity
Acres of at-risk crops per acre of cropland
0.234∗∗ 0.018 (0.057) (0.025) [3.31] −0.046 −0.003 (0.039) (0.018) [0.97] 0.054 −0.019 (0.035) (0.018) [1.70] −0.036 0.014 (0.042) (0.025) [0.95] .3949 9,104
0.121∗∗ 0.015∗∗ (0.015) (0.005) [8.01] −0.013 −0.011∗∗ (0.009) (0.004) [0.24] 0.039∗∗ 0.004 (0.010) (0.004) [3.44] 0.010 0.001 (0.007) (0.004) [1.06] .3922 1,960
0.292∗∗ 0.036 (0.067) (0.027) [3.47] −0.057 −0.012 (0.046) (0.021) [0.86] 0.058 −0.027 (0.039) (0.019) [1.81] 0.011 0.038 (0.049) (0.029) [0.44] .4000 7,320
0.058∗ (0.023) [2.84] 0.007 (0.015) [0.91] 0.020 (0.019) [1.75] 0.016 (0.015) [0.63]
.2777 1,960
0.005 (0.009)
−0.015 (0.009)
−0.005 (0.006)
0.000 (0.007)
Notes. For changes in productivity, estimates are from equation (9) in the text. For each crop and county, output per acre is normalized by its value in 1880. Changes in this normalized productivity measure are regressed on a fourth-degree polynomial function of county woodland (see notes to Figure III) and crop-by-state-by-decade fixed effects. For columns (1) and (2), data are pooled for all crops (corn, wheat, hay, oats, barley, rye). For columns (5) and (6), the sample is limited to more at-risk crops; that is, hay is excluded. The estimates are evaluated at three woodland levels (0%, 6%, 12%) and represent the predicted change over each decade for a county with 0% woodland relative to a county with 6% woodland (columns (1) and (5)) or for a county with 6% woodland relative to a county with 12% woodland (columns (2) and (6)). In parentheses are standard errors corrected for heteroscedasticity and clustered by county. In brackets are t-statistics for the difference between coefficients in columns (1) and (2), (5) and (6). For changes in cropland, estimates are from equation (8) in the text. The results are presented in the same form as in Table III (see notes). ∗∗ denotes statistical significance at 1% and ∗ at 5%.
R2 Observations
1910–1920
After barbed wire’s 1900–1910 universal adoption
1890–1900
After barbed 1880–1890 wire’s introduction
Decade
Acres of cropland per acre in farms
At-risk crops
0% vs. 6% 6% vs. 12% 0% vs. 6% 6% vs. 12% 0% vs. 6% 6% vs. 12% 0% vs. 6% 6% vs. 12% woodland woodland woodland woodland woodland woodland woodland woodland (1) (2) (3) (4) (5) (6) (7) (8)
Productivity
All crops
TABLE V CHANGES IN CROP PRODUCTIVITY AND CROP INTENSITY
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
801
802
QUARTERLY JOURNAL OF ECONOMICS
unchanged from 1880 to 1890. From estimating equation (8) for the fraction of cropland allocated to crops more at risk, columns (7) and (8) report that more cropland became allocated to crops more at risk from 1880 to 1890. In interpreting the results, changes in productivity for a given plot of land may be confounded by productivity differences for new lands coming under cultivation. Newly cultivated lands on the Plains may be especially productive due to stored soil nutrients. Parton et al. (2005) estimate that this productivity advantage mostly dissipates over twenty to thirty years; note that productivity gains from 1880 to 1890 are mostly persistent over the next 30 years. Permanent composition effects could appear if less wooded areas of counties were inherently more productive. However, very little land on farms was wooded: in 1880, less than 6% of the land on farms was wooded in 76% of the counties with less than 6% woodland. Furthermore, it may be that cheaper fencing encouraged the expansion of production into otherwise unprofitable and lower-quality lands within a county, which would cause the estimates to understate the increase in productivity for a given plot of land. A limitation of decadal census data is that crop production is sensitive to weather and other short-term shocks, so productivity in Census years may not be representative. Evidence on the representativeness of Census years is mixed, based on annual county-level data from Kansas for the productivity of corn, wheat, and oats (Parker, DeCanio, and Trojanowski 2000). For wheat, 0% woodland counties were unusually unproductive in 1879 relative to 6% woodland counties. By contrast, corn productivity differences in Census production years were similar to average non-Census years.53 These data caution that Census years may happen to give an inaccurate picture of typical changes in productivity. This highlights the advantage of comparing productivity changes for at-risk crops and hay, which implicitly includes county-year fixed effects. Additionally, the estimates in Table V are robust to the distance, region, subregion, and soil group specifications reported in Table IV. These additional controls may absorb changes in technology, weather, or other factors.54 53. Census data in 1880 reports production data from 1879. Oats were less commonly grown, but estimates are more similar to wheat than corn: 1879 was fairly unproductive, 1880 was fairly productive, and later Census years were similar to non-Census years. 54. When productivity is analyzed, the additional controls are interacted with each crop.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
803
Overall, it appears that farmers secured, improved, and expanded crop production from 1880 to 1890. If after 1890 farmers no longer cultivated substantial lands without fences, then crop productivity would not be expected to increase further. VI.C. Land Value Changes in land values potentially capitalize the total value of barbed wire to farmers. The Census provides self-reported data on land values, and farmers may have been familiar with their lands’ market value. Speculation in land markets was active at this time and taxes were paid on separately assessed land values (Gates 1973).55 Land value data are available in each period only for the combined value of farmland, buildings, and fences. However, land was the largest component of this measure: in 1900 and 1910, building’s value was between 13% and 17% of the total in low-, medium-, and high-woodland counties; in 1879, the cost of building and repairing fences was 1% of the total value. Equation (8) is estimated for the natural log of land value, per county acre. The log is analyzed because technology, prices, and land protection are typically modeled to have multiplicative effects on output value. Similar increases in land settlement or other additive shocks would have a larger percentage effect in areas with low initial levels, so the analysis also controls for initial land values.56 Table VI presents the results. Land values increased substantially from 1880 to 1890 in counties with 0% woodland, relative to counties with 6% woodland. Land values continued to increase from 1890 to 1900, but not statistically more than the relative increase at higher woodland levels. Before 1880 or after 1900, in contrast, there were either relative declines or small changes. Focusing on the increase from 1880 to 1890, this represents an economically substantial increase of 50% above 1880 levels. This is 1.7 times the 1880 value of all agricultural products in lowwoodland counties. Assuming that barbed wire had no effect on counties with more than 6% woodland, the estimated total benefit to farmers is $103 million (1880 U.S. dollars) with a standard error 55. Farmers may have partly anticipated the arrival of barbed wire by the 1880 Census, though unsettled land would still be valued at zero. 56. The specification controls for a fourth-degree polynomial function of the 1870 log land value, interacted with each decade. Without these additional controls, the estimates fit a clear pattern of economic convergence for both low- and medium-woodland counties: there are large relative increases from 1870 to 1880 that then decline over time.
804
QUARTERLY JOURNAL OF ECONOMICS TABLE VI CHANGES IN LAND VALUE Log value of land in farms, per county acre
Decade Before barbed wire
1870–1880
After barbed wire’s introduction
1880–1890
1890–1900
After barbed wire’s universal adoption
1900–1910
1910–1920
R2 Observations
0% vs. 6% woodland (1)
6% vs. 12% woodland (2)
−0.364∗ (0.151) [1.11] 0.406∗∗ (0.105) [3.99] 0.213∗∗ (0.072) [1.45] −0.101 (0.073) [2.07] 0.044 (0.063) [0.50]
−0.224∗∗ (0.053) 0.074∗ (0.037) 0.126∗∗ (0.027) 0.013 (0.027) 0.018 (0.024) .7569 1,880
Notes. Estimates are from a modified version of equation (8) in the text: county-level changes in land value (farmland, buildings, fences) are regressed on a fourth-degree polynomial function of county woodland (see notes to Figure III), state-by-decade fixed effects, and a fourth-degree function of the county’s 1870 log land value. The estimates are evaluated at three woodland levels (0%, 6%, 12%) and represent the predicted change over each decade for a county with 0% woodland relative to a county with 6% woodland (column (1)) or for a county with 6% woodland relative to a county with 12% woodland (column (2)). In parentheses are standard errors corrected for heteroscedasticity and clustered by county. In brackets are t-statistics for the difference between coefficients in columns (1) and (2). ∗∗ denotes statistical significance at 1% and ∗ at 5%.
of $32 million.57 This is approximately 0.9% of total U.S. GDP in 1880 (Historical Statistics of the United States 2006). The estimated increase in land value understates the total value of barbed wire if counties with more than 6% woodland also benefited (or nonsample counties). This number overstates the total value to the extent that farmers’ investment costs became capitalized into land values. As a check on the results, an upper bound 57. This total is calculated as follows. Sixty-four counties had between 0% and 1% woodland, with an average of 0.42% woodland and $1.7 million of land value in 1880. For an average county with 0.42% woodland, land values increased by an estimated 43% relative to a county with 6% woodland. This gives an overall effect of approximately $47 million (64 × $1.7m × 43%). Summing across the woodland bins (1%–2%, 2%–3%, 3%–4%, 4%–5%, 5%–6%) yields an estimate of $103 million with a standard error of $32 million.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
805
on the value of barbed wire is the total saved fence construction costs, which are estimated to be $767 million.58 If fencing demand declines linearly, then an implied tighter upper bound on farmer surplus is half this amount. VII. INTERPRETATION Barbed wire appears to have had a substantial impact on U.S. agricultural development, as seen in the relative development of low-woodland areas from 1880 to 1900. Given the simple nature of the innovation, the estimated magnitudes are remarkable and reflect the substantial cost of fencing before 1880. Barbed wire was particularly important in this historical context, due to timber scarcity and the importance of fencing. Farmers had secure legal title to land, but had to pay high fencing costs to receive protection from damage by others’ livestock. Farmers had lobbied for legal reforms to states’ fencing requirements, but effective protection came about only with the introduction of barbed wire. Legal title was not affected; rather, the ability to exclude others’ livestock became part of farmers’ bundle of property rights over land.59 Barbed wire may also have been influential as a general improvement in agricultural technology. Cheaper fencing benefits even an isolated farm by providing greater control over a farmer’s own cattle. This allows the production of cattle and crops in close proximity, and increases cattle productivity through improvements in feeding and breeding. This would be particularly beneficial if nearby lands varied substantially in their suitability for cattle and crops.60 However, empirical estimates suggest that barbed wire did not increase cattle production and decreased the 58. This is found by multiplying the difference in cost between wooden and barbed wire fences (roughly $1 per rod) by the total amount of barbed wire built by 1900 in the Prairie and Southwest (roughly 767 rods). If land values increased by more, then farmers should have been willing to construct these fences prior to barbed wire’s introduction. 59. Property rights often vary beyond whether ownership is secure: rights may not include the ability to sell, rent, mortgage, pledge, bequeath, or gift land (Besley 1995); land ownership may be contingent on not leaving it fallow (Goldstein and Udry 2008); others may have the right to kill and eat animals on your land, but not keep the fur (Demsetz 1967). In addition, these rights only exist to the extent that they are enforced and not simply allocated. 60. To avoid encroachment and operate more in isolation, there was an incentive to expand farms prior to barbed wire’s introduction. These farms would still have neighbors, however, and farm scale may have been restricted by inferior fencing or monitoring options. Estimating equation (8) for log average farm size, there are not systematic changes by woodland levels, apart from an increase in average farm sizes at low-woodland levels from 1890 to 1900.
806
QUARTERLY JOURNAL OF ECONOMICS TABLE VII CHANGES IN CATTLE PRODUCTION AND COUNTY SPECIALIZATION Number of cattle Degree of specialization per five county acres in crops
Decade After barbed wire’s 1880–1890 introduction
0% vs. 6% 6% vs. 12% 0% vs. 6% 6% vs. 12% woodland woodland woodland woodland (1) (2) (3) (4) −0.0039 (0.0137) [2.01]
0.0247∗∗ 0.0117∗∗ (0.0066) (0.0039) [3.50]
−0.0007 (0.0012)
1890–1900
0.0567∗∗ 0.0355∗∗ 0.0007 (0.0150) (0.0071) (0.0032) [1.29] [2.05]
−0.0070∗∗ (0.0020)
After barbed wire’s 1900–1910 universal adoption
0.0437∗∗ −0.0122 −0.0023 (0.0140) (0.0065) (0.0025) [3.50] [1.48]
0.0021 (0.0014)
1910–1920
R2 Observations In 1880 Mean Std. deviation
−0.0023 0.0348∗∗ −0.0064 −0.0015 (0.0102) (0.0052) (0.0020) (0.0014) [3.32] [0.28] .5823 .1681 1,960 1,960 0.2034 0.1570
0.0175 0.0248
Notes. Estimates are from equation (8) in the text: county-level changes in each outcome are regressed on a fourth-degree polynomial function of county woodland (see notes to Figure III) and state-by-decade fixed effects. Specialization is defined as the squared difference between the fraction of farmland allocated to crops in a county and the average over all counties in that state and decade. The estimates are evaluated at three woodland levels (0%, 6%, 12%) and represent the predicted change over each decade for a county with 0% woodland relative to a county with 6% woodland (columns (1) and (3)) or for a county with 6% woodland relative to a county with 12% woodland (columns (2) and (4)). In parentheses are standard errors corrected for heteroscedasticity and clustered by county. In brackets are t-statistics for the difference between coefficients in columns (1) and (2), (3) and (4). ∗∗ denotes statistical significance at 1% and ∗ at 5%.
joint production of cattle and crops. These findings suggest that barbed wire had relatively small technological benefits for an isolated farm. To examine changes in cattle production, equation (8) is estimated for the number of cattle per five county acres.61 A cow required roughly five acres to graze in this region, so the estimated magnitudes can be compared to estimated changes in settlement and land improvement. Columns (1) and (2) of Table VII present 61. Data on cattle are first available in 1880, so county regions are held constant at their 1880 boundaries.
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
807
the results. From 1880 to 1890, there was no substantial increase in cattle production for a county with 0% woodland relative to a county with 6% woodland. By contrast, cattle production increased moderately in all subsequent periods and at higher woodland levels from 1880 to 1890. To examine changes in the joint production of cattle and crops, an index is defined that captures the degree to which counties are specialized in crop production.62 The index is the squared difference between the fraction of county farmland devoted to crops and the average over all counties in that decade and state: Ict = (Mct − Mst )2 . The index increases when a county with aboveaverage crop intensity increases crop production, and vice versa. Changes in this index are estimated using equation (8), which controls for average county deviations from the mean and stateby-decade shocks. Columns (3) and (4) of Table VII present the results. From 1880 to 1890, counties with 0% woodland became increasingly specialized by half a standard deviation, relative to counties with 6% woodland.63 Barbed wire may affect cattle production and county specialization through multiple channels, but these results suggest that barbed wire’s effects are not simply the direct technological benefits that would be expected for an isolated farm. On the contrary, it appears that barbed wire affected agricultural development largely by reducing the threat of encroachment by others’ cattle. VIII. CONCLUSIONS There is growing evidence from current developing countries that insecure property rights may limit economic development. Complementing that literature, the historical development of American agriculture appears to have been limited when farmers were unable to protect frontier lands from encroachment by others’ cattle. In the United States, this institutional failure was resolved not by legal reform but by technological change: the introduction of barbed wire fencing. Following the introduction of barbed wire, low-woodland areas that had been especially costly to fence experienced substantial relative increases in agricultural development. Increases along intensive margins were particularly rapid and substantial: 62. Pasture land is not directly observed, and per–cow acreage requirements vary with the environment, production methods, and desired sustainability. 63. Estimates are robust to the distance, region, subregion, and soil group specifications from Table IV.
808
QUARTERLY JOURNAL OF ECONOMICS
land improvement, crop production, and crop productivity. Land values increased substantially, indicating a large increase in total economic production. These results suggest that land protection plays an important role in facilitating agricultural development. Indeed, rather than being a unique feature of some modern developing countries, that this also occurred on the American frontier suggests that it may be a more universal characteristic of economic development. HARVARD UNIVERSITY AND NATIONAL BUREAU OF ECONOMIC RESEARCH
REFERENCES Acemoglu, Daron, and Simon Johnson, “Unbundling Institutions,” Journal of Political Economy, 113 (2005), 949–995. Aldrich, Nelson, “Wholesale Prices, Wages, and Transportation,” Report to the Committee on Finance, 1893. Alston, Lee J., Gary D. Libecap, and Bernardo Mueller, “Property Rights and Land Conflict: A Comparison of Settlement of the U.S. Western and Brazilian Amazon Frontiers,” in Latin America and the World Economy Since 1800, John H. Coatsworth and Alan M. Taylor, eds. (Cambridge, MA: Harvard University Press, 1998). Alston, Lee J., Gary D. Libecap, and Robert Schneider, “The Determinants and Impact of Property Rights: Land Titles on the Brazilian Frontier,” Journal of Law, Economics, and Organization, 12 (1996), 25–61. Anderson, Terry L., and Peter J. Hill, “The Evolution of Property Rights: A Study of the American West,” Journal of Law and Economics, 18 (1975), 163–179. Ashton, T.S., The Industrial Revolution, 1760–1830 (New York: Oxford University Press, 1997). Banerjee, Abhijit V., Paul J. Gertler, and Maitreesh Ghatak, “Empowerment and Efficiency: Tenancy Reform in West Bengal,” Journal of Political Economy, 110 (2002), 239–280. Besley, Timothy, “Property Rights and Investment Incentives: Theory and Evidence from Ghana,” Journal of Political Economy, 103 (1995), 903–937. Besley, Timothy, and Maitreesh Ghatak, “Property Rights and Economic Development,” in Handbook of Development Economics V, D. Rodrik and M. Rosenzweig, eds. (Amsterdam: North Holland, 2009). Bogue, Allan G., From Prairie to Corn Belt: Farming on the Illinois and Iowa Prairies in the Nineteenth Century (Chicago: University of Chicago Press, 1963a). ——, “Farming in the Prairie Peninsula, 1830–1890,” Journal of Economic History, 23 (1963b), 3–29. Brasselle, A.-S., F. Gaspart, and J.-P. Platteau, “Land Tenure Security and Investment Incentives: Puzzling Evidence from Burkina Faso,” Journal of Development Economics, 68 (2002), 373–418. Carville, Earle, John Heppen, and Samuel Otterstrom, HUSCO 1790–1999: Historical United States County Boundary Files (Baton Rouge: Louisiana State University, 1999). Cheung, Steven, “The Structure of a Contract and the Theory of a Non-exclusive Resource,” Journal of Law and Economics, 13 (1970), 49–70. Coase, Ronald H, “The Problem of Social Cost,” Journal of Law and Economics, 3 (1960), 1–44. Conley, T.G., “GMM Estimation with Cross Sectional Dependence,” Journal of Econometrics, 92 (1999), 1–45. Davis, Rodney O., “Before Barbed Wire: Herd Law Agitations in Early Kansas and Nebraska,” in Essays in American History in Honor of James C. Malin, Burton J. Williams, ed. (Lawrence, KS: Coronado Press, 1973).
PROPERTY RIGHTS AND AGRICULTURAL DEVELOPMENT
809
De Meza, David, and J.R. Gould, “The Social Efficiency of Private Decisions to Enforce Property Rights,” Journal of Political Economy, 100 (1992), 561–580. Demsetz, Harold, “Toward a Theory of Property Rights,” American Economic Review: Papers and Proceedings, 57 (1967), 347–359. De Soto, Hernando, The Mystery of Capital (New York: Basic Books, 2000). Ellickson, Robert C., Order without Law: How Neighbors Settle Disputes (Cambridge, MA: Harvard University Press, 1991). Encyclopedia Britannica, 11th ed., “Barbed Wire” (1911). Engerman, S., and K. Sokoloff, “Institutional and Non-institutional Explanations of Economic Differences,” in Handbook of New Institutional Economics, Claude Menard and Mary M. Shirley, eds. (Dordrecht, the Netherlands: Springer, 2003). Field, Erica, “Entitled to Work: Urban Property Rights and Labor Supply in Peru,” Quarterly Journal of Economics, 122 (2007), 1561–1602. Galiani, Sebastian, and Ernesto Schargrodsky, “Property Rights for the Poor: Effects of Land Titling,” Universidad Torcuato di Tella Business School Working Paper No. 06/2005, 2005. Gates, Paul W., “The Role of the Land Speculator in Western Development,” in Landlords and Tenants on the Prairie Frontier (Ithaca, NY: Cornell University Press, 1973). Goldstein, Markus, and Christopher Udry, “The Profits of Power: Land Rights and Agricultural Investment in Ghana,” Journal of Political Economy, 116 (2008), 981–1022. Gutmann, Myron P., Great Plains Population and Environment Data: Agricultural Data (Ann Arbor: University of Michigan and ICPSR, 2005). Haines, Michael R., Historical, Demographic, Economic, and Social Data: The United States, 1790–2000 (Hamilton, NY: Colgate University and ICPSR, 2005). Hayter, Earl W., “Barbed Wire Fencing—A Prairie Invention,” Agricultural History, 13 (1939), 189–207. Hess, Greg, and Marc D. Weidenmier, “How the West Was Won,” Claremont McKenna College Working Paper, forthcoming. Historical Statistics of the United States, 2006. Available at http://hsus.cambridge .org. Jacoby, Hanan G., Guo Li, and Scott Rozelle, “Hazards of Expropriation: Tenure Insecurity and Investment in Rural China,” American Economic Review, 92 (2002), 1420–1447. Johnson, Simon, John McMillan, and Christopher Woodruff, “Property Rights and Finance,” American Economic Review, 92 (2002), 1335–1356. Kansas State Board of Agriculture, Report of the State Board of Agriculture to the Legislature of the State of Kansas (Topeka, KS: The Board, 1876–1888). Kantor, Shawn E., Politics and Property Rights: The Closing of the Open Range in the Postbellum South (Chicago: University of Chicago Press, 1998). Kawashima, Yasuhide, “Fence Laws on the Great Plains, 1865–1900,” in Essays on English Law and the American Experience, Elizabeth A. Cawthorn and David E. Narrett, eds. (College Station: Texas A&M University Press, 1994). Kraenzel, Carl Frederick, The Great Plains in Transition (Norman: University of Oklahoma Press, 1955). Lanjouw, Jean O., and Philip I. Levy, “Untitled: A Study of Formal and Informal Property Rights in Urban Ecuador,” Economic Journal, 112 (2002), 986– 1019. Libecap, Gary D., “The Assignment of Property Rights on the Western Frontier: Lessons for Contemporary Environmental and Resource Policy,” Journal of Economic History, 67 (2007), 257–291. McCallum, Henry D., and Frances T. McCallum, The Wire That Fenced the West (Norman: University of Oklahoma Press, 1972). McFadden, Joseph M., “Monopoly in Barbed Wire: The Formation of the American Steel and Wire Company,” Business History Review, 52 (1978), 465–489. Netz, Reviel, Barbed Wire: An Ecology of Modernity (Middleton, CT: Wesleyan University Press, 2004). North, Douglass C., Structure and Change in Economic History (New York: W.W. Norton, 1981).
810
QUARTERLY JOURNAL OF ECONOMICS
Olmstead, Alan L., and Paul W. Rhode, “The Red Queen and the Hard Reds: Productivity Growth in American Wheat, 1880–1940,” Journal of Economic History, 62 (2002), 929–966. Parker, William N., Stephen J. DeCanio, and Joseph M. Trojanowski, Adjustments to Resource Depletion: The Case of American Agriculture—Kansas, 1874–1936 (Ann Arbor, MI: ICPSR, 2000). Parton, William J., Myron P. Gutmann, Stephen A. Williams, Mark Easter, and Dennis Ojima, “Ecological Impact of Historical Land-Use Patterns in the Great Plains: A Methodological Assessment,” Ecological Applications, 15 (2005), 1915–1928. Primack, Martin L., “Farm Fencing in the Nineteenth Century,” Journal of Economic History, 29 (1969), 287–291. ——, Farm Formed Capital in American Agriculture 1850–1910 (New York: Arno Press, 1977). Rasmussen, Wayne D., “The Impact of Technological Change on American Agriculture, 1862–1962,” Journal of Economic History, 22 (1962), 578–591. Razac, Olivier, Barbed Wire: A Political History (New York: W.W. Norton, 2002). Sanchez, Nicolas, and Jeffrey B. Nugent, “Fence Laws vs. Herd Laws: A Nineteenth-Century Kansas Paradox,” Land Economics, 76 (2000), 518–533. Smith, S. Ray, Brinkley Benson, and Wade Thomason, “Growing Small Grains for Forage in Virginia,” Virginia Cooperative Extension, Publication Number 424-006, 2004. U.S. Department of Agriculture, Atlas of Agriculture, Part I, Section E (Washington, DC: GPO, 1924). U.S. House of Representatives, 42nd Congress, 1st Session, “Special Report on Immigration: Accompanying Information for Immigrants . . .,” Edward Young, Chief, Bureau of Statistics (Washington, DC: GPO, 1871). U.S. House of Representatives, 42nd Congress, 2nd Session, “Statistics of Fences in the United States,” House Executive Documents, 17 (1872), 497–512 (Washington, DC: GPO, Report of the Commissioner of Agriculture, U.S. Congressional Serial Set Vol. 1522). Washburn and Moen Manufacturing Company, Fence Laws: The Statute Prescriptions as to the Legal Fence in the United States and Territories, the Dominion of Canada and Provinces, and Australia, with Illustrative Historical Notes and Judicial Decisions, and a View of Fences and Fence Laws in Great Britain (Worcester, MA: Snow, Woodman & Co., 1880). Webb, Walter Prescott, The Great Plains (New York: Grosset & Dunlap, 1931, 1978 printing).
TRUST AND THE REFERENCE POINTS FOR TRUSTWORTHINESS IN GULF AND WESTERN COUNTRIES∗ IRIS BOHNET BENEDIKT HERRMANN RICHARD ZECKHAUSER Why is private investment so low in Gulf compared to Western countries? We investigate cross-regional differences in trust and reference points for trustworthiness as possible factors. Experiments controlling for cross-regional differences in institutions and beliefs about trustworthiness reveal that Gulf citizens pay much more than Westerners to avoid trusting, and hardly respond when returns to trusting change. These differences can be explained by subjects’ gain/loss utility relative to their region’s reference point for trustworthiness. The relation-based production of trust in the Gulf induces higher levels of trustworthiness, albeit within groups, than the rule-based interactions prevalent in the West.
I. INTRODUCTION Private domestic investment is low in Arab countries, particularly relative to public investment. In the Persian Gulf countries examined here, the private/public ratio was less than 2:1 in the 1990s; for OECD countries it was over 6:1 (Sala-i-Martin and Artadi 2002). Investment requires placing one’s funds in the hands of another person. Not surprisingly, investment rates are closely associated with people’s willingness to trust others (Knack and Keefer 1997). Trust levels are generally low in Islamic societies. Across the fourteen Islamic countries surveyed in the World Values Survey, only 28% of the respondents indicated that “most people can be trusted,” compared to 46% in Protestant European countries (Inglehart 2007).1 ∗ We thank Kuwait University, Sultan Qaboos University, UAE University and the University of Zurich for permission to conduct our research, and Samar Attar, Miriam Avis, Paul Bohnet, Robin Hogarth, Sarah Hardy, Magma Ismail, Timor Koran, Alan Levy, Stephan Meier, Hilary Rantisi, Dani Rodrik, Frank Vogel, three anonymous referees, the participants in seminars at Harvard, Pompeu Fabre University (Barcelona), the University of Zurich, the Santa Fe Institute, the conference for Laboratory Experiments and the Field (University College London), the CESifo conference on Economics and Psychology (Venice), and the Economic Science Association Meetings 2005 (Montreal), and particularly Edward Glaeser (the editor) for their helpful comments. Financial support from the Kuwait Fund at Harvard Kennedy School and the U.S. Army Research Laboratory and the U.S. Army Research Office under grant number W911NF-08-1-0144 is gratefully acknowledged. 1. Interpersonal trust and more generally what Inglehart (2007) refers to as “self-expression values” have also been associated with support for democracy. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
811
812
QUARTERLY JOURNAL OF ECONOMICS
We investigate whether reference points for trustworthiness help explain these cross-regional differences in people’s willingness to trust strangers. We build on Kahneman and Tversky’s (1979) Prospect Theory and the formalizations by K¨oszegi and Rabin (2006, 2007) and posit that betrayal imposes an additional utility cost beyond monetary loss. That cost increases the more the likelihood of betrayal deviates from one’s reference points of accustomed experience. Gulf residents are accustomed to higher levels of trustworthiness than Westerners. As is typical for tribal societies, most trust interactions take place within groups. Trust is fostered by decreasing the likelihood of betrayal through repeated interactions, reputation, and reciprocity. “Aman [i.e., trust] . . . tends to convey a sense of personal attachment between those who trust one another rather than confidence in institutions, office-holders, or even one’s own knowledge or abilities . . . . For Arabs, who believe that it is contexts of relationship, not invariant capabilities, that most fully define a person, actively entangling them in webs of indebtedness constitutes the greatest predictability and security that one can have for their actions towards oneself ” (Rosen 2000, pp. 135–136). Social networks have evolved to allow such informal enforcement: “Groups in the Middle East are necessarily more limited in size in order to maximize trust and cooperative endeavor . . . . Asabiyya [‘social solidarity’] was most easily developed in small, informal, and highly personalistic groups” (Bill and Springborg 2000, pp. 66–67). Untrustworthy behavior often leads to expulsion, and xenophobia is common (e.g., Arab Human Development Report [2004]; Inglehart, Moaddel, and Tessler [2006]).2 In the West, in contrast, formal institutions, notably contract law, promote trust by decreasing the cost of betrayal by awarding damages given breach. Oliver Wendell Holmes, Jr., wrote about U.S. law in 1897: “The duty to keep a contract at common law means a prediction that you must pay damages if you do not keep it and nothing else” (Rosen 2000, p. 139).3 2. The Economist (April 9, 2005, p. 37) describes a recent case in Qatar where “its rulers have just stripped some 5,000 Qataris of their citizenship, apparently because they belong to a clan deemed disloyal.” Cultural theorists characterize the Gulf countries as “collectivist” and Protestant Western countries as “individualist” (e.g., Triandis [1995]; Hofstede [2001]). They predict a stronger distinction between “in-group” and “out-group” members in the former than the latter. See also Greif (1994) and Kuran (2004). 3. In Islamic law, contracts that impose expectations on future performance are not permitted because they are inherently speculative (Rosen 2000, p. 142).
TRUST IN GULF AND WESTERN COUNTRIES
813
Given the differences in their trustworthiness reference points, people in the Gulf will demand higher levels of trustworthiness before trusting than Westerners. To examine experimentally, we elicit people’s minimum acceptable probability of trustworthiness, the threshold value that would make them just willing to trust a randomly selected, anonymous counterpart, given particular payoffs, in three Gulf countries, Kuwait, Oman, and the United Arab Emirates, and two Western countries, Switzerland and the United States. The higher a person’s threshold, the higher the price—in loss of expected value of payoff—he or she is willing to pay to avoid trusting. Our method eliminates institutional factors and beliefs about a counterpart’s trustworthiness as explanations. To control for differences in willingness to take risk, we elicit minimum acceptable probabilities for a straight gamble offering the same payoffs for the two parties as the trust game. The difference in values for the required probabilities when confronted with nature rather than a person isolates a person’s intolerance of betrayal. The utility structure we posit creates a second effect. It affects how responsive people are to changes in the likelihood or the cost of betrayal. We label the way people respond to changes in the expected returns from trusting as their elasticity of trust, and calibrate it using our experiments. Elasticity of trust must be high if institutional innovations, such as laws that protect investors or enforce contracts, are to raise trust and investment. Our paper is organized as follows. Section II presents our theory and conceptual framework. Section III explains the experimental design, and Section IV presents the results. Section V concludes. II. THEORY AND METHODS Trust is primarily produced by preventing betrayal through webs of relationships in the Gulf, but by mitigating the cost of betrayal through contract law in the West. This difference leads to differences in accustomed levels of trustworthiness in the two regions. We posit that such trustworthiness experiences provide a reference point, r, for expectations of trustworthiness levels, rg and rw , respectively, in the Gulf and Western nations, where Specifically, there is no recovery for lost profits or other damages that are based on a counterfactual premise or speculation about events that did not occur (Vogel 1997).
814
QUARTERLY JOURNAL OF ECONOMICS
rg rw . We build on a model of reference-dependent preferences by K¨oszegi and Rabin (2006), who argue that individuals experience a loss aversion component when an outcome deviates from a reference point that is “endogenously determined by the economic environment, derived from experiences in the past” (p. 1133). Consider two groups of individuals who must choose between lotteries S and T , Sure and Trust. A lottery pays x with probability p and y < x with probability 1 − p. The total utility function is (1)
u( p, r, x, y) = pv(x) + (1 − p)v(y) + z( p, r, x, y).
Here v is a traditional von Neumann–Morgenstern (VN-M) utility function, where v > 0. The innovation beyond traditional utility theory is that z( p, r, x, y) is a reference-dependent utility coming from the probability of trustworthiness itself. For T , 0 < p < 1, x = m, and y = n < m. For S, p = 1, x = s, and the value of y is irrelevant. For the choice to be meaningful, m > s > n. Consistent with K¨oszegi and Rabin (and intuition), we assume that z is increasing with p and strictly decreasing with r. This implies that u is increasing with p. For p = 0, T is inferior to S; at p = 1, it is superior. Because the utility of S does not vary with p, there must be a minimum cutoff value p j that makes an individual j willing to trust. In particular, p j , sets u(S) = u(T ), implying that (2)
p j v(m) + (1 − p j )v(n) + z( p j r j , m, n) = v(s) + z(1, r j , m, n).
(Hereafter, we suppress the arguments m and n in z, because they play no role.) Consider individuals A and B with identical utility function (1). The trustworthiness reference point for A, ra , is greater than the trustworthiness reference point for B, rb. The critical assumption is that there is diminished importance for the reference when p = 1 (i.e., in lottery S). Specifically, for all reference points ra > rb and all probabilities of trustworthiness p < 1, (3)
z(1, rb) − z(1, ra ) < z( p, rb) − z( p, ra ).
PROPOSITION. Given (3) and that z is increasing with p, an individual with a higher value of reference trustworthiness, r, will require a greater level of trustworthiness, p, in order to trust. The proof is given in the Appendix. In the context of our paper, the Proposition yields
TRUST IN GULF AND WESTERN COUNTRIES
815
Trustworthy [15; 15] Agent Trust
Principal
Betray
Sure
[8; 22]
[10; 10]
FIGURE I The Trust Game [Payoffs to Principal; Payoffs to Agent]
IMPLICATION 1. Gulf citizens will require a higher level of trustworthiness than Westerners before they trust. To assess minimally required trustworthiness levels, p, we have subjects play a modified trust game (Camerer and Weigelt 1988) in the two regions. The game is shown in Figure I. We ask individuals (principals) what minimum acceptable probability of trustworthiness, p j , would lead them to trust as opposed to receiving a sure payoff. The higher a principal’s minimum acceptable probability value, the more he or she is willing to sacrifice in expected value to avoid trusting. The amount he or she is willing to pay is the difference in expected values from his or her Trust and Sure strategies if p j is just satisfied. Representing the expected trust payoff in brackets, that amount is [ p j 15 + (1 − p j )8] − 10. Note that a principal’s minimum acceptable probability value should be independent of his or her assessments of the probability of trustworthiness in the game. If the actual trustworthiness level is below his or her threshold value, he or she will end up not trusting; and if trustworthiness is above it, he or she will get the actual level and reap a surplus. Thus truthful revelation is optimal.4 Principals do not learn the true 4. Note that a principal cannot affect the probability he or she receives in the lottery, because it in no way relates to the answer that he or she provides. Given our procedure, truth telling by a principal is as good as anything else. It is strictly dominant if, as seems reasonable, people believe that actual levels of trustworthiness may lie in the immediate neighborhood of their minimum acceptable probability, and if they obey the Substitution Axiom of VN-M utility. Our procedure is closely related to the (strictly dominant) Becker–DeGroot–Marshak
816
QUARTERLY JOURNAL OF ECONOMICS
proportion of trustworthy agents in their game until after they have made their decisions. Moreover, the game offers no institutional protections. Thus, observed differences in behavior cannot be accounted for by cross-regional differences therein. Willingness to trust is likely related to willingness to take risk. If people were accustomed to different levels of success when taking risk in the two regions, a similar logic would apply to willingness to take risk as to willingness to trust. Given that earning returns based on chance is strongly discouraged in Islamic law, that gambling is strictly forbidden, and that Islamic banks tend to invest more conservatively then Western banks (Al-Suwailem 2000), there may well be cross-regional differences in success reference points leading to differences in willingness to take risk. In addition, there could also be differences in standard risk preferences. To make sure we are not merely picking up cross-regional differences in willingness to take risk in our experiments, we ran a control treatment in each country, the risky dictator game (Bohnet and Zeckhauser 2004; Bohnet et al. 2008). In it, Nature rather than the Agent determines the outcome. Figure I represents the risky dictator game if “Trust” is replaced with “Gamble,” “Agent” is replaced with “Nature,” and “Trustworthy” and “Betray” are respectively replaced with “Success” and “Failure.” A comparison of behavior in the trust game and in the risky dictator game will tell us whether cross-regional differences in willingness to trust are due to differences in betrayal intolerance, differences in risk intolerance, or a combination of the two. According to Implication 1, people in the West will trust for lower levels of trustworthiness than people in the Gulf. This also suggests that trust is more responsive to changes in the likelihood of trustworthiness in the West than in the Gulf. To examine, we compute the elasticity of trust. That elasticity tells how the percentage of those not trusting diminishes in response to a percentage reduction in those not trustworthy. Let α be the fraction of trusting principals, and β the fraction of trustworthy agents. Our elasticity concept looks at the curve x = f (β). The elasticity measure at each point is thus [dα/(1 − α)]/[dβ/(1 − β)]. Because our data are limited, we compute this elasticity looking elicitation procedure. The major difference is that we do not draw our probability of payoff randomly from a uniform distribution but rather observe it empirically.
TRUST IN GULF AND WESTERN COUNTRIES
817
only at decile intervals. Thus, we measure the elasticity at each 10% increase of trustworthiness with start points at 0% to 90%.5 To get an overall elasticity measure, we average these ten numbers. IMPLICATION 2. Trust levels will be less elastic to levels of trustworthiness in the Gulf than in the West. Given greater concern with levels of trustworthiness and lesser concern with monetary returns from trusting in the Gulf, Gulf citizens will respond less to changes in the cost of betrayal. We test this by comparing willingness to trust in a high-cost and a low-cost trust game. IMPLICATION 3. Trust levels will respond less to changes in the cost of betrayal in the Gulf than in the West.
III. DESIGN AND PROCEDURES As is traditional with experiments, we relied on student subjects, with 736 total subjects in the five countries.6 The students were from Kuwait University in Kuwait, Sultan Qaboos University in Oman, the University of Zurich in Switzerland, UAE University in the United Arab Emirates, and various universities in the greater Boston area in the United States. Participants’ average age and self-reported wealth levels on a scale from 1 (poor) to 6 (wealthy) were, respectively, 21 and 4.1 in Kuwait, 21 and 3.7 in Oman, 23 and 4.0 in Switzerland, and 24 and 3.5 in the United States.7 We ran a total of 28 experimental sessions; 22 to 36 subjects participated in each. In Oman, Kuwait, Switzerland, and the United States, we ran mixed-sex sessions. In the UAE, this was not possible because higher education is sex-segregated; experiments there were 5. We exclude 100%, as everyone is willing to trust if trustworthiness is guaranteed. Thus, the elasticity in the final decile interval is always 1. 6. As we are interested in comparisons between Gulf and Western countries, ideally we would have liked to run our experiments with representative samples of the general population in each country. However, this was not feasible in the three Gulf countries. To the best of our knowledge, not even Western surveys have been allowed to be conducted in any of the three countries to date (or any other Gulf country, for that matter). Our experiments represent five case studies. We do not claim that they are conclusive about behavior in either the Gulf or the West. 7. We collected this information in a short post-experimental questionnaire. We were not allowed to collect demographic information in the UAE. However, as the sessions there were segregated by sex, we can control for a person’s sex in all our analyses.
818
QUARTERLY JOURNAL OF ECONOMICS TABLE I NUMBERS OF PARTICIPANTS IN THE DIFFERENT SUBJECT POOLS Mixed
All men
Kuwait Oman United Arab Emirates Switzerland United States
Baseline trust game 24 26 58 28 50 62
Kuwait Oman United Arab Emirates Switzerland United States
Risky dictator game 32 28 44 30 48 58
Oman United States
High-cost trust game 70 72
All women 28 28
20 30
conducted separately for female and male subjects.8 Subjects were identified by code numbers and kept anonymous to other players. They were randomly assigned to the role of principal or agent and randomly matched (single-blind). Table I provides an overview of the participants in our three sets of experiments: the baseline trust game, the risky dictator game, and a high-cost trust game. The high-cost trust game lowered the principal’s payoff given betrayal to 6 and raised the agent’s associated payoff to 24 points to determine how changes in the material cost of betrayal affected trust decisions. The probability that equates expected values in the first two games is .296; in the third (high-cost) game it is .444. The payoffs were presented to subjects in a matrix form with neutral terminology, and no discussion of breakeven probabilities. Payoffs were given in points. Each point was converted to respectively 0.25 Kuwaiti dinar, 0.2 Omani rial, 1 Swiss franc, 1 UAE dirham, or 1 U.S. dollar at the end of the experiment. Subjects earned a 10-point show-up fee and received on average an additional 13 points for an experiment that took approximately 8. To get a sense for how this might affect behavior, we added an all-male and an all-female session to our mixed-sex session in Kuwait, a nation with substantial components of both single-sex and mixed-sex higher education. We believe that there are no analogous single-sex comparison groups in the West, as people selfselect into single-sex colleges in the West but not in the UAE.
TRUST IN GULF AND WESTERN COUNTRIES
819
thirty to sixty minutes. To ensure the equivalence of experimental procedures across countries, we followed Roth et al. (1991) on designs for multinational experiments.9 The experiments were run as follows: In the trust games, we asked principals what minimum percentage of trustworthy behavior they would require to trust. The neutral language description was: “How large would the probability of being paired with a Person Y who chose option 1 minimally have to be for you to pick B over A?” (The agent’s “option 1” is what we label “trustworthy.” The principal’s choice “B” is our “trust.”) We used the strategy method for agents: Before they knew their principal’s decision, we asked them whether or not they would reward trust were it offered. Specifically, we asked: “Which option, 1 or 2, do you choose in case B?” If a principal’s minimum acceptable probability exceeded the percentage of trustworthy agents in a given session, p∗ , both principal and agent earned the sure payoff. If a principal’s minimum acceptable probability was equal to or lower than p∗ , the two payoffs were determined by the agent’s choice. Principals were informed on the whole procedure, including that agents’ decisions would be used to calculate p∗. Agents were not informed that principals were asked to state their minimum acceptable probability of trustworthiness, nor that we would calculate a p∗ , because we did not want our elicitation procedure to affect agents’ decisions. In the risky dictator game, the principal becomes the “dictator”; the agent is a “recipient,” with no active role to play, as in the standard dictator game (Kahneman, Knetsch, and Thaler 1986). We asked principals to indicate the minimum acceptable probability of earning 15 such that they would take the gamble rather than the sure outcome: “How large would the probability of receiving option 1 minimally have to be for you to pick B over A?” They were informed that p∗ had been predetermined and was inside the envelope visibly posted to the blackboard. The average 9. We controlled for currency, language, and experimenter effects to the best of our ability. To produce parity in rewards across the five nations, we used the most direct measure of opportunity cost of time we could find as a guideline, the hourly wage of an undergraduate research assistant. We had the instructions translated (and back-translated) from English to Arabic. The experiments were conducted by the first two listed authors. They first ran experiments in the United States before conducting sessions in other countries. We did not find any evidence for experimenter effects in the United States. The first author ran the experiments in Switzerland and the UAE, and the second author ran the experiments in Kuwait and Oman. The instructions are available from the authors upon request. For a discussion of methods in cross-cultural experiments, see the supplementary ¨ materials of Herrmann, Th¨oni, and Gachter (2008), who conducted experiments in the same countries.
820
QUARTERLY JOURNAL OF ECONOMICS
likelihood of trustworthiness from the baseline trust games in a given country served as p∗ for the risky dictator games, which were conducted with different subjects after the trust games. If a principal’s minimum acceptable probability was higher than the predetermined probability, p∗ , he or she was taken to reject the gamble. He or she was then paid the sure payoff. If the minimum acceptable probability was less than or equal to p∗ , we conducted the lottery by drawing a ball from an urn containing p∗ good and (1 − p∗ ) bad balls. This determined whether principals received the 15 or the 8 payments; the complementary 15 or 22 payments went to their recipients. Before subjects made their decisions, they had to complete a quiz testing their understanding. Only after all subjects understood the problem and could calculate their earnings for different values of hypothetical p j and p∗ did we proceed with the experimental decisions. After subjects had made their decisions, and had given us the demographic information we were allowed to collect, we informed everyone on the details of the experimental procedure and the results. Subjects presented their code numbers to collect sealed envelopes with their earnings.
IV. RESULTS We first examine Implication 1 from our Proposition, namely that people in the Gulf will require higher levels of trustworthiness before trusting than Westerners. We then compare required trustworthiness levels with required success levels in the risky dictator game to make sure behavior in the trust game is not just due to people’s willingness to take risk. Finally, we report results for Implications 2 and 3, namely, how responsive people are to changes in the likelihood or the cost of betrayal. Table II summarizes principals’ willingness to trust. On average, people in the Gulf countries are willing to trust if at least 70% of the people are trustworthy, whereas people in the Western countries are willing to trust if at least 52% of the people are trustworthy. This cross-regional difference is significant10 for each cross-regional country comparison. Emiratis, Kuwaitis, and Omanis require significantly higher levels of trustworthiness 10. We run one-tailed Mann–Whitney U tests for differences in means. All p-values reported are based on this test, unless noted otherwise. A difference is reported as significant if p < .05.
TRUST IN GULF AND WESTERN COUNTRIES
821
TABLE II MINIMUM ACCEPTABLE PROBABILITIES IN BASELINE TRUST GAME (MEAN, MEDIAN, [N])
Kuwaita
Oman
United Arab Emirates
Switzerland
United States
All
Men
Women
0.61 0.70 [39] 0.72 0.80 [29] 0.81 0.80 [28] 0.51 0.55 [25] 0.54 0.50 [31]
0.74 0.80 [15] 0.72 0.70 [12] 0.77 0.80 [14] 0.46 0.48 [18] 0.50 0.50 [19]
0.53 0.50 [24] 0.73 0.80 [16] 0.86 0.95 [14] 0.62 0.60 [7] 0.61 0.72 [12]
a There are no significant differences in same-sex and mixed-sex sessions for either men or women. Men’s behavior varies not at all; women are slightly though not significantly more willing to trust in same-sex than in mixed-sex sessions.
before trusting than do Swiss and Americans. This affirms Implication 1. Table III reports a simple regression with the minimum acceptable probability as the dependent variable. In columns (1) and (2) we group the countries by region (Gulf = 1) and control for sex (woman = 1), the sex composition of our sessions (mixed = 1), and the possible interaction variables. Principals in the Gulf countries require higher minimum acceptable probabilities than Western principals. Columns (3) to (5) treat each country separately, with the United States as the excluded group. Minimum acceptable probabilities do not differ between Switzerland and the United States, and the cross-regional difference in minimum acceptable probabilities is due to both sexes in the Emirates and Oman, but only to men in Kuwait. Kuwaiti women request probability thresholds on a par with their Western counterparts.11 To determine whether people’s willingness to trust reflects primarily their willingness to take risk, we compare the probability thresholds for trusting (Table II) with the thresholds for 11. Kuwait ranks highest on the gender-related development index in the Arab world (Table 3, AHDR 2004). On May 16, 2005, the Kuwaiti parliament voted to give women the right to vote and to run for political office.
822
QUARTERLY JOURNAL OF ECONOMICS TABLE III DETERMINANTS OF MINIMUM ACCEPTABLE PROBABILITIES IN THE BASELINE TRUST GAME
Gulf countries
(1)
(2)
0.175∗∗
0.249∗∗
(0.041)
(4)
(5)
0.065 (0.056) 0.179∗∗ (0.060) 0.036 (0.063) 0.269∗∗ (0.061)
0.063 (0.057) 0.176∗∗ (0.062) 0.035 (0.063) 0.268∗∗ (0.062) 0.009 (0.040)
0.243∗∗ (0.079) 0.217∗ (0.084) 0.034 (0.075) 0.270∗∗ (0.080) 0.116 (0.084)
0.539∗∗ (0.045) 151 .18
−0.332∗∗ (0.113) −0.105 (0.121) 0.037 (0.132) −0.027 (0.120) 0.498∗∗ (0.052) 151 .25
(0.075)
Kuwait Oman Switzerland United Arab Emirates Women
(3)
0.095 (0.123) −0.018 (0.079) −0.183† (0.105) 0.039 (0.103)
Mixed session Gulf countries × women Women × mixed session Kuwait × women Oman × women Switzerland × women UAE × women Constant Observations R2
0.527∗∗ (0.032) 152 .11
0.500∗∗ (0.089) 151 .15
0.543∗∗ (0.042) 152 .18
Note. Standard errors in parentheses. † Significant at 10%. ∗ Significant at 5%. ∗∗ Significant at 1%.
risk taking. Table IV presents principals’ minimum acceptable probabilities in the risky dictator game in the five countries. We find that the mean minimum acceptable probabilities in the trust game significantly exceed those in the risky dictator game in all countries, namely by 0.16 in Kuwait, 0.26 in Oman, 0.33 in the UAE, 0.11 in Switzerland, and 0.22 in the United States (with p < .05 everywhere except p < .1 in Switzerland). This implies that all subjects thought it worse to lose the high payment due to
TRUST IN GULF AND WESTERN COUNTRIES
823
TABLE IV MINIMUM ACCEPTABLE PROBABILITIES IN RISKY DICTATOR GAME (MEAN, MEDIAN, [N])
Kuwait
Oman
United Arab Emirates
Switzerland
United States
All
Men
Women
0.44 0.42 [40] 0.47 0.45 [22] 0.48 0.48 [30] 0.40 0.42 [24] 0.32 0.29 [29]
0.46 0.43 [25] 0.49 0.48 [8] 0.51 0.50 [15] 0.33 0.30 [13] 0.28 0.29 [16]
0.40 0.27 [15] 0.43 0.40 [13] 0.46 0.45 [15] 0.48 0.50 [11] 0.38 0.35 [13]
betrayal than due to an unlucky draw on a chance device, that is, were intolerant of betrayal. On average, people in the Gulf are willing to take risk if the likelihood of getting the good outcome is at least 46%, whereas their counterparts in the West are willing to do so for a minimum acceptable probability of at least 36%. This cross-regional difference is mainly driven by men. Repeating the analysis conducted for the trust game (comparisons of means using Mann–Whitney tests and regressions), we find that Gulf men are significantly less willing to take risk than Western men, whereas there are no cross-regional differences in risk taking for women. Overall, the cross-regional difference in willingness to trust is mainly due to differences in intolerance to betrayal. This supports our notion that trust behavior responds to differences in trustworthiness reference points across the regions. In addition, for men, there are also cross-regional differences in willingness to take risk. To see how responsive willingness to trust is to changes in the likelihood of betrayal, Figure II shows the percentage of principals willing to trust for given likelihoods of trustworthiness in the two regions. Emiratis’, Kuwaitis’, and Omanis’ willingness to trust is less elastic to changes in the likelihood of trustworthiness than that
824
QUARTERLY JOURNAL OF ECONOMICS 100
Percent trusting
80
60
40
20
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Likelihood of trustworthiness Western countries
Gulf countries
FIGURE II Cumulative Distribution of Willingness to Trust in the West and in the Gulf TABLE V ELASTICITY OF TRUST TO THE LIKELIHOOD OF TRUSTWORTHINESS Elasticity of trust Kuwait Oman United Arab Emirates Switzerland United States
0.81 0.57 0.21 1.17 1.03
of Westerners. Our elasticity measure indicates how, on average, the percentage of those not trusting responds to a percentage reduction in those not trustworthy for each 10% change in trustworthiness. Table V presents the results, with the UAE at the bottom and Switzerland at the top. This supports Implication 2. To examine how responsive willingness to trust is to changes in the cost of betrayal, we compare the required trustworthiness thresholds in the basic trust game and the high-cost trust game in Oman and the United States (the only two countries studied). We find a pattern similar to the above. Americans respond to changes in the material cost of betrayal; Omanis do not. Table VI presents principals’ minimum acceptable probability values for
825
TRUST IN GULF AND WESTERN COUNTRIES
TABLE VI MINIMUM ACCEPTABLE PROBABILITIES IN BASELINE AND HIGH-COST TRUST GAMES (MEAN, MEDIAN, [N])
Oman
United States
All baseline
All high-cost
Men high-cost
Women high-cost
0.72 0.80 [29] 0.54 0.50 [31]
0.71 0.75 [35] 0.69 0.75 [36]
0.72 0.75 [23] 0.60 0.70 [16]
0.68 0.78 [12] 0.77 0.80 [18]
the two games in the two countries. Americans request significantly higher minimum acceptable probabilities in the high-cost than in the baseline trust game. Omanis’ minimum acceptable probabilities, in contrast, differ hardly at all across the two conditions.12 This supports Implication 3. Our major interest is why and when people trust strangers; hence our focus on the behavior of principals. But agents’ responses are interesting as well, although they do not entail any information on cross-regional differences in accustomed levels of trustworthiness. In our experiments, betrayal entails neither reputational nor legal costs—prime concerns respectively in the Gulf and the West—and thus primarily reflects a person’s intrinsic motivation to be trustworthy. There are no significant cross-regional differences in our agents’ willingness to reward trust. In our baseline trust game, 43% of the agents chose to reward trust in Kuwait (N = 39), 31% in Oman (N = 29), 32% in the United Arab Emirates (N = 28), 28% in Switzerland (N = 25), and 29% in the United States (N = 31).13 V. CONCLUSIONS Private investment levels and trust levels are lower in Gulf than in Western countries. This paper shows that differences in accustomed levels of trustworthiness might contribute to this 12. The trustworthiness rates in the high-cost trust games—not relevant for the probability thresholds—are alike in the two countries: 36% are trustworthy in the United States (N = 36) and 37% in Oman (N = 35). 13. None of the differences between these percentages is significant (e.g., χ 2 -test p = .21 when comparing Kuwait and Switzerland, the two extremes). Calculating weighted averages for each region gives us a trustworthiness rate of 37% in the three Gulf and 29% in the two Western countries (χ 2 -test p = 0.32).
826
QUARTERLY JOURNAL OF ECONOMICS
pattern, leading people in the Gulf to require higher levels of trustworthiness before trusting than Westerners. In the Gulf, trust has traditionally been primarily produced by relying on personal relationships, whereas in the West formal rules, such as contract law, play an important role. Relation-based trust decreases the likelihood of betrayal; rule-based trust decreases the cost of betrayal. Thus, the reference points for trustworthiness are higher in the Gulf than in the West. Following K¨oszegi and Rabin (2006), we posited a gain–loss utility for accepting a level of trustworthiness below one’s reference level. This utility adds to the pure VN-M utility from the game’s monetary payoffs. Given their higher trustworthiness reference points, people in the Gulf should be willing to pay a higher price to avoid trusting than people in the West. Our experiments confirmed this prediction. Emiratis, Kuwaitis, and Omanis demanded a substantially higher minimum trustworthiness threshold before trusting than did the Americans and Swiss, well exceeding the probability thresholds required to take risk in an analogous game where Nature determined the outcome. Crossregional differences in willingness to trust mainly came from differences in people’s intolerance of betrayal, though for men differences in willingness to take risk also contributed. Differences in trust preferences, brought about by differences in the reference points for trustworthiness for the two regions, help us understand disparities in private investment rates. Beyond trust levels, a better understanding of preferences that depend on reference points may prove particularly useful in comparing countries or cultures. APPENDIX Define pa and pb to be the minimum levels of trustworthiness such that A and B respectively select Trust. We show that at trustworthiness level pb, A does not select Trust, implying that pa > pb. Set p = pb for individual A. Then by (3) (4)
z(1, rb) − z(1, ra ) < z( pb, rb) − z( pb, ra ).
Rearranging terms yields z(1, rb) − z( pb, rb) < z(1, ra ) − z( pb, ra ). Furthermore, (2) implies that (5)
pbv(x) + (1 − pb)v(y) + z( pb, rb) = v(s) + z(1, rb).
TRUST IN GULF AND WESTERN COUNTRIES
827
Substituting terms in (4) and (5) yields (6)
pbv(x) + (1 − pb)v(y) + z( pb, ra ) < v(s) + z(1, ra ).
The left-hand side of (6) is simply the utility of Trust for A at trustworthiness level pb, and the right-hand side is A’s utility of Sure. This inequality implies that at pb, A does not select Trust. Because the sum of the VN-M terms and z are both increasing with p, and at p = 1 A does trust, there must be a value pa > pb that gets A to trust. QED Note that we have not required the stronger condition of continuity of z in its arguments, and only require (3) to hold when p = 1 for the Sure lottery. Had we assumed continuity and differentiability, a sufficient condition replacing (3) would be that ∂ 2 z/∂r∂ p > 0. In the context of our paper, the Proposition implies that Gulf citizens will require a higher level of trustworthiness than Westerners to Trust. HARVARD UNIVERSITY UNIVERSITY OF NOTTINGHAM HARVARD UNIVERSITY
REFERENCES Al-Suwailem, Sami, “Decision under Uncertainty: An Islamic Perspective,” AlRajhi Banking and Investment Corporation Working Paper, 2000. Arab Human Development Reports (AHDR), United Nations Development Program, 2002–2004. Bill, James A., and Robert Springborg, Politics in the Middle East (New York: Longman, 2000). Bohnet, Iris, Fiona Greig, Benedikt Herrmann, and Richard Zeckhauser, “Betrayal Aversion,” American Economic Review, 98 (2008), 294–310. Bohnet, Iris, and Richard Zeckhauser, “Trust, Risk and Betrayal,” Journal of Economic Behavior and Organization, 55 (2004), 467–484. Camerer, Colin, and Keith Weigelt, “Experimental Tests of a Sequential Equilibrium Reputation Model,” Econometrica, 56 (1988), 1–36. Economist, “A Long Way to Go,” April 9, 2005, 36–38. Greif, Avner, “Cultural Beliefs and the Organization of Society: A Historical and Theoretical Reflection on Collectivist and Individualist Societies,” Journal of Political Economy, 102 (1994), 912–950. ¨ Herrmann, Benedikt A., Christian Th¨oni, and Simon Gachter, “Antisocial Punishment across Societies,” Science, 319 (2008), 1362–1367. nd Hofstede, Geert, Culture’s Consequences, 2 ed. (Thousand Oaks, CA: Sage, 2001). Inglehart, Ronald F., “The Worldviews of Islamic Publics in Global Perspective,” in Values and Perception of Islamic and Middle Eastern Publics: Findings of Value Surveys (New York: Palgrave, 2007). Inglehart, Ronald F., Mansoor Moaddel, and Mark Tessler, “Xenophobia and InGroup Solidarity in Iraq: A Natural Experiment on the Impact of Insecurity,” Perspectives on Politics, 4 (2006), 495–505.
828
QUARTERLY JOURNAL OF ECONOMICS
Kahneman, Daniel, Jack L. Knetsch, and Richard H. Thaler, “Fairness and the Assumptions of Economics,” Journal of Business, 59 (1986), 285–300. Kahneman, Daniel, and Amos Tversky, “Prospect Theory: An Analysis of Decisions under Risk,” Econometrica, 47 (1979), 263–291. Knack, Stephen, and Philip Keefer, “Does Social Capital Have an Economic Payoff?” Quarterly Journal of Economics, 112 (1997), 1251–1288. K¨oszegi, Botond, and Matthew Rabin, “A Model of Reference-Dependent Preferences,” Quarterly Journal of Economics, 121 (2006), 1133–1165. ——, “Reference-Dependent Risk Attitudes,” American Economic Review, 97 (2007), 1047–1073. Kuran, Timur, Islam and Mammon (Princeton, NJ: Princeton University Press, 2004). Rosen, Lawrence, The Justice of Islam (Oxford, UK: Oxford University Press, 2000). Roth, Alvin E., Vesna Prasnikar, Masahiro Okuno-Fujiwara, and Shmuel Zamir, “Bargaining and Market Behavior in Jerusalem, Ljubliana, Pittsburgh, and Tokyo: An Experimental Study,” American Economic Review, 81 (1991), 1068– 1095. Sala-i-Martin, Xavier, and Elsa V. Artadi, “Economic Growth and Investment in the Arab World,” Columbia University Working Paper, 2002. Triandis, Harry C., Individualism and Collectivism (Boulder, CO: Westview Press, 1995). Vogel, Frank E., “The Contract Law of Islam and of the Arab Middle East,” Harvard Law School Working Paper, 1997.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION: EVIDENCE FROM THE FIELD∗ JENNIFER BROWN TANJIM HOSSAIN JOHN MORGAN We use field and natural experiments in online auctions to study the revenue effect of varying the level and disclosure of shipping charges. Our main findings are (1) disclosure affects revenues—for low shipping charges, a seller is better off disclosing; and (2) increasing shipping charges boosts revenues when these charges are hidden. These results are not explained by changes in the number of bidders.
I. INTRODUCTION Online stores often reveal shipping charges only after a consumer fills his or her “shopping cart.” Television offers for items “not sold in stores” disclose shipping and handling in small print with speedy voice-overs. Airlines increasingly use hidden fuel surcharges. Hidden mandatory telephone and energy fees in hotels have triggered class-action lawsuits.1 Are these practices profitable? Firms will enjoy higher revenues if consumers naively underestimate “shrouded” charges. However, if hidden fees make consumers suspicious, demand may fall. If consumers fully anticipate the charges, shrouding will have no effect. We conduct field experiments using leading online auction platforms in Taiwan and Ireland to compare revenues for identical items while varying both the amount and the disclosure level of the shipping charge. We also compare revenues before and after a change on eBay’s U.S. site that allowed users to display shipping charges in their search results. Our main findings are (1) shrouding affects revenues—for low shipping charges, a ∗ We thank Alvin Ho, John Li, and Jason Snyder for their excellent research assistance, Edward Tsai and Rupert Gatti for their help in conducting the experiments in Taiwan and Ireland, respectively, and Sean Tyan for kindly sharing his data set with us. We also thank Eric Anderson, Rachel Croson, Stefano DellaVigna, Botond Koszegi, Ulrike Malmendier, Chun-Hui Miao and two editors of this journal, as well as seminar participants at a number of institutions. The second author gratefully acknowledges the financial support of the Hong Kong Research Grants Council and the Center for Economic Development, HKUST. The third author gratefully acknowledges the financial support of the National Science Foundation.
[email protected],
[email protected],
[email protected]. 1. See Woodyard (2004) for examples. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
859
860
QUARTERLY JOURNAL OF ECONOMICS
seller is better off disclosing; and (2) increasing shipping charges boosts revenues when shipping charges are shrouded. Changes in the number of bidders do not appear to drive these revenue differences. Theoretical predictions on the profitability of shrouded pricing frequently depend on the rationality level of consumers. The literature makes a distinction between shrouded charges that are unavoidable (surcharges) and avoidable (add-ons). Shrouding a surcharge is not optimal when all consumers are fully rational and disclosure is costless (Milgrom 1981; Jovanovic 1982). However, shrouding may be optimal with boundedly rational consumers (Spiegler 2006). Add-ons may be shrouded in equilibrium when consumers are myopic (Gabaix and Laibson 2006; Miao 2006), lack self-control (DellaVigna and Malmendier 2004), or vary in their tastes for the add-on and advertising add-on prices is expensive (Ellison 2005). Moreover, there is no incentive for firms to educate consumers about competitors’ shrouded add-ons (Gabaix and Laibson 2006). Empirical literature on price shrouding mostly suggests that shrouding raises profitability. Ellison and Ellison (2009) find that shrouding add-ons is a profitable strategy for online firms selling computer memory chips. Chetty, Loony, and Kroft (2009) find that consumer demand falls when retailers post tax-inclusive prices (i.e., disclose a surcharge) for personal care products using a field experiment. They offer similar results for tax disclosure in alcohol prices using historical data. Ellison (2006) surveys various approaches to modeling bounded rationality and their implications for firm pricing. DellaVigna (2009) provides an overview of bounded rationality models using field data. Theory suggests that firms can exploit price partitioning (separating price into components) to affect consumer choice (Kahneman and Tversky 1984; Thaler 1985). Hossain and Morgan (2006) find evidence of this in field experiments on eBay’s U.S. auction site. They find that, when shipping is shrouded, raising the shipping charge increases both revenues and the number of bidders attracted to an auction. In contrast, mixed results have been obtained in laboratory experiments (Morwitz, Greenleaf, and Johnson 1998; Bertini and Wathieu 2008). Smith and Brynjolfsson (2001) find that online book retailers do not benefit from price partitioning. Our paper complements these earlier works by studying the interaction between price partitioning and disclosure using both field and natural experiments.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
861
II. FIELD EXPERIMENTS We conducted field experiments, selling ten different types of iPods, to study the revenue effect of changing the amount and shrouding level of shipping charges. The auction title and item description specified the capacity, model, and color of each iPod. The item description clearly stated the shipping charge and method. We disclosed the shipping charge in the title of the listing for half of the auctions and shrouded (omitted) it from the title for the other half. We used two different auction sites for these experiments, selling 36 items on Yahoo Taiwan in 2006 and 40 items on eBay Ireland in 2008. Our seller identity on each site had a reasonable reputation rating. The choice of auction sites and products allows us to vary shipping and shrouding easily while selling identical items. iPod markets on these sites are thick, and exhibit considerable variation in shipping charges. Neither site automatically reveals shipping in search listings, an essential feature for examining shrouding.2 This allowed us to control the disclosure level of shipping charges without drawing attention to ourselves. II.A. Taiwan We sold new 512 MB and 1 GB silver iPod Shuffles as well as 1 GB and 2 GB Nanos in both white and black—a total of six different iPod models. Our treatments were as follows: Opening price of TWD 750
Disclosed Shrouded
Opening price of TWD 600
Low shipping TWD 30
High shipping TWD 180
High shipping TWD 180
DL SL
DH SH
DR SR
where “TWD” denotes new Taiwan dollars. At the time of our experiments, the exchange rate was TWD 33 to USD 1 or EUR 0.83. Prior to the start of the experiments, we collected field data and observed shipping charges ranging from TWD 50 to 250 with a median shipping charge of TWD 100. Thus, our low shipping charge is a “bargain” in this market, whereas our high shipping charge is at the 99th percentile of the market. We auctioned all six 2. In contrast, eBay U.S. automatically discloses shipping.
862
QUARTERLY JOURNAL OF ECONOMICS
iPod models under each treatment. Treatments DL, DH, and DR were conducted from March 13 to March 20, 2006, whereas treatments SL, SH, and SR were conducted from March 20 to March 27, 2006. Although the auctions are separated by a week, Apple made no changes to the suggested retail price over this period, nor were there any price trends in online auctions for iPods worldwide (Glover and Raviv 2007). All auctions closed successfully. Figures I and II present screenshots (and accompanying English translations) for auctions where the shipping charge is disclosed and shrouded, respectively. To examine the effect of shrouding, we compare treatments Dx to Sx. Comparing treatments xL to xH reveals the effect of raising the shipping charge while holding the opening price fixed. In comparing treatments xL to xH, there is a potential confound—the reserve price (minimum payment) of the auction also increases. This is unlikely to matter because the minimum payment is considerably below the retail price, and not likely to be binding.3 Nevertheless, the xR treatments (“R” is a mnemonic for reserve) disentangle shipping charges and reserve price. To study the effects of raising the shipping charge while holding the reserve constant, we compare treatments xL to xR. Comparing treatments xR to xH identifies the effect of raising the opening price with a fixed shipping charge. II.B. Ireland We sold new 1 GB second generation iPod Shuffles in four different colors: blue, green, pink, and silver. Because changing the reserve price had no effect in the Taiwan experiments, we simplified the design, omitting the xR treatments. Our treatments were as follows: Opening price of EUR 0.01
Disclosed Shrouded
Low shipping EUR 11
High shipping EUR 14
DL SL
DH SH
At the time of our experiments, the exchange rate was EUR 0.77 to USD 1. We conducted eight auctions per week, with two items 3. The cheapest iPod we sold, the 512 MB Shuffle, had a retail price of TWD 2,500.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
863
FIGURE I Screenshot for Disclosed Auction in Taiwan Title: Brand new IPOD SHUFFLE 1G!!! Shipping Fee TWD30
!!! Item Description: This is a brand new IPOD SHUFFLE 1G. The seller delivers only via standard postage service. The shipping cost is TWD30 and is not negotiable. The buyer needs to make the payment within 10 days of completion of the auction. The seller only accepts payment by bank transfer. Your iPod comes with 90 days of telephone technical support and 1 year of warranty.
864
QUARTERLY JOURNAL OF ECONOMICS
FIGURE II Screenshot for Shrouded Auction in Taiwan Title: Brand new IPOD SHUFFLE 1G!!! Item Description: This is a brand new IPOD SHUFFLE 1G. The seller delivers only via standard postage service. The shipping cost is TWD30 and is not negotiable. The buyer needs to make the payment within 10 days of completion of the auction. The seller only accepts payment by bank transfer. Your iPod comes with 90 days of telephone technical support and 1 year of warranty.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
865
FIGURE III Screenshot from Disclosed eBay Ireland Auction
in each treatment cell. In a given week, items of the same color differed only by shipping charge. The disclosure treatment for a color alternated each week. We ran the experiments over the fiveweek period from October 13, 2008, to November 18, 2008, and all auctions closed successfully. Prior to the start of the experiments, we collected field data and chose shipping charges coinciding with the 25th and 75th percentiles of the market. Figures III and IV
866
QUARTERLY JOURNAL OF ECONOMICS
FIGURE IV Screenshot from Shrouded eBay Ireland Auction
present screenshots for auctions where the shipping charge is disclosed and shrouded, respectively. II.C. Results Table I summarizes the results by country for each treatment, whereas Table II presents formal statistical tests. By pooling the
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
867
TABLE I SUMMARY STATISTICS FOR YAHOO AND EBAY FIELD EXPERIMENTS Opening price of TWD 750 or EUR 0.01 Low shipping TWD 30 or EUR 11 Taiwan
Revenue # of bidders
Ireland
# of observations Revenue # of bidders # of observations
Taiwan
Revenue # of bidders
Ireland
# of observations Revenue # of bidders # of observations
High shipping TWD 180 or EUR 14
Opening price of TWD 600 High shipping TWD 180
Disclosed 92.92 (28.76) 11.17 (2.32) 6 37.52 (5.63) 5.8 (1.3) 10
96.92 (30.91) 10.17 (3.76) 6 36.93 (5.65) 7.0 (1.9) 10
95.31 (30.03) 10.5 (3.7) 6 — (—) — (—) —
Shrouded 88.89 (29.31) 11.33 (5.6) 6 36.36 (4.85) 6.7 (2.26) 10
93.26 (28.87) 10.5 (5.2) 6 38.94 (3.15) 6.9 (1.6) 10
94.27 (30.53) 12.7 (4.1) 6 — (—) — (—) —
Note: Values are means with standard deviations shown in parentheses. Revenue is denoted in euros. In March 2006, TWD 1 = EUR 0.025. Shipping charges are “shrouded” when they are not included in the title or search results. Shipping charges are “disclosed” when they appear in the title and search results.
data from the two countries, we can take advantage of a larger data set to estimate more precise effects. Three tests are reported, a standard t-test, a Wilcoxon signed-rank test, and a Fisher– Pitman exact permutation test. As the table shows, the statistical significance is similar across tests. Table II also presents permutation-based confidence intervals.4 The effects of shrouding on revenues may be seen by comparing each item under treatment Dx with its pair under 4. Permutation-based confidence intervals are only valid under the null hypothesis of exchangeability. Thus, we construct these only for treatment pairs where we cannot reject the null.
10 10 16 16 6 6 6 6
# of bidders DL vs. SL DH vs. SH DL vs. DH SL vs. SH DH vs. DR SH vs. SR DL vs. DR SL vs. SR
−0.533 −0.271 −0.375 0.188 0.333 2.167 0.667 −1.333
2.763 1.422 −1.126 −3.254 −1.605 1.008 −2.389 −5.376 0.291 0.148 0.535 0.174 1.000 2.484∗∗ 0.445 0.623
2.578∗∗ 0.807 0.853 3.043∗∗∗ 0.793 0.488 2.200∗ 4.997∗∗∗
t-test t-stat
0.204 0.307 0.863 0.339 1.000 1.897∗∗ 0.315 0.954
1.736∗ 0.410 0.724 2.617∗∗∗ 0.420 0.216 1.782∗ 2.201∗∗
Wilcoxon signed-rank test z-stat
.805 .906 .666 .921 .625 .094 .750 .656
.047 .445 .409 .011 .500 .750 .094 .031
Fisher–Pitman permutation test p-value
(−2.93, 2.93) (−2.18, 2.18) (−1.13, 1.13) (−1.69, 1.69) (−0.66, 0.66) — (−2.33, 2.33) (−3.33, 3.33)
— (−2.95, 2.95) (−2.16, 2.16) — (−3.09, 3.09) (−3.25, 3.25) — —
Monte Carlo permutation-based 90% confidence intervals
Note: “D” indicates disclosed, “S” indicates shrouded, “L” indicates low shipping fees, and “H” indicates high shipping fees. “R” indicates Taiwan auctions with a high shipping fee and low opening price, designed to have a reserve equal to the reserve in treatment “L”. Revenue is denoted in euros. In March 2006, TWD 1 = EUR 0.025. Permutation-based confidence intervals were constructed only when we failed to reject the null hypothesis of equality (200,000 replications). ∗ , ∗∗ , and ∗∗∗ represent statistical significance at the 10%, 5%, and 1% levels, respectively.
10 10 16 16 6 6 6 6
Revenue DL vs. SL DH vs. SH DL vs. DH SL vs. SH DH vs. DR SH vs. SR DL vs. DR SL vs. SR
# of pairs of obs.
Mean differences (e.g., DL − SL)
TABLE II SUMMARY OF PAIRWISE TESTS OF REVENUE AND NUMBER OF BIDDERS FOR YAHOO AND EBAY FIELD EXPERIMENTS
868 QUARTERLY JOURNAL OF ECONOMICS
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
869
treatment Sx.5 Notice that, under low shipping, revenues declined with shrouding. Statistical tests indicate that this revenue difference is significant at the 5% level. Under high shipping, the effect is ambiguous—disclosure increased revenues in Taiwan but decreased them in Ireland. Formal statistical tests do not indicate a significant difference in revenues—confidence bounds suggest that revenue differences between shrouded and disclosed treatments under high shipping do not exceed EUR 2.95. Disclosing a low shipping charge might raise revenues by attracting more bidders, yet there is little evidence of this. Disclosure increased the number of bidders in Taiwan but decreased the number in Ireland. Statistical tests suggest that revenue differences cannot be attributed to changes in the number of bidders. Similarly, disclosure has no significant effect on the number of bidders under high shipping. How do shipping charges affect revenues under the different shrouding treatments? This may be seen by comparing each item under treatment xL with its pair under treatment xH. When shipping charges are disclosed, the revenue effect is ambiguous— more expensive shipping raises revenues increase in Taiwan but lowers them in Ireland. Once again, formal statistical tests fail to reject the hypothesis of no treatment effect—confidence bounds indicate that the effect is somewhere below EUR 2.16. In contrast, raising the shipping charge significantly increases revenues when it is shrouded—the winning bidder pays, on average, 5% more in Taiwan and 7% more in Ireland under high shipping. As Table II shows, this revenue difference is significant at about the 1% level. Shipping charges have only modest effects on the number of bidders attracted to each auction. In Taiwan, higher shipping charges attract slightly fewer bidders. In Ireland, they attract slightly more. Statistical tests are consistent with this observation—we cannot reject the null hypothesis of no treatment effect at conventional levels under either disclosure or shrouding. When the opening price is held fixed, raising the shipping charge increases the reserve level of the auction. Comparing treatments xH to xR isolates a pure reserve effect. Regardless of disclosure, there is no statistical difference between these treatments. In contrast, comparing treatments xL to xR isolates a pure 5. When multiple identical items were sold under the same treatment, we used mean revenue as the unit of observation leading to ten observations for ten different types of iPod.
870
QUARTERLY JOURNAL OF ECONOMICS
shipping effect. Here we find that raising the shipping charge increases revenues, but the effect is more pronounced when shipping costs are shrouded.6 This revenue difference is significant at the 10% level under disclosure and the 5% level under shrouding. To summarize, changes in the reserve level do not appear to drive auction revenues. II.D. Discussion The main findings that emerge from the field experiments are (1) shrouding a low shipping charge is a money-losing strategy; (2) raising shipping charges increases revenue, particularly when they are shrouded; and (3) these revenue differences cannot be attributed to changes in the number of bidders. We sketch a model that can explain these findings. Suppose that the number of bidders is fixed. Some bidders are attentive—they are fully aware of the shipping charge. Others are naive—they are unaware of the exact shipping charge, but believe it to be extremely low.7 Finally, suspicious bidders are also unaware of the exact shipping charge, but assume that it will be high.8 With disclosure, a fraction of the naive and suspicious bidders become aware of the exact shipping charge and change their bids. Suspicious bidders raise their bids because the actual shipping charge is lower than their expectations, whereas naive bidders lower their bids because the shipping charge is unexpectedly high. When the shipping charge is low, the net effect of disclosure is to increase seller revenues, because the gains from suspicious bidders outweigh the losses from naive bidders. The reverse is true when the shipping charge is high. Thus, there is a shipping charge threshold below which disclosure is optimal and above which sellers prefer to shroud. Increasing the shipping charge causes attentive bidders to reduce their bids on a one-for-one basis. Bids of naive and suspicious bidders, who are unaware of the exact shipping charge, do not respond to this change. The net effect is to improve seller revenues. When the shipping charge is shrouded, this improvement 6. The revenue difference between treatments SL and SR is consistent with the findings of Hossain and Morgan (2006), who also found that revenues increased with higher shipping charges, holding the reserve fixed. Unlike their findings, we do not see a treatment difference in the number of bidders. 7. Such behavior might arise if consumers anchored on the base price (Kahneman and Tversky 1979). 8. We are grateful to an anonymous referee for suggesting a model along these lines.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
871
is larger than when the shipping charge is disclosed because a smaller fraction of bidders adjust their bids. III. NATURAL EXPERIMENT On October 28, 2004, eBay US announced a change in their search format—prospective bidders would now have the option of seeing the shipping charge for each auction on the results page. Prior to this, users had to read the body of each auction listing to learn the shipping charge. EBay also increased the visibility of shipping charges by displaying them on the bid confirmation screen. This action shifted the default from shrouding to disclosure of shipping charges. We obtained a data set used in Tyan (2005), consisting of successful auctions for gold and silver coins conducted on eBay’s U.S. site from September to December 2004. In this data set, we classify the shipping charges for each auction as either “shrouded” or “disclosed.” Shipping charges are shrouded when they are not included in the title or search results and disclosed when they are included. Shrouded auctions are those ending prior to October 27, 2004, whereas disclosed auctions are those beginning after November 10, 2004.9 Auctions between these dates are omitted. Table III summarizes the revenue (including shipping), opening price, shipping charge, and number of unique bidders for the shrouded and disclosed auctions of gold and silver coins. Interestingly, average revenues are higher when the shipping charge is disclosed than when it is shrouded. The increase, however, cannot be attributed to differences in the number of bidders—shrouded auctions attract about the same number of bidders as do disclosed auctions. We study changes in shrouding and shipping charges using the following regression: revenue = β0 + β1 shipping + β2 opening + β3 disclosed + β4 disclosed × shipping + β5 disclosed × opening (1) + γ X + ε, where X is a matrix of control variables. For the field experiments, we include product fixed effects. For silver coins, we use a dummy for whether the coin was graded. For gold coins, we use dummies 9. Results are robust to variations in these cutoff dates.
872
QUARTERLY JOURNAL OF ECONOMICS TABLE III SUMMARY STATISTICS FOR GOLD AND SILVER COIN AUCTIONS Gold coins
Revenue Opening price Shipping charge # of bidders # of observations Revenue Opening price Shipping charge # of bidders # of observations
Silver coins
Disclosed 67.45 (22.00) 12.17 (21.81) 4.55 (1.37) 6.15 (2.48) 162
45.72 (4.19) 24.10 (16.16) 5.08 (1.27) 4.53 (2.92) 306
Shrouded 62.12 (16.92) 9.04 (17.02) 4.81 (1.90) 6.34 (2.44) 124
42.49 (4.18) 18.98 (15.98) 4.95 (1.48) 4.37 (2.70) 212
Note: Values are means with standard deviations shown in parentheses. Shipping charges are “shrouded” when they are not included in the title or search results. Shipping charges are “disclosed” when they appear in the title and search results. Data from silver and gold coin auctions were provided by Tyan (2005). For the coin data, shrouded auctions are those ending prior to October 27, 2004, whereas disclosed auctions are those beginning after November 10, 2004. Auctions between these dates are omitted.
for each grade interacted with dummies for the grading organization. We also control for whether the coin was listed as a “proof ” or “brilliant uncirculated.” Controls for photographs, acceptance of Paypal or credit cards, and the decile of the sellers’ feedback rating are used for all coin auctions. To account for heteroscedasticity, we use robust estimation. Table IV presents the results of this analysis. If shrouding matters, then we should reject the hypothesis that the coefficients associated with disclosure are all equal to zero (β3 = β4 = β5 = 0). Table IV reports that this is the case in all instances. What happens when a seller increases the shipping charge but leaves the reserve level unchanged? If all bidders were attentive, this would have no effect on revenues (under shrouding β1 = β2 ; under disclosure β1 + β4 = β2 + β5 ). When shipping charges are shrouded, we reject this hypothesis—a one-dollar
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
873
TABLE IV REGRESSIONS OF TOTAL AUCTION REVENUE FOR IPOD AND COIN AUCTIONS iPods (EUR) Coefficient estimates 1.130∗∗∗ (0.320) opening price −0.101 (0.378) disclosed 6.991 (8.634) disclosed × shipping charge −0.470∗∗ (0.266) disclosed × opening price −0.140 (0.446)
β1 shipping charge β2 β3 β4 β5
β3 = β4 = β5 = 0
2.031∗∗∗ (0.569) 0.013 (0.046) 4.053 (4.941) −0.359 (1.218) 0.048 (0.075)
Silver coins (USD) 0.888∗∗∗ (0.178) 0.079∗∗∗ (0.015) 4.261∗∗∗ (1.392) −0.290 (0.253) −0.013 (0.021)
F-tests 4.17∗∗∗ d.f. (3, 61)
2.1∗ (3, 261)
18.47∗∗∗ (3, 499)
d.f.
4.48∗∗ (1, 61)
11.95∗∗∗ (1, 261)
20.45∗∗∗ (1, 499)
d.f.
2.20 (1, 61)
2.15 (1, 261)
8.45∗∗∗ (1, 499)
76
286
518
β1 = β2 β1 + β4 = β2 + β5 # of observations
Gold coins (USD)
Note: The values in parentheses are robust standard errors. For experimental data, “disclosed” = 1 when the shipping charge was listed in the item title. For field data, “disclosed” = 1 when the auction occurred after November 10, 2004. iPod regressions include item-specific fixed effects. Coin regressions include controls for condition, grade, seller reputation, and other auction characteristics. ∗ , ∗∗ , and ∗∗∗ represent statistical significance at the 10%, 5%, and 1% levels, respectively.
increase in shipping with an equal reduction in the opening price raises revenue. When shipping charges are disclosed, we can reject the null hypothesis for silver coins, but not for other items. In all cases, increasing shipping by a dollar while holding the reserve level constant has a smaller revenue effect when the shipping charge is disclosed than when it is shrouded. An average seller benefited from the increased disclosure of shipping charges due to eBay’s format change. Formally, we reject the hypothesis that an average seller earned the same revenue under shrouding and disclosure (β3 + β4 × average opening price + β5 × average shipping charge = 0; F(1,261) = 4.48 for gold coins and F(1,499) = 50.58 for silver coins). Are differences in the number of bidders driving the revenue effects? To examine this, we change the dependent variable in equation (1) to the number of unique bidders. Table V presents
874
QUARTERLY JOURNAL OF ECONOMICS TABLE V REGRESSIONS OF TOTAL NUMBER OF BIDDERS FOR IPOD AND COIN AUCTIONS iPods
β1 shipping charge β2 opening price
Gold coins
Coefficient estimates 0.244 0.124 (0.394) (0.078) −0.228 −0.077∗∗∗ (0.350) (0.005)
β3 disclosed β4 disclosed × shipping charge β5 disclosed × opening price
β3 = β4 = β5 = 0
−0.089 (0.089) −0.132∗∗∗ (0.007) 0.969 (0.756) 0.066 (0.132) −0.019∗∗ (0.010)
F-tests 0.51 d.f. (3, 61)
0.44 (3, 261)
12.2∗∗∗ (3, 499)
0.44 (1, 61)
6.44 (1, 264)
0.23 (1, 499)
β1 = β2 d.f. β1 + β4 = β2 + β5
1.83 (1, 499)
d.f. # of observations
Silver coins
76
286
518
Note: The values in parentheses are robust standard errors. For experimental data, “disclosed” = 1 when the shipping charge was listed in the item title. For field data, “disclosed” = 1 when the auction occurred after November 10, 2004. iPod regressions include item-specific fixed effects. Coin regressions include controls for condition, grade, seller reputation, and other auction characteristics. ∗ , ∗∗ , and ∗∗∗ represent statistical significance at the 10%, 5%, and 1% levels, respectively.
the results of this analysis. We only observe a shrouding effect on the number of bidders for silver coins. For all other data, we cannot reject the hypothesis that the disclosure coefficients are all equal to zero (β3 = β4 = β5 = 0). Moreover, in every instance, shipping charge coefficients are statistically indistinguishable from zero. There is little evidence that changes in the number of bidders are responsible for the observed revenue differences. Instead, revenue differences are likely a result of differences in the bids being placed. The regression results complement those of the field experiment: (1) shrouding affects revenues; (2) raising the shipping charge increases revenues, and the effect is stronger under shrouding; and (3) these differences are not attributable to changes in the number of bidders. The finding that disclosure on eBay increased average seller revenues, however, presents a
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
875
puzzle. If disclosure were profitable, then why didn’t more sellers disclose their shipping charges in the titles of their listings? Prior to the institutional change on eBay, an individual seller would not benefit by switching from shrouding to disclosing a high shipping charge. Revenues would fall if more naive bidders than suspicious ones became aware of the shipping charge, because newly aware naives would then lower their bids. In contrast, disclosure is profitable for sellers offering low shipping charges. A marketwide change is likely to have different effects on awareness. In particular, suppose that suspicious bidders are more technologically sophisticated than naive bidders and hence more likely to adjust their user preferences to make shipping visible following the changes to eBay’s site. Now, if a seller discloses a high shipping charge, newly aware suspicious bidders will raise their bids (so long as the charge is below their expectations), and revenues will increase. Similarly, sellers offering a low shipping charge will also benefit from disclosure. As a result, overall seller revenues can increase with such a change even when disclosure was previously unprofitable (for high–shipping charge sellers). IV. CONCLUSIONS Although sellers often shroud their shipping charges in online auctions, our findings suggest that the profitability of this strategy depends on the size of the charge. In field experiments, we find that shrouding a low shipping charge actually reduces seller revenues, whereas shrouding a high shipping charge does not improve revenues relative to disclosure. Using field data from eBay, we find that an institutional change toward transparency may raise revenues for the average seller. Shrouding and partitioned pricing are complements—a seller can increase revenues by raising its shipping charge when shrouded, but not under disclosure. These revenue effects are not attributable to changes in the number of bidders. Perhaps most surprising is the large revenue effect of raising shipping charges under shrouding. Indeed, for all products, the estimated effect of raising the shipping charge (β1 in Table IV) is statistically indistinguishable from 1 at the 5% level.10 10. For gold coins, the coefficient is more than one. Formally, we can reject the null hypothesis that β1 = 1 at the 7% level.
876
QUARTERLY JOURNAL OF ECONOMICS
That is, at the current level of shipping fees, a dollar marginal increase in shipping fees passes directly through to seller revenues. NORTHWESTERN UNIVERSITY UNIVERSITY OF TORONTO UNIVERSITY OF CALIFORNIA, BERKELEY
REFERENCES Bertini, Marco, and Luc Wathieu, “Attention Arousal through Price Partitioning,” Marketing Science, 27 (2008), 236–246. Chetty, Raj, Adam Looney, and Kory Kroft, “Salience and Taxation: Theory and Evidence,” American Economic Review, 99 (2009), 1145–1177. DellaVigna, Stefano, “Psychology and Economics: Evidence from the Field,” Journal of Economic Literature, 47 (2009), 315–372. DellaVigna, Stefano, and Ulrike Malmendier, “Contract Design and Self-Control: Theory and Evidence,” Quarterly Journal of Economics, 119 (2004), 353–402. Ellison, Glenn, “A Model of Add-on Pricing,” Quarterly Journal of Economics, 120 (2005), 585–637. ——, “Bounded Rationality in Industrial Organization,” in Advances in Economics and Econometrics: Theory and Applications, Ninth World Congress, R. Blundell, W. Newey, and T. Persson, eds. (Cambridge, UK: Cambridge University Press, 2006). Ellison, Glenn, and Sara Fisher Ellison, “Search, Obfuscation, and Price Elasticities on the Internet,” Econometrica, 77 (2009), 427–452. Gabaix, Xavier, and David Laibson, “Shrouded Attributes, Consumer Myopia, and Information Suppression in Competitive Markets,” Quarterly Journal of Economics, 121 (2006), 505–540. Glover, Brent, and Yaron Raviv, “Revenue Non-equivalence between Auctions with Soft and Hard Closing Mechanism: New Evidence from Yahoo!” University of Pennsylvania Working Paper, 2007. Hossain, Tanjim, and John Morgan, “. . . Plus Shipping and Handling: Revenue (Non)equivalence in Field Experiments on eBay,” Advances in Economic Analysis and Policy, 6 (2006), Article 3. Jovanovic, Boyan, “Truthful Disclosure of Information,” Bell Journal of Economics, 13 (1982), 36–44. Kahneman, Daniel, and Amos Tversky, “Prospect Theory: An Analysis of Decision under Risk,” Econometrica, 47 (1979), 263–291. ——, “Choices, Values, and Frames,” American Psychologist, 39 (1984), 341–350. Miao, Chun-Hui, “Consumer Myopia, Standardization and Aftermarket Monopolization,” University of South Carolina Working Paper, 2006. Milgrom, Paul R., “Good News and Bad News: Representation Theorems and Applications,” Bell Journal of Economics, 12 (1981), 380–391. Morwitz, Vicki G., Eric A. Greenleaf, and Eric J. Johnson, “Divide and Prosper: Consumers’ Reaction to Partitioned Prices,” Journal of Marketing Research, 35 (1998), 453–463. Smith, Michael D., and Erik Brynjolfsson, “Consumer Decision-Making at an Internet Shopbot: Brand Still Matters,” Journal of Industrial Economics, 49 (2001), 541–558. Spiegler, Ran, “Competition over Agents with Boundedly Rational Expectations,” Theoretical Economics, 1 (2006), 207–231. Thaler, Richard, “Mental Accounting and Consumer Choice,” Marketing Science, 4 (1985), 199–214. Tyan, Sean, “The Effect of Shipping Costs on Bidder Entry and Seller Revenues in eBay Auctions,” Senior Thesis, Department of Economics, Stanford University, 2005. Woodyard, Chris, “Hotels Face Lawsuits on Surcharges for Phones, Energy,” USA TODAY, September 26, 2004.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION: EVIDENCE FROM THE FIELD∗ JENNIFER BROWN TANJIM HOSSAIN JOHN MORGAN We use field and natural experiments in online auctions to study the revenue effect of varying the level and disclosure of shipping charges. Our main findings are (1) disclosure affects revenues—for low shipping charges, a seller is better off disclosing; and (2) increasing shipping charges boosts revenues when these charges are hidden. These results are not explained by changes in the number of bidders.
I. INTRODUCTION Online stores often reveal shipping charges only after a consumer fills his or her “shopping cart.” Television offers for items “not sold in stores” disclose shipping and handling in small print with speedy voice-overs. Airlines increasingly use hidden fuel surcharges. Hidden mandatory telephone and energy fees in hotels have triggered class-action lawsuits.1 Are these practices profitable? Firms will enjoy higher revenues if consumers naively underestimate “shrouded” charges. However, if hidden fees make consumers suspicious, demand may fall. If consumers fully anticipate the charges, shrouding will have no effect. We conduct field experiments using leading online auction platforms in Taiwan and Ireland to compare revenues for identical items while varying both the amount and the disclosure level of the shipping charge. We also compare revenues before and after a change on eBay’s U.S. site that allowed users to display shipping charges in their search results. Our main findings are (1) shrouding affects revenues—for low shipping charges, a ∗ We thank Alvin Ho, John Li, and Jason Snyder for their excellent research assistance, Edward Tsai and Rupert Gatti for their help in conducting the experiments in Taiwan and Ireland, respectively, and Sean Tyan for kindly sharing his data set with us. We also thank Eric Anderson, Rachel Croson, Stefano DellaVigna, Botond Koszegi, Ulrike Malmendier, Chun-Hui Miao and two editors of this journal, as well as seminar participants at a number of institutions. The second author gratefully acknowledges the financial support of the Hong Kong Research Grants Council and the Center for Economic Development, HKUST. The third author gratefully acknowledges the financial support of the National Science Foundation. [email protected], [email protected], [email protected]. 1. See Woodyard (2004) for examples. C 2010 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2010
859
860
QUARTERLY JOURNAL OF ECONOMICS
seller is better off disclosing; and (2) increasing shipping charges boosts revenues when shipping charges are shrouded. Changes in the number of bidders do not appear to drive these revenue differences. Theoretical predictions on the profitability of shrouded pricing frequently depend on the rationality level of consumers. The literature makes a distinction between shrouded charges that are unavoidable (surcharges) and avoidable (add-ons). Shrouding a surcharge is not optimal when all consumers are fully rational and disclosure is costless (Milgrom 1981; Jovanovic 1982). However, shrouding may be optimal with boundedly rational consumers (Spiegler 2006). Add-ons may be shrouded in equilibrium when consumers are myopic (Gabaix and Laibson 2006; Miao 2006), lack self-control (DellaVigna and Malmendier 2004), or vary in their tastes for the add-on and advertising add-on prices is expensive (Ellison 2005). Moreover, there is no incentive for firms to educate consumers about competitors’ shrouded add-ons (Gabaix and Laibson 2006). Empirical literature on price shrouding mostly suggests that shrouding raises profitability. Ellison and Ellison (2009) find that shrouding add-ons is a profitable strategy for online firms selling computer memory chips. Chetty, Loony, and Kroft (2009) find that consumer demand falls when retailers post tax-inclusive prices (i.e., disclose a surcharge) for personal care products using a field experiment. They offer similar results for tax disclosure in alcohol prices using historical data. Ellison (2006) surveys various approaches to modeling bounded rationality and their implications for firm pricing. DellaVigna (2009) provides an overview of bounded rationality models using field data. Theory suggests that firms can exploit price partitioning (separating price into components) to affect consumer choice (Kahneman and Tversky 1984; Thaler 1985). Hossain and Morgan (2006) find evidence of this in field experiments on eBay’s U.S. auction site. They find that, when shipping is shrouded, raising the shipping charge increases both revenues and the number of bidders attracted to an auction. In contrast, mixed results have been obtained in laboratory experiments (Morwitz, Greenleaf, and Johnson 1998; Bertini and Wathieu 2008). Smith and Brynjolfsson (2001) find that online book retailers do not benefit from price partitioning. Our paper complements these earlier works by studying the interaction between price partitioning and disclosure using both field and natural experiments.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
861
II. FIELD EXPERIMENTS We conducted field experiments, selling ten different types of iPods, to study the revenue effect of changing the amount and shrouding level of shipping charges. The auction title and item description specified the capacity, model, and color of each iPod. The item description clearly stated the shipping charge and method. We disclosed the shipping charge in the title of the listing for half of the auctions and shrouded (omitted) it from the title for the other half. We used two different auction sites for these experiments, selling 36 items on Yahoo Taiwan in 2006 and 40 items on eBay Ireland in 2008. Our seller identity on each site had a reasonable reputation rating. The choice of auction sites and products allows us to vary shipping and shrouding easily while selling identical items. iPod markets on these sites are thick, and exhibit considerable variation in shipping charges. Neither site automatically reveals shipping in search listings, an essential feature for examining shrouding.2 This allowed us to control the disclosure level of shipping charges without drawing attention to ourselves. II.A. Taiwan We sold new 512 MB and 1 GB silver iPod Shuffles as well as 1 GB and 2 GB Nanos in both white and black—a total of six different iPod models. Our treatments were as follows: Opening price of TWD 750
Disclosed Shrouded
Opening price of TWD 600
Low shipping TWD 30
High shipping TWD 180
High shipping TWD 180
DL SL
DH SH
DR SR
where “TWD” denotes new Taiwan dollars. At the time of our experiments, the exchange rate was TWD 33 to USD 1 or EUR 0.83. Prior to the start of the experiments, we collected field data and observed shipping charges ranging from TWD 50 to 250 with a median shipping charge of TWD 100. Thus, our low shipping charge is a “bargain” in this market, whereas our high shipping charge is at the 99th percentile of the market. We auctioned all six 2. In contrast, eBay U.S. automatically discloses shipping.
862
QUARTERLY JOURNAL OF ECONOMICS
iPod models under each treatment. Treatments DL, DH, and DR were conducted from March 13 to March 20, 2006, whereas treatments SL, SH, and SR were conducted from March 20 to March 27, 2006. Although the auctions are separated by a week, Apple made no changes to the suggested retail price over this period, nor were there any price trends in online auctions for iPods worldwide (Glover and Raviv 2007). All auctions closed successfully. Figures I and II present screenshots (and accompanying English translations) for auctions where the shipping charge is disclosed and shrouded, respectively. To examine the effect of shrouding, we compare treatments Dx to Sx. Comparing treatments xL to xH reveals the effect of raising the shipping charge while holding the opening price fixed. In comparing treatments xL to xH, there is a potential confound—the reserve price (minimum payment) of the auction also increases. This is unlikely to matter because the minimum payment is considerably below the retail price, and not likely to be binding.3 Nevertheless, the xR treatments (“R” is a mnemonic for reserve) disentangle shipping charges and reserve price. To study the effects of raising the shipping charge while holding the reserve constant, we compare treatments xL to xR. Comparing treatments xR to xH identifies the effect of raising the opening price with a fixed shipping charge. II.B. Ireland We sold new 1 GB second generation iPod Shuffles in four different colors: blue, green, pink, and silver. Because changing the reserve price had no effect in the Taiwan experiments, we simplified the design, omitting the xR treatments. Our treatments were as follows: Opening price of EUR 0.01
Disclosed Shrouded
Low shipping EUR 11
High shipping EUR 14
DL SL
DH SH
At the time of our experiments, the exchange rate was EUR 0.77 to USD 1. We conducted eight auctions per week, with two items 3. The cheapest iPod we sold, the 512 MB Shuffle, had a retail price of TWD 2,500.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
863
FIGURE I Screenshot for Disclosed Auction in Taiwan Title: Brand new IPOD SHUFFLE 1G!!! Shipping Fee TWD30 !!! Item Description: This is a brand new IPOD SHUFFLE 1G. The seller delivers only via standard postage service. The shipping cost is TWD30 and is not negotiable. The buyer needs to make the payment within 10 days of completion of the auction. The seller only accepts payment by bank transfer. Your iPod comes with 90 days of telephone technical support and 1 year of warranty.
864
QUARTERLY JOURNAL OF ECONOMICS
FIGURE II Screenshot for Shrouded Auction in Taiwan Title: Brand new IPOD SHUFFLE 1G!!! Item Description: This is a brand new IPOD SHUFFLE 1G. The seller delivers only via standard postage service. The shipping cost is TWD30 and is not negotiable. The buyer needs to make the payment within 10 days of completion of the auction. The seller only accepts payment by bank transfer. Your iPod comes with 90 days of telephone technical support and 1 year of warranty.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
865
FIGURE III Screenshot from Disclosed eBay Ireland Auction
in each treatment cell. In a given week, items of the same color differed only by shipping charge. The disclosure treatment for a color alternated each week. We ran the experiments over the fiveweek period from October 13, 2008, to November 18, 2008, and all auctions closed successfully. Prior to the start of the experiments, we collected field data and chose shipping charges coinciding with the 25th and 75th percentiles of the market. Figures III and IV
866
QUARTERLY JOURNAL OF ECONOMICS
FIGURE IV Screenshot from Shrouded eBay Ireland Auction
present screenshots for auctions where the shipping charge is disclosed and shrouded, respectively. II.C. Results Table I summarizes the results by country for each treatment, whereas Table II presents formal statistical tests. By pooling the
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
867
TABLE I SUMMARY STATISTICS FOR YAHOO AND EBAY FIELD EXPERIMENTS Opening price of TWD 750 or EUR 0.01 Low shipping TWD 30 or EUR 11 Taiwan
Revenue # of bidders
Ireland
# of observations Revenue # of bidders # of observations
Taiwan
Revenue # of bidders
Ireland
# of observations Revenue # of bidders # of observations
High shipping TWD 180 or EUR 14
Opening price of TWD 600 High shipping TWD 180
Disclosed 92.92 (28.76) 11.17 (2.32) 6 37.52 (5.63) 5.8 (1.3) 10
96.92 (30.91) 10.17 (3.76) 6 36.93 (5.65) 7.0 (1.9) 10
95.31 (30.03) 10.5 (3.7) 6 — (—) — (—) —
Shrouded 88.89 (29.31) 11.33 (5.6) 6 36.36 (4.85) 6.7 (2.26) 10
93.26 (28.87) 10.5 (5.2) 6 38.94 (3.15) 6.9 (1.6) 10
94.27 (30.53) 12.7 (4.1) 6 — (—) — (—) —
Note: Values are means with standard deviations shown in parentheses. Revenue is denoted in euros. In March 2006, TWD 1 = EUR 0.025. Shipping charges are “shrouded” when they are not included in the title or search results. Shipping charges are “disclosed” when they appear in the title and search results.
data from the two countries, we can take advantage of a larger data set to estimate more precise effects. Three tests are reported, a standard t-test, a Wilcoxon signed-rank test, and a Fisher– Pitman exact permutation test. As the table shows, the statistical significance is similar across tests. Table II also presents permutation-based confidence intervals.4 The effects of shrouding on revenues may be seen by comparing each item under treatment Dx with its pair under 4. Permutation-based confidence intervals are only valid under the null hypothesis of exchangeability. Thus, we construct these only for treatment pairs where we cannot reject the null.
10 10 16 16 6 6 6 6
# of bidders DL vs. SL DH vs. SH DL vs. DH SL vs. SH DH vs. DR SH vs. SR DL vs. DR SL vs. SR
−0.533 −0.271 −0.375 0.188 0.333 2.167 0.667 −1.333
2.763 1.422 −1.126 −3.254 −1.605 1.008 −2.389 −5.376 0.291 0.148 0.535 0.174 1.000 2.484∗∗ 0.445 0.623
2.578∗∗ 0.807 0.853 3.043∗∗∗ 0.793 0.488 2.200∗ 4.997∗∗∗
t-test t-stat
0.204 0.307 0.863 0.339 1.000 1.897∗∗ 0.315 0.954
1.736∗ 0.410 0.724 2.617∗∗∗ 0.420 0.216 1.782∗ 2.201∗∗
Wilcoxon signed-rank test z-stat
.805 .906 .666 .921 .625 .094 .750 .656
.047 .445 .409 .011 .500 .750 .094 .031
Fisher–Pitman permutation test p-value
(−2.93, 2.93) (−2.18, 2.18) (−1.13, 1.13) (−1.69, 1.69) (−0.66, 0.66) — (−2.33, 2.33) (−3.33, 3.33)
— (−2.95, 2.95) (−2.16, 2.16) — (−3.09, 3.09) (−3.25, 3.25) — —
Monte Carlo permutation-based 90% confidence intervals
Note: “D” indicates disclosed, “S” indicates shrouded, “L” indicates low shipping fees, and “H” indicates high shipping fees. “R” indicates Taiwan auctions with a high shipping fee and low opening price, designed to have a reserve equal to the reserve in treatment “L”. Revenue is denoted in euros. In March 2006, TWD 1 = EUR 0.025. Permutation-based confidence intervals were constructed only when we failed to reject the null hypothesis of equality (200,000 replications). ∗ , ∗∗ , and ∗∗∗ represent statistical significance at the 10%, 5%, and 1% levels, respectively.
10 10 16 16 6 6 6 6
Revenue DL vs. SL DH vs. SH DL vs. DH SL vs. SH DH vs. DR SH vs. SR DL vs. DR SL vs. SR
# of pairs of obs.
Mean differences (e.g., DL − SL)
TABLE II SUMMARY OF PAIRWISE TESTS OF REVENUE AND NUMBER OF BIDDERS FOR YAHOO AND EBAY FIELD EXPERIMENTS
868 QUARTERLY JOURNAL OF ECONOMICS
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
869
treatment Sx.5 Notice that, under low shipping, revenues declined with shrouding. Statistical tests indicate that this revenue difference is significant at the 5% level. Under high shipping, the effect is ambiguous—disclosure increased revenues in Taiwan but decreased them in Ireland. Formal statistical tests do not indicate a significant difference in revenues—confidence bounds suggest that revenue differences between shrouded and disclosed treatments under high shipping do not exceed EUR 2.95. Disclosing a low shipping charge might raise revenues by attracting more bidders, yet there is little evidence of this. Disclosure increased the number of bidders in Taiwan but decreased the number in Ireland. Statistical tests suggest that revenue differences cannot be attributed to changes in the number of bidders. Similarly, disclosure has no significant effect on the number of bidders under high shipping. How do shipping charges affect revenues under the different shrouding treatments? This may be seen by comparing each item under treatment xL with its pair under treatment xH. When shipping charges are disclosed, the revenue effect is ambiguous— more expensive shipping raises revenues increase in Taiwan but lowers them in Ireland. Once again, formal statistical tests fail to reject the hypothesis of no treatment effect—confidence bounds indicate that the effect is somewhere below EUR 2.16. In contrast, raising the shipping charge significantly increases revenues when it is shrouded—the winning bidder pays, on average, 5% more in Taiwan and 7% more in Ireland under high shipping. As Table II shows, this revenue difference is significant at about the 1% level. Shipping charges have only modest effects on the number of bidders attracted to each auction. In Taiwan, higher shipping charges attract slightly fewer bidders. In Ireland, they attract slightly more. Statistical tests are consistent with this observation—we cannot reject the null hypothesis of no treatment effect at conventional levels under either disclosure or shrouding. When the opening price is held fixed, raising the shipping charge increases the reserve level of the auction. Comparing treatments xH to xR isolates a pure reserve effect. Regardless of disclosure, there is no statistical difference between these treatments. In contrast, comparing treatments xL to xR isolates a pure 5. When multiple identical items were sold under the same treatment, we used mean revenue as the unit of observation leading to ten observations for ten different types of iPod.
870
QUARTERLY JOURNAL OF ECONOMICS
shipping effect. Here we find that raising the shipping charge increases revenues, but the effect is more pronounced when shipping costs are shrouded.6 This revenue difference is significant at the 10% level under disclosure and the 5% level under shrouding. To summarize, changes in the reserve level do not appear to drive auction revenues. II.D. Discussion The main findings that emerge from the field experiments are (1) shrouding a low shipping charge is a money-losing strategy; (2) raising shipping charges increases revenue, particularly when they are shrouded; and (3) these revenue differences cannot be attributed to changes in the number of bidders. We sketch a model that can explain these findings. Suppose that the number of bidders is fixed. Some bidders are attentive—they are fully aware of the shipping charge. Others are naive—they are unaware of the exact shipping charge, but believe it to be extremely low.7 Finally, suspicious bidders are also unaware of the exact shipping charge, but assume that it will be high.8 With disclosure, a fraction of the naive and suspicious bidders become aware of the exact shipping charge and change their bids. Suspicious bidders raise their bids because the actual shipping charge is lower than their expectations, whereas naive bidders lower their bids because the shipping charge is unexpectedly high. When the shipping charge is low, the net effect of disclosure is to increase seller revenues, because the gains from suspicious bidders outweigh the losses from naive bidders. The reverse is true when the shipping charge is high. Thus, there is a shipping charge threshold below which disclosure is optimal and above which sellers prefer to shroud. Increasing the shipping charge causes attentive bidders to reduce their bids on a one-for-one basis. Bids of naive and suspicious bidders, who are unaware of the exact shipping charge, do not respond to this change. The net effect is to improve seller revenues. When the shipping charge is shrouded, this improvement 6. The revenue difference between treatments SL and SR is consistent with the findings of Hossain and Morgan (2006), who also found that revenues increased with higher shipping charges, holding the reserve fixed. Unlike their findings, we do not see a treatment difference in the number of bidders. 7. Such behavior might arise if consumers anchored on the base price (Kahneman and Tversky 1979). 8. We are grateful to an anonymous referee for suggesting a model along these lines.
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
871
is larger than when the shipping charge is disclosed because a smaller fraction of bidders adjust their bids. III. NATURAL EXPERIMENT On October 28, 2004, eBay US announced a change in their search format—prospective bidders would now have the option of seeing the shipping charge for each auction on the results page. Prior to this, users had to read the body of each auction listing to learn the shipping charge. EBay also increased the visibility of shipping charges by displaying them on the bid confirmation screen. This action shifted the default from shrouding to disclosure of shipping charges. We obtained a data set used in Tyan (2005), consisting of successful auctions for gold and silver coins conducted on eBay’s U.S. site from September to December 2004. In this data set, we classify the shipping charges for each auction as either “shrouded” or “disclosed.” Shipping charges are shrouded when they are not included in the title or search results and disclosed when they are included. Shrouded auctions are those ending prior to October 27, 2004, whereas disclosed auctions are those beginning after November 10, 2004.9 Auctions between these dates are omitted. Table III summarizes the revenue (including shipping), opening price, shipping charge, and number of unique bidders for the shrouded and disclosed auctions of gold and silver coins. Interestingly, average revenues are higher when the shipping charge is disclosed than when it is shrouded. The increase, however, cannot be attributed to differences in the number of bidders—shrouded auctions attract about the same number of bidders as do disclosed auctions. We study changes in shrouding and shipping charges using the following regression: revenue = β0 + β1 shipping + β2 opening + β3 disclosed + β4 disclosed × shipping + β5 disclosed × opening (1) + γ X + ε, where X is a matrix of control variables. For the field experiments, we include product fixed effects. For silver coins, we use a dummy for whether the coin was graded. For gold coins, we use dummies 9. Results are robust to variations in these cutoff dates.
872
QUARTERLY JOURNAL OF ECONOMICS TABLE III SUMMARY STATISTICS FOR GOLD AND SILVER COIN AUCTIONS Gold coins
Revenue Opening price Shipping charge # of bidders # of observations Revenue Opening price Shipping charge # of bidders # of observations
Silver coins
Disclosed 67.45 (22.00) 12.17 (21.81) 4.55 (1.37) 6.15 (2.48) 162
45.72 (4.19) 24.10 (16.16) 5.08 (1.27) 4.53 (2.92) 306
Shrouded 62.12 (16.92) 9.04 (17.02) 4.81 (1.90) 6.34 (2.44) 124
42.49 (4.18) 18.98 (15.98) 4.95 (1.48) 4.37 (2.70) 212
Note: Values are means with standard deviations shown in parentheses. Shipping charges are “shrouded” when they are not included in the title or search results. Shipping charges are “disclosed” when they appear in the title and search results. Data from silver and gold coin auctions were provided by Tyan (2005). For the coin data, shrouded auctions are those ending prior to October 27, 2004, whereas disclosed auctions are those beginning after November 10, 2004. Auctions between these dates are omitted.
for each grade interacted with dummies for the grading organization. We also control for whether the coin was listed as a “proof ” or “brilliant uncirculated.” Controls for photographs, acceptance of Paypal or credit cards, and the decile of the sellers’ feedback rating are used for all coin auctions. To account for heteroscedasticity, we use robust estimation. Table IV presents the results of this analysis. If shrouding matters, then we should reject the hypothesis that the coefficients associated with disclosure are all equal to zero (β3 = β4 = β5 = 0). Table IV reports that this is the case in all instances. What happens when a seller increases the shipping charge but leaves the reserve level unchanged? If all bidders were attentive, this would have no effect on revenues (under shrouding β1 = β2 ; under disclosure β1 + β4 = β2 + β5 ). When shipping charges are shrouded, we reject this hypothesis—a one-dollar
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
873
TABLE IV REGRESSIONS OF TOTAL AUCTION REVENUE FOR IPOD AND COIN AUCTIONS iPods (EUR) Coefficient estimates 1.130∗∗∗ (0.320) opening price −0.101 (0.378) disclosed 6.991 (8.634) disclosed × shipping charge −0.470∗∗ (0.266) disclosed × opening price −0.140 (0.446)
β1 shipping charge β2 β3 β4 β5
β3 = β4 = β5 = 0
2.031∗∗∗ (0.569) 0.013 (0.046) 4.053 (4.941) −0.359 (1.218) 0.048 (0.075)
Silver coins (USD) 0.888∗∗∗ (0.178) 0.079∗∗∗ (0.015) 4.261∗∗∗ (1.392) −0.290 (0.253) −0.013 (0.021)
F-tests 4.17∗∗∗ d.f. (3, 61)
2.1∗ (3, 261)
18.47∗∗∗ (3, 499)
d.f.
4.48∗∗ (1, 61)
11.95∗∗∗ (1, 261)
20.45∗∗∗ (1, 499)
d.f.
2.20 (1, 61)
2.15 (1, 261)
8.45∗∗∗ (1, 499)
76
286
518
β1 = β2 β1 + β4 = β2 + β5 # of observations
Gold coins (USD)
Note: The values in parentheses are robust standard errors. For experimental data, “disclosed” = 1 when the shipping charge was listed in the item title. For field data, “disclosed” = 1 when the auction occurred after November 10, 2004. iPod regressions include item-specific fixed effects. Coin regressions include controls for condition, grade, seller reputation, and other auction characteristics. ∗ , ∗∗ , and ∗∗∗ represent statistical significance at the 10%, 5%, and 1% levels, respectively.
increase in shipping with an equal reduction in the opening price raises revenue. When shipping charges are disclosed, we can reject the null hypothesis for silver coins, but not for other items. In all cases, increasing shipping by a dollar while holding the reserve level constant has a smaller revenue effect when the shipping charge is disclosed than when it is shrouded. An average seller benefited from the increased disclosure of shipping charges due to eBay’s format change. Formally, we reject the hypothesis that an average seller earned the same revenue under shrouding and disclosure (β3 + β4 × average opening price + β5 × average shipping charge = 0; F(1,261) = 4.48 for gold coins and F(1,499) = 50.58 for silver coins). Are differences in the number of bidders driving the revenue effects? To examine this, we change the dependent variable in equation (1) to the number of unique bidders. Table V presents
874
QUARTERLY JOURNAL OF ECONOMICS TABLE V REGRESSIONS OF TOTAL NUMBER OF BIDDERS FOR IPOD AND COIN AUCTIONS iPods
β1 shipping charge β2 opening price
Gold coins
Coefficient estimates 0.244 0.124 (0.394) (0.078) −0.228 −0.077∗∗∗ (0.350) (0.005)
β3 disclosed β4 disclosed × shipping charge β5 disclosed × opening price
β3 = β4 = β5 = 0
−0.089 (0.089) −0.132∗∗∗ (0.007) 0.969 (0.756) 0.066 (0.132) −0.019∗∗ (0.010)
F-tests 0.51 d.f. (3, 61)
0.44 (3, 261)
12.2∗∗∗ (3, 499)
0.44 (1, 61)
6.44 (1, 264)
0.23 (1, 499)
β1 = β2 d.f. β1 + β4 = β2 + β5
1.83 (1, 499)
d.f. # of observations
Silver coins
76
286
518
Note: The values in parentheses are robust standard errors. For experimental data, “disclosed” = 1 when the shipping charge was listed in the item title. For field data, “disclosed” = 1 when the auction occurred after November 10, 2004. iPod regressions include item-specific fixed effects. Coin regressions include controls for condition, grade, seller reputation, and other auction characteristics. ∗ , ∗∗ , and ∗∗∗ represent statistical significance at the 10%, 5%, and 1% levels, respectively.
the results of this analysis. We only observe a shrouding effect on the number of bidders for silver coins. For all other data, we cannot reject the hypothesis that the disclosure coefficients are all equal to zero (β3 = β4 = β5 = 0). Moreover, in every instance, shipping charge coefficients are statistically indistinguishable from zero. There is little evidence that changes in the number of bidders are responsible for the observed revenue differences. Instead, revenue differences are likely a result of differences in the bids being placed. The regression results complement those of the field experiment: (1) shrouding affects revenues; (2) raising the shipping charge increases revenues, and the effect is stronger under shrouding; and (3) these differences are not attributable to changes in the number of bidders. The finding that disclosure on eBay increased average seller revenues, however, presents a
SHROUDED ATTRIBUTES AND INFORMATION SUPPRESSION
875
puzzle. If disclosure were profitable, then why didn’t more sellers disclose their shipping charges in the titles of their listings? Prior to the institutional change on eBay, an individual seller would not benefit by switching from shrouding to disclosing a high shipping charge. Revenues would fall if more naive bidders than suspicious ones became aware of the shipping charge, because newly aware naives would then lower their bids. In contrast, disclosure is profitable for sellers offering low shipping charges. A marketwide change is likely to have different effects on awareness. In particular, suppose that suspicious bidders are more technologically sophisticated than naive bidders and hence more likely to adjust their user preferences to make shipping visible following the changes to eBay’s site. Now, if a seller discloses a high shipping charge, newly aware suspicious bidders will raise their bids (so long as the charge is below their expectations), and revenues will increase. Similarly, sellers offering a low shipping charge will also benefit from disclosure. As a result, overall seller revenues can increase with such a change even when disclosure was previously unprofitable (for high–shipping charge sellers). IV. CONCLUSIONS Although sellers often shroud their shipping charges in online auctions, our findings suggest that the profitability of this strategy depends on the size of the charge. In field experiments, we find that shrouding a low shipping charge actually reduces seller revenues, whereas shrouding a high shipping charge does not improve revenues relative to disclosure. Using field data from eBay, we find that an institutional change toward transparency may raise revenues for the average seller. Shrouding and partitioned pricing are complements—a seller can increase revenues by raising its shipping charge when shrouded, but not under disclosure. These revenue effects are not attributable to changes in the number of bidders. Perhaps most surprising is the large revenue effect of raising shipping charges under shrouding. Indeed, for all products, the estimated effect of raising the shipping charge (β1 in Table IV) is statistically indistinguishable from 1 at the 5% level.10 10. For gold coins, the coefficient is more than one. Formally, we can reject the null hypothesis that β1 = 1 at the 7% level.
876
QUARTERLY JOURNAL OF ECONOMICS
That is, at the current level of shipping fees, a dollar marginal increase in shipping fees passes directly through to seller revenues. NORTHWESTERN UNIVERSITY UNIVERSITY OF TORONTO UNIVERSITY OF CALIFORNIA, BERKELEY
REFERENCES Bertini, Marco, and Luc Wathieu, “Attention Arousal through Price Partitioning,” Marketing Science, 27 (2008), 236–246. Chetty, Raj, Adam Looney, and Kory Kroft, “Salience and Taxation: Theory and Evidence,” American Economic Review, 99 (2009), 1145–1177. DellaVigna, Stefano, “Psychology and Economics: Evidence from the Field,” Journal of Economic Literature, 47 (2009), 315–372. DellaVigna, Stefano, and Ulrike Malmendier, “Contract Design and Self-Control: Theory and Evidence,” Quarterly Journal of Economics, 119 (2004), 353–402. Ellison, Glenn, “A Model of Add-on Pricing,” Quarterly Journal of Economics, 120 (2005), 585–637. ——, “Bounded Rationality in Industrial Organization,” in Advances in Economics and Econometrics: Theory and Applications, Ninth World Congress, R. Blundell, W. Newey, and T. Persson, eds. (Cambridge, UK: Cambridge University Press, 2006). Ellison, Glenn, and Sara Fisher Ellison, “Search, Obfuscation, and Price Elasticities on the Internet,” Econometrica, 77 (2009), 427–452. Gabaix, Xavier, and David Laibson, “Shrouded Attributes, Consumer Myopia, and Information Suppression in Competitive Markets,” Quarterly Journal of Economics, 121 (2006), 505–540. Glover, Brent, and Yaron Raviv, “Revenue Non-equivalence between Auctions with Soft and Hard Closing Mechanism: New Evidence from Yahoo!” University of Pennsylvania Working Paper, 2007. Hossain, Tanjim, and John Morgan, “. . . Plus Shipping and Handling: Revenue (Non)equivalence in Field Experiments on eBay,” Advances in Economic Analysis and Policy, 6 (2006), Article 3. Jovanovic, Boyan, “Truthful Disclosure of Information,” Bell Journal of Economics, 13 (1982), 36–44. Kahneman, Daniel, and Amos Tversky, “Prospect Theory: An Analysis of Decision under Risk,” Econometrica, 47 (1979), 263–291. ——, “Choices, Values, and Frames,” American Psychologist, 39 (1984), 341–350. Miao, Chun-Hui, “Consumer Myopia, Standardization and Aftermarket Monopolization,” University of South Carolina Working Paper, 2006. Milgrom, Paul R., “Good News and Bad News: Representation Theorems and Applications,” Bell Journal of Economics, 12 (1981), 380–391. Morwitz, Vicki G., Eric A. Greenleaf, and Eric J. Johnson, “Divide and Prosper: Consumers’ Reaction to Partitioned Prices,” Journal of Marketing Research, 35 (1998), 453–463. Smith, Michael D., and Erik Brynjolfsson, “Consumer Decision-Making at an Internet Shopbot: Brand Still Matters,” Journal of Industrial Economics, 49 (2001), 541–558. Spiegler, Ran, “Competition over Agents with Boundedly Rational Expectations,” Theoretical Economics, 1 (2006), 207–231. Thaler, Richard, “Mental Accounting and Consumer Choice,” Marketing Science, 4 (1985), 199–214. Tyan, Sean, “The Effect of Shipping Costs on Bidder Entry and Seller Revenues in eBay Auctions,” Senior Thesis, Department of Economics, Stanford University, 2005. Woodyard, Chris, “Hotels Face Lawsuits on Surcharges for Phones, Energy,” USA TODAY, September 26, 2004.