The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. Si–Sii. doi: 10.1111/j.1368-423X.2010.00321.x
Recent developments in structural microeconometrics
EDITORIAL The papers in this Special Issue of The Econometrics Journal arose from the 19th meeting of (EC)2 (European Conferences of the Econom(etr)ics Community in Econometrics and Quantitative Economics) held in Rome, Italy, on December 19–20, 2008, at ‘Palazzo Koch’ that houses the Bank of Italy. The focus of the Conference was on ‘Recent Developments in Structural Microeconometrics’. Both theoretical and applied contributions were presented. The Local Organizing Committee consisted of Franco Peracchi (Tor Vergata University and EIEF) and Stefano Siviero (Bank of Italy). The Scientific Committee comprised Jerome Adda (University College London), Jaap Abbring (Free University in Amsterdam), Stephane Bonhomme (CEMFI, Madrid), Moshe Buchinsky (University of California, Los Angeles), Luigi Guiso (European University Institute), Pedro Mira (CEMFI, Madrid), Aviv Nevo (Northwestern University) and Luigi Pistaferri (Stanford University). Jean-Marc Robin (Sciences Po and University College London) acted as Programme Chair. Invited lectures were presented by Costas Meghir (University College London), JeanPierre Florens (University Toulouse I), Zvi Eckstein (Tel Aviv University) and Elie Tamer (Northwestern University). In addition, 19 contributed papers were presented, divided more or less equally between theoretical and empirical topics. The programme is available on the internet, address http://www.ec2-rome2008.net/. This Special Issue of The Econometrics Journal gathers together six of the papers presented at the conference. The survey paper by Fr´ed´erique F`eve and Jean-Pierre Florens (Toulouse School of Economics) is concerned with practical issues associated with the solution of inverse problems arising in non-parametric estimation. They study the particular case of transformation models of the form ϕ(Y ) = β X + U under exogeneity or instrumental variable assumptions. Given the increasing importance of inverse problems this paper should be both of interest and a valuable resource for many researchers. Ivana Komunjer and Andres Santos (UCSD) deal with non-separable structural models of the form Y = mα (X, U ) with U uniformly distributed on (0, 1) in which mα is a known real function parametrized by a structural parameter α that contains both a finite dimensional and an infinite dimensional component. This type of model is frequently encountered in structural economic applications. They employ a minimum distance from the independence criterion. Consistency and rates of convergence are obtained for the estimator of the non-parametric component, whereas the estimator of the finite dimensional parameter is consistent and asymptotically normally distributed. Unlike the literature, which has been primarily concerned with issues of identification, this paper provides an estimation procedure which should appeal to practitioners. The paper by Leandro Maschietto Magnusson (Tulane University) proposes tests for structural parameters in limited dependent variable models with endogenous explanatory variables which are robust to weak identification. These tests are based upon the generalized minimum distance principle. The paper compares the proposed tests to alternative Wald tests in C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Sii
Editorial
a simulation experiment. The tests are also used to analyse female labour supply and the demand for cigarettes. Anne Vanhems (Toulouse School of Economics) analyses a structural microeconomic relation describing exact consumer surplus in a non-parametric setting with endogenous prices. Consumer surplus here is characterized as the solution of a differential equation involving the observed demand function. This research tackles a particular econometric inverse problem employing non-parametric instrumental variable regression and is a notable contribution to this literature. The last two papers are more classical examples of structural econometrics applied to employment and retirement issues. Peter Haan (DIW Berlin) and Victoria Prowse (University of Oxford) estimate a dynamic structural life-cycle model of employment, non-employment and retirement. The estimated model is used to evaluate the employment effects of a tax reform focused on low-income individuals. Fedor Iskhakov (University of Oslo) provides an empirical analysis of substitution between early retirement and disability as two major exit routes from the labour market in Norway. This paper uses Norwegian register data. Administrative data are a relatively new source of microeconomic data and is becoming increasingly popular in empirical microeconomics. The (EC)2 meeting was particularly successful, bringing together both theorists and practitioners in statistics and econometrics. The Special Issue is representative of the very high quality of the papers that were presented. In conclusion I would like to take this opportunity to extend the gratitude of The Econometrics Journal to the contributors for their submissions. Especial thanks are owed to the referees of the papers comprising this Special Issue without whose assistance it would not have been possible. Jean-Marc Robin Department of Economics (Office J402) Sciences Po 28 rue des Saints P`eres 75007 Paris France Department of Economics University College London Drayton House 30 Gordon Street London WC1H 0AX UK
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. S1–S27. doi: 10.1111/j.1368-423X.2010.00314.x
The practice of non-parametric estimation by solving inverse problems: the example of transformation models F R E´ D E´ RIQUE F E` VE † AND J EAN -P IERRE F LORENS † †
IDEI and Toulouse School of Economics, Universit´e de Toulouse Capitole 21 all´ee de Brienne, 31000 Toulouse, France. E-mails:
[email protected],
[email protected] First version received: March 2009; final version accepted: January 2010
Summary This paper considers a semi-parametric version of the transformation model ϕ(Y ) = β X + U under exogeneity or instrumental variables assumptions (E(U | X) = 0 or E(U | instruments ) = 0). This model is used as an example to illustrate the practice of the estimation by solving linear functional equations. This paper is specially focused on the data-driven selection of the regularization parameter and of the bandwidths. Simulations experiments illustrate the relevance of this approach. Keywords: Instrumental variables, Integral equations, Selection of the regularization parameter and of the bandwidths, Tikhonov regularization.
1. INTRODUCTION The objective of this paper is to provide a simple guideline for the estimation of functional econometric parameters based on Tikhonov regularization of ill-posed linear inverse problems. We concentrate our presentation around a class of examples, namely the transformation models. This model is characterized by the relation ϕ(Y ) = β X + U and has been extensively studied in the econometric literature following in particular the paper by Horowitz (1996). The origin of the transformation models is probably the Box–Cox model a where ϕ(Y ) = y a−1 if a = 0 and ln y if a = 0. Several extensions of this family of transformation have been proposed (see Horowitz, 1998, ch. 5, for references). These models are essentially parametric and have been estimated under endogeneity using instruments by GMM. In this paper, we treat this model semi-parametrically: ϕ is a functional element and β is a vector of parameters. We assume that ϕ is monotonous and a particular example is the case ϕ = S −1 , where S is the cumulative distribution or the survivor of a random variable. This example covers in particular market share models (Y is the observed proportion of individuals who take the choice 1 between 0 and 1; the choice 1 is selected if an individual characteristic θ is greater than β X + U and 1 − S is the c.d.f. of θ ). An extension of this market share model covers the econometric models derived from the theory of two-sided markets. For example, let us take the credit card-market. The share of users of the credit card depends on the share of stores which accept the credit card, C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S2
F. F`eve and J.-P. Florens
and the share of stores depends on the share of users. This creates a system of transformation models, which may be analysed in a limited information approach by transformation models with endogenous variables (see Rochet and Tirole, 2003, and Argentesi and Fillistrucchi, 2007). Many other examples and references to previous papers may be found in Horowitz’s (1998) book. In particular, this class of model includes semi-parametric analysis of duration models where ϕ is the integrated hazard function. Many new references consider this model and these references may be found in Linton et al. (2008), for example. Two main differences characterize our model. We do not assume independence between U and X and we consider two cases: X exogenous defined by E(U | X) = 0 or X endogenous. In that case the model is estimated using instrumental variables. For identification reasons we normalize β such that one element is equal to one and we consider the model ϕ(Y ) = Z + β W + U ,
(1.1)
where Z may be endogenous. In that case we assume that there exists a vector R of instruments such that E(U | R, W ) = 0. The simplest case consists in the model ϕ(Y ) = Z + U , where E(U | Z) = 0 (exogeneity condition). Even in this case the parameter of interest ϕ should be considered as the solution of the equation E(ϕ(Y ) | Z) = Z
(1.2)
ϕ(y)f (y | z)dy = z.
(1.3)
or in terms of density function
Then the estimation of ϕ may be obtained by first the estimation of the conditional expectation operator E(ϕ(Y ) | Z) and second by solving equation (1.2). The more general model (1.1) under an instrumental variables assumption satisfies the condition E(ϕ(Y ) | W , R) = E(Z | W , R) + β W ,
(1.4)
where the two conditional expectations may be estimated and equation (1.3) needs to be solved w.r.t. ϕ and β in order to estimate the parameters. This example illustrates the inverse problems approach in econometrics. The economic theory defines a structural model where the (possibly functional) parameters ϕ are linked with the observation scheme by a (functional) equation A(ϕ, F ) = 0, where F is the data cumulative distribution function. The statistical analysis is then performed in two steps. First, we estimate the equation using for example an i.i.d. sample of data whose distribution is F and, second, we solve the equation in order to recover the parameters of interest. This approach is very common in econometrics, and a usual example is provided by GMM where the parameters ϕ and F are linked by a relation E F (h(X, ϕ)) = 0. The main question coming from the non-parametric approach concerns the ill-posedness of the inversion. The solution of the equation may not exist or is not in general a continuous function of the estimated part of the equation. The estimation is then not consistent in many cases. There exist several ways to treat this problem: we can reduce the parameter space to be compact (see Ai and Chen, 2003, or Newey and Powell, 2003) or we can keep general the parameter space by C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The practice of non-parametric estimation
S3
introducing a regularized solution in the equation. Instead of solving A(ϕ, F ) = 0 the principle is to minimize a penalized criterion A(ϕ, F )2 + αϕ2 ,
(1.5)
where the norms are suitably chosen and where α goes to zero at a suitable rate. The minimization of (1.5) leads to the Tikhonov regularization approach but other regularizations may be used. The regularized solutions of ill-posed inverse problems are standard in numerical analysis and in image treatment and have been introduced in econometrics to solve GMM estimation in infinite dimension (see Carrasco and Florens, 2000) and in non-parametric estimation using instrumental variables (see Darolles et al., 2003, Florens, 2003, Hall and Horowitz, 2005, and Carrasco et al., 2007). The main objective of this paper is to present an introduction to inverse problems, both on its practical implementation and on the main mathematical arguments of the derivation of the rate of convergence. This paper also contains different contributions to this literature. Identification of the transformation model without independence is based on standard tools but it contains new results. For example, the estimation of the transformation model under mean independence condition is a contribution of this paper. The rate of convergence of the estimators is not derived in previous articles on inverse problems on instrumental variables even if the demonstration is founded on general arguments which have been developed in, for example, Carrasco et al. (2007). The selection of α we suggest is derived from a known technique, and cross-validation selection of bandwidth is standard, but the recursive application of these approaches have not been presented previously in the literature. This paper is not a survey of inverse problems in econometrics (see e.g. Florens, 2003, and Carrasco et al., 2007, for more examples of the application of this theory). However, we may locate our basic examples in the general class of ill-posed inverse problems. The main characteristic of our example is to be linear, with an integral unknown operator. This operator is a conditional expectations operator. Linear inverse problems take the form T ϕ = r and usually only r is estimated and T is given. This is not true in our case and T ϕ is equal to the conditional expectation of ϕ given some random elements. Other relevant models belong to this class, essentially the basic non-parametric instrumental variables model (Y = ϕ(Z) + U , E(U | W ) = 0) which leads to E(ϕ(Z) | W ) = E(Y | W ), very similar to the model treated in Section 5. This question has been addressed in Darolles et al. (2003) and Hall and Horowitz (2005) in particular. This non-parametric inverse problem has been extended to additive models (see Florens et al., 2006) or has been used to test parametric (see Horowitz, 2006) or exogeneity assumptions (see Blundell and Horowitz, 2007). In all these cases, this problem is ill-posed because T is a compact integral operator. The problem becomes well posed if equations ϕ + T ϕ = r are considered, where T may be still an unknown conditional expectation operator (see Mammen and Yu, 2009). The literature on inverse problems is essentially theoretical in econometrics but an empirical application is presented in Blundell et al. (2007). The link between instrumental variables and simultaneous equations models is treated in Chernozhukov et al. (2007). Linear ill-posed inverse problems where the operator is not the conditional expectation operator are relevant in econometrics. A class of examples is based on the covariance operators (T ϕ = E(XW , ϕ)) estimated by Tˆ ϕ = n1 xi wi , ϕ which defines an ill-posed problem if the data are functional data (see Cardot and Johannes, 2009, and Florens and Van Bellegem, 2009). An illustration is the linear instrumental regression model with many regressors and instruments C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S4
F. F`eve and J.-P. Florens
(see Carrasco, 2008). The different forms of deconvolution problems (when the operator is the convolution with a known or unknown density) has generated a huge literature and is particularly adapted to many econometrics problems (see e.g. Bonhomme and Robin, 2008, or Carrasco and Florens, 2009). More recently the researchers have been interested to non-linear inverse problems motivated by non-separable models (to treat quantile regression under endogeneity, for example, see Gagliardini and Scaillet, 2007, Horowitz and Lee, 2007, and Chernozhukov et al., 2009) or by a general approach to GMM with functional parameters (see Chen and Pouzo, 2008, and Chen and Pouzo, 2009). These models create difficult numerical questions. Other non-linear inverse problems come from game-theoretic models (see Florens and Sbai, 2010). Inverse problems in more complex spaces or using other classes of operators may be founded in Gautier and Kitamura (2008) and in Hoderlein et al. (2010). The paper is organized as follows. In Section 2, the model is described and the identification is examined. Section 3 presents a simple example in more detail. Some asymptotic properties are considered in Section 4 and semi-parametric extension and instrumental variables approach are studied in Section 5. Numerous simulations and some technical details are reported in Appendices A and B.
2. AN EXAMPLE OF A SEMI-PARAMETRIC TRANSFORMATION MODEL We assume that all the variables and functions that we consider are square integrable. The model satisfies ϕ(Y ) = Z + β W + U , Y ∈ R Z ∈ R W ∈ Rk ,
(2.1)
where U is an unobservable noise. The model is semi-parametric and the parameter space contains a function ϕ and a vector β ∈ Rk . Equation (2.1) may be completed by one of these hypotheses: Exogeneity: E(U | Z, W ) = 0,
(2.2)
Instrumental Variables: E(U | R, W ) = 0,
(2.3)
where R is a random vector. As we will see below, these mean independence conditions are sufficient, up to some regularity assumptions, to identify the ϕ and β elements and an estimation procedure will be naturally derived from condition (2.2) or (2.3). The Box–Cox models or their extensions are naturally developed in the regression case, i.e. with E(U | Z) = 0, and not under an independence assumption between U and Z. The main motivation of the analysis of the problem under these weaker assumptions is to consider cases where high-order moments of U may depend on (Z, W ) or (R, W ). Practically, heteroscedasticity is extremely common in empirical research. The theory where U and (Z, W ) are independent is well established but this is not the case where Z is endogenous. In that case the treatment of the problem will lead to a non-linear integral equation problem (as in Horowitz and Lee, 2007) that may be difficult to analyse. It follows that in the endogenous case the mean independence conditions leads to a simpler procedure for the estimation of ϕ and β. If we trust in the full independence and if we want to use a non-linear Tikhonov procedure (or other regularization methods for the non-linear inverse C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The practice of non-parametric estimation
S5
problem), our estimator will provide a natural (because consistent) initial value for the required recursive procedure. To analyse the identification of these models, we need to recall two important concepts extensively used in the theory of the resolution of inverse problems involving conditional expectation operators. First, a random element A is said to be strongly identified by B given C if E(g(A, C) | B, C) = 0 a.s. implies g = 0 a.s. for any square integrable function g (see Florens et al., 1990). This concept has been introduced in statistics under the name ‘completeness’ in a particular case. Second, a random element A is said to be measurably separated to another random element B if equality g(A) = h(B) a.s. implies g = h = constant a.s. (see also Florens et al., 1990). This concept has also a long history in statistics in the theory of sufficient and ancillary statistics. Identification theorem may then be written as follows. T HEOREM 2.1. Let us consider model (2.1) under the exogeneity condition (2.2). Let us assume (i) Assumption A1: E(W W ) is invertible and W only contains non-constant variables; (ii) Assumption A2: Y is strongly identified by Z given W; (iii) Assumption A3: Y and W are measurably separated. Then ϕ and β are identified. The following generalization also holds: T HEOREM 2.2. Let us consider model (2.1) under (2.3). If we assume A1, A3 and A2 where we get Assumption A2 : Y is strongly identified by R given W. Then ϕ and β are identifiable. The assumptions of Theorems 2.1 and 2.2 do not seem to be immediately interpretable. However, they can be illustrated by the following comments. The assumption A3 (Y and W are measurably separated) is essentially a support condition (for a precise statement see Florens et al., 2008). It means that there does not exist an exact relation between W and Y or equivalently dϕ =0 that the derivative of any function of W is w.r.t. Y is zero. Then ϕ(Y ) − β W = 0 implies dY and ϕ = constant. This hypothesis is false if Z + U is constant which is an extreme dependence between Z and the noise U. More generally it is sufficient that Z + U may vary independently of W to verify the assumption. Assumption A2 is more severe. For simplicity we may eliminate W (or we can consider the question with respect to the conditional distribution of W = w, w fixed). Assumption A2 is a dependence condition between Y and Z. It is known that if Y and Z are jointly normal, this assumption is equivalent to rank Cov(Y , Z) = 1. General characterizations of this dependence are more difficult (see, for example, a recent contribution of d’Haultfoeuille, 2008). Intuitively this assumption means there exists no function of Y non-correlated to any function of Z. If this assumption is false the theory is essentially preserved but ϕ may not be fully estimable but only up to any function of Y orthogonal to any function of Z. The recent developments on ‘set identification’ may be applied in that case. R EMARK 2.1. The model analysed in Theorem 2.1 can be extended to the case where ϕ(Y ) becomes a function of ϕ(Y , ), where are some exogenous variables. However, extension of Theorems 2.1 and 2.2 to that case requires that and W should be measurably separated. This condition excludes the case where and W have some common elements. Common elements between and W prevent the identification of the model. Moreover, assumption A2 should be modified by ‘Y is strongly identified by Z given and W’. Similar extensions of Theorem 2.2 may also be done. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S6
F. F`eve and J.-P. Florens
R EMARK 2.2. For simplicity, all the variables we consider in the paper are assumed to be continuous variables. All our results apply but some hypotheses may be false in presence of discrete variables. For example, hypothesis A2 is no longer true if Y is continuous and Z discrete but the case Y discrete and Z continuous usually satisfies A2. In case of the instrumental variables approach, R should be continuous if Y is continuous. In all cases, W may contain discrete variables but if Y is discrete the support conditions A3 need to be checked carefully. If Y takes only a finite number of values, the functional estimation problem becomes a finite dimensional question and the difficulty of ill-posedness disappears.
3. ESTIMATION BY TIKHONOV REGULARIZATION We illustrate our analysis by the particular simple case ϕ(Y ) = Z + U
E(U | Z) = 0.
(3.1)
We assume that an i.i.d. sample of (Y , Z) is available and denoted by (yi , zi ), i = 1, . . . , n. Equation (3.1) implies E(ϕ(Y ) | Z) = Z
(3.2)
and the usual kernel smoothing estimation gives the following empirical counterpart of this equation: n
z − zi hn i=1 n z − zi K hn i=1
ϕ(yi )K
= z,
(3.3)
where K is a univariate kernel and hn the bandwidth. This equation has no solution in general z−zi K because there does not exist a linear combination of the functions n hnz−zi equal to z. The i=1 K
hn
resolution of equation (3.2) is then ill posed. We then adopt a Tikhonov regularization which is based on the minimization of T ϕ − Z2 + αϕ2 ,
(3.4)
where T ϕ = E(ϕ(Y ) | Z) (T is an operator from L2Y to L2Z defined w.r.t. the true data distributions) and the two norms are L2 norms (ϕ2 = ϕ 2 (z)f (z)dz if f is the true density of Z). This minimization leads to the first-order condition αϕ + T ∗ T ϕ = T ∗ Z,
(3.5)
where T ∗ is the adjoint operator of T. A general theory for adjoint operators is not necessary here and it is sufficient to note that T ∗ is the conditional expectation operator of functions of Z given Y. Then the first-order condition of minimization of (3.4) is αϕ(y) + E(E(ϕ(Y ) | Z) | Y = y) = E(Z | Y = y).
(3.6)
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S7
The practice of non-parametric estimation
The empirical counterpart of this equation may be written zj − zi n hn y − yj i=1 K n hn zj − zi j =1 K hn i=1 αϕ(y) + n y − yj K hn j =1 n y − yj zj K hn j =1 = n . y − yj K hn j =1 n
ϕ(yi )K
(3.7)
This equation may be solved in two steps. Consider first equation (3.7) for y = y1 , . . . , yn . Then (3.7) reduces to a matrix equation α ϕ¯ + CY CZ ϕ¯ = CY z¯ ,
(3.8)
where ϕ¯ is the vector of the (ϕ(yj ))j =1,...,n , z¯ the vector of (zj )j =1,...,n and CZ and CY two n × n matrices: ⎛ ⎛ ⎞ ⎞ zj − z i yl − yj ⎜ K ⎟ ⎜ K ⎟ ⎜ ⎟ hn hn ⎜ ⎟ ⎟ ⎟ CY = ⎜ . CZ = ⎜ ⎜ zj − zi ⎠ yl − yj ⎟ ⎝ ⎝ ⎠ K K hn hn i j j ,i=1,...,n
l,j =1,...,n
Equation (3.8) has a solution α ϕˆ¯ = (αI + CY CZ )−1 CY Z¯
(3.9)
involving the inversion of an n × n matrix.1 If we want ϕ(y) for a value y which does not belong to the sample we may use equation (3.7) α for which ϕ(y) may be derived immediately from the knowledge of ϕˆ¯ . R EMARK 3.1. In the particular case of market share models, the random variable Y is constrained to belong to the [0, 1] interval. In that case, we are faced with boundary problems in the kernel estimation. We solved this difficulty by using boundary kernels in the estimation of conditional expectations given Y. We use boundary Gaussian kernel defined, for example, in Li and Racine (2006).
1 The Tikhonov regularization needs this inversion of a possibly large matrix. If n is very large, other methods like Landweber–Fridman regularization may be used which do not require inversion.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S8
F. F`eve and J.-P. Florens
R EMARK 3.2. The model may imply that ϕ is monotonous non-increasing. We do not impose this restriction in our estimation even if the minimization of (3.4) under constraint is feasible (see e.g. Engle et al., 2000). The estimation without monotony constraint illustrates in a better way the impact of the selection of α because the monotony constraint is a regularization and the distinction between the impact of the penalization by αϕ2 and the constraint is not easy. Moreover, our model is then a little more general and not restricted to the usual transformation models. The implementation of our method depends on the selection of the bandwidths in the different kernel estimations and on the value of the regularization parameter α. The bandwidths may be chosen using many methods. We will compare two of them: (a) (b)
‘Naive’ bandwidth. As recommended by many authors (see e.g. Silverman, 1986) we 1 may choose 1.059 × standard deviation of the variable ×n− 5 . Cross-validation. Recall that the bandwidth may be chosen for the estimation of E(g(A) | B) by minimization of the sum of square of the residuals computed using the leave-out method (the residual of an observation is computed by dropping out this observation in the estimation). We then have three bandwidths to compute: the one corresponding to E(Z | Y ), the one corresponding to E(ϕ(Y ) | Z) and finally the bandwidth of the estimation of E(E(ϕ(Y ) | Z) | Y ). The last two bandwidths require a preliminary estimation of ϕ in order to be computed.
The selection of α (given the bandwidths) is also a standard issue the in regularized solution of linear equations. We adopt a version of the principle described in Engle et al. (2000). This method consists in the following procedure. (a)
For any (small) value of α compute the estimation of ϕ by an iterated Tikhonov approach. This estimation is defined by α α ϕˆ¯ 2 = (αI + CY CZ )−1 CY z¯ + α ϕˆ¯
¯ = (αI + CY CZ )−1 [CY + α(αI + CY CZ )−1 CY ]z.
(b)
Even if our final estimate is based on the usual (non-iterated) Tikhonov regularization, the iterated method is necessary to determine α optimal for the non-iterated scheme. Minimize the following sum of square: ⎡
n
z j − zi ⎢ n hn 1 ⎢ ⎢ i=1 SS(α) = zj − zi α j =1 ⎢ ⎣ K hn α ϕ 2 (yi )K
⎤2 ⎥ ⎥ − zj ⎥ ⎥ . ⎦
(3.10)
The idea is to minimize the norm of the residuals of the integral equation E(ϕ(Y ) | Z) = z where the conditional expectation is replaced by its estimator, the norm by the empirical mean of the squares and the ϕ by its estimator. This norm should be divided by α in order to admit a minimum. We will show in Section 4 that the value of α which leads to the optimal rate of
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The practice of non-parametric estimation
S9
convergence of our estimator should be proportional to n− 5(β+1) . Engle et al. (2000) show that the value of α which minimizes (3.1) satisfies this condition and then this value of α may be call optimal in the rate sense.2 The bandwidth and α may be chosen sequentially: we start by naive bandwidths and we minimize (3.10) in order to get a first value of α and of ϕ which may be used to improve the bandwidths by cross-validation. A new value of α is then obtained from the minimization of α. The process may then be recursively updated. 4
4. ASYMPTOTIC PROPERTIES Even if this paper is focused on practical implementation, this section gives a low technical flavour of the asymptotic analysis. The objective of this section is to provide the general method for the analysis of the rate of convergence of an estimator derived from a Tikhonov regularization. We concentrate this study on the case of model (3.1) and refer to different papers for more general cases. Let us recall that the estimator ϕˆ α is the solution of equation (3.6) where the conditional expectations operators T ϕ = E(ϕ(Y ) | Z) or T ∗ ψ = E(ψ(Z) | Y ) are replaced by kernel estimators Tˆ or Tˆ ∗ defined analogously to Tˆ . In an abstract way we have ϕˆ α = (αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ Z.
(4.1)
The asymptotic properties of ϕˆ ∗ are based on two properties of the kernel estimation. (a) First, we consider Tˆ ϕ − Z2 = ((Tˆ ϕ)(z) − z)2 f (z)dz. Using usual results on the kernel smoothing we will assume that our problem is sufficiently regular in order to have3 1 . Tˆ ϕ − Z2 ∼ O + h2ρ n nhn
(b)
In this expression ρ is the minimum value between the smoothness of ϕ and the order of the kernel. We simplify our presentation by considering probability kernels and twicecontinuous functions and then ρ = 2. All the O in the paper are in probability. We also assumed that the two norms of Tˆ − T 2 and Tˆ ∗ − T ∗ 2 are O nh1 2 + h4n . n Intuitively these results are based on the rate of convergence of the kernel estimator of the joint density of Y and Z to the true density.4 Note that Tˆ ∗ is an estimator of T ∗ and not the adjoint of Tˆ .
An important component of the calculus of the rate of convergence is the regularity assumption on ϕ. As we will see in Appendix B the asymptotic analysis involves a term: C = (αI + T ∗ T )−1 T ∗ T ϕ − ϕ = −α(αI + T ∗ T )−1 ϕ.
2 The proof given by Engle et al. (2000) is done in the case where the operator T is known. The extension of the proof in the case of the unknown T operator is given in Appendix B. 3 See Darolles et al. (2003) or Hall and Horowitz (2005). 4 See Darolles et al. (2003) and Florens et al. (2006).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S10
F. F`eve and J.-P. Florens
This term represents the difference between the true function and the regularized solution of the ‘true’ problem T ϕ = r ((αI + T ∗ T )−1 T ∗ T ϕ). The value C is called the regularization bias. C → 0 if α → 0 but not uniformly w.r.t. ϕ. In order to control the rate of decline of C2 when α → 0 ϕ should be constrained to be an element of a regularity class: ϕ is said to have the regularity β > 0 (w.r.t. the joint data-generating process) if C2 ∼ O(α β ). For example (see Carrasco et al., 2007) if there exists a function ψ(z) such that ϕ(y) = E(ψ(Z) | Y = y) ϕ has the regularity 1. The characterization of the regularity is a very complex question which is not treated here (see e.g. Carrasco et al., 2007). Note finally that a constraint imposed by Tikhonov regularization is that β ≤ 2. If β > 2, it should be replaced by 2. T HEOREM 4.1. have
Under the previous hypothesis (i), (ii) and the regularity condition on ϕ we
ϕˆ α − ϕ2 = O
1 1 h4 h4n min(β+1,2) min(β+1,2) β . α + n+ α + + α αnhn α αnh2n α
The estimator is then consistent if α → 0 and hn → 0 such that αnhn → ∞, α [1−min(β+1,2)] nh2n → ∞.
h4n α
→ 0 and
The question is now to select the optimal value of α and to derive the speed of convergence. In our approach hn is selected by cross-validation constructed from estimation of conditional 1 expectations given a single variable. Then hn is proportional to n− 5 . In that case the optimal 4 choice of α is obtained by balancing 1 4 and α β and leads to α = n− 5 (β+1) . αn 5 In that case it is clear that the two other terms are negligible and we get 4β ϕˆ α − ϕ2 ∼ O n− 5(β+1) . β The component n− 5 is due to non-parametric estimation and the factor β+1 follows from the resolution of the integral equation. Note that this rate is the actual rate of our procedure characterized by a specific choice of the regularization parameters. The optimality of this rate of convergence is a complex question and in this paper we just give an intuitive answer. Our rate is optimal under our hypothesis which does not link the regularity conditions of the kernel estimation and of the inverse problem. Intuitively the speed of convergence of the kernel estimation is based on differentiability conditions of ϕ and of the joint density of Y and Z. The source condition ( C 2 ∼ O(α β )) is based on the Fourier decomposition of ϕ on the singular vectors basis of T. In general, ρ and β are not related. However, if the source condition is derived from a degree of ill-posedness of T and from a regularity condition on ϕ both measured relatively to the differential operator (defining an Hilbert scale), the rate may be improved under this set of stronger hypothesis. This analysis has been done by Chen and Reiss (2007) and Johannes et al. (2007). See also Darolles et al. (2003) for a discussion on the minimax property of inverse problem solutions. We just consider here the consistency and the rate of convergence but the asymptotic normality may also be examined (see Darolles et al., 2003, and Horowitz, 2007). 4
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S11
The practice of non-parametric estimation
5. EXTENSIONS TO ENDOGENOUS VARIABLES AND SEMI-PARAMETRIC MODELS Let us first consider the model ϕ(Y ) = Z + U where Z is endogenous and E(U | R) = 0 where R is a real instrumental variable. We now solve the empirical counterpart of αϕ(y) + E(E(ϕ(Y ) | R) | Y ) = E(E(Z | R) | Y ).
(5.1) α
Using the same arguments as in Section 3 it may be shown that the vector ϕˆ¯ verifies α ¯ ϕˆ¯ = (αI + CY CR )−1 CY CR Z,
where CR is defined analogously to CY or CZ . The asymptotic properties of these estimators are very similar to these studied in Darolles et al. (2003). The choice of α and of the bandwidth is done analogously to the case where Z is exogenous. Let us now analyse semi-parametric estimation. We first consider the simple case of model (1.1) where W ∈ R under an exogeneity assumption: ϕ(Y ) = Z + βW + U
E(U | Z, W ) = 0.
(5.2)
We adopt a sequential approach extending the backfitting principle frequently used in semiparametric statistics. (a) If ϕ is given, β may be the obtained OLS method where the dependent variable is ϕ(Y ) − Z and the explaining variable is W because E(U | W ) = 0. (b) If β is given, our approach is identical to the one presented in Section 3 replacing Z by Z + βW because E(U | Z + βW ) = 0. The algorithm iterates these two steps up to convergence. An initial value for β should be selected and should be not too far from the true value. In many cases 0 may be a suitable initial value. This algorithm converges to the solution of the set of the two equations: E(W (ϕ(Y ) − Z − βW )) = 0,
(5.3)
E(ϕ(Y ) | Z + βW ) = Z + βW .
(5.4)
The second equation is actually regularized and transformed into αϕ(y) + E[E(ϕ(Y ) | Z + βW ) | Y = y] = E[Z + βW | Y = y].
(5.5)
We extend this analysis by considering Z as an endogenous variable and we use two instruments R and W. The computations are also realized using a recursive algorithm: (a) The step where ϕ is given is analogous to the first step if Z is exogenous. (b) The step where β is given is performed by solving
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S12
F. F`eve and J.-P. Florens
αϕ(y) + E(E(ϕ(Y ) | R, W ) | Y = y) = E(E(Z + βW | R, W ) | Y = y),
(5.6)
where the conditional expectations are replaced by their empirical counterparts. The different bandwidths and the α parameter are computed by purely data-driven methods, as in Section 3. In the sequential algorithms these parameters are updated at each step of the algorithms. The key question concerns the asymptotic properties of the estimator of the parametric β. It has been proved (see Ai √ and Chen, 2003, and Florens et al., 2006) that β is asymptotically normal and converges at n speed under weak assumptions.
6. CONCLUSION This paper proposes an approach to the transformation model based on a conditional mean hypothesis and not on an independence condition between the exogenous variables (or the instrumental variables) and the residual. This weaker assumption leads to estimate the functional parameter by solving an integral equation √ of type I and then to construct estimators with different rates of convergence from the usual n rate. The treatment of the endogeneity of some variables is however easier under this weaker assumption. This family of semi-parametric transformation models is taken in this paper as a class of examples of econometric inverse problems. We want to show that despite the technicality of the mathematical framework the technology of Tikhonov regularization is easy to implement. We illustrate this simplicity using numerous simulations presented in the paper. The usual difficulty of the practical use of non-parametric techniques is the selection of the bandwidths and of the regularization parameters. We present in this paper a purely data-driven strategy for these bandwidths and parameters. We illustrate by simulations the relevance of our methods. This paper is not a pure theoretical contribution but we present in Section 4 the main intuitions of the analysis of asymptotic properties, essentially the consistency and the rate of convergence.
ACKNOWLEDGMENTS The authors thank F. Collard and P. F`eve for helpful suggestions and comments. They also thank S. Van Bellegem for numerous discussions and the participants of the EC2 conference (Roma, December 2008) and of the workshop at the Economics department of Brown University. We also thank the Editor J. M. Robin and two anonymous referees for their helpful remarks and suggestions.
REFERENCES Ai, C. and X. Chen (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–843. Argentesi, E. and L. Fillistrucchi (2007). Estimating market power in a two-sided market: the case of newspapers. Journal of Applied Econometrics 22, 1247–66. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The practice of non-parametric estimation
S13
Blundell, R., X. Chen and D. Kristensen (2007). Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica 75, 1613–69. Blundell, R. and J. Horowitz (2007). A non-parametric test of exogeneity. Review of Economic Studies 74, 1035–58. Bonhomme, S. and J. M. Robin (2008). Generalized non parametric deconvolution with an application to earning dynamics. Review of Economic Study 77, 491–533. Cardot, H. and J. Johannes (2009). Thresholding projection estimators in functional linear models. Journal of Multivariate Analysis 101, 395–408. Carrasco, M. (2008). A regularization approach to the many instruments problem. Working paper, University of Montreal. Carrasco, M. and J. P. Florens (2000). Generalization of the GMM in presence of a continuum of moment conditions. Econometric Theory 16, 797–834. Carrasco, M. and J. P. Florens (2009). Spectral method for deconvolving a density. IDEI Working Paper ´ No. 138, Institut d’Economie Industrielle (IDEI). Carrasco, M., J. P. Florens and E. Renault (2007). Linear inverse problems in structural econometrics: estimation based on spectral decomposition and regularization. In J. J. Heckman and E. E. Leamer (Eds.), Handbook of Econometrics, Volume 6B, 5633–751. Amsterdam: North-Holland. Chen, X. and D. Pouzo (2008). Efficient of nonparametric conditional moment models with possibly nonsmooth moments. Working paper, Yale University. Chen, X. and D. Pouzo (2009). Efficient estimation of semi-parametric conditional moment models with possibly nonsmooth residuals. Journal of Econometrics 152, 46–60. Chen, X. and M. Reiss (2007). On rate optimality for ill posed inverse problems in econometrics. Cowles Foundation Discussion Paper No. 1626, Yale University. Chernozhukov, V., P. Gagliardini and O. Scaillet (2009). Nonparametric instrumental variable estimators of quantile structural effects. Technical report, HEC, Swiss Finance Institute. Chernozhukov, V., G. W. Imbens and W. K. Newey (2007). Instrumental variable identification and estimation of nonseparable models via quantile conditions. Journal of Econometrics 139, 4–14. Darolles, S., J. P. Florens and E. Renault (2003). Non parametric instrumental regression. Working paper, ´ Institut d’Economie Industrielle (IDEI). d’Haultfoeuille, X. (2008). On the completeness condition in nonparametric instrumental problems. Forthcoming in Econometric Theory. Engle, H. W., M. Hanke, and A. Neubauer (2000). Regularization of Inverse Problems. Dordrecht: Kluwer. Florens, J. P. (2003). Inverse problems in structural econometrics. The example of instrumental variables. In M. Dewatripont, L. P. Hansen and S. J. Turnovsky (Eds.), Advances in Economics and Econometrics, Volume 2, 284–311. Cambridge: Cambridge University Press. Florens, J. P., J. J. Heckman, C. Meghir and E. Vytlacil (2008). Identification of treatment effects using control functions in models with continuous, endogenous treatment and heterogeneous effects. Econometrica 76, 1191–206. Florens, J. P., J. Johannes and S. Van Bellegem (2006). Instrumental regression in partially linear models. CORE Discussion Paper No. 2006/25, Center for Operations Research and Econometrics (CORE), Universit´e catholique de Louvain. Florens, J. P., J. Johannes and S. Van Bellegem (2007). Identification and estimation by penalization in nonparametric instrumental regression. CORE Discussion Paper No. 2007/83, Center for Operations Research and Econometrics (CORE), Universit´e catholique de Louvain. Florens, J. P., M. Mouchart and J. M. Rolin (1990). Elements of Bayesian Statistics. New York: Marcel Dekker. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S14
F. F`eve and J.-P. Florens
Florens, J. P. and E. Sbai (2010). Local identification in empirical games of incomplete information. Forthcoming in Econometric Theory. Florens, J. P. and S. Van Belleghem (2009). Functional instrumental linear regression. Working paper, Toulouse University. Gagliardini, P. and O. Scaillet (2007). A specification test for non parametric instrumental variable regression. Working paper, HEC, Swiss Finance Institute. Gautier, E. and Y. Kitamura (2008). Non parametric estimation in random coefficients binary models. Working paper, Yale University. Hall, P. and J. Horowitz (2005). Non parametric methods for inference in the presence of instrumental variables. Annals of Statistics 33, 2904–29. Hoderlein, S., J. Klemel¨a and E. Mammen (2010). Reconsidering the random coefficient model. Econometric Theory 26, 804–37. Horowitz, J. L. (1996). Semiparametric estimation of a regression model with an unknown transformation of the dependent variable. Econometrica 64, 103–37. Horowitz, J. L. (1998). Semiparametric Methods in Econometrics. New York: Springer. Horowitz, J. L. (2006). Testing a parametric model against a nonparametric alternative with identification through instrumental variables. Econometrica 74, 521–38. Horowitz, J. L. (2007). Asymptotic normality of a nonparametric instrumental variables estimator. International Economic Review 48, 1329–49. Horowitz, J. L. and S. Lee (2007). Nonparametric instrumental variables estimation of a quantile regression model. Econometrica 75, 1191–208. Johannes, J., S. Van Bellegem and A. Vanhems (2007). A unified approach to solve ill posed inverse problems in econometrics. CORE Discussion Paper No. 2007/83, Center for Operations Research and Econometrics (CORE), Universit´e catholique de Louvain. Li, Q. I. and J. S. Racine (2006). Nonparametric Econometric. Princeton, NJ: Princeton University Press. Linton, O., S. Sperlick and I. Van Keilegom (2008). Estimation of a semi parametric transformation model. Annals of Statistics 36, 686–718. Mammen, E. and K. Yu (2009). Nonparametric estimation of noisy integral equations of the second kind. Journal of the Korean Statistical Society 38, 99–110. Newey, W. and J. Powell (2003). Instrumental variable estimation of nonparametric models. Econometrica 71, 1565–78. Rochet, J. C. and J. Tirole (2003). Platform competition in two sided markets. Journal of the European Economic Association 1, 990–1029. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman and Hall.
APPENDIX A: SIMULATION RESULTS This appendix illustrates our procedures by a Monte Carlo simulation.5
A.1. Simulation in the non-parametric model under exogeneity 2 2 The exogenous variable Z is drawn from an N (0, 1.2) and U is independently generated by an N (0, 0.3 ). 1−y 1 The survivor S is a logistic function S(t) = 1+e−t or equivalently ϕ(y) = ln y . Figure 1 shows the
5 All
the codes are available from the authors upon request.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S15
The practice of non-parametric estimation φ(Y)= Z + U (Z exogenous)
N = 500
4
3
data(Yi,Zi) "true φ" "any α" too large α too small α
2
1
0
−1
−2
−3
−4
−5
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
0.9
1
Figure 1. Estimation under different values of α (one draw).
sample (the yi is on the horizontal axes and zi on the vertical one), the true function and three estimations with naive bandwidth and arbitrary values of α. The sample size is n = 500. Intuitively great values of α lead to a flat line and very small values to a curve oscillating around the true one. The minimization of α is represented by the curve in Figure 2 where the function SS(α) defined in (3.10) is represented (the corresponding estimation is represented in Figure 3). In Figure 4 same estimation using optimal α is represented for a smaller sample of 200. In these figures, the bandwidth is a naive bandwidth. We show in Figure 5 the change of estimation by two recursive evaluations of the bandwidth by cross-validation and by selection of optimal α. In all these first five figures, one draw of the sample only is treated. Finally, 50 Monte Carlo replications of the model (where n = 100) are drawn and 50 curves are estimated (and represented in Figure 6) using naive bandwidths and optimal α for each simulation. This figure illustrates graphically the distribution of the estimator. We have checked the sensitivity of our results by modifying some assumptions. We first increased the variance of the error term in the equation (0.32 is replaced by 0.62 ). The results are very similar to the previous one and the Monte Carlo simulation for a sample size of 100 are represented in Figure 6(a). Second, we have replaced the function ϕ by ϕ(Y ) = tg(2π Y ). The design of Z is modified (Zi ∼ N (0, 0.42 )) and the results are given in Figure 6(b) by a Monte Carlo simulation. Here also results are very similar. A theory for the joint determination of α and the bandwidths has not yet been developed. Our simulation experiments show that many couples α and hn will give an identical result. If hn is fixed arbitrarily in a suitable interval, α will ‘adapt’ itself to give ‘good’ results. This conjecture needs to be checked theoretically.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S16
F. F`eve and J.-P. Florens Z exogenous − Optimal choice of α
N = 500
5.25
5.2
5.15
5.1
5.05
5
4.95
4.9 0.03
0.035
0.04
0.045
0.05
α
0.055
0.06
Figure 2. Representation of SS(α) (one draw) and selection of the minimum.
φ(Y)= Z + U (Z exogenous)
N = 500
4
data(Yi,Zi) "true φ" optimal α
3
2
1
0
−1
−2
−3
−4
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
0.9
1
Figure 3. Estimation under data-driven value of α. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S17
The practice of non-parametric estimation φ(Y)= Z + U (Z exogenous)
N = 200
4
data(Yi,Zi) "true φ" optimal α
3
2
1
0
−1
−2
−3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Y
Figure 4. Estimation under data-driven value of α (one draw, small sample).
φ(Y)= Z + U (Z exogenous)
N = 500
4
data(Yi,Zi) "true φ" Sylverman bandwith bandwith after 1 iter
3
2
1
0
−1
−2
−3
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
Figure 5. Estimation under two bandwidths (one draw). C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
0.9
1
S18
F. F`eve and J.-P. Florens Z exogenous − optimal α
N = 100
4
3
2
1
0
−1
−2
−3
−4
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
0.9
1
Figure 6(a). Monte Carlo simulations of the estimation ‘largest’ variance. Z exogenous − optimal α
N = 100
6
4
2
0
−2
−4
−6 −0.25
−0.2
−0.15
−0.1
−0.05
0
Y
0.05
0.1
0.15
0.2
0.25
Figure 6(b). Monte Carlo simulations of the estimation Other specification.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S19
The practice of non-parametric estimation φ(Y)= Z + U (Z endogenous)
N = 500
4
data(Yi,Zi) "true φ" "any α" too large α too small α
3
2
1
0
−1
−2
−3
−4
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
0.9
1
Figure 7. Estimation under different values of α (one draw).
A.2. Simulation in the non-parametric model under endogeneity We illustrate the Section 5 extension by some simulations. The function ϕ(y) is still equal to ln 1−y , U and y R are independent and both N (0; 0, 32 ). The variable Z is equal to aR + bU + ε, where a = 2.5, b = 2, ε is N (0, 0.0152 ). Figures 7–9 have the same definitions as Figures 1–3 but with Z endogenous. Figure 10 shows the impact of the bandwidth improvement. This graph concerns a single drawn of the data set but Figure 11 shows the results of 50 Monte Carlo simulations with optimal α and selection of bandwidths by cross-validation for each simulation (case n = 100).
A.3. Simulation in the semi-parametric model Two models have been simulated. In the exogenous case n = 50, ϕ(y) and U remain identical, Z ∼ N (0, 0.82 ), W ∼ N (0, 1) and β = 0.5. In the endogenous case, n = 200, Z = aR + bU + ε, where a = 2, 6, b = 2, 1. The others elements remain the same. In each case 50 Monte Carlo replications are generated. We represent the Monte Carlo distribution of the estimator of β with the values of the mean (Figures 12 and 14). Figures 13 and 15 represent the different estimators of ϕ with naive bandwidth and optimal α for each simulation. Our conclusion deduced from the simulation concerning the bandwidth and the regularization parameter is the following. The simultaneous choice of hn and α is not ‘identified’ in the sense that there probably exists a curve of hn and α space such that each element gives the same result. In other terms, the selection procedure of α adapts to the choice of hn (in a reasonable ‘set’) in order to give a ‘good’ result. This conjecture will be examined in future work.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S20
F. F`eve and J.-P. Florens Z endogenous − Optimal choice of α
N = 500
6
5.5
5
4.5
4
3.5
3
2.5
2
1.5
0
0.002
0.004
0.006
0.008
0.01
α
0.012
0.014
0.016
0.018
0.02
Figure 8. Representation of SS(α) (one draw) and selection of the minimum.
φ(Y)= Z + U (Z endogenous)
N = 500
4
data(Yi,Zi) "true φ" optimal α
3
2
1
0
−1
−2
−3
−4
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
0.9
1
Figure 9. Estimation under the data-driven selection of α (one draw).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S21
The practice of non-parametric estimation φ(Y)= Z + U Z endogenous
N = 500
4
3
data(Yi,Zi) "true φ" Sylverman bandwith bandwith after 1 iter bandwith after 2 iter
2
1
0
−1
−2
−3
−4
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
0.9
1
Figure 10. Estimation under different bandwidths (one draw). Z endogenous − Cross Validation − optimal α
N = 100
4
3
2
1
0
−1
−2
−3
−4
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
0.9
Figure 11. Monte Carlo simulations of the estimation of ϕ. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
1
S22
F. F`eve and J.-P. Florens Z exogenous
Estimator average =0.50829 N = 50
8
7
6
5
4
3
2
1
0
0.35
0.4
0.45
0.5
γ
0.55
0.6
0.65
0.7
Figure 12. Monte Carlo distribution of β.
Z exogenous
N = 50
3
2
1
0
−1
−2
−3
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
0.9
1
Figure 13. Monte Carlo simulations of the estimation of ϕ. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S23
The practice of non-parametric estimation Z endogenous Estimator average =0.47332 N = 500 14
12
10
8
6
4
2
0
0.35
0.4
0.45
0.5
γ
0.55
0.6
0.65
0.7
Figure 14. Monte Carlo distribution of β. Z endogenous
N = 200
5
4
3
2
1
0
−1
−2
−3
−4
0
0.1
0.2
0.3
0.4
0.5
Y
0.6
0.7
0.8
0.9
Figure 15. Monte Carlo simulations of the estimation of ϕ.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
1
S24
F. F`eve and J.-P. Florens
APPENDIX B: COMPLEMENTS ON ASYMPTOTIC PROPERTIES B.1. Proof of Theorems Proof of Theorem 2.1: Let us consider two solutions ϕ0 , β0 and ϕ1 , β1 to equation (2.1). Then if ϕ = ϕ1 − ϕ0 and β = β1 − β0 we have E(ϕ(Y ) | W , Z) = β W ,
(2.4)
and we have to prove that this imply ϕ = 0 and β = 0. Equation (2.4) and A2 imply ϕ(Y ) − β W = 0 and A3 implies that ϕ(Y ) = β W = c, where c is a real constant. As W is not constant β W = c implies c = 0 and then β = 0 under A1. Finally, ϕ(Y ) = 0.
Proof of Theorem 2.2: Analagous to that of Theorem 2.1. Proof of Theorem 4.1:
(a) Let us first start with the following remark. In the previous practical computations the kernel estimation of E(ϕ(Y ) | Z) was based on formula (3.3), i.e. n
z − zi hn i=1 (Tˆ ϕ)(z) = n z − zi K hn i=1
ϕ(yi )K
.
The asymptotic theory we present in this section is actually based on a slightly different expression of Tˆ ϕ, namely: n
(Tˆ ϕ)(z) =
i=1
ϕ(y)
z − zi y − yi dy K hn hn n z − zi K hn i=1
1 K hn
and the same modification is done for Tˆ ∗ . Actually this last expression is obtained by estimating ϕ(y)f (y | z)dy (where f is the density of Y and Z) by replacing f by its kernel estimator. This modification is motivated by the following argument. The first estimator defined above is a non-bounded estimator. To see this point we can imagine two functions, ϕ1 and ϕ2 , very closed in the square norm sense (E((ϕ1 (Y ) − ϕ2 (Y )))2 small) but such that Tˆ ϕ1 and Tˆ ϕ2 are very different. This unboundness property complicates the proofs. However, the second estimator defined a bounded operator and Tˆ is continuous. To see the difference between the two
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S25
The practice of non-parametric estimation
computations we first remark that with this new expression the empirical counterpart of (3.6) now becomes αϕ(y) +
n i=1
1 y − yj K hn ⎡
i=1
1 z − zj × K dz hn hn n y − yj zj K hn j =1 = n . y − yj K hn j =1
⎤ z − zi ⎥ ⎥ hn ⎥ ⎥ z − zi ⎦ K hn
⎢ K n ⎢ n y − yj y − yi 1 ⎢ dy × × K K ϕ(y) n ⎢ hn hn hn ⎣ i=1 j =1
( = 1, . . . , n). After integration of the two sides of the y −y z −z equation with respect to we get the same system as in (3.8) except that K hn j and K jhn i are now replaced by y − yj z − zi y − y z − zj 1 1 K K K K hn hn hn hn hn hn dy and dz. y − yj z − zi K K hn hn j i Let us multiply this equation by
1 K hn
y−y hn
These approximations introduce errors with the same magnitude as the bias of the kernel and then they may be neglected. (b) Let us now come back to formula (4.1). First, we may remark that (αI + Tˆ ∗ Tˆ ) is invertible for n sufficiently large. Indeed assumption Section 4 implies that Tˆ Tˆ − T T 2 goes to 0 which imply that the eigenvalues of αI + Tˆ ∗ Tˆ converges uniformly to the eigenvalues of αI + T ∗ T . These eigenvalues have the form α + λ2j , where λ2j is a (positive) eigenvalue of T ∗ T and are then strictly positive. This property is then also true for αI + Tˆ ∗ Tˆ . (c) Now consider ϕˆ α − ϕ2 . We want to analyse the rate of the convergence to zero of this norm. We have ϕˆ α − ϕ = (αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ Z − ϕ = (αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ Z − (αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ Tˆ ϕ + (αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ Tˆ ϕ − (αI + T ∗ T )−1 T ∗ T ϕ + (αI + T ∗ T )−1 T ∗ T ϕ − ϕ = A + B + C. From the properties of a norm ϕˆ α − ϕ ≤ A + B + C. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S26
F. F`eve and J.-P. Florens Let us consider the first term A = (αI + Tˆ ∗ Tˆ )−1 (Tˆ ∗ Z − Tˆ ∗ Tˆ ϕ). We used here these properties: (αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ (Z − Tˆ ϕ) ≤ (α + Tˆ ∗ Tˆ )−1 Tˆ ∗ Z − Tˆ ϕ. Tˆ )−1 Tˆ ∗ is equal to the larger eigenvalue of the operator. These eigenvalues The first norm (αI + Tˆ ∗ converge to
λj
α+λ2j
(λj =
λ2j ) and are then smaller than
we get that
√1 . α
(αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ (Z − Tˆ ϕ)2 ∼ O
1 α
Using the assumption (i) of Section 4
1 + h4n nhn
.
Using elementary algebra, the second term B verifies B = (αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ (Tˆ − T ) + (Tˆ ∗ − T ∗ ) T α(αI + T ∗ T )−1 ϕ. We have first remarked that (αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ = O( √1α ) and that Tˆ − T or Tˆ − T are O √n1hn + h2n . The last term, identical to α(αI + T ∗ T )−1 T ∗ ϕ is the regularity bias of T ∗ ϕ equal √ to O( α min(β+1,2) ). Then 1 1 4 min(β+1,2) α . B2 = O + h α nh2 Finally we have seen in Section 4.1 that ϕ is assumed sufficiently regular: C2 = α β .
B.2. Speed of convergence of the data-driven selection of α Let us consider the main elements of the proof. If n is large SS(α) defined in (3.10) is almost equal to 1 ˆ α 2 T ϕ(2) − Z α α Tˆ ϕˆ (2) − Z = Tˆ (αI + Tˆ ∗ Tˆ )−1 [Tˆ ∗ + α(αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ ](Z − Tˆ ϕ) + Tˆ (αI + Tˆ ∗ Tˆ )−1 [Tˆ ∗ + α(αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ ]Tˆ ϕ = A+B as Tˆ (αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ + α(αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ is bounded. 1 + h4n . A 2 = O nhn The second term is equal to B = B1 + B2 where B1 = T (αI + T ∗ T )−1 (T ∗ + α(αI + T ∗ T )−1 T ∗ )T ϕ and B2 = Tˆ (αI + Tˆ ∗ Tˆ )−1 (Tˆ ∗ + α(αI + Tˆ ∗ Tˆ )−1 Tˆ ∗ )Tˆ − T (αI + T ∗ T )−1 (T ∗ + α(αI + T ∗ T )−1 T ∗ )T .
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The practice of non-parametric estimation
S27
B1 2 is the regularization bias of T ϕ equal to α β+1 if β ≤ 2 (see Engle et al., 2000). The last term is negligible using arguments identical to the end of the proof of Theorem 4.1 but based on the algebra of the iterated Tikhonov regularization. Then 1 1 1 ˆ α 2 4 β T ϕ(2) − Z ∼ O + hn + α α α nhn and the minimization of this expression gives an α which converges to zero at the optimal rate.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. S28–S55. doi: 10.1111/j.1368-423X.2010.00317.x
Semi-parametric estimation of non-separable models: a minimum distance from independence approach I VANA K OMUNJER † AND A NDRES S ANTOS † †
Department of Economics, University of California at San Diego, 9500 Gilman Drive MS0508, La Jolla, CA 92093, USA. E-mails:
[email protected],
[email protected] First version received: March 2009; final version accepted: February 2010
Summary This paper studies non-separable structural models that are of the form Y = mα (X, U ) with U uniform on (0, 1) in which mα is a known real function parametrized by a structural parameter α. We study the case in which α contains a finite dimensional component θ and an infinite dimensional component h. We assume that the true value α0 is identified by the restriction U ⊥ X. Our proposal is to estimate α0 by a minimum distance from independence (MDI) criterion. We show that: (a) our estimator for h0 is consistent and √ we obtain rates of convergence and (b) the estimator for θ0 is n consistent and asymptotically normally distributed. Keywords: Identification, Non-separable models, Semi-parametric estimation.
1. INTRODUCTION Non-parametric identification of non-linear non-separable structural models is often achieved by assuming that the model’s latent variables are independent of the exogenous variables. Examples of such arguments include Brown (1983), Roehrig (1988), Matzkin (1994), Chesher (2003), Matzkin (2003) and Benkard and Berry (2006) among others. Yet the criteria used for estimation in such models rarely involve the independence property. Instead, non-parametric and semiparametric estimation methods typically use the mean independence between the latent and exogenous variables that comes in a form of conditional moment restrictions (see e.g. Ai and Chen, 2003, Blundell et al., 2007). Weaker than independence, the mean independence property by itself does not guarantee the identification to hold. As a result, this literature most often simply assumes the models to be identified by the conditional moment restrictions. In this paper, we unify the estimation and identification of non-separable models by employing the same criterion to obtain both: full independence between the models’ latent and exogenous variables. We focus on models of the form: Y = mα (X, U ), with variables Y ∈ R and X ∈ X ⊆ Rdx that are observable, and a latent disturbance U that is uniformly distributed on (0, 1).1 We denote by α0 the true value of the structural parameter α which consists of (a) a 1 In a semi-parametric specification, requiring U ∼ U (0, 1) can often be seen as a normalization on the non-parametric component that does not affect the parametric one. This assumption is also often used for non-parametric identification (see Matzkin, 2003, for examples of such arguments). C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Semi-parametric estimation of non-separable models
S29
component θ in that is finite dimensional ( ⊂ Rdθ ) and (b) a function h of x and u belonging to an infinite dimensional set of functions H. Thus α ≡ (θ, h) ∈ A ≡ × H. We focus on nonseparable models in which for every value x ∈ X the mapping mα (x, u) is strictly increasing in u on (0, 1), and the true value α0 of α is identified by the independence restriction U ⊥ X. The key insight of our estimation procedure lies in the following equality implied by the model: P (Y mα0 (X, tu ); X tx ) = tu · P (X tx )
(1.1)
for all t ≡ (tx , tu ) ∈ X × (0, 1). We exploit this relationship between the marginal and joint cdfs to construct a Cram´er–von Mises-type criterion function: Q(α) ≡
X ×(0,1)
[P (Y mα (X, tu ); X tx ) − tu · P (X tx )]2 dμ(t),
where μ is a measure on X × (0, 1). In a sense, the criterion function Q(α) measures the distance from independence of U and X in the model. Hence, we call our estimator α—which ˆ we obtain by minimizing an appropriate sample analogue Qn (α) of Q(α) above—a minimum distance from independence (MDI) estimator. When α0 is identified by the assumptions of the model, then α0 will also be the unique zero of Q(α). Exploiting the standard M-estimation arguments we are ˆ is consistent for α0 = (θ0 , h0 ); (ii) obtain then able to: (i) show that the MDI estimator αˆ = (θˆ , h) ˆ the rate of convergence of the estimator h for h0 ; (iii) establish the asymptotic normality of the estimator θˆ for θ0 . The approach of minimizing the distance from independence for estimation was originally explored in the seminal work of Manski (1983). In the context of non-linear parametric simultaneous equations systems, the asymptotic properties of the MDI estimators were derived in Brown and Wegkamp (2002). These results, however, assume that the structural mappings are finitely parametrized and do not allow for the presence of non-parametric components, which our approach does. Our paper is also related to the vast literature on estimation of conditional quantiles. Horowitz and Lee (2007) and Chen and Pouzo (2008a), for example, study non-parametric and semi-parametric estimation, respectively, in an instrumental variables setting. However, these results concern a finite number of quantile restrictions, while (1.1) constitutes a continuum of them. Carrasco and Florens (2000) examine efficient GMM estimation under a continuum of restrictions, but their results apply only to finite dimensional parameters. Additional work in non-separable models concerns identification and estimation of average treatment effects rather than the entire structural parameter as in Altonji and Matzkin (2005), Chernozhukov and Hansen (2005), Florens et al. (2008) and Imbens and Newey (2009) among others. The remainder of the paper is organized as follows. In Section 2, we present the estimator and establish its consistency while in Section 3 we obtain a rate of convergence. The asymptotic √ normality result for n(θˆ − θ0 ) is derived in Section 4. In Section 5, we illustrate how semiparametric non-separable models arise naturally in economic analysis by studying a simple version of Berry et al. (1995) model of price-setting with differentiated products. The same section contains a Monte Carlo experiment that illustrates the properties of our estimator. Section 6 concludes the paper. The proofs of all the results stated in the text are relegated to the Appendices. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S30
I. Komunjer and A. Santos
2. MINIMUM DISTANCE FROM INDEPENDENCE ESTIMATION We consider the following non-separable model: Y = mα (X, U )
and
U ∼ U (0, 1)
(2.1)
with observables Y ∈ R and X ∈ X ⊆ Rdx , unobservable U ∈ (0, 1), and structural parameter α ∈ A. In our setup α consists of an unknown parameter θ ∈ that is finite dimensional ( ⊆ Rdθ ), as well as an unknown real function h : X × (0, 1) → R. The latter component of α is infinite dimensional and we assume that h ∈ H, where H is an infinite dimensional set of real valued functions of x and u. We therefore let (θ, h) ≡ α ∈ A ≡ × H. Hereafter, we assume that the model (2.1) is correctly specified and we denote by α0 the true value of the parameter α. For every α ∈ A, the structural mapping mα : X × (0, 1) → R in (2.1) is a known real function that is continuously differentiable in u on (0, 1) for every x ∈ X . Moreover, we assume that for every x ∈ X , we have ∂mα0 (x, u)/∂u > 0. In other words, at the true parameter value α0 , the real function mα0 (x, u) is assumed to be strictly increasing in u on (0, 1) for all values of x ∈ X . In particular, this property guarantees that, conditional on X, the mapping from the unobservables U to the observables Y is one-to-one. Our estimator will be constructed from a sample {yi , xi }ni=1 of observations of (Y , X) drawn according to model (2.1) with α = α0 . We assume the following: A SSUMPTION 2.1. (a) {yi , xi }ni=1 are i.i.d.; (b) X is continuously distributed on X with density fX (x) and (c) the densities fY | X (y | x) and fX (x) are uniformly bounded in (y, x) on S (defined below) and in x on X , respectively. Assumption 2.1(a) is more likely to hold in cross-sectional applications; though extensions to time-series context are feasible, we do not pursue them here. Assumptions 2.1(b) and (c) put restrictions on the density of the observables. Combining U ∼ U (0, 1) with mα0 (x, ·) being strictly increasing ensures that conditional on X = x, Y is continuously distributed with Assumption 2.1(b) then support in mα0 (x, (0, 1)); we denote by fY |X (·|·) its conditional density. ensures that (Y , X) are jointly continuously distributed on the set S ≡ x∈X (mα0 (x, (0, 1)), x). Moreover, Y is then continuous on Y ≡ x∈X mα0 (x, (0, 1)). Note that we allow the support of the dependent variable Y to depend on the true value α0 of α, as in some well-known examples of (2.1) such as the Box–Cox transformation model (see e.g. Komunjer, 2009). The key property of model (2.1) upon which we base our estimation procedure is that α0 is non-parametrically identified by an independence restriction. A SSUMPTION 2.2. The true value α0 ∈ A of the structural parameter α in model (2.1) is identified by the restriction: U ⊥ X. Assumption 2.2 requires that model (2.1) be identified by an independence restriction. For fully non-parametric specifications, the arguments that lead to this result are well understood (see e.g. Matzkin, 2003). Identification in semi-parametric setups, however, can be more challenging and of course depends on the model specification. In Section 5, we provide more primitive conditions under which the identification Assumption 2.2 holds within a simplified BLP model. The following lemma derives a simple characterization of the property in Assumption 2.2. L EMMA 2.1.
Let Assumptions 2.1(b) and 2.2 hold. Then, it follows that:
P (Y mα (X, u); X x) = u · P (X x)
for all (x, u) ∈ X × (0, 1)
if and only if α = α0 . C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Semi-parametric estimation of non-separable models
S31
Lemma 2.1 suggests a straightforward way to construct a criterion function through which to estimate α0 . Let t = (x, u) ∈ X × (0, 1) and define Wα (t) ≡ P (Y mα (X, u); X x) − u · P (X x).
(2.2)
Under the assumptions of Lemma 2.1, we have Wα (t) = 0 for all t ∈ X × (0, 1) if and only if α = α0 . Hence, a natural candidate for a population criterion function is the Cramer–von Misestype objective: Wα2 (t)dμ(t), (2.3) Q(α) ≡ X ×(0,1)
where μ is a measure on X × (0, 1) that is absolutely continuous with respect to Lebesgue measure. The choice of μ is free, though we note that it will influence the asymptotic variance of our estimator for θ . When the model in (2.1) is identified by the restriction U ⊥ X, Lemma 2.1 implies that α0 is the unique zero of Q(α) and hence we have α0 = arg min Q(α). α∈A
The absolute continuity of μ is needed to ensure that α0 is the unique minimum of Q(α). Indeed, if μ were to place point masses on some finite number of values ti ∈ X × (0, 1) of t (with i ∈ I and I finite), then the objective function Q(α) would be minimized at values of α for which Wα (ti ) = 0 for all i ∈ I . Therefore, multiple minimizers will exist in specifications where the independence assumption cannot be weakened without losing identification. Estimation will proceed by minimizing an empirical analogue Qn (α) of Q(α) over an appropriate sieve space. First define the sample analogue to Wα (t): 1 1 1{yi mα (xi , u); xi x} − u · 1{xi x}, n i=1 n i=1 n
Wα,n (t) ≡
n
which yields a finite sample criterion function: Qn (α) ≡
X ×(0,1)
2 Wα,n (t)dμ(t).
(2.4)
(2.5)
Since A contains a non-parametric component, minimizing Qn (α) to obtain an estimator may not only be computationally difficult, but also undesirable as it may yield slow rates of convergence (see Chen, 2006). For this reason we instead sieve the parameter space A. Let Hn ⊂ H be a sequence of approximating spaces, and define the sieve An = × Hn . The MDI estimator is then given by αˆ ∈ arg min Qn (α). α∈An
(2.6)
For the consistency analysis, we endow A with the metric αc = θ + h∞ and impose the following additional assumption:2 2
See the Appendix for details regarding the notations and definitions.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S32
I. Komunjer and A. Santos
A SSUMPTION 2.3. (a) μ has full support on X × (0, 1); (b) and H are compact w.r.t. · and · ∞ ; (c) mα (x, ·) : (0, 1) → R is strictly increasing for every (α, x) ∈ A × ˜ ∞ } with X ; (d) For every x ∈ X , supu∈(0,1) | mα (x, u) − mα˜ (x, u) | G(x){θ − θ˜ + h − h ∞ N[ ] (η3 , H, · ∞ )dη < ∞; (f) Hn ⊂ H are closed in E[G2 (X)] < ∞; (e) The entropy 0 · ∞ and for any h ∈ H there exists n h ∈ Hn such that h − n h∞ = o(1). As already pointed out, Assumption 2.3(a) ensures that Q(α) is uniquely minimized at α0 . Assumptions 2.3(b)–(e) ensure the stochastic process is asymptotically equicontinuous in probability. It is interesting to note that while strict monotonicity of mα (x, ·) is not needed for identification, imposing it on the parameter space is helpful in the statistical analysis. In Assumption 2.3(e), N[ ] (η3 , H, · ∞ ) denotes the bracketing number of H with respect to · ∞ ; see van der Vaart and Wellner (1996) for details and examples of function classes satisfying Assumption 2.3(e). Finally, Assumption 2.3(f) requires the sieve can approximate the parameter space with respect to the norm · ∞ . Assumptions 2.1–2.3 are sufficient for establishing the consistency of the MDI estimator under the norm · c . T HEOREM 2.1.
Under Assumptions 2.1–2.3 it follows that αˆ − αc = op (1).
3. RATE OF CONVERGENCE ˆ This result is not only interesting in its In this section, we establish the rate of convergence of h. √ own right, but is also instrumental in deriving the asymptotic normality of n(θˆ − θ ). We focus on the following norm for h(x, u): h2L2 = h2 (x, u)fX (x)dx du. (3.1) X ×(0,1)
Associated to the norm hL2 is the vector space L2 = {h(x, u) : hL2 < ∞}. We assume the structural function mα (x, u) in (2.1) satisfies mα L2 < ∞ and define the mapping m : (A, · c ) → L2 which to any α ∈ A associates m(α) ≡ mα . Given these definitions, we introduce the following assumption.3 A SSUMPTION 3.1. (a) In a neighbourhood N (α0 ) ⊂ A, m : (A, · c ) → L2 is continuously Fr´echet differentiable; (b) For every (y, x) ∈ S, the conditional densities satisfy |fY | X (y | x) − fY | X (y | x)| J (x)|y − y |ν with E[J 2 (X)G2∨2ν (X)] < ∞; (c) The marginal density of μ with respect to u is uniformly bounded on (0, 1). In what follows, we denote by dm (α) ˜ the Fr´echet derivative of m evaluated at α˜ ∈ A. For dα example, consider the structural mapping mα (x, u) = h(x, u) + x θ and assume that mα L2 < ∞. In this case m is linear and so it is its own Fr´echet derivative, i.e. for any π = (πh , πθ ) ∈ A (α)[π ](x, u) = πh (x, u) + x πθ . To simplify the notation, we hereafter let we have dm dα dm dmα (x, u) [π ] ≡ (α)[π ](x, u). dα dα
3
See the Appendix for definitions. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Semi-parametric estimation of non-separable models
S33
In order to obtain the rates of convergence for hˆ − hL2 , it is necessary to examine the local behaviour of Q(α) at α0 . Under Assumptions 2.1(b)–(c), 2.3(d) and 3.1, the Fr´echet differentiability of m is inherited by the mapping Q : (A, · c ) → R, which to every α ∈ A associates Q(α). To state the form of this Fr´echet derivative, we define the linear map Dα¯ : (A, · c ) → L2μ which to every π ∈ A associates Dα¯ [π ], where Dα¯ [π ] : X × (0, 1) → R maps t = (x, u) ∈ X × (0, 1) into Dα¯ [π ](t) given by: dmα¯ (sx , u) [π ]1{sx x}fX (sx )dsx . Dα¯ [π ](t) = fY | X (mα¯ (sx , u)|sx ) (3.2) dα X Lemma 3.1 establishes that Q(α) is twice Fr´echet differentiable at α0 . L EMMA 3.1. Under Assumptions 2.1(b)–(c), 2.3(d) and 3.1(a)–(c), Q : (A, · c ) → R is: (a) continuously Fr´echet differentiable in N (α0 ) with dQ(α) ¯ Wα¯ (t)Dα¯ [π ](t)dμ(t); [π ] = dα X ×(0,1) (b) twice Fr´echet differentiable at α0 with d 2 Q(α0 ) [ψ, π ] = Dα0 [ψ](t)Dα0 [π ](t)dμ(t). dα 2 X ×(0,1) In this model, since Q(α) is minimized at α0 , its second derivative at α0 induces a norm on A. This result√is analogous to a parametric model, in which if the Hessian H is a positive definite matrix, then a H a is a norm equivalent to the standard Euclidean norm. Guided by Lemma 3.1 we therefore define the inner product and associated norm: α, α ˜ w≡ Dα0 [α](t)Dα0 [α](t)dμ(t) ˜ and α2w = α, αw . (3.3) X ×(0,1)
The advantage of the norm · w is that through a Taylor expansion it is often possible to show α − α0 2w Q(α), which makes it feasible to obtain rates of convergence in · w . However, the norm · w may not be of interest in itself. We instead aim to obtain a rate of convergence in the stronger norm αs ≡ θ + hL2 . It is possible to obtain a rate of convergence for αˆ − α0 s by understanding the behaviour of the ratio · s / · w on the sieve An . We impose the following assumptions in order to obtain the rate of convergence of αˆ in the norm · s . A SSUMPTION 3.2. (a) In a neighbourhood N (α0 ), α − α0 2w Q(α) α − α0 2s ; (b) The ratio τn ≡ supAn αn 2s /αn 2w satisfies τn = o(nγ ) with γ < 1/4; (c) For any h ∈ H there exists 1 1 n h ∈ Hn with h − n hs = o(n− 2 ) and h − n hc = o(n− 4 ). Assumption 3.2(a) requires α − α0 w Q(α). As discussed, this is often verified through a Taylor expansion and allows us to obtain a rate of convergence in · w . In our model, · w is too weak and Q(α) is often not continuous in this norm. We impose instead Q(α) α − α0 2s . Assumption 3.2(b) is crucial in enabling us to obtain rates in · s from rates in · w , and vice versa, which is needed to refine initial estimates of the rate of convergence. The ratio τn is often referred to as the sieve modulus of continuity (see e.g. Chen and Pouzo, 2008b). In practice, Assumption 3.2(b) is requiring the sieve not to grow too fast. Finally, Assumption 3.2(c) refines the requirements of rates of approximation for the sieve An . C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S34
I. Komunjer and A. Santos
Given these assumptions we obtain the following rate of convergence result: T HEOREM 3.1.
Under Assumptions 2.1–2.3, 3.1 and 3.2, αˆ − α0 s = op (n− 4 ). 1
Note that since αˆ − α0 s = θˆ − θ0 + hˆ − h0 L2 , it immediately follows from 1 Theorem 3.1 that hˆ − h0 L2 = op (n− 4 ) as well.
4. ASYMPTOTIC NORMALITY √ In this section, we establish the asymptotic normality of n(θˆ − θ ). The approach of the proof is similar to that of Ai and Chen (2003) and Chen and Pouzo (2008a). We proceed in two steps. First, we show that for any λ ∈ Rdθ the linear functional Fλ (α) = λ θ , which returns a linear combination of the parametric component of the semi-parametric specification, is continuous in λ · w . By appealing to the Riesz Representation Theorem it then follows that there √ is vλ such that λ
ˆ v , αˆ − α0 w = λ (θ − θ0 ). Second, we establish the asymptotic normality of √ nv , αˆ − α0 w and employ the Cram´er–Wold device to conclude the asymptotic normality of n(θˆ − θ ). We therefore first aim to establish the continuity of Fλ (α) = λ θ in · w . Let A¯ denote the ¯ · w ) is a Hilbert space closure of the linear span of A − α0 under · w , and observe that (A, ¯ For any (α − α0 ) in A, ¯ we can with inner product ·, ·w and that A¯ is of the form A¯ = Rdθ × H. then decompose Dα0 [α − α0 ] as:4 Dα0 [α − α0 ] ≡
dW (α0 ) dW (α0 ) dW (α0 ) [α − α0 ] = [h − h0 ] . [θ − θ0 ] +
dα dθ dh
(4.1)
For each component θi of θ, 1 i dθ , let h∗j ∈ H¯ be defined by h∗j
≡ arg min h∈H¯
X ×(0,1)
dWα0 (t) dWα0 (t) [h] − dθj dh
2 dμ(t) ,
(4.2)
where the minimum in (4.2) is indeed attained and h∗j is well defined due to the Projection Theorem in Hilbert spaces (see e.g. Theorem 3.3.2 in Luenberger, 1969). Similarly, define h∗ ≡ (h∗1 , . . . , h∗dθ ) and let dWα0 (t) ∗
dWα0 (t) ∗
dWα0 (t) ∗ [h ] = h1 , . . . , hdθ . (4.3) dh dh dh As a final piece of notation, we also need to denote the vector of residuals: Rh∗ (t) =
dWα0 (t) dWα0 (t) ∗ − [h ] , dθ dh
(4.4)
4 The first equality in (4.1) is formally justified in the proof of Lemma 3.1 in the Appendix, in which it is shown D α¯ is the Fr´echet derivative of the mapping W : (A, · c ) → L2μ given by W : α → Wα , when evaluated at α. ¯ Similar to before, we use the notation
dWα (t) dW [π ] ≡ (α)[π ](t). dα dα
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S35
Semi-parametric estimation of non-separable models
and the associated matrix ∗ ≡
X ×(0,1)
Rh∗ (t)Rh ∗ (t)dμ(t).
(4.5)
Lemma 4.1 shows that the functional Fλ (α) = λ θ is continuous if the matrix ∗ is positive definite, which may be interpreted as a local identification condition on θ0 . Lemma 4.1 also obtains the formula for the Riesz Representor of Fλ (α). L EMMA 4.1. Let vθλ ≡ ( ∗ )−1 λ and vhλ ≡ −h∗ vθλ . If ∗ is positive definite, then for any λ ∈ Rdθ , Fλ (α − α0 ) = λ (θ − θ0 ) is continuous on A¯ under · w and in addition we have Fλ (α − α0 ) = v λ , α − α0 w = λ (θ − θ0 ). Having established the continuity of Fλ (α) in · w and the closed-form solution √ or the Riesz Representor v λ we can study the asymptotic normality of λ (θˆ − θ ) by examining nv λ , αˆ − α0 w instead. The latter representation is simpler to analyse as it is determined by the local behaviour of Q(α) near its minimum α0 . In order to establish asymptotic normality, we require one final assumption. A SSUMPTION 4.1. (a) The matrix ∗ is positive definite; (b) v λ ∈ A for λ small; α¯ [π] ¯ ∈ A2 , the pathwise derivative dDα+τ exists (c) For every α ∈ N (α0 ) and every (π, α) dτ α¯ [π] and in addition satisfies X ×(0,1) sups∈[0,1] | dDα+τ (t)| |dμ(t) α ¯ π as well as τ =s s s dτ dDα+τ α¯ [π]
2 2 (t)|τ =s dμ(t) α ¯ s ; (d) For every α ∈ N (α0 ) and every π ∈ X ×(0,1) sups∈[0,1] dτ A, |Dα [π ](t)| is bounded uniformly in t ∈ X × (0, 1). Assumption 4.1(a) ensures that Fλ (α) = λ θ is continuous in · w , as shown in ¯ Assumption 4.1(b) additionally requires v λ ∈ A. As a result v λ may Lemma 4.1. While v λ ∈ A, be approximated by an element n v λ ∈ An due to Assumption 3.1(c). The qualification ‘for λ small’ is due to the compactness assumption on × H imposing that they be bounded in norm. Finally Assumptions 4.1(c)–(d) require Wα (t) to be twice differentiable and for certain regularity conditions to hold on the derivatives. √ We are now ready to establish the asymptotic normality of n(θˆ − θ0 ). T HEOREM 4.1.
√ L Let Assumptions 2.1–4.1 hold. Then, n(θˆ − θ ) → N (0, ), where Rh∗ (t)Rh ∗ (s)(t, s)dμ(t)dμ(s) [ ∗ ]−1 , ≡ [ ∗ ]−1 (X ×(0,1))2
and for every t = (x, u) and t = (x , u ) in X × (0, 1) the kernel (t, t ) is given by (t, t ) ≡ E[(1{U u; X x} − u · 1{X x})(1{U u ; X x } − u · 1{X x })].
5. EXAMPLE AND MONTE CARLO EVIDENCE 5.1. The model We proceed to illustrate how non-separable structures of the form in (2.1) arise naturally in simple economic models. We shall also use this example in a small Monte Carlo study of the performance of our estimator. Our example is a basic version of Berry et al. (1995) (BLP C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S36
I. Komunjer and A. Santos
henceforth) model with two products and two firms. On the demand side, we use a random utility specification a` la Hausman and Wise (1978): uij = −apj + b xj + ξj + ζi + εij ,
(5.1)
in which uij is the utility of product j (j = 1, 2) to individual i (i = 1, . . . , I ) with unobserved characteristics ζi (ζi ∈ R), pj and xj are, respectively, the price and a dx -vector of observed characteristics of product j (pj ∈ R+ , xj ∈ Rdx , dx < ∞); b is a dx -vector of coefficients determining the impact of xj on the utility for j (b ∈ Rdx ), and ξj is an index of unobserved characteristics of the latter (ξj ∈ R); −a is a taste parameter on the price assumed constant across individuals (a > 0); finally, εij is an error term that represents the deviations from an average behaviour of agents and whose distribution is induced by the characteristics of the individual i and those of product j (εij ∈ R). A baseline specification of the random utility in (5.1) is that εij are i.i.d. across products j and individuals i. For example, assuming that εij ’s are Gumbel random variables, the resulting individual choice model is logit. In what follows, we let the difference εi2 − εi1 be distributed with some known cdf F that need not be logit. Note that F necessarily satisfies F (−ε) = 1 − F (ε). When εi2 − εi1 has cdf F, the demand for good j, denoted Dj (pj , p−j ), is given by Dj (pj , p−j ) = M · F (−a(pj − p−j ) + b (xj − x−j ) + ξj − ξ−j ),
(5.2)
where M is the total market size. Hereafter, we let the Y ≡ F −1 (D1 (p1 , p2 )/M) be the quantile of the market share for firm 1’s good (Y ∈ R), P ≡ p1 − p2 , X ≡ x1 − x2 and ξ ≡ ξ1 − ξ2 . Then, the structural BLP model of (5.2) takes the form Y = −aP + b X + ξ
with ξ ⊥ X.
(5.3)
In the model above, prices are endogenous, so even if ξ is independent of X, we can expect P to depend on ξ . Hence, without further restrictions on ξ and P it is not possible to identify the parameters a and b in (5.3). We now show how the supply-side information may be used to identify these parameters. We assume that firms compete in prices (`a la Bertrand), so each firm chooses the price which maximizes its profit j (pj , p−j ) = (pj − c)Dj (pj , p−j ). We assume the marginal cost parameter c to be the same for both firms. The equilibrium prices (p1 , p2 ) are implicitly defined by the solution to the Bertrand game with exogenous variables X. Lemma 5.1 exploits this relationship to obtain an alternative representation for the BLP model (5.3). L EMMA 5.1. Assume F is twice continuously differentiable on R with strictly increasing hazard rate τ . If ξ is continuously distributed, then it follows that: Y = h(X, U ) + X θ,
U ∼ U (0, 1),
(5.4)
with h continuously differentiable, ∂h(x, u)/∂u > 0, and θ = b. Lemma 5.1 assumes the hazard rate τ (ε) ≡ f (ε)/[1 − F (ε)] to be strictly increasing on R, which is equivalent to requiring that f (ε)[1 − F (ε)] + f 2 (ε) > 0 for all ε ∈ R (also equivalent to f (ε)F (ε) − f 2 (ε) < 0). This assumption guarantees the existence of a unique Nash equilibrium and the lemma can then be obtained by analysing the equilibrium strategies. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Semi-parametric estimation of non-separable models
S37
The BLP model (5.4) is clearly a special case of the non-separable structural model in (2.1) with α ≡ (θ, h) and mα (X, U ) = h(X, U ) + X θ.
(5.5)
We now illustrate how to verify other assumptions for this model. If the BLP model variables are i.i.d. with continuous distribution functions, then the continuous differentiability of the demand function guarantees that the sampling Assumption 2.1 holds. We note that in this example the supports of the endogenous and exogenous variables are given by Y = R and X ⊆ Rdx . But far more difficult to check is the identification Assumption 2.2 for which we now derive more primitive conditions. Our identification result for the BLP model (5.4) is contained in the following theorem. T HEOREM 5.1. Assume F is strictly increasing, twice continuously differentiable on R with strictly increasing hazard rate τ . Assume moreover that ξ is continuously distributed, and that we have: (a) h(0, 1/2) = 0 and (b) ∂h(0, 1/2)/∂x = 1. Then, the BLP model (5.4) satisfies Assumption 2.2. The conditions of Theorem 5.1 fix the values of the unknown function h and of its gradient with respect to x, denoted by ∂h(x, u)/∂x, at zero. In particular, (a) holds if the distribution Fξ of the products’ unobservables ξ in the BLP model in equation (5.3) is known to satisfy Fξ (0) = 1/2, since when X = 0 and ξ = 0 the equilibrium is symmetric (x1 = x2 ), which implies P = 0.5 Hence, −aP + ξ = 0 = h(0, 1/2). Requirement (b) fixes the value of the gradient ∂h(x, u)/∂x at zero. It ensures that the effects of changing θ can be separated from those of changing h. Indeed, if h is additive in x as in h(x, u) = φ x + r(u), then (ii) holds if φ = 1. This restriction is as we would expect since it would be otherwise impossible to identify θ in Y = (φ + θ ) X + r(U ). In the context of the BLP model (5.4), Assumptions 2.3(b) and 2.3(e) can be verified by letting H be a smooth set of functions. For example, suppose x has compact
dx +1support λi and X and let λ be a dx + 1 dimensional vector of positive integers. Define |λ| ≡ i=1 λd x λ1 λ λ λdx +1 D = ∂ /∂x1 . . . ∂xdx ∂u . An appropriate set H is then ∂h(x, u) λ ≥ε H= h: max |D h(x, u)| M, inf sup (x,u)∈X ×(0,1) ∂x |λ| 3(dx2+1) +1 (x,u)∈X ×(0,1) for some positive M and ε. By Theorem 2.7.1 in van der Vaart and Wellner (1996), Assumptions 2.3(a) and 2.3(e) are then satisfied. The definition of H also ensures Assumption 2.3(c) holds, while 2.3(d) is immediate from (5.5). As already noted, mα in (5.5) is linear, and since it is a continuous map from (A, · c ) to L2 , it is continuously Fr´echet differentiable with dmα (x, u) [π ] = πh (x, u) + x πθ , dα which verifies Assumption 3.1(a). In addition, for any t = (x, u) we then have fY | X h(sx , u) + sx θ |sx πh (sx , u) + sx πθ 1{sx x}fX (sx )dsx . Dα [π ](t) =
(5.6)
X
5
Note that whenever ξ1 and ξ2 are identically distributed, the distribution of their difference ξ satisfies Fξ (0) = 1/2.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S38
I. Komunjer and A. Santos
Hence, if fY | X (y | x) is uniformly bounded, then X and compact together with our choice for H imply Assumption 4.1(d) holds. Similarly, by direct calculation we obtain that in the discussed ¯ we have ¯ h) BLP example, for any α = (θ, h) and α¯ = (θ, dDα+τ α¯ [π ] ¯ x , u) + sx (θ + s θ¯ ) | sx πh (sx , u) + sx πθ = fY | X h(sx , u) + s h(s (t) τ =s dτ X ¯ x , u) + sx θ¯ 1{sx x}fX (sx )dsx ; × h(s hence Assumption 4.1(c) is easily verified if |fY | X (y | x)| is bounded in (y, x) on S. 5.2. Monte Carlo setup We consider the case in which the idiosyncratic errors εi1 and εi2 in (5.1) are i.i.d. Gumbel random variables, so that the distribution F of their difference is logistic. The equilibrium prices are solution to the FOC equations: exp() + 1 − a(p1 − c) = 0, exp(−) + 1 − a(p2 − c) = 0, where ≡ −a(p1 − p2 ) + bX + ξ as before; in this model X is scalar (dx = 1). The equilibrium prices obtained by solving the above equations are continuously differentiable functions of X; see Lemma E.1. A simple application of the Implicit Function Theorem shows that at equilibrium the prices satisfy ∂(p1 − p2 ) 2b = , ∂x 3a so ∂h(0, 1/2)/∂x = −2b/3. We set the true values of the parameters to be a = 2.4, b = −1.5 and c = 1. The variables are drawn as X ∼ U [−1, 1] and ξ ∼ N (0, 1), where X and ξ are independent.6 For the sieve we used a fully interacted polynomial of order 2 in X and U, while the measure μ was chosen to be uniform on [−1, 1] × [0, 1]. Table 1 reports the mean, standard deviation, mean squared error and the 10th, 50th and 90th percentile of the proposed estimator θˆ for sample sizes n = 100, 200, 500. The statistics were computed based on 500 replications. The estimator performs well, exhibiting only a small downward bias (recall true value is b = −1.5) and small mean squared errors for sample sizes of 200 and 500 observations. For the latter two sample sizes, the estimator is also within 0.1 of the true value in over 80% of the replications. Figure 1 exhibits a Gaussian kernel estimate for the Table 1. Monte Carlo results. Mean
STD
MSE
10%
50%
90%
n = 100 n = 200
−1.493 −1.493
0.110 0.072
0.012 0.005
−1.629 −1.583
−1.498 −1.495
−1.354 −1.395
n = 500
−1.486
0.043
0.002
−1.541
−1.484
−1.432
6 Note that by setting b = −1.5 we ensure that the identification condition (b) of Theorem 5.1 is satisfied. Condition (a) holds since the distribution of ξ is symmetric around zero.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S39
Semi-parametric estimation of non-separable models 9 8 7 6 5 4 3 2 1 0 −1.7
−1.65
−1.6
−1.55
−1.5
−1.45
−1.4
−1.35
−1.3
Figure 1. θˆ Kernel density estimate (n = 500).
density of θˆ obtained with sample size 500. The density is fairly symmetric and centred at the true value −1.5. Overall, we find the performance of the estimator on this limited Monte Carlo study to be encouraging.
6. CONCLUSION We have proposed a general estimation framework for a large class of semi-parametric nonseparable models. The resulting estimator converges to the non-parametric component at a 1 op (n− 4 ) rate, and yields an asymptotically normal estimator for the parametric component. Some of the assumptions must be verified in a model-specific basis, which we have done in an example motivated by Berry et al. (1995) model of price-setting with differentiated products. A small Monte Carlo study illustrates the performance of the proposed estimator within the BLP example.
ACKNOWLEDGMENTS We would like to thank the Editor, Jean-Marc Robin, and two anonymous referees for their suggestions, which improved an earlier version of the paper. This paper was presented at the 2008 EC2 Meeting ‘Recent Advances in Structural Microeconometrics’ in Rome, Italy. Many thanks to the participants of the conference for their comments. All errors are ours.
REFERENCES Ai, C. and X. Chen (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–844. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S40
I. Komunjer and A. Santos
Akritas, M. and I. van Keilegom (2001). Non-parametric estimation of the residual distribution. Scandinavian Journal of Statistics 28, 549–67. Altonji, J. G. and R. L. Matzkin (2005). Cross section and panel data estimators for nonseparable models with endogenous regressors. Econometrica 73, 1053–102. Benkard, C. L. and S. Berry (2006). On the nonparametric identification of non-linear simultaneous equations models: comment on Brown (1983) and Rhoerig (1988). Econometrica 74, 1429–40. Berry, S. T., J. Levinsohn and A. Pakes (1995). Automobile prices in market equilibrium. Econometrica 63, 841–90. Blundell, R., X. Chen and D. Kristensen (2007). Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica 75, 1613–69. Brown, B. W. (1983). The identification problem in systems nonlinear in the variables. Econometrica 51, 175–96. Brown, D. J. and M. H. Wegkamp (2002). Weighted minimum mean-square distance from independence estimation. Econometrica 70, 2035–51. Carrasco, M. and J.-P. Florens (2000). Generalization of GMM to a continuum of moment conditions. Econometric Theory 16, 797–834. Chen, X. (2006). Large sample sieve estimation of semi-nonparametric models. Working paper, Yale University. Chen, X. and D. Pouzo (2008a). Efficient estimation of semiparametric conditional moment models with possibly nonsmooth moments. Working paper, Yale University. Chen, X. and D. Pouzo (2008b). Estimation of nonparametric conditional moment models with possibly nonsmooth moments. Working paper, Yale University. Chernozhukov, V. and C. Hansen (2005). An IV model of quantile treatment effects. Econometrica 73, 245–61. Chesher, A. (2003). Identification in nonseparable models. Econometrica 71, 1405–41. Florens, J. P., J. J. Heckman, C. Meghir and E. Vytlacil (2008). Identification of treatment effects using control functions in models with continuous, endogenous treatment and heterogenous effects. Econometrica 76, 1191–206. Hausman, J. A. and D. A. Wise (1978). A conditional probit model for qualitative choice: discrete decisions recognizing interdependence and heterogeneous preferences. Econometrica 46, 403–26. Horowitz, J. L. and S. Lee (2007). Nonparametric instrumental variable estimation of a quantile regression model. Econometrica 75, 1191–208. Imbens, G. W. and W. K. Newey (2009). Identification and estimation of triangular simultaneous equation models without additivity. Econometrica 77, 1481–512. Komunjer, I. (2009). Global identification of the semiparametric Box–Cox model. Economics Letters 104, 53–56. Luenberger, D. G. (1969). Optimization by Vector Space Methods. New York: John Wiley. Manski, C. F. (1983). Closest empirical distribution estimation. Econometrica 51, 305–19. Matzkin, R. L. (1994). Restrictions of economic theory in nonparametric methods. In R. F. Engle and D. L. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2524–59. Amsterdam: North-Holland. Matzkin, R. L. (2003). Nonparametric estimation of nonadditive random functions. Econometrica 71, 1339–75. Milgrom, P. and J. Roberts (1990). Rationalizability, learning and equilibrium in games with strategic complementarities. Econometrica 58, 1255–77. Newey, W. K. and J. Powell (2003). Instrumental variables estimation of nonparametric models. Econometrica 71, 1565–78.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Semi-parametric estimation of non-separable models
S41
Roehrig, C. S. (1988). Conditions for identification in nonparametric and parametric models. Econometrica 56, 433–47. Rudin, W. (1976). Principles of Mathematical Analysis. New York: McGraw-Hill. Siddiqi, A. H. (2004). Applied Functional Analysis. New York: Marcel Dekker. van der Vaart, A. W. and J. A. Wellner (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer.
APPENDIX A: NOTATION AND DEFINITIONS The following is a table of the notation and definitions to be used. ab · c
a ≤ Mb for some constant M which is universal in the context of the proof. The norm αc ≡ θ + h∞ where α = (θ, h).
· s · ∞ · L2
The norm αs ≡ θ + hL2 where α = (θ, h). The norm h∞ ≡ sup(x,u)∈X ×(0,1) |h(x, u)|. The norm hL2 ≡ X ×(0,1) h(x, u)fX (x)dxdu. The norm hL2μ ≡ X ×(0,1) h(x, u)dμ(t) where t ≡ (x, u).
· L2μ N[ ] (, F, · )
The bracketing numbers of size ε for F under the norm · .
A mapping, m : (A, · ) → L2 is said to be Fr´echet differentiable, if there exists a bounded linear : (A, · c ) → L2 such that, map dm dα dm −1 [π ] lim π c mα+π − mα − 2 = 0. π c 0 dα L The Fr´echet derivative is a natural extension of a derivative to general metric spaces.
APPENDIX B: PROOFS FOR SECTION 2 Proof of Lemma 2.1: First, consider all values α¯ of α in A such that mα¯ (x, u) is not strictly increasing in ¯ Y u on (0, 1) for all values of x ∈ X . Let x¯ ∈ X be one such value. Then, the function u → P (X x; ¯ Y mα¯ (X, u)) is not strictly increasing on (0, 1); hence, there must exist u¯ ∈ (0, 1) such that P (X x; ¯ = u¯ · P (X x). ¯ Now, consider all values α˜ of α in A such that mα˜ (x, u) is strictly increasing in mα¯ (X, u)) ˜ note that for u on (0, 1) for all values of x ∈ X . Note that α0 is an element of that set. Now, for any such α, any (x, u) ∈ X × (0, 1) the following holds: fXY (sx , sy )dsx dsy P (X x; Y mα˜ (X, u)) = sx x
=
s x
x =
sx x
sy mα˜ (sx ,u)
s u
fXY (sx , mα˜ (sx , su ))
u
su u
∂mα˜ (sx , su ) dsx dsu ∂u
fXU˜ (sx , su )dsx dsu
= P (X x; U˜ u),
(B.1)
where for the second and third equalities follow we made a change of variable (sx , sy ) = (sx , mα˜ (sx , su )) and a change in measure Y = mα˜ (X, U˜ ). Under Assumption 2.2, α˜ = α0 if and only if U˜ ⊥ X, which since C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S42
I. Komunjer and A. Santos
U˜ is uniform on (0, 1) is equivalent to P (U˜ u; X x) = u · P (X x) for all (x, u) ∈ X × (0, 1). Combining (B.2) and (B.1) then establishes the lemma. L EMMA B.1.
(B.2)
Under Assumptions 2.1(a)–(c) and 2.3(b)–(e), the following class is Donsker: F ≡ {f (yi , xi ) = 1{yi mα (xi , u); xi x}, (α, x, u) ∈ A × Rdx × (0, 1)}.
Proof: First define the following classes of functions for 1 k dx :
where x
(k)
Fu ≡ {f (yi , xi ) = 1{yi mα (xi , u)} : (α, u) ∈ A × (0, 1)}
(B.3)
Fx(k) ≡ f (xi ) = 1 xi(k) t : t ∈ R ,
(B.4)
is the kth coordinate of x. Further note that by direct calculation we have F = Fu ×
dx
Fx(k) .
(B.5)
k=1
We establish the lemma by exploiting (B.5). For any continuously distributed random variable V ∈ R and η > 0 we can find {−∞ = t1 , . . . , tη−2 +2 = +∞} such that they satisfy P (ts V ts+1 ) η2 . The brackets [1{v ts }, 1{v ts+1 }] then cover {1{v t} : t ∈ R} and in addition we have E[(1{V ts } − 1{V ts+1 })2 ] η2 . Therefore, we immediately establish that for all 1 k dx : N[ ] η, Fx(k) , · L2 = O(η−2 ).
(B.6)
By Assumption 2.3(b), H is compact under · ∞ and under · . Thus, for any Kh , Kθ > 0 there exists a collection {hj } and {θl } such that the open balls of size Kh η3 around {hj } and of size Kθ η3 around {θl } cover H and , respectively. Defining {αs } = {hj } × {θl } we then have: #{αs } = N[ ] (Kh η3 , H, · ∞ ) × (Kθ η3 )−dθ .
(B.7)
Hence, by Assumption 2.3(c), for any α ∈ A there is a (θs ∗ , hs ∗ ) ≡ αs ∗ ∈ {αs } with sup | mα (xi , u) − mαs ∗ (xi , u)| G(xi ){θ − θs ∗ + h − hs ∗ ∞ }
u∈(0,1)
G(xi ){Kθ + Kh }η3 .
(B.8)
We conclude from (B.8) that for αs ∈ {αs } brackets of the form [mαs (xi , u) − {Kθ + Kh }η3 G(xi ); mαs (xi , u) + {Kθ + Kh }η3 G(x)]
(B.9)
cover the class {mα (xi , u) : α ∈ A} for each fixed u ∈ (0, 1). Next note that since mα (xi , u) is strictly increasing in u for all (xi , α) by Assumption 2.3(c), we may define their inverses: vα (xi , t) = u ⇐⇒ mα (xi , u) = t.
(B.10)
Following Akritas and van Keilegom (2001), for each αs ∈ {αs } we let FsU (u) be as in the first equality in (B.11) and obtain second equality in (B.11) from (B.10). FsU (u) ≡ P (Yi mαs (Xi , u) + {Kθ + Kh }η3 G(Xi )) = P (vαs (Xi , Yi − {Kθ + Kh }η3 G(Xi )) u).
(B.11)
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S43
Semi-parametric estimation of non-separable models
Arguing as in (B.6), there is a collection {uUsk } with #{uUsk } = O(η−2 ) such that it partitions R into segments each with FsU probability at most η2 /6. Similarly, also let FsL (u) ≡ P (Yi mαs (Xi , u) − {Kθ + Kh }η3 G(Xi )) and choose η2 /6. Next,
with #{uLsk } = O(η−2 ) so that it partitions R into combine {uLsk } and {uUsk } by letting each u ∈ R form the {uLsk }
segments with bracket
(B.12) FsL
probability at most
uLsk1 u uUsk2 , where uLsk1 is the largest element of {uLsk } such that uLsk u, and similarly uUsk2 is the smallest element in {uUsk } such that uUsk ≥ u. We denote this new bracket by {[usk1 , usk2 ]} and note that #{[usk1 , usk2 ]} = O(η−2 ).
(B.13)
It follows from (B.9) and the strict monotonicity of mα (x, u) in u that for every (α, u) ∈ A × (0, 1) there exists an αs ∈ {αs } and [usk1 , usk2 ] ∈ {[usk1 , usk2 ]} such that 1{yi mαs (x, usk1 ) − {Kθ + Kh }η3 G(xi )} 1{yi mα (xi , u)} 1{yi mαs (x, uik2 ) + {Kθ + Kh }η3 G(xi )},
(B.14)
and hence {[1{yi mαs (x, usk1 ) − {Kθ + Kh }η3 G(xi )}, 1{yi mαs (x, uik2 ) + {Kθ + Kh }η3 G(xi )}]} form brackets for the class of functions Fu . In order to calculate the size of the proposed brackets, note their L2 squared norm is equal to FsU (usk2 ) − L Fs (usk1 ). The construction of {[usk1 , usk2 ]} in turn implies the first inequality in (B.15) holds for any u ∈ [u sk1 , usk2 ], while direct calculation yields the second inequality for any constant Mη > 0. Setting Mη = 6E[G2 (Xi )]/η and Chebychev’s inequality yields the final result in (B.15). FsU (usk2 ) − FsL (usk1 ) FsU (u) − FsL (u) +
η2 3
FsU (u; G(Xi ) Mη ) − FsL (u; G(Xi ) Mη ) + 2P (G(Xi ) ≥ Mη ) +
η2 3
2 FsU (u; G(Xi ) Mη ) − FsL (t; G(Xi ) Mη ) + η2 . 3 2 To conclude, note that Mη = 6E[G (Xi )]/η and the Mean Value Theorem imply that
(B.15)
FsU (u; G(Xi ) Mη ) − FsL (u; G(Xi ) Mη ) P Yi mαs (Xi , u) + {Kθ + Kh }Mη η3 − P Yi mαs (Xi , u) − {Kθ + Kh }Mη η3 2 sup fY | X (yi | xi ) {Kθ + Kh } 6E[G2 (Xi )]η2 , yi ,xi
where the resulting expression is finite due to Assumptions 2.1(c) and 2.3(d). Combining the preceding result with that obtained in (B.15) it follows that by choosing −1 6E[G2 (Xi )] {Kθ + Kh } 2 sup fY | X (yi | xi ) yi ,xi
the proposed brackets will have L2 size η. Thus, we have from (B.7) and (B.13), N[ ] (η, Fu , · L2 ) = O(N[ ] (Kh η3 , H, · ∞ ) × (Kθ η3 )−(2+dθ ) ).
(B.16)
To conclude note that (B.6), (B.16), Assumption 2.3(d) and Theorem 2.5.6 in van der Vaart and Wellner (1996) imply the classes Fx(k) and Fu are Donsker. In turn, since all classes are uniformly bounded by 1, Theorem 2.10.6 in van der Vaart and Wellner (1996) and equation (B.5) establish the claim of the lemma. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S44
I. Komunjer and A. Santos
Proof of Theorem 2.1: By Assumption 2.3(b) and the Tychonoff Theorem, A is compact with respect to · c . Furthermore, Lemma B.1 and simple manipulations show, sup |Wα,n (t) − Wα (t)| = op (1).
(B.17)
t,α
Exploiting (B.17) and Wα,n (t) and Wα (t) being bounded by 1, we obtain that sup | Qn (α) − Q(α)| sup |Wα,n (t) − Wα (t)| × sup |Wα,n (t)| + sup |Wα (t)| = op (1). (B.18) α
t,α
t,α
t,α
The result then follows by Lemma A1 in Newey and Powell (2003) and noticing that their requirement that Qn (α) being continuous can be substituted by αˆ being an element of the argmin correspondence.
APPENDIX C: PROOFS FOR SECTION 3 Proof of Lemma 3.1: Similar to previously, let W : (A, · c ) → L2μ be a mapping which to each α ∈ A associates W (α) ≡ Wα . We first study the differentiability of W in a neighbourhood of α0 . Recall that for any t = (x, u), dmα¯ (sx , u) [π ]1{sx x}fX (sx )dsx fY | X (mα¯ (sx , u)|sx ) Dα¯ [π ](t) = dα X and note that Dα¯ [π ] is well defined for every α¯ ∈ N (α0 ) due to Assumption 3.1(a). Next, use fY | X (y | x) uniformly bounded and Jensen’s inequality to obtain the first result in (C.1). The second inequality then holds for · o the linear operator norm by Assumption 3.1(c). Dα¯ [π ]2L2 = μ
X ×(0,1)
X ×(0,1)
X
X
fY | X (mα¯ (sx , u)|sx ) dmα¯ (sx , u) [π ] dα
2
dmα¯ (sx , u) [π ]1{sx x}fX (sx )dsx dα
2 dμ(t)
fX (sx )dsx dμ(t)
dm 2 π 2 . ( α) ¯ c dα
(C.1)
o
Since Fr´echet derivatives are a fortiori continuous, (C.1) implies Dα¯ [π ] is continuous in π ∈ A for ¯ α) ˜ ∈ A2 and use Jensen’s all α¯ ∈ N (α0 ). To examine continuity of Dα¯ in α¯ ∈ N (α0 ), we consider (α, inequality to obtain (C.2) pointwise in t = (x, u). |Dα¯ [π ](t) − Dα˜ [π ](t)| dmα¯ (sx , u) dmα˜ (sx , u) [π ] − [π ] fX (sx )dsx fY | X (mα˜ (sx , u) | sx ) dα dα X fY | X (mα¯ (sx , u) | sx ) − fY | X (mα˜ (sx , u) | sx ) dmα¯ (sx , u) [π ] fX (sx )dsx . + dα X
(C.2)
In turn, the Lipschitz Assumptions 2.3(d) and 3.1(b), fY | X (y | x) uniformly bounded by Assumption 2.1(c) and equation (C.2) yield that pointwise in t = (x, u), dmα¯ (sx , u) [π ] fX (sx )dsx ˜ νc J (sx )Gν (sx ) |Dα¯ [π ](t) − Dα˜ [π ](t)| α¯ − α dα X dmα¯ (sx , u) dm (s α ˜ x , u) (C.3) [π ] − [π ] fX (sx )dsx . + dα dα X C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S45
Semi-parametric estimation of non-separable models Using (C.3), Cauchy–Schwarz and Jensen’s inequality and E[J 2 (X)G2ν (X)] < ∞ yields
˜ 2ν Dα¯ [π ] − Dα˜ [π ]2L2 α¯ − α c μ
+
X ×(0,1)
X ×(0,1)
X
X
dmα¯ (sx , u) [π ] dα
2 fX (sx )dsx dμ(t)
dmα¯ (sx , u) dmα˜ (sx , u) [π ] − [π ] dα dα
2 fX (sx )dsx dμ(t). (C.4)
impliesthe Let A¯ c denote the completion of the linear span of A under · c . The definition of · o then dm first equality in (C.5), while the first inequality follows from (C.4). Further, since the functional dα (α) ¯ : o dm(α) ¯ (N (α0 ), · c ) → R is continuous and A is compact under · c it follows that supN (α0 ) dα < ∞. o The second inequality in (C.5) then follows. 2 Dα¯ − Dα˜ 2o = sup π −2 c Dα¯ [π ] − Dα˜ [π ]L2
μ
π ∈A¯ c
2 dm 2 dm dm + ( α) ¯ ( α) ¯ − ( α) ˜ α¯ − α ˜ 2ν c dα dα dα o o 2 dm dm 2ν ¯ − (α) ˜ α¯ − α ˜ c + . dα (α) dα o
(C.5)
Therefore, Dα¯ is continuous in α by m being continuously Fr´echet differentiable. ¯ Straightforward manipulations imply that We now show Dα¯ is indeed the Fr´echet derivative of W at α. for any t = (x, u) ∈ X × (0, 1) we have Wα (t) =
X
P (Y mα (sx , u) | sx )1{sx x}fX (sx )dsx − u · P (X x).
(C.6)
Next, using the definition of Dα¯ and (C.6) together with Jensen’s inequality we obtain (C.7) pointwise in t for any α¯ ∈ N (α0 ) and π ∈ A. |Wα+π (t) − Wα¯ (t) − Dα¯ [π ](t)| ¯
X
|P (Y mα+π (sx , u)|sx ) ¯
− P (Y mα¯ (sx , u)|sx ) − fY | X (mα¯ (sx , u)|sx )
dmα¯ (sx , u) [π ]|fX (sx )dsx . dα
(C.7)
Applying the Mean Value Theorem inside the integral in (C.7) then implies (t) − Wα¯ (t) − Dα¯ [π ](t)| |Wα+π ¯
X
¯ x , u) | sx )[mα+π |fY | X (m(s (sx , u) − mα¯ (sx , u)] ¯
−fY | X (mα¯ (sx , u) | sx )
dmα¯ (sx , u) [π ]|fX (sx )dsx , dα
(C.8)
¯ x , u) is a convex combination of mα+π ¯ x , u) − where m(s (sx , u) and mα¯ (sx , u). Therefore, it follows that |m(s ¯ (sx , u) − mα¯ (sx , u)|. The Lipschitz conditions of Assumptions 2.3(d) and 3.1(b) then mα¯ (sx , u)| |mα+π ¯ imply the inequality: X
¯ x , u) | sx ) −fY | X (mα¯ (sx , u) | sx )][mα+π |[fY | X (m(s (sx , u) − mα¯ (sx , u)]|fX (sx )dsx ¯ π 1+ν J (sx )G1+ν (sx )fX (sx )dsx . c
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
(C.9)
S46
I. Komunjer and A. Santos
Using (C.8), (C.9), fY | X (y | x) being bounded and Jensen’s inequality in turn establishes the first inequality being the Fr´echet derivative of m. in (C.10). The final result in (C.10) then follows by dm dα (t) − Wα¯ (t) − Dα¯ [π ](t)2L2 π 2+2ν Wα+π ¯ c μ 2 dmα¯ (sx , u) [π ] mα+π + (s , u) − m (s , u) − fX (sx )dsx dμ(t) = o π 2c . ¯ x α¯ x dα X ×(0,1) X (C.10) We conclude from (C.10) and (C.5) that Dα¯ is the Fr´echet derivative at α¯ of the map W : (A, · c ) → L2μ and that it is continuous in α. ¯ To conclude the proof of the first claim of the lemma, note that Q(α) = Wα (t)2L2 . Since the functional · 2L2 : L2μ → R is trivially Fr´echet differentiable, applying the Chain μ μ rule for Fr´echet derivatives (see e.g. Theorem 5.2.5 in Siddiqi, 2004) yields dQ(α) ¯ [π ] = Wα¯ (t)Dα¯ [π ](t)dμ(t). (C.11) dα X ×(0,1) To establish the second claim of the lemma, define the bilinear form T : A × A → R, Dα0 [ψ](t)Dα0 [π ](t)dμ(t). T [ψ, π ] =
(C.12)
X ×(0,1)
We will show T is the second Fr´echet derivative of Q(α) at α0 . Note that T [ψ, · ] : A → R is a linear operator. The first requirement of Fr´echet differentiability is to show T [ψ, · ] is continuous in ψ. For this purpose, note that the first equality in (C.13) follows by definition while the first and second inequalities are implied by the Cauchy–Schwarz inequality and (C.1), respectively. 2 T [ψ, · ]2o = sup π −2 c T [ψ, π ] π ∈A¯ c
X ×(0,1)
Dα20 [ψ](t)dμ(t) × sup π −2 c π ∈A¯ c
dm(α0 ) 4 2 dα ψc .
X ×(0,1)
Dα20 [π ](t)dμ(t) (C.13)
o
It follows from (C.13) that T [ψ, · ] is continuous in ψ ∈ A. Next, we verify T is the second Fr´echet 0) = 0 and obtain derivative of Q(α) at α0 . In (C.14) use (C.11) and Wα0 (t) = 0 for all t to note dQ(α dα 2 dQ(α0 + ψ) dQ(α0 ) − − T [ψ, · ] dα dα o 2 −2 = sup π c (Wα0 +ψ (t)Dα0 +ψ [π ](t) − Dα0 [ψ](t)Dα0 [π ](t))dμ(t) . X ×(0,1)
π ∈A¯ c
(C.14)
Next, use the Cauchy–Schwarz inequality to obtain the first inequality in (C.15) and Dα0 being the Fr´echet derivative of W : (A, · c ) → L2μ at α0 for the second. sup π −2 c
X ×(0,1)
π ∈A¯ c
2 (Wα0 +ψ (t) − Wα0 (t) − Dα0 [ψ](t))Dα0 +ψ [π ](t)dμ(t)
2 Wα0 +ψ (t) − Wα0 (t) − Dα0 [ψ]2L2 × sup π −2 c Dα0 +ψ [π ]L2 μ
o ψ2c × Dα0 +ψ 2o .
π ∈A¯ c
μ
(C.15)
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S47
Semi-parametric estimation of non-separable models Similarly, we use the Cauchy–Schwarz inequality and the definition of · o to obtain, sup π −2 c
2
X ×(0,1)
π ∈A¯ c
Dα0 [ψ](t)(Dα0 +ψ [π ](t) − Dα0 [π ](t))dμ(t)
2 Dα0 2o ψ2c × sup π −2 c Dα0 +ψ [π ] − Dα0 [π ]L2
μ
π ∈A¯ c
Dα0 2o ψ2c
Dα0 2o .
× Dα0 +ψ −
(C.16)
To conclude, combine (C.14), (C.15) and (C.16) and Wα0 (t) = 0 for all t to derive the first inequality in (C.17). As argued in (C.5), however, Dα¯ o is bounded in a neighbourhood of α0 . Thus, the continuity of Dα¯ in α¯ for α¯ ∈ N (α0 ) implies the final result in (C.17). 2 dQ(α0 + ψ) dQ(α0 ) − − T [ψ, · ] dα dα o o ψ2c × Dα0 +ψ 2o + ψ2c Dα0 2o Dα0 +ψ − Dα0 2o = o ψ2c .
(C.17)
It follows from (C.17) that T is the second Fr´echet derivative of Q(α) at α0 .
Proof of Theorem 3.1: Let n α0 = arg minAn α0 − αs . By Theorem 2.1, αˆ ∈ N (α0 ) with probability tending to one and hence Assumptions 3.2(a) and 3.2(c), imply that with probability tending to one we have that ˆ − Q(n α0 ) + Q(n α0 ) αˆ − α0 2w Q(α) = Q(α) ˆ − Q(n α0 ) + o(n−1 ).
(C.18)
By Theorem 2.1 and · s · c , there is a δn → 0 such that P (αˆ − α0 s > δn ) → 0. Letting Aδ0n = ˆ Qn (n α0 ) by {α ∈ A : α − α0 s δn } then yields the first inequality in (C.19). Noticing that Qn (α) gives us the second virtue of αˆ minimizing Qn (α) over An and using the Cauchy–Schwarz inequality √ inequality. For the third and fourth inequalities we use Lemma B.1 which implies n(Wα,n (t) − Wα (t)) is tight in L∞ (Rdt × A) together with the definition of Q(α). Q(α) ˆ − Q(n α0 ) Qn (α) ˆ − Qn (n α0 ) + 2 sup | Qn (α) − Q(α)| δ
A0n
2
sup (t,α)∈Rdt ×A
⎡
| Wα,n (t) − Wα (t)| × ⎣sup
⎤ 12
δ
A0n
X ×(0,1)
(Wα,n (t) + Wα (t))2 dμ(t)⎦
⎡ Op (n
− 12
)×⎣
sup
(t,α)∈Rdt ×A
(Wα,n (t) − Wα (t))2 + sup 4 δ
A0n
⎡ Op (n
− 12
⎤ 12
X ×(0,1)
Wα2 (t)dμ(t)⎦
⎤ 12
) × ⎣Op (n ) + sup 4Q(α)⎦ . −1
(C.19)
δ
A0n
By Assumption 3.2(a), supAδn Q(α) δn2 = o(1). Therefore, combining (C.18) and (C.19): 0
1
1
αˆ − α0 2w Op (n− 2 ) × op (1) + o(n−1 ) = op (n− 2 ).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
(C.20)
S48
I. Komunjer and A. Santos
To obtain a rate with respect to · s , we use Assumption 3.2(c) for the first and second inequalities in 1 (C.21). It follows from (C.20) and Assumption 3.2(c) that αˆ − n α0 2w = op (n− 2 ) which together with Assumption 3.2(b) implies the equality in (C.21). αˆ − α0 2s αˆ − n α0 2s + o(n−1 ) sup
α∈An
α2s 1 × αˆ − n α0 2w + o(n−1 ) = op (n− 2 +γ ). α2w
(C.21)
We can now exploit the local behaviour of the objective function to improve on the obtained rate of 1 γ convergence. Note that due to (C.21) it is possible to choose δn = o(n− 4 + 2 ) such that P (αˆ ∈ Aδ0n ) → 1. Repeating the steps in (C.19) we obtain (C.22) with probability approaching one. ⎡ Q(α) ˆ − Q(n α0 ) Op (n
− 12
= Op (n
− 12
⎤ 12
) × ⎣Op (n−1 ) + sup 4Q(α)⎦ δ
A0n
) × op (n
− 14 + γ2
).
(C.22) 1
1
γ
From (C.18), (C.22) and Assumption 3.2(b), we then obtain αˆ − α0 2w = op (n− 2 − 4 + 2 ) and similarly that 1 1 γ αˆ − n α0 2w = op (n− 2 − 4 + 2 ). In turn, by repeating the argument in (C.21) we obtain the improved rate 1 1 1 1 1 1 αˆ − α0 2s = op (n(γ − 2 )(1+ 2 ) ). Proceeding in this fashion we get αˆ − α0 2s = op (n(γ − 2 )(1+ 2 + 4 + 8 +···) ). Since γ − 1/2 < −1/4, repeating this argument a possibly large, but finite number of times yields the desired 1 conclusion αˆ − α0 2s = op (n− 2 ) thus establishing the claim of the theorem.
APPENDIX D: PROOFS FOR SECTION 4 Because the criterion function Qn (α) is not smooth in α, it is convenient to define Qsn (α) =
X ×(0,1)
(Wα0 ,n (t) + Wα (t))2 dμ(t).
(D.1)
Throughout the proofs we will exploit the following lemma: L EMMA D.1.
If Assumptions 2.1, 2.3, 3.1, 3.2 hold, then: Qsn (α) ˆ infAn Qsn (α) + op (n−1 ).
Proof: Since αˆ − α0 c = op (1) and Wα0 (t) = 0 for all t ∈ X × (0, 1), Lemma B.1 implies sup
t∈X ×(0,1)
1
− |Wα,n ˆ (t) − Wαˆ (t) − Wα0 ,n (t)| = op (n 2 ).
By simple manipulations we therefore obtain ˆ Qsn (α) =
X ×(0,1)
X ×(0,1)
2 (|Wα0 ,n (t) + Wαˆ (t) − Wα,n ˆ (t)| + |Wα,n ˆ (t)|) dμ(t) 2 − 12 −1 Wα,n (t)dμ(t) + o (n ) × |Wα,n p ˆ (t)|dμ(t) + op (n ). ˆ
(D.2)
X ×(0,1)
Next, apply Jensen’s inequality and Qn (α) ˆ Qn (n α0 ) to obtain the first and second inequalities in 1 (D.3). By Lemma B.1, supt,α |Wα,n (t) − Wα (t)| = Op (n− 2 ). Together with Assumption 3.2(a), the final C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S49
Semi-parametric estimation of non-separable models two inequalities in (D.3) then immediately follow. |Wα,n ˆ (t)|dμ(t) X ×(0,1)
X ×(0,1)
X ×(0,1)
2
2 Wα,n ˆ (t)dμ(t)
12
W2 n α0 ,n (t)dμ(t)
12
X ×(0,1)
(Wn α0 ,n (t) − Wn α0 (t))2 dμ(t) + 2
1 Op (n−1 ) + n α0 − α0 2s 2 .
X ×(0,1)
W2 n α0 (t)dμ(t)
12
(D.3)
1
By Assumption 3.2(c), n α0 − α0 s = o(n− 2 ) and hence combining (D.2) and (D.3), ˆ Qn (α) ˆ + op (n−1 ). Qsn (α)
(D.4)
Let α˜ ∈ arg minAn Qsn (α), and note that Lemma B.1 and the same arguments as in Theorem 2.1 imply ˜ c = op (1). The same arguments as in (D.2) then imply that Qn (α) ˜ is bounded above by α0 − α 2 (|Wα,n ˜ (t) − Wα0 ,n (t) − Wα˜ (t)| + |Wα0 ,n (t) + Wα˜ (t)|) dμ(t) X ×(0,1) 1 = (Wα0 ,n (t) + Wα˜ (t))2 dμ(t) + op (n− 2 ) × |Wα0 ,n (t) + Wα˜ (t)|dμ(t) + op (n−1 ). X ×(0,1) X ×(0,1) (D.5) ˜ Qsn (n α0 ) imply the first and second inequalities Proceeding as in (D.3), Jensen’s inequality and Qsn (α) in (D.6). The last two results in (D.6) then follow by Assumption 3.2(a) and by noting that Lemma B.1 1 implies supt |Wα0 ,n (t)| = Op (n− 2 ), |Wα0 ,n (t) + Wα˜ (t)|dμ(t) X ×(0,1)
X ×(0,1)
X ×(0,1)
2
(Wα0 ,n (t) + Wα˜ (t))2 dμ(t)
(Wα0 ,n (t) + Wn α0 (t))2 dμ(t)
X ×(0,1) −1
12 12
Wα20 ,n (t)dμ(t)
Op (n ) + α0 −
+2
1
n α0 2s 2
X ×(0,1)
W2 n α0 (t)dμ(t)
12
.
(D.6)
1
Since n α0 − α0 s = o(n− 2 ) by Assumption 3.2(c), (D.5) and (D.6) imply ˜ Qsn (α) ˜ + op (n−1 ). Qn (α)
(D.7)
ˆ Qn (α), ˜ the definition of α˜ together with (D.4) and (D.7) establish Hence, since Qn (α) ˆ Qn (α) ˜ + op (n−1 ) inf Qsn (α) + op (n−1 ), Qsn (α) An
which establishes the claim of the lemma. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S50
I. Komunjer and A. Santos
Proof of Lemma 4.1: The arguments closely follow those of Ai and Chen (2003). We first establish continuity. Since Fλ is linear, it is only necessary to establish that it is bounded. For any θ ∈ Rdθ , we can obtain the first equality in (D.8) by using (4.2), while the second equality is definitional. min h∈H¯
X ×(0,1)
2 dW (α0 ) dW (α0 ) [h](t) [θ ](t) − dμ(t) dθ dh 2 dW (α0 ) dW (α0 ) ∗ (t) − [h ](t) θ dμ(t) = θ ∗ θ. = dθ dh X ×(0,1)
(D.8)
In order to show Fλ is bounded we need to establish the left-hand side of (D.9) is finite. Using (D.8) immediately implies the first equality in (D.9). For the second equality note the optimization problem is solved at θ ∗ = ( ∗ )−1 λ and plug in θ ∗ . sup 0=α∈A¯
Fλ2 (α) (λ θ)2 = λ ( ∗ )−1 λ. = sup 2
∗θ αw θ d θ 0=θ∈R
(D.9)
Since by assumption ∗ is positive definite, (D.9) is finite and hence Fλ is bounded which establishes continuity. For the second claim of the lemma, note the following orthogonality condition must hold as a result of (4.1) and (4.2): dW (α0 ) ∗ dW (α0 ) dW (α0 ) (t) − [h ](t) [h](t)dμ(t) = 0 (D.10) dθ dh dh X ×(0,1) ¯ Therefore, employing result (D.10) we obtain α − α0 , v λ equals: for all h ∈ H. dW (α0 ) ∗ dW (α0 ) ∗ dW (α0 ) dW (α0 ) (t) − [h ](t) (t) − [h ](t) dμ(t) vθλ . (θ − θ0 ) dθ dh dθ dh X ×(0,1) Hence, since vθλ = ( ∗ )−1 λ, the second claim of the lemma follows.
L EMMA D.2. Let Assumptions 2.1, 2.3, 3.1, 3.2 and 4.1 hold, and let vnλ = n v λ . Then: (a) 1 Wα0 ,n (t)Dαˆ [vnλ ](t)dμ(t) = X ×(0,1) Wα0 ,n (t)Dα0 [v λ ](t)dμ(t) + op (n− 2 ); also (b) X ×(0,1) (Wαˆ (t) − X ×(0,1) √ 1 L Wα0 (t))Dαˆ [vnλ ](t)dμ(t) = X ×(0,1) Dα0 [αˆ − α0 ](t)Dα0 [v λ ](t)dμ(t) + op (n− 2 ); and (c) nWα0 ,n (t) → G(t), where G(t) is a Gaussian process with covariance: (t, t ) = E[(1{U u; X x} − u1{X x})(1{U u ; X x } − u 1{X x })]. Proof: To establish the first claim apply the Cauchy–Schwarz inequality, the definition of the operator 1 norm, Theorem 2.1 and Lemma B.1 implying supt | Wα0 ,n (t)| = Op (n− 2 ) to obtain that with probability approaching one we have
X ×(0,1)
Wα0 ,n (t)Dαˆ [vnλ − v λ ](t)dμ(t)
X ×(0,1)
Wα20 ,n (t)dμ(t)
1
12
× Dαˆ vnλ − v λ L2μ
Op (n− 2 ) × sup Dα o × vnλ − v λ c .
(D.11)
α∈N (α0 )
As argued in (C.5), supα∈N (α0 ) Dα o < ∞. Further, Assumptions 4.1(b) and 2.3(f) imply that v λ − vnλ c = o(1). Therefore, we obtain from (D.11) that
1 Wα0 ,n (t)Dαˆ vnλ (t)dμ(t) = Wα0 ,n (t)Dαˆ [v λ ](t)dμ(t) + op (n− 2 ). (D.12) X ×(0,1)
X ×(0,1)
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S51
Semi-parametric estimation of non-separable models
Similarly, the derivations in (D.11) imply the inequality in (D.13). The equality is a result of the continuity of Dα in α under · c , as established in the proof of Lemma 3.1. λ λ W (t)(D [v ](t) − D [v ](t))dμ(t) α0 ,n αˆ α0 X ×(0,1)
1
1
Op (n− 2 ) × Dαˆ − Dα0 o × v λ c = op (n− 2 ).
(D.13)
Together, equations (D.12) and (D.13) establish the first claim of the lemma. For the second claim of the lemma, note that Assumption 4.1(c) allows us to do a second-order Taylor expansion to obtain (D.14) pointwise in t ∈ X × (0, 1), ˆ − α0 ](t) 1 dDα0 +τ (α−α ˆ 0 ) [α . (D.14) Wαˆ (t) = Wα0 (t) + Dα0 [αˆ − α0 ](t) + 2 dτ τ =s(t) The first equality in (D.15) then follows from (D.14), while the second one is implied by Assumptions 4.1(c) and 4.1(d). The final equality in turn follows from Theorem 3.1.
(Wαˆ (t) − Wα0 (t) − Dα0 [αˆ − α0 ](t))Dαˆ vnλ (t)dμ(t) X ×(0,1)
1 = 2
!
X ×(0,1)
"
dDα0 +τ (α−α ˆ − α0 ](t) 1 ˆ 0 ) [α Dαˆ vnλ (t)dμ(t) αˆ − α0 2s = op (n− 2 ). dτ τ =s(t)
(D.15)
Next, apply the Cauchy–Schwarz inequality and a Taylor expansion to obtain the first inequality in (D.16). The second inequality then follows by Assumption 4.1(c), αˆ − α0 w αˆ − α0 s in a neighbourhood of α0 by Assumption 3.2(a) and Theorem 3.1. λ
λ
Dα0 [αˆ − α0 ](t)(Dαˆ vn (t) − Dα0 vn (t))dμ(t) X ×(0,1)
⎡ αˆ − α0 w × ⎣
X ×(0,1)
!
⎤ 12 "2 λ dDα0 +τ (α−α 1 ˆ 0 ) vn (t) dμ(t)⎦ αˆ − α0 2s = op (n− 2 ). dτ τ =s(t) (D.16) 1
1
Similarly, applying the Cauchy–Schwarz inequality, αˆ − α0 w = op (n− 4 ) and vnλ − v λ c = o(n− 4 ) by Assumption 3.2(b) we are able to conclude, λ
λ v (t) − D D [ α ˆ − α ](t)(D [v ](t))dμ(t) α0 0 α0 α0 n X ×(0,1)
1
αˆ − α0 w × Dα0 o × vnλ − v λ c = op (n− 2 ).
(D.17)
Combining results (D.15)–(D.17) establishes the second claim of the lemma. The third claim of the lemma is immediate from Wα0 ,n (t) being a Donsker class due to Lemma B.1 and regular Central Limit Theorem. 1
ˆ Proof of Theorem 4.1: Let u∗ = ±v λ , u∗n = n u∗ and 0 < εn = o(n− 2 ) be such that it satisfies Qsn (α) infAn Qsn (α) + Op (εn2 ), which is possible due to Lemma D.1. Define α(τ ) = αˆ + τ εn u∗n and note that by Assumption 3.1(a) and Lemma 2.1, with probability tending to one α(τ ) ∈ An for τ ∈ [0, 1]. Therefore, Lemma D.1 establishes the first equality in (D.18). A second-order Taylor expansion around τ = 0 yields the equality in (D.18) for some s ∈ [0, 1]. 0 Qsn (α(1)) − Qsn (α(0)) + Op εn2
1 d 2 Qn (α(τ )) = 2εn (Wα0 ,n (t) + Wαˆ (t))Dαˆ u∗n (t)dμ(t) + (D.18) , 2 dτ 2 X ×(0,1)
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
τ =s
S52
I. Komunjer and A. Santos
where by direct calculation we have that 2 d 2 Qn (α(τ )) 2 Dα(s) u∗n (t) dμ(t) = ε n 2 dτ X ×(0,1) τ =s ∗ dDα+τ ˆ εn u∗n εn un (t) dμ(t). (Wα0 ,n (t) + Wα(s) (t)) + dτ X ×(0,1) τ =s As shown in (C.5), supα∈N (α0 ) Dα o < ∞, and hence, since v λ c < ∞, we obtain that 2 2 Dα(s) u∗n (t) dμ(t) sup Dα 2o × u∗n c = O(1). X ×(0,1)
(D.19)
(D.20)
α∈N (α0 )
Since Wα,n (t) and Wα (t) are both bounded by 1, Assumption 4.1(c) establishes ∗ dDα+τ ˆ εn u∗n εn un (t) dμ(t) εn u∗ 2 = O ε2 . (Wα0 ,n (t) + Wα(s) (t)) n s n X ×(0,1) dτ τ =s
(D.21)
1
Therefore, by combining (D.18)–(D.21), u∗n = ±vnλ and εn = o(n− 2 ), it follows that:
1 (Wα0 ,n (t) + Wαˆ (t))Dαˆ u∗n (t)dμ(t) = op (n− 2 ).
(D.22)
X ×(0,1)
To conclude, in (D.23) use Lemma 4.1 for the first equality, Lemma D.2(b) for the second equality, Wα0 (t) = 0 and (D.22) for the third one and Lemma D.2(a) for the final result. √ √ nλ (θˆ − θ0 ) = n Dα0 [αˆ − α0 ](t)Dα0 [v λ ](t)dμ(t) X ×(0,1)
√ = n (Wαˆ (t) − Wα0 (t))Dαˆ vnλ (t)dμ(t) + op (1) X ×(0,1)
√ = n Wα0 ,n (t)Dαˆ vnλ (t)dμ(t) + op (1) X ×(0,1) √ (D.23) = n Wα0 ,n (t)Dα0 [v λ ](t)dμ(t) + op (1). X ×(0,1)
Hence, applying Lemma D.2(c) we are able to conclude from (D.23) that
√ L nλ (θˆ − θ ) → N (0, λ ),
(D.24)
where λ = Dα0 [v λ ](t)Dα0 [v λ ](s)(t, s)dμ(t)dμ(s). Using the closed form for v λ , obtained in Lemma 4.1, and the definition of Rh∗ (t) in turn imply dW (α0 ) ∗ dW (α0 ) (t) − [h ](t) [ ∗ ]−1 λ Dα0 [v λ ](t) = dθ dh = Rh∗ (t)[ ∗ ]−1 λ. The Cram´er–Wold device, (D.24) and (D.25) then establish the claim of the theorem.
(D.25)
APPENDIX E: DETAILS OF THE BLP EXAMPLE In this appendix, we give the proofs of Lemma 5.1 and Theorem 5.1. We start with an auxiliary lemma whose result will be useful later on. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S53
Semi-parametric estimation of non-separable models
L EMMA E.1. Assume F is twice continuously differentiable on R with strictly increasing hazard rate τ . Then the BLP equilibrium prices exist, are unique, and the map (ξ1 − ξ2 , x1 − x2 , c) → (p1 − p2 ) is twice continuously differentiable with: 0<
1 ∂(p1 − p2 ) < . ∂(ξ1 − ξ2 ) a
Proof: Under the strictly increasing hazard rate assumption the goods are substitutes, and since f (ε)[1 − F (ε)] + f 2 (ε) > 0 and f (ε)F (ε) − f 2 (ε) < 0 we have ∂ 2 ln Dj (pj , p−j ) > 0, ∂pj ∂p−j i.e. the elasticity of demand is a decreasing function of the other firm’s prices. It follows that the (logtransformed) Bertrand duopoly played by the firms is supermodular; hence, there exists a pure Nash equilibrium to the game (see e.g. Milgrom and Roberts, 1990). We now show that this equilibrium is unique. For this purpose note that ∂ 2 ln j (pj , p−j ) < 0, ∂pj2 and
∂ 2 ln j (pj , p−j ) >0 ∂pj ∂p−j
∂ 2 ln (p , p ) ∂ 2 ln (p , p ) 1 j j −j j j −j = >0 − 2 ∂pj ∂p−j (pj − c)2 ∂pj
so that the ‘dominant diagonal’ condition of Milgrom and Roberts (1990) holds; this guarantees that the equilibrium is unique. Since under the strictly increasing hazard rate assumption we have f (ε)[1 − F (ε)] + f 2 (ε) > 0 and f (ε)F (ε) − f 2 (ε) < 0 it also holds that ∂ 2 ln Dj (pj , p−j )/∂pj2 < 0, which implies that ∂ 2 ln j (pj , p−j )/∂pj2 < 0, and the Nash equilibrium (p1∗ , p2∗ ) is the unique solution to the first-order conditions (p1 , p2 , ξ ) = 0, where we have let ξ = ξ1 − ξ2 and ⎤ ⎡ ∂ ln D1 (p1 , p2 ) 1 + ⎥ ⎢ p1 − c ∂p1 ⎥. (p1 , p2 , ξ ) = ⎢ ⎣ 1 ∂ ln D2 (p1 , p2 ) ⎦ + p2 − c ∂p2 Note that the map is continuously differentiable and we have ⎡ ⎢ ⎢ D(p1 ,p2 ) = ⎢ ⎣
−
1 ∂ 2 ln D1 (p1 , p2 ) + 2 (p1 − c) ∂p12 2
∂ ln D2 (p1 , p2 ) ∂p1 ∂p2
∂ 2 ln D1 (p1 , p2 ) ∂p1 ∂p2
⎤
⎥ ⎥ ⎥. 1 ∂ ln D2 (p1 , p2 ) ⎦ − + 2 (p2 − c)2 ∂p2 2
In addition, note that the demand function in (5.2) satisfies −
∂ 2 ln Dj (pj , p−j ) ∂ 2 ln Dj (pj , p−j ) ∂ 2 ln Dj (pj , p−j ) > 0, = =a 2 ∂pj ∂p−j ∂pj ∂(ξj − ξ−j ) ∂pj
(E.1)
where the last inequality follows from f (ε)F (ε)/f 2 (ε) < 1. Therefore, det D(p1 ,p2 ) =
1 (p1 −
c)2 (p
2
−
c)2
−
∂ 2 ln D2 (p1 , p2 ) ∂ 2 ln D1 (p1 , p2 ) 1 1 − > 0. 2 2 2 (p1 − c) (p2 − c) ∂p2 ∂p12
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S54
I. Komunjer and A. Santos
Hence, by the Implicit Function Theorem (see e.g. Theorem 9.28 in Rudin, 1976), the equation (p1 , p2 , ξ ) = 0 defines in a neighbourhood of the point (p1∗ , p2∗ , ξ ) a mapping ξ → (p1 , p2 ) that is continuously differentiable, and whose derivative at this point equals ⎞ ⎛ ∂p1 ⎜ ∂ξ ⎟ ⎟ ⎜ ⎜ ∂p ⎟ = −[D(p1 ,p2 ) (p1 , p2 , ξ )]−1 Dξ (p1 , p2 , ξ ). ⎝ 2⎠ ∂ξ Thus, 1 1 1 ∂ 2 ln D1 (p1 , p2 ) ∂p1 =− , ∂ξ a det D(p1 ,p2 ) (p2 − c)2 ∂p12
(E.2)
1 1 1 ∂p2 ∂ 2 ln D2 (p1 , p2 ) = , 2 ∂ξ a det D(p1 ,p2 ) (p1 − c) ∂p22
(E.3)
where the first equality uses (E.1) and the fact that ∂ 2 ln D2 (p1 , p2 ) ∂ 2 ln D1 (p1 , p2 ) ∂ 2 ln D1 (p1 , p2 ) ∂ 2 ln D2 (p1 , p2 ) − = 0, ∂p1 ∂ξ ∂p1 ∂p2 ∂p2 ∂ξ ∂p22 while the second exploits (E.1) and the fact that ∂ 2 ln D1 (p1 , p2 ) ∂ 2 ln D2 (p1 , p2 ) ∂ 2 ln D2 (p1 , p2 ) ∂ 2 ln D1 (p1 , p2 ) − = 0. ∂p2 ∂ξ ∂p1 ∂p2 ∂p1 ∂ξ ∂p12 From (E.2) to (E.3) we then have the desired result: 0<
∂(p1 − p2 ) 1 ∂(p1 − p2 ) = < , ∂ξ ∂(ξ1 − ξ2 ) a
which concludes the proof of the lemma.
Proof of Lemma 5.1: Since ξ = ξ1 − ξ2 is continuously distributed, it has a strictly increasing cdf, which we denote Fξ . Noting that Fξ (ξ ) ∼ U (0, 1), we may define h(X, U ) ≡ −a(p1 − p2 ) + Fξ−1 (U ), so that
Y ≡ F −1
D1 (p1 , p2 ) M
with
X ≡ x1 − x2 ,
= h(X, U ) + θ X,
where
θ ≡ b.
Since that by Lemma E.1 h is continuously differentiable we have for all (x, u) ∈ X × (0, 1): ∂h(x, u) ∂(p1 − p2 ) 1 = −a +1 > 0, ∂u ∂ξ fξ (Fξ−1 (u)) which completes the proof of Lemma 5.1.
Proof of Theorem 5.1: Consider the BLP model in (5.4) and let FY | X (·|·) denote the conditional distribution of Y given X that is induced by the structure (θ, h). Fix x ∈ X and let v : R × X → (0, 1) be such that for any u ∈ (0, 1), we have: h(x, u) = t if and only if u = v(t, x). Note that v(·, x) is well C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Semi-parametric estimation of non-separable models
S55
defined since by (a) we have ∂h(x, u)/∂u > 0. Then, for any (y, x) ∈ S, FY | X (y|x) = P (Y y | X = x) = P (h(X, U ) y − θ x | X = x) = P (U v(y − θ x, x) | X = x) = P (U v(y − θ x, x)) = v(y − θ x, x),
(E.4)
where the second equality uses the fact that h(x, u) is strictly increasing in u, the third exploits the independence of U and X, and the last follows from U being uniform. Since h(x, u) is continuously differentiable on X × (0, 1) and such that ∂h(x, u)/∂u > 0 on X × (0, 1), v(t, x) is continuously differentiable on R × X with −1 ∂h ∂h ∂v (t, x) = − (x, v(t, x)) (x, v(t, x)) , ∂x ∂x ∂u (E.5) −1 ∂h ∂v (t, x) = (x, v(t, x)) . ∂t ∂u Further, for any (y, x) ∈ S let (y, x) ≡ FY |X (y|x). Under our assumptions on F , (y, x) is continuously differentiable on S and we have ∂v ∂ (y, x) = (y − θ x, x), ∂y ∂t ∂v ∂v ∂ (y, x) = −θ (y − θ x, x) + (y − θ x, x). ∂x ∂t ∂x In particular, ∂(y, x)/∂y > 0 on S. Combining (E.5) and (E.6) we then obtain −1 ∂ ∂ ∂h (y, x) (y, x) x, v(y − θ x, x) . =θ+ − ∂x ∂y ∂x
(E.6)
(E.7)
Evaluate the left-hand side of (E.7) at x = 0 ∈ X and y = 0. For these values of x and y, we have: y − θ x = 0 so by using condition (i) of Theorem 5.1, v(0, 0) = 1/2. Combining the latter with condition (b) then gives −1 ∂ ∂ (0, 0) (0, 0) − 1, θ =− ∂x ∂y from which it follows that θ is identified. The identification of v, and hence h, then immediately follows from (E.4).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. S56–S79. doi: 10.1111/j.1368-423X.2009.00309.x
Inference in limited dependent variable models robust to weak identification L EANDRO M. M AGNUSSON † †
Department of Economics, Tulane University, 6823 Saint Charles Ave., 308 Tilton Hall, New Orleans, LA 70118, USA E-mail:
[email protected] First version received: April 2009; final version accepted: November 2009
Summary We propose tests for structural parameters in limited dependent variable models with endogenous explanatory variables. These tests are based upon the generalized minimum distance principle. They are of the correct size regardless of whether the structural parameters are identified and are especially appropriate for models whose moment conditions are nonlinear in the parameters. Moreover, they are computationally simple, allowing them to be implemented using a large number of statistical software packages. We compare our tests to Wald tests in a simulation experiment and use them to analyse the female labour supply and the demand for cigarettes. Keywords: Hypothesis testing, Limited dependent variable models, Minimum chi-square estimation, Weak identification.
1. INTRODUCTION In this paper, we use the generalized minimum distance approach to derive tests for structural parameters in limited dependent variable models with endogenous explanatory variables. These tests are of the correct size even when the parameters are unidentified. The generalized minimum distance approach is especially convenient when the moment conditions are non-linear in the parameters. As shown by Staiger and Stock (1997) and Stock and Wright (2000), Wald, Lagrange multiplier (LM) and likelihood-ratio (LR) tests have non-standard limiting distributions when the parameter is not identified, and inference based on these tests is thus unreliable. In the case of linear instrumental variable models, several tests are robust to parameter identification failure, like the AR, see Anderson and Rubin (1949), the K, see Kleibergen (2002), the conditional likelihood-ratio (CLR), see Moreira (2003), among others. For non-linear models, the extensions of these tests are based on the generalized method of moments (GMM). The starting point is the objective function of the continuous updating estimator (CUE). Stock and Wright (2000) formulate the S-test as an extension of the AR-test. Kleibergen (2005) proposes a new K-test which is the quadratic form of the score of the CUE’s objective function. In the same paper, he derives the GMM extension of the CLR-test. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Inference in LDV models robust to weak identification
S57
The generalized minimum distance approach is based on a link function that relates the structural and reduced-form parameters. By avoiding moment conditions, this approach permits the construction of robust tests for a class of models in which GMM would involve solving constrained non-linear systems, or in which the GMM is not feasible because the moments are not differentiable. From an applied point of view, our proposed tests are useful because they are simple to compute. In many models, they can be computed using the built-in functions of regular statistical software packages. Moreover, confidence intervals based on these tests do not require the estimation of untested parameters under the null hypothesis at every hypothesized value of the parameter of interest. However, the convenience of the proposed tests goes beyond their computational ease. Their asymptotic properties are derived from the asymptotic properties of the estimator of the reducedform parameters, which do not depend on the structural parameters. Necessary conditions for the implementation are standard assumptions for minimum distance estimation. Under these assumptions, the reduced-form parameters can be estimated either parametrically or semiparametrically. The paper is organized as follows. In the next section, we illustrate the use of the generalized minimum distance principle for estimating parameters in limited dependent variable models. The tests are derived in Section 3, and, in the subsequent section, we suggest an algorithm for the implementation of these tests in a class of limited dependent variable models. Next, in order to compare the performance of the proposed tests to Wald and robust GMM tests, we simulate endogenous Tobit and endogenous Poisson count data models. As an application of the tests, Section 6 considers the female labour supply described by Blundell and Smith (1989) and Lee (1995) and the demand for cigarettes described by Mullahy (1997). The Appendix contains all proofs.
2. MINIMUM DISTANCE PRINCIPLE FOR LIMITED DEPENDENT VARIABLE MODELS The minimum distance principle for estimation explores the underlying relation between structural parameters, denoted by β, and reduced-form parameters, denoted by π . This relation is described by a system of implicit functions of the form r(π, β) = 0. In the literature, r is known p as the link function. Let πˆ be an estimator of π , such that πˆ → π . We estimate β by forcing ||r(π, ˆ β)|| = 0 where || · || represents an appropriately weighted norm. The next example illustrates the minimum distance method introduced by Amemiya (1979) for estimating parameters of limited dependent variable models. For the use of this method in cross-sectional models, see Madalla (1983), Blundell and Smith (1989), Lee (1995) and Blundell et al. (2007), and, in panel data models, see Arellano et al. (1999) and Jones and Labeaga (2003). E XAMPLE 2.1. Let (y ∗ , x ∗ ) be a vector of latent variables generated by a linear simultaneous system
y ∗ = x ∗ β1 + wγ + u x ∗ = zπz + wπw + v,
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
(2.1)
S58
L. M. Magnusson
where x ∗ is correlated with the stochastic disturbance u, and z = (z, w) is a vector of exogenous variables. The reduced-form representation of (2.1) is y ∗ = zδy + e (2.2) x ∗ = zπx + v. Let Lw be a selection matrix, such that w = zLw . From systems (2.1) and (2.2), we have zδy = zπx β1 + zLw γ + ζ, where ζ = u + (vβ1 − e). Under the assumption that E(ζ | z) = 0, it must be the case that δy − πx β1 − Lw γ = 0. r(π,β)
The estimation of β = (β1 , γ ) is based on the minimization of the objective function ˆ −1 (δˆy − πˆ x β1 − Lw γ ), (δˆy − πˆ x β1 − Lw γ )
(2.3)
ˆ is a weighting where πˆ = vec[δˆy , πˆ x ] are estimates of the reduced-form parameters and matrix, i.e. we choose the β that minimizes the distance of r(π, ˆ β) measured by the norm || · ||ˆ . The reduced-form vector π can be estimated parametrically or semi-parametrically, according to the latent nature of (y ∗ , x ∗ ) and the distribution of (e, v). The consistent estimation . 1 We use of β depends on the identification condition, which is related to the rank of ∂r(π,β) ∂β local concepts of identification as follows: β is identified if identified if
∂r(π,β) ∂β
is singular or
∂r(π,β) ∂β
=
C √ , n
∂r(π,β) ∂β
is a full rank matrix, weakly
where C is a full rank matrix and n is the sample
∂r(π,β) ∂β
size, and unidentified if is a null matrix. In Example 2.1, the identification of β is given by the rank of πz . If β is weakly identified or unidentified, then the minimum distance estimator is biased and the limiting normal asymptotic results do not hold. Consequently, the usual approach to inference which is based on the limiting distribution of the minimum distance estimator, e.g. the Wald and likelihood-ratio tests, is misleading. Although the estimation of β is unreliable under weak and non-identification cases, we can still perform hypothesis testing about the value of the structural parameter β. In the following section, we present tests based on the minimum distance principle that have the null hypothesis of the form H0 : β = β0 , against the alternative hypothesis Ha : β = β0 . The proposed tests are of the correct size even when identification is weak or absent.
3. GENERALIZED MINIMUM DISTANCE ROBUST TESTS The objective function (2.3) is an example of a broad class of minimum distance objective functions. Let π be a kπ × 1 vector representing the reduced-form parameters, and β be an m × 1 vector of structural parameters. The values of π and β under the true data-generating process are π0 and β0 , respectively. The estimator of π is denoted by πˆ . Let r be a q × 1 real vector value function defined as r : Rkπ × Rm −→ Rq , with r(π, β) as a typical element. The link function 1 ∂f (θ) , ∂θ
the derivative of a kf × 1 vector function f (θ ) by the kθ × 1 vector θ , is a kf × kθ matrix. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Inference in LDV models robust to weak identification
S59
r represents the distance between structural and reduced-form parameters; and q, the number of restrictions imposed on the reduced-form parameters, measures the dimension of this distance. These tests rely on the following regularity conditions. √ p d A SSUMPTION 3.1. (Regularity conditions). (a) πˆ → π0 , n(πˆ − π0 ) → N (0, 0 ) where 0 p ˆ −→ 0 . (b) r(π, β) is continuous on is a symmetric, positive definite covariance matrix;
kπ m R × R , differentiable in π on a neighbourhood of π0 and twice differentiable in β. Moreover, ] = q. (c) r(π0 , β0 ) = 0. rank[ ∂r(π,β) ∂π The above conditions are the same as the ones commonly adopted in minimum distance estimation; see Newey (1985, Assumptions 1 and 2), Lee (1992, Assumption 1) and Gourieroux and Monfort (1995, Assumption 9.5). Assumption 3.1(a) states that the reduced-form parameter is root-n consistent and asymptotically normal, and its asymptotic variance matrix is consistently estimable. Newey and McFadden (1994) provide more primitive conditions if πˆ is a maximum likelihood or a GMM estimator. In a model combining censored and continuous endogenous √ variables, Newey (1985) presents conditions for estimating n-consistent and asymptotically normal reduced-form parameters that do not rely on the normality distribution of residuals. The definiteness of 0 in Assumption 3.1(a), together with the differentiability of r(π, β) in in 3.1(b), are necessary for deriving the asymptotic distribution π and the full rank of ∂r(π,β) ∂π of r(π, ˆ β). is a full Assumption 3.1 deserves further explanation. First, we do not require that ∂r(π,β) ∂β rank matrix, which is necessary for estimating β—see Newey (1985, assumption 1) and Lee (1992, assumption 2). Therefore, Assumption 3.1 holds independently of the structural parameter identification. Second, Kleibergen (2005) uses smoothness of the empirical moment condition for deriving weak identification robust tests. Some limited dependent variable models, such as the symmetric censored and the winsorized least squares discussed in Section 6, have nondifferentiable moments. Unlike GMM tests, our tests rely on the differentiability of the link function (Assumption 3.1(b)), and, as a consequence, the reduced-form parameters can be estimated semi-parametrically. Finally, in binary choice models or in models with a selection equation, a scale normalization on the variance of the residuals is necessary for estimating π . In Example 2.1, we use a triangular specification, which results in a linear link function. However, the r function can be non-linear in β, as in Blundell and Smith (1994). The next example presents a simplified version of their model. E XAMPLE 3.1. Consider the system 2
y ∗ = x ∗ β1 + wγ + u x ∗ = yβ2 + zπz + v,
(3.1)
where y = max{0, y ∗ } and x ∗ is continuously observed. The difference between (2.1) and (3.1) is that the observed y, not the latent y ∗ , determines x ∗ . When y ∗ > 0, we derive the quasi reduced-form system y ∗ = zδz + wδw + e (x ∗ − β2 y) = zπz + v. 2 The general model has a third equation that describes the mechanism in which the first equation is observed. We also impose a coherency statistical restriction by ignoring the observed y in the first equation.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S60
L. M. Magnusson
The link function relating the reduced form and the structural parameter β = (β1 , β2 ) is 3 r(π, β) = δz −
πz β1 , 1 − β1 β2
(3.2)
and πz is estimable for a given value of β2 . By the delta method, we find the asymptotic distribution of r(π, ˆ β), which is √ d n(r(π, ˆ β) − r(π0 , β)) −→ N (0, β ),
(3.3)
where β =
∂r(π0 , β) ∂r(π0 , β)
0 . ∂π ∂π
(3.4)
The identification of β does not affect the derivation of the asymptotic distribution of r(π, ˆ β). Define the objective function of the optimal minimum distance estimator, SMD (β), as ˆ β−1 [r(π, ˆ β)] ˆ β)], SMD (β) = n[r(π,
(3.5)
ˆ ˆ β = [ ∂r(π,β) ˆ ∂r(πˆ ,β) ] . From equation (3.3), we show that SMD (β) follows a chiwhere ] [ ∂π ∂π squared distribution with q degrees of freedom under the null hypothesis.
T HEOREM 3.1. (SMD -test) Under Assumption 3.1 and the null hypothesis H0 : β = β0 , d
SMD (β0 ) −→ χ 2 (q) independent of whether β is identified or not. The SMD -test is similar to the S-test proposed by Stock and Wright (2000) derived under the GMM framework. However, it is important to emphasize two differences. First, the link function is not a sample average of empirical moments. Under continuity and differentiability of r(π, β), the limiting distribution of the SMD -test is solely derived from the asymptotic properties ˆ Secondly, in non-linear models, of the reduced-form parameter estimator πˆ and its covariance . because structural and nuisance parameters are non-separable, testing a structural parameter involves the estimation of untested parameters under the null hypothesis. This is not the case for our tests. For the linear link function in Section 2, γ does not have to be estimated in order to test β1 . This property, illustrated in Sections 4 and 6, has important computational advantages, especially for the estimation of confidence intervals by inverting the statistical tests. When the model is overidentified (q > m), the SMD -test tests two hypotheses simultaneously: the value of the structural parameter vector and the (q − m) overidentification restrictions. As a result, this test loses power under the alternative hypothesis along with an increasing number of overidentification restrictions.
3
Lemieux et al. (1994) derive a similar link function, which is (δz −
πz β 1−β , δw
−
πz 1−β )
= 0.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S61
Inference in LDV models robust to weak identification
As in Kleibergen (2007), we can decompose the SMD -test into two orthogonal statistics: KMD and JMD . The first of these statistics tests only the value of the structural parameter, while the latter tests the overidentification restrictions. T HEOREM 3.2. (KMD - and JMD -tests) Define the KMD - and JMD -tests as − 12 − 12
ˆ β r(π, ˆ β r(π, ˆ β0 ) P − 12 ˆ β0 ) , KMD (β0 ) = n 0 0 ˆ β Dˆ β 0
− 12
ˆ β r(π, JMD (β0 ) = n ˆ β0 ) M 0
0
−1
ˆ β 2 Dˆ β 0
− 12
ˆ β r(π, ˆ β0 ) , 0
0
where P
−1 ˆβ 2 0
Dˆ β0
1 1
ˆ β− 2 Dˆ β0 Dˆ β ˆ β−1 Dˆ β0 −1 Dˆ β ˆ β− 2 , = 0 0 0 0 0
M
−1
ˆ β 2 Dˆ β 0
= Iq − P
−1
ˆ β 2 Dˆ β 0
,
= Dˆ 1 (β0 ) . . . Dˆ m (β0 ) , ∂r(π, ˆ β0 ) ∂r(π, ˆ β0 ) ∂ ˆ β0 ) ˆ −1 ˆ ∂r(π, Dˆ j (β0 ) = − ˆ β0 ), β0 r(π,
∂βj ∂π ∂βj ∂π 0
0
Dˆ β0
for j = 1, . . . , m. Under Assumption 3.1 and H0 : β = β0 , we have d
KMD (β0 ) −→ χ 2 (m) and
d
JMD (β0 ) −→ χ 2 (q − m),
regardless of whether β is point-identified. Also, SMD (β0 ) = KMD (β0 ) + JMD (β0 ).
(3.6)
Unlike the SMD , the KMD -test is not affected by the number of overidentifying restrictions. ˆ β0 ) under the null hypothesis. If β is The statistic Dˆ β0 is asymptotically independent of r(π, π,β ˆ 0) is close to a reduced identified, then Dˆ β0 converges in probability to ∂r(π∂β0 ,β0 ) . If not, i.e. ∂r(∂β √ ˆ rank value, then nDβ0 converges in distribution to a random variable. Because of the asymptotic ˆ β0 ), the distribution of the KMD -test, conditional on Dˆ β0 , is independence between Dˆ β0 and r(π, free from nuisance parameters; see Kleibergen (2005). Moreover, its unconditional distribution is pivotal. The derivative of the SMD with respect to β, as shown in the Appendix, is −
1 ∂SMD (β) ˆ β−1 Dˆ β . = nr(π, ˆ β) 2 ∂β
(3.7)
The KMD -test is the quadratic form of equation (3.7), weighted by its own variance. The minimum value of SMD (β) coincides with the point at which the KMD -test equals zero. This point is the generalized minimum distance continuous updating estimator (GMD-CUE), which is different from the GMM-CUE, the minimizer of the S-test. The JMD -test is related to the overidentification test proposed by Lee (1992). The latter results from substituting a minimum distance estimator for β in the SMD -test. Therefore, it is not robust to identification failure. Equation (3.7) shows that the KMD -test will suffer from a spurious decline of power at inflection and local extrema points of the SMD -test. Close to these points, the value of the JMD -test approximates the value of the SMD -test, which has discriminatory power; see equation (3.6). We define a new test for the structural parameter, the KJ MD -test, by combining both the KMD - and the C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S62
L. M. Magnusson
JMD -tests; see Kleibergen (2005). Let τKMD and τJMD be the significance levels of the KMD - and JMD -tests, respectively. The KJ MD -test has an approximate significance level of τ = τKMD + τJMD . Rejection of KJ MD occurs if KMD rejects at τKMD or if JMD rejects at τJMD significance levels. The JMD component of the KJ MD -test corrects the decline of power that affects the KMD -test at the inflexion and local minima points. An extension of Moreira’s (2003) conditional likelihood-ratio (CLR) to the current framework, under the null hypothesis, is 1 SMD (β0 ) − rk(β0 ) + [SMD (β0 ) + rk(β0 )]2 − 4JMD (β0 )rk(β0 ) , (3.8) CLRMD (β0 ) = 2 where rk(β0 ) is a statistic that tests the rank of Dˆ β0 ; see Kleibergen (2005). In the case of ˆ ˆ ˆ ˆ −1 one endogenous variable, rk(β0 ) = n{Dˆ β 0 β0 Dβ0 }, where β0 is the variance estimate of Dβ0 , described by equation (A.2) in the Appendix. The presence of the SMD (β0 ) statistic in (3.8) shows that the CLRMD -test does not have the spurious decline of power of the KMD -test. The asymptotic distribution of the CLRMD is not pivotal and depends on rk(β). The critical values of this test are calculated by simulating independent values of χ 2 (m) and χ 2 (q − m) random variables for a given value of rk(β). P ROPOSITION 3.1. (CLRMD -test) Under Assumption 3.1 and the null hypothesis we have
p 1 2 2 2 − rk(β0 ) + [χm2 + χq−m + rk(β0 )]2 − 4χq−m rk(β0 ) , CLRMD (β0 ) −→ χm2 + χq−m 2 2 where χm2 and χq−m are independent chi-squared distributed random variables with m and q − m degrees of freedom, respectively.
In the linear instrumental variable model with homoscedastic residuals, Andrews et al. (2006) show that the CLR-test dominates the S- and K-tests in terms of power. However, this result is not yet extended to a more general class of models. The simulations reported in Section 5 have an example in which the CLRMD -test does not dominate the KMD -test in terms of power. The proposed tests can be adapted in order to test only a subset of the structural parameter vector. The procedure consists of estimating the untested structural parameters under the null hypothesis by the GMD-CUE and replacing the estimated values in the original tests. If the estimated parameters are identified, the SMD - and KMD -tests remain chi-squared distributed with degrees of freedom reduced by the number of estimated parameters. If the estimated parameter is not identified, Kleibergen and Mavroeidis (2009) show that the limiting distributions of the tests are asymptotically bounded by the adjusted chi-squared distributions. There is a correspondence between our tests and other robust tests for the linear instrumental variable model represented by the following system: y = xβ0 + u x = zπz + v. We omit the included instruments, w, to simplify the exposition. The AR- and K-tests are, respectively, AR(β0 ) =
(y − xβ0 ) Pz (y − xβ0 ) σˆ β20
and
K(β0 ) =
(y − xβ0 ) Pzπ˜ z (β0 ) (y − xβ0 ) , σˆ β20
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Inference in LDV models robust to weak identification
S63
where Pa = a(a a)−1 a , σˆ β20
(y − xβ0 ) Mz (y − xβ0 ) (y − xβ0 ) Mz x −1 = , π˜ z (β0 ) = (z z) z x − (y − xβ0 ) , and Mz = I − Pz . n − kz (n − kz )σˆ β20
In the Appendix, we show that the AR- and K-tests are the same as the following SMD - and KMD -tests: ˆ β−1 (δˆz − πˆ z β0 ), SMD (β0 ) = (δˆz − πˆ z β0 ) 0 1
ˆ β− 2 PDˆ ˆ β− 2 (δˆz − πˆ z β0 ), KMD (β0 ) = (δˆz − πˆ z β0 ) β 0 0 1
0
−1
−1
σˆ β20 (z z)
ˆ β0 = where δˆz = (z z) z y, πˆ z = (z z) z x, and Dˆ β0 = −π˜ z (β0 ). In comparison with the tests proposed by Chernozhukov and Hansen (2008), the SMD is the same as the Wald-S, but the KMD differs from the Wald-K, which is based on testing H0 : η = 0 in the following Gauss–Newton regression: y − xβ0 = zπ˜ z (β0 )η + residuals. In the case of heteroscedastic or clustered residuals, these tests differ from the other tests in small samples. However, they are all asymptotically equivalent (see the Appendix).
4. IMPLEMENTATION OF ROBUST TESTS We provide an algorithm for the implementation of the proposed tests specific to the following class of limited dependent variable models: 4 y ∗ = xβ + wγ + u y ∗ = zδz + wδw + e (4.1) x = zπz + wπw + v x = zπz + wπw + v, where y ∗ is latent and x is continuous and fully observed. The link function r(π, β) derived from system (4.1) is r(π, β) = δz − πz β,
(4.2)
= [Ikz −β ⊗ Ikz ] is of full rank, independent of the where π = vec[δz , πz ]. In this model, ∂r(π,β) ∂π value of β. The variance of r(π, ˆ β), the Dˆ β -statistic of the KMD -test, and its variance matrix β , which is necessary for computing the CLRMD -test, are β = δz δz − (β ⊗ Ikz ) πz δz − δz πz (β ⊗ Ikz ) + (β ⊗ Ikz ) πz πz (β ⊗ Ikz ),
(4.3a)
vec[Dˆ β ] = −vec[πˆ z ] + [ πz δz − πz πz (β ⊗ Ikz )]β−1 (δˆz − πˆ z β),
(4.3b)
β = πz πz − [ πz δz − πz πz (β ⊗ Ikz )]β−1 [ δz πz − (β ⊗ Ikz ) πz πz ],
(4.3c)
4 Finlay and Magnusson (2009) have files available for implementing tests for the instrumental variable Probit and Tobit models in STATA. These files are downloadable from http://greenspace.tulane.edu/kfinlay/research.html.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S64
L. M. Magnusson
√ √ where δz δz and πz πz are the asymptotic variances of n(δˆz − δz ) and n[vec(πˆ z − πz )], respectively, and πz δz = δz πz is their asymptotic covariance. We can also estimate the reduced-form parameters by introducing a linear control function vα. Then, the structural and reduced-form equations become y ∗ = xβ + wγ + vα + ε y ∗ = zδz + wδw + vδv + ε (4.4) x = zπz + wπw + v x = zπz + wπw + v, where ε = u − vα and δv = β + α. We demonstrate that the use of the control function allows us to write πz δz = πz πz [(δv ⊗ Ikz )]. The elements for computing the tests reduce to β = δz δz + [(δv − β) ⊗ Ikz ] πz πz [(δv − β) ⊗ Ikz ],
(4.5a)
vec[Dβ ] = −vec[πz ] + πz πz vec[β−1 (δz − πz β)(δv − β) ],
(4.5b)
β = πz πz − { πz πz [(δv − β) ⊗ Ikz ]}β−1 {[(δv − β) ⊗ Ikz ] πz πz }.
(4.5c)
Further simplification is possible by assuming that v is homoscedastic (see the Appendix). The algorithm takes the following steps: √ ˆ πz πz . (1) Estimate πz and Var[ n(πˆ z − πz )] by OLS. Denote the estimated values by πˆ z and
Compute vˆi , the OLS residuals. (2) Estimate δz , δw and δv from the following equation: y = f (zδz + wδw + vδ ˆ v + ε˜ ) ,
(3) (4)
where f (·) is a known function and ε˜ = ε − (vˆ − v)δv . Denote the estimates of δz and δv by, respectively, δˆz and δˆv . We do not have to keep the estimate of δw because it is not part of the link function (4.2). ˆ δz δz , the output of the variance–covariance matrix estimate of δˆz . Save
ˆ πz πz , πˆ z and δˆv into equation (4.5) with the hypothesized value of β. Finally, substitute
5. SIMULATIONS We simulate the endogenous Tobit and the endogenous Poisson count data models, which can be represented by the simultaneous system (4.1). In both cases, πˆ x = (πˆ z , πˆ w ) is the ordinary least-squares estimate. We estimate δy = (δz , δw ) using Powell’s (1986) symmetric censored least squares (SCLS) and the Poisson quasi-likelihood method for, respectively, the endogenous Tobit model and the endogenous Poisson count data model. We compare the performance of the proposed robust tests with Wald tests, defined as W (β0 ) = (βˆ − β0 )Vˆβˆ−1 (βˆ − β0 ), ˆ where βˆ is an estimate of β, and Vˆβˆ is the estimated variance of βˆ evaluated at β. We compute the rejection frequency of the tests at the 10%, 5% and 1% significance levels. For the KJ MD -test, the significance level of KMD is four times the significance level of JMD . For both simulations, we generate 10,000 random samples of 200 observations each, satisfying the following conditions: β0 = 0; wi is a unitary constant; zi is a 1 × 3 row vector drawn from C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Inference in LDV models robust to weak identification
S65
independent uniform distributions and kept fixed in all simulations; and πz = (πz1 , 0, 0) is a 3 × 1 column vector. The value of πz1 is set according to μ, the concentration parameter divided by kz : 1 πz1 z1 z1 πz1 , μ= kz σv2 where z1 is the n × 1 vector of the first instrument and σv2 is the variance of vi . We choose μ to be 30 and 3, representing strong and weak identification, respectively. 5 5.1. Endogenous Tobit The endogenous Tobit model is represented by yi = max {0, xi β + wi γ + ui } xi = zi πz + wi πw + vi .
(5.1)
We consider two cases for the joint distribution of (ui , vi ): a bivariate Laplace distribution with zero mean and unit variance, and a bivariate t-distribution with three degrees of freedom. In both cases, we first generate bivariate uniform distributed random variables with correlation coefficient ρ. Then, we generate the residuals {(ui , vi )}ni=1 using the inverse of the cumulative distribution functions; ρ is either 0.2 or 0.9, and πw = 0.2. The parameter γ takes on values of 0.7267 and 0.4901 for the t-Student and the Laplace residuals, respectively. We calibrate γ such that, on average, 25% of the observations are left censored. In computing the robust tests, we assumed that residuals are heteroscedastic of unknown form. The statistics are based on the elements defined in (4.3). For estimating δy = (δz , δw ) by SCLS we use the algorithm proposed by Silva (2001). This algorithm converges faster and more frequently compared to the original algorithm in Powell (1986). However, between 0.5% and 1% of the simulations did not converge. The Wald test is derived from the two-step minimum distance estimator in Lee (1995). Table 1 reveals that the sizes of the Wald tests become distorted when identification decreases. The distortion varies according to the degree of endogeneity: the tests under-reject the null hypothesis when ρ = 0.2 and over-reject it when ρ = 0.9. These results are related to the bias of the minimum distance estimator of β: the lower the degree of identification and the higher the endogeneity, the more upwardly biased are the two-step estimates. In contrast to the Wald tests, the performance of our tests is neither affected by the level of identification nor by the degree of endogeneity. 5.2. Poisson count data model The following system is a representation of the endogenous Poisson count data model: ⎧ ⎪ ⎨ yi ∼ Poisson(λi ) λi = exp(xi β + wi γ + ui ) ⎪ ⎩ xi = zi πz + wi πw + vi ,
(5.2)
5 In linear instrumental variable models, Staiger and Stock (1997) suggest that values of μ below 10 indicate that the instruments are weak.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S66
L. M. Magnusson Table 1. Size comparison (in percentage) H0 : β = 0, endogenous censored model. μ = 30 μ=3 ρ = 0.2
ACV
ρ = 0.9
ρ = 0.2
ρ = 0.9
10%
5%
1%
10%
5%
1%
10%
5%
1%
10%
5%
1%
Laplace residuals 9.23 WaldMD 12.07 SMD
4.39 6.87
0.68 2.08
12.24 12.24
7.45 6.41
2.41 1.85
3.68 12.07
1.13 6.87
0.05 2.08
22.34 12.24
13.45 6.41
3.54 1.85
KMD JMD KJ MD
11.12 11.54 11.48
5.88 6.66 6.42
1.40 1.86 1.64
11.43 11.43 11.82
6.05 6.33 6.26
1.28 1.57 1.53
10.94 11.70 11.50
5.83 6.81 6.53
1.40 1.89 1.61
12.84 10.45 12.73
7.30 5.43 7.21
1.76 1.41 1.81
CLRMD
11.12
6.43
1.32
11.64
6.69
1.31
11.68
6.68
1.97
13.36
7.34
1.87
t-Student residuals 9.93 WaldMD
4.91
0.88
12.60
7.46
2.37
4.18
1.30
0.07
20.25
12.03
3.28
SMD KMD
13.19 11.89
7.84 6.47
2.55 1.68
13.31 12.07
7.34 6.67
2.33 1.57
13.19 11.77
7.84 6.51
2.55 1.69
13.31 13.40
7.34 7.47
2.33 1.91
JMD KJ MD CLRMD
12.38 12.64 11.82
7.32 7.35 7.12
2.19 2.09 1.76
12.09 12.98 12.30
6.99 7.13 7.39
1.81 1.98 1.60
12.45 12.85 12.63
7.41 7.42 7.70
2.20 2.11 2.37
11.53 13.57 13.88
6.20 7.75 8.18
1.73 2.19 2.24
Notes: Rejection frequencies under the null hypothesis are based on 10,000 simulations of samples with 200 observations. ACV stands for asymptotic critical value.
where λi is the mean of the Poisson distribution. We analyse the performance of the tests for over-, equally and underdispersed data. Because the results are similar, we only report the equally dispersed case. 6 We also analyse tests that consider the presence of a linear control function, as described in equation (4.4). Realizations from the system (5.2) are generated according to the following steps. First, we sample the random variables (ν1 , ν2 ) from a bivariate uniform distribution, with correlation coefficient ρ = {0.2, 0.9}. We set v = ν1 ; y results from the inverse Poisson distribution, evaluated at ν2 . By fixing πw = −0.5 and γ = log(2.0), we obtain E[wπw + v] = 0 and E[y] = Var[y] = 2. We also investigate the performance of the GMM robust tests in Stock and Wright (2000) and Kleibergen (2005), which are derived from the following empirical moment condition proposed by Mullahy (1997): n
1 zi (5.3) exp(−xi β − wi γ )yi − 1 . n i wi The estimation of γ under the null hypothesis is necessary for computing the robust GMM tests. The non-robust tests are listed according to the method used for estimating β: two-step GMM or two-step minimum distance. 6
The remaining results are available on the author’s web site. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S67
Inference in LDV models robust to weak identification
Table 2. Size comparison (in percentage) H0 : β = 0, endogenous Poisson count data model, equidispersion case. μ = 30 μ=3 ρ = 0.2
ρ = 0.9
ρ = 0.2
ρ = 0.9
ACV
10%
5%
1%
10%
5%
1%
10%
5%
1%
10%
5%
1%
WaldGMM
8.86
3.84
0.48
14.07
9.14
3.74
4.12
1.23
0.05
30.84
22.49
10.59
WaldMD WaldcMD
8.41 9.96
3.55 4.47
0.51 0.75
15.04 15.87
10.33 11.39
4.71 5.40
3.78 4.64
1.22 1.61
0.07 0.08
43.08 46.42
35.95 39.55
23.83 27.79
SMD KMD JMD
9.25 10.22 9.29
4.71 4.73 4.51
1.11 0.95 1.07
9.01 9.62 8.55
4.44 4.80 4.56
0.96 0.73 0.87
9.25 10.12 9.33
4.71 5.10 4.58
1.11 1.04 1.05
9.01 9.41 9.11
4.44 4.74 4.40
0.96 0.86 0.97
KJ MD CLRMD c SMD
9.79 10.08 12.44
4.79 5.07 6.60
0.94 0.92 1.76
9.30 9.62 10.09
4.70 5.11 5.22
0.81 0.72 0.89
9.96 9.90 12.44
5.19 5.23 6.60
0.89 1.03 1.76
9.35 9.19 10.09
4.76 4.94 5.22
0.81 0.78 0.89
c KMD c JMD
12.06 11.51
6.03 6.06
1.50 1.51
9.94 9.92
5.01 5.03
0.90 1.10
11.95 11.65
6.52 6.24
1.52 1.52
9.75 10.19
5.07 5.21
0.93 1.09
KJ cMD CLRcMD SGMM
12.01 12.08 10.23
6.23 6.73 4.88
1.50 1.38 0.97
10.13 9.98 9.56
5.02 5.39 4.65
0.91 0.88 0.68
12.18 12.55 10.23
6.67 6.88 4.88
1.58 1.45 0.97
9.90 9.89 9.56
5.01 5.22 4.65
0.96 0.88 0.68
KGMM JGMM
9.64 10.84
4.40 5.53
0.80 1.11
8.97 10.53
4.42 5.07
0.66 0.91
9.73 11.00
4.80 5.30
0.81 1.02
9.09 10.27
4.36 4.89
0.76 0.88
9.37 9.66
4.42 4.74
0.81 0.77
9.20 9.09
4.37 4.76
0.66 0.55
9.69 9.82
4.78 4.78
0.76 0.77
9.03 9.39
4.34 4.53
0.74 0.63
KJ GMM CLRGMM
Notes: Rejection frequencies under the null hypothesis are based on 10,000 simulations of samples with 200 observations. ACV stands for asymptotic critical value. The subscript c indicates tests that use the control function vα.
The results in Table 2 show that changes in the level of identification affect the behaviour of the Wald tests. Similar to the endogenous Tobit model, they under-reject the null hypothesis when ρ = 0.2 and over-reject it when ρ = 0.9. For example, the rejection probability of the Wald twostep minimum distance test jumps from 3.78% when ρ = 0.2 to 43.08% when ρ = 0.9, while it is supposed to be 10%. The proposed and the GMM robust tests’ rejection probabilities are close to the expected asymptotic critical values, regardless of the level of endogeneity, the degree of dispersion or the identification strength. The introduction of a control function has ambiguous results. It makes the nominal sizes closer to the asymptotic sizes when ρ = 0.9, and the opposite when ρ = 0.2. 5.3. Power—endogenous count data model We investigate the power of the proposed tests for the endogenous count data model using the same data-generating process of Section 5.2. We only report the results of the tests which do not incorporate a control function and therefore are less efficient. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S68
L. M. Magnusson
μ = 30, ρ = 0.2
μ = 3, ρ = 0.2
1
1
S
0.9
S
0.9
MD
MD
K
K
MD
MD
0.8
0.8
CLR
CLR
MD
MD
J
0.7
J
0.7
MD
MD
Wald
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 −2
−1.5
−1
−0.5
0
0.5
GMM
0.6
Rejection Probability
Rejection Probability
Wald
GMM
0.6
1
1.5
0
2
−2
−1.5
−1
−0.5
β
0
μ = 30, ρ = 0.9
1
1.5
2
μ = 3, ρ = 0.9
1
1
S
0.9
S
0.9
MD
MD
K
K
MD
MD
0.8
0.8
CLR
CLR
MD
MD
J
0.7
J
0.7
MD
MD
Wald
Wald
GMM
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 −2
−1.5
−1
−0.5
0
β
0.5
GMM
0.6
Rejection Probability
0.6
Rejection Probability
0.5
β
1
1.5
2
0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
β
Figure 1. Power curves for testing H0 : β = 0 at the 5% significance level, endogenous count data model.
Figure 1 compares the robust tests and the two-step GMM test according to their degree of endogeneity and identification ‘strength’. First, the size of the robust tests remains correct in all graphs, while the Wald test is biased even in cases in which the identification is relatively strong. Secondly, the SMD -test has less discriminatory power than the KMD -test, which is explained by the number of overidentification restrictions. Thirdly, in three out of the four cases, the CLRMD test dominates the remaining robust tests. Finally, we also note that the KMD suffers a decline of power at values differing from the hypothesized value. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Inference in LDV models robust to weak identification
S69
Figure 2 illustrates the power of our tests and of robust GMM. We observe that the KJ MD and CLRMD tests dominate the KJ GMM and CLRGMM , respectively, in all cases. The SGMM dominates the SMD -test for positive values of β. The results are even more favourable when using a control function.
6. TWO APPLICATIONS We illustrate the use of the robust tests by constructing confidence intervals and regions for the two models of Section 5: endogenous Tobit and Poisson count data. The former model is illustrated by the married female labour supply, see Blundell and Smith (1989) and Lee (1995), while the latter is exemplified by the demand for cigarettes; see Mullahy (1997). The 1 − τ confidence set is formed by the points of the parameter space that do not reject the null hypothesis at significance level τ . 6.1. Female labour supply Consider the married female labour supply model of Blundell and Smith (1989). In equation (5.1), yi represents weekly hours in paid work, and xi is other household income measured in US$1000.00, which includes unearned income and savings. Besides a constant term, wi includes demographic variables: female age and its square, education and its square, three child dummy variables and a race dummy variable. 7 The data set was originally obtained from the 1987 cross-section of the Michigan Panel Data Study of Income Dynamics and is identical to the one used by Lee (1995). The sample includes married couples with non-negative total family income. The female household member must be of working age (18–64) and not self-employed. From the 3382 married females, 895 were not working, which is approximately 26.4% of the total number of observations. Besides the SCLS, we consider the winsorized mean estimator (WME) suggested by Lee (1995) to estimate the reduced-form parameters. The WME is less restrictive than Powell’s SCLS estimator because the latter considers a symmetric distribution of the residuals, while the former assumes only local symmetry. On the other hand, the WME demands the definition of a trimming parameter that imposes local symmetry. Our trimming parameter, denoted by w, is the point that minimizes the sum of the diagonal of the variances of the WME. Table 3 presents the 95% confidence intervals derived from the two-step Wald estimator and our tests. The results are divided into two groups. In the first group, we follow Mroz (1987) and consider functions of the included instruments as the excluded instruments: cubic terms of the female age and education. In the second group, we add three dummy variables related to the male’s occupation. We also report the exogeneity tests proposed by Smith and Blundell (1986) and the first-stage F-statistic. For the model in column (a), the intervals derived from the Wald differ from those derived from the robust tests. In the SCLS case, the robust confidence intervals are larger than the nonrobust confidence interval. In the WME case, the opposite occurs. However, when dummies are added for husband’s occupations, the confidence intervals become identical, except for the SMD . These results show that the Wald confidence intervals in column (a) are unreliable, even with a first-stage F-statistic above 10 and more than 3000 observations.
7
See Table 3 footnote.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S70
L. M. Magnusson KJG M M , KJM D
SG M M , SM D
μ = 30, ρ = 0.2
μ = 30, ρ = 0.2 1
GMM
0.6 0.5 0.4 0.3
0.5 0.4 0.3
0.1
0.1
−1
−0.5
0
β
0.5
1
1.5
0
2
0.9
MD Rejection Probability
0.6 0.5 0.4 0.3
0
β
0.5
1
1.5
GMM MD
MD
0.6 0.5 0.4 0.3
−1.5
−1
−0.5
0
0.5
β
1
1.5
0
2
2
GMM MD
0.7 0.6 0.5 0.4 0.3
−2
−1.5
−1
−0.5
0
0.5
β
1
1.5
Rejection Probability
0.6 0.5 0.4 0.3
MD
0.7 0.6
−2
−1.5
2
−1
−0.5
0
0.5
β
1
0.5 0.4 0.3
−2
−1.5
−1
−0.5
0
β
0.5
CLR
GMM
1
1.5
2
CLR
MD
0.6 0.5 0.4 0.3
0.1
1.5
2
MD
0.7
0
1
1.5
GMM
CLR
0.8
KJ
0
0.5
CLR
0.9
GMM
0.2
0
1
0.3
0.1
β
0.5
0.4
0.1
−0.5
0
β
0.5
0.2
−1
−0.5
0.6
0.2
−1.5
−1
μ = 3, ρ = 0.9
Rejection Probability
MD
−2
−1.5
0.7
0
2
KJ
0.8
S
0.7
−2
1
0.9
GMM
2
MD
μ = 3, ρ = 0.9
S
0.8
1.5
GMM
CLR
0.8
KJ
1
0.9
2
CLR
0.9
KJ
μ = 3, ρ = 0.9 1
1.5
μ = 3, ρ = 0.2
μ = 30, ρ = 0.9
0.1
1.5
2
1
0
1
1.5
0.1
−2
0
0.5
1
0.3
0.2
0
0.5
0.4
0.1
β
0
β
0.5
0.1
−0.5
−0.5
0.6
0.2
−1
−1
0.7
0.2
−1.5
−1.5
0.2
Rejection Probability
S
−2
−2
0.8
KJ
0.8
Rejection Probability
Rejection Probability
0.3
0.9
KJ
0.9
GMM
0.7
0.4
μ = 30, ρ = 0.9
S
0.8
0.5
1
1
0.9
0.6
0
2
μ = 3, ρ = 0.2
μ = 30, ρ = 0.9 1
Rejection Probability
1.5
0.3
0
2
β
1
0.4
0.1
−0.5
0.5
0.5
0.1
−1
0
0.6
0.2
−1.5
−0.5
0.7
0.2
−2
−1
0.8
S
0.7
−1.5
0.9
GMM
MD
0.7
0.1
−2
1
S
0.8
GMM
CLR
0.2
Rejection Probability
−1.5
μ = 3, ρ = 0.2
1
Rejection Probability
0.6
0.2
CLR
0.8
MD
0.7
0.2
−2
GMM
KJ
Rejection Probability
Rejection Probability
Rejection Probability
MD
0.9
KJ
0.8
S
0.7
1
0.9
S
0.8
0
μ = 30, ρ = 0.2
1
0.9
0
CLRG M M , CLRM D
0
−2
−1.5
−1
−0.5
0
β
0.5
1
Figure 2. Power curves for testing H0 : β = 0 at the 5% significance level, endogenous count data model.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S71
Inference in LDV models robust to weak identification Table 3. 95% confidence interval, other household income. Instruments Estimation method SCLS
WME
Exogeneity t-test First-stage F-statistic
(a)
(a)+(b)
Wald SMD KJ MD
[−0.36, 0.14] [−0.59, 0.23] [−0.79, 0.14]
[−0.20, 0.03] [−0.23, 0.06] [−0.20, 0.03]
CLRMD Wald SMD
[−0.60, 0.13] [−0.55, 0.20] [−0.26, 0.09]
[−0.20, 0.03] [−0.23, −0.01] [−0.30, 0.05]
KJ MD CLRMD
[−0.57, 0.15] [−0.38, 0.12]
[−0.23, −0.01] [−0.23, −0.01]
−0.56 15.08
−3.29 32.15
Notes: The included instruments are age = Age10-40 , where Age is females’ age in years, age2 , educ = (Education − 8), where Education is females’ education in years, educ2 , three child dummy variables (C1 for a child ages 0–5, C2 for a child ages 6–13, and C3 for a child ages 14–17), and a dummy variable for race (1 if non-white and 0 otherwise). In column (a), the excluded instruments are age × education, age3 , education3 , age2 × education, and age × education2 . In column (b) the excluded instruments are the same as in column (a) in addition to three male occupation dummies (O1: manager or professional, O2: sales worker or clerical or craftsman, O3: farm-related worker). The number of observations is 3382.
6.2. Cigarette demand function Mullahy (1997) suggests a Poisson-type regression to investigate the impact of smoking habits on cigarette consumption. The data set consists of 6160 responses of males to the Smoking Supplement of the 1979 National Health Interview Survey. In equation (5.2), yi represents the number of cigarettes consumed, measured in packs per day. The endogenous explanatory variable xi is the smoking habit stock measure K210. 8 The vector of instruments wi includes: the statelevel average price-per-pack for cigarettes in 1979; the individual’s age in years; his years of education and its square; his family income in US$1000.00; a race dummy variable (white equals one, zero otherwise), and a constant. As excluded instruments, we use: an interaction term between age and education; the state-level average price-per-pack for cigarettes in 1978; and the number of years the state’s restaurant smoking restrictions had been in place in 1979. The first-stage F-statistic is 9.78. We compute the 90% and 95% confidence regions for the smoking habit stock (β), and cigarette price (γ ) using SGMM , and SMD , KJ GMM and KJ MD . They are illustrated in Figure 3. In this example, the GMM-tests require estimation of the eight parameters of included instruments for each hypothesized value in the grid search. This procedure involves optimizing a system of non-linear functions. As a consequence, in parts of the parameter space, the optimization algorithm is unstable and may not converge. In addition, the computational time 8 K210 is an index of the habit-forming effects of prior cigarette consumption. The author argues that the smoking habit and past unobserved determinants of smoking are correlated. Since the latter is also correlated with the present unobserved determinants of smoking, it turns out that smoking is correlated with its unobserved characteristics.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S72
L. M. Magnusson
SG M M
SM D
0.02
0.018
0.018
0.016
0.016
0.014
0.9
0.02
0.014
0.012
0.
95
0.012
0.95
0.01
β
β
0.9
0.9
95
0.
0.01
0.008
0.008
0.
9
0.
95
0.006
0.006
0.004
0.004
0.002
0.002
0 −0.02
−0.015
−0.01
−0.005
γ
0
0.005
0 −0.02
0.01
−0.015
−0.01
KJG M M
−0.005
0
γ
0.005
0.01
KJM D 0.02
0.9
0.018
0.9
0.02
0.018
0.016
0.016
0.014
0.014
0.9
0.9 5
5 0.9
0.012
0.01
0.01
95 0.
β
β
0.9
0.012
0.008
0.008
0.006
0.006
0.9
0.9
5
0.004
0.004
0.002
0.002
0 −0.02
−0.015
−0.01
−0.005
γ
0
0.005
0.01
0 −0.02
−0.015
−0.01
−0.005
γ
0
0.005
0.01
Figure 3. 90% and 95% confidence regions for smoking habit stock (β) and cigarette price (γ ).
for estimating the confidence region increases significantly. The SMD - and KJ MD -tests do not involve the estimation of the untested parameters. They require solving only one optimization problem, which is the reduced-form parameters estimation. 9 Moreover, their confidence regions are smaller. 9 In this example the grid contains 30,000 points. Using the same computer, the S MD - and KJ MD -tests take 8.64 seconds to calculate the confidence region, while the GMM tests last more than 26 hours.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Inference in LDV models robust to weak identification
S73
7. CONCLUSION We develop tests robust to weak identification in the context of models in which non-linearities in the moment conditions make conventional GMM-procedures intractable. These tests, based on the generalized minimum distance principle, avoid non-linearity problems because they do not require direct inference about the structural parameters. Instead, the crucial assumptions concern the relationship between the structural and reduced-form parameters and the asymptotic behaviour of the reduced-form parameter estimator. The simplicity of this approach extends to its computational implementation, which can be conducted using regular statistical software packages. Simulations show that these tests perform well in the case of weak identification and different degrees of endogeneity.
ACKNOWLEDGMENTS I would like to thank Frank Kleibergen, Sophocles Mavroeidis, Sun-Jae Jun, Toru Kitagawa, Christian Bontemps, Jason Pearcy, Keith Finlay, Luciana C. Fiorini and two anonymous referees for their helpful suggestions. I am also grateful for the suggestions of the participants of the 2008 EC 2 conference in Rome. Special thanks to Professor John Mullahy for providing me with the data set used in Section 6.2. I acknowledge the Tulane Research Enhancement Fund and the Committee on Research Summer Fellowship for funding support.
REFERENCES Amemiya, T. (1979). The estimation of simultaneous-equation Tobit model. International Economic Review 20, 169–81. Anderson, T. W. and H. Rubin (1949). Estimation of the parameters of a single equation in a complete system of stochastic equations. Annals of Mathematical Statistics 20, 46–63. Andrews, D. W. K., M. J. Moreira and J. Stock (2006). Optimal two-sided invariant similar tests for the instrumental variable regression. Econometrica 74, 715–52. Arellano, M., O. Bover and J. M. Labeaga (1999). Autoregressive models with sample selectivity for panel data. In C. Hsiao, K. Lahiri, L.-F. Lee and M. H. Pesaran (Eds.), Analysis of Panels and Limited Dependent Variable Models, 23–48. Cambridge, UK: Cambridge University Press. Blundell, R., T. MaCurdy and C. Meghir (2007). Labor supply models: unobserved heterogeneity, nonparticipation and dynamics. In J. Heckman and E. Leamer (Eds.), Handbook of Econometrics, Volume 6A, 4668–775. Amsterdam: North-Holland. Blundell, R. W. and R. J. Smith (1989). Estimation in a class of simultaneous equation limited dependent variable models. Review of Economics Studies 56, 37–58. Blundell, R. W. and R. J. Smith (1994). Coherency and estimation in simultaneous models with censored or qualitative dependent variables. Journal of Econometrics 64, 355–73. Chernozhukov, V. and C. Hansen (2008). The reduced form: a simple approach to inference with weak instruments. Economics Letters 100, 68–71. Finlay, K. and L. M. Magnusson (2009). Implementing tests for a general class of instrumental variable models that are robust to weak instruments. Stata Journal 9, 398–421. Gourieroux, C. and A. Monfort (1995). Statistics and Econometrics Models, Volume 1. Cambridge, UK: Cambridge University Press. Jones, A. M. and J. M. Labeaga (2003). Individual heterogeneity and censoring in panel data estimates of tobacco expenditure. Journal of Applied Econometrics 18, 157–77. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S74
L. M. Magnusson
Kleibergen, F. (2002). Pivotal statistics for testing structural parameters in instrumental variables regression. Econometrica 70, 1781–803. Kleibergen, F. (2005). Testing parameters in GMM without assuming that they are identified. Econometrica 73, 1103–23. Kleibergen, F. (2007). Generalizing weak instrument robust IV statistics towards multiple parameters, unrestricted covariance matrices and identification statistics. Journal of Econometrics 139, 181–216. Kleibergen, F. and S. Mavroeidis (2009). Weak instrument robust tests in GMM and the New Keynesian Phillips curve. Journal of Business and Economic Statistics 27, 293–311. Lee, L.-F. (1992). Amemiya’s generalized least squares and tests of the overidentification in simultaneous equation models with qualitative or limited dependent variables. Econometric Reviews 11, 319–28. Lee, M.-J. (1995). Semi-parametric estimation of simultaneous equations with limited dependent variables: a case study of female labor supply. Journal of Applied Econometrics 10, 187–200. Lemieux, T., B. Fortrin and P. Fr´echette (1994). The effect of taxes on labor supply in the underground economy. American Economic Review 84, 231–54. Madalla, G. (1983). Limited-Dependent and Qualitative Variables in Econometrics. Cambridge, UK: Cambridge University Press. Moreira, M. J. (2003). A conditional likelihood ratio test for structural models. Econometrica 71, 1027–48. Mroz, T. A. (1987). The sensitivity of an empircal model of married women’s hours of work to economic and statistical assumptions. Econometrica 55, 765–99. Mullahy, J. (1997). Instrumental-variable estimation of count data models: applications to models of cigarette smoking behaviour. Review of Economics and Statistics 79, 586–93. Newey, W. and D. McFadden (1994). Large sample estimation and hypothesis testing. In R. Engle and D. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2111–245. Amsterdam: North-Holland. Newey, W. K. (1985). Semiparametric estimation of limited dependent variable models with endogenous explanatory variables. Annales de L’Insee 59–60, 219–36. Powell, J. L. (1986). Symmetrically trimmed least squares estimation for Tobit models. Econometrica 54, 1435–60. Santos Silva, J. M. C. (2001). Influence diagnostics and estimation algorithms for Powell’s SCLS. Journal of Business and Economic Statistics 19, 55–62. Smith, R. J. and R. W. Blundell (1986). An exogeneity test for simultaneuos equation Tobit model with an application to labor supply. Econometrica 54, 679–85. Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65, 557–86. Stock, J. H. and J. Wright (2000). GMM with weak identification. Econometrica 68, 1055–96.
APPENDIX A.1. Proofs of results Proof of Theorem 3.1: Under the null hypothesis, r(π0 , β0 ) = 0 and, by continuity of the link function in π and the Cramer theorem, the asymptotic behaviour of r(π, ˆ β0 ) is √ ∂r(π0 , β0 ) √ d n(πˆ − π0 ) + op (1) → N (0, β0 ), nr(π, ˆ β0 ) = ∂π p ˆ −→ where β0 = [ ∂r(π∂π0 ,β0 ) ] 0 [ ∂r(π∂π0 ,β0 ) ] . Since
0 and Slutzky theorem.
∂r(πˆ ,β0 ) ∂π
−→p
d ∂r(π0 ,β0 ) , SMD (β0 ) −→ χ 2 (q) ∂π
by the
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S75
Inference in LDV models robust to weak identification
Proof of Theorem 3.2: From Assumption 3.1 and a Taylor expansion, the asymptotic joint distribution πˆ ,β) , under the null hypothesis, is between r(π, ˆ β) and ∂r(∂β ⎛ √ ⎜ n⎝
⎡
⎞
r(π, ˆ β0 ) − r(π0 , β0 )
!
⎡
β0
⎢ 0 ⎢ ⎟ d ⎢ ⎢ ,⎢ ∂r(π, ˆ β0 ) ∂r(π0 , β0 ) ⎠ −→ N ⎢ ⎣ 0 ⎣ ∂r(π0 , β0 ) − vec F0 0 ∂β ∂β ∂π
⎤⎞ ∂r(π0 , β0 )
0 F0 ⎥⎟ ∂π ⎥⎟ ⎥⎟ , ⎦⎠ F0 0 F0 (A.1)
∂ where β0 is defined in equation (3.4) and F0 = ∂π [vec( ∂r(π∂β0 ,β0 ) )]. The pre-multiplication of (A.1) by the lower-block triangular matrix [Iq , 0q×m : Fˆ0 , Iqm ], where Fˆ0 = ∂ [vec( ∂r(πˆ ,β0 ) )], results in ∂π
⎛ √ ⎜ n⎝
⎞
r(π, ˆ β0 )
⎟ d ⎠ −→ N ˆ 0 ) − ∂r(π0 , β0 ) vec D(β ∂β
∂β
0 0
! ,
β0
0
0
β0
! ,
where
β0 = F0 0 F0 − F0 0
∂r(π0 , β0 ) −1 ∂r(π0 , β0 )
0 F0 . β0 ∂π ∂π
(A.2)
ˆ 0 ) and r(π, ˆ β0 ) are asymptotically independent, regardless of the rank of C, where C = ∂r(π∂β0 ,β0 ) . Thus, D(β √ p ˆ 0 ) −→ Let ψr be the limiting distribution of n[r(π, ˆ β0 )]. If C has full rank, then D(β C and √ 1 1 d −1 − 2 −1 −2 ˆ ˆ −1 ˆ ˆ −1 ˆ ˆ β0 ) −→(C β0 C) C β0 ψr . The last term is N (0, Im ). n[D(β0 ) β0 D(β0 )] D(β0 ) β0 r(π, √ d ˆ 0 )] −→ If C is singular, then, as in Kleibergen (2002, 2005), n vec[D(β ψD , where ψD is a qm × 1 multivariate normal distribution with variance β0 . In this case,
ˆ 0) ˆ 0 ) ˆ β−1 D(β D(β 0
− 12
1 d ˆ 0 ) ˆ β−1 r(π, D(β ˆ β0 ) −→(ψD β−1 ψD )− 2 ψD β−1 ψr . 0 0 0
The conditional distribution ψD β−1 ψr | ψD follows a multivariate normal with mean zero and 0 ψ . Since ψ and ψ variance ψD β−1 D r D are independent, the marginal and conditional distributions 0 1 −1 ˆ 0 )]− 12 D(β ˆ 0 ) ˆ 0 ) ˆ β−1 D(β are the same. This implies that (ψD β ψD )− 2 ψD β−1 ψr ≡ N (0, Im ), and [D(β 0
0
0
d ˆ β−1 r(π, ˆ β0 ) −→ N (0, Im ), unconditionally. 0
Derivation of equation (3.7): The first-order condition of SMD (β) with respect to β is −
ˆ β−1 ] ∂vec[ ˆ β) n 1 ∂SMD (β) ˆ β−1 ∂r(π, = nr(π, ˆ β) + (r(π, ˆ β) ⊗ r(π, . ˆ β) ) 2 ∂β ∂β 2 ∂β
ˆ β−1 with respect to β is The partial derivative of ' % −1 & ∂vec ˆβ ⊗ ˆ β−1 − % −1 ˆ β−1 ˆβ ⊗ −
∂r(πˆ ,β) ˆ ∂r(πˆ ,β)
∂π ∂π
∂β ⎧ ⎨ & ∂r(π, ˆ β) ⎩
∂π
(
ˆ ⊗ Iq
∂vec
∂r(πˆ ,β)
∂π
∂β
(⎫ ' ∂r(πˆ ,β) ⎬ ∂r(π, ˆ β) ˆ ∂vec ∂π + Iq ⊗
. ⎭ ∂π ∂β
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
(A.3)
S76
L. M. Magnusson The second term of equation (A.3) simplifies to
πˆ ,β)
∂vec ∂r(∂π ˆ β) ˆ −1 ∂r(π, ˆ −1 ˆ r(π, ˆ β) β
⊗ r(π, ˆ β) β ∂π ∂β ∂r(πˆ ,β) ˆ β) ˆ ∂vec ∂π ˆ −1 ˆ −1 ∂r(π, . + r(π, ˆ β) β ⊗ r(π, ˆ β) β
∂π ∂β
Using the fact that πˆ ,β)
∂vec ∂r(∂π ∂r(π, ˆ β) ∂ = vec ∂βj ∂βj ∂π
and
∂vec
∂r(πˆ ,β) ∂π
∂βj
= vec
∂ ∂βj
(A.4)
∂r(π, ˆ β) ∂π
,
the jth column of (A.4) is ∂r(π, ˆ β) ˆ β) ˆ −1 ˆ ∂r(π, ˆ β) β r(π,
∂π ∂π ˆ β) ˆ ∂r(π, ˆ β) ˆ −1 ∂ ˆ β−1 ∂r(π, ˆ β). + r(π, ˆ β)
β r(π, ∂π ∂βj ∂π
ˆ β−1 r(π, ˆ β)
∂ ∂βj
Since both terms are scalars, equation (A.4) simplifies to ∂ ∂ ∂r(π, ˆ β) ∂r(π, ˆ β) ˆ β) ˆ −1 ˆ β−1 ˆ ∂r(π, ... ˆ β) 2r(π, ˆ β) β r(π,
∂β1 ∂π ∂βm ∂π ∂π and (A.3) becomes −
1 ∂SMD (β) ˆ ˆ β−1 D(β), = nr(π, ˆ β) 2 ∂β
ˆ where D(β) = [ Dˆ 1 (β) . . . Dˆ m (β) ] and, for j = 1, . . . , m, ∂r(π, ˆ β0 ) ∂r(π, ˆ β0 ) ∂ ∂r(π, ˆ β0 ) ˆ −1 ˆ ˆ − ˆ β0 ). β0 r(π, Dj (β0 ) =
∂βj ∂π ∂βj ∂π
A.2. Robust tests in linear instrumental variable models The linear limited information model and its unrestricted reduced form is: y = zδz + e y = xβ + u x = zπz + v
x = zπz + v.
(A.5)
The included exogenous regressor w is omitted for exposition clarity. The OLS estimators of δz and πz are δˆz = (z z)−1 z y and πˆ z = (z z)−1 z x, respectively. The link function (δˆz − πˆ z β0 ) can be rewritten as (z z)−1 z (y − xβ0 ). = [1 −β0 ] ⊗ Ikz . If the residuals are homoscedastic, the asymptotic variance of Note that ∂r(π,β) ∂π z z vec[δz , πz ] is ⊗ ( n ), where is Var(e, v). By definition, β0 is ! −1 . 1
, zz 1 −β0 ⊗ Ikz ⊗ ⊗ Ikz . n −β0 C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S77
Inference in LDV models robust to weak identification
ˆ = [ y ]Mz [y x]/(n − kz ) be the estimator of . After substituting ˆ into β0 , we find that ˆ β0 = Let x (y−xβ0 ) Mz (y−xβ0 ) 2 z z −1 2 σˆ β0 ( n ) where σˆ β0 = , and we have SMD = AR. n−kz The matrix can be partition as = [ee , ev : ve , vv ]. Since ˆ is [0 Ikz m ], vec[D(β)] = −vec[πˆ z ] +
ˆ ve ˆ vv ⊗
= −vec[πˆ z ] + vec
z z n
−1
z z n
−1 .
ˆ β−1 (δˆz 0
1
∂r(πˆ ,β0 ) ∂β
−β0
= −πˆ z and
∂ ∂π
πˆ ,β0 ) ( ∂r(∂β )=
! ˆ β−1 (δˆz − πˆ z β0 ) ⊗ Ikz 0
ˆ ev − πˆ z β0 ) ˆ vv −β0
!! .
ˆ ev 0 ) Mz x ˆ = [1 −β0 ][ ˜ z (β0 ). Since (y−xβ ˆ vv ], we have that D(β) = −π (n−k) In the case of heteroscedastic residuals, the White estimator of the asymptotic covariance matrix of √ n((δˆz − δz ) , vec(πˆ z − πz ) ) can be written as
ˆ = Im+1 ⊗
z z n
−1 ! z
diag(eˆi2 ) diag [vec(vˆi eˆi )]
! −1 ! diag vec(vˆi eˆi ) zz Im+1 ⊗ z ,
n diag vec(vˆi vˆi )
where diag(ti ) represents a diagonal matrix whose typical element in the main diagonal i is ti . Preˆ β0 , which is multiplying by [(1 − β0 ) ⊗ Ikz ] and post-multiplying by [( −β1 0 ) ⊗ Ikz ] results in =
z z n
−1
%
n−1 z diag(eˆi2 ) − (β0 ⊗ In )diag [vec(vˆi eˆi )] − diag vec(vˆi eˆi ) (β0 ⊗ In )
−1 ,
& zz ⊗ In )diag vec(vˆi vˆi ) (β0 ⊗ In ) z , + n −1 −1 %
& zz zz n−1 z diag (eˆi − vˆi β0 )2 z . = n n (β0
or
Then, the White covariance matrix with ordinary least-square estimate for the reduced-form parameters give us the following result:
& -−1 z z √ z z , −1 % n z diag (eˆi − vˆi β0 )2 z n((δˆz − πˆ z β0 )), n n , %
& -−1 1 1 √ (y − xβ0 ) z n−1 z diag (ˆui (β0 ))2 z √ z (y − xβ0 ), n n
√
n(δˆz − πˆ z β0 )
ˆ 0 ) = (eˆ − vβ where u(β ˆ 0 ) = Mz (y − xβ0 ). Kleibergen’s (2007) S- and K-tests use u(β0 ) = y − xβ0 in the last equation for estimating the variance–covariance matrix. The same proof extends for the case of clustered and autocorrelated residuals. Since
& %
& & 1 % % z diag (ˆui (β0 ))2 z − z diag (ui (β0 ))2 z = op (1), n the S- and the K-tests are asymptotically equivalent to the SMD - and the KMD -tests, respectively. Chernozhukov and Hansen (2008) demonstrate the asymptotic equivalent between the Wald-S and Wald-K and S and K. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S78
L. M. Magnusson
A.3. The algorithm for computing the robust tests using built-in functions We derived tests for the following class of limited dependent variable models with unrestricted reducedform representation y ∗ = zδz + wδw + vδv + ε y ∗ = xβ + wγ + u x = zπz + wπw + v
x = zπz + wπw + v,
where u = vα + ε and δv = α + β. Define z = (z, w), δ = (δz , δw , δv ) , πx = (πz ; πw ), and h = wδw + vδv . We consider estimators of δ and πx which have an influence function of the following zδz +/ form ni=1 gi (δ, πx ) = gn (δ, πx ) = (g1,n (δ, πx ) g2,n (πx ) ) , where z ε˜ (h) and g2,n (πx ) = vec(z v), g1,n (δ, πx ) = v ε˜ (h) where ε˜ is an n × 1 vector of generalized residuals, i.e., E[˜εi | vi , zi ; δ, πx ] = 0 for i = 1, . . . , n. This vector is a function of the reduced parameters through h. This representation accommodates several limited dependent variable models, including the endogenous probit, Tobit and count data models. In the case of maximum or quasi-maximum likelihood, g1,n (δ, πx ) is the score function. 10 ∂ ε˜ i Define σi = ∂h . The Jacobian of gn (δ, πx ) has the following form: i Hn =
Hδδ,n
Hδπx ,n
0
Im ⊗ (z z)
!
⎡
z z
⎢ = ⎣ v z 0
z v v v 0
−δv ⊗ z z
⎤
⎥ −δv ⊗ v z ⎦ , Im ⊗ (z z)
∂vec(h) ∂ ε˜ (h) ∂vec(h) x 1,n where = diag(σi ). We use that ∂vecπ = z ∂vec(h) and ∂vec(π = −δv ⊗ z in the derivation of the ∂vec(πx ) x x) last block column. Because of the conditional independence we have ! ! ! G 0 0 g1,n (δ, πx ) 1 , , →N √ n 0 ϒ 0 g2,n (πx ) / where G = E[gi (δ, πx )gi (δ, π¯ ) | zi , vi ] and ϒ = limn→+∞ E[ n1 ni=1 (Im ⊗ zi )vi vi (Im ⊗ zi )]. The asymptotic distribution of the reduced-form parameter estimator is !−1 g1,n (δ, πx ) Hδπx ,n δˆ − δ √ 1 Hδδ,n 1 a . n =− √ n n g2,n (πx ) 0 Im ⊗ (z z) vec(πˆ x − πx ) ∂g
(δ,π )
After some simplification, we find that the asymptotic variance is Hδδ−1 GHδδ−1 + (δv ⊗ Ik ) πx πx (δv ⊗ Ik )
πx πx (δv ⊗ Ik )
! , δv ⊗ Ik πx πx
πx πx
,
p
p
where n−1 Hδδ,n −→ Hδδ , πx πx = (I ⊗ Q)−1 ϒ(I ⊗ Q)−1 and n−1 z z −→ Q. The term Hδδ−1 GHδδ−1 is the variance–covariance of the quasi-maximum likelihood estimator. Pre-multiplying by [(1 −β0 ) ⊗ (Ikz 0)] and post-multiplying by [( −β1 0 ) ⊗ ( Ik0z )] results in β0 , which is ,
Hδδ−1 GHδδ−1
-
δz δz
+ ((δv − β0 ) ⊗ Ikz ) πz πz ((δv − β0 ) ⊗ Ikz ),
(A.6)
(1)
10 In the Tobit model, σ 2 , the variance of ε, is a nuisance parameter in the reduced-form model. In this case, g (δ, π ) x n ε (1) is the ‘effective’ score, which is obtained as the residual of regressing gn (δ, πx ) on the score of σε2 . C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Inference in LDV models robust to weak identification
S79
where (Hδδ−1 GHδδ−1 )δz δz is the kz × kz variance of δˆz . In the simulations for the endogenous count data model, ˆ β is based on equation (A.6). If the likelihood is correctly specified, Hδδ−1 GHδδ−1 = Hδδ−1 the computation of by the matrix information equality. If residuals are homoscedastic, the covariance matrix estimator derived from ordinary least squares can be simplified to πx πx = (vv ⊗ Q−1 ). The matrix Q can be partitioned as Q = [Qzz , Qzw : Qwz , Qww ]. Then, equation (A.6) simplifies to , −1 Hδδ δz δz + (δv − β0 ) vv (δv − β0 )Q−1 zz.w , where Qzz.w = Qzz − Qzw Q−1 ww Qwz . This is the variance equation used for computing the robust tests for endogenous probit and Tobit models in Finlay and Magnusson (2009).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. S80–S98. doi: 10.1111/j.1368-423X.2010.00311.x
Non-parametric estimation of exact consumer surplus with endogeneity in price A NNE VANHEMS †,‡ †
Toulouse Business School, 20, Boulevard Lascrosses, BP 7010, 31068 Toulouse Cedex 7, France. ‡
Toulouse School of Economics, 21 Allee de Brienne, 31000 Toulouse, France. E-mail:
[email protected] First version received: March 2009; final version accepted: January 2010
Summary This paper analyses a structural microeconomic relation describing the exact consumer surplus in a non-parametric setting with endogenous prices. The exact consumer surplus can be characterized as the solution of a differential equation involving the observed demand function. The strategy put forward in this paper involves two steps: first, estimate the demand function with endogeneity using non-parametric IV, second, plug this estimator into the differential equation to estimate the exact consumer surplus. The rate of convergence for this estimator is derived and is shown to be faster than the rate for the underlying nonparametric IV regression estimator. Solving the differential equation smooths the demand estimator and leads to a faster rate of convergence. The implementation of the methodology is illustrated through a simulation study. Keywords: Exact consumer surplus, Inverse problem, Non-parametric instrumental regression.
1. INTRODUCTION This paper addresses the issue of evaluating exact consumer surplus in a non-parametric setting. Consumer surplus is a widely used tool in microeconomics and can be interpreted as a monetary measure of the impact on consumer welfare of a change in the price of a good. It defines what income would be necessary for the consumer to maintain his utility level constant for this price change (Varian, 1992). This quantity was introduced by John Hicks (see Hicks, 1956) and depends on the Hicksian unobserved demand function. Although it could be roughly approximated by integrating the Marshallian observed demand function (Willig, 1976), Hausman (1981) shows that we can derive a measure of the exact consumer’s surplus from the observed demand curve without involving any approximation. Consider one consumer, define by y his income, q the demand in good and p1 the price of a unique good. Assume that there exists a price variation from p to p1 . The exact consumer surplus associated with an income level y and denoted by Sy is characterized by the following relation: C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Estimation of exact consumer surplus
Sy (p) = −q(p, y − Sy (p)), Sy (p1 ) = 0.
S81
(1.1)
The link between Sy and q is given by this non-linear ordinary differential equation of order 1. The initial condition Sy (p1 ) = 0 means that with no variation of price the exact consumer surplus is equal to 0. The approach classically used to estimate and analyse the function Sy involves two steps: first, the estimation of the demand function q; second, the resolution of the differential equation. To be more precise, consider (Q, P , Y ) a continuously distributed random vector defining demand, price and income, and a sample (Qi , Yi , Pi )i=1,...,n of observations. The demand function q can be approximated by the function g defined as the regression of Q given P and Y: Q = g(P , Y ) + U , E(U |P , Y ) = 0. Hausman and Newey (1995) propose a semi-parametric estimation of the demand function with a non-parametric estimation of g, and an additive parametric part including several exogenous variables such as the year of survey and the city/state of the household. They assume that the identification assumption E(U |P , Y ) = 0 is satisfied. In a second step, they plug this demand estimator into the differential equation and solve it numerically. Finally, they analyse its statistical properties (see also Vanhems, 2006, for the asymptotic properties). Our work extends this setting by relaxing the exogeneity assumption E(U |P , Y ) = 0 and considering the case where price can be an endogenous variable. Endogeneity issues occur frequently in economics, for example if an additional variable causes both independent and dependent variables and is not included in the regression model. Consider the example of hourly individual wages explained by the level of education (this example is quoted from Angrist and Krueger, 2001, Hall and Horowitz, 2005). The error term U may include personal unobserved characteristics such as individual ability, that would influence both level of education and wage. Another classical example is given by the Engel curve relationship that describes the expansion path for commodity demands with respect to household budget. In this setting, the total budget variable is a choice variable in the consumer’s allocation of income and acts as an endogenous regressor (see e.g. Blundell et al., 2007). The price endogeneity issue is also raised in several research articles. Brown and Walker (1989) argue that the hypothesis of random utility maximization implies that the additive error U can depend on P (see also Lewbel, 2001, and Matzkin, 2007) and the error term is interpreted as consumer preference heterogeneity.1 In an industrial organization framework, Berry et al. (1995) analyse demand and supply in differentiated product markets and highlight the problem involved by correlation between prices and product characteristics, some of which are observed by the consumer but not by the econometrician. They use the instrumental variables approach to estimate the demand system, and apply their techniques to the analysis of equilibrium in the US automobile industry. Yatchew and No (2001), proposing an analysis of household gasoline demand in Canada, also raise the problem of price endogeneity. In fact, they observe significant 1
Note however that this literature mainly focuses on heteroscedasticity of U.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S82
A. Vanhems
variations in prices within a given urban area, with a 5% higher coefficient of variation, which lead them to conclude that this heterogeneity in price variation depends on location and may affect consumers’ choices. The authors suggest that one instruments the observed price variable with the average price over a relatively small geographical area, such as the average inter-city price. In general, in any equilibrium determination of market outcomes, prices and demands will be determined simultaneously. The purpose of this work is to provide a theoretical analysis of the non-parametric exact consumer surplus estimator under the assumption of price endogeneity using an instrumental variable approach to identify the structural demand relationship. The instrumental variable approach has also been investigated in many recent econometric studies such as Darolles et al. (2002), Ai and Chen (2003), Newey and Powell (2003), Hall and Horowitz (2005), Blundell et al. (2007), Chen (2007) and Gagliardini and Scaillet (2007), to name but a few. In this paper, we apply the purely non-parametric kernel regression model used in Darolles et al. (2002) or Hall and Horowitz (2005). The regression estimators proposed in the two papers are similar and we finally adopted the methodology developed in Hall and Horowitz (2005) in order to stick to the consumer surplus illustration with one common variable Y in the regressors and in the instruments.2 To implement the instrumental variable approach we introduce some continuously distributed random variable W , called an instrument, such that E(U |Y , W ) = 0. The underlying function g is then defined through a second equation: E(Q − g(P , Y )|Y , W ) = 0.
(1.2)
As pointed out by recent econometric analysis of non-parametric instrumental regression, the study of g defined by (1.2) is a difficult ill-posed inverse problem that cannot be solved using standard tools, and equation (1.2) needs to be stabilized before estimation (see Engl et al., 2000, for a general overview of ill-posed inverse problems and regularization methods). Both steps of stabilization and estimation are discussed in detail in the body of the paper. A major property we find is that the rate of convergence of the estimated exact consumer surplus is improved, compared to the rate of estimated demand function. Solving the differential equation smooths the demand estimator and leads to a faster rate of convergence. This smoothing effect is consistent with the results obtained in the exogenous case (see Vanhems, 2006, for more details) and is completely driven by the resolution of the differential equation. The paper proceeds in the following way. In the next section, we set out the notations, give the main equations to be solved and establish the link with inverse problems theory. We then present our non-parametric estimator and recall the theoretical properties of each inverse problem (equations (1.1) and (1.2)). In Section 4, we study the asymptotic behaviour of our estimator and conclude the analysis with some simulations.
2. MODEL SPECIFICATION In this section, we set out the notation and link our model with inverse problem theory.
2 Note that other identification methods could have been used such as control function approach (see e.g. Newey et al., 1999, Blundell and Powell, 2003, or Newey and Imbens, 2009, for a non-parametric setting).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Estimation of exact consumer surplus
S83
2.1. The linear equation model The objective of this part is to set the econometric model defining the demand function q. We follow the modelling of Hall and Horowitz (2005). Consider (Q, P , Y , W , U ) a continuously distributed random vector with all scalar random variables (to simplify the notations and fit with the microeconomic illustration). P and Y are endogenous and exogenous explanatory variables, respectively, and W is the instrument. We assume that P , Y and W are supported on [0, 1].3 Let (Qi , Pi , Yi , Wi , Ui ), for 1 ≤ i ≤ n, be observed data independently and identically distributed as (Q, P , Y , W , U ). Let fPYW denote the joint density of (P , Y , W ), and fY the density of Y. Following Hall and Horowitz (2005) notations, we define for each y ∈ [0, 1], ty (p1 , p2 ) = fPYW (p1 , y, w) fPYW (p2 , y, w) dw and the operator Ty on L2 [0, 1] by (Ty ψ)(p, y) = ty (ξ, p)ψ(ξ, y) dξ . The solution g of equation (1.2) satisfies: (Ty g)(p, y) = fY (y)EW |Y {E(Q|Y = y, W )fPYW (p, y, W )|Y = y},
(2.1)
where EW |Y denotes the expectation operator with respect to the distribution of W conditional on Y. Then, for each y for which Ty−1 exists, it may be proved that g(p, y) = fY (y)EW |Y {E(Q|Y = y, W )(Ty−1 fPYW )(p, y, W )|Y = y}. 2.2. The non-linear equation model Consider a price value p1 ∈ ]0, 1[.4 Our functional parameter of interest Sy is the solution of the differential equation (1.1) depending on the demand function q. When q is replaced by the approximation function g, the differential equation to solve is rewritten: Sy (p) = −g(p, y − Sy (p)), (2.2) Sy (p1 ) = 0, or equivalently:
p1
Sy (p) =
g(t, y − Sy (t)) dt.
(2.3)
p
The definition of Sy involves the function g which depends on the distribution of (Q, P , Y , W ). Under standard regularity assumptions on the function g, there exists a unique local solution to (2.2). The analysis of these two problems (2.1) and (2.2) is closely linked to inverse problem theory and we recall below the characteristics of each of them.
3 This assumption is directly taken from Hall and Horowitz (2005) and is not a restrictive one as they argue in their article, p. 2908. Moreover, in our case, we are interested in solutions of differential equations which are by construction uniquely defined in a neighbourhood of the initial condition Sy (p 1 ) = 0, which will restrict the support of the functions and random variables. 4 We fix a price value p 1 in the interior of ]0, 1[ so that a neighbourhood of p 1 , on which S is defined, can also be y included in [0, 1].
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S84
A. Vanhems
2.3. Link with inverse problems theory The methodology used to study Sy is in two steps by solving successively the two equations (2.1) and (2.2). As we will see below, they have different regularity properties that impact the way to solve them and the properties of their solutions. Consider first the relation (2.2). The function Sy is defined implicitly as the solution of this non-linear differential equation, which can be considered as an inverse problem to solve. The standard issue is to check whether or not the inverse problem is well-posed, that is if there exists a unique stable solution to (2.2) (see Tikhonov and Arsenin, 1977, Kress, 1999, or Engl et al., 2000, for a general definition). This relation is characterized by the differential operator Ay defined by Ay (g, Sy ) = Sy + g(·, y − Sy ) and solving (2.2) is equivalent to inverting this operator under the initial condition Sy (p1 ) = 0. Although the operator is non-linear, it can be proved (see Vanhems, 2006, for details) that this inverse problem is in fact well-posed and defines a unique local stable solution. Under regularity assumptions on g (recalled in the next section), there exists a unique solution: Sy (p) = y [g](p), where y is continuous with respect to g. Consider now the first relation (2.1). This second inverse problem, which defines implicitly the parameter of interest g, requires to invert the linear integral operator Ty . As recalled in the introduction, this model is the foundation of many studies, and it was proved (see e.g. Tikhonov and Arsenin, 1977, or Kress, 1999) that even when the probability distribution of (P , Y , W ) is known, the calculation of a solution g from equation (2.1) is an ill-posed inverse problem. In particular, the solution is not stable and a regularization step is required to solve the problem. In our case, as in the problems studied by Darolles et al. (2002), Hall and Horowitz (2005), Carrasco et al. (2007) or Johannes et al. (2010), fPYW is unknown and has to be estimated from a sample of (P , Y , W ). The way to proceed is the following: first, equation (2.1) is stabilized using standard regularization method (recalled in the next section); second, the operator Ty is replaced by an estimator and the estimated stabilized equation is solved. Under regularity assumptions on the function g and the operator Ty , there exists a unique regularized solution g. The purpose of the next section is to recall separately the estimation procedure for the two equations (2.1) and (2.2) as well as the theoretical properties of their estimated solutions. Both inverses will then be mixed in Section 4. R EMARK 2.1. A potentially better way to proceed would have been to directly study the parameter of interest Sy in one step and invert one operator instead of two. However, this one step approach raises several issues. First, contrary to the operator Ay , Ty depends on the law of the data set and has to be estimated (which we do in a first step). Second, as we will see in the next section, it is possible to write an explicit solution to the linear inverse problem, whereas it turns out to be impossible for the non-linear one. Only a numerical approximation is available. Due to these two reasons, we decided it preferable not to treat our model as a single inverse problem.
3. ESTIMATION AND IDENTIFICATION In this section, we present the non-parametric methodology used as well as the issues of identification and overidentification for both inverse problems separately. We briefly recall the results in Hall and Horowitz (2005) and Vanhems (2006) that will be necessary to prove the asymptotic properties of the final estimated functional parameter Sy . C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Estimation of exact consumer surplus
S85
3.1. Estimation of consumer demand We first consider the non-parametric instrumental regression defined in equation (2.1) and present the methodology developed in Hall and Horowitz (2005). As recalled in the previous section, solving the relation (2.1) generates a linear ill-posed inverse problem which implies that a consistent estimator of g is not found by a simple inversion of the estimated operator Ty . For the purpose of estimation, we need to replace the inverse of Ty by a regularized version. In what follows, we use the well-known Tikhonov regularization and replace Ty−1 by (Ty + aI )−1 = Ty+ , where I is the identity operator and a > 0 (see Engl et al., 2000, for an overview of the main regularization methods). 3.1.1. Estimation. The function g is estimated using kernel estimation. Consider K a kernel function of one dimension, centred and separable, h > 0 the bandwidth parameter and Kh (u) = (1/h)K(u/h). In order to get rid of edge effects, following Hall and Horowitz (2005), we can introduce some generalized kernel function Kh (·, ·) such that if t is not close to either 0 or 1 then Kh (u, t) = Kh (u). In what follows, in order to simplify the formulas and notations, we simply denote it by Kh (u). To construct an estimator of g(p, y), let hp , hy > 0 be two bandwidth parameters and define: 1 Kh (p − Pi )Khy (y − Yi )Khp (w − Wi ), fPYW (p, y, w) = n i=1 p n
n 1 Kh (p − Pj )Khy (y − Yj )Khp (w − Wj ), (n − 1) j =1,j =i p ty (p1 , p2 ) = fPYW (p1 , y, w)fPYW (p2 , y, w) dw,
(−i) fPYW (p, y, w) =
(Ty ψ)(p, y) =
ty (ξ, p)ψ(ξ, y) dξ.
The non-parametric estimator of g(p, y) is then defined by: 1 + (−i) T f (p, y, Wi )Qi Khy (y − Yi ). n i=1 y PYW n
g (p, y) =
(3.1)
3.1.2. Theoretical properties. In order to derive rates of convergence for g (p, y) it is necessary to impose regularity conditions on the operator Ty . By construction Ty is linear and we assume that for each y ∈ [0, 1], Ty is a compact operator. Compactness is a standard and often used regularity assumption for integral operators that allow in particular to define a discrete spectrum. We denote by {φy1 , φy2 , . . .} the orthonormalized sequence of eigenvectors and λy1 ≥ λy2 ≥ · · · > 0 the respective eigenvalues of Ty . Assume that {φyj } forms an orthonormal basis on L2 [0, 1] and consider the following decompositions on this orthonormal basis:
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S86
A. Vanhems
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
ty (p1 , p2 ) =
∞
λyj φyj (p1 )φyj (p2 ),
j =1
fPYW (p, y, w) =
∞ ∞
dyj k φyj (p)φyk (w),
(3.2)
j =1 k=1
g(p, y) =
∞
byj φyj (p).
j =1
Under regularity conditions on the density fPYW and the kernel K (fPYW has r continuous derivatives and K is of order r), on the function g(p, y), and on the rate of decay of the coefficients byj , λyj and dyj k depending on constants α and β, it is proved in Hall and Horowitz 2β−1
2r . In (2005) that g (p, y) converges to g(p, y) in mean square at the rate n−τ 2β+α with τ = 2r+1 −β −α particular, the constants α and β are defined such that, for all j , |b | ≤ Cj , j ≤ Cλ and yj yj
−α/2 , C > 0, uniformly in y ∈ [0, 1]. k≥1 |dyj k | ≤ Cj
3.2. Estimation of exact consumer surplus Consider now the second non-linear inverse problem defined by equation (2.2). The estimated exact consumer surplus Sy (p) is defined as the solution of the estimated system: Sy (p) = − g (p, y − Sy (p)), (3.3) 1 Sy (p ) = 0. 3.2.1. Estimation. The Cauchy–Lipschitz theorem states that under some regularity assumptions on g, for each y ∈ ]0, 1[, there exists a unique solution Sy defined in a g, neighbourhood of the initial condition (p1 , 0).5 Again, under regularity conditions on following the Cauchy–Lipschitz theorem, there exists a unique solution Sy defined on a neighbourhood of the initial condition (p1 , 0).6 The estimated solution Sy can be approximated using numerical implementation. Various classical algorithms can be used such as the Euler–Cauchy algorithm, Heun’s method or the Runge–Kutta method (see Ascher and Petzold, 1998, or Collatz, 1960, for a general overview of these numerical methods). As an illustration, Hausman and Newey (1995) use a Buerlisch–Stoer algorithm from Numerical recipes and Vartia (1983) details the polygon method. Let us briefly recall the general methodology. Consider a grid of equidistant points p1 , . . . , pn , where pi+1 = pi + h and p1 = p1 . The differential equation (2.2) is transformed into a discretized version g: where gh is an approximation of Sy(i+1) = Syi − h gh (pi , y − Syi ), (3.4) Sy0 = 0. We fix y in the interior of [0, 1] for convenience, to make sure that y − Sy (p) still belongs to [0, 1]. See e.g. Coddington and Levinson (1955) for a general presentation of the Cauchy–Lipschitz theorem and Vanhems (2006) for an application in econometrics. 5 6
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Estimation of exact consumer surplus
S87
In the particular case of the Euler algorithm, gh = g . These numerical algorithms converge faster than the non-parametric estimators and hence the numerical approximation of Sy does not affect the theoretical properties of the estimator (as detailed in Vanhems, 2006). Sy is proved 3.2.2. Theoretical properties. Existence and uniqueness of both solutions Sy and under Cauchy–Lipschitz assumptions imposed on both functions g and g . Consider a fixed income value y ∈ ]0, 1[. Denote I = [p1 − 1 , p1 + 1 ], for 1 > 0 a closed neighbourhood of p1 , J = [y − 2 , y + 2 ] with 2 > 0 and Dy = I × J . The regularity conditions required to prove existence and uniqueness for Sy are the following: A SSUMPTION 3.1.
max(p,y )∈Dy |g(p, y )| < 2 / 1 .
A SSUMPTION 3.2.
|g(p, y2 ) − g(p, y1 )| ≤ k|y2 − y1 |, ∀(p, yi ) ∈ Dy such that c = k 1 < 1.
Note that the important condition to prove existence and uniqueness of a solution for the differential equation (2.2) is the second one. Indeed, Assumption 3.1 is just imposed by the local definition of our solution on I and the Cauchy–Lipschitz theorem proves existence and uniqueness of a solution defined on I. In particular, if the function g is assumed to be continuous, this assumption is very easily checked.7 Assumption 3.2 imposes g to be continuous on Dy and to satisfy the Lipschitz condition. A sufficient condition on g to satisfy this assumption is to be continuously differentiable of order 1 on Dy . In the next section, in order to derive rates of convergence for the estimated solution Sy , we impose this last stronger condition. Let us turn now to the existence and uniqueness of a solution Sy . Indeed, we study the exact consumer surplus in a two-step procedure and we also have to take into account the estimated differential equation (3.3). We use again the Cauchy–Lipschitz theorem and introduce g satisfies the two the parameters 1n and 2n , define the neighbourhoods In and Dyn such that following assumptions: A SSUMPTION 3.1 .
A SSUMPTION 3.2 . < 1.
max(p,y )∈Dyn | g (p, y )| < 2n / 1n . | g (p, y2 ) − g (p, y1 ) ≤ kn |y2 − y1 |, ∀(p, yi ) ∈ Dyn such that |cn = kn 1n
Again, in order to derive rates of convergence in the next section, we will transform these conditions into regularity conditions on the kernel function used to construct g . At last, in order to define both solutions Sy and Sy on the same neighbourhood Dy , we need an additional assumption of convergence of the Lipschitz factor kn to k. In other words, under the condition g (i.e. the derivative of g with respect to the second variable) converges uniformly to that ∂e∂ 2 ∂ g, both solutions can be defined on a common subset I and the inverse problem is stable and ∂e2 well-posed (see Vanhems, 2006, for more details). The main issue of this differential inverse problem is its non-linearity and the next step to derive rates of convergence is to linearize the relation between Sy and g. The methodology used to transform the non-linear equation into a linear problem is closely related to the functional delta method and is similar to Hausman and Newey (1995) and Vanhems (2006). Then, under the assumptions of existence uniqueness and stability for Sy and Sy , it can be proved that: 7 From a practical point of view, it could be interesting to check if the solution can be extended to a larger interval to take into account larger price variations. Under the same assumptions, it can be proved that a unique maximal solution exists, which can be constructed by piecing together local solutions if the intersection of their definition intervals is not empty.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S88
A. Vanhems
∀p ∈ I ,
Sy (p) − Sy (p) = I (p, y) + Rn (p, y),
(3.5)
where Rn (p, y) = oP ( g − g ) is the residual term and the counterpart in the Taylor expansion. Under assumptions on the estimated function g , this term converges to zero in probability and will be neglected in the asymptotics. The first term I (p, y) is linear in g − g and has an explicit form that will be detailed in the next section. Note that all the asymptotic results will be given using the L2 norm which will be written g − g)2 (a, b) dadb. If other norms are used it will be clearly
· . In particular, g − g 2 = Dy ( specified.
4. ASYMPTOTIC BEHAVIOUR OF THE ESTIMATED SOLUTION The objective of this section is to combine both inverse problems and derive the asymptotic behaviour of the solution of the differential equation obtained after estimating the regression function observed in an endogenous setting. We use the delta method to transform the nonlinear differential equation into a linear relation, up to the residual term. We show that, under assumptions detailed below, we are able to control the residual term and derive the rate of convergence for the leading linear term. 4.1. Assumptions In order to prove theoretical properties on the estimated exact consumer surplus Sy , we need to impose a set of regularity conditions. These assumptions are derived from the analysis of each inverse problem (estimation of consumer demand and estimation of exact consumer surplus) and are adapted from Hall and Horowitz (2005) and Vanhems (2006). The regularity conditions on g and g discussed in Section 3.2.2 are given in Assumptions 4.1, 4.5 and 4.7. Assumption 4.1 is equivalent to equation (1.2). Assumptions 4.2, 4.3 and 4.6 imply that Ty is a compact operator; Assumption 4.4 describes the sizes of the tuning parameters. Moreover, we also introduce the generalized Fourier decomposition for the following function: my (p, t) = 1[p1 ,p] (t) · e =
∞ ∞
t
[
∂ p ∂e2
g(u,y−Sy (u)) du]
,
cyj k φyj (p)φyk (t),
j =1 k=1
with specific assumptions on the rate of decay of the coefficients cyj k given in Assumption 4.3.8 All the required assumptions are summarized below: A SSUMPTION 4.1. The data (Qi , Pi , Yi , Wi ) are independent and identically distributed as (Q, P , Y , W ), where P , Y , W are supported on [0, 1] and E(Q − g(P , Y )|W , Y ) = 0. A SSUMPTION 4.2. The distribution of (P , Y , W ) has a density fPYW with r ≥ 2 derivatives, each derivative bounded in absolute value by C > 0, uniformly in p and y. The functions
8 The notation of m (p, t) with y as a subscript is arbitrary, in order to follow the initial notation of the operator T . We y y could as well have written m(p, t, y).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S89
Estimation of exact consumer surplus
E(Q2 |Y = y, W = w) and E(Q2 |P = p, Y = y, W = w) are bounded uniformly by C and E(Q2 ) < +∞. The function g is continuously differentiable of order 1 on [0, 1]2 . A SSUMPTION 4.3. The constants α, β, ν satisfy β > 1/2, ν > 1/2, α > 1 and max(β
+ ν − 1/2; 2ν − 1) < α < min(2ν; 2β; β + ν). Moreover, |byj | ≤ Cj −β , j −α ≤ Cλyj , k≥1
|dyj k | ≤ Cj −α/2 and k≥1 |cyj k | ≤ Cj −2ν uniformly in y, for all j ≥ 1. A SSUMPTION 4.4. The parameters a, hp , hy satisfy a n−ατ/(2β+α) , h n−1/(2r+1) as n goes to infinity, where τ = 2r/(2r + 1). A SSUMPTION 4.5. The kernel function K is a bounded and Lebesgue integrable function defined on [0, 1]. K(u) du = 1 and K is of order r ≥ 2. Moreover, K is continuously differentiable of order r with derivatives in L2 ([0, 1]). A SSUMPTION 4.6. For each y ∈ [0, 1], the function φyj form an orthonormal basis for L2 [0, 1] and supp supy maxj |φyj (p)| < ∞. A SSUMPTION 4.7.
˜ − ∀y ∈ [0, 1], supDy | ∂e∂ 2 g (p, y)
∂ ˜ g(p, y)| ∂e2
converges in probability to 0.
R EMARK 4.1. (i) In order to estimate the demand function g, a standard kernel function K has been introduced in Assumption 4.5. As recalled in Section 3.1.1 (see also Hall and Horowitz, 2005), in order to prevent from edge effects, a generalized kernel function or ‘boundary kernel’ has to be used. It corrects in particular for the bad behaviour of the non-parametric estimator around 0 or 1. However, to simplify the expansions in the proofs, we simply use the notation K. (ii) Assumption 4.3 specifies a polynomial rate of decay for the coefficients byj , cyj k , dyj k and λyj . However, other rates of decay could be used, such as exponential rate, which would lead to different rates of convergence for the non-parametric estimator (see Johannes et al., 2010, for a general overview). 4.2. Theoretical properties Consider Assumptions 4.1 to 4.7. Then we can prove the following results. T HEOREM 4.1. For each y ∈ ]0, 1[, there exist unique solutions Sy and Sy defined on a common neighbourhood I of p1 . Sy exist and are defined in the same This first result proves that both solutions Sy and neighbourhood I. It implies that the estimated solution Sy is stable and will converge to Sy as soon as g converges to g. In order to derive rates of convergence, we now need to linearize the differential equation. T HEOREM 4.2.
(i) Linear decomposition. Consider y ∈ ]0, 1[. For any p ∈ I ,
g − g)(t, y − Sy (t)) · my (p, t) dt + Rn (p, y) Sy (p) − Sy (p) = − (
(4.1)
= I (p, y) + Rn (p, y),
(4.2)
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S90
A. Vanhems
with Rn (p, y) the residual term introduced in equation (3.5), which converges to zero. (ii) Convergence in mean square. Under the additional property: 2 (4.3) supy∈]0,1[ E{I (p, y)}2 dp ≤ supy∈[0,1] E ( g − g)(t, y)my (p, t)dt dp, we can prove that: 2(β+ν)−1 supy∈[0,1] E( I (·, y) 2 ) = O n−τ 2β+α .
(4.4)
We give below some comments on this rate of convergence and the condition (4.3) required to derive it. R EMARK 4.1. Note first that the rate of convergence obtained here is faster than the rate given in Hall and Horowitz (2005). This finding is consistent with the conclusions in Vanhems (2006): solving the differential equation improves the regularity of the initial estimator g and the rate of convergence for Sy is expected to be faster. Moreover, compared to the Hall and Horowitz (2005) result, an additional parameter ν appears in the rate of convergence. In fact, the linear term g − g)(·, y); my (p, ·) I (p, y) can be rewritten using the scalar product in L2 [0, 1]: I (p, y) = ( and our objective is then to analyse the scalar product of the estimator g with a smooth function (instead of the function g itself, as in Hall and Horowitz, 2005). Our rate of convergence will depend on the smoothness of the function my (p, ·) characterized by the parameter ν. This parameter captures the regularity induced by solving the differential equation. That explains why 2β−1 g obtained by Hall the rate of convergence of Sy is faster than n−τ 2β+α the rate of convergence for and Horowitz (2005). R EMARK 4.2. In order to derive the rate of convergence in Theorem 4.2, we need an additional condition, given by the inequality (4.3). This condition is not restrictive as the income value g − g are y is initially fixed in ]0, 1[. Since Sy takes values in a neighbourhood of 0 and continuous functions on [0, 1]2 , we can conclude that y − Sy (p) also varies in [0, 1], which proves equation (4.3). From an economic point of view, it acts as if the compensated income were finally neglected in the surplus equation, as it is in the definition of the observed consumer surplus, when the demand function is integrated over price with fixed income.
5. SOME SIMULATIONS AND CONCLUDING REMARKS We present a small Monte Carlo study in order to demonstrate the practical implementation of 0.2y . This form fits with the proposed method. The function g is defined as follows: g(p, y) = (p+0.1) the classical demand function derived from the Cobb–Douglas utility (up to an additive term 0.1 to ensure the function is well-defined on [0, 1]). For fixed values y and p1 , the differential equation can be explicitly solved and Sy is defined by: p + 0.1 0.2 . Sy (p) = y 1 − p1 + 0.1 C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S91
Estimation of exact consumer surplus
0.0
0.2
0.4
0.6
0.8
1.0
√ We consider the trigonometric basis in L2 [0, 1], that is φ1 = 1, φ2j (·) = 2cos(2π j .), √ φ2j +1 (.) = 2sin(2π j .). The variables P , Y and W are uniformly
distributed on [0, 1] λ φ (p)φj (y)φj (w) and the joint density of (P , Y , W ) is defined by fPYW (p, y, w) = ∞
∞ −1 −1 j =1 j j −1 with singular values satisfying λ1 = 1 and λj = j (2 j =1 l ) , j ≥ 2. For computational purposes, the infinite series were truncated at j = 100. We then generate Q = E[g(P , Y )|W ] + V , where V is distributed as Normal(0, 0.1). To compute the exact consumer surplus, the income value is fixed and equal to 0.5 and the price reference p1 is equal to 1. The estimated solution of the differential equation is calculated using the Euler algorithm (see Section 3.2.1). We generate samples of size n = 200, and perform 500 Monte Carlo replications. The experiments are carried out in R. The kernel function is the Gaussian kernel and the values of the smoothing parameters are fixed and equal to h = 0.5 and a = 0.05. Results are illustrated graphically in Figures 1 and 2. The figures show g(p, 0.5) and g (p, 0.5)) and E(S S0.5 (p) in the solid line, and Monte Carlo approximation to E( 0.5 (p)) in the dotted line. Performances of both estimators are compared using the average of Monte Carlo approximations to mean squared error (MSE). The results are the following: MSE(g) = 0.01687601 and MSE(Sy ) = 0.0003646748. This illustrates clearly the fact that solving the differential equation smooths the demand and improve its properties (see Section 4.2), although the smoothing parameters h and a are not chosen optimally. To conclude, this article develops a non-parametric estimator of exact consumer surplus where price is specified to be endogenous. We combine the methodology of the nonparametric instrumental variable of Hall and Horowitz (2005) with the estimation of solution of differential equations by Vanhems (2006) in a two-step procedure: first non-parametric estimation of demand; second, non-parametric estimation of exact consumer surplus. We analyse the asymptotic property of our estimator and show that the rate derived for the estimated exact consumer surplus is faster than the rate obtained for the estimated demand (due to the resolution of the differential equation linking both functions). This result is illustrated via a small Monte Carlo simulation.
0.0
0.2
0.4
P
0.6
0.8
1.0
Figure 1. Graph of functions g (solid line) and E( g ) (dotted line). C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S92
0.00
0.05
0.10
0.15
0.20
0.25
0.30
A. Vanhems
0.0
0.2
0.4
0.6
0.8
1.0
P
Figure 2. Graph of functions Sy (solid line) and E( Sy ) (dotted line).
ACKNOWLEDGMENTS I am grateful to Richard Blundell for stimulating conversations, suggestions and advice. I also thank Jean-Pierre Florens, Jean-Marc Robin, Jan Johannes, Stefan Hoderlein, Sebastien Van Bellegem and two anonymous referees for most helpful comments.
REFERENCES Ai, C. and X. Chen (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–843. Angrist, J. and A. Krueger (2001). Instrumental variables and the search for identification: from supply and demand to natural experiments. Journal of Economic Perspectives 15, 69–85. Ascher, U. and L. Petzold (1998). Computer Methods for Ordinary Differential Equations and Differential Algebraic Equations. Philadelphia: Society for Industrial and Applied Mathematics (SIAM). Berry, S., J. Levinsohn and A. Pakes (1995). Automobile prices in market equilibrium. Econometrica 63, 841–90. Blundell, R., X. Chen and D. Kristensen (2007). Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica 75, 1613–69. Blundell, R. and J. Powell (2003). Endogeneity in semiparametric and nonparametric regression models. In H. L. Dewatripont, M. and S. Turnovsky (Eds.), Advances in Economics and Econometrics: Theory and Applications, Volume 43, 111–21. Cambridge: Cambridge University Press. Brown, B. and M. Walker (1989). The random utility hypothesis and inference in demand systems. Econometrica 57, 815–29. Carrasco, M., J.-P. Florens and E. Renault (2007). Linear inverse problems in structural econometrics: estimation based on spectral decomposition and regularization. In J. Heckman and E. Leamer (Eds.), Handbook of Econometrics, Volume 6B, 5633–751. Amsterdam: Elsevier. Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. In J. Heckman and E. Leamer (Eds.), Handbook of Econometrics, Volume 6B, 5548–632. Amsterdam: Elsevier. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Estimation of exact consumer surplus
S93
Coddington, E. and N. Levinson (1955). Theory of Ordinary Differential Equations. New York: McGrawHill. Collatz, L. (1960). The Numerical Treatment of Differential Equations. New York: Springer. Darolles, S., J.-P. Florens and E. Renault (2002). Nonparametric instrumental regression. IDEI Working Paper No. 228, Universit´e de Toulouse I. Engl, H. W., M. Hanke and A. Neubauer (2000). Regularization of Inverse Problems. Dordrecht: Kluwer Academic. Gagliardini, P. and O. Scaillet (2007). A specification test for nonparametric instrumental variable regression. Research Paper No. 07-13, Swiss Finance Institute. Hall, P. and J. L. Horowitz (2005). Nonparametric methods for inference in the presence of instrumental variables. Annals of Statistics 33, 2904–29. Hausman, J. (1981). Exact consumer’s surplus and deadweight loss. American Economic Review 71, 662–76. Hausman, J. and W. Newey (1995). Nonparametric estimation of exact consumer surplus and deadweight loss. Econometrica 63, 1445–76. Hicks, J. R. (1956). A Revision of Demand Theory. Oxford: Clarendon Press. Johannes, J., S. Van Bellegem and A. Vanhems (2010). Convergence rates for ill-posed inverse problems with an unknown operator. Forthcoming in Econometric Theory. Kress, R. (1999). Linear Integral Equations, Applied Mathematical Sciences Volume 82, New York: Springer. Lewbel, A. (2001). Demand systems with and without errors. American Economic Review 91, 611–18. Matzkin, R. (2007). Heterogeneous choice. In N. W. Blundell, R. and T. Persson (Eds.), Advances in Economics and Econometrics, Theory and Applications: Ninth World Congress of the Econometrics Society, Volume 43, 111–21. Cambridge: Cambridge University Press. Newey, W. and G. Imbens (2009). Identification and estimation of triangular simultaneous equations models without additivity. Econometrica 77, 1481–512. Newey, W. K. and J. L. Powell (2003). Instrumental variable estimation of nonparametric models. Econometrica 71, 1565–78. Newey, W., J. Powell and F. Vella (1999). Nonparametric estimation of triangular simultaneous equations models. Econometrica 67, 565–604. Tikhonov, A. and V. Arsenin (1977). Solutions of Ill-Posed Problems. Washington, DC: Winston and Sons. Vanhems, A. (2006). Nonparametric study of solutions of differential equations. Econometric Theory 22, 127–57. Varian, H. (1992). Microeconomic Analysis. New York: W. W. Norton. Vartia, Y. (1983). Efficient methods of measuring welfare change and compensated income in terms of ordinary demand functions. Econometrica 51, 79–98. Willig, R. D. (1976). Consumer’s surplus without apology. American Economic Review 66, 589–97. Yatchew, A. and J. No (2001). Household gasoline demand in Canada. Econometrica 69, 1697–709.
APPENDIX A: PROOFS OF RESULTS Proof of Theorem 4.1: Existence and uniqueness of solutions Sy and Sy is proved using the Cauchy–Lipschitz theorem, under the sufficient condition that both functions g and g are continuously differentiable of order 1, which is assumed in Assumptions 4.2 and 4.5. Moreover, under Assumption 4.7 of uniform convergence, we can define a common Lipschitz factor k for both functions g and g and common neighbourhoods I and Dy (see Vanhems, 2006, proof of Lemma 2.2, on p. 150, for details). C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S94
A. Vanhems
Proof of Theorem 4.2: Linear decomposition. This proof is directly adapted from Vanhems (2006) (proof of Proposition 4.1, p. 151). Under the assumptions of existence and uniqueness, for any y ∈ ]0, 1[ there exists a unique solution to (2.2) Sy (p) = y [g](p). The objective is to try and characterize the functional y that is the exact dependence between Sy and g. Consider the operator Ay defined as follows: 1 C (Dy ) × C 12 ,0 (I ) → C(I ), Ay : (u, v) → Ay (u, v), where C(I ) is the space of continuous functions defined on I, and C 1 (Dy ) the space of functions defined on Dy and continuously differentiable of order 1. We consider also the space C 2 ,0 (I ) the space of continuous functions defined on I and satisfying both Assumptions 3.1 and 3.2 of Section 3.2.1. The space C 12 ,0 (I ) stands for continuously differentiable functions of order 1 belonging to C 2 ,0 (I ). Note that both spaces (C 1 (Dy ), · ) and (C(I ), · ) are Banach spaces. Moreover we define the following norm:
· = max v , v
on C 12 ,0 (I ). We can easily see that (C 12 ,0 (I ), · ) is a Banach space. The use of such a norm allows us to have the continuity and linearity of the following function: 1 C 2 ,0 (I ), · → (C(I ), · ) , D: f −→ f . So, we have: ∀x ∈ I , Ay (u, v)(x) = v (x) + u(x, y − v(x)). Define an open subset O of C 1 (Dy ) × C 12 ,0 (I ) and (g, Sy ) ∈ O. Ay is continuous on O (it is a sum of continuous applications) and Ay (g, Sy ) = 0. Let us check the hypothesis of the implicit function theorem. Ay is in fact continuously differentiable (thanks to the same argument) so we can take its derivative with the second variable d2 Ay (g, Sy ). Moreover, we have: ∀h ∈ C 12 ,0 (I ),
∀p ∈ I ,
d2 Ay (g, Sy )(h)(p) = h (p) +
∂ g(p, y − Sy (p)) · h(p). ∂e2
We have to prove that d2 Ay (g, Sy ) is a bijection. Let us show first the surjectivity: ∀v ∈ C(I ),
∃?h ∈ C 12 ,0 (I );
h (p) +
∀p ∈ I ,
∂ g(p, y − Sy (p)) · h(x) = v(p). ∂e2
This is a linear differential equation, so we can solve it and find that: p p ∂ g(t,y−Sy (t)) dt v(s) · e s ∂e2 ds. ∀p ∈ I , h(p) = − p1
Therefore, d2 Ay (g, y − Sy ) is surjective. Let us now demonstrate the injectivity, that is Ker(d2 Ay (g, y − Sy )) = {0}. 1 We are going to solve d2 Ay (g, y − Sy )h = 0, h ∈ Cb,0 (I ). We find again a linear differential equation we can solve and find:
∀p ∈ I ,
h(p) = ce
p − 1 ∂e∂ g(t,y−Sy (t)) dt p
2
and
h(p 1 ) = 0.
Therefore, we get c = 0. Thus, we have demonstrated that d2 Ay (g, Sy ) is bijective. Let us now demonstrate the bi-continuity of d2 Ay (g, Sy ). In the usual implicit function theorem, this assumption is not required, but here we consider infinite dimension spaces which is why we need a more general theorem with further C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S95
Estimation of exact consumer surplus
assumptions to satisfy. The continuity of d2 Ay (g, Sy ) has already been proved since Ay is continuously differentiable. The continuity of the reversible function is given by an application of Baire Theorem: if an application is linearly continuous and bijective on two Banach spaces, the reversible application is continuous. Therefore, we can apply the Implicit Function Theorem: ∃U an open subset around g, and V an open subset around Sy such as: Ay (u, v) = 0 has a unique solution in V .
∀u ∈ U ,
Let us note: v = y [u] this unique solution for u ∈ U . Now we are going to differentiate the relation: Ay (u, [u]) = 0, ∀u ∈ U and apply it in (g, Sy = y [g]). Let us first differentiate Ay : ∀h ∈ C 1 (Dy ) × C 12 ,0 (I ), dAy (g, Sy )(h)(p) = d1 Ay (g, Sy ) dg(h)(p) + d2 Ay (g, Sy ) dSy (h)(p) = dg(h)(p, y − Sy (p)) + (dSy (h)) (p) +
∂ g(p, y − Sy (p)) dS(h)(p). ∂e2
The differential of Ay leads to a linear differential equation in dSy (h) that we can solve. Now we apply it g − g) in order to find: with dg(h) = g − g and dSy (h) = dy [g](
dy [g]( g − g) (p) = −
∂ g(p, y − y [g](p) · d( g − g)(p) − ( g − g)(p, y − y [g](p)). ∂e2
Solving it leads us to:
p
g − g)(p) = − dy [g](
p1 p
=− =−
p1 p p1
[ ps ∂e∂ g(u,y−y [g](u)) du]
dt
[ ps ∂e∂ g(u,y−Sy [g](u)) du]
dt
( g − g)(t, y − y [g](t)) · e
g − g) (t, y − Sy [g](t)) · e (
2
2
(( g − g)(t, y − Sy [g](t)) · v(p, t)) dt.
So the statement is proved. The convergence of the residual term is proved in Hall and Horowitz (2005). Convergence in mean square. We analyse the following term: ( g − g)(t, y)my (p, t) dt. The objective is to prove that: 2
supy∈[0,1]
E
( g − g)(t, y)my (p, t) dt
2(β+ν)−1 dp = O n−τ 2β+α .
The sketch of the proof is very similar to the demonstration in Hall and Horowitz (2005). We decompose the difference ( g − g)(t, y)my (p, t) dt into four terms and analyse the convergence of each one. Define: C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S96
A. Vanhems Dny (p) =
g(x, y)fPYW (x, y, w)Ty+ (fPYW − fPYW )(t, y, w) dxdw my (p, t) dt,
n 1 n i=1 n 1 An2y (p) = n i=1 n 1 An3y (p) = n i=1 n 1 An4y (p) = n i=1
An1y (p) =
Ty+ fPYW (t, y, Wi )Qi Khy (y − Yi )my (p, t) dt,
(−i) − fPYW (t, y, Wi )Qi Khy (y − Yi )my (p, t) dt − Dny (p), Ty+ fPYW Ty+ − Ty+ fPYW (t, y, Wi )Qi Khy (y − Yi )my (p, t) dt + Dny (p),
Ty+ − Ty+
(−i) fPYW − fPYW
(t, y, Wi )Qi Khy (y − Yi )my (p, t) dt.
Then g (t, y)my (p, t) dt = An1y (p) + An2y (p) + An3y (p) + An4y (p) and the theorem will follow if we prove that: 2(β+ν)−1 (A.1) E An1y − g(t, y)my (p, t) dt 2 = O n−τ 2β+α , 2(β+ν)−1 E Anjy 2 = O n−τ 2β+α ,
forj = 2, 3, 4.
(A.2)
We will then carefully detail the proof for equation (A.1) and very briefly indicate the way to prove equation (A.2) following Hall and Horowitz (2005). To derive (A.1), we first decompose the bias term. EAn1y (p) − g(t, y)my (p, t) dt = I1 + I2 , with I1 = −a
k
byj cyj k (λj + a)−1 φyk (p),
j
∂ Ty+ fPYW (t, y, w)q r fQW Y (q, w, y) dqdw my (p, t) dt. I2 = O hry ∂y Therefore, EAn1y (p) − g(t, y)my (p, t) dt 2 ≤ 2( I1 2 + I2 2 ) and ⎛ ⎞2 ⎝a
I1 2 = byj cyj k (λj + a)−1 ⎠ k
j
⎛ ≤ C ⎝a 2
⎞2 |byj |j
−2ν
(λj + a)
−1 ⎠
.
j
Using Cauchy–Schwarz inequality, we get: ⎛ ⎞⎛ ⎞
I1 2 ≤ C 2 a 2 ⎝ j −2ν ⎠ ⎝ |byj |2 j −2ν (λj + a)−2 ⎠ j
j
⎛
≤ const. a ⎝ 2
2 −2ν
|byj | j
⎞ (λj + a)
−2 ⎠
,
j
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S97
Estimation of exact consumer surplus
where here and below “const.” denote a positive constant. We then divide the series up to the sum over −1/α and the complementary part. Following Hall and Horowitz (2005), we bound the right-hand j ≤ J a
side by a 2 j ≤J (byj j −ν /λj )2 + j >J (byj j −ν )2 . Under Assumptions 4.3 and 4.4, we prove that: 2(β+ν)−1
I1 2 = O n−τ 2β+α .
(A.3)
Consider now the second term I2 the statistical bias. We have: # $ I2 ≤ const. hry EW |Y Ty+ fPYW (·, y, W )|Y = y ; my (p, ·) dyj k cylj φyl (p). ≤ const. hry λ +a j ,k,l yj Therefore, we get: ⎛ ⎞2 dyj k cylj ⎝ ⎠
I2 2 ≤ const. h2r y λyj + a l k,j ⎛ ≤ const.
h2r y
j −2ν−α/2
⎝
j
λyj + a
⎞2 ⎠ .
Again, we can use Cauchy–Schwarz inequality and divide the series up to the sum over J and the complementary part to get: 2ν−α−1
α
I2 2 ≤ const. h2r y a 2(β+ν)−1 = O n−τ 2β+α
and
EAn1y (p) −
2(β+ν)−1 g(t, y)my (p, t) dt 2 = O n−τ 2β+α .
(A.4)
Consider now the variance term. Using Assumption 4.2, we deduce that % nhy var{An1y (p)} ≤ const. EW |Y
(Ty+ fPYW )(t, y, W )my (p, t) dt
2 & .
Then we prove, from an expansion of Ty+ fPYW and my (p, ·) in their generalized Fourier series, that var{An1y (p)}dp ≤ const.
1 dyj k dyiq cylj cyli nhy j kiql (λyj + a)(λyl + a)
⎛ ⎞2 ' 1 ⎝ λyj cylj ⎠ ≤ const. nhy l λyj + a j ⎛ ⎞2 ' 1 ⎝ λyj j −2ν ⎠ ≤ const. . nhy λyj + a j C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S98
A. Vanhems
Using again Cauchy–Schwarz and the series decomposition as previously, we prove that: E An1y − EAn1y 2 = var{An1y (p)} dp = O (nhy )−1 a −(α+1−2ν)/α 2(β+ν)−1 = O n−τ 2β+α Result (A.1) is implied by this bound and (A.4). We now present briefly how to handle with the other terms in (A.2). Start with j = 2. We introduce the additional notations: (−i) g(x, y)fPYW (x, y, w)Ty+ fPYW − fPYW (t, y, w) dxdw my (p, t) dt, Dnyi (p) = An2y1 (p) =
n 1 + (−i) Ty fPYW − fPYW (t, y, Wi )Qi Khy (y − Yi )my (p, t) dt − Dnyi (p), n i=1
An2y2 (p) =
n 1 (Dnyi (p) − Dny (p)), n i=1
An2y (p) = An2y1 (p) + An2y2 (p). We then study each term An2y1 2 and An2y2 2 . It may be shown by tedious calculations that E An2y1 2 = 2(β+ν)−1 O(n−τ 2β+α ). Moreover, write An2y2 (p)2 dp as a double series and take the expected values of the terms 2(β+ν)−1
one by one. We can again show that E An2y2 2 = O(n−τ 2β+α ). Next we derive (A.2) for j = 3. Note = Ty − Ty and consider the following decomposition Ty+ − Ty+ = −(I + Ty+ )−1 Ty+ Ty+ . We introduce the additional notations: −1 An3y1 (p) = − I + Ty+ Ty+ g(·, y); my (p, ·) −1 An3y2 (p) = − I + Ty+ Ty+ An1y (p) − g(·, y); my (p, ·) An3y (p) = An3y1 (p) + An3y2 (p). Following the Hall and Horowitz (2005) argument and using Cauchy–Schwarz inequality, it can be shown that: 1/2 E An3y2 2 ≤ E (I + Ty+ )−1 Ty+ 4 E An1y (p) − g(·, y); my (p, ·) 4 2(β+ν)−1 = O n−τ 2β+α . The second term is again decomposed in several sub-terms, each of them being controlled in the same vein 2(β+ν)−1
as for An1y (p). Tedious moment calculus show that E An3y1 2 = (n−τ 2β+α ). The last result (A.2) with j = 4 follows with the rates of An2y and An3y .
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. S99–S125. doi: 10.1111/j.1368-423X.2009.00308.x
A structural approach to estimating the effect of taxation on the labour market dynamics of older workers P ETER H AAN † AND V ICTORIA P ROWSE ‡ †
‡
DIW Berlin, Mohren Str. 58, 10117 Berlin, Germany E-mail:
[email protected]
Department of Economics, University of Oxford, Manor Road Building, Oxford OX1 3UQ, UK E-mail:
[email protected] First version received: March 2009; final version accepted: December 2009
Summary We estimate a dynamic structural life-cycle model of employment, nonemployment and retirement that includes endogenous accumulation of human capital and intertemporal non-separabilities in preferences. In addition, the model accounts for the effects of income tax, social security contributions and the transfer system on work incentives. The structural parameter estimates are used to evaluate the employment effects of a tax reform focused on low-income individuals. This tax reform is found to cause a significant increase in employment and we find evidence for anticipation effects if the reform is targeted only at older workers. Keywords: Income taxation, In-work credits, Life-cycle labour supply, Tax reform.
1. INTRODUCTION In most developed countries, the employment rates of older workers are relatively low, and longterm unemployment is heavily concentrated among older workers. For example, in Germany almost two-thirds of all unemployed people aged 55–64 years have been unemployed for more than a year, compared to roughly 40% in the total population. This is particularly problematic for the low educated and individuals living in east Germany as they generally have the lowest employment rates. Responding to this phenomenon, in several countries labour market policy targeted at older people has recently shifted from early retirement schemes to ‘active’ labour market programmes aimed at increasing the employment rates of older unemployed people. The aim of this paper is to evaluate how changes in the tax and transfer system could be effective in fostering employment among older members of the workforce in Germany. The German tax and transfer system can be characterized as a traditional welfare system with relatively generous out-of-work transfers and high marginal deduction rates when people start working. In the political discussion this has often been criticized and the low work incentives implicit in the system have been identified as a central reason for high unemployment, particularly among the low educated. Drawing on the international experience, mainly from the Earned Income Tax Credit (EITC) in the United States and the Working Tax Credit (WTC) in C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S100
P. Haan and V. Prowse
the United Kingdom, there is an ongoing debate about changing the German welfare system by shifting more transfers to the working poor and thus increasing their work incentives. In this analysis, we evaluate the introduction of an in-work credit for the working poor designed similarly to the EITC. We consider two different implementations of this tax reform. The first is targeted at the whole population and increases working incentives for employed individuals of any age. In the second implementation eligibility for the tax credit is conditioned on age. Specifically, only workers aged 60 years and above are eligible for the transfer. The agespecific reform has the advantage that it is targeted at a population with low employment rates and thus the design limits subsidies given to individuals who are likely to choose employment without additional fiscal incentives. However, this change in the tax and transfer system for older workers induces dynamic effects over the life-cycle. To account for these potential dynamic effects, it is appropriate to work within a dynamic structural life-cycle model of labour supply. A dynamic model is required to take account of intertemporal non-separabilities in wages and preferences. Such effects imply that reforms of the tax system which affect the net incomes of young or middle-aged individuals might induce incentives which change employment behaviour towards the end of the working life. Similarly, a life-cycle model, featuring optimizing forward-looking individuals, provides a desirable framework as it allows current labour supply to depend on the expected rewards to future employment. Thus, a life-cycle model captures the employment response of younger members of the labour force to a tax reform that affects only the net incomes of older workers. In common with the proceeding literature concerned with the specification and estimation of dynamic structural life-cycle models of labour supply, prominently Eckstein and Wolpin (1989), our model allows for on-the-job accumulation of human capital and for intertemporally non-separable preferences. In addition, the implemented model captures the effects of income taxation, social security contributions and in-work and out-of-work transfers on labour supply incentives. The latter feature is necessary to represent correctly labour supply incentives. Despite this, very few papers in this literature have attempted to model the returns to working as net income rather than gross earnings. Indeed, while there exist several implementations of dynamic structural life-cycle models including a specification of the transfers paid to nonworking individuals (see, inter alia, Wolpin, 1992, Ferall, 1997, and Adda et al., 2007), the tax and transfer system applicable to the gross incomes of working individuals has been widely ignored. Exceptions include Yamada (2007), who models progressive income tax when analysing the life-cycle employment behaviour of Japanese women, Haan et al. (2008) who use a full specification of all relevant elements of the German tax and transfer system when studying the effect of in-work benefits, and Rust and Phelan (1997) who study the effect of the design of the social security system on the retirement decisions of American men. In our framework, the transfer system determines the net incomes of non-working individuals through means-tested benefits while the net income of a working individual is defined as his or her gross earnings minus social security contributions (SSC) and income tax plus any in-work transfers. The novelty of the current paper lies in the focus on the effect of the design of the system of taxation applied to earned income on the employment and retirement decision of older individuals. For this analysis we use a sample of German men and women aged between 40 and 65 years living in single adult households without dependent children. Each period nonretired individuals choose between full-time employment, non-employment and, if eligible, early retirement. In a similar vein to Low et al. (2009), dependent on age and health status, individuals can decide to retire before the compulsory retirement age of 65 years. In particular, individuals C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The effect of taxation on labour market dynamics
S101
without health problems can enter retirement if they are aged 50 years or more while individuals with health problems can retire at any age. This analysis complements a large empirical literature which has evaluated the labour supply effects of policies that alter the net incomes of working individuals with low earnings, prominently the EITC and the WTC (see the surveys by Blundell, 2000, Blank, 2002, and Hotz and Scholz, 2003). In contrast to the reduced form and structural myopic methods of evaluation which have been used previously, we use a dynamic structural life-cycle model to determine the employment and retirement effects of a reform affecting the net incomes of working individuals. The main advantage of our approach is that the estimated structural parameters can be used to simulate the life-cycle effects of proposed or hypothetical reforms to tax and transfer schemes that affect the net incomes of working individuals while recognizing the forward-looking and intertemporal nature of individuals’ labour supply behaviour. In contrast to a growing literature on retirement behaviour based on dynamic structural life-cycle models, e.g. Rust and Phelan (1997) or French (2005), we do not focus on the behavioural effects of specific reforms to the pension system. Instead we show how changes in the tax and transfer system might induce positive employment effects while holding the pension system fixed. In this sense our study is similar to Adda et al. (2007, 2009) who evaluate the employment effect of different labour market policies over the life-cycle. However, while these papers focus on the beginning of the working career, we analyse the employment effects towards the end of the working life. Our results show that the introduction of an in-work credit targeted at low-earning individuals of any age leads to a significant increase in employment and a postponement in retirement. In addition, anticipation effects occur when the tax reform is limited to only those aged 60 years and above. Indeed, because we model labour supply in a dynamic setting with forward-looking individuals, the employment behaviour of younger individuals might be affected as they know that if their future earnings are sufficiently low then they will be eligible for the tax credit once they reach 60 years of age. Our results show that the age-dependent reform induces essentially no effects for individuals aged under 57 years. However, between age 57 and 60 years there is an increase in employment and a postponement of retirement. The remainder of this paper proceeds as follows. Section 2 presents our dynamic structural model of labour supply behaviour over the life-cycle. This section closes with a presentation of the empirical specification of the flow utilities and the equations of motion for gross wages and health status. Section 3 contains a full description of the institutional features of the German tax and transfer system that impact on the net incomes of employed and non-employed individuals. The strategy for estimation is outlined in Section 4 and the data source, the German SocioEconomic Panel (SOEP), and our sample selection criteria are discussed in Section 5. Estimation results and an analysis of goodness of fit are presented in Section 6. Section 7 shows the estimated employment and retirement effects of changes to the system of income taxation. Finally, Section 8 concludes.
2. MODEL AND EMPIRICAL SPECIFICATION 2.1. Overview It is the purpose of this paper to study the effects of the tax and transfer system on the employment behaviour of older individuals. To this end, we derive and estimate a dynamic structural life-cycle model of employment, non-employment and early retirement that accounts for the endogeneity of C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S102
P. Haan and V. Prowse
work experience, intertemporally non-separable preferences, and the effect of the tax and transfer system on work incentives. To reduce complexities, we restrict our sample to one particular population group. Specifically, we model only the life-cycle labour supply of men and women residing in single adult households without dependent children. Further, we focus on individuals aged 40 years and above. We assume that family composition is constant over the individual’s future life and this is justified by the aforementioned age restriction. In addition, it is assumed that men and women in this age category have finished their education and therefore all of the analysis is conditional on educational qualifications obtained prior to age 40 years. Each period an individual re-optimizes his or her labour supply and retirement behaviour. A period is defined to be a quarter of a year; this provides a reasonable tradeoff between the reality of individuals being able to move between employment states on a monthly or even weekly basis and the need for computational tractability. Following e.g. Rust and Phelan (1997), we account for gender- and age-specific life-expectancy which is calculated on the basis of the Human Mortality Database. 1 Before proceeding, two further limitations of our analysis are discussed. First, as is common in this literature, e.g. Rust and Phelan (1997), we make the restrictive assumption that individuals do not save and are credit constrained. 2 Therefore, the estimated employment effects of a tax subsidy should be interpreted as upper bounds. In a more general model, in addition to the tax and transfer system, precautionary savings would provide insurance by allowing intertemporal consumption smoothing; e.g. Low et al. (2009). In such a setting households are less dependent on the tax and transfer system and therefore any behavioural effects induced by changes in the tax legislation are likely to be lower. However, because in this study we focus on a sample of low educated men and women, ignoring precautionary savings as potential insurance should only be of minor importance. 3 Secondly, unlike numerous studies focusing on the job search behaviour, including Ferrall (1997) and Frijters and van der Klaauw (2006), we do not model job search; in our model, all non-retired individuals receive one job offer each period and all non-work among non-retired individuals, henceforth referred to as non-employment, corresponds to individuals who chose not to accept a job at the wage they were offered. That implies that job transitions are driven mainly by persistence or state-dependence effects in employment and by experience or human capital accumulation which might affect wages. 2.2. Job offers and net income Let t = τi denote the age at which individual i enters the labour market and let T denote the age of compulsory retirement. Similar to Low et al. (2009), we allow individuals aged T R years or older and those with poor health to take early retirement while this alternative is not open to individuals without health problems aged under T R years. Non-employed individuals remain in the labour force and may return to full-time employment in the future. In contrast, early retirement is a fully absorbing state and thus once an individual enters early retirement returning to employment in the future is precluded. 4 1 Human Mortality Database is provided by the University of California, Berkeley (USA), and Max Planck Institute for Demographic Research (Germany). The database is available at www.mortality.org. 2 French (2005) is one of the few examples of a discrete choice model of life-cycle labour supply that allows saving. 3 On average, the low educated men and women in our sample save roughly 5% of average gross earnings. 4 This assumption is in line with the observed behaviour of early-retired individuals in Germany; hardly any of the early-retired made the transition into full-time employment.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The effect of taxation on labour market dynamics
S103
In each period t = τi , . . . , T every non-retired individual receives a single offer of a fulltime job (f ). The gross wage associated with the job offer received by individual i at time t is denoted wi,t . Non-retired individuals without health problems aged younger than T R must decide between accepting the full-time job, in which case they receive a net income in the current quarter of mi,f ,t , and rejecting the offer, in which case the individual is non-employed (n) and receives a net income of mi,n,t . Individuals without health problems aged T R or above and all individuals with health problems have a choice between full-time employment, non-employment and retirement (r). In practice, the net income of pensioners, mi,r,t , is mainly determined by pension payments which depend on previous earnings and the working history over the whole life-cycle. Because we focus on the employment effects of a tax reform which does not affect the pension system, we do not model pension payments explicitly but instead implement a reducedform specification for payoffs associated with retirement which absorbs the effect of pension income. Section 2.5 below provides further details. 2.3. Financial rewards by employment status In contrast to most previous studies of employment behaviour over the life-cycle, we model in detail the effect of the tax and transfer system on working incentives and assume that individuals make their employment decision based on net income rather than on gross earnings. This study uses the German tax and transfer system as a benchmark. The main features of the German tax and transfer system are noted here while Section 3 below provides a more detailed description together with information concerning recent relevant changes to the system. 5 Our estimation includes a tax simulation model that maps all relevant features of the tax and transfer system and generates for each individual the employment state-specific net income conditional on the individual’s demographic characteristics, the offered wage and non-labour income. 2.3.1. Net income in full-time employment. employment takes the following form:
The individual’s net income in full-time
mi,f ,t = Ff (wi,t , Ii,t ; TSt ). Net income in full-time employment depends on the offered gross wage wi,t , non-labour income Ii,t , and the tax and transfer system of the given period TSt . The tax and transfer system includes social security payments (SSC), income taxation, and, if net income is sufficiently low, a transfer to raise the individual’s income to the minimum income. 6 2.3.2. Net income in non-employment. Conditional on an individual’s employment and earnings history, a non-working individual may receive unemployment insurance transfers. In addition, depending on the level of any unemployment insurance transfers and income from other sources, the individual may receive a means-tested minimum income transfer which includes housing benefit. The minimum income transfer does not depend on previous earnings and the transfer is permanent. We simplify the legislation and approximate out-of-work transfers by only
5 As mentioned above, we restrict attention to single households without children. This greatly simplifies the modelling of the tax and transfer system as the family-related components of the legislation, such as the joint-income taxation of married couples and child-related transfers, do not need to be considered. 6 Because our sample consists of single individuals, full-time net incomes are always higher than the minimum income and hence none of the sampled individual receive an in-work transfer.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S104
P. Haan and V. Prowse
the means-tested minimum income transfer. Therefore, in our implementation, the net income for a non-employed individual depends on the non-labour income and the transfer system in the given year mi,n,t = Fn (Ii,t ; TSt ). Given our sample selection criteria, unemployment insurance is relatively unimportant so there is little loss in including only the means-tested minimum income transfer. Specifically, in the empirical analysis we include only individuals with low educational qualifications. Such individuals tend to have low wages and therefore any unemployment insurance payments are wholly or mostly offset by the withdrawal of the means-tested minimum income transfer. For two reasons, this approximation is most problematic for older individuals with long working histories. First, wages, and therefore any unemployment insurance transfers, are increasing with experience. Secondly, the entitlement rules for the unemployment insurance are relatively generous for the older workers. 7 2.4. Optimal labour supply over the life-cycle Having received a job offer with a wage of wi,t at time t individual i must decide whether to accept or reject the job offer. By drawing on dynamic programming techniques, we model optimal labour supply over the life-cycle in a forward-looking setting where the individual considers the dependence of payoffs occurring in the future on his or her current labour supply decision. We assume that the individual has full information about the tax and transfer system in the current period and makes his or her labour supply decision assuming that the current tax and transfer system will prevail in all future periods. 8 We differentiate two mechanisms linking today’s employment decision with future payoffs. First, intertemporally non-separable preferences due to habit formation and adjustment costs mean that an individual’s current employment behaviour affects his or her preference for employment relative to non-employment in future periods. Secondly, employment today adds to the individual’s experience which, assuming positive returns to experience, leads to higher expected future wage offers. The individual’s life-cycle utility can be expressed in terms of the employment state-specific j value functions Vt (si,t ) for j = f , n, r. The state variables si,t consist of all variables affecting the contemporaneous utilities and the offered wage wi,t at time t. At time t the individual is assumed to know the current value of si,t but may not know the values of all or some elements of si,t+k for k > 0. However, the distribution of si,t+1 is known to the individual at time t and it is assumed to depend only on si,t . The value function associated with full-time employment is defined as discounted value of the individual’s expected lifetime utility if he or she works full-time in the current quarter and makes optimal labour supply and retirement decisions in all subsequent quarters. The value function for non-employment is similarly defined. The value
7 In an ongoing research project, Haan and Prowse (2009) distinguish between the different transfer schemes for the non-employed and model the endogeneity of entitlement to unemployment insurance payments in a life-cycle model. This richer model is informative about the effects of changes in the entitlement period of the insurance-based part of unemployment transfers. However, such concerns are beyond the scope of this paper. 8 This assumption rules out anticipation of tax reforms. In general tax reforms are not announced long before their implementation and often the timing or design is subject to alteration, as was the case with Tax Reform 2000 in Germany.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S105
The effect of taxation on labour market dynamics
function associated with early retirement is defined as the discounted value of the individual’s expected lifetime utility if he or she enters retirement in the current quarter. Formally, let Di,t be an indicator of individual i being eligible for early retirement at age t. Di,t takes the value one if the individual is aged T R years or above and/or the individual has health problems and is zero otherwise. The employment state-specific value functions for fulltime employment and non-employment are defined recursively as follows: f n r Ui,j ,t (si,t ) + δEt max Vi,t+1 , Vi,t+1 for t = τi , . . . , T − 2, , Vi,t+1 j (2.1) Vi,t (si,t ) = r for t = T − 1, Ui,j ,t (si,t ) + δEt Vt+1 while the value function for early retirement is ⎧ T¯ ⎪ ⎪ ⎨ Ui,r,t + δ h κi,h,t Et [Ui,r,t+h |si,t , yi,r,t = 1] r Vt (si,t ) = h=1 ⎪ ⎪ ⎩ −∞
if Di,t = 1,
(2.2)
if Di,t = 0.
In the above yi,j ,t for j = f , n, r is an indicator variable taking the value one if the individual was in employment state j at time t and zero otherwise, and T¯ > T denotes the maximum length of the individual’s life. Ui,j ,t denotes the individual’s flow utility associated with employment state j at time t and is a function of the individual’s current net income, socio-economic characteristics and his or her previous employment outcomes. κi,h,t denotes the probability that individual i will survive at least h periods conditional on having survived until age t. δ denotes the discount factor. This is a crucial parameter in the life-cycle optimization problem as it describes how strongly expected future utility affects the individual’s current choice. In the empirical analysis we follow the literature and assume an annualized discount factor of 0.96. 9 In each quarter the individual maximizes his or her discounted expected life-cycle utility subject to a budget constraint. Because, in our framework, individuals neither save nor borrow, the budget constraint dictates that consumption equals state-specific net income. Optimizing behaviour on the part of an individual without health problems implies acceptance of the job f f n n (si,t ). Conversely, if Vi,t (si,t ) ≥ Vi,t (si,t ) offer received at age t < T R if and only if Vi,t (si,t ) ≥ Vi,t R then the individual will choose non-employment. A healthy individual aged Ti ≤ t < T or an f n individual with health problems aged t < T will work full-time if and only if Vi,t (si,t ) ≥ Vi,t (si,t ) f f n n r and Vi,t (si,t ) ≥ Vi,t (si,t ), will be non-employed if and only if Vi,t (si,t ) > Vi,t (si,t ) and Vi,t (si,t ) ≥ r (si,t ), and otherwise the individual will move out of the labour market and into retirement. At Vi,t age t = T all remaining non-retired individuals enter compulsory retirement. 2.5. Empirical specification This section describes the chosen specifications of the flow utilities, the distribution of offered wages and the stochastic health process. Finally, we detail how the initial conditions are modelled.
9 Previous studies, e.g. Karlstrom et al. (2004), discuss problems estimating the discount factor in similar life-cycle models.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S106
P. Haan and V. Prowse
2.5.1. Flow utilities. For the estimation, the flow utilities from full-time work and nonemployment are specified as follows: 1−ρ
Ui,f ,t (xi,t , mi,f ,t , αie , i,f ,t ) = βf + βy
mi,f ,t − 1 1−ρ
+ βx xi,t + βα αie + i,f ,t ,
(2.3)
1−ρ
Ui,n,t (mi,n,t , i,n,t ) = βy
mi,n,t − 1 1−ρ
+ i,n,t .
(2.4)
As common in this literature we assume that individuals are risk averse, and set ρ = 1.5. 10 βy determines the sign and magnitude of the preference for net income and therefore consumption. The intercept for full-time employment, denoted βf , accounts for any disutility from work. The vector of observed individual characteristics xi,t includes an indicator of the individual’s employment state in the last quarter and socio-economic variables including gender and health status. The lagged employment state captures intertemporal non-separabilities in preferences due to the combined effects of habit formation and adjustment costs. The unobservables i,f ,t and i,n,t are assumed to be mutually independent and independent over time. In addition, i,j ,t for all i, j and t is assumed to have a type 1 extreme value distribution. At time t individual i knows the current values of i,f ,t and i,n,t but has no information about the future values of these error terms. Persistence in unobservables is captured by αie which represents a time-invariant individual specific random effect, assumed to be known to the individual but unobserved to the econometrician. Prior to entering the labour market each individual draws a value of αie from a standard normal distribution. Draws are assumed to be independent of observed socio-economic characteristics and independent over individuals. By construction, the persistent unobservable αie will be correlated with the lagged dependent variable, experience and the individual’s initial employment status. Our estimation methodology, described below, fully accounts for these effects. 11 A reduced-form specification of the value function for retirement Vtr (si,t ) is adopted. The reduced form captures the effects of both pension income and preference on the value function for retirement. Specifically, we assume Vtr (si,t ) = γ qi,t + i,r,t ,
(2.5)
where i,r,t is an error term with the same properties as i,j ,t for j = f , n. In the above, qi,t contains age terms and the individual’s expected duration of retirement at time t. The age terms capture effects arising from either the design of public early retirement schemes or rules tying firm-specific or private pension payments to the age of retirement. The expected duration of retirement is defined as the individual’s age and gender adjusted life expectancy, computed from the survival rates published in the Human Mortality Database, minus the individual’s age. 2.5.2. Gross wages. Gross wages are a central component of the model as the offered gross wage is a major determinant of an individual’s net income from full-time work. In the empirical 10 In Appendix B, we provide a robustness check of this assumption by re-estimating the model with ρ = 2.5, which corresponds to higher risk aversion. 11 To obtain identification, the coefficients of the observed and unobserved individual characteristics x i,t and αi have been normalized to zero in the flow utility from non-employment.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The effect of taxation on labour market dynamics
S107
analysis individual i’s log offered gross wage is assumed to evolve according to log(wi,t ) = λz zi,t + λα αiw + υi,t
for t = τi , . . . , T .
(2.6)
In the above zi,t are observed individual characteristics that affect wages including education, region of residence and years of experience in the labour market. The coefficient on experience captures the effect of human capital accumulated via previous employment on wages. υi,t is a shock to individual i’s wages occurring at time t and is assumed to be independent of observed individual characteristics, to occur independently over time and to be normally distributed with zero mean and a variance συ2 . Individual i is assumed to know the current value of υi,t but does not know the future values of the time-varying shocks to wages. αiw is a time-invariant individual specific random effect assumed to be unconditional normally distributed with zero mean and unit variance. Again, by construction, αiw will be correlated with previous employment choices and therefore experience, and our estimation methodology accounts for this endogeneity. 2.5.3. Health process. Health status is known to be an important determinant of labour supply and retirement behaviour and may also impact on wages. We measure health with an indicator variable, Health Problemsi,t , which takes value one if the individual has health problems at time t and zero otherwise. We assume that health status evolves stochastically over the life-cycle according to the following equation: 1 if π1 Health Problemsi,t−1 + π2 gi,t + φi,t ≥ 0 (2.7) Health Problemsi,t = 0 otherwise, where gi,t consists of individual characteristics that impact on health, including education and age. The health status in the previous quarter, Health Problemsi,t−1 , captures persistence in health status. The unobservable φi,t is assumed to occur independently over both individuals and time and to have a standard normal distribution. Given these distributional assumptions, estimation of the parameters in (2.7) can be conducted prior to estimation of the remaining parameters. Appendix A details the estimation methodology and resulting parameter estimates. 2.5.4. Initial conditions. The dynamic nature of our model implies that we cannot treat the initial sample observations of experience and the initial employment state observed in the sample as exogenous with respect to the individual’s labour supply choices during the sample period. To account for the endogeneity of the initial conditions we follow Heckman (1981) and use a reduced-form equation to model the initial observations, and allow the unobservables affecting the initial observations to be correlated with the random effects appearing in the flow utilities and the wage equation. While Heckman (1981) proposed a probit model for the initial employment state, we generalize this to account for the endogeneity of both the initial employment state and initial experience, and to allow for individuals to be retired in the initial state. Specifically, we use a reduced-form dynamic multinomial probit model to approximate labour supply and retirement behaviour between entering the labour marker, assumed to occur at age 20 years, and the time when the individual enters the sample. The data generation process for behaviour prior to entering the sample is based on three indices IEi,t , IN i,t and IRi,t , indicating employment, non-employment or retirement at time t. More precisely, an individual is in employment at time t if IEi,t ≥ IN i,t and IEi,t ≥ IRi,t , is non-employed if IN i,t > IEi,t and IN i,t ≥ IRi,t and otherwise retirement is the initial state. As above, we model retirement is an absorbing state, hence any individual who enters retirement cannot subsequently move into employment or nonemployment. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S108
P. Haan and V. Prowse
In the empirical implementation, the index IEi,t is a linear function of observed I characteristics, including experience, the random effects αi,w and αi,e , and an error term i,f ,t . Inclusion of the random effects permits the initial observations to be correlated with subsequent labour supply behaviour. This is necessary to capture the endogenous nature of the initial I while, conditions. The second index IRi,t is a linear function of age terms and error term i,r,t I for identification purposes, IN i,t depends only on error term i,n,t . The three error terms are mutually independent, independent over time and individuals and are drawn from a standard normal distribution.
3. THE GERMAN TAX AND TRANSFER SYSTEM In the following, we describe the key elements of the German tax and transfer system and how we implement the legislation in the setting of a dynamic life-cycle model of labour supply. Although the general structure of income tax, SSC and transfers was unchanged over the years 1995– 2006, several reforms, discussed in detail below, affected the progressivity and generosity of this system. These reforms are important for this study as they provide an additional source of identification for the coefficient on net income, βy , which is not affecting the coefficients of the wage equation. 3.1. Social security contributions In each month, an individual’s income from employment is subject to social security deductions for health, unemployment and pension benefits. 12 As shown in the first three columns of Table 1, except for unemployment insurance, the rates for SSC increased slightly over time. Social security contributions are capped, and the upper level of monthly earnings subject to SSC is higher in west Germany than in the east (5200 euros compared to 4500 euros in 2005). 13 3.2. Income taxation In contrast to SSC, income tax is computed on an annual basis and at the household level. Because we focus only on single households, issues pertaining to the joint taxation of couples do not affect our model. An individual’s annual taxable income is defined as the sum of gross income from employment above an exemption threshold, gross income from assets above a disregard and income from renting property. Moreover SSC up to a maximum amount are deducted. An individual’s annual income tax liability is obtained by applying the income tax function to taxable income. The income tax function is a smooth function of taxable income above a further exemption threshold. The exemption threshold increased between 1995 and 2006 while, over the same period, the top marginal tax rate decreased from 53% to 42% (see Table 1). In additional to income tax, individuals pay an extra tax (Solidaritaetszuschlag) to finance the cost of German reunification. This extra tax was decreased in 1998 from 7.5% to 5.5% of income tax payments. 12
In addition to the employee’s SSC, the employer contributes about the same amount in SSC. Low earning individuals pay SSC at a subsidized rate. However, because we only consider the full-time employed, the lower bound is of no relevance for our application. 13
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S109
The effect of taxation on labour market dynamics Table 1. Key parameters of the German tax and transfer system. Social security contributions Income taxation Unemployment insurance in %
Tax allowance per year
Top marginal tax rate in %
Transfers
Health insurance in %
Retirement pension in %
Average per month west
Average per month east
1995 1996 1997
7 7.5 7.75
9.3 9.65 10.15
3.3 3.3 3.3
4050 6021 6021
53 53 53
564 571 580
553 560.5 569.5
1998 1999
7.75 7.75
10.15 9.85
3.3 3.3
6156 6507
53 53
586 594
575 584
2000 2001 2002
7.75 7.75 7.75
9.85 9.55 9.75
3.3 3.3 3.3
6876 7200 7200
51 48.5 48.5
606 617 629
596 606 617
2003 2004 2005
8 8 8.5
9.75 9.75 9.75
3.3 3.3 3.3
7200 7632 7632
48.5 45 42
634 643 653
622 631 637
2006
8.5
9.75
3.3
7632
42
658
642
Note: All payments are given in euros. The rates of the SSC describe only the employee’s share. The employer contributes the same amount. The minimum income includes housing benefits.
3.3. Transfer system Minimum income payments made to non-working individuals are means-tested against capital income and income from renting. The last two columns of Table 1 show the average monthly minimum income transfer paid to non-working individuals for the years 1995–2006. Working individuals with net incomes below the minimum income receive an in-work transfer to raise their income to the minimum income. However, as, in our model, all work consists of full-time employment, the majority of working individuals do not receive minimum income transfers. In Germany, minimum income transfers are not subject to income taxation. 3.4. Implementation As described above, income tax is based on annual income. However, we model labour supply decisions at quarterly intervals. In our implementation of the German tax and transfer system we calculate net income in the current quarter based on an annualized version of the individual’s income in the current quarter. The procedure assumes implicitly that individuals base their labour supply decision in the current quarter on their net income relating to their current gross income and ignore any adjustments in taxes and transfers pertaining to income received previously in the fiscal year. In addition, we assume full take-up of benefits.
4. ESTIMATION STRATEGY The parameters describing gross wages, preferences and the initial conditions are estimated jointly using the Method of Simulated Moments (MSM): parameters are chosen to minimize C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S110
P. Haan and V. Prowse
the distance between a set of moments pertaining to the values of the endogenous variables, namely wages, employment and retirement outcomes, as observed in the sample and the average values of the same moments in simulated data sets. Similar to, e.g., French (2005), we estimate the health process separately from preference and wages in a first step (see Appendix A). The dynamic structural life-cycle model itself contains 42 parameters and estimation is based on 214 moments including year- and age-specific mean values of the endogenous variables, and partial correlations between these variable and the explanatory variables obtained from multivariate regressions. Similarly, partial correlations between employment transitions and explanatory variables are included. Intertemporal correlations of the endogenous variables and the number of transitions provide information about persistence in wages and in employment behaviour and about the distribution of unobserved heterogeneity. The coefficient on net income is identified from correlations between functions of non-labour income, i.e. income from assets, and employment behaviour. Changes in the tax and transfer system over time provide a further source of identification. Specifically, such changes provide exogenous variation in the relationship between net income and employment; see Table 1 for changes in the tax and transfer system over time. The state-specific value functions, required to simulate data sets, are approximated using an adaptation of the method of Keane and Wolpin (1994). Within the MSM framework it is straightforward to deal with missing wage observations. Given the above model and the data source described below, there are three reasons for missing wages. First, wages are observed only in one quarter of each year—the quarter in which the interview was conducted—while the individual’s employment state is observed in every quarter. Secondly, only individuals in employment are asked to report their wage; the offered wage is not observed for non-working individuals. Thirdly, some individuals in employment do not respond to all of the survey questions needed to construct the wage measure. The missing wage observations in the quarters without interviews and the unobserved wages for non-working individuals do not pose any particular difficulties when constructing the simulated data sets. In the estimation we match moments of the wages observed in the sample with moments computed from the simulated wages of individuals who, in the simulation, chose to work in the quarter in which they were interviewed. This procedure does not require wages for non-interview quarters and accounts for selection into employment based on both observed and unobserved individual characteristics. To account for survey non-response, the moments pertaining to simulated wages are computed by weighting the simulated wages according to observed socio-demographic variables. These adjusted simulated moments are then matched to the corresponding moments in the sample. This methodology accounts for survey non-response that varies according to observed socio-demographic variables but assumes that, conditional on observables, survey nonresponse is random.
5. DATA AND DESCRIPTIVE EVIDENCE This study draws on data from the SOEP which is an annual representative panel survey of over 11,000 households living in Germany and contains information about working behaviour, socio-economic variables and information about income from all sources at the individual and household levels. 14 We construct an unbalanced panel of single adult households with consecutive observations in at least two years between 1996 and 2007 inclusive, which yields 14
For a detailed description of the data set, see Haisken De-New and Frick (2005). C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The effect of taxation on labour market dynamics
S111
retrospective information for the fiscal years 1995–2006. The sample is restricted to singles aged between 40 and 65 years inclusive. The maximum level of school qualifications of individuals in our sample is a medium degree (Realschule) and we drop individuals who have a higher vocational degree. Further, we exclude individuals whose primary earnings are from selfemployment as their labour supply differs substantially from that of the rest of the population of interest. These exclusions yield a sample with 874 different single individuals, consisting of 491 women and 383 men. The median number of observations per individual is 24 quarters. The SOEP includes detailed information about employment and retirement behaviour in each month of the year prior to the interview date. For tractability, we group the monthly information for each individual to form quarterly observations. More precisely, the individual’s state in the first month of the quarter determines the quarterly outcome. In this analysis we distinguish between employment, assumed to be full-time work, non-employment and retirement. Individuals aged 50 years or above who report sufficient income from a pension are classified as retired, as are younger individuals with objective health problems who receive a large enough pension. 15 Figure 1 shows the shares of employment, non-employment and retirement by age separately for men and women and by region. In general, the behaviour of the various subgroups is similar. Until the age of 55 years employment rates are fairly high while the employment rate declines to zero over the last 10 years of the working life. Before age 55 years the majority of the nonwork corresponds to non-employment whereas retirement increases markedly after the age of 60 years. Employment rates for men and women are quite similar. This is not surprising because our sample consists only of single individuals without dependent children. A difference by gender only becomes visible at the end of the working life. In particular, women tend to retire earlier than men. By region, however, we find the expected strong difference: averaged over the whole age distribution, the employment rate is 10 percentage points higher in west Germany than in east Germany, and prior to age 60 years, east Germans have a higher propensity to retire than west Germans. These differences are likely to be related to the worse economic conditions in east Germany. In addition to the retrospective information on monthly employment states, the data include the gross earnings in the month prior to the interview date. Moreover, the corresponding working hours including paid overtime work are given and thus we can construct an hourly wage measure. For time-consistency we cannot use the retrospective employment information and the current wage information from the same survey wave. Instead, we make use of the panel dimension in the data. Because we observe the exact interview day we can match the wage information collected in one year to the corresponding quarter of the retrospective employment information collected the next year. Given that our sample is very homogeneous, we condition preferences and wages on only a few demographic characteristics. Specifically in addition to gender, education, nationality and region of residence, which are time-invariant, we condition on age, experience and the stochastic health status. A measure of experience at the time the individual enters the sample is constructed from retrospective information concerning the individual’s working history. This variable is then updated in accordance with the individual’s observed employment behaviour during the sample period.
15 The assumption that only individuals older than 50 years or with health problems can choose early retirement is supported by the data.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S112
P. Haan and V. Prowse (a) Women
0
0
Proportion in state .2 .4 .6 .8
Proportion in state .2 .4 .6 .8
1
1
(b) Men
40
45
50 55 Age (years) Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
60
Full-time employment Non-employment
Retirement
0
0
Proportion in state .2 .4 .6 .8
Proportion in state .2 .4 .6 .8
1
(d) East Germany
1
(c) West Germany
65
40
45
50 55 Age (years) Full-time employment Non-employment
60
65
40
Retirement
45
50 55 Age (years) Full-time employment Non-employment
60
65
Retirement
Figure 1. Observed life-cycle employment and retirement behaviour by gender and region of residence.
6. RESULTS Table 2 shows the estimates of the parameters of the equation describing log wages. All parameters are in line with previous findings. Gross wages are increasing in experience: we find that an extra 10 years of experience increase the gross wage by 30%. 16 We find quite large wage differentials by gender, nationality and in particular by region of residence, while education has only a minor effect. The health effect is negative but not statistically significant and wages are lower for individuals older than 59 years. Quantitatively, the gross wages of men are about 25% higher than for women. Ceteris paribus, wages for native Germans wages are roughly 20% higher than for non-natives and the wage differential between east and west Germany is about 60%. The average effect of medium education, defined as having a medium school degree or vocational qualification, is about 13%. Moreover, a large proportion on the unobserved component of log wages is due to persistent unobservables: the estimated variance of the time-dependent individual error term is slightly smaller than the variance of the persistent unobserved component of wages. The top and bottom panels of Table 3 show, respectively, the estimates of the parameters that determine the flow utility from full-time employment, relative to non-employment, and the value function associated with retirement. In terms of the flow utility from full-time employment, the
16 Our specification implies that wages are convex in experience. In an additional estimation, not reported, we also included squared experience, but this variable was insignificant.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The effect of taxation on labour market dynamics Table 2. Estimates of parameters in the wage equation. Coefficient
S113
Standard error
Intercept West
0.684 0.630
0.145 0.050
Education Experience (years)/10
0.139 0.306
0.040 0.048
Male German Health Problems
0.254 0.188 −0.063
0.038 0.053 0.043
Age 1 Age 2
0.002 −0.085
0.014 0.029
0.173 0.146
0.046 0.018
λα σν
Notes: Age 1 and Age 2 are age terms. Age 1 is zero if the individual is aged less than 54 years, increases at the rate of 0.25 per quarter between age 54 and age 59 years and takes the value 5 if the individual is aged 59 years or older. Age 2 is zero if the individual is aged less than 59 years and increases at the rate of 0.25 per quarter thereafter. West is an indicator of residing in west Germany, Education is a dummy for having a medium school degree or vocational qualification. German is an indicator of being a German national. Health Problems is an indicator of having health problems that limit daily activities.
Table 3. Estimates of parameters describing preference for employment and retirement. Coefficient Standard error Employment Intercept Age 1 Age 2
−2.861 −0.808 −0.840
0.448 0.128 0.171
Employedt−1 Health Problems
4.436 −1.053
0.325 0.435
Education West Male
0.006 −0.174 0.836
0.416 0.423 0.371
1.273 3.750
0.266 0.515
Retirement Intercept I(59 < Age ≤ 62)
−7.895 0.687
6.037 2.341
I(Age > 62) Life Expectancy
−3.819 −7.767
0.844 2.442
Coefficient on net income (βy ) βα
Note: See notes for Table 2.
coefficient on the indicator of being in employment in the previous quarter is positive and highly significant. This state dependence effect may be due to adjustment costs or habit formation. The significant and negative intercept shows that on average individuals experience a disutility from work. The relatively large standard deviation of individual unobserved effect σαe , however, C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S114
P. Haan and V. Prowse
implies that, ceteris paribus, a fraction of the population derives utility from work. Age is a significant determinant of preferences for full-time work for individuals aged 55 years and above. As mentioned above, the approximation of the out-of-work transfers is most problematic for the older workers, because entitlement rules become more generous at the end of the working life. Thus, the age-related preference effect might capture to some extent institutional regulations that provide incentives to use non-employment as a stepping stone into retirement; see Haan and Prowse (2009). We find that single men tend to have a higher preference for full-time work than single women. Education has no significant effect on preferences which is not surprising given that we exclude the higher educated from this analysis. As expected, individuals with health problems have a significantly lower preference of full-time work relative to non-employment. Finally, but importantly, the coefficient on net income is significantly positive thus implying that net income is an important determinant of labour supply behaviour. The value function associated with retirement is significantly decreasing with the genderand age-specific life-expectancy. This implies for individuals with a high life-expectancy early retirement is not attractive, perhaps because they would suffer a pension penalty due to the long expected duration of their retirement. Conditional on the life-expectancy, the value function associated with retirement for individuals aged 55–60 years is not significantly different from that of younger individuals, while individuals aged over 60 years have a significantly lower value function from retirement than younger individuals. The latter effect could arise as older individuals are relatively likely to have poor health and therefore have a low value of leisure when retired. To complete the description of the estimation results, Table 4 presents the coefficients appearing in the initial conditions. These parameters are descriptive of individuals’ behaviour prior to their entering the sample, but do not have a structural interpretation. 6.1. Goodness of fit Figure 2 presents a graphical analysis of the model’s goodness of fit. Employment, nonemployment and retirement are predicted satisfactorily. The distribution of the simulated log wages for individuals in employment in the quarter in which they were interviewed and adjusted for survey non-response, matches accurately the distribution of sampled wages. 17
7. LIFE-CYCLE EMPLOYMENT EFFECTS OF TAX REFORMS In this final section attention is turned to using the structural parameter estimates, reported above, to simulate the employment and retirement effects of a reform to the system of taxation of earned income. The current German tax and transfer system is a traditional welfare system with relatively high out-of-work transfers. Transfers to the non-working are rapidly withdrawn with earnings once an individual enters the labour market, creating high marginal tax rates and low incentives to supply labour. The current system has often been identified as an important factor underlying the relatively low employment rate in Germany. There is an ongoing debate about 17 Simulation results are based on 50 simulated data sets each of the same size as the sample. Log wages are in year 2000 prices. We provide detailed information about the 214 simulated moments as supplementary material on the home page of the Journal.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S115
The effect of taxation on labour market dynamics Table 4. Estimates of parameters describing initial conditions. Coefficient
Standard error
Employment Intercept Individual employment effect Individual wage effect
0.615 1.337 0.168
1.056 0.278 0.320
Age3 /10 Age 4
−0.462 −0.422
0.558 0.162
Experience/10 Education West
−0.437 0.461 1.467
0.290 0.590 0.653
Male Health Problems Asset 1
0.448 −1.842 1.417
0.868 0.481 0.741
Asset 2 Children Previously
0.961 −0.080
0.598 1.073
Married Previously
−0.036
0.589
−2.899
0.127
0.566 1.400
0.359 0.243
Retirement Intercept I(55 < Age ≤ 57) I(Age > 57)
Notes: Age 3 and Age 4 are age terms. Age 3 is zero for individuals aged less than 40 years, increases at a rate of 0.25 per quarter up to age 55 years, and takes the value 15 if the individual is aged 55 years or older. Age 4 is zero for individuals aged less than 55 years and increases at a rate of 0.25 per quarter thereafter. Asset 1 is an indicator of income from assets being positive but less than 400 euros per year, and Asset 2 is an indicator of income from assets being greater than 400 euros per year. Children Previously and Married Previously are indicators of having had dependent children or having been married prior to entering the sample. For further details, see the note for Table 2.
changing the German welfare system by shifting more transfers to the working poor and thus increasing work incentives, as has been achieved in the United Kingdom via the WFTC and the United States with the EITC. In the following we focus on one particular, hypothetical, change to the tax system designed to foster employment among low-earning individuals. Specifically, we consider an in-work tax credit, similarly designed as the EITC, which reduces the marginal tax rate directly through the introduction earnings related transfers for the working poor. The calibration of this policy is based on year 2000 prices and is such that individuals with monthly gross earnings below 1267 euros, which corresponds to a gross hourly wage of less than 7.5 euros for a full-time worker, receive a monthly tax credit of 200 euros. In contrast to the EITC, there is no phase-in but a monotonic phase-out of the tax credit. 18 The taper rate is roughly 47%, which implies that individuals with monthly gross earnings above 1689 euros are not eligible for any tax credit. We consider two different implementations of this tax reform. The first is targeted at the whole population and increases working incentives for individuals with low earnings of any age. 18
Because we only focus on full-time working individuals there is an explicit-hours rule for the tax credit.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S116
P. Haan and V. Prowse
40
45
50 55 Age (years) Observed
60
65
Proportion in non-employment 0 .1 .2 .3 .4 .5
(b) Non-employment
Proportion in employment 0 .2 .4 .6 .8 1
(a) Full-time employment
40
45
50 55 Age (years)
Predicted
Observed
65
Predicted
(d) Log wages
0
Density .5
1
Proportion in retirement 0 1 .2 .4 .6 .8
(c) Retirement
60
40
45
50 55 Age (years) Observed
60
65
−1
−.5
0
.5 1 log wage Observed
Predicted
1.5
2
2.5
Predicted
Figure 2. Goodness of fit.
In the second implementation eligibility is conditioned on age. Specifically, only workers aged 60 years and older are eligible for the in-work credit. The age-specific reform has the advantage that it is targeted at a population with low employment rates and thus limits subsidies given to individuals who would choose employment without additional fiscal incentives. However, this change in income taxation for older workers induces dynamic effects over the life-cycle. Indeed, for younger individuals not directly targeted by the age-related tax credit it might be optimal to adjust working behaviour because of forward-looking anticipation effects. A priori the work incentives induced by such a tax reform are ambiguous. On the one hand, as higher working experience increases the employment probability at older ages, the in-work credit makes employment prior to age 60 years more attractive. On the other hand, because the tax credit is conditional on gross earnings, it might be optimal for a younger worker to reduce employment, leading to lower future wages, in order to become eligible for the tax subsidy. These examples highlight the complexity of behavioural effects induced by fiscal policy over the life-cycle, and underline the importance of applying a dynamic life-cycle model which allows for adjustments in current labour supply in response to anticipated future incentives. Figure 3 shows the effects on employment, non-employment and retirement when the inwork credit is introduced for all individuals regardless of age. We present the results for different subgroups, by gender, education level and region of residence based on simulations that assume zero income from assets and that individuals have not previously been married or previously had children. More precisely, based on the estimated parameters we simulate the group-specific behavioural effect over the life-cycle. Because we condition on other household characteristics, C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S117
The effect of taxation on labour market dynamics
(b) Medium educated, east German women
−4
Percentage point change −2 −4 0 2 4
Percentage point change −2 0 2 4
(a) Low educated, east German women
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(d) Medium educated, west German women
−4
−4
Percentage point change −2 0 2 4
Percentage point change −2 0 2 4
(c) Low educated, west German women
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
−4
−4
Percentage point change −2 0 2 4
(f) Medium educated, east German men
Percentage point change −2 0 2 4
(e) Low educated, east German men
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(h) Medium educated, west German men
−4
−4
Percentage point change −2 0 2 4
Percentage point change −2 0 2 4
(g) Low educated, west German men
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
Retirement
45
50 55 Age (years)
Full-time employment Non-employment
60 Retirement
Figure 3. Employment and retirement effects of a tax reform affecting all individuals.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
65
S118
P. Haan and V. Prowse
these effects are similar to group-specific marginal effects. By construction, this tax reform leads to different behavioural responses for individuals with different observed characteristics. Because eligibility for the tax credit is conditional on low earnings, individuals with low potential earnings have the highest incentive to take-up or remain in employment. Moreover, the size of the employment effect depends on the observed employment shares of the subgroups and is related to the estimated preference terms by observed and unobserved characteristics discussed above. Overall, we find a fairly similar life-cycle pattern of the behavioural responses for all subgroups; however, the magnitudes of the effects differ. The largest employment effects are around the age of 40 years and thereafter the behavioural adjustment is lower. This age pattern is partly related to the returns to experience; we find a relatively high experience effect on wages and this implies that the more experienced workers lose their eligibility for the inwork credit. Moreover, the in-work credit leads to a postponement of retirement for all groups. However, the largest fraction of the previously retired chooses to be non-employed rather than to work. Recall that retirement is an absorbing state and therefore, once retired, individuals never will benefit from the tax credit. Thus, the tax credit creates a strong incentive to postpone retirement. For the elderly and those with an interrupted working history, however, employment is not attractive. Indeed, high state dependence effects, which include adjustment costs, make a transition into employment difficult. Secondly, as discussed above, we find strong negative age effects in the utility from employment which can be partly related to the institutional setting of out-of-work transfers. Still, because individuals are forward looking they know that in future periods they might make a transition into employment thus benefiting from the in-work credit. Panels (a)–(d) show the employment effects for women and panels (e)–(h) show the effects for men. As discussed above we estimate an overall gender differential in wages of about 25% and hence given the lower wages, ceteris paribus, more full-time working women are eligible for the in-work credit than men. On the other hand, the estimates suggest that women tend to have a lower taste for work than men which reduces the behavioural responses to financial incentives. Still, for all subgroups we find the largest employment effects for women. There is no clear picture by education. The education effect on wages is relatively small and hence the incentives effects of in-work credits are only slightly higher for the low educated. On the other hand, there is an indirect effect on the employment behaviour which is related to the initial conditions and the health status. For example, we find that, ceteris paribus, better education reduces the health risk by about 3% (see Table 5). Bad health however, has a strong effect both on the initial employment state and on the life-cycle employment. In this respect the better educated respond stronger to financial incentives. The latter effect seems to dominate for all groups, but the difference is negligible. Not surprisingly, we find the largest difference between individuals living in east and west Germany and this is mainly related to the enormous regional wage gap. In other words, east German men and women are far more likely to benefit from the in-work credit than those fulltime employed in the west, and therefore we find higher behavioural effects in the eastern part. Figure 4 shows that the employment effects differ when the entitlement to the tax credit is conditioned on age. We present the effects by the above-defined subgroups. Unsurprisingly, the employment effects are largest for individuals aged 60 years and above, who are directly affected by the policy reform. Again, we find heterogeneity by gender, education and region. The effects are highest for east Germans, tend to be higher for women and by education there is no sizeable difference. For east German women and men we find an increase in employment of about two
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The effect of taxation on labour market dynamics Table 5. Estimates of parameters in the health process. Coefficient Intercept
S119
Standard error
3.914
0.148
Health Problemst−1 (Age (years)−40)/10 Education
0.150 −0.033 0.091
0.037 0.049 0.046
West Male Experience/10
0.102 −0.062 −2.317
0.048 0.026 0.086
Notes: Most of the moments are OLS regression coefficients from a regression of observed health status on the previous observation of health status, Health Problemsi,t−4 , and explanatory variables. In addition we included the proportions of individuals whose health remains good, remains poor and changes from good to poor between adjacent surveys. Also see note for Table 2.
percentage points around the age of 63 years. At the same time, retirement is postponed which leads to the above described increase in non-employment amongst the elderly. As discussed above, the age-specific tax reform might induce behavioural effects for individuals younger than 60 years, who are not affected directly by the tax reform. These effects are due to anticipation effects which induce behavioural responses of younger individuals optimizing their life-cycle labour supply. The results show that before the age of 57 years behavioural responses are negligible. However, at ages just before the policy change becomes effective, the employment rate increases. Even stronger is the postponement effect for retirement which occurs as individuals avoid moving into retirement in order to become eligible to the inwork credit after the age of 60 years. The size of this anticipation effect depends on several features of the model, including the specification of intertemporal dependencies in preferences, modelled here with experience and the lagged dependent variable, and the mechanism for human capital accumulation, which here takes the form of years of previous employment. In addition, the magnitude of any anticipation effects is driven by the discount factor. We have assumed individuals to be forward looking with an annualized discount factor of 0.96. At the lower bound, with myopic individuals (δ = 0), the anticipation effects for the younger individuals would be zero. The upper bound, with a discount factor of one, the behavioural responses of younger individuals would certainly be higher.
8. CONCLUSION In this paper, we have developed and estimated a dynamic structural life-cycle model of employment, non-employment and retirement that includes endogenous accumulation of human capital and intertemporal non-separabilities in preferences. In addition, and in contrast to most of the previous literature, the model accounts for the effect of income taxation on work incentives. We argue that such a model is required to represent accurately individuals’ labour supply incentives and to capture the various sources of dynamics in labour supply behaviour. Based on panel data from the SOEP, we have estimated the parameters of a life-cycle labour supply model for single adult households without dependent children. The model fits the data well, including fitting accurately the distribution of wages, which are a key determinant of C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S120
P. Haan and V. Prowse
(a) Low educated, east German women
−4
−4
Percentage point change 0 2 4 −2
Percentage point change −2 0 2 4
(b) Medium educated, east German women
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(d) Medium educated, west German women
−4
−4
Percentage point change −2 0 2 4
Percentage point change −2 0 2 4
(c) Low educated, west German women
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(f) Medium educated, east German men
−4
−4
Percentage point change −2 0 2 4
Percentage point change 4 −2 0 2
(e) Low educated, east German men
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(h) Medium educated, west German men
−4
−4
Percentage point change −2 0 2 4
Percentage point change 0 2 4 −2
(g) Low educated, west German men
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60 Retirement
65
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
Retirement
Figure 4. Employment and retirement effects of a tax reform affecting individuals aged years 60 and over.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The effect of taxation on labour market dynamics
S121
individuals’ labour supply decisions. In line with the previous literature, the estimation results show significant dynamic effects which occur through both state-dependent preferences and human capital accumulation. Furthermore, we find a significant effect of net income on the employment decision, which stresses the importance of a detailed modelling of the tax and transfer system. The structural parameter estimates are used to evaluate the effects of a tax reform targeted at low-income working individuals on employment behaviour and retirement decisions. We find that the introduction of an in-work credit similar to the EITC leads to positive employment effects and to a postponement in retirement. Due to its focus on low-earning individuals, the effect of this policy is largest for individuals with lower earnings potentials, in particular for men and women in east Germany. We have also considered the labour market implications of an age-specific tax reform, such that only individuals aged 60 years and above are eligible to receive the credit. In this case, the policy leads to a large positive employment effect and a reduction in retirements among those aged 60 years and above. Also, due to the forward-looking nature of individuals’ labour supply decisions, individuals aged under 60 years, who are not affected directly by the policy, find it optimal to adjust their labour supply behaviour. Specifically, for individuals aged 57–60 years we find an increase in full-time employment and a strong postponement effect for retirement.
ACKNOWLEDGMENTS The authors would like to thank two anonymous referees, participants at the 19th EC-squared Conference in Rome, at Statistics Norway and at Max-Planck Institute for Demographic Research for helpful comments. Bill Goeff kindly supplied MatLab code for simulated annealing. Computations were performed using facilities at the Oxford Supercomputing Centre.
REFERENCES Adda, J., M. Costa Dias, C. Meghir and B. Sianesi (2007). Labour market programmes and labour market outcomes: a study of the Swedish active labour market interventions. Working Paper No. 27, Institute for Labour Market Policy Evaluation (IFAU). Adda, J., C. Dustmann, C. Meghir and J.-M. Robin (2009). Career progression and formal versus on-the-job training. IFS Working Paper 09/06, No. 2260, Institute for Fiscal Studies. Blank, R. (2002). Evaluating welfare reform in the United States. Journal of Economic Literature 40, 1105– 66. Blundell, R. (2000). Work incentives and ‘in-work’ benefit reforms: a review. Oxford Review of Economic Policy 16, 27–44. Eckstein, Z. and K. Wolpin (1989). The specification and estimation of dynamic stochastic discrete choice models: a survey. Journal of Human Resources 24, 562–98. Ferrall, C. (1997). Unemployment insurance eligibility and the school-to-work transition in Canada and the United States. Journal of Business and Economic Statistics 15, 115–29. French, E. (2005). The effects of health, wealth, and wages on labour supply and retirement behaviour. Review of Economic Studies 72, 395–427. Frijters, P. and B. van der Klaauw (2006). Job search with nonparticipation. Economic Journal 116, 45–83. Haan, P. and V. Prowse (2009). The design of unemployment transfers—evidence from a dynamic structural life-cycle model. Working paper, University of Oxford. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S122
P. Haan and V. Prowse
Haan, P., V. Prowse and A. Uhlendorff (2008). Employment effects of welfare reforms: evidence from a dynamic structural life-cycle model. IZA Discussion Paper No. 3480, Institute for the Study of Labour (IZA). Haisken De-New, J. and J. Frick (2005). Desktop Compendium to the German Socio-Economic Panel Study (SOEP). Berlin: German Institute of Economic Research (DIW). Heckman, J. (1981). The incidental parameter problem and the problem of initial conditions in estimating a discrete time–discrete data stochastic process. In C. F. Manski and D. McFadden (Eds.), Structural Analysis of Discrete Data with Econometric Applications, 179–95. Cambridge, MA: MIT Press. Hotz, V. J. and J. K. Scholz (2003). The earned income tax credit. In R. A. Moffitt (Ed.), Means-Tested Transfer Programs in the United States, 141–98. Chicago: University of Chicago Press. Karlstrom, A., M. Palme and I. Svensson (2004). A dynamic programming approach to model the retirement behaviour of blue-collar workers in Sweden. Journal of Applied Econometrics 19, 795–807. Keane, M. P. and K. I. Wolpin (1994). The solution and estimation of discrete choice dynamic programming models by simulation and interpolation: Monte Carlo evidence. Review of Economics and Statistics 76, 648–72. Laibson, D., A. Repetto and J. Tobacman (2007). Estimating discount functions with consumption choices over the lifecycle. NBER Working Paper No. 13314, National Bureau of Economic Research. Low, H., C. Meghir and L. Pistaferri (2009). Wage risk and employment risk over the life cycle. NBER Working Paper No. 14901, National Bureau of Economic Research. Rust, J. and C. Phelan (1997). How social security and Medicare affect retirement behavior in a world of incomplete markets. Econometrica 65, 781–832. Wolpin, K. I. (1992). The determinants of black–white differences in early employment careers: search, layoffs, quits, and endogenous wage growth. Journal of Political Economy 100, 535–60. Yamada, K. (2007). Marital and occupational choices of women: a dynamic model of intra-household allocations with human capital accumulation. Working paper, School of Economics, Singapore Management University.
APPENDIX A: ESTIMATION OF THE HEALTH EQUATION The sampled individuals were asked to record their health status only in the quarter when the annual survey took place. A standard probit model cannot therefore be used to estimate the parameters in equation (2.7) as health status in the previous quarter, Health Problemsi,t−1 , is unobserved. Instead we use the MSM to estimate the unknown parameters. Table 5 reports the MSM parameter estimates. The coefficient on health status in the previous quarter is highly significant, indicating strong persistence in health status on a quarter-by-quarter basis. In addition we see that health tends to decline with age but improves with experience and education.
APPENDIX B: ROBUSTNESS CHECKS Laibson et al. (2007) discuss in detail the difficulties associated with identifying the coefficient of relative risk aversion, ρ. In the above analysis ρ = 1.5 was imposed. To check the robustness of our results with respect to the calibration of this parameter, we re-estimate the dynamic life-cycle model assuming higher risk aversion (ρ = 2.5). Figures B.1 and B.2 show the policy effects for the subgroups we have discussed previously in Section 7. The similarities between the estimated policy effects obtained using different values of ρ show that our conclusions do not strongly depend on the assumed degree of risk aversion. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S123
The effect of taxation on labour market dynamics
(b) Medium educated, east German women
−4
−4
Percentage point change −2 0 2 4
Percentage point change 0 2 4 −2
(a) Low educated, east German women
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(d) Medium educated, west German women
−4
−4
Percentage point change −2 0 2 4
Percentage point change −2 0 2 4
(c) Low educated, west German women
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(f) Medium educated, east German men
−4
−4
Percentage point change −2 0 2 4
Percentage point change −2 0 2 4
(e) Low educated, east German men
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(h) Medium educated, west German men
−4
−4
Percentage point change −2 0 2 4
Percentage point change −2 0 2 4
(g) Low educated, west German men
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
Retirement
45
50 55 Age (years)
Full-time employment Non-employment
60
65
Retirement
Figure B.1. Robustness checks: employment and retirement effects of a tax reform affecting all individuals (ρ = 2.5).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S124
P. Haan and V. Prowse (b) Medium educated, east German women
−4
−4
Percentage point change −2 0 2 4
Percentage point change −2 0 2 4
(a) Low educated, east German women
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(d) Medium educated, west German women
−4
−4
Percentage point change −2 0 2 4
Percentage point change −2 0 2 4
(c) Low educated, west German women
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(f) Medium educated, east German men
−4
−4
Percentage point change −2 0 2 4
Percentage point change −2 0 2 4
(e) Low educated, east German men
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
40
45
Retirement
50 55 Age (years)
Full-time employment Non-employment
65
Retirement
(h) Medium educated, west German men
−4
−4
Percentage point change −2 0 4 2
Percentage point change 0 4 −2 2
(g) Low educated, west German men
60
40
45
50 55 Age (years)
Full-time employment Non-employment
60 Retirement
65
40
45
50 55 Age (years)
Full-time employment Non-employment
60
65
Retirement
Figure B.2. Robustness checks: employment and retirement effects of a tax reform affecting individuals aged 60 years and over (ρ = 2.5).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The effect of taxation on labour market dynamics
S125
SUPPORTING INFORMATION Additional Supporting Information may be found in the online version of this article: Appendix S1. Simulated and observed moments. Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. S126–S161. doi: 10.1111/j.1368-423X.2010.00318.x
Structural dynamic model of retirement with latent health indicator F EDOR I SKHAKOV † †
Department of Economics, University of Oslo, PO 1095 Blindern 0317 Oslo, Norway. E-mail:
[email protected] First version received: January 2009; final version accepted: February 2010
Summary This paper provides an empirical analysis of substitution between early retirement and disability as two major exit routes from the labour market in Norway. Analysis is based on a dynamic structural model that addresses the interplay between health, institutional constraints and economic incentives of men and women in the later part of their working lives. Unlike most previous research, which has typically used self-reported and indirect measures of disability conditions, in this paper health is modelled as a direct latent indicator of the eligibility to retire through the disability system. The model specifically accounts for the fact that employment may coincide with bad health when employees do not opt for disability in favour of more desirable retirement opportunities in the future. Norwegian register data are used for estimation. The substitution between disability and early retirement exits is investigated by simulating a complete elimination of the latter. The simulation suggests a moderate inflow of the displaced early pensioners into disability combined with partial employment and negligible inflow into the full-time disability. Keywords: Disability, Dynamic programming, Early retirement, Health, Structural dynamic model.
1. INTRODUCTION The increasing life expectancy, earlier withdrawal from the labour market and consequent growing threats to the financial stability of the social security systems in many countries, including Norway, have highlighted the importance of understanding retirement behaviour and the choice of retirement routes in greater detail. Much of the pioneering and more recent research on the labour supply decisions of older workers has focused on the effects of financial incentives on retirement, generally showing their powerful behavioural implications (Gordon and Blinder, 1980, Crawford and Lilien, 1981, Stock and Wise, 1990, Berkovec and Stern, 1991, Blau, 1994, Fabrizia Mealli, 1996, Rust and Phelan, 1997, Samwick, 1998, Gustman and Steinmeier, 2000, 2004, Hernæs et al., 2000, Blundell et al., 2002, Bratberg et al., 2004, Burkhauser et al., 2004, Heyma, 2004, Karlstrom et al., 2004, and French, 2005). In the development of econometric models of retirement, health has also been recognized as an important determinant of retirement behaviour (Bound, 1998, Dwyer and Mitchell, 1999, Bloom et al., 2004, Coile, 2004, Au et al., 2005, and Disney et al., 2006). In Norway, as much as one-third of all individuals employed at age 50 retire through the disability system before the old-age pension becomes available. In this way, the disability system becomes by far the most popular exit route at later ages, whereas C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Structural dynamic model of retirement with latent health indicator
S127
the early retirement programme comes second. In these circumstances, policy measures aimed at encouraging later withdrawal from the labour market must take into account both the usual old-age and disability retirement alternatives. However, the existing literature focusing on the interplay among different retirement routes is very scarce. Only few works (Rust and Phelan, 1997, Bound et al., 2010) explicitly model health dynamics in order to assess the feasibility of disability retirement for a given individual at a given time. Such assessment is necessary in order to establish the precise sets of choices that the decision makers are facing and to determine when and under what circumstances different retirement opportunities may interact. In particular, the possibility that bad health is not reported when disability retirement is not pursued in favour of employment with more beneficial retirement settings gives rise to the concerns about potential spillovers among the retirement routes in the case of policy reform. This paper provides an empirical analysis of substitution between early retirement and disability as two major exit routes from the labour market in Norway. The analysis is based on a structural dynamic model of retirement choices which is estimated jointly with a stochastic model of individual health shocks, involuntary lay-offs and demographic dynamics. In contrast to most papers in the field, I use administrative data which cover the entire population of Norway between 1992 and 2003 and estimate the model on a very large sample of Norwegian households.1 As a result, the estimates of structural parameters obtained in the paper are very precise. Because of a deficiency of reliable measures of work limitations, I use a very special approach to modelling health conditions. As pointed out by many authors and systematized in Bound (1991), the use of health measures as explanatory variables for disability is subject to several complications.2 Self-reported measures are believed to be error driven and endogenous, respectively, because respondents rarely use the same scale when answering health-related questions and are tempted to use health for rationalizing their labour market outcomes. Applying more objective questions not directly linked to the respondents’ employment status is believed to yield more robust results, but these measures usually suffer from their categorical nature and scale simplicity. Finally, most of the measures of health are embedded with a logical contradiction as they measure physical conditions of wellbeing, which only partially correspond to the work limitations essential for the disability retirement. In my case, the data used for estimation originate in the governmental registers, which hold only indirect and unreliable measures of health, such as annual length of sick leave absence with no medical reference. Therefore I make use of a very specific interpretation of ‘health’ which is both motivated by the described inconsistencies and the lack of relevant data. Health is defined as the eligibility for disability pension, the disability conditions themselves. There are three possible values of health status: (α) good health indicating no work limitations with disability option unfeasible; (β) bad health adding this option to the choice set while not necessarily leading to a disability application, thus enabling the bad health to be concealed; finally, (γ ) very bad (worst) health which narrows the choice set to a single option, namely forcing an individual into fulltime disability. Although this approach creates an unobservable state variable and results in a significant complication of the estimation procedure, it enables an accurate structural modelling of the disability exit and facilitates the crucial policy simulation by assessing the magnitude of unrealized retirement through the disability system. The simulated complete elimination of an early retirement programme suggests a moderate substitution effect with only about 5–8% of the 1 2
A total of 200,921 families observed in up to 12 consecutive years; see Section 3.3. Detailed literature review on health measures is available in Iskhakov (2008a).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S128
F. Iskhakov
otherwise early retirees ending up as recipients of the disability benefit, in most cases combining partial disability and employment. Suggestions about the possible interaction between early retirement and disability can be found in the previous literature, e.g. Heyma (2004) mentions substitution between different exit routes among primary determinants of retirement. Bound et al. (2010) find very little effect of changes in the social security retirement programme on disability applications, and argue that those potentially eligible for disability are a distinct group, which implies low substitution. The question of substitution between the Norwegian early retirement and disability programmes has been addressed in two separate papers and has raised some controversy. Bratberg et al. (2004) estimate a static discrete-choice model to investigate interdependence of AFP and disability retirement and find clear signs for substitution between the two routes at the magnitude of 8.6% to 22.4%. At the same time, Røed and Haugen (2003), using a quasi-natural experiment of unexpected decrease of early retirement age find practically no substitution between the two exit routes. The latter finding is in line with a previous study on American data by Bound (1989). Neither of the two papers assessed underlying changes in the health status and the concealed eligibility for disability pension—the factors that come into play when one retirement option becomes unavailable. Therefore, the structural dynamic model with direct assessment of the disability option developed in the current paper enables a more accurate investigation of the degree to which early retirees would transfer to the disability programme in the case of the elimination of early retirement option. The approach of treating health as an unobservable process has been already used in the literature (Bound, 1991). To moderate the discrepancy between the endogenous self-reported and more objective health measures, Bound (1998) suggests modelling health as a latent variable using self-reported disability status as a proxy for the latent construct. In the current work, the absence of reliable instruments for disability conditions in the data leads to treating the latent health as a simple uncontrolled Markovian stochastic process with three states. This approach is in line with Rust and Phelan (1997) and French (2005), although considerations of tractability of the model forced a simplified transition probability matrix which does not take into account heterogeneity of health dynamics. The majority of the works in the field compromised the features of the model related to disability retirement even more by either completely dropping the health status (Karlstrom et al., 2004, Jia, 2005) or treating health as exogenous deterministic variable (Bound, 1991, Gustman and Steinmeier, 2002, Burkhauser et al., 2004, and Heyma, 2004). Bound et al. (2010) present a recent example of comprehensive modelling of health dynamics endogenously within a structural dynamic model of retirement. They estimate the model of transition from full-time work to either disability or old-age retirement through a bridge job and potential spells of unemployment, using a sample of 196 single men. In the tradition of Bound (1998), health is modelled as a latent process, which is identified with both health-related objective measures and error-driven self-reports. Because of the small sample size, considerable attention is given to the initial selection process, but the model is solved separately for each individual allowing the authors to account for much of individual heterogeneity by including additional exogenous covariates. The model in this paper differentiates from Bound et al. (2010) in a number of respects. Because of the comparatively more complicated design of the Norwegian social security system, my model makes a more detailed account of labour market transitions towards the end of one’s working life, allowing for eight states on the labour market. In particular, I model partial disability in combination with employment—the labour market state which turns out to be very important in the policy simulation becoming the second most frequent C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S129
destination for the displaced early pensioners. The analysis indicates that the phased retirement option provided by partial disability is much more attractive to the retirees, compared to fulltime disability. In comparison with Bound et al. (2010), my model is formulated more generally and aggregates over much of the individual heterogeneity. Aggregation is a natural way to suit the model for the data set as large as used in this research. Besides, the model is applied in connection with the ongoing discussions of pension reform in Norway, and is therefore tailored for the analysis of the drastic policy changes where any additional assumptions on stationarity of exogenous (and thus perfectly foreseen) variables would jeopardize the validity of simulation result. Therefore, the vector of state variables is exhaustive—all the constructs within the model are based only on the information contained in the state variables. This feature implies that policy simulations hinge solely on the estimates of structural parameters, and differentiates my model also from many other works which use exogenous variables to cut down the computational cost. In addition, unlike Bound et al. (2010), I model retirement decisions of single and married individuals of both genders, and account for deaths of spouses and divorces leading to breakups of full households. Modelling intra-household bargaining process is beyond the scope of this paper; instead, I utilize a unitary model for household preferences and assume that the families inherit the utility function from the primary spouse (defined in Section 3.3). This simplification allows me to treat both single individuals and two-person families in the same way, referring to the decision maker as an individual or a household interchangeably. The essential differences between single individuals and full families reflected in the model are the differences in taxation of these two types of households, and the existence of additional income brought in by a spouse in the latter case. The coefficient with leisure in the utility function is also allowed to depend on the household type. The model developed in the paper is solved and estimated using the framework of Rust and Phelan (1997), which is modified to correspond to the unique features related to the latent health status. The model is estimated by maximizing the integrated likelihood function where the integration is performed over all the realizations of the latent health process which are consistent with the observations of a particular household. Simple parametrization of the health transition probability matrix allows for all possible realizations of the process to be traced and eliminates the need for simulated approach. Collecting the probability mass from all consistent realizations of the latent health processes, however, adds yet another level of computational complexity to the computationally extensive problem. Even though standard distributional assumptions for the stochastic part of the utility function allow to avoid evaluation of multi-dimensional integrals when calculating choice probabilities, the complexity of the likelihood function does not permit analytical expressions for its derivatives and calls for the numerical optimization methods to be applied. Meanwhile, even a single evaluation of the likelihood function proves to be very time consuming because of the large number of points of the state space where the value function must be calculated with backward induction, and a huge number of observations. Consequently, the estimation of the 32 structural parameters of the model was performed with the use of distributed computations and a multi-stage estimation strategy inspired by Rust (1994). The rest of the paper is organized as follows. The next section summarizes the social security system and describes potential routes of retirement in Norway. Section 3 describes the theoretical model and the estimation technique. Sections 4 and 5 present, respectively, estimated beliefs and preferences. The last section presents a policy simulation and is followed by the concluding remarks. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S130
F. Iskhakov
2. RETIREMENT OPTIONS OF NORWEGIAN ELDERLY The economic incentives created by the institutional settings of the Norwegian social security system have been shown to be the most important driving force of the retirement process in the country (Hernæs et al., 2000). Normal old-age pension, early retirement and disability are the three main routes of withdrawal from the labour market among the Norwegian elderly (Figure 1). The normal retirement age is 67 for both genders, although three more years of employment are commonly seen to be used to gain full pension rights, which are conditional to at least 40 years of working history.3 The pension benefit is very generous—amounting to an aftertax replacement ratio of about 65%. A 40% earnings test of the pension benefits after the age of 67 limits the number of individuals combining work and pension to the very minimum. No additional pension accrual after the age of 70 results in virtually everybody being retired by this age. Old-age pension (NIS), which is provided for all permanent residents, consists of three components: basic amount, earnings-based part and a special supplement. The basic amount (denoted throughout this paper as G) serves as a unitary measure of all social security benefits and is indexed each year, which resulted in its growth in real terms for the last several decades. The earnings-based component is proportional to the average of the 20 highest pension points throughout life where each pension point is derived from corresponding annual earnings with a piecewise linear concave transformation bounded above such that earnings above 12G do not contribute to the pension benefit. A special supplement to the old-age pension differs for married and single pensioners, and is also adjusted on an annual basis. It is fully tested against the earnings-based component regulating the minimal level of the pension benefit. The early retirement option (the so-called AFP programme) has been introduced and regulated within the collective wage agreements between the Norwegian organization of labour unions and the organization of employers, and is also overseen and sponsored by the government. It is available to roughly two-thirds of the Norwegian labour force (Midtsundstad, 2004). Conditional on certain individual criteria ensuring consistent employment history, all employees in the public sector and large private companies have an option to retire earlier than age 67 with an extremely generous pension settlement. The age of earliest possible retirement through the AFP programme has been lowered from age 66 in 1989, when the scheme was initially introduced, to age 62 by 1998, thus making it cohort specific. The main individual AFP eligibility criteria include: being employed by a single participating company in the last three years or different participating companies in the last five years, having an annual wage over 1G in the last two years prior to early retirement; having at least 10 years of wage earnings above 1G after the age of 50; and having the average of 10 highest annual wage earnings since 1967 over 2G (Røgeberg, 2000). The pension benefit under the AFP programme is calculated identically to the usual old-age pension, except for the missed pension points corresponding to the years between early retirement age and age 67. These points are forecasted with the maximum of the preceding year earnings and the average of the three highest annual wage earnings throughout the retiree’s career. Thus, not only is early retirement not punished by any reduction of the pension benefits, but given the declining earnings profile in the last years of employment, it also introduces such a strong incentive to retire that the majority 3 Many of the institutional settings described here are subject to change under the Norwegian pension reform which will be in effect from January 1, 2011. Iskhakov (2008b) provides a thorough description of the changes. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S131
Structural dynamic model of retirement with latent health indicator 100% Out of labour market
Unemployment 90% 80% 70%
Disability
60%
Pension 50%
Full-time employment
40% 30% 20% 10% Employment + disability 50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
Figure 1. Fractions of the population in different labour market states by age (conditional on labour market participation at age 50).
of those eligible leave the labour force and collect AFP benefits at their first available opportunity (Tysse, 2001). In contrast to old-age and AFP retirement, disability pension does not bear any requirements for the age of the retiree and therefore serves as the path into retirement during earlier ages. As shown in Figure 1, the majority of people out of the labour market before the age of 67 choose this route.4 Similar to an old-age pension, disability pension is integrated into the social security system and covers all residents of Norway. Permanent full- or part-time disability is a terminal state in a complicated multi-stage process in which a typical retiree passes through different social support programmes (such as long-term sick leave, medical and vocational rehabilitation, and potentials spells of unemployment) before being granted disability pension. Fevang and Røed (2006) report about 84.8% of all permanently disabled retirees following the route originating in long-term sick leave, about 8.5% first losing their job, and only 6.7% of cases to result from injuries or inborn disabilities. Medical screenings have to be passed on several stages, and the rejection rate for the disability application is about 20% as reported by Kristoffersen and Sagsveen (2005). The benefit calculation rules at all stages of the path to disability are such that prospective disability pensioners do not incur substantial monetary cost. The initial period of sick leave may last up to 52 weeks with the salary fully replaced by the social insurance benefit. 4 In terms of net outflows, disability appears to be the main exit before 62 and the second most often used exit between 62 and 66.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S132
F. Iskhakov
Medical rehabilitation programmes have similar regulations, whereas vocational rehabilitation programmes may last up to three years with the 66% replacement ratio (Fevang and Røed, 2006). Full-time disability pension is calculated using the same principles as the AFP pension when normal pension benefit formulas are applied to the forecasted stream of annual wage earnings (transformed into pension points). The forecasting rules that take into account larger numbers of annual earnings are less generous. The missing pension points up to the age of 67 are substituted with the maximum of the average of the points earned in the last three years and the average of half of the amount of the points earned prior to the disability application. In the latter average, the highest points in the working history are used (Holen, 2008). This calculation is used only when additional requirement of positive wage earnings in one of the three years prior to disability application is satisfied. In the opposite case, the disability benefit is set to a predefined minimal level. Other retirement options contained in the out of labour market state (OLM) in Figure 1 include all sorts of occupational and private pension plans as well as individual specific options such as personal savings and annuities, which are hard to characterize with the register data.5 The people taking this option generally do not comply with the requirements of the disability or AFP pension accumulating in the OLM state up to the age of 66, and start reappearing as normal old-age pension recipients at the age of 67. The short overview of the options available to a prospective Norwegian retiree reveals the complexity and the dynamic nature of the decision problem, which may induce certain behavioural patterns around the ‘kink’ points in the institutional settings. A clear dominance of the AFP pension benefits when compared to disability pension benefits in terms of replacement ratios is reported in the previous literature, making early retirement through an AFP programme the most beneficial retirement path (Røed and Haugen, 2003). Trying to steer into the early retirement path, the workers may look harder for jobs at the participating companies and display higher labour force attachment rates in some years prior to the age of 62.6 This strategy has both opportunity costs and risks associated with it, namely an involuntary lay-off in the last three years before early retirement may lead to the loss of eligibility for the programme. Retirement through disability, although less rewarding, may be less risky given that necessary medical conditions can be approved.7 Holen (2008) finds certain empirical evidence of rational planning towards this route when showing that future disabled retirees increase their labour supply in the last three years of work, thus increasing their disability benefits. In addition, the application for disability pension can be filed at any age, and because the benefits are not income tested, disability may provide a way of gradual phasing from employment into retirement.
3. STRUCTURAL DYNAMIC MODEL OF RETIREMENT BEHAVIOUR The structural dynamic model of retirement developed in this paper replicates the described institutional arrangements of the Norwegian social security system, and assesses individual preferences, choice sets at different time periods and beliefs about the consequences of the current decisions on the future periods’ outcomes. 5
Direct assessment through tax files is used to predict this income, as described in Appendix B. Higher participation rates among prospective AFP pensioners prior to retirement were indeed shown by Røed and Haugen (2003). 7 In the last decade, medical literature on disability in Norway reports increasing use of non-verifiable disability conditions such as back pain. 6
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S133
3.1. Theoretical description Let vector st ∈ S contain the state variables representing the full set of socioeconomic factors effecting the agents’ decision making at period t (S is the corresponding state space). Assume that the evolution of the state vector (which is stochastic at least in part) is governed by the collection of the Markovian transition probabilities {p(st | st−1 , dt−1 )} which are controlled by the decision variable dt ∈ D and represent the beliefs of the decision makers about the future consequences of their choices. Assume further that in response to the realization of the state vector the agent is choosing actions dt = δt (st ) as to maximize the expected discounted lifetime utility, or in other words solves the sequential decision problem ⎧ ⎫ ⎛ ⎞ ⎞ ⎛ T t T ⎨ ⎬ ⎝ max E ρτ ⎠ β t−T0 U (dt , st ) + ⎝ ρτ ⎠ β T −T0 (sT ) , (3.1) ⎩ ⎭ δ∈F t=T0
τ =T0
τ =T0
where the expectation is taken with respect to the set of transition probabilities {p(st | st−1 , dt−1 )}, ρτ denote the probabilities of survival from period τ − 1 to period τ . U (dt , st ) is an instantaneous indirect utility at period t which is discounted with the intertemporal utility discount factor β. Time index in the model is identical to age of the decision makers. The limits T 0 and T are set so that the most relevant life span is covered, namely T0 − 1 = 50 to include sufficient number of years before possible retirement in order to capture potential dynamic responses to policy changes, and T = 70 since no transfers between labour market states occur and no decisions are made after this age. Additional termination function (sT ) captures the remaining after 70 lifetime utility. The maximization in (3.1) is performed with respect to the decision rules δ = (δT0 , . . . , δT ) ∈ F which are chosen from the class of feasible decision rules F. The feasibility conditions are expressed in a family of choice sets Dt (st , dt−1 ) ⊂ D that represent the available options at period t. Decision rule δ is said to be feasible if and only if for each t ∈ {T0 , . . . , T }δt (st ) ∈ Dt (st , dt−1 ). In other words, the class F can be represented by a Cartesian product of the choice sets F = {(δT0 (sT0 ), . . . , δT (sT )) : (δT0 (sT0 ), . . . , δT (sT )) ∈ ⊗Tt=T0 Dt (st , dt−1 )}. The agent sequential decision problem (3.1) is solved and estimated using standard procedure developed in Rust (1994). The utility function U (dt , st ) is formulated as random utility U (dt , st ) = u(dt , st ) + εt [dt ],
(3.2)
where u(dt , st ) is a non-stochastic component of the utility to be specified below and εt [dt ] is the component of vector εt ∈ R |D| corresponding to the decision dt . Under the assumption of conditional independence of εt , namely P (st , εt | st−1 , εt−1 , dt−1 ) = F (εt | st ) · P (st | st−1 , dt−1 ), and if the components of εt are independent and identically distributed with the extreme value distribution,8 the optimal decision rule δ ∗ = (δT∗0 (sT0 , εT0 ), . . . , δT∗ (sT , εT )) is given by δt∗ (st , εt ) = arg maxdt ∈Dt (st ) {vt (dt , st ) + εt [dt ]}, t ∈ {T0 , . . . , T }, where the value function vt (dt , st ) is defined by a recursion u(dT , sT ) + (sT ), t = T, (3.3) vt (st , dt ) = u(dt , st ) + ρt β · E[vt+1 (st+1 , dt+1 ) | st , dt ], t < T .
8
C.d.f. F (εt | st ) =
exp{−εt [d] + γ } · exp{− exp(−εt [d] + γ )}, γ = 0.577.
d∈D(st ) C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S134
F. Iskhakov
The expectation in the second line in (3.3) is taken with respect to transition probabilities P (st , εt | st−1 , εt−1 , dt−1 ) of the expanded stochastic process {dt , st , εt }δ∗ induced by the optimal decision rule δ ∗ . Rust (1994) shows that under the mentioned assumptions the observed partial stochastic process {dt , st }δ∗ induced by the optimal decision rule δ ∗ is Markovian with non-stationary transition probabilities qt (st , dt | st−1 , dt−1 ) = Pt (dt | st ) · p(st | st−1 , dt−1 ), where Pt (dt | st ) is a well-defined probability distribution given by Pt (dt | st ) =
exp{vt (dt , st )} d ∈D(st ) exp{vt (d , st )}
(3.4)
and the expectation in the value function expression (3.3) can be calculated with a log-sum formula ⎛ ⎞ E[vt+1 (st+1 , dt+1 ) | st , dt ] = log ⎝ exp{vt+1 (dt+1 , st+1 )}⎠ p(st+1 | st , dt ). st+1 ∈S
dt+1 ∈D(st+1 )
Along with the set of transition probabilities {p(st | st−1 , dt−1 )} the choice probabilities {Pt (dt | st )} serve as the basis for the construction of the likelihood function. 3.2. Empirical specifications The vector of state variables st = (pst , ht , mt , et , spt , nwt , awt , afpage, gender) and a scalar decision dt ∈ D = {0, . . . , 4} in the current model are defined as follows. pst ∈ {0, . . . , 7} denotes the previous period state on the labour market. Labour market states are constructed on the basis of the registered affiliations of a given individual with an employer or a certain social security programme. The following exhaustive set of mutually exclusive labour market states was formed after the analysis of all observed in the data affiliations, among which the less frequent were merged together (Iskhakov, 2008a):9 pst = 0—out of labour market (OLM) state combines self-employed individuals, housewives, those retired with private retirement plans and other individuals who could not be attributed to any of the next seven labour market states. pst = 1—full-time retirement, identified in the data with the individual’s record either in the old-age or AFP pension registers. In addition, the individuals not employed after the age of 66 and everybody at age 70 are classified as retired. pst = 2—full-time retirement through disability, identified with the individual’s record in the register of disabled, which is possibly combined with unemployment or specific pension before 62, but not with employment. pst = 3—unemployment (including partial) identified with the individual’s registration in the unemployment register for more than six months in a particular year. pst ∈ {4, 6}—employment in non-AFP and AFP companies, respectively, identified by the individual’s record in the employment register combined with either more than 30 registered working days or registered wage income greater than 1G. AFP and non-AFP
9 I decided to distinguish ‘state’ in the dynamic programming sense from ‘labour market state’ by keeping the latter as the definitive expression even though some labour market states actually indicate the absence from the labour market.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S135
companies differ in their participation in the early retirement agreement and are identified by tracking the last employer of all registered AFP pensioners. pst ∈ {5, 7}—employment in non-AFP and AFP companies, respectively, combined with disability benefit. Partial disability is the only considered form of partial retirement (combination of the labour market states); other types of phased retirement are disregarded. The last five positions (pst ∈ {3, 4, 5, 6, 7}) correspond to the active labour market states where all agents are observed initially (see Section 3.3 for sample definition) and among which any transitions are possible. The first three positions (pst ∈ {0, 1, 2}) are assumed to be jointly absorbing (no transfers from the inactive to active labour market states are allowed) while the pension state is absorbing in its own. These assumptions are commonly used in retirement literature and represent a stylized fact of the Norwegian retirement process. ht ∈ {0, 1, 2} is the individual health status. As it was described in the introduction, health is defined directly as eligibility for disability pension. Thus, ht = 0 is good health with no option to retire through disability, ht = 1 gives an option to become fully or partially disabled, while ht = 2 immediately leads to full-time disability. Therefore, certain information on health status is recoverable from the observations of the occupied labour market states, but the hypothesis that bad health can be concealed until a convenient retirement opportunity comes around makes the health variable impossible to completely uncover from the data. I assume health to be a Markov process, which evolves independently of other state variables with the transition probability matrix ⎡ h h h ⎤ π00 π01 π02 (h) ⎢ h h ⎥ (3.5) πij i,j ∈{0,1,2} = ⎣ 0.0 π11 π12 ⎦, 0.0
0.0
1.0
where unspecified matrix elements are parameters. Because of the need to keep the model computationally tractable and to ensure less troublesome identification of the health transition probabilities, matrix (3.5) is assumed to take the simplest form, implying identical risks of future health shocks for all individuals. Health deterioration is assumed to be permanent because there are no observations of individuals leaving the disability state except for the trivial transfer to old-age pension at the age of 67. mt ∈ {0, 1, 2} denotes the latent process of job match, which regulates job losses, as well as employment at each of the two types of companies recognized in the model. If mt > 0 there is a job opening in the current period (mt = 1 in non-AFP, mt = 2 in AFP company), otherwise an individual is forced into unemployment and possibly into full-time disability. I assume matching process to depend on health in a way that would reflect limited labour market opportunities for the individuals in bad health. Consider transition probability matrix for mt of the form ⎡ m m ⎤ m m m · (1 − ω) π01 · ω π02 ·ω π00 + π01 + π02 (m) ⎢ m m ⎥ m m m · (1 − ω) π11 + π11 + π12 · ω π12 · ω⎦. πij i,j ∈{0,1,2} = ⎣ π10 (3.6) m m m m m π20 + π21 + π22 · (1 − ω) π21 · ω π22 · ω Parameter ω takes value 1 when ht = 0, value 0 when ht = 2. When ht = 1 parameter ω redistributes the probability mass away from the second two columns corresponding to the job openings of both types independently of the previous value mt−1 . Again, for the reason of computational tractability, transition probabilities for mt are only dependent on health and do C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S136
F. Iskhakov
not account for other sources of individual heterogeneity. Also, without loss of generality, when ht = 2 or pst ∈ {0, 1, 2} variable mt is assumed to take the value of zero with probability one. et ∈ {0, 1} denotes the fulfilment of individual criteria for AFP eligibility (described in Section 2). Namely, et = 1 implies that the individual AFP requirements are met, which together with the employment at AFP participating company in the previous period (pst ∈ {6, 7}) introduces the early retirement option in the choice set. Because this option is only available after AFP eligibility age and before age 67, et is assumed to be constant outside of this time period. The cohort-specific earliest age of AFP retirement is denoted afpage and enters the state space of the model as a time-invariant exogenous variable. spt ∈ {0, 1} indicates full-family households; namely spt = 1 indicates the existence of a spouse and spt = 0 corresponds to a single individual household. As mentioned in the introduction, the only differences between two household types acknowledged in this paper involve tax function and additional incomes generated by a spouse. Risk of death of a spouse and the breakup of a household (divorce) are reflected by the dynamics of spt , which is modelled as Markov process with empirical approximated transition probabilities computed jointly for the events of spouse death and divorce on the basis of the sample data. Only a few new marriages are observed in the considered age groups, therefore they are neglected in the paper. The gender of the primary spouse in the household is controlled for and recorded in the time-invariant exogenous variable gender.10 awt ∈ R+ represents the lifetime trend in the wage income flow for the agent and is calculated as the average of the highest 20 annual wage earnings up to period t (measured in 1000 NOK in 1992 prices). This is the only continuous state variable in the model, which carries most of the information about individual household heterogeneity and serves as the foundation for forecasting different sources of household income, discussed further in Section 5 and Appendix B. Because of its definition, the aggregated wage awt is non-decreasing and evolves extremely smoothly. of annual wage earnings during whole Let w(1) ≤ · · · ≤ w(t) denote the ordered sequence 1 t working life up to t. By definition, awt = 20 k=t−19 w(k) . It is straightforward to show 1 that awt+1 = awt + 20 max(0, wt+1 − w(t−19) ). The available 2,268,837 observations of the aggregated wage allow for the estimation of the corresponding recursive equation11 awt = 1.0002 awt−1 + 0.144 + 2.695 · 1(pst−1 ∈ {3, 4, 6}). (0.0002)
(0.005)
(0.004)
(3.7)
The estimates reflect the fact that after age 50 the chosen aggregated measure of wages is extremely stable, with a significant increase only when still active on the labour market (possibly partially unemployed, but not partially disabled). For the individuals not anymore active on the labour market the value of awt remains nearly constant. Given its high goodness of fit, I treat equation (3.7) as the deterministic motion rule for aggregated wage. nwt ∈ {0, 1, . . . , 10} denotes the number of consecutive years before and excluding period t when the wage earnings exceeded the basic pension amount G. Similar to the previous variable, this measure of short-term tendency in the income flow of the household is relevant for the calculation of different social security benefits. The values of nwt are directly calculated from the data on earning histories and are truncated at 10 to reduce the number of points in the state space. 10
See Section 3.3 for formal definition of primary spouse. 1(·) is the indicator function which returns one if the condition is satisfied and zero otherwise. Estimation was carried out with simple OLS on pooled data. 11
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S137
It turns out that higher values do not bear considerable additional information. The definition of nwt implies very simple motion pattern: the value in the next period either grows by unity (up to the threshold) or drops down to zero, depending on the income level in the current period. The complete vector of state variables st = (pst , ht , mt , et , spt , nwt , awt , afp, gender) contains all the information used in the model to describe the decision-making process. In contrast to the common practice in the literature, as mentioned in the introduction, no additional exogenous variables enter into the model. Such variables would require the assumption of perfect foresight by the decision makers, while for many variables it would be hard to justify and would therefore be highly undesirable from the point of view of the main application of the model— policy simulations. The drawback of this approach is a greater degree of simplification of all the processes in the model, and thus a relatively lower goodness of fit. The decision variable in the current model is a scalar defined on the decision space D = {0, . . . , 4} with the following interpretation. dt = 0—the agent remains on the labour market, does not apply for any pension, dt = 1—the agent applies for disability benefits, but remains on the labour market, dt = 2—the agent retires, applies for disability benefits, dt = 3—the agent retires, applies for old-age or AFP pension, dt = 4—the agent leaves the labour market, but does not apply for any pension. The decision variable indicates the intentions of the agent to acquire a certain position on the labour market, which is matched against current ‘state of nature’ to determine the actual labour market outcome. The separation of decisions and labour market states differentiates this model from the majority of the literature on static discrete choice and many of the earlier works with dynamic models of retirement, and allows for explicit description of how discrete choices are made within dynamically changing stochastic choice sets. In some circumstances, the choice set collapses to a single element corresponding to the situation with no choice (to e.g. dt = 2 in the case of the severe health shock), but for the sake of uniform structure of the model such cases are treated within the common framework. In the beginning of each period random health ht , match mt and AFP eligibility et are realized, and together with previous period labour market state pst and age t define the current period choice set Dt (st , dt−1 ) = Dt (pst , ht , mt , et ). Definitions of choice sets are shown in the first and second parts of Table 1: only those values of control variable dt (part A in Table 1) are included in the choice set Dt (pst , ht , mt , et ) which correspond to the conditions in part B defined by the combination of values (t, pst , ht , mt , et ). The conditions laid out in part B of Table 1 replicate the institutional settings described in Section 2 and the interpretations given to different values of the state variables. For example, retiring and having old-age pension (dt = 3) is possible either when working at AFP company (pst 6) and satisfying individual requirement (et = 1) at proper age (t afpage), or when already retired through AFP (pst = 1, et = 1, afpage < t < 67), or unconditionally after normal retirement age of 67. Similarly, for those still active on the labour market (pst 3) worsening health (ht = 1) adds an option of applying to disability (dt = 1, 2) for each value of the match variable (mt = 0, 1, 2). Once the current choice set is defined and souse indicator spt is revealed, the best alternative is chosen by the decision maker. This decision transforms into the current labour market state recorded in the next period variable pst+1 according to the rules laid out by all three parts of Table 1 (read from left to right). In the last example, if dt = 1 is chosen, depending on the C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S138
F. Iskhakov Table 1. Evolvement of current labour market state. (A) (B) Decision variable Choice set conditions
dt 0
Remain on LM?
yes
Apply for pension?
pst
ht
mt
0
0
no
3
=2 =2
1 2
1 = 3 3
1 1 1
0 1 2
>0
–
– =2
1
yes
disability
2
no
disability
3
no
AFP/NIS
1 6 =1
4
no
no
– =1,2
et
Age
(C) Resulting labour market state pst+1 3 Unemployment
–
<70
–
<70
–
– 1 –
<70 afpage afpage
–
– –
67 <70
4 Non-AFP employment 6 AFP employment 2 Full-time disability 5 Partial disab. (not AFP) 7 Partial disability (AFP) 2 Full-time disability 1 Pension 0 Out of labour market
Notes: Parts A and B formalize definitions of the choice sets. Parts B and C present the evolvement of labour market state (pst+1 → pst+1 ) conditional on the current decision (dt ).
value of mt the individual appears in combined disability and employment in AFP or non-AFP company, or on full disability. At the same time, when followed from right to left, Table 1 shows how the unobserved in the data decision variable dt is uniquely identified from the other state variables. To complete the specifications of the model, it is left to define the structure of transition probabilities {p(st | st−1 , dt−1 )} and decision-makers’ preferences u(dt , st ). This is done consequently in the next two sections after describing the data sources and the necessary modification of the standard estimation procedure due to latent processes. 3.3. Data and sample definition The model is estimated using the collection of the Norwegian governmental registers covering the whole Norwegian population in the period from 1992 to 2003. Different files containing the demographic characteristics, annual employment and unemployment records, wage histories and received social benefits are linked on the individual level.12 Given such a vast data set, I concentrate on the unbalanced panel of individuals born in the period from 1933 to 1942, who are observed filling up the modelling period, in up to 12 consecutive years, from the age of 50 to the age of 70. Although some data are updated monthly, the essential employment affiliation is only available on the annual basis. The absence of more precise records of time drives the imprecise but inevitable unification of calendar years with the years of particular age, and thus I disregard events within a calendar year.13 The households are constructed on the basis of the family register which contains both registered marriages and unregistered cohabitations when the couple has at least one common 12
Broad description of the data collection at Frisch Centre for Economic Research can be found in Hernæs et al. (2000). An alternative identification of time periods could have dealt with precise ages (years from one birthday to another), but then all the annual characteristics would have to be artificially modified accordingly. 13
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S139
child. Those individuals identified as single (including widows and widowers) represent single households. In the full households the spouse initially included in the sample is assumed primary; if both spouses qualify, the primary spouse is the main earner in the family.14 On the basis of demographic and family registers from 1993, the initial sample conditions define 106,452 single households and 200,162 full households with 71,327 primary wives and 128,835 primary husbands, altogether 306,614 households. To concentrate solely on transitions out of the labour market, I require the primary spouse to be either employed or to be looking for a job in the first year of observation. Further, to exclude the extremely poor and very rich households whose labour market behaviour is not likely to be governed by the economic incentives modelled in this paper, I apply a simple income test.15 These requirements reduce the number of households in the sample by 34.47% to A = 200,921 households, mainly due to initial state requirement (69.98% of the reduction).16 3.4. Estimation strategy The Norwegian register data described above provide an unbalanced panel of observations {dta , psta , eta , spta , nwta , awta , afpa , gendera }t∈{T0a −1,...,T a },a∈{1,...,A} of all the state variables except health ht and match mt variables. Here a ∈ {1, . . . , A} denotes an observation index, and T0a and T a are, respectively, the observation specific initial and terminal observation periods. The model is estimated using the method of maximum likelihood with direct integration over the unobservables. The integration is facilitated by the simple parametrization of the Markovian transition probabilities matrix for the joint latent process {ht , mt } which is a composition of (3.5) and (3.6). Even though the health and match state variables are latent, they are partially identified from the observed data on the labour market states {psta }t∈{T0a −1,...,T a },a∈{1,...,A} as follows from Table 1. For example, the current full disability state (pst+1 = 2) cannot correspond to good health (ht = 1). Likewise, the current partial disability with employment in the non-AFP company (pst+1 = 5) can only be attributed to ht = mt = 1. To generalize, denote H Mt (pst+1 ) ⊂ {0, 1, 2} ⊗ {0, 1, 2} a set of pairs (ht , mt ) consistent with the labour market state in the given period. Let a a ) contain all trajectories (h, m) = (hT0 , . . . , hT , mT0 , . . . , mT ) the set H M a = ⊗Tt=T a H Mt (pst+1 0 of the health–match stochastic process {ht , mt } consistent with the observed evaluation of the state variables {psta , eta , spta , nwta , awta }t∈{T0a −1,...,T a } for a given agent. The likelihood function collects the probability mass from all realizations (h, m) from H M a for each agent, and thus takes the form ⎤ ⎡ A Ta a a a a a ⎣ p0 hT0a −1 , mT0a −1 , θ Pt dt | st , θ · p st st−1 , dt−1 ,θ ⎦, L(θ ) = (3.8) a a a=1 (h,m)∈H M t=T0 sta = psta , ht , mt , eta , spta , nwta , awta , in which the parameter vector θ includes already mentioned parameters (β, π (h) , π (m) , ω) and some more preference and transition probability parameters defined in the next sections. The 14 As described in the previous section, only choices of the primary spouse are modelled, whereas incomes originating from the secondary spouse are accounted for. 15 Namely, the households whose annual before-tax income lies outside the limits of 40,000 (slightly over 1G) and 1 million Norwegian kronen (NOK) in 1992 prices are filtered out. 16 Further details on the sample construction and some descriptive statistics can be found in Iskhakov (2008a).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S140
F. Iskhakov
constructed likelihood function (3.8) differs from the standard one in Rust (1994) in additional summation over all consistent trajectories (h, m) of the latent health–match process. The likelihood contribution of each of the trajectories of health and match process is weighted with the initial condition probabilities p0 (mT0a −1 , hT0a −1 , θ ) = p0 (hT0a −1 ) · p0 (mT0a −1 | hT0a −1 ), which is parametrized in the following way. Because initially disabled individuals are excluded h the from the sample (Section 3.3), initial health can only take values 0 and 1. Denote πinit probability of good health h49 = 0 at age 49. When the agent-specific initial period is different from 50, T0a > T0 , the corresponding distribution of hT0a −1 can be calculated from the powers of the transition probability matrices in a standard procedure: T a −T +1 (h) T0a −T0 +1 h h · π · π (h) 0 0 + 1 − πinit , (3.9) p0 (hT0a −1 ) = πinit (0,hT a −1 ) 0
(1,hT a −1 ) 0
where [·](i,j ) denotes an element of a matrix in row i and column j. Formula (3.9) is applicable because health process is independent, and includes the case of T0a = T0 = 50. Due to the definition of sample, calculation of the conditional probability distribution for the job match process p0 (mT0a −1 | hT0a −1 ) is trivial. Because the sample only includes individuals who are active on the labour market at the initial age (T0a − 1), initial mT0a −1 is completely recoverable from the observed labour market state, implying p0 (mT0a −1 | hT0a −1 ) = 1. The estimation approach in this paper could be related to the simulated likelihood which uses simulated sequences of unobservables to establish the likelihood calculation. The simple structure of the unobserved process in this model allowed me to take into account all of its possible realizations instead of a limited number of simulated ones. This method is also related to the well-established EM algorithm (Dempster et al., 1977), which suggests iterating the expectation step when the latent variables are integrated out of the likelihood function conditional on the parameters of their distribution, and the maximization step in which the optimal parameter values are found. In the current model, the algorithm collapses to one joint step because of the simple distributional assumptions of the latent variables.
4. ESTIMATION RESULTS: BELIEFS This section concludes empirical specification of the transition probabilities in the model and presents the estimates of the structural parameters describing agents’ beliefs about the future consequences of their decisions. 4.1. Decomposition of transition probabilities and separability of parameter vector Agents’ beliefs are represented in the model by a family of Markovian transition probabilities {p(st | st−1 , dt−1 )}, which reduces to a single transition probability matrix p(st | st−1 , dt−1 ) under the assumption of time invariance. This is a square matrix with the number of elements equal to the squared number of the points in the state space S of the model, thus practically impossible to identify without additional structural assumptions.17 Similar to Rust and Phelan (1997), I impose a certain dependence pattern on to the transition probabilities matrix in the form of the following 17 In the current setting the number of elements is 190,082 times the square number of grid points approximating aggregated wage, namely with seven grid points, i.e. 17,703,899,136.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S141
decomposition: p(st | st−1 , dt−1 ) = p(et | pst , et−1 , nwt , awt ) · p(awt | pst , awt−1 ) · p(nwt | pst , nwt−1 ), p(pst | dt−1 , pst−1 , ht−1 , mt−1 , et−1 ) · p(spt | spt−1 ) · p(mt | ht , mt−1 ) · p(ht | ht−1 ).
(4.1)
The logic under (4.1) is the following. Both latent health and match processes governed by the transition probabilities p(ht | ht−1 ) and p(mt | ht , mt−1 ) represent the factors independent from other variables.18 Dynamics of the household breakdowns are given by the exogenous empirical sample-specific distribution p(spt | spt−1 ).19 Transition probabilities for the labour market state p(pst | dt−1 , pst−1 , ht−1 , mt−1 , et−1 ) formally represent the rules of motion detailed in Table 1. Similarly, transition probabilities p(awt | pst , awt−1 ) for the aggregated wage are also degenerate and represent equation (3.7). Transition probabilities for the rest of the state variables, namely the number of last consecutive years with high earnings p(nwt | pst , nwt−1 ) and individual AFP eligibility p(et | pst , et−1 , nwt , awt ) are given complete empirical specifications below. The vector of parameters used for parametrization of decision-makers’ beliefs h h h m m m m m m h , π01 , π11 , π01 , π02 , π11 , π12 , π21 , π22 , ω, πinit , c1e , . . . , c5e , c1nw , . . . , c5nw ) consists of 21 θ = (π00 elements: the transition probability matrices for health and job match require nine linear independent parameters; two parameters describe initial distribution of health and the degree to which bad health reduces job market opportunities; and 10 parameters are used to parametrize transition probabilities of et and nwt . Together with the 14 parameters introduced in the next section to specify preferences, the estimation of the model requires maximization of the likelihood function (3.8) with respect to 35 structural parameters. The full vector of parameters θ = (θ , θ ) separates naturally into the two subvectors, associated respectively with the decision-makers’ beliefs and preferences, or equivalently with transition probabilities a a , dt−1 , θ ) and choice probabilities Pt (dta | sta , θ ). This makes it possible to apply a p(sta | st−1 multi-stage estimation procedure similar to the one proposed in Rust (1994). The estimation strategy is described in detail in Appendix A. 4.2. Health Identification of parameters entering transition probabilities for health is obtained from two sources. First, partial identification is possible from the observations of disabled retirees whose health cannot remain good (ht = 0) by the definition of the health variable. It is not possible, however, to completely identify health transition from the observation because intermediate (bad) health (ht = 1) may stay concealed by individuals who are not opting in to the disability programme. Trivial association of health status and certain labour market states, e.g. full-time disability with very bad health, partial disability with bad health and the rest of the labour market states with good health, results in very misleading frequency-based approximations of the first h h h = 0.9229, π˜ 01 = 0.0613, π˜ 03 = row of the health transition probability matrix (3.5), namely π˜ 00 0.0127. The estimates of these parameters obtained in the structural estimation of the model and reported in Table 2 differ quite a lot from this rough guess. This is due to the fact that when individual health is miscategorized, frequency-based estimates of the transfer rates between 18
See also discussion in Section 4.2. Strictly speaking, the exogenous household destruction probabilities are not time invariant, but their time dependence is disregarded here for simplicity. 19
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S142
Parameters
F. Iskhakov Table 2. Estimates of health transition probabilities. Estimate
Std. errors
Good health to good health Good health to bad health
h π00 h π01
0.97087 0.02778
0.00011 0.00011
Bad health to bad health Initial good health probability
h π11 h πinit
1.00000 1.00000
fixed 1.70 × 10−07
different levels of health may be biased in both directions, because the number of individuals in each health category may deviate upwards as well as downwards in both previous time period and next time period. Identification of the health transition probabilities is therefore not possible from the first source alone. The second source of identification of the parameters of health transition probabilities is the whole structure of the decision-making process developed in this paper, in particular individual choice sets carefully constructed in every time period. As a result of maximum likelihood estimation of the structural model as whole, the parameters of health transitions are pinpointed to the values that are most compatible with the observed behaviour. Given that the structure of the decision-making process is modelled correctly, presented in Table 2 estimates are very precise estimates of the true transition probabilities that govern the latent health process. The described principles of identification did not, however, allow for identification of the h ). From the point secondary health shocks to the individuals already in bad health (parameter π11 of view of the structure of the decision-making process, such an event would lead to a collapse of a wide choice set which includes disability as one of the options, to a single-element choice set with disability retirement as the only option. The available data, however, do not provide enough details to establish the existence of the disability option prior to observed disability takeup when it is dominated by other alternatives and can be attributed to a sudden severe health h is fixed at the value of unity representing the setup when both shock. Therefore, parameter π11 bad and very bad health are absorbing. The degree of precision in the estimates of the parameters presented in Table 2 is probably a consequence of using a very large sample for estimation. Another factor that plays towards precise estimation is independence of health transition probabilities from other state variables in the model, so that the whole data set is used to identify these parameters. The downside of this design is that it does not allow for the individual heterogeneity in health transitions to be reflected.20 The estimated initial condition in the bottom row of Table 2 indicates that the model is compatible with the assumption of universal good health at age 49. 4.3. Job match Identification of the seven parameters entering job match transition probabilities is due to the fact that the values of mt are recoverable from the observations of the active labour market states (pst+1 ∈ {3, . . . , 7}), as follows from Table 3. Given the assumption that mt = 0 when pst ∈ {0, 1, 2} made in Section 3.2, the observations almost uniquely define a trajectory of the 20 Yet, the most important consideration for not including other covariates in health transitions is the fundamental restriction imposed by the computational tractability of the model.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
Parameters
Table 3. Estimates of job match transition probabilities. Estimates
S143
Std. errors
No job to non-AFP job No job to AFP job
m π01 m π02
0.18709 0.12548
0.00148 0.00127
Non-AFP to non-AFP job Non-AFP job to AFP job AFP job to non-AFP job
m π11 m π12 m π21
0.89027 0.08909 0.05583
0.00046 0.00042 0.00026
AFP job to AFP job Matching correction for bad health
m π22 ω
0.93504 1.00000
0.00028 2.90 × 10−06
job match process that enters the set H M a of all consistent trajectories of the latent health–match process {ht , mt }. Table 3 displays the estimates of the coefficients, which are again very precise due to the large data set used for estimation. The probabilities reported in Table 3 can be interpreted as per period probabilities of finding a job of a particular kind as if the agents had lost jobs in the end of each period. In this respect, the model predicts a higher probability of finding a non-AFP job compared to an AFP job for an unemployed person. Both job types are predicted to provide reasonable job security, but the probability of leaving the AFP job is lower by 4.5%. Switching to the opposite job type is predicted five times more probable than becoming unemployed. In addition, the estimated transition probabilities can be interpreted through the limiting distribution of the underlying Markov chain. Thus, if the estimated pattern of transitions holds for all ages of all generations, the model assigns 37.0334% and 58.8039% of the population correspondingly to the non-AFP and AFP employment and implies an unemployment rate of 4.1627%. These numbers correspond well to the aggregated figures of the Norwegian labour market. The estimate of the parameter ω reflecting no reduction in labour market opportunities for the people with bad health is an artefact of the model. Because the model does not allow for the combined unemployment and disability state of the labour market, it is assumed that both agents with very bad health (ht = 2) and those with bad health and no job match end up in the full-time disability. It is impossible then to tell these two groups of people apart on the bases of the observations. The true value of ω is thus incorporated into the estimated probability of severe h h h = 1 − π00 − π01 which may in this case be somewhat overestimated. health shock π02 4.4. Individual AFP eligibility Partial transition probabilities p(et | pst , et−1 , nwt , awt ) summarize the agents’ beliefs about their future AFP eligibility. The natural assumption that AFP eligibility is not expected by the agents outside of the period when early retirement may be at all feasible, namely between the cohort-specific AFP age (afpage) and 67, leads to two separate prediction problems. First, initial AFP eligibility at the AFP age must be assessed on the basis of the components of the state vector except et−1 , which is not defined at t − 1 = afpage − 1. Second, AFP eligibility in the years following initial AFP age may be predicted based on the previous value et−1 and possibly other state variables. Because of the dichotomous nature of the dependent variable, I use a logistic formulation to parametrize the transition probabilities p(et | pst , et−1 , nwt , awt ) in both of these cases. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S144
Dependent variable
F. Iskhakov Table 4. Evolvement of the individual AFP eligibility indicator. Estimates
Applicable for the first period AFP retirement may take place, t = afp −0.36330 Constant term c1e 9 years with high wage, nwt−1 = 9 10 years with high wage, nwt−1 = 10 Applicable for the next periods, afp < t < 67 Constant term Gender of the decision maker
Std. errors
0.05032
c2e c3e
1.19699 2.40131
0.09969 0.05153
c4e c5e
−0.30404 −0.62806
0.01951 0.02870
The definitions of some of the state variables allow a direct verification of some of the AFP rules described in Section 2. These verifications allow for perfect predictions of the et values for some agents, thus making distribution p(et | pst , et−1 , nwt , awt ) in some cases degenerate. Thus, since employment at AFP company is required for AFP eligibility, the condition pst < 6 predicts failures perfectly (pst < 6 ⇒ et = 0). Another requirement is substantial annual wages in the last two years which gives nwt < 2 ⇒ et = 0. Data analysis shows that within the sample the condition awt < 74, the level that roughly corresponds to 2G in 1992 prices, corresponds well to the requirement for the average of the best 10 annual earnings, proving awt < 74 ⇒ et = 0. These conditions perfectly predict the initial AFP eligibility for 96,704 relevant observations in the sample, which constitutes 53.96% of all predictions that have to be made. The upper panel of Table 4 reports the estimates of the coefficient in the logistic formulation of the transition probabilities p(et | pst , et−1 , nwt , awt ) for the ‘undecided’ cases in the evolvement of the AFP eligibility. Further, the AFP eligibility status et−1 in the previous period (when available) perfectly predicts successes, thus et−1 = 1 ⇒ et = 1. Together with the conditions on previous labour market state (pst < 6 ⇒ et = 0) and the number of preceding years with substantial earnings (nwt < 2 ⇒ et = 0), this condition allows for perfect predictions of 95.22% of cases when AFP eligibility has to be predicted between the AFP age and the age of 67. The lower panel of Table 4 reports estimates for the coefficients in the logistic formulation of p(et | pst , et−1 , nwt , awt ) (which practically gives simple gender-specific probabilities) for the rest of the cases. Given that AFP eligibility is not directly verifiable, the estimates presented in Table 4 imply the following probabilities to become AFP eligible at the AFP age, namely 42.17%, 71.60% and 88.55% correspondingly for individuals with nwt−1 8, nwt−1 = 9 and nwt−1 = 10, and for the later years 42.50% and 28.22% correspondingly for males and females. 4.5. Short-term income indicator Because nwt is defined as the number of preceding consecutive years with the wage income over 1G (truncated from above at value 10), the only two feasible values for the next period are nwt+1 = nwt + 1 (unless the upper limit is reached) and nwt+1 = 0, corresponding to the cases when the current-year wage income is respectively high and low. Therefore the probability of the event that the current wage income exceeds 1G is sufficient to completely characterize the transition probabilities p(nwt | pst , nwt−1 ). C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S145
Table 5. Evolvement of the number of consecutive years with sufficient wage income. Dependent variable Estimates Std. errors No high wages previously nwt−1 = 0 Active on labour market last period pst ∈ {3, 4, 6}
c1nw c2nw
−5.78479 5.64986
0.02121 0.01415
Partial disability last period pst ∈ {5, 7} OLM or fully disabled last period pst ∈ {0, 2} Constant term
c3nw c4nw c5nw
17.77650 −5.93140 −0.47477
11.02260 0.09030 0.00864
In contrast to the fine prediction presumably made by the agent using a complete history of wage earnings and other available information, it is only possible within the model to condition the required probability on the values of the state variables known by the end of the period t − 1. Naturally, all the covariates used in the next section to predict wage income level can be utilized in predicting the event of the current wage to exceed 1G. A separate pilot study of the corresponding logit model indeed revealed a very good fit with the McFadden rho coefficient of 75.28% achieved using the previous value nwt−1 , aggregated wage and the dummies corresponding to the previous period labour market states as explanatory variables. However, it appeared that the information carried out by these covariates can be extracted from a smaller number of ‘sufficient statistics’, including only the previous value nwt−1 and three dummies for the grouped labour market states. Table 5 reports the estimates of the parameters (c1nw , . . . , c5nw ) obtained during the final stage of joint estimation of the model. All coefficients but one are sharply estimated. Statistically undistinguishable from zero (at significance level lower than 11%), coefficient c3nw is likely to be the consequence of greater uncertainty of the level of current earnings associated with partial disability combined with employment. The coefficients presented in Table 5 should be interpreted as coefficients in the linear combination of the covariates wrapped into a logistic function to parametrize the probability of high wage income in the current period. The highest probability of earning sufficient salary is 99.4376% (as calculated on the bases of Table 5 using only significant coefficients), and is assigned to the active labour market states when previous wage earnings are high. Much lower probability of 0.1649% is attributed to the inactive labour market states also when previous earnings are high. Previous low wage earnings substantially reduce these probabilities—for the agents active on the labour market it falls to 35.2128%, and for the rest almost certainly (with probability over 99.9994%) result in similarly low wage income in the current state.
5. ESTIMATION RESULTS: PREFERENCES This section concludes empirical specification of the utility function and presents the estimates of the structural parameters that characterize agents’ preferences. 5.1. Instantaneous utility function The deterministic part of the utility function u(dt , st ) in (3.2) which formalizes the agents’ preferences is defined along the traditional lines as indirect utility dependent on income C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S146
F. Iskhakov
I = I (t, dt , st ) and leisure L = L(t, dt , st ). Following a general trend in the microeconometric studies of the labour market, I adopt the additive functional form of the instantaneous utility function with constant relative risk aversion (CRRA) with respect to household disposable income and linear leisure:21 u(dt , st ) = u(I (t, dt , st ), L(t, dt , st )) (T x(I ))λ − 1 ck · 1(pst+1 = k), + b(st ) · L + λ k=0 7
=a
(5.1)
where T x(·) is a mapping from before-tax household incomes to disposable household incomes. A scalar coefficient a is used to scale the impact of the utility of income which is measured against the utility of leisure and additional non-pecuniary utility generated by the different labour market states and weighted by the set of parameters ck . To ensure identification, c0 is set to zero making the OLS state the reference. A compound coefficient with leisure captures the individual heterogeneity in preferences. I estimate the following form of b(st ), which is mainly driven by the computational tractability considerations: b(st ) = b1 · ξ (ht = 1) + b2 · spt + b3 · gender.
(5.2)
Following the general approach in the literature (Dagsvik and Strøm, 1992, Rust and Phelan, 1997, Gilleskie, 1998, Hernæs et al., 2000, Heyma, 2004, and Karlstrom et al., 2004) I completely assume away savings and require all the income in a particular year to be consumed. This assumption is first of all driven by the restrictions on the complexity of the model and the absence of reliable data on savings, but may also be supported by the argument that at the modelled ages a considerable part of households’ savings is immobilized in durable goods and housing, and thus consumption smoothing is very limited (Iskhakov, 2008a). Leisure is a deterministic function of the labour market state and is simply calculated as a fraction of time available for leisure, net of eight hours of sleep per day and time spent at work. Active labour market states are assumed to occupy 7.5 hours per working day (corresponding to the normal Norwegian 37.5 hours working week), except the unemployment state which is assumed to occupy half of this time. OLM, pension and full-time disability have maximum leisure. ⎧ 24 · 365 − 37.5 · 52 − 8 · 365 ⎪ ⎪ = 0.444, pst+1 ∈ {4, 5, 6, 7}, ⎪ ⎪ 24 · 365 ⎪ ⎪ ⎨ 24 · 365 − 0.5 · 37.5 · 52 − 8 · 365 (5.3) L(t, dt , st ) = = 0.555, pst+1 ∈ {3}, ⎪ 24 · 365 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 24 · 365 − 8 · 365 = 0.667, pst+1 ∈ {0, 1, 2}. 24 · 365 Preliminary analysis of the income data from the tax records reveals the following clear correspondence of income sources and labour market states. Employment incomes which in addition to wages contain different occupational benefits provided by the employer (reimbursement for communication, commute expenses, etc.) as well as holiday pay, travel allowances and similar, constitute the main source of household income at full-time employment 21 CRRA as Box–Cox transformation has both theoretical justification (Dagsvik et al., 2006) and represents a convenient generalization spanning from linear (λ = 1) to logarithmic (λ → 0) specifications. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S147
and a considerable amount of income at partial disability and unemployment (pst+1 ∈ {3, . . . , 7}). Pension incomes which consist of different pension benefits, disability benefits up to the age of 67, possibly AFP benefits between the cohort-specific AFP age and the age of 67 and regular old-age pension benefits after the age of 67, is the main source of income in the retirement and full-time disability labour market states (pst+1 ∈ {1, 2}). Additional incomes, which combine the rest of the important sources of income for an individual (such as occupational pension, survival benefit, unemployment benefit, child care, sickness and other forms of benefits resulting from governmental or private insurance schemes, additional personal old-age annuities, golden handshake premiums, capital income, etc.) are received in all states, while playing the most important role in the inactive labour market states (pst+1 ∈ {0, 1, 2}). Finally, the spouse income comprises all three groups of incomes calculated for the spouse and only appears in full households (when spt = 1). The four major sources of income for a household (wage, pensions, additional and spousal income) unambiguously correspond to the different combinations of the pst+1 and spt variables. Incomes from these four sources are carefully forecasted by multiple linear income equations of the form Ik (t, dt , st ) = gk (t, dt , st ) + υk , an estimation of which is detailed in Appendix B. Here error terms υk are assumed independent and identically normally distributed with zero means and variances σk2 . Denote K(dt , st ) = K(pst+1 , spt ) the set of relevant income sources. Because the estimated standard errors in the income equations are rather large (see Table B.1 in the Appendix), the predictions of the total before-tax household income I (t, dt , st ) = k∈K(dt ,st ) Ik (t, dt , st ) become rather noisy. I apply a slightly more accurate procedure for the calculations of the current utility which allows for accounting not only for the point estimates of the incomes from different sources, but also for their second-order moments. Since normal distribution is stable under the summation, the total household income I (t, dt , st ) has a normal distribution with the expectation 2 k∈K(dt ,st ) gk (t, dt , st ) and the variance k∈K(dt ,st ) σk . Then the income component of the utility function a (T x(Iλ)) −1 can be replaced by its expectation over the noise in the prediction of incomes: ⎫ ⎧⎛ ⎛ ⎞⎞λ ⎬ a +∞ ⎨⎝ ⎝ Ik (t, dt , st )⎠⎠ − 1 dFυ Tx ⎭ λ −∞ ⎩ k∈K(dt ,st ) ⎫ ⎧⎛ ⎛ ⎞⎞λ ⎬ a 1 ⎨⎝ ⎝ = gk (t, dt , st ) + σk2 · −1 (τ )⎠⎠ − 1 dτ, Tx (5.4) ⎭ λ 0 ⎩ λ
k∈K(dt ,st )
k∈K(dt ,st )
where (·) is standard normal c.d.f. Expression (5.4) can be evaluated very fast using the Gaussian quadrature with nearly zero additional computational cost. The tax function T x(·) superposed inside the utility function is also approximated by a statistical equation, which is presented in Appendix B. The termination value function (sT ) in the agents’ sequential decision problem (3.1) represents the residual utility after the age of 70, and because most of the state variables stabilize at this age can only be formulated as dependent on a limited number of state variables. I assume the following linear specification with gender and aggregated wage awT used as controls of individual heterogeneity, together with the AFP age variable afpage, which can be interpreted as a rough control for cohort. tf
tf
tf
(sT ) = c1 · afpage + c2 · gender + c3 · awT . C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
(5.5)
S148
F. Iskhakov
In accordance with the estimation strategy described in Appendix A, the preference tf tf parameters θ = (β, λ, a, b1 , . . . , b3 , c1 , . . . , c5 , c1 , . . . , c3 ) are estimated during the last stages because they require the value function calculation in each evaluation of the likelihood function and thus top up the computational load. In addition, the topography of the likelihood function in the space of the preference parameters, namely its flatness and the abundance of local maxima, make the optimization task especially demanding. As reported in Iskhakov (2008a), the estimation of the preference parameter vector itself was performed in several steps with some parameters fixed on the earlier approaches. 5.2. Estimated preferences Table 6 presents the obtained structural estimates of the preference parameters. The large sample used in estimation again allowed for very precise estimates which are all very significantly different from zeros. Unfortunately, insufficient variation of the state variables at the termination tf tf age prevented the estimation of the parameters c1 , . . . , c3 in the termination function, leading to the simplifying assumption (sT ) = 0. The estimates suggest higher marginal utility of leisure for the individuals in full households. This is a clear indication of a spousal effect in the retirement process in Norway, which is in line with previous research on joint retirement decisions. Women are estimated to have a lower marginal utility of leisure, which may reflect their tendency to work more compared to men in order to counteract the gender gap in pension benefits. The same is applicable for the huge additional marginal utility of leisure for the unhealthy.22 Comparison of the indirect utilities of different labour market states (in which two pairs of coefficients in similar states are equalized) reveals the following pattern. The least attracting labour market state is full-time disability, especially when it results from a permanent health shock (when ht = 2 and no additional utility is attached to leisure). The most attractive is combined disability and employment—this reflects the large alternative cost of working for those who choose to become employed while on disability. OLM state is the least attractive among the rest of the states, while employment, unemployment and retirement result in similar additional utility. However, if these three are compared, the unemployment state seems to be slightly more attractive than pension, and pension slightly more attractive than employment, which can be interpreted as evidence for the preferences outside the income–leisure dimensions that affect the choice. The large coefficient with leisure associated with bad health can also be given an alternative, interesting interpretation. Provided that the medical screening at the entrance to the disability programme is efficient so that the health indicator in the model can be interpreted as medical health, the estimation results are consistent with the social stigma effect in the disability takeup process. Indeed, the large coefficient with leisure offsets the low non-pecuniary utility of full-time disability only for those retirees who are in bad health. Thus, potential tricking the disability programme and applying for the benefit when in good health is punished at the level of individual utility. This implied punishment, i.e. social stigma, could be one of the important factors explaining retirement through the disability system.
22 Those in the worst health are all assigned to the maximum leisure corresponding to full-time disability—making related coefficient unidentifiable.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
Parameters
Utility of income Utility of leisure
Labour market state-specific additional utility
Table 6. Final estimates of the preference parameters. Estimates
S149
Std. errors
Discount factor 1 – risk aversion
β λ
0.90683 0.67140
0.00211 0.02764
Constant
a
0.18903
0.00505
ht = 1
b1
29.30984
0.57732
spt Gender (0-male, 1-female)
b2 b3
0.32370 −0.67042
0.02109 0.01971
OLM (reference) Pension Disability (DI)
c0 c1 c2
0.0 0.68043 −5.88040
fixed 0.00394 0.13116
Unemployment Employment
c3 c4 = c6
0.74066 0.54762
0.01603 0.00681
Employment combined with disability
c5 = c7
7.65913
0.12912
The second row in Table 6 indicates that the discount factor β was successfully estimated in the present model at the level of 0.90683 implying the intertemporal discount rate of 10.27%. Identification of the discount factor differentiates this paper from many works in the field where it is either fixed at given value, usually between 0.85 and 0.95 (Berkovec and Stern, 1991, Gilleskie, 1998, Burkhauser et al., 2004, Heyma, 2004, Karlstrom et al., 2004, and Bound et al., 2010), or is a subject of a grid search (Rust and Phelan, 1997). In the absence of theoretical characterization of the models capable or incapable of identifying the discount factor, some other authors also succeed in estimating this parameter. In the estimation of his model of retirement and savings decisions French (2005) obtains a range of discount factors considerably higher compared to the current estimate, namely 0.981–1.04.23 Another parameter which is difficult for estimation and often fixed in literature is the Box–Cox parameter λ, which in the current setup equals the difference between unity and the coefficient of constant relative risk aversion with respect to household disposable income. Many authors also use logarithmic specification, which corresponds to a fixed value λ = 0 (Heyma, 2004, Jia, 2005, and Bound et al., 2010).24 Table 6 reports the estimate 0.67140 with corresponding CRRA coefficient 0.32860, which implies even lower risk aversion than Burkhauser et al. (2004), who estimate it at the level 0.407–0.520. In his model with saving decisions, French (2005) obtains much higher risk aversion with the CRRA coefficient in the range 3.19–3.78. 5.3. Goodness of fit of the model Overall, the preference parameter estimates presented in Table 6 appear to be informative, reasonable and accurate. To derive a quantitative measure of the goodness of fit of the model, I calculate a goodness-of-fit coefficient similar to McFadden (1974) based on the frequency 23 Considerably lower estimates obtained in this paper can be attributed to the effective absence of the termination function which would capture the remaining utility after the age of 70. 24 A common mistake is claiming the logarithmic form as a special case with lambda approaching zero when the utility function is specified as CRRA but not Box–Cox. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S150
F. Iskhakov
approximations of both the choice and the transition probabilities (except those fixed or degenerate) according to the following formulas (here N {·} denotes the number of elements of a set): N dta , sta : dta = d, sta = s freq P (d | s) = , (5.6) N dta , sta : sta = s p
freq
a a N st+1 : st+1 = st+1 , (st+1 | st , dt ) = a a a N dt , st : st = st , dta = dt
(5.7)
where the values of the latent health status and job match are defined deterministically from the current labour market states in the matter mentioned in the previous section.25 Plugging these approximations into the log-likelihood (3.8) and applying the McFadden formula gives a goodness-of-fit coefficient of 53.129%. The interpretation of this coefficient is the following: compared to the restricted model in which the choices are extrapolated from the historical data and the evolvement of the state variables is described by the primitive frequency-based rates, the estimated structural dynamic model explains about one and a half times as much variation in the data. Because the constructed coefficient varies between zero when the full model does not add additional knowledge compared to the restricted specification and unity when the full model is infinitely better that the restricted version, the obtained value of 53.129% seems reasonably high.
6. SUBSTITUTION BETWEEN AFP RETIREMENT AND RETIREMENT THROUGH DISABILITY Early retirement and retirement through disability have been the two major exit routes from the labour market in Norway before the normal retirement at the age of 67. The question of possible substitution between these two exit routes thus becomes particularly important when the policy encouraging later withdrawal from the labour market is designed. As noted in the introduction, this question has already been given attention in the literature, but the conclusions were rather different (Røed and Haugen, 2003, Bratberg et al., 2004). The structural model developed in this paper outperforms the previous approaches and allows me to shed more light on the described issue. In order to assess the magnitude of potential spillover between AFP and disability retirement, I simulate a policy of complete elimination of the AFP scheme under the assumption that the structural parameters in preferences and transition probabilities are policy invariant. Within the model the new policy is represented by a single change in Table 1 where the AFP age variable (afpage) is replaced with 67, making early retirement equivalent to retirement with normal old-age pension.26 The idea under the simulation is to analyse the changes in distributions of households over the set of labour market states before and after the policy change in order to identify the allocation of the displaced early retirees. 25 Namely, health in the full-time disability is assumed the worst, partial disability is associated with bad health and the agents in all the rest of the labour market states are assumed to be healthy. 26 Note that the replacement is only made in the motion rule for the labour market state and the choice sets definition, but not in the income equations where the AFP age serves as a rough control for cohort.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S151
Structural dynamic model of retirement with latent health indicator Table 7. Policy simulation: elimination of the AFP scheme (% points). Age Labour market states
62
63
64
65
66
67
68
69
0 −17.96
0.21 −24.12
0.21 −30.24
0.32 −34.51
0.43 −42.17
0 −6.81
0 −4.44
0 −4.30
100% 0 0.21
100% 0 0.42
100% 0.11 0.21
100% 0.11 0.53
100% 0.11 0.76
100% 0 0.77
100% 0 0.22
100% 0 0.11
1.34 0
2.6 0
4.43 0.11
6.09 0.32
7.07 0.65
0.66 0.88
0 0.89
0 0.68
AFP work AFP work + DI Disability (joint)
16.41 0 0.0
20.37 0.52 0.51
23.92 1.26 1.48
25.21 1.92 2.35
30.54 2.61 3.37
1.43 3.08 3.96
0 3.33 4.22
0 3.51 4.19
Employment (joint)
0.0% 17.75
2.16% 22.97
4.88% 28.35
6.81% 31.30
7.99% 37.61
58.1% 2.09
95% 0
97.3% 0
98.9%
95.3%
93.7%
90.7%
89.1%
30.7%
0%
0%
OLM Pension Disability Unemployment Non-AFP work N-AFP work + DI
Notes: The table presents % point differences between the distribution of the simulated households in the existing pension system and the distribution of the simulated households after the AFP pension was completely eliminated. Negative changes in the retirement state between the ages 62 and 66 correspond to no cases of retirement before 67 as suggested by the simulated policy.
The analysis is implemented by a simulation of 1000 households who are assigned the initial conditions {dTa a , psTa a , eTa a , spTa a , nwTa a , awTa a }a∈{1,...,1000} randomly drawn from the observed 0 0 0 0 0 0 initial conditions in the sample.27 All decisions and states for these 1000 households are then sequentially simulated by drawing random variables from the distributions of transition a a , dt−1 ) and choice probabilities Pt (dta | sta ) in all the time periods up to probabilities p(sta | st−1 T = 70 (if a prior event of death does not occur). The data set constructed with this procedure of the simulated state and decision variables provides the basis for the calculation of the simulated distributions across labour market states at each age. These distributions before and after policy are compared, and the differences are presented in Table 7. To make the two distributions independent of the simulated stochastic transitions and choices, the sequence of random number realizations is shared between the policy simulations. It is clear from Table 7 that the people who would otherwise be retired with the AFP pensions mostly distribute themselves across employment states. Substitution into the full-time disability is practically negligible with the increase of only about one-tenth of a percent point at ages 64 to 66. In the same time period, the fractions of employed partially disabled increased much more. The biggest inflow of the displaced AFP pensioners is seen into employment states, more precisely AFP employment. This is natural because it is exactly these employees who lose the opportunity to retire. The same reason makes AFP employment combined with disability the third biggest increase. The second biggest gain corresponds to the non-AFP employment, which indicates that most of the displaced AFP retirees stay on the labour market, even when they have to change jobs. Inflow into full-time disability is negligible, much lower than disability combined 27
Prior to the initial period all individuals are assumed to be healthy.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S152
F. Iskhakov
with non-AFP employment indicating that finding any work in addition to partial disability is very attractive. To make the results comparable to the previous studies, it is necessary to combine the three disability states as shown in the last two rows of Table 7 (the two employment states are also combined). Now when disability is given a slightly different definition, namely including both full-time and partial disability, the conclusions seem to change as well. Even though the largest substitution is directed towards employment, the second largest escape for the displaced early pensioners is the joint disability state. These results can be interpreted as displaying signs of AFP and disability substitution after the age of 63 and up to the age of 69. Table 7 reports in parentheses the fractions of the total decrease in AFP the disability and employment states are responsible for. Thus, the model predicts the substitution effect at the level of about 4.88–7.99% in the ages 64 to 66, which is intermediate in comparison to the previous findings (Røed and Haugen, 2003, Bratberg et al., 2004). An interesting feature of the result is the absence of any early reaction to the anticipated reduction of opportunities due to the elimination of AFP programme. This is contrary to the results of a preliminary policy simulation based on the calibrated values of the parameters when a sizeable decrease in the labour supply was observed as early as at age 52. After accurate estimation of the preference parameters, it is clearly demonstrated that the limitation of the retirement opportunities is not capable of causing any early behavioural response. On the contrary—negative change in fractions of those who retired after 66 indicates that the simulated policy induces general postponing of retirement in society. Overall, when the AFP pension is removed, people remain on the labour market but utilize the possibility of taking out a partial disability pension as well. This behaviour, in fact, may be interpreted as substitution into disability.
7. CONCLUSIONS Labour market behaviour of older workers is a complex process with several hidden factors affecting it. This paper makes an effort to model the most important of these factors in order to better understand their influence on the decision-making process and the dynamics of the behavioural response to policy changes. Controlling for the core variables such as health, job match, individual eligibility for early retirement, and the previous labour market state, the current model focused on careful reconstruction of the individual choice sets in each time period, thus accurately replicating the labour market opportunities of each agent. In the constructed stochastically evolving environment, the decision makers react to the situation that they find themselves in at the beginning of each period and try to optimize their outcome from the point of view of the maximization of the expected discounted lifetime utility. Forecasting of four different sources of income in each of eight different labour market states considered in the model allowed assessment of structural parameters of the indirect utility function dependent on income and leisure. Under the deficiency of relevant data on work limitations and health, the current paper proposed and tested the latent variable approach for modelling eligibility for disability benefits within a structural dynamic model of retirement decisions. Very accurate estimates of health transition probabilities proved the fact that the dynamics of the latent health, which was modelled as a Markovian stochastic process, was perfectly identifiable in the current setup. The results of the estimation and the following policy simulation established the capability of the model to track C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S153
those individuals in bad health who do not choose retirement through disability in the anticipation of more favourable retirement opportunities. Thus, the approach is proved worthwhile. The chosen structural dynamic setup proved very fruitful in analysing the consequences of different policy changes, because it enabled the model to pick up the effects spread throughout several time periods and provided solid grounds for the analysis of dynamic behavioural responses. After the model had been structurally estimated with the maximum likelihood method, the estimates were used to address the question of substitution between the AFP retirement scheme and disability pension. Simulation of the complete elimination of the AFP scheme revealed very little substitution of the displaced AFP retirees into full-time disability, but because the joint disability and employment position on the labour market were also considered, it was capable of providing a further insight into the issue of exit routes substitution. In fact, simulated substitution into partial disability is accountable for 5–8% of the number of misplaced AFP pensioners. To a certain extent, this paper closes the gap between the previously contradictory results of Røed and Haugen (2003) and Bratberg et al. (2004) by substantiating little substitution of the AFP pensioners into the disability and out of the labour market, and at the same time providing evidence for a moderate inflow of otherwise AFP pensioners into partial disability combined with employment. The paper succeeded in answering two important practical questions, providing solid evidence that misplaced early retirees would certainly increase their labour supply, and relieving the worries that they might pour into the disability programme. However, another worrying scenario is left out of the current work. Because of the definition of bad health as direct eligibility for a disability pension, the proposed approach did not address the problem of potential cheating the system through forgery of the medical screening results, pretending to have medical conditions, etc. The model allows for the eligibility of the disability pension to not be genuine by leaving the whole application process in the shadow of the latent variable. In this respect, the estimated health transition probabilities reflect not only medical health, but both the application denials and the moral hazard issues. The results of the policy simulation are dependent on the invariance of these estimates, and thus may be somewhat overly optimistic in the assumption that the amount of faked disabilities will remain constant throughout the policy change. To address these issues, a more complicated model has to explicitly describe the disability insurance application process. It also remains unclear how exactly disability application decisions are made. Both estimation and simulation results in this paper suggest the importance of the economic incentives which drive nearly total dominance of combined disability and employment over full-time disability. Another essential difference between these two labour market states is the retention of choice for the future periods in the former state, as opposed to complete absorption of full-time disability. The extremum of the structural estimates of agents’ preferences attained at the coefficients with bad health and disability state indicate that there may be a social stigma associated with taking up a disability pension. The estimation of the developed structural dynamic model presented an intense computational task, in which the original complexity of the standard approach was additionally raised by the huge data set used for estimation and the need for integration over the unobservables in the likelihood calculation. Ensuring successful estimation of the model, the considerations of computational tractability were constantly the limiting factor causing abridgement of some of its features. The independence of health transitions in the model led to the very sharp estimates of the parameters describing its dynamics, but did not allow for investigation on how health could be affected by other state variables. In general, the amount of individual heterogeneity accounted C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S154
F. Iskhakov
for in the model was limited, first, by the number of state variables and the number of points in the state space, second, by the number of parameters entering the model. Increasing both of these quantities raises the accuracy and credibility of the model, but also greatly increases computational complexity of its estimation. The structural dynamic model estimated in this paper is thus a result of a compromise between accuracy and computational tractability. Besides, the robustness checks, which would be a natural way of testing the restrictiveness of many assumptions in the model, also proved to be unreasonably costly because of the computational complexity of the model estimation. Nevertheless, the model developed in this paper proves the worthiness of the proposed latent variable approach and the capability for a considerable practical contribution. I believe that by extending the basic dynamic framework of this paper to account for disability application decisions and to account for more individual heterogeneity in preferences and health transitions, it may eventually be possible to provide a fully comprehensive explanation of the labour market behaviour of elderly workers.
ACKNOWLEDGMENTS I would like to thank the anonymous referees, participants of the 19th EC-squared Conference in Rome, 3rd Italian Congress of Econometrics and Empirical Economics in Ancona, Italy, 4th PhD Presentation Meeting at UCL, London, Simposio de Analisis Economico in Zaragoza, Spain, seminar participants at the Centre for Applied Microeconometrics at the University of Copenhagen, New Economic School, Moscow, and the Department of Economics at the University of Maryland for helpful comments. This paper was written within Project 1133 (Working life and welfare of the elderly) at the Frisch Centre for Economic Research. Financial support from the Research Council of Norway is greatly acknowledged.
REFERENCES Au, D., T. Crossley and M. Schellhorn (2005). The effect of health changes and long-term health on the work activity of older Canadians. Health Economics 14, 999–1018. Berkovec, J. and S. Stern (1991). Job exit behavior of older men. Econometrica 59, 189–210. Blau, D. M. (1994). Labor-force dynamics of older men. Econometrica 62, 117–56. Bloom, D. E., D. Canning and M. Moore (2004). The effect of improvements in health and longevity on optimal retirement and saving. NBER Working Paper No. 10919, National Bureau of Economic Research. Blundell, R., C. Meghir and S. Smith (2002). Pension incentives and the pattern of retirement. Economic Journal 112, 153–70. Bound, J. (1989). The health and earnings of rejected disability insurance applicants. American Economic Review 79, 482–503. Bound, J. (1991). Self-reported versus objective measures of health in retirement models. Journal of Human Resources 26, 106–38. Bound, J. (1998). The dynamic effects of health on the labor force transitions of older workers. NBER Working Paper No. 6777, National Bureau of Economic Research. Bound, J., T. Stinebrickner and T. Waidmann (2010). Health, economic resources and the work decisions of older men. Journal of Econometrics 156, 106–29. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S155
Bratberg, E., T. H. Holmas and O. Thogersen (2004). Assessing the effects of an early retirement program. Journal of Population Economics 17, 387–408. Burkhauser, R. V., J. S. Butler and G. Gumus (2004). Dynamic programming model estimates of social security disability insurance application timing. Journal of Applied Econometrics 19, 671–85. Coile, C. (2004). Health shocks and couples’ labor supply decisions. NBER Working Paper No. 10810, National Bureau of Economic Research. Crawford, V. P. and D. M. Lilien (1981). Social-security and the retirement decision. Quarterly Journal of Economics 96, 505–29. Dagsvik, J. K. and S. Strøm (1992). Labor supply with non-convex budget sets, hours restriction and nonpecuniary job-attributes. Report No. 76, Central Bureau of Statistics. Dagsvik, J. K., S. Strom and Z. Y. Jia (2006). Utility of income as a random function: behavioral characterization and empirical evidence. Mathematical Social Sciences 51, 23–57. Dempster, A. P., N. M. Laird and D. B. Rubin (1977). Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38. Disney, R., C. Emmerson and M. Wakefield (2006). Ill health and retirement in Britain: a panel data-based analysis. Journal of Health Economics 25, 621–49. Dwyer, D. S. and O. S. Mitchell (1999). Health problems as determinants of retirement: are self-rated measures endogenous? Journal of Health Economics 18, 173–93. Fabrizia Mealli, S. P. (1996). Occupational pensions and job mobility in Britain: estimation of randomeffects competing risks model. Journal of Applied Econometrics 11, 293–320. Fevang, E. and K. Røed (2006). Veien til uføretrygd i norge. Report No. 10, Frisch Centre for Economic Research. French, E. (2005). The effects of health, wealth, and wages on labour supply and retirement behaviour. Review of Economic Studies 72, 395–427. Gilleskie, D. B. (1998). A dynamic stochastic model of medical care use and work absence. Econometrica 66, 1–45. Gordon, R. H. and A. S. Blinder (1980). Market wages, reservation wages and retirement decisions. Journal of Public Economics 14, 277–308. Greene, W. H. (2000). Econometric Analysis. Englewood Cliffs, NJ: Prentice Hall. Gustman, A. L. and T. L. Steinmeier (2000). Retirement in dual-career families: a structural model. Journal of Labor Economics 18, 503–45. Gustman, A. L. and T. L. Steinmeier (2002). Retirement and the stock market bubble. NBER Working Paper No. 9404, National Bureau of Economic Research. Gustman, A. L. and T. L. Steinmeier (2004). Social security, pensions and retirement behaviour within the family. Journal of Applied Econometrics 19, 723–37. Hernæs, E., M. Sollie and S. Strøm (2000). Early retirement and economic incentives. Scandinavian Journal of Economics 102, 481–502. Heyma, A. (2004). A structural dynamic analysis of retirement behaviour in the Netherlands. Journal of Applied Econometrics 19, 739–59. Holen, D. S. (2008). Disability pension motivated income adjustment. Working Paper No. 17/2008, Department of Economics, University of Oslo. Iskhakov, F. (2008a). Dynamic programming model of health and retirement. Working Paper No. 03/2008, Department of Economics, University of Oslo. Iskhakov, F. (2008b). Pension reform in Norway: evidence from a structural dynamic model. Working Paper No. 14/2008, Department of Economics, University of Oslo. Jia, Z. (2005). Retirement behavior of working couples in Norway: a dynamic programming approach. Discussion Paper Series No. 405, Statistics Norway. C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S156
F. Iskhakov
Karlstrom, A., M. Palme and I. Svensson (2004). A dynamic programming approach to model the retirement behaviour of blue-collar workers in Sweden. Journal of Applied Econometrics 19, 795–807. Kristoffersen, P. and A. Sagsveen (2005). Innskjerpingen i attføringsvilk˚aret i 2000. tid fra avslag p˚asøknad om uførepensjon til overgang til andre trygdeytelser og arbeid. Report No. 05/2005, Rikstrygdeverket. McFadden, D. (1974). The measurement of urban travel demand. Journal of Public Economics 3, 303–28. Midtsundstad, T. (2004). Hvor mange har rett til AFP? Fafo-paper 2004:4, Institute for Labour and Social Research (Fafo). Murphy, K. M. and R. H. Topel (1985). Estimation and inference in 2-step econometric-models. Journal of Business and Economic Statistics 3, 370–79. Rust, J. (1987). Optimal replacement of GMC bus engines—an empirical-model of Zurcher, Harold. Econometrica 55, 999–1033. Rust, J. (1994). Structural estimation of Markov decision processes. In D. M. R. Engle (Ed.), Handbook of Econometrics, Volume 4, 3081–143. Amsterdam: Elsevier Science. Rust, J. and C. Phelan (1997). How social security and Medicare affect retirement behavior in a world of incomplete markets. Econometrica 65, 781–831. Røed, K. and F. Haugen (2003). Early retirement and economic incentives: evidence from a quasi-natural experiment. Labour 17, 203–28. Røgeberg, O. J. (2000). Married man and early retirement under AFP scheme. Working Paper No. 02/2000, Department of Economics, University of Oslo. Samwick, A. A. (1998). New evidence on pensions, social security, and the timing of retirement. Journal of Public Economics 70, 207–36. Stock, J. H. and D. A. Wise (1990). Pensions, the option value of work, and retirement. Econometrica 58, 1151–80. Tysse, T. I. (2001). The effects of enterprise characteristics on early retirement. Report 2001/26, Statistics Norway (SSB).
APPENDIX A: MULTI-STAGE ESTIMATION STRATEGY In order to reduce the computational burden during the model estimation, the likelihood maximization is performed in several stages as indicated in Table A.1. On the preliminary first stage, the partial transition probabilities p(et | pst , et−1 , nwt , awt ) and p(nwt | pst , nwt−1 ) which correspondingly govern the evaluation of the AFP eligibility indicator et and the number of last consecutive years with annual salary exceeding 1G nwt are estimated directly using the relevant sample data.28 In the second stage, I follow Rust (1994) and formulate the partial likelihood function ⎞ ⎛ A Ta ! " 1 a a ⎝ ! " · p sta st−1 p0 mT0a , hT0a , θ , dt−1 , θ ⎠ , L (θ ) = a (A.1) a=1 (h,m)a ∈H M a t=T0a +1 Dt st sta = psta , ht , mt , eta , spta , nwta , awta , h h h m m m m m m h , π01 , π11 , πinit , π01 , π02 , π11 , π12 , π21 , π22 , which is maximized using the separable subvector θ = (π00 e e nw nw ω, c1 , . . . , c5 , c1 , . . . , c5 ) of 21 parameters corresponding to the transition probabilities p(st | st−1 , dt−1 , θ). Here |Dt (sta )| denotes the number of alternatives in the choice set Dt (sta ) available for the agent at time t: in other words, the choice probabilities Pt (dta | sta , θ) in (3.8) are replaced with the uniform distribution defined over the corresponding choice set. Such reformulation eliminates the need
28
The estimates from this and other intermediate stages are reported in Iskhakov (2008a). C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S157
Structural dynamic model of retirement with latent health indicator Table A.1. Estimation strategy. Stages Number of parameters
Description
4
Health transition probability matrix and
X
X
m m m m π01 , π02 , π11 , π12 , m m π21 , π22 , ω
7
initial condition Matching transition probability matrix
X
X
c1e , . . . , c5e c1nw , . . . , c5nw
5 5
AFP eligibility indicator motion rule Number of years with high wage
X X
X X
Parameters h h h h π00 , π01 , π11 , πinit
β λ, a, b1 , . . . , b3 , tf
1 14
motion rule Discount factor Utility function parameters
1
X X
2
3
X X
4
X X
tf
c1 , . . . , c5 , c1 , . . . , c3
for the value function recalculation at each computation of the likelihood function, and thus at stage 2 the main computationally demanding backward induction calculation is avoided. The resulting estimates of the parameters in the subvector θ are consistent and asymptotically normal, but lack efficiency. The third stage of the estimation corresponds to the second stage in the algorithm proposed by Rust (1994) and also results in consistent asymptotically normal but not efficient estimates. Here the parameter subvector θ is kept constant while the complete likelihood function (8) is maximized with respect to the rest of the parameters, tf tf namely θ = (β, λ, a, b1 , . . . , b3 , c1 , . . . , c5 , c1 , . . . , c3 ). Greene (2000) references Murphy and Topel (1985) who provide a way for the adjusted covariance matrix of the estimates of the joint parameter vector (θ , θ ) to be calculated, thus allowing for standard errors and test hypotheses. Nevertheless, to avoid these cumbersome calculations and improve the accuracy of the estimation, I perform the final fourth stage of the estimation, which takes the estimates of the previous stages as starting points and performs several additional steps of the quasi-Newton line search algorithm with respect to all the parameters. Rust (1994) shows that single Newton step is sufficient to gain the estimates asymptotically equivalent to the full information likelihood estimates, but the complicated ‘flat’ shape of the likelihood function in the current model allows for arrival at the estimates with smaller estimated standard errors after several such iterations, which are in fact performed on the fourth stage. To estimate the standard errors, I use the ‘information equality’ to approximate the covariance matrix of the estimates from the numerically calculated Hessian after the convergence is achieved. The model was implemented in the MatLab environment with the inner circuit of the algorithm (which is very similar to the nested fixed-point algorithm (Rust, 1987) programmed as a dynamically linked library written in C programming language and the outer circuit completely taken care of by the standard MatLab unconstrained minimization routine.29 Using C for the computationally demanding parts of the program (both value function calculation and the integration over the unobserved variables) ensured obtaining minimal running time. On the Frisch Centre server, one evaluation of the likelihood function over a half of the data set took approximately 160–170 seconds, with approximately 60 seconds spent on the value function calculation.30 Production runs of the optimization routine were performed in the supercomputing 29 Namely, fminunc(·) procedure with numerical derivatives and the BFGS Hessian updating followed by at most three single Newton steps based on the numerically approximated Hessian for ‘fine tuning’. 30 Frisch Centre server at the time was Anton: Dell PowerEdge 2850 x64-based PC with 8 EM64T Family 15 Model 4 Stepping 8 GenuineIntel ∼2793 Mhz processors, 8GB physical memory, running Microsoft Windows Server 2003 Enterprise x64 Edition, version 5.2.3790 Service Pack 1 Build 3790. 32bit MatLab version 7.3.1.267 (R2006b).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
S158
F. Iskhakov
centre of the University of Oslo.31 Distributing the computational load over 8 to 12 CPUs brought the one evaluation run-time over the full sample down to 9–12 seconds.32 Parallelization was applied to both value function calculation and the integration over the unobservables. In addition, the use of an unconstrained optimization routine required reformulation of the h h h m m m m m m h h , π01 , π11 , πinit , π01 , π02 , π11 , π12 , π21 , π22 , ω, πinit ) constrained by the obvious probability parameters (π00 1 limitations using the logistic transformation g (τ ) = 1+exp(τ ) for single constrained parameters or g (τk ) = ∈ {1, . . . , K} for connected constrained parameters. Invariance of the maximum likelihood estimation and the theorem about the asymptotic distribution of the non-linear function in Greene (2000, Theorem 4.17) provide a way of backward recalculation of the estimates and their covariance matrix in accordance with these transformations. exp(τk ) ,k 1+exp(τ1 )+···+exp(τK )
APPENDIX B: HOUSEHOLD INCOME Because the structural dynamic model developed in this paper uses the McFadden conditional approach in representing the choices of the agent, household incomes have to be predicted in all the accessible labour market states and not only those in which the agents are observed. One technique applicable for assessment of such counterfactual incomes suggests using simplified endogenous equations, which are capable of tracking the main sources but disregard the small details. In the current setup, when income is virtually the only continuous variable and therefore bears a considerable load of explaining power, another technique is used which preserves and translates the largest possible amount of information into the structural model. Different sources of income are recorded in a preliminary data analysis and tied to the state variables of the model with the means of several statistical approximations. Counterfactual incomes are thus predicted using observed incomes for similar individuals in similar conditions. The amount of available row sample data ensures that all combinations of such conditions provide a sufficient amount of observations for an accurate estimation of these approximations. I also assume away possible adverse selection problems on the grounds that the enormous sample contains observations of both those who do and those who do not take a particular decision in similar situations, and that these outcomes appear at random. Several helpful considerations are taken into account when formulating statistical models (mainly linear multinomial regressions) for different income sources. First, the assumed timing convention allows conditioning on both the previous and the current period labour market states in the income equations. Second, the problem of censoring at zero is negligible for the wage, pension and spouse incomes, but is severe for the additional incomes since many times a zero additional income is observed. Instead of using censored regression for forecasting this income source, I estimate an auxiliary logit regression to predict its positive value. Third, because the estimated equations enter the structural dynamic model, both the goodness-of-fit measure and the estimated standard error of the residuals are important indicators of the quality of the established relationships. Table B.1 presents the summary of the estimated income equations. Each equation was estimated on a specific data set (defined by the ‘estimated on’ rows) and is used for specific individuals (defined by the ‘applied for’ rows). This filtering is set so that the income equations are estimated only on the data from those actually observed in particular states (working, pension, etc.) and forecasts are applied for the agents transferring to these states. The income sources are assumed to be independent and are therefore forecasted separately under ordinary regression assumptions. Table B.1 reports the estimates of 113 coefficients in 11 equations representing four sources of income. Wage income is represented by three equations corresponding to three different age intervals: regular working careers, pre-retirement and exceptional employment after normal retirement age. These periods are 31 The Titan cluster at the time comprising 1852 CPUs (on four-core SUN X2200 AMD and 2-CPU DELL 1425 Intel nodes) was used; see http://www.hpc.uio.no, http://lanlord.titan.uio.no/ganglia/. 32 Run-time decreased non-linearly in the number of CPUs.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
−0.009 1.5E-5
−0.007 1E-5
aw2 aw3
2.24
−4.591 1.58
0.001
−2.08 0.229
sp nw aw
−82.030
≥67
≥67
−0.001
0.532
161.099 −1.901 −6.244
?–1 afp − 67
afp − 67
1–1
−0.001
−4.964 0.534
242.88 −6.442
−1692.1 −7.88 −15.981
≥67
≥67
0.403
375.9 −5.288 −6.69
Estimates
?–2
2–2
−2.221 0.243
−5.121 −1.017 0.118
−1.081 −0.093 0.0005
0
1.408 −0.102
12.928
25.993
logit >
1 2
inc > 0
0.167
0.325
−8.595 0.617
−0.747
any(∗ )
any(∗ )
Additional
21.978
?–(5,7)
?–(5,7)
Table B.1. Summary of models for different sources of income. Pension
−0.166
13.339
16.474
−1.25
−28.438
(≥3)–(≥3) 60–67
60–67
33.046
<60
<60
(≥3)–(≥3)
Wage
age − 50 (age − 50)2
Constant AFP age 1(f emale)
Variable
pst –pst+1 age, other
Applied to:
age, other
Estimated on: pst –pst+1
Income
women
−0.001
0.365
−8.608 0.069
−502.22 9.635
−0.601 0.245
−17.12 1.08
664.10 −8.726
spouse exists men women
men
spouse exists
Spouse
Structural dynamic model of retirement with latent health indicator
S159
53,212 0.3212 31.133
108,727 0.2233 55.415
216,503 0.6087 21.466
0.3574 30.311
0.5088
0.3946 38.659
−24.644 886,269
0.1083 83.96
18.651 18.698 1,076,041
9.071 11.589
21.226 10.907 9.828
0.0902 110.555
433,340
Notes: Estimates are obtained in ordinary regressions except one logit model marked with (∗ ), for which McFadden’s phi is reported in place of R-square. pst+1 denotes the current labour market state; see Section 3.2 for descriptions of other variables.
0.6663 63.028
0.6445 80.209
0.7236 50.767
−1.538 3.728 2,262,566
R-square Std. error res.
−7.979 151,419
51.332 −8.728 914,839
1(pst+1 = 6) 1(pst+1 = 7) Num. of obs.
56.137 −7.919 703,107
−1.337 3.257
48.07 −18.853
1(pst+1 = 4) 1(pst+1 = 5)
−29.762
63.23 −17.351 −21.772 9.563 0.346 3.157 −1.042
1(pst+1 = 1) 1(pst+1 = 2) 1(pst+1 = 3)
11,296
−0.01
1(pst = 7) 1(pst+1 = 0)
46.185 −14.388
0.053
16.365
0.06 7.802 16.2 20.509
0.042
Table B.1. Continued. Estimates
1(pst = 4) 1(pst = 5) 1(pst = 6)
aw · nw
Variable
S160 F. Iskhakov
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
Structural dynamic model of retirement with latent health indicator
S161
chosen in accordance with the growing variance of the wage earnings revealed by a series of age-specific models. Pension income falls into four distinct categories (AFP pension, old-age (NIS) pension, full and partial disability pension) which can be clearly separated by the labour market state and age variables facilitating the corresponding data fragmentation. I drop the first retirement year from the data sets used in the estimation of the first three pension equations to eliminate the effect of transition period. I estimate a separate logit model for positive additional incomes, which facilitates the use of the regression estimated only on the positive values of the dependent variable. In the model, additional income is then forecasted only for those households that produce a probability prediction over one-half with the logit model. Spouse income is modelled separately for two genders, reflecting the two major types of households: retiring older husbands with younger wives and retiring younger wives with husbands already on pension. In general, analysis of goodness of fit in the income equations reveals the main complication of the chosen statistical approach for predictions of the counterfactuals. Considerable heterogeneity embedded into the data and the limited number of explanatory variables results in a rather poor fit of the predicting models. As it follows from the last rows in Table B.1 the best R-square of 72.36% is found in the wage income equation while the most problematic ones are the spouse’s incomes. Poor fit of the income equations causes errors in the utility calculation for the different labour market states and affects the calculation of the likelihood function influencing parameter estimates. However, this influence does not violate the asymptotic properties of the estimates, only reducing the goodness of fit of the model. The tax function is estimated as a linear equation using the simulated data, and provides the relationship between the Norwegian tax rules and the state variables of the model. $ # T x(It ) = 1 − 0.423 · It + 21.95 · (spt = 1) + 18.72 ·1 (pst ∈ {1, 2}) (0.000)
(0.08)
+ 9.25 ·1 (pst 3) + 24.50 . (0.103)
(0.113)
(0.103)
(B.1)
The estimates indicate an average marginal tax rate for the household incomes of 42.3%, but the tax amount decreases for full households, working and especially retired people. Estimated with a very tight fit (R-square 98.34%), the tax equation is treated as deterministic. More details can be found in Iskhakov (2008a).
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 291–292. doi: 10.1111/j.1368-423X.2010.00335.x
Index to The Econometrics Journal Volume 13
ORIGINAL ARTICLES Bauwens, L., A. Preminger and J.V.K. Rombouts, Theory and inference for a Markov switching GARCH model Browning, M. and J.M. Carro, Heterogeneity in dynamic discrete choice models Bun, M.J.G. and F. Windmeijer, The weak instrument problem of the system GMM estimator in dynamic panel data models Choi, H.-S. and N.M. Kiefer, Improving robust model selection tests for dynamic models Fiorio, C.V., V.A. Hajivassiliou and P.C.B. Phillips, Bimodal t-ratios: the impact of thick tails on inference Honor´e, B.E., and L. Hu, Estimation of a transformation model with truncation, interval observation and time-varying covariates Jiang, G.J. and J.L. Knight, ECF estimation of Markov models where the transition density is unknown Lee, L.-F., X. Liu and X. Lin, Specification and estimation of social interaction models with network structures Madsen, E., Unit root inference in panel data models where the time-series dimension is fixed: a comparison of different tests Schafgans, M.M.A. and V. Zinde-Walsh, Smoothness adaptive average derivative estimation Wright, J.H., Testing the adequacy of conventional asymptotics in GMM
PAGE 218 1 95 177 271 127 245 145 63 40 205
SPECIAL ISSUE ARTICLES F´eve, F. and J.-P. Florens, The practice of non-parametric estimation by solving inverseproblems: the example of transformation models Haan, P. and V. Prowse, A structural approach to estimating the effect of taxation on the labour market dynamics of older workers Iskhakov, F., Structural dynamic model of retirement with latent health indicator Komunjer, I. and A. Santos, Semi-parametric estimation of non-separable models: a minimum distance from independence approach
S1 S99 S126 S28
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
292
Index
Magnusson, L.M., Inference in limited dependent variable models robust to weak identification Vanhems, A., Non-parametric estimation of exact consumer surplus with endogeneity in price
S56 S80
REVIEW Hoff, P.D., A First Course in Bayesian Statistical Methods
B1
REVIEWER Koop, G.
B1
C 2010 The Author(s). The Econometrics Journal C 2010 Royal Economic Society.