Econometrics - PDF Free Download

Copyright@ 2000 by Princeton University Press Published by Princeton University Press 41 William Street, Princeton, New...

Author: Fumio Hayashi

181 downloads 2520 Views 9MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Copyright@ 2000 by Princeton University Press Published by Princeton University Press 41 William Street, Princeton, New Jersey 08540 In the United Kingdom: Princeton University Press, 3 Market Place, Woodstock. OX20 1SY All Rights Reserved Library of Congress Cataloging-in-Publication Data Hayashi, Fumio. Econometrics I Fumio Hayashi. p. em. Includes bibliographical references and index. ISBN 0-691·01 018·8 (alk. paper) 1. Econometrics. I. Title. HB139.H39 2000 330'.01 '5195--dc21

OO--Q34665

This book has been composed in Times Roman and Helvetica The paper used in this publication meets the minimum requirements of ANSI/NISO 239.48-1992 (R 1997) (Permanence of Paper) www.pup.princefon .edu Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

Preface

This book is designed to serve as the textbook for a first-year graduate course in econometrics. It has two distinguishing features. First. it covers a full range of techniques with the estimation method called the Ge11eralized Method of Moments (GMM) as the organizing principle. I believe this unified approach is the most efficient way to cover the first-year materials in an accessible yet rigorous matmer. Second, most chapters include a section examining in detail original applied articJes from such diverse fields in economics as industrial organization, labor, finance, international, and macroeconomics. So the reader will know how to use the techniques covered in the chapter and under what conditions they arc applicable. Over the last several years, the lecture notes on which this book is based have been used at the University of Pennsylvania, Columbia University, Princeton University, the University of Tokyo, Boston College, Harvard University, and Ohio State University. Students seem to like the book a lot. My own experience from teaching out of the book is that students think the book is much better than the instructor. Prerequisites The reader of this hook is assumed to have a working knowledge of the basics of calculus, probability theory, and linear algebra. An understanding of the following concepts is taken for granted: functions of several variables, partial derivatives, integrals, random variables, joint distributions, independence, unconditional

and conditional expectations, variances and covariances of vector random variables, normal distributions, chi-square distributions, matrix multiplication, inverses

of matrices, the rank of a matrix, detem1inants, and positive definite matrices. Any relevant concepts above this level will be introduced as the discussion progresses. Results on partitioned matrices and Kronecker products are collected in the appendix. Prior exposure to undergraduate econometrics is not required. Organization of the Book To understand how the book is organized, it is useful to distinguish between a model and an estimation procedure. The basic premise of econometrics is that economic data (such as postwar U.S. GOP) are random variables. A model is a

XX

Preface

family of probability distributions that could possibly have generated the economic data. An estimation procedure is a data-based protocol for choosing from the model a particular distribution that is likely to have generated the data. Most estimation procedures in econometrics are a specialization of the GMM estimation principle.

For example. when GMM is applied to a model called the classical linear regression model, the resulting estimation procedure is Ordinary Least Squares (OLS), the most basic estimation procedure in econometrics. This viewpoint is the organizing

principle in the first six chapters of the book, where most of the standard estimation procedures arc presented. The book could have presented GMM in the first chapter, but that would deprive the reader of the option to follow a series of topics specific to OLS without getting distracted by GMM. For this reason! chose to usc the first two chapters to present the finite-sample and large-sample theory of OLS. GMM is presented in Chapter 3 as a generalization of OLS. A major expositional innovation of the book is to treat multiple-equation estimation procedures- such as Seemingly Unrelated Regressions (SUR), Three-Stage Least Squares (3SLS), the Random-Effects method, covered in Chapter 4, and the Fixed-Effects method covered in Chapter 5- as special cases of the singleequation GMM of Chapter 3. This makes it possible to derive the statistical properties of those advanced techniques merely by suitably specializing the results about single-equation GMM developed in Chapter 3. Chapter 6 completes the book's discussion of GMM by indicating how serial dependence in the error tenn can be incorporated in GMM. For some models in econometrics, Maximum Likelihood (ML) is the more natural estimation principle than GMM. ML is covered in Chapters 7 and 8. To make clear the relationship between GMM and ML, the book's discussion of ML starts out in Chapter 7 with an estimation principle called Extremum Estimators, which includes both ML and GMM as special cases. Applications of ML to various models arc covered in Chapter 8. The book also includes an extensive treatment of time-series analysis. Basic time-series topics arc covered in Section 2.2 and in the first half of Chapter 6. That is enough of a prerequisite for the important recent advances in nonstationary time-series analysis, which arc covered in the last two chapters of the book. Designing a Course Out of the Book

Several different courses can be designed based on the book. • Assuming that the course meets twice for an hour and a half, eight weeks should be enough to cover core theory, which is Chapters 1-4 and 6 (excluding those

Preface

XXI

sections on specific economic applications), Chapter 7 (with proofs and examples skipped), and Chapter 8. • A twelve-week semester course can cover, in addition to the core theory, Chapter 5 and the economic applications included in Chapters 1-6. • A short (say, six-week) course specializing in GMM estimation in cross-section and panel data would cover Chapters 1-5 (excluding Sections 2.10-2.12 hut including Section 2.2). Chapters 7 and 8 (excluding Section 8.7) would add the ML component to the course. • A short time-series course covering recent developments with economic applications would consist of Chapter I (excluding Section 1.7), Chapter 2, parts of Chapters 6 and 7 (Sections 6.1-{).5, 6.8, and 7.1), and Chapters 9 and 10. The applications sections in Chapters 2, 6, 9, and 10 can be dropped if the course focuses on just theory. Review Questions and Analytical Exercises The book includes a number of short questions for review at the end of each chapter, with lots of hints (and even answers). They can be used to check whether the

reader actually understood the materials of the section. On the second reading, if not the first, readers should try to answer them before proceeding. Answers to selected review questions are available from the book's website, http://pup.princeton.edu/ titles/6946 .html. There are several analytical exercises at the end of each chapter that ask the reader to prove results left unproved in the text or supplementary results that are useful for their own sake. Unless otherwise indicated, analytical exercises can be skipped without loss of continuity. Empirical Exercises Each chapter usually has one big empirical exercise. It asks for a replication of the empirical results of the original article discussed in the applications section of the chapter and an estimation of the various extensions of the article's model using the

estimation procedures presented in the chapter. The dataset for estimation, which usually is the same as the one used by the original article, can be downloaded from the book's website mentioned above. To implement the estimation procedure on the dataset, readers should run a statistical package on a computer. There arc many statistical packages that arc widely used in econometrics. They include GAUSS (www.aptcch.com), MATLAB ( www.mathworks.com), Evicws (www.cvicws.com), LIMDEP (www.limdcp.com),

XXII

Preface

RATS (www.estima.com), SAS (www.sas.com), Stata (www.stata.com), and TSP (www.tsp.com). GAUSS and MATLAB are different from the rest in that they are matrix-based languages rather than a collection of procedures. Consider, for exam-

ple, carrying out OLS with GAUSS or MATLAB. After loading the dataset into the computer's workspace, it takes several lines reflecting the matrix operations of

OLS to calculate the OLS estimate and associated statistics (such as R 2 ). With the other packages, which are procedure-driven and sometimes called "canned packages," those several lines can be replaced by a one-line command invoking the OLS procedure. For example, TSP's OLS command is OLSQ. There are advantages and disadvantages with canned packages. Obviously, it takes far fewer lines to accomplish the same thing, so one can spend less time on programming. On the other hand, procedures in a canned package, which accept data and spit out point estimates and associated statistics, are essentially a black box. Sometimes it is not clear from the documentation of the package how certain statistics arc calculated. Although those canned packages mentioned above regularly incorporate new developments in econometrics, the estimation procedure desired may not be currently supported by the package, in which case it will be necessary to write one's own procedures in GAUSS or MATLAB. But it may be a blessing in disguise; actually writing down the underlying matrix operations provides you with an excellent chance to understand the estimation procedure. With only a few exceptions, all the calculations needed for the empirical exercises of the book can be carried out with any of the canned packages mentioned above. My recommendation, therefore, is for economics Ph.D. students who are planning to write an applied thesis using state-of-the-art estimation procedures or

a theoretical thesis proposing new ones, to use GAUSS or MATLAB. Otherwise, students should use any of the canned packages mentioned above. Mathematical Notation

There is no single mathematical notation used by everyone in econometrics. The book's notation follows the most standard, if not universal, practice. Vectors are treated as column vectors and written in bold lowercase letters. Matrices are in bold uppercase letters. The transpose of the matrix A is denoted by A'. Scalar variables are (mostly lowercase) letters in italics. Acknowledgments

I acknowledge with gratitude help from the following individuals and institutions. Mark Watson, Dale Jorgenson, Eo Honore, Serena Ng, Masao Ogaki, and Jushan Bai were kind and courageous enough to usc early versions of the book as the

Preface

XXIII

textbook for their econometrics courses. Comments made by them and their students have been incorporated in this final version. Yuzo Honda read the manuscript and offered helpful suggestions. Naoto Kunitomo, Whitney Newey, Serena Ng, Pierre Perron, Jim Stock, Katsuto Tanaka, Mark Watson, Hal White, and Yoshihiro Yajima took time out to answer my questions and enquiries. Two graduate students

at University of Tokyo, Mari Sakudo and Naoki Shimoi, read the entire manuscript to weed out typos. Their effort was underwritten by a grant-in-aid !rom the Zengin Foundation for Studies on Economics and Finance. Peter Dougherty, my editor at Princeton University Press, provided enthusiasm and just the right amount of

pressure. Stephanie Hogue was a versatile enough k'TJ3X expert to accommodate my formatting whims. Ellen Foos supervised production of the book. Last but not least, Jessica Helfand agreed, probably out of friendship, to do the cover design. For more than five years all my free time went into writing this book. Now, having completed the book, I feel like someone who has just been released from prison. My research suffered, but hopefully the profession has not noticed.

Contents

1

List of Figures

xvii

Preface

xix

Finite-Sample Properties of OLS

3

1.1

3 4

1.2

The Classical Linear Regression Model The Linearity Assumption Matrix Notation The Strict Exogeneity Assumption Implications of Strict Exogeneity Strict Exogeneity in Time-Series Models Other Assumptions of the Model The Classical Regression Model for Random Samples ''Fixed" Regressors The Algebra of Least Squares OLS Minimizes the Sum of Squared Residuals Normal Equations Two Expressions for the OLS Estimator More Concepts and Algebra Influential Analysis (optional)

I .3

I .4

A Note on the Computation of OLS Estimates Finite-Sample Properties of OLS Finite-Sample Distribution of b Finite-Sample Properties of s 2 Estimate of Var(b I X) Hypothesis Testing under Normality Normally Distributed Error Terms Testing Hypotheses about Individual Regression Coefficients Decision Rule for the 1-Test Confidence Interval

6 7

8 9 lO I2 I3

I5 I5 I6 18 I8 2I 23 27 27 30 31 33 33 35 37 38

VI

Contents

p-Value Linear Hypotheses The F-Test A More Convenient Expression for F

1.5

t==F An Example of a Test Statistic Whose Distribution Depends on X Relation to Maximum Likelihood The Maximum Likelihood Principle

Conditional versus Unconditional Likelihood The Log Likelihood for the Regression Model ML via Concentrated Likelihood Crarner-Rao Bound for the Classical Regression Model The F- Test as a Likelihood Ratio Test Quasi-Maximum Likelihood 1.6 Generalized Least Squares (GLS) Consequence of Relaxing Assumption 1.4 Efficient Estimation with Known V A Special Case: Weighted Least Squares (WLS) Limiting Nature of GLS 1.7 Application: Returns to Scale in Electricity Supply The Electricity Supply Industry The Data Why Do We Need Econometrics? The Cobb-Douglas Technology How Do We Know Things Are Cohh-Douglas? Are the OLS Assumptions Satisfied? Restricted Least Squares Testing the Homogeneity of the Cost Function Detour: A Cautionary Note on R 2 Testing Constant Returns to Scale Importance of Plotting Residuals Subsequent Developments Problem Set Answers to Selected Questions 2

Large-Sample Theory 2.1 Review of Limit Theorems for Sequences of Random Variables Various Modes of Convergence Three Useful Results

38

39 40 42 ~

45

47 47 47 48 48

49 52 53 54 55 55

58 58 60 60 60 61 62

63 64

65 65 67 67 68 68 71

84 BB 88 89 92

Contents

2.2

2.3

2.4

2.5

2.6

2.7 2.8

2.9

VII

Viewing Estimators as Sequences of Random Variables Laws of Large Numbers and Central Limit Theorems Fundamental Concepts in Time-Series Analysis Need for Ergodic Stationarity Various Classes of Stochastic Processes Different Formulation of Lack of Serial Dependence The CLT for Ergodic Stationary Martingale Differences Sequences Large-Sample Distribution of the OLS Estimator The Model Asymptotic Distribution of the OLS Estimator s 2 Is Consistent Hypothesis Testing Testing Linear Hypotheses The Test Is Consistent Asymptotic Power Testing Nonlinear Hypotheses Estimating E(efx,x;) Consistently Using Residuals for the Errors Data Matrix Representation of S Finite-Sample Considerations Implications of Conditional Homoskedasticity Conditional versus Unconditional Homoskedasticity Reduction to Finite-Sample Formulas Large-Sample Distribution oft and F Statistics Variations of Asymptotic Tests under Conditional Homoskedasticity Testing Conditional Homoskedasticity Estimation with Parameterized Conditional Heteroskedasticity (optional) The Functional Form WLS with Known a Regression of ef on z 1 Provides a Consistent Estimate of a WLS with Estimated a OLS versus WLS Least Squares Projection Optimally Predicting the Value of the Dependent Variable Best Linear Predictor OLS Consistently Estimates the Projection Coefficients

94 95 97 97 98 106 106 109 109 113 115 117 117 119 120 121 123 123 125 125 126 126 127 128 129 131 133 133 134 135 136 137 137 138 139 140

Contents

VIII

2.10 Testing for Serial Correlation Box-Pierce and Ljung-Box Sample Autocorrelations Calculated from Residuals Testing with Predetennincd, but Not Strictly Exogenous, Regressors An Auxiliary Regression- Based Test 2.11 Application: Rational Expectations Econometrics The Efficient Market Hypotheses Testable Implications Testing for Serial Correlation Is the Nominal Interest Rate the Optimal Predictor? R, Is Not Strictly Exogenous Subsequent Developments 2.12 Time Regressions The Asymptotic Distribution of the OLS Estimator Hypothesis Testing for Time Regressions Appendix 2.A: Asymptotics with Fixed Regressors Appendix 2.B: Proof of Proposition 2.10 Problem Set Answers to Selected Questions

3

Single-Equation GMM

3.1

3.2

3.3

3.4

Endogeneity Bias: Working's Example A Simultaneous Equations Model of Market Equilibrium Endogeneity Bias Observable Supply Shifters More Examples A Simple Macroeconometric Model Errors-in-Variables Production Function The General Formulation Regressors and Instruments Identification Order Condition lor Identification The Assumption for Asymptotic Normality Generalized Method of Moments Defined Method of Moments Generalized Method of Moments Sampling Error

141 142 144 146 147 !50 !50 !52 !53 !56 !58 !59 160 161 163 164 165 168 183 186 187 187 188 189

193 193 194 196 198 198 200 202 202 204 205 206 207

Contents

4

IX

3.5

Large-Sample Properties of GMM Asymptotic Distribution of the GMM Estimator Estimation of Error Variance Hypothesis Testing Estimation of S Efficient GMM Estimator Asymptotic Power Small-Sample Properties 3.6 Testing Overidentifying Restrictions Testing Subsets of Orthogonality Conditions 3.7 Hypothesis Testing by the Likelihood-Ratio Principle The LR Statistic for the Regression Model Variable Addition Test (optional) 3.8 Implications of Conditional Homoskedasticity Efficient GMM Becomes 2SLS 1 Becomes Sargan's Statistic Small-Sample Properties of 2SLS Alternative Derivations of 2SLS When Regressors Are Predetermined Testing a Subset of Orthogonality Conditions Testing Conditional Homoskedasticity Testing for Serial Correlation 3.9 Application: Returns from Schooling The NLS-Y Data The Semi-Log Wage Equation Omitted Variable Bias IQ as the Measure of Ability Errors-in-Variables 2SLS to Correct for the Bias Subsequent Developments Problem Set Answers to Selected Questions

208 209 210 211 212 212 214 215 217 218 222 223 224 225 226 227 229 229 231 232 234 234 236 236 237 238 239 239 242 243 244 254

Multiple-Equation GMM

258

4.1

259 259 260 261 262

The Multiple-Equation Model Linearity Stationarity and Ergodicity Orthogonality Conditions Identification

Contents

X

5

The Assumption for Asymptotic Normality Connection to the "Complete" System of Simultaneous Equations 4.2 Multiple-Equation GMM Defined 4.3 Large-Sample Theory 4.4 Single-Equation versus Multiple-Equation Estimation When Are They "Equivalent"? Joint Estimation Can Be Hazardous 4.5 Special Cases of Multiple-Equation GMM: FIVE, 3SLS, and SUR Conditional Homoskedasticity Full-Information Instrumental Variables Efficient (FIVE) Three-Stage Least Squares (3SLS) Seemingly Unrelated Regressions (SUR) SUR versus OLS 4.6 Common Coefficients The Model with Common Coefficients The GMM Estimator Imposing Conditional Homoskedasticity Pooled OLS Beautifying the Formulas The Restriction That lsn 't 4.7 Application: Interrelated Factor Demands The Translog Cost Function Factor Shares Substitution Elasticities Properties of Cost Functions Stochastic Specifications The Nature of Restrictions Multivariate Regression Subject to Cross-Equation Restrictions Which Equation to Delete? Results Problem Set Answers to Selected Questions

264 265 265 268 271 272 273 274 274 275 276 279 281 286 286 287 288 290 292 293 296 296 297 298 299 300 301 302 304 305 308 320

Panel Data

323 324 324 327 327 330

5.1

5.2

The Error-Components Model Error Components Group Means A Repararneterization The Fixed-Effects Estimator

Contents

XI

The Formula Large-Sample Properties Digression: When q i Is Spherical Random Effects versm Fixed Effect' Relaxing Conditional Homoskedasticity 5.3 Unbalanced Panels (optional) "Zeroing Out" Missing Observations Zeroing Out versus Compression No Selectivity Bias 5.4 Application: International Differences in Growth Rates Derivation of the Estimation Equation Appending the Error Term Treatment of Consistent Estimation of Speed of Convergence Appendix 5.A: Distribution of Hausman Statistic Problem Set Answers to Selected Questions

330 33I 333 334 335 337 338 339 340 342 342 343 344 345 346 349 363

Serial Correlation

365 365

a,

6

6.I

6.2

6.3 6.4

Modeling Serial Correlation: Linear Processes MA(q) MA(oo) as a Mean Square Limit Filters Inverting Lag Polynomials ARMA Processes AR(I) and Its MA(oo) Representation Autocovariances of AR(I) AR(p) and Its MA(oo) Representation ARMA(p,q) ARMA(p, q) with Common Roots lnvertibility Autocovariance-Generating Function and the Spectrum Vector Processes Estimating Autoregressions Estimation of AR(I) Estimation of AR(p) Choice of Lag Length Estimation of VARs Estimation of ARMA(p, q)

366 366 369 372 375 376 378 378 380 382 383 383 387 392 392 393 394 397 398

Contents

Xii

7

6.5

Asymptotics for Sample Means of Serially Correlated Processes LLN for Covariance-Stationary Processes Two Central Limit Theorems Multivariate Extension 6.6 Incorporating Serial Correlation in GMM The Model and Asymptotic Results Estimating S When Autocovariances Vanish after Finite Lags Using Kernels to Estimate S VARHAC 6.7 Estimation under Conditional Homoskedasticity (Optional) Kernel-Based Estimation of Sunder Conditional Homoskedasticity Data Matrix Representation of Estimated Long-Run Variance Relation to GLS 6.8 Application: Forward Exchange Rates as Optimal Predictors The Market Efficiency Hypothesis Testing Whether the Unconditional Mean Is Zero Regression Tests Problem Set Answers to Selected Questions

400 401 402 404 406 406 407 408 410 413 413 414 415 418 419 420 423 428 441

Extremum Estimators

445 446 446 447 448 450 452 453 454 456 456 458 461 462 467 469 470 473 474

7.1

7.2

7.3

Extremum Estimators "Measurability"' of iJ Two Classes of Extremum Estimators Maximum Likelihood (ML) Conditional Maximum Likelihood lnvariance of ML Nonlinear Least Squares (NLS) Linear and Nonlinear GMM Consistency Two Consistency Theorems for Extremum Estimators Consistency of M-Estimators Concavity after Reparameterization Identification in NLS and ML Consistency of GMM Asymptotic Normality Asymptotic Normality of M -Estimators Consistent A•ymptotic Variance Estimation Asymptotic Normality of Conditional ML

Contents

8

XIII

476 478 481 483 487 487 489 489 491 493 494 497 497 498

Two Examples Asymptotic Normality of GMM GMM versus ML Expressing the Sampling Error in a Common Format 7.4 Hypothesis Testing The Null Hypothesis The Working Assumptions The Wald Statistic The Lagrange Multiplier (LM) Statistic The Likelihood Ratio (LR) Statistic Summary of the Trinity 7.5 Numerical Optimization Newton-Raphson Gauss-Newton Writing Newton-Raphson and Gauss-Newton in a Common Format Equations Nonlinear in Parameters Only Problem Set Answers to Selected Questions

498 499 501 505

Examples of Maximum Likelihood

507

8.1

507 508 509 510 511 511 512 513 514 515 517 518 518 519 521 522 523 524

8.2

8.3

8.4

Qualitative Response (QR) Models Score and Hessian for Observation t Consistency Asymptotic Normality Truncated Regression Models The Model Truncated Distributions The Likelihood Function Reparameterizing the Likelihood Function Verifying Consistency and Asymptotic Normality Recovering Original Parameters Censored Regression (Tobit) Models Tobit Likelihood Function Reparamcterization Multivariate Regressions The Multivariate Regression Model Restated The Likelihood Function Maximizing the Likelihood Function

Contents

XIV

9

Consistency and Asymptotic Normality 8.5 FIML The Multiple-Equation Model with Common Instruments Restated The Complete System of Simultaneous Equations Relationship between (f 0 , Bo) and ilo The FIML Likelihood Function The FIML Concentrated Likelihood Function Testing Overidentifying Restrictions Propcrrics of the FIML Estimator ML Estimation of the SUR Model 8.6 LIML LIML Defined Computation of LIML LIML versus 2SLS 8.7 Serially Correlated Observations Two Questions Unconditional ML for Dependent Observations ML Estimation of AR( I) Processes Conditional ML Estimation of AR(l) Processes Conditional ML Estimation of AR(p) and VAR(p) Processes Problem Set

525 526 526 529 530 531 532 533 533 535 538 538 540 542 543 543 545 546 547 549 551

Unit-Root Econometrics

557

9.1

557 558 560 562 563 563 563

9.2

9.3

Modeling Trends Integrated Processes Why Is It Important to Know if the Process Is 1(1)? Which Should Be Taken as the Null, 1(0) or I (I)? Other Approaches to Modeling Trends Tools for Unit-Root Econometrics Linear I (0) Processes Approximating I (I) by a Random Walk Relation to ARMA Models The Wiener Process A Useful Lemma Dickey-Fuller Tests The AR(l) Model Deriving the Limiting Distribution under the I (I) Null Incorporating the Intercept Incorporating Time Trend

564 566 567 570 573 573 574 577 581

Contents

9.4 Augmented Dickey-Fuller Tests The Augmented Autoregression Limiting Distribution of the OLS Estimator Deriving Test Statistics Testing Hypotheses about { What to Do When p Is Unknown? A Suggestion for the Choice of p~(T) Including the Intercept in the Regression Incorporating Time Trend Summary of the DF and ADF Tests and Other Unit-Root Tests 9.5 Which Unit-Root Test to Use? Local-to-Unity Asymptotics Small-Sample Properties 9.6 Application: Purchasing Power Parity The Embarrassing Resiliency of the Random Walk Model? Problem Set Answers to Selected Questions 10 Co integration

10.1 Cointegrated Systems Linear Vector I(O) and I( I) Processes The Beveridge-Nelson Decomposition Cointegration Defined 10.2 Alternative Representations of Cointegrated Systems Phillips's Triangular Representation VAR and Cointegration The Vector Error-Correction Model (VECM) Johansen's ML Procedure 10.3 Testing the Null of No Cointegration Spurious Regressions The Residual-Based Test for Cointegration Testing the Null of Cointegration 10.4 Inference on Cointegrating Vectors The SOLS Estimator The Bivariate Example Continuing with the Bivariate Example Allowing for Serial Correlation General Case Other Estimators and Finite-Sample Properties

XV

585 585 586 590 591 592 594 595 597 599 601 602 602 603 604 605 619 623

624 624 627 629 633 633 636 638 640 643 643 644

649 650 650 652 653 654 657 658

Contents

XVI

10.5 Application: The Demand for Money in the United States The Data (m ~ p, y, R) as a Cointegrated System DOLS Unstable Money Demand? Problem Set

659 660 660 662 663 665

Appendix A: Partitioned Matrices and Kronecker Products

670 671

Addition and Multiplication of Partitioned Matrices Inverting Partitioned Matrices Index

672 675

CHAPTER

1

Finite-Sample Properties of OLS

ABSTRACT

The Ordinary Least Squares (OLS) estimator is the most basic estimation procedure in econometrics. This chapter covers the tinite- or sma11-sample properties of the OLS estimator, that is, the statistical properties of the OLS estimator that are vahd for any given sample size. The materials covered in this chapter are entirely standard. The exposition here differs from that of most other textbooks in its emphasis on the role played by the assumption that the regressors are "strictly exogenous." ln the finaJ section, we apply the finite-sample theory to the estimation of the cost function using cross-section data on individual finns. The question posed in Ncrlovc's (1963) study is of great practical importance: arc there increasing returns to scale in electricity supply? If yes, microeconomics tells us that the industry should be regulated. Besides providing you with a hands-on experience of using the techniques to test interesting hypotheses, Nerlove's paper has a careful discussion of why the OLS is an appropriate estimation procedure in this particular application.

1.1

The Classical Linear Regression Model

In this section we present the assumptions that comprise the classical linear regression model. In the model. the variable in question (called the de)>endent variable, the regressand, or more generically the left-hand [-side] variable) is related to several other variables (called the regressors, the explanatory variables, or the right-hand [-side] variables). Suppose we observe n values for those variables. Let y; be the i -th observation of the dependent variable in question and let (x;~, x; 2 , ... , x;K) be the i -th observation of the K regressors. The sample or data is a collection of those n observations. The data in economics cannot he generated hy experiments (except in experi-

mental economics), so both the dependent and independent variables have to be treated as random variables, variahles whose values are suhject to chance. A model

4

Chapter 1

is a set of restrictions on the joint distribution of the dependent and independent variables. That is, a model is a set of joint distributions satisfying a set of assumptions. The classical regression model is a set of joint distributions satisfying Assumptions 1.1-1.4 stated below. The Linearity Assumption The first assumption is that the relationship between the dependent variable and the regressors is linear. Assumption 1.1 (linearity):

Y; = fhx;r + fhx;z + · · · + fhx;K + 8;

(i = I, 2, .... n),

(1.1.1)

where fJ 's are unknown paTameters to be estimated, and ci is the unobserved ezror term with cert:1in properties to be specified below.

The partoftherigbt-hand side involving the regressors, !h.<;r +tl2X;2+· · +tlKXiK· is called the regression or the regression function, and the coefficients (,'l's) are called tl!e regression coefficients. They represent the marginal and separate e!Iects of the regressors. For example, ,'l2 represents the change in the dependent variable when the second regressor increases by one unit while other regressors are held constant. In the language of calculus, this can be expressed as ay;jax; 2 = ,'l2 . The linearity implies that the marginal effect does not depend on the level of regressors. The error term represents the part of the dependent variable left unexplained by the regressors. Example 1.1 (consumption function): The simple consumption function familiar from introductory economics is CON; =

tlr + tl2YD; + 8;,

(1.1.2)

where CON is consumption and YD is disposable income. If the data arc annual aggregate time-series, CON; and YD; are aggregate consumption and disposable income for year i. If the data come from a survey of individual households, CON; is consumption by the i-th household in the cross-section sample of n households. The consumption function can be written as (1.1.1) by setting }'; = CON;, x; 1 = I (a constant), and x; 2 = YD;. The error term 8; represents other variables besides disposable income that inllucnce consumption. They include those variables- such as financial assets- that

5

Finite-Sample Properties of OLS

might be observable but the researcher decided not to include as regressors, as well as those variables-such as the "mood" of the consumer-that are hard to measure. When the equation has only one nonconstant regressor, as here, it is called the simple regression model. The linearity assumption is not as restrictive as it might first seem, because the dependent variable and the regressors can be transformations of the variables in question. Consider Examt>le 1.2 (wage equation): A simplified version of the wage equation routinely estimated in labor economics is log(WAGE;) = fh

+ fhS; + {3 3 TENURE; + {34 EXPR; + E;,

(1.1.3)

where WAGE = the wage rate for the individual, S = education in years, TENURE = years on the current job, and l!.XPR = experience in the labor Ioree (i.e., total number of years to date on all the jobs held currently or previously by the individual). The wage equation fits the generic format (1.1.1) with y; = log(WAGE;). The equation is said to be in the semi-log form because only the dependent variable is in logs. The equation is derived from the following nonlinear relationship between the level of the wage rate and the regressors: WAGE;= exp({3 1) exp({32 S;) exp({33 TENURE;) exp({34 EXPR;) exp(c;). (1.1.4)

By taking logs of both sides of (1.1.4) and noting that log[cxp(x)] = x, one obtains (1.1.3). The coefficients in the semi-log form have the interpretation of percentage changes, not changes in levels. For example, a value of 0.05 lor {32 implies that an additional year of education has the effect of raising the wage rate by 5 percent. The difference in the interpretation comes about because the dependent variable is the log wage rate, not the wage rate itself, and the change in logs equals the percentage change in levels. Certain other forms of nonlincarities can also be accommodated. Suppose, for example, the marginal effect of education tapers off as the level of education gets higher. This can be captured by including in the wage equation the squared term S2 as an additional regressor in the wage equation. If the coefficient of the squared

6

Chapter 1

term is {3 5, the marginal effect of education is {32

+ 2f3sS

(= 3 log(WAGE)j3S).

If f3s is negative, the marginal effect of education declines with the level of education. There are, of course, cases of genuine nonlinearity. For example, the relationship (L 1.4) could not have been made linear if the error term entered additivcly

rather than multiplicatively: WAGE;= exp(f3 1) exp(f32 S;) exp(f3 3TENURE;) exp(f34EXPR;)

+ £;.

Estimation of nonlinear regression equations such as this will be discussed m Chapter 7. Matrix Notation Before stating other assumptions of the classical model, we introduce the vector and matrix notation. The notation will prove useful for stating other assumptions precisely and also for deriving the OLS estimator of {3. Define K -dimensional (column) vectors X; and {3 as xil Xi2

X;

{3

(Kx I)

(Kx 1) XiK

f3I f3z

(1.1.5)

f3K

By the definition of vector inner products, x;f3 = f3IXii the equations in Assumption I. I can be written as

+ f32Xi2 + · · · + f3KXiK·

So

(L Ll')

Also define

y (nxl)

X (nxK)

(LL6)

In the vectors and matrices in (L 1.6), there are as many rows as there are observations, with the rows corresponding to the observations. For this reason y and X are sometimes called the data vector and the data matrix. Since the number of

7

Finite-Sample Properties of OLS

columns of X equals the number of rows of {3, X and {3 arc conformable and X/3 is an n x 1 vector. Its i -th element is x; {3. Therefore, Assumption 1.1 can he written compactly as

y=X (nxl)

{3+e.

(nxK)cKxl)

(nxl)

~~~

(n x 1)

The Strict Exogeneity Assumption The next assumption of the classical regression model is

Assumption 1.2 (strict cxogeneity): E(e; I X) =0

(i = 1,2, ... ,n).

(I. 1.7)

Here, the expectation (mean) is conditional on the regressors for all observations.

This point may be made more apparent by writing the assumption without using the data matrix as

E(e; lx 1 ,

...

,x,)=0

(i= 1,2, ... ,n).

To state the assumption differently, take, for any given observation i, the joint distribution of the nK +I random variables, f(e;, x 1 , . , . , x,), and consider the conditional distribution, f(e; I x1 , ...• x,). The conditional mean E(e; I x1 , ... , x,) is in general a nonlinear function of (x 1 , ••• , x,). The strict exogeneity assumption says that this function is a constant of value zero. 1 Assuming this constant to be zero is not restrictive if the regressors include a

constant, because the equation can be rewritten so that the conditional mean of the error term is zero. To sec this, suppose that E(e; I X) is M and xi! = I. The equation can he written as

+ f32Xi2 + ... + f3KXiK + £; (f3I + Ml + fJ2X;z + · · · + f3KxiK + (e,- Ml·

Yi = f3I

=

If we redefine {3 1 to he fh + M and£; to he£; - fl, the conditional mean of the new error term is zero. In virtually all applications, the regressors include a constant term. 1Some authors define the term "strict exogeneity" somewhat differently. For example, in Koopmans and Hood (1953) and Engle, Hendry, and Richards (1983), the regressors are strictly exogenous ifx; is independent of E:j for all i, j. This definition is stronger than, but not inconsistent with, our definition of strict exogeneity.

8

Chapter 1

Example 1.3 (continuation of Example 1.1): For the simple regression model of Example 1.1. the strict exogeneity assumption can he written as E(e; I YD 1 • YD 2 •...• YD,) = 0. Since x; = (1. YD;) 1, you might wish to write the strict exogeneity assumption as E(e; II, YD 1 , I, YD 2 ,

••• •

I, YD,) = 0.

But since a constant provides no information, the expectation conditional on (1, YD 1 , I, YD 2 ,

••. ,

I. YD,)

is the same as the expectation conditional on (YDt, YD 2 ,

.•• ,

YD,).

Implications of Strict Exogeneity

The strict exogeneity assumption has several implications. • The unconditional mean of the error term is zero, i.e.,

E(e;)=O

(i=l,2, ... ,n).

(1.1.8)

This is because, by the Law of Total Expectations from basic probability theory," E[E(e, I X)] = E(e;). • If the cross moment E(x y) of two random variables x and y is zero, then we say that x is orthogonal toy (or y is orthogonal to x). Under strict cxogcncity, the regressors are orthogonal to the error term for all observations, i.e.,

E(xjkf;) =0

(i,j =I, ... ,n;k = 1, ... ,K)

or

E(Xjt f;) E(x·, e·) 1- '

0 (Kxl)

2The Law of Total Expectations states that E[E(y I x)] = E(y).

(foralli,j).

(1.1.9)

9

Finite-Sample Properties of OLS

The proof is a good illustration of the usc of properties of conditional expectations and goes as follows. PROOF. Since Xjk is an element of X, strict exogeneity implies

E(£; I Xjk) = E[E(£; I X) I Xjk] = 0

(1.1.10)

by the Law of Iterated Expectations from probability thcory. 3 It follows from this that E(Xjk&;) = E[E(xjk
I Xjk)] (by the Law of Total Expectations)

E(£; 1Xj>)]

(by the linearity of conditional expectations4 )

=0.

•

The point here is that strict exogeneity requires the regressors be orthogonal not only to the error terrn from the same observation (i.e., E(x;,£;) = 0 for all k), but also to the error term from the other observations (i.e., E(xjks;) = 0 for all k and for j ~ i). • Because the mean of the error tenn is zero, the orthogonality conditions (1.1.9) are equivalent to zero-correlation conditions. This is because Cov(£;, Xjk) = E(xjk<;)- E(xjk) E(s;) = E(xjke;) =

0

(by definition of covariance)

(since E(s;) = 0, see (1.1.8))

(by the orthogonality conditions (1.1.9)).

In particular, fori = .i. Cov(x,., £;) = 0. Therefore. strict exogeneity implies the requirement (familiar to those who have studied econometrics before) that the regressors be contemporaneously uncorrelated with the error term. Strict Exogeneity in Time-Series Models For time-series models where i is time. the implication (1.1.9) of strict exogeneity can be rephrased as: the regressors arc orthogonal to the past, current, and future error terms (or equivalently, the error term is orthogonal to the past, current, and future regressors). But for most time-series models, this condition (and ajOJ' tiori strict exogcncity) is not satisfied, so the finite-sample theory based on strict

exogeneity to be developed in this section is rarely applicable in time-series con3The Law of Iterated Expectations states that E[E(y I x, z) I x] = E(y I x). 4 The linearity of conditional expectations states that E[f(x)y I x] = f(x) E(y I x).

10

Chapter 1

texts. However, as will be shown in the next chapter, the estimator possesses good large-sample properties without strict exogeneity. The clearest example of a failure of strict exogeneity is a model where the regressor includes the lagged dependent variable. Consider the simplest such model: Yi = f3Yi-I

+ E;

(i =I, 2, ... , n).

(1.1.11)

This is called the first-order autoregressive model (AR(l)). (We will study this model more fully in Chapter 6.) Suppose, consistent with the spirit of the strict exogcncity assumption, that the regressor for observation i, y,_ 1 , is orthogonal to the error term fori so E(y,_ 1s;) = 0. Then E(y;s;) = E[(f3y,_ 1 + s,)s;] = f3 =

E(y;-Jc;)

ECs?l

(by (1.1.11))

+ E(rfl

(since E(y,_ 1£;) = 0 by hypothesis).

Therefore, unless the error term is always zero, E(y;<;) is not zero. But y, is the regressor for observation i +I. Thus, the regressor is not orthogonal to the past error term, which is a violation of strict exogeneity. Other Assumptions of the Model The remaining assumptions comprising the classical regressiOn model are the following. Assumption 1.3 (no multicollinearity): The rank of then x K data matrix, X, is K with probability 1. Assumption 1.4 (spherical error variance): (homoskedasticity) E(r? I X) =a 2 > 0

(i = 1,2, ... ,n), 5 (1.1.12)

(no colTelation between observations) E(c;Ej I X) =0

(i,j = 1,2, ... ,n;i

~j).

(1.1.13)

5When a symbol (which here is a 2 ) is given to a moment (which here is the second moment E(efl X)), by implication the moment is assumed to exist and is finite. We will follow this convention for the rest of this book.

11

Finite-Sample Properties of OLS

To understand Assumption 1.3, recall from matrix algebra that the rank of a matrix equals the numher of linearly independent columns of the matrix. The assumption says that none of the K columns of the data matrix X can be expressed as a linear combination of the other columns of X. That is, X is of full column rank. Since the K columns cannot he linearly independent if their dimension is less than K, the assumption irnplies that n > K, i.e., there must be at least as many observations as there are regressors. The regressors are said to be (perfectly) multicollinear if the assumption is not satisfied. It is easy to see in specific applications when the regressors arc multicollinear and what problems arise. Example 1.4 (continuation of Example 1.2): If no individuals in the sample ever changed jobs, then TENURE, = EXPR; for all i, in violation of the no multicollinearity assumption. There is evidently no way to distinguish the tenure effect on the wage rate from the experience effect. If we substitute this equality into the wage equation to eliminate TENURE;, the wage equation becomes

log(WAGE;) = {3 1 + {32 S; which shows that only the sum {33 estimated.

+ ({3, + {34)EXPR; + e,,

+ {34 , but not {33 and {34 separately, can be

The homoskedasticity assumption (1.1.12) says that the conditional second moment, which in general is a nonlinear function of X, is a constant. Thanks to strict cxogcncity, this condition can be stated equivalently in more familiar terms. Consider the conditional variance Var(ci I X). lt equals the same constant hecause Var(e;

1

X) _ E(ef X) - E(e; X) 2 1

=

E(ef I X)

1

(since E(e;

(by definition of conditional variance) 1 X) =

0 by strict exogeneity).

Similarly, (1.1.13) is equivalent to the requirement that Cov(e;,£1

I X)= 0 (i,j = 1,2, ... ,n;i of j).

That is, in the joint distribution of (e;, e1 ) conditional on X, the covariance is zero. In the context of time-series models, (1.1.13) states that there is no serial correlation in the error tenn.

12

Chapter 1

Since the (i, j) clement of the nxn matrix ee' is S;SJ, Assumption 1.4 can be written compactly as

(1.1.14) The discussion of the previous paragraph shows that the assumption can also he

written as

However, (1.1.14) is the preferred expression, because the more convenient measure of variability is second moments (such as E(sf I X)) rather than variances. This point will become clearer when we deal with the large sample theory in the next chapter. Assumption 1.4 is sometimes called the spherical error variance assumption because then xn matrix of second moments (which arc also variances

and covariances) is proportional to the identity matrix In. This assumption will be relaxed later in this chapter. The Classical Regression Model for Random Samples The sample (y, X) is a random sample if {y;, x;} is i.i.d. (independently and identically distributed) across observations. Since by Assumption 1.1 s; is a function of (y;, x;) and since (y;, x;) is independent of (yj, Xj) for j fc i, (e;, x;) is independent of Xj for j fc i. So E(s; I X) = E(s; I x;), E(sf

I X)

= E(sf

I x;),

and E(s;Sj I X) = E(e; I x;) E(ej I Xj)

(fori fc j).

(1.1.15)

(Proving the last equality in (1.1.15) is a review question.) Therefore, Assumptions 1.2 and 1.4 reduce to Assumption 1.2: E(s; I X;)= 0

(i =I, 2, ... , n),

Assumption 1.4: E(sf I x;) = a 2 > 0

(i =I, 2, ... , n).

(1.1.16) (1.1.17)

The implication of the identical distribution aspect of a random sample is that the joint distribution of (£;, X;) does not depend on i. So the unconditional second moment E(cf) is constant across i (this is referred to as unconditional homoskcdasticity) and the functional form of the conditional second moment E(sf 1 x;) is the same across i. However, Assumption 1.4 -that the value of the conditional

13

Finite-Sample Properties of OLS

second moment is the same across i -docs nol follow. Therefore, Assumption 1.4 remains restrictive for the case of a random sample; without it, the conditional second moment E(ef I X;) can differ across i through its possible dependence on X;. To emphasize the distinction, the restrictions on the conditional second moments, (1.1.12) and (1.1.17), are referred to as conditional homoskcdasticity. "Fixed" Regressors We have presented the classical linear regression model, treating the regressors as

random. This is in contrast to the treatment in most textbooks, where X is assumed to be "fixed" or deterministic. If X is fixed, then there is no need to distinguish between the conditional distribution of the error term. f(e; I x 1 , ... , x,), and the unconditional distribution, f(s;), so that Assumptions 1.2 and 1.4 can be written as

Assumption 1.2: E(e;) = 0 (i =I, ... , n),

(1.1.18)

Assumption 1.4: E(ef) = u 2 (i =I, ... , n); E(e;ej) =0 (i,j =I, ... ,n;i f j).

(1.1.19)

Although it is clearly inappropriate for a nonexperimental science like econometrics, the assumption of fixed regressors remains popular because the regression model with fixed X can be interpreted as a set of statements conditional on X, allowing us to dispense with "I X" from the statements such as Assumptions 1.2 and 1.4 of the model. However, the economy in the notation comes at a price. It is very easy to miss the point that the error tenn is being assumed to be uncorrelated with current, past, and futmc regressors. Also, the distinction between the unconditional and conditional homoskedasticity gets lost if the regressors are deterministic. Throughout this book, the regressors are treated as random, and, unless otherwise noted, statements conditional on X arc made explicit by inserting "I X." QUESTIONS FOR REVIEW

1. (Change in units in the semi-log fonn) In the wage equation, (1.1.3), of Example 1.2, if WAGE is mcasmcd in cents rather than in dollars, what difference does it make to the equation? Hint: log(x y) = log(x) + log(y). 2. Prove the last equality in (1.1.15). Hint: E(s;£j I X) = E[ej E(£; I X, (e;,x;) is independent of(ej,x 1 , •.. ,x;_ 1 ,x;+ 1 , ..• ,x,) fori f j.

Ej)

I X].

14

Chapter 1

3. (Combining linearity and strict cxogcncity) Show that Assumptions 1.1 and 1.2 imply E(y,

I X)= x;p

(i = 1,2, ... ,n).

(1.1.20)

Conversely, show that this assumption implies that there exist error terms that

satisfy those two assumptions. 4.

(Nonnally distrihuted random sample) Consider a random sample on consumption and disposable income, (CON;, YD;) (i = I, 2, ... , n). Suppose the joint distribution of (CON;, YD;) (which is the same across i because of the random sample assumption) is normal. Clearly, Assumption 1.3 is satisfied; the rank of X would be less than K only by pure accident. Show that the other assumptions, Assumptions 1.1, 1.2, and 1.4, are satisfied. Hint: If two random variables, y and x, are jointly normally distributed, then the conditional expectation is linear in x, i.e.,

!h

E(y I x) =

and the conditional variance, Var(y

1

+ fl2x,

x), does not depend on x. Here, the fact

that the distribution is the same across i is important; if the distribution differed across i, /) 1 and /)2 could vary across i. 5. (Multicollinearity for the simple regression model) Show that Assumption 1.3 for the simple regression model is that the nonconstant regressor (xi2) is really nonconstant (i.e., x, 2 one). 6.

oft

xj2

for some pairs of (i, .i), i

oft j, with probability

(An exercise in conditional and unconditional expectations) Show that Assumptions 1.2 and 1.4 imply

Var(e;) =

0'

2

and Cov(e;, Ej) = 0

(i =I, 2, ... , n) (i

oft j; i, j

= 1, 2, .. . n).

Hint: Strict exogeneity implies E(e;) = 0. So(*) is equivalent to

E(ef)

=0'

and E(e;ej) = 0

2

(i

(i

oft

=

1,2, ... ,n)

j; i, j =I, 2, ... , n).

15

Finite-Sample Properties of OLS

1.2

The Algebra of Least Squares

This section describes the computational procedure for obtaining the OLS estimate, b, of the unknown coefficient vector f3 and introduces a few concepts that derive from b.

OLS Minimizes the Sum of Squared Residuals Although we do not observe the error term, we can calculate the value implied by a hypothetical value, {3. of {3 as

-

Yi- X~~This is called the residual for observation i. From this, form the sum of squared residuals (SSR):

SSR(p)-

" 2)Y;x;7i) 2 =

(y- XP)'(y- X/3).

i=l

This sum is also called the error sum of squares (ESS) or the residual sum of squares (RSS). lt is a function of {3 because the residual depends on it. The OLS estimate, b, of {3 is the {3 that minimizes this function: b

= arg!_l1in SSR({3).

(1.2.1)

fi

The relationship among {3 (the unknown coefficient vector), b (the OLS estimate of it), and {3 (a hypothetical value of {3) is illustrated in Figure 1.1 forK = I. Because SSR({3) is quadratic in {3, its graph has the U shape. The value of {3 corresponding to the bottom is b, the OLS estimate. Since it depends on the sample (y, X), the

OLS estimate b is in general different from the true value {3; if b equals {3, it is by sheer accident. By having squared residuals in the objective function, this method imposes a heavy penalty on large residuals; the OLS estimate is chosen to prevent large residuals for a few observations at the expense of tolerating relatively small residuals for many other observations. We will see in the next section that this particular criterion brings about some desirable properties for the estimate.

16

Chapter 1

SSR

----r-----------~---+------7~

b Figure 1.1: Hypothetical, True, and Estimated Values

Normal Equations A sure-fire way of solving the minimization problem is to derive the first-order conditions hy setting the partial derivatives equal to zero. To this end we seek a K-dimensional vector of partial derivatives, aSSR(ii);a{3 6 The task is facilitated by writing SSR({J) as

-

SSR({j)

=

(y- x{j)'(y- X{j)

=

(y' -{j'X')(y- X{j)

=

y'y - {J X'y - y'X{J

=

y'y- 2y'x73

'"""!

.....,

(since the i-th element of y- x{j is)'; - x;{j) (since (X{j)' = {j'X') '"""!

......,

+ {J X'Xf3

+ 73'x'x73

(since the scalar {j'X'y equals its transpose y'X{j) """

- y'y- 2a' {J

+ {3,._,I A{J"""

with a- X'y and A - X'X.

(1.2.2)

The term y'y docs not depend on {J and so can be ignored in the differentiation of SSR({3). Recalling from matrix algebra that

-

a(a'P)

ap

=a and

for A symmetric,

6If h: IRK -4 OC is a scalar-valued function of a K -dimensional vector x, the derivative of h with respect to xis a K-dimensional vector whose k-th element is Oh(x)/Oxk where Xk is the k-th element ofx. (This K-dimensional vector is called the gradient.) Here, the xis and the function h(x) is SSR(/i).

/i

17

Finite-Sample Properties of OLS

the K -dimensional vector of partial derivatives is

aSS/~({3) ap

-2a + 2Ap.

=

The first-order conditions are obtained hy setting this equal to zero. Recalling from (1.2.2) that a here is X'y and A is X'X and rearranging, we can write the first-order conditions as

X'X

b

=

X'y.

(1.2.3)

(KxK)(Kxl)

Here, we have replaced {3 by b because the OLS estimate b is the {3 that satisfies the first-order conditions. These K equations are called the normal equations. The vector of residuals evaluated at {3 = b,

c = y- Xb,

(1.2.4)

(nx I)

is called the vector of OLS residuals. Rearranging (1.2.3) gives X'(y- Xb)

=

Its i-th element

0 or X'c

=

IS

e,

0 or (1.2.3')

which shows that the normal equations can he interpreted as the sample analogue of the orthogonality conditions E(x; · £;) = 0. This point will be pursued more fully in subsequent chapters. To be sure, the first-order conditions are just a necessary condition for minimization, and we have to check the second-order condition to make sure that b achieves the minimum, not the maximum. Those who arc familiar with the Hessian of a function of several variables7 can immediately recognize that the second-order condition is satisfied because (as noted below) X'X is positive definite. There is, however, a more direct way to show that b indeed achieves the minimum. It utilizes the "add-and-subtract" strategy, which is effective when the objective function is quadratic, as here. Application of the strategy to the algebra of least squares is left to you as an analytical exercise.

7The Hessian of h(x) is a square manix whose (k. £)element is J 2h(x)jJxk Jxe.

18

Chapter 1

Two Expressions for the OLS Estimator Thus, we have obtained a system of K linear simultaneous equations in K unknowns in b. By Assumption 1.3 (no multicollinearity). the coefficient matrix X'X

is positive definite (sec review question l below for a proof) and hence nonsingular. So the normal equations can he solved uniquely forb hy premultiplying hoth sides of (1.2.3) by (X'X)- 1 : (1.2.5)

Viewed as a function of the sample (y, X), (1.2.5) is sometimes called the OLS estimator. For any given sample (y, X), the value of this function is the OLS estimate. In this book, as in most other textbooks, the two terms will be used almost interchangeably. Since (X'X)- 1X'y = (X'Xjn)- 1X'yjn, the OLS estimator can also be rewritten as b=

where

1

1

s;.}

s:\.)'•

(1.2.5')

II

Sxx = -X'X = - Lxix~ n n

(sample average ofxix;),

(1.2.6a)

i=l

I I s,.-y = -X'y = -

n

" L xi · Yi n

(sample average of xi · Yi)·

(1.2.6h)

i=l

The data matrix fonn (1.2.5) is more convenient for developing the finite-sample results, while the sample average fonn (1.2.5') is the form to be utilized for largesample theory. More Concepts and Algebra

Having derived the OLS estimator of the coefficient vector, we can define a few related concepts. • The fitted value for observation i is defined as y, = x;b. The vector of fitted value, y, equals Xb. Thus, the vector of OLS residuals can be written as e = y-

y.

• The projection matrix I' and the annihilator M are defined as I' - X(X'X)- 1X',

(1.2.7)

(nxn)

M =In- P.

(nxn)

(1.2.8)

19

Finite-Sample Properties of OLS

They have the following nifty properties (proving them is a review question): Both P and Mare symmetric and idempotent, 8 PX =X

(hence the term projection matrix),

MX = 0

(hence the term annihilator).

(1.2.9) (1.2.10) (1.2.11)

Since e is the residual vector at fJ = b, the sum of squared OLS residuals, SSR, equals e'e. It can further be written as SSR = e'e = e'Me.

(1.2.12)

(Proving this is a review question.) This expression, relating SSR to the true error term e, will be useful later on. • The OLS estimate of u 2 (the variance of the error term), denoted s 2 , is the sum of squared residuals divided by 11 - K: s2

_

SSR 11- K

e'e 11- K

(1.2.13)

(The definition presumes thatn > K; otherwise s 2 is not well-defined.) As will be shown in Proposition 1.2 below, dividing the sum of squared residuals by n- K (called the degrees of freedom) rather than by 11 (the sample size) makes this estimate unbiased for u 2 . The intuitive reason is that K parameters (fJ) have to be estimated before obtaining the residual vector c used to calculate s2 . More specifically, e has to satisfy the K normal equations (1.2.3'), which limits the variability of the residual. • The square root of s 2 , s, is called the standard error of the regression (SER) or standard error of the equation (SEE). It is an estimate of the standard deviation of the error term. • The sampling error is defined as b- fJ. It too can be related toe as follows. b- fJ = (X'X)- 1X'y- fJ

(by (1.2.5))

= (X'X)- 1X'(XfJ +e)- fJ = (X'X)- 1 (X'X)fJ

(since y = XfJ

+ e by Assumption 1.1)

+ (X'X)- 1X'e- fJ (1.2.14)

8A square matrix A is said to be idempotent if A = A 2 .

20

Chapter 1

• Uncentered R 2 One measure of the variability of the dependent variable is the sum of squares, I: yp = y'y. Because the OLS residual is chosen to satisfy the normal equations, we have the following decomposition of y'y:

y'y = (y + e)'(y +e)

(since e = y- y)

=yy+2y'e+e'e = y'y + 2b'X'e + e'e = )"y + e'e

(since y- Xh)

(since X'e = 0 by the normal equations; see (1.2.3')). (1.2.15)

The uncentered R 2 is defined as e'e = 1-uc y'y

R2

(1.2.16)

Because of the decomposition (1.2.15), this equals

.,.

yy y'y Since both

Y'Y and c'c are nonnegative, 0

<

R;c <

1. Tht1s, the uncentered R 2

has the interpretation of the fraction of the variation of the dependent variable

that is attributable to the variation in the explanatory variables. The closer the fitted value tracks the dependent variable, the closer is the unccntcrcd R 2 to one. • (Centered) R 2 , the coefficient of determination. If the only regressor is a constant (so that K = I and Xi! = I), then it is easy to sec from (1.2.5) that b equals ji, the sample mean of the dependent variable, which means that .Yi = y for all i, y'yin (1.2.15) equals ny 2 , and e'e equals I:,(y,- ji) 2 If the regressors also include nonconstant variables, then it can he shown (the proof is left as an analytical exercise) that I:,(y,- ji) 2 is decomposed as

i=l

i=l

I ji-n

i=l

n

LYi·

(1.2.17)

i=l

The coefficient of determination, R 2 , is defined as }12

= -

1_

" Li-t " ei2 ~~~

Li~l (Yi- y)

2.

(1.2.18)

21

Finite-Sample Properties of OLS

Because of the decomposition (1.2.17), this R 2 equals "" L....i=I (' Yi " L.,;~I "

-

y-)2

(y; - y) 2"

Therefore, provided that the regressors include a constant so that the decompos-

ition (1.2.17) is valid, 0 < R2 < I. Thus, this R2 as defined in (1.2.18) is a measure of the explanatory power of the nonconstant regressors. If the regressors do not include a constant but (as some regression software

packages do) youncvcrUIClcss calculate R 2 by UIC formula (1.2.18), then the R 2 can he negative. This is because, without the henefit of an intercept, the regres-

sion could do worse than the sample mean in terms of tracking the dependent variable. On the other hand, some other regression packages (notably STATA) switch to the formula (1.2.16) for the R 2 when a constant is not included, in order to avoid negative values for the R 2 This is a mixed blessing. Suppose that the regressors do not include a constant but that a linear combination of the regressors equals a constant. This occurs if, for example, the intercept is

replaced by seasonal dummics 9 The regression is essentially the same when one of the regressors in the linear combination is replaced by a constant. Indeed, one should obtain the same vector of fitted values. But if the formula for the R 2 is (1.2.16) for regressions without a constant and (1.2.18) for those with a constant, the calculated R 2 declines (sec Review Question 7 below) after the replacement by a constant. Influential Analysis (optional}

Since the method of least squares seeks to prevent a few large residuals at the expense of incurring many relatively small residuals, only a few observations can he extremely influential in the sense that dropping them from the sample changes

some elements of b substantially. There is a systematic way to find those influential observations. 10 Let IJC;J be the OLS estimate of {3 that would be obtained if OLS were used on a sample from which the i-th observation was omitted. The key equation is

bco- b = - (

1 I - p;

)cx'x)- 1x;·e;,

9oummy variables will be introduced in the empirical exercise for lhis chapter. 10 See Krasker, Kuh, and Welsch (1983) for more details.

(1.2.19)

22

Chapter 1

where X; as before is the i -th row of X, e; is the OLS residual for observation i, and p; is defined as

- '(X'X)- 1Xj, Pi =Xi

(1.2.20)

which is the i-th diagonal element of the projection matrix P. (Proving (1.2.19) would be a good exercise in matrix algebra, but we will not do it here.) It is easy to show (sec Review Question 7 of Section 1.3) that n

0 < p; < I and

L

p; = K.

(1.2.21)

i=l

Sop; equals K jn on average. To illustrate the use of (1.2.19) in a specific example, consider the relationship between equipment investment and economic growth for the world's poorest countries between 1960 and 1985. Figure 1.2 plots the average annual GDP-per-worker growth between 1960 and 1985 against the ratio of equipment investment to GDP over the same period for thirteen countries whose GDP per worker in 1965 was less than 10 percent of that of the United States. 11 It is clear visually from the plot that the position of the estimated regression line would depend very much on the single outlier (Botswana). Indeed, if Botswana is dropped from the sample, the estimated slope coefficient drops from 0.37 to 0.058. In the present case of simple regression, it is easy to spot outliers by visually inspecting the plot such as Figure 1.2. This strategy would not work if there were more than one nonconstant regressor. Analysis based on formula (1.2.19) is not restricted to simple regressions. Table 1.1 displays the data along with the OLS residuals, the values of p;, and (1.2.19) for each observation. Botswana's p; of 0.7196 is well above the average of 0.154 (= K/n = 2/13) and is highly influential, as the last two columns of the table indicate. Note that we could not have detected the influential observation by looking at the residuals, which is not smprising because the algebra of least squares is designed to avoid large residuals at the expense of many small residuals for other

observations. What should be done with influential observations? It depends. If the influential observations satisfy the regression model, they provide valuable information about the regression function unavailable from the rest of the sample and should definitely be kept in the sample. But more probable is that the influential observations are atypical of the rest of the sample because they do not satisfy the model. 11 The data are from the Penn World Table, reprinted in DeLong and Summers (1991). To their credit, their analysis is based on the whole sample of sixty-one countries.

23

Finite-Sample Properties of OLS

-·... .

0.08,..---------------------

Botswana .<:: ~

0.06

!"

"'

(i; 0.04

-

""'0

;:: 0> 0 .02

c. "0

(!)

0

.

~

• •

•

• •

•

•

•

-0.02 +-___,--+-+---t--+--+--+--+--,_-+--+---1 0.02 0.04 0.06 0.08 0.1 0.12 0.14

equipment/GOP Figure 1.2: Equipment Investment and Growth

In this case they should definitely be dropped from the sample. For the example just examined. there was a worldwide growth in the demand for diamonds. Botswana's main export, and production of diamonds requires heavy investment in drilling equipment. Tf the reason to expect an association between growth and equipment investment is the beneficial effect on productivity of the introduction of new technologies through equipment, then Botswana, whose high GOP growth is demand-driven, should he dropped from the sample. A Note on the Computation ol OLS Estimates 12

So far, we have focused on the conceptual aspects of the algebra of least squares. But for applied researchers who actually calculate OLS estimates using digital computers, it is important to be aware of a certain aspect of digital computing in order to avoid the risk of obtaining unreliable estimates without knowing it. The source of a potential problem is that the computer approximates real numbers by so-called floating-point nnmbcrs. When an arithmetic operation involves both very large numbers and very small numbers, floating-point calculation can produce inaccurate results. This is relevant in the computation of OLS estimates when the regressors greatly differ in magnitude. For example, one of the regressors may be the interest rate stated as a fraction, and another may be U.S. GDP in dollars. The matrix X'X will then contain both very small and very large numbers, and the arithmetic operation of inverting this matrix by the digital computer will produce unreliable results. 12 A fuller u·eatment of this topic can be found in Section 1.5 of Davidson and MacKinnon (1993).

Table l.I: Influential Analysis Country

Botswana Cameroon Ethiopia India Indonesia Ivory Coast Kenya Madagascar Malawi Mali Pakistan Tanzania Thailand

GDP/workcr growth

0.0676 0.0458 0.0094 0.0115 0.0345 0.0278 0.0146 -0.0102 0.0153 0.0044 0.0295 0.0184 0.0341

Equipment! GDP

Residual

Pi

(1.2.19) for {3 1

0.1310 0.0415 0.0212 0.0278 0.0221 0.0243 0.0462 0.0219 0.0361 0.0433 0.0263 0.0860 0.0395

0.0119 0.0233 -0.0056 -0.0059 0.0192 0.0117 -0.0096 -0.0254 -0.0052 -0.0188 0.0126 -0.0206 0.0123

0.7196 0.0773 0.1193 0.0980 0.1160 0.1084 0.0775 0.1167 0.0817 0.0769 0.1022 0.2281 0.0784

0.0104 -0.0021 0.0010 0.0009 -0.0034 -0.0019 0.0007 0.0045 0.0006 0.0016 -0.0020 -0.0021 -0.0012

(1.2.19) for /32 -0.3124 0.0045 -0.0119 -0.0087 0.0394 0.0213 0.0023 -0.0527 -0.0036 -0.0006 0.0205 0.0952 0.0047

25

Finite-Sample Properties of OLS

A simple solution to this problem is to choose the units of measurement so that the regressors are similar in magnitude. For example, state the interest rate in per-

cents and U.S. GDP in trillion dollars. This sort of care would prevent the problem most of the time. A more systematic transfonnation of the X matrix is to subtract the sample means of all regressors and divide by the sample standard deviations before forming X'X (and adjust the OLS estimates to undo the transformation). Most OLS programs (such as TSP) take a more sophisticated transformation of the X matrix (called the QR decomposition) to produce accurate results. QUESTIONS FOR REVIEW

1. Prove that X'X is positive definite if X is of full column rank. Hint: What needs to be shown is that c'X'Xc > 0 for c c1 0. Define z - Xc. Then c'X'Xc = z'z = L-[~ 1 zf. If X is of full column rank, then z cl 0 for any c cl 0.

f.

2. Verify that X'X/n = L_, x;X, and X'y/n = The (k, €) element of X'X is L_, x"x".

f. L_, X; · y, as in (1.2.6).

Hint:

3. (OLS estimator for the simple regression model) In the simple regression model, K = 2 and x; 1 = I. Show that

where and Show that

t

L-7~ 1 (X;z - iz)(y; - ji) b 2 = "-'"'71 -::"'="-(-.--.----:)2,---- and b 1 = ;; Li=I Xi2 -

y-

i 2b 2 •

Xz

(You may recognize the denominator of the expression for b2 as the sample variance of the nonconstant regressor and the numerator as the sample covariance between the nonconstant regressor and the dependent variable.) Hint: 1

n

1

n

- 2::x;'2 - (i,) 2 = - L(x;2- i2) 2 n n i=l

i=l

26

Chapter 1

and

You can take (1.2.5') and use the brute force of matrix inversion. Alternatively, write down the two normal equations. The first normal equation is b 1 = y-i 2 b 2 . Substitute this into the second normal equation to eliminate b 1 and then solve for b2 . 4. Prove (1.2.9)-{1.2.11). Hint: They should easily follow from the definition of P andM. 5. (Matrix algebra of fitted values and residuals) Show the following: (a)

y=

l'y, e =My= Me. Hint: Use (1.2.5).

(b) (1.2.12), namely, SS11 = e'Me. 6. (Change in units and 11 2 ) Does a change in the unit of measurement for the dependent variable change 11 2 ? A change in the unit of measurement for the regressors? Hint: Check whether the change affects the denominator and the numerator in the definition for 11 2 7. (Relation between 11;, and 11 2 ) Show that

Hint: Use (1.2.16), (1.2.18), and the identity I:,(y;-

y)'

=I:,

yf- n. y2

8. Show that

112 !IC

=

y'Py y'y

9. (Computation of the statistics) Verify that b, SS11, s 2 , and 11 2 can be calculated from the following sample averages: Sxx, Sxy, y'y/n, andy. (If the regressors include a constant, then y is the element of Sxy corresponding to the constant.) Therefore, those sample averages need to be computed just once in order to obtain the regression coefficients and related statistics.

27

Finite-Sample Properties of OLS

1.3

Finite-Sample Properties of OLS

Having derived the OLS estimator, we now examine its finite-sample properties, namely. the characteristics of the distribution of the estimator that are valid for any

given sample size n. Finite-Sample Distribution of b

Proposition 1.1 (finite-sample properties or the OLS estimator or {3): (a) (unbiasedness)

Under Assumptions 1.1-1.3, E(b I X)= {3.

(b) (expression for the va6ance) Under Assumptiom 1.1-1.4, Var(b (X'X)-

X) = rr 2

1

.

1

(c) (Gauss-Mm*ov Theorem) Under Assumptions 1.1-1.4, the OLS estimator is efficient in the class of linear unbiased estimators. That is, for any unbiased estimator that is linear· in y, Var(/i I X) > Var(b I X) in the mallix sense. 13

/i

(d) Under Assumptions 1.1-1.4, Cov(b, e I X) = 0, where e

= y- Xb.

Before plunging into the proof, let us be clear about what this proposition means. ~

• The matrix inequality in part (c) says that the K x K matrix Var(/3 Var(b

I X)

I

X) -

is positive semidefinite, so

a'[Var(/i I X) - Var(b I X) ]a > 0 or a' Var(/i I X) a > a' Var(b

I X)a

for any K -dimensional vector a. In particular, consider a special vector whose elements are all 0 except for the k-th element, which is 1. For this particular a, the quadratic fonn a' Aa picks up the (k, k) element of A. Butthe (k, k) element of Var(/i I X), for example, is Var(fjk I X) where fjk is the k-th element of

/i.

Thus the matrix inequality in (c) implies Var(fjk

I

X) > Var(bk

I

X)

(k = I, 2, ... , K).

(1.3.1)

That is~ for any regression coefficient~ the variance of the OLS estimator is no larger than that of any other linear unbiased estimator.

13 Let A and B be two square matrices of the same size. We say that A 2: B if A - B is positive semidefinite. A K x K matrix Cis said to be positive semidefinite (or nonnegative definite) ifx'Cx 2: 0 for all K-dimensional vectors x.

28

Chapter 1

• As clear from (1.2.5). the OLS estimator is linear in y. There arc many other estimators of {3 that are linear and unbiased (you will he asked to provide one in a review question below). The Gauss-Markov Theorem says that the OLS estimator is efficient in the sense that its conditional variance matrix Var(b I X) is smallest among linear unbiased estimators. For this reason the OLS estimator

is called the Best Linear Unbiased Estimator (BLUE). • The OLS estimator b is a function of the sample (y. X). Since (y. X) are random. so is b. Now imagine that we fix X at some given value, calculate b for all samples corresponding to all possible realizations of y, and take the average of

b (the Monte Carlo exercise to this chapter will ask you to do this). This average is the (population) conditional mean E(b I X). Part (a) (unbiasedncss) says that this average equals the true value {3. • There is another notion of unbiascdncss that is weaker than the unbiasedncss of part (a). By the Law of Total Expectations, E[E(b I X)] = E(b). So (a) implies E(b) = {3.

(1.3.2)

This says: if we calculated b for all possible different samples, differing not only in y but also in X, the average would be the true value. This unconditional statement is probably more relevant in economics because samples do differ in both y and X. The import of the conditional statement (a) is that it implies the unconditional statement (1.3.2), which is more relevant. • The same holds for the conditional statement (c) about the variance. A review question below asks you to show that statements (a) and (b) imply

-

Var({3) > Var(b) where

/i is any linear unbiased estimator (so that E(/i I X) =

(1.3.3)

{3).

We will now go through the proof of this important result. The proof may look lengthy; if so, it is only because it records every step, however easy. In the first reading, you can skip the proof of part (c). Proof of (d) is a review question. PROO~:

(a) (Proof that E(b I X) = {3) E(b - {3 I X) = 0 whenever E(b I X) = {3. So we prove the former. By the expression for the sampling error (1.2.14),

29

Finite-Sample Properties of OLS

b- {3 = Ae where A here is (X'X)- 1X1. So E(b- {3 I X) = E(Ae I X) =A E(e I X). Here, the second equality holds by the linearity of conditional expectations; A is a function of X and so can he treated as if nonrandom. Since E(e I X) = 0. the last expression is zero.

(b) (ProofthatVar(b X) =u 2 -(X'X)- 1) 1

Var(b 1 X) = Var(b- {3 1 X)

(since {3 is not random)

(by (1.2.14) and A_ (X'X)- 1X')

= Var(Ae I X)

= A Var(e X) A'

(since A is a function of X)

= A E(ee' I X) A'

(by Assumption 1.2)

1

= A(u 2 I,)A 1

(hy Assumption 1.4. see (1.1.14))

= u 2AA'

= u 2 · (X'X)- 1

(since AA' = (X'X)- 1X'X(X'X)- 1 = (X'X)- 1).

(c) (Gauss-Markov) Since /i is linear in y. it can be written as /i = Cy for some matrix C. which possibly is a function of X. Let D _ C - A or C = D +A where A= (X'X)- 1X'. Then

-

{3 = (D + A)y

= Dy+Ay = D(X/3 +e)+ b

(since y = X/3 + e and Ay = (X'X)- 1X'y =b)

= DX/3 +De +b. Taking the conditional expectation of both sides, we obtain

E(/i I X) = DX/3 + E(De I X)+ E(b I X). Since both b and /i arc unbiased and since E(De I X) = D E(e I X) = 0. it follows that DX/3 = 0. For this to be true for any given {3. it is necessary that DX = 0. So /i = De + b and

-

{3 - {3 = De

+ (b -

= (D + A)e

{3)

(by (1.2.14)).

30

Chapter 1

So Var(ji I X) = Var(ji - {3 I X) = Var[(D + A)e I X] = (D +A) Var(e

I X)(D' +A')

(since both D and A arc functions of X) = a 2 · (D + A)(D' +A')

(since Var(e I X) = a 2 In)

= a 2 . (DD' +AD'+ DA' + AA'). But DA' = DX(X'X)- 1 = 0 since DX = 0. Also, AA' = (X'X)- 1 as shown in (b). So

Var(ji I X)= a 2 · [DD' + (X'X)- 1]

> a 2 . (X'X)- 1 = Var(b I X)

(since DD' is positive semidefinite)

•

(by (b)).

It should be emphasized that the strict exogeneity assumption (Assumption 1.2) is critical for proving unbiasedness. Anything short of strict exogeneity will

not do. For example, it is not enough to assume that E(e; I x;) = 0 for all i or that E(x;·e;) = 0 for all i. We noted in Section 1.1 that most time-series models do not satisfy strict exogeneity even if they satisfy weaker conditions such as the orthogonality condition E(x; ·£;) = 0. It follows that for those models the OLS estimator is not unhiased. Finite-Sample Properties ol s2 We defined the OLS estimator of a 2 in (1.2.13). It, too, is unbiased.

Proposition 1.2 (Unbiascdncss of .1· 2 ): Under Assumptions 1.1-1.4, E(s 2 a 2 (and hence E(s 2 ) = a 2 ), provided n > K (so that s2 is well-defined).

1

X) =

We can prove this proposition easily by the use of the trace operator. 14 PROOF. Since s 2 = c'c/(n- K), the proof amounts to showing that E(c'c X) = (n - K)a 2 As shown in (1.2.12), c'e = e'Me where M is the annihilator. The proof consists of proving two properties: (I) E(e'Me 1 X) = a 2 · trace(M), and (2) tracc(M) = n- K. 1

14 The trace of a square matrix A is the sum of the diagonal elements of A: trace(A) =

Li aii.

31

Finite-Sample Properties of OLS

(I) (ProofthatE(e'Me I X)= u 2 ·tracc(M)) Since e'Me = 2..::'~ 1 2..:)~ 1 (this is just writing out the quadratic form e'Me ), we have II

E(e'Me I X) =

m;j

e;

Ej

II

L Lmu E(e;

Ej

i~I j~I

i=l

(since E(e;

I X) (because m;j's arc functions of X, E(m;j s; Ej I X)= m;j E(e; Ej I X))

Ej I

X) = 0 fori

op j by Assumption 1.4)

n

=a2Lmii i=l

= o- 2 · trace(M). (2) (Proof that trace(M) = n - K) trace(M) =trace(!, - P)

(since M _ 1,- P; see (1.2.8))

= trace(!,) - trace(P)

(fact: the trace operator is linear)

= n - trace(P), and tracc(l') = trace[X(X'X)- 1X'] = trace[(X'X)- 1X'X]

(since I'= X(X'X)- 1X'; sec (1.2.7)) (fact: trace(AB) = trace(BA))

= trace(h) = K. So tracc(M) = n - K. Estimate of Var(b

If

s2

•

I X)

is the estimate of o- 2 , a natural estimate ofVar(b I X)= o- 2 ·(X'X)- 1 is (1.3.4)

This is one of the statistics included in the computer printout of any OLS software package. QUESTIONS FOR REVIEW

1. (Role of the no-multicollinearity assumption) In Propositions I. I and 1.2, where did we use Assumption 1.3 that rank(X) = K? Hint: We need the no-multicollinearity condition to make sure X'X is invertible.

32

Chapter 1

2. (Example of a linear estimator) For the consumption function example in Example 1.1, propose a linear and unbiased estimator of fh that is different from the OLS estimator. Hint: How about 1Iz = (CON2 -CON1)f(YD 2 - YD 1)? Is it linear in (CON1 , •.. , CON,)? Is it unbiased in the sense that E(i52 I YD1, ... , YD,) = {J,? 3. (What Gauss-Markov does not mean) Under Assumptions 1.1-1.4, does there exist a linear, but not necessarily unbiased, estimator of {3 that has a variance smaller than that of the OLS estimator? If so, how small can the variance be? Hint: If an estimator of {3 is a constant, then the estimator is trivially linear in y. 4. (Gauss-Markov for Unconditional Variance) (a) Prove: Var(Pl = E[Var(,B I X)]+ Var[E(,B I X)]. Hint: By definition, Var(p I X)- E[@-

E(p I Xl)@- E(p I X))' I X]

and Var[E(B I X)J- E{rE(B I X)- E(B)J[E(B I X)- E(Pll'). Use the add-and-subtract strategy: take E(/f).

7f- E(/f

1

X) and add and subtract

(b) Prove (1.3.3). Hint: If Var(/f I X) > Var(b I X), then E[Var(/f I X)] > E[Var(b I X) l 5. Propose an unbiased estimator of a 2 if you had data on e. Hint: How about e'efn? Is it unbiased? 6. Prove part (d) of Proposition 1.1. Hint: By definition, Cov(b, c I X) - E( [b- E(b I X)][c- E(c I X)]'

I X J.

Since E(b X)= {3, we have b- E(b X)= Ae where A here is (X'X)- 1X'. Use Me = c (see Review Question 5 to Section 1.2) to show that c- E(c 1 X)= Me. E(Aee'M X)= A E(ee' X)M since both A and Mare functions of X. Finally, use MX = 0 (see (1.2.11)). 1

1

1

1

7. Prove (1.2.21). Hint: Since Pis positive semidefinite, its diagonal elements are nonnegative. Note that 2..::'~ 1 p; = tracc(l').

Finite-Sample Properties of OLS

1.4

33

Hypolhesis Tesling under Norma lily

Very often, the economic theory that motivated the regression equation also specifies the values that the regression coefficients should take. Suppose that the underlying theory implies the restriction that {J 2 equals 1. Although Proposition 1.1 guarantees that, on average, b 2 (the OLS estimate of {3 2 ) equals I if the restriction is true, b 2 may not be exactly equal to I for a particular sample at hand. Obviously, we cannot conclude that the restriction is false just because the estimate b 2 differs from I. In order for us to decide whether the sampling error b2 - I is "too large" for the restriction to be tn1c, we need to construct from the sampling error some test statistic whose probability distribution is known given the tmth of the hypothesis. It might appear that doing so requires one to specify the joint distribution of (X, e) because, as is clear from (1.2.14), the sampling error is a function of (X, e). A surprising fact about the theory of hypothesis testing to be presented in this section is that the distribution can be derived without specifying the joint distribution when the conditional distribution of e conditional on X is normal; there is no need

to specify the distribution of X. In the language of hypothesis testing, the restriction to be tested (such as "{3 2 = I") is called the null hypothesis (or simply the null). It is a restriction on the maintained hypothesis, a set of assumptions which, combined with the null, produces some test statistic with a known distribution. For the present case of testing hypothesis about regression coefficients, only the nonnality assumption about the conditional distribution of e needs to be added to the classical regression model (Assumptions 1.1-1.4) to form the maintained hypothesis (as just noted, there is no need to specify the joint distribution of (X, e)). Sometimes the maintained hypothesis is somewhat loosely referred to as "the model." We say that the model is correctly specified if the maintained hypothesis is tn1e. Although too large a value of the test statistic is interpreted as a failure of the nulL the interpreta-

tion is valid only as long as the model is correctly specified. It is possible that the test statistic docs not have the supposed distribution when the null is tme but the model is false.

Normally Distributed Error Terms In many applications, the error term consists of many miscellaneous factors not

captured by the regressors. The Central Limit Theorem suggests that the error tenn has a normal distribution.

In other applications, the error term is due to errors in measuring the dependent variable. It is known that very often measure-

34

Chapter 1

men! errors arc normally distributed (in fact, the normal distribution was originally developed for measurement errors). It is therefore worth entertaining the nonnality

assumption:

Assumption 1.5 (normality of the error term): The distribution of e conditional on X is jointly nonnal. Recall from probability theory that the normal distribution has several convenient features: • The distribution depends only on the mean and the variance. Thus, once the mean and the variance arc known, you can write down the density function. If the distribution conditional on X is nonnal, the mean and the variance can

depend on X. It follows that, if the distribution conditional on X is nom1al and

if neither the conditional mean nor the conditional variance depends on X, then the marginal (i.e, unconditional) distribution is the same normal distribution. • In general, if two random variables are independent, then they are uncorrelated,

but the converse is not true. However, if two random variables are joint nor-

mal, the converse is also true, so that independence and a lack of correlation are equivalent. This carries over to conditional distributions: if two random variables are joint normal and uncorrelated conditional on X, then they are independent conditional on X. • A linear function of random variables that arc jointly normally distributed is itself normally distributed. This also carries over to conditional distributions. If the distribution of e conditional on X is normal, then Ae, where the elements of matrix A arc functions of X, is normal conditional on X. It is thanks to these features of normality that Assumption 1.5 delivers the following

properties to be exploited in the derivation of test statistics: • The mean and the variance of the distribution of e conditional on X are already specified in Assumptions 1.2 and 1.4. Therefore, Assumption 1.5 together with Assumptions 1.2 and 1.4 implies that the distribution of e conditional on X is N(O, a 2 1,): e

I X~ N(O. a 2 1,).

(1.4.1)

Thus, the distribution of e conditional on X does not depend on X. It then follows that e and X arc independent. Therefore, in particular, the marginal or unconditional distribution of e is N(O, a 2 I,).

35

Finite-Sample Properties of OLS

• We know from (1.2.14) that the sampling error b- {3 is linear in e given X. Since e is nom1al given X, so is the sampling error. Its mean and variance are given by parts (a) and (b) of Proposition 1.1. Thus, under Assumptions 1.1-1.5, (1.4.2) Testing Hypotheses about Individual Regression Coefficients The type of hypothesis we first consider is ahout the k-th coefficient

Here, fh is some known value specified by the null hypothesis. We wish to test this null against the alternative hypothesis H 1 : f3k oft {3" at a significance level of a. Looking at the k-th component of (1.4.2) and imposing the restriction of the null, we obtain

where ( (X'X) -I )kk is the (k, k) element of (X'X) - l So if we define the ratio dividing bk - f3 k by its standard deviation

Zk

by

(1.4.3)

then the distribution of Zk is N(O, I) (the standard nonnal distribution). Suppose for a second that o- 2 is known. Then the statistic Zk has some desirable properties as a test statistic. First, its value can be calculated from the sample. Second, its distribution conditional on X docs not depend on X (which should not he confused with the fact that the value of Zk depends on X). So Zk and X are independently distributed, and, regardless of the value of X, the distribution of Zk is the same as its unconditional distribution. This is convenient because different samples differ not only in y but also in X. Third, the distribution is known. In particular, it docs not depend on unknown parameters (such as {3). (If the distribution of a statistic depends on unknown parameters, those parameters are cal1ed nuisance parameters.) Using this statistic, we can determine whether or not the sampling error bk - f3 k is too large: it is too large if the test statistic takes on a value that is surprising for a realization from the distribution. If we do not know the true value of o- 2 , a natural idea is to replace the nuisance parameter rr 2 by its OLS estimate s 2 The statistic after the substitution of s 2 for

36

Chapter 1

a 2 is called the !-ratio or the !-value. The denominator of this statistic is called the standard error of the OLS estimate of fh and is sometimes written as SE(b,): SE(b,)-

)s

2 .

(CX'XJ- 1)kk =

)Ck, k)

element ofVar(biX) in (1.3.4). (1.4.4)

Since s 2 , being a function of the sample, is a random variable, this substitution changes the distribution of the statistic, but fortunately the changed distribution, too, is known and depends on neither nuisance parameters nor X. Proposition 1.3 (distribution of the t-ratio): Suppose Assumptions 1.1-1.5 bold. Under the null hypothesis H 0 : f3k = {3,, the 1-ratio defined as (1.4.5)

is

dis~'ibuted !iS

l(n- K) (the I

dis~'ibution

with n- K degrees of freedom).

PROOF. We can write

Zk

e'e/(n-K)

a'

where q = e'eja 2 to reflect the substitution of s 2 for a 2 • We have already shown that Zk is N(O, 1). We will show: (1) q

I X~

x2 (n- K),

(2) two random variables

Zk

and q are independent conditional on X.

Then. by the definition of the I distribution. the ratio of Zk to ../q/(n- K) is distributed as t with n - K degrees of freedom, 15 and we are done. (1) Since e'e = e'Me from (1.2.12). we have

c'c e' e q = -2 = -M-. a a a 15 Fact: If x "' N(O, 1), y ,...., with m degrees of freedom.

x2 (m) and if x andy are independent, then the ratio xj .fYTiii has the t distribution

37

Finite-Sample Properties of OLS

The middle matrix M, being the annihilator, is idempotent Also, e(a I X~ N(O, In) hy (1.4.1). Therefore, this quadratic form is distributed as x2 with degrees of freedom equal to rank(M). 16 But rank(M) = trace(M), because M is idcmpotcntP We have already shown in the proof of Proposition 1.2 that trace(M) = 11- K. So q I X~ x2 (11- K). (2) Both band care linear functions of e (by (1.2.14) and the fact that c =Me), so they are jointly nonnal conditional on X. Also, they are uncorrelated conditional on X (sec part (d) of Proposition 1.1). Sob and e arc independently distributed conditional on X. But Zk is a function of b and q is a function of c. So Zk and q are independently distributed conditional on X. 18 • Decision Rule for the t-Tesl

The test of the null hypothesis based on the t-ratio is called the t-test and proceeds as follows:

Step 1: Given the hypothesized value, {3., of f3<. form the t-ratio as in (1.4.5). Too large a deviation oft, from 0 is a sign of the failure of the null hypothesis. The next step specifies how large is too large. Step 2: Go to the t-tablc (most statistics and econometrics textbooks include the!table) and look up the entry for 11- K degrees of freedom. Find the critical value, 10 ; 2 (11 - K), such that the area in the t distribution to the right of la; 2 (11 - K) is a/2, as illustrated in Figure 1.3. (lf 11 - K = 30 and a = 5%, for example, la;2(11- K) = 2.042.) Then, since the t distribution is symmetric around 0, Prob( -la;2(11- K) < I < la;2(11- K)) = I -

C<.

Step 3: Accept Ho if -ta;2(11- K) < t, < la;2(11- K) (that is, if It, I < la;2(11K)), where t, is the !-ratio from Step 1. Reject H 0 otherwise. Since t, ~ t (n - K) under H 0 , the probability of rejecting H 0 when H 0 is true is a. So the size (significance level) of the test is indeed a. A convenient feature of the t-test is that tbe critical value does not depend on X; there is no need to calculate critical values for each sample.

16Faet: lfx......., N(O, J 11 ) and A is idempotent, then x 1Ax has a chi-squared distribution with degrees of freedom equal to the rank of A. 17Faet: If A is idempotent, then rank( A)= trace(A). 18 Fact: Ifx andy are independently distributed, then so are f(x) and g(y).

38

Chapter 1

density

density of /

t distribution

a/2

a/2

t

-t.,;2 (n- K)

0

t.,;2 (n- K)

Figure 1.3: t Distribution

Confidence Interval Step 3 can also be stated in tcnns of bk and SE(bk). Since tk is as in (1.4.5), you accept Ho whenever

or

Therefore, we accept if and only if the hypothesized value f!k falls in the interval: (1.4.6) This interval is called the level 1 - ct confidence interval. It is narrower the smaller the standard error. Thus, the smallness of the standard error is a measure of the estimator's precision. p-Value The decision mle of the t-test can also be stated using the p-value.

Step 1: Same as above. Step 2: Rather than finding the critical value la;z(n - K), calculate p = Prob(t > ltkll x 2.

Finite-Sample Properties of OLS

39

Since the t distribution is symmetric around 0, Prob(t > ltkl) = Prob(t < -ltkl), so Prob( -ltkl <

t <

ltkl) = I - p.

(1.4.7)

Step 3: Accept H0 if p > a. Reject otherwise.

To sec the equivalence of the two decision mlcs, one based on the critical values such as 1a;2(n - K) and the other based on the p-value, refer to Figure 1.3. If Prob(t > It> I) is greater than aj2 (as in the figure), that is, if the p-value is more than a, then ltkl must be to the left of ta;2(n- K). This means from Step 3 that the null hypothesis is not rejected. Thus, when p is small, the !-ratio is surprisingly large for a random variable from the t distribution. The smaller the p, the stronger the rejection.

Examples of the 1-test can be found in Section 1.7. Linear Hypotheses

The null hypothesis we wish to test may not be a restriction about individual regression coefficients of the maintained hypothesis; it is often about linear combinations of them written as a system of linear equations: Ho:

Rf3

= r,

(1.4.8)

where values of Rand rare known and specified by the hypothesis. We denote the number of equations, which is the dimension of r, by #r. So R is #r x K. These #r equations are restrictions on the coefficients in the maintained hypothesis. It is called a linear hypothesis because each equation is linear. To make sure that there are no redundant equations and that the equations are consistent with each other,

we require that rank(R) = #r (i.e., R is of full row rank with its rank equaling the number of rows). But do not he too conscious about the rank condition; in specific

applications, it is very easy to spot a failure of the rank condition if there is one.

Example 1.5 (continuation of Example 1.2): Consider the wage equation of Example 1.2 where K = 4. We might wish to test the hypothesis that education and tenure have equal impact on the wage rate and that there is no experience effect. The hypothesis is two equations (so #r = 2):

!h

= {33 and {J4 = 0.

40

Chapter 1

This can be cast in the format R,B = r if R and r arc defined as

R [00 0I -I0 OJI ' =

[OJ. 0

r =

Because the two rows of this R are linearly independent, the rank condition is satisfied. But suppose we require additionally that

This is redundant because it holds whenever the first two equations do. With these three equations, #r = 3 and

0 I -1 R= 0 0 0 [ 0 I -1 Since the third row of R is the difference between the first two, R is not of full row rank. The consequence of adding redundant equations is that R no longer meets the full row rank condition. As an example of inconsistent equations, consider adding to the first two equations the third equation fh = 0.5. Evidently, {34 cannot be 0 and 0.5 at the same time. The hypothesis is inconsistent because there is no ,8 that satisfies the three equations simultaneously. If we nevertheless included this equation, then Rand r would become

R

=

[~ ~ ~l ~] 0 0

0

I

, r

=

[

~

]·

0.5

Again, the full row rank condition is not satisfied because the rank of R is 2 while #r = 3. The F-Test

To test linear hypotheses, we look for a test statistic that has a known distribution under the null hypothesis. Proposition 1.4 (distribution of the F-ratio): Suppose Assumptions 1.1-1.5 hold. Under the null hypothesis H0 : R,B = r, where R is #r x K with rank(R) = #r, the F -ratio defined as

41

Finite-Sample Properties of OLS

= (Rb- r)'[RVaf(biX)R'r' (Rb- r)/#r

(by (1.3.4))

(1.4.9)

is disLTibuled as F(#r, n - K) (lhe F disllibulion with #rand n - K degrees of freedom). As in Proposition 1.3, it suffices to show that the distribution conditional on X is F (#r, n - K); because the F distribution does not depend on X, it is also the unconditional distribution or the statistic. PROO~~ Since s 2 = e'e/(n - K), we can write

F = ~w_.f_#r_ qf(n- K) where w

= (Rb- r)'[u 2 . R(X'X)- 1R'r' (Rb- r)

c'c

and q=2· ()'

We need to show

(2) q

I X~ x2 (n-

K) (this is part (I) in the proof of Proposition 1.3),

(3) wand q are independently distributed conditional on X.

Then, by the definition of the F distribution, the F -ratio

~

F(#r, n - K).

(I) Let v- Rb- r. Under H 0 , Rb- r = R(b- {3). So by (1.4.2), conditional on X, vis normal with mean 0, and its variance is given by Var(v I X)= Var(R(b- {3)

I X)= RVar(b- {3 I X)R' = u 2 · R(X'X)-'R',

which is none other than the inverse of the middle matrix in the quadratic form for w. Hence, w can be written as v' Var(v I X)- 1v. Since R is of full row rank and X'X is nonsingular, u 2 • R(X'X)- 1R' is nonsingular (why? Showing this is a review question). Therefore, hy the definition of the x 2 distribution, w

I

X~

x2 (#r). 19

19 Fact: Letx be an m dimensional random vector. Ifx "'N(p.,, :E) with :E nonsingular. then (x- p.,)':E- 1 (x!') ~ x 2 (m).

42

Chapter 1

(3) w is a function of b and q is a function of e. But b and e arc independently distributed conditional on X, as shown in part (2) of the proof of Proposition 1.3. So w and q are independently distributed conditional on X.

•

If the null hypothesis R/3 = r is tme, we expect Rb - r to be small, so large values ofF should be taken as evidence for a failure of the null. This means that we look at only the upper tail of the distribution in the F -statistic. The decision mle of the F -test at the significance level of a is as follows.

Step 1: Calculate the F-ratio hy the formula (1.4.9). Step 2: Go to the table ofF distribution and look up the entry for #r (the numerator degrees of freedom) and n - K (the denominator degrees of freedom). Find the critical value F.(#r, n- K) that leaves a for the upper tail of the F distribution, as illustrated in Figure 1.4. For example. when #r = 3, n- K = 30, and a = 5%, the critical value F. 05 (3. 30) is 2.92. Step 3: Accept the null if the F -ratio from Step I is less than Fa (#r, n- K). Reject otherwise.

This decision mle can also be described in terms of the p-value:

Step 1: Same as above. Step 2: Calculate p = area of the upper tail of the F distribution to the right of the F -ratio.

Step 3: Accept the null if p >

a: reject otherwise.

Thus, a small p-value is a signal of the failure of the null. A More Convenient Expression for F

The above derivation of the F -ratio is by the Wald principle, because it is based on the unrestricted estimator, which is not constrained to satisfy the restrictions of the null hypothesis. Calculating the F -ratio hy the formula (1.4.9) requires matrix

inversion and multiplication. Fortunately, there is a convenient alternative formula involving two different sum of squared residuals: one is SSR, the minimized sum of squared residuals obtained from (1.2.1) now denoted as SSRu, and the other is the restricted sum of squared residuals, denoted SSRn, obtained from mjn SSR(/3) s.t. p

R/3

= r.

(1.4.10)

43

Finite-Sample Properties of OLS

density ~ densityof

~

F(#r,n-K)

a

o~------------------------~-------7F

Fa(r, n-K) Figure 1.4: F Distribution

Finding the fJ that achieves this constrained minimization is called the restricted regression or restricted least squares. It is left as an analytical exercise to show that the F -ratio equals F = (SSRR- SSRu)/#r -'--=ss-:::R"-.u-/,.,-(n-'-'K:':-)-,

(1.4.11)

which is the difference in the objective function deflated by the estimate of the error variance. This derivation of the F -ratio is analogous to how the likelihoodratio statistic is derived in maximum likelihood estimation as the difference in log likelihood with and without the imposition of the null hypothesis. For this reason, this second derivation of the F-ratio is said to be hy the Likelihood-Ratio principle. There is a closed-fonn expression for the restricted least squares estimator of fJ. Deriving the expression is left as an analytical exercise. The computation of restricted least squares will be explained in the context of the empirical example in Section 1.7.

t versus F Because hypotheses about individual coefficients are linear hypotheses, the 1-test ofH0 : fh = f3k is a special case of the F-test. To see this, note that the hypothesis can be written as RfJ = r with R

(I xK)

=

[o

0

I 0 (k)

44

Chapter 1

So by (1.4.9) the F -ratio is

which is the square of the t-ratio in (1.4.5). Since a random variable distributed as F(l, 11- K) is the square of a random variable distributed as 1(11 - K), the t- and F -tests give the same test result. Sometimes, the null is that a set of individual regression coefficients equal certain values. For example, assume K = 2 and consider Ho: fJ1 = I and {J2 = 0. This can be written as a linear hypothesis R{J = r for R = 12 and r = (I, 0)'. So the F -test can be used. It is tempting, however, to conduct the 1-test separately for each individual coefficient of the hypothesis. We might accept H 0 if both restrictions fJ1 = I and fJ2 = 0 pass the t -test. This amounts to using the confidence region of {cfJI, fJz)

I b1 -

SE(bi) · ta;2(n - K) < fJ1 <

h

b,- SE(bz) · 1af2(n- K) < fJz < hz

+ SE(bi) · ta;2(n -

K),

+ SE(bz) · ta;z(n- K)},

which is a rectangular region in the ({J 1, f32) plane, as illustrated in Figure 1.5. If (I, 0), the point in the ({J 1, {3 2 ) plane specified by the null, falls in this region, one would accept the null. On the other hand, the confidence region for the F -test is - I X)) -l[bl -fJI] < 2F.(#r, {(fJJ, fJz) I (b1- fJI, b,- fJz)(Var(b b2 - fJz

11-

-

K) }.

Since Var(b I X) is positive definite, the F -test acceptance region is an ellipse in the ({J 1, {3 2 ) plane. The two confidence regions look typically like Figure 1.5. The F-test should he preferred to the test using two t-ratios for two reasons. First, if the size (significance level) in each of the two 1-tests is a, then the overall size (the probability that (I, 0) is outside the rectangular region) is not a. Second, as will be noted in the next section (see (1.5.19)), the F-test is a likelihood ratio test and likelihood-ratio tests have certain desirable properties. So even if the significance level in each 1-test is controlled so that the overall size is a, the test is less desirable than the F -test. 20

2°For more details on the relationship between the t-test and the F -tests. see Scheffe (1959, p. 46).

45

Finite-Sample Properties of OLS

/3,

'' -----------------------.----''

level 1-a

con f. interval for f3 2

b2

---------------------

'' ''

'' '

' ''

-~--------·--------~-

----- .... ------0 level 1-a confidence Interval for /31

Figure 1.5: t- versus F-Tests

An Example of a Test Statistic Whose Distribution Depends on X To place the discussion of this section in a proper perspective, it may be useful to note that there are some statistics whose conditional distribution depends on X. Consider the celebrated Durbin-Watson statistic:

The conditional distribution, and hence the critical values, of this statistic depend on X, but J. Durbin and G. S. Watson have shown that the critical values fall between two bounds (which depends on the sample size, the number or regressors, and whether the regressor includes a constant)_ Therefore, the critical values for the unconditional distribution, too, fall between these bounds. The statistic is designed for testing whether there is no serial correlation in the error term. Thus, the null hypothesis is Assumption 1.4, while the maintained hypothesis is the other assumptions of the classical regression model (including the strict exogeneity assumption) and the normality assumption. But, as emphasized in Section L 1, the strict exogeneity assumption is not satisfied in time-series models typically encountered in econometrics, and scrjal correlation is an issue that arises only in time-series models. Thus, the Durbin-Watson statistic is not useful in econometrics. More useful tests for serial correlation, which are all based on large-sample theory, will be covered in the next chapter.

46

Chapter 1

QUESTIONS FOR REVIEW 1.

(Conditional vs. unconditional distribution) Do we know from Assumptions 1.1-l.Sthatthc marginal (unconditional) distribution orb is nom1al? [Answer: No.] Are the statistics Zk (see (1.4.3)), t" and F distributed independently of X? [Answer: Yes, because their distributions conditional on X don't depend on X.]

2. (Computation of test statistics) Verify that SE(bk) as well as b, SSR, s 2 , and R 2 can be calculated from the following sample averages: Sxx, Sxy• y'yfn, andy. 3. For the fonnula (1.4.9) for the F to be well-defined, the matrix R(X'X)- 1R' must be nonsingular. Prove the stronger result that the matrix is positive definite. Hint: X'X is positive definite. The inverse of a positive definite matrix is positive definite. Since R (#r x K) is of full row rank, for any nonzero #r dimensional vector z, R'z fc 0.

4. (One-tailed t-test) The t-test described in the text is the two-tailed t-test because the significance a is equally distributed between both tails of the t distribution. Suppose the alternative is one-sided and written as H1: fh > fh. Consider the following modification of the decision rule of the t-test. Step 1: Same as above. Step 2: Find the critical value 10 such that the area in the t distribution to the right of la is a. Note the difference from the two-tailed test: the left tail is ignored and the area of a is assigned to the upper tail only. Step 3: Accept if tk < 10 ; reject otherwise. Show that the size (significance level) of this one-tailed t-test is a. 5. (Relation between F(l, n- K) and t(n- K)) Look up the t and F distribution tables to verify that I';, (I, n- K) = Cta; 2 (n- K)) 2 for degrees of freedom and significance levels of your choice. 6. (t vs. F) "It is nonsense to test a hypothesis consisting of a large number of equality restrictions, because the !-test will most likely reject at least some of the restrictions." Criticize this statement.

7. (Variance or s 2 ) Show that, under Assumptions 1.1-1.5, Var(s 2 I X) =

2u 4

---= n-K

Hint: If a random variable is distributed as x 2 (m), then its mean ism and vari-

ance 2m.

47

Finite-Sample Properties of OLS

1.5

Relalion to Maximum Likelihood

Having specified the distribution of the error vector e, we can usc the maximum likelihood (ML) principle to estimate the model parameters (/3. a 2 ) 21 In this section, we will show that b, the OLS estimator of {3. is also the ML estimator, and the OLS estimator of a 2 differs only slightly from the ML counterpart, when the error is normally distributed. We will also show that b achieves the Cramer-Rae lower bound. The Maximum Likelihood Principle As you might recall from elementary statistics, the basic idea of the ML principle is to choose the parameter estimates to maximize the probability of obtaining the observed sample. To he more precise, we assume that the probability density of the sample (y, X) is a member of a family of functions indexed by a finite-dimensional parameter vector~: j(y, X;~). (This is described as parameterizing the density function.) This function, viewed as a function of the hypothetical parameter vector ~,is called the likelihood function. At the true parameter vector ~, the density of (y, X) is .f(y, X; 0- The ML estimate of the true parameter vector~ is the~ that maximizes the likelihood function given the data (y, X). Conditional versus Unconditional Likelihood Since a (joint) density is the product of a marginal density and a conditional density, the density of (y, X) can be written as f(y, X;~) = f(y

I X; 0)

· .f(X;

1/1),

(1.5.1)

where 0 is the subset of the parameter vector ~ that deten11ines the conditional density function and 1/1 is the subset deten11ining the marginal density function. The parameter vector of interest is (J; for the linear regression model with nonnal errors, 0 = ({3', a 2 )' and .f(y I X; 0) is given by (1.5.4) below. -I -I Let~ - (0, 1/1 )'be a hypothetical value of~ = (0', 1/1')'. Then the (unconditional or joint) likelihood function is .f(y, X;~)= .f(y I X; O)..f(X; If we knew the parametric form of .f(X;

1/1),

;jJ).

(1.5.2)

then we could maximize this joint

21 For a fuller treatment of maximum likelihood. see Chapter 7.

48

Chapter 1

likelihood function over the entire hypothetical parameter vector ~, and the ML estimate of 0 would he the elements of the ML estimate of~. We cannot do this for the classical regression model because the model does not specify .f(X; 1/J). However, if there is no Junctional relationship between 0 and 1/1 (such as a subset of {fr being a function of 0), then maximizing (1.5.2) with respect to~ is achieved

-

-

-

by separately maximizing .f(y I X; 0) with respect to 0 and maximizing .f(X; 1/r) with respect to 1/f. Thus the ML estimate of 0 also maximizes the conditional likelihood

.f (y I X; 0).

The Log Likelihood lor the Regression Model As already observed, Assumption 1.5 (the normality assumption) together with Assumptions 1.2 and 1.4 imply that the distribution of e conditional on X is N(O, a 2 In) (see (1.4.1)). But since y = XP + e by Assumption 1.1, we have

y I X~ N(X{j,a 2 ln).

(1.5.3)

Thus, the conditional density of y given X is22 (1.5.4) Replacing the true parameters CP, a 2 ) by their hypothetical values CP, &2 ) and taking logs, we obtain the log likelihood function:

n n I log L(p, &2 ) = - log(2rr)- 1og(& 2 ) (y- Xfj)'(y- X{j). (1.5.5) 252 2 2

Since the log transformation is a monotone transformation, the ML estimator of (p, a 2 ) is the (p, &2 ) that maximizes this log likelihood. ML via Concentrated Likelihood tt is instructive to maximize the log likelihood in two stages. First, maximize over p for any given &2 . The p that maximizes the objective function could (but docs not, in the present case of Assumptions 1.1-1.5) depend on &2 . Second, maximize over &2 taking into account that the p obtained in the first stage could depend on &2 The log likelihood function in which {3 is constrained to be the value from 22 Recall from basic probability theory that the density function for ann-variate normal disu·ibution with mean JL and variance matrix I: is

(2n)-n/21:EI-l/2exp[ -~(y- p,)':E-l(y

To derive (1.5.4),just set 1-L =X~ and :E = a 2 In.

-~t)J.

49

Finite-Sample Properties of OLS

the first stage is called the concentrated log likelihood function (concentrated with respect to {J). For the normal log likelihood ( 1.5.5), the first stage amounts to minimizing the sum of squares (y- Xfi)'(y- Xfi). The fi that does it is none other than the OLS estimator b, and the minimized stml of squares is c'c. Thus, the concentrated log likelihood is concentrated log likelihood= -

n

2

-

-

-

Iog(2n)-

1 , n log(u 2 ) - & 2 c c. (1.5.6) 2 2

This is a function of & 2 alone, and the & 2 that maximizes the concentrated likelihood is the ML estimate of a 2 . The maximization is straightforward for the present case of the classical regression modeL because e'e is not a function of 5 2 and so can be taken as a constant. Still, taking the derivative with respect to 5 2 , rather than with respect to can be tricky. This can be avoided by denoting & 2 by ji. Taking the derivative of (1.5.6) with respect to ji (= & 2 ) and setting it to zero, we obtain the following result.

u,

Proposition 1.5 (ML Estimator of (/3, a 2 )): Suppose Assumptions 1.1-1.5 hold. Then the ML estimator of fi is the OLS estimator b and I SSR n- K 2 ML estimator of a 2 = - e'e = - - = s . n n n

(1.5.7)

We know from Proposition 1.2 that s 2 is unbiased. Since s 2 is multiplied by a factor (n- K)fn which is different from I, the ML estimator of a 2 is biased, although the bias becomes arbitrarily small as the sample size n increases for any given fixed

K. For later use, we calculate the maximized value of the likelihood function. Substituting (1.5.7) into (1.5.6), we obtain

n log (2n) maximized log likelihood = - - -n - -n log(SSR), 2 n 2 2 so that the maximized likelihood is

- & 2 ) = (2n)-n/2 ") · (SSR)-nf 2 tpaxL(fi, · exp ( - p,a' n 2

(1.5.8)

Cramer-Rao Bound for the Classical Regression Model

Just to refresh your memory of basic statistics, we temporarily step outside the classical regression model and present without proof the Cramer-Rao inequality for the variance-covariance matrix of any unbiased estimator. For this purpose, define

50

Chapter 1

-

the score vector at a hypothetical parameter value IJ to be the gradient (vector of partial derivatives) of log likelihood: score: s(IJ)

=

alogL(IJ)

_

ae

.

(1.5.9)

Cramer-Rao Inequality: Let z be a vector of random vadables (not necessarily independent) the joint density of which is given by J(z; IJ), where IJ is an mdimensional vcc10r oFpm·amclcrs in some parameter space e. Let L(IJ) = f(z; IJ) be the likelihood function, and let IJ(z) be an unbiased estimator of IJ with a finite vmiance-covariance matrix. Then, under some regularity conditions on f(z; 0) (not stated here),

-

Var[O(z)] > I(IJ)- 1

(-

Cramer-Rao Lower Bound),

(mxm)

where I(IJ) is the information matrix defined by I(IJ) - E[s(IJ) s(IJ)'].

(1.5.10)

(Note well that the score is evaluated at the t111e pm·ameter value IJ.) Also under the regulan'ly conditions, the infOrmation matrix equals the negative o[ the expected value of the Hessian (matrix of second pmtial derivatives) of the log likelihood: I(IJ) =

-E[a2l~g~~IJ)J·

(1.5.11)

ae ae

This is called the information matrix equality.

See, e.g., Amemiya (1985, Theorem 1.3.1) for a proof and a statement of the regularity conditions. Those conditions guarantee that the operations of differentiation and taking expectations can be interchanged. Thus, for example,

-

-

E[aL(IJ)faiJ] =a E[L(IJ)]faiJ.

Now, for the classical regression model (of Assumptions 1.1-1.5), the likelihood function L(IJ) in the Cramer-Rao inequality is the conditional density (1.5.4), so the variance in the inequality is the variance conditional on X. It can be shown that those regularity conditions arc satisfied for the nonnal density (1.5.4) (sec, e.g., Amemiya, 1985, Sections 1.3.2 and 1.3.3). In the rest of this subsection, we calculate the information matrix for (1.5.4). The parameter vector IJ is ((3', a 2 )'.

51

Finite-Sample Properties of OLS

So ii =

(jj', ji)' and the matrix of second derivatives we seck to calculate is

a2 log L(O) - ao_, ao ((K+I)x(K+I))

a2 1ogL(8) - -, ap ap

a2 Iog L(8) apay

(KxK)

(Kxl) a 2 Iog L(8)

a2 1ogL(O) ay aP'

(1.5.12)

aly

(I xI)

(1 xK)

The first and second derivatives of the log likelihood (1.5.5) with respect to 0, evaluated at the true parameter vector 0, are

alog ~(0) ap

=

_a_lo-=g-::-~-'-(0...:..)

= __ n

(1.5.13a)

Y

ay

B2 log L(O)

- _, ap ap

a2 logL(O)

2.x'(y _ xp),

+ _l_(y _

2y

I ' =--XX, y n I = - -Cy2y2 Y'

a2l~L(O) ap aji

= _

xp)

,

(y-

xp).

_.!._X'( -XR).

Y2

Y

(1.5.13b)

(1.5.14a)

-~--

a2y

XP)'(y _ Xp).

2y2

(1.5.14b)

(1.5.14c)

"

Since the derivatives are evaluated at the true parameter value, y - XP = e in these expressions. Substituting (1.5.14) into (1.5.12) and using E(e I X) = 0 (Assumption 1.2), E(e' e 1 X) = nu 2 (implication of Assumption 1.4), and recalling y = a 2 , we can easily derive

I(O) =

[o',~'X

(1.5.15)

Here, the expectation is conditional on X because the likelihood fm1ction (1.5.4) is a conditional density conditional on X. This block diagonal matrix can he inverted to obtain the Cramer-Rao bound: Cramer-Rao bound - I(O) -I =

u2. (X'X)-1 [

0

,

(1.5.16)

Therefore, the unbiased estimator b, whose variance is u 2 · (X'X) -I by Proposition 1.1, attains the Cramcr-Rao bound. We have thus proved

52

Chapter 1

Proposition 1.6 (b is the Best Unbiased Estimator (BUE)): Under Assumptions 1.1-1 .5. the OLS estimator b of p is B UE in that any other unbiased (but not

necessmily linem~ estimator has larger conditional variance in the matrix sense. This result should be distinguished from the Gauss-Markov Theorem that b is minimum variance among those estimators that arc unbiased and linear in y. Proposition 1.6 says that b is minimum variance in a larger class of estimators that includes nonlinear unbiased estimators. This stronger statement is obtained under the normality assumption (Assumption 1.5) which is not assumed in the Gauss-Markov Theorem. Put differently, the Gauss-Markov Theorem does not exclude the possibility of some nonlinear estimator beating OLS, but this possibility is mled out by the normality assumption. As was already seen, the ML estimator of rr 2 is biased, so the Cramer-Rao bound does not apply. But the OLS estimator s2 of rr 2 is unbiased. Does it achieve the bound? We have shown in a review question to the previous section that Var(s

2

I X)

2rr 4

= ---:::-

n- K

under the same set of assumptions as in Proposition 1.6. Therefore, s 2 does not attain the Cramer-Rao bound 2rr 4 In. However, it can be shown that an unbiased estimator of rr 2 with variance lower than 2rr 4 l(n - K) docs not exist (sec, e.g., Rao, 1973, p. 319). The F- Test as a Likelihood Ratio Test

The likelihood ratio test of the null hypothesis compares Lu, the maximized likelihood without the imposition of the restriction specified in the null hypothesis, with LR, the likelihood maximized subject to the restriction. If the likelihood ratio A _ LuI L R is too large, it should be a sign that the null is false. The F -test of the null hypothesis H 0 : RP = r considered in the previous section is a likelihood ratio test because the F -ratio is a monotone transfonnation of the likelihood ratio A. For the present model, Lu is given by (1.5.8) where the SSR, the sum of squared residuals minimized without the constraint Ho, is the SSRu in (1.4.11). The restricted likelihood L R is given by replacing this SSR by the restricted sum of squared residuals, SSRR. So

LR

-

max L(p, ii 2 ) = p,o-z s.t. Ho

= _

and the likelihood ratio is

(2n)-n/2 n) · (SSRn)-n/ · exp (- n

2

2

,

(1.5.17)

53

Finite-Sample Properties of OLS

(1.5.18) Comparing this with the formula (1.4.11) for the F -ratio, we sec that the F -ratio is a monotone trans[ormation of the likelihood ratio A: F =

n-K #r

2

(1.5.19)

(A / " - 1),

so that the two tests are the same. Quasi-Maximum Likelihood

All these results assume the normality of the error term. Without normality, there is no guarantee that the ML estimator of pis OLS (Proposition 1.5) or that the OLS estimator b achieves the Cramer-Rao bound (Proposition 1.6). However, Proposition 1.5 docs imply that b is a quasi- (or pseudo-) maximum likelihood estimator, an estimator that maximizes a misspecified likelihood function. The misspecified likelihood function we have considered is the normal likelihood. The results of Section 1.3 can then be interpreted as providing the finite-sample properties of the quasi-ML estimator when the error is incorrectly specified to he nonnal. QUESTIONS FOR REVIEW

1. (Usc of regularity conditions) Assuming that taking expectations (i.e., taking integrals) and ditferentiation can he interchanged, prove that the expected value of the score vector given in (1.5.9), if evaluated at the true parameter value 0, is zero. Hint: What needs to be shown is that

f

alog f_(z;

0) f(z; 8) dz = 0.

a8

Since f(z; 0) is a density, f f(z, 0) dz = I for any 0. Differentiate both sides with respect to 0 and use the regularity conditions, which allows us to change the order of integration and differentiation, to obtain j[af(z; 8)/BO] dz = 0. Also, from basic calculus,

alogf(z; 0) a8

aj(z;8) f(z; 8) a8 I

2. (Maximizing joint log likelihood) Consider maximizing (the log of) the joint likelihood (1.5.2) for the classical regression model, where ii = ii 2 )' and log J(y I X; 8) is given hy (1.5.5). You would parameterize the marginal like-

c7i'.

54

Chapter 1

lihood f(X; 1/r) and take the log of (1.5.2) to obtain the objective function to he maximized over~ = (0', 1/r')'. What is the ML estimator of 0 = ((3', a 2 )'? [Answer: It should be the same as that in Proposition 1.5.] Derive the CramerRao bound for (3. Hint: By the information matrix equality,

-

_,

Also, 32 Iog L(~)/(30 31/r) = 0. 3.

(Concentrated log likelihood with respect to ii 2 ) Writing ii 2 as ji, the log likelihood function for the classical regression model is

n n 1 logL({3,ji) = Iog(2rr)- zlog(ji)- ji(y- X(3)'(y-X(3). 2 2 ln the two-step maximization procedure described in the text, we first maximized this function with respect to {3. Instead, first maximize with respect toY given (3. Show that the concentrated log likelihood (concentrated with respect to ji = ii 2 ) is

-

n 2

--[1

+ log(2rr)]-

n -log

((y-

2

xp)'(yn

XP)) .

4. (Information matrix equality for classical regression model) Verify (1.5.11) for the linear regression model. 5. (Likelihood equations for classical regression model) We used the two-step procedure to derive the ML estimate for the classical regression model. An alternative way to find the ML estimator is to solve for the first-order conditions that set (1.5.13) equal to zero (the first-order conditions for the log likelihood is called the likelihood equations). Verify that the ML estimator given in Proposition 1.5 solves the likelihood equations.

1.6

Generalized Least Squares (GLS)

Assumption 1.4 states that then x n matrix of conditional second moments E(ee' I X)(= Var(e I X)) is spherical, that is, proportional to the identity matrix. Without the assumption, each element of then x n matrix is in general a nonlinear function

55

Finite-Sample Properties of OLS

of X. If the error is not (conditionally) homoskcdastic, the values of the diagonal elements of E(ee' I X) are not the same, and if there is correlation in the error term between observations (the case of serial correlation for time-series models), the values of the off-diagonal clements arc not zero. For any given positive scalar a 2 , define V(X) - E(ee' I X)ja 2 and assume V(X) is nonsingular and known. That is, E(ee' I X)= a 2 V(X), V(X) nonsingular and known.

(1.6.1)

(nxn)

The reason we decompose E(ee' 1 X) into the component a 2 that is common to all elements of the matrix E(ee' I X) and the remaining component V(X) is that we do not need to know the value of a 2 for efficient estimation. The model that results when Assumption 1.4 is replaced by (1.6.1), which merely assumes that the conditional second moment E(ee' I X) is nonsingular, is called the generalized regression model. Consequence of Relaxing Assumption 1.4

Of the results derived in the previous sections, those that assume Assumption 1.4

are no longer valid for the generalized regression model. More specifically, • The Gauss-Markov Theorem no longer holds for the OLS estimator

The BLUE is some other estimator. • The /-ratio is not distributed as the t distribution. Thus, the /-test is no longer valid. The same comments apply to the F -test. • However, the OLS estimator is still unbiased, because the unbiasedness result (Proposition1.1(a)) docs not require Assumption 1.4. Etficient Estimation with Known V If the value of the matrix function V(X) is known, docs there exist a BLUE for the

generalized regression model? The answer is yes, and the estimator is called the

generalized least squares (GLS) estimator, which we now derive. The basic idea of the derivation is to transfonn the generalized regression mode], which consists of Assumptions 1.1-1.3 and (1.6.1), into a model that satisfies all the assumptions, including Assumption 1.4, of the classical regression model.

56

Chapter 1

For economy of notation, we usc V for the value V(X), Since Vis by construction symmetric and positive definite, there exists a nonsingular n x n. matrix C such that y-I = C'C

(I ,6.2)

This decomposition is not unique, with more than one choice for C, but, as is clear

from the discussion below, the choice of C doesn't matter, Now consider creating a new regression model by transforming (y, X, e) by Cas

y_ Cy, x _ex, e _ Ce Then Assumption],] for (y, X, e) implies that (y,

(1,6.3)

X, e) too satisfies linearity: (1,6.4)

The transformed model satisfies the other assumptions of the classical linear regression model, Strict exogeneity is satisfied because E(e

1Xl

= E(e 1X) (since Cis nonsingular, X and X contain the same information) = = =

E(Ce I X) CE(e I X) (by the linearity of conditional expectations) 0 (since E(e 1X) = 0 by Assumption 1,2),

Because Vis positive definite, the no-multicollinearity assumption is also satisfied (see a review question below for a proof), Assumption 1.4 is satisfied for the transformed model because E(ee' I X) = E(U' I X)

(since X and X contain the same information)

= C E(ee' I X)C' = C, = CJ

=

CJ

CJ

2 ,

VC'

2

CVC'

2

In

(since

ee' =

Cee'C')

(by (1,6,1))

(since (C')- 1 V- 1 C- 1 =In or CVC' =In by (1,6.2)),

e

e

So indeed the variance of the transformed error vector is sphericaL Finally, I X is normal because the distribution of e I X is the same as e I X and e is a linear transformation of e. This completes the verification of Assumptions 1.1-1.5 for

the transformed modeL

57

Finite-Sample Properties of OLS

The Gauss-Markov Theorem for the transfom1cd model implies that the BLUE of p for the generalized regression model is the OLS estimator applied to (1.6.4): PaLs=

(X'X)-IX'y

= [(CX)'(CX)r 1 (CX)'Cy = (X'C'CX)- 1 (X'C'Cy) = cx'v- 1 x)- 1X'V- 1y

(by (1.6.2)).

(1.6.5)

This is the GLS estimator. Its conditional variance is VarCPaLs I X) = cx'v- 1x)- 1X'v- 1 Var(y X)V- 1X(X'v- 1x)- 1 1

= (X'v- 1x)- 1X'V- 1(rr 2 V)V- 1X(X'V- 1X)- 1

(since Var(y X)= Var(e X)) 1

= rr 2 . cx'v- 1 x)- 1

1

(1.6.6)

Since replacing V by rr 2 ·V (= Var(e 1 X)) in (1.6.5) does not change the numerical value, the GLS estimator can also be written as

As noted above, the OLS estimator (X'X)- 1X'y too is unbiased without Assumption 1.4, but nevertheless the GLS estimator should be preferred (provided V is known) because the latter is more efficient in that the variance is smaller in the matrix sense. The gain in efficiency is achieved by exploiting the heteroskedasticity and correlation between observations in the error term, which, operationally, is to insert the inverse of (a matrix proportional to) Var(e I X) in the OLS formula, as in (1.6.5). The discussion so far can he summarized as Proposition 1.7 (finite-sample properties of GLS):

(a) (unbiascdncss) Under Assumption 1.1-1.3, ECPaLs I X) = {J. (b) (expression for the vmiance) Under Assumptions 1.1-1.3 and the assumption (1.6.1) that the conditional second momem is proportional to V(X),

(c) (efficiency of GLS) Under the same set of assumptions as in (b), the GLS estimator is efficient in that the conditional variance of any unbiased estimator

that is linear in y is greater than or equal to VarCPaLs I X) in the matrix sense.

58

Chapter 1

A Special Case: Weighted Least Squares (WLS) The idea of adjusting for the error variance matrix becomes more transparent when there is no correlation in the error term between observations so that the matrix V is diagonal. Let v;(X) be the i-th diagonal clement ofV(X). So E(ef I X)(= Var(e; I X))= rr 2 · v;(X). It is easy to sec that Cis also diagonal, with the square root of 1/v;(X) in the i-th diagonal. Thus, (y, X) is given by

-

Yi

-

Xi

Y; = Jv;(X)' x; = Jv;(X)

(i = 1,2, ... ,11).

Therefore, efficient estimation under a known form of heteroskedasticity is first to weight each observation by the reciprocal of the square root or the variance v; (X) and then apply OLS. This is called the weighted regression (or the weighted least squares (WLS)). An important further special case is the case of a random sample where {Yi, xi} is i.i.d. across i. As was noted in Section 1.1, the error is unconditionally homoskcdastic (i.e., ECe?) docs not depend on i), but still GLS can be used to increase efficiency because the error can be conditiona11y heteroskedastic. The conditional second moment E(ef 1 X) for the case of random samples depends only on x;, and the functional form of E(ef I X;) is the same across i. Thus v; (X) = v(x;)

for random samples.

(1.6.7)

So the knowledge of VO comes down to a single function of K variables, v(-). Limiting Nature of GLS All these sanguine conclusions about the finite-sample properties of GLS rest on the assumption that the regressors in the generalized regression model are strictly

-

exogenous (E(e I X) = 0). This fact limits the usefulness of the GLS procedure. Suppose, as is often the case with time-series models, that the regressors are not strictly exogenous and the error is serially correlated. So neither OLS nor GLS has those good finite-sample properties such as unbiasedness. Nevertheless, as will be shown in the next chapter, the OLS estimator, which ignores serial correlation in the error, will have some good large sample properties (such as "consistency" and "asymptotic nonnality"), provided that the regressors are "predetermined" (which is weaker than strict cxogcncity). The GLS estimator, in contrast, does not have that redeeming feature. That is, if the error is not strictly exogenous

59

Finite-Sample Properties of OLS

but is merely prcdctcnnincd, the GLS procedure to correct for serial correlation can make the estimator inconsistent (see Section 6.7). A procedure for explicitly taking seria1 correlation into account while maintaining consistency will be presented in Chapter 6. If it is not appropriate for correcting for serial correlation, the GLS procedure can still be used to correct for hctcroskcdasticity when the error is not serially correlated with diagonal V(X), in the fonn of WLS. But that is provided that the matrix function V(X) is known. Very rarely do we have a priori information specifying the values of the diagonal clements of V(X), which is necessary to weight observations. In the case of a random sample where serial correlation is guaranteed not to arise, the knowledge of V (X) boils down to a single function of K variables, v(x;), as we have just seen, but even for this case the knowledge of such a function is unavailable in most applications. lf we do not know the function V(X), we can estimate its functional form from the sample. This approach is called the Feasible Generalized Least Squares (FGLS). But if the function V (X) is estimated tram the sample, its value V becomes a random variable, which affects the distribution of the GLS estimator. Very little is known about the finite-sample properties of the FGLS estimator. We will cover the large-sample properties of the FGLS estimator in the context of heteroskedasticity correction in the next chapter. Before closing, one positive side of GLS should be noted: most linear estimation techniques- including the 2SLS, 3SLS, and the random effects estimators to be introduced later-can be expressed as a GLS estimator, with some liberal definition of data matrices. However, those estimators and OLS can also be interpreted as a GMM (generalized method of moments) estimator, and the GMM interpretation is more useful for developing large-sample results. QUESTIONS FOR REVIEW

1. (The no-multicollinearity assumption for the transfonned model) Assumption 1.3 for the transfom1ed model is that rank(CX) = K. This is satisfied since C is nonsingular and X is of full column rank. Show this. Hint: Since X is of full column rank, for any K -dimensional vector c i" 0, Xc i" 0. 2. (Generalized SSR) Show that PGLS minimizes (y- xp)'v- 1(y- Xp). 3. Derive the expression for Var(b I X) for the generalized regression model. What is the relation of it to Var(f3GLS I X)? Verify that Proposition 1.7(c) implies

-

60

4. (SamplingerrorofGLS) Show: PaLs

Chapter 1

-p

=

(X'V- 1X)- 1X'V- 1 e.

1.7 Application: Returns to Scale in Electricity Supply Nerlove's 1963 paper is a classic study of returns to scale in a regulated industry. It also is excellent material for illustrating the techniques of this chapter and presenting a few more not yet covered.

The Electricity Supply Industry At the time of Ncrlovc's writing, the U.S. electric power supply industry had the following features: (1) Privately owned local monopolies supply power on demand. (2) Rates (electricity prices) are set by the utility commission. (3) Factor prices (e.g., the wage rate) arc given to the finn, either because of perfect competition in the market for factor inputs or through long-term contracts with labor unions. These institutional features will he relevant when we examine whether the OLS is an appropriate estimation procedure. 23

The Data Ncrlovc assembled a cross-section data set on 145 finllS in 44 states in the year 1955 for which data on all the relevant variables were available. The variables in the data are total costs, factor prices (the wage rate, the price of fuel, and the rental price of capital), and output. Although firms own capital (such as power plants, equipment, and structures), the standard investment theory of Jorgenson (1963) tells us that (as long as there are no costs in changing the capital stock) the firm should behave as if it rents capital on a period-to-period basis from itself at a rental price called the "user cost of capital," which is defined as (r + 8) ·PI, where r here is the real interest rate (below we will user for the degree of retums to scale), 8 is the depreciation rate, and PI is the price of capital goods. For this reason capital 23 Thanks to the deregulation of the indust1y since the time ofNerlove's writing, multiple firms are now allowed to compete in the same local market, and the strict price control has been lifted in many states. So the first two features no longer characterize the industry.

Finite-Sample Properties of OLS

61

input can be treated as if it is a variable factor ofproduetion, just like labor and fuel inputs.

Appendix B of Nerlove (1963) contains a careful and honest discussion of how the data were constmcted, Data on output, fuel, and labor costs (which, along with capital cost...,, make up total costs) were obtained from the Federal Power

Commission (1956), For the wage rate, Nerlove used statewide average wages for utility workers, Ideally, one would calculate capital costs as the reproduction cost of capital times the user cost of capitaL Due to data limitation, Nerlove instead used interest and depreciation charges available from the firm's books, Why Do We Need Econometrics?

Why do we need a fancy econometric technique like OLS to determine retums to scale? Why can't we be simple-minded and plot the average cost (which can be easily calculated from the data as the ratio of total costs to output) against output and see whether the AC (average cost) curve is downward sloping? The reason is that each firm can have a different AC curve. lf firms face different factor prices,

then the average cost is less for firms facing lower factor prices, That cross-section units at a given moment face the same prices is usually a good assumption to make,

but not for the U,S, electricity industry with substantial regional differences in factor prices, The effect of factor prices on the AC curve has to be isolated somehow, The approach taken by Nerlove, which became a standard econometric practice, is to estimate a parameterized cost function.

Another factor that shifts the individual AC curve is the level of production efficiency. If more efficient firms produce more output, then it is possible that

the individual AC curve is upward sloping but the line connecting the observed combination of the average cost and output is downward sloping, To illustrate, consider a competitive industry described in Figure L6, where the AC and MC (marginal cost) curves are drawn for two firms competing in the same market. To

focus on the connection between production efficiency and output, assume that all firms face the same factor prices so that the only reason the AC and MC curves differ between firms is the difference in production efficiency, The AC and MC curves are upward sloping to reflect decreasing retums to scale, The AC and MC curves for fim1 A lie above those for firm B because finn A is less efficient than B, Because the industry is competitive, both firms face the same price p, Since output is determined at the intersection of the MC cmve and the market price, the combinations of output and the average cost for two finns are points A and B in

the figme, The curve obtained from connecting these two points can be downward sloping, giving a false impression of increasing returns to scale.

62

Chapter 1

AC,MC

MCs ACs p

c_-----~------------~------7

output

Figure 1.6: Ou1pu1 De1ermina1ion

The Cobb-Douglas Technology

To derive a parameterized cost function, we start with the Cobb-Douglas production function (1.7.1) where Q 1 is finn i 's output, xil is labor input for finn i, xi2 is capital input, and x 13 is fuel. A; captures unobservable differences in production efficiency (tltis term is often called lirm heterogeneity). The sum a 1 +a2 +a 3 = r is the degree of returns to scale. Thus, it is assumed a priori that the degree of returns to scale is constant (this should not be confused with constant returns to scale, which is that r = 1). Since the electric utilities in the sample are privately owned, it is reasonable to suppose that they arc engaged in cost minimization (sec, however, the discussion at the end of this section). We know from microeconomics that the cost function associated with the Cobb-Douglas production function is Cobb-Douglas: (1.7.2) where TC; is total costs for finn i. Taking logs, we obtain the following log-linear relationship:

''

log(1C;) = /-';

1 C¥t C¥2 C¥3 + -log(Q;) + -log(p;t) + -log(p; 2 ) + -log(p; 3 ),

r

r

r

r

(1.7.3)

63

Finite-Sample Properties of OLS

where p,; = log[r · (A;

a:'

a~ 2 a;')- 11'"]. The equation is said to be log-linear

because both the dependent variable and the regressors are logs. Coefficients in

log-linear equations are elasticities. The log(p; 1) coefficient. for example. is the elasticity of total costs with respect to the wage rate, i.e., the percentage change in total costs when the wage rate changes by I percent. The degree of returns to scale, which in (1.7.3) is the reciprocal of the output elasticity of total costs, is independent of the level of output. Now let p, _ E(p,;) and define £; _ p,; - p, so that E(e;) = 0. This £; represents the inverse of the firm's production efficiency relative to the industry's average efficiency; finns with positive £; are high-cost firms. With this notation, (1.7 .3) becomes log(TC;) = f3t

+ f3z

log(Q;)

+ f33

log(p;t)

+ f34 log(p;2) + f3s

log(p;,)

+ £;. (1.7.4)

where I

f32 = -, r

(1.7.5)

Thus, the cost function has been cast in the regression format of Assumption 1.1 with K = 5. We noted a moment ago that the simple-minded approach of plotting the average cost against output cannot account for the factor price effect. What we have shown is that under the Cobb-Douglas technology the factor price effect is controlled for by the inclusion in the cost function of the logs of factor prices. Because the equation is derived from an explicit description of the firm's technology, the error tenn as well as the regression coefficients have clear interpretations. How Do We Know Things Are Cobb-Douglas?

The Cobb-Douglas functional form is certainly a very convenient parameterization of technology. But how do we know that the true production function is CobbDouglas? The Cobb-Douglas form satisfies the properties, such as diminishing marginal productivitics, that we normally require for the production function, but the Cobb-Douglas form is certainly not the only functional form with those desirable properties. A number of more general functional forms have been proposed in the literature, hut the Cobb-Douglas fonn, despite its simplicity, has proved to be a surprisingly good description of technology. Nerlove's paper is one of the relatively few studies in which the Cobb-Douglas (log-linear) form is found to be inadequate, hut it only underscores the importance of the Cobb-Douglas functional fonn as the benchmark from which one can usefully contemplate generalizations.

64

Chapter 1

Are the OLS Assumptions Satisfied?

To justify the use of least squares, we need to make sure that Assumptions 1.11.4 are satisfied for the equation (1.7 .4). Evidently, Assumption 1.1 (linearity) is

satisfied with

y, =log (TC;), x; = (1, log(Q;), log (p; 1), log (p; 2 ), log (p; 3 ))'. There is no reason to expect that the regressors in (1.7.4) arc perfectly multicollinear. Indeed, in Nerlove's data set, rank(X) = 5 and n = 145, so Assumption 1.3 (no multicollinearity) is satisfied as well. In verifying the strict cxogcncity assumption (Assumption 1.2), the features of the electricity industry mentioned above are relevant. It is reasonable to assume, as in most cross-section data, that

X;

is independent of ej fori ,6 j. So the question

is whether x; is independent of e;. If it is, then E(e I X) = 0. According to the third feature of the industry, factor prices are given to the firm with no regard for the firm's efficiency, so it is eminently reasonable to assume that factor prices are independent of£;. What about output? Since the firm's output is supplied on demand (the first feature of the industry), output depends on the price of electricity set by the utility commission (the second feature). If the regulatory scheme is such that the price is determined regardless of the firm's efficiency, then log(Q;) and£; are independently distributed. On the other hand, if the price is set to cover the average cost, then the finn's efficiency atl'ects output through the etl'ect of the electricity price on demand and output in this case is endogenous, being correlated with the error tcnn. We will very briefly come back to this point at the end, but until then we will ignore the possible endogeneity of output. This certainly would not do if we were dealing with a competitive industry. Since high-cost firms tend to produce less, there would be a negative correlation between log(Q;) and£;, making OLS

an inappropriate estimation procedure. Regarding Assumption 1.4, the assumption of no correlation in the error term between firms (observations) would be suspect if, for example, there were technology spillovers running from one finn to otl1er closely located firms. For the industry under study, this is probably not the case. There is no a priori reason to suppose that homoskedasticity is satisfied. Indeed, the plot of residuals to be shown shortly suggests a failure of this condition. The main part of Nerlove's paper is exploring ways to deal with this problem.

65

Finite-Sample Properties of OLS

Restricted Least Squares

The equation (1.7 .4) is ovcridcntilicd in that its five coefficients, being functions of the four technology parameters (which are "h a 2 , a 3 , and fL), are not free parameters. We can easily sec that from (1.7.5): fh +{34 + f3s = 1 (recall: r = a 1+a2 +a,). This is a reflection of the generic property of the cost function that it is linearly homogeneous in factor prices. Indeed, multiplying total costs TC, and all factor prices (p, 1 , p, 2 , p, 3 ) by a common factor leaves the cost function (1.7.4) intact if and only if fJJ + f34 + f3s = I. Estimating the equation by least squares while imposing a priori restrictions on the coefficient vector is the restricted least squares. It can be done easily by deriv-

ing from the original regression a separate regression that embodies the restrictions. In the present example, to impose the homogeneity restriction {3 3 + {34 + {3 5 = 1 on the cost function, we take any one of the factor prices, say Pi3, and subtract log(p, 3 ) from both sides of (1.7.4) to obtain

(TC)

log - '

= fJ1

(P'I)

(P2)

+ fJz log(Q,) + {33 log -'-

+ {34 log - '

Pi3

Pi3

Pi3

+ £,. (1.7.6)

There are now four coefficients in the regression, from which unique values of the four technology parameters can be determined. The restricted least squares estimate of ({3 1, ... , {34) is simply the OLS estimate of the coefficients in (1.7 .6). The restricted least squares estimate of fJs is the value implied by the estimate of (fJI, ...• {34) and the restriction. Testing the Homogeneity of the Cost Function

Before proceeding to the estimation of the restricted model (1.7.6), in order to test the homogeneity restriction {3 3 + {34 + {3 5 = 1, we will first estimate the umestricted model (1.7.4). If one uses the data available in printed form in Nerlove's paper, the OLS estimate of the equation is: log(TC,) = -3.5

(1.8) -

+

0.72 log(Q,) (0.017)

0.22 log(p, 2 ) (0.34)

+

+

0.44 log(p,I) (0.29)

0.43 log(p, 3 ) (0.10)

R 2 = 0.926, mean of dep. variable= 1.72,

SER = 0.392, SSR = 21.552, n = 145.

(1.7.7)

Here, numbers in parentheses are the standard errors of the OLS coefficient estimates. Since {32 = 1/r, the estimate of the degree of returns to scale implied

66

Chapter 1

by the OLS coefficient estimates is about 1.4 (= 1/0.72). The OLS estimate of {J4 = a 2 jr has the wrong sign. As noted hy Nerlove, there are reasons to helieve that p; 2 , the rental price of capital, is poorly measured. This may explain why b4 is so imprecisely dctcnnincd (i.e., the standard error is large relative to the size of the coefficient estimate) that one cannot reject the hypothesis that {J4 = 0 with a t-ratio of -0.65 (= -0.22/0.34) 24 To test the homogeneity restriction Ho: {J3 + {!4 + {! 5 = I, we could write the hypothesis in the form R/3 = r with R = (0, 0, I, I, I) and r = I and use the formula ( 1.4.9) to calculate the F -ratio. The maintained hypothesis is the unrestricted model (1.7.4) (that is, Assumptions 1.1-1.5 where the equation in Assumption 1.1 is (1.7.4)), so the band the estimated variance of bin the F-ratio fonnula should come from the OLS estimation of (1.7.4). Alternatively, we can use the F-ratio formula (1.4.11). The unrestricted model producing SSRu is (1.7 .4) and the restricted model producing SSRR is (1.7.6), which superimposes the null hypothesis on the unrestricted model. The OLS estimate of (1.7 .6) is

(TC)

log - ' P;J

=

-4.7 (0.88)

+

+

0.72 log(Q;) (0.017)

0.59 log(p;J/ p;3) (0.20)

0.007 log(p; 2 / p; 3 ) (0.19)

R 2 = 0.932, mean ofdep. var. = -1.48, SER

= 0.39, SSR = 21.640, n = 145.

(1.78)

The F test of the homogeneity restriction proceeds as follows.

Step 1: Using (1.4.11), the F-ratio can be calculated as (21.640- 21.552)/1 0 57 21.552/(145 - 5) = ' '

Step 2: Find the critical value. The number of restrictions (equations) in the null hypothesis is I, and K (the numher of coefficients) in the unrestricted model (which is the maintained hypothesis) is 5. So the degrees of freedom arc I and 140 (= 145- 5). From the table ofF distributions, the critical value is about 3.9. Step 3: Thus, we can easily accept the homogeneity restriction, a very comforting conclusion for those who take microeconomics seriously (like us). 24 The consequence of measurement error is not just that the coefficient of the variable measured with error is poorly determined; it could also contaminate the coefficient estimates for all other regressors. The appropriate context to address this problem is the large sample themy for endogeneous regressors in Chapter 3.

67

Finite-Sample Properties of OLS

Detour: A Cautionary Note on R 2

The R 2 of 0.926 is surprisingly high for cross-section estimates, but some of the explanatory power of the regression comes from the scale effect that total costs increase with finn size. To gauge the contribution of the scale effect on the R 2 , subtract log(Q;) from hath sides of (1.7.4) to obtain an equivalent cost function:

log(~')

= {J 1

+ ({J2 -

+ fJJ

I) log(Q;)

log(pil)

+ f34

log(p;z)

+ fJs

log(p;J)

+ £;.

(I. 7 .4')

Here, the dependent variable is the average cost rather than total costs. Application of the OLS to (1.7 .4') using the same data yields

(TC)

log - ' Q,

= -3.5 -

( 1.8)

+

0.28 log(Q;) (0.017)

0.44 log(pil) - 0.22 log(p; 2 ) (0.29) (0.34)

+

0.43 log(p; 3 ) (0.10)

R 2 = 0.695, mean of dep. var. = -4.83, SER = 0.392, SSR = 21.552, n = 145.

(1.7.9)

As you no doubt have anticipated, the output coefficient is now -0.28 ( = 0.72- I) with the standard errors and the other coefficient estimates unchanged. The R 2 changes only because the dependent variable is different. It is nonsense to say that the higher R 2 makes (1.7.4) preferable to (1.7.4'), because the two equations represent the same model. The point is: when comparing equations on the basis of the fit, the equations must share the same dependent variable. Testing Constant Returns to Scale

As an application of the !-test, consider testing whether returns to scale are constant

(r = !). We take the maintained hypothesis to be the restricted model (1.7.6). Because fJ2 (the log output coefficient) equals I if and only if r = l, the null hypothesis is that H 0 : {J 2 = I. The 1-test of constant returns to scale proceeds as follows. Step 1: Calculate the 1-ratio for the hypothesis. From the estimation of the restricted model, we have b2 = 0.72 with a standard error of 0.017, so

1-ratio =

0.72- I = -16. 0.017

68

Chapter 1

Because the maintained hypothesis here is the restricted model (1.7.6), K (the number of coefficients) = 4. Step 2: Look for the critical value in the t (141) distribution. If the size of the test is 5 percent, the critical value is 1.98. Step 3: Since the absolute value of the t-ratio is far greater than the critical value, we reject the hypothesis of constant returns to scale. Importance of Plotting Residuals The regression has a problem that cannot be seen from the estimated coefficients and their standard errors. Figure 1.7 plots the residuals against log(Q,). Notice two things from the plot. First. as output increases, the residuals first tend to be positive, then negative, and again positive. This strongly suggests that the degree of returns to scale (r) is not constant as assumed in the log-linear specification. Second, the

residuals are more widely scattered for lower outputs, which is a sign of a failure of the homoskedasticity assumption that the error variance does not depend on the regressors. To deal with these problems, Nerlove divided the sample of 145 firms into five groups of 29, ordered by output, and estimated the model (1.7.6) separately for each group. This amounts to allowing all the coefficients (including fh = 1/r) and the error variance to differ across the five groups differing in size. Nerlove finds that returns to scale diminish steadily, from a high of well over 2 to a low of slightly below I, over the output range of the data. In the empirical exercise of this chapter, the reader is asked to replicate this finding and do some further analysis using dummy variables and the weighted least squares. Subsequent Developments One strand of the subsequent literature is concerned about generalizing the CobbDouglas technology while maintaining the assumption of cost minimization. An obvious alternative to Cobb-Douglas is the Constant Elasticity of Substitution (CES) production function, but it has two problems. First, the cost function implied by the CES production function is highly nonlinear (which, though, could be overcome by the use of nonlinear least squares to be covered in Chapter 7).

Second, tl1e CES technology implies a constant degree of returns to scale. One of Nerlove's main findings is that the degree varies with output. Christensen and Greene (1976) are probably the first to estimate the technology parameters allowing for variable degrees of returns to scale. Using the translog cost function introduced hy Christensen, Jorgenson, and Lau (1973), they find that the significant scale economies evident in the 1955 data were mostly exhausted by 1970, with most firms operating at much higher output levels where the AC curve is essentially flat. Their work will he examined in detail in Chapter 4.

69

Finite-Sample Properties of OLS

2

•

1.5

• •

1

•

• •

0.5

"' ·u; ::>

"0

"

•

0

•

•

••

·~

-1

•• •

·~

,-~'!.

•

~

-0.5

..... . .··· :. .. . . .., .... • .. • •• • • ••.,..... . . . .-.... .... . • • • ••• • •

• •

• • • • •

I

• •

•

••

•

•

•

•

-1.5 0

2

4

6

B

10

log output

Figure 1. 7: Plot of Residuals against Log Oulput

Another issue is whether regulated firms minimize costs. The influential paper by Averch and Johnson (1962) argues that the practice by regulators to guarantee utilities a "fair rate of return"' on their capital stock distorts the choice of input levels. Since the fair rate of return is usually higher than the interest rate, utilities have an incentive to overinvest. That is, they minimize costs, but the relevant rate of return in the definition of the user cost of capital is the fair rate of return. Consequently, unless the fair rate of return is used in the calculation of p; 2 , the true technology parameters cannot be estimated from the cost function. The fair-rateof-retunl regulation creates another econometric problem: to guarantee utilities a fair rate of return, the price of electricity must be kept relatively high in markets served by high-cost utilities. Thus output will be endogenous. A more recent issue is whether the regulator has enough information to bring about cost minimization. If the utility has more infonnation about costs, it has an incentive to misreport to the regulator the true value of the efficiency parameter. Schemes to be adopted by the regulator to take into account this incentive problem may not lead to cost minimization. Wolak's (1994) empirical results for California's water utility industry indicate that the observed level of costs and output is better modeled as the outcome of a regulator-utility interaction under asymmetric inforn1ation. Wolak resolves the prohlem of the endogeneity of output hy estimating the demand function along with the cost function. Doing so, however, requires an estimation technique more sophisticated than the OLS.

70

Chapter 1

QUESTIONS FOR REVIEW

1. (Review of duality theory) Consult your favorite microeconomic textbook to remember how to derive the Cobb-Douglas cost function from the CobbDouglas production function.

2. (Change of units) In Nerlove's data, output is measured in kilowatt hours. If output were measured in megawatt hours, how would the estimated restricted regression change? 3. (Recovering technology parameters from regression coefficients) Show that the technology parameters (/L, a 1, a 2, a 3) can be determined uniquely from the first four equations in (1.7.5) and the definition r _ a 1 + a 2 + a 3. (Do not use the fifth equation fJs = a 3 jr.) 4. (Recovering left-out coefficients from restricted OLS) Calculate the restricted OLS estimate of fJs from (1.7.8). How do you calculate the standard error of bs from the printout of the restricted OLS? Hint: Write b 5 = a + c'b for suitably chosen a and c where b here is (b 1 , ... , b4 )'. So Var(b 5 I X) = c' Var(b I X) c. The printout from the restricted OLS should include Var(b I X).

-

5. If you take p; 2 instead of p; 3 and subtract log(p; 2 ) from both sides of (1.7.4), how does the restricted regression look? Without actually estimating it on Nerlove's data, can you tell from the estimated restricted regression in the text what the restricted OLS estimate of ({3 1 , .•• , fJs) will be? Their standard errors? The SSR? What about the R 2 ? 6. Why is the R 2 of0.926 from the unrestricted model (1.7.7) lower than the R 2 of 0.932 from the restricted model (1.7.8)? 7.

A more realistic assumption about the rental price of capital may he that there

is an economy-wide capital market so Piz is the same across firms. In this case,

(a) Can we estimate the technology parameters? Hint: The answer is yes, but why? When p; 2 is constant, (1.7.4) will have the perfect multicollinearity problem. But recall that ({3 1 , •.. , {3 5 ) are not free parameters. (b) Can we test homogeneity of the cost function in factor prices? 8. Taking logs of both sides of the production function (1.7.1), one can derive the log-linear relationship: log(Q;) = ao

+ a1

log(xil)

+ az

log(x;z) +a, log(x;,)

+ <;,

71

Finite-Sample Properties of OLS

where£; here is defined as log(A;) -E[log(A; )] and ao = E[log(A;)]. Suppose, in addition to total costs, output, and factor prices, we had data on factor inputs. Can we estimate a's by applying OLS to this log-linear relationship'? Why or why not? Hint: Do input levels depend on £;? Suggest a different way to estimate a's. Hint: Look at input shares.

PROBLEM SET FOR CHAPTER 1

ANALYTICAL EXERCISES

1. (Proof that b minimizes SSR) Let b be the OLS estimator of {3. Prove that, for any hypothetical estimate, {3, of {3,

(y- x7iJ'Cy- x/i) >

(y- Xb)'(y- Xb).

In your proof, use the add-and-subtract strategy: take y- X{3, add Xb to it and then subtract the same limn it. It produces the decomposition of y- X/3: y- X{3 = (y- Xb)

+ (Xb- X{3).

Hint: (y- X{3)'(y- X{3) = [(y- Xb) + X(b- {3)]'[(y- Xb) Using the normal equations, show that this equals (y- Xb)'(y- Xb)

+ X(b- {3)].

+ (b -Pl'X'X(b- {3).

2. (The annihilator associated with the vector of ones) Let 1 he then-dimensional column vector of ones, and let M 1 _ 1, -1(1'1)- 11'. That is, M 1 is the annihilator associated with 1. Prove the following: (a) M 1 is symmetric and idempotent. (b) M 11 = 0.

(c) M 1y = y-

y · 1 where I

"

.Y=-2:>· n i=l

M 1y is the vector of deviations from the mean. (d) M 1X =X- 1x' where x = X'1/n. The k-th element of the K x I vector

-·

X IS

'""

fi

Li=l Xik·

72

Chapter 1

3. (Deviation-from-the-mean regression) Consider a regression model with a con stant. Let X he partitioned as x

(nxK)

= [ 1 : nxi

x2

nx(K-1)

J

so the first regressor is a constant. Partition {3 and b accordingly:

fh ]

{3 =

+-- scalar +-- (K- I) x

[ f3z

!"

b=

[h ]

h2 ·

Also let x2 - MIX2 andy- MIY· They are the deviations from the mean for the nonconstant regressors and the dependent variable. Prove the following: (a) The K normal equations are

where ii:2 =

X~ljn,

XSy - n · b1 · ii:2 - X~Xzbz = (b) b2 =

(X;Xz)- 1X;y.

0

((K -l)x I)

Hint: Substitute the first normal equation into the other

K - I equations to eliminate b 1 and solve for b2 . This is a generalization of the result you proved in Review Question 3 in Section 1.2.

4. (Partitioned regression, generalization of Exercise 3) Let X be partitioned as x (nxK)

= [ x1 (nxK 1)

: x2 ]. (nxKz)

Partition {3 accordingly:

Thus, the regression can be written as

Let P 1 = X 1CX;X 1)- 1X'p M 1 =I- P 1 , X2 = M 1X2 andy= M 1y. Thus, y is tl1e residual vector from the regression of yon X~o and the k-th column of X2 is the residual vector from the regression of the corresponding k-th column of

73

Finite-Sample Properties of OLS

X2 on X 1• Prove the following: (a) The normal equations are

+ X'1X2b2 = x;x,b, + x;x2b2 = X'1X 1b,

X'1y, x;y.

(b) b2 = (X;X2)-'X;y. That is, b2 can be obtained by regressing the residuals yon the matrix of residuals X2. Hint: Derive XJ{3 1 = -l' 1X2fl 2 + l' 1y from (*). Substitute this into (**) to obtain x;M,X 2fl 2 = x;M,y. Then use the fact that M 1 is symmetric and idempotent. Or, if you wish, you can apply the brute force of the partitioned inverse formula (A.1 0) of Appendix A to the coefficient matrix

Show that the second diagonal block of (X'X)- 1 is (X;X 2 )- 1 . (c) The residuals from the regression of y on X2 numerically equals e, the residuals from the regression of yon X (- (X, : X2)). Hint: If cis the residual from the regression of yon X,

Premultiplying both sides by M 1 and using M 1X 1 = 0, we obtain

Show that M 1c = c and observe that b 2 equals the OLS coefficient estimate in the regression of y on X 2. (d) b 2 = cx;X 2)-'X;y. Note the difference from (b). Here, the vector of dependent variable is y, not y. Are the residuals from the regression of y on X2 numerically the same as e? [Answer: No.] Is the SSR from the regression of yon X2 the same as the SSR from the regression of y on X2? [Answer: No.] The results in (b)-( d) are known as the Frisch- Waugh Theorem.

-

(e) Show: -'X(X'MX)-'X'y_,_Y - C' C = Y 2 2 I 2 2Y·

74

Chapter 1

Hint: Apply the general decomposition formula (1.2.15) to the regression in (c) to derive _,_Y = I'X-'XI Y >2 2 2 >2

+ e e. I

Then use (b). (f) Consider the following four regressions:

(1) regress yon X 1 • (2) regress y on

Xz.

(3) regress y on X 1 and X2 • (4) regress yon X2 . Let SSRj be the sum of squared residuals from regression j. Show: (i) SSR 1 = y'j. Hint: y is constructed so that no explanatory power.

x;y =

0, so X 1 should have

(ii) SSR 2 = e'e. Hint: Use (c). (iii) SSR 3 = e'e. Hint: Apply the Frisch-Waugh Theorem on regression (3).

M,y = y. (iv) Verify by numerical example that SSR4 is not necessarily equal to c'c. 5. (Restricted regression and F) In the restricted least squares, the sum of squared residuals is minimized subject to the constraint implied by the null hypothesis Rfl = r. Form the Lagrangian as

1 £ = 2_CY- XtJ)'(y- XtJ)

+ A (RtJr), 1

where A here is the #r-dimcnsional vector or Lagrange multipliers (recall: R is #r x K, {3 is K x I, and r is #r x I). Let tJ be the restricted least squares estimator of tJ. It is the solution to the constrained minimization problem.

-

(a) Let b be the unrestricted OLS estimator. Show:

(i =

b- (X'X)- 1R'[R(X'X)- 1R']- 1 (Rb- r),

A= [R(X'X)- 1R'r'CRb- r). Hint: The first-order conditions are X'y- (X'X)(i = R'A or X'(yR'A. Combine this with the constraint R(i = r to solve for A and (i.

X(i)

=

75

Finite-Sample Properties of OLS

(b) Let ii - y -

X/i, the residuals from the restricted regression. Show:

SSRR - SSRu

= (b =

/i)'(X'X) (b - {i)

(Rb- r)'[R(X'X)- 1R'r 1 (Rb- r)

= ).'R(X'X)- 1R'). = e''I''e,

where Pis the projection matrix. Hint: For the first equality, use the addand-subtract strategy:

SSRR = (y - X/i)' (y- X{i) = [(y- Xb)

+ X(b -/i)J'[(y- Xb) + X(b -/i)].

Use the normal equations X'(y - Xb) = 0. For the second and third equalities, use (a). To prove the fourth equality, the easiest way is to use the first-order condition mentioned in (a) that R''- = X'ii. (c) Verify that you have proved in (b) that (1.4.9) = (1.4.11). 6.

(Proof of the decomposition (1.2.17)) Take the unrestricted model to be a regression where one of the regressors is a constant, and the restricted model to be a regression where the only regressor is a constant.

(a) Show that (b) in the previous exercise is the decomposition (1.2.17) for this case. Hint: What is /i for this case? Show that SSRR = L;(Y; - y) 2 and (b -/i)'(X'X) (b - Pl = Li CY - .W (b) (R 2 as an F-ratio) For a regression where one of the regressors is a constant, prove that

7. (Hausman principle in finite samples) For the generalized regression model, prove the following. Here, it is understood that the expectations, variances, and

covariances are all conditional on X. (a) Cov(/ioLs, b -PoLs) = 0. Hint: Recall that, for any two random vectors x andy, Cov(x, y)- E[(x- E(xl)(y- E(y))'J.

76

Chapter 1

So Cov(Ax, By) =A Cov(x, y)R'. Also, since {3 is nonrandom, _....._

_.....

_....._

----

Cov(f3oLS, b - f3or.sl = Cov(f3 0 r.s - {3, b - f3or.sl. (b) Let {3 be any unbiased estimator and define q _ {3 - Por.s· Assume {3 is such that Vq

= Var(q)

is nonsingular. Prove: Cov(Por.s• q) = 0. (If we

P

set = b, we are back to (a).) Hint: Define: Show: Var({i) = VarCPor.sl

/3- Por.s + Hq for some H.

+ CH' + HC' + HVqH',

where C - Cov(Por.s• q). Show that, if C oft 0 then Var(/3) can be made smaller than Var(Por.sl by setting H = -CV;j 1 Argue that this is in contradiction to Proposition 1.7(c). (c) (Optional, only for those who are proficient in linear algebra) Prove: if the

K columns of X are characteristic vectors of V, then b = Por.s• where Vis then x n variance-covariance matrix of the n-dimcnsional error vector e. (So not all unbiased estimators satisfy the requirement in (b) that Var({3 f3or.sl be nonsingular.) Hint: For any 11 x 11 symmetric matrix V, there exists ann x 11 matrix H such that H'H = 1, (soH is an orthogonal matrix) and H'VH = A, where A is a diagonal matrix with the characteristic roots (which are real since V is symmetric) of V in the diagonal. The columns of H are called the characteristic vectors of V. Show that

-

H- 1 = H', H'V- 1H =A -I, H'V- 1 =A - 1H'. Without loss of generality, X can be taken to be the first K columns of H. So X= HF, where

F _[IK] 0 .

(nxK)-

EMPIRICAL EXERCISES

Read Marc Nerlove, "Returns to Scale in Electricity Supply" (except paragraphs of equations (6)-(9), the part of section 2 from p. 184 on, and Appendix A and

Finite-Sample Properties of OLS

77

C) before doing this exercise. For 145 electric utility companies in 1955, the file

NERLOVE.ASC has data on the following: Column 1: Column 2: Column 3: Column 4: Column 5:

total costs (call it TC) in millions of dollars output (Q) in billions of kilowatt hours price of labor (PL) price of fuels (PFJ price of capital (PK).

They are from the data appendix of his article. There are 145 observations, and the observations are ordered in size, observation 1 being the smallest company and observation 145 the largest. Using the data transfonnation facilities of your computer software, generate for each of the 145 firms the variables required for

estimation. To estimate (1.7.4), for example, you need to generate log(TC), a constant, log(Q), log(PL), log(PKJ, and log(PF), for each of the 145 finllS. (a) (Data question) Does Nerlove's construction of the price of capital conform to the definition of the user cost of capital? Hint: Read Nerlove's Appendix 8.4. (b) Estimate the unrestricted model (1.7.4) by OLS. Can you replicate the estimates in the text?

(c) (Restricted least squares) Estimate the restricted model (1.7.6) by OLS. To do this, you need to generate a new set of variables for each of the 145 finns. For example, the dependent variable is log(TC/PFJ, not log(TC). Can you replicate the estimates in the text? Can you replicate Nerlove's results? Nerlove's estimate of {3 2 , for example, is 0.721 with a standard error of 0.0174 (the standard error in his paper is 0.175, but it is probably a typographical error). Where in Nerlove's paper can you find this estimate? What about the other coeffi-

cients? (Warning: You will not be able to replicate Nerlove's results precisely. One reason is that he used common rather than natural logarithms; however,

this should affect only the estimated intercept term. The other reason: the data set used for his results is a corrected version of the data set published with his article.) As mentioned in the text, the plot of residuals suggests a nonlinear rela-

tionship between log(TC) and log(Q). Nerlove hypothesized that estimated returns to scale varied with the level of output. Following Ncrlovc, divide the sample of 145 firms into five subsamples or groups, each having 29 firms. (Recall that since the data arc ordered by level of output, the first 29 observations will have the smallest output levels, whereas the last 29 observations will have the largest output levels.) Consider the following three generalizations of the model (1.7.6):

78

Chapter 1

Modell: Both the coefficients (fJ's) and the error variance in (1,7,6) differ across groups, Model 2: The coefficients are different, but the error variance is the same across groups, Model 3: While each group has common coefficients for {J3 and {J4 (price elasticities) and common error variance, it has a different intercept term and a dillerent {J2 , Model 3 is what Nerlove called the hypothesis of neutral variations in returns to scale. For Model ], the coefficients and error variances specific to groups can he esti-

mated from

yUl = xUlf3Ul

+ eUl

(j = !, , , , , 5),

where yUl (29 x I) is the vector of the values of the dependent variable for group j, xUJ (29 x 4) is the matrix of the values of the four regressors for group j, 13
(d) Estimate Model I by OLS, How well can you replicate Ncrlovc's reported results? On the basis of your estimates of

fh, compute the point estimates

of returns to scale in each of the five groups, What is the general pattern of estimated scale economies as the level of output increases? What is the general pattern of the estimated error variance as output increases? Model 2 assumes for Model I that af = a2 for all j, This equivariance restriction can be incorporated by stacking vectors and matrices as follows:

y=X{3+e, where e (145xl)

ln particular, X is now a block-diagonal matrix. The equivariance restriction can be expressed as E(ee' I X) = u 2 114s, There are now 20 variables derived from

the original four regressors, The 145 dimensional vector corresponding to the second variable, for example, has log(Q 1) , , , , , log(Q 29 ) as the first 29 elements and

79

Finite-Sample Properties of OLS

zeros elsewhere. The vector corresponding to the 6th variable, which represents log output for the second group of !inns, has log(Q 30 ), ... , log(Q 58 ) for the 30th through 58th elements and zeros elsewhere, and so on. The stacking operation needed to fom1 they and X in(*) can be done easily if your computer software is matrix-hased. Otherwise, you trick your software into accomplishing the same thing by the usc of dummy variables. Define the j-th dummy variable as .. _ {I D1 , 0

iffim1 i belongs to thcj-thgroup,

(i =I, ... , 145).

otherwise,

Then the second regressor is Dli · log(Q;). The 6th variable is 0 so forth.

2;

-log(Q;), and

(e) Estimate Model 2 by OLS. Verify that the OLS coefficient estimates here arc the same as those in (d). Also verify that 5

'L,SSRj =SSR, j=l

where SSRj is the SSR from the j-th group in your estimation of Model I in (d) and SSR is the SSR from Model 2. This agreement is not by accident, i.e., not specific to the present data set. Prove that this agreement for the coefficients and the SSR holds in general, temporarily assuming just two groups without loss of generality. Hint: First show that the coefficient estimate is the same between Model 1 and Model 2. Use formulas (A.4), (A.5), and (A.9) of Appendix A. (f) (Chow test) Model 2 is more general than Model (1.7.6) because the coefficients can differ across groups. Test the null hypothesis that the coefficients are the same across groups. How many equations (restrictions) are in the null hypothesis? This test is sometimes called the Chow test for structural change. Calculate the p-value of the F -ratio. Hint: This is a linear hypothesis about the coefficients of Model 2. So take Model 2 to be the maintained hypothesis and (1.7.6) to be the restricted model. Use the formula (1.4.11) for the F -ratio. Gauss Tip: If xis the F-ratio, the Gauss command cdffc (x, dfl, df2)

gives the area to the right of F for the F distribution with dfl and df2 degrees of freedom.

80

Chapter 1

TSP Tip: The TSP command to do the same is cdf If, df l~df1, df 2~ df2)

x. An output ofTSP's OLS command, OLSQ, is @SSR, which is the

SSR for the regression. RATS Tip: The RATS command is cdf ftest x dfl df2. An output

of RATS's OLS command, LINREG, is %RSS, which is the SSR for the regression.

The restriction in Model 3 that the price elasticities are the same across firm groups can be imposed on Model2 by applying the dummy variable transfonnation only to the constant and log output. Thus, there are 12 (= 2 x 5 + 2) variables in X. Now X looks like

X= 0

0

0

0

log(PLz9/Pf"z9)

0

0

l

log(Qm)

log(PLmfPFm)

log(PKm/PFm)

0

0

l

log(Q 145 )

log(PL14s/PF145)

log(PK145jPF145)

(**)

(g) Estimate Model 3. The model is a special case of Model 2, with the hypothesis that the two price elasticities are the same across the five groups. Test the

hypothesis at a significance level of 5 percent, assuming normality. (Note: Ncrlovc's F-ratio on p. 183 is wrong.) As has become clear from the plot of residuals in Figure 1.7, the conditional second moment E(ef I X) is likely to depend on log output, which is a violation of the conditional homoskedasticity assumption. This time we do not attempt to

test conditional homoskedasticity, because to do so requires large sample theory and is postponed until the next chapter. lnstcad, we pretend to know the form of the function linking the conditional second moment to log output. The function, specified below, implies that the conditional second moment varies continuously with output, contrary to the three models we have considered above. Also contrary

to those models, we assume that the degree of returns to scale varies continuously

81

Finite-Sample Properties of OLS

with output by inclt1ding the square of log output. 25 Model 4 is

Model4:

(TC·)

log - ' Pi3

=

flt

+ fJ2

log(Q;)

+ {33 [log(Q;)f

(Pd) + fls (P;2) + e;

+ f34 log -

log -

Pi3

E(ef I X)= a 2

· (

0.0565 +

2

Pi3

77

·~; )

(i =I, 2, ... , 145)

for some unknown a 2 . (h) Estimate Model4 hy weighted least squares on the whole sample of 145 firms. (Be careful about the treatment of the intercept; in the equation after weighting, none of the regressors is a constant.) Plot the residuals. Is there still evidence for conditional homoskedasticity or further nonlinearities? MONTE CARLO EXERCISES

Monte Carlo analysis simulates a large numher of samples from the model to study the finite-sample distribution of estimators. In this exercise, we use the technique to confinn the two finite-sample results of the text: the unbiascdncss of the OLS coefficient estimator and the distrihution of the t-ratio. The model is the following simple regression model satisfying Assumptions 1.1-1.5 with n = 32. The regression equation is

Y; = flt + fJzX; + e;

(i =I, 2, ... , n)

or y = 1 · flt + x · fJ2 + e = X{J + e, where X= (1: x) and fJ = ({J 1, {3 2)'. The model parameters are ({J 1, {3 2, a 2). As mentioned in the text, a model is a set of joint distributions of (y, X). We pick a particular joint distribution by specifying the regression model as follows. Set flt = I, fJ2 = 0.5, and a 2 = I. The distribution of x = (x 1, x2, ... , x,)' is specified by the following AR(l) process: X;

=c+¢x;_ 1 +17;

(i = 1,2, ... ,n),

25 We have derived the log-linear cost function from the Cobb-Douglas production function. Does there exist a production function from which this generalized cost function with a quadratic term in log output can be derived? This is a question of the ''integrability" of cost functions and is discussed in detail in Christensen et al. ( 1973).

82

Chapter 1

where {>Jd is i.i.d. N(O. 1) and ~N

X 0

1 ) c ( 1- ¢" 1-¢2 . c

=

2. ¢ = 0.6.

This fixes the joint distribution of (y. X). From this distribution, a large number of samples will be drawn. 1n programming the simulation, the following expression for x will be useful. Solve the first-order difference equation(**) to obtain X;

=¢'xo+(1+¢+¢ 2 +···+¢'- 1)c + (~; + ¢>Ji-l + ¢ 2>Ji-2 + ... + ¢'- 11JJ),

or, in matrix notation, x=r·xo+d+A

(nxl)

where d = (d1 , d2 ,

•.. ,

(nxl)

(nxl)

q,

(nxn)(nxl)

d,)' and

2 dt = c, d2 = (1 +¢)c, ... , d; = (1 + ¢ + ¢ + · · · + ¢'-')c, ... ,

¢

0 1

....... 0

0 0

¢2

¢

1

0

1

¢ ¢2 , A=

r= ¢"

cpn-1 ¢n-2

1)J , q=

~2

1]n

¢

1

Gauss Tip: To form the rmatrix, use seqm. To form the A matrix, 11se toepli tz and lowmat.

(a) Rnn two Monte Carlo simulations. The first simulation calculates E(b I x) and the distribution of the !-ratio as a distribution conditional on x. A computer program for the first simulation should consist of the following steps. (1) (Generate x just once) Using the random number generator, draw a vector q of n i.i.d. random variables from N(O, 1) and x0 from N(c/(1 - ¢), 1/ (1- ¢ 2)), and calculate x by(***). (Calculation of x can also be accomplished recursively by(**) with a do loop, but vector operations such as (***)consume less CPU time than do loops. This becomes a consideration

in the second simulation, where x has to be generated in each replication.)

83

Finite-Sample Properties of OLS

(2) Set a counter to zero. The counter will record the incidence that It I > t 0 .025 (n- 2). Also, set a two-dimensional vector at zero; this vector will he used for calculating the mean of the OLS estimator b of ({3 1 , {3 2 )'. (3) Start a do loop of a large number of replications (I million, say). In each replication, do the following. (i) (Generate y) Draw an n dimensional vector e of n i.i.d. random variables from N(O, 1), and calculate y = (y 1 , ••. , Ynl' by(*). This y is paired with the same x from step (I) to form a sample (y, x). (ii) From the sample, calculate the OLS estimator b and the t-value for Ho: f3z = 0.5. (iii) Increase the counter by one if It I >

t 0 .025 (n -

2). Also, add b to the

two-dimensional vector.

(4) After the do loop, divide the counter by the number of replications to calculate the frequency of rejecting the null. Also, divide the two-dimensional vector that has accumulated b hy the numher of replications. It should equal E(b I x) if the number of replications is infinite. Note that in this first simulation, xis fixed throughout the do loop for y. The second simulation calculates the unconditional distribution of the t-ratio. It should consist of the following steps. (1) Set the counter to zero. (2) Start a do loop of a large number of replications. In each replication, do the following. (i) (Generate x) Draw a vector q of n i.i.d. random variables from N (0, I) and x 0 from N(c/(1- ¢), 1/(1- ¢ 2 )), and calculate x by(***). (ii) (Generate y) Draw a vector e of n i.i.d. random variables from N (0, 1), and calculate y = (YI, ... , Yn)' by ( *). (iii) From a sample (y, x) thus generated, calculate the t-valuc for Ho: f3 = 0.5 from the sample (y, x). (iv) Increase the counter by one if It I > to.ozs (n - 2). (3) After the do loop, divide the counter by the number of replications. For the two simulations, verify that, for a sufficiently large number of replications,

84

Chapter 1

1. the mean of b from the first simulation is arbitrarily close to the true value

(1. 0.5);

2. the frequency of rejecting the tn1e hypothesis Ho (the type I error) is arbitrarily close to 5 percent in either simulation.

(b) In those two simulations, is the (noneonstant) regressor strictly exogenous? Is the error conditionally homoskedastic?

ANSWERS TO SELECTED QUESTIONS _ _ _ _ _ _ _ _ __

ANALYTICAL EXERCISES

1.

(y -

x{i)' (y- x{i)

= [(y- Xb)

+ X(b- p)]'[(y- Xb) + X(b- Pll

(by the add-and-subtract strategy) = [(y- Xb)' + (b- Pl'X'][(y- Xb)

+ X(b- Pll

+ (b- Pl'X'(y- Xb) + (y- Xb)'X(b- Pl + (b- Pl'X'X(b- Pl (y- Xb)'(y- Xb) + 2(b- Pl'X'(y- Xb) + (b- Pl'X'X(b- .Bl

= (y- Xb)'(y- Xb) =

(since (b- Pl'X'(y- Xb) = (y- Xb)'X(b- Pll = (y- Xb)'(y- Xb)

+ (b- P)'X'X(b- Pl

(since X'(y- Xb) = 0 hy the normal equations) > (y- Xb)'(y- Xb) n

(since (b- Pl'X'X(b- Pl = z'z =

L zf > 0 where z- X(b- ,8)). i=l

7a. PoLS - {3 = Ae where A = cx•v- 1x)- 1X'v-• and b -PoLS = Be where B _ (X'X)- 1X'- (X'V- 1X)- 1X'V- 1 So Cov(PoLs- {3, b -PoLs) = Cov(Ae, Be) =A Var(e)B' = rr 2AVB'. It is straightforward to show that AVB' = 0.

85

Finite-Sample Properties of OLS

7b. For the choice of H indicated in the hint, Var({3) - Var(f3m.sl =

If C z,

-cv; I C'.

oft 0, then there exists a nonzero vector z such that C'z = v oft 0. For such

z'[Var(B)- VarCPaLs)]z = -v'V; 1v < 0

(since Vq is positive definite),

-

which is a contradiction because f3oLs is efficient. EMPIRICAL EXERCISES

(a) Ncrlovc's description in Appendix B.4 leads one to believe that he did not include the depreciation rate 8 in his constmction of the price of capital. (b) Your estimates should agree with (1.7.7). (c) Our estimates differ from Nerlove's slightly. This would happen even if the data used by Nerlove were the same as those provided to you, because computers in his age were much less precise and had more frequent rounding errors.

(d) How well can you replicate Nerlove's reported results? Fairly well. The point estimates of returns to scale in each of the five subsamplcs are 2.5, 1.5, 1.1, 1.1, and .96. As the level of output increases, the returns to scale decline. (e) Model 2 can be written as y = X{3 (setting j = 2),

+ e, where y, X, and e arc as in(*).

which means (X'X)- 1 = [(

X(I)'X(l))-1

0

And

, Xy

=

[X(IYy(I)] XC2YyC2) ·

So

86

Chapter 1

Therefore,

Thus, the OLS estimate of fhc coefficient vector for Model 2 is the same as fhat for Model I. Since fhe estimate of fhe coefficient vector is fhe same, fhe sum of squared residuals, too, is the same. (f) The number of restrictions is 16. K = #coefficients in Model 2 = 20. So fhe

two degrees of Jfeedom should be (16, 125). SSRu = 12.262 and SSRR = 21.640. F -ratio = 5.97 with a p-value of 0.0000. So this can be rejected at any reasonable significance level. (g) SSRu = 12.262 and SSRR = 12.577. So F = .40 with 8 and 125 degrees of freedom Its p-value is 0.92. So fhe restrictions can be accepted at any reasonable significance level. Nerlove's F-ratio (seep. 183, 8fh line from bottom) is 1.576. (h) The plot still shows that the conditional second moment is somewhat larger for smaller firms, but now there is no evidence for possible nonlinearities.

References Amcmiya, T., 1985, Advanced Econometrics, Cambridge: Harvard University Press. Avcrch, H., and L. Johnson, 1962, "Behavior of the Firm under Regulatory Constraint," American Economic Review, 52, 1052-1069. Christensen, L., and W. Greene, 1976, "Economics of Scale in US Electric Power Generation," Journal of Political Economy, 84, 655-676. Christensen, L., D. Jorgenson, and L. Lau, 1973, "Transcendental Logarithmic Production Frontiers," Review of Economics and Statistics, 55, 28-45. Davidson, R .. and J. MacKinnon, 1993, Estimatio11 and Inference in Econometrics, Oxford: Oxford University Press. DeLong, B., and L. Summers, 1991, "Equipment [nvestment and Growth," Quarterly Journal of Economics, 99, 28-45. Engle, R., D. Hendry, and J.-F. Richards, 1983, "Exogeneity," Econometrica, 51,277-304. Federal Power Commission, 1956, Statistics of Electric Utilities in the United States, 1955, Class A and B Privately Owned Companies, Washington, D.C. Jorgenson, D., 1963, "Capital Theory and Investment Behavior," American Economic Review, 53, 247-259. Koopmans, T., and W. Hood, 1953, "The Estimation of Simultaneous Linear Economic Relationships," in W. Hood, and T. Koopmans (eds.), Studies in Ec01wmetric Method, New Haven: Yale University Press.

Finite-Sample Properties of OLS

87

Krasker, W., E. Kuh, and R. Welsch, 1983, "Estimation for Dirty Data and Flawed Models," Chapter 11 in Z. Griliches, and M. lntriligator (eds.), Handbook of Econometrics, Volume 1, Amsterdam: North-Holland. Nerlove, M., 1963, "Returns to Scale in Electricity Supply," in C. Christ (ed.), Measurement in Economics: Studies in i'l1athematical Economics and Econometrics in Memory of Yehuda Grunfeld, Stanford: Stanford University Press. Rao, C. R., 1973, Linear Satistical Inference and Its Applications (2d ed.), New York: Wiley. Scheffe, H., 1959, The Analysis of Variance, New York: Wiley. Wolak, F., 1994, "An Econometric Analysis of the Asymmetric Information, Regulator-Utility Interaction," Annales D'Economie et de Statistique, 34, 13--69.

CHAPTER 2

Large-Sample Theory

ABSTRACT

In the previous chapter, we derived the exact- or finite-sample distribution of the OLS estimator and its associated test statistics. However, not very often in economics are the assumptions of the exact distribution satisfied. The finite-sample theory breaks down if one of the folJowing three assumptions is violated: (1) the exogeneity of

regressors, (2) the normality of the error term, and (3) the linearity of the regression equation. This chapter develops an alternative approach, retaining only the third

assumption. The approach, called asymptotic or large-sample theory, derives an approximation to the distribution of the estimator and its associated statistics assuming that the sample size is sufficiently large. Rather than making assumptions on the sample of a given size, large-sample theory makes assumptions on the stochastic process that generates the sample. The first two sections of this chapter provide the necessary language for describing stochastic processes. The concepts introduced in this chapter are essential for rational expectations econometrics, as illustrated in Fama's classic paper on the Fisher Hypothesis that the real interest rate is constant. The hypothesis has a very strong policy implication: monetary and fiscal policies cannot influence aggregate demand through the real interest rate. Very surprisingly, one cannot reject the hypothesis for the United States

(at least if the sample period ends in the early 1970s).

2.1

Review of Limit Theorems for Sequences of Random Variables

The material of this section concerns the limiting behavior of a sequence of random variables, (z 1, z2 , ... ). Since the material may already be familiar to you, we present it rather formally, in a series of definitions and theorems. An authoritative source is Rao (1973, Chapter 2c) which gives proofs of all the theorems included in this section. In this section and the rest of this book, a sequence (z 1 , z 2 , ... ) will be denoted by {z,}.

89

Large-Sample Theory

Various Modes ol Convergence

Convergence in Probability

A sequence of random scalars {z.) converges in probability to a constant (nonrandom) a if, for any e > 0,

lim Prob(lz.- al >e)= 0.

(2.1.1)

n~oo

The constant a is called the probabili[y limit of Zn and is written as "plimn~oo Zn = a" or ••zn -)>P a". Evidently, "z,

-)>

p

a"

is the Sa.IIle as

"zn - a

-)>

p

0."

This definition of convergence in prohahility is extended to a sequence of random vectors or random matrices (by viewing a matrix as a vector whose elements have been rearranged) by requiring element-hy-clcmcnt convergence in probability. That is, a sequence of K -dimensional random vectors {z.) converges in probability to a K -dimensional vector of constants ct if, for any e > 0, lim Prob(IZnk- akl >e)= 0 for all k (=I. 2, ... , K),

(2.1.2)

n~oo

where Znk is the k-th clement of z. and CLk the k-th element of ct. Almost Sure Convergence

A sequence of random scalars {z.) converges almosl surely to a constant a if

Prob( lim Zn = a) = I.

(2.1.3)

·~00

We write this as "z, -)oa.s. a.'' The extension to random vectors is analogous to that for convergence in probability. As will be mentioned below, this concept of convergence is stronger than convergence in prohability; that is, if a sequence

converges almost surely, then it converges in probability. The concept involved in (2.1.3) is harder to grasp because the probability is about an event concerning an infinite sequence (z., z2 , ••• ). For our purposes, however, all that matters is that almost sure convergence is stronger than convergence in probability. If we can show that a sequence converges almost surely, that is one way ID prove the sequence converges in probability.

90

Chapter 2

Convergence in Mean Square A sequence of random scalars {Zn I converges in mean square (or in quadratic mean) to a (written as "zn -->m.•. a") if lim E[(Zn - a) 2] = 0.

(2.1.4)

n~oo

The extension to random vectors is analogous to that for convergence in probability: Zn -->m.,. a if each element of z. converges in mean square to the corresponding component of a:. Convergence to a Random Variable In these definitions of convergence, the limit is a constant (i.e., a real number). The limit can be a random variable. We say that a sequence of K -dimensional random variables {z.l converges to a K-dimensional random variable z and write z. -->p z if {z. - zl converges to 0: "zn

~ p

z"

is the same as

"zn -

Z ~ p

0."

(2.1.5a)

"zn

~

z"

is the same as

"zn -

z ~ 0,"

(2.1.5b)

z"

is the same as

"zn -

z

(2.1.5c)

Similarly,

"zn

a..

~

m.s.

a.&

~ m.s.

0."

Convergence in Distribution Let {z.l be a sequence of random scalars and F. be the cumulative distribution function (c.d.f.) of Zn. We say that {z, I converges in distribution to a random scalar z if the c.d.f. F. of Zn converges to the c.d.f. F of z at every continuity point of F. 1 We write "zn -->d z" or "zn -->L z" and call F the asymptotic (or limit or limiting) distribution of z•. Sometimes we write "zn -->d F," when the distribution F is well-known. For example, "zn -->d N(O, I)" should read

"zn

-->d

z and the distribution of z is N(O, 1) (normal distribution with mean 0 and

variance 1)." It can be shown from the definition that convergence in probability is stronger than convergence in distribution, that is, "zn ____,. z" =? "zn ----;,.. z." p

(2.1.6)

d

1Do not be concerned about the quaJifier "at every continuity point of F." For the most part, except possibly

for the chapten; on the discrete or limited dependent variable, the relevant distribution is continuous, and the distribution function is continuous at all points.

91

Large-Sample Theory

A special case of convergence in distribution is that z is a constant (a trivial random variable). The extension to a sequence of random vectors is immediate: z. --+d z if the joint c.d.f. F. of the random vector Zn converges to the joint c.d.f. F of z at every continuity point of F. Note, however, that, unlike the other concepts of convergence, for convergence in distribution, element -by-element convergence does not necessarily mean convergence for the vector sequence. That is, "each element of z. --+ d corresponding element of z" does not necessarily imply "z. --+ d z." A common way to establish the connection between scalar convergence and vector convergence in distribution is

Multivariate Convergence in Distribution Theorem: (stated in Rao, 1973, p. 128) Let {z.} be a sequence of K -dimensional random vectors. Then: "z.--+ z" d

*

"'-'z.--+ '-'z for any K -dimensional vector of real numbers." d

Convergence in Distribution vs. Convergence in Moments It is worth emphasizing that the moments of the limit distribution of z. are not necessarily equal to the limits of the moments of z•. For example, "zn --+ d z" does not necessarily imply "lim.~ 00 E(z.) = E(z)." However,

Lemma 2.1 (convergence in distribution and in moments): Let a,. be the s-th moment of z, and lim,--"' 00 asn =as where as is finite (i.e .• a real number). Then: "z, ......:,. z"::::} ''as is the s -th moment of z." d

Thus, for example, if the variance of a sequence of random variables converging in distribution converges to some finite number, then that number is the variance of the limiting distribution. Relation among Modes of Convergence Some modes of convergence are weaker than others. The following theorem establishes the relationship between the four modes of convergence.

Lemma 2.2 (relationship among the four modes of convergence): 1a)

~·

"zn ......:,. m.s. a'•=? "zn ......:,. p a.·· So "zn ......:,. m.s. z•'=? "zn ......:,. p z.••

92

Chapter 2

(c) "zn --->p a"* "zn --->d a." (That is, if the limiting random van'able is a constant [a trivial random van'able], convergence in distribution is the same as convergence in probability.) Three Useful Results

Having defined the modes of convergence, we can state three results essential for developing large-sample theory. Lemma 2.3 (preservation of convergence for continuous transformation): Sup pose aO is a vector-valued continuous function that does not depend on n.

(a) "zn

a"=} "a(zn)

--->p

a(a)."Stateddifferently,

--->p

plim a(zn) = a(plim Zn) n~oo

n~oo

provided the plim exists.

An immediate implication of Lemma 2.3(a) is that the usual arithmetic operations preserve convergence in probability. For example:

"Xn "Xn

--+ p

--+ p

"xn---> p

fJ , Yn fJ , Yn

--+ p --+ p

Y"

--->.. ----r ,,Xn

Y"

--->.. " ---r

fJ, Yn---> y" p

"Y.---> p

=}

r" =}

+ Yn

XnYn

--+ p

fJ

+ Y,

fJ Y ·,

--+ p

"xn/Yn---> fJ/y,"provided that y oft 0. p

1 "Y;;- ---> p

r- 1," provided that r

is invertible.

The next result about combinations of convergence in probability and in distribution will be used repeatedly to derive the asymptotic distribution of the estimator. Lemma2.4:

(b'J,\ "X n

' X • Yn ----;rp ' ---;rd

0"---"---r "y'n Xn

~ -----;rp

0" '

(c) "x. --->d x, A. --->p A"=} "A.x. --->d Ax," provided that A. and x. are confonnable. In particular; if x ~ N(O, 1:), then A.x. --->d N(O, A:EA').

93

Large-Sample Theory 1 1 'd'"x VI n ___,. d x ' A___,. n p A"~ ____, "x'An n x n ___,. d x'A- x"providedthatA ' n and are confonnable and A is nonsingular.

Xn

Parts (a) and (c) are sometimes called Slutzky's Theorem. By setting a= 0, part (a) implies: "xn ~ X, Yn ~ d p

0"

+ Yn

~ "Xn

~ d

(2.1.7)

x."

That is, ifz. = x. +Yn and Yn -+p 0 (i.e., ifz. -x. -+p 0), then the asymptotic distribution ofz. is the same as that ofx •. When z.- Xn -+p 0, we sometimes (but not always) say that the two sequences are asymptotically equivalent and write it as

,, Zn

'"""' Xn

a

"Or"Zn =

Xn

+" Op,

where op is some suitable random variable (y. here) that converges to zero in probability. A standard trick in deriving the asymptotic distribution of a sequence of random variables is to find an asymptotically equivalent sequence whose asymptotic distribution is easier to derive. In particular, by replacing Yn by Yn -a in part (b) of the lemma, we obtain

''xn

~

d

X, Yn

~

p

a"

~

''y'nXn

,.......

a

a ' Xn " or ''y'nXn =a ' Xn

+ Op· "

(2.1.8)

The oP here is (y.- a)'x•. Therefore, replacing y. by its probability limit does not change the asymptotic distribution of Y,.x., provided Xn converges in distribution to some random variable. The third result will allow us to test nonlinear hypotheses given the asymptotic distribution of the estimator. Lemma 2.5 (the "delta method"): Suppose {x.} is a sequence of K -dimensional random vectors such that x. -+p p and -/ii(x. -

p) ___,.

z,

d

and suppose a(·) : JJr.K ___,. llr.' has continuous first derivatives with A(p) denoting the r x K matrix of first derivatives evaluated at P:

94

Chapter 2

Then v"il[a(x.)- a(fJ)] --> A(fJ)z. d

In particular: "v"ii"(x.-

fJ)--> N(O, :E)"=? "v"ii"[a(x.)- a(fJ)]--> N(O, A(fJ)l:A(fJ)')." d

d

Proving this is a good way to learn how to use the results covered so far. PROOF. By the mean-value theorem from calculus (see Section 7.3 for a statement

of the theorem), there exists a K -dimensional vector Yn between x. and a(x.)- a(fJ) · A(y.)(x.(rxK)

Multiplying both sides by

P such that

fJ).

(Kxl)

-Jii, we obtain

v"il[a(x.)- a(fJ)] = A(y.)v"ii"(x.-

fJ).

Since Yn is between x. and P and since x. -->p p, we know that Yn -->p fJ. Moreover, the first derivative AO is continuous by assumption. So by Lemma 2.3(a), A(y.) --> A(fJ). p

By Lemma 2.4(c), this and the hypothesis that ,Jri(xn A(y.)v"ii"(x.- /J)

fJ) -->d z imply that

(= v"il[a(x.)- a(fJ)l)--> A(fJ)z. d

•

Viewing Estimators as Sequences of Random Variables

o.'

Let be an estimator of a parameter vector /J based on a sample of size n. The ' sequence {/J.] is an example of a sequence of random variables, so the concepts ' introduced in this section for sequences of random variables are applicable to {/J.]. We say that an estimator On is consistent for /J if plimO. = /J n~oo

or

'

o.--> /J. p

95

Large-sample Theory

The asymptotic bias of 8. is defined as plim.- 00 8. - 8.1 So if the estimator is consistent, its asymptotic bias is zero. A consistent estimator 8n is asymptotically normal if

.fii(iJ. -IJ) --->d

N(O, 1:).

Such an estimator is called Jii"-consistent. The acronym sometimes used for "consistent and asymptotically normal" is CAN. The variance matrix l: is called the asymptotic variance and is denoted Avar(8.). Some authors use the notation Avar(8.) to mean l:jn (which is zero in the limit). In this book, Avar(8.) is the variance of the limiting distribution of .fii(8. -IJ). Laws ot Large Numbers and Central Limit Theorems For a sequence of random scalars {z;}."the sample mean Zn is defined as ]

n

z.=-L::z;. n i=l

Consider the sequence {z.}. Laws of large numbers (LLNs) concern conditions under which {z.l converges either in probability or almost surely. An LLN is called strong if the convergence is almost surely and weak if the convergence is in probability. We can derive the following weak LLN easily from Part (a) of Lemma 2.2. A Version of Chebychev's Weak LLN: "lim E(z.) n-+oo

= 11-. n-+oo lim Var(z.) = 0"=} "z.---> p

11-·"

This holds because, under the condition specified, it is easy to prove (see an analytical question) that Zn --->m.s. 11-· The following strong LLN assumes that {z;} is i.i.d. (independently and identically distributed), but the variance does not need to be finite. Kolmogorov's Second Strong Law of Large Numbers: 3

E(z;) = 11-. Then

Zn

Let {z;} be i.i.d. with

->as. 11-·

2Some authors use the term "asymptotic bias" differently. Amemiya (1985), for example, defines it to mean limn-HXJ E(/Jn)-

8.

3so the mean exists and is finite (a real number). When a moment (e.g., the mean) is indicated, as here, then by implication the moment is assumed to exist and is finite.

96

Chapter 2

These LLNs extend readily to random vectors by requiring element-by-element convergence. Central Limit Theorems (CLTs) are about the limiting behavior of the difference between Zn and E(znl (which equals E(z;) if {z;) is i.i.d.) blown up by ~. The only Central Limit Theorem we need for the case of i.i.d. sequences is: Lindeberg-Levy CLT: Let {z;} be i.i.d. with E(z;) = JL and Var(z;) = l:. Then ]

n

~('in- JL) = ~ ~)z;- JL)--> N(O, l:). n

d

i=l

This reads: a sequence of random vectors I~ ('in - JL)} converges in distribution to a random vector whose distribution is N(O, l:). (Usually, the Lindeberg-Levy CLT is for a sequence of scalar random variables. The vector version displayed above is derived from the scalar version as follows. Let {z;} be i.i.d. with E(z;) = JL and Var(z;) = l:, and let l be any vector of real numbers of the same dimension. Then {l'znl is a sequence of scalar random variables with E(l'znl = l'JL and Var(l'zn) = l'l:l. The scalar version ofLinderberg-Levy then implies that ~(l'zn- l'JL) = l'~(Zn- JL)--> N(O, l'l:l). d

But this limit distribution is the distribution of l'x where x ~ N(O, l:). So by the Multivariate Convergence in Distribution Theorem stated a few pages back, ~(z" - JL)} -->d x, which is the claim of the vector version of Lindeberg-Levy.)

I

QUESTIONS FOR REVIEW

1. (Usual convergence vs. convergence in probability) A sequence of real numbers is a trivial example of a sequence of random variables. Is it true that "limn--+oo Zn = a" ~ "plimn->oo Zn =a"? Hint: Look at the definition of plim. Since limn~ooZn = a,lzn- al < E: for n sufficiently large. 2. (Alternative definition of convergence for vector sequences) Verify that the definition in the text of "zn ----*m.s. z" is equiva1ent to lim E[(z"- z)'(zn- z)] = 0. n~oo

Hint: E[(Zn- z)'(z"- z)] = E[(Znt- Zt) 2 ] + · · · + E[(ZnK- ZK) 2 ], where K is the dimension of z. Similarly, verify that the definition in the text of

97

Large-sample Theory

"zn ----*p a" is equivalent to

lim Prob((zn -a)'(zn -a)> s)= Oforanys > 0. n~oo

3. Prove Lemma 2.4(c) from Lemma 2.4(a) and (b). Hint: AnXn = (An- A)Xn

Axn. By (b), (An - A)xn 4. Suppose

.jii(On

-II)

-->p

-->d

A

lin -II =

+

0.

N(O, a 2 ). Does it follow that On I

-->p

II? Hint:

I

A

r.; · ..fii(lin -II), plim-= = 0.

vn

n---+00

..Jn

5. (Combine Delta method with Lindeberg-Levy) Let (z;} be a sequence of i.i.d. (independently and identically distributed) random variables with E(z;) = fL oft 0 and Var(z;) = a 2 , and let Zn be the-sample mean. Show that ..fii (

:J-- _I_)--+ N(o. fL Zn

d

a2 ). /L4

Hint: In Lemma 2.5, set f3 = JL, a(/3) = l / JL, Xn = Zn·

2.2

Fundamental Concepts in Time-Series Analysis

In this section, we define the very basic concepts in time-series analysis that will form an integral part of our language. The key concept is a stochastic process, which is just a fancy name for a sequence of random variables. If the index for the random variables is interpreted as representing time, the stochastic process is called a time series. If (z;} (i = l, 2, ... ) is a stochastic process, its realization or a sample path is an assignment to each i of a possible value of z;. So a realization of (z;} is a sequence of real numbers. We will frequently use the term time series to mean both the realization and the process of which it is a realization. Need for Ergodic Stationarity The fundamental problem in time-series analysis is that we can observe the realiza. tion of the process only once. For example, the sample on the U.S. annual inflation rate for the period from 1946 to !995 is a string of 50 particular numbers, which is just one possible outcome of the underlying stochastic process for the inflation rate; if history took a different course, we could have obtained a different sample.

98

Chapter 2

If we could observe the history many times over, we could assemble many samples, each containing a possibly different string of 50 numbers. The mean inflation rate for, say, 1995 can then be estimated by taking the average of the 1995 inflation rate (the 50th element of the string) across those samples. The population mean thus estimated is sometimes called the ensemble mean. In the language of general equilibrium theory in economics, the ensemble mean is the average across all the possible different states of nature at any given calendar time. Of course, it is not feasible to observe many different alternative histories. But if the distribution of the inflation rate remains unchanged (this property will be referred to as stationarity), the particular string of 50 numbers we do observe can be viewed as 50 different values from the same distribution. Furthermore, if the process is not too persistent (what's called ergodicity has this property), each element of the string will contain some information not available from the other elements, and, as shown below, the time average over the elements of the single string will be consistent for the ensemble mean. Various Classes of Stochastic Processes4

Stationary Processes A stochastic process {z1} (i = I, 2, ... ) is (strictly) stationary if, for any given finite integer rand for any set of subscripts, i~o i2, ... , i" the joint distribution of (z;, z;,, Z;2 , ... , z;J depends only on i 1 - i, iz- i, i3- i, ... , ir- i but not on i. For example, the joint distribution of (z 1 , z5 ) is the same as that of (z 12 , z 16 ). What matters for the distribution is the relative position in the sequence. In particular, the distribution of z1 does not depend on the absolute position, i, of z1, so the mean, variance, and other higher moments, if they exist, remain the same across i. The definition also implies that any transformation (function) of a stationary process is itself stationary, that is, if {zd is stationary, then {f(z1)} is. 5 For example, {z1zj} is stationary if {z1} is. Example 2.1 (i.i.d. sequences): A sequence of independent and identically distributed random variables is a stationary process that exhibits no serial dependence.

4Many of the concepts collected in this subsection can aJso be found in Section 4. 7 of Davidson and MacKinnon (1993). 5 The function f (-) needs to be "measurable" so that f (Zj) is a well-defined random variable, Any continuous function is measurable. In what follows, we \VOn 't bother to add the qualifier "measurable" when a function f of a random variable is understood to be a random variable.

Large~Sample

Theory

99

Example 2.2 (constant series): Draw z1 from some distribution and then set z1 = z 1 (i = 2, 3, ... ). So the value of the process is frozen at the initial date. The process {z1I thus created is a stationary process that exhibits maximum serial dependence. Evidently, if a vector process {z1I is stationary, then each element of the vector forms a univariate stationary process. The converse, however, is not true. Example 2.3 (element-wise vs. joint stationarity): Let {e,} (i = 1, 2, ... ) be a scalar i.i.d_ process. Create a two-dimensional process {z1I from it by defining Zil = e 1 and z12 = e 1 The scalar process {zil I is stationary (this is the process of Example 2.1). The scalar process {z 12 1, too, is stationary (the Example 2.2 process). The vector process {z,}, however, is not (jointly) stationary, because the (joint) distriliution of z 1 (= (e 1, e 1)')differs from that ofz2 (= (e2,eJ)'). Most aggregate time series such as GOP are not stationary because they exhibit time trends. A less obvious example of nonstationarity is international exchange rates, which are alleged to have increasing variance. But many time series with trend can be reduced to stationary processes. A process is called trend stationary if it is stationary after subtracting from it a (usually linear) function of time (which is the index i). If a process is not stationary but its first difference, z1 - z 1_ 1 , is stationary, {zd is called difference stationary. Trend-stationary processes and difference-stationary processes will be studied in Chapter 9. Covariance Stationary Processes A stochastic process {zd is weakly (or covariance) stationary if: (i) E(z1) does not depend on i, and (ii) Cov(z 1, z1_j) exists, is finite, and depends only on j but not on i (for example, Cov(z1, z5 ) equals Cov(z12, z 16)). The relative, not absolute, position in the sequence matters for the mean and covariance of a covariance-stationary process. Evidently, if a sequence is (strictly) stationary and if the variance and co variances are finite, then the sequence is weakly stationary (hence the term "strict"). An example of a covariance-stationary but not strictly stationary process will be given in Example 2.4 below.

100

Chapter 2

The j -th order autocovariance, denoted rj

= Cov(z;, Z;-j)

r j, is defined as

(j = 0, I, 2, ... ).

The term "auto" comes about because the two random variables are taken from the same process. r j does not depend on i because of covariance-stationarity. Also by covariance stationarity, r j satisfies

(2.2.1) (Showing this is a review question below.) The 0-th order autocovariance is the variance ro = Var(z;). The processes in Examples 2.1 and 2.2 are covariance stationary if the variance exists and is finite. For the process of Example 2.1, r 0 is the variance of the distribution and rj = 0 for j > I. For the process of Example 2.2. rj = ro. For a scalar covariance stationary process (z 1}, the j-th order autocovariance is now a scalar. If Yj is this autocovariance, it satisfies

(2.2.2)

Yj = Y-j·

Take a string of n successive values, (z;, Zi+l, ... , Zi+n-J), from a scalar process. By covariance stationarity, its n x n variance-covariance matrix is the same as that of (z 1 , z 2 , ... Zn) and is a band spectrum matrix: Yo

Yt

Y2

Yn-1

Yt

Yo

YI

Yn-2

Var(z;, Zi+I, ... Zi+n-I) = Yn-2

YI

Yo

YI

Yn-1

Y2

YI

Yo

This is called the autocovariance matrix of the process. The j-th order autocorrelation coefficient, pj. is defined as j -th order autocorrelation coefficient Yj

Cov(z;, Zi-j)

Yo

Var(z;)

Pj=-=

=

(j=l,2, ... ).

(2.2.3)

For j = 0, Pj = I. The plot of (pj} against j = 0, I, 2, ... is called the correlogram.

Large~Sample

101

Theory

White Noise Processes A very important class of weakly stationary processes is a white noise process, a process with zero mean and no serial correlation: a covariance-stationary process (z,) is white noise if E(z;) = 0 and Cov(z;,

z,_1 ) =

0 for j

i' 0.

Clearly, an independently and identically distributed (i.i.d.) sequence with mean zero and finite variance is a special case of a white noise process. For this reason, it is called an independent white noise process. Example 2.4 (a white noise process that is not strictly stationary 6 ): Let w be a random variable uniformly distributed in the interval (0, 2rr), and define

z;

= cos(i w)

(i

= I, 2, ... ) .

It can be shown thatE(z;) = 0, Var(z;) = If2, and Cov(z;, z1) = Ofori ,;, j. So (z;} is white noise. However, clearly, it is not an independent white noise process. It is not even strictly stationary.

Ergodicity A stationary process ( z;) is said to be ergodic if, for any two bounded functions f : lit -+ JR: and g : JR:f -+ JR:,

lim jE[/(Z;, ... , Z;+k)g(Z;+n• ... , Zi+n+ellj

n~oo

= jE[f(z;, ... , Z;H)JjjE[g(z;+n• ... , Zi+n+elJj.

Heuristically, a stationary process is ergodic if it is asymptotically independent, that is, if any two random variables positioned far apart in the sequence are almost independently distributed. A stationary process that is ergodic will be called ergodic stationary. Ergodic stationarity will be an integral ingredient in developing large· sample theory because of the following property. Ergodic Theorem: (See, e.g., Theorem 9.5.5 of Karlin and Taylor (1975).) Let (z;} be a stationary and ergodic process with E(z;) = JL? Then

6Drawn from Example 7.8 of Anderson (1971, p. 379). 7So the mean is assumed to exist and is finite.

102

Chapter 2 ]

Zn

n

= -n L

Z;

~ JL.

a.s.

i=l

The Ergodic Theorem, therefore, is a substantial generalization of Kolmogorov's LLN. Serial dependence, which is ruled out by the i.i.d. assumption in Kolmogorov's LLN, is allowed in the Ergodic Theorem, provided that it disappears in the long run. Since, for any (measurable) function f(), {f(z,)} is ergodic stationary whenever (z,} is, this theorem implies that any moment of a stationary and ergodic process (if it exists and is finite) is consistently estimated by the sample moment. For example, suppose (z;} is stationary and ergodic and E(z,z;) exists and is finite. Then, ~ Li z;z~ is consistent for E(zizi). The simplest example of ergodic stationary processes is independent white noise processes. (White noise processes where independence is weakened to no serial correlation are not necessarily ergodic; Example 2.4 above is an example.) Another important example is the AR(l) process satisfying Zi =c+PZi-1 +&;,

IPI

<

1,

where (&;}is independent white noise. Martingales Let x; be an element of z,. The scalar process (x;} is called a martingale with respect to (z;} if (2.2.4) The conditioning set (z;-I, z,_2 , .•. , z 1) is often called the information set at point (date) i - I. (x,} is called simply a martingale if the information set is its own past values (x,_ 1 , x,_ 2 , ••• , x 1). If z; includes x;, then (x;} is a martingale, because E(x; I xi-!, x;_z, ... , x1) = E[E(x; I Zi-1, Z;-z,

... ,

ZJ) I X;-J,

X;-z, ... ,

xJ]

(Law of Iterated Expectations) =E(xi-JIXi-!•X;_z, ... ,x1)=x;_J.

8 rf the process started in the infinite pa~t so that i runs from -oo to +oo. the definition is E(x;

1

Zi-1, Zi-2• ... ) = Xi- I• and the qualifier "i ~ 2" is not needed. Whether the process started in the infinite past or in date i = I is not important for large~sample theory to be developed below. What will matter is that the

process starts before the sample period.

103

Large-Sample Theory

A vector process {z1} is called a martingale if E(z 1 I Z;-I> ... , z,) = z1_ 1 fori > 2.

(2.2.5)

Example 2.5 (Hall's Martingale Hypothesis): Let z1 be a vector containing a set of macroeconomic variables (such as the money supply or GOP) including aggregate consumption ci for period i. Hall's (1978) martingale hypothesis is that consumption is a martingale with respect to {zi }:

This formalizes the notion in consumption theory called "consumption smoothing": the consumer, wishing to avoid fluctuations in the standard of living, adjusts consumption in date i - I to the level such that no change in subsequent consumption is anticipated. Random Walks An important example of martingales is a random walk. Let {gi} be a vector independent white noise process (so it is i.i.d. with mean 0 and finite variance matrix). A random walk, {zi}, is a sequence of cumulative sums: z1 = gl> zz = g, + gz, ... , z; = g1 + g2 + · · · + g, · · · .

(2.2.6)

Given the sequence {zi}, the underlying independent white noise sequence, {gi}, can be backed out by taking first differences: (2.2.7) So the first difference of a random walk is independent white noise. A random walk is a martingale because E(z1 I z;_ 1, ... , z 1) = E(zi I gi-l, ... , g1) (since (zi-1, ... , Z1) and (gi-l, ... , g 1 ) have the same information, as just seen) = E(g, +

~

+ · · · + gi I g,_t, ... , g1)

= E(gi I g;-t •... , g,) + (gl + · · · + gi-t) = gt +···+gi-l =zi-t

(E(gi I gi-l, ... , g 1) = 0 as {g;} is independent white noise)

(by the definition of zi_ 1).

(2.2.8)

104

Chapter 2

Mar1ingale Difference Sequences A vector process {g1} with E(g,) = 0 is called a martingale difference sequence (m.d.s.) or martingale differences if the expectation conditional on its past values, too, is zero:

E(g,

I g,_,, g,_2, ... , g,)

= 0 for i

~

2.

(2.2.9)

The process is so called because the cumulative sum {z1} created from a martingale difference sequence {g1 } is a martingale; the proof is the same as in (2.2.8). Conversely, if {z,j is martingale, the first differences created as in (2.2.7) are a martingale diff~rence sequence. A martingale difference sequence has no serial correlation (i.e., Cov(g1, g1_ j) = 0 for all i and j oft 0). A proof of this claim is as follows. First note that we can assume, without loss of generality, that j > I. Since the mean is zero, it suffices to show that E(g1g_) = 0. So consider rewriting it as follows.

PROOF.

E(g,g;_ j) = E[E(g1g;_j

I g1_ 1 )] (by the Law of Total Expectations)

= E[E(g1 I g1_j)g_j] Now, since j >I, (g1_ 1 ,

(by the linearity of conditional expectations).

... , gi-j· ... ,

g 1) includes

I g,_j) = E[E(g1 I g1_ 1 , ... , g 1_ j· ... , g 1) I g,_ j]

gi-j·

Therefore,

E(g,

(by the Law of Iterated Expectations)

=0. The last equality holds because E(g1 I gi-l , ... , g 1_ j,

... ,

g 1) = 0.

•

ARCH Processes An example of martingale differences, frequently used in analyzing asset returns, is an autoregressive conditional heteroskedastic (ARCH) process introduced by Engle (1982). A process {g1} is said to be an ARCH process of order 1 (ARCH(l)) if it can he written as

(2.2.10)

Large~Sample

105

Theory

where {e;) is i.i.d. with mean zero and unit variance. If 8 1 is the initial value of the process, we can use (2.2.10) to calculate subsequent values of 8i. For example, 82 =

J I; + a8f · e2.

More generally, 8; (i =:: 2) is a function of 8 1 and (e2, e 3 , ••• , e1). Therefore. e1 is independent of (8 1 , 8z, ...• 8i-!l· It is then easy to show that {8d is an m.d.s. because E(8;

I 81-1· 81-2·

(2.2.11)

... '8Il

=

E(Ji; + a8f_ 1 • e, I 8i-1· 8;-z,

=

J I;+ a8f_ 1 E(e, I 8i-1, 8;-z, ... , 81l

=

J~; + a8f_ 1 E(e1)

=0

(since E(e;)

.. ·, 81)

(since e1 is' independent of (8 1, 8 2, ... , 8;-!l)

= 0).

(2.2.12)

By a similar argument, it follows that (2.2.13) So the conditional second moment (which equals the conditional variance since E(8; I 8!, 8z, ... , 81_ 1) = 0) is a function of its own history of the process. In this sense the process exhibits own conditional heteroskedasticity. It can be shown (see, e.g., Engle. 1982) that the process is strictly stationary and ergodic if Ia I < I, provided that 8 1 is a draw from an appropriate distribution or provided that the process started in the infinite past. If 8i is stationary, the unconditional second moment is easy to obtain. Taking the unconditional expectation of both sides of (2.2.13) and noting that E[E(8f I 81-1.81-2 •... , 81ll

= E(8fl

and

E(8fl

= E(8f_ 1 ) if 8i is stationary,

we obtain E(8fl = I;

+a E(8fl

or

E(82) =

'

I;

1-a

(2.2.14)

If a > 0, this model captures the characteristic found for asset returns that large values tend to be followed by large values. (For more details of ARCH processes, see, e.g., Hamilton, 1994, Section 21.1.)

106

Chapter 2

Different Formulation of Lack of Serial Dependence Evidently, an independent white noise process is a stationary martingale difference sequence with finite variance. And, as just seen, a martingale difference sequence has no serial correlation. Thus, we have three formulations of a lack of serial dependence for zero-mean covariance stationary processes. They are, in the order of strength, (I) "{g;} is independent white noise." =}

(2) "(g;} is stationary m.d.s. with finite variance."

=}

(3) "(g;} is white noise."

(2.2.15)

Condition (1) is stronger than (2) because there are processes satisfying (2) but not (1). An ARCH(!) process (2.2.10) wjth Ia I < 1 is an example. Figure 2.1 shows how a realization of a process satisfying (I) typically differs from that satisfying (2). Figure 2.1, Panel (a), plots a realization of a sequence of independent and normally distributed random variables with mean 0 and unit variance. Panel (b) plots an ARCH (I) process (2.2.10) with ~ = 0.2 and a = 0.8 (so that the unconditional variance, ~ ((1 -a), is unity as in panel (a)), where the value of the i.i.d. sequence £ 1 in (2.2.1 0) is taken from Figure 2.1, Panel (a), so that the sign in both panels is the same at all points. The series in Panel (b) is generally less volatile than in Panel (a), but at some points it is much more volatile. Nevertheless, the series is stationary. Condition (2) is stronger than (3); the process in Example 2.4 is white noise, but (as you will show in a review question) it does not satisfy (2). The CLT for Ergodic Stationary Martingale Differences Sequences The following CLT extends the Lindeberg-Levy CLT to stationary and ergodic m.d.s. Ergodic Stationary Martingale Differences CLT (Billingsley, 1961): Let (g1} be a vector martingale difference sequence that is stationary and ergodic with E(g;gi) = 1:,9 and Jet

g=

~

I:7= 1 g1• Then

.Jrig =

1 r::

-yn

n

L g; --> N(O, :E). i=l

d

9Since {g;} is stationary, this matrix of cross moments does not depend on i. Also, since a cross moment matrix is indicated, it is implicitly assumed that all the cross moments exist and are finite.

Large~Sample

Theory

107

(a)

4,------------------------------------------, 3

2 1

-1

-2 -3~--------------------------------------~

(b) ·~-----------------------------------------,

3

2 1

-1

-2 -3~------------------------------------------~

Figure 2.1: Plots of Serially Uncorrelated Time Series: (a) An i.i.d. N(O, 1) sequence. (b) ARCH(l) with shocks taken from panel (a) Unlike in the Linde berg-Levy CLT, there is no need to subtract the mean from g1 because the unconditional mean of an m.d.s. is by definition zero. For the same reason, :E also equals Var(g 1). This CLT, being applicable not just to i.i.d. sequences but also to stationary martingale differences such as ARCH(l) processes, is more general than Lindeberg-Levy. We have presented an LLN for serially correlated processes in the form of the Ergodic Theorem. A central limit theorem for serially correlated processes will be presented in Chapter 6.

108

Chapter 2

QUESTIONS FOR REVIEW

1. Prove that

r -j

=

r;.

Hint: Cov(z;, Z;-j) = E[(z; -JL)(Zi-j -JL)'] where

JL = E(z;). By covariance-stationarity, Cov(z 1, z;-j) = Cov(z;+j• z1).

2. (Forecasting white noise) For the white noise process of Example 2.4, E(z 1) = 0. What is E(z 1 I z 1) fori > 2? Hint: You should be able to forecast the future exactly if you know the value of z 1 . Is the process an m.d.s? [Answer: No.] 3. (No anticipated changes in martingales) Suppose {x 1} is a martingale with respect to {z1}. Show that E(x;+j I z 1_ 1 , Z;-2, ... , z 1) = x 1_ 1 and E(x;+j+I x 1+j I z,_ 1 , z,_ 2 , ..• , z 1) = 0 for j = 0, I, .... Hint: Use the Law of Iterated Expectations. 4. Let {x;} be a sequence of real numbers that change with i and {e;} be a sequence of i.i.d. random variables with mean 0 and finite variance. Is {x, . e;} i.i.d.? [Answer: No.] Is it serially independent? [Answer: Yes.] An m.d.s? [Answer: Yes.] Stationary? [Answer: No.]" 5. Show that a random walk is non stationary. Hint: Check the variance. 6. (The first difference of a martingale is a martingale difference sequence) Let {z;} be a martingale. Show that the process {g;} created by (2.2.7) is an m.d.s. Hint: (g 1, ... , g1) and (z 1, ... , z 1) share the same information. 7. (An m.d.s. that is not independent white noise) Let g 1 = e; · e;- 1 , where {e;} is an independent white noise process. Evidently, {g;} is not i.i.d. Verify that {g;} (i = 2, 3, ... ) is an m.d.s. Hint: E(g; I g,_,, ... , gz) = E[E(e; · e;-1 e;-t •... , Bt)

I ei-I'B;-z, E:;-2·6;-3,

... ,

Cz·etJ.

8. (Revision of expectations is m.d.s.) Let {y;} be a process such that E(y1 y 1_ 1 , Y;- 2 , ... , y 1) exists and is finite, and definer;; = E(y; I Yi-1> y1_ 2 , .•. , y 1)- E(y; I y1_ 2 , y,_,, ... , y;). So r;; is the change in the expectation as one more observation is added to the information set. Show that {r1d (i > 2) is an m.d.s. with respect to {y;}. 9. (Billingsley is stronger than Lindeberg-Levy) Let {z;} be an i.i.d. sequence with E(z;) = JL and Var(z;) = 1:, as in the Lindeberg-Levy CLT. Use the Martingale Differences CLT to prove the claim of the Lindeberg-Levy CLT, namely, that .jii(zn -JL) ->d N(O, 1:). Hint: {z; -JL) is an independent white noise process and hence is an ergodic stationary m.d.s.

109

Large-Sample Theory

2.3

Large-Sample Distribution of the OLS Estimator

The importance in econometrics of the OLS procedure, originally developed for the classical regression model of Chapter 1, lies in the fact that it has good asymptotic properties for a class of models, different from the classical model, that are useful in economics. Of those models, the model presented in this section has probably the widest range of economic applications. No specific distributional assumption (such as the normality of the error term) is required to derive the asymptotic distribution of the OLS estimator. The requirement in finite-sample theory that the regressors be strictly exogenous or "fixed" is replaced by a much weaker requirement that they be "predetermined." (For the sake of completeness, the appendix develops the parallel asymptotic theory for a model with "fixed" regressors.)

The Model We use the term the data generating process (DGP) for the stochastic process that generated the finite sample (y, X). Therefore, if we specify the DGP, the joint distribution of the finite sample (y, X) can be determined. ln finite-sample theory, where the sample size is fixed and finite, we defined a model as a set of the joint distributions of (y, X). In large-sample theory, a model is stated as a set of DGPs. The model we study is the set of DGPs satisfying the following set of assumptions. Assumption 2.1 (linearity):

y;=x;fJ+s;

(i=1,2, ... ,n)

where x; is a K-dimensional vector of explanatory variables (regressors), {J is a K-dimensional coefficient vector, and £; is the unobservable error term. Assumption 2.2 (ergodic stationarity): The (K +I)-dimensional vector stochastic process {y;, X;} is jointlystationary and ergodic. Assumption 2.3 (predetermined regressors): All the regressors are predetermined in the sense that they are orthogonal to the contemporaneous error term: E(x;k<;) = 0 for alii and k (= 1, 2, ... , K)w This can be written as 10 0ur definition of the 1enn predetermined is not universal. Some authors say that the regressors are predetennined if E(xi- j · Sj) = 0 for all j 2: 0, not just for j = 0. That is, ihe error tenn is orthogonal not only to the contemporaneous but also the past regressors. In Koopmans and Hood ( 1953), the regressors are said to be predetennined if ei is independent of Xi- j forall j 2: 0. Our definition is the same as Hamilton's (1994).

110

Chapter 2

E[x, · (y 1 -

X,Jl)]

= 0 or equivalently E(g1) = 0

where~ =xi · Cj.

Assumption 2.4 (rank condition): The K x K matrix E(x 1x;) is nonsingular (and hence finite). We denote this matrix by I:xx. Assumption 2.5 (g 1 is a martingale difference sequence with finite second moments): (g1) is a martingale difference sequence (so a fortiori E(g;) = 0). The K x K matrix of cross moments, E(g;g;), is nonsingular We useS for Avar(ji) (the variance of the asymptotic distribution of .jii g, where g = ~ I:, g;). By Assumption 2.2 and the ergodic stationary Martingale Differences CLT, S = E(g1i;). The first assumption is just reproducing Assumption 1.1. The rest of the assumptions require some lengthy comments. • (Ergodic stationarity) A trivial but important special case of ergodic stationarity is that {y1, x,j is i.i.d., that is, the sample is a random sample.U Most existing microdata on households are random samples, with observations randomly drawn from a population of a nation's households. Thus, we are in no way ruling out models that use cross-section data. • (The model accommodates conditional heteroskedasticity) If (y1, x 1) is stationary, then the error term e1 = y 1 - x;p is also stationary. Thus, Assumption 2.2 implies that the unconditional second moment E(ef) -if it exists and is finite- is constant across i. That is, the error term is unconditionally homoskedastic. Yet the error can be conditionally heteroskedastic in that the conditional second moment, E(ef I x1), can depend on x1• An example in which the error is homoskedastic unconditionally but not conditionally is included in Section 2.6, where the consequence of superimposing conditional homoskedasticity (that E(e,2 1 x1) = 0" 2 ) on the model will be explored. • (E(x; · e 1) = 0 vs. E(e1 1 x1) = 0) Sometimes, instead of the orthogonality condition E(x; · e1) = 0, it is assumed that the error is unrelated in the sense that E(e; I x1) = 0. This is stronger than the orthogonality condition because it

11 Actually, once the independence assumption is made, the same large-sample results can be proved for the more general case where Cn, Xi l is independently but not identically distributed (Ln.i.d.), provided that some conditions on higher moments of the joint distribution of (Ej, Xi) are satisfied. We will not entertain this genera1ization because the i.i.d assumption is satisfied in most microdata sets.

111

Large-Sample Theory

implies that, for any (measurable) function f of x1, f (x 1) is orthogonal to £;: E[f(x;)<;] = E[E(f(x;)£; I x;)] = E[f(x;)E(£; I x1)] = 0. In rational expectations models, this stronger condition is satisfied, but for the purpose of developing asymptotic theory, we need only the weaker assumption of the orthogonality condition. • (Predetermined vs. strictly exogenous regressors) The regressors are not required to be strictly exogenous. As we noted in Section 1.2, the exogeneity assumption (Assumption 1.2) implies that, for any regressor k, E(xjkE;) = 0 for all i and j, not just fori = j, which rules out the possibility that the current error term, £;, is correlated with future regressors, x1+ j for j > I. Assumption 2.3, restricting only the contemporaneous relationship between the error term and the regressors, does not rule· out that possibility. For example, the AR(l) process, which does not satisfy the exogeneity assumption of the classical regression model, can be accommodated in the model of this chapter. This weaker assumption of predetermined regressors will be further relaxed in the next chapter. • (Rank condition as no multicollinearity in the limit) Since E(x 1xj) is finite by Assumption 2.4, limn-).oo Sxx = l:xx (where Sxx = ~ I:7=t x;xD with probability one by the Ergodic Theorem. So, for n sufficiently large, the sample cross moment of the regressors Sxx, which can be written as ~ X'X, is nonsingular by Assumptions 2.2 and 2.4. Since ~ X'X is nonsingular if and only if rank(X) = K, Assumption 1.3 (no multicollinearity) is satisfied with probability one for sufficiently large n. In the OLS formula b = S;;i Sxy (where Sxy = ~ I:7~t X; · y1), S.,. needs to be inverted. IfS,.. is singular in a finite sample (so it cannot be inverted), we just assign an arbitrary value to b so that the OLS estimator is well-defined for any sample. • (A sufficient condition for {g;} to be an m.d.s.) Since an m.d.s. (martingale difference sequence) is zero-mean by definition, Assumption 2.5 is stronger than Assumption 2.3. We will need Assumption 2.5 to prove the asymptotic normality of the OLS estimator. The assumption, about the product of the regressors and the error term, may be hard to interpret. A sufficient condition that is easier to interpret is E(e;

I £i-t·Ci-2·····£t,Xi,Xi-l·····xr)

= 0.

(2.3.I)

Note that the current as well as lagged regressors is included in the information

112

Chapter 2

set. This condition implies that the error term is serially uncorrelated and also is uncorrelated with the current and past regressors (the proof is much like the proof in the previous section that an m.d.s. is serially uncorrelated). That (2.3.1) is sufficient for (g,.) to be an m.d.s. can be seen as follows. We have E(g; I g;-I, ... , gi) = E[E(g;

I <;-I, <;-z, ... ,
This holds by the Law of Iterated Expectations because there is more information in the "inside" information set (£;_ 1 , E;-2 •... , e 1 , xj, X;- 1 , ... , x 1) than in the "outside" information set (g,_ I, ... , gi ). Therefore, E(g; I g;_ I, ... , gi) = E[x; E(<;

I <;-I, <;-2, ·... ,<~ox;, X;-I, ... , XI) I g;-I, ... , gil

(by the linearity of conditional expectations) = 0

(by (2.3.1)).

(2.3.2)

• (When the regressors include a constant) In virtually all applications, the regressors include a constant. If the regressors include a constant so that xn = I for all i, then Assumption 2.3 of predetermined regressors can be stated in more familiar terms: the mean of the error term is zero (which is implied by E(x;,E;) = 0 for k = 1), and the contemporaneous correlation between the error term and the regressors is zero (which is implied by E(x;,<;) fork f I and E(<;) = 0). Also, since the first element of the K -dimensional vector g, (= x; . <;) is E;, Assumption 2.5 implies E(E; I g,_I, g;-2, ... , gi) = 0. Then, by the Law of Iterated Expectations,(<;) is a scalar m.d.s: (2.3.3) Therefore, Assumption 2.5 implies that the error term itself is an m.d.s. and hence is serially uncorrelated. • (Sis a matrix of fourth moments) Since g1 = X; · E;, the Sin Assumption 2.5 can be written as E(
113

Large-Sample 'rheory

• (S will take a different expression without Assumption 2.5) Thanks to the assumption that {g1} is an m.d.s .• S (= Avar(g)) is equal to E(g;l(,). Without the assumption, as we will see in Chapter 6, the expression for S is more complicated and involves autocovariances of g;. Asymptotic Distribution of the OLS Estimator We now prove that the OLS estimator is consistent and asymptotically normal. It should be kept in mind throughout the rest of this chapter that the OLS estimator b depends on the sample size n (although the dependence is not made explicit by our choice of not to subscript b by n) and that K, the number of regressors, is held fixed when we track the sequence of OLS estimators indexed by n. For the time being, we presume that there is available some consistent estimator, denoted S, of

S (= Avar@ = E(g1g;) = E(efx1x;>). The issue of estimating S consistently will be taken up later. Proposition 2.1 (asymptotic distribution of the OLS Estimator): (a) (Consistency ofb for {J) Under Assumptions 2.1-2.4, Assumption 2.5 is not needed for consistency.)

plimn~oo

b = {J. (So

(b) (Asymptotic Normality of b) If Assumption 2.3 is strengthened as Assumption 2.5, then

vfn(b- {J)

-->

N(O, Avar(b)) as n --> oo,

d

where

Avar(b) = I:-;,.1 S I:;,.'. (Recall: I:xx

= E(x1x;J, S =

(2.3.4)

E(g 11(,), g1 = x1 • e1.)

(c) (Consistent Estimate of Avar(b)) Suppose there is available a consistent estimatm; S, ofS (K x K). Then, under Assumption 2.2, Avar(b) is consistently estimated by

-

-1- -1 Avar(b) = Sxx S Sxx,

(2.3.5)

where Sxx is the sample mean ofx 1x;: n

I"", I, Sxx = - L... x;x1 = -X X. (KxK)

n

i=l

n

(2.3.6)

114

Chapter 2

The proof is a showcase of all the standard tricks in asymptotics. For proving (a) and (b), three tricks will be employed: (I) write the object in question in terms of sample means, (2) apply the relevant LLN (the Ergodic Theorem in the present context) and CLT (the ergodic stationary Martingale Differences CLT) to sample means, and (3) use Lemma 2.4(c) to derive the asymptotic distribution. Proof of (c) will not be given because it is an immediate implication of ergodic stationarity. PROOF

(Parts (a) and (b)).

(I) We first write the sampling error b -

fJ in terms of sample means.

= S -1-g, XX

(2.3.7)

where

The sample means Sxx and g depend on the sample size n, although the notation does not make it explicit. (2) (Consistency) Since by Assumption 2.2 {x,x;) is ergodic stationary, Sxx ->p I: xx· (The convergence is actually almost surely, but almost sure convergence implies convergence in probability.) Since I:,, is invertible by Assumption 2.4, S;;,_1 ->p 1:;,1 by Lemma 2.3(a). Similarly, g ->p E(g;) which by Assumption 2.3 is 0. So by Lemma 2.3(a), S;;;ig ->p I:;;;iO = 0. Therefore, plimn~ 00 (b {J) = 0, which implies plimn~oo b = fJ. (3) (Asymptotic normality) Rewrite (2.3.7) as Jli(b- {J) = S;:1(J1ig).

(2.3.8)

As mentioned in the statement of Assumption 2.5, Jn g ->d N (0, S). So, by Lemma 2.4(c), Jli(b- /J) converges to a normal distribution with mean 0 and variance I:;:1S(I:;:1)'. But since I:xx is symmetric, this expression equals ~3~.

•

115

Large-Sample Theory

This result says that the distribution of .fii times the sampling error is approximated arbitrarily well by a normal distribution when the sample size is sufficiently large. The natural question is how large is "large": how large must the sample size be for the asymptotic approximation to be valid? The asymptotic result we just derived holds for all the DGPs satisfying the model assumptions. However, the sample size needed to achieve a given measure of proximity to the asymptotic distribution depends on the DGP. We will par1ially address this issue in the Monte Carlo experiment of this chapter.

s 2 Is Consistent We now tum to the OLS estimator, s 2 , of the error variance.

Proposition 2.2 (consistent estimation of error variance): Let e; = y; - x;b be the OLS residual for observation i. Under Assumptions 2.1-2.4, s

2

I

= n-K

n "2

L.. e; ---> E(e;2), P

i=l

provided E(ef) exists and is finite.

If we could observe the error term e;, then the obvious estimator would be the sam-

ef.

ple mean of It is consistent by ergodic stationarity. The message of Proposition 2.2 is that the substitution of the OLS residual e; for the true error term e; does not impair consistency. Let us go through a sketch of the proof, because knowing how to handle the discrepancy between t:; and its estimate e; will be useful in other contexts as well. Since n

2

s =

n (_!_ L ef),

n- K n

i=l

ef, L; ef, converges in probability to

it suffices to prove that the sample mean of ~ E(t:f}. The relationship between e1 and e; is given by ei

=Yi- X;b = y1 =

-

x; fJ - x; (b - {J)

(by adding and subtracting x; {J)

e; - X,(b- {J),

(2.3.9)

so that

ef = ef- 2(b- {J)'x; · e; + (b- {J)'x;x;(b- {J).

(2.3.10)

116

Chapter 2

Summing over i, we obtain n

~nL...t' "'e

2

n

=

i=l

~nL...t' "'e

2

n

-

2(b- tJ)' ~ "'x1 · e1 +(b-Ill'(~"' x,x')(b-

nL...t

i=l

=

~n

t

e? -

n

nL...t

i=l

i=l

2(b - fJ)'ii; + (b- fJ)'Sxx(b- fJ).

'

fJ)

(2.3.11)

i=l

The rest of the proof, which is to show that the plims of the last two terms are zero so that plim ~ I:, e? = plim ~ I:, ef, is left as a review question. If you do the review question, it should be clear to you that all that is required for the coefficient estimator is consistency; if some consistent estimator, rather than the OLS estimator b, is used to form the residuals, the error variance estimator is still consistent for E( ef). QUESTIONS FOR REVIEW

1. Suppose E(y1 I x1) = x;tJ, that is, suppose the regression of y1 on x1 is a linear function of x1• Define e1 = y1 - x;tJ. Show that x1 is orthogonal to e1• Hint: First show that E(e 1 1 x 1) = 0. 2. (Is E(ef) assumed to be finite?) (a) Do Assumptions 2.1-2.5 imply that E(eF) exists and is finite? Hint: A strictly stationary process may not have finite second moments. (b) If one of the regressors is a constant in our model, then the variance of

the error term is finite. Prove this. Hint: II x 11 = I, the (1, I) element ol E(g1g;J is

e;.

3. (Alternative expression for S) Let E(e;x,x;)) can be written as

f (x1)

E(ef I x1). Show that S (=

S = E[f(x1)x 1x;J. Hint: Law of Total Expectations.

4. Complete the proof of Proposition 2.2. Hint: We have already proved lor Proposition 2.1 that plim ii; = 0, plim Sxx = I:xx, and plim(b - fJ) = 0 under Assumptions 2.1-2.4. Use Lemma 2.3(a) to show that plim(b - fJ)'g = 0 and plim(b - fJ)'S .. (b - fJ) = 0.

-

5. (Proposition 2.2 with consistent fJ) Prove the following generalization of Proposition 2.2:

117

Large-Sample Theory

x;p

Let £1 = y 1 where Pis any consistent estimator of {J. Under Assumptions 2.1, 2.2, and the assumption that E(x1 . e;) and E(x1x) are finite, ~I:, i!J->, E(eJ). So the regressors do not have to be orthogonal to the error term.

2.4

Hypothesis Testing

Statistical inference in large-sample theory is based on test statistics whose asymptotic distributions are known under the truth of the null hypothesis. Derivation of the distribution of test statistics is easier than in finite-sample theory because we are only concerned about the large-sample approximation to the exact distribution. In this section we derive test statistics, assuming throughout that a consistent estimator, S, of S (= E(g;g;l) is available. The issue of consistent estimation of Swill be taken up in the next section.

Testing Linear Hypotheses Consider testing a hypothesis about the k-th coefficient {h. Proposition 2.1 implies that under the Ho: fJk = {J., and

-

Avar(bd--> Avar(bk). p

where bk is the k-th element of band Avar(bd is the (k, k) element of the K x K matrix Avar(b). So Lemma 2.4(c) guarantees that (2.4.1)

where

JI -

l (

-1-!) kk' SE'(bk) =;; · Avar(bk) = ; ; · Sxx SSxx

The denominator in this t-ratio, SE'(bk), is called the heteroskedasticityconsistent standard error, (heteroskedasticity-)robust standard error, or White's standard error. The reason for this terminology is that the error term can be conditionally heteroskedastic; recall that we have not assumed conditional homoskedasticity (that E(e?l x1) does not depend on x1) to derive the asymptotic distribution of tk. This t-ratio is called the robust t-ratio, to distinguish it from the

118

Chapter 2

t-ratio of Chapter I. The relationship between these two sorts oft-ratio will be discussed in Section 2.6. Given this t -ratio, testing the null hypothesis H 0 : fh = fh at a significance level of a proceeds as follows:

Step 1: Calculate tk by the formula (2.4.1). Step 1: Look up the table of N (0, I) to find the critical value t.12 which leaves a /2 to the upper tail of the standard normal distribution. (example: if a = 5%,

= 1.96.) Step 3: Accept the hypothesis if ltkl < t. 12 ; otherwise reject. 1aj2

The differences from the finite-sample t-test are: (I) the way the standard error is calculated is different, (2) we use the table of N(O, I) rather than that of t(n- K), and (3) the actual size or exact size of the test (the probability of Type I error given the sample size) equals the nominal size (i.e., the desired significance level a) only approximately, although the approximation becomes arbitrarily good as the sample size increases. The difference between the exact size and the nominal size of a test is called the size distortion. Since tk is asymptotically standard normal, the size distortion of the t-test converges to zero as the sample size n goes to infinity. Thus, we have proved the first half of Proposition 2.3 (robust t-ratio and Wald statistic): Suppose Assumptions 2.12.5 hold, and suppose there is available a consistent estimateS ofS (= E(g,Jt;)). As before, Jet -

Avar(b)

-1-- -1 = s.. S s...

Then (a) Under the null hypothesis H 0 :

fh = fh, tk defined in (2.4.1) --+d N(O, 1).

(b) Under the null hypothesis H 0 : R{J = r, where R is an #r x K matrix (where #r, the dimension of r. is the number of restrictions) of full row rank,

W

-

= n · (Rb- r)'{R[Avar(b)]R')-\Rb- r)--+ x2 (#r).

What remains to be shown is that W cation of Lemma 2.4( d).

--+d

(2.4.2)

d

x2 (#r), which is a straightforward appli-

119

Large-Sample Theory

PROOF

(continued). Write Was W = c~Q;;- 1 c. where

Cn

-

= .)ii(Rb- r) and Q. = RAvar(b)R'.

Under Ho, en = R.jii(b- {3). So by Proposition 2.1, Cn

--> c d

where

c ~ N(O, RAvar(b)R').

Also by Proposition 2.1, Q. --> Q p

where

Q

= RAvar(b)R'.

Because R is of full row rank and Avar(b) is positive definite, Q is invenible. Therefore, by Lemma 2.4( d),

Since the #r-dimensional random vector c is normally distributed and since Q • equals Var(c), c'Q- 1c ~ x2(#r).

-

This chi-square statistic W is a Wald statistic because it is based on unrestricted estimates (band Avar(b) here) not constrained by the null hypothesis H0 . Testing H0 at a significance level of a proceeds as follows. Step 1: Calculate theW statistic by the formula (2.4.2). Step 2: Look up the table of x2(#r) distribution to find the critical value xJ(#r) that gives a to the upper tail of the x 2(#r) distribution. Step 3: If W < xJ(#r), then accept Ho; otherwise reject.

The probability of Type I error approaches a as the sample becomes larger. As will be made clear in Section 2.6, this Wald statistic is closely related to the familiar F -test under conditional homoskedasticity. The Test Is Consistent

Recall from basic statistics that the (finite-sample) power of a test is the probability of rejecting the null hypothesis when it is false given a sample of finite size (that is, the power is I minus the probability of Type II error). Power will obviously depend on the DGP (i.e., how the data were actually generated) considered as the alternative as well as on the size (significance level) of the test. For example, consider any DGP {y;, x;) satisfying Assumptions 2.1-2.5 but not the null hypothesis

120

Chapter 2

H0 : fh = {Jk. The power of the /-test of size a against this alternative is

which depends on the DGP in question because the DGP controls the distribution of lk. We say that a test is consistent against a set of DGPs, none of which satisfies the null, if the power against any particular member of the set approaches unity as n ---+ oo for any assumed significance level. That the 1-test is consistent against the set of alternatives (DGPs) satisfying Assumptions 2.1-2.5 can be seen as follows. Look at the expression (2.4.1) for the 1-ratio, reproduced here:

0i,(bk -7fk) tk

= ·,;A-;;;;;(J;k)

.

The denominator converges to .JAvar(bk) despite the fact that the DGP does not satisfy the null (recall that all parts of Proposition 2.1 hold regardless of the truth of the null, provided Assumptions 2.1-2.5 are satisfied). On the other hand, the numerator tends to +oo or -oo because h converges in probability to the DGP's fJk. which is different from {Jk. So the power tends to unity as the sample size n tends to infinity, implying that the t -test of Proposition 2.3 is consistent against those alternatives, the DGPs that do not satisfy the null. The same is true for the Wald test. Asymptotic Power

For later use in the next chapter, we define here the asymptotic power of a consistent test. As noted above, the power of the 1-test approaches to unity as the sample size increases while the DGP taken as the alternative is held fixed. But if the DGP gets closer and closer to the null as the sample size increases, the power may not converge to unity. A sequence of such DGPs is called a sequence of local alternatives. For the regression model and for the null of H0 : fJk = {Jh it is a sequence of DGPs such that (i) the n-th DGP, IYin), xin)) (i = I, 2, ... ), satisfies Assumptions 2.1-2.5 and converges in a certain sense to a fixed DGP {y;, x;)/ 2 and (ii) the value of fJk of the n-th DGP, fJi"), converges to {Jk. Suppose, further, that fJt) satisfies (2.4.3)

12 See, e.g., Assumption l of Newey (1985) for a precise statement.

121

Large-Sample Theory

for some given y of 0. So fJt> approaches to {Jk at a rate proportional to 1I .fii. This special sequence of local alternatives is called a Pitman drift or a Pitman sequence. Substituting (2.4.3) into (2.4.1), the t-ratio above can be rewritten as (2.4.4)

If the sample of size n is generated by the n-th DGP of a Pitman drift, does lk converge to a nontrivial distribution? Since the n-th DGP satisfies Assumptions 2.1-2.5, the first term on the right hand side of (2.4.4) converges in distribution to N(O, l) by parts (b) and (c) of Proposition 2.1. By part (c) of Proposition 2.1 and the fact that {yj•>, xJ•>) "converges" to a fixed DGP, the second term converges in probability to (2.4.5) where Avar(bk) is evaluated at the fixed DGP. Therefore, tk -+d N(f.l, l) along this sequence of local alternatives. If the significance level is a, the power converges to Prob(lxl > ta;z)

(2.4.6)

where x ~ N(f.l, l) and t.12 is the level-a critical value. This probability is called the asymptotic power. lt is a measure of the ability of the test to detect small deviations of the model from the null hypothesis. Evidently, the larger is lf-ll, the higher is the asymptotic power for any given size a. By a similar argument, it is easy to show that the Wald statistic converges to a distribution called noncentral chi-squared. Testing Nonlinear Hypotheses The Wald statistic can be generalized to a test of a set of nonlinear restrictions on

fJ.

Consider a null hypothesis of the form H0 : a({J) = 0.

Here, a is a vector-valued function with continuous first derivatives. Let #a be the dimension of a({J) (so the null hypothesis has #a restrictions), and let A(/J) be the #a x K matrix of first derivatives evaluated at {J: A(/J) = Ja({J) jofJ'. For the hypothesis to be well-defined, we assume that A({J) is of full row rank (this is the generalization of the requirement for linear hypothesis R{J = r that R is of full

122

Chapter 2

row rank). Lemma 2.5 of Section 2.1 and Proposition 2.1 (b) imply that Jil[a(b)- a(fl)] -> c, c ~ N(O, A(/1) Avar(b) A(/1)').

(2.4.7)

d

Since a(fl) = 0 under H 0 , (2.4. 7) becomes Jila(b)-> c, c ~ N(O, A(/1) Avar(b) A(/1)').

(2.4.8)

d

Since b ->p p by Proposition 2.1(a), Lemma 2.3(a) implies that A(b) ->p A(fl). By Proposition 2.l(c), Avar(b) ->p Avar(b). So by Lemma 2.3(a),

-

-

A(b)Avar(b) A(b)'-> A(/1) Avar(b) A(/1)' = Var(c).

(2.4.9)

p

Because A(/1) is of full row rank ancj Avar(b) is positive definite, Var(c) is invertible. Then Lemma 2.4(d), (2.4.8), and (2.4.9) imply Jil a(b)'(A(b)A~)A(b)')- 1 Jil a(b) -> c'Var(c)- 1c ~ d

x2 (#a).

(2.4.10)

Combining two Jil's in (2.4.10) into one n, we have proved Proposition 2.3 (continued):

(c) Under the null hypothesis with #a restrictions H 0 : a(fl) = 0 such that A(fl), the #a x K matrix of continuous first derivatives of a(/1), is of full row rank, we have

W

= n · a(b)'(A(b)A~) A(b)')- 1a(b) ->d x2 (#a).

Par1 (c) is a generalization of (b); by setting a(fl) =

RP -

(2.4.11)

r, (2.4.11) reduces to

(2.4.2), the Wald statistic for linear restrictions. The choice of a(·) for representing a given set of restrictions is not unique. For example, f'>tfh =I can be written as a(/1) = 0 with a(/1) = fh/11- I or with a(/1) = fh -I/ f)z. While par1 (c) of the proposition guarantees that in large samples the outcome of the Wald test is the same regardless of the choice of the function a, the numerical value of the Wald statistic W does depend on the representation, and the test outcome can be different in finite samples. In the above example, the second representation, a (/1) = fh - If fh. does not satisfy the requirement of continuous derivatives at /12 = 0. Indeed, a Monte Carlo study by Gregory and Veal! (1985) reports that, when /12 is close to zero, the Wald test based on the second representation rejects the null too often in small samples.

123

Large-Sample Theory

QUESTIONS FOR REVIEW

1. Does SE'(bk)

---+p

0 as n---+ oo?

2. (Standard error of a nonlinear function) For simplicity let K = I and let b be the OLS estimate of {3. The standard error of b is JA~)/n. Suppose A = -log({3). The estimate of A implied by the OLS estimate of {3 is~= -log(b). Verify that the standard error

of~ is (1/b). J A~)/n.

3. (Invariance [or lack thereof] of the Wald statistic) There is no unique way to write the linear hypothesis RP = r, because for any #r x #r nonsingular matrix F, the same set of restrictions can be represented as R.p = r with R = FR and r = Fr. Does a different choice of R and r affect the asymptotic distribution of W? The finite-sample distribution? The numerical value?

2.5 Estimating E(ef x;x;) Consistently The theory developed so far presumes that there is available a consistent estimator, S, of S (= E(g,g;J E(efx,x;J) to be used to calculate the estimated asymptotic variance, Avar(b). This section explains how to obtainS from the sample (y, X).

-

=

-

Using Residuals for the Errors If the error were observable, then the sample mean of efx,x; is obviously consistent by ergodic stationarity. But we do not observe the error term, and the substitution of some consistent estimate of it results in n

-S

p

' , , XjX;, ' =-I "L.J£i

n.1=1

(2.5.1)

where £1 = y; - x;{i, and is some consistent estimator of p. (Although the obvious candidate for the consistent estimator is the OLS estimator b, we use rather than b here, in order to make the point that the results of this section hold for any consistent estimator.) For this estimator to be consistent for S, we need to make a fourth-moment assumption about the regressors.

P

p

Assumption 2.6 (finite fourth moments for regressors): E[(x;kXij) 2 ] exists and is finite for all k, j (= I, 2, ... , K).

124

Chapter 2

Proposition 2.4 (consistent estimation of S): Suppose the coefficient estimate 7J used for calculating the residual E:, for Sin (2.5.1) is consistent, and suppose S = E(g,g,) exists and is finite. Then, under Assumptions 2.1, 2.2, and 2.6, S given in (2.5.1) is consistent for S.

To indicate why the fourth-moment assumption is needed for the regressors, we provide a sketch of the proof for the special case of K = 1 (only one regressor). Sox; is now a scalar x;, g, is a scalar g; = x;e;, and (2.3.10) (with b = 7J and e; = E:,) simplifies to ·2 e, = e,2 - 2({3 - f3)x;£;

By multiplying both sides by 1~

•2 2

1~

2 2

;; L_.£;X; - ; ; L_.£;X; = i::::l

2 + (/3{3) 2x,.

(2.5.2)

xt and summing over i, -

-2({3-

i=l

3 21~ 4 /3);;1 ~ L...e,x, + (/3- {3) n L_.X;. (2.5.3) i:::d

j::::l

Now we can see why the finite fourth-moment assumption on x; is required: if the fourth moment E(x() is finite, then by ergodic stationarity the sample average of converges in probability to some finite number, so that the last term in (2.5.3) vanishes (converges to 0 in probability) if fj is consistent for {3. It can also be shown (see Analytical Exercise 4 for proof) that, by combining the same fourthmoment assumption about the regressors and the fourth-moment assumption that E(g, g,) (= E( efx,x;)) is finite, the sample average of xJ £; converges in probability to some finite number, so that the other term on the RHS of (2.5.3), too, vanishes. According to Proposition 2.1(a), the assumptions made in Proposition 2.3 are sufficient to guarantee that b is consistent, so we can set b = fJ in (2.5.1) and use the OLS residual to calculateS. Also, the assumption made in Proposition 2.4 that E(g;g;) is finite is part of Assumption 2.5, which is assumed in Proposition 2.3. Therefore, the import of Proposition 2.4 is:

xi

-

-

-

If Assumption 2.6 is added to the hypothesis in Proposition 2.3, then the S given in (2.5.1) with b = 7J (so E:, is the OLS residual e;) can be used in (2.3.5) to calculate the estimated asymptotic variance: (2.5.4) which is consistent for Avar(b).

125

Large-Sample Theory

Data Matrix Representation of S

If B is then x n diagonal matrix whose i-th diagonal element is (2.5.1) can be represented in terms of data matrices as

£f, then the Sin

X'BX S=-n

(2.5.1')

with

So (2.5.4) can be rewritten (with £i in B set to ei) as

-

Avar(b) = n · (X'X)- 1 (X'BX)(X'X)- 1•

(2.5.4')

These expressions, although useful for some purposes, should not be used for computation purposes, because the n x n matrix B will take up too much of the computer's working memory, panicularly when the sample size is large. To computeS from the sample, the formula (2.5.1) is more useful than (2.5.1'). Finite-Sample Considerations

Of course, in finite samples, the power may well be far below one against cenain alternatives. Also, the probability of rejecting the null when the DGP does satisfy the null (the Type I error) may be very different from the assumed significance level. Davidson and MacKinnon (1993, Section 16.3) repon that, at least for the Monte Carlo simulations they have seen, the robust t-ratio based on (2.5.1) rejects the null too often and that simply replacing the denominator n in (2.5.1) by the degrees of freedom n- K or equivalently multiplying (2.5.1) by nj(n- K) (this is a degrees of freedom correction) mitigates the problem of overrejecting. They also repon that the robust t-ratios based on the following adjustments on S perform even better:

-

(2.5.5) where Pi is the Pi defined in the context of the influential analysis in Chapter I: it is the i -th diagonal element of the projection matrix P, that is,

126

Chapter 2

QUESTIONS FOR REVIEW

1. (Computation of robust standard errors) In Review Question 9 of Section 1.2, we observed that the standard errors of the OLS coefficient estimates can be calculated from Sxx, Sxy (the sample mean of x1 • y1), y'y jn, andy, so the sample moments need to be computed just once. ls the same true for the robust standard errors where theSis calculated according to the formula (2.5.1) with ~.- e·? c,,.

2. The finite-sample variance of the OLS estimator in the generalized regression model of Chapter I is Var(b 1 X) = (X'X)- 1X'(0" 2 V)X(X'X)- 1• Compare this to (2.5.4'). What are the differences?

2.6

Implications of Conditional Homoskedasticity

The test statistics developed in Sections 2.4 and 2.5 are different from the finitesample counterparts of Section 1.4 designed for testing the same null hypothesis. How are they related? What is the asymptotic distribution of the t and F statistics of Chapter I? This section answers these questions. lt turns out that the robust r-ratio is numerically equal to the /-ratio of Section 1.4, for a particular choice of S. Therefore, the asymptotic distribution of the /-

-

ratio of Section 1.4 is the same as that of the robust t -ratio, if that particular choice is consistent for S. The same relationship holds between the F -ratio of Section 1.4 and the Wald statistic W of this chapter. Under the conditional homoskedasticity assumption stated below, that particular choice is indeed consistent.

Conditional versus Unconditional Homoskedasticity The conditional homoskedasticity assumption is: Assumption 2.7 (conditional homoskedasticity): E(ef I x 1 ) = "

2

> 0.

(2.6.1)

This assumption implies that the unconditional second moment E(ef) equals 0' 2 by the Law of Total Expectations. To be clear about the distinction between unconditional and conditional homoskedasticity, consider the following example.

127

Large-Sample Theory

Example 2.6 (unconditionally homoskedastic but conditionally heteroskedastic errors): As already observed, if (y;, x;) is stationary, so is {£;). and the error is unconditionally homoskedastic in that E(st) does not depend on i. To illustrate that the error can nevertheless be conditionally heteroskedastic, suppose that£; is written as £1 ry,j(x 1), where (ry;} is zero-mean E(ry 1) = 0 and is independent of x 1• The conditional second moment of s1 depends on x1 because

=

E(sf I x,) = E(ryt j(x;) 2 I x,) 2

= f(x;) E(ryf

= f(x1) 2 E(ryt)

I x1)

(since £1 = ry,j(x 1))

(by the linearity of conditional expectations)

(since ry 1 is independent of x1 by assumption),

which varies across i because of the variation in

f (x1) across i.

Reduction to Finite-Sample Formulas

To examine the large-sample distribution of the 1 and F statistics of Chapter I under the additional Assumption 2. 7, we first examine the algebraic relationship to their robust counterparts. Consider the following choice for the estimate of S: -

S = s 2 Sxx.

where s 2 is the OLS estimate of a 2 . (We will show in a moment that this estimator is consistent under conditional homoskedasticity.) Then the expression (2.3.5) for Avar(b) becomes

-

(2.6.2) Substituting this expression into (2.4.1), we see that the robust standard error becomes Js 2 times (k, k) element of (X'X)- 1,

(2.6.3)

which is the usual standard error in finite-sample theory. So the robust 1-ratio is numerically identical to the usual finite-sample 1-ratio when we setS = s 2 Sxx. Similarly, substituting (2.6.2) into the expression for the Wald statistic (2.4.2), we obtain

128

Chapter 2

W = n. (Rb- r)'(R[n. s 2 • (X'X)- 1]R')- 1(Rb- r)

= (Rb- r)'{R[s 2 . (X'X)- 1]R')- 1(Rb- r) = (Rb- r)'!R[(X'X)- 1]R'r\Rb- r)/s

(the two n's cancel)

2

(by the definition of (1.4.9) of the F -ratio) = (SSRR- SSRu)fs 2 (by (1.4.11)). = r ·F

Thus, when we set S = s 2 Sxx, the Wald statistic W is numerically identical to #r. F (where #r is the number of restrictions in the null hypothesis). Large-Sample Distribution of t and F Statistics

It then follows from Proposition 2.3 that the t-ratio (2.4.1) is asymptotically N(O, I) and #r · F asymptotically x2 (#r), if s 2 Sxx is consistent for S. That s2 Sxx is consistent for Scan be seen as follows. Under conditional homoskedasticity, the matrix of fourth moments S can be expressed as a product of second moments:

S = E(g;g;J = E(x;x;ef)

(since g;

= x; · e;)

= E[E(x;x;ef I X;)]

(by the Law of Total Expectations)

= E[x;x; E(e~

1

(by the linearity of conditional expectations)

= E(x;x' a 2 )

(by Assumption 2.7)

x; )]

(2.6.4) This decomposition has several implications. • (l:xx is nonsingular) Since by Assumption 2.5 Sis nonsingular, this decomposition of S implies that a 2 > 0 and l:xx is nonsingular. Hence, Assumption 2.4 (rank condition) is implied. • (No need for fourth-moment assumption) By ergodic stationarity S,, ->p l:xx. By Proposition 2.2, s 2 is consistent for a 2 under Assumptions 2.1-2.4. Thus, s 2 S,, ->p a 2 l:xx = S. We do not need the fourth-moment assumption (Assumption 2.6) for consistency. As another implication of (2.6.4), the expression for Avar(b) can be simplified: inserting (2.6.4) into (2.3.4) of Proposition 2.1, the expression for Avar(b) becomes (2.6.5) Thus, we have proved

129

Large-Sample Theory

Proposition 2.5 (large-sample properties of b, t, and F under conditional homo skedasticity): Suppose Assumptions 2.1-2.5 and 2. 7 are satisfied. Then (a) (Asymptotic distribution of b) The OLS estimator b of fl is consistent and asymptotically normal with Avar(b) = a'I;;-,1 •

-

(b) (Consistent estimation of asymptotic variance) Under the same set of assumptions, Avar(b) is consistently estimated by Avar(b) = s 2 s;,1 = n · s 2 • (X'X)- 1 •

(c) (Asymptotic distribution of the t and F statistics of the finite-sample theory) Under Ho: fh = fJk, the usual t-ratio (1.4.5) is asymptotically distributed as N(O, 1). Under Ho: Rfl = r, #r · F is asymptotically x2 (#r), where F is the F statistic from (1.4.9) and #r is the number of restrictions in H0 .

Variations of Asymptotic Tests under Conditional Homoskedastlclty According to this result, you should look up the N(O, I) table to find the critical value to be compared with the t-ratio (1.4.5) and the x' table for the statistic #r · F derived from (1.4.9). Some researchers replace the s 2 in (1.4.5) and (1.4.9) by ~ L;1 el. That is, the degrees of freedom n - K is replaced by n, or the degrees of freedom adjustment implicit in s 2 is removed. The difference this substitution makes vanishes in large samples, because limn~ 00 [nj(n- K)] = I. Therefore, regardless of which test to use, the outcome of the test will be the same if the sample size is sufficiently large. Another variation is to retain the degrees of freedom n - K but use the t (n K) table for the t ratio and the F(#r, n- K) table for F, which is exactly the prescription of finite-sample theory. This, too, is asymptotically valid because, as n - K tends to infinity (which is what happens when n --+ oo with K fixed), the t(n- K) distribution converges to N(O, I) (just compare the t table for large degrees of freedom with the standard normal table) and F (#r, n- K) to x2 (#r)j#r. Put differently, even if the error is not normally distributed and the regressors are merely predetermined (orthogonal to the error term) and not strictly exogenous, the distribution of the t-ratio (1.4.5) is well approximated by t(n - K), and the distribution of the F ratio by F(#r, n- K). These variations are all asymptotically equivalent in that the differences in the values vanishes in large samples and hence (by Lemma 2.4(a)) their asymptotic distributions are the same. However, when the sample size is only moderately large, the approximation to the finite-sample or exact distribution of test statistics may be better with t(n- K) and F(#r, n- K), rather than with N(O, I) and x2 (#r). Because the exact distribution depends on the DGP, there is no simple guide as to which variation works better in finite samples. This issue of which table- N (0, l)

130

Chapter 2

or I (n - K)- should be used for moderate sample sizes will be taken up in the Monte Carlo exercise of this chapter. QUESTIONS FOR REVIEW

1. (Inconsistency of finite sample-formulas without conditional homoskedasticity) Without Assumption 2.7, Avar(b) is given by (2.3.4) in Proposition 2.1. Is it consistently estimated by (2.6.2) without Assumption 2.7? [The answer is no. Why?] Is the /-ratio (1.4.5) asymptotically standard normal without the assumption? [Answer: No.] 2. (Advantage of finite-sample formulas under conditional homoskedasticity) Conversely, under Assumption 2.7, Avar(b) is given by (2.6.5). Is it consistently estimated by (2.3.5) under Assumption 2.7? If Assumption 2.7 holds, what do you think is the advantage of using (2.6.2) over (2.3.5) to estimate the asymptotic variance? [Note: The finite-sample properties of an estimator are generally better, the fewer the number of population parameters estimated to form the estimator. How many population parameters need to be estimated to form (2.3.5)? (2.6.2)?] 3. (Relation ofF to x2 ) Find the 5 percent critical value of F(IO, oo) and compare it to the 5 percent critical value of x2 (10). What is the relationship between the two? 4. Without conditional homoskedasticity, is (SSRR- SSRu)fs 2 asymptotically x2 (#r)? [Answer: No.] 5. (n R 2 test) For a regression with a constant, consider the null hypothesis that the coefficients of the K - I nonconstant regressors are all zero. Show that nR2 -+d x2 (K - I) under the hypothesis of Proposition 2.5. Hint: You have proved for a Chapter 1 analytical exercise that the algebraic relationship between the F -ratio for the null and R 2 is

R 2 j(K- I) F - -:-::---::.,;.,...,.,---...:...,::- (I - R2 )/(n - K).

Can you use the nR 2 statistic when the error is not conditionally homoskedastic? [Answer: No.]

131

large-Sample Theory

2.7

Testing Conditional Homoskedasticity

With the advent of robust standard errors allowing us to do inference without specifying the conditional second moment E(e1 1 x, ), testing conditional homoskedasticity is not as important as it used to be. This section presents only the most popular test due to White (1980) for the case of random samples. 13 Recall that S given in (2.5.1) (withe, = e;) is consistent for S (Proposition 2.4), and s 2 Sxx is consistent for cr 2 l:xx (an implication of Proposition 2.2). But under conditional homoskedasticity, S = cr 2 l:xx (see (2.6.4)), so the difference between the two consistent estimators should vanish:

(2.7.1)

Let Y,; be a vector collecting unique and nonconstant elements of the K x K symmetric matrix

x,x;.

(Construction of

1/r;

from x;X, will be illustrated in Example

2.7 below.) Then (2.7.1) implies n

c. = -1""2 L_.(e; - s2)Vr; -+ 0. n i=l P

(2.7.2)

This c, is a sample mean converging to zero. Under some conditions appropriate for a Central Limit Theorem to be applicable, we would expect .jTi c. to converge in probability to a normal distribution with mean zero and some asymptotic variance B, so for any consistent estimator Bof B, (2.7.3) where m is the dimension of c•. For a certain choice of B, this statistic can be computed as nR 2 from the following auxiliary regression: regress e~ on a constant and

1/r;.

13 See, e.g., Judge eta!. (1985, Section 11.3) or Greene (1997, Section 12.3) for other tests.

(2.7.4)

132

Chapter 2

White ( 1980) rigorously developed this argument 14 to prove Proposition 2.6 (White's Test for Conditional Heteroskedasticity): In addition to Assumptions 2.1 and 2.4, suppose that (a) (y;, X;} is i.i.d. with finite E(efx;x;) (thus strengthening Assumptions 2.2 and 2.5), (b) e; is independent ofx; (thus strengthening Assumption 2.3 and conditional homoskedasticity), and (c) a certain condition holds on the moments ofe; andx;. Then,

where R2 is the R2 from the auxiliary regression (2. 7.4), and m is the dimension of•/1;.

Example 2.7 (regressors in White's n R 2 test): Consider the Cobb-Douglas cost function of Section 1.7:

(TC)

(P'I) + {3 1og (P'2) - ' + e;.

log - ' = {3 1 + {32 1og(Q;) + /33log - ' Pi3

4

Pi3

Pi3

Here, x; = (I, log(Q;), log(pn I Pn), log(p;z/ pn)), a four-dimensional vector. There are 10 (= 4 · 5/2) unique elements in x;x;: , (PH) , log (p;z) P'I)' , log(Q;) ·log (p~) [log(Q;)f, log(Q;) ·log (-' , 2 (P'I) (p·z) [ (P'2)] 2 VI )] [log (P:3 ' log P:3 .log P:3 ' log P:3 . I, log(Q;), log -

Pi3

Pi3

Pi3

Pi3

Vr; is a nine-dimensional vector excluding a constant from this list.

So them

in Proposition 2.6 is 9. If White's test accepts the null of conditional homoskedasticity, then the results of Section 2.6 apply, and statistical inference can be based on the I- and F -ratios from Chapter I. Otherwise inference should be based on the robust 1 and Wald statistics of Proposition 2.3. Because the regressors >/r; in the auxiliary regression have many elements consisting of squares and cross-products of the elements of x;, the test should be consiste.lt (i.e., the power approaches unity as n --> oo) against most heteroskedas14 The original statement of White's theorem is more general as it covers the case where {Yi, X;} is independent, but not identically distributed (i.n.i.d.).

133

Large-Sample Theory

tic alternatives but may require a fairly large sample to have power close to unity. If the researcher knows that some of the elements of 1/t; do not affect the conditional second moment, then they can be dropped from the auxiliary regression and the power might be increased in finite samples. The downside of it, of course, is that if such knowledge is false, the test will have no power against heteroskedastic alternatives that relate the conditional second moment to those elements that are excluded from the auxiliary regression. QUESTION FOR REVIEW

1. (Dimension of 1/t;) Suppose x, = (1, q;, q,', p;)', a four-dimensional vector. How many nonconstant and unique elements are there in x,x;? [Answer: 8.]

2.8

Estimation with Parameterized Conditional Heteroskedasticity (optional)

Even when the error is found to be conditionally heteroskedastic, the OLS estimator is still consistent and asymptotically normal, and valid statistical inference can be conducted with robust standard errors and robust Wald statistics. However, in the (somewhat unlikely) case of a priori knowledge of the functional form of the conditional second moment E(ef I x; ), it should be possible to obtain sharper estimates with smaller asymptotic variance. Indeed, in finite-sample theory, the WLS (weighted least squares) can incorporate such knowledge for increased efficiency in the sense of smaller finite-sample variance. Does this finite-sample result carry over to large-sample theory? This section deals with the large-sample properties of the WLS estimator. To simplify the discussion, throughout this section we strengthen Assumptions 2.2 and 2.5 by assuming that {y;, X;} is i.i.d. This is a natural assumption to make, because it is usually in cross-section contexts where WLS is invoked. The Functional Form

The parametric functional form for the conditional second moment we consider is E(ef I x,) = z;a, where zi is a function of Xi.

(2.8.1)

134

Chapter 2

Example 2.8 (parametric form of the conditional second moment): The functional form used in the WLS estimation in the empirical exercise to Chapter 1 was Pi3 ,log (P'2)) Pi3 = a1 + E( s,2llog(Q;),log(Q;) 2,log (Pn)

a2 ·

( Q;1) .

Since the elements of z; can be nonlinear functions of x;, this specification is more flexible tban it might first look, but still it rules out some nonlinearities. For example, tbe functional form E(st I x;) = exp(z;a)

(2.8.2)

might be more attractive because its value is guaranteed to be positive. We consider tbe linear specification (2.8.1) only because estimating parameters in nonlinear specifications such as (2.8.2) requires tbe use of nonlinear least squares. WLS with Known cr

To isolate tbe complication arising from tbe fact tbat the unknown parameter vector a must be estimated, we first examine tbe large-sample distribution of tbe WLS estimator with known a. If a is known, tbe conditional second moment can be calculated from data as z;a, and WLS proceeds exactly as in Section 1.6: dividing botb sides of tbe estimation equation y, = x; P+ s; by tbe square root of z;a, to obtain (2.8.3) where

and tben apply OLS. For later reference, write the resulting WLS estimator as {i (V). It is given by _.. . ._ fl(V)

1

n

-1

1

n

= (~ 2)t,x;) ~ L i=l

1 "

=

(~ L

i= 1

ii, · _y,

i=l

1 -1 1 " z'ax,x;) ~

L

I

i=l

I

z~ax' · y, I

(2.8.4)

135

Large-Sample Theory

with

V -=

z;a [

l

..

.

z' a

"

If Assumption 2.3 is strengthened by the condition that

E(c; I x;) = 0,

(2.8.5)

then E(£; I x;) = 0. To see this, note first that, because z; is a function ofx,,

E(~ z;a

E(£; I x;) =

/ x;) =

}z;a E (c; I z;a

X;)= 0.

(2.8.6)

Second, because i; is a function of xi, there is no more information in i; than in X;.

So by the Law of Iterated Expectations we have E(i', 1 x,) = E[E(i', 1 x,) 1 x;] =

o.

Therefore, provided that E(x,x;) is nonsingu1ar, Assumptions 2.1-2.5 are satisfied for equation (2.8.3). Furthermore, by construction, the error i'; is conditionally homoskedastic: E(ef 1 x;) = 1. So Proposition 2.5 applies with rr 2 = I: the WLS estimator is consistent and asymptotically normal, and the asymptotic variance is Avar(P(V)) = E(x,x;)- 1

I.

= p

(since the error variance is I)

(1~--,)-1

lffi ;; ~X;Xi

(by ergodic stationarity)

1=1

I " l = plim(n L..,; z'.a

"-X; X'·) i=}

=

1

-1

(since X; = x;/ .jZiii)

I

plim(~x'v- 1 xt.

(2.8.7)

So (~X'V- 1 X)- 1 is a consistent estimator of Avar(B(V)). Regression of

ef on Z; Provides a Consistent Estimate of a

If a is unknown, it can be estimated by running a separate regression. (2.8.1) says that is a regression of cf. If we define ~i = "F - E(cf I x;), then (2.8.1) can be

z;a

136

Chapter 2

written as a regression equation: (2.8.8) By construction, E(~; I x;) = 0, which, together with the fact that Z; is a function of X;, implies that the regressors Z; are orthogonal to the error term ~;. Hence, provided that E(z;z;) is nonsingular, Proposition 2.1 is applicable to this auxiliary regression (2.8.8): the OLS estimator of a is consistent and asymptotically normal. Of course, this we cannot do because we don't observe the error term. However, since the OLS estimator b for the original regression y; = x;tJ + 8; is consistent despite the presence of conditional heteroskedasticity, the OLS residual e; provides a consistent estimate of 8;. It is left to you as an analytical exercise to show that, when 8; is replaced by e; in the regression (2.8.8), the OLS estimator, call it is consistent for a.

a,

WLS with Estimated

11t

The WLS estimation of the original equation y; = x;fJ + 8; with estimated a is P(V), where Vis the n x n diagonal matrix whose i-th diagonal is Under suitable additional conditions (see, e.g., Amemiya (1977) for an explicit statement of such conditions), it can be shown that

z;a.

(a) .jii (P(V) - {J) and .jii (P(V) - fJ) are asymptotically equivalent in that the difference converges to zero in probability as n -> co. Therefore, by Lemma 2.4(a), the asymptotic distribution of .jii (P(V) - {J) is the same as that of ./ii (p(V)- {J). So Avar(p(V)) equals Avar(p(V)), which in turn is given by (2.8.7); (b) plim lx'v-'x = plim lx'v- 1x. • • So ( l X'v- 1X)- 1 is consistent for Avar(P(V)) .

•

All this may sound complicated, but the operational implication for the WLS estimation of the equation y; = x;fJ + 8; is very clear:

Step 1: Estimate the equation y; = x;fJ + 8; by OLS and compute the OLS residuals ei. Step 2: Regress ef on Z;, to obtain the OLS coefficient estimate a. Step 3: Re-estimate the equation y; = x; fJ + 8; by WLS, using I I as the weight for observation i.

J7!:&

Since the correct estimate of the asymptotic variance, (~X'V- 1 X)- 1 in (b), is (n times) the estimated variance matrix routinely printed out by standard regression

137

Large-Sample Theory

packages for the Step 3 regression, calculating appropriate test statistics is quite straightforward: the standard t- and F -ratios from Step 3 regression can be used to do statistical inference. OLS versus WLS

Thus we have two consistent and asymptotically nonnal estimators, the OLS and WLS estimators. We say that a consistently and asymptotically nonnal estimator is asymptotically more efficient than another consistent and asymptotically nonnal estimator of the same parameter if the asymptotic variance of the former is no larger than that of the latter. lt is left to you as an analytical exercise to show that the WLS estimator is asymptotically more efficient than the OLS estimator. The superiority of WLS over OLS, however, rests on the premise that the sample size is sufficiently large and the functional fonn of the conditional second moment is correctly specified. If the functional fonn is misspecified, the WLS estimator would still be consistent, but its asymptotic variance may or may not be smaller than Avar(b). ln finite samples, even if the functional form is correctly specified, the large-sample approximation will probably work less well for the WLS estimator than for OLS because of the estimation of extra parameters (a) involved in the WLS procedure. QUESTIONS FOR REVIEW

1. Prove: "E(~; I x;) = 0, Total Expectations.

Z;

is a function ofx;" => "E(z; · ~;)

= 0." Hint:

Law of

2. Is the error conditionally homoskedastic in the auxiliary regression (2.8.8)? If so, does it matter for the asymptotic distribution of the WLS estimator?

2.9

Least Squares Projection

What if the assumptions justifying the large-sample properties of the OLS estimator (except for ergodic stationarity) are not satisfied but we nevertheless go ahead and apply OLS to the sample? What is it that we estimate? The answer is in this section. OLS provides an estimate of the best way linearly to combine the explanatory variables to predict the dependent variable. The linear combination is called the least squares projection.

138

Chapter 2

Optimally Predicting the Value of the Dependent Variable We have been concerned about estimating unknown parameters from a sample. Let us temporarily suspend the role of econometrician and put ourselves in the following situation. There is a random scalar y and a random vector x. We know the joint distribution of (y, x) and the value of x. On the basis of this knowledge we wish to predict y. So a predictor is a function j(x) of x with the functional form JO determined by the joint distribution of (y, x). Naturally, we choose the function JO so as to minimize some index that is a function of the forecast error y- j(x). We take the loss function to be the mean squared error E[(y- j(x)) 2] because it seems as reasonable a loss function as any other and, more importantly, because it produces the following convenient result:

Proposition 2.7: E(y I x) is the best-predictor ofy in that it minimizes the mean squared error.

We have seen in Chapter I that the add-and-subtract strategy is effective in showing that the candidate sol uti on minimizes a quadratic function. Let us apply the strategy to the squared error here. Let f (x) be any forecast. Add E(y I x) to the forecast error y - f (x) and then subtract it to obtain the decomposition y- j(x) = (y- E(y I x)) + (E(y I x)- j(x)).

(2.9.1)

So the squared forecast error is (y- J(x)) 2 = (y- E(y I x)) 2 + 2(y- E(y I x))(E(y I x)- f(x)) + (E(y I x)- j(x)) 2 .

(2.9.2)

Take the expectation of both sides to obtain mean squared error= E[ (y- j(x)) 2 ] = E[(y- E(y I x)) 2 ] +2E[(y- E(y I x))(E(y I x)- j(x))j + E[ (E(y I x)- j(x)) 2 ].

(2.9.3)

It is a straightforward application of the Law of Total Expectations to show that the middle term, which is the covariance between the optimal forecast error and the difference in forecasts, is zero (a review question). Therefore, mean squared error= E[(y- E(y I x)) 2 ] + E[(E(y I x)- f(x)) 2 ] > E[ (y- E(y

I xlf],

(2.9.4)

139

Large-Sample Theory

which shows that the mean squared error is bounded from below byE[ (y - E(y x)) 2], and this lower bound is achieved by the conditional expectation. Best Linear Predictor It requires the knowledge of the joint distribution of (y, x) to calculate E(y

I

I x),

which may be highly nonlinear. We now restrict the predictor to being a linear function ofx and ask: what is the best (in the sense of minimizing the mean squared error) linear predictor of y based on x? For this purpose, consider fJ* that satisfies the orthogonality condition E[x · (y- x' {J*)] = 0 or E(xx'){J* = E(x · y).

(2.9.5)

The idea is to choose {J* so that the forecast error y - x' {J* is orthogonal to x. If E(xx') is nonsingular, the orthogonality condition can be solved for {J*: {J* = [E(xx')r 1 E(x · y).

(2.9.6)

The least squares (or linear) projection of yon x, denoted E*(y I x), is defined as x' {J*, where {J* satisfies (2.9.5) and is called the least squares projection coefficients. Proposition 2.8: The least squares projection E *(y I x) is the best linear predictor of y in that it minimizes the mean squared error. The add-and-subtract strategy also works here. PROOF. For any linear predictor x'/i, mean squared error= E[ (y - x'/i) 2]

+ x' ({J* x' fJ*)'] + 2(/J* -

= E{ [ (y - x' {J*)

/i) ]'}

(by the add-and-subtract strategy) = E[ (y /i)' E[x . (y - x' {J*) J+ E[ (x' (fJ* - /i)) 2] = E[ (y - x' {J*) 2] + E[ (x' (p - {J*)) 2j (by the orthogonality condition (2.9.5)) > E[ (y - x'{J*)2j.

•

ln contrast to the best predictor, which is the conditional expectation, the best linear predictor requires only the knowledge of the second moments of the joint distribution of (y, x) to calculate (see (2.9.6)).

140

Chapter 2

If one of the regressors xis a constant, the least squares projection coefficients

can be written in terms of variances and covariances. Let x be the vector of nonconstant regressors so that

An analytical exercise to this chapter asks you to prove that F:'(y 1xl

= F:'(.v 11. x) =

J.L

+ y'x,

(2.9.7)

where y

= Var(x)- 1 Cov(i, y),

J.L

= E(y)- y'E(i).

This formula is the population analogue of the "deviations-from-the-mean regression" formula of Chapter I (Analytical Exercise 3). OLS Consistently Estimates the Projection Coefficients

Now let us put the econometrician's hat back on and consider estimating fJ*. Suppose we have a sample of size n drawn from an ergodic stationary stochastic process (y,, x,} with the joint distribution (y,, x,) (which does not depend on i because of stationarity) identical to that of (y, x) above. So, for example, E(x,x;l = E(xx'). By the Ergodic Theorem the second moments in (2.9.6) can be consistently estimated by the corresponding sample second moments. Thus a consistent estimator of the projection coefficients fJ' is

which is none other than the OLS estimator b. That is, under Assumption 2.2 (ergodic stationarity) and Assumption 2.4 guaranteeing the nonsingularity of E(xx'), the OLS estimator is always consistent for the projection coefficient vector, the fJ* that satisfies the orthogonality condition (2.9.5). QUESTIONS FOR REVIEW

1. (Unforecastability offorecast error) For the minimum mean square error forecast E(y I x), the forecast error is orthogonal to any function
141

Large-Sample Theory

2. (Forecasting white noise) Suppose {e;} is white noise. What is E'(e;

I e;_ 1,

e;-2, ... , e;-m)? What is E'(e; I I, e;-1, e;-z, ... , e;-m)? Is it true that E(e; e;_ 1, e;_ 2 , •.• , e;-m) = 0? Hint: Is it zero for the process of Example 2.4? 3. (Conditional expectations that are linear) Suppose E(y E'(y 1 I, x) = E(y 1 x).

I x) =

11- + y'x. Show:

4. (Partitioned projection) Consider the model y, = x;tJ + z;~ + e; with E(x; · e;) = 0, E(z; · &,) I 0, and E(z,x;) = 0. Thus, z, is not predetermined (i.e., not orthogonal to the error term), but it is unrelated to the predetermined regressor xi in that the cross moments are zero.

(a) Show that the least squares projection coefficient of x; in the projection of y, on x; and z, is {J. Hint: Calculate E'(e; I x;, z;). (b) What is the least squares projection coefficient of X; in the least squares projection of y; on x,? Hint: Treat z;~ + e; as the error term. (c) Which projection would you use for estimating {J? Hint: You want the error variance to be smaller.

2.10

Testing for Serial Correlation

As remarked in Section 2.3 (see (2.3.3)), when the regressors include a constant (true in virtually all known applications), Assumption 2.5 implies that the error term is a scalar martingale difference sequence (m.d.s.), so if the error is found to be serially correlated, that is an indication of a failure of Assumption 2.5. Serial correlation has traditionally been an important subject in econometrics, and there are available a number of tests for serial correlation (i.e., tests of the null of no serial correlation in the error term). Some of them, however, require that the regressors be strictly exogenous. The test to be presented in this section does not require strict exogeneity. Because the issue of serial correlation arises only in timeseries models, we use the subscript "t" instead of"i" in this section (and the next). Throughout this section we assume that the regressors include a constant. It would be nice to have those tests extended to cover serial correlation in g, (= x, . &, ), but no such tests have been proposed to gain acceptance. This is a gap in the literature, but not a serious one, because nowadays researchers know how to live with serial correlation in g,. That is, as will be shown in Chapter 6, there is available a method to do inference in the presence of serial correlation in g,.

142

Chapter 2

Earlier in this chapter we learned how to calculate standard errors that are robust to conditional heteroskedasticity. In Chapter 6, those standard errors will be made robust to serial correlation as well. Box-Pierce and Ljung-Box

Before turning to the tests for serial correlation in the error term, we temporarily step outside the regression framework and consider serial correlation in a univariate time series. Suppose we have a sample of size n, {Z!> ... , Zn), drawn from a scalar covariance-stationary process. In Section 2.2 we defined the (population) j -th order autocovariance Yi. The sample j -th order antocovariance is

Yi

l

=n

L"

(z,- Zn)(Zt-i- Zn)

(j = 0, I, ... ),

(2.10.1)

t=j+l

where I

"

Zn=-LZt· n 1=1

(If the population mean E(z,) is known, it can replace the sample mean Zn; doing

so would improve the small sample property.) Here, even though only n - j terms are in the sum, the denominator is n rather than n - j. Whether the sum of n - j terms is divided by n or by n - j does not affect large-sample results. For moderate sample sizes, however, the numerical difference can be substantial, and you should always be explicit about which is used as the denominator. The sample j -th order

autocorrelation coefficient, Pi, is defined as Pi

= -;-YiYo

(' 1 = 12 , , · · ·l·

(2.10.2)

If {z,} is ergodic stationary, then it is easy to show (see a review question) that )j is consistent for Yi (j = 0, l, 2, ... ). Hence, by Lemma 2.3(a), Pi is consistent for Pi (j = l, 2, ... ). In particular, if {z,} is serially uncorrelated, then all the sample autocorrelation coefficients converge to 0 in probability. To test for serial correlation, however, we need to know the asymptotic distribution of .JTiPi· It is provided by

Proposition 2.9 (special case of Theorem 6.7 of Hall and Heyde (1980)): Suppose {ztl can be written as /L + 1i1 , where 1i 1 is a stationary martingale difference sequence with "own" conditional homoskedasticity: (own conditional homoskedasticity) E(li; le1_ 1 , e1_ 2 ,

••• )

= a 2 , a 2 > 0.

143

Large-Sample Theory

Let the sample autocorrelation

.fiiy --+ d

Pj

be defined as in (2.10.1) and (2.10.2). Then

.fiip--+

N(O, a 4 1p) and

d

N(O, Ip),

Here, the process e, is not required to be ergodic. Proving this under the additional condition of ergodicity is left as an analytical exercise. Thus, asymptotically, (.fii times) the autocorrelations are i.i.d. and the distribution is N(O, 1). The process { e1 } assumed here is more general than independent white noise processes, but the conditional second moment has to be constant, so this result does not cover ARCH processes, for example. One way to test for serial correlation in the series is to check whether the firstorder autocorrelation, P~o is 0. Proposition 2.9 implies that

.fiif5t =

-

(2.10.3)

P./n --+ N(O, 1).

If n

d

So the /-statistic forrned as the ratio of p1 to a "standard error" of I/ .fii is asymptotically standard norrnal. We can also test whether a group of autocorrelations are simultaneously zero. Let p = (Pio ... , Pp)' be the p-dimensional vector collecting the first p sample autocorrelations. Since the elements of .fiip are asymptotically independent and individually distributed as standard norrnal, their squared sum, called the BoxPierce Q because it was first considered by Box and Pierce (1970), is asymptotically chi-squared: p

Box-Pierce Q statistic =n

p

LPJ = 'L,(.fii/lj} j=l

2

-;j

x2 (p).

(2.10.4)

j=l

It is easy to show (see a review question) that the following modification, called the Ljung-Box Q, is asymptotically equivalent in that its difference from the BoxPierce Q vanishes in large samples. So by Lemma 2.4(a) it, too, is asymptotically chi-squared: p

Ljung-Box Q statistic= n. (n

+ 2) L j=l

jj2 j

p

.

n-;

=

+2

L. n-; n . ( .fii/lj) j=l

2

--+ x2 (p). d

(2.10.5)

144

Chapter 2

This modification often provides a better approximation to the chi-square distribution for moderate sample sizes (you will be asked to verify this in a Monte Carlo exercise). For either statistic, there is no clear guide to the choice of p. If pis too small, there is a danger of missing the existence of higher-order autocorrelations, but if p is too large relative to the sample size, its finite-sample distribution is likely to deteriorate, diverging greatly from the chi-square distribution. Sample Autocorrelations Calculated from Residuals

Now go back to the regression model described by Assumptions 2.1-2.5. If the error terrn e, were observable, we would calculate the sample autocorrelations as Yj (' Pj=-:::;= I ' 2, ... ) ' Yo

(2.10.6)

where n

Yj

=-nI

"L...t .

Bt Gt-j

(j = 0, 1' 2, ... ) .

(2.10.7)

1=;+1

(There is no need to subtract the sample mean because the population mean is zero, an implication of the inclusion of a constant in the regressors.) Since [e,e,_j} is ergodic stationary by Assumption 2.2, Yj converges in probability to the corresponding population mean, E(e,e,_ j ), for all j, and Pj is consistent for the population j -th autocorrelation coefficient of e,. Next consider the more realistic case where we do not observe the error terrn. We can replace e, in the above forrnula by the OLS estimate e, and calculate the sample autocorrelations as A

Yj Pj=-;:Yo A

(' ; =12 , , ... ) ,

(2.10.8)

where ]

Yj=n

n

L

e,e,_j

(j=0,1,2, ... ).

(2.1 0.9)

t=J+I

(Because the regressors include a constant, the norrnal equation corresponding to the constant ensures that the sample mean of e, is zero. So there is no need to subtract the sample mean.) Is it all right to use Pj (calculated from the residuals) instead of Pj and the residual-based Q statistics derived from (pj} for testing for serial correlation? The answer is yes, but only if the regressors are strictly exogenous.

145

Large-Sample Theory

Recall that the expression linking the residual e1 to the true error terrn e, was given in (2.3.9). Using this, the difference between jij and Yj can be written as J

91 = -

n

=

~n

n

L.

e, e,_ 1

!=J+I

n

L

[e,- x;(b- fl)][e,_j- x;_/h- fl)]

t=j+l

I

= Yj-n

n

L

(Xr-j ·

e,

+ x,

· e,_j)'(b-

fl)

t=J+l

+ (b- flY(~

n

L

x, x;_j )
(2.10.10)

t=J+l

lfE(x, · e,_ j ), E(x,_ j · e1 ), and E(x,x;_) are all finite, then the second and the third terrns vanish (converges to zero in probability) because b - fJ -+ P 0. Therefore, ' v-v-+ J J p

0

(j = 0, I, 2, ... ),

and thus the difference between Pj and pj, too, vanishes in large samples. However, .jii times the difference does not. .jiipj and ../ii/3j can be written as (2.10.11) We have just seen that, for both .jiipj and .jiipj, the denominator -+p a 2 . So the difference between .jiipj and .Jii/3j will vanish if the difference between .jiijij and .jiiyj does (if you are not convinced, see Review Question 3 below). Now, by multiplying both sides of (2. 10.10) by .jii, we obtain

CL n

+ .Jii (b- fl)'

x, x;_j )
(2.10.12)

t=J+!

Because .jii (b- fl) converges to a random variable (whose distribution is norrnal), the third terrn on the RHS vanishes by Lemma 2.4(b). Regarding the second terrn, we have

146

Chapter 2

I

n

n

L.

(Xr-j ·<1 +x, ·<1-j)-> E(Xr-j ·e,)+E(x,· <1-j). • l=j+l

(2.10.13)

If the regressors are strictly exogenous in the sense that E(x, · e,) = 0 for all 1, s, then (2.10.14)

E(Xr-j · <1 )+ E(x, ·
and by Lemma 2.4(b) the second term converges to zero in probability. So the difference between Jnjjj and .fiipj vanishes, which means that the Q statistic calculated from the regression residuals {e,}, too, is asymptotically chi-squared, and we can use this residual-based Q to test for serial correlation. lf, on the other hand, the regressors are not strictly exogenous, then there is no guarantee that (2.10.14) holds. Consequently, the residual-based Q statistic may not be asymptotically chisquared. Testing with Predetermined, but Not Strictly Exogenous, Regressors

Therefore, when the regressors are not strictly exogenous, we need to modify the Q statistic to restore its asymptotic distribution. For this purpose, consider two restrictions: (stronger form of predeterminedness) E(e, I e,_,,
••• )

= 0,

(2.10.15) (stronger form of conditional homoskedasticity)

E(ei I e,_,, Cr-2·

... ,

x,, Xt-h

... ) = CJ

2

>

0.

(2.10.16)

The first condition is just reproducing (2.3.1) (with "t" now being used as the subscript). As was shown in Section 2.3, this is stronger than Assumption 2.3 and implies that g, (= x, · e,) is an m.d.s. Condition (2.10.16), with the conditioning set including x,, is obviously stronger than Assumption 2.7 (conditional homoskedasticity). It also is stronger than the own conditional homoskedasticity assumption in Proposition 2.9 because the conditioning set includes current and past x as well as past e. The next result shows that under these additional conditions there is an appropriate modification of the Q statistic. Proposition 2.10 (testing for serial correlation with predetermined regressors): Suppose that Assumptions 2.1, 2.2, 2.4, (2.10.15), and (2.10.16) are satisfied. Let the sample autocorrelation of the OLS residuals, Pi, be defined as in (2.1 0.8). Then,

Large-Sample Theory

147

,Jiiy---+ N(O, o- 4 -(Ip- ))

and

d

where y = (Y~o y2 , ... , Yp)', jJ = the p x p matrix ) is given by

(/i~o

fj 2 ,

,j/ijJ---+ N(O, Ip- ), d

fjp)', and t/lj, (

... ,

(2.10.17)

(j, k) element of

(2.10.18) The proof, though not very difficult, is relegated to Appendix 2.B. By the Ergodic Theorem, matrix is consistently estimated by its sample counterpart: --

"

= (t/Jj,),

" ¢jk

= it}S;x1[L,js 2

(j, k = I, 2, ... , p),

(2.10.19)

where llj

I

=n

L. n

Xt •Ct-j·

t=;+l

It follows from this and Proposition 2.10 that

modified Box-Pierce Q = n · jJ'(Ip- ~)- 1 p---+ d

x2 (p).

(2.1 0.20)

As long as the regressors are predetermined and the error is conditionally homoskedastic in the sense of (2.10.15) and (2.10.16), this modified Q statistic can be used even when the regressors are not strictly exogenous. An Auxiliary Regression-Based Test Although calculating this modified Q statistic is straightforward with matrix-based software, it is useful to find an asymptotically equivalent statistic that can be calculated from regression packages. For this ·purpose, consider the following auxiliary regressiOn: regress er on

Xr,

(2.1 0.21)

et-1, et-2 • ... , er-p·

To run this auxiliary regression fort= I, 2, ... , n, we need data on (e0 , e_ 1 , ... , e-p+J). It does not matter asymptotically which particular numbers to assign to them, but it seems sensible to set them equal to 0, their expected value. 15 From this auxiliary regression, we can calculate the F statistic for the hypothesis that the p coefficients of e,_ 1, e,_ 2, ... , e,_P are all zero. Given Proposition 2.5(c), it is only natural to wonder whether p · F is asymptotically x2 (p). This conjecture is indeed IS Another asymptotically equivalent choice is to run the auxiliary regression fort = p

+

1, p

+ 2, ... , n.

148

Chapter 2

true: under the hypothesis of Proposition 2.1 0, the modified Q statistic (2.10.20) is asymptotically equivalent to p . F (i.e., the difference between the two converges to zero in probability as n -+ oo), sop · F, too, is asymptotically chi-squared (showing this is an analytical exercise). This p · F statistic, in tum, is asymptotically equivalent to nR 2 from the auxiliary regression. This can be shown as follows. Recall the algebraic result from Chapter l that the F -ratio can be calculated from the difference in the sum of squared residuals between the unrestricted and restricted regressions. The unrestricted regression in the present context is (2.10.21) while the restricted regressiOn ts (2.10.22)

regress e1 on x 1 •

Therefore, if SSRu and S S RR are SSRs from (2.10.21) and (2.1 0.22), respectively, we have

p ·F =

SSRR- SSRu ( )SSRR- SSRu ) = n- #x,- p (2.!0.23 SSRu /(n- #x1 - p) SSRu

where #x, is the number of variables in x,. However, since e, is the residual from the original regression (a regression of y, on x,), the regressors x, in (2.!0.22) have no explanatory power. So SSRR = e'e where e is the n-dimensional vector of residuals from the original regression, and (2.10.23) is numerically identical to

Rz l - "'R2

(n- #x - p) I

"'

where R~, is the uncentered R 2 for the auxiliary regression (2.10.21 )l 6 But since the sample mean of e, is by construction zero (this is because x, includes a constant). R~, is numerically identical to the R 2 for the auxiliary regression. Thus, we have proved the algebraic result that

Rz p. F = (n- #x, - p) l - R2, where R 2 is equal to the R 2 for the auxiliary regression (2.1 0.21 ). Solving this equation for R 2 and multiplying both sides by n, we obtain 16 Recall from (1.2.16) that the uncentered R2 for a regression of yon xis defined as 2 _ y'y-e'e

Rue=

f

yy

·

In the present context, the y'y in tills fonnula is e1e while the e1e in che fonnula is SSRu.

149

Large-Sample Theory

1 n R2 = -----,n,---__ --......,--. P. F. n-#x,-p

I

+I!

p·F #xt

p

Since l! 'fixF! p -+p 0 as n -+ oo (an implicat.ion of Lemma 2.4(b)), this equation shows that p. F- nR 2 -+p 0. Therefore, nR 2 from the auxiliary regression (2.10.21) is asymptotically x2 (p). The test based on nR 2 is called the BreuschGodfrey test for serial correlation. When p = I, the test statistic is asymptotically equivalent to the square of what is known as Durbin's h statistic. 17 QUESTIONS FOR REVIEW

1. (Consistency of sample autocovariances) Show: if [z 1 } is ergodic stationary, then the sample autocovariance YJ given in (2.1 0.1) is consistent for YJ, the population autocovariance. Hint: YJ . E(z,z,_ 1)-E(z,) E(z,_1). Rewrite y1 as 11

1 - =- '"' Yj L...t n t=j-+1

11

Zt Zr-j-

1 '"' ZnL...t n t=j+l

11

Zt-j-

1 '"' ZnL...t n t=j+l

Zt

+

n-· 1 (-Zn )2 • n

2. (Asymptotic equivalence of two Q's) Prove that the difference between the Box-Pierce Q and the Ljung-Box Q converges to 0 in probability as n -+ oo (so their asymptotic distributions are the same). Hint: Let

Find a p-dimensional vector a, such that the difference between two Q's is a;,x,. Show that a, -+ 0. The asymptotic property to use is Lemma 2.4(b). 3. Consider vnPj and vnPJ in (2.10.11). We have shown that Yo -+p 0" 2 and ji0 -+p 0" 2 . By Proposition 2.9, .jnji1 -+d N(O, 0" 4 ). Taking these for granted, showthat.jnpj-vnPJ -+p Oif .jnjiJ-vnYJ -+p 0. Hint: ll.jnjiJ-vnYJ -+p 0, then by Lemma 2.4(a), .jny1 -+d N(O, 0" 4 ).

Multiply both sides by .jn and apply Lemma 2.4(b). 17 See Breusch (1978) and Godfrey (1978). The Breusch-Godfrey test was originally derived as a Lagrange Multiplier test for the case where the error is normally dislributed and x1 consists of fixed regressors and the lagged dependent variable. The discussion in the text shows that the test is applicable to the more general case considered in Proposition 2.1 0.

150

Chapter 2

2.11

Application: Rational Expectations Econometrics

In dynamic economic models and other contexts, expectations about the futme natlirally play an important role. Rational expectations econometrics concerns estimating equations involving expectations. Expectations are usually not observable, but, if we assume that economic agents form expectations rationally, we can

overcome the problem of unobservable expectations using the techniques of this chapter. The application we consider is Fama's efficient market hypothesis. In his own words, "An efficient capital market is a market that is efficient in processing information .... In an efficient market, prices 'fully reflect' availahle information" (Fama, 1976, p. 133). The words "fully reflect" can be formalized precisely. We will do so for a particular capital market and test its implication. The Efficient Market Hypotheses

The capital market studied in Fama (1975) is the market for U.S. Treasury bills. In this section we focus on the one-month Treasury bill rates observed on a monthly basis. (Later in the book we will also consider Treasury bills of different maturities.) To formalize market efficiency for the bills market, we need to introduce a few concepts from macroeconomics and finance. Since in this section we deal exclusively with time-series data, we use "I" rather than "i" for the subscript. Define v,

= price of a one-month Treasury bill at the he ginning of month I,

R,

=one-month nominal interest rate over month t. i.e., nominal return on the bill over the month= (I- v,)jv,, so v, = 1/(1

P,

+ R,),

= value of the Consumer Price Index (CPI), which is om measure of the price level at the beginning of month I,

rr,+ 1

=inflation rate over month I (i.e., from the beginning of month I to the beginning of month

,rr,+ 1

1Jr+I

1

+ I)= (P,+ 1 -

P, )/ P,,

= expected

inflation rate over month I, expectation formed at the beginning of month I,

=inflation forecast error=

7rr+I -

r1l'"r+1,

1/ P,+ 1 - v,j P, r,+ 1 = ex-post real interest rate over month I = ....:......:...:..:....--,-...:..:.....:. v,j P, I+ R, :---'- - I ~ R, - Jrr+I, I

,r,+ 1

+ rr,+I

=ex-ante real interest rate =

I+ R, 1 + r1rr+I

- 1 : : : : R, - r1rr+I·

151

Large-Sample Theory

Month t -------r-----------------------------+--------~tlme

t+l

t

R, P, Figure 2.2: What's Observed When

Timing--the value of which variable is ohserved at what point in time-is very important in rational expectations econometrics. Our rule for dating variables (which is different from Fama's) is that the variable has subscript I if its value is first ohserved at the beginning of period (month) 1 -' 8 For example, because rr,+I> the inflation rate over month 1, depends on P,+ I> its subscript is 1 + I rather than 1. For the same reason, r,+l, the ex-post real interest rate over month 1 has subscript I + I. In contrast, the nominal interest rate R, over month 1 has subscript I, not 1 + I, because it is determined by the price of the Treasury bill at the beginning of month I, v,. Figure 2.2 shows the value of which variable is observed at what point in time. We take a period to be a month, only because our data are monthly. Note that the maturity of the security (one month) coincides with the sampling interval (a month). If the sampling interval is finer than the maturity, which is the case if, for example, we have monthly data on the three-month Treasury bill rate, then maturities overlap and we need a more sophisticated technique to be introduced in Chapter 6. The efficient market hypothesis is a joint hypothesis combining: Rational Expectations. Inflationary expectations are rational: ,rr,+ 1 = E(rr,+ 1 I I,), where I, is information available at the beginning of month I and includes {R,, R,_ 1 , ••. , rr,, rr,_,, ... }. Also, I, :::J I,_, :::J I,_ 2 . . . . That is, agents participating in the market do not forget. Constant Real Rates. The ex-ante real interest rate is constant: ,r,+ 1 = r.

18 tn Fama's (1975) notation, our Hr+l is ~3. , rr+I is Tt+I, and It is ¢r. 1

152

Chapter 2

Testable Implications We derive two testable implications of the efficient market hypothesis by utilizing the following key observations ahout the inflation forecast error under rational expectations. (a)

E(~t+l

I I,)

= 0, that is, the inflation forecast error is a martingale difference

sequence with respect to the information set.

(b)

(~,, ~ 1 - 1 , ... )

as well as (R,, R,_ 1 , ••• , rr,, rr,_ 1 , ... ) are included in I, (known at the beginning of month t). That is, agents remember all past mistakes.

(a) makes intuitive sense; if people use all available information to forecast the inflation rate, the forecast error, which is known only after the fact, will not have any systematic relation to what people knew when they formed expectations. It can be proved easily: E(~t+l

I /,)

= E(nr+I- r1l'"r+l I lr)

(since 17r+l

=

7rr+l- r7rr+J)

= E(rr,+ 1 I /,)- E(,rr,+ 1 11,) = E(rr,+l II,)- E[E(rr,+l II,) II,]

(since ,rr,+ 1 = E(rr,+ 1 I I,) by rational expectations) = E(rr,+l II,)- E(rr,+l I/,)= 0.

(2.11.1)

(h) follows because agents remember past inflation forecasts. Since t-j-lrrt-j = E(rr,_j l/,_j_ 1) is a function of f,_j-l• it is included in f,_j-l (i.e., known at the beginning of month t - j - 1) and hence in I, (known at the beginning of month t) for j > 0. Thus ~t-j = rr1 -j - t-j- 1rr,_j is included in I, for all j 2: 0. Observations (a) and (b), together with the Law of Total Expectations, imply (c)

{~,)is

m.d.s.

So, it is a zero mean and serially uncorrelated process. These are not testable because expectations are unobservable, but, when combined with the constant-real-rate assumption, they imply two testable implications about the inflation rate and the interest rate which are ohservable.

Implication 1: The ex-post real interest rate has a constant mean and is serially uncorrelated.

153

Large-Sample Theory

This follows because rt+I

=

Rr - 1ft+I

= (Rr - t1ft+I)= r'r+l - ~r+l = r - ~r+l

(7rr+l -

r1ft+I)

(by definition)

(since r'r+I = r by assumption).

(2.11.2)

By (c), {r,} has mean rand is serially uncorrelated. Implication 2: E(rrr+l II,)= -r

+ R,.

That is, the best (in the sense of minimum mean squared error) inflation forecast on the basis of all the available information. is the nominal interest rate; the nominal rateR,, which is determined by the price of a Treasury hill v,, summarizes all the currently available information relevant for predicting future inflation. This is the formalization of asset prices "fully reflecting" available information. This, too, can he derived easily. Solve (2.11.2) for 7rr+I as Jrr+I = -r + R, + ~r+I· So E(rrr+l II,)

+ R, + ~r+I -r + R, + E(~r+l

= E(-r

II,)

=

llr) (since r is a constant and R, is included in I,)

= -r

+

R,

(by (a)).

(2.11.3)

In the rest of this section, we test these two implications of market efficiency. Testing for Serial Correlation

Consider first the implication that the ex-post real rate has no serial correlation, which Fama tests using the result from Proposition 2.9. We use the monthly data set used by Mishkin (1992) on the one-month T -bill rate and the monthly CPI inflation rate, stated in percent at annual rates 19 Figure 2.3 plots the data. The ex-post real interest rate, defined as the difference between the T-bill rate and the CPI inflation rate, is plotted in Figure 2.4. To duplicate Fama's results, we take the sample period to be the same as in Fama (1975), which is January 1953 through July 1971. The sample size is thus 223. The time-series properties of the real interest rate are summarized in Table 2.1. In calculating the sample autocorrelations Pj = Yj fy0 , I9The data set is described in some detail in the empiricaJ exercise. Following Fama, the inflation measure used in the text is calculated from the raw CPl numbers. The inflation measure used by Mishkin (1992) treats the housing component of the CPI consistently. See question (k) of Empirical Exercise 1.

25 ,-------------------------------------------, 20

-5

-10

~----------------------------------------~

1/55

1/60 1/65 1-month T-bill rate

1/70

1/75 1/80 1/85 ·· ·· · inflation rate

1/90

Figure 2.3: Inflation and Interest Rates

15,------------------------------------------, 10

-"" "c. ~

"

-15

1/55

1/60

1/65

1/70

1/75

Figure 2.4: Real Interest Rates

1/80

1/85

1/90

Table 2.1: Real1nterest Rates, January 1953-July 1971 mean"" 0.82%, standard deviation= 2.847%, sample size= 223

pj std. error Ljung-Box Q p-value (%)

j =I

j=2

-0.101 0.067 2.3 12.8%

0.172 0.067 9.1 1.1%

j=9

j = 10

j =II

j = 12

-0.019 -0.004 -0.064 -0.021 -0.092 0.095 0.094 0.067 0.067 0.067 0.067 0.067 0.067 0.067 10.1 9.1 9.1 10.2 12.1 14.2 16.3 2.8% 7.3% 11.7% 9.6% 7.6% 6.1% 5.8%

0.019 0.067 16.4 8.9%

0.004 0.067 16.4 12.8%

0.207 0.067 26.5 0.9%

)=3

j=4

)=5

j=6

j=7

j=8

156

Chapter 2

we use the formula (2.1 0.1) for yj where the denominator is the sample size rather than n - j. Similarly, the formula for the standard error of pj is I/ .jii rather than 1/ ..jn j (so the standard error does not depend on j). The Ljung-Box Q statistic in the tahle for, say, j = 2 is (2.10.5) with p = 2. The results in the table show that none of the autocorrelations are highly significant individually or as a group, in accordance with the first implication of the efficient market hypothesis. Is the Nomlnallnterest Rate the Optimal Predictor?

We can test Implication 2 by estimating an appropriate regression and carrying out the t-test of Proposition 2.3. The remarkable fact about the efficient market hypothesis is that, despite its simplicity, it is specific enough to imply the important parts of the assumptions justifying the t-test 20 To see this, let y, = rr,+h x, (1, R,)', £, = ~t+l (the inflation forecast error), and rewrite (2.11.2) as (2.11.4) so Assumption 2.1 is ohviously satisfied with

p=

(-r, !)'.

(2.11.5)

The other assumptions of the model are verified as follows. • Assumption 2.3 (predetermined regressors). The g, (= x, · £,) in the present context is (~t+l, R 1 ~t+J)'. What we need to show is that E(~t+I) = 0 and E(R 1 ~ 1 + 1 ) = 0. The former is implied by (a) and the Law of Total Expectations: E(~t+I) = E[E(~t+I

I I,)]=

0.

(2. II .6)

The latter holds because E(R,~t+I)

= =

I I,)] (by the Law of Total Expectations) E[R, E(~,+I I I,)] (by linearity of conditional expectations and E[E(R,~,+I

= 0

(by (a)).

R, E I,) (2.11. 7)

• Assumption 2.5. By (b), I, includes (s,_ 1 (= ~,), £,_ 2 , ... ) as well as (x,, x,_ 1, ... ). (a) then implies (2.3.1), which is a sufficient condition for [g,} to he m.d.s. 20 The technical part-that the fourth-moment matrix E(grg;) exists and is finite and that the regressors have finite fourth moments -is not implied by the efficient market hypothesis and needs to be assumed.

157

Large-Sample Theory

• Assumption 2.4. E(x,x;) here is ' = [ E(R,) I E(x,x,)

E(R, )] E(R~)

.

(2.11.8)

Its determinant is E(R~)- [E(R,)f = Var(R,), which must be positive; if the variance were zero, we would not observe the interest rate R, fluctuating. • Assumption 2.2 (ergodic stationarity). It requires [n,+ 1, R,) to be ergodic stationary. The plot in Figure 2.3 shows a stretch of upward movements, especially for the nominal interest rate, although the stretch is followed by a reversion to a lower level. Testing stationarity of the series will be covered in Example 9.2 of Chapter 9. For now, despite this rather casual evidence to the contrary, we proceed under the assumption of stationarity (but the implication of having trending series will be touched upon in the final subsection). For ergodicity, we just have to assume it.

What we have shown is that, under the efficient market hypothesis, [y,, x,} (where y, = n,+ 1 and x, = (I, R, )') belongs to the set of DGPs satisfying Assumptions 2.1-2.5. Having verified the required assumptions, we now proceed to estimation. Using our data, the estimated regression is, for t = 1I 53, ... , 7/71, n,+I = -0.868

(0.431)

+

1.015 R,, (0.112)

R 2 = 0.24, mean of dependent variable = 2.35, SER = 2.84%, n = 223. (2.11.9)

-

The heteroskedasticity-robust standard errors calculated as in (2.4.1) (where S is given by (2.5.1) with 1 = e1) are shown in parentheses. As shown in (2.11.5), market efficiency implies that the R, coefficient equals 121 The robust t-ratio for the null that the R, coefficient equals I is (1.015- 1)/0.112 = 0.13. Its p-value is about 90 percent. Thus, we can easily accept the null hypothesis. There are other ways to test the implication of the nominal interest rate summarizing all the currently available information. We can bring in more explanatory variables in the regression and see if they have an explanatory power over and above the nominal rate. The additional variables to be brought in, however, must he part of the current information set I, hecause, obviously, the efficient market

e

21The intercept is known to be -r, but it does not produce a testable restriction since the efficient market hypothesis does not specify the value of r.

158

Chapter 2

hypothesis is consistent with the existence of some variable not currently available predicting future inflation better than the current nominal rate. Formally, if x, includes variables not included in I,, the argument we used for verifying Assumption 2.3 is no longer valid.

R, Is Not Strictly Exogenous As has been emphasized, one important advantage of large-sample theory over finite-sample theory of Chapter I is that the regressors do not have to be strictly exogenous as long as they are orthogonal to the error term. We now show by example that R, is not strictly exogenous, so finite-sample theory is not applicable to the Fama regression (2.11.4). Suppose that the process for the inflation rate is an AR(l) process: n, = c If~,+ 1

+ pn,_ 1 + ~,.

. (~,}is

independent white noise.

is independent of any elements of 1,, then

E(rr,+1 II,)

+ pn, + E(~ 1 + 1 II,) c + pn, (since E(~ 1 + 1

= c

(since rr, is in 1,)

=

II,) =

E(~ 1 + 1 )

and

E(~ 1 +J)

= 0 by assumption).

So ~,+ 1 is indeed the inflation forecast error for rr,+ 1 and

R, = r + c + pn, (since R, = r + E(rr,+ 1 I 1,) under market efficiency). It is then easy to see that the error term e, in the Fama regression (of "'+ 1 on a constant and R, ). which is the inflation forecast error ~<+I, can be correlated with future regressors. For example, Cov(e,, R,+Il

= Cov(~<+h r

+ c + P"<+Il

= pCov(~<+1·"'+J)

= p

Var(~ 1 + 1 )

(since rr,+ 1 = c + pn,

+ ~,+ 1 and Cov(~,+ 1 , rr,) =

0),

which is not zero. Although finite-sample theory is not applicable, its prescription for statistical inference is approximately valid, provided that the error is conditionally homoskedastic and the sample size is large enough, which was the message of Section 2.6.

Large-Sample Theory

159

Subsequent Developments Although Fruna's paper represents perhaps the finest example of rational expectations econometrics, events after its publication and a large number of empirical studies it inspired proved that the near one-for-one relation between inflation and interest rates is limited to the postwar period until 1979. Such strong association cannot be found for the prewar period or for many other countries. Furthermore, the increase in the real interest rates after the change in the Fed's operating procedure that took place in October 1979 runs counter to the premise of the Fama regression that the expected real interest rate is constant over time. When the sample is restricted to the post-October 1979 period, the interest rate coefficient in the Fama regression is far below unity. 22 These findings raise the question of why the strong association occurs only for certain periods but not for others. The explanation by Mishkin (1992) is that the inflation premium gets incorporated into interest rates only gradually. In periods when the inflation rate shows only short-run fluctuations, interest rates are not responsive to inflation. However, sustained movements in inflation will be reflected in interest rates. The period when the strong association is observed (which includes Fruna's srunple period, see Figure 2.3) is precisely when inflation showed a sustained upward movement. On January 29, 1997, the U.S. Treasury auctioned $7 billion in ten-year inflation-indexed bonds, making it possible for researchers to observe the ex-ante real interest rate as the yield on indexed bonds. Evidence from Great Britain, where indexed bonds have been available since the early 1980s, is that the yields on indexed bonds, although much less volatile than the yield on ordinary bonds, are not constant over time. The constant-real-rate assumption that we made was the auxiliary assumption whereby we can test whether bond prices fully reflect available information. Now we can do so without the auxiliary assumption by regressing the actual inflation rate on the yield differential between the conventional and indexed bonds. We can also relax the assumption, made implicitly so far, that the yield differential equals the expected inflation rate. Investors may not be risk-neutral, and the inflation risk premium over and above the expected inflation rate may he needed to induce them to hold ordinary bonds. If so, the expected inflation rate is once again unobservable. Rational expectations econometrics will continue to be a useful tool for dealing with unobservable expectations.

22 see, for example, Mishkin (1992, Table I). We will verify this in part (l) of Empirical Exercise J.

160

Chapter 2

QUESTIONS FOR REVIEW

1. Let R, = 5% and rr,+ 1 = 2%. Calculate r,+ 1 using the two formulas, (1 + R,)/(1 + rr,+ 1) - 1 and R,- rr,+I· Is the difference small? What if R, = 20% and rr,+ 1 = 17%?

-.

2. Show that E (rr,+ 1 Hint: Use (2.9.7).

I I, R,)

3. If the inflation forecast

= -r + R, under the efficient market hypothesis.

error~'+'

were observable, how would you test market

efficiency? Do we need the constant ex-ante real rate assumption?

4. If the inflation rates and the interest rates were measured in fractions rather than percent (e.g., 0.08 instead of 8%), how would the regression result (2. I 1.9) change? If the inflation rate is in percent but per month and the interest rate is in percent per year? Hint: If x is ttie inflation rate per month and y is its annual rate, then 1 + y =(I +x) 12 so that y"" 12x. 5. Suppose in (2. I 1.4) x, includes a third variable which is not in I,. Which part of the argument deriving Assumption 2.3 fails? Hint: The third element of g, is the third variable times ~'+'· 6. Provide an example of market inefficiency such that Implication I holds but Implication 2 does not. Hint: Suppose rr, and R, are serially independent and mutually independent processes.

2.12

Time Regressions

Throughout this chapter, we have assumed that [y,, x,} is stationary. This assumption, however, is not always satisfied in time-series regressions. In this book, we examine two cases where the regressors are not stationary and yet OLS is applicable. The first is examined here, while the other case, called "cointegrating regressions," is covered in Chapter 10. The regression we consider is written as

y, = a

+ 8 · t + £,,

(2. I 2.1)

where [£,} is independent white noise. This regression can be written as y, = x;p +£,with

161

Large-Sample Theory

x, = (1, t)', fJ =(a, 8)'.

(2.12.2)

Clear! y, x, is not stationary because the mean of the second element is t, which increases with time. Similarly, y, is not stationary. The linear function, a+ 8. t, is called the time trend of y,. We say that a process is trend stationary if it can he written as the sum of a time trend and a stationary process. The process [y,} here is a special trend-stationary process where the stationary component is independent white noise. The Asymptotic Distribution of the OLS Estimator

Let b be the OLS estimate of fJ based on a sample of size n: (2.12.3) The sampling error can be written as (2.12.4) Using the algebraic result that L:7~ 1 t 1)/6, it is easy to show

~ "

XX ' -

' ' -

[

= n · (n + 1)/2, L:;'~ 1 t 2 =

n · (n + 1)(2n +

n·(n+l)/2] n · (n + I) /2 n · (n + 1)(2n + 1)/6 · n

(2.12.5)

So, unlike in the stationary case, I:;~ I x,x;;n does not converge in probability to a nonsingular matrix; it actually diverges. It turns out that the OLS estimates & and 8are consistent but have different rates of convergence. As in the stationary case, the rate of convergence for & is -lfi. In contrast, the rate for 8 is n 312 That is, n 312 (8 - 8) converges to a nondegenerate distribution (the distribution of a nonconstant random variable). To assign those different rates of convergence to the elements of b - {J, consider multiplying both sides of (2.12.4) by the matrix (2.12.6)

162

Chapter 2

This produces

Yn(b- Pl = =

..jii(&

_a)]

[ n'l'(i; _ o)

(~X1 X1

= Yn

~X 1 n

)

· 81

Yn(f=x,x;tYn Y;;- 1 (I:X, ·8,) t~I

t=I

n

=

1

')_ (

n

-I

1

[-r;;- (I;x,x;)-r;;-

1 ]

n

(-r;;- I;x, ·8,) 1

t=I

t:I

= Qn-1 Vn,

(2.12.7)

where n

Q.

= Y;;-

1

(I: x,x; )-r;;-

n

1

and

Vn = Y;;-

t= l

1

L x, · t=

I

Substituting (2.12.5) and (2.12.6) into these expressions for Qn and

Qn = [ (n

+

I l)/(2n)

(2.12.8)

8,.

(n + l)/(2n) ] 2 (n + 1)(2n + l)/(6n )

Vn, we obtain

'

(2.12.9) Clearly, we have

Qn--> Q

=

I/2] . [1/2I 1/3

(2.12.10)

For vn. it can be shown (see, e.g., Hamilton, 1994, pp. 458-460) that

Vn

--> d

N(O, u2 Q)

(2.12.11)

if [8,} is independent white noise with E(812) = u 2 and E(8~) < oo. Thus the asymptotic distribution of (2.12.7) is normal with mean 0 and variance Q- 1 (u 2 Q) · Q- 1 = u 2 Q- 1 . Summarizing the discussion so far,

Proposition 2.11 (OLS estimation of the time regression): Consider the time regression (2. 12. I) where 8 1 is independent white noise with E(8;) = u 2 and

163

Large-Sample Theory

E(s;') < oo and let & and 8 be the OLS estimate of a and 8. Then

a)] (

..fii(&[ n'1'(8 - 8)

2 [ 1 1/2]-l)

0, a . 1/2 1/3

--: N

.

As in the stationary case, & is ..fii-consistent because ..jii(& - a) converges to a (nonnal) random variable. The OLS estimate of the time coefficient, 8, too is consistent, but the speed of convergence is faster: it is n 3i 2-consistent in that n 312 (8 -8) converge to a random variable. In this sense, 8 is hyperconsistent. 23 Hypothesis Testing for Time Regressions Thus, the OLS coefficient estimates of the time regression are asymptotically nor-

mal, provided the sampling error is proRerly scaled. We now show that the deflation of the sampling error by the standard error provides a scaling that makes the resulting ratio- the t-value- asymptotically standard nonnal. First consider the t-value for the null a = a 0 . Noting that n

I

(2.12.12)

(I, I) elementof(I_>,x;r =[I t=l

the t-value can he written as

-

a -ao

t = --,====~====;=

js

2

·[I

01(2::7= 1 x,x;r' [b]

..jii(<x -

ao)

..fii(&- ao)

Js Js

2

·[I

(since [ ..fii 0] = [I

O]'Y' nl

Ol'Y'n(I:7=• x,x;r'rn[b] ..fii(&- ao)

2

·[I

OJ(Y;;'(I:7=• x,x;)r;;')

..fii(&- ao)

(by (2.12.8)).

(2.12.13)

O]Q;'[b]

23Estimators that are n-consistent are usually called superconsistent. A prominent example of superconsistent estimators arises in the estimation of unit-root processes; see Chapter 9. Some authors use the term "superconsistent" more broadly, to refer to estimators that are nY -consistent withy > !-In their language, 3in the text is superconsistent.

164

Chapter 2

It is straightforward (see review question I) to show that s 2 is consistent for a 2 .

And hy (2.12.1 0), Q.

-->p

Q. Thus hy Lemma 2.4(c) and Proposition 2.11,

t -value for the null of a = ao

--> d

I st element of z

J a 2 x (I, 1) element of Q

1

(2.12.14)

where z ~ N(O, a 2 Q- 1). So the t-value for a is asymptotically N(O, 1), as in the stationary case. Using the same trick and noting that [0 n 312 ] = [0 l]Y., we can write the t-value for the null of 8 = 80 as (2.12.15)

So the t-value for the coefficient of time, too, is asymptotically N(O, I)! Inference about a or 8 can be done exactly as in the stationary case. QUESTIONS FOR REVIEW

1. (s 2 is consistent for a 2 ) Show that s 2

-->p

a 2 Hint: From (2.3.10), derive

Sum overt to obtain n

n

" 2 I", 2 , I , -1L..e, = - L..'i- -(b-/3) Y.v. +-(b- {3) Y.Q.Y.(b -{3). n t=I n t=l n n

2. (Serially uncorrelated errors) Suppose [r 1 } is a general stationary process, rather than an independent white noise process, but suppose that v. -->d N (0, V) where V oft a 2 Q. Are the OLS estimators, & and 8, consistent? Does Y n (b {3) converge to a normal random vector with mean zero? If so, what is the variance of the normal distribution? [Answer: Q- 1VQ- 1.] Are the t-values asymptotically normal? [Answer: Yes, but the variance will not be unity.]

Appendix 2.A: Asymptotics with Fixed Regressors

To prove results similar in conclusion to Proposition 2.5, we add to Assumption 2.1 the following.

Assumption 2.A.l: X is detenninistic. As n --> oo, s.. --Y 1:,,, a nonsingular matrix (conventional convergence here, because X is nonrandom). Assumption 2.A.2: {e;} is i.i.d. with E(e,)

= 0 and Var(e;) = 0' 2 .

Proposition 2.A.l (asymptotics with fixed regressors): Under Assumptions 2.1, 2.A. I, and 2.A.2, plus an assumption called the Lindeberg condition on {g,} (where g, = X; • &; ), the same conclusions as in Proposition 2.5 hold. Rather than stating the Lindeberg condition, we explain why such condition is needed. To prove the asymptotic normality of b, we would take (2.3.8) and show that .jii g converges to a normal distribution. The technical difficulty is that the sequence {g;} is not stationary thanks t() the assumption that {x;} is a sequence of nonrandom vectors. For example,

is not constant across observations. Consequently, neither the Lindeberg-Levy CLT nor the Martingale Differences CLT is applicable. Fortunately, however, there is a generalization of the Lindeberg-Levy CLT to nonstationary processes with nonconstant variance, called the Lindeberg-Feller CLT, which places a technical restriction on the tail distribution of g; to prevent observations with large x, from dominating the distribution of .fii g.

Appendix 2.8: Proof of Proposition 2.10 This appendix provides a proof of Proposition 2. 10. We start with the expression for yj given in (2. 10. 12). The first part of the proof is to find an expression that is asymptotically equivalent.

.fiiyj = .fiijij-

~ L"

(Xr-j · Br

+ x,

· Br-j)' .fii(b-

t=j+l

+ .fii(b- Pl'(~

t

x,x;_j )
t=j+l

(since the last term above vanishes)

/J)

166

Chapter 2

I (sincen =

L" (x,_j ·

£1

t=j+l

VnYj -IL}Vn(b- /3)

+ x, · <1-j)-->P E(x1-j · £ 1 + x, where

· <1-j))

llj = E(x 1 • s,_j)

(since E(x,_ j · r 1 ) = 0 by (2.10.15)) =

I

n

1

=

"

VnYj - IL}.Jris;;;- L x, . r, Vnn

t=l

1

n

L &,r,_j -IL}-./iiS;;!,-n LX n

t=j+l

1

C:rfr- j -

1

1

1

JL} Jns;x -

n

1=1

(since

(by (2.10.7))

t=l

n

'"a'"' Jn-n L

r,

1

n

L t=I

1

n

-./ii- L &1<1- j - -./iin

n

t=I

=

Xt · er n

L

r,r,_j

t=j+l

L

I ' r.:: s,r,_ j --> 0 for each j)

vn

t=I

P

(since Sxx --> :!:xxl p

where gj, ((K+l)xl)

Thus, letting

we have proved that

_ [
=

Xr 'Bt

(2.B.I)

167

Large-Sample Theory

where

(p(Kfl)xp)

-

g

=

[CJ ·. I

n

n

t=I

J'

=- I:g,,

{p(K+l)xl)

g,

~

{p(K+l)xl)

gill : [gpl ·

The second part of the proof is to show that g, is a martingale difference sequence, namely, that E(gj, I gr-1, g,_,, ... ) = 0

(for j = I, 2, .... p).

But this easily follows from the Law of Total Expectations, the fact that (g1_ 1, g,_,, ... ) has less information than (x,, Xr-1, ... ,
s,_,, ... ),

. , ) E( g)lgkt ((K +l)x(K+Ill

_ [E(t:?
-

E(x;s?tr-j)] 2 I E(t:1 x,x,)

'

By using the Law of Total Expectations and (2.10.16), it is easy to show that 1

E(gj,g,,) ((K+l)x(K+l))

=

a •sjk [(J 2 ILk

a ' JL1,· 2 a 1:xx

J

,

(2.B.2)

where 8j, is "Kronecker's delta," which is I if j = k and 0 otherwise. Since, as shown above, ~y ~ C' ~ g and~ g---.. N(O, E(g,gi)), we have a

Avar(y) (=variance of limiting distribution of ~y) = C' E(g,g;)c. Its (j, k) element is (j, k) element of Avar(y) =

<

E(gj 1 g~1 )c,.

168

Chapter 2

Substituting (2.B.2) into this and using the definition of

Cj

in (2.B.l ), we obtain,

after some simple calculations, (j, k) element of Avar(y) = cr 4 · [8jk-

p.;I:;:: p.dcr

2

].

So

where is defined in Proposition 2.10. Since .jfip ~ .jfiyjcr 2 , the limiting

" is N(O, lp- ). distribution of .jfip is the same as that of fo y jcr 2 , which

•

PROBLEM SET FOR CHAPTER 2 - - - - - - - - - - - - ANALYTICAL EXERCISES

1. (Taken from Example 3.4.2 of Amemiya (1985)) Let

z, be defined by

with probability (n- 1)/n, with probability l j n.

Show that plimn---+oo 2.

= 0 but limn-HXl E(zn) = oo.

Zn

Prove Chebychev's weak LLN, by showing that

z,

-~-' =

z, ~m.<. J.l.. Hint:

(z,- E(z,,)) + (E
So 2

u,- J.1.) 2 = (z,- E(z,)) + 2(z,- E(z,l)(E
E[(z,- E(z,l)(E(z,) -1-')]

=

o.

3. (Consistency and Asymptotic Normality of OLS for Random Samples) Con-

sider replacing Assumption 2.2 by Assumption 2.2': [y;, X;) is a random sample.

and Assumption 2.5 by Assumption 2.5': S

= E(g;g;) exists and is finite.

169

large-Sample Theory

Prove the following simplified version of Proposition 2.1:

Proposition 2.1 for Random Samples: Under Assumptions 2.1, 2.3, 2.4, and 2.2'. the OLS estimator b is consistent. Under the additional assumption of Assumption 2.5', the OLS estimator b is asymptotically normal with Avar(b) given by (2.3.4). (a) Show that this is a special case of Proposition 2.1 of the text. (b) Give a direct proof of this proposition. Hint: Use Kolmogorov and Linderberg-Levy. 4. (Proof of Proposition 2.4) We wish to prove Proposition 2.4. To avoid inessential complications, we assume as in the text that K = I (only one regressor). What remains to he shown is that the sample mean of the first term on the RHS of (2.5.3) converges in probability to some finite number. Prove it. Hint: Let f = x;<; and h = xf. The Cauchy-Schwartz inequality states:

So

By assumption. E(x;
I:; s;x(

--+p some finite number

by ergodic

stationarity. 5. (Direct proof that the change in SSR divided by u 2 is asymptotically Section 2.6, we proved that

(SSRR- SSRu)fs 2 -+ d

x2 )

In

x'\#r)

as a corollary to Proposition 2.3. Give a direct proof, first by showing that

6. (Optional, consistency of a) In this question we prove the claim made in Section 2.8 that&, the OLS estimate for the regression of on z;, is consistent.

ef

170

Chapter 2

Let

a be the OLS estimate from (2.8.8)

(a) Make appropriate assumptions in addition to Assumptions 2.1 and 2.2 to show that a is consistent. Hint: You need a rank condition for 1 .

z

(b) Derive:

1 n (-

-I

1

1=1

with v1 = -2(b- {3)'x1 · £;

n

L z,z;) -n L. n .

a- a=

Z; . V;

1=1

+ (b- {3)'x1x;(b- {3). Hint: The discrepancy

ef

and the squared OLS residual from the OLS estimation of between the original equation y1 = x;p + £; is given by (2.3.10). Substituting it into (2.8.8) gives:

ef =

z;a

+ (ry1 + v1).

(c) To avoid inessential complications, suppose that x1 is a scalar x1. Then

~

Li zi ·Vi in(*) becomes l n -2(b- {3)- I:x;£;Z; n.

]

n

+ (b- {3) 2- I:xtz;. n.

1=1

(**)

1=1

Show that the plim of the first term is zero. Hint: E(x 1£1z1 ) = E[z 1 · x 1 • E(£;

I x 1)].

Show that the plim of the second term is zero if E(xfz1) exists and is finite. (d) Thus we have proved that a - a vanishes. Prove the stronger result that .Jii(a -a) vanishes. Hint: Use Lemma 2.4(b). 7. (Optional, proof that WLS is asymptotically more efficient than OLS) In Section 2.8, the WLS estimator is denoted {3 (V). We wish to prove that it is asymptotically more efficient than OLS, namely, ~~

Avar@(V)) < Avar(b). But since Avar(P(V)) = Avar(P(V)), it is sufficient to prove that Avar(P(V)) < Avar(b). Prove this last matrix inequality. Hint: In Chapter 1, we proved the algebraic

171

Large-Sample Theory

result that

for any positive definite matrix V. (This follows from the fact that GLS is more efficient than OLS; the LHS is (a 2 times) the variance of the OLS estimator for the generalized regression model while the RHS is (a 2 times) the variance of the GLS estimator.) Dividing both sides of(*) by nand taking probability limits yields:

(plim~x'xr'(plim~x'vx)(plim~X'xr' :o: plim(~x'v-'xr'.

(**l

-

The RHS is Avar(/J(V)) (see (2.8.7)). Show that the LHS equals Avar(b) = ..--1 L.o XX

8 ..--1 L.o XX "

8. Prove (2.9.7). Hint: If (f.i, y) is the least squares projection coefficients, it satisfies

E(xx')

[~] = E(x · y),

or

[~] = [E(xx'lr

1

E(x · y),

where E xx' = [ l < l E(x) The first equation of(*) is f.i

E(x)' ] E x . = [ E(y) ] E(xx'l • < Yl E(x . .vl ·

+ E(x)'y =

E(y). Use this to eliminate f.i from the

rest of the equations of(*) and then solve for y. In the process, use Var(x) = E(xx')- E(x) E(x), Cov(x, y) = E(x · y)- E(x) E(y). Alternatively, use the formula for the partitioned inverse (see (A.10) of Appendix A) to obtain E(xx'l-' = [l

+ E(x)' ~ar(x)-~ E(x) 1 - Var(x)- E(x)

- E(x)' ~ar(x)-'] . Var(x)- 1

Then substitute this into(**)· 9. (Proof of Proposition 2.9 for p = l) Assume that{<,} is an ergodic stationary martingale difference sequence and that E(
I 8 1_ 1 ,
,
= a 2 < oo.

172

Chapter 2

Let g1

= Br · Br-1

and

i/j

I = n

n

L

,

'

Yt

(j=O,I), PI = -;-. Yo

Br · Er- j

t=j+I

(a) Show that {g,} (t = 2, 3, ... ) is a martingale difference sequence. (b) Show that E(g;) = a 4 • (c) Show that .fii YI

--->d

N(O, a 4 ) as n

--->

co.

(d) Show that .fii PI --->d N(O, 1). Hint: First show that .fii y1ja 2 Then use Lemma 2.4(c).

--->d

N(O, 1).

10. (Asymptotics of sample mean of MA(2)) Let (<-I, 8 0 , 8t. 8 2 , ... ) be i.i.d. with mean zero and variance a,2. Consider a process (Y~t. Yo, Yt. Y2 .... ) gen-

erated by Y~I=<-I,

yo=8o+et<-I,

y,=8,+e,8,~t+e2 8,~2

(t=l,2, ... ).

(This process is called a second-order moving average process (MA(2)); see Chapter 6. (a) Show that (y 1 , y 2 , ..• ) is covariance stationary. Derive the expressions for the autocovariances Yi (j = 0, I, 2, ... ). (b) Let r,i

= E(y, I Y•~i· Y•~j~t• ···,Yo. Y~tl - E(y, I Y•~j~t. Y•~j~2· ···,yo, Y~tl (t = j, j +I, ... ; j = 0, I, 2, ... ).

Show that

Hint: There is a one-to-one mapping between (<-t. 8 0 , 8I, ... , 8,) and (Y~I, Yo. YI· ... , y,) so E(y, I Y•~i· ... , Y~tl = E(y, I 8,~j· ... , <-Il· (c) Let

I

Yn = -(y, n

+ Y2 + · · · + Ynl·

173

Large-Sample Theory

Show that

(d) Because {y,} is not i.i.d, the Lindeberg-Levy CLT is not applicable. However, it will be shown in Chapter 6 that ..jii Yn converges in distribution to a normal random variable. What is the mean of the limiting distribution? What is the variance of the limiting distribution (i.e., Avar()i.))? Hint: In Lemma 2.1, set Zn = ..jii y., and limHooO - qfn) = l for any fixed q. 11. (Optional, Breusch-Godfrey test for serial correlation) In this question, we prove that the modified Box-Pierce Q is asymptotically equivalent to pF from auxiliary regression (2.10.21), where pis the number of lags and F is the Fratio for the hypothesis that the coefficients of e,_,, e,_ 2 , ... , e,_P are all zero. First we establish the notation. Let

(n~K) B

= []'

-

((K+p)x(K+Pil-

E

(nxp)

-

0

0

e,

0

ez

e1

0

e,

ez

e1

en- I

en-2

en-p

0

[!X'X !E'E !X'E] , B-l - _ n

~E'X

n

e

(nx I)

-

[l ["] Yt

9

-

(pxl)

n

=

:

9p '

where e, is the residual from the original regression y, = x;tl + s, and 91 is the sample j-th order autocovariance, defined in (2.10.8), calculated from the residuals. (a) The auxiliary regression (2.1 0.21) has K + p regressors. Let ft be the vector of the K + p coefficients. Show that

a• -_ -B-1

[(K~ll] " . y

(px I)

Hint: Since e is the vector of residuals from the original regression with X as regressors, X' e = 0.

174

Chapter 2

(b) Show:

where H

(Kxp)

= [E(x, · e,_ 1 )

· ··

E(x, · e,_p)J.

Hint: The j-th column of ~X'E is~ L:7~Ht x, · e,_i· Use (2.3.9) to show that it converges in probability to E(x, · e,_i)-

(c) (very easy) Show:

a -->p 0.

(d) Let SSR here be the sum of squared residuals from the auxiliary, not the original, regression. Show:

SSR

-....:..::--- ..... "2

n-K-p P

'

where a 2 is the variance of e1, the error term from the original regression. Hint: [X : E] a = EY, So SSR = (e- Ey)'(e- Ey). ~E'e = y, ~E'E -->p u 2 Ip. (e) Show:

PF =

22 nVil !' -=c,....,...:'-----=---,-

SSR/(n- K- p)

Hint: Apply the formula for the F-ratio in (1.4.9) to the auxiliary regression.

(f) Use the formula for the partitioned inverse (see (A.l 0) of Appendix A) to show

~

(g) Let p and be as in (2.10.8) and (2.10.19). Show that the modified BoxPierce Q defined in (2.10.20) is asymptotically equivalent to pF (i.e., the difference between the two converges to zero in probability). Hint: Show that both p F and the modified Q are asymptotically equivalent to

where is defined in (2.10.18).

175

Large-Sample Theory

12. (Optional, Chow test for structural change in large samples) Consider applying the Chow test for structural change to the regression model with conditional homoskedasticity described in Proposition 2.5. Since the issue of structural change arises mostly in time-series models, we use "t" for the subscript. We generalize Assumption 2.1 by allowing the coefficient vector to change at the break date r: fort = I, 2, ... , r, fort= r +I, r + 2, ... , n. We assume that the break date r is known. (When the break date is unknown, the situation is more complicated and has been an object of recent research. See Stock (1994, Section 5) for a survey.) The null hypothesis to test is that

P1

= p2 . This is a set of K restrictions. Let SSR 1 be the sum of squared residuals from the first period (t = I, 2, ... , r), SSR 2 be the SSR from the

second period (t = r +I, r + 2, ... , n), and SSRR be the SSR from the entire sample under the constraint p1 = p2 . From Chapter I, the Chow statistic is defined as

F _ "-[S::::SR.,:R:---,_.:.,(S:::S:::R'-:-1+":-:--SS_R-'2::-:)]':7/K_ - (SSR1 + SSRz) / (n - 2K) . Let b 1 be the OLS estimate of P1 obtained from the first period and b 2 be the OLS estimate of P2 from the second period. (a) Show that K F is numerically equal to

where s2 = (SSR 1 +SSR2 )/(n -2K). Hint: Then equations can be written in matrix form as y =

XP + e, where

X

X =

(nx2K)

Xz ((n-,)xK)

[

I 0

0

Xz]

l~~

("K)

=

[

,

P (2Kxl)

:

,

p

,+J

x' :

x'] [~; I

' XI =

=

[Pz] I

.

176

Chapter 2

The null hypothesis that {3 1 = {3 2 can be written as R/3 = r where R = [lx : - lx] and r = 0. Use Proposition 1.4. (b) Let A = r In. Show that, as n --+ oo with A fixed,

(c) Show that, as n --+ oo with A fixed,

It can be shown (see, e.g., Stock, 1994, Section 5) that

)n L7=r+l XtE:r are asymptotically uncorre1ated.

)n I:;~ I x,e,

and

So

Show that

bt - b2 = [IK : -lx ]b. Use Lemma 2.4(c). (e) Finally, show that K F

-+d

x2 (K) as n --+ oo with A =

r In fixed.

EMPIRICAL EXERCISES

1. Read the introduction and Sections I-IV of Fama (1975), before doing this exercise. In the data file MISHKIN.ASC, monthly data are provided on: Column 1: year Column 2: month

Large-Sample Theory

Column 3: Column 4: Column 5: Column 6: Column 7:

177

one-month inflation rate (in percent, annual rate; call this PAil) three-month inflation rate (in percent, annual rate; call this PA/3) one-month T-hill rate (in percent, annual rate; call this TBJ) three-month T-bill rate (in percent, annual rate; call this TB3) CPI for urban consumers, all items (the 1982-1984 average is set to 100; call this CPJ).

The sample period is February 1950 to December 1990 (491 observations). The data on PAil, PA/3, TBJ, and TB3 are the same data used in Mishkin (1992) and were made available to us by him. The T-bill data were obtained from the Center for Research in Security Prices (CRSP) at the University of Chicago. The T-bill rates for the month are as of the last business day of the previous month (and so can be taken for the interest rates at the beginning of the month). The construction of PAil and PA/3 will be described toward the end of this exercise; for the time being, we will use the inflation derived from CPl. (a) (Library/internet work) To check the accuracy of the data in MISHKIN.ASC, find relevant tables from back issues of Treasury Bulletin or Federal Reserve Bulletin for hard-copy data on T-bill rates. Or visit the web sites of the Board of Governors (www.bog.frb.fed.us) or the Treasury Department (www.ustreas.gov) to accomplish the same. Can you find one-month T-bill rates? [Answer: Probably not.] Are the rates in MISHKIN.ASC close to those in the relevant tables? Do they appear to be at the beginning of the month? (b) (Library/internet work) Find relevant tables from back issues of Monthly Labor Review (or visit the web site of the Bureau of Labor Statistics, www.bls.gov) to verify that the CPI figures in MISHKIN.ASC are correct. Verify that the timing of the variable is such that a January CPI observation is the CPI for the month. Regarding the definition of the CPI, verify the following. (I) The CPJ is for urban consumers, for all items including food and housing, and is not seasonally adjusted. (2) Prices of the component items of the index are sampled throughout the month. When is the CPI for the month announced? (c) Is the CPI a fixed-weight index or a variable-weight index? Hint: Dig up your old intermediate macro textbook; graduate macro textbooks won't do. The one-month T-hill rate for month t in the data set is for the period from the beginning of month t to the end of the month (as you just verified). Ideally, if we had data on the price level at the beginning of each period, we would

178

Chapter 2

calculate the inflation rate for the same period as (Pr+l - P,)/ P,, where P, is the beginning-of-the-period price level. We use CPI for month t - I for P, (i.e., set P, = CP/,_ 1). Since the CPT component items are collected at different times during the month, there arises the inevitable misalignment of the inflation measure and the interest-rate data. Another problem is the timing of the release of the CPT figure. The efficient market theory assumes that P, is known to the market at the beginning of month t when the T-bill rates for the month are set. However, the CPI for month t - I, which we take to be P" is not announced until sometime in the following month (month t ). Thus we are assuming that people know the CPI for month t- I at the beginning of month t. TSP Tip: When reading in the data and calculating the inflation rate, you

should exploit TSP's ability to handle calendar dates. The initial part of your TSP program might look like: ? The data are monthly, 1950:2 thru 1990:12 freq m;smpl 50:2 90:12; ? Read in the ASCII data read(file='mishkin.asc') year month pail pai3 tbl tb3 cpi; ? Calculate inflation rate and the real rate smpl 50:3 90:12; pai=( (cpi/cpi(-1) )**12-1)*100; r = tbl-pai; RATS Tip: Similarly, the initial part of your RATS program might look like:

* The data are monthly, 1950:2 thru 1990:12 cal 50 2 12;all 0 90:12 * Read in the ASCII data open data mishkin.asc data(org=obs) I year month pail pai3 tbl tb3 cpi * Calculate inflation rate and real rate set pa1 = ( (cpi(t)/cpi(t-1))**12-1)*100 set r = tbl(t)-pai(t)

(d) Reproduce the results in Table 2. I. Because the T-bill rate is in percent and at an annual rate, the inflation rate must be measured in the same unit. Calculate rrr+J, which is to be matched with TBI, (the one-month T-bill rate for month 1), as the continuously compounded rate in percent:

179

Large-Sample Theory

[(;,'t -1]

X

100.

Gauss Tip: One way to compute the sample autocovariances by formula (2.1 0.1) and the sample autocorrelations by (2.10.2) is the following. Let z be then-dimensional vector whose i -th component is z; and nobs be the sample size, n.

z==z-meanc ( z) ; nlag=l2;

/*de-mean the ser1es */ /* lag length */

rho:::::zeros(nlag,l); j=l;do until j>nlag; rho[j]=z'*shiftr(z' ,j,O) '/nobs; /* sample autocovariances */ j=j+l;endo; rho=rho./(z'*z/nobs); /* sample autocorrelations */

TSP Tip: Use the BJIDENT command. It does not calculate p-values, so (unless you are willing to learn how to use TSP's matrix commands) give up the idea of producing p-values. RATS Tip: Use CORRELATE or BOXJENK. (e) Can you reproduce (2.11.9), which has robust standard errors? What is the interpretation of the intercept?

Gauss Tip: For the heteroskedasticity-robust standard errors, you have to calculateS by formula (2.5.1) with i!; = e; (OLS residual). An example of the Gauss codes is the following. Let x be the n x K data matrix and eha t the n-dimensional vector of OLS residuals. If g is then x K matrix whose i-th row is x; · e;, the Gauss code for generating g is g = x. * eha t. The Gauss code for calculating Sis g' g /nabs, where nobs is the sample size. This, however, does not exploit the fact that S is symmetric. The more computationally efficient command is moment ( g, 0) /nabs.

TSP Tip: Use the option HCTYPE= 0 of OLSQ. RATS Tip: Use the ROBUST ERRORS option of LINREG. (f) (Davidson-MacKinnon correction of White) The finite-sample properties

of the robust t-ratio might be improved by the Davidson-MacKinnon adjust-

180

Chapter 2

ment discussed in the text. Calculate robust standard errors using the three formulas discussed in the text. The first is the degrees of freedom correction of multiplying S by n/(n- K). The second is to calculateS by formula (2.5.5) with d = I. The third is (2.5.5) with d = 2. These correspond to TSP's OLSQ options HCTYPE = 1, 2, 3.

-

-

Gauss Tip: Let p be the n-dimensional vector to store the p; 's of (2.5.5), sxxinv = s;:,'. x = the data matrix X. One way to compute pis

p=zeros(nobs,1); i;l;do until i>nobs; p[i)=x[i, .)*sxxinv*x[i, .) '/nobs; i=i+l;endo; (g) (Estimation under conditional homoskedasticity) Test market efficiency by regressing ll"r+l on a constant and TBJ, under conditional homoskedasticity (2.6.1). Compare your results with those in (e) and (f). Which part is different? (h) (Breusch-Godfrey test) For the specification in (g), conduct the BreuschGodfrey test for serial correlation with p = 12. (The n R 2 statistic should be about 27.0.) Let e, (t = 1, 2, ... , n) be the OLS residual from the regression in (g). To perform the Breusch-Godfrey test as described in the text, we need to set e, (t = 0, 1, ... , -11) to zero in running the auxiliary regression for t = I, 2, ... , n. TSP Tip: The TSP codes for calculating the nR 2 statistic are the following. resid is the OLS residual from the original regression. ? Part (h): Breusch-Godfrey

resid=@res; smpl 52:1 52:12; resid=O; smpl 53:1 71:7; olsq resid c tb1 resid(-11 resid(-2) resid(-3) resid(-4) resid(-5) resid(-6) resid(-7) resid(-8) resid(-9) resid(-10) resid(-11) resid(-12); set W=@nob*@rsq;cdf(chisq,df=12) w; Here, w is the n R 2 statistic.

large-Sample Theory

181

RATS Tip: The RATS codes for calculating the n R 2 statistic are the following. resid is the OLS residual from the original regression. * Part (h) : Breusch-Godfrey set resid 53:1 71:7 = pai-%beta(1)-%beta(2)*tb1 set resid 52:1 52:12 = 0 linreg resid 53:1 71:7; # constant tb1 resid{l to 12} compute w = %nobs*%rsquared cdf chisquared w 12

Here, w is then R 2 statistic. Gauss Tip: As in the Gauss codes for (d), the shiftr command is useful here. Let y (223 x I) and x .(223 x 2) be the vector of the dependent variable and the matrix of regressors, respectively, from the previous regression. Let b (2 x I) be the vector of estimated coefficients from the previous regression. Your Gauss program for creating the y and x for the auxiliary regression might look like: "@ Part (h): Breusch-Godfrey @"; y=y-x*b; /* residual vector from the previous

regression */ j=1;do until j>12; X=x-shiftr(y' ,j,O) '; j=j+1;endo; (i) (Reconstructing Fama) What are Farna's own point estimate and standard

error of the nominal interest rate coefficient? Are they identical to your results? (Farna uses different notation, so you need to translate his results in our notation.) Why is his estimate of the intercept different from yours? (One obvious reason is that our data are different, but there is another reason.) Optional: What are the differences between his data and our data? 0) (Seasonal dummies) The CPI used to calculate the inflation rate is not sea-

sonally adjusted. To take account of seasonal factors while still using seasonally unadjusted data in the regression, define twelve monthly dummies, M l, M2, ... , M 12, and use them in place of a constant in the regression in (g). Does this make any difference to your results? What happens if you include a constant along with the twelve monthly dummies in the regression? Optional: Is there any reason for preferring seasonally unadjusted data to seasonally adjusted?

TSP Tip: Use the DUMMY command to create monthly dummies. RATS Tip: Use the SEASONAL command to create monthly dummies.

A well-known problem with the CPI series is that its residential housing component is home mortgage interest costs for periods before I983 and a more appropriate "rental equivalence" measure since I983. The inflation variables PAil and PA/3 in MISHKIN.ASC are calculated from a price index that uses the rental-equivalence measure for all periods. For more details on this, see Section II of Huizinga and Mishkin (1984). The timing of the variables is such that a January observation for PAil is calculated from the December and January data on the price index and PAI3 from the December and March data on the price index. So, under the assumption that the price index is for the end of the month, a January observation for PAil is the inflation rate during the month, and a January observation for PAi3 is the inflation rate from the beginning of January to the end of March. (k)

Estimate the Fama regression for I/53-7nl using this better measure of the one-month inflation rate and compare the results to those you obtained in (e).

(I)

Estimate the Fama regression for the post-October 1979 period. Is the nominal interest rate coefficient much lower? (The coefficient should drop to 0.564.)

2. (Continuation of the Nerlove exercise of Chapter I, p. 76) (i) For Model4, carry out White's nR 2 test for conditional heteroskedasticity.

(nR 2 should be 66.546). 0) (optional) Estimate Model 4 by OLS. This is Step I of the procedure of Section 2.8. Carry out Step 2 by estimating the following auxiliary regres-

sion: regress ef on a constant and 1/ Q;. Verify that Step 3 is what you did in part (h). (k) Calculate White standard errors for Model 4. MONTE CARLO EXERCISES

1. (Degrees of freedom correction) In the model of the Monte Carlo exercise of Chapter I, we assumed that the error term was normal. Assume instead that<; is uniformly distributed between -0.5 and 0.5. (a) Verify that the DGP of the second simulation (where X differs across simulated samples) satisfies Assumptions 2.1-2.5 and 2.7. (It can be shown that

183

Large-Sample Theory

Assumption 2.2 [ergodic stationarity] is also satisfied. This is an implication of Proposition 6.1 (d).) (b) Run the second simulation of the Chapter I Monte Carlo exercise with the uniformly distributed error term. In each replication, calculate the usual t -ratio, as before, and compare it with two critical values. The first is the same critical value (2.042) implied by the t (30) distribution, and the second is the critical value (1.96) from N(O, 1). Compute the rejection frequency for each critical value. Which one is closer to the nominal size of 5%? 2. (Box-Pierce vs. Ljung-Box) We wish to verify the claim that the small sample properties of the Box-Ljung Q statistic are superior to those of the Box-Pierce Q. Generate a string of 50 i.i.d. random numbers with mean 0. (Choose your favorite distribution.) Taking this string as data, do the following. (1) Calculate the Box-Pierce Q and the Ljung-Box Q statistics (see (2.1 0.4) and (2.10.5)) with p = 4. (The ensemble mean is 0 by construction. Nevertheless take the sample mean from the series when you compute the autocorrelations.) (2) For each statistic, accept or reject the null of no serial correlation at a significance level of 5%, assuming that the statistic is distributed as x2 (4). Do this a large number of times, each time using a different string of 50 i.i.d. random numbers generated afresh. For each Q statistic, calculate the frequency of rejecting the null. If the finite-sample distribution of the statistic is well approximated by x2 (4), the frequency should be close to 5%. Which statistic gives you the frequency closer to the nominal size of 5%? Do we fail to reject the null too often if we use the Box-Pierce Q?

ANSWERS TO SELECTED Q U E S T I O N S - - - - - - - - - ANALYTICAL EXERCISES

3. From (2.3.7) of the text, b- p = s;;ig. By Kolmogorov, S., -->p E(x,x;l and g -->p 0. Sob is consistent. By the Lindeberg-Levy CLT, vfn g -->d N(O, S). The rest of the proof should now be routine. 7. What remains to be shown is that the LHS of(**) equals Avar(b) =

J:,, =

plim

~X' X n

J:;;,1 SJ:;;,1.

184

Chapter 2

and S = E(efx,X,) = E[E(ef I x1)x,x;J = E[(z;a)x 1x;J

= plim

(by (2.8.1))

~n.t(z;a)x,x;

(by ergodic stationarity)

l=l

I ' VX = plim -X n

(by definition (2.8.4) ofV).

EMPIRICAL EXERCISES

1c. The CPI is probably the most widely used fixed-weight price index. 1e. The negative of the intercept= r (the constant ex-ante real interest rate). 1g. Point estimates are the same. Only standard errors are different. 1i. Looking at Fama's Table 3, his R, coefficient when the sample period is 1/53 to 7171 is 0.98 with a s.e. of 0.10. Very close, but not exactly the same. There are two possible explanations for the difference between om estimates and Fama"s. First, in calculating the inflation rate for month t, 1l"r+l = (Pr+l - P,)f P,, it is not clear from Fama that CPI,_ 1 rather than CPI, was used for P,. Second, the weight for the CPI at the time of his writing may be for 1958. The weight for the CP I in our data is for 1982-1984. Our estimate of the intercept ( -0.868) differs from Fama's (0.00070) because, first, his dependent variable is the negative of the inflation rate and, second, the inflation rate and the T-hill rate are monthly rates in Fama. 1j. The seasonally adjusted series manufactured by the BLS is a sort of two-sided moving average. Thus, for example, seasonally adjusted CPI, depends on seasonally unadjusted values of future CPI, which is not in I,. Thus if there is no reason to take account of seasonal factors in the relationship between the inflation rate and the nominal interest rate, one should use seasonally unadjusted data. If there is a need to take account of seasonal factors, it is better to include seasonal dummies in the regression rather than use seasonally adjusted data. Inclusion of seasonal dummies does not change results in any important way. 1k. The R, coefficient drops to 0.807 from 1.014.

Large-Sample Theory

185

References Amemiya, T., 1977, "A Note on a Heteroscedastic Model," Jmmwl of Econometrics, 6, 365-370. ___ , 1985. Advanced Ecmwmetrics. Cambridge: Harvard University Press. Anderson, T. W., 1971, The Statistical Analysis of Time Series, New York: Wiley. Billingsley, P., 1961, "The Lindcberg-Levy Theorem for Martingales," in Proceedings £!{the American Mathematical Society, Volume 12, 788-792. Box. G., and D. Pierce. 1970, "Distribution of Residual Autocorrelations in AutoregressiveIntegrated Moving Average Time Series Models," lou mal of the American StatiMical Association, n5, 1509-1526. Rreusch, T., 1978. "Testing for Autocorrelation in Dynamic Linear Models," Australian Economic Paper~;, 17, 334-355. Davidson, R., and J. MacKinnon, 1993, Estimation and Inference in &·onometrics, Oxford: Oxford University Press. Durbin. J., 1970, "Testing for Serial Correlation in Least-Squares Regression When Some of the Regressors are Lagged Dependent Variables,'' Econometrica, 38, 410-421. Engle, R., 1982, "Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation,'' Econometrica, 50, "987-1 007. fama. E., 1975, "Short-Tenn Interest Rates as Predictors of Inflation," American Economic Review, 65, 269-282. ___ . 1976, fOundations of Finance, Basic Books. Godfrey, L., 1978, "Testing against General Autoregressive and Moving Average Error Models When the Regressors Include Lagged Dependent Variables," Econometrica, 46, 1293-1301. Greene, W., 1997, Econometric Analysis (3d ed.), New York: Prentice Hall. Gregory, A .. and M. YeaH, 1985. "Fonnulating Wald Tests of Nonlinear Restrictions," Econometrica, 53, 1465-1468. Hall, P., and C. Heyde. 1980, Martingale Limit Theory and Its Application, New York: Academic. Hall, R., 1978, "Stochastic Implications of the Life Cycle-Pennanent Income Hypothesis: Theory and Evidence," lou mal of Political Economy, 86. 971...1)87. Hamilton, J., 1994. Time Series Analysis, Princeton: Princeton University Press. Huizinga, J., and F. Mishkin, 1984, "Inflation and Real Interest Rates on Assets with Different Risk Characteristics," Journal of Finance, 39, 699-714. Judge, G., W. Griffiths. R. Hill, H. Llitkepohl, and T. Lee, 1985, The Theory and Practice of Econometrics (2nd ed.), New York: Wiley. Karlin, S., and H. M. Taylor, 1975, A First Course in Stochastic Processes (2d ed.), New York: Academic. Koopmans, T.. and W. Hood, 1953, "The Estimation of Simultaneous Linear Economic Relationships," in W. Hood and T. Koopmans (eds.), Studies in Econometn"c Method, New Haven: Yale University Press. Mishkin, F., 1992, "Is the Fisher Effect for Real?'' Joumal ofMOIIetary Economics, 30, 195-215. Newey, W., 1985. "Generalized Method of Momenls Specification Testing," Joumal of Econometrics, 29, 229-256. Rao, C. R., 1973, Linear Satisticaiinference and Its Applicatiom (2nd ed.), New York: Wiley. Stock, L 1994, "Unit Roots, Structural Breaks and Trends," in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North Holland. White, H., 1980, "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity." Econometrica, 48, 817-8~8.

CHAPTER 3

Single-Equation GMM

ABSTRACT

The most important assumption made for the OLS is the orthogonality between the error term and regressors. Without it, the OLS estimator is not even consistent. Since in many important applications the orthogonality condition is not satisfied, it is imperative to be able to deal with endogenous regressors. The estimation method called the Generalized Method of Moments (GMM), which includes OLS as a special case, provides a solution. This chapter presents GMM for single-equation models, while the next chapter develops its multiple-equation extension. These chapters deal only with linear equations. GMM estimation of nonlinear equations will be covered in Chapter 7. The major alternative to GMM, the maximum likelihood (ML) estimation, will be covered in Chapter 8. Reflecting the prevalence of endogenous regressors in economics, this chapter starts out with a number of examples. This is followed by a general formulation of endogenous regressors in Section 3.3. Section 3.4 introduces the GMM procedure for the model of Section 3.3. Sections 3.5-3.7 are devoted to developing the large sample properties of the GMM estimator and associated test statistics. Under conditional homoskedasticity, the formulas derived in these sections can be simplified. Section 3.8 collects those simplified formula~. ln particular, the two-stage least squares (2SLS), the techniques originally designed for the estimation of simultaneous equations models, is a special case of GMM. The ML counterpart of 2SLS is called Jimited-information maximum likelihood (LlML), which wil1 be covered in Section 8.6. The empirical exercise of the chapter takes up the most widely estimated equation in economics, the wage equation. The equation relates the wage rate to the individual's various characteristics such as education and ability. Because education is a choice made by individuals, and also because ability is imperfectly measured in data, the regressors are likely to be endogenous. We apply the estimation techniques introduced in this chapter to the wage equation and verify that the parameter estimates depend on whether we correct for endogeneity.

187

Single-Equation GMM

3.1

Endogeneity Bias: Working's Example

A Simultaneous Equations Model of Market Equilibrium

The classic illustration of biases created by endogenous regressors is Working (1927). Consider the following simple model of demand and supply:

qf = a 0 +a,p; + u;, (demand equation) qf = f3o + /3JP; + v;, (supply equation) qf = qf, (market equilibrium)

(3.1.la) (3.1.lb) (3.l.lc)

where qf is the quantity demanded for the commodity in question (coffee, say) in period i, qf is the quantity supplied, anc:l Pi is the price. The error term u, in the demand equation represents factors that influence coffee demand other than price, such as the public's mood for coffee. Depending on the value of u;, the demand curve in the price-quantity plane shifts up or down. Similarly, v; represents supply factors other than price. We assume that E(u;) = 0 and E(v;) = 0 (if not, include the nonzero means in the intercepts a 0 and {30 ). To avoid inessential complications, we also assume Cov(u;, v,) = 0. If we define q, = qf = qf, the three-equation system (3.1.1) can be reduced to a two-equation system:

+ a,p, + u;, f3o + f3!Pi + v;.

q; = ao

(demand equation)

(3.1.2a)

q; =

(supply equation)

(3.1.2b)

We say that a regressor is endogenous if it is not predetermined (i.e., not orthogonal to the error term), that is, if it does not satisfy the orthogonality condition. When the equation includes the intercept, the orthogonality condition is violated and hence the regressor is endogenous, if and only if the regressor is correlated with the error term. In the present example, the regressor p; is necessarily endogenous in both equations. To see why, treat (3.1.2) as a system of two simultaneous equations and solve for (p;, q;) as Pi = qi =

f3o - ao

"' - f3I

a,f3o- aof3 1 aI

-

/3!

v; -

u;

+ --'---'+

UJ - /3J a 1v; - {3,u,

"' -

{3,

(3.1.3a)

.

(3.1.3b)

So price is a function of the two error terms. From (3.1.3a), we can calculate the covariance of the regressor p; with the demand shifter u; and the supply shifter vi to be

188

Chapter 3

Cov(pi, ui) =-

Var(ui) at-f3t

, Cov(pi. vi)=

Var(vi) at-f3t

,

(3.1.4)

which are not zero (unless Var(ui) = 0 and Var(vi) = 0). Therefore, price is correlated positively with the demand shifter and negatively with the supply shifter, if the demand curve is downward-sloping (a 1 < 0) and the supply curve upwardsloping ({3 1 > 0). In this example, endogeneity is a result of market equilibrium. Endogeneity Bias When quantity is regressed on a constant and price, does it estimate the demand curve or the supply curve'? The answer is neither, because price is endogenous in both the demand and supply equations. Recall from Section 2.9 that the OLS estimator is consistent for the least squares projection coefficients. In the least squares projection of qi on a constant and Pi, the coefficient of Pi is Cov(p;, qi )/ Var(p; ), 1 so, · o f the OLS esttmate · · coeffi ctent · o f t he pnce phm = Cov(p;' q;) . (3.1.5) Var(pi) To rewrite this ratio in relation to the price effect in the demand curve (at). use the demand equation (3.1.2a) to calculate Cov(pi, q;) as Cov(p;, q;) = a 1 Var(p;)

+ Cov(p;, u;).

(3.1.6)

Substituting (3.1.6) into (3.1.5), we obtain the expression for the asymptotic bias fora 1 . . f . fi. Cov(pi,u;) p Itm of the 0 LS estimate o the pnce coef ctent - a 1 = . Var(p,)

(3 .I .7)

Similarly, the asymptotic bias for {3 1, the price effect in the supply curve, is · · o f t he OLS · · coeffi ctent phm ~ estimate o f t he pnce - vR 1 = Cov(pi' v;) . Var(p;)

(3.1.8)

But, as seen from (3.1.4), Cov(p;, u;) 'I 0 and Cov(p;, v;) 'I 0, so the OLS estimate is consistent for neither a 1 nor {3 1 • This phenomenon is known as the endogeneity bias. It is also known as the simultaneous equations bias or simultaneity bias, because the regressor and the error term are often related to each other through a system of simultaneous equations, as in the present example. 1Fact: Let y be the least squares coefficients in E*(y Proving this was an analytical exercise for Chapter 2.

1

I, x) =

a+ x'y.

Then y = Var(x)- 1 Cov(x, y).

189

Single-Equation GMM

In the extreme case of no demand shifters (so that u, = 0 for all i), we have Cov(p,, u 1) = 0, and the formula (3.1.7) indicates that the OLS estimate is consistent for the demand parameter a 1. In this case, the demand curve does not shift, and, as illustrated in Figure 3. I (a), all the observed combinations of price and quantity fall on the demand curve as the supply curve shifts. In the other extreme case of no supply shifters, the observed price-quantity pairs trace out the supply curve as the demand curve shifts (see Figure 3.I(b)). In the general case of both curves shifting, the OLS estimate should be consistent for a weighted average of a 1 and fi 1. This can be seen analytically by deriving yet another expression for the plim of the OLS estimate:

. , . , a 1 Var(v1) phm of the OLS est1mate of the pnce coefficient = Var(v1)

+ fh Var(u 1) . + Var(u 1)

(3.1.9) Proving this is a review question. Observable Supply Shifters

The reason neither the demand curve nor the supply curve is consistently estimated in the general case is that we cannot infer from data whether the change in price and quantity is due to a demand shift or a supply shift. This suggests that it might be possible to estimate the demand curve (resp. the supply curve) if some of the factors that shift the supply curve (resp. the demand curve) are observable. So suppose the supply shifter, v1, can be divided into an observable factor x, and an unobservable factor t;; uncorrelated with x; 2

q; = f!o

+ fiiPi + f!,x, + \;

with

fi2

oft 0.

(supply)

(3.1.2b')

Now imagine that this observed supply shifter x, is predetermined (i.e., uncorrelated with the error term) in the demand equation; think of x, as the temperature in coffee-growing regions. If the temperature (x;) is uncorrelated with the unobserved taste for coffee (u 1), it should be possible to extract from price movements a component that is related to the temperature (the observed supply shifter) but uncorrelated with the demand shifter. We can then estimate the demand curve by examining the relationship between coffee consumption and that component of price. Let us formalize this argument.

2This decomposition is always possible. If the least squares projection of Vi on a constant and x; is Yo+ fhxi. define~~-= v;- Yo- fJ2x;, so v; = ~i +YO+ {hx;. By definition, ~i is uncorrelated with x;. Substitute this equation into the original supply equation (3.1.2b) and submerge YO in the intercept. This produces (3.!.2b').

(a) No shifts in demand price

...

supply curves

demand curve quantity

(b) No shifts in supply price

supply curve

demand curves quantity Figure 3.1: Supply and Demand Curves

191

Single-Equation GMM

For the equation in question, a predetermined variable that is correlated with the endogenous regressor is called an instrumental variable or an instrument. We sometimes call it a valid instrument to emphasize that its correlation with the endogenous regressor is not zero. ln this example, the observable supply shifter x 1 can serve as an instrument for the demand equation. This can be seen easily. Solve the system of two simultaneous equations (3.l.2a) and (3.1.2b') for (p;, q; ): f3o - ao a1 - fJ1

+

{3, x; a1 - fJ1

atf3o-aof31 a1-f31

+

a1fJ2 a1{;-{31u; X;+ . a1-f31 a1-fJ1

p; = q; =

+

{; - u; a1 - fJ1

,

(3.1.10a)

Since Cov(x1 , {;) = 0 by construction and Cov(x;, u 1) follows from (3.1.1 Oa) that Cov(x;, p;) =

!32 a1 - fJ1

Var(x 1)

i'

(3.J.I0b) 0 by assumption, it

0.

So X; is indeed a valid instrument. With a valid instrument at hand, we can estimate the price coefficient a 1 of the demand curve consistently. Use the demand equation (3.1.2a) to calculate, not Cov(p;, q1) as in (3.1.6), but Cov(x;, q1): Cov(x1 , q;) = a 1 Cov(x;, p 1) = a 1 Cov(x;, p 1)

As we just verified, Cov(x1 , p;) to obtain

+ Cov(x1 , u1) (Cov(x 1, u 1) = 0 by assumption).

i' 0. So we can divide both sides by Cov(x;, p;)

0']

=

Cov(x;, q1) . Cov(x1 , p;)

(3.1.11)

A natural estimator that suggests itself is ,

a1w= '

sample covariance between xi and qi sample covariance between

xi

and Pi

.

(3.1.12)

This estimator is called the instrumental variables (IV) estimator with x 1 as the instrument. We sometimes say "the endogenous regressor p; is instrumented by

,.

X."

Another popular procedure for consistently estimating a 1 is the two-stage least squares (2SLS). It is so called because the procedure consists of running two

192

Chapter 3

regressiOns. In the first stage, the endogenous regressor p; is regressed on a constant and the predetermined variable x;, to obtain the fitted value, p;, of p;. The second stage is to regress the dependent variable q; on a constant and p;. The use of p; rather than p; distinguishes 2SLS from the simple-minded application of OLS to the demand equation. The 2SLS estimator of a 1 is the OLS estimate of the p; coefficient in the second-stage regression. So it equals 3 " al,2SLS

=

sample covariance between fii and qi

sample variance of jJ;

(3.1.13)

To relate the regression in the second stage to the demand equation, rewrite (3.1.2a) as (3.1.14) The second stage regression estimates this equation, treating the bracketed term as the error term. The OLS estimate of a 1 is consistent for the following reason. If the fitted value p; were exactly equal to the least squares projection E•(p; I I, x;), then neither u; nor (p;- p;) would be correlated with p;: u; is uncorrelated because it is uncorrelated with X; and p; is a linear function of X;, and (p; - p;) is uncorrelated because it is a least squares projection error. The fitted value p; is not exactly equal to E•(p; I l,x;), but the difference between the two vanishes as the sample gets larger. Therefore, asymptotically, p1 is uncorrelated with the error term in (3.1.14), making the 2SLS estimator consistent. Furthermore, since the projection is the best linear predictor, the fitted value incorporates all of the effect of the supply shifter on price. This suggests that minimizing the asymptotic variance is minimized for the 2S LS estimator. In the present example, the IV and 2SLS estimators are numerically the same (we will prove this in a more general context). More generally, the 2SLS estimator can be written as an lV estimator with an appropriate choice of instruments, and

the IV estimator, in turn, is a particular GMM estimator. QUESTIONS FOR REVIEW

1. In the simultaneous equations model (3.1.2), suppose Cov(u 1 , v1) is not necessarily zero. Is it necessarily true that price and the demand shifter u1 are positively correlated when a 1 < 0 and {3 1 > 0? [Answer: No.] Why? 3Fact: In the regression of Yi on a constant and Xi, the OLS estimator of the Xi coefficient is the ratio of the sample covariance between Xi and Yi to the sample variance of Xi. Proving this was a review question of Section 1.2.

193

Single-Equation GMM

2. (3.1.7) shows that the OLS estimate of the price coefficient in the regression of quantity on a constant and price is biased for the price coefficient a 1. Is the OLS estimate of the intercept biased for a 0 , the intercept in the demand equation? Hint: The plim of the OLS estimate of the intercept in the regression is E(q;)But from (3.1.2a), E(q;) = a 0

Cov(p;. q;) Var(p;) E(p;).

+ a 1 E(p; ).

3. Derive (3.1.9). Hint: Show: Var(v;) Var(p;) =

(

+ Var(u;)

a1 -

R

1-'1

)2

.

·) _ a 1 Var(v;) (

' C OV ( p,' q, -

+ {31 Var(u;)

ai -

R )2

1-'1

.

4. For the market equilibrium model·(3.1.2a), (3.1.2b') (on page 189) with Cov(u;, ~;) = 0, Cov(x;, u;) = 0, and Cov(x;, ~;) = 0, verify that price is endogenous in the demand equation. Is it in the supply equation? Hint: Look at (3.1.10a). Do we need the assumption that the demand and supply shifters are uncorrelated (i.e., Cov(u;, ~;) = 0) for &I.Iv and &1. 2 sLs to be consistent? Hint: Is x 1 a valid instrument without the assumption?

3.2

More Examples

Endogeneity bias arises in a wide variety of situations. We examine a few more examples. A Simple Macroeconometrlc Model

Haavelmo's (1943) illustrative model is an extremely simple macroeconometric model:

+ a 1 Y1 + u1 , 0 < a 1 < C; + 11 (GNP identity),

C; = a 0

Y1 =

I

(consumption function)

where Ci is aggregate consumption in year i, Yi is GNP, /i is investment, and a 1 is the Marginal Propensity to Consume out of income (the MPC). As is familiar from

194

Chapter 3

introductory macroeconomics, GNP in equilibrium is

+

U·

'

I - a1

(3.2.1)

If investment is predetermined in that Cov(/;, u;) = 0, it follows from (3.2. I) that Var(u;) > 0, I - a1 Var(/;) Cov(/;, Y;) = > 0. I- a1

Cov(Y;, u;) =

So income is endogenous in the consumption function, but investment is a valid instrument for the endogenous regressor. A straightforward calculation similar to the one we used to derive (3.1.7) shows that the OLS estimator of the MPC obtained from regressing consumption on a constant and income is asymptotically biased: . • Cov(Y;, u;) I - a1 > 0. Phm a, •0 , •s - a 1 -Var(Y;) - I Va<(/;)

(3.2.2)

+ Var(u,.}

This is perhaps the clearest example of simultaneity bias. The difference from Working's example of market equilibrium is that here the second equation is an identity, which only makes it easier to see the endogeneity of the regressor. The asymptotic bias can be corrected for by the use of investment as the instrument for income in the consumption function. The role played by the observable supply shifter in Working's example is played here by investment. Errors-In-Variables

The term errors-in-variables refers to the phenomenon that an otherwise predetermined regressor necessarily becomes endogenous when measured with error. This problem is ubiquitous, particularly in micro data on households. For example, in the Panel Study of Income Dynamics (PSID), information on variables such as food consumption and income is collected over the telephone. It is perhaps too much to hope that the respondent can recall on the spot how much was spent on food over a specified period of time. The cross-section version of M. Friedman's (1957) Permanent Income Hypothesis can be formulated neatly as an errors-in-variables problem. The hypothesis states that "permanent consumption" c; for household i is proportional to "permanent income" Y/":

c:= kY,'

with 0 < k
(3.2.3)

195

Single-Equation GMM

It is assumed that measured consumption Ci and measured income Yi are errorridden measures of permanent consumption and income:

(3.2.4) Measmement errors c; and y1 are assumed to be zero mean and uncorrelated with permanent consumption and income:

E(c;) = 0. E(y;) = 0, E(c;y,) = 0,

(3.2.5)

E(C;'c;) = 0, E(Y;'y;) = 0, E(C,'y;) = 0, E(Y;'c;) = 0.

(3.2.6)

Substituting (3.2.4) into (3.2.3), the relationship can be expressed in terms of measured consumption and income:

(3.2.7) This example differs from the previous ones in that the equation does not include a constant. So we should examine the cross moment E(Y;u 1) rather than the covariance Cov(Y1, u 1) to determine whether income is predetermined. It is straightforward to derive from (3.2.4)-(3.2.7) that E(Y;u;) = -k E(yf) < 0.

So measmed income is endogenous in the consumption function (3.2.7). Unlike in the previous examples, the endogeneity is due to measurement errors. Using

the fact that the OLS estimator of the Y; coefficient in (3.2.7) is consistent for the corresponding least squares projection coefficient E(C;Y;)/E(Y,'), we can also derive from (3.2.4)-(3.2.7) that

. ~ phmkoLs =

k E[(Y') 2 ] '

E[(Y/) 2 ] + E(yt)

< k.

(3.2.8)

So the regression of measured consumption on measured income (without the inter-

cept) underestimates k. Friedman used this result to explain why the MPC from the cross-section regression of consumption on income is lower than the MPC esti-

mated from aggregate time-series data. Let us suppose for a moment that there exists a valid instrument x;. So E(x;u 1) = 0 and E(x 1Y1) f= 0. A similar argument we used for deriving (3.1.11) establishes that

k = E(x·G-) I I • E(x;Y;)

(3.2.9)

196

Chapter 3

So a consistent IV estimator is sample mean of x; C; k,v = . sample mean of x; Y;

(3.2. 10)

But is there a valid instrument? Yes, and it is a constant. Substituting X; = I into (3.2.10), we obtain a consistent estimator of k which is the ratio of the sample mean of measured consumption to the sample mean of measured income. This is how Friedman estimated k. Production Function

In many contexts, the error term includes factors that are observable to the economic agent under study but unobservable to the econometrician. Endogeneity arises when regressors are decisions made by the agent on the basis of such factors. Consider a cross-sectional sample of firms choosing labor input to maximize profits. The production function of the i-th firm is Q; =A;· (L;)"' · exp(v;), 0 < ¢ 1 < I,

(3.2.11)

where Q; is the firm's output, L; is labor input, A; is the firm's efficiency level known to the firm, and v; is the technology shock. In contrast to A;, v; is not observable to the firm when it chooses L;. Neither A; nor v; is observable to the econometrician. Assume that, for each firm i, v; is serially independent over time. Therefore, B = E[exp(v;)] is the same for all i ,4 and the level of output the firm expects when it chooses L; is A; · (L;) 1' · B.

Let p be the output price and w the wage rate. For simplicity, assume that all the firms are from the same competitive industry so that p and w are constant across firms. Firm i's objective is to choose L; to maximize expected profits p ·A; · (L;) 1' · B- w · L;.

Take the derivative of this with respect to L;, set it equal to zero, and solve it for

4 Jf v; for flm1 i were correlated over time, the fum would use past values of v; to forecast the current vaJue of v; when choosing labor input, and so B would differ across finns. Also, since the expected value of a nonlinear function of a random variable whose mean is 7.ero is not necessarily 7.ero, B is not necessarily zero even if E(v;) = 0.

197

Single·Equation GMM

L;, to obtain the profit-maximizing labor inpm level: L; =

p (A; B¢,) (w)•'1

I

1 l-·1.

(3.2.12)

Let u; be firm i's deviation from fhe mean log efficiency: u; = log(A;)E[log(A; )], and let ¢ 0 E[log(A; )] (so E(u;) = 0 and A; = exp(¢0 + u; )). Then, the production function (3.2.11) and the labor demand function (3.2.12) in logs can be written as

=

(3.2.13)

log(Q;) =
(3.2.14)

where fJo =

I

¢I - I

[log(wjp)- ¢ 0 -log(¢ 1 8)].

Because all firms face the same prices, {J0 is constant across firms. (3.2.14) shows that, in the log-linear production function (3.2.13), log(L;) is an endogenous regressor positively correlated with the error term (v; + u;) through u;. Thus, the OLS estimator of ¢ 1 in the estimation of the log-linear production function confounds the contribmion to output of u; with labor's contribution. This example illustrates yet anofher source of endogeneity: a variable chosen by the agent taking into account some error component unobservable to the econometrician. QUESTIONS FOR REVIEW

1. In Friedman's Permanent Income Hypofhesis, consider the regression of C; on

a constant and Y;. Derive the plim of the OLS estimator of the Y; coefficient in this regression with a constant. Hint: The plim equals the ratio of Cov(C;, Y;) to Var(Y;). Show that it equals kVar(Y;') Var(Y;') + Var(y;).

-

2. In the production function example, show that plimn~oo ¢I.OLS = I, where ¢I.oLs is the OLS estimate of ¢ 1 from (3.2.13). Hint: Eliminate u; from the log output equation (3.2.13) by using the labor demand equation (3.2.14).

-

3. In the production function example, suppose fhe firm can observe v; as well as u; before deciding on labor input. How does the demand equation for labor

198

Chapter 3

(3.2.12) change? Show that log(Q 1) and log(L 1) are perfectly correlated. Hint: log( Q1) will be an exact linear function of Iog(L 1) without errors.

3.3

The General Formulation

We now provide a general formulation. It is a model described by the following set of assumptions and is a generalization of the model of Chapter 2. Regressors and Instruments

Assumption 3.1 (linearity): The equation to be estimated is linear: y; = z;8 +E;

(i =I, 2, ... ,n),

where z1 is an L-dimensional vector of regressors, 8 is an L-dimensional coefficient vector, and ei is an unobservable error term.

Assumption 3.2 (ergodic stationarity): Let x1 be a K-dimensional vector to be referred to as the vector of instruments, and let w 1 be the unique and nonconstant elements of(y1 , z1• x1) 5 [w1} is jointly stationary and ergodic. Assumption 3.3 (orthogonality conditions): All the K variables in x1 are predetermined in the sense that they are all orthogonal to the current error term: E(x;k E;) = 0 for all i and k (k = 1, 2•...• K) 6 This can be written as E[x1 · (y1 - z;8)] = 0

where g; =

or

E(g 1) = 0,

x; · e;.

We remarked in Section 2.3 that, when the regressors include a constant, the orthogonality conditions about the regressors are equivalent to the condition that E(E;) = 0 and that the regressors be uncorrelated with the error tenn. Here, the orthogonality conditions are about the instrumental variables. So if one of the instruments is 5 see examples below for illustrations of Wf. 6 As was noted in Section 2.3, our use of the term predetermined may not be standard in some quarters of the profession. All we require for the term is that the current error term be orthogonal to the current regressors. We do not require that the current error lenn be orthogonal to the past regressors.

199

Single·Equation GMM

a constant, Assumption 3.3 is equivalent to the condition that E(£;) = 0 and that nonconstant instruments be uncorrelated with the error term. The examples examined in the previous two sections can be written in this general format. For example, Example 3.1 (Working's example with an observable supply shifter): Consider the market equilibrium model of (3.1.2a) and (3.1.2b') on page 189. Suppose the equation to be estimated is the demand equation. The example can be cast in the general format by setting

y;

= q;,

L

= 2, ~ = [::] , z; =

u] , = E,

u, K

= 2,

x,

= [:,] ,

and w; = (qi, Pi, Xi)'. Since the mean of the error term is zero, a constant is orthogonal to the error term and so can be included in x,. It follows that x;, which is assumed to be uncorrelated with the error term, satisfies the orthogonality condition E(x; £;) = 0. So it too can be included in x;.

The other examples of the previous sections can be similarly written as special cases of the general model. Having two separate symbols, x; and z;, may give you the impression that the regressors and the instruments do not share the same variables, but that is not always the case. Indeed in the above example, x; and Z; share the same variable (a constant). The instruments that are also regressors are called predetermined regressors, and the rest of the regressors, those that are not included in X;, are called endogenous regressors. A good example to make the point is Example 3.2 (wage equation): A simplified version of the wage equation to be estimated later in this chapter is

where LW; is the log wage of individual i, S; is completed years of schooling, EXPR, is experience in years, IQ; is IQ, and £; is unobservable individual characteristics relevant for the wage rate. We assume E(E;) = 0 (if not, include the mean of E; in 81). In one of the specifications we estimate later, we assume that S; is predetermined but that IQ,, an error-ridden measure of the individual's ability, is endogenous due to the errors-in-variables problem. We

200

Chapter 3

also assume that EXPR;. AGE; (age of the individual), and MED; (mother's education in years) are predetermined. AGE; is excluded from the wage equation, reflecting the underlying assumption that, once experience is controlled for, age has no effect on the wage rate. In terms of the general model,

I

1 )'; =LW;, L =4,

Z;

=

S;

EXPR1

S;

, K

IQ;

=5,

X;=

EXPR; AGE; MED;

and w; = (LW;. S;. EXPR;. IQ;. AGE;, MED;)'. As in Example 3.1, we can include a constant in X; because E(£;) = 0. In this example, x; and z1 share three variables (1, S;, EXPR;) . .Jf, for example, S; were not included in x;, that would amount to treating S; as endogenous.

If, as in this example, some of the regressors z; are predetermined. those predetermined regressors should be included in the list of instruments X;. The GMM estimation of the parameter vector is about how to exploit the information afforded by the orthogonality conditions. Not including predetermined regressors as elements of x; amounts to throwing away the orthogonality conditions that could have been exploited. Identification As seen from the examples of the previous two sections, an instrument must be not only predetermined (i.e., orthogonal to the error term) but also correlated with the regressors. Otherwise, the instrumental variables estimator cannot be defined (see, e.g., (3.1.1 !)). The generalization to the case with more than one regressor and more than one predetermined variable is Assumption 3.4 (rank condition for identification): The K x L matrix E(x,z;) is of full column rank (i.e., its rank equals L, the number of its columns). We denote this matrix by !:., 7 To see that this is indeed a generalization, consider Example 3.!. With z; (I. p;)',

X;=

(l,x;)', the!:., matrix is

7 So the cross moment E(xiz~) is a<;sumed to exist and is finite. If a moment is indicated, as here, then by implication the moment is assumed to exist and is finite.

201

Single-Equation GMM

E

_ [ "-

I

E(p;) ]

E(x;)

E(x;p;)

·

The determinant of this matrix is not zero (and hence is of full column rank) if and only if Cov(x;, p;) = E(x; p;) - E(x;) E(p;) i= 0. Assumption 3.4 is called the rank condition for identification for the following reason. To emphasize the dependence of g; (= X;·£;) on the data and the parameter vector, rewrite g; as

g, = g(w;;

~)

= x; · (y;

- z;~).

(3.3.1)

So the orthogonality conditions can be rewritten as E[g(w;:

~)]

=

0 .

(3.3.2)

(Kx l)

Now let ~ (L x I) be a hypothetical value of~ and consider the system of K simultaneous equations in L unknowns (the L elements of~): E[g(w;;

~)]

=

0 .

(3.3.3)

(Kx l)

The orthogonality conditions (3.3.2) mean that the true value of the coefficient vector, ~. is a solution to this system of K simultaneous equations (3.3.3). So the assumptions we have made so far guarantee that there exists a solution to the simultaneous equations system. We say that the coefficient vector (or the equation) is identified if~ = ~ is the only solution. Because the estimation equation is linear in our model, the function g(w;; ~)is linear in 8, as it can be written as X; · y; - x,z;8. So (3.3.3) is a system of K linear equations:

E(x; · y;) - E(x;z)8 = 0

or

"Exz

~

(KxL)(Lxl)

=

f1 xy ,

(3.3.4)

(Kxl)

where

A necessary and sufficient condition that~ = ~ is the only solution can be derived from the following result from matrix algebra. 8 Suppose there exists a solution to a system oflinear simultaneous equations 8See. e.g.. Searle (1982. pp. 233-2~4).

202

Chapter 3

m x: Ax = b. A necessary and sufficient condition that it is the only solution is that A is of full column rank. Therefore, 8 = 8 is the only solution to (3.3.4) if and only if En is of full column rank, which is Assumption 3.4. Order Condition lor Identification Since rank( E.,) < L if K < L, a necessary condition for identification is that K (=#predetermined variables) > L (=#regressors).

(3.3.5a)

This is called the order condition for identification. It can be stated in different ways. Since K is also the number of orthogonality conditions and L is the number of parameters, the order condition can. be stated equivalently as #orthogonality conditions > #parameters.

(3.3.5b)

By subtracting the number of predetermined regressors from both sides of the inequality, we obtain another equivalent statement: #predetermined variables excluded from the equation (3.3.5c)

> #endogenous regressors.

Depending on whether the order condition is satisfied, we say that the equation is • overidentified if the rank condition is satisfied and K > L, • exactly identified or just identified if the rank condition

IS

satisfied and

K=L,

• underidentified (or not identified) if the order condition is not satisfied (i.e., if K < L).

Since the order condition is a necessary condition for identification, a failure of the order condition means that the equation is not identified. The Assumption for Asymptotic Normality As in Chapter 2, we need to strengthen the orthogonality conditions for the estimator to be asymptotically normal. Assumption 3.5 (g1 is a martingale difference sequence with finite second moments): Let g1 = x, · e;. (gd is a martingale difference sequence (so a for-

203

Single-Equation GMM

tiori E(g;) = 0). The K x K matrix of cross moments, E(g;g,), is nonsingular. We useS for Avar(iD (i.e., the variance of the limiting distribution of y'n g, where g = ~ L:7~ 1 g;). By Assumption 3.2 and the ergodic stationary Martingale Differences CLT, S = E(g;g,). This is the same as Assumption 2.5, and the same comments apply: • If the instruments include a constant, then this assumption implies that the error is a martingale difference sequence (and a fortiori serially uncorrelated). • A sufficient and perhaps easier to understand condition for Assumption 3.5 is that (3.3.6) which means that, besides being a martingale difference sequence, the error term is orthogonal not only to the current but also to the past instruments. • Since g 1g; =

El xix;, Sis a matrix of fourth moments.

Consistent estimation of

S will require a fourth-moment assumption to be specified in Assumption 3.6 below. • If (g;] is serially correlated, then S (which is defined to be Avar(g)) does not

equal E(g,g,) and will take a more complicated form, as we will see in Chapter 6.

QUESTIONS FOR REVIEW

1. In the Working example of Example 3.1, is the demand equation identified? Overidentified? Is the supply equation identified? 2. Suppose the rank condition is satisfied for the wage equation of Example 3.2. Is the equation overidentified? 3. In the production function example, no instrument other than a constant is specified. So K = I and L = 2. The order condition informs us that the log output equation (3.2.13) cannot be identified. Write down the orthogonality condition and verify that there are infinitely many combinations of (¢o, ¢ 1 ) that satisfy the orthogonality condition. 4. Verify that the examples of Section 3.2 are special cases of the model of this section by specifying (y;, x;, z;) and writing down the rank condition for each example.

204

Chapter 3

5. Show that the model implies rank( En ) =rank( En : (KxL)

(KxL)

CJ xy ) . (Kxl)

= ~ is a solution to (3.3.4). (Some textbooks add this equation to the rank condition for identification, but the equation is implied by the other assumptions of the model.)

Hint:

~

6. (Irrelevant instruments) Consider adding another variable, call it$;, to x;. Although it is predetermined, the variable is unrelated to the regressors in that E($;Z;e) = 0 for all£(= I, 2, ... , L). Is the rank condition still satisfied? Hint: If a K x L matrix is of full column rank, then adding any L·dimensional row vector to the rows of the matrix does not alter the rank. 7. (Linearly dependent instruments) In Example 3.2, suppose AGE; = EXPR; + S; for all the individuals in the population. Does that necessarily mean a failure of the rank condition? [Answer: No.] Is the full-rank condition in Assumption 3.5 (that E(g;g;) be nonsingular) satisfied? Hint: There exists a K-dimensional vector a f= 0 such that a'x; = 0. Show that a' E(g;g) = 0. B. (Linear combinations of instruments) Let A be a q x K matrix of full row rank (so q < K) such that A En is of full column rank. Let X; = Ax; (so ii; is a vector of q transformed instruments). Verify: Assumptions 3.3-3.5 hold for if they do forx;.

x;

9. Verify that the model, consisting of Assumptions 3.1-3.5, reduces to the regression model of Chapter 2 if z; = x,..

3.4

Generalized Method of Moments Defined

The orthogonality conditions state that a set of population moments are all equal to zero. The basic principle of the method of moments is to choose the parameter estimate so that the corresponding sample moments are also equal to zero. The population moments in the orthogonality conditions are E[g(w;; ~)]. Its sample analogue is the sample mean of g(w,.; ~) (where g(w;; ~) is defined in (3.3.1))

205

Single-Equation GMM

evaluated at some hypothetical value
=-I~ L.., g(w,; <1).

n

(Kxl)

(3.4.1)

i=l

Applying the method of moments principle to our model amounts to choosing a
-

-

n

I~

g,(
z;o)-

(by definition (3.3.1) ofg(w,;
1=1

]n

=n

]n

_

~x··y·(- ~x-z')o ~ ~ I. 1 I

I

n

i=l

_

= •~y - S XZ
(3.4.2)

i=l

where Sxy and Sxz are the corresponding sample moments of O" xy and I:.,:

Sxy (Kxl)

]

n

n

i=I

=- L Xi"Yi

So the sample analog g,(
li

= Sxy·

(3.4.3)

This is the sample analogue of (3.3.4). If there are more orthogonality conditions than parameters, that is, if K > L, then the system may not have a solution. The extension of the method of moments to cover this case as well is the generalized method of moments (GMM). Method of Moments

If the equation is exactly identified, then K = L and I:xz is square and invertible. Since under Assumption 3.2 Sxz converges to I:., almost surely, S., is invertible for sufficiently large sample size n with probability one. Thus, when the sample size is large, the system of simultaneous equations (3.4.3) has a unique solution given by (3.4.4)

206

Chapter 3

This estimator is also called the instrumental variables estimator with x; serving as instruments. Because it is defined for the exactly identified case, the formula assumes that there are as many instruments as regressors. If, moreover, z; = x; (all the regressors are predetermined or orthogonal to the error term), then ~ 1 v reduces to the OLS estimator. Thus, the OLS estimator is a method of moments estimator. A

Generalized Method of Moments If the equation is overidentified so that K > L, we cannot in general choose an L-dimensional ~to satisfy the K equations in (3.4.3). If we cannot set g"(~) exactly equal to 0, we can at least choose~ so that g.(~) is as close to 0 as possible. To make precise what we mean by "close," we define the distance between any two K-dimensional vectors ~ and ~ by the quadratic form (~ - ~)'W(~ - ~), where W, sometimes called the weighting matrix, is a symmetric and positive definite matrix defining the distance 9 Definition 3.1 (GMM estimator): Let W be a K x K symmetric positive definite matrix, possibly dependent on the sample, such that W --+ 0 Was the sample size n goes to infinity with W symmetric and positive definite. The GMM estimator of~. denoted ~(W), is J(W)

= argmin J(J, W),

(3.4.5)

' where

(The reason the distance g"(J)'Wg.(J) is multiplied by the sample size (n) becomes clear in Section 3.6.) The weighting matrix is allowed to be random and depend on the sample size, to cover the possibility that the matrix is estimated from the sample. The definition makes clear that the GMM is a special case of minimum distance estimation. In minimum distance estimation, plim g.(~) = 0, as here, but the g"O function is not necessarily a sample mean. Since g"(J) is linear in J (see (3.4.2)), the objective function is quadratic in J when the equation is linear: (3.4.6)

9oo not let the word "weight'' confuse you between GMM and weighted least squares (WLS). In GMM, the weighting is applied to the sample mean g, while in WLS it applies to each observation.

207

Single·Equation GMM

It is left to you to show that the first-order condition for the minimization of this with respect to ~ is

-

w

s~z

Sxy =

(LxK) (KxK) (Kxl)

s~z

-

w

S.,

~

(3.4.7)

(LxK) (KxK) (KxL) (Lx I)

If Assumptions 3.2 and 3.4 hold, then S., is of full column rank for sufficiently large n with probability one. Then, since Wis positive definite, the L x L matrix

S~, WS., is nonsingular. So the unique solution can be obtained by multiplying both sides by the inverse of S~, WS.,. That unique solution is the GMM estimator: "

....._

....._

I

....._

GMM estimator: ~(W) = (S~,WS.,)- S~,Wsxy·

(3.4.8)

If K = L, then S., is a square matrix and (3.4.8) reduces to the IV estimator (3.4.4). The GMM is indeed a generalization of the method of moments. Sampling Error For later use, we derive the expression for the sampling error. Multiplying both sides of the estimation equation y; = i;~ +e; from left by x; and taking the averages yields Sxy = S.,~

+ g,

(3.4.9)

where I

g=-

I n LX;-E; =- Lg(w;;~) = g,(~). n

n.l= 1

n

(3.4.1 0)

i=l

Substituting (3.4.9) into (3.4.8), we obtain (3.4.11)

QUESTIONS FOR REVIEW

1. Verify that (3.1.12) is the IV estimator of Cll when the method of moments is applied to the demand equation (3.1.2a) with (I, x;) as instruments. 2. If the equation is just identified, what is the minimized value of J (J, W)? 3. Even if the equation is overidentified, the population version, (3.3.4), of the system of K equations (3.4.3) has a solution. Then, why does not the sample version, (3.4.3), have a solution? Hint: The population version has a solution

208

Chapter 3

because the K x (L + I) matrix [:E., : u xy] is of rank L. The rank condition is a set of equality conditions on the elements of :E., and u xr Even if S., and Sxy converge to :E., and u xy. respectively, it does not necessarily follow that [S., : Sxy] is of rank L for sufficiently large n. In contrast, S., is of full column rank for sufficiently large n when its limit, :E., is of full column rank. This is because the full column rank condition for a matrix is a set of inequality conditions on the elements of the matrix. 4. What is wrong with the following argument? Even if the equation is overidentified, there is no problem finding a solution to (3.4.3). Just premultiply both sides by s~, to obtain (3.4.12) Since S., is of full column rank,

Hint:

This~

S~,S.,

is invertible. So the solution is

certainly solves (3.4.12), but does it solve (3.4.3)?

5. (Singular W) Verify that the GMM estimator (3.4.8) remains well-defined for sufficiently large n even if W (= plim W) is singular, as long as :E~,W:E., is nonsingu1ar.

3.5

Large-Sample Properties of GMM

The GMM formula (3.4.8) defines GMM estimators, which are a set of estimators, each indexed by the weighting matrix W. You will be delighted to know that every estimator to be introduced in the next few chapters is a GMM estimator for some choice of W. The task of this section is to develop the large-sample theory for the GMM estimator for any given choice of W, which can be carried out quite easily with the techniques of the previous chapter. The first half of this section extends Propositions 2.1-2.4 of Chapter 2 to GMM estimators. One issue that did not arise in those propositions is which GMM estimators should be preferred to other GMM estimators. This is a question of choosing Woptimally and will be taken up in the latter part of this section.

-

209

Single·Equation GMM

Asymptotic Distribution of the GMM Estimator

The large-sample theory for ~ (W) valid for any given choice of W is Proposition 3.1 (asymptotic distribution of the GMM estimator):

(a) (Consistency) Under Assumptions 3.1-3.4, plim
where

(Recall: :!:.,

= E(x1z;), S =

E(g,g,) = E(s~ x1x;). W

= plim W.)

(c) (Consistent Estimate of Avar(a(W))) Suppose there is available a consistent estimator, S, of S (K x K). Then, under Assumption 3.2, Avar(~(W)) is consistently estimated by ~

"

--._

Avar(a(W))

= (S~,ws.,)- s~,WSWS.,(S~,ws,,)- , --._

I

--._--._--._

_....._

1

(3.5.2)

where Sxz is the sample mean ofx;z;:

The ugly looking expression for the asymptotic variance will become much prettier when we choose the weighting matrix optimally. If you have gone through the proof of Proposition 2.1, you should find it a breeze to prove Proposition 3.1. The key observations are: (I) S.,

(2)

-*p

:!:.,

(by ergodic stationarity)

g (= ~ 2::7= 1 g1)

-*p

0

(by ergodic stationarity and the orthogonality condi-

tions) (3) yin g -*d N(O, S)

(by Assumption 3.5).

Consistency immediately follows if you use (1), (2), and Lemma 2.3(a) on the expression for the sampling error (3.4.11). To prove asymptotic normality, multiply

210

Chapter 3

both sides of (3.4.11) by .jii to obtain (3.5.3) and use (3), Lemma 2.3(a), and Lemma 2.4(c). Part (c) of Proposition 3.1 follows immediately from Lemma 2.3(a). Estimation of Error Variance

In Section 2.3, we proved the consistency of the OLS estimator, s 2 , of the error variance. As was noted there, the result holds as long as the residual is from some consistent estimator of the coefficient vector. The same is true here. Proposition 3.2 (consistent estimation of error variance): For any consistent estimator, ~. of~. define i!; =: y; - z;~. Under Assumptions 3.1, 3.2, plus the assumption that E(z;z;) (second moments of the regressors) exists and is finite, n

!. L e;--+ E(
i=I

P

provided E(
n

-I '"'•2 L_.f·1

2 =-I '"' L_.f·1 n

n

i=l

n

'I '"' 2(~ -~)L_.Z;·f; A

i=l

n

i""'l

n

' + <'o -o)' '(I-n '"' L_.Z;Z·') (o' -o). ' i=l

(3.5.6) As usua1, ~ 'Eic} -*p E(ef). Since~ is consistent and ~ 'E;z;z; converges in probability to some finite matrix by assumption. the last terrn vanishes. It is easy to show that E(z; ·<;) exists and is finite. 10 Then, by ergodic stationarity, 10 By the Cauchy-Schwartz inequality, E(lzaEi /) ::5 JE
E
Single-Equation GMM

211

I

- Li Zj ·Ei ---* some finite vector. n P So the second tenn on the RHS of (3.5.6), too, vanishes.

Hypothesis Testing It should now be routine to derive from Proposition 3.l(b) and 3.1 (c) the following.

-

Proposition 3.3 (robust 1-ratio and Wald statistics): Suppose Assumptions 3.13.5 hold, and suppose there is available a consistent estimateS ofS (= Avar(g) = E(g,g;)). Let

Then: (a) Under the null hypothesis H 0 : 8, = 8,,

t, =

-

-

-

where (Avar(:\(Wll), is the (C, C) element of Avar(:\(W)) [which is

Avar(8,(W))] and SE', (robust standard error)=

1 --:::--..-.. -·(Avar(8(W))) ,. n

(3.5.7)

(b) Under the null hypothesis H0 : R8 = r where #r is the number of restrictions

(the dimension of r) and R (#r x L) is of full row rank, W

-

= n · (R.l(W)- r)' {R[Avar(8(WlllRT 1(R8(W)

- r)--> d

x2 (#r). (3.5.8)

(c) Under the null hypothesis H0 : a(8) = 0 such that A(8), the #a x L matrix of first derivatives of a(8) (where #a is the dimension of a), is continuous and of full row rank,

W

-

= n · a(8(W))'{A(8(W))[Avar(8(W))]A(:l(W))T 1a(:\(WJ)-->d x2 (#a). (3.5.9)

212

Chapter 3

For the Wald statistic for the nonlinear hypothesis, the same comment that we made for Proposition 2.3 applies here: the numerical value of the Wald statistic is not invariant to how the nonlinear restriction is represented by a(·). Estimation of S

We have already studied the estimation of S (= E(g,g,) = E(rf x,x;)) in Section 2.5. The proposed estimator was, if adapted to the present estimation equation Yi = z;& + E;, (3.5.10) where £, = y, - z;~ and ~ is consistent for ~. The fourth-moment assumption needed for this to be consistent for S is a generalization of Assumption 2.6. Assumption 3.6 (finite fourth moments): E[(x1kz;e) 2 ] exists and is finite for all k (=I, ... , K) andi (=I, ... , L). It is left as an analytical exercise to prove

Proposition 3.4 (consistent estimation of S): Suppose the coefficient estimate~ used for calculating the residual £1 for S in (3.5.10) is consistent, and suppose S = E(g 1g;) exists and is finite. Then, under Assumptions 3.1, 3.2, and 3.6, S given in (3.5.10) is consistent for S. Efficient GMM Estimator

Naturally, we wish to choose from the GMM estimators indexed by Wone that has the least asymptotic variance. The next proposition provides a choice of W that minimizes the asymptotic variance. Proposition 3.5 (optimal choice of the weighting matrix): A lower bound for the asymptotic variance of the GMM estimators (3.4.8) indexed by Wis given by (I:~, s-'I:,,)- 1 The lower bound is achieved ifW is such that w (= plim W) =

s-''' Because the asymptotic variance for any given W is (3.5.1), this proposition proved if we can show that

s-

IS

1 is sufficient but not necessary for efficiency. A necessary and sufficient condi11 The condition that W = tion is that there exists a matrix C such that I:~W = C I:ri: s-•. See Newey and McFadden (1994, p. 2165).

Single-Equation GMM

213

for any symmetric positive definite matrix W. Proving this algebraic result is left as an analytical exercise.

A GMM estimator satisfying the efficiency condition that plim W = s- 1 will be called the efficient (or optimal) GMM estimator. Simply by replacing W by s- 1 (which is consistent for s- 1) in the formulas of Proposition 3.1, we obtain Efficient GMM estimator: J(S- 1) = (S~, s- 1Sx,)- 1S~, s- 1Sxy. (3.5.12) Avar(J(s- 1)) = (E~, s- 11:,)- 1•

(3.5.13)

~

Avar(J(s- 1)) = (S~,s- 1 s,)- 1

(3.5.14)

With W = S - 1, the formulas for the robust t and the Wald statistics in Proposition 3.3 become (3.5.15) where SEf is the robust standard error given by I

-

-· ( (S~, s- 1S,)- 1) ee• n and

-

To calculate the efficient GMM estimator, we need the consistent estimator S. But Proposition 3.4 assures us that the S based on any consistent estimator of is consistent for S. This leads us to the following two-step efficient GMM procedure:

-

a

Step 1: Choose a matrix W that converges in probability to a symmetric and positive definite matrix, and minimize J (J, W) over J to obtain J(W). There is no shortage of such matrices W (e.g., W = 1), but usually we set W = s;~. The resulting estimator J(S;x') is the celebrated two-stage least squares (as we will see in Section 3.8). Use this to calculate the residual E:, = y; - z;J(W) and obtain a consistent estimatorS of S by (3.5. 10). Step 2: Minimize J(J, s- 1) over J. The minimizer is the efficient GMM estimator.

214

Chapter 3

Asymptotic Power

Like the coefficient estimator, the t and Wald statistics depend on the choice of W. It seems intuitively obvious that those statistics associated with the efficient GMM estimator should be preferred in large samples. This intuition can be formalized in terms of asymptotic power introduced in Section 2.4. Take, for example, the t-ratio for testing H 0 : 8e = 8e. The 1-ratio is written (reproducing (3.5.7)) as

-

(3.5.17)

(Avar(~(W)))ee The denominator converges to some finite number even when the null is false. In contrast, the numerator explodes (its plim is infinite) when the null is false. So the power under any fixed alternative approaches unity. That is, the test is consistent. This is true for any choice of W, so test consistency cannot be a basis for choosing W. Next, consider a sequence of local alternatives subject to Pitman drift: (3.5.18) for some given y ;6 0. Substituting (3.5.18) into (3.5.17), the 1-ratio above can be rewritten as (3.5.19)

Applying the same argument for deriving the asymptotic distribution of the t-ratio in (2.4.4), we can show that le --+d N(JL, 1), where (3.5.20)

If the significance level is a, the asymptotic power is given by Prob(lxl > 1af2), where x ~ N (JL, I) and taf2 is the level-a critical value. Evidently, the larger is IJLI, the higher is the asymptotic power. And IJLI decreases with the asymptotic variance. Therefore, the asymptotic power against local alternatives is maximized by the efficient GMM estimator.

215

Single-Equation GMM

Small-Sample Properties

Do these desirable asymptotic properties of the efficient GMM estimator and associated test statistics carry over to their finite-sample distributions? The efficient a function of estimated fourth moments, for W. GenerGMM estimator uses ally, it takes a substantially larger sample size to estimate fourth moments reliably than to estimate first and second moments. So we would expect the efficient GMM estimator to have poorer small-sample properties than the GMM estimators that do not use fourth moments for W. The July 1996 issue of the Journal of Business and Economic Statistics has a number of papers examining the small-sample distribution ofGMM estimators and associated test statistics for various DGPs. Their overall conclusion is that the equally weighted GMM estimator with W= I generally outperforms the efficient GMM in terms of the bias and variance in finite samples. They also find that the size of the efficient Wald statistic in small samples far exceeds the assumed significance level. That is, if a is the assumed significance level and c, is the associated critical value so that Prob(x 2 > c,) =a, the probability in finite samples that the Wald statistic is greater than c, far exceeds a; the test rejects the null too often. Unfortunately, however, like most other smallsample studies, those studies fail to produce clear quantitative guidance which the empirical researcher could follow.

s-•'

QUESTIONS FOR REVIEW

1. Verify that all the results of Sections 2.3-2.5 are special cases of those of this section. In particular, verify that (3.5.1) reduces to (2.3.4). Hint: :Exz is square ifzi =xi.

2. (Singular W) Suppose W is singular but :E~z W:Exz is nonsingular. Verify that all the results of this section (except Proposition 3.5) remain valid. 3. (Asymptotically equivalent choice of W) Suppose that

w, - w2

->p

0. Show

Hint:

Jna<w,)- Jna<w2) = [<S~zw,s.,)- 1 S~w,- (S~zw2Sxz)- 1 S~zw2]

J/1 g converges in distribution to a random variable.

J/lg.

Use Lemma 2.4(b).

216

Chapter 3

-

4. (Three-step GMM) Consider adding to the efficient two-step GMM procedure the following third step: Recompute S by (3.5.10). but this time using ·the residual from the second step. Calculate the GMM estimator with this recomputed S. Is this estimator consistent? Asymptotically normal? Efficient? Hint: Verify by Proposition 3.4 that the recalculated S is consistent for S. Does the asymptotic distribution of the GMM estimator depend on the choice of W if plimn~oo Wis the same?

-

-

z,

-

z,

5. (When is a strict subset of x,) Suppose is a strict subset of x,. Sox, includes, in addition to the regressors (which are all predetennined), some other predetennined variables. Are the efficient two-step GMM estimator and the OLS estimator numerically the same? [Answer: No.] 6. (Data matrix representation of efficient GMM) Let B be the n x n diagonal

matrix whose i-th element is consistent estimation. That is,

if,

where£, is the residual from the first-step

Verify that

where X, y, and Z are data matrices for the instruments, the dependent variable, and the regressors (they are defined in Section 3.8 below). 7. (GLS interpretation of efficient GMM) Let X, Z, and y be as in the previous question. Then the estimation equation can be written as y = Z8 + e. Premultiply both sides by X to obtain

X'y = X'Z8 +X' e. Taking S to be the variance matrix of X' e and S to be its consistent estimate, apply FGLS. Verify that the FGLS estimator is the efficient GMM estimator (the FGLS estimator was defined in Section 1.6). 8. (Linear combination of orthogonality conditions) Derive the efficient GMM estimator that exploits a linear combination of orthogonality conditions,

AE(g,) = 0,

217

Single-Equation GMM

where A is a q x K matrix of full row rank (so q < K). [Answer: Replace Sxz by ASxy. s,, by Asxy. and S by ASA'. Formally, the estimator can be written as (3.4.8) with W= A'(ASA')- 1A.] Verify: If q = K (so that A is nonsingular), then the efficient GMM estimator is numerically equal to the efficient GMM estimator associated with the orthogonality conditions E(g;) = 0.

3.6 Testing Overidentifying Restrictions If the equation is exactly identified, then it is possible to choose elements of the sample moments g,, (~) are zero and the distance

J<~.

~

so that all the

w) = n-g,(~l' Wg,(~l

1s zero. (The ~ that does it is the IV estimator.) If the equation is overidentified, then the distance cannot be set to zero exactly, but we would expect the minimized distance to be close to zero. It turns out that, if the weighting matrix Wis chosen optimally so that plim = s-l, then the minimized distance is asymptotically chi-squared. LetS be a consistent estimator of S, and consider first the case where the dis1). Since by definition tance is evaluated at the true parameter value~. J(~. I g,(~) = g (= ~~;g;) for~=~. the distance equals

w

s-

(3.6.1) Since Jii'g ~d N(O, S) and S ~P S, its asymptotic distribution is x2(K) by Lemma 2.4(d). Now if~ is replaced by ~(S- 1 ), then the degrees of freedom change from K to K - L. The intuitive reason is that we have to estimate L parameters ~before forming the sample average of g;. (We encountered a similar situation in Chapter 1 in the context of the unbiased estimation of rr 2 ) We summarize this as Proposition 3.6 (Hansen's test ofoveridentifying restrictions (Hansen, 1982)): Suppose there is available a consistent estimator, S, of S (= E(g;t,)). Under Assumptions 3.1-3.5,

218

Chapter 3

-

A formal proof is left as an optional exercise. Because the S given in (3.5.1 0) is consistent (under the additional condition of Assumption 3.6), the minimized distance calculated in the second step of the efficient two-step GMM is asymptotically

x2 (K- L). Three points on the use and interpretation of the J test are in order. • This is a specification test, testing whether all the restrictions of the model (which are the assumptions maintained in Proposition 3.6) are satisfied. If the J statistic of Proposition 3.6 is surprisingly large, it means that either the orthogonality conditions (Assumption 3.3) or the other assumptions (or both) are likely to be false. Only when we are confident about those other assumptions can we interpret the large J statistic as evidence for the endogeneity of some of the K instruments included in x1. • Unlike the tests we have encountered so far, the test is not consistent against some failures of the orthogonality conditions. The essential reason is the loss of degrees of freedom from K to K - L. It is easy to show that g is related to its sample counterpart, g,(J(S- 1)), as

The problem is that, since BS., = 0, this matrix Bis not of full column rank. If the orthogonality conditions fail and E(g,) ,P 0, then the elements of ,jii g will diverge to +oo or -oo. But, since Bis not offull column rank, B,jfi g and hence 1 12 ) may remain finite for some pattern of nonzero orthogonalities J(J(S- 1),

s-

• It is only recently that the small-sample properties of the test became a concern. Several papers in the July 1996 issue of the Journal of Business and Economic Statistics report that the finite-sample or actual size of the J test in small samples far exceeds the nominal size (i.e., the test rejects too often). Testing Subsets of Orthogonality Conditions

Suppose we can divide the K instruments into two groups: the vector x, 1 of K 1 variables that are known to satisfy the orthogonality conditions, and the vector x12 of remaining K- K 1 variables that are suspect. Since the ordering of instruments does not change the numerical values of the estimator and test statistics, we can assume without loss of generality that the last K- K 1 elements of x, are the suspect

12 See Newey (1985, Section 3) for a thorough treatment of this issue.

219

Single-Equation GMM

instruments:

.- [Xil] }

x,-

X;z

}

K1 rows,

(3.6.3)

K- K 1 rows.

The part of the model we wish to test is (3.6.4)

E(x 12 ·e;) = 0.

This restriction is testable if there are at least as many nonsuspect instruments as there are coefficients so that K 1 > L. The basic idea is to compare two 1 statistics from two separate GMMestimators of the same coefficient vector~. one using only the instruments included in x11 , and the other using also the suspect instruments x12 in addition to x11 . If the inclusion of the suspect instruments significantly increases the 1 statistic, that is a good reason for doubting the predeterminedness ofx,2 . In accordance with the partition of x;, the sample orthogonality conditions g, (~) and S can be written as

s

=

(KxK)-

[Su S12] S21

S22

(3.6.5) '

where

In particular, g 1,

(~)

can be written as

where (3.6.6) For a consistent estimate S of S, the efficient GMM estimator using all the K instruments and its associated 1 statistic have already been derived in this and the

220

Chapter 3

previous section. Reproducing them, o'

= (S'xz

-s- 1sxz )- 1s'xz -s- 1Sxy. ~

......._ I

"

J = n-g,(~)·s- g,(~).

(3.6.7) (3.6.8)

The efficient GMM estimator of the same coefficient vector ~ using only the first K 1 instruments and its associated J statistic are obtained by replacing x; by Xi! in these expressions. So

~ = [S~,z(S,,)- 1 Sx,zr 1 S~,z(Sn) - 1Sx 1y, 11 =

-

........

1

-

n-g 1 ,(~)'(S 11 )- g 1 ,(~),

(3.6.9) (3.6.1 0)

where S 11 is a consistent estimate of S 11 . The test is based on the following proposition specifying the asymptotic distribution of J - 1 1 (the proof is left as an optional exercise).

Proposition 3.7 (testing a subset of orthogonality conditions 13 ): Suppose Assumptions 3.1-3.5 hold. Let x;~ be a subvector of x;, and strengthen Assumption 3.4 by requiring that the rank condition for identification is satisfied for x" (so E(x,,z;) is of full column rank). Then, for any consistent estimators SofS andS11 of Sn,

where K = #x; (dimension of x, ), K 1 are defined in (3.6.8) and (3.6.10).

= #xi!

(dimension of xil ), and J and 11

Clearly, the choice ofS and S 11 does not matter asymptotically as long as they are consistent. But in finite samples, the test statistic C can be negative. This problem can be avoided and C can be made nonnegative in finite samples if the same S is used throughout, that is, ifS 11 in (3.6.9) and (3.6.10) is the submatrix of Sin (3.6.7) and (3.6.8). This is accomplished by taking the following steps: (I) do the efficient two-step GMM with full instruments x; to obtainS from the

first step,

'

~

and J from the second step;

(2) extract the submatrix S 11 from S obtained from(!), calculate~ by (3.6.9) using this S 11 and 1 1 by (3.6.10). Then take the difference in J. 13 The test was developed in Newey (1985, Section 4) and Eichenbaum, Hansen, and Singleton ( 1985, Appendix

C).

221

Single-Equation GMM

It is left as an optional analytical exercise to prove that C calculated as just described is nonnegative in finite samp]es. We can use Proposition 3.7 to test for the endogeneity of a subset of regressors, as the next example illustrates.

Example 3.3 (testing whether schooling is predetermined in the wage equation): In the wage equation of Example 3.2, suppose schooling S, is suspected to be endogenous. To test for the endogeneity of S,, partition x; as

Xil

EXPR; AGE, MED;

=

The vector of regressors, z,, is the same as in Example 3.2. The first step is to do the efficient two-step GMM estimation of~ with x, = (1, EXPR;, AGE;, MED,. s,y as the instruments. This produces J and the 5 X 5 matrix Second, extract the leading 4 x 4 submatrix S 11 corresponding to Xi! from S and estimate the same coefficients ~ by GMM, this time with the fewer instruments XiJ and using this sll· The difference in the J statistic from the two different GMM estimators of~ should be asymptotically x2 (1 ).

s.

-

QUESTIONS FOR REVIEW "........

-~1

1. Does J(~(S- 1 ), S ) tent estimators of S? 2. Show:

" ........ I

J(~(S-

........ I

-+d

), s-) =

........

-

x'(K - L) where SandS are two different consis........ [

n-s~ys-

(sxy-

A

........

S,~(s-

I

)).

3. Can the degrees of freedom of the C statistic be greater than K- L? [Answer: No.] 4. Suppose K 1 = L. Does the numerical value of C depend on the partition of x, between xi 1 and xi2?

222

Chapter 3

3.7 Hypothesis Testing by the Likelihood-Ratio Principle We have derived in Section 3.5 the chi·squared test statistics for the null hypothesis H 0 : a(~) = 0 by the Wald principle. This section does the same by the likelihood· ratio (LR) principle, which is to examine the difference in the objective function with and without the imposition of the null. Derivation of test statistics by the Lagrange Multiplier principle for GMM and extension to nonlinear equations are given in Section 7.4. In the efficient GMM estimation, the objective function is J (a, s- 1) for a given consistent estimateS of S. The restricted efficient GMM estimator is defined as restrictedefficientGMM: a(S- 1) =argminJ(a,s- 1)

subjecttoH0 •

(3.7.1)

j

The LR principle suggests that (3.7.2) should be asymptotically chi·squared. Indeed it is.

-

Proposition 3.8 (test statistic by tbe LR principle): Suppose Assumptions 3.13.5 hold and suppose there is available a consistent estimator, S, ofS (= E(g1ft;)). Consider the null hypothesis of #a restrictions H 0 : a(~) = 0 such that A(~). the #a x L matrix of first derivatives, is continuous and of full row rank. Define two statistics, W and LR, by (3.5.16) and (3. 7.2), respectively. Then, under the null, the following holds:

(a) The two statistics are asymptotically equivalent in that their asymptotic distributions are the same (namely, x2 (#a)). (b) The two statistics are asymptotically equivalent in the stronger sense that their

numerical difference converges in probability to zero: LR- W -+p 0. (By Lemma 2.4(a), this result is stronger than (a).) (c) Furthermore, if the hypothesis is linear so that the restrictions can be written as R~ = r, then the two statistics are numerically equal. Proving this for the linear case, which is an algebraic result, is left as an analytical exercise. For a full proof, see Section 7 .4.

223

Single-Equation GMM

Several comments about Proposition 3.8: • The advantage of LR over W is invariance: the numerical value of LR does not depend on how the nonlinear restrictions are represented by a(·). On the other hand, you have to write a nonlinear optimization computer program to find the restricted efficient GMM when the hypothesis is nonlinear. • Proposition 3.8 requires that the distance matrix W satisfy the efficiency condition plim W= Otherwise LR is not asymptotically chi-squared. In contrast, the Wald statistic is asymptotically chi-squared without Wsatisfying the efficiency condition.

s-'.

• The same estimate of S should be used throughout in the calculation of LR. Let S and S be two different consistent estimators of S, and consider the statistic

This statistic is what you end up with if you perform two separate two-step efficient GMMs with and without the constraint of the null; S is the estimate of S from the first step with the constraint, while S is from the first step without the constraint. The statistic is asymptotically equivalent to LR (in that the difference between the two converges in probability to zero), but in finite samples it may be negative. Having the same estimate of S throughout ensures the nonnegativity of the statistic in finite samples. Researchers usually use the estimate from unconstrained estimation (S) here, but the estimate from constrained estimation is also valid because it is consistent for S under the null. • Part (b) of the proposition (the asymptotic equivalence in a stronger sense) means that if the sample size is large enough and the hypothesis is true, then the outcome (not just the probability of rejection or acceptance) of the test based on LR will be the same as that based on W because the probability that the two statistics differ numerically by an even tiny amount is zero in sufficiently large samples. • For the numerical equivalence LR = W for the linear case to hold, the same S must be used throughout to calculate not only LR but also the Wald statistic. Otherwise LR and W are only asymptotically equivalent. ~

The LR Statistic for the Regression Model

Because the regression model of Chapter 2 is a special case of the GMM model of this chapter, it may be useful to know how LR would look for that special case.

224

Chapter 3

Because x1 = z 1 in the regression model, the (unrestricted) efficient GMM estima1) = 0. Therefore, tor is OLS and J(~(S- 1 ),

s-

(3.7.3) where ~(S- 1 ) is the restricted efficient GMM estimator. By Proposition 3.8, this statistic is asymptotically chi-squared and is numerically equal to the Wald statistic if the null is linear. As will be shown below, under conditional homoskedasticity, this statistic can be written as the difference in the sum of squared residuals normalized to the error variance. Variable Addition Test (optional)

In the previous section, we considered a specification test based on the C statistic for the endogeneity of a subset, x, 2 of instruments x, while assuming that the other instruments xn are predetennined. Occasionally, we encounter a special case

where zi

= Xi!:

(3.7.4) A popular method to test whether the suspect instruments x12 are predetermined is to estimate the augmented equation

with y =

[!]

(3.7.5)

and test the null H0 : a = 0. Testing is either by the Wald statistic W or by the LR statistic, which is numerically equal to W. This test is sometimes called the variable addition test. How is the test related to the C test of Proposition 3.7? To calculate LR, we have to calculate two efficient GMM estimators of y with the same instrument set x1: one with the constraint a = 0 and one without. The unrestricted efficient GMM estimator is the OLS estimator on the unconstrained equation (3.7.5). The associated J statistic is zero. Let (3.7.6) where e1 is the OLS residual from the unrestricted regression (3.7.5). If we use this estimate of S, the restricted efficient GMM estimator of y minimizes the J function

225

Single-Equation GMM

subject to ct = 0. Clearly, the restricted estimator can be written as

_=

y

[~(s-ll] 0

'

where ~(S- 1 ) is the efficient GMM estimator with x1 as instruments on the restricted equation (3.7.4). So LR = n·(s,,- S,,y)'S- 1(s,,- S,,y) "' ........ I

........ I

(since J = 0 for unrestricted GMM) " ........ I

= n·(s,,- Sxx 1.5(S- ))'S- (s,,- Sxx,.5(S- ))

·

-

" ..-._1

(smce S.,y = Sxx 1.5(S )).

This is none other than Hansen's J statistic for the restricted equation (3.7.4) when x1 is the instrument vector. This statistic, in tum, equals the C statistic because the 11 in Proposition 3. 7 is zero in the present case. Therefore, all three statistics, W from the unrestricted regression, LR, and C, are numerically equal to Hansen's J, provided that the same S is used throughout. That is, the variable addition test is numerically equivalent to Hansen's test for overidentifying restrictions.

-

QUESTIONS·FOR REVIEW

1. (LR for the regression model) Verify that, for the regression model where

2. (Choice ofS in variable addition test) Suppose you formS from the residual from the restricted regression (3.7.4) and use it to form W, LR, and C. Are they numerically equal? Are they asymptotically chi-squared? Hint: Is this S consistent under the null?

3.8

Implications of Conditional Homoskedasticity

So far, we have not assumed conditional homoskedasticity in developing the asymptotics of the GMM estimator. This section considers the implication of imposing

226

Chapter 3

Assumption 3.7 (conditional homoskedasticity):

Under conditional homoskedasticity, the matrix of fourth moments S (= E(g;gj) = E(ef x,x;)) can be written as a product of second moments: S = a 2 1:xx,

(3.8.1)

where :Exx = E(x;x;). As in Chapter 2, this decomposition of S has several implications.

• Since Sis nonsingular by Assumption 3.5, this decomposition implies that a 2 > 0 and :Exx is nonsingular. • The estimator exploiting this structure of S is (3.8.2)

where &2 is some consistent estimator to be specified below. By ergodic stationarity, Sxx --+,,,, :Exx. Thus, provided that &2 is consistent, we do not need the fourth-moment assumption (Assumption 3.6) for S to be consistent. Needless to say, all the results presented so far are valid under the extra condition of conditional homoskedasticity. But many of the results and formulas can be simplified substantially under this extra condition, by just replacing S by a 21:,, and the expression (3.5.10) for S by (3.8.2). This section collects those simplifications. Efficient GMM Becomes 2SLS In the efficient GMM estimation, the weighting matrix is s- 1 . If we setS to (3.8.2), the GMM estimator becomes

;l(s-1) = [8~(&2 S,,)-t s.,r1 S~z(&2 S,,)-1 Sxy (s , s- 1 s )- 1s' s- 1 =

xz

"

-1

xx

= a(Sxx)

xz

" = a2SLS•

xz

xx Sxy

(3.8.3)

which does not depend on &2 In the general case, the whole point of the first step in the efficient two-step GMM was to obtain a consistent estimator of S. Under conditional homoskedasticity, there is no need to do the first step because the second-

227

Single-Equation GMM

step estimator collapses to the GMM estimator with Sxx used for S, ~(S;x'l. This estimator, ~ 2 sLS. is called the Two-Stage Least Squares (2SLS or TSLS). 14 The same equation can be estimated by maximum likelihood. Section 8.6 will cover the ML counterpart of 2SLS, called the "limited-information maximum likelihood estimator.'' The expression for Avar(~ 2 sLs) can be obtained by substituting (3.8.1) into (3.5.13): (3.8.4) A natural estimator of this is

-

Avar(~2SLsl = a 2 ·(S~,s;x' Su)- 1

(3.8.5)

For <7 2, consider the sample variance of the 2SLS residuals: -2

a

I~ = -n. L.)Y; -

· 2 i;~2SLs) ·

(3.8.6)

r=l

(Some authors divide the sum of squares by n - L, not by n, to calculate a2 ) By Proposition 3.2, (3.8.6) ---+p a 2 if E(z,z;J exists and is finite. Thus S defined in (3.8.2) with this <7 2 is consistent for S. Substituting (3.8.2) into (3.5.15) and (3.5.16), the t-ratio and the Wald statistic become

J

,52SLS,f SEe

8e

SEe =

(3.8.7)

a(82sLs)' [A (82SLsl (S~, s;x' s,J- 1A (82SLsl']-l a( 82sLS) W = n· • . a2

(3.8.8)

te =

with

Becomes Sargan's Statistic

When

Wis set to (a 2Sxx)- 1, the distance defined in (3.4.6) becomes J( 'i

O,

("2.s )-') = (J

XX

. (Sxy-

n

l4This estimator was first proposed by Theil ( 1953).

S,~)'S;;.'(sxyS,~) ' " a2

(3.8.9)

228

Chapter 3

Proposition 3.6 then implies that the distance evaluated at the efficient GMM estimator under conditional homoskedasticity, ~ZSLS• is asymptotically chi-squared. This distance is called Sargan's statistic (Sargan, 1958):

-

-

I

-

(Sxy - S,~zsLSl'S;;-x (Sxy - Su~2SLS) t t' t' .. S argan s sa IS tc = n· cr 2 ,

(3.8.10)

We summarize our results so far as Proposition 3.9 (asymptotic properties of 2SLS): (a) Under Assumptions 3.1-3.4, the 2SLS estimator (3.8.3) is consistent. If Assumption 3.5 is added, the estimator is asymptotically normal with the asymp totic variance given by (3.5.1) with W = (cr 2 l:xx)- 1 • If Assumption 3.7 (conditional homoskedasticity) is added to Assumptions 3.1-3.5, then the estimator is the efficient GMM estimator. Furthermore, ifE(z 1z;l exists and is finite, 15 then (b) the asymptotic variance is consistently estimated by (3.8.5), (c) It in (3.8. 7)

""'d

N(O, 1), Win (3.8.8)

(d) the Sargan statistic in (3.8.10)

""'d

""'d

x2 (#r), and

x2 (K- L).

Proposition 3.8 states that the LR statistic, which is the difference in J with and without the imposition of the null hypothesis, is asymptotically chi-squared. Since J can be written as (3.8.9), we have

(3.8.11) where ~ is the restricted 2SLS estimator which minimizes (3.8.9) under the null hypothesis. 16 In Proposition 3.8, the use of the same S guaranteed the statistic to be nonnegative in finite samples. Here, deflation by the same rr 2 ensures the statistic to be nonnegative. If the hypothesis is linear, then this LR is numerically equal to the Wald statistic W.

15 Th is additional assumption is needed for the consistency of 6- 2 ; see Proposition 3.2. 16 Th is statistic was first derived by Gallant and Jorgenson (1979).

229

Single-Equation GMM

Small-Sample Properties of 2SLS

There is a fairly large literature on the finite-sample distribution of the 2SLS estimator (see, e.g., Judge et al. (1985, Section 15.4) and Staiger and Stock (1997, Section I)). Some studies manage to derive analytically the exact finite-sample distribution of the estimator, while others are Monte Carlo studies for various DGPs. The analytical results, however, are not useful for empirical researchers, because they are derived under the restrictive assumptions of fixed instruments and normal errors and the analytical expressions for distributions are computationally intractable. For the case of a single regressor and a single (stochastic) instrument with normal errors, Nelson and Startz (1990) derive the exact finite-sample distribution of the 2SLS estimator that is fairly simple and easy to calculate. They also show that, when the instrument is "weak" in the sense of low explanatory power in the first-stage regression of the regressor on the instrument, a mass of the distribution of the sampling error 82sLS - 8 remains apart from zero until the sample size gets really large. Their work illustrates the need for reporting the R 2 for the first-stage regressions; if the R 2 is low, we should suspect the large-sample approximation to the distribution of the 2SLS estimator to be poor (you will see a dramatic example in part (g) of the empirical exercise). Recently, Staiger and Stock (1997) proposed an alternative asymptotic approximation to the finite-sample distribution of the 2SLS and other estimators and associated test statistics, for the case of "weak" instruments. Their device is to look at a sequence of models along which the coefficients of the instruments in the first-stage regressions converge to zero. (Analytical Exercise I 0 works out this type of asymptotics for the simple case of one regressor and one instrument.) Alternative Derivations of 2SLS

If we define data matrices as

X =

x'I

z'1

x'2

z'2

z

(nxL)

(nxK)

x,.'

=

Yt y

Yz

(nxl)

z'

"

y,.

then it is easy to see that the 2SLS estimator and associated statistics can he written as

230

Chapter 3

~2SLS = [Z'X(X'X)- 1X'Z]- 1Z'X(X'X)- 1X'y = (Z'Pz)- 1Z'Py, where P

(3.8.3')

= X(X'X)- 1X' is the projection matrix, ~

Avar(~2sLS) = n·a 2·[Z'X(X'X)- 1X'zr 1 = n-& 2-(Z'PZ)- 1. .. 2

a

e' 'e'

.. Where E ;: Y- Z~2SLS·

= -n

te =

Ja'·

(3.8.5')

~2SLS l ·

-

(3.8.6')

8e

(3.8.7')

([Z'X(X'X)-IX'Z]-1 ),

a(82sLS)' [ A(82sLS)(Z'X(X'x)- 1X'Z)-I A(82SLs)'r

1

a(~2SLS)

w = - - - " - - - - - - , - : - - - - - - d . __ _

"'

J(~ (''·S )_ 1) _ (y- Z~)'P(y- Z~) XX

'(f

"2

-

a

(3.8.8') (3.8.9')

·

e"'p"e , . . Sargan s stattst1c = ~ a

(3.8.10')

Using this formula, we can provide two other derivations of the 2SLS estimator. 2SLS as an IV Estimator Let (L x I) be the vector of L instruments (which will be generated from X; as described below) for the L regressors, and let Z be then x L data matrix of those L instruments. So the i -th row of Z is The IV estimator of~ with serving as instrument• is, by (3.4.4),

z;

z;.

n

' ( ;;I L..,Z;Z; ~' ') ~IV= i=l

-1

I ;;

z;

n

~' L..,Z;·y;

=

-, Z)- 1-, (Z z y.

(3.8.I2)

i=l

Now we generate those L instruments from X; as follows. The f-th instrument is the fitted value from regressing z;e (the f-th regressor) on x;. Then-vector of fitted value is X(X'X)- 1X'z 1 , where ze is then-vector of the f-th regressor (i.e., the f-th column of Z). Therefore, then x L data matrix of instruments is

Z=

(X(X'X)- 1 X'z 1, ... , X(X'X)- 1X'zc) = X(X'X)- 1X'Z = PZ, (3.8.!3)

where Pis the projection matrix. Substituting this into the IV format(3.8.I2) yields the 2SLS estimator (see (3.8.3')).

231

Sing\e·Equation GMM

2SLS as Two Regressions Instead of substituting the generated instruments into the IV format to estimate the equation y1 = z;8 +£;,consider regressing y, on The coefficient estimate is

z;

z,.

(Z'Z)- 1 Z'y = (Z'P'PZ)- 1Z'Py = (Z'PZ)- 1Z'Py

(since Pis symmetric and idempotent).

This is the 2SLS estimator. So the 2SLS coefficient estimate can be obtained in two stages: the first stage is to regress the L regressors on x, and obtain fitted values and the second stage is to regress y; on those fitted values. For those regressors that are predetermined and hence included in x,, there is no need to carry out the first-stage regression because the fitted value is the regressor itself. To see this, if Z;e is predetermined and included in x1 as the k-th instrument, the n-vector of fitted values for the e-th regressor is Pz 1 , where ze is the n-vector whose i-th element is z11 • But since z1 is also the k-th column of X, Pz1 = Px,. Since P is the projection matrix, Px, = x,. This derivation of the 2SLS is useful as it justifies the naming of the estimator, but there is a pitfall. In the second-stage regression where y, is regressed on the standard errors routinely calculated by the OLS package are based on the residual vector y - z82SLS· This does not equal the 2SLS residual y - z82SLS· Therefore, the OLS standard errors and estimated asymptotic variance from the second stage cannot be used for statistical inference.

z,,

z;,

When Regressors Are Predetermined When all the regressors are predetermined and the errors are conditionally homoskedastic, there is a close connection between the distance function J for efficient GMM and the sum of squared residuals (SSR). From (3.8.9') on page 230,

-

J (8, (a 2 ·S,,) -I )

-

-

(y - Z8)'P(y - Z8) = -"---'-.:-o2:"'----'a - _, y'Py - 2y'PZ8 + 8 Z'PZ8 ,72

-

_,

-

y'Py - 2y'Z8 + 8 Z'Z8 ,72

(since PZ = Z when z, C x,)

(y - Z~)' (y - Z~)

y'y- y'Py

&2 (y- Z8)'(y- Z8) &2

&2 (y- y)(y- y)' &2

-

(3.8.14)

232

Chapter 3

where y - Py is the vector of fitted values from unrestricted OLS. Since the last term does not depend on ~. minimizing J amounts to minimizing the sum of squared residuals (y- ZJ)'(y- ZJ). It then follows that (I) the efficient GMM estimator is OLS (which actually is true without conditional homoskedasticity as long as z; = x; ), (2) the restricted efficient GMM estimator subject to the constraints of the null hypothesis is the restricted OLS (whose objective function is not J but SSR), and (3) the Wald statistic, which is numerically equal to the LR statistic, can be calculated as the difference in SSR with and without the imposition of the null, normalized to a2 This last result confirms the derivation in Section 2.6 of the Wald statistic by the Likelihood Ratio principle.

-

Testing a Subset of Orthogonality Conditions

In Section 3.6 we introduced the statistic C for testing a subset of orthogonality conditions. It utilizes two efficient GMM estimators of the same equation, one using the full set X; of instruments and the other using only a subset, x", of X;. To examine what expression it takes under conditional homoskedasticity, let X 1 (n x K 1) be the data matrix whose i -th row is x; 1 • Because the two-step efficient GMM estimator is the 2SLS estimator under conditional homoskedasticity, the two GMM estimators are given by

J=

(Z'Pz)- 1Z'Py

~ = (Z'P 1Z)- 1Z'P 1y

with with

P

= X(X'X)- 1X',

(3.8.15)

P 1 = X1 (X'1X 1)- 1X;.

(3.8.16)

And the C statistic becomes the difference in two Sargan statistics: (3.8.17) where ... 2

., .

ee

a=-

n

As seen for the case without conditional homoskedasticity, C is guaranteed to be nonnegative in finite samples if the same matrix Sis used throughout, which under conditional homoskedasticity amounts to using the same estimate of the error variance, a 2 , to deflate both e'P€ and €'P 1€, as in (3.8.17). By Proposition 3.7, this C is asymptotically x'(K- K 1). Perhaps more popular is the test statistic proposed by Hausman and Taylor (1980) and further examined by Newey (1985). J is asymptotically more efficient than ~ because ~ exploits more orthogonality conditions. Therefore, Avar(~) >

-

.

.

Single-Equation GMM

•

233

'

Avar(8). Furthermore, as you will be asked to show (Analytical Exercise 9), under the same hypothesis guaranteeing C to be asymptotically chi-squared, -

-

~

~

Avar(8 - 8) = Avar(8) - Avar(8).

(3.8.18)

(If you have been exposed to the Hausman test in the ML context, you can recognize this as the GMM version of the Hausman principle.) By Proposition 3.9, ' Avar(8) is consistently estimated by ~

Avar(cS)

= n-& 2 -(Z'PZ)- 1.

(3.8.19)

Similarly, a consistent estimator of Avar(8) is ~

Avar(8)

= n-& 2 -(Z'P 1Z)- 1•

(3.8.20)

Here, as in the calculation of the C statistic, the same estimate, &2 , is used throughout, in order to guarantee the test statistic below to be nonnegative. The resulting - ' estimator of Avar(8- 8) is ~

Avar(8- cS) = &2 ·n·((Z'P 1Z)- 1 - (Z'PZ)- 1].

(3.8.21)

Hausman and Taylor (1980) have shown that (I) this matrix in finite samples is positive semidefinite (nonnegative definite) but not necessarily nonsingular, but (2) for any generalized inverse 17 of this matrix, the Hausman statistic H

= Jn(i,- cS)'(&'·n·(
&'

(3.8.22)

is invariant to the choice of the generalized inverse and is asymptotically chisquared with min(K- K 1 , L- s) degrees of freedom, where s = #zi

n xi 1 = number of regressors which are retained as instruments in xil.

What is the relationship between C and H under conditional homoskedasticity? It can be shown (see Newey, 1985) that:

17A gcncraJizcd inverse, A-, of a matrix A is any matrix satisfying AA- A = A. lf A is square and nonsingular. then A- is unique and equal to A -I.

234

•

Chapter 3

If K - K 1 < L - s so that both H and C have the same degrees of freedom, then H = C (numerically equal). Otherwise, the two statistics are numerically different and have different degrees of freedom.

One frequent case where K- K1 < L- s holds is when x 12 is a subset of z1, that is, when the suspect instruments are a subset of regressors. In this case, the formulas (3.8.17) and (3 .8.22) are two alternative ways of calculating numerically the same statistic. In the case where K - K 1 > L - s, the degrees of freedom for H are less than those for the C statistic (K - K 1). For this reason, the Hausman test, unlike the C test, is not consistent against some local alternatives. 18 Testing Conditional Homoskedasticlty

For the OLS case with predetermined regressors, as shown in Section 2.7, there is a convenient n R 2 test of conditional homoskedasticity. In the present case, the nR 2 statistic obtained by regressing the squared 2SLS residuals on a constant and second-order cross products of the instrumental variables turns out not to have the desired asymptotic distribution. A test statistic that is asymptotically chi-squared is available but is extremely cumbersome. See White (1982, note 2). Testing for Serial Correlation

For the OLS case, we developed in Section 2.10 tests for serial correlation in the error term. More specifically, under (2.10.15) (which is stronger than Assumption 2.3 or 3.3) and (2.10.16) (which is stronger than Assumption 2.7 or 3.7 of conditional homoskedasticity), the modified Box-Pierce Q given in (2.10.20) can be used for testing the null of no serial correlation in the error term, and this statistic is asymptotically equivalent (in the stronger sense of the plim of the difference being zero) to the nR 2 statistic from regression (2.10.21). Can this test be extended to the case where the regressors z1 are endogenous? If the instruments x; satisfy (2.1 0.15) and (2.10.16), then the argument in Appendix 2.B of Chapter 2 can be generalized to produce a modified Q statistic that is asymptotically chi-squared under the null of no serial correlation. However, the expression for the statistic is more compJicated than (2.10.20) and so is not presented here. This modified Q statistic for the case of endogenous regressors does not seem to be asymptotically equivalent to the nR 2 statistic from regression (2.1 0.21 ).

18 The Hausman statistic can he generalized to the case of conditional heteroskedasticity, but it is not practical because the degrees of freedom depend on the unknown values of the matrices I::u and S. See Newey (l985).

Single·Equation GMM

•

235

QUESTIONS FOR REVIEW

1. (GMM with conditional homoskedasticity) In efficient two-step GMM estimation, Sin the second step is calculated from the first step by (3.5.10) using the first-step residuals. Under conditional homoskedasticity, derive the asymptotic variance of the two-step estimator. Is it the same as (3.8.4)? Hint: Under conditional homoskedasticity, (3.5.10) -->p a 2 :Exx. 2. (2SLS without conditional homoskedasticity) ls the 2SLS consistent when conditional homoskedasticity does not hold? Derive the plim of (3.8.2) and ' Avar(8 2sLs) without assuming conditional homoskedasticity. Is the 2SLS as efficient as the two-step GMM (i.e., is its asymptotic variance as small as (3.5.13))? Hint: 2SLS is a GMM estimator with a choice of Wthat is not necessarily efficient without conditional homoskedasticity. 3. (GLS interpretation of 2SLS) Verify that the 2SLS estimator can be written as a GLS estimator if S, and Sxy are interpreted as the data matrix of regressors and the data vector of the dependent variable and Sxx as the variance matrix of the error term.

4. Provide two cases in which ~2SLS and ~(S- 1 ) are asymptotically equivalent in the sense that

Hint: Keywords are "conditional homoskedasticity" and "exactly identified."

5. Suppose the equation is just identified. Show that 2SLS (3.8.3) reduces to IV (3.4.4). 6. (Sargan as nR 2 ) Prove that Sargan's statistic (3.8.1 0') equals nR,;c, where R,;c is the uncentered R-squared from a regression of eon X. Hint: Review Question 8 of Section 1.2. 7. (When z; is a strict subset of x;) Suppose z; is a strict subset of X;. Sox; includes, in addition to the regressors (which are all predetennined), some other predetermined variables. We have shown that the efficient GMM estimator is OLS under conditional homoskedasticity. Does the result remain true without conditional homoskedasticity? Hint: Review Question 5 of Section 3.5.

236

3.9

•

Chapter 3

Application: Returns from Schooling

Since Mincer's (1958) pioneering study, the relationship between the wage rate and schooling has been the subject of a large number of empirical and theoretical investigations. You might find the amount of attention puzzling because the explanation of the positive relationship seems to be obvious: education enhances the individual's productivity. There are, however, other explanations. For example, in the Spencian job market signaling model, more educated individuals receive higher wages only because education is used as a signal of higher ability; although education does not increase the individual's earning capacity, there is a correlation between the wage rate and schooling because both variables are influenced by the third variable, ability. This section shows how to use the technique of this chapter to isolate the effect of education on the wage rate from that of ability. One of the earliest studies to address this issue is Griliches (1976). The NLS· Y Data

The data used by Griliches are the Young Men's Cohort of the National Longitudinal Survey (NLS-Y). This cohort was first surveyed in 1966 at ages 14-24, with 5,225 respondents, and was resurveyed at one- or two-year intervals thereafter. By 1969, about a quarter of the original sample was lost, but there are 2,026 individuals who reported earnings in 1969 and whose records are complete enough to allow derivation of all the variables for analysis. A very attractive feature of the NLS-Y is its inclusion of two measures of ability. One of them is the score on the Knowledge of the World of Work (KWW) test administered by the NLS interviewers in 1966. The other measure is the IQ score. All youths in the survey who had completed ninth grade by 1966 were asked to sign waivers letting their school supply the survey administrator their scores on various tests and other background materials. The resulting School Survey, conducted in 1968, yielded data on different mental ability scores, which were combined into JQ equivalents. Of the 2,026 individuals with information about the 1969 wage rate and other variables, the IQ score is available for I ,362 individuals, reflecting the fact that the School Survey was able to cover only two-thirds of the original sample. Our discussion here concerns Griliches's results based on this smaller sample. Table 3.1 reports the means and standard deviation of key variables.

Single-Equation GMM

237

•

Table 3.1: Characteristics of Young Men from the National Longitudinal Survey Means and Standard Deviations (in parentheses) Variable Sample size

I ,362

Age in 1969

22.3 (3.2)

Schooling in years in 1969 (S)

12.5 (1.9)

Logarithm of hourly wages (in cents) in 1969 (LW)

5.68 (0.40)

Score on the Knowledge of the World of Work test (KWW) JQ score (IQ)

97.7 (15.3)

Experience in years in 1969

SOURCE:

35.1 (7.9)

3.7 (2.8)

Gri1iches (1976, Tah1e 1).

The Semi-Log Wage Equation

The typical wage equation estimated in the literature is the semi-log form: LW =a+ ,8S + y A+ a'h + e,

(3.9.1)

where LW is the log wage rate for the individual, S is schooling in years, A is a measure of ability, h is the vector of observable characteristics of the individual (such as experience and location dummies), a is the associated vector of coefficients, and e is the unobservable error term with zero mean (in this section, the individual subscript i will be dropped for notational simplicity). The semi-log specification for schooling Sis often justified by appealing to the well-established stylized fact from large cross-section data (such as the Current Population Survey) that the relationship between log wages and schooling is linear. 19 The schooling coefficient ,8 measures the percentage increase in the wage rate the individual would receive if she had one more year of education. Jt therefore l9por the funclional form issue, see Card (1995).

238

•

Chapter 3

represents the marginal return from investing in human capital, which should be in the same order of magnitude as the rate of return from financial assets. We assume that the nonconstant regressors, (S, A, h), are uncorrelated with the error term c so that the OLS is an appropriate estimation technique if ability A is included in the regression along with S and h. In the rest of this section, we examine what Griliches called the "ability bias"- biases on the OLS estimate of {3 that would arise when ability A is not included in the regression and when its imperfect measure is included in its place. Omitted Variable Bias

Sometimes the data set you work with has no measures of A (this is true for the Current Population Survey, for example). What is the consequence of ignoring ability by omitting A from the wage equation? We know from Section 2.9 that the regression of LW on a constant, S, and h provides a consistent estimator of the corresponding least squares projection, which can be written as

-* -* (a+/3S+yA+8h+sll,S,h) ' E (LWII,S,h)=E

(by(3.9.1))

-* -* =a+f3S+8'h+yE (A ll,S,h)+E (< ll,S,h). (3.9.2)

Since the regressors in (3.9.1) are all predetermined, we have E(s) = 0, E(S c)= 0, and E(h ·<) = 0. So -* E (< I I , S, h) = 0. Writing -* E (A I I, S, h) = 81 + Bs S + ll~h, (3.9.2) becomes

-. (LW E

I I, S, h) =(a+ yB1) + ({3 + yBs)S + (8 + y·llh) ' h.

Therefore, the OLS coefficient estimate can be asymptotically biased for all the included regressors. This phenomenon is called the omitted variable bias. In particular, plim fioLs = {3 + yBs. That is, the OLS estimate of the schooling coefficient {3 includes the indirect effect of ability on log wages through schooling (YBs) as well as the direct effect of schooling ({3). If Bs is positive, then fioLs is asymptotically biased upward. Using the sample of I ,362 individuals from the NLS-Y described above, Griliches estimated the wage equation for 1969. The list of variables included in h will not be given here; suffice to say that it includes experience in years and some region and city-size dummies. Griliches's estimate of the schooling coefficient when ability is ignored in the wage equation is reproduced on line I of Table 3.2.

Single-Equation GMM

•

239

On the face of it, the estimate looks good: the point estimate resembles the rate of return one might get from financial assets, and it is sharply estimated as evidenced by the high t-value. IQ as the Measure of Ability

As already mentioned, the NLS-Yhas two measures of ability, KWW (the score on the Knowledge of the World of Work test) collected in 1966 and IQ (the IQ score). Since the NLS- Y respondents were at least fourteen years old in 1966, KWW would reflect the effect of schooling already undertaken, and so it cannot be a measure of raw ability. The IQ score does not have this problem. If /Q were a perfect measure of ability so that A can be equated with it, then the wage equation (3.9. I) could be estimated consistently with IQ substituting for A. Griliches's OLS estimates of fJ (schooling coefficient) and y (ability coefficient) when IQ is included in the equation are reported in line 2 of Table 3.2. Now the estimated schooling coefficient is lower, confirming our prediction that the estimated schooling coefficient includes the effect of ability on the wage rate when ability is omitted from the regression. Errors-in-Variables

Of course the IQ score may not be an error-free measure of ability. If ry is the measurement error, IQ is related to A as IQ = +A

+ ry,

(3.9.3)

with E(ry) = 0. The interpretation of ry as measurement error makes it reasonable to assume that ry is uncorrelated with A, S, h, and the wage equation error term e. Substituting (3.9.3) into (3.9.1), we obtain LW =(a- y)

+ fJS + y!Q + 8'h + (e- yry).

(3.9.4)

We now illustrate for this example the general result that, if at least one of the regressors is measured with error, the OLS estimates of all the regression coefficients can be asymptotically biased. To examine the consequence of using the error-ridden measure IQ for A, consider the corresponding least squares projection: E'(LW II, S, /Q,h) =(a- y)

+ fJS + y!Q+8'h

+ -* E (e II, S, /Q, h)- y -* E (ry

I I, S, IQ, h). (3.9.5)

Table 3.2: Parameter Estimates Estimation equation: LW =a+ {JS + y JQ +other controls

Line no.

Estimation technique

Coefficient of

s

SER

R'

Which regressor is endogenous?

0.332

0.309

none

0.313

none

JQ

OLS

0.065 (13.2)

2

OLS

0.059 (10.7)

0.0019 (2.8)

0.331

3

2SLS

0.052 (7.0)

0.0038 (2.4)

0.332

JQ

Excluded predetermined variables

MED, KWW, age, age squared, background variables

SOURCE: Line 1: equation (Bl) in Grilichcs (1976. Table 2). Line 2: equation (B3) in Grilicbcs's Table 2. Line 3: line 3 in Grilichcs's Table 5. Figures in parentheses are t-values rather than standard errors.

Single-Equation GMM

241

•

-,

Now consider E (e I 1, S, /Q, h) in (3.9.5). Since E(e) = 0 and (S, h) are uncorrelated with e, ( 1, S, h) are orthogonal toe. /Q is also orthogonal toe because E(IQ e)= E[(r/> +A+ ry) e] = E(ry e)

(by (3.9.3))

(since E(e) = 0, Cov(A, e) = 0)

=0.

.

-,

Thus, E (e 11. S, IQ, h)= 0 m (3.9.5). So the biases, if any, equal -yE'(ry 11. S, IQ, h) in (3.9.5). Let COs, 01Q, 11}.) be the projection coefficients of(S, IQ, h) in E'(ry II, S, /Q, h). By the formula (2.9.7) from Chapter 2,

l[

Os 01Q [ llh

=

Var(S) Cov(IQ, S) Cov(h, S)

Cov(S, /Q) Var(/Q) Cov(h, /Q)

Cov(S, h') Cov(IQ, h') Var(h)

l

-1 [

Cov(S, ry) Cov(/Q, ry) Cov(h, ry)

l

. (3.9.6)

In this expression, Cov(S, ry) and Cov(h, ry) are zero by asswnption. Cov(IQ, ry), however. is not zero because

Cov(IQ, ry)

= E(IQ ry)

(since E(ry)

= E[(r/> +A+ ry) ry] = rJ> E(ry)+ E(A ry) = Var(ry)

= 0)

(by (3.9.3))

+ E(ry 2 )

(since E(ry) = 0 and Cov(A, ry) = 0).

That is, the measurement error, if uncorrelated with the true value, is necessarily correlated with the measured value. Using the fact that Cov(S, ry) = 0,

Cov(/Q, ry) = Var(ry),

and

Cov(h, ry) = 0,

the projection coefficients can be rewritten as

[

~~]

= Var(ry) ·a,

where a

=

Var(S) second column of Cov(IQ, S) [ Cov(h, S)

Cov(S, /Q) Var(/Q) Cov(h, IQ)

Cov(S, h') Cov(IQ, h') Var(h) ]

-!

242

•

Chapter 3

Therefore, in the regression of LW on a constant, S, IQ, and h, plim tloLs = fJ - y · Var(ry)·(lst element of a),

(3.9.7a)

plim YoLs = y- y· Var(ry)·(2nd element of a).

(3.9.7b)

Since the second element of a is positive (it is a diagonal element of the inverse of a variance-covariance matrix), the OLS estimate of the ability coefficient is biased downward. The direction of the asymptotic bias for the schooling coefficient, however, depends on the sign of the first element of a. Typically, Cov(S, IQ) would be positive, so, barring unusually strong correlation of (S, IQ) with h, the first element of a would be negative. Thus, we conjecture that tloLs is biased upward for schooling. 2SLS to Correct for the Bias

To control for the bias, Griliches applies the 2SLS to the wage equation. To do so, the set of instruments needs to be specified. The predetermined regressors, (I, S, h), can be included in the set. The additional variables included in the set to instrument IQ are age, age squared, KWW, mother's education, and some other background variables of the individual (such as father's occupation). Those variables are thus assumed to be predetermined. Griliches's 2SLS estimate for this specification is reproduced in line 3 of Table 3.2. 20 In accordance with our prediction that the OLS estimate of the ability coefficient is biased downward, the OLS estimate of the ability coefficient of 0.0019 in line 2 is lower than the 2SLS estimate of 0.0038 in line 3. Our conjecture that floLs of the schooling coefficient when IQ is included in the regression is biased upward, too, is borne out by data because the estimate of fJ of 0.059 in line 2 is higher than the estimate of 0.052 in line 3. To summarize, the "ability bias" studied by Griliches is that the schooling coefficient is biased upward if ability is ignored and is biased in an unknown direction (but perhaps upward) if an imperfect measure of ability (IQ in the present case) is included. The 2SLS provides a solution, but it is predicated on the assumption that the instruments used (such as MED and KWW) are uncorrelated with unobserved wage determinants e and ry. Those determinants would include personal characteristics like diligence. It seems not unreasonable to suppose that KWW depends on such characteristics. If so, KWW cannot serve as an instrument. One could go further and argue that MED is not predetermined either, because some 20 The R2 is not reported for the 2SLS estimate because, unlike in the OLS case, the sum of squared 2SLS residuals cannot be divided between the "explalned variation" and the "unexplained variation,"

Single-Equation GMM

•

243

of the mother's personal characteristics that influenced her education and that are valued in the marketplace would be inherited by the child. Whether those variables included in the set of instruments are predetermined can be tested by Sargan 's statistic. But the statistic is not reported in Griliches's paper (which was written long before doing so became a standard practice). Subsequent Developments

Perhaps because it is so difficult to come up with valid instruments that are uncorrelated with unobserved characteristics but correlated with /Q, controlling for the "ability bias" continued to attract much attention in the literature. There is a fairly large literature starting with Behrman et al. (1980), which compares identical twins with different levels of education to control for genetic characteristics and family background. One can also compare the same individual at two points in time. In either case, the appropriate estimation technique is what is called the fixed-effect estimator, to be covered in Chapter 5. The endogeneity of schooling is another major issue. If one takes the view that the error term includes a host of unobservable individual characteristics that might affect the individual's choice of schooling, then schooling needs to be treated as an endogenous variable. But again, finding valid instruments uncorrelated with unobservable characteristics but correlated with schooling is extremely difficult. The recent literature can be viewed as a search for the determinants of schooling having little to do with the individual's unobserved characteristics. The variable used in Angrist and Krueger (1991), for example, is whether the individual is subject to compulsory schooling laws. Finally, the literature on the relationship between school quality and earnings is attracting renewed attention. See Card and Krueger (1996) for a survey. QUESTIONS FOR REVIEW

1. List all the assumptions made in this section about the means and covariances

of (S, A, IQ, KWW, h, c, ry). Study how Griliches states those assumptions. Which ones are crucial for the 2SLS to be consistent? 2. Suppose the relationship between IQ and A is given not by (3.9.3) but by

IQ = ¢> + rr·A

+ ry.

Is the 2SLS estimator of f3 (the schooling coefficient) still consistent? [Answer: Yes.]

244

Chapter 3

•

3. In the Spencian model of job market signaling, individuals are sorted into two groups, with low-ability individuals choosing a low level of education and high-ability ones choosing a high level of education. Education does not increase the earning capacity of individuals, so the individual's wage rate is determined by his or her ability. If we had data on the wage rate, ability, and schooling for a sample of individuals that includes both high- and low-ability individuals, can we test the hypothesis that education by itself does not contribute to higher wages? Hint: The data would have a multicollinearity problem.

PROBLEM SET FOR CHAPTER 3 - - - - - - - - - - - - ANALYTICAL EXERCISES 1.

Prove: A symmetric and idempotent matrix is positive semidefinite.

2. We wish to prove Proposition.3.4. To avoid inessential complications, consider the case where K = I and L = I so that x; and z, are scalars. So (3.5.5) becomes 2

2

A

E; = e; - 2(8 - 8)z; e;

+ (8 A

22

8) z,,

-

and the Sin (3.5.10) becomes

"

-' " L._. ' X; e; ''

n 1=1 .

"

2(8- - 8)-' " L._. Z; X; e; ' 2 n 1=1 .

+ (8- -

"

8) '-' L._. Z; X;. "''' n 1=1 .

(a) Show: The first term on the RHS of(*) converges in probability to S (= E(xf ef)). (b) Use the Cauchy-Schwartz inequality E(IX · Yl) < )E(X 2) • E(Y 2 ) to show that E(z;xfe;) exists and is finite. Hint: z;xfe; is the product of x;e; and x; z;. Fact: E(x) exists and is finite if and only if E(lx 1) < oo. Where do we use Assumption 3.6? (c) Show that the second term on the RHS of(*) converges to zero in probability.

Single-Equation GMM

•

245

(d) In a similar fashion, show that the third tenn on the RHS of(*) converges to zero in probability. 3. We wish to prove the algebraic result (3.5.11). What needs to be proved is that A- B is positive semidefinite where

and

A result from matrix algebra states: Let A and B be two positive definite and symmetric matrices. A- B is positive semidefinite if and only ifB- 1 - A- 1 is positive semidefinite. Therefore, what needs to be proved is that

is positive semidefinite. (a) Since S (K x K) is positive definite, there exists a nonsingular K x K matrix C such that C'C = s- 1• Verify that Q can be written as

where

Hint:

c- 1C'- 1 = s.

(b) Show that Q is positive semidefinite. Hint: First show that MG is symmetric and idempotent. As you showed in Exercise 1, a symmetric and idempotent matrix is positive semidefinite. 4. Suppose z1 is a strict subset of x1 . Which one is smaller in the matrix sense, the asymptotic variance of the OLS estimator or the asymptotic variance of the efficient two-step GMM estimator? Hint: Without loss of generality, we can assume that the first L elements of the K -dimensional vector x 1 are z1• If K x K matrix W is given by

246

•

w-_[A-1 0

Chapter 3

OJ

0 ,

where A = E(sfz,z;J =leading (L x L) submatrix of S,

(LxL)

then the asymptotic variance of the OLS estimator can be written as (3.5.1). If you did Exercise 3, you can verify that the algebraic relation (3.5.11) does not require W to be positive definite, as long as 1:~, WSWl:" is nonsingular. 5. (optional) Prove Proposition 3.6 by following the steps below. (a) Show: g,(cl(S- 1)) = Bg where B

= IK- S.,(S~,s- 1 S.,J- 1 S~,s-I

(b) Since Sis positive definite, there exists a nonsingular K x K matrix C such that C'C = s- 1. Define A= CS.,. Show that B'S- 1B = C'MC, where M = IK - A(A'A)- 1A'. What is the rank of M? Hint: The rank of an idempotent matrix equals its trace. (c) Letv= J/1Cg. Showthatv--*d N(O,IK)(d) Show that 1(J(S- 1), s- 1) = v'Mv. What is its asymptotic distribution? 6. Show that the 1 statistic in Proposition 3.6 is numerically equal to

....... ......

_.-.

......

where B is defined in Exercise 5(a). Hint: From Exercise 5, 1 = n·gB'S- 1Bg. Show that Bg = Bsxy. 7. (optional) Prove Proposition 3.7 by taking the following steps. The notation not introduced in this exercise is the same as in Exercise 5. (a) Prove:

where

gl

1

n

=-n.LXlJ·Cj, 1=1

B1 = IK, - Sx,,[S~,,(Su)- 1 Sx,,r 1 S~,,(Su)- 1 , C'1C1 = (Su)- 1,

Single-Equation GMM

•

247

= IK, -A,(A;AJ)-'A;, A 1 = C 1Sx 1,.

M,

(b) Prove: rank(M) = K- L, rank(MJ) = K, - L. (c) Prove: You have proved in the previous exercise that J = v'Mv, where v .jn C g. Here, prove: J 1 = v; M 1v" where v 1 .jn C 1g1•

=

=

To proceed further, define the K x K 1 matrix F as

Kt rows, K- K 1 rows.

(d) Prove J- J 1 = v'(M- D)v where D = C'- 1FC;M,C 1F'C- 1• (e) Prove that Dis symmetric and idempotent, with rank K 1 - L. Hint: F'c-'C'-'F

= F'(C'C)-'F = F'SF = Su.

c,s"c; = c,cc',c,)-'c; = IK,· (f) Prove: A'D = 0. Hint: 1 1 A'D = (S'xz C')(C'- 1FC'M FC'M I lC l F'C- ) = (S'x z t )(C l l F'C- ) •

so

(g) Prove: M - D is symmetric and idempotent, with rank K - K 1 . (h) Prove the desired result J- J 1

---+d

x2 CK- K 1).

(i) Step (d) has established that C = n·gC'(M- D)Cg. Show that C can be written as n·s~yC' (M- D)Csxy· A

Hint: g" (~) = Bsxy· So the numerical value of C would have been the same if we replaced g by Sxy from the beginning in (a).

(j) Show that M - D is positive semidefinite. Thus, the quadratic form C is nonnegative for any sample.

248

Chapter 3

•

8. (Optional, proof of numerical equivalence in Proposition 3.8) In the restricted efficient GMM, the J function is minimized subject to the restriction Ra = r. For the time being, do not restrict W to be equal to 1 and form the Lagrangian as

s-

where A is the #r-dimensional Lagrange multipliers (recall: r is #r x I, R is #r X L, and a is L X 1). Let a be the solution to the constrained minimization 1 problem. (If W = , it is the restricted efficient GMM estimator of a.)

s-

(a) Let ~(W) be the unrestricted GMM estimator that minimizes J(a, W). Show:

&= ~(W)-

cs;, WSu)- 1R' [Res;, WSu)- 1RT [R~(W)- r], 1

A= 2n· [Res;, WSu)- 1R't [Ra(W)- r]. Hint: The first-order conditions are 1--.

,--.

-

I

-2nS.,Wsxy + 2n(S.,WSu)a + R A= 0. -

-

Combine this with the constraint Ra = r to solve for A and a. (b) Show: J(a, Wl = J(~(W), Wl +n·<~<Wl- a)'(s~,ws")(a(W)- &). Hint:

J(a, W) = n. (Sxy- s.,a)'W(Sxy- s.,a) = n. [(Sxy- s.,~(W)) + S.,(3(W)- a)]'W

· [(sxy- S.,~(W)) + S.,(a(W)- a)]. Use the first-order conditions for the unrestricted GMM:

s~,W(sxy- s.,a(W)) =

o.

s-

1 (c) Now set W = and show that the Wald statistic W defined in (3.5.8) is numerically equal to LR in (3.7.2).

9. (GMM Hausman principle) Let &1 = ~(W 1 ) and&, = ~(W,) be two GMM estimators with two different choices of the weighting matrix, W 1 and W 2 .

Single-Equation GMM

•

249

(a) Show:

where

A11 = Qj" 1 l:~,W,SW,l:,Qj" 1 , A,,= Qj 1 l:~,W,SW,l:,Q2 1 , A,,= Q2'l:~,w,sw,l:,Qj" 1 , A,,= Q2'l:~,w,sw,l:,Q2'·

Q 1 = l:~,W 1 l:, Q 2 = l:~,W 2 l:, plimWj = Wj (j = 1.2). n~oo

(b) Let q

= .1, - 8,. Show that .J1l q ->d Avar(q) = A 11

(c) Set W2 =

N(O, Avar(q)), where

+ A 22 -

A 12

-

A21 •

s-' so that ~ 2 is efficient GMM. Show that

10. (Weak instruments. This question is due toM. Watson.) Consider the following model:

y;=z;8+e,

(i=l,2, ... ,n),

where Yi· Zi, and Ei are scalar random variables. Let xi denote a scalar instrumental variable that is related to Zi via the equation:

Let

Assume that (1) {x;, q;) follows a stationary and ergodic process. (2) g; is a martingale difference sequence with E(g;i;) = S, which is a positive definite matrix. (3) E(xf) = u} > 0. (4) {3 "" 0.

250

• ,.. -

Let 0 =

-1

sxz Sxy

Chapter 3

where l

Sx::

n

=-n LXiZi, i=l

(a) Prove that axz is satisfied).

= E(x1z;) f= 0 (so that the rank condition for identification

(b) Prove that 8 -->p 8. We now want to consider a case in which the rank condition is just barely satisfied, that is, axz is very close to zero. One way to do this is to replace Assumption (4) with

(4')/l = 1/vfn. For the remainder of the problem, assume that (l )-(3) and (4') hold. (This assumption and the following questions are based on Staiger and Stock (1997).) Note that 8 - 8 = s;/ g1, where g1 is the sample mean of gli (= x 1s1). (c) Show that s,

-->p

0.

(d) Show that vfn Sxz -->d a"j +a, where a is distributed N(O, Szz) and szz is the (2, 2) element of S. (e) Show that 8- 8 -->d (a"j + a)- 1b, where (a, b) are jointly normally distributed with mean zero and covariance matrix S .

.

(f) Is 8 consistent?

EMPIRICAL EXERCISES

Read Griliches (1976) before answering. We will estimate the type of the wage equation estimated by Griliches using an extract from the NLS-Y used by Blackburn and Neumark (1992). 21 The NLS-Y is panel data, with the same set of young men surveyed at several points in time (we will not exploit the panel feature of the data set in this exercise, though). The extract contains information about those individuals at two points in time: first, the earliest year in which wages and other variables are available, and second, in 1980. In a data tile GRILIC.ASC, data are provided on RNS, RNS80, MRT, MRT80, SMSA, SMSA80, MED, IQ, KWW, YEAR, AGE, AGE80, S, S80, EXPR, EXPR80, TENURE, TENURE80, LW, and LW80 (in this order, with the columns corresponding to the variables, as usual). The variable 2Isee Blackburn and Neumark (1992, Section III) for a more detailed description of the sample.

Single·Equation GMM

•

251

YEAR is the year of the first point in time. Variables without "80" are for the first point, and those with "80" are for 1980. The definition of the variables for the first point is: RNS = dummy for residency in the southern states MRT = dummy for marital status (1 if married) SMSA = dummy for residency in metropolitan areas MED = mother's education in years

KWW = score on the "Knowledge of the World of Work" test lQ = IQ score AGE = age of the individual

S = completed years of schooling EXPR =experience in years TENURE= tenure in years LW = log wage.

The Blackburn-Neumark sample has 815 observations after deleting black individuals. Also deleted are cases with missing information on mother's education, reducing the sample size to 758. (a) Calculate means and standard deviations of all the provided variables (including those for 1980) and prepare a table similar to Griliches's Table 1. Also, calculate the correlation between lQ and S (use the MSD ( CORR) command for TSP, CMOM (PRINT, CORR) for RATS). Since the year the wage rate is observed differs across individuals, the wage rate will have the year effect. Generate eight year dummies for YEAR = 66, ... , 73. (Note: There is no observation for 1972.) The year dummies will be included in the log wage equation to control for the year effect. (b) Consider the wage equation (dropping the individual subscript i) LW= /)S+ylQ+o'h+e,

where LW is log wages, S is schooling, and h

= (EXPR, TENURE, RNS, SMSA, year dummies)'.

(There is no need to include a constant because it is a linear combination of

252

•

Chapter 3

the year dummies.) Prepare a table similar to Table 3.2. In your 2SLS estimation of the wage equation. the set of instruments should consist of predetermined regressors (S and h) and excluded predetermined variables (MED, KWW, MRT, AGE). Is the relative magnitude of the three different estimates of the schooling coefficient consistent with the prediction based on the omitted variable and errors-in-variables biases? TSP Tip: To do 2SLS, use the INST or 2SLS command. RATS Tip: Use the LINREG command with the INST option. To define the instrument set, use the INSTRUMENTS command.

(c) For your 2SLS estimation of the wage equation in part (b), calculate Sargan's statistic (it should be 87.655). What should the degrees of freedom be? Calculate the p-value. TSP Tip: Sargan's statistic can be calculated as @phi I ( @s srI @nob). where the variables with "@"are outputs from TSP commands. @nob is sample size, @phi is e'X(X'X) -I X' ewhere eis the 2SLS residual vector, and @s s r . e"''"e. ts RATS Tip: Sargan's statistic is %uzwzul (%rssl%nobs), where variables with "%" are outputs from RATS commands. %nabs is sample size. %uzwzu is e'X(X'X)- 1X'e, and %rss is e'e.

(d) Obtain the 2SLS estimate by actually running two regressions. Verify that the standard errors given by the second stage regression are different from those you obtained in (b). (e) Griliches mentions that schooling. too, may be endogenous. What is his argument? Estimate by 2SLS the wage equation, treating both IQ and S as endogenous. What happens to the schooling coefficient? How would you explain the difference between your 2SLS estimate of the schooling coefficient here and your 2SLS estimate in (b)? Calculate Sargan's statistic (it should be 13.268) and its p-value. (f) Estimate the wage equation by GMM, treating schooling as predetermined as in

the 2SLS estimation in part (b). Are 2SLS standard errors smaller than GMM standard errors? Test whether schooling is predetermined by the C statistic. To calculate C. you need to do two GMMs with and without schooling as an instrument. Use the same S throughout. To calculateS, use the 2SLS residual from part (b). The C statistic should be 58.168.

Single-Equation GMM

•

-

253

TSP Tip: It is a bit awkward to calculate S in TSP. The following codes, though not elegant, get the job done. Here, resO is the residuals from the 2SLS estimation in (b). smpl 1 758; xO=resO*s; xl=resO*expr; x2=res0*tenure; x3=res0*rns; x4=res0*srnsa; x5~res0*y66;

xll~res0*y73;

x12~res0*med;

x13=res0*k'ww; x14=res0*rnrt; x15=res0*age; rnsd(noprint,rnornent) x0-x15; ~

@mom, an output from the TSP command MSD, is S. To do GMM, use TSP's GMM command. The GMM option COVOC~matrix name forces the GMM command to use the assigned matrix for S. The J statistic is given by @nob*@phi.

RATS Tip: The LINREG with the INST and ROBUSTERRORS option does GMM for linear equations. The WMATRIX~ma trix name option of LINREG accepts §-I' not S. To calculate §-I in RATS, do the following. Here, resO is the residuals from the 2SLS estimation in part (b). mcov I resO # s expr tenure rns smsa y66 ... y73 rned kww rnrt age compute sinv=inv(%crnorn) s inv is S- 1 , The J statistic is given by %uzwzu. (g) Go back to the wage equation you estimated in (e) by 2SLS endogenous schooling. The large Sargan's statistic (and the large J statistic in part (f)) is a concern. So consider dropping both MED and KWW from the list of instruments. Is the order condition for identification still satisfied? What happens to the

254

•

Chapter 3

2SLS estimates? (The schooling coefficient will be -529.3%!) What do you think is the problem? Hint: There are two endogenous regressors, Sand JQ. MRT and AGE are the only excluded predetermined variables. If MRT and AGE do not have explanatory power in the first-stage regression of S and JQ on the instruments, then the fitted values of Sand JQ will be very close to linear combinations of the included predetermined variables. This will lead to a nearmulticollinearity problem in the second-stage regression. Make sure you examine the first-stage regressions to check the explanatory power of MRT and AGE.

ANSWERS TO SELECTED Q U E S T I O N S - - - - - - - - - - -

ANALYTICAL EXERCISES

4. The asymptotic variance of the OLS estimator is (by setting x = z in (3.5.1 ))

where A (3.5.13),

=

E(s;z,z;). The asymptotic variance of the two-step GMM is, by

( l:'xz 8 -11: xz l-1 '

where S

= E(s; x;x;).

If W is as defined in the hint, then

So (3.5.1) reduces to the asymptotic variance of the OLS estimator. By (3.5.11 ), it is no smaller than (l:~, s- 11:,)- 1 , which is the asymptotic variance of the two-step GMM estimator. EMPIRICAL EXERCISES

(b) See Table 3.3, lines 1-5. (e) He gives three reasons for the endogeneity of schooling. First, the measurement error ry may be related to success in schooling. The second reason is the argument we made in the text: the choice of S is influenced by unobservable characteristics. Third, schooling may be measured with error. The general

Table 3.3: Parameter Estimates, Blackbum-Neumark Sample Estimation equation: dependent variable= LW, regressors: S, IQ, EXPR, TENURE, and other control<>

Line

""·

Estimation technique

Coefficient of

s

JQ

SEE

R'

TENURE

0.030 (0.0065)

0.043 (0.0075)

0.328

0.425

none none

0.070 (0.0067)

2

OLS

0.062 (0.0073)

0.0027 (0.0010)

O.o31 (0.0065)

0.042 (0.0075)

0.326 0.430

3

2SLS

0.069 (0.013)

0.0002 (0.0039)

0.030 (0.0066)

0.043 (0.0076)

0.328

0.172 (0.Q2l)

-0.009 (0.0047)

0.049 (0.0082)

0.042 (0.0088)

0.380

0.176 (0.021)

-0.009 (0.0049)

0.050 (0.0080)

0.043 (0.0095)

0.379

0.117 (0.027)

0.002 (0.0050)

0.033 (0.0052)

0.0051 (0.0029)

0.380

5 6

2SLS

GMM 2SLS

Which regressor is endogenous?

EXPR

OLS

4

Test of over· identifying restrictions

NoTE: Standard errors in parentheses. Line 6 is for 1980.

87.6

IQ

MED,KWW, AGE,MRT

S.IQ

MED,KWW, AGE,MRT

S.IQ

MED,KWW, AGE,MRT

SBO, IQ

MED,KWW, AGEBO, MRTSO

(p = 0.0000)

13.3 (p = 0.00131)

(p

~

11.6 0.00303) 14.9

(p = 0.00058)

Excluded predetermined variables besides regional dummies

256

•

Chapter 3

formula for the asymptotic bias for the 2SLS estimator is given by

If S is endogenous, the instrument set x1 for the 2SLS estimation in (b) erro-

neously includes a variable (which is S) that is not predetermined. So one of the elements of E(x 1·8 1) is not zero. Then it is clear from the formula that asymptotic bias can exist at least for some coefficients. (g) Without MED and KWW as instruments, the 2SLS estimator becomes impre-

cise. The order condition is satisfied with equality. The culprit is the low explanatory power of MRT (look at its r-value in the first-stage regressions for Sand /Q).

References Angrist, J., and A. Krueger, 1991, "Does Compulsory School Attendance Affect Schooling and Earnings?" Quarterly Journal of Economics, 106,979-1014. Behnnan, J., Z. Hrubcc, P. Taubman, and T. Wales, 1980, Socioeconomic Success: A Study of the Effects of Genetic Endawments, Family Environment, and Schooling, Amsterdam: North-Holland. Blackburn, M., and D. Neumark, 1992, "Unobserved Ability, Efficiency Wages, and Interindustry Wage Differentials," Quarterly Journal of Economics, 107, 1421-1436. Card, D., 1995, "Earnings, Schooling, and Ability Revisited,'' Research in Labor Economics, 14,

23-48. Card, D., and A. Krueger, 1996, "Labor Market Effects of School Quality: Theory and Evidence," Working Paper 5450, NBER. Eichenbaum, M., L. Hansen, and K. Singleton, 1985, "A Time Series Analysis of Representative Agent Models of Consumption and Leisure Choice under Uncertainty," Quarterly Journal of Economic.<:, 103,51-78. Friedman, M., 1957, A Theory of the Consumption Function, Princeton: Princeton University Press. Gallant, R., and D. Jorgenson, 1979, "Statistical Inference for a System of Simultaneous, Non-linear, Implicit Equations in the Context of Instrumental Variable Estimation," Journal of Econometrics,

II. 275-302. Griliches, Z., 1976, "Wages of Very Young Men," Journal of Political Economy, 84, S69-S85. Haavclmo, T., 1943, "The Statistical Implications of a System of Simultaneous Equations," Econometrica, II, 1-12. Hansen, L., 1982, "Large Sample Properties of Generali:tcd Method of Moments Estimators," Econometrica, 50, 1029-1054. Hausman, J., and W. Taylor, 1980, "Comparing Specification Tests and Classical Tests," MIT Working Paper No. 266. Judge, G., W. Griffiths, R. Hill, H. Llitkepohl, and T. Lee, 1985, The Theory and Practice of Econometrics (2d ed.), New York: Wiley. Mincer, J., 1958, "Investment in Human Capital and Personal Income Distribution," Journal of Political Economy, 66,281-302.

Single-Equation GMM

•

257

Nelson, C., and R. Startz, 1990, "Some Further Results on the Exact Small Sample Properties of the Instrumental Variable Estimator," Econometrica, 58, 967-976. Newey, W., 1985, "Generalized Method of Moments Specification Testing," Journal of Econometrics, 29, 229-256. Newey, W., and D. McFadden, 1994, "Large Sample Estimation and Hypothesis Testing,'' Chapter 36 in R. Engle and D. McFadden (cds.), Handbook of Econometrics, Volume TV, New York: North-Holland. Sargan, D., 1958, '!he Estimation of Economic Relationships Using Instrumental Variables," Econometrica, 26. 393-415. Searle. S., 1982, Matrix Algebra Useful for Statistics, New York: Wiley. Staiger, D., and J. Stock, 1997, "Instrumental Variables Regression with Weak Instrumenls," Econometrica, 65. 557-586. Theil, H., 1953. "Repeated Least-Squares Applied to Complete Equalion Systems," MS., The Hague: Central Planning Bureau. White, H., 1982, "Instrumental Variables Regression with Independent Observations," Econometrica, 50, 483-499. - - - , 1984, Asymptotic Theory for Econometricians, Academic. Working, E. J., 1927, "What Do 'Statistical Demand' Curves Show?" Quarterly Journal of Economics, 41,212-235.

CHAPTER 4

Multiple-Equation GMM

ABSTRACT This chapter is concerned about estimating more than one equation jointly by GMM. Having learned how to do the single-equation GMM, it is only a small step to get to a system of multiple equations, despite the seemingly complicated notation designed

to keep tabs on individual equations. This is because the multiple-equation GMM estimator can be expressed as a single-equation GMM estimator by suitably specify-

ing the matrices and vectors comprising the single-equation GMM formula. This being the case, we can develop large-sample theory of multiple-equation GMM almost off the shelf. The payoff from mastering multiple-equation GMM is considerable. Under conditional homoskedasticity, it reduces to the full~information instrumental variable efficient (FIVE) estimator, which in turn reduces to three-stage least squares (3SLS) if the set of instrumental variables is common to all equations. If we further assume that all the regressors are predetennined, then 3SLS reduces to seemingly unrelated regressions (SUR), which in tum reduces to the multivariate regression when all the equations have the same regressors. We will also show that the multiple~equation system can be written as an equa~ tion system with its coefficients constrained to be the same across equations. The GMM estimator for this system is again a special case of single~equation GMM. The GMM estimator when all the regressors are predetermined and errors are condition~ ally homoskedastic is called the random-effects (RE) estimator. Therefore, SUR and RE are equivalent estimators. The application of this chapter is estimation of interrelated factor demands by SURJRE. A system of factor demand functions is derived from a cost function called the translog cost function, which is a generalization of the linear logarithmic cost function considered in Chapter I. This chapter does not consider the maximum likelihood (ML) estimation of multiple-equation models. The ML counterpart of 3SLS is full-information maxi~ mum likelihood (FIML). For a full treatment of FIML, see Section 8.5. If the complexity of this chapter's notation stands in your way, just set M (the number of equations) to two. Doing so does not impair the generality of any of the results of this chapter.

259

Multiple-Equation GMM

4.1

The Multiple-Equation Model

To be able to deal with more than one equation, we need to complicate the notation by introducing an additional subscript designating the equation. For example, the dependent variable of the m-th equation for observation i will be denoted Yim· Linearity There are M linear equations, each of which is a linear equation like the one in Assumption 3.1: Assumption 4.1 (linearity): There are M linear equations Yim = z;miim

+ eim

(m = 1, 2, ... , M; i = 1, 2, ... , n),

(4.1.1)

where n is the sample size. Zim is the Lm-dimensional vector of regressors, lim is the conformable coefficient vector, and <;m is an unobservable error tenn in the m-th equation. The model will make no assumptions about the interequation (or contemporaneous) correlation between the errors (eil, ... , <;M ). Also, no a priori restrictions are placed on the coefficients from different equations. That is, the model assumes no cross-equation restrictions on the coefficient.;;. These point.;; can he illustrated in Example 4.1 (wage equation): In the Griliches exercise of Chapter 3, we estimated the wage equation. In one specification, Griliches adds to it the equation for KWW (score on the "Knowledge of the World of Work" test):

+ f3I S; + Y1IQ, + rr EXPR, + '"• KWW; = 4>z + f3zS; + y,IQ, + e,,,

LW, = 1/>1

where (to refresh your memory) LW; is the log wage of the i-th individual in the first year of the survey, S, is schooling in years, IQ, is JQ, and EXPR, is experience in years. In this example, M Yi2 = KWW1, and

= 2,

L1

= 4,

Lz

= 3, Yil = LW;,

260

Chapter 4

1 Zii

=

s, , 62 =

IQ, EXPR,

[~l

(4.1.2)

The errors "'' and e; 2 can be correlated. The correlation arises if, for example, there is an unobservable individual characteristic that affects both the wage rate and the test score. There are no cross-equation restrictions such as

{3, = f3z.

Cross-equation restrictions naturally arise in panel data models where the same relationship can be estimated for different points in time.

Example 4.2 (wage equation for two years): The NLS-Y data we used for the Griliches exercise actually has information for 1980, so we can estimate two wage equations, one for the first point in time (say, 1969) and the other for 1980:

LW69,

=

LWBO, =

+ {3 1S69; + y 1IQ, + rr 1EXPR69; + ""' ¢ 2 + {3 2 S80, + y 2 IQ, + rr 2 EXPR80, + e; 2 .
In the language of multiple equations, S69; (education in 1969) and SBO; (education in 1980) are two different variables. It would be natural to entertain the case where the coefficients remained unchanged over time, with ¢ 1 = ¢z, [3, = f3z, YJ = yz, rr 1 = rr2 . This is a set of linear cross-equation restrictions.

As will he shown in Section 4.3 helow, testing (linear or nonlinear) cross-equation restrictions is straightforward. Estimation while imposing the restriction that the coefficients are the same across equations will be considered in Section 4.6. Stationarity and Ergodicity

Let X;m be the vector of Km instruments for the m-th equation. Although not frequently the case in practice, the set of instruments can differ across equations, and so the number of instruments, Km, can depend on the equation.

Example 4.1 (continued): Suppose that IQ, is endogenous in both equations and that there is a variahle, MED, (mother's education), that is predetermined in the first equation (so E(MED; "il) = 0). So the instruments for the LW

261

Multiple-Equation GMM

equation are (1, S;, EXPR;, MED;). If these instruments are also orthogonal to e; 2 , then the KWW equation can he estimated with the same set of instruments as the LW equation. In this example, K 1 = K 2

= 4, and

1

s,

(4.1.3)

EXPR; MED; The multiple-equation version of Assumption 3.2 is

Assumption 4.2 (ergodic stationarity): Let w, be unique and nonconstant ele-

ments of (Yil, ... , YiM· zil, ... , Z;M. ergodic.

Xi!, ... , X;M).

{w;} is jointly stationary arrd

This assumption is stronger than just assuming that ergodic stationarity is satisfied for each equation of the system. Even if IYim• Z;m. X;m) is stationary and ergodic for each individual equation m, it does not necessarily follow, strictly speaking, that the larger process {w;}, which is the union of individual processes, is Uointly) stationary and ergodic (see Example 2.3 of Chapter 2 for an example of a vector nonstationary process whose components are univariate stationary processes). In practice, the distinction is somewhat blurred because the equations often share common variables. Example 4.1 (continued): Because z; 1 , z, 2 , Xi!, and x; 2 for the example have some common elements, w, has only six elements: (LW;, KWW;, S;, IQ,, EXPR,, MED;). The ergodic stationarity a"umption for the first equation of this example is that {LW;, S;, IQ,, EXP R;, MED;} is jointly stationary and ergodic. Assumption 4.2 requires that the joint process include KWW;. Orthogonality Conditions

The orthogonality conditions for the M -equation system are just a collection of the orthogonality conditions for individual equations. Assumption 4.3 (orthogonality conditions): For each equation m, the Km van-

abies in X;m are predetermined. That is, the fo/lowing orthogonality conditions are satisfied: E(Xim · &;m) = 0

(m = l, 2, ... , M).

262

Chapter 4

So there are Lm Km orthogonality conditions in total. If we define

(4. 1.4)

then all those orthogonality conditions can be written compactly as E(g,) = 0. Note that we are not assuming "cross" orthogonalities; for example, Xi! does not have to be orthogonal to e, 2 , although it is required to be orthogonal to eil. However, if a variable is included in both x, and x,2, then the assumption does imply that the variable is orthogonal to both"" and e 12 . Example 4.1 (continued): Lm Km in the current example is 8 (= 4 and g1 is

+ 4)

f:iJ

si eil

EXPR, e, MED, BjJ Bi2

Sisi2

EXPR, Bi2 MED; Bi2 Because

Xi!

and xn have the same set of instruments, each instrument

1s

orthogonal to both "" and e, 2 • Identification Having set up the orthogonality conditions for multiple equations, we can derive the identification condition in much the same way as in the single-equation case. The multiple-equation version of (3.3. 1) is

g(w1; ii)

=

(4.1.5)

where w 1 is as in Assumption 4.2, and ii without the subscript is a stacked vector of coefficient"

263

Multiple-Equation GMM

(4.1.6)

So the orthogonality conditions can be written as E[g(w,; o)] = 0. The coefficient vector is identified if = is the only solution to the system of equations

o o

E[g(w;; o)] = 0.

(4.!.7)

Using the definition (4.!.5), we can rewrite the left-hand side of this equation, E[g(w;; o) ], as _ E[g(w,;o)l

[ E[xil · (Yil

=

:

-z;,~,)]

l

E[x;M · (Y;M- z;MoM)l = [E(xii:'Yil) ] - [

E(x,M · YiM)

E(xil~,)~,

l

E(x;Mz;M)oM

(4.!.8) where

(The third equality in (4.!.8) is obtained by formula (A.4) of Appendix A for multiplication of partitioned matrices.) Therefore, the system of equations determining can be written as

o

(4.!.10)

which is the same in form as (3.3.4), the system of equations determining single-equation case!

oin the

264

Chapter 4

It then follows from the discussion of identification for the single-equation case that a necessary and sufficient condition for identification is that l:., be of full column rank. But, since l:xz is block diagonal, this condition is equivalent to

Assumption 4.4(rankcondition for identification): Foreachm (=I, 2, ... , M) E(x,mZ:ml (Km X Lm) is of full column rank. It should be clear why the identification condition becomes thus. We can uniquely

determine all coefficient vectors ( ~ 1 , ...• ~ M) if and only if each coefficient vector ~m is uniquely determined, which occurs if and only if Assumption 3.4 holds for each equation. The rank condition is this simple because there are no a priori cross-equation restrictions. Identification when the coefficients are assumed to he the same across equations will be covered in Section 4.6. The Assumption for Asymptotic Normality As in the single-equation estimation, the orthogonality conditions need to be strengthened for asymptotic normality: Assumption 4.5 (g, is a martingale difference sequence with finite second moments): {g,) is a joint martingale difference sequence. E(g,g;) is nonsingular. The same comment that we made above about joint ergodic stationarity applies here: the assumption is stronger than requiring the same for each equation. As in the previous chapters, we use the symbolS for Avar(g), the asymptotic variance of g (i.e., the variance of the limiting distribution of Jn g). By the CLT for stationary and ergodic martingale difference sequences (see Section 2.2), it equals E(g,g;). It has the following partitioned structure: S=

E(g,g;)

I

KmxLt;;=l Km)

E(SiJfiMXiJX;~)

]·

(4.1.11)

E(e;MfiMX;MX;M)

That is, the (m, h) block ofS is E(s,m
265

Multiple-Equation GMM

Connection to the "Complete" System of Simultaneous Equations

The multiple-equation model presented in most other textbooks, called the "complete" system of simultaneous equations, adds more assumptions to our model. Those additional assumptions are not needed for the development of multipleequation GMM, hut if you are curious about them, you can at this junction make a detour to the first two subsections of Section 8.5.

4.2

Multiple-Equation GMM Defined

Derivation of the GMM estimator for multiple equations, too, can be carried out in much the same way as in the single-equation case. Let 8 be a hypothetical value of the true parameter vector 8 and define g,(8) by (3.4.1 ). The definition of the GMM estimator is the same as in (3.4.5), provided that the weighting matrix W is now Lm Km x Lm Km. In the previous section, we were able to rewrite E[g(w;; cl)] for multiple equations as a linear function of 8 (see (4.1.8)). We can do the same for the sample analogue g,(8): ~

g,(8) ('E;!;==l K111 xl)

l " - Lx;1·Y;1 n i=1 l " - LxrM · YiM n 1=1 .

l " LXiMz;.M~M n

-

i=l

l " - LXiM ·YiM

n.1=1

(4.2.1)

266

Chapter 4

where

. (4.2.2) I

n

- LxiM · YiM n.1=1

Again, this is the same in fonn as the expression for g, ( 8) for the single-equation case.

It follows that the results of Section 3.4 about the derivation of the GMM esti-

mator are applicable to the multiple-equation case. In particular, just reproducing (3.4.8) and (3.4.11 ), "'

,....._

I

..-_

multiple-equation GMM estimator: 8(W) = (Sxz WSxz)

-J

I

,....._

SxzWsxy.

its sampling error: 8(W)- 8 = (S~,WSxz)- 1 8~ Wg.

(4.2.3) (4.2.4)

The features specific to the multiple-equation GMM fonnula are the following: (i) s,, is a stacked vector as in (4.2.2), (ii) Sxz is a block diagonal matrix as in (4.2.2), (iii) accordingly, the size of the weighting matrix W is Lm Km X Lm Km, and (iv) gin (4.2.4), the sample mean of g1, is the stacked vector I

n

-n.L

Xi!. £il

1=1

= g,(8).

J

(4.2.5)

n

-L:XiM·£iM

n

i=l

It will be necessary to rewrite the multiple-equation GMM fonnula (4.2.3) while explicitly taking account of those special features. If W mh (Km x Kh) is the (m, h) block of W (m, h = I, 2, ... , M), then (4.2.3) can be written out in full as

267

Multiple-Equation GMM

1

n

- LziMx;M

~11 ~:IM] [ WMI···WMM

n.1=1

-I

1

n

- LziMXiM

n.1=1

1

n

- LXjM. YiM

n

i=l -I

(4.2.6)

To go from the second-to-last to the last line, use formulas (A.6) and (A.7) of the appendix on partitioned matrices. If you find the operation too much, just set

M=2.

268

4.3

Chapter 4

Large-Sample Theory

Large-sample theory for the single-equation GMM was stated in Propositions 3.13.8. It is just a matter of mechanical substitution to translate it to the multipleequation model. Provided that 8, 2:.,, S, g,, Sxy. and S., are as defined in the previous sections and that "Assumption 3.x•• is replaced by "Assumption 4.x," Propositions 3.1, 3.3, and 3.5-3.8 are valid as stated for the multiple-equation model. Only a few comments about their interpretation are needed. • (Hypothesis testing) In the present case of multiple equations, 8 is a stacked vector composed of coefficients from different equations. The import of Propositions 3.3 and 3.8 about hypothesis testing for multiple equations is that we can test cross-equation restrictions. Example 4.2 (continued): For the two-equation system of wage equations for two years, the stacked coefficient vector 8 is

It would he interesting to test H0 :

fh

=

fh and rr 1 = rr2 , namely, that the

schooling and experience premia remained stable over time. The hypothesis can be written as R8 = r, where

R=[OIOOO -I 0 0 0 I 0 0

0 0

Nonlinear cross equation restrictions, too, can he tested; just use part (c) of Proposition 3.3 or Proposition 3.8.

• (Test of overidentifying restrictions) The number of orthogonality conditions is Lm Km and the number of coefficients is Lm Lm. Accordingly, the degrees of freedom for the J statistic in Proposition 3.6 are Lm Km - Lm Lm, and that for the C statistic in Proposition 3.7 is the total number of suspect instruments from different equations. Proposition 3.2 does not apply to multiple equations as is, but it is obvious how it can be adapted. Proposition 4.1 (consistent estimation ofcontemporaneous error cross-equation moments): Let ~m be a consistent estimator of ~m. and Jet Eim = Yim - z;m3m be

269

Multiple-Equation GMM

the implied residual form = I, 2, ... , M. Under Assumptions 4.1 and 4.2, plus the assumption that E(z;mz;h) exists and is finite for all m, h (= I, 2, ... , M),

(4.3.1) where

provided that E(B;mB;h) exists and is finite.

The proof, very similar to the proof of Proposition 3.2 and tedious, is left as an analytical exercise. This leaves Proposition 3.4 ahout consistently estimating S. The assumption corresponding to Assumption 3.6 (the finite fourth-moment assumption) is Assumption4.6 (finite fourth moments): E[(X;mk · Z;hJ)'] exists and is finite for all k (= I, 2, ... , Km), j (= I, 2, ... , Lh), m, h (= I, 2, ... , M), where X;mk is

the k-th element of X;m and Z;hJ is the j -th element of Z;h·

The multiple-equation version of the formula (3.5.10) for consistently estimating Sis

~

S=

for some consistent estimate

(4.3.2)

Bim

of Bim.

Proposition 4.2 (consistent estimation of S, the asymptotic variance of g): Let

am be a consistent estimator of 3m.· and Jet B;m = Y;m - z;mam be the implied residual for m = I, 2, ... , M. Under Assumptions 4.1, 4.2, and 4. 6, S given in (4.3.2) is consistent for S. ~

We will not give a proof; the technique is very similar to the one used in the proof of Proposition 2.4.

Table 4.1: Multiple-Equation GMM in the Single-Equation Format Sample Analogue of Orthogonality Conditions: GMM Estimator: Its Sampling Error: Asymptotic Variance of Optimal GMM: Its Estimator: J Statistic:

-

-

g,(8) = Sxy- S.,8 = 0

(S~,ws.,)- 1 S~,Ws,y

clcW) =

3(W)- 8 = (S~ ws.,)- 1 S~,Wg

Avarc~cs- 1 )) = (I:~,s- 1 1:.,)-1

-

Avar(~(S-1)) = (S~s- 1 s.,)-1

J(~(S- 1 ),S- 1 )

= n. g,(~(S- 1 ))'s- 1 g,(~(s- 1 ))

Single-Equation GMM applied to the

Multiple-Equation GMM

equation in question

g,

Xi · fj

(4.1.4)

8

8

(4.1.6)

Sxy

I " - :Lx; · Yi n i-1

(4.2.2)

s.,

- :Lxiz~

I " n

(4.2.2)

'

i-1

Size ofW

KxK

I:.,

E(x1z;J

LmKm X LmKm (4.1.9)

S (= Avar(g))

E(t}xixD

(4.1.11)

s

" L n

-I

i-1

-2 XjX·' f.

'

'

(4.3.2)

Estimator consistent

under which

3.1-3.4

4.1-4.4

3.1-3.5

4.1-4.5

3.1, 3.2, 3.6, E(g;t;) finite

4.1, 4.2, 4.6, E(g,·g;) finite

K-L

Lm(Km- Lm)

assumptions? Estimator asymptotic normal under which assumptions?

-S

->p

Sunder

which assumptions? d.f. of J

271

Multiple-Equation GMM

-

Since this S is consistent for S, the multiple-equation GMM estimator, using s- 1 as the weighting matrix, cl(S- 1), is an efficient multiple-equation GMM estimator with minimum asymptotic variance. The formulas for the asymptotic variance and its consistent estimator are (3.5.13) and (3.5.14), which are reproduced here for convenience:

(4.3.3) ~

Avar(cl(S- 1)) = (S~,s- 1 s.,)- 1 •

(4.3.4)

To recapitulate, the features specific to the multiple-equation GMM are: :E., is a block diagonal matrix given by (4.1.9), Sis given by (4.1.11 ), Sxy and S., by (4.2.2), and S by (4.3.2). For the initial estimator elm needed to calculate E;m and S, we can use the FIVE estimator (to be presented below) or the efficient single-equation GMM applied to each equation separately. Of course, as long as it is consistent, the choice of the initial estimator does not affect the asymptotic distribution of the efficient GMM estimator. Table 4.1 summarizes the results of this chapter so far.

4.4

Single-Equation versus Multiple-Equation Estimation

An obvious alternative to multiple-equation GMM, which estimates the stacked coefficient vector 8 (Lm Lm x I) jointly, is to apply the single-equation GMM separately to each equation and then stack the estimated coefficients. If the weighting matrix in the single-equation GMM for the m-th equation is mm (Km X Km). the resulting equation-by-equation GMM estimator of the stacked coefficient vector 8 can be written as the multiple-equation GMM estimator cl(W) given in (4.2.3), where the Lm Km x Lm Km weighting matrix W is a block diagonal matrix whose m-th block is Wmm:

w

-

(4.4.1)

This is because both S., (Lm Km x Lm Lm) and W are block diagonal. (The reader should actually substitute the expressions (4.2.2) for S., and (4.4.1) for W into (4.2.3) to verify that S~, WS., and hence (S~, WS.,)- 1 S~, Ware block diagonal, using formulas (A.5) and (A.9) of Appendix A. Set M = 2 if it makes it easier.)

272

Chapter 4

Therefore, the difference between the multiple-equation GMM and the equationby-equation GMM estimation of the stacked coefficient vector 5 lies in the choice of the Lm Km x Lm Km weighting matrix W. Put differently, the equation-byequation GMM is a particular multiple-equation GMM. When Are They "Equivalent"?

As seen in Chapter 3, if the equation is just identified, the choice of the weighting matrix W does not matter because the GMM estimator, regardless of the choice of W, numerically equals the IV estimator. The same is true for multiple-equation GMM: if each equation of the system is exactly identified so that Lm = Km for all m, then S" in the GMM formula (4.2.3) is a square matrix and the GMM estimator for any choice of W becomes the multiple-equation IV estimator s;,ts.y (and, since Sxz is block diagonal, the estimator is just a collection of single-equation IV estimators). On the other hand, if at least one of theM equations is overidentified, the choice of W (Lm Km x Lm Km) does affect the numerical value of the GMM estimator. Consider the efficient equation-by-equation GMM estimator of 5 (Lm Lm x 1), where each equation is estimated by the efficient single-equation GMM with an optimal choice of Wmm (m = I, 2, ... , M). Is it as efficient as the efficient multiple-equation GMM? The equation-by-equation estimator can be written as 8(W), where the Lm Km x Lm Km block diagonal weighting matrix W satisfies

plimW =

(4.4.2)

n~oo

Because this plim is not necessarily equal to s- 1 (unless (4.4.3) below holds), the estimator is generally less efficient than the efficient multiple-equation GMM estimator 8cs- 1), where Sis a consistent estimator of S. But if the equations are "unrelated" 1 in the sense that (4.4.3) then S becomes block diagonal and for the weighting matrix W defined above plim W = s- 1 • Since in this case W - s- 1 -+p 0, we have (as we proved in Review Question 3 of Section 3.5)

I Under conditional homoskedasticity, this condition becomes the more familiar condition that E(Eim f:iJ1 ) == 0. See the next section.

Multiple·Equation GMM

273

It then follows from Lemma 2.4(a) that the asymptotic distribution of ckW) is the

same as that of .lcs- 1 ), so the efficient equation-by-equation GMM estimator is an efficient multiple-equation GMM estimator. We summarize the discussion so far as

Proposition 4.3 (equivalence between single-equation and multiple-equation GMM):

(a) If all equations are just identified, then the equation-by-equation GMM and the multiple-equation GMM are numerically the same and equal to the IV estimator. (b) If at least one equation is overidentified but the equations are ''unrelated" in the sense of (4.4.3), then the efficient equation-by-equation GMM and the efficient multiple-equation GMM are asymptotically equivalent in that .jii times the difference converges to zero in probability. For those two cases, there is no efficiency gain from joint estimation. Furthermore, if it is known a priori that the equations are "unrelated," then the equation-byequation estimation should be preferred, because exploiting the a priori knowledge by requiring the off-diagonal blocks of Wto be zeros (which is what the equationby-equation estimation entails) will most likely improve the finite-sample properties of the estimator.

Joint Estimation Can Be Hazardous Except for cases (a) and (b),joint estimation is asymptotically more efficient. Even if you are interested in estimating one particular equation (say, the LW equation in Example 4.1), you can generally gain asymptotic efficiency by combining it with some other equations (see, however, Analytical Exercise 10 for an example where joint estimation entails no efficiency gain even if the added equations are not unrelated). There are caveats, however. First of all, the small-sample properties of the coefficient estimates of the equation in question might be better without joint estimation. Second, the asymptotic result presumes that the model is correctly specified, that is, all the model assumptions are satisfied. If the model is misspecified, neither the single-equation GMM nor the multiple-equation GMM is guaranteed even to be consistent. And chances of misspecification increase as you add equations to the system.

274

Chapter 4

To illustrate this second point, suppose Assumption 4.3 does not hold because, only for the M -th equation, the orthogonality conditions do not hold: E(X;M ·
0

plim~(W) -~ = (:E~,W:E.,)- 1 :E~W

(4.4.4)

n~oo

In efficient multiple-equation GMM, W-hence (:E~W:Exz)- 1 :E~,W-are not generally block diagonal, so any element of plimHoo ~(W) - ~ can be different from zero. Even for the coefficients of the equations of interest involving no misspecification the asymptotic bias may not be zero. That is, in joint estimation, biases due to a local misspecification contaminate the rest of the system. This problem does not arise for the equation-by-equation GMM which constrains W to be block diagonal. QUESTION FOR REVIEW

1. Express the equation-by-equation 2SLS estimator in the form (4.2.3) by suit1 ably specifying W mm 'sin (4.4.1). [Answer: W mm = (~ 2::7~ 1 X;mx;mr ]

4.5

Special Cases of Multiple-Equation GMM: FIVE, 3SLS, and SUR

Conditional Homoskedasticity Under conditional homoskedasticity, we can easily derive a number of prominent estimators as special cases of multiple-equation GMM. The multiple-equation version of conditional homoskedasticity is

Assumption 4.7 (conditional homoskedasticity):

forallm, h = 1, 2, ... , M.

275

Multiple-Equation GMM

Then the unconditional cross moment E(<;m<;h) equals Expectations. You should have no trouble showing that

crmh

by the Law of Total

(m, h) block of E(g;Jt;) given in (4.1.11) = E(eimfihXimx;h) = E[E(e;m<;hXimX.·h I X;m, X;h )]

(by the Law of Total Expectations)

= E[E(<;m
(by linearity of conditional expectations)

= E[crmhX;mXfh]

(by conditional homoskedasticity)

= crmh E(x;mXihl

(by linearity of expectations).

(4.5.1)

Thus, the S in (4.1.11) can be written as S = [ cru E(~;tX; 1 ) CTMJ E(x,Mx; 1)

UtM E(~i!XfM) ]· ···

crMM

(4.5.2)

E(x,Mx;M)

Since by Assumption 4.5 Sis finite, this decomposition implies that E(x1mxjh) exists and is finite for all m, h (= I, 2, ... , M). Full-Information Instrumental Variables Efficient (FIVE)

An estimator of S exploiting the structure of fourth moments shown in (4.5.2) is

UMM ·

c

(4.5.3) n

LX;MXfM) i=l

where, for some consistent estimator ~m of~. (m. h = I, 2, ... , M). (4.5.4)

By Proposition 4.1, amh -+p crmh provided (in addition to Assumptions 4.1 and 4.2) that E(z 1mz;,J is finite. By ergodic stationarity, ~ L; X1mx;" converges in probability to E(x 1mxih), which, as just noted, exists and is finite. Therefore, (4.5.3) is consistent for S, without the fourth-moment assumption (Assumption 4.6).

276

Chapter 4

.

The FIVE estimator of 8, denoted 8FIVE· is "

8FIVE

" ......-1 = 8(S )

with S given by (4.5.3) to exploit conditional homoskedasticity. 2 Being a multipleequation GMM for some choice of W, it is consistent and asymptotically normal by (the multiple-equation adaptation of) Proposition 3.1. Because Sis consistent for S, the estimator is efficient by Proposition 3.5. Thus, we have proved

Proposition 4.4 (large-sample properties of FIVE): Suppose Assumptions 4.14.5 and 4. 7 hold. Suppose, furthermore, that E(z 1mz;,J exists and is finite for all m, h (= I, 2, ... , M). 3 LetS and S be as in (4.5.2) and (4.5.3), respectively. Then ~

(a) S

--+p

S;

(b) 8FivE. defined as cl(S- 1), is consistent, asymptotically normal, and efficient, with Avar(clFJvEl given by (4.3.3);

.

(c) The estimated asymptotic variance given in (4.3.4) is consistent for Avar(8FivE); (d) (Sargan's statistic) J(clFJVE,S-

1 )

= n · g.(.IFJvErs- 1 g.(.IFIVEl--: x2 (L
Lml).

m

where g, (-) is given in (4.2.1).

.

~

Usually, the initial estimator 8m needed to calculate Sis the 2SLS estimator. Three-Stage Least Squares (3SLS)

When the set of instruments is the same across equations, the FIVE formula can be simplified somewhat. The simplified formula is the JSLS estimator, denoted 83sLs.' To this end, let

(4.5.5)

2The estimator is due to Brundy and Jorgenson (1971). 3This assumption is needed for G-mh to be consistent. See Proposition 4.1. 4The estimator is due to Zellner and Theil (1962).

277

Multiple-Equation GMM

Then the M x M matrix of cross moments of e,, denoted l:, can be written as

(4.5.6)

To estimate l: consistently, we need an initial consistent estimator of &m for the purpose of calculating the residual Bim· The term 3SLS comes from the fact that the 2SLS estimator of &m is used as the initial estimator. Given the residuals thus calculated, a natural estimator of l: is

-

l:=

[CrJJ .

(4.5.7)

' aMI

which is a matrix whose (m, h) element is the estimated cross moment given in (4.5.4).

If xi (=xi I = xi 2 = · · · = xiM) is the common set of instruments with dimension K, then g, in (4.!.4), S in (4.5.2), and Sin (4.5.3) can be written compactly using the Kronecker product5 as gi

=E;®Xi,

(4.5.8)

(M K xi)

S (MKxMK)

=

l:

® E(x,x;),

(MxM)

(4.5.9)

(KxK)

n

s = i: ® (~ I>x;).

(4.5.10)

i=l

so §'-I = l:- ® I:7~ 1 x,x;)- 1 The Kronecker product decomposition (4.5.9) of S and the nonsingularity of S (by Assumption 4.5) imply that both l: and E(x,x;) are nonsingular. To achieve further rewriting of the 3SLS formulas, we need to go back to 1, writes out in full the efficient multiple-equation (4.2.6), which, with W = GMM estimator. The formula (4.5.10) implies that theW in (4.2.6) is such that 1

s-

5See fonnulas (A. II) and (A.15) of the Appendix if you are not familiar with the Kronecker product notation.

278

Chapter 4 ~ Wmd= (m, h) blockofW) =am h · -;; L..,x;X,

(I

)-' ,

(4.5.11)

1=1

1

where amh is the (m, h) element of I:- Substitute this into (4.2.6) to obtain

'!!A-

a 83SLS

=

[a

a•IMA- 1M ..

II

.

:_

•MIA

l- [•!!• 1

+ .. + a1MCJM

en ..

.

a•MMA..-. MM

MI

a

a"Ml"CMt

+ .. ·+

l,

"MM" CMM

a

(4.5.12) where ..-.

1

1

n

n

-I

Amh = (-;; LZ;mx;)(-;; Lx,x;) 1

n

Cmi• = (-;; LZ;mX,)(-;; LX;x;) 1=1

l=l

(LmxK)

(KxK)

(4.5.13)

1=l

(KxK)

n

n

(-;; I:x,z;h)•

1=!

i=l (~xK)

1

1

(KxLh) -I

1

n

(-;;LX;

Y;h)

(4.5.14)

1=1

(Kxl)

Similarly, substituting into (4.3.3) the expression for :E,, in (4.1.9) and the expression for S in (4.5.9) and using formula (A.6) of the Appendix, we obtain the expressions for Avar(8JsLs):

(4.5.15)

where (4.5.16) and

amh

is the (m, h) element of :E- 1• This is consistently estimated by ·IlA-

a

Av~Lsl =

I

a• IM-A IM

(4.5.17)

:__ [

a·MIAMI

]-I

a·MMA- MM

(This sample version can also be obtained directly by substituting (4.2.2) and (4.5.10) into (4.3.4).)

279

Multiple·Equation GMM

Most textbooks give expressions even prettier tban tbese by using data matrices (such as then x K matrix X). Deriving such expressions is an analytical exercise to Ibis chapter. The preceding discussion can be summarized as Proposition 4.5 (large-sample properties of 3SLS): Suppose Assumptions 4.14.5 and 4. 7 hold, and suppose X;m = x1 (common set of instruments). Suppose, furthermore, that E(z;mz;hl exists and is finite for all m, h (= l, 2, ... , M). Let i be the M x M matrix of estimated error cross moments calculated by (4.5. 7) using the 2SLS residuals. Then

-

(a) 83sLs given by (4.5.12) is consistent, asymptotically normal, and efficient, with Avar(83sLsl given by (4.5.15).

-

-

(b) The estimated asymptotic variance (4.5. 17) is consistent for Avar(83sLsl.

(c) (Sargan 's statistic) l(cl,sLS· s-')

= n. gn(cl3SLs)'s-I gn(cl3sLsl-:

x2 ( M K- L

Lm ).

m

where S = i ® (~ g, 0 is given in (4.2./).

L; x,x;),

K is the number of common instruments, and

Seemingly Unrelated Regressions (SUR) The 3SLS formula can further be simplified if xi =union of (zi 1, ... , ziM ).

(4.5.18)

This is equivalent to the condition tbat E(z;m ·&;h)= 0

(m, h = l, 2, ... , M).

(4.5.18')

That is, tbe predetermined regressors satisfy "cross" orthogonalities: not only are they predetermined in each equation (i.e., E(z 1m · <;m) = 0), but also they are predetermined in the other equations (i.e., E(z1m · &;h) = 0 form # h). The simplified formula is called the SUR estimator, to be denoted clsuR· 6

6The estimator is due to Zellner (1962).

280

Chapter 4

Example 4.3 (Example 4.1 with different specification for instruments): If (4.5.18) holds for the two-equation system of Example 4.1, X; is the union of zil and z; 2 , s.o I

,-

X.-

s,

(4.5.19)

IQ,

EXPR; The orthogonality conditions that E(x; · EiJ) = 0 and E(x; · £; 2 ) = 0 become £j2

£ii

Sit:il

E

IQ;

= 0

and

t:il

EXPR;

E

S; £;2 IQ;

=0,

(4.5.20)

Ei2

EXPR; e,2

EiJ

which should he contrasted with the statement that the regressors are predetermined in each equation: £j)

E

S; IQ;

t:il EiJ

EXPR; Eii

=0

and

£;2] E s, [IQ, £;2 Ei2

= 0.

(4.5.21)

The difference between (4.5.20) and (4.5.21) is that the KWW equation is overidentified with EXPR as the additional instrument. Because SUR is a special case of3SLS, formulas (4.5.12), (4.5.15), and (4.5.17) for J 3sLs apply to JsUR· The implication of the SUR assumption (4.5.18) is that, in the above expressions for Amh• Cmh• and Amh• X; "disappears": ~

(4.5.13')

(4.5.14') (4.5.16') Here we give a proof of (4.5.16'). Without Joss of generality, suppose z,, is fhe first L, elements of the K elements ofx;. Let D (K x L,) be fhe first Lm columns

281

Multiple-Equation GMM ofiK.

Then we have

Substituting these into (4.5.16) yields (4.5.16'). The proof of (4.5.13') and (4.5.14'), which are about the sample moments, proceeds similarly, with Z;m = D'x;. (You will be asked to provide an alternative proof by the use of the projection matrix in the problem set.) In 3SLS, the initial consistent estimator for calculating 1: was 2SLS. But, 2SLS when the regressors are a subset of the instrument set is OLS (as you showed for Review Question 7 of Section 3.8). So, for SUR, the initial estimator is the OLS estimator. The preceding discussion can be summarized as

Proposition 4.6 (large-sample properties of SUR): Suppose Assumptions 4.14.5 and 4. 7 hold with X; = union of (z 11 , ... , Z;M ). Let 1: be theM x M matrix of estimated error cross moments calculated by (4.5. 7) using the OLS residuals. Then

-

(a) 3suR given by (4.5.12) with Amh and Cmh given by (4.5.13') and (4.5.14') form, h = 1, ... , M is consistent, asymptotically nonnal, and efficient, with Avar(3suR) given by (4.5.15) where Amh is given by (4.5.16'). (b) The estimated asymptotic variance (4.5.17) where Amh is given by (4.5.13') is consistent for Avar(8suRl· A

(c) (Sargan's statistic) J(3suR.s-')

= n. g,(3suR)'s-I g.,(3suR)-: x2 (MK-

LLm)• m

where S = 1: 0 (~ I:, x1x;J, K is the number of common instruments, and g,(·) is given in (4.2.1). (Unlike in Proposition 4.5, the condition that E(z;mz;h) be finite is not needed because, thanks to (4.5.18), it is implied by the condition that E(x,x;) be nonsingular, which in tum is implied by the nonsingularity of S [from Assumption 4.5] and the fact that Scan be written as 1: 0 E(x;x;J.) Sargan's statistic for SUR is rarely reported in practice. SUR versus OLS

Since the regressors are predetermined, the system can also be estimated by the equation-by-equation OLS. Then why SUR over OLS? As just seen, under condi-

282

Chapter 4

tiona! homoskedasticity, the efficient multiple-equation GMM is FIVE, which in tum is numerically equal to SUR under (4.5.18). As shown in Chapter 3, under conditional homoskedasticity, the efficient single-equation GMM is 2SLS, which in tum is numerically equal to OLS because the regressors are predetermined under (4.5.18). This interrelationship is shown in Figure 4.1. Therefore, the relation between SUR and equation-by-equation OLS is strictly analogous to the relation between the multiple-equation GMM and the equation-by-equation GMM. As discussed in Section 4.4, there are two cases to consider. (a) Each equation is just identified. Because the common instrument set is the union of all the regressors, this is possible only if the regressors are the same for all equations, i.e., if Zim = x, for all m. The SUR for this case is called the multivariate regression. We observed in Section 4.4 that both the efficient multiple-equation GMM and the efficient single-equation GMM are numerically the same as the IV estimator for the just identified system. Because the regressors are predetermined, we conclude that the GMM estimator of the multivariate regression model is simply equation-by-equation OLS 7 This can be verified directly by substituting (4.5.13') and (4.5.14') on page 280 (with z,m = x,) into the expression for the point estimate (4.5.12). Also substituting (4.5.16') and (4.5.13') (again with z,m = x,) into (4.5.15) and (4.5.17), we obtain, for multivariate regression, (4.5.22) Avar(.l) =

f

1 ® (;;-

n

L x,x;) I=

-1

.

(4.5.23)

I

(b) At least one equation is overidentified.

Then SUR is more efficient than equation-by-equation OLS, unless equations are "unrelated" to each other in the sense of (4.4.3). In the present case of conditional homoskedasticity and the common set of instruments, (4.4.3) becomes amh

E(x,x;) = 0

for all m oF h.

Since E(x,x;) cannot be a zero matrix (it is assumed to be nonsingular), equations are "unrelated" to each other if and only if amh = 0. Therefore, SUR is more efficient than OLS if amh oF 0 for some pair (m, h). If amh = 0 for all (m, h), the two estimators are asymptotically equivalent (i.e., fo times the difference vanishes). 71t will be shown in Chapter 8 that the estimator is also the maximum likelihood estimator.

283

Multiple-Equation GMM

conditional homoskedasticity

efficient equation-by-equation GMM

efficient multiple-equation GMM

equation-by-equation 2SLS

FIVE

equation-by-equation OLS

SUR

----7

SUA assumplion (4.5.18), i.e., endogenous regressors ----7 satisfy "cross" orthogonalities

Figure 4.1: OLS and GMM Another way to see the efficiency of SUR is to view the SUR model as a multivariate regression model with a priori exclusion restrictions. As an example, expand the two-equation system of Example 4.3 as LW;

= ¢ 1 + f3 1S; + y 1IQ, + n 1EXPR; + eiJ,

KWW, = ¢ 2

+ f3 2 S; + y2IQ, + n 2 EXPR, + t:;2 •

The common instrument set is (I, s,, IQ,, EXPR, ). Then this two-equation system is a multivariate regression model with the same regressors in both equations. But if n 2 is a priori restricted to be 0 and thus EXPR; is excluded from the second equation, then the model becomes the SUR model. The SUR estimator is more efficient than the multivariate regression because it exploits exclusion restrictions. The relationship among the estimators considered in this section is summarized in Table 4.2.

QUESTIONS FOR REVIEW

1. (FIVE without conditional homoskedasticity) Without conditional homoske-

dasticity, is the Sin (4.5.3) consistent for Sin (4.1.11)? [Answer: No. Why?] Under conditional homoskedasticity, is (4.3.2) consistent for (4.5.2)? [Answer: Yes. Why?] 2. (FIVE without conditional homoskedasticity) Without conditional homoskedasticity, is FIVE consistent and asymptotically normal? Efficient? Hint: The FIVE estimator is a multiple-equation GMM estimator (4.2.3) for some W. Does theW satisfy the efficiency condition that plim W = s- 1 ?

Table 4.2: Relationship between Multiple-Equation Estimators Multiple-

FIVE

equation GMM

3SLS~

SUR

Multivariate regression

Assumptions 4.1-4.6

Assumptions 4.1-4.5, Assumption4.7 E(z;,z;h) finite

Assumptions 4.1-4.5, Assumption 4.7 E(z;,z;h) finite x;, = Xi for all m

Assumptions 4.1-4.5, Assumption 4. 7 x1, = x; for all m x; =union ofzn, ... , Z;M

Assumptions 4.1-4.5, Assumption4.7 x1, = x1 for all m z;, = Xi for all m

(4.1.11)

(4.5.2)

E 0E(x,x;)

E 0E(x;xj)

irrelevant

s

(4.3.2)

(4.5.3)

...... I: 0 (n -1 E 1x1x1' ) E from 2SLS residuals

f

8(§"-1)

no simplification

no simplification

(4.5.12) with (4.5.13), (4.5.14)

(4.5.12) with (4.5.13'), (4.5.14')

equation-by-equation OLS

Avar(8(8- 1))

(4.3.3)

(4.3.3)

(4.5.15) with (4.5.16')

OLS fonnula

-

(4.5.15) with (4.5.16)

(4.3.4)

(4.3.4)

(4.5.17) with (4.5.13)

(4.5.17) with (4.5.13')

OLS fonnu1a

The model

S

(~

Avar@)

Avar(8(S- 1))

E0

(n- 1E 1x1x;) from OLS residuals

irrelevant

285

Multiple-Equation GMM

3. (When FIVE and eq-by-eq 2SLS are numerically the same) Suppose each equation is just identified. Are the equation-by-equation 2SLS and FIVE estimators asymptotically equivalent (in that .jli times the difference vanishes)? Numerically the same? Hint: Both FIVE and 2SLS reduce to IV. 4. (When are FIVE and eq-by-eq 2SLS asymptotically equivalent) When the errors are orthogonal to each other (so amh = 0 form f= h), the FIVE and equation-by-equation 2SLS estimators are asymptotically equivalent (in that .jli times the difference vanishes). Prove this. Hint: Superimpose conditional homoskedasticity on Proposition 4.3(b). Under conditional homoskedasticity, the efficient multiple-equation GMM reduces to the FIVE and the efficient singleequation GMM reduces to the equation-by-equation 2SLS. Also under conditional homoskedasticity, the equations are "unrelated" in the sense of (4.4.3) if O'mh =

0.

5. (Conditional homoskedasticity under SUR assumptions) Verify that, under the SUR assumption (4.5.18), Assumption 4.7 becomes E(e;m £ih

I Z;t •... , z;M) =

amh

for all m, h = I, 2, ... , M.

6. In Example 4.3, what happens to the SUR estimator when X; includes MED; also? Hint: x; still "disappears" from (4.5.13), (4.5.14), and (4.5.16). To Sargan's statistic? 7. (Identification in SUR) In Proposition 4.6, the identification condition (Assumption 4.4) is actually not needed because it is implied by the other assumptions. Verify that Assumptions 4.5 and 4.7 imply Assumption 4.4. Hint: By Assumption 4.5 and the Kronecker decomposition (4.5.9) of S, E(x;x;) is nonsingular. z;m is a subset of x 1.

8. (SUR without conditional homoskedasticity) (a) Without conditional homoskedasticity, is SUR consistent and asymptotically normal? Efficient? Hint: The SUR estimator is still a multipleequation GMM estimator, so Proposition 3.1 applies. (b) Without conditional homoskedasticity, does SUR still reduce to the multivariate regression when the system is just identified in that each equation is just identified (so Z;m =X;)? [Answer: Yes.] 9. (Role of "cross" orthogonalities) Suppose, instead of (4.5.18') on page 279, the orthogonality conditions are that E(z;m · &;m) = 0 form = I, 2, ... , M. Is

286

Chapter 4

SUR consistent? [Answer: Not necessarily.] Derive the efficient multipleequation GMM estimator under conditional homoskedasticity. [Answer: lt is OLS.] 10. Verify that Sargan's statistic for the multivariate regression model is zero.

4.6

Common Coefficients

In many applications, particularly in panel data contexts, you deal with a special case of the multiple-equation model where the number of regressors is the same across equations with the same coefficients. Such a model is a special case of the multiple-equation model of Section 4.1. This section shows how to apply GMM to multiple equations while imposing the common coefficient restriction. It will be shown at the end that the seemingly restrictive model actually includes as a special case the multiple-equation model without the common coefficient restriction. The Model with Common Coefficients

With common coefficients, the multiple-equation system becomes Assumption 4.1' (linearity): There are M linear equations to be estimated: Yim = z;m~

+ E:im

(m = 1, 2, ... ,

M; i

=

1. 2, ... , n),

(4.6.1)

where n is the sample size, Zim is an L-dimensional vector of regressors, ~ is an L-dimensional coefficient vector common to all equations, and E:;m is an unobsetYable error term of them -th equation.

Of the other assumptions of the multiple-equation model, Assumptions 4.2-4.6, only Assumption 4.4 (identification) needs to be modified to reflect the common coefficient restriction. The multiple-equation version of g(w;; ~) is now

g(w;;

so that E[g(w;;

~)]becomes

~)

=

(4.6.2)

Multiple·Equation GMM

E(xn . y;.) E[g(w1; li)] =

: [

l

287

E(x;M · YiM)

(4.6.3)

where

(4.6.4)

So the implication of common coefficients is that l:xz is now a stacked, not a block diagonal, matrix. With this change, the system of equations determining li is again (4.1.10), so the identification condition is Assumption 4.4' (identification with common coefficients): The L~~J Km x L matrix l:,z defined in (4.6.4) is of full column rank. This condition is weaker than Assumption 4.4, which requires that each equation of the system be identified. Indeed, a sufficient condition for identification is that E(x;m z;m) be of full column rank for some m. 8 This is thanks to the a priori restriction that the coefficients are the same across equations. It is possible that the system is identified even if none of the equations are identified individually. The GMM Estimator

-

-

It is left to you to verify that gn (li) can be written as Sxy - Sxzli with

I

n

-n.I:xn ·Yn 1=1

Sxy

=

(4.6.5) I

n

- LXiM "YiM n.

1=1

I

n

- LX;Mz;M n

i=l

8 Proof: If E(ximz/m) is of full column rank, its rank is L. Since the rank of a matri:< is also equal to the number of linearly independent rows, E(x;mZ:·m) has L linearly independent rows. Thus, l:xz has at least L linearly independent rows.

288

Chapter 4

so that the GMM estimator with a Lm Km X Lm Km weighting matrix Wis (4.2.3) with the sampling error given by (4.2.4). Provided that l:xz and Sxz are as just redefined and ~ is interpreted as the common coefficient, large-sample theory for the multiple-equation model with common coefficients is the same as that with separate coefficients developed in Section 4.3, with Assumption 4.1' replacing Assumption 4.1, Assumption 4.4' replacing Assumption 4.4, and the residual 1!1 redefined to be Yim -z;m~ (not Yim -z~m&m) for some consistent estimate of the L-dimensional common coefficient vector b. In order to relate this GMM estimator to popular estimators in the literature, it is necessary to write out the GMM formula in full. Substituting (4.6.5) into (4.2.3), we obtain -I

[

n

-n I:xn ·Yn i=l

l

n

-n I:xiM ·

YiM

i=I

(4.6.6) where Wmh is the (m, h) block of W. (For the second equality, use formula (A.8) of Appendix A.) The efficient GMM estimator obtains if the Win this expression is replaced by the inverse of S defined in (4.3.2).

-

Imposing Conditional Homoskedasticity

-

s-

1 The efficient GMM estimator obtains when we set in (4.6.6) W = where Sis a consistent estimator of S. As before, we can impose conditional homoskedasticity

289

Multiple·Equation GMM

-

to reduce the efficient GMM estimator to popular estimators. The FIVE estimator obtains if Sis given by (4.5.3). lf we further assume that the set of instruments is the same across equations, then, as before, this S has the Kronecker product structure n

S=i®(~Lx;x;), i=l

so that Wmh in (4.6.6) is given by (4.5.11), resulting in what should be called the 3SLS estimator with common coefficients:

(4.6.7) 1

where &mh is the (m, h) element of f- . If, in addition, the SUR condition (4.5.18) is assumed, then the "disappearance ofx" occurs in (4.6. 7) and the efficient GMM estimator becomes what is called (for historical reasons) the random-effects estimator: (4.6.8) Its asymptotic variance is

Avar(~RE) =

,?; b M

[

M

umh

E(z;mz;h)

]-l

(4.6.9)

(To derive this, go back to the general formula (4.3.3) for the Avar of efficient GMM. Set l:., as in (4.6.4), S = l: 0 E(x;x;J, use formula (A.8) of Appendix A to calculate the matrix product, and observe the "disappearance of x" that (4.5.16) becomes (4.5.16') on page 280 under the SUR assumption (4.5.18).) It is consistently estimated by

(4.6.10)

290

Chapter 4

We summarize the result about the random-effects estimator as Proposition 4.7 (large-sample properties of the random-effects estimator):

Suppose Assumptions 4.1', 4.2, 4.3, 4.4', 4.5, and 4.7 hold with x; =union of (zn, ... , Z;M ). Let :E be the M x M matrix whose (m, h) element is E(&;m &;h), and let 1: be a consistent estimate of :E. Then (a) JRE· given by (4.6.8) is consistent, asymptotically normal, and efficient, with Avar(oRE) given by (4.6.9).

.

(b) The estimated asymptotic variance (4. 6.10) is consistent for Avar(JRE).

(c) (Sargan's statistic) "

.......-I

](oRE, S

)

" , ......-1 " = n · g,(oRE) S g.(oRE)--->

d

2

X (MK- L),

wh~re S = 1: ® _(~ L x;x;), K is the number of common instruments, and g,(o) = s,y- S,o with s,y and S, given by (4.6.5). Below we will rewrite the formulas (4.6.8)--(4.6.10) in more elegant forms. Pooled OLS For the SUR of the previous section, we obtained 1: from the residuals from the equation-by-equation OLS. The consistency is all that is required for 1:, so the same procedure for 1: works here just as well, but the finite sample distribution of 3RE might be improved if we exploit the a priori restriction that the coefficients be the same across equations in the estimation of 1:. So consider setting Wto

rather than --1

1:

('

® ;;

~x;x; n

)-! ,

1=1

in the first-step GMM estimation for the purpose of obtaining an initial consistent estimate of The estimator is (4.6.8) with amh = I form =hand 0 form oft h,

o.

which can be written as

291

Multiple-Equation GMM

(4.6.11) which is simply the OLS estimator on the sample of size Mn where observations are pooled across equations. For this reason, the estimator is called the pooled OLS estimator. The expression suggests that the orthogonality conditions that are being exploited are E(z;~

· e;~

+ · · · + Z;M

· e;M)

(4.6.12)

= 0,

which does not involve the "cross" orthogonalities E(z;m · t:;h) = 0 (m r" h). It is left as an analytical exercise to show that the pooled OLS is the GMM estimator exploiting (4.6.12). Because it is so easy to calculate, pooled OLS is a popular option for those researchers who do not want to program estimators. The estimator is also robust to the failure of the "cross" orthogonalities. But it is important to keep in mind that the standard errors printed out by the OLS package, which do not take into account the interequation cross moments (amh), are biased. For the pooled OLS estimator, which is a GMM estimator with a nonoptimal choice of W, the correct formula for the asymptotic variance is (3.5.1) of Proposition 3.1. Setting W = IM ® [E(x,x)r 1 , S = :E ® E(x,x;), :E., as in (4.6.4), and again observing the disappearance of x, we obtain

Avar(Jpoolod OLS) = [

f;

E(Z;mz;m)] -I

f; f;

amh E(z,mz;h) [

f;

E(z,mz;m)] -I (4.6.13)

which is consistently estimated by

Since the pooled OLS estimator is consistent, the associated residual can be used to calculate amh (m, h = I, ... , M) in this expression. The correct standard errors of pooled OLS are the square roots of (I jn times) the diagonal element of this matrix.

292

Chapter 4

Beautifying the Formulas

These formulas for the random-effects and pooled OLS estimators are quite straightforward to derive but look complicated, with double and triple summations. They can, however, be beautified- and at the same time more useful for programming-if we introduce some new matrix notation. Let

(4.6.15)

so that theM-equation system with common coefficients in Assumption 4.1' can be written compactly as

Y; =Z;8+e;

(4.6.1 ')

(i = 1,2, ... ,n).

The beautification of the complicated formulas utilizes the following algebraic results: M

M

(4.6.16a) m=l

M

m=l

M

L LCmh.

M

Zim. Yih

LL

= z;cyi,

m=l h=I

I

M

Cmh. Z;mz;h =

z;cz;

(4.6.16b)

m=l h=I

for any M x M matrix C = M

M

(cmh)- Now take the triple summation in (4.6.8)

n

LL&mh. (;; LZ;m. Y;h) m=l h=I

i=l

l

= ;;

n

M

M

L(L L &mh ·Z;m ·Yih) i=l

(by changing order of summations)

m=l h=I

n

I "'

,--1

=;;L_.Z;I:.

Y;

--1

(by(4.6.16b)withC='E

).

(4.6.17)

1=1

Similarly, by (4.6.16b) with C comes M

M

f- 1, n

"'"' L...t L...t a. mh -I " L...t' n m=l h=1

the other triple summation in (4.6.8) be-

i=l

n

' = -I ZjmZ;h

n

"'z,--1 L...t ;I: Z;. i=l

(4.6.18)

293

Multiple-Equation GMM

The double and triple summations in (4.6.9) and (4.6.10) can be written similarly. We have thus rewritten the random-effects formulas as

oRE =

(I~

,--1

;; L.. z, I:.

Z;

)-

1

I~

y,.

,--1

;; L.. z, I:.

i=l

(4.6.8')

i=1

• Avar(oRE) =

---;:--

Avar(oRE) =

( E(Z,I:.' I Z;)

(I;; L..Z,I:. ~

,--1

)-! .

z,

)-

(4.6.9') 1

(4.6.10')

.

I:=: 1

Using (4.6.16a), the pooled OLS formulas can be written as " 1 n -1 1 n opooiedoLs = (- L:z;z,) - L:z;y,, n .

(4.6.11')

n .

1=1

1=1

Avar(~pooied oLsl = [E(z;z,)]- 1 E(z; I:.Z,)[E(z;z,)r 1 • (4.6.13') -:;:---1 n -1 1 n .-... 1 n -1 Avar(opooiedOLs) = (;; L:z;z,) (;; L:z;I:.z,)C L:z;z,) . (4.6.14') i=1

i=l

i=1

These formulas are not just pretty; as will be seen in the next section, they will be useful for handling certain classes of panel data (called unbalanced panel data). The Restriction That Isn't Although it appears that the model of this section, with the common coefficient assumption, is a special case of the model of Section 4.1, the latter can be cast in the format of the model of this section with a proper redefinition of regressors. For example, consider Example 4.1. Instead of defining zi! and z12 as in (4.1.2), define them as

Z;J

=

I

0

q,,

s,

0

fh

IQ,

0 0

Y1 rr

EXPR,

Z;2

=

0

I

0

s,

0

IQ,

with

o=

z

fJ, Yz

Then the two-equation system fits the format of Assumption 4.1 '. (See Review Question 5 below for a demonstration of the same point more generally.)

294

Chapter 4

Furthermore, a system in which only a subset of variables have common coefficients can be written in the form of Assumption 4.1'. Consider Example 4.2. Suppose we wish to assume a priori that the coefficients of schooling, TQ, and experience remain constant over time (so fJ1 = fJ2 = {3, Y1 = Y2 = y, rr1 = rr2 = rr) but the intercept does not. The translation is accomplished by setting

0

I Zn =

0 S69 ' TQ, EXPR69;

¢1

¢2

'

Zj2

=

S80 ' TQ, EXPR80,

'

o=

f3

(4.6.19)

y

,.

Therefore, the common coefficient restriction is not restrictive. We could have

derived the GMM estimator for the model of Section 4.1 as a special case of the GMM estimator with the common coefficient restriction. But the model without the restriction is perhaps an easier introduction to multiple equations. QUESTIONS FOR REVIEW

1. (Identification with common coefficients) Let Zimj be the j-th element of z;m. That is, Zimj is the j-th regressor in the m-th equation. Assume that Zimi = I for all i, m. Which of the assumptions made for the model is violated if z,mz = I for all i, m? Hint: Look at the rank condition. Would your answer change if Zim2

= m for all i, m? Hint: It should.

2. (Importance of "cross" orthogonalities) Suppose E(Z;m · E:;h) = 0 form = h but not necessarily form #h. Would the random-effects estimator be consistent? Hint: The sampling error of the random-eflects estimator is

3. (Inefficiency of pooled OLS) Derive the efficient GMM estimator of 8 that exploits E(z;m · e;m) = 0 for all m. Is it the same as the pooled OLS? [Answer: No.] Is it the same as the random-effects estimator? [Answer: No.] 4. (I:,z, s,y. S,z with common instruments in Kronecker products) Suppose the instruments are the same across equations: X;m =

x,.

Show that I:,. in (4.6.4)

295

Multiple·Equation GMM

can be written as E(Z; ® x; ), where Z; is the M x L matrix defined in (4.6.15). Also,

Sxy

= ~

L7=l Yi ®X;, Sxz = ~ L7=! zi ®X;. Hint: Xiz;m =

z;m ®Xi·

5. (The restriction that isn't) To avoid possible confusion arising from the fact that the Z;m in Section 4.1 and the Z;m in this section are different, write the M-equation system in (4.1.1) as (m = 1, 2, ... , M; i = 1, 2, ... , n),

(I)

where z;m and~:~ are Lm ·dimensional.

(a) (Assumption 4.1 as a special case of Assumption 4.1') Verify that (I) can be written as (4.6.1) if we define

Zim Q~_!f=l Lm XI)

0

)

L 1 rows

0

)

Lm-I rows

z'•m

)

Lm rows

0

)

Lm+I rows

0

)

LM rows

and

8

[:~l

(2)

(b) (Assumption 4.4 as a special case of Assumption 4.4') Verify that, when Z;m is defined as in (2) with z7m interpreted as the Z;m of Section 4.1, Assumption 4.4' becomes Assumption 4.4. (c) (GMM formula) Under the same interpretation, verify that S., of (4.6.5) becomes a block diagonal matrix whose m-th block is~

(4.6.6) becomes (4.2.6) if the

Z;m

Li ximz7~. so that

in (4.2.6) is understood to mean

z7m·

6. (Identification in theRE model) Assumption 4.4' in Proposition 4.7 is actually unnecessary. Verify that Assumptions 4.5 and 4.7 imply Assumption 4.4'. Hint: If E(x;x;J is nonsingular, then Assumption 4.4 holds, as shown in Review Question 7 to Section 4.5. Assumption 4.4' is weaker than Assumption 4.4. 7. (Only for those who have become familiar with Kronecker products) (a) Show that E(Z;:E- 1Z;) in (4.6.9') on page 293 for the random-effects estimator is non singular if the assumptions of Proposition 4. 7 hold. Hint:

296

Chapter 4

1

E(Z;l:- Z;) =

=

M

M

M

M

L L
L L amh E(Z;mx;) [E(x;X;) r' E(x,z;.J m:=l h=I

(x; can reappear, since x; includes Z;m and

Exz

Z;h)

is of full column rank by Assumption 4.4'.

(b) Show that E(z;z,) in (4.6.13') on page 293 for pooled OLS is invertible. Hint: Replace l:- 1 by I.

4.7 Application: Interrelated Factor Demands The translog cost function is a generalization of the Cobb-Douglas (log-linear) cost function introduced in Section I. 7. An attractive feature of the translog specification is that the cost-minimizing input demand equations, if transformed into cost share equations, are linear in the logs of output and factor prices, with the coefficients inheriting (a subset of) the cost function parameters characterizing the technology. Those technology parameters can be estimated from a system of equations for cost shares. Early applications of this idea include Berndt and Wood (1975) and Christensen and Greene ( 1976). This section reviews their basic methodology. The methodology is also applicable in the analysis of consumer demand. To better focus on the issue at hand and to follow the practice of those original studies, we will proceed under the assumption of conditional homoskedasticity. Also, we will assume that the regressors in the system of share equations, which are log factor prices and log output, are predetermined. As we argued in Section 1.7, this assumption is not unreasonable for the U.S. electric power industry before the recent deregulation. Consequently, the appropriate multiple-equation estimation technique is the multivariate regression. The Translog Cost Function We have already made a partial transition from the log-linear to translog cost function: in the empirical exercise of Chapter I, we entertained the idea of adding the square of log output to the log-linear cost function (see Model 4). If this idea is

297

Multiple-Equation GMM

extended to the quadratic and cross terms in the logs of all the arguments, the loglinear cost function with three inputs (1.7.4) becomes the translog cost function: I

3

log( C)= <>o + L

a1 log(p1) +

j=l

+ <>Q log(Q) +

I

3

2 Lj=I

3

LYJk log(p1) log(pk) k=I

YQQ(log(Q)) 2

2

3

+ LYJQ log(p1) log(Q) + £. j=l

(4.7.1) Here, and as in the rest of this section, the observation subscript "i" is dropped. In this expression, the term

I

3

2L

3

L

YJk log(pi) log(pk)

j=l k=l

is a quadratic form representing the second-order effect of factor prices. We can assume, without loss of generality, that the 3 x 3 matrix of quadratic form coefficients, (y1k), is symmetric: 9

YJk = YkJ

(4.7.2)

(j, k =I, 2, 3).

In the case of log-linear cost function examined in Section 1.7, the degree of returns to scale can be calculated as the reciprocal of the elasticity of costs with respect to output. If this definition is applied to the translog cost function, we have I

returns to scale =

a1og

(C

)fa 1og (Q ) I

<>Q

3

+ YQQ log(Q) + I:1 ~ 1 YJQ log(p1)

.

(4.7.3)

Factor Shares

The link between the cost function parameters and factor demands is given by Shephard'sLemma from microeconomics. Letx1 be the cost-minimizing demand for factor input j given factor prices (p 1, p 2 , p 3) and output Q. So I:]~ 1 p1x1 = C. The lemma states that

9 Lct x be an n-dimension<JI vector. The associated quadratic fonn is x 1Ax. Since x 1Ax = x' A'x, we have 1 x Ax = x 1[(A + A1)/2]x. So if A is not symmetric, it can be replaced by the symmetric matrix (A+ A )/2 without changing the value of the quadratic fonn. 1

298

Chapter 4

(4.7.4) Noting that

alog( C) alog(pj)

pj ac -·-, c apj

the lemma can also be written as stating that the logarithmic partial derivative of the cost function equals the factor share, namely,

alog(C)

=

PjXj

aIog(pj)

(4.7.5)

c

For the case of translog cost function given in (4. 7.I), the log partial derivative is easy to calculate:

alog( C) alo ( ·) g PJ

3

=

Cij

._, + L.... Yjk log(p,) k~l

+ l'jQ log(Q)

(j=l,2,3).

(4.7.6)

Combining this with (4.7.5) and defining the cost shares Sj = pjXj/C, we obtain the following system of cost share equations associated with the translog cost function: 3

Sj

=

aj

+ LYjk log(p,) + YjQ log(Q)

(j = I, 2, 3).

(4.7.7)

k=I

This system of share equations is subject to the cross-equation restrictions that the coefficient of log(p,) in the Sj equation equals the coefficient of log(pj) in the s, equation for j i= k. In the rest of this section, these cross-equation restrictions will be called the symmetry restrictions, although they are not a consequence of the symmetry assumption (4.7.2). Those cross-equation restrictions are a consequence of calculus: you should verify that, if you did not assume symmetry, the coefficient of log(p 2 ) in the s 1 equation, for example, would be (y12 + y 21 )/2, which would equal the log(p 1 ) coefficient in the s 2 equation. Substitution Elasticities

In the production function literature, a great deal of attention has been paid to estimating the elasticities of substitution (more precisely, the Hicks-Allen partial elasticities of substitution) between various inputs. 10 As shown in Ozawa (1962), 10 See Section 9.2 of Berndt (1991) for a historical overview.

299

Multiple·Equation GMM

the substitution elasticity between inputs j and k, denoted function C by the formula

ryjb

is related to the cost

(4.7.8)

For the translog cost function, some routine calculus and Shephard's lemma show that Yjk

+ SjSk

for j 'f k

SjSk

Yjj

+ s/- Sj s·2

(4.7.9) for j = k,

}

where sj is the cost share of input j. In the special case of the log-linear cost function where Yjk = 0 for all j, k, we have ryjk = I for j 'f k, verifying that the substitution elasticities are unity between any two factor inputs for the CobbDouglas technology. Properties of Cost Functions

As is well known in microeconomics (see, e.g., Varian, 1992, Chapter 5), cost functions have the following properties. I. (Homogeneity) Homogeneous of degree one in factor prices. 2. (Monotonicity) Nondecreasing in factor prices, which requires that ac jJpj > 0 for all j. 3. (Concavity) Concave in factor prices, which requires that the Hessian (the matrix of second-order derivatives) of a cost function C, say

be negative semidefinite. By (4. 7.8), concavity is satisfied if and only if the matrix of substitution elasticities, (ryjk), is negative semidefinite.'' For the case of trans log cost function, these properties take the following form.

Ll Let H be the Hessian of the cost function C and V be a diagonal matrix whose j·th diagonal is

ac;apj.

Then the matrix of substitution elasticilies, (TJjk), given in (4.7.8) can be written compactly as: (rJjk) = C ·v-I Hv- 1. Since Vis a diagonal matrix, v- 1Hv- 1 is negative semidefinite if and only if His.

300

Chapter 4

I. Homogeneity: It is easy to verify (see a review question) that, with the symmetry restrictions imposed on (yjt), the homogeneity condition can be expressed as

"' + "2 + "' = I, Y11 + Y12 + YI3 = 0, YI2 + Y22 + Y23 = 0, Yl3 + Y23 + Y33 = 0, YI Q + Y2Q + Y3Q = 0.

(4.7.10)

Homogeneity will be imposed in our estimation of the share equations. 2. Monotonicity: Given Shephard's lemma, monotonicity is equivalent to requiring that the right-hand sides of the share equations (4.7.7) be nonnegative for any combinations of factor prices and output. For any values of coefficients, we can always choose factor prices and output so that this condition is violated. Accordingly, we cannot impose monotonicity on the share equations. It is possible, however, to check whether monotonicity is satisfied for the "relevant range," that is, if it is satisfied for all the combinations of factor prices and output found in the data at hand; we will do this for the estimated parameters below. 3. Concavity: Concavity requires that the matrix of substitution elasticities given by (4.7.9) be negative semidefinite for any combinations of cost shares. It is possible to show that a necessary and sufficient condition for concavity is that the 3 x 3 matrix (Yjk) be negative semidefinite. 12 In particular, the diagonal elements, (y11 , )'22, y 33 ), have to be nonpositive. The condition that the matrix (Yjk) be negative semidefinite is a set of inequality conditions on its elements. Estimation while imposing inequality conditions is feasible (see Jorgenson, 1986, Section 2.4.5, for the required technique), but it will not be pursued here. Stochastic Specifications The translog cost function given in (4.7.1) has an (additive) error term,&, while

the share equations derived from it have none. On our data, the share equations do

12 Proof: Lets here be a 3 x I vector whose j·th element is Sj and S here be a diagonal matrix whose j·th element is s . Then the matrix of substitution elasLicities given in (4.7.9) can be written as (IJjk) = s- 1[(Yjk) +

1

ss'- S]S- 1 . For some combination of cost shares, ss'- Sis zero (for example, sets= (1, 0, 0) 1). So a necessary condition for the matrix (IJjk) to be negative semidefinite is that the matrix (Yjk) is negative semidefinite. This condition is also sufficient, because the matrix ss' - S is negative semidefinite and the sum of two negative semidefinite matrices is negative semidefinite.

301

Multiple·Equation GMM

not hold exactly for each firm, so we need to allow for errors. The usual practice is to add a random disturbance term to each share equation. There are two ways to justify such a stochastic specification. One is optimization errors: firms make random errors in choosing their cost-minimizing input combinations. But then actual total costs will not be as low as what is indicated in the cost function, because there is no room for optimization errors in the derivation of the cost function. The other justification for errors is to allow ai in the share equations to be stochastic and differ across firms. The intercept in the share equation for j would then be the mean of ai, and the error term would be the deviation of ai from its mean. But this also means that the ai in the cost function has to be treated as random. In either case, therefore, internal consistency would prevent us from adding the cost function to the share equations with additive errors to form a system of estimation equations. In the rest of this section, we will focus on a system that consists entirely of share equations. This, however, means that we can estimate only a subset of the cost function parameters. Two parameters, aQ and YQQ• which are absent in factor demands, cannot be estimated. Since the degree of returns to scale depends on them (see (4.7.3)), it cannot be estimated either. The Nature of Restrictions

With the additive errors appended, the system of share equations (4.7.7) is written out in full as

+ YII log(p,) + Yl2log(p,) + Yn log(p,) + YIQ log( Q) + .s,, s, = a 2 + Y21 log(p,) + y,log(p,) + Y2J log(p3) + Y2Q log( Q) + .s,, s, =a,+ y31log(p,) + y"log(p,) + y33log(p,) + YJQ log(Q) + .s,.

s, = a,

l

(4.7.11)

This system has fifteen coefficients or parameters. The cross-equation restrictions (that the 3 x 3 matrix of log price coefficients, (Yjk), be symmetric) and the homogeneity restrictions (4. 7 .I 0) form a set of eight restrictions, so that the fifteen parameters can be described by seven free parameters. To understand this, it is useful to decompose the restrictions into the following three groups.

a 1 + a 2 + a 3 = I, Y11 + Y21 + Y3! = 0, adding-up restrictions:

+ Yn + Y32 = 0, Yn + Y23 + Y33 = 0, y, Q + Y2Q + YJQ = 0. Y12

(4.7.12)

302

Chapter 4

homogeneity:

symmetry:

+ Y12 + YI3

= 0, + Y22 + Y23 = 0, Y3I + Y32 + Y11 = 0, Yil Y21

! {

Y12 = Y2I· YI3 = Y11,

(4.7.13)

(4.7.14)

Y23 = Y32·

These are eleven equations, but only eight of them are restrictive: given the addingup restrictions, one of the three homogeneity restrictions is implied by the two others, and given the adding-up and homogeneity restrictions, two of the three symmetry restrictions are implied by the other (you should verify this). Multivariate Regression Subject to Cross-Equation Restrictions

It is straightforward to impose the homogeneity restrictions, (4.7.13), because they are not cross-equation restrictions. We can eliminate three parameters, say (y13 , Y33), from the system to obtain

y,,,

-'1 = a1 + YIIIog(pJ/ P3) s2 =a,+ Y21log(pi/ P3) S3 =a,+ Y11log(pJ! p,)

!

+ Y12log(p,j P3) + YIQ log(Q) + 8J, + Y22log(p,j P3) + Y2Q log(Q) + s,, + y,,log(p,j P3) + Y3Q log(Q) + 83.

(4.7.15)

The system (with or without the imposition of homogeneity) has the special feature that the sum of the dependent variables, (s 1, s 2, s 3), adds up to unity for all observations. This feature and the adding-up restrictions imply that the sum of the error terms, (s~> s 2, s 3), is zero for all observations, so the 3 x 3 error covariance matrix, I: = Var(s 1, s 2, s 3), is singular. Recall that, in the multivariate regression model (which is the special case of the SUR model described in Proposition 4.6), the S matrix defined in (4.1.11) can be written as a Kronecker product: S = I: ® E(x,x;). Thus S becomes singular, in violation of Assumption 4.5. The common practice to deal with this singularity problem is to drop one equation from the system and estimate the system composed of the remaining equations (see below for more on this issue). The coefficients in the equation that was dropped can be calculated from the coefficient estimates for the included two equations by using the adding-up restrictions. The adding-up restrictions as well as homogeneity are thus incorporated. As mentioned above, given the adding-up restrictions and homogeneity, there is effectively only one symmetry restriction. For example, if (y13 , y23 , y33 ) are

303

Multiple·Equation GMM

eliminated from the system to incorporate homogeneity, as in (4.7.15), and if the third equation is dropped to incorporate the adding-up restrictions, then the unique symmetry restriction can be stated as y 12 = Y2I· If this cross-equation restriction is imposed, the system becomes

Since the regressors are predetermined and since the equations have common regressors, the system can be estimated by the multivariate regression subject to the cross-equation restriction of symmetry. This constrained estimation, if it is cast in the "common coefficient" format explored in Section 4.6, can be transformed into an unconstrained estimation. That is, the system (4.7.16) can be written in the "common coefficient" format of y, = Z;d + e1 (see (4.6.1') on page 292), if we set (while still dropping the "i" suhscript from s1, s2, log(pJ/ p3), log(p2/ p 3 ), and log( Q))

Yi = [::].

z, = [~

0 log(pi I P3) I 0

d'

= [UJ

log(p2/ p3) log(pJ/ P3) a2

Y11

YI2

0

log(p2/ p3) Y22

YIQ

log( Q) 0 Y2Q] ·

log~Q)]

'

(4.7.17)

So the multivariate regression subject to symmetry is equivalent to the (unconstrained) random-effects estimation. This random-effects estimation of the two-equation system produces consistent estimates of the following seven parameters:

The rest of the fifteen share equation parameters,

can be calculated using the adding-up restrictions (4.7.12), homogeneity (4.7.13), and symmetry (4.7.14). For example, y33 can be calculated as Y33 = -Y3I- Y32

= -y13- Y23 = (YII + Y12l + (Yl2 + Y22) = Y11 + 2YI2 + Y22·

304

Chapter 4

So the point estimate of y 33 is given by

where (YJJ, y12, Y22) are the random-effects estimates. The standard error of YJJ can be calculated by applying the "delta method" (see Lemma 2.5) to the consistent estimate of Avar(j/11 , j/12 , y22 ) obtained from the random-effects estimation. Which Equation to Delete?

Does it matter which equation is to be dropped from the system? It clearly would not if there were no cross-equation restrictions, because the multivariate regression is numerically equivalent to the equation-by-equation OLS. To see if the numerical invariance holds under the cross-equation restriction, let I:' be the 2 x 2 matrix of error co variances for the two-equation system obtained from dropping one equation from the three-equation system (4.7.15). It is the appropriate submatrix of the 3 x 3 error covariance matrix I:. (For example, if the third equation is dropped, then I:' is the leading 2 x 2 submatrix of the 3 x 3 matrix I:.) To implement the random-effects estimation, we need a consistent estimate, f*, of l:*. Two ways for obtaining I:' are: • (equation-by-equation OLS) Estimate the two equations separately, thus ignoring the cross-equation restriction, and then use the residuals to calculate I:'. Equivalently, estimate the three equations separately by OLS, use the threeequation residuals to calculate I: (an estimate of the 3 x 3 error covariance matrix I:), and then extract the appropriate submatrix from I:. If ~· I: is thus obtained, then (as you will verify in the empirical exercise) the numerical invariance is guaranteed; it does not matter which equation to drop.

-

• Obtain I:' from some technique that exploits the cross-equation restriction (such as the pooled OLS applied to the common coefficient format). The numerical invariance in this case is not guaranteed. The numerical invariance holds under equation-by-equation OLS, because the 3 x 3 matrix I:, from which I:' is derived, is singular in finite samples. In large samples, as long as I:' is consistent, and even if I:' is not from equation-by-equation OLS, this relationship between I:' and I: holds asymptotically because I: is singular. So the invariance holds asymptotically if I:' is consistent. In patticular, the asymptotic variance of parameter estimates does not depend on the choice of equation to drop.

Multiple-Equation GMM

305

Results

Nerlove's study examined in Chapter I was based on 1955 cross-section data for U.S. electric utility companies. Christensen and Greene (1976) updated Nerlove's data to 1970 for 99 firms, using data sources that allowed them to construct better measures of the wage rate and fuel prices (see their Section V for more details). Their data set includes information on factor shares, so the share equations can be estimated. The means and standard deviations for some of the variables in the data are reported in Table 4.3, which shows that fuel accounts for a very large share of the cost of electricity generation. We use the equation-by-equation OLS residuals to calculate i:', so that the random-effects estimates are numerically invariant to the choice of equation to drop. The estimates are reported in Table 4.4, along with i: derived from the equation-by-equation OLS. The factor input index j is I for labor, 2 for capital, and 3 for fuel. Some (actually all) of the diagonal elements of the matrix of estimated price coefficients, (Yj>), are positive. So the matrix is not negative semidefinite, in violation of concavity. 13 Even if (YJ>) is not negative semidefinite, the associated substitution elasticities given in the formula (4.7.9) may be negative semidefinite. Since the formula is derived assuming no optimization errors, fitted cost shares, rather than actual cost shares, should be used for the Sj and s> in the formula. The elasticities averaged over firms are shown in Table 4.5. (They are the off-diagonal elements of the symmetric 3 x 3 substitution elasticity matrix thus calculated.) The three factor inputs are substitutes because the substitution elasticities are positive, but the degree of substitutability is far less than what is assumed in the log-linear technology. Concavity is violated even in the "relevant range": for 20 firms out of the 99 in the sample, the matrix of substitution ela,ticities is not negative semidefinite. Monotonicity, in contra,t, is satisfied in the relevant range: fitted cost shares are nonnegative for all inputs for all firms in the sample.

13 This "test" of concavity can be fonnalizcd. Using the delta method, we can calculate the asymptotic distribution of the characteri!>tic roots of (fjk ). The null hypothesis of concavity is that those characteristic roots arc all nonpositive.

Table 4.3: Simple Statistics (Sample Size = 99)

Output in kilowatt hours

Labor share

Capital share

Fuel share

9.0 10.3

0.141 0.059

0.227 0.062

0.631 0.095

Mean Std. deviation

Table 4.4: Random-Effects Estimates

Point

Standard

estimate

error

-0.132 0.318 0.813 0.084 -0.023 -0.060 0.122 -0.099 0.159 -0.0211 -0.0086 0.0297

0.106 0.085 0.094 0.020 0.016 0.015 0.020 0.017 0.023 0.0025 0.0030 0.0037

Parameter

a, az a, Yu Y12 Yl3 Y22 Y23 Y33

YIQ YzQ Y.JQ

i

[ by pooled OLS =

0.00173 -0.000171 -0.00156

t-value

-1.25 3.75 8.69 4.19 -1.46 -3.92 6.19 -5.75 6.90 -8.55 -2.87 7.98

-0.000171 0.00253 -0.00236

-0.00156] -0.00236 0.00391

Table 4.5: Substitution Elasticities

Labor-Capital

Capital-Fuel

Labor-Fuel

0.17

0.27

0.29

307

Multiple-Equation GMM

QUESTIONS FOR REVIEW

1. Derive (4.7.6). Verify that the symmetry of (v;kl is used in the derivation. 2. Derive lhe homogeneity restrictions (4.7.1 0). Hint: For the translog cost function (4.7.1) to be homogeneous of degree one in factor prices, it is necessary and sufficient that, for any scalar A 'I 0, I 2

3

3

3

La; log(A · P;) + L LYJk log(A · P;) log(A · p,) j=I

j=l k=I

I

3

+ aQ log(Q) + 2YQQ(log(Q)) 2 + LYJQ log(A. P;) log(Q) )=1

3

= log(A)

I

3

+ L a1 1og(p1) + 2 L j=l

+ aQ log(Q) +

3

L Y;k 1og(p1) log(p,)

j=l k=l

I 2YQQ(log(Q)) 2 +

3

LYJQ log(p;) log(Q). }=1

3. (Counting restrictions) Using the adding-up restrictions (4.7.12), two of the three equations in (4.7.13), and one of the three equations in (4.7.14), derive the other two equations in (4.7.14). 4. (OLS respects singularity) Consider applying the equation-by-equation OLS to each of the three equations in (4.7.15). Show lhat the equation-by-equation OLS estimates of the coefficients satisfy the adding-up restrictions. Show that the 3 x 3 error covariance matrix calculated from the equation-by-equation OLS residuals is singular. Hint: Let y 1 , y 2 , y 3 ben-dimensional vectors of the three dependent variables (where n is the sample size), and let X be the n x K data matrix of the common K regressors (K = 4 in (4.7.15)). Note that y 1 + y 2 + y 3 = 0. The equation-by-equation OLS coefficient estimates are (X'X)- 1X'y1,j = 1,2,3. Thethreenxl residualvectorsareMy1,j = 1,2,3, where M =In- X(X'X)- 1X'.

308

Chapter 4

PROBLEM SET FOR CHAPTER 4 - - - - - - - - - - - - ANALYTICAL EXERCISES 1.

(Data matrix representation of 3SLS) In the 3SLS model, the set of instruments is the same across equations. Let K be the number of common instruments. Define:

X

~xK)

- [~'1]- [XI . · X1Kl -

.

-

·

.

.

Xn!

XnK

·

~

·

,

[zt]· znm

y

=

(Mnxl)

YI]

[Y~ :

'

e

(Mnx!)

Show the following:

+ e,

• Assumption 4.1 becomes y = Z8 • Sin (4.5.10) becomes S =

~X'X,

1: ®

1

1

• 83sLs = [Z'(i:- ® P)Zr 1Z'(i:- ® P)y, where P = X(X'X)- 1X',

-

"

---~1

• Avar(83sLs) = n · [Z'(:E ~

® P)Zr 1•

~

• Amh in (4.5.13) becomes Amh =

I

;;Z~PZh,

• Cmh in (4.5.14) becomes Cmh = ~Z~PYh· "

• J(83SLS.

_....._ I

s- ) =

"

----1

(y- Z83sLs)'(:E

"

® P)(y- Z83SLs).

309

Multiple-Equation GMM

Hint:

" I""'"' n i=l

I

- L..-X;X;

1' =-XX, n

Sxy

I I = -(IM0X)y, n

Sxz

1 ' = -(IM0X)Z. n

Use formulas (A.11) and (A.13)-(A.15) of Appendix A. 2. (Data matrix representation of SUR, RE, and Pooled OLS) (a) Prove that (4.5.13) reduces to (4.5.13') (on page 280) and (4.5.14) to (4.5.14'), under the SUR assumption (4.5.18). Hint: If P = X(X'X)- 1X', then Amh = ~Z~PZh, C,h = ~Z~Pyh. PZm = Zm if the columns of Zm are taken from the columns of X. (b) Show that, for SUR, 1

1

the estimator= [Z'(i:- 0 I.)Zr 1Z'(i:- 0 I.)y, 1

its Mar= n · [Z'(i:- 0 I.)Zr 1 • (c) Suppose, just for this part of the question, that Z;m = x; for all m. So Z = IM 0 X. Show that SUR reduces to the multivariate regression estimator

Verify that the m-th K -dimensional block of this M K -dimensional vector is (X'X)- 1X'Ym· Now redefine Z as

z (MnxL)

=

z1] [ :

Z~

So Assumption 4.1' (linear equations with common coefficients) can be written as y = Z8 + e, where y and e are as in Analytical Exercise I and 8 is the £-dimensional vector of common coefficients. (d) Show that, for the random-effects estimator, the same formulas you derived in (b) for SUR hold with Z as just redefined as in (*). Hint: The estimator is given in (4.6.8) and its Avar is in (4.6.10). Use formula (A.B) of the Appendix.

-

310

Chapter 4

(e) Show that

~pool
Avar(gpookd oLsl in (4.6.14) = n. (Z'z)- 1 [Z'(f ® I,)Z](Z'Z)-'(f) In the Mn x L matrix Z defined in(*), its rows are ordered first by the equation (m) and then by observation (i). Now order the rows first by observation (i), which amounts to redefining Z as

z

= (MnxL)-

z,l [ :

~n

'

where Z; (i = I, 2, ... , n) is theM x L matrix defined in (4.6.15). Reorder rows of y and e similarly:

How would the formulas for RE and pooled OLS you derived in (d) and (e) change? Hint: Just replace "f ®I," by ''I, ® f". Set M = 2 if you find the translation too hard. 3. (GLS interpretation of SUR andRE) The SUR model consists of Assumptions 4.1-4.5, 4.7, and (4.5.18) (that x; =union of (z,-,, ... , Z;M)). Specialize the model by strengthening Assumption 4.3 (orthogonality conditions) as E(s;m

I Zn, ... , Z;M)

=0

(m

= I, ... , M),

and Assumption 4.2 (ergodic stationarity) as stating that {yn, · · ·• Yw. z,·,, · .. , z;M)

is i.i.d. (a) (Easy) Explain why these are stronger assumptions. {b) Let e (M n x I) and Z (Mn x Lm Lm) be as in Analytical Exercise I above.

311

Multiple-Equation GMM

Show that the specialized model implies E(e I Z) = 0

and

E(ee' I Z) = 1: ®In.

Hint: For the first, what needs to be shown is E(e 1m I Z) = 0. For the latter,

the (m, h) block of E(ee' 1Z) is E(em eh I Z), ann x n matrix. What needs to be shown is that this matrix is O"mhln. (c) (Finite-sample properties of SUR) Suppose 1: is known so that the consistent estimates &mh in the SUR formulas in Proposition 4.6 can be equated with the true value O"mh. (i) Verify that the model satisfies the GLS assumptions of Section 1.6 (which are Assumptions 1.1-1.3 and (1.6.1)) with a sample size of Mn. (ii) Verify that the SUR estimator is the GLS estimator. Hint: The X in Section 1.6 is Z here. Show that a 2 V there is 1: ®ln. Use your answer to 2(b).

-

(iii) Verify that SUR is unbiased in finite samples in that E(~suR 0. (iv) The expression for 1.6. Show that

-

Var(~suR

-

Avar(~suR)

~

I Z) =

I Z) is given in Proposition I. 7 of Section

-

= plim n · Var(~sUR I Z). n~oo

Hint: With E(ee'

I Z)

= 1: ®In, the matrix Var(3suR

of a matrix whose (m, h) block is amhz~zh. Z~Zh

I Z) is the inverse

=I:, z 1mz;h.

(d) (Finite-sample properties of RE) Now impose on the model the condition that the coefficient vector is the same across equations so that Z is as in (*) of Analytical Exercise 2. As in (c), suppose 1: is known so that the consistent estimates &mh in theRE formulas in Proposition 4.7 can be equated with the true value amh· (i) Verify that this model with common coefficients satisfies the GLS assumptions of Section 1.6. In particular, Assumption 1.1 can be written as y =

Z~

+ e, where y and e are as in Analytical Exercise 2.

(ii) Verify that the random-effects estimator is the GLS estimator. (iii) Verify that the random-effects estimator is unbiased in finite samples.

312

Chapter 4

(iv) Show that

Avar(~RE) = plim n · Var(~RE I Z). n~oo

Hint: Var(~RE

I Z) is the inverse of Lm Lh umhz;.zh.

4. (Standard errors from equation-by-equation vs. multiple-equation GMM) We have shown in Section 4.4 that the Avar of the efficient equation-by-equation GMM estimator is no less (in matrix sense) than the Avar of the efficient multiple-equation GMM estimators. In this exercise we show that the same holds for Avar. LetS and 8,., be as in (4.3.2) and (4.2.2), respectively. (a) (Trivial) Verify that, under appropriate assumptions, (S~,S- 1 S.,)- 1 is consistent for the asymptotic variance of the efficient multiple-equation GMM estimator. What are those appropriate assumptions? (b) Let ~m be the efficient single-equation GMM estimator of ~m and ~ be the - stacked vector formed from (~ 1 , ..• , ~M ). ~ is the efficient equation-byequation GMM estimator of~. Show that Avar(~ml is consistently estimated by

where

(c) Verify that(*) is the (m, m) block of

where W is the block diagonal matrix whose m-th diagonal block is

Wmm.

(d) Proposition 3.5 is an algebraic result applicable not just to population moments, so it implies that

Taking this for granted, show the following: If the same residuals are used to calculate W for the equation-by-equation GMM and S for the multipleequation GMM, then, for each coefficient, the standard error from the

313

Multiple-Equation GMM

efficient multiple-equation GMM is no larger than that from the efficient equation-by-equation GMM. Hint: The standard errors are the square roots of the diagonal elements of the estimated asymptotic variance divided by the sample size. (e) Does the result you proved in (d) hold true if the model is misspecified in any way? Hint: What you have proved is an algebraic result. (f) Go through steps (a)-(d) for the FIVE and equation-by-equation 2SLS. Hint: It's just a matter of redefining Sand the rn-th diagonal block, W mm, of W. TheWmm for 2SLS is the inverse of n

-I n

L ,,

Eim .

i=l

n

-I n

L

'

X·unXim

i=l

-

and the S for FIVE is (4.5.3). 5. (Identification and cross-equation restrictions) Consider the wage equation for 1969 and 1980: LW69; = LW80; =

/3 0 + /3 1· S69; + f:h. · IQ, + Ei!, f3o + f3I · S80; + /32 · IQ; + E;2.

The orthogonality conditions are E(t:n) = 0, E(t:; 2 ) = 0, E(MED, e;J) = 0, E(MED, &; 2 ) = 0. Assume we have a random sample on (LW69, LW80, S69, S80, IQ, MED).

(a) Is the LW69 equation identified in isolation? Hint: Check the order condition for identification for the LW69 equation. (b) Write the four orthogonality conditions in tenns of (/30 , /3 1 , /3 2 ) as a system of equations linear in /3's. How many equations are there? State the rank condition for identification in tenns of relevant cross moments. (This example shows that even if each equation of the system is not identified in isolation, the system can be identified thanks to the cross-equation restriction that /3's are the same across equations.) (c) Show that the system cannot be identified if IQ and MED are uncorrelated. 6. (Optional) Prove Proposition 4.1. Hint: Generalize the proof of Proposition 3.2.

314

Chapter 4

7. (Numerical equivalence of 2SLS and 3SLS) In the system of equations Yim = z;m~m + o;m (i = I, ... , n; m = I, ... , M), suppose all equations share the same set of instruments (so x1m = x1). (In what follows, assume all the conditions for the 3SLS estimator are satisfied.) (a) When each equation has the same RHS variables (so Z;m = z 1), the 2SLS and 3SLS estimators are numerically the same. Prove this. Hint: Let B ~ L; x,z;. The 3SLS estimator is (4.2.3) with S., = IM 0 B, W= f-' 0 s;x1 where Sxx = ~ L; X;x;.

=

(b) If the errors are not conditionally homoskedastic, does (a) remain true for the single-equation and multiple-equation GMM estimators? Hint: The answer is no. Why? Can you write §'-I as a Kronecker product without conditional homoskedasticy? 8. (SUR is more than assuming predetermined regressors in each equation) Con-

sider estimating a system of M equations: Yim = z;m~m + o;m (i = I, ... , n; m = I, ... , M). The orthogonality conditions are E(z;m · o;m) = 0 (m = 1,2, ... ,M). (a) Does the efficient multiple-equation GMM estimator ~(S- 1 ) reduce to the equation-by-equation OLS? Does your conclusion change if the errors are conditionally homoskedastic? Hint: What's the size of S.,? S?

-

(b) Explain why the efficient GMM you derived in (a) under conditional homoskedasticity differs from SUR. Hint: The difference is in the orthogonality conditions. 9. (Optional, peril of joint estimation) Taken from Greene (1997, p. 706). Consider the two-equation SUR model: Yii

= fhxn + Eii,

Y12 = fhxt2

+ {!,x,, + 6;2.

For simplicity, assume that E(o~1 ), E(oil o12 ), and E(o~2 ) are known. Now suppose you apply SUR, but erroneously omit x 13 from the second equation. What effect does this have on the consistency of the estimator of {! 1? Hint: Let x 1 , x2 , x3 , Y~o y2 , e 1 , and e 2 ben-dimensional vectors of respective variables and

y [y'] , e =

Y2

= ["'] , X= "2

[X'0

O] ,

X2

315

Multiple-Equation GMM

(So, for example, X is 2n x 2 and

fJ

is 2 x I.) Show that:

(a) The two-equation system can be written as y = X{J (b) The SUR estimator of

fJ

+ {3 3 · d +e.

when x; 3 is inadvertently left out is

(c) Use formulas (A.6) and (A.7) of the Appendix to show

PsuR

=

fJ + {3, ·a+ b,

where ]

-1

n

ail_'""" x~

n

a= a

211

-

~ i==l

11

~ ~Xi2Xii

n.1=1

-1

h=

(d) Argue that plim b = 0. Is plim a = 0? 10. (Optional, adding just-identified equations does not increase efficiency) Consider the 3SLS two-equation model where the second equation is just identified (so L 2 = K). Let ~I.JsLs be the 3SLS estimator of the coefficients of the first equation and ~ 1 • 2 sLs be the 2SLS estimator of the same coefficients.

-

(a) Write down Avar(~J.2sLs) in terms of (4.5.16).

all

and A 11 , which is defined in

316

Chapter 4

(b) Show that Avar(~I,3SLsl = Avar(~I.2SLsl· Hint: Use the partitioned inverse formula (see formula (A.10) of the Appendix) to write the upper-left (L 1x L 1) block of Avar(~I. sLsl as G- 1 where

3

But by (4.5.16), 1 A 12 A22 A 21 = E(z"X,)[E(x,x;Jr' E(x,z;,)

[E(z 12 x;)[E(x,x;W' E(x,z;,Jr' E(z,,x;)[E(x,x;Jr' E(x,z;,) (since #z.; 2 (= L 2 ) = #x; (= K))

= E(z"x)[E(x;x;Jr' E(x;z;,l

=A". So

G= [ a 11 -

(a

12

· a a22

(since a

11

21 ) ] A11

=A11 all

(aa"·a") 12

-

= 1/all)·

11. (Linear combinations of orthogonality conditions)

For the multiple-equation model of Proposition 4.7, derive the efficient GMM estimator that exploits the following orthogonality conditions: (1)

Hint: The sample analogue of the left-hand side the orthogonality conditions is

There are as many orthogonality conditions as there are coefficients. The efficient GMM estimator is the IV estimator that salves g,. (a) = 0.

317

Multiple·Equation GMM

EMPIRICAL EXERCISES

In data file GREENE.ASC, data are provided on the following variables. Column I: Column 2: Column 3: Column 4: Column 5: Column 6: Column 7: Column 8:

finn ID in Nerlove's 1955 sample costs in 1970 in millions of dollars output in millions of kilowatt hours price of labor price of capital price of fuel labor's cost share capital's cost share

The sample has 99 observations. Fuel's cost share can be calculated as one minus the sum of labor and capital cost shares. These data were used in Christensen and Greene (1976). In the text, the third equation (for fuel share) of the system was dropped in the random-effects estimation. To verify the numerical invariance, drop the second equation (for capital share) and eliminate (Yl2· y,, Yn) when imposing homogeneity. This produces the two-equation system:

+ Y11 log(p, I P2l + Yl3log(p,j P2l + y 1Qlog(Q) + e 1 { s, =a,+ Y3l log(pd P2) + y,,Iog(p,j P2) + Y3Q log(Q) + eJ s, = a,

(labor share), (fuel share).

Call this the unconstrained system. The constrained system imposes symmetry Yl3 = y 31 on the unconstrained system. (a) Verify that the simple statistics of the sample shown in the text can be duplicated. (b) (Numerical invariance) Apply the random-effects estimation (i.e., the multivariate regression subject to the cross-equation restriction) to the constrained system. Can you duplicate the coefficient estimates in the text? You should first apply equation-by-equation OLS to the unconstrained system to calculate i'. (Note on computation: Programming joint estimation is much easier with canned programs such as TSP and RATS. However, those canned programs will not print out Sargan's statistic for the SUR.)

Gauss Tip: The random-effects estimator could be calculated by operating on data matrices, but we recommend the use of do loops. Use (4.6.8') on page 293.

318

Chapter 4

RATS Tip: Use the NLSYS proc, which is designed for handling nonlinear equations but which can be used for linear equations. It can also be used for calculating f', as done in the following codes:

* Estimation with capital share equation dropped

*

Define parameters nonlin al af gll glf gfl gff gly gfy * Define two eq. system frml labor sl = al+gll*log(pllpk)+glf*log(pflpk) +gly*log(kwh) frml fuel sf = af+gfl*log(pllpk)+gff*log(pflpk) +gfy*log(kwh) * Set starting values for the parameters compute al=af=gll=glf=gfl=gff=gly=gfy=O * Now do multivariate regression w/o cross-eq.

* restriction nlsystem(outsigma=sigmahat) I labor fuel

Here, sigmaha t is the 2 x 2 matrix f'. The random-effects estimation using this f' is done by the following codes.

* Now impose cross-eq. restriction on fuel share * eq. (set gfl=glf) * Define parameters. nonlin al af gll glf gff gly gfy * Impose symmetry. frml fuelb sf = af+glf*log(pllpk)+gff*log(pflpk) +gfy*log(kwh) * Multivariate regression with covariance matrix * calculated from equation-by-equation OLS

* residuals nlsystem(isigma=sigmahat) I labor fuelb

TSP Tip: An example of TSP codes that accomplish the same thing as the RATS codes above is: ?

Estimate the system with capital equation

dropped. frml labor sl=al+gll*log(pllpk)+glf*log(pflpk) +gly*log (kwh); ?

Multiple-Equation GMM

319

frml fuel sf=af+gfl*log(pl/pk)+gff*log(pf/pk) +gfy*log (kwh); ? Set parameter values to zero. param al af gll glf gfl gff gly gfy; ? Multivariate regression without symmetry sur labor fuel; ? Estimate error covariance matrix from residuals mform sigmahat=@covu; ?

? Do multivariate regression with cross-equation ? restriction. ? Impose symmetry. frml fuel sf=af+glf*log(pl/pk)+gff*log(pf/pk) +gfy* log I kwh I ; ? Multivariate regression subject to symmetry lsq(maxitw=O,wname=sigmahat) labor fuel;

This gives the random-effects estimate of the seven free parameters,

As explained in the text, use the adding-up, homogeneity, and symmetry restrictions to calculate the point estimates and their standard errors of the remaining eight parameters. The TSP command ANALYZ should be useful here. {c) (Optional, for Gauss users) Calculate Sargan's statistic. Hint: Use the formula in Proposition 4.7. The x, vector should be: a constant, log(pJ/p 2 ), log(p 3 ( p 2 ), log(Q). For f*. use the one obtained from the unconstrained system. Sxy

= ~ L7= 1 Yi ®Xi and Sxz = *L7= 1 Zi ®Xi. The statistic should

be 0.63313 with a p-value of 0.42621. (d) (Testing symmetry) Given that Sargan's statistic is 0.63313, perform the following hypothesis testing. Take the maintained hypothesis to be the unconstrained system. The null hypothesis is the symmetry restriction that y13 = y31 . Hint: By Proposition 3.8, the difference in the Sargan statistics with and without symmetry is asymptotically chi-squared. What is Sargan's statistic for the unconstrained system? (e) (Wald test of symmetry) Test the same null hypothesis by the Wald principle. So you test the symmetry restriction in the unconstrained system. Verify that

320

Chapter 4

the test statistic is numerically equal to Sargan's statistic for the constrained system you estimated in (b).

TSP Tip: Just insert the line analyz wald wald=glf-gfl;

right after the codes for the unconstrained estimation. (f) Using the parameter estimates you obtained in (b), calculate the elasticity of

substitution between labor and capital averaged over the 99 firms. Can you duplicate the result in the text?

ANSWERS TO SELECTED QUESTIONS _ _ _ _ _ _ _ _ __ ANALYTICAL EXERCISES

I Z) E(S;m I z,,' ... 'ZJM' Z2J, ... Z2M' ..• 'ZnJ, . .. 'ZnM) (by definition of Z) E(s;m I ZiJ, ... , z;M) (by the i.i.d. assumption)

3b. E(s;m

=

=

= 0

(by the strengthened orthogonality conditions).

The (i, j) element of then x n matrix E(eme~ I Z) is E(s;m i.i.d. assumption, this is zero for j oft i. For j = i, E(S;m S;h

=

Sjh

I Z). By the

I Z)

E(Bim E'ih

= E(s;m e;h

I Z11,

· · ·, ZJM, Z21, · · · Z2M• · · ·, Zni, · · ·, ZnM)

I Zil, ... , z;M) (by the i.i.d. assumption).

Since X;m = X; and X; is the union of (z", ... , z;M) in the SUR model, the conditional homoskedasticity assumption, Assumption 4.7, states that E(s;m
5b.

I I E(MED) E(MED)

E(S69) E(S80) E(S69·MED) E(S80-MED)

E(JQ) E(JQ) E(JQ·MED) E(JQ·MED)

[~:]

E(LW69) =

E(LW80) E(LW69 · MED) E(LW80 · MED)

321

Multiple-Equation GMM

The condition for the system to be identified is that the 4 x 3 coefficient matrix is of full column rank. 5c. If /Q and MED are uncorrelated, then E(/Q · MED) = E(JQ) · E(MED) and the third column of the coefficient matrix is E(JQ) times the first column. So the matrix cannot be of full column rank.

where ]

n

(I)=- I>im fih,

n ,

i=!

,I~ n i=l

(2) = -(~m -~m)- L._,Zim ·f;h,

(4) =

(~m- ~m)'C i:>mz:h)(~h- ~h) 1=1

As usual, under Assumption 4.1 and 4.2, (I) -->p p 0. Regarding (2), by Cauchy-Schwartz,

where Zimj is the j-th element of Zim· So E(Z;m · e,h) is finite and (2) ' because ~m - ~m -->p 0. Similarly, (3) -->p 0.

-->p

0

References Berndt, E., 1991, The Practice of Econometrics, New York: Addison-Wesley. Berndt, E., and D. Wood, 1975, "Technology, Prices, and the Derived Demand for Energy,'' Review of Economics and Statistics, 57, 259-268. Brundy, J., and D. Jorgenson, 1971, "Efficient Estimation of Simultaneous Systems by Instrumental Variables,'' Review of Economics and Statistics, 53, 207-224.

322

Chapter 4

Christensen, L., and W. Greene, 1976, "Economies of Scale in U.S. Electric Power Generation," Journal of Political Economy, 84, 655--676. Greene, W., 1997, Econometric Analysis (3d ed.), New Jersey: Prentice Hall. Jorgenson, D., 1986, "Econometric Methods for Modeling Producer Behavior," in Z. Griliches and M. lntriligator (eds.), Handbook of Econometrics, Volume 111, Amsterdam: Elsevier Science Publishers, 1841-1915. Uzawa, H., 1962, "Production Function with Constant Elasticities of Substitution," Review of Eco· nomic Studies, 29, 291-299. Varian, H., 1992, Microeconomic Analysis (3d eel.), New York: Norton. Zellner, A., 1962, "An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias," Journal of the American Statistical Association, 63, 502-511. Zellner, A., and H. Theil, 1%2. "Three-Stage Least Squares: Simultaneous Estimation of Simultaneous Equations," Econometrica, 30, 54-78.

CHAPTER 5

Panel Data

ABSTRACT

A longitudinal or panel data set has multiple observations for a number of cross~ section units. As its name suggests, a panel has two dimensions, one for crosssection units and the other for observations. The latter dimension is usually time, but there are exceptions. A cross-section sample of twins, for example, is a panel where the second dimension is siblings. The distinction between cross-section units and observations, the two terms used interchangeably until now, should he kept in mind throughout this chapter. When it is apt to do so, we will use the term "group" for the cross-section unit. In many applications, we encounter the error-components modeL In such models, the error term of the equation has a component that is common to all observations (time-invariant if the second dimension of the panel is time) but that may not be orthogonal to regressors. There is available a technique, ca11ed the "fixed-effects estimator," which allows us to estimate the model without the help of instrumental variables. It and the random-effects estimator of the previous chapter are the two staple techniques in panel econometrics. The optional section of this chapter, Section 5.3, shows how to modify the fixedeffects esti.mator when the number of observations varies across groups. The empirical section takes up growth empirics, a topic that has recently attracted a great deal of attention. As it turns out, the fixed-effects and other estimators covered in this chapter are particular GMM estimators on a suitable transformation of the error-components model. The GMM interpretation of the fixed-effects estimator is pursued in an analytical exercise to this chapter, where you will be asked to find an estimator that is asymptotically more efficient than the fixed-effects estimator.

324

5.1

Chapter 5

The Error-Components Model

Our point of departure is the multiple-equation model with common coefficients of Proposition 4.7. To simplify the discussion, we will assume that the sample is a random sample. This assumption is satisfied in most popular panels, such as the Panel Study of Income Dynamics (PSID) and the National Longitudinal Survey (NLS). In those panels, a large number-hundreds or even thousands -of crosssection units are randomly drawn from the population. For this case of random samples, the assumptions of Proposition 4. 7 can be restated as (see (4.6.15) for the definition of y1 (M x 1), z, (M x L), e 1 (M x 1)):

o+ e1

(system of M linear equations) y1 = Z1

(5.1.1)

(random sample) {y1, Z;} is i.i.d.

(5.1.2)

("SUR assumption") E(z;m · <;h) = 0 i.e., E(e 1 0

X;)

form, h = I, 2, ... , M,

(5.1.3)

= 0 where x1 = union of (z;~o ... , z,M ),

(identification) E(Z; 0 x1) is of full column rank, 1 (conditional homoskedasticity) E(e,e;

(5.1.4)

I X;)= E(e 1e) = 1:

(nonsingularity of E(g;g;)) E(g,·g) is nonsingular, where g;

= e, 0

(5.1.5)

x,. (5.1.6)

As noted in Section 4.5, since E(g,g,) = 1: 0 E(x;x;) under conditional homoskedasticity, (5.1.6) is equivalent to the condition 1: and E(x1x) are nonsingular.

(5.1.6')

Under these assumptions, the random-effects estimator &R£, defined in (4.6.8') on page 293, is the efficient GMM estimator. Error Components

This model, very frequently used for panel data, is one where the unobservable error tenn

Sim

is assumed to consist of two components:

(5.1.7)

1The l:xz in Assumption 4.4 1can be written as E(Zi ®Xi); see Review Question 4 to Section 4.6.

325

Panel Data

The first unobservable component a;, without them subscript and hence common to all equations, is called the individual effect, the individual heterogeneity, or the fixed effect. If we define

(M'I/~1) =

ryil ] [

:

'

T/iM

the matrix representation of the M -equation system is

or y 1 =Z 1 ~+1M·a 1 +q 1

(i=1,2, ... ,n),

(5.1.1')

where 1M is the M-dimensional vector of ones. The orthogonality conditions (5.1.3) are satisfied if the regressors of the system are orthogonal to both error components, that is, if E(z;m·a;)=O

form=1,2, ... ,M,

(5.1.8a)

for m, h = I, 2, ... , M.

(5.1.8b)

and E(z;m · ry;h) = 0

However, in many applications that utilize panel data, (5.1.8a) is not a reasonable assumption to make. This is because the fixed effect represents some permanent characteristics of the individual economic unit, as illustrated by the following examples. Example 5.1 (production function with firm heterogeneity): In the loglinear production function (3.2.13) of Section 3.2, suppose the firm's efficiency u; stays constant over time. Then the equation for year m is

where Q;m is the output of firm i in year m, L;m is labor input in year m, and v1m is the technology shock in year m. The equation can be written as (5.1.1') by setting

326

Chapter 5

Also, X;

= (I, log(Lil ), ... , Jog(L;M ))'.

Under perfect competition, the individual effect u, would be positively correlated with labor input because efficient firms, whose u; is higher than its mean value, would hire more labor to expand (see (3.2.14)). If v;m represents unexpected shocks that are unforeseen by the firm when input choices are made, it is reasonable to assume that v;m is uncorrelated with the regressors. Example 5.2 (wage equation): For simplicity, drop experience from the wage equation of Example 4.2 but suppose data for 1982, in addition to 1969 and 1980, are available. Suppose also that the coefficient of schooling and that of IQ remain the same over time but that the intercept is time-dependent (due, for example, to business-cycle effects on wages). With the error decomposition (5.1.7), the three-equation system is

+ f3S69, + yiQ, + ct; + ry", 2 + f3S80; + yiQ, + ct; + ry;2, 4>3 + f3S82; + yiQ, + ct; + ry;3.

LW69, = 1 LW80; = [

LW82; =

As noted at the end of Section 4.6, even if a subset of the coefficients differs across equations, the system can be written as multiple equations with common coefficients. In this example, this is accomplished with

z, =

r:;:J z, 3

=

r~ ~ ~ ~~~: ~~:J. ~· 0 0

S82,

= (¢,. 4>2. 4>3. {3. y). (5.1.9)

IQ,

Also, x; =(I, S69;, S80;, S82;, IQ;)'.

The error term includes the wage determinants not accounted for by schooling and IQ. It may be reasonable to assume that they are divided between the pe1manent individual traits (denoted a;), which would affect the individual's choice of schooling, and other factors (denoted ry;m) uncorrelated with the regressors, such as measurement error in the log wage rate.

327

Panel Data

Group Means

Fortunately, there is available a popular estimator, called the fixed-effects estimator, that is robust to the failure of (5.1.8a), i.e., that is consistent even when the regressors are not orthogonal to the fixed effect a;. To anticipate the next section's discussion, the estimator is applied to an M-equation system transformed from the original system (5.1.1 '). The matrix used for the transformation is the annihilator associated with 1M:

Q

= IM-

lM(l~lMl-' ~~

(MxM)

I ' = IM- M 1M 1M

1/M = IM-

... (5.1.10)

: [

1/M

What this matrix does is to extract deviations from group means. For example, multiplying theM -dimensional vector y; from left by Q results in

y; = Qy;

Yn- Y; : .

=

[

l

= y; - IM.

y;,

(5.1.11)

YiM- Yi

where

is the group mean for the dependent variable. Note that this mean is over m, not over i; it is specific to each group (i). Similarly for the regressors, each column of QZ; is the vector of deviations for the corresponding column of Z;. A Reparameterlzation

Not surprisingly, however, robustness to the failure of (5.1.8a) comes at a price: some of the parameters of the model may no longer be identifiable after the transformation by Q. To provide two examples, • The obvious case is the coefficient of a variable common to all equations. For example, IQ; in Example 5.2 is a common regressor (see (5.1.9)), so the column of Z; corresponding to IQ; (the fifth column) is IM · IQ; and the fifth column of QZ; is a vector of zeros. Consequently, the IQ coefficient, y, cannot

328

Chapter 5

be identified after the transformation. To separate the coefficient of common regressors from the rest, let b; be the vector of common regressors and write the M x L matrix of regressors, z,, as (5.1.12) and partition the coefficient vector

~

accordingly:

The coefficient vector y cannot be identified after transformation. • It may also be the case that, in addition to y, some of the coefficients in {3 are unidentifiable and a reparameterization is needed. Consider Example 5.2 where each equation has a different intercept. One of the intercepts needs to be dropped from F; and included in b,, because a common intercept can be created by taking a linear combination of the equation-specific intercepts. One reparameterization is to define

F; =

[~

0 S69,]

sso, ' b; =

(I' JQ)',

(5.1.13)

0 S82,

so

Zi=

[~

JQ,l JQ; , {3 = (
0 S69;

sso, 0 S82·'

I

,_- ¢3, fJ)', Y = (3, y)'.

IQ, (5.1.14)

More generally, the identification condition for fixed-effects estimation is that F; and b, are defined so that E(QF, ® x;) is of full column rank where x, = union of (z, 1 ,

... , Z;M ).

(5. J.l5)

Why is this an identification condition?

Because (as you will see more clearly in Analytical Exercise 2), the fixed-effects estimator is a (specialization of the) random-effects estimator applied to the transformed regression of Qy on QF. (5.1.15) is just the adaptation of the identification condition (5.1.4) to the transformed regression.

329

Panel Data

With z, thus divided between F; and b;, the system of M-equations (5. I. I') on page 325 can be rewritten as y;=F;{J+IM·b;y+IM·a;+~,

Yim = f/mfJ

+ b;y +a;+ ~im

(i= 1,2,,.,n), or

(i = I, 2, ... , n; m = I, ... , M),

(5.1.1")

where f(m is them-throw ofF,. The error-components model consists of assumptions (5.1.1") (M linear equations), (5.1.2) (random samples), (5.1.8) (SUR assumption with two error components), (5.1.4) (identification), (5.1.5) (conditional homoskedasticity), (5.1.6) (nonsingular E(g,g;)), and (5.1.15) (fixed-effects identification). We include the additional identification condition (5.1.15) for fixedeffects estimation, so that both the random-effects estimator and the fixed-effects estimator are well-defined for the same model. QUESTIONS FOR REVIEW

1. (Verification of Proposition 4.7 assumptions) Verify that the conditions of Proposition 4.7 are satisfied when (5.1.1)-(5.1.6) hold. Show that the identification condition (5.1.4) is satisfied when E(x1X,) is nonsingular. Hint: Z;m is a subset of x,. Thus, (5.1.4) is redundant. 2. (Nonuniqueness of reparameterization) (5.1.13) is not the only reparameterization for Example 5 .2. If F 1 is defined as

569;]

0 0 F1 = I 0 580, , [ 0 I 582 1 how should b;

and~

be defined?

3. Without reparameterization, the F; for Example 5.2 is

569,]

I 0 0 F; = 0 I 0 5801 [ 0 0 I 582 1

with b1 = IQ,. Verify that (5.1.15) is not satisfied.

330

Chapter 5

5.2

The Fixed-Effects Estimator

The Formula As already mentioned, the fixed-effects estimator is defined for the transformed system. Multiplying both sides of (5.1.1 ") on page 329 from the left by Q and making use of the fact that QlM = 0, we obtain

Qy, = QF;{J or

-

y,

=F;{J+q,

+ Qq,

(i = 1,2, ... ,n),

(5.2.1)

where

y, = Qy,, F, = QF,, (Mxl)

(Mx#fl)

= Qq,.

q,

(5.2.2)

(Mxl)

Thus, the fixed effect a; and common regressors drop out in the transformed equations. Now form the pooled sample of transformed y, and F; as

YI] [

.. y- _ = (Mnxl)

jn

[F1]

F- ·. . =

'

(Mnx#fl)

~n

(5.2.3)

-

The fixed-effects estimator of {3, denoted {JFE, is the pooled OLS estimator, that is, the OLS estimator applied to the pooled sample (y, F) of size M n. Therefore,

1

I "

= [~

1

~(QF,)'(QF;)

L F;QQF,) n . fl

= (-

1=1

=

I "

-1

-1

]- 1 "

~ ~(QF;)'(Qy;)

1 " F;QQy, n 1=1 .

L

1 "

(~ LF;QF;) ~ LF;Qy, (sinceQissymmetricandidempotent). i=l

i=l

(5.2.4)

331

Panel Data

Substituting (5.2.1) into (5.2.4), the sampling error can be obtained as

(5.2.5)

Because it is based on deviations from group means, the fixed-effects estimator is also known as the within estimator or the covariance estimator. Another, and perhaps more popular, derivation is to apply OLS to levels, i.e., not in deviations but with group (i) specific dummies added to the list of regressors. For this reason the estimator is also called the least squares dummy variables (LSDV) estimator. The LSDV interpretation allows us to see what additional assumptions are needed for the error-components model in developing a finite-sample theory for the estimator. This is left as Analytical Exercise 1. Large-Sample Properties As you will prove in Analytical Exercise 2, the fixed-effects estimator can also be written as a GMM estimator, which lends itself to a straightforward development of large-sample theory for the fixed-effects estimator summarized in Proposition 5.1 below. Proving the proposition, however, can be accomplished more easily by inspecting the expression (5.2.5) for sampling error. Proposition 5.1 (large-sample properties of the fixed-effects estimator): Consider the error-components model (consisting of (5.1.1") on page 329, (5.1.2), (5.1.8), (5.1.4)-(5.1.6), and (5.1.15)), but relax the SUR assumption (5.1.8b) by requiring only that E(f1m · ry1h) = 0 where

forallm, h (= 1, 2, ... , M),

r;"' is them-throw ofF;. Define y1, F1, and ij 1 by (5.2.2).

(5.2.6)

Then:

(a) the fixed-effects estimator (5.2.4) is consistent and asymptotically normal with (5.2.7)

(b) This asymptotic variance is consistently estimated by (5.2.8)

332

Chapter 5

where V is the sample cross-moment matrix of transfonned residuals associated with the fixed-effects estimator: n

=-nI~ L._.,(Y;- F,,BFE)(y, - F,,BFE)'.

-V

(MxM)

(5.2.9)

i=l

The only nontrivial part of the proof is to show: (I) E(ii;ii,) is nonsingular (and hence invertible),

(2) E(F;Qq 1) = 0 (which is needed for consistency), and (3) E(F;ij,ij;F,) (which is the Avar of~ L:7~ 1 ii;ij,) = E[ii; E(ij,ij;)ii,].

The rest of the proof is an easy application of Kolmogorov's LLN and the LindebergLevy CLT. Proving that (5.1.15) implies (I) is left as an optional analytical exercise. To prove (2), use formula (4.6.16b) and rewrite the expectation as M

E(F;Qq,) = E(I;

M

L qmhfim · ry;h)

= (m, h) element of Q)

(qmh

m=l h=I

M

=

M

L L qmh E(f;m · ry;h).

(5.2.10)

m=l h=l

This is zero by the orthogonality conditions (5.2.6). Proof of (3) is as follows. First note that

(This follows by the Law of Total Expectations and by the fact that x, [= union of (z, 1, ... , z,M )] includes the elements ofF, and F 1 is a function ofF,.) But E(ij 1ij; I x,) equals the unconditional mean E(ij 1 ij) because E(ij,ij; I X;)= E(e,e; I X;) = QE(e 1e;

(since

e; = Qe; =

Q(lM. a,+ q,) = Qq, = ij,)

I X;)Q

= Q E(e;e;)Q

(by (5.1.5))

= E(e,e) = E(ij 1ij;)

(since e1 = ij,).

(5.2.11)

333

Panel Data

A glance at the expression for the asymptotic variance should make you suspect that there must be some other estimator that is asymptotically more efficient, because the asymptotic variance of an efficient estimator can be written as the inverse of a matrix. Indeed, as you will prove in Analytical Exercise 2, there is available an estimator that is asymptotically more efficient than the fixed-effects estimator. Digression: When q1 Is Spherical

The usual error-components model assumes, additionally, that the second error component q1 is spherical: (5.2.12) If, as in many panel data sets, m is time, this assumption is often referred to as the assumption of no "serial correlation." The lack of serial correlation in this sense should not be confused with the condition that q 1 be uncorrelated with qj fori -I j. The latter is a consequence of our maintained assumption that the sample is i.i.d. Substituting this into (5.2.7) and noting that QF; = F;, we find that the expression for the asymptotic variance simplifies to

- =a"2· [E(F-,-F;) J-1 . (5.2.13) If there is a consistent estimate, &g. of ag, then the asymptotic variance is consisAvar(pFE)

1

tently estimated by (5.2.14) which equals n times&;;.
SSR

= (y- FPFEl'(y- FPFEl = L(Y;- F,pFE)'(y;- F,pFE).

(5.2.15)

i=l

In the pooled OLS on deviations from group means, there are #P coefficients to be estimated. The usual OLS estimate of the error variance is SSR Mn-#P

(5.2.16)

334

Chapter 5

This, however, is not consistent for a;. You will be asked to show in Analytical Exercise 3 that a consistent estimator is

SSR Mn-n-#fl

(5.2.17)

(For that matter, SSR/(Mn- n) is also consistent.) The reason n has to be subtracted from the denominator of (5.2.16) is that M transformed equations are not linearly independent; for both sides of the M transformed equations, the sum is always zero, as you can see by multiplying both sides of (5.2.1) from left by and making use of t;.,Q = 0'. The effective sample size of the pooled sample is Mn- n, not Mn. Use of (5.2.16) instead of (5.2.17) is a very common mistake, which results in an underestimation of standard errors and an overestimation of t-values. For example, if M = 3, n = 1,000, and #fl = 5, the t-values will be overestimated by about 23% (=square root of 299511995 minus one).

t;.,

Random Effects versus Fixed Effects

Now get back to the general case of no restriction on E(q,q;), so the error term can have an arbitrary pattern of "serial correlation" across m. Comparing (5.2.6) with (5.1.8), we see that the orthogonality conditions not used by the fixed-effects estimator are E(f;m ·a;)= 0 for all m, E(b; ·a;)= 0, E(b; · ry;m) = 0 for all m.

(5.2.18)

That is, the fixed-effects estimator is robust to the failure of (5.2.18). Let PRE be the elements of 8RE (the random-effects estimator applied to the original M-equation system (5.1.1") on page 329) that correspond to fl: (5.2.19) Also, let Avar(/iRE) be the asymptotic variance of PRE· It is the leading submatrix of Avar(8REl (given by (4.6.9') on page 293) corresponding to fl. The random-effects estimator PRE is an efficient estimator while fiFE is consistent but not efficient. However, if (5.2.18) is violated, flRE is no longer guaranteed to be consistent, while flFE remains consistent. Thus, a natural test of (5.2.18) is to consider the difference between the two estimators,

-

-

(5.2.20)

335

Panel Data

It is easy to prove that ..(ii q is asymptotically normal. The Hausman principle

(originally developed for ML estimators but proved to be applicable to GMM estimators in Analytical Exercise 9 to Chapter 3) implies that Avar@ = Avar(pFE)- Avar(PREl·

(5.2.21)

(So, there is no need to incorporate the asymptotic covariance between PFE and {JRE because it is zero.) By Proposition 4.7, the leading submatrix of (4.6.9') on page 293 provides a consistent estimator of Avar(PREl· By Proposition 5.1, Avar(PFEl is consistently estimated by (5.2.8). A consistent estimator of Avar(q), therefore, is the difference. Write this as Avar(q). It is proved in the appendix that (i) Avar(q) is nonsingular (and hence positive defirtite) and (ii) the Hausman statistic defined below is guaranteed to be nonnegative for any sample {y;, Z; }. To

-

-

summanze,

Proposition 5.2 (Hausman specification test): Suppose the assumptions of the error-components model ((5.1.1") (on page 329), (5.1.2), (5.1.8), (5.1.4)-(5.1.6),

-

and (5.1.15)) hold. Define q and Avar(q) as just described. Then the Hausman statistic

(5.2.22) is distributed asymptotically chi-square with #{J degrees of freedom. It is nonnegative in finite samples.

This test is a specification test because it can detect a failure of (5.2.18) which is part of the maintained assumptions of the error-components model. More specifically, consider a sequence of local alternatives that satisfy all the assumptions of the error-components model except (5.2.18) 2 With some additional technical assumptions, it can be shown that the Hausman statistic converges to a noncentral x2 distribution with a noncentrality parameter along those sequences of local alternatives. 3 That is, the Hausman statistic has power against local alternatives under which the random-effects estimator is inconsistent. Relaxing Conditional Homoskedasticity It is straightforward to derive the large sample distribution of the fixed-effects estimator without conditional homoskedasticity. As will be seen in the next section, 2The notion of local alternatives was introduced in Section 2.4. 3 See Newey (1985). The required technical assumption is Newey's Assumption 5.

336

Chapter 5

extension of the large sample results to cover unbalanced panels is actually easier without conditional homoskedasticity.

Proposition 5.3 (fixed-effects estimator without conditional homoskedasticity ): Drop conditional homoskedasticity (5.1.5) from the hypothesis of Proposition 5.1. Define y;, F;, and ij, by (5.2.2). Then: (a) the fixed-effects estimator (5.2.4) is consistent and asymptotically normal with

(5.2.23) (b) If, furthermore, some appropriate finite fourth-moment assumption (which is an adaptation of Assumption 4.6) is satisfied, then the asymptotic variance is consistendy estimated by

---=

Avar(/JFE) =

where 'ij,

(I- L.., ~-,-)-'(1 ~ -, __ ,-)(I~-,-)-' F,F; ;; L_.,(F,q,q,F,) - L.., F,F, , n 1=1 .

= y;- F,/iFE (this

. 1=1

n 1=1 .

(5.2.24)

'ij, should not be confused with ij, in (5.2.1)).

Again, there are two ways to prove this. Write the fixed-effects estimator as a GMM estimator and then apply the large sample theory of GMM estimators. Or you can prove it directly by inspecting the expression (5.2.4) for the sampling error. The only non-trivial part of the proof is that the middle summation in the expression ----::::::-for Avar(/JFE) converges in probability to its population counterpart, E(F;ij,ij;F;). The required technique is very similar to the one used for proving Proposition 4.2 and so is not repeated here.

- -

QUESTIONS FOR REVIEW 1.

For the large sample results of this section to hold, do we have to assume that ct; and q, are orthogonal to each other?

2. (Dropping redundant equations) As mentioned in the text, the M transformed equations are not linearly independent in the sense that any one of the M transformed equations is a linear combination of the remaining M - I equations. So consider forming a pooled sample of size M n - n by dropping one transformed equation (say, the last equation) for each group (i). Is the OLS estimator on this smaller pooled sample the fixed-effects estimator? [Answer: No.]

Panel Data

337

3. (Importance of cross orthogonalities) Suppose (5.2.6) holds form = h but not necessarily form "I h. Would the fixed-effects estimator be necessarily consistent? [Answer: No.] 4. Verify that E(ij,ij;) is singular. (Nevertheless, E[ii; E(ij,ij)F,j in (5.2.7) and E(Fjij,ij;F,) in (5.2.23) are nonsingular, as an optional analytical exercise will ask you to show.) 5. (Consistent estimation of error covariances) To prove that (5.2.9) is consistent for E(ij;ij;), you would apply Proposition 4.1 to the transformed system of M equations. Verify that the conditions of the proposition are satisfied here. In particular, is the requirement that E(f;mf(m) (where f;m is the m-th row of F;) be finite satisfied by the error-component model? Hint: f;m is a linear transformation of X;. (5.1.5) and (5.1.6) imply that E(x;xj) is nonsingular.

-

6. (What the Hausman statistic tests) For simplicity, assume I: (= E(e,ej)) is known. Also for simplicity, assume that there are no common regressors, so Z;m = f;m and (5.2.18) reduces to (5.1.8a): E(z;m ·a;) = 0 for all m. This is the restriction not used by the fixed-effects estimator. Is the random-effects estimator necessarily inconsistent when (5.1.8a) fails but all the other assumptions of the error-components model (such as (5.1.8b)) are satisfied? Hint: Show that, without (5.1.8a),

where e; = l·a; + q,. Provided (5.1.8b) holds, (5.1.8a) is sufficient for E(Z;I:- 1e;) to be zero. Is it necessary? Thus, strictly speaking, the Hausman test is not a test of (5.1.8a); it is a test of E(Zj :r-' e;) = 0.

5.3

Unbalanced Panels (optional)

In applying the multiple-equation model to panel data, we have been making an implicit but important assumption that the variables are observable for all m, so that the number of observations for each cross-section unit is the same (M). Such a panel is called a balanced panel. But no available panel data sets are balanced, thanks to exits from and entries to the survey. Inevitably, some firms disappear from the sample due to bankruptcies and mergers before the end year M of the survey, while those firms that did not exist at the start of the survey may be included

338

Chapter 5

later. Some of the households initially in the survey will not complete the spell of M periods due to household resolutions or because respondents get tired of being asked the same questions over and over again. The panel of identical siblings is unbalanced if it includes identical triplets (and even septuplets) as well as twins. This section shows that the fixed-effects estimator easily extends to unbalanced panels. This sanguine conclusion, however, depends on the assumption that whether the cross-section unit stays in the sample does not depend on the error term. Without this assumption, the estimator is no longer consistent. This phenomenon is called the selectivity bias. A full discussion of why a selectivity bias arises and how it can be corrected for is relegated to Section 8.2, because the issue is not specific to panel data. "Zeroing Out" Missing Observations To handle missing observations, it is convenient to define a dummy variable, d;m,

I

d;m =

if observation m is in the sample,

{0

(5.3.1)

otherwise,

and M

M; =

L d;m = #observations from i.

(5.3.2)

m=l

If observation m is missing for group i, fill them-throws of y;, F;, and zeros so that they continue to be of fixed size. That is,

. _ [ d,, : Yil y, : d;M • YiM

with

l ._[ l _[ l '

F/

-

dil ..

:

(Mx#fJ)

(Mxl)

T/;

d;M •

f;,

t:M

,

'1/;

-

dil : T/;1 :

• (5.3.3)

(Mxl)

d;M . TliM

Thus, for each i and m, either all the elements of the vector (Y;m, f;m) are observable or none is observable. We will not consider the case where some, but not all, elements of (Y;m. f;m) are observable. With this new notation, (5.1.1 ")on page 329 becomes

y, =F.,'I+d; -b;y+d, ·a;+q,

(i

= 1,2, ... ,n).

(5.3.4)

In the case of balanced panels, where d; =1M, the transformation matrix Q in the fixed-effects formula (5.2.4) is the annihilator associated with 1M (see (5.1.10)).

339

Panel Data

With missing observations, we use as the transfonnation matrix the annihilator associated with d,: Q

= IM- d,(d;d,)-'d:

(MxM)

1 d,d,' = IM - M·

' = M;). (since d,d;

(5.3.5)

' This matrix depends on i because d, differs across i. Nevertheless, we leav<' this dependence of Q on i implicit for economy of notation. With y,, F,, q,, and Q thus redefined, the fonnula for the fixed-effects estimator remains the same as (5.2.4). Substituting (5.3.4) into this fonnula and observing that Qd; = 0, we see tha.t the expression for the sampling error, too, is the same as (5.2.5). Zeroing Out versus Compression As seen in Section 5.2 for the case of balanced panels, the fixed-effects estiD.lator can be calculated as the OLS estimator on the pooled sample (of size Mn) of deviations from group means. As before, let y,, F,, and ij; be the transfonnations by Q of y,, F;, and q,, respectively. For example, y; looks like

y,

=Qy, =

(MxM)

where 1 d;y, = - x (sum of Yim over m for which data are available). (5.3.6) M; M;

y, = -

I

Thus, if observation m for group i is available, the m-th element of y, is still the deviation of Yim from the group mean, but the group mean is over available observations. Since d;m = 0 if observation m is not available, the rows corresponding to missing observations in y, are zero. The same structure applies to columns ofF,. Therefore, the pooled sample of size Mn created from (y,, F;) (i = I, ... , r.•) has a number of rows filled with zeros. Those rows can be dropped from the pooled sample without affecting the numerical value of the fixed-effects estimator. That is, we can "compress" (y,. F,) by dropping rows filled with zeros, before forming the pooled sample.

-

340

Chapters

No ~;electivity Bias The fixed-effects estimator, thus adapted to unbalanced panels, remains consistent and a
ry,. I d,) =

0

(m, h = I, 2, ... , M). (5.3.7)

As shown in the next proposition, this condition is crucial for ensuring the consistency of the fixed-effects estimator. By the Law of Total Expectations, this condition is stronger than (5.2.6) that the unconditional expectation is zero. It is easy to construct an example where (5.3. 7) does not hold because the pattern of selection d, depends on the error term q,. Note that (5.3.7) does not involve the fixed effect a,. Thus, if the dependence of sample selection on the error term is only through the fixed effect, the problem of sample selectivity bias does not occur. If the orthogonality conditions (5.2.6) are strengthened as (5.3.7), the conclusiom; of Proposition 5.3 carry over: Prop•osition 5.4 (large-sample properties of the fixed-effects estimator for unba!alllced panels): Consider the error-components model consisting of (5.3.4), (5.1.2), (5.1.4H5.1.6), (5.1.15), and (5.3.7), with (y,, F,, q;) as in (5.3.3). As before, define (y,, F;, ij,) by (5.2.2), but with Q given by (5.3.5). Then the conclusions of Proposition 5.3 hold. Here we show that E(F;Qq;) = 0. (The other parts of the proof, which are application.s of Kolmogorov's LLN and the Lindeberg-Levy CLT, are the same as in the proof of Proposition 5.3.) Since M

F;Q7J; =

M

L L q~h ·dim 'd;h '

f;m · TJih,

(5.3.8)

m=l h=l

where q~~ is the (m, h) element ofQ applied to the i-th equations, we have M

E(F;Qq;) =

M

L L E(q~~ . dim . d, • . f,m . q,.).

(5.3.9)

m=l h=l

If there is no selectivity bias so that E(f;m · q;h

I d,)

= 0, then (5.3.9) equals zero

341

Panel Data

because

(since q~~ is a function of d, ). (5.3.10) It is interesting that the extension to unbalanced panels is easier without conditional homoskedasticity. In order for (5.2.7) of Proposition 5. I to be the asymptotic variance of the estimator under conditional homoskedasticity, we would have to assume that the second moment of ij, conditional on x, and d, equals the unconditional second moment, and consistent estimation of this second moment would be a bit complicated. QUESTIONS FOR REVIEW

1. Ford; = (I, 0, I)', write down Q. 2. (Alternative derivation of the fixed-effects estimator for unbalanced panels) Let Yi be the M;-dimensional vector created from the M-dimensional vector y; by dropping rows for which there is no observation. Similarly, create Fi (M; x #{J) from F, and dj from d, (so di is the M;-dimensional vector of ones). Let Q' (M; x M;) be the annihilator associated with di. Verify that the fixed-effects estimator can also be written as (5.2.4) in terms of those starred matrices and vectors. 3. Section 4.6 described the pooled OLS in levels for the system (4.6.1') on page 292. (a) How would you modify the estimator (4.6.11 ') on page 293 to accommodate unbalanced panels? Hint: Define y, and z, (M x L) by "zeroing out." (b) Without conditional homoskedasticity, what is its asymptotic variance? 1 E(Z,e,e;z;) [E(Z,Z{jf ] [Answer: [E(Z,Z;J

r'

(c) How would you estimate the asymptotic variance? Hint: Replace F, by Z;, Q by IM. and ij, bye, in Proposition 5.3.

342

5.4

Chapter 5

Application: International Differences in Growth Rates

Do poor economies grow faster than rich economies? If so, how much faster? These are the questions addressed in the recent burgeoning literature of growth. 4 This section derives the key equation that has been used to explain economic growth of nations and discusses ways to estimate it. Actual estimation of several specifications of the equation is relegated to the empirical exercise. Derivation of the Estimation Equation In neoclassical growth theory, output per effective labor in period 1, denoted q(l) and to be defined more precisely in (5.4.4) below, converges to the steady-state level q*. The log-linear approximation around the steady state gives d log(q(l)) -"-d'-'-.:....:..:.

=)., ·

1

[log(q*)- log(q(t))].

(5.4.1)

Estimating ).., the speed of convergence, has been the objective of the recent growth literature. (5.4.1) implies that, for any two points, lm-t and 1m in time, log(q(lm)) =(I- p) ·log(q') + p ·log(q(lm-t)),

(5.4.2)

p =exp[-)., · Um- tm-1)].

(5.4.3)

where

(Since in our data tm 's are equally spaced in time, p will not depend on m.) Output per effective labor is defined as q(t)

=

Y(t) A (I) L(t)

(5.4.4)

where Y(l) equals aggregate output, L(t) equals aggregate hours worked, and A (I) equals level of labor-augmenting technical progress. Assuming A(t) grows at a constant rate g (so that A (I) = A (0) exp(g t) ), (5.4.4) implies log(q(t)) =log ( -Y(t)) - log(A(O))- g · 1. L(l) Substituting this into (5.4.2), we obtain

4 A very extensive survey of the literature is Barro and Sala-i-Manin (1995).

(5.4.5)

343

Panel Data

log ( Y(tm)) = p ·log (YUm-Il) L(tm) L(tm-J)

+(I- p) · [log(q') + log(A(O))] + 4>m. (5.4.6)

where m

= g · Um -

(5.4.7)

P 'lm-J).

An equivalent form of equation (5.4.6) can be obtained by subtracting log r~:::=:~ from both sides: log(Y(tml) -log(Y(Im-Il) L(tm)

L(tm_Jl

= (p- I) log ( YUm-Il) L(tm-I)

+(I- p) · [log(q') + log(A(O))] + 4>m·

(5.4.6')

Since p 0, this equation says that the level of per capita output has a negative effect on subsequent growth. The country, when poor, should grow faster. This form of the equation is more intuitive, but for the purpose of choosing the correct estimation technique, (5.4.6) rather than (5.4.6') is more instructive. Equation (5.4.6) should hold for each country i .5 For the rest of this section, we assume that the speed of convergence ()._)and the growth rate of labor-augmenting technical progress (g) are the same across countries. Then (5.4.6) for country i can be written as Yim = l/Jm

+ PYi,m-1 + ct;,

(5.4.8)

where Yim

= log ( y Um)) for country i, L(tm)

a;=

(I- p)

x {log(q')

+ log(A(O))}

for country i.

(5.4.9)

Appending the Error Term

It is a fairly standard practice in applied econometrics to derive an equation without an error term from the theory and then append an error term for estimation. Growth empirics is not an exception; an error term ry,m is added to (5.4.8) to obtain Yim = l/Jm

+ PYi,m-J

+ct;

+ 1Jim•

(5.4. 10)

Swe have derived the equation describing convergence assuming a dosed economy. The assumption is hard to justify, but a similar equation can be derived under free mobility of physical capital if human capital is another factor of production and not freely mobile internationally. See, e.g., Sarro and Sala-i-Martin (1995).

344

Chapter 5

One natural interpretation of Tlim is business cycles, which in macroeconomics are defined as the difference between actual output and potential output. Growth theory is about potential output. The discrepancy between potential output and actual output becomes the error term in the estimation equation derived from growth theory. But then there would be an errors-in-variables problem, which renders the regressors endogenous. 6 For the rest of this section, we follow the mainstream growth literature and ignore the endogeneity problem. Another problem is the possible correlation in Tl<m between countries (i's). This spatial (or geographical) correlation would not arise if the sample were a random sample. But the sample of nations is not a random sample; it is the universe. The phenomenon known as international business cycles implies that the spatial correlation would be positive. Statistical inference treating countries as if they were independent data points would overstate statistical significance. This problem, too, is largely ignored in the literature and will be ignored in our discussion. Treatment of a.,

As is clear from (5.4.9), the term a 1 in equation (5.4.10) depends on the steadystate level q' and the initial level of technology A(O). How you treat a; in the estimation depends on which version of neoclassical growth theory you subscribe to. According to the Cass-Koopmans optimal growth model, the sole determinant of the steady-state level q' is the growth rate of labor. Thus, if technological knowledge freely flows internationally so that A(O) is the same across countries, the international differences in a 1 are completely explained by the labor growth rate. Therefore, leaving aside the endogeneity problem just mentioned, OLS estimation of (5.4.10) delivers consistency if the labor growth rate is included as the additional regressor to control for a 1• If you subscribe to the Solow-Swan growth theory, the saving rate is the other determinant of q', so the regression should include the saving rate along with the labor growth rate. A large fraction of the recent growth literature under the rubric of "conditional convergence" includes these and other variables that might affect q' or A(O)such as measures of political stability and the degree of financial intermediationin order to control for a 1• In the empirical exercise, you will be asked to estimate p, and hence the speed of convergence A, using this conditional convergence approach.

6Let Xim be the log potential output in year 1m for country i that growth theory is purported to explain, so that (5.4.8) holds for Xim rather than for Yim. Let log actual output Yim be related to Xim as Yim = Xim +'11m· Then (5.4.10) results, but with the error tenn 'hm - prJi,m-1· Not only would there be an errors-in-variables problem, but also the error tenn has a moving-average structure.

345

Panel Data

However, no matter how ingeniously the variables are selected to control for a,, some component of a 1 having to do with features unique to the country would remain unexplained and find its way in the error term. Being part of a;, it is correlated with the regressor Yi.m-I in (5.4.10). The OLS estimation, even if the equation includes a variety of variables to control for a,, will not provide consistency. The alternative approach is to treat a, as an unobservable fixed effect. Consistent Estimation of Speed of Convergence

If output and other variables are observed at M +I points in time, t0 , t 1, t2 , ••• , fM, equation (5.4.10) is available form = I, ... , M. The fixed-effects technique could be applied to this M -equation system, but it does not provide consistency. The reason is that the system is "dynamic" in that some of the regressors are the dependent variables for other equations of the system. To see this, look at the first two equations of the system:

+ PY;o +a, + '111, <1>2 + PYn +a,+ '112·

Yn = 1 Yi2 =

Suppose the error term 'Ill is orthogonal to the regressors (which are a constant and Y10l in the first equation, so that E('lil) = 0 and E(y;o '111) = 0. If we further make the assumption that E(a 1 'Iii) = 0, then from the first equation

But Yil is one of the regressors in the second equation. Therefore, if the error term 'lim is orthogonal to the regressors for each equation, the "cross" orthogonalities are necessarily violated. 7 As emphasized in Section 5.2 (see Review Question 3), the fixed-effects estimator is not guaranteed to be consistent if the cross orthogonalities are not satisfied. For a precise expression for the bias, see Analytical Question 4to this chapter. One way to obtain consistency is the following. Take the first differences of the M -equations to obtain M - I equations:

+ (y"- Y;o)P + (1/12- 'Ill), 1L2 + (Yi2- Y11)P + ('113- 'l;z),

Y12- Y11 = 1L1

YI3- Y12 =

YIM- Yi.M-1 = ILM-1

+ (YI.M-1- Yi,M-2)P +('liM- 'li.M-Jl,

7This was first pointed out by Nickell (1981).

(5.4.11)

346

Chapter 5

where

The random-effects estimator and the equation-by-equation OLS do not provide consistency because the regressors are not orthogonal to the error term. For example in the first equation, y; 1 - y;0 is not orthogonal to ry; 2 - ry;1 because E(y; 1 · ry; 1) ~ 0 as just seen. Consistent estimation can be done by multiple-equation GMM, if there are valid instruments. In the optional empirical exercise to this chapter, you will be asked to use the saving rate as the instrument for the endogenous regressors. To the extent that the saving rate is uncorrelated with ry;m. this estimator is robust to the endogeneity problem mentioned above. QUESTIONS FOR REVIEW

1. The assumption that g (growth rate of technical progress) is common to all countries is needed to derive (5.4.8). Do we have to assume that g is constant over time? [Answer: No.] 2. (Identification) For simplicity, let M = 3 so that (5.4.11) is a two-equation system. (a) Write (5.4.11) as a multiple-equation system with common coefficients by properly defining the YiJ. y; 2 , Zn, z;2, 8 in (4.6.1). Hint: z; 2, for example, is (0, I, Yi2 - Yn )'.

(b) If S;m is the instrument for Yim - Yi.m-J. the instrument vector for the m-th equation is X;m = (1, S;m)' form = I, 2. Show that the system is not identified if Cov(s;m, Yim - Yi.m-l) = 0 for all m. Hint:

l:xz

=

E(s; 1 )

0 0

E[sn (Yn - y;o)]

0 0

I E(sn)

E(Y;2 - Yn) E[s;2(Yi2- Yn)]

E(y;l - y;o)

Appendix S.A: Distribution of Hausman Statistic

This appendix proves that the Avar(q) in (5.2.21) is positive definite and the Hausman statistic (5.2.22) is guaranteed to be nonnegative in any finite samples.

347

Panel Data

As shown in Section 4.6, (5.A.l) where

Sxz

I n =-L::Zi®Xj,

(MKxL)

n

i=l

]

n

Sxx =(KxK) n

K

~

S

(MKxMK)

==

~

:E 0 Sxx, I

Lxix;,

Sxy

i=l

n

=-n LY; 0

X;,

i=l

= #x;,

L

= #z;m

(so Z; is M x L ).

Let Z;

=

: ' -= #fJ,

(F;. lb;), p

D =

(Lxp)

, [•PJ 0

so that F 1 = Z;D. Then it can be shown fairly easily that (5.A.2) where

W=

Q0

s;;;,

Q

= annihilator associated with 1M.

From (5.A.l) and (5.A.2) it follows that

(To derive this, recall from Section 4.6 that n

1"', S.,WS., = - L...Z QZ;. ,~

n.1=1

1

But since QZ; = (QF1 : 0), this equals (5.A.4) so that s~,ws., = oo's~,ws., = s~,ws"'oo'.J As was shown in Analytical Exercise 5 of Chapter 3, (5.A.5)

348

Chapter 5

Substituting this into (5.A.3), we obtain (5.A.6) The asymptotic variance of this expression is

where B = plim B, W = plim W = Q 0 '!:;.1, and S I:.,D is invertible because it is easy to show that

= Avar@.

Here, D'I::U W ·

D'I::U Wl:xzD = E(F;QF;),

(5.A.8)

which, as proved in Analytical Exercise 5, is nonsingular. (5.A.7) is consistently estimated by

-

..-....-.. ....... ..-....-..

..-..

..-..

Avar(q) = (D'S:UWSxzD)- 1D'S:UWBSBWSxzD(D'S:UWSuD)- 1• (5.A.9)

-

-

A straightforward but lengthy matrix calculation shows that (5.A.7) equals Avar(pFE) - D' Avar(.IRE)D and (5.A.9) equals Avar({iFE) - D' Avar(.IRE)D. In any finite sample, (5.A.9) is positive sentidefinite. So the Hausman statistic is guaranteed to be nonnegative. We now show that (5.A.7) is nonsingular. By (5.1.6), Sis positive definite. Therefore, the rank of Avar(q) equals the rank of (5.A.IO) Since the columns of s- 1l:xz form a basis for the column null space of B and since (Q 0 '!:;,1)l:.,D is of full column rank by (5.1.15), by Lemma A.5 of Newey (1985), 8 the rank of (5.A.IO) equals (5.A.ll) where use has been made of the fact that S = I: 0 '!:"". Now, x; collects all the unique variables in Rearrange x1 as

z,.

Xj

' = (Zil,

••. , Z;' M,

b')' i .

(5.A.l2)

e matrix, B an e x m matrix. and Cane x n matrix. If the columns of C fonn a basis for the column null space of A and rank(B) = m, then rank(AB) = rank(C : B)- n. Here, k =l=MK,m = p,andn = L. 8The lemma states: Let A be a k x

349

Panel Data

Then 0

h-p

] , em = m-th column ofiM, (5.A. 13)

so that l:xz can be written as (5.A.l4) where H' = Avar(q) is

(H'., ... , H;_,). Substituting (5.A.l4) into (5.A.II), the rank of

The m-th K rows of the (M K x (L are

+ p))

matrix [(l:- 1 0 IK )H : (Q 0

h

)lfiD]

0 (5.A.I5)

0

0

Therefore, ran:k[(l:- 1 0 h )H : (Q 0 IK )HD] = L + p unless vec(l:- 1) Md vec(Q) are linearly dependent; that is, unless l: -I is proportional to Q. But since • ran:k(l:- 1) = M ;6 ran:k(Q) = M ~ I, l:- 1 cannot be proportional to Q. PROBLEM SET FOR CHAPTER 5 - - - - - - - - - - - ANALYTICAL EXERCISES

1. (Fixed-effects estimator as the LSDV estimator) In (5.l.I") on page 329, redefine a; to be the sum of old a; and b; y, so that (5.1.1 ") becomes Y;=F;{J+IM·a;+q;

(i= 1,2, ... ,n).

(I)

Define the following stacked vectors and matrices:

~

I]

. y=· (Mnxl)

,n

F ,

(Mnx#fl)

[F1]

= ·. ~n

,

[qll

[all

q = ·. , C i = ·. (Mnxl) ~n (nx I) ~n

,

350

Chapter 5

W

= (D: F),

(Mnx(n+#fJ))

D

(I ((n+#ft)xl)

= ["] {J

'

(2)

= ln ®1M=

(Mnxn)

so that the M -equation system (1) fori = I, 2, ... , n can be written as

y,

1M ·a 1 lM . <12

Yn

1M

Y1

'Cin

+

FtfJ

ql

F,{J

q,

+

FnfJ

qn

or

YI y,

1M 1M

"' 1M

Yn

""

+

F1 F,

Fn

ql

{J+

q, qn

.or y = Da

+ F{J + q = W6 + q.

(3)

The OLS estimator of (I on a pooled sample of size Mn, (y, W), is (W'W)- 1 . 'W'y. (a) Show that the fixed-effects estimator given in (5.2.4) is the last #{J elements of this vector. Hint: Use the Frisch-Waugh Theorem about partitioned regressions (see Analytical Exercise 4(c) of Chapter 1 ). The first step is to regress yon D and regress each column ofF on D. If we define the annihilator associated with D as Mn

= lMn- D(D'D)- 10',

(4)

(MnxMn)

then the residual vectors can be written as Moy and M 0 F. The second step is to regress M 0 y on M 0 F. Show that the estimated regression coefficient

351

Panel Data

can be written as

(5) Show that Mo = In® Q (where Q is the annihilator associated with IM, as in the text) and use this to show that (1) is indeed the fixed-effects estimator. Alternatively, you can go all the way back to the normal equations. If and bare the OLS estimators of a and of n

fJ,

a

then (a, b) satisfies the system

+ #fJ linear simultaneous equations:

[a]

D'D D'F] = [D'y] . [ F'D F'F b F'y

(6)

Take the first n equations and solve for a taking b as given. This should yield: a = (D'D)- 1 (D'y- D'Fb). Then substitute this into the rest of the equations to solve for b. (You can accomplish the same thing by using the formula for inverting partitioned matrices.) (b) Show:

where a; is the OLS estimator of a;, ~(y;,

+ · ·· + Y;M).

C!m

is them-throw ofF;, and

J;

Hint: Use the normal equations fora.

(c) Assume: (i) {y1, F 1 } is i.i.d.,

I

F;) (5.1.8b)),

(ii) E(q 1

(iii) E(q 1q;

I F 1)

=

0 (which is stronger than the orthogonality conditions

= a~IM (the spherical error assumption), and

(iv) W is of full column rank. Show that the assumptions of the Classical Regression Model are satisfied for the pooled regression (3). Hint: You have to show that W is in (3), that is, E(q

I W)

=

0 (Mnxl)

.

exogenous

352

Chapter 5

Also show that E(qq' I W) = a;IMn· Develop the small sample theory for a and b. For example, are they unbiased? Let SSR be the sum of squared residuals from the pooled regression (3). Is this SSR the same as the SSR from the "within" regression (5.2.15)? 2. (GMM interpretation of FE) It is fairly well known for the case of two equations ( M = 2) that the fixed-effects estimator is the OLS estimator in the

regression of first differences: Yi2 - Yil on zi 2 - zil. This question generalizes this argument to the case of arbitrary M. We wish to write the fixed-effects estimator (5.2.4) in the GMM format (7)

for properly defined Suo

W, and s.y. Let C be a matrix such that

Cis an M x (M- I) matrix offull column rank satisfying C'IM = 0.

(8)

A prominent example of such a matrix C is the matrix of first differences:

(a) Verify that these two matrices satisfy (8). (b) For any given choice of C, verify that you can derive the following system of M- I transformed equations by multiplying both sides of the system of M equations (5.1.1") (on page 329) from left by C':

f, where

. Y; ((M~l)xl)

= C'y;,

=

¥,p + ~~

-F; ((M~I)x#p)

= C 'F;,

(9)

353

Panel Data

(c) (optional) Verify the following: If {y,, z,] satisfies the assumptions of the error-components model (which are (5.1.1"), (5.1.2), (5.1.8), (5.1.4), (5.1.5), and (5.1.6) or (5.1.6') on page 324), then {y,, F,] satisfies the assumptions of the random-effects model, i.e., for the M - I equation system (9), (random sample) {y;, F;} is i.i.d., (orthogonality conditions) E(~; 0 X;) = 0 with x; (z 11 , ••• , z;M) =unique elements of (F;, b;),

= union of

(identification) E(F; 0 X;) is of full column rank, (conditional homoskedasticity) E(~;~; nonsingular,

I x;)

= E(~;ii) which is

(nonsingularity ofE(g,g;)) E(g;g;) is nonsingular,

g, = f;, 0

x;.

Hint: For the orthogonality conditions, note that (5.1.8b) can be written as E(~; 0 x;) = 0 and that ii; 0 x, = (C' 0 h)(~; 0 x;). The identification condition is equivalent to (5.1.15) (see the answer sheet for proof). To show that E( ii; I x;) does not depend on x; , use the fact that ii; = C' ~; = C' e; since e; = 1M· a;+~;· Given conditional homoskedasticity, we have

ij;

So the last condition (nonsingularity of E(g;g;)) is simply that E(x;x;J be nonsingular. (d) Show that the fixed-effects estimator (5.2.4) can be written in the GMM format (7) with

-

1 (I;; ~ ~x;x; )-1 .

(10)

W = (C'C)- 0

i=l

Hint: Use the following result from matrix algebra: If C satisfies condition (8), then

354

Chapter 5

(the annihilator associated with 1M). The elements of f;m are a subset of the elements of x,. Observe the "disappearance of x." So, for example,

(e) Let~ ((M- I) x (M- I)) be a consistent estimate of 11' that the efficient GMM estimator is given by

= E(i; 1i;). Show

1F; )-1 -I "'-,-1. -fJ-_ (I- "'-,-L....F,ll' L....F,ll' y,. n

n

n !=I .

n !=I .

(12)

Show that (13) Hint: Apply Proposition 4.7 to the present system of M - I equations.

(I) Show that the following is consistent for 11': (14)

Hint: Verily that all the assumptions of Proposition 4.1 are satisfied.

(g) (Sargan's statistic) Derive Sargan's statistic associated with the efficient GMM estimator. Hint: In Proposition 4.7(c), set

-S = -11' 0

(I~ ;; L....x,x; ) ,

-

-

g,({J) = Sxy- S,.{J.

i=l

(h) Suppose that q, is spherical in that (15)

Verify that, if

a-; is a consistent estimator of a;, then ~ =u'C'C

'

(16)

is consistent for 11'. Verify that the efficient GMM estimator (12) with this choice of~ is numerically equal to the fixed-effects estimator.

355

Panel Data

(i) (Invariance to the choice of C) Verify: if an M x (M -I) matrix C satisfies (8), so does B = CA where A is an (M- I) x (M- I) non singular matrix. Replace C everywhere in this section by B and verify that the choice of C does not change any of the results of this question. 3. Let SSR be the sum of squared residuals defined in (5.2.15). For the errorcomponent model with the spherical assumption that E(q 1q;) =
SSR plim- = (M -I) n-oo n

·
Hint: n

SSR = 2)Q(y;- F,pFE)l'[Q(y;- F,pFE)] i=l n

= L(Y; - F,pFE)'Q(y, - F,PFEl i=l

(since Q is symmetric and idempotent) n

= L(Y;- F,pFE)'C(C'C)-'C'(y,- F,PFEl

(by (11))

i=l n

1 -- "v'',.(C'C)v',. L.__

= C'Yi

('V;

- C'F-;; ii'FE )

i=l n

i=l n

= n . trace[ (C'C) -I~

n

L v,v; J

(since the trace operator is linear).

i=l

With the usual technical condition (that E(f1m f(h) exists and is finite for all m, h, which obtains if E(x,x;) exists and is finite), ~ should now be routine to show that

where v1 = C'y1 -

C'F,p. Show that E(v,v;) =

356

Chapter 5

4. (Inconsistency of the fixed-effects estimator for dynamic models) Consider a "dynamic" model Yim =a,+ p · Yi.m-1

+ ry,m

(m =I, 2, ... , M; i =I, 2, ... , n).

Suppose we have a random sample on (y; 0, Yih ... , YiM)- We assume that E(a; ry;m) = 0, and E(y;o ry;m) = 0 for all m. Also assume there is no crossequation correlation: E(ry;m ry;h) = 0 form ofo h, and that the second moment is the same across equations: E(ry~m) =a;. (a) Write this system of M equations as (5.1.1") on page 329 by properly defining y,, F,, and b;. (b) Show: E(y;m · ry;h) = 0 if m < h. Hint: Yim = ry;m

+ pryi.m-1 + · · · + P

m-1

ry"

+

1- Pm m I a;+ P YiD·

-p

(c) Verify that E(y;m · ry;h) = pm-ha; form >h. (optional) Show that E(F'.Q .) =_a; (M- I)- Mp +PM '

~.

M

(I - p) 2

(d) Therefore, ifE(F;QF,) is non singular and if (M- I)- Mp + pM 1 0, then the fixed-effects estimator for this dynamic model is inconsistent. Which assumption of Proposition 5.1 guaranteeing the consistency of the fixedeffects estimator is violated? 5. (Optional, identification ensures nonsingularity) (a) Show that E(F;F,) in (5.2.7) is invertible under assumption (5.1.15), that E(QF; 0 x;) is of full column rank. Hint: M

E(F;F,) = E(F;QF;) =

M

L L qmh E(f,m f/h) m=l h=l

where qmh = (m, h) element of Q. Show that

(Recall that the elements of f;m are included in x,.) Finally, show that

357

Panel Data

E(F;F;) = E(F; 0 X;)'[Q 0 [E(x;X,) r'J E(F; 0 X;) 1

= E(QF, 0x;)'(IM 0 [E(x;X,)r )E(QF, 0x;).

(b) Use a similar argument to show that E[F'; E(ij,ij;)F;] in (5.2.7) is nonsingular. Hint: Since ij, = Qq, = Qe,, we have

Also, E[F'; E(e;e)F,] = E(QF, 0 X;)'(E(e,e;) 0 [E(x,x) r'l E(QF; 0 X;).

(c) Show that E(F;'li- 1F;) in (13) of Analytical Exercise 2 is nonsingular. Hint: As shown in Analytical Exercise 2(c), E(QF 1 0 x1) is of full column rank if and only if E(C'F 1 0 x;) is offull column rank. 6. (Optional, alternative identification condition) Consider the error-component model (5.1.1") on page 329. Let p = #fJ. If Am is the matrix that picks up f,m from xi, it can be written as

where em is them-throw ofiM· (a) LetA (K Mxp) be the matrix obtained from stacking (A" ... , AM)· Show that F; = (IM 0 x;JA. (b) Show that the rank condition (5.1.15), combined with the condition that E(x1x) be nonsingular (which is implied by (5.1.5) and (5.1.6)), can be written as rank[(Q 0 IK}A] = p

(= #{J).

358

Chapter 5

(This result will be used in the next exercise.) Hint: F; ®X; = ([(IM ® x)AJ} ®X; = WM ® x) ® x, ](A ® I) = (IM ®x; ®x;)A = [IM ® (x;x)]A. So F, ®X;= (Q ® IK)(F; ®X;) = [Q ® (x,x;)JA

= (IM ® X;X)(Q ® IK)A. SoE(F; ®x 1) = [IM®E(x 1x)](Q®IK)A. 7. (Optional, checking nonsingularity) Prove that E(F;ij, ii;F;) in (5.2.23) is nonsingular, by taking the following steps. (a) Write F;ij; as a linear function of g1 = e; ® x1• Hint: Let A be as in the previous exercise.

F;q,

= A'(IM ® X;)Q~, = A'(IM ® X;)Qe;

(since e 1 = IM · a1 + ~,)

= A'(IM ® X;)(Qe; ®I) = A'(Qe; ® x;) =A'(Q®IK)(e;®X;) = A'(Q ® IK)g;. (b) Prove the desired result. Hint: E(g11i;) is positive definite by (5.1.6). EMPIRICAL EXERCISES

In this exercise, you will estimate the speed of convergence using several different estimation techniques. We use the international panel constructed by Summers and Heston (1991), which has become the standard data set for studying the growth of nations. The Summers-Heston panel (also called the Penn World Table) includes major macro variables measured in a consistent basis for virtually all the countries of the world. The file SUM_HES. ASC is an extract for 1960-1985 from version 5.6 of the Penn World Table downloaded from the NBER's home page (www.nber.org). The

359

Panel Data

file is organized as follows. For each country, information on the country is contained in multiple records (rows). The first record has the country's identification code and two dummies, one for a communist regime (referred to as COM below) and the other for the Organization of Petroleum Exporting Countries (referred to as OPEC below). The second record has the year, the population (in thousands), real GDP per capita (in 1985 U.S. dollars), and the saving rate (in percent) for 1960. The third record has the same for 1961, and so forth. The twenty-seventh record of the country is for 1985. The first few records of the file look like: 1 0 0 1960 10800 1961 11016

1723 1599

19.9 21.1

1984 1985

21173 21848

2962 2988

26.4 26.0

4816 4918

931 976

3.3 2.9

2 0 0 1960 1961

The file contains all 125 countries for which the data are available. (The mapping between the country ID and the country name is in country. as c. Country I is Algeria, 2 is Angola, etc. Note that the mapping in Penn World Table version 5.6 is different from that in Summers and Heston (1991).) GDP per capita is the country's real GDP converted into U.S. dollars using the purchasing power parity of the country's currency against the dollar for 1985 (the variable "RGDPCH" in the Penn World Table). The saving rate is measured as the ratio of real investment to real GDP ("!" in the Penn World Table). See Sections II and Ill (particularly III. D) of Summers and Heston ( 1991) for how the variables are constructed. The first issue we examine empirically is conditional convergence. (At this juncture it would be useful to skim Mankiw, Romer, and Wei! (1992).) Let s, be the saving rate of country i and n, be the country's population growth rate (which we take to be the growth rate of labor). In the Solow-Swan growth model with a Cobb-Douglas production function, the steady-state level of output per effective labor, denoted q' in the text, is a linear function of log(s;) -log(n, + g +8), where g is the rate of labor-augmenting technical progress and 8 is the depreciation rate of the capital stock. Then, assuming A(O), the initial level of technology, to be the

360

Chapter 5

same across countries, equation (5.4.10) can be written as

Yim = m

+ PYi.m-1 + Yllog(s;) + y,log(n; + g + 8) + ry;m.

(I)

The growth model implies that y 1 = -y2 , but we will not impose this restriction. This is the specification estimated in Table IV of Mankiw et a!. (1992). As in Mankiw et al., assume g + 8 to be 5 percent for all countries, and take s1 to be the saving rate averaged over the 1960-1985 period. We measure n 1 as the average annual population growth rate over the same period (Mankiw et al. uses the average growth rate of the working-age population). Because our sample of 125 countries includes (former) communist countries and OPEC, we add COM(= I if communist, 0 otherwise) and OPEC(= I if OPEC, 0 otherwise) to the equation:

+ PYi,m-1 + Yl log(s;) + y,Iog(n; + g + 8) + y3COM; + y40PEC; + ryim·

Yim = m

(2)

(a) By subtracting Yi.m-1 from both sides, we can rewrite (2) as

Yim- Yi,m-1 = m

+ (p-

l)y;,m-1

+ Y!log(s;) + y,log(n; + g + 8) (2')

For this specification, set M = I (just one equation to estimate) and take to = 1960 and t 1 = 1985, so the dependent variable, Yil - Y;o, is the cumulative growth between 1960 and 1985. Plot Yn - y;o against y10 for the 125 countries included in the sample. Is there any relation between the initial GDP (y; 0 ) and the subsequent growth (Yi! - y; 0 )? Assuming outright that the error term is orthogonal to the regressors, estimate equation (2') by OLS. Calculate the value of A (speed of convergence) implied by the OLS estimate of p. (It should be less than I percent per year.) Calculate the asymptotic standard error (i.e., lfn times the square root of the asymptotic variance) of your estimate of A. Hint: By (5.4.3). ~ = -log(p)/25. By the delta method (Lemma 2.5 of Section 2.1), A

Avar(A) =

Avar(p) (25p) 2

(since the derivative of log(p) is If p). So a consistent estimate of Avar(~) is Avar(p)/(25,0) 2 • The standard error of A is lfn times the square root of this. So: the standard error of~ =(standard error of p)/(25,0). Can you confirm the finding that "if countries did not vary in their investment and population growth -

A

Panel Data

361

rates, there would be a strong tendency for poor countries to grow faster than rich ones" (Mankiw et al., 1992, p. 428). (b) (Fixed-effects estimation) By setting M

=

25, to

=

1960, t 1

=

1961, ... ,

t 25 =

1985, we can form a system of 25 equations, with (5.4. 10) as the m-th equation. Estimate the system by the fixed-effect technique, assuming that the error is spherical in the sense of (5.2. 12). What is the implied value of )c? (It should be about 6.4 percent per year.) (Optional: Calculate the stan-

dard errors without the spherical error assumption. As emphasized in the text, the fixed-effects estimator is not consistent. We nevertheless apply blindly the fixed-effect technique just to gain programming experience.)

TSP Tip: TSP's command for the fixed-effects estimator (with the spherical error assumption) is PANEL. It, however, does not accept the data organization of SUM_HES. ASC.lt requires that the group (country) lD and the year be included in the rows containing information on group-years. The data file that PANEL can accept is FIXED. ASC.lts first few records are:

1 0 0 1 0 0

1961 1962

1599 1275

1723 1599

1 1 2 2

1984 1985 1961 1962

2962 2988 976 984

2903 2962 931 976

0 0 0 0

0 0 0 0

Here, the first variable is the country ID, followed by COM, OPEC, year, real per capita GOP for the year, and real per capita GOP for the previous year.

RATS Tip: RATS does not have a command specifically for the fixed-effects estimator, but there is a command called PANEL (not to be confused with TSP's PANEL) that calculates deviations from group means. The RATS codes for doing this part of the exercise are shown here:

* * (b) fixed-effects estimation. * calendar(panelobs=25} allocate 0 125//25

362

Chapter 5

open data fixed.asc data(org=obs) I id com opec year rgdp rgdpl * generate year dummies (only 24 dummies are * enough) dec vector[series] dummies(24) do i=l, 24 set dummies(i) = (year==i+l960) end do i * log transformation set y = log(rgdp); set ylag = log(rgdpl); * simple statistics table * calculate deviations from group means panel y I yw 1 entry 1.0 indiv -1.0 dec vector[series] dumw(24) panel ylag I ylagw 1 entry 1.0 indiv -1.0 do i=l,24 panel dummies(i) I dumw(i) 1 entry 1.0 indiv -1.0 end do l * Within regresslon, with degrees of freedom * correction linreg(dfc=125) yw;# dumw ylagw

(c) (optional) As in the previous question, take M = 25 and to = 1960, t 1 = 1961,12 = 1962, ... , t 25 = 1985. Then (5.4.11) is a system of24 equations if written as the model of Section 4.6:

Yim=z;m&+e;m

(i=l,2, ... ,125;m=l,2, ... ,24),

where the Yim in (3) equals Yi.m+l - Yim in (5.4.11), Z;m =

(0, ... ,0, I,O, ... ,O,y;m -Yi.m-d',

a 25-dimensional vector whose m-th element is I, & = (JJ-,, ... , 1"24, p)'.

(3)

363

Panel Data

If s;m is country i's saving rate (investment/GOP ratio) in year m, use a constant and s;.m~l as the instrument in the m-th equation (so, for example, in the first equation where the nonconstant regressor is y 1. 1961 - y;,~ 960 , the vector of instruments is (I, s1. 1960 )'). Apply the multiple-equation GMM technique to (3) assuming conditional homoskedasticity. (It will be very clumsy to use TSP or RATS; use GAUSS. The estimator is (4.6.6) with Wset to the inverse of (4.5.3).) The implied value of A should be about 36 percent per year!

ANSWERS TO SELECTED QUESTIONS _ _ _ _ _ _ _ _ _ __

ANALYTICAL EXERCISES

1c. E(q 1 I W) = E(q 1 I F)

(since Dis a matrix of constants)

=E(q, I F1, ... ,Fn)

= E(q 1 I F;) (since q1 is indep. of (F" ... , F 1_ 1 , F1+ 1 ,

•.• ,

F n) by i.i.d. assumption)

=0. Therefore, the regressors are exogenous. Also, E(q 1 q; I W) = E(q 1q; I F)

= E(q,q; IF,) =

a:

IM

(by the spherical error assumption).

By the i.i.d. assumption, E(q 1 qj I W) = 0 fori

'I

j. So E(qq' I W) =a~ IMn·

*

2c. Proof that "E(QF1 ® x1) is of full column rank" "E(C'F 1 ® x1) is of full column rank": E(QF; ®X;) and B E(C'F; ®X;). There exists an M x (M - I) Let A matrix L of full column rank such that Q = LC' (it is actually C(C'C)- 1 by (II)). Let D = L ® IK. Then A = DB. Since the rank of a product of two matrices is less than or equal to the rank of either of the two matrices,

=

=

rank(A) < rank(B).

(I)

364

Chapter 5

Next consider a product D' A = D'DB. rank(D'DB) < rank(D' A) < rank(A).

(2)

But since Lis of full column rank, D'D is nonsingular. Since multiplication by a nonsingular matrix does not alter rank, rank(D'DB) = rank(B).

(3)

From (1)-(3) it follows that rank(A) = rank(B). Since A and B have the same number of columns, the desired conclusion follows.

4a.

There is no b;. 4d. Since E(ry;h · Y;m) satisfied.

I 0 form

> h, some of the "cross" orthogonalities are not

References Sarro, R., and X. Sala-i-Martin, 1995, Economic Growth, New York: McGraw-Hill. Mankiw, G., D. Romer, and D. Weil, 1992, "A Contribution to the Empirics of Economic Growth," Quarterly Journal of Economics, 107, 407-437. Newey, W., 1985, "Generalized Method of Moments Specification Testing," Journal of Economet-

rics, 29, 229~256. Nickell, S., 1981, "Biases in Dynamic Models with Fixed Effects," Econometrica, 49, 1417~1426. Summers, R., and A. Heston, 1991, "The Penn World Table (Mark 5): An Expanded Set of International Comparisons, 1950-1988," Quarterly Journal of Economics, 106, 327-368.

CHAPTER 6

Serial Correlation

ABSTRACT This chapter extends the GMM model of Chapter 3 to incorporate serial correlation in the product of the vector of instruments and the error term. To do so, we need to generalize the Central Limit Theorems of Chapter 2 to serially correlated processes. The generalization is possible under certain conditions restricting the degree of serial correlation. The condition is transparent for the stochastic processes called linear processes. Linear processes are important in their own right. and this chapter devotes four sections to covering them. In the application section, the extended GMM model is used to test the efficient market hypothesis, this time for the foreign exchange market.

A Note on Notation: Because the issues treated here are specific to time-series data, we use the subscript "t" instead of "i" in this chapter.

6.1

Modeling Serial Correlation: Linear Processes

Section 2.2 introduced covariance-stationary processes and their autocovariances. In particular, a (scalar or univariate) white noise process {£,) is a zero-mean covariance-stationary process with no serial correlation: E(£ 1) = 0, E(Ei) = a 2 > 0, E(£ 1£1-j) = 0

for j

'I 0.

A very important class of covariance-stationary processes, called linear processes, can be created by taking a moving average of a white noise process. This section studies general properties of linear processes using an apparatus called the filter. Because the current value of a linear process can depend on possibly infinite past values of a white noise process, it is convenient to have the white noise process and the linear process it generates defined fort = 0, -I, -2, ... as well

366

Chapter 6

as for t = 1, 2, . . . . Intuitively, a covariance-stationary process defined for all integers (t = 0, ±1, ±2, ... ) is a process that started so long ago that its mean and autocovariances have stabilized to time-invariant constants. MA(q) The simplest example of linear processes that exhibit serial correlation is a finiteorder moving average process. A process {y,] is called the q-th order moving average process (MA(q)) if it can be written as a weighted average of the current

and most recent q values of a white noise process: (6.1. I) Evidently, a moving average process is covariance-stationary with mean J.L. It is easy to verify that the j-th order autocovariance, Vj (= E[(y,- J.LHY<-j- J.L)]), is q-j

Yj = (ejeo+ej+Iei

+··· +eqeq-j)u

2

2

=u L:ej+,e,

for j = O,l, ... ,q,

k=O

(6.1.2a) Yj = 0

for j > q,

(6.1.2b)

where u 2 = E(£?). (This formula also covers Y- j because Yj for q =I, we have Yo =

(eJ + et)u 2 =

(1

= Y- j·)

For example,

+ et)u 2

(by setting j = 0 and q = 1 in (6.1.2a) and noting e0 = 1), 2 2 YI = Y-1 = (eieo)u = e1u

(by setting j = 1 and q = 1 in (6.1.2a) and noting e0 = 1), Vj = 0

for j = ±2, ±3, ....

The whole profile of autocovariances, {yj], is described by just q + 1 parameters (e 1.e2 , .•• • eq. and u 2 ), and the correlogram {pj} (where pj = Yj/y0) by q parameters (e 1 , e2 .... , e"). MA(oo) as a Mean Square Limit As illustrated in the above example, serial correlation in MA(q) processes dies out

completely after q lags. Although some time series have this property (we will see an example in the application below), it is certainly desirable to be able to model serial correlation that does not have this property. An obvious idea is to have y,

367

Serial Correlation

depend on the infinite past by replacing the sum of finitely many terms in (6.1.1 ).

by an infinite sum 00

1/loE, + 1/J,E,~t

+ ··· =

L 1/ljE,~j

(6.1.3)

j=O

where {1/lj] is a sequence of real numbers. For this idea to work, we must make sure that this sum of infinitely many random variables is well defined, namely, that the partial sum (6.1.4) converges (in some appropriate sense) to a random variable as n --> oo. One popular condition under which this happens is that the sequence of real numbers {1/lj} be absolutely summable: (6.1.5)

(This is a standard short-hand expression in calculus. Equation (6.1.5) reads: the partial sum L}~o 11/ljl converges to a real number [i.e., a finite limit] as n --> oo and the infinite sum in (6.1.5) is defined to be this real number.) For the sequence {1/lj] to be absolutely sum mabie, it is necessary (but certainly not sufficient) that 1/lj --> 0 as j --> oo. Thus, absolute summability requires that the effect of the past shocks represented by 1/lj eventually die away. As claimed in the next proposition, the partial sum (6.1.4) converges in mean square to some random variable for any given t. The mean square limit is unique in the following sense. If there exists another random variable to which the partial sum converges in mean square, then (as you will show in Analytical Exercise I) the two mean square limits are equal with probability one. We define the infinite sum (6.1.3) to be the unique mean square limit of the partial sum (6.1.4) and say "the infinite sum converges in mean square." Proposition 6.1 (MA(oo) with absolutely summable coefficients): Let {e,) be white noise and {1/lj} be a sequence of real numbers that is absolutely summable. Then

368

Chapter 6

(a) For each t, 00

y, = IL

+L

VtjE,_ j

(6.1.6)

j=O

converges in mean square. {y,) is covariance-stationary. (The process {y,) is called the infinite-order moving-average process (MA(oo)).) (b) The mean ofy, is IL· Theautocovariances {yj) are given by Yj = (VtjVtO

=a' L

+ Vtj+IVtl + Vtj+2Vt2 + · · · )a 2 Vtj+kVtk

(j = 0, I, 2, ... ).

(6.1.7)

k=D

(c) The autocovariances are absolutely summable: (6.1.8)

(d) If, in addition, {£ 1 ) is i.i.d, then the process {y1 ) is (stricdy) stationary and ergodic. This result includes MA(q) processes as a special case with 1/tj = (}j for j = 0, I, ... , q and Vtj = 0 for j > q. Evidently, this sequence {1/tj) is absolutely summable. Proving parts (a)-{c) is Analytical Exercise 2. Proving (a) would use the fact from calculus that Cauchy sequences converge (see Review Question I about Cauchy sequences). Part (b) is an obvious extrapolation to infinity of the corresponding results on the mean and autocovariances of finite-order MA processes, but the extrapolation involves interchanging the order of expectations and summations. For example, to prove that the mean of the MA(oo) process is /J-, you would have to show that the expected value of the mean square limit of (6.1.3) is the limit of the expected value of (6.1.4). The fact that this operation is legitimate under mean square convergence would be used in the proof of part (b). Part (c) says, intuitively, that if the effect of past shocks represented by Vtj dies out fast enough to make {Vtj) absolutely summable, then serial correlation dies out just as fast. Absolute summability of autocovariances will play a key role in the theory of covariance-stationary processes to be developed in this chapter. For a proof of (d), see, e.g., Hannan (1970, p. 204, Theorem 3). 1 1Absolute summability of {1/lj} is more than enough for part (a) to hold. As shown in Analytical Exercise 2, square summability, which is weaker than absolute summability, is enough to ensure that the infinite sum is well defined as a mean square limit. However, for the other parts of thls proposition to hold, absolute summability is needed.

369

Serial Correlation

The MA(oo) process in (6.1.6) is said to be one-sided because the moving average does not include future values of £ 1. It is a special case of a linear process, which is defined to be two-sided moving averages of the form: 00

y, =

{!

+

L

00

1/Fj£1_ j

with

j=-00

I: 11/Fjl < oo. J=-00

The proposition (and all the results in this section) can easily be extended to general linear processes. We nevertheless focus on one-sided moving averages, only because two-sided moving averages are very rarely encountered in economics. Proposition 6.1 can be generalized to the case where the process {y,} is a moving average of a general covariance-stationary process rather than of a white noise process. In particular, absolute sumrnability of autocovariances survives the operation of taking an absolutely sumrnable weighted average. Proposition 6.2 (Filtering covariance-stationary processes): Let {x,} be a covariance-stationary process and {h j} be a sequence of real numbers that is absolutely summable. Then

(a) For each t, the infinite sum 00

Yt = LhJXt-j

(6.1.9)

j=O

converges in mean square. The process {y,} is covariance-stationary. (b) If, furthermore, the autocovariances of {x,} are absolutely summable, then so are the autocovariances of {y,}. We defer to, e.g., Fuller ( 1996, Theorem 4.3. I) for a proof, but the basic technique is the same as in the proof of Proposition 6.1. Deriving the autocovariances of {y,}, which would generalize (6.1.7), is Analytical Exercise 3. Filters The operation of taking a weighted average of (possibly infinitely many) successive values of a process, as in (6.1.6) and (6.1.9), is called filtering. It can be expressed compactly if we use the device called the lag operator L, defined by the relation Ox, = x,_j· We now introduce some concepts associated with the lag operator that will be useful in time-series analysis. 2

2For a fuller treatment of filters, see, e.g., Gourieroux and Monfort (1997, Sections 5.2 and 7.4).

370

Chapter 6

Filters Defined

For a given arbitrary sequence of real numbers, (a 0, a 1, a 2, ... ), define a filter by a(L)

= a 0 + a 1L + a 2L 2 + · · · .

(6.1.10)

If we just mechanically and mindlessly apply the definition of the lag operator to an input process (x, }, we get

+ a 1Lx, + a,L 2x, + · · · aoxr + a!xt-1 + a2xr-2 + · · ·

a(L)x, =a ox,

=

00

(6.1.11)

= :Lajxr-j· j=O

This is indeed the definition of a(L)x,. If x, = c (a constant), a(L)c = a(l)c = c 2::]':0 aj. If aj t- 0 for j = p and aj = 0 for j > p, the filter reduces to a p-th degree lag polynomial:

When applied to an input process, it creates a weighted average of the current and p most recent values of the process. Proposition 6.2 assures us that the object a (L )x, defined by (6.1. II) is a well-defined random variable forming a covariancestationary process if the sequence (aj} is absolutely summable and if the input process (x,} is covariance-stationary. A filter with absolutely summable coefficients will be referred to (in this book) as an absolutely summable filter. To use some fancy set theory language, an absolutely summable filter is a mapping from the set of covariance-stationary processes to itself. Products of Filters

Let (aj} and (pj} be two arbitrary sequences of real numbers and define the sequence (8j} by the relation aoPo = 8o,

+ a1Po = 81> aoP2 + a1P1 + a,Po = aofh

8,,

(6.1.12)

371

Serial Correlation

The sequence {8j} created from this convoluted formula is called the convolution of {aj} and {j3j ). Leta(L) ao+a1L +a,L 2 +· · ·, J3(L) J3o+J3IL + J3zL 2 +· · ·, and 8(L) = 80 +8 1L +8 2 L 2 + · · · be the associated filters. We define the product of two filters a(L) and Jl(L) to be this filter 8(L) and write it as

=

=

8(L) = a(L) jJ(L).

For example, for a(L) yields

= 1+

a 1 Land jJ(L)

= 1+

j3 1L, the convolution formula

(If you are familiar with the product of two polynomials of the usual sort from calculus, you can immediately see that this definition for filters is strictly analogous.) As is clear from the definition, filters are commutative: a(L) Jl(L) = jJ(L) a(L).

Commutativity, however, will not carry over to matrix filters (see Section 6.3 below). If the two sequences {aj} and {j3j} are absolutely summable and if {x,} is covariance-stationary, then Proposition 6.2 assures us that {jJ(L)x,} is covariancestationary and so a(L)J3(L)x, is a well-defined random variable, equal to 8(L)x,. Thus, if a(L) and jJ(L) are absolutely summable and {x,} is covariance-stationary, then "a(L) J3(L) = 8(L)"

~

"a(L) Jl(L)x, = 8(L)x,."

(6.1.13)

(Incidentally, {8j} is absolutely summable [see, e.g., Fuller, 1996, p. 30].) Inverses Of particular interest is the case when 8(L) = I (a very special filter) so that the two filters a(L) and jJ(L) satisfy a(L) Jl(L) = I. We say that Jl(L) is the inverse of a(L) and denote it as a(L)- 1 or 1/a(L). That is, a(L)- 1 is a filter satisfying a(L)a(L)- 1 =I.

(6.1.14)

As long as a 0 f' 0 in a(L) = a 0 + a 1L +··.,the inverse of a(L) can be defined for any arbitrary sequence {aj} because the set of equations (6.1.12) with 80 = I,

372

Chapter 6

Oj = 0 (j = I, 2, ... ) can be solved successively for {.Bj}:

For example, consider the filter I - L. Its inverse is given by (I- L)- = 1 + L 1

+ L 2 + L 3 + · ·· .

The coefficient sequence is obviously not absolutely summable. This example illustrates the point that an inverse filter may not be absolutely summable. It is left as Review Question 3 to show that a(L) a(L) - 1 = a(L)- 1a(L) "a(L) ,B(L)

= 8(L)" #

",B(L)

(so inverses are commutative), (6.1.15a)

= a(L)- 18(L)" #

"a(L)

= 8(L) ,B(L)- 1 ," (6.1.15b)

provided a 0 # 0 and .Bo # 0. Therefore, for example, to solve a(L) ,B(L) = 8(L) for ,B( L), you "multiply from the left" by the inverse of a(L). We have noted above that commutativity does not necessarily hold for matrix filters. As will be shown in Section 6.3, for inverses, commutativity does generalize to matrix filters. Inverting Lag Polynomials

In the next section on ARMA processes, it will become necessary to calculate the inverse of a p-th degree lag polynomial¢(L)

Since ¢ 0 = I it satisfies

# 0, the inverse can be defined. Let >/J(L) = t/>(L)- 1 . By definition, t/>(L) o/(L) = I.

The convolution formula (6.1.12) provides an algorithm for solving this for >/J(L). Setting a(L) = ¢(L), ,B(L) = o/(L), and 8(L) = I in the formula yields

373

Serial Correlation

constant: L: L':

u-

1

:

LP: u+I; u+Z:

>fro lfri

"''

=I,

-<1>11/fo - ilfri

=0 • =0,

-fro

1/fp-I- Il/fp-2- 1 1/fp-I - 21/fp-21/fp+I-!1/fp -zl/fp-11/fp+2 - 1 1/f p+ I - ch 1/fp -

· · · - p-Il/fo = o. · · · -p-I 1/f1 - p >fro= 0, ··· -p-Il/fz -pl/fi = 0, · · · - p-I 1/f3 - p 1/f2 = 0,

(6.1.16) These equations can easily be solved successively for (>fro,

1/f~o

1fr2, ... ) as

Also, notice that for sufficiently large j (actually for j > p), {1/fj} follows the p-th order homogeneous difference equation

So, once the first p coefficients (>fro, 1/f~o ... , 1/fp-I) are calculated, we can use this homogeneous p-th order difference equation to generate the rest of the coefficients (1/fj. j > p) with those first p coefficients as the initial condition. As we know from the theory of homogeneous difference equations (summarized in Analytical Exercise 4), the solution sequence {1/fj} to (6.1.17) eventually starts declining at a geometric rate if what is known as the stability condition holds. The condition states: All the roots of the p-th degree polynomial equation in (z) = 0 where (z)

=I-

z

1z- 2z2 - · · · - pzP

(6.1.18)

are greater than I in absolute value. The roots of the polynomial equation (z) = 0 can be complex numbers. If your command of complex numbers is rusty, a complex number z can be written as

z =a+ bi, where a and b are two real numbers and i = .J=T. So a complex number can be represented as a point in a two-dimensional diagram with the horizontal axis measuring a (the real part) and the vertical axis measuring b (the imaginary part).

374

Chapter 6

lts absolute value is lzl = vfa 2 + b 2 , which is the distance between the origin and the point representing z. We say that z lies outside the unit circle when lzl > I. So the stability condition above can be restated as requiring that the roots lie outside the unit circle. Because z = A solves (z) = 0 whenever z = I /A solves zP - ¢ 1zP-I - ¢ 2 zP- 2 - · · · - p-JZ- p = 0, the stability condition can be equivalently stated as All the roots of the p-th order polynomial equation (6.1.18') are less than I in absolute value (i.e., lie inside the unit circle). Under this condition, as you will be asked to prove (Analytical Exercise 4 ), there exist two real numbers, A > 0 and 0 < b < I, such that (6.1.19) Given (6.1.19), the absolute summability of [ 1/lj) follows because 00

00

L 11/ljl < L Abj = j=O

j=O

A

1-b

< 00.

Thus, we have proved Proposition 6.3 (Absolutely summable inverses of lag polynomials): Consider a p-th degree Jag polynomial (L) = I - 1L- 2 L 2 - · • · - pLP. and Jet 1/I(L) = (L)- 1 • If the associated p-th degree polynomial equation (z) = 0 satisfies the stability condition (6.1.18) or (6.1.18' ), then the coefficient sequence [ 1/lj) of 1/I(L) is bounded in absolute value by a geometrically declining sequence (as in (6.1.19)) and hence is absolutely summable. To illustrate, consider a lag polynomial of degree one, I - L. The root of the associated polynomial equation I - z = 0 is I/. The stability condition is that 11/1 > 1 or 11 j) is indeed bounded in absolute value by a geometrically declining sequence (for example, set A = 1.1 and b = ).

375

Serial Correlation

QUESTIONS FOR REVIEW

1. (Cauchy sequences) A sequence of real numbers, {a.}, is said to be a Cauchy sequence if lam -a. I ---> 0 as m, n ---> oo. An important result from calculus is that a. converges to a real number if and only if the sequence is Cauchy. Taking this as given, show that {Yj} is absolutely summable if and only if I:7~n+l IYj I ---> 0 as m, n ---> oo. (You can assume, without loss of generality, that m > n.) 2. (How filtering alters autocovariances) Let h(L) = h 0 + h 1 L andy, = h(L)x, where {x,} is covariance-stationary. Verify that (with {Y/l being the autocovariance of {x,}) the autocovariances of {y,} are given by I

yf =

I

L I>k h, Y/-k+l' k=O f=D

3. Prove (6.1.15b). Hint: Here is a proof that "a(L) fJ(L) = 8(L)" => "fJ(L) = a(L)- 18(L)":

•.

a(L)- 18(L) = a(L)- 1a(L)fJ(L) = a(L)a(L)- 1fJ(L) = fJ(L) .

4. Does ¢(z) = I - %z + 196 z2 satisfy the stability condition? Hint: The roots of ¢(z) = 0 are 2(1 ± v'3 i)/3. 5. (Coefficients of _polynomial equations) Consider a polynomial equation ¢(z) = 0 where ¢(z) = I - ¢ 1z - ¢2z 2 and let (A~o A2 ) be the two roots. Verify that ¢ 1 = I/A 1 + I/A 2 , ¢ 2 = -I/(A 1 · A2 ). Hint: II A1 and A2 are the two roots of ¢(z) = 0, ¢(z) can be written as ¢(z) = (I - ;, z)(l - ;, z). 6. (Inverting I - ¢L) Verify that (6.1.20) is the inverse of I - ¢L by checking the convolution formula (6.1.12) with 80 = I, 8j = 0 for j > I.

6.2

ARMA Processes

Having introduced the concept of filters, we can study with ease and grace a class of linear processes called ARMA processes, which are a parameterization of the coefficients of MA(oo) processes.

376

Chapter 6

AR(l) and Its MA(oo) Representation A first-order autoregressive process (AR(l)) satisfies the following stochastic difference equation:

y, = c + 4>Yt-I

+ t:,

or

where {t:t} is white noise. If 4>

Yt- 4>Yt-I = c + t:1

#

I, let 11

(y,- 11)- 4>- (y,_ 1 - 11)

= t:

1

or

(I- ¢L)y, = c + t:,, (6.2. I)

= cj(l- ¢)and rewrite this equation as or

(I- ¢L)(y,- 11)

= t:,.

(6.2.1')

As will be seen in a moment, 11. not c, is the mean of y, if y, is covariancestationary. For this reason, we will call (6.2. I') a deviation-from-the-mean form. The moving average is on the successive values of {y,}, not on {<1 ) _ The difference equation is called stochastic because of the presence of the random variable <1 • We seek a covariance-stationary solution {y,) to this stochastic difference equation. The solution depends on whether 14> I is less than, equal to, or greater than I. Case 1: 14>1 1 < I, we can apply it to both sides of the AR(l) equation (6. 2. I') to obtain

This operation is legitimate because, thanks to Proposition 6.2, both sides of this equation are well defined as mean square limits. By the definition (6. 1.14) of inverses, (I-¢L)(I-¢L)- 1 = I, and by commutativity of inverses (l-¢L)- 1 (I¢L) = I. Therefore, if {yt} is covariance-stationary, the left-hand side equals y,- 11 by (6.1.13). So 00

2

2

y, -11 = (1-ci>L)-'t:, =(I +4>L+¢ L +---)t:, = L4>j
or

y, = 11 + L 4>j
(6.2.2)

j=O

What we have shown is that, if {y,) is a covariance-stationary solution to the stochastic difference equation (6.2. I) or (6.2. I'), then y, has the moving-average representation as in (6.2.2). Conversely, if y, has the representation (6.2.2), then

Serial Correlation

377

it satisfies the difference equation. This can be derived easily by substitution of (6.2.2) into (6.2.1'): (I- tjJL)(y,- Jt) =(I- ¢£)(1- ¢L)- 1c1

= <,

(since (I- ¢£)(1- t/JL)- 1 = I by definition of inverses).

Thus, the process [y,) given by (6.2.2) is the unique covariance-stationary solution to the first-order stochastic difference equation if 1¢1 < I. The mean of this process is It by Proposition 6.1 (b). The condition 1¢1 < I, which is the stability condition (6.1.18) associated with the first-degree polynomial equation I - t/Jz = 0, is called the stationarity condition in the present context of autoregressive processes. Intuitively, this result says that the process, if it started a long time ago, "settles down," provided that 1¢1 I By shifting time forward by one period (i.e., by replacing "t" by "t+ 1"), multiplying both sides by q,- 1, and rearranging, the stochastic difference equation (6.2.1 ') can be written as (6.2.3) Applying the above argument but this time moving in the opposite direction in time shows that the unique covariance-stationary solution is 00

y, =~t- L¢-j<,+j·

(6.2.4)

j=l

That is, the current value of y is a moving average of future values of Ei. The infinite sum is well defined because the sequence [¢-j) is absolutely summable if 1¢1 > I. Case 3: 1¢1 = I The stochastic difference equation has no covariance-stationary solution. For example, if¢ = I, the stochastic difference equation becomes

+ Yt-1 + <, c + (c + Yt-2 + <1-1) + <, (since y,_, = c + Yt-2 + <,_,) c + (c + (c + y,_, +
y, = c = =

378

Chapter 6

Repeating this type of successive substitution j times, we obtain Yt- Yt-j =

C ·

j

+ (t:, + t:r-1 + · · · + 'r-j+J).

(If {t:r} is independent white noise, this process is a random walk with drift c.)

Suppose, contrary to our claim, that the process is covariance-stationary. Then the variance of y,- Yr-j is 2(y0 - Yj). The variance of the right-hand side is a 2 · j. So

a2

= I - - · j <-I 2yo '

p·

for j large enough (recall: p,· = y,/Yo).

This is a contradiction since autocorrelation coefficients cannot be greater than one in absolute value. So {y,} cannot be covariance-stationary. Processes with 1> = I are examples of "unit-root processes" and will be studied in Chapter 9. The solution (6.2.4) for the case of It/> I > I, with the current value of y linked to the future values of the forcing process {t:, }, is not useful in economics, which does not confer perfect foresight on economic agents. In what follows, unless otherwise stated, we will use the term "an AR(l) process" as the unique covariancestationary solution to an AR(l) equation (6.2.1) or (6.2.1') when the stationarity condition holds. Autocovariances of AR(l) The autocovariances {yj} of AR(l) (for the case of 11>1 < I) can be calculated in two ways. The first method is to represent AR(l) as MA(oo) as in (6.2.2) and use (6.1.7). Since 1/fj = tj>j for AR(l), we have

(6.2.5)

So the correlogram of an AR(l) process is very simple: it declines with j at a constant geometric rate. The second method bases the calculation on what is called the Yule-Walker equations; see Analytical Exercise 5 for the use of the Yule-Walker equations in the calculation of autocovariances. AR(p) and Its MA(oo) Representation All the results just derived for AR(l) can be generalized to AR(p), the p-th order autoregressive process, which satisfies the p-th order stochastic difference

379

Serial Correlation

equation:

+ t!>JYt-J + · ·· + t/JpYt-p + e,

or

y,- t/JJYt-J - · · · - t/JpYt-p = c + e,

or

y, = c t/J(L)y, =c+e,

withtjJ(L) = l-¢IL-¢2L 2 - ···

-rpPU,

(6.2.6)

where {etl is white noise and t/Jp i' 0. Evidently, ¢(1) = I - ¢ 1 - • • • - t/Jp. If t/J (I) i' 0, let /L = c /(I - t/>1 - · · · - t/Jp) = c /t/J (I). As shown below, /L is the mean of y, if y, is covariance-stationary. Substituting for c, the AR(p) equation (6.2.6) can be written equivalently in deviation-from-the-mean form: (y, - /L)- t/>1 · (Yt-1 - /L)- · · · - t/Jp · (Yt-p- /L) = e, or

t/J(L)(y,- /L) = e,.

(6.2.6')

The generalization to AR(p) of what we have derived for AR(l) is Proposition 6.4 (AR(p) as MA(oo) with absolutely summable coefficients): Suppose the p-th degree polynomial t/J(z) satisfies the stationarity (stability) con-

dition (6.1.18) or (6.1.18'). Then (a) The unique covariance-stationary solution to the p-th order stochastic difference equation (6.2.6) or (6.2.6') has the MA(oo) representation

where 1/f (L) = rp (L )- 1• The coefficient sequence {1/fj} is bounded in absolute value by a geometrically declining sequence and hence is absolutely summable. (b) The mean /L of the process is given by /L = ¢(1 )- 1c where c is the constant in (6.2.6).

(6.2.8)

(c) {yj} is bounded in absolute value by a sequence that declines geometrically with j. Hence, the autocovariances are absolutely summable. Unless otherwise stated, we will use the term "an AR(p) process" as the unique covariance-stationary solution (6.2.7) to an AR(p) equation (6.2.6) that satisfies the stationarity condition. The absolute summability of the inverse t/J(L)- 1 in part (a) of this proposition is immediate from Proposition 6.3. Given absolute summability, the first part of (a) can be proved by the same argument we used for AR(l ); just

380

Chapter 6

multiply both sides of (6.2.6') by lj>(L )- 1 and observe that

Part (b) is immediate from Proposition 6.l(b). Part (c) can easily be proved by combining (a) and Proposition 6.1(b). Analytical Exercise 7 will ask you to give an alternative proof of the proposition without using filters but using the apparatus called the companion form. As in AR(I), autocovariances can be obtained in two ways. The first method utilizes the MA( oo) representation. The coefficients {ifli) in the MA representation are the coefficients in the inverse of the lag polynomial¢(L), and the algorithm for calculating the inverse coefficients has already been presented (see (6.1.16)). Given {ifli), use (6.1. 7) to calculate the autocovariances {Yi). The second method uses the Yule-Walker equations. ARMA(p,q) An ARMA(p, q) process combines AR(p) and MA(q): y, = c + ¢1Yt-l

+ ¢2Yt-2 + · · · + 1/>pYt-p + Oo
or y, -lj>,y,_, -¢2Yt-2- · · · -1/>pYt-p = c

+ !Jo<, + 0,£,_, + 0,£,_, + · · · + Oq
or 1/>(L)y, = c + O(L)£ 1 , 1/>(L) = I -¢,L- · · · -1/>pU, O(L) = Oo + 0 1L

where {£1 ) is white noise. If ¢(1) mean form is

+ · · · + OqL",

(6.2.9)

#- I, set It= c/¢(1). The deviation-from-the-

(y, -It) -¢, · (y,_, -It) - · · · -1/>p · (Yt-p- It) = Oo£,

+ 0,£,_, + · · · + Oq
or 1/>(L)(y,- It)= O(L)£,.

(6.2.9')

381

Serial Correlation

This is still a stochastic p-th order difference equation but with a serially correlated forcing process {II (L )t:,} instead of a white noise process {t:,}. Proposition 6.4 generalizes easily to

Proposition 6.5 (ARMA(p, q) as MA(oo) with absolutely summable coefficients): Suppose the p-th degree polynomial ,P(z) satisfies stationarity (stability) condition (6.1.18) or(6.1.18'). Then

(a) The unique covariance-stationary solution to the p-th order stochastic difference equation (6.2.9) or (6.2.9') has the MA(oo) representation

where 1/f(L) = tjJ(L)- 111(L). The coefficient sequence {1/fj} is bounded in absolute value by a geometrically declining sequence and hence is absolutely summable. (b) The mean IL of the process is given by IL = tjJ(l)- 1c where cis the constant in (6.2.9).

(6.2.11)

(c) {yj} is bounded in absolute value by a sequence that declines geometrically with j. Hence, the autocovariances are absolutely summable. Unless otherwise stated, we will use the term "an ARMA process" as the unique solution (6.2.10) to an ARMA equation (6.2.9) or (6.2.9') that satisfies the stationarity condition. The MA(oo) representation can be derived easily by multiplying both sides of (6.2.9') by the inverse ,P(L)- 1 and observing that tjJ(L)- 1tjJ(L) = tjJ(L) tjJ(L)- 1 = I. For the rest of part (a) and part (b), the only nontrivial part of the proof is to show that {1/fj} is bounded in absolute value by a geometrically declining sequence. For AR(p) we used the algorithm based on the set of equations (6.1.16) to solve tjJ(L) 1/f(L) = I for 1/f(L). This algorithm can easily be generalized. This time 1/f(L) = tjJ(L)- 111(L) or tjJ(L) 1/f(L) = II(L). To solve this for 1/f(L), apply the convolution formula (6.1.12) with a(L) = tjJ(L), {J(L) = 1/f(L), and 8(L) = II(L) (instead of 1). The resulting set of equations is the same as (6. 1.16), except that the right-hand side of the equation for the constant is 110 rather than I, that of the equation for Lis 11 1 rather than 0, that of L 2 is 112 , and

382

Chapter 6

so forth, until the equation for L q, whose right-hand side is Oq- The right-hand side of the rest of the equations is 0. As in AR(p), these equations can easily be solved successively for (1/fo, 1/f1 , 1/f2 , •.. ) as

Again, as in the AR(p) case, for sufficiently large j (actually, for j > max(p, q + I), as you will verify in Review Question 5), {1/fi} follows the same p-th order homogeneous difference equation (6.1.17). So {1/fj} is bounded in absolute value by a geometrically declining sequence and hence is absolutely summable. Again there are two ways to derive the expression for autocovariances in terms of ( 1 , ... , p, 01, ••• , Oq. a 2 ). You can calculate the coefficients {1/fj} in the MA representation as just described and then use (6.1. 7). The other method is based on the Yule-Walker equations.

ARMA(p, q) with Common Roots In the ARMA(p, q) equation (6.2.9'), suppose (L) satisfies the stationarity condition as in Proposition 6.5, but suppose that one of the roots of (z) = 0 is also a root of O(z) = 0_ If that root is A, the polynomials (z) and l)(z) have a factorization (z) =a(z)'(z), l)(z) =a(z)O'(z) witha(z) = 1-A- 1z. The inverse of (L) is '(L)- 1a(L)- 1 So the 1/f(L) in Proposition 6.5 can be written as

which shows that the ARMA(p, q) equation (6.2.9') and the simpler ARMA(p1, q- 1) equation '(L)(y,- Jl.) = O'(L)<, share the same process as the unique covariance-stationary solution. 3 For reasons

of parsimony, ARMA equations with common roots are rarely used to parameterize covariance-stationary processes.

3sincc we defined filters for sequences of real numbers, the di~cussion in the text presumes that lhe root A real. If A is complex, then there must be another common root which is the complex conjugate of A. Let A1 = a+ bi and Az =a - bi be the pair of complex roots. Then if we define i~

a(z) = I - (Aj

I

+ A2

the discussion in the text carries over.

I

I

)z

+ (A1·A2) : :

2

= 1-

2a 2 (a 2 +b)

z + (a'

I

+ b')

2

z ·

383

Serial Correlation

lnvertlblllty For the ARMA(p, q) equation (6.2.9), if (z) = 0 and O(z) = 0 have no common roots and if O(z) satisfies the stability condition (6.1.18) or (6.1.18'), then the ARMA process is said to be invertible and the stability condition on O(z) is called the invertibility condition 4 Since O(L)- 1 is absolutely summable if O(z) satisfies the invertibility (stability) condition, we can multiply both sides of the ARMA equation (6.2.9) by O(L)- 1 and use the relation O(L)- 1c = ce(l)- 1 = cje(l) to obtain I

C

O(L)- (L)y, = O(l)

+ e,.

(6.2.13)

So, the process has the infinite-order autoregressive (AR(oo)) representation. The derivation does not require the stationarity condition (that (z) satisfies the stability condition). If both (z) and O(z) satisfy the stability condition, then the ARMA(p, q) process has both the MA(oo) and AR(oo) representations. Autocovariance-Generating Function and the Spectrum A particularly useful way to summarize the whole profile of autocovariances, {Yi l, of a covariance-stationary process {y, l is the autocovariance-generating function: 00

gy(Z)

=

L

YjZj

j=-00 00

=Yo+ LYi · (zi

+ z-i)

(since Y-i = Yil·

(6.2.14)

j=I

The argument of this function (z) is a complex scalar. Because the summation involves infinitely many terms, there arises a question as to whether the function is well defined. A familiar result from calculus says that the infinite sum is well defined at least for lzl = I (the unit circle) if {Yi l is absolutely summable. For example, for z = I, 00

00

(6.2.15) j=-00

j=I

4It is easy to confuse invertibility and the existence of an inverse. Since Bo = I f- 0. G(L) does have an inverse without the invertibility condition. The inverse, however, may not be ab~olutely summable unles~ the invertibility condition is satisfied.

384

Chapter 6

This is indeed well defined because the infinite sum converges 5 Any complex number on the unit circle can be represented as z = cos(w) - i sin(w) = e-iw,

(6.2.16)

where i = .J=T and w is the negative of the radian angle that z makes with the real axis. If the autocovariance-generating function is evaluated at this z and divided by 2n, the resulting function of w, sy(w)

= -I

2Jr

1 gy(e-'w) . gy (cos(w)- i sin(w) ) = 2Jr

(6.2.17)

is called the (power) spectrum or the spectral density (function) of [y, }, and w is referred to as the frequency. It can be shown that for absolutely summable autocovariances, all of the autocovariances can be calculated from the spectrum (hence the term autocovariancegenerating function). So there is a one-to-one mapping between the [yj) sequence and gy(z) or sy(w). The time domain approach, which concentrates on [yj) and which is what we have been doing, and the frequency domain approach, which is based on the interpretation of the spectrum, are therefore equivalent, although some results can be more easily stated in one approach than in the other. This book will not cover the frequency domain approach any further than this. If {y1 ) is white noise, then evidently gy(Z) is constant and equal to the variance. For MA(l), gr(z) is obtained by simple substitution of (6.1.2) for q = I into the definition (6.2.14): gy(Z) = Y-IZ-l +Yo+ YtZ

= (Bia 2 )z-I = a 2 · [B 1z-I = a

2

+(I+ B~)a 2 +

(Bia 2 )z

+(I + B~) +BIZ]

·(I+ Btz)(l + Biz- 1 ).

So (6.2.18) where B(z) = I +BIZ· This last expression generalizes to the MA(q) case where B(z) = I + B1z + B,z 2 + · · · + Bqzq (You should verify this by carrying out the multiplication in this expression for gy(Z) for MA(q), collecting terms by powers of z. and looking at the coefficient of zj or cj.) The expression is also good for 5 Fact: If { Yj I is absolutely summable, then it is summable (i.e., the partial sum of {Yj} converges to a finite limit).

385

Serial Correlation

MA(oo) processes if the coefficients are absolutely summable, as the following result assures us.

Proposition 6.6 (Autocovariance-generating function for filtered processes): Let {<,} be white noise and Jet >/t(L) = >/to+ >/1 1 L + >/t2 L 2 + · · · be an absolutely summable filter (i.e., with {Vtj} absolutely summable). Then the autocovariancegenerating function of the MA(oo) process {y,} where y, = /-' + >/t(L)e, is (6.2.19) More generally, Jet {x,} be covariance-stationary with absolutely summable autocovariances and gx(z) be the autocovariance-generating function of {x, }. The autocovariance-generating function of the filtered series {y,} where y, = h(L)x, (with absolutely summable {hj }) is gy(z) = h(z) gx(z) h(z- 1).

(6.2.20)

We will not prove this result; see, e.g., Theorem 4.3.1 of Fuller (1996). Because AR(p) and ARMA(p, q) processes have the MA(oo) representations, their autocovariance-generating functions can be derived easily from ( 6.2.19). Since >/t(L) = 1//t(L) = (J(L)/1J(L) for ARMA(p, q), we have I (6.2.21) AR(p): gy(Z) = a2
(6.2.22)

QUESTIONS FOR REVIEW 1.

If {y,} is the covariance-stationary solution to the AR(l) equation (6.2.1) with 1¢1 < I, what is

-,

E (y,

I I, Yt-1)

(the least squares projection)? Hint: E(y,_ 1e,) = 0. Do the projection coefficients depend on t? Is E'(y, I I, YH) = E(y, I y,_I)? Hint: E(y,_I£,) = 0 but does it mean E(e, I y,_ 1) = 0? What is E'(y, I I, Yt-J, Yt-2)? Now suppose that {y,J is the covariance-stationary solution to the AR(l) equation -, when 1¢1 > I. Is it true that E(y,_ 1£1) = 0? [Answer: No.] Does E (y, I I, y,_ 1) = c + ¢y,_ 1 as was the case when 1¢1 < I?

386

Chapter 6

2. (AR(l) with¢= -I) For simplicity, setc = Oin theAR(l) equation (6.2.1). Show that the AR(l) equation with¢ = -I has no covariance-stationary solution. Hint: The same sort of successive substitution we did for the case ¢ = I yields Yt-

.

(-1) 1 Yt-j

=£,-Bt-l+···+

"-1

(-1) 1

Bt-j+l·

From this, derive .

(12

( - l ) l1p · = l - - · j .

2yo

3. (About Proposition 6.4) How do we know that ¢(1) oft 0 in (6.2.8)? Hint: Use the stationarity condition. Prove (b) of Proposition 6.4. Hint: Take the expectation of both sides of (6.2.6) and exploit the covariance-stationarity of {y, }. 4. (About Proposition 6.5) Verify that y, - 11- = ,P(L)- 1 O(L)s, is a solution to the ARMA(p, q) equation. 5. ({1/Fj} for ARMA(p, q)) For AR(p), we used the equations (6.1.12) to calculate the MA coefficients {1/Fj}. For ARMA(3, I), write down the corresponding equations. From which lag j does {1/Fj} start following

which is (6.1.17) for p = 3? Do the same for ARMA(l, 2). From which j does {1/Fj} start following (6.1.17) for p =I? [Answer: j = 3.] 6. (How filtering alters the spectrum) Let h(L) be absolutely summable. By Proposition 6.2, if {x1 } is covariance-stationary with absolutely summable autocovariances, then so is {y, l where y, = h(L)x,. Using (6.2.20), verify that

7. (The spectrum is real valued) Show that the spectrum is real valued. Hint: Substitute (6.2.16) into (6.2.14). Use the following facts from complex analysis: [cos(w)- i sin(w)]j = cos(jw)- i sin(jw); sin( -w) = - sin(w).

387

Serial Correlation

6.3

Vector Processes

It is straightforward to extend all the concepts introduced so far to vector processes. A vector white noise process {e,} is a jointly covariance-stationary process satisfying

E(e,) = 0, E(e,e;) = Q (positive definite), E(e,e;_j) = 0

and (6.3.1)

for j oft 0.

Since Q is not restricted to being diagonal, there can be contemporaneous correlation between the elements of e,. Perfect correlation among the elements of e, is ruled out because Q is required to be positive definite. A vector MA(oo) process is the obvious vector version of (6.1.6): 00

Yr = J1.

+L

(6.3.2)

with '1! 0 =I,

lJ!jEr-j

j=O

where {lJ!j} are square matrices. The sequence of coefficient matrices is said to be absolutely summable if each element is absolutely summable. That is, if Vrktj is the (k, i) element of lJ!j, 00

"{lJ!j lis absolutely summable"

'* "L IVrklj I < oo for all (k. i)."

(6.3.3)

j=O

With this definition, Proposition 6.1 generalizes to the multivariate case in an obvious way. In particular, if rj (= E[(y,- JLHYr-j- JL)']) is the j-th order autocovariance matrix, then the expression for autocovariances in part (b) of Proposition 6.1 becomes 00

rj =

L

lf!j+k

S? lf!~

(j =

o,

(6.3.4)

1, 2 .... ).

k=O

(This formula also covers j =-I, -2, ... because A multivariate filter can be written as

r - j = rj.)

where (Hj l is a sequence of (not necessarily square) matrices. So if

Hklj

is the

388

Chapter 6

(k, e) element of H 1, the (k, e) element of H(L) is

The multivariate version of Proposition 6.2 is obvious, withy, = H(L) x,. Product of Filters Let A(L) and B(L) be two filters where {A1 } ism x r and {B1} is r x s so that the matrix product A1Bk can be defined. The product of two filters, D(L) = A(L)B(L) is an m x s filter whose coefficient matrix sequence {D1} is given by th~ multivariate version of the convolution formula (6. 1.12): AoBo =Do, AoB 1 +A 1 B 0 = 0

1,

A 0 B 2 +A 1 B 1 + A 2 B 0 = 0 2 ,

(6.3.5) Inverses Let A(L) and B(L) be two filters whose coefficient matrices are square. B(L) is said to be the inverse of A(L) and is denoted A(L)- 1 if A(L) B(L) =I.

(6.3.6)

For any arbitrary sequence of square matrices {A1}, the inverse of A(L) exists if A0 is nonsingular. It is easy to show that inverses are commutative (see Review Question 2), so A(L) A(L)- 1 = A(L)- 1A(L).

Absolutely Summable Inverses of Lag Polynomials A p-th degree lag matrix polynomial is (6.3.7) where {<1>1 } is a sequence of r x r matrices with P # 0. From the theory of homogeneous difference equations, the stability condition for the multivariate case is as follows:

389

Serial Correlation

All the roots of the detenninantal equation (6.3.8) are greater than I in absolute value (i.e., lie outside the unit circle). Or, equivalently, All the roots of the detenninantal equation (6.3.8') are less than I in absolute value (i.e., lie inside the unit circle). As an example, consider a first-degree lag polynomial 4>(L) = I - 4> 1 L where 4> 1 = (r/>te) is 2 x 2. Equation (6.3.8) can be written as

With the stability condition thus generalized, Proposition 6.3 generalizes in an obvious way. Let lii(L) = 4>(L)- 1• Each component of the coefficient matrix sequence {Illi) will be bounded in absolute value by a geometrically declining sequence.

The multivariate analogue of an AR(p) is a vector autoregressive process of p-th order (VAR(p)). It is the unique covariance-stationary solution under stationarity condition (6.3.8) to the following vector stochastic difference equation:

y, - 4>,y,_, - · · · - 4>pYr-p = c + e,

or

4>(L)(y1

4>(L) =I,- 4> 1 L- 4> 2L 2 - · · · - 4>PLP

where

and

-

/L) = e 1

/L

= 4>(1)- 1c, (6.3.9)

where 4>P f= 0. For example, a bivariate VAR(l) can be written out in full as a two-equation system with common regressors: Yit

= c,

+ r/>IIYI.t-1 + ¢I2Y2.r-l + ''''

Y2r

=

+ ri>21YI.r-I + r/>nY2,r-I + <2r.

c2

were h

~

... 1

=

["'" ¢21

"''2]

.

¢22

More generally, an M-variate VAR(p) is a set of M equations with Mp+l common regressors. Proposition 6.4 generalizes straightforwardly to the multivariate case.

390

Chapter 6

In particular, each element of rj is bounded in absolute value by a geometrically declining sequence. Vector autoregressions are a very popular tool for analyzing the dynamic interrelationship between key macro variables. A thorough treatment can be found in Hamilton (1994, Chapter II). The multivariate analogue of an ARMA(p, q) is a vector ARMA(p, q) (VARMA(p, q)). It is the unique covariance-stationary solution under the stationarity condition to the following vector stochastic difference equation:

y, = c + ,y,_, + zYr-2 + · · · + 4>pYt-p + e,

+

e, .. ,_, +

0ze,_z + ... + 0qet-q

or

(L)(y, -I-')= 0(L)e 1 , where 4>(L) =I- 4> 1 L- · · · - 4>PU, 0(L) =I+ 0 1 L + · .. + 0qLq,

1-' = 4>(1)- 1c,

where p

f=

0 and

(6.3.10)

e, f= 0. Proposition 6.5 generalizes in an obvious way.

Autocovariance Generating Function The multivariate version of the autocovariance-generating function for a vector covariance-stationary process {y, I is 00

Gv(z) =

L

00

rjzj = ro + L:
j=-00

J=!

The spectrum of a vector process {y 1 I is defined as sv(w) = - I G v(e- iw ). 2rr

(6.3.12)

Proposition 6.6 generalizes easily. In particular, if H(L) is an r x s absolutely summable filter and Gx(z) is the autocovariance-generating function of an sdimensional covariance-stationary process {x 1 I with absolutely summable amocovariances, then the autocovariance-generating function ofy, = H(L)x, is given by Gv(z) = H(z) G,(z) H(C 1)'. (rxr)

(rxs)

(sxs)

(sxr)

(6.3.13)

391

Serial Correlation

As special cases of this, we have vector white noise: Gv(z) = Q,

(6.3.14)

VMA(oo): Gv(z) = 'li(z) Q 'li(z- 1)', VAR(p): Gv(z) = [cf>(z)- 1 ]Q[cf>(z- 1)- 1]',

(6.3.15)

VARMA(p, q): Gv(z) = [cf>(z)- 1] [0(z)] Q [0(z- 1)]' [cf>(z- 1)- 1]'.

(6.3.16) (6.3.17)

QUESTIONS FOR REVIEW

1.

Verify (6.3.4) for vector MA(l).

2. (Commutativity of inverses) Verify the commutativity of inverses, that is,

I" {} "B(L) A(L) = I" for two square filters of dimension r with lAo I oft 0 and IBol oft 0. Hint: Set Do = I, Dj = 0 (j > I) in (6.3.5) and solve forB's. Show that this B(L) satisfies B(L) A(L) =I.

"A(L) B(L)

=

3. (Lack of commutativity for multivariate filters) Provide an example where A(L) and B(L) are 2 x 2 but A(L) B(L) oft B(L) A(L). Hint: How about A(L) = I +A 1L and B(L) = I + B 1 L? Find two square matrices A 1 and B 1

such that A 1 B1 oft B1 A1. 4. Verify the multivariate version of (6.1.15b), namely, "A(L) B(L) = D(L)"

* "B(L) = A(L)- D(L)"

* "A(L) = D(L) B(L)-

1

1 ,"

provided A 0 and B0 are nonsingular. So, to solve A(L) B(L) = D(L) for B(L), "multiply both sides from the left" by A(L)- 1 5. Show that [A(L) B(L)]- 1 = B(L)- 1A(L)- 1, provided that Ao and B 0 are nonsingular. Hint: By definition, A(L)B(L)[A(L)B(L)]- 1 =I.

392

6.4

Cnapler 6

Estimating Autoregressions

Autoregressive processes (autoregressions) are popular in econometrics, not only because they have a natural interpretation but also because they are easy to estimate. This section considers the estimation of autoregressions. Estimation of ARMA processes will be touched upon at the end.

Estimation of AR(l) Recall that an AR(l) is an MA(oo) with absolutely summable coefficients. So if [£1} is independent white noise (an i.i.d. sequence with zero mean and finite variance), it is strictly stationary and ergodic (Proposition 6.1 (d)). Letting x 1 = (1, y1_ 1 )' and fJ = (c, )',the AR(l) equation (6.2.1) can be written as a regression equation (6.4.1) We assume the sample is (y0, y 1, y 2, ... , y,), y 0 inclusive, so that (6.4.1) fort = I (which has y 0 on the right-hand side) can be included in the estimation. So the sample period is from t = I ton. We now show that all the conditions of Proposition 2.5 about the asymptotic properties of the OLS estimator of fJ with conditional homoskedasticity are satisfied here. First, obviously, linearity (Assumption 2.1) is satisfied. Second, as just noted, [y1, X 1} is jointly stationary and ergodic (Assumption 2.2). Third, since y1 _ 1 (the nonconstant regressor in x1) is a function of {£1_ 1, £1_2, ... }, it is independent of the error term £1 by the i.i.d. assumption for [£1}. Thus (6.4.2) that is, the error is conditionally homoskedastic (Assumption 2.7). Fourth, if & = x 1 · £1 = (£ 1, y1_ 1£1)', [gr} is a martingale difference sequence, as required in Assumption 2.5, because first element of E(g1 I g1_ 1, g1- 2, ... ) = EC£1 = E(£1 = 0

I gH, g1-2 •... ) I £1-1, '1-2• · · · , Y1-2£1-l, Yl-3'1-2• · · ·)

(since {£r} is i.i.d. and y1 _j is a function of (£1-j, £1_j_ 1, ... ))

393

Serial Correlation

and second element ofE(g,

I g,_ 1, g,_ 2 , •.. )

I g,_,' g,_,, ... ) E[E(ErYr-1 I Yr-J, g,_l> g,_,, .. ·) I gt-J, g,_,, · · .]

= E(E, Yt-1 =

(by the Law of Iterated Expectations)

I y,_,, g,_,, g,_,, ... ) I g,_,, g,_,, ... ] 0 (since E(E, I y,_,, E,_,, E,_,, ... , y,_,E,_J, Yr-3Er-2· ... ) = 0). (6.4.3)

= E[y,_, E(E, =

In particular, E(E,y,_ 1) = 0, sox, is orthogonal to the error term (Assumption 2.3). Finally, the rank condition about the moment matrix E(x,x;) (Assumption 2.4) is satisfied because the determinant of

' [I

E(x,x,) =

(6.4.4)

fL

is y0 > 0. This also means that E(g,g,) is nonsingular because under conditional homoskedasticity E(g,g;l = a 2 E(x,x;). So the nonsingularity condition in Assumption 2.5 is met. Thus the AR(l) with an independent white noise process satisfies all the assumptions of Proposition 2.5, so parts (a)-(c) of the propositions hold true here, with (c, ¢)'as {J, (1, y,_ 1)' as x,, and a 2 as the error variance. In particular, the OLS estimate of the error variance, n

I"''

s 2 = n- 2Le,, e, = y,- c- (i,y,_ 1 where t=l

(c, (i>)

are OLS estimates,

is consistent for a 2 .

Estimation of AR(p) Similarly for AR(p ), the autoregressive coefficients (¢ 1 , ••• , r/>p) can be consistently estimated by regressing y, on the constant and lagged y's, (y,_ 1, ... , Yr-p). We assume the sample is (Y-p+l> Y-p+'• ... , Yo. y 1 , ... , y.), inclusive of p lagged values prior to y 1, so that the AR(p) equation for t = 1 can be included and the sample period is from t = I ton. Proposition 6.7 (Estimation of AR coefficients): Let {y,} be the AR(p) process

following (6.2.6) with the stationarity condition (6.1.18). Suppose further that {E,} is independent white noise. Then the OLS estimator of (c, r/> 1, ¢ 2, ... , r/>p)

394

Chapter 6

obtained by regressing y, on the constant and p lagged values of y for the sample period oft = I, 2, ... , n is consistent and asymptotically normal. Letting fJ = (c,¢ 1,¢2 , ••• ,¢P)' andx, = (l,y,_~>····Y1 -p)',andP = OLSestimateof {J, we have

which is consistently estimated by --;.__

Avar({J) = s

2

l · (;;

n

L x,x;)

-I

,

1=1

where s 2 is the OLS estimate of the error variance given by

I~--

2

s =

n-p-

-2

I L.)Yt - c- tf>IYt-1 - · · · - t/>pYt-p) ·

(6.4.5)

t=I

The proof is to verify the assumptions of Proposition 2.5, which can be done in the same way as we did for AR(l). The only difficult part is to show that E(x,x;) is non singular (Assumption 2.4 ), which is equivalent to requiring the p x p autocovariance matrix Var(y,, ... , Yt-p+I) to be nonsingular (see Review Question l(b) below). It can be shown that the autocovariance matrix of a covariance-stationary process is nonsingular for any p if y 0 > 0 and Yj -> 0 as j -> oo 6 So Assumption 2.4, too, is met. We have not covered the maximum likelihood estimation of AR(p ), but for those of you who are curious about the connection to ML, the OLS estimator is numerically the same as the conditional Gaussian ML estimator and asymptotically equivalent to the exact Gaussian ML estimator. See Section 8.7. Choice of Lag Length

All these nice results about estimation of autoregressions assume that the order (the lag length) of autoregression, p, is known. How should we proceed if the order p is unknown? First of all, it is clear that Proposition 6.7, which is about p-th order autoregressions, is applicable to autoregressions of lower order, say r < p. That is, even if¢, op 0 but ¢,+ 1 = ¢,+ 2 = · · · = t/>p = 0, the proposition remains applicable as long as (¢ 1 , ¢,, ... , ¢,) satisfies the stationarity condition. In particular, the OLS estimates of (¢,+ 1 , ¢,+ 2 , .•. , t/>p) will converge to the true value of zero. To put the same argument differently, suppose that the true order of 6see, e.g., Proposition 5.1.1 of Brockwell and Davis (1991).

395

Serial Correlation

the autoregression is p (so 1/>p f= 0) and suppose that all we know about p is that it is less than or equal to some known integer Pmax· By setting the r in the argument just given to p and the p to Pmax• we see that Proposition 6.7 is applicable to autoregressions with Pmax lags: if the (y,) is stationary AR(p) and if we regress y, on the constant and Pmax lags of y, then the OLS coefficient estimates will be consistent and asymptotically normal. It is natural to think that the true lag length may be estimated from this OLS estimate of (¢> 1 , ••. , 1/>pm,xl· We consider two classes of rules for determining the lag length. The first is the General-to-specific sequential I rule: Start with an autoregression of Pmax lags. If the last lag is significant at some prespecified significance level (say, 10 percent), then end the procedure and set the lag length to Pmax· Otherwise drop the last lag, reestimate the autoregression with one fewer lag, and repeat the same test. If this process continues until there is only one lag in the autoregression and if that lag is insignificant, then set the lag length to 0 (no lags). This rule has the following properties, to be illustrated in a moment. Since the t -test is consistent, the lag length thus chosen will never be less than p (the true lag length) in large samples. However, the probability of overfitting (that the lag length chosen is greater than the true lag length p) is not zero, even in large samples. That is, if p is the lag length chosen by the sequential t rule, lim Prob(p < p) = 0

and

lim Prob(p > p) > 0.

To illustrate, suppose p = 2 and Pmax = 3. The sequential with three lags in the autoregression:

t

(6.4.6)

rule starts out

We test the null hypothesis that ¢>3 = 0 at the 10 percent significance level. With a probability of 10 percent, we reject the (true) null and set the lag length to 3, or accept the null with a probability of 90 percent, in large samples. In the event of accepting the null, we set ¢>3 = 0, estimate an autoregression with two lags, and test the hypothesis that ¢>2 = 0. Since ¢>2 f= 0 in truth (i.e., since p = 2), the absolute value of the !-value on the OLS estimate of ¢>2 would be very large in large samples, so we would never accept the (false) null that the second lag is zero. Therefore, in this example, Prob(p = 2) = 90% and Prob(p = 3) = 10% in large samples. There are two minor variations in the choice of the sample period with data (y,, ... , Ynl· One is to use the same sample period oft = Pmax + 1, Pmax + 2, ... , n

396

Chapter 6

throughout. The other is to let the sample period be "elastic" by expanding it as fewer and fewer lags are included. That is, when the autoregression being estimated has j lags, the sample period is 1 = j + l, j + 2, ... , n. The second rule for selecting the lag length uses either the Akaike infonnation criterion (AIC) or the Schwartz infonnation criterion (SIC), also called the Bayesian infonnation criterion (BIC). This procedure sets the lag length to the j that minimizes 1og(SSR1 jn)

+ (j + l)C(n)/n,

(6.4.7)

over j = 0, l, 2, ... , Pmaxo where SSR1 is the sum of squared residuals for the autoregression with j lags. In this formula, "j + l" enters as the number of parameters (the coefficients of the constant and j lags). For the AIC, C(n) = 2 while for the BIC, C(n) = log(n). A few comments about this information-criterion-based rule: • As the number of parameters increases with j, the first term of the objective function (6.4.7) declines because the fit of the equation gets better, but the second term increases. The information criterion strikes a balance between a better fit and model parsimony. • Without an upper bound Pmaxo a ridiculously large value of j might be chosen (indeed, the information criterion achieves a minimum of -oo at j + l = n because SSR1 = 0 for j = n- l ). In the present context where the true DGP is a finite-order autoregression, a reasonable upper bound is the maximum possible lag length. • Again there can be two ways to set the sample period. You can use the fixed sample period of I = Pmax + l, Pmax + 2, ... , n throughout, in which case SSRJ is the sum of n - Pmax squared residuals. Or you can use the "elastic" sample period, in which case SSR1 is the sum of n - j squared residuals. In either case, there is yet another variation, which is to replace the n in the information criterion (6.4.7) by the actual sample size. So you replace the n by n - Pmax if the fixed sample is to be used. Based on simulations and some theoretical considerations, Ng and Perron (2000) recommend the use of a fixed, rather than an "elastic," sample period, with n - Pmax replacing n in the information criterion. Let fiA1c and fj 61c be the lag lengths chosen by the AIC and BIC, respectively. Like the lag length selected by the general-to-specific 1 rule, they are functions of data, and hence are random variables. For them it is possible to show the following (see, e.g., Lutkepohl, 1993, Section 4.3):

397

Serial Correlation

(I) PBic < PAle unless the sample size is very small. (If the fixed sample of

Pmox + I, ... , n is used and "n - Pmox" replaces n in (6.4.7), this is true for any finite sample with n > Pmax + 8.) This is an algebraic result and holds for any sample on y. t =

(2) Suppose that {y,) is stationary AR(p) and {E,) is independent white noise with a finite fourth moment. Then lim PBIC = p, n---J.OO

lim Prob(PAIC < p) = 0, n---J.OO

lim Prob(PAIC > p) > 0. n---J.OO

(6.4.8) That is, just like the general-to-specific sequential t rule, the AIC rule has a positive probability of overfilling. In contrast, the BIC rule is consistent. Furthermore, Hannan and Deistler (1988, Theorem 5.4.l(c) and its Remark I) have shown that the consistency of PBIC holds when the upper bound Pmax is made to increase at a rate of log(n) (i.e., when it is set to [c log(n)] [the integer part of c log(n)] for any given c > 0). This means that you do not need to know an upper bound in the consistent estimation by BIC of the order of an autoregression. Estimation of VARs

Estimation ofVAR coefficients is equally easy. If y, is M -dimensional, the VAR(p) (6.3.9) is an M -equation system and can be written as Yrm =

x;am + Erm

where Yrm is the m-th element of y,,

Erm

(m = I, 2, ... , M),

(6.4.9)

is the m-th element of E,, and

I

t/J lm

x, (Mp+l)x I Yt-p

Cm

= m-th element of c,

t/J}m

= m-th row of j.

So the M equations share the same set of regressors. It is easy to verify that x, is orthogonal to the error term in each equation and that the error is conditionally homoskedastic (the proof is very similar to the proof for AR(p)). So the M-equation system with i.i.d. {Er) satisfies all the conditions for the multivariate regression model of Section 4.5. Thus, efficient GMM estimation is very easy:

398

Chapter 6

just do equation-by-equation OLS. The expressions for the asymptotic variance and its consistent estimate can be obtained from Proposition 4.6 with Z;m = x, (so (4.5.13') on page 280 becomes ~ L:, x,x;). If~ is the (Mp + OM-dimensional stacked vector created from (~ 1 , ... , ~M) and if~ is the equation-by-equation OLS estimate of~. we have from (4.5.17) ~

Avar(~)

-..

]

= Q ® (;:;

n

L x,x;)

-1

,

(6.4.1 0)

t=I

where e, =

y, - c- Cii,y,_, - ... -

Ci>py,_p·

(6.4.11)

If we do not know the lag length p, we estimate it from data. In the informationcriteria-based rule, the lag length we choose is the k that minimizes log

''I) + (k M L. e,e, (I ;:;I~,

2

C(n) + M) · -n-,

(6.4.12)

1=1

over k = 0, I, ... , Pmax· Here, as in the univariate case, C(n) = 2 for the AIC and log(n) for the BIC. Exactly the same results, listed as (I) and (2) above for the univariate case, hold for the multivariate case (see, e.g, Ltitkepohl, 1993, Section 4.3). Estimation of ARMA(p, q)

Letting Ur

z,

= C'r + 81Cr-1 + 82e1-2 + · · · + 8qE:t-q•

= (1, y,_,, Y<-2• · · ·, Y<-p)',

~

= (c, ¢, ¢2, · · ·,
the univariate ARMA(p, q) equation (6.2.9) can be written as y, = z;~

+ u,.

(6.4.13)

The moving-average component of ARMA creates two problems. First, obviously, the error term u, is serially correlated (in fact, it is MA(q)). Second, since the lagged dependent variables included in z, are correlated with lags of e included in the error term, the regressors z, are not orthogonal to the error term u,. That is, serial correlation in the presence of lagged dependent variables results in biased estimation. The second problem could be dealt with by using suitably lagged

399

Serial CorrelaJion

dependent variables (y,_q-!, y,_q_ 2 , ... ) as instruments. Regarding the first problem, the method to be developed later in this chapter provides consistent estimates of the coefficient vector in the presence of serial correlation. Given a consistent estimate of 8, the MA parameters (O's) can be estimated from the residuals. This procedure, however, is not efficient. 7 We will not cover the topic of efficient estimation of ARMA(p, q) processes with Gaussian errors. Interested readers are referred to, e.g., Fuller (1996, Chapter 8) or Hamilton (1994, Chapter 5). QUESTIONS FOR REVIEW 1.

(Avar of AR coefficients) (a) For AR(I), show that

cr'

Avar(~) = - = I - ¢ } , Yo where~ is the OLS estimator of¢>. Hint: It is cr 2 times the (2, 2) element of the inverse of (6.4.4).

(b) For AR(p), let~ be the OLS estimate of (¢> 1 , ¢>2 ,

.•• ,

¢>r)'. Show that

Avar(~) = cr 2 V- 1,

where V is the p x p autocovariance matrix

Yo Y1

Y1 Yo

Y2 Y1

Yp-1 Yp-2

V = Var(y,, ... , y,_r+l) = Yr-2 Yr-1

Y1 Y2

Yo Y1

Y1 Yo

Hint: If x, is as in Proposition 6.7, then

, [ I E(x,x,) = !1-1

11-l' ] V +!1-211' ,

where 1 is a p x I vector of ones. Show that the inverse of this matrix is

7The model yields infinitely many orthogonality conditions, E(u • y ) = 0 for all s :::; t - q- I. The GMM 1 5 procedure can exploit only finitely many orthogonality conditions.

400

Chapter 6

given by

This also shows that E(x,x;) is nonsingular if and only if V is. (If V is nonsingular, then E(x,x;J is nonsingular because it is invertible. If V is singular, then there exists a p-dimensional vector d ,;, 0 such that V d = 0 and E(x,x;)f = 0 where r = ( -JLl'd, d').) 2. (Estimated VAR(l) coefficients) Consider a VAR(l) without the intercept: y, = Ay 1_ 1 + e 1 • Verify that the estimate of A by multivariate regression using the sample (y 1 , y2 , ..• , Ynl (so the sample period is t = 2, 3, ... , n) is

Hint: The coefficient vector in the m-th equation is the m-th row of A.

6.5

Asymptotics for Sample Means of Serially Correlated Processes

This section studies the asymptotic properties of the sample mean ]

n

.Y=-I> n f=l

for serially correlated processes. (It should be kept in mind that y depends on n, although the notation does not make it explicit.) For the consistency of the sample mean, we already have a sufficient condition in the Ergodic Theorem of Chapter 2. The first proposition of this section provides another sufficient condition in the form of restrictions on covariance-stationary processes. We also provide two CLTs for serially correlated processes, one for linear processes and the other for ergodic stationary processes. Chapter 2 includes a CLT for ergodic stationary processes, but it rules out serial correlation because the process is assumed to be a martingale difference sequence. The CLT for ergodic stationary processes of this section generalizes it by allowing for serial correlation.

401

Serial Correlation

LLN for Covariance-Stationary Processes Proposition 6.8 (LLN for covariance-stationary processes with vanishing autocovariances): Let (y,} be covariance-stationmy with mean fl and (yj} be the auto-

co variances of (y,}. Then

00

lim Var(.j/l y) = "

(b)

y < oo L- 1

n-oo

if (y1} is summable. 8

(6.5.1)

j=-00

Since a mean square convergence implies a convergence in probability, part (a) shows that a very mild condition on covariance-stationary processes suffices for the sample mean to be consistent for fl. To prove part (a), by Chebychev's LLN (see Section 2.1), it suffices to show that lim.- 00 Var(y) = 0, which is fairly easy (Analytical Exercise 9) once the expression for Var(y) is derived. Unlike in the case without serial correlation in (y,}, calculating the variance of y is more cumbersome because covariances between two different terms have to be taken into account: n · Var(y) = Var( .j1l y)

I n

= -[Cov(y,, Yt I = -[(yo+ Yt

n

+ · · · + Yn) + · · · + Cov(y., Yt + · · · + Yn)]

+ · · · + Yn-2 + Yn-t) + (y,

+Yo+ Yt

+ · · · + Yn-2)

+ · · · + (y._, + Yn-2 + · · · + Yt +Yo)] I = -[nyo+2(n -l)y, +···+2(n- j)yi

n

n-1

=

+ ··· +2Yn-tl

,

Yo+2~(1- ~)Yt·

(6.5.2)

j=l

(If you have difficulty deriving this, set n to, say, 3 to verify.) For each fixed j > I, the coefficient of y1 in (6.5.2) goes to I as n ---> oo, so that all the autocovariances eventually enter the sum with a coefficient of one. This is not enough to show the desired result (6.5.1). Filling in the rest of the proof of (b) is Analytical Exercise 10. For a covariance-stationary process (y,}, we define the long-run variance to be the limit as n ---> oo of Var(.j/l y) (if it exists). So part (b) of the proposition says that the long-run variance equals L:J:-oo y1 , which in tum equals the value of

8 Anderson (1971, Lemma 8.3.1, p. 460), Absolute summability of {Yj} is not needed.

402

Chapter 6

the autocovariance-generating function gy(z) at z = I (see (6.2.15)) or 2rr times the spectrum at frequency zero (see (6.2.17)). Two Central Limit Theorems

We present two CLTs to cover serial correlation. The first is a generalization of Lindeberg-Levy.

Proposition 6.9 (CLT for MA(oo)): Let y, = f.L + L:J:,o t/ljEr-j where (s,} is independent white noise and L:}~o lt/lj I < oo. Then 00

Jri (ji- f.L)

--:

N( 0, L

Yj).

(6.5.3)

j=-00

We will not prove this result (see Anderson, 1971, Theorem 7.7.8 or Brockwell and Davis, 1991, Theorem 7.1.2 for a proof), but we should not be surprised to see that Avar(v) (the variance of the limiting distribution of Jri (ji - f.L)) equals the long-run variance L:)':,_ 00 Yj· We know from Lemma 2.1 that "xn ---+d x," "Var(xn) ---+ c < oo" =} "Var(x) =c." Here, set Xn = Jri (ji- f.L). By (6.5.1), Var(xn) converges to a finite limit L yj. So the variance of the limiting distribution of .fo (y- f.L) is LYj· By Proposition 6.1 (d), the process in Proposition 6.9 is (stationary and) ergodic. Remember that ergodicity limits serial correlation by requiring that two random variables sufficiently far apart in the sequence be almost independent. But ergodic stationarity alone is not sufficient to ensure the asymptotic normality of ji. A natural question arises: is there a further restriction on ergodicity, short of requiring the process to be linear as in Proposition 6.9, that delivers asymptotic normality? The restriction that has become increasingly popular is what this book calls Gordin's condition for ergodic stationary processes. It has three parts. The first two parts follow: (a) EC_v?) < oo. (This is a restriction on [strictly] stationary processes because, strictly speaking, a stationary process might not have finite second moments.) (b) E(y, I Yr-j. Yr-j-I •... ) ---+m.,, 0 as j ---+ 00. (Since (y,} is stationary, this condition is equivalenttoE(yo I Y-j. Y-j-I· ... ) ---+m.s. 0 as j---+ oo. Because E(y, I Yt-j• Yr-j-I •... ) is a random variable, the convergence cannot be the usual convergence for real numbers.)

403

Serial Correlation

This condition implies that the unconditional mean is zero, E(y,) = 0. 9 As the forecast (the conditional expectation) is based on less and less information, it should approach the forecast based on no information, i.e., the unconditional expectation. To prepare for the third part, let I,= (y,, y,_ 1, y,_2 , ..• ) and write y, as

y, = y,- [E(y, I I,_1)- E(y, I IH)1- [E(y, I I,_2)- E(y, I I,_2)1- · · ·

I I,_ 1 )- E(y, I I,_ 1)1 [y, - E(y, I I,_Jl] + [E(y, I I,_ I)- E(y, I I,_2)1 + · · · + [E(y, I I,_/+1)- E(y, I I,_;)]+ E(y, I I,_;) (r,o + rn + · · · + r,,;-1l + E(y, I Yt-J• y,_J-1, ... ) - [E(y,

=

=

where

This

r,k

is the revision of expectations about y 1 as the information set increases

from It-k-1 to It-k· By (b), y, - (r1o + rn + · · · + r,,;-1l converges in mean square to zero as j -+ oo for each t. So y, can be written as 00

Yr

=

Lrrj· J=O

This sum is called the telescoping sum. The third part of Gordin's condition is 00

(c) L[E(r1))J 112 < oo. J~O

The telescoping sum indicates how the "shocks" represented by (r, 0 , rn, ... ) influence the current value of y. Condition (c) says, roughly speaking, that shocks that occurred a long time ago do not have disproportionately large influence. As such, the condition restricts the extent of serial correlation in {y,}. To understand the condition, take for example a zero-mean AR(l) with independent errors: y, = Yt-1

+ £1,

I
{e1 } independent white noise with o- 2 = Var(e1 ).

Evidently, Gordin's condition (a) is satisfied. Condition (b) too is satisfied because

9See Lemma 5.14 of White (1984) for a proof.

404

Chapter 6

(6.5.4) and E[(¢>j Yr-j) 2 ]--> 0 as j--> oo. The expectation revision r,j can be written as (6.5.5) So the telescoping sum for AR(l) is the MA(oo) representation. Condition (c) is satisfied because [E(r~) ] 112 =

I.Pij a

and

The CLT we have been seeking is Proposition 6.10 (Gordin's CLTforzero-mean ergodic stationary processes): 10

Suppose (y,} is stationary and ergodic and suppose Gordin's condition is satisfied. Then E(y,) = 0, the autocovariances ( yj} are absolutely summable, and 00

.fiiy-:

N(o, L Yj)· j=-00

Since Gordin's condition is satisfied if (y,} is a martingale difference sequence, which exhibits no serial correlation, Proposition 6.10 is a generalization of the ergodic stationary Martingale Difference Sequence CLT of Chapter 2. Multivariate Extension

Extension of the above results to the multivariate case is straightforward. Let

J

Y=-n

n

LYr r=1

be the sample mean of a vector process (y, }. • The multivariate version of Proposition 6.8 is that (a)

y -->m.,.

fJ if each diagonal element of rj goes to zero as j -->

00,

and

(b) limn~oo Var(y'fi y) (which is the long-run covariance matrix of (y,}) equals

L:)':,_

00

rj if {fj} is summable (i.e., if each component of rj is summable).

10 This result is due to Gordin (1969), restated as Theorem 5.15 of White (1984), The claim about the absolute summability of autocovariances is noted in footnote 19 of Hansen ( 1982).

405

Serial Correlation

• Therefore, the long-run covariance matrix of {y, l can be written as 00

00

(6.5.6) j=-00

j=I

where Gv(z) (the autocovariance-generating function) and Sy (w) (the spectrum) for the vector process {y,} are defined in Section 6.3. • The multivariate version of Proposition 6.9 is that 00

-fiiCY-1-'l--;

N(o. L rj)

(6.5.7)

j=-oo

if {y, lis vector MA(oo) with absolutely summable coefficients and {e, lis vector independent white noise. • The multivariate extension of Gordin's condition on ergodic stationary processes is obvious: 11

Gordin's condition on ergodic stationary processes (a) E(y, y;) exists and is finite. (b) E(y, I Yr-j. Yr-j-1• ... ) ~m.,. 0 as j ~DO. 00

(c) L[E(r;jrlj )] 112 is finite, where j=O

r,j

= E(y,

I Yt-j· Yr-j-1, ... ) - E(y, I Yr-j-1> Yr-j-2 •... ).

• The multivariate version of Proposition 6.10 is obvious: Suppose Gordin's condition holds for vector ergodic stationary process {y1 }. Then E(y1 ) = 0, {fj} is absolutely summable, and 00

Jny--;

N(o. L rj)·

(6.5.8)

j=-oo

QUESTIONS FOR REVIEW 1.

Show that Var(.fii ji) = 1' Var(y,, Yr-1, ... , Yt-n+l)ljn, where Var(y1, Yr-I, ... , Yr-n+I) is then x n autocovariance matrix of a covariance-stationary process {y, }.

llStated as Assumption 3.5 in Hansen (1982).

406

Chapter 6

2. (Avar(y) for MA(oo)) Verify that Avar(y) for the MA(oo) process in Proposition 6. 9 equals

Hint: It should equal gy(l) for MA(oo). 3. (Asymptotics of y for AR(l )) Consider the AR(l) process y, = c+¢>y,_ 1 +er with 1¢>1 < 1. Does y --+p j.i.? Add the assumption that(&,} is an independent white noise process. Calculate Avar(y). [Answer: [(I + ¢>)/(1 - ¢>)]y0 .]

4. (Long-run covariance matrix of filtered series) Let H(L) = H 0 + H 1L and (x,} be covariance-stationary with absolutely summable autocovariances. Verify that the long-run covariance matrix of y, = H(L) x, is given by (Ho

+ H1) G,(l) (H~ + u;).

Hint: Use (6.3.13) and (6.5.6).

5. (Long-run variance of "unit-root MA'' processes) What is the long-run variance of y, = &, - &r-I> where"' is white noise? [Answer: zero.]

6.6

Incorporating Serial Correlation in GMM

The Model and Asymptotic Results

We have now all the tools needed to introduce serial correlation in the singleequation GMM model of Chapter 3. Recall that g, in Chapter 3 (with the subscript "i" replaced by "t") is a K -dimensional vector defined as x, · &,, the product of the K -dimensional vector of instruments x, and the scalar error term &, . By Assumption 3.3 (orthogonality conditions), the mean of g, is zero. We assumed in Assumption 3.5 that (g,) was a mar1ingale difference sequence. Since no serial correlation is allowed under this assumption, the matrix S, defined to be the asymptotic variance of g (= ~ L~~I g, ), was simply the variance of g,. We can now allow for serial correlation in (g,] by relaxing Assumption 3.5 as

Assumption 3.5' (Gordin's condition restricting the degree of serial correlation): (g,] satisfies Gordin's condition. Its long-ron covariance matrix is nonsingular.

407

Serial Correlation

Then by the multivariate version of Proposition 6.10, .jii g converges to a normal distribution. The variance of this limiting distribution (namely, Avar(g)), denoted S, equals the long-run covariance matrix, 00

00

(6.6.1) j=l

j=-00

where

rj

is the j -th order autocovariance matrix

rj = E(g,g;_j)

u=

o. ±I. ±2.... ).

(6.6.2)

(Since E(g,) = 0 by the orthogonality conditions, there is no need to subtract the mean from g,.) That is, S (= Avar(g)) equals r 0 under Assumption 3.5 and L~-oo rj under Assumption 3.5'. This is the only difference between the GMM model with and without serial correlation. Accordingly, all the results we developed in Sections 3.5-3.7 carry over, provided that Sis now given by (6.6.1) and S, some consistent estimation of S, is redefined to take into account serial correlation. More specifically:

-

a

• The GMM estimator (W) remains consistent and asymptotically normal. Its asymptotic variance is consistently estimated by (3.5.2), provided that the S there is consistent for S defined in (6.6.1 ). This estimated asymptotic variance sometimes gets the adjective heteroskedasticity and autocorrelation consistent (HAC). • The GMM estimator achieves the minimum variance when p1imn-oo \V =

s-t

where S is given by (6.6.1 ).

-

• The two-step procedure described in Section 3.5 still provides an efficient GMM estimator, provided that S calculated in Step 1 is consistent for S. • With S thus properly defined, the expressions for the t, W, J, C, and LR statistics of Sections 3.5-3.7 remain valid; those statistics retain the same asymptotic distributions in the presence of serial correlation. Estimating S When Autocovariances Vanish after Finite Lags

All these results assume that you have a consistent estimate, S, of the long-run variance matrix S. Obtaining such an estimate takes some intellectual effort, particularly when autocovariances do not vanish after finite lags.

408

Chapter 6

We start our quest with consistent estimation of individual autocovariances. The natural estimator is (j = 0, I, ... , n - I)

(6.6.3)

where

g,

=Xr · S,,

£, = Yr- z;J,

3consistent for~.

If we had the true value ~ in place of the estimated value ~ in this formula, then fj would be consistent for rj under Assumptions 3.1 and 3.2, the assumptions implying ergodic stationarity for {g, ). But we have to use the estimated series llld rather than the true series to calculate autocovariances. For this reason we need some suitable fourth-moment condition (which we will not bother to state) for the estimated autocovariances to be consistent. The proof of consistency is not given here because it is very similar to the proof of Proposition 3 .4. Given the estimated autocovariances, there are two cases to consider for the calculation of S. If we know a priori that rj = 0 for j > q where q is known and finite, then clearly Scan be consistently estimated by

-

(recall: j=l

-r-

j

=

_,

r).

(6.6.4)

j=-q

More difficult is the other case where we do not know q (which may or may not be infinite). How can we estimate the long-run variance matrix, which involves possibly infinitely many parameters? There are two approaches. Using Kernels to EstimateS

Recently, several procedures have been proposed to estimate the long-run variance matrix S. 12 A class of estimators, called the kernel-based (or "nonparametric") estimators, can be expressed as a weighted average of estimated autocovariances: (6.6.5)

12 For the results to be stated in this and the next subsection to hold, we need to impose a set of technical conditions (in addition to Gordin's condition above) that further restrict the nature of serial correlation. For a statement of those conditions and a more detailed discussion of the subject treated in this and the next subsection, see Den Haan and Levin ( !9%b).

409

Serial Correlation

Here, the function k(·), which gives weights for autocovariances, is called a kernel and q (n) is called the bandwidth. The bandwidth can depend on the sample size n. The estimator (6.6.4) above is a special kernel-based estimator with q(n) = q and

k(x) =

I

for lxl
{0

for lxl > 1.

(6.6.6)

This kernel is called the truncated kernel. In the present case of unknown q, we could use the truncated kernel with a bandwidth q(n) that increases with the sample size. As q(n) increases to infinity, more and more (and ultimately all) autocovariances can be included in the calculation of S. This truncated kernel-based S, however, is not guaranteed to be positive semidefinite in finite samples. This is a problem because then the estimated asymptotic variance of the GMM estimator may not be positive semidefinite. Newey and West (1987) noted that the kernel-based estimator can be made nonnegative definite in finite samples if the kernel is the Bartlett kernel:

k(x) =

I {0

-lxl

for lxl
(6.6.7)

for lxl > 1.

The Bartlett kernel-based estimator of S is called (in econometrics) the NeweyWest estimator. For example, for q(n) = 3, the kernel-based estimator includes autocovariances up to two (not three) lags: (6.6.8) In his definitive treatment of kernel-based estimation of the long-run variance, Andrews (1991) has examined positive semidefinite kernels, namely, the class of kernels (including Bartlett's) that yield a positive semidefinite S. He shows, under some suitable regularity conditions, how the speed at which S converges (in mean square) to S depends on the kernel and q(n). The most rapid possible rate is n 215 (i.e., n 215 • (S- S) remains stochastically bounded), 13 which occurs when q(n) grows at rate n 115 (i.e., q (n )/n 115 remains bounded) and the kernel is a member of a certain subset of positive semidefinite kernels. The Bartlett kernel is not in this subset. For Bartlett, the most rapid possible rate is n 113 which occurs when q(n) grows at rate n 113 • Therefore, at least on asymptotic grounds, positive semidefinite

-

!3 A sequence of random variables {x11 } is said to be stochastically bounded and written x 11 = Op if for any there exists an M > 0 such that Prob(lx11 1 > M) < f: for all n, If a sequence converges in distribution to a random variable, then the sequence is stochastically bounded.

£

410

Chapter 6

kernels in this subset should be preferred to Bartlett. An example of such kernels is the quadratic spectral (QS) kernel: k(x) =

25 (sin(6rrx/5) 6rrxj5 12rr 2 x 2

-

) cos(6rrxj5) .

(6.6.9)

Since k(x) f= 0 for lxl > I in the QS kernel, all the estimated autocovariances fj (j = 0, I, ... , n - I) enter the calculation of S even if q(n) < n - I. To be sure, the knowledge of the growth rate is not enough to determine the value of the bandwidth for any particular data of finite length. Andrews (1991) also considered a data-dependent forrnula for determining the bandwidth that not only grows at the optimal rate but also minimizes some appropriate criterion (the asymptotic mean square error). But even in this forrnula, some parameter values need to be provided by the user, so the forrnula does not really provide automatic data-based selection of the bandwidth. ~

VARHAC The other procedure for estimating S, called the VAR Heteroskedasticity and Autocorrelation Consistent (VARHAC) estimator, was proposed by Den Haan and Levin (1996a). The idea is to fit a finite-order VAR to the K -dimensional series Ill,} and then construct the long-run covariance matrix implied by the estimated VAR. This is not to say that g, is assumed to be a finite-order VAR. In fact, in this procedure (and also in the kernel-based procedure), g, does not even have to be a linear process, Jet alone a finite-order autoregression. The VARHAC procedure consists of two steps.

Step 1: Lag length selection for each VAR equation. In this step, for each VAR equation, you determine the lag length by the Bayesian information criterion (RIC). Let gk, be the k-th element of the K -dimensional vector g,. The k-th VAR equation can be written as A.(k)8kt = 'f'Jl 8J,t-l A,(k)-

A.(k)A.(k)+ ¥'12 8J,t-2 + ''' + '~-'lp 8I,t-p A,(k)-

A.(k)-

+ "'" g,,,_, + "'" g,_,_, + ... + '~-'2p g2,t-p

-_

A.(k)A.(k)A,(k)+ '~-'KlgK,t-l + '1"'1( 2 gK,t-2 +.". + '1-'KpgK,t-p + ekt ... (k).<(k)(6610) "'' g,_, + · · · + '~"'P g,_P + ek,, .•

411

Serial Correlation

where

(j = I, 2, ... , p).

In picking the lag length by the BIC, the maximum lag length Pmax needs to be specified. In Section 6.4, when the BIC was introduced in the context of estimating the true order (lag length) of a finite-order autoregression, Pmax was set to some integer known to be greater than or equal to the true lag length. In the present case where g, is not necessarily a finiteorder VAR, Den Haan and Levin (1996a, Theorem 2(c)) show that the VARHAC estimator is consistent for S when Pmax grows at rate n 113 and recommend setting Pmax to [n 113 ] (the integer part of n 113 ). To emphasize its dependence on the sample size, we denote the maximum lag length by Pmax (n). Let SSR~k) be the sum of squared residuals (SSR) from the OLS estimation of (6.6.10) (the k-th equation in the VAR) fort = PmaxCn) + I, Pmax(n)+2, . .. ,n. 14 For p = O,SSR~) = L;~pm,(,)+l(g,,) 2 , theSSR from the regression of g,, on no regressors. Since there are pK coefficients in the k-th equation, the BIC criterion is log(SSR~) jn)

+ pK ·log(n)jn.

(6.6.11)

(Then in this expression could be replaced by the actual sample size n Pmax with no asymptotic consequences.) Let p(k) be the p that minimizes the information criterion over p = 0, I, ... , Pmax(n) for the k-th VAR equation. Also let

4>t) (j

= I, 2.... , p(k)) be the OLS estimate of the

VAR coefficients <J>j') (j = I, 2, ... , p(k)) when (6.6.10) is estimated by OLS for p = p(k). Step 2: Calculating the implied long-run variance. Let P

= max{p(l), p(2), ... , p(K)).

the largest lag length in the K equations. By construction, P :5 PmaxCn). Step 1 produces an estimated VAR which can be written as

g, (Kxl)

=

4>1

~;,_I

(KxK)(Kxl)

+ · · · + 4>p g,_p + (KxK)(Kxl)

e, (Kxl)

(t = PmaxCn) +I, PmaxCn)

+ 2, ... , n).

14 So the sample period is fixed throughout the process of picking the lag length.

(6.6.12)

412

Chapter 6

Since the lag length differs across equations, some rows of ~P (p = -

'(k)

.

I, 2, ... , P) maybe zero. Therowsof~P are related tot/>i (;=I, 2, ... , p(k)) obtained in Step 1 as '(k)

-

t/>

k-th row of ~P =

{

(KxK)

for I .3co (KxK)

-

I

n

" "" L

, ,, e,e,.

(6.6.14)

f=Pnw..-(n)+I

Recall that the long-run variance matrix of a VAR is given by (6.3.16) for z = I. The estimator of S implied by the estimated VAR (6.6.12) is then (6.6.15)

This VARHAC estimator is positive semidefinite by construction. Den Haan and Levin (1996a) show that (under suitable regularity conditions) the estimator converges to S at a faster rate than any positive semidefinite kernel-based estimator for almost all autocovariance structures of g, (which is not necessarily a finite-order VAR). In particular, if g, is finite-order VARMA, then the lag order chosen by the BIC grows at a logarithmic rate and the VARHAC estimator converges at a rate arbitrarily close ton 112 , which is faster than the maximum rate n215 achieved by positive semidefinite kernel-based estimators. QUESTIONS FOR REVIEW

1. When x, = z,, what is the efficient GMM estimator? 2. (Consequence of ignoring serial correlation) Consider the efficient GMM estimator that presumes {g,} to be serially uncorrelated. If {g,} is in fact serially correlated, is the estimator consistent? Efficient?

413

Serial Correlation

6. 7

Estimation under Conditional Homoskedasticity (Optional)

We saw in Section 3.8 that the GMM estimator reduced to the 2SLS estimator under conditional homoskedasticity. This section shows how the 2SLS can be generalized to incorporate serial correlation. It will then become clear that the GLS estimator is not a GMM estimator with serial correlation. The latter half of this section examines the relationship between the GLS and GMM. Kernel-Based Estimation of S under Conditional Homoskedasticity

The relationship between serial correlation in g, = x, · £ 1 and serial correlation in the error terrn £ 1 becomes clearer under conditional homoskedasticity. Let (6.7.1) Since {£,} is stationary by Assumptions 3.1 and 3.2, wi does not depend on t. If E(£1 ) = 0 (which is the case if x, includes the constant), then wi equals the j-th order autocovariance of {£ 1 }. Suppose the error terrn is conditionally homoskedastic: conditional homoskedasticity: E(£,£,_i I x,, x,_i) = wi

(j = 0, ±I, ±2, ... ).

(6. 7.2)

Under this additional condition, the usual argument utilizing the Law of Total Expectations can be applied to r / ri = E(g,g;_i)

(since E(g,) = 0)

= E(£1 £1-ix,X,_i) = E[E(£ 1£1-i

I x,, x,_j)x,x;_i] (by the Law of Total Expectations)

= wi E(x,x;_i).

(6.7.3)

Thus, unless E(x,x;_i) is zero, {g,} is serially correlated if and only if{£,} is. To estimate ri, we exploit the fact that r i is a product of two second moments, E(x,x;_i) and E(£,£,_j). A natural estimator ofE(x,x;_) is~ L;~i+ 1 x,x;_j· It is easy to show that

Wj

is consistent1y estimated by

*L~=j+l £,f:,_j where£, is

the residual from some consistent estimator (the proof is almost the same as in the case with no serial correlation, see Proposition 3.2). Thus, a natural estimate of ri

414

Chapter 6

under conditional homoskedasticity is (6.7.4)

Given these estimated autocovariances, the kernel-based estimation of S is exactly the same as in the case without conditional homoskedasticity described in the previous section. Data Matrix Representation of Estimated Long-Run Variance

For the purpose of relating the GMM estimator with serial correlation but with conditional homoskedasticity to conventional estimators, the data matrix representation ofS is useful. By defining ann x n matrix Q suitably, we can writeS as -S = I X'Si!X, n

(6.7.5)

where X is then x K data matrix whose t-th row is like the autocovariance matrix of {e, }: ' wo '

-

WJ

w,'

' "'2

' wo

' WJ

x;. The matrix !'2 looks much ' lVn-I '

Wn-2

(6.7.6)

Sil

(n xn)

WJ

' wo

' w,

' w,

' w,

' wo

'

' Wn-2 '

lVn-I

If we know a priori that Cov(e,, e,_ j) = 0 (so that elements of this !'2 as I

"

ri

= 0) for j > q, specify the

" ' f:,f:,_i nL

ifO < j < q,

0

if j > q.

(6.7.7)

t=J+I

-

Then the Sin (6.6.4) can be written as (6.7.5). If we do not know that there is such a q, then use the kernel-based estimator of S. For the Bartlett kernel, let

' = Wj

1. J I " ' - - L [I - -q(n) n •~i+l

'

f:tf:t-.

0

ifO < j < q(n)- I, (6.7.8)

J

if j > q(n)- I.

Then the consistent estimator (6.6.3) with (6.6.4) equals (6.7.5).

415

Serial Correlation

It is now easy to show the following extension of the 2SLS estimator to encompass serial correlation:

• The efficient GMM estimator under conditional homoskedasticity can be written as (6.7.9) where Z and yare the data matrices of the regressors and the dependent variable. This generalizes (3.8.3') on page 230. • The consistent estimator of Avar(~(S- 1 )) reduces to

(I _

[I )-11 ]--:-:::Avar(d(S-1)) = n Z'X ;; X'QX n X'Z = n · [Z'X(X'QX)- 1X'Zr'.

1

(6.7.10)

So the square roots of the diagonal elements of [Z'X(X'QX)- 1X'z]- 1 are the standard errors. This generalizes (3.8.5') on page 230. Relation to GLS

The procedure for incorporating serial correlation into the GMM estimation we have described is different from the GLS procedure. To see the difference most clearly, consider the case where x, = z1 . If there are L regressors, X'Z and X'QX in (6.7.9) are all L x L matrices, so the efficient GMM estimator that exploits the orthogonality conditions E(z, · £ 1 ) = 0 is the OLS estimator ~OLS = (Z'Z)- 1Z'y. The consistent estimate of its asymptotic variance is obtained by setting X = Z in (6.7.10): ~

Avar(~oLs) = n · (Z'Z)- 1 (Z'QZ)(Z'z)- 1 .

(6.7.11)

This may seem strange because we know from Section 1.6 that the GLS estimator is more efficient than OLS when the error is serially correlated. This is a finite sample result, but, given that we have a consistent estimator fi of Q, we could take the GLS formula and estimate d as "

---1

dm_s = (Z'Q

1

---1

Z)- Z'Q y.

Is not this estimator asymptotically more efficient than OLS in that its asymptotic variance is smaller? The problem with GLS is that consistency is not guaranteed,

416

Chapter 6

if the regressors are merely predetennined, as shown below. As we emphasized in Chapter 2, the regressors are not strictly exogenous in most time series models. It follows that GLS should not be used to correct for serial correlation in the error tennfor models lacking strict exogeneity. In particular, for the x, = z, case, the correct procedure to adjust for serial correlation is to leave the point estimate unchanged while incorporating serial correlation in the estimate of the asymptotic vanance.

That the GLS estimator is generally inconsistent can be seen as follows. To focus on the problem at hand, assume that the true autocovariance matrix is proportional to some known n x n matrix V:

As shown in Section 1.6, the GLS estimator of the coefficient li in the regression model y = Zli +e can be calculated as the OLS estimator on the transformed model Cy = CZ/i + Ce, where Cis the "square root" of v- 1 such that C'C = v- 1. But look at the t-th observation of the transformed model, which can be written as (6.7.12) where = CrtYI

+ · · · + CrnYn,

Z, = CrtZI f, = CrJCI

+ · · · + CrnZn, + ·' · + CtnC:n,

ji,

(c,1 .... , c,) is the t-th row of C.

For the GLS estimator to be consistent, it is necessary that E(z, · i',) left-hand side of this condition can be written as

0. The

E(z,. l,) =(a)+ (b), n

(a)=

L c:, E(z, · e,), s=I

(b)=

L c, · c, E(z, · ev). s¥v

By the orthogonality condition E(z, · e,) - 0, (a) is zero. For (b), because the regressors are merely predetermined and not assumed to be strictly exogenous, E(z, · ev) is not guaranteed to be zero for s # v. So (b) is not zero, unless the c's take particular values to offset the nonzero terms. There is one important special case where GLS is consistent, and that is when the error is a finite-order autoregressive process. To take the simplest case, suppose

417

Serial Correlation

that(£,} is AR(I): £, =
Var(£,, £z, ... , En)= a 2 · V, V =

--"7

I -

q,z

I

I

It is easy to verify that the square root, C, ofv-' (satisfying C'C = V) is given by

Ji - q,z C=

-
0 I

0

-
0 0 I

0

0

0

0 0 0

-
0 0 0

(6.7.13)

So the t-th transformed equation is Yr- Yr-1

= (z,-
(6.7.14)

The GLS estimate of 8, which is the OLS estimate in the regression of y, -
1.

Verify (6.7.5) for n

= 3, q = I.

2. (Sargan's statistic) Derive the generalization of Sargan's statistic (3.8.1 O') on

page 230 to the case of serially correlated errors. 3. Suppose that z, is a strict subset of x,. We have seen in Section 3.8 that the

2SLS estimator reduces to the OLS estimator. Is this true here as well? That is, does (6.7.9) reduce to (Z'Z)- 1Z'y? (Answer: No.] ~

4. (Data matrix representation of S without conditional homoskedasticity) Define fi suitably so that (6.7.5) equals the Bar1lett kernel-based estimator of S without conditional homoskedasticity, namely, (6.6.5) with (6.6.7). Hint: Write fi as (6.7.6). If the lag length q is known, w1 = £1£,_ 1 tor 0 < j < q and 0 for j > q. How would you define w1 tor the Bartlett kernel-based estimator?

418

Chapter 6

5. Consider the simple regression model: y, = f3z, +e, where {e,) is AR(l ), e, = ¢e,_ 1 +~~with known¢. Assume E(z,e,) = 0, E(z,~,) = 0, E(zt-t~,) = 0, and conditional homoskedasticity. Also, to simplify, assume E(z,) = 0. Show that

-

2 a, I + 2 Li:t
Avar(f3oLs) = 2 a,

I - ¢2

_

a,2

I

Avar(fJGLs) = al -:-:(1:-+---:¢"" ,-:-
where a}= Var(z,), a; = Var(~ 1 ), Pj = Corr(z,, Zr-j). Find a configuration of(¢, {yj}) such that Avar(fi'm.sl > Avar(fi'm.s). Explain why this is consistent with the fact that f3oLs is an efficient GMM estimator.

6.8

Application: Forward Exchange Rates as Optimal Predictors

In foreign exchange markets, the percentage difference between the forward exchange rate and the spot exchange rate is called the forward premium. The relation between the forward premium and the expected rate of currency appreciation/depreciation over the life of the forward contract has been the subject of a large number of empirical studies. The simplest hypothesis about the relationship is that the two are the same. In this section, we test this hypothesis using the same weekly data set used by Bekaert and Rodrick (1993). The data cover the period of I 975-1989 and include the following variables:

S, = spot exchange rate, stated in units of the foreign currency per dollar (e.g., I 25 yen per dollar) on the Friday of week t, F, = 30-day forward exchange rate on Friday of week t, namely, the price in foreign currency units of the dollar deliverable 30 days from Friday of week t,

S30, = the spot rate on the delivery date on a 30-day forward contract made on Friday of week t. The important feature of the data is that the maturity of the contract covers several sampling intervals; the delivery date is after the Friday of week t + 4 but before the Friday of week t + 5.

419

Serial Correlation

The Market Efficiency Hypothesis To indicate the assumptions underlying the hypothesis, consider three strategies for investing dollar funds for 30 days at date t. The first is to invest in a domestic money market instrument (e.g., U.S. Treasury bills) whose maturity is 30 days. Let i, be the rate of return over the 30-day period. The second investment strategy is to convert dollars into foreign currency units, buy a 30-day money market instrument denominated in the foreign currency whose interest rate is denoted i,', and convert back to dollars at maturity. The rate of return from this investment strategy is

S, ·(I+ i,')/530,. This strategy involves foreign exchange risk because at time t of investing you do not know S30, which is the spot rate 30 days hence. The third investment strategy differs from the second in that it hedges against exchange risk by buying dollars forward at price F, at the same time the dollar fund is converted into foreign currency units. This hedged foreign investment provides a sure return equal to

S, ·(I+ i,')/F,. Because the first and the third strategies involve no risk, arbitrage requires that the rates of return be the same: I

+ i, =

S, ·(I

+ i,')/ F,.

Taking logs of both sides and using the approximation log (I the covered interest parity equation:

.

'+ lr·•

lr=Sr-Jr

or

'

Jr -

Sr

= t·•r

-

+ x)

.

lr.

"'"x, we obtain

(6.8.1)

where lower case letters for Sand Fare natural logs. That is, the forward premium, J, - s, equals the interest rate differential. This relationship holds almost exactly in real data. In fact, people routinely calculate the forward premium as the interest rate differential. The rate of return from the second investment strategy is uncertain, but if investors are risk-neutral, its expected return is equal to the sure rate of return. Furthermore, if their expectations are rational, the expected return is the conditional expectation based on the information currently available, so E[S, · (I

+ i,')/ S30, I 1,] =

S, · (I

+ i,')/ F,,

(6.8.2)

420

Chapter 6

where E(- I /,) is the expectations operator and I, is the information set as of date t. Since S,, F,, and;: are known at date t, this equality reduces to E(F,j S30,

I /,) = I.

Again, by log approximation, F,j S30, "" I E(s30,

I /,) = j, or

E(e,

+ j, -

s30,. Therefore,

I /,) = 0 where e,

=s30, -

j,.

(6.8.3)

This is the market efficiency hypothesis. As is clear from the derivation, the hypothesis assumes risk-neutrality and rational expectations. The hypothesis is sometimes described as the forward rate being the optimal forecast of future spot rates, because the conditional expectation is the optimal forecast in that it minimizes the mean square error (see Proposition 2.7). Testing Whether the Unconditional Mean Is Zero

Taking the unconditional expectation of both sides of (6.8.3), we obtain the unconditional relationship: E(s30,) = E(j,)

or

E(e,) = 0.

(6.8.4)

That is, the forecast must be correct on average. Figure 6.1 is a plot of the forecast errore, for the yen/dollar exchange rate. It looks like a stationary series, and we proceed under the assumption that it is. 15 Because the forecast error is observable, we can test the hypothesis that its unconditional mean is zero. This is in contrast to the Fama exercise, where the forecast error (about future inflation) is observable only up to a constant. To test the hypothesis of market efficiency, we need to derive the asymptotic distribution of the sample mean, £, under the hypothesis. The task is somewhat involved because (e,} is serially correlated. To see why the forecast error has serial correlation, note that !,, while including (e,_s, e,-6, ... }, does not include (e,, ... , fr-4} because e,, which depends on s30,, cannot be observed until after t + 4. Therefore, E(e,

I fr-5, fr-6, ... )

= 0.

(6.8.5)

It follows that Cov(e,, e,_i) = 0 for j > 5 but not necessarily for j < 4. In 15 Using covered interest rate parity, t = (s30 - s ) - Ut'- it). If you believe that both the 30-day rate of 1 1 1 change s30r - sr and the interest rate differential i( - ir are stationary, you have to believe that the forecast error is stationary.

421

Serial Correlation

150,---------------------------------------,

- 100 <1>

Ol

c

""'"' 0

<1>

-"' Ol

50

c

<1>

0

<1>

"-

"C <1>

-~

-50

"'

:J

c

~ ~100

-I

0

C'l-150

"'

-200~--------------------~----------------~

1/75

1/78

1/81

1/84

1/87

12/89

Figure 6.1: Forecast Error, Yen/Dollar

the Fama exercise, the ex-post real interest rate (which is the inflation forecast error pins a constant) was serially uncorrelated because the maturity for the interest rate and the sampling interval were the same. Here, the forecast error has serial correlation because the maturity of the forward contract (30 days) is longer than the sampling interval (a week). The estimated correlogram is displayed along with the two-standard error band in Figure 6.2. 16 The standard error is calculated under the assumption that the forecast error is i.i.d. So it equals I 1.fii, as in Table 2.1. The pattern of autocorrelation is broadly consistent with the hypothesis: it stays well above two standard errors for the first four lags and generally lies within the band thereafter. Under the null of market efficiency, since serial correlation vanishes after a finite lag, it is easy to see that {t: 1} satisfies the essential part of Gordin's condition restricting serial correlation. By the Law of Iterated Expectations, (6.8.5) implies E(t:, I t:,_j, t:t-j-1 •... ) = 0 for j > 5, which immediately implies part (b) of Gordin's condition. Since the revision of expectations, r,j. is zero for j > 5, part (c) is satisfied, provided that r,j (0 < j < 4) have finite second moments. It then

16 Although

theory implies that the population mean is zero, I subtracted the sample mean in the caJculation of sample autocorrelations.

422

ChapterS

1

-"'

c: 0.8

" "'""'c:0 .-0 "'

"' " "' E

0.6 0.4

~ ~

0

0.

0.2 -------- ------

---------------~--------?

0

"'"' -0.2

- - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - -

0

5

10

- -- - -"' - - - - -- - - - - -- - - -

20

15

25

30

35

- - - - - - -

40

lag ~---

two standard errors

Figure 6.2: Correlogram of s30- f, Yen/Dollar

follows from Proposition 6.10 that

.foe

(6.8.6) 4

Jyo+2tn

(Yo+ 2L>! )In j=l

j=l

Furthermore, the denominator is consistently estimated by replacing y1 by its sample counterpart y1. So the ratio (6.8. 7) 4

(9o+2L9t)!n j=l

is also asymptotically standard normal. The denominator of this expression will be called the standard error of the sample mean. Table 6.1 reports results for testing the implication of the hypothesis that the unconditional mean of the forecast error is zero, for three foreign exchange rates against the dollar. The three currencies are the Deutsche mark, the British pound, and the Japanese yen. Columns I and 2 of the table report the mean and the standard deviation of the actual rate of change of the spot rate, s30, - s,, and the forward premium, J, - s,. (Because the spot rate is per dollar, a positive rate of change means that the dollar strengthened.) The monthly rates of change and the

423

Serial Correlation

Table 6.1: Means of Rates of Change, Forward Premiums, and Differences betw~en the Two: 1975-1989

Exchange rate

Means and Standard Deviations s30- s f-s Difference (expected rate (unexpected rate (actual rate of change) of change) of change)

¥/$

-4.98 (41.6)

-3.74 (3.65)

-1.25 (42.4) standard error = 3. 56

DM/$

-1.78 (40.6)

-3.91 (2.17)

2.13 (41.1) standard error = 3.26

£/$

3.59 (39.2)

2.16 (3.49)

1.43 (39.9) standard error= 3.29

NOTE: s30- s is the rate of change over 30 days and f - s is the forward premium, expressed as annualized percentage changes. Standard deviations are in parentheses. The standard errors are the values of the denominator of (6.8.7). The data are weekly, from 1975 to 1989. The sample size i• 778.

forward premium have been multiplied by I ,200 to express the rates in annualized percentages. The market efficiency hypothesis is that the forward premium is the expected rate of change of the spot rate. It is evident that the actual rate of change is much more volatile than the forward premium. The third column reports the sample mean of £ 1 , which equals the difference between the first two columns for the sample mean. The numbers in parentheses below the sample mean are the standard errors calculated as described above. In no case is there significant evidence against the null hypothesis of zero mean. Regression Tests The unconditional test, however, does not exploit the implication of market efficiency that the forecast error is orthogonal to conditioning information I,. Probably the most important variable in the information set I 1 for future spot rates is the forward rate J,. We can test whether the forecast error is uncorrelated with the forward rate by regressing £ 1 on the constant and f,. This is equivalent to running

424

Chapter 6

300

250

200

150

1/78

1/81

1/84

1/87

12/89

Figure 6.3: Yen/Dollar Spot Rate, Jan. 1975-Dec. 1989

the following regression (6.8.8) and testing whether {30 = 0 and {3 1 = I. Because the error term is the forecast error orthogonal to anything known at date t including ft. the regressors are guaranteed to be orthogonal to the error term. So it might appear that we can estimate (6.8.8) by OLS and adjust for serial correlation in the error term as indicated in Section 6.6 for correct inference. However, as the graph in Figure 6.3 suggests, the spot exchange rate looks like a process with increasing variance. As we will verify in Chapter 9, we cannot reject the hypothesis that the process has a unit root. The forward rate has the same property. (Its plot is not shown here because it is indistinguishable from the plot of the spot rate.) Thus, Assumption 3.2, which requires the regressor (/1 ) and the dependent variable (s301 ) to be stationary, is not satisfied. The problem actually runs deeper. We have observed in Figure 6.1 that the difference between s30, and ft looked like a stationary series. In the language to be introduced in Chapter 10, s301 and ft are "cointegrated" with a cointegrating vector of (I, -I), in which case the OLS estimate of {3 1 in (6.8.8) converges quickly to I. This can be verified in Figure 6.4 where s30, is plotted against f,. If a regression line is fitted, its slope

425

Serial Correlation

5.8

"c: .c: ,.,""' "' "'1ii" 0

"0 0

-

5.7 5.6 5.5 5.4 5.3

~

0

c.

"'0

"'

"'"

~

0

"'"'

5.2

... .... .. • , 4,:..• ... .• .... .,... -:· . ~

.

......

4.9

5

5.1 5

I

~

.... '4.;: o-~·· ;..p:r·;' ..

4.9 4.8 4.7 4.6

4.7

4.8

5.1 f(t).

5.2

5.3

5.4

5.5

5.6

5.7

5.8

log forward rate

Figure 6.4: Plot of s30 against f, Yen/Dollar

is 0.99. The problem from the point of view of testing market efficiency is that the OLS estimate converges to I regardless of whether or not the hypothesis is true. Therefore, the test has no power. The problem can be dealt with by estimating a different regression, s30,- s, = f3o

+ f3,(j,- s,) + e,.

(6.8.9)

Judging from the plot of s30, - s, and J, - s, (not shown), it is reasonable to assume that they are stationary, which is consistent with Assumption 3.2 (ergodic stationarity). That the equation satisfies other GMM assumptions under market efficiency can be verified as follows. For {30 = 0 and f3, = I, the error term £, equals the forecast error s30,- f, and so the orthogonality conditions E(e,) = 0 and E[ (J, - s,) · £,] = 0 (Assumption 3.3 with z, = x,) are satisfied. The identification condition is that the second moment of (I, f, - s,) be nonsingular, which is true if and only if Var(J, - s,) > 0. This is evidently satisfied; if the population variance were zero, we would not observe any variations in / 1 - s1 • This leaves us with Assumption 3.5', which requires that there is not too much serial correlation in

g,

= x, . £,

e,

= [ (j, - s,) £,

]

.

426

Chapter 6

Table 6.2: Regression Tests of Market Efficiency: 1975-1989 s301

-

s1 =

fJo + f3t(ft- s1 ) + £ 1

Regression Coefficients Currency

Constant

Wald Statistic for

Forward

Ho: fJo

prermum

= 0,

fJ,

= I

¥/$

-12.8 (4.01)

-2.10 (0.738)

0.034

18.6 (p = 0.009%)

DM/$

-13.6 (5.72)

-3.01 (1.37)

0.026

8.7 (p = 1.312%)

7.96 (3.54)

-2.02 (0.85)

0.033

12.9 (p=0.156%)

£1$

NOTE: Standard errors are in parentheses. They are heteroskedasticity-robust and allow for serial correlation up to four lags calculated by the formula (6.8.11 ). The

sample size is 778. Our results differ slightly from those in Bekaert and Hodrick ( 1993) because they used the Bartlett kernel-based estimator with q (n) = 4 to estimateS.

As already noted, since 11 includes {<1 -5, £ 1 _ 6 , .•• }, the forecast error satisfies (6.8.5). Review Question 3 will ask you to show that {g 1 } inherits the same property, namely, E(g,

I g,_5, g,-6· ... ) = 0.

(6.8.10)

So, as in the unconditional test, Gordin's condition is satisfied, provided that the relevant second moments are finite. Since E(g,ft,_) = 0 for j > 5, the long-run covariance matrix S can be estimated as in (6.6.4) with q = 4. (In the empirical exercise, you will be asked to estimate S by the Bartlett kernel and also by VARHAC.) To summarize, the efficient GMM estimator of p = (fJo, {J 1)' in (6.8.9) is OLS and the consistent estimator of its asymptotic variance is given by

-

-

-..

-

-1-..

-I

Avar(floLs) = Sxx S Sxx,

(6.8.11)

where S is given in (6.6.4) with q = 4. Table 6.2 reports the regression test results for the three currencies. The corresponding plot for the yen!$ exchange rate is in Figure 6.5. Notice that the estimated slope coefficient is not only significantly different from I but also negative, for all three currencies. The null hypothesis

427

Serial Correlation

150

"

100

"'

50

Ol t:

.<::

u

-

"' u "' "" " 2

.

.'.. ... . . •••·, .

~

~

0

.,

-50

I

. . . ............ .. ...:.....-.... .. .·.

-150 -200 -20

-15

-1 0

-

-.

-5

f(t)-s(t},

..

.. . .

. •. .•.

0-100

"'"

.'

0

5

10

forward premium

regression line

Figure 6.5: Plot of s30 - s against

f -

s, Yen/Dollar

implied by market efficiency that f!o = 0 and {! 1 = I can be rejected decisively. This is reflected in the too many observations in the north-west quadrant of Figure 6.5; too often expected dollar depreciation (negative f, - s,) is associated with actual dollar appreciation (positive s30, - s, ). QUESTIONS FOR REVIEW

1. (The standard error in the unconditional test) Lets, = s30, - f, and consider a regression of s, on the constant. To account for serial correlation, use (6.6.4) with q = 4 for S. Verify that the /-value for the hypothesis that the intercept is zero is given by (6.8.7). 2. (Lack of identification) For simplicity suppose F, is the two-week forward exchange rate so that the s30, in (6.8.9) can be replaced by s,+2· Suppose that the spot exchange rate is a random walk and I, = {s,, s,_. 1 , •.. }. What should J, be under the hypothesis? Which of the required assumptions are violated? [Answer: The identification condition.] Suppose, instead, that s, is AR(I) satisfying s, = c + ¢s,_ 1 + ry, where {ry,} is i.i.d. and 11 < I. Verify that f, =(I +
428

Chapter 6

3. Prove (6.8.10). Hint: E(e, I I,) = 0. Verify that I, includes s, {g,_ 5 , ~- 6 , .•. }. Use the Law of Iterated Expectations.

.ft and

PROBLEM SET FOR CHAPTER 6 - - - - - - - - - - - - - ANALYTICAL EXERCISES

1. (Mean square limits and their uniqueness) In Chapter 2, we defined the mean square convergence of a sequence of random variables {Zn} to a random variable z. In this and other questions, we will deal with sequences of random variables with finite second moments. For those sequences, the definition of mean square convergence

IS

DEFINITION. Let {Zn} be a sequence of random variables with E(z~) < oo.

We say that lzn} converges in mean square to a random variable z with E(z 2 ) < 00, written Zn -+m.,. z, if E[(Zn - z) 2 ] -+ 0 as n -+ 00. (a) Show that, if Zn -+m.,. z and Zn -+m.,. z', then E[(z - z') 2] = 0. Hint: z- z' = (z- Zn) + (Zn - z'). Define llxll = JE(x 2 ). Then liz- z'll 2 =liz- Znll 2 < liz- Znll

2

+ 2E[(z- Zn)(Zn- z')] + llzn- z'll 2

+ 2JE[(z- ZnPl JE[(Zn- z') 2 ] +

llzn- z'll 2

2 )) (by Cauchy-Schwartz inequality that E(xy) < JE(x 2 ) J:::-E(:-y~

= (liz- Zn II

+ llzn- z'll) 2

So liz- z'll < liz- Zn II+ llzn- z'll. This is called the triangle Inequality. (b) Show that Prob(z = z') = I. Hint: Use Chebychev's inequality: Prob(lz- z'l >e) <

~ E [liz- z'll 2 ]. £

2. (Proof of Proposition 6.1) A well-known result about mean square convergence (see, e.g., Brockwell and Davis, 1991, Proposition 2.7.1) is (i) lzn} converges in mean square if and only if E[ (zm - Zn ) 2 ] -+ 0 as m, n -+ 00.

(ii) If Xn

-+m.s. X

and Zn -+m.s. Z, then

lim E(xn) = E(x)

n---+00

and

lim E(XnZn) = E(xz).

n---+00

429

Serial Correlation

Take this result for granted in answering. (a) Prove that (6.1.6) converges in mean square as n -+ oo under the hypothesis of Proposition 6.1. Hint: Let n

Y<.n =

!-'+

L•irJe,_j· J=O

What needs to be proved is that (y, ·"} converges in mean square as n -+ oo for each t. Given (i) above, it is sufficient to prove that (assuming m > n without loss of generality) as m, n -Too.

Use the fact that an absolutely summable sequence is square summable, i.e., 00

00

I:l•hl

< oo

"* L"'f < oo,

j=O

j=O

and the fact that a sequence of real numbers (an} converges (to a finite limit) if and only if"" is a Cauchy sequence:

an

-T

a

¢}

lam - an I -T 0 as m, n

-T

00.

Set "" = L:J~o l/rf. (b) Prove that E(y,) = 1-'· Hint: Let Y<.n = /-' shown in (a) that Y<.n -+m.,. y,.

+ L:J=o l/rje<-j as in (a).

It was

(c) (Proof of part (b) of Proposition 6.1) Show that E[(y,- {t)(y<-j- {L)] = lim E[(y,,n- {t)(y,-j,n- {L)]. n~oo

Have we shown that (y,} is covariance-stationary? [Answer: Yes.] (d) (Optional) Prove part (c) of Proposition 6.1, taking the following facts from calculus for granted. (i) If (aj} is absolutely summable, then (aj} is summable (i.e., -oo < :L'f'=oaj < oo) and

Chapter 6

430 00

00

L>j

<[:lajl·

j=O

j=O

{ajd (j, k

(ii) Consider a sequence with two subscripts, < oo for each k and lets, Suppose I:}::o {sd is summable. Then

lajkl

00

00

00

L:(:[>jk) j=O

< oo

and

L:}::o

00

lajkl· Suppose

00

L(Laj,) = L(Lajk) < oo. j=O

k=O

00

=

= 0, I, 2, ... ).

k=O

k=O

j=O

Hint: Derive 00

00

00

L IYjl
j=O

k=O

3. (Autocovariances of h(L)x,) We wish to derive the autocovariances of {y,) of Proposition 6.2. As in Proposition 6.2, assume throughout this question that {x,) is covariance-stationary. First consider a weighted average of finitely many terms:

(a) Showthatforthis {Yr."l' n

n

Yj = LLh,h,yf-k+, k=O l=O

where Y/ is the j -th order autocovariance of {x,). Verify that this reduces to (6.1.2) if {x,) is white noise. Hint: If you lind this difficult, first set n = I, and then increase n to establish the pattern. (b) Now consider a weighted average of possibly infinitely many terms: y, = hox 1 + h1xr-1

+ ··· .

Taking Proposition 6.2(a) for granted, show that 00

00

Yj = L L h,h,yf-k+,

if

{hj) is absolutely summable.

k=O f=O

Hint: Yr." --+m.,. y, as n --+ oo by Proposition 6.2. Proceed as in 2(c).

431

Serial Correlation

4. (Homogeneous linear difference equations) Consider a p-th order homogeneous linear difference equation: Yi-
=0

(j

= p, P +I, ... ),

(I)

where (
Let Ak (k = I, 2, ... , K) be the distinct roots of this equation (they can be complex numbers) and rk be the multiplicity of ).k· So r 1 + r, + · · · + rK = p. The polynomial equation can be written as K

fl (I -

Ai:

1

zY'

(3)

= 0.

k=l

For example, consider the second degree polynomial equation I-

z + ~z 2

(4)

= 0.

Its root is 2 with the multiplicity of 2 because I-

z + ~z 2

=(I

-!d.

So K = I and r 1 = 2 and ). 1 = 2 in this example. As you verify below, the solution to the p-th order homogeneous equation can be written as K r*-1

Yi =

L :~::>kn. un;j

(5)

k=I n=O

The coefficients ckn (there are p of those because r 1 + · · · + rK = p) can be determined from the initial conditions which specify the values for (y0 , y 1 , ••• , Yp-J). In the above example of (4),

(6) In the important special case where all the roots are distinct, K = p and

rk

= I

432

Chapter 6

for all k, and (5) becomes p

" Yj = L.. CkO

(7)

)._k- j .

k=l

(a) For the special case of p = 2 and distinct roots, (7) is

(8) Verify that this solves the homogeneous difference equation (I) with p = 2 for any given (c 10 , c20 ). To detennine (c 10 , c 20 ), write down (8) for j = 0 and j = I, and solve for c's as a function of (y0 , y 1 , A~o "A,). (b) To learn (remember) how to deal with multiple roots, consider example

(4) above. Verify that (6) solves the second-order homogeneous difference

equation

Yj- Yj-1

+ ~Yj-2 =

0.

(c) (Optional) Prove the following lemma. LEMMA. Let 0 < ~ < I and Jet n be a non-negative integer. Then there

exist two real number.s, A and b, such that~ < b J.

(d) Suppose the stability condition holds for the p-th order homogeneous linear difference equation. Prove that there exist two numbers, A and b, such thatO < b < I and

Hint: K

IYjl

=

r,.,-1

I: I: k=l n=O

Ckn.

un;j

433

Serial Correlation K

<

rt-l

L L ickn I · (j)" · l).k"IIi k=I n=O K

rt-l

< c I : I:
where c = max{lcknll. Since l).k"II Oand I).;II < bk
IYil < c

max{Ad and b

rt-I

L L A'bi = cpA'bi k=l n=O

(Recall that p = ri + .. · + rK.) Then define A to be cpA'. You may find this sort of argument using inequalities hard to follow. If so, set p, K, - I, and (ri, ... , rK) to some specific numbers (e.g., p = 3, K = 2, 'I '2

= 2).

5. (Yule-Walker equations) In the text, the autocovariances of an AR(I) process were calculated from the MA(oo) representation. Here, we use the Yule-

Walker equations. (a) From (6.2.1') on page 376, derive

Yi- Yi-I = E[<, · (y,_i- J,t)]

(j = 0, I, 2, ... ).

(b) (Easy) From the MA representation, show that (T2

E[<,. (y,_i- J.L)] =

{

0

(j = 0)

(j =I, 2, ... ).

(c) Derive the Yule-Walker equations: Yo - YI =

2
Yi- Yi-I = 0

(9) (j = I, 2, ... ).

(10)

434

Chapter 6

(d) Calculate (yo, y 1 , ... ) from (9) and (10). Hint: First set j = I in (10) to obtain Y1 =Yo· This and (9) can be solved for (yo, YJ). 6. (AR(l) without using filters) We wish to prove, without the help of Proposition 6.2, that (y,} following the AR(l) equation (6.2.1), with the stationarity condition lc/>1 < I, must have an MA(oo) representation if it is covariancestationary. (a) Let x, = y,- fJ. so that E(x,) = 0. By successive substitution, derive from (6.2.1') (on page 376) Xr = Xr ,n

+ 4Jn Xr-n

where

(b) Show that x,,, -->m.s. x,. Hint: What needs to be shown is that E[(x,,, x,) 2 ]--> 0 as n--> oo. (x,,,- x,) 2 = cf> 2"x?_,. Because (y,} is assumed to be covariance-stationary, E(x;_,) < oo. (c) (Trivial) Recalling that the infinite sum

is defined to be the mean square limit of the partial sum x, ·"' show that y, can be written as co

y, = fJ.

+ L, j £,_j· j=O

7. (Companion form) We wish to generalize the argument just developed in question 6 to AR(p). Without loss of generality, we can assume that fJ. = 0 (or redefine y, to be the deviation from the mean of y, ). Define

~.

-

y,

£,

Yr-l

0 0

Yt-2

v, -

(px I)

(px I)

Yr-p+l

0

435

Serial Correlation

F

-

¢2

0

0

0 I

"'' 0 0

(pxp)

0

I

0

Consider the following first-order vector stochastic difference equation: ~I= F~t-I

+ v,.

(II)

In this system of p equations, called the companion form, the first equation is the same as the AR(p) equation (6.2.6') on page 379 and the rest are just identities about Yt-j (j = I, 2, ... , p- 1). (a) By successive substitution, show that ~~=~'·"+(F)"~,_,,

~~,

=

v,

+ Fv,_I + F 2v,_, + ·· · + (F)"-Iv,_,+I·

(12)

The eigenvalues or characteristic roots of a p x p matrix F are the roots of the determinantal equation (13)

It can be shown that the left-hand side of this is V- ¢IV-1 - · · · -rPr-I A- rPrSo the determinantal equation becomes the p-th order polynomial equation: (14)

(b) Verify this for p = 2. In the rest of this question, assume that all the roots of the p-th order polynomial equation (which, as just seen, are the eigenvalues of F) are distinct. Let (AI, A2 , ••• , Ap) be those distinct eigenvalues. We have from matrix algebra that there exists a nonsingular p x p matrix T such that AI 0 F=TAT-I,

0 A2

0 0

0 0

A -

(15)

(pxp)

0 0

0 0

Ar-I 0

0 Ar

436

Chapter 6

(This fact does not depend on the special form of the matrix F; all that is needed for (15) is that the A's are distinct eigenvalues of F.) From (15) it follows that

(F)"

=

T(A)"T- 1.

(c) (Very easy) Prove this for n = 3. (d) Show that~'·" ~m.,. ~' as n ~ oo if the stationarity condition about the AR(p) equation is satisfied and if (y, l is covariance-stationary. Hint: The stationarity condition requires that all the roots of (13) be less than 1 in absolute value. By (12) what needs to be shown is that (F)"~,_, ~ m.,. 0. Recall that z, ~ m.,. " if the mean square convergence occurs component-wise. (e) Show that y, has the MA(oo) representation

How is Vrj related to the characteristic roots? Hint: From (d) it immediately follows that

~' = v,

+ Fv,_ 1 + (F) 2v,_2 + · · ..

The top equation of this is the MA(oo) representation for y,. (If all the eigenvalues are not distinct, we need what is called the "Jordan

decomposition" ofF in place of (15).) 8. (AR(l) that started at I = 0) In the text, the MA(oo) representation (6.2.2) of an AR(I) process following y, = c +
defined for I = -I, -2, ... as well as for I = 0, I, 2, .... Instead, consider an AR(l) process that started at 1 = 0 with the initial value of y0 . Suppose that 1<1>1 '-I)c 2 1 +<, + <,-2+··· +<1>'- <1

+'yo.

Let J.L and (Yj} be the mean and autocovariances of the covariancestationary AR(l) process, so J.L = cj(l- )and (yj} are as in (6.2.5).

437

Serial Correlation

Show that lim E(y,) =J1,

lim Var(y,) =Yo.

t-'>- 00

lim Cov(y,,y,_,) = Yi·

and

t-'>- 00

I-'>- 00

(b) Show that (y,} is covariance-stationary if E(yo) = 11 and Var(yo) = Yo· 9. (Proof of Proposition 6.8(a)) We wish to prove Proposition 6.8(a). By Chebychev's inequality, it suffices to show that limHoo Var(ji) = 0. From (6.5.2), n-1

.

2 Var(ji) =Yo+n n~

"(I- :!.)y n

1

j==!

(since I-:!.= 0 for j = n) n

Yo

+-n2 L~( I - -nj) IY·I1 j==!

n

Yo 2 <-+n n

L IYI· 1

j=!

(a) (Optional) Prove the following result: n

lim a1 = 0

=}

1 __.,_ oo

lim

n-"'- oo

~n L

a1 = 0.

j=!

(a1} converges to 0. So (i) la1 I < M for all j, and (ii) for any given 8 > 0 there exists a positive integer N such that la1I < 8 j2 for all j > N. So Hint:

]I I<;; L j

;; Lai n

j=!

n

Ia, I

j=!

IN =-

L

n.

J=!

NM

= -n

1

n

lail +n.

L

J=N+l

+

(n- N)

n

8

-

2

IN Ja1 J<"M nL j=!

NM < --

n

1

n

+-n L " 2:_ j=N+l

8

+ -. 2

By taking n large enough, N M j n can be made less than 8 j2. (b) Taking the result in (a) as given, prove that Var(ji) --> 0 as n -+

00 .

438

Chapter 6

10. (Proof of Proposition 6.8(b)) (a) (optional) Prove the following result: I

oo

n

'\' ai < oo => lim - '\' jai = 0. Ln-oon Lj=I

j=I

Hint: n

L jai =

a1

+ 2a2 + 3a3 + · ·· + na.

j=l

= (a 1 + a2

+ · ·· +a.) + (a2 + a3 + · ·· +a.) + · · ·

+(an-I +an)+ a. n

n

n

n

= LLak < L j=I k=j

N

= L j=I

n

Lak k=j

Lak k=j

j=l

n

+

L j=N+I

n

Lak. k=j

Since {ai] is summable, (i) I L~~i ak I < M lor all j, n, and (ii) there exists a positive integer N such that I L~~i akl < s/2 for any givens > 0 for all

j, n > N (this follows because II:}~ 1 ai} is Cauchy).

(b) Taking this result as given, prove Proposition 6.8(b). EMPIRICAL EXERCISES

1. Read at least pp. 115-119 (except the last two paragraphs of p. 119) of Bekaert and Hodrick (1993) before answering. Data files DM.ASC (for the Deutsche Mark), POUND.ASC (British Pound), and YEN.ASC (Japanese Yen) contain weekly data on the following items: Column 1: the date of the observation (e.g., "19850104" is January 4, 1985) Column 2: the ask price of the dollar in units of the foreign currency in the spot market on Friday of the current week (S,) Column 3: the ask price of the dollar in units of the foreign currency in the 30-day forward market on Friday of the current week ( F,) Column 4: the bid price of the dollar in units of the foreign currency in the spot market on the delivery date on a current forward contract (S30, ).

Serial Correlation

439

The sample period is the first week of 1975 through the last week of 1989. The sample size is 778. As in the text, define s, = log(S,), f, = log(F, ), s30, = log(S30,). If / 1 is the information available on the Friday of week 1, it includes (s,, s,_ 1 , ... , f,, f 1_ 1 , ••• , s30,_ 5 , s30,_6 , •.. }. Note that s30, is not observed until after the Friday of week 1 + 4. Defines, = s30, - f,. Pick your favorite currency to answer the following questions. (a) (Library/internet work) For the foreign currency of your choice, identify the week when the absolute value of the forward premium is largest. For that week, find some measure of the domestic one-month interest rate (e.g., the one-month CD rate) for the United States and the currency's home country, to verify that the interest rate differential is as large as is indicated in the forward premium. (b) (Correlogram of (e,)) Draw the sample correlogram of e, with 40 lags. Does the autocorrelation appear to vanish after 4 lags? (It is up to you to decide whether to subtract the sample mean in the calculation of sample correlations. Theory says the population mean is zero, which you might want to impose in the calculation. In the calculation of the correlogram for the yen/$ exchange rate shown in Figure 6.2, the sample mean was subtracted.) (c) (Is the log spot rate a random walk?) Draw the sample correlogram of s,+ 1 - s, with 40 lags. For those 40 autocorrelations, use the Box-Ljung statistic to test for serial correlation. Can you reject the hypothesis that (s,} is a random walk with drift? (d) (Unconditional test) Carry out the unconditional test. Can you replicate the results of Table 6.1 for the currency? {e) (Optional, regression test with truncated kernel) Carry out the regression test. Can you replicate the results of Table 6.2 for the currency?

RATS Tip: Use LINREG with the ROBUSTERRORS option. To include autocorrelations up to the fourth, set LAGS= 4 in LINREG. The DAMP= 0 option instructs L INREG to use the truncated kernel. TSP Tip: TSP's GMM procedure has the KERNEL option, but the choice is limited to Bartlett and the kernel called Parzen. You would have to use TSP's matrix commands to do the truncated kernel estimation. (f) (Bartlett kernel) Use the Bartlett kernel-based estimator of S to do the

regression test. Newey and West (1994) provide a data-dependent auto-

440

Chapter 6

matic bandwidth selection procedure. Take for granted that the autocovariance lag length detennined by this procedure is 12 for yen!$ (so autocovariances up to the twelfth are included in the calculation of S), 8 for OM/$, and 16 for Pound/$.

-

RATS Tip: The ROBUSTERRORS, DAMP~ 1 option instructs LINREG to use Bartlett. For yen!$, for example, set LAGS = 12. TSPTip: Set het, kernel =bartlett in the GMM procedure. Set NMA ~ 12 for yen!$. The standard error for the f - s coefficient for yen!$ should be 0.6815. (g) (Optional, VARHAC) Use the VARHAC estimator for Sin the regression test. Set Pmax to the integer part of n 113 In Step I of the VARHAC, you would estimate a bivariate VAR, and for each of the two VAR equations, you would pick the lag length p by B!C on the fixed sample oft = Pmax + I, ... , T. The information criterion could be (6.6.11) or, alternatively, Jog(SSR~k) /(n- Pmaxl)

+ pK ·log(n- Pmax)/(n- Pmaxl·

(16)

Just to be clear about how you picked p in your answer, use this latter objective function in selecting p by BIC. For the case of yen!$, the lag length selected should be 4 for the first VAR equation and 6 for the second equation. For yen!$, the standard error for the f - s coefficient should be 0.8027. 2. (Continuation of the Fama exercise of Chapter 2) Now we know how to use the three-month T-bill rate to test the efficiency of the U.S. Treasury bills market using monthly data. Let :n:1 = inflation rate (in percent, annual rate) over the three-month

period ending in month

t -

I,

R, = three-month T-bill rate (in percent, annual rate) at the beginning of month t.

Consider the Fama regression for testing market efficiency:

(a) (Trivial) Why is the dependent variable :n:,+ 3 rather than, say, :n:,?

Serial Correlation

441

(b) Show that, under the efficient market hypothesis, {! 1 I and the autocovariance of {s1 } vanishes after the second lag (i.e., Yj = 0 for j > 2). (c) Let r,+3

=

R, - 1ft+3 be the ex-post real rate on the three-month T-bill. Use PA/3 in MISHKIN.ASC as the measure of the three-month inflation rate to calculate the ex-post real rate. (The timing of the variables in MISHKIN.ASC is such that a January interest rate observation uses end-ofDecember bill rate data and a January observation for a three-month inflation rate is from the December to March CPI data. So PA/3 and TB3 (threemonth T-bill rate) for the same month can be paired in the regression.) Draw the sample correlogram for the sample period of 1/53-7171. What is the prediction of the efficient market hypothesis about the correlogram?

(d) Test market efficiency by estimating the Fama regression for the sample

period of 1/53-7171. (The sample size should be 223; do not throw away two-thirds of the observations, as Fama did in his Tables &-8.) Use (6.6.4) with q = 2 for S.

-

ANSWERS TO SELECTED Q U E S T I O N S - - - - - - - - - ANALYTICAL EXERCISES

2d. Since {•h} is absolutely summable, t/lj --> 0 as j --> 00. So for any j, there exists an A > 0 such that lt/lj+ki < A for all j, k. So lt/lj+k · ifJ1i < Ait/lkl· Since {t/Jkl (and hence {At/ld) is absolutely summable, so is {t/lj+k · hl (k = 0, I, 2, ... ) for any given j. Thus by (i), 00

IYjl

=

a

2

L t/lj+kifJk

k=O

00

00

k=O

k=O

L lt/lj+kt/lkl =a' L lt/lj+klit/Jkl < oo.

Now set ajk in (ii) to lt/lj+kl ·iifJki· Then 00

00

00

I: 1aj11 =I: lt/Jkllt/lj+kl < iifJkl I: lt/ljl < oo. j=O

j=O

Let 00

M=

L lt/ljl j=O

00

and

sk

= lifJkl L lt/lj+llj=O

442

Chapter 6

Then {sd is summable because Iski < l1h I· M and {1/rkl is absolutely summable. Therefore, by (ii), 00

00

j=O

k=O

L(L 11/rj+'l · 11/r•l) < oo. This times ?e.

1/rj

<1

2 is L.:J,.o IYjl· So {yj) is absolutely summable.

is the (1, I) element ofT(A)jT- 1

Sa. I-' E(y,) = I- c+¢'E(y0 ) =(I-
<1>' ·

(since f.l = cf(l- ))

[E(yo) - J.L],

Var(y,) =(I + q,2 + q,• + ... + q,2'-2)"2 + ¢ 2' Var(y0 ) (since Cov(£,, Yo) = 0) I - ¢2' I - ¢2 "2 + q,2' Var(yo)

2

= (I - ')yo +

= Yo+

<1>

2

<1> '

Var(yo)

(since Yo=

2

2

j(l - )),

2 ' · [Var(yo) - YoL

Cov(y,, y,_j) = (j + q,H2 + ... + 2,-j-2)"2 + 2,-j Var(yo) =

q,j. (I+ q,2 +

=

q,1. . I

_

... + q,2,-2j-2)"2 +

-.2U- j)

~

q,2,-j Var(yo)

" 2 + q, 2,_1. Var(y0 )

I - ¢2 = j · (I - 2u- j))Yo + ¢ 2H Var(yo) = Yoj + ¢ 2H · [Var(yo) - Yol 2'

.

= Yj + - 1 • [Var(yo) -Yo]. EMPIRICAL EXERCISES

1c. The Ljung-Box statistic with 40 lags is 84.0 with a p-value of 0.0058%. So the hypothesis that the spot rate is a random walk with drift can be rejected. See Figure 6.6 for the correlograrn for yen/$.

1f. For the yen/$ exchange rate, the standard errors are 3. 72 for the constant and 0.68 for the f- s coefficient. The Wald statistic for the hypothesis that {30 = 0 and fJ1 = I is 21.85 (p = 0.002%).

443

Serial Correlation

-" c

u

1 0.6

=" 0.6 0

u

c 0

~ 0.4

" ~ ~

0

0.2

u

"c. E

0

"'"'

-0.2 10

5

0

15

25

20

35

30

40

lag two standard errors

Figure 6.6: Correlogram of sr+l - s,, Yen/Dollar

c

.~

u

1 0.6

=" 0.6 .Q -"' 0.4 0

u

c

-

" ~ ~

0

u

"Ec. "'"'

0.2

---------

-------------------------~----~----~--~~

...--

0 -0.2

..L-,---,--,---,--,--....,--~-.--~--,------,~--,-----,c:-'

0

1

2

3

4

5

6

7

6

9

10

11

12

lag ---- two standard errors

Figure 6. 7: Correlogram of Three-Month Ex-Post Real Rate 2c. The correlogram for the three-month ex-post real rate is shown above (Figure 6. 7). Theory predicts that serial correlation vanishes after the second lag. The correlogram is mostly consistent with this prediction.

444

Chapter 6

References Anderson, T. W., 1971, The Statistical Analysis of Time Series, New York: Wiley. Andrews, D., 1991, "Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation," Econometrica, 59,817-858. Bekaert, G., and R. Hodrick, 1993, "On Biases in the Measurement of Foreign Exchange Risk Premiums," Journn.l of International Money and Finance, 12, 115-138. Brockwell. P., and R. Davis, 1991, Time Series: Theory and Metlwds (2d ed.), Berlin: SpringerVerlag. Den Haan, W., and A. Levin, l996a, "Inferences from Parametric and Non-Parametric Covariance Matrix Estimation Procedures," Technical Working Paper Series 195, NBER. - - - · 1996b, "A Practitioner's Guide to Robust Covariance Matrix Estimation," Technical Working Paper Series 197, NBER. Fuller, W., 1996, Introduction to Statistical lime Series (2d ed.), New York: Wiley. Gordin, M. 1., 1969. "The Centra1 Limit Theorem for Stationary Processes," Soviet Math. Dokl., 10,

I 174-1176. Gourieroux, C., and A. Monfort, 1997, Time Series and Dynamic Models, Cambridge: Cambridge University Press. Hamilton, J., 1994, lime Series Analysis, Princeton: Princeton University Press. Hannan, E. J., 1970, Multiple Time Series, New York: Wiley. Hannan, E. J., and M. Deistler, 1988, The Statistical Theory of Linear Systems, New York: Wiley. Hansen, L., 1982, "Large Sample Properties of Genera1ized Method of Moments Estimators," Econo~ metrica, 50, 1029-1054. Li.itkepohl, H., 1993, lmroduction to Multiple lime Series Analysis (2d eci.), New York: SpringerVerlag. Newey, W., and K. West. 1987, "Hypothesis Testing with Efficient Method of Moments Estimation," International Economic Review, 28, 777-787. _ _ _ , 1994, "Automatic Lag Selection in Covariance Matrix Estimation," Review of Economic Studies, 61.631-653. Ng, S., and P. Perron, 2000, "A Note on Model Selection in Time Series Ana1ysis," mimeo., Boston: Boston College. White, H., 1984, Asymptotic Theory for Econometn.cians, New York: Academic.

CHAPTER 7

Extremum Estimators

ABSTRACT

In the previous chapters, the generalized method of moments (GMM) has been our choice for estimating the various models we have considered. The method of maximum likelihood (ML) has been discussed, but only in passing. In this chapter, we expand our repertoire of estimation techniques by examining a class of estimators called "extremum estimators," which includes (linear and nonlinear) least squares, (linear and nonlinear) GMM, and ML as special cases. As has been emphasized by Amemiya (1985) and others, this unified approach is useful for bringing out the common structure that underlies those apparently diverse estimation principles. Section 7.1 introduces extremum estimators. The two sections that follow develop the asymptotic properties of extremum estimators. Section 7.4 defines the trio of test statistics -the Wald, Langrange multipler, and likelihood-ratio statistics- for extremum estimators in general, and shows how they can be specialized to individual cases. Section 7.5 is a brief discussion of the computational aspects of extremum estimators. The discussion of this chapter may appear daunting, as it combines a heavy use of asymptotic theory with calculus. The essence, however, is really straightforward. If you can read graduate textbooks on micro and macroeconomics without too much difficulty. you should be able to understand most of the discussion on the second (if not the first) reading. If you are not interested in the asymptotics of extremum estimators per se but plan to study Chapter 8, there is no need to read this chapter in its entirety. Just read Section 7.1 and then try to understand the statements in Propositions 7.5, 7.6, 7.8, 7.9, and 7.11. A Note on Notation: Unlike in the previous chapters, we use the parameter vector with subscript 0 for its true value. So Do is the true parameter value, while (J represents a hypothetical parameter value. This notation is standard in the "highbrow" literature on nonlinear estimation. Also, we will use "t" rather than "i" for the observation index.

446

7.1

Chapter 7

Extremum Estimators

An estimator iJ is called an extremum estimator if there is a scalar objective function Q, (0) such that

iJ maximizes

Q,(O) subject to 0 E

e c ffi!P,

(7.1.1)

where e. called the parameter space, is the set of possible parameter values. In this book, we restrict our attention to the case where e is a subset of the finite dimensional Euclidean space JRP. The objective function Q,(O) depends not only on 0 but also on the sample or the data, (w 1, w2, ... , w,), where w, is the t-th observation and n is the sample size. In our notation the dependence of the objective function on the sample of size n is signalled by the subscript n. The maximum likelihood (ML) and the linear and nonlinear generalized method of moments (GMM) estimators are particular extremum estimators. This section presents these and other extremum estimators and indicates how they are related to each other. '

"Measurability" ot 0 To be sure, the maximization problem (7.1.1) may not necessarily have a solution. But recall from calculus the following fact: Let h : EtP ~ IR be continuous and A c JRP be a compact (closed and bounded) set. Then h has a maximum on the set A. That is, there exists an x• E A such that h(x•) > h(x) for all x in A. Therefore, if Q,(O) is continuous in 0 for any data (w,, W2, ... , w,) and e is compact, then there exists a 0 that solves the maximization problem in (7 .1.1) for any given data (w 1, ••. , w,). In the event of multiple solutions, we would choose one from them. So iJ, thus uniquely determined for any data (w 1, ••• , w,), is a function of the data. Strictly speaking, however, being a function of the vector of random variables (w 1, ... , w,) is not enough to make iJ a well-defined random variable; iJ needs to be a "measurable" function of (w 1 , ... , w,). (If a function is continuous, it is measurable.) The following lemma shows that iJ is measurable if Q,(O)is Lemma 7.1 (Existence of extremum estimators): Suppose that (i) the parameter Space e is a COmpact Subset of ITtP, (ii) Q, (0) is COntinUOUS in 0 for any data (w 1, ... , w,), and (iii) Q,(O) is a measurable function of the data for all 0 in e. Then there exists a measurable function iJ of the data that solves (7.1.1).

447

Extremum Estimators

See, e.g., Gourieroux and Monfort (1995, Property 24.1) for a proof. In all the examples of this chapter, Q.(ll) is measurable. In most applications, we do not know the upper or lower bound for the true parameter vector. Even if we do, those bounds are not included in the parameter space, so the parameter space is not closed (see the CES example below for an example). So the compactness assumption for e is something we wish to avoid. In some of the asymptotic results of this chapter, we will replace the compactness assumption by some other conditions that are satisfied in many applications. Two Classes ot Extremum Estimators In the next two sections, we wil1 develop asymptotic results for extremum estimators and then specialize those results to the following two classes of extremum estimators.

I. M-Estimators. An extremum estimator is an M-estimator if the objective function is a sample average: ]

Q.(ll) = -

n

n

L

(7.1.2)

m(w,; II).

t=I

where m is a real-valued function of (w,, II). Two examples of an M-estimator we study are the maximum likelihood (ML) and the nonlinear least squares (NLS). 2. GMM. An extremum estimator is a GMM estimator if the objective function can be written as Q.(ll) = -

I

g,(II)'Wg,(ll) with g,(ll)

2

(Kxl)

]

n

=- L n

g(w,;ll),

(7.1.3)

t=l

where Wis a K x K symmetric and positive definite matrix that defines the distance of g. (II) from zero. It can depend on the data. Maximizing this objective function is equivalent to minimi1ing the distance g,(II)'Wg,(ll), which is how we defined GMM in Chapter 3. As will become clear, the deflation of the distance by 2 simplifies the expression for the derivative of the objective function. We have introduced GMM estimation for linear models where g(w,; II) is linear in II. Below we will show that GMM estimation can be extended easily to nonlinear models. There are extremum estimators that do not fall into either class. A prominent example is classical minimum distance estimators. Their objective function can be

448

Chapter 7

written as -g, (6)'Wg, (6) but g, (6) is not necessarily a sample mean. The consistency theorem of the next section is general enough to cover this class as well. In the rest of this section, we present examples of M-estimators and GMM estimators. Each example is a combination of the model (a set of data generating processes [DGPs)) and the estimation method. Maximum Likelihood (ML)

A prime example of an M-estimator is the maximum likelihood for the case where {w,} is i.i.d. The model here is a set of i.i.d. sequences {w,} where the density of w, is a member of the family of densities indexed by a finite-dimensional vector 6: j(w,; 6), 6 E 8. (Since {w,] is identically distributed, the functional form of j(-; ·)does not depend on t.) The functional form off(·;·) is known. The model is parametric because the parameter vector 6 is finite dimensional. At the true parameter value 6 0 , the density of the true DGP (the DGP that actually generated the data) is j(w,; 6 0) 1 We say that the model is correctly specified if 6 0 E 8. Since {w,} is independently distributed, the joint density of the data (w 1, W2, .•. , Wn) is n

j(w 1 , w2, ... , w.; 6o)

=IT f(w,; 6o).

(7.1.4)

f=J

With the distribution of the data thus completely specified, the natural estimation method is maximum likelihood, which proceeds as follows. With the true parameter vector 6o replaced by its hypothetical value 6, this density, viewed as a function of 6, is called the likelihood function. The maximum likelihood (ML) estimator of 60 is the 6 that maximizes the likelihood function. Since the log transformation is a monotone transformation, maximizing the likelihood function is equivalent to maximizing the log likelihood function: 2

log

j(w~o w 2, ... , w.; 6) =log

[D

f(w,; 6)] =

t.

log j(w,; 6).

(7.1.5)

The ML estimator of 6 0 , therefore, is an M-estimator with 1Had we adhered to the notational co~vention of the previous sections, the lnle parameter value would be denoted by 9 and a hypothetical value by 9. In this chapter, we use 9o for the lnle value and 9 for a hypothetical value of the parameter vector. 2 For 6 such that j(wr; 9) = 0. the log cannot be taken. This does not present a problem because such 6 never maximi7es the (level) likelihood function f(WJ, ... , Wn; 6). For such 6 we can assign an arbitrarily small number to log f(w 1 ; 6) so that the maximizer of the \eve\lilc:ellhood is also the maximization of the log likelthood. See White (1994, p. 15) for a more formal treatment of thls issue.

449

Extremum Estimators

J

m(w,; 0) =Jog f(w,; 0), that is, Qn(O) = -

n

n

L

Jog J(w,; 0). (7. 1.6)

t=l

Before illustrating ML by an example, it is useful to make three remarks. • (Serially correlated observations) The average log likelihood would not take a form as simple as this if (w,} were serially correlated. Examples of ML for serially correlated observations will be studied in the next chapter. • (Alternative estimators) Maximum likelihood is not the only way to estimate the parameter vector. For example, let JL(O) be the expectation of w, implied by the density function f(w,; 0). It is a known function ofO. By construction, the following zero-mean condition holds: E[w, - JL(Oo)] = 0.

(7.1.7)

The parameter vector 00 could be estimated by GMM with g(w,; 0) = w,- JL(O) in the GMM objective function (7. I .3). • (Efficiency of ML) As will be shown in the next two sections, ML is consistent and asymptotically normal under suitable conditions. It was widely believed that ML is efficient (i.e., achieves minimum asymptotic variance) in the class of all consistent and asymptotically normal estimators. This belief was shown to be wrong by a counterexample (described, for example, in Amemiya, 1985, Example 4.2.4). Nevertheless, ML is efficient in quite general classes of asymptotically normal estimators. One such general class is GMM estimators. In Section 7.3, we will note that the asymptotic variance of any GMM estimator of 00 is no smaller than that of the ML estimator. Example 7.1 (Estimating the mean of a normal distribution): Let the data (w 1, ••• , wn) be a scalar i.i.d. sequence with the distribution of w, given by N(/L. a 2 ). So 0 = (/L. a 2 )' and 2 f(w 1 ;/L,a)=

I [ (w,-11-)2] expv'2rra 2 2a 2

•

The average Jog likelihood of the data (w,, ... , Wn) is

L

w,-:f] .

I " I I I " [( - Llogj(w 1;/L,a 2 )=--log(2rr)--Jog(a 2 ) - n 2 2 n 2a t=l

t=I

(7. 1.8)

450

Chapter 7

The ML estimator of (JLo, aJ) is an extremum estimator with Qn (II) taken to be this average log likelihood. Suppose that the variance is assumed to be positive but that there is no a priori restriction on f.Lo, so the parameter space E> is rn: x rn:++, where ~+ is the set of positive real numbers. It is easy to check that the ML estimator of JLo is the sample mean of w1 • The GMM estimator of JLo based on the zero-mean condition E(w,- JLo) = 0, too, is the sample mean of w 1 • Conditional Maximum Likelihood In most applications, the vector w, is partitioned into two groups, y, and x,, and the researcher's interest is to examine how x1 influences the conditional distribution of y, given x,. We have been calling the variable y, the dependent variable and x, regressors. None of the results of this chapter depends on y, being a scalar; y, can be a vector and still all the results will remain valid with the scalar "y,'' replaced by a vector "y,." We nevertheless use the scalar notation, simply because in all the examples we consider in this chapter, there is only one dependent variable. Let f(y, I x,; 11 0 ) be the conditional density of y, given x,, and let j(x,; tal be the marginal density of x,. Then f(y,, x,; llo. tal= f(y, I x,; llo)f(x,; tal

(7.1.9)

is the joint density of w, = (y,, x;)'. Suppose, for now, that 11 0 and to are not functionally related (see the next paragraph for an example of a functional relationship). The average log likelihood of the data (w 1 , •.. , wn) is I

n

-n L

l=l

]

log f(w,; II, tl =n

n

L l=l

)

log f(y, I x,; II)+n

n

L

log f(x,; tl.

t=l

(7.1.10) The first term on the right-hand side is the average log conditional likelihood. The conditional ML estimator of 11 0 maximizes this first term, thus ignoring the second term. It is an M-estimator with m(w,; II)= log f(y,

I x,; II), that is,

I Qn(ll) =-

n

L n

logf(y,

I x,;ll).

1=1

(7.1.11) The second term on the right-hand side of (7.1.10) is the average log marginal likelihood. It does not depend on II, so the conditional ML estimator of 11 0 is

451

Extremum Estimators

numerically the same as the (joint or full) ML estimator that maximizes the average joint likelihood (7.1.1 0). Now suppose that Oo and 1(r 0 are functionally related. For example, Oo and 1(r 0 might be partitioned as 00 = (a~. {3~)'. 1(r 0 = ({3~. y 0)'. Then the full (i.e., unconditional) ML and conditional ML estimators are no longer numerically equal. It is intuitively clear that the conditional ML misses information that could have been obtained from the marginal likelihood. Indeed it can be shown that the conditional ML estimator of 00 is less efficient than the full ML estimator obtained from maximizing the joint log likelihood (7.1.10) (see, e.g., Gourieroux and Monfort, 1995, Section 7.5.3). In most applications, this loss of efficiency is unavoidable because we do not know, or do not wish to specify, the parametric form, f(x,; 1{r), of the marginal distribution. Here are two examples of conditional ML (other examples will be in the next chapter). Example 7.2 (Linear regression model with normal errors): In Chapter 2, we considered the linear regression model with conditional homoskedasticity. Assume further that the error term is normally distributed, so [y,, x,} is i.i.d., y, = x:Po + s,, s, I x, ~ N(O, rrJ).

(7.1.12)

The conditional likelihood of y, given x,, f(y, I x,; 0), is the density function of N(x;p, rr 2). So, with 0 = (/3'. rr 2)' and w, = (y,. x;)', them function is m(w,; 0) =log f(y, I x,; {3, rr 2)

x;p)2]) = --log(2rr)- log(rr)-- (y,- x;f3) 2 2 I [ (y,= Iog ( exp ~2rrrr2 2rr 2 I

I

2

(7.1.13)

(I

The parameter space e is JFtK X 114+· where K is the dimension of {3 and 114+ is the set of positive real numbers reflecting the a priori restriction that rrJ > 0. As was verified in Chapter I, the ML estimator of {3 0 is the OLS estimator and the ML estimator of rrJ is the sum of squared residuals divided by the sample size n. Example 7.3 (Probit): In the probit model, a scalar dependent variable y, is a binary variable, y, E [0, 1}. For example, y, = I if the wife chooses to enter the labor force andy, = 0 otherwise, and x, might include the husband's

452

Chapter 7

income. The conditional probability of y, given a vector of regressors x, is given by

I x,; llo) = 4>(X,IIo), { f(y, = 0 I x,; llo) =I- 4>(X,IIo), f(y, =I

(7.1.14)

where 4>0 is the cumulative density function of the standard normal distribution. This can be written compactly as f(y,

I x,; llo)

= 4>(x;llol'' [I - 4>(X,IIo)] 1-y'.

If there is no a priori restriction on 11 0 , the parameter space e is JRP. The ML estimator of llo is an M-estimator with them function given by m(w,;ll) = logf(y1 I x,;ll) = y,log4>(x;ll)+(l- y,)log[l -4>(x;ll)]. (7.1.15) lnvariance of ML The joint ML and conditional ML estimators have a number of desirable properties. One of them, which holds true in finite samples, is invariance. To describe invariance for an extremum estimator in general, consider reparameterizing the model by a mapping or function l = r(ll) defined on e. Let A be the range of the mappmg: A

=r(e) ={l ll = r(ll) for some II e e].

(7.1.16)

By definition, for every l in A, there exists at least one II such that l = r(ll). The mapping 1: : e -+ A is called a reparameterization if it is one-to-one; namely, for every l in A, there is only one II in e such that l = 1: (II). This unique II is therefore a function of l. This mapping from A to e is called the inverse of 1: and denoted r- 1• We say that an extremum estimator 8 is invariant to reparameterization r A

if the extremum estimator for the reparameterized model is r(ll). Let Q. (l) be the objective function associated with the reparameterized model. An extremum estimator is invariant if and only if Q.(l) = Q.(r -1 (l))

for all

leA.

(7.1.17)

Toseethis,Ieti. = r(iJ). Foranyl e A, wehaveQ.(l) = Q.(r- 1 (l)) < Q.(O) because 0 maximizes Q.(ll) one and "f- 1 (l) is in e. But Q.(O) = Q.(c 1 (l)) = Q.(i). So Q.(l) < Q.(i) for alll eA.

453

Extremum Estimators

ML (be it conditional or joint) is invariant to reparameterization because the likelihood after reparameterization is j(.; A) = f(.; T- 1(A)), which produces the objective functions that satisfy (7.1.17). Can there be an estimator that is not invariant? The answer is yes; Review Question 5 will ask you to verify that GMM is not invariant. Nonlinear Least Squares (NLS)

In NLS, as in conditional ML, the observation vector w, is partitioned into two groups, y, and x,: w, = (y,, x;)'. The model in NLS is a set of stochastic processes {y1 , x,} such that the conditional mean E(y, I x,), which is a function of x,, is a member of the parametric family of functions
+ <,,

E(<,

I x,) = 0, llo

E

e.

(7 1.18)

The most widely used estimation method to estimate llo here is least squares, which is to minimize the sum of squared residuals. Therefore, an NLS estimator, which is least squares applied to the model just described, is an M-estimator with I m(w,; II)= -[y,-
n

L [y

1 -

t=!

(7.1.19) Here maximization of Qn(ll) is the same as minimization of the sum of squared residuals. Example 7.4 (CES production function with additive errors): illustration, consider the CES production function -PO ' Q, = Ao · [ ooK 1

+ (I -

') L,-P0]-1/PO oo

+ <1 ,

As an

(7.1.20)

where Q, is output in period t, K, is the capital stock, L, is labor input, and <1 is an additive shock to the production function with E(< 1 1 K,, L,) = 0. This is a conditional mean model (7.1.18) withy,= Q,, X1 = (K 1 , L,)', 11 0 = 11 (A 0 , 80, p0)', 0, 0 < 8 < I, -I < p, which determines the parameter space e.

454

Chapter 7

Linear and Nonlinear GMM In Chapter 3, we applied GMM to linear equations. The linear equation to be estimated was (7.1.21) With x, as the vector of instruments, the orthogonality (zero-mean) conditions were (7.1.22)

E[x, · (y,- z;llo)l = 0.

The correctly specified model here is a set of ergodic stationary processes w, = (y,, z;, x;)' such that these zero-mean conditions hold for 11 0 in e. The linear GMM estimator of llo is a GMM estimator with the g function in the GMM objective function (7.1.3) given by g(w,; II)

= x, · (y, -

z;ll) = x, · y, - x,z;ll.

(7.1.23)

GMM can be readily applied to nonlinear equations. Suppose that the estima-

tion equation is nonlinear and also implicit in y,: a(y1 , z,; llo) = £,.

(7.1.24)

Just set g(w,; II) = x,·a(y,, z,; II). This estimator is called a generalized nonlinear instrumental variables estimator.' It is still a special case of GMM because the g function here can be written as a product of the vector of instruments and the error term. The properties of GMM estimators to be developed in this chapter do not rely on this special structure. As an example of nonlinear GMM, consider the nonlinear Euler equation of Hansen and Singleton (1982). Example 7.5 (Nonlinear consumption Euler equation): The Euler equation for the household optimization problem, standard in macroeconomics, is

E[Rt+i fJou'(cr+ll j I]= 1 '( ) u c,

I

'

(7.1.25)

where Rr+i is the gross ex-post rate of return (I plus the rate of return) from the asset in question, c, is consumption, {3 0 is the discounting factor, u'(c) is the marginal utility, and I, is the information available at date t. Assume for 311 is more general than the usual nonlinear instrumental variables estimator because conditional homoskedasticity is not assumed.

455

Extremum Estimators

the utility function that u(c) = c 1-"0 /(l- a 0 ). Then u'(c) = c-•o and the Euler equation becomes Cr+l

E [ a ( - - , Rr+!; O'o, fJo

)

c,

(7.1.26) If x, is a vector of variables whose values are known at date t, then x, and it follows from (7.1.26) that

E /1

(7.1.27) Taking the unconditional expectation of both sides and using the Law of Total Expectations (that E[E(x II,)]= E(x)), we obtain the orthogonality (zeromean) conditions E[g(w,; tlo)] = 0 where

Ct+l

g(w,; tlo) = x, ·a ( --;;-· Rr+!; ao, fJo

)

.

w1th

w, =

[c,;::~l

tlo =

[~:l

(7.1.28)

QUESTIONS FOR REVIEW

1. (Estimating probit model by NLS) For the probit model, verify that E(y,

I

x,) = (x;t1 0 ). Consider applying least squares to this model. Specify them function in theM-estimator objective function (7.1.2). 2. (Estimating probit model by GMM) For the probit model, verify that the orthogonality conditions E[x, · (y1 - (x;tlo) )] = 0 hold. Specify the g function in the GMM objective function (7.1.3). 3. (Estimation by ML of a GMM model?) Suppose that {y,, z,, x,} is i.i.d. in the correctly specified linear model described in the last subsection. Can you estimate tl 0 by ML? [Answer: No, because the model does not specify the parametric form for the density of w,.] 4. (E(<, 1 x,) = 0 vs. E(e, · x,) = 0 in NLS) Consider the quadratic NLS model where cp(x,; tl) = 00 +0 1x 1 +02 xl. Suppose that E(e1x 1) = 0 but E(e,x?l "I 0. Does E(<, I x,) = 0 hold? [Answer: No. If E(e, I x 1 ) = 0, then E(e,h(x, )) = 0

456

Chapter 7

for any (measurable) function h.] Would the NLS estimator of 11 0 be consistent? [Answer: No.] 5. (Lack of invariance in GMM) Consider the linear model y, = e0 z, + E:, where e0 and z, are scalars. The orthogonality conditions are E[x, · (y,- e0 z,)] = 0. Write down the GMM objective function Qn(e). Assume that E> = JR.,_+ (i.e., e0 > 0) and consider the reparameterization A = I ;e. With this repararneterization, the linear equation can be rewritten as Zr = A0 y 1 - A0 e 1 and the orthogonality conditions can be rewritten as E[x, · (z,- J. 0 y,)] = 0. Let Qn(J.) be the GMM objective functions associated with this set of orthogonality conditions. Is it the case that Qn(e) = Qn(l(e) or Qn(A) = Qn(I(J.) when Wis the same in the two objective functions? [Answer: No.]

7.2

Consistency

ln this section, we first present a set of sufficient conditions under which an extre-

mum estimator is consistent. Those conditions will then be specialized to NLS, ML,andGMM. Two Consistency Theorems lor Extremum Estimators

The objective function QnO is a random function because for each (J its value Qn(ll), being dependent on the data (w 1, ... , wn), is a random variable. The basic idea for the consistency of an extremum estimator is that if Qn(ll) converges in probability to Qo(ll), and the true parameter llo solves the "lintit problem" of maximizing the limit function Q0 (11), then the limit of the maximum 0 should be 11 0 . Sufficient conditions for the maximum of the limit to be the limit of the maximum are that the convergence in probability be "uniform" (in the sense made precise in a moment) and that the parameter space E> be compact. As already mentioned, the parameter space is not compact in most applications. We therefore provide an alternative set of consistency conditions, which does not require E> to be compact. To state the first consistency theorem for extremum estimators, we need to make precise what we mean by convergence being "uniform." If you are already familiar from calculus with the notion of uniform convergence of a sequence of functions, you will recognize that the definition about to be given is a natural extension to a sequence of random functions, QnO (n = I, 2, ... ). Pointwise convergence in probability of QnO to some nonrandom function QoO simply means plimn~oo Qn(ll) = Qo(O) for all II, namely, that the sequence of random

457

Extremum Estimators

variables IQ" (IJ) - Qo(IJ) I (n = I, 2, ... ) converges in probability to 0 for each IJ. Uniform convergence in probability is stronger. The convergence has to occur "uniformly" over the parameter space e in the following sense: (7.2.1)

sup IQn (IJ) - Qo(IJ)I --+ 0 as n --+ oo. Oee

P

As in the case of convergence in probability, uniform convergence in probability can be extended readily to sequences of vector random functions or matrix random functions (by viewing a matrix as a vector whose elements have been rearranged), by requiring uniform convergence for each element: a sequence of vector random functions {hn (-)) converges uniformly in probability to a nonrandom function h 0 (-) if each element converges uniformly. This element-by-element convergence is equivalent to convergence in the norm:

(7.2.2)

sup llhn(IJ)- ho(IJ)II--+ 0 as n--+ oo, Oee

P

where 11.11 is the usual Euclidean norm 4 In the following statement of our first consistency theorem, the conditions ensuring that IJ is well defined are labeled (i)-(iii), while those ensuring consistency are (a) and (b). A

Proposition 7.1 (Consistency with compact parameter space): Suppose that (i) e is a compact subset ofllf.P, (ii) Qn (IJ) is continuous in IJ for any data (w 1 , ... , Wn). and (iii) Qn(IJ) is a measurable function of the data for aJIIJ in e (so by Lemma 7.1 the extremum estimator jj defined in (7.1.1) is a well-defined random variable). If there is a function Q0 (1J) such that

(a) (identification) Qo(IJ) is uniquely maximized on 8 at IJo

E

8,

(b) (uniform convergence) QnO converges uniformly in probability to Qo(-), A

then IJ

-+p

IJo.

We will not prove this result; see, e.g., Amemiya (1985, Theorem 4.1.1 )for a proof. For estimates such as ML, NLS, and GMM, we will state later the identification condition (a) in more primitive terms so that the condition can be interpreted more intuitively.

4The Euclidean norm of a p-dimensional vector x =

Jxf+x5+ ··+x~..._

(XJ,

x2 .... , Xp) 1, denoted IJx(l, is defined as

458

Chapter 7

The theorem that does away with the compactness of e is Proposition 7.2 (Consistency without compactness): Suppose that (i) the true parameter vector (1 0 is an element of the interior of a convex parameter space e (c JRP), (ii) Q.(O) is concave 5 over the parameter space for any data (w 1, ...• w.) • and (iii) Q.({l) is a measurable function of the data for all (I in e. Let (I be the extremum estimator defined by (7.1.1) (wait for a moment to Jearn about its existence). If there is a function Q0 ((1) such that

.

(a) (identification) Q 0 ({1) is uniquely maximized on

e at (1 0 E e.

(b) (pointwise convergence) Q.((l) converges in probability to Q 0 ({1) for all

o E e.

then, as n -> oo,

. exists with probability approaclting 1 and (I. ->p 0

(I

0.

See Newey and McFadden (1994, pp. 2133-2134) for a proof. In most applications, e is an open convex set (as in Examples 7.1-7.4), so condition (i) is satisfied. The convergence of Q.((l) to Q 0 ((1) is required to be only pointwise, which is easier to verify in applications. The price of these niceties is the requirement that the objective function Q.(O) be concave for any data (under which pointwise convergence implies uniform convergence), but in many applications this condition is satisfied. Below we will verify that concavity is satisfied for ML for the linear regression model (after a reparameterization), probit ML, and linear GMM. Consistency of M-Estimators For an M-estimator, the objective function is given by (7.1.2). If {w1 } is ergodic stationary, the Ergodic Theorem implies pointwise convergence for Q.(O): Q.(O) converges in probability pointwise for each (I E e to Q 0 (0) given by Qo(O) = E[m(w,: (I)]

(7.2.3)

(provided, of course, that E[m(w,: 0)] exists and is finite). To apply the first consistency result (Proposition 7.1) toM-estimators, we need to show that the convergence of~ I:;= I m(w,: .) to E[m(w,: .)] is uniform. A standard method to prove uniform convergence in probability, which exploits the fact that the expression is 5 A scalar function h(x) is said to be concave if ah(XJ) + ( I - a)h(xz) ::=: h(ax1 + ( I - a)x2) for all Xt, "2· 0 ::=: a ::=: I. So a linear function is concave. Some authors use the tenn "globally concave" to distinguish it from the concept of "local concavity" that lhe Hessian evaluated at the parameter vector in question is negative semidefinite. If a function is concave, then it is continuous. So condition (ii) in this proposition is stronger lhan condition (ii) in the previous proposition.

459

Extremum Estimators

a sample mean, is the uniform law of large numbers. Its version for ergodic stationary processes (stated as Lemma 2.4 in Newey and McFadden, 1994, and as Theorem A.2.2 in White, 1994) is Lemma 7.2 (Uniform law of large numbers): Let {w,} be an ergodic stationary process. Suppose that (i) the set 8 is compact, (ii) m(w,; {I) is continuous in (I for all w,, and (iii) m(w,; {I) is measurable in w, for all (I in e. Suppose, in addition, the dominance condition: there exists a function d(w,) (sometimes called the "dominating function") such that lm(w,; 0)1 < d(w,) for all (I E e and E[d(w,)] < oo. Then ~ L:7~ 1 m(w,; .) converges uniformly in probability to E[m(w,; .)] over e. Moreover, E[m(w,; {!)] is a continuous function of (1. Since by construction lm(w,; 0)1 < sup9E8 lm(w,; 0)1 for all (I E e. we can use sup9E8 lm(w,; 0)1 for d(w,), so the above dominance condition can be stated simply as E[ sup 9E8 lm(w,; 0)11 < oo. The Uniform Law of Large Numbers can be extended immediately to vector random functions as follows: Let {w,J be an ergodic stationary process. Suppose that (i) the parameter space e is compact, (ii) h(w,; (I) is continuous in (I for all w,, and (iii) h(w,; {I) is measurable in w, for all (I in e. Suppose, in addition, the dominance condition: E[sup llh(w,; 0)111 < oo. 9e0

Then E[h(w,; (I)] is a continuous function of (I and ~ L:7~ 1 h(w,: .) converges uniformly in probability to E[h(w,; .)] over e. Just by combining this lemma with the first consistency theorem for extremum estimators, and noting that Qn ((I) is continuous if m(w,; (I) is continuous in (I and that Qn((l) is a measurable function of the data if m(w,; 0) is measurable in w,, we obtain Proposition 7.3 (Consistency ofM-estimators with compact parameter space): Let {w,} be ergodic stationary. Suppose that (i) the parameter space e is a compact subset of JR!P, (ii) m(w,; {I) is continuous in (I for all w,, and (iii) m(w,; {I) is measurable in w, for all (I in e. Let (I be theM-estimator defined by (7.1.1) and (7.1.2). Suppose, further, that A

460

Chapter 7

(a) (identification) E[m(w,; II)] is uniquely maximized on

e atll 0 E e,

(b) (dominance) E[sup 8 E8 1m(w,; 11)1] < oo. Then

0 -+p llo.

It is even more straightforward to specialize the second consistency theorem for extremum estimators toM-estimators. The objective function Qn(ll) is concave if m(w,; II) is concave in 11. 6 As noted above, the pointwise convergence of Qn(ll) to Q 0 (11) (= E[m(w,; II)]) is assured by the Ergodic Theorem if E[m(w,; II)] exists and is finite for alii!. Thus Proposition 7-4 (Consistency of M-estimators without compactness): Let [w,} be ergodic stationary. Suppose that (i) the true parameter vector 11 0 is an element of the interior of a convex parameter space e (C IR.P), (ii) m(w,; II) is concave over the parameter space for all w,, and (iii) m(w,; II) is measurable in w, for all II in 8. Lett! be theM-estimator defined by (7.1. I) and (7. I .2). Suppose, further, that

-

(a) (identification) E[m(w,; II)] is uniquely maximized on 8 atllo

E

8,

(b) E[ lm(w,; II) I] < oo (i.e., E[m(w,; II)] exists and is finite) for all II

E

Then, as n

-+

-

-

e.

oo,ll exists with probability approaching I and II -+p 11 0 •

The following example shows that probit conditional ML satisfies the concavity

condition. Example 7.6 (Concavity of the probit likelihood function): The m function for probit is given in (7.1.15) where <1>0 is the cumulative density function of the standard normal distribution. Since the normal distribution is symmetric,( -v) = I - (v) and them function can be rewritten as m(w,; II) = y, log <1>(>(,11)

+ (I

- y,) log (-X, II).

(7 .2.4)

Concavity of m is implied if both log (x;ll) and log (-X, II) are concave in II. Since x;11 is linear in II, it suffices to show concavity of log ( v ). The first derivative of log ( v) is ¢(v) d log (v) dv - (v)' 6 A fact from calc~,~lus: if f(x) and g(x) are concave functions, then so is j(x)

(7.2.5)

+ g(x).

461

Extremum Estimators

I

where ¢(v) (= '(v)) is the density function of N(O, 1). It is well known that ¢(v)/(v) is a decreasing function. So log (v) is (strictly) concave.

Concavity after Reparameterlzatlon

Proposition 7.4 can be extended easily to estimators for objective functions that are concave after reparameterization. Suppose that all the conditions of the proposition except concavity are satisfied and that there is a continuous one-to-one mapping -r(6) : 0-+ A = -r(E>) such that

is concave inland A = -r(E>) is a convex set. Let

be the objective function after this reparameterization. Clearly, all the assumptions of Proposition 7.4 are satisfied for l, A, and m(w,; l): lo = -r(6 0 ) is an interior point of A, E[m(w,; l)] is uniquely maximized at l 0 , and E[ lm(w,; l)ll < oo for alll in A. Thus for sufficiently large n, there exists an M-estimator i such that plimn_ooi = l 0 . Since Qn(l) satisfies (7.1.17) for the one-to-one mapping T, il = T- 1(l) maximizes the original objective function Qn(6). The estimator il is consistent because

n-+00

n-+00

= -r- 1(plim i)

(since -r- 1 is continuous)

n-oo (7.2.6)

Example 7.7 (Concavity of linear regression likelihood function): In the conditional ML estimation of the linear regression model, m(w,; 6) is given in (7.1.13) with 6 = ({3', a 2 )' and w, = (y,, X,)'. This function is not necessarily concave, 7 but consider a reparameterization: y = l(a and & = f3 1a. It is a continuous one-to-one mapping between e = JRK x JR,_+ and A= IRK x JR,_+· The new parameter space A is convex. With l = (&', y)',

7Forexample, the second partial derivative of m with respect to a 2 may or may not be non~negative depending on the value of y 1 -Lx;fJ. Recall that a necessary condition for a twice differentiable function h(xl, ... , Xp) to be concave is that 2hj((hj ) 2 be non~negative for all j = I, 2, ... , p.

a

462

Chapter7

the m function after reparameterization becomes

_ I m(w,; A)= - log(2rr)

2

+ log(y)-

I , z(yy,- x,~) 2 .

(7.2.7)

Since log(y) is concave in y, it is concave in A. Since v = yy, - x;~ is linear in l. and -~v 2 is concave, -!(YYr - x;0) 2 is concave in l.. So above, which is the sum of two concave functions (plus a constant), is concave in A for any w,. Therefore, provided that all the conditions of Proposition 7.4 besides concavity are satisfied for 9, 0, and m(w,; 9), the conditional ML estimator iJ is consistent.

m

Identification in NLS and ML The identification condition (a) (that E[m(w,; fl)] be uniquely maximized at flo) in these consistency theorems forM-estimators can be stated in more primitive terms for NLS and ML. NLS

Consider firstNLS, wherem(w,; {I)= -(y,-rp(x,; 9)) 2 and E(y, I x,) = rp(x,; {1 0 ). We have shown in Section 2.9 that the conditional expectation minimizes the mean squared error, that is, for any (measurable) function h(x, ), mean squared error

= E[{y,- h(x,)f] = E[{y,- E(y, I x,)1 2 ] ;::: E[ {y, - E(y, I x,) 12].

+ E[{E(y, I x,)- h(x, )1 2 ] (7.2.8)

This is just reproducing (2.9.4) in the current notation. Furthermore, the conditional expectation is essentially the unique minimizer in the following sense. Being functions of x,, h (x,) and E(y, I x,) are random variables. For two random variables X and Y, let X 'I Y mean Prob(X 'I Y) > 0 (i.e., Prob(X = Y) < I) and let X = Y mean Prob(X 'I Y) = 0. (If X and Y differ only for events whose probability is zero, we do not need to care about the difference.) Therefore, h(x,) 'I E(y1 1 x,) means Prob(h(x,) 'I E(y, I x,)) > 0. If h(x,) 'I E(y, I x,) in this sense, then {E(y, 1 x,)- h(x,)l 2 > 0 with positive probability. Consequently, E[{E(y, I x,)- h(x,)fj in (7.2.8) is positive and the mean squared error is strictly greater than E[{y,- E(y, I x, lf].

463

Extremum Estimators

Just setting h(x,) = E[IYr-
if
{ E[IYr-
I x, ),

we conclude

'I
(7.2.9)

if
Therefore, the identification condition- that E[ m(x,; II) J= - E[ {y,-
conditional mean identification:
for all II

'I llo.

(7.2.10)

ML

For ML, the role played by (7.2.8) for NLS is played by the Kullback-Leibler information inequality. Here we examine identification for conditional ML (doing the same for unconditional or joint ML is more straightforward). The hypothetical conditional density f(y, I x,; II), being a function of (y,, x,), is a random variable for any II. The expected value of the random variable log f(y, I x,; II) can be written as 8 E[log f(y,

I x,; II)]

=

J

log f(y,

I x,; ll)f(y,, Xr; llo, t

0 )dy,dx,.

(7.2.11)

Note well that the expected value is with respect to the true joint density f (y,, x,; 110 , t 0 ). The Kullback-Leibler information inequality about density functions (adapted to conditional densities) states that E[Iog f(y, { E[log f(y,

I x,; II)] I x,; II)]

< E[log f(y, = E[log f(y,

I x,; lloll I x,; llo)]

if f(y, if f(y,

I x,; II) 'I I x,; II) =

f(y, f(y,

I x,; llo). I x,; llo) (7.2.12)

(provided, of course, that E[log f(y, I x,; II)] exists and is finite). (See Analytical Exercise I for derivation.) It immediately follows that the identification condition (that E[log f(y, I x,; II)] be maximized uniquely at llo) is satisfied if

conditional density identification: f(y,

I x,; II) 'I

f(y,

for all II '111 0 .

I x,; llo) (7.2.13)

BJf f{y 1 I x1: 6) = 0, then its log cannot be defined. If this bothers you, assume f(Yt I Xt: 6) is positive for all Yt, x1, and 6. See White ( 1994, p. 9) for an argument that avoids such a simplifying assumption.

464

Chapter7

We can tum the two consistency theorems forM-estimators, Propositions 7.3 and 7.4, into ones for conditional ML. with w, = (y,. x;)' and m(w,; {/)=log f · (y, I x,; (!). Replacing the identification condition forM-estimators by conditional density identification, we obtain the following two self-contained statements about consistency of ML. Proposition 7.5 (Consistency of conditional ML with compact parameter space): Let fy,. x,} be ergodic stationary wich conditional density f(y, I x,; (1 0 ) and Jet (I be the conditional ML estimator, which maximizes che average Jog conditional likelihood (derived under che assumption chat fy,. x,} is i.i.d.):

-

-

(I=

]

argmax9E0 n

L log f(y, I x,; {/). n

t=l

Suppose the model is correctly specified so chat (1 0 is in e. Suppose chat (1} che parameter space e is a compact subset ofJRP. (ii) f(y, I x,; {I) is continuous in (I for all (y,, x,). and (iii) f(y, I x,; {/)is measurable in (y,. x,) for all (I E e (so by Lemma 7.1 (/is a well-defined random variable). Suppose, furcher, chat

-

(a) (identification) Prob[J(y, I x,; 0)

f

(b) (dominance) E[ sup8 E 8 1log f(y 1 over y, and x, ).

I x,; 0)11

Then

(I ~P

f(y,

I x,; 0 0 )]

> 0 fora!! (If 0 0 in

e.

< oo (note: che expectation is

Oo.

Proposition 7.6 (Consistency of conditional ML without compactness): Let fy,, x,} be ergodic stationary wich conditional density f (y, I x,; (/ 0 ) and let (I be che conditional ML estimator, which maximizes che average Jog conditional likelihood (derived under the assumption chat fy,, x,} is i.i.d.): ]

iJ = argmax8Ee

n

n

L log f(y, I x,; 0). t=l

Suppose che model is correctly specified so chat (1 0 is in e. Suppose that (i) che true parameter vector (1 0 is an element ofthe interior of a convex parameter space e (c JRP ). (ii) log f (y, I x,; (I) is concave in (I for all (y,, x, ). and (iii) log f (y, I x,; {I) is measurable in (y,, x,) for all (I in e. (For sufficiently large n, (I is well defined; see a few Jines below.) Suppose, furlher, that (a) (identification) Prob[J(y, I x,; 0)

f

f(y,

I x,; 0 0 )]

> 0 for all (If 0 0 in

e.

465

Extremum Estimators

(b) E[ I log f(y 1 I x,; 11)1] < oo (i.e., E[log f(y, I x,; II)] exists and is finite) for all II E 0 (note: the expectation is over Yr and x,). Then, as n ~ oo, iJ exists with probability approaching 1 and

iJ

~P llo.

There are two remarks worth making. • The Kullback-Leibler information inequality (7.2.12) holds for unconditional densities as well (the proof is easier for unconditional densities). Therefore, Propositions 7.5 and 7.6 hold for unconditional ML; just replace j(y, I x,; II) by j(w,; II). • The noteworthy feature of these two ML consistency theorems is that, despite the fact that the log likelihood is derived under the i.i.d. assumption for {y1 , x,}. consistency is assured even if {y1 , x,} is not i.i.d. but merely ergodic stationary. This is because the ML estimator is an M-estimator. An ML estimator that maxintizes a likelihood function different from the model's likelihood function is called a quasi-ML estimator or QML estimator. Put differently, a QML estimator maxintizes the likelihood function of a ntisspecified model. Therefore, the "maximum likelihood" estimators in Propositions 7.5 and 7.6 are QML estimators if {Yr. x,} is ergodic stationary but not i.i.d. The propositions say that those particular QML estimators are consistent. The identification condition in conditional ML can be stated in even more primitive terms for specific examples. Example 7.8 (Identification in linear regression model): For the linear regression model with normal errors, the log conditional likelihood for observation t is

logf(y,

I

I

I

2

2

2a 2

I x,; II)= --log(2rr)- -log(a 2 ) - -

, (y,- x,{J) 2 .

Since the log function is strictly monotonic, conditional density identification is equivalent to the condition that log j(y, I x,; II) of log j(y, I x,; 11 0 ) with positive probability if II of llo. This condition is satisfied if E(x,X,) is nonsingular (and hence positive definite). To see why, let II = (fJ', a 2 )' and llo = ({J~. aJY. Clearly, log f(y, I x,; II) of log f(y, I x,; llo) with positive probability if a 2 of aJ. So consider the case where a 2 = aJ but fJ of {J 0 • Then

466

Chapter 7

Hence, x;fJ ol x;fJ 0 with positive probability; if >
>
The last term is finite if E(x,x;) is. These results, together with the fact noted earlier that the log likelihood is concave after reparameterization, imply that the conditional ML estimator of the linear regression model with normal errors (derived under the i.i.d. assumption for {y,, x,}) is consistent if {y,, x,} is ergodic stationary and E(x,x;) is nonsingular. Example 7.9 (Identification in probit): The same conclusion just derived in the previous example for the linear model also holds for the probit model, whose conditional likelihood for observation t is

The same argument used in Example 7.8 implies that, under the nonsingularity of E(x,x;), x;o ol x;Oo with positive probability if 0 ol 00 . But since
(7.2.14)

Combining this bound for I log
467

Extremum Estimators

singularity of E(x,x;n• We therefore conclude that the probit ML estimator (whose log likelihood is derived under the i.i.d. assumption for {y,, x,}) is consistent if {y,, x,} is ergodic stationary and E(x,x;) is nonsingular. Consistency of GMM We now turn to GMM and specialize Proposition 7.1 to GMM by verifying the conditions of the proposition under a set of appropriate sufficient conditions. The GMM objective function is given by (7.1.3). Clearly, the continuity of Qn(ll) in II is satisfied if g(w,; II) is continuous in II for all w,. If {wr} is ergodic stationary, then g.(ll) (= ~ L:~~• g(w,; II)) -->p E[g(w,; II)]. So the limit function Q 0 (11) is

(7.2.15) This function is nonpositive if W (= plim W) is positive definite. It has a maximum of zero at 11 0 because E[g(w,; 11 0 )] = 0 by the orthogonality conditions. Thus, the identification condition (that Q 0 (11) be uniquely maximized at 11 0 ) in Proposition 7.1 is satisfied if E[g(w,; II)] i= 0 for II i= 11 0 . Regarding the uniform convergence of Qn(ll) to Q 0 (11), it is not hard to show that the condition is satisfied if g.O converges uniformly to E[g(w,; ·)](see Newey and McFadden, 1994, p. 2132). The multivariate version of the Uniform Law of Large Numbers stated right below Lemma 7.2, in turn, provides a sufficient condition for the uniform convergence. So we have proved Proposition 7.7 (Consistency of GMM with compact parameter space): Let {w,} be ergodic stationary and let II be the GMM estimator defined by

-

where the symmetric matrix

Wis assumed

to conve~ge in probability to some

90nly for interested readers, here is a proof: I log /(Yr I xr; 6)1 ~ IYt II log $(x;B)I ~ pog $(x;B)I

Yrlllog $( -x;B)I

+ I log$( -x;B)I

(since IYr I ~ I and II - Y1 I ~ 1)

+ 1x;o1 + 1x;01 2] (by (7.2.14)) 2 x [llog(O)I + llxrii·IIOII + llxrll 2 ·11611 2 ].

<0 2

<0

+ II -

x [I log (O)I

The last inequality is due to the Cauchy-Schwartz inequality that lx'yl :5 llxiiiiYII for any two conformable vectors x and y. Also, E(llx1 11 2 ), which is the sum of E(x~) over i, is finite because the nonsingularity of E(x1 x;) implies E(x~) < oo for all i. Jf E(llxJ 11 2 ) < oo, then ECIIx1 ID < oo.

468

Chapter 7

symmetric positive definite matrix W. Suppose that the mndel is correctly specified in that E[g(w 1; 0 0 )] = 0 (i.e., the orthogonality conditions hold) for 0 0 E EJ. Suppose that (i) the parameter space EJ is a compact subset of JRP, (ii) g(w,; 0) is continuous in 0 for all w,, and (iii) g(w,; 0) is measurable in w, for all I} in EJ (so I} is a well-defined random variable by Lemma 7. 1). Suppose, further, that

.

(a) (identification) E[g(w,; 0)]

#

0 for all 0

#

Oo in EJ,

(b) (dominance) E[sup 9 E 8 IIg(w,; 0)111 < oo. Then iJ

__.P

Oa.

If Qn(O) is concave, then the compactness of EJ, the continuity of g(w,; 0), and the dominance condition can be replaced by the condition that E[g(w,; 0)]

exist and be finite for all 0. We will not state this result as a proposition because there are very few applications in which Qn(O) is concave. For example, the GMM objective function for the Euler equation model in Example 7.5 is not concave. Just about the only well-known case is one in which g(w,; I}) is linear in 0, but the linear case has been treated rather thoroughly in Chapter 3. When g(w,; 0) is nonlinear in 0, specifying primitive conditions for (a) (identification) and (b) (dominance) in Proposition 7.7 is generally quite difficult. So in most applications those conditions are simply assumed. QUESTIONS FOR REVIEW

1.

(Necessity of density identification) We have shown for conditional ML that the conditional density identification (7.2.13) is sufficient for condition (a) (that E[log f(y, I x,; 0)] be maximized uniquely at 00 ). That it is also a necessary condition is clear from the Kullback-Leibler information inequality. Without using this inequality, show the necessity of conditional mean identification. Hint: Show that, if (7.2.13) were false, then (a) would not hold. Suppose that there exists a 0 1 # 0 0 such that 0 1 E EJ and f(y, I x,; 0 1 ) = j(y, I x,; 0 0 ). Then E[log f (y, I x,; 0 1)] = E[log f (y, I x,; Oo)].

2. (Necessity of conditional mean identification) Without using (7.2.8), show that conditional mean identification is necessary as well as sufficient for identification in NLS. 3. (Identification in NLS) Suppose that the rp function in NLS is linear: rp(x1 ; 0) = X,O. Verify that identification is satisfied ifE(x,X,) is nonsingular.

469

Extremum Estimators

4. (Identification in NLS estimation of pro bit) In the pro bit model, E(y, I x,) = (x;9 0 ). Consider estimating 90 by least squares. Verify that conditional mean identification is satisfied if E(x,x;) is non singular. 5. (Quasi-ML) Suppose in Proposition 7.5 that {y1 , x,) is i.i.d., so the likelihood function itself is not misspecified. But suppose that the true parameter vector 90 is not included in the assumed parameter space 8. So in this sense the estimator 9' is a quasi-ML estimator. Drop condition (a), and suppose instead that there is a unique 9* in 8 that maximizes E[log f(y, x,; 9)]. Show that the QML estimator iJ is consistent for 9*. Hint: Replace the 90 in Proposition 7.3 by 9'. In Propositions 7.1-7.4, 9o does not need to be the "true" parameter vector. 1

6. (Identification in linear GMM) Consider the linear GMM model of Chapter 3. Verify that the rank condition for identification derived in Chapter 3 is equivalent to the identification condition in Proposition 7. 7. 7. (Concavity of the objective function in linear GMM) Verify that the GMM objective function in linear GMM is concave in 9. Hint: The Hessian (the ~ L:~~~ x,z;. matrix of second derivatives) of Qn(9) is -S:U WS., where Sxz A twice continuously differentiable function is concave if and only if its Hessian is everywhere negative semidefinite.

=

8. (GMM identification with singular W) For W, if we require it to be merely

positive semidefinite, not necessarily positive definite, what would be the identification condition in Proposition 7.7? Hint: Suppose W is positive semidefinite. Show that if E[g(w,; 9 0 )] = 0 and WE[g(w,; 9)] oft 0 for 9 oft 90 , then Q0 (9) = E[g(w,; 9)]'W E[g(w,; 9)] has a unique maximum at 9 0 .

-4

7.3 Asymptotic Normality The basic tool in deriving the asymptotic distribution of an extremum estimator is the Mean Value Theorem from calculus. How it is used depends on whether the estimator is an M-estimator or a GMM estimator. Therefore, the proof of asymptotic normality will be done separately forM-estimators (including ML as a special case) and GMM estimators. This is followed by a brief remark on the relative efficiency of GMM and ML. By way of summing up, we will point out at the end of

470

Chapter 7

the section that the Taylor expansion for the sampling error has a structure shared by both M-estimators and GMM. Asymptotic Normality of M-Estimators Reproducing (7.1.2), the objective function forM-estimators is

I Q.(O) = n

n

L

(7.3.1)

m(w,; 0).

t=l

It will he convenient to give symbols to the gradient (vector of first derivatives) and the Hessian (matrix of second derivatives) of them function as . n)- am(w,; 0)

S (Wr, u (px I)

H(w,; OJ (pxp)

-

=

a0 as(w,; 0)

ao'

(7.3.2)

,

a2 m(w,; 0)

= aoao'

(7.3.3)

In analogy to ML, s(w,; 0) will be referred to as the score vector for observation t. This s(w,; 0) should not be confused with the score in the usual sense, which is the gradient of the objective function Q. (0). The score in the latter sense will be denoted s.(O) later in this chapter. The same applies to the Hessian: H(w,; 0) will be referred to as the Hessian for observation t, and the Hessian of Q. (0) will be denoted H.(O). The goal of this subsection is the asymptotic normality of 0 described in (7.3.10) below. In the process of deriving it, we will make a number of assumptions, which will he collected in Proposition 7.8 below. Assume that m(w,; 0) is differentiable in 0 and that 0 is in the interior of e. So 0, being the interior solution

-

-

-

to the problem of maximizing Q. (0), satisfies the first-order conditions

0

(pxl)

=

aQ.
n

'

= - L::s(w,; 0)

n

(hy (7.3.1) and (7.3.2)).

(7 .3.4)

t=l

We now use the following result from calculus: Mean Value Theorem: Let h : JRP -> ffi:" be continuously differentiable. Then h(x) admits the mean value expansion

471

Extremum Estimators

(qxl)

ah(x)

+

h(x) = h(Xo)

Ox'

(qxl)

(x- Xo), (pxl)

(qxp)

where xis a mean value lying between x and J
2

-

aQn(6) - aQn(6o) +a Qn(6) (0- 6o) a6 a6 a6a6' (px '> (pxl)

(pxl)

1

(pxp)

n

=- I:s(w,; 6 0 ) n t=!

+ [In

n

LH(w,;

0) ] (0- 6 0)

t=I

(px I)

(by (7 .3.1 )-(7 .3.3)),

(px I)

(pxp)

(7.3.5) -

'

where 6 is a mean value that lies between 6 and 60 • The continuous differentiability requirement of the Mean Value Theorem is satisfied if m (w,; 6) is twice continuously differentiable with respect to 6. Combining this equation with the first-order condition above, we obtain

I"

[I"

+ -n L

0 =- I:s(w,; 6 0 )

(pxl)

9-

n

t=l

-]·

H(w,; 6) (6- 6 0 ).

(7.3.6)

t=1

Assuming that ~ L:7~ 1 H(w,; 6 0 to yield

•

0) is nonsingular,

[I"L

this equation can be solved for

-J-l]Jn

./ii(6- 60 ) =- -;;

H(w,; 6)

t=l

n

I:s(w,; 60 ).

(7.3.7)

t=1

This expression for (.jii times) the sampling error will be referred to as the mean value expansion for the sampling error. Note that the score vector s(w,; 6) is evaluated at the true parameter value 60 . " " Now, since 6 lies between 60 and 6, 6 is consistent for 6 0 if 6 is. If {w,} is ergodic stationary, it is natural to conjecture that

-nI L H(w,; 6) ->P E[H(w,; 6 n

-

0 )].

(7.3.8)

t=l

10 The Mean Value Theorem only applies to individual elements of h, so that i actually differs from element to element of the vector equation. This complication does not affect the discussion in the text.

472

Chapter 7

The ergodic stationarity of w, and consistency of(} alone, however, are not enough to ensure this; some technical condition needs to be assumed. One such technical condition is the uniform convergence of~ L:7~ 1 H(w,; ·)to E[H(w,; -)]in a neighborhood of 9 0 . 11 But hy the Uniform Convergence Theorem, the uniform convergence of~ L:7~ 1 H(w,; ·) is satisfied if the following dominance condition is satisfied for the Hessian: for some neighhorhood Jl of 9o, E[ sup 9 E.N IIH(w,; 9) Ill < oo (this is condition (4) below). 12 This is a sufficient condition for (7 .3.8); if you can directly verify (7.3.8), then there is no need to verify the dominance condition for asymptotic normality. Finally, if (7.3.8) holds and if ]

r.:: -vn

n

L s(w,; 9 t=!

0 ) --> d

N(O, :E),

(7.3.9)

then by the Slutzky theorem (see Lemma 2.4(c)) we have

./ii(iJ- 9o)-:

N(o. (E[H(w,; 9ol]r':E(E[H(w,; 9ol]r').

(7.3.10)

Collecting the assumptions we have made so far, we have Proposition 7.8 (Asymptotic normality of M-estirnators): Suppose that the conditions of either Proposition 7.3 or Proposition 7.4 are satisfied, so that {w,} is ergodic stationary and theM-estimator(} defined by (7.1.1) and (7.1.2) is consistent. Suppose, further, that A

(1) 9o is in the interior of

e.

(2) m(w,; {})is twice continuously differentiable in(} for any w,, (3) }. L:7~, s(w,; 9 0 ) in (7.3.2),

->d

N(O, 1:), l: positive definite, where s(w,; {})is defined

11 This is a consequence of the following result (see, e.g., Theorem 4.1.5 of Amemiya, 1985): Suppose a sequence of random functions h 11 (6) converges to a nonstochastic function ho(9) uniformly in fJ over an open neighborhood of 80 . Then plim11 _... 00 hn(8) = ho(fJo) if plim 8 = 8o and ho(fJ) is continuous at 8 0 .

Set h11 (8) to~ L~=l H(w 1 ; ·)and ho(fJ) to E[H(wz; 8)]. Continuiry of E[H(w 1 ; 8)] will be assured by the Uniform Convergence Theorem. The uniform convergence needs to be only over a neighborhood of 8o, not over the entire parameter space, because 8 is close to 8o for n sufficiently large. 12 A matrix can be viewed as a vector whose elements have been rearranged. So the Euclidean norm of a matrix, such as IIH(w1; 8) 11. is the square root of the sum of squares of all its elements.

473

Extremum Estimators

(4) (local dominance condition on the Hessian) for some neighborhood oN of 0 0 , E[ sup JIH(w,;

IJ)IIl

< oo,

6eJ/

so that for any consistent estimator IJ, ~ 'L7~ 1 H(w,; 0) where H(w1 ; IJ) is defined in (7.3.3), (5) E[H(w1 ,

--+p

E[H(w,;

IJo)],

IJo)] is nonsingular.

Then iJ is asymptotically normal with Avar(il) = (E[H(w,;

1

IJol]r :E(E[H(w,; IJol]r

1

•

(This is Theorem 4.1.3 of Amemiya (1985) adapted toM-estimators.) Two remarks are in order. • Of the assumptions we have made in the derivation of asymptotic normality, the following are not listed in the proposition: (i) IJ is an interior point, and (ii) ~ 'L7~ 1 H(w,; 0) is nonsingular. It is intuitively clear that these conditions hold hecause iJ converges in prohability to an interior point 0 0 , and ~ 'L7~ 1 H(w,; 0) converges in probability to a nonsingular matrix E[H(w,; IJ 0 )]. See Newey and McFadden (1994, p. 2152) for a rigorous proof. This sort of technicality will be ignored in the rest of this chapter.

-

• If w, is ergodic stationary, then so is s(w,; IJ 0 ) and the matrix :E is the long run variance matrix of (s(w,; IJ 0)]u A sufficient condition for (3) is Gordin's condition introduced in Section 6.5. So condition (3) in the proposition can he replaced by Gordin's condition on {s(w,; IJ 0 )]. It is satisfied, for example, if w, is i.i.d. and E[s(w,; IJ 0 )] = 0. The assumption that :E is positive definite is not really needed for the conclusion of the proposition, but we might as well assume it here because in virtually all applications it is satisfied (or assumed) and also because it will be required in the discussion of hypothesis testing later in this chapter. Consistent Asymptotic Variance Estimation To use this asymptotic result for hypothesis testing, we need a consistent estimate of Avar(O) = (E[H(w,;

1

IJol]r :E(E[H(w,: IJol]r

13 The long-run variance was introduced in Section 6.5.

1 •

474

Chapter 7

Since 9' -->p 9 0 , condition (4) of Proposition 7.8 implies that ]

n

,

-n L

H(w,; 9) --> E[H(w,; 9 0 )]. P

t=l

Therefore, provided that there is available a consistent estimator f of :E, -

Avar(B) =

{]

n

L n

H(w,; B)

t=l

~-1

f { ]-;; Ln

H(w,; 0)

~-[

(7.3.11)

t=l

is a consistent estimator of the asymptotic variance matrix. To obtain f. the methods introduced in Section 6.6, such as the VARHAC, can be applied to the estimated series {s(w,; 0)} under some suitable technical conditions. Asymptotic Normality of Conditional ML We now specialize this asymptotic normality result to ML for the case of i.i.d. data. In ML, where the m function is a log (conditional) likelihood, the first-order conditions (7.3.4) are called the likelihood equations. The fact that them function is a log (conditional) likelihood leads to a simplification of the asymptotic variance ' matrix of 9. We show this for conditional ML; doing the same for unconditional or joint ML is easier. Consider the two equalities in condition (3) of Proposition 7.9 below. The second equality is called the information matrix equality (not to be confused with the Kullback-Leibler information inequality). It is so called because E[s(w,; 9 0 ) s(w,; 90 )'] is the information matrix for observation t. It is left as Analytical Exercise 2 to derive these two equalities under some technical conditions pennitting the interchange of differentiation and integration. Here, we just note the implication of these two equalities for Proposition 7.8. If w, is i.i.d., the Lindeberg-Levy CLT and the first equality (that E[s(w,; 9 0 )] = 0) imply ]

n

.Jri L

s(w,; 9 0 )

7

N(O, :E) where :E = E[s(w,; 9 0 ) s(w,; 9 0 )'].

t=l

(7.3.12) '

This and the information matrix equality imply that Avar(9) in Proposition 7.8 is simplified in two ways as Avar(O) = -jE[H(w,; 9o)

lr

I

= jE[s(w,; 9o) s(w,; 9ol']

r'.

(7.3.13)

475

Extremum Estimators

Thus, we have proved Proposition 7.9 (Asymptotic normality of conditional ML): Let w, (= (y,, x;)') be i.i.d. Suppose the conditions of either Proposition 7.5 or Proposition 7. 6 are satisfied, so that ii ~P 11 0 . Suppose, in addition, that (1) llo is in the interior of e, (2) f(y,

1

x,; II) is twice continuously differentiable in II for all

(y,,

x, ),

(3) E[s(w,; 11 0 )] = 0 and -E[H(w,; 11 0 )] = E[s(w,; 11 0 ) s(w,; 11 0 )'], where sand

H functions are defined in (7.3.2) and (7.3.3), (4) (local dominance condition on the Hessian)

for some neighborhood

J{

of 11 0 ,

E[sup IIH(w,; II) Ill< oo, 8e.N

so that for any consistent estimator ii, ~ L:;~, H(w,; ii) ~P E[H(w,; 11 0) ], (5) E[H(w,; 11 0 ) J is nonsingular.

Then

ii is asymptotically nonnal with Avar(ii) given by (7.3.13).

Two remarks about the proposition: • Condition (3) is stated as an assumption here because its derivation requires interchange of integration and differentiation (see Analytical Exercise 2). A technical condition under which the interchange is legal is readily available from calculus (see, e.g., Lemma 3.6 of Newey and McFadden, 1994). So condition (3) could be replaced by such a technical condition on f(y, I x,; II). We do not do it here because in most applications condition (3) can be verified directly. • It should be clear from the above discussion how this proposition can be adapted to unconditional ML: simply replace f(y, I x,; II) by f(w,; II). To use this asymptotic result for hypothesis testing, we need a consistent estia natural mate of Avar(O). Noting that it can be written as -{E(H(w,; 11 0 ) J

r',

estimator is

1 first estimator of Avar(O) = - { ~

L n

H(w,;

0) ~-1

(7.3.14)

t=l

Since

9 ~P

11 0 , this estimator is consistent hy condition (4) of Proposition 7.9.

476

Chapter 7

Another estimator, hased on the relation Avar(ii) = IE[s(w,; llo) s(w,; llol'] ]

second estimator of Avar(ii) =

{

n

;; ~ s(w,; ii) s(w,; ii)'

rl' is

} -I

(7.3.15)

To insure consistency for this estimator, we need to make an assumption in addition to the hypothesis of Proposition 7.9 (see, e.g., Theorem 4.4 of Newey and McFadden, 1994). We will not show it here hecause in many applications the consistency of this second estimator can be verified directly. There is no compelling reason for preferring one estimator of Avar(ll) over the other. The second estimator, which requires only the first derivative of the log likelihood, is easier to compute than the first. This can be an important consideration when it is impossible to calculate the derivatives of the density function analytically and some numerical method must he used. On the other hand, in at least some cases the first provides a better finite-sample approximation of the asymptotic distribution (see, e.g., Davidson and MacJUnnon, 1993).

-

Two Examples To get a better grasp of the asymptotic normality results about conditional ML, it is useful to see how the results can be applied to familiar examples. We consider two such examples. Example 7.10 (Asymptotic normality in linear regression model): To reproduce the log conditional density for observation t for the linear regression model with normal errors,

With e = JRK x Il4+ (where K is the dimension of {3), condition (I) of Proposition 7.9 is satisfied. Condition (2) is obviously satisfied. To verify (3), a routine calculation yields: 14

s(w,; II) =

I-] [ + ~~-~

" l

- 2a2

I '"'2

2a4

e,

, H(w,; II)=

[I' -~~~

"

-

1

f

"

;,4Xt . Cr

I - + I c,-3] '

-2a4Xt. Cr I 4<:r4 -

l '"'2 la6cr

2a6Xt.

+

I "4 4asC 1

(7.3.17) 14For lhe parameter a2 the differentiation is with respect to a 2 rather than a.

477

Extremum Estimators

where w, = (y 1, x;)', 9 = (fJ', a 2 )' and f:, = y,- x;p (which should be distinguished from£, = y,- x;fJ 0 ). So for 9 = 9o the f:, in these expressions can be replaced by £ 1 • In the linear regression model, E(£1 1 x,) = 0. Also, since £ 1 is N(O, a~). we have E(
-.!, E(x,x;)

- E[H(w,; 9o) J = E[s(w,; 9o) s(w,; 9ol'] =

"o

[

o'

: ] . (7.3.18) 2a04

If E(x,x;) is nonsingular, then E[H(w,; 9 0 ) J is nonsingular and (5) is satisfied.

-

Regarding condition (4 ), let E:, = y, - x; p for some consistent estimator and 2 . Condition (7.3.8) in this example is

p

a

(7.3.19)

It is straightforward to show this, given that E:, = £,-x;
Example 7.11 (Asymptotic nonnality of probit ML): To reproduce the log conditional likelihood of the pro bit model, log f(y,

1

x,; 9) = y,log (x;9) +(I- y,) log[! - (x;9)]. (7.3.20)

With 8 = JRP, condition (I) of Proposition 7.9 is satisfied. Condition (2) is obviously satisfied. To verify (3), a tedious calculation yields (y,- (x;9))(x;9))(x;9) ''

H

(7.3.21)

. - {- [ y, - (x;9l ]' ' ' (x;9)(1 - (x;9)) [
y,- (x;9l

+ [ (x;9)(1 (To derive (7.3.22), use the fact that

Y?

J '( , )} x,x,.,

- (x;9))
(7.3.22)

= y,.) Here, (·) is the cumulative

478

Chapter 7

density function of N(O, I) and¢(·)= <1>'0 is its density function. It is then easy to prove the conditional version of (3): E[s(w,; llo) I x,] = 0 and - E[H(w,; llo) I x,] = E[s(w,; llo) s(w,; llo)' I x,].

(7.3.23)

So (3) is satisfied by the Law of Total Expectations. In particular, for probit, it is easy to show that - E[H(w,; llo)] = E[s(w,; llo) s(w,; llol'] = E[ ).(x;llo)A( -x;llo)x,x;]. (7.3.24) where ).(v)

=

t/>(v) I - (v)

(7.3.25)

is called the inverse Mill's ratio or the hazard for N(O, 1). It can be shown that the term in braces in (7.3.22) is between 0 and 2 (see Newey and McFadden, 1994, p. 2147). So (7.3.26)

IIH(w,; IIlii < 211x,x;11.

The Euclidean norm llx,x; II is the square root of the sum of squares of the elements ofx,x;. Therefore, the expectation of llx,x;\1 2 and hence that of llx,x;ll are finite if E(x,x;) (exists and) is finite. So the local dominance condition on the Hessian (condition (4)) is satisfied ifE(x,x;) is nonsingular. It can be shown that condition (5) is satisfied ifE(x,x;) is nonsingular (see footnote 29 on p. 2147 of Newey and McFadden, 1994). We conclude, again, that all the conditions of Proposition 7.9 are satisfied ifE(x,x;) is nonsingular. Asymptotic Normality of GMM Now tum to GMM. The GMM objective function is I

-

I

"

Qn(ll) = --g,(II)'Wg,(ll) with g,(ll) =- I)<w,; II). (Kxl) n t=l 2

(7.3.27)

As in the case ofM-estimators, we apply the Mean Value Theorem to the first-order condition for maximization, but the theorem will be applied to g,(ll), not its first derivative. Thus, unlike in the case of M-estimators, the objective function needs to be continuously differentiable only once, not twice. Titis reflects the fact that the sample average enters the objective function differently for the case of GMM.

479

Extremum Estimators

Assuming that 9' is an interior point, the first-order condition for maximization of the objective function is 0

aQ (OJ , , , " = -G,(9) W g,(9), 88 (pxK) (KxK) (Kxl)

=

(pxl)

(7.3.28)

(px I)

where G,(9) is the Jacobian ofg,(9): G,(9J

= ag,(~J

(7.3.29)

a9

(K xp)

Now apply the Mean Value Theorem to g,(9), notto aQ, (9)(a9 as in M-estimation, to ohtain the mean value expansion g,(O) = g,(9o) (Kxl)

+ G,(O) (0 -

(Kxl)

(Kxp)

9o).

(7.3.30)

(pxl)

Substituting this into the first-order condition above, we obtain

,.,,_ 0

= -G,(9)

(pxl)

Solving this for

(pxK)

,.,,

W g,(9 0 ) - G,(9) (KxK)

(Kxl)

(pxK)

(0- 9 0 ) and noting that g,(9) =

__

,.,

W G,(9) (9- 9 0 ). (7.3.31)

(KxK) (Kxp)

(pxl)

~ L:;~, g(w,; 9), we obtain

-1 "

"

./n(9- 9 0 ) = - [ G,(9) (pxl)

(pxK)

I

..-..

-

W G,(9) ]

(KxK) (Kxp)

" G,(9) (pxK)

W

"

Ir.: L...., " g(w,; 9 0 ).

(KxK) yn

t= I

(Kx I)

(7.3.32) In this expression, the Jacobian G, (9) of g,(9) is evaluated at two different points, 0 and 9. Since g,(9) = ~ L:;~, g(w,; 9), the Jacobian at any given estimator 0 can be written as

"

-

G (O) = ~" ag(w,; 9) n n L. ao' .

(7.3.33)

t=l

If w, is ergodic stationary and 9 is consistent, it is natural to conjecture that this expression converges to E[ag(w,; 90 )(a9']. As was true forM-estimators, this conjecture is true if ag(w,; 9)(a9' satisfies the suitable dominance condition specified in condition (4) below. It should now be clear that the GMM equivalent of Proposition 7.8 is

480

Chapter 7

Proposition 7.10 (Asymptotic normality of GMM): Suppose that the conditions of Proposition 7. 7 are satisfied, so that {w,} is ergodic stationary, W(K x K) converges in probability to a symmetric positive definite matrix W, and the p-dimensional GMM estimator 9 is consistent. Suppose, further, that

.

(I) 9o is in the interior of

e.

(2) g(w,; 9) is continuously differentiable in 9 for any w,, (Kx I)

(3)

!o L:7= 1 g(w,; 9o) -->d N(O, (KxK) S ), S positive definite,

vn

(4) (local dominance condition on •·~~',' 81 ) for some neighborhood N of 90 ,

, 9) II] < oo, E[ sup I og(w,; 9 a eo1

~ any consistent · · ag(w 9- • 111 "'" so that 10r estimator ~r=l ao ·B) --> p E[ag(w,;8o)J 88' ' 1 ,'

(Kxp)

(5) E["•<;~: 801 ] is of full column rank. (Kxp)

Then the following results hold:

.

(a) (asymptotic normality) 9 is asymptotically normal with Avar(O) = (G' WG)- 1G' WS WG(G' WG)- 1, (pxp)

where G

= E[ag(w,;Bol],

(Kxp)

CO

(b) (consistent asymptotic variance estimation) Suppose there is available a consistent estimateS of S. The asymptotic van·ance is consistently estimated by

-

Avar(O) =

(G' WG) -'G' WSWG(G' WG) -I,

(7.3.34)

The assumption that S is positive definite is not really needed for the conclusion of the proposition, hut we might as well assume it here because in virtually all applications it is satisfied (or assumed) and also because it will be required in the discussion of hypothesis testing later in this chapter. It is useful to note the analogy between linear and nonlinear GMM by comparing this proposition to Proposition 3.1. In linear GMM, g, (9) is given by

481

Extremum Estimators ]

g,(ll) = (;;-

11

1

11

I:X, · y,) - (;;- L x,z; )11. 1=1

(7.3.35)

t=.l

So G _ G.(il) in Proposition 7.10 reduces to -(~ L: x,z;), and G reduces to - E(x,z;). The thrust of Proposition 7.10 is that the asymptotic distribution of the nonlinear GMM estimator can be ohtained hy taking the linear approximation of g, (II) around the true parameter value. The analogy to linear GMM extends further: • (Efficient nonlinear GMM) Using the current notation, the matrix inequality (3.5.11) in Section 3.5 is (7.3.36) which says that the lower bound for Avar(iJ) is (G'S- 1G)-'- The lower bound is 1 achieved ifW satisfies the efficiency condition that plim.~oo = • namely, = for some consistent estimator of s. that

w s-

w s-l

• (J statistic for testing overidentifying restrictions) Suppose that the conditions

w s-l

w

of Proposition 7.10 are satisfied and that = so that satisfies the ef= s- 1 • Then the minimized distance J = ng,(O)' ficiency condition: plim 1g,(0) (= -2nQ.(O)) is asymptotically chi-squared with K- p degrees of freedom. The proof of this result is essentially the same as in the linear case; see Newey and McFadden (1994, Section 9.5) for a proof.

w

s-

• (Estimation of S) Sis the long-run variance of {g(w,; 11 0 )}. Under some suitable technical conditions, the methods introduced in Section 6.6- such as the VARHAC-can be applied to the estimated series {g(w,; 0)} to obtainS. In particular, ifw, is i.i.d., then S = E[g(w,; 11 0 )g(w,; llol'] and the sample second ~ moment of g(w,; II) can serve asS.

-

GMM versus Ml The question we have just answered is: taking the orthogonality conditions as given, what is the optimal choice of the weighting matrix W? A related, but distinct, question is: what is the optimal choice of orthogonality conditions?" This question, too, has a clear answer, provided that the parametric form of the density of the data (w 1 , ••• , w.) is known. Here, we focus on the i.i.d. case with the 15 Yet another efficiency question concerns the optimal choice of instruments in generalized nonlinear instrumental estimation (a special case ofGMM where the g function is a product of the vector of instruments and the error rerrn for an estimation equation). See Newey and McFadden (1994, Section 5.4) for a discussion of optimal instruments.

482

Chapter7

density of w, given by f(w,; IJ). Let ii be the GMM estimator associated with the orthogonality conditions E[g(w,; IJ 0)] = 0. Its asymptotic variance Avar(O) is given in Proposition 7.10. It is not hard to show that Avar(ii)::: E[s(w,; IJ 0 )s(w,; IJ 0

)'r'

. IJ ) where s (w, 0 = (pxl)

alog J(w,; IJo) . aiJ

(7.3.37) That is, the inverse of the information matrix E(ss') is the lower bound for the asymptotic variance of GMM estimators. This matrix inequality holds under the conditions of Proposition 7.10 plus some technical condition on f (w,; IJ) which allows the interchange of differentiation and integration. See Newey and McFadden (1994, Theorem 5.1) for a statement of those conditions and a proof. Asymptotic efficiency of ML over GMM estimators then follows from this result because the lower bound [E(ss')]- 1 is the asymptotic variance of the ML estimator. The superiority of ML, however, is not very surprising, given that ML exploits the knowledge of the parametric form f(w,; IJ) of the density function while GMM does not. As you will show in Review Question 4 below, GMM achieves this lower bound when the g function in the orthogonality conditions is the score for observation t: g(w 1 ; IJ) = s(w,; IJ) (Kxl)

=

alog f(w,; IJ) aiJ

(pxl)

.

(7.3.38)

Therefore, the GMM estimator with optimal orthogonality conditions is asymptotically equivalent to ML. Actually, they are numerically equivalent, which can be shown as follows. Since K (the number of orthogonality conditions) equals p (the number of parameters) when g is chosen optimally as in (7.3.38), the GMM estimator ii should satisfy g,(ll) = 0, which under (7.3.38) can be written as I n - Z:::s(w,;

n

0) =

0.

(7.3.39)

t=l

This is none other than the likelihood equations (i.e., the first-order condition for ML). They have at least one solution because the ML estimator satisfies them. If they have only one solution, then IJ is the ML estimator as well as the GMM estimator. If they have more than one solution, we need to choose one from them as the GMM estimator. Since one of the solutions is the ML estimator, we can A

simply choose it as the GMM estimator.

483

Extremum Estimators

Expressing the Sampling Error in a Common Format To prepare for the next section's discussion, it is useful to develop a slightly different proof of asymptotic normality. We have used the Mean Value Theorem to derive the mean value expansion for the sampling error, which is (7 .3. 7) for M-estimators and (7.3.32) for GMM. We show in this suhsection that they can be written in a common format, (7.3.43) below. Consider M-estimators first. Noting that aQa~901 = ~ L::7~ 1 s(w,; 60 ), the mean value expansion (7.3.7) can be written as

.Jri (660 ) =

-

L

[I " H(w,; 6) n t=l

]-I .Jri

oQ.(6o) . a6

(7.3.40)

By condition (4) of Proposition 7.8, ~ L::7~ 1 H(w,; i}) converges in probability to some p x p symmetric matrix >II given by

>II = E[H(w,; 6o)].

(7.3.41)

(pxp)

So rewrite (7.3.40) as

By construction, the term in braces converges in probability to zero. By condition (3) of Proposition 7.8, .Jri aQa~901 (= }. L::7~ 1 s(w,; 6 0 )) converges to a random

variable. So the last term, which is the product of the term in hraces and .Jri aQa~901 , converges to zero in probability by Lemma 2.4(b). This fact can be written compactly as a Taylor expansion: 16

.Jri
~-~

o6 (pxl)

a, ,

~xl)

(7.3.43)

where the term "o," means "some random variable that converges to zero in probability."17 The exact expression for the o, term will depend on the context. Here, it equals the last term in (7.3.42). As you will see, all that matters for our argument

1611 is apt to caJl this equation a Taylor expansion rather than a mean vaJue expansion because the matrix that multiplies Jn aQato) is evaJuated at 9o rather than at 6 as in the mean vaJue expansion. 17 The "op" notation was introduced in Section 2.1.

484

Chapter 7

is that the Op term vanishes (converges to zero in probability), not its expression. Since the difference between .jii(iJ -8 0 ) and -'i'- 1./ii aQ.~Do> vanishes, the asymptotic distribution of .jii(iJ -8 0 ) is the same as that of -'i'- 1./ii aQ.~OoJ by Lemma 2.4(a). It then follows that .jii(iJ- IJ 0 ) converges to a normal distrihution with mean zero and asymptotic variance given by Avar(IJ) = '1'- >1:'1'- , where

l: (pxp)

=Avar (aQ.(IJo)) ao .

(7.3.44)

ForM-estimators, .jii aQ.~Dol = }. L:7~ 1 s(w,; IJ 0 ) and l: is the long-run variance of s(w,; IJo). Setting 'I' = E[H(w,; IJo) J gives the expression for Avar(O) in Proposition 7.8. Next tum to GMM. The expression for .jii aQg~Do) is

r.: aQ.(IJo) ,- I ~ "" = -[G.(IJo)l W r.: L..,g(w,; IJo).

-vn

aiJ

(7.3.45)

t=l

L:

(Recall that G.(IJ) = 8 ~9<~> .) Since G.(IJ 0 ) = ~ '•';/o> ~P G (= E['•';~;Do>]), W~P W, and }. L:7~ 1 g(w,; IJ 0) ~d N(O, S) underthe conditions of Proposition 7.10, we have .jii aQ.~Oo) converging to a normal distribution with mean zero and 1:

= Avar(aQ.(IJo))

=

(J(J

(pxp)

G'

W

S

W

G .

(7.3.46)

(pxK)(KxK)(KxK)(KxK)(Kxp)

Now rewrite the mean value expansion for the sampling error for GMM (7 .3.32) as

n

c

- - .jii I "L..,g(w,; IJo). = -G.(IJ)'W

(7.3.47)

t=I

It is left as Review Question 5 below to show that this can be written as the Taylor expansion (7.3.43), with the symmetric matrix 'I' given by '1'=-G' (pxp)

W

G.

(7.3.48)

(pxK) (KxK) (Kxp)

Substituting (7.3.48) and (7.3.46) into (7.3.44), we obtain the GMM asymptotic variance indicated in Proposition 7.10. Table 7.1 summarizes the discussion of this subsection by indicating how the substitution should be done to make the common formulas applicable to M-estimators and GMM.

Table 7.1: Taylor Expansion for the Sampling Error

Terms for substitution

M-estimators

GMM

Q.(ll)

I n - Lm(w,; II) n

-2g.(ll) Wg.(ll)

I

,-

t=l

Jii aQ.(llo) all

"' l:

I

n

Jii L

s(w,; 11 0 )

-[G.(II 0 )]'W

t=l

Jn

tg(w,; llo) t=l

E[H(w,; 11 0 )]

-G'WG

long-run variance ofs(w,; 11 0 )

G'WSWG

NOTE: See (7.3.2) and (7.3.3) for definition ofs(w1 ; 9) and H(w 1 ; 9). Also.

486

Chapter 7

In passing, it is useful for the next section's discussion to note that :E = - ljl 1). for ML with i.i.d. observations and efficient GMM (the GMM with W =

s-

For ML, :E = E[s(w,; 11 0 ) s(w,; 11 0 )'], which equals the negative of the "' given in (7.3.41) thanks to the information matrix equality. For efficient GMM, setting w = s-l in (7.3.46) gives (7.3.49) which is the negative of the "' in (7.3.48). QUESTIONS FOR REVIEW

1. (Score and Hessian of objective function) ForM-estimators and ML in particular, we defined the score for ohservation 1, s(w,; II), to be the gradient (vector of first derivatives) of m(w,; II) and the Hessian for observation 1, H(w,; II), to be the Hessian of m (w,; II). Let the score (without the qualifier "for observation 1") be the gradient of the objective function Q,(ll) and denote it as s,(ll). Let the Hessian (without the qualifier "for observation t") be the Hessian of Q, (II) and denote it as H,(ll). Under the conditions of Proposition 7.9, show that E[s,(llo)] = 0,

-~ E[H,(IIo)] =

E[s,(llo)s,(ll 0 )'].

n

Hint: {s(w,; 11 0 )} is i.i.d. 2. (Conditional information matrix equality for the linear regression model) For the linear regression model of Example 7.10, verify that E[s(w,; llo)

I x,]

=0

and - E[H(w,; llo)

I x,]

= E[s(w,; llo) s(w,; llo)' I x,].

3. (Avarfor the linear regression model) For the linear regression model of Example 7.10, let (ji, &2 ) be the ML estimate of (fJ, 0" 2 ). Write down the first esti-

-

..----.,.

mator of Avar(ll) defined in (7.3.14). Does Avar(fJ) equal & 2 (~ [Answer: No.]

L x,x;)- 1?

4. (GMM with optimal orthogonality conditions) Assume that the relevant technical conditions are satisfied for the hypothetical density j(w,; II) so that the information matrix equality holds. Using Proposition 7.10, show that when the g function in the orthogonality conditions is given as in (7 .3.38), the asymptotic variance matrix of the GMM estimator equals the inverse of the information matrix for observation I (which is the lower bound in (7.3.37)). Hint: Since K equals p, G is a square matrix and the expression tor Avar(ll) in

-

Extremum Estimators

487

Proposition 7.10 reduces to Avar(O) = G-'SG'- 1 • The choice of g implies G = E[H(w,; 9 0 )] and S = E[s(w,; 9 0 )s(w1 ; 90 )']. 5. (Taylor expansion of the sampling error for GMM) Show that (7.3.47) can be written as the Taylor expansion (7.3.43) under the conditions of Proposition 7 .I 0, by taking the following steps. (a) Show that B- 1 = 'it- 1 + Yn where Yn ~P 0. (b) Show that c = Jn aQa~Oo) + Yn where Yn ~P 0. Hint: Gn(O)- Gn(9 0 ) = ' I (Gn(O) -G)- (Gn(Oo)- G) ~P 0 and W,;;; L:7~t g(w,; Oo) converges in distribution to a random variable.

!il 8 Qn(Ool+x wherexn=- -[Yn'\1" !il (c) Showthat-B-'c= -IJI- 1 V" 88 r~> >Jt-IYn + YnYn]. (d) Show that Xn

~P

8Qn(Oo)+ 88

0.

6. (Why the GMM distance is deflated by 2) Verify that, if the negative of the distance g.(O)'S- 1g.(9) were not divided by 2 in the definition of the efficient GMM objective function, then we would not have :E = -'it.

7.4 Hypothesis Testing As is well known for ML, there is a trio of statistics-the Wald, Lagrange multiplier (LM), and likelihood ratio (LR) statistics-that can be used for testing the null hypothesis. The three statistics are asymptotically equivalent in that they share the same asymptotic distribution (of x2 ). In this section we show that the asymptotic equivalence of the trinity can be extended to GMM by developing an argument applicable to both ML and GMM. The key to the argument is the common format for the sampling error derived at the end of the previous section. On the first reading, you may wish to read the next subsection and then jump to the last subsection. The Null Hypothesis We have already considered, in the context of the linear regression model, the problem of testing a set of r possibly nonlinear restrictions (see Section 2.4). To recapitulate, let 9 0 be the p-dimensional model parameter. The null hypothesis can

488

Chapter 7

be expressed as

Ho : a(Oo) = 0.

(7.4.1)

(rx I)

We assume that a(-) is continuously differentiable. Also, let A(O)

= oa(~)

(>xp)

ao

(7.4.2)

be the Jacobian of a(O). We assume that

= A(0 0 ) is offull row rank.

Ao

(7.4.3)

(rxp)

This ensures that the r restrictions are not redundant. This rank condition implies that r s p. Let 0 be the extremum estimator in question. It is either ML or GMM. It solves the unconstrained optimization problem (7 .1. I). The Wald statistic for testing the null hypothesis utilizes the unconstrained estimator. The LM statistic, in contrast, utilizes the constrained estimator, denoted 0, which solves max Q.(O) s.t. a(O) = 0.

(7.4.4)

8E8

As was shown in Section 7.2, the true parameter value 0 0 solves the "limit unconstrained problem" where Q. in the unconstrained optimization problem (7 .1.1) is replaced by the limit function Q0 . It also solves the "limit constrained problem" where Q. in the above constrained optimization problem is replaced by Q0 , because 0 0 satisfies the constraint. It turns out that the uniform convergence in probability of Q. 0 to Q 0 0 assures that the limit of the constrained estimator 0 is the solution to the limit constrained problem, which is Oo. For a proof, see Newey and McFadden (1994, Theorem 9.1) for GMM and Gallant and White (1988, Lemma 7.3(b)) for general extremum estimators. It is also possible to show that, if all the conditions ensuring the asymptotic normality of the unconstrained estimator are satisfied, then 0. too, is asymptotically normal under the null. For a proof for GMM, see Newey and McFadden (1994, Theorem 9.2); for ML, see, e.g., Amemiya (1985, Section 4.5). 18

18 If you are interested in the proof. it is just the mean value version of the discussion in the subsection on the LM statistic below, which is based on Taylor expansions.

489

Extremum Estimators

The Working Assumptions The argument for showing that the Wald, LM, and LR statistics are all asymptotically x2 (r) is based on the truth of the following conditions.

(A) .jfi(O -11 0 ) admits the Taylor expansion (7.3.43), which is reproduced here (7.4.5) (B) .fii aQa~•oJ __., N(O. I:), I: positive definite. (C) .jfi(O - llo) converges in distribution to a random variable. (D) I:= -'it.

We have shown in the previous section that the first two conditions are satisfied for conditional ML under the hypothesis of Proposition 7.9, for unconditional ML under the hypothesis suitably modified, and for GMM under the hypothesis of Proposition 7.10. As just noted, under the same hypothesis, the constrained estimator II is consistent and asymptotically normal, ensuring condition (C). So conditions (AHC) are implied by the hypothesis assuring the asymptotic normality of the extremum estimator. The Wald and LM statistics, if defined appropriately so as not to depend on condition (D), are asymptotically x2 (r). But their expressions can be simplified if condition (D) is invoked. Furthermore, without the condition, the LR statistic would not be asymptotically x2 As was noted at the end of the previous section, 1• condition (D) is satisfied for ML and also for efficient GMM with W = Therefore, the part of the discussion that relies on condition (D) is not applicable to nonefficient GMM.

s-

The Wald Statistic Throughout the book we have had several occasions to derive the Wald statistic using the "delta method" of Lemma 2.5. Here, we provide a slightly different derivation based on the Taylor expansion. Applying the Mean Value Theorem to a(/1) around llo and noting that a(llo) = 0 under the null, the mean value expansion of .fii a(ii) is

-

.fii a(il) = A(ii) .fii(il- llo). (rxl)

(rxp)

(7.4.6)

(pxl)

Derivation of the Taylor expansion from the mean value expansion is more transparent than in the previous section and proceeds as follows. Since II, lying between

490 (} 0

and

Chapter 7

iJ, is consistent and since A(.) is continuous, A(ii) converges in probability

to Ao (= A(0 0 )). Furthermore, the term multiplying A(ii), y'n(iJ- 0 0 ), converges to a random variable. So (A(ii) - A 0)y'n(iJ - 0 0 ) vanishes (converges to zero in probability) by Lemma 2.4(b). Denoting this term by oP, the mean value expansion can be rewritten as a Taylor expansion:

v'n a(iJ) = Ao (rxl)

y'n(iJ- Oo)

+

(pxi)

(rxp)

oP .

(7.4.7)

(rxl)

(Note that this argument would not be valid if y'n(iJ - 0 0 ) did not converge to a random variable.) Substituting the Taylor expansion (7.4.5) into this, we obtain an equation link1: ing VIi a(O) to VIi

aQab'' c

.

1

c 3Qn(0o)

vna(O)=-Ao "'- vn (rx 1)

(rxp) (pxp)

ao

+op.

(7.4.8)

(px I)

(The Op here is A 0 times the oP in (7.4.5) plus the oP in (7.4.7), but it still converges to zero in probability.) It immediately follows from this expression and the asymptotic normality of y'n

aQab'' 1 (condition (B)) that y'n a(O) converges to a normal

distribution with mean zero and Avar(a(O)) = Ao"'- 1 1:"'-'A~ (rxr)

= A 0 :E- 1 A~

(if condition (D) is invoked).

(7.4.9)

(rxp) (pxp) (pxr)

Since the r x p matrix Ao is of full row rank and :E is positive definite (by condition (B)), this r x r matrix is positive definite. Therefore, the associated quadratic form na(O)'[ Ao :E- 1A~r'a(O)

(7.4.10)

is asymptotically x 2 (r) under the null. As noted above, A(iJ) ~P A0 . So if there is available a consistent estimator f of :E, then A(O)f-' A(O)' is consistent for A 0 :E- 1A 0, which implies by Lemma 2.4(d) that the Wald statistic defined by (7.4.11) too is asymptotically x2 (r) under the null. How the consistent estimator f of :E (= - "'l is obtained depends on whether the extremum estimator is ML or efficient GMM. For ML, it is given by a consistent

491

Extremum Estimators

estimate of- lji = - E[H(w,; 9 0)]: a2 Qn(9) I n - a9a9' = -;; LH(w,; 9), A

(7.4.12)

t=l

or by a consistent estimate of l: = E[ s(w,; 9o) s(w,; 9o)']:

I"

- L, s(w,; 9) s(w,; 9)'. A

(7.4.13)

A

n

For efficient GMM, l: = G' s-'G (see (7.3.49)), which is consistently estimated by (7.4.14) ~

As already mentioned, S is an estimate of the long-run variance matrix constructed from the estimated series (g(w,; 9) }. A

The Lagrange Multiplier (LM) Statistic Let y" (r x I) be the Lagrange multiplier for the constrained problem (7.4.4). The constrained estimator 9 satisfies the first-order conditions

.J,i a Qn(9) a9

+ A(iJ)' .J,i y n =

0 ' (pxl)

.J,i a(ii) = 0 .

(7.4.15) (7.4.16)

(rx I)

We seek the Taylor expansion of these first-order conditions. By condition (C), .jii(i! - 9 0 ) converges to a random variable. So .J,i a(il) admits the Taylor expansiOn (7.4.17) It is left as Review Question 2 to show that the following Taylor expansion holds for both ML and GMM:

(7.4.18) An implication of this equation is that .J,i '~1 9 > converges to a random variable. So .J,i y "'satisfying (7.4.15), converges to a random variable. Consequently, since

492

Chapter 7

A(/1) ---> P A0 , we obtain

Substituting these three equations into the first-order conditions, we obtain "' (pxp) [ Ao

(rxp)

l

A~ [.,fo.(iJ

(pxr)

-/loll [-.Jii oQ;~ o>l +

(pxl)

6

_

-

(pxi)

0

VnYn

0

(rxr)

(rxl)

(rxl)

Op.

(7.4.20)

Using the formula for partitioned inverses, this can be solved to yield 1A .,,-I r.: aQ,(IIo) r.: --(A0.., ·•·- 1A')yny,0 0.., yn 3(1

+ Op,

(7 •• 4 21)

].Jii aQ,(IIo) + all

.Jii(iJ -llo) = -["'-I -"'-I A~(Ao"'-l A~)- 1 Ao"'- 1

op.

(7.4.22) The rest is the same as in the derivation of the Wald statistic. From (7 .4.21 ), .Jii y, converges to a normal distribution with mean zero and variance given by Avar(y ,) = (A 0 "'-I A~)- 1 Ao"'- 11:"'-l A~(A 0 "'-I A~)- 1 (rx r)

= (A 0 l:- 1 A~)- 1

(if condition (D) is invoked).

(7.4.23)

Since this r x r matrix is positive definite, the associated quadratic form (7.4.24) is asymptotically x2 (r) under the null. If there is available a consistent estima1 tor, denoted 1:, of 1:, then A(O)l:- A(O)' is consistent for Aol:- 1A~. So the LM statistic defined by the first line below is asymptotically x2 (r) under the null. This statistic can be beautified by the first-order conditions (7.4.15): LM = n y~ [A(O)i- A(O)'j Yn 1

(rxl)

(lxr) (r

x r)

-

= n(aQ,(/1))'

all

(lxp)

1; _,(aQ.(/1)) (since A(ii)'y. (pxp)

ao

=

-•~·JB>

by (7.4.15)).

(pxl)

(7.4.25)

493

Extremum Estimators

The last expression for the LM statistic is the reason why the statistic is also called the score test statistic. It is the distance from zero of the score evaluated at the constrained estimate 8. You should not be fooled by its beauty, however: despite being a quadratic form associated with a p-dimensional vector, the degrees of freedom of its x2 asymptotic distribution are r, not p, as is clear from the derivation. For ML, I: is given by (7.4.12) or (7.4.13), evaluated at 8. For GMM, I: is consistently estimated by

-

-

.

With

G (Kxp)

1 " " ag(w,; 8) = G, (8) = -n £.__, a8 ' . -

(7.4.26)

r=l

Here, S is an estimate of the long-run variance matrix constructed from the estimated series {g(w,; 8)}. The Likelihood Ratio (LR) Statistic

-

-

The LR statistic is defined as 2n times Q"(8)- Q,(8). (For efficient GMM, calling this the likelihood ratio statistic is a slight abuse of the language because Q. for GMM is not the likelihood function, but we will continue to use the term for both ML and GMM.) It is not hard to show for ML and GMM (see Review Questions 6 and 7) that LR

= 2n · [Q.(II)- Q.(iJ)] =

-n(O- ii)''i1(0- iJ)

+ oP.

(7.4.27)

The validity of this expression does not depend on condition (D). From this expression, it is easy to derive the asymptotic distribution of the LR statistic. Subtracting (7.4.22) from (7.4.5), we obtain

Substituting this into (7.4.27), the LR statistic can be written as LR- - ( r.: vn

aQ.a(8o) )' w-1 A'0 (A0 w-rA'0)-I A0 w-r ( v r.:n aQ.a(8o)) + Op. 8

8

(7.4.29) Now invoke assumption (D). The LR statistic becomes LR- ( r.: aQ.(8 o))' I:- 1A' [A I:-rA' - ...;n a8 0 0 0

8 J- 1A0 I:-r ( ...;nr.: aQ.( a8 o)) +

Op·

(7.4.30)

494

Chapter 7

This is asymptotically x2 (r) because the variance of the limiting distribution of A0 :E- 1 'Qa~Oo)) is the positive definite matrix in brackets.

(vn

This completes the proof of the asymptotic equivalence of the trio of statistics. But it should be clear from the proof that something stronger than asymptotic equivalence (that the three statistics share the same asymptotic distribution) holds. We have shown in the proof that the difference between the Wald statistic and (7.4.10) is op. But by (7.4.8) the difference between the latter and the quadratic form in the expression for the LR statistic (7.4.30) is op when :E = -lll. So the numerical difference between the Wald statistic and the LR statistic is oP when :E = -'it. The same applies to the LM statistic. The difference between it and (7.4.24) is op. But the difference between the latter and the quadratic form in (7 .4.30) is again oP when :E = - lll. Summary of the Trinity To summarize the rather lengthy discussion of this section,

-

Proposition 7.11 (The trinity): Let 6 the extremum estimator defined in (7.1.1). Consider the null hypothesis (7.4.1) where a(·) is continuously differentiable (so the J~cobian A(6) = ";~~) is continuous) and the rank condition (7.4.3) is satisfied. Let 6 be the constrained extremum estimator defined in (7.4.4). Assume:

• the hypothesis of Proposition 7.9, if the extremum estimator is conditional ML, • the hypothesis appropriately modified as indicated right below Proposition 7.9, in the case of unconditional ML, • the hypothesis of Proposition 7.10 with W = s- 1 and that there is available a consistent estimatorS of S constructed from 0 and S constructed from ii, in the case of efficient GMM. Define the Wald, LM, and LR statistics as in Table 7.2. Then (a) the constrained extremum estimator 6 is consistent and asymptotically normal, (b) the three statistics all converge in distribution to is the number of restrictions in the null,

x2 (r)

under the null where r

(c) moreover, the numerical difference between the three statistics converges in probability to zero under the null.

Table 7.2: Trinity 1

1

Wald: na(th'[ A(O) i:- A(O)'r a(O) (lxr)

(rxp)(pxp)(pxr)

(rxl)

-

-

LM: n(aQn(IJ))' 1;-1 (aQn(IJ))

atJ

atJ

(pxp)

(pxl)

(lxp)

LR: 2n · [Qn(t/)- Qn(ii)] Conditional ML

Terms for substitution

I

n

-n L

log f(y,

Efficient GMM

I x,; IJ)

t=l

2

•

a Qn(IJ) I " · · , G = Gn(B) - atJatJ' or n L..,s(w,; IJ)s(w,; IJ) G's- 1G, (Kxp) replace

8 by ii in above

G'S- 1G, G = Gn(ii)

NOTE: See (7.3.2) and (7.3.3)for definitions of s(w,; 0) and H(w,; 0). Also, l

gn(O)

n

=- Lg(w1 ; 0), n

J=l

G (O) n

= ag, (O)

ao' ·

The same middle matrix is used in Qn(O) and Qn (0) for GMM.

(Kxp)

496

Chapter 7

QUESTIONS FOR REVIEW

1. Verify that the LM statistic for ML can be written as

-)'II"'-~-'(1~ -) n;; L..s(w,;/1) n L..s(w,;ll)s(w,;/1)' ;; L..s(w,;/1). ( I~ 1=1

2.

1=1

Derive (7.4.18). Hint: For ML, replace -

.Jri

iJ

by ii in (7.3.5). For GMM, derive

n

- ,- 1 " ' - ,= -G.(/1) W y'il L..g(w,; llo)- G.(ll) WG.(II)./Ii(ll-11 0 ),

aQ.(IIJ

all

1::1

-

-

where lilies between 11 0 and II. The derivation of this should be as easy as the derivation of (7.3.31 ). 3. Explain why the LR statistic would not be asymptotically

x2 if l:

= -lfl did

not hold.

4. For ML, suppose the trio of statistics are calculated as indicated in Table 7 .2, but suppose w, is actually serially correlated. Which one is still asymptotically

x2 ? [Answer:

None.]

-

5. Suppose l: = -lfl does not hold but suppose that a consistent estimate lJI of lJI is available along with the consistent estimate I: of l:. Propose an LM statistic that is asymptotically x2 (r). Hint: Use the first equality in (7.4.23), which is valid without l: = -lfl. The answer is

-

n (a ~n/1(/1)

)'Iii_, A(iiJ'[A(ii)lii- I:lii-' A(O)' r' A(ii)lii 1

-I

(a Q;/1(/1)). (7.4.31)

6.

(Optional, proof of (7.4.27)) Prove (7.4.27) forM-estimators. Hint: Take for granted the "second-order'' mean value expansion around II,

-

- By the lirst-order condition, aQ.. c9, , where II- lies between II- and II. = 0 . Also, 39 2

-

' a9a9' Q,rel

--+p

E[H( w,,· II oJ] ·

497

Extremum Estimators

7. (Optional, proof of (7.4.27) for GMM) Prove (7.4.27) for GMM. Hint: From the mean value expansion of g, (II) around II, derive the Taylor expansion A

(7.4.33) Substitute this into Qn(O) = - !g, (0)' Wg, (ii). Then use the first-order condition that Gn(IJ)' Wg,(t}) = 0.

7.5

Numerical Optimization

So far, we have not been concerned about the computational aspect of extremum estimators. In the ML estimation of the linear regression model, for example, the objective function Qn(li) is quadratic in II, so there is a closed-fonn fonnula for the maximizer, which is the OLS estimator. The objective function is quadratic for linear GMM also, with the linear GMM estimator being the closed-fonn fonnula. In most other cases, however, no such closed-fonn fonnulas are available, and some numerical algorithm must be employed to locate the maximum. A number of algorithms are available, but this section covers only the two most important ones. More details can be found in, e.g .• Judge et al. (1985, Appendix B). The first is used for calculating M-estimators, while the second is used for GMM. Newton-Raphson

We first consider M-estimators, whose objective function Qn(li) is (7.1.2). Since Qn(ll) is twice continuously differentiable forM-estimators, there is a secondorder Taylor expansion

where ilj is the estimate in the j-th round of the iterative procedure to be described in a moment, and Sn and Hn are the gradient and the Hessian of the objective function: Sn(IJ)

=

aQn(li)

a/1' ,

(7 .5.2)

498

Chapter 7

.

The (j + 1)-th round estimator 6j+t is the maximizer of the quadratic function on the right-hand side of (7.5.1). It is given by (7.5.3) This iterative procedure is called the Newton-Raphson algorithm. If the objective function is concave, the algorithm often converges quickly to the global maximum . • For ML estimators, when the global maximum 6 is thus obtained, the estimate of Avar(i)) is obtained as -H.(Ii)- 1 (recall that H.(6) ==~I: H(w,; 6)). Gauss-Newton Now tum to GMM, whose objective function is Q.(6) = -~g.(6)'Wg.(6) (see (7.1.3)). As in the derivation of the asymptotic distribution, we work with a linearization of the g. (6) function. The first-order Taylor expansion of g. (6) around 9j is

g.(6)

~

. + G.(6j)(6. 6j) .

g.(6j)

(K xI)

(K xp)

=

(px I)

[g.(iJj)- G.(iJj)iJj]- [-G.(iJj)j6

= Vj- Gj6,

(7.5.4)

where (7.5.5) (Recall that G. (6) = "';;;,'.9>.) If the g. (6) function in the expression for the GMM objective function were this linear function Vj - Gj6, then the objective function would be quadratic in 6 and the maximizer (or the minimizer of the GMM distance) would be the linear GMM estimator: (7.5.6) This is the (j + 1)-th round estimate in the Gauss-Newton algorithm. 19 Unlike in Newton-Raphson, there is no need to calculate second-order derivatives. Writing Newton-Raphson and Gauss-Newton in a Common Format There are also similarities between Gauss-Newton and Newton-Raphson. Substituting (7.5.5) into (7.5.6) and rearranging, we obtain 19 The Gauss-Newton algorithm was originally developed for nonlinear least squares. See Review Question 2 for how Gauss-Newton is applied to NLS. The algorithm presented here is therefore its GMM adaptation.

499

Extremum Estimators

(7 .5.7) '

The second bracketed term is none other than the gradient evaluated at II j for the GMM objective function Qn(ll) = -!g,(II)'Wg,(ll). The role played by the Hessian of the objective function in (7.5.3) is played here by the first bracketed term, which is the estimate based on Oj of the -G'WG in Table 7.1. Therefore, the analogy between M-estimators and GMM noted in Table 7.1 also holds here. Furthermore, if g, is linear, then the Hessian equivalent (the first bracketed term in (7 .5. 7)) is the Hessian of the GMM objective function. So the analogy is exact: the Gauss-Newton algorithm coincides with the Newton-Raphson algorithm. Equations Nonlinear in Parameters Only In the Gauss-Newton algorithm (7.5.7), when g,(ll) (= ~ L: g(w,; II)) is not linear in II, it is generally necessary to take averages over observations to evaluate the gradient and the Hessian equivalent in each iteration. (This is also generally true in Newton-Raphson if the objective function is not quadratic in II.) For large data sets it is a very CPU time-intensive operation. There is, however, one case where the averaging over observations in each iteration is not needed. That occurs in a class of generalized nonlinear instrumental variable estimation (a special case of nonlinear GMM where g(y,, z,, II) can be written as a(y,, z,; ll)x,). Suppose that the equation a(y,, z,; II) takes the form

+ a 1(y,, z, )'«(II),

a (y,, z,; II) = ao(y,, z,)

(lxq)

(7.5.8)

(qxl)

where a 0 0 and a 1 0 are known functions of (y,, z,) (but not functions of II). A function of this form is said to be nonlinear in parameters only or linear in variables. It can be easily seen that I

g,(ll)

I

n

n

=- Lg(y,,z,;ll) =- La(y,,z,;ll) ·X, n

n

t==l ]

t=l

I

n

= (- L:ao(y,, z,) · x,) n r=l

n t=l (Kx I)

Gn(ll) =

n

+ (- I:x,at(y,, z,l')cx(ll),

(7.5.9)

(qxl) (Kxq)

ag, (II) (1~ ') acx (II) all' = n L.. x,at (y,, z,) all' . t=l

(qxp)

(K xq)

(7 .5.10)

500

Chapter 7

These equations make it clear that the sample averages need to be calculated just once, before the start of iterations. QUESTIONS FOR REVIEW

1. (Does Q. increase during iterations?) Set II = Oj+I in (7 .5.1) and rewrite it as

'

Consider the Newton-Raphson algorithm (7.5.3). Suppose that llj+I is sufficiently close to Oj so that the sign of Q.(Oi+Il- Q.(Oj) is the same as that of the right hand side of the equation. Show that Q.(Oj+Il > Q.(Oj) if Q.(ll) is concave. Hint: Using (7.5.3), the right-hand side can be written .as

2. (Gauss-Newton for NLS) In NLS (nonlinear least squares), the objective function is given by (7.1.19). Let Oi be the estimate in the j-th round and consider the linear approximation

(7.5.11) Replace I' (x,; II) in the m function by this linear function in II. Show that the maximizer of this linearized problem is

(7.5.12)

501

Extremum Estimators

PROBLEM SET FOR CHAPTER 7 - - - - - - - - - - - - ANALYTICAL EXERCISES

1. (Kullback-Leibler information inequality) Let f(y I x; IJ) be a parametric family of hypothetical conditional density functions. with the true density function given by f(y I x; 9 0 ). Suppose E[log f(y 1 x; IJ)] exists and is finite for aliiJ. The Kullback-Leibler information inequality states that

I x; IJ) f= f(y I x; IJo)] > 0 ==> E[logf(y I x;IJ)] < E[logf(y I x;IJo)].

Prob[f(y

Prove this by taking the following steps. In the proof, for simplicity only, assume that f (y I x; IJ) > 0 for all (y. x) and IJ, so that it is legitimate to take logs of the density f(y I x; IJ). (a) Suppose Prob[f(y I x; IJ) define a(w)

f= f(y I

x;

90 )]

>

0. Let w

= (y, x')' and

= f(y I x; IJ)jf(y I x; IJo).

Verify that a(w) f= I with positive probability, so that a(w) is a nonconstant random variable. (b) The strict version of Jensen's inequality states that if c(x) is a strictly concave function and xis a nonconstant random variable, then E[c(x)] < c(E(x)). Using this fact, show that E[log a(w)] < log{E[a(w)]}. Hint: log(x) is strictly concave. (c) Show that E[a(w)] = I. Hint: The conditional mean of a(w) equals I because

502

Chapter7

E[a(w) I x]

=I =

I

=I

a(w)f(y

I x; llo)dy

f(y I x; II) f( I . II )d f(y I x; llo) y x, 0 y f(y

I x; ll)dy

=1.

The last equality holds because f(y I x; II) is a density function far any x and 11. Note well that the conditional expectation is taken with respect to the true cand~ianal dens·lty f(y 1 x:llo). (d) Finally, show the desired result. 2. (Infannatian matrix equality) For ML, the score vector and the Hessian for observation (y, x) can be rewrinen (with w = (y, x')') as: s(w; II) =

alog f(y I x; II), all

(pxl)

. II _ as(w; II) _

"<~':~) ) -

all'

a2 1ag J(y I x; II)

-

a11a11'

·

As usual, far simplicity, assume f(y I x, II) > 0 for all (y, x) and II, so it is legitimate to take logs of the density function. (a) Assuming that integration (i.e., taking expectations) and differentiation can be interchanged, show that E[s(w; llo)l = 0. Hint: Since f(y I x; II) is a hypothetical density, its integral is unity:

I

f(y

I x; ll)dy =

I.

Differentiate bath sides of this equation with respect to II and then change the order of differentiation and integration to obtain the identity

I

s(w; ll)f(y I x; ll)dy =

0 .

(px I)

Set II = llo and obtain E[s(w; llo) I x] = 0. (b) Show the infonnation matrix equality: - E[H(w; llo) J = E[ s(w; llo) s(w; llo)'].

503

Extremum Estimators

Hint: Differentiating both sides of the above identity in the previous hint and assuming that the order of differentiation and integration can be interchanged, we obtain

f

"a ,[s(w;6)f(y lx;6)]dy=

o6

0.

(pxp)

(px I)

Show that the integrand can be written as

a

a6,[s(w; 6)J(y I x; 6)] I x; 6) + s(w; 6) s(w; 6)' f(y I x; 6).

= H(w; 6)f(y

3. (Trinity for linear regression model) Consider the linear regression model with normal errors, whose conditional density for observation t is logf(y,

I x,;{J,a 2 )

I I ( = - - Iog(2tr)- -log(a 2 ) - y,

2

2

x' {J) 2

2a

;

Let (fj, & 2 ) be the unrestricted ML estimate of 6 = (fJ' a 2 )' and let (jj, a2 ) be the restricted ML estimate subject to the constraint R{J = c where R is an r x K matrix of known constants. Assume that 8 = JRK x Ill.;.+ and that E(x, x;) is nonsingular. Also, let II""'

~ =

U2 ;; L......~;=t XI x, [

0I ]

'

1: =

2(&2)2

I I "" z;l;; L......~,= I XI XI' [

0I ]

.

2(i72)2

/i

(a) Verify that minimizes the sum of squared residuals. So it is the OLS estimator. Verify that 1J minimizes the sum of squared residuals subject to the constraint R{J = c. So it is the restricted least squares estimator. (b) Let Q.(6) = ~ I:7~ 1 1og f(y,

I x,; {J, a 2 ). Show that

I Iog (SSRu) n , 2 2 2 I log(2tr)- I - I 1og (SSRR) 2 2 2 -n- ,

- = - I Iog(2tr)- I Q.(6)

Q.(6) = -

where SSRu (= L;(y,- x;/i) 2 ) is the unrestricted sum of squared residuals and SSRR (= L;(y, - x;p) 2 ) is the restricted sum of squared residuals. Hint: Show that &2 = SSRufn and a2 = SSRRfn.

504

Chapter 7

(c) Verify that the 1; given here. although not the same as-~ I:7~t H(w,; iJ), is consistent for - E[H(w,; 11 0 ) ). Verify that the 1': given here, although not the same as-~ I:7~t H(w,; 0), is consistent for- E[H(w,; 9 0 )). Hint: From the discussion in Example 7.8, ii is consistent. As mentioned in Section 7.4, 9 too is consistent under the null. Given the consistency of ii and ii, it should be easy to show that 8 2 and ii 2 are consistent for cr 2 (d) Show that the Wald, LM, and LR statistics using 1; and 1; given here in the formulas in Table 7.2 can be written as

(RP- c)'[R(X'X)-'R'r'
where y (n x I) and X (n x K) are the data vector and matrix associated with y, and x,, and P = X(X'X)- 1X'. Hint: The A(IJ) (r x (K + I)) in Table 7.2 is

R : 0 ). ( (rxK) (rxl) (e) Show that the three statistics can also be written as W=n. SSRR -SSRu SSRu ' SSRRLM = n · .:___;_:_ _SSRu _.::.. SSRR ' LR = n . iog(-SS_R_R). SSRu

Hint: As we have shown in an analytical exercise of Chapter 1, SSRR - SSRv =

= (RP- c)'[R(X'X)- 1RT 1
505

Extremum Estimators

(f) Show that W > LR > LM. (These inequalities do not always hold in

nonlinear regression models.)

ANSWERS TO SELECTED Q U E S T I O N S - - - - - - - - - - -

2a. Since j(y I x; IJ) is a hypothetical density, its integral is unity:

I

j(y I x; IJ)dy = I.

This is an identity, valid for any IJ identity with respect to IJ, we obtain

aa(}

I

E

(I)

EJ. Differentiating both sides of this

j(y I x; IJ)dy =

0 .

(2)

(pxl)

If the order of differentiation and integration can be interchanged, then

:(} I

f(y I x; IJ)dy

=I :(}

j(y I x; IJ)dy.

(3)

But by the definition of the score, s(w; IJ) j(y I x; IJ) = :e j(y I x; IJ). Substituting this into (3), we obtain

I This holds for any(}

I

E

s(w; IJ)j(y I x; IJ)dy =

(4)

0 . (p X I)

EJ, in particular, for IJo. Setting(} = 8 0 , we obtain

s(w; IJo)f(y I x; IJo)dy = E[s(w; IJo) I x] =

0 .

(px I)

(5)

Then, by the Law of Total Expectations, we obtain the desired result.

References Amemiya, T., 1985, Advanced Econometrics, Cambridge: Harvard University Press. Davidson, R., and J. MacKinnon, 1993, Estimation and Inference in Econometrics, New York: Oxford University Press. Gallant, R., and H. White, 1988, A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. New York: Basil Blackwell.

506

Chapter 7

Gourieroux, C., and A. Monfort, 1995, Statistics and Econometric Models, New York; Cambridge University Press. Hansen, L. P., and K. Singleton, 1982, "Generalized Instrumental Variables Estimation of Nonlinear Rational Expectations Models," Econometrica, 50, 1269-1286. Judge, G., W. Griffiths, R. Hill, H. Ltitkepohl, and T, Lee, 1985, The Theory and Practice of Econometrics (2d ed.), New York: Wiley. Newey, W., and D. McFadden, 1994, "Large Sample Estimation and Hypothesis Testing," Chapter 36 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North-Holland. White, H., 1994, Estimation, Inference, and Specification Analysis, New York: Cambridge University

Press.

CHAPTER 8

Examples of Maximum Likelihood

ABSTRACT The method of maximum likelihood (ML) was developed in the previous chapter as a special case of extremum estimators. In view of its importance in econometrics, this chapter considers applications of ML to some prominent models. We have already seen some of those models in Chapters 3 and 4 in the context of GMM estimation. In deriving the asymptotic properties ofML estimators for the models considered here, the general results for ML developed in Chapter 7 will be utilized. Before proceeding, you should review Section 7.1, Propositions 7.5, 7.6, 7.8, 7.9, and 7.11. A Note on Notation:

We will maintain the notation, introduced in the previous

chapter, of indicating the true parameter value by subscript 0. So 9 is a hypothetical parameter value, while 8o is the true parameter value.

8.1

Qualitative Response (QR) Models

In many applications in economics and other sciences, the dependent variable is a categorical variable. For example, a person may be in the labor force or not; a commuter chooses a particular mode of transportation; a patient may die or not; and so on. Regression models in which the dependent variable takes on discrete values are called qualitative response (QR) models. A QR model is described by a parameterized density function J(y, I x,; (I) representing a family of discrete probability distributions of y, given x,, indexed by 0. A QR model is called a binary response model if the dependent variable can take on only two values (which can be taken to be 0 and I without loss of generality) and is called a multinomial response model if the dependent variable can take on more than two values. Only binary models are treated in this subsection. For a thorough treatment of binary and multinomial models, see Amemiya (1985, Chapter 9).

508

Chapter 6

The most popular binary response model is the probit model, which has already been presented in Example 7.3. Another popular binary model is the logit model:

I x,; Oo) = A(x;Oo). = 0 I x,; 0 0 ) =I- A(x;Oo),

J(y, =I { j(y,

(8.1.1)

where A is the cumulative density function of the logistic distribution: A (v)

This density

exp(v)

= -:---'----':--:1 + exp(v)

(8.1.2)

f (y, I x,: 0 0 ) can be written as f(y,

I x,; Oo) = A(x;o 0 )Y' x [I- A(x;0 0 )] 1-Y'.

(8.1.3)

Taking logs of both sides and replacing the true parameter value 0 0 by its hypothetical value 0, we obtain the log likelihood for observation t: log f (y,

1

x,: 0) = y, log A (x;o) + (I - y,) log[! - A (x;o)].

(8.1.4)

The logit objective function Q,(O) is 1/n times the log likelihood of the sample (y,, x 1 , y2 , x2 .... , y,, x,). Under the assumption that (y,, x,} is i.i.d., the log likelihood of the sample is the sum of the log likelihood for observation t over t. Therefore, Qn(O) =

~ :t{y,logA(x;O)+(I- y,)log[l- A(x;OJJ).

n

(8.1.5)

t=l

For probit, we have proved that its log likelihood function is concave (Example 7.6), and that the ML estimator is consistent (Example 7.9) and asymptotically normal (Example 7.11) under the assumption that E(x,x;) is nonsingular. You are about to find out that doing the same for logit is easier. (On the first reading, however, you may wish to skip the discussion on consistency and asymptotic normality.) Score and Hessian for Observation t

The logistic cumulative density function has the convenient property that A'(v) = A(v)(l - A(v)), A"(v) =[I- 2A(v)]A(v)[l - A(v)]. (8.1.6)

509

Examples of Maximum Likelihood

Using this, it is easy to derive (see Review Question I (a)) the following expressions for the score and the Hessian for observation t: s(w,; II) ( = H(w,; II) (

, a log j(y, I x,; II)) all = [y,- A(x,ll)]x,,

= as(w,;ll)) all' =

' ' ' -A(x,ll)[l - A(x,ll)]x,x,.

(8.1.7) (8.1.8)

where w, = (y,, x;)'. Consistency Since x,x; is positive semidefinite, it immediately follows from its expression that H(w,: II) is negative semidefinite and hence Q"(ll) is concave. The relevant consistency theorem, therefore, is Proposition 7.6. We show here that the conditions of the proposition are satisfied under the nonsingularity of E(x,x;). Condition (a) of the proposition is the (conditional) density identification: j(y, I x,; II) oft f(y, I x,; 11 0 ) with positive probability for II oft 11 0 1 Since a logarithm is a strictly monotone transformation, this condition is equivalent to log f(y, I x,; II)

oft log f(y, I x,; 11 0 ) with positive probability for II oft 11 0 , (8.1.9)

where llo is the true parameter value and the density f is given in (8.1.3). Verification of this condition is the same as in the probit example in Example 7.9 and proceeds as follows. The discussion in Example 7.8 has established that E(x,x;) nonsingular

=> x;11 oft x;llo with positive probability for II oft 11 0 . (8.1.10)

Since the logit function A(v) is a strictly monotone function of v, we have A(x;ll) oft A(x;llo) when x;11 oft x;11 0 . Thus, condition (a) is implied by the nonsingularity of E(x,x;). Turning to condition (b) (that E[llog f(y,lx,; Ill I) < oo for all II), it is easy to show that I log A(v)l
(8.1.11)

Condition (b) can then be verified in a way analogous to the verification of the same condition for probit in Example 7.9. Thus, as in probit, the nonsingularity of E(x,x;) ensures thatlogit ML is consistent. 1In other words, let X = fC.vt I x ; 0) and Y =- j(y I x : 8o). X and Y are random variables because they 1 1 1 are functions of w 1 = (:y 1 , x:)'. The condition can be seated simply as Prob(X f" Y) > 0 for 0 f" Oo.

510

Chapter 8

Asymptotic Normality To prove asymptotic normality under the same condition, we verify the conditions of Proposition 7.9. Condition (I) of the proposition is satisfied if the parameter space e is taken to be JRP. Condition (2) is obviously satisfied. Condition (3) can be checked easily as follows. Since E(y, I x,) = A(x;8), we have E[s(w,; 8 0 ) I x,] = 0 from (8.1.7). Hence, E[s(w,; 8 0 )] = 0 by the Law of Total Expectations. It is left as Review Question l(b) to derive the conditional information matrix equality

E[s(w,; 8 0 ) s(w,; 8 0 )' I x,] = - E[H(w,; 8 0 ) I x,].

(8.1.12)

Hence, (3) is satisfied by the Law of Total Expectations. The local dominance condition (condition (4)) is easy to verify for logit because, since IA(x;8)[1 A(x;8)]1 < l,wehave: IIH,(w,;8)11 < llx,x;llforall8. Theexpectationofllx,x;ll' (the sum of squares of the elements of x,x;) is finite if E(x,x;) is nonsingular (and hence finite). So E(llx,x; II) is finite. Regarding condition (5), the argument in footnote 29 of Newey and McFadden (1994) used for verifying the same condition for probit can be used for logit as well 2 Thus, we conclude that if (y,, x,} is i.i.d. and E(x,x;) is nonsingular, then the logit ML estimator 8' of 8o is consistent and asymptotically normal, with the asymptotic variance consistently estimated by

_

Avar(ii) = -

[l L n

n

H(w,;

9)

]-1 ,

(8.1.13)

t=I

where w, = (y,, x;)' and H(w,; 8) is given in (8.1.8). QUESTIONS FOR REVIEW

1. (Score and Hessian for observation t for general densities) In (8.1.1), replace the logit cumulative density function (c.d.f.) A(x;8o) by a general c.d.f. F(x;8o). (a) Verify that the score and Hessian for observation

t

are given by

y,- F, s(w,; 8) = -=F,'"'·-(""l-'=F,.,.,) f,x,,

20nly for interested readers, here's lhe argument. For any ii > 0, lhere exists a C such lhat A(v)[J - A(v)] ~ C > 0 for Ivi .:::;: ii. So E( A(x19)[1- A(x'O)]xx'J ~ E(l (lx'lll .S ii)A(x'O)[I - A(x'O)]xx'J 2::. C E(1Cix'81 _:::: ii)xx'] in lhe matrix inequality sense, where HI vi_::::: ii) is lhe indicator function. The last term is positive definite (not just positive semi-definite) for large enough ii by non-singularity ofE(xx1).

511

Examples of Maximum Likelihood

Yt - Ft 2 , H(w,;IJ) =- [ F,. (1- F,) ] 2 j, x,x,

Yt - Fr

,

,

+ [ F,. (1- F,) ] j,x,x,.

where F, = F(x;IJ), j, = j(x;IJ), and JO = F'O is the density function. (b) Verify the conditional information matrix equality (8.1.12) for the general c.d.f. F. Hint: Show that the expected value of the second term in the expression for the Hessian in the previous question is zero. 2. Verify (8.1.7) and (8.1.8). 3. (Logit for serially correlated observations) Suppose {y,, x,} is ergodic stationary but not necessarily i.i.d. Is the logit ML estimator described in the text consistent? [Answer: Yes, because the relevant consistency theorem, Proposition ' 7.6, does not require a random sample. However, the expression for Avar(IJ) is not given by the negative of the inverse of the sample Hessian.]

8.2 Truncated Regression Models A limited dependent variable (LDV) model is a regression model in which either the dependent variable y, is constrained in some way or the observations for which y, does not meet some prespecified criterion are excluded from the sample. The LDV model of the former sort is called a censored regression model, while the latter is called a truncated regression model. This section presents truncated regression models; censored regression models are presented in the next section. The Model Suppose that {y,, x,} is i.i.d. satisfying y, =

x;Po + s,, s, I x,

~ N(O, cr~).

(8.2.1)

The model would be just the standard linear regression model with normal errors, were it not for the following feature: only those observations that meet some prespecified rule for y, are included in the sample. We will consider only the simplest truncation rule: y, > c where c is a known constant. This rule is sometimes called a "truncation from below." Once this case is worked out, it should not be hard to consider more general truncation rules.

512

Chapter 8

''

'

'

\~ '

''

''

''

''

'

X 0

density after truncation

densityofN(O,t)

''

''

'"'"

.... ---

c

Figure 8.1: Effect of Truncation

Truncated Distributions

To proceed, we need two results from probability theory.

Density of a Truncated Random Variable: If a continuous random variable y has a density function f(y) and cis a constant, then its distribution after the truncation y > cis defined over the interval (c, oo) and is given by f(y) f(y I y > c) = Prob(y > c)

(8.2.2)

Figure 8.1 illustrates how truncation alters the distribution for N(O, 1). The solid curve is the density of N(O, 1). Because it is a density, the area under the curve is unity. The dotted curve is the distribution after truncation, which is defined over the interval (c, oo). It lies above the solid curve over this interval so that the area under the dotted curve is unity. The second result follows.

Moments of the Truncated Normal Distribution: If y ~ N(f.lo, aJ) and c is a constant, then the mean and the variance of the truncated distribution are

I y > c) = f.lo + aoA(v), Var(y I y >c)= aJ(l - A(v)[A(v)-

(8.2.3)

E(y

where v = (c- f.lo)/ao and A(v)

= 1 °':;/,).

v]),

(8.2.4)

513

Examples of Maximum Likelihood

This formula for the mean clearly shows that the sample mean of draws from the truncated distribution is not consistent for J.Lo. This is an example of the sample selection bias. The bias equals ao'-(v). The function '-(v) is called the inverse Mill's ratio or the hazard function. The function A(v) is convex and asymptotes to vas v--+ oo and to zero as v--+ -oo. So its derivative '-'(v) is between 0 and I. It is easy to show that '-'(v) = A(V)(A(v)- v).

(8.2.5)

Using (8.2.3) and (8.2.4), we can show how truncation alters the form of the regression of y, on x,. By (8.2.1), the distribution of y, conditional on x, before truncation is N(x;/3 0 , aJ). Observation I is in the sample if and only if y, > c. So E(y, I x,, I in the sample)= x;/3 0 + ao'-(c-;;Po), Var(y, I x,, I in the sample) =

aJ

II - '-(c-;jPo) ['-(c-;;Po) - c-;;Po JJ.

(8.2.6) (8.2.7)

This formula shows that the OLS estimate of the x, coefficient from the linear regression of y1 on x1 is not consistent because the correction tenn a0 AC-;~Po), which is correlated with x,, would be included in the error term in the linear regression. Since the functional form of the hazard function A(v) is known (under the normality assumption for c, ), we could avoid the sample selection bias by applying nonlinear least squares to estimate ({3 0 , aJ), but maximum likelihood should be the preferred estimation method because it is asymptotically more efficient. In passing, we note that the sample selection bias does not arise if selection is based on the regressors, not on the dependent variable. Suppose that observation 1 is in the sample if cf>(x,) > 0. Then E(y, I x,, 1 in the sample)= E[x;/3 0 + £ 1 I x,, cf>(x,) > 0] = x;/3 0 + E[c, I x,, cf>(x,) > 0]

=

x;f3o·

(8.2.8)

The last equality holds since E [c, I x,, cf>(x,) > 0] = E(c, I x,) = 0 if cf>(x,) > 0. The Likelihood Function

Derivation of the likelihood function utilizes (8.2.2). As just noted, the pretruncation distribution of y, I x, is N(x;f3 0 , aJ), whose density function is

514

Chapter 8

exp[-~(y'- x;{J 0 ) 2 ]

I

2

J2rrag

ao

= _1 ao

(8.2.9)

ao

where c I x,) = I - Prob(y, ::: c I x,) = 1 _ Prob(y' = 1-

x;Po

< c-

ao

<~>C -"':;Po)

x;Po I x,)

ao (since Y•-;JPo

1 x,

~ N(O.

t)). (8.2.10)

Therefore, by (8.2.2), the post-truncation density, defined over (c, oo), is

(8.2.11)

This is the density (conditional on x,) of observation t in the sample. Taking logs and replacing ({J 0 , ag) by their hypothetical values ({J, a 2 ), we obtain the log conditional likelihood for observation t: logf(y,

I x,; {J, a 2 )

2 (Y'- x;fJ)2} -log [I- (c- x;p)]

I I tog(a ) - 2:I = { -2log(2rr)2

17

.

17

(8.2.12) Were it not for the last term, this would be the log likelihood (conditional on x,) for the usual linear regression model. The last term is needed because the value of the dependent variable for observation t, having passed the test y, > c and thus being in the sample, is never less than or equal to c. Reparameterizi11g the Likelihood Function This log likelihood function can be made easier to analyze if we introduce the

following reparameterization: ~ =

{Jfa, y =!fa.

(8.2.13)

(We have considered this reparameterization in Example 7. 7 for the linear regression model.) This is a one-to-one mapping between (fJ, a 2 ) and (~. y), with the inverse mapping given by fJ = ~fy, a 2 = ljy 2 The reparameterized log

515

Examples of Maximum Likelihood

conditional likelihood is log j(y,

I x,; l)

= [-

I

2

1og(2rr)

+ log(y)-

I

,

Z(yy,- x,8)

2] -log[l- (ye- x,8)]. , (8.2.14)

The objective function in ML estimation is I In times the log likelihood of the sample. For a random sample, the log likelihood of the sample is the sum over t of the log likelihood for observation t. Thus, the objective function in the ML estimation of the reparameterized truncated regression model, denoted Q.(8, y), is the average overt of (8.2.14). The ML estimator (J, y) of (8 0 , y0 ) is the (8, y) that maximizes this objective function. Verifying Consistency and Asymptotic Normality

The next task is to derive the score and the Hessian and check the conditions for consistency and asymptotic normality for the reparameterized log likelihood (8.2.14). (On the first reading, you may wish to skip the discussion on consistency and asymptotic normality.) Score and Hessian for Observation t A tedious calculation yields

s(w,; 8, y) = ((K+l)x1)

[ ;;

H(w,; 8, y) = ((K+l)x(K+l))

1

(yy, - x;8)x, , -

[

]

(yy, - x,8)y,

XtXt'

-~~

,

1

-ytxr

r'

2

+ Y,

+ l(v,) [-x,] ,

(8.2.15)

e

]

+ l(v,)[l(v,)- v,]

[-ex; '

~~

-ex,] e2

'

(8.2.16) where K is the number of regressors, w, = (y,, x;)', and v, = ye- x;8 (= '-;:~). The Hessian is not everywhere negative semidefinite even after the reparameterization. So the objective function Q.(8, y) is not concave 3 The relevant consistency theorem, therefore, is Proposition 7.5, not Proposition 7.6, while the relevant asymptotic normality theorem remains Proposition 7.9. 3However. it can be shown that the Hessian of the objective function Qn(9) is negative semidefinite at any solution to the likelihood equations (see Orme, 1989). Since this rules out any saddle points or local minima, having more than two local maxima is impossible. Therefore, provided that the maximum of Qn(d, y) occurs at a point interior to the parameter space, the first-order condition is necessary and sufficient for a global maximum. If an algorithm such as Newton-Raphson produces convergence, the point of convergence is the global maximum.

516

Chapter 8

Consistency Condition (a) of Proposition 7.5 is the (conditional) density identification

-

f(y,

I x,; 8,

y)

i=

-

f(y,

I x,; 8o, Yo)

with positive probability for (8, y)

i= (8o, Yo).

where (8o, yo) is the true parameter value and the parameterized log density f(y, I x,; 8, y) is given in (8.2.14). Clearly, given (y,, x,), the value of j(y, I x,; 8, y) is different for different values of y. So identification boils down to the condition that X,8 i= x;oo with positive probability for 8 i= 80 • But (8.1.1 0) indicates that this condition is satisfied ifE(x,x;) is nonsingular. Turning to condition (b) of the proposition (the dominance condition for j(y, I x,; 8, y)), it is easy to show that the condition is satisfied under the nonsingularity of E(x,x;l 4 Therefore, the ML estimator (J, j)) of (80 , y0 ) is consistent under the nonsingularity ofE(x,x;). Asymptotic Normality A rather tedious calculation using (8.2.6), (8.2.7), and (8.2.13) shows that E[s x,] = 0. An extremely tedious calculation yields the conditional information matrix equality: E[ss' I x,] = - E[H I x, ]. Therefore, condition (3) of Proposition 7.9 is satisfied. We will not discuss verification of conditions (4) and (5) of the proposition 5 Thus, we conclude: if {y,, x,) is i.i.d. and E(x,x;) is nonsingular, then the ML estimator (J, 9) of the truncated regression model is consistent and asymptotically normal with the asymptotic variance consistently estimated by -

Avar(8.

y) =

-

[]

n

L H(w,; J, y) ]-1 n

(8.2.17)

t=l

where w, = (y,, x;)' and H(w,; IJ) is given in (8.2.16).

4 0nly for interested readers, what needs to be shown is E[sup Ee pogj(y I x ;9)1]
517

Examples of Maximum Likelihood

Recovering Original Parameters By the invariance property of ML (see Section 7.1), the ML estimate of ({j 0 , aJ) is = ~jy and &2 = I jf/ 2 , the value of the inverse mapping. Clearly, if the ML estimate of the reparameterized model, (~. y), is consistent, then so is the ML estimate (ti, &2) thus recovered. The asymptotic variance of & 2) can be obtained

/i

(/i,

-

by the delta method (see Lemma 2.5) as JAvar(~. y)J', where J is the estimate of the Jacobian matrix for the inverse mapping ([j = ~/y, a 2 = ljy 2 ) evaluated at (~. y):

-- [ti -j,] J-

0'

- ..1.. yJ

.

(8.2.18)

QUESTIONS FOR REVIEW

1. (Second moment of y,) From (8.2.6) and (8.2.7), derive:

E(y; I x,, tin sample)=

where vt

= yc -

~[I+ !.(v,)yc + (x;~) 2 + A.(v,)x;~]. y

x;o.

2. (Estimation of truncated regression model by GMM) The truncated regression model can also be estimated by GMM. Consider the following K + I orthogonality conditions. The first K of them are g 1 (w,:~.y)= (Kxi)

where w, = (y,, x;)' and v,

!.(v,)) I , ( y,--x,~--- -x,,

Y

Y

(8.2.19)

= yc-X,~. The (K + 1)-th orthogonality condition

IS

g2(w,;

~. y) = l- ~[I+ !.(v,)yc + (x;~) 2 + A.(v,)x;~]. y

(8.2.20)

(a) Verify that E[g1 (w,: ~ 0 • y0 ) 1 = 0 and E[g 2(w,: ~ 0 , y 0 ) 1 = 0. Hint: Use the Law of Total Expectations. v, is a function of x,. (b) Show that(~', y') satisfies the likelihood equations if and only if it satisfies the K +I orthogonality conditions. Hint: Let s 2(w,; ~- y) be the (K + 1)-th

518

Chapter 8

element of s(w,;

~.

y) in (8.2.15). Then

,

(

I

s 2 (w,; ~. y) + yg2 (w1 ; ~. y) = (x,~) y,- y x',~-

8.3

y),(v . 1) )

Censored Regression (Tobit) Models

The censored regression model is also called the Tobit model to pay respect to Tobin ( 1958), who was the first to introduce censoring in economics. The simplest Tobit model can be written as follows: t=l,2, ... ,n, y, = {

Yt

ifyt* > c,

c

if Yr"'
(8.3.1) (8.3.2)

Here, e1 I x1 is N (0, a~) and {y,, x1 ) (I = I, 2, ... , n) is i.i.d. The threshold value c is known. Unlike in the truncated regression model, there is no truncation here: observations for which the value of the dependent variable y,• fails to pass the test y,' > c are in the sample. The feature that distinguishes the censored regression model from the usual regression model is that the dependent variable is censored, that is, constrained to lie in a certain range, which necessitates a distinction between the observable dependent variable y, and the unobservable or latent variable y,'. The former is the censored value of the latter. An equivalent way to write the model is y, =

max{x;Po + E:,, c).

(8.3.3)

Tobit Likelihood Function For those observations that remain intact after censoring (Tobin ( 1958) called those

observations non limit observations), the density of y, (conditional on x,) is _1 q,(Y'- x;Po).

o-o

ao

(8.3.4)

This differs from the density for the truncated model (8.2.11) simply because there is no truncation. For those observations, called limit observations, for which the value of the dependent variable is altered by censoring, all we know is that y,• < c,

519

Examples of Maximum Ukelihood

the probability of which is given by Prob(y,' < c I x,) = Prob(Y:- x',Po < c- x',Po / x,) o-o o-o =

<~>(c -x',Po) (since Y!-•!Po 1 x, ao

~ N(O, 1)).

(8.3.5)

o-o

Therefore, the density of y,, defined over the interval [c, oo), is (8.3.4) for y, > c and a probability mass of size at y, = c. This density can be written as

c-;;Po)

(8.3.6) where the dummy variable D, is a function of y, defined as 0

D,= { I

if y, > c (i.e., y,' > c),

(8.3.7)

if y, = c (i.e., y,' :5 c).

Taking logs and replacing (fJ 0 , o-J) by their hypothetical values (/J, o- 2 ), we obtain the log conditional likelihood for observation t: log f(y,

I x,; p, o- 2 )

I

=(I - D,) log [ o-

(Y ' -x'P)] o- ' + D, log (c-x'p o-' ). (8.3.8)

Thus, the Tobit average log likelihood for a random sample can be written as Q.(e) =

=

~ ~{o- D,)Iog[>C' ~x;p) J+D,Iog<~>C-o-x;P)} ~ L {log[>C' ~ x;P)]} + ~ L {log <~>(c -o-x;P) ~-

l·

(8.3.9)

~-

Reparameterlzatlon

As in the truncated regression model, the discussion is simplified if we introduce the reparameterization (8.2.13). The reparameterized log conditional likelihood is log j(y, I x,; 8, y) =(I-

D,)~-~ log(2rr) + log(y)- ~(yy,- x;8) 2 ) + D,log(yc-x;8).

(8.3.10)

520

Chapter 8

We have seen that the term in braces is concave in (d, y) (see Example 7.7). We have also seen that log <J>(v) is concave (see Example 7.6), which implies that log <J>(yc- x;d) is concave in (d. y). It follows that the reparameterized Tobit average log likelihood for observation t, being a non-negative weighted average of two concave functions, is concave in the parameters (this finding is due to Olsen, 1978). Accordingly, as in probit and logit, the relevant consistency theorem for Tobit is Proposition 7.6. The remaining task is to write down the score and the Hessian and check the conditions for consistency and asymptotic normality for the reparameterized log likelihood (8.3.10). Score and Hessian for Observation t

A straightforward calculation utilizing the fact that!.( -v) = (8.2.5) produces

s(w,;d, y) = (1- D,) · ((K+l)xl)

(yy,- x;d)x, [

(

1

y- YYt-

H(w,; d, y) = -(I - D,) · ((K+l)x(K+l))

[

XrXt

") Xta Yt

',

-ytxt

]

+ D,

I

· 1.(-v,)

¢( -v) ¢'( v)

[-x,]

and

=

¢(v) ¢'(v)

,

(8.3.11)

C

]

- YrXt 1 2

y2

+ Yr

- D, · 1.(-v,)[!.(-v,) + v,]

x,x;,

[ -ext

-ex,] 2 c

•

(8.3.12)

where

Vr

= yc -

c-x;p) " (= xta a- .

Consistency and Asymptotic Normality

For Tobit, we will not verify the conditions for consistency and asymptotic normality. For the case where (x,} is a sequence of fixed constants, Amemiya (1973) has proved the consistency and asymptotic normality of Tobit ML under the assumption that x, is bounded and lim~ L~~~ x,x; is nonsingular. It seems a reasonable guess that for the case where x, is stochastic (as here), a sufficient condition for consistency and asymptotic normality is the nonsingularity ofE(x,x;). Recovering Original Parameters

As in the truncated regression model, the ML estimate of ({J 0 , O"C) can be recovered as {i = 8;9 and i7 2 = l/y 2 . The asymptotic variance of ({i, i7 2 ) can be obtained

521

Examples of Maximum likelihood

by the delta method as JAvar(~. ' yA)]-1 . - [ ~I"" L...•~I H(w,,. o,

ylJ', where J is as in (8.2.18) and Avar(~. y) is

QUESTIONS FOR REVIEW

1. (Censored vs. truncated regression) One way to see the difference between truncation and censoring is to calculate E(y, I x,) for the censored regression model. Show that

Hint: Conditional on y, > c, the distribution of y, (conditional on x,) is none

other than the truncated distribution derived in the previous section, whose expectation is given in (8.2.6). The probability that y, >cis given in (8.2.10).

2. Show that the Hessian given in (8.3.12) is negative semidefinite. Hint:

1y,x, 2] = [~'Yr ] + Yr

[ x;

y2

x,x; [ -ex'

'

-ex,] _ [ x, ] [ , _ cJ. c -c 2

Also, ).(v) > v for all v (i.e., A(-v)

-

+v >

~]·

y·

xt

0 for all -v).

3. (Tobit for serially correlated observations) Suppose {y,, x,) is ergodic stationary but not necessarily i.i.d. Is the Tobit ML estimator described in the text consistent? [Answer: Yes.]

8.4 Multivariate Regressions In Section 1.5 and also in Example 7.2 of Chapter 7, we pointed out that the OLS estimator of the linear regression model is numerically the same as the ML estimator of the same model with normal errors. We have derived the GMM estimators for the single-equation models with endogenous regressors in Chapter 3 and for their multiple-equation versions in Chapter 4, but we have not indicated what the ML estimators for those models are. In this section and the next, we derive those ML counterparts. This section examines the ML estimation of the multivariate regression model, which is the simplest multiple-equation model.

522

Chapter 8

The Multivariate Regression Model Restated

The multivariate regression model is described by Assumptions 4.1-4.5 and 4.7, with z,m = x,m = x, for all m = I, 2, ... , M. To change the notation of Chapter 4, the system of M equations can be written as

Y<m =

x;

:n:om

+ v,m

(m =I, 2, ... , M; t =I, 2, ... , n).

(8.4.1)

(I xK) (Kx I)

Here, we maintain the convention of designating the true parameter vector by subscript 0. If we define Y
y,

Y<2

v,

,

no = [:n:ol

7ro2

· · · 7roM], (8.4.2)

(KxM)

(Mxl)

(Mxl)

Y<M then the M -equation system can be written compactly as

y,

= n~x, + v,

(t

= I, 2, ... , n).

(8.4.3)

The model has the following features. • {y,, x,} is ergodic stationary (Assumption 4.2).

• The common regressors x, are predetermined in that E(x, · v,m) = 0 for all m (Assumption 4.3). • The rank condition for identification (Assumption 4.4) reduces to the familiar condition that E(x,X,) be nonsingular. • The error vector is conditionally homoskedastic in that E(v, v; Sl 0 is positive definite (Assumption 4.7).

I x,)

= Sl0 , and

The GMM estimator of :n: om is the OLS estimator: (8.4.4) So the GMM (OLS) estimate of no is given by

fi

(KxM)

1

n

= ("x,x;) n L..t t=l

(KxK)

-I

1 n (- "x,y;). n

L..t

t=l (KxM)

(8.4.5)

523

Examples of Maximum Likelihood

The Likelihood Function

The model is not yet specific enough to lend itself to ML estimation. We make the supplementary assumptions that (I) v, I x, ~ N(O. 2 0 ), and (2) (y,, x,} is i.i.d. (This assumption strengthens Assumptions 4.2 and 4.5.) The multivariate regression model with the normality assumption (I) implies that y, 1 x, ~ N(D~x,, Q 0 ). The density of this multivariate normal distribution is (8.4.6) Replacing the true parameter values (D 0 , Ro) by their hypothetical values (D, Q) and taking logs, we obtain the log conditional likelihood for observation t: logf(y, I x,; D, Q) = -

M

2

I

I

I

log(2rr) + zlog(IR- D- z(y,- D x,)

sz-

ff)

f

(y,- D x,), (8.4.7)

where use has been made of the fact that -log(IQI) = log(IQ- 1 1). For a random sample, I In times the log conditional likelihood of the sample is the average over t of the log conditional likelihood for observation t. So the objective function Q.(ll, Q) is Q.(ll, Q) =- M log(2rr) 2

+~

2

log(IR- 1 1)-

-

1

2n

n

l)y,- D'x,)'Q- 1 (y,- D'x,). t=l

(8.4.8) It is left as Review Question 2 to show that the last term can be written as n

'"' - I L...-(y,D f x,) f 2n t=l

sz- I (y,- D x,) =-2I trace[Q- 1-Q(D)], f

(8.4.9)

-

where Q(D) is defined as n

Q(ll) (MxM)

'"' =-I L...-(y,D ,x,)(y,- D ,x,)., n

(8.4.10)

t=l

So the objective function can be rewritten as Q.(ll, Q) = -

M

2

log(2rr)

+I

2

log(IQ- 1 I)-

lI

-

trace[Q- 1 Q(ll)]. (8.4.11)

The ML estimate of (D 0 , Q0 ) is the (D, Q) that maximizes this objective function.

524

Chapter 8

The parameter space for (ll 0 , S! 0) is {(ll, S!) IS! is symmetric and positive definite}.

Maximizing the Likelihood Function We now prove that the solution to the maximization problem is numerically the same as the GMM estimator. That is, the ML estimator of llo is given by the GMM/OLS estimator fi in (8.4.5) and the ML estimate of S! 0 is given by n

-S! = -I~-., L- v, v, = S!(ll), n

vt

(8.4.12)

t=I

_,

n

where = YtXr. It is left as Analytical Exercise 2 to show that, under the assumptions of the multivariate regression model stated above, Si(ll) is positive definite with probability one for any given n for sufficiently large n. So we can assume that Si(ll) is positive definite, not just positive semidefinite. The proof that the GMM estimator is numerically the same as the ML estimator is based on the following two-step maximization of the objective function.

Step 1: The first step is to maximize Qn(ll, S!) with respect to S!, taking

n

as given. For this purpose, the following fact from matrix algebra is useful.

An Inequality Involving Trace and Determinan/: 6 Let A and B be two symmetric and positive definite matrices of the same size. Then the function f(A) =log (lA I)- trace (AB)

is maximized uniquely by A = B- 1 . This result, with A = S!- 1 and B = Si(ll), immediately implies that the objective function (8.4.11) is maximized uniquely by S! = §(ll), given n. Substituting this S! into (8.4.11) gives (1/n times) the concentrated log likelihood function (concentrated with respect to S!):

6This result is implied by Lemma A.6 of Johansen (1995). The uniqueness of the minimizer is due to the linearity of the trace operator and the strict concavity oflog(IAD noted in Magnus and Neudecker ( 1988, p. 222).

525

Examples of Maximum Likelihood

Q:(ll)

=

Qn(ll, Q(ll))

M I I = -21og(2rr) + 21og(IQ(ll)- 11)- trace[Q(ll)- 1Q(ll)] 2

= -

M

2

tog(2rr)-

I

2

-

tog(IQ(ll)l)-

M 2'

(8.4.13)

Step 2: Looking at the concentrated log likelihood function the ML estimator of ll 0 should minimize

I'~

-

IQ(ll)l = -;; L..(y,t=l

·

Q~ (ll),

· ·1

n x,)(y,- n x,)

we see that

(8.4.14)

.

It is left as Analytical Exercise I to show that this is minimized by the

OLS estimator gives (8.4.12).

fi

given in (8.4.5). Finally, setting

n

=

fi

in (8.4.10)

Consistency and Asymptotic Normality We have shown in Chapter 4, without the normality assumption (that e, x, 1s normal) or the i.i.d. assumption for {y1 , x, }, that the GMM/OLS estimator fi in (8.4.5) is consistent and asymptotically normal and that Q in (8.4.12) is consistent. It then follows, trivially, that the ML estimator of ll 0 is consistent and asymptotically normal and the ML estimator of Q0 is consistent. It is worth emphasizing that the ML estimator of llo based on the likelihood function that assumes normality is consistent and asymptotically normal even if the error term is not normally distributed. We will not be concerned with the asymptotic normality of the ML estimator of Q0 , Q(ll). One way to demonstrate the asymptotic normality would be to verify the conditions of the relevant asymptotic normality theorem of the previous chapter.

-

QUESTION FOR REVIEW

1. Prove (8.4.9). Hint: Use the following properties of the trace operator. (1) trace(x) = x if xis a scalar, (2) trace(A +B) = trace(A) + trace(B), and (3) trace(AB) = trace(BA) provided AB and BA are well defined. The answer is n

.!.n L(y,- ll'x,)'Q- 1(y,- ll'x,) t=l

=trace[~ t(y,- ll'x,)'Q- 1(y,- n'x,)J t=l

526

Chapter 8

= .!_ ttrace[(y,n t= L

ll'x,n~-'(y,- ll'x,)]

n

= .!_ n

2:: trace[ ~r' (y, -

n'x, )(y, - n'x,)'J

t=l

=trace[~ tQ- 1(y,- ll'x,)(y,- ll'x,)'] t=L

~

=trace[ Q- 1 t(y,- ll'x,)(y,- ll'x,)l

8.5 FIML As seen in Chapter 4, the multivariate regression model is a specialization of the seemingly unrelated regression (SUR) model, which in tum is a specialization of the multiple-equation system to which three-stage least squares (3SLS) is appli· cable. The two-stage least squares (2SLS) estimator is the GMM estimator when each equation of the system is estimated separately, while 3SLS is the GMM estimator when the system as a whole is estimated. This section presents the ML counterparts of 2SLS and 3SLS. The Multiple-Equation Model with Common Instruments Restated To refresh your memory, the model is an M -equation system described by Assump-

tions 4.1-4.5 and 4.7, with Xrrn = x, for all m =I, 2, ... , M (so the set of instruments is the same across equations). Still maintaining the convention of denoting the true parameter value by subscript 0, we can write the M equations of the system as Yrrn

=

z;rn

~om +errn

(m= 1,2, ... ,M;t

= 1,2, ... ,n).

(8.5.1)

(lxL,.) (L,xl)

The model has the following features. • Unlike in the multivariate regression model, for each equation m, the regressors Zrrn may not be orthogonal to the error term, but there are available K predetermined variables x, that are known to satisfy the orthogonality conditions E(x, · Brrnl = 0 for all m (this is Assumption 4.3). Defining the M x I vector Etas

527

Examples of Maximum Likelihood

Et (Mxl)

(8.5.2)

=

the orthogonality conditions can be expressed as E(x,<;) =

0

(KxM)

(8.5.3)

.

These K predetermined variables serve as the common set of instruments in the GMM estimation. • The endogenous variables of the system are those variables (apart from the errors) in the system that are not included in x,. • The rank condition for identification (Assumption 4.4) is that the K x Lm matrix E(x,z;ml be of full column rank for all m. Equation m is said to be overidentified if the rank condition is satisfied and Lm < K. • The error vector is conditionally homoskedastic in that E(s,s; I: 0 is positive definite (Assumption 4.7).

I x,)

= I: 0 , and

• E(x,x;) is nonsingular (a consequence of conditional homoskedasticity and the nonsingularity requirement in Assumption 4.5). The GMM estimator for this system as a whole is the 3SLS estimator given in (4.5.12), while the single-equation GMM estimator of each equation in isolation is 2SLS given in (3.8.3). The following examples will be used to illustrate the concepts to be introduced shortly.

Example 8.1 (A model of market equilibrium): An expanded version of Working's example considered in Section 3.1 is a two-equation system:

q, = YliP< + fJll + fJ12a, +
(demand),

(8.5.4)

q, = Y21P< + fJ21 + fJnw, +

(supply).

(8.5.5)

<<2

In this system, q, is the amount of coffee transacted and p, is the coffee price. The variable a, appearing in the demand equation is a taste shifter, while the variable w, is the amount of rainfall that affects the supply of coffee. The

528

Chapter 8

system can be cast in the general format by setting

Yn =q,, Y<> =q,, Zn

=

[~:], z,, = [ ~],

~01 = [;::]' ~02 = [;~:]. Jl,, /lJ2

(Note that Yn andy,, are the same variable.) IfE(&,j) = 0, E(a,&,j) = 0, and E(w,&,j) = Ofor j = 1,2, then (l,a,,w,) can be included in the common set of instruments. So

The other nonerror variables in the system, p, and q,, are endogenous variables. Provided that the rank condition for identification is satisfied, each equation is just identified because the number of right-hand side variables equals that of instruments. Example 8.2 (Wage equation): In the empirical exercise of Chapter 3, we estimated a standard wage equation. Consider supplementing the wage equation by an equation explaining KWW (the score on the "Knowledge of the World of Work" test):

+ il11 + Jl12IQ, + ""' Y21S, + il21 + ea,

LW, = Y11S, KWW, =

(8.5.6) (8.5.7)

where S,, for example, is years in schooling for individual I. Assume that E(&,j) = 0 and E(IQ,&,j) = 0 for j = I, 2. Assume also that there is a variable MED (mother's education), which does not appear in either equation but which is predetermined in that E(MED,eu) = 0 for j = I, 2. So the common set of instruments is x, = (I, IQ,, MED,)'. The rest of the nonerror variables in the system are endogenous variables. This two-equation system has three endogenous variables, LW, S, and KWW. Provided that the rank condition is satisfied, the first equation is just identified because the number of right-hand side variables equals the number of instruments, while the second equation is overidentified.

529

Examples of Maximum Likelihood

The Complete System of Simultaneous Equations The ML counterpart of 3SLS is called the fun-information maximum likelihood (FIML) estimator. If it is to be estimated by FIML, the multiple-equation model needs to be further specialized before introducing the normality and i.i.d. assumptions. The required specialization is that those M equations form a "complete" system of simultaneous equations. Completeness requires two things. I. There are as many endogenous variables as there are equations. This condition implies that, ify, is an M-dimensional vector collecting those M endogenous variables, then (Y<~, ... , y, M, z, 1 , .•• , z, M) are all elements of (y,, x, ), which allows one to write the M -equation system (8.5.1) as

ro

y,

(MxM)(Mxl)

+

Bo

x,

e,

=

(MxK)(Kxl)

(1=1,2, ... ,n),

(8.5.8)

(Mxl)

with the m-th equation being y,m = z;moom + e,m· (This is illustrated for Example 8.1 below.) The equation system written in this way is called the structural form, and (r 0 , B 0) are called the structural form parameters. As will be illustrated below, each of the structural form parameters is a function of (,lot, ... , OoM). The system in Example 8.1 satisfies this first requirement. The system in Example 8.2 is not a complete system because the number of endogenous variables is greater than the number of equations. It is not possible to estimate incomplete systems by FIML (unless we complete the system by adding appropriate equations, see the discussion about LIML below). This is in contrast to GMM; incomplete systems can be estimated by 3SLS (or more generally by multiple-equation GMM if conditional homoskedasticity is not assumed), as long as they satisfy the rank condition for identification. 2. The square matrix r 0 is nonsingular. This implies that the structural form can be solved for the endogenous variables y, as (8.5.9) where n~ (MxK)

vt (Mxl)

=-

ro -• Bo , (MxM)

= ro (MxM)

-1

(8.5.10)

(MxK)

et .

(8.5.11)

(Mxl)

Expression (8.5.9) is known as the reduced-form representation of the structural form, and the elements of 0 0 are called the reduced-form coefficients. In

530

Chapter 8

each equation of the reduced form, the regressors are the same set of variables, x,. From (8.5.3) and (8.5.11), it follows that E(x,v;) = 0, namely, that all the regressors are predetermined. The reduced form, therefore, is a multivariate regression model. Relationship between (r0 , B,) and

&o

Let So be the stacked vector collecting all the coefficients in the M -equation system (8.5.1):

(8.5.12)

For understanding the mechanics of FIML, it is important to see how the coefficient matrices (f 0 , B0 ) depend on S0 . To illustrate, consider Example 8.1. The S0 vector is YII fJII

So = (6x I)

fJ12

(8.5.13)

Y21

fJ21 fl22

The two endogenous variables are (p,, q,). It does not matter how the endogenous variahles are ordered, so arhitrarily put p, first in y,: y, = (p,, q, )'. The structural form parameters can be written as

ro

(2x2)

=

-YII [

-y21

Bo = (2x3)

-fJII [ -fJ21

-fJI2

0

(8.5.14)

This example serves to illustrate three points. First, each row of r 0 has an element of unity, reflecting the fact that the coefficient of the dependent variable in each of theM equations (8.5.1) is unity. In this sense r 0 is already normalized. Second, some elements of r 0 and B0 are zeros, reflecting the fact that some endogenous or predetermined variables do not appear in certain equations of the M-equation system. This feature of the structural form coefficient matrices is called the exclusion restrictions. (In Example 8.1, no elements of r 0 happen to be zero because the two endogenous variables appear in both equations.) Third,

531

Examples of Maximum Likelihood

the system involves no cross-equation restrictions, so each element of ~ 0 appears in (f 0 , B0 ) just once. The FIML Likelihood Function

To make the complete system of simultaneous equations estimable by FIML, we assume (I) that the structural error vector e, is jointly normal conditional on x,, i.e., e, I x, ~ N(O, 1: 0 ), and (2) that {y,,x,} is i.i.d., not just ergodic stationary martingale differences (this assumption strengthens Assumptions 4. 2 and 4.5). By the normality assumption (I) about the structural errors e, and (8.5.11), we have v, I x, ~ N(O, f 011: 0 (f 01)'). Combining this and the reduced form y, = II~x, + v, with (8.5.1 0) produces (8.5.15) So the log conditional likelihood for observation 1 is log f(y, I x,; ~. 1:) = -

M

2

~[y, +

log(2rr) -

I

2

log (If - 11: (r- 1l'l) 1

r- 1Bx,l'[r- 11:(r- 1)']- [y, + r- 1Bx,].

(8.5.16)

The likelihood is a function of(~, 1:) because, as illustrated in (8.5.14), the structural form coefficients (f, B) are functions of~. Now

)T 1 Iy, + r- 1Bx,]

[y, + r- 1Bx,l'[r- 11:(f- 1

= [y, + r- 1Bx,l'[r'1:- 1 f]ly, + r- 1Bx,J = [ry, +Bx,]'1:- 1 [ry, +Bx,]

(8.5.17)

and (8.5.18) Substituting these into (8.5 .16), logj(y, I x,; ~. 1:) = -

M

2

I 1 1og(2rr) + 21og(lfl 2) - 2log(l1:1)

-

~[ry, +

Bx,]'1:- 1[ry, + Bx,]. (8.5.19)

Taking the average over 1, we obtain the FIML objective function for a random sample:

532

Chapter 8

I

Af

Qn(~, l:) = -2Jog(2rr)+ 21og(lfl

2

)-

I 21og(ll:l)

n

-_I "'[ry, +Bx,]'l:- 1 [ry, +Bx,]. 2n L...,

(8.5.20)

t=l

The FIML estimate of (~ 0 , 1: 0 ) is the(~, l:) that maximizes this objective function. The FIML Concentrated Likelihood Function As in the maximization of the objective function of the multivariate regression model, we can proceed in two steps. The first step is to maximize Qn(~, l:) with respect to l: given ~. This yields -l:(~) =-I~ L...,(fy, + Bx,)(fy, + Bx,).' (MxM) n r::::l

(8.5.21)

Its (m, h) element can be written as n

&mh =

~ L(Yrm- z;m~m)(y,h- z;h~h).

n

(8.5.22)

t=l

By substituting (8.5.21) back into the objective function (8.5.20), (1/n times) the FIML concentrated likelihood function can be derived as Q;(~);;: Qn(~, 1:(~)) =-

~ log(2rr)- ~

+

= - Af log(2rr)- Af-

2

2

~ log(lfl 2 ) - ~ Iog(li:(~)l)

~log~~ ~(y, + 2 nL...,

r- 1Bx,)(y, + r- 1Bx,)'l·

t=l

(8.5.23) The second step is to maximize this FIML concentrated likelihood with respect to ~. The FIML estimator &of ~ 0 is the ~ that maximizes this function 7 The FIML estimator of l:o is 1: (&l,

7So the FIM:L estimator

&is an extremum estimator that maximizes

" + r- 1BxrHYt + r- 1Bxd]. -1~ L
Some aspects of this extremum estimator are explored in an optional analytical exercise (see Analytical Exercise 3).

533

Examples of Maximum Likelihood

Testing Overidentifying Restrictions Now compare this FlML concentrated likelihood Q~(8) given in (8.5.23) with that for the multivariate regression model, Q~ (ll) given in (8.4.13). The FIML concentrated likelihood Q~(8) is what obtains when we impose on Q~(ll) the restrictions implied hy the null hypothesis that Ho: n~ = -f 0 B0 or foll~ + B0 = 0. 1

(8.5.24)

Put differently, FIML estimation of 8o is a constrained multivariate regression. Therefore, the truth of the null hypothesis can be tested by the likelihood ratio principle. Since the OLS estimator fi given in (8.4.5) maximizes the multivariate regression concentrated likelihood function Q~ (ll) given in (8.4.13) and the FIML estimator ii maximizes the FIML concentrated likelihood function Q~(ii) given in (8.5.23), the likelihood ratio test statistic is LR = 2n · (Q~(fi) - Q~(/i)) 1

= n x (logiS!(-(f- Bl')l-log

IQ(fi)l).

(8.5.25)

where Q(ll) is defined in (8.4.1 0) and (f, B) is the value of (f, B) implied hy ii. Since the dimension of 8 equals I::::'= I Lm, the number of restrictions implied by the null hypothesis H0 above is M

M

KM - L Lm = L(K - Lm), m=l

(8.5.26)

m:=:l

which is the total number of overidentifying restrictions of the system. Therefore, the likelihood ratio test based on the LR statistic (8.5.25) is a test of overidentifying restrictions. It is a specification test because the restriction being tested is a condition assumed in the model. Properties of the FIML Estimator There is no closed-form solution to the FIML estimator. So, in order to establish the consistency and asymptotic normality, we would have to verify the conditions of relevant theorems of the previous chapter. We summarize here, without proof, the main properties of the FIML estimator. Identification Stating the identification condition in various forms for complete simultaneous equations systems is a major topic in most textbooks. Here it is left as an optional

534

Chapter 8

analytical exercise (Analytical Exercise 3) to show that the identification condition for FIML as an extremum estimator is equivalent to the rank condition for identification (that E(x,z;ml be of full column rank for all m = I, 2, ... , M). Asymptotic Properties Although the FIML ohjective function (8.5.20) is derived under the normality of e, I x,, the FIML estimator of 80 is consistent and asymptotically normal without the normality assumption. For a proof of consistency without normality, see, e.g., Amemiya (1985, pp. 232-233). There is a nice proof of asymptotic normality of 8 due to Hausman (1975), which shows that an iterative instrumental variable estimation can be viewed as an algorithm for solving the first-order conditions for the maximization of Qn(o, :E). 8 This argument shows that the FIML estimator of 80 is asymptotically equivalent to the 3SLS estimator9 Therefore, its asymptotic variance is given hy (4.5.15), and a consistent estimator of the asymptotic variance is (4.5.17). A

lnvariance Since the FIML estimator is an ML estimator, it has the invariance property discussed in Section 7.1. To illustrate, consider Example 8.1 and suppose you wish to reparameterize the supply equation as p, = ji,,q,

+ !h1 + fJ,w, + 1!, 2

(supply).

(8.5.27)

The old supply parameters (y21 , fJ21, fJ22l and the new supply parameters (ji21 , fJ21 , fJ22 ) are connected by the one-to-one mapping (8.5.28) If (y 21 , {i21 , {i22 ) is the FIML estimator of (y21 , {J21 , fJ22) for the system consisting of the demand equation (8.5.4) and the old supply equation (8.5.5), then the FIML estimator based on the system consisting of (8.5.4) and the new supply equation is numerically the same as the value of the above mapping at (y21 , {i21 , {i22 ). The 3SLS estimator (and 2SLS for that matter) does not have this invariance property.

8we will not display the first-order condition.~ (the likelihood equations) for FIML because it requires some cumbersome matrix notation. See Amemiya (1985, pp. 233-234) or Davidson and MacKinnon ( 1993. pp. 658660) for more details. For the SUR model, the likelihood equations and the iterative procedure are much easier

to describe: see the next subsection. 9This asymptotic equivalence does not carry over to nonlinear equation systems; nonlinear FIML is more efficient than nonlinear 3SLS. See Amemiya (1985, Section 8.2).

535

Examples of Maximum Likelihood

The foregoing discussion about FIML can be summarized as

Proposition 8.1 (Asymptotic properties ofFIML): Consider theM-equation system (8.5. I) and Jet ~ 0 be the stacked vector collecting all the coefficients in the M-equation system. Make the following assumptions. • the rank condition for identification (that E(x,z;m) be of full column rank for all m = 1 , 2, ... , M) is satisfied,

• E(x,X,) is nonsingular, • the M-equation system can be written as a complete system of simultaneous equations (8.5.8) with r 0 nonsingular,

• e, I x,

~

N(O, 1: 0 ), 1: 0 positive definite,

• (y,, x,} is i.i.d.,

• the parameter space for (~ 0 • 1: 0 ) is compact with the true parameter vector (~ 0 • 1: 0 ) included as an interior point. Then (a) the FIML estimator(~. 1:), which maximizes (8.5.20), is consi.
(b) the asymptotic variance of~ is given by (4.5.15), and a consistent estimator of 1 the asymptotic variance is (4.5.17) where &-mh is the (m, h) element of l:- , (c) the likelihood ratio statistic (8.5.25) for testing overidentifying restrictions is asymptotically x2 with K M - L~~ 1 Lm degrees of freedom. A

Furthermore, these asymptotic results about

~hold

even if e, 1 x, is not normal.

ML Estimation of the SUR Model

The SUR model is a specialization of the FlML model with z,m being a subvector of x,. It is therefore an M-equation system (8.5.1) where the regressors z,m are predetermined. Unlike in the multivariate regression model, the set of regressors could differ across equations. It is easy to show (see Review Question 3) that no two dependent variables can be the same variable, so y, (M x 1) in the structural form (8.5.8) simply collects the M dependent variables. Furthermore, since z,m includes no endogenous variables, r 0 in the structural form is the identity matrix. Thus, the FIML objective function (8.5.20) becomes

536

Chapter 8

Qn(
M

2

1og(2n)-

I

I " 1og(ll:l)- n .L;[ry, 2 2 t=I

+ Bx,]'l:-'[ry, + Bx,], (8.5.29)

where f is understood to be the identity matrix. (We continue to carry f around in order to facilitate the comparison to the objective function for LIML to be introduced in the next subsection.) As in FIML, the l: that maximizes the objective function given
L;;;=,

n

.L;[ry,

+ Bx,]'l:- 1 [ry, + Bx,]

= (y- Z
t=I

Therefore, (8.5.31) (Doing the same for the more general case ofFIML is not as easy because, although (8.5.30) holds for FIML as well, the score is more complicated thanks to the 2 term log(lfl ) in the FIML likelihood function. Put differently, if If I is a constant (i.e., independent of <1), then for FIML is the same as (8.5.31).) Setting this equal to zero and solving for <1, we obtain

'fl'

'S'

(8.5.32) Given some consistent estimator f of 1: 0 , 3(f) is the SUR estimator of <1 0 derived in Chapter 4. These two functions (3(1:), l:(
537

Examples of Maximum Likelihood

part updates the estimate of 1: 0 using the current estimate of .1 0 . If this process converges, then the limit is a solution to the first-order conditions. QUESTIONS FOR REVIEW

1. (Deriving f(y, I x,) from f(e, I x,)) In this exercise, we wish to derive the likelihood for observation t for the FIML model by the following fact from probability theory:

Change of Varklble Formula: Suppose e is a continuous M -dimensional random vector with density J,(e) and g : m:.M --> m:.M is a oneto-one and differentiable mapping on an open set S, with the inverse mapping denoted as g- 1( •). Let the M x M matrix J(e) be the J acohian matrix af~~l (the M x M matrix of partial derivatives, whose (i, j) element is a~;•l). Assume that IJ(e)l is not zero at any point in S. ' of y = g(e) is given by Then the density J,(g-l(y)) f(y) = abs(IJ(g 1(y))l) · (Note that IJ(g- 1(y))l is a determinant, which may or may not be positive.) Using this fact, derive the density of y, I x, from the density of e, I x, and verify that the associated log likelihood is given by (8.5.16). Hint: The inverse mapping g-l(y,) is the left-hand side of (8.5.8). SoJ(e,) = f 01 Also, Iog(abs(lfl)) = ~ Iog(lfl 2 ). 2. (Instruments outside the system) Consider the complete system of simultaneous equations (8.5.8) with If ol ,P 0. Suppose that one of the K predetermined variables, say x, K, does not appear in any of the M-equations. Let E' (y I x) he the least squares projection of yon x. Show that E'(y,m I x,) = E'(y,m I x,), where x, = (xu, ... , x,_K _J)' (so x, = (x;, x,K )'). (As you will show in Analytical Exercise 4, including x, K in the list of instruments in addition to will not change the asymptotic variance of the FIML estimator of .1 0 ; it probably makes the small sample properties only worse.)

x,

3. Recall that in Example 8.1 y, 1 and y, 2 are the same variahle. In the SUR model, this cannot happen; that is, no two elements of (y, 1, y, 2 , .•• , y,M) are the same variahle if the assumptions descrihing the model are satisfied. Prove this. Hint: Suppose y, 1 and y, 2 are the same variable. Then

538

Chapter 8

Since z" and z, 2 are subvectors of x,, the right-hand side can be written as for some K-dimensional vector"'· Use the orthogonality conditions that E(x,e;) = 0 and the nonsingularity of E(x,x;) to show that"' = 0.

x;"'

8.6

LIML

LIML Defined

The advantage of the FIML estimator is that it allows you to exploit all the information afforded by the complete system of simultaneous equations. This, however, is also a weakness because, as is true with any other system or joint estimation procedure, the estimator is not consistent if any part of the system is misspecilied. If you are confident that the equation in question is correctly specified hut not so sure about the rest of the system, you may well prefer to employ single-equation methods such as 2SLS. The rest of this section derives the ML estimator called the limited-infonnation maximum likelihood (LIML) estimator, which is the ML counterpart of 2SLS. Let the m-th equation of theM-equation system he the equation of interest. The

Lm regressors, z,m, are either endogenous or predetermined. Let y, be the vector of Mm endogenous variables and be the vector of Km predetermined variables, with Mm + Km = Lm. Partition ~Om conformably as ~Om = (Y~m• {3~)'. Then the m-th equation can he written as

x,

Yrm =

z;m

+ Erm

8om

(lxLm)(Lmxl)

y;

=

+ x;

Yom

(IXMmHMmxl)

/3om

+ Erm·

(8.6.1)

(lxKm)(Kmxl)

Obviously, this single equation in isolation is an incomplete system because of the existence of the Mm included endogenous variables y,. However, if this equation is supplemented by the Mm reduced-form equations for y, taken from (8.5.9), then the system of I + Mm equations thus formed is a complete system of simultaneous equations. To describe this idea more fully, let llo be the associated reducedform coefficient matrix and collect appropriate Mm columns from ll 0 and write the reduced-form equations for the Mm included endogenous variables as

y, (Mmxl)

_, flo

X1

(M111 xK)(Kx!)

+

V1 (Mmxl)

(8.6.2)

539

Examples of Maximum Ukelihood

Combining (8.6.1) and (8.6.2), we obtain the system of I + Mm equations:

+

y,

fo

((l+Mm)x(l+Mm)) ((1+Mm)x1)

Bo

x,

e,

=

(8.6.3)

((l+Mm)xl)

((l+Mm)XK) (Kxi)

where y,

=

[

Ytm ]

(M:~ I)

Er '

=

[

Erm ] (M:rxl)

IM.' l

'

- Yom

fo

=

[M;xl)

(I xMm)

,

Bo

=

[- (lx~m) p'

l

(lx(K-Km)) 0'

if0

-

.

(8.6.4)

(Mm xK)

Here, we have assumed, without loss of generality, that the included predetermined variables x, are the first Km elements ofx,. The equation system (8.6.3) is a complete system of I + Mm simultaneous equations hecause (8.6.5)

lfol =I ;iO.

For example, consider Example 8.2. The included endogenous variable in the wage equation is S,. Supplement the wage equation by a reduced-form equation for S,, a regression of S, on a constant, !Q, and MED,. The r 0 matrix for the two-equation system that results is

- [I fo =

0

The hypothetical value r of r 0 has the same structure shown in (8.6.4) for f 0 . So lfl = I. Replacing in the FIML objective function (8.5.20) (~,B) by (ym, Pm, ll), r hy r, and 1: by 1:, and noting that If I= I, we obtain the LIML objective function:

-Mm I Q.(y m• Pm' ll, 1:) =- 21og(2rr) - :zlog(I:EI) n

-

2~ L;[ ry, + Bx,]' 1: t=1

-I [

ry, + Bx,],

(8.6.6)

--

where 1: is the (I + M.,) x (I + Mm) variance matrix of e,. Let (y .,, Pm' ll, 1:) be the FIML estimator of this system. The LIML estimator of ~Om = (Y~m, P~m)' is (y~, P~)'. This derivation of the LIML estimator, which is different from the

540

Chapter 8

original derivation hy Anderson and Ruhin (1949), is due to Pagan (1979). Since LIML is a FIML estimator, Proposition 8.1 assures us that the LIML estimator is consistent and asymptotically normal even if the error is not normally distributed. Compulalion ol LIML

For the SUR model, we have shown that the ML estimator can be obtained from the iterative SUR procedure. If you reexamine the argument, you should see that it does not depend on r in the SUR objective function (8.5.29) being the identity matrix; all that is needed is that the determinant of r is a constant. Therefore, the same argument also applies to the LIML objective function (8.6.6). That is, we can obtain the LIML estimator of 80m by applying the iterative SUR procedure to the (I +Mm )-equation system (8.6.3). In particular, if l:o is known, the LIML estimator is the SUR estimator. On the face of it, this conclusion -that SUR can be used to estimate the coefficients of included endogenous variables- is surprising, but it is true. The essential reason is this: the sampling error Sm - Som depends not only on the correlation between the included endogenous variables y, and the error term e,m but also on the correlation between y, and v,. Those two correlations cancel each other out. Review Question I verifies this for a simple example. The iterative SUR procedure, however, is not how the LIML estimator is usually calculated, because there is a closed-form expression for the estimator. The simultaneous equation system (8.6.3) has two special features. One is the special structure of r, which we have noted, and the other is the fact that there are no exclusion restrictions in the rows of B corresponding to the included endogenous variables y,. Thanks to these features, the LIML estimator Sm = (y~, {i~)' and also the likelihood ratio statistic (8.5.25) can be calculated explicitly. To write down the closed-form expressions for the LIML estimator and the likelihood ratio statistic, we need to introduce some more notation. Let Zm, Y, X, X, and Ym be the data matrices and the data vector associated with z,m, y,, x,, x,, and Y<m• respectively:

(8.6.7)

541

Examples of Maximum Likelihood

and let M and M be the annihilators hased on X and X, respectively:

M =In- X(X'X)- 1X', M =I.- X(X'X)- 1X'. (n xn)

(8.6.8)

(n xn)

Then the LIML estimator ~m can be written as (8.6.9) where k is the smallest characteristic root of (8.6.10) with (I

+ Mm)

x (I

+ Mm)

matrices Wand W defined hy (8.6.11)

(See, e.g., Davidson and MacKinnon, 1993, pp. 645--649, for a derivation.) The likelihood ratio statistic (8.5.25) for testing overidentifying restrictions reduces to LR = n logk.

(8.6.12)

Since there are no overidentifying restrictions in the supplementary Mm equations (8.6.2), the equation of interest is the only source of overidentifying restrictions, which are K- Lm in numher. Therefore, by Proposition 8.1, this statistic for testing overidentifying restrictions is asymptotically x2 (K - Lm). When the equation is just identified so that K = Lm, then the LR statistic should be zero. Indeed, it can be shown that k = I in this case (see, e.g., p. 647 of Davidson and MacKinnon, 1993). If we do not necessarily require k to be as just defined, the estimator (8.6.9) is called a k-class estimator. The LIML estimator is thus a k-class estimator with k defined above. Inspection of the 2SLS formula in terms of data matrices (see (3.8.3') on page 230) immediately shows that the 2SLS estimator is a k-class estimator with k = I, and the OLS estimator is a k-class estimator with k = 0. It follows that LIML and 2SLS are numerically the same when the equation is just identified (so that k = I). We have already shown that LIML is consistent and asymptotically normal. Moreover, using the explicit formula (8.6.9), it can be shown that the LIML estimator of llom has the same asymptotic distribution as 2SLS (see, e.g., Amemiya, 1985, Section 7.3.4). So the formula for the asymptotic variance ofLIML is given by (3.8.4) and its consistent estimate by (3.8.5).

542

Chapter 8

LIML versus 2SLS

Since LIML and 2SLS share the same asymptotic distrihution, you cannot prefer one over the other on asymptotic grounds. For any finite sample, LIML has the invariance property, while 2SLS is not invariant. Furthennore, the conclusion one could draw from the large literature on finite-sample properties (see, e.g., Judge et al., 1985, Section 15.4) is that LIML should he preferred to 2SLS. Major conclusions from the literature are listed below. They assume that the predeterrnined variables x, are fixed constants, however. • Suppose errors are nonnally distributed. The p-th moment of the 2SLS estimator exists if and only if p < K- Lm+ I, where K is the number ofpredetennined variables and Lm is the numher of regressors in the m-th equation. Thus, 2SLS does not even have a mean if the equation is just identified (i.e., if K = Lm), and this is true even if the distrihution of errors is a nonnal distribution, whose moments are all finite. • The LIML estimator has no finite moments even if errors are nonnally distrihuted. • However, in most Monte Carlo simulations (see, e.g., Anderson, Kunitomo, and Sawa, 1982), LIML approaches its asymptotic nonnal distribution much more rapidly than does 2SLS.

QUESTIONS FOR REVIEW

1. (Why SUR can be used for a structural equation) To see why SUR can be used to estimate the coefficients of included endogenous variables, consider the following simple example: Yil = YoYr2 { Yr2 = fJoX 1

+ 6'rJ, + v,,

where x, is predetennined. Suppose for simplicity that

is known. Let (y,

{J) be the SUR estimator of (yo, flo).

543

Examples of Maximum Ukelihood

(a) Show that

[~]- [;:] a 121>' -;;- L... Yr2Xr ] - ! [ai l l-;;-> L...' Yr2Br! +a 121>' -;;- L... Yr2Vr ] a22lnL...t "'x2

a2I.!nL...tti "'x e

+ a22lnL...tt "'x v

'

where amh is the (m. h) element of I: 01 • Hint: The formula for the SUR estimator is given by (4.5.12) with (4.5.13') and (4.5.14') on page 280. (b) Show that

a

" I l- lL..., "" ' Yt2Br1 +a 121"' - L..., Yr2Vr n

n

t=I

t=I

converges to zero in probability. Hint:

8.7

Serially Correlated Observations

So far, we have assumed i.i.d. observations in deriving the asymptotic variance of ML estimators. This section is concerned with ML estimation when observations are serially correlated. Two Questions

When observations are serially correlated, there arise two distinct questions. The first is whether the ML estimator that maximizes a likelihood function derived from the assumption of i.i.d. observations is consistent and asymptotically normal when indeed the observations are serially correlated. More precisely, suppose that the observation vector w, is ergodic stationary hut not necessarily i.i.d., and consider the ML estimator that maximizes

L

I " Q,(O) = log f(w,; 9), n t=I

(8.7.1)

544

Chapter B

where log f(w,; 6) is the log likelihood for ohservation t. This objective function would he (I/n times) the log likelihood of the sample (w 1, ... , Wn) if the observations were i.i.d. Even though the log likelihood for observation t, logf(w,; 6), is correctly specified, this objective function is misspecified because it assumes, incorrectly, that ( w,} is i .i .d. The ML estimator that maximizes this objective function is a quasi-ML estimator because the likelihood function is misspecified. Is a quasi-ML estimator that incorrectly assumes no serial correlation consistent and asymptotically normal? We have actually answered this first question in the previous chapter. By Propositions 7.5 and 7.6, the quasi-ML estimator is consistent because it is an M-estimator, with m(w,; 6) given hy the log likelihood for ohservation t. For asymptotic normality, the relevant theorem is Proposition 7.8, which spells out conditions under which an M-estimator is asymptotically normal. Just reproducing the conclusion of Proposition 7.8, the expression for the asymptotic variance is given by

(8. 7.2) where 6 0 is the true value of 6, H(w,; IJ) is the Hessian for observation t, and l: is the long-run variance of the score for ohservation t evaluated at IJ 0 , s(w,; 6 0 ). This asymptotic variance can he consistently estimated by

~

Avar(O) =

{ 1 -;;

L n

t=l

H(w,; 0)

}-1 f {-;;1 L n

H(w,; 0)

}-1 ,

(8.7.3)

t=I

where f is a consistent estimate of :E. Any of the "nonparametric" methods discussed in Chapter 6 (such as the VARHAC) can be used to calculate l: from the estimated series (s(w,; 0) }; there is no need to parameterize the serial correlation in (s(w,; 6o)}. The second question is what is the likelihood function that properly accounts for serial correlation in observations and what are the properties of the "genuine" ML estimator that maximizes it. The question can be posed for conditional ML as well as unconditional ML. In conditional ML, we divide the observation vector w, into two groups, the dependent variable y, and the regressors x,, and examine the conditional distribution of (y 1, ••. , y,) conditional on (x 1, ••• , x.). Parameterizing this conditional distribution is relatively straightforward when (w,} is i.i.d., hecause it can he written as a product over t of the conditional distrihution of y, given x,. In fact, this i.i.d. case is what we have examined in this and previous chapters. If (w,} is serially correlated, however, the researcher does not know the

-

545

Examples of Maximum Likelihood

dynamic interaction between y, and x, well enough to be able to write down the conditional distribution. 10 In the rest of this section, we deal only with examples of unconditional ML estimation with serially correlated ohservations. Unconditional ML for Dependent Observations

We start out with the simplest case of a univariate series. Let (yo, Y1, ... , Yn) be the sample. (For notational convenience, we assume that the sample includes observation fort = 0.) When the observations are serially correlated, the density function of the sample can no longer be expressed as a product of individual densities. However, it can be written down in terms of a series of conditional probability density functions. The joint density function of (yo, y 1) can he written as (8.7.4) Similarly, for (yo. Y1, Y2), (8.7.5) Substituting (8.7.4) into this, we obtain (8.7.6) A repeated use of this sort of sequential substitution produces n

f(yo, Y~> · · ·, Yn) =

[l f(y,

I Yr-1. · · ·, Yl· Yo)f(yo).

(8.7.7)

t=l

If the conditional density f(y,

I y,_ 1 , ... , y 1 , y 0 ) and the unconditional density

f(y0 ) are parameterized by a finite-dimensional parameter vector 6, then ( 1/ n times) the log likelihood of the sample (y0 , y 1 , ... , Yn) can be written as

I " - l:)og f(y, n r=l

I n

I Y•-1· ... , Y~> Yo; 6) +-log f(yo; 6).

(8.7.8)

The ML estimator 6 of the true parameter value 6 0 is the 6 that maximizes this log likelihood function. We will call this ML estimator the exact ML estimator.

10 For a discussion of conditional ML when the complete specification of the dynamic interaction is not possible or undesirable, see Wooldridge (1994. Section 5). For a discussion of possible restrictions on the dynamic interaction (such as "Granger causality" and "weak exogeneity"), see, e.g .• Davidson and MacKinnon (1993, Section 18.2).

546

Chapter B

ML Estimation of AR(l) Processes

In Section 6.2, we considered in some detail first-order autoregressive (AR(l)) processes. If we assume for an AR(l) process that the error term is normally distributed, we obtain a Gaussian AR(l) process: y,

=co+ ¢oy,_, + £,,

(8.7.9)

with <, ~ i.i.d. N(O. aJ). We assume the stationarity condition that I
" {J 1 f(yo,YJ, ... ,y.;fi)=fl 2rra

t=I

x

2

y.) for a

2 exp[-2a1 2 (y,-c-¢y,_ 1) ] }

I

j2rr

... ,

,~:2

exp

-(yo[

'~•)']

2 1 ~¢2 2

•

(8.7.12)

Taking logs and dividing both sides by n, we obtain the objective function for exact ML:

""I n

I L- --log(2rr)I I 2 ) - - I (y,- c- ¢YH) ') Q.(fi) =-log(a n t=l 2 2 2a 2

+-I

2

) - [(Yo- _ { --log(2rr)I I ( a c) -log I-¢ 2 2 n 2 2 I- ¢ 2 1-¢2 "

2 ] } .

(8.7.13)

The exact ML estimator fJ of the AR( I) process is the fJ that maximizes this objective function. Assuming an interior solution, the estimato~ solves the first-order conditions (the likelihood equations) obtained by setting 'a~' to zero. They are a

547

Examples of Maximum Likelihood

system of equations that are nonlinear in IJ, so finding IJ requires iterative algorithms such as those described in Section 7.5.

Conditional ML Estimation of AR(l) Processes The source of the nonlinearity in the likelihood equations is the log likelihood of y 0 . As an alternative, consider dropping this term and maximizing

~I --log(2rr)I I 2 ) - -I , (y,- c- ¢y,_ ) Qn(IJ) = -I L...., -log(CT 1 n 2 2 2CT·

'l

.

(8.7.14)

t=I

Using the relation f(yo, y,, ... , Ynl = f(v,, ... , Yn I .vo)f(yo) and (8.7.12), we see that this objective function is (1/ n times) the log of the likelihood conditional on the first observation yo,

n

1 1 (y,- c- ¢y,_,)'] } . f(y,, y,, ... , Yn I .vo; IJ) = " { exp[- 2 2 2 r~I .J2rrCT C1 (8.7.15) Let 0 be the IJ that maximizes the log conditional likelihood (8.7.14) (conditional on y 0 ). This ML estimator will henceforth he called the y 0 -conditional ML estimator. Here, conditioning is on y0 , which makes the estimator here conceptually different from the conditional ML estimators of the previous sections where conditioning is on x,. Before examining the relationship between the two ML estimators, the exact ML estimator IJ and the y 0 -conditional ML estimator IJ, we give two derivations of the asymptotic properties of the latter. The first derivation is based on the explicit "

A

solution for the estimator. Recall from Section 1.5 and Example 7.2 that the objective function in the ML estimation of the linear regression model y, = x;t3 0 + e, for i.i.d. observations is

I2

I I 2 ) - -I( y - x , Pl -] Ln --log(2rr)-log(CT

n

t=l

2

2"2

'

'

'l .

(8.7.16)

It was shown in Section 1.5, and is very easy to show, that the ML estimator of Po is the OLS coefficient estimator and the ML estimator of equals the sum of squared residuals divided by the sample size. Now, with x, = (I, y,_ 1)' and Po = (c0 , ¢ 0 )', the objective function for the linear regression model reduces to Qn (IJ) in (8.7.14), the conditional log likelihood for the AR(l) process. Therefore, algebraically, the conditional ML estimator (c, ¢l of (c0 , ¢ 0 ) is numerically the

"J

548

Chapter 8

same as the OLS coefficient estimator in the regression of y, on a constant and Y•-1> and the conditional ML estimator 8- 2 of aJ is the sum of squared residuals divided by n. Regarding the asymptotic properties of the estimator, we have shown in Section 6.4 that (c, if,) is consistent and asymptotically normal with the asymptotic variance indicated in Proposition 6.7 for p = I and that 8-2 is consistent for aJ. Therefore, for the asymptotic normality of the conditional ML estimator (c, if,), which maximizes the y0 -conditional likelihood for normal errors, the normality assumption is not needed. The second derivation is more indirect but nevertheless more instructive. As just noted, the objective function Q"(ll) coincides with (8.7.16) with x, = (I, y,_ 1)' and {J 0 = (c0 , ¢ 0 )'. Since (y,, x,) is merely ergodic stationary and not necessarily i.i.d., the conditional ML estimator (c, if,, 8- 2 ) can be viewed as the quasi-ML estimator considered in Proposition 7 .6. It has been veri tied in Example 7.8 that the objective function satisfies all the conditions of Proposition 7.6 if E(x,x;) where x, =(I, y,_ 1)' is nonsingular. We have shown in Section 6.4 that this nonsingularity condition is satisfied for AR(l) processes if aJ > 0. Thus, the conditional ML is consistent. To prove asymptotic normality, view the conditional ML estimator as an M-estimator with w, = (y,, y,_ 1 )' and m(w,; II)= -

I

2

Iog(2Jr)-

I

2

1og(a 2 ) -

I

2

2

a 2 (y,- c-
The relevant theorem, therefore, is Proposition 7.8. Let s(w,; II) be the score for observation t and let H(w,; II) be the Hessian for observation t associated with this m function, and consider the conditions (I )-(5) of Proposition 7.8. The discussion in Example 7.I 0 implies that all the conditions except for (3) hold if E(x1 x;) where x, = (1. _y1_ 1)'is nonsingular, which as just noted is satisfied if aJ > 0. This leaves condition (3) that )n L7~ 1 s(w,; llo) N(O, l:). In the AR(l) model, f(y, I Yr-I .... , YJ, Yo) = f(y, I y,_,). As you will show in Review Question I, this special feature of the joint density function implies that the sequence (s(w,; 11 0 )) is a martingale difference sequence. Since s(w 1 ; llo) is a function of w1 = (_y,, y,_J)' and (y,) is ergodic stationary, the sequence (s(w,; 11 0 )) also is ergodic stationary. So by the ergodic stationary martingale difference CLT of Chapter 2, we can claim that condition (3) is satisfied with

--+,

l: = E[s(w 1; 11 0 ) s(w,; 11 0 )'].

(8.7.18)

Finally, it is easy to show the information matrix equality for observation t that - E[H(w1; 11 0 ) J = E[s(w 1; 11 0 ) s(w,; 11 0 )'] (a proof is in Example 7.1 0). Substituting this and (8.7.18) into (8.7.2), we conclude that the asymptotic variance of the

549

Examples of Maximum Likelihood

'

y 0 -conditional ML estimator 8 of Oo is given by

1 1 Avar(i}) = - (E[H(w,; Oo) Jr = (E[s(w,; Oo) s(w,; Oo)'Jr . (8.7 .19) That is, the usual conclusion for a correctly specified ML estimator for i.i.d. observations is applicable to the y 0 -conditional ML estimator that maximizes the correctly specified conditional likelihood (8.7.14) for dependent observations. 11 The difference between (8.7.13) and (8.7.14) is the log likelihood of y 0 divided by n. If the sample size n is sufficiently large, then the first observation Yo makes a negligible contribution to the likelihood of the sample. The exact ML estimator and the Yo-conditional ML estimator tum out to have the same asymptotic distribution, provided that 1<1>1 < I (see, e.g., Fuller, 1996, Sections 8.1 and 8.4). For this reason, in most applications the parameters of an autoregression are estimated by OLS (Yo-conditional ML) rather than exact ML. Conditional ML Estimation of AR(p) and VAR(p) Processes A Gaussian AR(p) process can be written as

Yr =co+ orYr-1 + 02Yr-2 + · · · + t/>opYr-p + s,, with s, ~ i.i.d. N(O,

(8.7.20)

uJ). The conditional ML estimation of the AR(p) process is

completely analogous to the AR(l) case. If the sample is

x, = (1, y,_ 1, ... , Yr-p)'

and

{3 0 =(co, or. o2 •... , t/>op)'.

Therefore, the conditional ML (conditional on (Y-p+l• Y-p+'· ... , Yo)) estimator of (c0 , ¢>01 , ..• , t/>op) is numerically the same as the OLS coefficient estimator in the regression of y, on a constant and p lagged values of y,. The conditional ML estimate of is the sum of squared residuals divided by the sample size. By Proposition 6.7, the conditional ML estimator of the coefficients is consistent and

uJ

asymptotically normal if the stationarity condition (that the roots of I -¢>o 1z- · · ·-

t/>opzP = 0 be greater than I in absolute value) is satisfied. Again as in the AR(l) case, the error term does not need to be normal for consistency and asymptotic normality. The exact ML estimates and the conditional ML estimates have the same asymptotic distribution if the stationarity condition is satisfied. 11 This argument can be applied to models more genera] than autoregressive processes. See Lemma 5.2 of Wooldridge (1994).

550

Chapter 8

These results generalize readily to vector processes. A Gaussian VAR(p) (p-th order vector autoregression) can be written as y,

=Co+ '~>oiYt-1 + · · · + '~>opYt-p + e,,

(8.7.21)

(Mxl)

withe, ~ i.i.d. N(O, 2 0 ). The objective function in the conditional ML estimation is (1/ n times) the log conditional likelihood: M 1 1 ~~ Qn(ll) = --log(2rr) +-log( IS!- l l - - L.)(y,-

2

2

2n

n x,) Q-

11]

(y,-

n x,)}, I

l=l

(8.7.22) where

TI'

=[c

11> 1

11> 2

'l>p],

·· ·

x,

=

Yt-t

(8.7.23)

Yt-2

Yt-p

This coincides with the objective function in the ML estimation of the multivariate regression model considered in Section 8.4. Therefore, the conditional ML estimate of (c0 , TI 0 ) is numerically the same as the equation-by-equation OLS estimator. It is consistent and asymptotically normal if the coefficient matrices satisfy the stationarity condition (that all the roots of liM- '1> 01 z- · · · - '~>opzPI = 0 be greater than I in absolute value). This result does not require the error vector e, to be normally distributed. QUESTIONS FOR REVIEW 1.

(Score is m.d.s.) Consider the Gaussian AR(l) process (8.7.9) and let s(w,; (}) be the score for observation t (the gradient of (8.7.17)) with w, = (y,, y,_ 1)' and(} = (c, . a 2 )'. (a) Show that

s(w,; 9) =

where

P=

"\x, · (y,- x;{l) [ - ~1 2

+ 2 ~4 (yt

(c, )'and x, = (1, Yt-IY·

-

x;f3)

] 2

,

551

Examples of Maximum likelihood

(b) Show that E[s(w,; IJo) I Y<-1,

Y<-2· ... l =

0.

Hint: At IJ = 90 , y, - x;fJ = s,. (c) Show that {s(w,; 1} 0 )) is a martingale difference sequence. Hint: s(w,; 1} 0 ) is a (measurable) function of (y, y,_ 1). So s(w,_i; IJ 0 ) (j > 1) can be calculated from (y,_J. Y<-2• ... ). (d) Does this result in (c) carry over to Gaussian AR(p)? [Answer: Yes.]

PROBLEM SET FOR CHAPTER 8 - - - - - - - - - - - - ANALYTICAL EXERCISES

1. Show that

L...(y,- n ' x,)(y,- n ''I x,) > 1'~'''1 -;; L... v,v, , I -;;I~ t=l

(I)

t=l

~,

~

where v, = y, - TI x, and TI is given in (8.4.5). (Since v, does not depend on n by construction, this inequality shows that the left-hand side is minimized at TI = TI.) Hint: Use the "add-and-subtracf' strategy to derive ~

y,- TI'x,

=

~

v, + (TI- TI)'x,.

So n

L(y,- TI'x,)(y,- TI'x,)' t=l

n

=

n

n

I:v,v; + (fi- TI)'I:x,v; + I:v,x;(fi- TI) t=1

t=l

n

+ (fi- TI)'(I:x,x;)
n

n

= I:v,v; + (fi- TI)'(I:x,x;)
1=1

n

(since

I:x,v; = 0). 1=1

552

Chapter 8

Then use the following result from matrix algebra: Let A and B be two positive semidefinite matrices of the same size. Then

lA + Bl

>

IAI.

~

2. (Q(II) is positive definite.) Under the conditions of the multivariate regression model, show that Q(II) defined in (8.4.10) is positive definite for any given II. Hint: Derive y,- II'x, = v, + (II 0 - II)'x,. Since {y1, x,} is i.i.d., Q(II) converges almost surely toE[ (y, - II'x, )(y, - II'x, )']. Show that

Then use the matrix inequaltiy mentioned in the previous exercise. Q 0 is positive definite by assumption. 3.

(Optional, identification of~ in FlML) As noted in footnote 7 of Section 8.5, the FIML estimator of ~ 0 is an extremum estimator whose objective function is ~

Q"(~) =-In(~) I,

(2)

where

Q(~)

1

n

=- L(y1 + r-'Bx,)(y, + r-'Bx,)'. n

(3)

t=l

Let Qo(~)

= plim Qn(~).

(4)

n~oo

To show the consistency of this extremum estimator, we would verify the two conditions of Proposition 7.1 of the previous chapter. Reproducing these two conditions in the current notation: identification: at ~ 0 ,

Q 0 (~)

is uniquely maximized on the compact parameter space

uniform convergence: QnO converges uniformly in probability to Q 0 0. ln this exercise we prove that the rank condition for identification (that E(x,z;ml be of full column rank for all m) is equivalent to the above extremum estimator identification condition. ln the proof, take for granted all the assumptions we made about FIML in the text.

Examples of Maximum Likelihood

553

To set up the discussion, we need to introduce some new notation. Reproducing the m-th equation of the structural form from the text: Ytm

=

z;m

l)Om

+ Ctm.

(5)

(lxLm) (Lmxl)

The regressors Z1m consist of Mm endogenous variables and Km predetermined variables with Mm + Km = Lm. Define

Sm

: a matrix that selects those Mm endogenous regressors from y,,

(MmxM)

Cm : a matrix that selects those Km predetermined regressors from x,. (Km

x K)

To illustrate, S 1 and C 1 for Example 8.1 withy, = (p 1 • q,)' and x, = (I, a,, w, )' IS

So the regressors in the m-th equation can be written as

'('s'·'c') = y t m : XI m '

2 tm

(a) Show that E(x,z;ml = E(x,x;) ( 11o (KxLm)

(KxK)

S~

: C~ ].

(KxM)(MxMmJ

(6)

(KxKm)

(KxLm)

Hint: Use the reduced form (8.5.9) and the fact that E(x,V,) = 0. Since multiplication by a nonsingular matrix (E(x,x;) here) does not alter rank, equation (6) implies that the rank condition for identification is equivalent to the condition that the rank of (11 0 S~ : C~] be Lm. (b) Show that plim Q(d) = f n~oo

0 I:o(f 0 )' + (11~ + r-'B] E(x,x;l[11~ + r-'BJ', 1

1

(7)

where 11~ = -f 01B0 . Hint: Since {y,, x,} is i.i.d., the probability limit is given by E((y, + r-'Bx,)(y, + r-'Bx,)']. Also, y,

+ r-'Bx,

= v,

+ (11~ + r-'B)x,.

554

Chapter a

(c) Show that IR(~)I is minimized only if rrr~ + B = 0. Hint: Since E(x,x;) is positive definite, the second term on the right-hand side of (7) is zero only if II 0 + B'(f- 1 )' = 0. Use the fact from matrix algebra noted in Exercise 1 above. Therefore, the extremum estimator identification condition (that Q 0 (~) be maximized uniquely at ~ 0 ) holds if and only if (f, B) = (f 0 , B0 ) is the only solution to r II~+ B = 0 viewed as a system of simultaneous equations for (f, B). (d) Let a~ (I x (M + K)) be them-throw of [r : BJ, and let e~ be a (I x (M + K)) vector whose element corresponding to Ytm is one and whose other elements are zero. To illustrate, e'1 for Example 8.1 withy, = (p,, q, )' and x, = (I, a,, w, )' is (0, I, 0, 0, 0). Verify for Example 8.1 the general result that Sm I

am (lx(M+K))

(e) Rewrite

fiT~+

B

0'

= e' m

(MmxM)

m

(!x(Mm+Km))

[

0

(8)

(KmxM)

= 0 as (9)

Show that them-throw of rrr~

+B=

0 can be written as (10)

where (11) (f) Verify that ~m = ~om is a solution to (10). Show that a necessary and sufficient condition that ~m = ~Om is the only solution to (10) is the rank condition for identification for equation m. Hint: Recall the following fact

from linear algebra: Suppose Ax = y has a solution XQ. A necessary and sufficient condition that it is the only solution is that A is of full column rank. (g) Explain why we can conclude that the only solution to r II 0 + B = 0 is (f, B) = (f 0 , B 0 ). Hint: Each element of~ appears in either r orB only once.

555

Examples of Maximum Likelihood

4. (Optional, instruments that do not show up in the system) Consider the complete system of simultaneous equations discussed in Section 8.5. Let

x, = (Kxl)

[I(K~{)x))]. x,K

Suppose that x, K does not appear in any of the M structural equations. Show that dropping x,K from the Jist of instruments changes neither the rank condition for identification nor the asymptotic variance of the FIML estimator of 80 . Hint: Define the two matrices, Sm and Cm as in Exercise 3, so that = (y;s~ : x;c;.,) and (6) holds. FIML is asymptotically equivalent to 3SLS, so its asymptotic variance is given in (4.5.15). The last row of is a vector of zeros. Let 0 0 ((K- I) x M) be the matrix of reduced-form coefficients when x, K is dropped from the system. Then

z;m

c;,

0

= 0

[IlK ~?xM)] 0' (l xM)

References Amemiya, T., 1973, ··Regression Analysis When the Dependent Variable is Truncated Normal," Econometrica, 41,997-1016. - - - · 1985, Advanced Ecorwmetrics, Cambridge: Harvard University Press. Anderson, T. W., and H. Rubin. 1949, "Estimator of the Parameters of a Single Equation in a Complete System of Stochastic Equations." Annals of Mathematical Statistics, 20,46----63. Anderson, T. W.. N. Kunitomo. and T. Sawa, 1982, "Evaluation of the Distribution Function of the Limited Information Maximum Likelihood Estimator," Econometrica, 50. 1009-1027. Davidson, R., and J. MacKinnon. 1993, Estimation and Inference in Et·onometrics, New York: Oxford University Press. Fuller, W.. 19%, Imroduction to Statistical Time Series (2d ed.), New York: Wiley. Hausman, J., 1975, "An Instrumental Variable Approach to Full Information Estimators for Linear and Certain Nonlinear Econometric Models." Econometrica, 43, 727-738. Johansen, S .. 1995, Likelihood-Based I11ference ill Cointegrated Vector Auto-RegreJsive Modeh, New York: Oxford University Press. Judge, G., W. Griffiths. R. Hill, H. Liitkepohl. and T. Lee, 1985, Tlte Theory m1d Practice af Econometrics (2d ed.). New York: Wiley. Magnus, J .. and H. Neudecker, 1988. Matrix Differential Calculm· with Applications in Statistics and Econometrics. New York: Wiley. Newey, W., and D. McFadden. 1994. "Large Sample Estimation and Hypothesis Testing," Chapter 36 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North-Holland.

556

Chapter 8

Olsen, R. J., 1978, "Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model," Econometrica, 46, 1211-1215. Orme, C., 1989, "On the Uniqueness of the Maximum Likelihood Estimator in Truncated Regression Models." Econometric Reviews, 8, 217-222. Pagan, A., 1979, "Some Consequences of Viewing LIML As an Iterated Aitken Estimator," Economics Leuers. 3, 369-372. Sapra, S. K., 1992, "Asymptotic Properties of a Quasi-Maximum Likelihood Estimator in Truncated Regression Model with Serial Correlation," Econometric Reviews, II, 253-260. Tobin, J., 1958, "Estimation of Relationships for Limited Dependent Variables,'' Econometrica, 26, 24-36. Wooldridge, J., 1994, "Estimation and Inference for Dependent Processes," Chapter 45 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North-Holland.

CHAPTER 9

Unit-Root Econometrics

ABSTRACT Up to this point our analysis has been confined to stationary processes. Section 9.1 introduces two classes of processes with trends: trend-stationary processes and unitroot processes. A unit-root process is also called difference-stationary or integrated

of order I (I (I)) because its first difference is a stationary, or I(0), process. The technical tools for deriving the limiting distributions of statistics involving 1(0) and I (I) processes are collected in Section 9.2. Using these tools, we will derive some unitroot tests of the null hypothesis that the process in question is I (I). Those tests not only are popular hut also have better finite-sample properties than most other existing tests. In Section 9.3, we cover the Dickey-Fuller tests, which were developed to test

the null that the process is a random walk, a prototypical I (I) process whose first difference is serially uncorrelated. Section 9.4 shows that these tests can be generalized to cover 1(1) processes with serially correlated first differences. These and other unit-root tests are compared briefly in Section 9.5. Section 9.6, the application of this chapter, utilizes the unit-root tests to test purchasing power parity (PPP), the fundamental proposition in international economics that exchange rates adjust to national price levels.

9.1

Modeling Trends

There is no shortage of examples of time series with what could be reasonably described as trends in economics. Figure 9.1 displays the log of U.S. real GDP. The log GDP clearly has an upward trend in that its mean, instead of being constant as with stationary processes, increases steadily over time. The trend in the mean is called a detenninistic trend or time trend. For the case of log U.S. GDP, the deterministic trend appears to be linear. Another kind of trend can be seen from Figure 6.3 on page 424, which graphs the log yen/dollar exchange rate. The log exchange rate does not have a trend in the mean, but its every change seems to have a permanent effect on its future values so that the best predictor of future values is

558

Chapter 9

5.0 4.5-

4,0 3.5 3.0

"

2.0 1.5 1.0

0.5 0.0 1869

1889

1909

1929

1949

1969

1989

Figure 9.1: Log U.S. Real GOP, Value for 1869 Set to 0 its current value. A process with this property, which is not shared by stationary processes, is called a stochastic trend. If you recall the definition of a martingale, it fits this description of a stochastic trend. Stochastic trends and martingales are synonymous. The basic premise of this chapter and the next is that economic time series can be represented as the sum of a linear time trend, a stochastic trend, and a stationary process.' Integrated Processes

A random walk is an example of a class of trending processes known as integrated processes. To give a precise definition of integrated processes, we first define 1(0) processes. An 1(0) process is a (strictly) stationary process whose long-run variance (defined in the discussion following Proposition 6.8 of Section 6.5) is finite and positive. (We will explain why the long-run variance is required to be positive in a moment.) Following, e.g., Hamilton (1994, p. 435), we allow 1(0) processes to have possibly nonzero means (some authors, e.g., Stock, 1994, require 1(0) processes to have zero mean). Therefore, an 1(0) process can be written as (9.1.1) where {u1 } is zero-mean stationary with positive long-run variance. 1 A nme on semantics. Pmcesses

with trends are often called "nonstationary processes.'' We will not use this term in the rest of this book because a process can be not stationary without containing a trend. For example, let E1 be an i.i.d. process with unit variance and let d 1 take a value of 1 fort odd and 2 for 1 even. Then a process {u 1 } defined by u1 = d1 - e1 is not stationary because it~ vari:tnce depends on 1. Yet the process cannot be reasonably described as a process with a trend.

559

Unit-Root Econometrics

The definition of integrated processes follows from the definition of !(0) processes. Let ~ be the difference operator, so, for a sequence {~~}, ~~~=(I- L)~l

= ~~- ~1-1,

~ 2 ~ 1 =(I- L) 2 ~ 1 = (~1 - ~1 -!l- (~ 1 -I- ~1-2l. etc.

(9. !.2)

Definition 9.1 (l(d) processes): A process is said to be integrated of order d (l(d)) (d = 1, 2, ... ) if its d-th difference ~d~1 is 1(0). In particular, a process {~rl is integrated of order 1 (1(1)) if the first difference, ~~1 , is !(0). The reason the long-run variance of an !(0) process is required to be positive is to rule out the following definitional anomaly. Consider the process {v1 } defined by Vr

= Br- Br-I,

(9.1.3)

where {er} is independent white noise. As we verified in Review Question 5 of Section 6.5, the long-run variance of {v1 } is zero. lf it were not for the requirement for the long-run variance, {v1 } would be !(0). But then, since ~e 1 = v1 , the independent white noise process {er} would have to be I (I)! lf a process {vr} is written as the first difference of an !(0) process, it is called an I( -1) process. The long-run variance of an 1( -1) process is zero (proving this is Review Question 1). For the rest of this chapter, the integrated processes we deal with are of order 1. A few comments about !(I) processes follow. • (When did it start?) As is clear with random walks, the variance of an !(1) process increases linearly with time. Thus if the process had started in the infinite past, the variance would be infinite. To focus on 1(1) processes with finite variance, we assume that the process began in the finite past, and without loss of generality we can assume that the starting date is 1 = 0. Since an !(0) process can be written as (9.1.1) and since by definition an 1(1) process {~ 1 } satisfies the relation~~~ = 8 + u 1 , we can write~~ in levels as ~~ = ~0

+ 8' I+ (U! + U2 + · · · + U1 ),

(9.!.4)

where {ur} is zero-mean 1(0). So the specification of the levels process Rrl must include an assumption about the initial condition. Unless otherwise stated, we assume throughout that E(~J) < oo. So ~ 0 can be random. • (The mean in 1(0) is the trend in l(l)) As is clear from (9.1.4), an 1(!) process can have a linear trend, 8 · 1. This is a consequence of having allowed 1(0) processes to have a nonzero mean 8. lf 8 = 0, then the !(I) process has no trend

560

Chapter 9

and is called a driftless 1(1) process, while if 8 i= 0, the process is called an 1(1) process with drift. Evidently, an I (I) process with drift can be written as the sum of a linear trend and a driftless I (I) process. • (Two other names of! (I) processes) An I (I) process has two other names. It is called a difference-stationary process because the first difference is stationary. It is also called a unit-root process. To see why it is so called, consider a model

(1-pL)y,=o+u,,

(9.1.5)

where u, is zero-mean 1(0). It is an autoregressive model with possibly serially correlated errors represented by u 1 • If the autoregressive root pis unity, then the first difference of y, is 1(0), so [y,} is I (I). Why Is It Important to Know if the Process Is 1(1)? For the rest of this chapter, we will be concerned with distinguishing between trend-stationary processes, which can be written as the sum of a linear time trend (if there is one) and a stationary process, on one hand, and difference-stationary processes (i.e., I (I) processes with or without drift) on the other. As will be shown in Proposition 9.1, an I (I) process can be written as the sum of a linear trend (if any), a stationary process, and a stochastic trend. Therefore, the difference between the two classes of processes is the existence of a stochastic trend. There are at least two reasons why the distinction is important. I. First, it matters a great deal in forecasting. To make the point, consider the following simple AR(l) process with trend: Yr = Zr

Ct

+ 0 · t + Zr,

= PZr-1 +ct,

(9.1.6)

where (erl is independent white noise. The s-period-ahead forecast of Yr+, conditional on (y 1, Yr- 1,

••• )

I y,, Yr-t.

... )

E(y,+,

can be written as

+ o · (t + s) I y,, Yr-t. .. . ] + E(z,+, I y,, Yr-t. ... ) (9.1.7) =a+ 8 · (t + s) + E(z,+, I y,, Yr-to. ·. ).

= E[a

Both (y,, y,_ 1 , ••• ) and (z,, z1_ 1 , ••• ) contain the same information because there is a one-to-one mapping between the two. So the last term can be written as

561

Unit·Root Econometrics

E(z,+,

I y,, .Yr-1 •... ) = E(z,+, I z,, Zr-1 •... ) = p'zr

(9.1.8)

since Zr+' = er+.• + pe,+,-1 + · · · + p'- 1e,+ 1 + p'z,. Substituting (9.1.8) into (9.1.7), we obtain the following expression for the s-period-abead forecast: E(y,+, I y,, Yr-1, ... )

=a + 8 · (t + s) + p' Zr =a+ 8 · (t + s) + p' · (y,- a- 8 · t). (9.1.9)

There are two cases to consider.

Case IPI < 1. Now if IPI < I, then (z, I is a stationary AR(I) process and thus is zero-mean 1(0). So (y, I is trend stationary. In this case, since E(z:) < oo, we have

Therefore, the s-period-abead forecast E(y1+, I y 1 , y,_ 1, ... ) converges m mean square to the linear time trend a + 8 · (t + s) as s --> oo. More precisely, E([E(Yr+•

I y,, Yr-1• ... )- a- 8 · (t + s)f}--> 0 ass--> oo.

(9.1.10)

That is, the current and past values of y do not affect the forecast if the forecasting horizon s is sufficiently long. In particular, if 8 = 0, then E(y,+,

I y,, Yr-1· ... )--> E(y,) as s--> m.s.

00.

(9.1.11)

That is, a long-run forecast is the unconditional mean. This property, called mean reversion, holds for any linear stationary processes, not just for stationary AR( 1) processes. 2 For this reason, a linear stationary process is sometimes called a transitory component. Case p = 1. Suppose, instead, that p = I. Then (z, I is a driftless random walk (a particular stochastic trend), and y, can be written as

Yr = (a

+ zo) + 8 · t + e1 + 82 + · · · + e,,

(9.1.12)

which shows that (.Y1I is a random walk with drift with the initial value of z0 +a (a particular 1(1) or difference-stationary process). Setting p = 1 in (9.1. 9), we obtain E(y,+,

I _Y1 , Yr-1•

2See, e.g., Hamilton (1994, p. 439) for a proof.

... ) = 8 · S + y1 •

(9.1.13)

562

Chapter 9

That is, a random walk with drift 8 is expected to grow at a constant rate of 8 per period from whatever its current value y, happens to be. Unlike in the trend-stationary case, the current value has a permanent effect on the forecast for all forecast horizons. This is because of the presence of a stochastic trend z,. A stochastic trend is also called a permanent component. 2. The second reason for distinguishing between trend-stationary and differencestationary processes arises in the context of inference. For example, suppose you are interested in testing a hypothesis about fJ in a regression equation Yr = x,{J

+ u,,

(9.1.14)

where u 1 is a stationary error term. If the regressor x 1 is trend stationary, then, as was shown in Section 2.12, the t-value on the OLS estimate of {J is asymptotically standard normal. On the other hand, if x 1 is difference stationary, that is, if it has a stochastic trend, then the t-value has a nonstandard limiting distribution, so inference about fJ cannot be made in the usual way. The regression in this case is called a "cointegrating regression." Inference about a cointegrating regression will be studied in the next chapter. Which Should Be Taken as the Null, 1(0) or 1(1)?

To distinguish between trend-stationary processes and difference-stationary, or I (I), processes, the usual practice in the literature is to test the null hypothesis that the process is I (I), rather than taking the hypothesis of trend stationarity as the null. Presumably there are two reasons for this. One is that the researcher believes the variables of the equation of interest to be I (I) and wishes to control type I error with respect to the I (I) null. This, however, is not ultimately convincing because usually the researcher does not have a strong prior belief as to whether the variables are I (0) or I (I). For a concise statement of the problem this poses for inference, see Watson (1994, pp. 2867-2870). The other reason is simply that the literature on tests of the trend-stationary null against the I (I) alternative is still relatively underdeveloped. There are available several tests of the null hypothesis that the process is I (0) or more generally trend stationary; see contributions by Tanaka (1990), Kwiatkowski et a!. (1992), and Leybourne and McCabe (1994), and surveys by Stock (1994, Section 4) and Maddala and Kim (1998, Section 4.5). As of this writing, there is no single 1(0) test commonly used by the majority of researchers that has good finite-sample properties. In this book, therefore, we will concentrate on tests of the 1(1) null.

563

Unit-Root Econometrics

Other Approaches to Modeling Trends In addition to stochastic trends and time trends, there are two other popular approaches to modeling trends: fractionally integration and broken trends with an unknown break date. They will not be covered in this book. For a survey of the former approach, see Baillie (1996). The literature on trend breaks is growing rapidly. Surveys can be found in Stock (1994, Section 5) and Maddala and Kim (1998, Chapter 13). QUESTION FOR REVIEW

1. (Long-run variance of Var(u,) < oo. Verify run variance of (t.u,) v, = t.u,) is defined

(vi+ v,

g,2

differences of 1(0) processes) Let (u,) be 1(0) with that (t.u,] is covariance stationary and that the longis zero. Hint: The long-run variance of (v,) (where to be the limit as T --+ oo of Var(fiii). fiv =

+ · · · + vr)/fi =

(ur- uo)/.JT.

Tools for Unit-Root Econometrics

The basic tool in unit-root econometrics is what is called the "functional central limit theorem" (FCLT). For the FCLT to be applicable, we need to specialize 1(1) processes by placing restrictions on the associated 1(0) processes. Which restrictions to place, that is, which class of 1(0) processes to focus on, differs from author to author, depending on the version of the FCLT in use. In this book, following the practice in the literature, we focus on linear 1(0) processes. After defining linear 1(0) processes, this section introduces some concepts and results that will be used repeatedly in the rest of this chapter. Linear 1(0) Processes Our definition of linear 1(0) processes is as follows. Definition 9.2 (Linear 1(0) processes): A linear 1(0) process can be written as a constant plus a zero-mean linear process (u,] such that u, = 1/r(L)s,, ljr(L)

=

1/ro

+ 1jr1L + 1jr2 L 2 + · · · fort

=

0, ±1, ±2, .... (9.2. I)

(s,) is independent white noise (i.i.d. with mean 0 and E(s?l

= u2

> 0), (9.2.2)

564

Chapter 9 DO

L j}lfrJI < oo.

(9.2.3a)

}=0

',if (I)

i 0.

(9.2.3b)

The innovation process {etl is assumed to be white noise i.i.d. This requirement is for simplifying the exposition and can be relaxed substantially.' In the rest of this chapter, we will refer to linear 1(0) processes as simply 1(0) processes, without the qualifier "linear." Also, unless otherwise stated, {e,} will be an independent white noise process and {url a zero-mean 1(0) process. Condition (9.2.3a) (sometimes called the one-summability condition) is stronger than the more familiar condition of absolute summability. This stronger assumption will make it easier to prove large sample results of the next section, with the help of the Beveridge-Nelson decomposition to be introduced shortly. Since {',if1 } is absolutely summable, the linear process {url is strictly stationary and ergodic by Proposition 6.l(d). The j-th order autocovariance of {u 1 ) will be denoted y1. Since E(u 1 ) = 0, YJ = E(u,u,_ 1). By Proposition 6.J(c) the autocovariances are absolutely summable. To understand condition (9.2.3b), recall that the long-run variance (which we denote !-. 2 in this chapter) equals the value of the autocovariance-generating function at z = I by Proposition 6.8(b). By Proposition 6.6 it is given by DO

2

i-. =long-run variance of {u,) =Yo+ 2 LYJ = rr 2 · [1/f(l)f

(9.2.4)

j=l

The condition (9.2.3b) thus ensures the long-run variance to be positive. Approximating 1(1) by a Random Walk

Let {~tl be I (I) so that !J.~, = 8 + u, where u, = 1/f(L)e, is a zero-mean 1(0) process satisfying (9.2.1)-(9.2.3) with E(~J) < oo. Using the following identity (to be verified in Review Question 1): 1/f(L) =',if (I)+ !J.a(L),

!J. = I - L,

DO

(9.2.5)

3For example, the innovation process can be stationary martingale differences. See, e.g .. Theorem Tanaka (1996, p. 80).

~.8

of

565

Unit-Root Econometrics

we can write u r as U1

= 1/f(L)Ei

1

= 1/f(l) · Ei 1

+ ~1

- ~~-I

with ~ 1

= a(L)Ei

1.

(9.2.6)

It can be shown (see Analytical Exercise 7) that a(L) is absolutely summable. So, by Proposition 6.1 (a), {~~} is a well-defined zero-mean covariance-stationary process (it is actually ergodic stationary by Proposition 6.l(d)). Substituting (9.2.6) into (9.1.4), we obtain (what is known in econometrics as) the Beveridge-Nelson decomposition: I

~I= 8 · l

+ 2)1/f(J) · Ei, + ~'- ~,-tl + ~0 s=I

(9.2.7) Thus, any linear I( I) process can be written as the sum of a linear time trend (8 · t), a drift less random walk or a stochastic trend (1/f (I )(Ei 1 + Ei 2 + · · ·+ Ei1)), a stationary process (~ 1 ), and an initial condition (~0 - ~ 0 ). Stated somewhat differently, we have Proposition 9.1 (Beveridge-Nelson decomposition): Let {u 1 } be a zero-mean 1(0) process satisfying (9.2.1)-(9.2.3). Then

u 1 + u 2 + · · · + u1 = 1/f(l)(EiJ where~~

a(L)Ei 1 , aj

=

-(1/IHI

+ '2 + · · · + Ei1) + ~~- ~o.

+ l/IH 2 + · · ·).

{~ 1 }

is zero-mean ergodic

4

stationary

For a moment set 8 = 0 in (9.2.7) so that {~ 1 } is driftless I (I). An important implication of the decomposition is that any driftless I (I) process is dominated by the stochastic trend 1/f(l) I:;~ I Ei, in the following sense. Divide both sides of the decomposition (9.2.7) (with 8 = 0) by -fi to obtain (9.2.8)

4 For a more general condition, which does not require the 1(0) process lur) to be linear, under which u + 1 · · · -1- lit is decomposed as shown here, see Theorem 5,4 of Hall and Heyde (1980), Jn that theorem, let) is a stationary martingale difference sequence, so the stochastic trend is a martingale.

566

Chapter 9

Since E(~J) < oo by assumption, E[(~o/Ji) 2 ]--> 0 as 1 --> oo. So ~o!.fi converges in mean square (and hence in probability) to zero. The same applies to ry, 1.fi and ry 0 / .fi_ So the terms in the brackets on the right-hand side of (9.2.8) can be ignored asymptotically. In contrast, the first term has a limiting distribution of N(O, o- 2 • [1/.r(I)JZ) by the Lindeberg-Levy CLT of Section 2.1. In this sense, the stochastic trend grows at rate .,fi. Using (9.2.4), the stochastic trend can be written as (9.2.9) which shows that changes in the stochastic trend in ~~ have a variance of ). 2 , the long-run variance of {t.~, }. Now suppose 8 oft 0 so that {~,} is 1(1) with drift. As is clear from dividing both sides of (9.2.7) by 1 rather than .fi and applying the same argument just given for the 8 = 0 case, the stochastic trend as well as the stationary component can be ignored asymptotically, with ~,fl converging in probability to 8. In this sense the time trend dominates the I( I) process in large samples. Relation to ARMA Models

In Section 6.2, we defined a zero-mean stationary ARMA process {u,} as the unique covariance-stationary process satisfying the stationary ARMA(p, q) equation (9.2.1 0) where (L) satisfies the stationarity condition. If {e,} is independent white noise and if 11(1) oft 0, then the ARMA(p, q) process is zero-mean 1(0). To see this, note from Proposition 6.5(a) that u, can be written as (9.2.1) with 1/.r(L) = (L)- 111(L).

(9.2.11)

Since (I) oft 0 by the stationarity condition and 11 (I) oft 0 by assumption, condition (9.2.3b) (that 1}.r(l) oft 0) is satisfied. By Proposition 6.4(a), the coefficient sequence {1/.rj} is bounded in absolute value by a geometrically declining sequence, which ensures that (9.2.3a) is satisfied. Given the zero-mean stationary ARMA(p, q) process {u,} just described, the associated I (I) process R,} defined by the relation t.~, = u, is called an autoregressive integrated moving average (ARIMA(p, 1, q)) process. Substituting the relation u, =(I- L)~, into (9.2.10), we obtain

567

Unit-Root Econometrics '(L)~,

= 8(L)e,,

(9.2.12)

with '(L) = (L)(l - L), (L) stationary, and 8(1) oft 0. One of the roots of the associated autoregressive polynomial equation ' (z) = 0 is unity and all other roots lie outside the unit circle. More generally, a class of l(d) processes can be represented as satisfying (9.2.12) where '(L) is now defined as '(L) = (L)(l- L)d

(9.2.13)

So '(z) = 0 now has a root of unity with a multiplicity of d (i.e., d unit roots) and all other roots lie outside the unit circle. This class of l(d) processes is called autoregressive integrated moving average (ARIMA(p, d, q)) processes. The first parameter (p) refers to the order of autoregressive lags (not counting the unit roots), the second parameter (d) refers to the order of integration, and the third parameter (q) is the number of moving average lags. Since (L) satisfies the stationarity condition, taking d-th differences of an ARIMA(p, d, q) produces a zero-mean stationary ARMA(p, q) process. So an ARIMA(p, d, q) process is I(d). The Wiener Process

The next two sections will present a variety of unit-root tests. The limiting distributions of their test statistics will be written in terms of Wiener processes (also called Brownian motion processes). Some of you may already be familiar with this from continuous-time finance, but to refresh your memory, Definition 9.3 (Standard Wiener processes): A standard Wiener (Brownian motion) process W(·) is a continuous-time stochastic process, associating each date t E [0, I] with the scalar random variable W (t), such that (I) W(O) = 0;

(2) for any dates 0 < t 1 < t2 < · · · < t, < I, the changes

are independent multivariate normal with W(s)- W(t) particular W(l) ~ N(O, 1));

~

N(O, (s- t)) (so in

(3) for any realization, W(t) is continuous in t with probability I.

568

Chapter 9

Roughly speaking, the functional central limit theorem (FCLT), also called the invariance principle, states that a Wiener process is a continuous-time analogue

of a driftless random walk. Imagine that you generate a realization of a standard random walk (whose changes have a unit variance) of length T, scale the graph vertically by deflating all its values by ../f, and then horizontally compress this normalized graph so that it fits the unit interval [0, I]. In Panel (a) of Figure 9.2, that is done for a sample size of T = I 0. The sample size is increased to I00 in Panel (b), and then to 1,000 in Panel (c). As the figure shows, the graph becomes increasingly dense over the unit interval. The FCLT assures us that there exists a well-defined limit as T --> oo and that the limiting process is the Wiener process. Property (2) of Definition 9.3 is a mathematical formulation of the property that the sequence of instantaneous changes of a Wiener process is i.i.d. The continuoustime analogue of a driftless random walk whose changes have a variance of rr 2 , rather than unity, can be written as rr W (r ). To pursue the analogy between a driftless random walk and a Wiener process a bit further, consider a "demeaned standard random walk" which is constructed from a standard driftless random walk by subtracting the sample mean. That is, let {~,} be a driftless random walk with Var(Ll~,) = I and define (9.2.14) The continuous-time analogue of this demeaned random walk is a demeaned standard Wiener process defined as W''(r)

= W(r)

-1

1

W(s)ds.

(9.2.15)

(We defined the demeaned series for 1 = 0, I, ... , T - I, to match the dating convention of Proposition 9.2 below. If the demeaned series is defined for 1 I, 2, ... , T, then its continuous-time analogue, too, is (9.2.15).) We can also create from the standard random walk {~~} a detrended series: ~,' - ~~ - &. -

8· 1

(1 = 0. I, ... , T - I),

(9.2.16)

where &. and 8 are the OLS estimates of the intercept and the time coefficient in the regression of~~ on (I, 1) (I = 0, I, ... , T - I). The continuous-time analogue of

0.6 0.4 0.2

t

0 -0.2 -0.4 -0.6 -0.8 -1

(a) T=10

-1.2 2

1.5

1

0.5

0 1

(b) T=100 -0.5 0.8

0.7 0.6

0.5 0.4

0.3 0.2 0.1 0

1

-0.1

(c) T=1000 -0.2

Figure 9.2: Illustration of the Functional Central Limit Theorem

570

Chapter 9

the detrended random walk is a detrended standard Wiener process defined as W'(r)

= W(r)- a-d· r,

1 =1 1

a=

(4- 6s)W(s) ds,

1

d

(-6+ l2s)W(s)ds.

(9.2.17)

The coefficients a and dare the limiting random variables of & and 8, respectively, as T -+ oo (see, e.g., Phillips and Durlauf, 1986, for a derivation). Therefore, if a linear time trend is fitted to a driftless random walk by OLS, the estimated linear time trend 8 converges in distribution to a random variable d; even in large samples 8 will never be zero unless by accident. This phenomenon is sometimes called spurious detrending. A

A Useful Lemma Unit-root tests utilize statistics involving !(0) and !(I) processes. The key results

collected in the following proposition will be used to derive the limiting distributions of the test statistics of the unit-root tests of Sections 9.3-9.4. Proposition 9.2 (Limiting distributions of statistics involving 1(0) and 1(1) variables): Let {$1) be driftless !(I) so that L'l$1 is zero-mean !(0) satisfying (9.2.1)-(9.2.3) and E($J) < 00. Let ).

2

=long-run variance of {L'I$1 ) and y0

= Var(L'I$,).

Then

1'

; 2 -+). 2 · (a) - I 2 . L..)$r-1) W(r) 2 dr, T d o t=I

1 (b)T

L L'l$ $r-1 -+ -W(l) ). 2

T

1

ct

t=I

2

2

Yo. - -

2

Let{$,"} be the demeaned series created from{$,) by the formula (9.2.14). Then I (c) 2

T 2 "'<$,"_ 1) -+

T L. t=I

I (d) T

T

2

). ·

d

). 2

11 o

[WM(r)fdr,

L 1'1$, $,"_1--: z-{[WM(I)]2- [WM(O)J2]- ~. t=I

571

Unit-Root Econometrics

Let

{~,'}

be the detrended series created from {~,} by the formula (9.2. I 6). Then

I T (e) T'l)~,'_,)'

7

{'

).2.

O

t=I

I (f) T

T

Jo [W'(r)J' dr,

A2

I>'>~.~.'-1--: 2{[W'(l)J'- [W'(O)J'} t=l

-7·

The convergence is joint. That is, a vector consisting of the statistics indicated in (a)-(!) converges to a random vector whose elements are the corresponding random variables also indicated in (a)-(!). Part (a), for example, reads as follows: the sequence of random variables {(1/ T) 2

I:;~, (~t-1) 2 } indexed by T converges in distribution to a random variable A2 f W 2 dr. The limiting random variables in the proposition are all written in terms of standard Wiener processes. Note that the same Wiener process W (-) appears in (a) and (b) and that the demeaned and detrended Wiener processes in (c)-(f) are derived from the Wiener process appearing in (a) and (b). So the limiting random variables in (a)-(!) can be correlated. To gain some understanding of this result, suppose temporarily that {~,} is a driftless random walk with Var(t:.~,) = a 2 Given that a Wiener process is the continuous-time analogue, it is not surprising that "I:;~, (~1 _J) 2 " suitably normalized by a power of T becomes "a 2 f W2 ." Perhaps what you might not have guessed is the normalization by T 2 . One way to see why the square of T provides the right normalization, is to recall that E[(~ 1 ) 2 ] = Var(~,) = a 2 · t. Thus, (9.2.18) So the mean of L;~,(~,_J) 2 grows at rate T 2 In order to construct a random variable that could have a limiting distribution, I:;~, (~ 1 _ 1 ) 2 will have to be divided by T 2 • Now suppose that {~,} is a general driftless I (I) process whose first difference is a zero-mean 1(0) process satisfying (9.2.1)-(9.2.3). Then serial correlation in t:.~, can be taken into account by replacing a 2 by A2 (the long-run variance of {L':.~1 }).

This is due to implication (9.2.9) of the Beveridge-Nelson decomposition that a driftless I(I) process is dominated in large samples by a random walk whose changes have a variance of A2 . Put differently, the limiting distribution of L:;~,(~1 - 1 ) 2 /T 2 would be the same if{~,}, instead of being a general driftless

572

Chapter 9

I( I) process with serially correlated first differences, were a driftless random walk whose changes have a variance of A2 For a proof of Proposition 9.2, see, e.g., Stock (1994, pp. 2751-2753). It is a rather mechanical application of the FCLT and a theorem called the "continuous mapping theorem." Since W(l) ~ N(O, 1), the limiting random variable in (b) is (9.2.19) where X ~ x2 ( I) (chi-squared with I degree of freedom). Proof of (b) can be done without the fancy apparatus of the FCLT and the continuous mapping theorem and is left as Analytical Exercise I . QUESTIONS FOR REVIEW

1. (Verifying Beveridge-Nelson for simple cases) Verify (9.2.5) for ljr(L) = I+ ljr 1 L. Do the same for ljr(L) =(I - t/JL)- 1 where 1¢1 < I. 2. (A CLT for 1(0)) Proposition 6.9 is about the asymptotics of the sample mean of a linear process. Verify that the paragraph right below Proposition 9.1 is a proof of Proposition 6.9 with the added assumption of one-summability. 3. (Is the stationary component in Beveridge-Nelson 1(0)?) Consider a zeromean 1(0) process

Verify that the long-run variance of {ry,j defined in (9.2.6) is zero. 4. (Initial values and demeaned values) Let {~,} be driftless 1(1) so that~~ can be written as ~~ = ~o + u 1 + u2 + · ·· + u1 • Does the value of ~o affect the value of ~i where W'} is the demeaned series created from {~~} by the formula (9.2.14)? [Answer: No.] 5. (Linear trends and detrended values) Let ~~ be I (I) with drift, so it can be written as the sum of a linear time trend a+ 8 ·I and a driftless I (I) process. Do the values of a and 8 affect the value of ~ 1' where {~,'} is the detrended series created from {~1 } by the formula (9.2.16)? [Answer: No.]

573

Unit-Root Econometrics

9.3

Dickey-Fuller Tests

In this and the following section, we will present several tests of the null hypothesis that the data generating process (DGP) is !(I). In this section, the I( I) process is a random walk with or without drift, and the tests are based on the OLS estimation of appropriate AR(l) equations. Those tests were developed by Dickey (!976) and Fuller (1976) and are referred to as Dickey-Fuller tests or DF tests. We take the convention that the sample has T + I observations (yo, y 1, ••• , y 7 ), so that the AR(l) equation can be estimated for 1 =I, 2, ... , T. The AR(l) Model

The Dickey-Fuller test statistics are derived from the estimation of the first-order autoregressive model:

v, =

PYt-1

+ £,.

{£,}independent white noise.

(9.3.1)

(The case with intercept will be studied in a moment.) This model includes both I (I) and I(O) processes: if p = I, that is, if the associated polynomial equation I - pz = 0 has a unit root, then !'!.y, = £ 1 and {y,} is a driftless random walk, whereas if IPI < I, the process is zero-mean stationary AR(l). Thus, the hypothesis that the data are generated by a driftless random walk can be formulated as the null hypothesis that p = I. If we take the position that the DGP is either 1(0) or !(I), then -I < p < I. Given this a priori restriction on p, the !(0) alternative can be expressed as p < I. So one-tailed tests will be used to test the null of p = I against the 1(0) alternative. The tests will be based on fj, the OLS estimate of p in the AR(l) equation (9.3.1). Under the I(I) null of p = I, the sampling error is fj- I, whose expression is given by (9.3.2) As will be verified in a moment, the numerator has a limiting distribution if divided by T and the denominator has one if divided by T 2 So consider the sampling error magnified by T:

"T

T·(p-l)=T·Lt~l

!';

YtYt-1

I:,~~
TI

"T Lt=l ~Yt Yt-l I

"T (Yt-l )2 .

T2 Lt=l

(9.3.3)

574

Chapter 9

Deriving the Limiting Distribution under the 1(1) Null

Deriving the limiting distribution ofT· (p- I) under the I (I) null (a driftless random walk) is very straightforward. Since {y,] is driftless I (I), parts (a) and (b) of Proposition 9.2 are applicable with S< = y,. Furthennore, since l'!.y, is independent white noise, A2 (the long-run variance of {l'!.y,j) equals y0 (the variance of l'!.y,). Thus,

IL l'!.y, T

WIT=-

T

V<-1-+

-

d

Yo 2 Yo -W(l) --

2

2

=

WJ,

(9.3.4)

f=l

(9.3.5) As noted in the Proposition, the convergence is joint, so wT = (w 1T. w 2T)' as a vector converges to a 2 x I random vector, w = (w~o w2 )'. Since (9.3.3) is a continuous function of wT (it equals w 1T fw 2T ). it follows from Lemma 2.3(b) that

WI

-+ d

-=

w,

lfW0) 2 -lf Yo.

J0I W(r)' dr (9.3.6)

The test statistic T · (p- I) is called the Dickey-Fuller, or DF, p statistic. Several points are worth noting: • T · (p - 1), rather than the usual ./T · (p - I). has a nondegenerate limiting distribution. The estimator p is said to be superconsistent because it converges to the true value of unity at a faster rate (T).

• The null hypothesis does not specify the values of Var(e,) and y0 (the initial value) of the DGP, yet they do not affect the limiting distribution, the distribution of the random variable D FP. That is, the limiting distribution does not involve those nuisance parameters. So we can use T · (p- I) as the test statistic. This test of a driftless random walk is called the Dickey-Fuller p test. • Note that both the numerator and the denominator involve the same Wiener process, so they are correlated. Since W(l) is distributed as standard nonnal, the numerator of the expression for DFP could be written as (x 2 (1) - I )/2. But this would obscure the point being made here.

575

Unit-Root Econometrics

The usual !-test for the null of p = I, too, has a limiting distribution. It is just a matter of simple algebra to show that the !-value for the null of p = I can be written as (9.3.7)

where s is the standard error of regression: T

s=

"'<

I I L. Yr - py,_J) 2 . TA

(9.3.8)

t=l

It is easy to prove (see Review Question 3) that s 2 is consistent for y0 (= Var(LI.y1)). It is then immediate from this and (9.3.4) and (9.3.5) that l(W(l) 2 - I)

t-;t

JJoiW(r)2dr =DF,.

(9.3.9)

This limiting distribution, too, is free from nuisance parameters. So the t-value can be used for testing, but the critical values for the standard normal distribution cannot be used; they have to come from a tabulation (provided below) of D F,. This test is called the DF t test. We summarize the discussion so far as Proposition 9.3 (Dickey-Fuller tests of a driftless random walk, the case without intercept): Suppose that {y1 } is a driftless random walk (so {LI.yr} is independent white noise) with E(yJ) < oo. Consider the regression of y, on Yr-I (without intercept) fort = I, 2..... T. Then

DF p statistic: T · (p- I) --+ DFp. d

DFt statistic: t --+ DF,, d

where p is the OLS estimate of the y,_ 1 coefficient, t is the t-value for the hypothesis that the y,_ 1 coefficient is 1, and DFp and DF, are the random variables defined in (9.3.6) and (9.3.9). Thus, we have two test statistics, T · (p- I) and the t-value, for the same I(I) null. Comparison of the (generalizations of) these OF tests will be discussed in the next section.

576

Chapter 9

Table 9.1: Critical Values for the Dickey-Fuller p Test Sample size (T)

25 50 100 250 500 00

25 50 100 250 500 00

25 50 100 250 500 00 SOURCE:

0.01 -11.8 -12.8 -13.3 -13.6 -13.7 -13.8 -17.2 -18.9 -19.8 -20.3 -20.5 -20.7 -22.5 -25.8 -27.4 -28.5 -28.9 -29.4

Probability that the statistic is less than entry 0.025 0.05 0.10 0.90 0.95 0.975

-9.3 -9.9 -10.2 -10.4 -10.4 -10.5

Panel (a): T · (p- 1) -7.3 -5.3 1.01 -7.7 -5.5 0.97 -7.9 -5.6 0.95 -8.0 -5.7 0.94 -8.0 -5.7 0.93 -8.1 -5.7 0.93

-14.6 -15.7 -16.3 -16.7 -16.8 -16.9

Panel (b): T · (jY' - I) -12.5 -10.2 -0.76 -13.3 -10.7 -0.81 -13.7 -11.0 -0.83 -13.9 -II. I -0.84 -14.0 -11.2 -0.85 -14.1 -11.3 -0.85

-20.0 -22.4 -23.7 -24.4 -24.7 -25.0

Panel (c): T. (p'- I) -17.9 -15.6 -3.65 -19.7 -16.8 -3.71 -20.6 -17.5 -3.74 -21.3 -17.9 -3.76 -21.5 -18.1 -3.76 -21.7 -18.3 -3.77

0.99

1.41 1.34 1.31 1.29 1.28 1.28

1.78 1.69 1.65 1.62 1.61 1.60

2.28 2.16 2.09 2.05 2.04 2.03

0.00 -0.11 -0.13 -0.14 -0.14

0.65 0.53 0.47 0.44 0.42 0.41

1.39 1.22 1.14 1.08 1.07 1.05

-2.51 -2.60 -2.63 -2.65 -2.66 -2.67

-1.53 -1.67 -1.74 -1.79 -1.80 -1.81

-0.46 -0.67 -0.76 -0.83 -0.86 -0.88

-om

Fuller (1976, Table 8.5.1 ), corrected in Fuller (1996, Table 10.A.1 ).

Panel (a) of Table 9. I has critical values for the finite-sample distribution of the random variable T · (p- I) for various sample sizes. The solid curve in Figure 9.3 graphs the finite-sample distribution of p for T = 100, which shows that the OLS estimate of p is biased downward (i.e., the mean of p is less than 1). These are obtained by Monte Carlo assuming that £, is Gaussian and y 0 = 0. By Proposition 9.3, as the sample size T increases, those critical values converge to the critical values for the distribution of DFP, reported in the row for T = 00. (Since Yo does not have to be zero and £, does not have to be Gaussian for the asymptotic results in Proposition 9.3 to hold, we should obtain the same limiting critical values if the finite-sample critical values were calculated with a different assumption about the initial value y 0 and the distribution of£,.) From the critical values for T = oo,

577

Unit-Root Econometrics

7.0,-----------------------------------------------, 6.0

5.0

cm

4.0

~

m

···.

a. 3.0

2.0

.·· 1.0

0.0 -t-----1""·---=-~0.6 0.7

=..:.:.:..r-·=··""""";.;-~··=::::;:::: ;· ___···.:;o·-~-.·::;;_\,__j 0.8

- - - without intercept or time

1.0

0.9

········· with intercept

·····- with time

Figure 9.3: Finite-Sample Distribution of OLS Estimate of AR( l) Coefficient, T = I 00

we can see that the DFp distribution, too, is skewed to the left, with the 5 percent lower tail critical value of -8.1 (so Prob(DFp < -8.1) = 5%) and the 5 percent upper tail critical value of 1.28. The small-sample bias in p is reflected in the skewness in the limiting distribution ofT· (p- 1). Critical values for DF, (the limiting distribution of the 1-value) are in the row forT = oo of Table 9.2, Panel (a). Like DFP, it is skewed to the left. The table also has critical values for finite-sample sizes. Again, they assume that y 0 = 0 and e, is normally distributed.

Incorporating the Intercept

A shortcoming of the tests based on the AR(l) equation without intercept is its lack of invariance to an addition of a constant to the series. If the test is performed on a series of logarithms (as in Example 9.1 below), then a change in units of measurement of the series results in an addition of a constant to the series, which affects the value of the test statistic. To make the test statistic invariant to an addition of a constant, we generalize the model from which the statistic is derived. Consider the model

y,

=a+ z,, z,

=

pz,_,

+ e,,

{e,} independent white noise.

(9.3.10)

578

Chapter 9

Table 9.2: Critical Values for the Dickey-FuUer t-Test Sample size (T)

25 50 100 250 500 00

25 50 100 250 500 00

25 50 100 250 500 00

SOURCE:

0.01 -2.65 -2.62 -2.60 -2.58 -2.58 -2.58 -3.75 -3.59 -3.50 -3.45 -3.44 -3.42 -4.38 -4.16 -4.05 -3.98 -3.97 -3.96

Probability that the statistic is less than entry 0.025 0.05 0.90 0.95 0.975 0.10

0.99

-2.26 -2.25 -2.24 -2.24 -2.23 -2.23

Panel (a): t -1.95 -1.60 -1.95 -1.61 -1.95 -1.61 -1.95 -1.62 -1.95 -1.62 -1.95 -1.62

0.92 0.91 0.90 0.89 0.89 0.89

1.33 1.31 1.29 1.28 1.28 1.28

1.70 1.66 1.64 1.63 1.62 1.62

2.15 2.08 2.04 2.02 2.01 2.01

-3.33 -3.23 -3.17 -3.14 -3.13 -3.12

Panel (b): t~ -2.99 -2.64 -0.37 -2.93 -2.60 -0.41 -2.90 -2.59 -0.42 -2.88 -2.58 -0.42 -2.87 -2.57 -0.44 -2.86 -2.57 -0.44

0.00 -0.04 -0.06 -0.07 -0.07 -0.08

0.34 0.28 0.26 0.24 0.24 0.23

0.71 0.66 0.63 0.62 0.61 0.60

-3.95 -3.80 -3.73 -3.69 -3.67 -3.67

Panel (c): t' -3.60 -3.24 -1.14 -3.50 -3.18 -1.19 -3.45 -3.15 -1.22 -3.42 -3.13 -1.23 -3.42 -3.13 -1.24 -3.41 -3.13 -1.25

-0.81 -0.87 -0.90 -0.92 -0.93 -0.94

-0.50 -0.58 -0.62 -0.64 -0.65 -0.66

-0.15 -0.24 -0.28 -0.31 -0.32 -0.32

Fuller (1976. Table 8.5.2). corrected in Fuller (1996, Table IO.A.2).

The process {y,} obtains when a constant a is added to a process satisfying (9.3.1). Underthe null of p = I, {z,J is a driftless random walk. Since y, can be written as

y, =a+ zo + eJ

+ e2 + · · · + &,,

{y,} too is a driftless random walk with the initial value of vo = a + z0 • Under the alternative of p < I, {y,] is stationary AR ( 1) with mean a. Thus, the class of I (0) processes included in (9.3.10) is larger than that in (9.3.1 ).

By eliminating {z,] from (9.3.10), we obtain an AR(l) equation with intercept

Yr =a'+ PYr-I

+ e,,

(9.3.11)

579

Unit-Root Econometrics

where

a'= (I- p)a.

(9.3.12)

Since a' = 0 when p = I. the I (I) null (that the DGP is a driftless random walk) is the joint hypothesis that p = I and a' = 0 in terms of the coefficients of regression (9.3.11 ). Without the restriction a' = 0, {y,} could be a random walk with drift. We will develop tests of a random walk with drift in a moment. Until then, we continue to take the null to be a random walk without drift. Let fj" here be the OLS estimate of pin (9.3.11) and t" be the !-value for the null of p = I. It should be clear that a would not affect the value of p" or its OLS standard error; adding the same constant toy, for all t merely changes the estimated intercept. Therefore, the finite-sample as well as large-sample distributions of the DF p" statistic, T · (p" - I), and the DF t" statistic, t", will not depend on a regardless of the value of p. As you will be asked to prove in Analytical Exercise 2, T · (p" - I) converges in distribution to a random variable represented as DF~=

H[W"(l)f- [W"(0)] 2

-

1)

(9.3.13)

Jd [W~'(r)]2 dr

and t" converges in distribution to

DF'(

_ !(\W"(l)] 2 =

-

[W~'(0)] 2 -

JJ [W~'(r)]2 1 0

I)

.

(9.3.14)

dr

Here, W"(-) is a demeaned standard Wiener process introduced in Section 9.2. These limiting distributions are free from nuisance parameters such as a. The basic idea of the proof is the following. First, obtain a demeaned series by subtracting the sample mean from {y, l (or equivalently, as the OLS residuals from the regression of {y,} on a constant). Second, use the Frisch-Waugh Theorem (introduced in Analytical Exercise 4 of Chapter I) to write T · (p"- I) and the !-value in formulas involving the demeaned series. Finally, use Proposition 9.2(c) and (d). Summarizing the discussion, we have Proposition 9.4 (Dickey-Fuller tests of a driftless random walk, the case with intercept): Suppose that {y, l is a driftless random walk (so {t.y, l is independent white noise) with E(yij) < oo. Consider the regression of y, on (I, Yr-l) for t=i,2, ... ,T. Then

DF p~' statistic: T · (p"

-

I) --> DF%, d

580

Chapter 9

DF tM statistic: tM --> DF{', d

where pM is the OLS estimate of the y,_ 1 coefficient, tM is the t-value for the hypothesis that the y,_ 1 coefficient is 1, and DF};' and DF{' are the two random variables defined in (9.3.13) and (9.3.14).

Critical values for DF};' are in Table 9.1, Panel (b), and those for DF{' are in Table 9.2, Panel (b), in rows forT= oo. The distribution is more strongly skewed to the left than in the case without intercept. The table also has critical values for finite-sample sizes based on the assumption that £ 1 is normal. The finite-sample distribution of pM for T = 100 is graphed in Figure 9.3. It shows a greater bias for pM than for p (the OLS estimate without intercept). Unlike in the case without intercept, the finite-sample critical values do not depend on YO· This is because, when p = I, a change in the initial value Yo only adds the same constant toy, for alit, so the values of the DF pM and tM statistics are invariant to y 0 • We started out this subsection by saying that the I( I) null is the joint hypothesis that p = I and a' = 0. Yet the test statistics are for the hypothesis that p = I. For the unit-root tests, this is the practice in the profession; test statistics for the joint hypothesis are rarely used.

Example 9.1 (Are the exchange rates a random walk?): Let y, be the log of the spot yen/$ exchange rate. To make the tests invariant to the choice of units (e.g., yen/$ vs. yen/¢), we fit the AR(I) model with intercept, so Proposition 9.4 is applicable. For the weekly exchange rate data used in the empirical application of Chapter 6 and graphed in Figure 6.3 on page 424, the AR( I) equation estimated by OLS is y,

=

0.162 (0.435)

+

0.9983376 Yr-1, R 2 (0.0019285)

= 0.997,

SER

= 2.824.

Standard errors are in parentheses. Because of the lagged variable y,_ 1, the actual sample period is from the second week rather than the first week of 1975. So T = 777 and

T. (pM- I)= 777 · (0.9983376- I)= -1.29, 0.9983376 - I tM = 0.0019285 = -O. 86 ' Since the alternative hypothesis is that p < I, a one-tailed test is called for. (If you think that the process might be an explosive AR(l) with p > I, you

581

Unit-Root Econometrics

should perform a two-tailed test.) From Table 9.1, Panel (b), for DFff, the (asymptotic) critical value that gives 5 percent to the lower tail is -14.1. That is, Prob(DF~ < -14. I) = 0. 05. The test statistic of -1.29 is well above this critical value, so the null is easily accepted at the 5 percent significance level. Turning to the OF 1 test, from Table 9.2, Panel (b), the 5 percent critical value is -2.86. The 1-value of -0.86 is greater, so the null hypothesis is easily accepted. Incorporating Time Trend

As was mentioned in Section 9.1, most economic time series appear to have linear time trends. To make the OF tests applicable to time series with trend, we generalize the model once again, this time by adding a linear time trend:

y, =a + 8 · I + z,, z, = pz,_ 1 + e,, [e,] independent white noise. (9.3.15) Here, 8 is the slope of the linear trend which may or may not be zero. Under the null of p = I, [z,J is a driftless random walk, andy, can be written as y,

=a + 8 · 1 + zo +

e1 + e2 + · · · +

e,

=Yo+ 8 ·I+ (e1 + · · · + e,)

(9.3.16)

with Yo = a + zo. So [y,} is a random walk with drift if 8 # 0 and without drift if 8 = 0. Under the alternative that p < I, the process, being the sum of a linear trend and a zero-mean stationary AR( I) process, is trend stationary. By eliminating [z,} from (9.3.15) we obtain y,

=a'+ 8' ·I+ PYt-1 + e,,

(9.3.17)

where a'=(l-p)a+p8, 8'=(1-p)8.

(9.3.18)

Since 8' = 0 when p = I, the 1(1) null (that the OGP is a random walk with or without drift) is the joint hypothesis that p = I and 8' = 0 in terms of the regression coefficients. In practice, however, unit-root tests usually focus on the single restriction p = I. As in the AR(l) model with intercept, the statement of Proposition 9.3 can be readily modified to the AR(l) model with trend. Let p' be the OLS estimate of the pin (9.3.17) and 1' be the 1-value for the null of p = I. As you will be asked to

582

Chapter 9

prove in Analytical Exercise 3, T · (p'

DF'

- I) converges in distribution to

1)

2 l([W'(l)] 2

- [W'(0)] 2 ~ "'--'-'----:';:..___..:---,___:_:__-'-

(9.3.19)

Jd[W'(r)]2dr

P-

and t' converges in distribmion to 1{[W'(l)] 2

-

[W'(0)] 2

-

I)

1 )J0 [W'(r)]2dr

(9.3.20)

Here, W' (-) is a detrended standard Wiener process introduced in Section 9.2. The basic idea of the proof is to use the Frisch-Waugh Theorem to write T · (p'- I) and t' in formulas involving detrended series and then apply Proposition 9.2(e) and (f). Summarizing the discussion, we have Proposition 9.5 (Dickey-Fuller tests of a random walk with or without drift): Suppose that [y,} is a random walk with or without drift with E(yJ) < oo. Considerthe regression of y, on (I, t, YH) fort = I, 2, ... , T. Then

DF p' statistic: T · (/5' - I)

--+

DF;,

(9.3.21)

d

DFt' statistic: t'--+ DF;,

(9.3.22)

d

where p' is the OLS estimate of the y,_ 1 coefficient, t' is the t -value for the hypothesis that the y,_ 1 coefficient is 1, and DF; and DF; are the two random variables defined in (9.3.19) and (9.3.20). Critical values for DF; are in Table 9.1, Panel (c), and those for DF; are in Table 9.2, Panel (c). The table also has critical values for finite sample sizes based on the assumption that e, is normal (the initial value Yo as well as the trend parameters a and 8 need not be specified to calculate the finite-sample distributions because they do not affect the numerical values of the test statistics under the null of p = I). For both the DF p' and t' statistics, the distribution is even more strongly skewed to the left than that for the case with intercept. Now for all sample sizes displayed in the tables, the 99 percent critical value is negative, meaning that more than 99 percent of the time /5' is less than unity when in fact the true value of p is unity! This is illustrated by the chained curve in Figure 9.3, which graphs the finite-sample distribution of /5' for T = I00.

583

Unit-Root Econometrics

Two comments are in order about Proposition 9.5. • (Numerical invariance) The test statistics fj' and I' are invariant to (a, 8) regardless of the value of p; adding a+ 8 ·I to the series {y,} merely changes the OLS estimates of a' and 8' but not fj' or its 1-value. Therefore, the finite-sample as well as large-sample distributions ofT· (fj' - I) and 1' will not depend on (a, 8), regardless of the value of p. • (Should the regression include time?) Since the test statistics are invariant to the value 8 in the DGP, Proposition 9.5 is applicable even when 8 = 0. That is, the p' and I' statistics can be used to test the null of a driftless random walk. However, if you are confident that the null process has no trend, then the OF p" and I" tests of Proposition 9.4 should be used, with the regression not including time, because the finite-sample power against stationary alternatives is generally greater with fj" and 1" than with fj' and 1' (you will be asked to verify this in Monte Carlo Exercise I). On the other hand, if you think that the process might have a trend, then you should use fj' and 1', with time in the regression. If you fail to include time in the regression and use critical values from Table 9.1, Panel (b), and Table 9.2, Panel (b), when indeed the DGP has a trend, then the test will not be correctly sized in large samples (i.e., the probability of rejecting the null when the null is true will not approach the prespecified significance level (the nominal size) as the sample size goes to infinity). This is simply because the limiting distributions of fj" or I" when the DGP has a trend are not DF% or DF/'".

QUESTIONS FOR REVIEW

1. (Superconsistency) Let p be the OLS coefficient estimate as in (9.3.2). Verify that T 1- ' • (fj- I) -->pOforO < ry
Jd

W(r) 2 dr

where'-'= long-run variance of {LI.yr} and y0 = Var(LI.y,).

584

Chapter 9

3. (Consistency of s 2 ) The usual OLS standard error of regression for the AR( l) equation. s. is defined in (9.3.8). If IPI < l. we know from Section 6.4 that s 2 is consistent for Var(e,). Show that. under the null of p = l. s 2 remains consistent for Var(e, ). Since t:.y, = e, when p = l. s 2 is consistent for y0 (= Var(t:.y,)) as well. Hint: Rewrite s 2 as T

s2

=

l I " ' TL_.(t:.y,(p- l)y,_,) 2 A

1=1

2 :::-T---:-1 L(t:.y,)'- T- I . [T. T

(p-

J l)]· T

1=1

+

I T- I

T

L t:.y, y,_, 1=1

. [T. (p- I)],II: .(y,_,) . T

2

A

T'

l=l

Also, E[(t:.y,) 2 ] = y0 . Use the result we have used repeatedly: if xr ~, 0 and YT ~d y, then xr · YT ~, 0. 4. (Two tests for a random walk) Consider two regressions. One is to regress t:.y, on t:. y,_,, and the other is to regress t:. y, on y,_,. To test the null that {y,} is a driftless random walk, which distribution should be used, the 1-value from

the first regression or the 1-value from the second regression? 5.

(Consistency of OF tests) Recall that a test is said to be consistent against a set of alternatives if the probability of rejecting a false null hypothesis when the true OGP is any of the alternatives approaches unity as the sample size increases. LetT· (fj- I) be as defined in (9.3.3). (a) Suppose that {y,} is a zero-mean !(0) process and that its first-order autocorrelation coefficient is Jess than l (so E(y;) > E(y, y,_ 1)). Show that plim T · (p r-oo

-

l) = -oo.

Thus, the OF p test is consistent against general 1(0) alternatives. Hint: Show that, if {y,} is 1(0), •

A

pImp= )

r-oo

E(y,y,_,) E(y,_,) 2

.

(b) Show that the OF I test is consistent against general zero-mean 1(0) alternatives. Hint: s converges in probability to some positive number when {y,} is zero·mean 1(0).

585

Unit-Root Econometrics

6. (Effect of Yo on test statistics) For p~ and r". verify that the initial value y 0 does not affect their values in finite samples when p = I. Does this hold true when p < I? [Answer: No, because the effect of a change in y0 on y, depends on r .] Does the finite-sample power against the alternative of p < I depend on Yo? [Answer: Yes.] 7. (Sargan-Bhargava, 1983, statistic) Basing the test statistic on the OLS coefficient estimator is not the only way to derive a unit-root test. For a sample of (y0 , .VI, ... , YT ), the Sargan-Bhargava statistic is defined as SB=

(a) Verify that it is the reciprocal of the Durbin-Watson statistic. (b) Show that, if {y,} is a driftless random walk, then SB-> d

"T

"T '

. H In!: L,~o }',2 = L,~I Yi-I

t [W(r)]' dr.

Jo

+ YT·2

AI SO, h 'fT' -> p 0 ·

(c) Verify that, under an 1(0) alternative with E[(l'ly,) 2 ] fo 0, SB the test rejects the I(I) null for small values of SB.)

9.4

->p

0. (So

Augmented Dickey-Fuller Tests

The I (I) null process in the previous section was a random walk, which does not allow for serial correlation in first differences. The tests of this section control for serial correlation by adding lagged first differences to the autoregressive equation. They are due to Dickey and Fuller ( 1979) and came to be known as the Augmented Dickey-Fuller (ADF) tests. 'rhe Augmented Autoregression The I (I) null in the simplest ADF tests is that {y,] is ARIMA(p, I, 0), that is, {l'ly,] is a zero-mean stationary AR(p) process: L'ly, ={I l'ly,_I + {2 L'ly,_, + · · · + {p L'ly,_, +£,, E(£?) := a 2 , (9.4.1)

586

Chapter 9

where {£ 1 I is independent white noise and where the associated polynomial equation, I - \IZ- \2z 2- · · ·- \pzP = 0, has all the roots outside the unit circle. (Later on we will allow {L'.y,} to be stationary ARMA(p, q).) The first difference process (L'>y, I is thus an 1(0) process because it can be written as L'>y, = l{r(L)£1 where (9.4.2) The model used to derive the test statistics includes this process as a special case: (9.4.3) Indeed, (9.4.3) reduces to (9.4.1) when p =I. Equation (9.4.3) can also be written as an AR(p Yt =
+ 1):

+ ¢2Yt-2 +,,, +
(9.4.4)

where the ¢'s are related to (p, II- ... , \p) by

=
(j = I, 2, ... , p).

(9.4.5)

Conversely, any (p + 1)-th order autoregression can be written in the form (9.4.3). This form (9.4.3) was originally proposed by Fuller (1976, p. 374). It is called the augmented autoregression because it can be obtained by adding lagged changes to the level AR(I) equation (9.3.1 ). It is easy to show (Review Question 2) that the polynomial equation associated with (9.4.4) has a root inside the unit circle if p > I. Thus, if we take the position that the DGP is either I(I) or 1(0), then p cannot be greater than I and the 1(0) alternative is that p < I. Limiting Distribution of the OLS Estimator

Having set up the model, we now tum to the task of deriving the limiting distribution of the OLS estimate of p in the augmented autoregression (9.4.3) under the I (I) null. Numerically the same estimate of p can be obtained from the AR(p + I) equation (9.4.4) by calculating the sum of the OLS estimates of(¢~> ¢2, ... ,
587

Unit-Root Econometrics

To simplify the calculation, consider the special case of p change in y, or AR(2)), so that the augmented autoregression is

I (one lagged

(9.4.6)

Yz = PYz-I +II L'>Yz-1 + e,.

We assume that the sample is (y_ 1, y 0 , ..• , y 7 ) so that the augmented autoregression can be estimated for t = I, 2, ... , T (or alternatively, the sample is (y 1, Y2· ... , YT) but the summations in what follows are from t = 3 to t = T rather than from t = I tot = T). Write this as y, =

x;p + e,

with Xr (2xl)

If b

-=

=

p -

Yz-1 ] [ .6.Yr-1 '

(2xl)-

[p]

II .

(9.4.7)

(p, ~ 1 )' is the OLS estimator of the augmented autoregression coefficients

p <= (p, lil'), the sampling error is given by T

_ 1 T

b-P=(I;x,x;) L;x,-e,. 1=1

(9.4.8)

1=1

The individual terms in (9.4.8) are given by

'"" T

~Xr. E:r

=

[ '\'T L.,.....r=I Yt-1 E:r ]

I= I

T

(9.4.9)

•

Lr=l .6.Yr-I E:r

(2 X I)

We can now apply the same trick we used for the time regression of Section 2.12. For some non singular matrix 'Y" 7 to be specified below, rewrite (9.4.8) as ')" T ·

(b- {J) = ')" T

[!- p] II- II

= Ay 1 Cr,

(9.4.10)

where T

Cr

= Y'T

1

I;xr · E:r. t=I

(9.4.11)

588

Chapter 9

We seek a matrix Y" 7 such that Y" 7 · (b - {J) has a nondegenerate limiting distribution under the I( I) null. This is accomplished by setting (9.4.12) With this choice of the scaling matrix Y" 7 , it follows from (9.4.9) and (9.4.11) that

CT =

[

TI '\'T L,...t=I T ft I:,~~ I

Yr-1

et

]

(9.4.13)

.

Ll.y,_l ,,

Let us examine the elements of A 7 and c 7 and derive their limiting distributions under the null of p = I. The (I, I) element of A 7 • Since [y,} is I( I) without drift, it converges in distribution to A2 • W 2 dr by Proposition 9.2(a) where A2 = long-run variance of {LI.y,}.

J

The (2, 2) element of A 7 • It converges in probability (and hence in distribution) to y0 (= Var(LI.y, )) because {LI.y,}, being stationary AR(l ). is ergodic stationary with mean zero. The off-diagonal elements of Ar. They are I/ ../T times I T

T

L Ll.y,_l Yt-J,

(9.4.14)

1=1

which is the average of the product of zero-mean !(0) and driftless I( I) variables. By Proposition 9.2(b), the similar product (1/T) I:;~I Ll.y, Yt-1 converges in distribution to a random variable. It is easy to show (Review Question 3) that (9.4.14), too, converges to a random variable. So lf../T times (9.4.14), which is the off-diagonal elements of A 7 , converges in probability (hence in distribution) to 0. Therefore, A 7 , the properly rescaled X'X matrix, is asymptotically diagonal: 1

Ar--> d

[!.. 2 -J0 W(r) 2 dr OJ 0

Yo

'

589

Unit-Root Econometrics

so (Ar)-I --->

2 [(A

ri

2

· Jo W(r) dr

)-I

0

d

~~J.

(9.4.15)

Yo

We now turn to the elements of cr. The first element of cr. Using the Beveridge-Nelson decomposition. it can be shown (Analytical Exercise 4) that T

-

I

T

L

Yr-I

e,--->

C1

ct

t=I

I 2 = >"" · J/r(l) · [W(l) 2 -

(9.4.16)

1],

for a general zero-mean 1(0) process D.y, = ljr(L)e,.

(9.4.17)

Jnthepresentcase. D.y, = \ID.Yr-I+e,underthenull. soljr(L) = (l-\ 1L)- 1 and 1/r(l)=

I I - II

(9.4.18)

.

Second element of cr. It is easy to show (Review Question 4) that {D.Yr-I e,} is a stationary and ergodic martingale difference sequence. so I

second element of cr =

T

M'

yT

where y0

=Var(D.y

1)

and

0"

2

L D.y,_I e,---> t=I

=Var(e

c2

~ N(O, Yo·

2

0" ),

(9.4.19)

d

1 ).

Substituting (9.4.15)-(9.4.19) into (9.4.1 0) and noting that the true value of p is unity under the I (I) null, we obtain, for the OLS estimate of the augmented autoregression, 5

5Here, we are using Lemma 2.3(b) to claim that AT 1 CT converges in distribution to A- 1c where A and c are the limiting random variables of AT and c-r. respectively. Note that. unlike in situations encountered for the stationary case, the convergence of AT to A is in distribution. not in prohahility. We have proved in the text that AT converges in distribution to A and cT to c, but we have not shown that (AT. cT) converges in distribution jointly to (A. c). which is needed when using Lemma 2.3(b). However, by. e.g., Theorem 2.2 of Chan and Wei (1988). the convergence is joint.

590

Chapter 9

Thus,

T,

(p-

I) ---+ u21jr(l) d

A2

HW(l)2- I]

A2

or ----,---- · T · (p

Jd W(r) 2 dr

u21fr(l)

-

I) ---+ D F d

P•

(9.4.20) (9.4.21) where DFp is none other than the DF p distribution defined in (9.3.6). Note that the estimated coefficient of the zero-mean 1(0) variable (L'>y,_ 1) has the usual .Jf asymptotics, while that of the driftless I (I) variable (Yr-~) has a nonstandard limiting distribution. Deriving Test Statistics

There are no nuisance parameters entering the limiting distribution of the statistic "'~(IJ · T · (p - I) in (9.4.20), but the statistic itself involves nuisance parameters through "'~(IJ' By the formula (9.2.4) for the long-run variance A2 of {L'>y,), this ratio equals ljr(l), which in tum equals 1 ~ , by (9.4.18) under the null. Now go

1

back to the augmented autoregression (9.4.6). Since f 1 , the OLS estimate of ( 1, is consistent by (9.4.21), a consistent estimator of 1 ~ , is ( I - f 1)- 1 • It then follows 1 from (9.4.20) that (9.4.22) That is, the required correction on T · (p- I) is done through the estimated coefficient of lagged L'>y in the augmented autoregression. After the correction is made, the limiting distribution of the statistic does not depend on nuisance parameters controlling serial correlation of {L'>y,) such as A2 and y0 . The OLS t-value for the hypothesis that p = I, too, has a limiting distribution. It can be written as

591

Unit-Root Econometrics

p-]

t=--c=====~~======= T

s·

(1, I) element of

c~=x,X, l=l

r

I

T. (p-I)

s·

Jo. I) element of (A 7 )

1

(9.4.23) '

where s is the usual OLS standard error of regression. It follows easily from (9.4.15), (9.4.20), the consistency of s 2 for a 2 , and Proposition 9.2(a) and (b) for ~~

= y, that

(9.4.24)

t-+ DF,, d

where DF, is the DF

t

distribution defined in (9.3.9). Therefore, for the r-value,

there is no need to correct for the fact that lagged !:. y is included in the augmented

autoregression. All these results for p = I can be generalized easily:

Proposition 9.6 (Augmented Dickey-Fuller test of a unit root without intercept): Suppose that (y,) is ARIMA(p, I, 0) so that (t:.y,) is a zero-mean stationary AR(p) process following (9.4.1). Consider estimating the augmented autoregression, (9.4.3), and Jet (p, f" f,, ... , fp) be the OLS coefficient estimates. Also lett be the t-value for the hypothesis that p = I. Then . . ADF p statJStlc:

•

T. (p-I) •

•

I -II - 12- ... - \p

-+

(9.4.25)

DFP,

d

(9.4.26)

ADFt statistic: t-+ DF,, d

where DFp and DF, are the two random variables defined in (9.3.6) and (9.3.9). Testing Hypotheses about

1

The argument leading to this surprisingly simple result is lengthy, but it has a by-product. The obvious generalization of (9.4.21) to the p-th order autoregres-

sion is

fT.

- II II12-12 -

\p- \p

-I

0 -+N

0

, a'.

Yo

Y1

Yl

Yo

Yp-1 Yp-2

Yp-1

Yp-2

Yo

d

0

(9 .4 .27)

592

Chapter 9

where y1 is the j-th order autocovariance of {!>y, }. By Proposition 6.7, this is the same asymptotic distribution you would get if the augmented autoregression (9.4.3) were estimated by OLS with the restriction p = I. that is, if (9.4.1) is estimated by OLS. It is easy to show (Review Question 6) that hypotheses involving only the coefficients of the zero-mean I(O) regressors ({ 1 , { 2 , ... , {p) can be tested in the usual way, with the usual I and F statistics asymptotically valid. In deriving the limiting distributions of the regression coefficients, we have exploited the fact that one of the regressors is zero-mean I(O) while the other is driftless !(I). A systematic treatment of more general cases can be found in Sims, Stock, and Watson (1990) and Watson (1994, Section 2). What to Do When p Is Unknown?

Proposition 9.6 assumes that the order of autoregression, p, for t>y, is known. What can we do when the order p is unknown? To distinguish between the true order p and the number of lagged changes included in the augmented autoregression, we denote the latter by p in this section. We have considered a similar problem in Section 6.4. The difference is that the DGP here is a stationary process in first differences, not in levels as in Section 6.4, and that the estimation equation is an augmented autoregression with p freely estimated. We display below, without proof, three large-sample results about the choice of p under which the conclusion of Proposition 9.6 remains valid. These results are applicable to a class of processes more general than is assumed in Proposition 9.6: {!>y,} is zero-mean stationary and invertible ARMA(p, q) of unknown order (albeit with an additional assumption on £ 1 that its fourth moment is finite) 6 So, written as an autoregression, the order of autoregression for !>y1 can be infinite, as when q > 0, or finite, as when q = 0. The first result is that the conclusion of Proposition 9.6 continues to hold when p, the number of lagged first differences in the augmented autoregression, is increased with the sample size at an appropriate rate. (I) (Said and Dickey, 1984, generalization of Proposition 9.6) Suppose that

p

satisfies

p --+

oo but

as T

----*

oo.

(9.4.28)

(That is, p goes to infinity but at a rate slower than T '1 3 .) Then the two statistics, the ADF p and ADF 1, based on the augmented autoregression with 6See Section 6.2 for the definition of invertible ARMA(p, q) processes. If an ARMA(p. q) process is invertible, then it can be wriuen as an infinite-order autoregression.

593

Unit· Root Econometrics

p lagged changes,

have the same limiting distributions as those indicated in

Proposition 9.6. This result, however, does not give us a practical guide for the lag length selection because there are infinitely many rules for p satisfying (9.4.28). A natural question is whether any of the rules for selecting the lag length considered in Section 6.4- the general-to-specific sequential I rule or the information-criterionbased rules-can be used in the present context. To refresh your memory, the information-criterion-based rule is to set p to the j that minimizes log (

SSR·) C(T) T + (j +I) x ----;;:--,

(9.4.29)

where SSRi is the sum of squared residuals from the autoregression with j lags: t.y, = P Yr-1

+ ~1 t.y,_, + {2 t..vr-2 + ·· · + ~j t.y,_j + E:,.

(9.4.30)

The term C(T)/T is multiplied by j +I, because the equation has j +I coefficients including the lagged y, coefficient. For the Akaike information criterion (AIC), C(T) = 2, while for the Bayesian information criterion (BlC), also called Schwartz information criterion (SIC), C(T) = log(T). In either rule, pis selected from the set j = 0, I, 2, . , . , Pmax· In Section 6.4, this upper bound Pm= was set to some known integer greater than or equal to the true order of the finite-order autoregressive process. The p selected by either of these rules is a function of data (not just a function of T), and hence is a random variable. To be sure, when {t.y,} is stationary and invertible ARMA(p, q), the autoregressive order is infinite when q > 0 and the upper bound Pm= obviously cannot be set to the true order, But it can be made to increase with the sample size T. To emphasize the dependence of Pmax on T, write it as Pmax c · T' for some c > 0 and 0 < g < I /3. Then the limiting distributions of the ADF p and ADF 1 statistics are as indicated in Proposition 9,6 7

7This is Theorem 5.3 of Ng and Perron (1995). Their conditionAl is (9.4.28) here. Their condition A2 11 is implied by the second condition indicated here.

594

Chapter 9

(3) Suppose that pis selected by AIC or by BIC with p,..x(T) satisfying condition (9.4.28). Then the same conclusion holds 8 Thus, no matter which rule-the sequential t, AIC, or BIC-for selecting p you employ, the large-sample distributions of the ADF statistics are the same as m Proposition 9.6. A Suggestion for the Choice of Pmax(T)

The finite-sample distribution, however, depends on which rule you use and also on the choice of the upper bound p"""'(T), and there are infinitely many valid choices for the function p"""'(T). For example, p"""'(T) = [T 114 ] (the integer part ofT 114 ) satisfies the conditions in result (2) for the sequential t-test. So does Pm=(T) = [IOOT 3i 10 ]. It is important, therefore, that researchers use the same Pm=(T) when deciding the order of the augmented autoregression. In this context, the Monte Carlo literature examining the small-sample properties of various unitroot tests is relevant. Simulations run by Schwer! (1989) show that in small and moderately large samples (from T = 25 to I ,000), including enough lags in the autoregression is important to minimize size distortions? The choice of p (the number of lags included in the autoregression) that was more or less successful in 114 controlling the actual size in Schwert's study is [12 · C~) J. The upper bound Pm=(T) should therefore give lag selection rules a chance of selecting a p as large as this. Therefore, in the examples and application of this chapter, we use the function

1 ~f ]

4

4

Pm=(T) = [12 · (

(integer part of 12 · (

1 ~f )

(9.4.31)

in any of the rules for selecting the lag length. This function satisfies the conditions of results (2) and (3) above. The sample period when selecting pis t = p"""'(T) + 2, p"""'(T) + 3, ... , T. The first t is Pm=(T) + 2 because Pm=(T) + I observations are needed to calculate Pm=(T) lagged changes in the augmented autoregression. Since only T Pm=(T) - 1 observations are used to estimate the autoregression (9.4.30) for j = I, 2, ... , p"""'(T), the sample size in the objective function in (9.4.29) should be

8This is Theorem 4.3 of Ng and Perron (1995). The theorem does not indicate the upper bound PmiU(T), but (as pointed out by Pierre Perron in private communications with the author) Hannan and Deistler (1988, Section 7.4 (ii)) implies that (9.4.28) can be an upper bound. 9 Recall that a size distortion refers to the difference between the actual or elCact size (the finite-sample probability of rejecting a true null) and the nominal size (the probability of rejecting a true null when the sample size

is infinite),

595

Unit-Root Econometrics

T- Pmax(T)- I rather than T: SSRj ) ( log T - PmaoJT) - I

.

+ (] +

C(T - Pmax(T) - I) I) x T - Pmax(T) - I .

(9.4.32)

Including the Intercept in the Regression As in the DF tests, we can modify the ADF tests so that they are invariant to an addition of a constant to the series. We allow [y,] to differ by an unspecified constant from the process that follows augmented autoregression without intercept (9.4.3). This amounts to replacing they, in (9.4.3) by y, -a, which yields y,

=a'+ PY<-1 + ~~

l>y,_l

+ ~2 t.y,_2 + · · · + ~P t.y,_P + &,

(9.4.33)

where a' = (I - p)a. As in the AR(l) model with intercept, the 1(1) null is the joint hypothesis that p = I and a' = 0. Nevertheless, unit-root tests usually focus on the single restriction p = I. It is left as an analytical exercise to prove Proposition 9.7 (Augmented Dickey-Fuller test of a unit root with intercept): Suppose that [y,) is an ARIMA(p, I, 0) process so that [t.y,) is a zero-mean stationary AR(p) process following (9.4.1). Consider estimating the augmented autoregression with intercept, (9.4.33), and Jet(&,/)". f~o be the OLS coefficient estimates. Also Jet t" be the 1-value for the hypothesis that p = I. Then

f2 .... , frl

,, .. ADF p statJStJc:

T · (p'" - I) • •

•

(9.4.34)

ADFt 1' statistic: t 1' --> DF/',

(9.4.35)

I -

~I -

~2 -

... -

~p

d

where the random variables DF% and DF/' are defined in (9.3.13) and (9.3.14). The basic idea of the proof is to use the Frisch-Waugh Theorem to write /)" and the t-value in formulas involving the demeaned series created from [y, ]. Before turning to an example, two comments about Proposition 9.7 are in order. • (Numerical invariance) As was true in Proposition 9.4, since the regression includes a constant, the test statistics are invariant to an addition of a constant to the series. • (Said-Dickey extension) The Said-Dickey-Ng-Perron extension is also applicable here: if (t.y,] is stationary and invertible ARMA(p, q) (so t.y, can be

596

Chapter 9

written as a possibly infinite-order autoregression), then the ADF p 1' and t 1' statistics have the limiting distributions indicated in Proposition 9.7, provided that p (the number of lagged changes to be included in the autoregression) is set by the sequential t rule or the AIC or BIC rules. 10 Example 9.2 (ADF on T-bill rate): In the empirical exercise of Chapter 2, we remarked rather casually that the nominal interest rate R, might have a trend for the sample period studied by Fama (1975). Here we test whether it has a stochastic trend (i.e., whether it is driftless 1(1)). The data set we used in the empirical application of Chapter 2 was monthly observations on the U.S. one-month Treasury bill rate. We take the sample period to be from January 1953 to July 1971 (T = 223 observations), the sample period of Fama's (1975) study. The interest rate series does not exhibit time trend. So we include a constant but not time in the autoregression. The maximum lag length Pm=(T) is set equal to 14 according to the function (9.4.31). In the process of choosing the actual lag length p, we fix the sample period from t = Pm=(T) + 2 = 16 (April 1954) to t = 223 (July 1971), with a sample size of208 (= T- Pm=(T)- 1). The sequential t rule picks a p of 14. The objective function in the information-criterion-based rule when the sample size is fixed is (9.4.32). The AlC picks p = 14, while the BlC picks p = I. Given p = I, we estimate the augmented autoregression on the maximum sample period, which is t = p +2, ... , T (from March 1953 to July 1971) with 221 observations. The estimated regression is R, = 0.088 + 0.97705 (0.057) (0.0162)

R,_, -

0.20 L'.R,_" (0.067)

R2 = 0. 944, sample size = 221. From this estimated regression, we can calculate _ ADF p~ = 221

0.97705- I X

I+ 0.20

= -4.22, t 1'

0.97705- I ---,---,-,--,----- = - 1.4 2. 0.0162

From Table 9.1, Panel (b), the 5 percent critical value for the ADF p 1' statistic is -14.1. The 5 percent critical value for the ADF t~ statistic is -2.86 from Table 9.2, Panel (b). For either statistic, the 1(1) null is easily accepted.

10 Ng and Perron (1995) proved this result about the selection of lag length only for the ca~e without intercept. That it can be extended to the case with intercept and also to the case with trend was confinned in a private communication with one of the authors of the paper.

597

Unit-Root Econometrics

Incorporating Time Trend Allowing for a linear time trend, too, can be done as in the DF tests. Replacing the y, in augmented autoregression (9.4.3) by y, - a - 8 · 1, we obtain the following augmented autoregression with time trend: y, = a• + 8• ·I+ PY<-I +{I t'.y,_I + {2 t'.y,_2 + · · · + {p t'.y,_P + &,,

(9.4.36) where a• =(I- p)a

+ (p- {I- {2- · · · - {p)8,

8• = ( I - p)8.

(9.4.37)

Since 8• = 0 when p = I, the 1(1) null (that the process is 1(1) with or without drift) implies that p = I and 8• = 0 in (9.4.36), but, as usual, we focus on the single restriction p = I. Combining the techniques used for proving Propositions 9.5 and 9.7, it is easy to prove

Proposition 9.8 (Augmented Dickey-Fuller Test of a unit root with linear time trend): Suppose that (y,] is the sum of a linear time trend and an ARIMA(p, I, 0) process so that (t'.y,) is a stationary AR(p) process whose mean may or may not be zero. Consider estimating the augmented autoregression with trend, (9.4.36), and Jet (a, 8, p', {I> {2 , ... , {p) be the OLS coefficient estimates. Also Jet r' be the 1-va/ue for the hypothesis that p = I. Then ADF p' statistic:

I

T · (jj' - I) " _ -{I -

(2 -

... -

" ~ DF;, (p

(9.4.38)

d

ADF 1' statistic: 1' -> DF,',

(9.4.39)

d

where the random variables DF; and DF,' are defined in (9.3.19) and (9.3.20). Before turning to an example, three comments are in order. • (Numerical invariance to trend parameters) As in Proposition 9.5, since the regression includes a constant and time, neither a nor 8 affects the OLS estimates of (p, {I, ... , {p) and their standard errors, so the finite-sample as well as large-sample distribution of the ADF statistics will not be affected by (a, 8). • (Said-Dickey extension) Again, the Said-Dickey-Ng-Perron extension IS applicable here: if (t'.y,] is stationary and invertible ARMA(p, q) with possibly a nonzero mean, then the ADF p' and t' statistics have the limiting distributions

59B

Chapter 9

indicated in Proposition 9.8, provided that jj is set by the sequential1-rule or the AIC or BIC rules. • (Should the regression include time?) The same comment we made about the choice between the OF tests with or without trend is applicable to the ADF tests. If you are confident that the null process has no trend, then the ADF p" and 11' tests of Proposition 9.7 should be used, with the augmented autoregression not including time, because the finite-sample power of the ADF tests is generally greater without time in the augmented autoregression. On the other hand, if you think that the process might have a trend, then you should use the ADF p' and I' tests of Proposition 9.8 with time in the augmented autoregression. Example 9.3 (Does GDP have a unit root?): As was shown in Figure 9.1, the log of U.S. GOP clearly has a linear time trend, so we include time in the augmented autoregression. Using the logarithm of U.S. real GOP quarterly data from 1947:QI to 1998:Ql (205 observations), we estimate the augmented autoregression (9.4.36). As in Example 9.2, we set Pm=(T) = [12. (205/100) 114 ] (which is 14). Also as in Example 9.2, the sample period is fixed (1 = Pm=(T) + 2, ... , T) in the process of selecting the number of lagged changes. The sequential I picks jj = 12, while both A1C and B1C pick jj = I . Given jj = I, the augmented autoregression estimated on the maximal sample of 1 = jj + 2, ... , T (1947:Q3 to 1998:Ql) is y, =

0.22 (0.10)

+

0.00022 · I (0.00011) R2

= 0.999,

+

0.9707 y,_, (0.0137)

sample size

+

0.348 L'.yH, (0.066)

= 203.

From this estimated regression, we can calculate , Q 0.9707388 - I I_ _ = -9.11, AD F p = 2 3 x 0 348 I'=

0.9707388 - I = -2.13. 0.0137104

From Table 9.1, Panel (c), the 5 percent critical value for the ADF p' statistic is -21.7. The 5 percent critical value for the ADF 1' statistic is -3.41 from Table 9.2, Panel (c). For either statistic, the 1(1) null is easily accepted. A little over 50 years of data may not be enough to discriminate between the 1(1) and 1(0) alternatives. (As noted by Perron (1991 ), among others, the span of the data rather than the sampling frequency, e.g., quarterly vs.

599

Unit-Root Econometrics

annual, is more important for the power of unit-root tests, because I - p is much smaller on quarterly data than on annual data.) We now use annual GDP data from 1869 to 1997 for the ADF tests." SoT = 129. As before, we set Pmox(T) = [12 · (129/100) 114 ] (which is 12) and fix the sample period in the process of choosing the lag length. The sequential t picks 9 for p, while the value chosen by AIC and BIC is I. When p = I, the regression estimated on the maximum sample oft= fj + 2, ... , T (from 1871 to 1997, 127 observations) is

y, =

0.67 (0.18)

+

0.0048 · t (0.0014)

R 2 = 0.999,

+

0.85886 Yt-1 (0.03989)

+

0.32 t. Vt-1, (0.086)

sample size = 127.

The ADF p' and t' statistics are -26.2 and -3.53, respectively, which are less (greater in absolute value) than their respective 5 percent critical values of -21.7 and -3.41. With the annual data covering a much longer span of time, we can reject the 1(1) null that GDP has a stochastic trend. The impression one gets from Examples 9.1-9.3 is the ease with which the 1(1) null is accepted, which leads to the suspicion that the DF and ADF tests may have low power in finite samples. The power issue will be examined in the next section. Summary of the OF and ADF Tests and Other Unit-Root Tests The various 1(1) processes we have considered in this and the previous section satisfy the model (a set of DGPs)

y, =

d,

+ z,, z,

=

pz,_ 1 + u,,

{u,)

~zero-mean

1(0).

(9.4.40)

The restrictions on (9.4.40), besides the restriction that p = I, that characterize the I( I) null for each of the cases considered are the following. (I) Proposition 9.3: d, = 0, {u,] independent white noise,

(2) Proposition 9.4: d, =a, {u,} independent white noise, (3) Proposition 9.5: d, =a

+ 8 · t, {u,} independent white noise,

(4) Said-Dickey extension of Proposition 9.6: d, = 0, {u,} is zero-mean ARMA(p, q), II Annual GDP data from 1929 are available from the Bureau of Economic Ana1ysis web site. The BalkeGordan GNP data (Balke and Gordon, 1989) were used to ell:lrapolate GDP back to 1869.

600

Chapter 9

(5) Said-Dickey extension of Proposition 9.7: d, =a, {u,) is zero-mean ARMA(p,q), (6) Said-Dickey extension of Proposition 9.8: d, = ARMA(p,q).

a+ 8 · t,

{u 1 } is zero-mean

QUESTIONS FOR REVIEW

1. (Non-uniqueness of the decomposition between zero-mean !(0) and driftless !(I)) Assume p = 2 in (9.4.4). Show that the third-order autoregression can be written equivalently as y, =a, · y,_,

+a, · (y,_,

- y,_,)

+a,· (y,_,

- y,_,)

+ '•·

How are (a 1 , a 2 , a 3 ) related to (¢ 1 , ¢>2, ¢3)? Which regressor is zero-mean 1(0) and which one is driftless !(!)?Verify that the OLS estimate of they,_, coefficient from this equation is numerically the same as the OLS estimate of p from (9.4.3) with p = 2. 2. Let (z) = I - ¢ 1z- · · ·- (z) = 0 has a real root inside the unit circle if p > I. Hint: ¢(!) = I - ( 1 + ·· · + I. ¢(0) = I > 0. 2

'i'·

3. Prove that (9.4.14) -+d A2 W(l) 2 + Hint: The asymptotic distribution of } L;~, t>y,_, y,_, is the same as that of } L;~, t>y, y,. Also, t>y, y, = t>y, y,_, + (t>y, ) 2 • Use Proposition 9.2(b). 4. Let {l>y,] be a zero-mean 1(0) process satisfying (9.2.1)-(9.2.3). (The zeromean stationary AR(p) process for {t>y,] considered in Proposition 9.6 is a special case of this.) (a) Show that {l>yt-Jotl is a stationary and ergodic martingale difference sequence. Hint: Since {8 1 ] and {t>y,} are jointly ergodic stationary, {t>y,_ 1· o,} is ergodic stationary. To show that {t>y,_,,,] is a martingale difference sequence, use the Law of Iterated Expectations. {8 1 _ 1, 8 1 _ 2 , •.. ] has more information than {t>y,_ 2 ,,_ 1, t>y,_3 ,,_,, ... }. (b) Prove (9.4.19). Hint: You need to show that E[(t>y,_ 1 o1) 2 ] = Yo.

<J 2

601

Unit-Root Econometrics

5. Verify (9.4.24). Hint: From (9.4.20) and (9.2.4), ,

!{W(l) 2

I

- I) T · (p - I) -> - - .o..'--:7'---'d 1/r( I) W'dr .

Jd

Also, by (9.4.15) and (9.2.4), s-J(I, l)elementof(Ar)- 1

--;;

I/)t(l) 2 Jd W 2 dr.

(9.4.41)

Since ljr(l) > 0 by the stationarity condition, ./1/r(l) 2 = ljr(l). 6. (Hest on 0 It is claimed in the text (right below (9.4.27)) that the usual 1value for each I; is asymptotically standard normal. Verify this for the case of p = I. Hint: (9.4.27) for p = I is ' - 1; )-> N(O. a 2 /Yo). v"" 1 · (1; 1 1 d

You need to show that v'f times the standard error of { 1 converges in probability to the square root of a 2 jy0 . v'f times the standard error of { 1 is s · (2. 2) element of (Ar) 1•

J

9.5 Which Unit-Root Test to Use? Besides the OF and ADF tests, a number of other unit-root tests are available (see Maddala and Kim, 1998, Chapter 3, for a catalogue). The most prominent among them are the Phillips (1987) test for case (4) (mentioned at the end of the previous section) and the Phillips-Perron (1988) test, a generalization of the Phillips test to cover cases (5) and (6). (The Phillips tests are derived in Analytical Exercise 6.) Their tests are based on the OLS estimate of the y,_ 1 coefficient in an AR(l) equation, rather than an augmented autoregression, with the long-run variance of {t.y,) estimated from the residuals of the AR(l) equation. We did not present them in the text because the finite-sample properties of the tests are rather poor (see below). There is a new generation of unit-root tests with reasonably low size distortions and good power. They include the ADF-GLS test of Elliott, Rothenberg, and Stock (I 996) and the M -tests of Perron and Ng (1996).

602

Chapter 9

Local-to-Unity Asymptotics

These and most other unit-root tests are consistent against all fixed 1(0) alternatives.12 Which consistent test should be chosen over the other? Recall that in Section 2.4 we defined the asymptotic power against a sequence of local alternatives that drift toward the null at rate ./'f. lt turns out that in the unit-root context the appropriate rate is T rather than .ff (see, e.g., Stock, 1994, pp. 2772-2773). That is, the probability of rejecting the null of p = l when the true DGP is

c

P =I+T

(9.5.1)

has a limit between 0 and l as T goes to infinity. This limit is the asymptotic power. The sequence of alternatives described by (9.5.1) is called local-to-unity alternatives. By definition, the asymptotic power equals the nominal size when c = 0. It can be shown that the DF or ADF p tests are more powerful than the DF or ADF t tests against local-to-unity alternatives (see Stock, 1994, pp. 27742776). That is, for any nominal size and c, the asymptotic power of the DF or ADF p statistic is greater than that of the DF or ADF t-statistic. On this score, we should prefer the p-based test to the t-based test, but the verdict gets overturned when we examine their finite-sample properties. Small-Sample Properties

Turning to the issue of testing an I( l) null against a fixed, rather than drifting, 1(0) alternative, the two desirable finite-sample properties of a test are a reasonably low size distortion and high power against 1(0) alternatives. There is a large body of Monte Carlo evidenoe on the finite-sample properties of the DF, ADF, and other unit-root tests (for a reference, see the papers cited on p. 2777 of Stock, 1994, and see Ng and Perron, 1998, for a comparison of the ADF-GLS and the M -tests). Major findings are the following. • Virtually all tests suffer from size distortions, particularly when the null is in a sense close to being 1(0). TheM-test generally has the smallest size distortion, followed by the ADF t. The size distortion of the ADF p is much larger than that of the ADF t. For example, the I( I) null considered by Schwer! (1989) is

y,- Yr-1 = cr When

+ e,,_,, (c,} Gaussian i.i.d.

(9.5.2)

e is close to -1, this l(l) process is very close to a white noise process,

12 That the DF tests are consistent against general 1(0) alternatives was verified in Review Question 5 to Section 9.3. For the consistency of the ADF tests, see, e.g., Stock (1994, p. 2770).

603

Unit-Root Econometrics

because the AR root of unity nearly equals the MA root of -e. Simulations by Schwert (1989) show that the ADF p and Phillips-Perron Zp and Z, tests (see Analytical Exercise 6) reject the I (I) null far too often in finite samples when e is close to -I. In particular, for the Phillips-Perron tests, the actual size when the nominal size is 5 percent is more than 90 percent even for sample sizes as large as I ,000 when e = -0.8. • In the case of ADF tests, how the lag length p is selected affects the finitesample size and power. For a given rule for choosing fj, the power is generally higher for the ADF p-test than for the ADF !-test. • The power is generally much higher for the ADF-GLS and theM-tests than for the ADF t- and p-tests. Unlike the M -statistic, the ADF-GLS statistic can be calculated easily by standard regression packages. In the empirical exercise of this chapter, you will be asked to use the ADF-GLS to test an I (I) null.

9.6

Application: Purchasing Power Parity

The theory of purchasing power parity (PPP) states that, for any two countries, the currency exchange rate equals the ratio of the price levels of the two countries. Put differently, once converted to a common currency, national price levels should be equal. Let P, be the price index for the United States, P." be the price index for the foreign country in question (say, the United Kingdom), and S, be the exchange rate in dollars per unit of the foreign currency. PPP states that P, = S, · P,*,

(9.6.1)

or taking logs and using lower-case letters for them, 13

Pr = s,

+ p;.

(9.6.2)

PPP is related to but different from the law of one price, which states that (9.6.1) holds for any good, with P, representing the dollar price of the good in question 13 Equation (9.6.1) or (9.6.2) is a statement of "absolute PPP.'' What is called relative PPP requires only that the first difference of (9.6.2) hold, that is, the rate of growth of the exchange rate should offset the differential between the rate of growth in home and foreign price indexes. Relative PPP should not be confused with the weaker version of PPP to be introduced below.

604

Chapter 9

and P,' the foreign currency price of the good. If international price arbitrage works smoothly, the law of one price should hold at every date r. If the law of one price holds for all goods and if the basket of goods for the price index is the same between two countries, then PPP should hold. Due to a variety of factors limiting international price arbitrage, PPP does not hold exactly at every date r. 14 A weaker version of PPP is that the deviation from PPP Zr

=

Sr - Pr

+ P1• •

(9.6.3)

which is the real exchange rate, is stationary. Even if it does not enforce the law of one price in the short run, international arbitrage should have an effect in the long run. The Embarrassing Resiliency of the Random Walk Model?

For many years researchers found it difficult to reject the hypothesis that real exchange rates follow a random walk under floating exchange regimes. As Rogoff (1996) notes, this was somewhat of an embarrassment because every reasonable theoretical model suggests the weaker version of PPP. However, studies since the mid 1980s looking at longer spans of data have been able to reject the random walk null. Recent work by Lothian and Taylor (1996) is a good example. It uses annual data spanning two centuries from 1791 to 1990 for dollar/sterling and franc/sterling real exchange rates to find that the random walk null can be rejected. In what follows, we apply the ADF r-test to their data on the dollar/sterling (dollars per pound) exchange rate. The dollar/sterling real exchange rate, calculated as (9.6.3) and plotted in Figure 9.4, exhibits no time trend. We therefore do not include time in the augmented autoregression. If we were dealing with the weak version of the law of one price, with P, and P,' denoting the home and foreign prices of a particular good, then we could exclude the intercept from the augmented autoregression because a change in units of measuring the good does not affect z,. But P, and P,' here are price indexes, and a change in the base year leads to adding a constant to z,. To make the test invariant to such changes, a constant is included in the augmented autoregression. By applying the ADF test, rather than the DF test, we can allow for serial correlation in the first differences of z, under the null, but the lag length chosen by BIC (with Pmm(T) = [12 · (200/I
605

Unit-Root Econometrics

0.5 0.4 0.3 0.2 0.1 0 ·0.1 ·0.2 -0.3 -0.4 -0.5 1791

1811

1831

1851

1871

1891

1911

1931

1951

1971

Figure 9.4: Dollar/Sterling Real Exchange Rate, 1791-1990, 1914 Value Set to 0

z,= 0.179 (0.052)

+

0.8869z,_ 1, SER=0.07i, R2 =0.790, t= 1792, ... ,1990. (0.0326)

The OFt-statistic (t") is -3.5 (= (0.8869- 1)/0.0326), which is significant at I percent. Thus, we can indeed reject the hypothesis that the sterling/dollar real exchange rate is a random walk. An obvious caveat to this result is that the sample period of 1791-1990 includes fixed and floating rate periods. It might be that international price arbitrage works more effectively under fixed exchange rate regimes. If so, the z,_ 1 coefficient should be closer to 0 during fixed rate periods. Lothian and Taylor ( 1996) report that if one uses a simple Chow test, one cannot reject the hypothesis that the z,_ 1 coefficient is the same before and after floating began in 1973. In the empirical exercise of this chapter, you will be asked to verify their finding and also use the ADF-GLS to test the same hypothesis. PROBLEM SET FOR CHAPTER 9 _ _ _ _ _ _ _ _ _ _ _ __ ANALYTICAL EXERCISES

1. Prove Proposition 9.2(b). Hint: By the definition of t.~,.

So

t.~,,

we

have~~

=

~,_ 1

+

606

Chapter 9

From this, derive

Sum this overt = I, 2, ... , T to obtain T

T

:L t>~~ · ~~-' = GJ[(~d- (~ol l- m:L(t>~~) . 2

2

t=l

Divide both sides by T to obtain

L 6.~1 . ~I-I =

I T T

I(

2

~T

2

.jT

)

-

I(

2

~o )

2

I

T

- 2T L(f>~ 1 )

.jT

1=1

2 •

t=l

Apply Proposition 6.9 and the Ergodic Theorem to the terms on the right-hand side of(*).

2. (Proof of Proposition 9.4) Let p~ be the OLS estimate of p in the AR(l) equation with intercept, (9.3.11), and define the demeaned series {y;_,) (I = I, 2, ... , T) by ~

Y1_,

This

=

Y1-1-

- y, Y

=

Yo+Yl+ .. ·+YT-1 T (I= I, 2, ... , T).

(I)

y;_, equals the OLS residual from the regression of y1_ 1 on a constant for

t=l,2, ... ,T. (a) Derive "JJ- _

p

l _ -

"T ~ Ltt=l f::l.yt Yr-1 "T ~ L-l=i(YI-1)

(2)

2

Hint: By the Frisch-Waugh Theorem, p~ is numerically equal to the OLS

coefficient estimate in the regression of y 1 on

y;_, (without intercept). Thus, (3)

Use the fact that

L;=l (yl

I;;=, y;_,

- Y1-1l · Y:'-1·

= 0 to show that

I;;=, (y y;_,) · y;_, 1 -

607

Unit-Root Econometrics

(b) (Easy) Show that, under the null hypothesis that [y,] is driftless 1(1) (not necessarily driftless random walk),

A'

-([W"(l)] 2

T · (p"

-

I)--+ 2

A2 •

d

Jd

y

-

[W"(O)J 2) - ~

2

[W"(r)) 2 dr

(4)

where Yo is the variance of D.y,, A2 is the long-run variance of D.y,, and W"(-) is a demeaned standard Brownian motion introduced in Section 9.2. Hint: Use Proposition 9.2(c) and (d) with~' = y,. (c) (Trivial) Verify that T · (p"- I) converges to DF~ (defined in (9.3.13)) if [y,} is a driftless random walk. Hint: If y, is a random walk, then A2 = y0 . (d) (Optional) Lets be the standard error of regression for the AR(I) equation with intercept, (9.3.11). Show that s 2 is consistent for y0 (= Var(D.y,)) under the null that [y,} is driftless I (I) (so D.y, is zero-mean 1(0) satisfying (9.2.1)--{9.2.3)). Hint: Let(&*, p") be the OLS estimates of (a*, p). Show that &* converges to zero in probability. Write s 2 as T

s' =

I

T-1

L[(D.y,- &*) -

(p" -

l)y,_JJ'

t=l

T

=

I T- I

L( D.y, -a'*)' t=l

2 . [T. T-1

T

(p"- I)]·.!._ L(D.y,- &*). Yt-) T

t=l

T

+

I , 1"" 2 2 T _I · [T · (p"- I)] · T' L)y,_,) . (5) t=l

Since plim &* = 0, it should be easy to show that the first term on the righthand side of (5) converges in probability to y0 = E[ (D.y,) 2]. To prove that the second term converges to zero in probability, take for granted that I I LY<-i T --+ A ..;T T ct f'i' -

t=l

1'

W(r) dr.

(6)

0

(e) Show that t" --+ct DF'/ if [y,] is a driftless random walk. Hint: Since [y,] is a driftless random walk, A2 = y0 = a 2 (=variance oft:,). So sis

608

Chapter 9

consistent for

rJ.

.0"- 1

(7)

t" = s-x-,;--;=(2=.c::2C")'=el=e=m=e=nt=o=f=o(x=·x"")=c1 Show that the square root of the (2, 2) element of (I:;~,

x,x;) _,

is

1

x, = [1, y,_Jl'.

where 3.

(Proof of Proposition 9.5) Let

p'

be the OLS estimate of p in the AR(l)

equation with time trend, (9.3.17), and define the detrended series [y,'_ 1) (t =

1, 2, ... , T) by

' y,_ 1

8)

where (a,

=y,_

1

. .

-a-8·t,

(8)

are the OLS coefficient estimates for the regression of y,_ 1 on

(1, t) for t = 1, 2, ... , T. (a) Derive

., p - 1-

'i;"'T

L.....t=l

b.

'i;"'T

(

' Yr Yr-1 '

L....r=l Yr-l

Hint: By the Frisch-Waugh Theorem,

,0'

(9)

)2

is numerically equal to the OLS

estimate of the y J_ 1 coefficient in the regression of y, on y 1'_ 1 (without intercept or time). Thus, "'r

p =

'i;"'T

L....t=l

Yr

'i;"'T

(

'

Yr-1 '

Lt=l Yr-1

(10)

)2

Show that I:;~, (y,- y;_,) · y1'_ 1 = I:;~, (y,- y,_,) · y,'_ 1 • (By construction,

I:;~,

y,'_ 1 =

0 and I:;~, t ·

y;_, =

0 (these are the normal equations).)

(b) (Easy) Show that, under the assumption that y,

=a+ 8. t +~~where[~,)

is driftless !(1),

A' ~ -([W'(1)] 2 - [W'(0)] 2 ) - ~ T·<.D'-1)-+ 2 d

A' j~[W'(r)]2dr

2

(11)

609

Unit-Root Econometrics

where

Yo is the variance of

t>y,, A2 is the long-run variance of t>y,, and

W' O is a detrended standard Brownian motion introduced in Section 9.2. Hint: Let ~,'_ 1 be the residual from the hypothetical regression of ~r-l on (I, t) lor t = I, 2, ... , T where 1;, = y, - a - 8 · t (this regression is hypothetical because we do not observe{!;,)). Then 1;,'_ 1 = fort= I, 2, .... T, because 1;, differs from y, only by a linear trend a + 8 · t. Use Proposition 9.2(e) and (I). Since t>y, = 8 + <'>~, under the null, the variance of !>~, equals that of !> y, and the long-run variance of <'>~, equals that of

_v;_,

t>v,. (c) (Trivial) Verify that T · (jj'- I) converges to DF; (defined in (9.3.19)) if {y,} is a random walk with or without drift.

4. (Proof of (9.4.16)) In the text,

L .Yr-1 e, T

-I T

--+ c, d

t=l

=

I 2 · Y,(l) · [W(l)'- I] -u 2 0

(9.4.16)

was left unproved. Prove this under the condition that {y,} is driftless I (I) so that {t>y,) is zero-mean 1(0) satisfying (9.2.1)-(9.2.3).

Hint: Thee, in

(9.4.16) is the innovation process in the MA representation of {t>yr). Using the Beveridge-Neisen decomposition, we have

Yr =

l/J(l)w, +~~+(Yo- ~o), w,

= e, + e, + · · · + e,.

So

IT

IT

T LYr-J e, = 1/1(1) T L t=I

IT

w,_, e, + T

t=I

IT

L ~r-J e, +(Yo- ~o) T L e,. t=I

t=I

(*) Show that the second and the third term on the right-hand side of this equation converge in probability to zero. Once you have done Analytical Exercise 1 above, it should be easy to show that I

T

T

2

L w,_, e,-; (~ )lW(l)'- 1]. t=I

610

Chapter 9

5. (Optional, proof of Proposition 9.7) To simplify the calculation, set p = I, so the augmented autoregression is y,

Let (a', p~,

=a'+ PY<-1 + ~~

t>y,_l

+ <,.

fil be the OLS coefficient estimates. Show that T__. .:::DF~. I - ~~ d

Hint: By the Frisch-Waugh Theorem, (p~, { 1) are numerically the same as the

OLS coefficient estimates from the regression of y, on (y;_,, (l>y,_,)l~l), where is the residual from the regression of on a constant and (t>y,_,)l~l is the residual from the regression of t>y,_ 1 on a constant lor 1 = I, 2 ... , T. Therefore, if b = (p~. f,)', fJ = (p, ~J)'. and x, = (y;_,, (t>y,_,)l~l)', then they satisfy (9.4.8). So

y;_,

y,_,

T

1"'~2 (I, I) element of A 7 = T2 L)Y,_,) , t=l

I 1st element of c7 = T

T

LY:'- 1 • s,. t= 1

Apply Proposition 9.2(c) and (d) with 1;, = y, to these expressions. 6. (Phillips tests) Suppose that (y,} is driftless I (I), so t>y, can be serially correlated. Nevertheless, estimate the AR(l) equation without intercept, rather than the augmented autoregression without intercept, on (y,}. Let p be the OLS estimate of they,_, coefficient. Define Zp

= Z,

,

T · (p- I) -

= ;cs · I A

-

I

- ·

2

I

2·

(T

2 ·

s2

uj) , ·

2

,

(A -Yo),

(T·u~>) '2 , , · (A - Yo), s ·A

where &P is the standard error of p, s is the OLS standard error of regression, X2 is a consistent estimate of A2 , y0 is a consistent estimate of y0 , and 1 is the 1-value for the hypothesis that p = I.

611

Unit-Root Econometrics

(a) Show that Zp and Z, can be written as

Hint: By the algebra of least squares, I

(b) Show that

Zp ---> DFp and Z, ---> DF,. d

d

Hint: Use Proposition 9.2(a) and (b)

with~~

= y,.

7. (Optional) Show that a(L) in (9.2.5) is absolutely summable. Hint: A one-line

proof is 00

00

00

00

L:: Vt; < L:: L:: It;)= "L:ilt;) < i=j+l

00.

j=Oi=j+l

Justify each of the equalities and inequalities. MONTE CARLO EXERCISES

1. (You have less power with time) In this simulation, we wish to verify that the finite-sample power against stationary alternatives is greater with the OF t~-test than with the OFt' -test, that is, inclusion of time in the AR( I) equation reduces power. The OGP we examine as the stationary alternative is a stationary AR(l) process: y,

= py,_, + e,,

p

= 0.95,

e,

~

i.i.d. N(O. !), Yo= 0.

Choose the sample size T to be I 00 and the nominal size to be 5 percent. Your computer program for calculating the finite-sample power should have the following steps.

612

Chapter 9

(1) Set counters to zero. There should be two counters, one for the DF t~ and the other fort'. They will record the number of times the (false) null gets rejected. (2) Start a do loop of a large number of replications (I million, say). In each replication, do the following. (i) Generate the data vector (yo, Y1, ... , YT ).

Gauss Tip: As was mentioned in the Monte Carlo exercise for Chapter I, to generate (y 1, y,, ... , YT), avoid creating a do loop within the current do loop; use the matrix operators. The matrix formula is r

=

y (Txl)

·yo+

A

e

(TxT}(Txi)

(Txl)

where p

YI Y2

y=

p' r= ' PT

YT

0

1

0

p

A=

.......

p'

p

PT-1

PT-2

0 0 0 ' f=

fi

s, fT

p

To define r, use the Gauss command seqm. To define A, use toepli tz and lowma t. rand A should be defined in step (1), outside the current do loop. (ii) Calculate I~ and t' from (yo. Y1o ..• , YT). (So the sample size is actually T +I.)

by I if I~ < -2.86 (the 5 percent critical value for DF'/). Increase the counter fort' by I if t' < -3.41 (the 5 percent critical value for D F,').

(iii) Increase the counter for

I~

(3) After the do loop, for each statistic, divide the counter by the number of replications to calculate the frequency of rejecting the null. This frequency converges to the finite-sample power as the number of Monte Carlo replications goes to infinity.

613

Unit·Aoot Econometrics

Power should be about 0.124 for the OF t"-test (so only about 12 percent of the time can we reject the false null!) and 0.092 for the OFt' -test. So indeed, power is less when time is included in the regression, at least against this particular OGP. 2. (Size distortions) In this Monte Carlo simulation, we calculate the size distortions for the AOF p"- and !~'-tests. Our null OGP is the one examined by Schwer! (I 989): y, = py,_,

e,

~

+e, +lie,_,,

p =I, II= -0.8,

i.i.d. N(O, 1), Yo= 0, eo= 0.

For T = I 00 and the nominal size of 5 percent, calculate the finite-sample or exact size for /3" and t". It would be useful to select the number of lagged changes in the augmented autoregression by one of the data-dependent methods (the sequential t, the AIC, or the BIC), but to keep the computer program simple and also to save CPU time, set it equal to 4. Thus, the regression equation, as opposed to the OGP, is

\1 · (y,_, - Yt-2) + \2 · (Yt-2- y,_,) + \1 · (y,_,- y,_.) + I• · (y,_•- Yt-s) +error.

y, =canst.+ PYt-1 +

Since the sample includes y0 , the sample period that can accommodate four lagged changes is from t = 5 (not p + 2 = 6) to t = T and the actual sample size is T- 4. Verify that the size distortion is less for the AOF t"- than for the AOF p"-test. (The size should be about 0.496 for the AOF p" and 0.290 for the AOF t~'.) EMPIRICAL EXERCISES

Read Lothian and Taylor (I 996, pp. 488-509) (but skip over their discussion of the Phillips-Perron tests) before answering. The file LT.ASC has annual data on the following: Column I: Column 2: Column 3: Column 4:

year dollar/sterling exchange rate (call itS,) U.S. WPI (wholesale price index), normalized to 100 for 1914 (P,) U.K. WPI, normalized to 100 for 1914 (P/).

The sample period is 1791 to 1990 (200 observations). These are the same data used by Lothian and Taylor ( 1996) and were made available to us. For data sources

614

Chapter 9

and how the authors combined them to construct consistent time series, see the Appendix of Lothian and Taylor ( 1996). (According to the Appendix, the exchange rate (and probably WPis) are annual averages. This is rather unfortunate and pointin-time data should have been used, because of the time aggregation problem: if a variable follows a random walk on a daily basis, the annual averages of the variable do not follow a random walk. See part (f) below for more on this.) Calculate the dollar/sterling real exchange rate as

z, = ln(S,) ~ ln(P,) + ln(P,').

(I)

Since S, is in dollars per pound, an increase means sterling appreciation. The plot of the real exchange rate in Figure 9.4 shows no clear time trend, so we apply the AOF t"-test. Therefore, the augmented autoregression to be estimated is

z,

= const. + Plr-1 + \1 · (Z:-1 ~ Zr-2) + · · · + \p · (Zr-p ~ Zr-p-1) +error,

(2) where p is the number of lagged changes included in the augmented autoregression (which was denoted p in the text). (a) (AOF with automatic lag selection)

For the entire sample period of 17911990, apply the sequential t-rule, the AIC, and the BIC to select p (the number of lagged changes to be included). Set Pmax (the maximum number of lagged changes in the augmented autoregression, denoted Pmax(T) in the text) by (9.4.31) (so Pm"x = 14). For the sequential t test, use the significance level of I 0 percent. (You should find p = 14 by the sequential t, 14 by AIC, and 0 by BIC.) Verify that the value of the t"-statistic for p = 0 (if estimated on the maximum sample) matches the value reported in Table I of Lothian and Taylor (1996) as r,.

RATS Tip: RATS allows you to change the number of regressors in a do loop. Since the RATS OLS procedure LINREG does not allow the standard errors to be retrieved, the OFt" cannot be calculated in the program. So calculate the OF t"-statistic as the t-value on the z,_ 1 coefficient in the regression of z, ~ Zr-l· Also, neither the AIC nor the BIC is calculated by LINREG. So use COMPUTE. In the RATS codes below, z is the log real exchange rate and dz is its first difference. In applying the sequential t, AIC, and BIC rules, fix the sample period to be 1806 to 1990 (which corresponds to Pmax + 2, ... , T). The RATS codes for AIC and BIC are linreg dz 1806:1 1990:1;# constant z{1)

615

Unit·Aoot Econometrics

compute ssrr=%rss compute akaike = log(%rss/%nobs)+%nreg*2.0/%nobs compute schwarz = log(%rss/%nobs)+%nreg*log(%nobs) /%nobs display akaike schwarz * run ADF with 1 through 14 lagged changes do P=L 14 linreg dz 1806:1 1990:1;# constant z{1) dz{1 top) compute akaike = log(%rss/%nobs)+%nreg*2.0/%nobs compute schwarz = log(%rss/%nobs)+%nreg *log(%nobs)/%nobs display akaike schwarz end do TSP Tip: Unlike RATS, it appears that TSP does not allow you to change the sample period and the number of regressors in a do loop. The OLSQ pro-

cedure prints out the BIC, bm not the AIC, so the AIC must be calculated using the SET command. (b) (OF test on the floating period) Apply the OF 1g-test to the floating exchange rate period of 1974-1990. Because of the lagged dependent variable Z<-J, the first equation is fort = 1975: Zt975

=canst.+

PZt974

+error.

So the sample size is 16. Can you reject the random walk null at 5 percent? (c) (OF test on the gold standard period) Apply the OF tM-test to the gold standard period of 1870--1913. Take t =I for 1871, so the sample size is 43. Can you reject the random walk null at 5 percent? Is the estimate of p (the z,_ 1 coefficient in the regression of z, on a constant and z,_ 1) closer to zero than in (b)? (d) (Chow test) Having rejected the I(l) null, we proceed under the maintained assumption that the log real exchange rate z, follows a stationary AR(l) process. To test the null hypothesis that there is no structural change in the AR(l) equation, split the sample into two subsamples, 1791-1973 and 1974-1990, and apply the Chow test. You may recall that the Chow test was originally developed for models with strictly exogenous regressors. If K is the number of regressors including a constant (2 in the present case), it is an F-test with (K, T - 2K) degrees of freedom. This result is not directly applicable here, because in the AR(l) equation the regressor is the lagged dependent variable

616

Chapter 9

which is not strictly exogenous. However, as you verified in Analytical Exercise 12 of Chapter 2, if the variables (the regressors and the dependent variable) are ergodic stationary (which is the case here because (z,} is ergodic stationary when IPI < I), then in large samples, with the sizes of both subsamples increasing, K · F is asymptotically x2 (K). This result does not require the regressors to be strictly exogenous. Calculate K · F and verify the claim (see pp. 498-502 of Lothian and Taylor. 1996) that "there is no sign of a structural shift in the parameters during the floating period." In the present case of a lagged dependent variable, there is an issue of whether the equation for 1974, ZJ974

= const. +

PZI973

+error,

should be included in the first subsample or in the second. It does not matter in large samples; include the equation in the first subsample (so the sample size of the first subsample is 183). (Note: K · F should be about 1.26.) (e) (The ADF-GLS) As was mentioned in Section 9.5, the ADF-GLS test achieves

substantially higher power at a cost of only slightly higher size distortions. The ADF-GLS t 1'- and t' -statistics are calculated as follows. As in the ADF t"- and t' -tests, the null process is

y,

= d, +

z,. z,

=

PZt-l

+

u,. p =I, {u,} zero-mean 1(0) process.

(3)

Write d, = x;p. For the case where d, is a constant. x, = I and p =a, while for the case where d, is a linear time trend, x, = (I. t )' and p = (a. 8)'. We assume the sample is of size T with (y~o y2 , ••. , YT ). (i) (GLS estimation of fJ) In estimating the regression equation

y,

= x;p +error,

(t

=

I, 2, .... T),

(4)

do GLS assuming that the error process is AR(l) with the autoregressive coefficient of jj = I+ cIT. (The value of c will be specified in a moment.) That is, estimate p by OLS by regressing -

-

-

-

-

-

YI· Y2- PYI· Y3- PY2· · · ·, YT- PYT-l on x1.x2 -px,.x3 -px2 .... ,xr- PXr-1·

(Unlike in the genuine GLS, the first observations, y 1 and x 1, are not

617

Unit-Root Econometrics

weighted by /I ~ p 2 , but this should not matter in large samples.) Set c according to

in the demeaned case (with x, = 1), c=~I3.5

in the detrended case. (Roughly speaking, this choice of censures that the asymptotic power at c = c where p =I+ efT is nearly 50 percent which is the asymptotic power achieved by an ideal test specifically tailored to this alternative of c = c.) Call the resulting estimator PoLs and define y, = y, ~ x;PoLs· (This should not be confused with the GLS residual, which is (y, ~ PYr-Il ~ (x, ~ PXr-Il'PoLs·l (ii) (ADF without intercept on the residuals) Estimate the augmented autore-

gression without intercept using

y, instead of y,:

.Vt = PYr-1 + \1 · (Yr-1 ~ Yr-2) + · · · + \r · (Yr-p ~ .Vr-p-I) +error. The !-value for p = I is the ADF-GLS !-statistic; call it the ADF-GLS t~ if x, = I and the ADF-GLS t' if x, = (I, t)'. The limiting distribution of the ADF-GLS t~ is that of DF, (which is tabulated in Table 9.2, Panel (a)), while that of the ADF-GLS t' is tabulated in Table I. C. of Elliott eta!. (1996). Reproducing their lower tail critical values ~3.48

for I%,

~3.15

for 2.5%.

~2.89

for 5%, and

~2.57

for 10%.

Elliott et a!. ( 1996) show for the ADF-GLS test that the asymptotic power is higher than those of the ADF p- and t -tests, and that compared with the ADF tests, the finite-sample power is substantially higher with only a minor deterioration in size distortions. Apply the ADF-GLS 1~-test to the gold standard subsample 1870-1913. Choose the lag length by BIC with Pmax (the maximum number of lagged changes in the augmented autoregression, denoted PmaxCTl in the text) by (9.4.31) (so Pmax = 9). Can you reject the null that the real exchange rate is driftless 1(1)? (Note: The BIC will pick 0 for p. The statistic should be ~2.24. You should be able to reject the 1(1) null at 5 percent.)

TSP Tip: An example of TSP codes assuming p = 0 for this part is the following.

618

Chapter 9 ?

7 (e) ADF-GLS ?

7 Step 1: GLS demeaning ?

set rhobar=l-7/(1913-1870+1); smpl 1870 1913; const=l; Ct=l; zt=z; ? z lS the log real exchange rate smpl 1871 1913; zt=z-rhobar*z(-1); ct=const-rhobar*const(-1); smpl 1870 1913; olsq zt ct; z=z-@coef(l); ? @coef(l) is the GLS estimate of the mean ? ? Step 2:

DF with demeaned variable

?

smpl 1871 1913; olsq z z (-1); set t=(@coef(l)-1)/@ses(l) ;print t; (f) (Optional analytical exercise on time aggregation)

Suppose y, is a random walk without drift. Create from this time-averaged series by the formula Y1 Yl =

+ Y2 + · ·· + Yn , n

Yn+l Y2 =

+ Yn+2 + · · · + Y2n , · · · · n

(Think of y, as the exchange rate on day t and Yj as an annual average of daily exchange rates for year j.) Show that {yj} is not a random walk. More specifically, show that

619

Unit-Root Econometrics

ANSWERS TO SELECTED Q U E S T I O N S - - - - - - - - - ANALYTICAL EXERCISES

1. Since E(;Jl < oo, ; 0 ; ft converges in mean square (and hence in probability) to 0 as T ---> oo. Thus, the second term on the right-hand side of(*) in the hint can be ignored. Regarding the first term, we have

The first term on the right-hand side of this equation can be ignored. By Proposition 6.9, the second term converges to N(O, A2 ) because !l.;, is a linear process with absolutely summable MA coefficients. So the first term on the right-hand side of(*) converges in distribution to (A 2 j2)X where X ~ x 2 (1). Since {!l.;,} is ergodic stationary, the last term on the right-hand side of(*) converges in probability to ~ E[(!l.;,) 2 ] by the Ergodic Theorem. 5. The (1, I) element of Ar converges to the random variable indicated in Proposition 9.2(c). Regarding the first element of cr, by Beveridge-Nelson, y, = >f!(l)w,

+ ry, +(yo~

ryo), w,

= '' + '' + · · · + t:,.

(I)

Therefore, (2)

where M

-

X 1_ 1 = Xr-I- x,

_

X=

Xo+···+XT-I T

for x

= w, TJ,

and

The second term on the right-hand side of this equation can be ignored asymptotically. By Proposition 9.2(d) for;, = w, (a driftless random walk), the first term on the right-hand side converges to

620

Chapter 9

From this, we have I T

I

T

LY;'_ 1 • s, --;j 2 · a 2 · >fr(l) · ([W"(l)f- [W"(O)f- I).

(4)

/o=l

The A 7 matrix is asymptotically diagonal for the demeaned case too. The offdiagonal elements of A 7 are equal to 1/ft times (5)

Since

L;=, y;'_

1

= 0, this equals

(6) It is easy to show, by a little bit of algebra analogous to that in Review Question

3 of Section 9.4 and Proposition 9.2(d), that (6) has a limiting distribution. So 1/ ft times (6), which is the off-diagonal elements of A 7 , converges to zero in probability. Then, using the same argument we used in the text for the case without intercept, it is easy to show that ~ 1 is consistent for \ 1 . So >fr(l) is consistently estimated by 1/(1- ~I). Taken together, T_·_.:::(p_'"-,--:.:.1) _::_ , --> DF:. I - \I d EMPIRICAL EXERCISES

(a) With p = 0, the estimated AR(l) equation is Zt

=canst.+ 0.8869 Zt-1, SSR = 0.995135, R 2 = 0.790, T = 199. (0.0326)

The value oft" is -3.47, which matches the value in Table I of Lothian and Taylor (1996). The 5 percent critical value is -2.86 from Table 9.2, Panel (b). (b) The estimated AR(l) equation for 1974-1990 is

z, =canst.+ 0.7704z,_ 1 ,

SSR = 0.150964, R 2 = 0.539, T = 16.

(0.1904) t" = -1.21 > -2.86. We fail to reject the null at 5 percent.

621

Unit-Root Econometrics

(c) The estimated AR(l) equation for 1870-1913 is

z,

= const. + 0.7222 z,_ 1 , SSR (0.1 045)

= 0.069429,

R2

= 0.538,

T

= 43.

t" = -2.66 > -2.86. We fail to reject the null at 5 percent.

(d) The estimated AR(l) equation for 1791-1974 is

z,

=cons!.+ 0.8993z,_ 1 , SSR (0.0329)

= 0.837772,

R2

= 0.805,

T = 183.

From this and the results in (a) and (b), 0.995135- (0.837772 + 0.150964) K .F =

o 837772+0.t5o964

1·26 ·

(199 4)

The 5 percent critical value for x 2 (2) is 5.99. So we accept the null hypothesis of parameter stability. (e) The ADF-GLS t" = -2.24. The 5 percent critical value is -1.95 from Table 9.2, Panel (a). So we can reject the 1(1) null.

References Baillie, R., 1996, "Long Memory Processes and Fractional Integration in Econometrics." Journal of Econometrics, 73, 5-59. Balke, N., and R. Gordon, 1989, "The Estimation of Prewar Gross National Product: Methodology and New Evidence," Journal of Political Economy, 97. 39-92. Chan, N., and C. Wei, 1988, "Limiting Distributions of Least Squares Estimates of Unstable Autore· gressivc Processes," Annals of Statistics, 16, 367-401. Dickey, D., 1976, "Estimation and Hypothesis Testing in Nonstationary Time Series," Ph.D. disser· tation, Iowa State University. Dickey, D., and W. Fuller, 1979, "Distribtion of the Estimators for Autoregressive Time Series with a Unit Root," Journal of the American Statistical Association, 74, 427-431. Elliott, G., T. Rothenberg, and J. Stock, 1996, "Efficient Tests for an Autoregressive Unit Root," Econometrica, 64, 813-836. Fama, E., 1975, "Short-Tenn Interest Rates as Predictors of Inflation," American Economic Review, 65, 269--282. Fuller, W., 1976, Introduction to Statistical Time Series, New York: Wiley. _ _ _ , 1996, Introduction to Statistical Time Series (2d ed.), New York: Wiley. Hall, P., and C. Heyde, 1980, Martingale Limit Theory and Its Application, New York; Academic. Hamilton, 1., 1994, Time Series Analysis, Princeton: Princeton University Press. Hannan, E. 1., and M. Deistler, 1988, The Statistical Theory of Linear Systems, New York: Wiley.

622

Chapter 9

Kwiatkowski, D., P. Phillips, P. Schmidt, and Y. Shin, 1992, ''Testing the Null Hypothesis of Stationarity against the Alternative of a Unit Root," Journal of Econometrics, 54, I 59-178. Leyboume, S., and B. McCabe, 1994, "A Consistent Test for a Unit Root," Journal of Business and

Economic Statistics, 12, 157-166. Lothian, J., and M. Taylor, 1996, "Real Exchange Rate Behavior: The Recent Float from the Perspective of the Past Two Centuries," Journal of Political Economy, 104-,488-509. Maddala, G. S., and In-Moo Kim, 1998, Unit Root.~. Cointegration, and Structural Change, New York: Cambridge University Press. Ng, S .. and P. Perron, 1995, "Unit Root Tests in ARMA Models with Data-Dependent Methods for the Selection of the Truncation Lag," Journal of the American Statistical Association, 90, 268-

281. - - - · 1998, "Lag Length Selection and the Construction of Unit Root Tests with Good Size and Power," working paper, mimeo. Perron, P., 1991, "Test Consistency with Varying Sampling Frequency,'' Econometric Theory, 7, 341-368. Perron, P., and S. Ng, 1996, "Useful Modifications to Unit Root Tests with Dependent Errors and Their Local Asymptotic Properties," Review of Economic Studies, 63, 435-463. Phillips, P., 1987, "Time Series Regression with a Unit Root," Econometn'ca, 55, 277-301. Phillips, P., and S. Durlauf, 1986, "Multiple Time Series Regression with Integrated Processes," Review of Economic Studies, 53,473-496. Phillips, P., and P. Perron, 1988, "Testing for a Unit Root in Time Series Regression," Biometrika, 75, 335-346. Rogoff, K., 1996, "The Purchasing Power Parity Puzzle," Journal of Economic Literature, 34,647668. Said, S., and D. Dickey, 1984, "Testing for Unit Roots in Autoregressive~ Moving Average Models of Unknown Order," Biometrika, 71,599-607. Sargan, D., and A. Bhargava, 1983, "Testing Residuals from Least Squares Regression for Being Generated by the Gaussian Random Walk," Econometrica, 51, 153-174. Schwert, G., 1989, "Tests for Unit Roots: A Monte Carlo Investigation," Journal of Business and Economic Statistics, 7, 147-159. Sims, C., J. Stock, and M. Watson, 1990, "Inference in Linear Time Series Models with Some Unit Roots," Econometrica, 58, 113-144-. Stock, J., 1994, "Unit Roots, Structural Breaks and Trends," Chapter 46 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North Holland. Tanaka, K., 1990, ''Testing for a Moving Average Unit Root," Econometric Theory, 6, 433-444-. _ _ _ , 1996, Time Series Analysis, New York: Wiley. Watson, M., 1994, "Vector Autoregressions and Cointegration," Chapter 47 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North Holland.

CHAPTER

10

Cointegration

ABSTRACT As the examples of the previous section amply illustrate, many economic time series can be characterized as being 1(1 ). But very often their linear combinations appear to be stationary. Those variables are said to be co integrated and the weights in the linear combination are called a co integrating vector. This chapter studies cointegrated systems, namely, vector 1(1) processes whose elements are cointegrated.

In the previous chapter, we have defined univariate 1(0) and 1(1) processes.

The first section of this chapter presents their generalization to vector processes and defines co integration for vector I ( 1) processes. Section I 0.2 presents alternative representations of a cointegrated system. Section I 0.3 examines in some detail a test of whether a vector 1(1) process is cointegrated. Techniques for estimating cointegrating vectors and making inferences about them are developed in Section I 0.4.

As an application, Section 10.5 takes up the money demand function. The log real money supply, the nominal interest rate, and the log real income appear to be 1(1), but the existence of a stable money demand function implies that those variables are cointegrated, with the coefficients in the money demand function forming a cointegrating vector. The techniques of Section I0.4 are utilized to estimate the cointegrating vector. Unlike in other chapters, we will not present results in a series of propositions, because the level of the subject matter is such that it is difficult to state assumptions without introducing further technical concepts. However, for technically oriented readers, we indicate in the text references where the assumptions are formally stated. A Note on Notation: Unlike in other chapters, the symbol "n" is used for the dimension of the vector processes, and as in the previous chapter, the symbol "T" is used for the sample size and "t" is the observation index. Also, if Yr is the t-th observation of ann-dimensional vector process, its j-th element will be denoted Yjr instead of Yrj. We make these changes to be consistent with the notation used in the majority of original papers on cointegration.

624

Chapter tO

10.1

Co Integrated Systems

Figure IO.l(a) plots the logs of quarterly real per capita aggregate personal disposable income and personal consumption expenditure for the United States from 1947:Ql to 1998:Ql. Both series have linear trends and-as will be verified in an example below- stochastic trends. However, these two 1(1) series tend to move together, suggesting that the difference is stationary. This is an example of cointegration. Cointegration relations abound in economics. In fact, many of the variables we have examined in the book are cointegrated: prices of the same commodity in two different countries (the difference should be stationary under the weaker version of Purchasing Power Parity), long- and short-term interest rates (even though they have trends, yield differential may be stationary), forward and spot exchange rates, and so forth. Cointegration may characterize more than two variables. For example, the existence of a stable money demand function implies that a linear combination of the log of real money stock. the nominal interest rate, and log aggregate income may be stationary even though each of the three variables is I (I). We start out this section by extending the definitions of univariate 1(0) and 1(1) processes of the previous chapter to vector processes. The concept of cointegration for multivariate I (I) processes will then be introduced formally. 1.4,----------------------------, 1.3

~

1 .1

.... ... --·"" ...

1.0 0.9

... -·--- .-

/ ~.,.-...--::=::-::::/--~ _..........

1.2

-

,·-----

.r;;.,...:.:.....................----.. ·····"

0.8 ~~::.···

0.7

06~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~-M~YWDM~Mron~~~~~M~MOO~M~~

- - log per capita income

year · · · · • log per capita consumption

Figure 10.1 (a): Log Income and Consumption

Linear Vector 1(0) and 1(1) Processes

Our definition of vector 1(0) and 1(1) processes follows Hamilton (1994) and Johansen (1995). To introduce the multivariate extension of a zero-mean linear

625

Cointegration

univariate !(0) process, consider a zero-mean n-dimensional VMA (vector moving average) process:

u, = lii(L) e, , W(L) (11xl)

== 1, + lii,L + lii 2 L 2 + · · ·,

(10,1,])

S2 , S2 positive definite.

(10.1.2)

(11xn) (11xl)

where e, is i.i.d, with E(e 1) = 0, E(e,e;) =

(11 XII)

(The error process e, could be a vector Martingale Difference Sequence satisfying certain properties, but we will not allow for this degree of generality,) As in the univariate case, we impose two restrictions. The first is one-summability: 1 {lilj) is one-summable,

(10,1.3)

Since {lilj) is absolutely summable when one-summability holds, the multivariate version of Proposition 6. I (d) means that u, is (strictly) stationary and ergodic. The second restriction on the zero-mean VMA process is Iii(!)

f=

(n x n matrix ofzeros).

0

(10.1.4)

(n XII)

That is, at least one element of then x n matrix Iii (I) is not zero. This is a natural multivariate extension of condition (9.2.3b) in the definition of univariate 1(0) processes on page 564. A (linear) zero-mean n-dimensional vector 1(0) process or a (linear) zeromean 1(0) system is a VMA process satisfying (IO.l.l)-(10.1.4). Given a zeromean 1(0) system u,, a (linear) vector 1(0) process or a (linear) 1(0) system can be written as ~

+u,,

where~

is ann-dimensional vector representing the mean of the stationary process. Using the formula (6.5.6) with (6.3.15), the long-run covariance matrix of the 1(0) system is lil(l)S21i1(1)'.

(10.1.5)

Since S2 is positive definite and Iii (I) satisfies (10.1.4 ), at least one of the diagonal 1A sequence of matrices {lit j J is said to be one-summable if Nj,kel is one-summable (i.e., L:~o jl\l!j,ke I < oo) for all k, f = I, 2, ... , n, where \l.lj,ke is the (k, f) element of then x n matrix ~ j.

626

Chapter 10

elements of this long-run variance matrix is positive, implying that at least one of the elements of an 1(0) system is individually 1(0) (i.e., 1(0) as a univariate process). We do not necessarily require \11(1) to be nonsingular. In fact, accommodating its singularity is the whole point of the theory of cointegration. Consequently, the long-run variance matrix, too, can be singular. Example 10.1 (A bivariate 1(0) system): Consider the following bivariate first-order VMA process:

{

u,, =Bit -

U21

=

fJ,/-1

+ Yf2,t-I'

or u, = e,

+ 'IIIer-h

(10.1.6)

B2r,

where

e,

=

[elt] E:2r

[-I

r]

0 0 .

'If, =

'

(10.1.7)

Since this is a finite-order VMA, the one-summability condition is trivially satisfied. Requirement (10.1.4) is satisfied because

\11(1)=12+"',=[

0 0

y]-~ 0. I (2x2)

(10.1.8)

So this example fits our definition of 1(0) systems. If y = 0, then the first element, u It• is actually I( -I) as a univariate process. The n-dimensionall(1) system y, associated with a zero-mean 1(0) system u, can be defined by the relation f),y, =

~

+ u,

0

(10.1.9)

So the mean of f),y, is ~- Since not every element of the associated zero-mean 1(0) process u, is required to be individually 1(0), some of the elements of the I (I) system y, may not be individually 1(1). 2 Substituting (10.1.1) into this, we obtain f),y, = li + \II(L)e,.

(10.1.10)

This is called the vector moving average (VMA) representation of an 1(1) 2Jn many textbooks and also in Engle and Granger (1987), all elements of an I (I) system are assumed to be individually 1(1), but that assumption is not needed for the development of cointegration theory. This point is emphasized in, e.g., Lt.itkepohl ( !993, pp. 352-353) and Johansen (1995, Chapter 3).

627

Cointegration

system. In levels, y, can be written as

y, =Yo+ 8 ·I+ U1 + U2 + · · · + u,,

(10.1.11)

or, written out in full, Yit

YI,O

Y2t

Y2.0

Ynt

Yn,O

OJ ·I

+

82 ·I

u1,2

+

+ Un,I

u2,2

+ .. ·+

Un,2

Regarding the initial value y0 , we assume either that it is a vector of fixed constants or that it is random but independent of e, for alit.

Example 10.2 (A bivariate 1(1) system): An I (I) system whose associated zero-mean 1(0) process is the bivariate process in Example 10.1 is

(10.1.12) where \11 1 is given in (10.1.7). In levels, Yit = YI,o +OJ ·t +(£It - fJ,o) + Y · (s2,o + £2,1 + · · · + f2,t-J), { Y2t = Y2.o + 02 ·I+ (£2,1 + £2,2 + · · · + f2t ).

(10.1.13) The second element of y, is I(I) as a univariate process. If y = 0, then the first element y 1, is trend stationary because the deviation from the trend function (y 1,o-fJ.o)+o 1·I is stationary (actually i.i.d.). Otherwise y 1, is 1(1). The Beverldge-Nelson Decomposition

For an I (I) system whose associated zero-mean I (0) process is u, satisfying (I 0.1.1 )-(I 0.1.4), it is easy to obtain the Beveridge-Nelson (BN) decomposition. Recall from Section 9.2 that the univariate BN decomposition is based on the rewriting of the MA filter as (9.2.5) on page 564. The obvious multivariate version

628

Chapter 10

of it is 00

\II(L) = \11(1) +(I - L)a(L), a(Ll = (nxn)

(n xn)

(n xn)

Clj

= -('i'j+l

I>ju. j=O

+ "'j+2 + ... ).

(10.1.14)

(n xn)

So the multivariate analogue of (9.2.6) on page 565 is

u, = \ll(l)e,

+ ~~- ~,_ 1 ,

~~ (n xI)

= a(L)e,.

(10.1.15)

Since \II(L) is one-summable, a(L) is absolutely summable and ~~ is a welldefined covariance-stationary process. Substituting this into (10.1.11), we obtain the multivariate version of the BN decompnsition:

y, =

~ ·t

+ \ll(l)(e, + e, + · · · + e,)

+~~+(yo- ~ 0 ).

(10.1.16)

As in the univariate case, the I (I) system y, is decomposed into a linear trend~· t, a stochastic trend \11(1)(£ 1 + e 2 + · · · + e,), a stationary component~,. and the initial condition Yo - ~ 0 • By construction, ~ 0 is a random variable. So unless Yo is perfectly correlated with ~ 0 • the initial condition is random. Example 10.2 (continued): For the bivariate I (I) system in Example 10.2, the matrix version of a(L) equals -\11 1 so that the stationary component~~ in the BN decomposition is

- [£" -0 ye,,] .

~~-

(10.1.17)

(It should be easy for you to verify this from the matrix version of (9.2.5).)

The two-dimensional stochastic trend is written as

\ll(l)(e 1 + e2 + · · · + e,) =

[~

Thus, the two-dimensional stochastic trend is driven by one common stochastic trend, L;~, e,_,. This is an example of the "common trend" representation of Stock and Watson (1993) (see Review Question 2 below).

629

Cointegration

Cointegration Defined

The 1(1) system y, is not stationary because it contains a stochastic trend W(l) · (e 1 + · · · + e,), and, if~ i 0, a deterministic trend~· 1. The deterministic and stochastic trends, however, may disappear if we take a suitable linear combination of the elements of y,. To pursue this idea, 3 premultiply both sides of the BN decomposition (I 0.1. I 6) by some conformable vector a to obtain a'y, =a'~· 1 + a'W(I)(e 1 + e 2 + · · · + e,) + a'q, + a'(y0

-

q0 ). (10.1.19)

If a satisfies a'W( I)= 0' ,

(10.1.20)

(1 xn)

then the stochastic trend is eliminated and a'y, becomes

"

a ' y, = a o · 1 +a 'q, +a '(Yo - q 0 ) .

(10.1.21)

Strictly speaking, this process is not trend stationary because the initial condition a' (Yo - q0 ) can be correlated with the subsequent values of a' q1 (see Review Question 3 below for an illustration). The process would be trend stationary if, for example, the initial value y0 were such that a' (Yo - q0 ) = 0. We are now ready to define cointegration. 4 Definition lO.l (Cointegration): Let y, be ann-dimensional I (I) system whose associated zero-mean 1(0) system u, satisfies (10.1.1)-(10.1.4). We say that y, is cointegrated with ann-dimensional cointegrating vector a if a i 0 and a'y, can be made trend stationary by a suitable choice of its initial value y0 . Defining cointegration in this way, although dictated by logical consistency, does not necessarily mean that the theory of cointegration requires that the initial condition Yo be chosen as indicated in the definition. The process in (10.1.21) is not stationary because q, and q0 are correlated. However, since q, = a.(L)e, and a.(L) is absolutely summable, q, and q 0 will become asymptotically independent as 1 increases. In this sense, the process in (10.1.21) is "asymptotically stationary," and asymptotic stationarity is all that is needed for estimation and inference for cointegrated I(I) systems. 5

3The idea was originally suggested by Granger (1981). See also Aoki (1968) and Box and Tiao (1977). 40ur definition i.~ the same as Definition 3.4 in Johansen ( 1995). 5The distinction between stationarity and asymptotic stationarity is discussed in Li.itkepohl (1993, pp. 348349)

630

Chapter 10

With cointegration thus defined, we can define the following related concepts. • (Cointegration rank) The cointegrating rank is the number of linearly independent cointegrating vectors, and the cointegrating space is the space spanned by the cointegrating vectors. As the preceding discussion shows, an n-dimensional vector a is a cointegrating vector if and only if (10.1.20) holds. Since there are h linearly independent such vectors if the cointegration rank is h, it follows that rank [ljl(l)] = n- h.

(10. 1.22)

(nxn)

Put differently, the cointegration rank h equals n - rank[ljl (I)]. • (Cointegration among subsets of y1 ) At least one of the elements of a cointegrating vector is not zero. Suppose, without loss of generality, that the first element of the cointegrating vector is not zero. Then we say that y 11 (the first element of y,) is cointegrated with Y21 (the remaining n - I elements of y,) or that y 11 is part of a cointegrating relationship. We can also define cointegration for a subset ofy 1 . For example, the n-1 variables in Y2 1 are not cointegrated if there does not exist an (n - I )-dimensional vector b oft 0 such that (I 0. 1.20) holds with a' = (0, b'). Therefore, y 21 is not cointegrated if and only if the last n- I rows of ljl(l) are linearly independent. • (Stochastic cointegration) Note that the deterministic trend a'8 ·tis not eliminated from a'y, unless the cointegrating vector also satisfies a'8 = 0.

(10.1.23)

In most applications, a cointegrating vector eliminates both the stochastic and deterministic trends in (10.1.19). That is, a vector a that satisfies (10.1.20) necessarily satisfies (I 0. 1.23), so that a'y, which can now be written as a'y, = a'q,

+ a'(yo- qo),

(I 0.1.24)

is stationary (not just trend stationary) for a suitable choice of y0 . This implies that 8 is a linear combination of the columns of ljl(l), so rank [8

: ljl(l)] = n- h.

(10.1.25)

(nx(n+l))

Unless otherwise noted, we will assume that (10.1.25) as well as (10.1.22) are

Cointegration

631

satisfied when the cointegration rank is h. If we wish to describe the case where a cointegrating vector eliminates the stochastic trend but not necessarily the detenninistic trend, we will use the term stochastic cointegration. Of course, ify, does not contain a deterministic trend (i.e., if 8 = 0), then there is no difference between cointegration and stochastic cointegration. Example 10.2 (continued): In the bivariate 1(1) system (10.1.12) in Example 10.2, the matrix >It (I) is given in (10.1.8), so

The rank of >It (I) is I, so the cointegration rank is I (= 2- 1). All cointegrating vectors can be written as (c, -cy)', c # 0. The assumption that the cointegration vector also eliminates the detenninistic trend can be written as c8, - cy8, = 0 or 81 = y8,, which implies that the rank of [8 : >It (I)] shown above is one. Quite a few implications for 1(1) systems follow from the definition of cointegration. For example, For an n -dimensional I(I) system, the cointegration rank cannot be as large as n. If it were n, then rank[W(l)] = 0 and >It (I) would have to be a matrix of zeros, which is ruled out by requirement (I 0.1.4).

• (h < n)

• (Implications of the positive definiteness of the long-run variance of L'>y1) The long-run variance matrix of L'>y, is given by >lt(l)Q>It(l)' (see (10.1.5)), which is positive definite if and only if W(l) is nonsingular. Therefore, y, is not cointegrated if and only if the long-run variance matrix of L'>y, is positive definite 6 Since the long-run variance of each element of L'>y, is positive if the long-run variance matrix of L'>y1 is positive definite, it follows that each element of y, is individually I (I) (i.e., I (I) as a univariate process) if y, is not cointegrated. The same is true for a subset of y,. For example, consider y2,, the last n - I elements of y,. The long-run variance matrix of L'>y 2, is given by >it 2 (l)Q>it 2 (1)' where >it 2 (1) is the last n - I rows of >it (1). It is positive definite if and only if the rows of W2 (1) are linearly independent or y 2, is not cointegrated. In particular, each element of y2 , is individually I (I) if y 2, is not cointegrated.

6This equivalence is specific to linear processes. In general, the positive definiteness of the long-run variance matrix of !:::.yz is sufficient, but not always necessary, for y1 to be not cointegrated. See Phillips (1986, p. 321).

632

Chapter 10

o

Suppose that y 11 is cointegrated with y,,. Then y 2, is not cointegrated if h = l and cointegrated if h > l. The reason is as follows. Cointegration of y 1, with y 2 , implies that there exists a cointegrating vector, call it a 1, whose first element is not zero. lf h = 1, then there should not exist an (n - ])-dimensional vector b of 0 such that a~ = (0, b') is a cointegrating vector, because a 1 and a, are linearly independent. A vector such as a2 can be found if h > 1.

o

(VAR in first differences?) If y, is difference stationary (without drift), it is

tempting to model it as a stationary finite-order VAR (L)!!.y, = e, where (L) is a matrix polynomial of degree p such that all the roots of l(z) I = 0 are outside the unit circle. But then y, cannot be cointegrated. The reason is as follows. lf (L) satisfies the stationarity condition, the coefficient matrix sequence {'llj] of its inverse 'li(L) = (L)- 1 is bounded by a geometrically declining sequence (see Section 6.3). So 'li(L) is one-summable and t.y, 'li(L)e, is a VMA process satisfying (10.!.2) and (10.1.3). Furthermore, 1'110)1

= 1<1>(1)- 1 1= 1<1>(1)1-l

of 0.

So 'II ( 1) is nonsingular and the long-run variance matrix of t.y, is positive definite.

QUESTIONS FOR REVIEW

1. Let (yJ 1 , y,,) be a' in Example 10.2 withy = 0. Show that {Ytr + Y21 } is 1(1). Hint: You need to show that the long-run variance of { !!.y 11 + !!.y 21 ) is positive. 2. (Stock and Watson (1988) common-trend representation) e2 + · · · + e 1 so that the BN decomposition is

Let w,

= e1 +

A result from linear algebra states that if C is ann x n matrix of rank n - h, then there exists ann x n nonsingular matrix G and an n x (n - h) matrix F of full column rank such that

CG=[ F :

(nxn)(nxn)

(nx(n-h))

o].

(nxh)

Show that the permanent component '11(l)w1 can be written as Fr,, where F is ann x (n- h) matrix of full column rank and r, is an (n- h)-dimensional

633

Cointegration

random walk with Var(I'>T 1 ) positive definite. Hint: lJt(l)w1 = lJt(l)GG- 1 w1 . Let T1 be the first n - h elements of G - I w 1 . This representation makes clear that an I (I) system with a cointegrating rank of h has n- h common stochastic trends. 3. (a'y 1 is not quite stationary) To show that the process in (1 0.1.21) is not quite stationary, consider the simple case where q1 = e1 - e,_,, 8 = 0, and Yo = 0. Verify that

Cov(q" q0 ) =

252

fort= 0,

0

fort= 0,

-Q

fort = 1, Var(a'y,) =

6a'Qa

fort = 1,

0

fort>l,

4a'Qa

fort>l,

where Q = Var(e 1 ). Verify that (a'y1 } is stationary fort = 2. 3, .... 4. (What if some elements are stationary?) Let y, be an I( 1) system and suppose that the first element of y, is stationary. Show that the cointegrating rank is at least I. 5. (Matrix of cointegrating vectors) Suppose that the cointegration rank of an n dimensional I (I) system y1 is h and let A be an n x h matrix collecting h linearly independent cointegrating vectors. Let F be any h x h nonsingular matrix. Show that columns of AF are h linearly independent cointegrating vectors. Hint: Multiplication by a nonsingular matrix does not alter rank.

10.2

Alternalive Representations of Cointegraled Systems

Jn addition to the common trend representation (see Review Question 2 of the previous section), there are three other useful representations of cointegrated vector 1(1) processes: the triangular representation of Phillips (1991), the VAR representation, and the VECM (vector error-correction model) of Davidson et al. (1978). This section introduces these representations. Phillips's Triangular Representation This representation is convenient for the purpose of estimating cointegrating vectors. The representation is valid for any cointegration rank, but we initially assume that the cointegrating rank h is I. Let a be a cointegrating vector, and suppose, without loss of generality, that the first element of a is not zero (if it is zero, change

634

Chapter 10

the ordering of the variables of the system so that the first element after reordering is not zero). Partition y, accordingly:

y,

=

(nxl)

Y11

Y21

]

[ ((n-l)xl)

(10.2.1)

.

Thus, y 1, is cointegrated with Y2 1. Since a scalar multiple of a cointegrating vector, too, is a cointegrating vector, we can normalize the cointegrating vector a so that its first element is unity:

(I 0.2.2)

a=

We have seen in the previous section that if a cointegrating vector a eliminates

not only the stochastic trend (i.e., a'ljl(l) = 0') but also the deterministic trend (i.e., a'8 = 0), then a'y, can be written as (10.1.24). Setting the a in (10.1.24) to the cointegrating vector in (I 0.2.2), we obtain

Yll = Y 'Y21

+ z,• + M.

(10.2.3)

where (10.2.4) Since q1 is jointly stationary, z~ is stationary. This equation, with z~ viewed as an error term and M as an intercept, is called a cointegrating regression. Its regression coefficients y, relating the permanent component in y 11 to those in y2, can be interpreted as describing the long-run relationship between y 11 and y 2,. The cointegrating regression will have the trend term a' 8 · t as an additional regressor if the cointegrating vector does not eliminate the deterministic trend in a'y,. The triangular representation is an n -equation system consisting of this cointegrating regression and the la't n - I rows of (10.1.9): D.y2, = 82

+ n2,

=

82 ((n-l)xl)

+

lji2(L)

e, ,

(I 0.2.5)

((n-l)xn) (nxl)

where 82 and n 2, are the last n - I elements of the n-dimensional vectors 8 and n,, respectively, and lji 2(L) is the last n - I rows of the lji(L) in the VMA representation (10.1.10). The implication of the h = I assumption is that, as noted in

Cointegration

635

the previous section, y 2, is not cointegrated (it would be cointegrated if h > I). In particular, each element of y2, is individually I (I). More generally, consider the case where the cointegration rank h is not necessarily I. By changing the order of the elements ofy, if necessary, it is possible (see, e.g., Hamilton, 1994, pp. 576-577) to select h linearly independent cointegrating vectors, a 1, a 2 , ... , ah, such that

A

(n xh)

(I 0.2.6)

=[a,

Partition y, conformably as

=

y, (n xI)

y,,

(hx 1)

l

(10.2.7)

Y21 [ ((n-h)xl)

Multiplying both sides of the BN decomposition (I 0.1.16) by this A' and noting that A''i'(l) = 0 (since the columns of A are cointegrating vectors), A'~ = 0 (if those cointegrating vectors also eliminate the deterministic trend), and A'y 1 = y 1, - f'y 2,, we obtain the following h cointegrating regressions:

y,,

r'

=

(hxl)

Y21

+

(hx(n-h))((n-h)xl)

/L

+ z,*

(hxl)

(I 0.2.8)

,

(hxl)

where /L = A'(y0 - ~ 0 ) and z; =A'~,. Since~~ is jointly stationary, so is z;. The triangular representation is these h cointegrating regressions, supplemented by the rest of the VMA representation: ~'>Y2t ((n-h)xl)

=

~2 ((n-h)xl)

+

U21

~2

((n-h)xl)

((n-h)xl)

+

W2(L)

e, .

(I 0.2.9)

((n-h)xn) (nxl)

Here, 'it 2 (L) is the last n-h rows of the 'it (L) in the VMA representation (I 0.1.1 0). It is easy to show (see Review Question I) that y 2, is not cointegrated. To illustrate the triangular representation and how it can be derived from the VMA representation, consider Example 10.3 (From VMA to triangular): In the bivariate system (I 0. 1.12) of Example I 0.2, the cointegration rank is 1. The cointegrating vector whose first element is unity is (1, -y)'. So the and f.l in (10.2.3) are

z;

636

Chapter 10

z; = (1, -y)~, =

e,,- ye2,.

11- = (1, -y)(yo- ~o) = (Yt.o- YY2.ol- (eJ,O- Yt:2,o), (10.2.10)

and the triangular representation is

Ytt = 11- + YY2t

+ (t:It -

yt:2,),

( 10.2.11)

{ L'>y2, =82+t:21· VAR and Cointegration

For the stationary case, we found it useful and convenient to model a vector process as a finite-order VAR. Although, as seen above, no cointegrated I (I) system can be represented as a finite-order VARin first differences, some cointegrated systems may admit a finite-order VAR representation in levels. So suppose a cointegrated 1(1) system y, can be written as y, (n x 1)

= ct

+ 8 · t + ~1 •

(L)~,

(L) =In- 1L- 2L 2

-

= e,,

· · ·-

PLP

(10.2.12)

(nxn)

For later reference, we eliminate

~

from this to obtain

y, = ct' + 8' · t + tYr-I + 2Yr-2 + · · · + rYr-r + e,,

(10.2.13)

where ct' (nxl)

= (t

+24>2 + ... + pp)8 + (l)ct,

8' (nxl)

= 4>(1)8.

(10.2.14)

How do we know that this finite-order VARin levels is a cointegrated I (I) system? An obvious way to find out is to derive the VMA representation, L'>~, = lji(L)e,, from the VAR representation, (L)~, = e,, and see if lji(L) satisfies the definition of a cointegrated system. The derivation is a bit tricky because the VMA representation is in first differences, but it can be done fairly easily as follows. Taking the first difference of both sides of (L)~, and (1 - L)(L)=(L)(l - L), we obtain (L)L'>~ 1 =(I-

= e, and noting that L'> = 1- L

L)e,.

(10.2.15)

637

Cointegration

Substitute

1'>~ 1

= lf!(L)e, into this to obtain (L)lf!(L)e, = (1 - L)e 1 •

(10.2.16)

This equation has to hold for any realization of e 1 , so (L)lf!(L) = (1 - L)ln.

(10.2.17)

This can be solved for >1! (L) by multiplying both sides from the left by (L)- 1, the inverse of (L). 7 This produces: >1! (L) = (L)- 1(1 - L). The question is, under what conditions on (L) is this lfl(L) one-summable and rank[lfl(l)] = n- h? We can easily derive a necessary condition. Setting L = I in (10.2.17), we obtain (l)lfl(l)= 0 (1 0.2.18) (nxn)

(nxn)

(nxn)

Since the rank of lf!(l) equals n -h when the cointegration rank of~~ ish, the rauk of <1>(1) is at most h. As shown below, the rank is actually h. To state a necessary and sufficient condition, let U(L) and V(L) ben x n matrix lag polynomials with all their roots outside the unit circle and let M(L) be a matrix polynomial satisfying (1 - L)In-h M(L) =

[

(hx(~--h))

(1 0.2.19)

That is, the first n -h diagonal elements of the diagonal matrix M(L) are 1-L and the remaining h diagonal elements are unity. A necessary and sufficient condition for a finite-order VAR process {~ 1 } following (L)~, = e, to be a cointegrated 1(1) system with rank his that (L) can be factored as (L) = U(L)M(L)V(L) 8 Therefore, all the roots of l(z)l = 0 are on or outside the unit circle and those that are on the unit circle are all unit roots (z = 1). It is not sufficient that (L) has n - h unit roots with the other roots outside the unit circle; see an example in Review Question 4 where (z) has two unit roots and one root outside the unit circle yet the system is not 1(1). Then - h unit roots have to be located in the system in the particular way indicated by the factorization (z) = U(z)M(z)V(z). Setting z = I in this factorization, we obtain (I) = U( l)M(l )V (1 ). Since the roots of U(z) and V(z) are all outside the unit circle, U(l) and V (1) are nonsingular 7The inverse exists because + :=: 1 is nonsingular; see Section 6.3. Since we are not assuming the station11 0 arity condition for +(L), the inverse filter may not be absolutely summable. 8This result is an implication of the lemma due to Sam Yoo, cited in Engle and Yoo ( 1991). See Watson (1994, pp. 2870-2873) for an accessible exposition.

638

Chapter 10

and the rank of (l) equals the rank of M(l ), which ish. That is, rank[ (!)]= h.

(10.2.20)

Then it follows from basic linear algebra that there exist two n x h matrices of full column rank, B and A, such that (l) = (nxn)

B

A' .

(10.2.21)

(nxh) (hxn)

This is sometimes called the reduced rank condition. The choice of B and A is not unique; if F is an h x h nonsingular matrix, then B(F') -I and AF in place of B and A also satisfy (l 0.2.21 ). Substituting (l 0.2.21) into ( 10.2.18), we obtain BA'ljl(l) = 0. Since B is offull column rank, this equation implies A'ljl(l) = 0. So the columns of the n x h matrix A are cointegrating vectors. The Vector Error-Correction Model (VECM)

For the univariate case, we derived the augmented autoregression (9.4.3) on page 586 from an AR equation. The same idea can be applied to the VAR here. It is a matter of simple algebra to show that the VAR representation (L)~, = e, m (10.2.12) can be written equivalently as

where

= ... , +

p

... , + ... + p.

(10.2.23)

(nxn)

= -(Hl + H2 + · · · + p)

~'

for s

= I, 2, ... , p-I.

(10.2.24)

(nxn)

~r-l

from both sides of (10.2.22) and noting that p - In = -(In ,- ,-···- p) = -<1>(1), we can rewrite (10.2.22) as

Subtracting

L'>~, = -(l)~,-1 + ~~L'>~r-1 + ~,t-~r-2 + .. · + ~p-ll'>~r-p+l + e,

= -BA'~r-1 + ~~L'>~r-1 + ~,L'>~r-2 + · .. + ~p-lL'>~r-p+l + e,. (10.2.25)

Using the relation y, = L'>y, =a'+~·

· t-

=a'+~·· t -

a+~

·t

+~"this

can be translated into an equation in y,:

(l) Yr-1 + ~ 1L'>y,_, +···+~p-I L'>Yr-p+l + e, (10.2.26)

B z,_, + (nxh)(hxl)

~1

L'>y,_, + · · · + ~p-JL'>Yr-p+l + e,, (10.2.27)

639

Cointegratton

where a* and 8' are as in ( !0.2.14) and z, is given by Zr (hxl)

=

A' Yr ·

(10.2.28)

(hxn)(nxl)

Since, as noted above, the n x h matrix A collects h cointegrating vectors, z, is trend stationary (with a suitable choice of the initial value y0 ) 9 This representation is called the vector error-correction model (VECM) or the error-correction representation. If it were not for the term Bz,. 1 in the VECM, the process, expressed as a VAR in first differences, could not be cointegrated. The VECM accommodates h cointegrating relationships by including h linear combinations of levels of the variables. If there are no time trends in the cointegrating relations, that is, if A'8 = 0, then 8* = cfl(l)8 = BA'8 = 0. So the VAR and the VECM representations do not involve time trends despite the possible existence of time trends in the elements ofy,. That the same I(O) process has the VMA, VAR, and VECM representations is known as the Granger Representation Theorem. Example 10.4 (From VMA to VARIVECM): In the previous example, we derived the triangular representation from the VMA representation (I 0. I .12). In this example, we derive the VAR and VECM representations from the same VMA representation. For the '11 (L) in ( !0.1.12), it is easy to verify that ( !0.2.17) is satisfied with (10.2.29) So the VMA can be represented by a finite-order VAR. For this VAR, we have (I 0.2.30)

which can be written as (!0.2.21) with

9 Just in case you are wondering why Yo is relevant. It is true that you can solve (10.2.27) for z as 1 Zt-1

= (B'B)- 1B'[a:* +a*. t - !:::.yt

+ s1!:::.Yr-1 + ... + Sp-t!:::.Yr-p+l + etl.

So you might think that z1 is trend stationary regardless of the choice of y0 . This is not true, strictly speaking, because, unlike in the VMA representation, !:::.y1 in the VARNECM representation is not defined for t -1, -2, .... Only with a judicious choice of Yo can one make !:::.y1 (t I, 2.... ) stationary. This point is mentioned in Johansen (1995, Theorem 4.2).

=

=

640

Chapter 10

B= [

~]

(10.2.31)

and A= [ _;] (for example).

By (10.2.27) and (10.2.28), the VECM for this choice ofB and A is [

~~::] = [y82+~:- ya2] + [8' -t2] .1_ [~] 1+e,, 21 _

z, =

A'y, = Ylt- YY2t·

(10.2.32)

y82,that is, if the cointegrating

The deterministic trend disappears if 8 1 = vector also eliminates the deterministic trend.

Johansen's ML Procedure

In closing this section, having introduced the VECM representation, we here provide an executive summary of the maximum likelihood (ML) estimation of cointegrated systems proposed by Johansen (1988). Go back to the VAR representation (10.2.13). For simplicity, we assume that there is no trend in the cointegrating relations, so that 8* = 0. 10 We have considered the conditional ML estimation (conditional on the initial values of y) of a VARin Section 8.7. If the error vector e1 is jointly normal N(O, Q) and if we have a sample ofT+ p observations (Y-r+I, y -r+2· ... , YT ), then the (average) log likelihood of (y 1 , ... , YT) conditional on the initial values (Y-r+l, Y-r+2• ... , Yo) is given by QT(II) =-

~ log(2rr) + ~ log(IQ-

T

1

1)-

~ L{(y,- II'x,)'Q- 1(y,- II'x,)l. 2 t=l

(10.2.33) where n is the dimension of the system (not the sample size), II= (II, Q), and I

n

'- [a* =

(nxl)

y,_, ,

<1>2

(nxn)

(n xn)

...

p ]

(n xn)

'

x,

Yt-2

. (10.2.34)

((l+np)xl)

Yr-p

1°For a thorough treatment of time trends in the VECM, see Johansen (1995, Sections 5.7 and 6,2).

641

Cointegration

Let

So = -4>(1) =-[In- (4>, + · · · + 4>p)].

(I 0.2.35)

(nxn)

Then the reduced rank condition is that 1; 0 = -BA' and the VECM (10.2.26) can be written as

t;.y, =

a'

+ I; oYt-1 + I; 1t;.y,_, + · · · +I; p-J t;.yt-p+l +

e,.

(10.2.36)

Equations (10.2.35) and (I 0.2.24) provide a one-to-one mapping between (4> 1 , ••• , 4>p) and (1; 0 , •.. , Sp- 1). Thus the same average log likelihood Qr(IJ) can be rewritten in terms of the VECM parameters as

n -:zlog(2rr)

l

T

+

I _, I '""'{ _,_ ' _, _,_ :zlog(IQ I)- T L.... (t;.y,- II x,) Q (t;.y,- II x,) , 2

t;:;;:: I

(10.2.37) where I

-' = [ (nxl) a' n

So

(nxn)

1;, (nxn)

.. · sr-'] (nxn)

Yt-1

x,

t;.y,_,

. (10.2.38)

• ((l+np)xl)

The ML estimate of the VECM parameters is the (a', I; 0 , ... , I; p-J, Q) that maximizes this objective function, subject to the reduced rank constraint that 1; 0 = -BA' for some n x h full rank matrices A and B. This constraint accommodates h cointegrating relationships on the 1(1) system. Obviously, the maximized log likelihood is higher the higher the assumed cointegration rank h. Thus, we can use the likelihood ratio statistic to test the null hypothesis that h = ho against the alternative hypothesis of more cointegration. Unlike in the stationary VAR case, the limiting distribution of the likelihood ratio statistic is nonstandard. Given the cointegration rank h thus determined, the ML estimate of h cointcgrating vectors can be obtained as the estimate of A. For a more detailed exposition of this procedure, see Hamilton (1994, Chapter 20) and also Johansen (1995, Chapter 6).

642

Chapter 10

QUESTIONS FOR REVIEW

1. (Error vector in the triangular representation) Consider the error vector (z~, u2 1) in the triangular representation (10.2.8) and (10.2.9). It can be written as

z~

(hxl)

[

U2r

l

_

-

''''(L) _ 'l" Er =

[ l w;(L) (hxn)

Er.

(10.2.39)

W2(L)

{(n-h)xl)

((n-h)xn)

Here,

w;(L) = (1,

-r') a(L),

(hxn)

(I 0.2.40)

(nxn)

where a(L) is from the BN decomposition and 'i' 2 (L) is the last n- h rows of 'i'(L) in the VMA representation (a) Show that 'i' 2 ( I) is of full row rank (i.e., the n-h rows of 'i' 2 (1) are linearly independent). Hint: Suppose, contrary to the claim, that there exists an (n-h)-dimensionalvectorb f' Osuchthatb''i' 2 (1) = 0'. Showthat(O', b') would be an (n-dimensional) cointegrating vector and the cointegration rank would be at least h + I. (b) Verify that 'it'(L) is absolutely summable. (c) Write 'i''(L) ='it~+ w;L + w;L 2 + · · ·. For the bivariate process of Example I 0.3, verify that 'it~ is not diagonal. (Therefore, even if the elements of e, are uncorrelated, z~ and u 2, can be correlated.) 2. (From triangular to VMA representations) For Example 10.3, start from the triangular representation (10.2.1 I) and recover the VMA representation. Hint: Take the first difference of the cointegrating regression. 3. (An alternative decomposition of 4>(1)) For the bivariate system of Example I 0.4, verify that B = (y,O)', A= (ljy, -I)'

is an alternative decomposition of 4>(1). Write down the VECM corresponding to this choice of B and A.

643

Cointegration

4. (A trivariate VAR that is 1(2)) Consider a trivariate VAR given by

Ylr = 2y,,,_,- YJ.r-2 + '" (i.e., 1'. 2 y,, Y2r = PY2,r-J + e2r with IPI < I, and

= e 1,),

!

YJt = E:Jr ·

Verify that this is an 1(2) system by writing 1'. 2 y, as a vector moving average process. Write down (L) for this system and show that (z) has three roots, two unit roots and one that is outside the unit circle.

10.3 Testing the Null of No Cointegration Having introduced the notion of cointegration, we need to deal with two issues. The first is how to determine the cointegration rank, and the second is how to estimate and do inference on cointegrating vectors. The first will be discussed in this section, and the second will be the topic of the next section. There are several procedures for determining the cointegration rank. Among them are Johansen's (1988) likelihood ratio test derived from the maximum likelihood estimation of the VECM (briefly covered at the end of previous section) and the common trend procedure of Stock and Watson (1988). These procedures allow us to test the null of h = h 0 where h 0 is some arbitrary integer between 0 and n - I. We will not cover these procedures.'' Here, we cover only the simple test suggested by Engle and Granger (1987) and extended by Phillips and Ouliaris (1990). In that test, the null hypothesis is that h = 0 (no cointegration) and the alternative is that h > I. Spurious Regressions

The test of Engle and Granger ( 1987) is based on OLS estimation of the regression

y,, =

f.I

+ y'Y2r + z;.

(10.3.1)

where Yit is the first element of y,, Y2r is the vector of the remaining n - I elements, and

z; is an error term. This regression would be a cointegrating regression

if h = I and Y1t were part of the cointegrating relationship. Under the null of h = 0 (no cointegration), however, this regression does not represent a cointegrating relationship. Let ({j, y) be the OLS coefficient estimates of (JI, y ). It turns out that y 11 See Maddala and Kim (1998, Section 7) for a catalogue of available procedures.

Chapter 10

644

does not provide consistent estimates of any population parameters of the system' For example, even if Yir is unrelated to Y2r (in that Ll.y 11 and Ll.y2, are independent for all s, 1), the 1- and F -statistics associated with the OLS estimates become arbitrarily large as the sample size increases, giving a false impression that there is a close connection between y 11 and y2,. This phenomenon, called the spurious regression, was first discovered in Monte Carlo experiments by Granger and Newbold (1974). Phillips (1986) theoretically derived the large-sample distributions of the statistics for spurious regressions. For example, the 1-value, if divided by ft. converges to a nondegenerate distribution. The Residual-Based Test tor Cointegration

The regression (10.3.1) nevertheless provides a useful device for testing the null of no cointegration, because the OLS residuals, y 11 - fi- f/y 2,, should appear to have a stochastic trend if y, is not cointegrated and be stationary otherwise. Engle and Granger (1987) suggested applying the ADF 1-test to the residuals in order to test the null of no cointegration. Because of the use of the residuals, the test is called the residual-based test for cointegration. The asymptotic distributions of the test statistic for some leading unit-root tests were derived theoretically and the (asymptotic) critical values tabulated by Phillips and Ouliaris (1990) and Hansen (1992a). In contrast to the univariate unit-root tests, the asymptotic distributions (and hence the asymptotic critical values) depend on the dimension n of the system. This is because the residuals depend on ({i, y), which, being estimates based on data, are random variables. Here we indicate the appropriate asymptotic critical values when the unit-root test applied to the residuals is the ADF t-test of Proposition 9.6 for autoregressions without a constant or time. That is, the ADF 1-statistic is the 1value on the x,_ 1 coefficient in the following augmented autoregression estimated on residuals: Ll.x, = (p- l).t,_ 1 +

~1

Ll.x,_ 1 + · · · +

~PLI.x,_P

+error,

(10.3.2)

where x, here is the residual from regression (10.3.1). There is no need to include a constant in this augmented autoregression because, with the regression already including a constant, the sample mean of the residuals is guaranteed to be zero. There is no need to include time either, because the variables of the regression (10.3.1) implicitly or explicitly include time trends (see below for more on this). The number of lagged changes, p, needs to be increased to infinity with the sample

645

Cointegration

size T, but at a rate slower than T 113 More precisely, 12 p ---> oo but

~13

T

---> 0 as T ---> oo. P

(10.3.3)

The critical values are the same if Phillips' Z1-test (see Analytical Exercise 6 of the previous chapter) is used instead of the ADF t. There are three cases to consider. I. E(L'>y21) = 0 and E(L'>y1 1) = 0, so no elements of the 1(1) system have drift. The appropriate critical values are in Table IO.I(a), which reproduces Table lib of Phillips and Ouliaris (1990). For a statement of the conditions on the VMA representation under which the asymptotic distribution is derived, see Hamilton (1994, Proposition 19.4).

2. E(L'>y21) i" 0 but E(~'>Yir) may or may not be zero. This case was discussed by Hansen (1992a). Let g (= n - I) be the number of regressors besides a constant in the regression (I 0.3.1 ). Some of the g regressors have drift. Since linear trends from different regressors can be combined into one, 13 the regression (10.3.1) can be rewritten a' a regression of y 11 on a constant, g- I I( I) regressors without drift, and one I (I) regressor with drift. Now, since linear trends dominate stochastic trends (in the sense made precise in the discussion of the BN decomposition in the previous chapter), the 1(1) regressor with a trend behaves very much like time in large samples. So the residuals are "asymptotically the same" as the residuals from a regression with a constant, g - I driftless 1(1) variables, and time as regressors, in the sense that the limiting distribution of a statistic based on the residuals from the former regression is the same as that from the latter regression. The critical values for the ADF t test based on the latter regression are tabulated in Table lie of Phillips and Ouliaris (1990). Therefore, to find the appropriate critical value when the regression (I 0.3.1) has a constant and g regressors but not time, turn to this table for g - I regressors. If the regression has only one regressor besides the constant (i.e., if g = 1), then the regression is asymptotically equivalent to a regression of y 11 on a constant and time. For this case, the limiting distribution of the ADF t-statistic calculated from the residuals turns out to be the Dickey-Fuller distribution (D Fr') of Proposition 9.8. Table IO.l(b) combines these distributions: for the case of one 1(1) 12see Phillips and Ouliaris (1990, Theorem 4.2). The lag length p can be a random variable because it can be data-dependent. This condition is the same as in the Said-Dickey extension of the ADF tests in the previous chapter (see Section 9.4), except that p here can be a random variable. 13 For example, let n == 3 so that there are two regressors, Y2l and Y3t with coefficients Yl and Y2· respectively. If 82 and 83 are the drifts in Y2t and Y3r• respectively, then the linear trends in regression (10.3.1) are Y18zt and J128Jt, which can he combined into a single time trend (y182 + n83)t.

Chapter to

646 Table 10.1: Critical Values for the ADF t-Statistic Applied to Residuals Estimated Regression: Ytr = 1-' + y'Y2r Number of regressors, excluding constant (a) Regressors have no drift I

2 3 4 5

1%

2.5%

5%

10%

-3.96 -4.31 -4.73 -5.07 -5.28

-3.64 -4.02 -4.37 -4.71 -4.98

-3.37 -3.77 -4.11 -4.45 -4.71

-3.07 -3.45 -3.83 -4.16 -4.43

-3.67 -4.07 -4.39 -4.77 -5.02

-3.41 -3.80 -4.16 -4.49 -4.74

-3.13 -3.52 -3.84 -4.20 -4.46

(b) Some regressors have drift

2 3 4 5

-3.96 -4.36 -4.65 -5.04 -5.36

For panel (a), Phillips and Ouliaris (1990, Table lib). For panel (b), the first row is from Fuller ( 1996, Table IO.A.2), and the other rows are from Phillips and Ouliaris (1990, Table lie). SOURCE:

regressor (g = I), it shows the ADF t' -distribution, and for the case of g (> I) regressors, it shows the critical values from Table lie of Phillips and Ouliaris (1990) for g - I regressors. For example, if g = 2, the 5 percent critical value is -3.80, which is the 5 percent critical value in Table lie in Phillips and Ouliaris (1990) for one regressor. 3. This leaves the case where E(C>y2,) = 0 and E(C>Ytrl 'I 0. Since Ytr has drift and Y2r does not, we need to include time in the regression (I 0.3.1) in order to remove a linear trend from the residuals. The discussion for the previous case makes it clear that the ADF 1-statistic calculated using the residuals from a regression of y 1, on a constant, g driftless I (I) regressors y 2,, and time has the asymptotic distribution tabulated in Table I 0.1 (b) for g + I regressors (or Table lie of Phillips and Ouliaris, 1990, for g regressors). For example, if g = 2, the regressor (I 0.3.1) has a constant, two I (I) regressors, and time; the critical values can be found from Table IO.l(b) for three regressors.

Cointegration

647

Several comments are in order about the residual-based test for cointegration. • (Test consistency) The alternative hypothesis is that y, is cointegrated (i.e., h ::0: l ). The test is consistent against the alternative as long as Y1t is cointegrated with y 2,. 14 The reason (we will discuss it in more detail below for the case with h = l) is that the OLS residuals from the regression (10.3.1) will converge to a stationary process. However, if y 11 is I(l) and not part of the cointegration relationship, then the test may have no power against the alternative of cointegration because the OLS residuals, y 11 - y'y 21 , with a nonzero coefficient (of unity) on the 1(1) variable y 11 , will not converge to a stationary process. Thus, the choice of normalization (of which variable should be used as the dependent variable) matters for the consistency of the test. • (Should time be included in the regression?) If time is included in the regression (10.3.1), then the drift in y 11 , E(t:.y 11 ), affects only the time coefficient, making the numerical value of the residuals (and hence the ADF t-value) invariant to E(t:.y 11 ). This means that the case 3 procedure, which adds time to the regressors, can be used for case l, where E(t:.y1 1 ) happens to be zero. That is, if you include time in the regression (l 0.3. 1), then the appropriate critical value for case I is given from Table IO.I(b) for g +I regressors. The procedure is also valid for case 2, because if time is included in addition to a constant and g I( l) regressors with drift, then the regression can be rewritten as a regression with a constant, g driftless l(l) regressors, and time that combines the drifts in the g 1(1) regressors. This regression falls under case 3 and the critical values provided by Table 10.1 (b) for g + I regressors apply. Therefore, when time is included in the regression (10.3.1), the same critical values can be used, regardless of the location of drifts. A possible disadvantage is reduced power in finite samples. The finite-sample power is indeed lower at least for the DGPs examined by Hansen (1992a) in his simulations. • (Choice of lag length) The requirement (10.3.3) does not provide a practical rule in finite samples for selecting the lag length pin the augmented autoregression to be estimated on the residuals. There seems to be no work in the context of the residual-based test comparable to that of Ng and Perron ( 1995) for univariate I( l) processes. The usual practice is to proceed as in the univariate context, which is to use the Akaike information criterion (AIC) or the Bayesian information criterion (BIC), also called the Schwartz information criterion (SIC), to determine the number of lagged changes. t4For the case of h = I, this is an implication of Theorem 5.1 of Phillips and Ouliaris (1990). Their remark (d) to this theorem shows that the test is consistent when h > I and hence Y2r is cointegrated.

648

Chapter 10

• (Finite-sample considerations) Besides the residual-based ADF 1-test, a number of tests are available for testing the null of no cointegration. They include tests proposed by Phillips and Ouliaris (1990), Stock and Watson (1988), and Johansen ( 1988). Haug ( 1996) reports a recent Monte Carlo study examining the finite-sample penormance of these and other tests. It reveals a tradeoff between power and size distortions (that is, tests with the least size distortion tend to have low power). The residual-based ADF t-test with the lag length chosen by AIC, although less powenul than some other tests, has the least size distortion for the DGPs examined. The following example applies the residual-based test with the ADF !-statistic to the consumption-income relationship.

Example 10.5 (Are consumption and income cointegrated?): As was already mentioned in connection with Figure I 0.1 (a), log income (y1 ) and log consumption (c1 ) appear to be cointegrated with a cointegrating vector of (I, -I). This figure, however, is rather deceiving. The plot of y, - c1 (which is the log of the saving rate) in Figure I0.1 (b) shows an upward drift right after the war and a downward drift since the mid 1980s. (The latter is the well-publicized fact that the U.S. personal saving rate has been declining.) As seen below, the test results depend on whether to include these periods or not We initially focus on the sample period of 1950:Ql to 1986:Q4 (148 obervations ). We first test whether the two series are individually 1(1), by conducting the ADF t-test with a constant and time trend in the augmented autoregression. In applying the BIC to select the number of lagged changes in the augmented augoregression, we follow the same practice of the previous chapter: the maximum length (Pmaxl is [ 12 · (T I I 00) 114] (the integer part of [ 12· (T I I00) 114 ]), the sample period is fixed at t = Pmax + 2, Pmax +3, ... , T in the process of choosing the lag length p, and, given p, the maximum sample oft = p + 2, p + 3, ... , T is used to estimate the augmented autoregression with p lagged changes. For the present sample size, Pmax is 13. For disposable income, the BIC selects the lag length of 0 and the ADF !-statistic (t') is -1.80. The 5 percent critical value from Table 9.2 (c) is -3.41, so we accept the hypothesis that y, is I(!). For consumption, the lag length by the BIC is I and the ADF t statistic is -2.07. So we accept the I (I) null at 5 percent. Thus, both series might well be described as I( I) with drift Turning to the residual-based test for cointegration, the OLS estimate of the static regression is

649

Cointegration

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00

47

52

57

62

67

72

77

82

87

92

97

Figure 10.1 (b): Log Income Minus Log Consumption

c, = -0.046 (0.009)

+

0.975 y,, R 2 = 0.998, (0.0037)

t =

1950:Ql to 1986:Q4. (10.3.4)

To conduct the residual-based ADF t test, an augmented autoregression without constant or time is estimated on the residuals with the lag length of 0 selected by the BIC. The ADF t statistic is -5.49. Since the series have time trends, we turn to Table I 0.1 (b), rather than Table I 0.1 (a), to find critical values. For g = I, the 5 percent critical value is -3.41, so we can reject the null of no cointegration.

If the residual-based test is conducted on the entire sample of 47:Q I to 98:QI, the ADF t statistic is -2.94 with the lag length of I determined by BIC, and thus, we cannot reject the hypothesis that the consumption-onincome regression is spurious! Testing the Null of Cointegration

In the above test, cointegration is taken as the alternative hypothesis rather than the null. But very often in economics the hypothesis of economic interest is whether the variables in question are cointegrated, so it would be desirable to develop tests where the null hypothesis is that h = I rather than that h = 0. Very recently, several tests of the null of cointegration have been proposed. For a catalogue of such tests, see Maddala and Kim ( 1998, Section 4.5). As was true in the testing of

650

Chapter 10

the null that a univariate process is 1(0), there is no single test of cointegration used by the majority of researchers yet. 15 QUESTION FOR REVIEW

1. For the residual-based test for cointegration with the regression (10.3.1) not including time as a regressor, verify that there is very little difference in the critical value between case I and case 2.

10.4

Inference on Cointegrating Vectors

In the previous section, we examined whether an 1(1) system is cointegrated. In this section, we assume that the system is known to be cointegrated and that the cointegration rank and the associated triangular representation are known. 16 Our interest is to estimate the cointegrating vectors in the triangular representation and to make inference about them. For the most part, we focus on the special case where the cointegration rank is I; the general case is briefly discussed at the end of the section. The SOLS Estimator

For the h = I case, the triangular representation is Yit {

+ y'y2, + z; = ~2 + u2,

= J.L

D.y2, ((n-1) xI)

z;

(1

[

X

1)

u2,

l

= IJI'(L)

e, ,

(10.4.1)

(nxn) (nxl)

((n-l)xl)

where y 2, is not cointegrated. (This is just reproducing (10.2.3) and (10.2.5) with (10.2.39) for h = 1.) So the regression (10.3.1) is now the cointegrating regression. Therefore, there exists a unique (n- I)-dimensional vector y such that y 1, - ji'y2 , is a stationary process (z;l plus some time-invariant random variable (J.L) when y = y and has a stochastic trend when y ~ y. This suggests that the OLS estimate 15 The procedures of Johansen (1988) and Stock and Watson (1988) cannot be used for testing the null of cointegration against the alternative of no cointegration, because in their tests the alternative hypothesis specifies more cointegration than the null.

16 In practice, we rarely have such knowledge, and the decision to entertain a system with h cointegrating relationships is usually based on the outcome of prior tests (such as those mentioned in the previous section) designed for determining the cointegration rank. This creates a pretest problem, because the distribution of the estimated cointegrating vectors does not take into account the uncertainty about the cointegration rank. This issue has not been studied extensively.

651

Cointegration

(/l, ji) of the coefficient vector, which minimizes the sum of squared residuals,

is consistent. In fact, as was shown by Phillips and Durlauf (1986) and Stock (1987) for the case where ~ 2 in (10.4.1) (= E(Ll.y2,)) is zero and by Hansen (1992a, Theorem l(b)) for the ~ 2 i' 0 case, the OLS estimate ji is superconsistent for the cointegrating vector y, with the sampling error converging to 0 at a rate faster than the usual rate of ft. 17 Also, the R2 converges to unity. 18 The OLS estimator of y from the cointegrating regression will be referred to as the "static" OLS (SOLS) estimator of the cointegrating vector. Since the SOLS estimator is consistent, the residuals converge to a zero-mean stationary process. Thus, if a univariate unitroot test such as the ADF test is applied to the residuals, the test will reject the 1(1) null in large samples. This is why the residual-based test of the previous section is consistent against cointegration. This fact- that the OLS coefficient estimates are consistent when the regressors are I (I) and not co integrated- is a remarkable result, in sharp contrast to the case of stationary regressors. To appreciate the contrast, remember from Chapter 3 what it takes to obtain a consistent estimator of y when the regressors y2, are stationary: for the OLS estimator to be consistent, the error term has to be uncorrelated with y2,; otherwise we need instrumental variables for y2,. In contrast, if y 1, is cointegrated with y2, and if the I (I) regressors Y2 1 are not cointegrated, as here, then we do not have to worry about the simultaneity bias, at least in large samples, even though the error term and the 1(1) regressors are correlated. 19 In finite samples, however, the bias of the SOLS estimator (the difference between the expected value of the estimator and the true value) can be large, as noted by Banerjee et al. (1986) and Stock (1987). Another shortcoming of SOLS is that the asymptotic distribution of the associated t value depends on nuisance parameters (which are the coefficients in IJI' ( L) in (I 0.4.1 )), so it is difficult to do inference. Later in this section we will introduce another estimator (to be referred to as the "DOLS" estimator) which is efficient and whose associated test statistics (such as the t- and Wald statistics) have conventional asymptotic distributions.

z;

z;

17 The speed of convergence is T if liz = 0, and T 312 if liz ;;f. 0. All the studies cited here assume that 11 is a fixed constant. The same conclusion should hold even if 11 is treated as random. As mentioned in Section 10.1 (see the paragraph right below (10.1.21)), strictly speaking, II+ zj is not stationary even if zj is stationary. However, since 11 and zi are asymptotically independent as t --+ oo, 11 + z7 is asymptotically stationary, which is all that is needed for the asymptotic results here. 18 For a statement of the conditions under which these results hold. see Hamilton (1994, Proposition 19.2). Those conditions are restrictions on "IJI"*(L)er in (10.4.1). 19 tf Y2t is cointegrated, then h ~ 2 and the regression (10.3.1)-with n - I regressors-is no longer a cointegrating regression (a cointegrating regression should haven- h regressors, see the triangular representation (10.2.B)). This case can be handled by the general methodology presented by Sims, Stock, and Watson (1990). See Hamilton (1994, Chapter 18) or Watson (1994, Section 2) for an accessible exposition of the methodology.

652

Chapter 10

The Bivariate Example

To be clear about the source and nature of the correlation between the error term and the I (I) regressors and also to pave the way for the introduction of the DOLS estimator, consider the bivariate version of (I 0.4.1 ). For the time being, assume 'il'(L) = 'il~ (so there is no serial correlation in (z;. u 2,)), 82 = 0, and /L = 0. The triangular representation can be written as Ytt = Y Y2t {

~Y2t

+ z;

(10.4.2)

= Uz,.

The error term z; and the 1(1) regressor yz 1 in the cointegrating regression in (10.4.2) can be correlated because Cov(y,,, z;l = Cov(_v,,o + L'>yz,I = =

+ L'>yz,2 + ·· · + t>y,,, z;l Cov(t>y,,I + t>y,,, + · · · + t>y,,, z;) (since Cov(yz,o. z;) = Cov(u 2 , 1 + u 2 , 2 + · · · + u 2,, z;) (since £>y2, = u 2,)

= Cov(u 2,, z;l

(since (u 2,, z;l is i.i.d.).

0)

(I 0.4.3)

z;

Since 'il~ and Q are not restricted to be diagonal, and u 2, can be contemporaneously correlated. To isolate this possible correlation, consider the least squares projection of -, on a constant and u 2, (= t>y2,). Recall from Section 2.9 that E (y I I, x) = [E(y)- f3o E(x)] + f3ox where f3o = Cov(x, y)/ Var(x) and that the least squares projection error, y- E'(y I I, x), has mean zero and is uncorrelated with x. Here, the mean is zero for both z; and u,,. So E'(z; I I, uz 1 ) = f3 0u,, and, if we denote the least squares projection error by = E'(z; I I, u 2, ), we have

z;

v, z; -

z; = f3ou,,

+ v,

= f3ol>y,,

+ v,,

E(v,) = 0, Cov(u 2,, v,) = Cov(t>y,,, v,) = 0.

(10.4.4)

Substituting this into the cointegrating regression, we obtain Ytt = yy,,

+ f3ot>y,, + v,.

(10.4.5)

This regression will be referred to as the augmented cointegrating regression. Now we show that the I (I) regressor y2, is strictly exogenous in that Cov(y2,, v,) = 0 for all t, s. By construction, Cov(t>y2 ,, v,) = 0. For Cov(t>y2, v,) fort oft s, Cov(t>y 2,, v,) = Cov(u 2, v,) = Cov(u 2 , z;- f3ou 2,)

= Cov(u 2,, z;)- f3oCov(u 2,, u2,) = 0,

(I 0.4.6)

Cointegration

653

because (z;, u2,) is i.i.d. So L'.yz, is strictly exogenous. The strict exogeneity of L'.y 2, implies the same for y2, because Cov(yz,, v,) = Cov(Y2.o + L'.y2.1 = Cov(L'.y2,1 = 0

+ L'.yz,z + · · · + L'.y2, v,) + L'.y2,2 + ·· · + L'.yz,, v,) (since Cov(yz,o, v,) =

for all t. s by (10.4.6).

0)

(10.4.7)

Continuing with the Bivariate Example

To recapitulate, we have shown that, in the augmented cointegrating regression (10.4.5), the I (I) regressor is strictly exogenous if 'i''(L) = '11 0. Let (9, fj;,) be the OLS coefficient estimate of (y, fJo) from the augmented cointegrating regression. (This 9 should be distinguished from the SOLS estimator of y in the cointegrating regression in (10.4.2).) This regression, (10.4.5), is very similar to the augmented autoregression (9.4.6) in that one of the two regressors is zero-mean 1(0) and the other is driftless I (I). We have shown for the augmented autoregression that the "X'X" matrix, if properly scaled by T and ft. is asymptotically diagonal, so the existence of 1(0) regressors can be ignored for the purpose of deriving the limiting distribution of 9, the OLS estimator of the coefficient of the I(I) regressor. The same is true here, and the same argument exploiting the asymptotic diagonality of the suitably scaled "X'X" matrix shows that the usual 1-value for the hypothesis that y = y0 is asymptotically equivalent to I

I=

T

T I~ Y21VI a2

r

'

(10.4.8)

/z,E(yz,)2

aJ

v

where is the variance of 1 • That is, the difference between the usual t-value and i in (10.4.8) converges to zero in probability, so the limiting distribution oft and that of i are the same. In the augmented autoregression used for the ADF test, the limiting distribution of the ADF !-statistic was nonstandard (it is the Dickey-Fuller /-distribution). In contrast, the asymptotic distribution of i (and hence that of the usual t-va!ue) is standard nonnal. To see why, first recall that the 1(1) regressor y2, is strictly exogenous in that Cov(y2, v,) = 0 for all I, s. To develop intuition, temporarily assume that (z;, u 2,) is jointly normal. Then y2, and v, are independent for all (s, 1), not just uncorrelated. Consequently, the distribution of v, conditional on

(Y2.I, Y2.2 •... , Yzr) is the same as its unconditional distribution, which is N(O, a;).

654

Chapter 10

So the distribution of the numerator of (10.4.8) conditional on (y2 • 1 , Yz,z, ... , Y2T) is

But the standard deviation of this normal distribution equals the denominator of -t, so

t I (Yz.I, Yz.z, ... , Yzr)

~ N(O, l).

Since this conditional distribution oft does not depend on (y2•1 , y2 ,2 , •.. , y 2 r ), the unconditional distribution of i', too, is N(O, 1). (This type of argument is not new to you; see the paragraph containing ( 1.4.3) of Chapter l.) Recall from Chapter 2 that, even if the normality assumption is dropped, the distribution of the usual t-value is standard normal, albeit asymptotically, in large samples. The same conclusion is true here: the limiting distribution oft is N(O, l) even if (z;, u 2,) is not jointly normal. Proving this requires an argument different from the type used in Chapter 2, because y 21 here has a stochastic trend. We will not give a formal proof here; just an executive summary. It consists of two parts. The first is to derive the limiting distribution of the numerator of (10.4.8) (which can be done using a result stated in, e.g., Proposition 18.1 (e) of Hamilton, 1994, or Lemma 2.3(c) of Watson, 1994). This distribution is nonstandard. The second is to show that normalization (10.4.8) converts the nonstandard distribution into a normal distribution (Lemma 5.1 of Park and Phillips, 1988). This example of a bivariate l(l) system is special in several respects: (a) there is no serial correlation in the error process (z;, u 21 ), (b) the l(l) regressor y2, is a scalar, (c) y 2, has no drift, and (d) p. = 0. Of these, relaxing (a) requires some thoughts. Allowing for Serial Correlation

as in (10.4.2), then (z;. u 21 ) is serially correlated. Consequently, the l(l) regressor y 21 in the augmented cointegrating regression (10.4.5) is no longer strictly exogenous because Cov(L'.y 2,, v1), while still zero fort = s, is no longer guaranteed to be zero for r i' s. To remove this correlation, consider the least squares projection of on the current, past, and future values of u 2 , not just on the current value of uz. Noting that L'.y21 = u21, the projection can If it is not the case that l[t'(L) =

l[t~

z;

655

Cointegration

be written as 00

z; = {J(L)"-Yzr + v,,

fJ(L)

=

L

fJjLi

(10.4.9)

j=-oo

(This v, is different from the v, in (10.4.5).) By construction, E(v,) = 0 and Cov("-y2 ., v,) = 0 for all 1, s. The two-sided filter {J(L) can be of infinite order, but suppose- for now- that it is not and {Jj = 0 for lj I > p. Thus,

+ fJ-1 "-Yz.r+l + · ·· + fJ-p"-Y2,t+p + fJ1"-Yz.r-1 + · · · + fJp"-Yz.r-p + v,.

z: = fJo"-Yzr

(10.4.10)

Substituting this into the cointegrating regression in (10.4.2), we obtain the (vastly) augmented cointegrating regression

+ fJo"-Yzr + fJ-1 "-Y2.r+l + ·· · + fJ-p"-Yz.r+p + fJ1"-Yz.r-l + ·· · + fJP"-Yz.r-p + v,.

Ylr = YYzr

(10.4.11)

Since Cov(;,.y2, v,) = 0 for all 1 and s, ;,.y 2, is strictly exogenous, which means (as seen above) that the level regressor Y2r too is strictly exogenous. Thus, by including not just the current change but also past and future changes of the I (I) regressor in the augmented cointegrating regression, we are able to maintain the strict exogeneity of Yzr· The OLS estimator of the cointegrating vector y based on this augmented cointegrating regression is referred to as the "dynamic'' OLS (DOLS), to distinguish it from the SOLS estimator based on the cointegrating regression without changes in y 2 .

There are 2 + 2p regressors, the first of which is driftless 1(1) and the rest zero-mean 1(0). As before, with suitable scaling by T and ~. the I (I) regressor is asymptotically uncorrelated with the 2p + I zero-mean 1(0) regressors, so that the suitably scaled X'X matrix is again block diagonal and the zero-mean 1(0) regressors can be ignored for the purpose of deriving the limiting distribution of the DOLS estimate of y. Therefore, the expression for the random variable that is asymptotically equivalent to the 1-value for y = y0 is again given by (10.4.8), the expression derived for the case where (z;, Uzr) is serially uncorrelated. However, when (z;, u 2,) is serially correlated, the derivation of the asymptotic distribution of (10.4.8) is not the same as when (z;, uz,) is serially uncorrelated, because the error term v, can now be serially correlated; the projection (10.4.9), while eliminating the correlation between "-yz, and v, for all s, I, does not remove its own serial correlation in v,. To examine the asymptotic distribution,

Chapter 10

656

A;

let V (T x T) be the autocovariance matrix ofT successive values of v, and be the long-run variance of v,. Also, let X in the rest of this subsection be the T x I matrix (y2 , 1, y 2 , 2 , .•. , y 2r )'. Again, to develop intuition, temporarily assume that (z;, u 2 ,) are jointly normal. Then, since y 2, is strictly exogenous, the distribution of the numerator of (10.4.8) conditional on X is normal with mean zero and the conditional variance I ' T 2 XVX.

(10.4.12)

The square root of this would have to replace the denominator of (10.4.8) to make the ratio standard normal in finite samples. Fortunately, it turns out (see Corollary 2. 7 of Phillips, 1988) that, in large samples, the distribution of the numerator is the same as the distribution you would get if v, were serially uncorrelated but with the rather than a variance of Therefore, all that is required to modify the ratio (10.4.8) to accommodate Var(v,)) by (the long-run variance of serial correlation in is to replace namely, if tis given by (I 0.4.8) with still in the denominator, its rescaled value

A;

J.

v,

v, ),

a;(=

A;

a;

(10.4.13) is asymptotically N (0, I)! Since the OLS t -value for y is asymptotically equivalent to t, it follows that

·t-: N(O, (;._) A,

1),

(10.4.14)

-

where Av is some consistent estimator of Av and s is the usual OLS standard error

of regression (it is easy to show that s is consistent for a,). Put differently, if we rescale the usual standard error of y (the OLS estimate of the y 2, coefficient in (10.4.11)) as rescaled standard error= (

~s")

x usual standard error,

(10.4.15)

then the r-value based on this rescaled standard error is asymptotically N(O, 1). The foregoing argument is applicable only to the coefficient of the I(I) regressor, so the t -values for fJ's in (I 0.4.11 ), even when their standard errors are rescaled as just described, are not necessarily N(O, I) in large samples.

657

Cointegration

Since the regressors are strictly exogenous, the discussion in Sections 1.6 and 6.7 suggests that the GLS might be applicable. However, the fact noted above that X'VX behaves asymptotically like A,X'X implies that the GLS estimator of y (to be referred to as the DGLS estimator) is asymptotically equivalent to the DOLS estimator (or more precisely, T times the difference converges to 0 in probability) (see Phillips, 1988, and Phillips and Park, 1988). Therefore, there is no efficiency gain from correcting for the serial correlation in the error term v1 by GLS. A consistent estimate of A, is easy to obtain. Recall that in Section 6.6 we used the VARHAC procedure to estimate the long-run variance matrix of a vector process. The same procedure can be applied to the present case of a scalar error from the augmented process. Consider fitting an AR(p) process to the residuals, cointegrating regression

v,,

v,

=
+ t/>zv1_2 + ·· · +
(t = p

+ l, ... , T).

(10.4.16)

(This p should not be confused with the p in the augmented cointegrating regression.) Use the BlC to pick the lag length by searching over the possible lag lengths ofO, l, ... , T 113 • Using the relation (6.2.21) on page 385 and noting that the longrun variance is the value at z = l of the autocovariance-generating function, the long-run variance A~ of v1 can be estimated by (10.4.17)

where J,j (j = l, 2, ... , p) are the estimated AR(p) coefficients and residual from the AR(p) equation (10.4.16).

e,

is the

General Case

We now tum to the general ca•e of an n-dimensional cointegrated system with possibly nonzero drift. We focus on the h = l case because extending it to the case where h > l is straightforward. So the triangular representation is (10.4.1) and the augmented cointegrating regression is

+ Y Y2t + fJ~llY2t + fJ~1 llY2.t+l + · · · + fJ~pllY2.t+p + p;llY2.t-! + · ·· + {J~llY2.1-p + v,. (10.4.18)

Ylt = JL

1

Here, {Jj (j = 0, l, 2, ... , p, -1, -2, ... , - p) are the least squares projection coefficients in the projection of (the error in the cointegrating regression) on the current, past, and future values of !ly2. The DOLS estimator of the cointegrating

z;

658

Chapter 10

vector y is simply the OLS estimator of y from this augmented cointegrating regression.

The following results are proved in Saikkonen (1991) and Stock and Watson (1993). 20 1. The bivariate results derived above carry over. That is, the DOLS estimator of y is superconsistent and the properly rescaled r- and Wald statistics for hypotheses about y have the conventional asymptotic distributions (standard normal and chi squared). The proper rescaling is to multiply the usual r-value by (s/A.,) and the Wald statistics by the square of (s/X,). This simple rescaling, however, does not work for the r-value and the Wald statistic for hypotheses involving J.L or Pj.

z;

2. If the two-sided filter fJ(L) in the projection of on the leads and lags of Ll.y2, is infinite order, then the error term v, includes the truncation remainder, 00

00

j=p+I

j=p+I

All the results just mentioned above carry over, provided that p in (10.4.18) is made to increase with the sample si7£ T at a rate slower than T 113 See Saikkonen (1991, Section 4) for a statement of required regularity conditions. 3. The estimator is efficient in some precise sense 21 The estimator is asymptotically equivalent to other efficient estimators such as Johansen's maximum likelihood procedure based on the VECM, the Fully Modified estimator of Phillips and Hansen (1990), and a nonlinear least squares estimator of Phillips and Loretan (1991) (and also to the DGLS, as mentioned above). For all these efficient estimators, the t- and Wald statistics for hypotheses about y, correctly rescaled if necessary, have standard asymptotic distributions (see Watson, 1994, Section 3.4.3, for more details). Other Estimators and Finite-Sample Properties

Stock and Watson (1993) examined the finite-sample performance of these estimators just mentioned (excluding the Phillips-Loretan estimator). For the DGPs and the sample sizes (T = 100 and 300) they examined, the following results emerged. 20 These authors assume that J.L is a fixed constant. However, the same conclusions would hold even if J.L were a time-invariant random variable. 21 The r-value is asymptotically N(O, I), but the DOLS estimator itself is not asymptotically normal. So the usual criterion of comparing the variances of the asymptotic distributions among asymptotically normal estimators is not applicable here. See Saikkonen ( 1991, Section 3) for an appropriate definition of efficiency.

659

Cointegration

(l) The bias is smallest for the Johansen estimator and largest for SOLS. (2) However, the variance of the Johansen estimator is much larger than those for the other efficient estimators, and DOLS has the smallest root mean square error (RMSE). (3) For all estimators examined, the distribution of the 1-values (correctly rescaled if necessary) tends to be spread out relative to N(O, 1), suggesting that the null will be rejected too often. These conclusions are broadly consistent with the assessment of Monte Carlo studies found in Maddala and Kim (1998, Section 5.7). QUESTION FOR REVIEW

1. As we have emphasized, there is a similarity between augmented autoregression (9.4.6) on page 587 and augmented cointegrating regression (10.4.5). Then why is the 1-value on the l(l) regressor in the augmented cointegrating regression (10.4.5) asymptotically N(O, I) while that in (9.4.6) is not?

10.5

Application: The Demand tor Money in the United Stales

The literature on the estimation of the money demand function is very large (see Goldfeld and Sichel, 1990, for a review). The money demand equation typically estimated in the literature is m, - p, = 1L + YyYt

+ YRR, + z;,

(10.5.1)

where m, is the log of money stock in period t, p 1 is the log of price level, y, is log income, R1 is the nominal interest rate, and z; is some error term. Yy is the income elasticity and YR is the interest semielasticity of money demand (it is a semielasticity because the interest rate enters the money demand equation in levels, not in logs). Most empirical analyses that predate the literature on unit roots and cointegration suffer from two drawbacks. First, as will be verified below, all the variables involved, m, - p,, y,, and R,, appear to contain stochastic trends. If the variables have trends, conventional inference under the stationarity assumption is not valid. Second, the regressors may be endogenous. This is likely to be a serious problem in the case of money demand, because in virtually all macro models a shift in money demand represented by the error term affects the nominal interest rate and perhaps income. A resolution of both problems is provided by the econo-

z;

metric technique for estimating cointegrating regressions presented in the previous section.

660

Chapter 10

Another issue addressed in this section is the stability of the money demand function. The consensus in the literature is that money demand is unstable. This is challenged by Lucas (1988) who, using annual data since 1900, argues that the interest semielasticity is stable if the income elasticity is constrained to be unity. We will examine Lucas' claim by applying the state-of-the-art econometric techniques developed in this chapter. The Data

We base our analysis on the annual data studied by Lucas (1988), extended by Stock and Watson ( 1993) to cover 1900-1989. Their measures of the money stock, income, and the nominal interest rate are Ml, net national product (NNP), and the six-month commercial paper rate (in percent at an annual rate), respectively. The use of net national product-rather than gross national product-follows the tradition since Friedman (1959) that the scale variable (y,) in the money demand equation should be a measure of wealth rather than a measure of transaction volume. There are no official statistics on Ml that date back to as early as 1900, so one needs to do some splicing on series from multiple sources. For the money stock and the interest rate, monthly data were averaged to obtain annual observations. See Appendix B of Stock and Watson (1993) for more details on data construction. (m- p, y, R) as a Cointegrated System

Figures 10.2 and 10.3 plot m - p, y, and R. The three series have clear trends. Log real money stock (m - p) grew rapidly over the first half of the century, but experienced almost no growth thereafter until 1981. Log income (y), on the other hand, grew steadily over the entire sample period with a major interruption from 1930 to the early 1940s, so Ml velocity (y- (m- p)), also plotted in Figure 10.3, dropped during that period and then grew steadily until 1981. This movement in Ml velocity is fairly closely followed by the nominal interest rate, which suggests that m - p, y, and R are cointegrated, with a cointegrating vector whose element corresponding to y is about I. Inspection of these figures suggests that the three variables, m - p, y, and R, might be well described as being individually 1(1). In fact, Stock and Watson (1993), based on the ADF tests, report that m- p andy are individually I (I) with drift, while R is I (I) with no drift. They also report, based on the Stock-Watson (1988) common trend test, that the three-variable system is cointegrated with a cointegrating rank of I. In what follows, we proceed under the assumption that m - p is cointegrated with y and R. (In the empirical exercise, you will be asked to check the validity of this premise.)

. -. -·

-·-.

--. -...

... .. -- ...

.·--

..

-·-· - . . . -- ..

.. .

-A,./'_ _j

1920

1910

1940

1930

- ... -

Real M1

1960

1950

1970

1980

Net National Product

Figure 10.2: U.S. Real NNP and Real M1, in Logs

2.0,-----------------------------------------------~--· 15

,.

1.8 1.6

•

1.2 1.0 0.8

.. ... . '•. .... . ' ·. .-..· . -. \: ·,' . ~.

0.6

••

~

.

0.4

I'

-- ......... -

0.2 o.ok---~----~----_J

1900

.... '• "

1910

1920

_____J____

1930

1940

M1 Velocity

11

.....

...

1.4

13

..

.. .... .... ·.

·'

9 7 5

.. . •

3

·:

1 ~

_____ L_ _ _ __ L_ _ _ __ L_ _ _ _

1960 1970 -- - · - Com. Paper Rate 1950

~

-1

1980

Figure 10.3: Log M1 Velocity (left scale) and Short-Term Commercial Paper Rate (right scale)

662

Chapter 10

DOLS

The first line of Table 10.2 reports the SOLS estimate of the cointegrating regression (10.5.1). (The sample period is 1903~1987 because in the DOLS estimation below two lead and lagged changes will be included in the augmented cointegrating regression.) The standard errors are not reported because the asymptotic distribution of the associated t-ratio, being dependent on nuisance parameters, is unknown. The SOLS estimate of the income elasticity of 0.943 is close to unity, but we cannot tell whether it is insignificantly different from unity. To be able to do inference, we tum to the DOLS estimation of the cointegrating vector, which is to estimate the parameters (JL, Yy· YR) in the cointegrating regression (I 0.5.1) by adding !>y, l>R, and their leads and lags. Following Stock and Watson (1993), the number of leads and lags is (arbitrarily) chosen to be 2, so the augmented cointegrating regression associated with (I 0.5. I) is mt- Pt

+ YyYr + YRRr + {Jyol>Yr + fJy.-1 L'>Yr+l + {Jy,-2L'>Yr+2 + fJv1 L'>Yr-1 + {Jyzl>Yr-2 + fJRol>R, + fJR.~l!>Rr+l + fJR.~zl>Rr+2 + fJRIL'>Rr~l + fJRzl>Rr-2 + v,.

= JL

(10.5.2) This dictates the sample period to bet = 1903, ... , 1987. The DOLS point estimate of (yy. YR). reported in the second line of Table 10.2, is obtained from estimating this equation by OLS. To calculate appropriate standard errors, the long-run variance of the error term v, needs to be estimated. For this purpose we fit an autoregressive process to the DOLS residuals. The order of the autoregression is (again arbitrarily) set to 2. With the DOLS residuals calculated fort = 1903, ... , 1987, the sample period for the AR(2) estimation is fort = 1905, ... , 1987 (sample size= 83). The estimated autoregression is

ii,

= 0.93806 iir-1- 0.13341

ii,~2.

SSR

= 0.19843.

(10.5.3)

We then use the formula (I 0.4.17), but to be able to replicate the DOLS estimates by Stock and Watson (1993, Table Ill), we divide the SSR by T - p - K rather than by T - p to calculate a"/:, where K here is the number of regressors in the augmented cointegrating regression (which is 13 here). Thus a"/:= 0.19843/(832 -13) = 0.0029181 and

).2 = 0.0029181 = 0.07647. 2 ' (1- 0.93806 + 0.13341)

(10.5.4)

663

Colntegration

Table 10.2: SOLS and DOLS Estimates, 1903-1987

Yy SOLS estimates DOLS estimates (Rescaled standard errors)

YR

R2

Std. error of regression

0.943

-0.082

0.959

0.135

0.970 (0.046)

-0.101 (0.013)

0.982

0.096

SOURCE: The point estimates and standard errors are from Table III of Stock and Watson ( 1993). The R2 and standard error of regression are by author's calculation.

Table 10.3: Money Demand in Two Subsamples 1903-1945

y, SOLS estimates DOLS estimates (Rescaled standard errors)

YR

1946-1987

Yy

YR

0.919

-0.085

0.192

-0.016

0.887 (0. I 97)

-0.104 (0.038)

0.269 (0.213)

-0.027 (0.025)

SouRCE: Table III of Stock and Watson (1993).

The estimate 5., is the square root of this, which is 0.277. Since the standard error of the DOLS regression is 0.096, the factor in the formula (10.4.15) for rescaling the usual OLS standard errors from the augmented cointegrating regression is 0.277/0.096 = 2.89. The numbers in parentheses in Table 10.2 are the adjusted standard errors thus calculated. Now we can see that the estimated income elasticity is not significantly different from unity. Unstable Money Demand?

Table 10.3 reports the parameter estimates by SOLS and DOLS for two subsamples, 1903-1945 and 1946-1987. (Following Stock and Watson (1993), the break date of 1946 was chosen both because of the natural break at the end of World War 11 and because it divides the full sample nearly in two.) In sharp contrast to the estimates based on the first-half of the sample, the postwar estimates are very different from those based on the full sample. Lucas (1988) argues that estimates like these are consistent with a stable demand for money with a unitary income elasticity. To support this view, he points

664

Chapter 10

O.Or------------------------, -0.2 -0.4 >- -0.6

~ 0

-0.8

'!/

-1.0

Q)

!"

'!/

£

-1.2

•. . . .

-1.4

-1.6 -1.8 -2,0

~· .....

L__

0

__jL__

2

• until1945

. .... .

.

__[_ __ l_ __ L_ __ L_ ____L__ ___L__

4

6 8 10 commercial paper rate .& since

12

14

__j

16

1946

Figure 10A: Plot of m - p - y against R

to the evidence depicted in Figure lOA. It plots m,- p, -y, (the log inverse velocity, which would be the dependent variable in the money demand equation (10.5.1) if the income elasticity Yy were set to unity) against the interest rate, with the postwar observations indicated by a different set of symbols from the prewar observations. The striking feature is that, although postwar interest rates are substantially higher, postwar observations lie on the line defined by the prewar observations 22 This finding suggests that the postwar estimates suffer from the near multicollinearity problem: because of the similar trends in y and R in the postwar period (shown in Figures 102 and 103), Yy and YR estimated on postwar data are strongly correlated. Even though for each element the postwar estimate of (yy. YR) is different from the prewar estimate, they may not be jointly significantly different This possibility could be easily checked by the Chow test of structural change, at least if the regressors were stationary. In the Chow test, you would estimate

where D, is a dummy variable whose value is 1 if t > 1946 and 0 otherwise. Thus, the coefficients of the constant, y,, and R, are (/1-, Yy· YR) until 1945 and (/1- + o0, Yy + oy. YR +oR) thereafter. If the regressors are stationary, the Wald statistic for the null hypothesis that o0 = 0, oy = 0, oR = 0 is asymptotically x 2 (3) as both subsamples grow larger with the relative size held constant (see 22Jn Lucas' ( 1988) original plot, the sample period is 1900--1985 and the break date is 1957, but the message of the figure is essentially the same.

665

Cointegration

Analytical Exercise 12 in Chapter 2). Now in the present case where regressors are I(I) and not cointegrated, it has been shown by Hansen (1992b, Theorem 3(a)) that the Wald statistic (properly rescaled if necessary) has the same asymptotic distribution (of x2 ) if the parameters of the cointegration regression are estimated by an efficient method such as DOLS (which adds !;.y,, !;.R,, and their leads and Jags to the cointegrating regression (I 0.5.5)), the Fully Modified estimation, or the Johansen ML. 23 The DOLS estimates of (10.5.5) augmented with two leads and two Jags of !;,.y and /;,.R are shown in Table 10.4. The insignificant Wald statistic supports Lucas' view that the money demand in the United States has been stable in the twentieth century. PROBLEM SET FOR CHAPTER 1 0 - - - - - - - - - - - EMPIRICAL EXERCISES

(Skim Stock and Watson, 1993, Section 7 and Appendix B, before answering.) The file MPYR.ASC has annual data on the following: Column Column Column Column

1: 2: 3: 4:

natural log ofMI (to be referred to as m) natural log of the NNP price deflator (p) natural Jog of NNP (y) the commercial paper rate in percent at an annual rate (R).

The sample period is 1900 to 1989. This is the same data used by Stock and Watson (1993) in their study of the U.S. money demand. See their Appendix B for data sources.

Questions (a) and (b) are for verifying the presumption that m - pis cointegrated with (y, R). The rest of the questions are about estimating the cointegrating vector. (a) (Univariate unit-root tests) Stock and Watson (1993, Appendix B) report that the ADF t~' and t' statistics, computed with 2 and 4 lagged first differences, fail to reject a unit root in each of m- p and Rat the 10 percent level and that the unit root hypothesis is not rejected for y with 4 lags, but is rejected at the 10 percent (but not 5 percent) level with 2 lags. Verify these findings. In your tests, use the t~' test for R, and t' form - p andy, and use asymptotic critical values (those critical values for T = oo ). 23 We noted in the previous section that the Wald statistic is not asymptotically chi squared even after rescaling if the hypothesis involves 11. Hansen's (1992b) result, however, shows that the Wald statistic for the hypothesis about the difference in J.L is asymptoticaJly chi squared. Hansen (1992b) also develop.~ tests for structural change at unknown break dates.

Table 10.4: Test for Structural Change Wald statistic DOLS estimates (Rescaled standard errors)

0.925 (0.142)

-0.090 (0.026)

1.36 (0.72)

-0.52 (0.31)

SoURCE: Author's calculation, to be verified in the empirical exercise.

0.048

1.85

(0.034)

(p-value = 0.60)

Cointegration

667

(b) (Residual-based tests) Conduct the Engle-Granger test of the null that m - p is not cointegrated withy and R. Set p = I (which is what is selected by BIC). (If you do not include time in the regression, the ADF t -statistic derived from the residuals should be about -4.7. Case 2 discussed in Section 10.3 should apply. So the 5 percent critical value should be - 3.80.) (c) (DOLS) Reproduce the SOLS and DOLS estimates of the cointegrating vector reported in Table 10.2. (d) (Chow test) Reproduce the DOLS estimates reported in Table 10.4. They are based on the following augmented cointegrating regression: m,- Pr =M+~~+~~+~~+~~~+8R~~

+ fJyo!>,y, + fJy.-1/>,Yt+I + f3y,-2;.,Yt+2 + fJyi;.,Yt-1 + fJy,;.,Yt-2 + fJRot>,R, + fJR,-1/>,R,+I + fJR,-2/>,Rt+2 + fJRit>,R,_I + fJR2t>,R,_, + v,. The null hypothesis is that 8o = 8y = 8R = 0. Set the pin (10.4.16) to 2. Calculate the Wald statistic for the null hypothesis of no structural change.

References Aoki, Masanao, 1968, "Control of Large-Scale Dynamic Systems by Aggregation," IEEE Transactions on Automatic Control, AC-13, 246-253. Banerjee, A., J. Dolado, D. Hendry, and G. Smith, 1986, "Exploring Equilibrium Relationships in Econometrics through Static Models: Some Monte Carlo Evidence," Oxford Bulletin of Economics and Statistics, 48, 253-277. Box, G., and G. Tiao, 1977, "A Canonical Analysis of Multiple Time Series," Biometrika, 64, 355365. Davidson, J., D. Hendry, F. Srba, and S. Yeo, 1978, "Econometric Modelling of the Aggregate TimeSeries Relationships between Consumers' Expenditure and Income in the United Kingdom," Economic Journal, 88,661-692. Engle, R., and C. Granger, 1987, "Co-Integration and Error Correction: Representation, Estimation, and Testing," Econometrica, 55, 251-276. Engle, R., and S. Yoo, 1991, "Cointcgrated Economic Time Series: An Overview with New Results," in R. Engle and C. Granger (eds.), Long-Run Economic Relations: Readings in Cointegration, New York: Oxford University Press. Friedman, M., 1959, "The Demand for Money: Some Theoretical and Empirical Results," Journal of Political Economy, 67, 327-351. Fuller, W., 1996, Introduction to Statistical Time Series (2d ed.), New York: Wiley. Goldfcld, S., and D. Sichel, 1990, "The Demand for Money," Chapter 8 in B. Friedman and F. Hahn (eds.), Handbook of Monetary Economics, Volume I, New York: North-Holland.

668

Chapter 10

Granger, C., 1981, "Some Properties of Time Series Data and Their Use in Econometric Model Specification,'' Journal of Econometrics, 16, 121-130. Granger, C., and P. Newbold, 1974, "Spurious Regressions in Econometrics," Journal of Econometrics, 2, 111-120. Hamilton, J., 1994, Time Series Analysis, Princeton: Princeton University Press. Hansen, B., 1992a, "Efficient Estimation and Testing of Cointegrating Vectors in the Presence of Deterministic Trends," Journal of Econometrics, 53, 87-121. _ _ _ , 1992b, ''Tests for Parameter Instability in Regressions with I(I) Processes," Journal of Business and Economic Statistics, I0, 321-335. Haug, A .. 1996, "Tests for Cointegration: A Monte Carlo Comparison," Journal of Econometrics, 71,89-115. Johansen, S., 1988, "Statistical Analysis of Cointegrated Vectors," Journal of Economic Dyamics and Control, 12, 231-254. - - - · 1995, Likelihood-Based Inference in Cointegrated Vector Auto-Regressive Models, New York: Oxford University Press. Lucas, R., 1988, Money Demand in the United States: A Quantitative Review, Carnegie-Rochester Conference Series on Public Policy, Volume 29, 137-168. Ltitkepohl, H., 1993, Introduction to Multiple Time Series Analysis (2d ed.), New York: SpringerVerlag. Maddala, G. S., and In-Moo Kim, 1998, Unit Roots, Cointegration, and Structural Change, New York Cambridge University Press. Ng, S., and P. Perron, 1995, ''Unit Root Tests in ARMA Models with Data-Dependent Methods for the Selection of the Truncated Lag," Journal of the American Statistical Association, 90, 268-281. Park, J., and P. Phillips, 1988, ''Statistica11nference in Regressions with Integrated Processes: Part 1," Econometric Theory, 4,468-497. Phillips. P., 1986, "Understanding Spurious Regressions in Econometrics,'' Journal of Econometrics, 33, 311-340. _ _ _ , 1988, "Weak Convergence to the Matrix Stochastic Integral Bd 8 1," Journal of Multivariate Analysis, 24, 252-264. - - - · 1991, "Optimal Inference in Co integrated Systems," Econometrica, 59, 283-306. Phillips, P., and S. Dnrlauf, 1986, "Multiple Time Series Regression with Integrated Processes," Review of Economic Studies, 53, 473-495. Phillips, P.• and B. Hansen, 1990, "Statistical Inference in Instrumental Variables Regression with 1(1) Processes," Review of Economic Studies, 57, 99-I25. Phillips, P., and M. Loretan, 199I, "Estimating Long-Run Economic Equilibria," Review of Economic Studies, 58,407-436. Phillips, P.• and S. Ouliaris, 1990, "Asymptotic Properties of Residual Based Tests for Cointegration," Econometrica, 58, 165-193. Phillips, P., and J. Park, 1988. "Asymptotic Equivalence of OLS and GLS in Regressions with Integrated Regressors," Journal of the American Statistical Association, 83, 111-115. Saikkonen, P., 1991, "Asymptotically Efficient Estimation ofCointegration Regressions," Eco!UJmetric Theory, 7, 1-21. Sims, C., J. Stock, and M. Watson, 1990, "Inference in Linear Time Series Models with Some Unit Roots," Econometrica, 58, I 13-I44. Stock, J., 1987, "Asymptotic Properties of Least Squares Estimators of Cointegrating Vectors,'' Econometrica. 55. 1035-l 056.

JJ

Colntegration

669

Stock, J., and M. Watson, 1988, "Testing for Common Trends," Journal of the American Statistical Association, 83, 1097-1107. ___ , 1993, "A Simple Estimator of Cointegrating Vectors in Higher Order Integrated Systems,'' Econometrica, 61, 783-820. Watson, M., 1994, "Vector Antoregressions and Cointegration," Chapter 47 in R. Engle, and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North Holland.

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics,

Read more

Econometrics

Read more

Econometrics

Read more

Econometrics

Read more

Bayesian Econometrics (Advances in Econometrics)

Read more

Financial Econometrics

Read more

Introductory Econometrics

Read more

Spatial Econometrics

Read more

Basic Econometrics

Read more

Financial Econometrics

Read more

Advanced Econometrics

Read more

Advanced Econometrics

Read more

Ebooks Econometrics The Econometrics of Financial Markets

Read more

Finite Sample Econometrics (Advanced Texts in Econometrics)

Read more

Bayesian econometrics

Read more

Recommend Documents

Econometrics

Econometrics c Michael Creel Version 0.80, February, 2006 D EPT. OF E CONOMICS AND E CONOMIC H ISTORY, U NIVERSITA...

Econometrics

EC403, Part 1 Oliver Linton October 14, 2002 2 Contents 1 Linear Regression 1.1 The Model . . . . . . . . . . . . . ...

Econometrics

Econometrics Badi H. Baltagi Econometrics Fourth Edition 123 Professor Badi H. Baltagi Syracuse University Cente...

Econometrics

...

Econometrics

Econometrics Badi H. Baltagi Econometrics Fourth Edition 123 Professor Badi H. Baltagi Syracuse University Cente...

Econometrics

Econometrics

...

Econometrics

ECONOMETRICS Bruce E. Hansen c °2000, 20061 University of Wisconsin www.ssc.wisc.edu/~bhansen Revised: January 2006 Comm...

Econometrics

Econometrics Badi H. Baltagi Econometrics Fourth Edition 123 Professor Badi H. Baltagi Syracuse University Cente...

Econometrics

Thomas Andren Econometrics Download free books at BookBooN.com Econometrics © 2007 Thomas Andren & Ventus Publishin...