Matrix Calculus &Zero-One Matrices STATISTICAL AND ECONOMETRIC APPLICATIONS
Darrell A. Turkington
Matrix Calculus & Z...
262 downloads
1278 Views
7MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Matrix Calculus &Zero-One Matrices STATISTICAL AND ECONOMETRIC APPLICATIONS
Darrell A. Turkington
Matrix Calculus & Zero-One Matrices Darrell A. Turkington
Darrell A. Turkington is Professor in the Department of Economics at the University of Western Australia, Perth. He is coauthor with Roger J. Bowden of the widely cited work Instrumental
Variables (1984) in the Econometric Society Monographs series published by Cambridge University Press. Professor Turkington has published in leading international journals such as the Journal
of the American Statistical Association, the International Economic Review, and the Journal of Econometrics. He has held visiting positions at the University of California, Berkeley, the University of Warwick, and the University of British Columbia.
JACKET DESIGN BY JAMES F. BRISSON
Printed in the United Kingdom at the University Press, Cambridge
ADVANCE PRAISE FOR
Matrix Calculus and Zero-One Matrices "As a textbook and reference on sampling theoretic multivariate linear inference, this book is distinguished by its rigorous use of advanced matrix calculus. Not since Theil's text appeared in 1971 has there been a classical econometrics text that introduced to econometricians new matrix calculus methods in such an extensive manner. While the earlier books sometimes used the Kronecker product, Turkington's book goes much deeper and systematically incorporates matrix calculus methods that were not available to econometricians in 1971." —
William Barnett, Washington University, St. Louis
"Do you think that a full Gaussian maximum likelihood analysis of a simultaneous equations system with vector autoregressive or moving average residuals is clumsy and therefore almost untractable? Turkington shows you that you are wrong. By defining suitable zero-one matrices and vectorization operators he finds an elegant way to deal with such models and he shows that they are quite accessible." —
Helmut aitkepohl, Humboldt University of Berlin
CAMBRIDGE UNIVERSITY PRESS www.rambridge.orq 5BN 0-52 -80788-3
9
III iuiIui
Matrix Calculus and Zero-One Matrices
This book presents the reader with mathematical tools taken from matrix calculus and zero-one matrices and demonstrates how these tools greatly facilitate the application of classical statistical procedures to econometric models. The matrix calculus results are derived from a few basic rules that are generalizations of the rules of ordinary calculus. These results are summarized in a useful table. Well-known zero-one matrices, together with some new ones, are defined, their mathematical roles explained, and their useful properties presented. The basic building blocks of classical statistics, namely, the score vector, the information matrix, and the Cramer—Rao lower bound, are obtained for a sequence of linear econometric models of increasing statistical complexity. From these are obtained interactive interpretations of maximum likelihood estimators, linking them with efficient econometric estimators. Classical test statistics are also derived and compared for hypotheses of interest. Darrell A. Turkington is Professor in the Department of Economics at the University of Western Australia, Perth. He is coauthor with Roger J. Bowden of the widely cited work Instrumental Variables (1984) in the Econometric Society Monographs series published by Cambridge University Press. Professor Turkington has published in leading international journals such as the Journal of the American Statistical Association, the International Economic Review, and the Journal of Econometrics. He has held visiting positions at the University of California, Berkeley, the University of Warwick, and the University of British Columbia.
Matrix Calculus and Zero-One Matrices Statistical and Econometric Applications
DARRELL A. TURKINGTON University of Western Australia
AMBRIDGE UNIVERSITY PRESS
CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, Sao Paulo Cambridge University Press The Edinburgh Building, Cambridge CB2 2RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521807883 © Darrell A. Turkington 2002 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2002 This digitally printed first paperback version 2005 A catalogue record for this publication is available from the British Library Library of Congress Cataloguing in Publication data Turkington, Darrell A. Matrix calculus and zero-one matrices : statistical and econometric applications / Darrell A. Turkington. p. cm. Includes bibliographical references and index. ISBN 0-521-80788-3 1. Matrices. 2. Mathematical statistics. I. Title. QA188 .T865 2001 512.9'434 — dc21 ISBN-13 978-0-521-80788-3 hardback ISBN-10 0-521-80788-3 hardback ISBN-13 978-0-521-02245-3 paperback ISBN-10 0-521-02245-2 paperback
2001025614
To Sonia
Contents
Preface
1 Classical Statistical Procedures 1.1 Introduction 1.2 The Score Vector, the Information Matrix, and the Cramer—Rao Lower Bound 1.3 Maximum Likelihood Estimators and Test Procedures 1.4 Nuisance Parameters 1.5 Differentiation and Asymptotics 2 Elements of Matrix Algebra 2.1 Introduction 2.2 Kronecker Products 2.3 The Vec and the Devec Operators 2.4 Generalized Vec and Devec Operators 2.5 Triangular Matrices and Band Matrices 3 Zero-One Matrices 3.1 Introduction 3.2 Selection Matrices and Permutation Matrices 3.3 The Commutation Matrix and Generalized Vecs and Devecs of the Commutation Matrix 3.4 Elimination Matrices L, L 3.5 Duplication Matrices D, L' 3.6 Results Concerning Zero-One Matrices Associated with an n x n Matrix 3.7 Shifting Matrices 4 Matrix Calculus 4.1 Introduction 4.2 Basic Definitions 4.3 Some Simple Matrix Calculus Results
ix
1 1 2 2 4 5 7 7 7 10 14 23 28 28 29 29 43 43 44 46 67 67 68 69 vii
viii
Contents
4.4 Matrix Calculus and Zero-One Matrices 4.5 The Chain Rule and the Product Rule for Matrix Calculus 4.6 Rules for Vecs of Matrices 4.7 Rules Developed from the Properties of KG„, Kt'„, and IC„I 4.8 Rules for Scalar Functions of a Matrix 4.9 Tables of Results 5 Linear-Regression Models 5.1 Introduction 5.2 The Basic Linear-Regression Model 5.3 The Linear-Regression Model with Autoregressive Disturbances 5.4 Linear-Regression Model with Moving-Average Disturbances Appendix 5.A Probability Limits Associated with the Information Matrix for the Autoregressive Disturbances Model Appendix 5.B Probability Limits Associated with the Information Matrix for the Moving-Average Disturbances Model 6 Seemingly Unrelated Regression Equations Models 6.1 Introduction 6.2 The Standard SURE Model 6.3 The SURE Model with Vector Autoregressive Disturbances 6.4 The SURE Model with Vector Moving-Average Disturbances Appendix 6.A Probability Limits Associated with the Information Matrix of the Model with Moving-Average Disturbances 7 Linear Simultaneous Equations Models 7.1 Introduction 7.2 The Standard Linear Simultaneous Equations Model 7.3 The Linear Simultaneous Equations Model with Vector Autoregressive Disturbances 7.4 The Linear Simultaneous Equations Model with Vector Moving-Average Disturbances Appendix 7.A Consistency of R and E References and Suggested Readings Index
69 71 73 78 80 83 87 87 88 90 100 107
108 110 110 110 117 132
142 146 146 148 161 184 195 197 201
Preface
This book concerns itself with the mathematics behind the application of classical statistical procedures to econometric models. I first tried to apply such procedures in 1983 when I wrote a book with Roger Bowden on instrumental variable estimation. I was impressed with the amount of differentiation involved and the difficultly I had in recognizing the end product of this process. I thought there must be an easier way of doing things. Of course at the time, like most econometricians, I was blissfully unaware of matrix calculus and the existence of zero-one matrices. Since then several books have been published in these areas showing us the power of these concepts. See, for example Graham (1981), Magnus (1988), Magnus and Neudecker (1999), and Lutkepohl (1996). This present book arose when I set myself two tasks: first, to make myself a list of rules of matrix calculus that were most useful in applying classical statistical procedures to econometrics; second, to work out the basic building blocks of such procedures — the score vector, the information matrix, and the Cramer—Rao lower bound — for a sequence of econometric models of increasing statistical complexity. I found that the mathematics involved working with operators that were generalizations of the well-known vec operator, and that a very simple zero-one matrix kept cropping up. I called the matrix a shifting matrix for reasons that are obvious in the book. Its basic nature is illustrated by the fact that all Toeplitz circulant matrices can be written as linear combinations of shifting matrices. The book falls naturally into two parts. The first part outlines the classical statistical procedures used throughout the work and aims at providing the reader with the mathematical tools needed to apply these procedures to econometric models. The statistical procedures are dealt with in Chap. 1. Chapter 2 deals with elements of matrix algebra. In this chapter, generalized vec and devec operators are defined and their basic properties investigated. Chapter 3 concerns itself with zero-one matrices. Well-known zero-one matrices such as commutation matrices, elimination matrices, and duplication matrices are defined and their properties listed. Several new zero-one matrices are introduced in this ix
Preface chapter. Explicit expressions are given for the generalized vec and devec of the commutation matrix, and the properties of these matrices are investigated in several theorems. Shifting matrices are defined and the connection among these matrices and Toeplitz and circulant matrices is explained. Moreover, the essential role they play in time-series processes is demonstrated. Chapter 4 is devoted to matrix calculus. The approach taken in this chapter is to derive the matrix calculus results from a few basic rules that are generalizations of the chain rule and product rule of ordinary calculus. Some of these results are new, involving as they do generalized vecs of commutation matrices. A list of useful rules is given at the end of the chapter. The second part of the book is designed to illustrate how the mathematical tools discussed in the preceding chapters greatly facilitate the application of classical statistical procedures to econometric models in that they speed up the difficult differentiation involved and help in the required asymptotic work. In all, nine linear statistical models are considered. The first three models (Chap. 5) are based on the linear-regression model: the basic model, the linearregression model with autoregressive disturbances, and the linear-regression model with moving-average disturbances. The next three models (Chap. 6) are based on the seemingly unrelated regression equations (SURE) model: the basic model, the SURE model with vector autoregressive disturbances, and the SURE model with vector moving-average disturbances. The final three models (Chap. 7) are based on the linear simultaneous equations (LSE) model. We consider the basic LSE model and the two variations that come about when we assume vector autoregressive or vector moving-average disturbances. For each model considered, the basic building blocks of classical statistics are obtained: the score vector, the information matrix, and the Cramer—Rao lower bound. Statistical analysis is then conducted with these concepts. Where possible, econometric estimators of the parameters of primary interest that achieve the Cramer—Rao lower bound are discussed. Iterative interpretations of the maximum-likelihood estimators that link them with the econometric estimators are presented. Classical test statistics for hypotheses of interest are obtained. The models were chosen in such a way as to form a sequence of models of increasing statistical complexity. The reader can then see, for example, how the added complication changes the information matrix or the Cramer—Rao lower bound. There are, in fact, two such sequences in operation. We have in Chap. 5, for example, the basic linear-regression model followed by versions of this model with more complicated disturbance structures. Second, between chapters, we have sequences of models with the same characteristics assigned to the disturbances: for example, the linear-regression model with autoregressive disturbances followed by the SURE model and the LSE model with vector autoregressive disturbances.
Preface
xi
It is assumed that the reader has a good working knowledge of matrix algebra, basic statistics, and classical econometrics and is familiar with standard asymptotic theory. As such, the book should be useful for graduate students in econometrics and for practicing econometricians. Statisticians interested in how their procedures apply to other fields may also be attracted to this work. Several institutions should be mentioned in this preface: first, my home university, the University of Western Australia, for allowing me time off from teaching to concentrate on the manuscript; second, the University of Warwick and the University of British Columbia for providing me with stimulating environments at which to spend my sabbaticals. At Warwick I first became interested in matrix calculus; at British Columbia I put the finishing touches to the manuscript. Several individuals must also be thanked: my teacher Tom Rothenberg, to whom I owe an enormous debt; Adrian Pagan, for his sound advice; Jan Magnus, for introducing me to the intricacies of zero-one matrices; my colleagues Les Jennings, Michael McAleer, Shiqing Ling, and Jakob Madsen for their helpful suggestions and encouragement; Helen Reidy for her great patience and skill in typing the many drafts of this work; finally, my family, Sonia, Joshua, and Nikola, for being there for me.
1
Classical Statistical Procedures
1.1. INTRODUCTION
An alternative title to this book could have been The Application of Classical Statistical Procedures to Econometrics or something along these lines. What it purports to do is provide the reader with mathematical tools that facilitate the application of classical statistical procedures to the complicated statistical models that we are confronted with in econometrics. It then demonstrates how these procedures can be applied to a sequence of linear econometric models, each model being more complicated statistically than the previous one. The statistical procedures I have in mind are these centered around the likelihood function: procedures that involve the score vector, the information matrix, and the Cramer—Rao lower bound, together with maximum-likelihood estimation and classical test statistics. Until recently, such procedures were little used by econometricians. The likelihood function in most econometric models is complicated, and the first-order conditions for maximizing this function usually give rise to a system of nonlinear equations that is not easily solved. As a result, econometricians developed their own class of estimators, instrumental variable estimators, that had the same asymptotic properties as those of maximum-likelihood estimators (MLEs) but were far more tractable mathematically [see Bowden and Turkington (1990)]. Nor did econometricians make much use of the prescribed classical statistical procedures for obtaining test statistics for the hypotheses of interest in econometric models; rather, test statistics were developed on an ad hoc basis. All that changed in the last couple of decades, when there was renewed interest by econometricians in maximum-likelihood procedures and in developing Lagrangian multiplier test (LMT) statistics. One reason for this change was the advent of large, fast computers. A complicated system of nonlinear equations could now be solved so we would have in hand the maximum-likelihood estimates even though we had no algebraic expression for the underlying estimators. Another more recent explanation for this change in attitude is the 1
2
Matrix Calculus and Zero-One Matrices
advent of results on zero-one matrices and matrix calculus. Works by Graham (1981), Magnus (1988), Magnus and Neudecker (1988), and Lutkepohl (1996) have shown us the importance of zero-one matrices, their connection to matrix calculus, and the power of matrix calculus particularly with respect to applying classical statistical procedures. In this introductory chapter, I have a brief and nonrigorous summary of the classical statistical procedures that are used extensively in the latter part of this book. 1.2. THE SCORE VECTOR, THE INFORMATION MATRIX, AND THE CRAMER—RAO LOWER BOUND
Let B be a k x 1 vector of unknown parameters associated with a statistical model and let 1(0) be the log-likelihood function that satisfies certain regularity conditions and is twice differentiable. Let al/a0 denote the k x 1 vector of partial derivatives of I. Then al/a0 is called the score vector. Let a2lIaea0' denote the k x k Hessian matrix of 1(0). Then the (asymptotic) information matrix is defined as
1(9)=— lim 1E(a2liaeaw), n —> co n
where n denotes the sample size. Now the limit of the expectation need not be the same as the probability limit. However, for the models we consider in this book, based as they are on the multivariate normal distribution, the two concepts will be the same. As a result it is often more convenient to regard the information matrix as / (6) = -p lim1a2i /aeae' The inverse of this matrix, 1-1(6), is called the (asymptotic) Cramer-Rao lower bound. Let B be a consistent estimator of 9 such that ,Fi(e - 0) 2> N(0, V). The matrix V is called the asymptotic covariance matrix of O. Then V exceeds the Cramer-Rao lower bound 1'(0) in the sense that V - / -1(0) is a positive-semidefinite matrix. If V = ), then / '0 ,,is called a best asymptotically normally distributed estimator (which is shortened to BAN estimator). 1.3. MAXIMUM LIKELIHOOD ESTIMATORS AND TEST PROCEDURES
Classical statisticians prescribed a procedure for obtaining a BAN estimator, namely the maximum-likelihood procedure. Let ED denote the parameter space. Then any value of 0 that maximizes 1(0) over ED is called a maximum-likelihood
Classical Statistical Procedures
3
estimate, and the underlying estimator is called the MLE. The first-order conditions for this maximization are given by
al(e) =0. ae Let 5 denote the MLE of O. Then O is consistent, and B is the BAN estimator so ,Ft(6 — 0)4 N[0, V I M]. Let h be a G x 1 vector whose elements are functions of the elements of B. We denote this by h(0). Suppose we are interested in developing test statistics for the null hypothesis Ho: h(0) = 0 against the alternative HA : h(0) 0 0. Let B denote the MLE of 9 and 6 denote the constrained MLE of 0; that is, is the MLE of 9 we obtain after we impose Ho on our statistical model. Let ah(e)/ae denote the k x G matrix whose (ij) element is aki /a6;. Then classical statisticians prescribed three competing procedures for obtaining a test statistic for Ho. These are as follows. LAGRANGIAN MULTIPLIER TEST STATISTIC T1
=1 n
ai(ey 1-1(0_ ) ai(0) ae ae •
Note that the LMT statistic uses the constrained MLE of O. If Hois true, 6 should be close to B and as, by the first-order conditions, alovae= 0, the derivative woo() evaluated at 9 should also be close to the null vector. The test statistic is a measure of the distance woo() is from the null vector. wALD TEST STATISTIC T2
= nh(o)f
a hoy
ae
I1(u)
ae
n(u).
Note that the Wald test statistic uses the (unconstrained) MLE of B. Essentially it is based on the asymptotic distribution of ,Fili(0) under Ho, the statistic itself measuring the distance 40) is from the null vector. LIKELIHOOD RATIO TEST STATISTIC T3 = 2[1(9) - 1(9)1
Note that the likelihood ratio test (LRT) statistic uses both the unconstrained MLE B and the constrained MLE O. If Hois indeed true, it should not matter whether we impose it or not, so 1(9) should be approximately the same as 1(0). The test statistic T3 measures the difference between WI) and 1(0).
4
Matrix Calculus and Zero-One Matrices
All three test statistics are asymptotically equivalent in the sense that, under Ho, they all have the same limiting X 2distribution and under HA, with local alternatives, they have the same limiting noncentral X2distribution. Usually imposing the null hypothesis on our model leads to a simpler statistical model, and thus the constrained MLEs B are more obtainable than the B MLEs. For this reason the LMT statistic is often the easiest statistic to form. Certainly it is the one that has been most widely used in econometrics. 1.4. NUISANCE PARAMETERS Let us now partition B into B = (a' P')', where a is aki x 1 vector of parameters of primary interest and ,B is a k2 x 1 vector of nuisance parameters, k1 + k2 = k. The terms used here do not imply that the parameters in p are unimportant to our statistical model. Rather, they indicate that the purpose of our analysis is to make statistical inference about the parameters in a instead of those in p. In this situation, two approaches can be taken. First, we can derive the information matrix 1(6) and the Cramer-Rao lower bound / -1(6). Let
[laa 46) , 113a I Ia16 ) 1-1(0) = Gfia I/3/3 "
be these matrices partitioned according to our partition of B. As far as a is concerned we can now work with /aaand /"" in place of 1(6) and / -1(6), respectively. For example, /"" is the Cramer-Rao lower bound for the asymptotic covariance matrix of a consistent estimator of a. If a is the MLE of a, then - a)
N(0, /""),
and so on. A particular null hypothesis that has particular relevance for us is Ho : a = 0 against HA a 0 0. :
Under this first approach, the classical test statistics for this null hypothesis would be the following test statistics. LAGRANGIAN TEST STATISTIC Tl
= 1 ai(0)' laa(6) ai(6) n as
as
Classical Statistical Procedures
5
wALD TEST STATISTIC
T2 = neeiaa(5)-l a . LIKELIHOOD RATIO TEST STATISTIC
T3 = 2[1(5) — 0)1 Under Hoall three test statistics would have a limiting x2distribution with k1 degrees of freedom, and the nature of the tests insists that we use the upper tail of this distribution to find the appropriate critical region. The second approach is to work with the concentrated log-likelihood function. Here we undertake a stepwise maximization of the log-likelihood function. We first maximize 0) with respect to the nuisance parameters p to obtain p = p(a), say. The vector p is then placed back in the log-likelihood function to obtain 1(a) = l[a, P(a)]. The function i(a) is called the concentrated likelihood function. Our analysis can now be reworked with 1(a) in place of 1(6). For example, let l ai n aa aa' and let a be any consistent estimator of a such that i = — plim
— a) 4 N(0, Va). Then Va > in the sense that their difference is a positive-semidefinite matrix. If a is the MLE of a, then a is obtained from
al aa
= 0, —
a) 4 N(0, / 1 ),
and so on. As far as test procedures go for the null hypothesis Ho : h(a) = 0, under this second approach we rewrite the test statistics by using 1 and I in place of l(6) and / (6), respectively. In this book, I largely use the first approach as one of my expressed aims is to achieve the complete information matrix /(6) for a sequence of econometric models. 1.5. DIFFERENTIATION AND ASYMPTOTICS
Before we leave this brief chapter, note that classical statistical procedures involve us in much differentiation. The score vector aim), the Hessian matrix 82lI8e86', and algae all involve working out partial derivatives. It is at this stage that difficulties can arise in applying these procedures to econometric
6
Matrix Calculus and Zero-One Matrices
models. As hinted at in Section 1.2, the log-likelihood function 1(0) for most econometric models is a complicated function, and it is no trivial matter to obtain the derivatives required in our application. Usually it is too great a task for ordinary calculus. Although in some cases it can be done, [see, for example, Rothenberg and Leenders (1964)], what often happens when one attempts to do the differentiation by using ordinary calculus is that one is confronted with a hopeless mess. It is precisely this problem that has motivated the writing of this book. I hope that it will go some way toward alleviating it. It is assumed that the reader is familiar with standard asymptotic theory. Every attempt has been made to make the rather dull but necessary asymptotic analysis in this book as readable as possible. Only the probability limits of the information matrices that are required in our statistical analysis are worked out in full. The probability limits themselves are assumed to exist — a more formal mathematical analysis would give a list of sufficient conditions needed to ensure this. Finally, as already noted, use is made of the shortcut notation
,F1(14 - p) 4 N(0, V) rather than the more formally correct notation ,Ft(S —
p) 4 x - N(0, v).
2
Elements of Matrix Algebra
2.1. INTRODUCTION
In this chapter, we consider matrix operators that are used throughout the book and special square matrices, namely triangular matrices and band matrices, that will crop up continually in our future work. From the elements of an m x n matrix, A = (a13 ) and a p x q matrix, B = (bii ), the Kronecker product forms an mp x nq matrix. The vec operator forms a column vector out of a given matrix by stacking its columns one underneath the other. The devec operator forms a row vector out of a given matrix by stacking its rows one alongside the other. In like manner, a generalized vec operator forms a new matrix from a given matrix by stacking a certain number of its columns under each other and a generalized devec operator forms a new matrix by stacking a certain number of rows alongside each other. It is well known that the Kronecker product is intimately connected with the vec operator, but we shall see that this connection also holds for the devec and generalized operators as well. Finally we look at special square matrices with zeros above or below the main diagonal or whose nonzero elements form a band surrounded by zeros. The approach I have taken in this chapter, as indeed in several other chapters, is to list, without proof, wellknown properties of the mathematical concept, in hand. If, however, I want to present a property in a different light or if I have something new to say about the concept, then I will give a proof. 2.2. KRONECKER PRODUCTS Let A = (a1j ) be an m x n matrix and B a p x q matrix. The mp x nq matrix
given by a11 B
• • • ain B
am 1 B
amn B 7
8
Matrix Calculus and Zero-One Matrices
is called the Kronecker product of A and B, denoted by A 0 B. The following useful properties concerning Kronecker products are well known: A 0(B0C)=(A0B)0C= A0B0C, (A + B)0(C+D)=A0C+A0D+B0C+BOD, if A + B and C + D exist, (A ® B) (C ®D) = AC 0 BD, if AC and BD exist. The transpose of a Kronecker product is (A 0 B)' = A' 0 B' , whereas the rank of a Kronecker product is r(A 0 B) = r(A) r(B). If A is a square n x n matrix and B is a square p x p matrix, then the trace of the Kronecker product is tr(A 0 B) = tr A tr B, whereas the determinant of the Kronecker product is IA 0 BI = IAI P IBIn , and if A and B are nonsingular, the inverse of the Kronecker product is (A 0 B)-1 = A-1 0 B-1.
Other properties of Kronecker products, although perhaps less well known, are nevertheless useful and are used throughout this book. First note that, in general, Kronecker products do not obey the commutative law, so A0B OBOA. One exception to this rule is if a and b are two column vectors, not necessarily of the same order; then a' b = b a' = ba'
(2.1)
This exception allows us to write A 0 b in an interesting way, where A is an m x n matrix and b is a p x 1 vector. Partitioning A into its rows, we write a A=
, am'
where ai' is the ith row of A. Then clearly from our definition of Kronecker
9
Elements of Matrix Algebra
product al'Ob A0b=
bOal'
: am'
(2.2)
= (b am)
®b
where we achieve the last equality by using Eq. (2.1). Second, it is clear from the definition of the Kronecker product that if A is partitioned into submatrices, say A11 A=
Aid
An then
A
All
®B
• ••
Au(
B
An
B
• ••
Au{
B
B=
Suppose we now partition
B
into an arbitrary number of submatrices, say
B11
• ••
Bir
Bs 1
• ••
Bsr
B=
Then, in general, A A
. • • A 0 Bir
B 0[ A Bs i . . A Bsr
One exception to this rule is given by the following theorem. Theorem 2.1. Let a be an m x 1 vector and B be a p x q matrix. Write B = (B1 • • • Br ), where each submatrix of B has p rows. Then a0B=(a B1•••a0Br).
10
Matrix Calculus and Zero-One Matrices
Proof of Theorem 2.1. Clearly (aiB) a0B=
;
[ai(Bi
•••
Br)
a1B1
al Br
Br )]
amBi
am Br
=
a,n B
am(Bi - • •
0
= (a 0 Bi • • • a 0 Br).
Now consider A as an m x n matrix partitioned into its columns A = (ai• • • an) and a partitioned matrix B = (B1• • • Br ). Then, by using Theorem 2.1, it is clear that we can write A
B = (ai
Bi • • • ai 0 Br • • an 0 Bi • • • an 0 Br).
This property of Kronecker products allows us to write A 0 B in a useful way. Partitioning A and B into their columns, we write A = (ai • • • an), B = (b1 • • • bq ). Then A
B = (ai 0 bi • • • a
bq • • • an
bl • • • an
bq ).
Third, note that if A and B are m x n and p x q matrices, respectively, and x is any column vector, then = (A 0 1) (In x')
A(In
=A 0 ,
(x 0 I p)B = (x I p) (1 B) = x 0 B.
This property, coupled with the Kronecker product of A 0 B, where A is partitioned, affords us another useful way of writing A 0 B. Partitioning A into its columns, we obtain A
B = (ai B • • • an B) = Rai 0 Ip)B • • • (an 0 Ip)B].
Finally, note that for A m x n, and B p x q A
B = (A 0 p) (In B)= (I,n B) (A 0 Iq ).
2.3. THE VEC AND THE DEVEC OPERATORS 2.3.1. Basic Definitions Let A be an m x n matrix and a jbe its jth column. Then vec A is the mn x 1 vector a1
vec A = an
11
Elements of Matrix Algebra
that is, the vec operator transforms A into a column vector by stacking the columns of A one underneath the other. Let A be an m x n matrix and let be the ith row of A. Then devec A is the 1 x mn vector devec A = (a l' • • • am'), that is, the devec operator transforms A into a row vector by stacking the rows of A alongside each other. Clearly the two operators are intimately connected. Writing A = (al• • • a.), we obtain (vec A)' = (c4 • • • an) = devec A'.
(2.3)
Now let A' = B. Then vec B' = (devec B)'.
(2.4)
These basic relationships mean that results for one of the operators can be readily obtained from results for the other operator. 2.3.2. Vec, Devec, and Kronecker Products A basic connection between our operators and the Kronecker products can be derived from the property noted in Section 2.2 that, for any two column vectors a and b, ab' = b' 0 a = a 0 b' . From this property it is clear that the jth column of ab' is br a, where b jis the jth element of b, so vec ab' = vec(b' 0 a) = b a.
(2.5)
Also, the ith row of ab' is ai l), so devec ab' = devec(a 0 b') = a' 0 b'
.
More generally, if A, B, and C are three matrices such that the matrix product ABC is defined, then vec ABC = (C' 0 A) vec B.
(2.6)
The corresponding result for the devec operator is devec ABC = [vec(C'B'A')]' = [(A 0 C')vec B']' = (vec B')'(A' 0 C) = devec B(A' 0 C).
(2.7)
Note that for A, an m x n matrix, vec A = (In 0A)vec
= (A' 0 4,)vec
devec A = (devec In ) (A' 0 In) = (devec /„,) (/„, 0 A).
(2.8) (2.9)
12
Matrix Calculus and Zero-One Matrices
Special cases of these results that we shall have occasion to refer to are those for a, an m x 1 vector, and b, an n x 1 vector: a = vec a = (a' 0 /„,)vec b = vec b' = (In 0 b')vec In, b' = devec
= (devec In ) (b 0 In),
a' = devec a = (devec
0 a').
In future chapters, we often have to work with partitioned matrices. Suppose that A is an m x np matrix and partition A so that A = (A1• • • AP), where each submatrix is m x n. Then it is clear that (vec A1) vec A = vec A P Suppose also that B is any n x q matrix and consider A(/p 0 B) = (Ai B • • • A pB). If follows that (vec AIB) vec A(I p B) =
(4 0 Ai vec B.
=
vec AP B
Iq A p
2.3.3. Vecs, Devecs, and Traces Traces of matrix products can conveniently be expressed in terms of the vec and the devec operators. It is well known, for example, that tr AB = (vec A')' vec B. However, from Eq. (2.3) we can now write tr AB = devec A vec B, thus avoiding the awkward expression (vec A')'. Similarly, we can obtain the many expressions for tr ABC= tr BCA = tr CAB, for example, by writing tr ABC = devec A vec BC = devec AB vec C, and then using equations (2.6) and (2.7).
Elements of Matrix Algebra
13
2.3.4. Related Operators: Vech and I) In taking the vec of a square matrix A, we form a column by using all the elements of A. The vech and the 13 operators form column vectors by using select elements of A. Let A be an n x n matrix: all A= [
aln
and
ann
Then vech A is the . n(n + 1) x 1 vector /aii \ an d a22
vech A = ant \ann
that is, we form vech A by stacking the elements of A on and below the main diagonal, one underneath each other. The vector v(A) is the In(n — 1) x 1 vector given by a21 and
R A) =
a32
ant •
\ann —11
that is, we form v(A) by stacking the elements of A below the main diagonal, one underneath the other. If A is a symmetric matrix, that is, A' = A, then aid = ajiand the elements of A below the main diagonal are duplicated by the elements above the main diagonal. Often we wish to form a vector from A that consists of the essential elements of A without duplication. Clearly the vech operator allows us to do this.
14
Matrix Calculus and Zero-One Matrices
An obvious application in statistics is that in which A is a covariance matrix. Then the unknown parameters associated with the covariance matrix are given by vech A. Suppose that now we wish to form a vector consisting of only the covariances of the covariance matrix but not the variances. Then such a vector is given by f)(A). Before we leave this section, note that for a square matrix A, not necessarily symmetric, vec A continues all the elements in vech A and in 1)(A) and more besides. It follows then that we could obtain vech A and 1)(A) by premultiplying vec A by a matrix whose elements are zeros or ones strategically placed. Such matrices are examples of the zero-one matrices called elimination matrices. Elimination matrices and other zero-one matrices are discussed in the next chapter. 2.4. GENERALIZED VEC AND DEVEC OPERATORS
In this section, we look at operators that are generalizations of the vec and the devec operators discussed in Section 2.3. 2.4.1. Generalized Vec Operator Consider an m x p matrix partitioned into its columns A = (a1 • • • a p), where di is the jth column of A. Then vec1 A = vec A, that is, vec1 A is the mp x 1 vector given by ai
veciA = vec A = a
Suppose now that A is an m x 2p matrix A = (al • • • a2p). Then we define vec2 as the mp x 2 matrix given by
vec2 A =
al a a3
a2 a a4
a2p-1
a2p
that is, to form vec2A, we stack columns of A under each other, taking two at a time. More generally, if A is the m x np matrix A = (ai• • • anp) then vec. A is the mp x n matrix given by
vecn A =
al
an
ant i
a2n
an( p-1)
anp
15
Elements of Matrix Algebra Table 2.1. Vec Operations
K 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Vec Operators Performable on A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 3 2
4 5
2
3
6 7
2
4
8
3
9
2
5
10 11
2
3
4
6
12 13
2
7 3
2
14
5
15
4
8
16 17
2
3
6
9
18 19
2
4
5
10
20
For a given m x K matrix A, the number of generalized vec operations that can be performed on A clearly depends on the number of columns K of A. If K is a prime number, then only two generalized vec operators can be performed on A, vec1 A = vec A and vecKA = A. For K any other number, the number of generalized vec operations that can be performed on A is the number of divisors of K. We then have the Table 2.1.1 2.4.2. Generalized Devec Operator In applying a generalized vec operator, we form a matrix by stacking a certain number of columns of a given matrix underneath each other. In applying a generalized devec operator, we form a matrix by stacking a certain number of rows of a given matrix alongside each other. Consider a p x m matrix B that we partition into its rows B = (b1• • • bP)' , where bye is the jth row of B. Then deveci B = devec B is the 1 x mp vector given by deveci B = devec B = (b y• • • b11). 1Number
theorists assure me that it has not been possible as yet to derive a sequence for Table 2.1.
16
Matrix Calculus and Zero-One Matrices b2pw
Suppose now that B is an 2p x m matrix B = devec2 B as the 2 x mp matrix given by b3'
b devec2 B =[ , b2
b4'
) Then we define
b(2p-lyi b2P'
that is, to form devec2 B,we stack rows of B alongside each other, taking two at a time. More generally, if B is an np x m matrix B = (b1• • • bnP)' , then devec B is the n x mp matrix given by by
devecn B =
bn+1'
[ :. bn'
bn(p-1Y
.• b2r,'
. •
.
bnil
Generalized devec operators performable on a given K x m matrix are given by a table similar to Table 2.1. 2.4.3. Veenand DeveenOperators 2.4.3.1. Basic Relationship Between the Veen and the DevecnOperators
Let A be an m x np matrix and write A = (A1 • AP), where each submatrix Ai ism x n. Then A ,
vecn = Ap SO
(vecn A)' = (A'1
A') = devecn A'.
(2.10)
Now, letting B = A', we have (devecn B)' = vecn B'.
(2.11)
Again, these basic relationships mean that we have to derive theorems for only one of these operators. We can then readily obtain the corresponding results for the other operator by using Eq. (2.10) or Eq. (2.11). 2.4.3.2. Vec„, Devecn, and Kronecker Products
Let A and B be p x q and m x n matrices, respectively, and write A= (a1 . . . aq), where ai is the jth columns of A. Then A
B = (ai B•••aq B),
17
Elements of Matrix Algebra SO
al 0 B
vecn(A ® B) =
= vec A ® B.
(2.12)
aq 0 B
As a special case, vecn (a' 0 B) = a 0 B. Now write A = (a l a' is the ith row of A. Then
aP)' , where
(a" 0 B') A0
= all 0B'
SO
devecn(A 0
= (a v 0B' • • all 0B')= devec 0 B'
(2.13)
As a special case, note that devecn(a 0 B') = a' ®B' . Equations (2.12) and (2.13) mean that for complicated Kronecker products, where A is a product matrix for example, the results for vecnand devecnfor these Kronecker products can again be obtained from known results of the vec operator. 2.4.4. Obvious Relationships between Generalized Vecs and Generalized Devecs for a Given Matrix This section finishes with references to some obvious relationships between our operators for a given T x K matrix A. Let vec jA refer to a generalized vec operator that is performable on A and let devec;A refer to a generalized devec operator that is also performable on A. Then clearly the following relationships hold: Devec(vec A) = (vec A)' = devec A'. DevecT(vec jA) = A. Devec(vec jA) = 1 x TK vector whose elements are obtained from a permutation of these of (vec A)'.
Vec(devec A) = vec A' = (devec A)'. VecK(deveci A) = A. VecK(devecn A) = TK x 1 vector whose elements are obtained from permutation of those of vec A .
2.4.5. Theorems about Generalized Vec and Devec Operators In this subsection, we derive results concerning the generalized vec and devec operators that are important in the application of our concepts in future sections. These results are summarized in the following theorems.
18
Matrix Calculus and Zero-One Matrices
Theorem 2.2. Let A, C, D, E, and a be m x np, r x m, n x s, p x q, and p x 1 matrices, respectively. Then 1. vecn CA = (I p0 C)vec„ A and devec„ A'C' = devec„ p 0 C'), = devec„ Ar(a 0 In,), 2. A(a 0 In) = (a' 0 InOvec„ A and (at 0 3. vecs[CA(4 0 D)] = 0 C)(vecn A)D and devecsRip 0 D')A'C'] = Dr(devec„ A')(I p 0 C'), 4. vecs[A(E 0 D)]= q 0 A)(vec E 0 D) and devecs RE 0 D')A'] = (devec E' 0 D')(Iq 0 A'). Proof of Theorem 2.2. We need to prove the results for only the generalized vec operator as we can readily obtain the equivalent results for the devec operator by using the basic relationship given by Eq. (2.10). 1. Partition A as A = (A1• • • AP), where each submatrix A j ism x n; it follows that (CA1 vecn CA =
: CAP
= (I p0 C)vecn A.
2. Writing a = (al • • • ad', we have (ai In A(ot 0 In ) = (Ai - - - A p)
: a p In
= ai Ai ± a2A2 + - • - +ot p Ap.
Al However, (a' 0 I.) vecn A = (ai In, • • • apl,n) : = al Ai + Ap +ap A p. a2A2 + 3. Clearly CA(I p 0 D) = (C AiD CAR D) and, as each of these submatrices are r x s, (CAiD vecs [CA(Ip0 D)] =
: CA p D
= (I p0 C)(vecn A)D.
4. From 1, we have vecs[A(E 0
= (Ig0 A)vecs(E 0 D) = (Iq 0 A)(vec E 0 D),
where we obtain the last equality by using Eq. (2.12).
0
Elements of Matrix Algebra
19
A special case that is of interest to us is that in which s = n so D is a square n x n matrix. This case is summarized in the following corollary. Corollary 2.1. For A, C, and E as prescribed in Theorem 2.1 and D n x n, vecn [C A(I p0 D)] = (I p0 C)(vecn A)D, devecn(ip 0 vecn[A(E 0
= D' devecn At(ip 0C'), = (Iq 0 A)(vec E 0 D),
devecn(E' 0 D')A' = (devec E' 0 D')(Iq 0A'). Often A is a square np x np matrix. For such a matrix, we have the following theorem. Theorem 2.3. Let A be an np x np matrix so each submatrix is np x n, and let D and a be as prescribed in Theorem 2.1. Then vecn [(Dt 0a')A] = (I p0 D' 0 a')vecn A, devecn[At(D 0 a)] = devecn A' (I p0 D 0 a). Proof of Theorem 2.3. We prove the result for the generalized vec operator. We write A = (A1 • • .A p), where each Aiis now np x n. Then clearly (D' 0
= [(D' 0 a')A1 • • • (D' 0 a')A p ],
and, as each submatrix is s x n, [(D' 0a')Ail vecn(Dt 0a')A =
= (I p0 D' 0 a')vecn A. (D' 0 a')A p
❑
2.4.6. A More Convenient Notation Generalized vec and devec operators arise naturally when we are working with partitioned m x np or np x m matrices, and it is convenient at this stage to introduce a separate notation for these operators that is more manageable. Let A be an m x np matrix and write A = (Ai • • • Ap), where each submatrix Ai is m x n. Similarly let B be an np x m matrix and write Bi B= (: , Bp where each submatrix B, is n x m. It is clear that the operators vecnand devecr, are performable on A and B, respectively.
20
Matrix Calculus and Zero-One Matrices
NOTATION. Let At = vecn A and Bt = devecn B. Our future work will involve the intensive use of this notation so it is convenient to list the results we have obtained so far in terms of this new notation. 2.4.6.1. Properties of the Operator r 1. Let A and B be p x q and m x n matrices, respectively. Then from subsection 2.4.3.2. on vec n, devecn, and Kronecker products we have (A 0 BY = vec A 0 B. As a special case, let x and y be q x 1 and n x 1 vectors, respectively; then (x' 0 y')t = x 0 y'. 2. From the basic relationship linking vecnand devecngiven by Eq. (2.10) we have (At)' = (AT.
(2.14)
3. For A, C, E, and a as prescribed by Theorem 2.2 and for D an n x n matrix we have a. (C A)" = (1p 0C)At , b. A(a 0 In) = (a' 0 Ip)At , c. [C A(I p 0D)]t = (1p 0C)At D, d. [A(E 0 D)]t = (Iq 0 A) (vec E 0 D). 4. From the same theorem, if we set D = x, an n x 1 vector, so that s = 1 we have e. vec A(I p 0 x) = At x, f. vec[CA(/p 0 x)] = (Ip 0C)At x. 5. If we set D = x, an n x 1 vector, and C = y', a 1 x m vector, so that s = 1 and r = 1, we have g. (y'A)t = (Ip, 0 y')At , h. [y' A(I p0 x)]" = (1p 0y')At x . 2.4.6.2. Properties of the Operator f The following are the equivalent properties for the operator f. 1. Let A and B be p x q and m x n matrices, respectively. Then (A' 0 BY = devec A' 0 B' , (x 0 g = x' 0 y. 2. (Bt )' = (BY .
(2.15)
Elements of Matrix Algebra
21
3. Letting B = A' and with A, C, E, and a as prescribed in Theorem 2.2, we have a. (B CT = Bl(ip 0C'), b. (a' 0 I„)B = Bt (a 0 In), c. RI p0 D')BC']t = D' Bi- (I p 0 C'), d. [(E' 0 D')B]t = (devec E' 0 D') (Iq 0 B). 4. For x, an n x 1 vector, e. deyee(ip 0x r)B = x' Bi- , f. deyeeR/p 0x')BC'] = x' Bi- (Ip 0 C'). 5. If x and y are n x q and m x 1 vectors, respectively, g. (By)1= Bi- (I p 0 y), h. [(In 0x')By]` = x' Bi- (I p 0 y). 2.4.6.3. Word of Warning on the Notation As A is m x np, the operator r could be defined as (A)t = veep A or (A)t = vecp A, and of course these two matrices are not the same. The convenience of presenting the operator as a superscript far outweighs this ambiguity. However, if in cases in which the possibility of confusion exists, we use the notation At' = veen A,
At, =yecp A. A similar notation is used for devecs. Given the extensive use we make of the operators r and f it behooves us to distinguish them clearly from a similar operator, namely the transpose operator. Although the transpose operator is defined on any matrix and thus has more general applicability than the operators r and f, it can still be compared with these operators in the case in which A is an m x np matrix and B is an np x m matrix. This comparison is discussed in the next subsection. 2.4.7. The Operators T, 'T" and the Transpose Operator Let A and B be m x np and np x m matrices, respectively, and write Bi A = (Ai...Ap), B = (:), Bp
22
Matrix Calculus and Zero-One Matrices
where each submatrix A j is m x n and each submatrix Bi is n x m. Let At = vec A and Bt = devec B. Then the following properties highlight the differences between the generalized vec and devec operators and the transpose operator. Ai 1.
Ai
=
= (Bi- • B'p),
, At = A'
Bt = (B1. • Bp).
A
A if Ai • • • A if A p
2. A'A
Br B =
E B;Bi , i=i
A' Ai • • APA P whereas for the case m = n only Ai.
A i Ap
At A = ( ] A p Ai
3. AA' =
B fB =
A2P
E i=i
BiBc
• • BIB;
Bp B;
• BP B'p
E Ai A;, BB' = i=1
whereas for the case m = n only B1
P
AAT =EA, BB f =
i=i
• • BI BP
• Bp Bi
4. (AB)' = Br A' =
B2
E B:A;,
E
Bt At =
i=i
For the case m = n, (AB)t = AB =
E Ai Bi = (AB)t .
5. If A is square and symmetric, A' = A. If A is square n x n, At = At = A.
23
Elements of Matrix Algebra 6. (A')' = A, (B')' = B. For the case m = n, (At )t = A, (BI )t = B. (Bc 7. (At)' = (AT = (A'1- • • A') (Bt)' = (B')t = BrP 8. For x, an nx 1 vector, = x', (x)t = x'.
9. One final distinction can be drawn between the generalized vec and devec operators on the one hand and the transpose operator on the other. From the properties of the transpose operator and the Kronecker products, [A(E 0 D)]' = (E' 0 D')A'. The corresponding results for the generalized vec and devec operators are given by the following theorem. Theorem 2.4. Let A, B, E and D be m x np, np x m, p x q, and n x n matrices, respectively. Then [A(E 0 DT = (E' 0 In)At D, [(E' 0 D')B]t = D' Bt (E 0 In) Proof of Theorem 2.4. From the properties of the operator r, [A(E 0 D)]t = (Ig ® A) (vec E ®D) = (Ig ® A) (vec E 0 In)D. Writing E = (el • • • sq), where Ei is the i th column of E, we have [A(Ei 0 In)] [(el 0 10141 (lq 0A) (vec E 0 In ) =
. : = (e'q 0 In)At A(sq 0 In)
= (E' 0 In)At . We obtain the corresponding result for the devec operator by taking the transpose and by using Eq. (2.14) or Eq. (2.15). ❑ 2.5. TRIANGULAR MATRICES AND BAND MATRICES
Special square matrices that crop up a lot in our future work are triangular matrices and band matrices. In this section, these matrices are defined and some
24
Matrix Calculus and Zero-One Matrices
theorems concerning triangular matrices that are important in our forthcoming analysis are proved. A square matrix A, n x n, is upper triangular if [
ain l azn
all a12 a a22
A=
0 ann
and lower triangular if all a21 a22
A=
0
an d
„n1
If in addition is 0 for i = 1, . . . , n, A is said to be strictly triangular. Working with triangular matrices is relatively easy as their mathematical properties are simple. For example, if A is upper (lower) triangular then A' is lower (upper) triangular. The determinant of a triangular matrix is the product of its main diagonal elements. The product of a finite number of upper (lower) triangular matrices is also upper (lower) triangular, and if one of the matrices in the product is strictly upper (lower) triangular the product itself is strictly upper (lower) triangular. Two theorems that we will appeal to often in future chapters are the following. Theorem 2.5. If A is lower (upper) triangular with ones as its main diagonal elements then A is nonsingular and A-1is also lower (upper) triangular with ones as its main diagonal elements. Proof of Theorem 2.5. Clearly A is nonsingular as IA I = 1. We prove that the main diagonal elements of A-1are all ones by mathematical induction for the case in which A is lower triangular. Consider A2 =
[1 0
a 1]'
where a is a constant. Then _1 A2
—
01
—a 1
so it is clearly true for this case. Suppose it is true for An, an n x n matrix with the prescribed characteristics and consider An 0 An ±i = [ a , d'
25
Elements of Matrix Algebra where a is an n x 1 vector. Clearly [An-1 And1-1=
d' 0
-a' An-1
so it is also true for an (n + 1) x (n 1) matrix. We establish the proof for the lower triangular case by taking transposes. ❑ Theorem 2.6. Suppose A is an nG x nG matrix and let A11
A G1
AG1
AGG
A=
where each n x n matrix Au , i = 1, . . . , G, is lower (upper) triangular with ones along its main diagonal and each n x n matrix Au, i j, is strictly lower (upper) triangular. Suppose A is nonsingular and let A!G A-1 = AGG
AG1
Then each n x n matrix A", i = 1 • • G, is also lower (upper) triangular with ones as its main diagonal elements and each Auis also strictly lower (upper) triangular. Proof of Theorem 2.6. We use mathematical induction to established the result for the lower-triangular case. We then obtain the upper-triangular proof by taking transposes. Consider A2, a 2n x 2n matrix, and let A2 = [Au A,2 A21 A22
where the submatrices are n x n with the characteristics prescribed by the theorem. Then as Old = 1, Aiiis nonsingular and lower triangular so D = A22 - A21 A 1-11A 12 exists. Now A21 AniA 12 is the product of lower-triangular matrices and, as A2, is strictly lower triangular, this product is also strictly lower triangular. It follows then that D is lower triangular with ones as its main diagonal elements so I DI = 1 and D is nonsingular. Let A 1 1 Al2 A 1 [A21 A22] • 2 -
Then A
111, 111 Al2D-1 A21AA-
=
A21 =
ni
A ti
A
2i
'
A l2
111 Al2D-1 9 -A-
A22 =
D.
26
Matrix Calculus and Zero-One Matrices
Then, by using Theorem 2.5 and properties of products of triangular matrices, we clearly see that the submatrices Aij have the required characteristics. Suppose now that it is true for Ap, an np x np matrix, and consider
B
B,2
=
DP A [B2, [0 A p-1-1p+1
where B2, = (A p+11 • • A p±i IA B,2 = (A4+1• • • A'13/3+1 )' and all the submatrices A p±iiand Aip±i , i = 1, . . . , p, are n x n and strictly lower triangular and Ap±ip±iis lower triangular with ones as its main diagonal elements. Let All
A-1 = APP A1P]
AP1
where, by assumption, Ap-1exists and each of the n x n submatrices Aij has the desired characteristics. Consider F = A p±i p±i — B21Ap l B121
where B21 A p-1B12 = Ei Ap±li, Ai-I A/3+1j is the sum of products of lower-triangular matrices and each of these products is, in fact, strictly lower triangular. It follows that F is lower triangular with ones as its main diagonal elements so IFI = 1 and F is nonsingular. Let B il B 1—[1321
13 12 B22].
Then
=
LIA__F-1B_ I A -1 A -1 + 1.112 A-1 p 2 rP A1 B21 = F -1
-1B12 F-1 D12 = A.19
" B22 = F-1
21''P
Expanding B2, AP 1and A P 1B12 as we did above and by using Theorem 2.5, we ❑ clearly see that the Biis have the required characteristics. In a triangular matrix, zeros appear on one side of the main diagonal whereas nonzero elements of the matrix appear on the other side of the main diagonal. In a band matrix, the nonzero elements appear in a band surrounded on both sides by zeros. More formally, a square n x n matrix A = (aii ) is a band matrix with bandwidth r +s + 1 if aq = 0 for i — j > s and for j — i > r, where
27
Elements of Matrix Algebra r and s are both less than n. That is, ai r
all a21
0
0
0
a2r+i
0
0
asl
A =
0
as+12
0
0
0
an—r+ ln 0 • • • • • • 0 ann-s+1 . . . " . ann -0 In Subsection 3.7.4.3 of Chap. 3, we shall look at special types of band matrices that are linear combinations of a special zero-one matrix called a shifting matrix.
3
Zero-One Matrices
3.1. INTRODUCTION
A matrix whose elements are all either one or zero is, naturally enough, called a zero-one matrix. Probably the first zero-one matrix to appear in statistics and econometrics was a selection matrix. The columns of a selection matrix are made up of appropriately chosen columns from an identity matrix. When a given matrix A is postmultiplied by a selection matrix, the result is a matrix whose columns are selected columns from A. A related matrix is a permutation matrix whose columns are obtained from a permutation of the columns of an identity matrix. When a given matrix A is (premultiplied) postmultiplied by a permutation matrix, the resultant matrix is one whose (rows) columns we obtain by permutating the (rows) columns of A. Both selection matrices and permutation matrices are used through out this book. Several zero-one matrices are associated with the vec, vech, and f) operators discussed in Chap. 2. These are commutation matrices, elimination matrices, and duplication matrices. These matrices are important in our work as they arise naturally in matrix calculus. Other zero-one matrices that are important in this context are generalized vecs and devecs of commutation matrices. A final zero-one matrix that is used throughout this book is what I call a shifting matrix. When a given matrix A is (premultiplied) postmultiplied by a shifting matrix, the (rows) columns of A are shifted across (down) a number of places and the spaces thus created are filled with zeros. Shifting matrices are useful, at least as far as asymptotic theory is concerned, in writing timeseries processes in matrix notation. Their relationship to Toeplitz, circulant, and forward-shift matrices is explained. Well-known results about our zero-one matrix are presented with references only, the proofs being reserved for results that I believe are new or that are presented in a different light than usual. 28
29
Zero-One Matrices 3.2. SELECTION MATRICES AND PERMUTATION MATRICES
Consider A, an m x n matrix, and write A = (al • • • an), where a, is the ith column of A. Suppose from A that we wish to form a new matrix B whose columns consist, say, of the first, fourth, and fifth columns of A. Let S = e4 where e`! is the jth column of the n x n identity matrix In. Then clearly ,
AS = (ai a4 as) = B. The matrix S, whose columns are made up of a selection of columns from an identity matrix, is called a selection matrix. Selection matrices have an obvious application in econometrics. The matrix A, for example, may represent the observations on all the endogenous variables in an econometric model, and the matrix B may represent the observations on the endogenous variables that appear on the right-hand side of a particular equation in the model. Often it is mathematically convenient for us to use selection matrices and write B = AS. This property generalizes to the case in which our matrices are partitioned matrices. Suppose A is an m x np matrix partitioned into pm x n matrices; that is, we write A = (A1• • • A r), where each submatrix Ai is m x n. Suppose we wish to form a new matrix B from A made up of A 1 , A4, and A5. Then B=
A4 A5) =
A(S 0 In),
where S = (e7 e4 e5), as in the beginning of this section. A permutation matrix P is obtained from a permutation of the columns or rows of an identity matrix. The result is a matrix in which each row and each column of the matrix contains a single element, 1, and all the remaining elements are Os. As the columns of an identity matrix form an orthonormal set of vectors it is quite clear that every permutation matrix is orthogonal, that is, P' = 13-1. When a given matrix A is premultiplied (postmultiplied) by a permutation matrix, the result is a matrix whose rows (columns) are obtained from a permutation of the rows (columns) of A. 3.3. THE COMMUTATION MATRIX AND GENERALIZED VECS AND DEVECS OF THE COMMUTATION MATRIX
3.3.1. The Commutation Matrix Consider an m x n matrix A and write a l' A = (al • • • an) =
)• am'
30
Matrix Calculus and Zero-One Matrices
where d, is the jth column of A and a'' is the ith row of A. Then al vec A (an ) whereas al
vec = am
Clearly both the vec A and the vec A' contain all the elements of A, although arranged in different orders. It follows that there exists an mn x mn permutation matrix K. that has the property K mnvec A = vec A'.
This matrix is called the commutation matrix. Under our notation Knm is the commutation matrix associated with the n x m matrix. The two commutation matrices, K. and K., are linked by Knni Kmn vec A = vec A,
so it follows that Knn, = Kmn1 = Icnn , where the last equality comes about because K. is a permutation matrix. Note also that Kln = Knl =
Explicit expressions for the commutation matrix K. that are used extensively throughout this book are In 0 er In 0 e'' 27 `
Kmn =
= [I,,, 0 e7 Im ® er2`. • • • Im, 0 4
In 0e:'
where e7 is the jth column of the m x m identity matrix Im and e,' is the ith column of the n x n identity matrix In . For example,
K23
=
1 0 0 0 0 0
0 0 0 1 0 0
0 1 0 0 0 0
0 0 0 0 1 0
0 0 1 0 0 0
0 0 0 0 0 1
Zero-One Matrices
31
3.3.2. Commutation Matrices, Kronecker Products, and Vecs An essential property of commutation matrices is that they allow us to interchange matrices in a Kronecker product. In particular, the following results are well known [see Neudecker and Magnus (1988), p. 47]. Let A be an m x n matrix, let B be apxq matrix, and let b be apx 1 vector. Then Kpm(A 0 B) = (B 0 A)Kqn,
K pm(A 0 B)Knq = B 0 A, K p„,(A b) = b 0 A, K„,p(b 0 A) = A 0 b. Another interesting property of commutation matrices with respect to Kronecker products is not so well known. Consider A and B as above and partition B into its rows:
B= 101 where is the ith row of B. Then we saw in Section 2.2 of Chap. 2 that, in general,
1,1 (A 0 bl') A0B=A0(
A 0 bP)
bP'
However, we have the following theorem for these matrices. Theorem 3.1.
A 0 bi K pm(A
B) = A0bP'
Proof of Theorem 3.1. (
1„, ® er
K p,„(A 0 B) =
I (A 0 B) I,„
ell;
(A 0 efB) A 0 ePp B
(A ®b"
A
bP)
0
32
Matrix Calculus and Zero-One Matrices
Similarly, we partition B into its columns so that B = (b1 • • bq), where bi is the jth column of B. Then we saw in Section 2.2 of Chap. 2 that, in general, A0B=A0(bi•••bq )0(A0bi•••AON).
However, we have the following result. Theorem 3.2. (A
B)K„q =(A 0bi • • A 0 bq).
Proof of Theorem 3.2. (A
B)K„q = (A 0 B)(In
• • • In 0 e0
= (A 0 Be?' • • • A 0 Be0 = (A 0 bi • • A 0 bq).
0
Related theorems are the following. Theorem 3.3. A 0 bi (Kg„,
).
0 /p)(A vec B) = A 0 bq
Proof of Theorem 3.3. [I,„ ® (er ® 4)1 (ICq„, 0 I p)(A 0 vec B) =
(A 0 vec B) In, ® (4 ® I p) A 0 (e'll 0 Ip)vec B
=
A 0 bi) =
A 0 (4 0 I p)vec B
: A 0 bq
0 Theorem 3.4. (B 0 a1
(In 0 Kpm)(vec A 0 B) = B 0 an
33
Zero-One Matrices
Proof of Theorem 3.4. [K p„,
0
al 0 1
[ 0
K p„,
an 0 B)
(In 0 Kpm)(vec A 0 B) =
K p„, (al ®B) =[
B 0 ai
i 1= ( : ) • K p„, (an 0 B) B 0 an
❑
Using the fact that K;n1 = Kn„,, we find that these theorems lead to the following corollaries: A0b1' A 0 B = K„,p (
),
,40bP A 0 B = (A 0 bi - - - A 0 bq)Kqn, (A 0 bi) (A 0 vec B) = (K„,q 0 Ip)
; A 0 bq (B 0ai)
(vec A 0 B) = (In 0 K„,p) B 0 an
Several other similar results can be obtained; the analysis is left to the reader. Theorems 3.3 and 3.4 allow us to convert the vec of a Kronecker product into the Kronecker product of vecs and vice versa. Consider an m x n matrix A and a p x q matrix B partitioned into their columns: A = (a1• • • an), B = (bi . • • bq)• Then, as we saw in Section 2.2 of Chap. 2, we can write A 0 B = (ai 0 bl• • - ai 0 bq• • - an0 b1- . • an 0 bq),
34
Matrix Calculus and Zero-One Matrices
SO
/a1 0 b1\ al 0 bq vec(A 0 B) = an ® b1
\a„ Obql whereas _
bi ai bq
vec A 0 vec B = bi
an 0 ( : bq ) _ Clearly both vectors have the same elements, although these elements are rearranged in moving from one vector to the other. Each vector must then be able to be obtained when the other is premultiplied by a suitable zero-one matrix. Applying Theorem 3.3 or Theorem 3.4 we have vec(A 0 B) = (In 0 ICq„, 0 /p)(vec A 0 vec B),
(3.1)
(In 0K„,q 0 Ip)vec(A ®B) = vec A 0 vec B.
(3.2)
The same properties of the commutation matrix allow us to write vec(A 0 B) in terms of either vec A or vec B, as illustrated by the following theorem. Theorem 3.5.
/,n 0 bi vec(A ® B) =[In ® ( : 1„, 0 bq
vec A,
q 0ai vec(A ® B) = [l 0 Ip ec B. lq 0 an) v
Zero-One Matrices
35
Proof of Theorem 3.5. By Eq. (3.1), vec(A 0 B) = (in 0Kqm 0 Ip)(vec A 0 vec B)
= (In 0 Kqm 0ip)vec[vec B(vec by Eq. 2.1. Now we can write [In 0 (in, 0 vec B)] vec A
vec[vec B(vec
= [(vec A 0 4) 0 /Id vec B, SO
vec(A 0 B) = {In 0 [(Kqm 0 Ip)(Im 0 vec B)]lvec A
= {[(In 0 K q m)(VeC A 0 4)10 I plveC B. Applying Theorems 3.3 and 3.4 gives us the result.
❑
3.3.3. Generalized Vecs and Devecs of the Commutation Matrix 3.3.3.1. Explicit Expressions for Knii--nand Km% Consider the commutation matrix Kmn , which we can write as In 0 er'
Kmn =
;
= Elm 0e7 • - • I. 0 4],
(3.3)
In 0e:' where e'.; is the jth column of the n x n identity matrix In. It follows then that K,1„.,, is the n x nm2matrix given by
IC!'n = [In 0er' - - - In 0 41
(3.4)
and Km% is the mn2 x m matrix given by /m 0e„ K t' 'n = [
1. Im0 er,;
From Theorem 3.5, we see that for an m x n matrix A /„, 0 4)1 vec(A ® /G) = in 0
vec A.
(3.5)
im 0eg If follows then that we can now write vec(A 0 IG) in terms of generalized vecs as vec(A 0 IG) = (in 0 Kmt'"G )vec A.
(3.6)
36
Matrix Calculus and Zero-One Matrices
Note for the special case in which a is an m x 1 vector, we have vec(a 0 IG) = Icr,Ga. In a similar fashion, we can write In e? vec(IG 0 A) = [
) 0 I„, vec A In 0 eg
= (Kr% 0 /m)vec A,
(3.7)
and for the special case in which a is an m x 1 vector we have vec(IG 0 a) = (vec IG 0 I,n)a. As we shall see in the next chapter, Eqs. (3.6) and (3.7) allow us to write the derivatives of vec(A 0 /G) and vec(IG0 A) with respect to vec A in terms of generalized vecs and devecs of the commutation matrix. 3.3.3.2. Useful Properties of Kn%, K ijn In deriving results for Kni-"G and 4'n, it is often convenient to express these matrices in terms of the commutation matrix KGB. The following theorem does this. Theorem 3.6.
% = ( KG,: 0 IG)(In vec IG) = (IG KnG)(VeeIG 0 In),
Kr
Kin =
0 (vec IGY1(KnG 0 IG) = [(vec IG)' 0In](IG 0 KGn)•
Proof of Theorem 3.6 Using Eq. (3.3), we can write In0 (e?' 0 IG)(vec IG) (KGn IG)(In O vec /G) = In 0(egg 0 IG)(vec IG )
This clearly equals KnI as (eq' 0 IG)vec IG = e? . for j = 1
G. Now
[KnG(e? 0 In)]
(IG 0 KnG)(vec IG 0 In) = KnG(eg 0 In)
n 0 4) = K = (I
•
nG•
In 0 eg The equivalent results for Kinare readily obtained with KLnn
= (K,tinGY.
0
Zero-One Matrices
37
Using Theorem 3.6 and known results about the commutation matrix, we can derive results for K,TinG and Kin. For example, we know that for A, an m x n matrix, B, a p x q matrix, and b, a p x 1 vector, K pm(A b) = b 0 A, K,pp(b 0 A) = A 0 b. Using these results, Eqs. (2.7)—(2.9), and Theorem 3.6, we prove the following theorem. Theorem 3.7. For A, B, and b, which are m x n, p x q, and p x 1 matrices, respectively, K,I;fp(A 0 b Im) = b
(vec A)',
K ip-^,;,(b 0 A 0 /p) = A 0 b',
K:rp(I„, b 0 A) = devec A 0 b, K
0 A 0 b) = b' 0 A.
Proof of Theorem 3.7. KLP,,(A
b I„,) = [Ip
OTC WV 0 A 0 Im)
= b 0 (vec 1,)'(A0 Im) =b
(vec A)',
IC;;(b 0 A 0 I p) = [Im 0 (vec p)1(A b
p)
= A 0 (vec /p) f(b /p) = A 0 b f, K:fp(I„, b 0 A) = [(vec 40' 0 /Aim 0 A 0 b) = (vec WV'', ® A) 0 b = devec A 0 b,
Kpm (Ip ®A®b)= [(vec lp)'®I,n ](Ip ® b®A) = (vec /p)'(/p 0b) 0 A = 0 A.
0
We can obtain the equivalent results for the generalized vec of the commutation matrix Kmpby taking appropriate transposes. However, such proofs are rather inefficient' in that we can obtain the same results immediately, by using the properties of the generalized vec and devec operators discussed in Section 2.4 of Chap. 2. For example, we have K pm(A
b) = b A.
38
Matrix Calculus and Zero-One Matrices
Taking the devec„, of both sides, we have K ipt(Ip 0 A 0 b)= b' 0 A.
Similarly, taking the devec of both sides of K mp(b 0 A) = A 0 b, we have K,IPPp (/,n b
= devec A 0 b.
In like manner, as K p„,Kmp = Ipn„ taking the vecmof both sides, we have (I p
K p„,)K;`,;;, = (Ipm)r"•
Further such proofs are left to the reader. 3.3.3.3. Other Theorems about K'G-„ and KnG
In subsequent chapters, we shall need to call on the following theorems involving K. (We can obtain the equivalent theorems for KG 'nby taking appropriate transposes). For notational convenience for the rest of this section, we shall use i and r to denote the devecG and vecG operators, respectively. Other generalized devec and vec operators shall be denoted by appropriate subscripts. Theorem 3.8. Let a be an n x 1 vector. Then Kni-G (a 0 Inc)
= ( IG 0 a'),
KnfG (I„G 0 a) = a' 0
IG.
Proof of Theorem 3.8. Let a = (al - an)' and write al(IG In)] K:G(a Inc) = (/G 0 el
/G
er,:') an(IG 0 In)
= ai (/G 0 en ±
=
an (IG4')
a' .
Similarly, write KL(inG 0 = (IG 0 el
1G
en[In
(IG 0]
= ( IG e?' a G e`,:' a) = ( IG 0 ai IG 0 an) = (alIG • • • anIG) = a' 0 IG.
❑
39
Zero-One Matrices Theorem 3.9. Let A be an n x p matrix. Then (/p 0/C;,`G)(vec A 0 Inc) = K PG(IG 0 A').
(3.8)
Proof of Theorem 3.9. Write A = (a1 • • • u p), where a jis the n x 1 jth column of A. Then we can write the left-hand side of Eq. (3.8) as [Kni-G
0
(al 0inG)
C)
[IrG(ai. 0 InG) ==
.
KnG
. .
.
(3.9)
Knt G(ap 0 InG)
ap 0 InG
However, by using Theorem 3.7, we can write the right-hand side of Eq. (3.9) as (IG 0al • • • IG 0 ar)', and, from the properties of the commutation matrix, ❑ this is equal to K PG(IG 0 A'). Theorem 3.10. Let U be an n x G matrix and u = vec U. Then Kr1G(In 0u) = U' = (VI . Proof of Theorem 3.10. Clearly )[ U
Kri,-G(in 0 u) = (IG 0e?"
IG
e' r:
C)
= [(IG 0 enu • • (IG 0enu]. Now write u =
u'G Y , where uiis the jth column of U. Then u ji
(lo er.;')u =
= n'
e • Ito
= Uj u -G
Theorem 3.11. Let u be an nG x 1 vector. Then KniG(u x
=
.
Proof of Theorem 3.11. Write u = Then
u;,)', where each vector ui isG x 1. u 1 0 In)
Kri:G(U 0 In ) = (IG 0 err - - - IG ® ern')
(
: U n 0 in
= (ui 0 en + + (un 0 en. However, u 0
= u fel = (0 u j 0),
40
Matrix Calculus and Zero-One Matrices
SO
u„) = uT.
KL(u 0 In) = (ui
❑
Theorem 3.12. (Im 0KL)Kmn,nG = KmG(IG 0 K r,",;,). Proof of Theorem 3.12. We can write Kmn,nG = hm 0 (In 0 eV) 1m 0 (In 0 e'r2)], so (/,„ 0 KL)Kmn,nG = [Im 0 Kr,G(In
er). Im 0KL(In 0 eV)]* (3.10)
Now, by Theorem 3.10, Knf G(In e7G) = Re7G)
and, as e7G =e? 0 e7 from the properties of the generalized devec operator given in Section 2.4 of Chap. 2, (eiGY" = e?' e ll, SO
= e? 0 = e? en eri
K t (in 0
.
The first n matrices on the right-hand side of Eq. (3.10) can then be written as 4, 0 e?eri Im 0 e? err,' = (Im 0 en (Im 0 err Im 0 en) = (Im 0 en (ICr%). It follows then that the left-hand side of Eq. (3.10) can be written as [(Im 0
K
(Im eg)Kn;"„] = KmG(IG Kn"m').
❑
Theorem 3.13. Let b and d be n x 1 vectors and let c be a G x 1 vector. Then (b 0 c 0 d) = d'bc.
Proof of Theorem 3.13. [bi(c 0 d)1 K nrG(b 0 c 0 d) = ( IG 0 eri - • - IG 0 ern'' ) bn(c 0 d) = bi(c 0 di) + • — + bn(c 0 dn) = (bidi + - - - + bndn)c.
0
41
Zero-One Matrices
Theorem 3.14. Let b be an n x 1 vector and let A be a Gn x p matrix. Then K:G(b 0 A) = IG 0 b')A. (
Proof of Theorem 3.14. (bi A) K:G(b ®A) = (/G 0 eil - - - IG 0 en b„ A = b1(lo ® en A + • - - + b„ (IG ® ern`')A. Let A = (A'1• • • A's )', where each submatrix Ai isn x p. Then (ic 0 er()A = (Nie7 - - - AGe7)', SO
ICT,G(b 0 A) = (A'i
A'G e7)'
bn(14'le;;
A' en)'
.
Consider the first submatrix, ble7' Ai+ ---±bner,:' Ai = b' Ai, so we can write K,T,G(b 0 A) = (A'i b
A'G b)' = (IG 0b')A.
0
3.3.3.4. K p,npversus Kr,'" Note that both K pmp and Knt'p have np2columns, and it is of some interest what happens to a Kronecker product with np2rows when it is postmultiplied by these matrices. Let A be an m x p matrix and let B be an r x np matrix; then A 0 B is such a Kronecker product. From Theorem 3.2, (A B)K p,np = (A 0 bi
A 0 bnp),
(3.11)
where b1 is the ith column of B. The equivalent result for Knt19' is given by the following theorem.
42
Matrix Calculus and Zero-One Matrices
Theorem 3.15. Let A be an m x p matrix and let B be an r x np matrix. Then [B(In 0al )1
(A 0 B)Knr'p =
(3.12) B(In 0am)
where a'' is the ith row of A. Proof of Theorem 3.15.
(al' 0 B)K,T;, (A ®B)K P =
(am' 0 B)K4 However, from Theorem 3.14, (al' 0 B)K4 = B(In 0al ).
0
Taking the transposes of Eqs. (3.11) and (3.12), we find that if C and D are p x m and np x r matrices, respectively,
COd 1' Knpp(C 0 D) =
(3.13)
),
C
dnP'
whereas IC;;,,(C 0 D) = [(In 0 c'I )D
(In 0c n)D],
(3.14)
where di' is the ith row of D and ciis the jth column of C. 3.3.4. The Matrix IV, Associated with the commutation matrix Knnis the n2 x n2matrix Nn, which is defined by
1 N„ = 2(In2
Knn)
•
For a square n x n matrix A, 1 2 From the properties of the commutation matrix, it is clear that Nnis symmetric idempotent, that is, Nn = Nn = Other properties of Nnthat can easily be derived from the properties of Knn , in which A and B are n x n matrices and b is an n x 1 vector, are Nnvec A = vec-(A + A').
1. Nn Knn = Nn =KnnNn, 2. Nn(A 0 B)Nn = Nn(B 0 A)Nfl,
43
Zero-One Matrices 3. Nn (A 0 B + B 0 A)Nn = Nn (A0B+BOA)
= (A 0 B ± B 0 A)Nn = 2Nn (A 0 B)Nn, 4. 1)1„(A 0 A)N, = N„(A 0 A) = (A 0 A)N„, 5. NAA 0 b) = N„(1■0 A) =(A 0 1■± b 0 A). In subsequent chapters, it is often convenient to drop the subscripts from Kr,,, and N„. Thus, we often use the symbol K for Knn and N for Nn , in which the order for these matrices is clear. In like fashion, we drop the subscripts from elimination matrices and duplication matrices, which are discussed in Sections 3.4 and 3.5. 3.4. ELIMINATION MATRICES L,
L
It was noted in Section 2.3 of Chap. 2 that if A is an n x n matrix, then vec A contains all the elements in vech A and in D(A) and more besides. It follows that there exists zero-one matrices L and L whose orders are it(it + 1) x n2 and In(n — 1) x n2, respectively, such that L vec A = vech A,
L vec A = D(A). These matrices are called elimination matrices. For example, for a 3 x 3 matrix A, 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
L3 =
L
0 0 00 0 0 0 0 0 L
0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0
1_
_ [0 1 0 0 0 0 0 0 0 L3= 0 0 1 0 0 0 0 0 0 . 0 0 0 0 0 1 0 0 0 3.5. DUPLICATION MATRICES D, I,'
In Chap. 2 it was also noted that if an n x n matrix A is symmetric, then vech A contains all the essential elements of A, with some of these elements duplicated in vec A. It follows therefore that there exists an n2 x In(n + 1) zero-one matrix D such that D vech A = vec A.
44
Matrix Calculus and Zero-One Matrices
For example, for a 3 x 3 symmetric matrix, 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 I- 0 0 D3= 0 0 0 1 0
0
0 0 0 0 0
1 10
0 0 1 0 0 0 0 0 0
0 0 0 1 0 0
0 0
1
Similarly, if A is strictly lower triangular then 13(A) contains all the essential nonzero elements of A. Hence there exists an n2 x In(n — 1) zero-one matrix, L' as it turns out, such that
L' D(A) = vec A The matrices D and L' are called duplication matrices. 3.6. RESULTS CONCERNING ZERO-ONE MATRICES ASSOCIATED WITH AN n x n MATRIX
The following results are well known and may be found in Magnus (1988). 1. KD=D=ND,LD=I. 2. L'L = I. 3. DLN = N. 4. D'D = (LNL)-1= 21 — LK L' 5. D = N L(LN L)-1= 2N — L' LK L' 6. DD' = 2N —L'LKL'L. 7. LL'= 7. 8. LICL' =O. 9. The generalized inverse of D is DI-= LN. 10. D±(A 0 B)D = D±(B 0 A)D. 11. For a nonsingular matrix A, [D'(A 0 A)D]-1 = LN (A-1 0A-1)NL'. 12. Let A = {aid ) be an n x n matrix. Then L'L vec A = vec A, where A is the strict lower-triangular matrix given by
[0 0 a21 a32 A=
.
.
an1 ant
0 0 0 0
:1 •
an n —1 0
45
Zero-One Matrices One further result that is useful to us is given by the following theorem.
Theorem 3.16. Suppose A is an n x n diagonal matrix with diagonal elements a ll , a22 • ann. Then D'(A 0 A)D is also a diagonal matrix given by - 2 11
0
2alla22
2a l ia„n D'(A 0 A)D =
2 a22
0 2a22ann an2n_
Proof of Theorem 3.16. From Theorem 4.9 of Magnus (1988), D±(A 0 A)D is a diagonal matrix with diagonal elements aii aji for 1 < j < i < n. Now D'(A 0 A)D = D' DD± (A 0 A)D, but Magnus (1988) in Theorem 4.4 shows that we can write D' D = 2I, n u ii u;i, where u jiis a unit vector of the order of in(n + 1) x 1, with one in the (i - 1)n + 3 i - i i2position and zeros elsewhere. However, it is easily seen that
0 0 0
E uiiuii DI-(A 0 A)D =
2 a22
i=i
ann
0
Note that as [D'(A 0 A)D]-1= LN(A-10 A-1)LN for nonsingular A, it
46
Matrix Calculus and Zero-One Matrices
follows that LN(A-10 A-1)LN -1 a 11
1 -1 -1 2 a11 a22
0 1 -1 -1 2
a ll ann
1
a22
1 -1 -1 - a 2 22
a33
1 -1 -1
0
- a 22 2
ann
an for a diagonal matrix A. 3.7. SHIFTING MATRICES
3.7.1. Introduction In this section, a zero-one matrix is introduced that will be useful to us in a future analysis of time-series models. When a time-series process is written in matrix notation, shifting matrices, at least as far as asymptotic theory is concerned, play the same role as that of the lag operators that appear when we write the process for a particular time period. However, specifying the process in matrix notation using these zero-one matrices greatly facilitates the use of matrix calculus and hence the application of classical statistical procedures to the model. 3.7.2. Definition and Basic Operations Consider the n x n matrix 0 1 0,
SI =
01
/„_, 0
0 1
0
Clearly Siis a strictly lower-triangular zero-one matrix. It is called a shifting matrix as when a given matrix is premultiplied or postmultiplied by Si, the elements of that matrix are shifted one space and zeros are placed in the spaces thus created. For example, let A = {a111 and B = {bid ) be n x m and m x n
47
Zero-One Matrices
matrices, respectively. Then
Si A =
[ 0 all
0 aim 19 an-1
an-11
that is, in forming SiA we shift the rows of A down one space and replace the first row of A with the null row vector. Similarly b12 BSi =[: bm2
bin
0 :1 • 0
• • • bmn
that is, in forming BSIwe shift the columns of B to the left one place and replace the last column of B with the null column vector. Notice that for a, an n x 1 vector,
Sia =
a' Si= (a2 - - - an0).
From Si, other shifting matrices can be formed that will also shift the elements of a given matrix one space. Clearly al
azi S'i A = [ :-
•
and
anm
0
0
shift the rows of A up one and replaces the last row of A with the null row vector and 0 bi i BS'i =
b1n-1 .
0
bm i
bmn -i
shifts the colums of B to the right one and replaces the first column of B with the null vector. Suppose now that we want to replace the first or the last row (column) of a matrix with a null vector while leaving the other elements of the matrix the
48
Matrix Calculus and Zero-One Matrices
same. We can achieve this too by using the shifting matrix Si. Clearly
Si SI A =
0 a21
0 azm
an
anm
0 b12
bin
B S1S1 =[: 0 bm2
ai m
all
Si A =
bmn
[bl 1
B .51Si = a„_11
an-im
0
0
bm 1
Alternatively, we can use Sito leave the first or the last row (column) of a matrix unaltered while multiplying all other elements of the matrix by a given constant, k, say. For example, if In is the n x n ident' ty matrix, then [ail kazi [In +(k — 1),S 1 SjA = .
alm ka2m •
kan
kanm
The other cases are left to the reader. 3.7.3. Shifting Matrices Associated with an n x n Matrix When a given matrix is premultiplied (postmultiplied) by the shifting matrix Si or Sc, the rows (columns) of that matrix are shifted one space. Other shifting matrices can similarly be defined that will shift the elements of a given matrix any number of spaces. In fact, a given n x n matrix has n shifting matrices associated with iti : 1
In = [
0
•
•
0
1
shifts elements zero spaces; -0
1
•
0
Si =
O 1We
1 0
do not consider shifting matrices that convert the matrix into the null matrix.
49
Zero-One Matrices shifts elements one space. Zeros go in the spaces created; —0 0 S2 =
0
1 0
1
0 0
shifts elements two spaces. Zeros go in the spaces created;
0 0
Sn-1 =
0 1 0...
0
shifts elements n — 1 spaces. Zeros go in the spaces created. Suppose we denote the jth column of the n x n identity matrix by ej. Then clearly Sj
= E
ei e;_ i ,
i=j+1
Sjej= ei+j, i + j < n, Siej = 0, i + j > n. Moreover, S. =
j = 0, 1,
n — 1,
provided we take Si = A similar analysis as that conducted for Sican now be made for Si. For example, Si Si' when premultiplied (postmultiplied) by a given matrix, replaces the first j rows (columns) of that matrix with zeros while leaving the other elements unaltered. If A is an n x n matrix, then S;ASJis the matrix formed when we move the elements of A up i rows and across to the left j columns, filling the spaces thus created with zeros. Similarly SiAS is the matrix formed when we move the elements of A down i rows and across to the right j columns, filling the spaces thus created with zeros.
50
Matrix Calculus and Zero-One Matrices
3.7.4. Some Properties of Shifting Matrices 3.7.4.1. Obvious Properties It is clear that all shifting matrices other than the identity matrix have the following properties: 1. They are all singular. In fact, r(SJ) = n — j. 2. They all are nilpotent. For example, = 0. 3. Si S f = Si+j, + j < n, =0, i+j-n. The last property can be used to obtain inverses of matrices involving shifting matrices. An example is the following theorem. Theorem 3.17. Let R be an m x m matrix, and consider the mn x mn matrix given by M(r) = In„, + (R 0 Si). Then M(r)-1= Inm— R 0 Si + R2 0S2 ± + (-1)n-1(Rn-1 0 Sn-1). Proof of Theorem 3.17. Let A = Inm R0 Si + R20 S2 ± • •
• rm-1 S„-1), and consider Vt0
(-1)n-1 x
M(r)A=A + R 0 SI — R 2 0S2 ± + (-1)n-2(Rn-1 0 Sn-i) ±(-1)"-1(R" 0 S„) =Inm
as Snis the null matrix. Hence, by the uniqueness of the inverse, A = M(r)-1. 0 Matrices like M(r) crop up in the application of shifting matrices to time-series models. Also, in this application we will need to consider how shifting matrices interact with triangular matrices. This topic is taken up in the next subsection. 3.7.4.2. Shifting Matrices and Triangular Matrices Suppose A is the upper-triangular matrix all an a22
ain a2n
A = [o
•
0
0
ann
51
Zero-One Matrices
Then -
0
0
a j+I
j+ I • • • aj +In
Sj A =
an n
• •
0 that is, S' A is strictly upper triangular with zeros in the j - 1 diagonals above the main diagonal. The matrix AS has a like configuration, with 0
0
all
al n-j
AS = O
•
an-j n-j
0
0 Similarly, if B is lower triangular, then - 0
SiB =
b11
bn _ 1 •
0
• • • bn-jn-j
0
O 0
BSi =
b11
bn _ • J•
0
bnn 0
0
0
0_
52
Matrix Calculus and Zero-One Matrices 3.7.4.3. Shifting Matrices and Toeplitz Matrices
Shifting matrices form the building blocks for Toeplitz matrices. An n x n matrix A is Toeplitz if it takes the form al
a2 a 3
b1
al
an
a2
A= • a2 b1 a 1 that is, the matrix takes on a constant along all its diagonals running from upper left to lower right. It is easily seen that A is Toeplitz if and only if it can be written as n-1
A=
bi Si + i=i
i=1
n-1
=
S1 +
bi , i=i
i=i
where So = .s? = I. Two special sorts of Toeplitz matrices which we will come across a lot in future are A = in +
+ + a p Sp 0 0
0-
al
0 al 0
0 1 0
+
0
0
al
a aP
_O
a
0
O
ap
al 1
53
Zero-One Matrices 0
-
b1
0
•.
B = biSi ± - - - ± b p Sp = b y
_O
by • • • bi 0_
Both these matrices are Toeplitz matrices that are lower triangular and band. As we shall be working with such matrices quite a lot in future chapters we finish off this section by looking at some of their useful properties as presented in the following theorems: Theorem 3.18. Let A and B be the n x n matrices given by A = in + aiSi ± - - - + a pSp , B = biSi ± - - - ± b pSp.
Then AB = BA = biSi ± c2S2 ± - - - ± c2pS2p 0 b1 0
C2
= C2p
0 C2p • • • C2 b1 0
provided that 2p < n, where c2• • • C2pare formed from the as and the bs. If n < 2p, then AB = B A = biSi ± c2S2 ± • - - ± cn-iSn-i-
Proof of Theorem 3.18. AB = (In ±ai Si + • - - + apSp)(biSi + - - - + bpSp).
54
Matrix Calculus and Zero-One Matrices
Using the property of shifting matrices that Si = Si+j, i j < n, =0, i + j > n,
we can write the product AB as AB =
b2S2 + • + bpSp
+ aibi S2 ±
± alb p-14 ± alb pS p+1
+a p_ i bi Sp +
+ a p-1bpS2p-1
-1-apbiSp+i + • • • ± apbpS2p,
with the understanding that some of these shifting matrices will be the null matrix if 2p > n. Collecting shifting matrices of the same order gives the result. Moreover, writing the product BA in the same manner shows that AB = BA. 0 Theorem 3.19. Let A be the n x n matrix A = /„ + al + +a pSp. Then
=
+ 1
+
+ cn- Sn0
cn-2 _cn-1 cn-2 • • •
ci
where the cis are products of the as s. Proof of Theorem 3.19. This proof is similar to that of Theorem 2.5 of Section 2.5 of Chap. 2. ❑ Theorem 19 gives us the form of the inverse A-1, and this is all we really need. However, if we like, we can go further often and actually specify the elements c.;s of the inverse as shown by the following theorem Theorem 3.20. 2Let A and A-1be the matrices specified in Theorem 3.19. 2I
am grateful to Shiqing Ling for providing me with this theorem and proof.
55
Zero-One Matrices
Then j =1,... n- 1,
c1= (-1)- Ajj ,
where the A jjs are the leading principal minors of the (n - 1) x (n - 1) matrix 0 1
al 1 a2 al
0-
a •• 0
aP
1
a2 al
Proof of Theorem 3.20. As A is lower triangular with ones as its main diagonal elements, by Theorem 2.5 of Section 2.5 of Chap. 2, A-1is also lower triangular with ones as its main diagonal elements. Hence, in forming the inverse A-1, we need consider only the cofactors cij for j > i, i, j = 1, , n - 1. Interchanging rows and columns if necessary, we see that al
1
0 0
az cii±k =
(_ 02i-4
:
• .
ak
• .
a2
1 al
B for a suitable submatrix B of A. However, CO DE
C F 0 E
= 1c11 El
for square matrices C and E, so Cii±k = (-1)k Akk-
Moreover, as A is triangular with ones as its main diagonal elements, I A I = 1, ❑ and cii+k is the (i k, i) element of A-1. With this theorem in hand, we can prove the following theorem. Theorem 3.21. Suppose A is an nG x nG matrix and write A11
AiG1
AG1
AGG
A=
56
Matrix Calculus and Zero-One Matrices
where each submatrix A11 is n x n and each Ai;is of the form 1
0
al In +
+ a pSp =
a 0
for i = 1
al 1
ap
G, whereas each Aid , for i j, is of the form -
0 bi
bi Si + +bpSp =
0
•
by 0
by
bl 0
Suppose A is nonsingular and let All AiG A-1 = [ A GI
A GG
Then each n x n matrix A" is of the form 1
0
C1
= in +C1 Si ±
Cn- 1 Sn-1
= Cn -2 Cn -1 Cn -2 • • •
ci
whereas each Ali, i j, is of the form -
0 d1
Aij = di Si +
O
do-iSn-i =
do -2 _dn-1 dn-2
where the cis and dis are products of the elements of A.
dI
0_
1
Zero-One Matrices
57
Proof of Theorem 3.21. This proof is, similar to that of Theorem 2.6 of 0 Section 2.5 of Chap. 2. We finish this discussion by showing that the nG x nG matrix A specified in Theorem 3.21 can be written as A=
InG + (R
0 In)C,
where R is a G x Gp matrix and C is the Gpn x Gn matrix given by 1G 0 S1
C= 1G 0 Sp
To do this, we let N(r) = (R 0 IOC and write R = (R1- • • Re), where each submatrix Ri is G x G so
N(r)= (Ri 0 S1) + - - - ± (Re 0 Se). Now, letting
R1 = Iril j i for l= 1 - - - p, we have .„ 1 c i [riiP S [ric-i rl i SiP N(r)= + + '
1
1
rG1 P SP
so if we write Nil
NGG
then
Ni1 =rili S1 +r,27 52
rri Se
0
0
Ii
fj
0
rP
r,,
0_
r 1G P SP 9 GGSP rGG
58
Matrix Calculus and Zero-One Matrices
As under this notation In
0
[N11
In
NG,
Nil
-.
A= 0
.
9
NGG
we see that A = Inc + (R 0 In)C meets the specifications of Theorem 3.21. Finally, it follows that A-1can be written as A-1 = InG + (R 0 In)C, say, where R is a G x G(n — 1) matrix whose elements are formed from those of A and C is the Gn(n — 1) x Gn matrix given by [1G 0 Si C=
•
•
'
1G 0 Sn-1
3.7.4.4. Shifting Matrices and Circulant Matrices
Circulant matrices form a subset of the set of Toeplitz matrices. A circulant matrix of the order of n is an n x n matrix of the form CI
C = circ(ci 9 c2, • • -
c2
cn cn_ i
cn ci 9
Cn) =
[
. CnC1
C2 C3
that is, the elements of each row of the circulant matrix C are identical to those of the previous row but moved to the right one position and wrapped around. Unlike most other matrices, circulant matrices commute under multiplication. If C and D are n x n circulant matrices then so is their product CD. If C is circulant then so is C', and if C is circulant and nonsingular then so is C-1. Any linear function of circulant matrices of the same order is circulant. Finally, any circulant matrix is symmetric about its main counterdiagonal. A zero-one matrix that is circulant is the forward-shift matrix, which is defined as
o-
0 1
n i =circ(0, 1, 0, - • - , 0) =
o • -. • -.
_1
0
i
- - - 0_
Note that the forward-shift matrix is also a permutation matrix so IT, = ni 1, and this inverse is also circulant.
59
Zero-One Matrices
Suppose now A = {a1j } is an n x m matrix. Then
[a21
a2m
n i A= : and
anm
ail
aim
that is, when a matrix A is premultiplied by the forward-shift matrix, the rows of A are shifted up one place and wrapped around. It operates similarly to the shifting matrix Si, the difference in the two operations being this: When A is premultiplied by Si, the rows of A are pushed up one and zeros are placed in the last row of A. When A is premultiplied by n i , the rows of A are pushed up one and wrapped around. As with shifting matrices, we can define a series of forward-shift matrices as , 0, 1 ,...,o) , j+i
1-1.;= circ (0,
=
n — 1.
Then, when A is premultiplied by RI, the rows of A are pushed up j places and wrapped around. Like shifting matrices, 1-1.; =
,
=o,...,n —1.
However, unlike shifting matrices, these forward-shift matrices are permutation matrices and are therefore orthogonal. To use this property, we write
nn-1 =
, 0)=
nn = circ(1, 0,
= 11
for j = 0, ... n.
It follows then that = (nj)-1 =
j=0,..., n.
The matrices 1-1-1behave in much the same way as the shifting matrices Si except with wrapping around. Forward-shift matrices are, in fact, the sum of shifting matrices as
nj=sj +sn_ j,I =0,..., n-1. We can write any n x n circulant matrix C = circ(ci ,c2, , cn ) first as a linear combination of forward-shift matrices and by using this property as a linear combination of shifting matrices: C =ciin czni
+
= ci in+ c2(S;
Sn _i) +
cn(Snii + Si)
= ci in-I- c2 5'1 +
+ cn
+
+ cn + c2Sn_i.
A good reference for both Toeplitz and circulant matrices is Davis (1979).
60
Matrix Calculus and Zero-One Matrices
3.7.4.5. Traces and Shifting Matrices In the application of shifting matrices to time-series models, we often have occasion to consider tr S;ASJ, where A is an n x n matrix. First, we consider tr S; AS, . To obtain Si AS;we imagine shifting the lower right-hand corner of A up the main diagonal i places. It follows then that tr S;ASi =tr A all a22 — = tr A tr
— = 1,
,n — 1.
Now we consider tr S;AS1. If j > i, then we are shifting the elements of A across more spaces than we are raising them. It follows then that tr kS; AS1is the sum of the elements of the j — 1 diagonal above the main diagonal minus the first i elements of this diagonal. If, however, j < i, then we are shifting the elements of A up further than we are moving them across. It follows then that tr AS is the sum of the elements of the i — 1 diagonal below the main diagonal minus the first j elements of this diagonal. 3.7.4.6. Shifting Matrices and Partitioned Matrices Consider an np x r matrix A that we partition into p matrices: Al A= Ap where each submatrix Ai is n x r. Suppose we wish to shift the partitioned matrices down one and replace the spaces thus created with zeros; that is, we wish to form 0 Al C =( Ap_1 Then it is easily seen that C = (Si 0In)A. In like manner, A2 • Ap 0
=
0 1,,)A.
61
Zero-One Matrices
Similar results hold for an r x np matrix B that we partition into p matrices: B = (Bi- - - Bp),
where each submatrix Bi is r x n. Suppose we wish to move the partitioned matrices to the right one and replace the spaces thus created with zeros; that is, we wish to consider D = (0 B1
- - -
Bp_ i )•
Again, it is easily seen that D = B(S'i 0 In ).
In like manner, (B2- - - B p 0) = B(S1 0
In)•
3.7.5. Some Theorems About Shifting Matrices In this subsection we prove some theorems about shifting matrices that are important for the work in Chap. 5. These theorems involve the n shifting matrices In Si S2 .• • Sr—l• Let S be the n x n2matrix given by
S = (In SiS2 ' • • Sn-1) and let Stbe the n2 x n matrix given by In
)
Si Sr = • • •
Sn-1
that is, .5r is vec„ S. Then we have the following theorems concerning S and St
.
Theorem 3.22. Let x be an n x 1 vector. Then 3(x 0
In) = S(In 0X)
= (X '
0 In)3T
Proof of Theorem 3.22. Consider
3(1n
0x)=(xSix
- - - Sn_ix)
xi
0
X2
X1
0 0
xn
Xn-1
Xi
=[
.
62
Matrix Calculus and Zero-One Matrices
Now 3(x 0 In) = xi In
[Xi
= O
+ x2Si + - - - + xn Sn-i oi o 0 0 --- 0 o o x2 0 . + : +...+ [00...01 xi xn 0 - - - 0 [0 x2 0
= 3(in 0 x). Similarly, (x' 0 /n) 3r = Ain 0 x).
0
Theorem 3.23. (In 0x') St is symmetric. Proof of Theorem 3.23. This is obvious, as
l x2 x2 x3
xn xn 0
xn
0
(In 0x')St = [x
0
Theorem 3.24. Let S be an n x np matrix made up of a selection of any p shifting matrices so
S = AS 0 /n), where S is an n x p selection matrix whose columns are the appropriate columns of In. Then, for a, a p x 1 vector, and x, an n x 1 vector, 1. S(a 0 In) = 3(Sa 0 In) = E(In 0 SO= WS' 0 In)3r 2. S(/p0 x) = (x/ 0/n)3T S, 3. (II, 0 x')St = 5' S'T' (In 0 x).
,
Proof of Theorem 3.24. 1. Clearly, S(a 0 In) = AS 0 In)(a 0 In) = 3(Sa 0 In). The result follows from the application of Theorem 3.22. 2. S(/p 0 x) = S(S 0 x) = 3(in 0x)S. The result follows from Theorem 3.22. 3. Consider (I p 0 x' )St = (I p 0 X')[3 (CS 0 I n)] r = (I p 0 X' )(S/ 0 In)3T
by Theorem 2.4 of Chap. 2. However, clearly this is S/(in 0 x')St; the result follows from Theorem 3.23. ❑
Zero-One Matrices
63
It is illuminating to consider the types of matrices presenting themselves in Theorem 3.24. Two important cases need to be considered. 1. S = (SiS2 Sp): Here S = (e2• • • ep+i), where ejis the jth column of In, and 0
0
al
S(a In)= a
•
o
al
0
aP
—0
0
x1
0
X2
X1
0-
•
X2
S(I p x)=
0 X1
X2
_xn-1 xn-2
(4 0 X')ST =
• • •
xn-p_
xn 0 xn 0 0
[X2 X3
.
xn 0
xn-p 1 -
2. S = S1 In) ei), and Here S = (enen _i an
S(a 0 In) =
0
an-1
al
• • • an - 1 an
0
64
Matrix Calculus and Zero-One Matrices -
Xi
0 .• •
S( In x) = _X 1
X2
xn _
•••
which is symmetrical, and x„
0
(In 0 x')St = xn _
3.7.6. Shifting Matrices and Time-Series Processes An obvious application of shifting matrices is in time-series analysis. In writing a time series process in matrix notation, shifting matrices, at least as far as our asymptotic theory is concerned, play the same role as that of lag operators when we write the process for a particular time period t. Consider an autoregressive process of the order 3of 1, ut + aiut _i = Et, t = 1, ... 0, and the Et are assumed to be i.i.d random variables. Using the lag operator /1, where ll ut = , we can write the process at time t as (1 + = Et. In matrix notation, we write this process as U
3
= E,
The correspondence between lag operators and shifting matrices can be used to derive results for shifting matrices. For example, for —1 < al < 1, we know we can convert the moving-average process of the order of 1 to an autoregressive process by using the expansion (1 +
= (1 — alll + alif
where /Jut =ut _ j. It follows that (in +
= in —al + af.Sf +...
However, we know that Si
= Si+ , if i + j < n,
SiSi=0, if i + j > n, SO
(in
= Jn — aiSi +
This is just a special example of Theorem 3.17.
Zero-One Matrices
65
where El
UP U
=
U-1
un 1.
U1
=
•
(
14 n:—1)
: 1)
E
=
En
As far as asymptotic theory is concerned, all presample values may be replaced with zeros without affecting our asymptotic results. If we do this, then (0 /41 u_ i = .
=
Stu,
Un-1
and we can write the autoregressive process in matrix notation as (In +a Si )u = E. In like manner, consider an autoregressive process of the order of p: ut
aiut _i
a put _ p =Et , t = 1,
, n,
and p < n. In matrix notation, we write the process as a pu_ p =E,
u aiu_i where
U_
'= un_j /
Again, replacing presample values with zeros, we have / 0 \
U
=
0 /41 \un _j/
= SOU ,
(3.15)
66
Matrix Calculus and Zero-One Matrices
and we can write the autoregressive process as u + a 1 Si u ± - - - ±aSpu = E,
or u ± U pa = E,
where Up= (So • • • Spu) = S(I p 0 u), and S is the n x np matrix S = (S1- - - Sp).
(3.16)
4
Matrix Calculus
4.1. INTRODUCTION
The advent of matrix calculus, as mentioned in Chap. 2, has greatly facilitated the complicated differentiation required in applying classical statistic techniques to econometric models. In this chapter, we develop the matrix calculus results that are needed in the applications examined in future chapters. Two things should be noted about the approach taken in this chapter. First, the method used in deriving proofs for our matrix calculus results is different from that used by Magnus and Neudecker. In their book, Matrix Differential Calculus, these authors derive their proofs by first taking differentials. Although this approach has mathematical elegance, it is not really necessary. After all, in ordinary calculus we do not usually derive results for derivatives by first appealing to differentials but by referring to a few general rules of differentiation such as the chain rule and the product rule. This is the approach taken in this chapter. We will obtain our matrix calculus results on the whole by appealing to a few general rules, which are the generalizations of the chain rule and the product rule of univariate calculus. Second, consider an m x n matrix Y whose elements yidare differentiable functions of the elements xki of a p x q matrix X. Then we have mnpq partial derivatives we can consider: i= 1, . . . , m aYij
j= 1, . . . , n
axki
k = 1, . . . , p • 1 = 1, . . . , q
The question is how to arrange these derivatives. Different arrangements give rise to different concepts of derivatives in matrix calculus [see, for example, Magnus and Neudecker (1988), Graham (1981), Rogers (1980)1 Our approach is to consider the vectors y = vec Y and x = vec X and define a notion of a 67
68
Matrix Calculus and Zero-One Matrices
derivative of a vector y with respect to another vector x. Such a notion will accommodate all our needs in the future chapters. In this chapter, no attempt has been made to give an exhaustive list of matrix calculus results. [For such a list one can do no better than refer to Lutkepohl (1996)]. Instead what is presented are results, some of which are new, that are most useful for our future applications. 4.2. BASIC DEFINITIONS
Let y = (y j) be an m x 1 vector whose elements are differentiable functions of the elements of an n x 1 vector x = (xi). We write y = y(x) and say that y is a vector function of x. Then we have the following definition. Definition 4.1. The derivative of y with respect to x, denoted by ay/ax, is the n x m matrix given byl
- ay, a y _ axl ax ay, _axn
aYm axl
aym axn _
Note that under this notion if y is a scalar so that y(x) is a scalar function of x, the derivative ay/ax is the n x 1 vector given by
ay ay ax
ax,
ay _axn
Similarly, if x is a scalar and y is an m x 1 vector, then the derivative ay/ax is the 1 x m vector
ay _ 1 ay, ax L ax
aYm ax
For the general case in which y and x are m x 1 and n x 1 vectors, respectively, the jth column of the matrix ay/ax is the derivative of a scalar function with respect to a vector, namely aye /ax, whereas the ith row of the matrix ay/ax is the derivative of a vector with respect to a scalar, namely ay/ax;. 1Magnus
and Neudecker (1985) show that if one is to be mathematically formally correct one should define the derivative of y with respect to x as ay/ax', rather than ay/ax, as we have.
69
Matrix Calculus
Row vectors are accommodated by the following definition. By the symbol ay/ax', we mean the m x n matrix defined by
ay
(ay
ax,
)'
and we define
ay'
ay
ax ax•• 4.3. SOME SIMPLE MATRIX CALCULUS RESULTS
The following simple matrix calculus results can be derived from our basic definitions. Theorem 4.1. Let x be an n x 1 vector and let A be a matrix of constants (i.e., the elements of A are not scalar functions of x). Then a Ax = A' ax ax' A = A, ax ax' Ax = (A ± A')x
ax
for A, m x n, for A, n x p, for A, n x n.
Proof of Theorem 4.1. The jth element of Ax is Ek afkxk, and so the jth column of aAx/ax is where Aj„ is the jth row of A and aAx/ax = A'. Under our notation, ax' Alax = a A'xlax = A. The jth element of ax'Ax/ax is Ei aijxi Ekajkxk so ax' Ax lax = (A + A')x. ❑ 4.4. MATRIX CALCULUS AND ZERO ONE MATRICES —
With the results of Theorem 4.1 in hand we can derive further results and we can see how zero-one matrices enter the picture. We saw in Subsection 3.3.1 of Chap. 3 that vec X' = K„r vec X for X, an n x r matrix, and vec X = D vech X for X, a symmetric n x n matrix, where Knr and D are a commutation matrix and a duplication matrix, respectively. It follows immediately that
a vec X' X= a vec
nr = Krn
(4.1)
70
Matrix Calculus and Zero-One Matrices
for X, an n x r matrix, and
a vec X = D' B vech X for X, a symmetric n x n matrix. Moreover, as vec AXB = (B' 0 A)vec X, it follows that
a vec AXB = B 0 A', a vec X and
a vec AX /B = Kr,'r (B 0 A') a vec X = Knr (B 0 A')
(4.2)
for X, an n x r matix. In subsection 3.3.3.1 of Chap. 3, we saw too that for X, an n x r matrix, vec(X 0 /G) = (Jr 0 KnG )vec X, vec(IG 0 X) ,(KrrrG 0 In)vec X. It follows that
B vec(X 0 ic) = Jr 0 Knb B vec X = Ir 0 (Kr,'G)t" = Ir 0 4n ,
(4.3)
B vec(IG OX) = Kdf r 0 InX B vec
(4.4)
Special cases of the last two results occur when X is an n x 1 vector x. Then
B vec(x 0 1G) vf,, = ''Grz, ax B vec(IG 0 x) = devec 1G 0 J. ax
(4.5) (4.6)
Moreover, as (x 0 a) = (In 0a)x and (a 0 x)= (a 0 In)x, a(x 0 a) = In 0 a', ax
(4.7)
a(a 0 x) , = a 0 In. ax
(4.8)
71
Matrix Calculus 4.5. THE CHAIN RULE AND THE PRODUCT RULE FOR MATRIX CALCULUS
Working out the derivatives of more complicated functions requires the application of the following lemmas that represent generalizations of the chain rule and the product rule of ordinary calculus. Lemma 4.1. The Chain Rule. Let x = (xi ), y = (yk), and z = (z1) be n x 1, r x 1, and m x 1 vectors, respectively. Suppose z is a vector function of y and y itself is a vector function of x so that z = z[y(x)]. Then2
az _ ay az
ax
ax ay •
Proof of Lemma 4.1. The (i j)th element of the matrix az/ax is
(
az = az; = ayk az; ax if. axi k=1 aX i aYk ax ay ax ax )ij•
Hence,
az _ ay az ax
0
ax ay •
For our purposes it is useful to consider a generalization of the chain rule for matrix calculus for the case in which z is a vector function of two vectors. This generalization is given by the following lemma. Lemma 4.2. Generalization of the Chain Rule. Let z = (z1) be an m x 1 vector function of two vectors u = (uq) and v = (u p), which are r x 1 and s x 1, respectively. Suppose u and v are both vector functions of an n x 1 vector x = (xi), so z = z[u(x), v(x)]. Then az = au az
ax
av az ax au ax av
= az
az
ax v constant
ox
u constant
2Note
that the chain rule presented here is a backward one. If, however, one were to follow Magnus' notation then a forward chain rule could be obtained by
az ax,
( az \ (ax )
( ay
az ay ay) — ay,axi•
72
Matrix Calculus and Zero-One Matrices
Proof of Lemma 4.2. The (i j)th element of the matrix az/ax is ( az
ax
ti
r aug azi ± -' , avp azi = azi = --, ax, q =1 ax;auq p=1 ax, av p
( au az
av az ax av Li•
ar au )ij
❑
The result follows directly.
Lemma 4.2 can be used to obtain a product rule for matrix calculus as presented in Lemma 4.3. Lemma 4.3. Product Rule. Let X be an m x n matrix and let Y be an n x p matrix, and suppose that the elements of both matrices are scalar functions of a vector S. Then avec XY a vec X
as
as
(Y
±
a vecY
as
(I p
X').
Proof of Lemma 4.3. By Lemma 4.2, we have
a vec XY as
=
a vec XY as
vec Y constant
a vec x a vec XY as a vec X
a vec XY as
vec X constant
a vec Y avec X Y a aver Y
vec Y constant
vec X constant
where the last equality follows from Lemma 4.1. We find that the result follows immediately by noting that vec XY = (Y' 0 Im)vec X = (II, 0 X)vec Y, and by applying Theorem 4.1. ❑ Lemma 4.3 has the following useful corollary. Corollary to Lemma 4.3. Let x be an n x 1 vector, f(x) be a scalar function of x, u(x) and v(x) be m x 1 vector functions of x, and A(x) and B(x) be p x m and m x q matrices, respectively, whose elements are scalar functions of x. Then
afoox af(x) , x= = f (x) 0 + ax ax af(x)u(x)
ax
=
au(x)
ax
[f (x) 1„i ]
f (x)In +
af(x)
ax
x',
af(x) , au(x) u(x) = f (x) ax ax
af(x)
ax
„ u(x)
Matrix Calculus
atoxyv(x)
ax
73
atox) av(x) v(x)± u(x),
axx)
ax
vec u(x)v(x)'_ ax
, av(x) [v(x) 0 i,n ] + [/,„, 0 u(x)'], ax ax
a A(x)u(x) a vec A(x) au(x) „ [u(x) 0 11, ] + A(x) a ax ax x
a vec u(x)/B(x)
aB(x)'u(x)
a vec [B(x)l
ax
ax
ax
[u(x) 0 41+
au(x) B(x). ax
Our definition, the simple results given in Theorem 4.1 and the lemmas representing generalization of the chain rule and product rule of ordinary calculus, allow us to derive derivatives for more complicated vector functions and scalar functions. These derivatives are represented as theorems under appropriate headings that make up the subsequent sections. 4.6. RULES FOR VECS OF MATRICES
Theorem 4.2. For a nonsingular n x n matrix X,
a vec X-1 = a vec X
(X-10 X-1').
Proof of Theorem 4.2. Taking the vec of both sides of XX-1 = In , we have vec X X-1= vec in. Differentiating both sides with respect to vec X we have, by applying Lemma 4.3,
a vec X
a vec x
(X-1 0 In) +
a vec X-1 (in 0X') = 0. a vec X
Solving, we obtain
a vec x-1 = —(x-1 0 In)(In 0 X -1' ) = — (X -10 X-1' ). a vec X Theorem 4.3. For A and B, m x n and n x p matrices of constants, respectively, and X, an n x n nonsingular matrix,
a vec AX -1B = —X-1B 0 x -1' A' . a vec X
74
Matrix Calculus and Zero-One Matrices
Proof of Theorem 4.3. From vec AX -1B = (B' 0 A)vec X-1we have, by using Lemma 4.1 and Theorem 4.2,
a vec AX-1 B = a vec X-1a(B/ 0 A)vec X-1 a vec x a vec x a vec X-1 = -(x-i 0X-1-)(B0 A!) = -X-1 B 0 X -lf A'.
❑
The following results involve the commutation matrix Krr and the matrix Nr = 1(42 + Krr ) and are derived for an n x r matrix X. Theorem 4.4. Let A be an n x n matrix of constants. Then
a vec X' AX a vec X
= (Ir 0 AX)Krr ± (Ir 0 A'X).
Proof of Theorem 4.4. Applying Lemma 4.3, we have
a vec X' AX a vec X' (AX 0 10+ a vec AX a vec X = avecX a vec X However, as vec AX = (Ir 0 A)vec X, by applying Theorem 4.1 we can write
a vec X' AX a vec X
= Knr (AX 0 4) + (Jr 0 A'X).
However, from the properties of the commutative matrix, Knr (AX 0 4) = ❑
(4 0 AX)Krr .
Note that if A is the n x n identity matrix we have from the definition of Nr that
a vec X'X a vec X
= 2(Ir 0 X)Nr •
Note also that if X is symmetric, then
a vec XAX a vec X = (AX 0 JO+ (In 0 A'X). Theorem 4.5. Let B be an r x r matrix of constants. Then
a vec XBX' a vec X
= (BX' 0 JO+ (B' X' 0 In)Knn•
Matrix Calculus
75
Proof of Theorem 4.5. Applying Lemma 4.3, we have
B vec XBX' a vec x
=
avec x (BX' 0 1,0+ avec BX' (In 0 X'). a vec X
avec X
However, from Eq. (4.2) we have
a vec BX' a vec X
= Krn(In 0 B'),
and hence
B vec XBX' B vec X
= (BX' ® In ) + Krn(In 0 B' X') = (BX' 0 In ) ± (B' X' 0 In)Knn •
❑
Note that if B is the r x r identity matrix, we have
a vec XX' a vec X
= 2(X' 0 In)Nn•
Theorem 4.6.
Bvec(rX)-1 = —2[(X/ X)-1 0x(rx)-11Nr. a vec X Proof of Theorem 4.6. From Lemma 4.1,
B vec(rX)-1 a vec x
=
a vec rx a vec(X/Xy1 a vec x a vec X'X
= —2(Ir 0X)Nr [(X/ X)-10 (X/X)-1]. The result follows from the properties of Nr .
❑
Theorem 4.7.
a vec X(X / X)-1 = [(X/X)-1 0In ] —2[(X' X)-1 0X(X/ X)-11/)/4/,. ® x'). a vec X Proof of Theorem 4.7. Applying Lemma 4.3, we have
a vec x(rx)-1 =a vec X [(x1 a vec(X/X)-1 B vec X a vec x x)-10 Id + a vec X (Jr 0 X'). The result follows from Theorem 4.6.
❑
76
Matrix Calculus and Zero-One Matrices
Note that by using the property that, for an r x r matrix A, Nr (A 0 A) = (A 0 A)Nr = Nr(A 0 A)Nr , we can write
a vec X(rX)-1 = [(X/ X)-1 0In] a vec X - 2(Ir 0 X)Nr [(X' X)-10 (X' X)-1X/1 = [(X' X)-10 In] - 2(Ir 0 X)Nr [(X' X)-10 (X' X)-11Nr(Ir 0 X). Theorem 4.8.
a vec(rX)-1x/ a vec X
= Krn [In 0 (X' X)-1] - 2(Ir 0 X)Nr [(X' X)-1X' 0 (X' X)-11.
Proof of Theorem 4.8. From Lemma 4.3,
a vec(rX)-l x' ,a vec(rX)-1 (X' 0 Ir)± a vec r [In 0 (X' X)-1 . avecX avecX a vec X ]
Applying Theorem 4.6 and Eq. (4.1), we can then write
avec(r X)-1X' a vec X
= 2(Ir 0 X)Nr [(X' X)-10 (X' X)-11(X' 0 Ir ) + Krn [In 0 (X' X)-1] = -2(4 0 X)Nr [(X' X)-1X' 0 (X' X)-11 ± K,[In 0 (X' X)-1].
0
Again, we can obtain other expressions for this derivative by using the properties of the matrix Nr .. Theorem 4.9.
avec X(rX)-1X' a vec X
= 2{(X/X)-1X' 0 [In -X(X' X)-1 X/ ]1Nn.
Proof of Theorem 4.9. Applying Lemma 4.3, we write
avec X(X / X)-1x/ =avec X(X'X)-1 (X a vec X a vec X
a vec X'
®/n)
+ a vec X [In 0 (X' )(Y i n
(4.9)
77
Matrix Calculus
From Theorem 4.7 and Eq. (4.1), we can write the right-hand side of Eq. (4.9) as {[(X' X)-1 0 In ] - 2(1, 0 X)Nr [(X' X)-10 (X' X)-1X1)(X' ® In) ± K„,[In 0 (X' )(yi n
From the definition of N„ we can rewrite this expression as (X/ X)-1 X' 0 [In—
xpc xyl x/]
— (Jr ® x)Krr[(x' x) 1x' ® (x' x)-1 xl+Krn[In 0 (X' X)-1Xl. However, by using the properties of the commutation matrix, we obtain K„[(X' X)-1X' 0 (X' X)-1X1 = [(X' X)-1X' 0 (X' X)-1KIK nn, Krn[In 0 (X' X)-1X1 ]
= [( X' X)-1X' ® In]Knn, ❑
and the result follows.
The next two theorems involve the derivatives of vecs of Kronecker products and bring in the generalized devec of the commutation matrix. Theorem 4.10. Let A be a G x s matrix of constants. Then
avec(X 0 A) a vec X = Jr ®
I-
Kj,,(A 0
InG)
= I, 0 (1,, 0 c4 - - - 1,, 0 a's ),
(4.10)
where ai is the jth column of A. Proof of Theorem 4.10. We write vec(X 0 A) = vec[(X 0 /0(4 0 A)] = (Jr 0 A' 0 InG)vec(X 0 IG), so by Theorem 4.1,
a vec(X ®A)_ vec(X ® /G) a vec X
=
a vec X
(4 0 A 0 inG)
= Jr 0 leG.",,(A 0 InG) by Eq. (4.3). The last equality of Eq. (4.10) follows from Theorem 3.15 of Chap. 3. ❑ Theorem 4.11. Let A be a G x s matrix of constants. Then
Bvec(A 0 X) _ a vec X
—
vi., I 1
sr(Isr
0 A') 0
In
= (I, 0 a'i- - - Ir 0 ds )0 In,
where ai is the jth column of A.
78
Matrix Calculus and Zero-One Matrices
Proof of Theorem 4.11. Write vec(A 0 X) = vec[(A 0 (Isr 0 A 0 in )vec(is 0X),
IOUs 0 X)] =
so, from Theorem 4.1,
avec(A 0 X)a yews x) a vec X
avecX
(Isr 0 A' 0 in)
= n(4,- (S) A') 0 in by Eq. (4.4). Now Ks j/sr 0 A') = (Ir 0 es: = (Ir 0 esi' A/ = (Ir 0di
[ir 0 A'
0
0
4 0 A'
4 ® ess")
Ir e s A')
0
Ir as)•
Special cases of the last two theorems occur when X is an n x 1 vector x. Then
avec(x 0 A)
=(in 0c4 • In 0 a's ), ax a vec(A= (ai x) „ as) 0 In ax = devec A' 0 4,, which are generalizations of the results given by Eqs. (4.5) and (4.6). 4.7. RULES DEVELOPED FROM THE PROPERTIES OF KGn ,4'n , AND Knit
Several derivatives can be derived by use of the properties of the commutation matrix and generalized vecs and devecs of commutation matrices, as the following theorems illustrate. Theorem 4.12. Let x be an n x 1 vector and let A be a G x n matrix of
constants. Then a(x 0 /G)Ax
ax
= (in 0 x'A') + A'(x' 0 /G)•
Proof of Theorem 4.12. From the product rule, we have
a(x
G)Ax = avec(x
ax
/G)(A__
G\
X Lad in - —x (x' ® IG)ax = gn (Ax 0 InG)± A'(x' /G) 1- ax
79
Matrix Calculus from Eq. (4.5) and Theorem 4.1. However, by Theorem 2.8 of Chap. 2, 4,(Ax 0 InG) = In 0 x' A' .
❑
Theorem 4.13. Let x be an n x 1 vector and let A be a G x n matrix of constants. Then a(IG 0 x)Ax = (x'A' 0 In) + A'(IG 0 x'). ax Proof of Theorem 4.13. am 0x)Ax _ a KGn(x 0 IG)Ax — ax ax a(x 0 IG)Ax = KnG • ax The result follows from Theorem 4.12 and the properties of the commutation matrix. ❑ Theorem 4.14. Let x be an n x 1 vector and let A be an nG x n matrix of constants. Then
am 0 x')Ax_ =(At' + A')(/G 0 x). ax
Proof of Theorem 4.14. From Theorem 3.8 of Chap. 3,
am 0 x')Ax ax
a -
= — Ic%(x 0 InG )Ax ax n = [(In 0 x'A') + A'(x' 0 InG)1KG%
from Theorems 4.12 and 4.1. However, from Theorem 3.8, (x' 0 InG)IcrGn = IG 0 x, and, from Theorem 3.10, (In 0 x'A')KGrGn = (Ax)1" = At. (/G 0 x).
❑
Theorem 4.15. Let x be an n x 1 vector and let A be an nG x n matrix of constants. Then a(x' 0 /G)Ax = (In 0x')(Ar + A'(x 0 IG)• ax Proof of Theorem 4.15. From Theorem 3.8,
_ a(x' 0IG)Ax _ ax
nG(I
ax
nG 00 x)Ax
= [(x'A' 0 In) + A'(InG 0 x' )1Iqn from Theorems 4.13 and 4.14. From Theorem 3.8, (InG 0 X')KLGn = x 0 IG,
80
Matrix Calculus and Zero-One Matrices
and, from Theorem 3.11, (x' A' 0
= (Ax) = (x'ATG = 0 x')(Ar.
0
4.8. RULES FOR SCALAR FUNCTIONS OF A MATRIX
Suppose y is a scalar that is a differentiable function of the elements of an n x r matrix X. We say that y is a scalar function of the matrix X, and we write y = y(X). Scalar functions that crop up a lot in the application of classical statistical procedures are traces and determinants, and derivatives for such functions are developed in the next two subsections. 4.8.1. Rules for Traces of Matrices We can easily derive derivatives for scalar functions that are traces of matrices by using the result tr AB = (vec A')'vec B and then by applying the results we already obtained in Section 4.6 for vector functions that are vecs of matrices. In the next three theorems, we shall use the same notation as that we used previously and let X be an n x r matrix and A and B be n x n and r x r matrices of constants, respectively. Theorem 4.16. a tr AX a vec X = vec A'. Proof of Theorem 4.16. Write tr AX = (vec A')'vec X. Then the result follows from Theorem 4.1. ❑ Theorem 4.17.
a tr X'AX B vec X
= vec(A'X + AX).
Proof of Theorem 4.17. We write tr X'AX = tr AX X' = (vec A')'vec XX'. Then, applying the chain rule given by Lemma 4.2, we obtain
a tr AX X' a vecX
a vec XX' a vec X vec A' = [(X' 0 In ) + (X' 0 In)Knnlvec A'
from Theorem 4.5. We find that the result follows by noting that Knnvec A' ❑ = vec A.
81
Matrix Calculus
Theorem 4.18. atrXBX'A = vec(AXB + A' XB') a vec X
Proof of Theorem 4.18. We write tr XBX' A = tr BX'AX = (vec B')'vec AX. Hence, by applying the chain rule, we obtain
a tr XBX'A a vec X
a vec X' AX a vec X
vec B'.
However, by Theorem 4.4, the right-hand side of the preceding equation can be written as [(Jr 0 AX)Krr (Jr 0
= (4 0 A X)vec B
A'X)]vec (Jr 0 A' X)vec
= vec(AXB A'XB').
0 4.8.2. Rules for Determinants of Matrices and Logs of Determinants Again the results developed in Section 4.6 in conjunction with the chain rule allow us to obtain the derivatives of scalar functions that are determinants of matrices or logs of such determinants. However, before we follow such a procedure we prove the following theorem. Theorem 4.19. Let X be a nonsingular n x n matrix and let I X I denote the determinant of X. Then
alxl = a vec X Ixlvec[(x-1)'1Proof of Theorem 4.19. Expanding the determinant of X by using the jth column of X, we have I XI = E7=1 cu xu , where cif is the cofactor of the (ipth element of X. If follows that aixilaxi;= c11and that the jth subvector of the column vector aixilvec X is (cif • • cnj = (adjoint X /)„ Hence,
alxl = vec[(adjoint X)'] = IX Ivec[a lY]-
avec X
0
Note that if IX I is positive log I X I exists and the following result occurs directly from this theorem and the application of the chain rule:
a log IXI = vec(X a vec X
82
Matrix Calculus and Zero-One Matrices
With Theorem 4.19 and the results of Section 4.6 in hand, further derivatives for scalar functions that are determinants can easily be obtained, as the following theorems indicate. Theorem 4.20. Let X be an n x r matrix, let A be an n x n matrix of constants, and suppose that Y = X' AX is nonsingular. Then
alYI = iyiRy-lf 0A)± (r -I 0 A')1vec x.
a vec X
Proof of Theorem 4.20. Applying the chain rule of Lemma 4.1, we have
alYI
a vec X
=
avec Y a vecir a vec X a vec Y '
so from Theorems 4.19 and 4.4, we have
alYI a vec X = 1171[(4 0 AX)Krr ± (Ir 0 A/ X)1vec Y-1' = 117 1[(Ir 0 AX)vec Y-1+ (Jr 0 A'X)vec Y -1'1 = Ir[vec AXY-1+ vec A'XY-1'] = I171[(Y lf 0 A) + (Y-10 A')]vec X.
0
Several obvious corollaries flow from this theorem. First, if A is the identity matrix, then
alrx1 = 21x/xlvec X(X/ X)-1,
a vec X
provided, of course, that )0( is nonsingular. Second, if I YI is positive so that log I Y I exists,
a log 1171 a vec X
= RY-1' 0 A) + (17-1 0A')]vec X,
and if A is symmetric so Y is symmetric: aloglYl = 2(Y -10 A)vec X. a vec X Third, suppose the elements of Y are themselves scalar functions of a p x 1 vector 5 . Then, by the chain rule,
a log IYI
as
=
avecX
as
_ if [07 0 A) + (17-1 0 A')]vec X
,
83
Matrix Calculus and, for symmetric A, alog IYI
as
=2
avec X
as
1 (11-0 A)vec x.
Theorem 4.21. Let X be an n x r matrix, let B be an r x r matrix of constants, and suppose Z = XBX' is nonsingular. Then
aizi = IZI[(B 0 Z-1') + (B' 0 Z-1)]vec X.
a vec X
Proof of Theorem 4.21. This proof is similar to that of Theorem 4.20.
❑
Corollaries for this theorem corresponding to those derived for Theorem 4.20 are easily obtained; the details are left to the reader. 4.9. TABLES OF RESULTS
In this section, for easy reference, the results proved in this chapter are summarized in Tables 4.1-4.7.
Table 4.1. General Rules Chain Rules az _ ay az
ax — ax ay' z = z[Y(x)]• az = au az av az z = z[ u (x), v(x) ]. ax ax au ± ax Tv' Product Rule a vec X Y a vec X (y 0 Ito ± a vec Y ‘I .,,, x,,_ ,,,,, = ); ,1 xn,Ynxp•
as
as ' P'46'
as
Corollaries of Product Rule
af (x)x
af(x) ,
— x , f (x) is a scalar function of n x 1 x. ax — f (x)In + ax , af(x)u(x) au(x) af(x) = — f(x) + - u(x) ; f (x) scalar function, u(x) vector function.
ax ax ax au(xyv(x) au(x) av(x) ax v(x) + ax —14.0; 144 v(x) vector functions. =— ax a vec u(x)v(x)/au(x) av(x) = [v(x) 0 /m 1 + -[I,,, 0 u(x)1; u(x), v(x)m x 1 vector functions. ax ax ax , a A(x)u(x) a vec A(x) au(x) = [u(x) 0 I p] + - A(x) ; u(x) vector function, A(x) p x m. ax ax ax a vec u(x)'B(x) a vec[B(x)'] au(x) [u(x) 0 41 + B(x); u(x) vector function, = ax ax ax B(x) m x q.
84
Matrix Calculus and Zero-One Matrices Table 4.2. Simple Results aAx = A, A'.
ax ax' A =A. ax ax' Ax — (A + A')x. ax a(x 0 a) = In0 a', x is an n x 1 vector. ax a(a 0 x) , = a 0 In, x is an n x 1 vector. ax
Table 4.3. Results for Vecs of Matrices
a vec X' a vec X = Krn• a vec X a vech X = Dn,, is a symmetric n x n matrix. a vec AXB — B 0 A'. a vec X a vec AX' B = Knr (B 0 A'). a vec X a vec(X ® /G) a vec X = Ir 0nn • a vec(1G ® X) = KGr 0In. a vec x
avec(x 0 /G) ax
— K j,,, x is an n x 1 vector.
a vec(1G 0 x) = devec k 0 In , X is an n x 1 vector. ax a vec X-I — (X-10 X-1'), X is nonsingular. a vec X a vecAX-1 B = —X-I B 0 X-I' A' , X is nonsingular. a vec X a vec X' AX (Ir 0 AX)Krr ± (Ir 0 A'X). a vec x = a vec X'X a vec X — 2(Ir 0 X)Nr. a vec X AX — (AX 0 In ) + (In 0 A' X), X is a symmetric n x n matrix. a vec X a vec X BX' a vec X — (BX' 0 In ) + (B' X' 0 In)Knn •
85
Matrix Calculus
a vec XX' — 2(X' 0 ln)Nn • 8 vec X 8 vec(X/X)-1 = 2[(X/ X)-1 0x(x/x)—Tsir• 8 vec X a vec X(X/X)-1 = [(X / X)-1 0 — 2[(x' x) —' 0 X(X/ X)-11/s/r(ir x'). 8 vec x 8 vec(X/X)—I X' = Krn [In 0(X/X)-1] — 2(4. 0 X)Nr[(X/X)-I X' 0 (X/X)-1]. 8 vec X a vec X(X / X)—I X' — 2{(X/ X)-1 0[In— X(X' X)—I nNn• 8 vec X a vec(X ®A) = 0 4n(A 0 inG), A is a G x s matrix. a vecx a vec(A 0 X) = K st; (isr 0 A') 0 In , A is aGxs matrix. a vec X Note: Unless otherwise specified, X is an n x r matrix in these results.
Table 4.4. For x, an n x 1 Vector 8(x 0/G)Ax
ax 8(IG 0x)Ax
8x
= (In 0x' A') + A'(x' 0 /G), A is aGxn matrix. (x A 0
+ AVG 0 x'), A is a Gn x n matrix.
8(IG 0x')Ax_ (A ax 8(x' 0 IG)Ax
ax
t + A')(IG x), A is a Gn x n matrix.
=(In 0 x')(ATG + A /(x 0 /G), A is a Gn x n matrix.
Table 4.5. Rules for Traces of Matrices
a tr AX = vec A'. a vecx a tr X' AX = vec(A'X + AX). 8 vec x 8 trXBX'A = vec(AX B + A' XB'). 8 vec X
86
Matrix Calculus and Zero-One Matrices Table 4.6. Rules for Determinants of Matrices
alx1 a vec X alYI
a vec X
= IXIvecRX —IYl.
1 0 A) + 0,— ' 0 A')]vec X for a nonsingular Y = X'AX — IYIRY = 2IY KY-10 A)vec X for symmetric A.
alzI a
vec X
— IZIRB 0 Z-1') + (B' 0 Z—I )]vec X for a nonsingular Z = XBX' = 2IZI(B 0 Z— ')vec X for symmetric B.
alx' xl = 2ixixlvecX(X/X)-1. a vecx aixxil = 21xxi I vec(XX/)-1X. a vecX
Table 4.7. Rules for Logs
alog 'XI a vec x
alog III a vecX
— vecRX-1)1 .
= [(17—I' 0 A) + (17-10 A')]vec X for a nonsingular Y = X'AX = 2(11-10 A)vec X for symmetric A.
alog IZI a vec X
= [(B 0 Z-1') + (B' 0 Z—')]vec X for a nonsingular Z= XBX' = 2(B 0 Z-1)vec X for symmetric B.
alog IX'XI a vec x
alog 'XXI a vec X
= 2 vec x(xix)—l. = 2 vec(XX)-1x.
5
Linear-Regression Models
5.1. INTRODUCTION
The linear-regression model is without doubt the best-known statistical model in both the material sciences and the social sciences. Because it is so well known, it provides us with a good starting place for the introduction of classical statistical procedures. Moreover, it furnishes an easy first application of matrix calculus that assuredly becomes more complicated in future models, and its inclusion ensures completeness in our sequence of statistical models. The linear-regression model is modified in one way only to provide our basic model. Lagged values of the dependent variable will be allowed to appear on the right-hand side of the regression equation. Far more worthy candidates of the mathematical tools presented in the preceding chapters are variations of the basic model that we achieve by allowing the disturbances to be correlated, forming either an autoregressive system or a moving-average system. These modifications greatly increase the complexity of the model. Lagged values of the dependent variable appearing among the independent variables when coupled with correlated disturbances make the asymptotic theory associated with the application of classical statistical procedures far more difficult. The same combination also makes the differentiation required in this application more difficult. Our work then with these two variations of the basic linear-regression model require applications of the results and concepts discussed in the first four chapters. Particularly useful in this context will be the properties of shifting matrices and generalized vec operators discussed in Sections 3.7 and 2.4, respectively. For notational convenience we drop the n from the generalized vec operator rn, so throughout this chapter Si will stand for Si" , the generalized vec n of the matrix S. The asymptotic theory required in the evaluation of the information matrices is reserved for appendices at the end of the chapter in which the appropriate assumptions regarding the existence of various probability limits associated with the model are made. -
87
88
Matrix Calculus and Zero-One Matrices
5.2. THE BASIC LINEAR-REGRESSION MODEL 5.2.1. Assumptions of the Model
Consider the linear-regression model represented by the equation K
Yr
= Extok + ut,
t =1,...,n,
k=1
where we allow the possibility that some of the regressors Xk represent lagged values of the dependent variable y. In matrix notation we write this equation as y = xp+u,
where y is the n x 1 vector y = (yt), X is the n x K matrix X = Val, p is the K x 1 vector p=(ak), and u is the n x 1 vector u = (ut). We make the usual assumptions: The disturbances utare independently, identically normally distributed random variables with mean zero and variance Q2 so u --- N(0, a2 In), r(X) = K so (X/X)-1exists, and the variables in X that are not lagged dependent variables are constants. Additionally we assume that p lim X/ X In exists and is a positive-definite matrix and that there are no unit root problems. The last assumptions can be written more formally. Let y_j be the n x 1 vector whose elements are those of y except lagged j periods. Suppose y_1, ... , ymgform the first g variables that go to make up the matrix X. Then we require that all the g roots of the polynominal equation zg - Ping-1- - - - - Pg = 0 have absolute values of less than one. We now wish to obtain the basic building blocks of classical statistical procedures, namely, the log-likelihood function, the score vector, the information matrix, and the Cramer-Rao lower bound. 5.2.2. The Log-Likelihood Function and the Score Vector The parameters of our model are given by the (K + 1) x 1 vector 0 = (,B/ Q2)'. The log-likelihood function, apart from a constant, is n 1(0) = - -log cr 2—
1 u/u, 2cr2 where in this function u is set equal to y - X. Obtaining the first component of the score vector, 8l/B,8, involves us in our first application of matrix calculus. By the chain rule au'u
au au'u = — — = -2X' u,
8,B8,Bau
Linear-Regression Models
89
SO
al = X'ula2.
(5.1)
Clearly al = aa2
u'u
n 2a2
(5.2)
2a4 .
5.2.3. The Hessian Matrix, the Information Matrix, and the Cramer-Rao Lower Bound The components of this matrix are easily obtained from Eqs. (5.1) and (5.2). They are a21
X'X az
aSaS' a2.1
apacy2 a2i ao.22
X'u a4 n
u'u
2,74 — (76
With standard asymptotic theory, p lim X'u In = 0 and p limu'u I n = cr2, so the information matrix is 1 [X' X 1(0) = p limncy 0,
0 n 1, 2(32
and inverting this matrix gives the Cramer-Rao lower bound, / -1(0) = cr2 p limn
[(X' X)- 1 0 2a2 1 . 0' n
5.2.4. Maximum-Likelihood Estimators Equating
au ap to the null vector and 0//acr 2to zero gives the MLEs of 0: = (X' X)-1 y , In.
6.2 =
The MLE of ,B is the ordinary-least-squares (OLS) estimator. This estimator is consistent and asymptotically efficient, so
,Ft(S - p)
N(0, //313),
90
Matrix Calculus and Zero-One Matrices
where, from the Cramer—Rao lower bound -1(0), P613 =cr2(p lim X/ X/n)-1. This, by the way, is the only occasion on which we can actually obtain an algebraic expression for the MLE of the parameters of primary interest. In future models, we shall have to be content with an iterative interpretation of the MLE we obtain by setting the score vector equal to the null vector. 5.3. THE LINEAR-REGRESSION MODEL WITH AUTOREGRESSIVE DISTURBANCES 5.3.1. The Model, the Matrix M(c) and the Log-Likelihood Function
Consider the linear-regression equation Yr =
Extok
+14,,
t = 1,
, rt,
k=1
where, as in Section 5.2, we allow the possibility that some of the regressors Xk represent lagged values of the dependent variable y. We also assume that the disturbances are subject to an autoregressive system of the order of p so ut
± • • • ±a put _ p =Et ,
,n,
t =1,
where p < n. The Et s are assumed to be independently, identically normally distributed random variables with mean zero and covariance cr2• We assume that there are no unit roots problems. In matrix notation, we write the autoregressive process as 01111-1± • • • ±Ciptl_p = E,
where 111-j+1\ Ili u=
(
)
El )
] , E_ tin
, ll_j =
UP 111
En \tin-j
I
As far as our asymptotic theory is concerned, all presample values may be replaced with zeros without affecting our asymptotic results. We saw in
Linear-Regression Models
91
Subsection 3.7.6 that if we do this at the start of our analysis we can write
where Si is the shifting matrix 0 0
Si =
0
_0
1
0 ... 0_
and we can write the autoregressive process as u + aiSiu ± • • • a pSpu = E,
or u + Upa = E, where a is the p x 1 vector (al • • • ap)', Up= (Sit/ • • • Spu) = S(/p 0u),
(5.3)
and S is the n x np matrix S = (Si • • • Sp). Considering S(I p0 u)a = S vec ua' = S(a 0 I,,)u, we can write the autoregressive process as M(a)u = E,
(5.4)
M(a) = In + S(a 0 In).
(5.5)
where
92
Matrix Calculus and Zero-One Matrices
From case 1 in Subsection 3.7.5, 1
0
al
M(a) =
(5.6)
a
0
aP
al 1
By using Theorem 3.24 we can write
M(a) = + S(In Sa), where S = (InS1 • • • Sn_1) and S is the selection matrix (el • • • ell + 1). Clearly vec SV„ 0 SO= St Sa, and vec M(a) = vec + S't Sa.
(5.7)
Therefore, by replacing presample values with zeros, we can write our model succinctly in matrix notation as
+ u,
y=
M(a)u = E, E
N(0, a2 /),
where M(a) is the matrix given by Eqs. (5.5) and (5.6). The parameters of our model are 6 = a' Q2)', and the log-likelihood function (apart from a constant) is 1 E / E, 40) = —Lilog cr2— 2(32 2 where in this function E is set equal to M(a)(y — 0). In applying matrix calculus to help us work out the derivatives of this function, we need aE/aa and a vec M(a)/8a. Both these expressions can be written simply in terms of our shifting matrices. Clearly E=U 1The
S(a 0 1,i )u = u S vecua' = u + S(I p0 u)a,
matrix M(a) was originally used by Phillips (1966) and appears in the works of other authors such as Pagan (1974) and Godfrey (1978a).
Linear-Regression Models
93
SO
aE = (1p 0 u')s' = u'p . as
(5.8)
From Eq. (5.7), we see that
a vec M (a) = s,s,,'
(5.9)
as 5.3.2. The Score Vector 81/80 and the Hessian Matrix 821/8088'
If we use the derivatives given by Eqs. (5.8) and (5.9) it is a simple matter to obtain the score vector of our model. The components of this vector are as follows: (5.10)
w3 a1 =X' M(a)'sla2, al as —= —(1,9 0 tos'Ela2= —u'pE 1 a2 ,
(5.11)
al ,90.2 = —n/2a2+ E's/2a4.
(5.12)
In a similar manner, certain components of the Hessian matrix are easily obtained with Eqs. (5.8) and (5.9). They are
a 2/
map, a21
= X' M(a)' M(a)X1a 2,
a2i ),
= —x' m(ayEla4 = ( ao.2ap, > apaa2 a21 ( a21 V = (In ®u / )15/E/ Q4 = aaaa2 aa 2 ad ' a21 = (I n 0 1,015/15(1 p 0 UV Cf 2 = — 1I;II I 9 / Cf ,
aaaa' a 2/
a0.22 = n/2.74— E'E/a6. The remaining derivative, namely a2liapaac requires a bit more work. From
94
Matrix Calculus and Zero-One Matrices
Eq. (5.10) we can write
al = ap
M(a)' M(a)u I a2= (u' 0 X') vec M(a)1111(a)1a2
Now, using the backward chain rule, we have
a vec M(a)/ M (a)
a vec M(a) a vec M(a)/M(a) as a vec M(a)
as
= 2S'St [I„ 0 M(a)]N., 1 where Nn= — (in2 +
2
is a commutation matrix. It follows that
and
aX'M(a)'M(a)u = 2S'St [/,, 0 M(a)]N.(u 0 X) as = S'St [/„ 0 M(a)][(X 0 u) + (u 0 X)] = S' St [(X 0 E) + [u 0 M(a)X]1 = S' St [(1. ®E) + (u 0 1„)M(a)1X. However, from Theorem 3.24 we have S'St (/.
E) =
0 OS',
S'St (u 0 In) = [S(/p 0u)]' = U. If follows then that
ap 2la, = a2 [x/ St p 0+ x' m(ar uP 1
/ a21 V = aaaP'
1 821 5.3.3. The Information Matrix 1(0) = —p limn 0000' The probability limits required for forming this matrix are worked out in Appendix S.A. By using these limits we can write the information matrix as
1 I(6)= plim— naz
—U'p X
—g'U p UP U p
0'
0'
0 0
(5.13)
2.72
where X = M(a)X. For the special case in which X is exogenous and contains
Linear-Regression Models
95
no lagged dependent variables, the information matrix simplifies
0
O
1 [0UU p 0 1(0)= plimn nag 0, 0' 2(32
(5.14)
5.3.4. The Cramer-Rao Lower Bound /-1(0) Inverting the information matrix is straightforward. We obtain (II m plyi
p 5)-i I-1(6 ) = a2p limn [(Wpti 0-1 Vp i ( I' M
(I'Mpi)-l l'U p(WpU p)-1 (Up /C/Up)-i
0'
'
0'
0 0 2a2'
(5.15) for the general case, in which Mp =
- up(u'pup)-lup,
k=
g(g/g)-ig',
and
(g/g)-1 o 0 0 O (u p / UO-1 V I (0) = cr2 p limn 2C/2 0' 0' n for the case in which X is exogenous.
(5.16)
5.3.5. Statistical Inference from the Score Vector and the Information Matrix We have seen in the preceding subsections how the use of shifting matrices and matrix calculus greatly facilitates the obtaining of the score vector and information matrix of our model. With these matrices in hand, we can easily derive statistical results from them. These results, most of which are well known, are listed in this section. 5.3.5.1. Maximum-Likelihood Estimators as Iterative Generalized-Least-Squares Estimators Using the score vector given by Eqs. (5.10)-(5.12), we can obtain an interpretation of the MLE of p as an iterative generalized-least-squares (GLS) estimator for the model with autoregressive disturbances. Returning to the score vector, we see that 0//act2= 0 gives .32 = g'g/n = ii/M(C)/M(a)//717,
96
Matrix Calculus and Zero-One Matrices
and
aim = 0 gives X' ill(a)' M(a)(y - )0), 0, which yields S = Pe M(ifi)' M(a)X1-1X' M(ifi)' M(61)y. Finally,
a//aa = 0 gives Up'E=0, but Up'E =Up/vec M(a)u = U'p (u/0 I„)vec M(a)
= U'P(u' 0 /,i)( vec /„ + St Sa). Now
U'p(u' 0 /,i ) vec i„ = U'p u, and, by using Theorem 3.24, we obtain
U'p(u' 0 /,i)gt Sa = U 'p S(/p 0u)a = U'pU pa. Therefore,
8l/8a = 0 gives rise to the equations -U'U a = UPU ', P PP
a = —(1-Fp& 0-1ij'pti. Clearly the solution we have obtained for the MLEs is iterative as a still contains p through U pand $ clearly contains a. 5.3.5.2. Asymptotic Efficiency Suppose a was known. Then from Eq. (5.13) the information matrix would be I
p2 \
iff )
p iim
1
r-
[X X
n 2 0'
0 n 2C/ 2
and the asymptotic Cramer-Rao lower bound for a consistent estimator of p would be plima2(k g In)-1. The GLS estimator ,B = (kg)-1g'57 would, for example, obtain this bound. From Eq. (5.14) it is clear that, even if a is unknown, provided that X is exogenous and does not contain lagged values of the dependent variable, the Cramer-Rao lower bound for a consistent estimator
Linear-Regression Models
97
of ,B is still p lim cr2(i'i/n)-1. The GLS estimator
/5 _ (itio-1i67, where I( = M(&)X, j27 = M(a)y, and a is a consistent estimator of a, for this special case would be as efficient asymptotically as p. However, from Eq. (5.15) it is clear that, if X contains lagged values of y, the Cramer-Rao lower bound for a consistent estimator of ,B is now cr2p lim(X' M p X In)-1 .The GLS estimator p still attains this bound so it is still asymptotically efficient but is less efficient than S. 5.3.5.3. Classical Tests for the Null Hypothesis Ho : a = 0 LAGRANGIAN MULTIPLIER TEST STATISTIC FOR Ho: a = 0. We wish to obtain the Lagrangian multiplier test (LMT) statistic for the null hypothesis 1/0 :a = 0 against the alternative HA: a 0 0. Recall from Chap. 1 that the LMT statistic can be written as
1 ai Ti = TiT,:,
e
ai (e) aa —
Pa
o'
where I' refers to that part of the Cramer-Rao lower bound corresponding to a and 0 refers to the constrained MLE of 0; that is, in forming 6, we set a equal to 0 and p and cr2equal to p and "a 2, the constrained MLEs of p and cr2 obtained after we impose 1/0 :a = 0 on our model. However, with a = 0, and y = 0 + E, E ^-' N(0, a2 I), so we are confronted with the linear-regression model with the sole modification that X may contain lagged values of y. It follows that the constrained MLEs are the OLS estimators p = (X' X)-1X' y, and a2 = y'My/n, where M = In - X(X/X)-1X'. Finally, under Ho, u equates to E so
al = -Ciiii -2 , as e where (4 is formed from the elements of OLS residual vector i't = My. We are now in a position to form the LMT statistic. We do this for both the case in which X is exogenous and for the more general case in which X contains lagged values of the dependent variable y. First, when X is exogenous, we have seen that /"" = cr2(p lim U'pUP/n)-1, so if we ignore the probability limit, Pa
(o) = 6. 2(a/pap//0-1.
98
Matrix Calculus and Zero-One Matrices
The LMT statistic for this special case would be
T1' =ii4(frptIp)-1(11;tilli2. Clearly T1 is the ratio of the explained variation to the unexplained variation of a regression of i on Up. Under 1/0, T1 asymptotically has a x 2distribution with p degrees of freedom, and the upper tail of this distribution would be used to find the appropriate critical region. Second, with X containing lagged values of y, we have seen that I' is more complicated. Now it is given by /"" = cr 2(p lim U'pICIUp111)-1. Under 1/0, 5( = X and M = M, so again we ignore the probability limit
I—(B) = 62(frpm(Jpin)-1, and the LMT statistic now becomes T1 =iitp(frp M0p)-10'Pa I C1 2 . This test statistic was first obtained by Godfrey (1978a), and, under Ho, T1 asymptotically has a x2distribution with p degrees of freedom. Godfrey shows that this test statistic is asymptotically equivalent to p times the usual F test statistic one would form for the null hypothesis in the regression
y = xp — Upa +n, the only difference in the two test statistics being the consistent estimator used for cr2. WALD_TEST STATISTIC FOR Ho : a = 0. As in the first part of this subsection, let 0 = ($' a' 6 2)' be the MLE of O. By using these estimators, we could form M(ii), X = M(ii)X, M = In - i(g'i)-i g', it = y - XS, and U. The Wald test statistic would be T2' =
iit'ISpi i I er2
(5.17)
for the case in which X is exogenous and T2 =
a/CI ' ;ICJ P(11'6 2
(5.18)
for the more general case in which X contains lagged values of y. The difficulty with these test statistics is in obtaining an algebraic expression for the MLE O. We find the MLE 0 by solving the system of equations aloe = 0 for 0, but a casual glance at the score vector shows that this system of equations is highly nonlinear. However, these days, numerical techniques with the aid of fast computers should ensure that we can come up with the maximum-likelihood estimates of 0 and thus the observed value of the Wald test statistics 7'. and T2.
Linear-Regression Models
99
We can then determine whether the observed values fall in the appropriate critical region and act accordingly. This being said, it is still possible for comparison purposes to get some insights into the Wald test statistics by use of the iterative solution for a obtained in Subsection 5.3.5.1, namely = (i7,19 0.0-1 cf u. Substituting this iterative solution into Eq. (5.17), we get T2
ii /C1p(CFpC1p)-111/pii [6 2,
where the symbol stands for "is asymptotically equivalent to." Comparing T2 with T1, we see that the test statistic Tz essentially has the same format as that of the LMT statistic T1, the difference being that the former works with the unconstrained MLEs whereas the latter uses the OLS estimators. For the more general case in which X contains lagged values of y, we have
T2
11/17 p(17/1,17 0-1(17 P117117OW P 1719)-117/p1ef'2
Noting that for the LMT statistic T1, as U'p T1
(5.19)
= UP MUM, we could write
= ii'O p(frpOp)-1(frpM6p)(frpOp)-1 frpii/a2,
so again the Wald test statistic essentially has the same format as that of the LMT statistic. The difference between the two test statistics is that the former works with the MLE in forming a and U pwhereas the LMT statistic uses the OLS estimators and the (equivalent) Wald test statistic uses M in the covariance matrix whereas the LMT statistic uses M. THE LIKELIHOOD RATIO TEST STATISTIC FOR Ho. The likelihood ratio test (LRT) statistic is
T3 = 2[1(o) — 1(5)]. In obtaining an explicit expression for this test statistic, we are faced with the same difficulty as we had with the Wald test statistic in that both test statistics work with the MLE B. However, as with the Wald test statistic we can get some insight into the nature of the LRT statistic by using the iterative solution a -(U'Up)-1U/pii. Clearly g = u + Upa, so substituting our iterative solution for a into this equation allows us to write
E = lapil with lap = Our iterative solution for (32 is for e we write 2
-/
M uln.
In - Up((/ ///p)-1(/ / . a2 g'g
/n,so substituting the above expression (5.20)
By substituting expression (5.20) into the log-likelihood function we get, apart
100
Matrix Calculus and Zero-One Matrices
from a constant, /(5) and an equivalent test for the LRT statistic would be —n log a2/6 2.
T3
As log is a monotonic function, the LRT statistic is equivalent to T3 = nei 2162 =
—
CI'P (17'PP CI )-111'P al/62.
Note from Eq. (5.19) that we can write
[a'a — a/Up(CFpUp)-1(17/pgri7p)(17/p17p)-117'pall5 2,
T2
with = and we can obtain an equivalent expression for T1. Again, we see the essential similarities between the test statistics. For example, the numerator in the test statistic equivalent to T2 would be the same as that of T3 except that the former uses N in forming the matrix in the quadratic form whereas the latter is the identity matrix. 5.4. LINEAR-REGRESSION MODEL WITH MOVING-AVERAGE DISTURBANCES
5.4.1. The Model, the Log-Likelihood Function, and Important Derivatives Consider the linear-regression equation Yr = Extok +tit, t =1,...,n, k=1
where again we allow for the possibility that some of the regressors Xk represent lagged values of the dependent variable y. Now we assume that the disturbances are subject to a moving-average process of the order of p, which we write as Ut = Et ± aiEt-1+ ...±apEt-p,
where p < n. The Ets are assumed to be independently, identically normally distributed random variables with mean zero and variance .72. Again, we assume that there are no unit root problems arising from lagged dependent variables. Putting this process in matrix notation and replacing presample values with zeros, we clearly can write the process u = M(a)E.
Linear-Regression Models
101
Clearly, as I M(a)I =1 M(a) is nonsingular, we can write E = M(a)-1u.
We assume that the disturbance process is invertable. In matrix notation, then we can write our model succinctly as
+u,
y=
u = M(a)E, E ^ N(0, a2/).
Again, the parameters of our model are B = (p' a' cr 2)I and the log-likelihood function is n 1 , l(0)= -- log cr2— E, 2 2(32 except now E is set equal to M(a)-1(y — 0) in this function. As in the preceding model, we need to acquire an expression for aE/aa. To do this, we write E = vec M(a)-1u = (u/ /, i )vec M(a)-1.
Using the backward chain rule, we then have
aE _ a vec M(a) a vec M(a)-1a(u/ /,i)vec M(a)-1 as a vec M(a) a vec M(a)-1
as
= —S'S'T'[M(a)-1 0M(a)-1I(u 0 In
)
= —S'gt (E 0 /,i)M(ce)-v . However, from Theorem 3.24, S'St (E
In) =
0 E')S' = EP
,
where EP = S(Ip 0E).
(5.21)
It follows that
aE _ = —E' M(a)-1' as
(5.22)
Note that this derivative is more complicated than the corresponding one for the autoregressive model in that aE/aa now involves M(a)-1. This means in turn that certain components of the Hessian matrix a21laeae/ will of necessity be more complicated. However, again by use of our results on shifting matrices, these derivatives are obtainable.
-
102
Matrix Calculus and Zero-One Matrices
5.4.2. The Score Vector 81/80 and the Hessian Matrix 821/8088' By using the derivative 0E/aa given by Eq. (5.22) and the derivative a vec M(a)/aa given by Eq. (5.9), it is a simple matter to obtain the score vector. Its components are
al - = X' M(a)-1' E/a 2 , ap al
as
= (I
P
(5.23)
0 E')S' M(a)-1'Ela2 = E yiK ,-1',2,
al
acy2 = —n/2a2
E 12.7 4 .
(5.24) (5.25)
As with the preceding model certain components of this matrix are easily obtained with Eqs. (5.22) and (5.9). They are ,921
= x/m(01)-1"m(a)-1x/a2,
ap/ar ,9 21 = = x /m(a)-1'E/a2 apacy2 a 2i = —E',M(a)-1'E/Q4 aaacr 2 a zi a0.22 =n12.7 4— E'Ela6.
a2.1 acy2apt '
,
The remaining derivatives, namely a2liapaa' and 02//aaaac require extra effort and will again involve our using our results on shifting matrices. Consider first obtaining a2ilapaa/. To this end write
X//1/1(a)-1'M(a)-l u = X' vec M(a)-1'M(a)-lu = X'(u' 0 /n)vec M(a)Now,
a vec M(a)-1' M(a)-1 as
a vec M(a) a vec M(a)-1a vec M(a)-1' M(a)-1 avec M(a) a vec M(a)-1 as = —2S'St [M(a)-1 M(a)-11[1. 0 M(a)-1]Nn,
SO
aX' vec M(a)-1' M(a) l u
as
= S'gt' [M(a)-10 M(a)-1' M(a)-1] x [(In u)+ (u O LAX =—S St 1[1. M(a)-1' E]+ (E 0 In) x M(a)-1'1111(a)-1X. '
103
Linear-Regression Models However, from Theorem 3.24,
S'gt [/„ 0 M(a)-1 'E] = S'S'e(E
[I p
0 E/M(a)-1]St,
0 In) = (Ip 0E')S' = E p,
SO
a2 1 aPaa'
= X/ M(a)-1' {Slip 0M(a)-1'E] + M(a)-1Epl/a2 ( all ),
aaap, • The most difficult derivative to derive is a2//aaaa/. Using the product rule of matrix calculus and referring to Eq. (5.24), we write a —
aa
(1p0 E')S' M(a)-1E
a yew p0 E')S [M(a)-1 E 0 4] as a m(a)-1' E S p 0 E). (5.26) + as '
—
Using the backward chain rule, we have
a yew p 0E')S' _ aE a vec S(/p 0E) a yew p 0E')S' — as aE a vec S(11, ® E) ' as and, as vec S(/p 0E) = ST E, a vec S(11, 0 E)/aE = Se, we have
a yew p 0 E')S'
= -E'l, M(a)-vSeK„p, as where Km, is a commutation matrix. Again, using the product rule, we have
(5.27)
am(a)-v E = a vec m(a)-1' (E 0 In) + aE M(a)-1, aa
aa
aa a vec M(a)-1/ a vec M(a) a vec M(a)' vec M(a)-1/ = as a vec M(a) a vec M(a)' as = —S' St K..[M(a)-1' 0 M(a)-11,
so aM(a)-1E
as
= —S' ST K„,i [M(a)-1/E 0 M(a)-1] —E' M(a)-1/ M(a)-1. (5.28)
104
Matrix Calculus and Zero-One Matrices
Using Eqs. (5.27) and (5.28) in Eq. (5.26), we obtain azi - -{E' M(a)-1'St'Knp[M(a)-V E 0 8a 8a'— P
Ip]
± S' ST K nn [M (0-1' E 0 M(a)-11E p + E p M(a)-1/111(a)-1E plIcr 2.
(5.29)
From the definition of E pgiven by Eq. (5.21) and the properties of the commutation matrix, we can write the first matrix on the right-hand side of Eq. (5.29) as -(I p0 E')S'ill(a)-1' St' [I p0 M(a)-1' E]lcr 2 . Again using the properties of the commutation matrix, we can write the second matrix on the right-hand side of Eq. (5.29) as -S'S'T [In 0M(a)-1'E]M(a)-1S(I p0 E)/(32. However, by Theorem 3.24, this matrix is equal to -[I p0 E/ M(a)-i ]st M(a)-1 syp 0 010.2,so we can write ,921
aa 8a
= - ((In0 E')S' M(a)-1S1I p0 M(a)-1' E] '
+ {(I p0 EVM(a)-1Se[I p0 M(a)-1'E]}' + rp M(a)- vM(a)-1Ep)1(32 5.4.3. The Information Matrix 1(0) = -p lint 4821/8080, The probability limits required for forming this matrix are worked out in Appendix S.B. Using these limits, we can write the information matrix as
X*' X* X*1111(a)-1E p 1 E'p Al (a)- vX* E;M(a)-1/111(a)-1Ep I(0) = p lim_ ncr 2 0' 0'
0 0 ni' 2.72 (5.30)
where X* = M(a)-1X. For the special case in which X is exogenous and contains no lagged dependent variables, the information matrix simplifies to
X*' X* 1 1(0) = p lim — a2
0 0'
,
10 1 E pM(a)_ M(a)_ Ep 0'
0 n 2.72 (5.31)
Linear-Regression Models
105
5.4.4. The Cramer-Rao Lower Bound /-1(0) Inverting the information matrix is straightforward. We obtain (X*' MF X*)-1
1-1(19) = a2
limn
[ -
(F' F)-1F' X*(X*' MFX*)-1
-(X*' MFX*)-1X*' F(F' F)-1 (F' M* F)-1
0'
0'
0 0
2a21 n
'
—
(5 32) where F = M(a)-1 E p, MF = — F(F' F)-1F', and (X*/X*)-1X*/for the general case, and
0 (X*9(*)-1 (F' F) 1 0 / -1(6) = cy 2 p limn [ 0' 0'
0 0 2a2 1 n
M* = - X*
(5.33)
for the case in which X is exogenous. 5.4.5. Statistical Inference from the Score Vector and the Information Matrix As with the preceding model, the score vector and the information matrix can be used to obtain statistical results for our model, and this is the purpose of this subsection. 5.4.5.1. Maximum-Likelihood Estimators as Iterative Generalized-Least-Square Estimators In Subsection 5.3.5.1. we saw that we could obtain iterative solutions for the MLEs p, a, and a in the autoregressive case and that these solutions gave an iterative GLS interpretation for the MLE of p. Interestingly enough, a similar interpretation for the MLE of B in the moving-average disturbances case does not appear to be available. Consider the score vector for this model given by Eqs. (5.23)-(5.25). Solving 8l/8,B = 0 gives = [X' M(51)-1' M(5t)-1X1-1)CW(6)-1' M(a)-1y, as expected, but the difficulty arises from 0//8a = 0. Unlike in the autoregressive case, we cannot extract a from this equation. The problem is that whereas vec M(a) is linear in a [see Eq. (5.7)], vec M(a)-1is highly nonlinear in a.
106
Matrix Calculus and Zero-One Matrices
5.4.5.2. Asymptotic Efficiency Suppose a is known. Then the information matrix would be
P2 ) = pliln 1 I (, na
X*' X* (y
0 n 1, 2(32 and the asymptotic Cramer-Rao lower bound for a consistent estimator of p would be (32 p lim(X*/X7n)-1. The GLS estimator fi = (X*/X*)-1X*/y* would attain this bound. From Eq. (5.33) it is clear that, even if a is unknown, provided that X is exogenous and does not contain lagged values of the dependent variable, the Cramer-Rao lower bound for a consistent estimator of p is still plim (32(X*/ X* I n)-1 . The GLS estimator
A = (i,,,k)- 1 kor, where X* = M(&)-1 X, 5,* = M(Ce)-l y, and et is a consistent estimator of a, for this special case would be as efficient asymptotically as p. However, from Eq. (5.32) it is clear that, if X contains lagged values of y, the Cramer-Rao lower bound for a consistent estimator of ,B is now (32plim(X*WFX*In) 1 . The GLS estimator ,B still attains the bound so it is still asymptotically efficient but is less efficient than p.
5.4.5.3. Classical Tests for the Null Hypothesis Ho : a = 0 LAGRANGIAN MULTIPLIER TEST STATISTIC FORM : a = 0. Godfrey (1978) showed that the LMT statistic was incapable of distinguishing between autoregressive disturbances and moving-average disturbances. With the score vectors and the information matrices in hand for the two models, we can easily see why this is the case. We form the LMT statistic by using alaa and the information matrix evaluated at a = 0. If we do this, then, for both models,
ai
aa
=
a=0
±U /pU/CY 2 , -
X'X ±X1U p ; 1 ±U I(6) = p.a 1nP/ X U P UP 2 a=0 n cl 0/ 0/
0 0 n i' 2C/2
so the LMT statistic must be the same for both models. THE WALD TEST STATISTIC AND THE LIKELIHOOD RATIO TEST STATISTIC FOR Ho :a = 0. As before let 5 = ($' a' ef2y be the MLE of 6 and let 132 = y'My/n. If we had these estimators in hand we could form Moir', ii* = _i x, 5,* = m(0-1y,g = 57* — i*/3, and thus fp , 1, and 14* . The Wald test statistic would then be
mw
T2' = a/P/P ( 11 62
Linear-Regression Models
107
for the case in which X is exogenous and T2 = ei/ P/1171*Pa la2 for the more general case in which X contains lagged values of y. The LRT statistic would be T3 = 2 [1(5) — 1(0)]. As with the autoregressive process, the difficulty in obtaining algebraic expressions for these test statistics lies in the fact that the system of equations aloe = 0 is highly nonlinear in O. Again, numerical techniques with the aid of computers ensure that we can at least obtain the maximum-likelihood estimates and thus the observed values of these test statistics. However, unlike in the autoregressive process, we cannot get comparative insights into our classical test statistics by using an iterative solution for the MLE 0 as we have seen in Subsection 5.4.5.1 that no such iterative solution is available. APPENDIX S.A. PROBABILITY LIMITS ASSOCIATED WITH THE INFORMATION MATRIX FOR THE AUTOREGRESSIVE DISTURBANCES MODEL
p lim Inti(a)' sin: The easiest way of obtaining this probability limit is to consider the transformed equation Y7 =
ip + E,
(5.A.1)
where 57 = M(a)y and X = M(a)X . If a is known, then we would obtain a consistent estimator of p, regardless of whether X contains lagged values of y, by applying ordinary least squares to Eq. (5.A.1), to obtain
# =(x/x)-1x/57 = p + (X' X In)-1i'E In. We assume that p lim X' M(a)/M(a)X I n exists, so p lim X' M(a)1E I n = 0. p lim Ups / n: Consider the artificial equation u = —U pa + E
(5.A.2)
and suppose for the moment that the us are known. Then we could obtain a consistent estimator of a by applying ordinary least square to Eq. (5.A.2) to obtain a = —(U'pU p)-1U'pu = a — (Up'Upin) 1Up'Eln. Again, we assume that plimU p/U pill exists, so p lim U'p E In = 0.
108
Matrix Calculus and Zero-One Matrices
p lim Sr'(I p e)/n: This probability limit involves our looking at plimX'S'Eln. Let xk be the k column of X and consider n- j
S E = ExtkEt+i,
= 1 , • • • • P.
t=i
Clearly, plimxk'S'Eln = 0 even if xk refers to lagged values of the dependent variable. p lim M(a)' S(I p u)/n: Consider the jth component of this matrix, namely p lim M(a)' S ju I n. Clearly, if X is exogenous, this p lim is equal to the null vector. Suppose, however, that X contains lagged values of the dependent variable. This p lim may not be equal to the null vector at least for some j. Suppose, for example, that xk = y_i. Then, as M(a)' is upper triangular with ones down the main diagonal, 4,M(a)/Siu would involve Et ut_iut, and so the p lim X' M(a)/Siu I n would not be the null vector. APPENDIX S.B. PROBABILITY LIMITS ASSOCIATED WITH THE INFORMATION MATRIX FOR THE MOVING-AVERAGE DISTURBANCES MODEL
p limICM(a)-1/ sin : We proceed as in Appendix 5.A and consider the transformed equation y* =
+ E,
(5.B.1)
where y* = M(a)-1y and X* = M(a)-1X. Ifa is known, then we would obtain a consistent estimator of p, regardless of whether X contains lagged values of y, by applying ordinary least squeres to Eq. (B.5.1.) to obtain = (X*/X*)-1X*/ y* = p + (X*/X* I n)-1X*'E I n. We assume that p lim M(a)- vM(a)-1X I n exists, so p lim X' M(a)-v E I n = 0. p lim E'pM(a)-1/e/n: We write E'p M(a)-1/E = (I 0 E')S' 111(a)-1/ E = [E' (a)-1SiE • • • E/ M(a)-14E1'
and consider the typical quadratic form of this vector, say E' Yj M(a)-1' E. Now M(a)-1' is upper triangular and we have seen from Subsection 3.7.4.2 that YM(a)-1' is strictly upper triangular with zeros in the j — 1 diagonals above the main diagonal. It follows that E' Svi M(a)-1'E is of the form Et Etatt+jEt+j, and so p lim E' S M(a)-vEln is zero and the p lim E M(a)-1/E I n is the null vector. p lim M(a)-1/ St' [I p 0 M(a)-1/ E] /n: Clearly,
109
Linear-Regression Models
M (a)- 1ST' p 0 M (01)- V Ell = M (ay [sc m (a)- 1 E
s;,m(ot)-vEl.
Consider the typical submatrix, say X'/If(a)-1'5i M(a)-1'E. As M(a)-1' is upper triangular, we have seen from Subsection 3.7.4.2 that Sq.J M(a)-1' is strictly upper triangular with zeros along the j - 1 diagonals above the main diagonal. Let A = Way liSi' M(a)-1'. Then, as M(a)-1' is upper triangular, A is strictly upper triangular with the same configuration as that of YJ M(a)-1'. Let X = (x1 • • • xK) and consider XI,'AE = Extkatt+JEt+J• t
Clearly, if the independent variables are exogenous, then plimx;c As In = 0. However, this is also true if xk refers to lagged values of the dependent variable. Suppose, for example, that xk = y_1so xrk = Yr-1. As yr_1 depends on Et-1 and the Ets are, by assumption, independent, it follows that p lim xi/AE/ n is still zero. Hence, regardless of whether X contains lagged values of y, plim X' M(a)-1' St' [I p0 M(a)-1' Ell n = 0. p lim (I p0 e'),S' M(a)-1' Sr' [I p0 M(a)-1'E]/n: Clearly, the matrix in this probability limit can be written as [
E/ SW(0-1151111(0-1'E
E' S1M(a)-11 5lc M(a)-1'E •
ETP M(a ) 1 SMarl E
E/ S'P M(a)-115'PM(a)-1'E
Consider the typical quadratic form of this matrix, say E' Sq./ M(0)-115*(0-VE .
We have seen that Sj/ M(a)-1' is strictly upper triangular with zeros along the j - 1 diagonals above the main diagonal. Let B = Y./ M(a)-l iS;M(a)-1' and suppose, without loss of generality, that we assume that j > i. Then B is also strictly upper triangular with the same configuration as that of S'i M(a)-1', so this quadratic form can be written as EEtbrt+JEt+i,
and, as the Et s are assumed to be independent random variables, p lim Et Erbtr+ J Et+iln is zero. It follows then that p lim(I p0 s')S'M(a)-liSe[I p 0 M(a)-1' Ell n = 0.
6
Seemingly Unrelated Regression Equations Models
6.1. INTRODUCTION
On a scale of statistical complexity, the seemingly unrelated regression equations (SURE) model is one step up from the linear-regression model. The essential feature that distinguishes the two models is that in the former model the disturbances are contemporaneously correlated whereas in the latter model the disturbances are assumed independent. In this chapter, we apply classical statistical procedures to three variations of the SURE model: First we look at the standard model; then, as we did with the linear-regression model, we look at two versions of the model in which the disturbances are subject to vector autoregressive processes and vector movingaverage processes. In our analysis, we shall find that our work on generalized vecs and devecs covered in Sections 2.4 and 4.7 particularly relevant. The duplication matrix discussed in Section 3.5 will make an appearance as will elimination matrices (Section 3.4). From the practice established in the previous chapter, the asymptotic analysis needed in the evaluation of information matrices is given in Appendix 6.A at the end of the chapter, in which appropriate assumptions are made about the existence of certain probability limits. We can obtain the matrix calculus rules used in the differentiation of this chapter by referring to the tables at the end of Chap. 4. 6.2. THE STANDARD SURE MODEL
6.2.1. The Model and the Log-Likelihood Functions We consider a system of G linear-regression equations, which we write as Yi =
X161
+ u1
YG = XG6G + UG
110
Seemingly Unrelated Regression Equations Models
111
or more succinctly as
y = XS ± u, where y is the nG x 1 vector y = (y • • • y'G )' , X is the block diagonal matrix with Xiin the ith block diagonal position, S = WI• • • S'GY , and u = (u 'i• • • u'/G)'. We assume that the disturbances have zero expectations and are contemporaneously correlated so e(utt us j) = 0, t s, = t = s, and the covariance matrix of u is given by V(u) = E 0 In , where E is the G x G matrix whose (i j)th element is au. Finally, we assume that the disturbance vector u has a multivariate normal distribution with mean vector 0 and covariance matrix V(u) = E 0 In. Thus we write our model as y = XS u, u ^ N(0, E 0 In). The parameters of the model are given by 6 = (SW , where v = vech E and the log-likelihood function, apart from a constant, is 1 1(0) = -11log det E - 2 2
0 /„)u,
(6.1)
where in this function u is set equal to y - XS. Alternatively, we can write this function as 1 (6.2) 1(0) = -11 log det E - - tr E -11/ / U, 2 2 where U is the n x G matrix (U i• • • uG). Usually S contains the parameters of primary interest, and v = vech E are the nuisance parameters. This is not always the case, however, as later on in the section we wish to develop the LMT statistic for the null hypothesis H0 : = 0, i j. Then v would represent the parameters of primary interest and S the nuisance parameters. However, treating S as the parameter of primary interest for the moment, we obtain the concentrated log-likelihood function 1*(0. From the log-likelihood function 1(0), written as Eq. (6.2) we have al
av
n a logdetE - 2 av
1 a tr E-1U'U 2 av
and we deal with each component of this derivative in turn. First, by the chain
112
Matrix Calculus and Zero-One Matrices
rule,
a log det E av
a vec E a log det E =
av
where D is the G2 x (1/2) G(G G x G symmetric matrix. Next
a trE-1U'U av
avecE
, _1 = D vec E ,
1) duplication matrix associated with a
avec E avec E-1 E-11/W av avec E avec E -1 0E = -1)vec U'U.
It follows that D' u -1 _ nvec E -1). 2 (vecE -1—,uE — av = —
(6.3)
Clearly, this derivative is equal to the null vector only if we set E=
= U'Uln.
Substituting back for E in Eq. (6.2), we find that the concentrated log-likelihood function is, apart from a constant, 1*(0 = — -12.-1 log det 2, with
(6.4)
2 set equal to U'U/n.
6.2.2. The Score Vector 81/80 and the Hessian matrix 821/8080' By using the rules of matrix calculus listed in the tables at the end of Chap. 4, we can easily carry out the differentiation required for obtaining the score vector and the Hessian matrix of the log likelihood. The first component of the score vector is
al
1
au au'(z-1 0 lou au
as = 2 as
, _1 (z 0 1„)u.
(6.5)
The second component of the score vector was derived in the previous subsection and is _ D' t uE-1 _ nvec E -1). —(vec E (6.6) av — 2 The first component of the Hessian matrix1 is
al a al au ax/(z-1 0 lou asas,= as as = as au 1We
need not take the transpose here as al I NO is symmetric.
Je(z-10 lox.
Seemingly Unrelated Regression Equations Models
113
The derivative a2ilasay is the transpose of
a (al)
av
a
=
) )('vec UE -1 =
avec E a vec UE-1 X av avec E
= -0E-10 E-1U')X, SO
az/ Hay'
= -X'(E-10 UE-1)D.
Finally, from Eq. (6.6)
a2 i avav'
=
(
avec E -1U / UE-1 av
n
a vecE -1 D av )2'
with avec E -1U/UE -1
av
=
avec E a vec E-1 avec E-1U'UE-1 avec E-1 av a vec E
= -/AE -10 E-1)[(ic 0 VUE -1)KGG
+(ic 0 U'UE -1)] = -2D'(E-1 ® E -i)(/G 0 cuE -1)N, and
avec E -1 av
—
avec E a vec E -1 = 0E-1 ®E-1) av avec E
where N = (1 /2)(IG2 + K GG). It follows that a z/
, [n = DI(E -1 0 E -., ) /G2 - (/G 0U'UE-1)]D, av av' 2 as ND = D. 6.2.3. The Information Matrix /(9) = -p lim1821/ 8080' and the Cramer-Rao Lower Bound / -1(0) n Under our assumptions, p lim U /U I n = E and p lim X/(I„ 0 U)I n = 0 regardless of whether X contains lagged values of the dependent variables. The information matrix is then plimX(E -10 1,0X In 0
0 1 - D' (E -10 E-1)D' 2
(6.7)
114
Matrix Calculus and Zero-One Matrices
and from the properties of the duplication matrix given in Section 3.5 the Cramer-Rao lower bound is lim (E-10 1,i)X1n1-1 0
=
0 2LN(E 0 E)Nd'
where L is the (112)G (G 1) x G2elimination matrix. 6.2.4. Statistical Inference from the Score Vector and the Information Matrix 6.2.4.1. Maximum-Likelihood Estimators as Iterative Joint-Generalized-Least-Squares Estimators Let S be a consistent estimator of S and suppose that ,Fi(S - S) N(0, V). Then, in order for S to be a best asymptotically normally distributed estimator (which is shortened to BAN estimator), V must equal the Cramer-Rao lower bound [p limX' (E -10 1„)X1n]-1. One such estimator that does this is the joint-generalized-least-squares (JGLS) estimator = [X'(E -10 /,i)X]-1X/(E -10 /,i)y,
(6.8)
where E = U'U/n and U is formed from the OLS residual vectors aj = [I„ - X •(X'.j X •)-1Xly • j The MLE has an iterative JGLS interpretation as when we equate the score vector to the null vector we get
_1
=
0 'Ay
-
XS ) =
which implies that = [X'(E -10 /„)X]-1X/(E -10 /0Y, and
D' — = — (vec E tuE-1
8v
2
-
nvec E -1) = 0,
which implies that E = (1/(11n.
(6.9)
These solutions are clearly iterative rather than explicit as S still depends on E and E depends on S through U. 6.2.4.2. The Lagrangian Multiplier Test Statistic for j Ho :crij =0, i As we mentioned at the start of this chapter, what distinguishes the SURE model from the linear-regression model is the assumption that the disturbances
Seemingly Unrelated Regression Equations Models
115
are contemporaneously correlated. It is this assumption that induces us to regard the equations as one large system y = XS u and to estimate S by using the JGLS estimator given by Eq. (6.8). If, however, the disturbances are not contemporaneously correlated there is no point in doing this. Instead, efficient estimating would merely require the application of ordinary least squares to each equation. We would then like to develop a classical test statistic for the null hypothesis: Ho :crij = 0, i
j against
HA :Crij
0 0,
or, in vector notation, Ho : = 0 against HA: 13
0 0,
where I) = D(E) is defined in Subsection 2.3.4. The most amenable test statistic for this case is the LMT statistic, which would be 1 al T = -- /1"(6) n aD aD where I" refers to that part of the Cramer-Rao lower bound corresponding to 13 and in forming 6 we set 13 equal to the null vector and evaluate all other parameters at the constrained MLEs, i.e., at the OLS estimators. In forming T the first task is to obtain ailaD and I" under Hofrom a/ a v and I", which we already have in hand. We do this by means of an appropriate selection matrix S, which has the property = Sv However, v = L vec E and 13 = L vec E, where L and L are (1/2)G(G + 1) x G2 and (1 /2)G(G - 1) x G2elimination matrices, defined in Section 3.4, so it must be true that SL = L. Now consider al /a D. As the matrix S selects from the vector v the elements that belong to 13 it follows that al
al-)
=S
D' D al [y(E =S av 2
n v(E -1)1,
where we use the property that for a symmetric matrix A, Dv(A)= vec A. Moreover, from the properties of zero-one matrices given in Section 3.6, SD'D =2S - SLKGGL' = 2S - LKGGL' =2S• Also, if Ho is true, Sv(E -1) =
al = Sv(A) = D(A), aU
-1
) = 0 so
116
Matrix Calculus and Zero-One Matrices
where A = E 1 U'U E -1, which itself under Ho is Pll
P1G1
PG1
PGG
A= [
with pi.; =it; ui /criicri • Finally then, under Ho, / P21 \
PG1
al al)
P32 —
PG2
\PGG-1/
Next we evaluate I" under Ho. As 13 = Sv it follows that Ivy = S/YYS', where /YY = 2LN(E 0
)N L',
and under H0, E is diagonal so from Theorem 3.16, I" will then be the diagonal matrix given by 20.11
0
0.110.22
0.11 CfG G
2,12 I°° =
0.220.33
0 Cf22CYGG
2aZG_
117
Seemingly Unrelated Regression Equations Models
Selecting the appropriate elements from this matrix, we have that, under Ho, al 1 Cf22
0 -GG
0-11 0
0-220-33
I" = 0
cf220-GG 0-G-1G-10-GG
Marrying the two components of the LMT statistic together, we obtain G i-1
T =nEEri2j , =2 j=1
where r7" = 64/eriieiff , = ii;a jin, and a;is the OLS residual vector obtained from the ith equation. This test statistic was first obtained by Breusch and Pagan (1980). Under Ho, T tends in distribution to a X2random variable with (112)G(G — 1) degrees of freedom so the upper tail of this distribution is used to find the appropriate critical region. 6.3. THE SURE MODEL WITH VECTOR AUTOREGRESSIVE DISTURBANCES 6.3.1. The Model
As in the preceding, section we consider a system of G linear-regression equations = X161
+
YG = XG6 G UG
or a one-equation system y = XS u.
We now assume that the disturbances are subject to a vector autoregressive system of the order of p. Let it, be the G x 1 vector containing the tth values of the G disturbances. Then we have ut + Riut_i + • • • + Rput _ p= Et , t = 1, ... n,
(6. 1 0)
where each matrix R., is a G x G matrix of unknown parameters and the E, are assumed to independently identically normally distributed random vectors with
118
Matrix Calculus and Zero-One Matrices
mean 0 and a positive-definite covariance matrix E. We assume that there are no unit root problems. Let U and E be the n x G matrices U = (ui • • • uG) and E = (Ei • • • EG), so under this notation Et is the tth row of E, and let U_1 denote the matrix U but with values that are lagged 1 periods. Then we can write the disturbances system (6.10) as U+
' = E, + • • • • +U_ p Rp
or U + U p R' = E
(6. 1 1)
where R is the G x Gp matrix R = (Ri • • • RP) and Upis the n xGp matrix U p= (U_I • • • U_n). In the application of asymptotic theory, presample values are replaced with zeros without affecting our results. Suppose we do this at the start of our analysis. Then we can write = SJ U,
j = 1,
, p,
where S., is the appropriate n x n shifting matrix and U p =S(/p 0U), where S is the n x np matrix given by S = (S1• • • Se). Taking the vec of both sides of Eq. (6.11), we have u + (R
/n)vec Up =E,
where u = vec U,
E = vec
E.
However, (vec Si U) (/G 0 S1) vec Up = •. = u = Cu, .• vec SpU
1G 0 Sp
where C is a Gpn x nG matrix given by r 0 Si) C= 1G 0 Sp )
so we can write our disturbances systems as M(r)u = E, where M(r) = -Gn+ I N(r), and N(r) is the nG x nG matrix given by N(r) = (R 0 In)C.
11 9
Seemingly Unrelated Regression Equations Models
Therefore, after this mathematical maneuvering, we can write our model as y = XS + u, M(r)u = E, E^
N(0, E 0 /,i).
6.3.2. Properties of the Matrices N(r) and M(r) The matrices N(r) and M(r) play a crucial role in the statistical analysis of our model that follows. As such, it pays us to consider some of the properties of these matrices. Clearly, N(r) = (R1 0 Si) + • • • + (Re Sp).
Let Rt =
I for / = 1,
, p.
Then 1 c
l riiSI
[rr 11 PS SP
1
N(r) =
rP iG SP •
+•••+ ,1 s 'G 1
rbiSi G
rG1 PS P
.
rGG P S P
Consider the n x n submatrix of N(r)'n the (1, 1) position as typical. Letting N11denote this matrix, we have N11 =
+ ill S2 ± • • • ± rn Sp
1-11
••
0
0
rl1
0
0
0
-
0
••
r11
•
_O
0
+...+ rfi r11 0 0_
—0
0
rfi 0
'••
•
rji
0_
• rill 0 0
120
Matrix Calculus and Zero-One Matrices
which is clearly a Toeplitz matrix that is strictly lower triangular and a band matrix. Therefore, if we write N11
MG]
Nci
NGG
N(r) =[
then each submatrix N11 is n x n, Toeplitz strictly lower triangular, and band. Now if we write
M(r) = IGn + N(r) =
M11
M1G
MG!
MGG
[
then it follows that each M„, i = 1, . . . , G, is n x n, Toeplitz lower triangular, and band with ones along its main diagonal whereas each , i j, is n x n, Toeplitz, strictly lower triangular, and band. 6.3.3. The Matrix J and Derivatives 8 vec N(r)I Or, 8e/8r Important derivatives for our work are a vec N(r)/ar and aE/ar. These derivatives bring generalized vecs and devecs into the analysis and are derived in this section. For notational convenience we use superscripts r and f to denote the generalized vecG and devecG operators, respectively. THE MATRIX J AND THE DERIVATIVE a VEC N(r)/ar. Now N(r) -= (R 0 In )C, SO
vec N(r) = (C' 0 InG)vec(R 0
In).
By Eq. (3.6) vec(R 0 In) = (IpG 0 IG)r, where r = vec R and we can write vecN(r) = Jr,
(6.12)
where J is an n2G2 x pG2matrix given by J = (C' 0
InG)(1 pG 0 KW.
(6.13)
Seemingly Unrelated Regression Equations Models
121
Clearly then
a vecN(r) =J, ar
and, as (K&)' = (1C/Gn )t =
G,
J' = (IpG 0 KL)(C 0 InG).
(6.14)
PROPERTIES OF THE MATRIX J. The matrix J is used extensively throughout the rest of this chapter. As such, we need to know some of its properties. These properties can be derived from the theorems concerning Kni-Gin Subsection 3.3.3 and are given in the following propositions, where u, Up, and C are defined in Section 6.3.1. Proposition 6.1. J'(u 0 Lc) = KpG,G(ic 0 U p). Proof of Proposition 6.1. As vec Up =Cu, we can write
J'(u 0 InG) = (I pG 0 KL)(VeC Up 0 Inc)•
(6.15)
However, applying Theorem 3.9, we can write the right-hand side of Eq. (6.15) as K pG,G(IG 0 Up).
❑
Proposition 6.2. • KnG,nG = K pG,G(IG Kn ,PpGG )(inG 0 C).
Proof of Proposition 6.2. Using the properties of the commutation matrix, we can write • KnG,nG = IpG 0 Knf G)K pGn,nG(InG 0 C). (
We find that the result follows by applying Theorem 3.12. Proposition 6.3. J'(InG 0 u) = (I pG 0U')C.
Proof of Proposition 6.3. From the definition of J .1' (In G 0 u) = (IpG 0 KnG)(C 0 u) =(I pG 0 Kni-o)(1 ',on 0 u)C.
122
Matrix Calculus and Zero-One Matrices
Now, (IpG 0KL)(1pGn 0 u) = 119G 0 K:G(In 0 u) = IpG 0 U', ❑
by Theorem 3.10.
THE DERIVATIVE aE/ar. Next, as E = M(1-)14 = (u' 0 inG)VeC MO, we have, by using the backward chain rule of matrix calculus,
aE_ avec N(r) _
ar
(u 0 InG) = .1'(II 0 Inc)•
However, applying Proposition 6.1, we have
aE aY
, = .1 (11 0 InG)= KpG,G(IG 0 UP).
(6.16)
6.3.4. The Parameters of the Model, the Log-Likelihood Function, and the Score Vector The parameters of the model are given by 0 = (S' r' v')', where v = vech E and the log-likelihood function, apart from a constant, is 1 40) = —11log det E — Ei(E -1 0In)E,
(6.17)
where in this function we set E equal to M(r) (y — XS). We can obtain the first and the third components of the score vector by adapting Eqs. (6.5) and (6.6) of Subsection 6.2.2. They are
as al
av
= X d (E -1 0In)E,
(6.18)
D' = T (vec E -1E' E E -1— nvec E -1),
(6.19)
where Xd = M(r)X and D is the G2 x (11 2)G (G + 1) duplication matrix. Using the derivative given by Eq. (6.16), we easily obtain the second component of the score vector as follows:
al _
1 aE aEi(z -1 0In)E , -1 = KpG,G(E 0 Up)E. aE 2 ar
(6.20)
123
Seemingly Unrelated Regression Equations Models 6.3.5. The Hessian Matrix 821/8080'
We can also obtain several components of the Hessian matrix by adapting the derivatives of Section 6.2.2. They are listed here for convenience: a2.1
= _xdf (E-1 0in)Xd,
(6.21)
asas, a 2i
= xdf (z _i 0 EE-1)D9 abavi a zi nIG 2 avav = D'(E -10 E -', )[ — (/G 0 E' E E -1)] D. 2
(6.22) (6.23)
The derivatives involving r are obtained each in turn. 821/868ri: We derive this derivative from al/ar, which, from Eq. (6.20), we write as
al
® ar = —lc pG,G(E -1 Ipc)vec Up' E. Using a product rule of matrix calculus, we have
a vec U ipE a vec c aE = P (E 0 II,G)+ (1G ® up). as as as However, vec Up' =
Kn,pG
C(y — XS), so a vec Up/05 = —X'C'K pG.n, and
a vec UP E = — X'C' lc pG,n(E ® 1 pG) — xdf (1G ® up). as Our derivative follows directly and is given by ,921
aS a r'
= X' C' K pG,n (E E -1® I pG)KG,pG + Xdf (E
1 0Up)KG.pG
= X /C"(I pG 0 E E-1) + Xd f (E-10 U p)KG.pc •
(6.24)
8211 Or 8v': Again, we derive this derivative from al or, which we now write as
al — ar
= — K pG,G(IG ® Up' E)vec E -1.
As a vec z-liav = —/Y(z-1 0 z-1), it follows that our derivative is given by a zi
arav'
= IC pG,G(E -10 U' E E-1)D.
124
Matrix Calculus and Zero-One Matrices
8218r8r': From Eqs. (6.20) and (6.16), we have a zi = Kpo,o(E -1 0Up' Up)Ko, po = arar' -
6.3.6. The Information Matrix 40) =
-
p lim
1 -
-
(U'pUp 0 E-1). (6.25)
82118080'
n
Clearly, under appropriate assumptions, p lim E'E In = E, p lim Eln =0, and plimX /C/(11,,G 0 E)/ n = 0, so our information matrix can be written as Xd' (E—I ® 1,i )Xd 1(0) = p lim [ KpG,G(E-1 0 Wp )Xd —
—Xd' (E-1 ® Up)KG.pG 0 tf'P UP 0 E—I 0
0
n
,
—a(E —I 0 E—')D 2
0
'
(6.26) If the matrix X does not contain lagged values of dependent variables, then p lim Xd' (IG 0 Up)In = 0, and for this special case the information matrix simplifies to xd'(E — 1 0 in d O 0 1[ O U U p E -1 0 40) = plim n 0 0 -D'(E-10-1)D 2 (6.27) 6.3.7. The Cramer-Rao Lower Bound / -1(0) Inverting the information matrix is straightforward. For the general case let 188 1 8r 1" 1-1(0) =[1r8 i rr 1 r ]. I" 1Yr 1"
Then Iaa =[pihrirr (E-10 Mp)X( /
(6.28)
,
18r =(1r8 )i = I" p lim Xd' [IG 0 U p(U Up)-1 ]Ko,po,
(6.29)
I 8v = (18v)/ = 0,
(6.30)
irr =K pG.Gp lim {E 0 (Up' Up/n)-1 +
0 (Up' Up)-111'1,1
x Xd 188 Xd' x [IG 0 U p(U p)-1 ]11Co,po (6.31)
= p lim n[11'p Up 0 E-1- K pG,G(E -10 U'p)Xd x [Xd' (E
1 0in )Xd]-1
Xd'( E -1 0Up) ICo,pol 1
,
125
Seemingly Unrelated Regression Equations Models
Iry =(Ivr ) =
(6.32)
0,
l vv =2N L(E 0 E)N L' ,
(6.33)
where Mp = In -Up(Up' Up)-11/p/ , N = (112)(IG2 KGG) and L is the (1/2) G(G + 1) x G2elimination matrix. For the special case in which X is exogenous and contains no lagged dependent variables,
[X d (E -1 0 /n )Xd ]-1 / -1 (0)= p
lien n
0 (Up' Up)-1 0E 0
0 0
0 0 2LN(E 0 E)NL'/n (6.34)
6.3.8. Statistical Inference from the Score Vector and the Information Matrix 6.3.8.1. Efficient Estimation of 6 1. CASE IN WHICH R IS KNOWN. Consider the equation y d = xd 6
(6.35)
where yd = M(r)X and Xd = M(r)X. Clearly this equation satisfies the assumptions of the SURE model without vector autoregressive disturbances. With R known, we can form yd and Xd and an asymptotically efficient estimation of 6 would be the JGLS estimator applied to Eq. (6.35); that is, = pur(t -1 0in)Xdyi xdv±-1 joyd,
(6.36)
where E = E'E' / n, E = deveens, ands is the OLS residual vector. As a is a BAN estimator, we have ,Fz(S - 6)
N(0, VI),
where V1 is the Cramer-Rao lower bound referring to S. With r known, our unknown parameters would (6' v')' and the information matrix, both for the case in which X is exogenous and for the case in which X contains lagged dependent variables would be I. (6 ) =
1
Xd (E -1
In )Xd
0
0 n -D'(E -10 E-1)1)]• 2
The asymptotic covariance matrix of a would then be V1
= [plim X`r(E -1 0 In)X d1 n]-1
126
Matrix Calculus and Zero-One Matrices
2. CASE IN WHICH R IS UNKNOWN. The estimator S is not available to us in the more realistic case in which R is unknown. However, an asymptotically efficient estimator for S may be obtained from the following procedure.2
1. Apply joint generalized least square to y = XS u, ignoring the vector autoregression, to obtain estimator 3, say, and the residual vector y - X. From it, form U = devecp U and U p = S(/p 0U). 2. Compute
R' = -(CI'pCI19 )-1CI 'pCI , r = vec R and M(P),
= CI + CI pf?' , Y = M(r)Y , id = M(r)i,
= 3. Compute ci = [iti(t-1 0 In )it1]-1 id1±-1 0 In
v.
(6.37)
The estimator ;'S is asymptotically efficient for both the case in which X is exogenous and for the case in which X contains lagged values of the dependent variables. However, as in the case of GLS estimators in dynamic linear-regression models, the efficiency of S differs in the two cases. First, consider the case in which X is exogenous. As S is a BAN estimator, /Ti(S - 5)
N
N(0, v2),
where V2 is the appropriate Cramer-Rao lower bound obtained from / -1(6) given by Eq. (6.34). Therefore, we see that V2 = Vl = [p lim Xd' (E-1 0 lox(' /n]-i This means that the JGLS estimator S with unknown R is as asymptotically efficient as the JGLS estimator S with known R. As in the linear-regression model, not knowing R cost us nothing in terms of asymptotic efficiency. Next, consider the case in which X contains lagged dependent variables. For this case / -1(6) is given by Eqs. (6.28)-(6.33) so the asymptotic covariance matrix of S is V2 = 188= [p lim Xd' (E -10 Mp)Xd111]-1 .
2The
formal proof that this procedure does indeed lead to an asymptotically efficient estimator may be obtained along the lines of a similar proof presented in Subsection 7.3.5.1.
Seemingly Unrelated Regression Equations Models
127
It is easily seen that now 171-1- V271is positive semidefinite so V2 — V1 is also positive semidefinite. The JGLS estimator S that can be formed with known R is asymptotically more efficient than the JGLS estimator ;3 with unknown R. Not knowing R, now costs us in terms of asymptotic efficiency, just as it did in the equivalent linear-regression case. 6.3.8.2. Maximum-Likelihood Estimators as Iterative Joint-Generalized-Least-Squares Estimators Using the score vector given by Eqs. (6.18)-(6.20) makes it possible to obtain an interpretation of the MLE of S as an iterative JGLS estimator. Returning to the score vector, we see that a//ar = 0 gives K pG,Gvec U'p EE = 0, which implies that U' E = 0, so from Eq. (6.11), = -(Up' Up )-1U;U.
(6.38)
Next, alla y = 0 gives = E'E/n,
(6.39)
and solving alias= 0 for S yields
= pur(E-1 0in)X d r 1 rr (E -1
in)Y
d.
This interpretation of the MLE is clearly iterative as k still contains S through U pwhereas S contains R through Xd. However, this interpretation clearly points to the estimation procedure outlined above. 6.3.8.3. Classical Test Statistics for Hypotheses Concerning the Disturbances 1. THE HYPOTHESIS Ho : = 0, i j. Two hypotheses are of interest regarding the disturbances of this model: that the disturbance Es are in fact contemporaneously correlated and that the disturbances are in fact subject to an autoregressive process. The first is easily dealt with. The appropriate null hypothesis is Ho :crij = 0, i
j.
Compare Eqs. (6.26) and (6.27) with Eq. (6.7). Clearly, as the information matrix I(0) for this model is block diagonal, both for the case in which X is exogenous and for the case in which X contains lagged dependent variables, the LMT statistic for Hois the Breusch-Pagan test statistic discussed in Subsection 6.2.4.2 with the proviso that we now work with the Es rather than with the us. This proviso complicates procedures in that E =yd xd6,
128
Matrix Calculus and Zero-One Matrices
and, as yd = M(r)y, Xd = M(r)X, and r is unknown, yd and Xd are unknown and must be predicted before we start. We could do this by ignoring the vector autoregressive process, assuming the e s are contemporaneously uncorrelated and applying ordinary least squares to y = XS u. With the OLS residual vector we would obtain a consistent estimator io of r by following the procedure outlined in Subsection 6.3.8.1 and then form predictors Sid and Xd. Then e would be the residual vector from the regression of Sid on Xd and the LMT statistic would be formed as in Subsection 6.2.5.2 but with e in place of It is unlikely that testing Ho: crij = 0, i j,would play a significant role in this model as it would in the standard model, the reason being that imposing crij = 0, i j, on the model does not simplify the estimation procedure all that much. We would still have to follow steps like those provided in Subsection 6.3.8.1, starting with an OLS estimator rather that a JGLS estimator in step 1, and the estimator we would end up with is simpler only to the extent that E is now diagonal. This does not lead to great computational savings as Xd is not block diagonal. A better diagnostic procedure would be first to test the null hypothesis H0 :r = 0. If this is accepted, we could then aim at greater simplification by testing H0: crij = 0, i j,by using the basic LMT statistic of Subsection 6.2.4.2. 2. THE LMT STATISTIC FOR Ho: r = 0. The second more important hypothesis then concerns the autoregressive process. If the disturbances of the SURE model are not subject to vector autoregression then, rather than using the estimator S given by Eq. (6.37), we would use the JGLS estimator obtained from y = XS u, namely
u.
= [x'(i —' in)Xi-1x/(t -1 I)y. It is of interest to us then to develop a test statistic for the null hypothesis H0 :r = 0 against the alternative HA : r 0. As in the linear model, the most amenable classical test statistic is the LMT statistic, which is given by T1=
1 al n 8r
Irr(0)— ar
where, in forming B, we put r equal to the null vector and evaluate all other parameters at the constrained MLEs, the MLEs we get for S and v after we set r equal to the null vector. Asymptotically the constrained MLE for S is equivalent to S. The actual test statistic itself will depend on the case before us. We have seen that 1(0) and therefore Irr (0) differ, depending on whether X is exogenous or X contains lagged dependent variables. Of course for both cases in which r = Up/11 = IP 0 E. We consider each case in 0, M(r) = InG, Xd = X, plim turn.
Seemingly Unrelated Regression Equations Models
129
First, when X is exogenous Fr(6)1,-=0 = 1 p 0 E -1 0 E. It follows that for this case the LMT statistic is , 7'; = -u/(E -1 0U p) KG. pG[Ip 0 E-1®E K pG.G(E -10 Udulb ]
1 n
= —111 E -1 0
up(ip(0E-1)u1,/ 111l b
= n ii Ref' 0)-10 el p[Ip 0(CI' (1)-1 ]Cip lii ,
(6.40)
where ii is the constrained MLE residual vector, U = devecn U and U p = S(I p 0 U). (An asymptotically equivalent test statistic would use the JGLS estimator residuals formed from ,5). Under H0, T; has a limiting x 2distribution with pG2degrees of freedom, so the upper tail of this distribution is used to obtain the appropriate critical region. This LMT statistic, like several other such test statistics, has an intuitive interpretation in terms of an F test associated with an underlying regression. Suppose for the moment that U and Upare known and consider the artificial regression U
= ( IG 0 UOP ± E,
where p = vec R'. This equation would not satisfy the assumptions of the linear-regression model as V (E) = E ® Inrather than a scalar matrix. However, suppose further that E is known and consider the nonsingular matrix P such that P'P = E-1. Then the transformed equation
a = CI p p ±E, with a = (P 0 In)u , E = (P 0 In )E and Up =POUp would satisfy the Gauss-Markov assumptions. We could then apply the usual F test for the null hypothesis H0 :p = 0, obtaining the test statistic F=
(aia-aikpa)/pc2 a'mpal(n— pG2)
with la p = InG — tI p(C0 p)-iti p. However, it is easily seen that the numerator of this artificial F test statistic, apart from the constant, is u1E -1 0Up(Up' Up)-1U/P 11/. Noting that, under Ho, plimUpU pin = Ip 0 plim U'U/n and that U'U/n is a consistent estimator of E, we see that the LMT statistic is asymptotically equivalent to n times this numerator after we have placed caps on u and U.
130
Matrix Calculus and Zero-One Matrices
Second, we consider the more complicated case in which X contains lagged dependent variables. From Eq. (6.31) we can write Irr (0)1r =o
= K pG,G(1 0 I p 0 E-1)KG,pG p lim n2K pG,G x [IG 0 (I p E-1)U xX(X1E -10 [In— U p(I p E -1)U ]]X)-1 X X' IG 0 Up(ip 0 E -1)1KG,pG[
Now, as — = pG1.G = K G,G= KG,pG, we PG G(1 1 0 Ui)11 and as K ar r=0 can write the LMT statistic, ignoring the p lim, as T1
= T; nui [E -10 Up(Ip 0 E-1)U p ] x X(X' (E -1® [In— U p(I p E -1)U DX)-1X' X [E -1 0Up(ip 0 E-1)Upi ]Ulii
(6.41)
In the evaluation at o, we put ii, CI I n, and ap in place of u, E, and U p, respectively, where u is the constrained MLE residual vector, U = devecn and Up = S(Ip 0 3. THE wALD TEST STATISTIC FOR Ho :r = 0. The other classical statistic that is worth considering in this context is the Wald test statistic. Suppose 6 = (3'F'13')' is the MLE of O. Then the Wald test statistic would be based on
M.
T2 = Mni rr (6)1-17.,
where Irr (0) = (p lim U U pI n)-10 E for the case in which X is exogenous, and Irr (0) = p lim n{U U p 0 E-1 KpG.G(E -10 U )Xd x [Xdf (E-1 0in)Xd ]-1 xdf (E -1 up)KG.pG ri for the case in which X contains lagged values of the dependent variables. An analysis similar to that conducted in Subsection 5.3.5.3 can now be conducted for the Wald test statistic here. As in the dynamic linear-regression case, we have difficulty in obtaining an explicit expression for the Wald statistic as the system of equations aloe =o is nonlinear in 0, although numerical techniques with the aid of computers should ensure that we can obtain the observed value of our test statistic. It is also possible in this case to gain some comparative insight into the Wald test statistic by using the iterative solution for the MLEs obtained in
131
Seemingly Unrelated Regression Equations Models
Subsection 6.3.8.2. There we show that the iterative solution for ./2 is given by = —(up' up)-lup' u, so vec R = —[IG 0 (Up' Up)-1Up ] u and = K pG.G vecki= — KpG.GUG (U U p)-l Uplu. For the case in which X is exogenous, the Wald test statistic essentially looks at
Tz = nu'[IG 0 Up(U;Up)-11KG. pG(U;U pl n 0 E -1)1C pc.G x [IG 0 (11;U p)-111'p ]u
(6.42)
= ui(E -1Afp)uio
with Arp =Up(Up' Up)-1Up Comparing Eq. (6.42) with Eq. (6.40), we see that essentially the Wald test statistic has the same format as that of the LMT statistic, the difference being that the former evaluates u, -1, and U pby using the (unconstrained) MLE whereas the latter evaluates these components by using the constrained MLEs. Note that from Eqs. (6.38) and (6.39)
=
In, E
= MpU,
(6.43)
where /171p = /p— A/-19 so we can write out in full as
Tz = u[(U/1711,11/n)-1 0CI p(Cip CI p)-1 p]u, where a = y — XS, g is the MLE of 3, and U = devec„ u. For the more complicated case in which X contains lagged dependent variables, the Wald test statistic is equivalent to T2 =
0 Up(U Up)-11KG.pGRU; Up 0 E -1) K pG,G(E-1 0
u )Xd[xd'(E-1 0 10)(1-1 xd'
X (E -10 Up)KG,pG) K pG,G[IG 0 (U Up) —i U ]11 I5-
Using the properties of the commutation matrix, we can simplify this expression to T2 =
u/(E-1
Aroxd [rr (E-1
in)Xd]-i xcr (E -1
Arouie.
Compare this with the corresponding case for the LMT given by Eq. (6.41), which we can write as T1
=
+14,(E -1
Aroxd [x x(E -1
moxd ]-1 x,r(E-1
Arouio.
We have already noted the similarities between T1 and T21. The second part of the two test statistics clearly involves a quadratic form of the vector
132
Matrix Calculus and Zero-One Matrices
X(/' E -1 0Ar)u, the difference being that for T2 the matrix used in the quadratic form is [Xd'(E -1 0/n)Xd]-1whereas in T1 it is [X(/' (E - 0 A/0x/Needless to say, T2 evaluates everything at the (unconstrained) MLE whereas T1evaluates everything at the constrained MLE. 4. THE LRT STATISTIC FOR Ho :r = 0. Again using our iterative interpretation of the MLEs of E, both for the case in which we have no vector autoregression disturbances (Subsection 6.2.4.1) and for the case in which such disturbances exist (Subsection 6.3.8.2), and the log-likelihood function given by Eq. (6.17), we see that the LRT statistic is equivalent to n
T3 = -
-
2
log
det det
where 2 and E are given by Eqs. (6.39) and (6.9), respectively. This expression clearly is a monotonic function of det E/ det E, but given the complicated nature of the determinant of a matrix, it is difficult to make further comparisons with the other two test statistics. Unlike in the linear-regression case, our iterative solutions do not help in this context. 6.4. THE SURE MODEL WITH VECTOR MOVING-AVERAGE DISTURBANCES 6.4.1. The Model
In this section, we assume that the disturbances of the model given are now subject to the moving-average process ut = Et ± R1Et-1±
RpEt-1.
Again, we assume that there are no unit root problems arising from lagged dependant variables. Following a similar analysis to that of Subsection 6.3.1, we write the model as
y= + u, u = M(r)E, E
N(0, E 0 /n).
Assuming invertability, we write E = M(r)-1u.
It is the presence of the inverse matrix M(r)-1that makes the differentiation of the log likelihood far more complicated for the case of moving-average disturbances, but again the mathematics is greatly facilitated by use of generalized vecs and devecs. Before we commence this differentiation it pays us to look at
Seemingly Unrelated Regression Equations Models
133
some of the properties of M(r)-1, properties that we shall need in the application of our asymptotic theory. 6.4.2. The Matrix M(r)-1 Recall from Subsection 6.3.2 that if we write MI!
M1G
MG1
MGG
M(r) =[
then each Mu , i = 1 G, is an n x n Toeplitz lower-triangular band matrix with ones along its main diagonal whereas each Mu i j, is an n x n Toeplitz matrix that is strictly lower triangular and band. Suppose we write M1G
M(r)-1 =[ MG!
MGG
where each submatrix is n x n. Then Theorem 2.6 of Chap. 2 allows us to conclude that each Mii has characteristics similar to those of Mij; that is, M" , i = 1 G, is a lower-triangular matrix with ones down its main diagonal whereas AO, i j, is strictly lower triangular. However, we can go further than this with regards to the properties of M(r)-1. Recall that M(r) = InG + (R 0 IOC, where R is the G x Gp matrix R = (R1• • • RP) and C is the Gnp x Gn matrix given by (IG 0 Sl)
C= IG 0 Sp
From the work we did in Subsection 3.7.4.3, it follows that it is possible to write M(r)-1= InG (R 0
(6.44)
where R is a G x G(n - 1) matrix whose elements are products of the elements of R and C is the Gn(n - 1) x Gn matrix given by (1G
0 51
C= 1G 0 Sn-1
• 134
Matrix Calculus and Zero-One Matrices
In other words, each submatrix Milof the inverse is a Toeplitz matrix of the form In + aiSi +
an-iSn-i
for i = 1, . . . 9 G,whereas each submatrix Mi j, i j, of the inverse is a Toeplitz matrix of the form bi Si +
bn-iSn-i
for suitable a, s and bis that are functions of the elements of R. Now consider, say, [mll'
M(Y)-1'
mG1'
=
Mll
•
m1G'
• • •
A/i1G
• ••
MGG
•
MGG'
MG1
It follows that each .A/1„1 = 1 G, is an n x n upper-triangular matrix with ones as its main diagonal elements whereas each Mu, i j, is strictly upper triangular. In fact, each M1 , is a Toeplitz matrix of the form In a1.51 +
+
_1,
whereas each Mij, i j, is a Toeplitz matrix of the form bi ,S1 +
+
for suitable ai s and b js. 6.4.3. The Derivative 8e/8r Just as in the analysis of the preceding model we shall need the derivative aElar. We write E=
M(Y)-111 = (u' 0 IG)vec M(r)-19
and
avec M(r)-1/ar = a vec N(r) a vecM(r)-1 = ar a vecM(r)
f [m(r)-i 0 mo-il (6.45)
SO
aE ar
=
in G)M(Y)-1'
However, we can obtain an alternative way of writing this derivative by using the properties of J as we did in Eq. (6.16) of Subsection 6.3.3 to get
aE ar
=
InG)M(r)-1'
=
pG,Gp(IG 0 E )M(r) -1' ,
(6.46)
Seemingly Unrelated Regression Equations Models
135
where Ep = S(Ip 0 E). 6.4.4. The Parameters of the Model, the Log-Likelihood Function, and the Score Vector The parameters of the model are given by 6 = (Yr' 0' and the log-likelihood function, apart from a constant, is 1 1(0)= -- log det E - - E(E -1 0/n)E, 2 2 where in this function we set E equal to y* - X*5, with y* = MO-1y and X* = M(r)-1X. The first and the third components of the score vector are given by Eqs. (6.18) and (6.19), with X* in place of Xd. Using Eq. (6.46), we find that the second component of the score vector is given by
al = K pG,G (IG 0 Eip)M(r)-1' (E-10 In)E. ar
(6.47)
6.4.5. The Hessian matrix 821/8080' The components of the Hessian matrix a2i 1 asasi , a2uasa V, and a2//a va V are given by Eqs. (6.21), (6.22), and (6.23), respectively, but with X* in place of Xd. The derivative al/arav' is obtained in much the same way as for the preceding model. We get a zi 0 = KpG,G (IG 0 E' )M(r)-1'(E-1 EE-1)D. aravi The last two components of the Hessian matrix, namely a2ilabar' and a2i lararc require more effort to obtain and draw heavily on the properties of the matrix J given by Propositions 6.1 and 6.2 of Subsection 6.3.3.1 and the Theorem 3.11 of Subsection 3.3.3 concerning K,T,G . Each is handled in turn. 82//86er': We start from al/ar, which we write as
al ar
= f(E 0 InG)AE,
where A = M(r)-1'(E-1 0 In). Using the backward chain rule of matrix calculus, we find that it follows that
2i aE a = ---[f(E 0 InG)AE]. nal-, ab a E However, in Chap. 4, from our table of matrix calculus results,
a(E O InG)AE
aE
= A'(E' ® InG) + (InG 0 E' Ai),
(6.48)
136
Matrix Calculus and Zero-One Matrices
and, as aEias,—x-, we have referring to Eq. 6.48
a 2/ Hari
= —X*/{(E -1 0i„)M0-1(E' 0 ±[InG
InG)
0 (vecE E -VM(r)-111./.
(6.49)
We now want to write this derivative in terms of commutation matrices. We do this by using the properties of J By Proposition 6.1 of Subsection 6.3.3.1., (6.50)
.1(E 0 Inc) = K pG,G(IG 0 E ip).
Consider the nG x 1 vector a = M(r)-1'vec E E -1.
Then, from Proposition 6.3 of the same section, J'(/„G 0a) = (I PG 0 a9C,
and if we use the properties of the devec operator at" = [M(r)-1']t" (IG 0 vec E E -1) = [M(r) 1 ] T (IG 0 vecE E-1),
(6.51) so we can now write a 2/ _ — X*'(E -1 0/„)M0-1(/G 0 Ep)KG,pc Hat.'
— V'C'{/pG 0 [M(r) 1 ]T''(IG Ovec E E -1)1. 8211 er Or': Again we start with al/ar. As before, we let a(r) = M(r)-1' and A(r) = IG 0 EP. With this notation we can write
(E-1 01y)E
al = ar
IcpG,GA(r)a(r)
and, by using the product rule of matrix calculus, a21
arar' l"'
i avec A(r)
1
ar
[a(r) 0 I pG21+
aa(r) A0/1 KG,pGar
(6.52)
However, from Eq. (3.7), vec A(r) = Q vec E' = Q Kn, pG vec EP = Q Kn,pG CE,
(6.53)
where Q = KntnG 0 IpG, so
a vecA(r) =aE —C.' K nQ' = —.1/(E 0 InG)M(r)-1' C' K pc,nQ' . ar ar P G' (6.54)
137
Seemingly Unrelated Regression Equations Models Again by using the product rule, we can write
aa(r) ar
a vec M(r)-1' ar
_
1
RE 0 I0 E 0 InG1+
a(E -1 0 4)E ar
M(r)-1 (6.55)
and, from Eq. (6.45),
a vec M(r)-1'
ar
= JW(r)-10 M(r)-1'11Cnc,nc •
This, together with Eq. (6.46) and the properties of the commutation matrix, allows us to write aa(r) = —J1M(r)-1 0a(r)1 — J'(E 0 /nG)M(r)-1 (E -1 0/n )M(r)-1. ar (6.56) Substituting Eqs. (6.54) and (6.56) into Eq. (6.52) gives a2.1
ar aril
x. a(r, = —J'(E 0 InG)M(r)-1 C"K pG,n or - pG2,--G,pG 1 6" 1it" ,' L
—
— JIM(r)-1' 0a(r)1(IG 0 EP)KG,PG — J'(E 0 In G)M(1^)-1' (E-1 0 In )M(r)-1(IG 0 EP )KG,PG.
(6.57) To write this derivative in terms of commutation matrices as we want to do, we use the properties of J and our theorems concerning K G. To this end, we consider the first matrix on the right-hand side of Eq. (6.57). As Q' = KG'n 0 IpG, we write, by using Theorem 3.11, Q'[a(r) 0 I pG2] = IKL"„[a(r) 0 Id' 0 I pG = a(r)t" 0 I pG• Then, recalling that J'(E 0 Inc) = KPG,G (IG 0E'p) and again using properties of the commutation matrix, we write this first matrix as —K pG,G(/G 0 El;)M(r)-1'Cl/pG 0a(r)il.
(6.58)
Next, the second matrix on the right-hand side of Eq. (6.57) can be written as —./VnG 0a(r)1M(r)-1 (IG 0 E P )KG, PG, and from Proposition 6.3 of Subsection 6.3.3.1, we see that this second matrix is just the transpose of the first matrix. Thus, by using Eq. (6.14), we obtain our
138
Matrix Calculus and Zero-One Matrices
final expression for our derivative:
a 21 arar' = -K pG.G(IG 0 E' )M(r)-1' ClIpG 0 [M(r)-i ]r"qG 0 vec E E-1)I - {/PG 0VG 0 (vec E E -1)/1[M(r)-1]T"ICM(r)-1(1G 0 Ep)KG,pG — K PG,G(IG 0 E'p)M(r)-v(E -1 0In)M(r) -1(IG 0 EP)KG, PG .
1 6.4.6. The Information Matrix 1(0) = -plim-821 /8080' n The work required for evaluating some of the probability limits associated with this matrix is described in Appendix 6.A. By using the results of this appendix, we can write the information matrix as 158 15r h v I(0) = [48 Irr Irv, 1v8 Ivr ivy
where 188 = p lim 1X*'(E -10 In)X* ,
n 1 18r = p lim- X*/ (E-1 0 In)M(r)-1(IG 0 Ep)KG. PG = (48Y , n
by = 0 = (48Y , Ir y = 0 = ( 1Yr)' ,
1
Irr = p lim -1;K pG,G(IG 0 E ip)M(r)-1' (E -1 0 In)M(r)-1
x (IG 0 Ep)KG,pG, 1 /„„ = 2 -U(E -10 E-1)D.
(6.59) (6.60)
In this appendix, it is also shown that, in order to ensure the existence of Irr , we need to assume that lim M(r)-"M(r)-1/n exists as n tends to infinity and we further evaluate Irr . However, for the purposes of deriving statistical inference by using the information matrix, as we do in Subsection 6.4.7, it is sufficient to write Irr as given by Eq. (6.59). For the special case in which X contains no lagged dependent variables, '8r = 0 = (48Y .
6.4.7. The Cramer-Rao Lower Bound 1-1(0) As 1(0) is block diagonal, inverting it presents little difficulty. Using the property of commutation matrices that KpG1,G = K piG,G = KG,PG, if we
Seemingly Unrelated Regression Equations Models
139
write 188 18r 18v 1-1(6) = [1r8 irr Irv] , P8 1 Yr 1"
then I" = p lim n{X"(E -1 0 In)X* — X"-F[F(E In).Fr i X*1-1 , (6.61) /8r = (/r8 )/ = —/88 p lira V'T[F(E 0 InVr i KG,pc,
(6.62)
18v = (Iv8 )i = 0
(6.63)
Iry
= (1rY Y
= 0,
(6.64)
1rr =K pG,GP lirn n{F(E 0 In)T — .7' X'
x [X*/(E-1 0In)X*]-I X*/.71-1 KG, pG, i" = 2LN(E 0
(6.65) (6.66)
)NL',
where = (E -1 0In )M(r)-1(IG 0 Er). The special case in which X is exogenous and contains no lagged dependant variables is simpler. Here, It (0) = p li mn
[Xv(E-1 0 4)X1-1 0 0
0
0
Ic,G,G[P(E ®
0 10FriKG, pG 0 2LN(E 0 E)NL,' In
(6.67) 6.4.8. Statistical Inference from the Score Vector and the Information Matrix Having used our work on the generalized vec and devec operators to assist us in the complicated matrix calculus needed to obtain the score vector and the information matrix, we can now avail ourselves of these latter concepts to derive statistical results for our model in much the same way as we have done for previous models. 6.4.8.1. Efficient Estimation of 1. CASE IN WHICH R IS KNOWN. Consider the equation y* = X*
E,
(6.68)
where y* = M(r)-1y and X* = M(r)-1X. Clearly this equation satisfies the assumption of the SURE model without vector moving-average disturbances.
140
Matrix Calculus and Zero-One Matrices
With R known, we can form y* and X* and an asymptotically efficient estimator of 5 would be the JGLS estimator obtained from Eq. (6.68), that is, 5 = [X*'(t -1 0/n)X*]-1 X*/(t -1 0in)Y*,
(6.69)
where t = ki kin, E = devecn i", and i' is the OLS residual vector. As 5 is a BAN estimator we have ,,,/ii(g - 5)
4 N(O, VI),
where V1 is the Cramer-Rao lower bound referring to S. With r known, our unknown parameters would be (5' v')', and the information matrix for both the case in which X is exogenous and for the case in which X contains lagged dependent variables would be 5 /* ( ) = p lim 1 v n
0
0 n _ D,,(E-1 ®E-1)D 2
The asymptotic covariance matrix of 5 would then be V1 = [p lim X*' (E -1 0In)X*/n]-1. 2. CASE IN WHICH R is UNKNOWN. The estimator 5' is not available to us in the more realistic case in which R is unknown. However, once a consistent estimator 1- is obtained, we can form X = M(1)-1X, 9* = M(1)-1y, and
g, = proct_i
® inprrii*,(t_i ® /or.
(6.70)
As with the autoregressive case, the estimator g is asymptotically efficient for both the case in which X is exogenous and for the case in which X contains lagged dependent variables, but the efficiency of the estimator differs for the two cases. Consider the case in which X is exogenous. As 5 is a BAN estimator, ,174 - 5)
4 N(0, V2),
where V2 is the appropriate Cramer-Rao lower bound obtained from / -1(0) given by Eq. (6.67), that is, V2 = V1
= [p lim X*' (E-1 0I„)X* 111]-1 .
This means that the JGLS estimator 5 with unknown R is as asymptotically efficient as the JGLS estimator 5 with known R. Not knowing R costs us nothing in terms of asymptotic efficiency.
Seemingly Unrelated Regression Equations Models
141
Next consider the case in which X contains lagged dependent variables. For this case / -1(6) is given by Eqs. (6.61)-(6.66), so the asymptotic covariance matrix of S is now V2 = 88 = plimn{X*'(E-1 0 In )X* - X*'T[F(E 0 In),F]-1F X*r 1 .
As with the autoregressive case it is easily seen that V2 — VI is positive semidefinite so now g is less efficient asymptotically than S. Not knowing R now costs us in terms of asymptotic efficiency. 6.4.8.2. Maximum-Likelihood Estimators as Iterative Joint-Generalized-Least-Squares Estimators? Interestingly enough, a similar interpretation of the MLE of 6 as obtained in the autoregressive case does not seem to be available to us for this case. Consider the score vector for this model. Solving al/as = 0 and allav= 0 gives = [X * (E-1 0 /„)X*]-1X*/(E -1 /)y*, (6.71) and = E'E/n, as expected, but problems arise when we attempt to extract r from the equation a//ar = 0. Unlike the autoregressive case this equation highly nonlinear in r, involving as it does M(r)-1. Notwithstanding this, Eq. (6.71) clearly points to the estimator ;3 given by Eq. (6.70). 6.4.8.3. Lagrangian Multiplier Test Statistic Ho: r = 0 The analysis of Subsection 6.3.8.3 with respect to the null hypothesis 1/0: crii = 0, i j, can be carried over to this model with X* = M(r)-1X and y* = M(r)-1y in place of xd and yd, respectively. However, as noted in that section, a good diagnostic procedure would be to first test for Ho :r = 0 and if under this test the null hypothesis is accepted to then to use the Breusch-Pagan LMT statistic to test for 1/0 :cru =0, i j. It is the former null hypothesis, 1/0 :r = 0, that we turn our attention to now. What we show is that the LMT statistic for the vector moving-average disturbance case before us is the same test statistic as that developed for the preceding vector autoregressive disturbances model. It follows then that the LMT statistic is incapable of distinguishing between the two disturbance systems.3We do this by noting that with r = 0, M(r) = Inc, X * = Xd = X, U = E, u = e, Up = Ep,plim E' E pin= plimUpUpin= Ip E, and 3This
result is a generalization of the result obtained by Godfrey (1978a) for the linear-regression model discussed in Subsection 5.3.5.3.
142
Matrix Calculus and Zero-One Matrices
,F = E-1U133 so for both models I"Ir=0 = cG,G{E-1 01 p
0 E — (E-1 0U'p)X
x [X'(E -1 0 /n)X]-1X/(E -1 0II Or i KG,pG,
ai ar
= ±K pG,G(E -10 Up')U. r=0
It follows then that the LMT statistic for H0 :r = 0 is the same for both models, for both the case in which X is exogenous and for the case in which X contains lagged dependent variables. APPENDIX 6.A. PROBABILITY LIMITS ASSOCIATED WITH THE INFORMATION MATRIX OF THE MODEL WITH MOVING-AVERAGE DISTURBANCES
p lim (11n)821186er': Recalling that X* = M(r)-1X and writing M(r)-1 = ( 11 - MG), where each submatrix M1 isnG x n, we have X*'(E-1
x;m;(E-'
1,)MiE p
VG M'G (E-1
LOME p
•• •
XjAij(E-1
In )MGE p
VG APG (E —I
I,)MG E p
In )M(r)-1(IG ® E p ) = ••
Clearly if each X, is truly exogenous then plim X:111;(E-1 01p)Mi Epin= 0. However, if X, contains lagged dependent variables this probability limit will not be the null matrix. Consider now X*'ClI pG at") = X*' (IG 0 Sat
1G 0 S a f^).
(6.A.1)
We consider the first matrix on the right-hand side of Eq. (6.A.1) as typical, and we use the notation of Subsection 6.4.2 to write, say, Mil
•••
Mic
MG1
• • •
MGG
M(r)-1' = [
.A4 = MG
Then X;M il S;a1-^ X*VG
X;M iG S;a1"
Si al") =
(6.A.2) X G/ MGIS;a1-^
X/G MGG S;a f^
Again, we take the matrix in the (1, 1) position of the right-hand side of Eq. (6.A.2) as typical. Now, under our notation a f^ = (M i vecEE-1-MG vec EE-1),
(6.A.3)
143
Seemingly Unrelated Regression Equations Models SO X.A.4 11 Sa f" = X.A/111Si(A/11 vec EE -1—MG vec EE-1). However, X1.A/111SA/11 vec EE -1= X.A/111SA/11(E-1 0In )E GG =
MliE j), i=1 j=1
where E -1= {crij }. Therefore, in evaluating p lira X 'C'(/PG 0a f.)In, we are typically looking at p lim ri M11.51Mii s j /n. Now, in Subsection 6.4.2, we saw that Mijis upper triangular and Mir , i j, is strictly upper triangular so Si Mii is strictly upper triangular. It follows As I n is the that A = A/111S'iA/1 1 t is strictly upper triangular and so p lim null vector even if X1contains lagged dependent variables. We conclude that, regardless of whether X contains lagged dependent variables, p lim X'Ci (lpG a l")1n = 0. p lim(11n)82118r Or' p lim(11n)K pG 4(1 G ep)M(r)-1' (I pG a5-.): We wish to show that this probability limit is the null matrix. We do this by proving that the probability limit of a typical element of the matrix is zero. To this end, we consider / EA\ EiG S1
E' E = (I p E')S' =
= Eii S \EG S197
so, by the property of the commutation matrix given in Subsection 3.3.2, 4,51\
/
IG
EG S;
IG
Eii S
po,o(lo E'p )=
\IG
p/ EiGSi
144
Matrix Calculus and Zero-One Matrices
It follows then that the submatrix in the (1, 1) position of the matrix we are considering is p lim (11 n)(IG 0EilSOM (0-1' (IG 0 Slat.) and the row vector in the (1, 1) position is p lim (1 I n)Ei Si.MiiSiat. Now by using Eq. (6.A.3), we find that the first element in this vector is 1 1 plim - EAMiiSMi(E -1 0 In )E = p lim
GG
EEcy i=1 j-i
The typical element of the matrix we have in hand is then cri j p lim E:S;MkiS; Msj E j ln. In Subsection 6.4.2, we saw that each Muis upper triangular and each M11, i j, is strictly upper triangular. It follows from the properties of shifting matrices that Sk Mijis strictly upper triangular for all k, i, j so the matrix is the quadratic form of our p lim being the product of strictly uppertriangular matrices is also strictly upper triangular. We conclude then that the probability limit of a typical element of our matrix is zero. p lim (1 n)K pG,G(IG E PM(r)-1'(E -1l„)M(r)-1(IG E p)K G ,pG: It is more convenient in this subsection to consider the limits of expectations rather than to work with probability limits. We have seen that we can write (IG O Ep)KG,pG = (IG
O 51E1- IG O S1EG IG 0 SpEi IG 0 SpEG),
so we need to consider the limit as n tends to infinity of the expectation of 1 - (/G0 E;S;)M(r)-1'(E-1 0 ir)m(r)-1(k 0skE j)
(6.A.4)
for i, j = 1, , G and 1,k =1, p. The submatrix in the (rs) block position of expression (6.A.4) is 1
1
GG
- E;S;Mr(E -1 0In).M s SkEj = - EEaxYE:S;MrxMsySkE j n x=1 y=1 n
for r, s = 1, . . . , G, so typically we are looking at lim -e(E;simrxm'syskE;) =au lim— tr S;MrxM'sySk. n We need to assume that such limits exist so it will pay us to look at the nature of these traces. Recall from Subsection 3.7.4.5 that if 1 = k,
tr S;MrxMsySk = irMrxMsy minus the sum of the first / elements in the main diagonal of MrxMisy; if / < k, tr SpArx MsySk is the sum of the elements of the k - 1 diagonal above the main diagonal of .A.4 rx sy minus the sum of the first / elements of this diagonal;
Seemingly Unrelated Regression Equations Models
145
and if 1 > k, trS;Mrx Mcv Sk is the sum of the elements of the l — 1 diagonal below the main diagonal of.A.4 rx M' sy minus the sum of the first k elements of this diagonal. It remains for us to consider the nature of the matrix M rx . Recall from Subsection 6.4.2 that Mrxis of the form in +a
+
± an_ Sn _ I forr = x,
or _ 1 forr x
b1 Si+
for suitable a, s and b;s. We now have some insight into the nature of the limits whose existence we need to assume. Suppose we make the initial assumption that lim M(r)-1' M(r)-1In
n—>oo
exists. Then this implies that lim MrxMisy in
n—>oo
exists and hence lim trS;Mrx.MisySkIn
n—>oo
exists. Making this initial assumption, then we let tr S; 1(E —10 I OM 1 S k
trsc m1 (z-1 0in )mGsk
r
trsp4G ( z -1 0in)MiGsk
1
Clk
— n
trS;MG(E -10 In ).M'1 Sk
= Ckl
for 1, k = 1,
, p. Then we can write 0 C11
... z 0 Clp
Irr =
E 0 CI,
---
E0
Cpp
7
Linear Simultaneous Equations Models
7.1. INTRODUCTION
The most complicated statistical models we consider are variations of the linear simultaneous equations (LSE) model, the statistical model that lies behind linear economic models. The complication that the standard LSE model adds to the standard SURE model is that, in the former model, current values of some of the right-hand variables of our equations must be regarded as random variables correlated to the current value of the disturbance term. Suppose we write the ith equation of the standard LSE model as yj =
j = 1, ... G,
where y jand u j are n x 1 random vectors and Hiis the matrix of observations on the right-hand variables of this equation. We partition Hi as follows: H j = (YjX j). The variables in X jare statistically the equivalent of the Xis on the right-hand sides of the equations in the SURE model. The variables in Yiare those contemporaneously correlated with the disturbance term, so the elements in the tth row of Yjare correlated to U ti, the tth element of uj . Now, even if we regard the elements of the disturbance vector u jas being statistically independent random variables, we still have E(Yi'ui) 0 0 and asymptotically plimilju j In 0 0. Econometricians have traditionally solved the problem of right-hand variables that are contemporaneously correlated with the disturbance term by forming instrumental variable estimators (IVEs). As we shall have quite a lot to do with such estimators in this chapter, it will pay us to examine briefly two generic types of IVEs developed by Bowden and Turkington (1984). We consider a statistical model that is broad enough to encompass the LSE model and variations of this model. We write an equation system as
y = 1-15 u, 146
(7.1)
Linear Simultaneous Equations Models
147
where y and u are nG x 1 random vectors and u has an expectation equal to the null vector, and we suppose some of the variables forming the nG x 1 data matrix H are correlated with the disturbance vector in the sense that p lim H'u I n 0. Suppose we further assume that the covariance matrix of u is a positive-definite nonscalar nG x nG matrix V and that there exists an nG x q matrix Z, with q > / but not dependent on n, available to form instrumental variables for H. The following requirements are the essential asymptotic requirements for Z: 1. p lim Z'H/n exists but is not equal to the null matrix, 2. plim Zu I n = O. Bowden and Turkington (1984) proposed two generic IVEs for S that may be motivated as follows. THE IV-GLS ESTIMATOR. Equation system (7.1) has two statistical problems associated with it, namely the nonscalar covariance matrix V of the disturbance term and the correlation of right-hand variables with this disturbance term. Suppose we deal with the former problem first by premultiplying Eq. (7.1) by the nonsingular matrix P, where PT = V-1, to obtain Py = P HS Pu.
(7.2)
The disturbance term of this transformed equation has a scalar covariance matrix, but we are still left with the problem of right-hand variables correlated with this disturbance vector. Suppose now we form an IV for PH in this equation by regressing PH on PZ to get PH = PZ(Z/V-1Z)-1Z/V-1H. Using PH as an IV for PH in Eq. (7.2) gives the estimator Si = [1-1/V-1Z(Z/V- z)-1 H]-' Z(Z' V-1
y.
Usually V is unknown so 54; is not available to us. However, suppose the elements of V are functions of the elements of an r x 1 vector p so we write V = V(p), where r is not a function of n. Suppose further that it is possible to obtain a consistent estimator 13 of p and let V = V(p). Then the IV-GLS estimator is
sl = [11/V-1Z(Z/V-1Z)-1Z/1)-111]-1H/1)-1Z(Z/1)-1Z)-1Z/1)-l y. (7.3) Such an estimator has also been proposed by White (1984). THE IV-OLS ESTIMATOR. Suppose that in Eq. (7.1) we deal with the second econometric problem first and attempt to break the correlation of right-hand variables with the disturbance vector by premultiplying Eq. (7.1) by Z', obtaining Z'y = nib Z'u.
(7.4)
148
Matrix Calculus and Zero-One Matrices
Equation (7.4) still has a disturbance vector whose covariance matrix is nonscalar but we can deal with this by applying a GLS estimator to the equation. Replacing the unknown V in this GLS estimator by V gives the IV-OLS estimator Sz = [fli Z(Z/1)Z)-1 Zf fi]- 1 fli Z(ZI)Z)-1Z' y
(7.5)
Such estimators are used throughout this chapter. As in Chap. 6 on the SURE model, in this chapter we consider three LSE models: First we consider the basic LSE model, then the LSE model in which we assume the disturbances are subject to vector autoregressive disturbances, and finally the LSE model in which the disturbances are subject to a vector movingaverage system. For each version of the model, we apply classical statistical procedures, drawing heavily, as always, on our rules of matrix calculus to do this. Because readers may not be as familiar with these models as they are with the previous models considered, I have broken our tradition and have largely reinstated the asymptotic analysis to the main text. 7.2. THE STANDARD LINEAR SIMULTANEOUS EQUATIONS MODEL
7.2.1. The Model and Its Assumptions We consider a complete system of G linear stochastic structural equations in G jointly dependent current endogenous variables and k predetermined variables. The i th equation is written as yi =
Xi yi
ui =
ui , i = 1, . . . , G,
where yiis an n x 1 vector of sample observations on one of the current endogenous variables, Yi is an n x Gimatrix of observations on the other G1current endogenous variables in the i th equation, Xiis an n x ki matrix on the ki predetermined variables in the equation, uiis an n x 1 vector of random disturbances, Hiis the n x (Gi k i ) matrix (Yi Xi), and Siis the (Gi + ki) x 1 vector (P: yl)/ The usual statistical assumptions are placed on the random disturbances. The expectation of uiis the null vector, and it is assumed that disturbances are contemporaneously correlated so that E(usi uq ) = 0 for periods s t, e(usiuo) = crijfor s = t;i, j = 1, . . . , G, E(ul u j) f = We write our model and assumptions more succinctly as y = HS u,
(7.6)
E(u) = 0, V(u) =
in ,
u
N(0, E /n),
where Y = (yi
YfG)',
u = (ui • u GY f
6=
• Sc)' ,
Linear Simultaneous Equations Models
149
H is the block diagonal matrix
[H1
01
L
HG
0
and E is the symmetric matrix, assumed to be positive definite so that E -1 exists, whose (i, j )th element is au. We assume that each equation in the model is identifiable by a priori restrictions, that the n x k matrix X of all predetermined variables has rank k, and that plim(X'ui I n)= 0 for all i. Finally, we assume that Hihas full column rank and that plimillfljln exists for all i and j. A different way of writing our model is YB + XF = U,
(7.7)
where Y is the n x G matrix of observations on the G current endogenous variables, X is the n x k matrix on the k predetermined variables, B is the G x G matrix of coefficients of the endogenous variables in our equations, I" is the k x G matrix of coefficients of the exogenous variables in our equations, and U is the n x G matrix (U, us). It follows that some of the elements of B are known a priori to be equal to one or zero as yi has a coefficient of one in the ith equation and some endogenous variables are excluded from certain equations. Similarly, some of the elements of I" are known a priori to be zero as certain predetermined variables are excluded from each equation. We assume that B is nonsingular. Note that y = vec Y and u = vec U. Equation (7.6) or Eq. (7.7) is often called the structural form of the model. We obtain the reduced form of the model from Eq. (7.7) by solving for Y to get Y = —XFB-1 +UB-1 =xn
+ V,
(7.8)
where H = -1-B-1 is the matrix of reduced-form parameters and V = UB-1 is the matrix of reduced-form disturbances. Taking the vecs of both sides of Eq. (7.8), we obtain
(
y = IG 0 X)71- v,
(7.9)
where 7 = vec H and v = vec V = ( B-1' 0I„)u. Note that the covariance matrix of the reduced-form disturbances is V(v) = B.-F EB-1® In
.
7.2.2. Parameters of the Model and the Log-Likelihood Function The unknown parameters of our model are 6 = (3'v')' where v = vech E. Here the log-likelihood function takes a little more work to obtain than that of the previous models we have looked at.
150
Matrix Calculus and Zero-One Matrices
Our sample point is the vector y, and thus the likelihood function is the joint probability density function of y. We obtain this function by starting with the joint probability density of u. We have assumed that u --- N(0, E 0 In) so the joint probability density function of y is f(y) = 1.71
1 1 exp [— - u'(E -1 0 /)u], 2 (27)5(det E 0 /n)1
with u set equal to y — HS and where 1.71 is the absolute value of the Jacobian au J = det — . ay Our first application of matrix calculus to this model involves working out this Jacobian. Taking the vec of both sides of U = YB + X F, we have u = (B' 0 In)y + (F' 0 1)x, where u = vec U, y = vec Y, and x = vec X. It follows that
au a—y =(B 0 In), f (y) _
I det(B0 in)1 (2n-)5(det E 0 in y
exp [— 1-u'(E -1 0 /n)ul . 2
However, from the properties of the determinant of a Kronecker product we have det(E 0 In) = (det E)n , SO [ l(det Br f(y) = (27 det E)5 exp — - u'(E -1 0 /n)ul , 2 )5 (
with u set equal to y — HS in this expression. This is the likelihood function L(0). The log-likelihood function, apart from a constant, is 1 - u'(E-1 0/)u, /(6) = n logIdet BI — 11log det E — 2 2 with u set equal to y — HS. An alternative way of writing this function is 1 1(0) = n logl det BI — 11log det E — 2 tr E-1U'U, 2
(7.10)
where U is set equal to YB + XF . Comparing Eq. (7.10) with the log-likelihood function of the standard SURE model given by Eq. (6.2) in Subsection 6.2.1 we see we now have an additional component, namely n log I det B I, and it is this additional component that makes
151
Linear Simultaneous Equations Models
the classical statistical analysis of the LSE model that much more difficult than that for the SURE model. This extra term is of course a function of S but not of v. It follows then that our derivatives of the log-likelihood function with respect to v will be the same as these derived for the basic SURE model in Section 6.2. What changes are the derivatives of /(0) with respect to S. This being the case, we find that the concentrated log likelihood P(S) obtained when S is the vector of parameters of primary interest is given by substituting E = U'U/n into the log-likelihood function as we did in Subsection 6.2.1. From Equation (7.10) we get, apart from a constant, log det 2,
P(S) = n logl det BI —
where 2 is set equal to U'U/n. Comparing this with the corresponding concentrated function of the SURE given by Eq. (6.4) we see again that we have the additional component n log I det BI7.2.3. The Derivative 8 log I det BIlab The extra term in the log-likelihood functions gives rise to a new derivative that must be considered, namely a log Wet BII as, and this derivative is obtained in this subsection. Our first task is to express matrix B of Eq. (7.7) in terms of of Eq. (7.6). To this end, we write the i th equation of our model as Yi = 17 WiPi XTIYi
+u1,
where W1 and T, are G x G, and k x k, selection matrices, respectively, with the properties that YW, =
=
Alternatively we can write yi = YWi Si XT,Si +u1 ,
where IV;and Tare the G x (G1 k1) and k x (Gi ki)selection matrices given by Wi = (W, 0) and T, = (0 T1 ), respectively. Under this notation, we can write = (Yi YG)= Y(Wibi
WG6G)+ X(TiSi TGSG)± U.
It follows then that B = IG (W16 1
r = —(T151
WOG),
(7.11)
TOG).
Moreover, vec B = vec IG
-
WS,
(7.12)
152
Matrix Calculus and Zero-One Matrices
where W is the block diagonal matrix 0 W= [WI 0
WG1 •
Returning to our derivative now, clearly from
a vec B = w' , as
(7.13)
and as
alogIdet BI _ a vec B alogIdet BI as — as a vec B ' we obtain a logidet BI = W'vec(B-1')
as
(7.14)
7.2.4. The Score Vector 81/80 With this derivative in hand, we easily obtain the score vector, where first component is
al _ a log ,96WetBI
1 a , u ( E -1 0/nu ) 2 as = —nW'vec(B-1) + H'(E -1 0/„)u.
Ts— n
(7.15)
Comparing this with the corresponding component of the score vector for the SURE model, Eq. (6.5), we see that we now have an additional term, namely —n W' vec(B-1'). The second component of the score vector, as noted above, is the same as that component for the SURE model, namely
ai av
D' = — (vecE-1U'I/E-1— n vecE-1). 2
(7.16)
7.2.5. The Hessian Matrix 821/8080' Two components of this matrix are the same as those of the SURE model with the proviso that we put H in place of X. They are repeated here for convenience:
)RI
all = D'(E-10 E-1 I G2 — (IG 0 UWE-1)1D, avav, a21 = Fr(z-i 0 uz-1)D = i a21 \' nay avas, ) •
Linear Simultaneous Equations Models From
153
al/as given by Eq. (7.15), the final component may be written as al a vec(B-1') u w (E-1 ® In H. (7.17) asas, = -n as as
Now,
a vec(B-1') as
a vec B a vec B' a vec as a vec B a vec B'
and by using our rules of matrix calculus and Eq. (7.13), we can write this as
a vec(B-1') =W as
(7.18)
KGG(B-1' 0 B-1).
Substituting Eq. (7.18) into Eq. (7.17) gives all
asas, =
KGG(B-1' B-1)W - H'(E -1 I n)H.
7.2.6. The Information Matrix 1(0) = -p lim
21
n aeee,
Under our assumptions, p lim U'U In = E and 1 plim 1-11Uln = plim - (U'Yi U'Xi)' n = plim 1(U' Y iT 0)' = KB-F E, SO
plimW(IG 0 UV n = W'(IG 0 B-1' E). Thus, we can write [
plim H'(E-10 I„)H I n
144/(E-1 0 B-1')D
1(0)= ±W'KGG(B-1' 0 B-1)W D'(E -10 B-1)W
1 - D'(E -1®E-1)D 2
It is instructive to compare this information matrix with the corresponding information matrix of the SURE given by Eq. (6.7) of Subsection 6.2.4. We see that the information matrix in hand is more complex on two counts. First, the matrix in the block (1, 1) position has an extra term in it, namely KGG(B-1' 0 B-1)W. Second, the matrix in the block (1, 2) position is now nonnull.
154
Matrix Calculus and Zero-One Matrices
7.2.7. The Cramer-Rao Lower Bound / -1(0) Inverting the information matrix to give the asymptotic Cramer-Rao lower bound presents no difficulty. We write /88
/ 8v
1-1(0) = [ Iv8 Ivy] . Then /88 = {plimH'(E-1
I n )H n
KGG(B -1' 0 B-1)W
-2W'(E -1 B -1')D[ME -10 E-1)/30]-1
x ry(E-1 ® B-i)wri. However, from the properties of the duplication matrix (see Section 3.5) ®E [D'(E-1 -1)D]-1 = LN(E E )N L' , where N = (112)(IG2 KGG), DLN = N, and N(E ®E )N = N(E 0 E), so the third matrix in the inverse can be written as -2W'(E -1 B -1')N(/G 0 EB-1)W = 0 B-1' EB-1)W - W'KGG(B-1' B-1)W. Using standard asymptotic theory, we note that plim H'(E -1 0M)H In = W I(E -1 0 B-r- to, 1 )W, where M is the projection matrix M = In - Px, with Px = X(X'X)-1X'. It follows that /88 = [p lim H'(E -1 0Px )H/n]-1.
By a similar analysis, /8v = -2[p lim H'(E -1 0Px )H/n]-1WVG 0 B-1'
)N L',
and /vv = 2LN{(E 0 E) 2(/G E B-1)W
x [plim (E -1 0Px )H/n]-1WVG 0 B-1' E)}NL'. (7.19) 7.2.8. Statistical Inference from the Score Vector and the Information Matrix 7.2.8.1. Full-Information Maximum-Likelihood Estimator as Iterative Instrumental Variable Estimator
In the introduction to this chapter, we considered a general statistical model that is broad enough to encompass variations of the LSE model that we consider. Associated with this general model is a matrix of instrumental variables Z and two generic IV estimators: the IV-OLS estimator and the IV-GLS estimator given by Eqs. (7.3) and (7.5), respectively. The basic LSE model studied in this section is a special example of this framework in which V = E In and the instrumental variables are obtainable from the reduced form. Considering the
Linear Simultaneous Equations Models
155
reduced form written as Eq. (7.9), we see that (/G 0 X) qualifies as instrumental variables for y, and as it is the ys in H that cause the asymptotic correlation between H and u, this means that (/G 0 X) qualifies as instrumental variables for H as well. Therefore for the model before us in the generic IVEs Si and 62 given by Eqs. (7.3) and (7.5), respectively, we set V = E 0 In and Z = ® X. Doing this we see that both estimators collapse to the same estimator, namely = [H'(E-10 13011]-111 (E -1 0Px)y. '
This is the three-stage least-squares (3SLS) estimator with known E. For the more realistic case in which E is unknown, we first obtain a consistent estimator for E by using the two-stage least-squares (2SLS) residual vectors The 3SLS estimator is then We let au = j In and use E = 8=
0 Px)111-11T(t -10 Px)Y,
or -10 In)Y, n)H]-1 (2 8 = [H'(E -1 I
(7.20)
where H = (/G ® Px )H is the predicted value for H from the regression of H on (/G 0X). This estimator was first developed by Zellner and Thiel (1962). It is well known that it is a BAN estimator [see, for example, Rothenberg (1964)] so its asymptotic covariance matrix is given by 188 = [plim 10E-1 0Px )H/n]-1. We now show that the MLE [usually referred to in this context as the fullinformation maximum-likelihood (FIML) estimator] has a similar IVE interpretation except this interpretation is iterative. This result was first obtained by Durbin in an unpublished paper [see Durbin (1988)] and again demonstrated by Hausman (1975). To this end, consider the score vector that is reproduced here for convenience:
Ts = —nW' vec(B-1') + 10E-1 0In)u,
(7.21)
D' a—v = — 2 (vecE -1U'UE -1— n vecE-1). Setting 8l/8v to the null vector gives E W' vec(B-1') =
= U'U/n. Now, we write
0 /G)vec B-1'
and note that B-1' = B-1U'Uln = V'U/n, where V is the disturbance
156
Matrix Calculus and Zero-One Matrices
matrix of reduced-form equation (7.8). So we write W' vec(B-1') = 1W'(2-1 0IG)(IG 0V')u
1 = - WVG 0 V')(2-1 0In)u and substituting this expression into Eq. (7.21) gives
= [H '
0 V')](2-1 0/n )u.
To evaluate this derivative further, we consider
0
[Hi - V WI H - (IG V)W = 0
HG - VWG
where H, - VW, = (Y; X; ) - V (k 0). If we write the reduced form of the endogenous variables on the right-hand side of this ith equation of our model as 17; =xni +v, then, as YliTi = it follows that V k = Vi and H; - VW; =(X II; X,). Let H; = (X Ili X; ) and let H be the block diagonal matrix with in the ith block diagonal position. Then we can write
as = H
,(2_1 0 in)u.
Setting this derivative to the null vector and replacing the remaining parameters with their MLEs gives H'(2 -10/n )(y - Hg) = 0,
(7.22)
where g is the FIML estimator of S andH is the MLE off/ given by the block diagonal matrix with II; = (xfii X,) in the ith block diagonal position, f1; being the MLE of ni . Solving Eq. (7,22) for S gives
= [k(2-1 0 in)H] (2-1 0 h)y• -1ir A few points should be noted about this interpretation of the FIML estimator. First, it is clearly an iterative interpretation as we have not solved explicitly for 5, with the matrices niand E still depending on S. Second, this IVE is similar to the 3SLS estimator of Eq. (7.20). 3SLS estimator uses = xni as instrumental variables for 11; where 1-1;is the OLS estimator of II;, i.e., II; = (Xx)--1 X'11i . The FIML estimator uses V, = X FL, as instrumental variables for Yi , where is the MLE of R.
Linear Simultaneous Equations Models
157
7.2.8.2. The Lagrangian Multiplier Test Statistic for H0 : = 0, i j If the disturbances are not contemporaneously correlated there is nothing to link the equations together, Single-equation estimators such as 2SLS estimators or limited-information maximum-likelihood (LIML) estimators would then be as efficient asymptotically as the system estimators such as 3 SLS and FIML estimators. Needless to say, the former estimators require a lot less computational effort than the latter estimators. The null hypothesis that interests us then is Ho : Grij = 0, i
j,
the alternative being HA
: Crij 0
0.
The LMT procedure has appeal for testing the null hypothesis as we need compute only LIML estimators in forming the test statistic.If we then accept the null hypothesis then on the grounds of asymptotic efficiency these estimators suffice. There is no need to move to the more computationally difficult estimators. With the score vector and the asymptotic Cramer—Rao lower bound in hand, it is a relatively simple matter to form the required test statistic. We proceed as we did for LMT statistic for the equivalent hypothesis in the SURE model, which gave rise to the Breusch—Pagan test statistic discussed in Subsection 6.2.4.2. Let 13 = i3(E), Then we can write 1/0: v = 0 against HA
:13 0
0,
and the LMT statistic is ivv(e)
av
where Iv71refers to that part of the Cramer—Rao lower bound corresponding to the parameters 13 and 6 sets 13 equal to the null vector and evaluates all other parameters at the unconstrained MLE, that is, at the LIML estimators. Under Ho, T1tends to a x 2random variable with (1 / 2)G(G — 1) degrees of freedom so the upper tail of this distribution is used to find the appropriate critical region, In forming T1, we follow the procedure adopted for the SURE model and obtain aila13 and I" from away and I", which we have in hand by using the selection matrix S defined by 13 = Sv, Recall that S has the property SL = L, where L and L are the (1 / 2)G(G + 1) x G2and (1 / 2)G(G — 1) x G2 elimination matrices defined in Section 3,4. Moreover, as No v for the model before us is the same as that for the SURE model, we have
al)
= D(A) = L vecA ,
158
Matrix Calculus and Zero-One Matrices
where A = E -1U'UE-1, and, under Ho,
Pll
P1G1
PG1
• PGG
A=[
with pig =
1.0 critcrjj
Again, as in the SURE model,
I" = S I"S' However, now I" is far more complicated, being given by Eq, (7,19) rather than by 2LN(EO E )N L', as in the SURE model, Using Eq, (7,19) and the fact that SL = L, we can write 1 T1 = - (vecA)1/ LN V (0)N i,' vecAlo, n
where V(0) = 2{(E E) + 2(/G ®EB-1)
x W[plim H'(E -10 P.,)H I n]-1WVG 0 B-1' E)). However, from Section 3.6, L'LvecA = vecA, where 0 P21
A
=
0
0
P31
P32
_PG1
PG2
0
•
PGG-1
0_
so we write 1 T1 = -(vec AY N V (0)N vecA, n
(7.23)
which is a quadratic form in N vecA, with V(0)1 n as the matrix in this quadratic form.
159
Linear Simultaneous Equations Models We consider first the vector in this quadratic form. By definition, 1 N vec A = vec(A + A'), 2 -0
-0
P21
P31
•• •
PG1
P12
P13
•• •
P1G
0
P32
•• •
PG2
0
P23
•• •
P2G
PGG-1
0
A' = 0
••
••
PG-1G
0
0
as pig = phi. Therefore, 0
P12 P13
0
P21
A+=
P2G
0
P3G
P31 P32
-
P1G
P23
= (i.19 Y2 • • • i.G),
PG-1G -PG1 PG2 • • • PG G-1
0
where (P12) (PIG
0 1
P2 = P32
• • • PG = PG-1G
P2. 1 = 0
0
PG1 PG2
Thus, we can write
1 N vec A = 2
(
r1
)
(7.24) rG •
Next, we consider the matrix V(0)In of the quadratic form. The first part of V(0) in this matrix, namely 2(E 0 E), under Ho is given by
0 1
crilEo . 2(E 0 E) = 2
-
0
0
[al 1 -.
withE0=
.
0
cfGGEo
cfcc
The second part of V(0) under Ho requires a little more work. Under Ho
cr11(H;Px 111)-1 [H'(E-10 Px)Hrl =
0
. .
0
cfcc(11 Px 11G) 1_1
160
Matrix Calculus and Zero-One Matrices
so this second part, if we ignore the p lira and the 2, is the diagonal matrix l'Eo 0
[ail E0/3-1 waiii px Hi /0-1mB
-
0
x HG/ n)-i lVd3-1' Eo aGGE0 13- 1WG( 11 P
We can further simplify this part by noting that Wi(1-1:Px111)-1K = I i4i[Y; (Px - 13,)Yi ]-1W;, where Px, = Xi(X;Xi)- 1 X.Putting the two parts together we have, still ignoring the p lim, that [01 i tEo + 2EoB-11-V1 [r(Px
— Px1)171/n]- 1WI B-1' Eol
0
V(0)= 2 0
acctEo + 2EoB-147 G[*(Px PxG)YG/ ti]- I
E0)
(7.25) Equations (7.23)-(7.25) taken together led to the following procedure for forming the LMT statistic: 1. Apply the LIML estimator to each equation yi = H; S; ui to get the estimations, Sisay. Form the residual vectors it, = yi - H, S; for = 1, , G. 2. Using and iii, form6,j =
nQ,~ =
, for i, j = 1, . . . , G,
CriiCrjj
and B. 3. Form / P1
1
0
=
\ PGa /
to =
, aGG
A, = 2 {to + 2t0 /3-11i4 [11:(Px - Px)Yi ln]-1W; fr y tol for i = 1, 4. Form
, G.
1 =—
n i=i
In forming this test statistic, we may make use of the 2SLS estimator instead of the LIML estimator. The 2SLS estimator of 5„ 8,2ssay, is asymptotically
161
Linear Simultaneous Equations Models
equivalent to the LIML estimator . Moreover, an estimator of crijthat uses the 2SLS residual vector is asymptotically equivalent to the one that uses the LIML residual vector. Hence, an asymptotically equivalent version of the LMT statistic would use a 2SLS estimator in place of the LIML estimator in the preceding procedure. The test statistic T1was first obtained by Turkington (1989) and may be viewed as a generalization of the Breusch-Pagan test statistic discussed in Subsection 6.2.4.2. Indeed, comparing the two test statistics, we see we can write the Turkington test statistic as
=
E q;[(5ii 171(Px
, 1=1 where T1 is the Breusch-Pagan test statistic and = i; l, 13-1' toi-i . Therefore the Breusch-Pagan test statistic is modified by the addition of the quadratic It is instructive to investigate what is being forms i);Viii Yi'(Px - Px,)Yi r l . measured by these quadratic forms. To do this, we first note that as YW, = it follows that Vi = V Wi = UB-11,V;. We consider now Tl
/
= eia
0
0
= U,u;
\t Gai 1 V;1ii , where Vi = where Ui = (a it; -10 ac), so = W; i3 - U; U j i3-11,V i . It follows then that the quadratic form 1i;[6iiiii(Px - Px; )Yi]- 111i measures the distance the vector V; u;is from the null vector; that is, it concerns itself with the correlations between it; and V;, excluding the part of V; that depends directly on 7.3. THE LINEAR SIMULTANEOUS EQUATIONS MODEL WITH VECTOR AUTOREGRESSIVE DISTURBANCES
7.3.1. The Model and Its Assumptions In this section, we extended the LSE model by allowing the disturbances to be subject to a vector autoregressive system of the order of p. We use the same notation for this system as that of Subsection 6.3.1. Replacing presample values with zero, we can write our model as
y = HS u, M(r)u = E, E
N(0, E 0 in),
162
Matrix Calculus and Zero-One Matrices
where, as in Subsection 6.3.1 M(r) is the nG x nG matrix given by M(r) = IGn+ N(r) = IGn + (R 0 In)C, C is the Gnp x Gn matrix given by IG 0 S1
C
1G . sp),
(7.26)
=( and Si - - - Sp are shifting matrices. Alternatively, we can write this model as YB +XF =U, U = —Up R' +E,
(7.27) (7.28)
where the rows of the random matrix E are assumed to be independently, identically normally distributed random vectors with null vectors as means and covariance matrix E , Recall that U pis the n x Gp matrix (U_I • • • U_ p), and if we replace presample values with zeros we can write Up = S(Ip 0 U), where S is the n x np matrix S = (Si- , - Sp), As in Subsection 6,3,1, we assume that there are no unit root problems, From Eq, (7,28) we can write Up = (17--113+ X_iF , - • Y_ p B ± X_ p F) = (Y_AB , , , Y_ p B) ± (X_iF , , , X_ p F) = Yp(Ip 0B) + X p(I p 0 F),
(7.29)
where Yp= (Y-- 1 , , , Y_ p) and X p= (X--1 • • • X—p), Using Eq, (7.29) for Upand Eq, (7,28), we can write the reduced form as Y = xni +uB-1 =xn i + xp n2 +ypn3 +EB-1 ,
(7.30)
where
n1 = -rB-1, n2 =-(Ip ® F)R' B-1, 113 = —(Ip0 B)R' B-1 , The predetermined variables in the model are given by Z = (X X p Ye). For our asymptotic theory we make the usual assumptions, namely that p lim(IG 0 Z') x Eln = 0, p lim Z'Z In exists, and this p lim is nonsingular, Further assumptions about asymptotic behavior will be given when needed in future sections,
-
163
Linear Simultaneous Equations Models 7.3.2. The Parameters of the Model, the Log-Likelihood Function, the Score Vector, and the Hessian Matrix
The parameters of the model are given by the vector 6 = Of r' Of , and an analysis similar to that conducted in Subsection 7,2,2 above gives the loglikelihood function that, apart from a constant, is 1 WI) = n logl det BI -11 log det E - - Ef(E -1 0/n )E, 2 2 where in this function we set E equal to M(r)(y - HS), Note that we can write E = yd — Hdb, where yd =M(r)y and Hd = M(r)H. When we compare this log-likelihood function with that obtained for the SURE model with vector autoregressive disturbances, given by Eq. (7.17) of Subsection 6,3,4, we see that again we have an additional term, namely n log I det BI, Moreover, as this term is a function of 5 but not of r or v, it follows that the derivatives with respect to r and v that we need for both the score vector and for the Hessian matrix are the same as those derived in Subsections 6,3,4 and 6,3,5 with the proviso that we replace Xd with Hd , Similarly the derivatives involving 5 but not r can be derived from those obtained in Subsections 7,2,4 and 7,2,5 above with the proviso that H is now replaced with Hd , u with E, and U with E, Finally, we obtain a2uasarf from Eq, (6,25) of Subsection 6,3,5 by replacing X and X din this equation with H and Hd , respectively. Thus, we obtain the score vector and the Hessian matrix, SCORE VECTOR al
= -nW' vec(B-1') + Hd'(E-1® 10E, ab al -a r = — K pG,G(1-1 ® lidE, ai D' av
= — (vec E -1E'EE-1 n vec E-1).
HESSIAN MATRIX
all asas, all abarf
=
nW'KGG(B-1' 0 B -1)W - Hc '(E -1 0In)Hd ,
= H'C' (I pG ® EE -1 ) + Hd' (E-1 0 U p)KG,PG,
all = Hx(E-1 0 EE--.1)D, nal/ all = (upf upoz-1 ), arar' all arav' = K pG,G(E-10 U'p EE -1)D' , a 2i n = D' (E-1® E -1)D - D'(E -10 E-1E' EE-1)D. 2 av ail
(7.31)
164
Matrix Calculus and Zero-One Matrices
7.3.3. The Information Matrix /(0) = —p lim
1 821
n aeae, To obtain the information matrix, we need to evaluate some new probability limits. Under our assumption, plimE'Eln= E, plimU 'pEln= 0, and p lim X' E I n = 0. We assume further that plimUp' Upill exists and is equal to a positive-definite matrix, S2 say. Let Hp =(H_1 • • • H_ p), where 0 H-i =
0
HG, _i
and 11 - 1• _idenotes the matrix formed when we lag the values in Hi j periods, where i = 1, , . . , G, and j = 1, . , . , p. If we replace the presample values with zeros, then we can write Hp = [(IG 0 Si)H • • (IG 0Sp)11],
where Si • • • Spare shifting matrices, and then H-1)
CH=HP = H_ p where / = Ei(KJ + Gi). Clearly, regardless of whether X contains lagged endogenous variables, plim H'C'(1pG 0
n = plim HP (1 0 E)/n = 0.
Now, under our new notation, Hd = M(r)H = H (R 0 In )C H = H (R 0 In)11;
(7.32)
SO
p lim Hd' (IG 0
n = p lim H'(I 0
n = diag 1p lim 11; E I n) = diag Plim plimX;E I n •
Now, as p lim X'E In = 0, plimX;E In = 0, and from the reduced form, Y = xni +uB--1,xni -upKB-1+EB-1, plim Y'Eln= B-1',plim E'E I n = B-F E, and plimr Eln = B-1' E. We can then write plim Hd' (IG
n = diag
B-1'E 0
= W'(IG 0 B-1' E).
= diag{ KB-1' E } (7.33)
165
Linear Simultaneous Equations Models
With these probability limits in hand and using the probability limits already derived in the SURE model with vector autoregressive disturbances, we can write the information model as Or KGG(B- ' 0 13-1)W ±H d' ( E -1 0 how AO) = plim
1
—
— K pG,G(E -I 0
n
— Hd f(E-10 Up)KG, pG
nW'(E-10 13-I')D
n(S2 0E-1)
0
0
— D'(E -I 0 E-I )D 2 _
U'p )Hd
nD'(E -10 13-1)W -
It is instructive to compare this information matrix with the corresponding information matrix of the SURE model with vector autoregressive disturbances given by Eq. (6.26) in Subsection 6.3.6. We note that the matrix in the block (1, 1) position has an extra term, namely nW'KGG(B-1' 0 B-1)W, Also, the matrix in the block (1, 3) position is now nonnull. The information matrix for the LSE model is then considerably more complex than that for the SURE model. For the case in which X is strictly exogenous and contains no lagged endogenous variables, the information matrix can be further evaluated. If X is exogenous, Eq. (7.30), as well as being the reduced form of the model, is also the final form. Also it allows us to write p lim(IG 0 X')u In = 0, and this simplifies the asymptotics in the evaluation of p lim Hd' (IG 0 Up)/n. We have seen that Hd' (IG 0 Up) = H'(IG U
p)± f 1;1(IpG
0 Up)(R' 0 I pG),
(7.35)
and we deal with each part of the right-hand side of Eq. (7.35) in turn. First, YlUp H'(IG 0Up) = diag{H; Up) = diagI . X; Up With X strictly exogenous, p lim X' UpIn = 0, so plim X;U pI n = 0. Moreover, from reduced-form equation (7.30) and Eq. (7.28), we have plimY1Upl n = —W;B-F RS-2, so p lim H'(IG 0 U p)In = —diag
W13-1' RS21 0
= —diag{KB-P ROI = — W'(IG 0 B-F RS2).
(7.36)
Next, 11;1 ( 1PG 0 Up) = [1111(IG 0 Up)* Hip(IG Up)],
(7.37)
166
Matrix Calculus and Zero-One Matrices
so we consider the matrix in the jth block position: j (IG
yi', _./ 11p 0 Up) ,chag x ,
With X exogenous, plimXL_ J Upin= 0 and plimr_ j Upin=W;Y' jUpl n = I•V; , where S2i = p lim U' jU pI n, a G x pG matrix and S21 =) (7.38) Op) It follows then that plim H' j(IG 0 Up)In = diag /
W; B-1' Oi l 0
= diag{lVi B-1'S.2j}
= W'(IG 0B-1' OA Returning to Eq. (7.37), we see that plim HP
Up)In = W'(IG B-1')(IG S.21 • IG 0 Op).
We can simplify this expression by using selection matrices. Let S., be the G x pG selection matrix given by (7.39)
Si =(0 • IG • • 0),
where each null matrix is G x G and IG is in the jth block position. Then, from Eq. (7.38), =
S-2,
and plim HP (Ipc 0 Up)In = W'(IG 0 B-1)S(I pc 0 0),
(7.40)
where S
= (IG
Si • ••
IG
(7.41)
Sp).
Finally, it follows from Eqs, (7,35), (7,36), and (7.40) that, for the case before us, plim H d' (IG 0 Up)In =
0 B-1' RS.2) + W'(IG 0 B-1)S(R' 0 0),
(7.42)
and by using this probability limit we can write the information matrix for the
Linear Simultaneous Equations Models
167
special case in which X is strictly exogenous as W'KGG(13-If 013-1 )W
013-1' RS-2)KG,pG
+plirn Ild'(E-1 (311„)Hd/n
liP(
(3) 13-1' ) D
-W(1c ® B-I )S(R'E-1Q)KG,pG
KpG,G(E-I (3)S2 Rf B-I )W -K pG,G(E-IR OS-2)S'(IG B-I)W D'(E-I 013-1)W
O
12 /Y(E-10E-I )D
7.3.4. The Cramer-Rao Lower Bound /-'(0) Inverting the information matrix to obtain the Cramer-Rao lower bound requires a little work. We let 88
8v 8r 1 1
1r8 Fr Irv] . 1-1(0)=[1 1v8
r vr i
Then, for the general case, /88 =
KGG(B-1' 0 B-1)W + plim Hd'(E -1 0I„)Hd In
— plim Hd' (E -10 Up)KG,pc x [(U'p Up)-10 E]KG,pc(E-10 U )Hd In — 2W'(E -10 B-1')DL N(E 0 E )N L' D'(E -10 B-1)Wr 1 Before we obtain the other components of / -1(0), it will pay us to evaluate /88 further. Using the results DLN = N and the properties of N and KGG, we can write the third matrix in this inverse as — 1,r(E-1 0B-1' •-•
)W — W'KGG(B-1' 0 B-1)W,
We have seen already from Eq. (7.33) that plim Hd f(IG 0 E)/n = B-1' E) and so, as plim E' Eln = E, we have,
0
plim Hdf (E-10 PE)Hd In =14AE-1 0B-1"EB-1)W, where PE is the projection matrix PE = E(E' E)-1E'. One way of writing /88would then be /88 =[plim Hd'
0 ME)Hd /n —
(7,43)
where T = plimHd'[E-10 Up(Up' Up)-1Up']Hd In and ME is the projection matrix, ME = In — PE. However, we can obtain a more convenient way of writing /88by showing that plim Hd' (E-1O Mz)Hd In = I4AE-10 B-1' E13-1)W
168
Matrix Calculus and Zero-One Matrices
as well, where Mz = In— Z(Z'Z)-1 Z' and Z is the matrix of predetermined variables, Z = (X YPY). To do this, note that Hd'(/G 0 Alz ) = 1-1'(/G 0
+ I pti(I PG 0 MAR' 1-1® In)
and, as Hpinvolves the variables in X P and V, 1-1;1(ic 0 Mz ) = 0, SO
H d,(E-1
mz)Hd = H,(E -1 0 mz)H =
HimzHil,
where E -1= {ail), Moreover, as X'Mz = 0, MY 0 11:111z Hj =[' 0
01'
and, from reduced-form equation (7,30), plimY1 Mz Y fin =
(plim E' E In)B-11
=
E B-11 j ,
SO
= 141B-1' EB-1Wi , p lim Hd' (E -1 Mz)Hd In = W'(E-10 B-1' EB-1)W, as required. It follows then that we can write /88 = [plim Hd'(E -10 Pz)H d In — 7]-1,
(7.44)
where P = Z(Z'Z)-1 Z'. The other components of / -1(0) are I 8r = 188 pliM Hd [IG 0 II p(II'pII 0-1 ] 1pG,G = (18V I 8v = —2188 liAIG 0 B-1' )N =(I v8Y I rr = (Sri E) + KpG,G{P lim[IG 0(Up' Up)-i Up/
Hd .188
x Hd'VG Up(Up' Up)-1 11KG,pc, I ry = — 2K pG,GPiiM[IG 0 W pr i U lp Hd I88144' (IG 0
(7.45) E)
x NL' = (Ivr)', I" = 2LN[(E E)±2(1G
E /3-1)W/88 WVG 0B-l'EM/L'.
For the special case in which X is exogenous and contains no lagged endogenous values, we know that p lim Hd IG 0 U p)In is given by Eq. (7.42) so we can ' (
169
Linear Simultaneous Equations Models evaluate our probability limits further to obtain
T = W'[(IG 0 B-1' R2) — (IG 0 B-1')S(R' 0 0)1(E-10 Sri ) x [(IG 0 OR' 13-1) — (7.46) 0 S2)S'(IG 0 1 )1W, 18r = —188 IVI(IG 0 = (n-1 4,
R) — (IG 0 B-1)45(K 0 1pG)1KG, pG,
KB x (IG 0 B-1)1W I 88 W'RIG 0 .14
z
v pG,GRIG
1) —
(R 0 I pG)S'
R)— (IG ® B-1')
x S(R' 0 /pG)11CG, pG.
(7.47)
We now wish to show that for the case under consideration, in which X contains no lagged endogenous variables, we can write /88 = [AVG 0 F')plim Xd' (1)-1Xd I n(IG 0 F)A]-'
where' (1) = E 0 /„, A is the G(G + k) x 1 diagonal matrix given by A = diaglAi1, and Ai is the (G + k) x (Gi + ki ) selection matrix given by Ai = T'y , F is the k x (G + k) matrix given by F = (H, I ic), and Xd is the nG x Gk matrix given by (7.48)
Xd = M(r)(IG 0 X).
As the first step in proving this, we need to write Hd in terms of Xd . To this end we consider H = diaglili 1 = (/G 0 Q)A, where Q =(Y X). However, we can write Q = (X ni + so, from Eq. (7.49)
(7.49)
v X)= X F + V (IG 0),
H = (IG 0 X)(IG0 F)A + (IG 010[10 0 (IG 0)]A. We can tidy this expression up by noting that [IG 0(IG 0)]A = diag{(IG 0)Ai } = diag{ } = W, SO
H = (IG 0 X)(lo 0 F)A (IG 0 V)W and, as Hd = M(r)H, Hd = Xd(IG 0 F)A M(r)(IG 0 V)W.
(7.50)
Our second step is to bring the G2 x p2 G2selection matrix S defined by Eqs. (7.39) and (7.41) into the picture. We do this by writing M(r)(IG 0 V) = (lo 0 V) + (R 0 In )C(IG 0 V) 1It
is assumed that p lira Xth4)-1Xd /n exists.
(7.51)
170
Matrix Calculus and Zero-One Matrices
and by noting that ( IG 0 V-1 )
IG 0 S1 C(IG 010 =
;
(
(IG 0 V) =
IG 0 Sp
IG 011—p
(IG 0 U-1) =
(IG 0 B-1). IG 0 U—p
However, U__.; =UpS; so C(IG 0 V) = (/PG 0 U p)S'(IG 0 B-1) and, from Eq. (7.51), M(r)(IG 0 V) = (IG 0 V) + (R 0 U p)S'(IG 0 B-1).
(7.52)
Note, by the same analysis, that we could write Xd = M(r)(IG 0 X) = (IG 0 X)+(R 0 X p)S' ,
(7.53)
where X p= (X-1 • X_ p), where S is the Gk x Gkp2selection matrix given by S = (IG 0 Si IG 0 Sp) and S.;is the k x kp selection matrix given by = (0 Ik • 0), where each null matrix is k x k. The third step in the proof is to consider Hd'(1)-1Hd, which, by using Eqs. (7.50), we can write as Hd' —1 n = A'(IG 0 F')X` 'f
1 Xd (IG
0 F)A
WVG 0 V')M(r)'(1)-1 Xd(IG 0 F)A ± AVG 0 r)X d' 0-1M(r)(IG 0 V)W WVG 0 V')M(r)'(1)-1 M(r)(IG 0 V)W.
(7.54)
We consider each component in turn. First, from Eqs. (7.52) and (7.53), (/G 0 V')M(r)'(1)-1Xd = IG 0 13-1')[(E -1 U'X) (
+(E-1R
U'X p):Sg +S(R'E-1 0Up'X)
+S(R'E-1R0Up'X p),.31, and, as X is strictly exogenous, p lim X' Uln = 0, p lim X'U I n = O, and plimX 'Upin p = 0. Therefore plim(IG 0 11')M(r)'(1)-1Xd I n = 0.
(7.55)
Linear Simultaneous Equations Models
171
Next, from Eq. (7.52), (IG 0 V')M(r)'(1)-1M(r)(1G 0 V)
= ( IG 0 R-1')[(E -10 U'U) + (E -1R 0 U'U)S' + S(R'E-10 Up' U) + S(R'E -1R 0 U'p U p)S1(IG 0 R-1). Now, U'U I n = (-RU'p + r)(-Up R' + E)/n 4 RS2R' + E, U'Up pi
4 -RS.2,
SO
(IG 0 V')M(r)'0-1M(r)(IG 0 V)/n -(IG 0 B-1') {[E -10 (RS-2R' + E)] - (E-1 R ® RS-2)S' - S(R'E -10 OR') +S(R'E-1R ® S-2)S1(1G 0 B-1). Putting our pieces together, we see from Eq. (7.54) that plim Hd' (I)-1 Hd In = AVG 0 F')plimXd' (I)-1 Xd In(IG 0 F) A ± w,(E.--1 0 B-1, EB-i)w +T,9 where T is given by Eq. (7.46). However, we have already seen that for the case in question we could write /88 =[plim Hd' (I)-1Hd In - W'(E-1 0B-1"EB-1)W - 71-1,
so our task is complete. 7.3.5. Statistical Inference from the Score Vector and the Information Matrix 7.3.5.1. Efficient Estimation of 5 1. Case in which R is known: With R known, M(r) is known. Consider the transformed equation )7 d = if/6 + E, (7.56) where yd =M(r)y and Hd = M(r)H are now known. Clearly this equation satisfies the assumptions of the LSE model in which the disturbances given by the vector E are no longer subject to vector autoregression. It follows from Subsection 7.2.7 that the Cramer-Rao lower bound for an asymptotically efficient consistent estimator of 5 is given by /*88 =[plim Hd' (E -1 ® pz)Hdin]-1.
172 Matrix Calculus and Zero-One Matrices Comparing this expression with /88given by Eq. (7.44), we see that T must represent the addition to the Cramer-Rao lower bound that comes about because R is unknown. Moreover, as in the case of the basic LSE model discussed in Section 7.2, the several competing estimators collapse to the same efficient estimator. We consider each of these estimators in turn. a. The 3SLS Estimator Consider first a 3 SLS estimator. Writing Eq. (7.56) as y d = H 3 ± (R 0 In )11; ' 3 ± E, we note that H is correlated with E but 1-1;: is not. In forming a 3SLS estimator for 5 we need an instrumental variable for H but not for H; . As p lim Z'e I n = 0, all the predetermined variables in Z are available to form an instrumental variable for H. Regressing H on (/G 0 Z), we would use 1 1 - = (/G 0 Pz )H . However, as 1-1;,' involves variables in Z, (IGp 0Pz )11;' =HP, H
d
= ( IG ® Pz )N d = H + (R ® In )fl; .
Therefore, we can write the 3SLS estimator as gm = [wit -1 0pz)Hd]-iHdit-i ® p)yd,
(7.57)
where t = t' tin and E and the matrix formed from the 2SLS residual vectors after the 2SL S estimator has been applied to the individual equations making up Eq. (7.56), again by using Z to form instruments. Under our assumptions, standard asymptotic theory shows that N/Ti(83s 5) has a limiting multivariate normal distribution with mean zero and covariance matrix /*88. Moreover, the analysis holds good regardless of whether X is strictly exogenous or X contains lagged endogenous variables. b. The 1V-GLS Estimator or White Estimator Equation system (7.56) has two econometric problems associated with it, namely the nonspherical covariance matrix of the disturbance term E and the correlation of right-hand variables with this disturbance term. Suppose we deal with the former problem first by premultiplying Eq. (7.56) by the nonsingular matrix P, where P'P = E -10 /„, to get Pyd =PHdb + PE.
(7.58)
In Eq. (7.58) the disturbances are now spherical but we are left with the second problem of right-hand variables, namely that PH is correlated
Linear Simultaneous Equations Models
173
to the disturbance term PE. We achieve an instrumental variable for PH by regressing this matrix on P(IG 0 Z) to give PH = P(IG 0 Pz )11.
However, as noted above, (IG 0PAR ®10H; = (R 0 1011;1 , so in forming the White estimator we could use PHd = P(IG ® /3,)Hd
as an instrumental variable for,,PHdin Eq. (7.58). Doing this and replacing the unknown E with Egives the estimator. However, a little work shows that this is identical to 53s. c. The 1V-OLS Estimator Consider again equation system (7.56), which we still write as
yd = HS + (R 0 In )HP + E.
(7.59)
We have seen that H is correlated with E. Suppose we attempt to break this correlation, at least asymptotically, by multiplying both sides of Eq. (7.59) by 1G 0 Z to obtain (IG ® Z)yd = (IG 0 Z)Hd 3 + (IG 0 Z)E.
(7.60)
Equation (7.60) still has a disturbance term whose covariance matrix is nonspherical, but we can deal with this by applying GLS estimation to the equation. Replacing the unknown E in this GLS estimator with E gives the IV-OLS estimator. However, a little work shows that this estimator is also identical to 53s. 2. Case in which R is Unknown a. The Modified 3SLS Estimator 5M3s For the more realistic case in which R is unknown, the Cramer—Rao lower bound /88given by Eq. (7.44) must be the asymptotic covariance matrix of any consistent estimator purporting to be asymptotically efficient. Estimation is more complicated now as Hd and yd are no longer observable. Moreover, the estimator E described above is no longer available. Suppose, however, that a consistent estimator R can be obtained so we could form M(P), where F. = vec R, and thus we have the predictors Hd = M(F)Hd and 57'd = M(P)yd of Hd and ydrespectively. Suppose further that an alternative consistent estimator of E, E say, could also be obtained. Then at first glance it may seem that a reasonable approach to obtaining an estimator of 5 would be to replace Hd , yd , and E in Eq. (7.57) with Hd , yd , and E, respectively. However, a little thought will reveal that this approach is unacceptable. Recall that in forming 53s, we agreed that with R known in Eq. (7.56) we needed an instrumental variable for H only and all the predetermined variables in Z qualify to form such an instrumental
174 Matrix Calculus and Zero-One Matrices variable. However, with R unknown we are forced to work with 57'd and Hd, and these matrices are related by the equation 7(1 = cid + M(P)U =
E [M(P) — M(r)]u.
(7.61)
In this artificial equation, we need an instrumental variable for lid, and now all of Z is no longer available to us to form such an instrument. The easiest case to handle is the one in which X is strictly exogenous and does not contain lagged endogenous variables. Then, as p lim X'u I n = 0, X and X pare still available to form instruments, but clearly Ypis not. For the moment we restrict ourselves to this case. Suppose we drop lipfrom the instrument set and work with Zi =(X XP) insteadof Z. Replacing Hd , yd, and P in Eq. (7.57) with H d ,57-d E, and Pz, = Z)-1Z1, respectively, we obtain the following estimator: 4.435 = [Hd
(E_1 0
pzi )fid]-lfidit-1 0pzigd.
(7.62)
Before we discuss the asymptotic properties of this estimator, we need to outline the procedure for finding the consistent estimators R and E. b. Procedure for Obtaining Consistent R and 2 i. We apply three stage least squares to the original equation y = HS + u, ignoring the autoregressive disturbances and using the exogenous variables X to form the instrumental variables for H in this equation. Although the 3SLS estimator, S say, is not efficient, it is shown in Appendix 7.A that it is consistent. ii. We form the 3SLS residual vectors if = y — HS. From u, we form U = devecn u and U p = S(/p 0U). iii. We compute R' = —(0'e/ p)-10'e/, E = U + &pie, and 2 = E'E/n. In Appendix 7.A, it is shown that both i? and 2 are consistent estimators. The analysis there makes it clear that any consistent estimator of would lead to consistent estimators of R and E. The 3SLS estimator is singled out as it is the estimator the econometrician has in hand when dealing with the LSE model. With R obtained, we form r = vec R , M(P), 5'7d = M(F)y, and Hd = M(P)Hd. c. Consistency of 6m3s We now show that 4435 is a consistent estimator. From Eqs. (7.61) and (7.62) we can write 1 fidit-i pz1)g, LsA35 = 6 + [Hd (E-1 0 Pzi)11di-
(7.63)
Linear Simultaneous Equations Models
175
where E = E + [M(r) — M(r)]u. Recall from Subsection 7.3.1 that M(r) = Ian + (R 0 /n )C, so we can write g = E ± 0 — R) 0 /n 1Cu, fid = 11 .-.-d + [(R — R) 0 In 111; .
(7.64) (7.65)
We wish to prove that the second vector on the right-hand side of Eq. (7.63) has a probability limit equal to the null vector. To this end, from Eq. (7.65) we have ild'(/G 0 Zi)/n = HdVG 0 ZOO/
+ [Hpti(/Gp 0Z1)/11 ][(R — R)' 0 ik(i+p)iUnder our asymptotic assumptions, p lim H''(IG 0 Zi)In and p lim 14' (IG 0 Zi)In exists and, as R is consistent, plim cid IG 0 Zi)In = plim Hd' (IG 0 Zi)In. ' (
(7.66)
It follows then that p liM /Pit 1 ®
pzi)fi d in =plim Hd' (E-1 0 pzi)Hd In. (7.67)
Moreover, from Eq. (7.64), (/G 0 ZOgIn = (ic 0 ZOE I n + [(R — R) 0 Ik(1+p)] x (/Gp 0ZOCu 1 n.
As plim(IG 0 ZOE In = 0 and plim(IGp 0Zc)Cu In exists, plim(IG 0 ZOE In = 0.
(7.68)
From Eqs. (7.66)—(7.68) we have then that L35 is consistent. d. Asymptotic Efficiency of 6m3s Having established that our modified 3SLS estimator is consistent, we now seek to prove that it is asymptotically efficient in that N/ii(Sm3s — S) has a limiting multivariate normal distribution with a null vector mean and a covariance matrix equal to /88 given by Eq. (7.44). To do this we need to make the following simplifying assumption. Assumption 7.1. Let up =Cu = (u' l - • • u'_ p)' and V = e(upup'). As n tends to infinity, the matrix (/Gp 0ZOV(/Gp 0Z1)1 n tends to a positive-definite matrix.
176 Matrix Calculus and Zero-One Matrices This assumption is similar to the type of assumption used in discussing the asymptotic efficiency of a GLS estimator in a linearregression framework with autoregressive disturbances and unknown autoregressive coefficients [see, for example, Theil (1971), Section 8.6]. Its importance for the discussion here is that it ensures that (/GP 0ZOup,/,Fi has a limiting distribution as n tends to infinity. We are still assuming that X contains no lagged endogenous variables, so this random vector has an expectation equal to the null vector. Under Assumption 1 its covariance matrix is bounded as n tends to infinity. Consider now
/T4m3s 6) _
(2-1 pzi) fid nr i I nd,[ 2-1tozi(4 x (IG 0ZI)gR/11.
(7.69)
We have already established that the matrices in the square brackets and the curly braces have probability limits so we need to show that (IG Z' i)g/ ,F1 has a limiting distribution. To this end, from Eq. (7.64) we can write (IG 0 Zc)g I
= (IG 0 ZDE I ±[(R — R) X (IGp0 4)11p lj.
Ik(1+p)1
(7.70)
Under our assumption, (/GP 0Zc)u p,/,‘/T1 has a limiting distribution, so the second term on the right-hand side of Eq. (7.70) has probability limit equal to the null vector, and so (IG 0ZOgl,,/il has the same limiting distribution as that of (/G 0Zc)EI NITI, which is a multivariate normal distribution with a mean equal to the null vector and a covariance matrix E -1 0p lim Zi Z 1 I n. Then, from Eqs. (7.67) and (7.69) we obtain ,./TiCgm3s
— 5) 4 N(0, VM),
= [p lim Hdl(E 1 0 .132:1 )Hdin] 1 Having established the where VM limiting distribution of 443s it remains for us to show that VM = I. Recall that Z = (Zi Ye), so MM= Mz, Mzi Yp(Y; Mzi Yp)-117;MZ1, where Mz, =
(7.71)
Pz, . However, from Eq. (7.29)
VP = Xp(Ip 0 n1 ) + up(Ip 0B-1)
(7.72)
and, as XI, is part of Z1 , MM, XP = 0, and Mzi Yp = MziU(Ip 0 B-1).
(7.73)
Linear Simultaneous Equations Models
177
Substituting Eq. (7.73) into Eq. (7.71) gives Mz = Mz,
Up)-11PpMzi-
Now, with X containing no lagged endogenous variables, p lim X' Up / n = 0, plim X' U pill = 0, and plim = O. It follows then that plim Hd' (E -10 Alz.1)11d In = plim Hd' {E -10 [Mz Up(Up' U p)-1U'p]1Hd In, for Eq. (7.44), Vey = /88, as required. In the preceding analysis we have assumed that X contains no lagged endogenous variables. When this is not the case we have seen that the Cramer-Rao lower bound remains at I. However, now plimX'Uln 00, plimX'Uln 0, reduced-form equation (7.30) is no longer the final form, and we must specify which lagged endogenous variables enter which equations. All this makes the asymptotic analysis far more complex, and as a consequence it is left outside the scope of this present work. 7.3.5.2. Using Xdto Form Instrumental Variables If X does not contain lagged endogenous variables, Xdis available to form instrumental variables. In this subsection we look at this possibility for both the case in which R is known and for the case in which R is unknown. 1. R IS KNOWN. Suppose again that R is known so yd , Hd, and Xd are available to us. Consider again the transformed equation y d = H d + E.
(7.74)
Now, we know from Eq. (7.30) that y= (I 0 X)71 v, where n-i = vec Fl i , v = vec V, and V = UB-1, so yd
= X d r1 M(r)v
may be regarded as the reduced-form equation for yd. This being the case, we may be tempted to use Xdto form instrumental variables for Hd in Eq. (7.74). Suppose we form a IV-GLS type estimator. First, we would consider Pyd = PHd b ± PE,
(7.75)
where we recall that 'P'P = -1® IG. We would then form an instrumental variable for PHdin this equation by regressing PHd on PXd to get PHd = pxd(xdo-ixd)-ixd'o-iHd.
178
Matrix Calculus and Zero-One Matrices
Using PHd as an IV for PHd in Eq. (7.75) and in the resultant estimator, replacing (I) with (I)= E 0 I gives sd = [H d4-i xd (x x4)-i xd )-i xd'4)-1Hd ]-i Hd4-i xd X (Xd' (i)-1 xd
4)-1 yd
We have assumed that plimX d' (i)-1 X d I n exists. Standard asymptotic theory then reveals that 6d is consistent and ,Ft(Sd 5) 4 N(0, Vd),
where Vd = [plim Hd' 0- 1 xd (xx 0-1 xd )-i xd,
Hd n]-i
From Eq. (7.50), Hd' (I)-1Xd = A'(IG 0 F')Xd'(1)-1Xd IVVG 0 V')111(110-1Xd ,
and we have seen from Eq. (7.55) that (/G 011')111(r)'0-1 Xd In
o,
SO
plim Hd'(I)-1xd /•n = A'(IG 0F')p lim Xd' 0-1 xd In.
We conclude that Vd = II. We know that /88is the Cramer-Rao lower bound for a consistent estimator of S in which R is unknown. If R is known, as we are now assuming, the Cramer-Rao lower bound is I. We conclude then that 6d is not an asymptotically efficient estimator of S. Forming an IV as we have just done does not make the most efficient use of our knowledge of R. 2. R IS UNKNOWN. Now we look at the more realistic case in which R is unknown. Suppose we persevere with 6d, replacing Hd and , (I) in this estimator with Rd = m(7.)Hd Xd = M (0(1G 0 X), and (I) = E 0 In , respectively, to obtain the estimator ild'4)-15ed Sd = [1:1d'4)-15p( je'd'4)-15e)—lip'4)-1 x (je d 4)-1 je d)-1 d 4)-157.d
Consistency of bd. Clearly, d (i)-1 d)-ix Sd = 6 +[ fix x ildi)-ik-d (xd'i)-ixd )-ik-cr 4)-1E,
(7.76)
where
= E +[(I? —
R) 0 In ]Cu.
Now, in Subsection 7.3.3, Eq. (7.32), we saw that we could write Hd =H (R 0 ln)H; ,
(7.77)
179
Linear Simultaneous Equations Models SO
H d = H d + [(1?'— R)
In] 11;
Also, in Subsection 7.3.4, Eq. (7.53), we saw that we could write Xd =(/G ® X) + (R 0 X p):9' , SO
Xd =
(7.78)
+ [(R — R) 0 X p ]S"
Consider then fid4-1 d = Hd4-1 xd HdVG 0 x 0[2-1(1?
R) 0 ipk]s,
H Pr (1pG 0 X)[(i?' — R)'2-1 0 H (I pG 0 X ORE — 42 -1R 0 1pk14.S" 1-41(1pG 0 X p)[(k' — R)'2-1(E — R) 0 1pk14.S"
We make the usual assumptions that plim Hp' (1pG 0 X)/n and p lim HP (IpG 0 X p)In exist so, as R is a consistent estimator of R, fid,-ijc-din
p limHd'(1)-1Xd /n.
In a like manner, je' d' 4)-1 je-cl I• n
P plim
r (1)-1 gy n We conclude that 75d is a consistent estimator of S. Asymptotic Efficiency of 5d: From Eq. (7.76), _ 6) = {[ d (i)-1
(i)-1 je)-1 (i)-1 je)-115( ' d' 4)-1E/,\FL x ( je d' 4)-1
(i)-1 d (7.79)
We have just shown that the term in the braces has a probability limit of plimn[Hd' 0-1 xd (xd' 0-1 xd )-1 x cr 4) -1 Hd 1-1 Hd' 0-1 xd (xd' 0-1 xd )-1. Now, using Eqs. (7.77) and (7.78), we can write 4) -1 giVn = Xd'(1)-1E
±[2-1 (1?— R)
+S[R' 2-1(E +SRI? — 51(12'— + 4.-
—
1k] (
IPI x')cu
R) 0 Ipk](IpG 0 X' )Cu/./ 0 1pd(IG ®X'p)EINFI
— R)
1pal pG 0 X'p)Cu
180
Matrix Calculus and Zero-One Matrices
Under Assumption 7.1, we have assumed that (/pG 0X')Cii[ji and (/pG 0 X' )Cu/.J have limiting distributions. Clearly (11,,G 0 X'19)E1,\FI has a limiting distribution. Therefore, as R is consistent, Xd'(1)-1g/,/il has the same limiting distribution as that of xdo-lEtji, which is a multivariate normal distribution with the null vector as mean and a covariance matrix p lim Xd' (1)-1 Xd In. Returning now to Eq. (7.79) we see that - 5) 4 N(0,
Vd),
and, as we have seen that Vd = 188 , the estimator Sd achieves the Cramer-Rao lower bound and is therefore asymptotically efficient. 7.3.5.3. Maximum-Likelihood Estimator as Iterative Instrumental Variable Estimator In the preceding analysis, we showed that efficient IVE involves finding the appropriate instrumental variable for the H part of Hd . In this section we show that the MLE has a similar IVE interpretation, although, as always, this interpretation is iterative. Recall that the components of the score vector are
al
as = -nW' vec al
-
ar
al av
n)E, + Hd'(E -1 I
KpG.G(E -1 0Up')E
D' —(vec E
,
(7.80)
TE-1 n vec E -1).
In equating al/ar to the null vector, we proceed as we did in Subsection 6.3.8.2 to obtain = -(Up' Up)-1U; U, whereas al/av = 0 gives = E'E/n. In dealing with al/as = 0, we proceed in exactly the same manner as we did in Subsection 7.2.8.1 to obtain
al
— W'(IG 0 B-1' r)1(2-10 In)E.
Now, H
-
(
IG EB-1)W = H = diag{fli - EB-1W;},
where Hi - EB-1W, = (Y,, X,) - E B-1(W i , 0). However, writing the
181
Linear Simultaneous Equations Models reduced-form equation as Y = ZII +EB-1, we see that IT; = Y117;= ZII1/17;+ EB-11/17i =
+ EB-1117i ,
so IT; — EB-1 W;gives the systematic part of the reduced form of Y, that is, the part that is left after the influence of the random disturbance matrix E has been removed. Thus,
H = diag{Ri} with
Ri = (zni xi), and H d— (/G E B-1)W = H + (R
1,)11; = Hd
say, and ai = inxyd — Hdo. as Equating this derivative to the null vector and replacing the remaining parameters with their MLEs gives kr( 2-1 /Ay —
= 0,
(7.81)
where
Hd = k + H
Ingl; ,
diag{Zfl, Xi ),
d
Y = M(r)y, FId = H (i? In)H; , and 3, P, and i-' = vec :R are the MLEs of 5, R, R; , and r, respectively. Solving Eq. (7.80) gives = [iy(2-1
In) fid r ifidit-i
in)57(i .
Although the interpretation is iterative, it clearly points to the estimation procedures outlined in the preceding subsections. 7.3.5.4. The Lagrangian Multiplier Test for Ho:r = 0 If the disturbances of the LSE model are not subject to a vector autoregressive process, then the estimation of (5 is a far simpler affair. In place of (5m3s we would use the 3SLS estimator obtained from y = HS u by using X to form instrumental variables for H, namely = [H'(2-10 P.,)111-1H'(2-1 0Px)Y,
182
Matrix Calculus and Zero-One Matrices
where we form the estimator E by using the 2SLS residual vectors (again using X to form instrumental variables) in the usual way. It is of interest to us then to develop a test statistic for the null hypothesis H0 :r = 0 against HA : r 0 0. As with the SURE model, the most amenable test statistic is the LMT statistic, which is given by
1 ai Ti = — n ar
'
1"M 6
ar
o'
where in forming 6 we put r equal to the null vector and evaluate all other parameters at the constrained MLEs, the MLEs we get for 5 and v after we set r equal to the null vector. The constrained MLE for 5 in this context is the FIML estimator for 5 discussed in Section 7.2. Asymptotically this estimator is equivalent to the 3 SLS estimator S. We proceed as we did with the SURE model in Subsection 6.3.8.3. As in that model the actual test statistic itself will depend on the case before us. We have seen that if X is strictly exogenous then we can simplify /" so it is given by Eq. (7.47). With X containing lagged endogenous variables, we are stuck with the more complicated expression for /" given by Eq. (7.45). Of course, for both cases when r = 0, M(r) = InG, E = u, Hd = H, U = E, Up, = E p,lim Elc E pin = plimUpfUpill = Ip ®E. Consider now the simpler case in which X is exogenous. From Eq. (7.47) we have /"(6)1,0 = /19 0 E-10 E, and, from Eq. (7.80) we have
al ar
= — K pG,G(I -1 0 Ulc )14 '
(7.82)
r-0
Marrying these two components together, we can write the LMT statistic as 1 Tif = - tif(E -1 0U p)KG, I,G(/p 0 E-1 0E)KpG,G(E -10 Udulo n = 1141E -10 Up(lp 0 E-1)U plule f n = n111(0'0)-10 CI p[lp0 (0'0)-1 111;11i, where ii is the FIML residual vector, U = devecn14, and Up = S(Ip 0 U). (An asymptotically equivalent test statistic would use the 3 SLS residuals formed from 5). Under H0, T1 has a limiting x 2distribution with pG2degrees of freedom, the upper tail of this distribution being used to obtain the appropriate critical region.
Linear Simultaneous Equations Models
183
Note that the form of the LMT statistic obtained here is exactly the same as that obtained for the SURE model in Subsection 6.3.8.3, although of course the residuals used differ in the two models. It follows then that the current LMT statistic before us has the same intuitive F test interpretation developed in Subsection 6.3.8.3. Now for the more complicated case in which X contains lagged endogenous variables: For this case we have seen that I rris more complex now, being given by Eq. (7.45). Setting r equal to the null vector in this expression involves /88I r =0, which, from Eq. (7.44), we can write as /"(6),o = Plim{ 111/-10 (Pz
Pp)1H 10 -1
where Pp, = Up(Upf Up)-l Upf .Thus,
/"(6)I r=0 = Ip 0 E -1 0 E +KpG,G{ p lim n 2[IG 0(Ip0E
1 )U pf
X 11188(0)r=0HVG U p(I p 0 E 1 )11KG, pG.
Marrying this with a//arlr=0given by Eq. (7.82) and remembering that K = K G,G = KG, pG, we write the LMT statistic, ignoring the p lim as T1 = T1 ± nu' [E -1 Up(Ip 0 E-1)U pf]1-1188(4 =c, x [E -10 Up(Ip E-1)Uplu
(7.83)
In evaluating this expression at 6, we put u , (101n, and 0pin place of u, E, and Up, respectively, where u is the FIML residual vector, U = devecn14, and Op= S(4, 0 0). For /88we ignore the p lim and use
/88 (6) =
[t-10 (Pz— 01111nr i ,
where 13p = CI p ( CrpU p)-1 0 pf or asymptotically equivalently, as p lim Up, U p Ip E,
/"(e) = {t -10 [(1,/n) — 0p(4,
E-1)Iipll.
Comparing the statistic with the equivalent statistic for the SURE model given by Eq. (6.14), we see that the two test statistics have the same format with one qualification concerning /88 (0)1 r =0. The matrix in the test statistics for the two models is I"(e) =
PC [t-10 (In — p)}X/n}
for the SURE model,
/88(o) = {fl f[t-10 (In— Pp Alz)111 Inr i for the LSE model.
184
Matrix Calculus and Zero-One Matrices
The introduction of endogenous variables into the picture introduces the matrix NI, = /„ — Z(Z' Z)-1 Z' into the expression for /88(0). There is one other asymptotically equivalent way of writing our test statistic. We saw in Eq. (7.43) that it is possible to write /88 (0) = {p lim H`r [E-1 0(ME — Pp)]1-1d 10-1, SO /88 (0)1,c, = fp limillE -1 0(Mu— PAH/nr i ,
where X = In— U(U'U)-1U'. Ignoring the p lim we could if we like use /88 (6) = {Int -1 0(Mu — 13 p)H/1111-1
(7.84)
in the expression for T1 , where NI„ = In— CI(CPU)-10'. It is also possible to get some comparative insight into the Wald test by using the iterative solution of the MLE of R in much the same way as we did in Subsection 6.3.8.3. The detailed analysis is left to the reader. 7.4. THE LINEAR SIMULTANEOUS EQUATIONS MODEL WITH VECTOR MOVING-AVERAGE DISTURBANCES 7.4.1. The Model and Its Assumptions
In this final section, we consider a further extension of the LSE estimator by allowing the disturbances to be subject to a vector moving-average system of the order of p. We use same notation for the disturbance system as that used in Subsection 6.4.1. Replacing presample values with zero, we can write the model under consideration as y = HS +u, u= M(r)E, E ''' N(0, E 0 In).
Assuming invertability, we write E = M(r)-l u •
As always, we assume no problems from unit roots that arise from lagged endogenous variables. Alternatively, we can write the model as YB + XF =U,
U = E p R' +E,
(7.85)
where the rows of the random matrix E are assumed to be independently,
Linear Simultaneous Equations Models
185
identically normally distributed random vectors with null vectors as means and covariance matrix E. Again Epis the n x Gp matrix (E_ 1 ---E_ p), and if we replace presample values with zero we can write E p = S(Ip 0 E), where, S is the n x np shifting matrix S = (Si- - - Sp).
The reduced form of the model is Y = XII I +UB-1,
(7.86)
with U = E pR' + E. More will be said about this reduced-form later on in this section. 7.4.2. The Parameters of the Model, the Log-Likelihood Function, the Score Vector, and the Hessian Matrix The parameters of the model are 6 = (6'r'v')' and the log-likelihood function, apart from a constant, is 1(0) = n logl det BI - 2 log det E - 21-E'(E -1 0/n )E, except now in this function we set E equal to M(r)-1(y - 116), and we can write E = y* - H*6, where y* = MO-1y and H* = M(r)-1H. The same comments can be made about the derivatives we now require as those made in Subsection 7.3.2. Derivatives with respect to r and v that we need for both the score vector and for the Hessian matrix are the same as those derived in Subsections 6.4.4 and 6.4.5 with the proviso that we replace X* with H*. Derivatives involving 6 but not r can be derived from those obtained in Subsections 7.2.4 and 7.2.5 with the proviso that H is now replaced with H*,u with E, and U with E. Finally, we obtain a21 'mar' from the equivalent expression of Subsection 6.4.5 by replacing X* with H*. Thus, we obtain the score vector and the Hessian matrix. SCORE VECTOR
al
as = -nW' vec(B-1') + H*'(E-1® /n )E, al ar al av
= K pG,G(/G ® E'p )M(r)-1'(E-1 0/„)E, D' = — (vec E -1 E'E - n vec E-1). 2
186
Matrix Calculus and Zero-One Matrices
HESSIAN MATRIX a2/
asas, a 21
= —nW' KGG(B -1' B-1)W — H*'(E -1 0In )H* , —
a 21 nay, a 21 arar'
(E -10 In)A/10-1(1G EP)KG,PG
— H 44' C{I pG 0 [M(r)-1 ]1(IG
vec =1 )1.
H*'(E -1 EE-1)D, —KPG.G(IG 0 E;)M(r)-1' x {IpG [M(r) vec EE -1)1 —{I pG [IG (vec EE -1)'1[M(r)-1 ]'1CM(r)-1
E P) KG,PG
x (IG
— K pG,G(IG Ep)A10-1'
—1 ® in)m(r)-1
X (IG 0 E p)KG,pG •
a zi KpG,G( IG
1 EE-1)D,
E'p
arav' n 2, n Df(E -1 0 E -1)D D/(E -1 0 E -1 EfE " avav' — 2 7.4.3. The Information Matrix 1(0) =
—
p 1im1821/ 8080'
The probability limits with reference to the derivatives of the Hessian matrix that involve r or v but not S have already been evaluated in Appendix 6.A. The work that confronts us here is to evaluate the probability limits that refer to the derivatives of the Hessian matrix that involve S. Before we do this it is expedient for us to return to the properties of M(r)-1discussed in Subsection 6.4.2. In that subsection we saw that M 1G M(r)-1 =[: MG!
MGG
where each submatrix is n x n and M" is a lower-triangular matrix with ones down its main diagonal whereas Mui j is strictly lower triangular. In fact, we saw that each M" is a Toeplitz matrix of the form In +a1 + + an-iSn-i for suitable ais that are functions of the elements of M(r), whereas each M's, i j, is of the form
bi + + bn
-
Sn 1 -
Linear Simultaneous Equations Models
187
again for suitable bjs that are functions of the elements of M(r). It follows that we can write NH N1G M(r)-1= IGn +
[
: NG1
= IGn +1v NGG
say, where Nii = Mi j , i j, and N" = Therefore, each n x n submatrix Ni j is a strictly lower-triangular Toeplitz matrix of the form —
+ bn- i Sn-i
bi +
Recall that we have written M11 M(r)-1'
• • •
A/i1G
• • •
MGG
=[;
MG1
Here it is convenient also to write N11' ... NG1'
M(r)-1' = IG ± .AP = IGn ± [ : NGG' Arll
• • •
N1G
=
ArG1
NGG
say, where each n x n matrix Mi.;is strictly upper-triangular Toeplitz and of the form
bi Sc
bn_iSn _ 1 .
Using the notation we have just introduced, we can write H*'(/G 0E) = 11'M(r)-1 (IG 0 E) = H'(IG 0 E)
H'Ar(IG 0 E). Now,
HVG E)= diag{1-11 El = diag I
YE l. X;E
(7.87)
Regardless of the nature of X, plim, X'E In = 0, and so plimX;E In = 0. Also, from reduced-form equation (7.86),
Y1'E = 1,17;17' E =
+117;B-1' RE ' E +117;B-1' E' E.
188
Matrix Calculus and Zero-One Matrices
However, plimElc E/n = 0, so plimY"Eln=1,17;B-l'E, and returning to Eq. (7.87) we see that plim HVG 0 E)/n = W'(IG 0 13-1' E), regardless of whether X is strictly exogenous or contains lagged endogenous variables. We now wish to show that p lim fl'AP(IG 0 E)/n = 0. To do this, we write [HAi E - - - f4Ar1GE fl'Aif(IG 0 E)=
i
,
H.I\rGiE - - - H.ArooE (r.1■ 1; j E
Hi f.IV;J E = A;
Xf.IV;i E)
where A;is the selection matrix given by Ai =
01 = (IV; 0 i; )
and the selection matrices W, W,, and Tiare defined in Subsection 7.2.3. Now r.1■ 1; j E and Xf.IV; j E are G x G and k x G matrices whose (r, s) elements are yrf.M jEs and 4./V; j E5, respectively. It follows that, as Arijis strictly upper triangular, ,
p lim y'..I■ iij es. I n = 0,
p lim 41■ 1; j es In = 0,
regardless of whether X is strictly exogenous or contains lagged endogenous variables, so plimflifIVii Eln = 0, and we have achieved our aim. We have then that plim H* f(lo 0 E)/ n = W'(IG 0 B-1' E),
(7.88)
regardless of the nature of X. (I pG 0 at-)In is the null vector, We now wish to prove that p lim where we recall that a is the nG x 1 vector given by a = M(r)-1' vec EE and, from the properties of the devec operator, = [M(r)-'l (IG 0 vec EE -1). The analysis here is the same as that conducted in Appendix 6.A, except that
189
Linear Simultaneous Equations Models
we now have H in place of X. Following that appendix, we write fr'C'(/pG at") = H*'(IG 0 Scat"
1G 0 S' al"),
flp/i ii S' H*'(IG 0
= IVAGGS'f at^
for j = 1, . , p. Now recalling how we wrote at., we have vec EE -1 MGvec EE -1)
= Hi' Mu
(7.89) for i, / = 1, . . . , G. We consider the typical matrix on the right-hand side of Eq. (7.89) H, MijSj Mk vec EE -1= Hi'MitS'i Mk(E-1 0OE GG
E E ars.MksE)
=
(
r=1 s=1
for k = 1, , G and where E-1 ={Q's). Therefore, in evaluating plim 114`'C'(IpG0 dr.)/ n we are typically looking at plimM ii S'f .A4 - krEsIn. Above we noted that Miiis upper triangular and Mi., , i j,is strictly upper triangular so from the properties of shifting matrices Y./ A/4ris strictly upper triangular and therefore so is MilS k r .As such, plimk Mil S'i Mkr Es In is the null vector even if Xicontains lagged dependent variables. We conclude that, regardless of the nature of X, plim H'C'(1pG
= 0.
With these probability limits in hand and using those derived in the equivalent SURE model, we are now in a position to write the information matrix. Let 188 '8r 18v 1(0)=[1r 8 Irr Irv]; Ir s Ivr Ivy
then — plim H*'(E-10 I„)H* In, 188 = 144' KGG(B-1,013-1 )w 1 I8r = plim -H*'(E -10 In)M(r)-1(1G 0 Ep)KG,pc, (7.90)
hv = 1,r(E-1 0B-1)D, Iry = 0,
190
Matrix Calculus and Zero-One Matrices 1 KpG.G(IG 0 E )M(r)- (E -1 0 /n)M(r)-1 irr = plim n X (1G 0 EP)KG,PG,
/„ = 2 D'(E -10 E.-1)D. Comparing this information matrix with that obtained for the corresponding SURE model given in Subsection 6.4.6 we see that the former is considerably more complex than the latter. The matrix in the block (1,1) position has an extra term in it, namely WKGG(13-1' 0 B-1)W. Also the matrix in the block (1, 3) position is now nonnull. In fact, this information matrix, unlike all previous ones considered in this work, does not lend itself to further simplification when we restrict ourselves to the case in which X is strictly exogenous. Consider 18r . Then, following an analysis similar to that conducted at the end of Appendix 6.A, we see that evaluating the probability limit, that is, 18r , involves our looking at plim 1-1" M r (E -1 0 In)M's ,SkE j In GG = plimaxY M rx M „SkE j n x=1 y=1
EE
for i, j, r, s = 1, G, k = 1, , p. It is the complicated nature of the matrix M rx,Mfu Sk that prevents further evaluation. All we know of this matrix is that from the properties of the shifting matrix Sk the last k columns of the matrix are all n x 1 null vectors. Fortunately, assuming that such p lims exists and leaving '8rin the form given by Eq. (7.90) suffices for our purposes. 7.4.4. The Cramer-Rao Lower Bound /-1(0) For one last time we invert the information matrix to obtain the Cramer-Rao lower bound. As always, we let 188 1-1(0) = H8 1Y8
18r Fr
18Y Irv] .
1Yr 1"
Then 188 =
KGG(13-1' 0 B-1)W plim H*' (E -1 0 In)H* In. - 2W(E -10 B-1')D[El f(E -1 0E-1)D]-10E -10 B-1)W 1 - p lim -H*f(E-10 In)M(r)-1(IG E p)KG,pG[K pG,G x (IG 0 E )M(r)-1'(E -10 In)M0-1(1G 0 Ep)KG,I9G1-1 x K pG,G(IG 0 E )M(r)-1'(E-1 0In)H*ri.
Linear Simultaneous Equations Models
191
In Subsection 7.3.4, we saw that we could write the third matrix in this inverse as _wr(E-1 0B-1' —
)W — KGG(B-1' B -1)W
and, as we have shown that p lira H*VG 0 E)/n = 0 B-" E), (E1 013-F EB-1)w — = plim H*'(IG 0 PE)H* In, where PE = E(Ef E)-1E'. If we now let _ (E -1 0 in)mo-i (1G 0 Er), we can write /88= {p lim H*(E-1 0ME)H* In x [7(E 0 /n ),F]-17H*rl , where ME = In PE. The other components of the Cramer-Rao lower bound are /8r = -/88 plim fl*f T[F(E 0 In).F] -1KG,pG, 18v = -2188 liAIG 0 B-v E)NL' , I" = p lim n K pG,G[F (E In),F]-1
x [Tf(E 0 In)T - H* 188 H*f F][F(E 0 In),F]-1KG,pG, I ry= 21 cG,G P11111[7 (E -1 In),F] -i II* 188 X W f(IG 0 B-F E)1■,
I" = 2LN[E ®E 2(/G 0 EB-1)W/88 W(/G 0 B-1'E)]N L'.
Consider the case in which S is the vector of parameters of primary interest. If the nuisance parameters given by the vector r were known, then the Cramer-Rao lower bound for a consistent estimator of S would be I*" = {p lim H1E -10 (In -PE)]H*04-1.
It follows that 1 T = plim- fr T[F(E 0 In ),F]-1 H* n
represents an addition to the Cramer-Rao low bound for S that comes about because r is unknown. 7.4.5. Maximum-Likelihood Estimator of 6 We return now to the reduced-form equation Y = X111 + UB-1.
192
Matrix Calculus and Zero-One Matrices
As U = ER R' + E, we can write this equation as Y = X111 +Ep R fB-1+ EB-1. Now, from Eq. (7.85) E=YB+XF —ER R', SO
=
+
E_2 = Y_2B + X _2F — E19,_21?'
and so on, where E p.-1 = [E--(1±i) • • •
E—(p+i)1
for i = 1, ... n — 1 + p. It follows that if we replace presample values with zero we must be able to write the reduced-form equation as Y = 211 + EB-1, where Z = (X Xn-1 Yn-1), Xn-1 = [ X-1 .
(7.91) X_(„_1)] and Yn _1 = [Y_1
Y —(n-1)]-
There is a crucial difference between this equation and the reduced-form equation of the previous model given by Eq. (7.30). The latter involved only Z = (X X p Yp) and hence in forming instrumental variables we need consider only values of the Xs and the Ys that were lagged p periods. In the asymptotic analysis of the previous model, we could make the assumption that p lim Z In exists. Now, regardless of the value of p, Y, through Eq. (7.91) is linked to variables that are lagged in n — 1 periods. This prevents us from forming the type of IVEs considered in the previous model. The problem is that, unlike Z, both dimensions of the matrix Z are dependent on n, so no assumptions about the existence of p lim Z'Z/n can be made. However, such an assumption is crucial in forming IVEs. The same point can be made a different way if ydis compared and contrasted with y*. Recall that Yd = M(r)Y = Y (R IG)CY, where R is a G x Gp matrix whose elements are constants (not depending on n) and (IG 0 S1) C= IG 0 Sp
If we partition R so R = (R1. .. Re), where each R j is G x G, we have d (Rp 0 IG)Y-P, Y = Y (R1 0IG)Y-1 +
193
Linear Simultaneous Equations Models
where, as always, y_ j = (IG 0 Sj)y refers to the vector whose elements are those of y lagged j periods, j = 1, . . . , p. Let ri refer to the ith row of the matrix R. Then the n x 1 vector in the ith block position of yd, yd can be written as \
y, = )7; + (r"
-I- • • • + (r,
.G)Y-p•
(7.92)
We consider the tth element of this vector. From Eq. (7.92)
= Yri + (r,11Yr-ii + • + riGYt-iG) (Tryt-pi
+ Trot-pc),
that is, in forming yd we use yt, plus a linear combination of the t —1, t — 2 ... t — p values of all the endogenous variables. We form current values of y d by using the corresponding current values of y plus linear combinations of lagged values of all the endogenous variables, lagged down p periods. If we introduce the notation Ydand ytfor the G x 1 vectors containing the tth values of y d and y, respectively, then we have Yd = Yr ± Riyt_i +
+ R pyt _ p.
The fact that we consider only variables lagged p periods in our autoregressive systems has practical implications for our asymptotic theory. Consider p lim(R 0 An), where Anis a K x K matrix, say, whose probability limit exists. Then, seeing that the elements of R are constants that do not depend on n and that the dimensions of R do not depend on n, we can write p lim(R 0 An) = (R 0 I K )(IG 0 plim An)This greatly simplifies our asymptotic analysis. Now, in Subsection 6.4.2 we saw that it is possible to write M(r)-1 = Inc
(R 0 in)0 9
(7.93)
where R is a G x G (n — 1) matrix whose elements are products of the elements of R and C is the Gn(n — 1) x Gn matrix given by (1G 0 Si
= 1G 0 Sn-1)
It follows that we can write y* = M(r)-l y = y ±(1?0 IG)CY. Conducting an analysis similar to the preceding one, we see that Yt* =Yr +
± • • • + Pn-lYt-(n-1),
194
Matrix Calculus and Zero-One Matrices
that is, in forming current values of y* we use the corresponding current values of y plus linear combinations of all the endogenous variables lagged right down to the beginning of our sample period. (In fact, if we did not agree to set presample values to zero they would be infinitely lagged). The practical implications of these for our asymptotic theory is that we cannot isolate R in the way we isolated R above. As the dimensions of R depend on n, this matrix blows up as n tends to infinity. This makes the asymptotic analysis for the moving average system far more complicated than that for the autoregressive system.' In this subsection then, we content ourselves with the insights into the MLE of S that we obtain from an iterative interpretation of this estimator. We proceed as we did in Subsection 7.3.5.3. Equating al lay to the null vector gives
2 = rE In, and we write
ai
[H' — W'(IG 0 B-1' E')](2-1 0 In)E.
as
Writing M(r)-1as we did in Eq. (7.93) we have H* = H (R 0 In)0 H. Consider then H — (IG 0 EB-1)W = diag {H, — E B-113/41 , H, — E B-1W, = (17; — EB-1 1,17 , X,). Clearly V, = 17, — E B-1117 is that part of Y, left after the influence of the disturbance matrix E has been removed. Let H = diag{H,1 with 171, = (17 X,), H* = H + (R 0 I„)E.H. Then H* represents that part of H* purged of the influence of E and therefore of E = vec E, and our derivative can be written as
al
as
=
1 H*I(E -0 I)(y* — H*6).
Equating this derivative to the null vector and replacing the remaining parameters with their MLEs gives 17*'(2:1 / n)(y* — ii*S) = 0, [fi*/(2-1 0 i n)R*1-11r f(E-1 0 l u , )5* 2Suppose
is a consistent estimator of r. Let r = M(0-1 y, = M(r)-1 H*, X* = = ur'4,-15(*(50,4,-1 i*)-1 M(r") 1 (IG ® X). One suspects that the estimator S iffri /14-1(50,4,-t y- * is asymptotically efficient.
Linear Simultaneous Equations Models
195
where II* =
— (IG ki3-1)w,
= M(r)-1H, = M(r)-1Y, 2
Thus, we see that the MLE of S has an iterative IVE interpretation in which we obtain the instrumental variable for H* by purging H of the influence of E. 7.4.6. The Lagrangian Multiplier Test Statistic for Ho: r = 0 We have seen in the previous subsection that forming IVEs for S for the LSE model with vector moving-average disturbances is a lot more difficult than for the model with vector autoregressive disturbances. It is imperative then that we obtain a classical test statistic for the null hypothesis H0: r = 0 for the model before us. As always, the LMT statistic is the most amenable, and in this section we show that the LMT statistic for the LSE model with vector moving-average disturbances is the same statistic as that developed for the preceding LSE model with vector autoregressive disturbances. As in this linear-regression model and in the SURE model, the LMT statistic is incapable of distinguishing between the two disturbance systems. We do this by noting that with r = 0, M(r)=1„G, H* = Hd =H, U=E, u=e, Up = EP, E pin = plimUp'Upin= 1p E and P = Up, so for both models
al ar r =0
= ±KpG,G(E -1 up ')u,
188 (0) r = 0 = 1p lim InE-1 0(Mu— PAH I liri irr (0) r =0 =
E -1®E + KpG,G1P lim[/G 0 (Ip, 0 E —1)Up' ] x H/88 (6)I r =0HVG
Up(ip E 1 )11K G, pc •
It follows then that the LMT statistic for Ho : r = 0 is the same for both models, for both the case in which X is exogenous and for the case in which X contains lagged endogenous variables. APPENDIX 7.A. CONSISTENCY OF
k AND E.
Write the model as Y B + XF =,QA = U, where Q = (Y X) and A = (B' F')'. Consider U = QA = U + Q(A — A), where A is some consistent estimator of A. As —plim(U'p p)-1U'p U = R', it follows that E' = —(UpU p)trpU, where U'p, = S(I p 0 U)is a consistent estimator of R' as long as A is a consistent estimator of A. Similarly, as plim E' E = E, E = E'E In with E = U + U pR' will also be consistent, provided A is consistent. The estimator A used in our analysis is formed from the 3SLS estimator we obtain by ignoring the
196
Matrix Calculus and Zero-One Matrices
autoregression and by using X to form the instrumental variable for H in y = HS u in the usual manner. To establish the consistency of R and E then it suffices to show that such an estimator is consistent. The first step in obtaining the 3SLS estimator is to apply a 2SLS estimator to each equation. Write the 2SLS estimator for the ith equation, yi = + ui , as HiX (X'X) -i rui n n
(7.A.1)
where Px = X(X'X)-1X'. Our assumptions ensure that the p lims exist for all the matrices on the right-hand side of Eq. (7.A.1) Moreover, we have assumed that X is strictly exogenous so plimruiln = 0. It follows that the 2SLS estimators are consistent estimators and that 11n=plimu;u j /n, where is the 2SLS residual vector (the latter p lim is assumed to exist). The next step in applying the 3SLS estimator is to form the matrix V = {11'141n}. Clearly p limV ={plimuu jIn} = V, say. The 3SLS estimator in this is asymptotically equivalent to
3* = ±
[W(V' Px )H
x [V-1 0
X' X n
H'(I ®X) (/
X')u
(7.A.2)
Again, our assumptions ensure that the p lims of all the matrices on the right-hand side of Eq. (7.A.2) exist. Moreover, with X strictly exogenous, p lim(I 0 X')Xu In = 0, and it follows that the 3SLS estimator of S is consistent.
References and Suggested Readings
Bowden, R. J. and Turkington, D. A., 1984. Instrumental Variables, Econometric Society Monograph in Quantitative Economics 8. Cambridge University Press, New York. Bowden, R. J. and Turkington, D. A., 1990. Instrumental Variables, paperback edition, Cambridge University Press, New York. Breusch, T. S., 1978. Testing for Autocorrelation in Dynamic Linear Models. Australian Economic Papers 17,334-335. Breusch, T. S. and Pagan, A. R., 1980. The Lagrange Multiplier Test and its Applications to Model Specification in Econometrics. Review of Economic Studies 47, 239-254. Cramer, J. S., 1986. Econometric Applications of Maximum Likelihood Methods. Cambridge University Press, Cambridge, U. K. Davis, P. J., 1979. Circulant Matrices. Wiley, New York. Dhrymes, P. J., 1978. Mathematics for Econometrics. Springer, New York. Durbin, J., 1988. Maximum Likelihood Estimation of the Parameters of a System of Simultaneous Regression Equations. Econometric Theory 4, 159-170. Dwyer, P. S., 1967. Some Applications of Matrix Derivatives in Multivariate Analysis. Journal of the American Statistical Association 26, 607-625. Dwyer, P. S. and MacPhail, M. S., 1948. Symbolic Matrix Derivatives. Annals of Mathematical Statistics, 19, 517-534. Godfrey, L. G., 1978a. Testing Against General Autoregressive and Moving Average Error Models when the Regressors Include Lagged Dependent Variables. Econometrica 46, 1293-1302. Godfrey, L. G., 1978b. Testing for Higher Order Serial Correlation in Regression Equations when the Regressors Include Lagged Dependent Variables. Econometrica 46, 1303-1310. Godfrey, L. G., 1988. Misspecifications Tests in Econometrics, Econometric Society Monograph in Quantitative Economics 16. Cambridge University Press, New York. Godfrey, L. G. and Breusch T. S., 1981. A Review of Recent Work on Testing for Autocorrelation in Dynamic Economic Models. In: D. Currie, R. Nobay, and D. Peel (eds.), Macro-Economic Analysis. Graham, A., 1981. Kronecker Products and Matrix Calculus with Applications. Ellis Horwood, Chichester, U. K. Hadley, G., 1961. Linear Algebra. Addison-Wesley, Reading, MA. 197
198 Matrix Calculus and Zero-One Matrices Hausman, J. A., 1975. An Instrumental Variable Approach to Full Information Estimates for Linear and Certain Non-Linear Econometric Models. Econometrica 43, 727-738. Henderson, H. V. and Searle, S. R., 1979. Vec and Vech Operators for Matrices, with Some Uses in Jacobians and Multivariate Statistics. Canadian Journal of Statistics 7, 65-81. Henderson, H. V. and Searle, S. R., 1981. The Vec-Permutation Matrix, the Vec Operator, and Kronecker Products: A Review. Linear and Multilinear Algebra 9, 271-288. Koopmans, T. C., Hood, W. C., 1953. The Estimation of Simultaneous Linear Economic Relationships. In: W. C. Hood and T. C. Koopmans (eds.), Studies in Econometric Methods. Cowles Commission Monograph 14, Wiley, New York, pp. 112-199. Reprint Yale University Press, New Haven, CT, 1970. Koopmans, T. C., Rubin, H., and Leipnik, R. B., 1950. Measuring the Equation Systems of Dynamic Economics. In: T. C. Koopmans (ed.), Statistical Inference in Dynamic Economic Models. Cowles Commission Monograph 10, Wiley, New York, pp. 53-237. Lutkepohl, H., 1996. Handbook of Matrices. Wiley, New York. MacDuffee, C. C., 1933. The Theory of Matrices. Reprinted by Chelsea, New York. Magnus, J. R. 1985. Matrix Differential Calculus with Applications to Simple, Hadamard, and Kronecker Products, Journal of Mathematical Psychology, 474-492. Magnus, J., 1988. Linear Structures. Oxford University Press, New York. Magnus J. R. and Neudecker, H., 1979. The Commutation Matrix: Some Properties and Applications. Annals of Statistics 7, 381-394. Magnus, J. R. and Neudecker, H., 1980. The Elimination Matrix: Some Lemmas and Applications. SIAM Journal on Algebraic and Discrete Methods, 422-449. Magnus, J. R. and Neudecker, H. 1985. Matrix Differential Calculus with Applications to Simple, Hadamard, and Kronecker Products. Journal of Mathematical Psychology 29, 474-492. Magnus, J. R. and Neudecker, H., 1986. Symmetry, 0-1 Matrices and Jacobians: A Review. Econometric Theory 2, 157-190. Magnus, J. R. and Neudecker, H., 1988. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, Chichester, U. K. Magnus J. R. and Neudecker, H., 1999. Matrix Differential Calculus with Applications in Statistics and Econometrics. Revised edition, Wiley, Chichester, U. K. McDonald, R. P. and Swaminathan, H., 1973. A Simple Matrix Calculus with Applications to Multivariate Analysis. General Systems 18, 37-54. Muirhead, R. J., 1982. Aspects of Multivariate Statistical Theory. Wiley, New York. Neudecker, H., 1967. On Matrix Procedures for Optimising Differential Scalar Functions of Matrices. Statistica Neerlandica 21, 101-107. Neudecker, H., 1969. Some Theorems on Matrix Differentiation with Special References to Kronecker Matrix Products. Journal of the American Statistical Association 64, 953-963. Neudecker, H., 1982. On Two Germane Matrix Derivatives. The Matrix and Tensor Quarterly 33, 3-12. Neudecker, H., 1985. Recent Advances in Statistical Application of Commutation Matrices. In: W. Grossmann, G. Pflug, I. Vineze and W. Wertz (eds.), Proceedings of the Fourth Pannonian Symposium on Mathematical Statistics. Reidel, Dordrecht, The Netherlands, Vol. B, pp. 239-250.
References and Suggested Readings
199
Pagan, A. R. 1974. A Generalised Approach to the Treatment for Autocorrelation, Australian Economic Papers 13, 267-280. Phillips, A. W, 1966. The Estimation of Systems of Difference Equations with Moving Average Disturbances. Paper presented at the Econometric Society Meetings, San Francisco, 1966. Pollock, D. S. G., 1979. The Algebra of Econometrics. Wiley, New York. Rogers, G. S., 1980. Matrix Derivatives. Marcel Dekker, New York. Rothenberg, T. J., 1973. Estimation with A Priori Information, Cowles Foundation Monograph 23, Yale University Press, New Haven, U. S. Rothenberg, T. J. and Leenders, C. T., 1964. Efficient Estimation of Simultaneous Equation Systems. Econometrica 32, 57-76. Searle, S. R., 1979. On Inverting Circulant Matrices. Linear Algebra and Its Application 25, 77-89. Theil, H., 1971. Principles of Econometrics. Wiley, New York. Tracy, D. S. and Dwyer, P. S., 1969. Multivariate Maxim and Minima with Matrix Derivatives. Journal of the American Statistical Association 64, 1574-1594. Tracy, D. S. and Singh, R. P., 1972. Some Modifications of Matrix Differentiation for Evaluating Jacobians of Symmetric Matrix Transformations. In: D. Tracy (ed.), Symmetric Functions in Statistics. University of Windsor, Ontario, Canada. Turkington, D. A., 1989, Classical Tests for Contemporaneously Uncorrelated Disturbances in the Linear Simultaneous Equations Model. Journal of Econometrics 42, 299-317. Turkington, D. A., 1998. Efficient Estimation in the Linear Simultaneous Equations Model with Vector Autoregressive Disturbances. Journal of Econometrics 85, 5174. Turkington, D. A., 2000. Generalised Vec Operators and the Seemingly Unrelated Regression Equations Model with Vector Correlated Disturbances. Journal of Econometrics 99, 225-253. White, H., 1984. Asymptotic Theory for Econometricians. Academic Press, New York. Wise, J., 1955. The Autocorrelation Function and the Special Density Function. Biometrika 42, 151-159. Wong, C. S., 1980. Matrix Derivatives and its Applications in Statistics. Journal of Mathematical Psychology 22, 70-81. Zellner, A., 1962. An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias. Journal of the American Statistical Association 57, 348-368. Zellner, A, and Theil, H, 1962. Three-Stage Least Squares: Simultaneous Estimation of Simultaneous Equations. Econometrica 30, 54-78.
Index
A Asymptotic theory, 6, 46, 64, 154 assumptions for, 6, 147, 162, 192-194 differentiation and, 5-6 equivalence of statistics, 4 presample values, 65, 90, 118, 162, 184 shifting matrices, 28, 64 See also specific estimators, test statistics, models Autoregressive disturbances. See linear regression models B Backward chain rule, 103 BAN. See Best asymptotically normally distributed estimator Band matrices, 26 matrix algebra, 23-27 shifting matrices and, 27 triangular matrices and, 26 zero-one matrices and, 27 Best asymptotically normally (BAN) distributed estimator, 2, 114, 125, 126, 140, 155 Breusch-Pagan statistic defined, 117 SURE model and, 127 Turkington statistic and, 157 C
Chain rules, x, 67, 83, backward, 71n2, 94, 103 forward, 71n2 matrix calculus and, 71-73 Circulant matrices, 28
commutativity of, 58 defined, 58 forward-shift matrices and, 58, 59 linear functions of, 58 shifting matrices and, x, 58-59 symmetric, 58 Classical statistical procedures 1-6 building blocks, of, x econometrics and, xi, 1 linear-regression models and, 87 LSE models and, 148 SURE models and, 110 Commutation matrices, x, xi, 94, 102 circulant matrices and, 58 defined, 30 derivatives and, 78-80, 104, 136 devecs and, 29-43, 77-79 expressions for, 30 Kronecker products and, 31-35, 39-42 LMT statistic and, 131 N matrix and, 42 permutation matrix and, 30 properties of, 31-35, 78-80 symmetric matrix, 69 vecs and, 29-43, 73-78 Wald statistic and, 131 zero-one matrices and, 29-43 Concentrated likelihood function 5, 112, 151 Consistent estimators. See specific estimators Cramer-Rao lower bound, x, 1-2 defined, 2 linear regression models, 89, 95, 105 LSE models, 154, 157, 167-171, 172, 177, 190-191 SURE models, 113-114, 124-125, 138-139
201
202 Index D Derivatives asymptotics and, 5-6 classical statistical procedures and, 5 commutation matrix and, 78-80 defined, 68 determinants and, 81-83 econometric models and, 67 logs of determinants, 81-83 scalar functions, 80-83 tables of, 83-85 traces of matrices, 80 vecs and, 73-78 of vectors, 67-68 zero-one matrices and, 69-73 See also Matrix calculus Determinants, 81-83 derivatives and, 81 logs of, 81-83 rules for, 86 scalar functions, 81,82 Devec operators, 188 definition of, 10-11 generalized. See Generalized devec operator Kronecker products and, 11-12 matrix algebra of, 10-14 notation and, 21 traces and, 12 transpose operator and, 22 vec operator and, 10-14,22 Differentiation. See Derivatives, Matrix calculus Disturbances. See specific models Duplication matrices, xi, 28,110 defined, 43-44 vech and, 43-44 zero-one matrices and, 28,43-44 E Econometrics, 5-6 classical procedures and, xi, 1 differentiation and, 67 information matrix for, 5-6 LMT statistic and, 4 log-likelihood function, 6 LSE model and, 174 nonlinear equations, 1 partial derivatives, 5 test statistics and, 1 See also specific models
Elimination matrices, xi, 110 defined, 43 zero-one matrices and, 14,28,43 F F test, 129,183 FIML. See Full information maximum-likelihood (FIML) estimator Forward-shift matrices, 28,58-59 Full information maximum-likelihood (FIML) estimator, 155-157,182 G Gauss—Markov assumptions, 129 Generalized devec operators 15-16 commutation matrix and, 29-43,77,78 defined, 16 derivatives and, 77-80 generalized vec operators and, 14-23 Kronecker products and, 16-17 matrix algebra and, 14-23 notation for, 19,87 theorems for, 17-21 transpose operator and, 21-23 Generalized inverse, 44 Generalized-least-squares (GLS) estimators, 96,97,106 asymptotically efficient, 97,106,176 maximum likelihood estimators as, 95-96,105 Generalized vec operators, 14-15,87 commutation matrices and, 29-43 definition, 14 derivatives, 77-80 generalized devec operators and, 14-23 Kronecker products and, 16-17 matrix algebra and, 14-23 notation for, 19,87 theorems for, 17-21 transpose operator and, 21-23 GLS. See Generalized-least squares estimators Godfrey statistic, 98 H Hessian matrix, 89,101,135-138 linear regression models, 89,93,94, 102-104 LSE models, 152,153,163,186 SURE models, 112-113,123-124, 135-138
Index
Information matrix, x, 1 defined, 2 econometric models and, 5 linear-regression models and, 89,94-95, 104,107,109 LSE models and, 153,164-167,186-190 SURE models and, 113,124,138 Instrumental variable estimators (IVEs), 1 asymptotic efficiency of, 155,175-177, 179,180 generic types of, 146 IV-GLS estimator, 147,154,172 IV-OLS estimator, 147,154,173 LSE models and, 177-180 maximum-likelihood estimators and, 154-156,180,181,191-195 modified 3SLS estimator and, 173-177 3SLS estimators and, 155 White estimator and, 172-173 IVEs. See Instrumental variable estimators J JGLS. See Joint-generalized-least squares estimator Joint-generalized-least squares (JGLS) estimator, 125-127,139-141 asymptotic efficiency of, 114,125,126, 140,141 maximum-likelihood estimators and, 114, 127,141 K Kronecker products, 7-10,41 commutation matrices and, 31-35 determinant of, 8 devec operators and, 11-12,16-17 inverse of, 8 partitioning and, 9-10 rank of, 8 trace of, 8 transpose of, 8 vec operators and, 7,11-12,16-17, 31-35,77 L Lagged values, 88,126 asymptotic theory and, 87,177 linear-regression models and, 87 shifting matrices and, 46, 64, 64n3 time-series analysis and, 46,64
203 unit root problems, 100,132,184 See also Time-series analysis Lagrange multiplier test (LMT) statistics, 1,3, 4,114-115,128,129 autoregressive disturbances and, 106 Breusch—Pagan statistic and, 127 constrained MLE and, 3 econometrics and, 4 F test and, 129,183 FIML estimator and, 157 LIML estimator, 160 linear regression models and, 97-99, 106-107 LSE models and, 157-161,181-184,195 moving average disturbances for, 106 nuisance parameters and, 4 SURE models and, 114-117,128-130, 157-161,182-184 3SLS estimator and, 157 See also specific models Likelihood ratio test statistic (LRT), 3,5, 99-100,106,132 See also specific models Limited information maximum-likelihood (LIML) estimators, 157,160-161 L1ML. See Limited information maximum-likelihood estimators Linear regression models, 87-109 assumptions of, 88 autoregressive disturbances and, 90-100 asymptotic efficiency and, 96-97 Cramer—Rao lower bound and, 95 generalized-least-squares estimators, 95-96 information matrix, 94-95 LRT statistic, 99-100 LMT statistic and, 97-99 log-likelihood function, 92 maximum likelihood estimators, 95-96 probability limits, 107,108 score vector, 93-94 Wald test statistic and, 98-99, 106-107 basic model, 88-90 Cramer—Rao lower bound, 89 information matrix, 89 log-likelihood function, 88 score vector, 88-89 moving average disturbances, x, 100-107 asymptotic efficiency, 106 Cramer—Rao lower bound, 105
204 Index Linear regression models (Cont.) Hessian matrix, 102-104 information matrix, 104,107-109 LMT statistic, 106,107 LRT statistic, 106,107 log-likelihood function, 101 probability limits and, 107,108 score vector, 102 SURE model and, 114-115 Wald test statistic, 106-107 Linear simultaneous equations (LSE) models, 146-196 the standard model, 148-161 assumptions, 148-149 Cramer—Rao bound, 154 F1ML estimator as interative IV estimator, 154-156 generic IV estimators, 146 Hessian matrix, 152-153 information matrix, 153 log-likelihood function, 149-151 LMT statistic 157-161 3SLS, estimator, 154-155 score vector, 152 with vector autoregressive disturbances, 161-184 assumptions, 161-162 Cramer—Rao lower bound, 167-171 Hessian matrix, 163 information matrix, 164-167 IVE, 177-180 IV-OLS estimator, 173 MLE as iterative IVE, 189 log-likelihood function, 163 LMT statistic, 181-184 3SLS estimator, 172 score vector, 163 with moving average disturbances, 184-196 assumptions, 184 Cramer—Rao lower bound, 190,191 Hessian matrix, 186 information matrix, 186-190 log-likelihood function, 185 LMT statistic, 195 score vector, 185 Log-likelihood function, 5 linear-regression models and, 88,92,101 linear simultaneous equations models, 149-151,163,185 SURE models, 111, 122,135 LRT. See likelihood ratio test statistic
LSE. See linear simultaneous equations models M Magnus notation, 71n2 Matrix calculus, 67-86 chain rule, 71-73 commutation matrices, generalized vecs and devecs of, 78,79 definitions for, 68-69 product rule, 71-73 scalar functions and, 80-83 simple results, 69-70 tables of results, 83-86 vecs, rules for, 73-78 zero-one matrices and, 69-70 See also Derivatives Maximum likelihood estimators (MLEs), 1 asymptotic properties, I defined, 3 econometric models, 1 F1ML estimator, 154,156,182 generalized-least-squares estimators and, 95-96,105-106 IVEs and, 154-156,180-181 JGLS estimators and, 114,127,141-142 ordinary-least-squares estimators, 89-90 test procedures and, 2-4 See also specific models Modified 3SLS estimator, 173-177 Moving average disturbances, See linear regression models N matrix, 42-43 Nuisance parameters, 4-5 Null hypothesis autoregressive disturbances, 97 contemporaneously correlated disturbances, 114,127,157 moving average disturbances, 106 vector autoregressive disturbances, 128,181 vector moving average disturbances, 141,195 0 OLS. See Ordinary least squares estimators Ordinary-least-squares (OLS) estimators, 89-90,107,108 asymptotic efficiency, 89 MLE as, 89-90 -
Index P Permutation matrices commutation matrix and, 30 defined, 29 forward-shift matrices and, 58-59 zero-one matrices and, 29 Presample values, 65,90,100,118,162,184 Probability limits, 87 autoregressive disturbances and, 107-108 moving average disturbances and, 108-109, 142-145 Product rule, x, 71-73 corollaries of, 72 Q Quadratic forms, 109,161 R Reduced form, 149,162,185,192 S Scalar functions derivatives of, 80-83 determinants and, 81,82 logs of determinants and, 82,83 traces and, 80 Score vector, x, I defined, 2 linear regression models and, 88-89, 93-94,102 LSE models and, 152,163,185 SURE models and, 112,122,135 Seemingly unrelated regression equations (SURE) models, x, 110-145, 158,189 the standard model, 110-117 assumptions, 117 Breusch—Pagan test statistic, 114-117 Cramer—Rao lower bound, 114 Hessian matrix, 112-113 information matrix, 113 log likelihood function, 1 1 1 MLE as iterative JGLS estimator, 114 score vector, 112 with vector autoregressive disturbances assumptions of the model, 117 asymptotically efficient estimators, 125-127 classical test statistics for vector autoregressive disturbances, 128-132
205 Cramer—Rao lower bound, 124-125 Hessian matrix, 123,124 information matrix, 124 log-likelihood function, 122 LMT statistics, 127,128 LRT statistic, 132 matrices N(r) and M(r) 119-121 matrix J, 120-122 MLE as iterative JGLS estimator, 127 score vector, 122 Wald test statistic, 130 with vector moving-average disturbances assumptions, 132 asymptotically efficient estimators, 139-141 Cramer—Rao lower bound, 138-139 Hessian matrix, 135-138 information matrix, 138 log-likelihood function, 135 MLE, 141 LMT statistic, 142 score vector, 135 Selection matrices autoregressive process and, 92 defined, 29 LSE model and, 151,188 LMT and, 115 zero-one matrices and, 28,29 Shifting matrices, 28,46-66,162 asymptotic theory, 28,64 band matrices and, 27 circulant matrices and, x, 58-59 defined, 46 forward-shift matrices and, 59 lag operators and, 46, 64, 64n3 linear combination of, 59 for n x n matrices, 48-49 nilpotent, 50 partitioned matrices and, 60-61 properties of, 50-61 singular, 50 therorems about, 61-64 time-series processes and, 28,46,64-66 Toeplitz matrices and, x, 52-61 traces of, 60 triangular matrices and, 50-51 zero-one matrices and, xi, 46-66 SURE. See Seemingly unrelated regression equations Structural form, 149
206
Index
T Three-stage least squares (3SLS) estimator BAN estimator, 155 IVEs and, 155 LMT statistic and, 157 LSE models and, 172-177 3SLS. See Three-stage least squares estimator Time—series analysis, x lag operators and, 46,64 shifting matrices and, 28,46,50,64-66 zero-one matrices and, 46 See also Lagged values Teoplitz matrices, 28,186 circulant matrices and, 58 defined, 52 shifting matrices and, x, xi, 52-61 triangular matrix and, 120,133 Traces, of matrices derivatives and, 80,81 devecs and, 12 shifting matrices, 60 vecs and, 12 Transpose operator, 21-23 Triangular matrices, 23-27 defined, 24 matrix algebra and, 23-27 shifting matrices and, 50-51 strictly triangular, 24 transposes and, 24 Turkington test statistic, 161 2SLS. See Two-stage least squares estimators Two-stage least-squares (2SLS) estimators, 157,160-161 U Unit roots, 90,100,184 V operators, 13-14
Vec operators commutation matrix and, 31-35 definition of, 10-11 derivatives of, 73-78 devec operators and, 10 elimination matrix and, 43 generalized. See Generalized vec operator Kronecker products and, 11-12,31-35 matrix algebra and, 10-14 traces of, 12 vech operator and, 13 Vech operator, 13-14,43-44 Vector autoregressive disturbances. See specific models Vector moving average disturbances. See specific models Vector functions, 68,73-78
Wald test statistic, 3,5,98-99,106,107, 130-132,184 White estimator, 172-173
Zero-one matrices, xi commutation matrix, 29-35 defined, 28 duplication matrices, 43-45 elimination matrices, 14,43 generalized vecs and devec of the commutation matrix, 35-42 matrix calculus and, 69-71 matrix N, 42,43 permutation matrices, 29 results for, 44-46 selection matrices, 28-29 shifting matrices, 46-66 See also specific types