This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
, and et is the error at time t (called noise). If a is zero, the observations consist only of noise, that is, there is no signal. Assuming that the frequency is known, show that the above signal plus noise model with unspecified amplitude and phase reduces to (1.3.1) after a suitable transformation of the parameters. What are the explanatory variables of this linear model? Formulate the hypothesis of 'no signal' in terms of the parameters of the linear model. 1.2 The salary (y) of an employee in an organization is modelled as y = /3Q+ fiix\ + /32x2 + fax* + A1Z4 + e, G0O Here, a G [0,1] is a predetermined number called the level of significance. The left hand side of (3.8.1) is called the size (or minimum level of significance) of the test. For a given data set, the p-value of a test is the minimum level of significance at which the null hypotheses is rejected. A small p-value indicates greater credibility of the alternative hypothesis. A test is said to be most powerful test of level a, for a specific 0 G ©!, if it has the largest power subject to (3.8.1). A uniformly most powerful (UMP) test is one which maximizes the power E[(p{y)\ for all 6 G ©i subject to (3.8.1). A test is said to be an unbiased test if its power never falls below its size, that is, inf 1. There is no new LZF to be identified in cases (a) and (c). In case (d), we can permute the rows of X[ in such a way that each of the top few rows, when appended successively to Xm, increase the rank of the matrix by 1, and the remaining rows belong to the row space of the concatenated matrix. This permuted version of Xi can be partitioned
where x\ and X2 are binary indicators of graduation from highschool and college, respectively, £3 is the indicator of at least
1.8 Exercises Running distance (meters) 100 200 400 800 1000 1500 2000 3000 5000 10000
Men's record (seconds) 9T79 19.32 43.18 101.11 131.96 206.00 284.79 440.67 759.36 1582.75
17 Women's record (seconds) 10.49 21.34 47.60 113.28 148.98 230.46 325.36 486.11 868.09 1771.78
Table 1.1 World record running times data (source: International Association of Athletics Federations, http: //www. iaaf . org/Results/Records/index.html)
one post-graduate degree, £4 is the number of years in service and e is the error term of the model. (a) What are the possible sources of the model error? (b) Interpret the parameters J3Q, ,04. (c) Which constraints on the parameters correspond to the hypothesis: 'salary does not depend on the educational background' ? 1.3 Table 1.1 gives the men's and women's world record times for various running distances, recognized by the International Association of Athletics Federations (IAAF) as of 16 August, 2002. It may be assumed that the log of the record time is approximately a linear function of the log of the running distance. Identify the matrix and vectors of (1.3.2) if a linear model is used for the men's log-record times and another one for the women's log-record times. Construct a single 'grand model' with four parameters which can be used as a substitute for these two models, and identify the corresponding matrix and vectors. 1.4 (a) If a single linear model is used for all the log-record times
18
Chapter 1 : Year 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
Population (billion) 4.533 4.613 4.694 4.774 4.855 4.938 5.024 5.110 5.196 5.284
Introduction Year 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Population (billion) 5.367 5.450 5.531 5.611 5.691 5.769 5.847 5.925 6.003 6.080
Table 1.2 World population data (Source: U.S. Census Bureau, International Data Base, http: //www. census. gov/ipc/www/idbnew. html)
for the data of Table 1.1, and the gender effect is represented by an additional (binary) explanatory variable, then identify the matrix and vectors of (1.3.2). (b) Identify a constraint on the parameters of the 'grand model' of Exercise 1.3 which would make it equivalent to the model of part (a). 1.5 Table 1.2 gives the midyear population of the world for the years 1981-2000. Suppose that a linear model is used to express the world population approximately in terms of the year. Identify the matrix and vectors of (1.3.2), and interpret the parameters
/30 and fa. 1.6 Show that if the explanatory variables are random and (1.3.2)(1.3.3) represent a model of y conditional on X, then the model error e must be uncorrelated with X. [See Exercise 3.7 for a stronger version of this result.] 1.7 Consider the piecewise linear model _ f «o + ot\x + e if x < XQ, y~\l3o+fax
+e
iix>x0.
Show that if XQ is known, this model can be rewritten as a linear model — with a suitable choice of explanatory variables.
1.8 Exercises
1.8
1.9
1.10
1.11
1.12
19
[Note: Usually the change point xo is unknown, and therefore the piecewise linear model is in fact a nonlinear model.] If the linear model of Exercise 1.7 is used for the world population data of Table 1.2 with xo chosen as the year 1990, and the model is expressed as (1.3.2), identify the matrix and vectors. According to the piecewise linear model of Exercise 1.7, E(y\x) may be discontinuous at XQ. Observe that the discontinuity disappears if the restriction /3o — ao = [a\ — (5\)XQ is imposed. Rewrite this continuous, piecewise linear model as a linear model — with a suitable choice of explanatory variables. If the linear model of Exercise 1.9 is used for the world population data of Table 1.2 with XQ chosen as the year 1990, and the model is expressed as (1.3.2), identify the matrix and vectors. Consider the models (1.4.2) and (1.4.3) of Example 1.4.4. Show that the errors e and 5 both have zero mean only if v and l/v are uncorrelated. Is this condition likely to hold? Cobb-Douglas model. This model for production function postulates that the production (q) is related to labour (I) and capital (c) via the equation q = alac^
u,
where a, a and /3 are unspecified constants and u is the (multiplicative) model error. The model is transformed to a linear model via a log-transformation of both sides of the equation. Assume that the additive error of the transformed model has zero mean. If 8 is defined as q — alac^, the additive error of the original model, show that this error has larger variance when the mean response of the transformed model is larger. 1.13 Logistic regression model. Suppose that the response is a binary variable whose conditional mean (ir) given the explanatory variables x\,..., xp is given by the equation log [Y~)
=00 + PlXl +
PpXp-
Is this model a special case of any of the models discussed in this chapter? Can it be linearized by a suitable transformation of the response?
20
Chapter 1 : Introduction 1.14 The manufacturer of a medicine for common cold claims that this medicine provides 30% longer relief than that provided by a competing brand. In order to test this claim, an experiment is conducted with a number of adult volunteers who were given a standard dose of one medicine or the other. The duration of relief was measured, and other possibly influencing factors such as gender were recorded too. Is it possible to formulate the problem in such a way that the claim amounts to a simple condition on the parameters of a linear model which may be fitted to the above data? 1.15 Response surface. Consider the quadratic regression model y = (3Q + Pix + fox2 + e,
Var{e) = a2,
with independent observations. If fo > 0, determine the value of x which will minimize the expected response. [See Exercise 5.8 for inference of this value from data.] 1.16 Errors in variables. Suppose that for a given value of the random explanatory variable x, the response (y) is given by the linear model y = /?o + Pix + e. Suppose that x is observed with some random error, and the observation x0 is represented by the model x0 = x + 8, where 8 has zero mean and is independent of e and x. (a) A model involving y and x0 may be obtained by eliminating x from the two equations. Show that this model is not a special case of (1.1.1), by calculating the correlation between the model error and x0. (b) Is the model represented by the original pair of equations a special case of any of the models considered in this chapter? 1.17 Suppose that the effectiveness of a new drug (for which there is no competitor) is studied in the following way. A random
1.8 Exercises
21
sample of 10 clinics is selected without replacement from all the clinics in the country, and a random sample of 10 patients is selected without replacement from these clinics. The selected patients are administered the drug and the 'improvement in status' is recorded. The model for this response is yij = n + 6{ + €{j,
i, j = 1 , . . . , 10,
where [i is a constant, Si is the effect of the ith clinic, and e^ is a random term corresponding to the jth patient of the ith clinic. The objective of the study is to measure the average 'improvement in status', irrespective of the clinic. Should the <5jS be modelled as fixed parameters or random quantities? Which parameter of this model should be the focus of inference? Identify the model from among all those considered in this chapter for which the above model is a special case.
Chapter 2
Review of Linear Algebra
This chapter provides a brief summary of the concepts and results of linear algebra that the reader would find useful in the subsequent development. Related results are grouped into sections and their proofs, when considered lengthy and inessential for understanding the results, omitted. Our limited purpose here is to acquaint the reader with essential facts in order to make the treatment self-contained. No attempt is made at comprehensive coverage or at stating the results in their most general form. Matrix notations which are introduced in this chapter and used subsequently are summarized in the Glossary of matrix notations given in the end of the book. 2.1
Matrices and vectors
A matrix is a rectangular array of numbers arranged in rows and columns. If a matrix has m rows and n columns, then it is said to have order mxn. We shall denote matrices by bold and uppercase Roman or Greek letters, such as A, and occasionally specify the order explicitly as a subscript, as in Amxn. The entry in the ith row and the jth column of a matrix is called its (i,j)th element. We shall denote an element of a matrix by the corresponding lowercase letter, with the location specified as subscript. For instance, the (i,j)th element of the matrix A is a y . Sometimes we shall describe a matrix by a typical element: ((ay)) will 23
24
Chapter 2 : Review of Linear Algebra
represent the matrix whose (i,j)th element is a y , that is,
A = (K,)). We only deal with matrices whose elements are real numbers, that is, take values in IR, the real line. An m x n matrix assumes values in IRmn, the mn-fold Cartesian product of the real line. When two matrices have identical orders, one can define their sum. The sum or addition of the two matrices Amxn and Bmxn, denoted by A + B is defined as A + B = {{oij + bij)). The scalar product of a matrix A with a real number c is defined as cA = ((caij)). The difference of Amxn
and Bmxn A-B
is defined as A + (—1)B. Thus,
= {{aij-bij)).
The product of the matrix Amxn by AB, is defined as
with the matrix BnXk, denoted
The product has order mxk. The product is defined only when the number of columns of A is the same as the number of rows of B. In general AB 7^ BA even if both the products are defined and are of the same order. To emphasize the importance of the order of the multiplication, the operation of obtaining AB is referred to as 'the post-multiplication of A by B' or 'the pre-multiplication of B by A."1 A series of additions and subtractions (such as A + B — C) and a series of multiplications (such as ABC) may be defined likewise, irrespective of the sequence of the operations. Whenever we carry out these operations without mentioning the orders explicitly, it is to be understood that the matrices involved have orders appropriate for the operations.
2.1 Matrices and vectors
25
The elements occurring in the (i,i)th position of a matrix (i = 1,2,...) are called diagonal elements, while the others are called nondiagonal or off-diagonal elements. A diagonal matrix is a matrix with all off-diagonal elements equal to zero. If A = ((ajj)), the transpose of A is defined as A1 = ((ajti)). It is easy to verify that {AB)' = B'A'. A square matrix is one with the number of rows equal to the number of columns. A square matrix A is called symmetric if A' = A, that is, if aij = a^i for all % and j . The sum of all the diagonal elements of a square matrix is called the trace of the matrix. We shall denote the trace of the matrix A by tr(A). It can be verified that m
tr(-A mxn i> nxm J = tv{UnxrnArnxn)
n
= y ^/ i=l
J
ciijbjj.
j=l
A matrix with a single column is called a column vector, while a matrix with a single row is called a row vector. Throughout this book, when we simply refer to a vector it should be understood that it is a column vector. We shall denote a vector, with a bold and lowercase Roman or Greek letter, such as a or a. We shall use the corresponding lowercase (non-bold) letter to represent an element of a vector, with the location specified as subscript. Thus, a^ and a, are the ith elements (or components) of the vectors a and a, respectively. If the order of a vector is n x 1, then we shall call it a vector of order n for brevity. A matrix with a single row and a single column is a scalar, which we shall denote by a lowercase Roman or Greek letter, such as a and a. We shall use special notations for a few frequently used matrices. A matrix having all the elements equal to zero will be denoted by 0, regardless of the order. Therefore, a vector of Os will also be denoted by 0. A non-trivial vector or matrix is one which is not identically equal to 0. The notation 1 will represent a vector of Is (every element equal to 1). A square, diagonal matrix with all the diagonal elements equal to 1 is called an identity matrix, and is denoted by I. It can be verified that A + 0 = A, A0 = 0A = 0, and AI = IA = A
26
Chapter 2 : Review of Linear Algebra
for 0 and I of appropriate order. Often a few contiguous rows and/or columns of a matrix are identified as blocks. For instance, we can partition J5X5 into four blocks as j
_ f Ij,xJ, 03x2 \ \O2x3 -^2x2/
Sometimes the blocks of a matrix can be operated with as if they are single elements. For instance, it can be easily verified that /«mxi\
(A mxril BmXjl2
Cmxn3) I vn2Xi J = Au + Bv + Cw.
The Kronecker product of two matrices AmXn
and Bpxq,
denoted
by A®B
= {(aijB)),
is a partitioned mp x nq matrix with aijB as its (2, j)th block. This product is found to be very useful in the manipulation of matrices with special block structure. It follows from the definition of the Kronecker product that (a) (b) (c) (d) (e)
(Ai + A2)
A set of vectors {v\,..., v^} is called linearly dependent if there is a set of real numbers {a\,..., a^}, not all zero, such that J2i=i aivi = 0- If a set of vectors is not linearly dependent, it is called linearly independent. Thus, all the columns of a matrix A are linearly independent if there is no non-trivial vector b such that the linear combination of the columns Ab = 0. All the rows of A are linearly independent if there is no nontrivial vector c such that the linear combination of the rows c'A = 0. The column rank of a matrix A is the maximum number of linearly independent columns of A. Likewise, the row rank is the maximum
2.1 Matrices and vectors
27
number of its rows that are linearly independent. If the column rank of a matrix Amxn is equal to n, the matrix is called full column rank, and it is called full row rank if the row rank equals m. A matrix which is neither full row rank nor full column rank is called rank-deficient. An important result of matrix theory is that the row rank of any matrix is equal to its column rank (see, e.g., Rao and Bhimasankaram (1992, p.107) for a proof). This number is called the rank of the corresponding matrix. We denote the rank of the matrix A by p{A). Obviously, p(A) < min{m, n}. A square matrix Bnxn is called full rank or nonsingular if p{B) = n. If p(Bnxn) < n, then B is called a singular matrix. Thus, a singular matrix is a square matrix which is rank-deficient. The inner product of two vectors a and b, having the same order n, is defined as the matrix product a'b = YA=I ai^i- This happens to be a scalar, and is identical to b'a. We define the norm of a vector a as ||a|| = (a'a) 1 / 2 . If |a|| = 1, then a is called a vector with unit norm, or simply a unit vector. For a vector vnx\ and a matrix Anxn, the scalar n
n
v'Av = ]T J2 a-ipiVj i=ij-i
is called a quadratic form in v. Note that v'Av = v'(^(A + A'))v. Since a non-symmetric matrix A in a quadratic form can always be replaced by its symmetric version \(A + A') without changing the value of the quadratic form, there is no loss of generality in assuming A to be symmetric. The quadratic form v'Av is characterized by the symmetric matrix A, which is called the matrix of the quadratic form. Such a matrix is called (a) (b) (c) (d)
positive definite if v'Av > 0 for all w^O, negative definite if v'Av < 0 for all « ^ 0 , nonnegative definite if v'Av > 0 for all v and positive semidefinite if v'Av > 0 for all v and v'Av = 0 for some v ^ 0.
A positive definite matrix is nonsingular, while a positive semidefinite matrix is singular (see Exercise 2.3). Saying that a matrix is nonnega-
28
Chapter 2 : Review of Linear Algebra
tive definite is equivalent to saying that it is either positive definite or positive semidefinite. If Anxn is a positive definite matrix, then one can define a general inner product between the pair of order-n vectors a and b as a'Ab. The corresponding generalized norm of a would be (a'Aa)1/2. The fact that A is a positive definite matrix ensures that a'Aa is always positive, unless a = 0. A useful construct for matrices is the vector formed by stacking the consecutive columns of a matrix. We denote the vector constructed from the matrix A in this manner by vec(A). It is easy to see that tr(AB) — vec(A')'vec(B). The number ||vec(A)|| is called the Frobenius norm of the matrix A, and is denoted by | | A | | F - Note that ||A||^ is the sum of squares of all the elements of the matrix A. The Frobenius norm is also referred to as the 'Euclidean norm' in statistical literature.
2.2
Inverses and generalized inverses
If AB = / , then B is called a right-inverse of A, and A is called a left-inverse of B. We shall denote a right-inverse of A by A~ . It exists only when A is of full row rank. Likewise, we shall denote a left-inverse of B, which exists only when B is of full column rank, by B~L. Even when a right-inverse or a left-inverse exists, it may not be unique. For a rectangular matrix AmXrn the rank condition indicates that there cannot be a right-inverse when m > n, and there cannot be a left-inverse when m < n. As a matter of fact, both the inverses would exist if and only if the matrix A is square and full rank. In such a case, A~L and A~R happen to be unique and equal to each other (this follows from Theorem 2.1.1 of Rao and Mitra, 1971). This special matrix is called the inverse of the nonsingular matrix A, and is denoted by A"1. By definition, the inverse exists and is unique if and only if A is nonsingular, and AA~l = A~l A = I. If A and B are both nonsingular with the same order, then (AB)" 1 = B~1A~1. A matrix B is called a generalized inverse or g-inverse of A if ABA — A. A g-inverse of A is denoted by A~. Obviously, if A has order m x n, then A" must have the order n x. m. Every matrix
2.2 Inverses and generalized inverses
29
has at least one g-inverse. Every symmetric matrix has at least one symmetric g-inverse (see Exercise 2.5). It is easy to see that if A has either a left-inverse or a right-inverse, then the same is also a g-inverse of A. In general, A~ is not uniquely defined. It is unique if and only if A is nonsingular, in which case A~ = A~l. Even though A", A~L and A~~R are not uniquely defined in general, we often work with these notations anyway. However, we use these notations only in those expressions where the specific choice of A~, A~L or A~R does not matter. We have just noted that the matrix A has an inverse if and only if it is square and nonsingular. Therefore a nonsingular matrix is also called an invertible matrix. Every other (non-invertible) matrix has a g-inverse that is necessarily non-unique. It can be shown that for every matrix A there is a unique matrix B having the properties (a) (b) (c) (d)
ABA = A, BAB = B, AB = (AB)' and BA = (BA)'.
Property (a) indicates that B is a g-inverse of A. This special g-inverse is called the Moore-Penrose inverse of A, and is denoted by A+. When A is invertible, A+ = A~l. When A is a square and diagonal matrix, A+ is obtained by replacing the non-zero diagonal elements of A by their respective reciprocals. Example 2.2.1 /
2
3\
Let /_A
I
_I3 \
/_8
23 _ i \
Then it can be verified that B and C are distinct choices of A~L. Likewise, B' and C are right-inverses of A'. While B is the MoorePenrose inverse of A, C is not, since AC is not a symmetric matrix. Q If A is invertible and A~l = A', then A is called an orthogonal matrix. If A is of full column rank and A' is a left-inverse of A, then
30
Chapter 2 : Review of Linear Algebra
A is said to be semi-orthogonal. If ai,...,an are the columns of a semi-orthogonal matrix, then a[aj = 0 for i ^ j , while a'fii = 1. A semi-orthogonal matrix happens to be orthogonal if it is square. The following inversion formulae are useful for small-scale computations.
+ a
A+
I 0 if a = 0; {{a'a)-la' if ||a|| > 0, 10 if ||a|| = 0; = Yim(A'A + 52I)-lA'= \imA'{AA'+
=
{A'A)+ =
52I)-X-
A+{A+)'.
The third formula is proved in Albert (1972, p.19). The other formulae are proved by direct verification. Two other formulae are given in Proposition 2.5.2. See Golub and Van Loan (1996) for numerically stable methods for computing the Moore-Penrose inverse. Possible choices of the right- and left-inverses, when they exist, are A~L
=
{A'A)-lA';
A~R
=
A'{AA'yl.
In fact, these choices of left- and right-inverses are Moore-Penrose inverses. If A~ is a particular g-inverse of A, other g-inverses can be expressed as A~ + B-AABAA-, where B is an arbitrary matrix of appropriate order. This is a characterization of all g-inverses of A (see Rao, 1973c, p.25). We conclude this section with inversion formulae for matrices with some special structure. It follows from the definition and properties of the Kronecker product of matrices (see Section 2.1) that a generalized inverse of A
2.3 Vector space and projection
31
where A~ and B~ are any g-inverse of A and B, respectively. It can be verified by direct substitution that if A, C and A + BCD are nonsingular matrices, then C~l + DA~lB is also nonsingular and {A + BCD)1 = A-1 - A-lB{C~l
+ DA~lB)'1
DA1.
Consider the square matrix M=(c
D)'
If M and A are both nonsingular, then ! M
_
(A~l + A-lBT-lCA~l
~ {
-T-lCA~l
-A-lBT~l\
T- 1
)
where T = D — CA~lB, which must be a nonsingular matrix (see Exercise 2.7). KC = B' and M is a symmetric and nonnegative definite matrix, then a g-inverse of M is
*- = r + i:*™ -\BT~) where T = D — B'A~B, which is a nonnegative definite matrix (see Exercise 2.19). The proofs of these results may be found in Rao and Bhimasankaram (1992, p. 138,347). 2.3
Vector space and projection
For our purpose, a vector space is a nonempty set <S of vectors having a fixed number of components such that if u £ S and v G S, then au + bv E S for any pair of real numbers a and b. If the vectors in S have order n, then S C IRn.
32
Chapter 2 : Review of Linear Algebra
If Si and 52 are two vector spaces containing vectors of the same order, then the intersection Sif)S2 contains all the vectors that belong to both the spaces. Every vector space contains the vector 0. If Si (IS2 = {0}, then S\ and S2 are said to be virtually disjoint. It is easy to see that Si n 2 is itself a vector space. However, the union Si U S2 is not necessarily a vector space. The smallest vector space that contains the set S1US2 is called the sum of the two spaces, and is denoted by Si +S2. It consists of all the vectors of the form u + v where u £ Si and D 6 52. A vector u is said to be orthogonal to another vector v (having the same order) if u'v = 0. If a vector is orthogonal to all the vectors in the vector space S, then it is said to be orthogonal to S. If Si and £2 are two vector spaces such that every vector in Si is orthogonal to S2 (and vice versa), then the two spaces are said to be orthogonal to each other. Two vector spaces which are orthogonal to each other must be virtually disjoint, but the converse is not true. The sum of two spaces which are orthogonal to each other is called the direct sum. In order to distinguish it from the sum, the symbol ' + ' is replaced by '©.' Thus, when Si and £2 are orthogonal to each other, Si + S2 can be written as Si © S2. If Si ffi S2 = IRn, then Si and S2 are called orthogonal complements of each other. We then write Si = S^ and S2 = S^. Clearly, (S-1)1- = S. A set of vectors {ui,..., Uk} is called a basis of the vector space S if (a) Ui G S for i = 1 , . . . , fc, (b) the set { u i , . . . , Uk} is linearly independent and (c) every member of S is a linear combination of « i , ...,«/&. Every vector space has a basis, which is in general not unique. However, the number of vectors in any two bases of S is the same (see Exercise 2.8). Thus, the number of basis vectors is a uniquely defined attribute of any given vector space. This number is called the dimension of the vector space. The dimension of the vector space S is denoted by dim(S). Two different vector spaces may have the same dimension. If S consists of n-component vectors, then dim(S) < n (see Exercise 2.10).
Example 2.3.1
Let
" i= 0- *"-(!) HD 1 HD-
2.3 Vector space and projection
33
Define S\ ^2 53 54
= = = =
,£5 =
{u {u {u {u
: u = aui + bu2 for any real a and 6}, : u — au\ + bu4 for any real a and b}, :u = for any real a}, : u = au\ for any real a},
{u : u = au2 for any real a } .
It is easy to see that S i , . . . , S s are vector spaces. A basis of S\ is {tti, 112}- An alternative basis of <Si is {ui,uz}. The pair of vector spaces 54 and 5s constitute an example of virtually disjoint spaces which are not orthogonal to each other. T h e spaces <Si and S3 are orthogonal to each other. In fact, 5 i ©53 = J R 3 , so that S3 = S^. The intersection between Si and S2 consists of all the vectors which are proportional to Ui, that is, Si flS2 = S4. In this case, Si US2 is not a vector space. For instance, Ui + U4 is not a member of Si U S2, even though u-i and 114 are. The sum, <Si + S2 is equal to 1R3, which is a vector space. Even so, Si and S2 are not orthogonal complements of each other, because they are not orthogonal to each other in the first place. Note that a set of pairwise orthogonal vectors are linearly independent, but the converse does not hold. If the vectors in a basis set are orthogonal to each other, this special basis set is called an orthogonal basis. If in addition, the vectors have unit norm, then the basis is called an orthonormal basis. For instance, {u\, 113} is an orthonormal basis for Si in Example 2.3.1. Given any basis set, one can always construct an orthogonal or orthonormal basis out of it, such that the new basis spans the same vector space. Gram-Schmidt orthogonalization (see Golub and Van Loan, 1996) is a sequential method for such a conversion. A few important results on vector spaces are given below. P r o p o s i t i o n 2.3.2
Suppose Si and S2 are two vector spaces.
(a) dim(Si n S 2 ) + dim(S x + S 2 ) = dim(Si) + dim(S 2 ).
(b) (S1 + S 2 ) x = St n S2X. (c) If Si C S2 and dim(Si) = dim(S2), then Si = S 2 .
34
Chapter 2 : Review of Linear Algebra
Proof. See Exercise 2.9. Note that Part (b) of the above proposition also implies that (Si n 52)-L = 51J- + 52J-. For any vector space S containing vectors of order n, we have S © S1- = ZR™. Hence, every vector v of order n can be decomposed as V = Vi
+V2,
where Vi G S and v^ G SL. Thus, the two parts belong to mutually orthogonal spaces, and are orthogonal to each other. This is called an orthogonal decomposition of the vector v. The vector v\ is called the projection of v on S. The projection of a vector on a vector space is uniquely denned (see Exercise 2.11). A matrix P is called a projection matrix for the vector space S if Pv = v for all v G S and Pv G 5 for all i? of appropriate order. In such a case, Pv is the projection of v on S for all v. Since PPv = Pv for any v, the matrix P satisfies the property P2 = P. Square matrices having this property are called idempotent matrices. Every projection matrix is necessarily an idempotent matrix. If P is an idempotent matrix, it is easy to see that / — P is also idempotent. If P is a projection matrix of the vector space S such that I — P is a projection matrix of S^, then P is called the orthogonal projection matrix for S. Every vector space has a unique orthogonal projection matrix, although it may have other projection matrices (see Exercise 2.12). Henceforth we shall denote the orthogonal projection matrix of S by P~. An orthogonal projection matrix is not only idempotent but also symmetric. Conversely, every symmetric and idempotent matrix is an orthogonal projection matrix (see Exercise 2.13). Example 2.3.3 the matrices
Consider the vector space <Si of Example 2.3.1, and
/I 0 0\ Pi = I 0 1 0 \0 0 0/
and
/I 0 1\ P2 = 0 1 1 . \0 0 0/
Notice that PiUj = Uj for i = 1,2, j = 1,2. Therefore, P{V = v for
2.3 Vector space and projection
35
any v G <Si and i = 1,2. Further, Pj = (m : U2)Ti, z = 1,2, where
rTl~[o -(
1
-1I °\ oj'
T
-(
T2-\o
1
-1i i °\ j-
Therefore, for any u, PjV = (u\ : 1*2) (TJU), which is a linear combination of u\ and u^ and hence is in «Si. Thus, P\ and Pi are both projection matrices of S\. It can be verified that both are idempotent matrices. Notice that U4 G 1S3 = 5/". Further, (J — Pi)it4 = U4, but (J — P2)UA 7^ W4. Also, (/ — P\)v is in >?3 for all v. Therefore, P i is the orthogonal projection matrix for «Si, while P2 is not. d Proposition 2.3.4 If the vectors ui,...,Uk constitute an orthonormal basis of a vector space S, then Ps = X3i=i uiu'i-
Proof. Let P = ]Ci=i Uiu\. Since any vector v in <S is of the form J2i=i aiuii it follows that k
k
k
Pv = ^2 Yl uiu'iujaj = 5Z a i t t i
= v-
One the other hand, for a general vector v, Pv = J2i=i 12j=i{uiv)uh which is evidently in S. Therefore, P is indeed a projection matrix for S. We now have to show that / — P is a projection matrix of S1-. Let v £ S^, so that V'UJ = 0 for j = 1,..., k. Then k
(I — P)v
— v — ^2 UJ(U'JV) = v. J=l
Since u'^P = u'j for j = 1 , . . . ,k, we have for a general vector v of appropriate order, u'J(I-P)v
= 0,
j =
l,...,k.
Therefore, (J — P)v is orthogonal to S, and is indeed a member of 5-1. Combining the above results, and using the fact that the orthogonal projection matrix of any vector space is unique (see Exercise 2.12), we have Ps= P.
36 2.4
Chapter 2 : Review of Linear Algebra Column space
Let the matrix A have the columns oi, 02, , an. If a; is a vector having components x\,X2, ,xn, then the matrix-vector product Ax — x\ai + X2CL2 H
+
xnan,
represents a linear combination of the columns of A. The set of all vectors that may be expressed as linear combinations of the columns of A is a vector space and is called the column space of A. We denote it by C{A). The column space C(A) is said to be spanned by the columns of A. The statement u £ C{A) is equivalent to saying that the vector u is of the form Ax where x is another vector. The row space of A is defined as C(A'). We now list a few results which will be useful in the subsequent chapters. Proposition 2.4.1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (I)
C(A:B) =C(A)+C{B). C{AB) CC{A). C(AA') = C(A). Consequently, p{AA') = p(A). C(C) C C(A) only if C is of the form AB for a suitable matrix B. IfC(B) C C(A), then AA~B = B, irrespective of the choice of the g-inverse. Similarly, C(B') C C(A') implies BA~ A = B. C(B') C C{A') and C{C) C C(A) if and only if BAC is invariant under the choice of the g-inverse. B'A = 0 if and only if C{B) C C(A)1-. dvm{C{A)) =p(A). If A has n rows, then dim(C(A)-L) = n — p(A). IfC(A) C C(B) and p(A) = p(B), then C(A) = C(B). In particular, C(Inxn) = JRnp(AB)<min{p(A),p(B)}. p(A + B)
Proof. Part (a) follows easily from definition. Every vector belonging to C(AB) is of the form ABl or which is clearly in C(A). This proves part (b).
A(Bl),
2.4 Column space
37
To prove that C(AA') - C(A), note that I e C(AA')1
=> I'AA' = 0 =* I'AA'l = \\Al\\2 = 0 => Al = O = leC(A)1.
Thus C{AA')L C C(A) 1 , and consequently, C(A) C C(AJ4'). The reverse inclusion follows from part (b). Equating the dimensions (see part (h), proved below), we have p(AA') = p(A). To prove Part (d), let C = {cx : : ck). Since C{C) C C(A), Cj E C(A) for j = l,...,h. Therefore, for each j between 1 and k, there is a vector bj such that Cj = Abj. It follows that C = AB where B =
(bl:---:bk).
IfC(B) C C(A), then there is a matrix T such that B = AT. Hence, = A A" AT = AT = B. The other statement of part (e) is AAB proved in a similar manner. In order to prove part (f), let C(B') C C{A') and C{C) C C(A). There are matrices TY and T 2 such that B = TXA and C = AT 2 . If Aj~ and A~^ are two g-inverses of A, then B A ^ C - B1 2 -C = TX{AA\A
- AA^A)T 2 = TX{A - A)T2 = 0.
This proves the invariance of BA~C under the choice of the g-inverse. In order to prove the converse, consider the g-inverses A^ = A+ and A^ = A+ + K — A+AKAA+, where K is an arbitrary matrix of appropriate dimension. Then the invariance of BA~C implies 0 = BAz C - BA{C = BKC - (BA+A)K{AA+C) for all K. By choosing K = uiv'j, where Ui is the iih. column of an appropriate identity matrix and Vj is the jth row of another identity matrix, we conclude that {Buitfv'jC) = (BA+AuiJiv'jAA+C)
for all i,j.
Therefore, B = aBA+A and C = a^AA+C for some a ^ O . (In fact, by using the first identity repeatedly, we can show that a = 1.) Therefore, C{B') C C{A') and C(C) C C(A).
38
Chapter 2 : Review of Linear Algebra
Part (g) is proved by noting that I G C(B) implies that I = Bm for some vector m , and consequently VA — m'B'A = 0 or Z G C(A)-L. To prove part (h), let k = p{A), and a\,..., a^ be linearly independent columns of A. By definition, any column of A outside this list is a linear combination of these columns. Therefore, any vector in C(A) is also a linear combination of these vectors. Hence, these vectors constitute a basis set of C(A), and dim(C(.4.)) — k = p(A). Part (i) follows from part (h) above and part (a) of Proposition 2.3.2. Part (j) is a direct consequence of part (g) above and part (c) of Proposition 2.3.2. Parts (b) and (h) imply that p(AB) < p{A) and p{AB) = p(B'A') < p(B') = p{B). Combining these two, we have the result of part (k). In order to prove part (1), observe that p(A + B)
=
dim(C(A + JB))
<
dim(C(A : B))
<
dim(C(A)) + dim(C(B))
-
p(A) + p(B),
where the second inequality follows part (a) of Proposition 2.3.2. A few additional results on column spaces will be presented in the next section. We now examine the projection matrices corresponding to a column space. Proposition 2.4.2 For any matrix A, the matrix AA~ is a projection matrix for C(A). Further, the orthogonal projection matrix for C(A) is PC{A) = A(A'A)-A'. Proof. For any v of appropriate order, (AA~)v is obviously in C(A). If v G C(A), it is of the form At. In such a case, (AA~)v — AA~ At = At = v. Therefore, AA~ is a projection matrix for C(A). Let PA = A(A'A)-A'. Parts (c) and (e) of Proposition 2.4.1 imply that A{A'AYA'A - A. Thus, (A'A)~A' is a g-inverse of A, and consequently PA is a projection matrix for C(A). Part (f) of Proposition 2.4.1 ensures that PA is defined uniquely, no matter what g-inverse of A' A is used in its definition.
2.4 Column space
39
Now let I e C(A)X, so that A'l = 0. Then PJ = 0 and {I-PA)l = I. Further, for any general I, (I - PA)l e C(A)1- since A'(I - PA)l = 0. Therefore (I — P.) is a projection matrix for C(A)-L. The conclusion follows. Henceforth, we shall abbreviate -P,,,-, by P.. The above proposition not only gives an explicit form of P., but it also provides an explicit which is I — PA. It is easy to see that C(PA) = C(A) form of P and C(I-PA) = C(A)L. The column space of a single non-null vector, u, has dimension 1. The corresponding orthogonal projection matrix is Pu = (u'u)~luu'. If u has unit norm, then Pu reduces to uu'. (This fact can also be seen as a corollary to Proposition 2.3.4.) For a given pair of vectors u and v having the same order, we shall refer to u'v/u'u as the component of v along u. We had previously used the word 'component' to mean an element of a vector. This is a special case of the notion of 'component' defined here. Indeed, if U{ is the vector consisting of zero's except for a 1 in the ith position, then the component of v along U{ is the ith component (or element) of v. Remark 2.4.3 If u and v are two vectors having the same number of elements, it follows from the above discussion that (u'v)2/(u'u) = v'Puv < v'v. The result (u'v)2 < \\u\\2 \\v\\2
is the well-known Cauchy-Schwarz inequality.
(2.4.1)
O
Proposition 2.4.4 Suppose that A and B are matrices having the same number of rows. (a) C(A : B) = C(A) © C((I -
PA)B).
Proof. The spaces C(A) and C((I — PA)B) are easily seen to be mutually orthogonal. Let I £ C(A : B). Then I can be written as Au + Bv for some vectors u and v. Therefore, I =PA{Au + Bv) + {I-PA){Au + Bv) =PA(Au + Bv) + (I-PA)Bv.
40
Chapter 2 : Review of Linear Algebra
The first vector, PA(Au + Bv) is in C{A), while the second is in C({IPA)B). Thus, I £ C{A) 0 C{{I - PA)B). Reversing this sequence of arguments, we have I e C(A) 0 C{(I - PA)B) =$> I € C{A : B). The result of part (a) is obtained by combining these two implications. Part (b) is a direct consequence of part (a) and the fact that
P
= P +P
whenever S\ and 52 are mutually orthogonal vector spaces (see Exercise 2.14).
2.5
Matrix decompositions
A number of decompositions of matrices are found to be useful for numerical computations as well as for theoretical developments. We mention three decompositions which will be needed later. Any non-null matrix Amxn of rank r can be written as BmxrCrXn, where B has full column rank and C has full row rank. This is called a rank-factorization. Any matrix Amxn can be written as UDV', where Umxm and VnXn are orthogonal matrices and Dmxn is a diagonal matrix with nonnegative diagonal elements. This is called a singular value decomposition (SVD) of the matrix A. The non-zero diagonal elements of D are referred to as the singular values of the matrix A. The columns of U and V corresponding to the singular values are called the left and right singular vectors of A, respectively. It can be seen that p(A) = p(D) (see Exercise 2.16). Therefore, the number of nonzero (positive) singular values of a matrix is equal to its rank. The diagonal elements of D can be permuted in any way, provided the columns of U and V are also permuted accordingly. A combination of such permuted versions of D, U and V would constitute another SVD of A (see Exercise 2.2). If the singular values are arranged so that the positive elements occur in the first few diagonal positions, then we can write A = J2l=i diUiv[, where r = p(A), di,..., dr are the non-zero singular values, while u\,..., ur and v\,..., vr are the corresponding left and right singular vectors. This is an alternative form of the SVD.
2.5 Matrix decompositions
41
This sum can also be written as U\DiV[, where XJ\ = {u\ : : ur}, Vi — {vi vr} and D\ is a diagonal matrix with d{ in the (i,i)th location, i = 1,..., r. Example 2.5.1 Consider the matrix /4
5 2\
A- I
3 6
4 5 2 " \ 0 3 6/ The rank of A is 2. An SVD of A is UDV', where
r7
_
2
2
I
I
V2
2
2
_I
2
n
_I
1 1 1 1 ) 2
_
'
0
6
0
0 0
U
'
\o o o i
2
1
_2
3
3
3
"
U - I \)
2 /
This decomposition is not unique. We can have another decomposition by reversing the signs of the first columns of U and V. Yet another SVD is obtained by replacing the last two columns o f t / b y ( - T = : 0 : — 4 ^ : 0 ) and (0 : ^ : 0 : - ^ ) ' . Two alternative forms of SVD of A are given below: (\\ diuit,; + d2u2v'2
= 1 2
( \
( i f
|) +6
\ \ ~\
2
/I TT D V
( f \
- | ) ,
2
\ 5/ I \ 2
2
2
2
2
2
\-5 (12 V
0W /
\
3
3
3
3
/ 3 \ 3/
Two rank-factorizations of A are (L7i.Di)(Vi) and ( [ / ^ ( D ^ i J . n
42
Chapter 2 : Review of Linear Algebra
If A is a symmetric matrix, it can be decomposed as VAV', where V is an orthogonal matrix and A is a square and diagonal matrix. The diagonal elements of A are real, but these need not be nonnegative (see Proposition 2.5.2 below). The set of the distinct diagonal elements of A is called the spectrum of A. We shall refer to VAV' as a spectral decomposition of the symmetric matrix A. The diagonal elements of A and the columns of V have the property Avi = XiVi,
i=
l,...,n,
A, and V{ being the z'th diagonal element of A and the ith column of V, respectively. Combinations of scalars and vectors satisfying this property are generally called eigenvalues and eigenvectors of A, respectively. Thus, every Aj is an eigenvalue of A, while every Vi is an eigenvector of A. We shall denote the largest and smallest eigenvalues of the symmetric matrix A by A max (A) and \min(A), respectively. There are several connections among the three decompositions mentioned so far. If A is a general matrix with SVD UDV', a spectral decomposition of the nonnegative definite matrix A'A is VD2 V'. If A is itself a nonnegative definite matrix, any SVD of A is a spectral decomposition, and vice versa. An alternative form of spectral decomposition of a symmetric matrix A is ViAiV^ where Vi is semi-orthogonal and Ai is a nonsingular, diagonal matrix. If p(A) — r, then Ai has r positive diagonal elements, and can be written as D\, D\ being another diagonal matrix. Thus, A can be factored as {V\Di){V\D{)'. This construction shows that any nonnegative definite matrix can be rankfactorized as BB1 where B has full column rank. In general we use this form of rank-factorization for nonnegative definite matrices (see Rao and Bhimasankaram, 2000, p.361 for an algorithm for this decomposition). We have already seen in Example 2.5.1 how an SVD leads to a rank-factorization of a general matrix (not necessarily square). Although the SVD is not unique, the set of singular values of any matrix is unique. Likewise, a symmetric matrix can have many spectral decompositions, but a unique set of eigenvalues. The rank-factorization, SVD and spectral decomposition help us better understand the concepts introduced in the preceding sections. We now present a few characterizations based on these decompositions.
2.5 Matrix decompositions
43
Proposition 2.5.2 Suppose UiDiV[ is an SVD of the matrix A, such that D\ is full rank. Suppose, further, that B is a symmetric matrix with spectral decomposition VAV', A having the same order as B. (a) PA = U1U[. (b) A+ = VlDl1U\. (c) PA=AA+. (d) B is nonnegative (or positive) definite if and only if all the elements of A are nonnegative (or positive). (e) If B is nonnegative definite, then B+ = VA+V (f) B is idempotent if and only if all the elements of A are either 0 or 1. (g) If CD is a rank-factorization of A, then the Moore-Penrose inverse of A is given by A+ =
D'{DD')-l{C'CylC.
(h) tr(B) = tr(A), the sum of the eigenvalues. Proof. Note that U\ and V\ are semi-orthogonal matrices. Further, C(A) = C(AA') = C(UiDiDiU[)
= C{UxDi) = C{UX).
The last equality holds because D\ is invertible. It follows from Proposition 2.4.2 that PA = PVi = Ui(UiUi)-Ui = UiU[. Part (b) is proved by verifying the four conditions the Moore-Penrose inverse must satisfy (see Section 2.2). Part (c) follows directly from parts (a) and (b). Suppose that the elements of A are Ai,..., An and the columns of V are v\,... ,vn. Then, for any vector I of order n,
I'Bl = I' (JT X^v') 1 = ]T Kil'vi)2. If Ai,..., An are all non-negative, I'Bl is nonnegative for all I, indicating that B is nonnegative definite. Otherwise, if Xj < 0 for some j , we have V'JBVJ = Xj < 0, and B is not nonnegative definite. Thus, B is
44
Chapter 2 : Review of Linear Algebra
nonnegative definite if and only if the elements of A are nonnegative. If the word 'nonnegative' is replaced by 'positive,' the statement is proved in a similar manner. This proves part (d). As mentioned on page 29, the Moore-Penrose inverse of a square and diagonal matrix is obtained by replacing the non-zero elements of the matrix by their respective reciprocals. Therefore, A+ is obtained from A in this process. The matrix V A + V is easily seen to satisfy the four properties of a Moore-Penrose inverse given in page 29, using the fact that V is an orthogonal matrix. This proves part (e). In order to prove part (f), note that B2 — B if and only if VA2V' = VAV', which is equivalent to A = A. Since A is a diagonal matrix, this is possible if and only if each diagonal element of A is equal to its square. The statement follows. where C and D are as in Part (g). Let F = D'{DD'yl{C'C)-lC', The conditions AFA = A, FAF, AF = (AF)' and FA = (FA)' are easily verified. Hence, F must be the Moore-Penrose inverse of A. Part (h) follows from the fact that tr(VAV') =tr(A W ) =tr(A).D In view of Proposition 2.4.2, Part (c) of Proposition 2.5.2 describes how a projection matrix AA~ can be made an orthogonal projection matrix by choosing the g-inverse suitably. Part (d) implies that a nonnegative definite matrix is singular if and only if it has at least one zero eigenvalue. Part (f) characterizes the eigenvalues of orthogonal projection matrices (see Exercise 2.13). Parts (f) and (h) imply that whenever B is an orthogonal projection matrix, p{B) = tr(B). We define the determinant of a symmetric matrix as the product of its eigenvalues. This coincides with the conventional definition of the determinant that applies to any square matrix (see a textbook on linear algebra, such as Marcus and Mine, 1988). We denote the determinant of a symmetric matrix B by |B|. According to part (d) of Proposition 2.5.2, B is nonnegative definite only if \B\ > 0, positive definite only if |J3| > 0 and positive semidefinite (singular) only if |B| = 0 . The decompositions described above also serve as tools to prove some theoretical results that may be stated without reference to the decompositions. We illustrate the utility of rank-factorization by proving
2.6 Lowner order
45
some more useful results on column spaces. Proposition 2.5.3 (a) If B is nonnegative definite, then C(ABA') p(ABA') = p{AB) = p(BA').
— C(AB), and
is a nonnegative definite matrix, then C(B) C (b) If I , \B Cj C(A) and C(B') C C(C). Proof. Suppose that CC is a rank-factorization of B. Then C{ABA') = C{ACC'A') C C{ACC) C C{AC). However, C{AC) = C({AC){AC)') = C(ABA'). Thus, all the above column spaces are identical. In particular, C(AB) = C(ACC') = C{ABA'). Consequently, p{ABA') = p{AB) = p({AB)') = p{BA'). In order to prove part (b), let TT' be a rank-factorization of the T \ bea partition of T
(
1
2/
such that the block T\ has the same number of rows as A. Then we have T T /_fTiTi
T1T'2\_(A B\ ^T2Ti T2T'2J \B' CJ-
Comparing the blocks, we have A — TiT\ and B = T{T'2. Further, C{B) = C{TiT'2) C C(Ti) = C(TiTi) = C(A). Repeating this argument on the transposed matrix, we have C(B') C C(C). 2.6
Lowner order
Nonnegative definite matrices are often arranged by a partial order called the Lowner order, that is defined below: Definition 2.6.1 If A and B are nonnegative definite matrices of the same order, then A is said to be smaller than B in the sense of Lowner order (written as A < B or B > A) if the difference B — A is nonnegative definite. If the difference is positive definite, then A is said to be strictly smaller than B (written a s > l < J 3 o r J 5 > A ) .
46
Chapter 2 : Review of Linear Algebra
It is easy to see that whenever A < B, every diagonal element of A is less than or equal to the corresponding diagonal element of B. Apart from the diagonal elements, several other real-valued functions of the matrix elements happen to be algebraically ordered whenever the corresponding matrices are Lowner ordered. Proposition 2.6.2 Let A and B be symmetric and nonnegative definite matrices having the same order and let A < B. Then (a) (b) (c) (d)
tr(A) < t r ( £ ) ; the largest eigenvalue of A is less than or equal to that of B; the smallest eigenvalue of A is less than or equal to that of B; \A\ < \B\.
Proof. Part (a) follows from the fact that B — A is a symmetric and nonnegative definite matrix, and the sum of the eigenvalues of this matrix is tr(B) — tr(A). Parts (b) and (c) are consequences of the fact that u'Au < u'Bu for every u, and that the inequality continues to hold after both sides are maximized or minimized with respect to u. In order to prove part (d), note that the proof is non-trivial only when \A\ > 0. Let A be positive definite and CC1 be a rank-factorization of A. It follows that I < C~lB{C')~~l. By part (c), the smallest eigenvalue (and hence, every eigenvalue) of C~lB{C')~1 is greater than or equal to 1. Therefore, \C~lB{C')~l\ > 1. The stated result follows from the identities \C-lB(C')-y\ = \C~l\ \B\ \{C')-l\ = \B\ \A\~l. Note that the matrix functions considered in Proposition 2.6.3 are in fact functions of the eigenvalues. It can be shown that whenever A < B, all the ordered eigenvalues of A are smaller than the corresponding eigenvalues of B (see Bellman, 1960). Thus, the Lowner order implies algebraic order of any increasing function of the ordered eigenvalues. The four parts of Proposition 2.6.2 are special cases of this stronger result. There is a direct relation between the Lowner order and the column spaces of the corresponding matrices.
2.7 Solution of linear equations
47
Proposition 2.6.3 Let A and B be matrices having the same number of rows. (a) If A and B are both symmetric and nonnegative definite, then A< B implies C(A) C C{B). (b) C(A) C C(B) if and only if PA < PB and C{A) C C(B) if and only if PA
^A-.B
= P +P ^A ^ ^{I-PA)B
>P - ^A-
In particular, iiC(A) C C(B), then C((I - PA)B) cannot be identically zero, which leads to the strict order PA < PB. On the other hand, when PA < PB, part (a) implies that C(A) C C{B). If PA < Pg, there is a vector I such that A'l = 0 but B'l ^ 0. Therefore, C(B)L C C(A')1-, that is, C(A) CC(B). 2.7
Solution of linear equations
Consider a set of linear equations written in the matrix-vector form as Ax = 6, where x is unknown. Proposition 2.7.1 below provides answers to the following questions: (a) When do the equations have a solution? (b) If there is a solution, when is it unique? (c) When there is a solution, how can we characterize all the solutions? Proposition 2.7.1 Suppose Amxn
and bmx\ are known.
(a) The equations Ax = b have a solution if and only if b E C(A).
48
Chapter 2 : Review of Linear Algebra (b) The equations Ax = b have a unique solution if and only if b e C{A) and p(A) = n. (c) If b E C(A), every solution to the equations Ax = b is of the form A~b+ (I — A~A)c where A~ is any fixed g-inverse of A and c is an arbitrary vector.
Proof. Part (a) follows directly from Parts (b) and (d) of Proposition 2.4.1. Part (b) is proved by observing that b can be expressed as a unique linear combination of the columns of A if and only if b E C(A) and the columns of A are linearly independent. It is easy to see that whenever b E C(A), A~b is a solution to Ax = b. If XQ is another solution, then XQ — A~b must be in C(-A')-1-. Since Al = 0 if and only if (/ — A~ A)l = I, C(A')1- must be the same as C(I — A" A). Hence XQ must be of the form A~b + (I — A~A)c for some c. Remark 2.7.2 If b is a non-null vector contained in C(A), every solution of the equations Ax = b can be shown to have the form A~b where A~ is some g-inverse of A. (See Corollary 1, p.27 of Rao and Mitra, 1971). Since the equations Ax = b have no solution unless b E C(A), this condition is often called the consistency condition. If this condition is violated, the equations have an inherent contradiction. If .A is a square and nonsingular matrix, then the conditions of parts (a) and (b) are automatically satisfied. In such a case, Ax = b has a unique solution given by x = A~lb. lib E C(A), a general form of the solution of Ax — bis A+b + (I — P.,)c. This is obtained by choosing the g-inverse in part (c) as A+.
2.8
Optimization of quadratic forms and functions
Consider a quadratic function q(x) = x'Ax + b'x + c of a vector variable x, where we assume A to be symmetric without loss of generality. In order to minimize or maximize q(x) with respect to x, we can differentiate q(x) with respect to the components of x, one at a time,
2.8 Optimization of quadratic forms and functions
49
and set the derivatives equal to zero. The solution(s) of these simultaneous equations are candidates for the optimizing value of x. This algebra can be carried out in a neater way using vector calculus. Let x = (x\,..., xn)'. The gradient of a function f(x) is defined as / df(x) \ dx\ ox dj\x)
\ dxn I The Hessian matrix of f(x) is defined as / d2f(x) dx\ d2f{x) ^/(tt) „ , , = dx2dxi ox ox' d2f(x) \ dxndx\
d2f(x) dx\dx2 Q2f^) dx\
d2f(x) \ dx\dxn g2/(*) dx2dxn
...
d2f(x) dxndx2
'
d2f(x) dx2 J
The gradient and the Hessian are the higher-dimensional equivalents of the first and second derivatives, respectively. A differentiable function has a maximum or a minimum at the point XQ only if its gradient is zero at XQ. If the gradient is zero at XQ, it is a minimum if the Hessian at XQ is nonnegative definite, and maximum if the negative of the Hessian is nonnegative definite. Proposition 2.8.1 matrix.
Let q(x) = x'Ax+b'x+c,
d(x'Ax) _ (a)
^
d2(x'Ax) ( ' dxdx'
a(*/x)
- ZAx, ~
^
where A is a symmetric
ac - b, ^
d2(b'x) ' dxdx' ~u'
- 0.
d2c dx~w
-
50
Chapter 2 : Review of Linear Algebra (c) q(x) has a maximum if and only if b € C(A) and —A is nonnegative definite. In such a case, a maximizer of q(x) is of the form — \A~b + (I — A~A)XQ, where xo is arbitrary. (d) q(x) has a minimum if and only ifb£ C(A) and A is nonnegative definite. In such a case, a minimizer of q(x) is of the form — l}A~b + (I — A~A)xo, where XQ is arbitrary.
Proof. The proposition is proved by direct verification of the expressions and the conditions stated above, coupled with part (c) of Proposition 2.7.1. Most of the time it is easier to maximize or minimize a quadratic function by 'completing squares', rather than by using vector calculus. To see this, assume that b G C(A) are rewrite q(x) as
(x + \A~b\
A(X + \A~b) + (c- \b'A-b) .
If A > 0, then q(x) is minimized when x = — \A~b+ {I — A~ A)XQ for arbitrary x. If A < 0, then q(x) is maximized at this value of x. The quadratic function q(x) may be maximized or minimized subject to the linear constraint Dx = e by using the Lagrange multiplier method. This is accomplished by adding the term 2y'(Dx — e) to ^(a;) and optimizing the sum with respect to x and y. Thus, the task is to optimize
O'(2 ?)(;MA.)'C)(x\ with respect to I j . This is clearly covered by the results given in Proposition 2.8.1. The next proposition deals with maximization of a quadratic form under a quadratic constraint. Proposition 2.8.2 Let A be a symmetric matrix, and b be a vector such that b'b — 1. Then b'Ab is maximized with respect to b when b is a unit-norm eigenvector of A corresponding to its largest eigenvalue, and the corresponding maximum value of A is equal to the largest eigenvalue.
2.8 Optimization of quadratic forms and functions
51
Proof. Let VAV' be a spectral decomposition of A, and let Ai,..., An, the diagonal elements of A, be in decreasing order. Suppose that o = Vb. It follows that the optimization problem at hand is equivalent to the task of maximizing YA~I ^ial subject to the constraint ^ " = 1 af = 1. It is easy to see that the weighted sum of A^s is maximized when a\ = 1 and ai = 0 for i — 2,..., n. This choice ensures the solution stated in the proposition. D Note that if 6o maximizes b'Ab subject to the unit-norm condition, so does — &o- Other solutions can be found if Ai = A2. If Ai > 0, then the statement of the above proposition can be strengthened by replacing the constraint b'b with the inequality constraint b'b < 1. An equivalent statement to Proposition 2.8.2 is the following: the ratio b'Ab/b'b is maximized over all b ^ 0 when b is an eigenvector of A corresponding to its largest eigenvalue. The corresponding maximum value of b'Ab/b'b is equal to the largest eigenvalue of A. Similarly it may be noted that b'Ab is minimized with respect to b subject to the condition b'b = 1 when b is a unit-norm eigenvector of A corresponding to its smallest eigenvalue. The corresponding minimum value of b'Ab is equal to the smallest eigenvalue of A. This statement can be proved along the lines of Proposition 2.8.2. An equivalent statement in terms of the minimization of the ratio b'Ab/b'b can also be made. We shall define the norm of any rectangular matrix A as the largest value of the ratio
||Ab|| _ (b'A'AbV12
11*11 ~\ tib )
'
and denote it by ||A||. Proposition 2.8.2 and the preceding discussion imply that \\A\\ must be equal to the square-root of the largest eigenvalue of A'A, which is equal to the largest singular value of A (see the discussion of page 42). It also follows that the vector b which maximizes the above ratio is proportional to a right singular vector of A corresponding to its largest singular value. The norm defined here is different from the Frobenius norm defined in page 28.
52
Chapter 2 : Review of Linear Algebra
2.9
Exercises AP is 2.1 Find a matrix Pnxn such that for any matrix Amxru a modification of A with the first two columns interchanged. Can you find a matrix Q with suitable order such that QA is another modification of A with the first two rows interchanged? 2.2 Obtain the inverses of the matrices P and Q of Exercise 2.1. 2.3 Let A be a matrix of order n x n. (a) Show that if A is positive definite, then it is nonsingular. (b) Show that if A is symmetric and positive semidefmite, then it is singular. (c) If A is positive semidefinite but not necessarily symmetric, does it follow that it is singular? 2.4 Show that Mmxrai
Onixm2 j
is a g-inverse of
(
Amixni
Omixn2
0m2 xni
"m2XH2/
|
2.5 Is (A~)' a g-inverse of A'l Show that a symmetric matrix always has a symmetric g-inverse. 2.6 If A has full column rank, show that every g-inverse of A is also a left-inverse. 2.7 If A is nonsingular, show that the matrix M = I _, is \G D J nonsingular if and only if D — CA~XB is nonsingular. 2.8 Show that any two bases of a given vector space must have the same number of vectors. 2.9 Prove Proposition 2.3.2. 2.10 Prove that the dimension of a vector space is no more than the order of the vectors it contains. 2.11 Show that the projection of a vector on a vector space is uniquely defined. 2.12 Prove that the orthogonal projection matrix for a given vector space is unique.
2.9 Exercises
53
2.13 (a) Prove that every idempotent matrix is a projection matrix, (b) Prove that a matrix is an orthogonal projection matrix if and only if it is symmetric and idempotent. 2.14 If S\ and S2 are mutually orthogonal vector spaces, show that
2.15 2.16
2.17 2.18
2.19
2.20
P = P 4- P Sies2 5i
(c) c(v1(v1 + v2yv2) = c(vl)nc(v2). 2.21 Show that the dimension of C( A) n C(B) is less than or equal to the rank of A'B, and that the inequality can be strict.
54
Chapter 2 : Review of Linear Algebra 2.22 If A is a nonsingular matrix and D is a square matrix, show that
£ ^\ = \A\-\D-CA-lB\. 2.23 Let A and B be symmetric matrices such that A < B. (a) If A and B are positive definite, show that B~l < A~l. (b) If A and B are nonnegative definite, show that AB~ A < A. Does AB~ A depend on the choice of the g-inverse? 2.24 Prove that the converse of part (a) of Proposition 2.6.3 does not hold, that is, for symmetric and nonnegative definite matrices A and B satisfying the condition C{A) C C(B), one does not necessarily have A < B. 2.25 Prove the following results for the Kronecker product of two matrices A and B. (a)
PA®B=PA®PB-
(b) If A and B have full column rank, so does A
that \A®B\ = \A\PW
\B\P^A\
2.27 Show that vec(A-BC) = ( C
Chapter 3
Review of Statistical Results
In this chapter we summarize the statistical theory which serves as the background for the chapters to follow. As in Chapter 2, the aim is to present the results in a coherent and self-contained manner. In order to cover only the essential facts, we sometimes use weaker versions of standard results. We do not make a notational distinction between a random vector and a particular realization of it. 3.1
Covariance adjustment
For a given random vector u, we denote the expected value and variance-covariance matrix (also known as the dispersion matrix) as E(u) and D(u), respectively. For an ordered pair of random vectors u and v, we denote the matrix of covariances by Cov(u,v), that is, Cov(u, v) = E[{u - E{u)}{v - E(v)}']. According to this notation, D(u) = Cov(u, u), and Cov(v,u) = Cov(u, v)'. It is easy to see that Cov(Au,Bv) = ACov(u,v)B' and D(Au) = AD(u)A'. If V is the dispersion matrix of y, the variance of I'y (denoted by Var(l'y)) is I'Vl, which must be nonnegative for every I. Hence, every dispersion matrix is nonnegative definite. The matrix V is singular (that is, positive semidefmite) if and only if there is a vector / such that Var(l'y) = 0, that is, i'y is a degenerate random variable. In 55
56
Chapter 3 : Review of Statistical Results
the last five chapters of this book we shall have the occasion to deal with random vectors which may have a singular dispersion matrix. A major difficulty in working with a singular dispersion matrix is that its column space does not contain all the vectors having the same order as its columns. Therefore, it is important to know which vectors are contained in the column space of a dispersion matrix. In this context, the next proposition provides two results which will be frequently used in the later chapters. Proposition 3.1.1 If u and v are random vectors with finite mean and dispersion matrices, then (a) [u - E(u)} e C(D{u)) with probability 1, (b) C(Cov{u,v)) CC(D(u)). Proof. Let y = {I-P
)[u-E(u)].
It is easy to see that E(y) = 0 and
D(y) = 0. Therefore the random variable ||y|| 2 satisfies the condition £(||y||2)
= E(y'y) = E(tv(yy')) = trE(yy') = tvE[{E(y) + (y - E(y))}{E(y) + (y - E{y))}'} = ti[E(y)E(y')] + trD(y) = 0.
It follows by a standard argument of probability theory (see, for instance, Theorem 15.2(ii) of Billingsley (1985)), that any nonnegative random variable with zero expected value must be zero with probability 1. Therefore, the vector y is almost surely a zero vector.a This proves part (a). Part (b) is obtained by applying part (b) of Proposition 2.5.3 to the combined dispersion matrix of u and v. Q The correlation of a random vector u with another random vector v can be removed by a linear adjustment. The mechanism of achieving uncorrelatedness is given below. Proposition 3.1.2 Let u and v be random vectors having first and second order moments such that E(v) = 0. Then the linear compound aWe
shall use the phrase with probability 1 interchangeably with almost surely.
3.1 Covariance adjustment
57
u — Bv is uncorrelated with v if and only if Bv = Cov(u,v)[D(v)]~v
with probability 1.
Proof. The 'if part is easy to prove. To prove the 'only if part, let Cov(u - Bv,v) = 0. It follows that BD(v) = Cov{u,v), or D(v)B' = Cov(v,u). According to part (c) of Proposition 2.7.1, B' where V = D{v) must be of the form V~Cov(v,u) - (I - VV)G, and G is an arbitrary matrix. Since E(v) = 0, part (a) of Proposition 3.1.1 implies that v 6 C(V) almost surely. We can choose V~ as a symmetric matrix without loss of generality, and the value of Cov(u,v)V~v does not depend on the choice of the g-inverse. Consequently, v'B' = v'V~Cov(v,u) — 0, that is, Bv = Cov(u,v)V~v with probability 1. Part (b) of Proposition 3.1.1 ensures that the value of Bv in Proposition 3.1.2 does not depend on the choice of the g-inverse.
us
Bv
u — Bv
J
Figure 3.1 A geometric view of covariance adjustment
We shall refer to the result of Proposition 3.1.2 as the covariance adjustment principle. It plays an important role in the derivation of several results of this book. The principle can be understood from the diagram of Figure 2.1, where the random vectors u and v are represented by lines with arrows. Vectors at right angles to each other signify zero correla-
58
Chapter 3 : Review of Statistical Results
tion. Parallel vectors have correlation 1 or —1. In general u and v are correlated, that is, these vectors are not at right angles to one another. The vector Bv described in Proposition 3.1.2 can be interpreted as the component of u in the direction of v. The remaining part, u — Bv, is at right angles (uncorrelated) with v. The random vectors Bv and u — Bv constitute a decomposition of u into uncorrelated components. 3.2
Basic distributions
Definition 3.2.1 The random vector y is said to have the multivariate normal distribution if, for every fixed vector I having the same order as y, the random variable I'y has the univariate normal distribution. The multivariate normal distribution is completely characterized by the mean vector and the dispersion matrix. If E{y) = fx and D(y) = V, the notation y ~ N(fj,, V) indicates that y has the multivariate normal distribution with mean [i and dispersion matrix V. The joint probability density of such a random vector with n components is f(y) = (27r)-"/ 2 |Vr 1 / 2 exp[-I(y - n)'V-\V
- /x)],
if V is nonsingular. If V is singular, the joint density function is f ( 2 7 T ) ^ ( ^ ) / 2 | C ' C | - 1 / 2 exp[-I(j/ - ri'V-(y - /*)] f(y) = \
if(/-Pv)(y-/i) = 0,
I0
if(/-P y )(y-/x)#0,
where CC is any rank-factorization of V. This density reduces to the earlier form when V is nonsingular. In particular, when V = a21, the joint density is f(y) =
(2na2)-^2exp[-^a-2(y-fxY(y-t,)}.
The singular normal distribution is encountered when there is a deterministic relationship among the random components of y. Specifically, if BB' is a rank-factorization of / — Py, then the linear relationship B'y = B/J. holds with probability 1.
3.2 Basic distributions
59
Let y ~ JV(/x, V) and y-fVl)
U-fl*)
V-(V^
Vl")
where the partitions correspond to one another. Then the conditional distribution of y2 given y1 is normal with mean and dispersion given by E{y2\Vi) = »2 + V21Vi1(y1-n1),
(3.2.1)
D(Vi\V\) = V22-V2lV^Vl2.
(3.2.2)
Proposition 3.1.1 ensures that the above expressions do not depend on the choice of the g-inverse of V u . Note that the conditional mean is linear in yl, while the conditional dispersion does not depend on yl. If V12 = 0, this density does not depend on y1 (thus coinciding with the unconditional density of y2). This confirms the well-known fact that if the joint distribution of two random vectors is multivariate normal, then they are independent of one another if and only if they are uncorrelated. If y ~ iV(/x, V) and L is a non-random matrix with n columns, then Ly ~ N(Lfx,LVL'). Other useful functions of y lead to several important distributions. Definition 3.2.2 A random variable q is said to have the chi-square distribution with n degrees of freedom (written formally as q ~ Xn) if Q can be written as y'y, where y n x i ~ iV(0, / ) . n Definition 3.2.3 A random variable u is said to have the student's ^-distribution with n degrees of freedom (written formally as u ~ tn) if u can be written as y/y/z/n, where y ~ N(0,1), z ~ x\ an< i V a n d z are independent random variables. D Definition 3.2.4 A random variable v is said to have the F distribution with n\ and n2 degrees of freedom (written formally as v ~ Fni jH2) if v can be written as ^y" 1 , where z\ ~ x\x > zi ~ X«2 a n ( ^ Z l anc ^ ^2 a r e independent random variables. Remark 3.2.5 If the normal distribution mentioned in definitions 3.2.2-3.2.3 has non-zero mean, then the resulting distributions are said
60
Chapter 3 : Review of Statistical Results
to be noncentral. Specifically, if y n X l ~ N(/J,, I), then q = y'y is said to have the noncentral chi-square distribution with n degrees of freedom and noncentrality parameter H'/JL (written formally as q ~ Xniv't*))- If y ~ N{fi,l), z ~ xii and y and z are independent then u = y/y/z/n is said to have the noncentral student's ^-distribution with n degrees of freedom and noncentrality parameter fj, (written formally as u ~ tn(fj,)). Finally, if z\ ~ Xnj(c); Z2 ~ Xn2 a n ( ^ ^i anc ^ Z2 are independent random variables, then v = j 4 ^ - is said to have the noncentral F distribution with degrees of freedom n\ and n^ and noncentrality parameter c (written formally a s w ~ Fni>n2(c)).
3.3
Distribution of quadratic forms
Proposition 3.3.1
(Fisher-Cochran Theorem) Let ynXi
~
N(fx,I)
and y'Aiy,... ,y'Ary be quadratic forms whose sum is equal to y'y. Then y'A^y ~ XJ,(A-)(^i) for * = >r and are independent if and only ifJ2i=ip(Ai) = n, in which case Aj = fi'Anj,, i = l , . . . , r , and Proof. Suppose that y'A^y ~"X^iA)(^) for i = 1,... ,r and are independent. It follows that for i = 1 , . . . , r there is xi ~ iV(/Xj, I) such that x \ x { — y ' A i y a n d Aj = /x^/Xj. L e t x — (x[ : x'r)'. T h e n x'x ~
Xyr
pM0
( 2 A,- 1
and y'y ~ Xn (/*'A*)
Since s'a; = y'y, by comparing the distributions of these we conclude that X)Li P(Ai) = nTo prove the converse, let Y^-iP(-^-i) = n a n d ^ t ^ i be a rankfactorization of Aj, ? = l , . . . , r . Also let B = (B\ : : B r ) and C = {C\ Cr). The condition Y!i=i pi-^-i) — n ensures that JB and C are n x n matrices. Further,
y'y = J2v'AiV = Jlv'BiC'iV = y'BC'y i=l
i=l
Vy.
3.4 Regression
61
are symmetWe can assume without loss of generality that A\,...,Ar ric. Hence, BC' is symmetric. Choosing y in the above equation as one eigenvector of BC' at a time, it is easily seen that all the eigenvalues of this matrix are equal to 1. Therefore, BC' = I, that is, C = B~l. It follows that C'B — I. By comparing the blocks of the two sides of this matrix identity, we conclude that C[Bi = I for i = 1,... ,r. Conare symmetric and sequently, BiC'iBiC\ = BiC'i, that is, A\,...,AT idempotent matrices. Without loss of generality we can choose C{ — B{ for % — 1,... , r. Since BB' — BC' = I, B is an orthogonal matrix. Consequently, B'y ~ N{B'IM,I). It follows that B\y ~ N{B\n,I) for i = l , . . . , r and that these are independent. Independence and the stated distributions of y'Aiy,..., y'Ary follow from Remark 3.2.5, which also implies the facts that Aj — fx'AifJ,, i = 1,... ,r. Finally,
J2 V-'AiH = j ^ VL'BiB'ip = fi'BB'n = v!n. Proposition 3.3.2 Suppose that y n x l ~ N(n, I), A and B are idempotent matrices of order n x n and C is any matrix of order m x n. (a) y'Ay~x2p(A){v'An). (b) y'Ay and y'By are independently distributed if and only if AB = O. (c) y' Ay and Cy are independently distributed if and only ifCA = 0. Proof. See Exercise 3.3. 3.4
Regression
The problem of regression concerns the approximation of a random vector y by a suitable function of a random vector x. In particular, the vector function g(x) that minimizes E[(y — g(x))'W(x)(y — g(x)], where W is an arbitrary, positive definite matrix that may depend on x, is g(x) — E(y\x) (Exercise 3.7). This conditional expectation is called the regression of y on x. It is sometimes referred to as the regression function, when viewed as a function of a;. If y is a scalar (denoted as
62
Chapter 3 : Review of Statistical Results
y), then the 'minimum mean squared error' criterion for approximating y by x is to minimize
E[y-g(x)}2 with respect to the function g. The minimum occurs when g(x) = E(y\x), the regression of y on x. The approximation error, y — E(y\x), is uncorrelated with E(y\x) (see Exercise 3.7). It follows from the foregoing discussion that when the joint distribution of y and x is multivariate normal, the regression function of y on x is an affine function of x, that is, a linear function of x with an added constant. Specifically, if (X)~N((^)
\y)
(V™
v*y))
l U J ' U ^ VyyJJ'
then the regression of y on x is
(3.4.1) E(y\x)=liy + v'xyV-x(x-nx).
(3.4.1)
It can be rewritten as
(3.4.2) E{y\x) = fa + Pixx + + 0pxp, (3.4.2) where x = {x\ : : xp)' and PQ, Pi,..., Pp are appropriate functions of /j,x, jj,y, vxy and Vxx- This equation clearly has the form of (1.3.5). Even if the distribution of (x' : y)' is not normal but its mean and dispersion are as above, we can consider the approximation of y through a linear function of x, including a constant. The solution to the minimization problem min E[y -I'x - c]2 l,c may be called the linear regression of y on x. It is known more commonly as the best linear predictor (BLP) of y given x. We denote it by
E{y\x). Proposition 3.4.1 Let
E(X) = (^) \y) \ny /
and D(X) = (Vr \y J \vxy
Vxy).
vyy)
3.4 Regression
63
Then, (a) E{y\x) =ny + v'xyVxx(x - fix), (b) y — E(y\x) is uncorrelated with every linear function of x. (c) E[E(y\x)} = ny. Proof. It is easy to see that y — \xy — v'xyVxx(x — /J X ) has zero mean and is uncorrelated with x — fj,x. Hence, it must be uncorrelated with every linear function of a;. It follows that for any I and c E[y-l'x-c}2
=
Eiy-Vy-v'zyV-^x-^)]2 + E[fiy + v'xyVxx{x - nx) - I'x - c]2.
Clearly the left hand side is minimized by choosing I and c in a way that makes the second term on the right hand side equal to zero. This proves part (a). Part (b) also follows from the above argument. Part (c) is a consequence of part (a). d Note that part (b) of the above proposition also follows from the covariance adjustment principle of Proposition 3.1.2. The expression of E(y\x) given in part (a) is identical to that of E(y\x) in the normal case, given in (3.4.1). Therefore, we can write y as y = Po + fiixi +
+ f3pxp + e,
(3.4.3)
: xp)', /3o,/?i> ,Pp a r e a s m (3-4.2), and e = where x = {x\ : y—E(y\x). According to Proposition 3.4.1, e has zero mean and is uncorrelated with x. Therefore, (3.4.3) is a special case of (1.3.2)—(1.3.3) for a single observation. However, the explanatory variables in (3.4.3) are in general random and are not necessarily independent of e. Even though the model (3.4.3) applies to any y and x having the moments described in Proposition 3.4.1, the model may not always be interpretable as a conditional one (given the explanatory variables). Methods of inference which require the explanatory variables to be conditionally non-random are not applicable to (3.4.3). The multiple correlation coefficient of y with x is the maximum value of the correlation between y and any linear function of a;. If the covariance structure of x and y is as in Proposition 3.4.1, the linear
64
Chapter 3 : Review of Statistical Results
function of x which has the maximum correlation with y happens to be v'xyV~xx. Therefore, the squared multiple correlation is Cov(y,vxyV-xx)2 Var{y) Var(v'xyVxxx)
=
v'xyV-xvxy vyy
4
Thus, y has a larger correlation with the BLP of y, given in part (a) of Proposition 3.4.1, than with any other linear function of x. We end this section with a proposition which shows how the linear model (1.3.2)—(1.3.3) follows from a multivariate normal distribution of the variables. We shall denote by vec(A) the vector obtained by successively concatenating the columns of the matrix A. Proposition 3.4.2 Let the random vector y n x l and the random matrix -X"nx(p+i) be such that X = (1 : Z), vec(y:Z)~Jv((M
® l n x l , £ { p + 1 ) x ( p + 1 ) ® Vnxn) .
Then E(y\X) = (l:Z)/3,
D(y\X) = a2V,
where P ff2
=
(Vy + °iy^xxMx\ V ^xxaxy J'
=
°yy ~
CT'xy^'xx{Txy
Proof. Since X = (1 : Z), conditioning with respect to X is equivalent to conditioning with respect to Z. The stated formulae would follow from (3.2.1) and (3.2.2). Let x = vee(Z),
and S ( p + 1 ) x ( p + 1 ) = (a™ \ Oxy
< ) . Zjxx J
It follows that D(y\X)
= D(y\x) = ayyV - (a'xy ® V)(VXX ® V)-(axy
® V)
= OyyV - (*'xy ® V)(E- a ® V-)(
3.4 Regression
65
This is of the form given in the proposition. Using the properties of the Kronecker product described in Sections 2.1 and 2.2, and the fact that x — E(x) = ( S X I ® V)l for some vector I of size pn (see Proposition 3.1.1 (a)), we can also write E(y\X)
= E(y\x)
=
Myl
+ {a'xy ® F)(S X X ® V)~{x - fj.x ® 1)
=
/iyl + ( ^ y ® V " ) ( E x x ® V ) - ( S M ® V ) /
=
/i y l + [ ( a ^ S ; s S s x ) ® ( V W ) ] I
=
/iyl + [(O^E^E**) ® V]Z
=
^yl + [(^'x V S- a ! )®/](a!-/* I ®l)
=
(/iy + o - ^ S - ^ J l + ^ S - , , ) ® / ! *
=
(/zy + cTxyV-xtix)l
+ ZVxx
which coincides with the expression given in the proposition.
HH
The expression for the conditional mean of y given in Proposition 3.4.2 is not just similar to that of (1.3.4); these are the same. Note that since S I X is allowed to be singular, 'non-random' explanatory variables may be included in X with no additional technical problem. If Ylxx is singular, j3 is not uniquely denned — as it depends on the selection of Yl~x. However, E{y\X) is uniquely defined. The result of Proposition 3.4.2 would not hold if the joint covariance matrix of y and Z has no special structure. The assumed form of this matrix, S ® V, ensures that the first element of E(y\X) depends only on the first row of X , the second element depends on the the second row, and so on. Further, the assumptions E(y) = /j,yl and E(Z) = l®y.'x ensure that the constant part of E(y\X) is the same for all the elements. These two reductions mean that the regression model has a manageable number of parameters for inference. As mentioned in Section 1.3, usually V is assumed to be known and j3 and a2 are estimated from the data. We shall continue to make these assumptions in Chapters 4-7. The case of partially known V is discussed in Chapter 8.
66
3.5
Chapter 3 : Review of Statistical Results
Basic concepts of inference
Suppose that the random vector y has distribution Fg, which involves a vector parameter 9 that can assume any value from a set 0 . If one has to draw inference about 9 from y, a reasonable approach is to work with a vector-valued function t of y that has all the information which is relevant for 6. The concept of sufficiency is useful for this purpose. A statistic t(y) is called sufficient for the parameter 9 if the conditional distribution of y given t(y) — to does not involve 9. The idea of summarizing information through sufficient statistic is more meaningful when the summary is as brief as possible. A sufficient statistic t(y) is called minimal if it is almost surely equal to a function of any other sufficient statistic almost surely for all 9. If we have a minimal sufficient statistic, we have no more than what we need to know from y about 9. On the other hand, we can also identify the information that is not relevant for inference about 9. A statistic z{y) is called ancillary for 9 if the marginal distribution of z(y) does not depend on 9. An ancillary statistic is called maximal if every other ancillary statistic is almost surely equal to a function of it. Note that the value of a distribution function at any given point can be expressed as the expected value of an indicator function. Therefore, we can characterize a statistic z(y) as ancillary for the parameter 9 if the expectation of any function of z(y) does not involve 9. By the same token, we can characterize a statistic t(y) as sufficient for the parameter 9 if the conditional expectation of any function of y given t(y) is almost surely a function of t(y) and does not depend on 9. Example 3.5.1 Suppose that the order-n vector y has the distribution N(fil,I). It is easy to see that the conditional distribution of y given l'y = to is N((to/n)l, (I — n ^ l l ' ) ) , which does not depend on fj,. Hence, t(y) = l'y is a sufficient statistic for \i. (This can also be proved via the factorization theorem given below.) The vector y is also sufficient. The statistic l'y is minimal sufficient, but y is not. The distribution of z(y) — (I — n~1ll')y does not depend on fi. Therefore, any function of this vector is ancillary for JJ,. In particular,
3.5 Basic concepts of inference ||(J — n~1ll')y\\2 of M-
67
is ancillary, and has Xn-i distribution, irrespective D
Sometimes even a minimal sufficient statistic contains redundant information in the sense that a function of it is ancillary. A sufficient statistic is most useful in summarizing the data if no nontrivial function of it is ancillary. A sufficient statistic t(y) is called complete if E[g(t(y))} = 0 for all 0 € 0 implies that the function g is zero almost everywhere (that is, over a set having probability 1) for all 9. A complete sufficient statistic may not always exist. Bahadur (1957) showed that if it does, then it must be minimal too (see Schervish, 1995, p.94). it can be E x a m p l e 3.5.2 In Example 3.5.1, where y ~ N(fil,a2I), shown that l'y is complete, though the proof is beyond the scope of the present discussion. Since the statistic is also sufficient for /u, it is minimal. E x a m p l e 3.5.3 Suppose that y i , . . . , yn are independent samples from the uniform distribution over the interval (6 — ^, # + ^), — oo < # < oo. It can be seen that a minimal sufficient statistic for 6 is (ymirn Vmax)-)
where ymin and ymax a re the smallest and largest of the y^s, respectively. However, it is not complete, since E Umax - Vmin ~ ^ j )
=°
for
a11
°'
thus exhibiting a non-trivial function with zero expectation. The relation between a complete sufficient statistic and an ancillary statistic is brought out by the following result (Basu, 1958). Proposition 3.5.4 (Basu's Theorem) Ift(y)isa complete sufficient statistic for 6, it is independent of any ancillary statistic z(y). Proof. Note that P(z(y) £ A) is independent of 6 whenever A is a set for which the probability is defined. Denoting P(z{y) G A\t(y) = t 0 ) - P(z(y) G A) by gA{t0), we have E{gA{t(y))) = 0 for all 0. By
68
Chapter 3 : Review of Statistical Results
completeness, g^ must be zero almost everywhere. Therefore, t(y) must be independent of z(y). See Schervish (1995) for a stronger version of the above theorem. Example 3.5.5 In Example 3.5.1, where y ~ N(fil,a2I), the complete sufficient statistic l'y and the ancillary ||(J — n~ 1 ll')y|| 2 are independent. A simple way of deriving a sufficient statistic is given in the following proposition. Proposition 3.5.6 (Factorization Theorem) Let fg be the probability density function or the probability mass function of the random vector y, where 9 is a vector parameter which can assume any value from a set 0 . Then a statistic t(y) is sufficient for 0 if and only if fg can be factorized as fe(v) = 9o(t(y)) h(y), where gg and h are nonnegative functions and h does not depend on 6. Proof. We shall prove the result in the continuous case; the proof in the discrete case is similar. Suppose that fg can be factorized as stated in the proposition. Then the probability density function of t(y) at the point to is given by Potto) = /., . . fe{y)dy = gg(t0) h{y)dy = gg(to)H{to), Jt{y) = t0 Jt{y) = t0 for some function H. Thus, the conditional density of y given t(y) = to
fe(y) pg(t0)
=
ge(to)h(y) ge(to)H(to)
=
Wy)_ H{t0)'
which does not depend on 6. Therefore, t(y) is sufficient for 6. In order to prove the converse, let t{y) be sufficient for 0. Then the conditional density of y given t(y) — to,
Mv) Pe{to)'
3.5 Basic concepts of inference
69
does not depend on 0. If we denote this ratio by k(y,to), then we have fe(y) — Pe(to) k(y, to) for all io- In particular, we have
fe(y) = Pe{t{y))
k(y,t(y)),
which is of the form stated in the proposition.
n
A family V = {fg, 0kxi G ®} °f distributions of y is said to form a /c-dimensional exponential family if the density (or probability mass function) has the form My)=c(0)exp
Y,qj{O)tj{y)
h(y),
(3.5.1)
j=l
where q\{9),..., qk{0) and c(9) are real-valued functions of the parameter 0 and t\ (y),..., t)-(y) and h(y) are real-valued statistics. The quantities qi(9), , qk{9) are called the natural parameters of the exponential family. Most common families of distributions like the binomial, gamma, normal and multivariate normal can be shown to belong to such an exponential family. The following result shows that exponential families admit sufficient statistics for natural parameters which, under general conditions, are also minimal and complete. Proposition 3.5.7 If ylf... ,yn are samples from the distribution (3.5.1), and the set {{qi{0),q2{9),... ,qk{9)), 6 E &} contains a kdimensional rectangle, then the vector statistic
*(vi,---,yB)=fi;*i(yi):E*2(yi): \j=l
2=1
) !=1
is complete and sufficient for the natural parameters qi(9),...,
/
qk(9).
Proof. Sufficiency follows directly from the factorization theorem. For the proof of completeness we refer the reader to Lehmann and Casella (1998). D Example 3.5.8 Let the p-dimensional random vectors j / 1 ? . . . ,yn be samples from N(fj,, £ ) , where £ is positive definite. Here, 0 consists of
70
Chapter 3 : Review of Statistical Results
the elements of fj, and S. Define the sample mean vector and sample variance-covariance matrix as n
n
i/ = n~1Z!l/i.
S=
n-1^2(yi-y)(yi-y)'. i=l
i=l
Then the joint density of y 1 , . . . , yn can be written as
f[MVi) = (27r|S|)-"/ 2 expf-^(y J - M )'S- 1 (y 2 -/x) i=\
L
2=1
r i " oc |s|-"/2exP --^(yi-y =
l + y-^'^Hyi-y
+
y-^l
. i = i J |S|-"/2exP |-^X!(y i -y)'S- 1 (y i -i/) + n(y-/i)/S-1(y-/u) L
z
»=i
= |£|-"/ 2 exp [ - | (^(S- 1 ^) + y'S- ! y - 2y'S" V + /i'S" V)]
= |S|-"/2exP[-^'S-V] p [-1 (trfE-^S + yy')) - 2y'S"V)] Since the last expression is of the form (3.5.1), y and S + yy' must be complete and sufficient for the elements of E - 1 ^t and S . Therefore, y and S are complete and sufficient for /i and S.
3.6
Point estimation
Suppose that t(y) is used to estimate g(0), a function of 0. It cannot be a good estimator if it is systematically away from g (9). The quantity E[t{y)] —g(6) is called the bias of t(y). If the bias is zero for all 9 E ©, the estimator is called unbiased. The bias is only one criterion for judging the quality of an estimator. Sometimes the ill-effects of the deviation of t(y) from g{6) is expressed in terms of a function L(0,t(y)), called the loss-function. Usually the loss function is chosen such that it is nonnegative with L(9,g(9)) = 0.
3.6 Point estimation
71
Often it is also assumed to be convex in the second argument. The risk function of the estimator t is defined as R(d,t) = E[L(9,t(y))]. An important example in the case of a real-valued g (and a scalar t) is the squared error loss function, defined as L(9,t(y)) = (t(y) — g(9))2. In such a case, the risk of t, E[(t(y) — g(9))2] is called the mean squared error (MSE). It is easy to see that E[{t(y) - g{6))2} = Var(t(y)) + [E(t(y)) - g(9)}\ that is, the mean squared error of t(y) is the sum of its variance and squared bias. In particular, the risk of an unbiased estimator with respect to the squared error loss function is just its variance. If g is a vector-valued function of 9, the mean squared error (MSE) matrix of an estimator t(y) is defined as E[(t(y) — g(9))(t(y) — g{9))']. The MSE matrix of t(y) can be decomposed as E[(t(y) - g(9))(t(y) - g(9))'] = D(t(y)) + [E(t(y)) - g(9)][E(t(y)) - g{9)]'. In the above, the first term is the dispersion of t(y), while the second term is the bias vector times its transpose. Returning to the estimation of the real-valued functions of 6, the following result underscores the importance of a sufficient statistic in improving estimators by reducing their risk functions. Proposition 3.6.1 (Rao-Blackwell theorem) In the above set-up, let t(y) be sufficient for 6 G 0 , s(y) be an estimator of a real-valued function g(6) and L be a loss function which is convex in the second argument. If s(y) has a finite expectation, then the risk function (with respect to L) of the revised estimator E[s(y)\t(y)] is at most as large as that of s(y). Proof. Let /i(i0) = E[s(y)\t(y) = to\. According to the Jensen's equality, a convex function of the expected value of a random variable is less than or equal to the expectation of that convex function of the random variable. Choosing the random variable as s(y) and taking all the expectations with respect to the conditional distribution of y given
72
Chapter 3 : Review of Statistical Results
t(y) — to, we have for all 6
L(0,h(tQ)) < E[L(0,s(y))\t(y) = to}. Integrating both sides with respect to the distribution of t(y), we obtain R(0,h(t))
Var(s(y)) = Var(t(y) + (s(y)-t(y))) = Var{t(y)) + Var((s(y) - t{y))) > Var{t{y)). Therefore, t(y) must be a UMVUE. To prove the converse, let t(y) be correlated with z(y), an unbiased estimator of zero with finite and positive variance. Using the covariance adjustment principle of Proposition 3.1.2, we can construct another unbiased estimator of g{0),
s(y) = t(y) - Cov{t{y),z{y)){Var{z{y))}-lz{y),
3.6 Point estimation
73
which is uncorrelated with z(y). Thus, t(y) is the sum of two uncorrelated random variables one of which is s(y). It is easy to see that t(y) has larger variance than s(y), that is, t(y) cannot be a UMVUE. d A consequence of the above proposition is that if t\(y) and t%{y) are two UMVUEs, both must be uncorrelated with t\{y) — £2(2/)- It follows that Var{h{y)-t2{y)) = Cov{t1{y)My)-h{y))-Cov{t2{y),t1{y)-Uy))
= o,
that is, ti(y) = £2(2/) almost surely. This proves the uniqueness of the UMVUE whenever it exists. Given an unbiased estimator of g(6) and a complete sufficient statistic for 0, the UMVUE can be determined by the following extension of Proposition 3.6.1. It says in essence, that any function of such a complete sufficient statistic is the unique UMVUE of its own expectation. Proposition 3.6.3 (Lehmann-Scheffe theorem) In the above set-up, let g(6) have an unbiased estimator s(y), and t(y) be a complete sufficient statistic for 9. Then E(s(y)\t(y)) is the (almost surely) unique UMVUE of g{0). Proof. Let U\(y) = E{s{y)\t(y)). It is easy to see that U\(y) is a function of t(y), and is unbiased for g(0). To prove that it is the UMVUE, let ^2(2/) be another unbiased estimator of g(0) with smaller variance than U\ (y). We can assume without loss of generality that £^2(2/) is a function of t also (if not, we can reduce its variance by conditioning on t(y), while preserving unbiasedness). Note that E[Ui(t(y)) — U2{t{y))] = 0 for all 6. Because of the completeness of t(y), we must have U\(t(y)) — U2{t{y)) = 0 with probability 1 for all 6. Therefore, U2{y) cannot have smaller variance than U\{y). D Example 3.6.4 Let y ~ N(fj,l,a2I) as in Example 3.5.1. The statistic t(y) = l'y is complete and sufficient for fi. Any component of y is an unbiased estimator of /u. We can find the unique UMVUE of /j, by taking the conditional expectation of any of these given t(y). The result turns out to be n~lVy, the sample mean (see Exercise 3.18).
74
Chapter 3 : Review of Statistical Results
On the other hand, using the result of Example 3.5.8 for p = 1, we conclude that n~1l'y and n - 1 ||y — {n~ll'y)l\\2 are complete and sufficient statistics for /j, and a2. As n~ll'y and (n — l ) " 1 ^ — (n" 1 !'^)!!! 2 are unbiased estimators of fi and a2, respectively, and are functions of the complete and sufficient statistics, these must be UMVUEs. Apart from the UMVUE (which may not always exist), there are many formal methods of point estimation including the method of moments, the method of maximum likelihood, and several distance based methods. We shall briefly describe the method of maximum likelihood (ML) here. Suppose that a given observation y has a probability density function (or probability mass function — in the discrete case) fe(y), for a given parameter value 0. One can also interpret fe(y) as a function of 0 for a given/observed value of y. In the latter sense, it represents the likelihood of the parameter 0 to have generated the observed vector y. When fe(y) is viewed as a function of 0 for fixed y, it is called the likelihood function of 0. If we know y and would like to figure out which value of 0 is most likely to have generated this y, we should then ask for what 0 is fe(y) a maximum. A value 0 satisfying fe(y)>fe(y)
V0
is called the maximum likelihood estimator (MLE) of 0. Provided a number of regularity conditions hold and the derivatives exist, such a point can be obtained by methods of calculus. Since logp is a monotone increasing function of p, maximizing fe(y) is equivalent to maximizing log fe{y) with respect to 0, which is often easier (e.g., when fg{y) is in the exponential family). The maximum occurs at a point where the gradient vector satisfies the likelihood equation dlog fo(y) 80 and the matrix of second derivatives (Hessian matrix)
o2 log Mv) 8080'
3.6 Point estimation
75
is negative definite. Example 3.6.5 Let ynxl be easily seen that
~ N(fil,a2I).
Here, 9 = (n : a2)'. It can
log/*(y) = -(n/2)log(27ra 2 ) - (y -»l)'(y
-
M l)/(2a 2 ).
Consequently / glog/e(y) _ 50
nfx — l'y \ ^2 (y-A*i);(y-/*i) '
» V
2a 2
2(CT 2 ) 2
/
Equating this vector to zero, we obtain
and a 2
=
n - 1 ( y - /2l)'(y - /II)
as the unique solutions. Note that these are the sample mean and sample variance, respectively. These would be the respective MLEs of /j, and a2 if the Hessian matrix is negative definite. Indeed, n
I
d2 log fe(y) d9d9' 9=9
=
~72 nfl - l'y
njl— l'y
(72)2 \\y - fil\\2
V {72Y _ fn/a2 V 0
(72f 0_
\
n
2(^)2/
\
n/(2(o*)2))'
which is obviously negative definite. Recall that, when a sufficient statistic t(y) is available, the density of y can be factored as fe(y) = ge(t(y))
h(y)
in view of the factorization theorem (Proposition 3.5.6). Therefore, maximizing the log-likelihood is equivalent to maximizing loggg(t(y)).
76
Chapter 3 : Review of Statistical
Results
The value of 6 which maximizes this quantity must depend on y, only through t(y). Thus, the MLE is a function of every sufficient statistic. It can be shown that under some regularity conditions, the bias of the MLE goes to zero as the sample size goes to infinity. Thus, it is said to be asymptotically unbiased. On the other hand, a UMVUE is always unbiased. We now discuss a theoretical limit on the dispersion of an unbiased estimator, regardless of the method of estimation. Let fe(y) be the likelihood function of the r-dimensional vector parameter 6 corresponding to observation y. Let
^)-((-^r))'
^
assuming that the derivatives and the expectation exit. The matrix 1(6) is called the (Fisher) information matrix for 6. The information matrix can be shown to be an indicator of sensitivity of the distribution of y to changes in the value of 6 (larger sensitivity implies greater potential of knowing 0 from the observation y). Proposition 3.6.6 Under the above set-up, let t(y) be an unbiased estimator of the k-dimensional vector parameter g(9), and let
Then D(t(y)) > G{6)l-(Q)G'(e) in the sense of the Lowner order, and G(6)l~(0)G(0) on the choice of the g-inverse. Proof. Let s(y) =
°g °
does not depend
. It is easy to see that E[s{y)} = 0, and
do
\d2fe(y)l r\d2logfe(y)]
[ de^
deiddj
=
[
fe(y)
\fdf9(y)\ _
\
(dfe(y)Y
dOj
fe(y)
Lv / V
= 0-Cov(si{y),sj{y)),
dOj
fe(y)
).
3.6 Point estimation that is, D(s(y)) = 1(9). Further Cov(t(y),s(y)) dispersion matrix
77 = G{0). Hence, the
(t(y)\_(D(t(y)) G(9)\ G'(9) 1(9))
U{s(y))-(
must be nonnegative definite. The result of Exercise 2.19(a) indicates that D(t(y)) - G(9)1~(9)G'(9) is nonnegative definite. The inequality follows. The invariance of G(9)1~(9)G(9)' on the choice of 1~(9) is a consequence of Propositions 3.1.l(b) and 2.4.1(f). C The proof of Proposition 3.6.6 reveals that the information matrix has the following alternative expressions Tlff\ Z{6) =
p /d2log/fl(yA
„ \(d\ogfg(y)\
~E { 3989' ) = E [{
39 ) {
fd\ogfg(y)\r
39 ) '
The lower bound on D(t(y)) given in Proposition 3.6.6 depends only on the distribution of y, and the result holds for any unbiased estimator, irrespective of the method of estimation. This result is known as the information inequality or Cramer-Rao inequality. The Cramer-Rao lower bound holds even if there is no UMVUE for g(9). If t(y) is an unbiased estimator of 9 itself, then the information inequality simplifies to D{t(y)) >l'l(9). Example 3.6.7 Let ynxl ~ N(fj,l,a2I) and 9 - (// : a2)'. It follows from the calculations of Example 3.6.5 that m~{
0 n/(2^)J-
The information matrix is proportional to the sample size. The CramerRao lower bound on the variance of any unbiased estimator of /j, and a2 are a2/n and 2a 4 /n, respectively. The bound a2 jn is achieved by the sample mean, which is the UMVUE as well as the MLE of /i. The variance of the UMVUE of a2 (given in Example 3.6.4) is 2cr4/(n - 1), and therefore this estimator does not achieve the Cramer-Rao lower bound. The variance of the MLE of a2 (given in Example 3.6.5) is
78
Chapter 3 : Review of Statistical Results
2(n — I)u 4 /n 2 , which is smaller than the Cramer-Rao lower bound. However, the information inequality is not applicable to this estimator, as it is biased. If t(y) is an unbiased estimator of the parameter g(0) and bg is the corresponding Cramer-Rao lower bound described in Proposition 3.6.6,
then the ratio bg/Var(t(y)) is called the efficiency of t(y). The Cramer-Rao lower bound has a special significance for maximum likelihood estimators. Let 6Q be the 'true' value of 0, and 1(6) be the corresponding information matrix. It can be shown under some regularity conditions that, (a) the likelihood equation for 6 has at least one consistent solution 6 (that is, for all 6 > 0, the probability P[\\6 — 6Q\\ > 6] goes to 0 as the sample size n goes to infinity), (b) the distribution function of n 1//2 (0 — 0) converges pointwise to that of N(0,G(6)l-{60)G'{0)), and (c) the consistent MLE is asymptotically unique in the sense that if 6\ and #2 are distinct roots of the likelihood equation which are both consistent, then nll2(d\ — 62) goes to 0 with probability 1 as n goes to infinity. We refer the reader to Schervish (1995) for more discussion of Fisher information and other measures of information.
3.7
Bayesian estimation
Sometimes certain knowledge about the parameter 6 may be available prior to one's access to the vector of observations y. Such knowledge may be subjective or based on past experience in similar experiments. Bayesian inference consists of making appropriate use of this prior knowledge. This knowledge is often expressed in terms of a prior distribution of 6, denoted by IT(6). The prior distribution is sometimes referred to simply as the prior. Once a prior is attached to 0, the 'distribution' of y mentioned in the foregoing discussion has to be interpreted as the conditional distribution of y given 6. The average risk of the estimator t(y) with respect to the prior n(d) is
r(t,n) = JR(d,t)d7r(e),
3.7 Bayesian estimation
79
where R(9,t) is the risk function, defined in Section 3.6. Definition 3.7.1 An estimator which minimizes the average risk (also known as the Bayes risk) r(t, IT) is called the Bayes estimator of g(0) with respect to the prior IT. A Bayes estimator t of g(0) is said to be unique if for any other Bayes estimator s, r(s,7r) < r(t, n) implies that Pg(t(y) ^ s(y)) = 0 for all e <E 0 . The comparison of estimators with respect to a risk function sometimes reveals the unsuitability of some estimators. For instance, if the risk function of one estimator is larger than that of another estimator for all values of the parameter, the former estimator should not be considered a competitor. Such an estimator is called an inadmissible estimator. Definition 3.7.2 An estimator t belonging to a class of estimators A is called admissible for the parameter g(0) in the class A with respect to the loss function L if there is no estimator s in A such that R(6, s) < R(0,t) for all 0 E 0 , with strict inequality for at least one 6 € Q. D The above definition can be used even when the scalar function g and the scalar statistic t are replaced by vector-valued g and t. The risk continues to be defined as the expected loss, while the loss is a function of 0 and t. The squared error loss function in the vector case isp(y)-<7(0)|| 2 . Proposition 3.7.3 // a Bayes estimator is unique, then it is admissible. Proof. Let t be an inadmissible but unique Bayes estimator of g{6) with respect to the prior TT(6). Let s be another estimator such that R{0,s) < R(O,t) for all 6 with strict inequality for some 6. It follows that r(s,7r) < r(t,7r), that is, s is another Bayes estimator. The uniqueness of t implies that Pe(t(y) / s(y)) = 0 for all 9 £ ®. This contradicts the assumption that R{6, s) is strictly less than R(0, t) for some 0. We now obtain an explicit expression of the Bayes estimator in the case of the squared error loss function, and prove that it is essentially a
80
Chapter 3 : Review of Statistical Results
biased estimator. Proposition 3.7.4 function.
In the above set up, let L be the squared error loss
(a) The Bayes estimator of g(9) is t(y) = E(g(9)\y), where the expectation is taken with respect to the conditional distribution of 9 given y. (b) The Bayes estimator t(y) cannot be unbiased unless t(y) is almost surely equal to g(0). Proof. It is easy to see that t(y), as defined in part (a), is the unique minimizer of E[L(d,t(y))\y]. Therefore,
E[L(9,t(y))\y] < E[L(9,s(y))\y] for any other estimator s of g{0). By taking the expectation of both sides with respect to the distribution of y, it follows that t(y) minimizes the Bayes risk. In order to prove part (b), assume that t(y) is unbiased for g(6). Therefore, we have
E{t{y)\9) = g{0)
for all 6 e 0.
Therefore,
E[\\g(9)f] = E[(g(9))'E(t(y)\9)} = E[(g(9))'t(y)] = E[{t(y))'E(g(0)\y)} = E[\\t{y)f] Consequently
E[\\t(y) - g(9)f} = E[\\t(y)\\2} + E[\\g(0)f] - 2E[{g(e))'t{y)} = 0. This implies that E(\\t(y) — g(0)\\2) = 0, where the expectation is taken over the distributions of y and 9. Hence, \\t(y) — g(9)\\2 — 0 with probability 1. D It can be shown that the above proposition also holds when L is any quadratic loss function of the form (t(y) — g{9))'B(t(y) —g(0)), where B is a positive definite matrix.
3.7 Bayesian estimation
81
Minimizing the average or Bayes risk does not ensure that the risk will be as small as possible for a specific value of the parameter 0. Indeed, it does not make sense to choose an estimator which minimizes R{9,t) for a specific 0, because 6 is unknown. A conservative strategy would be to choose an estimator which minimizes R(6,t) in the worst possible case, that is, which minimizes sup 0 e 0 R(0, t). Such an estimator is called a minimax estimator. It is usually very difficult to find a minimax estimator. However, a solution can often be found in the form of a Bayes estimator. Specifically, we shall show that if we can find a prior such that the average risk of a corresponding Bayes estimator is equal to its maximum risk, then that Bayes estimator is also a minimax estimator. The prior which maximizes the average risk (of the corresponding Bayes estimator) over all possible priors is called a least favourable prior. Proposition 3.7.5 Suppose that a distribution n on 0 and a corresponding Bayes estimator t ofg(8), are such that r(t,7r) = sup R{d,t). Bee Then n is a least favourable prior and t is a minimax estimator of g(6). Further, ift is the unique Bayes estimator of g(0) (corresponding to ir), then it is the unique minimax estimator. Proof. Let TT* be another distribution on 0 , and i* be the corresponding Bayes estimator. Then r(t»,7r*) =
f R(0,U)dir.(0)
< sup R{0,t) Bee
=
< [ R{0,t)dTr*{0) r{t,ir),
which shows that TT is a least favourable prior. Let s be another estimator. It follows that sup R{9,t) see
= f R(0,t)dir(9) J
< f R{G,s) di:{$)< J
Consequently, t is a minimax estimator.
supR(6,s). Bee
82
Chapter 3 : Review of Statistical
Results
lit is the unique Bayes estimator and s is not a Bayes estimator, then / R(0, t) dn(0) must be strictly smaller than / R(0, s) dir(O). Therefore, s u p e e 0 R(6, t) < supeeGR(6,s), which implies that t is the unique minimax estimator. If a minimax estimator is not admissible, then there must be another estimator with smaller risk, which must also be minimax. Therefore, a minimax estimator is admissible whenever it is unique. Note that a Bayes estimator has a similar property (see Proposition 3.7.3).
3.8
Tests of hypotheses
Based on a random sample y from some parametric model fe(y), 6 £ 0 , an important aspect of inference deals with testing the validity (truth or credibility) of a certain statement (hypothesis) about the unknown parameter 6. If ©o and ©i denote disjoint subsets of 0 , one may wish to test the null hypothesis Ho 0 £ ©o versus the alternative hypothesis Hi : 6 6 @i. If &i, i = 0,1, contains just one parameter value, it is called a simple hypothesis and otherwise, a composite hypothesis. For any given sample y, the testing is accomplished by constructing a test function ip{y), taking values in [0,1], which denotes the probability of rejecting the null hypothesis for the given sample. The null hypothesis is rejected if
3.8 Tests of hypotheses
83
II error) over all tests
E[
sup
E[
o e 0i e e ©o A family of probability distributions having density (or probability mass function) fe{y), and indexed by the real-valued parameter 6 G ©, is said to have monotone likelihood ratio (MLR) in a statistic t(y) if for 0i < $2, fei{y)lfe2{y) i s a monotone function in t(y). One-parameter exponential families can be shown to have the MLR property. The following result presents a most powerful test when both Ho and Ti\ axe simple hypotheses. Proposition 3.8.1 (Neyman-Pearson lemma) For testing Ho OQ versus Ti\ : 9 = 0\, the test m(,,\-[l
M )
0=
iffei(y)>c-f6o{y),
10
iff0l{v)
where c satisfies the level condition E[tp(y)] = a for 9 = OQ, is the most powerful test of level a.
Proof. Let Ei denote expectation when 6 = 0i, i = 0,1. Let tp(y) be any level a test. Then Eo(
[My) - <-p(y)} = j [My) - ip(y)]fe1 {y)dy
84
Chapter 3 : Review of Statistical Results
= J[Mv) - ¥>(i/)][/«i (y) - cfgo(y)}dy + CEQ[
fe{y)
t(y) = * g Q ° sup fe{y)
(3.8.2)
Gee Clearly, £(y) is in the range [0,1]. The closer it is to zero, the less credible is the null hypothesis. Thus, the GLRT rejects %Q when £ < c, where c is determined by the level condition. It can be shown that subject to some regularity conditions, —21og£ has an asymptotic xt-i
3.9 Confidence region
85
distribution as the sample size goes to infinity, under the null hypothesis. The GLRT often leads one to such optimal tests as the UMP or UMPU tests, when these exist. A detailed discussion of testing of hypotheses, including various optimal tests and the regularity conditions required for the above asymptotic result can be found in Lehmann (1986). 3.9
Confidence region
As opposed to testing or verifying a given hypothesis, one would often be interested in constructing a set of values where the unknown parameter may lie, with a certain probability. In frequentist terminology such a set of possible values is called a confidence region or set, while the analogous Bayesian concept is referred to as a credible set. Given a sample y from fe(y), 6 6 0 , the goal is to construct a set C(y) C 0 with the property
pe[0 e C{y)] > i - a v e e &,
(3.9.1)
for a given confidence level 1 — a. In the above, PQ is probability computed on the basis of the density fg. In the frequentist sense, randomness of the event 6 £ C(y) is not because of the parameter 9, but due to the set C(y) which depends on y and varies from sample to sample. This random set C(y), called a level (1 — a) confidence set or region, contains an unknown but fixed 0 with probability (I—a) or more. Such a set should be small so that the chances of containing a 'wrong' 9 is as small as possible (without this restriction, 0 would be an excellent choice). We call a set C(y) uniformly most accurate (UMA) confidence region of level (1 — a) if it minimizes Pg[0*eC(y)}
V0,^0
among all sets satisfying (3.9.1). We also call a confidence region unbiased if it has a better chance of containing the 'correct' 9 than any incorrect 9*, that is, if Pe[OeC(y)]>Po[0*eC{y)]
V 0* ^ 0.
(3.9.2)
86
Chapter 3 : Review of Statistical
Results
A UMA confidence region among those satisfying (3.9.2) is called a uniformly most accurate unbiased (UMAU) confidence region of level a. One way of obtaining a UMA confidence region is to choose C(y) = {00:ye
A(60)},
where A(Go) is the acceptance region of the corresponding uniformly most powerful test for T-LQ : 0 = 9Q, when this test exists. A UMAU confidence region can be found from a uniformly most powerful unbiased test in a similar manner. Thus, there is a duality between hypothesis testing and confidence regions. We can also construct confidence regions from the generalized likelihood ratio test. E x a m p l e 3.9.1 Consider testing Tio : \i — /^o versus 1-L\ : /J, ^ /J-O from observation y n x l ~ N(nl,a2I). The uniformly most powerful unbiased level a test has the acceptance region (see Lehmann, 1986, p.195)
A(,0) = (
L:^
y/^/n
J
where y = n~ll'y and tn_xa is the (1 — j) quantile of the ^-distribution with n degrees of freedom. Thus, a UMAU confidence set for y, (in this case, an interval) is given by
C(y) = L:^
J 0 fe(y)K{a)da A set C, (y) C 0 with the property /
Jc.
ir(O\y)dO > 1 - a
3.10 Exercises
87
is called a (1 - a) level credible set. In order to make such a set the smallest possible, we look for a region C*(y) where the posterior is large, that is, C*(y) = {0 : n(O\y)>c}, where the threshold c is determined by the level condition. Such a set is called the highest posterior density (HPD) credible set. 3.10
Exercises
3.1 Prove the following facts about covariance adjustment. (a) If u and v are random vectors with known first and second order moments such that E{v) 6 C(D(v)), and B is chosen so that u — Bv is uncorrelated with v, show that D(u — Bv) < D(u) (that is, covariance adjustment reduces the dispersion of a vector in the sense of the Lowner order). (b) Let u, v and B be as above, v = (v[ : v'2)1 and B\ be chosen such that l'(u — BiV\) is uncorrelated with v\. Show that D(u — B\V\) > D(u — Bv) (that is, larger the size of v, smaller is the dispersion of the covariance adjusted vector). 3.2 If z ~ N(0, S), then show that the quadratic form z'YTz almost surely does not depend on the choice of the g-inverse, and has the chi-square distribution with p(S) degrees of freedom. 3.3 Prove Proposition 3.3.2. 3.4 Prove the following converse of part (a) of Proposition 3.3.2: If y ~ N(fj,, I), and y'Ay has a chi-square distribution then A is an idempotent matrix, in which case the chi-square distribution has p{A) degrees of freedom and noncentrality parameter fi'Afx. [See Rao (1973c, p.186) for a more general result.] 3.5 Show that part (b) of Proposition 3.3.2 (for /j, = 0) holds under the weaker assumption that A and B are any nonnegative definite matrix (not necessarily idempotent). 3.6 If y ~ N(0,I), and A and B be nonnegative definite matrices such that y'Ay ~ x\A) a n d v'(A + B)v ~ Xp(A+B)> t h e n s h o w
88
Chapter 3 ; Review of Statistical Results that y1 By ~ x2p{By 3.7 Let y and x be random vectors and W(x) be a positive definite matrix for every value of x. (a) Show that the vector function g(x) that minimizes E[(y — g(x))'W(x)(y - g(x)] is g(x) = E(y\x). (b) Show that y — E(y\x) is uncorrelated with E(y\x). [This result is a stronger version of the result of Exercise 1.6.] (c) What happens when W(x) is positive semidefinite? 3.8 Let x and y be random vectors with finite first and second order moments such that = (»*)
E(X)
\v)
W1
D(X)
\v)
= (V™
v*y)
\VyX vyyj-
Then show that (a) The best linear predictor (BLP) of y in terms of x, which minimizes (in the sense of the Lowner order) the mean squared prediction error matrix, E[(y — Lx — c)(y — Lx — c)'} with respect to L and c is unique and is given by E(y\x) =»y + VyxV-x{x
- fix),
(b) y — E(y\x) is uncorrelated with every linear function of x. (c) The mean squared prediction error matrix of the BLP is V yy
v
yxr
Xxv
xy
3.9 Modify the results of Exercise 3.7 under the constraint that g(x) is of the form Lx + c. 3.10 Conditional sufficiency. Let z have the uniform distribution over the interval [1,2]. Let yi,...,yn be independently distributed as N(6, z) for given z, 9 being a real-valued parameter. The observation consists of y = {y\ : ... : yn) and z. Show that the vector (n~ll'y : z) is minimal sufficient, even though z is ancillary (that is, the vector is not complete sufficient). In such a case, n~ll'y is said to be conditionally sufficient given z. Verify that given z, n~ll'y is indeed sufficient for 9.
3.10 Exercises
89
3.11 Let xi,...,xn be independent and identically distributed with density fe{x) — g(x — 9) for some function g which is free of the parameter 9. Show that the range maxi<;
a one-to-one mapping rj = h{6) such that the matrix —— is oO exists and is invertible, show that the Cramer-Rao lower bound is unaffected by this reparametrization. 3.15 Let y = n + au where u has zero mean, unit variance, and a completely known and differentiable probability density function h(-) which is symmetric around 0. Assume that the information matrix for 0 = (/J, : a2)' exists. (a) Show that the information matrix is I ^ V0
1 [ f°° ( dlogh(u)\2
z"
= ^[L{u^r2)
_ ) where 2W
. .
1
M«)*-IJ.
(b) Show that a2!^ > 1. (c) Show that the result of part (b) holds with equality if and only if h is the density of N(0,1). Interpret the result. 3.16 Let the random vector y n x l and the random matrix Xnxp finite mean and variance-covariance matrix of the form E(y:X) D(vec((y : X)'))
=
1 ® (/iv : / O , Vnxn ® S ( p + 1 ) x ( p + 1 ) ,
have
90
Chapter 3 : Review of Statistical Results for some nonnegative definite matrices V and 53. Show that y can be decomposed as V = (l:X)/3 + e, where E(e) = 0, D(e) = <J2V, e is uncorrelated with the elements of X, and (3 and a2 are functions of fj,y, iix and S. Can it be said that y follows the model (1.3.2) along with (1.3.3)? 3.17 Show that the UMVUE described in proposition 3.6.3 uniformly minimizes the risk of an unbiased estimator with respect to any loss function which is convex in its second argument. 3.18 If 2/1,..., yn are samples from N(9,1) and y is the sample mean, show that the conditional distribution of y\ given y is N(y,
oHl-l/n)). 3.19 If y i , . . . , y n are samples from N(fi,l) and the prior of fi is N(0, r), then show that the Bayes estimator of /n with respect to the squared error loss function is a9 + (1 — a) X^=i Ui, where a = n/(n + r). 3.20 Jeffreys' prior. Given the scalar parameter 0, suppose that the random variable y has the probability density function fe(y) and the Fisher information for 6 based on the observation y is 1(9). If / \JT{6)
3.10 Exercises
91
inadmissible with respect to the squared error loss function. Find an admissible estimator. 3.22 Let y ~ N(fil,I), with unspecified real parameter /i. Find the UMP test for the null hypothesis HQ : /i = 0 against the alternative Hi y, > 0. Find the GLRT for this problem. Which test has greater power for a given size? What happens when the alternative is two-sided, that is, 1-L\ : n ^ 0? 3.23 Let y ~ N(jil,I), with unspecified real parameter \x. Find the level (I —a) UMA confidence region for JJL when it is known that pL € [0, oo). Find the level (1 — a) UMAU confidence region for fi when it is known that fi £ (—oo, oo).
Chapter 4
Estimation in the Linear Model
Consider the homoscedastic linear model (y,X/3,a 2 /). This model is a special case of (1.3.2)—(1.3.3) where the model errors have the same variance and are uncorrelated. The unknown parameters of this model are the coefficient vector /3 and the error variance a2. In this chapter we deal with the problem of estimation of these parameters from the observables y and X. We assume that y is a vector of n elements, X is an n x k matrix and /3 is a vector of k elements. Some of these parameters may be redundant. If one has ten observations, all of them measuring the combined weight of an apple and an orange, one cannot hope to estimate the weight of the orange alone from these measurements. In general, only some functions of the model parameters and not all, can be estimated from the data. We discuss this issue in Section 4.1. Supposing that it is possible to estimate a given parameter, the next question is how to estimate it in an "optimal" manner. This leads us to the theory of best linear unbiased estimation, discussed in Section 4.3. An important tool used in the development of this theory is the set of linear zero functions, — linear functions of the response which have zero expectation. In Sections 4.2 and 4.4, we present the least squares method and the method of maximum likelihood (ML), the latter being considered under a distributional assumption for the errors. (Some other methods are discussed in Chapter 11.) Subsequent sections deal with measuring the degree of fit which the 93
94
Chapter 4 : Estimation in the Linear Model
estimated parameters provide, some variations to the linear model, and issues and problems which arise in estimation. 4.1
Linear estimation: some basic facts
Much of the classical inference problems related to the linear model (y, X/3, cr2l) concern a linear parametric function (LPF), p'/3. We often estimate it by a linear function of the response, l'y. Since y itself is modelled as a linear function of the parameter 0 plus error, it is reasonable to expect that one may be able to estimate /3 by some kind of a linear transformation in the reverse direction. This is why we try to estimate LPFs by linear estimators, that is, as linear functions of y. 4.1.1
Linear unbiased estimator and linear zero function
For accurate estimation of the LPF p'/3, it is desirable that the estimator is not systematically away from the 'true' value of the parameter. Definition 4.1.1 The statistic l'y is said to be a linear unbiased estimator (LUE) of p'(3 if E(l'y) = p'(3 for all possible values of (5. Another class of linear statistics have special significance in inference in the linear model. Definition 4.1.2 A linear function of the response, l'y is called a linear zero function (LZF) if E{l'y) — 0 for all possible values of /3. By putting p — 0 in the definition of the LUE, we see that any LZF is a linear unbiased estimator or LUE for 0. Therefore, by adding LZFs to an LUE of p'/3, we get other LUEs of p'{3. A natural question seems to be: Why bother about LZFs which are after all estimators of zero? There are two important ways in which the LZFs contribute to inference. First, they contain information about the error or noise in the model and are useful in the estimation of a 2 , which we consider in Section 4.7. Second, since the mean and variance of the LZFs do not depend on /3, they are in some sense decoupled from X/3 — the systematic part of the model (see also Remark 4.1.6). Therefore, we can use them to isolate the noise from what is useful for the estimation of the systematic part. This is precisely what we do in Section 4.3.
4.1 Linear estimation: some basic facts
95
Example 4.1.3 (A trivial example) Suppose that an orange and an apple with (unknown) weights a.\ and a2, respectively, are weighed separately with a crude scale. Each measurement is followed by a 'dummy' measurement with nothing on the scale, in order to get an idea about typical measurement errors. Let us assume that the measurements satisfy the linear model (Vi\
/I
Vz
0
\y4J
\0
/ei\
0\ 1
0/
\a2)
€3
'
\£4/
with the usual assumption of homoscedastic errors with variance a2. The observations j/2 a n d 1/4, being direct measurements of error, may be used to estimate the error variance. These are LZFs. The other two observations carry information about the two parameters. There are several unbiased estimators of a\, such as yi, y\ + 1/2 and Vi + Vi- It appears that y\ would be a natural estimator of a\ since it is free from the baggage of any LZF. We shall formalize this heuristic argument later. In reality we seldom have information about the requisite LPFs and the errors as nicely segregated as in Example 4.1.3. Our aim is to achieve this segregation for any linear model, so that the task of choosing an unbiased estimator becomes easier. Before proceeding further, let us characterize the LUEs and LZFs algebraically. Recall that Px = X(X'X)~X' is the orthogonal projection matrix for C(X). Proposition 4.1.4 tic I'y is
In the linear model (y, X/3, cr2l), the linear statis-
(a) an LUE of the LPF p'/3 if and only if X'l = p, (b) an LZF if and only if X'l = 0, that is, I is of the form (I — Px)m for some vector m. Proof. In order to prove part (a), note that I'y is an LUE of p'(3 if and only if E(l'y) = I'X/3, that is, the relation I'X/3 = p'/3 must hold as an identity for all /3. This is equivalent to the condition X'l = p.
96
Chapter 4 : Estimation in the Linear Model
The special case of part (a) for p = 0 indicates that I'y is an LZF if and only if X'l = 0. The latter condition is equivalent to requiring I to be of the form (I — Px)m for some vector m. Remark 4.1.5 Proposition 4.1.4(b) implies that every LZF is a linear function of (I—Px)y, and can be written as m'{I—P)y for some vector m. This fact will be used extensively hereafter. Proposition 4.1.4 will have to be modified somewhat for the more general model (y, X/3, a2V) (see Section 7.2.2). However, the characterization of LZFs as linear functions of (/ — Px)y continues to hold for such models. Remark 4.1.6 Consider the explicit form of the model considered here, y = X(3 + e. (1.3.2) A consequence of Remark 4.1.5 is that any LZF can be written as / ' ( / — Px)y, which is the same as I'(I — Px)e. Thus, the LZFs do not depend on /3 at all, and are functions of the model error, e. This is why LZFs are sometimes referred to as linear error functions. Example 4.1.7 (Trivial example, continued) ple 4.1.3, /I
In the case of Exam-
0 0 0\
/0 \
'** 0 I 1 I . »«--*,),,= » . \yj
\0 0 0 0/ Thus, every LZF is a linear function of j/2 and 2/4Example 4.1.8
Y
_ 5 —
(Two-way classified data) Consider the model
/llOxl
llOxl
Oioxl
llOxl
0l0xl\
llOxl
OlOxl , J-lOxl OlOxl
llOxl n Uioxl llOxl
llOxl n Uioxl OlOxl
OlOxl 1 ' llOxl llOxl/
l MlOxl
D
a a
P5xl —
a
P2 \T2/
This designed set-up is typical in agricultural experiments where several treatments are applied to various blocks of land. The experiment is often
4.1 Linear estimation: some basic facts
97
conducted to assess the differential impact of the treatments. Here, the parameter /j, represents a general effect which is present in all the observations, the parameters Pi and /% represent the respective effects of two blocks and the parameters T\ and T2 represent the respective effects of two treatments. The observed response has the combined effect of /i, the particular block where it comes from and the particular treatment received. We shall revisit this example several times in order to explain various concepts. Presently we wish to identify some LUEs and LZFs in this example. Each of the first ten observations is an unbiased estimator of the LPF (M + A + n ) . Here, p = (1 : 1 : : 1 : 0)' and the ith observation (j/i) can be written as I'y where I is the ith column of /40x40- The reader may try to identify the LPFs for which LUEs can be found from the last thirty observations. The difference (yt — yj) is an LZF for 1 < i,j < 10, i / j . It can be easily verified that yj — yj — I'y where I is the difference between the ith and jth columns of /40x40, a n d the LZFs described here satisfy the condition X'l = 0. These LZFs are by no means the only ones in this example, and the reader is encouraged to look for other LZFs. 4.1.2
Estimability and identifiability
Proposition 4.1.4(a) provides a condition not only on I, but also on p. If p does not lie in C(X'), the column space of X', there is no I satisfying the condition X'l = p. Thus, some LPFs may not have an LUE. Definition 4.1.9 An LPF is said to be estimable if it has an LUE. Since the existence of linear unbiased estimator is the primary concern here, a more appropriate name for such a function should be linearly estimable. However, the less specific term 'estimable' is used almost universally to describe this. The discussion preceding Definition 4.1.9 leads to the following simple characterization. Proposition 4.1.10 A necessary and sufficient condition for the estimability of an LPF (p'/3) is that p € C(X'). D
98
Chapter 4 : Estimation in the Linear Model
Proposition 4.1.10 says that p'0 is estimable when p' is any row or linear combination of rows of the X matrix. If p' is a particular row of the X matrix, the corresponding element of the response vector y is itself an unbiased estimator of p'j3. When p' is a specific linear combination of the rows of X, this linear combination of the corresponding elements of y is an unbiased estimator of p'/3. All these LPFs must be estimable. The result of Proposition 4.1.10 is just one of the numerous characterizations of estimability that can be found in the literature. Alalouf and Styan (1979) give a catalogue of fifteen other equivalent conditions. Most of these are merely algebraic characterizations that do not provide much insight into the issue. See Exercise 4.6 for an easily verifiable characterization. Remark 4.1.11 A vector LPF, A/3, is estimable if and only if C(A') C C(X'). The entire vector /3 is estimable if and only if p(X) = k, the number of parameters. In such a case all LPFs are estimable (Exercise 4.2). Example 4.1.12 (Trivial example, continued) The rank of the X matrix in Example 4.1.3 is 2, which means that every LPF of this model is estimable. Indeed, for arbitrary p\ and p2, an LUE of the LPF p\ai + p2a2 ispij/i +P22/3a Example 4.1.13 (Two-way classified data, continued) Consider the model of Example 4.1.8. Here, the first, second and fourth columns of X form a basis of C(X). Since there are five columns but p(X) is only 3, we can not expect all the parameters to be estimable. The LPF T\ — T2 is estimable, because the corresponding p-vector, (0 : 0 : 0 : 1 : — 1)' lies in C(X'). (We can also cite a specific LUE, such as y\ — j/31). However, n is not estimable, since ( 0 : 0 : 0 : 1 : 0 ) ' ^ C(X'). Definition 4.1.9 may suggest that the only defect of a non-estimable LPF is that there is no linear and unbiased estimator. It appears as if a non-estimable LPF might be reasonably estimated by a biased and/or nonlinear estimator. However, this is not the case. Consider the model of Example 4.1.8, and let 1, 2, 3, 4 and 5 be a set of candidate values of the unknown parameters, /u, f3\, /32, T\ and T2, respectively. If one
4.1 Linear estimation: some basic facts
99
subtracts 2 from fx and adds 1 to each of /3\, fa, T\ and T2, this alternative set of values has the same mean and variance of all the observations. There is no way to distinguish one set of candidate values from the other, on the basis of the observables y i , . . . , j/4o- (The reader may look for other candidate values of the parameters which are equally plausible.) One value of JJ, cannot be empirically more reasonable than any other. In this sense, /i can not be identified on the basis of the observations. The non-estimability of \i has to be appreciated in this context. The above example illustrates the difficulty in trying to 'estimate' non-estimable functions. This interpretation of non-estimability holds quite generally: given a non-estimable LPF and a 'candidate' /3, one can always find infinitely many alternative values of the parameter /3, each of which corresponds to a different value of the this LPF but makes no difference to the y-vector. It emerges from the above discussion that the issue of estimability of an LPF in a linear model is related to a more fundamental question: whether a parameter can be meaningfully identified. Let us formalize this notion. Definition 4.1.14 Let f(y;6), 0 G 0 be a family of multivariate probability density functions. A measurable parametric function g defined on 0 is said to be identifiable by distribution, or simply identifiable, if for any pair 9\, #2 in 0 , the relation g(0\) ^ ff(#2) implies
f(y,ei)^f(y,e2). The above definition is due to Bunke and Bunke (1974). It essentially says that two distinct values of an identifiable parametric function should always lead to different likelihoods of the observation.
Proposition 4.1.15
An LPF in the linear model (y,Xfi,a2I)
is
identifiable if and only if it is estimable. Proof. Consider the LPF p'fi. It is identifiable if and only if p'fi1 / p'P2 => X/3i / X(32 for all /3X and /3 2 . This condition is equivalent to X{3 = 0 =^ p'(3 = 0. According to Exercise 2.15, the latter condition can be written a s p e C(X'), which is the necessary and sufficient condition for estimability (see Proposition 4.1.10).
100
Chapter 4 : Estimation in the Linear Model
A consequence of Proposition 4.1.15 is that a non-estimable LPF in the linear model does not have a linear or nonlinear unbiased estimator (see Exercise 4.3). Non-estimable or non-identifiable LPFs occur when there is some redundancy in the model description in the form of too many parameters (see Remark 4.1.11). This can be 'rectified', if desired, through a reparametrization, which is discussed in Section 4.8. Often it is convenient not to reparametrize, particularly when the corresponding X matrix has a special structure. The methods developed in this chapter are perfectly applicable to models with redundancy.
4.2
Least squares estimation
The least squares method is the oldest method that is used for estimation in the linear model. The error vector of the linear model (y,Xj3,a2I) can be written as (y — X/3). One would expect the elements of this vector to be generally large whenever an 'incorrect' value of f3 is plugged in. Thus it makes sense to estimate the value of (3 so as to minimize the sum of squared elements of this error vector. Such an estimator is called a least squares estimator (LSE) of /3. Formally, an LSEis 3 L 5 = argimn(y-X/3)'(y-X/3). (4.2.1) Differentiating the quadratic function with respect to /3 and setting it to zero, we have X'X/3 = X'y (4.2.2) The above equation is traditionally referred to as the normal equation. A general solution to the normal equation is of the form (X'X)~X'y, where (X'X)~ is any g-inverse of X'X (see Remark 2.7.2). Thus, 3 L S = (X'X)-X'y.
(4.2.3)
Proposition 2.7.1 indicates that an LSE always exists, but it is uniquely defined if and only if X'X is nonsingular, that is, if X has full column rank. In such a case (3 is estimable (see Remark 4.1.11), and the unique least squares estimator of /3 is
£LS = (X'Xy'X'y.
(4.2.4)
4.2 Least squares estimation
101
It is easy to see that f3LS is a linear estimator and it is unbiased whenever it is uniquely denned. It should be noted that the least squares method provides an estimator of the entire parameter vector, /3, - even if one is interested in a single estimable LPF (say, p'/3). When p'/3 is estimable, it follows from Proposition 4.1.10 that P PLS 1S * n e same for all LSEs of f3. Hence, any value of p'/3 other than p'fiis corresponds to a choice of /3 that does not minimize (y — X/3)'(y — X/3). In this sense, p' /3LS is the unique
LSE ofp'p. Example 4.2.1 (Simple linear regression) When there is only one explanatory variable, the linear model (1.3.2) simplifies to y = pol + PiX + e.
(4.2.5)
This model is known as the simple linear regression model. Here, X = (1 : x) and 0 = (/30 : /?i)'. Thus, the unique LSE of /? is -s _ / n nx \~l ( ny\ PLS ~ I\nx= II«c|| || ||2/I I\x /y j) ) where x = n~ll'x and y = n~ll'y. the LSEs of Pi and /?o simplify to o Pi
After some algebraic manipulation
x'y-n(x){y) =
-j]—To
7^W
(4.2.bj
||a;|| 2 — n{x)1
Po = y-pix.
(4.2.7)
These estimators are linear in y and unbiased. Example 4.2.2 (World population data) For the world population data of Table 1.2, let y be the mid-year population for a given year x. If the linear model
y = PQ+ plX + e is assumed, then the least squares estimates of the parameters, obtained by evaluating the expressions given Example 4.2.1, are /30 = —158.3 and Pi = .0822.
102
Chapter 4 : Estimation in the Linear Model
4.3 Best linear unbiased estimation Suppose that A/3 is an estimable (vector) LPF of the linear model (y,X/3,a2I). Given an LUE of A/3, we can always construct another LUE by simply adding an LZF to it. a Thus, there is a large class of LUEs of any given estimable LPF. We seek to identify a member of this class which has the smallest dispersion (in the sense of the Lowner order denned in Section 2.6). Definition 4.3.1 The best linear unbiased estimator (BLUE) of an estimable vector LPF is defined as the LUE having the smallest dispersion matrix. The BLUE of a single estimable LPF is the LUE having the smallest variance. Suppose that L\y and Lyy are two distinct LUEs of the vector LPF A/3. It is entirely possible that neither D(L\y) — D(L2y) nor D(L2y) —D(Liy) is nonnegative definite (after all, the Lowner order is only a partial order). Therefore, it is not obvious that there would exist an LUE of A/3 whose dispersion is smaller than that of every other LUE. We shall prove that the BLUE of an estimable LPF not only exists but is also unique. We begin by showing an important connection between BLUEs and LZFs. Proposition 4.3.2 A linear function is the BLUE of its expectation if and only if it is uncorrelated with every LZF. D Proof. Let L\y and L,2y be two distinct LUEs of the same vector LPF, and L\y be uncorrelated with every LZF. Rewrite L,2y as
L2y = Liy + (L2 - Lx)y. Notice that (L2 — L\)y is a vector LZF and, hence, uncorrelated with L\y. Therefore,
D(L2y) = D(LlV) + D((L2 - LJy) > D(LlV). a In
fact, every other LUE can be obtained this way.
4.3 Best linear unbiased estimation
103
The strict inequality follows from the fact that the LZF (L2 — L\)y cannot be identically zero if the two LUEs are distinct. This proves the 'if part. In order to prove the 'only if part, let Ly be an LUE which is correlated with the nontrivial LZF, m'y. Consider L\y — Ly — bm'y,
where
It is easy to see that Cov(Liy,m'y) Ly = L\y + bm'y, we have
b—
Cov(Ly,m'y)/Var(m'y).
= 0. Rewriting the above as
D(Ly) = D(LlV) + D(bm'y) > D{Liy). Note that Ly and L\y are LUEs of the same LPF, so Ly can not be the BLUE of this LPF. It is clear from Proposition 4.3.2 that if an LPF has a BLUE, then any other LUE of that LPF can be expressed as the sum of two uncorrelated parts: the BLUE and an LZF. Any ordinary LUE has larger dispersion than that of the BLUE, precisely because of the added LZF component, which carries no information about the LPF. Having understood this, we should be able to improve upon any given LUE by 'trimming the fat.' To accomplish this, we have to subtract a suitable LZF from the given LUE so that the remainder is uncorrelated with every LZF. The task is simplified by the fact that every LZF is of the form m'(I — Px)y for some vector m (see Proposition 4.1.4). Therefore, all we have to do is to make the given LUE uncorrelated with (/ — Px)yThe covariance adjustment principle of Proposition 3.1.2 comes handy for this purpose. If Ly is an LUE of an LPF, then the correspondProposition 4.3.3 ing BLUE is LPxy. Proof. Note that one can write Ly = LPxy+L(I—Px)y. Since EL(I— Px)y — 0, LPxy has the same expectation as Ly. Further, LPxy is uncorrelated with any LZF of the form m'{I — Px)y. Therefore, LPxy must be the BLUE of E{Ly). D
104
Chapter 4 : Estimation in the Linear Model
This proposition gives a constructive proof of the existence of the BLUE of any estimable LPF. Instead of modifying a preliminary estimator, one can also construct the BLUE of a given LPF directly using the Gauss-Markov Theorem (see Proposition 4.3.9). Remark 4.3.4 Proposition 4.3.2 is the linear analogue of Proposition 3.6.2. The BLUE of a single LPF is a linear analogue of the UMVUE, and the LZFs are linear estimators of zero.b Indeed, when y has the normal distribution, it can be shown that the BLUE of any estimable LPF is its UMVUE and that any LZF is ancillary (Exercise 4.20). In the general case, the uncorrelatedness of the UMVUE and estimators of zero did not provide a direct method of constructing the UMVUE, because it is difficult to characterize the set of all estimators of zero. (We needed an unbiased estimator, a complete sufficient statistic and an additional result — Proposition 3.6.3 — for the construction). In the linear case however, 'zero correlation with LZFs' is an adequate characterization for constructing the BLUE — as we have just demonstrated through Proposition 4.3.3. D We now prove that the BLUE of an estimable LPF is unique. Proposition 4.3.5
Every estimable LPF has a unique BLUE.
Proof. Let L\y and L2y be distinct BLUEs of the same vector LPF. Writing L\y as L2y + {L\ — L2)y and using Proposition 4.3.2, we have
Var(Liy) — Var(L2y) + Var({L\ — L2)y). Therefore (Li — L2)y has zero mean and zero dispersion. It follows that {L\ — L2)y must be zero with probability one and that L\y — L2y almost surely. d Although we have proved the existence and uniqueness of the BLUE of an estimable LPF, another point remains to be clarified. Suppose that A/3 is an estimable vector LPF. Since all the elements of A(3 are estimable, these have their respective BLUEs. If these BLUEs are arranged as a vector, would that vector be the BLUE of A/3? Proposition 4.3.6 Let Ly be an LUE of the estimable vector LPF A/3. Then Ly is the BLUE of A/3 if and only if every element of Ly bSee
Chapter 11 for a linear version of the fundamental notions of inference.
4.3 Best linear unbiased estimation
105
is the BLUE of the corresponding element of A/3. Proof. The 'if part is proved from the fact that the elements of Ly has zero correlation with every LZF. The 'only if part follows from the fact that Lowner order between two matrices implies algebraic order between the corresponding diagonal elements. Proposition 4.3.6 answers a question but gives rise to another one. Why do we bother about Lowner order of dispersion matrices, if the BLUE of a vector LPF is nothing but the vector of BLUEs of the elements of that LPF? The reason for our interest in the Lowner order is that it implies several important algebraic orders. Let Ly be the BLUE and Ty be another LUE of A/3. It follows from Proposition 2.6.2 that ti(D(Ly)) < tr(D(Ty)), that is, the total variance of all the components of Ly is less than that of Ty. This proposition also implies that \D(Ly)\ < \D(Ty)\. It can be shown that the volume of an ellipsoidal confidence region of A/3 (see Section 5.2.2) centered at Ty is a monotonically increasing function of \D(Ty)\. Thus, a confidence region centered at the BLUE is the smallest. It also follows from Proposition 2.6.2 that the extreme eigenvalues of D(Ly) are smaller than those of D(Ty). The reader may ponder about the implications of these inequalities. Example 4.3.7 (Trivial example, continued) For the linear model of Example 4.1.3 it was shown that every LZF is a linear function of y2 and yi (see Example 4.1.7). Since y\ and y^ are both uncorrelated with y2 and y^, these must be the respective BLUEs of a.\ and a2. Thus, the 'natural estimator' mentioned in Example 4.1.3 is in fact the BLUE. Another LUE of a.\ is y\ + y2, but it has the additional baggage of the LZF y2, which inflates its variance. The variance of y\ + y2 is 2cr2, while that of the corresponding BLUE (yi) is only a2. Example 4.3.8 (Two-way classified data, continued) For the model of Example 4.1.8, C(X) is spanned by the orthogonal vectors u\, u2 and us, where u\ is the difference between the last two columns of X and u2 and it3 are the second and third columns of X, respectively. It
106
Chapter 4 : Estimation in the Linear Model
follows that
P rx
= P
+ P
r«i
+ P
^ *u 3 ^ -^3
=
L 40
/ 311' 11' 11' " ' 311' -11'
-11' \ 11'
11' - 1 1 ' 3 - 1 1 ' 11' ' ^ -11' 11' 11' 3 11' j
Each block in the above matrix has order 10 x 10. We have already noted that the first observation is an LUE of /j, + ft + T\ . The LUE can be expressed as I'y where I is the first column of /40x40- According to Proposition 4.3.3, the BLUE of n + ft + n is l'Pxy. The BLUE simplifies to \yx + \y2 + \y~s — \y^, where the quantity |/j is the average of the observations J/io(i-i)+i> > 2/io(i-i)+io> f° r z = li 2, 3,4. Likewise, the BLUE of /J + A + r 2 is \yx - \y2 + fy 3 + \yA. The BLUE of T\ — T
= A{X'X)~X'y = A/3LS,
by virtue of (4.2.3). The Gauss-Markov theorem is also called the principle of substitution, since the BLUE of any estimable LPF p'/3 is obtained simply by substituting any least squares estimator /3LS for f3, thus getting p''/3LS. The following proposition gives a converse to the principle of substitution.
4.4 Maximum likelihood estimation
107
Proposition 4.3.10 // /3 is such that for every estimable p'/3 the BLUE is p f3, then /3 must be a least squares estimator of /3. Proof. Let (3 = Gy, which satisfies the condition of the proposition. As Xf3 is estimable, XGy is the BLUE of X0. Since the BLUE is unbiased, we have XGX = X. This implies that XGPX = Px. The BLUE XGy must be uncorrelated with (I — Px)y- Therefore, XG(I — Px) = 0. Adding this to the equation XGPX = Px, we have XG = Py. Let Fy be any competing estimator of (3. It follows that \\y-XFy\\2 = \\y-XGy\\2 + \\X(G-F)y\\2 + 2y'(I-Px)X(G-F)y = \\y-XGy\\2 + \\X(G-F)y\\2 > ||y-XGy||2. Thus, /3 is a least squares estimator. 4.4
Maximum likelihood estimation
If the errors in the linear model (y, X(3, a21) are assumed to have a multivariate normal distribution, then the likelihood of the observation vector y is (27ra2)-"/2 e x p [ - ^ ( y - X/3)'(y - X/3)]. It is clear that a maximum likelihood estimator (MLE) of /3 is a minimizer of the quadratic form (y — X/3)'(y — X/3), which is by definition an LSE. Substituting the maximized quadratic form into the likelihood and maximizing it with respect to
^2ML = Uy - XPML)'(y - X0ML). We shall revisit the issue of estimating a2 in Section 4.7. An MLE of /3 is an LSE when the errors have the normal distribution. Therefore the condition for its uniqueness is that X should have full column rank. Further, the MLE of any estimable LPF is unique and it coincides with its BLUE (or LSE). If the MLE of /3 is unique, one may even derive the MLE of a nonlinear parametric function.
108
Chapter 4 : Estimation in the Linear Model
The multivariate normal distribution is a special case of the general class of spherically symmetric distributions. When the error distribution is spherically symmetric, it can be shown that an MLE of /3 is an LSE (Exercise 4.16). Although the three estimation strategies considered so far yield the same unbiased estimator of a given estimable LPF, not all is well with this estimator. It is inadmissible with respect to the squared error loss function and is sensitive to small changes in the measurements and/or to non-normality of the errors. Other estimation strategies which take some of these factors into account are considered briefly in Chapter 11.
4.5
Fitted value, residual and leverage
Recall that the LPF Afi is estimable if and only if A is of the form LX. The special case A = X corresponds to the vector LPF X/3, which is the systematic part of the response y. The BLUE of this vector is Px y. We formally define y = Pxy. (4.5.1) The elements of y are the fitted values of the respective observations. Since the matrix Px turns y into y, it is often referred to as the hatmatrix, and denoted by H. It is also called the prediction matrix. On the other hand, the vector e = y-y
= (I-H)y
(4.5.2)
is called the residual vector. The elements of the residual vector are LZFs. It follows from (4.5.1) that the fitted value for the ith observation is n .7 = 1
where hij is the (z,j)th element of the hat-matrix. The hijs for j = 1,... ,n are the weights with which the various components of y are combined to obtain &. In particular, the diagonal element h{^ is the weight of yi in the linear combination that gives y,-. This number is called the ith leverage, and the notation is commonly abbreviated as hi. The
4.5 Fitted value, residual and leverage
109
larger the leverage of an observation, the larger is its contribution to the corresponding fitted value. Since H is idempotent and has eigenvalues which are either 0 or 1, the leverages are always in the range [0,1]. A geometric view of the decomposition of y into y and e is given in Section 11.5.1. Proposition 4.5.1 Every BLUE is linear function of the vector of fitted values, and every LZF is a linear function of the residual vector. Proof. Let p'j3 be an estimable LPF, so that p G C(X'). We can write p as X'(X')~p and p'/3 as p'X~X/3. According to Proposition 4.3.9, the BLUE of the latter is p'(X'X)-X'y or p'X'y. It follows from Remark 4.1.5 that every LZF is of the form l'(I — Px)y or I'e for some I. D Remark 4.5.2 The class of all vectors I such that I'y is a BLUE is referred to as the estimation space. Likewise, the error space is the space of vectors m such that m'y is an LZF. Proposition 4.5.1 implies that the estimation and error spaces of the linear model (y, Xf3, o2l) are C{X) and C(X)X, respectively. Note that in general the vector parameter /3 may not be estimable, unless X has full column rank (see Remark 4.1.11). Even so, we can formally define the following estimator of /3: 3 = X~Pxy = X~y,
(4.5.3)
where X~ is any g-inverse of X. Although this estimator is expressed in terms of y, it is identical to f3LS (see Exercise 4.18). Therefore all BLUEs can be generated from it by the substitution principle of the Gauss-Markov theorem. In particular, if /3 is estimable, then /3 is the BLUE of /3. Example 4.5.3 (Two-way classified data, continued) Consider once again the model of Example 4.1.8. Let y l t y2, y3 and y4 be the simple averages of the four groups of ten successive observations, as denoted in
110
Chapter 4 : Estimation in the Linear Model
Example 4.3.8. Then we have
X'X
/40 20 = 20 20 \20
20 20 0 10 10
20 20 20 \ 0 10 10 20 10 10 , 10 20 0 10 0 2 0 /
(yl + y2 +y3 + j / 4 \ y x + y3 X'y = 10 10y2 + J74 y x + y2 V V3+V4 )
It can be shown that the three non-zero eigenvalues of X'X and 20. The corresponding eigenvectors are /0\ 1 1 vi = - 1 , Z 1
« 2 = 2- 1 / 2
\i/
/ 0 \ 1 -1 , 0
« 3 = 2- 1 / 2
are 80, 20 / 0 \ 0 0 . 1
V-1/
\ °/
Using these and Proposition 2.5.2, we obtain the Moore-Penrose inverse of X'X as
(X'X)+
= —
/ 4 2 2 2 2 17 -15 1 2 -15 17 1 2 1 1 17 V2 1 1 -15
Using the easily verifiable fact that X+ 'estimator' of /3 and the fitted values
Ul-SR + W+W
2 \ 1 1 . -15 17 /
= (X'X)+X',
we have the
U+ft+ft+W
In particular, the BLUE of the estimable LPF T\ — r2 is 9\ — T2 = (Vi + y)/2 — (^3 + F4)/2, as we had found out in Example 4.3.8. Although T\ — T2 is estimable, no single component of /3 is estimable. Another g-inverse of X in the expression X~y would lead to another
4.6 Dispersions
111
least squares estimator of /3. For instance, since the vector v = (1 : — 1 : — 1 : 0 : 0 ) ' satisfies Xv — 0, we can get another X~ by adding vu' to X + , where u is any vector of appropriate dimension. By choosing u as -3/160 times the first column of X, we have /
! 3 =^
-j/i -2/2-2/3-2/4
N
5|/i — 3y2 + 5y3 - 3^4 -3f7i + 5y2 - 3y3 + 5y4 . 8yi+8y 2 V 8j/3 + 8|/4 /
This is a different LSE of 0 which leads to the same fitted values of y and the same estimator of T\ — T2O
4.6
Dispersions The dispersion matrices of y and e are easily seen to be D{y)
= a2Px,
D(e) = a2{I-Px).
(4.6.1) (4.6.2)
Suppose that p'/3 and g'yS are two estimable LPFs. We can write p and q as X'(X')~p and X'(X')~g, respectively. Using (4.6.1) we have Cov(p'(3,q'0)
=
Cov(p'X-y,q'X-y)=a2p'X-Px(X')-q = a2p'(X'X)-q, Var{p'J3) = a2p'{X'X)-p.
Since p and q are in C(X'X), the above expressions do not depend on the choice of the g-inverse (see Proposition 2.4.1, part (f)). When X has full column rank, the dispersion of the BLUE of /3 is
D(J3) =
a2(X'xyl.
Example 4.6.1 (World population data, continued) For the world population data of Table 1.2, let the mid-year population (y) for a given
112
Chapter 4 : Estimation in the Linear Model
year (x) follow the linear model of Example 4.2.2. It turns out that
(X'X\~l= {
'
(fa\ = ( \fa)
5958
V-2.993
-2-993^
.001504 7 '
Thus, Var(f3o) = 5958cr2, Var((h) = .001504
=
-cr2hij,
i,j= l,...,n,
= a2(l-hi),
t = l,...,n,
(4.6.3) (4.6.4)
where hij is the (i,j)th element of H, defined in the previous section, and hi is the ith leverage. Example 4.6.2 (Two-way classified data, continued) We have shown in Example 4.3.8 that _ J_ * ~ 40
/ 3- 11' 11' 11' - 1 1 ' \ 11' 3 11' - 1 1 ' 11' 11' - 1 1 ' 3 11' 11' ' \ -11' 11' 11' 3 11' j
The dispersion matrix of y if &2PX. In particular, hi = 3/40 for all i. Note that r\ — T2 and fa — fo can be written as p'(3 and q'/3, respectively, where p' = (0 : 0 : 0 : 1 : -1) and q' = (0 : 1 : - 1 : 0 : 0). It follows from the calculation of (X'X)+ in Example 4.5.3 that Var(?i-T2) = o2p'{X'X)-p
=
A similar computation reveals that VariPi-fa) = a2/10,
C o u ^ i - ^ . n - ^ ) = 0.
Consider the decomposition of y into the vector of fitted values, y, and the residual vector, e: y = y + e.
4.7 Estimation of error variance and canonical decompositions 113 It follows from Proposition 4.3.2 that the corresponding dispersion matrix has the similar decomposition: D(y) = D(y) + D(e). The expressions (4.6.1) and (4.6.2) give expressions for the dispersion matrices on the right hand side. These dispersions have the following interpretation in the case of normal errors. D(y) D(e)
= =
D(y\e), D(y\y).
These results follow directly from (3.2.2). Note that y is an LUE of the vector LPF X/3, but the corresponding BLUE is y. The dispersion of the latter is the remaining variability of y after the variability of the LZFs has been removed by the conditioning. Similarly, the dispersion of the residual vector corresponds to the remaining variability of y after the effect of the BLUEs are removed. We shall show in Chapter 7 that these interpretations hold even when the model errors are heteroscedastic and correlated. 4.7
Estimation of error variance and canonical decomposition
We mentioned earlier (see Remark 4.1.6) that the LZFs are functions of model errors alone. It is easy to see that LZFs are the only linear functions of the model errors which are observable as linear functions of the response y. Therefore, a sensible way of estimating the error variance a2 may be through a sum of squares of suitable LZFs. Since there are infinitely many LZFs in general, one has to look for a minimal set of LZFs that would suffice. 4.7.1
A basis set of linear zero functions
We begin with a series of definitions for the model (y, X/3, a21). Definition 4.7.1
A set of LZFs is said to be a generating set if every
114
Chapter 4 : Estimation in the Linear Model
LZF of non-zero variance is almost surely a linear combination of the members of this set. d Definition 4.7.2 A generating set of LZFs is called a basis set if no linear combination of the members of this set is identically zero with probability 1. Definition 4.7.3 A basis set of LZFs is said to be a standardized basis set if all its members are uncorrelated and have variance equal to a2.^ Example 4.7.4 It is easy to see from Proposition 4.5.1 that the elements of e form a generating set of LZFs. Since the elements of e are correlated (see (4.6.3)), this is not a standardized basis set. However, we can find a standardized basis set of LZFs in the following way. The projection matrix / — Px can be factored as CC where C is an n x (n—r) semi-orthogonal matrix and r is the rank of X (this follows from Proposition 2.5.2(f)). If C~L is any left-inverse of C, then D(C~Le) = o2l. Therefore, the n—r elements of the vector C~Le form a standardized d basis set of LZFs. Once we have a standardized basis set of LZFs, we can estimate a2 by the average of their squared values. It would be ideal if the estimator does not depend on the choice of the standardized basis set. The following proposition ensures this. Proposition 4.7.5 // z is any vector whose elements constitute a standardized basis set of LZFs of the model (y,X/3,a2I), then (a) z has n—r elements, where r = p{X); (b) z'z = e'e. Proof. Suppose that z has m elements, l\y,... ,l'my. Further, let L = (h---lm)Then D{z) = o2L'L. Since the LZFs contained in the basis are uncorrelated and have variance a2, the columns of L must be orthonormal. Proposition 4.1.4 implies that C(L) C C(I — Px)Therefore m = p(L) < p[I — Px) = n—r. If m < n—r, we can find a vector I in C{X)A- such that L'l = 0. Then I'y would be a nontrivial LZF uncorrelated with z, which is a contradiction. Hence, m must be equal to n—r. This proves part (a).
4.7 Estimation of error variance and canonical decompositions 115 To prove part (b), let e = Bz. Equating the dispersions of these two vectors, we have I — Px= BB'. Therefore p{B) = p(BB') = n—r, that is, B has full column rank. It follows that P = I. Hence, z'z = z'PB,z = z'B'{BB')Bz = e'{I-Px)-e = e'{I-Px)e = e'e.
Proposition 4.7.5 ensures that the sum of squared values of LZFs contained any standardized basis set is equal to e'e, and the number of summands is always the same (n—r). Definition 4.7.6 If z is any vector whose elements constitute a standardized basis set of LZFs, then the error sum of squares is defined as
R20 = z ' z . Alternative names for the 'error sum of squares' are sum of squares due to error (SSE) and residual sum of squares. There are two interpretations of the phrase 'residual sum of squares'. It is clear from Proposition 4.7.5 that RQ = e'e, which is the sum of squared values of the residuals. Further, recall from Section 4.2 that an LSE of (3 minimizes the 'sum of squares', \\y — X/3|| 2 . According to the Gauss-Markov Theorem, X/3LS = Pxy. Then the minimum value of this 'sum of squares' is \\y-XpLSf = y'(I-Px)y = e'e. Thus, RQ is also the residual value of the sum of squares after it has been minimized as much as possible with respect to /3. It turns out that the above definition of error sum of squares holds for the general linear model considered in Chapter 7, but it cannot be interpreted as the sum of squared residuals in that context. Hence, we prefer to use the phrase 'error sum of squares' instead of 'residual sum of squares'. Remark 4.7.7 If z is any vector whose elements constitute a generating set of LZFs, then (see Exercise 4.23) R20 = z'[a-2D(z)]-z,
p(D(z))=n-r.
If we put z = e in the above expressions, then D(z) simplifies to cr2(I — Px) and z'[a~2D(z)]~z simplifies to e'e, as expected.
116 4.7.2
Chapter 4 : Estimation in the Linear Model A natural estimator of error variance
It follows from Proposition 4.7.5 that a natural unbiased estimator of a2 is the sample average of the squared LZFs of a standardized basis set z. Thus we use
3_-J_,.,_-!^L_JS-, n —r
n—r
(4.7.!,
n—r
where e is the vector of residuals and r = p(X). The MLE of a2 in the normal case is e'e/n (see Section 4.4). In contrast to the natural unbiased estimator a2, the MLE is a biased estimator. Remark 4.7.8 Although a2 is not the MLE in the normal case, it has another very important property: it is the UMVUE of a2. To see this, note that the exponent of the density function of y (given X) can be written as
-~[\\X0-X0\\2
+ Rl]-
By factorization theorem (Proposition 3.5.6), X/3 and i?2, are jointly sufficient for X(3 and a2. It can be be shown along the lines of Example 3.5.8 that these are complete (see Exercise 4.19). Since Xf3 and cr2 are both unbiased, and are complete and sufficient for Xf3 and CT2, Proposition 3.6.3 implies that these are UMVUE of these parameters.^ Other optimal properties of a2 will be discussed in Chapter 8 (see Section 8.2.3 and Exercise 8.11). Example 4.7.9 (World population data, continued) For the world population data of Table 1.2, if we assume the linear model of Example 4.2.2 for the mid-year population (y) for a given year (x), then natural unbiased estimator of the error variance is 6.727 x 10~5, while the MLE (under the additional assumption of normal distribution) is 6.055 x 10"5.
4.7 Estimation of error variance and canonical decompositions 117 4.7.3
A decomposition of the sum of squares*
In view of the decomposition y = y + e, the total sum of squares of the observed vector y may be decomposed as
\\y\\2 = Ml2 + M2.
(4.7-2)
The second term in the right hand side is the error sum of squares, JRQ, which can be further decomposed as the sum of squares of n — r uncorrelated LZFs each with variance a2. This fact raises the question: Is it possible to write ||y||2 as a similar sum? In order to answer this question, we have to define generating and basis sets of BLUEs. Definition 4.7.10 A set of BLUEs is said to be a generating set if the BLUE of any estimable LPF can be written as a linear combination of the members of this set. d Definition 4.7.11 A generating set of BLUEs is called a basis set if no linear combination of the members of this set is identically zero (for all possible values of 0) with probability 1. Definition 4.7.12 A basis set of BLUEs is said to be a standardized basis set if all its members are uncorrelated and have variance equal to a2 or zero. The reason why we allow the members of a standardized basis set of BLUEs to have zero variance will be clear when these definitions are used in the case of general linear models of the form (y, X/3, o2V) with V possibly singular (see Sections 11.1.2 and 11.1.3). In the present context every BLUE has a positive variance. See Exercise 4.22 for a transformation of y which simultaneously gives a standardized basis set of BLUEs and a standardized basis set of LZFs. Proposition 4.7.13 If z is any vector of BLUEs whose elements constitute a standardized basis set, then (a) z has r elements, where r = p{X); (b) z'z = y'y.
118
Chapter 4 : Estimation in the Linear Model
Proof. Suppose that z has m elements, l[y,..., l'my. Further, let L = (li---lm).
If VarHly)
= 0 for any i (1 < i < m), Ifa must
be zero, which is not possible. Therefore, each l\y must have variance a2. Then D(z) = a2L'L — a2l, that is, the columns of L are orthogonal. Proposition 4.3.2 implies that Cov(L'y, (I — Px)y) = 0, that is, C(L) C C{X). Therefore m = p(L) < p{X) = r. If m < r, we can find a vector I in C(X) such that L'l = 0. Then I'y would be a BLUE uncorrelated with z, which is a contradiction. This proves part (a). To prove part (b), let y = Bz. Equating the dispersions of these two, we have Px = BB'. Therefore p(B) = p(BB') = r, that is, B has full column rank. It follows that P , — I. Hence, z'z - z'PB,z = z'B'{BB')-Bz = y'{PxTy = y'Pxy = y'y. Proposition 4.7.13 adds further significance to the decomposition given in (4.7.2). The left hand side consists of the sum of squares of n uncorrelated observations, each having variance a2. On the right hand side, IIJ/H2 and ||e|| 2 account for r BLUEs and n—r LZFs. These are all uncorrelated and each have variance a2. The number of summands in each of the three terms are called the degrees of freedom. The number of uncorrelated observations appearing in the left hand side, n, is the total degrees of freedom. The number of uncorrelated BLUEs in ||y||2, r, is the number of degrees of freedom used for the estimation of the parameters. The number of uncorrelated LZFs in ||e|| 2 , n—r, is referred to as the error degrees of freedom. These 'degrees of freedom' assume greater significance in the case of normally distributed errors, when they characterize the chi-square square distributions of ||y||2/cr2, \\y \\2/o~2 and ||e||2/cr2, respectively. The generating and basis sets of BLUEs and LZFs have roots in some fundamental concepts of linear inference. These are discussed in Chapter 11. 4.8
Reparametrization
Consider the model M.\ = (y, X/3, cr2l), and suppose that Z is any matrix such that C(Z) = C(X). Then there are matrices T\ and T2
4.8 Reparametrization
119
such that X = ZT\ and Z = XT 2- If we define a new vector parameter 6 as Ti/3, then the model M.2 = (y, Z0,a2l) is equivalent to M.\. We call the transformation from M.\ to .M2 a reparametrization. Sometimes a reparametrization is done in order to remove a redundancy in the description of the model. In such a case, Z is chosen so that it has r linearly independent columns, where r — p(X). As a result, each of the r elements of 0 turn out to be estimable (see Exercise 4.28). This is why p(X) is often referred to as the effective number of parameters in the model (y,X/3,a2I). Example 4.8.1 (Two-way classified data, continued) Consider the model of Example 4.1.8. Since p{X) = 3, it should be possible to find a matrix Z with three columns such that C(Z) = C(X). We construct such a matrix by combining the first column, the difference between the second and third columns and the difference between the fourth and fifth columns of X. Thus we have ' llOxl v _
llOxl
llOxl ^
llOxl
—llOxl
llOxl
llOxl
llOxl
—llOxl
\ llOxl
—llOxl
—llOxl /
It is easy to verify that C(Z) = C(X), and that Z has full column rank. As X and Z have the same column space, there are matrices T\ and T2 such that X = ZT\ and Z = XT2- Specifically, the following choice of T\ and T 2 would suffice: /I / T
i =
1 i
° \o x
2
1 * 2
2
i\ 2
(l
\
1 -5 ° ° . 0 0 1 -i / 22/
° 0
T2
=
°\ 1
0
0 - 1 0 0 0 1 V0 0 -1 y
The model (t/,X/9,a 2 /) is equivalent to the reparametrized model (y, Z0, a21). The parameter 0 can be expressed in terms of /3 as 0 = T\j3. Specifically, dx = n + (/3i + p2 + n + r 2 )/2, 62 = ( ^ - £ 2 )/2 and 03 = (TI — T2)/2. In the reparametrized model, the entire vector parameter
120
Chapter 4 : Estimation in the Linear Model
0 is estimable. This may be contrasted by the original model, where /3 is not estimable. Note that we could have chosen another Z matrix, which would have led to another reparametrization. For the given Z matrix, the above choice of T\ is unique, but the choice of T2 is not unique. The reader may look for other possible choices of T2 which satisfy the relation
z = XT2.
a
A more general form of reparametrization occurs when /3 is transformed to 0 = T\f3 + #0) where #0 is a fixed vector. If Z is a matrix such that C{Z) = C{X) and T\ and T 2 are such that X = ZTi and Z — XT2, we can rewrite the model equation of M\ as y = ZTi/3 + e, that is, y + Z60 = Z{Ti/3 + dQ) + e = Z6 + e. The model M3 = {y + ZOQ, ZO, a21) can be called a reparametrization of M\. The parameters of the two models are related as 6 = T\fi + 60 and /3 = T2(9 — 60). It is easy to see that I'y is an LZF in M.\ if and only ifl'(y — ZOQ) is an LZF in M3. Likewise, Ly is a BLUE in M\ if and only if L(y + Z6Q) is a BLUE in .M3. In particular, the BLUEs of Xj3 and Z6 in the respective models are related to one another by the equation
xp = z~o + ze0. Reparametrization does not alter RQ, the degrees of freedom or a2.
4.9
Linear restrictions
Often the parameter (3 in the linear model (y, X(3, ex21) is subjected to a linear restriction (constraint) of the form A/3 = £. This restriction may be (a) a fact known from theoretical or experimental considerations, (b) an hypothesis that may have to be tested or (c) an artificially imposed condition to reduce or eliminate redundancy in the description of the model. Let us assume that the restriction is algebraically consistent, that is, £ £ C(A). How does the restriction affect estimation? Example 4.9.1 (Two-way classified data, continued) Consider the model of Example 4.1.8 where we want to take into account the linear
4.9 Linear restrictions
121
restrictions /3i + #2 = 0 and T\ + T
fVl+V3\ \V2+Vi' where yx, y2, y3 and y4 are as in Example 4.3.8. This is different from the vector of fitted values obtained from the original model (see Example 4.5.3). In the above two examples we were able to find an unrestricted model that is equivalent to the original model with the restriction. However, the choice was specifically for the model and the restriction at hand. Given the model (y, Xj3, a21) subject to the general restriction A/3 = £, can we find an equivalent unrestricted model? A general linear restriction of the form A/3 = £ can be taken into account by treating £ as an observation of A/3 with zero error. Therefore, the model (y,X/3,a2I) with the restriction A/3 = £ is equivalent
122
Chapter 4 : Estimation in the Linear Model
to the unrestricted model (y,, X*/3,cr 2 V), where
-(?)
*-£)
-(-)
The dispersion matrix a2 V is singular. Singular dispersion matrices are dealt with in Chapter 7, and we shall find other ways of handling linear restrictions before we get there. However, this formulation provides us with an important insight. An LPF p'(S is estimable in the restricted model if and only if p e C((X,)'), that is, p G C(X' : A').c On the other hand, p'fi is estimable in the unrestricted model if and only if p E C(X'). Thus, every estimable function in the unrestricted model is estimable in the restricted model, but the converse is not necessarily true. If restrictions can expand the set of estimable functions, what are the functions that become estimable because of the restrictions? In the case of Example 4.9.1, the additional rows of the matrix X* are (0 : 1 : 1 : 0 : 0)' and (0 : 0 : 0 : 1 : 1)'. In other words, ft + /32 and T\ + T2 become estimable after taking the restrictions into account, as these two parameters are 'observed' to be 0. Consequently all the LPFs become estimable. We have seen in Example 4.9.1 that these restrictions only amount to a reparametrization of the original model. In general, a restricted model is equivalent to a reparametrized model when p(X*) — p{X) + p(A) (Exercise 4.30). We shall refer to such restrictions as model-preserving constraints. Sometimes the very purpose of imposing the restriction is to make the matrix X* full column rank, so that all the parameters become estimable. As another special case, we may have p(X») = p(X), in which case the set of estimable functions is the same with or without the restriction. This happens when Aft is itself estimable in the unrestricted model. (For instance, the LPF T\ — r^ of Example 4.9.2 is estimable.) Estimation under such restrictions is often required for the purpose of conducting tests of hypotheses. c This
is a consequence of Proposition 4.1.10. We shall see in Section 7.2.2 that this proposition continues to hold even when the dispersion matrix is singular.
4.9 Linear restrictions
123
In general, the rank of X* may be somewhere in-between p(X) and the number of columns of X, although the extreme cases are more common. Dasgupta and Das Gupta (2000) show how any algebraically consistent restriction can be decomposed into a model preserving constraint and a restriction involving an estimable function. We shall obtain a similar decomposition in Section 5.3.1. The model (y,X(B,a2I) with the restriction A(3 = £ can also be shown to be equivalent to another unrestricted model with dispersion matrix a2l. We now derive such a model. When we estimate a vector parameter in the usual linear model, it is implicitly assumed that j3 can be anywhere in IRk. The restriction effectively confines f3 to a subset of IRk. Specifically, it follows from Proposition 2.7.1 that the vector /3 satisfies the algebraically consistent restriction A/3 — £ if and only if it is of the form A~£ + ( I — A~ A)0, where A~ is a g-inverse of A and 9 is an arbitrary vector. We can ensure that j3 satisfies the restriction by using the above form of j3 explicitly in the model equation: y = Xp+e = XA-£+X(I-A-A)0+e,
E(e) = O, D(e) = o2l.
Since XA~£ is completely known from the restriction, we can transfer it to the left hand side and write the model equation as y-XA-£
= X{I-A-A)0
+ e,
E{e) = O, D(e) = a2I.
Thus, the 'restricted model' is equivalent to the unrestricted model (y - XA~£,X(I — A~A)0, a2l), which is under the purview of the discussions of the foregoing sections. A geometric perspective of the BLUE of -Y/3 under the restriction A/3 = £ is given in Section 11.5.2. Example 4.9.1 (continued) The restriction may be written as A/3 = £ where .
A =
/0
1 1 0 0\
U oo i ij
,
and
t
/0\
*= UJ-
124
Chapter 4 : Estimation in the Linear Model
We can choose A~ — \A\ so that the equivalent unrestricted model is {y,ZO,cr2I), where /I 0
Z = X(I-A~A) = X
0 0 i -i
° ~5 0
0
1
0 0
0 0
0\ 0
°
°
0 5 -5 0 -i i 2 2/
V ^ llOxl
2^10xl
2^10xl ^
llOxl
—2^10x1
S^10*1
llOxl
2^10xl
~2^10xl
llOxl
—2"ll0xl
—jllOxl
,
This form of Z is very similar to that obtained in Example 4.8.1. The model obtained here is obviously a reparametrization of the model of Example 4.8.1, obtained by replacing the parameters /3 and r by 2/9 and 2T, respectively, and by adjusting the Z matrix for this change. d Example 4.9.2 (continued) The restriction is of the form A(3 — £, where A = (0 : 0 : 0 : 1 : - 1 ) and £ = 0. Let us choose A~ = (0 : 0 : 0 : 1 : 0)', and obtain the equivalent model (y, Z0,a2l), where
Z = X{I-A~A)
_
/I 0 = X 0 0 \0
0 1 0 0 0
0 0 1 0 0
0 0 0 0 0
0\ 0 0 1 1/
/llOxl
llOxl
OlOxl
Oioxl
ll0xl\
llOxl
Oioxl
llOxl
Oioxl
llOxl
I llOxl
llOxl
Oioxl
Oioxl
llOxl
VliOxl
Oioxl
llOxl
Oioxl
llOxl/
In this example, the fourth element of 6 is irrelevant, and the fifth element always appears together with the first element. An appropriate
4.9 Linear restrictions
125
move would be to reparametrize by dropping the last two columns of Z and the last two elements of 0. The resulting model is precisely the one which we had obtained earlier. The 'equivalent' model can not be chosen uniquely, because it depends on the choice of the g-inverse of A. It is important to use the same g-inverse in all the above computations. No matter which g-inverse is used, the BLUEs of all LPFs that are estimable under the original model, their dispersions and the error sum of squares do not depend on the choice of A~ in the definition of the equivalent model. A possible choice of A~ is A'(AA')~, which leads to the well-defined equivalent model (y - XA'(AA')-$,X(I - PA, )0, o2l). The vector of fitted values under this model is Px( Jy — XA'(AA')~£). Therefore, the BLUE of X/3 under the restriction Af3 = £ is X~prest = XA'iAA'Tt
+ PX(i-pA,)(y - XA'{AA')-i).
(4.9.1)
We conclude this section with a formal comparison of the restricted and unrestricted models. Proposition 4.9.3 Let A/3 = £ be a consistent restriction on the model {y,X/3, a21). (a) All estimable LPFs of the unrestricted model are estimable under the restricted model. (b) All LZFs of the unrestricted model are LZFs under the restricted model. (c) The restriction can only reduce the dispersion of the BLUE of X0. (d) The restriction can only increase the error sum of squares. Proof. Part (a) has already been proved via the 'equivalent' model that uses the restriction as additional observations with zero error. We shall use the equivalent model (y — XA'(AA')~£, X(I — P,)0, a21) in order to prove the other parts. For convenience we shall refer to the latter model as M.r and the original model as M..
126
Chapter 4 : Estimation in the Linear Model Suppose that I'y is an LZF in M, so that X'l = 0. Consequently,
I'y = l'(y- XA'(AA')-£)
and (I - PA, )X'l = 0. Hence, I'y is an LZF
of Mr. This proves part (b). It follows from (4.9.1) that the dispersion of the BLUE of X/3 under the restriction is CT2P' . I n contrast, the dispersion of the BLUE ji.(i—rAi)
of X/3 under M is o2Px. Since C(X(I - PA,)) C C(X), the result of part (c) follows from Proposition 2.6.3(b). Part (d) follows from the fact that the set of LZFs under Mr can only be larger than that under M., which follows from part (b). In the case of model-preserving constraints, the error sum of squares is not affected by the restriction. For other restrictions, the amount of increase is an indication of the validity of the restrictions. We shall examine this quantity in some detail in the next chapter. 4.10
Nuisance p a r a m e t e r s
Often one is interested in a subset of the parameters, or linear functions of them. For instance, in Example 4.1.8, we assumed that the parameter of interest was the difference between the treatment coefficients, T\—T2- In general, even if the primary interest of analysis is in one set of variable, an additional set of variables may be needed to make the model more realistic. Thus we can write Xfi as X\/31 + X2^2^ where the interest is only on the linear functions of /3X. The vector /32 contains the other parameters in which we have no interest. These parameters are called nuisance parameters. They are included in the model usually because the model may be inadequate without them. In Example 4.1.8, /x, f3\ and /?2 can be viewed as nuisance parameters. However, not all linear functions of the parameters of interest {T\ and T2) are estimable. A precise description of the estimable LPFs in such situations is given below. Proposition 4.10.1
In the linear model {y,X\fil
LPF p'/31 is estimable if and only if p e C(X[(I
Proof. Suppose that i'y is an LUE of p'fli-
+ X2/3 2 ,cr 2 /), the - Px
)).
Then we must have
4.10 Nuisance parameters
127
l'Xi/31 + l'X2/32 = p'f3l for all f3l and (32. Therefore X[l = p and X'2l = 0. It follows that I e C(X2) = C(I - Px), that is, I is of the form(J — Px )m for some vector m. Consequently p — X'l(I — Px )m, that is, p e C(X[(I -PX2))On the other hand, if p 6 C(X[(I — Px )), then there is a vector m such that p = X[(I - PX2)m. Then m'{I - PX2)y is an LUE of p'/3v 2 2 Example 4.10.2 (Two-way classified data, continued) Suppose that ft = (n : T2)' and 0 2 = (/i : ft : /32)'. Then /llOxl - _
Oioxl\
llOxl
OlOxl
OlOxl
llOxl
\OlOxl
Y
/llOxl
llOxl
—
l
OlOxl
llOxl
llOxl
llOxl
Oioxl
MlOxl
Oioxl
llOxl/
'
llOxl/
OioxlX
It follows that
(
-1/
'
-1/
1
V A 10xl
10xl
1' 110xl
1
10xl
1/
~110xl
\
I
V
V
-"-lOxl
- l 10xl /
'
so that C(X[(I — P )) is the space spanned by the vector (1 : —1)'. For p'0i to be estimable, p must be proportional to (1 : —1)'. Thus, only T\ — T2 (or a multiple of it) is estimable. Once the estimable LPFs are identified, their estimation may proceed in the usual manner. One may then ask if it is possible to construct an equivalent model that only involves fix, by somehow eliminating the effect of /32? It is shown in Section 7.10 that an equivalent model is [(I-PX2)y, (I-PXi)Xif3l,a2(I-PxJ). Since the dispersion matrix for this model is singular, we shall return to its analysis only after developing the requisite theory in Chapter 7 (see Proposition 7.10.1 and Remark 7.10.2).
128
Chapter 4 : Estimation in the Linear Model
4.11
Information matrix and Cramer-Rao bound
4.11.1
The case of normal
distribution*
Assuming that y ~ N(Xf3,a2I), let us compute the information matrix for /3 and a2. Here, 0 = (0' : a2)' and fe{y) is the density of N{X/3,a2I). We have diogMy)
i Yi (
dlog My)
n
Ya,
,
l
n
V/oM2
d2 log My)
_
i Y,Y
d2 log My)
_
d(3d{3'
-
a^XX'
d/3do2
~
d(a2)2
~ 2a*~^lly~X/3"
1
YR,
Y>(
~2^X{y-X/3>>
Consequently / ( )
~
d 2 log/ e (y)
g2iog/e(y) V 9/39a 2
d2 log My) \
a 2 log M y ) " a(a2)2 /
, i
T
o
- ^ J ' 2CT
If A/9 is nonestimable, then it does not have any unbiased estimator (see Exercise 4.3). If A/3 is an estimable LPF, then according to Proposition 3.6.6, the Cramer-Rao lower bound for the dispersion of an unbiased estimator of A/3 is a2 A(X'X)~ A'. The bound is achieved by the BLUE of A/3. Since the 1(6) is block diagonal, the information for /3 and a2 are decoupled from one another in the sense that the Cramer-Rao bound corresponding to functions of /3 does not depend on the block of the information matrix corresponding to a2, and vice-versa. We refer to the two blocks as information of f3 andCT2,respectively. It is also possible to define the information for the particular LPF p'/3; the expression turns out to be <j-2(p'p)-2p'X'(I - Pvn „ ,)Xp (see Exercise 4.37). If X{ is the ith. row of X, then the information matrix of /3 can be written as YA=I xi&\- Since Xix\ is a nonnegative definite matrix,
4.11 Information matrix and Cramer-Rao bound
129
an additional observation can only increase the information of /3 in the sense of the Lowner order. For fixed n, the information matrix can be larger for some values of the X-matrix than for other values. If one has the option of controlling the X-matrix (perhaps under some constraints), then it is desirable — for the purpose of inference — to arrange as large an information matrix as possible. Typically maximization of an increasing function of the eigenvalues of X'X (such as the trace, the determinant or the largest eigenvalue) is chosen as the design criterion. See Chapter 6 for more discussion on this subject. Example 4.11.1 (Weighing design — Spring balance) A spring balance is used to weigh three objects of unknown weights 0i, /?2 and /?3. Each object is weighted twice and the average weight is used as the estimate of the weight of that object. The model is (y, Xfl, a21) with
P = (ft : /32 : &)' and / I 0 0\ 1 0 0 _ 0 1 0
v
x~
o i o 0
0
1
\o o 1/ The elements of the matrix X can only assume binary values. This a characteristic of weighing designs with spring balance. In this case, the information matrix for f3 is 2a~2I^x^. Now consider an alternative plan where every pair of objects is weighed twice. For this design
v A
/I 1 _ 0 ~ o 1 \1
1 1 1 i 0 0
0\ 0 1 i ' 1 1/
130
Chapter 4 : Estimation in the Linear Model
and the information matrix for ft is
a" 2
/4 2 2\ 2 4 2 =2<j-2I +
\2
2a-2ll'.
2 4/
Clearly, this matrix is larger (in the sense of the Lowner order) than the information matrix in the case of the first design. According to the result of Exercise 3.13, the BLUEs of/3i, 02 and ^3 have smaller variance in the case of the second design (cr2/2 in the case of the first design, 3<72/8 for the second), even though the same number of measurements are used. It can be shown that the second design is not only better than the first, it is the unique design with three weights and six measurements that maximizes the determinant of the information matrix. It follows from the discussion of Section 4.11.2 that the optimality of this design holds even when the error distribution is not normal. In order to examine the effect of nuisance parameters, let ft and X be partitioned as in Section 4.10, and ft2 be the nuisance parameter. There is a corresponding partition of the information matrix of ft. The diagonal block corresponding to ftl is cr~2X\X\, which is the same as what would have been the 'information of fti if /32 were not there. However, the corresponding block of a g-inverse of the information matrix very much depends on the other blocks. Thus, the Cramer-Rao lower bound is affected by nuisance parameters. Let t'(I - P^XiP-t be an estimable LPF (see Proposition 4.10.1). Then the Cramer-Rao lower bound for the variance of an unbiased estimator of this LPF is a2?(I - PXi)X{X''X)~X'{I - Px)t, which simplifies to
(Two-way classified data, continued) Consider the
4.11 Information matrix and Cramer-Rao bound
131
model of Example 4.1.8. Here, the information for /3 is /40 20 20 20 20 \ 20 20 0 10 10 G~2X'X
= o~2
20
20 \20
0
20
10
10
.
10 10 20 0 10 10 0 2 0 /
There is no apparent segregation of information. However, the CramerRao lower bound for the variance of an unbiased estimator of T\ — T2 is CT2/10, which is the same as what it would have been in the absence of the nuisance parameters /u, f5\ and fa. It may be recalled that the bound is achieved by the BLUE. A reparametrization of the kind described in Example 4.8.1 would have brought out clearly the segregation of information for main and nuisance parameters (see Exercise 4.33). 4.11.2
The symmetric non-normal case*
Let y = X/3 + au, where the n elements of the random vector u are independent and identically distributed with mean 0, variance 1 and density h(-) satisfying h(—u) = h(u) for all u. Suppose further that all the partial derivatives and expected values used in the following derivation exist and the interchanges of derivatives and integrals are permissible. Then Iog/*(V) = — l o g ^ + ^ l o g / i ( ^ £ ) where y = {y\
yn)' and X = (x\ :
dlogMv)
9/3
_ l ^
~
o{r[Xl'
,
: xn)'. We have
(dlog h(u)\
\
du
)
u_Vi-
a
dlogfe(y) da2
__n 2a2
1 " / 2a2 ^[V
d\ogh(u)\ du )
Vi - x'jP ' a
132
Chapter 4 : Estimation in the Linear Model
Each summand in the right hand side of the first equation has zero expectation. Therefore,
\(dlogfe(y)\ fd\ogfe(y)\'] I f
, r (dlogh(u)\2
o" ~1
J-oo V
du
lufyl-x'l(3\
)
_ V%
XjP a
\
J
a
)
where X»
= -^.\ o2 J-oo V
1 - ^ h{u)du. du )
Further, P
\(dlogfe(y)\ 2cr3 t-' 1
fdlogfe(y)\] y_oo \
1
du
)
_V% ~xiP
a
\
a
)
Ur
a
as the integrand in the last expression is an odd function of u. Finally,
E[{~d^—)\
4.11 Information matrix and Cramer-Rao bound
133
/ d\ogh(u)\ \
ou
}
_ Vj
xjP
(Jb
o
n2
n2 f°° dh(u)
n2-nf[°° n2
dh{u) \2
n2 . n ,
n2-n.
n f°° ( d\ogh(u)\2, .2
n f°° (
. ,,
dlogh(u)\2L,
where I \r
( d\ogh{u)\2
4CT4 [y_oo \
ou
u/
J
1 J
Therefore, the information matrix for 0 = (/3' : cr2)' is
2 W
(l.X'X V 0
0 \ nla2j-
The information for /3 is I^X'X. Therefore, the design issues can be addressed with reference to the matrix X'X — just as in the normal case. The scalar I M is equal to a"2 when the components of y are normally distributed; otherwise it is greater than a"2 (see Exercise 3.15). Thus, the Cramer-Rao lower bound for the dispersion of an unbiased estimator of an estimable LPF is smaller in the non-normal case than in the normal case. We have already seen that the dispersion of the BLUE is equal to the 'normal' Cramer-Rao lower bound, irrespective of the error distribution. As the Cramer-Rao bound in the non-normal case is strictly smaller than the dispersion of the BLUE, there is a potential of achieving lower dispersion than that of the BLUE, by employing a nonlinear estimator, such as the MLE.
134
4.12
Chapter 4 : Estimation in the Linear Model
Collinearity in the linear model*
We have seen in Example 4.1.13 that redundancy in the description of the parameters may render some LPFs non-estimable. In such a case, the matrix X does not have full column rank, and there are one or more columns of X which are linear combinations of the other columns. In practice it is often found that some columns of X are almost equal to linear functions of the other columns. This may happen, for instance, when there is an exact linear relationship among the corresponding explanatory variables, but the variables are measured with some error. An approximate relation among the columns of X may also occur for no apparent reason. This phenomenon is known as multicollinearity or simply collinearity. Even if the matrix X has full column rank (so that all LPFs are estimable), the presence of collinearity can make it nearly rank-deficient in the following sense: a small alteration in the elements of X would turn the approximate relation among the columns into an exact one, and the perturbed matrix would have smaller rank. The presence of collinearity can make the variance of certain BLUEs very large compared to the model error variance, a2. Example 4.12.1 Let X = (#(i) : #(2))> where x^2) — 3Jm + otv, a being a small number and v being a vector such that v'v = x^'x^ and v'xn) = 0. It follows that
vuHp'm - .y(x'x)- P - ^ ^ p ' (1+_f - 1 ) P. In particular, if /3 = (fix : #2)', then by choosing p = (1 : —1)' we have
which can be very large if a is small. If a —> 0, then the variance explodes. Of course, $\ — 02 is no longer estimable when a = 0. On the other hand,
Var01+p2) = o2- J—,
4.12 Collinearity in the linear model*
135
which does not depend on a. Thus, the variance of /S1+/S2 is n °t affected by collinearity. Note that /3i + $2 remains estimable even if a = 0. D The above example shows that the variance of certain BLUEs may be very high because of collinearity, while the variance of some other BLUEs may not be affected. It also shows that non-estimability of parameters is an extreme form of collinearity. Let us try to appreciate the above points in a general set-up. The presence of collinearity implies that there is a vector v of unit norm so that the linear combination of the columns of X given by Xv, is very close to the zero vector. In such a case we have a small value of v'X'Xv, which can be interpreted as dearth of information in the direction of v (see the expression of information matrix given in page 133). We may informally refer to such a unit vector as a direction of collinearity. The unit vector v can be written as k 1=1
where Y,i-i Kviv'i IS a spectral decomposition of X'X, the eigenvalues being in the decreasing order. It follows that k
k
k
\\Xv\\2 = v'X'X Y,(v'vi)vi = v' ^2(v'vl)Xlvl = J2 hiv'vtf. i=\
i=l
i=l
The above is the smallest when v = Vk- Therefore, a very small value of Afc signifies the presence of collinearity. When A& is very small, v^ is a direction of collinearity. When X'X has several small eigenvalues, the corresponding eigenvectors are directions of collinearity. All unit vectors which are linear combinations of these eigenvectors are also directions of collinearity. Let X have full column rank, so that all LPFs are estimable. Then
Var(p% = a2p'{X'X)-lp = a2 £ ^ 2 .
(4.12.1)
If p is proportional to Vk, then the variance of its BLUE is p'p/XkThe presence of collinearity would mean that A^ is small and therefore,
136
Chapter 4 : Estimation in the Linear Model
this variance is large. As A& -> 0, the variance goes to infinity. (When Ajt = 0, we have a rank-deficient X matrix with Vk £ C(X'), and so p'/3 is not estimable at all.) A similar argument can be given if p has a substantial component (p'vi) along an eigenvector (vi) corresponding to any small eigenvalue (Aj) of X'X. If p has zero component along all the eigenvectors corresponding to small eigenvalues, then the reciprocals of the smaller eigenvalues do not contribute to the right hand side of (4.12.1), and thus Var(p /3) is not very large. In summary, all 'estimable' LPFs are not estimable with equal precision. Some LPFs can be estimated with greater precision than others. When there is collinearity, there are some LPFs which are estimable but the corresponding BLUEs have very little precision (that is, these have a very high variance). Non-estimable LPFs can be viewed as extreme cases of LPFs which can be linearly estimated with less precision. When an experiment is designed, one has to choose the matrix X in a way that ensures that the LPFs of interest are estimable with sufficient precision. The directions of collinearity can also be interpreted as directions of data inadequacy. To see this, write ||Xw|| 2 as
\\Xv\\2 = v'X'Xv = J2(xiv)\ where x[, x'2 ..., x'n are the rows of X. The ith. component of y is an observed value of x\fi (with error). If ||Xu|| 2 is small, every (x[v)2 is small. If this is the case, none of the observations carry much information about v'(3. This explains why the BLUE of this LPF has a large variance. If (x^v)2 = 0 for all i, the observations do not carry any information about v'(3. In this extreme case v'(3 is not estimable. If one has a priori knowledge of an approximate relationship among certain columns of X, and confines estimation to linear combinations which are orthogonal to these, then collinearity would not have much effect on the inference. This is analogous to the fact, as seen in Example 4.1.8, that estimable functions can be estimated even if there is one or more exact linear relationship involving the columns of X. If collinearity arises because of a known linear constraint, its impact
4.13 Exercises
137
on the precision of affected BLUEs can be reduced easily by incorporating this constraint into the model. Proposition 4.9.3 assures us that the restriction would reduce the variances of the BLUEs. If the cause of collinearity is not so obvious, then the eigenvectors corresponding to the small eigenvalues of (X'X) point to the variables which appear to be linearly related. If f3 is estimable, then the extent of collinearity can be measured by the variance inflation factors,
VIF^a^VarfyWx^f,
j = l,2,...,k,
(4.12.2)
where x^ is the jth column of X. The factor a~2 ensures that VIFj depends only on the matrix X, while the factor \\x^ ||2 ensures that this measure is not altered by a change in scale of the corresponding variable (see Exercise 4.38). All the variance inflation factors are greater than or equal to 1, and a very large value indicates that the variance of the BLUE of the corresponding parameter is inflated due to collinearity. Alternate measures of collinearity can be found in Belsley (1991) and Sengupta and Bhimasankaram (1997, see also Exercise 4.40). Since collinearity tends to inflate variances (and hence the mean squared error) of certain BLUEs, one may seek to reduce the MSE by adopting a biased estimation strategy in the case of collinear data. Some of these alternative estimators are discussed in Sections 7.9.2 and 11.3. 4.13
Exercises
4.1 The linear model ( j / n x l , X/3, o2l) is said to be saturated if the error degrees of freedom (n — p(X)) is equal to zero. Show that in a saturated model, every linear unbiased estimator is the corresponding BLUE. 4.2 Show that all the components of f3 in the model (y, X(3, a2l) are estimable if and only if X has full column rank, and that in such a case, every LPF is estimable. 4.3 If there is no linear unbiased estimator of the LPF Af3 in the model (y,X/3,cr 2 /), show that there is no nonlinear unbiased estimator of .A/3.
138
Chapter 4 : Estimation in the Linear Model 4.4 Consider the model I/t = 0 1 + 0 2 + 0 3 + - " + 0 i + e«,
!<»<".
with uncorrelated errors having zero mean and variance a2. (a) Obtain expressions for the BLUEs of the parameters and examine the possibility of unbiased estimation of the error variance. (b) Repeat part (a) when it is known that ft = j3n-i+i, 1 < i < n. 4.5 Suppose that x\,..., Xk are the columns of the matrix X in the linear model (y, X/3, a21). Show that the coefficient of X\ is not estimable if and only if x\ has an exact linear relationship with the other columns of X. [This is a case of exact 'collinearity.'] 4.6 Show that a vector valued LPF Af3 is estimable if and only if
4.7 Show that the affine estimator i'y + c is unbiased for p'/3 if and only if X'l = p and c = 0. 4.8 Show that an LUE (I'y) of an estimable LPF in the linear model (y, X/3, a21) is its BLUE if and only if I e C{X). 4.9 Spring balance. Four items having weights /3i, fc, fiz and /?4 are weighed eight times, each item being weighed twice. The measurements follow the model /yi\ 2/2 2/3 2/4 y5 y6 y7
Vy8/
/I 0 0 = 0 ~ 1 0 0
0 1 0 0 0 1 0
0 0 1 0 0 0 1
0\ 0 0 1 0 0 0
\0 0 0 1/
/ft \ fa 03 \04/
/eA e2 e3 e4 e5 ' e6 e7
Ve8/
with uncorrelated errors having zero mean and variance a2. We denote this model by (y,X(3,a2I).
4.13 Exercises (a) (b) (c) (d) (e) (f)
139
Show that every LPF of the model is estimable. When is I'y the BLUE of its expectation? When is i'y an LZF? What are the BLUEs of /3j,j = l, 2, 3,4? What is the dispersion of the BLUE of /3? Find an unbiased estimator of a2.
4.10 Repeat Exercise 4.9 for the model where three out of four items are measured at a time and each combination is weighed twice. 4.11 Estimate the parameters of the model described in Exercise 1.3 for the world record running times data of Table 1.1. What are the standard errors of the estimates? 4.12 Repeat Exercise 4.11 for the model described in Exercise 1.4. 4.13 BLUE from calculus. Let p'/3 be an estimable LPF. Suppose that I'y is the candidate which must satisfy the unbiasedness condition (XI = p) and the minimum-variance condition (a2l'l should be as small as possible). Formulate this as an optimization problem with Lagrange multipliers, and show that the optimum i'y is p'(X'X)~X'y. 4.14 The data set of Table 4.1, taken from Brownlee (1965), shows the observations from 21 days' operation of a plant for the oxidation of ammonia, which is used for producing nitric acid. The response variable (y) is the stack loss defined as the percentage of the ingoing ammonia that escapes unabsorbed. The explanatory variables are air flow (x\), cooling water inlet temperature in °C (#2) and acid concentration in percentage (3:3). Using a homoscedastic linear regression model of y, obtain the BLUEs of the coefficients of the explanatory variables, along with their standard errors. 4.15 If l[y and l'2y are the BLUEs of the estimable LPFs p[/3 and p'2f3, respectively, prove that (lx + I2)'y is the BLUE of (px + P 2 )'/3. 4.16 A multivariate distribution is said to be spherically symmetric if its density is of the form f(x) oc g(x'x), where g(-) is a nonnegative and nonincreasing function defined over the positive half of the real line. Derive the MLE of /3 for the linear model
140
Chapter 4 : Estimation in the Linear Model Case
x\
X2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
80 80 75 62 62 62 62 62 58 58 58 58 58 58 50 50 50 50 50 56 70
27 27 25 24 22 23 24 24 23 18 18 17 18 19 18 18 19 19 20 20 20
£3
y
58.9 4.2 58.8 3.7 59.0 3.7 58.7 2.8 58.7 1.8 58.7 1.8 59.3 1.9 59.3 2.0 58.7 1.5 58.0 1.4 58.9 1.4 58.8 1.3 58.2 1.1 59.3 1.2 58.9 0.8 58.6 0.7 57.2 0.8 57.9 0.8 58.0 0.9 58.2 1.5 59.1 1.5
Table 4.1 Stack loss in ammonia oxidation plant (Source: Brownlee, 1965)
(y, X/3, (J2I), when the errors have a spherically symmetric distribution. Is the MLE unique? 4.17 The LSE of /3 in the linear model {y,Xf3,a2l), with the pdimensional vector parameter /3 estimable, can be written as (see Subrahmanyam, 1972) 3 _ Sr \XrXr\/3r where Xr is a p x p sub-matrix of X, yr is the corresponding sub-vector of y, 3 r ls the LSE from the sub-model (y r ,X r /3,
4.13 Exercises
4.18
4.19
4.20
4.21
4.22
141
a21), and the summations are over all such sub-models where /3 is estimable. Interpret this result and prove it for the special case of Example 4.2.1. Show that (X'X)~X' is a g-inverse of X, and X~X(X'X)~ is a g-inverse of X'X. Hence, show that the estimator defined in (4.5.3) is identical to /3LSIf X/3 is the BLUE of X/3 and RQ is the error sum of squares for the linear model (y, X/3, a21) with normally distributed errors, show that these two statistics are jointly sufficient and complete for X/3 and a2. If y in the model (y,X/3,o2I) has the multivariate normal distribution, show that the BLUE of any estimable LPF is its UMVUE and that any LZF is ancillary for it. Given the model (y,X/3,<72/), define a finite set A of LZFs with the following property: there is no LZF outside A which has non-zero variance and is uncorrelated with all the members of A. Show that A is a generating set of LZFs. Given the model (y,X(3,a2I) with p{X) = r, find a transformation matrix L such that the vector Ly has the following properties. (i) D(Ly) = o2l. (ii) The first r elements of Ly constitute a standardized basis set of the BLUEs of the given model. (iii) The last n—r elements of Ly constitute a standardized basis set of the LZFs of the given model.
4.23 Prove the statement of Remark 4.7.7. 4.24 If z is a vector of BLUEs whose elements constitute a generating set, prove that D{z) has rank p{X). 4.25 Consider the the linear model (y, X/3, o2l) having n observations and k components of the parameter /3. (a) Determine the average value of the n leverages. (b) If the vector 1 is included in C(X), show that the zth leverage can be written as 1/n plus a quantity that can be interpreted as a squared distance of the «th row of X from
142
Chapter 4 : Estimation in the Linear Model the average of all the rows of X. [Thus, observation with extreme values of explanatory variables have high leverage.] (c) If hi and e; are the ith leverage and ith residual, respectively, show that
0 < hi + e2/e'e < 1. [Thus, fitted values of observations with leverage close to 1 cannot be very different from the observed value of the response.] 4.26 Show that an observation in the linear model (y,X/3,cr 2 I) has leverage equal to 1 if and only if the corresponding row of the X matrix is not a linear combination of the other rows. 4.27 Consider the model (y, X0, a21) with normally distributed errors. (a) Compute the mean squared errors of a2 and a2ML- Which estimator has smaller MSE? (b) Find c such that CRQ is the estimator of a2 having the smallest possible mean squared error. Does the answer coincide with a2 or a2ML? (c) What is the MSE of CRQ when c has this optimum value. 4.28 Consider the reparametrization of the model M.\ = (y,X/3, a21) as M.2 = (y,Z8,a2I), where Z has full column rank. Express the BLUEs of 6 and Xj3 in terms of one another. 4.29 Consider the model of Example 4.1.8. Identify the BLUEs lost and the LZFs gained because of the restriction T\ = T2- What happens to the sets of BLUEs and LZFs when the restrictions Pi + fa = 0 and T\ + T2 = 0 are introduced? 4.30 lip{X' : A') = p{X')+p(A'), show that the restriction A/3 = £ amounts to a reparametrization of the model (y,X/3,a2I). 4.31 The Cobb-Douglas model for production function postulates that the production (q) is related to labour (/) and capital (c) via the equation q = a- lacP
u,
4.13 Exercises
143
where a, a and ft are unspecified constants and u is the (positive) model error. A log-transformation of both sides of the equation linearizes the model. Economists are sometimes interested in a condition called 'constant returns to scale' (see Poirier, 1995, p.484), which amounts to a+/3 = 1. Suppose that n independent observations of the three variables are available, and all the parameters are identifiable. (a) Derive an unrestricted linear model which is equivalent to the Cobb-Douglas model with the restriction a + /3 = 1. (b) Find an expression for the 'decrease' in the dispersion of /loga\ the BLUE of a because of the restriction a+/3 = 1.
\ P J 4.32 Consider the linear model for two-way classified data yij = fi + Pi + Tj + tij,
1 < i < 6, 1 < j < t,
where the errors e^ have zero mean and are uncorrelated. The parameters /3i,...,/3;, represent the effects of b blocks, while Ti,...,Tt represent t different treatment effects. Show that a linear function of the treatment parameters, Y?j=i cjTj i s es~ timable if and only if Y?j=i cj = 0- [Functions satisfying this condition are called treatment contrasts.] 4.33 Show that the information matrix for the parameters 6\, 62 and 83 in the reparametrized model of Example 4.8.1 is block diagonal where one block corresponds to 6\ and 62 and the other, to 63. Compare the Cramer-Rao bound for an unbiased estimator of 9s with the bound given in Example 4.11.2. 4.34 Compare the information matrices of/3 for the models of Exercises 4.9 and 4.10, assuming normally distributed errors. 4.35 Spring balance with bias. A spring balance with bias is one in which the mean measurement of any weight differs from the 'true' weight by a non-zero constant. Let this constant be 0o, and suppose that p objects with weights /3i,... fip (in various combinations) are measured in such a balance. There are n measurements. If the errors are uncorrelated and have mean
144
Chapter 4 : Estimation in the Linear Model
zero and variance a2, show that the variance of the BLUE of /% for i = 1,... ,p is at least 4cr2/n. Can all the variances be equal to 4cr2/n? 4.36 Chemical balance. When a chemical balance is used to weigh various combinations of objects having weights /?i,... ,/3p, the elements of the matrix X can be 0, 1 or —1. Suppose that there are exactly n weighing operations, and there is no bias. Show that the variance of the BLUE of /3; for i = 1 , . . . , p is at least a2In. Can all the variances be equal to a2jni What happens when there is a bias in the balance (that is, an intercept term)? 4.37 Let Tp be the information matrix of j3 in the model (y, X/3, a21), given by the top left block of 1(0) in page 133. Partition Is as (
u
y1 I where l u is a scalar, and let In 2 = 1\\ —
V^-21 i-22/
X12X^X2i- Let /3\ be the first component of/3. (a) When f5\ is estimable, show that the Cramer-Rao bound on the variance of an unbiased estimator of pi is l/Zn.2(b) When fix is nonestimable, show that I\\.2 = 0. (c) How do you interpret the results of parts (a) and (b)? (d) Can the results of parts (a) and (b) and the interpretation of part (c) be extended to a general scalar LPF? 4.38 Suppose that X = (a?m : : a?(fc)), D is a diagonal matrix with ||aj(j)|| in the jth. diagonal position and Xs = XD~l, that is, the columns of Xs have unit norm and are proportional to the columns of X. Let X have full column rank. (a) Show that VIFj is the jth. diagonal element of the matrix (X'sXs)~l. [Thus, it is insensitive to a change in scale of the corresponding variable.] (b) Show that VIFj > 1 for j = 1 , . . . , k. (c) If Ai > > Afc are the eigenvalues of X'SXS and vi,... ,Vk are the corresponding eigenvectors, show that k
VIFJ
2
= I2T-'
J = I,-..,*,
4.13 Exercises
145
Vij being the ith element of Vj, i, j = 1,..., k. 4.39 Variance proportions table. Let X, Xs, Ai,..., A& and v\,..., v^ be as in Exercise 4.38. Let 7TJJ = vfj/(\iVIFj), i,j = 1,... ,k. This ratio represents the fraction of VIFj which may be attributed to the ith eigenvalue, and is called the (i,j)th variance proportion. A matrix arrangement of the TTJJ'S leads one to the
variance proportions table (see Belsley et al., 1980). (a) Show that -K\J < 1/VIFj, that is, if VIFj is large then -K\J is small (other variance proportions can be large). (b) If Ai is very small and TT^ is very close to 1, does it mean that the variance of the BLUE of /3j is inflated due to the approximate relation XSV{ « 0 among the columns of X? 4.40 Condition number. Let X, Xs and Ai,..., A& be as in Exercise 4.38. The ratio K = (Ai/A^)1/2 is called the condition number of Xs, which is an index of collinearity for the model matrix X. Show that k
VIFJ
[Thus, K is large if and only if at least one of the variance inflation factors is large.] 4.41 For the world population data of Table 1.2, calculate the condition number K defined in Exercise 4.40 and the variance inflation factors. Explain why the variance inflation factors for the two parameters are the same. 4.42 Replace the year number in the world population data of Table 1.2 by the 'number of years since 1980'. (a) Calculate the least squares estimates of the parameters, and compare these with the estimates of Example 4.2.2. Explain the findings. (b) Calculate the dispersion of the vector of estimated parameters and compare it with the corresponding result of Example 4.6.1. Explain the findings.
146
Chapter 4 : Estimation in the Linear Model (c) Calculate a2 and compare it with the corresponding estimate of Example 4.7.9. Explain the findings. (d) Repeat Exercise 4.41 for the modified data, compare with the results of that Exercise, and explain the findings.
Chapter 5
Further Inference in the Linear Model
In Chapter 4 we considered point estimation of estimable LPFs in the linear model (y,X/3,o2I). In the present chapter we discuss confidence regions for such LPFs, tests of hypothesis and the model-based prediction. In Section 5.1 we derive the distributions of the usual estimators of j3 and a2 under the assumption that the errors are independent and standard normal. (The assumption of normality of errors can be relaxed somewhat in the case of large samples, see Section 11.6.) This preliminary result allows us to construct confidence regions (Section 5.2) and test linear hypotheses (Section 5.3). After deriving the general theory, we construct analysis of variance tables for some important testing problems. We address the problem of optimal prediction in Section 5.4. In Section 5.5 we discuss the problem of collinearity and examine its effect on confidence regions, prediction and tests of hypotheses. 5.1
Distribution of the estimators
Suppose that A/3 is a vector of estimable LPFs in the linear model (y,X/3,a2I), and p(X) = r. When y is normally distributed, the following basic result provides the distributions of the least squares estimator A/3 and the natural unbiased estimator a2 defined in (4.7.1). Proposition 5.1.1 If y ~ N(Xp,a2I), then (a) A(3 ~ N(A(S,a2A{X'X)-A'), (b) [n-r)72ja2 ~ x*_r, 147
148
Chapter 5 : Further Inference in the Linear Model (c) A/3 and a2 are independent.
Proof. Part (a) follows from the discussion in Section 4.6 and the fact that any linear combination of a multivariate normally distributed random vector has the multivariate normal distribution. Part (b) follows from the characterization of R% as the sum of squares of (n—r) uncorrelated LZFs, each with variance a2. The LZFs, being linear functions of y, themselves have a multivariate normal distribution. Further, the LZFs are independent and have zero mean. Therefore, RQ/U2 is the sum of squares of n—r independent, standard normal random variables, and has the chi-square distribution with (n—r) degrees of freedom, thus proving Part (b). If y has the multivariate normal distribution, then the BLUEs and LZFs of the model are not only uncorrelated but also independent. Part (c) follows from the fact that a2 is a function of the LZFs only. These results can be used to construct confidence regions of the unknown parameters, as described in the next section.
5.2 5.2.1
Confidence regions Confidence interval for a single LPF
Let p'/3 be an estimable parametric function and p'0 be its BLUE. Under the assumption of normality, it follows from Proposition 5.1.1 (a) that / " - ' " '
~AT(O.l).
Since this is independent of a2, we have from Proposition 5.1.1 (b) and Definition 3.2.3
^/a2p'(X'X)-p l(n-rW/a2 V n— r
=
fip-p'P ^p'(X'X)-p
t
5.2 Confidence regions
149
where £ n _ r represents the student's ^-distribution with n—r degrees of freedom. Denoting the (1 — a) quantile of this distribution by tn-r>a, we have 1
r^z
— bn—r,a
my/a2p>{X'X)-p
= P [p'P >P^~ tn.r:a^p'(X'X)-pj
= 1 - a.
This gives a 100(1 — a)% lower confidence limit for p'/3, or
[^3 - tn-riay/£p'{X'X)-p,
oo)
(5.2.1)
is a 100(1 — a)% (one-sided) confidence interval for p'/3. Similar arguments lead to the other one-sided confidence interval (or upper confidence limit) (-oo, p^0 + tn-r,ay/£p'{X'X)-p\
,
(5.2.2)
and the two-sided confidence interval
[^3 - tn_r,a/2^p>(x'x)-P, fip + i n _,, Q/2 \/^p'(x'x)-p]. (5.2.3) If /3j, the j t h component of j3, is estimable, then we can obtain oneor two-sided confidence intervals for (ij as above, by choosing p' = (0 : 0: :1: : 0), with 1 in the j t h place. In such a case, p'(X'X)~p is the j t h diagonal element of (X'X)~, which does not depend on the choice of the g-inverse (as we noted in page 111). Example 5.2.1 (Two-way classified data, continued) Consider once again the model of Example 4.1.8. It was shown in Example 4.6.2 that the BLUEs ?i — ?2 and /?i — #2 e a c n have variance a 2 /10. Suppose that we choose confidence coefficient .95, corresponding to a = .05. Since tn-r,.02b/Vl0 = i37,.o25/Vl0 = (2.026)/y/T6 = .6407, a two-sided 95% confidence interval for ?i — ?2 [?i -?2-
.6407CT,
fx - ? 2 + .64075].
150
Chapter 5 : Further Inference in the Linear Model
On the other hand, t37t.05/y/W = (1.687)/\/l0 = .5335. Hence, leftand right-sided 95% confidence intervals for f\ — T% are [fi - ?2 — .53355, oo] and [—oo, n — f2 + .53355], respectively. Confidence intervals for Pi—fa can be obtained similarly.D 5.2.2
Confidence region for a vector LPF
Construction of a confidence region for the vector LPF A/3 is a meaningful task only if A/3 is estimable. Note that if y ~ N(X/3, o2l) and A/3 is estimable, then A/3 ~ N(A/3,a2A(X'X)-A'). Therefore, from Exercise 3.2, (A^ - A/3)'[A(X'X)-A']-(AP
- A/3)
CT2
2
~ *m>
where m is the rank of A. Since the BLUEs are independent of the LZFs, the above quadratic form is independent of a2. Consequently, (A/3 - A/3)'[A{X'X)-A']-{A0
- A/3)/{ma2)
{n-p{X))72l(a2{n-r)) (A3 - A/3)'[A(X'X)~A']-(A3 - A0) ^5 tm,n-r, maz where Fm>n-.r represents the P-distribution with m and n—r degrees of freedom (see Definition 3.2.5). If F m ) n _ r > Q is the (1 — a) quantile of this distribution, we have P [(A/3 - Aft'[A(X'X)-A']~(Al3
- Aft < m^F m , n _ r , Q ] = 1 - a.
The resulting confidence region for A/3 is an m-dimensional ellipsoid given by {A/3 : (A/3 - Aft'[A{X'X)-A'Y{Al3
- Aft < m^F m , n _ r , Q ,
(A/3 - Aft e C{A{X'X)~A')) .
(5.2.4)
5.2 Confidence regions
151
Example 5.2.2 (Two-way classified data, continued) Consider the vector LPF (/?i—/?2 : T\—T
{(£): [a-(A-A)] 2 + [ 6 - ( n - ^ ) ] 2 < ^ ^ = .6504?|. The area of this circle is .6504TR72 = 2.043cr2. Suppose that f3 is estimable, that is, p(X) is equal to k, the number of columns of X. Then we can find a confidence region of j3 by replacing A, [A{X'X)'A'\and m in the expression (5.2.4) by I, X'X and k, respectively. Specifically, a confidence region of /3 with confidence coefficient 1 — a is thefc-dimensionalellipsoid
{/3:||X/3-X3|| 2 <^F M _ fc , Q }. If the assumption of normality of y is not tenable, the above confidence regions may be grossly inaccurate. Alternative confidence regions may be constructed using resampling techniques (see Sections 9.4-9.5 of Efron and Tibshirani, 1993 and Sections 7.2-7.3 of Shao and Tu, 1995). These methods usually involve considerable computation, but provide satisfactory results for moderate sample sizes. 5.2.3
Simultaneous confidence intervals*
If a[,a'2, ,a'g denote the rows of the matrix A, then the vector parameter A/3 represents the LPFs a\P,a'2P,. ,a'q/3, with m = p{A) < q. As an alternative to the ellipsoidal confidence region in the g-dimensional parameter space discussed in the earlier section, one may construct individual confidence intervals for each of the scalar LPFs, a[f3, a'20,..., a'q/3. Viewed in the (/-dimensional parameter space, these
152
Chapter 5 : Further Inference in the Linear Model
simultaneous confidence intervals represent a ^-dimensional rectangle. Such intervals may be easier to visualize and deal with, compared to the ellipsoidal regions, particularly when q is greater than 3. Since the interest is typically in all the LPFs simultaneously, it is meaningful to assign a single coverage probability to the combination of these q confidence intervals, rather than a separate confidence level to each interval. These are referred to as the simultaneous confidence intervals. The simultaneous confidence intervals constitute a rectangle in the parameter space, in contrast to the ellipsoidal region derived previously. Let us first examine single confidence intervals of the type described in Section 5.2.1. A two-sided interval for the scalar LPF a'/3 with confidence coefficient 1 — a is
where r = p(X). Let £j denote the event
£3 = K/3 e i^}. The coverage probability of Ij is P(£j), which is equal to 1 — aforevery j . The coverage probability of the simultaneous confidence intervals l[s\ ..., 4s} is P(£i n n £g). If aft, j = 1 , . . . , q were independent, then the coverage probability would have simplified to (1 — a)q. These are not, in general, independent (they involve a common a2 for one) and the exact probability becomes difficult to compute. Using the superscript c to denote the complement of a set or event, the Bonferroni inequality gives
P(£in---n£g) = I - P ^ U - . - U ^ ) > l-[P(£l) + ... + P{£C)) = P(£in---n£g)
i-ga,
< P(£i) = l - a .
The above inequalities can be summarized to produce the following bounds for the coverage probability of the simultaneous confidence intervals: l-qa<
P ( / ] s ) includes alfi for all j) < 1 - a.
5.2 Confidence regions
153
It is clear that the combination of the confidence intervals of the single intervals described above, does not have adequate coverage probability. It follows from the above discussion that if we replace a by a/q, then the resulting simultaneous confident intervals
if^lafi-tn-r^yj^iX'X)-^, a'jP+tn^z.fea'j(X'X)-aj], j = 1,2,..., q, would have coverage probability greater than equal to 1 — qot/q = 1 — ot. These are called Bonferroni confidence intervals.
In general, the Bonferroni confidence intervals are conservative in the sense that the actual coverage probability is often much more than the assured minimum coverage probability of (1 — a). In spite of this, the Bonferroni rectangle in the g-dimensional parameter space may not entirely include the ellipsoid discussed in Section 5.2.2. In other words, a vector A/3 may belong to the ellipsoidal confidence region of (5.2.4), and yet some of its components may lie outside the Bonferroni confidence intervals. Scheffe suggested a set of conservative confidence intervals which avoids this possibility. Scheffe confidence intervals can be described geometrically as the smallest ^-dimensional rectangle, with faces orthogonal to the parameter axes, which includes the ellipsoid (5.2.4). Thus, the faces of this rectangle are tangents to the ellipsoid. The algebraic description of Scheffe simultaneous confidence intervals is as follows (see Exercise 5.5).
ij*c) =
a'jP-JmFm^aa'jiX'Xyap, a'jP+yJmFm,n-r,aa.'j{X'X)-aja*
, j = 1 , 2 , . . . ,q. (5.2.5)
A set of simultaneous confidence intervals with exact coverage probability can be obtained if the components of A/3 are uncorrelated. In such a case m = q, and la'jP - a'jft
max -3=— l
154
Chapter 5 : Further Inference in the Linear Model
has the same distribution as £* = maxi<j< m \ZJ\/S, where s 2 ~ Xn-r, zi,..., zm ~ JV(0,1) and s,zi,...,zm are independent. The joint distribution of zi/s,..., zm/s is said to be multivariate t with parameters m and n — r, and the distribution of £* can be derived from it. Let tm,n-r,a be the 1 — a quantile of the distribution of £* (see Hahn and Hendrickson, 1971 for a table of these quantiles). Then we have the maximum modulus-t confidence intervals
a'fi + t^n-r^^a'^X'Xyaj
, j = l,2,...,m.
This confidence interval has coverage probability exactly equal to 1 — a, only when the of LPFs of interest have uncorrelated BLUEs. Sidak (1968) showed that these confidence intervals are conservative for a general class of correlation pattern of the BLUEs. Example 5.2.3 (Two-way classified data, continued) An ellipsoidal confidence region for the vector LPF (fix —f$2 T\ —T2)' in the model of Example 4.1.8 was obtained in Example 5.2.2. The half-width of the separate 95% confidence intervals of the two LPFs, as determined in Example 5.2.1, is .6407
tn_r ^J^a'AX'X)~aj ^mF^n^a'^X'X)-gp tm^-r^y/a^a'^X'X)-^
=
.73885.
=
.8065a.
= .73685.
The rectangular confidence regions corresponding to these confidence intervals are illustrated in Figure 5.1, along with that corresponding to the single confidence intervals derived in Example 5.2.1 and the
5.2 Confidence regions
155
Scheffe I
/
/
SinSle
/ '
\ \
"x
Bonferroni\ and Maximum \ modulus-^ Ny ' _ _
Ellipsoidal
N? -\
I
/
/
A
I
Figure 5.1 Elliptic and rectangular confidence regions for Example 5.2.3
ellipsoidal (circular) confidence region derived in Example 5.2.2. The confidence region corresponding to the single confidence intervals is the smallest, as its coverage probability is inadequate. The Scheffe confidence intervals are very conservative, and the corresponding rectangular confidence region contains the circular confidence region. The Bonferroni confidence intervals are also conservative, but the corresponding confidence region excludes some points contained in the ellipsoidal confidence region. In this case the Bonferroni intervals practically coincide with the maximum modulus-i confidence intervals, which are exact. Here, we are able to use the maximum modulus-t intervals because the BLUEs of /?i — /?2 a n d T\ — T2 are uncorrelated.
Apart from the three simultaneous confidence intervals mentioned here, other useful intervals can be found in some special cases. A few such methods applicable to comparison of groups means are discussed in Section 6.2.4 (see also Miller, 1981, Chapter 2).
156
5.2.4
Chapter 5 : Further Inference in the Linear Model
Confidence band for regression surface*
In the context of linear regression analysis we can write the generic model equation for each observation as y = x'0 + e,
(5.2.6)
which is another form of (1.1.1). We wish to specify a neighbourhood of the fitted value (a:'/3) which is likely to include the 'true' mean of the response (x'j3) for all reasonable x. This region can be called a confidence band for the regression surface. The problem can be viewed as that of finding simultaneous confidence intervals for infinitely many x'0s. If we consider only a finite number of estimable LPFs, Scheffe's simultaneous confidence intervals are applicable. A crucial advantage of these intervals is that the distribution involved in the computation depends only on the number of the linearly independent LPFs, and not on the total number of LPFs (see (5.2.5) where m appears but q does not appear). If r is the rank of X, then simultaneous confidence intervals for r linearly independent LPFs can lead to confidence intervals for any finite number of estimable LPFs. There is no reason why the result should not hold for infinitely many estimable LPFs. The next proposition shows that this is indeed the case. Proposition 5.2.4 Let the response vector ynxl N(X/3,a2I) and r be that rank of X. Then P \\x'P-x'P\
have the distribution
<{rF r>n _ r>a aj'(JC'A:)-a5^2} 5 for all x G C(X')]
=l-a.
Proof. Let I be such that x = X'l. Let us define b = Px{X~/3 - Xfi) and c = Pxl- It follows from the Cauchy-Schwarz inequality that {x'/3 - x'P)2 o*x'(X'X)-x Also, b ~ N(0,o2Px),
{VXP - I'Xp)2 7H'PXI
_ jc'b)2 ^C'C
b'b ~ ^ '
and consequently b'b/a2 ~ xf- Since b is inde-
pendent of a2 (see Proposition 5.1.1), we have b'b/(ro2) ~ F r ; n _ r . It
5.2 Confidence regions
157
follows that
P
(x'8 — x'8)2 max -4=-^ ^— ra2x'{X'X)-x x e C{X')
< Frn-ra
=l-a.
Proposition 5.2.4 is in fact a proof of the validity of Scheffe's simultaneous confidence intervals in the special case A = X. In the case of simple linear regression, we have the model (5.2.6) with x = (1 : x)' and 8 = (/30 : pi)'. The 100(1 - a)% confidence band for the regression line (/3o + fiix) simplifies to
A, + &x-> 2Fr>n^a72 (I + -*)\') , \ \n n(x2 — x2) J A) + Ax + * rFrtn-r,a£
(- + ^X)*)
,
(5-2.7)
where x and x2 are the observed average of x and x2, respectively (See Exercise 5.9). Example 5.2.5 (World population data) For the world population data of Table 1.2, we fitted a linear model for the mid-year population (y) for a given year (x). The least squares estimates of the parameters are /?o = —158.3 and A = -0822, while the estimated error variance is a2 = 6.055 x 10~5 (see Examples 4.2.2 and 4.7.9). Using r = 2 and a = .05 in (5.2.7), we have the following 100(1 — a)% confidence band for the expected world population (in billion) in year x. [-158.3 + .0822a; - .02187^/(z - 1990.5)2/665 + .05, -158.3 + .0822a; + .02187^/(a; - 1990.5)2/665 + .05] . The upper and lower parts of the band are plotted in Figure 5.2. The band happens to be extremely narrow. Thus, we can locate the regression line very precisely for this data. Further details on confidence band for the regression surface can be found in Miller (1981, Section 3.4.1).
158
Chapter 5 : Further Inference in the Linear Model
6.2 r
y/
6 5.8 -
/
5.6 -
/
5.4 -
/
population
/ 5.2 -
/
4.8 4.6
-J^
44 L
1980
Figure 5.2
5.3
y^ i
i
i
i
1985
1990 year
1995
2000
Confidence band for the regression line of Example 5.2.5
Tests of linear hypotheses
We now turn to tests of hypotheses involving linear functions of /3. For example, we may want to test (a) if a specific coefficient, /3j, is zero, (b) if a subset of coefficients are zero, (c) if two coefficients are equal, or (d) if a coefficient is equal to a specified value. All these can be viewed as special cases of what is called the general linear hypothesis, A/3 = £ for given matrix A and £.
5.3.1
Testability of linear hypotheses*
Before attempting to construct such statistical tests, we have to clearly delineate what we can or cannot test statistically.
5.3 Tests of linear hypotheses
159
Example 5.3.1 (Two-way classified data, continued) Consider the hypothesis T\ + T2 = 0 in the model of Example 4.1.8. Following the argument of page 99, it is easy to see that T\ + T2 is not identifiable, that is, one cannot discern one value of T\ + T-I from another on the basis of the observations. Therefore, the hypothesis T\ + T2 = 0 is not statistically testable. Similarly, the hypothesis {5\ + fo = 0 is not testable. On the other hand, the data gives us information about the estimable LPF T\ — T2- This information would allow us to test the hypothesis Tl - r2 = 0. Consider the hypothesis Aft — £ where Aft is a scalar. Following the arguments of the above example, we can formally define testability of Aft = £ as the same as estimability (or identifiability) of Aft. When Aft is a vector, it may include estimable as well as nonestimable components. Therefore, one has to be more careful about the notion of testability. Definition 5.3.2 A linear hypothesis Aft = £ in the linear model (y, Xft, o2l) is called completely testable if all the elements of the vector Aft are estimable LPFs. According to Remark 4.1.11, the restriction Aft = £ is completely testable if and only if C(A') C C{X'). When the hypothesis Aft = £ is not completely testable, a meaningful statistical test may still be possible. The above hypothesis implies that I'Aft = l'£ for all I. If there is an I such that I'Aft is estimable, we may be able to test the hypothesis I'Aft = l'£. If there is no such I, no testing is possible. Definition 5.3.3 A linear hypothesis Aft = £ in the linear model (y,Xft,a2I) is called completely untestable if there is no vector I such that I'Aft is an estimable LPF. D Definition 5.3.4 A linear hypothesis Aft = £ in the linear model (y, Xft, a21) is called partially testable if it is neither completely testable nor completely untestable. The following proposition gives a simple criterion to judge when a linear hypothesis would be completely testable, partially testable or
160
Chapter 5 : Further Inference in the Linear Model
completely untestable. Proposition 5.3.5 (y,X(3,a2I) is
The linear hypothesis A/3 = £ in the linear model
(a) completely testable if and only if p(X' : A') = p(X'); (b) completely untestable if and only if p(X' : A') — p(X') + p(A'); (c) partially testable if and only if p{X) < p(X' : A') < p{X') + p(A'). Proof. Part (a) follows from the fact that C(A') C C(X') if and only if p(X' : A') — p{X') (see Exercise 4.6). The notion of complete untestability is equivalent to the virtual disjointness of the column spaces of X' and A'. Part (b) is a restatement of this condition. Part (c) follows from the other two parts. A completely untestable linear restriction is equivalent to a reparametrization of the original model with fewer parameters (see Exercise 4.30). What would be a meaningful way to test a partially testable hypothesis? We need not draw any conclusion about the untestable part of it. On the other hand, we should test for all the testable restrictions implied by the original hypothesis. The testable restrictions implied by the hypothesis A/3 = £ are of the form p'/3 where p € C(A') n C(X'). The following proposition provides the basis for a test. Proposition 5.3.6 In the linear model (y,Xj3,cr2I), £ be a hypothesis with £ G C(A).
let T-Co
A(3 —
(a) Ho is equivalent to the pair of hypotheses %Q\ : TA0 = T£ and H02 {I - Px,)A'Ap = (/ - PX,)A'£, where T is any matrix such that C(A'T') = C{A') nC(X'). (b) The hypothesis HQI is completely testable. (c) The hypothesis K02 is completely untestable. Proof. It is easy to see that Ho implies both %oi and %02- The reverse implication is proved if we can show that
C(A') C C(A'T') + C(A'A(I - Px,)).
5.3 Tests of linear hypotheses
161
Suppose that I is a vector which is orthogonal to the right hand side. Therefore, (/ - Px,)A'Al = 0, ie, A'Al e C(X'). It follows that A'Al is in C(A')nC(X') (which is equal to C(A'T')), and hence orthogonal to /. Consequently Al = 0. This proves the desired inclusion and part (a). Part (b) is proved by the fact that C(A'T') C C{X'). In order to prove part (c), let I be such that I'(I — P ,)A'A(3 is estimable. The condition A'A(I - Px,)l G C(X') implies that (I -
PX,)A'A{I
- Px,)l = 0, that is, A(I - Px,)l = 0. Therefore, I'(I -
Pvl)A1 A/3 must be identically zero.
d
The decomposition given in Proposition 5.3.6 is similar to a decomposition due to Dasgupta and Das Gupta (2000) which reduces any linear restriction into two parts: a model-preserving constraint and a restriction that only involves estimable LPFs. Remark 5.3.7 The choice of the matrix T in Proposition 5.3.6 should have no effect on the actual test, because the two versions of %oi corresponding to two distinct choices of T imply one another. It follows from Exercise 2.20 that a simple choice of T is X'X(X'X + A'A)~A'. Example 5.3.8 (Two-wa}r classified data, continued) Consider the hypothesis T\ = T2 = 0 in the model of Example 4.1.8. Here, .
A
/0
0 0
1 0\
.
n
= { o o o o i j ' * =0-
It is easy to see that C(A') consists of all vectors of order five which have the first three components equal to zero. On the other hand, /I 1 C(X') = £ 0 1 \0
1 1 1\ / I 1 1\ / 0 1 1 \ 0 10 1 0 1 0 0 1 1 0 1 = C 0 1 0 =C 0 1 0 . 1 0 0 1 1 0 1 1 0 0 1 1/ \ 0 0 1/ \-l 0 1 /
The first column of the last matrix is obtained from the matrix in the previous step by subtracting its third column from the first. It is clear that the first column of the last matrix is in C(A'). On the
162
Chapter 5 : Further Inference in the Linear Model
other hand, the other two columns cannot be linearly combined in any way so that the first three elements of the combined vector is zero. Therefore, C{A') n C(X') is spanned by the vector ( 0 : 0 : 0 : 1 : - 1 ) ' alone. Following Proposition 5.3.6, we can choose T = (1 : — 1), so that C{A'T') = C(A') DC{X'). Thus, the completely testable hypothesis (?^oi) is simply TX - r 2 = 0. The untestable hypothesis (^02) is (/ — Pvl )A'Aft = 0, which simplifies to I(-1:-1:-2:2:2)'(TI+^)=O.
This is equivalent to T\ + r^ — 0.
D
It is easy to see that %o2 is trivial when C(A') C C(X') and Hoi is trivial when p{X' : A') = p(X') + p(A'). Neither hypothesis is trivial when A/3 = £ is partially testable. One can test %Q by testing HQ\, while keeping in mind that the untestable restriction ^02 is implied by the hypothesis. Note that HQ can be tested directly without formally reducing it to Tim (see Remark 5.3.13). However, it is important to understand which hypothesis is actually being tested. In addition, the restriction H02 may serve as a pointer to any possible mistake in specifying the hypothesis. 5.3.2
Hypotheses
with a single degree of freedom
Let the n x 1 response vector y have the distribution N(X/3,a2I) and p(X) = r. Consider the null hypothesis Ho p'/3 = 6 Since p'/3 is a scalar, the question of partial testability does not arise. The hypothesis is testable if p'/3 is estimable and completely untestable otherwise. Let us assume that the hypothesis is testable. We shall deal with three alternative hypotheses: Hi : p'P > £,
H2 : p'P < 6 -Hz : p'/3 £.
5.3 Tests of linear hypotheses
163
It follows from Proposition 5.1.1 and Remark 3.2.5 that
,J*fi~* ~ tn-r(p'(3 - 0\l^p'{X'X)-p
(5-3.1)
The non-centrality parameter of the ^-distribution depends on /3, the 'true value' of the vector parameter. The noncentrality parameter is zero under Ho, positive under H\ and negative under H%. We can test for Ho against the alternative hypothesis H\ by rejecting Ho when the statistic of (5.3.1) is too large. If the level of the test is specified as a, then the test amounts to accepting H\ if
r^
-^ ln—r,ai
\u.o.4)
^jcjip>(X'X)-p and accepting HQ otherwise. Similarly, a test of the null hypothesis Ho against the alternative hypothesis H2 is to accept H2 when
-JI=L=<-t y/^p'(X'X)-p and to accept T-LQ otherwise. A size-a test for the null hypothesis %o against the alternative hypothesis Hz is to accept H3 when
1^ < tn-r,a/2, y/^pf{X'X)-p
(5.3.3)
and to accept Ho otherwise. This test coincides with the generalized likelihood ratio test, described in Section 5.3.4 (see Exercise 5.12). The above three tests happen to be uniformly most powerful unbiased tests for the respective problems, under the given set-up (see Lehmann, 1986, Section 7.7).
164
Chapter 5 : Further Inference in the Linear Model
5.3.3
Decomposing the sum of squares
We now return to the general linear hypothesis A/3 = £. As we have seen in Section 5.3.1, a partially testable hypothesis can always be reduced to a completely testable hypothesis. We shall henceforth assume that the hypothesis A/3 = £ is completely testable, that is, C(A') C C{X'). We also assume that £ e C(A) so that the hypothesis itself has no algebraic inconsistency. Finally, we assume that the error degrees of freedom is positive so that there is at least one LZF available for the estimation of a2. Let us denote the models (y,X/3,
AJ3-i and(I
-Px)y.
Proof Rewrite A0 — £ as
A/3-£ = A{X'X)-X'y-AA-£ = A{X'X)-X'y-A(X'X)-X'XA-Z = A{X'X)-X'(y-XA-£). Therefore, Aj3 — £ is a linear function in Air- Its mean (under Mr), obtained by substituting X(I — A~A)£ for (y — _X"A~£) in the last expression, easily simplifies to zero. This proves part (a). Part (b) follows from the fact that A{3 — £ and (I — Px)y are vectors of BLUEs and LZFs in A4, respectively.
5.3 Tests of linear hypotheses
165
In order to prove part (c) by contradiction, let l'(y — XA~£) be an LZF of Mr which is uncorrelated with (I — Px)y and A/3 — £. The first condition of uncorrelatedness implies that (J — Px)l = 0, that is, I is of the form Xm for some vector m. The second condition amounts to A{X'X)~X'l = 0 or Am = 0. Consequently, X(I - A~ A)m = I, or l'{y-XA-$) is a BLUE of Mr (see Exercise 4.8). Since I'iy-XA'^) is a BLUE and an LZF of Mr, it must be identically zero with probability one. Proposition 5.3.9 implies that the elements of (I — Px)y and A/3 — £ together constitute a generating set of the LZFs of Mr. Proposition 5.3.10 by
/ / the SSE of M is Rfi, the SSE of Mr is given
R2H = Rl + *2(A0 - t)'[D(A0 - t)]~(A0 - £)
(5.3.4)
Proof. Consider the rank-factorization of D(Aj3 — £), as a2CC. Then C has a left-inverse, C~L. The elements of the vector tii = C~L{A(3 — £) each have variance a2, while they are uncorrelated with (/ — Px)y and with each other. Further, the elements of (/ — Px)y constitute a basis set of the LZFs of M. Suppose that u 2 is a vector of a corresponding uncorrelated basis set of LZFs having variance a2. According to Proposition 5.3.9, an uncorrelated basis set of LZFs of Mr is given by the elements of u\ and it2, each element having variance a2. It follows from Proposition 4.7.5 that
R% = u'2u2 + ullul = Rl +
(A0-O'[(C-L)'C-L](Ap~^
= R2 + o-2(A0-Z)'[D(AP-O}-(Ap-O.
D
Remark 5.3.11 It is also clear from Proposition 5.3.9 that the number of additional LZFs that constitute an uncorrelated basis set of LZFs
of Mr is p{D{Afi - £)) or p{A{X'X)~A').
Suppose A = T{X'X).
Consequently
p{A{X'X)-A') = p{TX'XT') < p(TX'X) < p{TX'). However, p(TX') = p((TX')(XT')).
Hence p(D(A0-£))
= p(TX'X)
= p(A). Thus, the number of additional LZFs is precisely the number
166
Chapter 5 : Further Inference in the Linear Model
of linearly independent restrictions implied by the statement A/3 = £. The total error degrees of freedom of M.T is n — r + p(A). 5.3.4
Generalized likelihood ratio test and AN OVA table
Consider the linear model (y, X(3, a2 J) where y has the multivariate normal distribution, and the testable and algebraically consistent hypothesis %Q : A(3 = £. The generalized likelihood ratio test (GLRT) for Ho against the general alternative %i : Afi J= £ is given below. Proposition 5.3.12 Under the above set-up, the GLRT at level a is equivalent to rejecting Ho if
Rl
' m >*m'n-r'«
where m = p(A). Proof. The GLRT statistic is max
(27R72)-i exp f - r ^ | | y - X/3\\A
max(27RT2)-f exp [ _ ^ | | y - X/3||2] max(27R72)-t exp [—\r\\y - XA£ - X(I - A-A)0||21 a2,0
L 2a2
^
max(27ra2)-fexP[--^||y-X/3||2] a,/3
L l a
J
The denominator is maximized with respect toCT2when a1 = n"1!!?/ — X/3|| 2 , while the numerator is maximized by choosing a1 = n - 1 | | y — XA~£ — X(I — A~A)0\\2. Substituting these values in the above expression and simplifying, we have "min \\y - XA£ - X{I - A~ A)e\\2~\ ~" / 2 rmn||y-X^||
5.3 Tests of linear hypotheses
167
which is a decreasing function of (R2H — RQ)/RQ. Since the GLRT consists of rejecting Ho when £ is unduly small, it is equivalent to rejecting Ho when (R2H — RQ)/RQ is sufficiently large. In order to find an appropriate cut-off value of this ratio, notice from Propositions 4.7.5, 5.3.9 and 5.3.10 that, under Ho, {R2H - -Ro)/0"2 and JRQ/CT2 are sums of squares of independent, standard normal LZFs. Part (a) of Proposition 4.7.5 and Remark 5.3.11 indicate that the number of summands in them are m and n—r, respectively. Hence, (R2H — RV)I°2 ~ Xmi RQ ~ xi-ri a n d t n e two random variables are independent. Therefore, the null distribution of
R2
-R2
HD2 Ko
_
° ^^r is F m n _ r . The m
'
statement of the proposition follows. Remark 5.3.13 If A/3 = £ is a partially testable but algebraically consistent hypothesis, the GLRT of Proposition 5.3.12 is valid, with . ). Here, R2H should be interpreted as the error sum of squares under the restriction A/3 = £. If one overlooks the nontestability of the hypothesis, then one may incorrectly use m = p(A). This would entail a reduction in size and power of the test (see Exercise 5.17). See Peixoto (1986) and von Rosen (1990) for related results. n The GLRT statistic is intuitively quite meaningful. From Proposition 5.3.10, the difference R2H - R\ coincides with a2(A@ - £)'[.D(A/3 £)]~(A/3 — £). This quadratic form accounts for the deviation from the hypothesis A/3 = £, and should be small if Ho is indeed true. The other component of R2H is R2., which is present regardless the validity of "HQ. The GLRT consists of rejecting %Q if the quadratic form is too large relative to R2,. The decomposition of R2H and the associated degrees of freedom is often displayed in the form of the analysis of variance (ANOVA) given in Table 5.1. The 'total' sum of squares in the above table is in fact the error sum of squares, under the hypothesis Afi = £. This 'total' includes the unrestricted sum of squares {RQ) and the additional sum of squares arising from the departure from %Q. The 'departure' here may just be due to statistical fluctuations (or noise, since after all, A/3 — £ need not be identically zero even if A/3 — £ is) or due to the fact that Ho is
168
Chapter 5 : Further Inference in the Linear Model
Source
Sum of Squares
Deviation from^o
R2H - R20 = a2(Ap - £)' [D{Ap-£)]-{AJ3-£)
Residual
R2Q = min \\y - Xfif 0
Total
R2H =
min \\y - X/3|| 2
Degrees of Freedom m
, A) P{ '
n-r = n-p(X)
Mean Square
R2H-Rp m —?n—r
n-r+m
Table 5.1 Analysis of variance table for the hypothesis A/3 = £
not valid, that is, due to significant and systematic difference between A/3 and £. The statistical significance test determines whether the 'departure' is large enough to signify the violation of Holt is clear that any two of the above three sums of squares determines the third. Usually one computes R\ and either R2H or R2H — RQ, the choice depending on whichever is easier. The ANOVA table can be appreciated better when we interpret all the quantities in terms of LZFs. Recall from the proof of Proposition 5.3.9 that we can think of a basis set of LZFs of the restricted model as the union of two disjoint subsets: a basis set of LZFs of the unrestricted model and a set of LZFs which would have been BLUEs in the absence of the restrictions. The 'sums of squares' RQ and R2H — RQ are in fact the sums of squares of these subsets of LZFs, and the 'degrees of freedom' m and n — r are the numbers of LZFs in these. The total 'sum of squares' and 'degree of freedom' correspond to these quantities for the combined set of LZFs. The last column of Table 5.1 displays the average squared values of the LZFs of the two subsets and the combined set.
5.3.5
Special cases
In this section we work out the GLRT in a few important special cases. In all these examples we assume that the null hypothesis is testable and algebraically consistent. The alternative hypothesis is
5.3 Tests of linear hypotheses
169
the complement of the null hypothesis. (Af3 ^ £). Example 5.3.14 Consider the hypothesis "Ho 0 = 0. Testing this hypothesis amounts to checking if the model is useful at all. The null hypothesis contends that it is not. It is easy to see that in this case B?H = y'y, r = k and m = k. The ANOVA is given in Table 5.2. Source
Sum of Squares
Deviation
D2
ivoraUo
R»
r,2
/
D2
Degrees of Freedom ~'~
~ R° = VV ~ *% = V V
;
R2H
*
Residual
JR.?, = e'e
n—k
Total
R\ = y'y
n
Table 5.2
Mean Square ~
Rl
"V~ R2 —— n—k
Analysis of variance table for the hypothesis /3 = 0
Thus the level-a GLRT reduces to rejecting Ho if y'y - Rp R20
n-k k
_ y^ ~ e'e '
n-k k
tk^«-
Note that the 'deviation from Ho sum-of-squares,' R2H — R2, coincides with the sum-of-squares in (4.7.2) which was attributed to the BLUEs.D Example 5.3.15 Suppose that -X^xi = (lnxi : Z), l r a x i ^ C(Z) and 0' = (/30 : 6)'. Then the hypothesis Ho 0 = 0 corresponds to the irrelevance of the (non-constant) explanatory variables. This hypothesis can be expressed as A/3 = £ where A = (0 p x i Ipxp) and £ = Opxi, where p — k—1. It follows that R2H = y'y — ny2, where y is the sample mean of the response, ^ ~ 1 y / l n x i - We also have r = k — p+1 and m = p. The quantity R2H — R2 represents the reduction in the sum of squares due to the coefficients of all the (non-constant) explanatory variables. It is referred to as the regression sum of squares. R2H is called the total sum of squares. The ANOVA is given in Table 5.3.
170
Chapter 5 : Further Inference in the Linear Model Source
Sum of Squares
Degrees of Freedom
Mean Square R\ — Rn -JLy^
Deviation from^o
^9 ,-,9 w ^ _9 R«-R2o = yy-ny2
p
Residual
i?o = e'e
n—p—1
Total
R2H = y'y-ny2
R2 —n—p—1
n-1
Table 5.3 Analysis of variance table for the hypothesis 0 = 0
The level-a GLRT consists of rejecting Ho if y'y - ny2 n-p-l
_
1
e'e
R2
— 1
p
n-p-1
r>? '
1 — Rz
"p,n—p—l,ai
p
where 02 R
y'y - nV2 — —.
=o-
y'y - ny1 The ratio R2 is known as the coefficient of determination or the sample squared multiple correlation coefficient. It is the sample version of the multiple correlation coefficient defined in (3.4.4). Notice that 1 — R2 = e'e/(y'y — ny2), which is the ratio between the RQ values without and with the restriction 9 = 0. This quantity is small when the variability of y is explained by the explanatory variables much better, in comparison to the constant alone. This is why R2 has been traditionally used as an empirical indicator of the degree of fit. A large value of R2 is regarded as an indication of good fit (that is, significance of 6). The GLRT is a formal assertion of this statement. Example 5.3.16 The hypothesis %Q : 0J = 0 corresponds to the insignificance of the jth explanatory variable. We can express this hypothesis as A/3 = £, where A is the jth row of the k x k identity matrix and £ = 0. In this case, R2H — R2, has a simpler expression than R2^. Let (X'xyi be the jth diagonal element of (X'X)~ (since /3j is assumed to be testable, this element does not depend on the g-inverse:
5.3 Tests of linear hypotheses
171
see page 111). Then D(A0 - £) = Var0j) = a2{X'X)^. It follows that R2H-Rl = pf/iX'X)". This leads to the ANOVA of Table 5.4. Source
Sum of Squares
Deviation
2
from Wo
H
P2
~
Residual Total
Degrees of Freedom
_ a2 U yi v\jj
°~
i
'
j / (
Mean Square p2
p2
RH~Ro
Rl = e'e
n-k
R2H = e'e + p]/(X'X)^
R2 —— n—k
n-k+1
Table 5.4 Analysis of variance table for the hypothesis f3j = 0
The level-a GLRT is to reject Mo if %I{X'X)» {n-k)- -1 ; > Fi,n_fc>Q. This test is very similar to the test described in (5.3.3). A direct connection is established in Exercise 5.12. 5.3.6
Power of the generalized likelihood ratio test*
Let us assume, once again, that the errors are normal and that the linear hypothesis A/3 = £ is testable and algebraically consistent. We can calculate the power of the generalized likelihood ratio test by using its distribution under the alternative hypothesis. This distribution is provided below. Proposition 5.3.17
Under the above set-up,
R2-Rln-r RQ
„ m
where m = p(A), c = a~2(A/3 - t)'[A(X'X)-A']-(A0 the 'true' value of the parameter vector.
- £), and j3 is
172
Chapter 5 : Further Inference in the Linear Model
Proof. Notice that A/3 is a BLUE in the unrestricted model, and (A0 — 0 ~ N{A(3 - $,,o-2A{X'X)-A'). Rank-factorizing CT2A(X'X)"A' as CC', we have C~L{Ap-^) ~ iV(Cr L (A/3-£),1)- It is easy to see that p(C) = p(A(X'X)~A') = p(A) = m. Therefore, by Remark 3.2.5,
(R2H - R20)/a2 = (A0 - O'(C-L)'C-L(AJ3
- 0 ~
x2m(c).
As argued in the proof of Proposition 5.3.10, the above is independent of RQ. The statement of the proposition follows from Remark 3.2.5. Remark 5.3.18 The value of the noncentrality parameter c does not depend on the choice of the g-inverse of A(X'X)~ A'. To see this, note that (A/3-A/3) must be in C{A{X'X)~ A'), which is identical to C(A). On the other hand, the algebraic consistency of the hypothesis (assumed above) ensures that (A/3 — £) is also in this space. Therefore, (A/3 — £) must be in this space too. From Propositions 5.3.12 and 5.3.17, we find that the power of the generalized likelihood ratio test at level a is P[F > -Fm^-^a]) where F ~ Fm^r{c) and c - CT~2(A/3 - £)'[A(X'X)~A}~{A/3 - £). Given a and any numerical value of c, this probability can be computed from the infinite series expansions of the non-central F-distribution (see, e.g., Abramowitz and Stegun, 1980). The power is a monotonically increasing function of the noncentrality parameter, c. 5.3.7
Multiple comparisons*
Sometimes one has to test a number of single-degree-of-freedom hypotheses in a linear model. If these are combined to form a hypothesis of the form AP = £, then the rejection of this hypothesis only means that at least some of the single-degree-of-freedom hypotheses are probably incorrect. One may be interested in checking the validity of these hypotheses on a case-by-case basis, using the test of Section 5.3.2 with a nominal size a. If this is done, and all the hypotheses happen to be true, then the probability of erroneous rejection of at least one of the hypotheses is greater than a. Thus, the nominal size of the tests is misleading. Some adjustment will be needed, in the spirit of simultaneous
5.3 Tests of linear hypotheses
173
confidence intervals discussed in Section 5.2.3. We shall mention two sets of tests. Consider the linear model (y, X(3, a21) with n observations and p(X) = r. Suppose that we have to test q testable hypotheses,
HOj : a'ft = £,-,
against
Hij : a'ft ^ £,-,
j = 1 , . . . , q. Let A' = (ax : o 2 : : aq), £ = (f I : f 2 : : £)'> a n d p(A) = m. A set of conservative tests would be to reject %QJ if
^=:
> i1]
n
_
r
.2L,
j = 1,2,..., q. This set of tests is analogous to Bonferroni confidence intervals. The probability of erroneous rejection of at least one of the hypotheses, when all of them actually hold, is at most a. The tests which are analogous to the Scheffe confidence intervals are to reject Hoj if
(a'ft-tj)2 r=
. _ >
Tnrmn—ra,
a^a'^X'Xyaj j = l,2,...,q. This set of tests is also conservative. If the components of A 3 are uncorrelated, a set of tests analogous to the maximum modulus-i confidence intervals can be constructed in a similar manner. Two other methods for a special case are described in Section 6.2.4. We refer the reader to Hocking (1996, Chapter 4), Christensen (1996, Chapter 5) and Hochberg and Tamhane (1987) for more details on these and other multiple comparison methods.
5.3.8
Nested hypotheses
Sometimes it is necessary to test hypotheses that are related to each other. For example, given two sets of similar regression data in two groups, we may first check if the regression equations are identical — so that a single model can be used for the two groups. If the answer is negative, then we may ask if the regression surfaces are, at least,
174
Chapter 5 : Further Inference in the Linear Model
parallel — so that the group effect can be taken into account by means of a single added parameter. This is an example of nested hypotheses. Let T-LOj : Aj/3 = £y, j = 1 , . . . , q be a series of hypotheses so that for each j , HQJ implies "HOJ+I- We begin by checking Hoi using the generalized likelihood ratio test with nominal size a. If this hypotheses is accepted, there is no need to proceed further. If it is rejected, then we can check ^02 using another generalized likelihood ratio test with nominal size a. The problem of larger-than-nominal size does not arise here, since 'Hoi implies Ho 2- The (unconditional) probability of incorrect rejection at the second stage can not be more than the probability of incorrect rejection at the first stage. The tests can thus proceed in a sequential manner, until some hypothesis is accepted or the last hypothesis is rejected.
5.4
Prediction in the linear model
Suppose that yo is an unobserved response, which has to be predicted by making use of the observed response vector y. If g(y) is a predictor of yo, then it may be called a good predictor if E[yo — g(y)]2, the mean squared error of prediction (MSEP), is small. It was mentioned in Section 3.4 that the MSEP is minimized when g(y) = E(yo\y). Thus, the regression of yo o n V 1S the best predictor of yo- However, one usually needs the joint distribution of j/o a n ( l y m order to calculate the conditional expectation. Even if the distribution is known, the regression may not have a simple form and/or involve some unknown parameters. If we restrict g to be a linear function, then the resulting predictor can be called a linear predictor. The best linear predictor (BLP) is the linear function of y which minimizes the MSEP. An expression for E(yo\y), the BLP of yo in terms of y, is given in Proposition 3.4.1. The BLP is a linear function of y, also involving first and second order moments of yo and y. If these are not known, the predictor can be approximated by estimating these parameters. As the number of parameters is very large in comparison to the size of the vector y, one has to use a model, such as an autoregressive model. This method is not very useful when yo has little or no correlation with y. In the latter
5.4 Prediction in the linear model
175
case the BLP reduces to E(yo), and the most one can do is to estimate it from y. This is where the linear regression model can play a role. Suppose that yo and y are the responses from the models (yo,x'Q(3,a2) and (y, X0, cr2l), respectively, where XQ and X are observed values of possibly random explanatory variables. We assume that, given XQ and X, yo and y are uncorrelated. The links between the two models are the common parameters /3 and a2, which are unspecified. Note that the above model for yo implies that the BLP of yo is x'0f3, and the mean square prediction error is a2. When /3 and a2 are unknown, we would have to make use of the fact the y carries some information about them.
5.4.1
Best linear unbiased predictor
The mean squared prediction error (MSEP) of the linear predictor a'y + b can be written as E[y0 - a'y - b]2 = E[{yQ - x'Q0 + x'0/3 - a1y - b)2] =
E[y0 - x'0/3]2 + E[x'o0 - a'y - b}2,
because yo and y are uncorrelated and E(yo) = x'0/3. Therefore, minimizing the MSEP with respect to a and b is equivalent to minimizing
E[x'0P - a'y - bf. At this point we put a crucial restriction on the linear predictor. We require that a'y + b be an unbiased predictor of yo for all 0. Since E(yo) = x'0f3, this assumption is equivalent to the condition E(a'y + b) = x'Q@
for all /3.
We would look for the linear unbiased predictor with the smallest MSEP, and call it the best linear unbiased predictor (BLUP). Isn't the BLP itself an unbiased predictor? Of course it is, but it is still not useful as long as it depends on the unknown parameter /3. In order to find a meaningful solution, we put the further condition that a and 6 should not be functions of 0. This automatically rules out BLP as a candidate for BLUP.
176
Chapter 5 : Further Inference in the Linear Model
When XQ G C(X'), Exercise 4.7 indicates that the condition iE{a'y+ b) = x'0f3 for all /3' holds if and only if b = 0 and a'y is an LUE of the estimable LPF x'0f3. Hence, the solution to the above optimization problem is obtained by choosing b as 0 and a'y as the BLUE of x'0f3. Thus, the unique BLUP of y0 is the BLUE of x'0/3 in the model (y,X/3,o2I). If XQ £ C(X'), it is easy to see that the BLUP does not exist. The above result depends on the crucial assumption that yo and y are conditionally uncorrelated, given xo and X. See Section 7.13 for a generalization to the correlated case. 5.4.2
Prediction interval
Suppose that a future response yo has to be predicted on the basis of past response y, the vector of explanatory variables (ceo) corresponding to yo and the matrix of explanatory variables (X) corresponding to y. In addition to the best linear unbiased (point) predictor, we want to find an interval which will contain yo with a specified probability. We assume once again that xo G C(X'), that is, x'Q(3 is an estimable LPF. The prediction error of the BLUP is e0 = yo - z o /3 = y0 -
x'0(X'X)~X'y.
The mean of the prediction error is zero, while the variance is Var(e0) =CT2(1+
x'0(X'X)~x0).
If yo and y (given XQ and X) jointly have the normal distribution with appropriate mean and dispersion, then Proposition 4.1 implies that eo has the normal distribution with zero mean and the variance given above. Using the argument of Section 5.2 we conclude that the interval [yo - a, y0 + a],
where y0 =
x'0(X'X)~X'y,
a = tn_rta/2yJ^(l
+ x'0(X'X)-x0),
(5.4.1)
contains the unobserved response yo with probability I — a. We call this a 100(1 — a)% prediction interval for yo. Note that the above interval is wider than the corresponding twosided confidence interval for x o /3. The additional width accounts for the
5.4 Prediction in the linear model
177
error in observing x'0/3. The difference between the widths of the two intervals can be quite substantial when cco(X'X)~a;o is small, which typically happens when the sample size (n) is large. The prediction error of the BLUP is the sum of the estimation error of x'Q(3 (which is x'0(3 — x'0/3) and the deviation of the observation from its mean (which is yo - X'QP)- We can hope to reduce the estimation error by using a lot of data, but we still have to allow for the variability in the observations while constructing the prediction interval. The latter component does not depend on n. Example 5.4.1 The data set given in Table 5.5, taken from Daniel (1995), shows the measurements of lower abdominal adipose tissue (AT), known to be associated with metabolic disorders considered as risk factors for cardiovascular disease, and waist circumference in centimeters (waist) of 109 men. Here, the waist circumference is meant to be an indicator of extent of adipose tissue. If log (AT) is regressed on log(waist), the least squares fitted equation is log(AT) = -12.46 + 3.748log(waist) where the standard errors associated with the two estimators are .9820 and .2176, respectively. Suppose that one wants to find out the average AT for a man with waist circumference 100 cm. The point estimate of this average is 121.5 and a 95% confidence interval, according to (5.2.3), is [113.0,130.7]. If this study is replicated 100 times and similar confidence intervals are computed from each of the data sets, one may expect the 'true average' to lie in 95 out of these 100 intervals. Now suppose that the objective is to predict the AT for a particular man with waist circumference 100 cm. The point prediction is 121.5 and a 95% prediction interval, according to (5.4.1), is [62.68,235.6]. If similar prediction intervals are obtained from 100 replications of the study with different sets of 109 subjects, and every time the AT of a different man with waist circumference 100 cm is measured and compared with this prediction interval, then one may expect the prediction intervals to successfully capture the measured values 95 times.
178 waist 74.75 72.60 81.80 83.95 74.65 71.85 80.90 83.40 63.50 73.20 71.90 75.00 73.10 79.00 77.00 68.85 75.95 74.15 73.80 75.90 76.85 80.90 79.90 89.20 82.00 92.00 86.60 80.50
Chapter 5 : Further Inference in the Linear Model AT waist AT waist 25.72 861)0 78789 78^60 25.89 82.50 64.75 87.80 42.60 83.50 72.56 86.30 42.80 88.10 89.31 85.50 29.84 90.80 78.94 83.70 21.68 89.40 83.55 77.60 29.08 102.00 127.00 84.90 32.98 94.50 121.00 79.80 11.44 91.00 107.00 108.30 32.22 103.00 129.00 119.60 28.32 80.00 74.02 119.90 43.86 79.00 55.48 96.50 38.21 83.50 73.13 105.50 42.48 76.00 50.50 105.00 30.96 80.50 50.88 107.00 55.78 86.50 140.00 107.00 43.78 83.00 96.54 101.00 33.41 107.10 118.00 97.00 43.35 94.30 107.00 100.00 29.31 94.50 123.00 108.00 36.60 79.70 65.92 100.00 40.25 79.30 81.29 103.00 35.43 89.80 111.00 104.00 60.09 83.80 90.73 106.00 45.84 85.20 133.00 109.00 70.40 75.50 41.90 103.50 83.45 78.40 41.71 110.00 84.30
AT waist AT 58l6 110.00 153.00 88.85 112.00 158.00 155.00 108.50 183.00 70.77 104.00 184.00 75.08 111.00 121.00 57.05 108.50 159.00 99.73 121.00 245.00 27.96 109.00 137.00 123.00 97.50 165.00 90.41 105.50 152.00 106.00 98.00 181.00 144.00 94.50 80.95 121.00 97.00 137.00 97.13 105.00 125.00 166.00 106.00 241.00 87.99 99.00 134.00 154.00 91.00 150.00 100.00 102.50 198.00 123.00 106.00 151.00 217.00 109.10 229.00 140.00 115.00 253.00 109.00 101.00 188.00 127.00 100.10 124.00 112.00 93.30 62.20 192.00 101.80 133.00 132.00 107.90 208.00 126.00 108.50 208.00
Table 5.5 Waist circumference and adipose tissue data (Source: Daniel, 1995)
As for the present data, one of the two observed responses for individuals having waist circumference 100 cm falls outside the confidence interval, but the prediction interval includes both of these. This is not unexpected, as the confidence interval only accounts for the estimation
5.4 Prediction in the linear model
179
error of the parameters, and ignores the variation from one individual to another. 5.4.3
Simultaneous prediction intervals*
When prediction intervals for a number of future observations have to be specified, we have to be careful about the confidence coefficient associated with all of them together. If XQI , , x$q are the vectors of explanatory variables corresponding to the unobserved response values yoi) , Voq, we can provide prediction intervals of the form
x'Oii(X'X)-X'y x'0>i(X'X)-X'y
cfe(l+x'Oti(X'X)-xo,i), + c ^ 2 ( l + x'Oti(X'X)-xo,i)
,
i=
l,...,q.
Use of the Bonferroni inequality leads to the choice c = tn_ra/2q, r being the rank of X, in order to ensure coverage probability of at least (I — a). On the other hand, the use of a Scheffe-type argument leads to the choice c = [<7-F1g,n-r,a]1/'2 Note that, unlike in the case of simultaneous confidence intervals, we cannot use mjPm>n_r>Q where m is the number of linearly independent vectors out of XQI, ... ,XQQ (see Exercise 5.27. Larger the value of q, wider are the simultaneous prediction intervals to allow for uncertainties of all the future observations. (This is the case even when all the xOiS are the same, that is, m = 1.) Indeed, both tn_ra/2q and <7-F9,n-r,a increase without bound as q increases. When a large and/or uncertain number of future observations have to be predicted, a reasonable strategy is to switch to tolerance intervals, discussed next. 5.4.4
Tolerance interval*
There are two kinds of uncertainties which have to be accounted for in a prediction interval: uncertainty arising from the randomness of the observed samples and that due to the randomness of the future observations. As we have found out, the intervals become very wide when a
180
Chapter 5 : Further Inference in the Linear Model
large number of future observations are 'bracketed' simultaneously, rendering the intervals useless. A more meaningful solution can be found if we set a less ambitious target: that of covering most of the future observations — instead of covering them all. Our revised strategy is to ensure with 100(1 — a)% confidence that on the average 100(1 — 7)% of all the future observations will lie in the specified intervals. These intervals are called tolerance intervals. Let us first consider the case where x^ = Xo for i = 1 , . . . , q. If (3 and a2 are known, we can say with 100% confidence that the interval [x'QP - azl/2,
x'0(3 -
crzl/2\
will contain on the average 100(1 — 7)% of all the future observations, z1j2 being the 1 — 7/2 quantile of the standard normal distribution. The parameters x'Q/3 and a are not known, but we can use 100(1 — a/2)% confidence intervals of each of these:
j a ^ - atn^a/^x'0{X'X)-x0,
x'0f3 + dtn_rOLJ
o{X'X)-x0
J ,
[0,z/^xLrA-a/2/(n-r)} Combining these with the Bonferroni inequality, we have the 100(1—a)% tolerance interval for 100(1 — 7)% of the future observations:
L'Q/3 - a \tn^a/4^x'0(X'X)-x0
+ zl/2 /^/xL r ,i_ a / 2 /(n - 0 J >
AP + d \tn-T,a/iyJx'o(X'X)~*O + zi/2 / ^ _ r j i _ Q / 2 / ( n ~ r) J j Note that the size of interval does not depend on q any more, because 1007% of the future observations are meant to lie outside this tolerance interval. Let us now turn to the case of o?oi, , Xoq being possibly different. We only need a minor modification of the above procedure to achieve this end: instead of using a single confidence interval for x'0(3, we can use the confidence band of x'(3, derived in Section 5.2.4. This procedure leads to the simultaneous tolerance intervals
5.4 Prediction in the linear model XOiP
+ S {\lrFr,n-r,a/2x'Qi{X'X)-X0i
181
+ Zy/2 /yJxl^r^a/2/(n-r)
)] >
for i = 1 , . . . , q. Note that these intervals contain with probability 1 — a at least 100(1 — 7)% of all replications of any combination of yoi, , Voq (on the average). Other simultaneous tolerance intervals can be found in Miller (1981, Section 3.4.2).
6.2 r
6 -
//
5.8 -
//
5.6 -
J/
5.4 -
/?
population
xz
5.2 -
js
5 -
s?
4.8 4.6 4.4 *—
1980
// -/? '
'
'
'
1985
1990 year
1995
2000
Figure 5.3 Loci of simultaneous tolerance intervals for Example 5.4.2
E x a m p l e 5.4.2 (World population data) For the world population data of Table 1.2, a confidence band for the regression line was obtained in Example 5.2.5. A set of simultaneous tolerance intervals for 90% of all predicted observations with coverage probability .95 (a = .05, 7 = .1)
182
Chapter 5 : Further Inference in the Linear Model
is [-158.3 + -0822a; - (.02477yj(x - 1990.5)2/665 + .05 + .00967 j , -158.3 + .0822a; + (.02477^(3; - 1990.5)2/665 + .05 + .0096711 . The loci of the upper and lower limits of the tolerance intervals are plotted in Figure 5.3. The observed data are shown as dots. All the observed points lie within the band, which is wider than the confidence O band for the regression line shown in Figure 5.2. 5.5
Consequences of collinearity*
We have considered in this chapter the problems of interval estimation, testing of hypothesis and prediction, assuming conditional normality of the response given the explanatory variables. Every single procedure described here can be affected by collinearity. It follows from the discussion of Section 4.12 that the confidence interval of p'/3, described in Section 5.2, would be wide whenever p has a component along an eigenvector of X'X corresponding to one of its small eigenvalues. Likewise, the prediction interval of yo, described in Section 5.4.2, would be wide whenever XQ has such a component. In order to study the effect of collinearity on tests of hypothesis, let A(3 — £ be the testable part of a general hypothesis as per Proposition 5.3.6. According to Propositions 5.3.10 and 5.3.12, the numerator of the F-ratio for the GLRT is proportional to (A/3 - $)'[D(A0 £)/a2]~(A(3 — £). We assume for simplicity that X'X is nonsingular. Even if X'X is singular, the following argument holds with k replaced by p{X). Note that D(A0 - Z)lo2 = A(X'X)-A' = £
hAvi)(Avi)',
i=iXi
where \ and Vi, i = 1,..., k are the eigenvalues and eigenvectors of
5.5 Consequences of collinearity* X'X.
183
If I is a vector of unit norm, then
i=iXi
Suppose that there is an I for which (I'Avi)2 is large while A; is small. (This means that the relation /'A/3 = Z'£, implied by the hypothesis A(3 — £, is such that A'I has a component along an eigenvector corresponding to a small eigenvalue of X'X.) In such a case, l'D(A/3—£)l/cr2 is large. Therefore, D(A/3 — £)/a2 must have at least one large eigenvalue. Recall from Proposition 5.3.9 that (AJ3-£y[D{AJ3-£)/(T2]-(A/3£) is the sum of squares of a few uncorrelated LZFs (under the restricted model) each having variance 1. These LZFs are all linear functions of A/3 —£, and are the additional LZFs arising from the restriction A/3 = £. We can construct a set of such LZFs in various ways. Here, we construct it in a way that helps us understand the effect of collinearity. We choose the LZFs as u-(A/3 - £), i = 1,2,... ,p(D(A0 - £)), where the Uj's are unit-norm eigenvectors of D(A/3 — £) corresponding to its nonzero eigenvalues, arranged in the decreasing order. Thus, we have
(Ap-Z)'[D(A0-t)/
VarWAfi-t))
U
« i
> l
"";
where «j is the «th ordered eigenvalue of D(A0 — £). If some of the K;'S are large, the above sum would be small. The hypothesis A/3 = £ is a combination of several statements of the form p'/3 = po- We have already argued that whenever any of these implied statements is such that p has a component along an eigenvector corresponding to a small eigenvalue of X'X, at least one of the «j's would be large. Whenever this happens, the last expression of (5.5.1) is likely to have some small summands. This expression is proportional to the numerator of the F-ratio. On the other hand, the denominator of the F-ratio is not affected by collinearity, because RQ depends on X
184
Chapter 5 : Further Inference in the Linear Model
only through the projection matrix Px, which is not a function of the eigenvalues of X'X. The impact of collinearity on the GLRT can be easily understood by following the above argument in the special case where A/3 is a single LPF, with A having a substantial component in the 'wrong' direction. In this case the sum of (5.5.1) consists of a single term with a large denominator, thus making it difficult to reject the null hypotheses. This explains the common experience of regression practitioners: the estimated coefficients of presumably 'important' variables often happen to be statistically insignificant when there is collinearity. (When £ is a scalar, one can use a i-statistic instead of an .F-statistic, as in Section 5.3.2. A simplified form of the above argument would hold in the case of the i-test.) If A/3 is a vector LPF, the presence of collinearity may make some summands of (5.5.1) small. The possible rejection of the hypothesis would then depend too much on the other terms. Note that the degrees of freedom for the numerator of the F-ratio is equal to the number of terms of this sum, which remains the same whether or not there is collinearity. Thus, some degrees of freedom may be wasted because of collinearity. In summary, a linear hypothesis can be thought of as a combination of statements. Some of these statements may be difficult to verify statistically because of collinearity. When a single statistic is used to test all these statements simultaneously, precious degrees of freedom are wasted in trying to test the statements which are difficult to verify. The rejection of the hypothesis may then depend unduly on the possible non-conformity of the data with the remaining statements, and may thus become less likely. In the extreme case of collinearity, a part of A/3 may not be estimable at all. Then there is an / such that A'I is an eigenvector of X'X corresponding to a zero eigenvalue, and Z'A/3 = l'£ is a completely nontestable hypothesis. An oversight of the nontestability leads to reduced chances of rejection of the null hypothesis (see Remark 5.3.13). One can also justify the above qualitative statements by analysing the power of the GLRT. The power is an increasing function of the
5.6 Exercises
185
noncentrality parameter c given in Proposition 5.3.17. This parameter would tend to be small when the hypothesis A/3 = £ implies some statements of the form p'(3 = po where ||-X"p|| is small.
5.6
Exercises 5.1 Assuming the model of Exercise 1.5 (with normal errors) for the world population data of Table 1.2, find a 95% confidence interval for the yearly rate of growth of population. 5.2 Describe how you can construct a one- or two-sided confidence interval of a2, assuming y to be normal and using Proposition 5.1.1. 5.3 If Pi/3 and p'2j3 are estimable LPFs which are not multiples of one another, find a 100(1 — a)% elliptical confidence interval for the vector LPF (p'xj3 : p'2f3 : (p1 +P2YP)', and show that it is the same as the corresponding ellipsoid for (p^/3 : p'2f3)'. 5.4 If simultaneous confidence intervals have to be provided for the means of all the observed responses in linear regression (assuming normal errors), which of the three confidence intervals described in Section 5.2.3 should be used? Why? 5.5 (a) Suppose that M is a nonnegative definite matrix. Consider the ellipsoidal region (6 — 0Q)'M~(6 — 6Q) < 1, {6 - d0) € C(M), and the hyperplane a'd = c. Which values of c will ensure that the intersection of the hyperplane and the ellipsoid contains (i) no point, (ii) exactly one point, (iii) infinitely many points? (b) Use the result of part (a) to derive Scheffe simultaneous confidence intervals given in (5.2.5). 5.6 Using the deep abdominal adipose tissue data of Table 5.5, plot the loci of 95% Bonferroni and Scheffe confidence intervals of the mean of log(AT) for the observed values of log(waist). 5.7 Let p^/3 and p'2f3 be two estimable LPFs in the linear model (y, X/3, a21) with normally distributed errors, and suppose that A = PiP/p^P. Find a 95% confidence interval of A in the following manner.
186
Chapter 5 : Further Inference in the Linear Model (a) Find the mean and variance of a = p(fi — Xp'iP, where A is the true value of the ratio of the parameters and p^fi and p'2(3 are BLUEs. Can a be called an LZF? (b) Determine the distribution of (a2/Var(a)) x (a2/a2), where a2 is the usual estimator of a2. (c) Show that the ratio of part (b) is less than a given constant if and only if a quadratic function in A is negative. Using this fact, obtain a two-sided confidence interval for A. [See Exercise 5.8 for an application.] 5.8 Response surface: continued from Exercise 1.8. Consider the quadratic regression model Vi = A) + Pw + fax2 + €i,
5.9
5.10 5.11
5.12
5.13 5.14
5.15
Var{e) = a 2 ,
i = 1 , . . . , n.
Assuming that fa > 0 a n d that the errors have the normal distribution, find a 95% confidence interval for the value of the explanatory variable which will minimize the expected response. Prove the expression (5.2.7) for the 100(1 — a)% confidence band for the regression line in the special case of simple linear regression. Show that the band is the narrowest where the explanatory variable is equal to its sample average. Using the deep abdominal adipose tissue data of Table 5.5, plot a 95% confidence band for the regression line of log (AT). For the Cobb-Douglas model of Exercise 4.31, find an expression for the increase in the error sum of squares because of the restriction a + /3 = 1. Show that the GLRT of Ho : j33 = 0 against the alternative H\ : /3j ^ 0 described in Example 5.3.16 is equivalent to the test of (5.3.3) for a special choice of p and £. Construct the ANOVA table and describe the GLRT for the hypothesis T\ = T2 in Example 4.1.8. Suppose that you want to test the hypothesis /3 = £, where £ is a specified vector. Construct the ANOVA table and describe the GLRT. Using the model of Exercises 1.9-1.10, formulate the hypothesis of 'no change in slope at XQ' as a condition on the parameters of
5.6 Exercises
187
the model. Calculate the p-value of the generalized likelihood ratio test for this hypothesis for the world population data of Table 1.2, assuming normal errors. 5.16 Using the model of Exercises 1.7-1.8, formulate the hypothesis of 'no discontinuity of regression line at xo' as a condition on the parameters of the model. Calculate the p-value of the generalized likelihood ratio test for this hypothesis for the world population data of Table 1.2, assuming normal errors. 5.17 Under the set-up of Section 5.3.4, show that if the hypothesis is algebraically consistent but only partially testable, the GLRT
5.18
5.19
5.20
5.21
of Proposition 5.3.12 is valid, with m = p(A) + p(X) — p I ). VA / Does the size of the test reduce or increase when one incorrectly uses m = p{A)l What happens to the power of the test? Suppose that you have data from two linear models, ( j / ^ J i ^ , cr2/) and (y2,X2l32,cr2I). The objective is to test the equality of the regression lines, that is, Ho : /31 = /32- How will you formulate the testing problem and proceed to solve it? Suppose, in the preceding problem, that X{ = (l n xi %i) and P'i = (1 0-), i = l,2 (as in Example 5.3.15). The objective is to test the parallelity of the regression lines, that is, T-LQ : Q\ = #2How will you formulate the testing problem and proceed to solve it? Suppose that the restriction of Exercise 1.4(b) is posed as a hypothesis. Compute the p-value of the GLRT statistic for this hypothesis, using the world record running times data of Table 1.1 and assuming normality of errors. Can you conclude at the level .05 that the regression lines for the men's and women's data are parallel? Lack of fit. Let there be n4 observations of the response (arranged as the rii x 1 vector y j for a given combination of the explanatory variables (XJ), i = 1,... ,m, n\ + + nm = n. The plan is to check the adequacy of the model (y, X/3, a21) through a formal test of lack-of-fit, assuming normal errors.
188
Chapter 5 : Further Inference in the Linear Model Here, y = (y[ :
: y'm)' and
X' = (xi
: xm
Assume that m > r = p(X). (a) Show that the model (y, X/3, a21) is a restricted version of another model, where the response for every given X{ is allowed to have an arbitrary mean. (b) Obtain the error sum of squares under the unrestricted model (pure error sum of squares). (c) Identify the restriction of part (a) as the hypothesis of adequate fit, and obtain an expression for the sum of squares for deviation from the hypothesis (lack of fit sum of squares). (d) Construct the ANOVA table. (e) Describe the GLRT for lack of fit. 5.22 Consider the hypothesis /3 oc b, where b is a specified vector. Show that it can be reformulated as a linear hypothesis. Construct the ANOVA table and describe the GLRT. [Hint: In this case B?H may be easier to compute than R2H — RQ. A reparametrization would simplify the restricted model.] 5.23 Consider the model (y,X/3,a2I) with normally distributed errors. Let Amxn(u) be a matrix whose elements are possibly nonlinear functions of a vector unx\(a) Show that E[A(X/3)e] — 0. [Since it is a possibly nonlinear function of y, we can call it a generalized zero function (GZF).] (b) Obtain an expression for the dispersion of the GZF of part (a). (c) Obtain a set of transformed GZFs from those of part (a) such that (i) the transformed GZFs are independently distributed as N(0,a2), and (ii) the original GZFs can be retrieved from the transformed GZFs via a reverse transformation.
5.6 Exercises
189
(d) Assuming that p(X : A(X/3)) is a constant with probability 1, how can you scale the transformed GZFs of part (c) so that the distribution of their sum of squares (after scaling) is free of a2? Describe this distribution. [See Section 6.3.2 for an application of this construction.] 5.24 Given the linear model (y,X/3 + A0,a2I) with normally distributed errors, obtain the GLRT for the testable hypothesis 6 — 0. Does the null distribution of the test statistic change if the elements of A are (possibly nonlinear) functions of X/3, and the latter is replaced by X/3 in the statistic? [You can use the result of Exercise 5.23 and assume that p(X : A(Xf3)) is a constant with probability 1.] 5.25 Using the world record running times data of Table 1.1 and assuming normality of errors, test the hypothesis of equality of the regressions of the men's and women's log-record times on log-distance, at the level .05. 5.26 Assuming the model of Exercise 1.5 (with normal errors) for the world population data of Table 1.2, find a 95% prediction interval for the midyear population of the world in 2001, and compare it with the actual midyear population. 5.27 Suppose that the response vector of the normal-error linear model ( ( I, ( 1/3, a21] is only partially observed, that
WVoJ \xoJ
J
is, y 0 is unobserved. The purpose of this exercise is to provide a region where y 0 must lie with probability 1 — a. (a) Show that y 0 is contained with probability 1 — a in the ellipsoidal 'prediction region' (2/o - Vo)'[I + Xo(X'X)-X'Q]-(yo
- y0) <
qF^.^,
where y 0 = -X"o3, /3 is any least squares estimator of j3, n and q are the number of elements of y and y 0 , respectively, and r = p(X). (b) If X o = (scoi : : xOq)' and y 0 = (yOi : : y Og )', then justify the form of the Scheffe prediction intervals given in Section 5.4.3.
190
Chapter 5 : Further Inference in the Linear Model
5.28 Using the deep abdominal adipose tissue data of Table 5.5, plot the loci of 95% simultaneous tolerance intervals of log (AT) for 90% of all values of log (waist) to be observed for a similar group of subjects in future.
Chapter 6
Analysis of Variance in Basic Designs
It has been observed that one of the invariables in our life is variability itself. All human beings do not have the same height. Yield of wheat per acre of land is not the same everywhere. Even the microchips, for which we desire very low tolerances, vary in their performance characteristics, due to factors beyond the manufacturer's control. In trying to understand variability we might ask what factors cause this variability. For instance, individual heights vary possibly because of our parents heights and other genetic factors, our sex, dietary and exercise habits and so on. Similarly the yield (per acre) of a wheat crop may vary depending on what variety of wheat it is, the soil type, the amount of water and rainfall, the amount and type of fertilizer used and so on. We would have achieved a better understanding of such variability in a given context and possibly help control it, if we are able to list most of the major causes for the variation and split the total variation into parts, each of which is attributable to a given cause. Since we can not possibly list all the causes, we expect at the end of such an exercise, a component that might be called the residual (left-over or unexplained) variation. Achieving such a decomposition of the total variability is the grand goal of the statistical tool called analysis of variance (ANOVA). The systematic and efficient conduct of an experiment which leads us to identify these components, is the subject of design of experiments. In this chapter we consider linear models for experiments where the explanatory variables are not necessarily given/observed quantities, but 191
192
Chapter 6 : Analysis of Variance in Basic Designs
are chosen from a finite set of values by careful design. For instance, in the agricultural yield experiment, one may be allowed the choice of amount of water, or fertilizer applied. When the matrix X of the linear model (y, X/3, a21) is thus designed, we call it the design matrix. We have already come across a rudimentary ANOVA in Chapter 5, where the variation present in the response is measured by the sum of squared deviations from the mean. The total sum of squares was decomposed into regression sum of squares and error sum of squares in Example 5.3.15. In the case of designed experiments, the matrix X usually has a simple structure. Depending on the model, which determines the design matrix, it is often possible to further decompose the regression sum of squares into components which are attributable to different identifiable sources of variation. This chapter deals with such decompositions for different practically important yet simple models, and testing whether any of the factors (sources of variation) contribute significantly to the total variation. Some of the factors under study may be of primary interest to the experimenter (for example, which variety of wheat gives the most yield?). However in order to control the error (or unexplained part), the other factors and their effects are also brought into the model. These effects account for some nuisance parameters. Much of the initial development in experimental designs came from agricultural research, see Fisher (1926). The different factors whose effects are being studied, are called treatments, as for example the different varieties of wheat or different manufacturing processes, from which we want to choose. The basic unit of material on which these treatments are applied, is called an experimental unit. For instance, a plot of land where various types of wheat can be grown, or a batch of steel produced by the manufacturing process, is an experimental unit. Finally a measure of the effectiveness of the treatment on the experimental unit is referred to as the yield, as for instance the weight of wheat per plot, or the tensile strength of steel produced by the manufacturing process. In the case of a linear model, each experimental unit corresponds to a case or observation, the yield is the response and treatments are explanatory variables. Without going deeply into the principles of experimental designs, we
6.1 Optimal design
193
give a flavour of optimal designs in the next section. In the subsequent sections we show how the analysis of some basic designs provides answers to some important questions. The reader will find the development somewhat different from that in other books. The material is tied up with the theory provided in Chapters 4 and 5, with special emphasis on interpretation through linear zero functions.
6.1
Optimal design
In order to draw inference on some parameters of a linear model (such as treatment effects), an objective function may be identified so that minimizing this function with respect to all possible designs would enhance the quality of inference. If there is only one parameter of interest, an obvious criterion is to minimize the inverse of the (Fisher) information for this parameter. If the interest is in all the parameters, a real-valued function of the information matrix is chosen as the objective function. Some popular criteria for optimal design are as follows. A-optimality: Minimize the sum of reciprocals of the positive eigenvalues of the information matrix. D-optimality: Minimize the product of reciprocals of the positive eigenvalues of the information matrix. E-optimality: Minimize the reciprocal of the smallest positive eigenvalue of the information matrix. Note that these objective functions are monotone non-increasing functions of the eigenvalues of the information matrix. If an information matrix is larger than another one in the sense of the Lowner order, then it follows from Proposition 2.6.2 that the design corresponding to the first information matrix is better according to each of the above criteria. Usually an optimal design has to be selected subject to some constraints. There are limits on the number of observations because of cost and other considerations. Another typical constraint is that some elements of the matrix X can only assume finitely many values. We have already come across such a problem in Example 4.11.2, where the D-optimal design was identified subject to the constraint that the total number of observations is 6 and the elements of X can be either 0 or 1.
194
Chapter 6 : Analysis of Variance in Basic Designs
In general, the task of finding an optimal design in the discrete case suffers from the handicap that derivatives cannot be used. See Shah and Sinha (1989) for a detailed discussion of optimal designs. Example 6.1.1 Suppose that the means of t populations (//i,..., fit) have to be estimated using a total of n samples from these, and the variance of all these samples is a2. Let the number of observations allocated to population i be n^, i = 1,... ,t, such that J2i=i — n- Let us find the A-, D- and E-optimal designs subject to the restriction that all the means are estimable, that is, n^ > 1 for i — 1 , . . . , t. By arranging the observations as a vector, we can use a linear model where the matrix X consist of Os and Is. There is exactly one 1 in each row of X and exactly n^ Is in the column corresponding to \X{. The information matrix is diagonal. The reciprocals of its diagonal elements (or eigenvalues) are cr 2 /ni,..., a2frit- Because of the well-known order between the arithmetic and geometric means of positive numbers, we have
a2 Jo2 max — >Tl —+
\
\n\
o2\ (^a2\l/t + — > TT — nt I
\z~inij
a2 >~rn/t
Any one of the inequalities holds with equality if and only if n\ = = nt — n/t, in which case all the inequalities hold with equality. This is only possible if n is a multiple of t. It is clear that the design which allocates equal number of samples to each population is at once E-, Aand D-optimal. Such a design is called a balanced design. The model discussed in the next section is a variation of the model of Example 6.1.1, but the design is not assumed to be balanced.
6.2 6.2.1
One-way classified data The model
Suppose that we are asked to compare t types of treatments and are given n experimental units on which to conduct the experiment. One of
6.2 One-way classified data
195
nt such that the simple ways to do this, is to select integers n\, n,2,. ri\ + «2 + + nt = n and apply treatment i on nz- units, i = 1,..., t (different treatments being applied to different sets of units). If the units are known to be homogeneous, that is, if there is no factor (other than the treatment itself) which may cause some of them to have a different mean response than others, then the allocation of the units to the t groups may be done at random. This design is called completely randomized design (CRD). Clearly we would want n^ > 2 for each i, so that each treatment is replicated, providing a measure of internal variability. If no other prior information is available, it might well be best to subdivide n into t approximately equal parts and apply each of these treatments to an equal number of units (see Example 6.1.1). Let yij denote the yield of the jth. experimental unit which received the ith treatment, j = 1,..., nz-, i = 1,..., t. The yield may be modelled as yij = n + n + eij, j = 1,..., rii, i = 1,..., t, E{ei3)
= 0,
j = l,...,m,
1.0
i = 1, ...,t,
(6>2.i)
otherwise.
The parameter /i represents the baseline response that might be present in all the units, and T\, ..., TJ are the additive effects of the t different treatments. We refer to these parameters as the baseline effect and treatment effects, respectively. The matrix-vector form of the model is J/nxl — Xnx(t+l)P(t+l)xl + e nxl,
E(e) = 0,
D(e)=al,
where V
=
(2/11
2/lni
/ I r u x l
lrnxl
OniXl
lri2Xl
On2xi
ln2Xl
: Vlntxl e
=
(en
:
" S/21
V2n2
:
""
S/tl -" " " " : 0
n i X
i \
0TJ(X1 : e2i
:
T\
' l n j x l /
: iin2
Vint)', / ^ \
On2xl
:
Ontxl : e\ni
"
@~ V Tt
^tl
'
:
£tnt
/
196
Chapter 6 : Analysis of Variance in Basic Designs
The model can be reparametrized by defining \i{ = fi + T{, i — 1,...,£, so that the new parameters fn,...,(xt are estimable. This reparametrization highlights the crucial fact that one-way classified data model (6.2.1) may arise not only from CRD, but also from other problems such as comparison of population means of t samples, considered in Example 6.1.1. We shall proceed with estimation without this reparametrization.
6.2.2
Estimation
of model
parameters
Note that the first column of the matrix X is the sum of the remaining columns, the latter being linearly independent. Therefore, p(X) = t. It is clear that not all linear parametric functions of this model are estimable. In fact, none of the individual parameters of the model is estimable. Comparison of the treatments under the model (6.2.1) amounts to comparing the treatment effects T\ , . . . , r^ associated with the treatments. If these are the parameters of interest, /i is only a nuisance parameter. Let us first find out which linear functions of the treatment effects are estimable. If c = (c\ : : Q ) ' , then Proposition 4.10.1 provides a necessary and sufficient condition for the estimability of the function YA=I ciTi- The condition is c G C(X[(I — Px )), where X2 = l n x i (the first column of X) and X\ is the remaining part of X. To simplify this condition, note that (by Proposition 2.4.4)
dim(C(X[(I - PX2)))
= p(X[(I - PXi)) = p(Xl)-p(X2)
= =
p(X1:X2)-p(X2) t-1.
Therefore, dim(C(X' 1 (I - Px )) x ) = 1. It is easy to see that 1 G
C{X[{I - PX2))L-
Hence, C{X[{I - P^))-1
= C(l). Consequently
Yli-i ciTi i s estimable if and only if c'l = 0. A function of r i , . . . ,rt which satisfies this property is called a treatment contrast. The above discussion reveals that the only linear functions of treatment effects which are estimable are treatment contrasts. Examples of treatment contrasts are pairwise differences like T\ — T%, which bring out the contrast between the effects of two treatments. Linear combinations of such
6.2 One-way classified data
197
differences are also treatment contrasts. In fact, whenever c ' l = 0, the LPF C'T is a linear combination of differences of the type T{ — TJ (see Exercise 6.3), thus justifying the name 'treatment contrast'. Treatment contrasts are not the only linear parametric functions of /3 which are estimable. It is obvious that fi + TJ is estimable for In order to find the BLUE of Xfi, note that =
Px
Px,
=X1{X'lXl)-X\ (nx
0
0
n2
\ 0
0
x
-
ON
-..
0 ntj
(Pini
0
0
P u2
1 o
0
-
0 \ -
0 >
It follows t h a t y=\
:
,
where y f = — , j/j. = ^ y y , i = l , . . . , t .
(6.2.2)
UlnJ In other words, for any i and j the fitted value of the yield yij is the average of the yield in the ith treatment group, y^. The corresponding residual is y^ — y io the deviation from the group mean. All linear zero functions of the model are linear functions of these residuals. It follows that the error sum of squares and the usual unbiased estimator of a2 are
Rl = e'e = Y/f:(yiJ-yl-)2,
(6-2.3)
o* = Rl/{n-t).
(6.2.4)
The dispersion matrix of the vector of fitted values is (P. _ D(Xfi) = a2Px = a2
0
0
0 P, ,n2 0
0 \ 0 . P
1
J
(6.2.5)
198
Chapter 6 : Analysis of Variance in Basic Designs
Let r = (ri : : Tt)'. The BLUE of a treatment contrast C'T can be obtained from the formulae given in Chapter 4. However, there is a simpler way of obtaining the BLUE of any contrast and its variance from those of the fitted values. From the definition of a treatment contrast it follows that C'T = C ' ( / J 1 + T ) . The vector / x l + r is completely estimable, as its «th element is the expected value of yn. Therefore, the BLUE of the ith element of /il + r is y^.. Consequently the BLUE of C'T is t
dT = Y,<*Vi.-
(6-2.6)
Further, (6.2.5) implies that
Var^) = Var (£ c2y, ) = £ c^Var^) = a2 £ i
(6-2.7)
For instance, the BLUE of the contrast T\ — T2 is y1. — y2., and its variance is a2/m + a2 jn%. An extension of the above arguments shows that the covariance between the BLUEs of the contrasts Yli=i ci«Ti a n d E i = l C2iTi is
(
t t \ t 2 ^ CliTi, 2 ^ C2lTl = CT 2 ^ i=l i=l / i=l
n i
Let y.. = n~ly.., where y.. is the sum of all the observations. Since
that is, y.. is a linear combination of the fitted values, it must be the BLUE of its expectation, which is fj, + Ylt=i(ni/n)TiSometimes the redundancy of the parameters of (6.2.1) is sought to be removed by introducing the 'side-condition' Yj\=i{niln)Ti = 0- This restriction is a model-preserving constraint (see page 122), so it does not affect the BLUEs of estimable functions. Under this restriction, all the parameters of the model are estimable, and y.. is the BLUE of the baseline mean yield (/x). Its variance works out to be a2/n.
6.2 One-way classified data
6.2.3
199
Analysis of variance
The main hypothesis of interest here is that of 'no difference in treatment effects', 'Ho
n
= r2 =
=
rt.
This hypothesis can be rephrased as /n -
\
T2
T\ — r 3
Mo :
.
= 0,
\T\-TtJ or as Ap = 0 where / A(t_i)x(t+1) = v
0
1
- 1
0
0
0 .
1 .
0 .
- 1 .
0 .
0
1
0
0
- 1
\
)
The error sum of squares under the hypothesis (R2H) is very easy to calculate, since the restriction A/3 — 0 reduces (6.2.1) to a model with common mean of all the observations. Therefore,
p : A/3 = 0
fr
Ti i=1
j=l
(6.2.8) Thus, B?H is the sum of squared deviations from the grand mean of all the observations. We can also find an interpretable expression for R2H — RQ, as follows. t
R2H
rii
= EEfe-i)2 i=i j=i
t
rii
= EE^'-^+& i=i j=i
= E E ( y y - y . ) 2 + EE(y.-y..)2, i=l j=l
i=l j=l
-y? (6.2.9)
200
Chapter 6 : Analysis of Variance in Basic Designs
since the cross terms, when summed over j , reduce to zero. Comparing the first term of the last expression with (6.2.3), we find that it is equal to RQ. Therefore,
4-«o
= EE&..-i)2
(6-2-10)
The summands of the right hand side are squared deviations of group means from the grand mean. Each of these deviations is a BLUE under the alternative hypothesis, A/3 ^ 0. These BLUEs turn into LZFs under T-LQ . The above sum is called the between-groups sum of squares,
as it captures the variation across group means. The expression for RQ is called the within-group sum of squares, for obvious reasons. The sum of these two, R^j, is the total sum of squares. The between-groups sum of squares represents the departure from the null hypothesis of equal group means. Thus, it makes intuitive sense that a large value of this quantity, relative to the within-group sum of squares, would lead to rejection of the null hypothesis. The generalized likelihood ratio test (GLRT) for T-LQ, under the assumption of normality of the errors reduces to this criterion. It follows from the discussion of Section 5.3.4 that the GLRT rejects Ko when
Rl
' Ti
>
^-1>"-i>-
where Ft-i,n-t,a is the (1 — a) quantile of the Ft-\^n-t distribution. The analysis of variance for the model (6.2.1) is given in Table 6.1. The GLRT statistic is obtained from this table as the ratio of the mean squares, MSg/MSw. 6.2.4
Multiple comparisons of group means
When the null hypothesis of 'equal group means' is rejected, we need to further investigate which of the treatments are better than others. The multiple comparison techniques described in Section 5.3.7 can be used for this purpose. Apart from these general methods, there are some
6.2 One-way classified data Source D
Sum of Squares
. 2
V^
2
Mean Square MS9
/-
- ^2
+
=
E>2
<=1
_9
v~^v^/ z_/^^vwj
i
— \2 wt /
_t
4 = EE(^-^) 2
""I
u
t
Total
Degrees of Freedom
*
Between groups Within groups
201
> <«
D2
_
i
-^o n_ t
n,
Table 6.1 ANOVA for model (6.2.1) of one-way classified data
methods designed specifically for one-way classified data. We describe here two such methods. If Ti = Tj, thenyi.—yj. has the JV(O, o2(n~l + n~1)) distribution, and (iji- — yj)/[° r2 ( n i~ 1 + nJ1)]1^2 n a s * n e ^-distribution with n — t degrees of freedom.a In the absence of other comparisons, the hypothesis TJ = TJ should be rejected at the level a when |j/j. — yj.\ is larger than
LSDij = tn_tA/2\Ja2{nll
+ nj1).
One can use this criterion simultaneously for the differences of several pairs of group means — provided that the hypothesis of 'equal group means' has been rejected. If all the group means are equal, then the probability of erroneous rejection of the hypothesis T\ = — rt through the level-a GLRT is exactly equal to a. Hence, the probability of erroneous rejection of this hypothesis, followed by the rejection of any one of the hypotheses of the type T; = TJ is at most a. The very fact that the GLRT precedes the multiple comparisons, protects the level of the latter. The cut-off LSD^ is called Fisher's protected least significant a This
is the only section of the book where the notation 'i' is used both as the quantile of a distribution and as the number of treatments. The quantile always appears with a subscript.
202
Chapter 6 : Analysis of Variance in Basic Designs
difference (PLSD), when it is used in this manner. Tukey suggested a procedure which can be used without carrying = rit. Under the hypothout the GLRT, in the special case n\ = y\/\l'o2/n\,..., esis of equal group means, the scaled group averages yt/y &2/nt have the multivariate ^-distribution with parameters t and n — t (see page 154). The null distribution of the range of these ratios, Vi
max —, Ki
/"o /
Vi
— m m —, Ki
/?",
is called the studentized range distribution with parameters t and n — t. If qt,n-t,a is the 1 — a quantile of this distribution, then the hypotheses T{ = Tj, i,j = 1 , . . . ,t, i ^ j , can be tested simultaneously at the level a by checking if \yi. — yj.\ is larger than HSD =
qt,n-t,ay&2/ni-
The cut-off HSD is called Tukey's honestly significant difference (HSD). Tables of percentage points of the studentized range distribution are given by Harter (1960). When ni,...,nt are unequal, this procedure needs modification. The Tukey-Kramer method is to use the HSD as the cut-off for |j/j. — y~j.\, with 1/ni replaced by (1/n; + l/rij)/2. See Miller (1981, Chapter 2), Kshirsagar (1983, Chapter 6), and Hochberg and Tamhane (1987) for more information on these and other multiple comparison methods for one-way classified data.
6.3
Two-way classified data
As we mentioned earlier in this chapter, our main interest is often in checking if the treatments have equal effects and in the event that they are not all the same, to pick the best among them. However in any such study we can not ignore other major factors that contribute significantly to the total variability, since if we do, this contribution will become part of the residual or error sum of squares, error control should be part of any good design. To this end, we might separate the experimental units
6.3 Two-way classified data
203
into groups, called blocks which are homogeneous with respect to all non-treatment factors and replicate the complete set of t treatments inside each block. For instance, in the agricultural experiment where we wish to find which variety of wheat is best, one may see that the different levels of soil-fertility in the experimental units is an important source of variability. Then we divide the available experimental units into blocks so that soil-fertility within each block is nearly equal. This is referred to as blocking. 6.3.1
Single observation per cell
Let us first consider the simple case where we have n = tb experimental units and these are divided into b blocks of t units each. We then randomly assign the t treatments, one to each unit inside the block, so that each block represents a complete replication of the treatments. This type of design is called a randomized block design (RBD) with one observation per cell, a cell representing a combination of treatment and block. Let y^ denote the yield on the «th treatment in the jth block, i — 1, ...,£, j = 1,..., b. We may assume the simple additive effects model Vij E{ei:j) r,
,
-,
= n + Ti + fy + e , j , i = l,...,t,
j =
l,...,b,
= 0,
j =
l,...,b,
i = l,...,t,
f a2
if i = i' and j = f,
[0
otherwise,
(6.3.1) where /J. represents the baseline effect, rj the effect of the ith treatment and Pj the effect of the jth block. The matrix-vector form of the model is 2/nxl = Xnx(t+b+l)P(t+b+l)xl + £ nxl,
E(e) = 0,
D(e) = a2l,
where y
=
( y n
-yib-y2i
y2b-
X
=
( l n x l : Itxt ® Ifcxl : Itxl ® -Tftxfi))
:
y n
ytb)',
204
Chapter 6 : Analysis of Variance in Basic Designs
P
=
e
—
/l&xl
l&xl
O(,xl
Oftxl
Ifcxl
Ofoxl
1(, X 1 ••• O ( , X 1 Jftxfc
V1&X1
Oftxl
Ofixl
l6xi
Ibxb\
Ibxb/
(fi:T1:---:Tt:01:---:l3by, (en :
: eu
: e2\
:
: e2b
: en
:
:
etb)'.
In the above, the notation '<8>' signifies the Kronecker product. The first column of X is equal to the sum of the next t columns and also to the sum of the last b columns. It can be verified that the last t + b—1 columns of X are linearly independent. Thus, p{X) — t + b — 1. Sometimes it is assumed that £i=i Ti — 0 and J2*j=i Pj — 0- If these restrictions are imposed, then the baseline effect /i can be interpreted as the overall mean effect. These are model-preserving constraints which make all the parameters estimable. However, we shall not need any such restriction for our analysis. As indicated earlier, our main interest is in treatment effects. Let c = (ci : : ct)' and T = (T\ : : r t )'. According to Exercise 4.32, C'T is estimable if and only if c'l = 0. Thus, treatment contrasts turn out to be the only estimable linear functions of the treatment parameters — as in the case of CRD. In order to find the BLUE of XB, let us determine Pv. Since C(X)
=
C{lnXl:Itxt®lbxl--Uxl®Ibxb)
= C(lnxi
{I - t-lU')
and the three sets of columns of the last matrix are orthogonal, we haveb P X
= P r Uxi
+P (I-t-1!!1)®!
+P l®(I-b-Hl')
= n-41' + (/ - rti') ® (b^n') + (rti') ® (i - b-ln'). bHere
as well as in the case of some other designs to follow, decomposition of the projection matrix as the sum of several projection matrices plays a crucial role in obtaining the analysis of variance. The algebra of this decomposition may initially appear to be involved, but the compact representation opens the door for analysis of more complicated designs. The benefits are sure to outweigh the effort needed to get used to this approach. The decomposition also corresponds to a reparametrization, see Exercise 6.13.
6.3 Two-way classified data
205
The above decomposition of Px for two-way classification model implies fvi.-v..\ j
X/3 = Pxy = y..lnxi+l
® l6xi + ltxi ®
\Vt--y-J
fy.i-y..\ :
\y~b-y-J
(6.3.3) where 'dot' in the subscript indicates averaging over the corresponding index. In particular, fitted value and residual for yij are Vij = V- + fa. - y.) + (V.j - V-) eij
=
= Vi. + y.j - V.., (6-3.4)
y%j - Vi- - Vj + v-
(6.3.5)
The three terms in the first expression of yij are the BLUEs of the following estimable functions: the grand mean (n+t~ l J2l=i i~i+b~1 ]C$=i Pj) i the deviation of the ith treatment effect from the mean treatment effect (TJ — t~l Y?j=\Tj) a n d the deviation of the jth. block effect from the mean block effect (/3j — b~l J2i=i Pi)i respectively. Under the model preserving constraints YA=\ Ti ~ 0 a n d Z)j=i Pj = 0) these three terms are the BLUEs of /j,, Ti and /3j, respectively. It follows from the expression of i* that D{XJi) = or 2 [n- 1 ll / + ( J - r 1 l l ' ) ® ( 6 ~ 1 H ' ) + ( r 1 l l / ) ® ( / - 6 " 1 l l ' ) ] (6.3.6) The error sum of squares and the usual unbiased estimator ofCT2are (see (6.3.5)) t
Ro
b
= Y,H(vij-Vi--y-j
+ v-)2>
(6-3-7)
i = l .7=1
o2
=
Rl/(n-t-b
+ l).
(6.3.8)
As in the case of CRD, an important hypothesis of interest is that of 'no difference in treatment effects', %0 : Ti = T2 =
= Tt.
206
Chapter 6 : Analysis of Variance in Basic Designs
T h e h y p o t h e s i s c a n b e w r i t t e n as A/3 = 0, w h e r e / 0 0 -A(t-l)x(t+6+l) =
1 - 1 0 0 1 -1
.
.
.
^ 0
0
0
. 1
0 0
0 0
.
.
- 1
0
0 \ 0 ._
. 0 j
Under Ho, (6.3.1) reduces to a model with only block effect and no treatment effect. This is a version of (6.2.1), with blocks assuming the roles of treatments. Therefore, the error sum of squares under Ho is (see (6.2.3))
R\ = jljl^j-y-j)2-
(6-3.9)
t = l j=l
This simplifies to t i=i
b j=i
= E E 4 + EX>-*.) 2 = Rl + jljliyi.-y.)2i = l j=l
i=l j=l
i = l j=l
The cross-terms vanish because S j = i eij — 0 for i — 1 , . . . ,t. Consequently,
i&-*o=£;i>.-»..) 2 -
(6-3-10)
t=ij=i
Note that the deviation of the ith treatment effect from the mean (y2. — y.) is a BLUE of (6.3.1) which turns into an LZF under Ho, for i — 1 , . . . , t. Thus, R2H—R^ can be called the sum of squares due to difference between treatments (ST). In contrast to the case of one-way classified data, R2H (corresponding to the null hypothesis Ho) in the present case is not the total sum of squares. The difference between the total sum of squares (St) and B?H is
E i > t f - y-)2 -R2H = E Eivn - v-j
j=ij=i
i=ij=i
- y-)2 - *&
6.3 Two-way classified data
207
= thvv - y,)2 + 1 hy, - v..)2 -Rl = t I > . - v-)2, which is the sum of squares due to difference between blocks (Sp). The summands of the last expression are squares of BLUEs of (6.3.1). Thus, the total sum of squares (St) can be decomposed into three sums of squares: that arising (i) due to difference between block effects (sum of squared BLUEs), (ii) due to difference between treatment effects (sum of squares of BLUEs that turn into LZF under MQ) and (iii) error sum of squares (sum of squared LZFs). The number of degrees of freedom associated with the sum of squares due to difference between treatment effects is equal to the number of linearly independent LZFs (under Ho) contained in A/3, which is equal to p(A), that is, t — 1. By symmetry of the block and treatment effects in the model (6.3.1), the degrees of freedom associated with the sum of squares due to difference between block effects is b — 1. The degrees of freedom associated with the error sum of squares is n — p{X) = n — t — b + 1 = (6 — l)(i — 1). Thus, we have the detailed analysis of variance given in Table 6.2. The GLRT statistic for T-LQ is given by MST/MSe, which simplifies to (6 — 1)ST/RQ. This statistic has the F^t_^^_i^t_i^ distribution under Ho- By analogy, one can also obtain the GLRT for the hypothesis of 'no difference in block effects'. The test statistic, (t — 1)S/S/RQ, has the F{b-i),(b-i)(t-i) distribution under this null hypothesis. When two-way classified data arises from a RBD, this test can be used to determine whether heterogeneity due to the non-treatment factors has been effectively controlled by 'blocking'. 6.3.2
Interaction in two-way classified data
Suppose that within-block heterogeneity has been removed by appropriate 'blocking'. The model (6.3.1) implies that the block and treatment effects are additive. An implication of this model is that a 'good' block is good for all varieties of treatment, and a good treatment would result in the same incremental improvement of mean response in all the
208
Chapter 6 : Analysis of Variance in Basic Designs
Source
Sum of Squares
Between
v~v—
- \2
.v^/-
- \2
a
, , . treatments
ST = b\,(yi.-y.y ~[
Between
c
blocks
^ = *5>;-v-) r>2
t b V^V^/
Degrees of Freedom . 1
t-1 .
1
*"i
4
Total
»,rn
&T
**&
&p
MST = -—t— \
M
^= ^I
n/rc? iWo e =
—
*o = E 5 > » - ! / i (6_i)(4_i) 1 " J " - y , - + i/..)2
Error
Mean Square
flg (&~1)^-1)
6
5t = ^ ^ ( y y - y . . ) 2
n-1
i=lj=l
Table 6.2
ANOVA for model (6.3.1) of two-way classified data
blocks. This assumption may not always hold in practice. For instance, a fertilizer may be very good for the yield of a crop when it is grown on a particular type of soil, but not as good when a different soil type is used. This kind of departure from the additive model (6.3.1) is known as interaction effect between the block and treatment effects. The effect of interaction can be taken into account by introducing an interaction term in (6.3.1). The resulting model is Vij = /i + Tj + fa +Hj + eij, = 0,
E{eij) f
/
\ J
J
i - 1 , . . . , t, j = 1 , . . . , 6, *= l,...,t, j = l,..-,6,
( a2 Hi = i' and j = f, [0 otherwise,
(6.3.11) where 7^ represents the interaction effect, and the other parameters are as in (6.3.1). The design matrix corresponding to this model has rank n. There are effectively n parameters, one for each observation. The model (6.3.11) is called a saturated model.
6.3 Two-way classified data
209
The model (6.3.11) is not an adequate vehicle for testing if an interaction is present. This is because there are so many parameters affecting the mean response that after estimating these parameters one no longer has the degrees of freedom to estimate a2. As a result, one cannot test for the significance of any parametric function in such a model. A simpler model which incorporates a limited type of interaction is Vij = V + Ti + pj + Xin- f) (pj -p) + dj, i = l,...,t, j — 1,...,6, E(€ij) ri
i
= 0,
\ J
i = l,...,t,
j = l,...,b,
(6.3.12)
\ a2 \ii = i' and j = j ' , [0 otherwise,
J
where f = t~l YL\-\ Tii P = k"1 Z)j=i &ji ^ represents the extent of interaction, and the other parameters are as in (6.3.1). The interaction term has a more prominent effect when the treatment and block effects (TJ and Pj) are both far from their respective averages. The model reduces to (6.3.1) when A = 0. Thus, the hypothesis A = 0 signifies absence of interaction (that is, additivity of the block and treatment effects). Even though the model (6.3.12) is nonlinear in the parameters, Tukey (1949) showed that a simple test for this hypothesis can be derived in the following manner. If we assume for the moment that n,Ti,...,Tt,P\,...,Pb are known, then (6.3.12) can be rewritten as yij - fi-Ti-
Pj = Efaj)
= 0,
Cov(el3,ellf) v J
J
A(rj - f) (Pj - P) + eij, i = l,...,t, j = 1,...,6,
'
= \a \0
i = l,...,t,j 2
= l,...,b,
(6.3.13)
if^'andj=/, otherwise,
which is a linear model, with transformed response y^ — fi — TJ — Pj, i = 1 , . . . , t, j = 1 , . . . , b. The parameter A is estimable, and its BLUE is easily seen to be
? _. S U i S3=i(itt -I*--*Ei=iEj=i{Ti
Pj)(n - f)(Pj - p)
- T)2(PJ - p)2
(6.3.14)
210
Chapter 6 : Analysis of Variance in Basic Designs
By calculating R2H — i?2, for this model, it turns out that the sum of squares due to deviation from the hypothesis A = 0 is (see Exercise 6.18)
o
(EUE^ite-e-T.-ftKT.-fXft-ffl)2 TL^in-'m-w
Sx =
(6"5)
Since fi, T\, ..., Tt,/?i,...,/3& are unknown, we can replace these in the above expression by their respective estimators, and reject the hypothesis of no interaction if the above quantity is significantly different from 0. We shall now derive the null distribution of the resulting statistic under the usual assumption of multivariate normality of the errors. When A = 0, (6.3.13) reduces to (6.3.1). According to this model, the BLUEs of /u + r, + j3j, (T{ — f) and (f3j — J3) are y~ij, y^ — y.. and y.j —y.., respectively (see the discussion following (6.3.4)). Substituting these BLUEs in the expression for S\, we have the statistic
=
(£S,£;ie.Jfc-p..)(EJ-^))2
(6.3.16)
To find the distribution of S\, note that it can be written as
3
(ELIE;=I«^J^)2 L , i = l l ^ j = l a i bj
w h e r e a* = y{. - y . . a n d bj = y . j - y . . , i = 1, , j = l,...,b. For fixed ai and bj, the numerator is the square of an LZF of (6.3.1), and the variance of this LZF is a2 times the denominator (see Exercise 6.9). Therefore, for fixed ai and bj, S\/a2 has the xf distribution, under the assumption of multivariate normality of the model errors. Since y^ — y.. and y.j-y.. are BLUEs of the model (6.3.1) for i = 1 , . . . ,t, j — 1 , . . . ,b these are independent of the LZFs. Therefore, the conditional distribution of S\/cr2 given the BLUEs is xl- Since the conditional distribution does not depend on the BLUEs, the unconditional distribution must
also be xlIn order to construct a test statistic, we have to use an estimator of a2 which is independent of S\. The latter is the square of an LZF having
6.3 Two-way classified data
211
variance a 2 . Therefore, one can form a standardized basis set of LZFs by augmenting this LZF with n—t — b other LZFs which are independent of the first one. Since the sum of squares of all the n — t — b + 1 LZFs is the error sum of squares for (6.3.11) (see Definition 4.7.6), the sum of squares of the additional n - t — b LZFs is R% — Sx, which must be independent of S\. It follows that (n—t—b)S\/(Ro—S\) has the F\tn-t-b distribution under the hypothesis of A = 0. The test which rejects the hypothesis of no interaction (that is, the hypothesis of additivity of block and treatment effects) for large values of (n — t — b)Sx/(RQ — S\) is known as Tukey's one-degree of freedom test for non-additivity. The analysis of variance for the model (6.3.12) is given in Table 6.3. Strictly speaking, Sx is not a sum of squares, but the sum of squares of some 'generalized zero functions' of (6.3.1), which are nonlinear functions of the response (see Exercise 5.23). The test statistic for the hypothesis of additivity is the ratio of the mean squares, MS\/MSe. It is a special case of a more general class of tests (see Exercise 6.19). Source
Sum of Squares
Between treatments
o
,w— iZi
Between
<-,
.w —
blocks
*/> =
Degrees of Freedom
Mean Square
- \2
+ i
n^c
_ N2
u
n^-c
*^T t— 1
b
n
)
^
aP
* - !
MSfi
= bZi
1
^X
= SX
Sx =
activity (£L2«i^-*->^-*->) a
EUE^ifa--*-) 2 ^-*-) 2 MSe = Error
R% - Sx
n-t-b
Rp-Sx n —t—b
t
Total
b
st = Y, Y.^3 - y-)2
n~l
i=\j=\
Table 6.3 ANOVA for model (6.3.12) of two-way classified data with limited interaction
212
Chapter 6 : Analysis of Variance in Basic Designs
Ignoring interaction can produce reasonable point estimates but unduly long confidence intervals of treatment contrasts (see Exercise 6.17). See Scheffe (1959, pp. 134-136) for other effects of ignored interaction in two-way classified data with one observation per cell. One has to be careful about the interpretation of statistical findings when there is interaction. For instance, if the contrast T\ — T2 is found to be significantly positive, it only means that the effect of treatment 1 is likely to be more than that of treatment 2 when averaged over the b blocks. It does not necessarily mean that this order holds in every block. As another example, consider the case where the hypothesis of 'no treatment difference' is accepted but the hypothesis of 'no interaction' is rejected. This only means that there are differences among the treatments, but the differences somewhat offset one another when averaged over the various blocks. The model (6.3.12) represents a very specific type of interaction. Broader interaction between treatment and block effects makes it more difficult to allocate variance to various sources, as clearly as in Table 6.3.
6.3.3
Multiple observations per cell: balanced data
In Section 6.3.2 we could handle interaction only to a limited extent, as we did not have adequate degrees of freedom to assess all possible types of interaction. This inadequacy is removed if there are multiple observations for every combination of block and treatment. We first consider the case of balanced data, when there are equal number of observations for every combination of treatment and block. Thus, we have the following extension of (6.3.11). Vijk
= V+n+/3j+7ij-t-€ijk,
i = l,...,t,
j = l,...,b,
k — 1,... ,m, E{eijk)
=
0,
i = l , . . . , t , j = 1,...,6, k — 1,... ,m,
10
otherwise. (6.3.17)
6.3 Two-way classified data
213
The model with n = tbm observations can be written in the matrixvector form as Vnxl = Xnx(t+l)(b+l)P(t+l)(b+l)xl + €nxl,
E(e) = 0, D(e) = O / ,
where v
= ( ( ( y m :
- y u m )
( { y m
X /3
: y t i m )
- y i b m ) )
( y m
y t b m ) ) ) ' ,
=
( l t x i (8) lftxl : Itxt ® Ifcxl : I t x l ® Ibxb
=
(yu : (r x :
: r t ) : CQi :
(Til e
{ y m =
=
Tift)
( ( ( e m:
: £ n
: ( ( e m
: /3d) : (7*1
m
:
Itxt ® Ibxb) <8> l m x l ,
) :
7*ft))',
: (eiti :
: f-tim)
: eibm)) :
{(-tbi
ctbm)))
Note that C{X) =
C((ltxi
® lftxl : -^tx* <S> lftxl : Itxl ® Ibxb Itxt ® -^6x6) ® lmxl)
=
C((l t xi ® Uxi : {I ~ Pltxl)
® lfcxl = Itxl ® {I ~ Phxl)
:
(J-Pltxl)®(J-Plftxl))®W). The columns in the four partitions of the last matrix are orthogonal to one another. Therefore, Px=Pll
+ PT + Pp+P1,
(6.3.18)
where Pu
= P,
I1
®P
itxl
Pr = ( I - - P T
Pfl T
lmxl
®P
)®P
Itxl '
V
Ifcxl
®(/-P,
lmxl '
)®P,
=
,R
=
( I - R ) ® { I - P
P
P,
®P
J-txl
J-txl v
V
lixly
lixl' v
lmxl'
) ® R . Hxl'
Therefore, the BLUE of X/3 is Xp = Ptly + PTy + P[iy + P1y.
lmxl
(6.3.19)
214
Chapter 6 : Analysis of Variance in Basic Designs
In particular, the fitted value of y^ and the corresponding residual are Vxjk = y... + (j/i»-|/...) + {y-j-y...) + {yij--yi..-y-j-+v...)
(6.3.20)
etjk =
(6.3.21)
yijk-Vij-,
where 'dot' in the subscript indicates averaging over the corresponding index. The four terms of (6.3.20) are the respective BLUEs of the following estimable parametric functions (see Exercise 6.22): (a) the grand mean, /j, + f + /3 + 7..; (b) the deviation of the ith treatment effect from the average treatment effect (averaged over all the blocks), TJ — f + TJ. — 7..; (c) the deviation of the jth block effect from the average block effect (averaged over all the treatments), f3j — ft + 7.J — 7..; (d) the deviation of the ijth interaction effect from the average interaction effects of the ith. treatment and the j t h block, 7JJ — % -7-i+7Under the model-preserving 'side-conditions' f = 0, /3 = 0, j{. = 0, i = 1 , . . . , t and = 0, j = 1 , . . . , 6, the above four parametric functions simplify to /i, TJ, (3J and 7JJ, respectively. The dispersion of X/3 is &2PX, where Px is given in (6.3.18). Since the four components of this projection matrix are orthogonal, the four parts of yijk are uncorrelated. The error sum of squares and the usual unbiased estimator of a2 are (see Exercise 6.22) t
Ro
b m
= EEDvtffc-fci-) 2 .
<6-3-22)
i=ij=ifc=i
72 = Rl/(n-tb) = Rl/[tb(m-l)}.
(6.3.23)
Using the decomposition
y = Plly + PTy + Ppy + P^y + (I - Px)y, we can define the sum of squares, t
b m
t
Sr = l|PTy||2 = £ £ £ a / , - y J 2 = &m£a/i.-y...)2, i=lj=lfc=l
i=l
6.3 Two-way classified data Source
Sum of Squares
215
Degrees of Freedom
Mean Square
fronts
* = «^«a
-
Interaction
5 7 = ||P y||2
\ \ it>~V
^7 (t-l)(6-l)
i ^ = ||(J - P x )y|| 2
n - <6
MSe = - ^ n — to
Error
A
Total
S-^IKJ-P^j/ll 2
"* = £ l
1
n-1
Table 6.4 ANOVA for model (6.3.17) of balanced two-way classified data with interaction
t
b
m
b
i=lj=lfe=l t
b
3=1
m
i=ij=ik=i t
b
= m Y, Y,(yij-yi--y~r+y-)2' i=ij=\ t
b
m
Rl = ||(/-Px)y||2 = E E E t o - ^ - ) 2 > i=lj=lk=l t b m
St = ll(/-PM)y||2 = £E£fa-<7-) 2 i=lj=lk=l
Thus, we have the analysis of variance of model (6.3.17), given in Table 6.4.
216
Chapter 6 : Analysis of Variance in Basic Designs
The mean squares in the last column of Table 6.4 can be used to test several hypotheses, assuming that the errors have the multivariate normal distribution. The GLRT for the hypothesis of 'no treatment effect' consists of rejecting the hypothesis when the statistic MST/MSe is too large. The null distribution of this statistic is Ft_i n-tb- The GLRT for the hypothesis of 'no block effect' can be obtained similarly. The GLRT for the hypothesis of 'no interaction effect' is to reject the null hypothesis when MS-y/MSe is too large, the null distribution of the statistic being -F(t-i)(6-i),n-tfc- Tests for the significance of specific contrasts or interaction effects can also be obtained from the general theory (see Exercises 6.25 and 6.26). If the hypothesis of no interaction is accepted, then one may test the hypotheses of 'no block effect' and 'no interaction effect' using a reduced model with no interaction. If the jijS are eliminated from (6.3.17), then we have the simpler model Vijk = A* + Tj + flj + eijk,
i = 1 , . . . , t, j = 1 , . . . , b, k = 1 , . . . ,m,
E(eljk)
=
0,
i = l,...,t, k =
n ( Cov{eijk,eilfk>)
\ =
j =
l,...,b,
l,...,m,
/ cr2 <
if i = i', j = j ' and A; = k' J J >
10
otherwise.
(6.3.24)
The resulting analysis of variance is similar to Table 6.4, except that all the BLUEs corresponding to the interaction effects become LZFs. Therefore, 5 7 has to be added to RQ in order to obtain the error sum of squares of the simplified model, and the corresponding number of degrees of freedom is n — bt + (t — l)(b — 1) = n — b — t + 1. The hypotheses of 'no treatment effect' or 'no block effect' can be tested on the basis of this table.
6.3.4
Unbalanced data
If a two-way classified design occurs with an unequal number of observations per cell, then the data are said to be unbalanced. Such a
6.3 Two-way classified data
217
model is given by Vijk
= f-t+Ti+Pj+n/ij+eijic,
i = l,...,t, k =
E(eijk)
=
0,
\
l,...,mij,
i = l,...,t, K =
,-, / Cov{eljk,eiljlkl)
j = 1 , . . . , b,
fa2 = i L0
j = 1,...,6,
1, . . . , 7Tiij,
if i = &', ? = j ' a n d A; = A;', +u otherwise.
(6.3.25)
The total number of observations is n = YA=I 12bj=i mij- The matrixvector form of the model is Vnxl — Xnx(t+l)(b+l)P(t+l)(b+l)xl + e « x l , E(e) = 0, D(e) = a / , where y
=
y i i m n ) { ( y t n ^Imnxl
O
0mi2Xl
\0
m t 6
xl
(ltxl
(3
m u
xl
"
(ytbi O
m n
"mi2Xl
Omt6xi
lmtbxl
:
Itxt ® lfcxl
=
( ( (
e
m : : { { e
:
l i b )
t u
:
y t b m
t b
) ) ) ' ,
:
I t x l ® Ibxb
Itxt ® Ibxb),
:
:
: 6 u m
) )
/
{ i n
: eiimn)
l b
xl^
lmi2Xl
<8> lfcxl
'
y i b m
y n m n )
= (M:(n:---IT*): (ft (711
e
{yibi
l t b ) Y ,
: {^lbi t i
)
:
eihmi6)) : : {ttbl
et6m
t 6
)))'-
The key to the analysis of variance in the case of balanced data had been the decomposition of the vector of fitted values, given in (6.3.19). The four components of the decomposition represented sets of BLUEs which are not only attributable to various effects (grand mean, difference in treatment effects, difference in block effects and interaction), but these were also uncorrelated to one another. One of the last three
218
Chapter 6 : Analysis of Variance in Basic Designs
terms turn into LZFs under the hypothesis of 'no difference in treatment effects', 'no difference in block effects' or 'no interaction effect', respectively. In the case of unbalanced data one can still identify BLUEs that signify difference in treatment effects, following the general principles. These BLUEs turn into LZFs under the hypothesis of 'no difference in treatment effects'. A GLRT for this hypothesis can also be obtained. Similar analysis is also possible for the block and interaction effects. However, the sets of BLUEs for the various effects are in general correlated. As a result, it is not possible to conduct a detailed analysis of variance of the type described in Table 6.4- One can take at most one effect at a time and form a limited ANOVA table similar to Table 5.1. We refer the reader to Hocking (1996, Chapter 13) and Searle (1987) for more details on the subject. Even if the data are unbalanced, a detailed analysis of variance is possible if m^ = rxiirij for some rrii and rij, i = 1 , . . . , t, j = 1 , . . . , b (see Exercise 6.28). Sometimes unbalanced data occur from missing observations in an experiment which is designed as balanced. If the number of missing observations is not too large, then tractable solutions can be found by exploiting the structure of the original design. We outline here two methods of analysis, but defer proofs to Chapter 9. The tests of hypotheses obtained from both the methods work even when the missing observations render some potentially estimable parameters non-estimable. One only has to be careful about what is being tested (see Section 5.3.1) and keep track of the degrees of freedom. The first technique is called missing plot substitution. The missing values of the responses are treated as parameters, and the parameters are estimated by minimizing the error sum of squares with respect to these additional parameters. Parameter estimates under any restriction are obtained by minimizing the sum of squares under that restriction. It is expected that these sums of squares have explicit expressions, owing to the designed nature of the experiment, and lend themselves to easy minimization. The GLRT for a linear hypothesis is obtained by replacing the restricted and unrestricted sums of squares by their respectively
6.4 Multiple treatment/block factors
219
minimized values. The name 'missing plot substitution' arises from the substitution of the missing values with their respective estimators. The degrees of freedom for the restricted and unrestricted sums of squares are obtained from the unbalanced model in the usual manner. A proof of the validity of this technique in a more general set-up is given in Section 9.2.5. See Exercise 6.29 for the step-by-step construction of a test procedure in the special case of two-way classified data with one observation per cell and a single missing observation. The second technique is based on the fact that, for the purpose of inference, a missing observation can be treated as an available observation in a model having an extra parameter. Specifically, if the last / out of n observations of the balanced-data model (y, Xft, o2l) are missing, then the truncated (unbalanced-data) set-up is equivalent to the modified model (y, X(3 + Zr), a21), where Z is an n x I matrix obtained from the last I columns of the n xn identity matrix, and the missing elements of y can be substituted by an arbitrary set of numbers. The proof of this equivalence (originally due to Bartlett, 1937a) is given in Section 9.5. The model (y, X(3 + Zr/, a21) is a special case of the analysis of covariance model which is discussed in Section 6.6. The construction of a test procedure in a special case is outlined in Exercises 6.35, 6.37 and 6.39.
6.4
Multiple treatment/block factors
So far we have made a clear distinction between treatment and block effects by stating that blocks are formed by clubbing together experimental units having homogeneity with respect to all non-treatment factors. Sometimes the blocks represent a second kind of treatment. In other words, every experimental unit is subject to a combination of two treatments, each treatment being chosen from a finite number of possibilities. In such a case, one may be interested in knowing which (if any) of the treatments makes a significant contribution to the variance of the response, and whether there is any interaction. In the case of balanced data, analysis of variance gives a clear answer to this question. When the data are unbalanced, the comparison of the treatment effects can be viewed as a problem of model selection. One has to determine whether
220
Chapter 6 : Analysis of Variance in Basic Designs
the three sets of parameters corresponding to treatment 1, treatment 2 and interaction deserve to be present in the model. See Christensen (1996, Chapter 7) for a discussion of some interesting scenarios that may arise in this regard. When there are two or more treatment factors, the non-treatment factors may be controlled by blocking. This leads to the general pway classification model with p > 2. This model may also arise when there are more than one block factors. In particular, when homogeneity among experimental units is sought to be ensured through blocking by two factors, the resulting set-up is called a row-column design (see Exercise 6.31). The analysis of balanced p-way classified data is similar to the analysis described in Section 6.3.3.
6.5
Nested models
Suppose that various groups of experimental units are subjected to a number of treatments. We have seen how heterogeneity within treatment classes due to non-treatment factors is removed by blocking. Sometimes the treatment classes have within-group heterogeneity due to treatment-related factors. For instance, each treatment may be a type of drug while the experimental units subjected to a particular drug may receive different doses of it. The dose levels for one drug can be completely different from those of another. So there is no question of 'blocking' by doses, or of using the two-way classification model. The dose levels being specific to a given treatment, their effects are 'nested' within the treatment effects. Assuming that the treatment and nested effects (that is, the effects of groups and subgroups) are additive and that there is no other effect, this set-up can be described by the nested
6.5 Nested models
221
classification model Vijk = V + Ti + 7ij+eijk,
k = 1 , . . . , rriij, j = 1 , . . . , bi: i = l,...,t,
E(tijk) = 0,
k = 1 , . . . , m^, j = 1 , . . . , h, i = 1,...,<,
^
/
x
i a2
iii — i' i — j ' and A; = fc',
10
otherwise.
(6.5.1) The total number of observations is n = ]C*=1 1ZJ=I rnij- ^n t f l i s model, 7ii) >7ibi a r e t n e effects of the factor (with 6; possible levels) nested in treatment i, i = 1,... ,t. The matrix-vector form of the model is y n x i = ^ n x ( l + t + ^ = i 6 i ) ^ ( l + t + E ^ 6 i ) x l + e n x i , E(e)=0,
D(e)=a2l,
where y
=
{ { { y m
yibim ({yni
/ l a i X
y t i m n )
0
0
Ai
0
0
o
0
0
A
:
:
:
:
:
:
0
0
t
0
0
\ 1
2
^ *-mn \
^*i
__ —
(ytbti
1 =
lrni2 .
>
^ Imj^ / P
=
(M : ( n :
:
e
=
(((em
: eiimn)
: : ( ( e
m
n)
:
:
: eti
:
7l6j
:
" - ": (eii>ii : m n
)
:
' "'
: (erttl
)
"run
0
\
m ; 2
>
0m;6j :
\
t
lmi2 .
(Til
))Y,
'
"mji
V 0m i6i
t b t
0
2
0
m ; 2
))
ytbtm
0
/ Imii
__ -™i ~"
X b l
lmibi I ' : (7« =
:
ltbt)Y,
: ei(,imiiji :
:
^tbtmtbt
The model (6.5.1) implies that there are ]Ci=i h subgroups of response, with homogeneity of mean within every subgroup. Therefore, it
222
Chapter 6 : Analysis of Variance in Basic Designs
follows from (6.2.2) that the fitted values are Vijk ~ Vij;
with usual notations. This can be decomposed as Vijk = y... +
- y...) + ( % . - &..)
(6-5.2)
The three terms on the right hand side are the BLUEs of the grand mean (ji + £ j = 1 n/t + £ i = 1 £ ^ = 1 7«7E*=i ft*)> t h e deviation of the ith group mean from the grand mean (T{ + £j'=i jij/bi — J2l=i Ti/t— Ei=i Z)/=i lijl YA=\ h), a n d the deviation of the ijth. subgroup mean from the ith group mean (7^ — Yl%\ Hj/bi), respectively. The modelpreserving constraints Z)?'=i Jij = 0, i = 1,... ,t and YA=ITi = 0 make all the parameters estimable. Under these restrictions, the three terms of the right hand side of (6.5.2) are the BLUEs of fi, 77 and 7^, respectively. The expression (6.5.2) of fitted value leads to the following decomposition of the deviation from the grand mean: Vijk - y- = (y~i~ - y-) + (§ij- -
+ eijk,
(6.5.3)
where e^fc is the residual for the observation yijk- It is clear that the BLUEs {yi-.—y...) and (yij. — yi--) are uncorrelated with the residual e ^ . We shall argue that the two BLUEs are uncorrelated with one another. To see this, consider the restriction Jn — — 7 ^ = b~l X)jLi7ii> which means that all the subgroups within the ith group have equal mean. Under this restriction, &.. — y... continues to be a BLUE (according to the one-way classification model), while y~ij. — y~{.. turns into an LZF. Therefore, the two linear functions are uncorrelated. Thus, the three terms in the decomposition (6.5.3) are uncorrelated. The sums of squares of these terms lead to the analysis of variance given in Table 6.5. The null hypothesis of no significant effect of any subgroup can be tested by the GLRT statistic MS7/MSe, which has the F distribution with YA=I bi — t and n — YA=\ h degrees of freedom under the null hypothesis. If this hypothesis is rejected, one can look for the groups
6.5 Nested models Source
Sum of Squares
Between groups
v^v^ <- \2 rriijiyi.. - y...y ^ 4^f J
v^v^
subgroups
Error
Degrees of Freedom ^ i
ST = ) )
Between ,
223
b~ = > 7
^
^
/-
> mania.
- \2
— %..)
j v y j
7
*§ = £ E £ 4*
t- 1
v^ u ^
n-£b,
Total
»=i
$ = £££0/ijfc-i/...)2
.
ST t—1
MST = ——
*
> bi — t
i=i i=i *=i
Mean Square
Q
<->7
^
7
7
_ ^ n
- E*=i 6i
n-1
i=lj=lfc=l Table 6.5 ANOVA for model (6.5.1) of nested classification
where the subgroup effect is significant, using techniques of multiple comparisons (see Section 6.2.4). For a fixed group i, the GLRT statistic for the null hypothesis of no subgroup effect is
b~r\
/M5e'
which has the F distribution with 6; — 1 and n — Ei=i h degrees of freedom, under the null hypothesis. If this hypothesis is tested for several (or all) groups simultaneously, one has to be careful about the levels of the tests. The GLRT statistic for the hypothesis of no group effect is
ELifc-i/
e'
which has the F distribution with Ei=i bi — 1 and n — Ei=i h degrees of freedom under the null hypothesis.
224
Chapter 6 : Analysis of Variance in Basic Designs
Apart from the model (6.5.1), there are nested classification models for more than one main factor. See Hocking (1996, Chapter 14) for more details. 6.6 6.6.1
Analysis of covariance The model
In Section 6.3 we used the method of blocking in order to account for the non-uniformity of the experimental units and to reduce the experimental error, so that treatment comparisons can be made more precise. There are many experiments where for each experimental unit one has supplementary or concomitant variables (also called covariates) which might influence the response. In such a case, comparison of treatment effects can be meaningful only after accounting for the effects of the concomitant variables. The effects of designed factors and concomitant variables are combined in the analysis of covariance model, y = X(3 + Zri + e,
E(e) = 0,
D(e) = a2I,
(6.6.1)
where the elements of the matrix X represent the designed part of the experiment such as block/treatment levels, the elements of /3 are the corresponding effects, the columns of the matrix Z contain values of the respective covariates and r] represents the effect of these covariates on the mean response. The model (6.6.1) is a generalization of the model of Example 1.1.4 of Chapter 1. For given TJ, it can be written as y — Z-q = X/3 4- e, which is essentially the linear model for a designed experiment. For given /3, (6.6.1) can be written as y — X/3 = Zr) + e, which is a linear model with explanatory variables that do not come from a design. 6.6.2
Uses of the model
It has already been mentioned that the model (6.6.1) can be used to reduce experimental error. Example 1.1.4 (originally due to Fisher,
6.6 Analysis of covariance
225
1932) is a case in point. In this example, the pre-treatment yield is used as the concomitant variable. If this variable is excluded, then the model reduces to that of a designed experiment, but the variation in the pretreatment yield across the experimental units would make the treatment comparisons more imprecise. Inclusion of this variable means that the error variance essentially represents the conditional variance of the posttreatment yield for any treatment, given the pre-treatment yield. This should be smaller than the unconditional variance. In order to reduce the error variance, it is important that the concomitant variables influence the response considerably (that is, the conditional variance is much smaller than the unconditional variance). Identification of appropriate concomitant variables is, therefore, an important task. Care should be taken that the concomitant variables do not influence or are not influenced by the treatments or the blocks. For instance, in Example 1.1.4 one should not use blocking by pre-treatment yield, or apply one kind of treatment only to experimental units with low pre-treatment yield. If this precaution is not taken, then the design is no longer randomized with respect to the concomitant variables, and there may be some bias in the estimators. The model (6.6.1) is also used in observational studies where the objective is to study the effects of some binary variables, but there are some additional variables having possible effect on the response. For instance, in a comparison of heights of children from two different schools, Greenberg, (1953) took into account the age of the children in order to make the comparison meaningful. Omission of these additional variables may not only inflate the error variance, but also introduce bias in the estimators (see Section 9.3.4). In contrast to the case of randomized experiments, the bias due to omitted variables in observational studies in general cannot be removed by randomization. See Cochran (1957) for a discussion on the finer points of this subject. Sometimes the analysis of covariance model is used to achieve a deeper understanding of treatment effects. If the difference between two treatments is significant, but it becomes insignificant after taking into account an explanatory variable, then this explanatory variable may be responsible for the difference in these treatment effects.
226
Chapter 6 : Analysis of Variance in Basic Designs
If parallel regression lines are to be fitted to two or more groups of data, the model used for this purpose is essentially that of (6.6.1) (see Exercise 1.4). It was pointed out in Section 6.3.4 that the analysis of covariance model can be used to carry out analysis of variance when some observations are missing. This follows from the fact that deletion of an observation is equivalent to inclusion of an additional explanatory variable in the model (see Section 9.5 for a proof of this equivalence). The effects of these concocted variables corresponding to all the missing observations assume the role of r\ in (6.6.1). 6.6.3
Estimation of parameters
We have seen in the foregoing sections how the special structure of the matrix X leads to neat expressions of various BLUEs — when there is no covariate. Presence of covariates changes the scenario altogether, as the matrix Z does not have a special structure. What we intend to do here is to exploit the structure of X by estimating 77 and j3 successively. Let us first consider the estimation of the estimable functions of 77. For this purpose /3 is a vector of nuisance parameters. It follows from Proposition 4.10.1 that the estimable functions of rj are of the form L(I - Px)Zrj. It is shown in Remark 7.10.2 that the BLUE of L(I — Px)Zr} is obtained by replacing 77 with a solution of the reduced normal equation Z'(I-Px)Zr, = Z'(I-Px)y. Let us denote such a solution by 77. It also follows from Remark 7.10.2 that D((I-Px)Zr,) = o*P(I_Px)z. Let us use the notation
{Rl* £) = ( z ' ) ( / - p * ) ( y z)-
(6-6-2)
The reduced normal equation for 77 can be written simply as Rq = r , so that the substitution estimator is 77 = R~r.
(6.6.3)
6.6 Analysis of covariance
227
Note that the normal equation for the simultaneous estimation of the estimable linear functions of /3 and r) is
fX'X \z'X
X'Z\f/3\_fX'y\ Z'Zj{r,)-{z'y)-
Using the first of these equations and substituting r\ — fj, we have the solution for /3, J3 = (X'X)-[X'y-X'Zri]. (6.6.4) The BLUE of any estimable function of the form A/3 is given by A/3, where j3 is as in (6.6.4) and r\ is as in (6.6.3). Remark 6.6.1 The expression of f3 can be interpreted in the following manner. Write (6.6.1) as y = X/3 + Zrj + e = X(3 + PxZrj + {I ~ Px)Zrj + e =
X^ + il-P^Zrj
+ e,
(6.6.5)
Let Zj be the j t h column of Z (that where /30 = (3 + {X'X)~X'Zr}. is, the column containing the values of the j t h concomitant variable), j = 1 , . . . , q. Then we can write Q
where r/j is the jth element of rj and ctj is a least squares estimator of cx.j from the model (zj,Xcxj,a2I) (that is, a model where the j t h covariate plays the role of response). This is only an interpretation. As we condition y on Z, Qj need not be treated as random. Replacing /3 0 (X'X)~X'y and r] by their respective least squares estimators (f30 — and fj as in (6.6.3)), we have ~
~
q 3=1
This expression is equivalent to (6.6.4), as (6.6.5) is only a reparaD metrization of (6.6.1).
228
Chapter 6 : Analysis of Variance in Basic Designs
6.6.4
Tests of
hypotheses
We shall now assume that the conditional distribution of y given Z is normal. Let us first consider the hypothesis rj = 0, which essentially means that the covariates may be ignored. The error sum of squares under the hypothesis is obviously R2H — y'(I — Px)y with n — p(X) degrees of freedom. The error sum of squares for the model (6.6.1) is
Rl = y\i - P{X:Z))y = y'(i - P
X
-
P{I_PX)Z)V,
which follows from Proposition 2.4.4(b). In view of (6.6.2), we can write
R2 = Rip - r'R'r
= R2H -
r'R'r.
The associated degrees of freedom is n — p(X : Z). It follows that the GLRT for the hypothesis rj = 0 is to reject the null hypothesis when the ratio [(n - p(X : Z))(r'Rrr)/[(p(X : Z) - p(X))fl§] is too large. The null distribution of the statistic is FpiX:Z)-p(X),n-p(X:Z)We now turn to the main problem of testing a general linear hypothesis of the form A/3 = 0. Tests for the hypotheses such as no treatment effect, no block effect, no interaction and no nested effect are all special cases of this problem. It follows along the lines of the discussion of Section 4.9 that the model (6.6.1) is equivalent to
y = X(I-PA,)9
+ Zri + e,
E(e) = 0,
D(e) - a21.
(6.6.6)
We have already seen that the error sum of squares for the model (6.6.1) can be written as R20 = R20p - r'R-r, (6.6.7) where the terms on the right hand side are given by (6.6.2). Similarly, the error sum of squares for the model (6.6.6) is R2H = R2H0 - r'HR-HrH,
(6.6.8)
where the terms on the right hand side are given by (RHp
rH\
=
(y'\(i
_p
)(„
z)
(6 6 9)
6.6 Analysis of covariance
229
The degrees of freedom associated with RQ and R2H are n—p(X : Z) and n - p{X{I -PA,):Z), respectively. Thus, the GLRT of the hypothesis A/3 = 0 is to reject the null hypothesis for large values of the statistic
Rl-Rl Rl
n-P(X:Z) P(X:Z)-p(X(I-PA,):Zy
^
^
which has the null distribution Fp(X:z)-p(x(i-PA,):Z),n-P(X:Z)- If the concomitant variables are independent of the various effects, then C(Z) and C(X) are virtually disjoint. In such a case, p(X : Z) = p(X)+p(Z) and p(X(I-PA,) : Z) = p(X(I - PA,)) +p(Z). Consequently, the test statistic of (6.6.10) simplifies to R2H~R2o Rl
n-p(X)-p(Z) p{X)-p{X{I-PA,)Y
and its null distribution is F^X)_p{X(i-PA,)),n-p(X)-p{zy When the analysis of covariance model arises from covariates inserted in lieu of missing observations (see Section 6.3.4), the assumption p(X : Z) = p{X) + p(Z) means that no parameter becomes nonestimable because of the missing observations. 6.6.5
ANCOVA table and adjustment for covariate*
Recall that in the absence of covariates, the ANOVA table can be used to compute the GLRT statistics for the common hypothesis testing problems. Can there be a similar table for the computation of (6.6.10)? The key to the computation lies with the matrices defined in (6.6.2) and (6.6.9). Indeed, the top left elements of these two matrices can be obtained from a suitable analysis of variance table. These are the sum of squares RQQ and R2Hg, respectively, with no adjustment for covariates. If we can compute R^ and Rj^p from a table, we should also be able to derive the other elements of the matrices of (6.6.2) and (6.6.9) from an expanded table. It is clear from the structure of the matrices that the remaining diagonal elements are obtained from an 'ANOVA' table where y is replaced by a column of Z. Likewise, the off-diagonal elements are obtained by replacing every sum of squares in the ANOVA table
230
Chapter 6 : Analysis of Variance in Basic Designs
by a sum of squares and products, where the factors in each product correspond to the response or some of the covariates. We illustrate this method with an extended version of the balanced two-way classification model with interaction (considered in Section 6.3.3), where covariates are also included. The model is Vijk = v+Ti+Pj+7ij+ EjLi Cijkim+eijk, i = i , . . . , t, .7 = 1 , . . . , 6, fc = l , . . . , m , E{tijk)
= 0, i =
l,...,t,
j = l,...,b, n
t
\
/ c2 10
k-l,...,m,
if i = i', 7 = i' and k = k', otherwise. (6.6.11)
The matrix-vector form of the model is
y = X/3 + Zrj + e, E{e) = 0, D(e) = a2!, where y, X, fi and e are as in Section 6.3.3, rj = (rji : Z — (z\ : : zq) and Zj
—
(((clllj
:
"""
: cllmj)
: ((ctiij
:
(C\blj : : Ctimj)
: r)q)',
: Cibmj))
(
(hbmj)))
»
for; = l,...,q. Using the projection matrices of (6.3.18), we form the analysis of covariance (ANCOVA) table of model (6.6.11), given in Table 6.6. Note that the middle column of the ANCOVA table contains matrices of dimension (q + 1) x (q + 1). Conventionally a separate column is used for the ijth element of all the matrices, for each i and j . The matrix representation is used for brevity. For every (q+1) x (q+1) nonnegative definite matrix (
lxl
\sqxl ( S
let g (
J-xq I, &qxqj
s'\ c
I denote the number s — s'S~s. We shall use this notation
to describe the tests of the following hypotheses. %T : There is no treatment effect,
6.6 Analysis of covariance
231
Source Between treatments
Sum of Squares and products
Interaction
S 7 = (y : Z)'P7{y : Z)
(t - l)(b - 1)
Error
Se = {y: Z)'{I - Px){y: Z)
n-tb
Total
St = (y:ZY(I-P^(y:Z)
n-\
= ( v
)#
'
( TVtf
Degrees of Freedom
} y
Table 6.6 ANCOVA for model (6.6.11) of balanced two-way classified data with interaction and covariates
Hp
: There is no block effect,
Ti-y : There is no interaction effect, HT1
: There is no treatment or interaction effect,
Hp-,
: There is no block or interaction effect,
Tia
: All the effects are present.
The matrix of (6.6.9) for the six hypotheses are ST + Se, Sp + Se, S-y + Se, St — Sp, St — ST and Se, respectively. The corresponding corrected sum of squares, after eliminating the effect of the covariates, are g{ST + Se), g{Sp + Se), g{S1 + Se), g(St-Sp), g(St-ST) and g(Se), respectively. A careful examination of p(X(I — P ,)) of (6.6.10) for the various hypotheses leads to the GLRT statistics listed in Table 6.7. In this table c denotes p(Z), and it is assumed that p(X : Z) — p(X) + p(Z). (We remind the reader that for covariates inserted in lieu of missing observations, as in Section 6.3.4, the assumption p(X : Z) = p(X) + p(Z) is equivalent to assuming that no parameter becomes nonestimable because of the missing observations.) The null hypothesis is to be rejected if the F-ratio is too large. The number of degrees of freedom of the null distribution is obvious from the context.
232
Chapter 6 : Analysis of Variance in Basic Designs
Null hypothesis
Alternative hypothesis
oy
n,
GLRT statistic 9{ST
+ Se)-g(Se) 9(Se)
'
g(S0 + Se)-g(Se) 9(Se)
n-tb-c t-1
-.
v
v
u
n.
nj
HTI
Hl
9(St ~ Sp) ~ g(^ 7 + Se) n-t-b+1-C g(S, + Se) " t-i
H/37
Ul
g{St-ST)-g(S^ + Se) g(S, + Se) '
u
n,
HT"
Ha
^
v
HM
Ha
g(S1 + Se)-g(Se) g(Se)
n-tb-c b-l n-tb-c -(t-l)(6-l)
n-t-b+l-c 6-1
g(St-S0)-g(Se) n-tb-c g{se) " 6(<-i) g(St - ST) - g{Se) n-tb-c g(Se) ' t(b-i)
Table 6.7 List of GLRT statistics for various hypotheses for the ANCOVA model (6.6.11)
6.7
Exercises 6.1 Spring balance. A spring balance with no bias is used to weigh two objects of unknown weights $\ and /3%. The objects can be weighed individually or together, but the maximum number of measurements that can be taken is 6. The measurement errors are independent and identically distributed with mean 0 and variance a2. Find the D-optimal design subject to the condition that each weight should be estimable. 6.2 Find the A- and E-optimal designs for the problem of Exercise 6.1. 6.3 Show that a treatment contrast is necessarily a linear combina-
6.7 Exercises Sulfamerazine content (grams per 100 pounds of fish) 0 5 10 15
233
Hemoglobin in Brown Trout blood (grams per 100 ml of blood) 6.7 9.9 10.4 9.3
7.8 5.5 8.4 7^0 7.8 8.4 10.4 9.3 10.7 11.9 8.1 10.6 8.7 10.7 9.1 9.3 7.2 7.8 9.3 10.2
8.6 7.1 8.8 8.7
7.4 6.4 8.1 8.6
5.8 7lT 8.6 10.6 7.8 8.0 9.3 7.2
Table 6.8 Brown trout hemoglobin data (source: Gutsell, 1951)
6.4
6.5 6.6
6.7
tion of differences of various pairs of treatment effects. Table 6.8 gives data, taken from Gutsell (1951), on measured hemoglobin content in the blood of brown trout that were randomly allocated to four troughs. The fish in the four troughs received food containing various concentrations of sulfamerazine, 35 days prior to measurement. Assuming that the response (hemoglobin content) follows the model (6.2.1) with normal errors, Test the hypothesis that sulfamerazine has no effect on hemoglobin content of trout blood, using this data. For the model (6.2.1), find the expected value of the betweengroups mean of squares (MSg). For the brown trout hemoglobin data of Table 6.8, determine which of the six pairs of group means have a significant difference at the level 0.95, using Tukey's HSD and assuming that the response (hemoglobin content) follows the model (6.2.1) with normal errors. Compare the results with those obtained from the Bonferroni and Scheffe methods, and comment. Can the maximum modulus-^ or Fisher's PLSD methods be used for this problem? Testing for heterogeneity of variances. The hypothesis of equality of group means is often tested through one-way ANOVA, while assuming equality of the within-group variances. The purpose of this problem is to test the latter assumption. Given random samples y^, j = 1 , . . . , m, i = 1 , . . . , t from JV(/ij, a?),
234
Chapter 6 : Analysis of Variance in Basic Designs Rib height (i)
.010 inch .015 inch .020 inch
4.8
Reynolds number (j) 4.9 5.0 5.1 5.2
-.024 -.023 .001 .008 .029 .033 .028 .045 .057 .074 .037 .079 .079 .095 .101
5.3
.023 .080 .111
Table 6.9 Air speed experiment data (source: Wilkie, 1962)
find the generalized likelihood ratio test for the null hypothesis o\ = = of. You may assume that mini<j
6.7 Exercises
6.13 6.14
6.15
6.16 6.17
6.18 6.19 6.20
6.21
235
a smoothed pipe surrounding it. The 'position' is defined as the distance in inches from the center of the rod, in excess of 1.4 inches. The height of ribs on the roughened rod can have three different values. For each height category, one measurement is taken for six different Reynolds numbers. Using a two-way classification model with no interaction, obtain the ANOVA table and test for (a) no difference in effects of rib height (b) no difference in effects of Reynolds numbers. Identify the reparametrization of (6.3.1) which corresponds to the second partition of C(X) given in (6.3.2). Is it possible to obtain simultaneous confidence intervals with exact confidence coefficient for all pairs of treatment differences, in respect of model (6.3.1) for two-way classified data with one observation per cell, using any of the methods described in Sections 5.2.3 and 6.2.4? Explain. Show that the estimator obtained in Exercise 6.11 remains unbiased and its variance remains the same even if there is interaction of the type described in (6.3.12). Is it the BLUE of C'T in (6.3.12)? For the air speed experiment data of Table 6.9, test for interaction as per model (6.3.12). Show that the estimator of a1 given in (6.3.8) is an overestimate when the true model is (6.3.12), that is, when interaction is present. Comment on the appropriateness of confidence intervals of treatment contrasts computed on the basis of the model (6.3.1) when the correct model is (6.3.12). Derive the BLUE of A and the sum of squares due to deviation from the hypothesis A = 0, for the model (6.3.13). Show that Tukey's one degree of freedom test for nonadditivity is a special case of the test described in Exercise 5.24. How can Tukey's one-degree-of-freedom test for interaction be generalized to the case of balanced data with multiple observations per cell? Show that, in the absence of any side-condition, no linear function of the treatment effects of the model (6.3.17) is estimable.
236
Chapter 6 : Analysis of Variance in Basic Designs Poison I 31 4.5 4.6 4.3
Poison II 3^6 2.9 4.0 2.3
Poison III 2^2 2.1 1.8 2.3
Treatment B
8.2 11.0 8.8 7.2
9.2 6.1 4.9 12.4
3.0 3.7 3.8 2.9
Treatment C
4.3 4.5 6.3 7.6
4.4 3.5 3.1 4.0
2.3 2.5 2.4 2.2
Treatment D
4.5 7.1 6.6 O
5.6 10.2 7.1 3^8
3.0 3.6 3.1 3.3
Treatment A
Table 6.10 Survival times of animals exposed to poison and treatment (Source: Box and Cox, 1964)
6.22 In the case of two-way classified data with multiple observations per cell, obtain a simple expression for (/ — Px), where X is the design matrix. Hence, show that the four terms of (6.3.20) are indeed the BLUEs identified thereafter, and justify the expression of the error sum of squares given in (6.3.22). 6.23 Table 6.10 gives the survival times (in hours) of groups of four animals randomly allocated to three poisons and four treatments. The data, which appear in Box and Cox (1964), arise from an experiment that was part of an investigation to study the effects of certain toxic agents. There is no blocking, and both the factors are of interest. Construct the AN OVA table for the two-way classification model with interaction, and carry out the GLRT for 'no difference in effects of poisons', 'no dif-
6.7 Exercises
6.24
6.25
6.26
6.27
6.28
237
ference in treatment effects' and 'no interaction effect'. What does interaction mean in this case? For the survival time data of Table 6.10, construct the ANOVA table for the two-way classification model without interaction, and carry out the GLRT for 'no treatment difference'. Given the model (6.3.17) and assuming that the model errors are independent and have normal distribution, find a ^-statistic for the hypothesis of 'no difference in average effects of the first and second treatments'. Is this the same as the hypothesis T\ = T2? Given the model (6.3.17) with independent and normally distributed errors, suppose that we have to test if there is significant interaction between the first treatment and second block. Formulate this problem as a testable hypothesis involving the model parameters, and find a ^-statistic for testing it. We wish to test the hypothesis 7^ = 0 for all i,j (that is, no interaction), for the linear model (6.3.17). Find the testable part of this hypothesis and interpret the result. What is the untestable part of the hypothesis? Consider the model (6.3.25) for two-way classified data with interaction and unequal number of observations per cell, and assume that m^ = mirij for some mt- and rij, i = l,...,t, j — 1 , . . . , b. Split the index k into two indices ki and &2> a n d rewrite the model as follows:
Vik1jk2
= M + n + Pj + lij + ^ik1jk2, &i = 1 , . . . , m » , k2 = l,...,rij, i = l,...,t, j = l,...,b,
and the errors have the usual properties. Rearrange the observations in such a way that &2 changes faster than j , which changes faster than ki, followed by i. Verify that the model can be written in the form (y, X/3, a21) such that
X = (a®b:
A®b:
a®B
\ A®
B),
238
Chapter 6 : Analysis of Variance in Basic Designs where / I m i x l O
A
=
=
OmxXlN
lm2Xl
Om2xl
.
.
. \O
B
xl
m 2
OmiXl
m (
xl
Om(Xl
/lri!xl
OniXi
0^2X1
lri2Xl
. \On6xl
. On&xl
,
a = Al tx i,
Imtxl/ O
n i X
i\
0«2Xl
.
,
o = Blbxi.
irifcXl/
Hence, obtain a decomposition of the fitted values similar to (6.3.19) and analysis of variance in the form of Table 6.4. 6.29 Suppose that the observation yki in the model (6.3.1) is missing. Drive the GLRT for the hypothesis of 'no difference in treatment effects' using the missing plot technique, in the following manner. (a) Obtain the value of yki which minimizes the error sum of squares R% given in (6.3.8). Call it ya. (b) Obtain the value of yki which minimizes the restricted sum of squares R2H given in (6.3.9). Call it yb(c) Derive the test statistics with appropriate degrees of freedom. 6.30 Table 6.11 shows measurements of absorbance of light at a particular wavelength for positive control samples in an enzymelinked immunosorbent assay (ELISA) test for human immunodeficiency virus (HIV), which is believed to cause the acquired immunodeficiency syndrome (AIDS). The data are taken from Hoaglin et al. (1991). The materials for the test come in lots, and each lot contain enough material for three runs. Every run consists of measurements of three control samples. The purpose of the experiment is to test for significance of between-lot differences and run-to-run difference within a lot. Obtain an ANOVA table as per the model (6.5.1) and carry out the GLRT for these two problems.
6.7 Exercises
239
Run 1 L053 1.708 0.977
Run 2 0881 0.788 0.788
Run 3 0896 1.038 0.963
Run 4 0971 1.234 1.089
Run 5 0.984 0.986 1.067
Lot B
0.996 1.129 1.016
1.019 1.088 1.280
1.120 1.054 1.235
1.327 1.361 1.233
1.079 1.120 0.959
Lot C
1.229 1.027 1.109
1.118 1.066 1.146
1.053 1.082 1.113
1.140 1.172 0.966
0.963 1.064 1.086
Lot D
0.985 0.894 1.019
0.847 0.799 0.918
1.033 0.943 1.089
0.988 1.169 1.106
1.308 1.498 1.271
Lot E
1.128 1.141 1.144
0.990 0.801 0.416
0.929 0.950 0.899
0.873 0.871 0.786
0.930 0.968 0.844
Lot A
Table 6.11 Light absorbance for positive control samples in an ELISA test for HIV (Source: Hoaglin et al., 1991)
6.31 Three-way classified data from row-column design. Describe a linear model with three-way classified data with a single observation per cell and no interaction, the classification being according to two types of block factors (with b and h levels, respectively) and a treatment effect (t levels). Describe the ANOVA table. 6.32 Latin square design. Consider a design where there are two block factors and one treatment factor, all three effects having b levels. There are a total of b2 observations, one each from every combination of block levels. If the two block factors are arranged in rows and columns, then there is exactly one treatment allocated to each row and one to each column. This is
240
Chapter 6 : Analysis of Variance in Basic Designs called a b x b latin square design. The model equation is Vijk
= fJ-+Tl+/3j+>yk+eijk, i =
j = l,...,b, k= l,...,b,
fj + k- 1 i t j+ k - l < b , \j + k — l — b otherwise, '
with E{eijk) — 0 for all i,j, k and n™(c
c
\
/ °2 IL)
if i = i', j = j ' and k = k', otherwise.
(a) Show that the model can be represented as (y, X/3, cr2l) with X = (l(,xl ® Ifcxl : ^62xh : -^6x6 O lfcxl : lbxl ® ^xft). where Z ' = ( J 6 : J j : : J ^ 1 ) ' and the matrix J(, is obtained from I^b by removing the first column and appending it after the last column. (b) Show that C(X)
=
c(l 6 xi®l&xi:Z-&~ 1 lfcxi®l&xil / txi : (/6x6-&"1l6xll/6xl)<S'16xl ' hxl ® (^6x6 - ^"^ixlldxl)) )
and the column spaces of the four partitions are pairwise orthogonal. (c) Obtain a suitable decomposition of JR, and the ANOVA. 6.33 RBD with nested effects. A set of t drugs, each having d dose levels, are administered to subjects divided into b blocks. Each dose level of every drug is applied to m subjects of every block, while the allocation is completely random. The response is a measure of degree of relief caused by the drug. Write down a suitable nested model for this set-up and derive the ANOVA table. 6.34 Given the set-up of Exercise 6.33, how will you test the following hypotheses?
6.7 Exercises Tree -20 y z 1 13.14 42.1 2 15.90 41.0 3 13.39 41.1 4 15.51 41.0 5 15.53 41.0 6 15.26 42.0 7 15.06 40.4 8 15.21 39.3 9 16.90 39.2 10 15.45 37.7
Temperature in Celsius 0 20 40 y z y z y z 12.46 41.1 9.43 43.1 7.63 41.4 14.11 39.4 11.30 40.3 9.56 38.6 12.32 40.2 9.65 40.6 7.90 41.7 13.68 39.8 10.33 40.4 8.27 39.8 13.16 41.2 10.29 39.7 8.67 39.0 13.64 40.0 10.35 40.3 8.67 40.9 13.25 39.0 10.56 34.9 8.10 40.1 13.54 38.8 10.46 37.5 8.30 40.6 15.23 38.5 11.94 38.5 9.34 39.4 14.06 35.7 10.74 36.7 7.75 38.9
241
60 y z 6.34 39.1 7.27 36.7 6.41 39.7 7.06 39.3 6.68 39.0 6.62 41.2 6.15 41.4 6.09 41.8 6.26 41.7 6.29 38.2
Table 6.12 Compressive strength and moisture content of wood in hoop trees (Source: Williams, 1959)
(a) The dose levels of none of the drugs have different effects. (b) The t drugs do not have different effects. (c) The various dose levels of Drug 1 do not have different effects. 6.35 Consider the model (6.6.1) where X/3 is as in (6.3.1), r\ is an estimable scalar parameter (written as rj) and Z is a known vector (written as. z). Find the BLUEs of rj and Xf3, and the fitted values. Give simple expressions for Var(rf) and D(X(3). 6.36 The data set of Table 6.12, taken from Williams (1959), consists of the maximum compressive strength parallel to the grain (y) and moisture content (z) of 10 hoop trees for five temperature categories. Using y as response and z as a covariate, Describe the ANCOVA table and test for the hypothesis that the five temperature categories do not have different effects. How will the conclusions change if the covariate is ignored? Carry out a test for the significance of the covariate effect. 6.37 Given the ANCOVA model of Exercise 6.35, how can you test the hypothesis of 'no difference in treatment effects' ?
242
Chapter 6 : Analysis of Variance in Basic Designs
6.38 Describe the ANCOVA table for the model of Exercise 6.35. 6.39 Suppose that the vector z of the ANCOVA model of Exercise 6.35 has the element 1 at the location corresponding to y^\ and 0 everywhere else. Describe explicitly the GLRT for the hypothesis of 'no difference in treatment effects'. 6.40 Show that the test of Exercise 6.39 does not depend on the value of yki, and explain the result. 6.41 Prove the equivalence of the tests derived in Exercises 6.29 and 6.39, in the following manner. (a) Show that the value of the uncorrected RQ at y^i = ya, obtained in Exercise 6.29 is the same as the value of the corrected (for covariate) RQ, obtained in Exercise 6.39, at Vkl = Va-
(b) Show that the value of the uncorrected R2H at y^\ — y;,, obtained in Exercise 6.29 is the same as the value of the corrected (for covariate) R2H, obtained in Exercise 6.39, at Vkl = ya(c) Hence, show that test derived from the missing plot technique is the same as the test derived by using analysis of covariance.
Chapter 7
General Linear Model
In the present chapter, we discuss the 'general' case of the linear model, (y,X/3,a2V), where V is not necessarily the identity matrix. In contrast to the homoscedastic model, this allows the observations to be correlated as well as to have different variances. Although we allowed X to be rank deficient in the preceding chapters, here for the first time, we also allow the dispersion matrix V to be rank-deficient (that is, singular). We refer to the linear model with singular dispersion matrix as the singular linear model. A careful review of the literature on singular linear model shows that this case has been treated by most authors as if it is an entirely different object, compared to the model with nonsingular dispersion matrix. Besides the obvious differences such as the non-invertibility of V, there are other subtle differences between the two situations, which we highlight in Section 7.2. Such differences have made the singular linear model a happy hunting ground for researchers equipped with the artillery of linear algebra. However the heavy use of algebra as well as the apprehension that intuition may sometimes fail us in this context, has had the effect of turning practitioners away from the singular linear model. One of the primary goals of this chapter is to demonstrate that the singular linear model can be dealt with, using exactly the same fundamental principles that were used for the nonsingular case. One can continue to use most of the results obtained in Chapter 4 for the latter case, essentially unmindful of the fact that the model is singular. 243
244
Chapter 7 : General Linear Model
In Section 7.1 we discuss why the singular model is important and provide some examples. After dealing briefly with the nuances of the singular model in Section 7.2, we extend the results of Chapter 4 to the general linear model — including the singular case — in sections 7.37.10. Although our treatment is different from the conventional one, in our opinion, it reinforces the strong commonality in the analyses of the general linear model and the more common homoscedastic linear model. As in Chapter 5, we assume that the errors are normally distributed while deriving tests of hypotheses, confidence sets and prediction and tolerance intervals in Sections 7.11, 7.12 and 7.13, respectively. This assumption can be relaxed when the sample size is large, see Section 11.6. 7.1
Why study the singular model?
Recall that, as observed in page 123, the usual model with linear restrictions can be expressed in terms of an equivalent unrestricted model — making it possible to apply the standard techniques to the restricted model. However, such an equivalent model is expressed in terms of a transformed set of explanatory variables as well as parameters. It is then necessary to use the reverse transform, to make inferences on the original parameters. An alternative to this strategy is to treat the restrictions as a set of observations with zero error, as we did in Section 4.9 (see page 122). Thus, the model (y,X(3,a2I) under the restriction A/3 — £ is represented by the augmented model equation
This is equivalent to the unrestricted model (y,|t,.X'*/3,<72V), with
-(?)
'-(') MI!)-
This is clearly a singular model, since V is a singular matrix. The most attractive feature of this model is that its parameters are identical to those in the original (unrestricted) model.
7.1 Why study the singular model?
245
A similar singular dispersion structure may also arise naturally when a subset of measurements come error-free or nearly so. This may happen because of sophisticated measurement of certain physical variables. More commonly, some measurements may have much smaller variance than others and a model with singular V may serve as a limiting special case of a model with nearly rank-deficient V. In an econometric context, Buser (1977) gives an example where the dispersion matrix of a set of investment returns involving 'risk-free' mutual funds, may be singular. If one or more linear combinations of the response is constrained to have a specified value, the dispersion matrix becomes singular. For instance, if the response consists of a few proportions whose sum must be equal to one, then the sum of the model errors has zero variance (see also Example 7.3.5). Scott et al. (1990) shows how such constraints on the response occur when the generalized linear model of Example 1.4.5 is used (see also Sengupta, 1995). Estimation in such models typically involves an iterative procedure with a new linear model appearing at every stage of the iteration. Methodology for the singular linear model can be very useful for such problems. The dispersion matrix can also be singular when the elements of y are not measured directly, but are derived from other measurements. Rowley (1977) and Bich (1990) provide examples of this phenomenon in the fields of Economics and Metrology, respectively. Analysis of models with nuisance parameters can sometimes lead to singular dispersion matrices. In Section 7.10 we shall see that when the only objects of interest are the estimable linear functions of /31 in the model (y, X\fil + X2/32iCr2I), we can work with the reduced model ((I - PXa)y, (I - PX2)Xlf3l,o2{I - P X2 )), which does not involve the nuisance parameters. However the dispersion matrix of this reduced model becomes singular. Kempthorne (1976) and Zyskind (1975) show that in finite population sampling, the response obtained from certain sampling schemes can be represented as a linear model (corresponding to a completely randomized design) whose covariance structure is singular. Similar singularity of the error dispersion occurs in randomized block designs and
246
Chapter 7 : General Linear Model
some other designs in this context. Examples of a singular error dispersion matrix abound in the literature of state-space models, particularly in the area of automatic control (see Kohn and Ansley, 1983, Shaked and Soroka, 1987 and Bekir, 1988 for examples). It is shown in Section 9.1.6 how the minimum mean squared error linear predictor in a state-space model can be computed by means of best linear unbiased estimation in a special linear model. While some of the examples cited above can be handled by specialized techniques on a case-by-case basis, the general linear model with possibly singular dispersion matrix provides an appropriate and unified framework for the discussion of all such situations.
7.2
Special considerations with singular models
Before developing the theory, it would be a good idea to examine some of the 'peculiarities' of a singular model. Some readers may want to skip this section at the first reading, but it should be noted that in dealing with singular models, the slightly more general Proposition 7.2.3 replaces Proposition 4.1.4 regarding the LUEs and LZFs.
7.2.1
Checking for model
consistency*
A fundamental difference between the linear models with singular and nonsingular V is that unlike the latter, the singular model implies a partly deterministic statement about the response. According to the singular linear model, certain linear combinations of y have zero variance, that is, these are constant with probability 1. To see this, note that the model error vector must lie in C(V) with probability 1, while the mean of the response must lie in C(X). Therefore, we must have y 6 C(X : V) with probability 1. Other equivalent forms of this condition are: (a) (/ — Pv)y € C((I — PV)X) with probability 1 and (b) (/ - Px)y € C{(I - PX)V) with probability 1. If V is of full rank, it is clear that these conditions are automatically satisfied. When V is singular, the conditions sometimes follow from the formulation of the model, so that the conditions need not be checked. However, it is a good idea to check to see if one of these conditions is met, before
7.2 Special considerations with singular models
247
proceeding to do inference. To understand the issue of model consistency when V is singular, consider the case of a nearly singular dispersion matrix, when V has the determinant (and hence, at least one eigenvalue) very close to zero. In this case, there is a linear function I'y which is close to its mean, VXf3. If the observed value of I'y is far from I'X/3 for every choice of /3, we would have to suspect that the model is bad. If this happens in the extreme case when V is perfectly singular, the 'suspicion' will turn into disbelief, and we shall say that the model is inconsistent with the data. Example 7.2.1 /1\ 2 y=
3 ' W
X =
Consider the model (y,X/3,o2V)
with
/ I 1 0\ 1 1 0
/I 0
/
po
)
1 0 1 ' P=I* ' Vl 0 1 / X^2/
V=
0 0 0\ 1 0 0
0 0 a 0 ' VO 0 0 a)
and a2 known to be 1. If a is very small, say a = 0.0001 (making V nearly singular), we expect that the last two observations (3 and 4) to be known with a high degree of precision. The model stipulates that these observations are noisy measurements of a common entity (/3o + fa), and they should not be far from one another. From Chebyshev's inequality,
P[\V> - V4| > 1] < ^
f
M
= 2a =
0.0002,
which makes the observed values of 3 and 4, quite improbable. This fact casts some doubt on the appropriateness of the model. Had a been larger, say 0.4, the observed values would have been more plausible and we would not have been much concerned about the issue. On the other hand, if a = 0 (making V singular), then the model postulates that 2/3 and y4 should be identical with probability 1, while in reality they are not. In such a case, the model is inconsistent with the data. In the latter case (a = 0), /0\
(i-P)v= {
y'y
/0
°
(i-p)x-l0
3 '
l
W
v'
0
0\
° ° ~
1 0 1 '
Vi 0 1/
248
Chapter 7 : General Linear Model
G C((I-Py)X) is not satisfied. The reader and the condition {I-Py)y may verify that neither of the equivalent conditions y e C(X : V) and {I-Px)y EC{{I-PX)V) is satisfied. There is no reason why a properly formulated singular model will turn out to be inconsistent. Inconsistency often results from overlooking facts, either in model formulation or in the measurement of observations. 7.2.2
LUE, LZF, estimability and identifiability*
Consider the model (y, Xf3,a2V) where V is singular. Then as we saw earlier, there is a constant vector d such that (/ — Py)y = d with probability 1. Consequently (I-Pv)(I—Pd)y is zero with probability 1. The vector d may not be known before the data are gathered. If the singularity of V arises from incorporating a linear restriction, then d is known. If the singularity arises from zero measurement error, then d is not known until the measurements are actually recorded. What matters in the following discussion is that d exists and is non-random. Let p = X'l, so that I'y is an LUE of p'/3. Note that k'y is also an LUE of p'/3 when k = I + (I — PV)(I — Pd)'m for an arbitrary m. However, X'k is not necessarily equal to p. This shows that Proposition 4.1.4(a) does not hold when V is singular. Likewise, if I is a vector in C(X)^~, then i'y is an LZF. The linear function k'y is also an LZF when k = I + (I — Pd)(I — Pv)fn for an arbitrary m. However, X'k is not necessarily equal to 0, snowing that the characterization of LZFs given in Proposition 4.1.4(b) is not quite appropriate when V is singular. Example 7.2.2 Consider the model (y, X/3, a2V) where y and X are as in Example 7.2.1, and V is a diagonal matrix with diagonal elements 1, 0, 1 and 0, respectively. It is easy to see that
(
0\
2
/0
(I-P)d-P)-
0
0
0 \
° -8 ° -A
4/ \ 0 - . 4 0 .2 / If we choose I = (1 : 0 : 0 : 0)' and m = (0 : 0 : 0 : 1)', then
7.2 Special considerations with singular models
249
fc = (1 : -.4 : 0 : .2)' and X'k = ( 2 : 1 : 1)', but p = (1 : 1 : 0)'. Thus, k'y is an LUE of p'/3 and yet X'k ^ p. On the other hand, by choosing I = (1 : — 1 : 0 : 0)' and m is as above, we have k = (1 : -1.4 : 0 : .2)' and X'k = (-.2 : -.4 : .2)'. Clearly k'y is an LZF even though X'k ^ 0. That Proposition 4.1.4 is no longer valid, may make us somewhat uneasy about the singular case. Another difficulty arises from the fact that /3 is no longer a free parameter: it must satisfy the constraint (I — Pv)j3 = d and the definitions of LUE and LZF have to be interpreted in the light of this constraint. This constraint also affects the definition of estimability. According to Definition 4.1.9, we should call p'/3 an estimable function whenever there is a linear function I'y such that E{l'y) = p'(3 for all 0 that satisfies the above constraint. We shall also have to reexamine the issue of identifiability. The first clue to resolving the confusion comes from the following fact. Although the vector ki of Example 7.2.2 violates Proposition 4.1.4, the LUE k\y is identical to the original LUE, l[y, with probability 1. Thus, k[y is not a different LUE of p'(3; it is only an equivalent form of l[y. Likewise, k'2y is just another representation of the LZF l^y. Therefore, only a minor modification of Proposition 4.1.4 is needed. We provide this modification in the following proposition. Proposition 7.2.3 (Rao, 1973a) Consider the model where V may be singular.
(y,X/3,a2V)
(a) k'y is an LUE of p'(3 if and only if there is a vector I such that X'l = p and k'y = I'y with probability 1. (b) k'y is an LZF if and only if there is a vector I such that X'l — 0 and k'y = I'y with probability 1. Proof. The sufficiency in part (a) is obvious. In order to prove the necessity, let k'y be an LUE of p'fl. It follows that k'Xfi = p''/3 for all /3 satisfying the condition (I - Pv)X/3 = d. Choose an arbitrary /3X so that (I - Pd)(I - PV)X/31 = 0. It follows that (/ - Pv)Xf3l e C(d). Therefore, we must either have (I - Pv)Xp1a = d for some constant a ^ 0, or (J - PV)XP1 = 0. In the former case, fixa satisfies the condition (/ — Pv)Xf3 = d, so it
250
Chapter 7 : General Linear Model
must also satisfy k'X/3 = p'/3, and therefore (X'k — p)'/3i = 0. In the latter case, for every solution /30 to the equation (I — Py)Xj3 = d, the vector j30 + fii is also a solution. Therefore, /3 0 and /3 0 + /31 both satisfy the equation (X'k - p)'/3 = 0. It follows that (X'k - p)'0x = 0. We conclude from the analysis of the two cases that the equation (X'k - p)'/3 = 0 holds whenever (/ - J> )(I - Pv)X/3 = 0, that is, (X'k-p) e C(X'(I-PV)(IPd)). Therefore, there is an m such that
(X'k -p) = X'(I -PV)(IPd)m. Let I = k - (I - Py)(I - Pjm. Clearly X'l = p. Also, k'y - I'y = m'(I - Pd)(I - Py)y = 0 with probability 1. This proves part (a). The proof of part (b) follows from part (a) by putting p = 0. Q We now turn to the questions of estimability and identifiability in the singular case. The results stated earlier in Proposition 4.1.10 and Proposition 4.1.15 for the case V = I, continue to hold. So we restate these results and give proofs for the general case of possibly singular V. Consider the linear model (y, X/3, a2V) where V may be singular, and let d be a constant vector such that (/ — Pv)y = d with probability 1. Proposition 7.2.4 (Restatement of Proposition 4.1.10) A necessary and sufficient condition for the estimability of an LPF p'/3 in the general linear model (y,X(3,a2V) is p £ C(X'). Proof. Suppose that p'/3 is an estimable function, that is, there is a linear estimator k'y of p'(3 such that E(k'y) = p'(3 for all j3 satisfying the condition (/ — Py)X(3 = d. Let /3 0 be a choice of ft which satisfies the condition (/ - Py )X0 = d. Then f3x= (30 + (I- Px, )p is another such choice. Therefore, k'X0{ = p'fii, i = 0,1. It follows that p 7 3 0 = k'X/30
= k'Xpx
= p'/3i = p'Po + \\(I-
PX,)P\\2-
Therefore, \\(I - Px,)p\\2 = 0 and p must be in C(X'). The sufficiency of this condition is obvious. Remark 7.2.5 It follows from the above proposition that the following is an equivalent but simpler definition of estimability: An LPF p'fi is called estimable if there is a linear estimator i'y of p'/3 such that
7.3 Best linear unbiased estimation
251
E(l'y) — p1 ft for all fi. (There is no need to restrict the condition to all possible values of /3.) d The following result shows that the characterization given in Proposition 4.1.15 for identifiability, continues to hold in the singular case. Proposition 7.2.6 (Restatement of Proposition 4.1.15) An LPF in the model (y,X0,o~2V) is identifiable if and only if it is estimable. Proof. The LPF p'0 is identifiable if and only if p'fil ^ p'/32 => Xf3l ^ Xf32 for all 01 and (32 which satisfy the condition (7 — Pv)X0 = d. This is equivalent to the condition ' ( / — Pv)X(j31 — /32) — 0 and X{px - )92) = 0 imply p'{/31 - /32) = 0', which simplifies to lXf3 = 0 implies p'/3 = 0.' Since the latter condition is identical to p E C(X'), the statement follows from Proposition 7.2.4. We have seen that the building blocks for inference in the linear model are estimable LPFs, LUEs and LZFs and that their definitions do not involve the dispersion matrix V. As the above discussion shows, these basic building blocks remain essentially the same even when V is singular except that in that case, we may have (a) a set of LZFs which are identically zero with probability 1, and (b) a set of BLUEs with variance zero. These LZFs may be added to the other LZFs and LUES (those characterized by Proposition 4.1.4) to give them a different appearance. However the notions as well as the conditions for estimability and identifiability remain exactly the same. Baksalary et al. (1992) examine in detail the effects of the linear restriction on /3 implied by the singularity of V, on various aspects of linear estimation in the linear model. These results are similar in spirit to Proposition 7.2.3. 7.3 7.3.1
Best linear unbiased estimation BLUE, fitted values and residuals
The main message from the previous section is that one may continue to use LUEs and LZFs in the general case of possibly singular V, simply
252
Chapter 7 : General Linear Model
by using Proposition 7.2.3 in place of Proposition 4.1.4. In particular, the characterization of LZFs as linear functions of (I—Px)y continues to hold (see Remark 4.1.5). Since the proofs of Propositions 4.3.2 and 4.3.5 on finding the BLUE of an estimable LPF do not involve the form of the dispersion matrix of y, these results continue to hold in the general case and lead to the following representation of the BLUE of an estimable LPF. Proposition 7.3.1 Consider the vector LPF LX/3 which is estimable in the model (y,X/3,a2V). The BLUE of this LPF is given by LX(3 = L[I - V(I - PX){(I - PX)V(I
- PX)}~(I
-
Px)]y.
Further, the BLUE is unique. Proof. Note that Ly is an LUE of LXfi, and the corresponding BLUE has to be uncorrelated with (I — Px )y, — so that it is uncorrelated with all LZFs. Putting u — Ly and v = (I — Px)y in Proposition 3.1.2, we find that the linear compound u — Bv is uncorrelated with v if and only if it is equal to the expression given in the proposition. Since the mean of this quantity is E(u) — LX/3, it must be the (unique) BLUE
of LX/3. Remark 7.3.2 Note that C{{I-PX)V) = C(Cov((I-Px)y,y)) and C({I-PX)V(I-PX)) =C{D{{I-Px)y)). It follows from Proposition 3.1.1 that (I-Px)y is contained in C{{I-PX)V{I-PX)), and C((IPX)V) is a subset of the latter. According to Proposition 2.4.l(f), the expression of BLUE given in Proposition 7.3.1 does not depend on the choice of the g-inverse. Q Remark 7.3.3 Since any estimable vector LPF A/3 can be written as AX~X0, its BLUE is
A? = AX~[I - V(I - PX){(I - PX)V(I - PX)}~(I - Px)]y. According to Remark 7.3.10 (to follow), the above expression does not depend on the choice of X~. D Remark 7.3.4 When V — I, the expression for A/3 simplifies to AX"Pxy or A(X'X)~X'y, as in the Gauss-Markov theorem.
7.3 Best linear unbiased estimation
253
Example 7.3.5 (Centered data) Sometimes in linear regression the response as well as the explanatory variables are 'centered' by subtracting the respective sample means from the observed values of these variables. The operation of centering a vector unx\ amounts to replacing it with (I — P^ )u. Suppose that we begin with a regular homoscedastic model (y,X/3,a 2 /), and obtain yc and Xc after centering y and X, respectively. Because of the constraint l'y c = 0, a homoscedastic model is inappropriate for yc. As D(yc) — a2(I — P^), we use the model Putting V = I - Pv in the expression of BLUE {yc,Xcf3,a2(I-Pl)). I-Pl-Px in Proposition 7.3.1 and using the fact {I - PX){I - PXJ = to simplify it, we have LX/3 = L[I - (I - J> - PXc)(I - Px - PXc)~(I - PXc))yc =
LPxyc.
Thus, the estimator is the same as what it should be if V were equal to / . This is not a coincidence, as we shall see in Section 8.1.1. d The vector of fitted values, obtained by substituting L = I in Proposition 7.3.1, is y = [I-V(I-
PX){(I - PX)V(I - PX)}~(I - Px)]y.
(7.3.1)
Hence A0 = AX~y. Therefore, we can formally define the following estimator of /3: 0 = X-y = X-[I -V(I - PX){(I - PX)V(I - PX)}~(I - Px)]y, (7.3.2) where X is an arbitrary g-inverse of X. Accordingly, the estimator is not uniquely defined in general. It is uniquely denned if and only if X has full column rank, in which case it is the BLUE of /3 (Exercise 7.4). Even if X is rank deficient, 3 can be used as a plug-in estimator. In particular, the (unique) BLUE of any estimable LPF A/3 is identical to A@.
Using y given in (7.3.1), the residual vector e is given by, e = y-y
= V(I- PX){(I - PX)V(I - PX)}~(I - Px)y.
(7.3.3)
A geometric perspective of the BLUE of Xf5 in the possibly singular linear model is given in Section 11.5.3.
254
Chapter 7 : General Linear Model Consider the model (y,X0,a2V)
Example 7.3.6
with
l ° vh } k} 10 1 ' V ~ l i l i ' \1 0 1/ \ 1 \ 1/ For the purpose of discussion, we define the orthogonal unit vectors
x-\
l
I 1\ 1 1 «i = 2 - 1
f*\ '
U2 =
1 1 2 1
( l\ 1 - 1 ' "3 = 2 1
( l\ ' "4
=
1 - 1 2 -1
1 / It is easy to see that C(X) = C{u\ : u^) and V — 3u2u'2 + u^u'3. Thus, there is a partial overlap between the column spaces of X and V. Note that (I - Px) = u3u'3 + u4u'4. Hence, V(I - Px) = (I PX)V = U3U3. A g-inverse of (/ - PX)V(I - Px) is also u3u'3. Calculations made on the basis of (7.3.3) show that e = u^u'^y = (u : —u : n u : -u)' and y = y - e, where u = \(yi - y2 + y$ - y4)-
\-lJ
W
V- 1 /
I
One does not necessarily have to use the form of the BLUE given in Proposition 7.3.1 for computational purposes. We shall examine several other ways of obtaining the BLUE in Section 7.7. As we shall see however, the explicit expression given here is a very useful theoretical device. Our derivation makes use of the covariance adjustment principle, described in Proposition 3.1.2. Although the idea of covariance adjustment in the linear models context goes back many years (see, for instance, Rao, 1965) and this algebraic expression itself appears occasionally in the literature (see Albert, 1973 and Searle, 1994), the first statistical derivation of this expression appears to be relatively recent (see Bhimasankaram and Sengupta, 1996). Although the BLUE is unique, it does not have a unique representation when V is singular. This is because of the fact that an arbitrary LZF with zero dispersion can be added to a BLUE, producing 'another' BLUE. As pointed out in Section 7.2.2, the two estimators would have identical values with probability 1.
7.3 Best linear unbiased estimation
255
Linear statistics with zero dispersion have generated much interest among theoreticians. Rao (1973a) characterizes all possible representations of I such that I'y is the BLUE of its expectation. Rao (1979) and Harville (1981) consider the class of virtually linear estimators having the form a((I - Py)y) + y'b((I - Py)y) and show that the BLUE is the minimum variance unbiased estimator in the wider class of estimators where a(-) and b(-) are allowed to be nonlinear functions. Schonfeld and Werner (1987) consider estimators of the form a(/3) + y'b({3) which are identical to linear estimators with probability 1, and show that the minimum variance unbiased estimators in this class has the same dispersion as that of the corresponding BLUE. All these results make the 'best' linear unbiased estimator look better than ever before. However, there remains a question about the practical utility of the extended classes of estimators mentioned above.
7.3.2
Dispersions
It follows from (7.3.1) that D(y) =
a2[V-V(I-Px){(I-Px)V(I-Px)}-(I-Px)V],(7.3A)
D(e) = o2V{I-Px){{I-Px)V{I-Px)}-{I-Px)V.
(7.3.5)
The sum of these two components add up to the total dispersion, a2V, just as y and e add up to y. In the case of normal errors, D{y) = D(y\e) and D(e) = D{y\y). As in the homoscedastic case, the BLUEs are linear functions of y and the LZFs are linear functions of e with probability 1 (see Exercise 7.5). Their variances and covariances can be expressed in terms of D(y) and D(e). In particular, the dispersion of the BLUE of A/3 (if it is estimable) is D(A0) = D(AX-y) = a2AX~[V -
V(I-Px){(I-Px)V(I-Px)}-(I~Px)V](AX-y.
According to Remark 7.3.2 and Proposition 7.3.9 (to follow), the above expression does not depend on the choice of {(I—PX)V(I—P)}~ and X~.
256
Chapter 7 : General Linear Model
Example 7.3.7 (Centered data, continued) For the centered model (yc,Xc/3,cr2(I - PJ) of Example 7.3.5, we have D(yc) = a2PXc, D(e) =
a\l-Px-PXc).
If one ignores the singularity of the dispersion matrix and erroneously uses the homoscedastic model (yc, Xc/3, o"2/), then the consequent value of Die) would be o2(I — Pv ), which is obviously too large. Example 7.3.8 Consider the model of Example 7.3.6 along with the orthogonal vectors «i, u2, W3 and U4 denned there. It follows from the simplifications (I — Px) = U3W3 + u\u\, V(I — Px) = u^u'3 and (I-PX)V(I-PX) = U3U3 that D(y) = 3a2u2u'2 and D{e) = a2u3u'3.n We now characterize the column spaces spanned by y, e and their respective dispersion matrices. Proposition 7.3.9
For the linear model (y,X/3,a2V),
(a) C(D(y))=C(X)nC(V). (b)C(D(e))=C(V(I-Px)). Proof. The expression of D(y) in (7.3.4) directly implies that C(D(y)) C C(V). Further, (/ - Px)D(y) is easily seen to be identically zero. Consequently C(D(y)) C C{X) and indeed, C(D{y)) C C{X)nC(V). Now let I G C(D(y))L, so that [V - V(I - PX){(I - PX)V{I PX)}~{I - Px)V]l = 0. We shall show that I e [C(X) n C ( F ) ] x . To prove this, take a vector m from C(X) n C(V). Write m as Vl\ and then as XI2, so that I'm
= I'Vh =
l'V(I-Px){(I-Px)V(I-Px)}-(I-Px)Vh
= l'V(I-Px){(I-Px)V{I-Px)}-(I-Px)Xl2
= 0.
Therefore, CiDiy))1 C [C{X) nC(V)]L. This proves part (a). The expression of D(e) given in (7.3.5) implies that C(D(e)) C C(V(I - Px). The reverse inclusion, C(V{I - Px) C C(D(e)), follows from the fact that D(e)(I - Px) = a2V(I - Px). This proves part (b).
7.3 Best linear unbiased estimation
257
Remark 7.3.10 Since E(y) = X(3, we have by virtue of part (a) of Proposition 3.1.1 and Proposition 7.3.9 V G C(X), e 6
C(V(I-PX)).
almost surely.
D
It also follows from Propositions 3.1.1 and 7.3.9 that
(xp-xp)eC{x)nc{V). This result has a nice interpretation. A consequence of best linear unbiased estimation is that we approximate the unknown decomposition y = X/3 + e by y = X/3 + e. The error in this decomposition, {Xfi — X/3), must be a vector that has wrongly been put in the systematic part, X/3, although it should have been part of the error. Such a 'mixup' can occur if and only if this vector simultaneously belongs to the column space of the systematic part, C(X), and the column space of e, C(V). If C(X) and C(V) are virtually disjoint, then there would be no scope of such an error. Indeed, in such a case Xj3 = Xfi with probability 1. A geometric interpretation of the decomposition of y into components belonging to various subspaces is discussed in Chapter 11. 7.3.3
The nonsingular case
Proposition 7.3.11 IfV is positive definite and A/3 is an estimable LPF, then the BLUE of A/3 is A0 =
A{X'V-lX)~X'V-ly,
and its dispersion matrix is D(A/3) = o2A(X'V~lX)~
A'.
Proof. Let CC' be a rank-factorization of V. Since the column spaces of /—Px and X are orthogonal complements of one another, the column spaces of C'(I — Px) and C~1X must be orthogonal complements of
258
Chapter 7 : General Linear Model
one another. Therefore, we can rewrite (7.3.1) as y =
= = =
C[I-C'(I-Px){(I-Px)CC'(I-Px)}-(I-Px)C}C-ly cv
- pcV-pJc~ly
=
CPc-ixc-ly
CC-lX[X'{C-l)'C-xX)-X'{C-l)'C-ly X[X'V-lX}-X'V-ly.
According to Remark 7.3.3, A/3 = AX~y. The assertions follow. The above proposition implies that D{y) = o2X{X'V-lX)-X',
and D(e) =
X{X'V~lX)~X'].
Remark 7.3.12 If V is nonsingular and X has full column rank, then /3 is estimable. Putting A = I in Proposition 7.3.11, we have the BLUE of /3 given by 3 = {X'V-lX)-lX'V-ly. This widely used expression was first obtained by Aitken (1935), and is sometimes referred to as the Aitken estimator. The dispersion of this estimator is
D0)=a2(X'V-1X)-\ The Aitken estimator simplifies to (X'X)~lX'y 7.4
when V = I.
Estimation of error variance
In order to obtain a reasonable estimator of a2, we have to utilize the LZFs. We begin by extending Proposition 4.7.5 to the general linear model (y, X/3, o2V). The definitions of a generating set, a basis set and a standardized basis set of LZFs given in Section 4.7.1 remain the same. Proposition 7.4.1 / / z is any vector whose elements constitute a standardized basis set of LZFs of the model (y,X/3,a2V), then (a) z has p(V : X)—p(X) elements; (b) the value of z'z does not depend on the choice of the standardized basis set.
7.4 Estimation of error variance
259
Proof. Let m be the number of elements of z. Since z and e are both basis sets, there are n x m matrices C and B such that e = C2: and z = B'e. Therefore, m = p{D{z)) = p(B'D(e)B)
< p{D(e)) = p{CD(z)C)
= p(C) < m.
It follows from the above and Proposition 7.3.9 that m = p(D(e)) = p(V(I - Px)). The last expression is equal to p(V : X)-p(X) (by Proposition 2.4.4). This proves part (a). In order to prove part (b), note that D(e) = D(Cz) = u2CC. Also, a 2 / = D{z) = D{B'e) = D{B'Cz) = a2{B'C)(C'B). B'C must be an orthogonal matrix, which means that C'BB'C = I. Consequently CC'BB'CC' and BB' must be a g-inverse of CC. It follows that z'z = e'BB'e = e'{CC')~e = o2e'[D{e)Ye, which does not depend on the choice of the standardized basis set z. The above proposition implies that we can continue to use Definition 4.7.6 of the error sum of squares (R2)) in the case of general linear models. (However, RQ would not be equal to the sum of squared residuals, e'e, in the general case.) Further, a natural unbiased estimator of £ =
^
=
e'[a-2D(e)]-e
p(V:X)-p(X) p(V:X)-p(Xy [ '} Remark 7.4.2 If z is any vector of LZFs whose elements constitute a generating set, then R% = z'[a-2D{z)}-z, p{D{z)) = p{V : X)-p(X) and a2 = z'[a~2D(z)]~z/p(D(z)) (Exercise 7.15). This is an extension of the statement of Remark 4.7.7 to the general linear model. We have already seen the special case obtained by putting z = e in the above statement. An alternative expression may be obtained by choosing z = (I — Px)yR2 = y'(I - PX){{I - PX)V{I - PX)}-(I
- Px)y.
(7.4.2)
260
Chapter 7 : General Linear Model
According to Proposition 7.4.1, the number of elements in a standardized basis set of LZFs must be p(V : X)—p(X). This is the number of error degrees of freedom of the general linear model. Example 7.4.3 For the model of Example 7.3.6, it has been noted that (I — Px) = uzu'3 + U4U4, and a g-inverse of (I — PX)V(I — Px) is W3W3. Hence,
^2 _ y'(u3u3
+ Uju'^uzu'zjuzu'z + u4u'4)y _
° -
3~2
,
2
~ l|W3y|1 '
which simplifies to (y\ — 2/2 + 2/3 - 2/4)2/4-
E
In order to simplify the expression of a2, we may try to write RQ as e1 Me for a suitable matrix M. The following result gives a general description of M'. Proposition 7.4.4 If M is an arbitrary g-inverse of V + XUX', where U is any matrix of appropriate order, then RQ = e'Me. Proof. See Exercise 7.16.
O
Note that there is no condition whatsoever on the matrix U. (For instance, we need not have C(V + XUX') = C(X : V) — a condition we use for another purpose in Section 7.7.1.) The choice M = V~ has a special significance, as we shall see in Section 7.5. This choice leads to the simple form of the above unbiased estimator of a1, ^ = e'V-e/[p(V : X)-p(X)]. The expression of a2 reduces to e'V~le/[n gular.
(7.4.3)
— p(X)] when V is nonsin-
Example 7.4.5 (Centered data, continued) For the centered model (yc,Xcf3,a2(I - Px)) of Example 7.3.5, we have e = ( I - Px )y c , V = I - P1 and p(V : Xc) = p{V) = n - 1. Hence, ^ °
\\{I-Px-PXc){I-Px)ycf n-l-p(Xc)
_ \\(I - PXc)yc\\2 n-p(Xc)-l'
7.5 Maximum likelihood estimation
261
If one ignores the singularity of the dispersion matrix and erroneously uses the homoscedastic model (yc, Xc/3, a21), then the consequent value of a 2 would be ||(/ — Px )yc\\2/(n —p(Xc)), which underestimates
Maximum likelihood estimation
Let y ~ N{Xf3, a2V), and CC' be a rank-factorization of V. Then the joint likelihood of the observation vector is (see Section 3.2) (2 7 ra 2 )-"W/ 2 |C'Cr 1 / 2 e x p [ - ^ ( y - XP)'V-(y
- XP)],
with the restriction {I — Pv)Xf3 = (I — Py)y, which ensures that the quadratic function in the exponent of the likelihood does not depend on the choice of the g-inverse V. Following the derivation in the homoscedastic case, the MLEs can be shown to be 0ML
= argrmn[(y-X/3)V-(y-X/3)],
^2ML =
^mm[(y-X(3)'V-(y-X(3)}.
If/90 represents the true value of @, then y — X(30 must be in C(V) with probability 1. When the quadratic function is minimized with respect to /3, the choice must be restricted to the set of values which satisfy the condition (y - X/3) G C{V). Proposition 7.5.1 If y ~ N(X/3,a2V), given by X/3ML = y, defined in (7.3.1).
then the MLE of Xp is
Proof. It follows from the preceding discussion that the MLE of X/3 is the vector u which minimizes (y — u)'V~(y — u) subject to the conditions (y - u) € C{V) and u 6 C{X).
262
Chapter 7 : General Linear Model
The quadratic function can be written as (y — u + e)'V~(y — u + e), where y and e are as defined in Section 7.3.1. Minimizing this with respect to u is equivalent to minimizing (d + e)'V~ (d + e) with respect to d, where d = y — u. Since e G C(V) and y G C(X), the twin conditions (y - u) G C{V) and u G C(X) are equivalent to d G C(F) and d G C(-X"). Thus we have the equivalent minimization problem min (d + e)'V~(d + e). deC{X)DC(V)
Suppose that D = V - V{I - Px){(I - PX)V(I - Px)}~(I - PX)V. Proposition 7.3.9 indicates that C(D) = C(X) nC(V). Therefore, the minimizer of the above quadratic function must be of the form Dl for some I, and the minimization problem becomes
mm(Dl + e)'V-(Dl + e). It is easy to see from Proposition 7.3.9 that e £ C(V(I — Py)) with probability 1. It follows that DV~e = 0 with probability 1. After combining this with the fact that DV~D = D, the above quadratic function simplifies almost surely to I'Dl + e'V~e, which is minimized if and only if Dl = 0. Since u — y — d — y — Dl, the uniquely optimal choice of u is y. D Proposition 7.5.1 immediately leads to the two following results. Proposition 7.5.2 The MLE of any estimable LPF is unique and it coincides with the corresponding BLUE. If 0 is not entirely estimable, the MLE of j3 is not unique, and is given by X~y for any choice of X~. The MLE of a non-estimable LPF, p'/3 is of the form p'X~y, which is not uniquely defined.
Proposition 7.5.3
The MLE of a2 is o^ML = e'Ve/piV).
It is clear from the discussion at the end of Section 7.4 and Proposition 7.5.3 that the MLE of a2 is biased and underestimates a2. It also transpires from this discussion that the minimized value of the quadratic function (y — Xf3)'V~{y — Xj3) subject to the restriction
(y-X/3)eC(V)isRl
7.6 Weighted least squares estimation
263
By an argument similar to that used in Remark 4.7.8, it can be shown that if y given X has the distribution N(Xfi, o2V), then y and o2 are the UMVUEs of Xj3 and a2, respectively. Other optimal properties of a2 are discussed in Chapter 8 (see Section 8.2.3 and Exercise 11).
7.6
Weighted least squares estimation
Suppose that we want to estimate /3 by the vector which minimizes (y—Xf3)'M(y—X/3), where M is a symmetric and nonnegative definite 'weight' matrix. Setting the derivative (gradient) of the above quadratic function with respect to /3 equal to zero, we have
~2X'M(y - X/3) = 0,
or (X'MX)j3 = X'My.
Thus, the general solution is of the form (X' M X)~ X' My. The matrix of second derivatives (Hessian), 2X'MX, is nonnegative definite, confirming that this corresponds to a minimum. We refer to this method of estimation as the weighted least squares (WLS) method, and the vector (X'MX)~X'My as a weighted least squares estimator (WLSE) of/3. A WLSE of f) depends on the choice of M . Even for a fixed M it may not be unique. However, the minimized value of the quadratic function,
mm(y - X/3)'M(y - X0) = y'My - y'MX(X'MX)~
X'My,
is unique. This expression can be used to obtain the following unbiased estimator of a2 (see Exercise 7.12): 1 a WLS =
y'My - y'MX(X'MX)-X'My tr(MV-MX(X'MX)-X'MV)
^
^
Remark 7.6.1 In the previous section we found that the problem of maximizing the normal likelihood with respect to /3 is equivalent to minimizing (y — X/3)V~(y — Xj3) subject to a linear constraint. The constraint (y - X&) G C(V) is automatically satisfied if C(X) C C(V), and in particular, when V is positive definite. In such a case the normal
264
Chapter 7 : General Linear Model
MLE of X/3 (which is also the BLUE) is a special case of the WLSE corresponding to the choice, M = V~. The equivalence of the BLUE and WLSE holds in this case even if the error distribution is not normal (see also Exercise 7.8). An important question is: is there a choice of M such that the WLSE is the same as the BLUE in the general case? Rao (1973b) shows that the WLSE coincides with the BLUE, and the estimator given in (7.6.1) is the usual unbiased estimator of a 2 , if and only if M is a symmetric g-inverse of W — V + XUX', where U is any symmetric matrix such that C(W) = C(V : X). We prove the sufficiency of this form of M in Section 7.7.1, with the additional restriction that U is a nonnegative definite matrix. Note that when M = W~, the quadratic function does not depend on the choice of the g-inverse, because (y — X(3) is almost surely in C(W). It may appear from the above discussion that the singular model can be dealt with, very much like the nonsingular case, simply by replacing V with W. However, this is not true. Although the expression for the BLUE obtained by such a substitution is correct, the resulting dispersion is too large. See Proposition 7.7.2(b) for the correct dispersion in this case. Example 7.6.2 Consider once again the model of Example 7.3.6 along with the orthogonal vectors tti, u2, ^3 and u^ defined there. In order to use the WLS approach, we have to find a suitable U such that W = V + XUX' satisfies C(W) = C(X : V) = C(t*i : u2 : u 3 ). We choose W = u\u[ + 3u2u'2 + U3U3, which is accomplished by choosing JJ = 1(0 : 1 : - l ) ' ( 0 : 1 : - 1 ) . It follows that W~ may be chosen as u\u'x + ^U2u'2 + tt3«3, and that
X'WX
= {X'u^X'ux)' + Ux'u2)(X'u2y = 2v1v'1 + 2v2v'2,
where
-
*
(
-
!
)
7.7 Some recipes for obtaining the BLUE
265
and vi and v2 are orthogonal unit vectors. Thus, we can write (X'W-JT)-=it7i«i + £»2«2, X'W~y=y/2(u'1y)v1 + \J\{u'2y)v2, which lead to X{X'W-X)-X'W-y
= ^ X v i + ^ X i ; 2 = {uxu[ + u2w'2)yv2 v6 This is the WLSE y. The value of y obtained in Example 7.3.6 was (I — U3u'3)y, which is the same as (u\u'1+u2u2+U4u'i)y. The apparent difference between the two expressions, «4«'4y, is in fact zero, as y € C(X : V) = C{u\ : ui : U3) with probability 1. The equivalence of the two expressions is a confirmation of Proposition 7.7.2, to be proved later. For a comparison, consider the model (y, X/3, <J2W) where W is as defined above. The value of y turns out to be the same as above. However, W(J-Px) = (I~Px)W = u3u'3 which leads to D(y) = =
a2[W-W(I-Px){(I-Px)W(I~Px)}-(I-Px)W] <J2(UIU[
+ 31121*2)-
This is larger than the expression D(y) = 3a2U2u'2 obtained for the actual model in Example 7.3.8. 7.7
Some recipes for obtaining the BLUE
The characterization of BLUE through LZFs (Proposition 4.3.2) and the covariance adjustment principle (Proposition 3.1.2) had been the basis of our derivations of the BLUE, its dispersion and the estimator of error variance. Several other methods of obtaining these quantities are available in the literature. In this section we outline the major methods, gaining some insight in the process.
266
Chapter 7 : General Linear Model
7.7.1
'Unified theory' of least squares
estimation*
We have observed in Remark 7.6.1 that a BLUE can be computed through the weighted least squares method when C(X) C C(V). This condition may not always hold, but we can try to ensure it by expanding the column space of V. This we do by introducing more error into the model! Let M denote the model (y, Xfi, o2V). Suppose that y* = y + X 7 , where the random vector 7 is uncorrelated with y, has zero mean and Z>(7) = a2U. Clearly D(yJ = a2(V + XUX'). Let us denote the matrix V + XUX' by W, and the model (y*,Xp,a2W) by M*. Proposition 7.7.1
Let the models M. and M* be as defined above.
(a) The estimable LPFs in the models M. and M* are identical. (b) The set of LZFs in the models M. and M* are identical. (c) I'y is the BLUE of p'fi in the model M if and only if I'y* is the BLUE ofp'fi in the model M*. Proof. Part (a) follows from the identity of the systematic parts of the two models. Part (b) is a consequence of Remark 4.1.5 and the fact that (/ — Px)X~f = 0 almost surely. Part (c) follows from the fact that Cov(l'y,,k'y) = Cov(l'y,k'y) where fe'y is an LZF in either model. Proposition 7.7.1 establishes a kind of equivalence between the models M and M*. In order to be able to use the weighted least squares method for the computation of BLUE in the latter model, we have to ensure that the sufficient condition C(X) C C(W) (mentioned in Remark 7.6.1) holds. We have to choose the matrix U in model M* such that C(W) = C(V : X). (This is indeed the 'best case scenario', as in general C(W) C C(V : X), see Exercise 17.) This choice will produce an appropriate BLUE through weighted least squares, but we should expect the dispersions obtained from M* to be unduly large, as that model contains additional error. The next proposition gives a formal statement of the results. Proposition 7.7.2 Suppose that U is a symmetric and nonnegative definite matrix such that the matrix W = V + XUX' satisfies the condition C(W) = C{V : X). Then
7.7 Some recipes for obtaining the BLUE
267
(a) The BLUE ofX/3 is X0 = XPWLS
=
X{X'W-X)-X'W-y,
being a WLSE which minimizes (y — Xfi)'W~(y (b) The dispersion of X/3 is PWLS
D(Xp) = a2X[(X'W-X)~
— Xj3).
- U]X'.
(c) The error sum of squares can be written as the minimized value of the quadratic form, R2Q = (y- X0WLS)'W-(y
-
XpWLS).
(d) The usual unbiased estimator of a2 is given by a1 = RQ/[P(W) — P(X)]. Proof. The condition W = V + XUX' ensures C(X) C C(W). According to Remark 7.6.1, the BLUE of X/3 in the model M* is X{X'W~X)-X'W'y^ which must be unique. The result of part (a) follows from part (c) of Proposition 7.7.1. Let us rewrite the BLUE under the models M and M* as Cy and Cy*, respectively, where C = X{X'W~X)~X'W~. Then we have D{Cyt) = D(Cy + C 7 ) = D(Cy) + DiCX-y) = D(Cy) + £>(X7). Therefore, D{XP) = D(Cy J - D{X1) = a2X(X'W~X)~X'
- a2XUX'.
This proves part (b). Parts (c) and (d) follow from the identity of the LZFs under the models M and M* as established in Proposition 7.7.l(b). Example 7.7.3 For the model of Examples 7.3.6 and 7.6.2, it has been shown that by choosing U = \(0 : 1 : -l)'(0 : 1 : -1) we obtain the appropriate BLUE of X/3 by the weighted least squares method. However the dispersion of the BLUE of X/3 computed from the model M* is inappropriate. According to Proposition 7.7.2, we need to adjust
268
Chapter 7 : General Linear Model
the latter dispersion by subtracting a2XUX' from it. In the present case this adjustment amounts to subtracting G2U\U\ from o2{u\u'x + 3u2u'2). The correct dispersion matrix is 3o~2u2u'2, which coincides with the expression computed directly in Example 7.3.8. The expression of R2 obtained from Proposition 7.7.2 is i?o = e'W~e = y'u3u'3[uiu[ + lu2u2 + usu'3}usu'3y = {u'3y)2, which simplifies to {y\ — y2 + j/3 — 2/4)2/4. Since p(W) — p(X) = 1, a2 is also equal to this expression, as found in Example 7.4.3. As a direct consequence of Proposition 7.7.2, we have the following result. Proposition 7.7.4 If the matrix W is as described in Proposition 7.7.2, then the dispersion of the BLUE can be written as D(y) = Cov(y,y) = a2 X(X'W~ X)~ which must be a symmetric matrix. IfC(X)
D{y) = a2X(X'V-X)-X'V-V
X'W~V,
C C(V), then
= a2X(X'V~ X)~ X'.
Remark 7.7.5 Since matrices of the form W = V + XUX' are so useful when C(W) = C(V : X), equivalent forms of this condition are important. It can be shown that two equivalent conditions are C(X) C C(W) and p(W) = p(V : X) (see Exercise 7.18). Proposition 7.7.2 can be strengthened by dropping the condition of nonnegative definiteness of U (Exercise 7.20). From a practical point of view, not much is lost by forcing U to be symmetric and nonnegative definite. Nevertheless, considerable research has been done with the aim of relaxing these conditions. See Baksalary and Mathew (1990) and the references therein for a collection of sufficient conditions for the equivalence of the BLUE with a WLSE.
7.7.2
The inverse partitioned matrix approach*
Suppose that we wish to find the BLUE of Xj3 and the residual vector simultaneously. Let the BLUE of X/3 be L'y. By Proposition 4.3.2,
7.7 Some recipes for obtaining the BLUE
269
Cov{{I - Px)y,L'y) = {I - PX)VL = 0. Thus C{VL) C C(X), that is, there is a matrix T such that VL + X T = 0. Combining this with the unbiasedness condition, L'X = X, we have
{I ?)(2)-(£)-
™
On the other hand, the task of finding the BLUE of X/3 and the residual vector amounts to finding a decomposition y = y + e, such that the summands should satisfy the conditions given in Remark 7.3.10. Therefore, we can write the decomposition as y = Xu + Vv where v € C(I — Px), i.e., X'v = 0. The last two equations can be written in a combined form as
(£ Combining (7.7.1) and (7.7.2), we have the matrix equation (V
X \ ( L
v \ _ ( 0
y\
,
[x1
o) { T
U)~ \ X '
oj-
[1J-6)
,
After solving the above equation we should have the BLUE of X/3 as L'y or Xu, and the residual vector as Vv. Proposition 7.7.6 (Rao, 1973c) Suppose that a g-inverse of the first matrix of (7.7.3) is given by (V
X\~
\x'
oj
=
(d
\c3
C2 \
-cj-
Then (a) (b) (c) (d)
= XC'2y = XC3y. The BLUE ofXBisJTp The dispersion of XB is D{XB) = a2XC4X'. The residual vector corresponding to the BLUE is e = VC\y. The usual unbiased estimator of a2 is given by a2 = R^/[p{V : X) — p(X)], where i?g *s ^ e error sum of squares given by
Rl = y'Ciy.
270
Chapter 7 : General Linear Model
Proof. It follows from the discussion preceding this proposition that the system of equations (7.7.3) is consistent. A possible set of solutions is given by (L v\ _(CX C2 \ ( 0 y\ \T u) \C3 ~C4J \X' OjPart (a)followsimmediately from the representations of the BLUE given
by L'y and Xu. Using the conditions VL + XT = 0 and L'X = X, we have
D(XJ3) = G2L'VL
= -a2L'XT
= o2L'XCAX'
=
This proves part (b). Part (c) follows from the representations of the residual vector given by e = Vv. In order to prove part (d), we simplify the numerator of (7.4.3) as follows, using the conditions X'v = 0, e = Vv and y = Xu.
Rl = e'V'e = e'V'Vv = e'v = y'v - u'X'v = y'v = y'Cxy. This leads immediately to the expression for a2. Example 7.7.7 For the model of Example 7.3.6, it can be shown after some numerical computation that [Y_l
) = 4.3722uiWi - \/2ix2W3 - v^uatt^ - 1.3722w4W4 + u5u'5,
where Ui,... ,u$ are the orthogonal vectors given below: 0 /-.43621 1/2 -.43621 1/2 0 -.43621 - 1 / 2 0 (ui : u2 : uz : uA : u5) = -.43621 - 1 / 2 0 -.39907 0 0 -.19953 0 -l/\/2 ^ - . 19953 0 1/V2
-.24438 - 1 / 2 \ -.24438 1/2 -.24438 - 1 / 2 -.24438 1/2 . .71233 0 .35616 0 .35616 0/
Using the above singular value decomposition, we obtain the g-inverse
(C^ \^3
°2 ) =4.3722uiu[-V2u2uf3-V2u3u'2-1.3722u4u'4 —wy
+ u5u'5.
7.7 Some recipes for obtaining the BLUE
271
Identification of the blocks leads to / ! °i-4
1 -1 1 -1 \ -1 1 -1 1 i _! ! _i '
V-i
i - i
x °2-6
i ;
/ 1 2 -1 \ 1 2 -1 j _j 2 ,
Vi -i
/
1 1 1 1\ C3 = U 2 2 -1 -1 , \ -l -l 2 2y
C4 =
2 ;
/ 4 2 2 \ 2 1 1 . \ 2 1 1 y
Using part (a) of Proposition 7.7.6, we have ~ _ (v\ + V2 . yi +^2 . ys + yi . y3_+_y4V y
~ V 2
'
2
'
2
" 2 J'
which is almost surely the same as the expression obtained in Example 7.3.6 in view of the fact that u'^y = 0 with probability 1. The other parts of Proposition 7.7.6 lead to values of D(X/3), e and a2 which are identical to those obtained in Examples 7.3.8, 7.3.6 and 7.4.3, respectively. 7.7.3
A constrained least squares approach*
Since the error vector in the linear model (y, -X"/3, a2 V) is contained in C(V) almost surely, we can write the observation vector as y = X(3 + Fu,
(7.7.4)
where FF' is a rank-factorization of V. Note that IMI2 = \\F-L(y - X/3)||2 = (y - X/3)'V"(y - X/3). We proved in Section 7.5 that the BLUE of X/3 uniquely minimizes the right hand side of the above equation subject to the constraint {y - X/3) € C{V). As C{V) = C{F), the constraint is equivalent to (7.7.4) for some /3 and u. The minimization problem can be solved without invoking normality of y, as long as the response vector satisfies the constraint y G C(V : X) for consistency. Therefore, we have the following result.
272
Chapter 7 : General Linear Model
Proposition 7.7.8 Let FF' be a rank-factorization ofV in the linear model (y,X(3,a2V), and let /3 and u be a choice of f3 and u which minimizes \\u\\2 subject to the constraint (7.7-4). Then the BLUE of X(3 is X/3 which is unique, and the corresponding residual vector is
Fu. The importance of Proposition 7.7.8 is that the formulation given here lends itself to a numerically stable computational procedure for obtaining the quantities of interest. Kourouklis and Paige (1981) outline a procedure for obtaining the BLUE of X/3, its dispersion, and an uncorrelated basis set of LZFs (see Exercise 7.19). The idea of obtaining the BLUE in the general linear model as the solution of a constrained quadratic optimization problem is quite old. Goldman and Zelen (1964), the first authors to formally consider the general linear model with possibly singular covariance matrix, show that the BLUE can be obtained by minimizing (y — X/3)'V~{y — X(3) with respect to /3 subject to a linear constraint on /3. The constraint is equivalent to (I — Pv)X/3 — (I — Pv)y, which becomes important only when the dispersion matrix is singular.
7.8
Information matrix and Cramer-Rao bound*
If y ~ N{X/3, a2 V) and V is nonsingular, then it is easily seen, via a derivation similar to that of Section 4.11, that the information matrix for the vector parameter 6 = (/31 : a2)' is
(LX'v-lx
o\
Consequently the Cramer-Rao lower bound for the dispersion of an unbiased estimator of the estimable function A/3 is a2A{X'V~1 X)~ A!. When V is singular, this argument does not hold. To see this, it is enough to consider the special case V — 0. As the distribution is degenerate, the partial derivatives used in the definition of the information matrix do not exist. Consequently the information matrix does not exist. However, the BLUE of every estimable function has zero variance.
7.8 Information matrix and Cramer-Ran bound*
273
Therefore, the Cramer-Rao lower bound for every estimable LPF exists and is equal to 0. The Cramer-Rao lower bound can be generally determined when V is singular, even though the information matrix may not exist. Example 7.8.1 Let y ~ (X*rj,a2V*), where y, X*, r\ and V* have the following forms with conformable partitions: (yx\ V2
v
\y4/
(I 0
0\ I
\0
0/
friA
// 0
0 0
0 0
0\ 0
\0
0
0
0/
Evidently r\2 can be unbiasedly estimated by y2 which has dispersion 0. Note that y 1 ; y2, y 3 and y 4 are independent and the distribution of y2 and y 4 do not involve rjl and a2. Hence, we can ignore the distributions of y2 and y 4 for computing the information matrix for (77^ : a2)'. It happens to be ( " /
p(v0°/(2ff4))-
Every estimable LPF in the model (y, X*r), a2V*) can be expressed as Airji + A2rj2. The statistic t(y) is an unbiased estimator of this LPF if and only if t(y) — A2y2 is an unbiased estimator of A\r}l. As t(y) and t(y) — A2y2 have the same dispersion matrix, the same lower bound should work for both. The Cramer-Rao lower bound for an unbiased estimator of Air)1 is a2A\A\. This bound holds for any unbiased estimator of A\r\x + A2r\2 also. The lower bound for the variance of unbiased estimators of a2 is 2<x4/p(V*). The simple model of Example 7.8.1 has all the essential features of a singular linear model. It is shown in Section 11.1.3 that every general linear model can be reduced to this simple form. We use this decomposition to derive the Cramer-Rao lower bound in the general case.
Proposition 7.8.2 If y ~ N{Xf3,a2V), then the Cramer-Rao lower bound for the dispersion of an unbiased estimator of the estimable LPF A/3 is a2AX-[V
- V(I - PX){(I - PX)V(I - PX)}-(I
-
PX)V)(AX-)',
274
Chapter 7 : General Linear Model
which does not depend on the choice of the g-inverses. The lower bound for the variance of an unbiased estimator of a2 is 2a4/p(V).
Proof. According to Proposition 11.1.16, there is a nonsingular matrix L = {L[ : L'2 : L'3 : L'A)' such that Ly ~ 7V(X*77,CT 2 F*), where X* and V* are as in Example 7.8.1 and
Further, every BLUE is almost surely a linear function of h\y and Z^y Therefore, there is a matrix (K\ : K2) such that the BLUE of X/3 is almost surely equal to K\L\y + K2L2y. Equating the expected values of these, we have X/3 = Kirji + K2r]2- Equating the dispersions, we have D(X/3) = G2K\K'1. Use of the argument of Example 7.8.1 leads to the conclusion that the Cramer-Rao lower bound for the dispersion of any unbiased estimator of A/3 (or AX~Kirj1 + AX~K2TI2) is ^{AX-Ki^AX-Kx)'
= (AX~)D(X]3)(AX-y,
which simplifies to the given expression (see Section 7.3.2). Invariance under the choice of g-inverses was shown in Section 7.3.2. The lower bound corresponding to a2 is similarly found to be 2
V
0
p(V)/(2ai)J-
The Cramer-Rao lower bound for the dispersion of an unbiased estimator of the estimable LPF A/3 is a2A{X'V~X)~A'. Neither the information matrix nor the lower bound depends on the choice of the g-inverses (see Exercise 7.21). O
7.9 Effect of linear restrictions 7.9 7.9.1
275
Effect of linear restrictions Linear restrictions in the general linear model
Consider the linear model M = (y,X/3,a2V). If we impose the (algebraically consistent) linear restriction Afl = £ on M., then the restricted model is equivalent to the unrestricted model M.r = (y — XA~£,X(I - A~A)0,a2V). This statement may be justified along the lines of the arguments in the homoscedastic case (see Section 4.9). As mentioned there, the parameters in the two models are related by the equation X0 = XA'(AA')-^ + X(I-PA,)e.
(7.9.1)
Recall from Section 7.2.1 that the response in a singular model must satisfy a consistency condition. The consistency condition of M.T is (I - Pv)(y - XA~i) € C((I - PV)X(I - A'A)).
(7.9.2)
It is a good idea to check this condition before proceeding with any analysis of the restricted model. The restricted model given above depends on the choice of A". The specific choice A~ = A'(AA')~ leads to the well-defined model (y - XA'{AA')-£,X(I - PA,)0,
The consistency condition of M.R simplifies to
((i-pv)yyc((i-pv)xy
(793)
Propositions 4.9.3 and 5.3.9 had brought out the effect of the restrictions on the model (y,X/3,a2I). We shall now show that these
276
Chapter 7 : General Linear Model
two propositions hold for the model (y, X/3, u2 V) as well. We shall use MR to simplify the proofs. Note that we were unable to use MR in Chapters 4 and 5 because of the singularity of its dispersion matrix. Proposition 7.9.1 (Restatement of Proposition 4.9.3) Let A/3 = £ be a consistent restriction on the model (y,X/3, a2V). (a) All estimable LPFs of the unrestricted model are estimable under the restricted model. (b) All LZFs of the unrestricted model are LZFs under the restricted model. (c) The restriction can only reduce the dispersion of the BLUE of
xp. (d) The restriction can only increase the error sum of squares. Proof. Part (a) follows directly from Proposition 7.2.4 and the structures of the model matrices of M and MR. To prove part (b), let i'y be an LZF in M. Then there must be a A; such that i'y = k'y and X'k = 0. Therefore, (X' A'){k' 0')' = 0, that is, I'y = (k'0')(y'gy is an LZF in MR. Part (b) implies that one can construct a basis set of LZFs of MR by expanding a basis set of LZFs of M. Therefore, the BLUE of X/3 under MR can be obtained by covariance adjustment of y (an unbiased estimator of Xf3) with this larger basis set. The result of Exercise 3.1 implies that the dispersion of the BLUE of X/3 under MR would be larger than that of the unrestricted BLUE. This proves part (c). Part (d) is a straightforward consequence of part (b). D We now impose the restriction C(A') C C{X'), that is, A/3 = £ is a completely testable restriction. We prove the statement of Proposition 5.3.9 with Mr replaced by MR, denned above. Proposition 7.9.2 (Restatement of Proposition 5.3.9) Let A/3 = £ be a completely testable and algebraically consistent restriction, and Aft be the BLUE of A/3 under the model M. (a) A/3 - £ is a vector of LZFs under the model MR. (b) A/3 — £ is uncorrelated with (I — Px)y-
7.9 Effect of linear restrictions
277
(c) There is no nontrivial LZF of MR which is uncorrelated with AP-£ and(I-Px)y. Proof. Note that A(3 — £ is a linear function of (y' : £')'. Part (a) is proved directly by computing the expectation of Aft — £. Part (b) follows from the fact that AJ3 and (/ - Px)y are BLUEs and LZFs in M, respectively. In order to prove part (c) by contradiction, let l[y +1'2$, be an LZF of MR which is uncorrelated with both A/3 — £ and (/ — Px)y- The condition Cov((I — Px)y, {l[y + I2Q) = 0 is equivalent to Vl\ = Xm for some vector m. In view of this, the condition Cov((A(3 — £), (l[y + l'2£)) = 0 is equivalent to AX-[V-V(I-Px){(I-Px)V(I-Px)}-(I-Px)V]l1
= Am = O.
Suppose that k\y + k'2^ is another LZF of MR, and assume without loss of generality that X'k\ + A'k2 = 0 (see Proposition 7.2.3). It follows that Cov{{l[y + Z'2£), (k[y +fc^))= l[Vki = m'X'ki
+ m'A'k2 = 0.
Since the l[y + l'2£ is uncorrelated with every LZF of M.2, it must be uncorrelated with itself, that is, it must be identically zero with probability one. Thus, the elements of e and Af3 — £ constitute a basis set of the LZFs of MR. Denoting the SSE in this model by R2H, we have from Remark 4.7.7
=
2 a
/ e \(D(e) \A(3~CJ \ 0
0 yf e \ D(AP-Z)) U 3 - J
U^-J I 0 [D(A0-t)}-j{Ap-Zj = R20 + (Ap-Z)'lc--2D(AP-Z)}-(Ap-$,), (7.9.4)
278
Chapter 7 : General Linear Model
which is a restatement of Proposition 5.3.10 in the general case. The number of additional LZFs in an uncorrelated basis of M.R is p(D(A/3 — The computations for the 'equivalent' model described above can also be performed by using any one of the methods described in Section 7.7. Baksalary and Pordzik (1989) develop an inverse partitioned matrix method specifically for models with linear restrictions, where the restrictions are explicitly split into completely testable and completely untestable parts. Remark 7.9.3 Rao (1978) shows that when V is singular, there is no matrix M such that the minimized value of (y — X/3)'M(y — X/3) subject to a linear restriction produces the appropriate R?H for all completely testable restrictions. This result exposes an important limitation of the WLSE interpretation given to the BLUE in the singular case.
7.9.2
Improved estimation through restrictions
Part (c) of Proposition 7.9.1 shows that the dispersion of the BLUE of X/3 is reduced when a linear restriction is introduced. Sometimes a linear restriction is imposed with the purpose of reducing dispersion. In such a case, there is a possibility that the restriction may result in a bias in the estimator of X/3. We now examine the trade-off between the increased bias and the reduced dispersion of the 'restricted' BLUE under the unrestricted model M., by comparing the mean squared error matrices of the two estimators. We begin with the assumption that the restriction A/3 = £ is testable (if it is not, we can work with the testable part of it). It follows from Proposition 7.9.2 that the BLUE of X/3 under the restriction can be written as
X0R =
X0-CD-{A0-t),
where X/3 is the unrestricted BLUE of X/3, A/3 is the unrestricted BLUE of A/3, C = Cov{X/3,A/3) and D = D(AJ3). We also assume that D is nonsingular. Consequently the bias and the dispersion of the
7.9 Effect of linear restrictions
279
restricted BLUE are:
E(XPR)-X/3 = -CD-HA/3-0, D{X0R) = D{Xp)-CD-lC. Let us denote the mean squared error matrix of a vector estimator by MSE{-). We have MSE(Xp)
-
MSE(XpR)
= D(xp) - D(X0R) - (E(xpR) - xp){E{xpR) - xpy = CDlC
-CD-l(Ai3-Z)(Ap-Z)'D-lC
= CD-l[D-{Aj5-Z){Ap-£)']D-lC
(7.9.5)
A necessary and sufficient condition for MSE(X(3R) < MSE{Xj3) in the sense of the Lowner partial order is that the matrix D — (Aj3 — €)(A(3 — £)' is non-negative definite. The latter condition is equivalent to the scalar inequality (Exercise 7.24)
{AP-£)'D-\AP-Z)<1.
(7.9.6)
If the above condition holds, then for any estimable function with nonsingular dispersion the restricted 'BLUE' will have smaller MSE matrix. This result implies that imposing a restriction may be a good idea when £ is close to the true value of A/3 (that is, the restriction is almost true) or when D is large (that is, there is much uncertainty about A/3). See Rao and Toutenburg (1999) for other criteria of comparing the estimators and for the implications of misspecified linear restrictions. 7.9.3
Stochastic restrictions*
Suppose that the linear model (y,X/3,a2V) is subject to a somewhat uncertain linear restriction. The restriction is A0 = $ + 8,
(7.9.7)
where A and £ are known, d is a random vector with zero mean and dispersion T2W, and 6 is uncorrelated with y. The stochastic restriction
280
Chapter 7 : General Linear Model
may have resulted from prior information or an independent study. In order that the restrictions are consistent, we must have £ £ C(A : W). The case W = 0 corresponds to a deterministic restriction, which was considered earlier. We can treat £ as a set of additional observations, and consider the model
(?)-«)^>-U)-^W-C7^> If r 2 = cr2, this model fits into the framework of this chapter. Otherwise, the methods outlined in Section 8.3 may be used. We shall now assume that T 2 = a2 and C(A') C C(X'), and examine the effect of the restrictions on the BLUE and its dispersion. Proposition 7.9.4 Suppose that the unrestricted and restricted models, M and MR, respectively, are defined as
such that C(A') C C(X'). Let XJ3 and X~J3R be the BLUEs ofXfl under the models M and MR, respectively, and f3 be as defined in (7.3.2). (a) (b) (c) (d) (e)
All estimable LPFs of M. are estimable under M.RAll LZFs of M are LZFs under MR. A/3 — £ is a vector of LZFs under the model MRA{5 — £ is uncorrelated with (I — Px)yThere is no nontrivial LZF of MR which is uncorrelated with AJ3-£ and {I - Px)y. (f) The BLUEs of Xf3 under the two models are related as follows: XPR = X0-
D(XP)(AX-)'[D(AP)
where D(X/3) is as given in (7.3.4), D(A/3) =
+ a2W]-(Ap and
AX-D(X0)(AX-)'.
- £),
7.9 Effect of linear restrictions
281
(g) The respective dispersions of the BL UEs of X/3 under the two models are related as follows: D(X0R)
= D(Xft_
^
-D(X0)(AX-)'[D{AJ3)+
R2R = R2 + (A0 - Z)'[o-2D{AP) + W}-(AP - £), and the associated number of degrees of freedom is p(V : X) — p{X) + p(D{Ap) + a2W). Proof. Proofs of parts (a) and (b) are similar to the proofs of Proposition 7.9.1(a) and (b) (where W was the null matrix). Parts (c), (d) and (e) are proved along the lines of the proof of Proposition 7.9.2. Specifically for part (e), let l[y + l'2$, be an LZF of MR which is uncorrelated with A/3 — £ and (/ — Px)y- These two conditions are equivalent to VI \ = Xm and WI2 = Am for some vector m. Let k[y + k'2£ be another LZF of MR, and assume without loss of generality that X'k\ + A'ki = 0 (see Proposition 7.2.3). Then Cao{l\y + l'2i,k\y + k'2i) = a2[k'lVl1 + k'2Wl2}=a2(k[X + kl2A)m=0. It follows that l[y + l'2£ is uncorrelated with every LZF of M2, and therefore it must be identically zero. Parts (b)-(e) imply that the uncorrelated vectors (/ — Px)y and Afi — £ together constitute a basis set of LZFs of MR. Part (f) follows immediately via the covariance adjustment principle of Proposition 3.1.2. Part (g) follows from part (f) by expressing Xf3 as the sum of Xf3R and an uncorrelated term. Part (h) is an easy consequence of Remark 4.7.7 and the description of a basis set of LZFs of MR given above. Note that all the above results are generalizations of the case of non-stochastic restrictions (W — 0). When X and V have full column
282
Chapter 7 : General Linear Model
rank and A has full row rank, the expression for the restricted BLUE simplifies to X0R = X07.9.4
{X'V-xX)-lA![A(X'V-lX)-lA!
+ W}-l(A0 - £).
Inequality constraints*
Deterministic constraints of the form A/3 < £ (where the vector inequality represents inequality of the corresponding components) are quite common in econometric literature. For instance, some components of /3 may be known to be non-negative. Since equality constraints can also be written as a collection of inequality constraints, the latter may be viewed as a generalization of equality constraints. Inequality constraints make it difficult to work with LUEs, because these estimators may not satisfy the constraints. Judge and Takayama (1966) consider the least squares estimator under inequality constraints, which coincides with the MLE in the case of independent and normally distributed errors. This led them to a quadratic programming problem which can be solved by a version of the simplex algorithm. Liew (1976) presents another solution of this problem, and shows how the dispersion matrix of the estimator can be computed. Werner (1990) presents an expression of the estimator that minimizes (y — Xfi)'V~(y — X/3) subject to a set of inequality constraints in terms of various projectors and generalized inverses, assuming that V is nonsingular. Werner and Yapar (1996) extend this geometric approach to the case of possibly singular V. The estimator is nonlinear and does not have a neat form except in some special cases. The inequality constrained MLE of X/3 in the normal case with possibly singular error dispersion matrix can be described as follows. Suppose that there are m inequality constraints, involving estimable LPFs only. There are 2 m possible subsets of these inequalities which can be converted to 'equality constraints'. For a given set of equality constraints, we can compute the BLUE of X/3 subject to these linear restrictions as well as the consistency condition y — X/3 G C(V). Some of these 2 m 'BLUE's satisfy all the inequalities. The desired solution is given by that 'BLUE' which corresponds to the smallest value of
7.9 Effect of linear restrictions
283
(y - X(3)'V~(y — X/3) and satisfies all the inequalities. The next example demonstrates how a simple form of the inequality constrained least squares estimator is available in some special cases. (with V possibly Example 7.9.5 Consider the model (y,X(3,a2V) singular) subject to the constraint 0\ < b. Assume that 0\ is estimable, and that the singularity of V does not make 0\ equal to a constant with probability 1. The appropriate minimization problem is min
(y-X/3)'V-(y-X0).
Pi < o
P
(y - X/3) E C(V) Partition X and /? as (xi : X2) and (0\ : (5'2)', respectively. Let z(0\) = y — xi/?i. If we ignore the inequality constraint, then the above problem can be solved by (a) minimizing (z(0\) — -X"2/32)'V~(2(/3i)—_X"2/32) subject to the constraint (z(/3i)-X2j9 2 ) S C(V) for every feasible value of fii, and then (b) minimizing the resulting function with respect to /3\. The solution to the first problem is given by the BLUE of X2/32 in the model {z{fi\), X2/92, o2V). An expression for this BLUE can be obtained by using (7.3.1). Therefore, the function of f5i which has to be minimized with respect to 0i in the second step is quadratic in f}\. [It is easy to see that this 'function' is free offi\whenever /3\ is not estimable.] The constraint p\ < 6 can be incorporated in the second step. The constraint (y — X/3) G C(V) is automatically satisfied because of the constraint {z(/3\) — X202) £ C{Y) used in the first step. If the minimizer of the quadratic function in the second step automatically satisfies the constraint fii < b, then the overall solution with the inequality constraint, coincides with the unconstrained solution. Otherwise, the quadratic function in 0i has the minimum feasible value at 01 = b and the remaining part of the solution is given by the BLUE of X 2 /3 2 in the model {z{0\), X 2 /3 2 ,cr 2 V). Thus, the solution to the original problem is
BLUE of X/3m(y,X/3,a2V), \
n T T T n
r
a
.
,
-v a
ifp\<6,
9xr\
BLUE of Xp in (y, X/3, alV) subject to 0i = b,
otherwise.
284
Chapter 7 : General Linear Model
The estimator is obviously not linear in y. The simple estimator in the above example is sometimes referred to as the two-step estimator. The result can be extended to an inequality constraint involving any single estimable LPF (Exercise 7.26). Extension to multiple inequality constraints is similar (see Werner, 1990). Nevertheless, data analysts working with prior knowledge of the signs of some coefficients are sometimes tempted to conduct a two-step analysis where all the estimated coefficients with wrong sign in the first step are constrained to be zero in the second step. Lovell (1970) shows that this procedure (in the special case V — I) can lead to bias and inefficiency. A limited simulation study by Liew (1976) indicates that even the optimally constrained estimators generally tend to be biased.
7.10
Model with nuisance parameters
Consider the model M. = (y,Xf3,a2V)
where X = (X\ : X2)
and (3 = ((3[ : (3'2)' so that X/3 = X1f3l + X2(32. If o n e is interested only in the estimable linear functions of /3 1? then (32 is a vector of nuisance parameters. Proposition 4.10.1 regarding the estimability of such functions still holds, although with a slightly modified proof (Exercise 7.27). Carrying the idea of this proposition further, we can pre-multiply the model equation by (I — Px ), which leads to the 'reduced' model
M* = ((I-P^y^I-P^X^a^I-P^Vil-P^)).
This model
is free of the nuisance parameter. The following proposition proves that this model is equivalent to M. for the purpose of inference. Proposition 7.10.1
Let the models M and M* be as defined above.
(a) The set of LZFs in M* coincides with that in M. (b) p'Pi is estimable in M* if and only if it is estimable in M. (c) The set of BLUEs in M* coincides with the set of BLUEs of estimable linear functions of 01 in M.. (d) The dispersion of (I — Px )Xi/3l under the models Ai* and M. are identical. (e) The SSE under the models M* and M. are identical.
7.10 Model with nuisance parameters
285
(f) The error degrees of freedom under the models M* and M are identical. Proof. A vector of a basis set of LZFs of M* is
using the result of Proposition 2.4.4(b). This proves part (a). Part (b) follows from the estimability condition in Ad, (p' : 0)' £ C(X\ : X2)', which was shown in Proposition 4.10.1 to be equivalent to peC(X[(I-PX2)). Let I'y be the BLUE of an estimable function of f3x in M. The unbiasedness condition requires that there should be a vector k such that I'y — k'y almost surely and X'2k = 0 or k'y = k'(I — Px )y. Therefore I'y or k'y is an unbiased estimator of the same estimable function in M and M*. Since it is uncorrelated with the LZF's under either model it must be the BLUE. Similarly, any BLUE in /A* is also the BLUE of its expectation (which must be a function of /3X alone) in M.. This proves part (c). Part (d) can be proved by using (7.3.4) and the fact that (I — P{I_P )x ){I ~ Px )V = {I — PX)V, which can be proved along the lines of the proof of part (a). Part (e) follows from part (a), the expression (7.4.2), and the fact that £>((! - Px)y) = o2{I - PX)V{I - Px) even under M*. To prove part (f), observe that
pw - px2)v^-
px2y-
( j - p * 2 )*i) - P(( J - p x 2 )*o
- p{{I - PX2){V : Xx) - p{{I - PX2)XX) = [p(V : X, : X2) - p(X2)} - \p{Xl : X2) - p(X2)}. The last expression simplifies to p(V : X) — p(X), which is the error degrees of freedom for M.. The last two parts of the above proposition imply that even the usual unbiased estimator of a2 under the reduced model is identical to that under the original model. We can use the methodology developed
286
Chapter 7 : General Linear Model
in the earlier sections to analyse the reduced model which eliminates the nuisance parameters. Remark 7.10.2 In the special case of the homoscedastic linear model (V = / ) , the BLUE of (I - P ^ J X I J S J can be obtained from (7.3.1) by substituting / - JP^ for V, ( / - PX2)XX for X and ( / - PXi)y for y. The expression simplifies to (i-PX2)x3i
= P{I_Px2)Xiy.
It follows that a 'substitution' estimator of / ^ that produces BLUEs of estimable functions of /3X must satisfy the equation
XW-P^Xtf,
= X[P{I_Px2)Xiy = =
X'1(I-PX2)X1[X'1(I~PX2)X1}-X'1(I-PX2)y X[(I-PX2)y.
The equation X[(I — Px )X\f31 — X[(I — PX )y is called the reduced normal equation for fii- It can also be verified that D((I-PX2)XM
=
*2P{I_PX2)XI,
which follows from (7.3.4) by appropriate substitution. Likewise, the usual unbiased estimator of a2, obtained from (7.4.1) and (7.4.2), are
^_ °
SSE p(V:X)-p(X)
= IK f - P (/-^)x 1 )( J -^)yll 2
Q
p{I-PXi)-p{(I-PXi)X{)
There have been attempts to obtain simpler reduced models that would also produce the appropriate results. Two such models are:
Ml - {{I-PX2)y,(I-PX3)Xtfx,o2V) {y^I-P^Xtf^V) M\ = Bhimasankaram and SahaRay (1996) point out that M\ cannot be a valid model in general, because the dispersion of (I — Px )y must be
7.11 Tests of hypotheses
287
(I—Py )V(I — PX ). In spite of this observation, there has been a flurry of research work on this model. The model M.\ can only be meaningful if the response satisfies the consistency condition y G C(V : (J —JV. )Xi), which is stronger than the consistency condition of the original model, y G C(V : X). Bhimasankaram and SahaRay (1996) show that when C(VX2) C C{X2) and V is nonsingular, M% indeed produces the right BLUE of (I — Px )Xij31 and the right dispersion matrix, but the wrong estimator of a 2 . The condition of nonsingularity of V is relaxed to some extent by Puntanen (1997). 7.11
Tests of hypotheses
Suppose that y ~ N(Xf3,a2V). If the hypothesis HO : A/3 = £ is to be tested statistically against the hypothesis T-L\ : A/3 ^ £, we have to make sure that the following conditions hold. (a) The model under the alternative hypothesis must be consistent, that is, yeC(X: V) or (/ - Py)y G C((I - Py)X). (b) The hypothesis must be testable, that is, C(A') C C(X'). (c) The equation A/3 = £ must be algebraically consistent, that is, ieC{A). (d) The model under the null hypothesis must be consistent, that is, y-XA-$, G C(X(I-A-A) : V) or {I-Pv)(y-XA~£) G C((I-PV)X(I-A-A)). We have seen in Examples 7.2.1 and 5.3.1 how conditions (a) or (b) may be violated. If condition (a) does not hold, then the model (y, -X"/3, a2V) is inconsistent with the data, and we do not even have a basis for testing statistically the hypothesis A/3 = £. If condition (b) is violated, then we have to work with the testable part of the hypothesis as per Proposition 5.3.6. Condition (c) essentially says that we cannot test statements such as f i I = f
j , which is self-contradicting. The next example
shows how condition (d) is violated. If either of (c) and (d) is violated, then the null hypothesis may be rejected without conducting a statistical test. Conditions (a) and (d) are automatically satisfied when
288
Chapter 7 : General Linear Model
V is nonsingular. A statistical test may be conducted if all the four conditions hold. Example 7.11.1 Let the observed response for the model of Exercise 7.3.6 be y = (1 : 2 : 3 : 4)'. It is easy to see that u[y = 0, so that y 6 C(V : X), that is, condition (a) holds. Consider the hypothesis fa — Pz = 0. Since y2 - yz is an LUE of /32 — fiz, the hypothesis is testable. Further, a hypothesis with a single degree of freedom is always algebraically consistent. Thus, conditions (b) and (c) are satisfied. However, condition (d) is violated, as C(X(I—A~ A)) = C(u2) and this column space does not contain y. In order to understand what goes wrong, note that the BLUE of /? 2 -/3 3 isy2~y3 = (yi+y2-yz-yi)/2 = —2. The expression of D{y) obtained in Example 7.3.8 implies that Var(y2 — $3) = 0. Thus, it is known from the data with certainty that &2 — Pz = - 2 . It is no wonder that the observed data is inconsistent with the restricted model which incorporates the null hypothesis
ft - Pz = 0.
n
Proposition 7.11.2 Under the above set-up, the GLRT at level a is equivalent to rejecting T-LQ if RH-RQ
Rl where ri = p{X :V),r
= p(X)
n'-r
' m
>iV>"'-^
and m =
p(D(A0)).
Proof. See Exercise 7.29. A general version of the ANOVA table of Section 5.3.4 is given in Table 7.1.
Multiple comparisons of a number of single-degree-of-freedom hypothesis can be made using the ideas of Section 5.3.7. Consider the collection of testable hypotheses nOj
a'j/3 = £j,
against
H\j : a'j/3 ^ £,-,
7.11 Tests of hypotheses Source
Sum of Squares
Deviation
from %
Total
Degrees of Mean Freedom Square
R2H~Ro =
m=
{A0-ty[£D(A0-£)]-(AP-t) Ro
Residual
289
p(D(AP-£))~^r~
=
n'~r
{y-XP)'V-{y-XP)
m i n
_ mm
R2H~RO
(y-X0)'V-(y-Xl3)
=
P&:V)
R2
^ZZ
n'~r+m
Af3 = £ Table 7.1 Analysis of variance Table for the hypothesis A/3 = £
j - 1 , . . . , q. L e tA
1
= ( o i : a2 :
: a g) a n d £ = { £ i
tq)'
Using the Bonferroni inequality, we obtain a set of conservative tests which reject HQJ if
^a'3X-D(y){X-)'a3
'
'2«
j = 1,2,..., q. The probability of erroneous rejection of at least one of the hypotheses, when all of them actually hold, is at most a. On the other hand, using Scheffe's technique, we have another set of conservative tests which reject HQJ if
K-3 - Q)2 — o'a'jX-DiyKX-yaj
_ > mFm n ' _ r Q ,
j = 1, 2,..., q, where m = p(D(A0)) = p(AX~D{y){X~y'A').
290
Chapter 7 : General Linear Model
7.12
Confidence regions
If y ~ N(X/3,a2V) and p'/3 is an estimable LPF, then it follows from the discussion of Sections 7.3 and 7.4 that
p'0-p'0
.
y/a2prX-D(X-)'p where D = a~2D{y), ri = p(X : V) and r = p{X). Thus, a left-sided 100(1 — a)% confidence interval for p'/3 is
(-oo,P'3 + tn,_r>a^P>x-D(x-yP ]. As is Section 5.2, we can also find a right-sided or two-sided confidence interval. If the jth component of /3 is estimable, the corresponding oneand two-sided confidence intervals are obtained by replacing 'p' in these intervals by the jth. column of the k x k identity matrix. Under the above set-up, a 100(1 — a)% ellipsoidal confidence region for the estimable parameter vector A/3 is
| A/3 : (A/3 - AMAX-D(X-)'A']-(AP
- Afi) < m°2^'-^
I,
where m = p(D(A/3)). If the entire parameter vector /3 is estimable, the corresponding confidence region is as given above with A = I and r = k. li a[,a'2,.. .,a'q are the rows of the matrix A, then simultaneous confidence intervals for the estimable LPFs a[0, a'2(3,..., a'q(3 can be constructed as in Section 5.2.3. The Bonferroni confidence intervals with confidence coefficient (1 — a) are
/f = Lfi-t^e.fea'jX-DiX-yaj, a'fi+tnl.r^aia'jX-D{X-)'aj
, j = 1,2, ...,.
7.13 Prediction
291
The corresponding Scheffe confidence intervals are
I(fc) =
lafi-JmFn^aa'jX-DiX-yap, a'jJ3+yJmFm,n,-riaa'jX-D(X-)'ap
, j = l,2,...,q.
In the context of linear regression, we may obtain a confidence band for the regression surface (x f3) by adapting Proposition 5.2.4 to the general linear model. The confidence band is
\x'0 -
\jm'Fm,^^ax'X-D(X-yx72, x'P + y/m'Fm,,n,-riax'X-D(X-yxZ* ] ,
where m' = p(D(Xfi)) = dim(C(X) nC(V)). This band covers the regression surface with probability 1 — a. 7.13 7.13.1
Prediction Best linear unbiased predictor
Consider the the linear model
where y is observed but y0 is not. If the dispersion matrix is known, the above model can serve as a vehicle for the prediction of yo in terms of y. It follows from Proposition 3.4.1 that the BLP of yo given y is E(yo\y) = x'o/3 + v'QV-(y - Xfi). If/3 is not known, we have to look for the best linear unbiased predictor (BLUP). This predictor should (i) be of the form a'y + b, (ii) satisfy the condition E(yo-a'y—b) = 0 for all /3, and (iii) minimize E[HQ—a'y—b]2. Proposition 7.13.1 In the above set-up, let M. denote the linear model {y,Xf3,a2V). Then
292
Chapter 7 : General Linear Model (a) If x'oj3 is not estimable under the model M, then a BLUP of y0 does not exist. (b) If X'Q(3 is estimable under the model M., then a BLUP of yo is given by yo = x'0B + v'o V~e, where x'0J3 and e are the BLUE of x'0f3 and the residual vector from the model M.. (c) The BLUP described in part (b) is unique in the sense that any other BLUP is equal to it with probability 1. (d) The mean square prediction error of the BLUP of part (b) is a2{v0 - v'0V-v0)
+ (x'0X- - v'0V-)D(XP){X'-x0
-
V~v0).
Proof. If a'y + b is a linear unbiased predictor of yo, then it is a linear unbiased estimator of x'Qf3 under the model M.. It follows from Exercise 7.3 that xo € C(X'), that is, x'0f3 is estimable. This proves part (a). In order to prove the remaining three parts, let XQ € C(X') and a'y + b be a linear unbiased predictor of yo- Consider the decomposition yo-a'y-b=
(y0 - E(yo\y)) - (y0 - E(yo\y)) + (y0 - a'y - b).
The first term on the right hand side is the prediction error of the BLP E(yo\y), which must be uncorrelated with y (see Proposition 3.4.1, part (b)). Therefore, this term is uncorrelated with the other two terms. On the other hand, the second term is the estimation error of the BLUE of x'Q[3 - v'0V~X(3 in M (V~ being a particular g-inverse of V), while the third term is a linear zero function in this model. Therefore, these two terms are also uncorrelated. Consequently E[y0 - a'y - b}2 = E[y0 - E{yo\y)}2 + E[y0 - E{yo\y)}2 + E[y0 - a'y - bf. The above is minimized if and only if the LZF yo—a'y—b is almost surely equal to zero. This proves parts (b) and (c). By setting a'y + b = yo in the above equation, we have £[yo-yo] 2
=
E[y0 - E(yo\y)}2 + E[y0 - E(yo\y)}2
=
a2(vo-v'oV-vo)
=
o-2(v0 - v'oV-vo) + Var{(x'0X- -
+
Var(x\8-v'oV-X^) v'0V-)XP),
7.13 Prediction
293
which leads to the expression of part (d).
d
The expression for the mean squared prediction error of the BLUP is the sum of two terms. The first term is the mean squared prediction error of the BLP. The second term represents the increase in the mean squared prediction error because X/3 has to be estimated. 7.13.2
Prediction and tolerance
intervals
Under the assumption of normality, the 100(1 — a)% prediction interval for t/o is [yo-a, yo + a], where y0 = x'of3 + v'oV~e, a
=
tp{X:V)-P{X),a/2{°2b)ll'1,
b = vQ - v'0V-v0 +
(x'0X--v'0V-)(D(XP)/a2)-{X'-xo-V-v0).
When v0 = 0, the quantity b simplifies to VQ + Var(x'0(3)/a2. The resulting expression of a is similar to that obtained in Section 5.4.2. A tolerance interval for yo can also be obtained by using the idea of Section 5.4.4. Note that on the average 100(1 — 7)% of all replications of yo must satisfy the inequality |yo - x'oP - v'0V-(y - X/3)| < zl/2^/a2(v0
- v'0V~v0),
z1j2 being the 1 — 7/2 quantile of the standard normal distribution. Using the usual 100(1 — a/2)% two-sided confidence interval of (x'QX~ — v'0V~)X[3 and a 100(1 — a/2)% upper confidence limit for a2 together with the Bonferroni inequality, we have the 100(1 — a)% two-sided tolerance interval for inclusion of 100(1 — 7)% of all replications of yo, \yo ~ tp(X:V)-P(x),a/4Vca2 - zl/2\jda2
,
Vo + tp(x-.v)-p(x),a/4Vca2 + z1/2\Jda2
,
294
Chapter 7 : General Linear Model
where y0 = x'o0 + v'oV~e, c = (x'oX--v'oV-)(D(X0)/a2)-(X'-xo-V-vo), d = (p(X : V) - p(X))(v0 - v'0V-v0) /x2p{x:V)-p{x),x-*l2 Simultaneous prediction or tolerance intervals can also be obtained as in Sections 5.4.3 and 5.4.4 (see Exercises 7.30 and 7.31). 7.13.3
Inference through finite population
sampling*
Characteristics of a finite population is often estimated from a sample. Let ys be the vector of observed values of a particular variable in the sample, and the unobserved vector yr represent the values in the rest of the population. Let the combined vector for the population, {y's VT)' b e denoted by yt. The objective is to estimate a function of yt by means of a function of the observable ys. The present discussion is confined to estimating linear functions of yt such as the population mean and the population total. The population characteristic can be estimated better if there are some auxiliary variables carrying information about the main variable of interest, and these are known for the entire population. Let Xs be the matrix of auxiliary variables in the sample (each row representing a single unit), and Xr be the corresponding matrix for the rest of the population. We may seek to estimate the desired population characteristic by predicting yr by means of the observables, ys, Xs and Xr. The prediction is typically made on the basis of a linear model of the form
*(£) = (£)>
°C;H(v;; £ ) <»«>
where /3 is an unspecified vector parameter, and the dispersion matrix is known up to the scale factor a2. Let ~i\yt be the function to be estimated, where 7 t is a known coefficient vector. It can be written as ~f'sys + 1f'ryr, where 7 t = (7^ : 7^.)'. Since "f'sys is exactly known, the task of 'estimating' ~i'tyt is equivalent to that of predicting 'y'ryr. If the model (7.13.1) is assumed, then the
7.13 Prediction
295
theory of best linear unbiased prediction can be used for this purpose. The BLUP of 7^.yr is given by Proposition 7.13.1, with the following substitutions: y = ys, X = Xs, V = Vss, x0 = X'rjr,
VQ = 7 r V r r 7 r , v0 = VST^r.
According to Proposition 7.13.1, the BLUP exists and is unique if and only if X'r7T £ C(X'S). If the BLUP exists, it is given by 1'ryr
= 7'rXrX:X~7p + 7rVr8V-(ya - XJ),
where
vss (I-PXS)
X7P=[I-
{(I-PXS)
v (i-PXt)}~
{I-PXS)\
y..
Therefore, the model-based estimator of iy'tyt is TS/*
= I'SVS + I'AXrXjx^p
+ vrsv;s(ys - x7p)\.
(7.13.2)
According to part (d) of Proposition 7.13.1, the mean square prediction error of ~i'tyt is MSEP =
O2~f'r(Vrr-VrsV;sVsr)lr^ WAXrXj
- Vr3V-)D(X,/3)(X's-X'r -
V-aVar)lr.
If Vrs = 0, then the estimator and its MSEP simplify to itVt = MSEP =
l'sys+l'rXrXjXs(3, a21'rVrrlr+1lrXrXjD(x7l3)Xl-X'rlr.
If Vrs = 0 and Vss is nonsingular, then the expressions further simplify to itit = l'sys+l'rXr{X'sV-}Xs)-X'sV-}ys, MSEP = O-WriVrr + XriX'sV^XsrX'r}^.
(7.13.3) (7.13.4)
When Vrs = 0 and Vss = I, the estimator and its MSEP are iSt MSEP
= l'sys+l'rXr(X'sXs)-X'sys, = aVr[V r r + X r (X' s X 8 )-x;]7 r .
(7.13.5) (7.13.6)
296
Chapter 7 : General Linear Model
In the following examples, n denotes the sample size and N is the population size, assumed known. Example 7.13.2 (No auxiliary variable) If there is no auxiliary variable, a plausible model is E(yt) = fil, D(yt) = o2l. The BLUP of the population total l'yt simplifies from (7.13.5) to Nl'ys/n or Nys. This is known as the expansion estimator of population total. The MSEP, given by (7.13.6), simplifies to (N - n)No2/n. The expansion estimator also arises naturally from simple random sampling without replacement (SRSWOR). According to the sampling design, D(ys) = s2(I — N~lll'), where s 2 is the true population variance, and consequently the variance of the expansion estimator is N(N — n)s2/n (see Exercise 7.2). As the parameters /J, and a2 of the above prediction model are essentially the same as the population mean, N~ll'yt, and the population variance, s2, respectively, we find that the variance expression obtained from the design-based approach is identical to the MSEP obtained from the model-based approach. Example 7.13.3 (One auxiliary variable) Let there be a single auxiliary variable, so that Xs = (1 : xs) and Xr = (1 : xr), and D[yt) = a21. The computations are simplified by using the reparametrization where Xs — (1 : xs — xsl) and Xr = (1 : xr — xsl), xs being the sample mean of the auxiliary variable or n~ll'xs. The expression (7.13.5) for the BLUP of the population total l'yt simplifies to
where ys = n~ll'ys and x = N~ll'x. This estimator is known as the regression estimator of population total. The MSEP of this estimator, given by (7.13.6), simplifies to
^ f(N-n)N
[
n
+
N2{x-xsn
D
||xs -x s l|| 2 J '
Example 7.13.4 (One auxiliary variable with heteroscedasticity) Let there be a single auxiliary variable but no intercept term, so that Xs and Xr can be written as xs and xr, respectively, and E(y's : y'r)' =
7.14 Exercises
297
(x's : x'r)'f3. Further, let D(yt) be equal to a2 times a diagonal matrix with the elements of (x's : x'r) as its diagonal element. The BLUP of the population total l'yt is given by (7.13.3), which simplifies to l'ys(l + l'xr/l'xs) or Nysf-, x being the population mean of the auxiliary variable and ys and xs being the sample means of the main and auxiliary variables, respectively. This estimator of the population total is known as the ratio estimator, and it arises naturally in the context of probability proportional to size (PPS) sampling with the auxiliary variable used as 'size'. The MSEP of the ratio estimator is given by (7.13.4), which simplifies to a2^N~™>N ^-, xr being the average value of the auxiliary variable among non-samples. D It may appear from the theory of model-based inference and the preceding examples that the results hold irrespective of the sampling design. However, the assumed model may not be valid for all sampling designs. For example, sampling only from units which have a large value of the auxiliary variable may lead to wrong conclusions. Further, the worth of model-based prediction depends crucially on the validity of the model. The model has to be chosen with utmost care. If sampling with replacement takes place, then there may be replications within the sample. In such a case, one may have to use a model with singular dispersion matrix. A combination of design-based and model-based approaches leads to the design-assisted approach, including the general regression (GREG) estimator. Valliant et al. (2000) give a detailed treatment of modelbased and model-assisted inference in finite population sampling. 7.14
Exercises
7.1 For each of the following situations describe an appropriate general linear model with no constraint on the parameters, and indicate whether the dispersion matrix is singular. (a) Uncorrelated and homoscedastic observations following the model yij = x\$ + e^, j = 1,... ,rij, % = 1,... ,m are averaged. The only available data are n~l Y^jLi Vij> xi
298
Chapter 7 : General Linear Model and rii for i = 1 , . . . , m. (b) For the stack loss data of Table 4.1, it is desired that a simple linear regression model is used to describe the relationship of the stack loss with air flow, — after suitable linear adjustment for the other explanatory variables. (c) From a complete set of data on response and explanatory variables on a number of randomly selected individuals, some information are discarded in order to protect privacy: the response is expressed in terms of deviations from the sample mean, and each explanatory variable is scaled so that the sum of squared values over all the individuals is equal to 1. (d) An expensive but error-free instrument is used to measure the response variable once. Twenty additional measurements of the response by means of an inexpensive but erroneous instrument are also available. These measurements are unbiased and independent. There are two explanatory variables which are measured free of error. 7.2 Let the vector y = (y^ : j/j2 : : yin)' consist of samples from a finite population yi, j/2> > VN, where the units are selected according to simple random sampling without replacement (SRSWOR). The population mean y = N'1 J^iLiVi a n d the population variance s2 — (N — I ) " 1 J2iLi{Vi ~ y) 2 a r e unknown. Show that the vector of samples follows the linear model
(y, X/3, a2V) with X = 1, p = y, a2 = s2 and V =
I-N^ll'.
Is V singular? Obtain the BLUE of y, according to this model, and its variance. 7.3 Consider a linear parametric function p'/3 in the possibly singular linear model (y,X/3,cr 2 V). Show that there is an unbiased estimator of p'/3 having the form k'y+c if and only if p G C{X'). [Hint: Follow the proof of Proposition 7.2.4.] 7.4 Show that the estimator of (7.3.2) is uniquely denned if and only if X has full column rank, in which case it is the BLUE of/3. 7.5 Prove Proposition 4.5.1 for the general linear model.
7.14 Exercises
299
7.6 lil'y and m'y are BLUEs of their respective expectations under the model (y, X/3, a2V), show that l'y + m'y is also the BLUE of its expectation. 7.7 lil'y is uncorrelated with all BLUEs in the model (y, X/3, a2 V), is it necessarily an LZF? 7.8 HC(X) C C{V), show that the linear model (y,X/3,a2V) can be viewed as a linearly transformed version of another model of the form (y*,X*/3,cr2/). 7.9 Prove Proposition 7.3.11 By transforming y to C~1y, where CC' is a rank-factorization of V. 7.10 Let p'/3 be an estimable LPF injhe model (y,X/3,o2V). Let p'/3 be the BLUE_of p'P and p'/3LU be another LUE. Define the efficiency of p'fiiu as = P
f Far(p'3)/FarW L[/ )
if Far(g/3Lt/) > 0,
\l
ifFar(p'/3 L[/ ) = 0.
Show that 1 — r/p is equal to the squared multiple correlation coefficient of p'fim with any generating set of LZFs. Is the above notion of efficiency consistent with the definition given in page 78? 7.11 Consider the modified model (y*,X/3,cr2W) of Section 7.7.1 where W = V + XUX' and U is a symmetric, nonnegative definite matrix such that C(W) = C{X : V). Show that D{yt) = a2X{X'W-X)-X'. If D(y) is obtained as in Section 7.3 from the model (y, Xj3, a2V), then observe that D{y) < £>(yj in the sense of the Lowner order. When does the above relation hold with equality? 7.12 Show that the estimator of a2 given by (7.6.1) is unbiased. 7.13 Show that a choice of U that ensures C(V + XUX') = C{V : X) is U = I. Show that X(X'(V + XX')-X)-X'{V + XX')~y is the BLUE of X/3 in the model {y,X(3,a2V).
300
Chapter 7 : General Linear Model
7.14 Examine possible simplifications in the forms of the BLUE of Xfi in the model (y,Xf3, o2V), the dispersion of the BLUE, the residual vector and the usual unbiased estimator of a2 when (a) C(X)QC(V), (b) C(V) C C(X), (c) C(X) is orthogonal to C(V), (d) C(X) and C(V) are virtually disjoint. 7.15 Prove the statement of Remark 7.4.2. 7.16 Prove Proposition 7.4.4 using (7.3.3). 7.17 If V and U are symmetric and nonnegative definite matrices of appropriate order, then show that C(V) C C{V + XUX') C C{V:X). 7.18 This exercise aims at finding equivalent sufficient conditions for the results of Proposition 7.7.2. Let W = V + XUX' where V and U are symmetric and nonnegative definite matrices. Prove that the following three conditions are equivalent. (a) C(W) = C(V : X). (b) C(X) C C(W). (c) p(W)=p(V:X). 7.19 Let FF' be a rank-factorization of V, Q be a nonsingular matrix such that QX has the lower trapezoidal form with p(X) non-zero rows, and P be an orthogonal matrix such that QFP has the lower trapezoidal form. Partition Q and P as (Q[ : Q'2)' and ( P i : P2)) respectively, where Q2 has p(X) rows and P i has p(V : X) - p{X) columns. Note that QrX = 0, Q^FP2 = 0, and P can be chosen to ensure that Q1FPi and Q2FP2 have full column rank. Prove that a solution to the constrained minimization problem of Proposition 7.7.8 is given by ^ and P\Ui, where /3 and 2i satisfy the equation (QlFPl \Q2FPl
0 \(*i\ Q2X)\P)
=
(QiV\ \Q2yJ-
Further, show that the elements of ui form a basis set of LZFs with variance a2 and that the dispersion of X 3 is a2FP2P'2F'.
7.14 Exercises
301
7.20 Prove Proposition 7.7.2 without the condition that the matrix U is nonnegative definite. 7.21 Prove the statements made in Remark 7.8.3. 7.22 Let y = X/3+aFu, where FF is a rank-factorization of V, and the p{V) elements of the random vector u are independent and identically distributed with mean 0, variance 1 and density h(-) satisfying h(—u) = h(u) for all u. Assuming that the necessary partial derivatives and the integrals exist, derive the CramerRao lower bound for the dispersion of an unbiased estimator of an estimable parameter A/3, and show that it is in general smaller than the bound in the case where h is the density of the standard normal distribution. 7.23 Let MR and M be the model (y,X/3,a2V) with and without the restrictions A/3 = £, respectively, with A/3 not necessarily estimable in M. Suppose that X/3 and X/3R are the BLUEs of X/3 under M and MR, respectively. (a) Show that X/3 is the BLUE of X/3 in MR with probability 1 if and only if C(V) nC(X) C C{X(I - PA,)). (b) Simplify the condition of part (a) when V is nonsingular.
7.24
7.25
7.26 7.27
[See Yang, Cui and Sun (1987) and the references therein for a discussion of this problem.] Show that the MSE matrix of the BLUE of X/3 under model (y,X/3,cr2V) with the restriction A/3 = £ is smaller than the MSE matrix of its unrestricted BLUE whenever (7.9.6) holds. [Assume that A/3 is estimable but not equal to £, and its unrestricted BLUE has a positive definite dispersion matrix.] Determine when the MSE matrix of the BLUE of X/3 under model (y, X/3, o2V) with the stochastic restriction A/3 = £ + S (described in Section 7.9.3) is smaller than the MSE matrix of its unrestricted BLUE. Find the estimator of X/3 that minimizes (y-X/3)'F"(y-X/3) subject to the conditions (y — X/3) € C(V) and p'/3 < b, where p'/3 is estimable in (y,X/3,a2V). Prove Proposition 4.10.1 for the general linear model with possibly singular dispersion matrix.
302
Chapter 7 : General Linear Model
7.28 Compare the BLUEs of (/ — Px )Xi/31, their dispersions and the usual estimators of a2 under the models A4 = (y, X\fix + X2/32,cr2I) and M* = (y, (I - P ^ j X ^ . a 2 / ) . 7.29 Prove Proposition 7.11.2, using the proof of Proposition 5.3.12 as a model. 7.30 Suppose that the response vector of the normal-error linear
model ((*)
( £ ) ^ ( n v'J) i-'yp^'yo"-
served, that is, y 0 is unobserved. The purpose of this exercise is to provide a region where y 0 must lie with probability 1 — a. (a) Show that y0 is contained with probability 1 — a in the ellipsoidal 'prediction region'
(yQ-yQy[(XoX--V'oV-){-^D(X0)}(XoX--V'QV-)' + (V O o-V' o V-Vo)]-(yo-yo) < mcr2Fm>n/_ria, where y 0 = Xo3 + V^Ve, (5 and e are as in (7.3.2) and (7.3.3), respectively, n' = p{X^ : V), r = p(X) and m = p[(X0X- - V'0V-){o-2D(Xf3)}(XoX- - V'0V-)' +(Voo-V'oV-Vo)}. (b) If y0 = (j/oi : : yoq)', Xo = (ajOi : : xQq)', Vo = («oi : : voq) and uoi is the ith diagonal element of Voo, i = 1,.. ,q, then show that the generalized version of Scheffe prediction intervals for j/oi, > Voq given in Section 5.4.3 are Woi ~ {™>a2CiFm,n>_r^)ll2,yQi
+
(mo2CiFm,n>_r,a)1/2],
where
c- = KIx--«;,y-){(7-2D(^)}-((j-)'a;orV-«o«) + (voi - «diV~i) O i),
i=
l,...,q.
7.31 Derive simultaneous tolerance intervals for the components of y 0 , given the model of Exercise 7.30, as follows.
7.14 Exercises
303
(a) Consider the prediction of one sample of the entire vector y0 at a time. Show that on the average, at least 100(1 — 7)% of all replications of y0 must satisfy the simultaneous inequalities lyoi-x'uP-v'oiV-iy-XP)]
<
^xl^2(voi-v'OlV-vOi),
i = 1 , . . . , q, where s = p(Voo — V'QV~VQ). (b) Using the result of part (a), 100(1 - a/2)% Scheffe confidence intervals of x'0i(3 — v'0iV~X/3, i = 1 , . . . ,q, and 100(1 — a/2)% upper confidence limit for a2 together with the Bonferroni inequality, show that the intervals
[
I
"
~^z
yoi - \Jm'Fm,tn,_ria/2Ci(T2
~^z
1
- ^xl^diG2
Voi + ym'Fm,tn,_r}Q/2Cicr2
,
+ \jx2sadia2
I ,
where Voi = x'Oi0 + v'Ol V~e, cz =
(x'0lX--v'0lV-)(D(XP)/a2)(X'-x0l-V-v0t),
di = (n'
-r)(voi-v'OiV~vOi)/xl'-r^a/2,
m = p{D(XP)) =
dim(C(X)nC(V))
contain at least 100(1 — 7)% of all replications of y0 with probability not less than 100(1 — a)%. (c) If VQO—V'OV~VQ is a diagonal matrix, then show that the intervals [yOi - abi,yOi + obi],
i = 1 , . . . , q,
where k = y ra'-Fm',n'-r,a/2ci +
27
/2\/^J
contain with probability 1 — a at least 100(1 — 7)% of all replications of any combination of yoi> , yog-
304
Chapter 7 : General Linear Model
7.32 Suppose that samples are drawn from a number of strata of a finite population, and the main variable is observed for all the sampled units. There is no auxiliary variable. The population mean of the variable in the various strata may be different. Identify a suitable linear model of the form (7.13.1), assuming D(yt) = a21, and obtain the BLUP of the population total. Determine the MSEP of the BLUP. 7.33 In Exercise 7.32 let the units in the ith stratum have variance of, every pair of units within the ith stratum have correlation pi, and the units from different strata be uncorrelated. Identify a suitable linear model of the form (7.13.1), and obtain the BLUP of the population total. Determine the MSEP of the BLUP. 7.34 Small area estimation. The units in a finite population are crossclassified into c classes and d domains. The number of units in each cell (that is, each combination of class and domain) is known. Let ysij be the vector of sample values from the ijth cell, and yrij be the corresponding vector of nonsamples. Assume the following model for ytij = (y'sij : y'rij)': EiVtij) = mi1, i = 1.--->C, D(ytij) =
Chapter 8
Misspecified or Unknown Dispersion
The results of Chapters 4 and 7 are based on the crucial assumption that the error dispersion matrix (a2V) is known, up to an unspecified scale factor. The expression for the BLUE of an estimable parametric function, for instance, depends on this dispersion matrix and can not be computed when the matrix is unknown. As we shall see in this chapter, there are many practical situations where the dispersion matrix is unknown. In models with unknown dispersion, a simple strategy may be to use the least squares estimator (LSE), which is known to be unbiased. This amounts to using the BLUE with a linear model with misspecified dispersion matrix. In Section 8.1 we look at the consequences of using an incorrect dispersion matrix, specifically to see if such misspecification may not matter very much in some situations. Sometimes one may have information about the dispersion matrix, say an estimated V from previous data. The possibility of inserting this estimate in the expression for the BLUE is examined in Section 8.2. If no information at all is available about the error dispersion matrix and it is completely unspecified, then it is impossible to derive an inference procedure. This is because the number of unknown parameters far exceeds the size of the data set. In fact the number of unknown parameters in the dispersion matrix alone is n(n + l)/2, all of which have to be estimated using just the n observations! Thus, we can proceed only if there is at least partial knowledge of the dispersion matrix, say in the 305
306
Chapter 8 : Misspecified or Unknown Dispersion
form of a functional form for the elements of the matrix, making these depend on a small number of unknowns. In such a case, the inference problem at hand includes the estimation of /3 as well as the unspecified parameters which determine V. A few general strategies of estimation are considered in Section 8.2. In the subsequent sections we discuss estimation strategies that exploit the functional forms of V in various special cases. A very important special case of partially unknown dispersion matrix is the mixed effects linear model, also known as the variance components model. This model is considered in Section 8.3. Some other cases of correlated errors are dealt with in Section 8.4. The case of uncorrelated errors with unequal variances is discussed in Section 8.5. Some related problems of signal processing are outlined in Section 8.6.
8.1
Misspecified dispersion matrix
Apart from the possibility of oversight, incorrect specification of the dispersion matrix can occur for various other reasons. If the dispersion matrix is unknown, one is tempted to work with the least squares method, which amounts to using the best linear unbiased estimator after assuming a homogeneous and uncorrelated error structure. Even if the dispersion structure is not ignored and an iterative method is used to estimate V", an LSE may be needed as an initial value for the iterative procedure. The quality of the initial value may be crucial to the convergence of the iterative procedure. Some iterative procedures consist of estimating the dispersion matrix and computing the weighted least squares estimator (WLSE) at every stage of the iteration. The estimated dispersion matrix at any given stage is likely to be different from the 'true' dispersion matrix. The extent of misspecification of V can sometimes be marginal, for instance, in the final stages of an iterative procedure. The effect of small perturbations in V on the WLSE is studied by several authors (see Strand 1974, Neuwirth 1984, Stulajter, 1990). In this chapter we consider possibly larger misspecifications, but confine the discussion to the case where the 'assumed' dispersion matrix is a21, so that the BLUE under the assumed model is the LSE.
8.1 Misspecified dispersion matrix
307
The consequences of incorrect specification of the dispersion matrix may be appreciated through the following example. Example 8.1.1
Suppose that the true model is (?/, X/3, a2V), where
/ 1 0 1\ 10-1 1 0 1 10-1 II 1 ' 11-1 I I I Kl 1 -1 J
a
n
d
/I 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 a 0 0 O O O O O a O O O O O O O a VO 0 0 0 0 0 0
0\ 0 0 0 0 ' O O a/
and a is a positive number (not necessarily equal to 1). If we use the LSE (effectively assuming V — I), how does the actual dispersion of this estimator compare with that of the BLUE? which in The dispersion of the LSE is o2{X'X)-lX'VX{X'X)-\ this example, turns out to be / 1/4 D(PLS)=°2[
-1/4
V 0
-1/4 (l + a)/4
0
0 0
(l + o)/16/
On the other hand, the dispersion of the BLUE of j3 is which simplifies to / 1/4 D(PBLU)=^\
-1/4
V 0
-1/4 (l + a)/4
0
\
0
a2(X'V""1X)~1, \
0
a/4(l + a ) /
If we write /3 as (fio : 0i : fa)'\ then it is clear that Var(PitLs) = = (l + a) 2 /4a, Var(PhBLUE) for % = 0,1, but Var(^LS)/Var(^BLUE) a factor which grows without bound as a moves away from 1. Thus, while the LSE can sometimes be as good as the BLUE, it can also be much worse. Note that when a is very small, the last four elements of the response vector carry very accurate information about /3o+/3i+/?2 and fio+fii—fa-
308
Chapter 8 : MisspeciRed or Unknown Dispersion
A combination of the last four observations may be used to determine P2 precisely, irrespective of the values of the first four observations. The LSE fails to exploit this advantage. Conversely, when a is very large, the LSE fails to attach less importance to the last four observations, thus inheriting the large amount of uncertainty associated with them.D
8.1.1
When dispersion mis specification can be tolerated*
Proposition 8.1.2 The BLUE of all estimable LPFs in the linear model (y,X/3,o2V) are the same as the corresponding LSEs with probability 1 if and only if the matrices X and V satisfy one of the following equivalent conditions. (a) C(VX) C C(X), (b) PXV is symmetric. Proof. We do not really have to consider all estimable LPFs; it is enough to consider X/3 only. The LSE of X0 is Pxy, which is unbiased. A necessary and sufficient condition for this to be the BLUE with probability 1 is that it is uncorrelated with all LZFs. The latter condition is the same as PXV(I - Px) = 0, that is, C(VPX) C C(X). Since C{VPX) = C(VX), the necessary and sufficient condition reduces to C(VX) C C(X). This proves part (a). It is clear from the above arguments that the condition (a) is equivalent to PXV(I - Px) = 0 . This can hold only if PXV = PXVPX, that is, Px V is symmetric. The reverse implication is obvious. Example 8.1.3 (Intra-class correlation structure) Suppose that 1 E C(X) and V has the intra-class correlation structure, V = (1 — a)I + all'. This structure amounts to assuming that all the observations have the same variance, and all pairs of observations have the same correlation. In order that V is nonnegative definite, a should not be less than — l/(n — 1). It is easy to see that C(VX) c C(X) in this case. Therefore, the LSE of any estimable LPF would be the corresponding BLUE here. Example 8.1.4
(One-way classified data with between-groups het-
8.1 Misspecified dispersion matrix
309
erogeneity) Consider the model Vij = t* + n + *ij,
j = I ) 2 , . . . ,rii,
i = 1,2,... ,i,
with uncorrelated zero-mean errors such that Varfaj) = of. Let y = {yu
Vim
: Vti
Vtnt)'
a n d /3 = (fj, : n
:
: Tt)'.
Once
the corresponding X and V matrices are identified, it is easy to check that VX = XT, where
T
_
/ 0
0
0
...
0 \
a\
a\
0
...
0
a\
0
ol
...
0
Vcr t2
0
0 ... a^J
It follows that the LSE of every estimable LPF would be its BLUE, in spite of the heterogeneity between the groups. n Example 8.1.5 (Two-way classified data with interaction and heterogeneity between groups) Consider the model Vijk = n + Ti + Pj + -jij + tijk,
k = 1,..., riij, i = 1,..., t, j = 1,..., b,
with uncorrelated zero-mean errors such that Varfejk) = <Jij. Instead of using the messy expression of VX, write V as J^ij Vij, where Vjj i s obtained from V by replacing all the diagonal elements except those for the ijth cell by zero. The multiplication of Vij with X would involve the rows of X corresponding to the ijih cell only. These rows of X are identical to one another. The product VijX, therefore, consists of repeated rows at the locations corresponding to the ijth cell, and zero everywhere else. This is a matrix of rank one, and C(VijX) is spanned by the single column of X that corresponds to the ijth. interaction term. It follows that C{VX) CC(X). Therefore, the LSE of an estimable LPF is robust against heterogeneity between the groups. Another way to appreciate this result is to view the two-way model with interactions as a reparametrization of the one-way model with p~xq groups. The latter set-up has already been explored in the previous example.
310
Chapter 8 : Misspeci&ed or Unknown Dispersion
Example 8.1.6 (Seemingly Unrelated Regression (SUR) model) This model consists of a few apparently unrelated models with equal number of observations, Vi = X l $ l + 6 i ,
E(el) = 0, D(ei) = oijl, i = l,2,...,p, Cov(ei,ej) = (TijI, i,j = 1,2,...,p.
The only connection among the models (y^Xiff, a^I), i = 1,... ,p is through the covariance condition. The special case where Xi is the same for all i can be interpreted as a multivariate linear model, with the ith equation describing the ith characteristic of the response. Once the models are combined to form a single model, it is fairly easy to see that Px is block diagonal with Px ,..., Px along its diagonals. Also, V can be partitioned into p x p blocks with OijI in the ijth block. The condition PXV = VPX is equivalent to aij(Px — Px ) = 0 for all i ^ j . This condition holds if a^ = 0 for all i ^ j , that is, the 'seemingly' unrelated models are really uncorrelated. Another sufficient condition is Px = Px for all i ^ j , which is essentially the case of the linear model with multivariate response. Proposition 8.1.2 only gives necessary and sufficient conditions to check whether the dispersion matrix V is such that one can afford to ignore its form. One of these conditions happens to be satisfied by the matrix V for the special cases considered in the foregoing examples. However, for a more complete understanding of the class of V for which the LSE can serve as BLUE, we need a characterization of its structure, which is provided by the following result. Proposition 8.1.7 The BLUE of all estimable LPFs in the linear model (y,X0,a2V) are the same as the corresponding LSEs with probability 1 if and only if V can be written in the form
V = PXAPX + {I- PX)B{I -Px)
+ cl,
where A and B are symmetric nonnegative definite matrices and c is a nonnegative constant.
8.1 Misspecified dispersion matrix
311
Proof. It is easy to see that if V has the prescribed form, then Px V is symmetric, and hence Proposition 8.1.2 can be used. In order to prove the converse, let Px V = VPX, and write V as V = [PX + (I- Px)}V[PX +
(I-PX)}.
If the expression on the right hand side is expanded, four terms are obtained. The two cross terms turn out to be zero because of the condition PXV = VPX. Therefore, V = PXVPX + (I - PX)V(I - Px). The last expression is of the form prescribed in the proposition. O Example 8.1.8
(A mixed effects model) Consider the model v i=l
where (31:... (3p are fixed parameters and e and "7i,...,7 P are pairwise uncorrelated, zero mean random vectors with D(e) =CT2/and £>(7j) = Vj, i = l,...,p. The model equation can be rewritten by putting together all the random terms, so that the effective dispersion matrix, V = -D(ELi XHi + e ) = °21 + E L i XiViXi- T h i s is clearly in the form described in Proposition 8.1.7, where c = a2, B = 0 and A is a block diagonal matrix with the matrices (X'iXi)X'iViXi(X'iXi)~, i = 1,... ,p appearing as the diagonal blocks. The LSEs should be good enough for 0 in this model, even if a2 and V i , . . . , Vp are unspecified.!^ Apart from the three equivalent conditions given in Propositions 8.1.2 and 8.1.7, various other equivalent conditions have been derived by Rao (1967), Zyskind (1967) and several other authors. Puntanen and Styan (1989) and Lin (1993) survey the literature and catalog about 30 different equivalent conditions (see also Styan, 1973)! Kempthorne, in his discussion of Puntanen and Styan's (1989) article rightly points out that most of these 'equivalent' conditions are algebraic exercises that do not provide much additional insight to the problem. Kramer (1980) and Mathew (1985) consider the possibility that the LSE may coincide with the BLUE if the observation vector y happens to lie in a particular subspace of IRn. Mathew and Bhimasankaram (1983b) and Mathew
312
Chapter 8 : Misspecified or Unknown Dispersion
(1983) give some conditions under which an LPF would have a common
BLUE under the models {y,X/3,a2V)
and
{y,XP,o2Vx).
Even if the LSE coincides with the BLUE, does this mean that the dispersion structure can be completely ignored? This question may be answered by examining the usual estimator of the variance of the LSE of an estimable LPF. The following proposition provides conditions under which this estimator is unaffected, when V is 'assumed' to be / . Proposition 8.1.9 Let M-v and M.i denote the models and (y, X(3,cr2I), respectively, such that p(X : V) = n.
(y,X/3,a2V)
(a) The usual estimator of variance of the 'BLUE' of every estimable LPF, computed from the model Mi, is appropriate for the model Mv if and only if PXVPx + (I - PX)V{I- Px) = cl for some c > 0. (b) The BLUE of every estimable LPF and its usual estimator of variance computed from the model Mi are both appropriate for the model Mv if and only if V = cl for some c > 0. Proof. We assume that M.y is the right model but work with the expressions obtained from the model M.j. If p'/3 is an estimable LPF, the variance of its LSE, as estimated from A4i, is
^<"^>
Mi
=y-p\-*-> 'n~omy-l'Pxl> n
where I is such that X'l = p. The variance of the LSE under the 'true' model is
Mv
=
y'i1 - PxM* - px)m - px))-(i - px)y , p{X:V)-p{X)
vp
- xvrx^
using the expression given in (7.4.2). The statement of the proposition stipulates that the two expressions given above must coincide for all I
8.1 Misspecified dispersion matrix
313
and all y e C(X : V). Since p(X : V) = n, we must have y'{I-Px)y y'(I - PX){{I - PX)V(I - PX)}~(I - Px)y
=
l'PxVPxl l'Pxl
for all y and I. Substituting y = V(I — Px)m in the above expression, we have m'(I-Py)V(I-Py)V(I-Py)m
l'PyVPyl
(8.1.1) Each ratio should therefore, be equal to a constant c > 0 that does not depend on m or I. Setting the first ratio equal to c, we have cm'(I - PX)V(I - Px)m = m'[(I - PX)V(J - Px)]2m
for all m.
Therefore, all eigenvalues of the matrix c'1 (I — PX)V(I — Px) are equal to their respective squares. Hence, this matrix must be an idempotent matrix which is also symmetric. It must be the orthogonal projection matrix (see Exercise 2.13) for C((I — Px)V). The latter is in general a subset of C(I — Px)- However, the condition p(X : V) = n ensures that the dimensions of these two spaces are equal. Hence, the two column spaces are identical. By equating the orthogonal projection matrices of the two column spaces, we have c~\l-Px)V{I-Px) = I-Px. Since the right hand side of (8.1.1) is also equal to c, we have PxVPx=cPx. Putting the two conditions together, we have the necessary and sufficient condition PXVPX + (I - PX)V{I - Px) = cl. This proves part (a). In order to prove part (b), write V as V = [PX+ (I-PX)]V[PX + (I-Px)] = pxvpx + (I-PX)V(I-PX)+PXV(I-PX)
+ (I-PX)VPX.
314
Chapter 8 : Misspecified or Unknown Dispersion
In order that the BLUEs under the models M / and My coincide, Px V must be symmetric, that is, the last two summands of the last expression should be zero. Part (a) implies that the estimated variances of the BLUEs for the two models agree if and only if the sum of the first two summands is cl for some c > 0. The two conditions hold simultaneously if and only if V = cl for some c > 0. Example 8.1.10 (Intra-class correlation structure, continued) For the model of Example 8.1.3 assume that 1 G C(X). Then the dispersion of the BLUE of X(3 is D{Pxy) = o2PxVPx
= a2Px[(l-a)I
+ all'}Px
= a2[(l-a)Px+all']. The usual estimator of a2, computed from (7.4.1) and (7.4.2) is 7
y'(i-Px)y (l-a)(n-p(X)Y
Therefore, the correctly (and unbiasedly) estimated dispersion matrix of the BLUE of XfB is
~2P x
VP x
- y'V-px)y \P + _ 2 L . n , i n — p(X) L 1—a J
In contrast, the dispersion matrix computed by wrongly assuming a = 0 (that is, V = / ) is y'(i-px)yp n-p(X) * ' which is clearly an underestimate. O Part (b) of Proposition 8.1.9 raises a serious question about the usefulness of a large body of research done on inference with misspecified dispersion matrix. It says that even if the estimators are alright, a misspecified dispersion matrix would seriously jeopardize some other aspect of inference. Several researchers consider robustness of inference under dispersion misspecification. Kariya (1980) provides a general structure of V such
8.1 Misspecified dispersion matrix
315
that the estimator of a2 under two models would be identical. Jeyaratnam (1982) obtains conditions on V that ensure the validity of the likelihood ratio test which is appropriate for V = I. Bischoff (1993) discusses the robustness of D-optimal designs under dispersion misspecification, such that the point estimators of estimable LPFs remain the same. According to Proposition 8.1.9, the analysis of data arising out of such an experiment would be questionable for any further inference (for instance, for tests of hypotheses), unless the misspecification is rectified at that stage. Mathew and Bhimasankaram (1983a, 1983b), like a number of researchers before them, consider the validity of the GLRT of a linear hypothesis. An interesting aspect of their work is that they look for invariance of inference for a specific vector LPF A/3, rather than that of the entire vector X/3. They also find conditions under which the test with a misspecified model would be a conservative one, even if the test statistic is wrong. For instance, they show that the likelihood ratio test of any linear hypothesis of the form A/3 = 0 would be a conservative one if the true dispersion matrix has the intra-class correlation structure with a > 0. This is perhaps one of the few studies where useful conclusions are drawn in this context of a linear model with misspecified dispersion matrix. If we cannot get full mileage out of the LSE, we may try to do the next best thing: ask how good it is in relation to the BLUE, as we do in the next section. 8.1.2
Efficiency of least squares
estimators*
Suppose that we are interested in an estimable LPF p'/3 with P'PBLU and p'Pis representing its BLUE and LSE, respectively. We use as an indicator of efficiency the number = P
f Var(p>PBLU)/Var(p'PLS) \l
if
V«TWPLS)
ifVar(p'0LS)
> 0, = O.
Note that when Var(p'/3LS) = 0, Var(p'f3BLU) is also equal to zero, and so we define the efficiency to be equal to 1 in this case. In general
316
Chapter 8 : Misspecified or Unknown Dispersion
rjp is a number between 0 and 1. If V is singular, then an LSE mayhave zero efficiency, as in the case of ^3 of Example 8.1.1 with a = 0. It follows from Exercise 7.10 that 1 — r]p is the squared multiple correlation coefficient of p'(3LS with any generating set of LZFs. Let us examine the worst case scenario for the LSE. It can be shown that
(8.1.2) where Xmax(V) and Xmin(V) are the largest and smallest eigenvalues, respectively, of V. See Hannan (1970) for a proof of this result when X and V have full column rank, and Wang and Chow (1994) for a proof in the general case where X may be rank deficient. When V is rank-deficient, the inequality holds trivially. It is clear that the lower bound on 77 is close to 1 when Xmin(V) is close to A max (V) (recall that Xmin{V) = A max (V) if and only if V = cl for some c > 0). On the other hand, the bound can be very small and close to zero, if the extreme eigenvalues of V are far apart. Example 8.1.11 Consider the model of Example 8.1.1. Since V is diagonal, the diagonal elements are its eigenvalues. The lower bound on the efficiency is 4a/(l + a)2. This bound is evidently attained in the case of the LSE of /32. Even though the lower bound on the efficiency happens to be sharp in the above example, this is not always the case. For instance, if the third column of the X-matrix of Example 8.1.1 is removed, then the X and V matrices satisfy the condition C(VX) C C(X). Therefore, by Proposition 8.1.2, r\v = 1 for all estimable p'/3. The lower bound, however, remains 4a/(l + a) 2 as V is not changed. The bound (8.1.2) is based on a worst-case scenario for a given V. For a given pair X and V, the efficiency of the LSE would depend on the interplay between these two matrices. The next proposition provides a basis for understanding of the situations which are favourable or unfavourable to the LSE. Proposition 8.1.12 Let p'(3 be an estimable LPF in the linear model (y,X(3,a2V). If the LSE ofp'fl has non-zero variance, then its effi-
8.1 MisspeciEed dispersion matrix
317
ciency r\v is bounded from above and below by 1 - Xmax(Ul2UlU'lU2)
< Vp < 1 -
\min{U'2U1U'1U2),
where the columns ofU\ and U2 form an orthonormal basis ofC(C'(I — Px)) and C(C'PX), respectively, and CC' is a rank factorization ofV. The bounds are sharp. Proof. Suppose that p'/3 = Z'JC/3, so that the LSE of p'/3 is l'Pxy and the BLUE is l'X/3BLU.
It follows that
_ *
l'D(X0BLU)l l'D(Pxy)l
Further, the difference between the above two estimators is an LZF. Therefore,
D{Pxy)
= Cov(XPBLU + = Cov(X0BLU,Pxy)
(Pxy-X0BLU),Pxy) + Cov((Pxy -
= D(XpBLU) + Cov((Px -
XpBLU),Pxy)
XpBLU),Pxy)
It follows that
Vp
~
l'Cov((Px-XPBLU),Pxy)l l'D{Pxy)l
By virtue of (7.3.1) the covariance term simplifies as below:
o-2Cov{(Pxy -
XpBLU),Pxy)
= PxVPx - VPx + V{I-PX)[(I-PX)V{I-PX)]-{I-PX)VPX = B'B -CB + CA[A'A]-A'B = B'B - C{I-PA)B - B'B - (B' + A')(I~PA)B = B'B - B'{I-PA)B = B'PAB, where CC' is a rank factorization of V, A = C'(I — Px) and B = C'PX. Hence, l'B'PABl VP-1' I'B'Bl
318
Chapter 8 : MisspeciGed or Unknown Dispersion
Let U\ and U2 be matrices with columns forming an orthonormal basis of C{A) and C(B), respectively. Then PA = UiU[, and Bl can always be written as U2q for some vector q. Therefore, _ ~i
7?p
q'U'2UlU'lU2q q'U'2U2q
_ - 1
q'U'2UlU'lU2q q'q
Since q is a completely arbitrary vector, the statement of the proposition follows. It is clear from the above proposition that the efficiency of the LSE depends directly on the relationship between the semi-orthogonal matrices U\ and U2. Two special cases are particularly interesting. When U[U2 = 0, the lower bound is 1 from Proposition 8.1.12, so that the LSE of every estimable function has efficiency 1. The condition U1U2 = 0 indicates orthogonality of the column spaces of C'PX and C'(I - Px), which is also equivalent to Px V(I - Px) = 0. This leads us back to the condition Px V = VPX which is necessary and sufficient for the equivalence of LSE and BLUE of every estimable LPF. The other interesting special case occurs when the column spaces spanned by t/i and U2 have something in common i.e., C(U i)nC(L72) / {0}. If the vector q in the proof of Proposition 8.1.12 is such that U2q £ C{U\), then the efficiency drops to zero. Thus, if p'/3 is such that p = X'l and VPxl E C(V(I - Px)),
(8.1.3)
then the LSE of p'/3 would have zero efficiency. Remark 8.1.13 Let U3 be a matrix whose columns form an orthonormal basis of C{C'{I-PX))L. It follows that U{U\ + UzU'z = / , under the notations of Proposition 8.1.12. Therefore, the bound of Proposition 8.1.12 can also be written as Xmin(U'2U3U'3U2)
*max(U2U3U'3U2),
When the number of variables in the model is much smaller than the number of parameter, C/3 would have fewer columns than U\, and easier to handle.
8.1 Misspecified dispersion matrix
319
For the given forms of X and V in the model of Example 8.1.14 Example 8.1.1, one can choose C as a diagonal matrix with diagonal elements given as square roots of those of V. The column space of X is easily seen to be spanned by the orthogonal vectors /0\ 0 0 0
/1\ 1 I 1 1
1 1
0 0
\i/
Vo/
/
1 \ -1 1 -1 -1 1
v -l /
Therefore, C(C'PX) is spanned by the vectors C'u\, Cu2 and C«3, which can be easily orthogonalized. It follows that U>z can be chosen as /
U 2
_ ~
0 0 0 0 1/2 1/2 1/2
\ 1/2
1/2 1/2 1/2 1/2 0 0 0
1/2(1 + a ) 1 / 2 \ -1/2(1 + a ) 1 ^ 1/2(1 + a ) 1 / 2 -1/2(1+a)1/* a1'2/2{\ + a)1'2 ' - a ^ ^ l + a)1^ a 1 /2/ 2 (l + a )i/2
0
-a 1 / 2 /2(l + a)ll2
Using the representation Px = J2i=i uiui/\\ui\\2 it can be verified that
)
( see Proposition 2.3.4),
C'{I-PX)C (
5 -1
- 1 - 3 5 -1
-1 -3
-a1'2 a1/2
-3
-1
-1
-QV2
5 a}'2
a1'2 5a
5
1.-1 " 8 -a1'2
-3 a1'2
-1 -a1'2
a V2
_ a i/2
a i/2
-a1/2 Va 1 / 2
a1/2 -a1/2
-a1/2 aV2
_ a i/2
_
a1/2 _ a i/2
_3a _a
a
a1'2 -a1/2
-a1/2 a1/2
a1/2 \ -a1/2
a i/2
_ fl i/2
a i/2
-a1'2 -a
all2 -3a
-all2 -a "
_
_
5a
-a _3a
a
5a _a
3 a
-a 5a
/
320
Chapter 8 : Misspecified or Unknown Dispersion
An orthonormal basis for C(C(I - P ^ ) ) 1 is obtained from the three eigenvectors of the above matrix which correspond to zero eigenvalues. A possible choice of this basis is the set of columns of the matrix
^
3
-
/ l/S 1 ^ 1/8 l/2
1/8 l/2
l/8l/2
1/8 l/2
1/g l/2
1/8V2 i/ 8 i/2 1/8^2 1/8 1 / 2 V 1/8V2
al/2/2(l+a)l/2 \ _al/2/2(l+a)l/2 o l/2/ 2 (l + a)l/2
l/S1^ - o ^ ^ l + o ) 1 / 2 _ 1 / 8 i/2 1/2(1+ a)V2 -1/8V2 -1/2(1 + a)1/2 -1/81/2 1/2(1+ 0)^2 -l/S1^ - 1 / 2 ( 1 + a)V2 7
We are now ready to compute the bounds on r}p: Remark 8.1.13 tells us that these bounds are the extreme eigenvalues of U2U3U3U2, (that is, the extreme singular values of the matrix C/3C/2). It follows from the expressions of C/2 and f/3 that / U'3U2 =
1/21/2 -1/21/2
V
0
1/21/2 -1/21/2
0
0 0
\
2^/2/(1 + a) I
which has singular values 1, 1 and 2a 1 / 2 /(l + a). Consequently, we have for all p 4a We have earlier found a p (equal to (0 : 0 : 1)') for which the lower bound of efficiency is achieved, and two other values of p (equal to ( 1 : 0 : 0)' and ( 0 : 1 : 0)') for which the upper bound is achieved. Note that in the special case V = I (that is, a — 1), the lower bound is also equal to 1. Let us briefly look into the special case a = 0. The choice p = (0 : 0 : 1)' corresponds to p = X'l with I = ( 0 : 0 : 0 : 0 : 0 : 0 : ^ : -\)'. It is easy to see that VPxl = \{l : - 1 : 1 : - 1 : 0 : 0 : 0 : 0)'. Also, the vector W4 = (1 : —1 : 1 : —1 : —1 : 1 : —1 : 1)' is such that VU4 is proportional to VPxl while U4 G C(I — Px)- Therefore,
8.1 Misspecified dispersion matrix
321
VPxl G C{V(I - Px)), and p'/3 satisfies the condition of (8.1.3). As expected, the LSE of p'(3 has zero efficiency in this case. D When V and X are both full-rank, one can consider an indicator of the efficiency of the LSE of the vector parameter /3. One such measure
det(D0BLU)) det(D(3LS)) ' Bloomfield and Watson (1975) and Knott (1975) show that
where Ai > > \n are the eigenvalues of V, and the number of observations (n) is assumed to be at least twice as large as the number of parameters (k). It is clear that the lower bound can be quite small when the eigenvalues are not of the same order. Like (8.1.2), the above lower bound also ignores the interplay between C(X) and C(V), and assumes a worst case scenario for X. Watson (1967) shows that 77 can be expressed as
17 = 13(1-^), where pi, ,Pk are the canonical correlations between Xf3LS and the LZF vector (I — Px)y. This result provides an interesting interpretation of 77. The vector Xf3LS is the sum of X/3BLU and a vector of LZFs. The added LZFs account for the larger dispersion of X(3LS than the corresponding BLUE. The canonical correlations measure the extent of contamination of the BLUEs by the LZFs. Kramer and Donninger (1987) work with yet another ratio, ti(D{X0BLU)) tr{D{XpLS))
'
as an indicator of efficiency of the LSE. Puntanen (1987) provides a review of various measures of efficiency of the LSE. Tilke (1993) obtains efficiency expressions for covariance structures arising in spatial data.
322
8.1.3
Chapter 8 : MisspeciRed or Unknown Dispersion
Effect on the estimated variance of LSEs*
Let us now consider the effect of misspecified error dispersion matrix on the usual estimator of the variance of the LSE of an estimable LPF p'/3. Let the true dispersion matrix be o2V. If V is positive definite, Swindel (1968) shows that Afc+i + n-k
+ K 1 < E[V^r{p'PLS)} <X1 + ... + \n_k 1 (8.1.4) Ai - Var(p'(3LS) ~ n-k Xn
where Ai,..., An are the eigenvalues of V in the decreasing order, and k is the number of elements in f3. If the eigenvalues of V are scattered over several orders of magnitude, the above lower an upper bounds are much smaller and larger than 1, respectively. This indicates that there is a possibility of considerable over- or under-estimation of the variance. However, for a given combination of X and V matrices, the bounds of (8.1.4) may not be sharp. This happens because the inequalities of (8.1.4) are based on the worst-case scenario for X, as we have seen in the case of the bounds of efficiency given in (8.1.2). The following proposition leads to sharper and attainable bounds on the bias of the estimated variance of p'/3. Proposition 8.1.15 In the above set-up, suppose that U is a matrix whose columns form an orthonormal basis ofC(X), and let Var(p'f3LS) > 0. Then 1 ti((I-Px)V) ^maxiU'VU)' n-p(X)
< E[Va~r(p'PLS)}
< ~
Var(p'0LS) 1 Xmin(U'VU)'
tt((I-Px)V) n-p(X)
Proof. The estimated variance of the LSE of p'/3 is
where / is such that p = X'l. On the other hand, the correct variance is o2l'PxVPxl. Note that if U is as described in the proposition, then
8.1 Misspecified dispersion matrix
323
Pvl = Uv for some vector v. It follows that
E[vZrtf0LS)] Var(p'PLS)
=
tv((I-Px)E(yy'))-l'Pxl n - p(X) a2l'PxVPxl
l'Pxl tr{(I-Px)V) n-p{X) l'PxVPxl'
_ ~
v'U'Uv v'U'VUv'
ti((I-Px)V) n-p{X) '
Since U'U = I and v is completely arbitrary, the statement of the proposition follows. The case Var{p' j3LS) — 0 is dealt with in Exercise 8.7. Example 8.1.16 (Intra-class correlation structure, continued) Consider the model of Example 8.1.3 and suppose that 1 € C{X) and V - ( l - a ) / + a l l ' . Simple calculations show that tr((I-Px)V)/(np(X)) = 1-a. Further, U'VU = (l-a)I + aU'll'U. One eigenvalue of this matrix is 1 + (n — l)a, while the other eigenvalues are 1 — a. Therefore, we have from Proposition 8.1.15 1-a < E{V^r(P'0LS)) < ^ 1 + (n - l)a ~ Var(p>J3LS) ~ ' when a > 0. Thus, the variance of the LSE may be underestimated, even though the estimator itself coincides with the BLUE. When a is negative, the expressions of the lower and upper bounds interchange, indicating the possibility of overestimation of the variance of the LSE. Sharpness of the pair of bounds can be demonstrated (see Exercise 8.8). Note that the eigenvalues of V are 1 - (n — l)a and 1 — a, with the latter having multiplicity n — 1. It follows that the bounds obtained from (8.1.4) are (1 - a)/(I + (n - l)a) and 1 + na/(n - k)(l - a), respectively. One of these bounds is sharp, but the bound from the other direction is not sharp. Example 8.1.17 choice of U is
Consider the model of Example 8.1.1. A possible /_"]_
U~{\\Ul\\
u2
\\u2\\
u3
s
HusllJ'
324
Chapter 8 : Misspecified or Unknown Dispersion
where m, U2 and 113 are as described in page 319. It follows that U'VU is a diagonal matrix having elements a, 1 and (1 + a)/2. On the other hand, tr((J - PX)V) is the same as tr(C"(J - PX)C), where V = CC'. Using the expression of C'(I — PX)C given in page 319, we have tr((I - PX)V) = 5(o 4-1)/2. When a < 1, the lower and upper bounds given by Proposition 8.1.15 are 1 and (1 + a)/2a, respectively. Evidently the variance of the LSE tends to be overestimated. The lower bound is achieved by the estimated variances of f3\ and p\, while the upper bound is sharp in the case of /?o+/3i- The bounds obtained from (8.1.4) are (4a + l)/5 and (4+o)/5a, respectively. Neither of these bounds are sharp. When a>\, the lower and upper bounds of Proposition 8.1.15 are (1+ a)/2a and 1, respectively, which indicates a tendency of underestimation of the variance of the LSE in this case. The lower bound is sharp for the estimated variance of ^0 + J3\, while the upper bound is achieved in the cases of /?i and fa- The bounds given by (8.1.4) are not sharp.D
8.2
Unknown dispersion: the general case
When the error dispersion matrix is unknown, a simple strategy may be to plug in an estimate of it in the expression for the BLUE. This estimate may be based on historical data or other prior information. Alternatively, the estimate of the dispersion matrix may be based on the data at hand. We shall show that in either case the plug-in estimator is unbiased, under mild conditions. We also consider in this section likelihood-based estimation of the dispersion matrix.
8.2.1
An estimator based on prior
information*
Let V be an estimate of V in the linear model (y, X(3,a2V), such that V is independent of y. Once V has been estimated, X/3 may be estimated by replacing V with V in the expression of its BLUE. We shall refer to this estimator as the plug-in estimator, X(3pi. Proposition 8.2.1 Under the above set-up, let C(V) = C(V) with probability 1. Then the plug-in estimator is unbiased for X/3.
8.2 Unknown dispersion: the general case
325
Proof. The result follows from the form of the BLUE given in Proposition 7.3.1, after replacing V with V and taking conditional expectation of the resulting expression given V. Remark 8.2.2 Even though the plug-in estimator of Proposition 8.2.1 is unbiased, it may not be close to the BLUE. On the contrary, it may even have a larger dispersion than the LSE. To see this, note that the dispersion of Xf3pi is D(X0pi) = E{D(X0pt\V)]+D[E(XPpi\V)}
= E{D{X$pi\V)}.
The assumption C{V) = C(V) implies that Xfipi G C(X) almost surely (see Exercise 8.9). Therefore, D(X0p,\V) = D(PxXjipi\V)
= D(Pxy -
Pxepi\V),
where epi is the residual vector corresponding to X(3pi. It follows that o-~2D(Xppi) =
PxVPx-E{Cov(Pxy,Pxepi\V)} -E[Cov(PxepuPxy\V)]
+ E[D(Pxepi\V)}.
Now consider the situation which is most favourable to the LSE. The LSE coincides with the BLUE when PXV = PxVPX (see Proposition 8.1.2). Whenever this condition holds, the covariance terms in the above expression are equal to zero. In such a case, the dispersion of the LSE of X/3, o2PxVPx, is smaller than the dispersion of the plugin estimator. Even if this condition holds approximately, the plug-in estimator based on prior information on V is not likely to be an improvement upon the LSE. Rao (1967) obtains exact confidence regions for the plug-in estimator under the assumption of normality of the errors, Wishart distribution of V and nonsingularity of V. The prior information on V may also be available in the form of a distribution. See Exercise 8.10 for properties of an estimator which utilizes such information.
326
8.2.2
Chapter 8 : Misspecified or Unknown Dispersion
Maximum likelihood
estimator
Let us now suppose that V is a function of an unknown vector parameter 0. When the scale factor (a2) is unspecified, we include in 0 for notational simplicity. Thus, the model is (y, -X"/3, V(9)). We assume that the error distribution is normal. If the rank of V{6) is less than the sample size (ro), the joint distribution of y is singular normal, having the density given in Section 3.2. The joint likelihood of /3 and 0 is
L(/3,0) =
(27r)^m\C'(e)C(e)rl2exp^-1-(y-Xf3yV-(d)(y-X^,
where C(9)C'(9) is a rank factorization of V(0). Note that the determinant |C"(0)C(0)| reduces to |V(0)| when V(0) is nonsingular. On the other hand, when V(0) is singular, we have to restrict the ranges of 0 and /3 so that y is always in C(X : V(d)) and y - Xf3 is in C(V). Further, if p(V(6)) depends on the value of 0, then the likelihood function is not bounded. Therefore, we also have to restrict the range of 0 so that the rank of V{0) is constant. Since /3 may not always be estimable, we work with X@. It follows, along the lines of the argument given in Section 7.5, that
X£)ML
=
aigmin[{y-X0)V-{dML)(y-XP)], X{3 _
6ML = argmm[log\C'(e)C(9)\ 0
+
_
(y-XpMLyv-(0)(y-XPML)}.
For every value of 0 in its allowable range (such that y G C(X : V) and p(V{0)) is constant), one may take V(0) as the true dispersion matrix and compute the BLUE of X/3 and the corresponding error sum of squares. If the latter is denoted by RQ(0), then the MLE of 0 is the minimizer of
log\C'(0)C(9)\+R2o(0) (see Proposition 7.4.4 and the subsequent discussion). Example 8.2.3 (Autocorrelated errors) Suppose that the errors in the linear model (y, X/3, V(9)) follow the autoregressive model €i = 0ei_i + Si,
i = 2,3,...,n,
8.2 Unknown dispersion: the general case
327
$2, , Sn being uncorrelated, each with variance a2. In such a case, the parameter 6 consists of a2 and >. The dispersion matrix of e is /I
V{o2,
\(f>n-1
4>
cj?
1
(f)n-2
>"-! \
4>n~2 r~3
0"" 3
1
, /
which is full rank, as long as \(f)\ < 1. A possible factorization of V(0) is C(0)C'(0), where /
(l-^.2)-1^ <j>{\ - 4?)-ll2
c{0) = o
o
0 1
tfii-^r1'2
0\ 0
4>
...
i
0
... o .
It follows that \C'{0)C(0)\ = a2n{\ - 4>2)nl2'1. Since C'{0) is nonsingular, one way to obtain RQ for a given 9 is through the least squares analysis of the model {C~l{d)y,C~l{d)X/3,a2I). Note that / ( I _ ^2)1/2
-cf>
o .
c-1(^) = a-1
V
0
0
1
0
...
-4> l . .
0
0
0
o
o
o
QX
0
0
0
o
o . . .
1
0
-
The MLEs of a2 and <> / are obtained by minimizing n log cr2 + (n/2 — l ) l o g ( l - ^ 2 ) + i2 2 (a 2 ,0), where
i2§(<72,0) =y'C-1'(0) ( / - P ^ ^ j c - ^ ^ y = i2g(l,^)/<72. It follows that the MLE of a2 is i?g(l,0)/n and the MLE of ^ is the minimizer of (n/2 - 1) log(l - 0 2 ) + n l o g ^ ( 1 , 0 ) .
328
Chapter 8 : Misspecified or Unknown Dispersion
The MLE of XJ3 is its BLUE from the model {y,Xj3,a2V) replaced by V(6ML)-
with V ^
In general closed form expressions of the MLEs of 9 and /3 are not available. One may have to adopt a recursive strategy of alternately optimizing over 6 and j3. We now describe a variation of the maximum likelihood (ML) method that has some useful properties. 8.2.3
Translation invariance and RE ML
Consider the model (y,Xf3,a2V(9)). For any vector / of appropriate dimension, the perturbed response y + XI follows the model (y + XI, X(/3 +1), a2 V). Since the parameter ft is unspecified, the addition of / to /3 should not change the model. If we are interested in estimating 9, then the models (y, X/3, o2V{9)) and (y + XI, X/3, a2V{B)) should be equivalent for this purpose. We shall refer to an estimator of 9 as translation invariant if it remains the same when y is replaced by y + XI for any I. This property is also referred to as translation invariance in the literature. If the error distribution is such that the MLE of 9 exists, then it must be translation invariant. Proposition 8.2.4 In the above set-up, an estimator of 0 is translation invariant if and only if it depends on y through the linear zero functions alone. Proof. If 0(y) is a translation invariant estimator of 6, then 0(y) = 6((I -PX)V + X((X'X)-X'y))
= 9{(I - Px)y),
which depends on y only through (I — Px)y, which is a vector of LZFs. On the other hand, if 9(y) depends on y only through LZFs, it must be a function of any generating set of LZFs. In particular, it is a function of (I — Px)y- Let us express 9{y) as r)((I — Px)y)- Then for every vector I of appropriate dimension we have 9{y + XI) = rj((I - Px)(y + XI) = V((I - Px)(y) = 9(y). Thus, 9{y) is translation invariant.
D
8.2 Unknown dispersion: the general case
329
The LZFs carry information about the model error (see Remark 4.1.6). Therefore, translation invariance is a reasonable property that we can expect an estimator of 0 to possess. As pointed out at the beginning of Section 7.3, every LZF must be a function of (I — Px)y, the least squares residual vector, whatever be the true dispersion. It follows that an estimator of 6 is translation-invariant if and only if it depends on y through (/ — Px )y. The emphasis on the least squares residual vector can be carried further by considering the likelihood of 0 constructed from it. Note that (I — Px)y is the response in the reduced linear model ((/ — Px)y,0,o2{I - Px)V(0){I - Px)). The MLE of 0 obtained by maximizing the likelihood function constructed from this model is called the residual maximum likelihood (REML) estimator. Because of the restriction of translation invariance, it is also called the restricted maximum likelihood estimator. In the normal case the REML estimator of 0 minimizes log \G'(0)G(d)\ + y'(I - Px)[G(0)G{0)']-(I
~ Px)y
over all allowable 0 such that (I — Px)y € C(G(6)) and the rank of G'{6)G{6) is constant, where G{G)G(6)' is any rank factorization of {I-PX)V{6){I~PX). Itfollowsfrom Remark 4.7.7 that y'(I-P)[G(0)G(0)'}-(I-Px)y is the same as the error sum of squares, RQ{0) computed from the model {y,Xf3,o2V{6)). Therefore, the ML and REML methods for estimating 9 differ only in the first term of the objective function that is minimized. Both the estimators are translation invariant. However, unlike the MLE, the REML estimator often accounts for the degrees of freedom utilized for estimation of the regression parameters. For instance, the 'natural' unbiased estimator of variance derived in Section 7.4 happens to be the REML estimator in the normal case (see Exercise 8.11). Remark 8.2.5 Note that \G'(0)G{0)\ is the product of the nonzero eigenvalues of G{0)G'{O). If / - Px is written as UU' where U is a semi-orthogonal matrix, then the non-zero eigenvalues of (I — Px)V(0)(I - PX) and U'V(0)UU'U are identical. The latter matrix
330
Chapter 8 : Misspecified or Unknown Dispersion
simplifies to U'V(0)U. This matrix has full rank whenever V(0) has full rank. In such a case, the REML estimator of 0 is the minimizer of log\U'V(0)U\ + R2o(0). D Example 8.2.6 (Autocorrelated errors, continued) Consider the linear model of Example 8.2.3. Let UU' be a rank factorization of the matrix I — Px. Note that U is a semi-orthogonal matrix. Since V{0) is nonsingular for the specified range of 4>, Remark 8.2.5 implies that the REML estimators of o1 and <> / are found by minimizing loglE/'V^^l/l + i^a2,^)
=
{n-k)\ogo2
+ \og\UlV{l,4>)U\+ Rl{l,4>)l
It follows that the REML estimator of a 2 is i?g(l,^>)/(n — A;), which is larger than the corresponding MLE. The REML of
log \U'V(1,4)U\ + {n-k) Iogi2g(l, 0). In contrast, the MLE is the minimizer of log | V (1,
8.2.4
A two-stage estimator
Let V{0) be an estimator of V(0), based on the response vector y under the model (y,X/3,V(0)). Let X(3ts be the estimator of Xf3 obtained by plugging in V{0) for V(0) in the expression of its BLUE. If V{0) is a translation invariant estimator, then it is a function of (I - Px)y, the vector of least squares residuals. Thus, the estimation procedure for X/3 can notionally be said to have two stages: an initial stage of least squares estimation (and subsequent estimation of the
8.2 Unknown dispersion: the general case
331
nuisance parameter, 9) followed by a second stage of best linear unbiased estimation using V(6) as the dispersion matrix. We refer to this estimator as the two-stage estimator. Proposition 8.2.7 In the above set-up, let the model error e — y — X/3 have a symmetric distribution about 0, and V(6) = H(y) be a translation invariant estimator of V(0) such that C(V(0)) = C(V) for all y 6 C(X : V) and H(—y) = H(y). Then the distribution of X/3ts — X/3 is symmetric about 0. Proof. The expression of the BLUE of X/3 given in Proposition 7.3.1 implies that X0ts-X(3 =
[I-H(e)(I-Px){(I-Px)H(e)(I-Px)}-(I-Px)}e.
The result follows from the fact that the expression on the right hand side is multiplied by —1 whenever e is replaced by —e. Thus, if E{X(3ts) exists, then Xj3ts is unbiased for Xf3. This result is proved in the case of nonsingular V in Wang and Chow (1994). Remark 8.2.8 The twin conditions of translation invariance and symmetry with respect to the data are satisfied by most of the common estimators of V that are applicable to the special cases considered in the next few sections. Proposition 8.2.7 implies that in all these cases the two-stage estimator would be unbiased as long as its mean exists and the error distribution is symmetric about 0. In particular, the two-stage estimator of X(3 based on the ML or REML estimator of 0 is unbiased under these conditions. Remark 8.2.9 Suppose that rj((I — Px)y) is a translation invariant estimator of 6. Consider the following iterative estimation scheme, where the subscript i denotes the estimator at the ith stage of iteration.
Xfii
= [I -V(di)(I - PX){(I - Px)V(0i)(I - Px)}-(I - Px)]y.
The above recursions hold for i > 1, and X0O = P y. If 0, is translation invariant, then y - Xfi{ is a function of (I - Px )y, and hence 0 i + 1
332
Chapter 8 : Misspecified or Unknown Dispersion
is also translation invariant. Thus, if the error distribution is symmetric about 0, and ^ ( X / ^ ) exists, then X0i is unbiased for every i. As in the case of the plug-in estimator described in Section 8.2.1, other properties of the two-stage estimator of X/3 can not be derived without further assumptions. Eaton (1985) and Christensen (1991) show that if V(0; y) is an unbiased estimator of V(0), then under certain conditions on the error distribution, the dispersion of the twostage estimator of X/3 is larger than that of the corresponding BLUE. Further, if the dispersion of the two-stage estimator is computed in the usual way by treating the estimated V(0) as 'true', then under the above conditions this estimated dispersion is expected to be smaller than the dispersion of the BLUE. In other words, the two-stage estimator may have a large dispersion which is likely to be underestimated. On the other hand, there is some good news on the asymptotic front (see e.g. Ullah et al., 1983) involving special structures of V{6) and consistent estimators of 9. These results indicate that the asymptotic dispersion of the two-stage estimator of X/3, as the sample size goes to infinity, may be the same as that of the BLUE in the case of known 0.
8.3
Mixed effects and variance components
The general form of the mixed effects model (also known as the mixed model) is k
y = X(3 + YtUai,
(8.3.1)
i=\
where Ui,...,Uk are known matrices, /3 is a fixed but unspecified parameter, and 7 x , . . . , 7 f e are random vector parameters such that £ ( 7 i ) = 0 and D(7j) = o\l for % = 1 , . . . , k and C£w(7;,7y) = 0 for i ?= j . The case of k — 1 coincides with the model (y, X/3, a\U\U'i) which has been dealt with in the previous chapter. In the present context this special case is referred to as the 'fixed effects' model. Another special case where X consists of a single intercept column is referred to as the 'random effects' model. We shall refer to the fixed and random parameters of the mixed model as fixed and random effects, respectively.
8.3 Mixed effects and variance components
333
The random effects part of the mixed model represents a special type of influence exerted by the E/j's on the response y. According to the model, the effect of these are linear, but these change from one experiment to another. For instance, the effect of a batch in a production process is thought to be better represented as a random effect than as a fixed effect. Since the coefficients of the Cj's are random, it does not make sense to 'estimate' them. It may be somewhat unrealistic to presume that the mean of a random effect is always zero. However, the mean is non-random, and therefore it can be clubbed together with the fixed effects whenever the need arises. It is easy to see that the model (8.3.1) is a special case of the model (y,X0,V(O)),
V(0) = Ylo-iVi,
O = (al...ol)',
(8.3.2)
and V i , . . . , Vk are known nonnegative definite matrices. This model is known as the variance components model. The mixed effects model (8.3.1) corresponds to the choice Vi = UiU[, % = 1,2,..., k. On the other hand, any model of the above form can always be represented by (8.3.1) with suitable choices of the 'random effects'. We shall treat these models as equivalent to one another. Sometimes the models (8.3.1) and (8.3.2) are written in a slightly different way: by showing an additional term that represents homogeneous and uncorrelated errors. We prefer to absorb this term in the l/;'s and Vj's. We have seen in Section 8.1 that the knowledge of the parameter 6 is usually necessary for inference on 0. However, inference on 6 in the variance components model is also an important problem by its own right. This is needed, for instance, to examine the importance of various random effects or to assess the quality of estimation or prediction. 8.3.1
Identifiability
and
estimability
Recall that in the special case of the fixed effects model, a quadratic function of y turned out to be a natural estimator of a2 which is unbiased and translation invariant. In the general case also we may look for quadratic and unbiased estimators of a2,..., a\. At the outset it is im-
334
Chapter 8 : Misspecified or Unknown Dispersion
portant to observe that one cannot always expect to estimate of,..., o\ in this context. For instance, if k = 2 and V\ = Vi, then there is essentially a single variance component and o\ and o\ cannot possibly be estimated separately. Thus, the issue of identifiability (see Definition 4.1.14) has to be addressed first. Further, we have to examine which linear functions of 0 can be unbiasedly estimated by a quadratic estimator that may also be translation invariant. Some characterizations in this context are given below. Proposition 8.3.1 Consider the estimation of p'O under the variance components model (8.3.2), and let the matrices F = {(fij)), G = {(gij)) and H = ((/iy)) be defined as fitj = tviViVj); 9id = fij-triPxViPxVj); fHj = t r ( ( / - P x ) V i ( / - P x ) V j ) . (a) p'O is identifiable if and only if p 6 C(F). (b) There is a quadratic and unbiased estimator of p'O if and only if peC(G). (c) There is a translation invariant, quadratic and unbiased estimator of p'O if and only if p G C(H). Proof. According to the definition of identifiability given in page 99, p'O is identifiable if and only if p'Qx / p'O2 implies that V(0i) - V(02) ^ 0, where 0\ and 02 are any two plausible values of 0. Another way of writing this condition is
\\V{01)-V{02)\\2F
= Q =>
p'(91-e2)=0.
The above condition is equivalent to (0i - 02)'F(0l - 6>2) = 0
=»
p ' ( 0 i - 0 2 ) = O,
which simplifies to FO = 0 =» p'O = 0. The statement of part (a) follows immediately. In order to prove part (b), let y'Qy be an unbiased estimator of p'O, and assume without loss of generality that Q is symmetric. Thei
we must have E(y'Qy)
= p'X'QXp
+ £*=i afti(QVi)
= p'O fo
8.3 Mixed effects and variance components
335
all appropriate /3 and 9. This leads to a pair of necessary conditions: (i) P'X'QX'P = 0 for all /3 such that (/ - P y(fl) ) X/3 = (/ - P y(fl) ) y and (ii) £i==i of tr(QVi) = p'0 for all nonnegative of, ... o\. The first condition essentially means that without loss of generality we can assume X'QX = 0 (see Exercise 8.15), which implies that Q must be of the form T — PXTPX where T is another symmetric matrix (see Exercise 2.29). The second condition is tr(QVj) = pi for i = 1,... , k, pi being the ith component of p. A consequence of the two conditions is /tr((T - P ^ T P - j V i h /(vec(V!) P= ! =
\tv((T-PxTPx)Vk)J
vec(PxVXPX))'\ i vec(T).
\(vec(Vk)-vec(PxVkPx))'J
If we denote the matrix in the last expression by A, the above condition implies that p <E C(A) = C(AA') = C{G). Conversely, if p e C(G), then we can write p as At for some vector t. Let T be a square matrix such that vec(T) = t, and Ts be its symmetrized version, given by (T+T')/2. Let us define Q = Ts — P TSPX. It can be verified that y'Qy is an unbiased estimator of p'd. In order to prove part (c), note that a translation invariant estimator must depend on y through (I — Px)y (see Proposition 8.2.4). The result follows by applying part (b) to the model ((/ — Py)y,O,(I — PX)V{O)(I - Px)) with variance components of (I - PX)V i{I - Px), i = l,...,A;. Remark 8.3.2 It can be shown that quadratic and translation invariant 'estimability' of p'Q implies its quadratic 'estimability' (in the sense of the above proposition), which in turn implies its identifiability (Exercise 8.14). The reverse implications are disproved via the counterexamples given in Exercises 8.16 and 17 (see also Rao and Kleffe, 1988). This is in contrast with the linear parametric functions of the 'fixed effects', for which the notions of identifiability and (linear) estimability coincide (see Proposition 4.1.15). We shall henceforth assume the quadratic and translation invariant estimability of the parameters of interest, without explicitly mentioning this assumption.
336
Chapter 8 : Misspecified or Unknown Dispersion
8.3.2
ML and REML methods
Consider the ML method described in Section 8.2.2 in the special case of the variance components model of (8.3.2), where 9 = (of : : o\)''. We assume that the random effects have independent normal distributions, and provide the MLEs of 9 and X/3 in the next proposition. Proposition 8.3.3 Under the above set-up, the MLEs of 9 = (aj : : a^Y and X/3, satisfy the equations to(V-(dML)Vi)
=
X0ML =
{y-XpML)'V-{dML)ViV-{8ML){y-XflML), i = l,...,k, [I-V(dML)(I-Px){(I-Px)V(dML)(I-Px)}-(I-Px)]y,
provided that the MLEs of a\,... ,af. are greater than 0. The above equations do not depend on the choice of the g-inverse ofV(B). Proof. It can be shown that as long as of > 0, i = 1,..., k, C(V{9)) = C{VX : V2 Vk) (see Exercise 8.18). Therefore, C(V(0)) and p(V(0)) do not depend on 9. We first prove the proposition in the special case where V(9) has full rank, and then generalize it to the possibly singular case. When V(9) is nonsingular, it follows from the discussion of Section 8.2.2 that X/3ML is as described above, and 9ML minimizes
log W*)l +
(v-XPML)'V-\O){y-X0ML).
We now obtain the derivative of the two terms with respect to of. We write V{9) as /
V(9) = YJ^Vj= j=l
\
j^^Vj
+ o-lV, +^Vl = A + 4>Vl,
j=\
where o~\ is a fixed positive number smaller than a\ and $ = a\ — a\. If CC is a rank-factorization of A and Vj is expressed as CLL'C', then |V(0)| = \A + <j>Vi\ = \C{I + 4>LL')C'\ = \C'C\ \I +
8.3 Mixed effects and variance components
337
If A i , . . . , An are the eigenvalues LL' (including possible multiplicities and zero eigenvalues), then |V(0)| == \C'C\ Il"=i(l + 4>>"j)- Hence,
£Jo^M =
mo^m _ £ JL log(1 + ^
= ibTTTr
= t*[{I +
= tr[(CC" + 4>CLL'C')-lCLL'C1} = tr[(A + ^Vl)-1Vi) = tr{V-l(e)Vi}. On the other hand, differentiating both sides of the matrix identity I = V{0)V~l{d) with respect to of, we have 0
- ^[V{e)v-\e)] = ViV-\o) + v(9)-^v-\e).
Hence, ^V^iO)
= -V-l{e)ViV~l{0).
Using these derivatives in
the defining expression of OML , we have the estimating equations tr(V-l(9ML)Vi) =
(y-XPML)lV-1(dML)VlV-l(dML)(y-X0ML)
for i — 1,..., k. Now we allow V(0) to be singular. Let UD(0)U' be a spectral decomposition of V(6), where D(0) is a positive definite diagonal matrix. Note that U cannot depend on 6 because UU' is the orthogonal projection matrix for C(V(6)), which does not depend on 0. Further, [UD^2(d)}[UD^2{G)]' is a rank-factorization of V{0), and a choice of V~{6) is UD{O)-lU'. Therefore, it follows from the discussion of Section 8.2.2 that the MLE of 6 is obtained by minimizing log \D(0)\ + (U'y - U'XPMjD-^OWy
- U'XPML)
with respect to 8. This is essentially the case of a variance components model {U'y, U'X(3,D(d)) with the nonsingular dispersion matrix D{6) = Yli=i (TiU'ViU. Therefore, whenever the MLE's of a2,...,a\
338
Chapter 8 : Misspecified or Unknown Dispersion
are positive, these satisfy the simultaneous equations
ti{D-\d)U'ViU)
= (U'y - irXfiMLYD-HeWViUD-^BKU'v
- U'X0ML), 9 — 1
Ic
I — 1, . . . , K .
The equations given in the statement of the proposition follow from the
facts that ti{D-l{O)U'VtU) = tr{UD-l{0)U'Vi) UD~l{0)U' is a choice of V~{0).
and that the matrix
In order to show that the equations do not depend on the choice of the g-inverse of V(0), note that tr(V-(0)Vi) = tT(F'iV'(0)Fi), where FiF[ is a rank factorization of Vj. Since C(Fi) = C(Vj) C C(V(0)), the matrix F^V' {0)Fj^does not depend on the choice of the g-inverse of V(0). Also, since X/3ML is the BLUE of Xfi in the model (y^X/3, V(0)) with 0 replaced by its MLE, the residual vector y — X(3ML belongs to C(V(0)) (see (7.3.5)). This fact implies that the right hand side of each equation given in the statement of the proposition is insensitive to the choice of the g-inverse of V(0). Q Remark 8.3.4 An alternative form of the estimating equations of 0ML given in Proposition 8.3.3 is k
Y, ^ti(V-(e)ViV-(0)Vj)
= a(OyVia(9),
i=
l,...,k,
where
a(0) = (/ - PX){(I - Px)V(0)(I - Px)}~(I - Px)y. This is a consequence of (7.3.1) and the fact that Vj = VjV~(0) V(0) = J2j=i a]ViV~(0)VjThis representation lends itself to recursive computation of the MLE of 6. The current iterates can be used to compute V(0), while the next iterates are obtained by solving the system of linear equations in af,..., a\. d Remark 8.3.5 If V(0) is nonsingular, the equations of Proposition 8.3.3 can be further simplified as (see Exercise 8.19) k
YJ°2MV-l{0)V3V-l{9)Vi)
= e{0)'V-\0)ViV-l{0)e{0),
8.3 Mixed effects and variance components
339
for i = 1,... , k, where e(6>) = [/ - X'(X'V-1(e)X)-X'V-1(0)]y. The computation of the REML estimator can proceed in a similar manner. It follows from the discussion of Section 8.2.3 that the REML estimator of 0 can be viewed as its MLE in a reduced model where y, Xfl and V{0) are replaced by (I - Px)y, 0 and (I - PX)V{O)(I Px), respectively. The corresponding decomposition of the dispersion matrix is (I - PX)V(9)(I - Px) = £ o \ ( I
- Px)Vt(I
-Px).
i=\
Thus, the next proposition follows easily from Remark 8.3.4. Proposition 8.3.6 When the random effects are independent and normally distributed, the REML estimators of a^,... ,cr| satisfy the equations k
£ a2Jtr(W-(6)WiW-(e)WJ)
= b{e)'Wlb{e),
i=
l,...,k,
J=l
where W(0) = (I-PX)V(O)(I-PX); Wt = (I - Px)Vi(I - Px), i = l,...,k; b(0) = W-(9)(I-Px)y. provided that the estimates are positive. The above equations do not depend on the choice of the g-inverse ofW(O). The iterative procedure outlined for the MLE in Remark 8.3.4 can be used to solve the equations for the REML estimator. Example 8.3.7 (Balanced one-way classified random effects model) Consider the model yij = fj. + n + etj,
i = 1,..., t, j = 1,..., m,
340
Chapter 8 : Misspecified or Unknown Dispersion
where the T;'S are the i.i.d. random effects having the N(0, o\) distribution, and the e^'s are the i.i.d. model errors (independent of the r^'s) having the iV(0,CT2) distribution. In this case X = 1, 6 = {a\ : CT2)', and V(6) is a block-diagonal matrix with each m x m diagonal block given by erf 11' + a\l. Specifically, /ll' .0\
Vi=
; \o
-. ...
; h iv)
v2 = i.
V~l{6) is also a block diagonal matrix with each diagonal block given It is easy to see that C{V{G)X) C C{X), so by CT2-2(I - a^m^ll'). the MLE of \i is its LSE, that is, the sample mean of the observations. Therefore, the least squares residuals may be used in the equations of Proposition 8.3.3. The equation for i = 2 amounts to
tv{v-\e)) = \\V'l{e){y-yi)\\2. The two sides of the equation simplify as follows.
ti
[
V 4 + rnal ) \ ~i
tm ( a2
\
a\ a2
\
+ mcrl /
V
_ ^
Vj - yd _
i=[
a2
tmjaj + (m - l)a2) _ ZijiVij - Vi? cr2V°'2
+ WCTf)
CT2
(ri+maf J y - y{ °2 +
2
mal
m^M - V? (CT2 + mo\Y
where y is the sample mean of all the observations, y~i is the ith cell mean and yi is the sub-vector of y corresponding to the ith cell. The equation for i = 1, as per Proposition 8.3.3, is t r C V - ^ V i ) = (y - y l J ' V - ^ V i V - ^ X y - yl).
8.3 Mixed effects and variance components
341
After some algebraic manipulations, this equation simplifies to o\ + mo\
(a2 + ma2)2
The two equations lead to the following solution:
°\ = 7 2>-v) -„> \
(8.3.3)
In order to compute the REML estimators we can use the following simplified form of the equations given in Proposition 8.3.6: tr(W-{0)Wi) = b{0)'Wib{0),
i = 1,2.
Note that the right hand sides of the estimating equations of Remark 8.3.4 and Proposition 8.3.6 are identical. In the present case, this has already been simplified for the computation of the MLE, for i = 1,2. The matrices on the left hand side simplify as follows:
11^ = ^-in', w2 = i-^-n', w(0) = v-(£ + £)u'. t
tm yt tm I Since I'V'1! = (a2/t + aj/tm)-1, it follows that a choice of W~{9) is V~l{6). Further algebraic manipulations lead to the following form of the two equations. (tm - l)g| + tm(m - l)a2
aliol + mol)
EijiVij ~ Vi)2
~
(t-l)m a2 + me2
a\
m EM - V?
(a% + ma2)2'
m2Ei(yi-y)2 (p\ + m®2)2
=
The resulting estimators are
1
31 =
l
Km-^T) P ^ - W ) 2 -
(8.3.4)
342
Chapter 8 : Misspecified or Unknown Dispersion
It is interesting to note that the ML and REML estimators of o\ coincide, while the REML estimator of o\ is larger than the corresponding MLE. We shall show later (see page 345) that the REML estimators of the two parameters are unbiased. These also happen to have the minimum variance among all unbiased estimators (see page 352). The solution for u\ obtained from the ML or REML estimating equations may turn out to be negative. If this happens to be the case, the MLE of o\ should be 0. Thus, one has to ignore the presence of the first component of the variance. The MLE of o\ under this revised model
is ^ EijiVij ~y)2, and the REML estimator is j ^ - EijiVij ~v)2-
D
Apart from the iterative method mentioned in this section, one can try and find the ML or REML estimators using other iterative methods such as Newton-Raphson, steepest descent/ascent, scoring and the EM algorithm. See Rao (1997) for some details on these algorithms in the context of variance components estimation. We now turn our attention to some methods which deal specifically with the variance components model.
8.3.3
ANOVA
methods
Two major problems of the ML and REML methods are the need of an iterative algorithm to solve them and the possibility of negative variance estimates. Faster computers have alleviated the first problem to some extent. Yet, the quality of the solutions of iterative algorithms often depend on the quality of the initial values. In this section we consider some simple estimators which are not only useful as initial values, but also quite meaningful in a number of special cases. The ANOVA methods try to exploit the fact that the linear zero functions do not depend on the fixed effects parameters, and thereby carry information about the random effects or variance components. Suppose that we have a set of quadratic forms of the LZFs, qi = y'(I — Px)Qi(I — Px)y, i = 1, , k, where the Q^s are known non-random matrices. Note that for i = 1 , . . . , k
E(qi) =
EitviQiil-P^yy'il-Px)))
8.3 Mixed effects and variance components =
343
tv(QtE((i-Px)yy'(i-Px)))
= triQiil-P^VWil-P^ + Qiil-P^XPP'X'il-Px)) j=i
j=i
where Q{ = (I—P^Q^I—Px). Thus, each ^ is a linear function of o\,..., a\. Consider the system of equations nAQxVx)
MQ^U
la\\
\ WQ.V,)
; tr(QkVk)J
h Wj
(y'Qxy\ =
i \y'QkyJ
(8-3.5)
Thus, if the weight matrices are chosen suitably, then the above matrix would be invertible, and a unique set of solutions to the above equations would exist. It is also easy to see that the resulting estimators of a\,..., crjjl are unbiased. Being functions of LZFs, these are also translation invariant. A general problem with the above estimator is that it is not clear how one should select the matrices Q1,..., Qk. Depending on the choice of these matrices, one might obtain several versions of ANOVA estimators of the variance components. Some specific applications provide intuitively meaningful choices of the matrices. In the case of balanced data, the ANOVA estimators are often found to have the minimum variance among all unbiased estimators that are quadratic functions of the response. Henderson proposed a series of methods in the early 1950's, which remained quite popular for the next few decades. These are basically ANOVA methods with various choices of the quadratic functions. Example 8.3.8 model
(Henderson's Method III) Consider the mixed linear
1=1
where Uk = I. The model can alternatively be written as (8.3.2) with Vk = I. Let P — p 0
lr(X:Uv.-:Uk-ly
344
Chapter 8 : Misspecified Qi
=
Qk
=
po
or Unknown
~ I\X:Ui:...:Ui_1:Ui+i:...:Uk_1y
I-P
0
Dispersion
»= l,...,fc-l;
.
Since C(X : Ux : : Uk-i)1 Q C{X)L, we have Qk = Qk. Likewise, Qi = Qi for i = 1,..., k — 1. The quadratic form y'Qky can be interpreted as the error sum of squares in the model where 7fc is the vector of uncorrelated errors and the remaining random effects are assumed to be fixed effects. The quadratic form y'Qiy can be seen as the sum of squares due to deviation from the hypothesis of no significant effect of Ui (with ryi treated as a fixed effect). Henderson's Method III consists of setting these sums of squares to their respective expected values under the mixed effects model, and solving for the parameters 9
9
In this special case of the ANOVA method, the coefficient matrix of equation (8.3.5) reduces to /tr(Q!Vi) 0
V
0 0
0 tr(Q 2 V 2 ) 0 0
0 0 tv(Qk^Vk^) 0
tr(QJ \ tr(Q 2 ) tr(Qfc_!) tr(Qfc) /
Consequently the explicit solutions to the simultaneous equations are
ff2
°k
=
y'QkV tiQk
=
y'(i-P0)y tr(I-P0) "
The best aspect of this method is its computational simplicity. The interpretability of the quadratic forms has also given the method an oblique justification. While the estimators are unbiased, they are not known to have any optimal property in general. Example 8.3.9 (Balanced one-way classified random effects model, continued) In the case of the model of Example 8.3.7, Henderson's Method III estimators simplify further. In fact, the estimators of o\ and
8.3 Mixed effects and variance components
345
d\ coincide with the corresponding REML estimator given in (8.3.4). Since the ANOVA estimators are generally unbiased, the REML estimators are unbiased in this case. D The ANOVA-type methods have the advantage that these work without any distributional assumption. In the case of some balanced designs, some natural quadratic forms an be found (see Hocking, 1996). One may also consider using more than k quadratic forms, and try a least squares fit on the extended system of equations of the form (8.3.5). See Searle et al. (1992) for details. 8.3.4
Minimum norm quadratic unbiased
estimator
Suppose that the 'random effects' 7 1 ,...,7 f c in the mixed effects model (8.3.1) are somehow observed. If this is the case, a 'natural' estimator of af is ||7i|| 2 /^', where di is the dimension of the vector 7j, i = 1, k. A natural estimator of a linear function of the parameters,
£ix^ 2 ,is£Lill7ill 2 P,M. Now suppose the same parameter, J2i=iPiah *s estimated by a quadratic function of the response. As in the case of the ANOVA methods, we shall insist on translation invariance of the estimator. This restricts our choice to quadratic forms in the LZF vector, (I — Px)y. We write the quadratic form as y'(I — PX)Q(I — Px)y or simply y'Qy, where Q is of the form (I — PX)Q(I — Px). We assume without loss of generality that Q and Q are symmetric matrices. The development of the estimator has so far taken place along the lines of the ANOVA method. We now make a crucial choice that would help us select a suitable matrix Q. We rewrite the estimator as
y'Qy = (J2unt) Q (X>7<) = E E ^ Q ^ - , \i=l
/
\i=l
/
1=1.7=1
and try to bring it as close as possible to the 'natural' estimator, described earlier. The difference between the estimators is k
k
k
£ £ ViE7#tfi7i - Xy«(wM)J7i-
346
Chapter 8 : Misspecified or Unknown Dispersion
Suppose that our initial guess of the parameters o\,...,o\
be toi,...,
1 /2
u>k. Then we can write *yi = w^ 6j, i = 1 , . . . , k, where all the components of the vectors e i , . . . , e& have approximately the same variance, provided that the prior guess is not too bad. Using the re-scaled random effects, we can write the difference of the two estimators as e'Ae, where
A=i
(w\l2U\\
(w\l2U\\
*i/2u'2
Q wTv2
w2u'J
f ^ I
_
wj2u>J
o
0
0
T I
o
{o
\
.:. o ^ i )
and e' = (e[ : e'2 : : e'k). In order to ensure that the quadratic form e'Ae is small, we would require that the matrix A be small. This objective can be reached by minimizing a norm of this matrix. A popular norm in this context is the Probenius norm, denoted by || \\p (see page 28). Thus, we have the task of minimizing
i=l
di
i=lj=l
F
The minimization of the above with respect to the matrix Q leads to the estimator y'Qy, which is referred to as the minimum norm quadratic estimator (MINQE). Iterative techniques are usually needed in order to determine this estimator. A detailed description of such techniques may be found in Rao and Kleffe (1988). Note that k
E[y'Qy} = tv[QE(yy')] =
^MUiQUi\.
i=i
Therefore, in order that y'Qy is an unbiased estimator of J2i=i Piah must have tT[UliQUi]=Pi,
i = l,...,k.
we
(8.3.6)
8.3 Mixed effects and variance components
347
Subject to this additional condition, together with the fact ||A||^ = ti(AA'), we have the simplification
\\AfF + E ^ T = E E WiVjWiQUjWF-
(8.3.7)
Therefore, the minimum norm quadratic unbiased estimator (MINQUE) is y'Qy such that Q minimizes the right hand side of (8.3.7) under the constraint (8.3.6). In a remarkable work, Mitra (1971) shows that the problem of finding the MINQUE can be reformulated as that of obtaining the BLUE of an estimable LPF, assuming that W{ = 1 for i = 1,..., k and D(y) is positive definite. We now provide a set of estimating equations for the MINQUE, derived by using an extension of Mitra's argument that does not require these assumptions. The estimating equations are like normal equations, and give rise to closed form solutions. In the following proposition 6 and V(0) are as in (8.3.2), p — (p\ : : pk)', and w = (wi : : Wk)' represents the 'guessed value' of 0. Proposition 8.3.10 In the above set-up letp'0 be estimable through a translation invariant, quadratic and unbiased estimator under the model (8.3.1), and C{V{w)) = C(V(0)). Then the unique MINQUE ofp'9, which minimizes (8.3.7) subject to the constraint (8.3.6), is p'0, where 6 is any solution to the set of equations k
E cr]tv(W-(w)WiW-(w)Wj)
= b(w)'Wib(w),
i=
l,...,k,
i=i
where W{-), b(-) and W\,..., Wk are as defined in Proposition 8.3.6.
Proof. Let GG' be a rank-factorization of W(w), and F be a matrix such that Pw, the orthogonal projection matrix for C(W(w)), can be written as GF. Since (I - Px)y e C(W{w)) with probability 1 (see Proposition 3.1.1), we can rewrite the quadratic form y'Qy as y'Qy = y'(I - PX)PWQPW{I
- Px)y = z'Cz,
348
Chapter 8 : MisspeciGed or Unknown Dispersion
where C = G'QG and z = F(I - Px)y. Further, let c = vec(C) and t = vec(zz')- Then y'Qy = z'Cz = tr(Czz') = c't,
(8.3.8)
where we have used the fact that C is a symmetric matrix. Note that E(z) = 0 and k
E(zz') = D{z) = F(I - Px)V(0)(I - PX)F =
^a?FWiF'. i=i
Therefore,
E(t) = Yt°ki = Xt0,
(8.3.9)
2=1
where & = v e c ( F ^ F ' ) , i = 1,..., k, and Xt = fo : : ^). We now turn to the quantity that a MINQUE is supposed to minimize. Indeed, the right hand side of (8.3.7) simplifies as follows k
k
k
k
Y^Y.WiWiWU'iQUjtfp = 1=1.7 = 1
'£YiwiwMUiQUjU'jQUi)
i=lj=l
k
k
i=lj=l
=
tr(QW(w)'QW{w))
= tr(QGG'QGG') = ^{G'QGG'QG) = tr(CC) = c'c (8.3.10) From (8.3.8), (8.3.9) and (8.3.10) we conclude that c't is a MINQUE of p'G (where E(t) = Xt0) if it is an unbiased estimator of the latter and c'c has the smallest possible value. Therefore, the problem of finding the MINQUE is computationally equivalent to finding the BLUE of p'd from the linear model (t,Xt0,I). It is well-known that such an estimator exists if and only if p G C(X't). Since E(t) in this model is the same as that obtained from (8.3.1), and translation invariance is ensured by the construction of the model, we conclude that the assumptions of
8.3 Mixed effects and variance components
349
the proposition imply p € C{X[). Itfollowsfrom Propositions 4.3.5 and 4.3.9 (Gauss-Markov Theorem) that the MINQUE is unique and is given by p'6 where 6 is any solution to the normal equation X'tXt6 = X'tt. In order to complete the proof, we only have to show that the normal equation simplifies to the set of equations described in the statement of the proposition. Indeed, the equation simplifies to k
£*&ffi=#>
» = i,...,fc.
Further, we have £'£.
= vec(FWiF'yvec(FWjF') = = =
=
ti{FW\F'FWjF')
tT(F'FWiF'FWj) ti{W-(w)PwWiPwW-(xv)PwWjPw) tT{W-(w)WiW-(w)Wj),
where we have made use of the fact F'F = F'PG,F = F'G'iGG'YGF
= PwW~{w)Pw
(as G has full column rank). Likewise, £t
= tviFW.F'Fil - Px)yy'(I - PX)F'} = y'{I - Px)F'FWiF'F(I - Px)y = y'{I-Px)W'{w)WiW~{w){I-Px)y
= b{w)'Wib{w).
This completes the proof. Remark 8.3.11 The model (£, Xt0,1) used in the proof of Proposition 8.3.10 implies that D(t) = / , which is different from the dispersion computed (8.3.1). However, this discrepancy does not come in the way of the main argument of the proof. Also, the solution to the normal equations may be such that the matrix C (obtained from c) is not symmetric. It can then be replaced by (C + C')/2 without altering the MINQUE, which is unique. The corresponding choice of Q is nonunique.
350
Chapter 8 : Misspecified or Unknown Dispersion
Example 8.3.12 (Balanced one-way classified random effects model, continued) Consider the model of Example 8.3.7. Let the initial estimators of o\ and o\ be w\ and w2, respectively. Using the forms of W\, W2 and W~{6) given in Example 8.3.7, we have after some simplification tr(W-{w)W1W-(w)W1)
=
. m ^ ~ X\. (wi + mw\Y
tr(W-(w)W1W-(w)W2)
=
." ^ " ^ (W2 + mwi)2
tr(W-(w)W2W-(w)W2)
=
(*-!) (102 + mwiY
+
(m-1)^ W2
on the other hand, using the calculations of Example 8.3.7, we readily have
b[w)'wlb{w)
- " 2£ t ( *"ff,
b{wyw2b{w)
= ^g^-g)2
+
^-^-^2.
Thus, the estimating equations of Proposition 8.3.10 simplify to
(102 + mw\Y TTrlmCTi + ( T o ) H
(w2+mwiY
(mwi + W2) o
wZ
(To
—
—:
rrr
(mwi+W2Y |
Ei,j(Vij ~ Vif W2
Solving these equations, we conclude that the MINQUE of o\ and o\ are their respective (normal) REML estimators, irrespective of w\ and w2. ID Proposition 8.3.10 has several interesting consequences. First, by comparing the estimating equations of the MINQUE and the REML
8.3 Mixed effects and variance components
351
estimator (see Proposition 8.3.6), we find that if the weights w\,... ,Wk are accurate guesses of the parameters, then these two estimators coincide. More importantly, if the MINQUE of af,... ,cr| exist and these are used as weights in a second stage of MINQUE, and this procedure is repeated, then this iterative procedure (referred to as I-MINQUE) coincides with the iterative procedure for finding REML, described in Section 8.3.2. Thus, even when the errors are non-normal, the REML estimator can be interpreted as the I-MINQUE estimator. On the other hand, the MINQUE estimator can be thought of as a single step from the initially guessed values towards the (normal) REML estimator. The MINQE and MINQUE estimators generally depend on the initial guesses. If no prior information on the parameters are available, these may be chosen as equal numbers. 8.3.5
Best quadratic unbiased
estimator
Consider a translation invariant and quadratic function of the response, y'Qy, which is an unbiased estimator of a linear function of the parameters, J2i=iPiai- An estimator of this kind is called the minimum variance quadratic unbiased estimator (MIVQUE), or the best quadratic unbiased estimator (BQUE), if it has smaller variance than all other translation invariant, quadratic and unbiased estimators. Since the variance of a quadratic function of the response involves the third and fourth moments, a distributional assumption is needed to find the MIVQUE. We assume that the distribution is multivariate normal. Note that this assumption was made in the cases of the REML and ML estimators, but not in the case of the MINQUE and ANOVA estimators. Under the assumption of normality, it can be shown that Var(y'Qy) = 2tr[(QV(0))2].
(8.3.11)
In order to find the MIVQUE, one has to minimize this quantity with respect to Q subject to the condition (8.3.6) for unbiasedness. Also, Q has to be of the form (/ - PX)Q(I - Px) in order to ensure translation invariance. It can be shown that if the MIVQUE exists, then it must satisfy the condition of Proposition 8.3.10 with w replaced by the 'true value' of 9 (Exercise 8.21). In general the solution to these estimating
352
Chapter 8 : Misspecified or Unknown Dispersion
equations is a function of the unknown parameters. Thus, the biggest drawback of the MIVQUE is that often it does not exist. We may find an approximation of the MIVQUE by using an approximation of V{6). If the approximation is of the form V(w), then the resulting estimator is identical with the MINQUE (Exercise 8.21). Thus, any MINQUE estimator can be interpreted as an approximation of the MIVQUE in the normal case. Any attempt to improve upon this estimator by recursively updating the estimated dispersion matrix would only lead us to the normal REML estimator. If V(6) in (8.3.11) is a function of y — as is expected in a recursive procedure — then the resulting matrix Q also depends on y in general. Thus, the REML estimator obtained in the end of the recursive procedure is not necessarily a quadratic function of y, let alone being the MIVQUE. However, in some special cases the normal MIVQUE can be determined. Since the MINQUE minimizes (8.3.11) for $ = iw, we can conclude that whenever the MINQUE does not depend on w it minimizes (8.3.11) uniformly over all 0. In such a case, the MINQUE must be the same as the normal MIVQUE. Example 8.3.13 (Balanced one-way classified random effects model, continued) Consider the model of Example 8.3.7. It was shown in Example 8.3.12 that the MINQUE of o\ and o\ do not depend on w\ and W2 and are equal to the corresponding REML estimators. Thus, the MIVQUE of o\ and of are identical to the REML estimators given in (8.3.4). This coincidence shows that the REML estimators in this case are not only unbiased but these also have the minimum variance among all quadratic and unbiased estimators.
8.3.6
Further inference in the mixed model*
If the parameter 6 is known, the best linear unbiased predictor of the random effects of the model (8.3.1) can be obtained from Proposition 7.13.1. Specifically, the BLUPs are given by
li=
e,
(8.3.12)
8.4 Other special cases with correlated error
353
where e is the residual vector given in page 255, with V = Sj=i a]UjU'j (Exercise 8.23). The predicted value oip'fi + q'-f is p'/3 + q!;y where p'/3 is the BLUE of p'/3 under the mixed effects model and 7 is as described above. Rao and Kleffe (1988) give a computational method for simultaneously obtaining the BLUE of estimable LPFs of the fixed effects and the BLUPs of the random effects. This approach accommodates the possible rank deficiency of the matrices in the model, and is similar to the inverse partitioned matrix method for the fixed effects case (see Section 7.7.2). The main difficulty in the problem of prediction is that the parameter 6 is in general not known. One can plug in estimators of these in the above expression. The resulting predictors would no longer be the 'best' predictor in any sense. As in the case of the two-stage estimators of the fixed effects, such two-stage predictors of the random effects can be shown to have some reasonable properties (see for instance Toyooka, 1982). The problem of testing hypotheses on variance components in the mixed linear model is not an easy one. Some illustrations of the difficulty of the general problem and tractable solutions for some special cases are given by Khuri et al. (1998).
8.4 8.4.1
Other special cases with correlated error Serially correlated
observations
When the observations of the linear model are recorded serially in time, it is often found that there is a correlation among successive observations. This phenomenon, known as serial correlation, has been studied extensively in Econometrics. One way of dealing with serial correlation is to include some lagged (past) values of the response in the list of explanatory variables (see, for instance, Anderson, 1971, p.183). Such a model is referred to as a dynamic model in the econometric literature. Dynamic models are beyond the scope of the present discussion. Estimation procedures for dynamic models can be found in standard econometric texts, such as
354
Chapter 8 : Misspecified or Unknown Dispersion
Davidson and MacKinnon (1993). Another popular model for serially correlated data is yi
=
x'i/3 + ei, p
ei
=
q
X^ fyfo-i + & + 5Z Q$i-ji 3=1
3=1
where
where 0 -
(a2 : fa :
: >p : 0
:
: 6q)'.
above model for ej, is the same as the autoregressive moving average model of order p, q or ARMA(p, q) of (1.5.2). It is assumed that the dispersion matrix V(0) is positive definite. It follows from the discussion of Section 8.2.2 that the MLE of the ARMA parameters (>i,... ,(j)p and 0\,..., 6q, under the assumption of normal distribution, are obtained by minimizing
[log|V(0)| + nlog{(y - XpML)'V-\0)(y
- XpML)}] |
2=i,
" (8.4.1) while the MLE of a2 is a2 = n-lR2Q($i
:&
,
where all the estimators in the right hand side of the last expression are MLEs. The special case of these estimators for p = 1 and 9 = 0 was derived in Example 8.2.3. Minimizing (8.4.1) is not an easy task. A complication arises because of some constraints on the parameter space which ensure that e i , . . . , e n are second order stationary time series. Harvey and Phillips (1979) have suggested a computational procedure for finding the MLEs of all the parameters based on a state-space representation of the ARMA model and the Kalman filter (see Section 9.1.6 and Exercise 9.7). Zinde-Walsh and Galbraith (1991) show that the MLE can be approximated by a class of two-stage estimators up to a reasonable degree of accuracy. The common two-stage estimators are similar in spirit to that obtained in Section 8.2.4, and are generally translation invariant.
The
8.4 Other special cases with correlated error
355
The case of q = 0 corresponds to an AR(p) model of e^. There are several possibilities of approximating the MLE in this special case. For instance, we can drop the first p observations and work with the model p
/
v
\
V i - Y L faVi-i ~ & \Xi ~ S 3xi-3 ) + S*> * = P + 1,
> (8.4.2)
which has uncorrelated errors with variance a2. This model can also be written as p
yl-@'xi
= Y^4>j{yi-i
-P'xi-j)
+ Si,
i=p+l,...,n,
(8.4.3)
i=i
The representation (8.4.2) may be used to estimate the parameters f3 for given values of the AR parameters, while (8.4.3) may be used to estimate the AR parameters for given /3, using any one of the standard methods (see Brockwell and Davis, 2002). One can use an initial stage of least squares (using (8.4.2) with all AR parameters set to 0), estimate the AR parameters from (8.4.3), and revise the estimate of (3 by using (8.4.2) once again. One can repeat the procedure for further improvement of the estimator, but the convergence of these iterations is not assured in general. The case of p — 1 is the most common one. A wide range of solutions is available in this case. One can perform a grid search on the single AR parameter, 0, to minimize the objective function given in Example 8.2.3. This would produce the exact MLE. One can also follow the back-andforth scheme for the AR(p) model given above, which simplifies to some extent for p — 1. Specifically, the minimizer of J2^j=2^1 m (8.4.3) has the explicit form 7 _ z2i=2 eie 2^=1 ei where ej is the residual of the least squares analysis of (8.4.2) based on the previous estimate of >. It is not uncommon to have (/> very close to 1. In such a case, an extremely simple strategy that works quite well is to assume <> / = 1. The implication of this assumption is that one can use the least squares
356
Chapter 8 : Misspecified or Unknown Dispersion
method on differences of successive observations. It can be shown that this estimator has good efficiency compared to the BLUE for known >. An illustration of this fact is given in Exercise 8.24. Bailie (1979) finds a decomposition of the asymptotic mean squared prediction error when estimated AR parameters are used in the linear model for the purpose of prediction. This result indicates that the effect of using estimated AR parameters (instead of their 'true' values) decrease inversely with the sample size.
8.4.2
Models for spatial data
When an observation consists of attributes of a particular location, correlation among neighbouring entities is expected. A model that takes into account such correlation is
y = X0 + e,
e = aWe + S, E{5) = 0,
D{8) = a2l,
where a is an unspecified constant and W is a known matrix whose elements represent the degree of association among pairs of locations. Usually W is chosen to have zero diagonal elements. The above model can
be written as (y,X/3,a2V(a))
where V(a) =
[{I-aW)(I-aW')]'1.
The parameter takes values in a right-neighbourhood of zero such that V(a) is positive definite. Kramer and Donninger (1987) show that the LSE, which is the BLUE in the case a = 0, may be grossly inefficient. Ord (1975) proposes an iterative technique for obtaining the MLEs of/3 and a under the assumption of the normal distribution. Several other ad hoc models for the spatial correlation of errors in the linear model have appeared in the literature. Another approach of dealing with spatial correlation has gained popularity over the last two decades, particularly in the area of geostatistics. According to this approach, the response at various locations are viewed as samples of a stochastic process defined over a suitable space. The location of an observational point in this space is described by the vector u. Thus, the ith component of the response vector y is a sample of the process y(u) at the location u = u^. The ith row of X can also be viewed as the value of a function x(u) at the location u = u\. Even though x(u) may itself be a stochastic process, it may be treated as a
8.4 Other special cases with correlated error
357
nonrandom quantity by conditioning y(u) on it. It is assumed that E{y{u)\x{-)) = 0'x(u);
Cov((y(u),y(v))\x(-)) = g(u,v),
where g is an unknown function. This leads us back to the linear model withV = ((g(ul,uj))). Some kind of assumption of stationarity of the stochastic process is needed so that inference on /3 and g is possible. We assume the wide sense stationarity of y(u) —E(y(u)\x(-)), which means that g{u,v) can be written as a function of u — v. With a minor abuse of notation we shall write it as g(u — v). Note that g(u — v) = g(v — u). Under the above assumptions, Var{y(u) — y{u+h)) = 2g(0) — 2g(h). This function is known as the variogram in geostatistical literature, while the half of this function is called the semivariogram. The latter function has traditionally been used for inference. The semivariogram can be estimated nonparametrically. For instance, if the observations are taken on a spatial lattice, then a natural estimator of the semivariogram is n{h)
^ y E M m ) - P'x(m)) - (y(m + h ) - P'x(Ul + h))]2, where n(h) is the number of pairs of observations which are h distance apart. Since /3 is not known, an estimator of it may be used in the above expression. This estimator of the semivariogram may have to be smoothed or locally averaged. Once the semivariogram is estimated, the matrix V can be estimated from it, and the parameter f3 recalculated. Iterations of this scheme is possible, although the convergence of such iterations is not always guaranteed. See Cressie (1993) for a description of the nonparametric methods for estimating the semivariogram. The semivariogram can also be estimated parametrically. Several parametric models of this function can be found in Olea (1999). The parameters of these models may be estimated by the ML or REML methods, or by a least squares approach which seeks to minimize the distance between the parametric function with a nonparametric estimator. A particular parametric model has generated a lot of interest
358
Chapter 8 : Misspecified or Unknown Dispersion
among statisticians. The model, written in terms of the covariance function g, is
g(h) = Y,vbi(h), where gi , g^ ) are known functions and o\,..., Ok are unspecified parameters. This clearly leads to the variance components model (8.3.2), with Vi = ((<7i(uj — Uj))). Therefore, all the methods of variance components estimation are applicable here. Another class of parametric models that have been used in the context of data on a spatial lattice is that of ARMA models. The methods for linear models with ARMA errors can be used here. The problem of prediction in the case of spatial data is known as kriging. The general theory of BLUP given in Section 7.13 is applicable. If the linear model for the covariance function is used, then the BLUP is given by (8.3.12). Further details on kriging and parametric estimation of semivariogram may be found in Christensen (1991). Zimmerman and Cressie (1992) examine the performance of the predictor obtained by replacing the unknown parameters involved in the BLUP by their respective estimators. Their results suggest that the estimated mean squared prediction error of these predictors may be more reliable when the spatial correlation is stronger.
8.5
Special cases with uncorrelated error
Even if the model errors are uncorrelated, the least squares method may be inadequate because of unequal variances of the errors. The latter phenomenon is referred to as heteroscedasticity. Heteroscedastic data may arise in various contexts, some of which are considered here.
8.5.1
Combining experiments:
meta-analysis
Often one is faced with the task of combining information from various sources. The quality of data available from these sources may not be uniform. Sometimes the data are only available in a summarized form. The challenge of meta-analysis is to make improved inference (in
8.5 Special cases with uncorrelated error
359
comparison to what can be done with data taken from any single source) by judicious use of whatever information is available. In the context of linear models, the data from the various sources may carry information on a common set of fixed effects, but may have different levels of the model error. A simple model for this situation is
y3 = XjP + ej, E(e3) = 0, D{e3) = a]l, Cov^e-)
= 0,
for i, j = 1,..., m, i y£ j . The m individual models (yj,Xj(3, cr|j), j = 1,..., m, can be represented by a single combined model (y, X/3, V(6)), where
(Xl\
(v\ \ y2 y=
'
X
=
\yml and 6 = (o\ :
X2 :
f**1 '
V
^ =
\Xm)
° all :
0 :
Vo
"
° \
,
0 :
o
'
a^ij
: cr^)'. Further, we can decompose V(0) as m
where Vj is a block diagonal matrix with / at the jth diagonal block and zero everywhere else. Therefore, this model is a special case of the variance components model, and the methods discussed in the previous section are directly applicable here. The normal MLEs given in Section 8.2.2 satisfy the following simplified equations: / m
XjP
= XjlT^a^X'iXij
\-/m
\
( $ X 2 * ' » V i ) , j = l,...,m(8.5.1)
o) = n" 1 || y3 - Xrf ||2, j = 1,..., m,
(8.5.2)
where rij is the number of elements of y.-. The above equations lead to a natural way of obtaining the MLE: by iterating back and forth between the estimates of Xj/3's and crj's. The least squares estimators of the
360
Chapter 8 : Misspecified or Unknown Dispersion
Xj/3's may be used as the initial iterate. The resulting estimators of the cr|'s are obviously nonnegative. Even if the normal distribution is not appropriate for the response, equations (8.5.1-8.5.2) form the basis of several reasonable estimators. Fuller and Rao (1978) consider a two-stage estimator which is similar to the second iterate of the above iterative procedure. If the number of groups (m) is fixed and minj<m rij —> oo, the two-stage estimator of X/3 is as efficient as its BLUE computed from the model with 'known' crj's. Fuller and Rao derived the large sample properties of the estimator as m —> oo, and the n / s form a fixed sequence. Chen and Shao (1993) derive the large-sample properties of the estimators obtained at later stages of the iterations. They showed that the estimator obtained after a finite, though unknown, number of iterations is asymptotically more efficient than the corresponding estimators at earlier stages, and suggested a stopping rule for the iterations. Hooper (1993) suggests an iterative procedure with a modification of (8.5.2) which is based on a Bayesian model for the variances. In the above discussion, we have assumed that the raw data from the various studies are available at the time of the meta-analysis. Sometimes one only has a summary of the information from each study. For instance, one may have the LSE /3(j) = {^'jXj)~lX'jyj, its estimated dispersion, D(J3^) = a?(X'jXj)~1, and the estimated error variance, crj = y'j(I — Px )yj/(rij - k) for j = 1 , . . ., m, assuming that (3 is fully estimable from each study. In such a case we can bypass (8.5.2) and use the available estimates of a\,..., a^ for the computation of (8.5.1). The resulting estimator of f3 is
3= E(£(3c,-)))~
£(£(%))" % ,
(8.5.3)
with estimated dispersion m
D0)=
_x
£(^(i)))~ J=1
(8.5.4)
8.5 Special cases with uncorrelated error
361
Note that the equations (8.5.3) and (8.5.4) describe the BLUE and its dispersion if the true (unknown) values of o\ ..., cr^ are used in the expressions of D0^), j = 1,... ,m (Exercise 8.26). It can be shown that the estimator (8.5.3) with true and estimated values of o\ ..., o^ become distributionally equivalent to one another as ram.j<mnj —> oo. There are several other interesting problems relating to combinations of experiments and meta-analysis, such as the problem of estimation of fixed effects in the presence of nuisance parameters (see Hedayat and Majumdar, 1985, and Liu, 1996), and combination of tests from several studies (see Zhou and Mathew, 1993 and Mathew et al, 1993). These topics will not be dealt with here. 8.5.2
Systematic
heteroscedasticity
Sometimes the variances of the responses are not only unequal, but the variances follow a definite pattern. In the case of time series data, the variance may be a function of time. In other contexts the variance may be found to be a function of the mean response, or a function of one or more of the explanatory variables. Mathematically we can model these three situations as Var(yi\x{) = g(i), Var(yi\xi) — g(x'ifi) and Var(yi\xi) = g(xi), respectively, where JEJ is the ith row of the matrix X and g is usually an unspecified function. There are several graphical methods designed to provide an exploratory assessment of the form of g in the above situations. These methods are typically based on the scaled residuals of a preliminary least squares analysis. One can plot a scaled residual against the index i, against one explanatory variable at a time, or against the fitted value (see Section 9.1.4). The variation in the spread of the residuals is expected to provide a pointer to the form of the function g in the three examples considered above. Very often analysts make a heuristic choice of the function on the basis of one of these plots. Carroll and Ruppert (1988) discuss formal methods of estimating the function g. Once Var(yi\xi) is estimated, these can be treated as known and an appropriate weighted least squares analysis can be carried out in order to estimate /3, which is the parameter of primary interest. Carroll (1982) shows that the cost of not knowing the variance in the second
362
Chapter 8 : Misspecified or Unknown Dispersion
and third examples goes to zero as the sample size increases. Similar conclusions follow from van der Genugten's (1991) work in the case of the first example. These results are based on the first-order properties of the two-stage estimators of /3, and are quite reassuring when one has a lot of data. A model of heteroscedasticity that includes all the examples given above as special cases is
Var{yi\xi) = a2g{zi,p,9), where Z{ is a a known vector (which may include some components of x-j) and 6 is an unspecified vector parameter. We make the restrictive assumption that the function g is structurally known. Once the above parametric model is used, one can use the ML method. Alternatively one can use a two-stage estimator where 9 is estimated on the basis of the least squares residuals, and plug these into the expression of BLUE of /3 for known dispersion matrix. Davidian and Carroll (1987) review some methods of estimating 9 and also consider the secondorder properties of the two-stage estimator. Their findings indicate that although any consistent estimator of 9 used in the second stage of the two-stage method is good enough for large sample sizes, the quality of the estimator of 9 does matter for moderate sample sizes. It appears that further iterations of the two-stage procedure (via back-and-forth estimation of 9 and /3) would improve the efficiency of the estimator of /3, particularly when the variance of yi depends on its mean.
8.6
Some problems of signal processing A classical model of signal processing is v yt = X > t j + et, t = l,...,N.
(8.6.1)
In the above, the response is usually recorded serially in time. The terms in the summation are referred to as signal and the error term as the noise. The signal may have the following form: xtj = ajstj,
t = l,...,N,
j = 1 , . . . ,p.
(8.6.2)
8.6 Some problems of signal processing
363
Here, the s^'s are known. For instance, s n , . . . , S j v i may represent consecutive time samples of a signal emitted by an active sonar or radar, while the y t 's represent the signal received after the emitted signal is reflected from an object of interest. The terms for j = 2 , . . . ,m may represent various lagged versions of a single emitted signal, where the lags represent the delays caused by the signal traversing various paths. This is known as the multipath effect. The unknown a / s represent the decrease in amplitude of the signals (known as attenuation) as it travels from the source to the receiver via the various paths. Another example of the model (8.6.1-8.6.2) is the case of a passive sonar or radar receiver which 'listens' but does not emit any signal. In this case the signals originate from various sources of interest. In a multiple target situation the signal s\j,..., spij can be the engine noise of the jth target, which should be available from a database of signature tunes of commonly used engines. The estimated amplitudes a\,...,ap carry information about the existence and/or distance of the objects of interest from the receiver. The 'noise' in the transmission medium is often correlated in time. Typically a time series model (such as AR(p)) is used for the correlation structure. Thus, the methods of Section 8.4.1 are applicable. An important special case of (8.6.1) is v Vt = Yl ao c o s ( w j * + Oj) + eu
i=
l,...,N.
(The case p = 1 was considered in Exercise 1.1.) If the sinusoidal frequencies U\,...,OJP are known, then the j t h signal can be rewritten as (a,j cosOj) cos(ujjt) — (a,j sinOj) sm(cjjt). Thus, the standard techniques would work for the transformed parameters a,j cos 9j and a,j sin 9j, j = 1,... ,p (in lieu of the original parameters, a,j,6j, j = 1,... ,p. If the frequencies are unknown, the problem becomes much more complicated. Often it is necessary to estimate these frequencies in real-time. A discussion of such estimation procedures may be found in Kay (1988). Sometimes the signal part of (8.6.1) is also random. A narrow-band random signal is of the form x t j = atj cos{ujt),
t = l,...,N,
j = 1 , . . . ,p,
364
Chapter 8 : Misspecified or Unknown Dispersion
where atj, t = 1,..., N, are samples from a distribution for each j . All these signal amplitudes and the noise are uncorrelated. This model can be seen as a special case of the variance components model (8.3.2) when the frequencies u\,..., up are known. A more general version of this problem with complex signals and unknown frequencies is well-known in the signal processing literature. The importance of this problem stems partly from its equivalence with the problem of estimating the direction of arrival of several random signals using measurements from an array of sensors (see Chapters 5, and 16-17 in Bose and Rao, 1993, and Chapters 3 and 7-9 in Haykin, 1991). The following mixed effects model has applications in some signal processing problems. yt = x'tp + z'ti + et,
t=
l,...,N.
Koch (1999) gives an example of physical geodesy where the response represents the gravity at a certain location on the surface of the earth, the fixed effects represent the reference potential and the random effects represent the disturbing potential of the earth's gravity. The important problems in this context are smoothing (getting rid of the noise from the recorded observations) and prediction. The theory of BLUP and the methods of Section 8.3 can be used here. 8.7
Exercises 8.1 Consider the nested classification model with homogeneity within subsamples, given by y^ — M + ai + Pij + eijki k = 1,2,..., riij, j = 1,2,..., qi, i = l,2,...,p, having uncorrelated zero-mean errors with Var(eij) = a^. Show that the LSEs of the estimable LPFs coincide with the corresponding BLUEs. 8.2 If 1 € C{XnXk) and V is a diagonal matrix with n distinct diagonal elements (n > k), show that the LSEs of all the estimable functions in the model (y, X/3, a2V) cannot be BLUE. [Hint: In order that all LSEs are BLUE, the hat matrix must be diagonal, which is impossible in this case.] 8.3 Consider Example 8.1.5 with no interaction between cells. Show
8.7 Exercises
365
that the LSEs of the estimable LPFs would coincide with the corresponding BLUEs only if the a^'s are the same for all i and j . 8.4 Suppose that the dispersion matrix in the model (y, Xfi,a2V) has the special form V = aPx + CC', where C{C) C C(X)L, C'C = I, and a = p{C)/{n - p{X)). (a) Show that if one erroneously uses the model (y,X0,a2I) to compute the BLUE of Xj3 and the usual estimator of its dispersion matrix, no mistake is committed. (b) Does this fact contradict Proposition 8.1.9? (c) Does the 'wrong model' lead to the appropriate error sum of squares? (d) Does the 'wrong model' lead to the appropriate estimate of the dispersion matrix of the residual vector? 8.5 Consider the mixed effects model yi = X(P + rii) + ei,
i=
l,2,...,p,
where /3 is a fixed parameter and e\,..., ep and rji,..., r]p are pairwise uncorrelated, zero mean random vectors with D(ei) = a21 and D(r}{) = Vo, i = 1,... ,p. Show that the BLUE of all estimable functions of /3 coincide with the corresponding LSEs. [This model is used by Chow and Shao (1991) for the analysis of shelf-life of drugs.] 8.6 Consider the linear model y = Xfi + e with zero-mean errors having a spatial correlation structure modelled by the equation e = aWe + S, where W is a known 'weight' matrix with non-negative elements, a is an unknown positive 'correlation parameter' such that a2tv(W'W) < 1 and S is a vector of uncorrelated errors of equal variance [see Section 8.4.2]. Show that the LSE of any estimable function would coincide with its BLUE under the above model whenever C(WX) C C(X) and C{W'X) CC(X).
366
Chapter 8 : MisspeciBed or Unknown Dispersion
8.7 Show that the LSE of an estimable LPF can have zero variance if and only if the column spaces C(X) and C(V) X are not virtually disjoint. Is the variance of a zero-variance LSE overestimated if V is incorrectly assumed to be equal to J? 8.8 Find estimable LPFs for which the upper and lower bounds obtained in Example 8.1.16 are achieved. 8.9 Show that the estimator Xf3pi described in Section 8.2.1 resides almost surely in C(X), provided that C(V) = C(V) with probability 1. 8.10 For the linear model (y,X(3,cr2V) consider the 'averaged' estimator of X/3,
X0pia = E[[I- V(I-Px){(I-Px)V(I~Px)}-(I-Px)]y]
,
where the expectation is with respect to a prior distribution of V such that C(V) is the same for all points in the support of the prior distribution. (a) Show that X(3pia is unbiased for Xf3. (b) Derive an expression for the dispersion of this estimator. (c) Show that X@pia G C(X) almost surely. 8.11 Consider the linear model (y, X/3, a2V) with normal errors where V is possibly singular but completely known. Derive the REML estimator of a 2 , and show that unlike the MLE, this estimator is unbiased. 8.12 Suppose that the elements of the vector By constitute a generating set of linear zero functions for the model (y,X(3, V(9)), and that D(By) according to this model is nonsingular. If the REML estimator of 6 exists, does it coincide with its MLE from the reduced model {By, 0, BV(6)B')1 You may assume that y has a multivariate (possibly singular) normal distribution. Does the MLE depend on the choice of Bl 8.13 If 6 is a scalar (written as 6) and V{6) has full rank, show that the normal MLE of 6 in the linear model (y, Xfi, V{6)) satisfies the estimating equation
tr {v~l{e)§ev {d)) =
<e)'v-lQ)^rv~l<e)>
8.7 Exercises
367
where e{9) = y-
X{X'V-\6)X)-X'V-l{e)y.
Give an estimating equation for the normal REML estimator. 8.14 Show that the condition of part (b) of Proposition 8.3.1 is weaker than that of part (c) but stronger than the condition of part (a). 8.15 If y'Qy is an unbiased estimator oip'O in the variance components model (8.3.2), show that there exists a symmetric matrix Q* such that y'Q*y = y'Qy with probability 1 and X'Q^X = 0. 8.16 Consider the linear model (y,X/3,V(0)) with /I
8.17
8.18 8.19
8.20
0\
(o\
Hi!I
vw "h
0
0
0\
°3 o
\ 0 1/ \ 0 0 0 CT| / Show that there is no translation invariant, quadratic and unbiased estimator of a\ or o\, even though there is a quadratic and unbiased estimator of each of them. Find a quadratic and unbiased estimator of a\. If = XX' and V2 = I in the model (8.3.2) and X is such that tv{XX'XX')tr{Inxn) ^ [tr{XX')]2, show that there is no quadratic and unbiased estimator of a\ even though it is identifiable. Suppose that V(6) is as in (8.3.2) and of > 0, i = 1,... ,k. Show that C(V{0)) = C(Vi : V2 : : Vk). Using the special form of the BLUE for nonsingular dispersion matrix given is Remark 7.3.11, prove the alternative equations for the MLE of 6 given in Remark 8.3.5. State and prove the corresponding result for the REML estimator. Consider the variance components model of the type (8.3.2) with Vk = I- Then the MINQUE corresponding to the choice w = (0 : : 0 : 1)', that is, V(w) = / in the equation of Proposition 8.3.10 is sometimes referred to as MINQUE(O).
368
Chapter 8 : Misspecified or Unknown Dispersion
Derive the expressions for the MINQUE(O) of the two variance parameters in the case of the model of Example 8.3.7. What happens when the model is not balanced, that is, there are m; observations for the zth level of the random effect, i = 1 , . . . , t? 8.21 Suppose that the parameter p'd is estimable through a translation invariant, quadratic and unbiased estimator under the variance components model (8.3.1) with normally distributed random effects. Show that in order that y'Qy is the MIVQUE of p'd, it must be of the form pO = Ylj=i Pja2j where
£o*3tv{W-{6)W%W-{9)Wj)
= b{0)'Wib{0),
i =
l,...,k,
3= 1
8.22
8.23 8.24
8.25
where W(-), b(-) and W\,..., Wk are as defined in Proposition 8.3.6, and 0 is the true value of the parameter. Also, show that an approximation of the MIVQUE, obtained by replacing 0 by a known vector in the objective function of (8.3.11), is a MINQUE. Find the MINQUE and normal MIVQUE estimators of o1 in the fixed effects linear model (y, X/3, a1 V) where V is a known nonnegative definite matrix. Show that the BLUP of the random effects in the mixed effects model (8.3.1) are indeed as described in (8.3.12). Consider the linear model (y,X(3,V(a2,(f))), where the error sequence follows the model of Example 8.2.3. Calculate the minimum efficiency of an LSE from (8.1.2), for (f> = 0.95 and sample size 10. Compare this with the minimum efficiency of an LSE in the transformed model with sample size 9, where the observations (response as well as explanatory variables) consist of differences of the successive observations in the original model. Also compute the bounds on the expected value of the estimated variance of an LSE (given in (8.1.4)) for the two models, assuming that X has rank 2. What can you conclude from these comparisons? A commonly used test for correlation between successive sam-
8.7 Exercises
369
pies of a (stationary) time series e i , . . . , e n is based on the
Durbin- Watson statistic, IJVV
-
VW
-
»=1 ^+l n
~
ei>
2
(a) Show that the statistic DW is bounded between 0 and 4. (b) Suppose that the time series follows the model (1.5.1) with p = 1 (that is, the AR(1) model) with the AR parameter
370
Chapter 8 : MisspeciEed or Unknown Dispersion of the mixed effects model (8.3.1) with / l n 0 Ono\
H::: 1:1 Hi)- k=2>
(
J-noXno
"JniXni
o
"niXni
U n 2 xri2
T
n
n "noxno
n\Xn\
"noxno
Onixni
"712X712 N
'
un2Xri2
-*n2Xn2'
/-0i
0
0
0
^
0
\
and G2, respectively. Consider the following class of quadratic ^ 2 = : : . : \ 0 q(w 0 0, w1,w2)V2n +ni+n 2 / = ow osl + wis\ + w2sl, functions of y based on summary statistic: where 7X represents a baseline effect in the three sets of studies, and 7 2 pertain to the random effects of the various treatmentstudy combinations. Though C/2 is assumed to be known, in practice it is estimated from the respective studies. Consequently it is assumed that o\ = 1. The fixed parameter /?2 represents the differential impact of the second treatment. Let us denote the four groups of observations as Goa, G06, G\
where SQ is the average of the sample variances in Goa and Gob, a nd sf and s 2 a r e t n e sample variances in G\ and G2, respectively. Assuming that the average of iftiS in every group is known, find conditions on the weights wo, w\ and W2 so that 5(^0,^1,^2) is translation invariant and unbiased for a\. Can W2 be equal to u;i?
Chapter 9
Updates in the General Linear Model
Consider the linear model (y,Xf3,a2V) where the parameters /3 and a2 are unknown. The statistical quantities of interest include the best linear unbiased estimators (BLUEs) of the estimable parametric functions, variance-covariance matrices of such estimators, the error sum of squares and the likelihood ratio tests for testable linear hypotheses. In this chapter we are primarily concerned with the changes in these quantities when some observations are included or excluded, as well as when some explanatory variables are included or excluded. The update problem for additional observations is important not only for computational purpose but also for theoretical reasons. A proper understanding of the update mechanism can provide insight into strategies for sequential design. Updating in the case of exclusion of observations has implications in deletion diagnostics. The inclusion and exclusion of explanatory variables are relevant for comparison of various subset models. We shall pursue a number of these applications in some detail. The emphasis of this chapter will be on statistical interpretation and understanding of the mechanism of update. The LZFs will serve as the main tool in the derivation of the updates. This approach is based on the works of Bhimasankaram and Jammalamadaka (1994b) and Jammalamadaka and Sengupta (1999). Expressions in some special cases are obtained by Plackett (1950), Mitra and Bhimasankaram (1971), McGilchrist and Sandland (1979), Haslett (1985) and Bhimasankaram 371
372
Chapter 9 : Updates in the General Linear Model
et al. (1995). The update formulae given in the following sections and in the articles mentioned above are not necessarily the best for the purpose of computation. There is a vast literature on numerically stable methods of recursive estimation in the linear model, see for instance Chambers (1975), Gragg et al. (1979), Kourouklis and Paige (1981) and Farebrother (1988). We use a special notational convention in this chapter. When it is necessary to display the sample size explicitly, we indicate it by a subscript. On the other hand, when the number of parameters has to be displayed, we use subscripts within parentheses. 9.1
Inclusion of observations Let us denote the linear model with n observations by Mn =
(yn,Xn(3,cj2Vn).
In this section we track the transition from M.m = (ym,Xm(3,a2Vm) to M.n for m < n. We refer to M.m as the 'initial' model and M.n as the 'augmented' model. Note that each LZF in the initial model M.m is also an LZF in the augmented model Mn- According to Proposition 7.4.1, the number of nontrivial and uncorrelated LZFs exclusive to the augmented model, which are uncorrelated with the LZFs common to both the models, is [p(Xn : Vn) - p(Xn)] - [p(Xm : Vm) - p(Xm)]. The clue to the update relationships lies in the identification of these LZFs. 9.1.1
A simple case
Let us first consider the case n = m + 1 and Vn = Inxnpartition y m + 1 and X m + 1 as ,.
_
(
Urn \ .
Vm+l - I ,,
)i
We consider two cases: (a) xm+i g C(X'm), that is, p{Xm+i) (b) xm+l e C{X'm), that is, p(Xm+i)
v
(
Xm
-*m+l - I /
\
We
/qi
I
= p(Xm) + 1, and = p(Xm).
i\
(y.l.lj
9.1 Inclusion of observations
373
Recall from page 119 that p{X) is the effective number of parameters/explanatory variables in a linear model with model matrix X. If p(Xm+\) — p{Xm) = 1, there is effectively an additional explanatory variable in the augmented model. This variable does not affect the fit of ym, but ensures exact fit of the last observation. As the last observation is the BLUE of its own expectation, there is no new LZF exclusive to the augmented model. Consequently there need be no revision in the BLUE of any function that is estimable under the initial model. The dispersion of such a BLUE, as well as R% would also remain unchanged (see Proposition 9.1.9 and the discussion preceding it for formal proofs of these statements in a more general case). In case (b) x'm+1/3 is estimable under Mm. Let aj' m+ i/3 m be the BLUE of this function under M.m. Then it is uncorrelated with every LZF of M.m. Consequently the linear statistic wm+i = ym+i - x'm+1/3m
(9.1.2)
is an LZF of Mm+i which is uncorrelated with every LZF of Mm. Since ) = p{I — Px ) + 1, a standardized basis set of LZFs of p(I — Px Mm+i can be obtained by augmenting a standardized basis set of LZFs of Mm with a standardized version of wm+\. Since the BLUEs under M.m are already uncorrelated with the LZFs of M.m, adjustment of their covariance with wm+\ would produce the updated BLUE under M.m+\. These observations lead to the following update equations. Proposition 9.1.1 Under the above set-up, let C(xm+i) E C(X'm). Suppose further that A/3 is estimable, and wm+\ is as in (9.1.2). Further, let h = x'm+1{X'mXm)-xm+1 and c = Xm(X'mXm)-xm+1. Then
(a) X m 3 m + 1 = Xm0m + ^ c . (b) D(Xmpm+l) (c) Rl =Rl
= D(Xm0m) +
- T^-T-CC'. 1+a
.
(d) The change in i?^ corresponding to the hypothesis A(3 = £ is
374
Chapter 9 : Updates in the General Linear Model a = A(X'mXm)-xm+1 and DA = A(X'mXm)-A'. (e) The degrees of freedom of R2, and R2H increase by 1 as a result of the inclusion of the additional observation.
Proof. Note that Xml3m is an unbiased estimator of Xm(3 that is already uncorrelated with the LZFs of M.m- By making it uncorrelated with the new LZFs wm+i through proposition 3.1.2, we have Xrr3m+l
= XrrSm ~
Cov{Xm0m,Wm+l)Wm+l/Var(wm+l)-
Part (a) is proved by simplifying the above expression. Since Xm(3m+1 must be uncorrelated with the increment term in part (a), we have
D(Xm0J = D(Xm0m+l) + D
).
Simplification of this expression leads to the relation given in part (b). Part (c) follows from the characterization of RQ through a standardized basis set of linear zero functions, and by simplifying the increment, 9
9
°- "4+i As far as the restricted model is concerned, the role of wm+\ is played by the quantity
wm+l-a!D-A{Apm-£), which is obtained from wm+i by adjusting for its covariance with the LZF Af3m — £. The variance of this quantity is easily seen to be Var(wm+\) — o2a'D~^a. The result of part (d) is similar to that of part (c) with these adjustments. Part (e) follows from the fact that the additional observation essentially results in only one additional LZF of variance a2 which is uncorrelated with the existing LZFs — both for the restricted and unrestricted models. Update equations like those given in Proposition 9.1.1 are obtained by Plackett (1950) and Mitra and Bhimasankaram (1971). The quantity wm+\ holds the key to the update equations given in Proposition 9.1.1. It can be interpreted as the prediction error of the
9.1 Inclusion of observations
375
BLUP of ym+i computed from the first m observations (see Proposition 7.13.1). Brown et al. (1975) call this quantity the recursive residual of the newly included observation. We now go back to the general case where Vn is not necessarily In and several observations may be included simultaneously. 9.1.2
General case: linear zero functions gained*
Let yn, Xn and Vn be partitioned as shown below: v
_(Vm\.
X
_(Xm\.
y
^(Vm
Vml\
(9.1.3)
where / = n — m. Let Z* = p(Xn : Vn) — p(Xm : V m ) . Note that 0
376
Chapter 9 : Updates in the General Linear Model
as (X'ti : X\2)\ where X^ has full row rank and
p(Xl)-p(Xm)=p(Xn)-P(Xm). The elements of y^ Vml and Vl can also be permuted accordingly. Thus, the inclusion of the I observations can be viewed as a two-step process: the inclusion of the first set of observations entails additional estimable LPFs but no new LZF, as in case (a), while the inclusion of the remaining observations result in additional LZFs but no new estimable LPF, as in case (b). Thus, it is enough to identify the set of new LZFs in the augmented model in case (b), which we do through the next proposition. Proposition 9.1.3 In the above set-up, let /* > 0 and C(X\) C C(X'm). Then a vector of LZFs of the model M.n that is uncorrelated with all the LZFs of M.m is given by Wl=yi-
Xtpm
- V'mlV-m(ym - Xmpm).
(9.1.4)
Further, all LZFs of the augmented model are linear combinations ofw[ and the LZFs of the initial model. Proof. It is easy to see that yl — Xi/3m is indeed an LZF in the augmented model. The expression for to; is obtained by making it uncorrelated with (Im — PY )ym as per Proposition 3.1.2, and simplifying it. We shall prove the second part of the proposition by showing that there is no LZF of the augmented model which is uncorrelated with wi and the LZFs of the initial model. Suppose, for contradiction, that u'(I — PY )y is such an LZF. Consequently it is uncorrelated with (/ - PXm)ym
and [y{ - XtfJ.
Therefore
(I-PxJ(Vm:Vml)(I-PXn)u (Vtn-.VMl-PxJu-XtXniVrnlVrraHl-PxJu
= 0 - 0
The first condition is equivalent to (Vm : Vmi)(I — Px )u 6 C(Xm). It follows from this and the second condition that
(X£)x-m(Vm:Vml){I-PXn)u=[l2l
v7) ( J - P *>'
9.1 Inclusion of observations that is, V{I - PXn)u <E C{Xn). This implies that u'(I - PXn)yn trivial LZF with zero variance.
377 is a
n
The crucial LZF wi is a generalization of ium+i denned in (9.1.2) for the special case / = 1. Remark 9.1.4 A standardized basis set of LZFs in the augmented model has Z* extra elements, in comparison with a corresponding set for the initial model. Since all the LZFs of the augmented model that are uncorrelated with those of the initial model, are linear functions of wi, the rank of D(wi) must be Z*. Remark 9.1.5 It follows from Proposition 7.13.1 that the LZF wi can be written as the prediction error yi — t/i, where yt is the BLUP of yt on the basis of the model Mm^ Remark 9.1.6 There is no unique choice of the LZF with the properties stated in Proposition 9.1.3. Any linear function of W\ having the same rank of the dispersion matrix would suffice. However, the expression in (9.1.4) is invariant under the choice of the g-inverse of V m (this follows from Proposition 7.3.9). Remark 9.1.7 Let di{j3) = Vl - Xtf - VimV^{ym - X m /3), the part of the model error of y[ that is uncorrelated with the model error ofym. The LZF wi can be seen as di(/3m), the prediction oidi(/3) based on the first m observations. The implications of this interpretation will be clear in Section 9.2. McGilchrist and Sandland (1979) extend the recursive residual of Brown et al. (1975) to the case of any positive definite V. The expression of (9.1.4) for / = 1 can be seen as a further generalization to the case of singular V. Recursive residuals have been quite popular (particularly when there is a natural order among the observations) because of the fact that these are uncorrelated. These are used as diagnostic tools (see Kianifard and Swallow, 1996). It was seen in Proposition 9.1.1 that the recursive residual plays the central role in obtaining updates of various quantities of interest. The same holds in the general case, as we shall see in Proposition 9.1.8.
378
Chapter 9 : Updates in the General Linear Model
Haslett (1985) extends the recursive residuals to the case of multiple observations (/ > 1), assuming that V is positive definite. Jammalamadaka and Sengupta (1999) further extend it to the case of possibly singular V in the following way. Suppose that FF' is a rankfactorization of a~2D(wi), and F~L is a left-inverse of F. Then the LZF, F~Lwi can be defined as a recursive group residual for the observation vector yt. The recursive group residual is not uniquely defined whenever D(wi) is a singular matrix. However, the sum of squares of the recursive group residuals is uniquely defined and is equal to o2w'l[D(wi)]~wi. The vector wi is also uniquely defined given the order of inclusion of the observations. Moreover, the components of wi have one-to-one correspondence with those of y^. We can call wi the unsealed recursive group residual for yl.
9.1.3
General case: update equations*
It transpires from Remark 9.1.2 and the subsequent discussion that the main case of interest for data augmentation is case (b). We already have from Proposition 9.1.3 a vector LZF which accounts for the additional LZFs of the augmented model. We now use it to update various statistics. Proposition 9.1.8 Under the set-up of Section 9.1.2, let C{X\) C C(X'm) and let h = p{Xn : Vn\ - p{Xm : Vm) > 0. Suppose further that A(3 is estimable with D(A/3m) not identically zero, and wi is the recursive residual given in (9.1.4). Then (a) Xmpn = Xmj3m Cov(Xmfim,wi)[D(wi)]-wi. (b) DiXnflJ - D{XJlm) Cau{Xmpm,wi)[D{w{)]-Cov{wl} X-mPm)(c) Rl^R^+a^iDiw^-WL (d) The change in R2H corresponding to the hypothesis A/3 — £ is R2Hn = R2Hm + o2WiJ[D{ww)}-Wu, where wu =wt- Cov(whA^m)[D(Afim)}-(A0m - £). (e) Inclusion of the I additional observations increases the degrees of freedom of RQ and R2H by U and p(D(w^)), respectively.
9.1 Inclusion of observations
379
Proof. The proofs of parts (a), (b) and (c) are similar to the proofs of the corresponding parts of Proposition 9.1.1. Substitution of these three update formulae into (7.9.4) leads to (d) after some algebraic manipulation. Part (e) is a consequence of the fact that the additional error degrees of freedom coincide with the number of nontrivial LZFs of the augmented model that are uncorrelated with the old ones as well as among themselves. The variances and covariances involved in the update formulae can be computed from the expressions given in Sections 7.3 and 7.7. The explicit algebraic expressions in the general case are somewhat ungainly, as found out by Pordzik (1992a) and Bhimasankaram et al. (1995) who use the inverse partitioned matrix approach of Section 7.7.2 as a vehicle for deriving the updates. Simpler expressions can be found in some simpler cases. When a single observation is included (I = 1), D{wi) reduces to a scalar. Here, the assumptions of Proposition 9.1.8 imply that p(D(wi)) is equal to 1. The rank of D(iu/*) must also be equal to 1 (it is zero if and only if wi is a linear function of the BLUEs of A4m, which is impossible). If Vm is nonsingular, the unsealed recursive group residual denned in (9.1.4) can be written as wi = si-si, where st = y, - V'mlV^lym, si = Z,3 m , and Z{ =
(Xi-V'^V^Xn).
(A similar decomposition is possible even if Vm is singular, but the quantities S; and 3j are not uniquely defined in such a case.) The quantity s; is a part of yt which is uncorrelated with ym. On the other hand, si can be interpreted as the BLUP of S[ under the model (ym,Xf3,o-2Vm). Clearly, Cov(si,si) = 0. It follows that
D(wi) = Disri + Dfr), Cov{Xm0m,wi)
=
-Cov(Xm/3m,st).
380
Chapter 9 : Updates in the General Linear Model
If, in addition, Xm has full column rank, then we can work directly with 0m (instead of X m /3 m ). Thus, we have the following simplifications: D{ai) = a2(Vt - V'^V-Vrra), D(si) = o2Zl{X'mVm1Xm)-lZll, Cov(0m,8t) = ^(X^V^Xrn)-1^ = 3n D0n) Rln
-CoV0m,Wl),
= 0m + Crtw(3m,3,)[^(«j) + ^(aj)]"(aj-*j)» = Z>(3m)-Co«(3rn,3«)[-D(ai) + £>(Sj)]"Cou(3ni,Sj)', = Rlm + a 2 (a,-«,)'[£>(«/)+^(«j)r(«i-«j)-
The above formulae for /3n> D(0n) and i?gn are essentially the same as those given by McGilchrist and Sandland (1979) (for / = 1) and Haslett (1985) (for / > 1). Further simplifications occur when V = I, in which case si
=
at = D(8i)
=
Dfr) = Cov(pm,8i)
=
Vi,
XLpm, O2I,
a2Xl(X'mXm)-1X'l,
The resulting simplified forms of the update formulae are similar to those obtained in Proposition 9.1.1. The general expression of R2Hn given in Proposition 9.1.8(e) can be somewhat simplified. Note that the unsealed recursive group residual under the restriction A/3 = £ is w^ = wi — wi, where wi = Cov(wi, Afim)[D(A0m)]~(A0m — £), the linear regression of wi on A/3m — £. It follows that w^ and wi are uncorrelated, and hence, D(wu) = D{wt)
-D(wi).
In the special case of / = /* = 1, the update formula for error sum of squares under the restriction is R2Hn = R2Hm + a2(Wl - wtfUDiwi)
- Dim)}.
9.1 Inclusion of observations
381
Bhimasankaram and Jammalamadaka (1994b) gave another formula for R% essentially in terms of wi, ibi, D(wi) and D(wi), using other notations. It contained a minor error. The correct expression given here is much simpler. If Z* > 1 and Xm and Vm have full column rank, the expressions of u>i and D(wi) simplify as follows.
^ D(Wl)
= =
-Z /J D(3 m )A'[AD(3 m )A']-(A3 m -O, ZlD0m)A'[AD@m)A']-AD(pm)Z'l.
We now turn to cases (a), (c) and (d) of page 375. In case (c), the additional observations of the augmented model are essentially linear functions of the initial model (see Exercise 9.2). Therefore, the linear zero functions and the BLUEs remain the same in the appended model. There is no change whatsoever in any statistic of interest. It has already been shown in the discussion following Remark 9.1.2 that data augmentation in case (d) essentially consists of two steps of augmentation classifiable as cases (a) and (b), respectively. In case (a), there is no additional LZF in the augmented model. Hence, the BLUEs of the LPFs which are estimable in Mm, their dispersions, the error sum of squares and the corresponding degrees of freedom are the same under the two models. The error sum of squares under the restriction A/3 = £ and the corresponding degrees of freedom also remains the same after data augmentation. However, the additional observations contribute to the estimation of the LPFs that are estimable only under the augmented model, as shown in the next proposition. Proposition 9.1-9 p(X'm) = Z». Then (a) Xt0n
= Vl-
(b) DtXtfJ
Under the set-up used in this section, let p{X'n) — VlmV^(ym
-
Xmf3m).
= *2V< - VlmV^D(ym -
Xm0m)V^Vml.
Proof. The LZFs of the augmented and original models coincide (see Remark 9.1.2). Therefore, the BLUE of Xi/3 is obtained by adjusting Hi for its covariance with the LZFs of the original model. We choose ym — Xmf3m as a representative vector of LZFs.
382
Chapter 9 : Updates in the General Linear Model If we write this vector, according to (7.3.3), as Vm ~
=
XmPm
Vm(I-PXm){(I-PxJV^I-PxJ}-(I-PxJym)
=
VmRmym,
then the required BLUE is XiX
=
yi-Cov(yhym-Xm0m)[D{ym-Xm0m)]-{ym-Xm0m)
= yi- Cov(yi, =
Vi-
VmRmym)[D{VmRmym))-{VmRrnym)
V« m V-D(V m fl m i/ m )[I>(V r n fl r o y m )]-(V m fl m y m )
= vi -
vlmvm{ym-xmpm).
The expression of part (b) follows immediately. When Vim — 0) it is clear that the fitted value of y{ is equal to its observed value, and the corresponding dispersion is equal to the dispersion of yi. This is not at all surprising, considering that the / parameters can take any value to make the fit of yx as good as possible. Let us consider an example to understand why yt is not necessarily exactly fitted when Vim 0. Example 9.1.10
Let m = 2, I = 1, 0 = (Pi : /32)' and
(Vi\ Vn = \V2 , Vys/
Xn =
/ I OX 0 0 , \0 1/
Vn =
/ I 0 1\ 0 1 1 . \1 1 3/
It is easy to see that only j3i is estimable from the first two observations. Moreover, the second observation does not carry any information about /3\. It follows that J3i = j/i, and the residuals for the first two observations are 0 and 2/2 > respectively. Further, the dispersion of the fitted values of the first two observations is o2 I
1. According to
Proposition 9.1.9, the fitted value of t/3 is yz — 2/2The reason why the fitted value of 2/3 is not j/3 itself can be understood by examining the covariance of j/3 with the other two observations. Out of these, y\ carries information about /3i. Therefore, y2 is
9.1 Inclusion of observations
383
the only available observation which carries exclusive information about the model error. The only observed sample of model error being yi-, the BLUP of the error component of j/3, based on the first two observations, is j/2- Since the third observation does not introduce any new LZF, it cannot change the estimator of f3\, and consequently, the prediction of the model error. The estimator of /?2 adjusts itself to ensure that the residual of the third observation is the same as its predicted value from D the first two observations! The argument given in Example 9.1.10 can be extended to the general case also. The fitted value of yl from the augmented model must be such that the corresponding residual is identical to the BLUP of e/ from the original model, which is given by V/ m V^ n {y m — Xm/3m). This reduces to zero when the augmented observations are uncorrelated with the original observations (V/TO = 0). Even if this residual is non-zero, it does not alter the error sum of squares, because it is a function of the LZFs of the initial model. The degrees of freedom also do not change. An expression of the BLUE of a general LPF which is estimable under the augmented model is given in Exercise 9.3. 9.1.4
Application to model diagnostics
The homoscedastic linear model, (y,X/3,o2I), is often used in a situation where there is no reason to presuppose a more complicated model. The simple model is then checked for adequacy. The basis of a diagnostic check is the residual vector, e. Note that D(e) = a2(I — H), where H is the hat matrix (see p.108). It follows that the residuals of the various observations are generally correlated, and they do not have equal variance. Some adjustment is necessary in order to make the residuals suitable for diagnostic purposes. The problem of unequal variances of the residuals may be rectified by scaling the ith residual, e^, by its estimated standard deviation, (a2(I — hi))1!2. The resulting quantity, n =^ — , (9.1.5) (a2(l - hi)?/* is called the ith standardized (or internally studentized) residual. These
384
Chapter 9 : Updates in the General Linear Model
residuals are generally correlated. Another residual that can be used for model checking is the deleted residual. Simply stated, the deleted residual for the ith observation is the prediction error arising from the linear prediction of this observation in terms of all the other observations. If the data set is permuted so that the ith observation is the last one, then the recursive residual for this observation is the ith deleted residual. The deleted residual can be standardized by dividing it with an estimator of its standard deviation. A reasonable estimator is obtained by replacing a2 in the variance expression by <72(_j), the usual estimator computed from all but the ith observation. The scaled version of the deleted residual is called the externally studentized (or simply studentized) residual. It can be shown that the zth studentized residual is U=^ -. (9.1.6) (
9.1 Inclusion of observations
385
to the possible exclusion of a nonlinear function of the corresponding explanatory variable. (e) The plot of r$ or ij vs. the fitted values is used to detect heteroscedasticity. A systematic change in the spread of the residuals with the fitted values may indicate a possible functional relation between the variance of the response and its mean. (f) The ordered values of r; or U may be plotted against appropriate quantiles of the standard normal distribution to check'whether the model errors can be assumed to be normal. Since the recursive residuals are uncorrelated, their scaled versions provide an attractive alternative to r{ and U in the above procedures, particularly when there is a natural order among the observations. Galpin and Hawkins (1984) and Hawkins (1991) give a good exposition of these diagnostic methods. A popular variation of plot (a) mentioned above is the CUSUM plot, where the cumulative sums of the scaled recursive residuals are plotted against the index. Movement away from zero is interpreted as indication of a structural change in the model. This plot is proposed by Brown et al. (1975), who also gives formal cut-offs for fluctuations of the plot when there is no structural change. Several modifications and generalizations are suggested by subsequent researchers. McGilchrist et al. (1983) uses the plot of the recursive estimates of the regression coefficients vs. the index number in order to detect structural change in the model. A number of formal tests for the violations of various assumptions can also be constructed on the basis of the recursive residuals. See Kianifard and Swallow (1996) for a review of these methods. 9.1.5
Design
augmentation*
Suppose that a set of m observations has already been collected, and one is interested in a particular estimable function p'/3. Consider the problem of choosing an additional design point optimally, so that the variance of the BLUE of p'/3 is minimized. In the absence of any constraint, the variance can be made indefinitely close to zero. A reasonable constraint may be to set an upper bound on the variance of the
386
Chapter 9 : Updates in the General Linear Model
estimated mean of the additional observation, calculated on the basis of the first m observations. Of course, one can choose alternative constraints. The purpose of this section is only to show how the update formulae derived earlier can be utilized in solving some design problems. The simple case of homoscedastic model errors admits an intuitively meaningful solution to this problem: the new row of the design matrix should be proportional to p. We now derive a solution in the general case of heteroscedastic and possibly singular error dispersion matrix, by making use of the results of Section 9.1.3. Note that in the present context 1 = 1. In order to simplify the notations, we denote Xm, Xt, ym, yb Vm, Vml, Vi, wi and d;(-) by X, x', y, y, V, v, v, w and d(-), respectively. The task is to minimize the variance of p /3m+i with respect to x, subject to the constraint Var(x fim) < aa2 where a is a known positive number. It is clear that the new design point x carries no information about p'j3 if it is not in C(X'). Therefore x has to be of the form X'u. In such a case, choosing x is equivalent to choosing u. It was argued in Section 9.1.2 that whenever x G C(X'), there must be an additional LZF with nonzero variance, unless the new observation error is perfectly correlated with the first m errors of the model. The latter case is not interesting, since the Var(x f3m+1) happens to be the same as Var(x'fim). In the following discussion we assume that x = X'u and
Var(w) > 0. In view of Part (b) of Proposition 9.1.8, minimizing Var(p'/3m+1) is equivalent to maximizing [Cov(p'0m,w)]2/Var(w). Writing w as d(fl) + [d0m) - d(0)], a sum of uncorrelated parts (see Remark 9.1.7), it follows that
Cov(p'pm,w) = Cov(p'Pm,d0J-d(f3)), Var(w) = Var{d(P)) + Var(d(J3m)-d(p)). Let 6 = Var(d(/3))/a2 and the vectors a and b satisfy p = X'a and v = Vb, respectively. Then d(0m) - d{@) = (b - u)'X(J3m - 0). Denoting (j-2D(Xj3m) by SS', we have
Cov(p'(3m,w) =
~a2a'SS'(u-b),
9.1 Inclusion of observations
387
Var(w) = a2[6 + (u-b)'SS'(u-b)], Var(x'J3m) = o2u'SS'u. Thus the optimization problem reduces to
^ e Xs-mini-by
suchthat u'ss>u^a-
(9-L7)
A further simplification occurs if we let u\ = P' S'u, bx = P S'b, u2 = S'u — ui and b2 = S'b — b\. The solution to (9.1.7) can be obtained from the solution to the following problem. ma:: (m-bi)>i-fri) uuu2 9 + {u2- b2)'{u2 - b2)' such that itx e C{S'a), u2 G C{(I - Ps, )Sr), u[ui + u'2u2 < a. (9.1.8) Note that the objective function of (9.1.8) is equivalent to, but not identical with that of (9.1.7). Sengupta (1995) arrives at a similar formulation of the problem using the inverse partitioned matrix method. Proposition 9.1.11 The solution to the optimization problem (9.1.8) is as follows.
(a) If b\ = 62 = 0, the maximum is attained if and only if u2 = 0 and «! = 1/2S'a. (b) If bi y£ 0 and b2 = 0, the maximum is attained if and only if U! = -{a/b'iSS'bi)1'2!*! and u2 = 0. (c) If b\ = 0 and b2 ^ 0; the maximum is attained if and only if u2 = c\b2, where =
Cl~
b'2b2 + a + 6
2b'2b2
[
_ ( _
Aab'2b2
(^
(bfa + a + 8)2)
\1/2
and ux = clb'^/a'SS'a^^S'a. (d) If b\ and b2 are both non-zero, then the maximum is attained if and only if u2 = c2b2 where c2 maximizes [{a—c^b'^)1^2 + {b'^)1!2}2/{9+b'2b2{c2-\)2} over the range 0
388
Chapter 9 : Updates in the General Linear Model
Proof. The proofs of Parts (a) and (b) are straightforward. The other two parts are proved by holding u2 fixed, maximizing the numerator of (9.1.8) subject to the constraint u'xu\ < a — u'2U2, and maximizing the resulting expression with respect to u2. For details, we refer the reader to Sengupta (1995). Remark 9.1.12 The solution of Part (b) coincides with one of the two solutions of Part (a). Remark 9.1.13 Suppose r\ is the correlation coefficient between y and p'J3m, and r 2 is the multiple correlation coefficient of y with X/3m. Then b\b\ = vr\ and b'^ = v{r\ ~ r i)- Thus the four different cases of Proposition 9.1.11 have direct statistical interpretation. Remark 9.1.14 Proposition 9.1.11 leads us to a choice of S'u in each of the four special cases. The choice in each case is of the form S't for some t. It is clear that S'u = S't if and only if u is of the form u = t+t\ where S't\ = 0. On the other hand, the condition S't\ = 0 holds if and only if ti is orthogonal to C(D(X(3m)) which is the same as C(X)nC(V). Thus, ti must be of the form (/ - Px)t2 + (I - Pv)tQ. Therefore, the condition S'u = S't is equivalent to Xu = Xt + X(I — Pv)to for some vector to^ The above observations allow one to translate the choice of S'u obtained from Proposition 9.1.11 into a choice of x, as follows. Proposition 9.1.15 The choice of x that minimizes subject to a~2Var(x'f3m) < a is given as follows. '
p)1/2p
+ x0
-{a/vrl)l'2X'V-v x
t/r 2 = 0,
+ x0
= J [(a - cfwrD/wp]1/2 + aX'V-v
{
Var(p'(3m+1)
if r\ = r\ > 0, + x0
if r2 > 0 = n ,
9 / 9 O\ ^ 1/2 a-civ(ri -rf)\ . . ,v1/2 ~^i \ (v pfl2rlP vr[ J J
+ c2X'V~v
+ x0
if r%>r2>
0,
9.1 Inclusion of observations
389
where vp = Var(j»'/3m)/CT2, xo is an arbitrary vector in C(X(I — Pv)), r\ and r2 are as in Remark 9.1.13, and
C1
- f1 > a + e \ \ i d
Cl
~ {2+2vr2J[L
4avr"
V/2'
{L (a + 6 + vr*)*) j ' [{a-c 2 t;(r|-r?)} 1 /2
Co =
are
max &c€[0,{aMr|-r2)}l/2]
+
{1,r2}i/2l2
-—* ^— — 6 + v(r% - r\){c - I) 2
—.
Proof. The results follow from Proposition 9.1.11 and Remark 9.1.14 after some algebra. D Remark 9.1.16 The ambiguity in the choice of V~ can be removed by replacing X'V~v by X'PvV~v. The difference between the two terms is absorbed by the arbitrary vector XQ. Remark 9.1.17 The intuitive solution of choosing x in the direction of p is optimal not only in the homoscedastic case, but whenever r-i = —7"i > 0. If r2 = ?"i > 0, the opposite direction is optimal. Both of these cases correspond to the situation when the multiple correlation of y with X/3m is the same (in magnitude) as its correlation with p>'/Jm alone. Both the solutions are optimal when r2 = 0. The assumption of uncorrelated error variances is a special case when ri — 0. n Sengupta (1995) also considers the design problem when there are several LPFs of interest. Bhaumik and Mathew (2001) consider a similar problem when several additional observations have to be designed.
9.1.6
Recursive prediction and Kalman filter*
When observations are collected over a period of time, the linear model is sometimes used as a vehicle for predicting future values of the response. This prediction is recursively updated as newer observations become available. Using Theorem 7.13.1 and the notations used in this chapter, the BLUP of y( on the basis of data ym and Xn is seen to be
y, = X{pm - V'mlV^{ym -
Xmpj,
390
Chapter 9 : Updates in the General Linear Model
assuming that Xi/3 is estimable under the model Mm- The dispersion of the corresponding prediction error is
D(yt - m) = (Vmlv^ - X ^ J P ^ J ^ V ; - xtx-y. Part (a) of Theorem 9.1.8 shows how the actual prediction error wi = Vl~Vl c a n D e utilized, once yi becomes available, to obtain the estimator of Xnf3 and its dispersion. These updated quantities can be used to predict future values of the response in terms of the corresponding values of the explanatory variables. The assumption of a linear model with fixed coefficients is unsuitable for predicting the response, if the underlying mechanism changes with time. A very versatile model for a time-varying system is the state-space model, given by the recursive relation xt
= Btxt-\ + ut,
zt
= Htxt + vt,
(9.1.9) (9.1.10)
for £ = 1,2,.... (This model was briefly mentioned in Section 1.5.) In the above, the state vector xt is unobservable, but the measurement vector zt is observable. The error vectors ut and vt have zero mean, and Cov(us,ut)
= Qu(s,t),
s,t =
1,2,...,
Cov{vs,vt)
= Qv(s,t),
s,t =
1,2,...,
Cov(us,vt)
=
Quv(s,t),
s,t = 1,2,....
The matrices Qu(s, t), Qv(s, t) and QUtV(s, t), s,t = 1,2,... are assumed to be known. T h e state transition matrix Bt and t h e measurement ma-
trix Ht, t = 1,2,..., are also assumed to be known. The objective is to predict the state vector xt by a linear function of the observations Zi,Z2,...,Zt and the initial state XQ. In some applications the measurement vector z t also has to be predicted by a linear function of XQ, Z\,..., Zt-\. The linear predictors should have the smallest possible mean squared prediction error. The vector XQ may itself be an estimator, where the corresponding estimation error is absorbed in u\. The analysis given here is conditional on a fixed value of XQ.
9.1 Inclusion of observations
391
A recursive solution to the above problem is given by the Kalman filter (Kalman, 1960, Kalman and Bucy, 1961). The state-space model and the Kalman filter have a wide range of applications. It will be shown here that the Kalman filter equations can be derived from the update formulae of Section 9.1.3 in an intuitive manner. The first step in the derivation is to show that the minimum mean squared error linear predictor is given by a BLUE in afixedeffects linear model — as pointed out by Duncan and Horn (1972). A simpler version of their argument is used here, but a stronger result (with possibly singular dispersion matrices) is proved. Proposition 9.1.18 Let h be a known non-random vector and x and z be random vectors following the model (9.1.11) where F, G and V are known matrices which may not have full row or column rank, and C(G') C C(F'). Then for an arbitrary matrix C of appropriate dimension satisfying C(C') C C(F') (a) a minimum mean squared error linear predictor of Cx having the form A\h + A2z + as must be unbiased in the sense that the expected value of its prediction error is zero for all values of E(x); (b) the BLUE ofC/3 from the fixed effects model (y, X/3, V), where
V=(*),
and * = ( £ ) ,
is a linear predictor of Cx based on z and h, having the minimum mean squared error; (c) the mean squared prediction error of the predictor of part (b) is the same as the dispersion matrix of the BLUE ofCf3 from the abovefixedeffects linear model. Proof. Let A\h + A2z + a% be a linear predictor of Cx. The matrix of mean squared prediction error for this predictor is E[(Aih + A2z + a3 - Cx)(Aih + A2z + a3 - Cx)']
392
Chapter 9 : Updates in the General Linear Model
= E{Aih + A2z + a 3 - Cx)E{Alh + A2z + a 3 - Cx)' + D{A\h + A2z + a 3 - Cx) = [(AXF + A2G- C)E(x) + a3] [(AXF + A2G - C)E(x) + a 3 ]' + D(Alh + A2z-Cx). Since h is non-random, the dispersion depends only on A2. For a given choice of A2, the bias term can be made equal to zero by choosing A\ = (C — A2G)F~ and a 3 = 0. Therefore, a linear predictor with minimum mean square prediction error cannot have non-zero bias. This proves part (a). In order to prove part (b), let A\h + A2z + a 3 be a linear predictor of Cx and B = ((C - A2G)F~ : A2). Let us also write e = (it' : vj. It follows that E[(Aih + A2z + a3 - Cx)(Aih + A2z + a3 - Cx)'} > E[(By - Cx)(By - Cx)'} = D(By -Cx) = D (Be - ((C - A2G)F~ : A2) (^) x - Cx\ = D(Be) = BVB'. Let C = LX and B* = LR where R = I - V(I - PX){(I - PX)V(I
- PX)}~(I - Px).
(9.1.12)
According to Proposition 7.3.1, B*y is the BLUE of Cf3 from the model (y, X(3, V). Moreover, B*X — LX — C. Consequently, BVB' = (B-B*+B*)V{B-B* + B*)' = B*VB*' + {B-B*)V(B-B*)'
+ B*V(B - BJ + (B- B*)VBJ. The dispersion of the BLUE Ry given in (7.3.4) can be written as VR'. Proposition 7.3.9 implies that C(VR') C C{X). It follows that VBj can be written as XK for some matrix K. Hence, (B - B*)VBJ = (BX - B*X)K = (C - C)K = 0.
9.1 Inclusion of observations
393
Consequently E[(Aih + A2z + a3 - Cx)(Aih + A2z + az - Cx)']
> BVB'
= B*VB*' + (B- B*)V{B - B.)' > B*VBJ
=
E[(B*y-Cx)(B*y-Cx)']
This proves part (b). Part (c) follows from the simplification E[(B*y-Cx){B*y-Cx)'}
= B*VB*'= LVR'L',
the last expression being the dispersion matrix of the BLUE of C{3 from the model (y,X0,V). Proposition 9.1.18 generalizes a result of Duncan and Horn (1972), where V was assumed to be block-diagonal and nonsingular, and F and L were chosen as / . The best linear predictor described in this proposition happens to be a BLUP. Note that the equations (9.1.9)-(9.1.10) up to time t can be written as yt = Xat + et, (9.1.13) where 7 f = (x' x : x'2 :
/ -Bizo \
: x't)' a n d
I
0 Vt=
I -B2
0
0 - - - 0 N I
0
-Bt
-ut , et =
z1 z2
zt )
-u2
I
, Xt=
\
/ -ui\
0
\
Hi 0
0 H2
0
0
0 0
Ht )
vi v2
V vt )
This is a special case of (9.1.11) with F nonsingular. We shall denote D{et) by Vt, and use the notation Mt to describe the model (yt,Xnt,Vt).
394
Chapter 9 : Updates in the General Linear Model
The state update and measurement equations up to time t can also be written as Vt = (Xt:Oyyt+1+et. (9.1.14) We shall denote by M\ the model (yu (Xt : 0)7 t + 1 , Vt), which also fits into the framework of (9.1.11). However, the condition C(C) C C(F') of Proposition 9.1.18 means that the result can be used only to predict linear functions of -yt, and not for all functions of -ft+iThe state update equations (9.1.9) up to time t and the measurement equations (9.1.10) up to time t — 1 can be combined into the single equation Vt\t-\ = xt\t-i7t + et|t-i,
(9.1.15)
where -yt is as in (9.1.13) and
[BlXo\
f I 0 -B2
0
0 z\ \ zt-i /
Y
0 i?i \ 0
0\
f~Ul\ -u2
0
I
-Bt
I
0 0 Ht-i 0 /
^
-ut vi \vt-ij
We shall denote D{et\t_i) by V t | t _i and use the notation Mt\t-i for the model (yt\t-iiXtu_x'yt,Vt\t_i). This is also a special case of (9.1.11) with F nonsingular. Recursive prediction of the state vector consists of the following cycle of steps. (I) Given the prediction of ccj-i based on XQ, Z\, ..., Zt-i and the dispersion of the prediction error, predict Xt and the dispersion matrix. (II) Given the above quantities, update these by taking into account the additional measurement ztThe above discussion and Proposition 9.1.18 imply that the best linear predictor of the state vector and at every stage is given by a BLUE in a suitable 'equivalent' linear model.
9.1 Inclusion of observations
395
This is where the update equations of Section 9.1.3 have a role to play. Using the 'BLUE' of Xt-i and its dispersion under the linear model (9.1.13) (with t replaced by t - 1), we can find the 'BLUE' of xt and its dispersion under the model (9.1.13) by tracking the following three transitions: (la) from Mt-i to M.l_1} (Ib) from Mt to -Mt|t_i and (II) from Mt\t-i to Mf At this point we assume that the covariance matrices given in page 390 have the special form: Quv(s,t) = 0 for all s,t, Ru(s,t) = 0 for s ^ t and Rvts^) = 0 for s ^ t. This is done only to simplify the algebra. The derivation can easily be extended to the general case. Step la: transition from A4t-i to Mt_iLet us denote the BLUE of Xt-i computed from the model Mt-i by sct_i and its dispersion matrix by Pt-i- Using Proposition 9.1.18 for (9.1.13), with t replaced by t - 1 and C = (0 : : 0 : / ) , xt-i and Pt-i may be identified as the minimum mean squared linear predictor of Xt-i based on a;o, Z\,..., zt-\, and the dispersion matrix of the corresponding prediction error. The transition from Mt-i to A/(|_1 should involve no change in the BLUE or the dispersion matrix, since the model A/Jj_1 is only a reparametrization of Mt-\- Applying Proposition 9.1.18 to (9.1.14), with t replaced by t — 1, we observe that the best linear predictor and the dispersion matrix of the prediction error remain the same. Step Ib: transition from A4t_1 to A / l i | t _ 1 .
Let £ t | t _i and Pt\t-i denote the BLUE of Xt and its dispersion matrix, computed from the model M.t\t_i. Applying Proposition 9.1.18 to (9.1.15), Xt|t_i and Pt\t-i may be identified as the best linear predictor of xt based on xo, z\,..., zt-i, and the dispersion matrix of the corresponding prediction error. The model Mt\t_i is obtained from the model M\_l by including some additional observations. Note that p(Xt\t_i) — p(Xt-\ : 0) is equal to the size of the vector xt. Therefore, there is no new LZF. The BLUE of 7f_i and its dispersion remain unchanged. We can use Proposition 9.1.9 in order to obtain xt\t_\ and P t | t _i in terms of the
396
Chapter 9 : Updates in the General Linear Model
previously computed quantities. Specifically, we have + x t |t-i -Btxt-i £>(-B t x t -i + x t | t _i)
= 2/*) = Qu{t,t),
where y* is the observed value of the additional part of the y-vector. This happens to be 0, but that should not matter while we simplify the second equation. What matters is that y* is uncorrelated with j / t _ 1 . Thus, we have Covixty^, xt-i) - Cov(y^ + Btxt-i,xt-i)
= BtD(xt-i).
Consequently, Qu(t,t) = D(Btxt-i) + Pt\t-i ~ Cov(Btxt-i,xt\t-i) -Ccw(x t | t _ 1 ,B t *t-i) = Pt\t-i BtPt-iB't. Now we substitute y* = 0 in the earlier equation and have the updates xt{t_x
= Btxt-i,
(9.1.16)
Pt\t-i = BtPt-xB't + Qvfat).
(9.1.17)
Step II: transition from M-t\t-i t° M-tThe model Mt is obtained from the model M.t\t_x by including some additional observations. Since Xt\t_\ has full column rank, there is no newly estimable LPF. In the present case, the recursive residual of Proposition 9.1.3 is identified as Wt = zt — zt, where zt = Htxqt-!.
(9.1.18)
The requisite variance and covariances are D(wt) = D(zt) + HtDixty.jH't
= Qv(t, t) + HtP^^H',, (9.1.19)
Cov(xt|t_1,u>t)
=
-PAt_xH't.
9.2 Exclusion of observations Substitution of Wt, D(wt) and Cov(xt\t_i,wt) Proposition 9.1.8 produces xt = Xtlt-i +
397
in parts (a) and (b) of
P^H^QvM+HtP^H'tyizt-HtXt^), (9.1.20) (9.1.21)
The recursive relations (9.1.16)—(9.1.17) and (9.1.20)—(9.1.21) constitute the Kalman filter. These relations hold for t > 2. The initial iterates are S^o = BIXQ and PMQ = Qu(l,l). The minimum mean squared error linear predictor of the measurement vector zt (in terms of XQ, Z\, ..., zt-i) and the dispersion matrix of the corresponding prediction error are given by (9.1.18) and (9.1.19), respectively. It may be observed that Proposition 9.1.18 holds with no condition on the nature of the matrix Vt. Allowance for the singularity of Vt is important in many practical applications (see, for instance, Harvey and Phillips, 1979). Proposition 9.1.8 also allows the matrix Vmi to be non-zero. Therefore, the above derivation can be readily generalized to incorporate correlation of the error vectors ut and vt in the statespace model (9.1.9)-(9.1.10). Temporal correlation can also be handled. Thus, the derivation of the Kalman filter through the update equations of the linear model has several theoretical advantages. Haslett (1996) uses the linear model update equations to derive the Kalman filter. However, he assumes Vt to be nonsingular and uses a more complicated set-up for linear model updates, where data and parameters are augmented simultaneously. Nieto and Guerrero (1995) derive the Kalman filter in the singular dispersion case from a different set-up, and use the Moore-Penrose inverse where any g-inverse would suffice. 9.2
Exclusion of observations
Recall the models Mm and Mn defined in Section 9.1. In this section we track the transition from the model Mn to Mm, where I = n—m > 0.
398
Chapter 9 : Updates in the General Linear Model
We refer to Mn as the 'initial' model and Mm as the 'deleted' model. 9.2.1
A simple case
Once again we first consider the case Vn = I and 1 = 1, and compare the models Mn and M.n_\. We partition yn and Xn as
(9.2.1) In Section 9.1.1 the recursive residual played a pivotal role in determining the updates in the augmented model. The recursive residual corresponding to yn is expressed in terms of the predicted value of yn from the deleted model, which does not suit the present context. We need a pivot that is expressed in terms of the quantities computed for the initial model. For this purpose we use the ordinary residual, en = yn — x'nPn. Note that f3n is uncorrelated with every LZF, and in particular with every LZF of the deleted model. On the other hand, yn is also uncorrelated with the LZFs of the deleted model, as it is uncorrelated with yra_i- Thus, en is uncorrelated with every LZF of the deleted model. Adjustment of Xn^i/3n_l for its covariance with en yields Xn_x0n
= Xn.l^n_l
- Cov(-Xn-i3 n -i, Oen/Varfen),
(9.2.2)
assuming that Var(en) > 0. (If Var(en) = 0, then no covariance adjustment is necessary, and Xn-ifin_1 = Xn-if3n.) Rearrangement of the in terms of Xn-i0n. In order to obterms of (9.2.2) gives Xn-i0n_1 tain a simple expression of Cov(Xn-i/3n_l), we calculate the covariance of both sides of (9.2.2) with yn. This yields Cov(Xn^n,yn)
= =
0-Cov(X n _i / 9 n _ 1 ,e n )Cou(e n ,y n )/yar(e n ) -Cov(Xn-i^n_l,en).
The last equation follows from the fact that Cov(en,yn) = Cov(en,x'nPn)
+ Var(en) = Var{en).
Thus, CoD(Xn_ij9n_1)eft) = -Cov(Xn^n,yn)
=
-^Xn-^X^X^-Xn.
9.2 Exclusion of observations
399
This simplification, together with (9.2.2), produces X n _i3 n _ ! = Xn-.xpn
- Xn-i{X'nXn)-xnenl{\
- hn),
(9.2.3)
where hn is x'n(X'nXn)~xn, the leverage of the nth observation. The update equations resulting from (9.2.3) are given in the next proposition. Proposition 9.2.1 Consider the initial model Mn and the deleted model Mn-i, and let c = X n-i{X'nX n)~ xn and hn = x'n(X'nXn)~xn < 1. Further, let A/3 be estimable in the models M.n and Mn-i- Then the updated statistics for the deleted model are as follows: (a) X n _i3 n _i = Xn-lPn ~ C<W(1 - K). (b) D{Xn-tfn-l) = D(Xn-lPn) + O2CC'/{1 - hn).
(c) Rl^^Rl-el/il-hn). (d) The change in the error sum of squares R2H under the restriction A/3 = £ is RH^
~ RH«
1 - hn + a'D~Aa
'
where a = A'{X'nXn)'xn and DA = A'(X'nXn)-A. (e) The degrees of freedom of JRQ and R2H decrease by 1 when the nth observation is excluded. Proof. Parts (a)-(c) follow immediately from the discussion leading to (9.2.3) and Proposition 9.1.8. In order to prove part (d), let
en* =en + a'D-A{A'J3n-$,). It is easy to see that en* is uncorrelated with all the LZFs of the deleted model. Further, we have from part (a) Cou{e*n,Apn-i) = Cov{en + a!D-A{Apn-£),Apn-aenl{l-hn)) = cr 2 (O-a' + a ' - O ) = 0. In view of Proposition 7.9.2, e* is uncorrelated with every LZF of the deleted and restricted model. Thus, it can play a pivotal role in tracking the effect of data exclusion in the restricted model. The stated result
400
Chapter 9 : Updates in the General Linear Model
follows from the fact that Var(e*n) = cr2(l - hn + a'D^a). follows from part (e) of Proposition 9.1.1.
Part (e)
Remark 9.2.2 The condition hn = 1 is equivalent to xn <£ C(Xjl_1) or p(Xn) = p(Xn-i) + 1. If hn = 1, then it follows from the discussion of Section 9.1.1 that the exclusion of yn does not change any of the quantities of interest, except that x'n/3 becomes non-estimable. 9.2.2
General case: linear zero functions
lost*
Let us consider once again the four cases described in Section 9.1.2. No LZF is lost in cases (a) and (c). It also follows from the discussion of that section that we only have to identify the LZFs lost in case (b). Case (d) can be thought of as a two-step exclusion where the steps correspond to cases (b) and (a), respectively. Therefore, we deal mainly with case (b). As we have seen in the simple case of data exclusion, the unsealed recursive group residual (wi) of Section 9.1.2 is an LZF of M.n which is uncorrelated with the LZFs of Mm, and these two sets of LZFs together form a generating set of LZFs of Mn- This is why wi was used as a pivot for obtaining the updates in Section 9.1.3. However, wi is expressed in terms of the residuals of Mm, which is generally not available before the data exclusion takes place. Of course, we can express wi directly in terms of yn as
wt = yl-XlX^ym
+
(XlX^-VlmV^)Vm(I-PxJ x[(I-PXm)Vm(I-PxJ-(I-PxJym.
This expression does not depend on the choice of the various g-inverses. However, its computation essentially entails fresh computation of -X"/3m. We need a modification of wi which can be used in the present context. We now give such an LZF via the next proposition in the special case where Vn is nonsingular. Note that in such a case, /* = p(Xn : Vn) — p(Xm : Vm) simplifies to I. Proposition 9.2.3 In the set-up of Section 9.1.2, let I > 0. Then a vector of LZFs of the model Mn that is uncorrelated with all the LZFs
9.2 Exclusion of observations
401
of M.m is given by rt = dSn) =Vl~ Xlh ~ VlmV^{ym
- XmPn),
($-2A)
where dj(-) is as defined in Remark 9.1.7. If p(Vn) = n, then there is no nontrivial LZF of M.n which is uncorrelated with W[ and the LZFs of MmProof. If en is the residual vector for Ain, then r; = [—V^mV^ : I]en. On the other hand, the LZF (I — Px )ym can be written as L(I — Px )yn for some matrix L. Hence, we have from (7.3.5) Cov(n, (I - PxJym)
= [-V lm V- : I]Cov(en, (I - PXn)yn)L' = [-V, m V- : I]Cov(yn, (I - PXn)yn)L' = Cov ((-V l m V- : I)yn, (I - PxJvm)
- 0.
In order to prove the remaining part, we show that p(rj) = I* — [p(Xn) — p(Xm)] whenever Vn has full rank. Note that in this case /* = /. Let CC' be a rank-factorization of Vn such that C' — (C\ : C'2) and Vm = CiC'l. Then D(rl) =
o2(-VlmV^:I)[Vn-Xn(X'nV;1Xn)-X'n}(-V™Vml^
= o*{-c2cl(clc1r: = a\C2{I
i)c(i-pc^Xr)c^{ClC'fClC'^
- Pc[))(/ - Pc-1Xn)((I
If LL' is a rank-factorization of I — P P(D(n))
=
- Pc, )C2). then we have
P((i-Pc_1Xn)((i-Pc,)c2))
402
Chapter 9 : Updates in the General Linear Model
= ' ( < J - i W ( / - p c i > ) = P{«-Pc-1XJL) = p{{C-lXn : L)) - p{C-lXn) = p((Xn : CL)) - p{Xn)
= P[X£ =
(?L)-p(Xn)=p{Xm)+p(C2L)-p(Xn)
p(C 2J LL')-[p(X n )-p(X m )]
= p((I - Pc, )C"2) - [p(Xn) -
p(Xm)}
= p(C[ : C'l) - p(C[) - [p(Xn) - p(Xm)] = n-m-[p(Xn)-p(Xm)} = l-[p(Xn)-p(Xm)]. This completes the proof. It can be shown that whenever C{X\) C C(X'm) and Vn is nonsingular, ri and W[ are linearly transformed versions of one another (see Exercise 9.9). The advantage of ri over wi is that the former is expressed in terms of the estimator Xn(3n in the current model. When Vn is singular, wi may not be a function of r\. In particular, r/ may even have zero dispersion whereas the dispersion matrix of wi must have rank Z*. Evidently r; can serve as a pivot for updates in the general linear model if and only if p(D(r{)) = /* — [p(Xn)— p(Xm)]. The latter condition is satisfied when Vn is nonsingular. Jammalamadaka and Sengupta (1999) overlook the necessity of the rank condition on D(rt) (see Remark 9.2.6). Note that the condition C{X[) C C(X'm) was not needed in the proof of Proposition 9.2.3. Thus, case (d) of Section 9.1.2 (that is, the case where some LZFs and estimable LPFs are lost due to data exclusion) is well within the scope of this proposition. 9.2.3
General case: update equations*
Let us assume that 0 < /* — \p(Xn) — p(Xm)] = p(D(r{)), that is, some LZFs (represented adequately by r\) are lost because of data exclusion. In the light of Remark 9.1.6, we have
Xmpm
= Xmpn + Cov(XmPm, r,)[D(r,)]"r,.
(9.2.5)
9.2 Exclusion of observations
403
The covariance on the right hand side remains to be expressed in terms of the known quantities in the current model. Prom (9.2.5) it follows that Cov(Xm0m,di(P)) = Cov(Xmfin,di(P)) +
Cov(Xmpm,r,)[D(ri)]-Cov(rhdi{p)).
Since di{0) is uncorrelated with ym while XmPm is a linear function of it, the left hand side is zero. On the other hand, Cov(ri,di(0))—D(ri) is the covariance of ri with a BLUE in M.n which must be zero. Therefore the second term in the right hand side reduces to Cov(Xm/3m, r;), which can be replaced by —Cov(Xm0n,di(f3)) in (9.2.5). This leads to the update relationships given below. Proposition 9.2.4 Let 0 < l*-[p(Xn)-p(Xm)] = p{D{r{)) and A/3 be estimable in either model with D(A/3n) not identically zero. Then the updated statistics for the deleted model are as follows:
(a) Xm0m = XmPn - Cov(.X:m3n,d,G9))[2?(r,)]-r,. (b) D(Xmpm)=D(Xmpn)+Cov(Xmpn,dl(P))[D(rl)}-Cov(dl(P),
XmK)(cjR^^Rl-a^lDinrn. (d) The reduction in the error sum of squares under the restriction Afl = £ is given by R2H = R?H — a2r'lif[D(ru)]~ri*, where
rh = n + CovW/3), A0n)[D(APn)]-(APn - £) (e) The degrees of freedom of RQ and R2H reduce by I* and p(D(r;*)) ; respectively, as a result of data exclusion.
Proof. Parts (a)-(c) and (e) follow immediately from the above discussion and Proposition 9.1.8. Part (d) is proved by substituting the update formulae of parts (a) and (b) into equation (7.9.4) and simplifying. Remark 9.2.5 Bhimasankaram and Jammalamadaka (1994a) give algebraic expressions for the updates given in Proposition 9.2.4 in the special case when / = 1 and Vn is nonsingular. Bhimasankaram and Jammalamadaka (1994b) give statistical interpretations of these results along the lines of Proposition 9.2.4. Another set of interpretations in the multivariate normal case is given by Chib et al. (1987). Bhimasankaram
404
Chapter 9 : Updates in the General Linear Model
et al. (1995) give update equations for data exclusion in all possible cases, using the inverse partition matrix approach. Remark 9.2.6 The difficulty offindingan update equation in the case p{D(ri)) < I* — [p(Xn) — p(Xm)] can be appreciated by considering the model with yn =
( Vm\ ym+1 ,
Xn =
( X m\ 1 0
,
/3 = (Po
,
Vn =
fVm\ 0 .
If p{Xm) = 2 and Vm = Imxm, then U - \p{Xn) - p{Xm)} = 2, while p(D(ri)) = 0. It is clear that fin = (y m + 1 : y m + 2 -ym+i)' and D(/3n) = 0, but /3m has to be calculated afresh. There is no way of 'utilizing' the computations of the model with n observations. 9.2.4
Deletion diagnostics
Deletion diagnostics measure the changes in various aspects of a regression analysis if a group of observations is dropped. It is assumed here that every estimable LPF of the full model remains estimable after exclusion of the observations. Suppose that the last / of the n original observations are dropped. [The assumption of deleting the last few observations is made only for notational convenience; any set of observations can be made to occur in the end by a suitable permutation.] Let us also assume that p{D(n)) = [p((Xn : Vn))-p((Xm
: Vm))] - [p(Xn) - p(Xm)}.
The consequent change in the estimator of an estimable LPF p'/3, obtained from part (a) of Proposition 9.2.4, is the difference DFBETA{_Illp=p'pn-p'0m
= Cov(p'fin,di(fi))[D(ri)]-ri, (9.2^6) 1/ being the index set of the deleted observations. The variance of p'/3 n is estimated by replacing a2 in the expression of Var{p'j3n) by
flg.-a2r{[JJ(r,)]-r, a^-p(xm:vm)-p(xmy
(9 - 2J)
9.2 Exclusion of observations
405
Let us denote this estimator by Var(p'(3n). The quantity
( Il)'P
DFBETA(_jA v [Var{p'i3n)Yl*
can be used as a scaled measure of impact of the last / observations on the BLUE of p'fi. The actual algebraic formula for DFBETAS^_h)tP would depend on the values of the model matrices and the chosen computational method (see, for instance, Bhimasankaram et al., 1995). When / = 1 and Vn = I, this measure simplifies (see Remark 9.2.5) to nVRFTAQ
-
(
hV
P'(XnXn)
Xnen
a ( _ n ) (l-/i n )[p / (X;X n )-p] 1 /2
where
.2
Rl-ej/(l-hn)
°^ n-p(Xn)-l Belsley et al. (1980) popularized the special case of this measure where p is a column of the identity matrix. Let us now turn to the change in the overall fit when the last / observations are dropped. A popular measure of this change is the Cook's squared distance
mnvn C00KD^)
a 2 (^3 ra - Xnpm)'[D(Xn0n)]-(Xnpn -
=
xjj
3*^5
^[£(r;)]-C;[Z?(Xjj]-C n lP(r;)]- n d2np{Xn) = Cov{Xnf3n,dt(P)).
=
where Cn
When I\ = n and Vn = / , the Cook's squared distance simplifies (see Remark 9.2.5) to rnnKB ( ~ n)
- elhn[n-p(Xn)] " (i - KYP{xn)Rln
_ r2n ~ P{xn)
f
where rn is the standardized residual defined in (9.1.5).
hn \ '\i-hnJ'
406
Chapter 9 : Updates in the General Linear Model
A related measure that quantifies the change in the fitted value of the last observation when it is dropped is
DFFITS,_n) = -4— ("n) <7?_1}
X'nl3n
~*'nl3m . (yar«/3J)V2
It is clear that DFFITS{_n) is a special case of DFBETAS{_n)p p — xn. When Vn = I, the above expression reduces to
for
tn being the studentized residual defined in (9.1.6). Information on other deletion diagnostics can be found in Ryan (1997) and Rencher (2000). A key assumption for the computation of these diagnostics is that the matrix Vn is known. If it is estimated, as discussed in Chapter 8, exclusion of observations alters the estimator of Vn also. If the estimator of Vn based on the complete data is used in the above formulae, the resulting diagnostics would only be approximations. A nice interpretation of the above diagnostic measures can be given in terms of a well-known result. Note that the vector wi introduced in Section 9.1.2 is the prediction error for the BLUP of y^ in terms of ym (see Remark 9.1.5). The BLUP itself is yt — wi. Let yn(_/() = {y'm (yt - wi)')', and -Mn(_/() be the model (j/ n (_/,),X n /3 n ,o- 2 V n ). This is a data-augmented version of the model Mm- It is easy to see that the recursive residual described in Proposition 9.1.3 reduces to zero in this case. Consequently the BLUEs of M.n(_j^-, their dispersions and the error sum of squares coincide with those of the model M.m. This fact is a confirmation of the intuitively reasonable idea that if one predicts a few observations on the basis of a linear model, and pretends as if the predicted values are actual data, then the estimators of the parameters of the model are not altered by taking into account these additional 'data'. If the BLUE of Xn(3 in the model Mn is written as Bnyn, then the above discussion implies that Bnyn(-lt) = Xnf3m. Consequently
XJm = XJn-Bn^y
(9.2.8)
9.2 Exclusion of observations
407
This update formula is strikingly simple. The reason for its simplicity is that, unlike the propositions of Sections 9.1.3 and 9.2.3, this device makes no attempt to use the summary statistics of M.m or M.n exclusively. Specifically, the matrix Bn is an attribute of Ain, while wi is computed on the basis of Mm- Therefore, (9.2.8) is not suitable for practical computation. Nevertheless, it provides an important interpretation of the update. Haslett (1999) proves this result in the case of nonsingular Vn. It follows from (9.2.8) that
«/*.-*. ) DFBETAS^
=
_
_
_
_
_
_
where
2
°^
p(Xm : Vm) - p(Xm)
In particular,
DFFITSI
D**llb{-Il)-
i\ —
K(°) [(3l_Ii)/o*)Var(x'JnW/*'
where b'n is the last row of Bn. Cook's squared distance also has the simple representation
a\0:w\)B'n[D(xJ>n)]-Bn(°)
C00KD^ = 9.2.5
Missing plot
azm
substitution*
Missing observations sometimes create problems in designed experiments. The BLUEs and various sums of squares of interest in a designed experiment often have closed form and simple expressions owing to the
408
Chapter 9 : Updates in the General Linear Model
special structure of the design matrix. Standard computational formulae are often available for the analysis of designed experiments. Even a single missing observation may render these formulae useless, typically by turning a balanced set-up into an unbalanced one. One of the methods of dealing with this problem is missing plot substitution, which was introduced in Section 6.3.4. The idea is to proceed with the algebra, pretending as if all the data are available, and then to minimize the 'error sum of squares' with respect to the missing observation(s). To see why this should work, let us assume for the time being that the requisite parameters remain estimable even with the missing observation^). Consider the update equation of R% in Proposition 9.1.8. It is clear that RQU > R^m, and that the equality is achieved if and only if the recursive residual, Wi, is 0. Thus, the error sum of squares is minimized when yt happens to be equal to its BLUP computed on the basis of the rest of the data. It was explained in page 406 why the substitution of yi by its BLUP amounts to estimating the parameters from the depleted model. It is clear that the BLUEs, their dispersions and RQ for Mm would be properly computed from the full model with the substituted 'data'. The appropriate number of degrees of freedom associated with RQ would however be p(Vm : Xm) — p(Xm) (rather
than p{Vn : Xn) -
p{Xn)).
Some practitioners mistakenly use the same substituted values of the missing observations for computing R2H. It can be shown that this substitution makes R2H larger than what it should be (see Kshirsagar, 1983). The appropriate R2H is obtained from the full model with another substitution: by replacing the missing observations by their BLUP in terms of the available data, subject to the appropriate linear restriction (see Exercise 9.11). Since this substitution minimizes the R2H of the full model with respect to the missing observations, any other substitution (such as the unrestricted BLUP) may result in an inflated R2H. The number of degrees of freedom associated with the appropriate R2H is
p{Vm : Xm) - p(Xm) + p(D(Aj3m)). The missing plot substitution technique can be extended to the case where the missing observations render some LPFs non-estimable. Shah
9.2 Exclusion of observations
409
and Deo (1991) proved that the principle of substitution works when V = I. The following proposition gives a stronger result. Proposition 9.2.7 Let the models Mn and Mm be defined as in Section 9.1, with p{X'n) not necessarily equal to p{Xm). (a) The BLUEs of all estimable functions of Mm and their respective dispersions are the same as those obtained from Mn, provided that yi assumes a value that minimizes RQ . (b) The minimized value of R% , described in part (a), is equal to Rl (c) The error sum of squares under the restriction Af3 — £, i% m , is equal to the minimized value of R?H with respect to yt under this restriction. Proof. Let to; be a vector of LZFs that form a generating set of the LZFs of Mn which are uncorrelated to the LZFs of Mm. Since p(Xm) is not necessarily equal to p(Xn), it is not possible to construct W[ by (9.1.4). In any case, R%n = RQ + WI[D{WI)\~W[. We shall show that it is possible to choose yl such that wi is equal to zero with probability 1, which ensures RZ = Rn . The BLUE of an estimable LPF of Mm is uncorrelated with every LZF of Mm- The BLUE of this LPF under Mn is obtained by removing the correlation of the earlier BLUE with wi, by means of covariance adjustment. The two BLUEs are related via an equation similar to the update formula of part (a) of Proposition 9.1.8, where wi is defined above. The two estimators would be identical if wi = 0. Let t'yn be an LZF of M.n which is uncorrelated with all the LZFs of Mm. In order to complete the proof of parts (a) and (b), we shall now show that any such LZF is identically zero with probability 1 if y^ assumes the value & = XlPX'Jm
+ V'mlVm(ym - Xmpm).
(9.2.9)
According to part (a) of Proposition 7.2.3, there is a vector k such that t'y = k'y almost surely and X'nk — 0. Let us partition k as (k'm : k't)',
conformably with Xn. Then t'y = k'mym+k'lyl and X'mkm+X\ki
= 0.
410
Chapter 9 : Updates in the General Linear Model
Since this LZF is uncorrelated with (I — Px )ym (that is, with all the LZFs of Mm), we have
0 = Cov(k'mym + k\yh (I - PxJym)
= (k'mVm + fcJV^)(I - PXJ.
Consequently k'mym + k'iyi
= k'mym + k'lXlP
0rn +
= k'm(ym - xmPx,Jm) -
(k'm +
k'lV'mlV^(ym-Xm0m)
+ fc{v^v-(ym -
xmpj
k'lVmlV^)(yn-Xmpn)
According to part (a) of Proposition 3.1.1) and part (b) of Proposition 7.3.9,
(ym - Xm0J
G C(D(ym - Xm0J)
= C(Vm(I - PxJ)
with probability 1. Thus, we can write ym — Xm/3m almost surely as Vm(I — PY )u for some vector u. Substituting this in the expression of k'mym + k\yh we have k'mVm + KVl = (k'm +
klV'^V^Vmil-PxJu
= (k'nVm + k'tKaXl-PxJu
=0
with probability 1. This completes the proof of parts (a) and (b). Part (c) is obtained by repeating the above argument for the linear model
which is equivalent to the restricted model (see Section 7.9). 9.3
D
Exclusion of explanatory variables
There are several reasons why we may wish to study the effect of dropping some explanatory variables from a model. The issue of reducing the number of explanatory variables often arises from the consideration of costs vis-a-vis their utility. Alternatively, the motivation may
9.3 Exclusion of explanatory variables
411
come from the consideration of collinearity and possible lack of precision of the estimators. If the purpose of a regression analysis is prediction or study of the differential impact of a single variable or a small set of variables, then the pruning of unnecessary explanatory variables may be useful. In the present section and the next, we examine the connection between the models M{k) =
(y,X{k)f3{k),a2V)
and M{h) = (y,X{h)(3{h),a2V)
(k > h),
where the subscript within parentheses represents the number of explanatory variables in the model, and X(k) = (x(h)
X(j)),
/3(fc) = ( W ) . \ H (j)
/
We shall refer to M^k) a s the larger model, and to M(h) a s the smaller model. The model M(h) c a n be viewed as a restricted version of the model .M(fc), where the restriction is /3Q\ = 0. When we dealt with general linear restriction, we had to construct an equivalent unrestricted model. No such exercise is needed here, because we already have a simple unrestricted model, JM^). We shall be able to use the results derived earlier in the general case, and shall gain further insight by exploiting the simplicity of this special case. For the consistency of the smaller model with the data, (J — Py)y must belong to C{{I — Py)Xnl\). (This is a simplified version of the consistency condition for a general linear restriction, given in (7.9.2).) We assume that this condition holds. It follows that the data is consistent with the larger model as well. Here we consider the transition from the larger to the smaller model {M(k) to M.(h))- The reverse transition will be the subject of Section 9.4.
412
Chapter 9 : Updates in the General Linear Model
9.3.1
A simple case
Let V = I and h = k — 1. We partition -X"(fc) as (Xrk_u : x
* = £(*)
(9-3.1)
Indeed, v is a BLUE in My.) a n d an LZF in M^-i)- Clearly Var(v) > 0. Therefore, v can be used as a pivot for obtaining the update equations. The detailed equations are given in the next proposition. In order to distinguish between the least squares estimators under the two models, we use a 'tilde' for the estimators under the smaller model and the usual 'hat' for those under the larger model. Proposition 9.3.1 Under the above set-up, let p(X^k_^) = p(X^) — 1 and A^ik_x\ be estimable under the larger model. Further, let
a = B(X[k)X(k))~ ^ J , c = (0:l)(X' (fc) X (fc) )-(J), K = B(X'^X^) B , where B
= (A : 0).
Then (a) ^ ( f c _ i ) = A 3 ( * _ I ) - au/c. (b) D{Ap{k_x)) = DiAp^y) - a2aa'/c.
9.3 Exclusion of explanatory variables
413
W *%<>-» =*%w+ (d) The change in R?H corresponding to the restriction A/3(fc-i) is
=
£
^(fc-D = RHW +°2v1IVar{v*), where v* = v- a'K~(AJ3(k_i) - £), and a~2Var(v*) = c - a'K~a. (e) The degrees of freedom of R2. and R\ increase by 1 with the exclusion of the explanatory variables. Q Proof. It can be shown along the lines of the proof of Proposition 9.1.1 that AP(k-i)
=AP(k-i) - Cfou(j4J9(fc_1),i/)i7Far(i/), D{Ap{k_l))=D{Ap{k_l))-Cov{Ap{k_l),v)Cov{v,Ap{k_l))lVar{u), Rlik^=R2oik)+a^/Var(u), where v is as in (9.3.1). The statements (a)-(c) are obtained by simplifying these expressions. A pivot for changes under the restriction A(3ik_]\ = £ is obtained by adjusting the covariance of v with A/3rk_iy Thus we have the LZF !/. =
v-Cov{v,Apik_1))[D{Ap{k_1))]-{Ap{k_1)-Z).
Consequently we have the result of part (d). The proof of part (e) is left as an exercise. 9.3.2
General case: linear zero functions
gained*
According to Proposition 7.9.1, every LZF in the larger model is an LZF in the smaller model. A standardized basis of LZFs for the model with k parameters contains p{X^ : V) — p(Xik\) LZFs (see Proposition 7.4.1). If this set is extended to a standardized basis for the smaller model, then the number of uncorrelated LZFs exclusive to the smaller model is j , = p(X{h) : V) - p{X{k) : V) - p{X{h)) + p(X{k)). It is clear that 0 < j* < p(Xrj\).
414
Chapter 9 : Updates in the General Linear Model
We first show that the above expression for j* can be simplified to p(X^) — p(X(h)), if we dispose of a pathological special case. Suppose that x is an explanatory variable exclusive to the larger model which is not in C(X^ : V). Then I = (I - Px _v)x must be a nontrivial vector. Consistency of the smaller model dictates that I'y = 0 with probability 1, while that of the larger model requires I'y = (l'x)/3 = \\l\\2f3, where /3 is the coefficient of x in the larger model. These two conditions hold simultaneously only if j3 is identically zero, that is, x is useless as an explanatory variable. We now assume that there is no useless explanatory variable in the larger model, that is, p{Xik\ : V) = p{X{h) : V). Consequently ;* = p{X{k)) - p{X{h)). Another trivial case occurs when j* = 0. Under this condition, the explanatory variables exclusive to the larger model are redundant in the presence of the other explanatory variables, so that the two models are reparametrizations of one another. The various statistics of interest under the two models are essentially the same. The case of real interest is 0 < j * < p{X(j)). Recall that j * is the maximum number of uncorrelated LZFs in the smaller model that are uncorrelated with all the LZFs in the larger model. A vector of LZFs having this property must be a BLUE in the larger model. In contrast to the simple case considered in Section 9.3.1, we cannot use Proposition 7.9.2 to get hold of these BLUE-turnedLZFs, because / 3 ^ may not be estimable. The following result provides an adequate set of such linear functions. Proposition 9.3.2 The linear function u = (I — Px )X^P^, is a vector of BLUEs in the model M.(k) and a vector of LZFs in the model M(h). Further, p{D(v)) = j * . Proof. The parametric function (/ — Px )X^k^ik\ is estimable in the larger model. The BLUE of this function is v. It is easy to see that E{v) = 0 under the smaller model. Since the column space of D(Xik)0{k)) is C(X(k))DC(V) (see part (a) of Proposition 7.3.9), that of D{u) must be C((I - Px )X{k)) n C({I - Px )V). Note that C^-Px{h))X^)^C{{I-PXw){X{k)
: V))
9.3 Exclusion of explanatory variables
=
C«I-pxJ(xW-v))=c«I-pxw)V).
Hence, C{D{v)) = C{(I - Px ^ )X{k)). PXJXW)=J~
415
Consequently p{D{u)) = p((I-
^
D
It is easy to see that the quantity v introduced in the above proposition is just a special case of AJ3 — £ with A = (I — Px )X^) a n d £ = 0. Thus, we have finally found an equivalent linear linear restriction that would serve the purpose. It is essentially the 'testable part' of the generally untestable restriction p^ = 0 (see Exercise 9.13). 9.3.3
General case: update equations*
Before trying to find update formulae for BLUEs, let us examine if all LPFs can be meaningfully updated. If we think of the smaller model as a restricted model, then Proposition 4.9.3 says that all LPFs which are estimable under the larger model are estimable under the smaller model. (While estimating such an LPF from the smaller model, we can formally use the restriction /3(_-\ = 0.) However, our present interest lies in the linear functions of /3/M. Note that the only functions of /3(/j) that are estimable under the larger model are linear combinations of (I — Px )X^f3ihy On the other hand, the estimable functions Therefore, in the smaller model are linear combinations of X^f3,hy the estimable functions of f3th\ in the larger model are estimable under the smaller model, but the converse is not true in general. The rank of (/ — Pv )Xih\ is j * , the maximum number of uncorrelated LZFs which are exclusive to the smaller model. Hence, a necessary and sufficient condition for all the estimable functions in the smaller model to be estimable under the larger model is that j* = p{Xrj\). (In such a case X(h)P(h) a n d X(j)(3(j) are both estimable under the larger model.) Even if 0 < j * < p(X^), there may be some functions of f3rh\ that are estimable under both the models. We now proceed to obtain the update of the BLUE of such a function when the last j explanatory variables are dropped from the larger model. Once again, we use a 'tilde' for the estimators under the smaller model and a 'hat' for those
416
Chapter 9 : Updates in the General Linear Model
under the larger model. The results given below follow along the lines of Proposition 9.1.8. Proposition 9.3.3 Then (a) Ap{h)
Let Aj3ih\ be estimable under the larger model.
= AJ3{h) - Cov{AP{h),v)[D{u)]-v,
where i/ = (I -
Pxw)X{i)P{jY (b) D(Ap{h)) = D(Ap(h))-Cov(A0{h),v){D(is)]-Cov(u,Ap{h)). (c) Rlw=Rl{k)+v>[o-*D{u)]-v. (d) The change in B?H corresponding to the hypothesis Af3th\ = £ is RHW = R2Hlk)+<W-2D("*)]-"*, Where v.=v-Cov(v,A0{h))[D(A0w)]-{A0{h)-e). (e) As a result of exclusion of the explanatory variables, the degrees of freedom of RQ and R?H increase by j * and p{D(v*)), respectively. Q
Remark 9.3.4
The vector i/* is the BLUE of (/ - Px
)X^/3^
in
the larger model under the restriction Afirh\ = £. Remark 9.3.5 Depending on the special case at hand, one may use a different form of u that would have the requisite properties. For instance, if j * = p(Xr^), it can be chosen as X/j\/3/j\. If j * = j , v can be chosen as Remark 9.3.6 we have
0(h) = D(J3W) =
fitj).
^
If [j^ is entirely estimable under the original model,
P(h)-C™@W,0u))[D0U)]-pU), D@{h))-Cov(pih),PU))[D(pU)]-Cov(0{h),0U)y.
These updates only involve J3rk\ and its dispersion.
Q
Bhimasankaram and Jammalamadaka (1994b) give the update formulae for the exclusion of a single explanatory variable when V is nonsingular. These can be obtained as a special case of Proposition 9.3.3 after routine algebra.
9.4 Inclusion of explanatory variables 9.3.4
417
Consequences of omitted variables
The BLUE-turned-LZF, v, is the key quantity that controls the updates of the BLUEs and their dispersion. If the larger model is correct, part (a) of Proposition 9.3.3 implies that the bias of the estimator Af3th\ depends on the mean of v under the larger model. The bias is not substantial for a BLUE in the smaller model when the mean of v is small. Note that E(u) = (/ — Px )X^/3^, and it is small when )Xu\ is a matrix with very small elements. This happens (/ — P if the dropped columns of the model matrix are almost linear combinations of the columns that are retained, which is possible when there is collinearity. On the other hand, the reduction of the dispersion depends only on the covariance of A/3/^ with u, which need not be small even if E{u) is small. Thus, in the presence of collinearity involving the dropped columns, there is a possibility of reducing dispersion substantially without picking up too much bias. The trade-off between bias and variance in the case V = I is considered in Section 11.3.1. 9.3.5
Sequential linear
restrictions*
We have already mentioned that deleting explanatory variables is equivalent to introducing linear restrictions. If there is a need to revise the statistical quantities of interest in view of a linear restriction of the form A/3 = £, the appropriate equations can easily be formed along the lines of Proposition 9.3.3. The main idea is to capture the 'testable part' of the restriction, and use the BLUE of the corresponding LPF under the unrestricted model as the key LZF in the restricted model. These recursions would enable one to incorporate several linear restrictions serially. Kala and Klaczynski (1988) and Pordzik (1992b) obtain the explicit algebraic formulae when V is nonsingular and only one restriction is incorporated at a time. 9.4
Inclusion of explanatory variables
We now consider the transition from the model M(h) — {y-i^(h)P(h)i a2V) to the model M^) = (yi-^(fc)/3(fc),<72V) (k > h), where X^ =
418
Chapter 9 : Updates in the General Linear Model
(X(h)
-X"(j))'
an(i
P(k) = (P[h)
P(j)Y- As
m
Section 9.3, we refer to
-M(fc) as the larger model, and to M(h) as the smaller model. 9.4.1
A simple case
Let V = I and k = h + 1. We partition -X"(/j+1) as (X^ : x^+i)) and /3(/t+i) as (/3'^ : /fyj+i))'. The LZFs in the larger model are a subset of those in the smaller model. We show that the quantity
t = x[h+l){I - PxJy
(9.4.1)
can be used as a pivot for the updates. It is easily seen that t is an LZF in M(hy Since (I — Px )x^+i) £ £(-^(/i+i))> * is uncorrelated with (I — Px )y. Therefore, t is a BLUE in M^+i) We use t as a pivot for the updates. If x^+i) € C(X(/j)), then we only have a reparametrization, and the quantities of interest do not change. Assuming that x^+i) £ C(X(h}), and using 'tilde' for the estimators under the smaller model and a 'hat' for those under the larger model, we have the updates given by the following proposition. Proposition 9.4.1 Under the above set-up, let p(X(h+l)) = p(X^)+ 1 and Af3th\ be estimable under the larger model. Further, let a = AX~[h+l){I - PX(h))x(h+i) for any g-inverse ofX{h+l), c = x'{h+l)(IPx )s(/i+i) and t be as in (9.4-1)- Then
(a) Ap{h)=Ap{h)+at/c. (b) D{Ap{h)) = D(AP{h)) + a2aa'/c. W *V +1) = Rlw - */c (d) R%{h+i) = R%(h)-1?J[c-a'{a-*D(APw) + c-laa'}-a], where U = t- a'[a-2D(Ap{h)) + c" 1 aa'}'(A0 {h) + at/c - £). (e) The degrees of freedom of RQ and R2H decrease by 1 with the exclusion of the explanatory variables. Proof. Since t given by (9.4.1) is an LZF of the smaller model and it turns into a BLUE in the larger model, Ap{h) = Ap{h) -
Cov(Ap{h),t)t/Var(t).
9.4 Inclusion of explanatory variables
419
Write ^4/3(/i) as Afiw
= AX^k)[X{k)p{k)]
= AX-{k)y - AX^y
- X (fc) £ (fc) ],
where k = h + 1. The second term is an LZF in the larger model and hence is uncorrelated with t. Therefore Cov(A/3rh\,t) = Cov(AX/k\~y, t), and we have Af3w = A0{h) + Cov(AX(k)y,t)t/Var(t), which after simplification leads to the expression given in part (a). The results of parts (b) and (c) follow from the basic expressions D(A0{h)) = D(Apw) + Rlw
=
Cov(AXlk)y,t)Cov(AX{k)y,t)'/Var(t),
R\h)-°2t2IVar{t).
Adjustment for the covariance of t with Aj3th\ — £ yields U=
t-Cov(t,AP{h))[D{Apw)]-{Apw-Z),
which simplifies to the form given in part (d), after making use of parts (a) and (b). The expression for Rjj ^ follows immediately. Part (e) is a restatement of part (e) of Proposition 9.3.1. d 9.4.2
General case: linear zero functions lost*
We now remove the assumptions V = I and k = h + 1. We need a pivot which can be computed in terms of the statistics of the smaller model. Such a vector is presented below. Proposition 9.4.2 A vector of LZFs in the smaller model that is also a BLUE in the larger model is * = *U)(' - pxj{(* Further, p{D(t)) =;*.
~ Pxih) )V(I ~ PX(h))}-(! ~ PX(h) )V. (9.4.2)
420
Chapter 9 : Updates in the General Linear Model
Proof. It is clear that t is an LZF in the smaller model. Let I'y be an LZF in the augmented model. In view of Proposition 7.2.3(b), we can conclude without loss of generality that X', d = 0 and X'(hd = 0. Writing I as (/ — Px )s, we have, by virtue of Remark 7.3.3), Cov(t,l'y) =
o*X'U)(I-PXw){(I-PXm)V{I-PXw)}-
= o*X'U)(I-PXw)a
= o>X'U)l = 0.
In the above we have used the fact that C(I — Px C((I-Py )V) (identical to C({I-Py )V{J~PY V
'
X(h)'
"
'
X(h)
)X^) is a subset of ))), which follows X(h)"
from the assumption X^ £ C(X^ : V). Being uncorrelated with all LZFs in the larger model, t must be a BLUE there. The rank condition follows from the fact that C(D(y - X (/l) )9 (h) )) = C{V(J - Px )) (see Proposition 7.3.9), which implies C(D(t)) = C{X'{j)(I - Px )). Remark 9.4.3 Recall that C(X^) is assumed to be a subset of C{X{h) : V). If X{j) = X{h)B + VC, then t is the same as C'yres, where yres is the residual of y from the smaller model. (Specifically, yres = Ry where R = V{I - Py ){(/ - Px )V(I - PY ) } " ( / "res
a
\
-*(ft)'
Lv
X(h)
X(h)
Py ), as seen from Remark 7.3.3.) The vector t can also be interpreted where X^res = RX^, the 'residual' of X^ when as X'^TeaV~yres regressed (one column at a time) on X(hy Similarly, D(t) is the same Remark 9.4.4 The expectations of v and t, denned in Propositions 9.3.2 and 9.4.2, respectively, are linear functions of Puy These linear parametric functions are estimable in the model {yres,X^resfi/j\,a2W), where W = RV. Moreover, v and t are BLUEs of the corresponding parametric functions in this 'residual' model, which is obtained from the original (larger) model by pre-multiplying both the systematic and error parts by R (see Exercise 9.15). See Exercise 9.14 for a direct relation between u and t. Remark 9.4.5 When V is positive definite and a single explanatory variable is included, the BLUE of the coefficient of the new variable in
9.4 Inclusion of explanatory variables
421
the augmented model is proportional to the 'lost' LZF. In this special case the BLUE can be interpreted as the estimated (simple) regression coefficient in the 'residual' model. 9.4.3
General case: update
equations*
We now provide the update relations for the larger model where the BLUE is denoted with a 'hat', in terms of the statistics of the smaller model, where the BLUE is denoted with a 'tilde'. Proposition 9.4.6
// Af3ih\ is estimable under the larger model, then
(a) A(3/h\ = A/3ih\ + Cov(AX7k^y,t)[D(t)]~t, where t is as in (9.4-2). (b) D(Ap{h)) =D(A(3{h))+Cov(AX(k)y,t)[D(t)}-Cov(t,AX(k)y). (c) Rl{k)=Rl{h)-aH'[D{t)]-t. (d) R2H{k) = R2H(h) - oH'*[D{U)]-U, where U=tCov(t,Ap{h))[D(Ap{h))}-(Ap{h) - £) (e) The increase in the degrees of freedom of R2, and R?H with the exclusion of the explanatory variables are given by j* and p(D(t*)), respectively. Proof. Since t contains j * uncorrelated LZFs of the current model that turn into BLUEs in the larger model,
A0(h) = Ap(h) -
Cov(A0ih),t)[D(t)]-t.
Following the argument given in the proof of Proposition 9.4.1, we have Parts (a), (b) and (c) follow imCov(AP(h),t) — Cov(AX(k)~y,t). mediately. Part (d) is proved by substituting the results of these three parts into (7.9.4). Part (e) is easy to prove. Remark 9.4.7 The vector AXZ.y depends on the choice of the generalized inverse of X^, but its covariance with t does not. Remark 9.4.8 The vector <* used in parts (d) and (e) may be expressed in terms of the statistics of the original model by using parts (a)
422
Chapter 9 : Updates in the General Linear Model
and (b). The expression simplifies to
U = D(t)[D(t) +
Cov(t,AXik)-y)[D(A0w)]-Cov(t,AX(k)-y)']~
+ Cov(t,AX(k)-y)[D(A0w)]-(A0w-S)]. 9.4.4
Q
Application to regression model building
Sometimes one wishes to examine the effect of an additional variable on a regression model. Suppose that X^ represents the columns of explanatory variables already present in the model, and XQ\ is the column of the additional explanatory variable. Let /3Q) be the coefficient of xu\ in the larger model. Remarks 9.4.4 and 9.4.5 imply that the whenever V is nonsingular, the BLUE of fi^ from the larger model is the same as its BLUE from the single-parameter linear model (yres,x^resP^,a2V), where yres is the vector of residuals from the model (y, X
9.5 Data exclusion and variable inclusion*
423
is a rank-factorization of V~l. 9.5
Data exclusion and variable inclusion*
There is a very interesting connection between the exclusion of observations from the homoscedastic linear model and inclusion of some special variables to it. The result has been known to researchers for a long time (see, for instance, Schall and Dunne, 1988). Consider the model (y,X/3,a2I). If we wish to drop the last I observations, then the the corresponding updates are given in Proposition 9.2.4. This proposition uses a key LZF, r/, which is uncorrelated with all the LZFs of the depleted model. This LZF is lost when the observations are dropped. The expression for r\ (see (9.2.4)) reduces in the present case to e/, the residuals of the last I observations. If, instead of dropping I observations, we seek to include / explanatory variables (in the form of an n x 1 matrix Z concatenated to the columns of-X"), then the appropriate 'lost' LZF is given by t\ (see (9.4.2). This LZF reduces to Ze in the present case, e being the residual vector in the original model. The key LZFs in the two cases would be identical if Z is chosen to be the last / columns of an n x n identity matrix. Since the LZFs are identical, all the updates would naturally be identical. Thus, the dropping of the observations is equivalent to the inclusion of the explanatory variables. The following is an intuitive explanation of this remarkable connection. When X is replaced by (X : Z), we introduce I additional parameters. The above choice of Z ensures that each additional parameter contributes to the mean of exactly one observation. Thus, each of the last I observations has a chance to be exactly fitted with the help of the parameter which is exclusive to it. Therefore, the last few observations determine the BLUEs of the additional parameters. The remaining parameters are estimated by the remaining observations. An application of this result in the analysis of unbalanced data in designed experiments (using the analysis of covariance model) was given in Section 6.3.4.
424 9.6
Chapter 9 : Updates in the General Linear Model Exercises 9.1 Justify the interpretations given in Remark 9.1.2. 9.2 Show that the condition of case (c) of page 375 implies that the initial and augmented models are transformed versions of one another. 9.3 Given the models M.m and M.n defined in Section 9.1 and the partition of (9.1.3), let p{Xn) - p{Xm) = h = p(Xn : Vn) p(Xn) - p(Xm : Vm). Show that the BLUE of any LPF A0 which is estimable in the model M.n is given by Afin = LmXmpm
+ Liyi ~ LiVlmVm(ym
-
Xmpj,
where Lm and Li are such that A = (Lm : Li)(X'm : X[)'. Derive a formula of the dispersion of this estimator. 9.4 Given the set-up and notations of Proposition 9.1.8, show that wi* = Vi~Vh where y( is the BLUP of j/; from the model M.m subject to the restriction A/3 = £. 9.5 Let rj and t{ be the standardized and studentized residuals, defined by (9.1.5) and (9.1.6), respectively, for the ith observation in the linear model (y,Xfl,a2I). (a) Show that rf < n — p(X). When does this result hold with equality? (b) If hi < 1, show that
Can T{ be written explicitly in terms of t(l What happens when hi = 1? 9.6 Consider the update problem of Section 9.1, and assume that the model errors are uncorrelated. Examine if the updates of a BLUE and its dispersion can be obtained as a special case of the Kalman filter equations (9.1.16)—(9.1.21). If this is possible, compare the equations with those of Section 9.1.3.
9.6 Exercises
425
9.7 Suppose t h a t the sequence of response values the model
follow
yt = a'tp + et, r
£t
r
= Y tfr-i+st + Y 9J6t-j' 3=1
3=1
where (j>\,...,
r—k
Vk,t = Y, fa+kZt-j + Y Qj+kSt-j3=1
3=0
Identify the other components of the state-space model in this case. If the Kalman filter is used for estimating P, what would be the update formulae? [Harvey and Phillips (1979) use the above formulation as a basis for estimation when the parameters
426
9.11
9.12 9.13
9.14
Chapter 9 : Updates in the General Linear Model matrix a2(X'X)~l. If the «th observation is dropped, then the ratio of the modified value of this determinant to the original value is called COVRATIOi, and is used as a measure of influence of the ith observation on the precision of the parameter estimators. Obtain an expression for COVRATIOi in terms of the leverage (hi) and the standardized residual (rj). Consider the set-up of Section 9.1.3, and let Mn correspond to a designed experiment, while the observation y/ is missing. Assume that C(X'm) = C(X'n), and let p(Xn : Vn) - p{Xm : Vm) > 0- Show that the restricted sum of squares R2H for the linear restrictions A/3 = £ is given by R2Hn for the model Mn where yt has been replaced by y{, its BLUP from M.m under the restriction A/3 = £. Prove part (e) of Proposition 9.3.1. Show that the hypothesis (/ — Px )X^j3^ = 0 in the model M.ik) °f Section 9.3.2 is equivalent to the 'testable part' of the hypothesis /3y\ = 0, as per Proposition 5.3.6. Show that the random vectors v and t defined in Propositions 9.3.2 and 9.4.2, respectively, are related to one another as v =
(I-PX{h))X{])[a-2D(t)Tt,
t = xL(i-px
){{I-PX
)v(i-pY
)}~v.
9.15 Prove the statements of Remark 9.4.4. [Hint: The explanatory variables here reside in the column space of the dispersion matrix.] 9.16 Find an expression for the change in the value of the coefficient of determination (see page 170) when a single observation is exand give a statistical cluded from the model (yn,Xnf3n,a2In), interpretation of this quantity. 9.17 Given the two-way classified data model of (6.3.1), obtain a diagnostic to check which block is most influential to the computation of (a) error sum of squares, (b) between treatments sum of squares and (c) the GLRT for equivalence of treatments.
9.6 Exercises
427
9.18 Describe how the various tests of hypotheses would change when a single observation is missing from the model of (6.3.17). 9.19 Consider the added variable plot for the last variable in the model (t/,X ( / l + 1 ) ^ ( f t + 1 ) ,cr 2 /). Let X ( / l + 1 ) = (X(h) : a?(/l+1)), %-H) = (P[h) : &+l)'. Vrts = i1 - PXw^y a n d X{h+l),res = (I — Px )x(/i+i). Assume that fih+i is estimable and 1 G
c(x{h))w (a) Show that the least squares fitted line through the scatter of the added variable plot has intercept 0 and slope equal to 0h+1, the BLUE of ph+l. (b) Show that the least squares residuals for the scatter of the added variable plot are equal to the corresponding residuals of the given model. (c) Are the usual estimators of a2 obtained from the modand (y,X(ft+1)/3(/l+1),
Chapter 10
Multivariate Linear Model
Suppose that we want to determine the efficacy of a new drug by observing n patients over a period of time. Here, the response of the ith patient (i — 1,..., n) is some quantitative measure of relief {yn,..., yiq) observed at time points t\,...,tg after the treatment begins. The response may depend on explanatory variables such as the drug used (new or conventional), the age and gender of the patient. A linear model for this response would be v Vij = Poj + 5 Z PiJxu
+ e »i>
« = 1,
,
j = 1,
,
i=i
Note that the coefficients of the explanatory variables in the above model depend on j . This accounts for the possibility that the pattern of influence of the explanatory variables may depend on the time of measurement. The above model is of the form (1.3.1). However, the observations of a given patient at different time points may be correlated. The error variance for the various time points may also be different. These variances and covariances may not be known at all. Therefore, the error model (1.3.3) with known V is not adequate. In order to handle problems of this kind, we consider in this chapter a linear model with several response variables. It is commonly referred to as the multivariate linear model. After describing the general features of the model in Section 10.1, we discuss best linear unbiased estimation of the model coefficients and unbiased estimation of the error 429
430
Chapter 10 : Multivariate Linear Model
variances and covariances in Sections 10.2 and 10.3, respectively. We deal with maximum likelihood estimation of these parameters in the case of normal distribution of the errors, in Section 10.4. The effect of linear restrictions is discussed in Section 10.5. Sections 10.6 dwells on tests of linear hypotheses. Prediction and confidence regions are briefly discussed in Section 10.7. Section 10.8 provides some applications.
10.1
Description of the multivariate linear model
The general form of the multivariate linear model is as follows. Y = XB + £.
(10.1.1)
In the above equation Ynxq is the matrix of observations of the response variables, with each row representing a case (or observation set) and each column representing a characteristic of the response. The matrix XnXk is the observed matrix of the corresponding explanatory variables (as in the univariate response linear model considered before). The matrix BkXq contains the unspecified parameters, while the matrix £nxq contains the model errors. It is assumed that
E{S) = 0,
D{vec(£)) =
-Zqxg®Vnxn.
The matrices £ and V involved in the Kronecker product can be singular. Typically the matrix V is assumed to be known and the matrix E, unknown with specified rank. When q = 1, (10.1.1) reduces to the univariate linear model (1.3.2). In the case of the opening example, note that every row of Y and X correspond to a patient, every column of Y and B correspond to a particular time point and every column of X (and the corresponding row of B) correspond to an explanatory variable (or the constant term). Assuming that the responses for the various patients are uncorrelated, we can define V = I. The variance-covariance matrix of the response for a particular patient (given the explanatory variables) is S. Using the result of Exercise 2.27, we can rewrite the model (10.1.1) as vec(Y) = (I®X)vec(tf)+vec(5), E(vec(£)) = 0, D(vec(S)) = E
10.2 Best linear unbiased estimation
431
When S is unknown, the linear model (vec(l^), (I
and C(Y'(I - Px)) C C(E),
with probability 1. These two conditions must be satisfied by the data. The first condition is equivalent to C((I - PV)Y) C C((I - PV)X) with probability 1. Even though £ is unknown, the second condition is verifiable as long as C(£) is known (for example, from some restrictions giving rise to the singularity of S). 10.2
Best linear unbiased estimation
It was mentioned in Section 10.1 that the model (10.1.1) can also be written as (10.1.2). Even though the dispersion matrix of (10.1.2) involves several unspecified parameters, much of the theory of Chapter 7 is directly applicable to this model. On the other hand, the inference problems for the multivariate linear model are posed with reference to (10.1.1). Therefore, we shall have to move back and forth between the representations (10.1.1) and (10.1.2). Note that a vector-valued linear function of vec(B) has the general representation Y1J=I-A-JBUJ, where («i : : uq) = IqXq, and Ai,..., Aq are arbitrary matrices of appropriate dimension. Likewise, a vector-valued linear function of vec(V) is of the form X^=i KJYUJ, where K\,..., Kq are arbitrary matrices of appropriate dimension.
432
Chapter 10 : Multivariate Linear Model
Consider the problem of estimating the linear parametric function Z)f=i AjBuj by a linear function of the response, J29j=i KjYuj. The characterizations of linear unbiased estimator, linear zero function and estimable linear parametric function are given in the next proposition. Proposition 10.2.1 may be singular.
Consider the linear model (10.1.1) where V
is a LUE of ]C?=i AjBiij if (a) The linear function X)jLi KJYUJ and only if there are matrices L\,..., Lq such that LjX = Aj for j = 1 , . . . , q and J29j=i LJYUJ = Zw=i KjYuj with probability 1. is a LZF if and only if there (b) The linear function J29j=i KJYUJ are matrices L\,...,Lq such that LjX = 0 for j — 1 , . . . , q and J2q=i LJYUJ = Yl9j=i KJYUJ with probability 1. (c) The LPF J29j=1 AJBUJ is estimable if and only ifC{A'j) C C(X') for j = l,...,q. (d) The LPF J29j~i AJBUJ is identifiable if and only if it is estimable. Proof. Let k' be the first row of (K\ : : Kq). Using part (a) of Proposition 7.2.3 for the model (vec(Y),(J
: Aq)vec{B) = {Lx :
: Lq)(I ® X) = {LXX :
: LqX). (10.2.1) Part (a) follows immediately. Part (b) is obtained by setting Aj = 0 for i = 1 , . . . , q in part (a). In order to prove part (c), we use Proposition 7.2.4. The condition obtained from this proposition is C{{A\ : : Aq)') C C{(I
10.2 Best linear unbiased estimation
433
metric function. Proposition 10.2.2 // £ j = 1 AJBUJ is an estimable vector LPF of the model (10.1.1), then its BLUE is Y^^IAJX'XBUJ, where XB =
[I-V(I-PX){(I-PX)V(I-PX)}-(I-PX)]Y.
Further, the BLUE is unique. Proof. According to Proposition 7.3.1, which holds even when the dispersion matrix is unknown, the BLUE of vec(XB) or (I <S> X)vec(B) in the model (10.1.1) is given by the right hand side of (7.3.1), with X, V and y replaced by I
[ / - ( E S - ) ® M ] [ ( S ® V ) o + (I®X)b] [ S ® y - S ® ( M y ) ] a + (/®X)ft [J® J - / ® M ] ( S ® V ) a + (J® JC)6 [ I ® ( J - M ) ] [ ( S ® V ) a + (J®A:)6] [/ ® (/ - Af)]vec(lr) = [I ® (J - M)]vec(Y) vec((/-M)r),
as stated in the proposition. It has mean XB and is uncorrelated with every LZF. Therefore, X)?=i AJX~XBUJ has mean X)1=i AJBUJ and is uncorrelated with every LZF. This proves the first part of the proposition. Since C{XB) C C(X) with probability 1 (see Remark 10.2.6]_and CiA'j) C C(X') (see Proposition 10.2.1), the BLUE £ ? = 1 AJX-XBUJ does not depend on the choice of the g-inverse of X. Non-existence of a different BLUE follows from Proposition 7.3.1.
434
Chapter 10 : Multivariate Linear Model We denote AjX'XB
by AjB, so that the BLUE of £ j = 1 AjBuj is
EUAJBur Remark 10.2.3 When V = I, the BLUE XB simplifies to PXY. When V is nonsingular, it simplifies to X{X'V~lX)-X'V~lY, according to Proposition 7.3.11. Remark 10.2.4 When A' C C(X'), all the elements of the matrix AB are estimable. It can be observed that vec(AB) is a special case of Yfj=x AJBUJ where Aj = Uj
(10.2.2)
D(vec(E)) = -Z®(V -T),
(10.2.3)
where T = V- V(I-PX){(I-PX)V(I-PX)}-{I-PX)V.
(10.2.4)
Note that AJBUJ = (u^-®(i4j-X~))vec(XB). It follows from (10.2.2) that the dispersion of the BLUE S?=i AJBUJ is
D [j^AjBuA = t t ^ ^ V=l
/
(10.2.5) ( 10 - 2 - 5 )
i=l j=l
where ((<Xij)) = S, Z?^- = AiX~T{X~)'A'j for i,j = l,...,q and T is as in (10.2.4). Consider the special case Aj — Uj <S> A, j — I,... ,q. If AB is a matrix of estimable LPFs, then vec(AB) = (Iqxq ® (AX~))vec(XS). It follows directly from (10.2.2) that D{vec(AB)) = E ® [ A X " T ( X - ) ' A ' ] .
(10.2.6)
10.3 Unbiased estimation of error dispersion
435
When V is nonsingular, (10.2.2) simplifies to D(vec(Y)) = S® [X(X'V-lX)-X\. When V = J, D(vec(Y)) further simplifies to S ® Px. The expressions for the dispersions of vec(JS) and J2Qj=i AJBUJ also simplify accordingly. Remark 10.2.5 Suppose that we are only interested in the estimable linear parametric function AJBUJ for a fixed j . We may ignore all but the jth character and use the univariate response linear model formulation of Chapter 7. The resulting BLUE and its dispersion would be same as those obtained from Proposition 10.2.2 and (10.2.5). Therefore, there is no need to invoke the multivariate linear model for this purpose. However, analysis from the univariate linear model will not be adequate if we are interested in inference involving more than one character, as the cross-terms of (10.2.5) corresponding toCTJJfor i ^ j are in general D not equal to zero. Remark 10.2.6 Comparison of the BLUE of XB given in Proposition 10.2.2 and (7.3.4) reveals that every column of Y belongs to C(X) with probability 1 (see Remark 7.3.10). Therefore, C{Y) C C(X) with probability 1. Likewise, vec(.E) almost surely belongs to C(£
D
Unbiased estimation of error dispersion
In order to estimate S, we have to rely on the LZFs which are functions of the model errors only. We would like to consider an adequate number of vector LZFs having dispersion matrix equal to S. Definition 10.3.1 A vector of LZFs of the multivariate linear model (10.1.1) is called a normalized linear zero function (NLZF) if its dispersion matrix is proportional to S.
436
Chapter 10 : Multivariate Linear Model
Definition 10.3.2 A set of NLZFs is called a generating set of NLZFs if every NLZF is almost surely a linear combination of the NLZFs contained in it. As an example, the rows of the residual matrix E constitute a generating set of NLZFs. Another example is the set of rows of (I — PX)Y (see Exercise 10.4). Definition 10.3.3 A generating set of NLZFs is called a standardized basis set of NLZFs if every pair of NLZFs contained in it is uncorrelated, and each NLZF has dispersion equal to E. The NLZFs are akin to the rows of S. A standardized basis set contains the maximum possible number of uncorrelated sets of NLZFs, which can be utilized to estimate S. The next proposition proves an invariance which is crucial to the derivation of a meaningful estimator. Proposition 10.3.4 If Z is any matrix whose rows constitute a standardized basis set of NLZFs, then (a) Z has p(V : X)-p(X) rows; (b) the value of Z'Z does not depend on the choice of the standardized basis set. Proof. Let Z be a matrix whose rows constitute a standardized basis set of NLZFs, and let p\ be the number of its rows. Since the set of rows of ZPlXq as well as that of Enxq are generating sets of NLZFs, there are matrices BPlXn and Cnxpi such that E — CZ and Z = BE. It follows from Exercise 2.25 that p(0(vec(Z))) = p(S ® J P l X p i ) = pipCS). Also, p(£>(vec(Z))) = = = =
p(D[(I ® B)vec(E)]) < p(D(wec(E))) p(I>[(I®C)vec(Z)]) p([(I®C)(S®/)(J®C")]) p{V®(CC')) = p{T,)p{CC) < Pl p(E).
10.3 Unbiased estimation of error dispersion
437
The last the step is a consequence of the fact that p(C) < p\, since p\ is the number of columns of C. In summary, we have the following chain of equalities: PlP(2)
= p(D(vec(Z))) < p(D(vec(E))) < p l P (S).
All the terms should therefore be the same. Hence, from (10.2.3) and Proposition 7.3.9, PlP&)
= p(D(vec(E))) = P(V)P(V(I
- Px)).
Assuming that p(S) > 0, we have p\ = P(V(I - Px)) = p(V : X) p(X) (see Proposition 2.4.4(b)). This proves part (a). In order to prove part (b), note that £®/,ixPl
= D(vec(Z)) = D[(I ® B)vec(E)] = D[{I
Therefore, BCC'B' — IPlxPl. Hence, the p\ x p\ matrix BC is orthogonal, and C'B'BC = I. Consequently, B'B is a g-inverse of CC', and Z'Z = E'B'BE = E'{CC')-E. Since D{vec{E)) = S
438
Chapter 10 : Multivariate Linear Model
Part (b) of Proposition 10.3.4 ensures that Ro is well-defined. Part (a) of this proposition implies that E(Ro) = E(Z'Z) = (p(V : X) - p{X)) E. Therefore, an unbiased estimator of S is £ = (p(V : X) - p(X))-1 Ro.
(10.3.1)
We conclude this section by presenting a few expressions for Ro, the error sum of squares and products matrix. It follows from Exercises 10.4 and 10.5 that R0 = E'[V(I-PX){(I-PX)V(I-PX)}-(I-PX)V]-E = Y'(I - PX){(I - PX)V(I - PX)}-(I - PX)Y.
(10.3.2) (10.3.3)
These alternative forms will be useful later in this chapter. Let us write the expression of (10.3.2) as E'K~~E, where K — VM and M = {I - PX){{I - PX)V(I - PX)}~(I - PX)V. Note that if U is any symmetric matrix, then K = (V + XUX')M — M'{V + XUX')M. Since S
= Y,L'K{V+ XUX')-KLV = ZL'M'(V + XUX')ML?: = HL'KK-KLY,
= E'K'E
= VL'KLV = RQ.
Consequently, Ro can be written as E'(V + XUX')~E where U is any symmetric matrix, and any g-inverse can be used in this expression. In particular, Ro = E'V'E and E = {p{V : X) - p(X))"1 E'V'E.
(10.3.4) (10.3.5)
Remark 10.3.6 Let pi = p(V : X) — p(X) and ZpiXq be as in Proposition 10.3.4. According to Proposition 3.1.1, every column of Z'
10.4 Maximum likelihood estimation
439
is in C(£) with probability 1. Therefore, C(Z') C C(£) almost surely. Let BB' be a, rank-factorization of £ . Then it can be verified that the p(£) x pi matrix B~LZ' has uncorrelated elements, each having mean zero and variance 1. If the joint distribution of these elements have a density over the entire pi/?(]S)-dimensional Euclidean space, then it can be shown that the rank of B~LZ' is almost surely equal to the minimum of pi and p(£) (see Okamoto, 1973). Therefore, whenever p\ > p(£), we have p{Z'Z) > p{B-LZ'ZB-L')
= p(£) almost surely.
Comparing this with the earlier observation on column spaces, we have C(RQ) = C(Z'Z) = C(£) and p(R0) = p(£) with probability 1, whenever px > p(£). 10.4 10.4.1
Maximum likelihood estimation Estimator
of mean
Let vec{Ynxq) ~ N(vec{XnxkBkxq), qxq ® VnXn). The task of finding the MLE of XB and S is similar to that considered in Section 7.5, in the case q = 1. It follows from Proposition 7.5.1 that the MLE of XB coincides with its BLUE, for fixed £
(10.4.1)
The dispersion matrix of the MLE is given by (10.2.2). The above value of XB maximizes the likelihood, and in the process minimizes the exponent, vec(r - XB)'(E ® V)-vec{Y - XB), which can also be written as tr[(Y - XB)'V~{Y - X S ) S " ) . The smallest or residual value of above the quadratic function of B happens to be tr(E'V~E'£~). The corresponding value of the sum of squares
440
Chapter 10 : Multivariate Linear Model
is E'V~E. This is why and products matrix, (Y-XB)'V-(Y-XB), the error sum of squares and products matrix is also called the 'residual sum of squares and products matrix'. 10.4.2
Estimator of error dispersion
We derive the MLE of S without assuming that it is positive definite, and without invoking the messy calculus that is used in some other books. The MLE is obtained by maximizing the likelihood function with respect to S, after substituting XB with its MLE, XB. Simplifying the expression of the density given in Section 7.5, we have the partially maximized log-likelihood (also called log-likelihood profile) -p{V where BB' and CC' are rank-factorizations of S and V, respectively (see Exercise 2.25). The log-likelihood profile further simplifies to _p(V)p(E) i Q g ( 2 7 r ) _ 1 l o g | ( B , B ) 3 ( c , c ) 1 _
^E'y-E^-y
According to the result of Exercise 2.26, \{B'B) ® (C'C)| = \B'B\PW
. \C'C\pW.
Consequently, the log-likelihood profile is maximized when we minimize p{V)\og\B'B\ + ti(RoI}-) with respect to B. Let B*B't be a rank-factorization of RQ/p(V). It follows from Remark 10.3.6 that B* and B have the same column space, with probability 1. Let Q be an invertible matrix such that JB* = BQ. In order to maximize the log-likelihood profile, we have to minimize the following with respect to Q: ]og\B'B\ + tT(B*B't{BB')-) = log \Q-vB'tB*Q-l\ + tT(B',{BB')-BJ = \og\B[B*Q-lQ-l'\ + tv{Q'PB,Q) = log|B'J3*|-log|Q'Q|+tr(Q'Q).
10.4 Maximum likelihood estimation
441
The second equality follows from the fact that P& = I whenever B has full column rank. If Ai,..., Ap(£) are the eigenvalues of Q'Q, then tr(g'Q)-log|Q'Q| = X ; ^ - l o g A i . J=I
The quantity x — log x is always greater than or equal to 1 for all positive x, and is equal to 1 only when x = 1. Therefore, the log-likelihood profile is minimized when all the eigenvalues of Q'Q are 1. This happens if and only if Q'Q = / = QQ', that is, when S - B B ' = B*Q-l{Q-l)'B', = B*B', = R0/p(V). Therefore, (10.4.2) Comparing with the unbiased estimator of (10.3.5) we find that the MLE of S is negatively biased. 10.4.3
REML estimator of error dispersion
By taking a cue from Section 8.2.3, let us now derive the residual maximum likelihood estimator of S, assuming again that vec(YnXq) ~ N(XnXkBkXq, S gX(7 ®V nxn )- The REML estimator maximizes the likelihood function computed from the joint density of (/ — I®Px)vec{Y). It is easy to see that (/ - I® Px)vec{Y) ~ N(0, E ® [(/ - PX)V{I - Px)]). The task is similar to maximizing the likelihood profile in the case of the and V replaced by (I-PX)V(IMLE, with E replaced by (I-PX)Y Px). Therefore, the REML estimator can be obtained by making these substitutions in (10.4.2). Therefore, ^
^REML
1 =
p((I-Px)V(I-Px)) - PX){(I - PX)V(I - PX)}-(I - PX)Y
= pmi-Px))**
=
p(V : X) -
P(X)R°-{WA-3)
442
Chapter 10 : Multivariate Linear Model
The simplification occurs due to (10.3.3). Thus, the REML estimator of £ coincides with the unbiased estimator (10.3.5) derived earlier. 10.5
Effect of linear restrictions
10.5.1
Effect on estimable LPFs, LZFs and BLUEs
Consider the linear restriction AB = \& on the parameters of the linear model (10.1.1). The restriction can be interpreted as a set of observations with zero error, and appended to the other observations. Thus, we have the model equation
(;)-C). + ($)
w*
We shall refer to the above model as A4R, and use the notation M. for the linear model of (10.1.1). Thus, M
= (vec(Y),(J®X)vec(£),S®F),
«* - H£).M*)H8>^(o l)\ The restriction must satisfy the condition C(^) C C(A) for algebraic consistency. Also, in order that the restrictions are consistent with the observed data, the response matrix Y must satisfy the condition
cQcc(s®(g
°) : J ® ( ^ ) )
almost surely. (10.5.2)
We adapt Proposition 7.9.1 to the present situation and state it below. Proposition 10.5.1 model M..
Let AB = *S? be a consistent restriction on the
(a) All estimable LPFs of the unrestricted model M. are estimable under the restricted model MR. (b) All LZFs of the unrestricted model are LZFs under the restricted model.
10.5 Effect of linear restrictions
443
(c) The restriction can only reduce the dispersion of the BLUE of vec(XB). (d) The restriction can only increase the error sum of squares and products matrix, in the sense of the Lb'wner order. The proof proceeds along the lines of that of Proposition 7.9.1, with no major change in the multivariate case. D 10.5.2
Change in error sum of squares and products
Let us now assume that AB is estimable, that is, C(A') C C(X'). The BLUEs of the unrestricted model which turn into LZFs under restrictions are identified in Proposition 7.9.2. However, we are not interested in all BLUEs that turn into LZFs, but rather in those groups of BLUEs in M which may turn into NLZFs because of the restriction — thus affecting the error sum of squares and products matrix. The following result is a extension of Proposition 7.9.2 in this direction. Proposition 10.5.2 Let C(*) C C{A) and C(A') C C{X'). Let AB be the BLUE of AB under the model M. (a) AB — \& is a matrix whose rows are NLZFs under the model MR.
(b) The elements of AB — if? are uncorrelated with all NLZFs of M. (c) There is no nontrivial NLZF of MR which is uncorrelated with AB-* and the NLZFs of M. Proof. According to the model MR, the expected value of AB is 0. The dispersion of vec(AB) is given by (10.2.6). Therefore, the dispersion of the any row of AB is proportional to S. This proves part (a). Part (b) follows from the fact that AB is a BLUE of M, which must be uncorrelated with all LZFs in M, and in particular with the NLZFs. Part (c) is a consequence of part (c) of Proposition 7.9.2. Proposition 10.5.2 shows that a generating set of NLZFs of MR can be obtained by augmenting a generating set of NLZFs of M (say, the rows of E) with the rows of AB — \I>, and that the two sets of NLZFs are uncorrelated. If we denote the error sum of squares and products
444
Chapter 10 : Multivariate Linear Model
matrix of MR by RH, then we have from (10.2.3), (10.2.6) and Exercise 10.5
R»
- ( * ' ^ - * ) ' ) ( V Ax-nx-Y*)'(AS*-*) = E'(V - T)~E + {AB-¥)'[AX-T{X-)'A']-(AB - *) = Ro + iAB-VyiAX-TiX-yA^-iAB-V), (10.5.3)
where T is as in (10.2.4). This decomposition of the error sum of squares and products matrix under the restriction AB = 1J> is similar to (7.9.4) obtained in the case of univariate response. 10.5.3
Change in 'BLUE' and mean squared error matrix
Suppose that the restriction AB = \& is not necessarily true, but it is imposed anyway to reduce dispersion of the estimators, possibly at the expense of bias. Let the matrices £ and [AX~T(X~)'A'] used in (10.2.6) be positive definite. It follows from Proposition 10.5.2 and the covariance adjustment principle of Proposition 3.1.1 that the BLUE of XB under the restriction is obtained from the unrestricted BLUE by removing the latter's correlation with AB—St. Thus, vec(XBR)
= vec(XB) Cov(vec(XB),vec(AB))[D(vec(AB))}-vec(AB-&).
In view of (10.2.2) and (10.2.6), this expression simplifies to vec(XBR) = vec(X/?)-vec (T(X")''A'[AX'T^-)''A']- l {AB-*)) . This is similar to the expression given in Section 7.9.2 in the case of univariate response. Following that derivation, we obtain that MSE(vec(XBR))
< MSE(vec(XB))
in the sense of Lowner order if [vec(AB - ¥)]'[£ ® (AX-T(X-)'i')]" 1 [vec(AB - *)] < 1.
10.6 Tests of linear hypotheses
445
This condition can also be written as
tr[{AB - *)'(AX-T{X-)'A')-l{AB
- ^ S " 1 ] < 1.
which is satisfied if * is close to the true value of AB or £ is large. 10.6
Tests of linear hypotheses
We now assume that vec{Ynxq)
~ N{vec{XnxkBkxq),
S g x g
Suppose that we wish to test the null hypothesis Mo : AB = V
(10.6.1)
against the alternative hypothesis Ui : AB^^l.
(10.6.2)
We assume that the data conforms to the consistency condition (10.5.2). The issue of testability of this hypothesis is similar to that in the univariate response case (see Section 5.3.1). We assume that the hypothesis %o is completely testable, that is, C(A') C C(X'). We also assume that there is no algebraic inconsistency in the hypothesis, that is, C(*) CC(A). The decomposition (10.5.3) suggests that the data supports the null hypothesis if the matrix RQ is not too small compared to RJJ. There is no unique way to measure the 'smallness' of one matrix compared to another. We shall consider a few tests which essentially compare these matrices in different ways.
10.6.1
Generalized likelihood ratio test
Recall from Section 10.4.2 that the log-likelihood function, after maximization with respect to XB and S, depends linearly on P(v)iog|i?;B,|,
446
Chapter 10 : Multivariate Linear Model
where B^B^ is a rank-factorization of Ro- This result was derived without assuming that £ or V is positive definite. The linear model (10.1.1) under the hypothesis WQ is equivalent to the unrestricted model (10.5.1). The error sum of squares and products matrix for this model is RHTherefore, the log-likelihood function, after maximization with respect to XB and £ subject to the restriction AB = \&, is a linear function of
/»(F)iog|s;sti, where B^B'^ is a rank-factorization of RJJ- (Note that the rank of the error dispersion matrix of (10.5.1) is p(V).) It follows from (3.8.2) that the GLRT is a monotone increasing function of
A = !!Hr!'
(10-6-3)
small values of which should lead to the rejection of %Q. This statistic, proposed by Wilks (1932), is known as Wilks' A statistic. It follows from Remark 10.3.6 that C(B*)=C(Bi)=C(B)=C(E). As a result, the matrices J3*, B^ and B have the same rank and order q x po- In the special case where £ is positive definite, the Wilks' A statistic simplifies to
A = iM.
ao,,,
Returning to the general case, we now show that the distribution of A under WQ depends only on thefollowingnumbers: Po = p(S), Pl = p(V:X)-p(X), p2 = p(AX-T(X-)'A'), where T is as in (10.2.4). We assume that px > p0. Since C(B*) = C(B), B* can be written as BB~LB*. Hence, B'*B*\ = \B'*{B-L)'B'BB-LB*\ = \B-LB*B',(B-L)'\ \B'B\ = \B-LRo(B-L)'\
\B'B\.
10.6 Tests of linear hypotheses
447
If Z\ is a matrix whose rows constitute a standardized basis set of NLZFs of (10.1.1), then Ro = Z[Zi, the rows of the p 0 x q matrix Z\ being independent and distributed as N(0, S). Therefore, \B',B,\ = \B-LZ[Zi(B-L)'\
\B'B\.
Note that the elements of the p\ x po matrix Z\(B~L)' has independent elements with the standard normal distribution. Likewise, if Z
\B'B\.
In the above, the p2 x Po matrix Z2{B~~L)' is independent of Zi(B~L)', and has independent elements having the standard normal distribution. It follows that
A
\z'ozo\
\Z'0ZQ + Z'hZh\' where the elements of the pi x po matrix ZQ — Zi(B~L)' and the P2 x Po rnatrix Z^ — Z2{B~L)' are independent and have the standard normal distribution. Therefore, the distribution of A depends only on Pi, i = 0,l,2. The joint distribution of the elements of Z'0ZQ is called the standard Wishart distribution with parameters (po,Pi). We shall denote it by WPOjPl. The density of this distribution is proportional to |^Zo|^-"°- 1 )/ 2 exp[-tr(Z( ) Zo)/2]. Obviously the joint density of the elements of Z\Z\ is WP0)P2. See Bhimasankaram and Sengupta (1991) for a more general form of this distribution resulting from singular normal distributions. The distribution of A is called the Wilks' A distribution with parameters (po> Pi iP2)- The distribution is the same as that of the product of po independent beta-distributed random variables with parameters (P\ - P o + 1 P2\ V 2 ' 2j''"'\2
(Pj_ P 2 \ '2 ) '
See Rao (1973) for some approximations of this distribution. The exact distribution in an important special case is derived in Section 10.6.3.
448 10.6.2
Chapter 10 : Multivariate Linear Model Roy's union-intersection
test
Roy (1953) introduces an interesting approach to the problem of testing multiple hypotheses. The use of this approach in the present context leads to a useful test. The null hypothesis (10.6.1) consists of a matrix equality involving B. It holds if and only if the vector equations ABl = * / hold for all I. Thus, we can think of Ho as the intersection of all the hypotheses of the form HOi : ABl = VI for all I, that is, Ho = PHHOL.
The alternative hypothesis corresponding to HQI for a fixed I is
Uu : ABl^m. The overall alternative hypothesis can be written as the union Hx = UtHu. If we fix I for the time being, then we can frame the problem of testing T-Loi against H\i in the context of the univariate response linear model (Yl,XBl,(l'Sl)V). The GLRT for this problem was derived in Section 7.11. The test rejects Hoi if the ratio {R2m —RQ^/RQI is too large, where R^ is the error sum of squares for the above model and R2Hl is the error sum of squares under the hypothesis T-LQI. It is easy to see that R^ — I'RQI and R2Hl — I'RjjlTherefore, Hoi should be rejected if the ratio l'(RH - Ro)l/(lfRol) is sufficiently large. Let us now return to the original hypothesis, %Q. Since it is the intersection of all the hypotheses of the form Hoi, it should be rejected if the data carries evidence against any one of the sub-hypotheses. In other words, HQ should be rejected if TR
= sup
V(RH-RQ)1 TTB-
10.6 Tests of linear hypotheses
449
is too large. The statistic TR is called the union-intersection statistic for n0. Suppose that B*B[ is a rank factorization of Ro- Since C(S»), C(RH) and C(S) are identical with probability 1 (see Remark 10.3.6), we can write RH as B*B~LRHTherefore, l'BtB-LRH(B;L)'B'J TR
=
T
VKBJ
l
k'B-LRH(BZL)'k =
Wk
T
L
It follows from Proposition 2.8.2 that TR + 1 is the largest eigenvalue of the matrix B~LRH(B~L)''. If -Bf-Bf is a rank-factorization of H#, then TR + 1 is also the largest eigenvalue of B'^(B~L)'B~LBj, that is, the largest eigenvalue of the matrix B[RQ B^. Proposition 10.6.1 of the next section guarantees that when fio holds, the distribution of TR does not depend on S. Like Wilks' A statistic, the union-intersection statistic is also a function of the parameters po, p\ and P2, described in the previous section. When S is positive definite, Ro is also positive definite with probability 1. In this special case, the union-intersection statistic reduces to the largest eigenvalue of (RH — Ro)Ro1- The distribution of TR in another special case is derived in Section 10.6.3. 10.6.3
Other tests
Let BifB'x and B^B'i be rank factorizations of RQ and RH, respectively. Making use of the identity of the column spaces of RQ and RH, the Wilks' A test can alternatively be expressed as =
\B',B.\
\B'mB>\
=
|B'tBti
\B\(B:L)'B:B,B:LB^\
IB1J3.I \B:B,B\(B:LYB:LB^\
__
1 \B\Ro~Bi\
|£'*B*I
=
\B',B*\
\B\(B-L)'B:LB^\
450
Chapter 10 : Multivariate Linear Model
Thus, A and TR are both functions of the matrix B'^R^Bj. now show that under T-LQ, this matrix is invariant of S.
We shall
Proposition 10.6.1 Under the above set-up, the matrix B'^RQB^ does not depend on S. Proof. Let BB' be a rank-factorization of S. Since the column spaces of the matrices B, B* and JBf are identical with probability 1 (see remark 10.3.6), we can write B\RQB^
= =
[B\(B-LyB'}R^{BB-LBf] [£' t (B- L )'][£r LJ R 0 (B- L )']- 1 [ J B- LJ B t ]
does not depend on It was shown in Section 10.6.1 that B~LRQ(B~L)' is free of S, and [B-LB$B-LBfl can S. Likewise, B-LRH{B-L)' be any rank-factorization of B~LRH{B~L)'. The result follows. The (almost surely) nonsingular matrix BlR^B^ holds the key to the testing problem. The distribution of this matrix under %Q depends only on the parameters poi P\ a n d P2, defined in Section 10.6.1. We have already seen that the Wilks' A and union-intersection statistics are scalar functions of this matrix. Other scalar functions can also be used. Lawley (1938) and Hotelling (1951) suggest a test which, in the present context, amounts to rejecting Ho when the trace (T2) of the matrix (p(V : X) — P(X))[B'^RQB} — I] is too large. Generalization of another test suggested by Pillai (1955) leads to the rejection of Ho is too large. All when the trace (Tp) of the matrix / — (B'^RQB^)'1 the four statistics can be expressed in terms of simple functions of the eigenvalues of B'^RQB^. When S is positive definite, these eigenvalues coincide with those of RHRQ1See Arnold (1981, Chapter 19) for asymptotic distributions of the test statistics. Let us now consider the special case where pi = 1, that is, RH—RQ is a matrix of rank 1. Some examples of this case are given in Section 10.8. We now show that the four statistics described so far are equivalent in this case. Proposition 10.6.2 In the above set-up, let RJJ — Ro be a matrix of rank 1. Then the GLRT, Roy's union-intersection test, the Lawley-
10.6 Tests of linear hypotheses
451
Hotelling test and Pillai's test are equivalent to rejecting the null hypothesis for large values of r'^R^rn, where RJJ — RQ = rjjr^. Proof. Let £?*£?'* be a rank factorization of RQ. Since C(RH) = C(Ro) with probability 1 (see Remark 10.3.6), we have r # € C(Ro) almost surely. Thus, r'^jR^rn is well-defined for any choice of the g-inverse. It has already been observed that the four test statistics are functions of the eigenvalues of B'^RQB^, where B^B\ is a rank-factorization of RH- The set of eigenvalues of B'IRQB^ are the same as the set of nonzero eigenvalues of RQ B^Bi, that is, the set of non-zero eigenvalues of RQ(RO + rHr'H). By expressing r # as B*r for some vector r, we have Ro(Ro + rHr'H) = (B*B'J-S^s^E)
+rr']B',.
The non-zero eigenvalues of the latter matrix are the same as the eigenvalues of the nonsingular matrix P' (I+rr1), which simplifies to I+rr'. This matrix has p(S) eigenvalues, one of which is 1+r'r while the others are equal to 1. Since r'r = r'H(B~)'B~rH = r^R^rn, we conclude that /?(£) — 1 eigenvalues of the matrix B'^R^B\ are equal to 1 and the remaining one is equal to 1 + r^R^rij. It follows that
TR T2
l + r'HRZrH' = r'HRQrH, = (p(V:X)-p(X))r'HR^rH, = r'HR^rH l + r'HR^rH'
Therefore, all the tests are equivalent to rejecting the null hypothesis when r'HR.QrH is too large. The statistic T 2 described in the context of Proposition 10.6.2 is known as Hotelling's T2 statistic. We now prove a result which leads to the null distribution of the statistic r'HR^rHProposition 10.6.3 Let UQ,UI, ... ,un be independent random vectors of order q x 1, having the distribution N(0, S). Let Z — (ui : :
452
Chapter 10 : Multivariate Linear Model
un)' and q* = p(S) < n. Then u'0(Z'Z)~u0(n - q* + l ) / g , has the Fq», n-q,+\ distribution. Proof. Let FqXQt be such that FF' is a rank-factorization of S and Vj = F~Lu-j, j — 0,1,... , n, for a fixed left-inverse of F. Note that the independent random vectors VQ,VI,. .. ,vn, have the distribution N(0,IqtXq,)- It follows from Proposition 3.1.1 and Remark 10.3.6 that u0 G C(S) = C(F) = C(Z') with probability 1. Therefore,
u'0(Z'Z)-u0 = u'0(F-LyF'(Z'zrFF-Lu0
= v'Q (j2vJv'j)
vo-
This reduction shows that it is enough to prove the statement of the proposition for £ = I. We now assume that S = / , replace (Z'Z)~ by (Z'Z)~l, and do not make a distinction between q and q*. Let d = ||«o||~1wo and C be a g , x (g* — 1) matrix such that B' = (C : d) is an orthogonal matrix. Given uo, The vectors Wj = BUJ, j = l , . . . , n are independent and have the distribution N(0,Ig,Xq,)- Let Z* = (i«i : : wn)' — ZB'. Then the ratio
that is, the is seen to be the last diagonal element of B(Z'Z)~lB', last diagonal element of (Z'^Z*)"1. Using the block matrix inversion formula (see page 31) for the matrix ,
({ZC)'ZC
Z*Z*-{
(Zd)'ZC
{ZC)'Zd\_(KX*
X'.yA
(Zd)'Zd J - \ y'tX.
y'*y* J '
where X* = ZC and y* = Zd, we have the following expression for the last diagonal element of its inverse. d'(Z'Z)-ld = [y^y, - y'.X.iX'.XJ^XM-1
=
Consequently u'0(Z'Z)-u0
u{,(Z'Z)-iti 0
VA
x*>y*~
*
10.6 Tests of linear hypotheses
453
For given UQ, the above quantity is the error sum of squares for the linear model (y^,X*/3,a2I). Since the elements of y* and X* are independent and have the standard normal distribution (given Uo), it follows that the error sum of squares has the X^_pix,) distribution. Note that p(X*) = p(C) = q* — 1, and that the above conditional distribution does not depend on the UQ. Therefore, the conditional distribution is the same as the unconditional distribution and the chi-square statistic is independent of u'ouo. Therefore, the ratio
has the stated F distribution. It follows from Propositions 10.6.2 and 10.6.3 that under the null hypothesis, the statistic r'HR^rij(pi — p0 + l)/po has the FPo> Pl_po+i distribution, where po and p\ are as defined in Section 10.6.1. 10.6.4
A more general
hypothesis
The tests of Sections 10.6.1-10.6.3 can be used for any hypothesis of the form ABC = * , against the alternative hypothesis ABC ^ \&, where C is a known matrix having full column rank. To see this, transform the response matrix Y linearly into YC, and consider the model (vec(FC), (/ ® X)vec(B*), S*
454
Chapter 10 : Multivariate Linear Model
of the jth group at age ko + k years follows the linear model Vijk
= fJ-jk + t i j k ,
i = l , . . . , r i j , j = 1 , 2 , k = 1 , . . . , fci.
The model can be written as Y = X # + E, where / 2/m
y
2/raill 2/121
=
\yn22i
yiufci \
ynilfci 2/12*1
/I
x
=
'
yn22k1/
0\
1 0 0 1 '
^°
1
Q=(HU VM21
MUiA ^ f c j /
/
Consider the following three hypothesis: (a) Hp, that the two plots are parallel (same 'velocity' of mean height for boys and girls), (b) 7ie, that the two plots are identical (same mean height for both the groups at any given age) and (c) %c, that the lines are horizontal (no gain in mean height with age). Let /I
1
1 \
0
0
0 -1
0
- 1
A = (l:-1),
C* 1X( * 1 _ 1) =
V 0
0
.
-1 )
Then the three null hypotheses can be written as ABC = 0, AB = 0 and BC = 0, respectively. Assuming that vec(l^) ~ A^((/
Multiple
comparisons
The idea of using the Bonferroni inequality to conduct simultaneous tests of hypotheses (see Section 5.3.7) can be used in the multivariate
10.6 Tests of linear hypotheses
455
case also. Another procedure for multiple comparisons follows from Roy's union-intersection test (see Section 10.6.2). Consider the three hypotheses Hki : k'ABl = k'm,
Hi : ABl = *J, Ho : AB = * .
Let RH and Ro be as in Section 10.6.2, R^ = I'RQI, and R2HM and R\ be the error sum of squares for the model (Yl,X(Bl), (I'IJI)I) under Tiki and Hi, respectively. It follows from the decomposition (7.9.4) and Exercise 2.30 that i?#( — i?^ is the largest possible value of R2gkl — RQI for all k. If IR is the level a cutoff for Roy's TR statistic for Ho, then under HQ we have
^
/
l'(RH-Ro)l ^ j \
= P{TR>tR)
= a.
Thus, the test (R2Hk[ - Rl^/R^ > tR for Hki has the probability of type I error no greater than a, even if this test is carried out along with an arbitrary number of tests of the same kind (with other choices of k and /). 10.6.6
Test for additional information
All the q characters of the response may not be important in understanding how the response depends on the explanatory variables. Suppose that we partition the response Y as (Y\ : Y2), where Yj has qj characters, j = 1, 2, and q\ + q2 — It is clear from (10.1.1) that E(Yj\X) = XBj for; = 1,2, where (Bi : B2) is a partition of B such that B\ has q\ columns. Assuming the multivariate normality of Y and partitioning S conformably
456
Chapter 10 : Multivariate Linear Model
with Y as ( ^
n
^ 1 2 ) , we have
E(Y2\YUX) D(vec(Y2)\YuX)
= XB2 + (Yl-XBl)^lV12, = (S22-S21S{-1S12)®V.
(10.6.5) (10.6.6)
It is possible that E(Y2\Y\,X) does not depend on X at all. If this happens, then the knowledge of Y\ is sufficient for obtaining the mean of Y 2 , and the knowledge of X is superfluous for this purpose. In other words, the characters included in Y\ determine how the all the characters of the response depend on the explanatory variables, and the characters included in Y2 do not carry additional information in this regard. In practice, this may hold only in an approximate sense, that is, some characters of the response may carry most of the relevant information regarding the dependence on the explanatory variables. Elimination of characters with little information would result in reduction of dimensionality. Since the aim is to retain only the essential characters, let us assume that there is no linear dependence among the retained characters, that is, S n is positive definite. Then we can rewrite (10.6.5) as
E(Y2\Y1,X) = YxBa + XBb =
(Y1:X)(^j,
where Ba = S ^ S ^ and Bb = B2 -BiH^'EnThe hypothesis lXBb = 0' can be interpreted as 'no additional information carried by the last q2 characters' of the response vector. The model for Y2 given Y\ is
(vec(Y2), (I ® (Yi : X))vec (J^J , (S22 - ^i^n^n)
® v)
The essential difference of this model with (10.1.1) is that some of the response variables of (10.1.1) are regarded here as explanatory variables. It is with respect to this model that the hypothesis (0 : X) I
a
I=0
(or XBb = 0) has to be tested. The problem is well within the general hypothesis testing framework of Sections 10.6.1-10.6.3.
10.7 Linear prediction and confidence regions
10.7
457
Linear prediction and confidence regions
Suppose that Y follows the multivariate linear model (10.1.1) and that we have to predict a new observation y0. The combined set of observations follow the model
G)-GM5).*($)-^(5)-M«::)The result of Exercise 3.9 implies that the BLP of y0 given Y is
E(yo\Y) = [x'oB +
v'oV-(Y-XB)]',
and that the mean squared prediction error of the BLP is S
- V"«0)]S,
458
Chapter 10 : Multivariate Linear Model
Proof. Ya + b is a linear unbiased predictor of y 0 if and only if it is a linear unbiased estimator of x'0B under the model (10.1.1). Part (a) follows from the fact that an LUE of x'0B exists only if x'0B is estimable. In order to prove the remaining three parts, let xo G C(X') and Ya + b be a linear unbiased predictor of y 0 . Consider the decomposition
y0 - Ya - b = (y0 - E(yo\Y)) - (y0 - E(yQ\Y)) + (y0 - Ya - b). The first term on the right hand side is the prediction error of the BLP E(yo\Y), which must be uncorrelated with Y according to the result of Exercise 3.9. Therefore, this term is uncorrelated with the other two terms. On the other hand, the second term is the estimation error of the BLUE of x'0B - v'0V~XB in (10.1.1), while the third term is a linear zero function in this model. Therefore, these two terms are also uncorrelated. Consequently E[(yo-Ya-b)(yo-Ya-b)'}
= E[{yQ-E{y0\Y)){yQ-E{yQ\Y))']
+E[(y0-E(y0\Y))(y0-E(y0\Y))'}
+
E[(yQ-Ya-b)(y0-Ya-b)'}.
The above expression is minimized if and only if the LZF y 0 — Ya — b is almost surely equal to zero. This proves parts (b) and (c). By setting Ya + b = y 0 in the above equation, we have
E[(Vo - Vo)(Vo - VoY] = E[(yo-E(yo\Y))(yQ-E(yo\Y))'] +E[(yQ - E(yo\Y))(y0 - £ ( y o | r ) ) ' ] =
H®(VQ-V'0V-VQ)
+
^
D((X'QB-V'0V-XBY)
= S ® (vQ - v'0V-v0) + D{vec((x'0X- - v'0V-)XB)) = E ® (u0 - v'0V-v0) + D({I ® (x'0X- - v'0V-))vec(XB)) = S ® (v0 - v'0V-v0) + E ® [(X'QX--V'QV-)T((X-YXQ-V-V0), which justifies the expression of part (d). When the joint distribution of Y and y 0 is multivariate normal, y 0 — y 0 is also multivariate normal, and is independent of the estimator
10.7 Linear prediction and confidence regions
459
S given in (10.3.1). It follows from Proposition 10.6.3 that the statistic (yo-j/oVS {yQ-y0) (vo - v'0V-v0) + (x'0X- - v'QV-)T((X-)'x0 p(V:X)-p(X) + l-p{'E) p(H) (p(V : X) - p(X))
- V-v0)
has the -Fp(£), p(v.x)-p(x)+i-p(E) distribution. A 100(1 — a)% prediction region for y0 is (Vo ~ I/o)'S (Vo - Vo)
X) - p{X)) P(S) (P(V : X) - p(X) + 1 - p(V)
p{V
- v'0V-vQ) + (x'QX- - voV-)T((X-)'xo
- V~vQ)], (10.7.1)
where iJ1p(E)) p(v.x)-p(x)+i-p(Z), a i s the (1-a) quantile of the F distribution with /o(S) and p(V : X) — p(X) + 1 — p(S) degrees of freedom. When V = / , VQ = 0 and VQ = 1, the point prediction of y 0 simplifies to x'0B, its mean square prediction error matrix is [1 + x'0(X'X)~a;o]S and the 100(1 — a)% prediction region for y 0 reduces to
(y0 - Po/s (y0 - So) ^ P < ^p(E), n-p(X)+l-p(S), a
n
P(S) (W - P W ) n . - . y / y w , _ p ( X ) + 1 _ p ( S ) ' I1 + X0\X X) x0\-
An elliptical confidence region for B'XQ is
(B'xo-yo < <
Q-yo)
p Fp&), p(V:X)-P(Jf)+i-p(S), «
p(S) (p(V : X) - p(JT)) y:X)-/)(X) + l-p(S)
/9(
- - « / o F-)r((X-) / a;o - V~vo)].,
(10.7.2)
which can be derived along the lines of (10.7.1). If V = / , this region simplifies to (B'x0 - j/ 0 )'iJ (5'JBO - y0) < ,F < FPv), n -p W + i-,(E), «
n
p(S) (n - p(X)) _ p ( x )+ 1 _ p ( s )
,ry/^_ xo(X X) x0.
460
Chapter 10 : Multivariate Linear Model
Simultaneous confidence intervals for linear parametric functions can be constructed, using the ideas described in Section 10.6.5. Tolerance intervals in the spirit of Section 7.13.2 can also be obtained.
10.8 10.8.1
Applications One-sample
problem
Suppose that there are n independent samples from the ^-variate normal distribution AT(^t,S), with /n and S unspecified. The data can be organized as a matrix so that it follows the model (10.1.1) with X = lnxi) B = A*' a n d V = I. Following Bhimasankaram and Sengupta (1991), we allow £ to be singular. The one-sample problem is to test the hypothesis // = /u0 for a specified vector yu0. The hypothesis is of the form AB = \& with A — I and ^ = /i'o. It is easy to see that the error sum of squares and products matrix RQ simplifies to Y'(I — n~1ll')Y. On the other hand, the decomposition (10.5.3) simplifies to RH-R0
= n(fi - Ho)(fi ~ A*o)'.
where fi, is the BLUE of /^, given by n~1Y'l. As R2H — RQ is a matrix of rank 1, the four statistics mentioned in Sections 10.6.1-10.6.3 are equivalent to one another (see Proposition 10.6.2). It is easy to see that Roy's union-intersection statistic simplifies to TR = n(p, — /z o )'J?^(/i — fj,0). The Lawley-Hotelling statistic turns out to be T2 = (n — 1)TR. According to Proposition 10.6.3, the null distribution of the statistic (n - g*)T2/[ij*(n - 1)] is F g , >n _ g ,, where q* = p(S).
10.8.2
Two-sample
problem
Suppose that we have n\ and ri2 samples from the ^-variate normal distributions N(fj,l,'S) and - / V ( / J 2 , £ ) , respectively. If Y is the (ni + n"i) x Q such that its top n\ rows correspond to the n\ samples from the first population and the remaining n,2 rows correspond to the n 2 samples from the second population, then Y follows the model (10.1.1)
10.8 Applications
461
with
X=(lnix \0nix
?" 2 X l V B=(fMj), lnjxl/'
V = I.
VA*2/
The two-sample problem consists of testing the hypothesis / ^ = /x2. This hypothesis can be written as AB = * where A = (1 : — 1) and \I> = Oixg. In this case the matrix i?o simplifies to Ro = Yl(IniXrll—nl
ln1xllniXl)^ri+^2(^«2Xn2~n2 ^"1 x2lnix2)^2?
where (Y[ : Y'2) = Y' and Y\ has order ni x q. The matrix i2# — RQ simplifies to RH - Ro = {ni1 + nj 1 )"^/*! - /*2)(A*i - fe)'. where (/i^ : /i2) is the BLUE of B', given by p.j = n j ' ^ j l ^ x i , j = 1,2. Once again, the four statistics mentioned in Sections 10.6.1 and 10.6.3 are equivalent. Roy's union-intersection test statistic reduces to
The Lawley-Hotelling statistic for the two-sample problem simplifies to T 2 = (ni + Ti2 — 2)TR. The statistics TR and T 2 are also proportional to Mahalanobis' squared distance, D2 = (/i! - / i a J S " ^ ! - A2)' = («i + n2 - 2)(/i1 - /*2)-R6"(#ii - £2)'Under the null hypothesis, Proposition 10.6.3 implies that the statistic (ni+n 2 -g*-l)T 2 /[<7*(ni+n 2 -2)] has the F g , ini+n2 _ 9 ,_i distribution, where 9* = p(S). See Exercise 10.10 for an extension to the case V ^ I . 10.8.3
Multivariate A NOVA
Consider the model (10.1.1) with V = I and suppose that the elements of the matrix X are obtained from a designed experiment such as those described in Chapter 6. In that chapter we had attributed different parts of the total sum of squares to various sources. In a similar manner we can analyse the sum of squares and products matrix and
462
Chapter 10 : Multivariate Linear Model
attribute its parts to various sources. Such an analysis is called multivariate analysis of variance (MANOVA) or analysis of dispersion. The GLRT for various hypotheses in the univariate case had turned out to be a comparison of the ratio of estimated variances (under the respective hypotheses) with a suitable cut-off. In the present case, the comparison shifts from variances to dispersions. As observed before, there are several ways making this comparison. The GLRT is based on the ratio of the determinants of the relevant dispersion matrices. Detailed calculations are available in textbooks of multivariate analysis such as Arnold (1981, Section 19.6). A MANOVA table is similar in spirit to Table 6.6 for ANCOVA. The similarity is not superficial. When the ANCOVA table is expressed in terms of matrices (as in the case of Table 6.6), it is essentially a MANOVA table where the covariates are treated as additional characters of the response. Under the assumption of multivariate normality of the response, one can obtain the ANCOVA model by conditioning a single character of the response of a MANOVA model on the remaining characters. Conditioning of more than one character of the response on the remaining characters leads to the multivariate ANCOVA model.
10.8.4
Growth models
In Example 10.6.4 no specific relation among the fj,jkS was assumed. Growth models involve explicit assumption on the nature of dependence of the mean height (or any other dimension) with time. For instance, Hjk can be a polynomial function of the kth time point, where the coefficients of the polynomial would depend on the group j . This amounts to assuming a linear structure of the matrix B, such as B = Bo + AH where Bo and H are specified matrices, and the matrix A has fewer columns than B. The matrix H can always be chosen to have full row rank. Thus, a linear growth model is
Y = XBo + XAH + £,
E{£) = 0,
D(vec(£)) = S ® V. (10.8.1)
Let G be a fixed symmetric and positive definite matrix having the same order as S, and ZZ' be a rank factorization oi(I—PH). We can rewrite
10.9 Exercises
463
the model (10.8.1) as {Y-XBQ)[GH'{HGH')-1
Z] = [XA : 0] + £[GH'(HGH')-1
Z\.
Under the assumption of multivariate normality, we can condition Y\ = (Y-XBo)GH'(HGH')-1 on Y2 = (Y-XB0)Z, as in Section 10.6.6, to obtain the model Yi = XA+Y2Ba
+ £,,
E(£*) = 0,
£>(vec(£*)) = E,®V. (10.8.2)
where Ba is a matrix of unspecified parameters and S, = (HGH')1HG[S
- HZ(Z'HZ)'1 Z'Y^GH1 (HGH')~l.
This model is a special case of (10.1.1). The analysis of this model can proceed along the usual lines. The choice of the matrix G is arbitrary. If G = £ ~ \ then the term Ba disappears and £* simplifies to ( i T E ^ i ? ' ) " 1 . If G is chosen as a reasonable approximation of S" 1 , then we can use the simplified model Y\ = XA + £*. See Rao (1973c, Section 8c.7) and Christensen (1991, Section 1.6) for more details on the analysis of growth models in the case V = I. 10.9
Exercises
10.1 (Fisher's Iris data). The data set given in Table 10.1 represents 150 measurements of sepal length (ls), sepal width (ws), petal length (lp) and petal width (wp) of flowers of three species of the plant Iris (Iris setosa, Iris versicolor, and Iris virginica). This classic data set is taken from Fisher (1936). Let Y be the 150 x 4 matrix of the observations given in the table (each species accounting for a 50 x 4 block of Y). Let X be a 150 x 4 binary matrix, where the first column contains 1 in all the rows, the second column contains 1 in the rows corresponding to the first species and 0 elsewhere, and the last two columns contain 1 in the rows corresponding to the second and third species, respectively, and 0 elsewhere. Assume that the data follows the linear model (10.1.1) with V = I.
464
ls 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8
Chapter 10 : Multivariate Linear Model Iris setosa ws lp 3.5 1.4 3.0 1.4 3.2 1.3 3.1 1.5 3.6 1.4 3.9 1.7 3.4 1.4 3.4 1.5 2.9 1.4 3.1 1.5 3.7 1.5 3.4 1.6 3.0 1.4 3.0 1.1 4.0 1.2 4.4 1.5 3.9 1.3 3.5 1.4 3.8 1.7 3.8 1.5 3.4 1.7 3.7 1.5 3.6 1.0 3.3 1.7 3.4 1.9
wp 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4 0.3 0.3 0.3 0.2 0.4 0.2 0.5 0.2
Iris versicolor ls ws lp wp 7.0 3.2 4.7 1.4 6.4 3.2 4.5 1.5 6.9 3.1 4.9 1.5 5.5 2.3 4.0 1.3 6.5 2.8 4.6 1.5 5.7 2.8 4.5 1.3 6.3 3.3 4.7 1.6 4.9 2.4 3.3 1.0 6.6 2.9 4.6 1.3 5.2 2.7 3.9 1.4 5.0 2.0 3.5 1.0 5.9 3.0 4.2 1.5 6.0 2.2 4.0 1.0 6.1 2.9 4.7 1.4 5.6 2.9 3.6 1.3 6.7 3.1 4.4 1.4 5.6 3.0 4.5 1.5 5.8 2.7 4.1 1.0 6.2 2.2 4.5 1.5 5.6 2.5 3.9 1.1 5.9 3.2 4.8 1.8 6.1 2.8 4.0 1.3 6.3 2.5 4.9 1.5 6.1 2.8 4.7 1.2 6.4 2.9 4.3 1.3
Iris virginica ls ws lv 6.3 3.3 6.0 5.8 2.7 5.1 7.1 3.0 5.9 6.3 2.9 5.6 6.5 3.0 5.8 7.6 3.0 6.6 4.9 2.5 4.5 7.3 2.9 6.3 6.7 2.5 5.8 7.2 3.6 6.1 6.5 3.2 5.1 6.4 2.7 5.3 6.8 3.0 5.5 5.7 2.5 5.0 5.8 2.8 5.1 6.4 3.2 5.3 6.5 3.0 5.5 7.7 3.8 6.7 7.7 2.6 6.9 6.0 2.2 5.0 6.9 3.2 5.7 5.6 2.8 4.9 7.7 2.8 6.7 6.3 2.7 4.9 6.7 3.3 5.7
wp 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1
Table 10.1 Fisher's Iris data (Source: Fisher, 1936; continued to page 465)
(a) Interpret the elements of the 4 x 4 parameter matrix B and determine which of the parameters are estimable. (b) Find the BLUE of the difference between the mean responses of Iris setosa and Iris versicolor, and provide an estimate of the dispersion matrix of the estimation error.
10.9 Exercises
ls 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
Iris setosa ws lp 3.0 1.6 3.4 1.6 3.5 1.5 3.4 1.4 3.2 1.6 3.1 1.6 3.4 1.5 4.1 1.5 4.2 1.4 3.1 1.5 3.2 1.2 3.5 1.3 3.6 1.4 3.0 1.3 3.4 1.5 3.5 1.3 2.3 1.3 3.2 1.3 3.5 1.6 3.8 1.9 3.0 1.4 3.8 1.6 3.2 1.4 3.7 1.5 3.3 1.4
wp 0.2 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2
Iris versicolor ls ws l p wp 6.6 3.0 4.4 1.4 6.8 2.8 4.8 1.4 6.7 3.0 5.0 1.7 6.0 2.9 4.5 1.5 5.7 2.6 3.5 1.0 5.5 2.4 3.8 1.1 5.5 2.4 3.7 1.0 5.8 2.7 3.9 1.2 6.0 2.7 5.1 1.6 5.4 3.0 4.5 1.5 6.0 3.4 4.5 1.6 6.7 3.1 4.7 1.5 6.3 2.3 4.4 1.3 5.6 3.0 4.1 1.3 5.5 2.5 4.0 1.3 5.5 2.6 4.4 1.2 6.1 3.0 4.6 1.4 5.8 2.6 4.0 1.2 5.0 2.3 3.3 1.0 5.6 2.7 4.2 1.3 5.7 3.0 4.2 1.2 5.7 2.9 4.2 1.3 6.2 2.9 4.3 1.3 5.1 2.5 3.0 1.1 5.7 2.8 4.1 1.3
465 Iris virginica ls w s lp 7.2 3.2 6.0 6.2 2.8 4.8 6.1 3.0 4.9 6.4 2.8 5.6 7.2 3.0 5.8 7.4 2.8 6.1 7.9 3.8 6.4 6.4 2.8 5.6 6.3 2.8 5.1 6.1 2.6 5.6 7.7 3.0 6.1 6.3 3.4 5.6 6.4 3.1 5.5 6.0 3.0 4.8 6.9 3.1 5.4 6.7 3.1 5.6 6.9 3.1 5.1 5.8 2.7 5.1 6.8 3.2 5.9 6.7 3.3 5.7 6.7 3.0 5.2 6.3 2.5 5.0 6.5 3.0 5.2 6.2 3.4 5.4 5.9 3.0 5.1
wp 1.8 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5 1.4 2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3 2.5 2.3 1.9 2.0 2.3 1.8
Table 10.1 Fisher's Iris data (continued from page 464)
10.2 Show that the residual matrix E defined in page 434 has the representation described in Remark 10.2.6. 10.3 Derive the multivariate linear model from a suitable multivariate normal distribution of all the variables. 10.4 Show that the rows of the matrices (/ — PX)Y and E (the
466
10.5
10.6
10.7
10.8 10.9 10.10
10.11
10.12
Chapter 10 : Multivariate Linear Model residual matrix) constitute two generating sets of normalized linear zero functions in the model (10.1.1). If the rows of a matrix Z constitute a generating set of normalized linear zero functions in the model (10.1.1), then show that D(vec(Z)) is of the form £
10.9 Exercises Year 1900 1904 1908 1912 1920 1924 1928 1932 1936 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988
100 m 10.80 11.00 10.80 10.80 10.80 10.60 10.80 10.30 10.30 10.30 10.40 10.50 10.20 10.00 9.95 10.14 10.06 10.25 9.99 9.92
200 m 22.20 21.60 22.40 21.70 22.00 21.60 21.80 21.20 20.70 21.10 20.70 20.60 20.50 20.30 19.83 20.00 20.23 20.19 19.80 19.75
400 m 49.40 49.20 50.00 48.20 49.60 47.60 47.80 46.20 46.50 46.20 45.90 46.70 44.90 45.10 43.80 44.66 44.26 44.60 44.27 43.87
467 800 m 121.40 116.00 112.80 111.90 113.40 112.40 111.80 109.80 112.90 109.20 109.20 107.70 106.30 105.10 104.30 105.90 103.50 105.40 103.00 103.45
1500 m 246.00 245.40 243.40 236.80 241.80 233.60 233.20 231.20 227.80 225.20 225.20 221.20 215.60 218.10 214.90 216.30 219.20 218.40 212.50 215.96
Table 10.2 Men's Olympic sprint times in seconds (Source: Lunn and McNeil, 1991)
Assume the model Y = XB + E where Y consists of the logvalues of the last five columns of Table 10.2, X consists of the intercept term and values of the 'year' variable measured with origin at 1950, and vec(E) ~ JV(O, E 5 x 5 ®/2oX2o)- Test for the following hypotheses: (a) the log-record times did not change significantly with time; (b) the log-record times of every distance category changed with time at the same rate.
Chapter 11
Linear Inference — Other Perspectives
If we restrict our attention to linear statistics of the form Ly in the model {y,X(3,a2V), then an interesting theory of inference for /3 can be built along the lines of Section 3.5, without making any distributional assumption. Early research on this subject was carried out by Barnard (1963) and Baksalary and Kala (1981). Drygas (1983) gives the theory a more elaborate structure, while subsequent works provide corrections, modifications and extensions. However, a systematic exposition of such a theory does not seem to exist in the literature. In Section 11.1, we present the theory and show how it is connected to the definitions as well as the propositions contained in Chapters 4 and 7. Section 11.2 carries the theory of linear inference further by discussing linear versions of admissible, Bayes and minimax estimators. Section 11.3 discusses biased linear estimators which may have a smaller mean square error than the corresponding BLUEs. Two other classes of linear estimators are discussed in Section 11.4, making the story more complete. Section 11.5 deals with the geometry of linear models, which has fascinated many mathematicians and statisticians. The ideas of orthogonality and projections can be used to derive an approximation of X/3 in the linear model in a completely mathematical way, and the prize catch happens to be none other than the BLUE. Although our development of the theory in Chapters 4, 5 and 7 is based on statistical ideas, once that is done, the geometric perspective and its intuitive appeal can only help enrich and add to the understanding of linear models. 469
470
Chapter 11 : Linear Inference — Other Perspectives
Section 11.6 discusses large sample properties of a BLUE in the general linear model, justifying its use when no distributional assumption is made. 11.1 11.1.1
Foundations of linear inference General theory
We begin by adapting the second definition of ancillary statistic, given in Section 3.5, to the case of linear functions of y. Definition 11.1.1 A statistic Zy in the model (y, Xfi, a2V) is called linearly ancillary for /3 if any linear function of it has zero expectation for all /3. Definition 11.1.2 A linearly ancillary statistic Zy for /3 in the model (y,X(3,cr2V) is called linearly maximal ancillary if any other linearly ancillary statistic is almost surely equal to a linear function of Zy for all/3. In other words, linearly ancillary statistics in a linear model are precisely the LZFs. A linearly maximal ancillary statistic is a vector whose elements constitute a generating set of LZFs. Recall that the definition of sufficiency involves a conditional distribution. It is not suitable for adaptation to the linear case where we would like not to assume any distribution. Therefore, we make use of the alternative definition given in Section 3.5, and replace the conditional expectation by the best linear predictor (BLP) defined in page 62. Definition 11.1.3 linearly sufficient for regression E(l'y\Ty) linear function of Ty
A statistic Ty in the model (y, Xf3, <J2V) is called j3 if for every linear function I'y of y, the linear (also called the BLP) is almost surely equal to a which does not depend on /3.
Definition 11.1.4 A linearly sufficient statistic Ty for /3 in the model (y,X(3,a2V) is called linearly minimal sufficient if Ty is almost surely equal to a linear function of any other linearly sufficient statistic for all /3.
11.1 Foundations of linear inference
471
Definition 11.1.5 A statistic Ty in the model (y, X{3, G2V) is called linearly complete for /3 if no nontriviaP linear function of Ty is linearly ancillary. If a statistic is linearly sufficient and linearly complete at the same time, we shall refer to it as a linearly complete and sufficient statistic. It will be proved later (see Proposition 11.1.11) that such a statistic coincides with a linearly minimal sufficient statistic. Example 11.1.6 The statistic (I — Px)y is a, linearly maximal ancillary, and any function of the form l'(I — Px)y (that is, any LZF) is a linear ancillary. A scalar I'y which is not an LZF is linearly complete. The vector Ay, where A is a square and nonsingular matrix, is linearly sufficient for /3. The vector of fitted values is a linearly complete and sufficient statistic (see Proposition 11.1.10). D We now prove a result similar to Basu's theorem (see page 67). Proposition 11.1.7 A linearly complete and sufficient statistic is uncorrelated with every linearly ancillary statistic. Proof. Let Ty be a linearly complete and sufficient statistic for /3, and z'y be a linearly ancillary statistic (that is, an LZF). The BLP of z'y given Ty is E(z'y\Ty) = z'VT'{TVT')-(Ty - TXf3). Since Ty is linearly sufficient, this does not depend on /3, and is therefore equal to z'VT'(TVT')-Ty for any choice of the g-inverse. The linear completeness of Ty implies that this quantity should be equal to 0 almost surely. Therefore, z'VT', the covariance of E(z'y\Ty) with Ty must be zero. We can also have a linear analogue of the Rao-Blackwell theorem (Proposition 3.6.1). Proposition 11.1.8 LetTy be linearly sufficient for (3 and s'y be an estimator of a single LPF, p'/3, in the linear model (y, X/3, a2V). The aA
nontrivial linear function is one which is not equal to zero almost surely.
472
Chapter 11 : Linear Inference — Other Perspectives
mean square error of the "improved" estimator, E[s'y\Ty], is less than or equal to that of s'y. Proof. Let us denote E[s'y\Ty] by h'Ty. Proposition 3.4.1 implies that the linearly ancillary statistic s'y — h'Ty is uncorrelated with h'Ty. Therefore, E{s'y - p'/3)2 = E(h'Ty - p'0)2 + Var(s'y - h'Ty) +2E(h'Ty - p'P)E{s'y - h'Ty) = E(h'Ty-p'/3)2 + Var(s'y-h'Ty) >
E(tiTy-p'0)2
Thus, h'Ty has smaller MSE than s'y. The above proposition implies that the variance of an LUE of p'/3 can be reduced by regressing it on any linearly sufficient statistic. The LUE with minimum variance is of course the BLUE, which is the linear analogue of the UMVUE. Characterization of the BLUEs through the property of their being uncorrelated with LZFs (linear ancillaries) has already been given in Proposition 4.3.2, which is the linear version of Proposition 3.6.2. This result directly leads us to the construction of the BLUE of any estimable LPF via Proposition 7.3.1. The proof of the uniqueness of the BLUE — if it exists — also follows. Such a constructive procedure for obtaining the UMVUE in the general case was not available, because a complete collection of all estimators of zero are not easy to get. Instead, we had obtained the UMVUE from an unbiased estimator and a complete sufficient statistic, via the LehmannScheffe theorem (Proposition 3.6.3). A linear analog of this proposition is given below. Proposition 11.1.9 Letp'fl have an unbiased estimator s'y, andTy be a linearly complete and sufficient statistic for /3 in the linear model (y,X/3,a2V). Then the BLUE ofp'/3 exists and is almost surely equal to E(s'y\Ty). Proof. Let u[Ty = E(s'y\Ty). It is easy to see that u[Ty is unbiased for p'f3 (see Proposition 3.4. l(c)). To prove that it is the BLUE, let u'y
11.1 Foundations of linear inference
473
be another LUE of p'/3 having strictly smaller variance than u[Ty. Let u'2Ty = E(u'y\Ty). Therefore, u'2Ty is also an LUE of p/3 having strictly smaller variance than u[Ty. Note that u[Ty — u'2Ty is a linearly ancillary statistic or LZF. Because of the linear completeness of Ty, we must have u[Ty — u'2Ty = 0 with probability 1 for all f3. Therefore, u'2Ty cannot have smaller variance than u[Ty. The construction outlined in the above proposition depends on the existence of a linearly complete and sufficient statistic. It is now shown that the vector of fitted values y described by (7.3.1) is such a statistic. Proposition 11.1.10 A linearly complete and sufficient statistic in the linear model (y,X/3,a2V) is the vector of fitted values, y. Proof. In view of (7.3.1), we have
E(l'y\y) = l'X/3-l'D(y)[D(y)]-(y-XI3) = l'X/3-l'(y-X(3) = I'y, which does not depend on /3. Therefore, y is a linearly sufficient statistic. On the other hand, Proposition 4.3.2 and Remark 4.1.5 imply that y is uncorrelated with every linearly ancillary statistic (that is, every LZF). In particular, if a'y is an LZF, it is uncorrelated with itself, that is, it must be zero almost surely. Hence, y is linearly complete and sufficient. Propositions 11.1.8-11.1.10 lead us to a linear version of Bahadur's (1957) result linking complete sufficiency with minimal sufficiency. Unlike in the general case, these two turn out to be equivalent in the linear case. Proposition 11.1.11 A linear statistic in the model (y,X/3,a2V) is linearly complete and sufficient if and only if it is linearly minimal sufficient. Proof. Let Ty be linearly complete and sufficient. Proposition 11.1.9 implies that every component of Ty is the BLUE of its expectation. Let t'y be a component of Ty and Sy any other linearly sufficient statistic. According to Proposition 11.1.8, E(t'y\Sy), has at least as
474
Chapter 11 : Linear Inference — Other Perspectives
small a variance as t'y. Since t'y is the BLUE of its expectation, its variance must be the same as that of E(t'y\Sy). Uniqueness of the BLUE implies that t'y must be almost surely equal to E(t'y\Sy), which is a linear function of Sy. Therefore, Ty is almost surely equal to a linear function of every linearly sufficient statistic. To prove the converse, let Ty be linearly minimal sufficient. According to Proposition 11.1.10, Ty must be almost surely equal to Ay for some matrix A. If I'Ay is linearly ancillary, Proposition 4.3.2 ensures that it must be uncorrelated with itself, and hence be equal to zero almost surely. We are now ready for a characterization of linearly sufficient statistics in terms of BLUEs. Proposition 11.1.12 A statistic is linearly sufficient for /3 in the linear model (y,X/3,a2V) if and only if every BLUE is almost surely equal to a linear function of it. Proof. Every BLUE is almost surely equal to a linear function of the vector of fitted values. The latter in turn is almost surely equal to a linear function of every linearly sufficient statistic, according to Propositions 11.1.10 and 11.1.11. To prove the converse, let every BLUE be a linear function of the statistic Ty. We shall show that for any I'y, E(l'y\Ty) does not depend on (3. Since y = y + e, it is enough to show that E(l'y\Ty) and E{l'e\Ty) do not depend on /3. The statement about E(l'y\Ty) follows from the fact that y (and hence i'y) is a function of Ty. To prove the other part, note that I'e — E(l'e\Ty) is uncorrelated with Ty. Hence it is uncorrelated with y and with Ty. Setting the covariance with the latter equal to zero, we have after simplification Cov(l'e,Ty)[D(Ty)]-D(Ty) = 0. As {Ty-TX/3) e C(D(Ty)), we conclude that Cov(l'e,Ty)[D(Ty)]-(Ty-TXP) = 0, and consequently Cov(l'e,Ty)[D(Ty)]- (Ty-TX/3) is free of f3. Drygas (1983) defines a linearly sufficient statistic via the above characterization. See Miiller et al. (1984) for characterizations through three other properties.
11.1 Foundations of linear inference
475
Proposition 11.1.12 shows that a linearly sufficient statistic for (5 is a vector whose elements contain a generating set of BLUEs, defined in Section 4.7.3. Proposition 11.1.7 implies that the components of a linearly minimal sufficient statistic for /3 constitute a generating set of BL UEs. Apart from BLUEs, several linear estimators described in Sections 11.2 and 11.3 are functions of a linearly minimal sufficient statistic for j3. We shall illustrate in Section 11.1.3 the simultaneous construction of a linearly minimal sufficient statistic and a linearly maximal ancillary. When the distribution of y is multivariate normal, the concepts of linear sufficiency and completeness reduce to usual sufficiency and completeness, respectively. Proposition 11.1.13
Let y ~
N(X/3,a2V).
(a) The statistic Ty is linearly sufficient for 0 if and only if it is sufficient for /3. (b) The statistic Ty is linearly ancillary for /3 if and only if it is ancillary for f3. (c) The statistic Ty is linearly minimal sufficient (that is, linearly complete and sufficient) for /? if and only if it is complete and sufficient for /3. Proof. In the normal case E[y\Ty = t] = E[y\Ty = t). Therefore Ty is linearly sufficient for /3 if and only if E[y\Ty = t] does not depend on /3, which in turn is equivalent to the conditional (normal) distribution of y given Ty = t being not dependent on /3. This proves part (a). If Ty is linearly ancillary for /3, then its (normal) distribution has zero mean, and hence it does not depend on /3. Therefore it is ancillary for /3. On the other hand, if Ty is ancillary for /3, then E(Ty) = TX/3 does not depend on f3, which means that Ty must be an LZF. This proves part (b). In order to prove part (c), let Ty be complete and sufficient for /3. According to Basu's theorem, it must be uncorrelated with every LZF. Therefore, if the statistic I'Ty is linearly ancillary, then it must be uncorrelated with itself. It follows that I'Ty is almost surely equal to zero. This proves the linear completeness of Ty. The linear sufficiency follows from part (a). See Exercise 11.3 for the proof of the converse.D
476
Chapter 11 : Linear Inference — Other Perspectives
Drygas (1983) and Mueller (1987) give some algebraic characterizations of a matrix T so that T'y is linearly sufficient, linearly complete or linearly complete and sufficient. These results hold for nonsingular V but there are some problems in the singular case. Such characterizations for the singular case, are given in Section 11.1.4. The above theory leaves out the question of estimating a2. Since quadratic functions of y are generally needed to estimate it, one has to add at least one quadratic function of y to a linear sufficient statistic in order to make it sufficient for both (5 and a1 in some sense. See Drygas (1983) and Mueller (1987) for a definition and characterizations of quadratic sufficiency. Oktaba et al. (1988) look at the possible equivalence of the models (y, Xfi, a2V) and the corresponding model for Ty, {Ty, TX/3, o2TVT'), in terms of estimable functions, BLUEs, variance estimators and tests of hypotheses. They defined Ty as a invariant linearly sufficient statistic, and gave algebraic characterizations of such statistic. Shang and Zhang (1993) study linear sufficiency and linear completeness in the limited context of estimating a linear (vector) function of fi only, and also for restricted linear models. Kornacki (1998) considers possible ordering of linear models in terms of classes of linearly and quadratically sufficient statistics. 11.1.2
Basis set of BLUEs
No nontrivial linear function of a linearly minimal sufficient statistic is a linear ancillary. However, there may be some redundancy in it. For example, if Ty is a linearly minimal sufficient statistic, so is (T' : T')'y. Thus, there may be room for further summarization of a linearly minimal sufficient statistic. A basis set of BLUEs (defined in page 117) provides such a summary. It can be verified that every basis set of BLUEs is necessarily a linearly minimal sufficient statistic. Since some elements of a basis set of BLUEs can have zero variance (see Definition 4.7.11), one might ask if there is redundancy in a basis set of BLUEs also. Let us consider a specific example. Example 11.1.14 Suppose that we have five independent measurements on the weights of three objects. The last two observations are
11.1 Foundations of linear inference
477
made with super-precise instruments. The weights ft, ft and ft can be estimated from the model M = (y,X(3,a2V), where
fyi\
/i o o\
V2 v 2/4
W
0 n
1 0 n 1
,
n
1 0
n
I/
I
i3x3
\02x3
U3x2
1
012x2/
0
\o i o/
where 8 is a known, small constant, which represents the relative precision of measurement for the last two cases. The model becomes singular in the limit as 8 goes to zero. All the parameters in this model are estimable. The BLUEs are ft = {6y1+y4)/{l + S), ft = {8y2 + y^)l{l + 8), and /53 = y3. A basis set of BLUEs is given by the elements of the vector Ty — (ft : ft : ft)', which reduces to (2/4 : y$ : j/3)' when 8 = 0, that is, when V is singular. It may be argued that in the singular case, there is a linear combination of the first two elements of Ty which is zero. Specifically, ft ft — ft ft = 0. However, we need not be bothered about this 'redundancy' in Ty, as ft and ft are not known a priori. A model like the one in Example 11.1.14 may also arise from a linear restriction on the parameters (see the equivalent unrestricted model MR described in page 275). For instance, the model of Example 11.1.14 could have come from the model (y, 13x3/3, a21) under the restriction ft = ft = ! I n such a case, there is a known relation between two elements of Ty, namely, ft = ft- This relation corresponds to the relation ftft — ft ft = 0 in Example 11.1.14, while in the present case we know a priori that ft = ft. Thus the redundancy in the basis set of BLUEs results from the known restriction on the parameter space. The redundancy would disappear if the restrictions are incorporated via the equivalent unrestricted model Air of page 275. When there is no known restriction on the parameters space, a standardized basis set of BLUEs in a singular linear model in general contains a combination of elements with variance 0 and a2, none of which is redundant. The next proposition gives the number of elements of each kind.
478
Chapter 11 : Linear Inference — Other Perspectives
with Proposition 11.1.15 Consider the linear model (y,X(3,a2V) no known restriction on the parameters. If z is any vector of BLUEs whose elements constitute a standardized basis set, then (a) the total number of elements of z is p{X); (b) the number of elements of z which have variance equal to a2 is equal to the dimension ofC(X) nC(V), while the remaining elements have zero variance. Proof. Let ZB be a rank-factorization of X, so that Z has p{X) columns. If 6 = B/3, then (y, Z0,a2V) is a reparametrization of the original model (y,X(3,a2V), such that the vector parameter in the reparametrized model has the smallest possible size (p(X)). Since 0 is fully estimable, any standardized basis set of BLUEs must have at least p{X) elements. Let m be the number of elements of z, which must be greater than or equal to p(X). Let m > p(X) and z = L'y. Since the m x p{X) matrix L1 Z has more rows than columns, there is a nontrivial vector I such that I'L'ZG = 0. It follows that I'z is a linear combination of BLUEs which is an LZF! Since it has to be uncorrelated with all LZFs, I'z must have zero variance. Therefore, I'z = 0 with probability 1. This contradicts the definition of a standardized basis set. Hence, m must be equal to p(X). In order to prove part (b), it is enough to show that the rank of the dispersion matrix of z is equal to the dimension of C(X) C\ C(V). Since z and y are both generating sets, there are n x m matrices C and B such that y = Cz and z = B'y. Therefore,
p(D(z)) = p(B'D(y)B) < p(D(y)) = p(CD(z)C) < p(D(z)). Since all the inequalities given above must hold as equalities, we have p{D(z)) = p{D(y)). The result follows from Proposition 7.3.9(a). D As in the case of linearly minimal sufficient statistics, we can also think of a smallest set of linearly maximal ancillaries. This would coincide with a basis of LZFs, defined in Section 4.7.1. Part (a) of Proposition 7.4.1 implies that such a set should contain precisely p(V :
X)-p{X) LZFs.
11.1 Foundations of linear inference 11.1.3
479
A decomposition of the response
We now look for simultaneous construction of standardized bases for BLUEs and LZFs via a single transformation of the response. The next proposition provides these bases, and paves the way for a canonical decomposition of the sum of squares, as we have seen in Section 4.7.3. Proposition 11.1.16 Given the linear model (y, X/3, a2V) with possibly rank-deficient X and V but no known constraint on the parameter space, there is a nonsingular matrix L such that the vector Ly can be written as (y[ : y'2 : y'% : y\)'', where (a) z = (y'i : y2)' is a vector whose elements constitute a standardized basis set of BL UEs; (b) D{yx) = a2l and D(y2) = 0; (c) y 3 is a vector whose elements constitute a standardized basis set of LZFs; (d) y4 = 0 with probability 1; (e) II2/3II2 *s equal to the error sum of squares; (f) II2/2II2 = \\(I ~ Pv)y\\2 with probability 1; (g) The number of elements of y1; y2, y3 and y4 are dim(C(X) D C{V)), p(V : X) - p{V), P(V : X) - p(X) and n - p(V : X), respectively. Proof. Let UAU' be a spectral decomposition of V such that A is nonsingular. Let U\, U2, t/3 and C/4 be semi-orthogonal matrices such that U2U'2 = I-PV:X, U3A1U3 is a spectral decomposition of (J — PX)V(I U,U', = PV-PV{I_PX).
— Px),
Let KK' be a rank-factorization of the nonsingular matrix U'^VU^. We define L as / L=
K-lU'A
\
\
L l
\
=\L2
U>1
A-^U'sil
,
- P x)
U'2
\L3
J
\Lj
480
Chapter 11 : Linear Inference — Other Perspectives
Let j/j = Liy, 2 = 1,2,3,4. The number of elements of these vectors are the ranks of the orthogonal matrices U4, U\, XJ% and U2, respectively, which are easily seen to be as given in part (g). It also follows that L is a square matrix. In order to show that L is nonsingular, let / be a vector satisfying LI = 0. L4I — 0 implies that I must be of the form Va + Xb. L2l = 0 implies that X'Xb = 0, that is Xb = 0 and I = Va. L{Va = 0 and L^Va = 0 imply that Va — 0. Thus, I = 0, that is, L must have full column rank. Simple calculations show that D{yl) = a2l, D(y2) = 0, D(y3) = a21, £>(y4) = 0 and Cov{yi,y3) = 0. Further, E(y3) - 0 and E(yA) = 0. We have proved parts (b) and (d). Since the number of uncorrelated LZFs contained in y3 is exactly p(V : X) — p{X), part (c) is proved. The vectors y1 and y2 must be BLUEs of their respective expectations, as these are uncorrelated with the basis set of LZFs contained in y 3 . In order to prove part (a), partition L~~l conformably with L as (Mi : Mi : M 3 : M 4 ). Since L3X = 0 and L4X = 0, it follows that M\L\y + M
vsy3 = y'(i - PxH(i -
PX)V(I
- PX)}-(I
- PX)V = Rl
from (7.4.2). Part (f) follows from the fact that y'(I - Pv)y - y'iUxU'i + U2U'2)y = y'2y2 + y'4y4 - y'2y2 almost surely.
n
The transformation of y given in the above theorem produces a vector with uncorrelated components. Some of these components are BLUEs of their respective expectations, while the others are LZFs. Some components are degenerate with zero variance, while the others have variance a2. The number of components belonging to each category is summarized in Table 11.1.
11.1 Foundations of linear inference BLUE
481
LZF
total
with variance = a2 dim(C(X) nC(V)) p{V : X) - p{X) p{V) with variance = 0 p(V : X) - p(V) n - p(V : X) n- p(V) total
p(X)
n-p(X)
n
Table 11.1 Number of components of transformed y in various categories
Example 11.1.17 Consider the model (y, X/3, a2V) with 4 observations and two parameters, /I
0\
/I
0
0 0
0\
[0 1 0 0
10 1 '
0 0 0 0 '
Vo 1 /
Vo o o o /
Here, n = 4, ^(V) = ^(X) = 2, p(X : V) = 3, d\m{C(X) n C(V)) = 1. Therefore, there is exactly one component of each category described in Table 6.1. We can choose
(l ° \ 0 u
Vo
1 u
I
°^
f °\
I vs\
0
0
L
J_
v^
A/2
0
0
f°\ V72/
V-75/
V 0/
fM \ 0/
Then, the matrix L defined in Proposition 11.1.16 is / 1 1 0 0 \
J_ ~ V2
0 0 11 1-10 0 ' \ 0 0 1-1 /
Thus, (yi + y2)/V2 is a BLUE with variance a2, (y3 + ?/4)/v/2 is a BLUE with zero variance, (yi — ?/2)/\/2 is an LZF with variance a2, (l/3 ~ yi)/V% is an LZF with zero variance.
482
Chapter 11 : Linear Inference — Other Perspectives
Remark 11.1.18 The vector z described in part (a) of Proposition 11.1.16 is a linearly minimal sufficient statistic: The vector y3 described in part (c) is a linearly maximal ancillary. Remark 11.1.19 When the parameter space of the model is constrained, Proposition 11.1.16 continues to hold with the following change in part (a): the vector z is a generating set of the BLUEs. In this case, yl and a sub-vector of y2 constitutes a standardized basis set. There is a possible over-counting of the number of BLUEs with zero variance in Table 11.1. There is also a corresponding under-counting of LZFs with zero variance. If A/3 = £ is the known restriction, then the correct numbers can be obtained by replacing X with X(I — AA~) in that table. Nordstrom (1985) gives an additive decomposition of y into four subspaces of IRn. This decomposition is similar to the transformation given in Proposition 11.1.16 (see Exercise 11.8). The advantage of the transformation in Proposition 11.1.16 is that it provides the standardized basis sets of BLUEs and LZFs. When C(X) C C(V) (a condition which holds if V is nonsingular), it can be shown that
bill 2 = y'v-y, lly2ll2 = o,
(n.i.i) (n.i.2)
with probability 1. These facts supplement the results ||y3||2 = e'V-e, lly4H2 = 0,
(11.1.3) (11.1.4)
obtained from parts (d) and (e) of Proposition 11.1.16. Instead of proving (11.1.1), we now prove a more general result. Proposition 11.1.20 Let z be any standardized basis of BLUEs in the linear model (y,X(3,a2V) with C{X) C C(V). Then z'z = y'V~y = y'V-y-R20. Proof. Let FF' be a rank-factorization of V, C a left-inverse of F and Cy = Bz. Using the fact that y = X{X'V~X)~X'V'y (see
11.1 Foundations of linear inference
483
Exercise 7.14) and equating the dispersions of the two sides, we have Pcx = BB'. Therefore p(B) = p{CX) = p(X), that is, B has full column rank. It follows that PD, = I. Hence, z'z
= z'PB,z = z'B'{BB')-Bz = y'C'PcxCy
= y'C'(Pcx)-Cy
= y'C'Cy = y'V~y.
Also,
y'V-y = y'V-X{X'V-X)-X'V-y = y'V'y, which means that y'V~e = 0. Therefore, y'V'y
= y'V-y + e'V~e = z'z + R2Q.
The above proposition is a generalization of part (b) of Proposition 4.7.13, where it was assumed that V = / . The result can be further modified to the case where z is any basis set of BLUEs, which is not necessarily standardized (Exercise 11.6). However, the decomposition of Proposition 11.1.20 does not hold when C{X) is not contained in C(V). To see this, note that in this case there may be an element of z with zero variance. Doubling this element would lead to another standardized basis set, and the value of z'z would increase. Thus, the value of z'z is not invariant of the choice of the standardized basis set of BLUEs. This is in contrast to standardized basis sets of LZFs. 11.1.4
Estimation and error spaces
The idea of Error and Estimation spaces was introduced in Section 4.5 (see Remark 4.5.2). We now define these formally. Definition 11.1.21 The Error Space of the model {y,X/3,u2V) is defined as £r — {I : I'y is an LZF of the linear model}. Definition 11.1.22 is defined as
The Estimation Space of the model (y, X/3, o2V)
£s = {I : I'y is a BLUE of the linear model}.
484
Chapter 11 : Linear Inference — Other Perspectives
Remark 11.1.23 If there is no known restriction on the parameter space, and L\, L2, L3 and L4 are as in Proposition 11.1.16, then £r — C(L'3:L'4),£S=C(L[:L'2:L'4). It is easy to see that the error and estimation spaces conform to the definition of vector spaces given in Section 2.3. The connection between these two spaces and the definitions given in Section 11.1.1 are given in the following proposition. Proposition 11.1.24 (y,Xf3,a2V).
Let Ty be a linear statistic in the model M. =
(a) The statistic Ty is linearly ancillary if and only if C(T') C £r, and linearly maximal ancillary if and only if C(T') = £r. (b) The statistic Ty is linearly complete if and only if (C(T')n£r) C
(£sner). (c) The statistic Ty is linearly sufficient if and only if £s C C(T'). (d) The statistic Ty is linearly minimal sufficient if and only if C(T')=£S. Proof. See Exercise 11.7.
C3
The above proposition shows that the various types of linear statistics considered in Section 11.1 can be characterized if we are able to characterize error and estimation spaces. We now provide characterizations of the latter. Proposition 11.1.25 Let M be the linear model (y,X(3,a2V) no known restriction on the parameters. Then
with
(a) £r=C(X) ; (b) £S=C(V(I-PX)) ; (c) £rD£s= C(V : X)L; (d) £r + £s = IRn. Proof. Since there is no known restriction on the parameters, parts (a), (c) and (d) follow from Remark 11.1.23. In order to prove part (b), note that I'y is a BLUE in M if and only if it is uncorrelated with (/ — Px)y- This condition is equivalent to l'V{I - Px) = 0, or l£C{V{I-Px))L.
11.1 Foundations of linear inference
485
Remark 11.1.26 I G C(V{I - P^)) x if and only if VI € C(X). Therefore, when V is nonsingular, the estimation space is C(V~1X). When V = I, £s further simplifies to C(X). Part (c) of Proposition 11.1.25 shows that if Us a vector belonging to the estimation and error spaces simultaneously, then Vy is zero almost surely, as y £ C(V : X) with probability one. Thus, the intersection of the two spaces plays no role in the the values of the linear functions (BLUEs, LZFs and their linear combinations), li C{X : V) - IRn, the error and estimations spaces are virtually disjoint. If u and v are in the estimation and error spaces, respectively, then u'Vv = 0. Part (d) of Proposition 11.1.25 indicates that the two spaces together span the entire Mn. Thus, the two spaces have a complementary relationship. When V is nonsingular, they are in fact orthogonal complements of each other under the inner product defined through the positive definite matrix V. We are now ready for a characterization of a matrix T so that the statistic Ty is linearly ancillary, linearly complete, linearly sufficient or linearly minimal sufficient. Proposition 11.1.27 Let Ty be a linear statistic in the linear model (y,X/3,a2V), denoted by M., with no known restriction on the parameters. Then (a) The statistic Ty is linearly ancillary if and only if TX = 0, and linearly maximal ancillary if and only if C(T') = C(X)1-; (b) The statistic Ty is linearly complete if and only if C(TV) C C(TX); (c) The statistic Ty is linearly sufficient if and only if C{V(I — PX))^-CC(T'); (d) The statistic Ty is linearly minimal sufficient if and only if C(V{I - - P x ) ) x =C{T'). Proof. Part (a) follows from the definition. In order to prove part (b), let Ty be linear complete and I be an arbitrary vector so that I'TX = 0. According to Proposition 11.1.25, T'l G £r. Part (b) of Proposition 11.1.24 implies that T'l G £r n £s, that is, I'TV = 0. Therefore,
486
Chapter 11 : Linear Inference — Other Perspectives
C{TX)L C C(TV)L, that is, C(TV) C C(TX). Conversely, assume that the latter condition holds and I'Ty is any LZF. Then we have I'TX = 0, that is, I'TV = 0, which implies that I'Ty has zero variance. Therefore, Ty must be linearly complete. Parts (c) and (d) follow from Propositions 11.1.24 and 11.1.25.
11.2
Admissible, Bayes and minimax linear estimators
We discussed admissible, Bayes and minimax estimators in Section 3.7. In this section we introduce linear versions of these in the context of the linear model.
11.2.1
Admissible linear
estimator
Consider the problem of estimating an estimable vector LPF A/3 in the linear model (y,X/3,a2V). We compare estimators on the basis of the squared error loss function L(/3, T(y)) = ||T(y) — -4./3||2. An important result due to James and Stein (1961) implies that the BLUE of A/3 is not admissible within the class of all estimators with respect to the squared error loss function. However, if we confine our attention only to linear estimators of the form Ty, the question of admissibility should then be re-examined. An estimator which is admissible within the class of linear estimators is referred to as an admissible linear estimator (ALE). It can be shown that linear admissibility of an estimator with respect to the squared error loss function ensures its linear admissibility with respect to any quadratic loss function of the form (Ty — g(9))'B(Ty — g(0)), where B is a nonnegative definite matrix (Exercise 11.11). It seems rather natural that linearly complete and sufficient statistics play a central role in admissible linear estimation. The following proposition confirms this. Proposition 11.2.1 Any admissible linear estimator (with respect to the squared error loss function) of an estimable LPF A/3 in the linear model (y,X(3,o~2V) is almost surely equal to a linear function of any linearly complete and sufficient statistic.
11.2 Admissible, Bayes and minimax linear estimators
487
Proof. Let Ty be an admissible linear estimator of A/3 and Sy be a linearly complete and sufficient statistic. Let Ty = E(Ty\Sy). According to Proposition 11.1.9, Ty must be the BLUE of E{Ty). Then E\\Ty - A/3||2 = E\\Ty - Ty~\\2 + E\\T~i - A/3\\2 > E\\fi - Aj3\\\ If this inequality is strict for some /3, then Ty would not be an admissible linear estimator of A/3. Therefore, E\\Ty - Ty\\2 = 0, and the LZF Ty — Ty is equal to 0 with probability 1. Thus, Ty is almost surely equal to Ty, which is a linear function of Sy. Since y, the BLUE of X/3, is a linearly complete and sufficient statistic, the above proposition implies that every ALE is almost surely equal to a function of y. In particular, if Ty is an ALE, it is almost surely equal to E(Ty\y) = Ty. Proposition 11.2.2 The BLUE of A/3 in the model (y,X/3,a2V) is linearly admissible with respect to the squared error loss function. Proof. Let Ty have uniformly smaller risk than the risk of the BLUE of A/3. Without loss of generality we can replace Ty by Ty. If E(Ty) = A/3, then the uniqueness of the BLUE implies that Ty must be almost surely equal to the BLUE. Let us assume that E(Ty) = B/3, such that A/3 — B/3 j£ 0. Since Ty has smaller mean squared error than the BLUE (denoted here by A/3), we have for all (3 0 < £||A3-A/3||2-£||Ty-A/3||2 = tr{D(Ap)-D(Ty)}-\\(A-B)/3\\2. The first term does not depend on /3, while \\(A — B)/3\\2 is an unbounded function of /3 which is positive for some value of /3. Letting the magnitude of /3 increase indefinitely we arrive at a contradiction, thus proving the result. The next natural question to ask is whether there are other admissible linear estimators besides the BLUE. The following proposition provides a characterization of all admissible linear estimators of an estimable vector LPF. This result is a special case of a theorem of Baksalary and Markiewicz (1988) and is built upon a series of earlier results
488
Chapter 11 : Linear Inference — Other Perspectives
(see Cohen (1966), Shinozaki (1975), Rao (1976), Mathew et al (1984), and Klonecki and Zontek (1988)). The proof is lengthy and is omitted. Proposition 11.2.3 The class of admissible, linear estimators of an estimable LPF A/3 in the linear model (y,X/3,a2V) under the squared error loss function is equivalent to the class of estimators Ty which satisfy the following four conditions. (a) (b) (c) (d)
C(VT') C C(X); TVC' is symmetric; TVT' < TVC' in the sense of the Lowner order; C((T - C)X) = C{(T - C)W);
where C and W are matrices such that A = CX and C(W) = C(V) D C(X). Remark 11.2.4 Some special cases of the above result are particularly interesting. When A/3 and Ty are scalars, condition (b) is redundant and condition (c) reduces to an algebraic inequality. When V is positive definite, condition (d) is redundant. When A = X, we can choose C — I. In such a case, LTy is an admissible linear estimator of LX/3 for any L (Exercise 11.12). Example 11.2.5
(y,Xp,a2V)
If A/3 is the BLUE of A/3 in the linear model
where C(X) C C{V), then cA@ isjm ALE of A/3 for
0 < c < 1 (see Exercise 11.10). In particular, cA/3. is an ALE of A/3 whenever 0 < c < 1 and V is nonsingular. The condition of nonsingularity is unnecessary when c — 1. d A linear estimator of the form Ty is sometimes referred to as a homogeneous linear estimator, to distinguish it from an inhomogeneous linear estimator (also known as an affine estimator) of the form Ty +1, where t is a constant. If one looks for an estimator of an estimable A/3 within this wider class of linear estimators, then a characterization similar to Proposition 11.2.3 can be given, with the additional condition t EC((T- C)X). The BLUE is easily seen to be admissible within this class too. Further results in this area can be found in Stepniak (1989) and Fu and Tang (1993), who consider admissibility in the mixed linear
11.2 Admissible, Bayes and minimax linear estimators
489
model. Wu (1992) gives a summary of results on this topic as well as on admissible quadratic estimators of a2. Baksalary and Markiewicz (1990) extend the characterization of admissible linear estimators in the fixed effects linear model to the case of non-identifiable LPFs, although the usefulness of any estimator of non-identifiable LPFs is questionable. Baksalary et al. (1995) further extend this work to the case of a weighted quadratic risk function. 11.2.2
Bayes linear estimator
A Bayes linear estimator is similar to a Bayes estimator, when we restrict our attention to linear estimators. Specifically, suppose that we have a prior 7r(.) on the parameter vector 0 and y is an observation vector which carries some information about the parametric function g(9). A Bayes linear estimator (BLE) of g{6) with respect to the prior ix is defined as an estimator T(y) which minimizes the Bayes risk r(T,ir) = jR(0,T)dir(0) among the class of all linear estimators of the form Ty. Rao (1976) defines such an estimator as a Bayes homogeneous linear estimator, as distinguished from an estimator which minimizes the Bayes risk among the class of inhomogeneous or affine estimators. We shall refer to the latter as a Bayes affine estimator. The proof of Proposition 3.7.3 remains valid in the case of linear estimators. Hence, a unique BLE must necessarily be linearly admissible. Remark 11.2.6 If a quadratic loss function is used, then any BLE is almost surely a function of every linearly complete and sufficient statistic. To see this, let Ty be an estimator of A/3 and y be the BLUE of X/3 (which is almost surely a function of every linearly complete and sufficient statistic). Then for any symmetric and nonnegative definite matrix B we have E[{Ty-A0)'B{Ty-Ap)} = E[E{(Ty - Ap)'B{Ty - Ap)\/3,a2}} = E[E{(Ty - AP)'B{Ty - A0) + (Ty - Ty)'B(Ty - Ty)\/3, a2}}
490
Chapter 11 : Linear Inference — Other Perspectives
> E[E{(Ty - A0)'B(Ty - A/3)\0,a2}] = E[(Ty-AP)'B(Ty-Ap)}. Thus, for any estimator Ty (of A/3), we can find Ty which is a function of every linearly complete and sufficient statistic and has a smaller average risk. The next proposition provides a closed form expression of the BLE of an LPF in the general linear model under the squared error loss function, even when the LPF is not estimable. Proposition 11.2.7 Let the prior distribution of 0 and o1 in the linear model (y, Xf3, o~2V) be such that E(o~2) is positive and the matrix E(f3/3') finite. Let U = [E(o-2)]~1E(Pf3'). Then the unique BLE of the LPF Aft with respect to the above prior and the squared error loss function can almost surely be represented as Af3B, where PB = UX'(V +
XUX')-y.
Proof. Let IT be the prior of /3 and a2. A BLE is obtained by minimizing, with respect to the matrix T, the Bayes risk of the estimator Ty:
r(T,w) =
E[E[\\Ty-Af3\{2\(3,o-2}]
= E [E[\\(Ty - TX/3) + (TX - A)/3\\2 | /3, a2]} = E [a2tr(TVT') + \\{TX - A)/3||2] = E{o-2)tr(TVT') + ti[(TX - A)E(/3(3')(TX - A)'} = E(a2)tv[TVT' + (TX - A)U(TX - A)'}. Let W = V + XUX'. It follows from Exercise 7.13 that C(XU) = C(XUX') C C(W), so that we can write XVA' as WT'O for some matrix T o . Therefore, a BLE of A(3 is Ty such that T minimizes tx[TVT' + (TX - A)U{TX - A)'] = tr[TWT' - TWT'O - T0WT' + AUA'] = tr[(T - T o )W(T - T o )' + (AUA1 - TQWT'O)}.
11.2 Admissible, Bayes and minimax linear estimators
491
The minimum value of the above is tr[AUA' — To WT 0 ], which occurs when the trace of the nonnegative definite matrix (T — To) W(T — To)' is zero, that is, when the matrix itself is 0. This happens if and only if T = T o + F where F is such that FW = 0. We now argue that Fy — 0 almost surely. Indeed, Proposition 3.1.1 (a) implies that f3 e C(E(/3) : D(/3)) = C(E(i3)E((3)' + D(/3)) = C(U), and consequently y 6 C{Xf3 : V) C C{XUX')+C{V) = C{W). Therefore, the optimal Ty can almost surely be written as Ty = Toy = AUX'W'y
= AUX'(V + XUX')~V = APB-
Since y and XUA' both belong to C(V + XUX'), the above expression does not depend on the choice of the g-inverse. This proves the uniqueness of the BLE. The uniqueness of the BLE of A/3 implies that it must be linearly admissible. When V is positive definite, Rao (1976) proves that every ALE is either a BLE or the limit of a sequence of BLEs. If the loss function is chosen to have the form (Ty — AfB)'B(Ty — A/3), where B is a nonnegative definite matrix, the estimator given in Proposition 11.2.7 is a BLE. However, the uniqueness holds only when B is positive definite. An important aspect of the BLE under a quadratic loss function is that it depends on the prior of /3 and a2 only through E{/3I3')/E{a2). If V is positive definite, then the g-inverse in the expression of the BLE can be replaced by an inverse. If V and U are both positive definite, the BLE simplifies to (Exercise 11.17) 3 B = {X'V~lX + U-l)-lX'V'ly.
(11.2.1)
Recall from Proposition 3.7.4(b) that a Bayes estimator is generally biased. The expression given in Proposition 11.2.7 suggests that the BLE is biased too. A formal proof of this fact is given below. Proposition 11.2.8 Given the set-up of Proposition 11.2.7, a Bayes linear estimator of an estimable LPF in a linear model cannot be unbiased unless it is almost surely equal to that LPF.
492
Chapter 11 : Linear Inference — Other Perspectives
Proof. Let Ty be the BLE of the estimable LPF A/3. If it is unbiased, then we have E{Ty) = A/3
for all permissible /3.
Let A = CX. We can express A/3 as C(V+XUX'){V+XUX')-X0, where U = [E(u2)]-lE(f3f3'), as in Proposition 11.2.7. Therefore, the above equation can be written as A/3 - TXJ3 = CV(V + XUX'YXp
=0
for all permissible /3.
It follows that
E[CV{V + xux'yxpp'x'iv + xux'yvc') = E{a2)CV{V + XUX')-XUX'(V + XUX')-VC
= 0,
that is, CV{V + XUX')~XUX' = 0. Consequently, the conditional covariance between Ty and Cy given /3 and cr2 is zero. Note that any LUE of A/3 can almost surely be written as Cy for a suitable choice of C (see Proposition 7.2.3(a)). This implies that Ty is conditionally uncorrelated with every LUE of Aj3, including Ty itself. Therefore, Ty must have zero dispersion and be almost surely equal to its conditional mean, A(3. Q Proposition 11.2.8 implies that the BLUE cannot be a BLE with respect to the squared error loss function. Alternative characterizations of the BLE in a somewhat more general set-up are given by Gnot (1983). Gruber (1990) provides a comparison of the MSEs of the BLE and the BLUE, conditional on /3, in the special case V = I. LaMotte (1978) considers the class of Bayes affine estimators in this special case. Gaffke and Heiligers (1989) deal with BLEs for the linear model with positive definite V, subject to some linear restrictions on /3. 11.2.3
Minimax linear estimator
A minimax linear estimator minimizes the maximum risk among the class of linear estimators. In order to ensure that the maximum risk is
11.2 Admissible, Bayes and minimax linear estimators
493
finite, it is customary to impose an ellipsoidal (quadratic) restriction of the form 6'HO < 1 on the parameter space, where H is a nonnegative definite matrix. Thus a linear estimator Ty is formally defined to be a minimax linear estimator (MILE) of 0 if it minimizes the maximum risk sup E[L(0,Ty)]. 6e0-.e'H8
Obviously a MILE would depend on the choice of the loss function as well as the matrix H. In the context of the linear model (y,X/3,a2V) and a quadratic loss function with weight matrix B, a MILE of the estimable LPF Aj3 is an estimator Ty which minimizes sup
E[{Ty-Ap)'B{Ty-Ap)],
where H and B are both nonnegative definite matrices. The inclusion of a 2 in the ellipsoidal restriction simplifies the mathematics, as we shall see below. It is easily seen that the proof of Proposition 3.7.5 continues to hold in the linear case. Therefore, if there is a prior of /3 such that the average risk of the corresponding BLE is equal to its maximum risk, then the prior is least favourable and the BLE is a MILE. Although this result is important theoretically, it does not directly lead to the identification of a MILE. Let FF' be a rank-factorization of B, and C be a matrix such that A = CX. We can then expand the risk function as E[{Ty-Ap)'B{Ty-Ap)] = trE[||F'Ty - F'Af3\\2} = G2[tr{F'TVT'F) + \\F'(T - C)X0\\2/a2}. The problem of maximizing this with respect to /3 subject to the restriction (3'H/3 < a2 is somewhat simplified if we use the transformation 7 = /3/CT. The maximum risk corresponds to the maximum of ||.F'(T — C)X7|| 2 with respect to 7 subject to the restriction 7 ^ 7 < 1. It is easy to see that the quadratic function \\F'(T — C)X7|| 2 is unbounded if the matrix H is such that \\F'(T — C)X7|| 2 can be positive
494
Chapter 11 : Linear Inference — Other Perspectives
even when 'y'Hf = 0. Thus, boundedness of the risk amounts to the condition H~f = 0 implies F'(T - C)X^ = 0', that is, C(X'(T - C)'F) C C(H). A simple sufficient condition for the above is C(X') C C(H). This condition is also necessary when B is positive definite (that is, B is invertible) and the boundedness of the risk function of all linear estimators is required. The condition C(X') C C(H) implies that the maximum risk associated with the estimator Ty of A/3 is Rmax(Ty,A/3)
=
a2[tr(F'TVT'F) + \\F'(T-C)XH-X'(T-C)'F\\]
,
(11.2.2)
(see Exercise 11.13). Obtaining a MILE by minimizing (11.2.2) with respect to Ty is usually a very difficult task in the general case. We present closed form solutions to this problem in the following special cases: (a) B has rank one; (b) Afi is a scalar; = I,A = I and V nonsingular. (c)HoaI,B The following proposition, which is a generalization of a result due to Kuks and Olman (1972), provides the MILE in the case when B has rank 1. Proposition 11.2.9 Let A/3 be an estimable LPF in the linear model (y,X(3,o2V), and H be a nonnegative definite matrix with C(X') C C(H). A minimax linear estimator of A/3 subject to the restriction (3'H(3 < a2 and with respect to the loss function L(Ty, A/3) — E[{(Ty— A/3)'f}2] can almost surely be written as A~PmJ = AH-X'(V + XH-X')-y
+ (I- Pf)a,
where a is an arbitrary vector, P. = (f'f)~1ff' and the two g-inverses are symmetric. Further, the linear estimator of A/3 which is minimax
11.2 Admissible, Bayes and minimajc linear estimators
495
with respect to the above form of loss function for every f, is unique and can almost surely be written as AX~X/3m where
xpm = XHX'(V
+ XH~x')-y-
Proof. We begin from (11.2.2) with B = ff'. It follows that a minimax estimator is one which minimizes f'TVT'f + f'(T - C)XH~X'(T - C)'f. Completing the square as in the proof of Proposition 11.2.7, we rewrite the above expression as (T'f - QC'f)'{V + XH-X')(T'f - QC'f) + a, where Q = (V + XH~X')~XH~X' and a is a scalar that does not depend on T. Note that the above expression does not depend on the choice of the g-inverses. Therefore, the maximum risk is minimized when T'f is equal to QC'f plus a vector which is in the null space of C(V + XH~X'). As we have seen in the proof of Proposition 11.2.7, we can ignore the latter vector in the representation of the corresponding estimator, Ty. Thus, Ty is a MILE of A/3 only if f'Ty - f'CQ'y almost surely. This proves the first part of the proposition. The second part is easily proved by observing that a mile must satisfy f'{Ty - CQ'y) = 0 almost surely for all / , that is, Ty = CQ'y almost surely. Proposition 11.2.10 Let a'(5 be an estimable LPF in the linear model (y,X/3,«T2V), and H be a nonnegative definite matrix with C(X') C C{H). The minimax linear estimator of a'(3, subject to the restriction f3'Hf3 < a2 and with respect to the loss function L(t'y,a'/3) = E[\\t'y — a'/3\\2], almost surely has the unique representation c'X(3m, where Xf3m is as described in Proposition 11.2.9 and c is a vector satisfying a = X'c. Proof. Since we have a single LPF, the matrix B of (11.2.2) reduces to a scalar. Hence, we can ignore its presence. Consequently t'y is a minimax estimator of a'/3 only if it minimizes t'Vt + (t - c)'XH~X'(t - c),
496
Chapter 11 : Linear Inference — Other Perspectives
where c is a vector satisfying a'(3 = c'X/3. The rest of the proof follows along the lines of the proof of Proposition 11.2.9. The result of Proposition 11.2.9 can be derived from Proposition 11.2.10, by substituting a = A'f, where A and / are as in the proof of Proposition 11.2.9. The estimator X/3m described in Propositions 11.2.9 and 11.2.10 is called the Kuks-Olman estimator, in recognition of the early work by Kuks and Olman (1971,1972). Gruber (1990) makes a detailed study of the maximum mean squared error of this estimator. This estimator has two interesting properties. Proposition 11.2.9 shows that a MILE of any estimable LPF with respect to a quadratic loss function with any rank-one weight matrix B can be obtained from X/3m by substitution. Proposition 11.2.10 shows that the unique MILE of any single estimable LPF with respect to the squared error loss function is obtained from X/3m by substitution. Thus, the Kuks-Olman estimator plays an important role in minimax linear estimation in the linear model. Before taking up the third special case mentioned in page 494, let us view this estimator from another angle. The form of this estimator is identical to that of X/3B, which is a BLE of X/3 with respect to any quadratic loss function, for U = H~. See Exercise 11.18 for a formal relation between the two estimators. An interesting property of X/3m, which follows from a result of Bunke (1975), is sup
E[(AX-X0m
- A/3)(AX-XPm
- A/3)1}
0:0'H/3
<
sup
E[(Ty - A/3)(Ty - A/3)']
(11.2.3)
P:P'H0<
for any estimable LPF A/3 and for any linear estimator of it (see Exercise 11.14). In the above result, the supremum of a class of nonnegative definite matrices is the nonnegative definite matrix which is larger than or equal to every matrix of the class (in the sense of the Lowner order), such that no smaller matrix possesses this property. When V is positive definite, the Kuks-Olman estimator simplifies
11.2 Admissible, Bayes and minimax linear estimators
497
to (Exercise 11.17) X(3m = X{X'V-1X+
H)-X'V-1y.
(11.2.4)
A special case of the above estimator is the 'ridge estimator' discussed in Section 11.3.3. In spite of all the nice properties of X/3m, it must be remembered that it is not necessarily a MILE of Xf3 with respect to a general quadratic loss function in the case when the weight matrix B has arbitrary rank. This is illustrated by the next proposition which shows that the MILE of j3 in the special case B — I, H oc I and V nonsingular, is an entirely different estimator and is not a linear function of X0m (see also Exercise 11.15). Proposition 11.2.11 If /3 is estimable in the model (y,X/3,a2V) with nonsingular V, then its minimax linear estimator subject to the restriction (3'/3 < a2/h and with respect to the loss function L(Ty,(3) = i?[||Ty —/3||2], almost surely has the unique representation
Proof. Once again we begin with the expression of maximum risk given in (11.2.2). Substituting A = I, B = I and H = hi in this expression, we have
a-2Rmax(Ty,(3)
= tr(TVT') + h'^TX - J\\2.
A minimax estimator has to minimize the right hand side of the above equation. We shall simplify the problem by showing that tx{TVT') + h~l\\TX
- I\\2
(11.2.5) Note that the estimability of j3 implies that X has full column rank. Since V is nonsingular, C(X') = C{X'V-lX) and therefore X'V~lX is an invertible matrix. In order to prove (11.2.5), let ( X ' V " 1 ^ " ) " 1
498
Chapter 11 : Linear Inference — Other Perspectives
be factored as LL' where L is an invertible matrix. We have, for any vector I of appropriate dimension,
||(TX-J)Li|| 2 <||TX-I|| 2 -||LZ|| 2 . Therefore, the matrix difference \\TX-I\\2L'L-L'{TX~I)'{TX-I)L is nonnegative definite, and it must have a nonnegative trace. It follows that
0 < tr[||TX - ifL'LJ - ti\L'{TX - I)'{TX - I)L\ = \\TX - J||2tr(LL') - tr[(TX - I)LL'{TX - /)'] = \\TX-I\\hv{X'V-1 X)~l -
tv[{TX-I){X'V-xX)-l{TX-I)'^
and consequently
\\TX - If > tr[(TX - I){X'V-lX)-l{TX
- lyj/trpC'V- 1 *)" 1 .
This proves (11.2.5). If we can find a linear estimator Ty for which (11.2.5) holds with equality and the right hand side is minimized at the same time, then Ty would be a MILE for the problem at hand. This minimization is a rather easy task. Let a = [h t r ^ ' V " " ^ ) " 1 ] " 1 . Then the right hand side of (11.2.5) can be written as
tr(TVT') + atr[{TX - I){X'V-lXyx{TX = tr[T[F + aX{X'V-lX)-lX']T'
- [aiX'V^Xy^X'W
- I)'}
- T[aX{X'V-lX)-1}
+ aiX'V^X)-1].
By writing the [V + aXiX'V^X^X'] as W and completing the squares, we can simplify the above expression to
tr^T-aiX'V^X^X'W-^WiT-aiX'V^X^X'W-^'+K], where the matrix K does not depend on T. Thus, the matrix
To = a(X'V-lX)-lX'W-x
11.2 Admissible, Bayes and minimax linear estimators
499
is the unique minimizer of the right hand side of (11.2.5). By expanding W~l we have the following alternative expression of To: To =
=
a{X'V-lX)-lX'W-1)
aiX'V^X^X'iV-1 -V~lX{a'lX'V-lX + X'V-lX)-lX'V-1}
= -^-(x'v-'x^x'v-1 1 +a
'
This simplification shows that the matrix (ToX — / ) is proportional to the identity matrix of appropriate dimension. Consequently (11.2.5) holds with equality when T = TQ. This completes the proof of the proposition. D The estimator /3M is linearly admissible with respect to the squared error loss function (see Example 11.2.5). An interesting fact is that AfiM is not necessarily a MILE of Aj3 under the conditions of Proposition 11.2.11 (see Exercise 11.16). Thus, the principle of substitution does not work here. Apart from the closed form expressions for the MILE in the special cases considered in the preceding three propositions, finding the general solution to the problem of minimax linear estimation appears difficult. Lauter (1975) makes some progress for the special case when V and H are positive definite. Drygas (1985) provides a MILE in the special case of a single LPF, as in Proposition 11.2.10, but under the more general ellipsoidal restriction ()3 — /3Q)'H(f3 —fi0)< a2. He also considers the problem of minimax linear prediction in this set-up. Stahlecker and Lauterbach (1989) proposes a numerical method of obtaining a MILE in the general case. Pilz (1986) considers a positive definite dispersion matrix but a more general restriction on /3: that it belongs to a compact set which is symmetric around a midpoint. He investigates minimax estimators in the class of affine (rather than linear) estimators and it turns out that such a minimax estimator under this set-up is a Bayes affine estimator corresponding to the least favourable prior. The search
500
Chapter 11 : Linear Inference — Other Perspectives
for the least favourable prior gives rise to an explicit expression of the minimax affine estimator in some special cases, including the case of linear inequality constraint on each parameter from both sides. Drygas (1996) attempts to solve the general problem with an ellipsoidal restriction on (3, using spectral decomposition of some matrices. However, closed form solutions can only be found in some special cases. A general ellipsoidal restriction on the parameter /3 appears frequently in the literature of minimax estimation. Although such a restriction is unlikely to arise naturally in a practical problem, it is often implied by other natural restrictions. Toutenburg (1982) shows how to construct the least restrictive ellipsoidal restriction from a set of finite and linear inequality constraints. In many practical situations it may be possible to construct a large enough ellipsoid so that the parameter vector is contained in it. Significantly, the BLUE becomes inadmissible, — even among linear estimators — as soon as such a restriction is imposed. Hoffman (1996) identifies a class of MILEs that have smaller risk than the BLUE over the entire parameter ellipsoid, irrespective of the weight matrix of the quadratic risk function. 11.3
Biased estimators with smaller dispersion
We now consider four linear estimators: the subset estimator, the principal components estimator, the ridge estimator and the shrinkage estimator. All of these estimators can be considered as alternatives to the BLUE. Compared to the BLUE, they have smaller dispersion but this comes at the cost of some bias. The subset estimator is a special case of the BLUE with linear restrictions. The principal components estimator is a subset estimator in a reparametrized model. The ridge estimator is a Bayes linear estimator, while the shrinkage estimator is a minimax linear estimator. All the estimators can be shown to be 'shrinkage' estimators in some sense. 11.3.1
Subset
estimator
Choosing a suitable subset of explanatory variables is an important part of regression model building. Sometimes one is not sure about the
11.3 Biased estimators with smaller dispersion
501
set of explanatory variables which may be useful. Thus one is tempted to collect data for a large number of such variables, with the hope of selecting the useful variables later on. As a result of collinearity that we discussed in Section 4.12, the dispersion of the BLUE of any estimable LPF may be unnecessarily inflated. Dropping some of the explanatory variables is a very natural way of reducing this dispersion. Consider the model M. = (y,X/3,
(11.3.1)
502
Chapter 11 : Linear Inference — Other Perspectives
By writing the model equation in the form
y = X1(3l+X2(32 + e = Xl{(ll + (Xl1X1)-X'1X2f32)
+ (I - P X i )X 2 /3 2 + e,
we can see that using the subset model amounts to ignoring the second term in the last expression. The MSE of the subset estimator of every estimable LPF is smaller than the MSE of the corresponding BLUE if and only if the squared norm of this term is smaller than variance of the error term. If the columns of X2 are almost equal to linear combinations of the columns of X\, (that is, there is considerable collinearity), then the matrix (I — Px )X2 should have very small elements, and consequently the subset model would lead to a smaller MSE matrix. We now turn to the estimation of a particular estimable LPF, A/3. Let us denote the restricted BLUE of A/3 by A/3,,. It is easy to see that A/3S = CX/3S, where C is a matrix such that A = CX. Also, if A/3 is the BLUE of A/3, then A/3 = CX/3. We can compare AJ3S and A/3 with respect to the quadratic loss function with a nonnegative weight matrix B. Indeed,
E[(A0 - A/3)'B{AP - A/3)] = E[{XJ3 - X/3)'C'FF'C(X$ = tr[F'C(MSE(XJ3))C'F],
- X0)]
where B = FF'. We can simplify the risk of A/3S in a similar manner. Comparing these two expressions, we can conclude that whenever the MSE matrix of Xf3s is smaller than that of X/3 in the sense of the Lowner order, the risk of Aj3s is smaller than the risk of A/3 for any estimable A/3 and any nonnegative definite weight matrix B. Since the Lowner order of the MSE matrices is a strong condition, comparison in a weaker sense is sometimes more meaningful. We shall make comparisons in terms of (a) algebraic order of the MSE for an estimable scalar LPF and (b) algebraic order of the trace of the MSE matrices for X/3. We assume once again that V = I. Let us first consider a particular estimable LPF p'j3. A necessary and sufficient condition for the MSE of the subset estimator being smaller than the MSE of the BLUE is given in Exercise 11.19. Once
11.3 Biased estimators with smaller dispersion
503
again, the subset estimator is found to be more suitable when the collinearity in the model is such that every column of X2 is approximately equal to a linear combination of the columns of X\. In order to compare the traces of the MSE matrices of the estimators of X/3, note that tiMSE(XJ3s)
= tr(a 2 P Xi + (I - PXi )XPp'X'(I - Px%)) =
a2p{Xl) + \\{I-PXi)XP\\2.
Similarly, tvMSE(X~P) = o2p{X). larger if and only if 2 a
\\{i-PXi)xpf - P(x)-p(x1)
(11.3.2)
It follows that the latter trace is
j\{i-PXi)x2p2f p{x)-P{xl)
This condition is clearly much weaker than (11.3.1). Note that when X has full column rank, p(X) — p{X\) is equal to the number of elements of/32Since \\{I — Px )y\\2 — cr2(n — p{X\)) is an unbiased estimator of ||(J — Px )XP\\2 under the model M (this fact can be verified easily), the trace of the MSE given in (11.3.2) can be approximated by a2Cp, where Cp = \\{I-PXi)y\\2/a*-n + 2p{Xl), and a2 is the usual estimator of a2 from the full model (Ai). The above quantity is known as Mallows' Cp. It is used by practitioners not only to compare a subset model with the full model, but also as a criterion for selecting a particular model from a class of competing subset models. A smaller value of Cp is considered to be better. See Hocking (1996) for more details on this and other criteria for subset selection (also see Exercises 11.21-11.22). 11.3.2
Principal components estimator
The idea of principal components regression arose in the context of the homoscedastic linear model with X having full column rank. We
504
Chapter 11 : Linear Inference — Other Perspectives
shall confine our discussion to this model, although it is possible to generalize the idea in a straightforward manner to the model (y, X/3, o1 V), where both X and V have full column rank (see Exercise 11.23). The BLUE of /3 in the homoscedastic model is (X'X)~lX'y, and its dispersion is G2{X'X)~l. When there is collinearity, some eigenvalues of X'X are rather small. Therefore, the corresponding eigenvalues of (X'X)~l are large. This leads to the inflation of the dispersion of the BLUE of /3 and many other LPFs. A remedy for this problem can be sought in the following manner. Let UAU' be a spectral decomposition of X ' X . We can arrange the eigenvalues of X'X in the decreasing order, and partition the matrices U and A suitably so that
UAU' = (U, :U2)(A0l
°2)(ul)=
U1A1U'l + U2A2U'2,
where the diagonal elements of A2 are small. Rewrite the model equation as
y = X0 + e = X{U{U\ + U2U'2)P + e = Zi 7 l +Z 2 72+e = ^ 7 + «, where Z; = XUt and 7^ = U'fi for i = 1,2, Z = XU = {Zx : Z2) and 7 = (li I2Y = U'fi. Note that (y, Z~f,o2I) is a reparametrization of the original model. An important aspect of the reparametrized model is that the dispersion of the BLUE of 7 is ^{Z'Z)'1 = a2A'1. The dispersion matrix is diagonal, and the lower right block of this matrix, o"2A2~1, has large diagonal elements. These correspond to parameters which cannot be estimated with good precision. In principal components regression, 7 is estimated by its BLUE under the the restriction 7 2 = 0. It is easy to see that this estimator is =
'PC
/Ar 1 Z' 1 y\ \
0
) '
Using the transformation 7 = U'j3 (that is, /3 = t/7), we can write the principal components estimator of /3 as
PPc = U%c = UxA?Z'xy =
{UxA.^U'l)X'y.
11.3 Biased estimators with smaller dispersion
505
Thus, (3pc is obtained by reverse transformation from a subset estimator in a reparametrized model. Therefore, comparison of the mean squared errors of this estimator and the BLUE can proceed along the lines of the analysis of the subset estimator. It can be shown that the MSEs of (3pc and the BLUE of 0, /3, are MSE0pc)
= a2UlA^U[
MSE(/3)
= e2(UiAilU'1
+ U2l272U'2 + U2A2;1U2).
Therefore,
MSE0) - MSE0pc) = o2U2A2xU'2 - U2~f2l'2U2. This quantity is nonnegative definite if and only if cr2A2l — 7 2 7 2 is nonnegative definite. The latter condition is equivalent to the inequality o2 > i'2A2l2
= /3'U2A2U'2f3.
(11.3.3)
This condition is satisfied if A2 has sufficiently small diagonal elements. On the other hand, the difference between the traces of the MSEs of the estimators of X/3 are tr(MSE(X0)
- MSE{X{3pc)
= a2ti(Z2A2lZ'2) = =
- \\Z2l2\\2
a2tv{A2lZ'2Z2)-ll2Zl2Z2l2 a2p(U2)~/3'U2A2U'2l3
In the above, p{U2) is the number of elements of A2, which is equal to the number of discarded 'variables'. The difference between the MSEs is nonnegative if and only if a2 >
p'U2A2U'2/3/p(U2).
This condition is evidently weaker than (11.3.3). An important qualitative difference between the subset and the principal components methods is the following. There is a clear hierarchy among the (linearly transformed) variables of the model (y, Z~f,a2I) in terms of the variance of the estimators of the corresponding coefficients,
506
Chapter 11 : Linear Inference — Other Perspectives
which happen to be uncorrelated. Thus, one can choose a cut-off for the eigenvalues, and obtain a reasonable 'subset' right away. On the other hand, search for a suitable subset may not be easy, as there is no similar hierarchy among the untransformed variables. A drawback of the principal components method is that the discarded components may not necessarily be poor as explanatory variables. It may be possible to find a discarded variable having more correlation with the response than a retained variable. Another drawback is that even though the effective number of variables is reduced, the remaining transformed variables are usually linear functions of all the original variables. If the cost of collecting data on some variables is high, the principal components method does not provide any savings in cost. Nevertheless, this method serves as an important tool for dimension reduction. Sometimes it is used to get an idea of the size of the subset model that one might look for. 11.3.3
Ridge estimator
Consider, once again, the homoscedastic model (y, X/3, cr2l) where X has full column rank. Whenever there is collinearity, the inverse of the matrix X'X has large eigenvalues, and hence the variance of the BLUE of some LPFs is inflated. Hoerl and Kennard (1970a,b) suggested a simple solution to this problem: replace (X'X)'1 in the expression of the BLUE by (X'X + rl)~l, where r is a positive number. The effect of this alteration is to increase the diagonal elements of the matrix X'X by a constant amount. Since the change takes place only along the main diagonal of this matrix (like the rise of a steep mountain ridge from an ordinary landscape), the corresponding estimator is called the ridge estimator. Formally, the ridge estimator in the homoscedastic case is given by pr = (X'X + rI)-lX'y.
(11.3.4)
This estimator exists even if X does not have full column rank. The limit of this estimator as r goes to zero is just the least squares estimator. The limit is not meaningful if X'X is not invertible. However, the limit of Af3r for an estimable A/3 (as r goes to zero) is indeed the BLUE of
11.3 Biased estimators with smaller dispersion
507
A/3 — regardless of the possible singularity of X'X. It is easy to see that the bias of the ridge estimator is —r(X'X + rl)~lf3, the magnitude of which increases linearly with r. The dispersion of the ridge estimator is D(3r)-= o2(X'X + r I)'1 X'X (X'X + rl)~\ which is a decreasing function of r, in the sense of the Lowner order (Exercise 11.25). Therefore, the constant r controls the trade-off between bias and dispersion of the ridge estimator. Such a trade-off also exists in the case of the subset and principal components estimators discussed earlier, but the choice is limited by the number of candidate models. In this case, the number of choices is infinite because r can be any positive number. The difference between the mean squared errors of the BLUE and the ridge estimator of 0 is MSE(0) - MSE(f3r) = ^[(X'X)-1 - (X'X + rI)-lX'X(X'X + r/)- 1 ] -r2(X'X + rI)-l0/3'(X'X + rl)'1 = a2(X'X + rI)-1 1(X'X+rI)-X'X-r2a-2l3/3'](XlX+rI)-1 = ra2(X'X+rI)-1[(2I+r(X'X)-1)-ra-20l3'}(X'X + rJ)'1. (11.3.5) The above difference is nonnegative definite if and only if the matrix in the square bracket is nonnegative definite, which is equivalent to the condition
a2 > 0'(2r^I + (X'XyY^
= \\X0\\2
^ ^ ' £ ^ ' ^ *
The maximum value of the ratio given in the last expression is the largest eigenvalue of the matrix (2r~lX*X + / ) - 1 , which is equal to (2A/r + I)" 1 , where A is the minimum eigenvalue of X'X. It follows that the ridge estimator has smaller MSE matrix than the least squares estimator (in the sense of the Lowner order) for all 0 whenever r- 1 >(||X/3|| 2 /a 2 - 1)/(2A).
508
Chapter 11 : Linear Inference — Other Perspectives
Thus, if r is chosen to be a small positive number, then the above inequality would be satisfied. Therefore, there is always a range of values of r for which the ridge estimator is better than the LSE. Unfortunately the ratio ||JC/3||2/cr2 is not known in practice. As a result, one has to choose r on the basis of the data (see Judge et al., 1980). Note that the above comparison of mean squared error matrices hold only when r is a constant. If r is chosen on the basis of the data, the ridge estimator is no longer a linear estimator, and the above analysis does not hold. Note that the role of the matrix rl is to inflate the smaller eigenvalues of the matrix X'X. This can also be achieved by replacing rl by a positive definite matrix, R. The corresponding ridge estimator is (X'X + R)~lX'y. The obvious generalization of this estimator to the model (y,X/3,a2V) where V is a known positive definite matrix (not necessarily proportional to the identity matrix) is the following.
(3R = (X'V'lX + R)-lX'V-ly. By comparing X/3R with the expression of the Bayes Linear Estimator given in (11.2.1), we can identify the ridge estimator as the BLE corresponding to any prior such that [E(o2)]~lE(l3p') = R'1. It can also be interpreted as the Kuks-Olman MILE (see (11.2.4)) with H = R. In view of these identifications, we can define the ridge estimator for singular V as
J3R = R~lX\V + XR-lX')-y. The estimator does not depend on the choice of the g-inverse. The ridge estimator (11.3.4) can also be interpreted as a restricted least squares estimator. It can be shown that the sum of squares \\y — X/3\\2 is minimized subject to the quadratic restriction ||/3j|2 < b2 by the 'ridge' estimator fir = (X'X + rI)~lX'y, where r is such that ||/3 r || 2 = b2 (see Exercise 11.26). Note that this estimator is not linear, because r depends on y.
11.3.4
Shrinkage
estimator
The trade-off between reduced variance and increased bias can also be achieved by simply multiplying the BLUE by a constant. The re-
11.3 Biased estimators with smaller dispersion
509
suiting estimator,
XPS = sXfi, where X/3 is the BLUE of X/3, is called the shrinkage estimator (Mayer and Willke, 1973). The name shrinkage is derived from the fact that s is chosen to be a positive number less than 1. This choice ensures that the dispersion of the shrinkage estimator is less than that of the BLUE, in the sense of the Lowner order. Consider the linear model (y,X/3,a2I) where X has full column rank, and /3 and /3S be the least squares and shrinkage estimators of /3. It is easy to see that
MSE0) - MSE0s) = (1 - s2)(X'X)-1 - (1 - s)2p/3'. (11.3.6) The difference is positive definite if and only if \\X/3\\2/v2-l
The above condition is satisfied when s is marginally smaller than 1. There is always a range of values of s for which the shrinkage estimator would have a smaller MSE matrix than the least squares estimator. However, this range is not known because /3 and a2 are unknown. If s is chosen on the basis of the data, then the resulting estimator is no longer linear. The James-Stein estimator
(l V
ZZ2 \ XB \\X8\\2)
c= ( n ~ r )( r ~ 2 ) (n-r+ 2)r '
defined for r = p(X) > 2, is such an estimator. Although it involves a choice of s that depends on y, it is known to have smaller trace of mean squared error than the least squares estimator, provided that y has the normal distribution (see Gruber, 1998, p.197). In the special case of full rank X and V, the shrinkage estimator of /3 can be interpreted as the MILE under the conditions of Proposition 11.2.11. The choice of an s in the range (0,1) is equivalent to the choice of a positive h for the spherical restriction, f3'f3 < a2 jh. The
510
Chapter 11 : Linear Inference — Other Perspectives
shrinkage estimator is linearly admissible with respect to the squared error loss function (see Example 11.2.5). The shrinkage estimator is not the only estimator having a magnitude smaller than the BLUE. It can be shown that the ridge (Exercise 11.26, part (d)) and the principal components (Exercise 11.24) estimators also have smaller magnitude than the corresponding BLUE. The shrinkage estimator is a convex combination of the least squares estimator, /3 and the trivial estimator, 0. If there is a prior guess of /3, (say, /30), it may be wiser to choose a convex combination of the least squares estimator and f30. This estimator can be formally written as Psjo = (* - *)0o + s/3 = fT0 + (1 - s)(0Q - 0). This act of moving /3 towards /30 is called shrinkage towards /30. This turns out to be a good strategy when the prior guess /30 is reasonably good (Exercise 11.28).
11.4 11.4.1
Other linear estimators Best linear minimum bias estimator
If one allows for bias, the class of linear estimators of an LPF A/3 becomes larger. In order to find a 'best' estimator in this class, we may consider the mean squared error matrix of the estimator Ty of A/3, given by
MSE{Ty) = <j2[TVT' + (TX - A)U{TX - A)'], where U = a~2f3/3'. It turns out that the matrix in the right hand side of the above equation is minimized, in the sense of the Lowner order, when Ty = AUX'{V + XUX')~y (11.4.1) (see the proof of Proposition 11.2.7). Following Lewis and Odell (1966), we shall call this estimator the 'best linear estimator' of A/3. The best linear estimator of A/3 exists even if A/3 is not estimable. Like the best linear predictor, the best linear estimator cannot be used in practice,
11.4 Other linear estimators
511
as U is not known. We can replace U by a prior guess, perhaps on the basis of a prior distribution of /3 and cr2. We have already seen in Proposition 11.2.7 that a formal derivation of the Bayes linear estimator with respect to a prior distribution of /3 and
pHmbe =
ux'(xux')-y-
This estimator is called a linear minimum bias estimator (LIMBE). In general it is not unique. The LIMBE with the minimum dispersion is called the best linear minimum bias estimator (BLIMBE). Chipman (1964) proposes this estimator specifically for non-estimable LPFs. If U and V are positive definite matrices, then the BLIMBE of Afi is A0biimbe,
where
PbUmbe =
UX'(XUX'V-lXUX')-XUX'V-1y.
A brief derivation of this expression can be found in Rao (1973c). Note that both the best linear estimator and the BLIMBE depend on the choice of the matrix U. It can be chosen by using extraneous information. Rao (1973c, p.305) suggests an alternative consideration: to choose U as a relative weight given to the bias term as compared to the dispersion. It was pointed out in Chapter 4 that a non-estimable parameter is not identifiable. Such a parameter can be meaningfully estimated only if extraneous information (such as a prior distribution) is judiciously used. The fact that the best linear estimator and the BLIMBE exist even when the corresponding LPF is not identifiable makes sense only when these estimator utilizes extraneous information via the matrix U. If U is chosen without using such information, and A/3 is not estimable, neither the best linear estimator nor the BLIMBE of A/3 is meaningful. The problem of identifying the 'best' estimator among various types of linear estimators continued to interest researchers over the last three
512
Chapter 11 : Linear Inference — Other Perspectives
decades. Hallum et al. (1973) find the best linear estimator in a linear model with restrictions. Chaubey (1982) obtains the best linear estimator for a particular choice of the matrix U, where the linear estimators are restricted to be LIMBE for another choice of U. Schaffrin (1999) studies the best linear estimator as the condition of unbiasedness is gradually relaxed. 11.4.2
'Consistent' estimator
When the dispersion matrix V is singular, the response in the linear model (y, X/3, a2V) must be such that y — X/3 G C(V) almost surely. This is a property of the model error. On the other hand, X/3 (the systematic part of y) resides in C(X). Any estimator of /3 would lead to a decomposition of y into estimated 'systematic' and 'error' parts. If the estimator is reasonable, these two parts should belong to C{X) and C(V), respectively. The conditions
xpeC(X),
y-xpec(v)
are sometimes referred to as consistency conditions for an estimator X/3 of X/3. Christensen (1996) refers to an LUE that satisfies the twin conditions by the name consistent LUE (CLUE). Remark 7.3.10 implies that the BLUE of X/3 is a CLUE.jChristensen (1996, p.229) obtains a CLUE that would minimize \\y — X/3||, and called it the least squares CLUE. This estimator is not known to have any other optimal property and in general has a larger dispersion than the BLUE. 11.5
A geometric view of BLUE in the Linear Model
Consider the model equation y = X/3 + e,
E(e) = 0,
£>(e) = a2V.
The term X/3 is the systematic or signal part of y, while e is called the error or noise. By leaving (3 unspecified, the model postulates that the systematic part is given by an unknown linear combination of the
11.5 A geometric view of BLUE in the Linear Model
513
columns of X. The empirical equivalent of the model equation is y = Xfi + e = y + e, where X(3 is the vector of fitted values, which is our estimator of the systematic part. We can call y the explained part of y (through the model), while the residual e is the unexplained part. An alternative way of interpreting the model is the following. The model postulates that the systematic part is in the column space C(X), while the error part has zero mean and dispersion proportional to V. In this interpretation, the emphasis is on the vector space specified as the 'systematic' part. The identity of the regressor variables is not important. A reparametrization preserves the vector spaces, and therefore it is expected to produce the same values of y and e. It turns out that there is much to learn from the vector space interpretation of best linear unbiased estimation. We have seen glimpses of the geometric view in Chapters 4, 5 and 7. The purpose of this section is to appreciate the BLUE from a purely geometric perspective. 11.5.1
The homoscedastic case
Let V = I. Figure 11.1 illustrates the decomposition y = X{5 + e. The vectors y, Xfi and e are represented by the line segments OA, OP and PA, respectively. The vertical axis represents C(X)-L, while the horizontal plane represents C(X). In reality both of these spaces may have dimensions greater than two, and typically C(X)1- would have a larger dimension than C(X). We choose the dimensions 1 and 2 for C(X)1- and C(X) respectively, for the sake of visualization. The shaded region around the point P represents the region where the point A could possibly have appeared in another random experiment. The darker shade in the core of the region represents higher concentration of probability mass there. The model specifies that OP lies somewhere on the horizontal plane. The estimation problem consists of identifying the vector OP. Figure 11.2 illustrates the synthesis of the BLUE of Xj3, represented by the line OB, which is obtained by minimizing the length of the line BA over all possible locations of B in the plane C{X). The minimum length
514
Chapter 11 : Linear Inference — Other Perspectives
Figure 11.1 A geometric view of the homoscedastic linear model
A
wo1 Figure 11.2 A geometric view of BLUE in the homoscedastic linear model
occurs when AB is perpendicular (orthogonal) to the horizontal plane. Thus, we have OB and BA corresponding to y and e in the decomposition y = y + e. Clearly, y is the orthogonal projection of y on C(X), so that y € C(X) and e 6 C{X)L. The elliptical region around B is a typical confidence region for X/3. That the region lies entirely in the plane C{X) corresponds to the fact that C(D(y)) = C(X).
11.5 A geometric view of BLUE in the Linear Model
515
A
Figure 11.3 A geometric view of restricted BLUE in the linear model
11.5.2
The effect of linear restrictions
The effect of the restriction AB = £ in the homoscedastic case is illustrated in Figure 11.3, which is an extension of Figure 11.2. This diagram can be understood in the context of the 'equivalent' model (y— XA'{AA')-£,X(I-PA,),o2I) (see the discussion preceding (4.9.1) in page 125). The vector XA'(AA')~£ is a part of y which is completely known because of the restrictions. This vector, represented by the line segment 00', lies in C{XPA,), which is represented by a bold line on the plane of C(X). The other bold line represents the locus of tips of all the vectors which can be written as X A' (AA')~ £ + u, where u € C(X(I — PA,))- The point P (where OP is the unknown X/3) lies somewhere on this line. Since 00' is completely known, the estimation problem can be simplified by shifting the origin to O'. Consequent to this shift, the bold line passing through O' and P represents the vector space C(X(I-P )). The estimation problem reduces to finding the line segment OP or the point P on this line. The 'restricted' BLUE is obtained by finding a point B' on this line so that the length of B'A is the smallest. This is accomplished when B'A is perpendicular to this line, and in particular, to O'B'. If we refer to the model (y - XA'(AA')-{, X(I - PA,),<J2I) as {y*,X*(3,a2I) for brevity, then y*, X*[3 and e* are represented by
516
Chapter 11 : Linear Inference — Other Perspectives
PAI
<9(4)
(a) Orthogonal projection
W
9(4)
(b) Oblique projection
Figure 11.4 Orthogonal and oblique projections
the line segments O'A, O'B' and B'A, respectively. The vector X*/3 is the orthogonal projection of y* on C(X*). The line segment OB' represents yrest, the fitted value of y under the restriction A/3 — £. The line segment B'B can be seen as yrest — y or as e* — e. The vectors OB' and B'B are not shown explicitly in the figure.
11.5.3
The general linear model
In order to view the BLUE of X/3 in the general linear model as a projection, we need to use oblique projections rather than the orthogonal projection mentioned in Section 2.4. Note that if A and B are matrices with the same number of rows and C{A) and C(B) are virtually disjoint, then any vector I in C(A : B) has the unique representation l\ + I2 where /1 G C(A) and I2 6 C(B). (If + h* is another such representation, then l\ —1\* = I2* —12 belongs to C(A) as well as C(B), which means that these column space are not virtually disjoint.) We need a projection matrix which will make this decomposition possible. Definition 11.5.1 A matrix PA,R 1S called a projector onto C(A) along C{B) if for all I G C(A : B), PA.gl E C(A) and ( I - P )l E
C(B). Figure 11.4 illustrates the contrast between the orthogonal projection onto a column space and the (oblique) projection onto that column
11.5 A geometric view of BLUE in the Linear Model
517
space along another space, where both the spaces have dimension 1. The essential difference is that P.I and (I — P.)l are orthogonal to one another, while PA
Let A and B be matrices with the same number
(a) C{A) and C(B) are virtually disjoint if and only if C(A') = C(A'(I-PB)). (b) IfC{A) andC(B) are virtually disjoint, then A[(I-PB)A]-(IPB) is a projector onto C(A) along C(B). Proof. Suppose that C(A) and C(B) are virtually disjoint. It is clear that C(A'(I—PB)) C C(A'). If the inclusion is strict, let k be a vector in It follows that Ak £ C(B). C{A') which is orthogonal to C(A'(I-PB)). Since C(A) and C(B) are virtually disjoint, Ak must be 0. This implies that k, which is of the form A'm for some m, is itself equal to 0. Conversely, if C{A') = C{A'{I - PB)), let A = T{I - PB)A for some T. If Au\ = Bu2 for some u\ and U2, we have Am = T(I - PB)AUl
= T(I - PB)Bu2 = 0.
Thus, C(A) and C(B) are virtually disjoint. This proves part (a). To prove part (b), let P A | B = A[{I-PB)A]~{IPB) and let / be a vector in C(A : B). Suppose that I has the unique representation l\ +I2 where h € C{A) and l2 € C(B). It is easy to see that PA]Bl € C(A). On the other hand, (I~PA]B)l
=
l-A[(I-PB)A]-(J-PB)l
= l-T(I-PB)A[(I-PB)A]-(I-PB)h = l-T(I-PB)h = l-h = l2.
518
Chapter 11 : Linear Inference — Other Perspectives
In the above, T is a matrix such that A' = A'(I - PB)T'. Note that the vector l2 is in C(B). We discuss now the BLUE of Xj3 in the general linear model using the idea of an oblique projector. Proposition 11.5.3 Consider the linear model (y,X/3,a2V). (a) The response vector y lies almost surely in C(V(I — Px) : X). (b) C(V(I — Px)) and C(X) are virtually disjoint. (c) The BLUE of Xj3 is almost surely equal to P . . .y. Proof. Since y G C(V : X) almost surely, it suffices to prove that
C{V : X) = C(V(I - Px) : X). The inclusion of C(V(I - Px) : X) in C(V : X) is obvious. In order to prove the reverse inclusion, let
I 6 C(V{I - Px) : X)1. Since X'l = 0, we can write I as (I - Px)m. As l'V(I - Px) = 0, we have m'(I - PX)V{I - Px) = 0 , that is, m'{I-Px)V = 0. Hence, l'(V : X) = 0. It follows that C(V{I-PX) : X)L CC{V : X ) X . Part (b) follows directly from part (a) of proposition 11.5.2, by
choosing A = V{I - Px) and B = X. Using part (b) of Proposition 11.5.2, and comparing the resulting expression with (7.3.3), we have e = P . ., y. Part (c) then follows from the fact that / — P . is a choice of PB,. d There is an interesting connection between the decomposition of the response as per Proposition 11.5.3 and the estimation and error spaces. According to this proposition, the response y can be uniquely decomposed as y1 + y 2 , where y1 6 C(X) and y2 £ C(V(I — Px))It follows from Proposition 11.1.25 that whenever a vector I is in the estimation space, I'y is almost surely equal to l'y1. Likewise, whenever / in the error space, i'y = l'y2 almost surely. See Rao (1974) for more information on oblique projectors. Figure 11.5 shows the synthesis of y in the general linear model (y, _X"/3,a2V) with possibly singular V. The systematic part, X/3, is represented by the line OP, which lies in the plane representing C(X). If V is singular, then e lies in the space C(V). The shaded region on
11.5 A geometric view of BLUE in the Linear Model
519
Figure 11.5 A geometric view of the singular linear model
this plane around the point P represents the region where the point A could possibly have appeared in another random sample. Consider the plane of C(V) that passes through the point O. Since PA (e) is parallel to this plane, the perpendiculars from A and P on C(V), A'A and P'P, must have the same length. These two vectors represent (/ — Py)y and (/ — Pv)Xf3, respectively. Thus, P'P is the part of OP that is known exactly through A'A, because of the singularity of the model. Figure 11.6 shows the construction of the BLUE in the general linear model. Note that the line of C{V{I — Px)) is not perpendicular to the plane of C(X). This is in contrast with Figure 11.1, where the axis of C{X)L is perpendicular to the plane of C(X).
520
Chapter 11 : Linear Inference — Other Perspectives
Figure 11.6 A geometric view of BLUE in the general linear model
The BLUE of X/3 is obtained by dropping the oblique projection of OA on the plane C(X) along C(V(I — Px)). In other words, the point B is located by drawing a line parallel to C(V(I — Px)), to intersect the plane of C{X). Since C(V(I - Px)) is included in C(V), the line BA (corresponding to the residual vector e) is parallel to this plane. Therefore, the segment PB lies in the line through P that is parallel to C(V) fl C(X). PB corresponds to the error in the estimation of X/3, Xf3 — X/3. This clearly shows why the space spanned by D(X(3) is C{V) n C{X), and the space spanned by D{e) is C(V{I - Px)) (see Proposition 7.3.9). Figure 11.6 also illustrates that there is no need to separate the 'nonrandom' part of y to obtain the BLUE of X/3. The projections work much the same way as in the case of the homoscedastic linear model.
11.6 Large sample properties of estimators
521
When V is singular, one can further decompose Xfi into a deterministic and a stochastic part (see Nordstrom, 1985). The stochastic part is the orthogonal projection of OB on the line C(V) nC(X), while the deterministic part is perpendicular drawn from O to the base of this projection. Restrictions in the general linear model can be visualized in a similar manner. Drawing a diagram in the singular case is made difficult by the fact that one runs out of dimensions while tracking the transition from C{X) toC(X(I-PAI)), andthentoC(V)nC(JT(/-P.#))! 11.6
Large sample properties of estimators
The assumption of normality of the response is crucial for the confidence regions and tests of hypotheses described in Chapters 5, 7 and 10. We now explore whether for large sample sizes, the inference can be carried out as described previously even if the distributional assumption is replaced by weaker conditions. We begin with the statement of a convergence result for the homoscedastic linear model. The proof of this result involves a series of results which are outside the purview of this book, and is omitted. We refer the reader to Sen and Singer (1993, Section 7.2) for a proof. Proposition 11.6.1 Suppose that /3 n is the least squares estimator of (3 in the linear model
yn = Xnfi + en, where en has independent and identically distributed components with mean 0 and variance a2. Let the matrix Xn have full column rank and satisfy the conditions: (i) the elements of the matrix n~lX'nXn converge to those of a finite and positive definite matrix, V*, as the sample size n goes to oo; (it) the largest diagonal element of Pv goes to zero as n goes to oo. Then \/n(/3 n — (3) converges in distribution to N(0,a2V~1).
D
522
Chapter 11 : Linear Inference — Other Perspectives
We now extend this theorem to the general linear model (yn,Xnj3, cr2Vn), where both Xn and Vn can be rank deficient. If CnC'n is a rank factorization of V n , then yn can be written as yn = Xn/3 + Cnen,
(11.6.1)
where E(en) — 0 and D(en) = a2l. It is on the basis of this representation that the asymptotic normality of a BLUE is established. Proposition 11.6.2 Let Mn be the model (11.6.1) where en has independent and identically distributed components with mean 0 and variance a2. Let the matrix Cn have full column rank and Vn = CnC'n. Suppose that the following conditions hold: (i) the model Mn satisfies the consistency condition (I — Pv )yn £ C((I — Pv )Xn) with probability 1, for every sample size n; (ii) p(Xn) and p(Vn : Xn) — p{Vn) do not depend on n; (Hi) the matrix Xn has a rank factorization of the form
(B
\
where the matrix B — (B[ : B'2)' does not depend on n and C(X2n) = C(Xn n Vn); (iv) the elements of the matrix n~lX'2nV~Xin converge to those of a finite and positive definite matrix, V*, as the sample size n goes to oo; (v) the largest diagonal element of P + goes to zero as n goes Cn X2n
to oo. // A/3n is the BL UE of an LPF A/3 which is estimable under Mn for every n, then ^/n{Afin — A/3) converges in distribution to
where B~R is any right-inverse of B.
D
11.6 Large sample properties of estimators
523
Proof. Note that B has linearly independent rows. Let 'y1 — Bi/3 and 7 2 = B2(3. We shall show that
7m = [Xi n (J-P Vn )A: ln ]- 1 Xi n (/-P vrB )y n and 7 2 n
= [X'2nV-X2n}-lX'2nV-(yn-
Xln^ln)
are the BLUEs of 7X and 7 2 , respectively, under AITJ. TO verify that the inverses exist, note that p(X'ln{I - PvjXln)
= p((I-PvjXn) = p(Vn:Xn)-p(Vn) = p{Xn)-&\m{C(Xn)nC{Vn)) (see Table 11.1) = P(xn) - p(X2n) = p{X\n) (from rank-factorization of -X"n),
which is the same as the number if columns of X\ [I — P )Xxn. Likewise, p(X'2nV-X2n) = p(X2n) (as C(X2n) c C(Vn)), which is the same as the number if columns of X2nV~X2nIt is easy to verify that -E(7in) = 7 l n and consequently E(^j2n) = 7 2 n . Thus, 7 l n and 7 2 n are linear unbiased estimators of ~ix and 7 2 , respectively. Since 7 l n has zero dispersion, it is the BLUE of 7 ^ On the other hand, Cov(%n, (I - PXn)yn) = [X'2nV-X2n]-lX'2n(I
-PXn) = 0.
As 7 2 n is uncorrelated with every LZF of Ain, it is the BLUE of 7 2 . The model M.n can be reparametrized as Vn = -X"in7i + ^2^72 + Cnen, which is equivalent to the pair of equations {i-PVn)yn CniVn-Xlnlln)
=
{i-PVn)xlnll,
= C+X2n~12 + €n.
524
Chapter 11 : Linear Inference — Other Perspectives
These equations correspond to the linear models Mln M2n
:
((I-PVn)yn,(I-PVn)Xlnll,O) {CtiVn
-Xlnlln),CnX2nl2-,O1I).
In these models, the parameters 7 j and 7 2 a r e completely estimable. For Mi, 7 l n is the BLUE of 7 : with zero dispersion. For M2, 7 i n is the BLUE having dispersion n-1X'2nCi'C+X2n =
n-lX'2nV-X2n.
Using conditions (iv) and (v) and Proposition 11.6.1 for the model M2n, we have \fn{^2n —^2) converging in distribution to N(0, G2V~l). Thus, y/n(Zln
^l ) converges in distribution to TV ( 0,a 2 ( -
T ,_i
)]
Using the fact that C(A') C C(X'n) = C(B'), we can write A/3 = AB~RBf3 = AB~R-y for any right-inverse of B. Consequently, the BLUE of A/3 is AB~R{~j'ln : i'2n The statement of the proposition follows. D Remark 11.6.3 Conditions (ii) and (iii) of Proposition 11.6.2 essentially mean that the column space of X'n remains the same as the sample size goes to 00, and that the LPFs which can be estimated with zero error also remain the same. Remark 11.6.4 Under the conditions of Proposition 11.6.2, A/3n — A/3 converges to zero almost surely as n goes to 00. In this sense, the BLUE can be said to be strongly consistent. When Vn is positive definite for every n, we have the following corollary to Proposition 11.6.2. Proposition 11.6.5 Let Mn be the model (11.6.1) where en has independent and identically distributed components with mean 0 and variance a2. Let the square matrix Cn have full rank and Vn = CnC'n. Suppose that the following conditions hold: (i) the matrix Xn has a rank factorization of the form Xn = XxnB, where the matrix B does not depend on n;
11.7 Exercises
525
(ii) the elements of the matrix n~lX^V"1 X*n converge to those of a finite and positive definite matrix, V*, as the sample size n goes to oo; (in) the largest diagonal element of P _x goes to zero as n goes to oo.
// A/3n is the BLUE of an LPF Afl which is estimable under Mn for every n, then the limiting distribution of y/n(Af3n — A/3) is multivariate normal with mean 0 and dispersion a2AB~V~lB~ A', where B~~ is any right-inverse of B. D It can be shown that under the conditions of Proposition 11.6.1, the usual unbiased estimator of a2 under the model (yn,Xnf3,a2I) converges to a1 almost surely as n goes to oo (see Sen and Singer, 1993, p.281). We can apply this result to the model M.
Exercises
11.1 Let the best linear predictor (BLP) of y given x be defined as in Exercise 3.8. Given the linear model (y,X(3,a2V), show that the BLP of y given Ty is a function of Ty alone (irrespective of /3) if and only if Ty is linearly sufficient for /3. 11.2 Given the model M = (y,X/3,a2V) for the response y, consider the model MT = (Ty,TXf3,a2TVT') for the linearly transformed response Ty. Show that all the BLUEs of M are BLUEs of M.T if and only if Ty is linearly sufficient for /3.
526
Chapter 11 : Linear Inference — Other Perspectives
11.3 If y ~ N(X/3, CJ2V), then show that the statistic Ty is linearly minimal sufficient for j3 only if it is complete and sufficient for /3, using the following steps. (a) Let y1 and y2 be as in Proposition 11.1.16. Assuming that there is no known restriction on the parameter space, explain why it is enough to show that (y[ : y'2)' is complete and sufficient for j3. (b) Show that the vector (y[ : y'2)' of part (a) is complete and sufficient for its expected value, provided that there is no known restriction on the parameter space. Hence, prove that this vector is complete and sufficient for [3. (c) In the case of restricted parameter space, use the equivalent unrestricted model of Section 11.1.2 to complete the proof. 11.4 Obtain simpler expressions for the dimensions shown in Table 11.1 in the following special cases. (a) C(X) = C(V). (b) C(X)CC(V). (c) C(V) c C(X). (d) C(X)=C(V) . (e) C(X)CC(V) . (f) C(V)^cC(X). 11.5 Describe a linear analogue of the information inequality in the context of the linear model. 11.6 Modify the statement of Proposition 11.1.20 for the case when z is any basis set of BLUEs and prove the result. 11.7 Prove Proposition 11.1.24. [Hint: Use Remark 11.1.23.] 11.8 Let y(j) = L^Liy, where Li, i ~ 1,2, 3,4, are as in Proposition 11.1.16 and (L^ : L™ : L® : L (4) ) = L~l. Show that (a) y = y (1) + (2) + V(3) + V(4)(b) yn\ and j//2) a r e BLUEs, and their elements together constitute a generating set of BLUEs. (c) 2/(3) is an LZF whose elements constitute a generating set of LZFs.
11.7 Exercises
527
(d) D(yr2)) = 0 and yu\ is identically zero. (e) y(i)_+y(2) = XP a n d y(3) = e (f) D(X0) = £>(y(1)) and Z>(e) = I>(y(3)). 11.9 Consider the linear models M = (y,X/3,a2V) and .MD = (y, V(I—Px)~y, <J2V) with V positive definite. Bhimasankaram and Sengupta (1996) called MD the dual model corresponding to M. Show that (a) the error space of MD is the estimation space of M, and vice versa; (b) the class of LZFs of MD coincides with the class of BLUEs of M, and vice versa; (c) the residual vector of MD is almost surely equal to the vector of fitted values in M, and vice versa; (d) the dual of MD is M or a reparametrization of it. 11.10 If y is the BLUE of X/3 in the linear model (y, X/3, a2 V), then determine the conditions under which cCy is an ALE of CX/3 for 0 < c < 1. 11.11 Consider the problem of estimating the parametric function g(0) from the observation vector y using a linear estimator of the form Ty. Let LB indicate the loss function (Ty — g(0))'B(Ty — g(0)), where B is a symmetric nonnegative definite matrix, and Lj be the loss function LB for B = I. (a) Let Sy be an admissible linear estimator of g(0) with respect to the loss function LB, and let Ty be an inadmissible estimator whose risk function uniformly dominates that of Sy. Show that Ty is inadmissible with respect to Lp, where F = B/b, b being the largest eigenvalue of B. (b) In the above set-up, prove that Ty cannot be linearly admissible for g{9) with respect to L/. (c) If Ty is an ALE of g{6) with respect to the loss function Lj, then show that it is an ALE with respect to the loss function LB for any nonnegative definite B.
528
Chapter 11 : Linear Inference — Other Perspectives
11.12 Using the result of Exercise 11.11, show that whenever Ty is linearly admissible for JC/3 in the model (y, X/3, a2V) with respect to the squared error loss function, CTy is admissible for CX/3 with respect to the same loss function for any C. 11.13 Show that the maximum risk of the estimator Ty of the estimable LPF A/3 in the linear model (y, X/3, cr2V), with respect to the loss function (Ty — A/3)'B(Ty — A/3) and subject to the restriction /3'H/3 < a2, is given by (11.2.2). Assume that B and H are both nonnegative definite and C(X') C C(H). 11.14 Suppose that Ty is a linear estimator of the estimable LPF A/3, where the parameter (3 satisfies the quadratic restriction /3'H/3 < a2, where H is a nonnegative definite matrix such thatC(X') CC(H). (a) Show that for any vector p of suitable dimension, p'E[{Ty-AP){Ty-Ap)']p < o2\p'TVT'p + p'{T - C)XH~X'{T - C)'p], where C is a matrix such that A = CX. (b) Show that sup
E[(Ty - A0)(Ty - A/3)']
/3:0'H/3<(72
= a2[TVT' + (T - C)XH~X'{T - C)'], where the supremum is in the sense described in page 496. (c) Using the fact that the maximum mean squared error of p'Ty for estimating p'A/3 is minimized when p'Ty is choprove (11.2.3). sen as p'AX-Xf3m, 11.15 Consider the linear model (y,X/3,o2V) where X and V have full column rank and p(X) > 1. Consider the problem of minimax linear estimation of /3 with respect to the loss function {Ty - j3)'B(Ty - /3) and the restriction fi'0 = a2 jh. Show that the MILEs for the cases (a) B — I and (b) B is an arbitrary positive semidefmite matrix of rank 1, are almost surely different.
11.7 Exercises
529
11.16 Consider the linear model (y,X/3,o2V) where X and V have full column rank and p(X) > 1. Show that the MILE of a'/3 with respect to the loss function \t'y — a'/3\2 and the restriction /3'/3 = cr2/h, is almost surely different from a /3M, where {3M is as defined in Proposition 11.2.11. 11.17 Consider the model (y, X/3, a2V) where V is positive definite. (a) If H is a matrix such that C(X') C C(H), then show that the Kuks-Olman estimator of X/3 for the quadratic restriction f3'H/3 < a2 can almost surely be written as Xpm = X{X'V~lX + H)-X'V~ly. (b) If the prior distribution of /3 and a2 is such that [E(a2)]-lE(/3f3') = U, where U is a positive definite matrix, then show that the BLE of 0 is 3 B = {X'V~lX + U-l)-1X'V~1y. 11.18 Consider the linear model (y, X/3, a2V) where the prior distribution of (3 given a2 is such that f'H~f < 1 and E(-fy') = U, where 7 = (3/a and H and U are nonnegative definite matrices such that C(X') C C(H). The squared error loss function is used to obtain the MILE and BLE of the estimable LPF a'/3. (a) Show that the least favourable prior for 7 (as far as the BLE of a'/3 is concerned) is one for which U = H~ for some g-inverse of H. (b) Show that the MILE and BLE of a'(3 and coincide. 11.19 Consider the BLUE pZ/3 of a single estimable LPF p'/3 in the linear model (y,X/3,a2I), and the corresponding 'subset estimator' pf/3s obtained from the model (y,Xifii,a2l). Assume that /32 is estimable, its BLUE under the full model is /32> and the dispersion matrix of this BLUE is positive definite. (a) If p 1 and p 2 a r e sub-vectors of p such that p'/3 = P1P1 + P2P2, show that Pifii is estimable in both the models. (b) Show that D{(32) = o2[X'2(I - PXi)X2]~1.
(c) Show that Cov&PM =
\pl2-p'i(Xl1XlrX[X2]D02).
530
Chapter 11 : Linear Inference — Other Perspectives (d) By using (7.9.5), show that the_subset estimator p'0s has smaller MSE than the BLUE p'0 if and only if °2 > {q'P2)2/W{X'2{I
-
PXi)X2Tlq}.
can (e) Explain how the vector q = p2 — X'2Xi(X[X\)~px be interpreted as the prediction error for p2 in a suitable model. Interpret the matrix X'2(I — Px )X2 in terms of this model. 11.20 Consider the model (y,X0,a2I) and x
i
=
/x'0\ x'i I x'2 \x'3J
where 0 = (0O : 0i : 02
/ 1 1 1 1 1 1 1 1 - 1 \ 1 -1 1-
1 1 1 - 1 1 1 1 1-
&)',
1 1 1\ 1 - 1 -1 1 -1 -1 ' 1 1 -1 /
The objective of this study is to determine whether the subset model consisting only of XQ and x\ will be more suitable than the full model for the purpose of estimating certain LPFs. (a) Will the 'subset estimator' of 0$ — J3Q — (3\ —02 have smaller MSE than the BLUE from the full model, if the true parameter values are such that 02 = 0^1 (b) Will the 'subset estimator' of 0\ —^0 — 02—0?, have smaller MSE than the BLUE from the full model, if the true parameter values are such that /?2 = 0z*(c) Can you give an intuitive explanation of the discrepancy in the answers to parts (a) and (b)? 11.21 Adjusted R2. Consider the linear model (y,X0,a2I) with explanatory variables x\,..., Xf.- Let A and B be the index sets of two different subsets of explanatory variables. Let M-A and M B be the subset models corresponding to A and B, respectively. (a) If A C B, show that the sample squared multiple correlation coefficient R2 (see (5.3.15)) of MB is at least as large as that for MA- [Thus, R2 is not very useful as a criterion for comparing nested subsets.]
11.7 Exercises
531
(b) A modification of R2 which is adjusted for subset size is i?2 = 1 — o2s/a2, where a\ and a2 are the usual estimators of a2 under the subset and full models, respectively. This criterion for comparing subsets is called 'adjusted i? 2 ', and a small value of R2 is considered preferable. Show that this criterion is equivalent to R2 and Cp when comparison is made between subsets of equal size. 11.22 Show that the i?2 criterion defined in Exercise 11.21 cannot give smaller'best'subset than Mallows'Cp. [Hint: Consider the best subset according to R2 and another subset of larger size; show that the latter subset cannot possibly have a smaller value of Mallow's Cp.} 11.23 Derive a principal components estimator of X/3 in the model (y,Xf3,o2V) where V is non-singular. When does it have smaller trace of mean squared error compared to the BLUE? 11.24 If the matrix X in the linear model (y, X/3, a21) has full column rank, show that the principal components estimator of /3 has smaller magnitude than the least squares estimator. 11.25 Let Dr be the dispersion matrix of the ridge estimator (X'X + rI)~lX'y in the linear model (y,Xf3,a2I), r being a positive number. Show that Dri > DT2 in the sense of the Lowner order whenever 0 < r\
532
Chapter 11 : Linear Inference — Other Perspectives
11.27 Let the matrix X in the linear model (y,X[3,o2I) column rank.
have full
(a) Find a necessary and sufficient condition on f3 and a2 so that the ridge estimator /3r has smaller MSE matrix than the corresponding BLUE, in the sense of the Lowner order, for any r > 0. (b) Find a necessary and sufficient condition on /3 and a2 so that the shrinkage estimator /3S has smaller MSE matrix than the corresponding BLUE, in the sense of the Lowner order, for any s in the interval (0,1). (c) Find a necessary and sufficient condition on /3 and a2 so that the trivial estimator 0 has smaller MSE matrix than the corresponding BLUE, in the sense of the Lowner order. 11.28 Find the necessary and sufficient condition for the estimator /3 s/3o = ^ + (1 - s)(PQ - 3) in the model {y,X/3,a2I) (where X has full column rank) to have a smaller mean square error matrix than the least squares estimator, fl. Find a condition on f3 and a2 that would ensure this dominance for all values of s. 11.29 Consider the BLUE (A/3) and the shrinkage estimator (A/3,, = sA/3) of an estimable LPF A/3 in the model (y,X/3,a2V). (a) Find a necessary and sufficient condition for MSE(X/3) > MSE(X/3S) in the sense of the Lowner order. (b) Find a necessary and sufficient condition for the above order to hold for all values of s. (c) Find a necessary and sufficient condition for the dominance of MSE(A0s) by MSE(AJ3) for all estimable A/3. (d) Simplify the above conditions for the case V = I. 11.30 Suppose that C(A) and C(B) are virtually disjoint. Prove the following facts about the oblique projection matrix PA,B (a) P , is uniquely defined when (A : B) has full row rank. JT.J JD
(b) P is idempotent when (A : B) has full row rank. (c) P . ,Q is neither unique nor idempotent when (A : B) does not have full row rank.
Solutions to Odd-Numbered Exercises
Chapter 1 1.1 The model is yt = faxti + faxti + £t, where xti = cos(cot), xt2 = sin(ut), Pi = acosfi and fa — —asinrf). The hypothesis of 'no signal' corresponds to fa = fa = 0. 1.3 The gender-specific models are of the form j/j = /30 +fi\%%+ e», where j/j is the log(record time) for the ith category, and Xi is the log(running distance) for that category, i = 1,..., 10. The X-matrix for each model consists of the column of Is followed by the column of log(running distance). The y-vector for the two models consist of the column of log(record time) for men and women. The grand model has 20 observations and three explanatory variables. For i = l,...,20, let _
f 1 if the ith category is for men, \ 0 if the ith category is for women,
_ ~
j log(running distance) for ith category if xn = 1, \0 Xxa=0, |0 ifxu=l, I log(running distance) for ith category if xn = 0.
ll
Xa
=
Then the model is yt = /?0 + Pixn + /32Xi2 + 03Xi3 + e,-, where yt is the log(record time) for the ith category. The X-matrix for this model consists of the column of Is and three other columns containing the values of xn, xi2 and Xi3. (There are 10 zeroes in each of the last three columns.) 533
534
Solutions to Odd-Numbered
Exercises
1.5 The model is y = ftl+ftx + e, where x consists of the years 1981 to 2000, and y consists of the corresponding mid-year population. /?i is the average increase in population per year, and /3o is the population in the year 0, assuming that the rate of increase has been equal to Pi all along. (This assumption is somewhat far-fetched, see Exercise 4.42.) 1.7 The model is y = ao + a\X\ + p\X2 + jx3 + e, where ,-. xi={l-x3)x,
x
x2=x3x,
f 1 if x > XQ , x3 = I Q i i x < x , 7=
n
Po-ao.
1.9 Note that lim E(y\x) — a0 + aix0, whereas lim E(y\x) = /30 + /3ix0These two expressions are identical if and only if /Jo-ao = (ai- / S 1 )x 0 . The linear model is y — 70 + atiXi* +P1X2* + e, where 70 = a0 + aix 0 , xu — (x - x o )(l - x3) and x2* — (x - xo)x3. 1.11 UE(e) =E(6) = 0 , then
E(v)E(l/v) = (P0 + fa) . J ^ _ = f 1 + ^ - ) . J ^ _ = l. kosj ki + s ki+s \k0 On the other hand, E(v (1/v)) = 1. Thus, v and 1/v must be uncorrelated, even though these are functions of one another. If v and l/v are uncorrelated, and x and y are independent and have the same distribution as v, then E(v)E(l/v)
= E(x/y) = E[(x/y)I(x
< y)] + E[(x/y)I(x
> y)],
which can be written either as E[(x/y)I(x < y)] + E[(y/x)I(x or as E[(y/x)I(x > y)] + E[(x/y)I(x > y)]. Therefore, E(v)E(l/v)
=
+ +E{(y/x)I(x
=
\\E{xly)
=
E[(x-y)2/(2xy)}
< y)}
E{(y/x)I(x
> y)}}
+ E(y/x)} = E[(x2 + y2)/(2xy)] + l.
Thus, the covariance between v and 1/v is —E[(x — y)2/(2xy)]. This is strictly negative when v is a nonnegative random variable, and strictly positive when v has a symmetric distribution around 0. In general, it is unlikely that the covariance would be zero, except possibly in a pathological case.
Solutions
to Odd-Numbered
Exercises
535
1.13 It is a generalized linear model with 77(2/) = \og(y/(l — y)). The model cannot be linearized as rj(y) is undefined for y = 0 and y = 1 — the only possible values of y. 1.15 If /?2 > 0, then E(y) = /?o + /?i£ + /?2£2 is minimized when x = -/?i/(2/? 2 ). 1.17 The (5jS should be random, not necessarily with zero mean. The model is a special case of the mixed effects model. Let a; = E(6i) and /Jj = Si - on. Then the model is j/i = /x + a.i + {^ + tij),
i,j = 1 , . . . , 10.
The average 'improvement in status' is 1
10
1
10
i=l j=l
10
i=l
which should be the focus of inference. (The dispersion matrix of the model errors /% + e^, i,j = 1 , . . . , 10, has a special structure, which is discussed in Chapter 8.)
Chapter 2 2.1 Partition / n X n as (ui : and Q = P ' . 2.3
: un). Then P = (tt2 : Ui : u 3 :
: tt n )
(a) If A is positive definite, then for all v ^ 0, we have t)'A« > 0, that is, Av ^ 0. Hence, p(A) = n and A is nonsingular. (b) If A is symmetric and positive semidefinite, then it can be factorized as CC' and there is a » / 0 such that v1 Av = 0. It follows that ||C"t>|| = 0, that is, C'v = 0. Therefore C does not have full row rank, and so A is singular. (2 3 \ (c) The matrix I I is positive semidefinite but nonsingular.
2.5 Yes. A'(A~)'A' = (AA~A)' = A'. If A is symmetric and A~ is any g-inverse of it, then \[A~ + (A~)'] is a symmetric g-inverse of A. 2.7 The result follows from the identity [I \-CA'1
0\(A I)\C
B\ (I DJ\p
-A~lB\ _ (A I ) - \Q
0 \ D-CA-1BJ-
536
Solutions to Odd-Numbered Exercises 2.9
(a) Let dim(Si n S2) = m, dim(Si) = m + j and dim(S2) = m + k. Let « ! , . . . , um be a basis of S i n S 2 , U\,..., i t m , « i , . . . ,Vj be a basis of Si and t t i , . . . , u m , i i > i , . . . , to* be a basis of 1S2.
v3 W-L,..., Wf. is a basis We shall show that ui,..., um, vi, of Si + £2- It is easy to see that every vector of Si + S2 is a linear combination of the proposed 'basis'. We only have to show the linear independence of the 'basis' vectors. To prove this by contradiction, let m
j
k
Y^ aiui + 5Z @iVi + 23 ^iWi = ° t=l t=l i=l Since X)I^i aiut + 12i=i Pivi € *^i a n d ~ J2i=i "fiWi £ $2 these two vectors are equal to one another, we have
and
k
- ^ 7 i i 0 j e Si nS2. i=l
Therefore, - Y^i=i liwi c a n be expressed as Y^IL\ &iUi, that is, YliLi Siui + T,i=i Hwi = °- Therefore, ui,... ,um,w1,... ,wk cannot be a basis set of S2 (b) Let u G S^- D S£. Then u is orthogonal to every vector in Si and S2 and therefore to any linear combination of such vectors. Hence, u 6 (Si + S2) x . Conversely, if u E (Si + S2)"1", then u is orthogonal to every vector in Si + S2, and in particular, to those in Si and S2. Thus, u £ S^~ PI S^. (c) Let dim(Si)=/c and Ui,... ,«jt be abasis of Si. Thentii,... ,«* are linearly independent vectors in S2. Since dim(S2) = k, these vectors must also form a basis of S2 Any vector of S2 is a linear combination of t i i , . . . , uk, and therefore must be in Si. 2.11 Let va and Vf, be projections of v on S. Consequently va € S, Vf, 6 S, (v - va) e S x and (v - Vb) 6 S 1 . Therefore, (va - Vb) & S and [(1; - Vb) — (v — va)] 6 5 1 . It follows that va - vb is in S fl S , that is, va - vb = 0. 2.13
(a) Let P be an idempotent matrix and S = C(P). It is obvious that for any vector u of appropriate order, Pu G S. On the other hand, whenever it £ S, we can write u as PI for some vector I, so that P u = P2l = PI = u. Hence, P is a projection matrix of S.
Solutions to Odd-Numbered Exercises
537
(b) A matrix is a projection matrix if and only if it is idempotent, so we only have to show that a projection matrix is orthogonal if and only if it is symmetric. Let P be an orthogonal projection matrix. Then for all vectors u and v of appropriate order, Pu and (/ — P)v are orthogonal to one another. Thus, v'(I — P)'Pu = 0 for all u and v, that is, (/ - P)'P = 0. Consequently P = P'P, which is symmetric. Now suppose that P is a symmetric projection matrix of the vector space S. Therefore, for every v G S1- we have v'Pu = 0 for arbitrary u. Therefore, v'P = 0, that is, Pv = 0 and hence (I - P)v = v. Also, for any vector v of appropriate order, u'P'(I - P)v = u'P(I - P)v = 0 for all u, that is, P(I - P)v = 0, and hence (/ - P)v 6 S^. Thus, I - P is a projection matrix of S^, which means that P must be the orthogonal projection matrix of S. 2.15 The condition 'Ax = 0 implies p'x = 0' is equivalent to x G C(A')1implies a; e C(p) 1 ', that is, C(A')X CC(p) 1 . The latter condition in turn is equivalent to C(p) C C(A') or p e C(A'). 2.17 Since C has full row rank, it has a right-inverse. Let C~R be a rightinverse of C. Then B = BCC~R = AC~R, which implies that C(B) C C(A). The reverse inclusion is obvious. 2.19 The identity of part (a) follows along the lines of Exercise 2.7, where the inverse is replaced by a g-inverse, and by using the results of Proposition 2.5.3(b) and Exercise 2.18. The condition of part (b) implies that the matrix I „, „ I is nonnegative definite. It follows C/ \B A vi(~i~ R ' n \ _, ) is nonnegative definite and that
(
u
cy
C(B') CC(C). 2.21 Let Ui, U2 and U-$ be semi-orthogonal matrices such that the sets of columns of U\, (Ui : U2) and {U\ : U3) are orthonormal bases for C(A)nC(B), C(A) and C(B), respectively. Then p(A'B)
=
p(PAPB)
=
piiUi : U2)(U1 : U2)'(U1 : U3)(U1 : U3)')
=
p((U1:UanU1:U3))=p(U'^
=
p(Ui) + p(U'2U3) > p(U!) = dim(C(A) n C(B)).
^
)
538
Solutions to Odd-Numbered Exercises (1 The inequality is strict when . 4 = 0
0\ /l 1 I and B = 0
0\ 1 .
2.23 Let C C ' be a rank factorization of A.
(a) The Lowner order A < B implies that I < C~lBIO')'1, that is, all the eigenvalues of the latter matrix are greater than or equal to 1. Therefore, all the eigenvalues of C'B~1C are less than or equal to 1. It follows that, v'C'B~lCv < v'v for all v and u'B~lu < u'{C')-lC~lu for all v. Hence, B~l < {C')-lC-1 =A~l. (b) Note that (A \A
A\_(A B) - \ A
A\ A)
+
(0 \0
0
\
B-AJ
As each of the matrices on the right hand side is nonnegative definite, so is the matrix on the left hand side. The stated result follows from Exercise 2.19(a). Propositions 2.6.3(a) and 2.4.l(f) ensure that AB~ A does not depend on the choice of the ginverse. 2-25
(a)
PA®B
= = = = =
(A®B)[(A®B)'(A®B)}-(A®B)' ® B') (A®B)[(A' ®B')(A®B)]-(A' (A
(b) Suppose that Amxn ® BpXq does not have full column rank. Then there is a nontrivial matrix Lnxq such that (A®B)vec(L) = 0, that is, BLA' = 0. Therefore, both A and B cannot have full column rank (If A has full column rank, then LA' is not 0, and hence, B cannot have full column rank.) Thus, whenever A and B have full column rank, A® B also has full column rank. (c) Assume initially that A and B are symmetric and nonnegative definite. Let A\ A[ and BiB[ be rank-factorizations of A and B, respectively. Then p(A ®B) = p((A!Ai) ® (B1B[)) = p((Ai ® B1)(A[ ® B[)).
Solutions to Odd-Numbered Exercises
539
The last quantity is equal to p(Ai <8>Bi). Since A\ and B\ have full column rank, we have
p{A ® B) = p{A1 ® Bi) = p{A{)p{Bi) = p(A)p(B). If A and B are any pair of (possibly rectangular) matrices, then using the result of part (a), we have p(A®B)
=P(PA^B) =p{PA®PB)
=p(PA)p(PB)
=p(A)p(B).
(d) This result follows from parts (b) and (c). 2.27 Let B = (bi : . . . : bp) and C be a p x q matrix. Then vec(ABC)
=
l v vec I A ^ 6iCi,i : V t=i y^jaiAbj ~[ '
v \ : A ^ 6iCi,9 »=i /
\ /chiA
Cp,iA\
/6i\
2 ^ cij?Abj I =
(C'®A)vec(B).
2.29 The sufficiency of the condition is obvious. In order to prove the necessity, observe that ABA ~ 0 implies PABPA = 0, that is, PAB = K(I — PA) for some matrix K. Using Proposition 2.7.1 for each column of B, we find that B must be of the form PAK(I — PA) + (I — PA)L for some matrix L. Therefore, we can write the symmetric matrix B as
B =
+ B') = C-PACPA,
where C = \PA{K - L) + \{K - L)'PA + \{L + L1). Chapter 3 3.1
(a) Since u — Bv is uncorrelated with v, we have D(u) = D(u —
Bv + Bv) = D{u - Bv) + D(Bv) > D(u - Bv).
540
Solutions to Odd-Numbered Exercises (b) Adjusting for the covariance of u — -Bi^i with v, we have a random vector of the form u — Cv. The latter must coincide with u — Bv because of Proposition 3.1.2. The stated result follows from part (a) with u — B\V\ playing the role of u. 3.3
(a) The result follows from the Fisher-Cochran theorem, with r = 2, A\ = A and A2 — I — A. (b) If AB — 0, then / - A - B is idempotent, and the sum of the ranks of the matrices A, B and I — A — B is n. It follows from the Fisher-Cochran theorem that y'Ay and y'By are independent. On the other hand, if these are independent, then the sum y'Ay + y'By is chi-square distributed (this follows from part (a) and the definition of the chi-square distribution). The result of Exercise 3.4 implies that A + B must be idempotent, and hence, AB = 0. (c) Let UA and Uc be semi-orthogonal matrices so that A = UAU'A and C{UC) = C(C). The condition CA = 0 implies that U'CUA = 0, that is, U'cy and U'Ay are independent and hence Cy and y'Ay are independent. On the other hand, if Cy and y'Ay independent then y'Ay is independent of U'cy, and hence, of y'(UcU'c)y. Using part (b) we have UcU'cA = 0, which implies that U'CA and CA are both null matrices.
3.5 If AB = 0, we can rank-factorize A and B as A\A\ and B\B\ and prove that A\y and B\y are independent, hence their squared norms are independent. To prove the converse, note that the moment generating function of y'Ay and y'By is proportional to \I — 2t\A - 2t2B\~1/2. If the quadratic forms are independent, then this can be factored into g{t\) and /i(^)- Putting t\ = 0, we have h(t2) proportional to \I - 2t2B\~1/2. Likewise, g{t\) is proportional to \I - 2hA\-1'2. Consequently \I-2txA-2t2B\ = \I-2hA\-\I-2t2B\ = \I-2t1A-2t2B+At1t2AB\, for all tx and t2 over a rectangle. It follows that AB = 0. 3.7
(a) By conditioning the quadratic form on x, we have E[(y-g(x))'W(x)(y-g(x))} = E[E{(y~g(x)YW(x)(y-g(x))\x}] = E[E{{y - E(y\x) + E(y\x) - g{x))'W(x) (y - E(y\x) + E(y\x) - g(x))\x}}
Solutions to Odd-Numbered Exercises
541
= E[E{(y - E{y\x))'W(x)(y - E(y\x))\x}} +E[E{(E(y\x)-g(x))'W(x)(E(y\x)-g(x))\x}] > E[(y-E(y\x)'W(x)(y-E(y\x))}. The equality holds if and only if the second quadratic form is zero, that is, E(y\x) - g(x) = 0 almost surely. (b) E[(y - E(y\x))E(y\x)} = E[E{(y - E(y\x))E(y\x)\x}}, which is equal to 0. Also, E[y — E(y\x)\ — 0. Thus, we have E[(y E{y\x))E[y\x)\ = E[y - E(y\x)}E(y\x). (c) The inequality of part (a) holds with equality if and only if E[{E(y\x)~g(x))'W{x)(E(y\x)-g(x))\x] = 0 for almost all x. When W(x) is positive semidefinite, a necessary and sufficient condition for this is that g(x) = E(y\x) + w(x) almost surely, where w(x) is any vector satisfying W(x)w{x) = 0. 3.9 Let E (X)
= (^A
and D (X)
= (Y/x
Y/y).
Following an
argument similar to that of Proposition 3.4.1, it can be shown that E[(y — g(x))'W(x)(y — g(x))] is minimized with respect to g(x) subject to the constraint that g is an affine function of x, when g(x) is equal to E(y\x) = fiy + VyxV~x(x - Hx), and that y - E(y\x) is uncorrelated with every linear function of x, including E{y\x). This minimizer is unique when W(x) is positive definite for almost all x. When W(x) is positive semidefinite, there are infinitely many solutions which differ from one another by an affine function of x which is in C(W(x))-1. 3.11 First part is easy. For n > 3, the separation between any pair of order statistics is ancillary. 3.13 Let X\{9) and Z2W be two information matrices such that I\{6) < Xiiff). It follows from Proposition 3.6.6 and Exercise 2.23 that G{e)i~{e)G'{6) = G(e)i^(e)i1(6)ir(0)G'(e) > G(0)2T (0)l! (0)Z2- (9)1, (0)lr (0)G'(0) = G(0)i;(9)G'(9). 3.15
(a) The probability density function of y is ^h(U:^). shown through simple calculations that aiogift(i^) d/i
=
1 d\og h(u) a du '
It can be
542
Solutions to Odd-Numbered Exercises dloglfe(^) da2
=
__L 2a2
dlo^hju) 2a-2 du ' u
where u = (y - p,)/a. The stated result follows by simplifying the information matrix,
J-oo[
dO
)[
89
)ah{~V-)dV'
and making use of the fact that the integrand of the off-diagonal term is an odd function of u. Simplification of the expression of the bottom diagonal term is aided by the identity f°° (dh{u)\ I u —^—^ du = - 1 . (b) Let s(u) =
. Then s I du \ a ance / M . On the other hand,
2
I has mean 0 and varia J
f°° y^oo
(dh{u)\ \ du j
The stated result follows from the fact that the largest value of the squared correlation between two random variables is equal to 1. (c) The squared correlation is equal to 1 if and only if the two random variables are almost surely linear functions of one another. This condition simplifies to s(u) = au + b. Integrating the two sides with respect to u, we conclude that \ogh(u) must be a quadratic function of u which is symmetric around u = 0. The conclusion follows in view of the condition Var(u) = 1. We have 7M > I/a2 with equality holding only in the case of the normal distribution. Thus, whenever y has mean fi, variance
Solutions to Odd-Numbered Exercises
543
3.17 Let t(y) be a complete sufficient statistic for the parameter g{9), h (y) an unbiased estimator and ^(y) the UMVUE of f(6). We have for any loss function L(-,-) which is convex in the second argument
E[L{g(ff)Mv))]
= E[E{L(g(O)Mv))\Kv)}] > £[L(s(0),.E{*i(v)|t(l/)})] = E[L(g(e)Mv))]
by Jensen's equality and the Lehmann-Scheffe theorem. 3.19 The joint distribution of y and \i is normal. Hence the the posterior distribution of /x is normal too. It follows from Proposition 3.7.4 that the Bayes estimator is E(fi\y), which simplifies to the given expression. 3.21 It is easily seen that E(y) = 6\ so y is unbiased for 8. However, y is inadmissible as R{d,y) — 62 and R(8,y/2) = 62/2. An admissible estimator can be found via a unique Bayes estimator. Let A = 1/0 and the prior distribution of A have density TT(A) = pa\a-le-x0IT{a),
A > 0,
where a and /? are positive parameters. Then the marginal distribution of y has density a/3a(y + P)~^a+1\ and the posterior distribution of A given y has density 7r(A|y) = {y + 0)a+1\ae-x^+V/T(a
+ 1).
Therefore, the unique Bayes estimator is E(8\y) = E(l/X\y) = (y +
P)/a. 3.23 It follows from the discussion of Section 3.9 and the result of Exercise 3.22 that the level (1 — a) UMA confidence region, obtained from the level (1 - a) UMP test, is [0, y + n~ 1 / 2 z a ]. Likewise, the level (1 - a) UMAU confidence region, obtained from the level (1 — a) UMPU test, + n-1/2za/2]. is [y~n-l/2za/2,y Chapter 4 4.1 An LUE in a saturated model cannot be improved by removing correlation with LZFs, as there is no LZF.
544
Solutions to Odd-Numbered Exercises 4.3 If A(3 is non-estimable, then it follows from Proposition 4.1.15 that it is not identifiable. Thus, by definition there are 01 and /32 such that A/31 ^ A/32 and yet the density of y is the same for (3 = f5x and j3 — (32- If T(y) is any statistic, then the congruence of the densities implies that E{T{y)) has the same value for /3 = /3X and /3 = /3 2 . For T(y) to be an unbiased estimator of A/3, we should have A/31 = Af32, which is a contradiction. 4.5 Let ui be the first column of Ikxk- If xi has an exact linear relationship with the other columns of X, then there is a k x 1 vector I such that XI = 0 and u[l ^ 0. This is impossible when u\ is of the form X'm. Therefore, u\ £ C(X'), and hence the coefficient of x\ is not estimable. To prove the converse, let it i ^ C(X'). It follows that u[(I-P ,)ui > 0, that is, the first element of the vector (I - P )u\ is nonzero. Let I be a multiple of the latter vector such that its first element is equal to 1. It follows that XI = 0, that is, X\ can be written as a linear combination of the other columns of X. 4.7 The 'if part is obvious. To prove the 'only if part, note that I'y + c is unbiased for p'0 only if l'X/3 + c = p'P for all /3. Putting fi = 0 into this identity, we have c = 0 and hence, l'X/3 = p'/3 for all /3. The latter identity implies X'l — p. 4.9
(a) Since p(X) = 4, every LPF is estimable. (b) C{X) = C ( r ) Note that ! e C [ ^ ) if and only if I is of the form [
I for some mixi.
By Exercise 4.8, i'y is the BLUE
of its expectation if and only if / is of this form. (c) i'y is an LZF if and only if X'l = 0, that is, / is of the form
f m \ lor . some m4 X i. \-mJ (d) pj = (yj+yJ+A)/2forj
= 1,2,3,4-
(e) £(3) = £ l . (f) 0-2 = \y'{I -Px)y
= E ' = i ( ^ - 2/;+4)2/8-
4.11 Standard errors are given in brackets: (a) for the men's records, f30 = -2.887(1.609), Pi = 1.1564(.2340); (b) for the women's records, /?0 = -2.862(1.611), ft = 1.1689(.2342).
Solutions to Odd-Numbered
Exercises
545
4.13 We have to minimize l'l such that X'l = p. Using the method of Lagrange multipliers, the solution is obtained by minimizing l'l + X'(X'l-p) with respect to I and A. Differentiating the function with respect to I and A and setting the gradient equal to zero, we have the equations 21 + XX = 0 and X'l = p. Substituting I = - | X A into the second equation we have ^X'XX = -p. A solution to this equation is A = — 2(X'X)~p, which leads to / = X(X'X)~p and I'y = p'(X'X)~X'y. The last expression does not depend on the choice of the g-inverse (why?). That the solution is indeed a minimum ( I X\ follows from the fact the the Hessian matrix, I Y, I is negative
VA
u
/
definite (why?). 4.15 E[(lx + h)'y] = Pi/3 + p'2P- Further, (li + I2)'y is uncorrelated with every LZF. 4.17 The LSE can be interpreted as a weighted average of all the LSEs of 0 obtained from the (") sub-models. Note that r is the smallest number of observations in a sub-model so that ft can possibly be estimated from it, and that the weight is zero for any sub-model where /3 is not estimable. In the special case of Example 4.2.1, let /3^ denote the 'LSE' from the i and jth observations (that is, from (xi, yi) and (Xj,yj)). Let x = (x\ : : xn)'. It follows from (4.2.4) that whenever Xi ^ Xj,
(
Xiyj-Xjyi
\
Xi-Xj
\
Vi-Vi Xi-Xi
I J
Defining the weight Wij as (a:,- — Xj)2/(2nx'x — 2n2x2), we have v-» ^ 2 1 ( 2nyx'x - 2nxx'y\ 2 ^ 2 ^ wiJPH ~ 2nx'x - 2n2x2 V 2nx'y - 2n2xy ) ' which simplifies to /3. Note that the set of weights constitute a special case of the given weights for general r, and it satisfies the condition S?=i 2 j = i wij — 1- Thus, the LSE of /3 is a weighted sum of the slopes and intercepts of lines passing through pairs of points, and larger weights are given to pairs of points having a;-values far from one another. 4.19 The joint density of y can be written as
P0,AV) = (27ra2)-'l/2exP[-^||y-X/3||2]
546
Solutions to Odd-Numbered
_ ~
P
Exercises
r iixftii2]
I" fl2 ||X/3|| 2 (X/3)'X/3l e X P [ 2q' J [ 2a 2 2a 2 + a2 J ' (£r2)n/2
Since the last expression is of the form (3.5.1), X/3 and RQ must be complete and sufficient for X/3 and a2. 4.21 Suppose that i'y is any LZF with non-zero variance. Then it must be correlated with some LZF of A. Let Ly be a vector of all the members of A. After adjusting for the covariance of I'y with Ly, we obtain the new random variable m'y = i'y — Cov(l'y,Ly)[D(Ly)]~Ly, which is uncorrelated with the members of A. Since m'y is itself an LZF, by definition of A we have Var(m'y) = 0. Therefore, m'y = 0 with probability 1, that is, i'y is almost surely equal to a linear function of the elements of A. 4.23 Let z be as in Remark 4.7.7, and CC' be a rank-factorization of <j~2D(z). Suppose that v = C~Lz, where C is any left-inverse of C. It is easy to see that the elements of v are LZFs, these are uncorrelated and have variance a2. Further, as z € C(C) with probability 1 (because of Proposition 3.1.1), we can write z as Cv almost surely. Therefore, the elements of v form a basis set of LZFs. The conclusions follow from Definition 4.7.6, Proposition 4.3.5 and (4.7.1). 4-25
(a) I Ztx hi = ^tr(fl-) = p(X)/n = kfn. (b) For any n x 1 vector u, (/ — ^ l l ' ) u = u — ul, where u is the average value of the elements of u. Thus, (J — ^\V)u is the mean-subtracted or 'centered' version of u. Hence, Xc = (I — i l l ' ) X consists of centered versions of the columns of X. Its ith row consists of the deviations of the elements from their respective column means. Let this vector be xc». Then from the decomposition H = PY = R + P.t , ,.„, we conclude that hi = 1/n + x'ci(X'cXc)~xCi, which is of the described form. (c) Let X . = (X : y). Then Px = Px +P{I_Px)y = H + Pe. Since Px is an orthogonal projection matrix, its zth diagonal element is in the range [0,1]. Hence we have the required inequality.
Solutions to Odd-Numbered Exercises
547
4.27 Let z be a vector whose elements constitute a standardized basis set of LZFs. Then Var{z'z) = 2(n - r)a4, where r = p(X). Hence E[(cRl - C72)2] = [E(cR2 - a2)]2 + Var(cR2 - a2) = [((n - r)c - I ) 2 + 2(n - r)c2]aA. (a) For c = l/(n - rj^we have £ [ ( ^ - a2)2] = 2
(a) The model equation is log q = log a + a log / + /3 log c + log u, which is of the form y = f30+axi +/fa2 + e - Under the restriction a + (3 = 1, the model can be written as log(g/c) = logo + alog(//c) + e, which is of the form y = 0Q + ax + e. (b) Following the proof of Proposition 4.9.3(c), the decrease in D(X0) is a2(Px - PX(I_PAI)). Here, A = (0 : 1 : 1), so C(X(I — Px,t_p %)) is spanned by the vectors 1 and X\ — X2On the other hand, C(X) is spanned by 1, X\ and x 2 . If u is a vector in C(X) which is orthogonal to C(X(I—Pxp )), then CT2(PX - PX{I_PAI)) =
548
Solutions to Odd-Numbered Exercises
4.33 The information matrix is 40CT~ 2 /3 X 3, which is of the requisite form. The Cramer-Rao bound for #3 is <72/40. This agrees with the bound (<72/10) for T\ - r2 obtained in Example 4.11.2, as #3 = [T\ - r 2 )/2. 4.35 Let rii be the number of measurements involving object i, i = 1 . . . ,p. For p = 1, the information matrix is a~2 { l 1, and V«i " 1 / Var{pi) =
7 = — ' 7—TTT-, TT > —FTnn\-n\ n (ni/n)(l - ni/n) n/4 The bound is achieved when n\ — n/2. When p > 1, 'm = n/2 for all i' continues to be a necessary condition, but the other weights act as nuisance parameters and may increase the variance (see Section 4.11.1). It follows from the discussion of that section that the presence of the nuisance parameters does not make a difference in the variance of the BLUE of t'(I - PX2)X101 if and only if \\PXi (I - PX2)t\\2 = that is*\\P(I_Pxi)Xa(I-PXa)t\\* = 0. The latter \\Px(I-PX3)t\\2, condition is equivalent to X'2(I - Px )(I - Px^ )t - 0, or X'2PX^ {I Px )t = 0. If we are concerned with the LPF /31( then we can assume without loss of generality X\ =
lXl
nT><1
I>
an^
expect X2 to
be such that all the parameters are estimable. Writing /3j as (fio : /3\)' and Pi as t'(I — Px )Xifl1, we observe that X[(I — Px )t must be 1" \ _ j I ) . Therefore, the condition X'2PXi (I-PX2 )* =
(
0 further simplifies to X'2 [
* 1 = 0, that is, each of the objects
2,3, . . . , p must be weighed with and without object 1 for an equal number of times. When this condition is applied to all the objects, we obtain the conditions (i) every single object occurs in n / 2 rows of the matrix X and (ii) every pair of objects occurs in n / 4 rows. These two conditions hold simultaneously only if n is a multiple of 2P. Then
X'x-
/ n n/2 n/2 .
n/2 n/2 n/4 .
\n/2
n/4 n/4
n/2 n/4 n/2 . .
n/2\ n/4 | n/4 - n ( 4 . ~4V2-1
2 1' \ 1+11')'
n/2/
It is easy to see that the above condition is also sufficient.
Solutions to Odd-Numbered Exercises 4.37
549
(a) The result follows from the definition of the Cramer-Rao lower bound and the block matrix inversion formula of Section 2.2. (b) Let X = (xi : X2). When /?i is non-estimable, we have X\ € C(X2) (see Exercise 4.5), that is, ||(/-P X 2 )a;i|| 2 = 0. It is easy to see that Z11.2 = \\{I - PX2)x1\\2. (c) The quantity In. 2 can be interpreted as the information for /?i, adjusted for the other parameters. When /?i is non-estimable, the information is zero. (d) Essentially the same argument holds, after reparametization. Let p'/3 be any LPF (not necessarily estimable), with ||p|| = 1, and let P be such that the matrix (p : P) is orthogonal. It follows that the information for p'/3 is Z M p'X'(7 - Pxp)Xp or iftP'X'il — P ,,)Xp. When ||p|| not necessarily equal to X {l—pp )
1, the information for p'fi is ipl0 = i^p'pyip'x'ii 4.39
-
Px(IPp))xp.
(a) Note that the largest eingenvalue of X'SXS is no smaller than the average of the eigenvalues, which is tr(X'sXs)/k or 1. Hence, *V = (vy\i)/VIFj < 1/VIFj. (b) If VIFj is large, A; is small and TT^- is close to 1, then the variance of the BLUE of /?y is inflated mostly because of the smallness of ||X S V;||. This conclusion does not hold when VIFj is small. [Belsley et al. (1980) does not mention this crucial fact.]
4.41 K = 23.22, and the VIFs for /?0 and ft are 135.3. The VIFs are identical because these are obtained from the 2 x 2 matrix X'SXS, which does not change when the variable are interchanged.
Chapter 5 5.1 [0.08153, 0.08287]. 5.3 Since p'j/3 and p'2(3 are distinct, their BLUEs, p^/3 and p'^0 are also distinct. It follows that the variance-covariance matrix of these two BLUEs (a2K) is invertible. Thus,
/
P[P \
V(Pi+P2)73/
(
K
^ ( 1 :1)K
K(\)
)
(l:l)K[l)J
550
Solutions to Odd-Numbered Exercises According to the inversion formula of page 2.2, a possible choice of .,
.
, .,.
,.
.
.
.
(G-2K~1
0\
„
the g-mverse ot this dispersion matrix is I I. Hence, a 100(1 - a)% elliptical confidence region of (p[/3 : p'2(3 : (p t + p 2 )'/3)' is
(fx\
(
x-p'3
\',K-i
[ W
\z-(Pi+P2)'/9/
o
V
x /
*-pi3
\
\*-(Pi+P2)')8/
< 2^F 2 , n _ r , a I . This simplifies to
{ (I) :(*-''%] K-'(X-Pi)<^ {\ZJ
V2/-P2/3/
\y-P20J
A. J
The latter is a 100(1 — a)% elliptical confidence region of (p^/9 : p2/3)'5.5
(a) (i) Rewrite the hyperplane equation as a'(6 — 6$) = d. Let FF' be a rank factorization of M. If 0* is a point that lies both on the hyperplane and on the ellipsoid, write it as 0O + Mt. It follows from the Cauchy-Schwartz inequality that d2
= [a'(0. - 6»O)]2 = [a'Mtf = < [\F'a\\2 \{F't[\2 = (a'Ma)(t'Mt)
[a'FF't]2 < a'Ma.
The last step follows from the fact that t'Mt = (0* - eo)'M~ (6 - 00) < 1. Thus, a common solution does not exist if
d2 > a'Ma. (ii) If d2 = a'Ma, both the inequalities must hold with equality. The second of these implies that 6* must lie on the boundary of the elliptical region. The Cauchy-Schwartz inequality holds with equality if and only if F't — tpF'a, that is, 6* = 80 + tj;Ma for some constant ip. This constant must be d/(a'Ma), so that 6* lies on the hyperplane. Thus, the choices c = a'9o (a'Ma)1/2 lead to a unique point lying in the intersection between the hyperplane and the ellipsoidal region (the hyperplane being a tangent), (iii) If d2 < a1 Ma and 0* lies in the intersection, then we can
Solutions to Odd-Numbered Exercises
551
construct Of = 0*+g, where g is in C(M) n C ( a ) 1 and is small enough to ensure that Of satisfies the inequality. Since Of also lies on the hyperplane, any linear combination of 0* and Of lies in the intersection of the hyperplane and the ellipsoidal region. (b) Put 0=A/3, 00=AJ3, M=ma2Fm,n^aA{X' X)~ A in part (a) and choose a as the jth column of the q x q identity matrix. 5.7
(a) E(a) = 0; Var{a) = a2\p'1{X'X)-p1 - 2\p'l{X'X)~ p2 +A2P2'(X1X)~p2]. However, a is not an LZF, as it depends on the unknown A. In fact, it is a function of BLUEs. (b) [a2/Var(a)]/fi/a2} ~ F 1>n _ r . (c) [a2/Var{a)}/[a2/a2} < c if and only if (p[/3 - p'2/3)2 is less than cVar(a)a2/a2. The latter inequality can be rewritten as \2[{p[f3)2-c?p'2{X'X)-p2] -2X{(p[/3)(P20) co2p\{X'XYp2] + [(p'1l3)2-ca2pl1(XlX)-p1}<0. If p'2f3 is insignificant, the question of estimating p[l3/p2l3 does not arise. Let us assume that p'2f3 is significant. Then the coefficient of A2 is negative and the discriminant of the quadratic function is positive (why?). In such a case, the above inequality holds if and only if A lies between the roots of the quadratic function. We can choose c = Fi ifl _ ria , so that the interval between the corresponding roots is a 100o;(l — a)% confidence interval of A.
5.9 The result follows from Proposition 5.2.4 after simplifying the term x'(X'X)~x to the expression given within the parentheses. The width of the confidence band is a monotonically increasing function of {x - x)2/(x2 - x2), which is the smallest at x — x. 5.11 Let o^and /? be BLUEs of a and ft from the unrestricted model. Since a + P is a BLUE, it must be uncorrelated with all the LZFs of this model. On the other hand, a + 0— 1 is an LZF in the restricted model. An argument involving ranks shows that a standardized version of a + ft — 1, together with a standardized basis set of the LZFs of the unrestricted model, constitutes a basis set of the LZFs of the restricted model. Hence, the increase in Rl must be (a + (3-l)/[Var(a + ft)/a2]. 5.13 The ANOVAjs given in Table A.I. The GLRT is to reject n0 when (R2H - RD/a2 > F 2 , 3 7 , Q .
552
Solutions to Odd-Numbered Exercises Source Deviation from n
Sum of Squares R*H - R2 = 20 (S-^
\
=T2
Residual
Total
- yY '
v
*
2
ll
v
1
^ J l M
37
L-^Y
l<»<10 ^ 21
Mean sq.
J
R% = R% - (R2H - 1%)
^=
d.f.
? =
^
38 /
z
'
Table A.I ANOVA for Exercise 5.13
5.15 In the context of the model given in the solution of Exercise 1.9, the hypothesis is Ti0 : a\ = /?i. The SSE corresponding to this hypothesis is the same as the SSE for a single model for the 20 years' midyear population data with the year as the sole regressor (apart from the constant term). This turns out to be R2H = 1.211 x 10~3, with 18 degrees of freedom. The linear model of Exercise 1.9 leads to the (unrestricted) SSE F% = 5.236 x 10~4 with 17 degrees of freedom. The GLRT statistic is 22.31, with p-value .0002. Thus, we reject the hypothesis of 'no change in slope of the regression line at XQ1 at any reasonable level. 5.17 Using Proposition 5.3.6 we can decompose the hypothesis A/3 = £ into two hypothesis one of which is completely testable and the other, completely untestable. The completely untestable hypothesis only amounts to a reparametrization (see Exercise 4.30), while the completely testable hypothesis is of the form A*f3 = £„ where p(A») = dim(C(A') n C(X')). It is clear that the constraints A0 = £ and A*(3 = £„ lead to the same value of R2H. Therefore, the GLRT for the testable part of A/3 = £ is the test described in Proposition 5.3.12 with m = p(Ar). It follows from Proposition 2.3.2 that the latter number is the same as p{A') + p{X') — p(X' : A').
Solutions to Odd-Numbered Exercises
553
Suppose that p(A) is used in place of p{A,), and let R2C be the sum of squares of p(A) — p(A*) samples from iV(0, a2) which are independent of y. Then under the null hypothesis
\R2H-Rl
P(A)
\R2H-R2
1
+ R2C
P(A)
1 _
Therefore, whenever p(A) is used instead of p(A*), the size of the test is smaller than a. Another way of interpreting the incorrect test is that the appropriate test statistic is compared to a cut-off value which is too large for size a. Hence, the test would have unnecessarily small power. 5.19 Formulation: Test for 9X = #2 in the model
((Vi) f1 \\v2)'\0
Z*
° °)( £) a2l) 0 1 Z2J I ^02 I ' I
Solution: .RQ is the sum of SSEs from the sub-models, as in the solution of Exercise 5.18. R2H is the SSE from
((S)-(i I with ni + n2 - p ( _ ., \u
r,1 I degrees of freedom.
1 z,2 J
5.21 (a) Consider the model ((Vi\
( l n i xi®ui W ^ i \
; . \\Vm/ where (ui :
: Vin^xl » < /
\
h )yi . \^m/
: u m ) = 7mXm- Let fi = (pi :
/ : /i m )' and
Z - (xi : : xmy. Then the restriction JJ, = Z/3 reduces the above model to (y, X/3, a21). A more standard (but equivalent) form of the restriction is (I — Pz)/j, = 0.
554
Solutions to Odd-Numbered Exercises Source
Sum of Squares
Lack of fit
Rfm-Ro
m~r
Pure error R20] = E?=i IIVi-J/il||2 Total
Degrees of Freedom
R2 = YZi Ilyi-^i3l|| 2
Mean Square R2
—R2
—
-
n-m
^
n-v
Table A.2 ANOVA for Exercise 5.21
(b) The pure error sum of squares is iSL = Y^T=i \\Vi ~ ViM\2(c) The lack of fit sum of squares is RQ-R20, , where RQ = YLiLi \\Vi ~ z-/3l|| 2 the SSE under the model (y,XP,a2I). (d) The ANOVA is given in Table A.2. (e) The GLRT is to reject the hypothesis of adequate fit when n-m -"(o) ~ Ko
RfQ) 5.23
m-r
(a) The conditional mean olj4(X/3)e given XJ3 is 0. (b) D(A(X0)e) = a2E[A(X/3)(I Px)A'(Xfi)]. (c) A vector of transformed GZFs with the required properties is LA(X/3)e, where £ is a left-inverse of C and C is a rankfactorization of A(XJ3)(I - PX)A'(X~P). Note that the dispersion of the conditional expectation of this vector is 0, and hence its dispersion is just the expected value of the conditional dispersion. (d) Each component of the GZF of part (c) has to be divided by an estimator of a which does not depend on the transformed GZFs (for given X/3). We can use a set of additional GZFs which, together with the present set, constitute a basis set of LZFs (for given X0). The sum of squares of the latter GZFs is R20 - \\LA(Xf3)e\\2, and the number of these GZFs is n - p(X) - p(A(XP)(I
- PX)A'(XP))
=n-
p(X:A(Xfi)).
Therefore, a natural choice of the requisite estimator of a is [(R2 - \\LA(X0)e\\2)/(n - p(X : A(X0)))}^.
Solutions to Odd-Numbered Exercises
555
5.25 The (unrestricted) SSE is R% = 11.4569 with 16 degrees of freedom. From the combined model implied by the restriction we have R2H = 11.5171 with 18 degrees of freedom. The resulting GLRT is .0420428, with p-value .959. The hypothesis of equality of regression lines is accepted at any reasonable level. 5.27 The prediction error of the BLUP y0 can be decomposed into two uncorrelated parts: 2/o ~ Vo = (XoP - XQfi) - (y0 - Xof3). Hence, D(y0 matrix is q. y0)'W~2D(y0 part (a) and
- y0) = a2[Xo(X'X)~X'o + I], and the rank of this The result of part (a) follows from the fact that (y0 — - yo^Hvo ~ Vo)/qv2 ~ Fq>n-r. Part (b) follows from Exercise 7.5.
Chapter 6 6.1. Let ni, n2 and ni2 be the number of times we weigh object 1 alone, object 2 alone and objects 1 and 2 together, respectively. Then the information matrix is proportional to /rai +n12 \ ni2
nu \ n2+ni2j'
and its determinant is proportional to n\n2 + n 12 (ni + n 2 ). In view of the constraint of at most six measurements (all of which must be utilized for design optimality), we can write nxn2+ni2(ni+n2)
< =
( n i + n 2 ) 2 / 4 + n 12 (ni + n2) ( 6 - n i 2 ) 2 / 4 + ra12(6-ni2)
=
12-^(n12-2)2
< 12.
The second inequality holds with equality if and only if ni 2 = 2, while the first inequality holds with equality if and only if n\ — n 2 . Thus, the unique D-optimal design corresponds to m = n2 = n\2 = 2. 6.3 If C'T is a treatment contrast, then c'l = 0. Hence,
c'r = c'(r - fl) = J^Jn - i-1 £ T .) = £;£>/*)(,-, - TJ).
556
Solutions to Odd-Numbered Exercises 6.5 E\\(PX - Px)y\\2 = E[ir{{Px - P1)yy'(Px - PJ}] can be written as the sum of a dispersion and a bias term. Consequently MSg simplifies to
t
\t=i
\
t
/
t=i
^___
The null hypothesis is rejected for large values of -21og^. The asymptotic (as mini oo) null distribution of this statistic is Xt-i 6.9 It follows from the expression of Px that I — Px = (I - P1 ) ® (J — P U x i ) . Thus, Var((a®b)'e) = (a ® b)'a2{I - Px)(a
where
® 1 : 1 ® (J -
ft"1!!'))
and r] = (T]0 : v[ ^2)'. w i t h % = M + 7 E*=i Ti + i E*=i ft'. Vi = (n : : T t )'and »72 = ( ^ : - - - : ^ ) ' . 6.15 Take expected value. The term involving A becomes zero. Variance is an attribute of the model errors, which do not change from (6.3.1) to (6.3.12). There is no question of the estimator being the BLUE, as (6.3.12) is not a linear model at all. 6.17 Expected value of the estimator under (6.3.12) is a2 plus (t—1)~1(6 — I)" 1 A2 J^i AT{ — f)2(/3j — /3)2. Ignoring interaction would lead to unnecessarily long confidence intervals for treatment contrasts, even though the center of the interval is expected to lie at the appropriate location (this follows from Exercise 6.15). 6.19 Put G = A and A = (r - fl)
b
m
t
J2 U S liik (M + Ti + $i + la) = YJ CiTu i=l j=l
fc=l
i=l
Solutions to Odd-Numbered Source Between treatments
Sum of Squares x 1000 ^ _ g22 4
Between poisons
Exercises
Degrees of Freedom
= 1Q33 Q
557
3
Mean Square x 1000 M ^ = 30? 5
2
M
=
^ ^
^
M
Interaction
5 7 = 250.1
6
MSX = 41.7
Error
R% = 800.7
36
MSe = 22.2
Total
S t = 3006.2
47
Table A.3 ANOVA for Exercise 6.23
for all values of the parameters. By putting jij = 1 for all j and all other parameters equal to 0 in the above equation, we have ^ ^2k lijk = 0. On the other hand, by putting n = 1 and all other parameters equal to 0, we have ^ ^ f c lijk = c\. Hence we must have = 0. Likewise, c-i = = ct = 0. 6.23 The analysis of variance is given in Table A.3. The GLRT for the three hypotheses are as follows. Hypothesis no difference in effects of poisons no difference in treatment effects no interaction effect
GLRT F-statistic 2T2
degrees of freedom 2^36
p-value 3 x 10~7
13.8
3,36
4 x 10~6
1.9
6,36
.108
In this case, interaction means different treatments being particularly effective (or otherwise) for specific poisons. This analysis points to significance of the main effects but no interaction effect. See Box and Cox (1964) for further analysis with transformed variables. 6.25 The hypothesis T\ = r2 is not testable at all. The stated hypothesis is n + 7i. = r2 + 72., which is testable. The BLUE of the difference T\ + 7i- - T2 — 72- is j/j.. — y2.. (according to the result of Exercise
558
Solutions to Odd-Numbered
Exercises
6.22). Thus, the i-statistic is (y1.. - j7 2 ..)/\/- R o/( n - tb), where B$ = J2i=i Hj=i S*Li(2/*7* ~ Vij)2- The null distribution of this statistic is tn-tb6.27 We use Proposition 5.3.5. Here, A - (0«, xl : 0tbxt Otbxb Itbxtb) and £ = 0(6X1. A vector I is in C(A') if and only if it is of the form (0 : Oixt : 0iX6 : it')' where u is an arbitrary tb x 1 vector. Note that C(X') — C((l t xi <8> l6xi : Itxt ® l&xi : hxi ® Ibxb
Itxt ® Ibxb)')-
Therefore, if I e C(A') n C(X'), then we must have a tb x 1 vector k such that / !t6xl \ / 0 \ V Itxt® Ibxb / \Utbxl' The bottom part of the equation gives A; = u, while the top parts give three conditions which are equivalent to u & C((I — P1 )
= (0«xi : < W : Ot6x6 : (I - Pltxl) ® (I ~ ^i b x l ))'.
which is ensured by the choice T = (I — P. ) ® (7 - P. ) . The condition TA/3 — T£ reduces to 7^ — 7J. — j.j +7.. = 0 for all i and j . Thus, the testable part of the hypothesis of 'no interaction effect' is that all the LPFs of type (d) mentioned in page 214 are zero. The sum of squares for deviation from this hypothesis is 5 7 , which is the sum of squares of the BLUEs of all these LPFs, with (t — 1)(& — 1) degrees of freedom. Since T is a projection matrix, the untestable part of A(3 = £ must be (I - T)A0 = (I - T)£. This simplifies to % + j.j - 7.. = 0 for all i and j , which is equivalent to the side conditions 7;. = 7.J =7.. = 0 for all i and j . 6.29
(a) ya satisfies the equation y^i - yk. - y.t + y.. = 0, which leads to
ya = {(b-l)(t-l)}-1 X> + £ j " y - E £ ^ }ltk
&l
(i,j)^(k,l)
(b) yt satisfies the equation yu - y.t = 0, which leads to
yb = {t-l)-lY,Vn-
Solutions to Odd-Numbered Exercises
559
(c) The degrees of freedom for the above sums of squares decrease by 1 because of the missing observation. Using R%\yk,=ya and R2H\ykl=yb in the expression of the GLRT, we have the F-statistic
£ Yliva-v.j)2
- £ Y,(vij-Vi.-v.j+v..)2 yki=Vb
3
l
Vki=ya
3
Y^iva-yi-y.j^..)2 yki=ya
3
n —t— b t-1
with t — 1 and n — t — b degrees of freedom. 6.31 Here we have for i = 1 , . . . , t, j = 1 , . . . , h and k = 1 , . . . , b, Vijk = (J- + n + 0j + fik 4- eijk If
w e l e t / 3 = (fi : T!
n :0 i
9 h
P i
0b)', a n d
a r r a n g e
the data so that k changes faster than j which changes faster than i, then the design matrix is X = ( l t ® \h
Itxt ® lh ® I t : h ® Ihxh ® 16 : It ® 1ft ® Ibxb)-
(We write l t x i as lt for brevity.) The relevant projection matrices are PM = P1( ®Plfc ®P U> P r = ( / - P j ® ^ ®Plb,Pe = P l t ® ( / The sum of squares are t
Sr =
hb^iVi-.-y-)2, i=l
Sg =
tbJ2(y-r~y-)2, 6
t
Ro
k=i t
b
= Yl 12 J 2 ^ k - v<~ - y-i- - y-k +2^-)2' 2=1 j = l k = l
5*
= tthy^-y-)2i=l
j=l k=l
560
Solutions to Odd-Numbered Exercises Source
Sum of Squares
Degrees of Freedom
Mean Square
Between treatments
ST
t—1
MST = —r—
Between type I blocks
Se
k~1
Between type II
s
f>_1
Q
,
1
blocks
p
Error
R20
v*
Total
St
thb-1
„, c
Sg MS8=h^l
A
M 5
b- 1
p
MSe = - 2 .
*v = thb-t-h-b + 2 Table A.4 ANOVA for Exercise 6.31
The ANOVA is given in Table A.4. 6.33 The model is Vijkl = M + Ti + Pj + 7ifc + tijkl, for i = 1 , . . . , t, j = 1 , . . . , b, k = 1 , . . . , d and / = 1 , . . . , m, where 7,-* is the effect of the fcth dose level of the ith drug. If /3 = (/z : T\ : : rt
:P i
P b
i
n
-lid
I
n
ltd)',
a n d
t h e indices i ,
k, j and I change successively faster, then the design matrix can be written as X = (lt®ld®lb'-Itxt®1-d®1-f-1-t®l-d®Ibxf-Itxt®Idxd®U)<8ilm(We write l t x i as It for brevity.) The projection matrices are P^Plt®Pu®Plb®Plm, PT = (I-Pu)®Pu®Plb®Plm, P, = Itxt ® (/ - Pu) ® Pu ® Plm , Pe=I[Itxt ® /dxd ® P u + i^"® P1(, ® (/ - i3!,)] ® P l m The sum of squares are t
5r
=
dbm^(jji...-y....)2, »=i
Solutions to Odd-Numbered Source
Exercises
Sum of Squares
Degrees of Freedom
ST
t— 1
Between drugs
561
Mean Square g MST = —T— i
t
Between doses
57
t(d - 1 )
Between blocks
5^
6-1
MS^ =
7
MS$ — -r—r o — J.
Error
R20
v*
Total
St
tdbm-1
* v = tdbm
MSe = ^ £
-td-b+1
Table A.5 ANOVA for Exercise 6.33
t
d
S7 = 6m53^(2/^. -Vi...)2, »=i *:=i 6
S/3
=
tdm^y.,-..-j/....)2, t
b
d
m
j = l j=l * = 1 i = l t
6
d
i=\ 3=1 k=\
m
i=\
The ANOVA is given in Table A.5. 6.35 Prom (6.6.3) we have ff= z'(I - Px)y/z'(I
- Px)z, that is,
~ _ E i E j f e j - ^- - ^'j + ^--)(a/ij - I/j. - V.j +17..)
It follows from Remark 6.6.1 that X/3 = X/30 - fjXa, where the elements of X/30 and Xa corresponding to r/y are j/^, +j/.,- ~ V-. and
562
Solutions to Odd-Numbered
Exercises
Zi. + z.j — z.., respectively. Thus, the fitted value of yij is (y^ + y.j — y.) +rj{zij -z{. -z.j +z..). The variance of f) is
z'(I - Px)z
ar[V)
J2i EMJ ~ Zi- ~ z.j + z..Y "
As X0O and rj are uncorrelated, we have
D(X0) = D(X/3o) + D(Xafj) - a3 \PX + Z*J_PX)Z The element of this matrix corresponding to the pair yij and yki is cr2 times X
4 . ^
1\1M(X
\
1
1
I (ZJ. + z.j - z..)(zk.
+ z.t - z..)
- + I Oik - - T + [Ojl - - I - + ^ ^ , = = 1 - \2 ' n V tj b V bJ t HiY,j(zn - z^ - z.j + z..)2 where 5ij = 1 when i = j and 0 otherwise. 6.37 Following (6.6.2), we have (Rip r'\_(y'(I-Px)y
{r
where I - PY = (I - P. X
y'(I-Px)z\
Rj-{z'(I-Px)y ) ® (I - P
ltxi '
v
v
z'{I-Px)z)' ). Under the null hypothesis
l&xl '
J
we can ignore the effect of the treatments, so that I — PVIT X('
„ . = — "A'1
(I - P l t x l ) <E> Ibxb- We have from (6.6.9) (R2H0 [rH
r'H\ RH)
=
(y'(i-Px^PAl))y \z'{I-Px(I_PAi))y
V>(I-PX(I_PAI))Z\ Z'{I-PX(I_PAI))Z)-
Further, from (6.6.7) and (6.6.8) the sums of squares corrected for covariates are
i
3
r _ l 2 [Ei E j ( z u - zi- - z-j + z-)(vij - y%- - y.j + y-)\ E i E j ( z u - zi- - Z-J +z-)2
Solutions to Odd-Numbered Exercises [y'(I-Px(r &H
-
_ ,)z]2
y y1 - rX(i-pA,)>y ~ z'ii-p K
A'
)z
^
Z
= £I>-i7,)
563
^X{I-PA,)>Z
^^(^.-^
It is clear that in this case p(X : z) = p{X) + p(z). Hence, the degrees of freedom associated with RQ and R\ are n — t — b and n — b — 1, respectively. The GLRT would reject the null hypothesis for large values of the statistic HRi Q "~_~ , whose null distribution is Ft-i,n-t-b6.39 We only have to simplify the GLRT described in the solution of Exercise 6.37. Here, z
= 1J'
z'3
r i in = k,j = i, \ 0 otherwise,
.i *'
= (t *3=l, -z [0 otherwise,
=
=
"
n ni = k, \ 0 otherwise, i tb'
Therefore, we have the simplifications 02 _ vVf«---?7 -v in3 - 2fl;Wv Vi- y-i + y-)
Ro
*
(t-i)(6-i)
J
The GLRT statistic is fl"fi~5
^fzf^, which simplifies to
L v ^ / r _ ^ ^2 , tbjyu-yk.-y.i+y..)2
fe^' V"} ZflsWv
tb(yki-yk.-y.i+y..)2
ft-D(fc-l) y-
y-i+y-)
_ t(yfci-r/,,)2
« - l n-t-ft (*-i)(&-i)
and its null distribution is i7*—i,n—t—66.41
(a) When 2/jt; = 2/a, the covariate correction term vanishes.
564
Solutions to Odd-Numbered
Exercises
(b) The covariate-corrected R2H at yki = ya simplifies as follows.
EEfe-M2-^-^/(*-i) i
j
sr^sr^f
- \2 , v ^ (
(t- i)yb + ya\2
= Z) Z)^« - W-i)2 + Z ^ u ~ ^ V~^ V~^^ i
— \2 I
j
(c) It is clear from parts (a) and (b) that the value of the test statistic of Exercise 6.39 at y^i = ya is the same as the test statistic of Exercise 6.29. However, it was shown in Exercise 6.40 that the test statistic of Exercise 6.39 does not depend on yki- This proves the result. Chapter 7 7.1
(a) Let j/j. = n~l YllLi ytj, i = 1, model
,m. Then we can use the linear
yi. = x'fi + rji, E(6i) = 0, Var{r)i) = a2/m for i = 1,..., m, and Cov(T)i,T]j) = 0 for j / i. The dispersion matrix is not singular. (b) Regress (I — Px )y on (I — Px )x\, where y and X\ are the columns of values of stack loss and air-flow, respectively, and X2 is the matrix consisting of the columns of values of the other two explanatory variables. The dispersion matrix is a2(I—Px ), which is singular. (c) The explanatory variables may also be centered, with no loss of information. Then the centered data model of Example 7.3.5 may be used. The dispersion matrix, a2(I — n""1!!.'), is singular. The coefficients are not directly interpretable because of the scaling, but the i-ratios for their significance are appropriate for
Solutions to Odd-Numbered Exercises
565
the original model. The residual sum of squares is appropriate, and tests of all linear hypotheses can be carried out in spite of the loss of some part of the data. (d) The dispersion matrix is a2 (
20 * 20
„ I, which is singular.
7.3 The sufficiency of the condition p G C(X') is obvious. To prove necessity, let k'y+c be a linear estimator of p'/3 such that E(k'y+c) = p'/3 for all /3 satisfying the condition (I - Py)Xf3 = d. Let /30 be a choice of 0 which satisfies the condition (I — Pv)X[i = d. Then fi1 = fiQ + (7 - P )p is another such choice. Therefore, k'Xj3t + c = p'Pi, i = 0,1. It follows that p'/30 = k'X(30 + c = k'X(3l + c = p'13, = p'(30 + ||(I - Px, )p\\2. Therefore, \\(I - P. )p\\2 = 0 and p must be in C(X'). 7.5 It was shown after (7.3.1) that the BLUE of an estimable LPF A& is AX~y. On the other hand, any LZF B(I — Px)y can be written as B(I-Px)e, see (7.3.3). 7.7 Let k'y be the BLUE of E(l'y). Obviously k'y is uncorrelated with I'y - k'y, which is an LZF. Then Var(k'y) = Cov{l'y,k'y) = 0. Thus, I'y must be an LZF plus a BLUE of zero variance. If V is full rank, then there is no BLUE with zero variance, and hence I'y must be an LZF. 7.9 If {y,Xl3,(T2V) is the model for y, then {C-ly,C-lXp,a2I) is the equivalent model for C~ly. The BLUE of A/3 obtained from the latter (homoscedastic) model is
A0 = AKc-'xyic-'xTic-'xnc-'y) = Aix'v^xyx'v^y, and its dispersion is
C(V).
566
Solutions to Odd-Numbered Exercises
7.13 Let CC' be a rank-factorization of V. Then C(V : X) = C(C : X) = C((C : X)(C : X)') = C(V + XX'). The expression for the BLUE follows from Proposition 7.7.2(a). 7.15 If z is a vector of LZFs whose elements constitute a generating set, and CC' is a rank-factorization of a~2D(z), then for every left-inverse C~L of C, the elements of the vector C~Lz constitutes a standardized basis set of LZFs. This fact follows from the arguments of Exercise 4.23. Hence, R2 = z'(C-L)'C-Lz = z[a~2D(z)}-Z for any choice of the g-inverse in the latter expression. 7.17 Rank factorize V as CC' and U as KK'. Then C{C) C C(C : XK) C C{C : X). This is equivalent to C(V) C C{V + XUX') C C(V :X). 7.19 Let Ui = P[u and u2 = P'2u. Then u is the sum of the orthogonal vectors Pitti and P2U2- The constraint X(3 + Fu = y is equivalent to QX/3 + QFu = Qy, which can be rewritten as (Q.FP, \Q2FPX
0 \ (uA Q2x)\f3)
_(
QlV \ \Q2y-Q2FP2u2)'
in view of the facts Q1X = 0 and QXFP2 = 0. The matrix on the left hand side has full column rank, and the equation is consistent (as X/3 + Fu = y is consistent). Hence there is a solution to the above equation for every y and u2. As QxFPi has full column rank, u\ is uniquely determined by the first of the above two equations, QxFPiUi = Qxy. The objective function ||u||2 = ||ui|| 2 + ||u 2 || 2 is thus minimized by setting ||tt2||2 = 0. Note that this (like any other) choice of u2 does not violate the constraints as the matrix equations given above always have a solution as long as u\ satisfies the first equation. Comparing the dispersions of the two sides of the equation Q1FPiU\ = QiJ/i w e have Q1FP1D(u1)P'1F'Q[
= ^ Q i F F ' Q i = a2Q1FPP'F'Q[ = a2Q1FP1P[F'Ql1,
as QlFP2 = 0. Since Q^^FPi has a left-inverse, we have D(ui) = a21. Further, Ui must be an LZF, as EKQ.FP.y'Q.y] = (Q1FP1)~1Q1X0 = 0.
Solutions to Odd-Numbered
Exercises
567
xi\ has p(V : X) — p(X) elements. Therefore, its elements must constitute a standardized basis set of LZFs. Comparing the dispersions of the two sides of the equation Q2FP-\Ui +Q2X(3 = Q2y, we have a2Q2FPlP'1F'Q!l
+ Q2D(XP)Q'2
=
a2Q2FPP'F'Q'2.
Thus, Q2D(XP)Q2 = a2Q2FP2P'2F'Q2. Since the columns of Q2 are linearly independent, this matrix has a left-inverse. Hence,
D(Xp) =
u2FP2P'2F'.
7.21 Let CC' be a rank-factorization of V, C~L be a left-inverse of C and F be a matrix such that FV = 0 and p(F) =n- p{V). Then
ra~*(r:*') As y 6 C(V : X) = C(V), one can retrieve y from the above vector by premultiplying the latter with (C : 0). Since C~Ly and F y are independent and the distribution of the latter is free of /3 and a2, we can ignore it for consideration of the information matrix. If follows from the discussion of Section 4.11 that the information matrix for 9 is
(\{C-LX)'C~LX
o
0 \
f-^X'V-X
<*n\ = r
0
0 \
pmy
\ 2(74 / V 2(74 / As C{X) C C(V), the expression does not depend on the choice of the g-inverse. The expressions for the Cramer-Rao lower bounds follow immediately. 7.23
(a) MR is equivalent to the unrestricted model
Mr = {y- XA\AA')-£,X(I where 0 = (I - P ,)6 + XA'(AA')-£.
-
PA,)6,a2V), Therefore, X@ is the
BLUE of X(3 under MR if and only if it is uncorrelated with the LZFs of Mr. The algebraic form of this condition simplifies to orthogonality of the columns of D(X(3) and / - P . Since the column space of the latter matrix is C{X(I — P ,)) , the condition is equivalent to C{D(XP)) result follows from Proposition 7.3.9.
C C{X(I-P
,)). The
568
Solutions to Odd-Numbered Exercises (b) If V is nonsingular,C(V)nC(X) simplifies to C(X). Hence, the condition of part (a) reduces to C(X) = C(X(I — P.,)), that is, p(X') = p((I - PA,)X') = p(X' : A') - p(A'). The rank condition p(X' : A') - p(X') + p(A') holds if and only if and only if C(X') and C(A') are virtually disjoint, or Afi = £ is a completely untestable hypothesis in M.
7.25 Using an argument similar to the one leading to (7.9.5), we have that MSE(X~P) - MS{X0R) is nonnegative definite whenever D{AJ3) + a2W-(AP-£)(A0-$)' is nonnegative definite. If follows from Exercise 2.19 that a sufficient condition for the latter is (A/3—£)'[D(Al3) + <72W}-(A/3 - £) < 1.
7.27 Suppose that I'y is an LUE of p'fi1. From Proposition 7.2.3, I'y is almost surely equal to k'y where k'Xipi + k'X2/32 = P'Pi f° r all (31 and /3 2 . Therefore X[k = p and X'2k = 0. It follows that k 6 C(X 2 ) X = C{I - PX2), that is, k is of the form (/ - PX2)m for some vector m. Consequently p = X[(I - Px )m, that is, p G c(x\(i - PX2)). On the other hand, if p 6 C(X[(I - Px )), then there is a vector m such that p = X[(I - PX2)m. Then m'(I - PX2)y is an LUE of p'/3i7.29 The numerator and denominator of the GLRT statistic are maximized with respect to /3 when the respective exponents are minimized. The minimized values are R2H/(2a2) and R%/(2a2), respectively. Hence, the GLRT statistic is m&x(27va2)-dI^1\C'C\-iexp{-R2H/{2a2)] max(2Tra2)-£i^a\ClC\-3
exp[-R2/(2a2)} '
a2
where CC' is a rank-factorization of V. The numerator is maximized when a2 = R2Hjp(V:X), while the denominator is maximized when a2 = R2/p(V:X). Substituting these values, we have
which is a monotone decreasing function of the ratio given in Proposition 7.11.2. The null hypothesis is rejected for small values of I, and hence, for large values of this ratio. It follows from Proposition 7.9.2
Solutions to Odd-Numbered Exercises
569
and the subsequent discussion that the latter is the ratio of averages of squares of two sets of independent LZFs of variance a2, whenever the null hypothesis holds. The number of LZFs in the two sets are m and n' — r, respectively. Hence, the ratio has Fm^ni^r distribution under Ho- The result follows. 7.31
(a) As yo-Xof3-V'oV-(y-Xl3)~N(O,a2(VOo-V'oV-Vo)), inequality
the
(l/o - XO0 - V'0V-(y - X/3)y[a2(VOo-V'oV-Vo)}~ (yo-Xo0-V'oV-(y-X/3)) < j£ 7 holds with probability 1 - 7. The result follows from Exercise 5.5. (b) Simultaneous confidence intervals for x'Oj/3 - v'0:jV~Xf3, j = l,...,q, with confidence coefficient 1 — a/2 are given by the Scheffe confidence intervals of Section 7.12 with XOJ-X'V~VOJ, m' and a/2 replacing a.j, m and a, respectively. These are
\(x'OjX- -
v'OjV-)X0-yJm'Fm,,n,_1,a/2cP,
(x'OjX- - v'ojV-)X0+y/m'Fm,in,_ria/2cP
l,
j = 1,... ,q. A 100(1 - a/2)% upper confidence limit for a2 is
[0,^/j&_r,l-a/2]Because of the Bonferroni inequality, x'Ojf3 - v'0jV~Xf3, j = l,...,q, and a2 simultaneously belong to their respective confidence intervals with probability at least 1 — a. The worst-case combination of these inequalities, together with part (a) gives the requisite simultaneous tolerance intervals. (c) When VOo - V"QV~V"O is a diagonal matrix, the components of yo-Xo{3-V'oV~(y-Xf3) are independently distributed as 7V(0, <J2(voi — v'oiV~voi)),
i = l,...,q.
Hence, each of t h e in-
equalities \yOi - x'Oi/3 - v'oiV'iy - X0)\ < zl/2a^vOi
-
v'OiV-vOi,
i — 1,..., q, holds with probability 1 — 7. When there are arbitrary number of independent replicates of any combination of
570
Solutions to Odd-Numbered
Exercises
Voi, ,yoq, these inequalities are satisfied by 100(1 — 7)% of these (on the average). The stated result is obtained from the above by using the worst-case combination of the 100(1 - Q / 2 ) % Scheffe confidence intervals of x'Oi(3 - v'OiV~X(3, i = l,...,q, and 100(1 — a/2)% upper confidence limit for er2 given in part (b), together with the Bonferroni inequality. 7.33 As the strata are uncorrelated, we can consider one stratum at a time. The model for the ith stratum is
We have
v - 1 = [(l-pji
+ piUV1
= T^— [
f
- i .
1 - Pi [
r
/
-n " ' 1 >
1 + (Ui - \)Pi
J
Tij being the number of sampled units from the ith stratum. Since
Xs = 1, X'sV-}Xt
simplifies to m/[l + (n< - l)Pi], and
X'.Vjfy,
simplifies to riiySi/[l + (nt - l)pt], where ysi is the sample average within the ith stratum. Hence, the BULE of /j,i is
Ai = {X'.V^X.r^X'.V-ty.
= ysi.
It follows from (7.13.2) that the BLUP of the population total in the ith stratum is l'ys + l'[XrfH + V r . V - H v . - lAi)] = niVsi + (Ni - ni)ySi
+ _£L_i'll' [j 1 - Pi
=
L
11'] (y, - 1M
P
1 + (nj - l)pi
J
Niy.i,
where Ni is the population size in the ith stratum. This is just the expansion estimator, which remains the BLUP in spite of the withinstrata correlation. The MSEP of the BLUP is
o*i'(v -v °i
1 \
v
rr
v-v
* rs*
s s v
)llJ2[il(XrX;-vrsv7s)xsr sr)l
^ "i
„,.._!„ - * sV ss - * «
This expression simplifies to erf (1 — Pi)Nt(Ni — nij/rii.
Solutions to Odd-Numbered Exercises
571
Chapter 8 8.1 With reparametrization, the model reduces to the one-way classification model of Example 8.1.4, for which the result has already been proved. 8.3 Sufficiency of the condition is obvious. In order to prove necessity, note that the LSE of the block contrast /3j1 — (3j2 is y.jr. - y.j2., while the BLUE is Z^i=l 2~ik=l yijik/y/<Tiji _ 2->t=l 2^fc=l Vijikf y/O'iji i=l niji I \lan\
2-ii=l niJ21
y/aih
These two are identical if and only if the coefficients match, that is, aiji = <jij2. Likewise, a^j = cr^j. In order to ensure this for all combinations of indices, we must have all cr^s equal. 8.5 Let y = {y\ : Then
: y'p)',
V
= (v'i
V'PY and e = (el :
: e^)'.
y = ( l p ® X)j8 + [{Ipxp ® X)r] + e]. The dispersion matrix of the error (shown in square brackets) is V =
IPxp ® (X'SX' + a2l). Since V(lp ® X)=l p ®(A-SX'X+<7 2 X)=(l p ®X)(/ p X p ®(SX'X+«7 2 /)), the conclusion follows from Proposition 8.1.2(a). 8.7 Let l'Pxy be an LSE with zero variance, so that IPXVPX1 = 0. It follows that VPxl = 0, that is, Pxl G C(V)X f)C(X). Hence, C(V) X and C{X) are not virtually disjoint. On the other hand, if these two column spaces are not virtually disjoint, then there is a nontrivial vector m lying in the intersection such that Pxm = m, and m'y is easily seen to be an LSE with zero variance. If I'Px y is an LSE with zero variance, then its variance is overestimated whenever l'Pxl > 0. This generally holds, as l'Pxl = 0 if and only if PxI = 0, that is, l'Pxy is zero with probability 1. 8.9 Write the model error as VI. Then X0pi = [I-V(I-PX){(I-PX)V(I-PX)}-(I-PX)}(X0 = XP + [V-
+ VI)
V(I-Px){(I-Px)V(I-Px)}-(I-Px)V]l.
572
Solutions to Odd-Numbered Exercises The proof of Proposition 7.3.9 implies that the column space of the matrix in front of I in the last expression is C(X) f\C(V). The conclusion follows.
8.11 In this case V(0) = a2 V. It follows from discussion preceding Remark 8.2.5 that the REML estimator of a2 is the minimizer of log | a2 G' G | + Rl/cr2, where Rl is the SSE of the model (y, X/3, a2 V), and G is such that GG' is a rank-factorization of (/ - PX)V{I - Px). Note that log \(T2GG'\ = p((I -PX)V) log(cr2) + log\GG'\. Differentiating the expression to be minimized, we have the derivative with respect to a2 P((I-Px)V)/a2-R20/a\
which is positive for small values of a2 and negative for large values of a2. Setting the derivative equal to zero, we obtain the REML estimator 72 = Rl/p{{I - PX)V) = R2/[p(V : X) - p{X)}. This coincides with the natural unbiased estimator of a2 obtained in Section 7.4. 8.13 The first part follows from derivative calculations such as dl°^} = A^1. The REML estimating equation is obtained from the ML estimating equation by replacing e and V(0) by U'e and U'V(9)U, respectively, where UU' is a rank-factorization of I — Px. 8.15 Let Q t = \{Q 4- Q') so that y'Q^y = y'Qy and Q t is symmetric. Let UAU' be a spectral decomposition of Q^, A being a nonsingular diagonal matrix of non-zero eigenvalues. From the proof of Proposition 8.3.1(b), we have P'X'Q^XP = 0 for all /3 such that (/ - Pv(e)) X(H = (i - Pv(e)) y, as a necessary condition of unbiasedness. The condition 0'X'Q^Xp = 0 is equivalent to U'X/3 = 0. Hence, the above necessary condition is equivalent to U'y being an LZF. From Proposition 7.2.3, there is a matrix U* such that Uy = Uty almost surely and U*X — 0. Define Q_ = U*AU',. This matrix satisfies the requisite conditions. 8.17 The matrix G is equal to I _ , _ „ > I, whose column space does \ u t r (. i ~ yx> / not include (1 : 0)'. Hence, a\ does not have a quadratic unbiased
Solutions to Odd-Numbered Exercises
573
estimator. However, (tr(XX'XX') * ~ \ tr(XX')
tr(XX')\ tr(J) ) '
This matrix is nonsingular under the given condition. Hence, of is identifiable. 8.19 The statement of Remark 7.3.11 is easy to prove. The REML estimating equation is obtained by replacing Vj, V(9) and e(9) by U'VjU, U'V(O)U and U'e(0), respectively, UU' being a renk-factorization oiI-Px. 8.21 Expand the right hand side of (8.3.11) as 2tr(QV(9)QV(9))
=
= 2tr (Q^^UiU^Q^^UjU'j I
2J2J2afa]\\U'iQUj\\l. i=i j=i
For fixed 0, this is similar to the objective function (8.3.7) which is minimized by the MINQUE under the same conditions of unbiasedness and translation invariance. Therefore, by Proposition 8.3.10, the MIVQUE must have the stated form. The argument holds similarly when 9 is replaced by an approximation, w. However, if w is a function of y, then the approximate MIVQUE may not be a quadratic or unbiased estimator. 8.23 The result follows by simplifying the expression of BLUP given in Proposition 7.13.1 with x0 = 0, v0 = Cov (TJ, X)*=i ^ T i )
8.25
a n dV
=
(a) By making use of the inequality between arithmetic and geometric means, we have nw
^
^
Z^j=l
ei+l
ei +22^i=i
Z^i=l f i
+2^i=l
y,n
2
l e '+il" \ei\
lez+l + le*l )
_
-
A
q-
574
Solutions to Odd-Numbered Exercises (b) Writing e i + 1 as
(
1-1
0
0
-1
1
0
0
0 \ ...
0
1 - 1 /
Write the residual vector e as ( / — H)e. Since Ae and e have zero mean, the probability limit of DW is
tr(AZ?(e)A') _ tr(A(I-H)V(a2, (/>)(!- H)A') tr(D(e)) " tr((I-H)V(
Therefore, each diagonal element of AHVA is bounded from above by 8ca2/n(l — |0|), and the trace has smaller magnitude than 8CCT2/(1 - \(f>\). Thus, 2tr(AHVA') is negligible in comparison to tr(AVA') for large n. Likewise, the magnitude of the (i,j)th element of HVH is bounded from above as
1=1 m = l
(=1 m = l
C2<72 v ^ . , s=0
| s
.
.
. C2<72 ^ s=0
C2CT2 v
| r l y
Therefore, the trace of AHVHA has smaller magnitude than 4c 2 a 2 /(I - |<£|), which is negligible in comparison to tr(AVA')
Solutions to Odd-Numbered Exercises
575
for large n. Thus, tr(A(J - H)V(I - H)A')/n converges to 2CT2(1 - <j>). Using a similar argument, tr((J - H)V(I - H))/n can be shown to converge to a2. Hence, the probability limit of DW is 2(1 -
8.27 The BLUE is
Its dispersion is
jD(3) = ^lE f f i( 5 (%))" 1 ] Since y^ - -X^/3 is an LZF in the jth submodel as well as the full model, it is uncorrelated with f)^ and 0. Hence,
(
\
m
m
E»J-* E I I ^ - ^ I I 3
E»i~* E \}\Vi ~ x'Ai)U2 + Wxfiu) - X'M2}
Chapter 9
(
3= 1
)
3=1
m
\
m
E»J-* ECT' [{ni-Q+Gw-PYiDGw))-1^-?)]. 3= 1
)
3=1
Chapter 9 9.1 It is enough to prove that there are additional estimable LPFs in the augmented model if and only if p(Xn) — p(Xm) > 0, and that there are additional LZFs if and only if U > p(Xn) — p(Xm). The first statement follows from the fact that there is a p'/3 such that p € C(X'n) but p i C(X'm) if and only if p(Xn) > p(Xm). The second statement is a consequence of the representation L-p(Xn)+p(Xm) = [p(Xn : Vn)-p(Xn))-[p{Xm : Vm)-p(Xm)],
576
Solutions to Odd-Numbered Exercises and the fact that the two bracketed terms in the last expression represent the number of elements in a standardized basis set of LZFs for Mn and Mm, respectively (see Proposition 7.4.1). 9.3 The BLUE of A(3 in Mn is the sum of the BLUEs of LxXmP and L2X1/3 in this model (see Exercise 4.15; the result holds in the general linear model also). The BLUE of the first part is LiXmf3m, while the BLUE of the second part, as given by Proposition 9.1.9, is Liyt — IJiVimV^n(ym — Xmj3m). The result follows. The dispersion is LmD{Xm^m){L'm + V-mVmlL\) +Ll[a2Vl - VlmV-mD{ym 9.5
XmpjV^Vml]L\
(a) As e;/(l - hi)1/2 is an LZF with variance a2, it can be regarded as a part of a standardized basis set of LZFs whose sum of squares would produce the error sum of squares. It follows that e?/(I - hi) < e'e = (n - p(X))'?,
that is, r\
p(X). The
equality holds when ef/(l - hi) alone accounts for the entire error sum of squares, that is, all LZFs/residuals uncorrelated with ei are zero. (b) The definitions of r» and U imply U/ri = (
^ ( n - p(X) - 1)
n - p{X) - 1
? ( _ 0 ~ (n - p(X)W - e 2 /(l - hi) ~ « - P&) - A ' The stated result follows. Also, rt-tl{n-p(X)-l
+ t*J-
If follows from Exercise 4.25(c) that hi = 1 implies n = U = 0. 9.7 Note that et = faet-i +«i,t-i +St, vk,t = fa+iCt-i +vk+i,t-i+0kSt for k = 1 , . . . ,r - 2 and Dr_i,t =
Bt=
. 0 VO
0 0i 02
.
0 0 1 0 0 1
0\ 0 0
. . .
.
0r_i 0 0r 0
0 0
1 0/
/ [
, ut=
0 \ 1 6>i
. Qr-2 \9r-lJ
f
a
t \
5t, Ht = I 1
,
Solutions to Odd-Numbered Exercises
577
and vt = 0. The update equations (9.1.16)—(9.1.21) simplify to the recursions Pt\t-\ =
BtPt-iBt,
S »t
=
« -u P FT' y* — ~ a ^ ~—j, 1 a:t|t-i +-Pt|t-i-H t — **t"t\t-i™t
Pt
=
Pt\t-1 ~ Pt\t-lH'tHtPt\t-l
„ p
7J7;
/3t being obtained from the top part of xt, and D(/3t) from the top left block of Pt. 9.9 It is clear that
n = w, + di0n) - dt0J =wl + (-V,TOV- : = w, + (-VlmV-
I)(Xjm-Xjn)
: I)Cov(XnJ3m,w,)[D(wi)]-vih
by making use of part (a) of Proposition 9.1.8. The converse holds only if p(D(n))
= h - [p(Xn)
-
p{Xm)).
9.11 The result follows by using the argument given in Section 9.2.5 together with part (d) of Proposition 9.1.8 and Exercise 9.4. 9.13 We can write the hypothesis as TA/3 = 0, where A = (0 : I) and T = (I — Px )X(j). We only have to verify that the matrix T satisfies the condition of Proposition 5-3.6(a), that is, C(A'T') - C{A')f)C{X[k)). It is easy to see that A'T' = X[k){I - Px ) , so that C{A'T') C C(A') n C(X(k)). To complete the proof, we have to establish the equality of the dimensions of the two vector spaces. It follows from Proposition 2.3.2(a) that dim(C(A') n C(X'(k))) = p(A') + p(X{k)) - p(A' : X{k)) = j + p(X(k)) -\j + p(X(h))] = p(X(k)) - p(X(h)) = p(TA). 9.15 The expected values of i> and t are (I — Px
PxJH1 ~ Pxjm
)X^fi^
and X'^ (I -
- PXJ}-V - PxJXU)Pu)> resPectively.
These are estimable in the residual model, since we can write these as
(I-pxJxu)re.Pu)^Xu)V-PxJ{vV-rxJ}-Xv)~.f'u), respectively. Since C(X^)res)
C C(V), we conclude from Remark
578
Solutions to Odd-Numbered
Exercises
7.6.1 that the BLUE of the expected value oft is obtained by replacing /3{j) with {X[j)resV~X(j)res)-X'(j)resV~yres in its expression. In view of Remark 9.4.4, this expression simplifies to D{t)[D(t)]~t or t. The fact that v is the BLUE of its expectation follows from the result of Exercise 9.14. 9.17
(a) When the jth block is deleted, the corresponding block mean becomes non-estimable and the error degrees of freedom reduces by t — \. Note that the LZF rt defined in Proposition 9.2.3 consists dj,... ,etj, defined in (6.3.5). It follows from the form of the corresponding projection matrix that n[<7~2D(ri)]~ri is just the sum of squares of these residuals. Thus, the error sum of squares reduces by £* = 1 e?;. The reduction factor 1 - E!=i e%l^o c a n be regarded as a measure of influence of the jth block on the residual sum of squares. (b) The change in the between treatments sum of squares can be obtained from change in the error sum of squares under the restriction that all blocks are equivalent, as per Proposition 9.2.4. However, an easier way is to calculate the change in the constituent terms of ST, given in Table 6.2. The value of 5 r changes by the factor (b/ST) ZLiKVi- - V-) - eij/ib-l)]2, which can be called a measure of influence of the jth block on the between treatments sum of squares. (c) The GLR.T statistic is altered by the factor
Sr (l ~ E U 4 ) lRl (6-l)£!=iKfc--*-)-e«/(&-l)]2' the revised null distribution being i7i-i,(6-2)(t-i)- The above factor is a measure of influence of the jth block on the GLRT. 9.19
(a) Since l'yres = 0 and l'x{h+i),res = 0> the least squares fitted line through the added variable plot has intercept equal to 0. The slope is equal to the BLUE of Ph+i from the model (yres,X(h+i)tJ.es/3h+i,&2I)- The result follows from the discussion of Section 9.4.4. (b) The residual vector for the model (y, Jf(/1_)_1)/3//l+1),<72J) is v-px(h+1))y
-
('-PXW
-p{I-PX{h))Hh+i))y
=
(J-iVn).,..)j're..
Solutions to Odd-Numbered Exercises
579
which is the residual of the model (yres,X(h+i)tres/3h+i,<72I)(c) Although the error sum of squares for the two models are identical, the error degrees of freedom are n - 1 and n — p(X(h+l), respectively. The discrepancy is because of the fact that the model (yres,X(h+1):res(3h,+i,(T2I) is too simplistic a model. The dispersion matrix of yres is not a21, but is equal to a2 (I — Px ). If the correct dispersion matrix is used, then the resulting estimator of a2 would be the same as that from the model
{y,X{h+i)0(h+i),°21)Chapter 10 10.1
(a) The first row of B represents the general effect while the other three rows represent additional effects of the three species. The four columns represent these effects on the four characteristics. The estimable parameters are differences of species effects, the average of the effects (that is, the first row of B plus the average of the last three rows) and all linear combinations of these, (b) If the indices 1, 2, 3 and 4 are used for the four rows, then the estimatesjire: j?n-/3 2 i = -0.930, £12-#22 = 0.658, /?i3-/?23 = —2.798, /3i4 - jd24 = -1.080. The estimated dispersion matrix is 0.01060 0.00371 0.00670 0.00154 \
(
0.00371 0.00462 0.00670 0.00221 0.00154 0.00131 10.3 Let the random matrices YnXq and Z),
vec(F : Z) ~ N I (M
0.00221 0.00741 0.00171 Jfox(P+i)
0.00131 ] 0.00171 I ' 0.00168/ be such that X = (1 :
® l n X l , V{p+q)x{p+q) ® Vnxn) .
It follows along the lines of the proof of Proposition 3.4.2 that E(Y\X) = (1 : Z)B,
D(vec{Y)\X) = -E®V,
where K _ £
{»y +
'Svx'Z-xnx\
— (S^j, - S ya; S a . a ,S X j / ).
580
Solutions to Odd-Numbered Exercises
10.5 Let Z = LY. Then D(vec(Z))=D((I ® i)vec(y ))=(/ ® £)(£
equivalent to the statistic (ni+ri2-ki + l)(\RH\-\Ro\)/[(ki-l)\Ro\], whose null distribution is Fk1-itn1+n2-k1+i10.11 Clubbing the last two columns of Y with X, we test for the insignificance of the coefficients of X using the GLRT statistic, which is 0.03555. The null distribution is Wilks' A with parameters 2, 147 and 3. The p-value is 0.000. C h a p t e r 11 11.1 If Ty is linearly sufficient, then the BLP of every element of y is a function of Ty, irrespective of /3. To prove the converse, note that the E[l'y\Ty] - l'E[y\Ty] for all I. If E[y\Ty] is only a function of Ty (free of /3), then E[l'y\Ty) is also a function of Ty, and Ty must be a linearly sufficient statistic. 11.3
(a) If Ty is linearly minimal sufficient for /3, then according to Proposition 11.1.16, Ty and (y[ : y'2)' are linear functions of one another. Hence, it is enough to work with the latter statistic, (b) If Li and £2 are as in the proof of Proposition 11.1.16, then
(S)-(te)--(!!))
Solutions to Odd-Numbered Exercises
581
where \ii = LiX/3, i = 1,2. If ZA" is a rank-factorization of X and K/3 = 9, then (y,Z0,cr 2 V) is a reparametrization of the original model. Further,
The p(X) x p(X) matrix H is invertible, since
PW = P{L2Z)
=p{L2X)
= p(L!X) + p(L 2 X) = p(E7iA- 1/2 tf'X) + p(tf2.X) = p(A-1^U'X) + p(U2U'2X) = dim(C(V)nC(X)) + p((I - PV)X) = p(X). In the absence of any known constraint on the parameters, (y[ : y 2 )' is complete and sufficient for (/x^ : /x2)', according to Proposition 3.5.7. Therefore, it is complete and sufficient for 9, and consequently for /3. (c) If there is a known restriction on the parameters, then we can consider the decomposition of the response in the equivalent unrestricted model of Section 11.1.2, as per Proposition 11.1.16. Thus, any linearly minimal sufficient Ty is equivalent to {y'x : y 2 ) for the equivalent unrestricted model. The result of part (b) can then be used. 11.5 If Ty is a linear unbiased estimator of A/3 under the linear model (y,X13,a2V), then D(Ty)>a2T[V-V(I-Px){(I-Px)V-(I-Px)}-(I-Px)V]T' in the sense of the Lowner order. The bound is achieved by the BLUE, Ty. 11.7 Part (a) follows from the definitions of linearly ancillary statistic, linearly maximal ancillary and error space. In order to prove part (b), note that I'y is a linearly ancillary statistic and a function of Ty if and only iff e C(T')C\ET. On the other hand, from Remark 11.1.23, we find that I'y is almost surely equal to 0 if and only if / G C(L'4) — Es D £r. The statement follows from the definition of a linearly complete statistic. Part (c) follows from Proposition 11.1.12. According to Proposition 11.1.11 (Bahadur's theorem), no linear function of a linearly
582
Solutions to Odd-Numbered Exercises minimal sufficient statistic can be a linear ancillary. In other words, every linear function of such a statistic is almost surely equal to a BLUE or is identically 0. Thus, Ty is linearly minimal sufficient only if C(T') C £s. The other inclusion follows from part (c). This proves the necessity of the condition given in part (d). Sufficiency follows from parts (b) and (c).
11.9
(a) The error space of M-D is C{V(I — J^))- 1 , which coincides with the estimation space of Ai, as per Proposition 11.1.25. The estimation space of MD is
c{v{i - PV{I_PX)))^
= c(v-Hv(i - PX))) = c(x)^,
which coincides with the error space of M.. (b) This result follows from part (a). (c) Let X , = V(I - Px). Then V^X, = (I - Px). Hence, the vector of fitted values of M.D is X.iX'.V-'XJ-X'.V^y = V(I - PX)[(I - PX)V(I - PX)]-(I - Px)y, which is the residual vector in M.. (d) The dual of MD is (y, V(I - Px )8,a2V). C(V(I - Px )) = CiV'X,)1-
However,
= C(I - P x ) x = C(X).
Therefore, the dual of the dual model is only a reparametrization of the original model. 11.11
(a) The inequality E[(Sy - g(0))'B(Sy - g(9))] < E[(Ty - g{6))' B(Ty - g(0))] is preserved when B is replaced by F = B/b. (b) Let Ry = Ty + F{Sy - Ty). Using the fact that F2 < F, we have E[(Ry-g(0))'(Ry-g(e))] = E[(Ty-g(0))'(Ty-g(O))} +2E[(Ty-g(0))'F(Sy-Ty)} + E[(Sy~Ty)'F2(Sy-Ty)} < E[(Ty-g(0))'(Ty-g(0))} +E[(Sy-g(0)yF(Sy-g(0))] - E[{Ty-g{8))'F{Ty-g{0))] < E[(Ty-g(0)y(Ty-g(0))]. Thus, Ty is linearly inadmissible. (c) The result is proved by contradiction using part (b).
Solutions to Odd-Numbered Exercises
583
11.13 Let FF' be a rank-factorization of B and A = CX. We have E[(Ty-Ap)'B(Ty-AP)] = tv[F'E{(Ty - Ap)(Ty - Ap)'}F] = ahr[F'TVT'F] + tr[F'(T - C)X/3/3'X'{T - C)'F] <
*MF
P HP < a2tv[F'TVT'F] + a2\\F'(T - C)XH~X'{T - C)'F\\. The last step follows from Exercise 2.30(b). The first inequality holds with equality when the magnitude of K'P is a, where KK' is a rank-factorization of H. The second inequality holds with equality when K'P is the eigenvector corresponding to the largest eigenvalue of K~X'(T - C)'FF'(T - C)X(K-)'. 11.15 In this case we have from Proposition 11.2.11 and (11.2.4) 3M
= [l + /itr((X'V- 1 X)- 1 )](X'V- 1 X)- 1 X'V- 1 y,
pm = (x'v-'x + hiy'x'v-'y. In order that these are equal almost surely for all y, we must have (X'V^X + hI)PM = (X'V~lX + hl)pm, which simplifies to {X'V^X^X'V^y
= ^[{X'V-1
Xy^X'V^y
for almost all y, that is, ( X T 1 ! ) " 1 = t r ^ ' V " 1 * ) - 1 ] / . However, the ratio of the traces of the two matrices is p(X). 11.17
(a) When V is nonsingular, the Kuks-Olman estimator of XP can be written as Xpm = XHX'(V + XH'X'^y. However, XHX'
= XiX'V^X+H)-(X'V~1X + H)H-X' = X(X'V-lX + H)-X'V-lXH-X' +X(X'V~1X + H)~X'
= xix'v^x + Hyx'v-^xH-x' + v] Hence, Xpm = XiX'V^X + H)-X'V'ly. (b) When V and U are nonsingular, the BLE of Proposition 11.2.7 Proceeding as in part (a) with H is UX'(V + XUX')-ly. replaced by U~1 and all the g-inverses replaced by inverses, we have U~lX' = {X'V~lX + U-^X'V^lXUX' + V]. The result follows.
584 11.19
Solutions to Odd-Numbered Exercises (a) p'1^1 = p'0 - p'2f32, and p'/3 and p'2f32 are both estimable. (b) The result follows from the fact that a~2D(@2) is the lower right block of (X'X)-1, which simplifies to [X'2{I - Px )X2}~1. (c) Cov(p'P,P2) = <j2p'{X'X)-l(Q : / ) ' ; the result follows by using the block matric inversion formula. (d) According to (7.9.5), MSE&0) - MSEtffl,) = c'D-l[D 02(32)']D-lc, where D = D(J32) = a2[X'2{I - PXi)X2}~1 and c' = Cov(p'(3) = q'D (using parts (b) and (c)). The difference in nonnegative if and only if q'Dq > (q'/32)2, which simplifies to the stated condition. (e) If p2 is considered an additional observation in the multivariate linear model (X2, {I ® -X'1)vec(B), S ® I), and p[ is the corresponding row vector of explanatory variables, then the BLUP of p2 is p'l(X[Xi)~X'1X2. The corresponding prediction error is q'. The matrix X'2(I - Px )X2 is the matrix of error sums and products for the multivariate linear model.
11.21
(a) The result follows from the fact that R2 is a monotonically decreasing function of R2,, and the latter cannot decrease when a variable is dropped. (b) When the subset size is constant, all the three criteria are monotonic functions of cr|.
11.23 Let UAU1 be a spectral decomposition of X'V~1X. Arrange the eigenvalues of X'V~1X in the decreasing order, and partition the matrices U and A suitably so that
UAU' = (U, : U2) (AJ £)(%l)=
t/iAitfi +U2A2U'2,
where the diagonal elements of A2 are small. Thus we have
%c = ( A r l z r " l y ) ;
K = u^ = (tfiAr^ipr'v-y
This leads to same difference of MSEs as given in page 505 and the same conditions for superiority of the principal components estimator. 11.25 Dr = a2{X'X + rI)-lX'X{X'X + rl)~l. If Ai > > Xk are the eigenvalues of X'X, then cr2Aj/(Aj +r)2, i = 1 , . . . , k are the eigenvalues of DT. Each eigenvalue is a decreasing function of r. Hence, every eigenvalue of Dri is strictly larger than the corresponding eigenvalue of Dr2 when ri < r 2 .
Solutions to Odd-Numbered Exercises
585
11.27 The necessary and sufficient condition for all the parts is ||-X"/3||2 < a2.
11.29 (a) s > [ft'A'[D(A0)]-A0 - 1]/{0'A'[D(A0)]-A0 + 1]. (b) a2 > /3'A'[a-2D(Ap)}-Af3. (c) s >JJ3'X'[D(X0))-X{3 - 1]/\J3'X'[D(X0)]-X0 + 1], where D(X0) = a2[V-V(I-Px){(I-Px)V(I-Px)}-(I-Px)V}. (d) The simplified conditions for the three parts are 0'A'[ff2A(X'X)-A']-A0 - 1 /3'A'[a2A{X'X)-A']-Ap + l' a2 > 0'A'(X'X)-A0, 8 > l\\X0\?l-\]li\\X0\\2l
>
Bibliography and Author Index
(Italicized numbers within parentheses indicate page numbers where the source is cited.) Abramowitz, M. and Stegun, LA. (1972) Handbook of Mathematical Functions, Dover, New York. (172) Aitken, A.C. (1935) On least squares and linear combination of observations. Proc. Roy. Soc. Edinburgh Sect. A 55, 42-48. (258) Alalouf, I.S. and Styan, G.P.H. (1979) Characterizations of estimability in the general linear model. Ann. Statist. 7, 194-200. (98) Albert, A. (1972) Regression and the Moore-Penrose pseudoinverse, Mathematics in Science and Engineering, 94, Academic, New York. (30) Anderson, T.W. (1971) The Statistical Analysis of Time Series, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (353) Arnold, S.F., (1981) The Theory of Linear Models and Multivariate Analysis, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (450, 462) Atkinson, A.C, (1987) Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis, Oxford Statistical Science Series, Oxford University Press, Oxford. (16) Bahadur, R.R. (1957) On unbiased estimates of uniformly minimum variance. Sankhya Ser. A 18, 211-224. (67, 473) 587
588
Bibliography and Author Index
Bailie, R.T. (1979) The asymptotic mean squared error of multistep prediction from the regression model with autoregressive errors. J. Amer. Statist. Assoc. 74, 175-184. (356) Baksalary, J.K. and Kala, R. (1981) Linear transformations preserving best linear unbiased estimators in a general Gauss-Markoff model. Ann. Statist. 9, 913-916. (469) Baksalary, J.K. and Markiewicz, A. (1988) Admissible linear estimators in the general Gauss-Markov model. J. Statist. Plann. Inference 19, 349-359. (487) Baksalary, J.K. and Markiewicz, A. (1990) Admissible linear estimators of an arbitrary vector of parametric functions in the general Gauss-Markov model. J. Statist. Plann. Inference 26, 161-171. (489) Baksalary, J.K., Markiewicz, A. and Rao, C.R. (1995) Admissible linear estimation in the general Gauss-Markov model with respect to an arbitrary quadratic risk function. J. Statist. Plann. Inference 44, 341-347. (489) Baksalary, J.K.and Mathew, T. (1990) Rank invariance criterion and its application to the unified theory of least squares. Linear Algebra Appl. 127, 393-401. (268) Baksalary, J.K.and Pordzik, P.R. (1989) Inverse-partitioned-matrix method for the general Gauss-Markov model with linear restrictions. J. Statist. Plann. Inference 23, 133-143. (278) Baksalary, J.K., Rao, C.R. and Markiewicz, A. (1992) A study of the influence of the "natural restrictions" on estimation problems in the singular Gauss-Markov model. J. Statist. Plann. Inference 31, 335-351. (251) Barnard, G.A. (1963) The logic of least squares. J. Roy. Statist. Soc. Ser. B 25, 124-127. (469) Bartlett, M.S. (1937a) Some examples of statistical methods of research in agriculture. J. Roy. Statist. Soc. Suppl. 4, 137-183. (219) Bartlett, M.S. (1937b) Properties of sufficiency and statistical tests. Proc. Roy. Soc. London Ser. A 160, 268-282. (234) Basu, D. (1958) On statistics independent of sufficient statistics. Sankhyd 20, 223-226. (67)
Bibliography and Author Index
589
Bates, D.M. and Watts, D.G. (1988) Nonlinear Regression Analysis and its Applications, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (8) Bekir, E. (1988) A unified solution to the singular and nonsingular linear minimum-variance estimation problem. IEEE Trans. Automatic Control 33, 590-591. {246) Bellman, R. (1960) Introduction to Matrix Analysis, McGraw-Hill, New York. (46) Belsley, D.A. (1991) Conditioning Diagnostics: Collinearity and Weak Data in Regression, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (137) Belsley, D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (13, 16, 145, 405) Bhaumik, D. and Mathew, T., (2001) Optimal data augmentation for the estimation of a linear parametric function in linear models. Sankhya Ser. B 63, 10-26. (389) Bhimasankaram, P. and Jammalamadaka S.R. (1994a) Recursive estimation and testing in general linear models with applications to regression diagnostics. Tamkang J. Math. 25, 353-366. (403) Bhimasankaram, P. and Jammalamadaka S.R. (1994b) Updates of statistics in a general linear model: a statistical interpretation and applications. Comm. Statist. Simulation Comput. 23, 789-801. (371, 381, 403, 416) Bhimasankaram, P. and SahaRay, R. (1997) On a partitioned linear model and some associated reduced models. Linear Algebra Appl. 264, 329-339. (286) Bhimasankaram, P. and Sengupta, D. (1991) Testing for the mean vector of a multivariate normal distribution with a possibly singular dispersion matrix and related results. Statist. Probab. Lett. 11, 473-478. (447,460) Bhimasankaram, P. and Sengupta, D. (1996) The linear zero functions approach to linear models. Sankhya Ser. B 58, 338-351. (254, 527)
590
Bibliography and Author Index
Bhimasankaram, P., Sengupta, D. and Ramanathan, S. (1995) Recursive inference in a general linear model. Sankhyd Ser. A 57, 227-255. (372, 379, 404, 405) Bich, W. (1990) Variances, covariances and restraints in mass metrology. Metrologia 27, 111-116. (24S) Billingsley, P. (1985) Probability and Measure, second edition, Wiley, New York. (56) Bischoff, W. (1993) On £>-optimal designs for linear models under correlated observations with an application to a linear model with multiple response. J. Statist. Plann. Inference 37, 69-80. (315) Bloomfield, P. and Watson, G.S. (1975) The inefficiency of least squares. Biometrika 62, 121-128. (321) Boldin, M.V., Simonova, G.I. and Tyurin, Yu.N. (1997) Sign-based methods in linear statistical models (Translated from the Russian manuscript by D.M. Chibisov). Translations of Mathematical Monographs, 162, American Mathematical Society, Providence, RI. (15) Bose, N.K. and Rao, C.R. (1993) Handbook of Statistics, Vol. 10 (Signal Processing and its Applications), North-Holland, Amsterdam. (364) Bose R.C. (1949) Least Squares Aspects of Analysis of Variance, Inst. Stat. Mimeo. Ser. 9, Chapel Hill, NC. (x) Box, G.E.P. and Cox, D.R. (1964) An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26, 211-246. (236) Box, G.E.P. and Draper, N.R. (1987) Empirical Model-Building and Response Surfaces, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (13) Brockwell, P.J. and Davis, R.A. (2002) Introduction to Time Series and Forecasting, second edition, Springer Texts in Statistics, Springer-Verlag, New York. (355) Broemeling, L.D. (1985) Bayesian Analysis of Linear Models, Statistics: Textbooks and Monographs, 60, Marcel Dekker, New York. (15) Brown, P.J. (1993) Measurement, Regression and Calibration, Oxford Statistical Science Series, Clarendon, Oxford. (13)
Bibliography and Author Index
591
Brown, R.L., Durbin, J. and Evans, J.M. (1975) Methods of investigating whether a regression relationship is constant over time (with discussion). / . Roy. Statist. Soc. Ser. B 37, 149-192. (375, 377) Brownlee, K.A., (1965) Statistical Theory and Methodology in Science and Engineering, second edition, Wiley, London. (139) Bunke, O. (1975) Minimax linear, ridge and shrunken estimators for linear parameters. Math. Operationsforsch. Statist. 6, 697701. (496) Bunke, H. and Bunke, O. (1974) Identifiability and estimability. Math. Operationsforsch. Statist. 5, 223-233. (99) Buser, S.A. (1977) Mean-variance portfolio selection with either a singular or nonsingular variance-covariance matrix. J. Financial Quant. Anal. 12, 347-361. (245) Carroll, R.J. (1982) Adapting for heteroscedasticity in linear models. Ann. Statist. 10, 1224-1233. (361) Carroll, R.J. and Ruppert, D. (1988) Transformation and Weighting in Regression, Chapman and Hall, New York. (361) Chambers, J.M., (1975) Updating methods for linear models for the addition or deletion of observations. In A Survey of Statistical Design and Linear Models, ed. J.N. Srivastava, North-Holland, Amsterdam, 53-65. (372) Chaubey, Y.P. (1982) Best minimum bias linear estimators in GaussMarkoff model. Comm. Statist. Theory Methods 11, 19591963. (512) Chen, J.H. and Shao, J. (1993) Iterative weighted least squares estimators. Ann. Statist. 21, 1071-1092. (360) Chib, S., Jammalamadaka, S. Rao, and Tiwari, R. (1987) Another look at some results on the recursive estimation in the general linear model. Amer. Statistician 41, 56-58. (403) Chipman, J.S. (1964) On least squares with insufficient observations. J. Amer. Statist. Assoc. 59, 1078-1111. (511) Chow, S.C. and Shao, J. (1991) Estimating drug shelf-life with random batches. Biometrics 47, 1071-1079. (365) Christensen, R. (1991) Linear Models for Multivariate, Time Series
592
Bibliography and Author Index
and Spatial Data, Springer-Verlag, New York. (332, 358, 463) Christensen, R. (1996) Plane Answers to Complex Questions: The Theory of Linear Models, second edition, Springer-Verlag, NewYork (third edition, 2002). (173, 220, 512) Cochran, W.G. (1957) Analysis of covariance: its nature and uses. Biometrics (special issue on Analysis of covariance) 13, 261281. (225) Cohen, A. (1966) All admissible linear estimates of the mean vector. Ann. Math. Statist. 37, 458-463. (488) Cook, R.D. and Weisberg, S. (1994) An Introduction to Regression Graphics, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (16) Cressie, N.A.C. (1993) Statistics for Spatial Data, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (357) Daniel, W.W. (1995) Biostatistics: A Foundation for Analysis in the Health Sciences, Wiley Series in Probability and Mathematical Statistics — Applied, Wiley, New York. (177) Dasgupta, A. and Das Gupta, S. (2000) Parametric identifiability and model-preserving constraints, Calcutta Statist. Assoc. Bull. 50, 207-221. (123, 161) Davidian, M. and Carroll, R.J. (1987) Variance function estimation. / . Amer. Statist. Assoc. 82, 1079-1091. (362) Davidson, R. and MacKinnon, J.G. (1993) Estimation and Inference in Econometrics, Oxford University Press, Oxford. (354) Dobson, A.J. (2001) An Introduction to Generalized Linear Models, second edition, Chapman and Hall, London. (9, 16) Dodge, Y. (1985) Analysis of Experiments with Missing Data, Wiley Series in Probability and Mathematical Statistics, Wiley, Chichester. (16) Drygas, H. (1983) Sufficiency and completeness in the general GaussMarkov model. Sankhyd Ser. A 45, 88-98. (469, 474, 476) Drygas, H. (1985) Minimax prediction in linear models. In Linear Statistical Inference (Poznari, 1984), eds. T. Caliriski and W. Klonecki, Lecture Notes in Statistics 35, Springer, Berlin-New York, 48-60. (499)
Bibliography and Author Index
593
Drygas, H. (1996) Spectral methods in linear minimax estimation. Ada Appl. Math. 43, 17-42. (500) Duncan, D.B. and Horn, S.D. (1972) Linear dynamic recursive estimation in the general linear model. J. Arner. Statist. Assoc. 67, 815-821. {391, 393) Eaton, M.L. (1985) The Gauss-Markov theorem in multivariate analysis. In Multivariate Analysis, Part VI, ed. P.R. Krishnaiah, North-Holland, Amsterdam, 177-201. (332) Efron, B. and Tibshirani, R. J. (1993) An Introduction to the Bootstrap, Chapman and Hall, London. (16, 151) Farebrother, R.W. (1988) Linear Least Squares Computations, Mercel Dekker, New York. (372) Fisher, R.A. (1926) The arrangement of field experiments. J. Ministry Agr. 33, 503-513; (included also in Contributions to Mathematical Statistics by R.A. Fisher, Wiley, New York, 1950). (192) Fisher, R.A. (1932) Statistical Methods for Research Workers, Oliver and Boyd, Edinburgh. (224) Fisher, R.A. (1936) The use of multiple measurements in taxonomic problems. Ann. Eugenics, 7, 179-188. (463) Fomby, T.B., Hill, R.C. and Johnson, S.R. (1984) Advanced Econometric Methods, Springer-Verlag, New York. (11) Fu, Y.L. and Tang, S.Y. (1993) Necessary and sufficient conditions that linear estimators of a mixed effects linear model are admissible under matrix loss function. Statistics 24, 303-309. (488) Fuller, W.A. and Rao, J.N.K. (1978) Estimation for a linear regression model with unknown diagonal covariance matrix. Ann. Statist. 6,1149-1158. (360) Gaffke, N. and Heiligers, B. (1989) Bayes, admissible and minimax linear estimators in linear models with restricted parameter space. Statistics 20, 487-508. (492) Galpin, J.S. and Hawkins, D.M. (1984) The use of recursive residuals in checking model fit in linear regression. Amer. Statistician 38, 94-105. (385) Gnot, S. (1983) Bayes estimation in linear models: a coordinate-free approach. J. Multivariate Anal. 13, 40-51. (492)
594
Bibliography and Author Index
Goldman, A.J. and Zelen, M. (1964) Weak generalized inverse and minimum variance unbiased estimation. J. Research Nat. Bureau of Standards 68B, 151-172. {272) Golub, G.H. and Van Loan, C.F. (1996) Matrix Computations, third edition, Johns Hopkins University Press, Baltimore, MD. (30, 33) Gragg, W.B., LeVeque, R.J. and Trangenstein, J.A. (1979) Numerically stable methods for updating regressions. J. Amer. Statist. Assoc. 74, 161-168. (372) Greenberg, B.G. (1953) The use of analysis of covariance and balancing in analytical surveys, Amer. J. Pub. Health 43, 692-699. (225) Gruber, M.H.J. (1990) Regression Estimators: A Comparative Study, Statistical Modeling and Decision Science, Academic, New York. (492, 496) Gruber, M.H.J. (1998) Improving Efficiency by Shrinkage: The JamesStein and Ridge Estimators, Marcel-Dekker, New York. (509) Hahn, G.J. and Hendrickson, R.W. (1971) A table of percentage points of the distribution of the largest absolute value of k student t variates and its applications. Biometrika 58, 323-332. (154) Hallum, C.R., Lewis, T.O. and Boullion, T.L. (1973) Estimation in the restricted general linear model with a positive semidefinite covariance matrix. Comm. Statist. 1, 157-166. (512) Hannan, E. (1970) Multiple Time Series, Wiley, New York. (316) Harter, H.L. (1960) Tables of range and studentized range. Ann. Math. Statist. 31, 1122-1147. (202) Harvey, A.C. and Phillips, D.A. (1979) Maximum likelihood estimation of regression models with autoregressive-moving average disturbances. Biometrika 66, 49-58. (354, 397, 425) Harville, D.A. (1981) Unbiased and minimum-variance unbiased estimation of estimable functions for fixed linear models with arbitrary covariance structure. Ann. Statist. 9, 633-637. (255) Haslett, J. (1999) A simple derivation of deletion diagnostic results for the general linear model with correlated errors. J. Roy. Statist. Soc. Ser. B 61, 603-609. (407) Haslett, S. (1985) Recursive estimation of the general linear model with
Bibliography and Author Index
595
with dependent errors and multiple additional observations. Austral. J. Statist. 27, 183-188. (371, 378, 380) Haslett, S. (1996) Updating linear models with dependent errors to include additional data and / or parameters. Linear Algebra Appl. 237/238, 329-349. {397) Hawkins, D.M. (1991) Diagnostics for use with regression recursive residuals. Technometrics 33, 221-234. (385) Haykin, S. (1991) Advances in Spectrum Analysis and Array Processing, Vol. II, Prentice-Hall, Englewood Cliffs, NJ. (364) Hedayat, A.S. and Majumdar, D. (1985) Combining experiments under Gauss-Markov models. J. Amer. Statist. Assoc. 80, 698-703. (361) Hettmansperger, T.P. and McKean, J.W. (1998) Robust Nonparametric Statistical Methods, Kendall's Library of Statistics, 5, Arnold, London and Wiley, New York. (15) Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (1991) Fundamentals of Exploratory Analysis of Variance, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (238) Hochberg, Y. and Tamhane, A.C. (1987) Multiple Comparison Procedures, Wiley, New York. (173, 202) Hocking, R.R. (1996) Methods and Applications of Linear Models, Wiley, New York. (173, 218, 224, 345, 503) Hoerl, A.E. and Kennard, R.W. (1970a) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55-67. (506) Hoerl, A.E. and Kennard, R.W. (1970b) Ridge regression: applications to nonorthogonal problems. Technometrics 12, 69-82. (506) Hoffman, K. (1996) A subclass of Bayes linear estimators that are minimax. Ada Appl. Math. 43, 87-95. (500) Hooper, P.M. (1993) Iterative weighted least squares estimation in heteroscedastic linear model. J. Amer. Statist. Assoc. 88, 179184. (360) Hotelling, H. (1951) A generalized T-test and measure of multivariate dispersion. In Proc. Second Berkeley Symp. Math. Statist. Prob., Univ. of California Press, Berkeley, 23-41. (450)
596
Bibliography and Author Index
James, W. and Stein, C. (1961) Estimation with quadratic loss. In Proc. Fourth Berkeley Symp. Math. Statist. Prob., 1, Univ. of California Press, Berkeley, 361-379. (486) Jammalamadaka, S.R. and Sengupta, D. (1999) Changes in the general linear model: a unified approach. Linear Algebra Appl. 289, 225-242. {371, 378, 402) Jeyaratnam, S. (1982) A sufficient condition on the covariance matrix for F tests in linear models to be valid. Biometrika 69, 679-680. (315) Judge, G.G., Griffiths, W.E., Hill, R.C. and Lee, T.C. (1980) The Theory and Practice of Econometrics, Wiley, New York. (508) Judge, G.G. and Takayama, T. (1966) Inequality restrictions in regression analysis. J. Amer. Statist. Assoc. 61, 166-181. (282) Kala, R. and Klaczynski, K., (1988) Recursive improvement of estimates in a Gauss-Markov model with linear restrictions. Canad. J. Statist. 16, 301-305. (417) Kalman, R.E., (1960) A new approach to linear filtering and prediction problem. ASME Trans. J. Basic Engrg. 82-D, 35-45. (391) Kalman, R.E. and Bucy, R.S., (1961) New results in linear filtering and prediction theory. ASME Trans. J. Basic Engrg. 83-D, 95-108. (391) Kay, S.M. (1988) Modern Spectral Estimation: Theory and Application, Prentice Hall, Englewood Cliffs, N.J. (363) Kariya, T. (1980) Note on a condition for equality of sample variances in a linear model. J. Amer. Statist. Assoc. 75, 701-703. (314) Kempthorne, O. (1952) The Design and Analysis of Experiments, Wiley, New York. (245) Khuri, A.I. and Cornell, J.A. (1996) Response Surfaces: Designs and Analyses, second edition, Statistics: Textbooks and Monographs 152, Marcel Dekker, New York. (13) Khuri, A.I., Mathew, T. and Sinha, B.K. (1998) Statistical Tests for Mixed Linear Models, Wiley Series in Probability and Statistics, Wiley, New York. (353) Kianifard, F. and Swallow, W. (1996) A review of the development and application of recursive residuals in linear models. J. Amer.
Bibliography and Author Index
597
Statist. Assoc. 91, 391-400. (577, 385) Klonecki, W. and Zontek, S. (1988) On the structure of admissible linear estimators. J. Multivariate Anal. 24, 11-30. (488) Knott, M. (1975) On the minimum efficiency of least squares. Biometrika 62, 129-132. (321) Koch, K.-R. (1999) Parameter Estimation and Hypothesis Testing in Linear Models, second edition, Springer-Verlag, Berlin. (364) Kohn, R. and Ansley, C.F. (1983) Fixed interval estimation in statespace models when some of the data are missing or aggregated. Biometrika 70, 683-688. (246) Kornacki, A. (1998) Stability of quadratically and linearly sufficient statistics in general Gauss-Markov model. Random Oper. Stochastic Equations 6, 51-56. (476) Kourouklis, S. and Paige, C.C. (1981) A constrained least squares approach to the general Gauss-Markov linear model. J. Amer. Statist. Assoc. 76, 620-625. (272, 372) Kramer, W. (1980) A note on the equality of ordinary least squares and Gauss-Markov estimates in the general linear model. Sankhya Ser.A 42, 130-131. (311) Kramer, W. and Donninger, C. (1987) Spatial autocorrelation among errors and the relative efficiency of OLS in the linear regression model. J. Amer. Statist. Assoc. 82, 577-579. (321, 356) Kshirsagar, A.M. (1983) A Course in Linear Models, Marcel Dekker, New York. (202, 408) Kuks, J. and Olman, V. (1971) Minimax linear estimation of regression coefficients (In Russian). Izv. Akad. Nauk Eston. SSR 20, 480482. (496) Kuks, J. and Olman, V. (1972) Minimax linear estimation of regression coefficients II (In Russian). Izv. Akad. Nauk Eston. SSR 21, 66-72. (494, 496) LaMotte, L.R. (1978) Bayes linear estimators. Technometrics 20, 281— 290. (492) Lauter, H. (1975) A minimax linear estimator for linear parameters under restrictions in form of inequalities. Math. Operationsforsch. Statist. Ser. Statist. 6, 689-696. (499)
598
Bibliography and Author Index
Lawley, D.N. (1938) A generalization of Fisher's Z-test. Biometrika 30, 180-187. (450) Lehmann, E.L. (1986) Testing Statistical Hypotheses, second edition, Wiley, New York. (85, 86, 163) Lehmann, E.L. and Casella, G. (1998) Theory of Point Estimation, Springer Texts in Statistics, Springer-Verlag, New York. (69) Lewis, T.O. and Odell, P.L. (1966) A generalization of the GaussMarkov theorem. J. Amer. Statist. Assoc. 61, 1063-1066. (510) Li, Z.-H. and Begg, C.B. (1994) Random effects models for combining results from controlled and uncontrolled studies in a metaanalysis. J. Amer. Statist. Assoc. 89, 1523-1527. (369) Liew, C.K. (1976) Inequality constrained least-squares estimation. J. Amer. Statist. Assoc. 71, 746-751. (282, 284) Lin, C.T. (1993) Necessary and sufficient conditions for the least square estimator to be the best estimator in a general Gauss-Markov model. J. Math. Res. Exposition 13, 433-436. (311) Liu, A. (1996) Estimation of the parameters in two linear models with only some of the parameter vectors identical. Statist. Probab. Lett. 29, 369-375. (361) Lovell, M.C. and Prescott, E. (1970) Multiple regression with inequality constraints: pretesting bias, hypothesis testing and efficiency. J. Amer. Statist. Assoc. 65, 913-925. (284) Lunn, A.D. and McNeil, D.R. (1991) Computer-Interactive Data Analysis, Wiley, Chichester. (466) Marcus, M. and Mine, H. (1988) Introduction to Linear Algebra, (Reprint of the 1969 edition), Dover Books on Advanced Mathematics, Dover, New York. (44) Mathew, T. (1983) Linear estimation with an incorrect dispersion matrix in linear models with a common linear part. J. Amer. Statist. Assoc. 78, 468-471. (311) Mathew, T., (1985) On inference in a general linear model with an incorrect dispersion matrix. In Linear Statistical Inference (Poznari, 1984), eds. T. Calinski and W. Klonecki, Lecture Notes in Statistics 35, Springer, Berlin-New York, 200-210. (311)
Bibliography and Author Index
599
Mathew, T. and Bhimasankaram, P. (1983a) On the robustness of the LRT with respect to specification errors in a linear model. Sankhyd Ser. A 45, 212-225. {315) Mathew, T. and Bhimasankaram, P. (1983b) On the robustness of the LRT in singular linear models. Sankhyd Ser. A 45, 301-312. (311, 315) Mathew, T., Rao, C.R. and Sinha, B.K. (1984) Admissible linear estimation in singular linear models. Comm. Statist. Theory Methods 13, 3033-3045. {488) Mathew, T., Sinha, B.K. and Zhou, L. (1993) Some statistical procedures for combining independent tests. J. Amer. Statist. Assoc. 88, 912-919. (361) Mayer, L.S. and Wilke, T.A. (1973) On biased estimation in linear models. Technometrics 15, 497-508. (509) McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, second edition, Monographs on Statistics and Applied Probability, Chapman and Hall, London. (, 16) McGilchrist, C.A. and Sandland, R.L. (1979) Recursive estimation of the general linear model with dependent errors. J. Roy. Statist. Soc. Ser. B 41, 65-68. (371, 377, 380) McGilchrist, C.A., Sandland, R.L. and Hennessy, J.L. (1983) Generalized inverses used in recursive estimation of the general linear model. Austral. J. Statist. 25, 321-328. (385) Miller, R.G. (1981) Simultaneous Statistical Inference, second edition, Springer Series in Statistics, Springer-Verlag, New York. (155, 157, 181, 202)
Mitra, S.K. (1971) Another look at Rao's MINQUE of variance components. Int. Statist. Inst. Bull. 44(2), 279-283. (347) Mitra, S.K. and Bhimasankaram, P. (1971) Generalized inverses of partitioned matrices and recalculation of least squares estimators for data and model changes. Sankhyd Ser. A 33, 395-410. (371, 374)
Miiller, J., Rao, C.R. and Sinha, B.K. (1984) Inference on parameters in a linear model: a review of recent results. In Experimental Design, Statistical Models, and Genetic Statistics, ed. K.
600
Bibliography and Author Index
Hinkelmann, Statistics 50, Dekker, New York, 277-295. (474) Miiller, J. (1987) Sufficiency and completeness in the linear model. J. Multivariate Anal. 21, 312-323. (476) Nanayakkara, N. and Cressie, N.A.C. (1991) Robustness to unequal scale and other departures from the classical linear model. In Directions in Robust Statistics and Diagnostics, Part II, eds. W. Stahel and S. Weisberg, IMA Volumes in Mathematics and its Applications 34, Springer, New York, 65-113. (525) Neuwirth, E. (1985) Sensitivity of linear models with respect to the covariance matrix. In Linear Statistical Inference (Poznari, 1984), eds. T. Caliriski and W. Klonecki, Lecture Notes in Statistics 35, Springer, Berlin-New York, 223-230. (306) Nieto, F.H. and Guerrero, V.M. (1995) Kalman filter for singular and conditional state-space models when the system state and the observational error are correlated. Statist. Probab. Lett. 22, 303-310. (397) Nordstrom, K. (1985) On a decomposition of the singular GaussMarkov model. In Linear Statistical Inference (Poznari, 1984), eds. T. Caliriski and W. Klonecki, Lecture Notes in Statistics 35, Springer, Berlin-New York, 231-245. (482, 521) Olea, R.A. (1999) Geostatistics for Engineers and Earth Scientists, Kluwer Academic Publishers, Boston. (357) Okamoto, M. (1973) Distinctness of the eigenvalues of a quadratic form in a multivariate sample. Ann. Statist. 1, 763-765. (439) Oktaba, W., Kornacki, A. and Wawrzosek, J. (1988) Invariant linearly sufficient transformations of the general Gauss-Markoff model: Estimation and testing. Scand. J. Statist. 15, 117-124. (476) Ord, K. (1975) Estimation methods for models of spatial interaction. J. Amer. Statist. Assoc. 70, 120-126. (356) Peixoto, J.L. (1986) Testable hypothesis in singular fixed linear models. Comm. Statist. Theory Methods 15, 1957-1973. (167) Pillai, K.C.S. (1955) Some new test criteria in multivariate analysis. Ann. Math. Statist. 26, 117-121. (450) Pilz, J. (1986) Minimax linear regression estimation with symmetric parameter restrictions. J. Statist. Plann. Inference 13, 297-
Bibliography and Author Index
601
318. (499)
Placket, R.L. (1950) Some theorems in least squares. Biometrika 37, 149-157. (371, 374) Pordzik, P.R. (1992a) A lemma on ^-inverse of the bordered matrix and its application to recursive estimation in the restricted model. Comput. Statist. 7, 31-37. (379) Pordzik, P.R. (1992b) Adjusting of estimates in general linear model with respect to linear restrictions. Statist. Probab. Lett. 15, 125-130. (417) Press, S.J. (1971) Applied Multivariate Analysis, Holt, Rinehart and Winston, New York. (13) Puntanen, S. (1987) On the relative goodness of ordinary least squares estimation in the general linear model (Ph.D. dissertation), Ada Univ. Tamper. Ser. A 216, University of Tampere, Finland. (321) Puntanen, S. (1997) Some further results related to reduced singular linear models. Comm. Statist. Theory Methods 26, 375-385. (287) Puntanen, S. and Styan, G.P.H. (1989) The equality of the ordinary least squares and the best linear unbiased estimator (with discussion). Amer. Statistician 43, 153-163. (311) Rao, A.R. and Bhimasankaram, P. (1992) Linear Algebra, TataMcGraw Hill, New Delhi (second edition, 2000, Hindustan Book Agency, New Delhi). (27, 31) Rao, C.R. (1967) Least squares theory using an estimated dispersion matrix and its application to measurement of signals. In Proc. Fifth Berkeley Sympos. Math. Statist, and Probability, Vol. I: Statistics, Berkeley, Calif., 1965, eds. L.M. Le Cam and J. Neyman, Univ. of California Press, Berkeley, Calif., 355-372. (311, 325) Rao, C.R. (1973a) Representations of best linear unbiased estimators in the Gauss-Markoff model with a singular dispersion matrix. J. Multivariate Anal. 3, 276-292. (249, 255) Rao, C.R. (1973b) Unified theory of least squares. Comm. Statist. Theory Methods 1, 1-8. (264)
602
Bibliography and Author Index
Rao, C.R. (1973c) Linear Statistical Inference and its Applications, second edition, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (30, 269, 447, 463, 511) Rao, C.R. (1974) Projectors, generalized inverses and the BLUE's. J. Roy. Statist. Soc. Ser. A 36, 442-448. (518) Rao, C.R. (1976) Estimation of parameters in a linear model. Ann. Statist. 4, 1023-1037. Correction (1979) 7, 696. (488) Rao, C.R. (1978) Least squares theory for possibly singular models. Canad. J. Statist. 6, 19-23. (278) Rao, C.R. (1979) Estimation of parameters in the singular GaussMarkoff model. Comm. Statist. Theory Methods 8, 1353-1358. (255) Rao, C.R. and Kleffe, J. (1988) Estimation of Variance Components and Applications, North-Holland Series in Statistics and Probability 3, North-Holland, Amsterdam. (335, 346, 353) Rao, C.R. and Mitra, S.K. (1971) Generalized Inverse of Matrices and its Applications, Wiley, New York. (28, 48, 53) Rao, C.R., Mitra, S.K. and Bhimasankaram, P. (1972) Determination of a matrix by its subclasses of generalized inverses. Sankhyd Ser. A 34, 5-8. (53) Rao, C.R. and Toutenburg, H. (1999) Linear Models: Least Squares and Alternatives, Springer Series in Statistics, Springer-Verlag, New York. (15, 279) Rao, P.S.R.S. (1997) Variance Components Estimation: Mixed Models, Methodologies and Applications, Monographs on Statistics and Applied Probability 78, Chapman and Hall, London. (342) Rencher, A.C. (2000) Linear Models in Statistics, Wiley Series in Probability and Statistics, Wiley, New York. (406) Rousseeuw, P.J. and Leroy, A.M. (1987) Robust Regression and Outlier Detection, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, Wiley, New York. (15) Rowley, J.C.R. (1977) Singularities in econometric models of wage determination based on time series data. In ASA Proceedings of Business and Economic Statistics Section, 616-621. (245) Roy, S.N. (1953) On a heuristic method of test construction and its
Bibliography and Author Index
603
use in multivariate analysis, Ann. Math. Statist. 24, 220-238. (448)
Ryan, T.P. (1997) Modern Regression Methods, Wiley Series in Probability and Statistics, Wiley, New York. (16, 406) Schaffrin, B. (1999) Softly unbiased estimation I: the Gauss-Markov model. Linear Algebra Appl. 289, 285-296. (512) Schall, R. and Dunne, T.T. (1988) A unified approach to outliers in the general linear model. Sankhyd Ser. B 50, 157-167. (423) Scheffe, H. (1959) The Analysis of Variance, Wiley, New York, 1959. (212) Schervish, M.J. (1995) Theory of Statistics, Springer Series in Statistics, Springer-Verlag, New York. (67, 68, 78) Schonfeld, P. and Werner, H.-J. (1987) A note on C. R. Rao's wider definition BLUE in the general Gauss-Markov model. Sankhyd Ser. B 49, 1-8. (255) Scott, A.J., Rao, J.N.K. and Thomas, D.R. (1990) Weighted leastsquares and quasi-likelihood estimation for categorical data under singular models. Linear Algebra Appl. 127, 427-447. (245) Searle, S.R. (1994) Extending some results and proofs for the singular linear model. Linear Algebra Appl. 210, 139-151. (254) Searle, S.R. (1987) Linear Models for Unbalanced Data, Wiley, New York. (218).
Searle, S.R., Casella, G. and McCulloch, C.E. (1992) Variance Components, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, Wiley, New York. (345) Seber, G.A.F. and Wild, C.J. (1989) Nonlinear Regression, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (8) Sen, P.K. and Singer, J.M. (1993) Large Sample Methods in Statistics: An Introduction with Applications, Chapman and Hall, New York. (521, 525) Sengupta, D. (1995) Optimal choice of a new observation in a linear model. Sankhyd Ser. A 57, 137-153. (245, 387, 388, 389) Sengupta, D. and Bhimasankaram, P. (1997) On the roles of observations in collinearity in the linear model. J. Amer. Statist.
604
Bibliography and Author Index
Assoc. 92, 1024-1032. (137) Shah, K.R. and Deo, Sheela S. (1991) Missing plot technique in linear models. Comm. Statist. Theory Methods 20, 3239-3252. (409) Shaked, U. and Soroka, E. (1987). A simple solution to the singular linear minimum-variance estimation problem. IEEE Trans. Automatic Control 32, 81-84. (246) Shang, S.F. and Zhang, L. (1993) Linear sufficiency in the general Gauss-Markov model with restrictions on parameter space. Northeast. Math. J. 9, 235-240. (476) Shao, J. and Tu, D. (1995) The Jackknife and Bootstrap, Springer Series in Statistics, Springer, New York. (16, 151) Shinozaki, N. (1975) A Study of Generalized Inverse of Matrix and Estimation with Quadratic Loss, Ph.D. dissertation, Keio University, Japan. (488) Sidak, Z. (1968) On multivariate normal probabilities of rectangles. Ann. Math. Statist. 39, 1425-1434. (154) Stahlecker, P. and Lauterbach, J. (1989) Approximate linear minimax estimation in regression analysis with ellipsoidal constraints. Comm. Statist. Theory Methods 18, 2755-2784. (499) Stepniak, C. (1989) Admissible linear estimators in mixed linear models. J. Multivariate Anal. 31, 90-106. (488) Strand, O.N. (1974) Coefficient errors caused by using the wrong covariance matrix in the general linear model. Ann. Statist. 2, 935-949. (306) Stulajter, F. (1990) Robustness of the best linear unbiased estimator and predictor in linear regression models. Apl. Mat. 35, 162168. (306) Styan, G.P.H. (1973) When does least squares give the best linear unbiased estimate? In Multivariate Statistical Inference (Proc. Res. Sem., Dalhousie Univ., Halifax, N.S., 1972), eds. D.G. Kabe and R.P. Gupta, North-Holland, Amsterdam, 241-246. (311) Subrahmanyam, M. (1972) A property of simple least squares estimates. Sankhyd Ser. B 34, 355-356. (140) Swindel, B.F. (1968) On the bias of some least-squares estimators of
Bibliography and Author Index
605
variance in a general linear model. Biometrika 55, 313-316. (322) Tilke, C. (1993) The relative efficiency of OLS in the linear regression model with spatially autocorrelated errors. Statist. Papers 34, 263-270. (821) Titterington, D.M. and Sedransk, J. (1986) Matching and linear regression adjustment in imputation and observational studies. Sankhyd Ser. B 48, 347-367. (13) Toutenburg, H. (1982) Prior Information in Linear Models, Wiley Series in Probability and Mathematical Statistics, Wiley, New York. (500) Toyooka, Y. (1982) Prediction error in a linear model with estimated parameters. Biometrika, 69, 453-459. (353) Tukey, J.W. (1949) One degree of freedom for non-additivity. Biometrics, 5, 232-242. (209) Ullah, A., Srivastava, V.K., Magee, L. and Srivastava, A. (1983) Estimation of linear regression model with autocorrelated disturbances. J. Time Ser. Anal. 4, 127-135. (332) Valliant, R., Dorfman, A.H. and Royall, R.M. Finite Population Sampling and Inference: A Prediction Approach, Wiley, New York. (297) van der Genugten, B.B. (1991) Iterated weighted least squares in heteroskedastic linear models. Statistics 22, 495-516. (362) von Rosen, D. (1990) A matrix formula for testing linear hypotheses in linear models. Linear Algebra Appl. 127, 457-461. (167) Wang, S.-G. and Chow, S.-C. (1994) Advanced Linear Models: Theory and Applications, Marcel Dekker, New York. (316, 331) Watson, G.S. (1967) Linear least squares regression. Ann. Math. Statist. 38, 1679-1699. (321) Werner, H.J. (1990) On inequality constrained generalized leastsquares estimation. Linear Algebra Appl. 127, 379-392. (282, 284) Werner, H.J. and Yapar, C. (1996) On inequality constrained generalized least squares selections in the general possibly singular Gauss-Markov model: a projector theoretical approach. Lin-
606
Bibliography and Author Index
ear Algebra AppL, Special issue honouring C.R. Rao 237/238, 359-393. {282) Williams, E.J. (1959) Regression Analysis, Wiley, New York. (241) Wu, Q.-G. (1992) On admissibility of estimators for parameters in linear models. In The Development of Statistics: Recent Contributions from China, eds. X.R. Chen, K.T. Fang and C. Yang, Pitman Research Notes in Mathematics Series 258, Longman Sci. Tech., Harlow, co-published in the US with Wiley, New York, 179-198. (489) Yang, W.L., Cui, H.J. and Sun, G.W. (1987) On best linear unbiased estimation in the restricted general linear model. Statistics 18, 17-20. (301) Zhou, L. and Mathew, T. (1993) Combining independent tests in linear models. J. Amer. Statist. Assoc. 88, 650-655. (361) Zimmerman, D.L. and Cressie, N.A.C. (1992) Mean squared prediction error in the spatial linear model with estimated covariance parameters. Ann. Inst. Math. Statist. 44, 27-43. (358) Zinde-Walsh, V. and Galbraith, J.W. (1991) Estimation of a linear regression model with stationary ARMA(p, q) errors. J. Econometrics 47, 333-357. (354) Zyskind, G. (1967) On canonical forms, non-negative covariance matrices and best and simple least squares linear estimators in linear models. Ann. Math. Statist. 38, 1092-1109. (311) Zyskind, G. (1975) Error structure, projections and conditional inverses in linear model theory. In A Survey of Statistical Design and Linear Models, ed. J.N. Srivastava, North Holland, Amsterdam. (245)
Index
F-distribution, 59 noncentral, 60 ^-distribution, 59 noncentral, 60 p-value, 83
tests of hypotheses, 228, 232 uses of, 224 analysis of variance (ANOVA), 168, 191 p-way classified data, 220, 239 in general linear model, 289 latin square design, 239 multivariate, 461 nested model, 220, 223, 240 one-way classified data, 194, 201 two-way classified data, 202 multiple observations per cell, 215 single observation per cell, 208 with limited interaction, 211 with missing observations, 218 ancillary statistic, 66, 67, 141 linear, 470 linearly maximal, 470 asymptotic normality, 522 autoregressive (AR) model, 10, 355 autoregressive moving average (ARMA) model, 10, 354 average risk, 78
A-optimality, 193 abbreviations, xix accelerated failure time, 7 acceptance region, 82 added variable plot, 422, 427 adjusted R2, 530 admissible estimator, 79 admissible linear estimator (ALE), 486, 527, 528 afRne estimator, 138, 488 Bayes, 489 affine function, 62 Aitken estimator, 258 alternative hypothesis, 82 analysis of covariance (ANCOVA), 224, 423, 462 adjustment for covariates, 229 balanced two-way classified data, 231 estimation of parameters, 226 multivariate, 462
Bahadur's result, 67 607
608 linear version, 473 baseline effect, 195 basis of a vector space, 32 orthogonal, 33 orthonormal, 33 basis set of BLUEs, 117, 476 standardized, 117, 479, 482 basis set of LZFs, 114, 478 standardized, 114, 479 Basu's theorem, 67 linear version, 471 Bayes affine estimator, 489, 499 Bayes estimator, 78-81 Bayes linear estimator (BLE), 489 homogeneous, 489 Bayes risk, 79 best linear estimator, 510 best linear minimum bias estimator (BLIMBE), 510 best linear predictor (BLP), 62, 64, 88, 174, 470 in general linear model, 291, 525 in multivariate linear model, 88 best linear unbiased estimator (BLUE), 93, 102, 472 additivity, 139 as function of fitted values, 109 basis set of, 117 dispersion, 111 distribution, 147 existence, 103 generating set of, 141 geometric view, 512 in general linear model, 251, 257, 265 in multivariate linear model, 431, 432 large sample properties, 521 LZF turning into, 218 relation with LSE, 106
Index relation with LZF, 102 standardized basis set of, 141 uniqueness, 104 best linear unbiased predictor (BLUP), 175 in general linear model, 291 in mixed effects model, 352 in multivariate linear model, 457 best quadratic unbiased estimator, 351 bias of an estimator, 70 block, 203 block effect, 203 blocking, 203 Bonferroni confidence interval, 153 Bonferroni inequality, 152, 454 bound Cramer-Rao, 77 on efficiency of LSE, 316 on estimated variance of LSE, 322 on prediction error variance, 385 on type I error probability, 83 calibration, 13 Cauchy-Schwarz inequality, 39 centered data, 253, 256, 260 Central Limit Theorem, 4 change point, 19 Chebyshev's inequality, 247 chemical balance with bias, 144 chi-square distribution, 59 noncentral, 60 Cobb-Douglas model, 19, 142, 186 coefficient of determination, 170 collinearity, 134, 501 condition number, 145 consequences of , 182 exact, 138 variance inflation factor, 137, 144 variance proportions table, 145
Index column rank, 26 column space, 36 of a dispersion matrix, 56 column vector, 25 complete sufficient statistic, 67, 141 for exponential family, 69 linear, 471 completely randomized design (CRD), 195 component of a vector, 39 composite hypothesis, 82 concomitant variable, 224 condition number, 145 conditional mean, 65 confidence band for regression surface, 156, 186 confidence interval for a single linear model parameter, 148 for error variance, 185 for ratio of LPFs, 185, 186 simultaneous, 151 confidence region (set), 85 in general linear model, 290 in linear model, 148, 150 in multivariate linear model, 457 link with testing, 86 consistency of linear equations, 48 of multivariate linear model, 442 of singular linear model, 246 consistent linear unbiased estimator (CLUE), 512 constrained least squares approach, 271 contrast, 143, 196 convergence in distribution, 521 Cook's squared distance, 405 covariance adjustment, 55, 87, 103, 254, 265, 281, 373, 398, 419, 444
609 geometric view, 57 covariate, 224 Cramer-Rao lower bound, 77 for general linear model parameters, 273 for linear model parameters, 128 non-normal case, 133 credible set, 85, 87 critical value, 84 CUSUM plot, 385 D-optimal design, 315 data adipose tissue, 178, 185, 186, 190 air speed experiment, 234 blood samples in ELISA test for HIV, 239 brown trout hemoglobin, 233 compressive strength of hoop trees, 241 Fisher's Iris, 464, 465 Olympic sprint time, 467 stack loss, 140 survival times of poisoned animals, 236 world population, 18, 101, 111, 116, 145, 157, 181, 187 world record times, 17, 139, 187, 189 data exclusion and variable inclusion, 423 degrees of freedom, 118 error, 118, 260 deleted residual, 384 deletion diagnostics, 404 design augmentation, 385 design matrix, 192 designed experiment, 191, 461 determinant of a matrix, 44 DFBETAS, 405
610 DFFITS, 406 diagnostics for linear model, 16 added variable plot, 422 Cook's squared distance, 405 COVRATIO, 426 CUSUM plot, 385 detection of model inadequacy, 384 DFBETAS, 405 DFFITS, 406 Durbin-Watson statistic, 369 fitted values, 108 hat-matrix, 108 leverage, 108, 141 residual, 108, 112, 142 standardized residual, 383 studentized residual, 384 variance inflation factor, 137, 144 variance proportions table, 145 diagonal matrix, 25 dimension of a vector space, 32 direction of arrival estimation, 364 dispersion matrix, 5, 55 singular, 243 distribution F, 59 chi-square, 59 exponential family, 69 multivariate t, 154, 202 multivariate normal, 4, 58 of quadratic forms, 60 prior, 78 singular normal, 58 spherically symmetric, 108, 139 student's t, 59 univariate normal, 58 Wilks' A, 447 Wishert, 447 dual model, 527 Durbin-Watson statistic, 369
Index dynamic model, 353 efficiency, 78 of LSE, 315 of LUE, 299 error approximation, 8 due to ignored factors, 1 measurement, 1 multiplicative and additive, 19 observational, 1 error function (see linear zero function), 96 error space, 109, 483, 527 characterization, 484 error sum of squares, 115 in general linear model, 259 error variance MLE of, 262 natural unbiased estimator of, 116, 259, 260 REML estimator of, 366 UMVUE, 116, 263 errors in variables, 20 estimability of a parameter characterization, 97, 98 in general linear model, 248, 250 in linear model, 97 in multivariate linear model, 432 in variance components model, 333 relation with identifiability, 99 with nuisance parameters, 126, 284 estimation space, 109, 483, 527 characterization, 484 estimator, 70 admissible, 79 affine, 488 Bayes, 79-81
Index least squares, 100 linear, 94, 488 maximum likelihood, 74 minimax, 81 UMVU, 72 unbiased, 70 uniformly minimum variance unbiased, 72, 90 exclusion of observations, 397 exclusion of variables, 410 expansion estimator of population total, 296 expected value, 5, 55 experimental unit, 192 explanatory variable, 2 exponential family, 69 factor, 2 factorization theorem, 68 finite population sampling, 245, 294, 298, 304 Fisher-Cochran theorem, 60 fitted values in general linear model, 253 in linear model, 108 in multivariate linear model, 434 Frobenius norm, 28 full column rank, 27 full rank matrix, 27 Gauss-Markov Theorem, 104, 106, 109 converse, 107 general linear model, 5, 243 Aitken estimator, 258 analysis of variance, 289 arising from multivariate normal distribution, 64 best linear unbiased estimator (BLUE), 251, 257
611 dispersion of, 255 methods for obtaining, 265 canonical decomposition, 479 checking for consistency, 246 confidence region for parameters, 290 constrained parameter space, 249 effect of linear restrictions, 275 error sum of squares, 259 error variance estimation, 258 estimability and identifiability, 248 fitted values, 253 inclusion/exclusion of observations or variables (see updates in linear model), 371 inequality constraints, 282 information matrix for parameters, 272 linear restrictions, 278 maximum likelihood estimation, 261 nuisance parameters, 284 prediction through, 291 reduced normal equation, 286 residuals, 253 dispersion of, 255 stochastic restriction, 279 tests of hypotheses, 287 updates (see updates in linear model), 371 uses of, 244 virtually linear estimators in, 255 weighted least squares estimator (WLSE), 263 generalized inverse (g-inverse) of a matrix, 28 generalized likelihood ratio test (GLRT), 84
612
Index
in general linear model, 288 in linear model, 166 in multivariate linear model, 445 power of, 171 generalized linear model, 8, 16 generating set of BLUEs, 117, 475, 482 generating set of LZFs, 113 normalized, 436 gradient vector, 49, 74 Gram-Schmidt orthogonalization, 33 growth model, 462
information inequality, 77 information matrix, 76 in general linear model, 272 in linear model, 128 information of single LPF, 128 inner product of vectors, 27 intra-class correlation structure, 308 inverse of a matrix, 28 inverse partitioned matrix (IPM) approach, 268, 379, 387, 404 inversion formulae of matrices, 30 invertible matrix, 29
hat-matrix, 108 Henderson's Method III, 343 Hessian matrix, 49, 74, 75 heteroscedasticity, 358 systematic, 361 homogeneous linear estimator, 488 homoscedastic linear model, 5 honestly significant difference (HSD), 202 Hotelling's T 2 , 451 hypothesis, 82 general linear (see linear hypothesis), 158
James-Stein estimator, 509 Jeffreys' prior, 90
idempotent matrix, 34 identifiability of a parameter in general linear model, 248, 251 in linear model, 97, 99 in multivariate linear model, 432 in variance components model, 333 relation with estimability, 99 identity matrix, 25 inclusion of observations, 372 inclusion of variables, 417 inequality constraint, 282 inference, 66
Kalman filter, 389 Kronecker product, 26, 30, 54, 65, 204, 430 Kuks-Olman estimator, 496, 529 Lowner order, 45, 87, 102, 488, 501, 507, 510 lack of fit, 188 latin square design, 239 Lawley-Hotelling test, 450 least favourable prior, 81, 493, 499, 529 least squares estimator (LSE), 100 efficiency of, 315, 321 large sample properties, 521 left-inverse of a matrix, 28 Lehmann-Scheffe theorem, 73 level of significance, 83 leverage, 108, 141, 399 likelihood equation, 74 likelihood function, 74 likelihood ratio test (LRT), 84 linear equations, 47 linear estimation, 94
Index linear hypothesis, 158 decomposition into testable and untestable parts, 160 decomposition of sum of squares, 164 generalized likelihood ratio test, 166, 288 nested, 173 single degree of freedom, 162 testability, 158 linear minimum bias estimator (LIMBE), 511 linear model, 1 analysis of variance, 168 arising from multivariate normal distribution, 4, 62 assumptions, 5 Bayesian methods, 15 best linear unbiased estimator (BLUE), 102 calibration through, 13 canonical decomposition, 113 change point in, 19 collinearity in, 134, 138, 182 complete sufficient statistic in, 141 confidence band for regression surface, 156 confidence region for parameters, 148 diagnostics (see diagnostics for linear model), 16 distribution of estimators, 147 dual, 527 error space, 109, 483 error variance estimation (see error variance), 113 estimability and identifiability, 97 estimation in, 93 estimation space, 109, 483 fitted values and residuals, 108
613 general (see general linear model), 243 generalized, 8, 16 homoscedastic, 5, 93 in broader sense, 9 information matrix for parameters, 128 least squares estimator (LSE), 100 linear restriction in, 120 matrix-vector form, 5 maximum likelihood estimator (MLE), 107 missing data in, 16 mixed effects (see mixed effects model), 10 multivariate (see multivariate linear model), 429 nonlinear methods of inference, 15 notations, 4 nuisance parameters in, 126 obtained through linearization, 6 prediction through, 12, 174 reasons for choosing, 3 reparametrization, 118, 142 resampling methods, 15 robust methods, 15 singular (see general linear model), 122, 243 tests of linear hypotheses, 158 transformation of variables, 16 UMVUE, 104, 116 uses of, 11 linear parametric function (LPF), 94 admissible linear estimator, 486 Bayes linear estimator, 489 best linear estimator, 510 best linear minimum bias estimator, 510 best linear unbiased estimator, 102
614 confidence interval, 148 consistent linear unbiased estimator, 512 estimability and identifiability, 97 inequality constraint on, 284 least squares estimator, 100 maximum likelihood estimator, 107 minimax linear estimator, 492 principal components estimator, 503 ridge estimator, 506 shrinkage estimator, 508 subset estimator, 500 test of significance, 162 UMVUE, 104 linear prediction, 10 linear regression, 6, 62 arising from multivariate normal distribution, 62, 64 linear restriction, 120 in general linear model, 275, 278 in multivariate linear model, 442 leading to singular model, 244 sequential, 417 linear unbiased estimator (LUE), 94 characterization, 95 efficiency, 299 in general linear model, 248, 249 in multivariate linear model, 432 linear zero function (LZF), 14, 93, 94 as function of residuals, 109 basis set of, 113, 258 BLUE turning into, 142, 164, 200, 207, 218, 443 characterization, 95 gained from exclusion of variables, 413 gained from inclusion of
Index observations, 375 generating set of, 141, 165 in general linear model, 248, 249 in multivariate linear model, 432 lost from exclusion of observations, 400 lost from inclusion of variables, 419 normalized, 435, 466 standardized basis set of, 141 linearity in explanatory variables, 7 in parameters, 7 linearization, 7 of Cobb-Douglas model, 19 of nonlinear regression model, 8 of polynomial regression model, 7 linearly ancillary statistic, 470 characterization, 485 linearly complete statistic, 471 characterization, 485 linearly independent vectors, 26 linearly maximal ancillary, 470 characterization, 485 linearly sufficient statistic, 470, 525 characterization, 485 invariant, 476 link function, 8 logistic regression, 19 loss function, 70 convex, 71 quadratic, 80, 486, 527 squared error, 71, 80, 486 Mahalanobis distance, 461 Mallows' Cp, 503, 531 matrix, 23 blocks of, 26 column rank of, 26 column space of, 36
Index decompositions, 40 determinant of, 44 diagonal, 25 dispersion, 55 full column rank, 27 full rank, 27 full row rank, 27 generalized inverse (g-inverse) of, 28 idempotent, 34 identity, 25 inverse of, 28 inversion formulae, 30 invertible, 29 Kronecker product, 26 Lowner order, 45 left-inverse of, 28 long vector form, 28 Moore-Penrose inverse of, 29 multiplication, 24 negative definite, 27 nonnegative definite, 27 nonsingular, 27, 29 norm of, 51 notations, xxi order of, 23 orthogonal, 29 orthogonal projection matrix for column space of, 38 parallel sum, 53 positive definite, 27 positive semidefinite, 27 rank deficient, 27 rank of, 27, 36, 40 rank-factorization, 40 right-inverse of, 28 row rank of, 26 row space of, 36 semi-orthogonal, 30 singular, 27
615 singular value decomposition (SVD), 40 spectral decomposition, 42 symmetric, 25 spectrum of, 42 trace of, 25 transpose of, 25 variance-covariance, 55 maximal ancillary statistic, 66 linear, 470 maximum likelihood estimation in general linear model, 261 in general linear model with unknown dispersion, 326 in linear model, 107 in mixed effects model, 336 in multivariate linear model, 439 maximum likelihood estimator (MLE), 74 maximum modulus-^ confidence interval, 154 mean squared error (MSE), 71 mean squared error matrix, 71, 278 mean squared error of prediction (MSEP), 174, 293 matrix, 88 met a-analysis, 358, 369 minimal sufficient statistic, 66 linear, 470 minimax estimator, 81 minimax linear estimator (MILE), 492, 528, 529 minimum mean squared error, 62 minimum norm quadratic estimator (MINQE), 346 minimum norm quadratic unbiased estimator (MINQUE), 345, 347 minimum variance quadratic unbiased estimator (MIVQUE), 351
616 missing data, 218, 226, 407 missing plot substitution, 218, 407 misspecified error dispersion matrix, 306 effect on estimated variance of LSEs, 322 efficiency of LSE, 315 when inference is exact, 308 mixed effects model, 4, 10, 311, 332, 365 ANOVA methods, 342 application to signal processing, 364 best linear unbiased predictor (BLUP), 352 best quadratic unbiased estimator, 351 Henderson's method III, 343 identifiability and estimability, 333 minimum norm quadratic estimator (MINQE), 346 minimum norm quadratic unbiased estimator (MINQUE), 345, 347 minimum variance quadratic unbiased estimator (MIVQUE), 351 ML and REML methods, 336 testing of hypothesis, 353 model accelerated failure time, 7 analysis of covariance (see analysis of covariance), 224 autoregressive (AR), 10, 355 autoregressive moving average (ARMA), 10, 354 Cobb-Douglas, 19 conditional, 6 dynamic, 353
Index errors in variables, 20 general linear (see general linear model), 243 generalized linear, 8, 16 growth, 462 linear (see linear model), 1 linear regression, 6 logistic regression, 19 mixed effects (see mixed effects model), 10, 332 multivariate linear [see multivariate linear model), 429 nested, 220 nonlinear regression, 7 one-way classification {see one-way classified data), 194 piecewise linear, 18 polynomial regression, 7 random effects {see mixed effects model), 332 response surface, 20 seemingly unrelated regression (SUR), 310 simultaneous equations, 11 singular linear (see general linear model), 243 state-space, 4, 10 statistical, 1 time series, 4 two-way classification (see two-way classified data), 202 variance components, 333 model building, 422 model-preserving constraint, 122 monotone likelihood ratio (MLR), 83 Moore-Penrose inverse, 29 formula for, 30, 43
Index most powerful test, 83 multicollinearity (see collinearity), 134 multiple comparisons, 172 in general linear model, 288 in multivariate linear model, 454 multiple correlation coefficient, 63 sample, 170, 530 multivariate ANOVA (MANOVA), 461 multivariate linear model, 429 arising from multivariate normal distribution, 465 best linear unbiased estimator (BLUE), 431 dispersion of, 434 confidence regions, 457 consistency of linear restrictions, 442 effect of linear restrictions, 442 error dispersion estimation, 435, 440, 441 error sum of squares and products, 437 estimability and identifiability, 432 fitted values, 434 maximum likelihood estimation, 439 model description, 430 multivariate ANOVA (MANOVA), 461 normalized LZF, 435, 466 one-sample problem, 460 prediction through, 457 residuals, 434 dispersion of, 434 tests of hypotheses, 445 two-sample problem, 460 multivariate normal distribution, 58
617 conditional, 59 leading to general linear model, 64 singular case, 58 negative definite matrix, 27 nested hypotheses, 173 nested model, 220, 240, 364 Neyman-Pearson lemma, 83 noncentrality parameter, 60 nonlinear regression, 7, 8 nonnegative definite matrix, 27 nonsingular matrix, 27 norm of a matrix, 51 Frobenius, 28 norm of a vector, 27 normal equation, 100 reduced, 226 normalized LZF (NLZF), 435, 466 BLUE turning into, 443 nuisance parameter, 126 in designed experiment, 192 in general linear model, 284 reduced normal equation, 286 leading to singular model, 245 null hypothesis, 82 oblique projector, 516, 532 observational studies, 225 observations, 4 independent, 6 omitted variable, 417 one-sample problem, 460 one-way classified data, 194 analysis of variance, 199, 201 estimation of parameters, 196 with between-groups heterogeneity, 308 optimal design, 193 optimization of quadratic forms and functions, 48
618 order of a matrix, 23 orthogonal complement of a vector space, 32 orthogonal decomposition of a vector, 34 orthogonal matrix, 29 orthogonal projection matrix, 34 orthogonal vector spaces, 32 orthogonal vectors, 32 parallel sum of matrices, 53 parameter, 2 partial regression plot, 422 piecewise linear model, 18 Pillai's test, 450 point estimation, 70 polynomial regression, 7 positive definite matrix, 27 positive semidefinite matrix, 27 posterior distribution, 86 power of a test, 82 prediction through general linear model, 291 through linear model, 174 through multivariate linear model, 457 prediction interval, 176 simultaneous, 179 prediction matrix, 108 predictor, 2 best, 174 best linear, 62, 64, 174 best linear unbiased, 175, 291, 457 principal components estimator, 503, 510, 531 principle of substitution, 106 prior distribution, 78 profile analysis, 453, 466 projection matrix, 34 formula for, 35, 43
Index oblique, 516 orthogonal, 34 projection of a vector, 34 protected least significant difference (PLSD), 201 pure error, 188 quadratic form, 27 distribution of, 60 optimzation of, 48 quadratic loss function, 80 quadratically sufficient statistic, 476 random effects model (see mixed effects model), 332 random vector, 55 randomized block design (RBD), 203 rank of a matrix, 27 rank-deficient matrix, 27 rank-factorization, 40, 53 Rao-Blackwell theorem, 71 linear version, 471 ratio estimator of population total, 297 recursive group residual, 378 recursive residual, 375, 377 reduced normal equation, 286 regression, 61 coefficients, 6 confidence band for surface, 156 linear, 6, 62 logistic, 19 nonlinear, 7, 8 parameters, 6 polynomial, 6 regression diagnostics (see diagnostics for linear model), 16 regression estimator of population total, 296 regression line
Index confidence band for, 157, 186 equality of, 187 paralelity of, 187 regressor, 2 rejection region, 82 reparametrization, 118, 142 general form, 120 residual, 108, 142 deleted, 384 dispersion of, 111 in general linear model, 253 in multivariate linear model, 434 recursive, 375, 377 standardized, 383 studentized, 384 variance and covariance, 112 residual sum of squares, 115 residual sum of squares and products, 440 response surface, 13, 20, 186 response variable, 1, 4 restricted/residual maximum likelihood (REML) estimator, 329, 366 in multivariate linear model, 441 ridge estimator, 506, 510, 531 right-inverse of a matrix, 28 risk function, 71 row rank, 26 row vector, 25 row-column design, 220, 239 Roy's union-intersection test, 448 saturated model, 208 scalar, 25 Scheffe confidence interval, 153 seemingly unrelated regression (SUR) model, 310 semi-orthogonal matrix, 30 semivariogram, 357
619 sequential linear restrictions, 417 serial correlation, 353, 369 shrinkage estimator, 508, 532 side-condition, 198, 214 signal detection, 16, 362 simple hypothesis, 82 simple linear regression, 101 simultaneous confidence intervals, 151 Bonferroni, 153 honestly significant difference, 202 maximum modulus-t, 154 protected least significant difference, 201 Scheffe, 153 simultaneous equations model, 11 singular dispersion matrix, 56, 243 singular linear model (see general linear model), 243 uses of, 244 singular matrix, 27 singular value decomposition (SVD), 40 size of a test, 83 small area estimation, 304 Snell's law, 2 solution of linear equations, 47 spatial correlation, 356, 365 spectral decomposition, 42 spectrum of a symmetric matrix, 42 spherically symmetric distribution, 108, 139 spring balance, 129, 138, 232 spring balance with bias, 138, 143 SRSWOR, 296, 298 standardized basis set of BLUEs, 117, 477-479, 482 standardized basis set of LZFs, 114, 258, 479 normalized, 436
620 standardized residual, 383, 405 state-space model, 10, 390 leading to singular linear model, 246 statistic ancillary, 66, 67 complete, 67 complete sufficient, 67, 141 linearly ancillary, 470 linearly complete, 471 linearly maximal ancillary, 470 linearly minimal sufficient, 470 linearly sufficient, 470 maximal ancillary, 66 minimal sufficient, 66 sufficient, 66, 68 stochastic restriction, 279 strongly consistent, 524 student's ^-distribution, 59 noncentral, 60 studentized residual, 384, 406 subset estimator, 500, 529, 530 criteria for selection, 530, 531 sufficient statistic, 66, 68, 71 conditional, 88 linear, 470 linearly minimal, 470 sum of squares, 113 between-groups, 200 corrected for covariates, 231 decomposition of, 117, 164, 167, 192, 261 due to block difference, 207 due to deviation from hypothesis, 169, 210, 235 due to treatment difference, 206 error, 115, 192 lack of fit, 188 pure error, 188 regression, 169, 192
Index residual, 115 total, 169, 192, 200 within-group, 200 sum of squares and products, 230 error, 437 supplementary variable, 224 symmetric matrix, 25 test of hypothesis, 82 in general linear model, 287 in linear model, 158 in multivariate linear model, 445 time series model, 4 autoregressive (AR), 9, 355 autoregressive moving average (ARMA), 10, 354 state-space, 10, 390 tolerance interval, 179 trace of a matrix, 25 translation invariance, 328 transpose of a matrix, 25 treatment, 192, 203 treatment contrast, 196, 204 treatment effect, 195 Tukey's one-degree of freedom test, 211, 235 two-sample problem, 460 two-way classified data, 96, 202 example of BLUE, 105 example of confidence interval, 149 example of Cramer-Rao lower bound, 130 example of eliminating nuisance parameters, 127 example of elliptical confidence region, 151 example of identifying testable part of hypothesis, 161 example of linear zero function, 96
Index example of non-estimable parameter, 98 example of non-reparametrizing linear restriction, 124 example of non-testable hypothesis, 159 example of non-unique substitution estimator, 109 example of reparametrization, 119 example of reparametrizing linear restriction, 120, 123 example of variance of BLUE, 112 exmple of non-reparametrizing linear restriction, 121 interaction in, 207 multiple observations per cell balanced case, 212 unbalanced case, 216 single observation per cell, 203 treatment contrast, 143 with interaction and betweengroups heterogeneity, 309 with missing observations, 218 type I and Type II errors, 82 UMVUE, 72, 141 linear analogue, 104, 472 uniquesness, 73 unbalanced data, 216, 423 unbiased confidence region, 85 unbiased test, 83 unified theory of least squares estimation, 266 uniformly most accurate (UMA) confidence region, 85 unbiased (UMAU), 86 uniformly most powerful (UMP) test, 83-86 unbiased (UMPU), 84, 86, 163 unit vector, 27
621 unknown error dispersion matrix, 305, 324 MLE of unknown parameters, 326 REML estimator, 328 special case of variance components, 332 special cases with correlated error, 353 special cases with uncorrelated error, 358 two-stage estimator, 330 use of prior information, 324 update equations for exclusion of observations, 402 for exclusion of variables, 415 for inclusion of observations, 378 for inclusion of variables, 421 updates in linear model, 371 data exclusion and variable inclusion, 423 exclusion of a single observation, 398 exclusion of a single variable, 412 exclusion of observations, 397 application to deletion diagnostics, 404 application to missing plot substitution, 407 LZF lost, 400 update equations, 402 exclusion of variables, 410 application to model building, 417 LZF gained, 413 update equations, 415 inclusion of a single observation, 372 inclusion of a single variable, 418 inclusion of observations, 372 application to design
622 augmentation, 385 application to Kalman filter, 389 application to model diagnostics, 383 LZF gained, 375 update equations, 378 inclusion of variables, 417 application to model building, 422 LZF lost, 419 update equations, 421
Index order of, 25 orthogonal decomposition of, 34 orthogonal projection matrix for column space of, 39 orthogonality, 32 random, 55 row, 25 vector space, 31 basis of, 32 dimension of, 32 intersection, 32 orthogonal basis of, 33 orthogonal complement, 32 orthogonal projection matrix, 34 orthogonality, 32 orthonormal basis, 33 projection matrix, 34 sum' 32 virtually disjoint, 32 virtually linear estimators, 255
variable criterion, 1 dependent, 1 endogenous, 1 exogenous, 2 explanatory, 2, 4 controlled 6 random, 6 independent, 2 response, 1, 4 variance, 6 , j l ooo variance components model, 333 application to signal processing,
weighing design chemical balance, 144 s p r i n g b a l a n c e ' 129> 1 3 8 ' 1 4 3 ' 2 3 2 weighted least squares estimator b „„., \ '
„„. o f o> . „ .. - . no« , . , variance inflation ractor, 137, 144,
TTT.
variance proportions table, 145 variogram, 357 vector, 25 column, 25 column space of, 39 component of, 39 inner product, 27 linearly dependent, 26 linearly independent, 26 norm of, 27 notation, 25
Wilks' A statistic, 446 distribution, 447 , ,,..,..,,. AAn Wishart distribution, 447 , _ '