FOURTH EDITION
Applied Multivariate Statistical Analysis
RICHARD A. JOHNSON University of Wisconsin-Madison
DEAN W. WICHERN Texas A&M University
PRENTICE
HALL,
Upper
Saddle
River, New jersey
07458
Library of Congress Cataloging-in-Publication Data Johnson, Richard Arnold. Applied multivariate statistical analysis I Richard A. Johnson, Dean W. Wichern. -- 4th ed. p. em. Includes bibliographical references and indexes. ISBN 0-13-834194-X 1. Multivariate analysis. I. Wichern, Dean W. II. Title. QA278.J63 1998 519.5'35--dc21 97-42907· CIP
Acquisitions Editor: ANN HEATH Marketing Manager: MELODY MARCUS Editorial Assistant: MINDY McCLARD Editorial Director: TIM BOZIK Editor-in-Chief: JEROME GRANT Assistant Vice-President of Production and Manufacturing: DAVID W. RICCARDI Editorial/Production Supervision: RICHARD DeLORENZO Managing Editor: LINDA MIHATOV BEHRENS Executive Managing Editor: KATHLEEN SCHIAPARELLI Manufacturing Buyer: ALAN FISCHER Manufacturing Manager: TRUDY PISCIOTII Marketing Assistant: PATRICK MURPHY Director of Creative Services: PAULA MAYLAHN Art Director: JAYNE CONTE Cover Designer: BRUCE KENSELAAR © 1 998 by Prentice-Hall, Inc. Simon & Schuster I A Viacom Company Upper Saddle River, NJ 07458
All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher.
Printed in the United States of America 10 9
8 7
ISBN
6
5
4
3
2
1
0-13-834194-X
Prentice-Hall International (UK) Limited, London Prentice-Hall of Australia Pty. Limited, Sydney Prentice-Hall Canada Inc., Toronto Prentice-Hall Hispanoamericana, S.A., Mexico Prentice-Hall of India Private Limited, New Delhi Prentice-Hall of Japan, Inc., Tokyo Simon & Schuster Asia Pte. Ltd., Singapore Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro
ISBN 0-13-834194-X
9 780
90000
II
To the memory of my mother and my father. R.
To Dorothy, Michael, and Andrew. D.
A. f.
W. W.
Contents
xlll
PREFACE
PART I
Getting Started 1
1
1 .1
Introduction
1 .2
Applications of Multivariate Techniques
1 .3
The Organization of Data
1 .4
Data Displays and Pictoria l Representations
1 .5
Distance
1 .6
Final Comments Exercises
3
5 19
28 36
36
References 2
1
ASPECTS OF MULTIVARIATE ANALYSIS
47
MATRIX ALGEBRA AND RANDOM VECTORS
2. 1
Introduction
2.2
Some Basics o f Matrix and Vector Algebra
49
49 49 v
vi
Contents
2.3
Positive Defin ite Matrices
61
2.4
A Square-Root Matrix 67
2.5
Random Vectors and Matrices
2.6
Mean Vectors and Cova riance Matrices
2.7
Matrix Inequalities a n d Maximization
68 69 81
Supplement 2A Vectors a n d Matrices: Basic Concepts 86 Exercises
1 07
References 3
115
3. 1
I ntroduction
3.2
The Geometry of the Sample
3.3
Random Samples and the Expected Values of the Sample Mean and Cova riance Matrix
3.4
Generalized Variance
3.5
Sample Mean, Covariance, and Correlation as M atrix Operations 1 45
3.6
Sample Values of Linear Combinations of Variables 1 48 Exercises
116 117 1 24
1 29
1 53
References 4
1 16
SAMPLE GEOMETRY AND RANDOM SAMPLING
1 56
JS7
THE MULTIVARIATE NORMAL DISTRIBUTION
4. 1
I ntroduction
4.2
The Multivariate Normal Density and Its Properties 1 58
4.3
Sampling from a Multivariate Normal Distribution and Maximum Likeli hood Estimation 1 7 7
4.4
The Sampling Distribution of X and S
4.5 4.6
1 57
Large-Sample Behavior of X and S
1 84
1 85
Assessing the Assumption of Normality
1 88
Contents
4.7
Detecti ng Outliers a n d Data Cleaning
4.8
Transformations to Near Normality Exercises
200
204
21 4
References PART II
vii
222
Inferences About Multivariate Means and Linear Models 5
224
INFERENCES ABOUT A MEAN VECTOR
5.1
I ntroduction
5.2
T h e Plausibil ity o f P-o a s a Value for a Norma l Population Mean 224
5.3
224
Hotel ling's T2 a n d Likelihood Ratio Tests
231
5 .4
Confidence Regions and Simultaneous Comparison s o f Component Means 235
5.5
Large Sample I n ferences about a Population Mean Vector 252
5.6
Multivariate Quality Control Charts
5.7
I nferences about Mean Vectors When Some Observations Are M issing 268
5.8
Difficulties Due T o Time Dependence i n M ultiva riate Observations 273
25 7
Supplement SA Simultaneous Confidence Intervals and Elli pses as Shadows of the p-Dimensional Ellipsoids Exercises 6
2 76
2 79
References
288
COMPARISONS OF SEVERAL MULTIVARIATE MEANS
6. 1
I ntroduction
6.2
Pai red Comparisons and a Repeated Measures Design 291
6.3
Comparing Mean Vectors from Two Populations
6.4
Comparing Several Multivariate Population Means (One-Way MANOVA) 31 4
290
290
302
viii
Contents
6.5
S imultaneous Confidence I ntervals for Treatment Effects 329
6.6
Two-Way Multivariate Analysis of Variance
6. 7
Profile Analysis
6.8
Repeated Measures Designs and G rowth Curves
6.9
Perspectives and a Strategy for Analyzing Multivariate Models 355 Exercises
331
343 350
358 3 75
References
7 MULTIVARIATE LINEA/l REGRESSION MODELS
377
377
7.1
I ntroduction
7.2
The Classical Linear Regression Model
7.3
Least Squares Estimation
7.4
I n ferences About the Regression M odel
7.5
Inferences from the Estimated Regression Function
7.6
Model Checking and Other Aspects of Regression 404
7.7
Multivariate Multiple Regression
7.8
The Concept of Linear Regression
7.9
Comparing the Two Formulations of the Regression Model 438
7.1 0
Multiple Regression Models with Time Dependent Errors
377
381 390
41 0 427
Supplement 7A The Distribution of the Likelihood Ratio for the Multivariate Multiple Regression M odel Exercises
PRINCIPAL COMPONENTS 8.1
446
456
Analysis of Covariance Structure 8
44 1
448
References
PART Ill
400
I ntroduction
458
458
Contents
8.2
Population Pri ncipal Components
8.3
Summarizing Sample Variation by Pri ncipal Components 471
8.4
Gra p h i ng the Principal Components
8.5
Large Sample I nferences
8.6
Mon itori ng Quality with Pri ncipal Components
ix
458
484
487 490
Supplement 8A The Geometry of the Sample Pri n cipal Component Approximation 498 Exercises
503
References 9
51 2 574
FACTOR ANALYSIS AND INFERENCE FOR STRUCTURED COVARIANCE MATRICES 9. 1
I ntroduction
51 4
9. 2
The Orthogonal Factor Model
9.3
Metho ds o f Estimation
9.4
Factor Rotation
9.5
Factor Scores
9.6
Perspectives and a Strategy for Factor Analysis
9. 7
Structura l Equation Models
51 5
52 1
540 550 557
565
Supplement 9A Some Computational Details for Maximum Likelihood Estima tion 572 Exercises
57 5
References l0
585
587
CANONICAL CORRELATION ANALYSIS 1 0. 1
I ntroduction
58 7
1 0.2
Canon ical Va riates and Canon ical Correlations
1 0.3
I nterpreti ng the Population Canon ical Variables
1 0.4
The Sample Cahonical Variates and Sample Canonica l Correlations 601
1 0.5
Additional Sample Descri ptive Measures
61 0
58 7 595
x
Contents
1 0.6
Large Sample Inferences Exercises
61 9
References PART IV
61 5
62 7
Classification and Grouping Techniques ll
1 1 .1
I ntroduction
1 1 .2
Separation and Classification for Two Populations
1 1 .3
Classification with Two Multiva riate Normal Populations
11.4
Evaluating Classification Functions
1 1 .5
Fisher's Discriminant Function-Separation of Populations 661
1 1 .6
Classification with Severa l Populations
1 1 .7
Fisher's Method for Discrimi nating among Several Populations 683
1 1 .8
Final Comments Exercises
72
629
DISCRIMINATION AND CLASSIFICATION 629 630
649
665
697
703
References
723
726
CLUSTERING, DISTANCE METHODS AND ORDINATION 1 2. 1
I ntroduction
1 2 .2
Similarity Measures
1 2.3
H ierarchical Clusteri ng Methods
1 2.4
Nonhiera rchical Clustering Methods
1 2.5
Multidimensional Scaling
760
1 2.6
Correspondence Analysis
770
1 2. 7
Biplots for Viewing Sampling U n its and Variables
1 2 .8
P rocrustes Analysis: A M ethod for Comparing Configurations Exercises References
639
726
790 798
728 738
782
754
7 79
Contents
APPENDIX
800
Table 1
Standard Normal Probabil ities
Table 2
Student's t-Distribution Percentage Points
Table 3
801
x2 Distribution Percentage Points
802
803
Table 4
F-Distribution Percentage Points (a= .1 0)
804
Table 5
F-D istribution Percentage Points (a= .05)
806
F-Distribution Percentage Points (a= .01 )
808
Table 6
xi
DATA INDEX
8ll
SUBJECT INDEX
812
Preface
INTENDED AUDIENCE
This book originally grew out of our lecture notes for an "Applied Multivariate Analysis" course offered jointly by the Statistics Department and the School of Business at the University of Wisconsin-Madison. Applied Multivariate Statisti cal Analysis, Fourth Edition, is concerned with statistical methods for describing and analyzing multivariate data. Data analysis, while interesting with one variable, becomes truly fascinating and challenging when several variables are involved. Researchers in the biological, physical, and social sciences frequently collect mea surements on several variables. Modern computer packages readily provide the numerical results to rather complex statistical analyses. We have tried to provide readers with the supporting knowledge necessary for making proper interpreta tions, selecting appropriate techniques, and understanding their strengths and weaknesses. We hope our discussions will meet the needs of experimental scien tists, in a wide variety of subject matter areas, as a readable introduction to the sta tistical analysis of multivariate observations.
LEVEL
Our aim is to present the concepts and methods of multivariate analysis at a level that is readily understandable by readers who have taken two or more statistics xiil
xiv
P reface
courses. We emphasize the applications of multivariate methods and consequently, have attempted to make the mathematics as palatable as possible. We avoid the use of calculus. On the other hand, the concepts of a matrix and of matrix manipulations are important. We do not assume the reader is familiar with matrix algebra. Rather, we introduce matrices as they appear naturally in our discussions, and we then show how they simplify the presentation of multivariate models and techniques. The introductory account of matrix algebra, in Chapter 2, highlights the more important matrix algebra results as they apply to multivariate analysis. The Chap ter 2 supplement provides a summary of matrix algebra results for those with little or no previous exposure to the subject. This supplementary material helps make the book self-contained and is used to complete proofs. The proofs may be ignored on the first reading. In this way we hope to make the book accessible to a wide audience. In our attempt to make the study of multivariate analysis appealing to a large audience of both practitioners and theoreticians, we have had to sacrifice a consis tency of level. Some sections are harder than others. In particular, we have sum marized a voluminous amount of material on regression in Chapter 7. The resulting presentation is rather succinct and difficult the first time through. We hope instruc tors will be able to compensate for the unevenness in level by judiciously choosing those sections, and subsections, appropriate for their students and by toning them down if necessary.
ORGANIZATION AND APPROACH
The methodological "tools" of multivariate analysis are contained in Chapters 5 through 12. These chapters represent the heart of the book but they cannot be assimilated without much of the material in the introductory Chapters 1 through 4. Even those readers with a good knowledge of matrix algebra or those willing to accept the mathematical results on faith should, at the very least, peruse Chapter 3, Sample Geometry, and Chapter 4, Multivariate Normal Distribution. Our approach in the methodological chapters is to keep the discussion direct and uncluttered. Typically, we start with a formulation of the population models, delineate the corresponding sample results, and liberally illustrate everything with examples. The examples are of two types: those that are simple and whose calcu lations can be easily done by hand, and those that rely on real-world data and com puter software. These will provide an opportunity to: (1) duplicate our analyses, (2) carry out the analyses dictated by exercises, or (3) analyze the data using meth ods other than the ones we have used or suggested. The division of the methodological chapters (5 through 12) into three units allows instructors some flexibility in tailoring a course to their needs. Possible sequences for a one-semester (two quarter) course are indicated schematically.
Preface
�
Inferences About Means
xv
Getting Started Chapters 1-4
�
Classification and Grouping
Chapters 5-7
Chapters 11 and 12
Analysis of Covariance Structure
I
Analysis of Covariance Structure
Chapters 8-10
Chapters 8-10
I
Each instructor will undoubtedly omit certain sections from some chapters to cover a broader collection of topics than is indicated by these two choices. For most students, we would suggest a quick pass through the first four chap ters (concentrating primarily on the material in Chapter 1, Sections 2.1, 2.2, 2.3, 2.5, 2.6, and 3.6, and the "assessing normality" material in Chapter 4) followed by a selection of methodological topics. For example, one might discuss the compari son of mean vectors, principle components, factor analysis, discriminant analysis, and clustering. The discussions could feature the many "worked out" examples included in these sections of the text. Instructors may rely on diagrams and verbal descriptions to teach the corresponding theoretical developments. If the students have uniformly strong mathematical backgrounds, much of the book can success fully be covered in one term. We have found individual data-analysis projects useful for integrating mate rial from several of the methods chapters. Here, our rather complete treatments of MANOVA, regression analysis, factor analysis, canonical correlation, discriminant analysis, and so forth are helpful, even though they may not be specifically covered in lectures. CHANGES TO THE FOURTH EDITION New Material. Users of the previous editions will notice that we have added and updated some examples and exercises, and have expanded the discus sions of viewing multivariate data, generalized variance, assessing normality and transformations to normality, simultaneous confidence intervals, repeated measure designs, and cluster analysis. We have also added a number of new sections includ ing: Detecting Outliers and Data Cleaning (Ch. 4); Multivariate Quality Control Charts, Difficulties Due to Time Dependence in Multivariate Observations (Ch. 5); Repeated Measures Designs and Growth Curves (Ch. 6); Multiple Regression
xvi
Preface
Models with Time Dependent Errors (Ch. 7); Monitoring Quality with Principal Components (Ch. 8); Correspondence Analysis (Ch. 12); Biplots (Ch. 12); and Pro crustes Analysis (Ch. 12). We have worked to improve the exposition throughout the text, and have expanded the t-table in the appendix. Data Disk. Recognizing the importance of modern statistical packages in the analysis of multivariate data, we have added numerous real-data sets. The full data sets used in the book are saved as ASCII files on the Data Disk which is pack aged with each copy of the book. This format will allow easy interface with exist ing statistical software packages and provide more convenient hands-on data analysis opportunities. Instructors Solutions Mahual. An Instructors Solutions Manual (ISBN 0-13-834202-4) containing complete solutions to most of the exercises in the book is available free upon adoption from Prentice Hall.
For information on additional for sale supplements that may be used with the book or additional titles of interest, please visit the Prentice Hall web site at www.prenhall.com. ACKNOWLEDG MENTS
We thank our many colleagues who helped improve the applied aspect of the book by contributing their own data sets for examples and exercises. A number of indi viduals helped guide this revision and we are grateful for their suggestions: Steve Coad, University of Michigan; Richard Kiltie, University of Florida; Sam Kotz, George Mason University; Shyamal Peddada, University of Virginia; K. Sivakumar, University of Illinois at Chicago; Eric Smith, Virginia Tech; and Stanley Wasser man, University of Illinois at Urbana-Champaign. We also acknowledge the feed back of the students we have taught these past 25 years in our applied multivariate analysis courses. Their comments and suggestions are largely responsible for the present iteration of this work. We would also like to give special thanks to Wai Kwong Cheang for his help with the calculations for many of the new examples. We must thank Deborah Smith for her valuable work on the Data Disk and Solutions Manual, Steve Verrill for computing assistance throughout, and Alison Pollack for implementing a Chernoff faces program. We are indebted to Cliff Gilman for his assistance with the multi-dimensional scaling examples discussed in Chapter 12. Jacquelyn Forer did most of the typing of the original draft manuscript and we appreciate her expertise and willingness to endure the cajoling of authors faced with publication deadlines. Finally we would like to thank Ann Heath, Mindy McClard, Richard DeLorenzo, Brian Baker, Linda Behrens, Alan Fischer, and the rest of the Prentice Hall staff for their help with this project. W.
R. A.
D.
Johnson Wichern
CHAPTER
1
Aspects of Multivariate Analysis 1 .1
INTRODUCTION
Scientific inquiry is an iterative learning process. Objectives pertaining to the expla nation of a social or physical phenomenon must be specified and then tested by gathering and analyzing data. In turn, an analysis of the data gathered by experi mentation or observation will usually suggest a modified explanation of the phe nomenon. Throughout this iterative learning process, variables are often added or deleted from the study. Thus, the complexities of most phenomena require an investigator to collect observations on many different variables. This book is con cerned with statistical methods designed to elicit information from these kinds of data sets. Because the data include simultaneous measurements on many variables, this body of methodology is called multivariate analysis. The need to understand the relationships between many variables makes mul tivariate analysis an inherently difficult subject. Often, the human mind is over whelmed by the sheer bulk of the data. Additionally, more mathematics is required to derive multivariate statistical techniques for making inferences than in a uni variate setting. We have chosen to provide explanations based upon algebraic con cepts and to avoid the derivations of statistical results that require the calculus of many variables. Our objective is to introduce several useful multivariate techniques in a clear manner, making heavy use of illustrative examples and a minimum of mathematics. Nonetheless, some mathematical sophistication and a desire to think quantitatively will be required. Most of our emphasis will be on the analysis of measurements obtained without actively controlling or manipulating any of the variables on which the measurements are made. Only in Chapters 6 and 7 shall we treat a few experi mental plans (designs) for generating data that prescribe the active manipulation of important variables. Although the experimental design is ordinarily the most important part of a scientific investigation, it is frequently impossible to control the generation of appropriate data in certain disciplines. (This is true, for exam1
2
Chap. 1
Aspects of M u ltiva riate Analysis
ple, in business, economics, ecology, geology, and sociology.) You should consult [7] and [8] for detailed accounts of design principles that, fortunately, also apply to multivariate situations. It will become increasingly clear that many multivariate methods are based upon an underlying probability model known as the multivariate normal distribu tion. Other methods are ad hoc in nature and are justified by logical or common sense arguments. Regardless of their origin, multivariate techniques must, invariably, be implemented on a computer. Recent advances in computer technol ogy have been accompanied by the development of rather sophisticated statistical software packages, making the implementation step easier. Multivariate analysis is a "mixed bag." It is difficult to establish a classifica tion scheme for multivariate techniques that both is widely accepted and indicates the appropriateness of the techniques. One classification distinguishes tecl�niques designed to study interdependent relationships from those designed to study dependent relationships. Another classifies techniques according to the number of populations and the number of sets of variables being studied. Chapters in this text are divided into sections according to inference about treatment means, inference about covariance structure, and techniques for sorting or grouping. This should not, however, be considered an attempt to place each method into a slot. Rather, the choice of methods and the types of analyses employed are largely determined by the objectives of the investigation. Below, we list a smaller number of practical problems designed to illustrate the connection between the choice of a statistical method and the objectives of the study. These problems, plus the examples in the text, should provide you with an appreciation for the applicability of multivariate techniques across different fields. The objectives of scientific investigations to which multivariate methods most naturally lend themselves include the following: 1.
Data reduction or structural simplification. The phenomenon being studied is
represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier. 2. Sorting and grouping. Groups of "similar" objects or variables are created, based upon measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required. 3. Investigation of the dependence among variables. The nature of the relation ships among variables is of interest. Are all the variables mutually indepen dent or are one or more variables dependent on the others? If so, how? 4. Prediction. Relationships between variables must be determined for the pur pose of predicting the values of one or more variables on the basis of obser vations on the other variables. 5. Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions.
Sec. 1 . 2
Applications of M u ltiva ri ate Tech niques
3
We conclude this brief overview of multivariate analysis with a quotation from F. H. C. Marriott [19], page 89. The statement was made in a discussion of cluster analysis, but we feel it is appropriate for a broader range of methods. You should keep it in mind whenever you attempt or read about a data analysis. It allows one to maintain a proper perspective and not be overwhelmed by the ele gance of some of the theory:
on, docnotal presadmientt aatisoimn,plteheylogiarecal probabl interprey twrIafttoihong.en,resThere andultsdodiissnotnoagrmagiesehowwicthabout upinfcloernumer arlmedy iopinicalanigraphi metthehodsint,eandrpretmanyationwaysof datin awhi, notch tsaushey canage break down. They ar e a val u abl e ai d t o machines automatically transforming bodies of numbers into packets of scientific fact. 1 .2
APPLICATIONS OF MULTIVARIATE TECHNIQUES
The published applications of multivariate methods have increased tremendously in recent years. It is now difficult to cover the variety of real-world applications of these methods with brief discussions, as we did in earlier editions of this book. However, in order to give some indication of the usefulness of multivariate tech niques, we offer the following short descriptions of the results of studies from sev eral disciplines. These descriptions are organized according to the categories of objectives given in the previous section. Of course, many of our examples are mul tifaceted and could be placed in more than one category. Data reduction or simplification •
• •
•
Using data on several variables related to cancer patient responses to radio therapy, a simple measure of patient response to radiotherapy was con structed. (See Exercise 1.15.) Track records from many nations were used to develop an index of perfor mance for both male and female athletes. (See [10] and [21].) Multispectral image data collected by a high-altitude scanner were reduced to a form that could be viewed as images (pictures) of a shoreline in two dimensions. (See [22].) Data on several variables relating to yield and protein content were used to create an index to select parents of subsequent generations of improved bean plants. (See [14].)
Sorting and grouping •
Data on several variables related to computer use were employed to create clusters of categories of computer jobs that allow a better determination of existing (or planned) computer utilization. (See [2].)
4
Chap. 1
Aspects of M u ltivariate Analysis • •
•
Measurements of several physiological variables were used to develop a screen ing procedure that discriminates alcoholics from nonalcoholics. (See [25].) Data related to responses to visual stimuli were used to develop a rule for separating people suffering from a multiple-sclerosis-caused visual pathology from those not suffering from the disease. (See Exercise 1.14.) The U. S. Internal Revenue Service uses data collected from tax returns to sort taxpayers into two groups: those that will be audited and those that will not. (See [30].)
Investigation of the dependence among variables • •
•
•
Data on several variables were used to identify factors that were responsible for client success in hiring external consultants. (See [13].) Measurements of variables related to innovation, on the one hand, and vari ables related to the business environment and business organization, on the other hand, were used to discover why some firms are product innovators and some firms are not. (See [5].) Data on variables representing the outcomes of the 10 decathlon events in the Olympics were used to determine the physical factors responsible for suc cess in the decathlon. (See [17].) The associations between measures of risk-taking propensity and measures of socioeconomic characteristics for top-level business executives were used to assess the relation between risk-taking behavior and performance. (See [18].)
Prediction •
•
•
•
The associations between test scores and several high school performance variables and several college performance variables were used to develop pre dictors of success in college. (See [11].) Data on several variables related to the size distribution of sediments were used to develop rules for predicting different depositional environments. (See [9] and [20].) Measurements on several accounting and financial variables were used to develop a method for identifying potentially insolvent property-liability insur ers. (See [27].) Data on several variables for chickweed plants were used to develop a method for predicting the species of a new plant. (See [4].)
Hypotheses testing • Several pollution-related variables were measured to determine whether lev els for a large metropolitan area were roughly constant throughout the week, or whether there was a noticeable difference between weekdays and week ends. (See Exercise 1.6.)
Sec. 1 . 3 •
•
•
The Organ ization of Data
5
Experimental data on several variables were used to see whether the nature of the instructions makes any difference in perceived risks, as quantified by test scores. (See [26].) Data on many variables were used to investigate the differences in structure of American occupations to determine the support for one of two competing sociological theories. (See [16] and [24].) Data on several variables were used to determine whether different types of firms in newly industrialized countries exhibited different patterns of innova tion. (See [15].)
The preceding descriptions offer glimpses into the use of multivariate meth ods in widely diverse fields. 1.3 THE ORGANIZATION OF DATA
Throughout this text, we are going to be concerned with analyzing measurements made on several variables or characteristics. These measurements (commonly called data) must frequently be arranged and displayed in various ways. For exam ple, graphs and tabular arrangements are important aids in data analysis. Summary numbers, which quantitatively portray certain features of the data, are also neces sary to any description. We now introduce the preliminary concepts underlying these first steps of data organization. Arrays
Multivariate data arise whenever an investigator, seeking to understand a social or physical phenomenon, selects a number p ;;;. 1 of variables or characters to record. The values of these variables are all recorded for each distinct item, individual, or
experimental unit.
We will use the notation xi k to indicate the particular value of the kth vari able that is observed on the jth item, or trial. That is,
xi k = measurement of the kth variable on the jth item Consequently, n measurements on p variables can be displayed as follows: Variable 1 Variable 2 Variable k Variable p Item 1: xu x1 2 xl k xl p Item 2: Xz l Xzz X zk Xzp Item j:
xi l
Xi z
xi k
xiP
Item n :
Xn l
Xn z
Xnk
xn p
6
Chap. 1
Aspects of M u ltivariate Analysis
Or we can display these data as a rectangular array, called X, of n rows and p columns: x 1 1 X1 2 . . . x l k . . x 1 p . Xk . . Xp X1 X .
2
X=
22
2
. .
xi 1 xi 2 . . xi k . . . Xjp .
Xn 1 Xn 2 . xnk .
·
2
.
.
. .
. Xnp
The array X, then, contains the data consisting of all of the observations on all of the variables. Example 1 . 1
(A data array)
A selection of four receipts from a university bookstore was obtained in order to investigate the nature of book sales. Each receipt provided, among other things, the number of books sold and the total amount of each sale. Let the first variable be total dollar sales and the second variable be number of books sold. Then we can regard the corresponding numbers on the receipts as four measurements on two variables. Suppose the data, in tabular form, are: Variable 1 (dollar sales):
42
52
48
58
Variable 2 (number of books):
4
5
4
3
Using the notation just introduced, we have
x 1 1 = 42 x1 2 = 4 and the data array X is
x 2 1 = 52 x 22 = 5
X=
with four rows and two columns.
[ ] 42 52 48 58
4 5 4 3
•
Considering data in the form of arrays facilitates the exposition of the subject matter and allows numerical calculations to be performed in an orderly and effi cient manner. The efficiency is twofold, as gains are attained in both (1) describing numerical calculations as operations on arrays and (2) the implementation of the calculations on computers, which now use many languages and statistical packages to perform array operations. We consider the manipulation of arrays of numbers in Chapter 2. At this point, we are concerned only with their value as devices for displaying data.
Sec. 1 . 3
The O rg a nizatio n of Data
7
Descriptive Statistics
A large data set is bulky, and its very mass poses a serious obstacle to any attempt to visually extract pertinent information. Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descrip tive statistics. For example, the arithmetic average, or sample mean, is a descriptive statistic that provides a measure of location-that is, a "central value" for a set of numbers. And the average of the squares of the distances of all of the numbers from the mean provides a measure of the spread, or variation, in the numbers. We shall rely most heavily on descriptive statistics that measure location, vari ation, and linear association. The formal definitions of these quantities follow. Let xll, x21, , X11 1 be n measurements on the first variable. Then the arith metic average of these measurements is • . •
If the n measurements represent a subset of the full set of measurements that might have been observed, then x1 is also called the sample mean for the first vari able. We adopt this terminology because the bulk of this book is devoted to pro cedures designed for analyzing samples of measurements from larger collections. The sample mean can be computed from the n measurements on each of the p variables, so that, in general, there will be p sample means: -
xk
1 II = - L xjk
n j= l
k = 1,2, . . . , p
A measure of spread is provided by the surements on the first variable as
(1-1)
sample variance, defined for n mea
where x1 is the sample mean of the xj l ' s. In general, for p variables, we have k = 1,2,
0 0 0
,p
(1-2)
Two comments are in order. First, many authors define the sample variance with a divisor of n 1 rather than n. Later we shall see that there are theoretical rea sons for doing this, and it is particularly appropriate if the number of measure ments, n, is small. The two versions of the sample variance will always be differentiated by displaying the appropriate expression. Second, although the s2 notation is traditionally used to indicate the sample variance, we shall eventually consider an array of quantities in which the sample -
8
Chap. 1
Aspects of M u ltivariate Ana lysis
variances lie along the main diagonal. In this situation, it is convenient to use dou ble subscripts on the variances in order to indicate their positions in the array. Therefore, we introduce the notation s;; to denote the same variance computed from measurements on the ith variable, and we have the notational identities k
=
1, 2, . . . , p
(1-3)
The square root of the sample variance, Vi;;, is known as the sample standard deviation. This measure of variation is in the same units as the observations. Consider n pairs of measurements on each of variables 1 and 2:
[ Xxl1 2l ] , [ XzlXzz ] , . . . , [ xx",lz ]
That is, xj 1 and xj 2 are observed on the jth experimental item (j = 1, 2, . . . , n). A measure of linear association between the measurements of variables 1 and 2 is provided by the sample covariance
or the average product of the deviations from their respective means. If large val ues for one variable are observed in conjunction with large values for the other variable, and the small values also occur together, s 1 2 will be positive. If large val ues from one variable occur with small values for the other variable, s 1 2 will be neg ative. If there is no particular association between the values for the two variables, s 1 2 will be approximately zero. The sample covariance
i
=
1, 2, . . . , p, k
=
1, 2,
0 0 0
,p
(1-4)
measures the association between the ith and kth variables. We note that the covari ance reduces to the sample variance when i = k. Moreover, s;k = sk i for all i and k. The final descriptive statistic considered here is the sample correlation coef ficient (or Pearson 's product-moment correlation coefficient; see [31). This mea sure of the linear association between two variables does not depend on the units of measurement. The sample correlation coefficient for the ith and kth variables is defined as 11
(x ; - x; ) (xj k - xk ) j=l j 2:
(1-5)
Sec. 1 . 3
The Organ ization of Data
9
1, 2, ... , p and k = 1, 2, .. . , p. Note r;k = rk; for all i and k. The sample correlation coefficient is a standardized version of the sample covariance, where the product of the square roots of the sample variances provides the standardization. Notice that r;k has the same value whether n or n - 1 is cho sen as the common divisor for s;;, skk, and s;k · The sample correlation coefficient r;k can also be viewed as a sample covari an ce . Suppose the original values xii and xik are replaced by standardized values (xi; - x;)!Vi;; and (xik - xk )/Vi;,. The standardized values are commen surable because both sets are centered at zero and expressed in stqndard deviation units. The sample correlation coefficient is just the sample covariance of the stan dardized observations. Although the signs of the sample correlation and the sample covariance are the same, the correlation is ordinarily easier to interpret because its magnitude is bounded. To summarize, the sample correlation r has the following properties:
for i
=
The value of r must be between -1 and + 1. Here r measures the strength of the linear association. If r = 0, this implies a lack of linear association between the components. Otherwise, the sign of r indicates the direction of the association: r < 0 implies a tendency for one value in the pair to be larger than its average when the other is smaller than its average; and r > 0 implies a tendency for one value of the pair to be large when the other value is large and also for both values to be small together. 3. The value of r;k remains unchanged if the measurements of the ith variable are changed to Yii = axii + b, j = 1,2, . . . , n, and the values of the kth vari �ple are changed to Yik = cxik + d,j = 1,2, . . . , n, provided that the con stants a and c have the same sign. 1. 2.
The quantities s;k and r;k do not, in general, convey all there is to know about the association between two variables. Nonlinear associations can exist that are not revealed by these descriptive statistics. Covariance and correlation provide mea sures of linear association, or association along a line. Their values are less infor mative for other kinds of association. On the other hand, these quantities can be very sensitive to "wild" observations ("outliers") and may indicate association when, in fact, little exists. In spjte of these shortcomings, covariance and correla tion coefficients are routiqely calculated and analyzed. They provide cogent numer ical summaries of association when the data do not exhibit obvious nonlinear patterns of association and when wild observations are not present. Suspect observations must be accounted for by correcting obvious recording mistakes and by taking actions consistent with the identified causes. The values of s;k and r; k should be quoted both with and without these observations. The sum of squares of the deviations from the mean and the sum of cross product deviations are often of interest themselves. These quantities are
wkk
=
II
.L (xj k - xk ) 2
j=l
k
=
1,2, . . . ,p
(1-6)
10
Chap. 1
Aspects of M u ltivariate Analysis
and
W; k
=
ll
jL: (xji - x;) (xjk - xk ) =l
i
=
1, 2, . . . 'p, k
=
1, 2, . . . 'p
(1-7)
The descriptive statistics computed from n measurements on p variables can also be organized into arrays. ARRAYS OF BASIC DESCRIPTIVE STATISTICS
Sample
means
Sample vafiahces
and covariances
Sample correlations
The sample mean array is denoted by x, the sample variance and covariance array by the capital letter S11, and the sample correlation array by R. The subscript n on the array S11 is a mnemonic device used to remind you that n is employed as a divisor for the elements s; k · The size of all of the arrays is determined by the number of variables, p. The arrays S11 and R consist of p rows and p columns. The array x is a single column with p rows. The first subscript on an entry in arrays S" and R indicates the row; the second subscript indicates the column. Since s; k = ski and r; k = rk ; for all i and k, the entries in symmetric positions about the main northwest-southeast diag onals in arrays S11 and R are the same, and the arrays are said to be symmetric. Example 1 .2 (The arrays x, S11, and R for bivariate data)
Consider the data introduced in Example 1.1. Each receipt yields a pair of measurements, total dollar sales, and number of books sold. Find the arrays x, S11, and R.
Sec. 1 . 3
11
The Organ ization of Data
Since there are four receipts, we have a total of four measurements (observations) on each variable. The sample means are
4
:X1
=
� � xj !
:X 2
=
� � xj 2
j= l 4
j= l
=
l ( 42 + 52 + 48 + 58)
=
l{ 4 + 5 + 4 + 3)
=
=
50
4
The sample variances and covariances are
sl l
4
- 4I"' £.J =
(xj ! - x-i ) 2
j =l 2 � ((42 - 50) + (52 - 50) 2 + (48 - 50) 2 + (58 - 50) 2 )
S22 - I
4
=
34
=
- 1.5
(xj 2 - X-2 ) 2 j =l
"' - 4 £.J
=
s1 2
= =
� ((4 - 4) 2 + (5 - 4) 2 + (4 - 4) 2 + (3 - 4) 2 )
=
.5
4
(x - :X ) (x - :X ) j= j i 1 j 2 2 � ((42 - 50) (4 - 4) + (52 - 50) (5 - 4) � �l
+ (48 - 50) (4 - 4) + (58 - 50) (3 - 4)) and sn =
The sample correlation is
r 12
=
[
�Ys;;
S1 2
34 - 1.5 1.5 .5 =
- 1.5
J
\134 Y.5
=
- .36
12
Chap. 1
Aspects of M u ltivariate Analysis
so _
R G raphical Techniques
[ - .361
- .36 1
]
•
Plots are important, but frequently neglected, aids in data analysis. Although it is impossible to simultaneously plot all the measurements made on several variables and study the configurations, plots of individual variables and plots of pairs of vari ables can still be very informative. Sophisticated computer programs and display equipment allow one the luxury of visually examining data in one, two, or three dimensions with relative ease. On the other hand, many valuable insights can be obtained from the data by constructing plots with paper and pencil. Simple, yet ele gant and effective, methods for displaying data are available in [28]. It is good sta tistical practice to plot pairs of variables and visually inspect the pattern of association. Consider, then, the following seven pairs of measurements on two variables: Variable 1 (x 1 ) : Variable 2 (x2 ) :
3 5
4
2
5.5
4
6
7
8
2
5
10
5
7.5
These data are plotted as seven points in two dimensions ( each axis repre senting a variable ) in Figure 1.1. The coordinates of the points are determined by the paired measurements: (3, 5), (4, 5.5), . . . , (5, 7 .5 ) . The resulting two-dimen sional plot is known as a scatter diagram or scatter plot. Xz
xz
•
10
10
• •
8
8
:.a • 0 •• 0
6
6
E �
OJl "'
•
•
• •
4
4
2
2
0
• •
•
4
2
! •
2
•
•
! 4
6
•
! 6
Dot diagram
8
! 8
I""
10
XI
A scatter plot and marginal dot diagrams.
Figure 1 . 1
Sec. 1 . 3
The O rgan ization of Data
13
Also shown in Figure 1.1 are separate plots of the observed values of variable 1 and the observed values of variable 2, respectively. These plots are called (mar ginal) dot diagrams. They can be obtained from the original observations or by pro jecting the points in the scatter diagram onto each coordinate axis. The information contained in the single-variable dot diagrams can be used to calculate the sample means.�\ and :X2 and the sample variances s 1 1 and s22 . (See Exercise 1.1.) The scatter diagram indicates the orientation of the points, and their coordinates can be used to calculate the sample covariance s 12 . In the scatter dia gram of Figure 1.1, large values of x1 occur with large values of x 2 and small val ues of x 1 with small values of x2• Hence, s1 2 will be positive. Dot diagrams and scatter plots contain different kinds of information. The information in the marginal dot diagrams is not sufficient for constructing the scat ter plot. As an illustration, suppose the data preceding Figure 1.1 had been paired differently, so that the measurements on the variables x 1 and x 2 were as follows: Variable 1 Variable 2
(x 1 ): (x2 ):
5
4
6
2
2
8
3
5
5.5
4
7
10
5
7.5
(We have simply rearranged the values of variable 1.) The scatter and dot dia grams for the "new" data are shown in Figure 1.2. Comparing Figures 1.1 and 1.2, we find that the marginal dot diagrams are the same, but that the scatter diagrams are decidedly different. In Figure 1.2, large values of x 1 are paired with small val ues of x2 and small values of x1 with large values of x 2• Consequently, the descrip tive statistics for the individual variables :X1 , :X2 , s 1 1 , and s22 remain unchanged, but the sample covariance s 1 2 , which measures the association between pairs of vari ables, will now be negative. The different orientations of the data in Figures 1.1 and 1.2 are not dis cernible from the marginal dot diagrams alone. At the same time, the fact that the
• • •
•
• •
•
Xz
Xz
10 8
•
•
6
• •
4
•
•
•
2 0
2
•
! 2
4
•
! 4
6
•
! 6
8
10
8
10
!
I
x,
�
x,
Figure 1 .2 Scatter plot and dot diagrams for rearranged data .
14
Chap. 1
Aspects of M u ltivariate Analysis
marginal dot diagrams are the same in the two cases is not immediately apparent from the scatter plots. The two types of graphical procedures complement one another; they are not competitors. The next two examples further illustrate the information that can be con veyed by a graphic display. Example 1 .3
(The effect of unusual observations on sample correlations)
Some financial data representing jobs and productivity for the 16 largest pub lishing firms appeared in an article in Forbes magazine on April 30, 1990. The data for the pair of variables x1 = employees U obs ) and x 2 = profits per employee (productivity) are graphed in Figure 1 .3. We have labeled two "unusual" observations. Dun & Bradstreet is the largest firm in terms of number of employees, but is "typical" in terms of profits per employee. Time Warner has a "typical" number of employees, but comparatively small ( neg ative ) profits per employee .
• •
•
•
,
• • •
• •
•
•
•
Dun & Bradstreet •
Time Warner
{
Employees (thousands)
Figure 1 . 3 Profits per employee and number of employees for 1 6 publishing firms.
The sample correlation coefficient computed from the values of x1 and x2 is
r1 2
=
- .39 - .56 - .39 - .50
for all 16 firms for all firms but Dun & Bradstreet for all firms but Time Warner for all firms but Dun & Bradstreet and Time Warner
It is clear that atypical observations can have a considerable effect on the • sample correlation coefficient.
Example 1 .4 (A scatter plot for baseball data)
In a July 17, 1978, article on money in sports, Sports Illustrated magazine pro vided data on x1 = player payroll for National League East baseball teams.
Sec. 1 . 3
The O rganization of Data
15
Xz
•
•• •
•
•
0 Player payroll in millions of dollars Figure 1 .4
Salaries and won-lost percentage from Table 1 . 1 .
We have added data on x2 given in Table 1.1.
=
won-lost percentage for 1977. The results are
1 97 7 SALARY AND F I NAL RECORD FOR TH E NATIONAL LEAGU E EAST
TABLE 1 . 1
Team Philadelphia Phillies Pittsburgh Pirates St. Louis Cardinals Chicago Cubs Montreal Expos New York Mets
x1
=
player payroll 3,497,900 2,485,475 1,782,875 1,725,450 1,645,575 1,469,800
won-lost percentage
x2 =
.623 .593 .512 .500 .463 .395
The scatter plot in Figure 1.4 supports the claim that a championship team can be bought. Of course, this cause-effect relationship cannot be sub stantiated, because the experiment did not include a random assignment of payrolls. Thus, statistics cannot answer the question: Could the Mets have won with $4 million to spend on player salaries? • To construct the scatter plot in, for example, Figure 1 .4, we have regarded the six paired observations in Table 1.1 as the coordinates of six points in two dimensional space. The figure allows us to examine visually the grouping of teams with respect to the variables total payroll and won-lost percentage.
16
Chap. 1
Aspects of M u ltivariate Analysis
Example 1 .5
(Multiple scatter plots for paper strength measurements)
Paper is manufactured in continuous sheets several feet wide. Because of the orientation of fibers within the paper, it has a different strength when mea sured in the direction produced by the machine than when measured across, or at right angles to, the machine direction. Table 1.2 shows the measured values of x1 =
x2 =
x3 =
density (grams/cubic centimeter) strength (pounds) in the machine direction strength (pounds) in the cross direction
A novel graphic presentation of these data appears in Figure 1.5, page 18. The scatter plots are arranged as the off-diagonal elements of a covari ance array and box plots as the diagonal elements. The latter are on a differ ent scale with this software, so we use only the overall shape to provide information on symmetry and possible outliers for each individual character istic. The scatter plots can be inspected for patterns and unusual observations. In Figure 1.5, there is one unusual observation: the density of specimen 25. Some of the scatter plots have patterns suggesting that there are two separate clumps of observations. These scatter plot arrays are further pursued in our discussion of new • software graphics in the next section. In the general multiresponse situation, p variables are simultaneously recorded on n items. Scatter plots should be made for pairs of important variables and, if the task is not too great to warrant the effort, for all pairs. Limited as we are to a three-dimensional world, we cannot always picture an entire set of data. However, two further geometric representations of the data pro vide an important conceptual framework for viewing multivariable statistical meth ods. In cases where it is possible to capture the essence of the data in three dimensions, these representations can actually be graphed. n points in p dimensions (p-dimensional scatter plot) . Consider the nat ural extension of the scatter plot to p dimensions, where the p measurements
on the jth item represent the coordinates of a point in p-dimensional space. The coordinate axes are taken to correspond to the variables, so that the jth point is xi 1 units along the first axis, xi 2 units along the second, . . , xiP units along the p th axis. The resulting plot with n points not only will exhibit the overall pattern of vari ability, but also will show similarities (and differences) among the n items. Group ings of items will manifest themselves in this representation. .
TABLE 1 . 2
PAPER-QUALITY M EASU REM ENTS
Strength Specimen
Density
Machine direction
Cross direction
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
.801 .824 .841 .816 .840 .842 .820 .802 .828 .819 .826 .802 .810 .802 .832 .796 .759 .770 .759 .772 .806 .803 .845 .822
121.41 127.70 129.20 131 .80 135.10 131.50 126.70 115.10 130.80 124.60 118.31 114.20 120.30 115.70 117.51 109.81 109.10 115.10 118.31 112.60 116.20 1 18.00 131.00 125.70
70.42 72.47 78.20 74.89 71.21 78.39 69.02 73.10 79.28 76.48 70.25 72.88 68.23 68.12 71.62 53.10 50.85 51.68 50.60 53.51 56.53 70.70 74.35 68.29
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
.816 .836 .815 .822 .822 .843 .824 .788 .782 .795 .805 .836 .788 .772 .776 .758
125.80 125.50 127.80 130.50 127.90 123.90 124.10 120.80 107.40 120.70 121.91 122.31 1 10.60 103.51 110.71 113.80
70.64 76.33 76.75 80.33 75.68 78.54 71.91 68.22 54.42 70.41 73.68 74.93 53.52 48.93 53.67 52.42
25
.971
126.10
72:10
Source: Data courtesy of SONOCO Products, Inc. 17
Strength (MD)
Density Max ... 00
-� �
~
Med Min
�
1 � Cll
.
0.97
.
0.8 1
.
.
.
.
.
.
.
..
.
.
.
·· �·= .
.
.
. .
.
Med
.
.
.
.
. . .
. .
.
.
.
.
.
.
..
.
.
.
.
.
.
: : '·
121.4
.
. .
.
. .
.
.
.
--
Min
103.5
.
.
Max
. . . . · � .·...· ·
. 4 .... . . .
: : ·· · .
.·
.
.
.
. .
.
135. 1
.
.
.
,
.
T
.
. ;-
.
.
0.76 Max
. . . . . .. . . . .. . . . . . .. • . .
.
�.
.
.
.
.
� Cll
. .
.
.
l
· ·' . .
. .. . ·:. .· .: .. . . .. . t . . . ..
Strength (CD)
.
. .
.
. .
.
.
.
.
.. .
-
.
.
. .
.
.
. . .
.
. .
. .
.
. .
.
Med
..
.
.
.
80.33
70.70
.
. .
.
---
Figure 1 .5
.
T
---
-
Min
_l_
Scatter plots and boxplots of paper-quality data from Table 1 .2 .
48.93
Sec. 1 .4
Data Displays and Pictorial Representations
19
[xli]
p points i n n dimensions. The n observations o f the p variables can also be regarded as p points in n-dimensional space. Each column of X determines one of the points. The ith column,
x:, ;
X2 ;
consisting of all n measurements on the ith variable, determines the ith point. In Chapter 3, we show how the closeness of points in n dimensions can be related to measures of association between the corresponding variables. 1 .4 DATA DISPLAYS AND PICTORIAL REPRESENTATIONS
The rapid development of powerful personal computers and workstations has led to a proliferation of sophisticated statistical software for data analysis and graph ics. It is often possible, for example, to sit at one ' s desk and examine the nature of multidimensional data with clever computer-generated pictures. These pictures are valuable aids in understanding data and often prevent many false starts and sub sequent inferential problems. As we shall see in Chapters 8 and 12, there are several techniques that seek to represent p-dimensional observations in few dimensions such that the original distances ( or similarities between pairs of observations are (nearly preserved. In general, if multidimensional observations can be represented in two dimensions, then outliers, relationships, and distinguishable groupings can often be discerned by eye. We shall discuss and illustrate several methods for displaying multivariate data in two dimensions. One good source for more discussion of graphical methods is [12].
)
)
Linking Multiple Two-Dimensional Scatter Plots
One of the more exciting new graphical procedures involves electronically con necting many two-dimensional scatter plots. Example 1 .6
x1
(Linked scatter plots and brushing)
x3
To illustrate linked two dimensional scatter plots, we refer to the paper-qual ity data in Table 1 .2. These data represent measurements on the variables = density, x 2 = strength in the machine direction, and = strength in the cross direction. Figure 1.6 shows two-dimensional scatter plots for pairs of these variables organized as a 3 3 array. For example, the picture in the upper left-hand comer of the figure is a scatter plot of the pairs of observa tions That is, the values are plotted along the horizontal axis, and the values are plotted along the vertical axis. The lower right-hand comer
x3(x1 , x3 ).
x1
X
20
Chap. 1
. . ., . . . �· . . . .., . ... .,
. .... . . .
.
. .
.
Aspects of M u l tivariate Ana lysis
.
. ."' ' . .,. .'. . . ··'. · . '
. , ._
.
( x3 )
Cross
48 . 9
.
(x2)
. .
. 97 1
. 758
,
Machine
1 04
(xl )
80 . 3
1 35
.
Density
. .... .. ' . .. . , .
. "" . . . . . . .
.. . r .
.
"'
.
I
.;
.
.. . . . I.
. . ... '.. . , · . ··' . i• . . . . . .. . . . . . ...... . '
,
. ... . .
. . . . .: ' .. I I ... .
I
. � ·..'- ·... -. .
.. ·:« ..
Figure 1 .6 Scatter plots for the paper quality data of Table 1 .2.
of the figure contains a scatter plot of the observations ( x3 , x 1 ) . That is, the axes are reversed. Corresponding interpretations hold for the other scatter plots in the figure. Notice that the variables and their three-digit ranges are indicated in the boxes along the SW-NE diagonal. The operation of mark ing (selecting) the obvious outlier in the ( x 1 , x3 ) scatter plot of Figure 1 .6 cre ates Figure 1.7(a), where the outlier is labeled as specimen 25 and the same data point is highlighted in all the scatter plots. Specimen 25 also appears to be an outlier in the ( x1 , x2 ) scatter plot but not in the ( x2 , x3 ) scatter plot. The operation of deleting this specimen leads to the modified scatter plots of Figure 1 .7(b). From Figure 1.7, we notice that some points in, for example, the ( x2 , x3 ) scatter plot seem to be disconnected from the others. Selecting these points, using the (dashed) rectangle (see page 22), highlights the selected points in all of the other scatter plots and leads to the display in Figure 1 .8(a) . Further checking revealed that specimens 16-21, specimen 34, and specimens 3�1 were actually specimens from an older roll of paper that was included in order to have enough plies in the cardboard being manufactured. Deleting the out lier and the cases corresponding to the older paper and adjusting the ranges of the remaining observations leads to the scatter plots in Figure 1.8(b ). The operation of highlighting points corresponding to a selected range of one of the variables is called brushing. Brushing could begin with a rec tangle, as in Figure 1.8(a), but then the brush could be moved to provide a
:· '
. . .. . , .. -�· . . ., : :-' .
.
.
.
.
.
. . ..
# ... .,. .. ... . .. �. ...
. . r . .. . . .. .
#
25
.
. , -_
I
. . .. · . . . . .
Sec. 1 . 4
.
. .. , · : · 25 . ,
..
.
.
••
.
.
I I ....
.
.I
.
. 97 1
. . . . '.. . •.v : . • . . ' . . .� .··. : . . . '
.
.•
I
.
25
25
Cx t )
, · . .. � : · 25• •
.
1 35
1 04
. 758
( x3 )
Cross
48.9
Machine
Density
21
80.3
.. '
( xz )
25
Data Displays and Pictorial Representations
. · . ..., .. .
... . • .: ..
, ..
�� ...
(a )
. . . . , . . -�· . . ... ..._, .
:-' . .. .
#
. .. ,
80.3
. Cross (x3 )
48.9 1 35
( xz )
Machine
. .
.. . .
.I
1 04
.
.
.
.
. .:
.
I I ....
.. .
,
.•
...
' .
.
I
.97 1
( xi )
Density
.758
.. '
. . "' . .. . . .
.
. .# ... . .,. .. . ... . . . ··'· . ... . r . .. . . . . .
.
... . , -.. :. I
.
. ..,. . .. . ,.. ' · .. . . . ... .··.' : . . . '
(b)
. . .�· .·.·
. ·..
� -.s. .., .. .
·:« ,
Figure 1 .7 Modified scatter plots for the paper-quality data with outlier (25) (a) selected and (b) deleted.
22
Chap. 1
Aspects of M u l tivariate Analysis
' •I .. . "' . . . .
.. - · . . .,� ... . .. .
. .. ·
.
...
.. ' .
.
... '\ • . . I
.
80.3
I
Cross
.•
- - - - : -· •
·.. ..
I
•
48.9
.. · :. ..'
1 35
. ·· '· . ..
. . ...
I
..
•• •t • •
, ... .
Machine
r
I I
..
...
.
.
1 04
.•
I
.97 1
Density
.., . . ·.
\
.
. 758
·
.
. ... I' . . ·.,
•:·
•
.
� �.. ... .
... .....
.
·
.. ..
.... ,
(a)
:
. . .
.
..
. . . .. . . . .
.
..
..
·.
80.3
..
..
..
..
68. 1
..
1 35
..
Machine
1 14 .845 Density
Cross
. .. .
.
. .. .
.
.
.
. .
(b)
.
.
.
. .
.
.
.788
. .
.
.
Figure 1 .8
Modified scatter plots with (a) g roup of points selected and (b) points, including specimen 25, deleted and the scatter plots rescaled.
Sec. 1 .4
23
Data Displays a n d Pictorial Representations
sequence of highlighted points. The process can be stopped at any time to provide a snapshot of the current situation. • Scatter plots like those in Example 1 .6 are extremely useful aids in data analysis. Another important new graphical technique uses software that allows the data analyst to view high-dimensional data as slices of various three-dimensional perspectives. This can be done dynamically and continuously until informative views are obtained. A comprehensive discussion of dynamic graphical methods is available in [1] . A strategy for on-line multivariate exploratory graphical analysis, motivated by the need for a routine procedure for searching for structure in mul tivariate data, is given in [31]. Example 1 . 7
(Plots in three dimensions)
Four different measurements of lumber stiffness are given in Table 4.3, page 198. In Example 4.13, specimen (board) 16 and possibly specimen (board) 9 are identified as unusual observations. Figures 1 .9(a), (b), and (c) contain per spectives of the stiffness data in the x 1 , x 2 , x3 space. These views were obtained by continually rotating and turning the three-dimensional coordi nate axes. Spinning the coordinate axes allows one to get a better under standing of the three-dimensional aspects of the data. Figure 1 .9(d) gives one picture of the stiffness data in x2 , x3 , x4 space. Notice that Figures 1 .9(a) and (d) visually confirm specimens 9 and 16 as outliers. Specimen 9 is very large .
16
Xz
.
.
.
.
.
.
.
. . . .
.
.
•9
(a)
r
· : . : .. . · � : •·
16 ••
•
.
.
(b)
Outliers clear.
.
x2
•
•
.
X1
(c) Figure 1 .9
.
.
XI
Outliers masked.
. . . . .· . .
. .
.
.
.
.
x3 9'
Specimen 9 large.
• 9 Xz
(d)
Good view of
x2, x3 , x4 space.
Three-dimensional perspectives for the lumber stiffness data.
24
Chap. 1
Aspects of M u ltivariate Analysis
in all three coordinates. A counterclockwiselike rotation of the axes in Fig ure 1.9(a) produces Figure 1 .9(b ), and the two unusual observations are masked in this view. A further spinning of the x 2 , x3 axes gives Figure 1.9(c); one of the outliers (16) is now hidden. Additional insights can sometimes be gleaned from visual inspection of the slowly spinning data. It is this dynamic aspect that statisticians are just • beginning to understand and exploit.
Plots like those in Figure 1.9 allow one to identify readily observations that do not conform to the rest of the data and that may heavily influence inferences based on standarq data-generating 111o dels. We now turn to two popular pictorial representations of multivariate data in two dimensions: stars and Chernoff faces. Stars
Suppose each data unit consists of nonnegative observations on p ;;;.: 2 variables. In two dimensions, we can constr�ct circles of a fixed (reference) radius with p equally spaced rays emanating from the center of the circle. The lengths of the rays repre sent the values of the variables. The eqds of the rays can be connected with straight lines to form a star. Each star represents a multivariate observation, and the stars can be grouped according to their (subjective) similarities. It is often helpful, when constructing the stars, to standardize the observa tions. In this case some of the observations will be negative. The observations can then be reexpressed so that the center of the circle represents the smallest stan dardized observation within the entire data set. Example 1 .8
(Utility data as stars)
Stars representing the first 5 of the 22 public utility firms in Table 12.5, page 747, are shown in Figure 1.10. There are eight variables; consequently, the stars are distorted octagons. The observations on all variables were standardized. Among the first five utilities, the smallest standardized observation for any variable was - 1 .6. Treating this value as zero, the variables are plotted on identical scales along eight equiangular rays originating from the center of the circle. The variables are ordered in a clockwise direction, beginning in the 12 o ' clock position. At first glance, none of these utilities appears to be similar to any other. However, because of the way the stars are construct�d, each variable gets equal weight in the visual impression. If we concentrate on the variables 6 (sales in KWH use per year) and 8 (total fuel costs in cents per KWH), then Boston Edison and Consolidated Edison are similar (small variable 6, large variable 8), and Arizona Public Service, Central Louisiana Electric, and Com monwealth Edison are similar (moderate variable 6, moderate variable 8). •
Sec. 1 .4
Data Displays a n d Pictorial Representations
25
Boston Edison Co. (2)
Arizona Public Service ( 1 )
5
5 Central Louisiana Electric Co. (3)
Commonwealth Edison Co. (4)
5
5
I
Consolidated Edison Co. (NY) (5)
5 Figure 1 . 1 0
Stars for the first five public utilities.
Chernoff Faces
People react to faces. Chernoff [6] suggested representing p-dimensional observa tions as a two-dimensional face whose characteristics (face shape, mouth curvature, nose length, eye size, pupil position, and so forth) are determined by the measure tnents on the p variables.
26
Chap. 1
Aspects of M u ltivariate Analysis
As originally designed, Chernoff faces can handle up to 18 variables. The assignment of variables to facial features is done by the experimenter, and differ ent choices produce different results. Some iteration is usually necessary before sat isfactory representations are achieved. Chernoff faces appear to be most useful for verifying (1) an initial grouping suggested by subject-matter knowledge and intuition or (2) final groupings pro duced by clustering algorithms. Example 1 .9
(Utility data as Chernoff faces)
From the data in Table 12.5, the 22 public utility companies were represented as Chernoff faces. We have the following correspondences: Variable Xl : Fixed-charge coverage Xz : Rate of return on capital X3 : Cost per KW capacity in place X4 : Annual load factor
("\
i· iI
Xs :
Peak KWH demand growth from 1974
X6 : Sales (KWH use per year) X7 : Percent nuclear Xs : Total fuel costs (cents per KWH)
, . � ..
Facial characteristic � �
�
�
� �
�
�
Half-height of face Face width Position of center of mouth Slant of eyes height . . Eccentnctty of eyes width Half-length of eye Curvature of mouth Length of nose
(
)
\
The Chernoff faces· are shown in Figure-1. 1 1 . .We have.. subjectively grouped "similar" faces into seven clusters. If a smaller number of clusters is desired, we might combine clusters 5, 6, and 7 and, perhaps, clusters 2 and 3 to obtain four or five clusters. For our assignment of variables to facial features, the firms group largely according to geographical location. • Constructing Chernoff faces is a task that must be done with the aid of a com puter. The data are ordinarily standardized within the computer program as part of the process for determining the locations, sizes, and orientations of the facial characteristics. With some training, we can use Chernoff faces to communicate sim ilarities or dissimilarities, as the next example indicates. Example 1 . 1 0
(Using Chernoff faces to show changes over time)
Figure 1 .12 illustrates an additional use of Chernoff faces. (See [23] .) In the figure, the faces are used to track the financial well-being of a company over time. As indicated, each facial feature represents a single financial indicator, and the longitudinal changes in these indicators are thus evident � a �anre. •
\
Cluster I
0 O 0 0 4
10
13
20
Sec. 1 .4
Cluster 3
Cluster 2
0 CD 0 0 CD 0 3
9
14
Cluster ?
6
5
7
22
21
15
Cluster 4
Cluster 6
8
2
0 CD 0 CD 0 \D II
18
Figure 1 . 1 1
Cluster 5
W \D CD 0 W CD
I
19
27
Data Displays a n d Pictorial Representations
12
Chernoff faces for 22 public utilities. 16
17
1 975 1 976 1 977 1 978 1 979 ------� Time Figure 1 . 1 2
Chernoff faces for over time.
28
Chap. 1
Aspects of M u ltivariate Analysis
Chernoff faces have also been used to display differences in multivariate observations in two dimensions. For example, the two-dimensional coordinate axes might represent latitude and longitude (geographical location), and the faces might represent multivariate measurements on several U.S. cities. Additional examples of this kind are discussed in [29]. There are several ingenious ways to picture multivariate data in two dimen sions. We have described some of them. Further advances are possible and will almost certainly take advantage of improved computer graphics. 1 .5 DISTANCE
Although they may at first appear formidable, most multivariate techniques are based upon the simple concept of distance. Straight-line, or Euclidean, distance should be familiar. If we consider the point P = (x 1 , x 2 ) in the plane, the straight line distance, d (O, P), from P to the origin 0 = (0, 0) is, according to the Pythagorean theorem,
V\2 + x 22
d ( O ' P) =
(1-9)
1
The situation is illustrated in Figure 1.13. In general, if the point P has p coordi nates so that P = (x 1 , x 2 , , xp ), the straight-line distance from P to the origin 0 = (0, 0, . . . , 0) is • • •
(1-10) Yx12 + x 22 + · · · + xp2 (See Chapter 2.) All points (x 1 , x2 , , xp ) that lie a constant squared distance, such as c 2 , from the origin satisfy the equation d (O ' P) =
• . •
(1-11) Because this is the equation of a hypersphere (a circle if p = 2), points equidistant from the origin lie on a hypersphere. The straight-line distance between two arbitrary points P and Q with coordi nates P = (x 1 , x 2 , , xp ) and Q = (y1 , y2 , , Yp ) is given by • • •
• • •
d (P, Q ) =
Y(x 1 - y1 ) 2 + (x 2 - y2 ) 2 +
n
···
+ (xP - Yp ) 2
(1-12)
Straight-line, or Euclidean, distance is unsatisfactory for most statistical pur poses. This is because each coordinate contributes equally to the calculation of d ( O. P ) =
jx l · 0
t X
p
1- x r-----+ 1
z
t
-
Figure 1 . 1 3
Distance given by the Pythagorean theorem.
Sec. 1 . 5
D ista nce
29
Euclidean distance. When the coordinates represent measurements that are sub ject to random fluctuations of differing magnitudes, it is often desirable to weight coordinates subject to a great deal of variability less heavily than those that are not highly variable. This suggests a different measure of distance. Our purpose now is to develop a "statistical" distance that accounts for dif ferences in variation and, in due course, the presence of correlation. Because our choice will depend upon the sample variances and covariances, at this point we use the term statistical distance to distinguish it from ordinary Euclidean distance. It is statistical distance that is fundamental to multivariate analysis. To begin, we take as fixed the set of observations graphed as the p-dimen sional scatter plot. From these, we shall construct a measure of distance from , xp ) . In our arguments, the coordinates the origin to a point P = ( x 1 , x2 , ( x 1 , x 2 , , xp ) of P can vary to produce different locations for the point. The data that determine distance will, however, remain fixed. To illustrate, suppose we have n pairs of measurements on two variables. Call the variables x 1 and x 2 , and assume that the x 1 measurements vary independently of the x 2 measurements. 1 In addition, assume that the variability in the x 1 mea surements is larger than the variability in the x 2 measurements. A scatter plot of the data would look something like the one pictured in Figure 1 .14. Glancing at Figure 1.14, we see that values which are a given deviation from the origin in the x 1 direction are not as "surprising" or "unusual" as are values equidistant from the origin in the x2 direction. This is because the inherent vari ability in the x 1 direction is greater than the variability in the x 2 direction. Conse quently, large x 1 coordinates (in absolute value) are not as unexp"e cted as large x 2 coordinates. It seems reasonable, then, to weight an x 2 coordinate more heavily than an x 1 coordinate of the same value when computing the "distance" to the ori gin. One way to proceed is to divide each coordinate by the sample standard devi ation. Therefore, upon division by the standard deviations, we have the "stan dardized" coordinates x : = x 1 / � and x ; = x 2 j 'lfi;;_ . The standardized • • •
• • •
•
•
•
•
•
•
•
• •
• •
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
• •
•
Figure 1 . 1 4 A scatter plot with greater variability in the x1 direction than in the x2 direction.
1 At this point, "independently" means that the x 2 measurements cannot be predicted with any
accuracy from the
x1
measurements, and vice versa.
30
Chap. 1
Aspects of M u ltivariate Ana lysis
coordinates are now on an equal footing with one another. After taking the dif ferences in variability into account, we determine distance using the standard Euclidean formula. Thus, a statistical distance of the point P = (x 1 , x 2 ) from the origin 0 = (0, 0) can be computed from its standardized coordinates x: = x t f� and x� = x2 / YS;;. as
d (O, P)
=
v'(xn 2 + (x; ) 2
(1-13) Comparing (1-13) with (1-9), we see that the difference between the two expres sions is due to the weights k1 = 1 /s1 1 and k2 = 1 /s22 attached to xi and x� in (1-13). Note that if the sample variances are the same, k 1 = k2 , and xi and x� will receive the same weight. In cases where the weights are the same, it is convenient to ignore the common divisor and use the usual Euclidean distance formula. In other words, if the variability in the x 1 direction is the same as the variability in the x2 direction, and the x1 values vary independently of the x2 values, Euclidean dis tance is appropriate. Using (1-13), we see that all points which have coordinates (x 1 , x 2 ) and are a constant squared distance c2 from the origin must satisfy �
2 2 � + su S22
=
c2
(1-14)
Equation (1-14) is the equation of an ellipse centered at the origin whose major and minor axes coincide with the coordinate axes. That is, the statistical distance in (1-13) has an ellipse as the locus of all points a constant distance from the ori gin. This general case is shown in Figure 1.15. Example 1 . 1 1
(Calculating a statistical distance)
A set of paired measurements (x1 , x2 ) on two variables yields x1 = x2 = 0, = 4, and s 22 = 1. Suppose the x 1 measurements are unrelated to the x 2
s1 1
xz
p
�
----
�04-
----
cfi;:
----+-�==� x l
--------
The ellipse of constant statistical distance + x�/s22 = cl.
d2 (0, P) = xf /s1 1
Figure 1 . 1 5
----
--
Sec. 1 .5
Distance
31
measurements; that is, measurements within a pair vary independently of one another. Since the sample variances are unequal, we measure the square of the distance of an arbitrary point P = (x 1 , x 2 ) to the origin 0 = (0, 0) by
All points equation
(x1 , x2 ) that are a constant distance 1 from the origin satisfy the
The coordinates of some points a unit distance from the origin are presented in the following table: Coordinates: (\.
2 x Distance: --t-
(x1 , x2 )
+
xT2
12 1 2 0 + ( - 1) 2 4 1 2 2 +02 4 1 ( v'3/2 ) 2 1 02 4
(0, 1)
- + -
(0, - 1) (2 , 0) ( 1, v'3/2 )
=
=
=
=
=
1
1 1 1 1
A plot of the equation xf /4 + xV1 = 1 is an ellipse centered at (0, 0) whose major axis lies along the x1 coordinate axis and whose minor axis lies along the x2 coordinate axis. The half-lengths of these major and minor axes are \14 = 2 and Yl = 1, respectively. The ellipse of unit distance is plot ted in Figure 1.16. All points on the ellipse are regarded as being the same • statistical distance from the origin-in this case, a distance of 1.
Xz ---f--....L._---f-.L.._---+--�
-2
-1
2
Xl
!:L2
+
Figure 1 . 1 6
-1
4
2 x2
1
=
1.
Ellipse
of unit distance,
32
Chap. 1
Aspects of M u l tivariate Analysis
The expression in (1-13) can be generalized to accommodate the calculation of statistical distance from an arbitrary point P = (x i , x 2 ) to any fixed point Q = ( YI , Yz ). If we assume that the coordinate variables vary independently of one another, the distance from P to Q is given by =
d (P, Q )
/ (x i - Y1 ) 2 V si i
+
(x z - Yz ) 2 Szz
(1-15)
The extension of this statistical distance to more than two dimensions is straightforward. Let the points P and Q have p coordinates such that P = (x i , x2 , , xp ) and Q = ( YI , y2 , , Yp ). Suppose Q is a fixed point [it may be the origin 0 = (0, 0, . . . , 0)] and the coordinate variables vary independently of one another. Let si i , s22 , . . . , sPP be sample variances constructed from n measurements on x 1 , x2 , , xP , respectively. Then the statistical distance from P to Q is • • •
• • •
�(xi s-1 1YI )2
• • •
d (P, Q )
=
+
(x z
- Yz ) 2 Szz
+ ... +
(xp
- YtY
(1-16) sPP All points P that are a constant squared distance from Q lie on a hyperellip soid centered at Q whose major and minor axes are parallel to the coordinate axes. We note the following: The distance of P to the origin 0 is obtained by setting Y J = y 2 = · · · = Yp = 0 in (1-16). 2. Ifs l l = s22 = · · · = sPP the Euclidean distance formula in (1-12) is appropriate. ' 1.
The distance in (1-16) still does not include most of the important cases we shall encounter, because of the assumption of independent coordinates. The scat ter plot in Figure 1.17 depicts a two-dimensional situation in which the x i mea surements do not vary independently of the x 2 measurements. In fact, the Xz
I •
•
:/: �e
I. • - -. · '· . ---------"-l-7-------'-......._ x l . . .- .I • • • • I • I • •I • I
I
Figure 1 . 1 7
A scatter plot for positively correlated measurements and a rotated coordinate system .
Sec. 1 .5
33
Dista nce
coordinates of the pairs (x 1 , x2 ) exhibit a tendency to be large or small together, and the sample correlation coefficient is positive. Moreover, the variability in the x2 direction is larger than the variability in the x1 direction. What is a meaningful measure of distance when the variability in the x 1 direc tion is different from the variability in the x2 direction and the variables x1 and x2 are correlated? Actually, we can use what we have already introduced, provided that we look at things in the right way. From Figure 1 .17, we see that if we rotate the original coordinate system through the angle 0 while keeping the scatter fixed and label the rotated axes .X 1 and .X2 , the scatter in terms of the new axes looks very much like that in Figure 1.14. (You may wish to turn the book to place the x1 and .X2 axes in their customary positions.) This suggests that we calculate the sample variances using the .X 1 and .X2 coordinates and measure distance as in Equation (1-13). That is, with reference to the .X1 and .X2 axes, we define the distance from the point P = (.X1 , .X2 ) to the origin 0 = (0, 0) as =
d(O, P)
��isl l + ��
(1-17)
Sz z
where s1 1 and s22 denote the sample variances computed with the .X1 and .X2 measurements. The relation between the original coordinates (x 1 , x 2 ) and the rotated coor dinates (x1 , .X2 ) is provided by
X1 x2
= X1
COS
= - x1
+
( 0)
X z sin ( 0)
(1-18)
+ x 2 cos ( 0)
sin ( 0)
Given the relations in (1-18), we can formally substitute for x 1 and .X2 in (1-17) and express the distance in terms of the original coordinates. After some straightforward algebraic manipulations, the distance from P = (x 1 , x2 ) to the origin 0 = (0, 0) can be written in terms of the original coor dinates x1 and x 2 of P as (1-19) where the a's are numbers such that the distance is nonnegative for all possible val ues of x 1 and x 2 • Here a 1 1 , a 1 2 , and a22 are determined by the angle 0, and s 1 1 , s 1 2 , and s22 are calculated from the original data. 2 The particular forms for a 1 1 , aw and 2 Specifically,
2 cos ( 0) + 2 sin ( O) cos ( O)s 1 2 2 _ - -,-- --- - sin-( 0)-- .. az-, cos- (O)s 1 1 + 2 sin ( O) cos ( O) s 1 2 and a1 1 cos2 ( 0)s 1 1 _
_
_ _ __ _
a2 , = cos2 ( 0)su
+
__
cos ( 0) sin ( 0) 2 sin ( O) cos ( O)s 1 2
+
sin2 ( 0) s22
- . - --
+
+
-
--
sin2 ( 0) s 2 2
+ +
2 ( 0_!_) · -c- :-=s= i n:_'-' cos ( 0) s22 - 2 sin ( O) cos (.,---,O) s 1 2 2
__
----=-
__
cos2 ( 0) cos2 ( 0) s22 - 2 sin ( O) cos ( O) s 1 2
+
__
2 sin ( 0) s 1 1 2 sin ( 0)sn
- - ----- --- -----·----
sin ( e) cos ( 0) 2 2 sin ( 0) s22 cos (0)s 2 - 2 sin ( O) cos ( O) s 1 2 2
+
+
2 sin ( 0)su
34
Chap. 1
Aspects of M u ltiva riate Analysis
a22 are not important at this point. What is important is the appearance of the cross-product term 2a 12 x1 x 2 necessitated by the nonzero correlation r1 2 . Equation ( 1-1 9) can be compared with (1-13). The expression in (1-13) can be regarded as a special case of ( 1-1 9) with a l l = 1/s 1 1 , a22 = 1/szz , and a 12 = 0. In general, the statistical distance of the point P = ( x 1 , x 2 ) from the fixed point Q = (y 1 , y2 ) for situations in which the variables are correlated has the gen eral form
and can always be computed once coordinates of all points P = (x 1 , from Q satisfy
a 1 1 , a 12 , and a22 are known. In addition, the2 x 2 ) that are a constant squared distance c
a l l (xl - Y1 ) 2 + 2a 12 (xl - Y1 Hxz - Yz ) + azz (Xz - Yz ) 2
=
C 2 (1-21)
By definition, this is the equation of an ellipse centered at Q. The graph of such an equation is displayed in Figure 1.18. The major (long) and minor (short) axes are indicated. They are parallel to the x1 and x2 axes. For the choice of a l l , a 1 2 , and a22 in footnote 2, the .X1 and x2 axes are at an angle () with respect to the x 1 and x 2 axes. The generalization of the distance formulas of (1-19) and (1-20) to p dimen sions is straightforward. Let P = (x 1 , x2 , , xp ) be a point whose coordinates rep resent variables that are correlated and subject to inherent variability. Let 0 = (0, 0, . . . , 0) denote the origin, and let Q = (y 1 , y2 , , Yp ) be a specified fixed point. Then the distances from P to 0 and from P to Q have the general forms • • •
• . .
d (O, P ) (1-22) and
Xz
/
/
/
'
Figure 1 . 1 8 Ellipse of points a constant distance from the point Q.
Sec. 1 5 .
35
Distance
d (P, Q )
[a 1 1 (x 1 - y 1 ) 2 + a22 (x 2 - Y2 ) 2 + . . . + app (xp - Yp ) 2 + 2al 2 (xl - Y 1 Hxz - Yz ) + 2a t 3 (xl - Yt Hx3 - Y3 ) + . . . + 2ap - t,p (xp - t - Yp - t ) (xP - Yp )]
{1-23)
where the a ' s are numbers such that the distances are always nonnegative ? We note that the distances in (1-22) and (1-23) are completely determined by the coefficients (weights) a ik• i = 1, 2, . . . , p, k = 1, 2, . . . , p. These coefficients can be set out in the rectangular array (1-24) where the a;k ' s with i � k are displayed twice, since they are multiplied by 2 in the distance formulas. Consequently, the entries in this array specify the distance func tions. The a;k ' s cannot be arbitrary numbers; they must be such that the computed distance is nonnegative for every pair of points. (See Exercise 1 .10.) Contours of constant distances computed from (1-22) and (1-23) are hyper ellipsoids. A hyperellipsoid resembles a football when p = 3; it is impossible to visualize in more than three dimensions. The need to consider statistical rather than Euclidean distance is illustrated heuristically in Figure 1 .19. Figure 1.19 depicts a cluster of points whose center of gravity (sample mean) is indicated by the point Q. Consider the Euclidean dis tances from the point Q to the point P and the origin 0. The Euclidean distance from Q to P is larger than the Euclidean distance from Q to 0. However, P appears to be more like the points in the cluster than does the origin. If we take into account the variability of the points in the cluster and measure distance by the
• • • • • • • • • • • • • • •• • • •• • • • • • • • • • • • • Q :• • • .•••. .•� f'i' .• .• •
p� r;:..
. . . . . . :.· ·. • • . • . • •. •. •
----e......=- .----e,__::.. -+-------'Jio- x i
--
•
•
0
Figure 1 . 1 9 A cluster of points relative to a point P and the origi n .
3 The algebraic expressions for the sq uares o f the distances in (1-22) and ( 1 -23) are known as uadratic forms and, in particular, positive definite q uadratic forms. It is possible to display these qua q dratic forms in a simpler manner using matrix algebra; we shall do so in Section 2.3 of Chapter 2.
36
Chap. 1
Aspects of M u l tivariate Ana lysis
statistical distance in (1-20), then Q will be closer to P than to 0. This result seems reasonable given the nature of the scatter. Other measures of distance can be advanced. (See Exercise 1.12.) At times, it is useful to consider distances that are not related to circles or ellipses. Any dis tance measure d(P, Q) between two points P and Q is valid provided that it satis fies the following properties, where R is any other intermediate point: d (P, Q ) = d (Q, P) d (P, Q ) > O if P
-=f
Q (1-25 )
d (P, Q ) = O if P = Q d (P, Q )
:s;
d (P, R )
+ d (R, Q )
(triangle inequality)
1 .6 FINAL COMMENTS
We have attempted to motivate the study of multivariate analysis and to provide you with some rudimentary, but important, methods for organizing, summarizing, and displaying data. In addition, a general concept of distance has been introduced that will be used repeatedly in later chapters. EXERCISES
1.1.
Consider the seven pairs of measurements
(x i , x2 ) plotted in Figure 1.1:
xi 3 4 2 6 8 2 5 x2 5 5.5 4 7 10 5 7.5 Calculate the sample means xi and .X2 , the sample variances s 1 1 and s22 , and the sample covariance s12 . 1.2.
A morning newspaper lists the following used-car prices for a foreign com pact with age x 1 measured in years and selling price x 2 measured in thou sands of dollars: 8 9 10 1 1 7 5 7 7 5 3 xi x 2 2.30 1.90 1.00 .70 .30 1.00 1.05 .45 .70 .30 (a)
Construct a scatter plot of the data and marginal dot diagrams.
(b) Infer the sign of the sample covariance s i 2 from the scatter plot.
xi and .X2 and the sample variances s1 1 and s22• Compute the sample covariance s i 2 and the sample correlation coef ficient r12 . Interpret these quantities.
(c) Compute the sample means
Chap. 1
1.3.
Exercises
37
(d) Display the sample mean array x, the sample variance-covariance array S11, and the sample correlation array R using (1-8).
The following are five measurements on the variables x1 , x 2 , and x3:
Xz
2 8 4
9 12 3
6 6 0
5 4 2
8 10 1
Find the arrays x, S11, and R. 1.4. The 10 largest U.S. industrial corporations yield the following data:
x3 = assets x1 = sales x2 = profits (millions of dollars) (millions of dollars) (millions of dollars)
Company General Motors Ford Exxon IBM General Electric Mobil Philip Morris Chrysler Du Pont Texaco
Source: "Fortune 500," (a)
126,974 96,933 86,656 63,438 55,264 50,976 39,069 36,156 35,209 32,416 Fortune,
4,224 3,835 3,510 3,758 3,939 1,809 2,946 359 2,480 2,413
121
173,297 160,893 83,219 77,734 128,344 39,080 38,528 51,038 34,715 25,636
(April 23, 1990), 346-367.
Plot the scatter diagram and marginal dot diagrams for variables x1 and
x2• Comment on the appearance of the diagrams. (b) Compute :X1 , x2 , s 1 1 , s22 , and r1 2 . Interpret r1 2• Sw
Use the data in Exercise 1.4. (a) Plot the scatter diagrams and dot diagrams for ( x 2 , x3) and ( x1 , x3). Com ment on the patterns. (b) Compute the x, S11, and R arrays for ( x1 , x 2 , x3 ) . 1.6. The data in Table 1.3 are 42 measurements on air-pollution variables recorded at 12:00 noon in the Los Angeles area on different days. (See also the air-pollution data on the data disk.) (a) Plot the marginal dot diagrams for all the variables. (b) Construct the x , S11 , and R arrays, and interpret the entries in R. 1.7. You are given the following n = 3 observations on p = 2 variables: 1.5.
Variable 1: Variable 2:
x1 2 = 1 x22 = 2
x3 2 = 4
38
Chap. 1
Aspects of M u ltivariate Analysis
TABLE 1 .3 AI R-POLLUTION DATA
Wind (x 1 )
Solar radiation (x2 )
8 7 7 10 6 8 9 5 7 8 6 6 7 10 10 9 8 8 9 9 10 9 8 5 6 8 6 8 6 10 8 7 5 6 10 8 5 5 7 7 6 8
98 107 103 88 91 90 84 72 82 64 71 91 72 70 72 77 76 71 67 69 62 88 80 30 83 84 78 79 62 37 71 52 48 75 35 85 86 86 79 79 68 40
CO (x3 ) NO (x4 ) N02 (x5) 03 (x6 ) 7 4 4 5 4 5 7 6 5 5 5 4 7 4 4 4 4 5 4 3 5 4 4 3 5 3 4 2 4 3 4 4 6 4 4 4 3 7 7 5 6 4
Source: Data courtesy of Profes or G. C. Tiao.
2 3 3 2 2 2 4 4 1 2 4 2 4 2 1 1 1 3 2 3 3 2 2 3 1 2 2 1 3 1 1 1 5 1 1 1 1 2 4 2 2 3
12 9 5 8 8 12 12 21 11 13 10 12 18 11 8 9 7 16 13 9 14 7 13 5 10 7 11 7 9 7 10 12 8 10 6 9 6 13 9 8 11 6
8 5 6 15 10 12 15 14 11 9 3 7 10 7 10 10 7 4 2 5 4 6 11 2 23 6 11 10 8 2 7 8
4
24 9 10 12 18 25 6 14 5
HC (x7 ) 2 3 3 4 3 4 5 4 3 4 3 3
3
3 3 3 3 4 3 3 4 3 4 3 4 3 3 3 3 3 3 4
3
3 2 2 2 2 3 2 3 2
Chap. 1
Exercises
39
Plot the pairs of observations in the two-dimensional "variable space." That is, construct a two-dimensional scatter plot of the data. (b) Plot the data as two points in the three-dimensional "item space." 1.8. Evaluate the distance of the point P = ( -1, -1) to the point Q = (1, 0) using the Euclidean distance formula in (1-12) with p = 2 and using the statistical distance in (1-20) with a 1 1 = 1/3, a22 = 4/27, and a 1 2 = 1/9. Sketch the locus of points that are a constant squared statistical distance 1 from the point Q. 1.9. Consider the following eight pairs of measurements on two variables x1 and x 2 : (a)
�
1 2 5 6 8 x1 - 6 - 3 - 2 1 -1 2 1 5 3 x2 - 2 - 3 (a)
Plot the data as a scatter diagram, and compute s 1 1 , s22 , and s 1 2 .
(b) Using (1-18), calculate the corresponding measurements on variables x1
and .X2 , assuming that the original coordinate axes are rotated through an angle of 0 = 26° [given cos (26°) = .899 and sin (26°) = .438]. (c) Using the x 1 and .X2 measurements from (b), compute the sample vari ances s l l and s22 . (d) Consider the new pair of measurements (x 1 , x 2 ) = (4, -2). Transform these to measurements on .X1 and .X2 using (1-18), and calculate the dis tance d(O, P) of the new point P = (.X1 , .X2 ) from the origin 0 = (0, 0) using (1-17). Note: You will need s1 1 and s22 from (c). (e) Calculate the distance from P = (4, -2) to the origin 0 = (0, 0) using (1-19) and the expressions for a l l , a22 , and a 12 in footnote 2. Note: You will need s l l , s22 , and s 1 2 from (a). Compare the distance calculated here with the distance calculated using the .X 1 and .X2 values in (d). (Within rounding error, the numbers should be the same.) 1.10. Are the following distance functions valid for distance from the origin? Explain. (a) + 4x � + x 1 x2 = (distance) 2 - 2x� = (distance ) 2 (b) 1.11. Verify that distance defined by (1-20) with a l l = 4, a22 = 1 , and a 1 2 = - 1 sat isfies the first three conditions in (1-25). (The triangle inequality is more dif ficult to verify.) 1.12. Define the distance from the point P = (x 1 , x 2 ) to the origin 0 = (0, 0) as
xi xi
(a)
d(O, P) = max ( l x1 1 , l xz l ) Compute the distance from P = (-3, 4) to the origin.
(b) Plot the locus of points whose squared distance from the origin is 1.
(c) Generalize the foregoing distance expression to points in p dimensions. 1.13.
A large city has major roads laid out in a grid pattern, as indicated in the fol lowing diagram. Streets 1 through 5 run north-south (NS), and streets A
40
Chap. 1
Aspects of M u ltiva riate Ana lysis
through E run east-west (EW). Suppose there ate retail stores located at intersections (A, 2), (E, 3), and (C, 5). Assume the distance along a street between two intersections in either the NS or EW direction is 1 unit. Define the distance between any two intersections (points) on the grid to be the "city block" distance. (For example, the distance between intersections (D, 1) and (C, 2), which we might call d((D, 1), (C, 2)), is given by d((D, 1), (C, 2)) = d ((D , 1), (D, 2)) + d ( (D, 2), (C, 2)) = 1 + 1 = 2. Also, d((D, 1), (C, 2)) = d ((D, 1), (C, 1)) + d (( C, 1), (C, 2)) = 1 + 1 = 2.)
B r---+---+---,_--� C
t---t-----1f--+- -�FP2n
D r---+---+---,_--� E
Locate a supply facility (warehouse) at an intersection such that the sum of the distances from the warehouse to the three retail stores is minimized. The following exercises contain fairly extensive data sets. A computer may be nec
essary for the required calculations.
Table 1.4 contains some of the raw data discussed in Section 1 .2. (See also the multiple-sclerosis data on the disk.) Two different visual stimuli (S1 and S2) produced responses in both the left eye (L) and the right eye (R) of sub jects in the study groups. The values recorded in the table include x 1 (sub ject ' s age); x 2 (total response of both eyes to stimulus S1, that is, S1L + S1R ); x3 (difference between responses of eyes to stimulus S1, j S1L - S1R j); and so forth. (a) Plot the two-dimensional scatter diagram for the variables x 2 and x4 for the multiple-sclerosis group. Comment on the appearance of the diagram. (b) Compute the i , Sn , and R arrays for the non-multiple-sclerosis and mul tiple-sclerosis groups separately. 1.15. Some of the 98 measurements described in Section 1.2 are listed in Table 1.5. (See also the radiotherapy data on the data disk.) The data consist of aver age ratings over the course of treatment for patients undergoing radiother apy. Variables measured include x 1 (number of symptoms, such as sore throat or nausea); x2 (amount of activity, on a 1-5 scale); x3 (amount of sleep, on a 1-5 scale); x4 (amount of food consumed, on a 1-3 scale); x5 (appetite, on a 1-5 scale); and x6 (skin reaction, on a 0-3 scale).
1.14.
Chap. 1
41
Exercises
TABLE 1 .4 M U LTI P LE-SCLEROSIS DATA
Non-Multiple-Sclerosis Group Data Subject number
x3
x4
152.0 138.0 144.0 143.6 148.8
1.6 .4 .0 3.2 .0
198.4 180.8 186.4 194.8 217.6
.0 1.6 .8 .0 .0
154.4 171.2 157.2 175.2 155.0
2.4 1.6 .4 5.6 1.4
205.2 210.4 204.8 235.6 204.4
6.0 .8 .0 .4 .0
x3
x4
Xs
xl
Xz
1 2 3 4 5
18 19 20 20 20
65 66 67 68 69
67 69 73 74 79
(Age) (S1L
+ S1R)
J S1L - S1R J (S2L
+ S2R )
Xs
I S2L - S2R I
Multiple-Sclerosis Group Data Subject number
xl
Xz
1 2 3 4 5
23 25 25 28 29
148.0 195.2 158.0 134.4 190.2
.8 3.2 8.0 .0 14.2
205.4 262.8 209.8 198.4 243.8
.6 .4 12.2 3.2 10.6
25 26 27 28 29
57 58 58 58 59
165.6 238.4 164.0 169.8 199.8
16.8 8.0 .8 .0 4.6
229.2 304.4 216.8 219.2 250.2
15.6 6.0 .8 1.6 1.0
Source: Data courtesy of Dr. G. G. Celesia. Construct the two-dimensional scatter plot for variables x 2 and x3 and the marginal dot diagrams (or histograms). Do there appear to be any errors in the x3 data? (b) Compute the i, Sn , and R arrays. Interpret the pairwise correlations. 1.16. At the start of a study to determine whether exercise or dietary supplements would slow bone loss in older women, an investigator measured the mineral content of bones by photon absorptiometry. Measurements were recorded (a)
42
Chap. 1
Aspects of M u l tiva riate Analysis
TABLE 1 .5 x1
Symptoms
RADIOTH ERAPY DATA
Xz
x3
x4
Activity
Sleep
Eat
.889 2.813 1.454 .294 2.727
1.389 1.437 1.091 .941 2.545
1.555 .999 2.364 1.059 2.819
4.100 .125 6.231 3.000 .889
1.900 1.062 2.769 1.455 1.000
2.800 1.437 1.462 2.090 1.000
Xs
x6
2.222 2.312 2.455 2.000 2.727
Appetite 1.945 2.312 2.909 1.000 4.091
Skin reaction 1.000 2.000 3.000 1 .000 .000
2.000 1.875 2.385 2.273 2.000
2.600 1.563 4.000 3.272 1.000
2.000 .000 2.000 2.000 2.000
Tealesy,. RowsR. N.contValauiesninofg valuandes of lesandthanle1.s0 Source: lectionteproces arthane due1.0tDatmayo erarbeocourtrsomiinettshyeed.ofdatMra cols. Annet x2
x3
x2
x3
for three bones on the dominant and nondominant sides and are shown in Table 1 .6. (See also the mineral-content data on the data disk.) Compute the x, S11, and R arrays. Interpret the pairwise correlations. 1.17. Some of the data described in Section 1 .2 are listed in Table 1 .7. (See also the national-track-records data on the data disk.) The national track records for women in 55 countries can be examined for the relationships among the running events. Compute the x, S11 , and R arrays. Notice the magnitudes of the correlation coefficients as you go from the shorter (100-meter) to the longer (marathon) running distances. Interpret these pairwise correlations. 1.18. Convert the national track records for women in Table 1.7 to speeds mea sured in meters per second. For example, the record speed for the 100-m dash for Argentinian women is 100 m/1 1.61 sec = 8.613 m/sec. Notice that the records for the 800-m, 1500-m, 3000-m and marathon runs are measured in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Compute the x , S11, and R arrays. Notice the magnitudes of the correlation coefficients as you go from the shorter (100 m) to the longer (marathon) running distances. Interpret these pairwise correlations. Compare your results with the results you obtained in Exercise 1 .17. 1.19. Create the scatter plot and box plot displays of Figure 1 .5 for (a) the mineral content data in Table 1.6 and (b) the national-track-records data in Table 1.7. 1.20. Refer to the bankruptcy data in Table 11.4, page 712, and on the data disk. Using appropriate computer software: (a) View the entire data set in x 1 , x 2 , x3 space. Rotate the coordinate axes in various directions. Check for unusual observation s.
Chap. 1
TABLE 1 .6
M I N ERAL CONTENT I N BON ES
Subject number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Dominant radius 1.103 .842 .925 .857 .795 .787 .933 .799 .945 .921 .792 .815 .755 .880 .900 .764 .733 .932 .856 .890 .688 .940 .493 .835 .915
Exercises
43
Radius
Dominant humerus
Humerus
Dominant ulna
Ulna
1.052 .859 .873 .744 .809 .779 .880 .851 .876 .906 .825 .751 .724 .866 .838 .757 .748 .898 .786 .950 .532 .850 .616 .752 .936
2.139 1.873 1.887 1.739 1 .734 1.509 1.695 1.740 1.811 1.954 1.624 2.204 1.508 1.786 1.902 1.743 1.863 2.028 1.390 2.187 1.650 2.334 1.037 1.509 1.971
2.238 1.741 1.809 1 .547 1 .715 1 .474 1.656 1.777 1 .759 2.009 1 .657 1 .846 1.458 1 .811 1.606 1 .794 1 .869 2.032 1.324 2.087 1 .378 2.225 1 .268 1.422 1 .869
.873 .590 .767 .706 .549 .782 .737 .618 .853 .823 .686 .678 .662 .810 .723 .586 .672 .836 .578 .758 .533 .757 .546 .618 .869
.872 .744 .713 .674 .654 .571 .803 .682 .777 .765 .668 .546 .595 .819 .677 .541 .752 .805 .610 .718 .482 .731 .615 .664 .868
Source: Data courtesy of Everet Smith. (b) Highlight the set of points corresponding to the bankrupt firms. Examine
various three-dimensional perspectives. Are there some orientations of three-dimensional space for which the bankrupt firms can be distin guished from the nonbankrupt firms? Are there observations in each of the two groups that are likely to have a significant impact on any rule developed to classify firms based on the sample means, variances, and covariances calculated from these data? ( See Exercise 11.24.) 1.21. Refer to the milk transportation-cost data in Table 6.8, page 366, and on the data disk. Using appropriate computer software: (a) View the entire data set in three dimensions. Rotate the coordinate axes in various directions. Check for unusual observations.
44
Chap. 1
TABLE 1 . 7
Aspects of M u ltiva riate Analysis
NATIONAL TRACK RECORDS FOR WOM EN
Country Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Colombia Cook Islands Costa Rica Czechoslovakia Denmark Dominican Republic Finland France German Democratic Republic Federal Republic of Germany Great Britain and Northern Ireland Greece Guatemala Hungary India Indonesia Ireland Israel Italy Japan Kenya Korea Democratic People ' s Republic of Korea Luxembourg Malaysia Mauritius Mexico Netherlands
100 m (s)
200 m (s )
400 m (s)
800 m ( min )
1500 m (min )
11.61 11.20 11 .43 11.41 11 .46 11.31 12.14 11 .00 12.00 11.95 11 .60 12.90 11.96 11 .09 11.42 11.79 11.13 11.15
22.94 22.35 23.09 23.04 23.05 23.17 24.47 22.25 24.52 . 24.41 24.00 27.10 24.60 21.97 23.52 24.05 22.39 22.59
54.50 51.08 50.62 52.00 53.30 52.80 55.00 50.06 54.90 54.97 53.26 60.40 58.25 47.99 53.60 56.05 50.14 51.73
2.15 1.98 1 .99 2.00 2.16 2.10 2.18 2.00 2.05 2.08 2.11 2.30 2.21 1.89 2.03 2.24 2.03 2.00
4.43 4.13 4.22 4.14 4.58 4.49 4.45 4.06 4.23 4.33 4.35 4.84 4.68 4.14 4.18 4.74 4.10 4.14
9.79 9.08 9.34 8.88 9.81 9.77 9.51 8.81 9.37 9.31 9.46 11.10 10.43 8.92 8.71 9.89 8.92 8.98
178.52 152.37 159.37 157.85 169.98 168.75 191.02 149.45 171.38 168.48 165.42 233.22 171.80 158.85 151.75 203.88 154.23 155.27
10.81
21.71
48.16
1.93
3.96
8.75
157.68
11.01
22.39
49.75
1.95
4.03
8.59
148.53
11.00 11.79 11.84 1 1.45 11.95 11 .85 11.43 11.45 1 1.29 11.73 11.73 11 .96
22.13 24.08 24.54 23.06 24.28 24.24 23.51 23.57 23.00 24.00 23.88 24.49
50.46 54.93 56.09 51.50 53.60 55.34 53.24 54.90 52.01 53.73 52.70 55.70
1.98 2.07 2.28 2.01 2.10 2.22 2.05 2.10 1.96 2.09 2.00 2.15
4.03 4.35 4.86 4.14 4.32 4.61 4.11 4.25 3.98 4.35 4.15 4.42
8.62 9.87 10.54 8.98 9.98 10.02 8.89 9.37 8.63 9.20 9.20 9.62
149.72 182.20 215.08 156.37 188.03 201.28 149.38 160.48 151.82 150.50 181.05 164.65
12.25 12.03 12.23 11.76 11.89 11.25
25.78 24.96 24.21 25.08 23.62 22.81
51 .20 56.10 55.09 58.10 53.76 52.38
1.97 2.07 2.19 2.27 2.04 1 .99
4.25 4.38 4.69 4.79 4.25 4.06
9.35 9.64 10.46 10.90 9.59 9.01
179.17 174.68 182.17 261.13 158.53 152.48
3000 m Marathon ( min ) ( min )
Chap. 1
TABLE 1 . 7 (continued)
45
NATIONAL TRACK RECO RDS FOR WOM EN
Country
lOO m (s)
200 m (s)
400 m (s)
800 m (min)
1500 m (min)
New Zealand Norway Papua New Guinea Philippines Poland Portugal Rumania Singapore Spain Sweden Switzerland Taiwan Thailand Turkey U.S.A. U.S.S.R. Western Samoa
11.55 11.58 12.25 11.76 11.13 11.81 11 .44 12.30 1 1.80 11.16 11.45 11.22 11.75 11 .98 10.79 1 1.06 12.74
23.13 23.31 25.07 23.54 22.21 24.22 23.46 25.00 23.98 22.82 23.31 22.62 24.46 24.44 21.83 22.19 25.85
51.60 53.12 56.96 54.60 49.29 54.30 51 .20 55.08 53.59 51 .79 53.11 52.50 55.80 56.45 50.62 49.19 58.73
2.02 2.03 2.24 2.19 1.95 2.09 1.92 2.12 2.05 2.02 2.02 2.10 2.20 2.15 1.96 1.89 2.33
4.18 4.01 4.84 4.60 3.99 4.16 3.96 4.52 4.14 4.12 4.07 4.38 4.72 4.37 3.95 3.87 5.81
Source:
Exercises
IAAFIA TFS Track and Field Statistics Handbook for the
1 984
3000 m Marathon (min) (min) 8.76 8.53 10.69 10.16 8.97 8.84 8.53 9.94 9.02 8.84 8.77 9.63 10.28 9.38 8.50 8.45 13.04
145.48 145.48 233.00 200.37 160.82 151 .20 165.45 182.77 162.60 154.48 153.42 177.87 168.45 201.08 142.72 151 .22 306.00
Los Angeles Olympics.
(b) Highlight the set of points corresponding to gasoline trucks. Do any of
the gasoline-truck points appear to be multivariate outliers? (See Exer cise 6.17. ) Are there some orientations of x 1 , x 2 , x3 space for which the set of points representing gasoline trucks can be readily distinguished from the set of points representing diesel trucks? 1.22. Refer to the oxygen-consumption data in Table 6.10, page 369, and on the data disk. Using appropriate computer software: (a) View the entire data set in three dimensions employing various combi nations of three variables to represent the coordinate axes. Begin with the x 1 , x 2 , x3 space. (b) Check this data set for outliers. 1.23. Using the data in Table 11.9, page 724, and on the data disk, represent the cereals in each of the following ways. (a) Stars. (b) Chernoff faces. (Experiment with the assignment of variables to facial characteristics.) 1.24. Using the utility data in Table 12.5, page 747, and on the data disk, represent the public utility companies as Chernoff faces with assignments of variables to facial characteristics different from those considered in Example 1.9. Compare your faces with the faces in Figure 1.11. Are different groupings indicated?
46
Chap. 1
Aspects of M u l tivariate Analysis
Using the data in Table 12.5 and on the data disk, represent the 22 public util ity companies as stars. Visually group the companies into four or five clusters. 1.26. The data in Table 1.8 (see the bull data on the data disk) are the measured characteristics of 76 young (less than two years old) bulls sold at auction. Also included in the table are the selling prices (SalePr) of these bulls. The column headings (variables) are defined as follows: 1.25.
Breed
=
FtFrBody
{
1 Angus 5 Hereford 8 Simental
= Yearling height at
YrHgt
= Fat free body (pounds)
shoulder (inches)
PrctFFB
Frame
= Scale from 1(small)
BkFat
SaleHt
= Sale height at
SaleWt
to 8(large)
shoulder (inches)
= Percent fat-free body
= Back fat (inches)
= Sale weight (pounds)
Compute i , Sn , and R arrays. Interpret the pairwise correlations. Do some of these variables appear to distinguish one breed from another? (b) View the data in three dimensions using the variables Breed, Frame, and BkFat. Rotate the coordinate axes in various directions. Check for out liers. Are the breeds well separated in this coordinate system? (c) Repeat part b using Breed, FtFrBody, and SaleHt. Which three-dimen sional display appears to result in the best separation of the three breeds of bulls? (a)
TABLE 1 .8
DATA ON BU LLS
Breed
SalePr
YrHgt
FtFrBody
PrctFFB
Frame
BkFat
SaleHt
SaleWt
1 1 1 1 1
2200 2250 1625 4600 2150
51.0 51.9 49.9 53.1 51.2
1128 1108 1011 993 996
70.9 72.1 71.6 68.9 68.6
7 7 6 8 7
.25 .25 .15 .35 .25 . .10 .15 .10 .10 .15
54.8 55.3 53.1 56.4 55.0
1720 1575 1410 1595 1488
55.2 54.6 53.9 54.9 55.1
1454 1475 1375 1564 1458
'
8 8 8 8 8
1450 1200 1425 1250 1500
51.4 49.8 50.0 50.1 51.7
997 991 928 990 992
Source: Data courtesy of Mark Ellersieck.
73.4 70.8 70.8 71.0 70.6
7 6 6 6 7
Chap. 1
47
References
REFERENCES
1. siBecker, R. A., W. S. Clevelno.and,4 (1and987)A., 355-395. R. Wilks. "Dynamic Graphics for Data Analy s . " 2. BenjResoaurmicesn, Y.Ut,ilandizatiM.on.I"gbaria. "Clustering Cat, eno.gor2ie(s1f991),or Bet295-307. ter Prediction of Computer 3. John BhattWiacharley,yya,1977.G. K., and R. A. Johnson. New York: 4. Blis , C.volI.. 2."StNewatistYorics ikn: McGr Biology,aw-"Hil , 1967. 5. Large Capon,U.N.S., J.Manuf Farley,actD.ureLehman, and J. Hulbert. "Profino.les of2 (Product Innovators among r s . " 1 992) , 157-169. 6. Chernoff, H. "Using Faces to Represent Points68,in no.K-D342imens(1973)ional, 361-368. Space Graphically." 7.8. Cochran, Cochran, W.W.G.G., and G. M. Cox. (3d ed.). New York: John(2d ed.Wi)l.ey,New1977.York: John 9. WiDavino.ley,s,2J.1957.(1C.970)"In,f105-112. ormation Contained in Sediment Size Analysis." 10. Dawkins,no.B. 2"Mul(1989)tivar, 110--iate 1Anal15. ysis of National Track Records." 11. tDunham, R. B., and D. J. Kravetz. "Canonicalno.Cor4re(1la975)tion, Anal ysis in a Predictive Sys e m. " 35-42. 12.13. Ever iet, ,G.B. G. "A Multidimensional Model of Client Succes New York: NorEngagi th-Holngland.Exte1978.rnal Gabl s when , no.y8si(s1i996)n Pla1175-1198. 14. HalbasConseinduar,lontantJ.datsC.."a "PrcollienctciepdalbyComponent Anal ntofBreedi ng.i"n,Unpubl ished report Dr. F. A. Bl i s , Uni v er s i t y Wi s c ons 1979. 15. crKiimm,inL.ant, andAnalY.ysKiis.m" . "Innovation in a Newly no.Indus3 t(r1i985)alizin312-322. g Country: A Multiple Dis 16. Mobi Klatzlky,ity."S. R., and R. W. Hodge. "A Canonical Correlati66,on no.Anal333ysis(1of971),Occupat ional 16-22. 17. Linden.no. 3M.(Oct"Fact. 1977)or ,Anal562-568.ytic Study of Olympic Decathlon Data." 18. MacCrimmon, K., 6,andno.D.4 Wehr u,ng.422-435. "Characteristics of Risk Taking Executives." ( 1 990) 19. Press Marrio1974.t , F. H. C. London: Academic 20. viMatoglhaer,cialP.SediM.m"Stentusdy." of Factors Influencing Varino.at3io(n1972)in Si, z219-234. e Characteristics in Flu 21. sNaiideratk, D.ionsN.,inandtheR.PrKhatinciptalree.Component "RevisitingAnalOlyympisis.c"Track Records: Some Practical no.Con2 (1996), 140--144. Statistical Science,
2,
Applied Statistics,
40
Statistical Concepts and Methods.
Statistical Methods for Research in the Natural Sci
ences,
Management Science,
38,
Journal of the American Statistical Association, Sampling Techniques
Experimental Designs
Mathematical Geology,
2,
The A merican Statisti
cian,
43,
Journal of Experimental Education,
43,
Graphical Techniques for Multivariate Data. Management Science,
42
Management Science,
31,
Journal of the American Statistical Association,
Research Quarterly,
48,
Man
agement Science,
3
The Interpretation of Multiple Observations.
Mathematical Geology,
4,
The American Statistician,
50,
48
Chap. 1
of 22. Nas(1995),on, 411-430. G. "Three-dimensional Projection Pursuit." no. 4 23. Account Smith, M.ing, andStateR.mentTafs."fler. "Improving the Communication Funct io(n1984)of Publ, 139-146. ished no. 54 24. Ph.Spenner, K.ertaI.tio"Frn, oUnim vGenerat ioWin tsoconsGenerat ion: The Transmis ion of Occupation." D . di s er s i t y of i n , 1977. 25. Nonal Tabakofcoholf, ics.et" al. "Dif erences in Platelet Enzyme Actno.ivit3y(bet1988)ween, 134-139. Alcoholics and 26. Titerey,mm,CA:N. H.Brooks/Cole, 1975. Mon 27. TrDisietrsceshmann, J.InS.s,uandrers.G." E. Pinches. "A Multivariate Modelno.for3Pr(1e973)dicti,n327-338. g Financial y s e d PL 28.29. WaiTukey,ner,J.H.W.and D This en. "Graphical DatReadia Analng, yMA:sis." Addison-Wesley, 1977. ( 1 981) , 191-241. 30. War24, 1993)tzman,, C1,R. C15."Don't Wave a Red Flag at the IRS." (February 31. WeiAnalhyss,isC.):,Routand iH.ne Schmi Searchidlni.g "OMEGA for Structur(eO."n Line Multivariate Explno. o2r(a1to990)ry Graphi , 175-226.cal Aspects
M u ltivariate Analysis
Applied Statistics,
Accounting and Business Research,
B.,
New England Journal of Medicine,
44,
14,
318,
Multivariate Analysis with Applications in Education and Psychology.
Journal of Risk and Insurance,
40,
Exploratory Data Analysis.
Annual Review of Psychology,
32,
The Wall Street Journal,
Statistical Science, 5,
CHAPTER
2
Matrix Algebra and Random Vectors 2. 1 INTRODUCTION
We saw in Chapter 1 that multivariate data can be conveniently displayed as an array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns is called a matrix of dimension n p. The study of multivari ate methods is greatly facilitated by the use of matrix algebra. The matrix algebra results presented in this chapter will enable us to concisely state statistical models. Moreover, the formal relations expressed in matrix terms are easily programmed on computers to allow the routine calculation of important statistical quantities. We begin by introducing some very basic concepts that are essential to both our geometrical interpretations and algebraic explanations of subsequent statistical techniques. If you have not been previously exposed to the rudiments of matrix algebra, you may prefer to follow the brief refresher in the next section by the more detailed review provided in Supplement 2A.
X
2.2 SOME BASICS OF MATRIX AND VECTOR ALGEBRA Vectors
J�l : l
An array x of n real numbers x 1 , x2 , x
or
• • •
, x11 is called a x
vector, and it is written as
'
x"
where the prime denotes the operation of transposing a column to a row. 49
50
Chap. 2
M atrix Algebra and Ra ndom Vectors
A vector x can be represented geometrically as a directed line in n dimen sions with component x1 along the first axis, x2 along the second axis, . . . , and X11 along the n th axis. This is illustrated in Figure 2.1 for n = 3.
2
,"
-----------------�
1- /
I I
/
/
I I I I I
:
, " II I I I I I
:
31 0·�------------+ l �r-� � I
· · - - - - - - - - - - - - - - - - - - �"
,"
[]
Figure 2.1
The vector x ' = [1 , 3, 2].
A vector can be expanded or contracted by multiplying it by a constant c. In particular, we define the vector e x as ex =
cx1 CXz. c�"
That is, ex is the vector obtained by multiplying each element of x by ure 2.2 ( a ) .] Two vectors may be added. Addition of x and y is defined as 2 2
/
)
- tx
Figure 2.2
Scalar multiplication and vector addition.
c. [See Fig
Sec. 2 . 2
51
Some Basics of M atrix and Vector Algebra
so that x + y is the vector with ith element X; + Y; · The sum of two vectors emanating from the origin is the diagonal of the par allelogram formed with the two original vectors as adjacent sides. This geometri cal interpretation is illustrated in Figure 2.2(b ) A vector has both direction and length. In n = 2 dimensions, we consider the vector .
X =
[ ;J
The length of x, written Lx , is defined to be Lx
=
Yx I2
+ x 22
Geometrically, the length of a vector in two dimensions can be viewed as the hypotenuse of a right triangle. This is demonstrated schematically in Figure 2.3. The length of a vector x ' = [ x1 , x2 , . . . , xn ], with n components, is defined by (2-1) Multiplication of a vector tion (2-1),
x
by a scalar
c changes the length. From Equa
Multiplication by c does not change the direction of the vector x if c > 0. How ever, a negative value of c creates a vector with a direction opposite that of x. From (2-2) 2
xz Figure 2.3
Length of x = v'xf +
x .
i
52
Chap. 2
M atrix Algebra and Random Vectors
it is clear that x is expanded if I c I > 1 and contracted if 0 < I c I < 1. [Recall Fig ure 2.2(a).] Choosing c = L; 1 , we obtain the unit vector L ; 1 x , which has length 1 and lies in the direction of x. A second geometrical concept is angle. Consider two vectors in a plane and the angle 0 between them, as in Figure 2.4. From the figure, 0 can be represented as the difference between the angles 01 and 02 formed by the two vectors and the first coordinate axis. Since, by definition, cos ( 01 )
=
sin ( 01 )
x1
l1__
cos ( 02 )
LX
x2 LX
Ly
Y2 Ly
sin ( 02 )
and cos ( 0 )
=
cos ( 02 - 01 )
=
cos ( 02 ) cos ( 01 )
the angle 0 between the two vectors x' cos ( 02 - 01 )
=
+ sin ( 02 ) sin ( 01 )
[ x 1 , x2 ] and y '
(ll__L ) ( �Lx ) + ( LY2 ) ( �Lx )
=
[y 1 , y2 ] is specified by
x 1 y 1 + X2Yz Lx Ly
(2-3) y y We find it convenient to introduce the inner product of two vectors. For n = 2 dimensions, the inner product of x and y is cos ( O )
=
=
=
With this definition and Equation (2-3) LX
=
Wx
Since cos (90°) = cos (270°) pendicular when x ' y = 0.
=
x' y
cos ( 0 ) 0 and cos ( 0)
=
0 only if x ' y
=
0, x and y are per
2
X
The angle (J between = [x1 , x2 ] and y ' = [y1 , Y2 l ·
Figure 2.4
x'
Sec. 2.2
53
Some Basics of M atrix a n d Vector Algebra
For an arbitrary number of dimensions
x and y as
n, we define the inner product of (2-4)
The inner product is denoted by either x' y or y' x. Using the inner product, we have the natural extension of length and angle to vectors of n components: Lx = length of x = �
(2-5)
=
x' y (2-6) Lx Ly �v� x' x �v� y' y Since, again, cos ( 8 ) = 0 only if x' y = 0, we say that x and y are perpendicular when x' y = 0. x' y
cos ( 8 ) =
Example 2. 1
(Calculating lengths of vectors and the angle between them)
Given the vectors x' = [1, 3, 2] and y' = [ -2, 1, -1], find 3x and x + y. Next, determine the length of x, the length of y, and the angle between x and y. Also, check that the length of 3x is three times the length of x. First,
Next, x' x = 1 2 + 32 + 22 = 14, y' y = ( -2) 2 1 ( - 2) + 3 (1) + 2 ( - 1) = - 1. Therefore, L X = � = Vi4 = 3.742
+ 12 +
( 1 )2 -
=
6, and
x' y
=
Ly = VY'Y = V6 = 2.449
and cos ( O ) so () = 96.3°. Finally L3x = Y3 2 + 92 showing L3x = 3 Lx .
ft Lx L y
+ 62
=
-1 3.742
X 2.449
- .109
\li26 and 3 Lx = 3 Vi4 = \li26
•
54
Chap. 2
M atrix Algebra and Random Vectors
A pair of vectors x and y of the same dimension is said to be linearly depen
dent if there exist constants c1 and c2 , both not zero, such that
A set of vectors x 1 , x 2 , . . . , xk is said to be linearly dependent if there exist constants
c 1 , c2 , . . . , ck, not all zero, such that
(2-7) Linear dependence implies that at least one vector in the set can be written as a linear combination of the other vectors. Vectors of the same dimension that are not linearly dependent are said to be linearly independent. Example 2.2 (Identifying linearly independent vectors)
Consider the set of vectors
Setting
implies that
c 1 + c2 + c 3 = 0 2c1 - 2c3 = 0 c 1 - c2 + c3 = 0 with the unique solution c1 = c2 = c3 = 0. As we cannot find three constants c1 , c2 , and c3 , not all zero, such that c1 x 1 + c2 x 2 + c3x3 = 0, the vectors x 1 , • x 2 , and x 3 are linearly independent. The projection (or shadow) of a vector x on a vector y is 1 . . of x on y = (x' y) y = (x' y) y ProJectiOn -, L y Ly yy where the vector L; 1 y has unit length. The length of the projection is
(2-8)
-
Length of protection =
I �Y I y
= Lx
\ :'I \ = Lx l cos ( O ) I X
y
where 0 is the angle between x and y. (See Figure 2.5.)
(2-9)
Sec . 2 . 2
Some Basics of M atrix a n d Vector Algebra
( )
x' y y y' y
55
, y
''"""""'-- LX cos (8) -----;)�1
Figure 2.5
The projection of x on y.
Matrices
A matrix is any rectangular array of real numbers. We denote an arbitrary array of n rows and p columns by
A (n Xp)
= l :�: :�� ::: :�; J . . an I an 2
.
an p
Many of the vector concepts just introduced have direct generalizations to matrices. The transpose operation A' of a matrix changes the columns into rows, so that the first column of A becomes the first row of A', the second column becomes the second row, and so forth. Example 2.3
{The transpose of a matrix)
If
A (2 X 3 ) then
l=
A' (3 X 2 )
[� = [- � �] 2 4
•
A matrix may also be multiplied by a constant c. The product cA is the matrix that results from multiplying each element of A by c. Thus
cA (n Xp)
ca 1 1 c �2 1 ·
can l
Two matrices A and B of the same dimensions can be added. The sum A ( i, j)th entry aij + b ;j ·
+ B has
56
Chap. 2
M atrix Algebra and Ra ndom Vectors
Example 2.4 (The sum of two matrices and multiplication of a matrix by a constant)
If
A
( 2 X 3)
=
then
4A
( 2 X 3}
A + B
(2 X 3}
( 2 X 3}
[� - � � ]
= =
[
[ 04
12
B
and
4
(2 X 3}
]
=
[ � - � -� ]
and -4 4 3 2 0 + 1 1 + 2 -1 + 5
-
•
It is also possible to define the qmltiplication of two matrices if the dimen sions of the matrices conform in the following manner: When A is (n k) and B is (k p ), so that the number of elements in a row of A is the same as the num ber of elements in a column of B, we can form the matrix product AB. An element of the new matrix AB is formed by taking the inner product of each row of A with each column of B. The matrix product AB is
X
X
A
B
(n x k) (k xp )
=
the (n p) matrix whose entry in the ith row and jth column is the inner product of the ith row of A and the jth column of B
X
or ( i, j) entry of AB
= a; 1 b 1 j
+ a; 2 b 2 j + · · · +
[
a; k b k j
=
k
2:
f=l
a; e b n
]
(2-10)
When k = 4, we have four products ro add for each entry in the matrix AB. Thus, ql i
A B = (n X 4} (4 X p )
(a i '�
l
ai 2 al 3 al 4 a; z 1
a; 3
a; 4 )
an i a n 2 an 3 a n 4
� Row i
[
b1 1
b2 l b3 i b
4
. 0 0 0
0 0 0
1
blj . . b
b 2j
b b
,
0 0 0
lp
b2 p
b3 3 p 4, . . • a4p 0 0 0
Column j
· · · (a, b , ; + a, b,; +
a _ , b ,;
+ a,. b,; ) · · ·
]
Sec . 2.2
Example 2.5
Some Basics of M atrix a n d Vector Algebra
57
(Matrix multiplication)
If
then
A B
(2 X 3 ) (3 X l )
=
[ 31
[ 31 (( -- 2)2) ++ 5( -(7)1 ) (7) ++ 24 (9)(9) ]
-1 5
(2 X 1 ) and
C A (2 X 2 ) (2 X 3)
=
=
=
[ 21 - OJ1 [ 31 - 51 42 ] [ 21 (3)(3) -+ 01 (1(1 )) 21 (( -- 11 )) +- 01 (5)(5) [ 26 --26 - 24 ]
2 (2) + 0 (4) 1 (2) - 1 (4)
] •
(2 X 3 )
When a matrix B consists of single column, it is customary to use the lower case b vector notation. Example 2.6
(Some typical products and their dimensions)
Let d
Then Ab, be', b' c, and d' Ab are typical products.
=
[ �]
58
Chap. 2
M atrix Algebra and Random Vectors
[ _!] [
The product Ab is a vector with dimension equal to the number of rows of A.
b'c The product b' c is a
1
[7 - 3 6)
�
�
[ - 13)
]
X 1 vector or a single number, here - 13. [5 8 - 4]
=
35 56 - 28 - 15 - 24 12 30 48 - 24
The products be' is a matrix whose row dimension equals the dimension of b and whose column dimension equals that of c. This product is unlike b' c, which is a single number.
d'Ab
=
[ 21
[2 9]
The products d' Ab is a
1
-2
3 4 -1
]
[ -n
�
[26) •
X 1 vector or a single number, here 26.
Square matrices will be of special importance in our development of statis tical methods. A square matrix is said to be symmetric if A = A' or aii = ai i for all i and j. Example 2.7 (A symmetric matrix)
The matrix
is symmetric; the matrix
•
is not symmetric.
When two square matrices A and B are of the same dimension, both prod ucts AB and BA are defined, although they need not be equal. (See Supplement 2A.) If we let I denote the square matrix with ones on the diagonal and zeros else where, it follows from the definition of matrix multiplication that the ( i, j )th entry of AI is a; 1 0 + + ai,j - l 0 + a;i 1 + ai,j + l 0 + + ai k 0 = aii • so AI = A. Similarly, lA = A, so
X
·
· ·
X
X
X
· · ·
X
Sec. 2.2
I
Some Basics of Matrix a n d Vector Algebra
A = A
(k X k) (k X k)
1
I = A
(k X k) (k X k)
( k X k)
for any
(1
A
(k X k)
59
(2-11)
The matrix I acts like in ordinary multiplication · a = a · 1 = a), so it is called the identity matrix. The fundamental scalar relation about the existence of an inverse number a - 1 such that a - 1 a = aa - 1 = if a i= 0 has the following matrix algebra extension: If there exists a matrix B such that
1
B
A = A B
(k X k) (k X k)
(k X k) (k X k)
I
(k X k)
then B is called the inverse of A and is denoted by A - t . The technical condition that an inverse exists is that the k columns 1 a 1 , a2 , . . . , ak of A are linearly independent. That is, the existence of A - is equiv alent to
( See Result
(2-12)
2A.9 in Supplement 2A.)
Example 2.8 (The existence of a matrix inverse)
For
A=
[! �]
you may verify that
(.4) 4 (-2)2 + (.4)1 [ -..82 -..46 ] [ 43 21 ] = [ (-.(.82)3)3 ++ (-. 6 ) 4 (.8)2 + (-.6 )1 J = [� �]
so
is A - 1 . We note that
[ -.2 .4 ] .8 -. 6
implies that c 1 = c2 = 0, so the columns of A are linearly independent. This • confirms the condition stated in
(2-12).
60
Chap. 2
M atrix Algebra a n d Ra ndom Vectors
A method for computing an inverse, when one exists, is given in Supplement 2A. The routine, but lengthy, calculations are usually relegated to a computer, especially when the dimension is greater than three. Even so, you must be fore warned that if the column sum in (2-12) is nearly 0 for some constants c 1 , , ck , then the computer may produce incorrect inverses due to extreme errors in round ing. It is always good to check the products AA - t and A -1 A for equality with I when A - 1 is produced by a computer package. (See Exercise 2.10.) Diagonal matrices have inverses that are easy to compute. For example, • . •
1
0
0
0
azz 0
0
0
0
0
au 0 au 0 0 0 0 azz 0 0 0 0 a33 0 0 0 0 a44 0
0
0
0
0 0 0 0
0 has inverse
ass
1
1
0
0
0
0
0
0
0
0
1 0
0 1
if all the a;; =F 0. Another special class of square matrices with which we shall become familiar are the orthogonal matrices, characterized by (2-13) QQ' = Q'Q = I or Q' = Q - 1 The name derives from the property that if Q has ith row q ;, then QQ' = I implies that q ; qi = 1 and q ; q j = 0 for i =F j, so the rows have unit length and are mutu ally perpendicular (orthogonal). According to the condition Q'Q = I, the columns
have the same property. We conclude our brief introduction to the elements of matrix algebra by introducing a concept fundamental to multivariate statistical analysis. A square matrix A is said to have an eigenvalue .A, with corresponding eigenvector x =F 0, if Ax = .A x
(2 - 14)
Ordinarily, we normalize x so that it has length unity; that is, 1 = x' x . It is conve nient to denote normalized eigenvectors by e, and we do so in what follows. Spar ing you the details of the derivation (see [1]), we state the following basic result:
Sec. 2 . 3
Example 2.9
Positive Defi n ite M atrices
61
(Verifying eigenvalues and eigenvectors)
Let
Then, since
..\ 1
=
6 is an eigenvalue, and
is its corresponding normalized eigenvector. You may wish to show that a sec ond eigenvalue-eigenvector pair is A 2 = - 4, e; = [1 / V2, 1 / V2]. •
A method for calculating the A ' s and e ' s is described in Supplement 2A. It is instructive to do a few sample calculations to understand the technique. We usu ally rely on a computer when the dimension of the square matrix is greater than two or three. 2.3 POSITIVE DEFINITE MATRICES
The study of the variation and interrelationships in multivariate data is often based upon distances and the assumption that the data are multivariate normally distrib uted. Squared distances (see Chapter 1) and the multivariate normal density can be expressed in terms of matrix products called quadratic forms (see Chapter 4). Con sequently, it should not be surprising that quadratic forms play a central role in multivariate analysis. In this section, we consider quadratic forms that are always nonnegative and the associated positive definite matrices.
62
Chap. 2
Matrix Algebra and Random Vectors
Results involving quadratic forms and symmetric matrices are, in many cases, a direct consequence of an expansion for symmetric matrices known as the spec tral decomposition. The spectral decomposition of a k k symmetric matrix A is 1 given by
X
where A1 , A 2 , , Ak are the eigenvalues of A and e1 , e 2 , . . . , e k are the associated normalized eigenvectors. (See also Result 2A.14 in Supplement 2A.) Thus, e; e; = 1 for i = 1, 2 , . . . , k, and e; ej = O for i =F j. . . •
Example 2. 1 0
[ -�
]
{The spectral decomposition of a matrix)
Consider the symmetric matrix
13
A =
-4 13
2 -2
- 2 10
The eigenvalues obtained from the characteristic equation l A - A i l = 0 are A1 = 9, A 2 = 9, and A 3 = 18 (Definition 2A.30). The cor responding eigenvectors e1 , e 2 , and e 3 are the (normalized) solutions of the equations Ae; = A;e; for i = 1, 2, 3. Thus, Ae1 = Ae1 gives
[
13 - 4 2 - 4 13 - 2 2 - 2 10
][ ] [ ] e1 1 e1 1 e21 = 9 e21 e3 1 e3 1
or
- 4e1 1 + 13 e2 1 - 2e3 1 = 9e21 2e1 1 - 2e2 + 10e3 1 = 9e3 1 1
Moving the terms on the right of the equals sign to the left yields three homogeneous equations in three unknowns, but two of the equations are redundant. Selecting one of the equations and arbitrarily setting e 1 1 = 1 and e2 1 = 1 , we find that e3 1 = 0. Consequently, the normalized eigenvector is e� = [1/ Y1 2 + 1 2 + 0 2 , 1/ Y1 2 + 1 2 + 0 2 , O/ Y1 2 + 1 2 + 0 2 ] 1 A proof of Equation (2-16) is beyond the scope of this book. The interested reader will find a proof in [5], Chapter 8.
Sec. 2 . 3
63
Positive Defi nite M atrices
[1/v2, 1/v2, 0] , since the sum of the squares of its elements is unity. You may verify that e� = [1/vls, - 1/vls, - 4/vls] is also an eigenvector for 9 = .-\ 2 , and e; = [2/3, - 2/3, 1/3] is the normalized eigenvector corre sponding to the eigenvalue .-\ 3 = 18. Moreover, e; ej = 0 for i * j. The spectral decomposition of A is then A = ..\ 1 e 1 e ; + ..\2 e 2 e ; + ..\3 e 3 e;
[
or
-�]
13 - 4 - 4 13 2 - 2 10
1
� 9
v2
[ Jz
1
v2 0
1
v2
o]
1 v1s
+9
-1
v1s
-4
1 2 1 9 2
-
=
-
v1s
[ Jts
1 2 0 1 2 0 +9 -
-
0 0 0
-1
v1s
v1s ]
1 18 1 18 4 18
1 18 1 18 4 18
- 4 + 18
-
-
-
2 3 2 3 1 3
[�
-
-
4 18 4 18
-
16 18
4
4
-
9
+
18
9
4 9
2
-
9 as you may readily verify.
l]
- 32
-
2
-
9
4
-
2
2
9
9
-
-
9
9
1
•
The spectral decomposition is an important analytical tool. With it, we are very easily able to demonstrate certain statistical results. The first of these is a matrix explanation of distance, which we now develop.
64
Chap . 2
M atrix Algebra and Random Vectors
X k symmetric matrix A is such that
When a k
0 :;.;;:; X1 Ax (2-17) for all X1 = [ x1, x 2 , , xd, A is said to be nonnegative definite. If equality holds in (2-17) only for the vector X1 = [ 0, 0, . . . , OJ , then A is said to be positive definite. In other words, A is positive definite if 0 < X1 Ax (2-18) for all vectors x * 0. Because X1 Ax has only squared terms xf and product terms . • .
X;Xk ,
it is called a quadratic form.
Example 2. 1 1
(A positive definite quadratic form)
Show that the following quadratic form is positive definite: 3
xf + 2x� - 2 V2 x1x2
[ - V23
- Y2 [ xt ] = x 1 Ax J x2 By Definition 2A.30, the eigenvalues of A are the solutions of the equa tion l A H I 0, or (3 A ) (2 - A ) - 2 0. The solutions are A 1 = 4
To illustrate the general approach, we first write the quadratic form in matrix notation as
[xl
-
x2 ]
=
2
=
-
and A 2 = 1 . Using the spectral decomposition in (2-16), we can write A = A i e l e ; + A 2 e 2 e 2I
(2 X 2)
(2 X I) (I X 2) (2 X I) (I X 2 ) = e 1 e ; + e 2 el2 (2 X I) ( 1 X 2 ) (2 X I ) (I X 2 )
4
where e 1 and e 2 are the normalized and orthogonal eigenvectors associated with the eigenvalues A 1 = 4 and A 2 = 1, respectively. Because and 1 are scalars, premultiplication artd postmultiplication of A by X1 and x, respec tively, where X1 = is any nonzero vector, give
4
[x1 ,x2]
X1
(I X 2 )
A
x = 4x1 e 1
(2 X 2) (2 X I )
=
e;
x
(I X 2) (2 X I) (I X 2 ) (2 X I)
4yf + y� ;;,: 0
+
X1
e
e;
x
(I X 2 ) (2 X2I) (I X 2) (2 X I)
with
y1 = X1 e 1 = e ; x and Yz = X1 e2 = e;x We now show that y1 and Yz are not both zero and, consequently, that X1 Ax 4 y f + Yi > 0, or A is positive definite. =
Sec. 2 . 3
65
Positive Defin ite M atrices
From the definitions of y1 and y2 , we have
or
y
(2 X l )
=
E
X
(2 X2 ) (2 X l )
Now E is an orthogonal matrix and hence has inverse E ' . Thus, x But x is a nonzero vector, and 0 * x = E ' y implies that y * 0.
=
E ' y. •
Using the spectral decomposition, we can easily show that a k k symmet ric matrix A is a positive definite matrix if and only if every eigenvalue of A is pos itive. (See Exercise 2.17.) A is a nonnegative definite matrix if and only if all of its eigenvalues are greater than or equal to zero. Assume for the moment that the p elements x 1 , x 2 , . . . , xP of a vector x are realizations of p random variables X1 , X2 , , XP . As we pointed out in Chapter 1, we can regard these elements as the coordinates of a point in p-dimensional space, and the "distance" of the point [x 1 , x 2 , . . . , xp ] to the origin can, and in this case shbuld, be interpreted in terms of standard deviation units. In this way, we can account for the inherent uncertainty ( variability) in the observations. Points with the same associated "uncertainty" are regarded as being at the same distance from the origin. If we use the distance formula introduced in Chapter 1 [see Equation (1-22)], the distance from the origin satisfies the general formula
X
• • •
+ 2 (a 1 2 x1 x2 + a 1 3x 1 x3 + . . . + ap - l, p xp _ 1 xp ) provided that (distance ? > 0 for all [x 1 , x 2 , . . . , xp ] * [0, 0, . . . , 0.] Setting a;j = aj i ' i * j, i = 1, 2, . . . , p, j = 1, 2, . . . , p , we have alp X a az. p Xz. 0 < ( distance ) 2 = [x 1 , x2 , . . . , xp ] f 1 .. .. aP aPP xPj
rai l
r
1
or
0 < ( distance ) 2
=
x' Ax
for x * 0
ll
(2-19)
From (2-19), we see that the p p symmetric matrix A is positive definite. In sum, distance is determined from a positive definite quadratic form x' Ax. Con versely, a positive definite quadratic form can be interpreted as a squared distance.
X
66
Chap. 2
M atrix Algebra and Random Vectors
Comment. Let the square of the distance from the point x' = [ x1 , x2 , . . . , xp ] to the origin be given by x ' Ax, where A is a p p symmetric positive definite matrix. Then the square of the distance from x to an arbitrary fixed point p ' [ JL 1 , JL 2 , . . . , JLp ] is given by the general expression (x - p ) ' A (x - p ) .
X
=
Expressing distance as the square root of a positive definite quadratic form allows us to give a geometrical interpretation based on the eigenvalues and eigen vectors of the matrix A . For example, suppose p = 2. Then the points x' = [ x 1 , x 2 ] of constant distance c from the origin satisfy By the spectral decomposition, as in Example 2.11 ,
A = A1 e 1 e� + A2 e 2 e� s o x ' Ax = A 1 (x ' e 1 ) 2 + A 2 (x ' e 2 ) 2 Now, c 2 = A 1 y� + A 2 yi is an ellipse in Y t = x ' e 1 and y2 = x' e 2 because A 1 , A2 > 0 when A is positive definite. (See Exercise 2.17.) We easily verify that l 1 1 X = cA] 12 e l satisfies x' Ax = At (cA l - /Ze� e l ) 2 = c 2 . Similarly, X = cA2 1 2 e z gives the appropriate distance in the e 2 direction. Thus, the points at distance c lie on an ellipse whose axes are given by the eigenvectors of A with lengths propor
tional to the reciprocals of the square roots of the eigenvalues. The constant of proportionality is c. The situation is illustrated in Figure 2.6. Ifp > 2, the points x' = [ xt , x 2 , . . . , xP ] a constant distance c = � from the origin lie on hyperellipsoids c 2 = A 1 (x ' e 1 ) 2 + + AP (x ' ep ) 2 , whose axes are given by the eigenvectors of A. The half-length in the direction e; is equal to c/ "'VA; , i = 1, 2 , . . . , p, where At , A2 , . . . , AP are the eigenvalues of A. · · ·
Figure 2 . 6
Points a constant distance c from the origin (p = 2, 1
,;;
A1
<
A2 ) .
Sec. 2.4
67
A Square-Root M atrix
2.4 A SQUARE-ROOT MATRIX
The spectral decomposition allows us to express the inverse of a square matrix in terms of its eigenvalues and eigenvectors, and this leads to a useful square
root matrix.
Let A be a k
A
k
=
X k positive definite matrix with the spectral decomposition
2: ,\e;e ; . Let the normalized eigenvectors be the columns of another matrix
i= 1
A
(k X k)
where PP'
=
P'P
=
2: A; k
=
i=l
e; e; (k X l) (l X k)
P
=
A
P'
(k X k) (k X k) (k X k)
(2-20)
I and A is the diagonal matrix A
with A; > 0
( k X k)
Thus, :_
':;'8:�,!��:- �:;'
(2'-21)
since (PA - 1 P' ) PAP' = PAP' (PA - 1 P') = PP' = I . Next, let A 1 /2 � enote the diagonal matrix with '\(\ as the ith diagonal element. The matrix 2: '\(\ e;e ; i=1 denoted by A112 •
=
PA 1 12 P' is called the square root of A and is
A1/2 �) .;2: "\IA; e;e; k
·;;,.,= 1
=
PA1/2p'
68
Chap. 2
M atrix Algebra and Random Vectors
2.5 RANDOM VECTORS AND MATRICES
A random vector is a vector whose elements are random variables. Similarly, a ran dom matrix is a matrix whose elements are random variables. The expected value
of a random matrix (or vector) is the matrix (vector) consisting of the expected val ues of each of its elements. Specifically, let X = { Xii ) be an n p random matrix. Then the expected value of X, denoted by E ( X ), is the n p matrix of numbers (if they exist)
XX
(2-23) where, for each element of the matrix, 2
fro
if xij is a continuous random variable with _ , X;j/;j (x;j) dxij probability density function /;j(x;) if Xii is a discrete random variable with probability function P ;/X;j)
2: X;j Pij (x;)
a l l x 1i
Example 2.1 2 (Computing expected values for discrete random variables)
Suppose p = 2 and n = 1, and consider the random vector X' = [ X1 , X2 ] . Let the discrete random variable X1 have the following probability function: X1
·
I
-1 .3
0 .3
.1 .4
2 If you are unfamiliar with calculus, you should concentrate on the interpretation of the expected value and, eventually, variance. Our development is based primarily on the properties of expectation rather than its particular evaluation for continuous or discrete random variables.
Sec. 2 . 6
Then E (X1 ) =
Mean Vectors and Covariance M atrices
69
L x1 p1 (x1 ) = ( 1 ) ( 3) + (0) ( .3) + (1) (.4 ) = .1. Similarly, let the discrete random variable X2 have the probability -
.
all x1
function
0 .8 Then E (X2 ) = Thus,
1 .2
L X 2 p2 (x2 ) = (0) (.8) + (1) (.2) = .2.
allx2
[ ] J
E(X1) = · 1 E (X) = E(X .2 2)
[
•
Two results involving the expectation of sums and products of matrices fol low directly from the definition of the expected value of a random matrix and the univariate properties of expectation, E(X1 + Y1 ) = E (X1 ) + E ( Y1 ) and E (cX1 ) = cE(X1 ). Let X and Y be random matrices of the same dimension, and let A and B be conformable matrices of constants. Then ( see Exercise 2 . 40 )
2.6 MEAN VECTORS AND COVARIANCE MATRICES
Suppose X = [X1 , X2 , . . . , Xp ] is a p 1 random vector. Then each element of X is a random variable with its own marginal probability distribution. (See Example 2.12.) The marginal means J.Li and variances af are defined as J.L i = E (Xi ) and a? = E (Xi - J.LY , i = 1, 2, . . . , respectively. Specifically,
X
-"'_ xi [; (xi ) dxi J _
J.Li =
L Xi P i (xJ
allx1
if Xi is a continuous random variable with probability density function [; (xi ) if Xi is a discrete random variable with probability function P i (xJ
70
Chap. 2
M atrix Algebra and Random Vectors
if X; is a continuous random variable with probability density function [;(x;)
L (x; - IL YP; (x; )
a l l x1
if X; is a discrete random variable with probability function P ; (x;)
(2-25)
It will be convenient in later sections to denote the marginal variances by U; ; rather than the more traditional uf , and consequently, we shall adopt this notation. The behavior of any pair of random variables, such as X; and Xk , is described by their joint probability function, and a measure of the linear association between them is provided by the covariance
U; k = E(X; - IL ; ) (Xk - IL k )
f_� {"., (x; - IL; ) (xk - ILk )/;k (x;, xk ) dx; dxk
=
L L (x; - IL ; ) (xk - IL k ) P ; k (x;, xk )
a l l x1 allx,
if X;, Xk are continuous random variables with the joint density function /; k (x;, xk) if X;, Xk are discrete random variable with joint probability function P ; k (x;, xk) (2-26)
and IL; and IL k • i, k = 1, 2, . . . , p, are the marginal means. When i = k, the covari ance becomes the marginal variance. More generally, the collective behavior of the p random variables X1 , X2 , , XP or, equivalently, the random vector X = [X1 , X2 , , Xp ] ', is described by a joint probability density function f(x1 , x 2 , , xp ) = f(x). As we have already noted in this book, f(x) will often be the multivariate normal density function. (See Chapter 4.) If the joint probability P [X; :o:::; X; and Xk :o:::; xk ] can be written as the product of the corresponding marginal probabilities, so that • • •
• • •
• • •
(2-27)
for all pairs of values X; , xk , then X; and Xk are said to be statistically independent. When X; and Xk are continuous random variables with joint density /; k (x;, xk ) and marginal densities [; (x; ) and fk (xk ), the independence condition becomes for all pairs (x; , xk) . The p continuous random variables X1 , X2 , , XP are independent if their joint density can be factored as . . •
mutually statistically
Sec. 2 . 6
fi z ... p (xi , x 2 ,
. • .
71
Mean Vectors a n d Cova riance M atrices
, xP )
=
/1 ( xi )/2 (x 2 )
· · ·
(2-28)
fP ( xp )
for all p-tuples ( xi , x 2 , , xp ) . Statistical independence has an important implication for covariance. The fac torization in (2-28) implies that Cov (X;, Xk ) = 0. Thus, • . .
The converse of (2-29) is not true in general; there are situations where Cov (X; , Xk) = 0, but X; and Xk are not independent. (See [2].) The means and covariances of the p X 1 random vector X can be set out as matrices. The expected value of each element is contained in the vector of means p = E ( X) , and the p variances U; ; and the p (p - 1 ) /2 distinct covariances u; k ( i < k ) are contained in the symmetric variance-covariance matrix � = E ( X - p ) (X - p ) ' . Specifically, (2-30 )
(l � ] l
and
�
=
E (X - p ) (X - p ) I
_ E -
xi - IL1 Xz - ILz [ X, - "' , X, - "' ' . . . , X, - 1', 1
x,
1',
)
]
2 (Xl - IL1 ) (Xl - IL1 ) (Xz - ILz ) (Xl - IL1 ) (XP - ILp ) 2 (Xt ) ILz (Xz ) IL1 (Xz - ILz ) (XP - ILp ) ) 1Lz (Xz = E : 2 (Xp - ILp ) (XI - IL I ) (XP - ILp ) (Xz - IL2 ) (Xp - ILp ) E (XI - 1L1 ) 2 E (Xl - IL1 ) (XP - ILp ) E (Xl - IL1 ) (Xz - ILz ) z E (Xz - ILz (Xl - IL1 ) E (Xz � ILz ) : E (Xz - ILz } (Xp - ILp ) . .. E (XP - ILp ) (Xl - IL1 ) E (XP - 1Lp ) (X2 - ILz ) E (Xp - ILp ) 2
l
or
�
· · ·
· ·
. .
.
· · ·
]
72
Chap. 2
Matrix Algebra and Random Vectors
I = Cov ( X ) = Example 2. 1 3
[
<Ti l
· · ·
· · ·
0
(Computing the covariance matrix)
0
l
(2 31 ) -
Find the covariance ma trix for the two random variables Xi and X2 intro duced in Example 2.12 when their j oint probability function p 1 2 (x i , x 2 ) is represented by the entries in the body of the following table:
i �
0
1
P1 (x i )
-1 0 1
.24 .16 .40
.06 .14 .00
.3 .3 .4
Pz (x2 )
.8
.2
1
We have already shown that J-�- 1 = E (Xi ) = .1 and J-�- 2 = E (X2 ) = .2. ( See Example 2.12.) In addition,
= ( - 1 - .1 ) 2 ( .3 ) + (0 - .1 ) 2 ( .3 ) + (1 - .1) 2 (.4) = .69
= ( - 1 - .1 ) (0 - .2 ) ( .24 ) + ( - 1 - .1) (1 - .2) (.06) + . . . + (1 - .1 ) (1 - .2) (.00) = - .08
[
J [ :: J [ :� ]
Sec. 2 . 6
and
I
=
=
=
[
E (X
M e a n Vectors and Covaria n ce M atrices
- p; ) (X - p; ) I
z (Xl - ILI ) (Xz - ILz ) E (Xl - IL I ) 2 (Xz - ILz ) (Xz - ILz ) (Xl - IL 1 )
[ [
J
E (X1 - IL1 ) (Xz - ILz ) E (Xl - IL1 ) 2 E (X2 - ILz ) (Xl - IL1 ) E (Xz - ILz ) 2 0'1 1 0'1 2 O'z l O'zz
[ J =
-
.69 08
.
-
08 .16 .
]
J
73
•
We note that the computation of means, variances, and covariances for dis crete random variables involves summation (as in Examples 2.12 and 2.13), while analogous computations for continuous random variables involve integration. Because uik = E (Xi - IL J (Xk - IL k ) = uki • it is convenient to write the matrix appearing in (2-31) as
I
=
E (X
- p; ) (X - p; ) '
(2-32)
We shall refer to p; and I as the population mean (vector) and population vari ance-covariance (matrix), respectively.
The multivariate normal distribution is completely specified once the mean vector p; and v&riance-covariance matrix I are given (see Chapter 4), so it is not surprising that · these quantities play an important role in many multivariate procedures. It is frequently informative to separate the information contained in vari ances uu from that contained in measures of association and, in particular, the measure of association known as the population correlation coefficient Pik · The correlation coefficient Pik is defined in terms of the covariance uik and variances uu and uk k as (2-33) The correlation coefficient measures the amount of linear association between the random variables Xi and Xk . (See, for example, [2].)
74
Chap. 2
M atrix Algebra and Random Vectors
Let the population correlation matrix be the p ul l
o-1 2
0"1 2
U22
Ut p
U2p
X p symmetric matrix Ut p
� � � Yo;
p
� vo; u2p
� Yo; Yo; Yo;
: :l
vo; vo;
� vo; vo; vo;
r p:, "i' : : l Ptp P2p -
and let the p
.
• •
(2-34)
1
X p standard deviation matrix be yl /2
=
r� 0 . 0
0
...
vo:; · · ·
0
0
. 0 •
•
0
.
and
.
• •
··· � pp
Then it is easily verified ( see Exercise 2.23) that
yt/2 pyt/2
•
0 0
=
I
l
(2-35)
(2-36)
(2-37) That is, I can be obtained from V112 and p, whereas p can be obtained from I. Moreover, the expression of these relationships in terms of matrix operations allows the calculations to be conveniently implemented on a computer. Example 2. 1 4 (Computing the correlation matrix from the covariance matrix)
Suppose
� 25�
-3 Obtain V1 12 and p
-
]
Sec. 2 . 6
75
Mean Vectors and Cova riance M atrices
Here 0
va:;-; 0
and
Consequently, from
(2-37), 2] [ OJ -3 [j � � ] [�2 -3 25 u -t -n the correlation matrix p is given by 1 9
0 0 �
�
1
0
� 0 0 0 !
•
Partitioning the Covariance Matrix
Often, the characteristics measured on individual trials will fall naturally into two or more groups. As examples, consider measurements of variables representing consumption and income or variables representing personality traits and physical characteristics. One approach to handling these situations is to let the characteris tics defining the distinct groups be subsets of the total collection of characteristics. If the total collection is represented by a (p ! ) -dimensional random vector X, the subsets can be regarded as components of X and can be sorted by partitioning X . In general, we can partition the p characteristics contained in the p 1 ran dom vector X into, for instance, two groups of size and p respectively. For example, we can write
X
xl
X =
Xq xq + l xp
}q } -q p
[i-:��j
, q q
I-L l
and p, = E ( X )
/-L q /-L q + 1 /-Lp
X
[�-��-]
(2-38)
76
Chap. 2
M atrix Algebra and Random Vectors
From the definitions of the transpose and matrix multiplication,
1 2 2 { X( l) - ,.,. ( ) ) ( X( ) - ,.,. ( ) )
[
I
(Xl - J.tt ) {Xq + t - J.tq + t ) (Xl - J.td (Xq + 2 - J.tq + 2 ) . . . (XI - J.l- 1 ) (XP - J.tp ) (X2 - J.t2 ) (Xq + t - J.tq + t ) (X2 - J.l-2 ) (Xq + 2 - J.tq + 2 ) · · · (X2 - J.l-2 ) (Xp - J.tp ) . . . .. .
.
(Xq - J.tq ) {Xq + t - J.tq + d (Xq - J.tq ) {Xq + 2 - J.tq + 2 ) · · · (Xq - J.tq ) {XP - J.tp )
[
]
]
Upon taking the expectation of the matrix ( X< 1 > - p. ( l ) ) ( X <2 > - p. <2> ) 1 , we get
1 E ( X(l) - ,.,. ( ) ) ( X(2) - ,.,. (2) ) I =
al , q + l al , q + 2 · · · al p a2, q + l a2, q + 2 · · · a2 p •. : : . . . .: aq, q + l aq, q + 2 · · · aq p
=
{2 -39 )
""' 1 2 �
which gives all the covariances, a;i, i 1 , 2, . . . , q, j q + 1 , q + 2, . . . , p, between a component of x< 1 > and a component of x<2> . Note that the matrix I1 2 is not necessarily symmetric or even square. Making use of the partitioning in Equation {2-38), we can easily demon strate that =
=
[
( X - p. ) ( X - p. ) '
=
and consequently,
1 1 ( X< l p. ( ) ) ( q X l) ( X(2 ) - ,.,. (2 ) ) ((p - q) X 1 ) _
( 1 ) ) ' ( X( 1 ) - ,.,. ( 1 ) ) ( X(2) - ,.,. (2 ) ) I p. ( 1 X (p - q )) (q X 1 ) ( l x q) 1 1 ( X( ) - ,.,. ( ) ) I ( X <2J p. (2 ) ) ( X <2J p. ( 2 ) ) ' 1 1 1 ( X( )
_
_
( x q)
((p - q ) X )
_
( l X (p - q ))
]
Sec. 2 . 6
:I (p X p)
=
Mean Vectors a n d Cova ria n ce M atrices
E (X - p ) (X - p ) I
------ ------
I I
O'q I . . .
-------
t
0'1 q i 0'1, q + I
0'1 1 - O'q + l, -I
=
]
q p- q q :I l l i :I l 2 p - q - :I 2 1 :: :I 22 (p X p)
77
(2-40)
O'l p
I I I I
O'qq i O'q, q + I . . . O'q p O'q + l, p O'q +. ! , q �: O'q + l,. q-+ l :
---------------
•
--------
I I I I
-----------------
•
O'PP O'p q : O'p , q + l O'p l Note that :I 1 2 = :I� 1 . The covariance matrix of X ( l l is :I 1 1 , that of X (2) is :I 22 , and that of elements from X(! ) and X(2) iS l; l 2 (or :I z l ). The Mean Vector and Covariance Matrix for Linear Combinations of Random Variables
Recall that if a single random variable, such as X1 , is multiplied by a constant c, then and E (cX1 - c p., 1 ) 2 = c 2 Var (X1 ) = c 2 0'11 If X2 is a second random variable and a and b are constants, then, using additional properties of expectation, we get Var (cX1 )
=
Cov (aX1 , bX2 )
=
E (aX1 - a p.,1 ) (bX2 - b p., 2 ) a b E (X1 - p., 1 ) (X2 - JL z )
=
ab Cov (X1 , X2 )
=
=
ab 0'1 2
Finally, for the linear combination aX1 + bX2 , we have E (aX1 + bX2 ) = aE (X1 ) + bE (X2 ) = a p., 1 + b p., 2 Var (aX1 + bX2 )
= = = =
E [ (aX1
+ bX2 ) - (ap., 1 + b p., 2 ) f
E [a (X1 - p., 1 ) + b (X2 - p., 2 ) f E [a 2 (X1 - p.,1 ) 2 + b 2 (X2 - p., 2 ) 2 + 2ab (X1 - p., 1 ) (X2 - p., 2 ) ] a 2 Var (X1 ) + b 2 Var (X2 ) + 2ab Cov (X1 , X2 ) (2-41)
78
Chap. 2
Matrix Algebra and Random Vectors
With c'
=
[ �� ]
[a, b ], aX1 + bX2 can be written as [a b]
Similarly, E (aX1 + bX2 )
=
=
c' X
aJ.L 1 + bJ.L 2 can be expressed as
If we let
be the variance-covariance matrix of X, Equation (2-41) becomes Var (aX1 + bX2 ) since c , �... c
=
[a b]
=
Var (c' X) = c'l:c
[ al l a1 2 ] [ a ] a1 2 a22 b
=
(2-42)
a 2 a1 1 + 2a b a1 2 + b 2 a22
The preceding results can be extended to a linear combination of p random variables:
In general, consider the X1 , . . . , XP : Z1 z2 .
= =
q
linear combinations of the p random variables
el l X1 + c 1 2 X2 + . . · + c 1P XP c2 1 xl + Cz2 X2 + . . . + Cz p Xp .
Sec. 2.6
or
Z =
[ 1:J [ : (q X I )
79
Mean Vectors a n d Covariance M atrices
c, , c " cu C2 1 c22 . . . c2p .. . . . . . . • • c , cq 2 • c q p (q Xp ) · · ·
.
Tle lin�ar e�;nbilratiohs z .;;,· c� have' .
' ' P l'6,;:: E (1.;) +,ji! E (�X)'''?9
J [1J
=
ex
(2-44)
(p X I )
Clfi Iz .,� Colv (Z)f= �ov (G.X) CifX_ C' ''
where P x and Ix are the mean vector and variance-covariance matrix of X, respec tively. (See Exercise 2.28 for the computation of the off-diagonal terms in C ixC'.) We shall rely heavily on the result in (2-45) in our discussions of principal components and factor analysis in Chapters 8 and 9. Example 2. 1 5
(Means and covariances of linear combinations)
Let X' = [ X1 , X2 ] be a random vector with mean vector p� = variance-covariance matrix
[ IL t , �J-2] and
Find the mean vector and covariance matrix for the linear combinations z, = x, - X2 Z2 = x, + X2
or
�n terms of P x and Ix . Here P z = E ( Z ) = Cp x =
and
[ 11 - 11 J [ ,..� J [ /J-t - /J-2 ] 1 2
IL t
+ /J- 2
80
Chap. 2
= [ 11
M atrix Algebra and Random Vectors
[ CTu - 2CT-t z
I z = Cov ( Z ) = C i x C ' _
+
CTzz
u, ,
CTzz
CTt t
-1 1
][
u, 1 u 1 2 u1 2 u22
ul l - CTzz + 2 ut z +
CTzz
X
J
] [ - 11 11 ]
Note that if u1 1 = u22-that is, if 1 and X2 have equal variances-the off diagonal terms in I z vanish. This demonstrates the well-known result that the sum and difference of two random variables with identical variances are • uncorrelated. Partitioning the Sample Mean Vector and Covariance Matrix
Many of the matrix results in this section have been expressed in terms of popula tion means and variances (covariances). The results in (2-36), (2-37), (2-38), and (2-40) also hold if the population quantities are replaced by their appropriately defined sample counterparts. Let x ' = [:X1 , :X2 , , xp ] be the vector of sample averages constructed from n observations on p variables X2 , , and let • • •
sn
X1 ,
• • •
XP ,
=
1�
n j = l (xjp
- £.J
- x-P ) 2
be the corresponding sample variance-covariance matrix. The sampl e mean vector and the covariance matrix can be partitioned in order to distinguish quantities corresponding to groups of variables. Thus,
x
(p X I )
and
Xq
[-x�-<�z�-> ]
(2-46)
Sec. 2 . 7
M atrix I nequal ities a n d Maxim ization
81
s,, . . . s, q j s l ,q + I . . . si p sq 1 sqq ! sq, q + l sqp sn ( pxp) sq + 1 , 1 · · · sq + 1 ,q ! sq +l ,q +l · · sq +l,p sP 1 sp q i sp, q + l . . . sPP q p-q q [ sl , ! sl 2 ] =p (2-47) - q Sz , s22 where x (ll and x (2 l are the sample mean vectors constructed from observations ( x 1 , . . . , xq ] ' an d x ( x + i • . . . , x ] ' , respec t"IVe y,. S 1 1 IS. th e samp e x q P covariance matrix computed from observations x(ll; S22 is the sample covariance matrix computed from observations x(2); and s l 2 = s�, is the sample covariance matrix for elements of x(ll and elements of x(2 l. I
I I
· · •
• · ·
1- - - - - - - - - - - - - - - - - - - - - - _.. _ _ - - - - - - - - - - - - - - - - - - ... - - - ·
I I
:
• • •
------
(I) -
r :
-----
' 1
(Z) -
1
2.7 MATRIX INE Q\.! ALITIES AND MAXIMIZATION
Maximization principles play an important role in several multivariate techniques. Linear discriminant analysis, for example, is concerned with allocating observations to predeter:rnined groups. The allocation rule is often a linear function of mea surements that the separation between groups relative to their within ' group variability. As another example, principal components are linear combinations of measurements with variability. The matrix inequalities presented in this section will easily allow us to derive certain maximization results, which will be referenced in later chapters.
maximizes
maximum
Cauchy-Schwarz I nequality.
Let b and d be
(b' d) 2 � (b' b) ( d' d)
any two p X 1 vectors. Then (2-48)
with equality if and only if b = cd (or d = cb) for some constant c. Proof. The inequality is obvious if either b = 0 or d = 0. Excluding this pos sibility, consider the vector b - xd, where x is an arbitrary scalar. Since the length of b - xd is positive for b - xd =F 0, in this case 0 < (b - xd) ' (b - xd) = b' b - xd' b - b' (xd) + x 2 d' d = b' b - 2x f b'd) + x 2 (d'd)
The last expression is quadratic in x. If we complete the square by adding and sub tracting the scalar (b' d) 2 /d' d , we get
82
Chap. 2
M atrix Algebra and Random Vectors
0
(b' d) 2 (b' d) 2 < b ' b - d' d + d' d - 2x (b' d) + x 2 (d' d) b' d z (b' d) 2 + d) = b' b (d' x d' d d' d The term in brackets is zero if we choose x = b' d/d' d, so we conclude that 2 0 < b ' b - (b' d) d' d or (b' d) 2 < (b' b) ( d' d) if b * xd for some x. Note that if b = cd, 0 = (b - cd) ' (b - cd), and the same argument produces (b' d) 2 = (b' b) (d' d). •
)
(
A simple, but important, extension of the Cauchy-Schwarz inequality fol lows directly. Extended Cauchy-Schwarz I nequality. Let
b and d be any two vee-
(p X 1 )
tors, and let B be a positive definite matrix. Then
(p X 1)
(p Xp)
(b' d) 2 � (b' Bb) (d' B - 1 d) with equality if and only if b = cB - 1 d (or d = e Bb) for some constant c.
(2-49)
Proof. The inequality is obvious when b = 0 or d = 0. For cases other than these, consider the square-root matrix B � /2 defined in terms of its eigenvalues .A i and
the normalized eigenvectors e i as B 1 /2 = L � e i e; . If we set [see also (2-22)] p
i= 1
p B - 1 /2 = L � 1I. e i e i i = 1 v .A i
I
it follows that
b' d = b' ld = b' B 1 12 B - 1 12 d = ( B 1 12 b) ' ( B- 1 12 d) and the proof is completed by applying the Cauchy-Schwarz inequality to the vec tors ( B 1 /2 b) and ( B - 1 12 d). • The extended Cauchy-Schwarz inequality giyes rise to the following maxi mization result. Maximization Lemma. Let
B be positive definite and d be a given
(p Xp)
vector. Then, for an arbitrary nonzero vector x , (p X 1 )
(p X 1)
Sec. 2 . 7
max x*o
83
M atrix Inequalities a n d Maximization
(x' d) 2 = d' B- 1 d x' Bx
(2-50)
with the maximum attained when x = cB- 1 d for any constant c (p X 1 ) (p Xp) (p X I )
i=
0.
Proof. By the extended Cauchy-Schwarz inequality, (x' d) 2 � (x' Bx) (d' B- 1 d). Because x i= 0 and B is positive definite, x' Bx > 0. Dividing both sides of the inequality by the positive scalar x' Bx yields the upper bound
(x' d) 2 -- � d' B- 1 d x' Bx
Taking the maximum over x gives Equation attained for x = cB- 1 d.
(2-50) because the bound is •
A final maximization result will provide us with an interpretation of eigenvalues. B
(p Xp)
·•·
Maximization of Quadratic Forms for Points on the U n it Sphere. Let
be a positive definite matrix with eigenvalues A 1 ;;;;.: A 2 ;;;;.:
;;;;.:
AP ;;;;.: 0 and
associated normalized eigenvectors e 1 , e 2 , . . . , eP . Then x' Bx
max -,- = A1 (attained when x = e 1 ) x*O X X
(2-51)
x' Bx = AP (attained when x = eP ) min -' x*O X X
Moreover, x' Bx max -, - = A k + l ( attained when x = e k + I • k = x _!_ e " . . . , e, X X
where the symbol Proof.
l_
1 , 2, . . . , p
-
1)
(2-52)
is read "is perpendicular to."
Let P be the orthogonal matrix whose columns are the eigenvec
(p x p)
tors e1 , e 2 , . . . , eP and A be the diagonal matrix with eigenvalues A 1 , A 2 , . . . , AP along the main diagonal. Let B 1 12 = P A 112 P ' [see (2-22)] and y = P' x . Consequently, x
i=
0
implies y
i=
0.
Thus,
(p X 1 )
(p Xp) (p X I )
84
Cha p . 2
M atrix Algebra and Random Vectors
_
_
x' Bx x' B 1 12 B 1 12 x x' PA1 12 P ' PA 1 12 P ' x y' A y = x' x - x' PP ' x y' y ----y;y1
( p x p)
L A ;Yf i= l p
-
L Yf p
Ll Yf p
,:::: ' i= -...;
Ill p
L Yt
' - Ill -
(2-53)
i= l
i= l Setting x = e 1 gives
since
k = 1 k =f:. 1
For this choice of x, y' A y /y' y = Atf1 = A1 , or
(2-54)
A similar argument produces the second part of (2-51 ). Now, x = Py = y1 e1 + J2 C 2 +
···
+ yP eP , so x
..l
e1 , . . . , ek implies
i ::;;;; k Therefore, for x perpendicular to the first k eigenvectors e;, the left-hand side of the inequality in (2-53) becomes
L A -y� i=k+ l p
x' Bx x' x -
Taking yk + l = 1, yk + 2 = · · · = Yp =
0
P
I
L k + l Yf
I
i= gives the asserted maximum.
•
For a fixed x0 =f:. 0, x� Bx0/x�x0 has the same value as x' Bx, where x' = x�/� is of unit length. Consequently, Equation (2-51 ) says that the largest eigenvalue, A1 , is the maximum value of the quadratic form x' Bx for all
Sec. 2 . 7
Matrix Inequalities a n d M a x i m ization
85
points x whose distance from the origin is unity. Similarly, ).P is the smallest value of the quadratic form for all points x one unit from the origin. The largest and smallest eigenvalues thus represent extreme values of x' Bx for points on the unit sphere. The "intermediate" eigenvalues of the p p positive definite matrix B also have an interpretation as extreme values when x is further restricted to be per pendicular to the earlier choices.
X
SUPPLEMEN T 2A
Vectors and Matrices: Basic Concepts Vectors
Many concepts, such as a person ' s health, intellectual abilities, or personality, can not be adequately quantified as a single number. Rather, several different mea surements x 1 , x 2 , , xm are required. • . .
Definition 2A. 1 . An m-tuple of real numbers (x 1 , x 2 , , X; , , xm ) arranged in a column is called a vector and is denoted by a boldfaced, lowercase letter. Examples of vectors are . . .
. . .
Vectors are said to be equal if their corresponding entries are the same. Definition 2A.2 (Scalar Multiplication). Let c be an arbitrary scalar. Then
the product ex is a vector with ith entry ex;. To illustrate scalar multiplication, take c1
c1 y = 5
[ �] [ �] -2
=
1 - 10
and
[ �] [=�:� ]
= 5 and c2 = -1.2. Then
c2 y = (- 1.2)
-2
2.4
Definition 2A.3 (Vector Addition). The sum of two vectors x and y, each having the same number of entries, is that vector 86
Supplement 2A
z =
Vectors and M atrices: Basic Concepts
x + y with ith entry
Z ; = X;
+
87
Y;
Thus,
+
X
y
z
Taking the zero vector, 0, to be the m-tuple (0, 0, . . . , 0) and the vector -x to be the m-tuple ( - x1 , -x 2 , . . . , -xm ), the two operations of scalar multiplication and vector addition can be combined in a useful manner. Definition 2A.4. The space of all real m-tuples, with scalar multiplication and vector addition as just defined, is called a vector space. Definition 2A.5. The vector y = a 1 x 1 + a2 x 2 + · · · + ak x k is a linear com bination of the vectors x 1 , x 2 , . . . , x k . The set of all linear combinations of x 1 , x 2 , . . . , x k . is called their linear span. Definition 2A.6. A set of vectors x 1 , x 2 , . . . , x k is said to be linearly depen dent if there exist k numbers ( a 1 , a2 , . . . , ak ) , not all zero, such that
Otherwise the set of vectors is said to be linearly independent. If one of the vectors, for example, X; , is 0, the set is linearly dependent. (Let
a; be the only nonzero coefficient in Definition 2A.6. )
The familiar vectors with a one as an entry and zeros elsewhere are linearly independent. For m = 4,
so
implies that a 1 =
a2
=
a3
=
a4
=
0.
88
Chap. 2
M atrix Algebra and Random Vectors
As another example, let k = 3 and m = 3, and let
Then 2x1 - x2 + 3x3 = 0
Thus, x 1 , x2 , x3 are a linearly dependent set of vectors, since any one can be writ ten as a linear combination of the others (for example, x2 = 2x 1 + 3x3 ). Definition 2A. 7. Any set of m ljnearly independent vectors is called a basis for the vector space of all m-tuples of real numbers. •
Resu lt 2A. 1 . Every vector can be expressed as a unique linear combination
of a fixed basis. With m =
4, the usual choice of a basis is
These four vectors were shown to qe linearly independent. Any vector x can be uniquely expressed as ·
A vector consisting of m elements may be regarded geometrically as a point in m dimensional space. For example, with m = 2, the vector x may be regarded as rep resenting the point in the plane with coordinates x 1 and x 2 . Vectors have the geometrical properties of length and direction. 2 Xz
- - - - - - - -
I I I I I I
X = [��]
Su pplement 2A
Vectors a n d Matrices: Basic Concepts
89
Definition 2A.8. The length of a vector of m elements emanating from the origin is given by the Pythagorean formula:
length of x = Lx = Yxf Definition 2A.9. The
entries, is defined from
+
x�
+ ··· +
x;,
angle () between two vectors x and y, both having m
where Lx = length of x and Ly = len�th of y, x1 , x2 , . . . , xm are the elements of x, and y 1 , y2 , , Ym are the elements of y. Let • • •
Then the length of x, the length of y, and the cosine of the angle between the two vectors are length of x = length of y =
V (- 1) 2 + 52 + 22 + ( - 2) 2 = V34 = s.83 Y42 + ( - 3) 2 + 02 + 1 2 = V26 = 5.10
and
1 ,t;;: 1 : [( - 1) 4 + 5 ( - 3) + 2 (0) + ( - 2) 1] v 34 v 26 1 5.83 5.10 [ - 21] = - .?06 135°. =
Consequently,
()
=
� r.::: �
X
Definiti on 2A. 1 0. The inner (or dot) product of two vectors x and y with the same number of entries is defined as the sum of component products:
We use the notation x' y or y ' x to denote this inner product.
90
Chap. 2
M atrix Algebra and Random Vectors
With the x' y notation, we may express the length of a vector and the cosine of the angle between two vectors as Lx = length of x = cos ( 0 ) =
x'y
Yxf + x� + . · + x;, = � .
wyvy;y
----=�'-==
x, y i s (} = 90° or = 0 only if (} = 90° or 270°, we say that x and y are perpendicular. Since cos ( 8 ) Definition 2A. 1 1 . When the angle between two vectors
270°, the condition becomes
x and y are perpendicular if x' y = 0 We write x .l y. The basis vectors
[�]· [ �]· [�]· [�]
are mutually perpendicular. Also, each has length unity. The same construction holds for any number of entries m. Result 2A.2
(a) z is perpendicular to every vector if and only if z = 0. (b) If z is perpendicular to each vector x 1 , x 2 , . . . , x k , then z is perpendicular to their linear span.
(c) Mutually perpendicular vectors are linearly independent.
•
Definition 2A. 1 2. The projection ( or shadow) of a vector x on a vector y is
. . of x on y proJectzon
If y has unit length so that LY =
(x' y)
= -----u y y
1, projection of x on y = (x' y) y
Result 2A.3 (G ram-Schmidt Process).
Given linearly independent vectors
x 1 , x 2 , . . . , x k , there exist mutually perpendicular vectors u 1 , u 2 , . . . , u k with the same linear span. These may be constructed sequentially by setting
Supplement 2A
U2
= X
Vectors and M atrices: Basic Concepts
91
(x'2-u 1-) u 2-u'1 0 1 1
u) � . In this con struction, (x�z) zi is the projection of x k on zi and � (x�zi ) zi is the projection j=1 of x k on the linear span of x 1 , x 2 , . . . , x k - t · •
We can also convert the u's to unit length by setting zi
k-1
=
For example, to construct perpendicular vectors from
and
we take
so
and
x� u 1 Thus,
= 3 (4 ) +
1 (0)
+
0 (0) - 1 (2)
=
10
92
Chap. 2
M atrix Algebra and Random Vectors
Matrices
Defin ition 2A. 1 3 . An m k matrix, generally denoted by a boldface uppercase letter such as A, R, I, and so forth, is a rectangular array of elements having m rows and k columns. Examples of matrices are
X
] [ -7
A=
2
0 1 ' 3 4 I=
B
[
OJ
= x4 -23 1 / ' x
[ -�
]
.7 - .3 2 1 ' - .3 1 8
E
I= = [e1 ]
[� n 0 1
0
In our work, the matrix elements will be real numbers or functions taking on val ues in the real numbers. Definition 2A. 1 4. The dimension (abbreviated dim) of an m k matrix is the ordered pair (m, k ) ; m is the row dimension and k is the column dimension. The dimension of a matrix is frequently indicated in parentheses below the letter representing the matrix. Thus, the m k matrix A is denoted by A . In the pre-
X
X
(m x k )
ceding examples, the dimension of the matrix I is 3 be conveyed by writing I . An m
(3 X 3 )
[
X 3, and this information can
A, of arbitrary constants can be written al l a1 z al k az 1 azz az k A . . . . (m X k ) . . am l a m z . a m k or more compactly as A = {a;i }, where the index i refers to the row and the (m X k )
X
k matrix, say,
0 0
• •
•
0
• •
J
index j refers to the column. An m 1 matrix is referred to as a column vector. A 1 k matrix is referred to as a row vector. Since matrices can be considered as vectors side by side, it is nat ural to define multiplication by a scalar and the addition of two matrices with the same dimensions.
X
X
A = {aii } and B = {b;i } are said to (m X k ) (m X k ) be equal, written A = B, if aii = bii • i = 1 , 2, . . . , m, j = 1 , 2, . . . , k. That is, two Definition 2A. 1 5. Two matrices
matrices are equal if:
S u pp lement 2A
Vectors a n d M atrices: Basic Conce p ts
93
(a) Their dimensionality is the same.
(b) Every corresponding element is the same.
Definition 2A. 1 6 (Matrix Addition). Let the matrices A and B both be of dimension m k with arbitrary elements a;i and b ii ' i = 1, 2, . . . , m, j = 1, 2, . . . , k, respectively. The sum of the matrices A and B is an m k matrix C, written C = A + B, such that the arbitrary element of C is given by
X
X
i
1, 2, . . .
=
, m, j
=
1, 2, . . . , k
Note that the addition of matrices is defined only for matrices of the same dimension. For example,
-
[ ! � � J + [ � � � J = [ � � 1� J A
and
i
+
B
c
Definition 2A. 1 7 (Scalar M ultiplication). Let c be an arbitrary scalar {b;i }, where b ii = ca;i = a;i c, A {a;i } . Then cA = Ac = B
(m x k )
=
(m X k)
= 1, 2, . . . , k . Multiplication of a matrix by a scalar produces a new matrix whose elements are the elements of the original matrix, each multiplied by the scalar. For example, if c = 2, =
1, 2, . . . , m, j
(m x k)
(m X k)
[ 3 6 ] [3 6 ] [6 -8]
2 2 0
-4
2 0
5
cA
-4 5
2
=
4 12 0 10
Ac
B
{a;i } and B = {b; i } (m X k) be two matrices of equal dimension. Then the difference between A and B, writ ten A - B, is an m k matrix C {c;i } given by Definition 2A. 1 8 (Matrix Subtraction). Let
X
A
(m X k)
=
A + (- 1)B That is, c;i = a;i + ( - 1) b;i = a;i - b;i , i = 1, 2, . . . , m, j = 1, 2, . . . , k. Definition 2A. 1 9. Consider the m k matrix A with arbitrary elements a;i ' i = 1, 2, . . . , m, j = 1, 2, . . . , k . The transpose of the matrix A, denoted by A', is the k m matrix with elements ai i ' j = 1 , 2, . . . , k, i = 1, 2, . . . , m. That is, the transpose of the matrix A is obtained from A by interchanging the rows and C
=
A-
B
=
X
X
columns.
94
Chap. 2
M atrix Algebra and Random Vectors
As an example, if
A
(2 X 3 )
[ 27
-
]
1 3 4 6 ' -
then
A'
(3 X 2)
=
[� � ] -
3
6
A, B, and C (of equal dimension) and scalars
Result 2A.4. For all matrices
c and d, the following hold:
(a) (A + B) + C = A + (B + C)
A+B=B+A (c) c(A + B) = cA + cB (d) (c + d)A = cA + dA (e) (A + B) ' = A' + B'
(b)
(That is, the transpose of the sum is equal to the sum of the transposes.)
(f) ( cd ) A = c ( dA )
( g)
(cA)'
=
cA'
•
Definition 2A.20. If an arbitrary matrix A has the same number of rows and columns, then A is called a square matrix. The matrices I , I, and E given after Definition 2A.13 are square matrices. Definition 2A.21 . Let A be a k k (square) matrix. Then A is said to be symmetric if A = A'. That is, A is symmetric a;i = ai i• i = 1, 2, . . . , k, j = 1, 2, . . . , k.
X
Examples of symmetric matrices are
I =
(3 X 3 )
[� � � ]'
A
(2 X 2)
0 0 1
Defin ition 2A.22.
The
k
=
x
[! �] .
B =
(4 X 4)
k identity matrix, denoted by I , is the (k X k)
square matrix with ones on the main (NW-SE) diagonal and zeros elsewhere. The 3 3 identity matrix is shown before this definition.
X
AB of an m X n X k matrix B = {b;i} isThetheproduct m X k matrix C whose
Defin ition 2A.23 (Matrix Multiplication) .
matrix A = (a;i } and an elements are
C;i
n
= ,L
n
a bi t=l u t
i
= 1, 2,
... , m j
= 1, 2 ,
... , k
Su pplement 2A
Vectors a n d Matrices: Basic Concepts
95
Note that for the product AB to be defined, the column dimension of A must equal the row dimension of B. If that is so, then the row dimension of AB equals the row dimension of A, and the column dimension of AB equals the column dimension of B. For example, let
A
(2 X 3)
=
[!
Then
[3
-1 2 4 0 5 (2 X 3)
]
�]
-1 0
[ ! -� ] 4
(3 X 2)
and
B
(3 X 2)
=
] � [:
[ 3211 2031 J [ ee2l 1l ee222 1 J
3
(2 X 2 )
where
e1 1 e1 2 e2 1 e22
= = = =
(3) (3) (3) (4) (4) (3) (4) (4)
+ +
+ +
(- 1) (6) + (2) (4) = 11 (- 1) ( - 2) + (2) (3) = 20 (0) (6) + (5) (4) = 32 (0) ( - 2) + (5) (3) = 31
As an additional example, consider the product of two vectors. Let
Then x' = x' y � [l
Ul
[1 0 -2 3] and 0
-
2
3]
l �l
� [ - 20] � [2 - 3 - 1 - 8] -
� y' x
Note that the product xy is undefined, since x is a 4 1 matrix and y is a 4 1 matrix, so the column dim of x, 1, is unequal to the row dim of y, 4. If x and y are vectors of the same dimension, such as n 1, both of the products x' y and xy' are defined. In particular, y' x = x' y = x 1 y 1 + x2 y2 + . . . + X n Yn ' and xy ' is an n n matrix with i, jth element X;Yj ·
X
X
X X
Resu lt 2A.5. For all matrices A, B, and C (of dimensions such that the indi cated products are defined) and a scalar e ,
96
Chap. 2
M atrix Algebra and Random Vectors
(a) c(AB) = (cA)B
A (BC) = (AB) C (c) A (B + C) = AB + AC (d) (B + C)A = BA + CA (e) (AB)' = B' A' (b)
More generally, for any xj such that Axj is defined, (f)
(g)
n
n
2: Ax = A 2: xj j= I j= I j n � (Ax) (Ax)' = A ( �n xj xf ) A'
•
There are several important differences between the algebra of matrices and the algebra of real numbers. Two of these differences are as follows: 1.
Matrix multiplication is, in general, not commutative. That is, in general, AB =I= BA. Several examples will illustrate the failure of the commutative law (for matrices).
but
is not defined.
but
Also,
[�
0 -3
H n [�
]�
[ 0 - 11 J 4
[ -� n �]
0 -3
[
2 1 -3
4
J
[
[ 359 3310 ] 19 - 18 -1 -3 4 10 - 12 26;
[ 11 O J -3 4
]
Su pplement 2A
[ - 23 41 ] [ 40 - 11 ] - [ - 128 - 71 ]
but
2.
97
Vectors and Matrices: Basic Concepts
Let 0 denote the zero matrix, that is, the matrix with zero for every element. In the algebra of real numbers, if the product of two numbers, ab, is zero, then a = 0 or b = 0. In matrix algebra, however, the product of two nonzero matrices may be the zero matrix. Hence,
AB
(m X n)(n x k) does not imply that
A = 0 or B = 0. For example,
It is true, however, that if either
A B
A
(m X n)
0
(m X n)
or
B
(n X k )
0
(n x k)
, then
0 .
(m X k)
(m X n) (n X k)
Definition 2A.24. The
denoted by
0
(m X k)
I A I , is the scalar
determinant of the square k if k
IA I
=
X k matrix A
=
{aii},
1
k
1+ a j:L lJ I A I J I (- 1) /
if k > 1 =l where A 11 is the (k - 1) (k - 1) matrix obtained by deleting the first row and k jth column of A. Also, I A I = :L ai1 1 Ai1 1 ( - l) i + f, with the ith row in place of the =
X
j= l
first row.
1 � !I
Examples of determii1ants (evaluated using Definition 2A.24) are
In general,
=
1 1 4 1 (- 1) 2 + 3 1 6 1 ( - 1) 3
=
1 (4) + 3 (6) ( - 1)
=
- 14
98
Chap.
2
Matrix Algebra and Random Vectors
= 3 (39) - 1 ( - 3)
+
6 ( - 57) =
- 222
= 1 (1 ) = 1
If I is the
k k identity matrix, I I I
X
= 1.
al l a1 2 al 3 a2 1 G22 a2 3 a3 1 a3 2 a33
The determinant of any 3 3 matrix can be computed by summing the products of elements along the solid lines and subtracting the products along the dashed lines in the following diagram. This procedure is not valid for matrices of higher dimension, but in general, Definition 2A.24 can be employed to evaluate these determinants.
X
- - - - - -
'
�
'
' �.... -,. ''< �
.... ......
�
\
Y
\
We next want to state a result that describes some properties of the determi nant. However, we must first introduce some notions related to matrix inverses.
Supplement 2A Vectors and Matrices: Basic Concepts
99
Definition 2A.25. The row rank of a matrix is the maximum number of lin early independent rows, considered as vectors (that is, row vectors). The column rank of a matrix is the rank of its set of columns, considered as vectors. For example, let the matrix
A =
[� � - � ] 0
1 -1
The rows of A, written as vectors, were shown to be linearly dependent after Def inition 2A.6. Note that the column rank of A is also 2, since
but columns 1 and 2 are linearly independent. This is no coincidence, as the fol lowing result indicates. Result 2A.6. The row rank and the column rank of a matrix are equal. •
rank of a matrix is either the row rank or the column rank. Definition 2A.26. A square matrix A is nonsingular if A x
Thus, the
implies that x
(k X k)
(k x l )
0
(k X I)
0
(k X k ) ( k X l)
(k X l )
. If a matrix fails to be nonsingular, it is called singular.
Equivalently, a square matrix is nonsingular if its rank is equal to the number of rows (or columns) it has. Note that Ax = x1 a 1 + x2 a 2 + · · · + xk a k , where a; is the ith column of A, so that the condition of nonsingularity is just the statement that the columns of A are linearly independent. Result 2A. 7. Let A be a nonsingular square matrix of dimension Then there is a unique k k matrix B such that
X
where I is the k
AB = BA = I
k
X k. •
X k identity matrix.
Definition 2A.27. The B such that AB = BA = I is called the inverse of A and is denoted by A - I . In fact, if BA = I or AB = I, then B = A - I , and both prod ucts must equal I. For example,
A =
[� �J
has A - I
[-f -n
1 00
Chap. 2 Matrix Algebra and Random Vectors
[ 21 35 ] [ - � -;_7� ] [ t - � ] [ 21 53 ]
since
?_
Result 2A.8.
(a) The inverse of any
-7
7
2 X 2 matrix
is given by A- 1
(b) The inverse of any
_1_ IAI
3 X 3 matrix =
is given by
A- 1
=
1
1Af
l
a2 3 a33 a2 3 a33 a2 1 a22 a3 1 a 3 2
[
azz - az 1
1 ' ' _ , aa3l 1l aa31 22 1 ' aal 1l aa221 2 1 2
In both (a) and (b), it is clear that I A I =I= 0 if the inverse is to exist. (c) In general, A - 1 has j, ith entry [ 1 Aij i / I A I ] ( - 1 ) i + j, where A;j is the matrix • obtained from A by deleting the ith row and jth column. Result 2A.9. For a square matrix A of dimension
equivalent:
(a) A x (k X k ) (k X 1)
(b) I A I
=I= 0.
0
(k X 1}
implies x
(k X 1)
0
(k X 1}
k
X k, the following are
(A is nonsingular).
(c) There exists a matrix A - 1 such that AA - 1
=
A-1 A
=
I .
(k X k)
•
Supplement 2A Vectors and Matrices: Basic Concepts
1 01
Result 2A. 1 0. Let A and B be square matrices of the same dimension, and let the indicated inverses exist. Then the following hold:
(a) (A - 1 ) ' = (A' ) - 1
•
(b) (AB) - 1 = B - 1 A - 1 The determinant has the following properties. Result 2A. 1 1 .
I I I I
Let A and B be k
(a) A = A'
X k square matrices. I I I I I II
(b) If each element of a row (column) of A is zero, then A = 0 (c) If any two rows (columns) of A are identical, then A = 0 (d) If A is nonsingular, then ! A = 1/ A - 1 ; that is, A A - 1 1 = 1 .
I I l l
li I i I
I
(e) AB = ! A B (t) eA = ck A , where c is a scalar.
I I
You are referred to [5] for proofs of parts of Results 2A.9 and 2A.l l. Some • of these proofs are rather complex and beyond the scope of this book. Definition 2A.28. Let A = {a;1 ) be a k k square matrix. The trace of the matrix A, written tr (A), is the sum of the diagonal elements; that is,
X
k
tr (A) = 2:: a;; · i=1
Result 2A. 1 2.
Let A and B be k
(a) tr (cA) = c tr (A)
(b) tr (A ± B) = tr (A) (c) tr (AB) = tr (BA)
(d) tr (B - 1 AB) = tr (A) (e) tr (AA' ) =
k k
2:: 2:: a5
i= l j= l
X k matrices and c be a scalar.
± tr (B) •
Definition 2A.29. A square matrix A is said to be orthogonal if its rows, con sidered as vectors, are mutually perpendicular and have unit lengths; that is, AA' = I.
A matrix A is orthogonal if and only if A - t = A ' . For an orthogonal matrix, AA' = A' A = I, so the columns are also mutually perpendic ular and have unit lengths. • Result 2A. 1 3.
1 02
Chap. 2 Matrix Algebra and Random Vectors
An example of an orthogonal matrix is 1 2 - 21 1 2 1 2
I
2 I 2 _l
2 1 _
2
!]
0 00 00 0
m � J] l� �]
Clearly, A = A', so AA' = A' A = AA. We verify that AA = I = AA' = A' A, or 1 1 1 1 2 2 2 2 1 1 _ l2 l 1 2 2 2 1 _! 1 -1 1 2 2 2 2 1 1 1 1
l-l
_
2
A
2
_
2
A
2
I
so A' = A - 1 , and A must be an orthogonal matrix. Square matrices are best understood in terms of quantities called eigenvalues and eigenvectors.
0
Definition 2A.30. Let A be a k k square matrix and I be the k k iden tity matrix. Then the scalars A 1 , A 2 , . . . , Ak satisfying the polynomial equation [ A - A I [ = are called the eigenvalues (or characteristic roots) of a matrix A. The equation I A - AI I = (as a function of A) is called the characteristic
equation.
For example, let
0
X
A = Then [A - u1
X
[� �]
1 1 -1 A 3 -0 A I = (1
-
A) (3 - A ) =
0
implies that there are two roots, A 1 = 1 and A 2 = 3. The eigenvalues of A are 3 and 1. Let A =
[ �! �: -�10 ] 2 -2
Then the equation
Supplement 2A Vectors and Matrices: Basic Concepts
13 - A -4 4 13 -A 2
2 -2
1 03
- A 3 + 36 A2 - 405 A + 1458 = 0
-2 10 - A
has three roots: .A1 = 9, .A 2 = 9, and .A 3 = 18; that is, 9, 9, and 18 are the eigen values of A. Definition 2A. 3 1 . Let A be a matrix of dimension k k and let .A be an eigenvalue of A. If x is a nonzero vector ( x =I= 0 ) such that (k X I )
(k X I )
(k X l )
X
Ax = Ax then x is said to be an eigenvector (characteristic vector) of the matrix A associated with the eigenvalue .A. An equivalent condition for .A to be a solution of the eigenvalue-eigenvector equation is I A - AI I = 0. This follows because the statement that Ax = Ax for some .A and x =I= 0 implies that That is, the columns of A - .AI are linearly dependent so, by Result 2A.9 ( b ) , I A A I I = 0, as asserted. Following Definition 2A.30, we have shown that the eigenvalues of -
A =
[� � ]
are .A1 = 1 and .A 2 = 3. The eigenvectors associated with these eigenvalues can be determined by solving the following equations:
From the first expression,
or
1 04
Chap. 2 Matrix Algebra and Random Vectors
x1
=
- 2x2
There are many solutions for x 1 and x2 • Setting x2 = 1 (arbittarily) gives x 1 =
-2, and hence,
is an eigenvector corresponding to the eigenvalue
x1 + 3x 2 implies that x 1 = 0 and x 2 =
=
[�]
1. From the second expression,
3x 2
1 (arbitrarily), and hence, X =
is an eigenvector corresponding to the eigenvalue 3 . It is usual practice to determine an eigenvector so that it has length unity. That is, if Ax = A x , we take e = x/� as the eigenvector corresponding to A. For example, the eigenvector for A 1 = 1 is [ - 2/Vs, 1/Vs] '. Definition 2A.32. A quadratic form Q (x) in the = x' Ax, where x' = [x 1 , x 2 , , xk ] and A is a
k variables x1 , x2 , , xk is Q (x) k k symmetric matrix. k k Note that a quadratic form can be written as Q ( x ) = 2: 2: a;j x i xj . For • • •
• • •
X
i = lj = l
example,
Any symmetric square matrix can be reconstructured from its eigenvalues and eigenvectors. The particular expression reveals the relative importance of each pair according to the relative size of the eigenvalue and the direction of the eigenvector.
The Spectral Decomposition. Let A be a k k symmetric A can be expressed in terms of its k eigenvalue-eigenvector pairs
Result 2A. l 4.
matrix. Then (A ; , e ; ) as
X
k
A = 2: A;e ; e; i= 1
Supplement 2A Vectors and Matrices: Basic Concepts
For example, let A =
Then
lA
-
Ai l
= A.Z
-
[ 2.2 .4 ] .4 2. 8
1 05
+ 6. 1 6 - . 1 6 = (A - 3) (A - 2 )
SA
so A has eigenvalues A 1 = 3 and A 2 = 2. The corresponding eigenvectors are e � = [ 1/Vs , 2/V's] and e ; = [2/Vs , - 1/Vs] , respectively. Consequently, A =
=
[2.2 .4 ] J .4 2 . 8 -
Vsj [-1- _l_] J �j� [_l_ - 1 ]
l0s [ 1 :� �:! J + [ � :: - :! J -
Vs
Vs
+
l
Vs Vs
The ideas that lead to the spectral decomposition can be extended to provide a decomposition for a rectangular, rather than a square, matrix. If A is a rectan gular matrix, then the vectors in the expansion of A are the eigenvectors of the square matrices AA' and A' A. Result 2A. 1 5.
Singular-Value Decomposition. Let A be an m k matrix of m orthogonal matrix U and a k k orthog
real numbers. Then there exist an m onal matrix V such that
XX
X
A = U A V'
where the m k matrix A has (i, i) entry A; � 0 for i = 2, . . . , min (m, k ) and the pther entries are zero. The positive constants A; are called the singular val ues of A . The singular-value decomposition can also be expressed as a matrix expan sion that depends on the rank r of A. Specifically, there exist r positive constants unit vectors u 1 , u 2 , . . . , u, and r orthogonal A1 , A 2 , . . . , A,, r orthogonal m k unit vectors v 1 , v2 , . . . , v, such that
X
1,
X1
X1
A
=
r
L A;u; v;
i= 1
= U,A,v;
where U, = [u 1 , u 2 , . . . , u,], V, = [v1 , v2 , . . . , v, ] , and A , is an matrix with diagonal entries A;. Here AA' has eigenvalue-eigenvector pairs (Af , u;) , so
AA' u;
=
Afu;
r
X r diagonal
1 06
Chap. 2 Matrix Algebra and Random Vectors
with Af , Ai , . . . , A; > 0 = A;+ 1 , A;+ 2 , . . . , A;, (for m > k). Then vi = Ai- 1 A' ui. Alternatively, the vi are the eigenvectors of A' A with the same nonzero eigenvalues A f . The matrix expansion for the singular-value decomposition written in terms of the full dimensional matrices U, V, A is A
(m X k)
V'
A
U
(m X m) (m X k) (k X k)
where U has m orthogonal eigenvectors of AA' as its columns, V has k orthogo nal eigenvectors of A' A as its columns, and A is specified in Result 2A.15. For example, let
A= then
AA' =
[ - 31
1 1
3 1
[
]
3 1 1 -1 3 1
]
[� -� ]
[ 1 11 111 ]
1 You may verify that the eigenvalues y = A 2 o f AA' satisfy the equation 2 y - 22 y + 120 = ( y - 12) ( y - 10), and consequently, the eigenvalues are y1 = Af = 12 and y2 = Ai = 10. The corresponding eigenvectors are u� =
1
=
[ � � ] and u; = [ � � l respectively. Also = A' A [ - � � ; J I� I�
[ n
[: n
so J A' A - y i J = - y 3 - 22 y 2 - 120 y = - y ( y - 12) ( y - 10), and the eigen values are y1 = Af = 12, y2 = Ai = 10, and y3 = A� = 0. The nonzero eigenval ues are the same as those of AA' . A computer calculation gives the eigenvectors v
=
[
1
2
1
] , v2 = [ Vs2 I
]
- 1 0 and v3 =
Vs
,
V6 V6 Eigenvectors v1 and v2 can be verified by checking: {
V6
A' Av ,
=
A' A v, =
1
1 [ Y3Q
[ � � n � [�] � [ � ] [ I� I� n Js [ -n IO Js [ -n I
1
= 12 =
2 Y3Q
= Af v , = Al v,
-5 Y3Q
]
.
Chap. 2 Exercises
Taking A 1 A is
=
VU and
A = [ - 31
A2 =
1 1 3 1
]
1 07
VlO , we find that the singular-value decomposition of
2 V6
1 V6
J+
ViO
[Jz] [� -1
v2
v5
-=1. v5
o]
The equality may be checked by carrying out the operations on the right-hand side. EXERCISES
Let x ' = [5, 1, 3] and y' = [-1, 3, 1]. (a) Graph the two vectors. (b) Find (i) the length of x, (ii) the angle between x and y, and (iii) the pro jection of y on x. (c) Since .X = 3 and y = 1, graph [5 - 3, 1 - 3, 3 - 3] = [2, -2, 0] and [ - 1 - 1, 3 - 1, 1 - 1] = [-2, 2, 0]. 2.2. Given the matrices
2.1.
A=
[ - 41 23 ] '
perform the indicated multiplications.
(a) SA (b) BA (c) A'B' (d) C'B (e) Is AB defined?
2.3.
A = [ � !].
[ 51
]
Verify the following properties of the transpose when
B=
4 2 , and C 0 3
(a) (A')' = A (b) (C')- 1 = (c - t ) ' (c) (AB)' = B' A' (d) For general A and B , (AB)' = B' A'. (m x k)
(k x e)
=
1 08
Chap. 2 Matrix Algebra and Random Vectors 2.4.
2.5.
When A -I and B -I exist, prove each of the following. (a) (A')- 1 = (A - 1 ) ' (b) (AB ) - 1 = B- 1 A - I Hint: Part a can be proved by noting that AA- 1 = I, I = I', and (AA- 1 ) ' = (A - 1 ) ' A'. Part b follows from ( B - 1 A - 1 ) AB = B- 1 (A - 1 A) B = B- 1 B = I .
[ 5 12 ]
Check that
is an orthogonal matrix. 2.6. Let
Q= �1 3 �13 A=
[ - 29 - 26 ]
(a) Is A symmetric? Show that A is positive definite. Let A be as given in Exercise 2.6. (a) Determine the eigenvalues and eigenvectors of A. (b) Write the spectral decomposition of A. (c) Find A -I . (d) Find the eigenvalues and eigenvectors of A -I . (b)
2.7.
2.8.
Given the matrix
A=
[ 21 - 22 ]
find the eigenvalues t\ 1 and t\ 2 and the associated normalized eigenvectors e 1 and e 2 . Determine the spectral decomposition (2-16) of A. 2.9. Let A be as in Exercise 2.8. (a) Find A -I . (b) Compute the eigenvalues and eigenvectors of A - 1 . (c) Write the spectral decomposition of A-I, and compare it with that of A from Exercise 2.8. 2.10. Consider the matrices
A=
[ 44.001
4.001 4.002
]
and B =
4.001 4.001 4.002001
[4
J
These matrices are identical except for a small difference in the (2, 2) posi tion. Moreover, the columns of A (and B ) are nearly linearly dependent. Show that A- 1 = ( - 3 ) B- 1 . Consequently, small changes-perhaps caused by rounding-can give substantially different inverses.
Chap.
2 Exercises
1 09
Show that the determinant of the p p diagonal matrix A = {aii } with aii = 0, i * j, is given by the product of the diagonal elements; thus, \ A \ = a 1 1 a22 aPP . Hint: By Definition 2A.24, \ A \ = au Au + 0 + + 0. Repeat for the sub matrix A 1 1 obtained by deleting the first row and first column of A. 2.U. Show that the determinant of a square symmetric p p matrix A can be expressed as the product of its eigenvalues A 1 , A 2 , , AP ; that is, 2.11.
X
· · ·
· · ·
\A\
=
X
ITf= l A i .
• . •
Hint: From (2-16) and (2-20), A
= PAP' with P'P = I. From Result \ PAP' \ = \ P \ \ AP' \ = \ P \ \ A \ \ P' \ = \ A \ \ 1 \ , since \ I \ = \ P'P \ = \ P' \ \ P \ . Apply Exercise 2.1 1 . Show that \ Q \ = + 1 o r - 1 i f Q i s a p p orthogonal matrix. Hint: \ QQ' \ = \ I \ . Also, from Result 2A.l l , \ QQ' \ = \ Q \ \ Q' \ = \ Q \ 2 • Thus, \ Q \ 2 = \ I \ . Now use Exercise 2.11. Show that Q ' A Q and A have the same eigenvalues if Q is
2A.l l(e),
2.13.
2.14.
2.15. 2.16.
2.17.
2.18.
\A\
=
X
(p X p ) (p Xp) (p X p)
(p Xp)
orthogonal. Hint: Let A be an eigenvalue of A. Then 0 = \ A - AI \ . By Exercise 2.13 and Result 2A.ll(e), we can write 0 = \ Q' \ \ A - AI \ \ Q \ = \ Q'AQ - AI \ , since Q'Q = I. A quadratic form x' Ax is said to be positive definite if the matrix A is posi tive definite. Is the quadratic form 3x f + 3x � - 2x 1x 2 positive definite? Consider an arbitrary n p matrix A. Then A' A is a symmetric p p matrix. Show that A' A is necessarily nonnegative definite. Hint: Set y = Ax so that y ' y = x ' A' Ax. Prove that every eigenvalue of a k k positive definite matrix A is positive. Hint: Consider the definition of an eigenvalue, where Ae = Ae. Multiply on the left by e' so that e' Ae = Ae' e. Consider the sets of points (x 1 , x 2 ) whose "distances" from the origin are given by
X
X
X
c2
=
4x
i + 3x� - 2 V2 x1 x2
for c 2 = 1 and for c 2 = 4. Determine the major and minor axes of the ellipses of constant distances and their associated lengths. Sketch the ellipses of con stant distances and comment on their positions. What will happen as c 2 increases? 2.19.
Let A 1 12 = 2: \IA; e ; e; m
(m X m)
i= l
=
PA1 12 P', where PP'
=
P'P
=
I. (The A /s and
the e ; s are the eigenvalues and associated normalized eigenvectors of the matrix A.) Show Properties (1)-(4) of the square-root matrix in (2-22). 2.20. Determine the square-root matrix A 1 12 , using the matrix A in Exercise 2.3. Also, determine A - l/2 , and show that A 1 /2 A - I /2 = A -l /2 A 1 12 = I. '
110
Chap. 2 Matrix Algebra and Random Vectors 2.21.
(See Result 2A.15) Using the matrix
(a) Calculate A' A and obtain its eigenvalues and eigenvectors.
(b) Calculate AA' and obtain its eigenvalues and eigenvectors. Check that the nonzero eigenvalues are the same as those in part a.
(c) Obtain the singular-value decomposition of A. 2.22.
(See Result 2A. l 5) Using the matrix A =
[4
8
3 6
] -9 8
(a) Calculate AA' and obtain its eigenvalues and eigenvectors.
(b) Calculate A' A and obtain its eigenvalues and eigenvectors. Check that the nonzero eigenvalues are the same as those in part a.
(c) Obtain the singular-value decomposition of A.
Verify the relationships V1 12 pV1 /2 = I and p = (V1 12 ) - 1 I (V1 12 ) -\ where I is the p p population covariance matrix [Equation (2-32)] , p is the p p population correlation matrix [Equation (2-34)] , and V 1 12 is the population standard deviation matrix [Equation (2-35)]. 2.24. Let X have covariance matrix 2.23.
X
X
Find
(a) I - 1 •
2.25.
(b) The eigenvalues and eigenvectors of I . (c) The eigenvalues and eigenvectors of I - 1 .
Let X have covariance matrix
(a) Determine p and V1 12 • 2.26.
(b) Multiply your matrices to check the relation V 112 pV1 /2 = Use I as given in Exercise 2.25. (a) Find p1 3 •
(b) Find the correlation between X1 and �X2 +
� X3 •
I.
Chap. 2 Exercises
111
Derive expressions for the mean and variances of the following linear com binations in terms of the means and covariances of the random variables X1 , X2 , and X3 • (a) X1 - 2 X2 (b) - X1 + 3 X2 (c) X1 + X2 + X3 (e) X1 + 2 X2 - X3 (t) 3 X1 - 4 X2 if X1 and X2 are independent random variables. 2.28. Show that
2.27.
Cov (e1 1 X1 + e1 2 X2 + · · · + e1 P XP , e2 1 X1 + e22 X2 + · · · + e2 P XP ) = c � Ixc2 where c � [ e1 1 , e1 2 , . . . , e 1 p ] and c; = [ e2 1 , e22 , . . . , e2 P ] . This verifies the off-diagonal elements C ixC' in (2-45) or diagonal elements if c 1 c 2 . Hint: By (2-43), Z1 - E (Z1 ) e1 1 (X1 - 11-1 ) + · · · + e1 P (XP - /Lp ) and Z2 - E (Z2 ) e2 1 (X1 - �J- 1 ) + · · · + e2 P (XP - IJ-p ) . So Cov ( Z 1 , Z2 ) = E [ (Z1 - E (Z1 ) ) (Z2 - E (Z2 ) ) ] E [ (e11 (X1 - 11- 1 ) + · · · + e1p (Xp - 11-p ) ) (ez t (Xl - 11-t ) + e22 (X2 - �J- 2 ) + · · · + e2p (Xp - IJ-p ) ) ] . The product (el l (Xl - 11-t ) + e1 2 (Xz - 11- z ) + · · · + e l p (XP - ILp ) ) (ez t (XI - 11-t ) + ezz (Xz - /L z ) + · · · + ez p (Xp - ILp ) )
=
=
= ( �1 e� e (Xe p
p
11- e )
=
=
=
) c �1 ezm (Xm - 11-m ) )
= � � e l f ezm (Xt - ILe H Xm - 11- m )
f= l m = l has expected value p p � � et e ez m uem [ el l• · · · • e l p ] I [ ez t• · · · • e2 p ] ' . f=l m=l Verify the last step by the definition of matrix multiplication. The same steps hold for all elements. 2.29. Consider the arbitrary random vector [X1 , X2 , X3 , X4 , X5 ] ' with mean vector p, [�J- 1 , �J- 2 , �J- 3 , �J- 4 , 11- s ] ' . Partition into
=
=
where
X= X
X = [-i��-J
112
Chap. 2 Matrix Algebra and Random Vectors
Let I be the covariance matrix of X with general element cr;k · Partition I into the covariance matrices of X (l l and x
Partition X as
Let A = [1 2] and B =
[ 21 - 2 ] _1
and consider the linear combinations AX< 1 l and BX<2> . Find (a) E (X< 1 l ) (b) E (AX( l l ) (c) Cov (X(l l )l (d) Cov (AX( l ) (e) E (X<2l ) (f) E (BX<2l ) ( g) Cov ( X <2 > ) (h) Cov (BX<2 l ) (i) Cov (X( I ) ' x<2l ) (j) Cov (AX( l l , BX<2l ) 2.31. Repeat Exercise 2.30, but with A and B replaced by A = [1 - 1] and � = 2.32.
[ 20 - 11 ]
You are given the random vector X' = [X1 , X2 , , X5] with mean vector p,� = [2, 4, -1, 3, 0] and variance-covariance matrix . • •
Chap. 2 Exercises
4 -1 3 -1 I 1 2 l 1 2
Ix =
_
0
0
I
2
I
113
0
-2
0 1 -1 1 -1 6 0 4 1 0 2 -1
Partition X as x�
Xz x = x3 x4 Xs
Let A =
[1 -1] 1
1
and B =
[ 11
1 1
-
1 2
]
and consider the linear combinations AX( ! ) and BX C2l . Find (a) E (X C l l ) (b) E (AXC l l ) (c) Cov (X(l l ) (d) Cov ( AXC l l ) (e) E (XC2 l ) 2 (0 E (BX C l ) X ( g) Cov ( <2 l ) (h) Cov (BX<2 l ) (i) Cov (X( l l , X( 2 ) ) (j) Cov (AX(l l , BXC 2l ) 2.33. Repeat Exercise 2.32, but with X partitioned as
xi Xz X = x3 x4
Xs
[� � �J
[i��-J
and with A and B replaced by A = 2.34.
-
and B =
[ � �J _
Consider the vectors b' = �2, -1, 4, 0] and d' = [ - 1 , 3, -2, 1]. Verify the Cauchy-Schwarz inequality (b' d) 2 .;;; (b'b ) ( d' d).
1 14
Chap.
2
Matrix Algebra and Random Vectors
2.35. Using the vectors b' = [ -4, 3] and d' = [ 1 , 1 ] , verify the extended Cauchy Schwarz inequality (b'd)2 :%; (b'Bb) (d'B- 1 d) if
B
=
[
2 -2 -2 $
]
2.36. Find the maximum and minimum values of the quadratic form 4xr + 4x� + 6x1x2 for all points x' = [x 1 , x 2 ] such that x'x = 1 . 2.37. With A a s given in Exercise 2.6, find the maximum value o f x' Ax for x ' x = 1 . 2.38. Find the maximum and minimum values o f the ratio x ' Ax/x'x for any nonzero vectors x' = [x1 , x2 , x3] if
-�]
�: -2
2.39. Show that
10
s
I
A B C has (i, j) th entry � � a 1 e bf k ckj (=I k=1 (r Xs) (s X I) (I X v)
I
Hint: BC has (t', j )th entry � b a ckj k= l
=
dej · So A (BC) has (i, j ) th element
2.40. Verify (2-24): E(X + Y) = E(X) + E (Y) and E ( AX�) = AE (X) B. Hint: X + Y has Xij + Yij as its (i, j )th element. Now, E (Xij + Yi) E (Xij ) + E ( Yij ) by a univariate property of expectation, and this last quan tity is the (i, j ) th element of E(X) + E(Y). Next (see Exercise 2 .39 ) , AXB has
(i, j ) th entry � � ai l Xa b kj • and by the additive property of expectation,
e k
which is the (i, j )th element of AE(X) B.
Chap.
2 References
115
REFERENCES
1.
1970.
Bellmtaan,charyya, R. G. K., and R. A. Johnson. (2d ed.) New York: McGraw-HNewil , York: 2. John Bhat 3. Wads GraybiWiwlortl,eF.hy,, 1977. A.1969. Belmont, CA: 4. Haltrand,mos1958., P. R. (2d ed.). Princeton, NJ: D. Van Nos 5. Nobl (3d ed.). Englewood Cliffs, NJ: Prenteic, eB.Hal, andl, 1988.J. W. Daniel. Introduction to Matrix Analysis
Statistical Concepts and Methods.
Introduction to Matrices with Applications in Statistics.
Finite Dimensional Vector Spaces
Applied Linear Algebra
CHAPTER
3
Sample Geometry and Random Sampling
3 . 1 I NTRODUCTION
With the vector concepts introduced in the previous chapter, we can now delve deeper into the geometrical interpretations of the descriptive statistics i, Sn , and R; we do so in Section 3.2. Many of our explanations use the representation of the columns of X as p vectors in n dimensions. In Section 3.3 we introduce the assump tion that the observations constitute a random sample. Simply stated, random sam pling implies that (1) measurements taken on different items (or trials) are unrelated to one another and (2) the joint distribution of all p variables remains the same for all items. Ultimately, it is this structure of the random sample that justi fies a particular choice of distance and dictates the geometry for the n-dimensional representation of the data. Furthermore, when data can be treated as a random sample, statistical inferences are based on a solid foundation. Returning to geometric interpretations in Section 3.4, we introduce a single number, called generalized variance, to describe variability. This generalization of variance is an integral part of the comparison of multivariate means. In later sec tions we use matrix algebra to provide concise expressions for the matrix products and sums that allow us to calculate i and Sn directly from the data matrix X. The connection between x, Sn , and the means and covariances for linear combinations of variables is also clearly delineated, using the notion of matrix products. 116
Sec.
3.2
The Geometry of the Sample
117
3.2 THE GEOMETRY OF THE SAMPLE
x x10 x,, l rx. ,
A single multivariate observation is the collection of measurements on p different variables taken on the same item or trial. As in Chapter 1, if n observations have been obtained, the entire data set can be placed in an n X p array (matrix):
X
Xz l
Xzz · • · Xz p
.. . . . . n l Xn z · ' ·
(n Xp)
.. .
Xn p
Each row of X represents a multivariate observation. Since the entire set of mea surements is often one particular realization of what might have been observed, we say that the data are a sample of size n from a p-variate "population." The sample then consists of n measurements, each of which has p components. As we have seen, the data can be plotted in two different ways. For the p-dimensional scatter plot, the rows of X represent n points in p-dimensional space. We can write
X
(n X p )
r
..
X1 2 ' ' ' Xl p Xz : : . . Xz �
X11 1
X11 z " ' X11 p
X1 1 Xz �
. .. .
. .
l rl X
=
i
�2 ..
X
�
f-
1st (multivariate) Observation (3-1)
f-
n th (multivariate) Observation
The row vector x; , representing the jth observation, contains the coordinates of a point. The scatter plot of n points in p-dimensional space provides information on the locations and variability of the points. If the points are regarded as solid spheres, the sample mean vector x , given by (1-8), is the center of balance. Vari ability occurs in more than one direction, and it is quantified by the sample vari ance-covariance matrix S11 • A single numerical measure of variability is provided by the determinant of the sample variance-covariance matrix. When p is greater than 3, this scatter plot representation cannot actually be graphed. Yet the consideration of the data as n points in p dimensions provides insights that are not readily avail able from algebraic expressions. Moreover, the concepts illustrated for p = 2 or p = 3 remain valid for the other cases. Example 3 . 1
(Computing the mean vector)
Compute the mean vector x from the data matrix.
1 18
Chap.
3
Sample Geometry and Random Sampling
Plot the n = 3 data points in p = 2 space, and locate x on the resulting diagram. The first point, x 1 , has coordinates x ; = [ 4, 1 ] . Similarly, the remaining two points are x � = [ - 1, 3] and x � = [3, 5] . Finally, x�
l::�::j
m
�
Figure 3.1 shows that x is the balance point (center of gravity) of the scatter plot. 2 5 4
2
-2 - 1
2
3
4
5
-1
A plot of the data matrix • X as n = 3 points in p = 2 space. Figure 3 . 1
-2
The alternative geometrical representation is constructed by considering the data as p vectors in n-dimensional space. Here we take the elements of the columns of the data matrix to be the coordinates of the vectors. Let
X
( n x p)
l
x, Xz i
x, . . x, , X zz · · · Xz p . . .
xn l
Xn z • . • Xn p
..
. . ..
J
[ Y1 'i Yz 'i . . . 'i Yp ]
(3-2)
Then the coordinates of the first point y; = [ x1 1 , x2 1 , , xn 1 ] are the n measure ments on the first variable. In general, the ith point y; = [xu , x2 ; , . . . , xn ; ] is deter mined by the n-tuple of all measurements on the ith variable. In this geometrical representation, we depict y1 , , yP as vectors rather than points, as in the p-dimen sional scatter plot. We shall be manipulating these quantities shortly using the alge bra of vectors discussed in Chapter 2. • • •
Example 3.2 (Data as p vectors in
n
. • •
dimensions)
Plot the following data as p = 2 vectors in n = 3 space:
Sec.
3.2
The Geometry of the Sample
1 19
3
YI I
I
6
5
/ �/ 3 4
Here y ; ure 3.2.
Figure 3.2 A plot of the data matrix X as p = 2 vectors in n = 3 space.
[4, - 1, 3] and y; = [1, 3, 5]. These vectors are shown in Fig •
Many of the algebraic expressions we shall encounter in multivariate analysis can be related to the geometrical notions of length, angle, and volume. This is important because geometrical representations ordinarily facilitate understanding and lead to further insights. Unfortunately, we are limited to visualizing objects in three dimensions, and consequently, the n-dimensional representation of the data matrix X may not seem like a particularly useful device for n > 3. It turns out, however, that geometrical relationships and the associated statistical concepts depicted for any three vectors remain valid regardless of their dimension. This follows because three vectors, even if n dimensional, can span no more than a three-dimensional space, just as two vec tors with any number of components must lie in a plane. By selecting an appro priate three-dimensional perspective-that is, a portion of the n-dimensional space containing the three vectors of interest-a view is obtained that preserves both lengths and angles. Thus, it is possible, with the right choice of axes, to illustrate certain algebraic statistical concepts in terms of only two or three vectors of any dimension n. Since the specific choice of axes is not relevant to the geometry, we shall always label the coordinate axes 1, 2, and 3. It is possible to give a geometrical interpretation of the process of finding a sample mean. We start by defining th.e n X 1 vector 1� = [1, 1, . . . , 1 ] . (To simplify the notation, the subscript n will be dropped when the dimension of the vector 1 n is clear from the context.) The vector 1 forms equal angles with each of the n coor dinate axes, so the vector (1/Vn ) 1 has unit length in the equal-angle direction. Consider the vector y; = [ x1 ; , x2 ; , . . . , xn ; ] . The projection of Y; on the unit vector (1/Vn) 1 is, by (2-8),
(
)
y , -1- 1 1 1 = x i i + x2 i + . . . + xn i 1 = :X.' 1 , Vn Vn n __
(3-3)
1 20
Chap.
3
Sample Geometry and Random Sampling
That is, the sample mean X; = (x1 ; + x2; + + xn ; ) /n = y/ 1/n corresponds to the multiple of 1 required to give the projection of Y; onto the line determined by 1. Further, for each Y;, we have the decomposition · · ·
where :X; 1 is perpendicular to Y;
-
:X;1.
The deviation, or mean corrected, vector is (3-4)
The elements of d; are the deviations of the measurements on the ith variable from their sample mean. Decomposition of the Y; vectors into mean components and deviation from the mean components is shown in Figure 3.3 for p = 3 and n = 3. Example 3 . 3 (Decomposing a vector into its mean a n d deviation components)
Let us carry out the decomposition of Y; into :X;1 and d; = Y; for the data given in Example 3.2:
-
:X;1, i
= 1 , 2,
3
Figure 3.3
The decomposition of Y; into a mean component X; 1 and a deviation component d; = Y; X;l , i = 1 ' 2, 3. -
Sec.
3.2
The Geometry of the Sample
1 21
(4 - 1 + 3)/3 = 2 and .X2 = (1 + 3 + 5)/3 = 3, so
Consequently,
and
We note that .X1 1 and d 1 = y1 (X, 1) . (y , _
-
{n
.X1 1 are perpendicular, because
x, 1) � [2 2 2
_
� 4
_
6 + 2 �
0
A similar result holds for .X2 1 and d 2 = y 2 - .X2 1. The decomposition is
•
For the time being, we are interested in the deviation (or residual) vectors d ; = Y; - .X; l . A plot of the deviation vectors of Figure 3.3 is given in Figure 3.4 on page 122. We have translated the deviation vectors to the origin without chang ing their lengths or orientations. Now consider the squared lengths of the deviation vectors. Using (2-5) and (3-4), we obtain ' 2, Ld -
.i l «;
d;
n
� - LJ
-
j= l
(x1; - x; ) 2 -
(Length of deviation vector) 2 = sum of squared deviations
(3-5)
1 22
Chap.
3
Sample Geometry and Random Sampling 3
Figure 3.4 The deviation vectors d ; from Figure 3 . 3 .
From (1-3), we see that the squared length is proportional to the variance of the measurements on the ith variable. Equivalently, the length is proportional to the standard deviation. Longer vectors represent more variability than shorter vectors. For any two deviation vectors d; and d k , n
d; d k = 2: (xji - :X; ) (xjk - xk ) j= l
(3-6)
Let O;k denote the angle formed by the vectors d; and d k . From (2-6), we get
� (xji -
d ; d k = Ld;Ldk cos ( O; k )
�� (xji - :xy �� (xjk - xk 2 cos ( ()ik )
or, using (3-5) and (3-6), we obtain :X; ) (xj k
s o that [see (1-5)]
- xk ) =
)
(3-7) The cosine of the angle is the sample correlation coefficient. Thus, if the two devi ation vectors have nearly the same orientation, the sample correlation will be close to 1 . If the two vectors are nearly perpendicular, the sample correlation will be approximately zero. If the two vectors are oriented in nearly opposite directions, the sample correlation will be close to - 1.
Sec.
3.2
The Geometry of the Sample
1 23
Example 3.4 (Calculating Sn and R from deviation vectors)
Given the deviation vectors in Example 3.3, let us compute the sample vari ance-covariance matrix Sn and sample correlation matrix R using the geo metrical concepts just introduced. From Example 3.3,
These vectors, translated to the origin, are shown in Figure 3.5. Now,
or s 1 1 = ¥ · Also,
or s22 = � . Finally,
[n -
d ; d , � [2 - 3 1]
or s 1 2 = - � . Consequently,
� - 2 � 3s ,
3 7 6
2
5
4
3 Figure 3.5
and d 2 .
The deviation vectors d 1
1 24
Chap.
3
Sample Geometry and Random Sampling
and sn =
[ 134 2
-3
�3 \113
-� ] 8
3
'
R
_
- . 1 89
[ 1-. 1 89 1-.189 ]
•
The concepts of length, angle, and projection have provided us with a geo metrical interpretation of the sample. We summarize as follows:
3.3 RANDOM SAMPLES AND THE EXPECTED VALUES OF THE SAMPLE MEAN AND COVARIANCE MATRIX
In order to study the sampling variability of statistics like x and Sn with the ultimate aim of making inferences, we need to make assumptions about the variables whose observed values constitute the data set X. Suppose, then, that the data have not yet been observed, but we intend to col lect n sets of measurements on p variables. Before the measurements are made, their values cannot, in general, be predicted exactly. Consequently, we treat them as random variables. In this context, let the (j, k)-th entry in the data matrix be the random variable Each set of measurements on p variables is a random vec tor, and we have the random matrix
Xik ·
Xi
1 The square of the length and the inner product are (n - l)s;; and (n - l)s; k • respectively, when the divisor n - 1 is used in the definitions of the sample variance and covariance.
Sec.
3.3
Random Samples and the Expected Values
1 25
X
(3-8)
( n Xp) A
random sample can now be defined.
If the row vectors X� , X� , . . . , X� in (3-8) represent independent observations from a common joint distribution with density function f(x) = f(x1 , x 2 , , xp ), then X1 , X 2 , . . . , Xn are said to form a random sample from f(x). Mathematically, x l ' x 2 ' . . . ' xn form a random sample if their joint density function is given by the product f(x1 )f(x 2 ) f(xn ), where f(x) = f(xi 1 , xi 2 , , xiP ) is the density func tion for the jth row vector. Two points connected with the definition of random sample merit special attention: • • •
• • •
• • •
1. The measurements of the p variables in a single trial, such as x; = [ Xj 1 , Xj 2 , . . . , XiP ] , will usually be correlated. Indeed, we expect this to be the case. The measurements from different trials must, however, be independent. 2. The independence of measurements from trial to trial may not hold when the variables are likely to drift over time, as with sets of p stock prices or p eco nomic indicators. Violations of the tentative assumption of independence can have a serious impact on the quality of statistical inferences. The following examples illustrate these remarks. Example 3.5
(Selecting a random sample)
As a preliminary step in designing a permit system for utilizing a wilderness canoe area without overcrowding, a natural-resource manager took a survey of users. The total wilderness area was divided into subregions, and respon dents were asked to give information on the regions visited, lengths of stay, and other variables. The method followed was to select persons randomly (perhaps using a random number table) from all those who entered the wilderness area during a particular week. All persons were equally likely to be in the sample, so the more popular entrances were represented by larger proportions of canoeists. Here one would expect the sample observations to conform closely to the criterion for a random sample from the population of users or potential users. On the other hand, if one of the samplers had waited at a campsite far in the interior of the area and interviewed only canoeists who reached that spot, successive measurements would not be independent. For instance, • lengths of stay would tend to be large.
1 26
Chap.
3
Sample Geometry and Random Sampling
Exam ple 3 .6 (A nonrandom sample)
Because of concerns with future solid-waste disposal, a study was conducted of the gross weight of solid waste generated per year in the United States ("Characteristics of Municipal Solid Wastes in the United States, 1960-2000," Franklin Associates, Ltd.). Estimated amounts attributed to x 1 = paper and paperboard waste and x2 = plastic waste, in millions of tons, are given for selected years in Table 3.1. Should these measurements on x ' = [ X1 , X2 ] be treated as a random sample of size n = 6? No! In fact, both variables are increasing over time. A drift like this would be very rare if the year-to-year values were independent observations from the same distribution. TABlE 3 . 1
SOli D WASTE
Year
1960
1965
1970
1975
1980
1985
x 1 (paper) x2 (plastics)
29.8
37.9 1 .4
43.9 3.0
42.6
53 . 9 7.6
61 .7 9.8
.4
4.4
•
As we have argued heuristically in Chapter 1, the notion of statistical inde pendence has important implications for measuring distance. Euclidean distance appears appropriate if the components of a vector are independent and have the same variances. Suppose we consider the location of the k th column v; = [ X1 k , X2 k , . . . , Xn k ] of X, regarded as a point in n dimensions. The location of this point is determined by the joint probability distribution f (yk ) = f (x l k ' Xz k , . . . ' xn k ) . When the measurements Xl k , Xz k , . . . ' xn k are a ran dom sample, f (yk ) = f (x l k , Xz k , . . . , xn k ) = fk (x l k ) fk (x2 k ) · " fk (xn k ) and conse quently, each coordinate xj k contributes equally to the location through the identical marginal distributions fk (xj k ) . If the n components are not independent or the marginal distributions are not identical, the influence of individual measurements (coordinates) on location is asymmetrical. We would then be led to consider a distance function in which the coordinates were weighted unequally, as in the "statistical" distances or quadratic forms introduced in Chapters 1 and 2. Certain conclusions can be reached concerning the sampling distributions of X and Sn without making further assumptions regarding the form of the underly ing joint distribution of the variables. In particular, we can see how X and Sn fare as point estimators of the corresponding population mean vector p, and covari ance matrix I. Result 3 . 1 . Let X 1 , X 2 , . . . , X n be a random sample _!!o m a joint distribu tion that has mean vector p, and covariance matrix I. Then X is an unbiased esti mator of p,, and its covariance matrix is
Sec.
3.3
Random Samples and the Expected Values
1 27
That is,
E (X) = 1L 1 Cov (X) = - I n
(population mean vector)
( population variance-covariance matrix )
(3- 9)
divided by sample size
For the covariance matrix S,.,
n - 1 I = I - -1 I E(S" ) = -n n Thus,
(3-10) so [n/(n (bias) =
- 1)] S,. is an unbiased estimator of I, while S,. is a biased estimator with E(S,. ) - I = - (1/n) I. Proof Now, X = (X 1 + X 2 + · · · + X,. )/n. The repeated use of the prop
erties of expectation in (2-24) for two vectors gives
= -1 E (X 1 ) + -1 E (X 2 ) + n n =p Next,
···
+ -n1 E (X " ) = -n1 1L + -n1 1L +
(X - 1L) (X - 1L) ' = ( -n1 j�= (Xj - 1L) ) ( -n1 c�=11 (X c - 1L) ) ' 11
n
l
n
l
1 =2 �l (Xj - p) (X c - p) ' n j� = l C= so
···
+
-n1 1L
1 28
Chap.
Cov(X) = E(X - p) (X - p) 1 = �2 (� �� E(Xj - p) (Xc - P-) 1 ) For j =I= e, each entry in E(Xj - p) (Xc - p) 1 is zero because the entry is the covariance between a component of X and a component of X e , and these are inde pendent. [See Exercise 3.17 and (2-29).j ] Therefore, Cov(X) = :2 (� E (Xj - p) (Xj - p) 1 ) Since 'I = (X - p) (Xj - p) 1 is the common population covariance matrix for each Xj , weEhavej Cov(X) = 2n1 (j�=! E(Xj - p) (Xj - P-) 1 ) = 2n1 ( 'I + 'I + · · · + 'I ) n terms = __!__2 (n'I) = ( !)'I n n To obtaih the expected valu of S,; , w0irst note that i - Xi ) (Xj k - Xk ) is the (i, k )th element of (Xj - �_X) (Xj - X) 1 • The matrix(Xjrepresenting sums of squares and cross products can then be written as
3
Sample Geometry and Random Sampling
II
II
j =l X.X 1 - nXX 1 since j�=l (Xj - X) = 0 and nX 1 = i�=l x;. Therefore, its expected value is E (� Xj XJ - n x x ) = � E(Xj Xj ) - nE(XX 1 ) For any random vector V with E(V) = Pv and Cov(V) = 'Iv, we have E(VV 1 ) = 'I v + PvP�· (See Exercise 3.16.) Consequently, and E(-XX- 1 ) = -n1 'I + pp 1 Using these results, we obtain � E( Xj Xj ) - n E( XX 1 ) = n 'I + npp 1 - n ( ;:;-1 'I + pp 1 ) = (n - 1)'I � � = LJ
n
-
11
J
J
3 .4
and thus, since S11 = ( 1 /n ) (� xix; - n X X ' ) , it follows immediately that (n I E ( S ) = --n Result 3.1 shows that the (i, k)th entry, (n - j=2:l (Xi; XJ (Xik - Xk ) , of [n/(n - 1)]S11 is an unbiased estimator of u;k However, the individual sample standard deviations Vi;;, calculated with either n or· n - 1 as a divisor, are not unbi ased estimators of the corresponding population quantities Vii;; . Moreover, the correlation coefficients r;k are not unbiased estimators of the population quantities P;k · However, the bias E (v'i;; ) - Vii;; , or E (r;k ) - P;k, can usually be ignored if the sample size n is moderately large. Consideration of bias motivates a slightly modified definition of the sample vari ance-covariance matrix. Result 3.1 provides us with an unbiased estimator S of I: Sec.
II
- 1)
Generalized Variance
1 29
•
1t1 II
(UN B IASE D) SAMPLE VARIANCE.:..C OVARIANCE MATRIX ··.
II
-
(3-11)
Here S, without a subscript, has (i, k)th entry (n - l)- 1 j=2:l (Xi; X; ) (Xik - Xk ) . This definition of sample covariance is commonly used in many multivariate test statistics. Therefore, it will replace S11 as the sample covariance matrix in most of the material throughout the rest of this book. 3.4 GENERALIZ ED VARIANCE
With a single variable, the sample variance is often used to describe the amount of variation in the measurements on that variable. When p variables are observed on each unit, the variation is described by the sample variance-covariance matrix S=
ls 1 s, P S �z
1 30
Chap.
3
Sample Geometry and Random Sampling
The sample covariance matrix contains p variances and �p (p - 1) potentially dif ferent covariances. Sometimes it is desirable to assign a single numerical value for the variation expressed by S. One choice for a value is the determinant of=S, which reduces to the usual sample variance of a single characteristic when p 1. This determinantz is called the generalized sample variance:
Example 3 . 7 (Calculating a generalized variance)
Employees (x1 ) and profits per employee ( x2 ) for the 16largest publishing firms in the United States are shown in Figure 1.3 . The sample covariance matrix, obtained from the data in the April30, 1990, Forbes magazine article, is [ -68. 252.04 -68.43 s = 43 123.67 J Evaluate the generalized variance. In this case, we compute l s i = (252. 04) (123. 6 7) - c -68. 43) ( -68. 4 3) = 26, 4 87 The generalized sample "ariarice provides one way of writing the information on all variances and covariances as a single number. Of course, when p > 1, some information about the sample is lost in the process. A geometrical interpretation of I S I will help us appreciate its strengths and weaknesses as a descriptive summary. Consider the area generated within the plane by two deviation vectors d1 = y1 - x1l and d2 = y2 - x2 1. Let Ld 1 be the length of d1 and Ld2 the length of d2 . By elementary geometry, we have the diagram •
dl
- - - - - - - - - �t!J- - - - - - - - - - - - - .
cos2 ( e) + sin2 ( e) 1, we and the area of the trapezoid is jLd sin ( e) I Ld2 .canSinceexpress th' area as I
IS
2 Definition 2A.24 defines "determinant" and indicates one method for calculating the value of a determinant.
Sec.
From (3-5) and (3-7), Ld 1 Ld 2 =
and
�� (xi1 - .X1 )2 F .X2 ) 2
3 .4
Generalized Variance
=
V(n - 1) s11
=
V(n - 1) s22
1 31
cos(O) = r1 2
Therefore, Area = (n - 1) � YS; V1 - r?2 = (n - 1) Vs11s22 (1 - r?2 ) (3-13) Also, � Vs22 r1 z ] S22 I (3-14) If we compare (3-14) with (3-13), we see that l S I = (area) 2/(n - 1) 2 Assuming now that l S I = (n - 1) - (p - l l (volume? holds for the volume gen erated in n space by the p - 1 deviation vectors d 1 , d2 , ... , dp - J , we can establish the following general result for p deviation vectors by induction (see [1 ], p. 260): Generalized sample variance = I S I = (n - 1) -p (volume) (3-15) Equation (3-15) says that the generalized sample variance, for a fixed set of data,3 is proportional to the square of the volume generated by the p deviation vectors d 1 = y1 - .X11, d2 = y2 - .X2 1, ... , dP = Yp - .XP l. Figures 3.6(a) and (b) on page 132 show trapezoidal regions, generated by p = 3 residual vectors, corresponding to "large" and "small" generalized variances. For a fixed sample size, it is clear from the geometry that volume, or I S I , will increase when the length of di = Yi - :Xi l ( on/5;;) is increased. In addition, volume will increase if the residual vectors of fixed length are moved until they are at right angles to one another, as in Figure 3.6(a). On the other hand, the volume, or I S I , will be small if just one of the sii is small or one of the deviation vectors lies 2
any
3 If generalized variance is defined in terms of the sample covariance matrix S, = [(n - 1)/n] S, then, using Result 2A. 1 1 , I S, I = l [(n - 1 )/n)IP S I = l [(n - 1)/n)Ip I l S I = [(n - 1)/nJP I S I . Conse quently, using (3-15), we can also write the following: Generalized sample variance = I S, I = n-p (volume) 2 •
1 32
Chap.
3
Sample Geometry and Random Sampling , , 11 \
�
,' I \
3
3
\ , ,, \ I I \ I I \ \ \ "' \ \ I ( '"
Figure 3.6 (a) "Large" generalized sample variance for p = 3 . (b) "Small" gen eralized sample variance for p = 3.
nearly in the (hyper) plane formed by the others, or both. In the second case, the trapezoid has very little height above the plane. This is the situation in Figure 3.6(b ), where d3 lies nearly in the plane formed by d 1 and d2 . Generalized variance also has interpretations in the p-space scatter plot re presentation of the data. The most intuitive' interpretation concerns the spread of the scatter about the sample mean point x = [x , x2 , . . . , xP ] . Consider the mea sure of distance given1 in the comment below (2-19),1 with x playing the role of the fixed point p and s- playing the role of A. With these choices, the coordinates x ' = [ x1 , x 2 , . . . , xp] of the points a constant distance c from x satisfy (3-16) 1 (When p = 1, (x - x) ' S- (x - x) = (x1 - x1 ) 2/s1 1 is the squared distance from xl to xl in standard deviation units. ) Equation (3-16) defines a hyperellipsoid (an ellipse if p = 2) centered at x. It can be shown using integral calculus that the volume of this hyperellipsoid is related to I S 1 . In particular, Volume of {x: (x - x) ' S- 1 (x - x) ,;; c2 } = kP I S I 112cP (3-17) or (Volume of ellipsoid)2 = (constant) (generalized sample variance) where the constant kP is rather formidable.4 A large volume corresponds to a large generalized variance. 4 For those who are curious, kP uated at z.
=
27TP12jp f (p/2), where f (z) denotes the gamma function eval
Sec.
Generalized Variance
3.4
1 33
Although the generalized variance has some intuitively pleasing geometrical interpretations, it suffers from a basic weakness as a descriptive summary of the sample covariance matrix S, as the following example shows. Example 3.8
(Interpreting the generalized variance)
Figure 3.7 gives three scatter plots with very different patterns of correlation. All three data sets have i' = [1, 2], and the covariance matrices are 4 s [� � J r = .8 s = [ � � J r = 0 S = � ] , r = -.8 7
Xz
7 •
• •
•
•
•
•
•
•
•
•
•
XI
7
• •
•
•
• ' . • • • • • I • . .. . "' . •
Xz
[-
• •
-
• •
•
5
•
• • • • • • •
• • •
•
• •
(a)
(b)
• • •
•
7 • • • •
• • •• • •• • •
• • • • ,. •
•
•
• • • •• • •
• •
• •
(c) Figure 3.7
Scatter plots with three different orientations.
• •
•
7
X
I
1 34
Chap.
3
Sample Geometry and Random Sampling
Each covariance matrix S contains the information on the variability of the component variables and also the information required to calculate the correlation coefficient. In this sense, S captures the orientation and size of pattern of scatter. The eigenvalues and eigenvectors extracted from S further describe the pattern in the scatter plot. For the eigenvalues satisfy 0 == (A(A -- 5)9) 2(A- 421) and we determine the eigenvalue-eigenvector pairs A 1 = 9, e � = [ 1 /Yl , 1 /Yl ] and A2 = 1, e� = [1/Yl, -1/Yl]. The mean-centered ellipse, with center x' = [1, 2] for all three cases, is (x i) 'S- 1 (x - x) � c2 To describe this ellipse, as in Section 2.3, with A =I s- I , we notice that if (A, e) is an eigenvalue-eigenvector pair for S, then (A - , e) is an eigenvalue-eigen vector -pair for s-1- 1. That is,- if Se -=1 Ae, then multiplying on the left by s - 1 1 gives S Se = A S e, or S te = A e. Therefore, using the eigenvalues from S, we know that the ellipse extends e VA; in the direction of e; from x. In p = 2 dimensions, the choice c2 = 5. 99 will produce an ellipse that con tains approximately 95 percent of the observations. The vectors 3 \15.99 e 1 and \15.99 e are drawn in Figure 3. 8 (a) on page 135. Notice how the directions are the natural2 axes for the ellipse, and observe that the lengths of these scaled eigenvectors are comparable to the size of the pattern in each direction. Next, for the eigenvalues satisfy 0 = (A - 3) 2 and we arbitrarily choose the eigenvectors so that A 1 = 3, e� = [1, 0] and A2 = 3, e� = [0, 1]. The vectors V3 \15.99 e 1 and V3 \15.99 e2 are drawn in Figure 3. 8 (b). Finally, for [ 5 4 ] the eigenvalues satisfy 0 = (A - 5)2 - ( )2 s = = (A - 9) (A 1) 5 ' and we determine the eigenvalue-eigenvector pairs A 1 = 9, e � = [1 /Yl, -1/Yl] and A = 1, e� = [1 /Yl , 1/Yl ]. The scaled eigen vectors 3 v5.99 e 1 and v5.992 e2 are drawn in Figure 3.8 (c). In two dimensions, we can often sketch the axes of the mean-centered ellipse by eye. However, the eigenvector approach also works for high dimen sions where the data cannot be examined visually. -
-
-4
-
-
-
4
Sec.
7
x2
7
•
•
•
I
•
"'
•
•
•
• •
•
•
•
XI
•
•
•
•
•
•
• • • • •
•
•
•
•
•
•
1 35
x2
•
•
7 •
Generalized Variance
3.4
•
•
•
•
•
•
•
• • •
•
•
• • •
•
• •
7
•
X
I
(b)
(a) • •
•
•
7
x2
• •
•
•
• •
• • • •• •
•
•
•
•
• •
7 • • • •
XI
•
(c) Figure 3.8
Axes of the mean-centered 95-percent ellipses for the scatter plots in Figure 3 . 7.
Note: Here the generalized variance I S I gives the same value, I S I = 9, for all three patterns. But generalized variance does not contain any infor mation on the orientation of the patterns. Generalized variance is easier to interpret when the two or more samples (patterns) being compared have nearly the same orientations. Notice that our three patterns of scatter appear to cover approximately the same area. The ellipses that summarize the variability (x
-
i) 'S- 1 ( x - i) � c 2
do have exactly the same area [see (3-17)], since all have l S I 9. =
•
1 36
Chap.
3
Sample Geometry and Random Sampling
As Example 3. 8 demonstrates, different correlation structures are not detected by IS 1. The situation for p > 2 can be even more obscure. Consequently, it is often desirable to provide more than the single number IS I as aA summary of S. From Exercise 2.12, IS I can be expressed as the product A 1 A2 · · · P of the eigenvalues of S. Moreover, the mean-centered ellipsoid based on s- 1 [see (3-16)] has axes whose lengths are proportional to the square roots of the A;'s (see Section 2.3). These eigenvalues then provide information on the variabil ity in all directions in the p-space representation of the data. It is useful, therefore, to report their individual values, as well as their product. We shall pursue this topic later when we discuss principal components. Situations in which the Generalized Sample Variance Is Zero
The generalized sample variance will be zero in certain situations. A generalized variance of zero is indicative of extreme degeneracy, in the sense that at least one column of the matrix of deviations,
[ ] [ - - -] -
X 1I - X , X z, - X-,
X1 1 - X--1 x 1 2 - x-2 · · · x 1P - xP X z l - XI X zz - X z . . . X zp - xP
x;,
xn l
� x'
X
� XI
Xn z
1
x'
� Xz · : : X"P � XP
(3-18) can be expressed as a linear combination of the other columns. As we have shown geometrically, this is a case where one of the deviation vectors-for instance, d; = [xl i - X; , . . . , X, ; - x;]-lies in the (hyper) plane generated by d l , . . . , d;-] , (n X p )
-
(n X I) (I X p)
d i+l , . . . , dp .
Result 3.2. The generalized variance is zero when, and only when, at least one deviation vector lies in the (hyper) plane formed by all linear combinations of the others-that is, when the columns of the matrix of deviations in (3-18) are lin early dependent. Proof. If the columns of the deviation matrix (X - IX') are linearly depen dent, there is a linear combination of the columns such that
= (X -
li' ) a for some a 0 But then, as you may verify, (n - 1)S = (X - lx' ) ' (X - IX' ) and ( n - 1)Sa = ( X - IX' ) ' (X - IX ' ) a = 0 =I=
Sec.
3 .4
Generalized Variance
1 37
so the same a corresponds to a linear dependency, a1 col1 (S) + · · · + aP colP (S) = Sa = 0, in the columns of S. So, by Result 2A.9, I S I = 0. In the other direction, if I S I = 0, then there is some linear combination Sa of the columns of S such that Sa = 0. That is, 0 = (n 1 ) Sa = (X - li' ) ' (X - li' ) a. P�emultiplying by a' yields 0 = a' (X - li' ) ' (X - li' ) a = L fx - tx')a and, for the length to equal zero, we must have (X - li' ) a = 0. Thus, the • columns of (X - li' ) are linearly dependent.
-
Example 3.9
(A case where the generalized variance is zero)
Show that l S I = 0 for
X =
(3 X 3 )
[- -
-� � [ ] � = �]
and determine the degeneracy. Here = [3, 1, 5], so 1 - 3 2 1 � X - li' = 4 3 1 - 1 = 1 -1 -1 4 - 3 0 - 1 4 - 5 The deviation (column) vectors are d; = [ -2, 1 , 1], d� = [1 , 0, - 1], and d� = [0, 1, - 1]. Since d = d 1 2d2 , there is column degeneracy. (Note that there is row degeneracy3 afso. ) +This means that one of the deviation vectors-for example, d 3 , lies in the plane generated by the other two residual vectors. Consequently, the th volume is zero. This case is illustrated in Figure 3.9 and may beree-dimensional verified algebraically by showing that I S I = We have i'
0.
3 6 5
3 4
Figure 3.9 A case where the three dimensional volume is zero ( I S I = 0).
1 38
Chap.
3
Sample Geometry and Random Sampling
s (3 X 3 )
and from Definition 2A.24,
-
-[ �
-
•
3 { 1 i) + GH - � o) + o = � - � = o When large data sets are sent and received electronically, investigators are sometimes unpleasantly surprised to find a case of zero generalized variance, so that S does not have an inverse. We have encountered several such cases, with their associated difficulties, before the situation was unmasked. A singular covariance matrix occurs when, for instance, the data are test scores and the investigator has included variables that are sums of the others. For example, an algebra score and a geometry score could be combined to give a total math score, or class midterm and final exam scores summed to give total points. Once, the total weight of a num ber of chemicals was included along with that of each component. This common practice of creating new variables that are sums of the original variables and then including them in the data set has caused enough lost time that we emphasize the consequences. =
Example 3. 1 0
(Creating new variables that lead to a zero generalized variance)
Consider the data matrix
10 16 10 12 X= 13 3 11 14 where the third column is the sum of first two columns. These data could be the number of successful phone solicitations per day by a part-time and a full time employee, respectively, so the third column is the total number of suc cessful solicitations per day. Show that the generalized variance I S I = 0, and determine the nature of the dependency in the data. 1
9 4 12 2 5 8
Sec.
Generalized Variance
3.4
1 39
We find that the mean corrected data matrix, with entries xjk - xk , is X - lx'
[
-2 -1 -3 3 2 1 1 0 -1 0 2 -2 1 1 0
The resulting covariance matrix is s =
2.5 0 2.5 0 2.5 2.5 2.5 2.5 5.0
]
We verify that, in this case, the generalized variance I s I = 2.5 2 X 5 + 0 + 0 - 2.53 - 2.5 3 - 0 = 0
In general, if the three columns of the data matrix X satisfy a linear constraint a 1 xj l + a2xj 2 + a3xj 3 = a constant for all j, then a 1 x 1 + a2 x 2 + a3 x3 = so that c,
c,
a l (xj l - x l ) + a2 (xj 2 - x2 ) + a3 (xj 3 - x3 ) = 0
for all j. That is,
(X - lx' ) a =
0
and the columns of the mean corrected data matrix are linearly dependent. Thus, the inclusion of the third variable, which is linearly related to the first two, has led to the case of a zero generalized variance. Whenever the columns of the mean corrected data matrix are lin early dependent, ( n - 1 ) Sa = (X - li' ) ' (X - li ' ) a = (X - li ' ) O = 0
and Sa = 0 establishes the linear dependency of the columns of S. Hence, l s i = o. Since Sa = 0 = Oa, we see that a is a scaled eigenvector of S associated with an eigenvalue of zero. This gives rise to an important diagnostic: If we are unaware of any extra variables that are linear combinations of the others, we can find them by calculating the eigenvectors of S and identifying the one associated with a zero eigenvalue. That is, if we were unaware of the depen dency in this example, a computer calculation would find an eigenvalue pro portional to a' = [1, 1, - 1], since
1 40
Chap.
3
[
]
Sample Geometry and Random Sampling
Sa =
2.5 0 2.5 0 2.5 2.5 2.5 2.5 5.0
The coefficients reveal that
for allj In addition, the sum of the first two variables minus the third is a constant c for all n units. Here the third variable is actually the sum of the first two vari ables, so the columns of the original data matrix satisfy a linear constraint with c = 0. Because we have the special case c = 0, the constraint establishes the fact that the columns of the data matrix are linearly dependent. Let us summarize the important equivalent conditions for a generalized vari ance to be zero that we discussed in the preceding example. Whenever a nonzero vector a satisfies one of the following three conditions, it satisfies all of them: ( 1 ) Sa = 0 (2) a' (xj - x) = 0 for allj (3) a'xj = c for allj (c = a'x) l (xj l - xd + l (xj 2 - .X2 ) + ( - l ) (xj 3 - .X3 ) = 0
•
a
ear combicorrenctateiodn a scaloerdof with Theof thleinmean eieiggisenval envect ue data, using is zero. 0.
S
a,
Thethe orlinigearinalcombi nusatiinogn of dat a , is a constant. a,
We showed that if condition (3) is satisfied-that is, if the values for one variable can be expressed in terms of the others-then the generalized variance is zero because S has a zero eigenvalue. In the other direction, if condition (1) holds, then the eigenvector a gives coefficients for the linear dependency of the mean cor rected data. In any statistical analysis, I S I = 0 means that the measurements on some variables should be removed from the study as far as the mathematical computa tions are concerned. The corresponding reduced data matrix will then lead to a covariance matrix of full rank and a nonzero generalized variance. The question of which measurements to remove in degenerate cases is not easy to answer. When there is a choice, one should retain measurements on a (presumed) causal variable instead of those on a secondary characteristic. We shall return to this subject in our discussion of principal components. At this point, we settle for delineating some simple conditions for S to be of full rank or of reduced rank. Result 3.3. If n .;;; p, that is, (sample size) .;;; (number of variables), then IS I = 0 for all samples. Proof. We must show that the rank of S is less than or equal top and then apply Result 2A.9.
Sec.
3.4
Generalized Variance
1 41
For any fixed sample, the n row vectors in (3-18) sum to the zero vector. The existence of this linear combination means that the rank of X - 1X1 is less than or equal to n - 1, which, in turn, is less than or equal to p - 1 because n :s:; p. Since (n - 1) s
(p Xp )
= (X - 1X1) 1 (X - 1X1) (n x p)
(p X n )
the kth column of S, colk (S), can be written as a linear combination of the rows of (X - l.X1) 1 • In particular, (n - 1 ) colk (S) = (X - 1X1 ) 1 colk (X - 1X 1 )
= (xl k - xd row1 ( X - 1X1 ) 1 + . . . + (xn k - xk ) rown (X - 1X1) 1 Since the row vectors of (X - 1X1 ) 1 sum to the zero vector, we can write, for example, row1 {X - 1X1 ) 1 as the negative of the sum of the remaining row vectors. After substituting for row 1 (X - 1X1 ) 1 in the peceding equation, we can express
colk (S) as a linear combination of the at most n - 1 linearly independent row vec tors row2 (X - 1X1) 1, , rown (X - 1X1 ) 1• The rank of S is therefore less than or equal to n - 1, which-as noted at the beginning of the proof-is less than or equal • to p - 1, and S is singular. This implies, from Result 2A.9, that S I = 0. • • •
I
Resu lt 3.4. Let the row vectors x 1 , x2 , , xn , where x; is the jth row of the data matrix X, be realizations of the independent random vectors X 1 , X2 , , Xn . Then • • •
• • •
c) c
1. If the linear combination a 1 Xj has positive variance for each constant vector a 'I: 0, then, provided that p < n, S has full rank with probability 1 and
c
l s i > o. 2. If, with probability 1, &1 Xj is a constant (for example,
Proof.
ity 1 , &1Xj n =
=
c
=
0.
(Part 2). If &1 Xj = a 1 Xj 1 + a 2 Xj 2 + . . . + ap Xjp = with probabil for all j, and the sample mean of this linear combination is
L ( a1 xj 1 +
j= l
for all j, then I S I
a 2xj 2 + . . . + ap xjp ) /n
-
_
[
= a 1 .X1 + a2.X2
a 1 i � � 81 X : &1 i � - a1 i
+
] [c c] c-c =
�
:
. . . + aP .XP = a1i. Then
=
0
indicating linear dependence; the conclusion follows from Result 3.2. The proof of Part (1) is difficult and can be found in [2].
•
1 42
Chap.
3
Sample Geometry and Random Sampling
Generalized Variance Determined by I R and Its Geometrical Interpretation
I
The generalized sample variance is unduly affected by the variability of measure ments on a single variable. For example, suppose some s; ; is either large or quite small. Then, geometrically, the corresponding deviation vector d; = (y; - :X;l) will be very long or very short and will therefore clearly be an important factor in determining volume. Consequently, it is sometimes useful to scale all the deviation vectors so that they have the same length. Scaling the residual vectors is equivalent to replacing each original observa tion xi k by its standardized value (xi k - xk )/ Vi;;; . The sample covariance matrix of the standardized variables is then R, the sample correlation matrix of the origi nal variables. ( See Exercise 3.13.) We define
Since the resulting vectors all have length �, the generalized sample variance of the standardized vari ables will be large when these vectors are nearly perpendicular and will be small when two or more of these vectors are in almost the same direction. Employing the argument leading to (3-7), we readily find that the cosine of the angle (Jik between - :X;l)/\/S;; and ( yk - :Xkl)/� is the sample correlation coefficient r;k · Therefore, we can make the statement that I R I is large when all the r;k are nearly zero and it is small when one or more of the r;k are nearly + 1 or - 1 . I n sum, we have the following result: Let
( Y;
(Y; - :X;l) Vi;;
Xu - X; Vi;; X2 ; - X; Vi;;
i = 1, 2, . . . , p
Xn ; - X; Vi;;
be the deviation vectors of the standardized variables. These deviation vectors lie in the direction of d; , but have a squared length of n - 1. The volume generated
Sec. 3.4 Generalized Variance
1 43
3
Figure 3.1 0 The volume generated by equal-length deviation vectors of the standardized variables.
in p-space by the deviation vectors can be related to the generalized sample vari ance. The same steps that lead to (3-15) produce
( Generalized sa�ple va�iance ) = I R I = (n of the standardized vanables
_
1 ) -P (volume) z
(3-20)
The volume generated by deviation vectors of the standardized variables is illustrated in Figure 3.10 for the deviation vectors graphed in Figure 3.6. A com parison of Figures 3.10 and 3.6 reveals that the influence of the d 2 vector (large variability in x 2 ) on the squared volume I S I is much greater than its influence on the squared volume I R 1 . The quantities I S I and I R I are connected by the relationship (3-21) so (3 -22) [The proof of (3-21) is left to the reader as Exercise 3.12.] Interpreting (3-22) in terms of volumes, we see from (3-15) and (3 -20) that the squared volume (n - 1)P I S I is proportional to the squared volume (n - 1 ) P I R 1 . The constant of proportionality is the product of the variances, which, in turn, is proportional to the product of the squares of the lengths (n - 1)s;; of the d; . Equation (3-21) shows, algebraically, how a change in the measurement scale of X1 , for example, will alter the relationship between the generalized variances. Since I R I is based on standardized measurements, it is unaffected by the change in scale. However, the relative value of I S I will be changed whenever the multi plicative factor s 1 1 changes.
144
Chap.
3
Sample Geometry and Random Sampling
Example 3 . 1 1
(Ill ustrating the relation between I S I and I R I )
3.
Let us illustrate the relationship in and I R I when p = Suppose
(3-21 ) for the generalized variances I S I
(3 sX3) Then s 1 1 = 4 , s22 = 9, and s33 = 1 . Moreover,
1 � �� ( - 1 )2 + 3 �� �� (- 1 ) 3 + 1 � � �� (-1)4
Using Definition 2A.24 , we obtain
lsl = 4
- 3 ( 3 - 2 ) + 1(6 - 9) = 14 I R I = 1 1 � ! ( - 1 )2 + ! I ! I ( - 1 )3 + ( - 1 )4 I t ! It �I = ( 1 - �) - CD G - D +
It then follows that
•
(check ) Another Generalization of Variance
We conclude this discussion by mentioning another generalization of variance. Specifically, we define the total sample variance as the sum of the diagonal ele ments of the sample variance-covariance matrix S. Thus, Total
Example 3. 1 2
sample vari�nce
">
=
s11 + s22 +
· · ·
(Calculating the total sample variance)
+
sPP
c,"
(3-23)
Calculate the total sample variance for the variance-covariance matrices S in Examples and
3.7 3.9.
Sec.
3.5
Sample Mean, Covariance, and Correlation as Matrix Operations
[ -252.04 68.43
From Example 3.7, s
- 68.43 123.67
and Total sample variance =
s1 1
1 45
J
+ s22 = 252.04 + 123.67 = 375.71
From Example 3.9,
and Total sample variance =
s1 1
+ s22 +
s33 = 3
+ 1 + 1 =5
•
Geometrically, the total sample variance is the sum of the squared lengths of the p deviation vectors d 1 = ( y 1 .X 1 1), , dP = ( yp - xp l ) , divided by n - 1. The total sample variance criterion pays no attention to the orientation (correla tion structure) of the residual vectors. For instance, it assigns the same values to both sets of residual vectors (a) and (b) in Figure 3.6. -
. . .
3.5 SAMPLE MEAN , COVARIANCE, AND CORRELATION AS MATRIX OPERATIONS
We have developed geometrical representations of the data matrix X and the derived descriptive statistics x and S. In addition, it is possible to link algebraically the calculation of x and S directly to X using matrix operations. The resulting expressions, which depict the relation between x, S, and the full data set X con cisely, are easily programmed on electronic computers. We have it that xi = (x l i · 1 + Xz i · 1 + . . . + xn i · 1 ) /n = y( l/n. Therefore, -
xl -
x =
Xz -
xP
or
y; l n y; l n y; l n
1 n
x " x l 2 . . . X1 n
1
X z l X zz · · · X zn
1
xp l Xpz . . . xp n
1
1 46
Chap.
3
Sample Geometry a n d Random Sampling
x = _!_ X ' 1
(3-24)
n
That is, x is calculated from the transposed data matrix by postmultiplying by the vector 1 and then multiplying the result by the constant 1/n. Next, we create an n X p matrix of means by transposing both sides of (3-24) and premultiplying by 1; that is,
[- - - - -J [ - -J -�I [ - -J [ - - -J ( l ) ( l ) (I l ) li '
=
_!_ 11' X
n
x1 x2
=
·••
xP
�1 �2 : . . �p
(3-25)
: : ·. : x 1 x2 · • · xP
Subtracting this result from X produces the n X p
matrix of deviations (residuals)
x1 1 - x 1 x 12 - x 2 . . • x 1 P - xP x 21 � :x1 x22 � :x2 : . . x 2 p � xp .. .. .. .. . Xn 1 - X 1 Xn2 - X 2 • • Xn p - Xp -
X - _!_ 11'X n
=
(3-26)
-
Now, the matrix (n - l)S representing sums of squares and cross products is just the transpose of the matrix (3-26) times the matrix itself, or
(n - l) S
=
X I I - �1 X 2 1 - � I . . . x 12 - Xz Xzz - Xz · · · .. .. . . x1 P - xP Xzp - Xp • • ·
Xn 1 Xn z -. X2 .
.
X
Xnp - Xp
x l 1 - �� X 1 2 - �z
• • · X 1 p - �P • . . Xzp - xP Xz Xz l - X I Xzz •• • Xn l - X 1 Xn z - Xz • . . Xn p • •
• 0
0
X-
since
11'X ' X -
11'X
=
X'
0
-
• ••
- XP
11' X
(I - l 11' ) ' ( I - l 11' ) = I - l 11' - l 11' + :2 11'11' = I - l 11'
To summarize, the matrix expressions relating x and
S to the data set X are
Sec.
3.5
Sample Mean, Covariance, and Correlation as Matrix Operations
1 x' ( I - ! u, ) x n n - -1
s =
1 47
(3-27)
- 1)
The result for Sn is similar, except that 1/n replaces 1/(n as the first factor. The relations in (3-27) show clearly how matrix operations on the data matrix X lead to x and S. Once S is computed, it can be related to the sample correlation matrix R. The resulting expression can also be "inverted" to relate R to S. We first define the p X p sample standard deviation matrix D1 12 and compute its inverse, (D l f2) -1 = n - 1 12 . Let
l
(pXp)
D 1 12 =
Then
vs,
? Q
,
vs; 0
1
0
�
0
(pXp)
n - 1 /2 =
s and
s"
R= we have
��
�
[
s1 2
0
vs;
1
vs;;
0
s : , sl 2
::
s 1 p s2p . . . sPP s1 p
� vs;;
� vs;; vs; vs;;
vs;; vs;;
s2 p
:
s,
� vs;
sl p
(3-28)
0
1
0 Since
vU
0
Spp
R = n - 1 12 s n - 1 12
]
[}, T] r1 2
r2p
(3-29)
1 48
Chap.
3
Sample Geometry and Random Sampling
Postmultip lying and premultiplying both sides of (3-29) by D 112 and noting that D 1 /2o - 1 /2 = I gives
n - 1 /2D l /2
=
(3-30) That is, R can be obtained from the information in S, whereas S can be obtained from D 1 /2 and R. Equations (3-29) and (3-30) are sample analogs of (2-36) and (2-37). 3.6 SAMPLE VALUES OF LINEAR COMBINATIONS OF VARIABLES
We have introduced linear combinations of p variables in Section 2.6. In many mul tivariate procedures, we are led naturally to consider a linear combination of the form
c'X
=
c1 X1 + c2 X2 + . . . + cP XP
whose observed value on the jth trial is j The n derived observations in (3-31) have Sample mean
=
( c'x 1 + c'x 2 + n
. .
·
+
=
1, 2, . . , n .
(3-31)
c'x n )
c' (x 1 + x 2 + . . · + x n ) -n1 = c'x(3-32) Since (c'xi - c'i) 2 (c' (xi - i)) 2 c' (xi - i) (xi - i) 'c, we have 2 + ( c'x 2 - c'i) 2 + · · · + ( c'x - c'i) 2 ( c'x c' i) . n 1 Sample vanance =
=
=
=
=
n - 1
c' (x 1 - i) (x 1 - i) 'c + c' (x 2 - i) (x 2 - i) 'c + . . . + c' (xn - i) (xn - i) 'c n-1 ... c' [ (x 1 - i) (x 1 - i) ' + (x 2 - i) n(x 2- -1 i) ' + + (xn - i) (x n - i) ' ] c
or Sample variance of c'X
=
c'Sc
(3-33)
Equations (3-32) and (3-33) are sample analogs of (2-43). They correspond to sub stituting the sample quantities i and S for the "population" quantities p and l:, respectively, in (2-43). Now consider a second linear combination
Sec.
3.6
Sample Values of Linear Combinations of Variables
b'X
=
b 1 X1
+
b2 X2
+ ... +
1 49
bP XP
whose observed value on the jth trial is j = 1 , 2, . . . , n
(3-34)
It follows from (3-32) and (3-33) that the sample mean and variance of these derived observations are Sample mean of b'X =
Sample variance of b'X =
b'i b'Sb
Moreover, the sample covariance computed from pairs of observations on
b'X and c'X is Sample covariance
(b'x 2 - b'i) ( c'x 2 - c'i) + · · · + (b'x n - b'i) ( c'xn - c'i) n-1 b' (x 1 - i) (x 1 - i) ' c + b' (x 2 - i) (x 2 - i) ' c + . . . + b' (x n - i) (xn - i) ' c n-1 b' (x 1 - i) (x 1 - i) ' + (x 2 - i) n(x 2- -1 i) ' + . . . + (xn - i) (x n - i) ' c (b'x 1 - b'i) ( c'x 1 - c'i)
= =
[
+
]
or Sample covariance of b'X and c'X
=
b'Sc
(3-35)
In sum, we have the following result. Resu lt 3.5. The linear combinations
b'X c'X
= =
b 1 X1 + b 2 X2 + . . . + bP XP c1 X1 + c2 X2 + · · · + cP XP
have sample means, variances, and covariances that are related to i and Sample mean of b'X = Sample mean of c'X =
Sample variance of b'X =
Sample variance of c'X =
Sample covariance of b'X and c'X =
b'i c'i b'Sb c'Sc b'Sc
S by
(3-36) •
1 50
Chap.
3
Sample Geometry and Random Sampling
Example 3. 1 3
(Means and covariances for linear combinations)
We shall consider two linear combinations and their derived values for the n = 3 observations given in Example 3.9 as
-l{ �:J { �: ]
Consider the two linear combinations
b'X � [2 2 and
c'X � [ 1 - 1 3
� 2X, + 2X, - X,
� X, - X, + 3 X,
The means, variances, and covariance will first be evaluated directly and then be evaluated by (3-36). Observations on these linear combinations are obtained by replacing X1 , X2 , and X3 with their observed values. For example, the n = 3 obser vations on b'X are
b'x 1 = 2x1 1 + 2x12 - x 1 3 = 2 (1) + 2 (2) - (5) = 1 b'x 2 = 2x21 + 2x22 - x2 3 = 2 (4) + 2 (1) - (6) = 4 b'x3 = 2x31 + 2x3 2 - x33 = 2 (4) + 2 (0) - (4) = 4 The sample mean and variance of these values are, respectively, + = (1 + 43 4) = 3 (1 - 3) 2 + (4 - 3) 2 + (4 - 3) 2 = 3 S amp1e vanance = 3 1 In a similar manner, the n = 3 observations on c'X are c'x 1 = 1x1 1 - 1x12 + 3x1 3 = 1 (1 ) - 1 (2) + 3 ( 5 ) = 14 c'x 2 = 1 (4) - 1 (1) + 3 (6) = 21 c'x3 = 1 (4) - 1 (0) + 3 (4) = 16
Sample mean 0
_
Sec.
3.6
Sample Values of Linear Combinations of Variables
1 51
and Sample mean
=
Sample variance
=
(14 + 21 + 16) 3
=
17
(14 - 17) 2 + (21 - 17) 2 + (16 - 17) 2 = 13 3-1
Moreover, the sample covariance, computed from the pairs of observations (b'x 1 , c'x 1 ), (b'x 2 , c'x2 ), and (b'x3 , c'x3), is Sample covariance
(1 - 3) (14 - 17) + (4 - 3) (21 - 17) + (4 - 3) (16 - 17) 3-1
9
2
Alternatively, we use the sample mean vector x and sample covariance matrix S derived from the original data matrix X to calculate the sample means, variances, and covariances for the linear combinations. Thus, if only the descriptive statistics are of interest, we do not even need to calculate the observations b'xj and c'xj . From Example 3.9,
-�
OJ
1 12 12 1
Consequently, using (3-36), we find that the two sample means for the derived observations are
b'X
�
b'i
Sample mean of c'X
=
c'x
Sample mean of
Using (3-36), we also have
�
[2 2
-I] [!]
�
3
17
( check )
( check )
1 52
Chap.
Sample Geometry and Random Sampling
3
- [ -1 [ -1]
Sample variance pf J1'X = b' Sb [2 2
1]
� [2 2
- 1]
Sample variance of c'X = c'Sc
= 3
(check)
[1
= 13
[1 Sample covariance of b'X and c'X = b' Sc
[ i OJ[: -� ] J [n 3 _J
[2 2 - 1 ]
� [2 2
- 1]
(check)
1
-
�
-
l
(check)
As indicated, these last results check with the corresponding sam ple quantities computed directly from the observations on the linear • combinations. The sample mean and covariance relations in Result 3.5 pertain to any num ber of linear combinations. Consider the q linear combinations
l
a; 1X1 + a;2X2 + · · · + a;P XP'
]l
i = 1, 2, . . . ' q
These can be expressed in matrix notation as
a1 1X1 + a12X2 + · · · + a 1 P XP az \Xl + azz:Xz + . ; . + azp:XP aqlxl + aq2x2 + . . . + aqpxp
aJ I a1z : : : alp a� l a� z a�P .. . . .. .. aql aq2 . . . aqp .
]l ] X1 �2 ..
xp
(3-37)
=
AX
(3-38)
Chap.
3
Exercises
1 53
Taking the ith row of A , a; , to be b' and the kth row of A , a� , to be c' , we see that Equations (3-36) imply that the ith row of AX has sample mean a; x and the ith and kth rows of AX have sample covariance a; S ak . Note that a; Sak is the (i, k )th element of ASA'.
Result 3 .6. The q linear combinations AX in (3-38) have sample mean vector Ax and sample covariance matrix ASA ' . • EXERCISES
3.1. Given the data matrix
x � [HJ
(a) Graph the scatter plot in p = 2 dimensions. Locate the sample mean on your diagram.
(b) Sketch the n = 3-dimensional representation of the data, and plot the deviation vectors y1 - x 1 1 and y2 - x 2 1.
(c) Sketch the deviation vectors in (b) emanating from the origin. Calculate
the lengths of these vectors and the cosine of the angle between them. Relate these quantities to S" and R. 3.2. Given the data matri:st
x � [ � -�]
(a) Graph the scatter plot in p = 2 dimensions, and locate the sample mean on your diagram.
(b) Sketch the n = 3-space representation of the data, and pldt the deviation
vectors y 1 - .X1 1 and y2 - .X21. ( c) Sketch the deviation vectors in (b) emanating from the origin. Calculate their lengths and the cosine of the angle between them. Relate these quantities to sll and R. 3.3. Perform the decomposition of y 1 into .X1 1 and y 1 - .X1 1 using the first column of the data matrix in Example 3.9. 3.4. Use the six observations on the variable X1 , in units of millions, from Table 1.1. (a) Find the projection on 1' = [1, 1 , 1 , 1, 1, 1]. (b) Calculate the deviation vector y1 - .X1 1. Relate its length to the sample standard deviation. ( c) Graph (to scale) the triangle formed by y 1 , .X1 1, and y 1 - x 1 1. Identify the length of each component in your graph.
1 54
Chap.
3
Sample Geometry and Random Sampling
1.1.
(d) Repeat Parts a-c for the variable X2 in Table (e) Graph (to scale) the two deviation vectors y 1 - .X1 1 and y2 - x 2 1. Cal culate the value of the angle between them. 3.5. Calculate the generalized sample variance l S I for (a) the data matrix X in Exercise and (b) the data matrix X in Exercise 3.6. Consider the data matrix
3.1
X =
3.2 .
[ -125 432 -223 ]
(a) Calculate the matrix of deviations (residuals), X - li ' . Is this matrix of full rank? Explain.
(b) Determine S and calculate the generalized sample variance I S 1 . Interpret the latter geometrically.
(c) Using the results in (b), calculate the total sample variance. [See
[ 45 45 ] '
3.7. Sketch the solid ellipsoids (x three matrices
s =
s =
[ 01 01 0OJ
i) ' s - 1 (x
[ -45 -45 ] ' -
-
i) �
s =
(3-23).]
1 [see (3-16)] for the
[� �]
[ 1� i � ]
(Note that these matrices have the same generalized variance I S I ) 3.8. Given s =
0 0 1
and S �
=
_
1
_
1
-
_
(a) Calculate the total sample variance for each S . Compare the results. (b) Calculate the generalized sample variance for each S , and compare the
results. Comment on the discrepancies, if any, found between Parts a and b. 3.9. The following data matrix contains data on test scores, with x 1 = score on first test, x2 = score on second test, and x3 = total score on the two tests:
12 18 X = i 14 20 16
17 20 16 18 19
29 38 30 38 35
(a) Obtain the mean corrected data matrix, and verify that the columns are linearly dependent. Specify an a' = [a 1 , a2 , a3] vector that establishes the linear dependence.
Chap.
Exercises
3
1 55
Obtain the sample covariance matrix S, and verify that the generalized variance is zero. Also, show that Sa = 0, so a can be rescaled to be an eigenvector corresponding to eigenvalue zero. (c) Verify that the third column of the data matrix is the sum of the first two columns. That=is, show that there is linear dependence, with a1 = 1, a1 = 1, and a3 -1. 3.10. When the generalized variance is zero, it is the columns of the mean cor rected data matrix Xc = X - that are linearly dependent, not necessar ily those of the data matrix itself. Given the data 3 1 0 6 4 6 4 2 2 7 0 3 5 3 4 (a) Obtain the mean corrected data matrix, and verify that the columns are linearly dependent. Specify an a' = [a1 , a1, a3 ] vector that establishes the dependence. (b) Obtain the sample covariance matrix S, and verify that the generalized variance is zero. (c) Show that the columns of the data matrix are not linearly independent in this case. 3.11. Use the sample covariance obtained in Example 3. 7 to verify (3-29) and (3-30), which state that R. = o - 1 12SD- 112 and D l i2 RD 1 12 = S. 3.12. Show that l S I = (s l l s22 . . sPP ) I R I . Hint: From Equation (3-30), S = 0 1 /2 RD 1 /2 . Taking determinants gives I S I 2 2 1 I D 1 I I R I I D 1 1 1 . (See Result 2A.11. ) Now examine I D 1 12 1 . 3.13. Given a data matrix X and the resulting sample correlation matrix R, consider the standardized observations (xjk - xk )/� , k = 1, 2, .. p j = 1, 2, . .. ' n. Show that these standardized quantities have sample covariance matrix R. 3.14. Consider the data matrix X in Exercise 3.1. We have n = 3 observations on p = 2 variables X1 and X2• Form the linear combinations c'X = [- 1 2] [�:] = - X1 + 2X2 (b)
li '
.
'
,
[2 3] [ �: ] = 2X1 + 3X2 (a) Evaluate the sample means, variances, and covariance of b'X and c'X from first principles. That is, calculate the observed values of b'X and c'X , and then use the sample mean, variance, and covariance formulas. b'X =
1 56
Chap.
3
Sample Geometry and Random Sampling
Calculate the sample means, variances, and covariance of b'X and c'X using (3-36). Compare the results in (a) and (b). 3.15. Repeat Exercise 3. 1 4 using the data matrix (b)
and the linear combinations and Let V be a vector random variable with mean vector E(V) = =Pv and covari = ance matrix E(V - Pv ) (V - Pv ) ' :Iv. Show that E(VV') :I v + PvP'v · 3.17. Show that, if X and Z are independent, then each component of X is (q X l ) (p x l ) independent of each component of Z. Hint: P [Xl � xl , Xz � X z , . . . , xp � xp and zl � Zl , . . . , zq � Z q ] = P [Xl � X , Xz � X , . . , x � x ] . P [Zl � Zl , . . . , zq �Z q ] by independence. Let x2 , . . .l, xP and zz2 , , Zpq tendp to infinity, to obtain
3.16.
• • •
for all x1 , z 1 . So X1 and Z1 are independent. Repeat for other pairs. REFERENCES
1. Anders oln,ey,T.1984.W. (2d ed.). New York: John Wi 2. MatEatorn,ices.M.", and M. Perlman. "The(1973)Non-, 710-singul717.arity of Generalized Sample Covariance An Introduction to Multivariate Statistical Analysis
Annals of Statistics,
1
CHAPTER
4
The Multivariate Normal Distribution
4. 1 I NTRODUCTION
A generalization of the familiar bell-shaped normal density to several dimensions plays a fundamental role in multivariate analysis. In fact, most of the techniques encountered this book are based on the assumption that the data were generated from a multivariate normal distribution. While real data are never exactly multi variate normal, the normal density is often a useful approximation to the "true" population distribution. One advantage of the multivariate normal distribution stems from the fact that it is mathematically tractable and "nice" results can be obtained. This is fre quently not the case for other data-generating distributions. Of course, mathe matical attractiveness per se is of little use to the practitioner. It turns out, however, that normal distributions are useful in practice for two reasons: First, the normal distribution serves as a bona fide population model in some instances ; second, the sampling distributions of many multivariate statistics are approxi mately normal, regardless of the form of the parent population, because of a cen tral-limit effect. To summarize, many real-world problems fall naturally within the framework of normal theory. The importance of the normal distribution rests on its dual role as both population model for certain natural phenomena and approximate sam pling distribution for many statistics. in
1 57
1 58
Chap. 4 The Multivariate Normal Distribution
4.2 THE MULTIVARIATE NORMAL DENSITY AND ITS PROPERTIES
The multivariate normal density is a generalization of the univariate normal den sity top 2 dimensions. Recall that the univariate normal distribution, with mean J.L and variance cr2 , has the probability density function ;;;;;
f(x) =
1
vz:;;;z
e-[(x - p. ) / a-J'/2
- oo
<X <
oo
(4-1)
A plot of this function yields the familiar bell-shaped curve shown in Figure 4.1 . Also shown in the figure are approximate areas under the curve within ± 1 stan dard deviations and ±2 standard deviations of the mean. These areas represent probabilities, and thus, for the normal random variable X, P (f.L -
cr
�X�
P (f.L - 2 cr � X �
J.L
fJ.
+ cr)
==
.68
+ 2cr)
==
.95
Figure 4. 1 mean fL and
A normal density with variance cr2 and selected areas under the curve.
f.J - 2a f.J - a
It is convenient to denote the normal density function with mean J.L and vari ance cr2 by N(J.L, cr2) . Therefore, N(10, 4) refers to the function in (4-1) with J.L = 10 and cr = 2. This notation will be extended to the multivariate case later. The term (4-2)
in the exponent of the univariate normal density function measures the square of the distance from x to in standard deviation units. This can be generalized for a on several variables as p X 1 vector x of observations f.1.
(4-3)
The p X 1 vector p, represents the expected value of the random vector X, and the matrix that :I is the variance-covariance matrix of X. [See (2-30) and (2-31).] We pshallX passume the symmetric matrix :I is positive definite, so the expression in (4-3) is the square of the generalized distance from x to p,. The multivariate normal density is obtained by replacing the univariate dis tance in (4-2) by the multivariate generalized distance of (4-3) in the density func tion of (4-1). When this replacement is made, the univariate normalizing constant
Sec. 4.2 The Multivariate Normal Density and its Properties
1 59
tnust be changed to a more general constant that makes the vol under the surface of the multivariate density function unity for any p. This is necessary because, in the multivariate case, probabilities are represented by vol umes under the surface over regions defined2 by intervals of the X; values. It can be shown (see [1]) that this constant is (27T) -p/ I I 1 - 1 12, and consequently, a p-dimen sional normal density for the random vector X = [x1 ' x2 ' . . . ' xp ] I has the form ( 27T) - 1 12 ( a2 ) - 1 12
ume
f(x )
=
1 - (x - pp:- • (x - p) /2 ( 27T) p/2 1 I I 1 /2 e
(4-4)
where < X; < i = 1, 2, . . . ,p. We shall denote this p-dimensional normal density by NP (p, I), which is analogous to the normal density in the univariate case. - oo
Example 4. 1
oo,
(Bivariate normal density)
Let us evaluate the p = 2-variate normal density in terms of the individual parameters p. 1 = E (X1 ), p.2 = E (X2 ), a1 1 = Var (X1 ) , a22 = Var(X2 ), and p 1 2 = a1 2 / (� YO; ) = Corr (X1 , X2 ) . Using Result 2A.8, we find that the inverse of the covariance matrix is
Introducing the correlation coefficient p by writing a1 2 = p1 2 � YO; , we obtain a1 1 a22 - af2 = a1 1 a22 ( 1 - pf21)2 , and the squared distance becomes
[
[(
) (
) ( )] 2 ( The last expression is written in terms of the standardized values (x1 - p. 1 ) / � and (x2 - p.2 )/ Y0; . )
_
1 60
Chap. 4 The Multivariate Normal Distribution
Next, s ce I I = 0"1 1 0"22 - O"f2 = 0"1 1 0"22 ( 1 - pf2 ) , we can substitute for I- 1 and inI I I inI (4-4) to get the expression for the bivariate (p = 2) nor mal density involving the individual parameters J.L r , J.L 2 , 0"1 1 , 0"22 , and p 1 2 : ( 4- 6) 2 2 - 1"" 1 exp { - 2 ( 1 -1 P 122 ) [ (X1� ) (X2 - 1"" 2 ) v0; (X 1 - J.L 2 _ 2p 1 2 �IL 1 ) ( x2va; )]} The expression in (4-6) is somewhat unwieldy, and the compact general form in (4-4) is more informative in many ways. On the other hand, the expression in (4-6) is useful for discussing certain properties of the normal distribution. For example, if the random variables X and X2 are uncorrelated, so that p 1 2 = 0, the joint density can be written as the1 product of two univariate normal densi ties each of the form of (4-1 ) . That is, f(x1 , x2 ) = f (x1 )f(x2 ) and X1 and X2 are independent. [See (2-28).] This result is true in general. (See Result 4.5.) Two bivariate distributions with 1 = 0"22 are shown in Figure 4.2 on page 161. In Figure 4.2(a), X1 and X2 0"are1 independent p 2 = 0) . In Figure 4.2(b) , p 1 2 = .75. Notice how the presence of correlation( 1causes the proba • bility to concentrate along a line. From the expression in (4-4) for the density of a p-dimensional normal vari able, it should be clear that the paths of x values yielding a constant height for the density are ellipsoids. That is, the multivariate normal1 density is constant on sur faces where the square of the distance (x - p, ) I - (x - p,) is constant. These paths are called contours: Constantprobability density contour = !all x such that (x - p, ) I - 1 ( x - p, ) = c 2 } = surface of an ellipsoid centered at p, The axes of each ellipsoid of constant density are in the direction of the eigenvectors of I - I , and their lengths- 1 are proportional to the reciprocals of the square- 1 roots of the eigenvalues of I • Fortunately, we can avoid the calculation of I when determining the axes, since these ellipsoids are also determined by the eigenvalues and eigenvectors of I. We state the correspondence formally for later reference. Result 4. 1 . If I is positive definite, so that I - r exists, then ( Ie = A e implies I - 1 e = ±) e so (A, e) is an eigenvalue-eigenvector pair for I corresponding to the pair ( 1/ A, e) for I - I . Also, I - r is positive definite. 11
x
11
+
'
'
Sec.
4.2
The Multivariate Normal Density and its Properties
(a)
(b )
Figure 4.2 = u22
(b) u1 1
Two bivariate normal distributions. (a) u1 1 and p 1 2 = . 75 .
u22
and p 1 2
0.
1 61
1 62
Chap. 4 The Multivariate Normal Distribution
For I positive definite and e � 0 an eigenvector, we have or Moreover, and division by gives Thus, is an eigen value-eigenvector pair for Also, for any by Proof.
e = :I- 1 (:Ie) = :I- 1 (Ae), 0 < e' :Ie = e' (:Ie) = e' (A e) = Ae'e = A. 1 1 :I- e = (1/ A ) e. e = A:I- e, A>0 (1/ A, e) :I-1 • p X 1 x, (2-21)
since each term Aj 1 (x'eY is pnonnegative. In addition, x'ei = 0 for all i only if x = 0. So x � 0 implies that � (1/\ ) (x'eY > 0, and it follows that :I - 1 is posii= l tive definite. The following summarizes these concepts: •
A contour of constant density for a bivariate normal distribution with is obtained in the following example.
0"1 1 = O"zz
Example 4.2
(Contours of the bivariate normal density)
We shall obtain the axes of constant probability density contours for a bivari ate normal distribution when 0"1 1 = 0"2 2 . From (4-7), these axes are given by the eigenvalues and eigenvectors of :I. Here I :I - AI I = 0 becomes Consequently, the eigenvalues are A1 = 0"11 + 0"1 2 and A2 = 0"11 - 0"1 2 . The eigenvector e1 is determined from
Sec. 4.2 The Multivariate Normal Density and its Properties
1 63
or = ( 0"1 1 + 0"1 2 ) e1 0"1 2e1 + 0"1 1 e2 = ( 0"1 1 + 0"1 2 ) e2 These equations imply that e1 = e2 , and after normalization, the first eigen O"n el + 0"1 2 e2
value-eigenvector pair is
Similarly, A2 = 0"1 1 - 0"1 2 yields the eigenvector ; = [lj\12, - lj\12] . When the covariance 0"1 2 (or correlation p1 2 ) is positive, A1 = 0"1 1 + 0"1 2 is the largest eigenvalue, and its associated' eigenvector ; = [1 /\12, lj\12] lies along the 45° line through the point = [JL , JL ] . This is true for any pos itive value of the covariance (correlation).IL Since1 the2 axes of the constant-den sity ellipses are given by ± c � and ± e VA";: 2 [see (4-7) ], and the eigenvectors each have length unity, the major axis will be associated with the largest eigenvalue. For positively correlated normal random variables, then, the major axis of the constant-density ellipses will be along the 45° line through IL· (See Figure 4. 3 . ) When the covariance (correlation) is negative, A 2 = 0"1 1 - 0"1 2 will be the largest eigenvalue, and the major axes of the con stant-density ellipses will lie along a line at right angles to the 45° line through IL· (These results are true only for 0"1 1 = 0"22 .) To summarize, the axes of the ellipses of constant density for a bivari ate normal distribution with 0"1 1 = 0"22 are determined by e
e
e1
e
•
c j a- 1 1 + CY 1 2
c j a- 1 1 - a- 1 2 Figure 4.3 A constant-density contour for a bivariate normal distribution with u1 1 = u2 2 and u1 2 (or p 1 2 > 0).
>
0
1 64
Chap. 4 The M ultivariate Normal Distribution
We show in Result 4.7 that the choice c 2 = x; ( a ) , where x ; ( a ) is the upper ( 100a) th percentile of a chi-square distribution with p degrees of freedom, leads to contours that contain ( 1 - a) X 100% of the probability. Specifically, the follow ing is true for a p-dimensional normal distribution: The solid
ellips<:)id ofiyalues satisfyirig
has probability 1
:(x
-
cp,)' l:-1 (x
p,) ·�
-
-,. a.
xffi;(a)
(4-8)
The constant-density contours containing 50% and 90% of the probability under the bivariate normal surfaces in Figure 4.2 are pictured in Figure 4.4. Xz
Xz
fl z
@ I
f.ll
Figure 4.4
in Figure 4.2.
fl z XI
-# f.l l
XI
The 50% and 90% contours for the bivariate normal distributions
The p-variate normal density in (4-4) has a maximum value when the squared distance in (4-3) is zero-that is, when x = Thus, p is the point of maximum density, or mode, as well as the expected value of X, or mean. The fact that p is the mean of the multivariate normal distribution follows from the symmetry exhibited by the constant-density contours: These contours are centered, or balanced, at p. p,.
Additional Properties of the Multivariate Normal Distribution
Certain properties of the normal distribution will be needed repeatedly in our expla nations of statistical models and methods. These properties make it possible to manipulate normal distributions easily and, as we suggested in Section 4.1, are partly responsible for the popularity of the normal distribution. The key properties, which we shall soon discuss in some mathematical detail, can be stated rather simply. The following are true for a random vector X having a multivariate normal distribution:
Sec. 4.2 The Multivariate Normal Density and its Properties
1 65
Linear combinations of the components of X are normally distributed. All subsets of the components of X have a (multivariate) normal distribution. 3. Zero covariance implies that the corresponding components are indepen dently distributed. 4. The conditional distributions of the components are (multivariate) normal. These statements are reproduced mathematically in the results that follow. Many of these results are illustrated with examples. The proofs that are included should help improve your understanding of matrix manipulations and also lead you to an appreciation for the manner in which the results successively build on themselves. Result 4.2 can be taken as a working definition of the normal distribution. With this in hand, the subsequent properties are almost immediate. Our partial proof of Result 4.2 indicates how the linear combination definition of a normal density relates to the multivariate density in (4-4). Resu lt 4.2. If X is distributed as N (p, I) , then any linear combination of variables a'X = a1 X1 + a2 X2 + · · · + aP XPP is distributed as N (a' p, a' Ia). Also, if a'X is distributed as N(a' p, a' I a) for every a, then X must be Np (p, I). Proof. The expected value and variance of a'X follow from (2-43). Proving that a'X is normally distributed if X is multivariate normal is more difficult. You can find a proof in [1]. The second part of Result 4.2 is also demonstrated in [1]. 1. 2.
•
Example 4.3 (The distribution of a linear combination of the components of a normal random vector)
Consider the linear combination a'X of a multivariate normal random vector determined by the choice a' = [1, 0, ... , 0]. Since a'X
and
�
[1, o,
,o
{l:J
�
x,
1 66
Chap. 4 The Multivariate Normal Distribution
we have a' I a = [1, 0, . . . , 0]
[ ::: ::: ::: ::: j[ �J £T1 p O"zp
£TPP
u1 1
0
and it follows from Result 4.2 that X1 is distributed as N (J-L1 , u1 1 ) . More gen erally, the marginal distribution of any component X; of X is N(J-L;, u;; ) · The next result considers several linear combinations of a multivariate normal vector X. Result 4.3. If X is distributed as NP (p, I) , the linear combinations •
q
A X
(q X p ) ( p X 1 )
[:: �: : : : : : :� j
aq1 x1 + . . . + a q p xp Nq (Ap, AIA' ). X + (p x 1 ) (p x 1 ) Np ( l-t + I).
are distributed as Also, d , where d is a vector of constants, is distributed as d, Proof. The expected value and the covariance matrix of AX follow from (2-45). Any linear combinationE(AX) b' (AX) is a linear combination of X, of the form a'X with a = A'b . Thus, the conclusion concerning AX follows directly from Result 4.2. The second part of the result can be obtained by considering a' (X + d) = a'X + (a' d) , where a'X is distributed as N(a' p, a' I a). It is known from the uni variate case that adding a constant a'd to the random variable a'X leaves the vari ance unchanged and translates the mean to a' + a' d = a' (p + d) . Since a was arbitrary, X + d is distributed as Np ( + d, I). f.L
P
Example 4.4 (The distribution of two linear combinations of the components of a normal random vector)
For X distributed as N3 (p, I), find the distribution of -1 1
By Result 4.3, the distribution of AX is multivariate normal with mean
•
Sec. 4.2 The Multivariate Normal Density and its Properties
1 67
and covariance matrix AIA'
=
[� � -
J
az 3 - azz - a13 azz - 2az 3 + a33 Alternatively, the mean vector Ap, and covariance matrix AIA' may be ver al 2 +
ified by direct calculation of the means and covariances of the two random variables Y1 = X1 - X2 and Y2 = X2 - X3 • We have mentioned that all subsets of a multivariate normal random vector X are themselves normally distributed. We state this property formally as Result 4.4. Result 4.4. All subsets of X are normally distributed. If we respectively par tition X, its mean vector p,, and its covariance matrix I as •
X (p x t )
=
and I
l
(p Xp)
XI (q X I)
l
JL
------------
x2
((p - q) X l )
=
l
(p X I)
(q X q)
i
I2 1
:
I 11
------------
! �
=
l l l (q X I) ·------------
1)
I 12
(q X (p - q))
-----------------
I 22
((p - q) X q) ! ((p - q) X (p - q)) '
JL z
((p -ILq)t X
then X 1 is distributed as Nq (p, 1 , I 1 ). ] in Result 4.3, and the conclusion follows. Proof. Set A = [ I l 0 (q Xp) (q x q) : (q X (p - q)) To apply Result 4.4 to an arbitrary subset of the components of X, we simply relabel the subset of interest as X 1 and select the corresponding component means and covariances as p, 1 and I 1 1 , respectively. • '
1
1 68
Chap. 4 The Multivariate Normal Distribution
[ �: ] . We set
(The distribution of a subset of a normal random vector)
Example 4.5
If X is distributed as N5 ( p, I), find the distribution of
and note that with this assignment, X, rearranged and partitioned as
Xz x4 XI
X =
x3
Xs
or X=
p
=
[-'��'] ,
J.Lz Jl.4 IL 1 Jl. 3 J.Ls
'
(3 X 1 )
Thus, from Result 4.4, for
we have the distribution = Nz
I=
p,
and
I can respectively be U1 2 Uz 3 Uzs U1 4 U3 4---- U---45 ul l u1 3 Ut s u1 3 U3 3 U3 5 U1 5 U3 5 Uss
I I I I I I I I --------------r-----------
Uzz Uz4 Ut z Uz 3 Uzs
Uz4 U44 U1 4 U3 4 U45
[
I I I I I I I I I I I
I1 1 i I1 2
(2 X 2)
-
J
------ t! (2 X 3)-I 2 1 ! I zz (3 X 2) ! (3 X 3) -----
-
( [ J.LzJl.4 ] , [ UUz4zz UU44z4 ] )
I
-
It is clear from this example that the normal distribution for any subset can be expressed by simply selecting the appropriate means and covariances from the original p and I. The formal process of relabeling and partitioning is • unnecessary. We are now in a position to state that zero correlation between normal random variables or sets of normal random variables is equivalent to statistical independence. Result 4.5.
a q1 X
q2
( a ) If X 1 and X 2 are independen t, then Cov (X 1 , X 2 ) = 0, (q 1 X I )
matrix of zeros.
( q2 X I )
Sec. 4.2 The Multivariate Normal Density and its Properties
(b) If
[-�;-] is Nq, + q, ( t��-] , t�!�+�;;-] ) , then X i and X 2 are independent if 1 69
[�;] has multivariate normal distribution ] [-��d--�--J q , + q, ( [}�!_ I 22 ) ILz
and only if I 1 2 = 0. (c) If X 1 and X 2 are independent and are distributed as
Nq , (JL 1 , I 1 1 ) and
Nq, (p. 2 , I 22 ), respectively, then N
'
0' !
(See Exercise 4.14 for partial proofs based upon factoring the den• sity function when I 12 = 0.) Proof.
Example 4.6 (The equivalence of zero covariance and independence for normal variables)
Let X be N3 (p., I) with (3 X i )
Are X1 and X2 independent? What about (X1 , X2 ) and X3 ? Since Xi and X2 have covariance u1 2 = 1 , they are not independent. However, partitioning X and I as
I �
[i- -WJ
=
[ ] I1 1 (2 X 2) I2 1 ( i X 2)
-------
i I1 2
! (2 X 1 )
+- - - - - - ·
! I 22 ! (1 X I)
[�J and X3 have covariance matrix I 1 2 [ � ] . Therefore, (X1 , X ) and X3 are independent by Result 4.5. This implies X3 is indepen
we see that X 1 =
2 dent of Xi and also of X2 •
I
=
•
We pointed out in our discussion of the bivariate normal distribution that = 0 (zero correlation) implied independence because the joint density function 1 2 p [see (4-6)] could then be written as the product of the marginal (normal) densities of X1 and X2 • This fact, which we encouraged you to verify directly, is simply a spe cial case of Result 4.5 with q1 = q2 = 1.
1 70
[-�!-] be distributed as Np (IL, I) with IL [-�;],
Chap. 4 The Multivariate Normal Distribution
[-�U-�-�l?.J , Result 4.6.
Let
X
=
=
I = I I and I I22 l > 0. Then the conditional distribution of X 1 , given 21 : 22 that X 2 = x 2 , is normal and has and
Note that the covariance does not depend on the value x 2 of the conditioning variable. Proof. We shall give an indirect proof. (See Exercise 4.13, which uses the
densities directly.) Take
A (p Xp )
=
so
is jointly normal with covariance matrix AIA' given by
Since X 1 - IL 1 - I 12 I21 (X 2 - IL 2 ) and X 2 - IL 2 have zero covariance, they are independent. Moreover, the quantity X 1 - IL1 - I 12 I2} (X 2 - IL 2 ) has distribution Nq (O, I 1 1 - I 12 I2} I21 ). Given that X 2 = x 2 , IL 1 + I 12 I2} (x 2 - IL 2 ) is a constant. Because X 1 - IL 1 - I 12 I2} (X 2 - IL 2 ) and X 2 - IL 2 are independent, the condi tional distribution of X 1 - IL 1 - I 12 I2} (x 2 - IL 2 ) is the same as the unconditional distribution of X 1 - 1L 1 - I 12 I2} (X 2 - IL 2). Since X 1 - 1L 1 - I 12 I2} (X 2 - 1L2 ) is Nq (O, I 11 - I 12 I221 I21 ), so is the random vector X 1 - IL 1 - I 12 I2} (x 2 - IL 2 ) when X 2 has the particular value x 2 . Equivalently, given that X 2 = x 2 , X 1 is • distributed as Nq (IL 1 + I 12 I2} (x 2 - IL 2), I u - I 12 I2l i21 ).
Sec.
4.2
The Multivariate Normal Density and its Properties
1 71
Example 4. 7
(The conditional density of a bivariate normal distribution) The conditional density of X1 , given that X2 = x2 for any bivariate distribu
tion, is defined by
f(x1 i x2 ) = { conditional density of X1 given that X2 = x2 } where f (x2 ) is the marginal distribution of X2 • If f(x1 , x 2 ) is the bivariate nor mal density, show that f(x1 l x2 ) is
Here a1 1 - a�2 / a22 = a1 1 ( 1 - p f2 ) . The two terms involving x1 - IL l in the exponent of the bivariate normal density [see Equation ( 4-6)] become, apart from the multiplicative constant - 1 /2 ( 1 - p f 2 ) , 2 (x (xl - IL 1 ) IL 1 ) (x2 - IL 2 ) 2P 1 2 l - � al l v al l v� a22 .
.
Because p 1 2 = a1 2 /va;; \f'Ci;, or p 1 2 va;;; va=;; = a1 2 / a22 , the complete exponent is
=
(
a1 2 -1 2) x l - ILl (x 2au ( 1 - P�2 ) a22 z - IL
) 2 - ! (xz - IL 2 ) 2 2
The constant term 27TVa1 1 a22 (1 - p�2 ) also factors as Th \I'Ci; X Th Ya1 1 ( 1 - pf2 ) Dividing the joint density of X1 and X2 by the marginal density
azz
1 72
Chap. 4 The M ultivariate Normal Distribution
and canceling terms yields the conditional density
2 2 1 :: e-(Xt - I£! - (UJ2/U22}(X2 - I£2}j /2 Ut t ( l - pt 2) ---;=�====:= ' vz; v'0"1 1 (1 - p f2 )
- 00
<
Xl
<
00
Thus, with our customary notation, the conditional distribution of X1 given that X2 = x2 is N( p., 1 + ( u1 2 / u22 ) (x2 - p., 2 ) , u1 1 (1 - p f2 ) ) . N ow, I u - I 1 2 I 2i i 2 1 = 0"1 1 - uf2 / u22 = 0"1 1 (1 - P f2 ) and I 1 2 I 2i = o-1 2 / o-22 • agreeing with Result 4.6, which we obtained by an indirect method. • For the multivariate normal situation, it is worth emphasizing the following: 1. All conditional distributions are (multivariate) normal. The conditional mean is of the form
2.
ILq + f3q, q + 1 (xq + 1 - ILq + 1 ) + · · · + f3 q, p (xp - ILp ) where the (3 ' s are defined by
I 12 I22- 1 _
[
(3 1, q + 1 (3 1, q + 2 . . . f3 1 , p f3z, q + 1 f32, q + 2 .· · · f3z, p :
f3q, q + l
:
.
..
:
.
]
(4-9)
f3q, q + 2 . . . f3q, p 3. The conditional covariance, I 1 1 - I 1 2 I 2i I 2 1 , does not depend upon the value(s) of the conditioning variable(s).
We conclude this section by presenting two final properties of multivariate normal random vectors. One has to do with the probability content of the ellip soids of constant density. The other discusses the distribution of another form of linear combinations. The chi-square distribution determines the variability of the sample variance s 2 = s1 1 for samples from a univariate normal population. It also plays a basic role in the multivariate case. Let X be distributed as Np (JL, I) with I I I > 0. Then: 1 (a) (X - p.)' I (X - p.) is distributed as x; , where x; denotes the chi-square distribution with p degrees of freedom. (b) The Np (JL, I) distribution assigns probability 1 - a to the solid ellipsoid { x: (x - p.) ' I - 1 (x - p.) � x; (a) } , where x; (a) denotes the upper (100a)th percentile of the x; distribution. Result 4.7.
Sec.
4.2
The Multivariate Normal Density and its Properties
1 73
Proof. We know that x� is defined as the distribution of the sum Zf + Zi + . . . + Z� , where Z1 , Z2 , . . . , ZP are independent N(O, 1) random vari
ables. Next, by the spectral decomposition [see Equations (2-16) and (2-21) with p A = I, and see Result 4.1], I - 1 = A1 e;e; , where Ie; = ..\;e;, so I - 1 e; = i= 1 ; p (1/..\;)e;. Consequently, (X - p)' I - 1 (X - p) = i 1 (1/A;) (X - p)' e;e; ( x - p) = = p p p (1/ ..\ ;) (e; ( x - JL) ? = [(1/ '\f'A;) e; (X - p)f = Zr , for instance. Now,
�-
�
�
i= 1
Z = A (X - p), where
�
�
i= 1
i= 1
1 e 1 VA, A1 1 e \!'A:: z A I
z = (p X 1 )
[l:J
A (p xp)
=
z
VA.: 1
I
I
e AP P
and X - IL is distributed as Np (O, I). Therefore, by Result 4.3, Z = distributed as Np (O, AIA'), where
A(X - p) is
A I A' =
(p X p) (p Xp) (p Xp )
By Result 4.5, Z1 , Z 2 , . . . , ZP are independent standard normal variables, and we conclude that (X - p)' I - 1 (X - p) has a x� -distribution. For Part b, we note that P[(X - p)' I - 1 (X - p) � c 2] is the probability assigned to the ellipsoid (x - p)' I - 1 (x - p) � c 2 by the density Np ( p , I). But from Part a, P[(X - p)' I - 1 (X - p) � x� (a)] = 1 - a, and Part b holds. •
1 74
Chap. 4 The Multivariate Normal Distribution Remark:
(I nterpretation of statistical distance)
Result 4.7 provides an interpretation of a squared statistical distance. When distributed as Np (JL, I),
X is
is the squared statistical distance from X to the population mean vector JL · If one component has a much larger variance than another, it will contribute less to the squared distance. Moreover, two highly correlated random variables will contribute less than two variables that are nearly uncorrelated. Essentially, the use of the inverse of the covariance matrix, (1) standardizes all of the variables and (2) elim inates the effects of correlation. From the proof of Result 4.7,
(X - JL)' I -1 (X - JL)
In terms of I - � (see (2-22)), Z
I- l (x
= Zr + Zi +
· · ·
+ z;
- JL) has a Np (O, I P ) distribution, and (X - JL) ' I - 1 (X - JL) = (X - JL) ' I - � I - � (X - JL) = Z'Z = Z12 + Z22 + . . . + Zp2 The squared statistical distance is calculated as if, first, the random vector X were =
transformed to p independent standard normal random variables and then the usual squared distance, the sum of the squares of the variables, were applied. Next, consider the linear combination of vector random variables
c ! Xl
+
Cz X z
+
. . .
+
cn x n
=
[X l i X z
(p X n )
. . . ! X nJ c (n X 1 )
(4-10)
This linear combination differs from the linear combinations considered earlier in that it defines a p X 1 vector random variable that is a linear combination of vec tors. Previously, we discussed a single random variable that could be written as a linear combination of other univariate random variables. Let X 1 , X 2 , , X n be mutually independent with X i distributed I). (Note that each Xi has the same covariance matrix I.) Then v l = c l X l + Cz X z + . . . + cn X n
Result 4.8.
as Np (JLi ,
is distributed as NP · · ·
+
• • .
(�j = l ci JLi , (�j = l cl ) I ) . Moreover V1 and V2 = b 1 X 1 + b2 X 2 +
bn X n are jointly multivariate normal with covariance matrix
n
Consequently, V 1 and V2 are independent if b'c = � ci bi = 0. j= l
Sec.
4. 3
Sampling from a Multivariate Normal Distribution
1 77
For the second linear combination of random vectors, we apply Result 4.8 with b 1 = b 2 = b3 = 1 an� b4 = -3 to get mean vector
(b, + b, + b, + b, ) p
�
Op �
and covariance n:atrix
(b f + b i + b� + b�) I
= 12 X
I
=
[
[�]
36 - 12 12 - 12 12 0 0 24 12
]
Finally, the covariance matrix for the two linear combinations of random vec tors is
Every component of the first linear combination of random vectors has zero covariance with every component of the second linear combination of ran dom vectors. If, in addition, each X has a trivariate normal distribution, then the two linear combinations have a joint six-variate normal distribution, and the two linear combinations of vectors are independent. • 4.3 SAMPLING FROM A MULTIVARIATE NORMAL DISTRIBUTION AND MAXIMUM LIKELIHOOD ESTIMATION
We discussed sampling and simple random samples briefly in Chapter 3. In this sec tion, we shall be concerned with samples from �multivariate normal population in particular, with the sampling distribution of X and S. The Multivariate Normal Likelihood
Let us assume that the p X 1 vectors X 1 , X 2 , . . . , X n represent a random sample from a multivariate normal population with mean vector p and covariance matrix l:. Since X I ' x2 ' " . ' x n are mutually independent and each has distribution Np (JL, l:), the joint density function of all the observations is the product of the marginal normal densities:
1 78
Chap. 4 The M ultivariate Normal Distribution
=
( 27Ttp/2 I I J n/2 e 1
1
-
± (x; - �tn:- 1 (x; - �t)/2
j
ol
(4-11 )
When the numerical values of the observations become available, they may be substituted for the xi in Equation (4-11 ). The resulting expression, now consid ered as a function of p and I for the fixed set of observations x 1 , x 2 , . . . , X 11 , is called the likelihood. Many good statistical procedures employ values for the population parame ters that "best" explain the observed data. One meaning of best is to select the parameter values that maximize the joint density evaluated at the observations. This technique is called maximum likelihood estimation, and the maximizing para meter values are called maximum likelihood estimates. At this point, we shall consider maximum likelihood estimation of the para meters p and I for a multivariate normal population. To do so, we take the obser vations x 1 , x 2 , . . . , X11 as fixed and consider the joint density of Equation ( 4-1 1 ) evaluated at these values. The result is the likelihood function. I n order t o simplify matters, we rewrite the likelihood function in another form. We shall need some additional properties for the trace of a square matrix. (The trace of a matrix is the sum of its diagonal elements, and the properties of the trace are discussed in Def inition 2A.28 and Result 2A.12.) Result 4.9. Let A be a k X k symmetric matrix and x be a k X 1 vector. Then:
(a) x' Ax = tr (x' Ax) = tr (Axx') k (b) tr (A) = 2: A ; , where the A; are the eigenvalues of A. i=l
Proof. For Part a, we note that x' Ax is a scalar, so x' Ax = tr (x' Ax). We pointed out in Result 2A.12 that tr (BC) = tr (CB) for any two matrices B and k C of
dimensions m X
k and k X m, respectively. This follows because BC has ,L b;i cii
j= k ( ) as its ith diagonal element, so tr (BC) = � � b;i ci; . Similarly the jth diagonal k ( k ) ( = element of CB is � cii b;i' so tr (CB) = � � cii b;i � � b;i cii ) = tr (BC). 1
m
m
"'
m
Let x' be the matrix B with m = 1, and let Ax play the role of the matrix C. Then tr (x' (Ax)) = tr ((Ax) x'), and the result follows. Part b is proved by using the spectral decomposition of (2-20) to write A = P' AP, where PP' = I and A is a diagonal matrix with entries A 1 , A 2 , . . . , Ak . Therefore, tr (A) = tr (P' AP) = tr (APP' ) = tr (A) = A 1 + ..\ 2 + · · · + A k . •
Sec.
4.3
Sampling from a Multivariate Normal Distribution
1 79
Now the exponent in the joint density in ( 4-1 1 ) can be simplified. By Result 4.9(a),
(xi - p) 1 I-1 (xi - p) = tr [ (xi - p) 1 I - 1 (xi - p)] = tr [I- 1 (x i - p) (x i - p)1]
(4-12)
Next,
n
n
2: ( x - p) 1 I-1 (xi - p) = 2: tr [ (xi - p) 1 I-1 (xi - p)] j= l j= l i =
n
2: tr [I - 1 (xi - p) ( xi - p)1] j= l
since the trace of a sum of matrices is equal to the sum of the traces of the matri-
n
ces, according to Result 2A.12(b). We can add and subtract i = (1/n) 2: xi in j= l n each term (xi - p) in 2: (xi - p) (xi - p)1 to give j= l
n
2: (x i - i + i - p) (xi - i + i - p) 1
j= l
n
n
= 2: (xi - i ) (xi - i )1 + 2: ( i - p) ( i - p) 1 j= l j= l
n
= 2: (xi - i ) (xi - i ) 1 + n ( i - p) ( i - p) 1 j= l
n
(4-14)
n
because the cross-product terms, 2: (xi - i ) ( i - p) 1 and 2: ( i - p ) (xi - i ) 1, j= l j= l are both matrices of zeros. (See Exercise 4.15.) Consequently, using Equations (4-13) and (4-14), we can write the joint density of a random sample from a multi variate normal population as
{ Joint density of } - np /2 _ 2 X I , X 2 , . . . ' x n = (21r) l I 1 111 X exp { - tr [ I-1 ( � (xi - i ) (xi - i ) 1 + n (i - p) (i - p)1 ) J /2 }
(4-15)
1 80
Chap.
4
The M ultivariate Normal Distribution
..
Substituting the observed values x 1 , x 2 , . , X 11 into the joint density yields the like lihood function. We shall denote this function by L (p, I), to stress the fact that it is a function of the (unknown) population parameters p and I. Thus, when the vectors xj contain the specific numbers actually observed, we have
(4-16) It will be convenient in later sections of this book to express the exponent in the likelihood function (4-16) in different ways. In particular, we shall make use of the identity
[ (� (x - x ) (x - x ) 1 + n (x - p) (x - p ) 1 ) ] = tr [ I - 1 ( � ( xj - x ) ( xj - x ) ' ) ] + n tr [I - 1 ( x - p ) ( x - p) 1 ] = tr [ I- 1 ( � ( x - x ) ( x - x ) 1 ) ] + ( x - p) 1 I - I ( x - p) ( 4-17)
tr I - 1
j
j
j
j
Maximum Likelihood Estimation of
n
p and I
The next result will eventually allow us to obtain the maximum likelihood estima tors of p and I. Result 4. 1 0. Given a p X p symmetric positive definite matrix B and a scalar b > 0, it follows that
for all positive definite I , with equality holding only for I = (1/2b) B . (p Xp)
Proof. Let B 1 /2 be the symmetric square root of B [see Equation (2-22)], 1 so B i2B1 12 = B, B 1 /2 B - 1 12 = I, and B - I I2 B - 1 12 = B- 1 . Then tr (I - 1 B) = tr [(I - 1 B 1 12 ) B 1 12] = tr [B 1 12 (I - 1 B 1 12 )]. Let 71 be an eigenvalue of B 1 i2 I - 1 B 1 12 . This matrix is positive definite because y1B 1 i2 I - 1 B 1 12 y = (B 1 12 y) 1 I - 1 (B 1 12 y) > 0 if B 1 12 y ,e 0 or, equivalently, y ,e 0. Thus, the eigenvalues 7Ji of B 1 12 I - 1 B 1 12 are pos itive by Exercise 2.17. Result 4.9(b) then gives
tr (I - I B) = tr (B i i2 I - I B I /2 ) = 2: i= I p
7J i
Sec.
4.3
Sampling from a Multivariate Normal Distribution
1 81
p and I B 1 12 :I- 1 B 1 12 1 = II YJ i by Exercise 2.12. From the properties of determinants i= I in Result 2A.ll, we can write I B 1 /2 :I - 1 B I /2 1 = I B 1 /2 1 1 :I - 1 1 1 B l l2 1 = I :I - I I I B 1 /2 1 1 B l /2 1
�
I :I -l i i B I =
1 1
or
IBI
p
1 m=
I B I /2 :I - 1 B I /2 1 IBI
r
II TJi = i= l
lBf
Combining the results for the trace and the determinant yields
(v
TJi p ± 11J2 1 - tr (I _, Bl /2 e = � �� � e ' " ' = 1 i TJf e-71,/2 IB b I :I I b b But the function TJbe - 71 12 has a maximum, with respect to YJ, of (2b)b e -b, occurring at TJ = 2b. The choice TJi = 2b, for each i, therefore gives _1_ e- tr ( I -'B) /2 I :I I b
,c �
_
}]
_1_ (2b)Pb e - bp IBib
The upper bound is uniquely attained when :I = (1/2b )B, since, for this choice, B I /2 :I - 1 B l /2 = B l /2 (2b) B - 1 W/2 = (2b) I (p X p) and Moreover, I B I /2 :I - 1 B l /2 1 l (2b) I I (2b) P 1 __ IBI IBI I :I I IBI Straightforward substitution for tr [:I - 1 B] and 1/ I :I I b yields the bound asserted .
•
T_!le maximum likelihood estimates of p., and :I are those values-denot�d by
jL and :I-that maximize the function L (p.,, :I) in (4-16). The estimates jL and :I will
depend on the observed values x 1 , x 2 , . . . , x n through the summary statistics x and S.
1 82
Chap. 4 The Multivariate Normal Distribution Result 4. 1 1 . Let X 1 , X 2 , . . . , X n be a random sample from a normal popu lation with mean p and covariance I. Then
fo = X
(n - 1) S (X - X ) (X j - X ) ' = ± j n j= l n
and I = .!_
maximum likelihood estimators of p and I, respectively. Their observed n values, x and (1/n) 2: (x j - x ) (x j - i ) ' , are called the maximum likelihood esti j= t mates of p and I. are the
Proof. The exponent in the likelihood function [see Equation from the multiplicative factor - ! , is [see
(4-17)]
4.1,
(4-16)], apart 0
By Result I- 1 is positive definite, so the distance ( i - p) 'I - 1 ( x - p) > unless p = i. Thus, the likelihood is maximized with respect to p at = i. It remains to maximize
fo
n
4. 1 0 with b = n/2 and B = j2:= l (xj - i ) (xj - i ) ', the maximum n occurs at I" = (1/n) 2: (xj - i ) (xj - i ) ' , as stated. j= l The maximum likelihood estimators are random quantities. They are �btained by replacing the observations x 1 , x 2 , , xn in the expressions for fo and over I. By Result
• • •
I with the corresponding random vectors, X 1 , X 2 , . . . , X n .
•
We note that the maximum likelihood estimator X is a random vector and the maximum likelihood estimator :i is a random matrix. The maximum likelihood estimates are their particular values for the given data set. In addition, the maxi mum of the likelihood is
L(fo, I) cz7T1tp/2 e - np/2 _1_ I :i I n /2 or, since l :i l = [(n - 1)/nJP ISI, L (fo, I) = constant X (generalized variance) -n/2
(4-18) (4-19)
Sec.
4.3
Sampling from a Multivariate Normal Distribution
1 83
The generalized variance determines the "peakedness" of the likelihood function and, consequently, is a natural measure of variability when the parent population is multivariate normal. Maximum likelihood estimators possess an invariance property. Let 0 be the maximum likelihood estimator of 0, and consider estimating the parameter h ( O), which is a function of 0. Then the maximum likelihood estimate of
h (O) (See
(4-20)
h(o)
is given by
(same function of ii )
(a function of IJ)
[1] and [13]. ) For example:
1. The maximum likelihood estimator of p.' I - 1 p. is jl i - 1 jL , where jL = X and i = ((n are the maximum likelihood estimators of p. and I, respectively. 2. The maximum likelihood estimator of Vii;; is v'a;; , where
- 1)/n)S
A
aii
- -n1 j=�l (X;j - X ) 2 £.!
;
is the maximum likelihood estimator of U; ; = Var (X;). Sufficient Statistics
(4-15),
From expression the joint density depends on the whole set of observations x 1 , x 2 , . . . , x n only through the sample mean i and the sum-of-squares-and-cross-
n
products matrix � ( xj - i ) ( xj - i ) ' = saying that i and
j= 1
(n -
(n - 1)S. We express this fact by 1)S (or S) are sufficient statistics:
The importance of sufficient statistics for normal populations is that all of the information about p. and I in the data matrix X is contained in i and regardless of the sample size n. This generally is not true for nonnormal populations. Since many multivariate techniques begin with sample means and covariances, it is pru dent to check on the adequacy of the multivariate normal assumption. (See Section If the data cannot be regarded as multivariate normal, techniques that depend solely on i and may be ignoring other useful sample information.
S,
4. 6 . )
S
1 84
Chap. 4 The Multivariate Normal Distribution
4.4 THE SAMPLING DISTRIBUTION OF X AND S
The tentative assumption that X1 , X 2 , . . . , X, constitute a random sample from a normal population with mean p, and covariance I completely determines the sam pling distributions of X and S. Here we present the results on the sampling distri butions of X and S by drawing a parallel with the familiar up.ivariate conclusions. In the univariate case (p = 1), we know that X is normal with mean J.t = (population mean) and variance
-1 (}"2 = population variance
..__..__ _ _____
n
sample size
2)
The result for the multivariate case (p � is analogous in that X has a normal dis tribution with mean p, and covariance matrix (1/n) I. "
(n - 1 ) s 2 = 2: ( Xj - X ) 2 is distribj= l uted as � times a chi-square variable having n - 1 degrees of freedom (d.f.). In For the sample variance, recall that
turn, this chi-square is the distribution of a sum of squares of independent stan dard normal random variables. That is, (n - 1 ) s 2 is distributed as u2 (Zf + + Z,� _ 1 ) = (uZ1 ) 2 + + (uZ11 _ 1 f The individual terms uZi are independently distributed as N(O, u2). It is this latter form that is suitably gener alized to the basic sampling distribution for the sample covariance matrix. The sampling distribution of the sample covariance matrix is called the Wishart distribution, after its discoverer; it is defined as the sum of independent products of multivariate normal random vectors. Specifically, · · ·
· · ·
Wm ( · I I)
= Wishart distribution with m d.f.
= distribution of
m
2: zj z; j= l
(4-22)
where the Zj are each independently distributed as NP (O, I). We summarize the sampling distribution results as follows: '" ,; Let X1 , X2 , , X, be a random s&mple of size n fro:ID a p-variate normal .dis tribution witll mean p �nd covariance mi}trix I. Then: . • •
1. X is distributed as'Np (p, (l/n) I). 2. (n - l)S is distributed as a Wishart random matrix with, n - l:a.L 3. X and S are indepenaent,
Sec.
4.5
Large-Sample Behavior of X and
S
1 85
Because I is unknown, the distribution of X cannot be used directly to make inferences about p,. However, S provides independent information about I, and the distribution of S does not depend on p,. This allows us to construct a statistic for making inferences about p,, as we shall see in Chapter 5. For the present, we record some further results from multivariable distribu tion theory. The following properties of the Wishart distribution are derived directly from its definition as a sum of the independent products, ziz; . Proofs can be found in [1]. Properties of t he Wishart Distribution
1. If A 1 is distributed as Wm 1 (A 1 I I) independently of A 2 , which is distributed as wm2 (A z i i) , then A I + A 2 is distributed as wm l +m2 ( A l + A z i i). That is, the degrees of freedom add. (4-24 ) 2. If A is distributed as W111 ( A I I), then CA C' is distributed as W111 (CA C' I CIC').
Although we do not have any particular need for the probability density func Hon of the Wishart distribution, it may be of some interest to see its rather com plicated form. The density does not exist unless the sample size is greater than the number of variables p. When it does exist, its value at the positive definite matrix A is
n
wn - I ( A I I) =
where
1 A 1 (n
- p - 2 )/ 2e - tr[A l: - 1 ]! 2 ----'----'--- ----=p=----2 P ( n - I ) / 2 1T p (p - I )/4 I <" - I )/ 2
1 I
IT
f ( ! {n - i)) i= l
A positive definite (4-25)
f(-) is the gamma function. (See [1].)
4.5 LARGE-SAMPLE BEHAVIOR OF
X AND S
Suppose the quantity X is determined by a large number of independent causes V1 , V2 , , V11 , where the random variables V; representing the causes have approxi mately the same variability. If X is the sum • . .
X = V1 + V2 + · · · + V" then the central limit theorem applies, and we conclude that X has a distribution which is nearly normal. This is true for virtually any parent distribution of the V; ' s, provided that is large enough. The univariate central limit theorem also tells us that the sampling distribu tion of the sample mean, X, for a large sample size is nearly normal, whatever the form of the underlying population distribution. A similar result holds for many other important univariate statistics.
n
1 86
Chap. 4 The Multivariate Normal Distribution
It turns out that certain multivariate statistics, like X and S, have large-sam ple properties analogous to their univariate counterparts. As the sample size is increased without bound, certain regularities govern the sampling variation in X and S, irrespective of the form of the parent population. Therefore, the conclusions presented in this section do not require multivariate normal populations. The only requirements are that the parent population, whatever its form, have a mean fL and a finite covariance I. Result 4. 1 2 (law of large numbers) . Let Y1 , Y2 , , Yn be independent observations from a population with mean E ( Y;) = J-t. Then . . •
y
=
Y1 + Y2 +
n
· · · + Yn
converges in probability to J-t as n increases without bound. That is, for any pre scribed accuracy e > 0, P [ - e < Y - J-t < e ] approaches unity as n � oo . •
Proof. See [9].
As a direct consequence of the law of large numbers, which says that each Xi converges in probability to J-t i , i = 1 , 2, . . . , p,
X converges in probability to fL
(4-26)
i = Sn ) converges in probability to I
(4-27)
Also, each sample covariance sik converges in probability to O"ik ' i, k = 1 , 2, . . . , p, and S ( or
Statement (4-27) follows from writing n (n
(�i - Xi ) (� k - Xk ) - 1 ) sik = 2: j= i II
= 2: (Xj i - J-t i + ILi - Xi) (Xj k - ILk + f.-tk - Xk ) j=i n
= 2: (�i - J-t J (Xj k - f.-tk) + n (Xi - J-t J (Xk - f.-t k ) j= i Letting Jj = (�i - J-t i) (�k - ILk ) , with E ( Y) = O"ik ' we see that the first term in sik converges to O"ik and the second term converges to zero, by applying the law
of large numbers. The practical interpretation of statements (4-26) and ( 4-27) is that, with high probability, X will be close to /L and S will be close to I whenever the sample size is large. The statement concerning X is made even more precise by a multivariate version of the central limit theorem.
Sec. 4.5 Large-Sample Behavior of X and
S
1 87
Result 4. 1 3 (The central limit theorem). Let X 1 , X 2 , . . . , Xn be indepen dent observations from any population with mean f,1, and finite covariance I. Then
Vn ( X - f,l,) has an approximate NP (O, I) distribution
for large sample sizes. Here n should also be large relative to p. Proof. See [1].
•
The approximation provided by the central limit theorem applies to discrete, as well as continuous, multivariate populations. Mathematically, the limit is exact, and the approach to norma_!!ty is often fairly rapid. Moreover, from the results in Section 4.4, we know that X is exactly normally distributed when the underlying population is normal. Thus, we would expect the central limit theorem approxima tion to be quite good for moderate n when the parent population is nearly normal. As we have seen, when n is large, S is close to I with high probability. Con sequently, replacing I by S in the approximating normal distribution for X will have a negligible effect on subsequent probability calculations. Result 4.7 can be used to show that n ( X - f,l,) ' I - 1 ( X - f,l,) has a xi distribu-
( � )
tion when X is distributed as NP f,l,, I or, equivalently, when
Vn (X - f,1,) has
an NP (O, I)jistribution. '!}1e xi distribu�n is approximately the sampling distrib ution of n ( X - f,l,) ' I - 1 ( X - f,l,) when X is approximately normally distributed. Replacing I - 1 by s-t does not seriously affect this approximation for n large and much greater than p. We summarize the major conclusions of this section as follows:
In the next three sections, we consider ways of verifying the assumption of normality and methods for transforming nonnormal observations into observations that are approximately normal.
1 88
Chap. 4 The Multivariate Normal Distribution
4.6 ASSESSING THE ASSUMPTION OF NORMALITY I
As we have pointed out, most of the statistical techniques discussed in subsequent chapters assume that each vector observation Xj comes from a multivariate normal distribution. On the other hand, in situations where the sample size is large and the techniques depend solely on the behavior of X, or distances involving X of the form n ( X - p)'S- 1 ( X - p), the assumption of normality for the individual obser vations is less crucial. But to some degree, the quality of inferences made by these methods depends on how closely the true parent population resembles the multi variate normal form. It is imperative, then, that procedures exist for detecting cases where the data exhibit moderate to extreme departures from what is expected under multivariate normality. We want to answer this question: Do the observations Xj appear to violate the assumption that they came from a normal population? Based on the properties of normal distributions, we know that all linear combinations of normal variables are normal and the contours of the multivariate normal density are ellipsoids. Therefore, we address these questions: 1. Do the marginal distributions of the elements of X appear to be normal? What about a few linear combinations of the components X;? 2. Do the scatter plots of pairs of observations on different characteristics give the elliptical appearance expected from normal populations? 3. Are there any "wild" observations that should be checked for accuracy?
It will become clear that our investigations of normality will concentrate on the behavior of the observations in one or two dimensions (for example, marginal distributions and scatter plots). As might be expected, it has proved difficult to con struct a "good" overall test of joint normality in more than two dimensions because of the large number of things that can go wrong. To some extent, we must pay a price for concentrating on univariate and bivariate examinations of normality: We can never be sure that we have not missed some feature that is revealed only in higher dimensions. (It is possible, for example, to construct a nonnormal bivariate distribution with normal marginals. [See Exercise 4.8.]) Yet many types of non normality are often reflected in the marginal distributions and scatter plots. More over, for most practical work, one-dimensional and two-dimensional investigations are ordinarily sufficient. Fortunately, pathological data sets that are normal in lower dimensional representations, but nonnormal in higher dimensions, are not frequently encountered in practice. Evaluating the Normality of the Univariate Marginal Distributions
Dot diagrams for smaller n and histograms for n > 25 or so help reveal situations where one tail of a univariate distribution is much longer than the other. If the his togram for a variable X; appears reasonably symmetric, we can check further by
Sec.
4.6
Assessing the Assumption of Normality
1 89
counting the number of observations in certain intervals. A univariate normal dis tribution assigns probability .683 to the interval ( p, ; - � , IL ; + � ) and prob ability .954 to the interval ( P, ; - 2 � , IL ; + 2 � ) . Consequently, with a large sample size n, we expect the observed proportion P; 1 of the observations lying in the interval ( .X; - YS;; , x; + Vi;; ) to be about .683. Similarly, the observed pro portion P ; z of the observations in ( .X; - 2 v'i;; , .X; + 2 � ) should be about .954. Using the normal approximation to the sampling distribution of P; (see [9]), we observe that either (.683) ( .317) n
-'----� --' -..!...
lfii l - .683 1 > 3
=
1 .396
Vn
or (4-29) would indicate departures from an assumed normal distribution for the ith char acteristic. When the observed proportions are too small, parent distributions with thicker tails than the normal are suggested. Plots are always useful devices in any data analysis. Special plots called Q-Q plots can be used to assess the assumption of normality. These plots can be made for the marginal distributions of the sample observations on each variable. They are, in effect, plots of the sample quantile versus the quantile one would expect to observe if the observations actually were normally distributed. When the points lie very nearly along a straight line, the normality assumption remains tenable. Nor mality is suspect if the points deviate from a straight line. Moreover, the pattern of the deviations can provide clues about the nature of the nonnormality. Once the reasons for the nonnormality are identified, corrective action is often possible. (See Section 4.8.) To simplify notation, let x 1 , x 2 , , x, represent n observations on any single characteristic X; . Let x( l l :s;; x<2 l :s;; :s;; x<,l represent these observations after they are ordered according to magnitude. For example, x (z ) is the second smallest observation and x<, l is the largest observation. The x< n ' s are the sample quantiles. When the x< n are distinct, exactly j observations are less than or equal to x< n . (This is theoretically always true when the observations are of the continuous type, which we usually assume.) The proportion jIn of the sample at or to the left of xu> is often approximated by U - D in for analytical convenience. 1 For a standard normal distribution, the quantiles q
· · ·
P [Z
:s;;
q< n ]
=
(fq j) _,
\,12:; e - z'l
1
2dz
=
p (j )
=
�
j -
l
(4-30)
1 The i in the numerator of (j Din is a "continuity" correction. Some authors (see [5] and [10]) have suggested replacing u Din by ( j �)l (n + n. -
-
-
1 90
Chap. 4 The M ultivariate Normal Distribution
(See Table 1 in the appendix.) Here P u ) is the probability of getting a value less than or equal to q U ) in a single drawing from a standard normal population. The idea is to look at the pairs of quantiles (q
(Constructing a Q-Q plot)
A sample of n = 10 observations gives the values in the following table: Ordered observations xu >
- 1.00 - .10 .16 .41 .62 .80 1.26 1.54 1.71 2.30
Probability levels (j - ! } /n
Standard normal quantiles q (i)
.05 .15 .25 .35 .45 .55 .65 .75 .85 .95
- 1.645 - 1 .036 - .674 - .385 - .125 .125 .385 .674 1.036 1.645
3 . 85 .385] = J
1 - z 212 e dz = .65. [See (4-30).] V 27T' Let us now construct the Q-Q plot and comment on its appearance. The Q-Q plot for the foregoing data, which is a plot of the ordered data x(j) against the normal quantiles q(i l ' is shown in Figure 4.5. The pairs of points
Here, for example, P [Z �
•
r.:;-
- oo
X I})
2
•
-2 •
!I
•
•
-
I
0
t
•
•
•
•
2
q(j)
Figure 4.5
Example 4.9.
A Q-Q plot for the data in
2 A better procedure is to plot (m
Sec.
4.6
Assessing the Assumption of Normality
1 91
(q
The calculations required for Q-Q plots are easily programmed for electronic computers. Many statistical programs available commercially are capable of pro ducing such plots. The steps leading to a Q-Q plot are: 1. Order the original observations to get x( l ) , x <2 > , . . . , x (n ) and their correspond ing probability values (1 - D/n, (2 - !}/n, . . . , (n - ! ) jn; 2. Calculate the standard normal quantiles q ( l ) • q <2 > , . . . , q ( n ) ; and 3. Plot the pairs of observations (q ( l ) • x ( l ) ) , (q (2 ) , x (2 ) ) , . . . , (q (n ) • x (n ) ) , and examine the "straightness" of the outcome.
Q-Q plots are not particularly informative unless the sample size is moder ate to large-for instance, n ;;;;. 20. There can be quite a bit of variability in the straightness of the Q-Q plot for small samples, even when the observations are known to come from a normal population. Example 4. 1 0
(A Q-Q plot for radiation data)
The quality-control department of a manufacturer of microwave ovens is required by the federal government to monitor the amount of radiation emit ted when the doors of the ovens are closed. Observations of the radiation emitted through closed doors of n = 42 randomly selected ovens were made. The data are listed in Table 4.1 on page 191. In order to determine the probability of exceeding a prespecified toler ance level, a probability distribution for the radiation emitted was needed. Can we regard the observations here as being normally distributed? A computer was used to assemble the pairs (q
1 92
Chap. 4 The M ultivariate Normal Distribution TABLE 4. 1
RADIATION DATA (DOOR C LOS ED)
Oven no.
Radiation
Oven no.
Radiation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
.15 .09 .18 .10 .05 .12 .08 .05 .08 .10 .07 .02 .01 .10 .10
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
.10 .02 .10 .01 .40 .10 .05 .03 .05 .15 .10 .15 .09 .08 .18
Oven no.
Radiation
31 32 33 34 35 36 37 38 39 40 41 42
.10 .20 .11 .30 .02 .20 .20 .30 .30 .40 .30 .05
Source: Data courtesy of J. D. Cryer.
-2.0
- 1 .0
.0
1 .0
2.0
3.0
Figure 4.6 A Q-Q plot of the radiation data (door closed) from Example 4. 1 0. (The integers in the plot indicate the n u m ber of points occupying the same location.)
Sec.
4.6
Assessing the Assumption of Normality
1 93
II
x n - x ) ( qu l - q ) jL =l ( < rQ = - " / 2 /" ( qu l - 7i ? (x (j ) - x )
\j i�
(4-31 )
\j i�
and a powerful test of normality can b e based o n it. (See [5], [10] , and [ 1 1 ].) For mally, we reject the hypothesis of normality at level of significance a if rQ falls below the appropriate value in Table 4.2. FOR TH E Q-Q PLOT CORRELATI O N COEFFICI ENT TEST FOR NORMALITY
TABLE 4.2 CRITICAL POI NTS
Significance levels a
Sample size
Example 4. 1 1
n
.01
.05
.10
5 10 15 20 25 30 35 40 45 50 55 60 75 100 150 200 300
.8299 .8801 .9126 .9269 .9410 .9479 .9538 .9599 .9632 .9671 .9695 .9720 .9771 .9822 .9879 .9905 .9935
.8788 .9198 .9389 .9508 .9591 .9652 .9682 .9726 .9749 .9768 .9787 .9801 .9838 .9873 .9913 .9931 .9953
.9032 .9351 .9503 .9604 .9665 .9715 .9740 .9771 .9792 .9809 .9822 .9836 .9866 .9895 .9928 .9942 .9960
(A correlation coefficient test for normality)
Let us calculate the correlation coefficient rQ from the Q-Q plot of Example 4.9 (see Figure 4.5) and test for normality. Using the information from Example 4.9, we have x = .770 and w
10
w
L cx(j ) - x ) qc n = 8.584, L Cxu l - x ) 2 = 8.472, and L q fn = 8.795 j= l j=l j=l
1 94
Chap. 4 The Multivariate Normal Distribution
Since q = 0,
rQ
8.584
= \18.472 V8.79s = .994
A test of normality at the 10% level of significance is provided by referring
rQ = .994 to the entry in Table 4.2 corresponding to n = 10 and a = .10. This entry is .9351. Since rQ > .9351 , we do not reject the hypothesis of normality . •
Instead of rQ , some software packages evaluate the original statistic proposed by Shapiro and Wilk [ 11 ] . Its correlation form corresponds to replacing q (j ) by a function of the expected value of standard normal-order statistics and their covari ances. We prefer rQ because it corresponds directly to the points in the normal scores plot. For large sample sizes, the two statistics are nearly the same (see [ 12] ), so either can be used to judge lack of fit. Linear combinations of more than one characteristic can be investigated. Many statisticians suggest plotting e � xi where S e l = A l e ! in which A 1 is the largest eigenvalue of S. Here x; = [xj l , xi 2 , . . . , xiP ] is the jth observation on the p variables X1 , X2 , . . . , XP . The linear combination e;xi corre sponding to the smallest eigenvalue is also frequently singled out for inspection. (See Chapter 8 and [6] for further details.) Evaluating Bivariate Normality
We would like to check on the assumption of normality for all distributions of 2, 3, . . . , p dimensions. However, as we have pointed out, for practical work it is usually sufficient to investigate the univariate and bivariate distributions. We con sidered univariate marginal distributions earlier. It is now of interest to examine the bivariate case. In Chapter 1, we described scatter plots for pairs of characteristics. If the observations were generated from a multivariate normal distribution, each bivari ate distribution would be normal, and the contours of constant density would be ellipses. The scatter plot should conform to this structure by exhibiting an overall pattern that is nearly elliptical. Moreover, by Result 4.7, the set of bivariate outcomes x such that
(x - p, ) 'I - 1 (x - p, ) � xi (.S)
has probability .5. Thus, we should expect roughly the same percentage, 50%, of sample observations to lie in the ellipse given by { all X SUCh that ( X - x:)' S -1 (x - x ) � Xi (. S ) } where we have replaced p by its estimate x and I - 1 by its estimate s - 1 . I f not, the normality assumption is suspect.
Sec. Example 4. 1 2
4.6
Assessing the Assumption of Normality
1 95
(Checking bivariate normality)
Although not a random sample, data consisting of the pairs of observations (x 1 = sales, x 2 = profits) for the 10 largest U.S. industrial corporations are listed in Exercise 1.4. These data give -
X
=
[ 62,309 ] ' 2927
s =
[ 10,005.20
255.76
255.76 14.30
so 8 _1 =
1 77,661.18
[
[ - .000184 .003293
=
' [ .000184 [Xxl -- 62,309 ] - .003293 2927
14.30 - 255.76 - 255.76 10,005.20 - .003293 .128831
]
J X 10_ 5
J X 105
x
10 _ 5
] ] [ xXI -- 62,309 X 10 _ 5 2927
From Table 3 in the appendix, x� (.5) = 1.39. Thus, any observation
x ' = [x 1 , x 2 ] satisfying
2
- .003293 .128831
2
�
1 .39
is on or inside the estimated 50% contour. Otherwise the observation is out side this contour. The first pair of observations in Exercise 1 .4 is [x 1 , x 2 ] ' = [126,974, 4224] . In this case
- 62,309 [ .000184 [ 126,974 4224 - 2927 J - .003293 I
= 4.34 > 1.39
- .003293 .128831
J
- 62,309 [ 126,974 4224 - 2927
J
X 10 - 5
and this point falls outside the 50% contour. The remaining nine points have generalized distances from i of 1.20, .59, .83, 1.88, 1.01, 1 .02, 5.33, .81, and .97, respectively. Since seven of these distances are less than 1 .39, a proportion, .70, of the data falls within the 50% contour. If the observations were nor mally distributed, we would expect about half, or 5, of them to be within this contour. This large a difference in proportions would ordinarily provide evi dence for rejecting the notion of bivariate normality; however, our sample size of 10 is too small to reach this conclusion. (See also Example 4.13.) • Computing the fraction of the points within a contour and subjectively com paring it with the theoretical probability is a useful, but rather rough, procedure. A somewhat more formal method for judging the joint normality of a data set is based on the squared generalized distances - ( xj - -x ) 'S - 1 ( xj - -x ) , 1· - 1 , 2 , . . . , n dj2 (4-32)
1 96
Chap. 4 The M ultivariate Normal Distribution
where x l > x 2 , , X 11 are the sample observations. The procedure we are about to describe is not limited to the bivariate case; it can be used for all p � 2. When the parent population is multivariate normal and both n and n p are greater than 25 or 30, each of the squared distances df , di , . . . , d� should behave like a chi-square random variable. [ See Result 4.7 and Equations (4-26) and (4-27).] Although these distances are not independent or exactly chi-square distributed, it is helpful to plot them as if they were. The resulting plot is called a chi-square plot or gamma plot, because the chi-square distribution is a special case of the more general gamma distribution. (See [6].) To construct the chi-square plot:
-
. . •
1. Order the squared distances 2.
-
d{l ) :;;:;; d{z)
:;;:;;
· · · :;;:;; d{n) .
m
( 4-32) from smallest to largest as
Graph the pairs (qc, p ((j - D in), d{n ), where qc, p ( (j D i n) is the 100 (j ! ) In quantile of the chi-square distribution with p degrees of freedom. -
-
Quantiles are specified in terms of proportions, whereas percentiles are spec ified in terms of percentages. The quantiles qc, p ((j !)In) are related to the upper percentiles of a chi squared distribution. In particular, qc, p ((j - !)In) = x� ( (n - j + ! )In). The plot should resemble a straight line through the origin having slope 1. A systematic curved pattern suggests lack of normality. One or two points far above the line indicate large distances, or outlying observations, that merit further attention. Example 4. 1 3
(Constructing a chi-square plot)
Let us construct a chi-square plot of the generalized distances given in Exam ple 4.12. The ordered distances and the corresponding chi-square percentiles for p = 2 and n = 10 are listed in the following table:
-
e !)
j
d{n
qc, Z ----w-
1 2 3 4 5 6 7 8 9 10
.59 .81 .83 .97 1.01 1.02 1.20 1.88 4.34 5.33
.10 .33 .58 .86 1.20 1 .60 2.10 2.77 3.79 5.99
Sec.
Assessing the Assumption of Normality
4.6
1 97
•
•
•
•
•
•
•
•
•
•
___.___--'----'--------'-----+_ L..o.,___....___
0
2
Figure 4.7
3
4
5
6
qc,2(( j- !)1 10 )
A chi-square plot of the ordered distances in Example 4 . 1 3 .
A graph of the pairs (q c, z ( (j !}/10), d �n ) is shown in Figure 4.7. The points in Figure 4.7 do not lie along the line with slope 1. The smallest distances appear to be too large and the middle distances appear to be too small, relative to the distances expected from bivariate normal populations for samples of size 10. These data do not appear to be bivariate normal; however, the sample size is small, and it is difficult to reach a definitive conclusion. If further analysis of the data were required, it might be reasonable to transform them to observations more nearly bivariate normal. Appropriate transformations are discussed in Section 4.8 . -
•
In addition to inspecting univariate plots and scatter plots, we should check multivariate normality by constructing a chi-squared or d 2 plot. Figure 4.8 on page 198 contains d 2 plots based on two computer-generated samples of 30 four-variate normal random vectors. As expected, the plots have a straight-line pattern, but the top two or three ordered squared distances are quite variable. The next example contains a real data set comparable to the simulated data set that produced the plots in Figure 4.8.
Chap. 4 The M ultivariate Normal Distribution
1 98
10
•
dJ)
8
.:
6
•
4
• •
• ••
•
•
2
0
2
4
6
8
Figure 4.8
10
qc,4 ( (j -
0
�/3 1
0
2
4
6
8
10
qc,4 ( (j -
�/30)
Chi-square plots for two simulated four-variate normal data sets with n = 30.
Example 4. 1 4 (Evaluating multivariate normality for a four-variable data set)
The data in Table 4.3 were obtained by taking four different measures of stiffness, x 1 , x 2 , x3 , and x4 , of each of n = 30 boards. The first measurement involves sending a shock wave down the board, the second measurement is determined while vibrating the board, and the last two measurements are obtained from static tests. The squared distances df = (xj x ) 'S- 1 (x j x ) are also presented in the table. -
-
TABLE 4.3
FOU R M EASU REM ENTS OF STI F F N ESS
Observation no.
xl
Xz
x3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389
x4
d2
1778 .60 2197 5.48 2222 7.62 1533 5.21 1883 1.40 1546 2.22 1671 4.99 1874 1.49 2581 12.26 .77 1508 1667 1.93 .46 1898 1741 2.70 .13 1678 1714 1.08
Source: Data courtesy of Wil iam Galligan.
Observation no.
xl
Xz
x3
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
x4
d2
1281 16.85 1176 3.50 1308 3.99 1755 1 .36 1646 1.46 2111 9.90 1477 5.06 .80 1516 2037 2.54 1533 4.58 1469 3.40 1834 2.38 1597 3.00 2234 6.28 1284 2.58
Sec.
'Cl
4.6
Assessing the Assumption of Normality
dfj )
@
""'
@
N •
2 00 'Cl ""' N 0
1 99
0
•
••
• ••
•• • ••••
2
•• •
•
••
•
•
• •
•
4 Figure 4.9
•
•
•
8
6
1
0
I
12
qc. 4 ({j- �
) / 30)
A chi-square plot for the data i n Exam ple 4 . 1 4.
The marginal distributions appear quite normal (see Exercise 4.33), with the possible exception of specimen (board) 9. To further evaluate multivariate normality, we constructed the chi square plot shown in Figure 4.9. The two specimens with the largest squared distances are clearly removed from the straight-line pattern. Together, with the next largest point or two, they make the plot appear curved at the upper • end. We will return to a discussion of this plot in Example 4.15.
...
We have discussed some rather simple techniques for checking the normality assumption. Specifically, we advocate calculating the d} , j = 1 , 2, , n [see Equa tion ( 4-32)] and comparing the results with x 2 quantiles. For example, p-variate normality is indicated if:
( 1 - l) (2 - l),
1. Roughly half of the d} are less than or equal to q c, p (.50). 2. A plot of the ordered squared distances d{1 ) :s; d tz) :s; · · · :s; d{n) versus n qc, p � , qc. p � . . . , qc, p �1 respectively, is nearly a straight
(
),
line having slope 1 and which passes through the origin. (See [6] for a more complete exposition of methods for assessing normality.)
200
Chap. 4 The M ultivariate Normal Distribution
We close this section by noting that all measures of goodness of fit suffer the same serious drawback. When the sample size is small, only the most aberrant behavior will be identified as lack of fit. On the other hand, very large samples invariably produce statistically significant lack of fit. Yet the departure from the specified distribution may be very small and technically unimportant to the infer ential conclusions.
4.�
I
DETECTING OUTLIERS AND CLEANING DATA
Most data sets contain one or a few unusual observations that do not seem to belong to the pattern of variability produced by the other observations. With data on a single characteristic, unusual observations are those that are either very large or very small relative to the others. The situation can be more complicated with multivariate data. Before we address the issue of identifying these outliers, we must emphasize that not all outliers are wrong numbers. They may, justifiably, be part of the group and may lead to a better understanding of the phenomena being studied. Outliers are best detected visually whenever this is possible. When the num ber of observations n is large, dot plots are not feasible. When the number of characteristics p is large, the large number of scatter plots p (p 1)/2 may pre vent viewing them all. Even so, we suggest first visually inspecting the data when ever possible. What should we look for? For a single random variable, the problem is one dimensional, and we look for observations that are far from the others. For instance, the dot diagram -
••
• • ••
•
• • •• • • • • •• • • •
• eee •
•
•
• •
4-------+---� x
reveals a single large observation. In the bivariate case, the situation is more complicated. Figure 4.10 on page 201 shows a situation with two unusual observations. The data point circled in the upper right corner of the figure is removed from the pattern, and its second coordinate is large relative to the rest of the x2 mea surements, as shown by the vertical dot diagram. The second outlier, also circled, is far from the elliptical pattern of the rest of the points, but, separately, each of its components has a typical value. This outlier cannot be detected by inspecting the marginal dot diagrams. In higher dimensions, there can be outliers that cannot be detected from the univariate plots or even the bivariate scatter plots. Here a large value of (xj x ) 'S - 1 (xj x ) will suggest an unusual observation, even though it cannot be seen visually. -
-
Sec.
Detecting Outliers and Cleaning Data
4.7
• • • •
••• •••
•
@
•• • •
••
•
•
•
• • •
•
• •• •
.. .
Figure 4. 1 0
..
•
• • • •••• :: : .
•
• • • •
.
•• •
.
:
•
• •
•
201
•
@
• •@
c:> •
.
Two outliers; one univariate and one bivariate.
Steps for Detecting Outliers
1. Make a dot plot for each variable.
Make a scatter plot for each pair of variables. Calculate the standardized values zj k = (xj k - xk ) ��� for j = 1, 2, . . . , n and each column k = 1, 2, , p. Examine these standardized values for large or small values. 4. Calculate the generalized squared distances (x j - x ) 'S- 1 (x j - x ) Examine these distances for unusually large values. In a chi-square plot, these would be the points farthest from the origin.
2. 3.
. . .
.
In step 3, "large" must be interpreted relative to the sample size and number of variables. There are n p standardized values. When n = 100 and p = 5, there
X
202
Chap.
4
The Multivariate Normal Distribution
are 500 values. You expect 1 or 2 of these to exceed 3 or be less than - 3, even if the data came from a multivariate distribution that is exactly normal. As a guide line, 3.5 might be considered large for moderate sample sizes. In step 4, "large" is measured by an appropriate percentile of the chi-square distribution with p degrees of freedom. If the sample size is n = 100, we would expect 5 observations to have values of dJ that exceed the upper fifth percentile of the chi-square distribution. A more extreme percentile must serve to determine observations that do not fit the pattern of the remaining data. The data we presented in Table 4.3 concerning lumber have already been cleaned up somewhat. Similar data sets from the same study also contained data on x5 = tensile strength. Nine observation vectors, out of the total of 1 12, are given as rows in the following table, along with their standardized values.
xl
Xz
x3
x4
Xs
zl
Zz
z3
z4
Zs
1631 1770 1376 1705 1643 1567 1528 1803 1587
1528 1677 1190 1577 1535 1510 1591 1826 1554
1452 1707 723 1332 1510 1301 1714 1748 1352
1559 1738 1285 1703 1494 1405 1685 2746 1554
1602 1785 2791 1664 1582 1553 1698 1764 1551
.06 .64 - 1 .01 .37 .11 - .21 - .38 .78 - .13
- .15 .43 - 1.47 .04 - .12 - .22 .10 1.01 - .05
.05 1.07 -2.87 - .43 .28 - .56 1 .10 1 .23 - .35
.28 .94 - .73 .81 .04 - .28 .75
- .12 .60
Cffi) .26
@]) .13 - .20 - .31 .26 .52 - .32
The standardized values are based on the sample mean and variance, calcu lated from all 112 observations. There are two extreme standardized values. Both are too large with standardized values over 4.5. During their investigation, the researchers recorded measurements by hand in a logbook and then performed cal culations that produced the values given in the table. When they checked their records regarding the values pinpointed by this analysis, errors were discovered. The value x5 = 2791 was corrected to 1241, and x4 = 2746 was corrected to 1670. Incorrect readings on an individual variable are quickly detected by locating a large leading digit for the standardized value. The next example returns to the data on lumber discussed in Example 4.14. Example 4. 1 5
(Detecting outliers in the data on lumber)
Table 4.4 on page 203 contains the data in Table 4.3, along with the stan dardized observations. These data consist of four different measures of stiff ness x 1 , x2 , x3 , and x4 , on each of n = 30 boards. Recall that the first
Sec.
4.8
Transformations To Near Normality
203
measurement involves sending a shock wave down the board, the second measurement is determined while vibrating the board, and the last two measurements are obtained from static tests. The standardized measurements are
xj k - xk k = 1, 2, 3, 4; j = 1, 2, . . . ' 30 Zj k = � '
and the squares of the distances are df = (xj
- x ) ' S - 1 (xj - x ) .
TABLE 4.4 FOU R M EASU REM ENTS OF STI F F N ESS WITH STAN DARDIZED VALU ES
xl
Xz
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859 1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649 2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
x3
x4
Observation no.
zl
Zz
z3
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389 1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
1778 2197 2222 1533 1883 1546 1671 1874 2581 1508 1667 1898 1741 1678 1714 1281 1176 1308 1755 1646 2111 1477 1516 2037 1533 1469 1834 1597 2234 1284
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
- .1 1.5 .7 - .8 .2 - .6 .1 .6 3.3 - .5 - .6 .4 - .2 - .1 - .1 .1 - 1.8 - 1 .5 - .2 - .6 1.1 - .0 - .8 .5 - .2 - .6 .8 - .8 1.3 - 1.3
- .3 .9 - .2 - .4 .5 - .1 - .2 .2 3.3 - .5 - .5 .5 .3 - .2 - .3 1 .3 - 1 .8 - 1.2 - .4 - .5 1.4 - .4 - .7 .4 - .8 - 1.1 .5 - .2 1.7 - 1.2
.2 1.9 1.0 - 1.3 .3 - .2 - .8 .7 3.0 - .4 .0 .4 .3 - .1 - .4 - 1.1 - 1.7 - .8 .3 - .6 .1 - .3 - .7 .5 - .5 - .9 .6 - .3 1.8 - 1.0
z4
d2
.60 .2 5.48 1.5 7.62 1.5 - .6 5.21 1.40 .5 - .6 2.22 - .2 4.99 1.49 .5 2.7 <1§> - .7 .77 - .2 1.93 .46 .5 2.70 .0 - .1 .13 - .0 1 .08 - 1.4 <1@) - 1.7 3.50 - 1.3 3.99 1 .36 .1 - .2 1 .46 9.90 1.2 - .8 5.06 - .6 .80 2.54 1.0 - .6 4.58 - .8 3.40 2.38 .3 - .4 3.00 6.28 1.6 - 1.4 2.58
204
Chap. 4 The Multivariate Normal Distribution
The last column in Table 4.4 reveals that specimen 16 is a multivariate outlier, since xl (.005) = 14.86; yet all of the individual measurements are well within their respective univariate scatters. Specimen 9 also has a large d 2 value. The two specimens (9 and 16) with large squared distances stand out as clearly different from the rest of the pattern in Figure 4.9. Once these two points are removed, the remaining pattern conforms to the expected straight line relation. Scatter plots for the lumber stiffness measurements are given in Figure 4.11 on page 205. The solid dots in these figures correspond to speci mens 9 and 16. Although the dot for specimen 16 stands out in all the plots, the dot for specimen 9 is "hidden" in the scatterplot of x3 versus x4 and nearly hidden in that of x1 versus x3 • However, specimen 9 is clearly identified as a multivariate outlier when all four variables are considered. Scientists specializing in the properties of wood conjectured that speci men 9 was unusually clear and therefore very stiff and strong. It would also appear that specimen 16 is a bit unusual, since both of its dynamic measure ments are above average and the two static measurements are low. Unfortu nately, it was not possible to investigate this specimen further because the • material was no longer available. If outliers are identified, they should be examined for content, as was done in the case of the data on lumber stiffness in Example 4.15. Depending upon the nature of the outliers and the objectives of the investigation, outliers may be deleted or appropriately "weighted" in a subsequent analysis. Even though many statistical techniques assume normal populations, those based on the sample mean vectors usually will not be disturbed by a few moderate outliers. Hawkins [7] gives an extensive treatment of the subject of outliers. 4.8 TRANSFORMATIONS TO NEAR NORMALITY
If normality is not a viable assumption, what is the next step? One alternative is to ignore the findings of a normality check and proceed as if the data were normally distributed. This practice is not recommended, since, in many instances, it could lead to incorrect conclusions. A second alternative is to make nonnormal data more "normal looking" by considering transformations of the data. Normal-theory analyses can then be carried out with the suitably transformed data. Transformations are nothing more than a reexpression of the data in differ ent units. For example, when a histogram of positive observations exhibits a long right-hand tail, transforming the observations by taking their logarithms or square roots will often markedly improve the symmetry about the mean and the approxi mation to a normal distribution. It frequently happens that the new units provide more natural expressions of the characteristics being studied. Appropriate transformations are suggested by (1) theoretical considerations or (2) the data themselves (or both). It has been shown theoretically that data
Sec. -
I
I
I
I
I
-
x1
� -
"'
0
!3
-
-
0
•
• 00 0 o%
0
�0 � 0 o 0
-
-
• cP
0� r:S� 0 - � 00 �
-o
- co
..;"'
0 0
"'
0� o0� o
8
ow
I
0 oO •
16
•
9
,�
1 500
0 ;6o • oaS' J o 0o8?fo 0 0
o
0 0Cb 0
•
•
0 2500 Figure 4. 1 1
I
I
•
0
0
o
op
gso
cB
8
•
•
0 0
•
I
0 I
0 00
o c:g o �� 0
10 �
•
co
Cb
0 o o Cb 0 � c:r� 0 0 �
08
I
0000 �0
l§l
00 0
I •I
2400
1 800
0 <6
0
coo �
0�
I l j_ l j_
1 200
205
· Cb � ��
x3
0 0 00 0 Jtco �
o��
10
I I I I I •I
I
•
0 0 0 �66 0 (t19})00 • 0
�
�
I
x2
-
0 0 00
0
I
cP
-
0 0
2500
1 500
Transformations To Near Normality
4.8
- � "' 0
-
� � 0
•
III-
• r-
r- � "' rr-
0
:- � I-
0
r- 8 r-
0
fiSJ
1 000
I
I
I
1 600
I
I
I
2200
I
Scatter plots for the lumber stiffness data with specimens 9 and 1 6 plotted
as solid dots.
which are counts can often be made more normal by taking their square roots. Sim ilarly, the logit transformation applied to proportions and Fisher's z-transformation applied to correlation coefficients yield quantities that are approximately normally distributed.
206
Chap.
4
The M ultivariate Normal Distribution HELPFUL TRANSFORMATIONS Original Stale
Transformed Scale
1. Counts, y
2. 3.
TO N EAR NORMALITY
Propprtions� fi'
Cortelatiqp�;}.·
Vy
lpgit (fi)
�log C � P) (r) � G � ;)
=
Fi&her's z
=
log
In many instances, the choice of a transformation to improve the approxima tion to normality is not obvious. For such cases, it is convenient to let the data sug gest a transformation. A useful family of transformations for this purpose is the family of power transformations. Power transformations are defined only for positive variables. However, this is not as restrictive as it seems, because a single constant can be added to each observation in the data set if some of the values are negative. Let x represent an arbitrary observation. The power family of transforma tions is indexed by a parameter A. A given value for A implies a particular trans formation. For example, consider xA with A = 1. Since x- 1 = 1/x , this choice of A corresponds to the reciprocal transformation. We can trace the family of trans formations as A ranges from negative to positive powers of x . For A = 0, we define 0 x = ln x . A sequence of possible transformations is
-
4 . . . , x - 1 = -1 , x 0 = ln x , x 1 14 = Vx, x 1 12 = X
Vx,
x2, x3,
• • •
�
increases large values of x To select a power transformation, an investigator looks at the marginal dot diagram or histogram and decides whether large values have to be "pulled in" or "pushed out" to improve the symmetry about the mean. Trial-and-error calcula tions with a few of the foregoing transformations should produce an improvement. The final choice should always be examined by a Q-Q plot or other checks to see whether the tentative normal assumption is satisfactory. The transformations we have been discussing are data based in the sense that it is only the appearance of the data themselves that influences the choice of an appropriate transformation. There are no external considerations involved, although the transformation actually used is often determined by some mix of information supplied by the data and extra-data factors, such as simplicity or ease of interpretation. shrinks large values of x
Sec.
4.8
Transformations To Near Normality
207
A convenient analytical method is available for choosing a power transfor mation. We begin by focusing our attention on the univariate case. Box and Cox [3] consider the slightly modified family of power transformations
{
x < A) =
x"-
A lnx
1 A =F O
�-
(4-34)
A = 0
which is continuous in A for x > 0. (See [8].) Given the observations x 1 , x2 , , X11 , the Box-Cox solution for the choice of an appropriate power A is the solution that maximizes the expression • • •
f (A) = -
':1_
2
ln
[ln ji=l x? l - x
We note that x? l is defined in (4-34) and 1 11 x < A l = 2: x? l = �
-n j =
1
+ (A -
-n1 j2:=11 ( X A A- 1 ) 1
j= j 11
1) 2: ln x
_:1__
I
(4-35)
(4-36)
is the arithmetic average of the transformed observations. The first term in (4-35) is, apart from a constant, the logarithm of a normal likelihood function, after max imizing it with respect to the population mean and variance parameters. The calculation of f (A) for many values of A is an easy task for a computer. It is helpful to have a graph of f (A) versus A, as well as a tabular display of the pairs (A, f (A)), in order to study the behavior near the maximizing value A . For instance, if either A = 0 (logarithm) or A = � (square root) is near A , one of these may be pre ferred because of its simplicity. Rather than program the calculation of ( 4-35), some statisticians recommend the equivalent procedure of fixing A, creating the new variable
Yj - A [ ( D r/nr- 1 (A) -
xf
-1
X;
j=
1, . . . , n
(4-37)
and then calculating the sample variance. The minimum of the variance occurs at the same A that maximizes (4-35).
Comment. It is now understood that the transformation obtained by maxi mizing f (A) usually improves the approximation to normality. However, there is no guarantee that even the best choice of A will produce a transformed set of val ues that adequately conform to a normal distribution. The outcomes produced by a transformation selected according to (4-35) should always be carefully examined for possible violations of the tentative assumption of normality. This warning applies with equal force to transformations selected by any other technique.
208
Chap. 4 The Multivariate Normal Distribution Example 4. 1 6
(Determining a power transformation for u nivariate data)
We gave readings of the microwave radiation emitted through the closed doors of n = 42 ovens in Example 4.10. The Q-Q plot of these data in Fig ure 4.6 indicates that the observations deviate from what would be expected if they were normally distributed. Since all the observations are positive, let us perform a power transformation of the data which, we hope, will produce results that are more nearly normal. Restricting our attention to the family of transformations in (4-34), we must find that value of A. maximizing the func tion t' (.A) in (4-35). The pairs (.A, e (A.)) are listed in the following table for several values of .A: A.
t' (.A)
-1.00 -.90 -.80 -.70 -.60 -.50 -.40 -.30 -.20 - .10 .00 .10 .20 (.30
70.52 75.65 80.46 84.94 89.06 92.79 96.10 98.97 101.39 103.35 104.83 105.84 106.39 106.51)
t' (.A)
.40 .50 .60 .70 .80 .90 1.00 1.10 1.20 1.30 1.40 1.50
106.20 105.50 104.43 103.03 101.33 99.34 97.10 94.64 91.96 89.10 86.07 82.88
The curve of t' (.A) versus A. that allows the more exact determination " A. = .28 is shown in Figure 4.12 on page 209. ,.. It is evident from both the table and the plot that a value of A. around .30 maximizes t' (.A). For convenience, we choose A = .25. The data xj were reexpressed as �l /4)
XJ
1/4 - 1 xj = _l__l _ j = 1, 2, . . . ' 42 _
4
and a Q-Q plot was constructed from the transformed quantities. This plot is shown in Figure 4.13 on page 210. The quantile pairs fall very close4 to a straight line, and we would conclude from this evidence that the xp 1 l are • approximately normal.
Sec.
4.�
Transformations To Near Normality
209
e (A.)
-
0
8
0. 1
0.0
0.2
I
0. 3
0.4
0.5
A. = 0.28
Figure 4. 1 2
Plot of e (A) versus A for radiation data (door closed).
Transforming Multivariate Observations
With multivariate observations, a power transformation must be selected for each of the variables. Let ,.\ l , ,.\ 2 , , ,.\ P be the power transformations for the p mea sured characteristics. Each ,.\ k can be selected by • . .
..
maximizing
where x l k , Xz k • . , x,k are the n observations on the kth variable, k = 1, 2, 1 11 1 11 X � 1
- = - � xi�d = - � ( -''---- - ) xYd n i =l n i =l ,.\k
.
(4-38) . . ,p.
Here
(4-39)
is the arithmetic average of the transformed observations. The jth transformed multivariate observation is
21 0
Chap. 4 The Multivariate Normal Distribution x < 1 14) (j )
- .50
- 1 .00
- 1 .50
- 2.00
- 2.50
- 3 .00 - 1 .0
- 2.0
2.0
1 .0
.0
3.0
A Q-Q plot of the transformed radiation data (door closed). {The integers in the plot indicate the number of points occupying the same location.)
Figure 4.1 3
" x '
1
jl
"
A1 j2
x
xJ�-' >
"2
=
-
1
Az "
XJ�P'' Ap
1
"
where A 1 , A 2 , , AP are the values that individually maximize (4-38). The procedure just described is equivalent to making each marginal distribu tion approximately normal. Although normal marginals are not sufficient to ensure that the joint distribution is normal, in practical applications they may be good A P obtained from the pre enough. If not, we could start with the values A 1 , A 2 , ceding transformations and iterate toward the set of values .A' = [A 1 , A 2 , , A P ], which collectively maximizes • • •
• • •
,
. • •
Sec.
=
4.8
Transformations To Near Normality
21 1
- % In I S (A) I + (A 1 - 1) � lnxj 1 + (A2 - 1) � Inxj z 1 "
n
j=
j= 1
+ ... +
n
(Ap - 1) � Inxjp j= i
(4-40)
where S (A) is the sample covariance matrix computed from
XjAlI - 1
At
xJ
XjZA ' - 1
Az
j
=
1, 2, . . . , n
XJPAP - 1 Ap Maximizing (4-40) not only is substantially more difficult than maximizing the individual expressions in ( 4-38), but also is unlikely to yield remarkably better results. The selection method based on Equation (4-40) is equivalent to maximizing a multivariate likelihood over p. , I and A, whereas the method based on (4-38) cor responds to maximizing the kth univariate likelihood over f-Lk , (Ykk • and Ak . The lat ter likelihood is generated by pretending there is some Ak for which the observations (x/( - 1) / A k, j = 1, 2, . . . , n have a normal distribution. See [3] and [2] for detailed discussions of the univariate and multivariate cases, respectively. (Also, see [8].) Example 4. 1 7
(Determining power transformations for bivariate data)
Radiation measurements were also recorded through the open doors of the = 42 microwave ovens introduced in Example 4.10. The amount of radia tion emitted through the open doors of these ovens is listed in Table 4.5 on page 212. In accordance with the procedure outlined in Example 4.16, a power transformation for these data was se l,_e cted by maximizing f(A) in (4-35). The approximate maximizing value was A = .30. Figure 4.14 on page 213 shows Q-Q plots of the untransformed and transformed door-open radiation data. (These data were actually transformed by taking the fourth root, as in Exam ple 4.16.) It is clear from the figure that the transformed data are more nearly normal, although the normal approximation is not as good as it was for the door-closed data. Let us denote the door-closed data by x1 1 , x2 1 , . . . , x4 2 •1 and the door open data by xw x22 , . . . , x4 2, 2 • Choosing a power transformation for each set by maximizing the expression in (4-35) is equivalent to maximizing fk (A) in
n
21 2
Chap. 4 The M ultivariate Normal Distribution TABLE 4.5
RADIATION DATA (DOOR OPEN)
Oven no.
Radiation
Oven no.
Radiation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
.30 .09 .30 .10 .10 .12 .09 .10 .09 .10 .07 .05 .01 .45 .12
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
.20 .04 .10 .01 .60 .12 .10 .05 .05 .15 .30 .15 .09 .09 .28
Oven no.
Radiation
31 32 33 34 35 36 37 38 39 40 41 42
.10 .10 .10 .30 .12 .25 .20 .40 .33 .32 .12 .12
Source: Data courtesy of J.D. Cryer. (4-38) with k = 1, 2. Thus,� using the outcomes from Example 4.16 and the foregoing results, we have A 1 = .30 and A 2 = .30. These powers were deter mined for the marginal distributions of x1 and x2 . We can consider the joint distribution of x 1 and x2 and simultaneously determine the pair of powers (A 1 , A 2 ) that makes this joint distribution approximately bivariate normal. To do this, we must maximize t' (A 1 , A 2 ) in (4-40) with respect to both A 1 and A 2 . We computed t'(A 1 , A 2 ) for a grid of A 1 , A 2 values covering 0 � A 1 � .50 and 0 � A 2 � .50, and we constructed the contour plot sQ.o�n in Figure 4.15 on page 214. We see that the maximum occurs at about (A 1 , A 2 ) = (.16, .16). The "best" power transformations for this bivariate case do not differ substantially from those obtained by considering each marginal distribution . •
As we saw in Example 4.17, making each marginal distribution approximately normal is roughly equivalent to addressing the bivariate distribution directly and making it approximately normal. It is generally easier to select appropriate trans formations for the marginal distributions than for the joint distributions.
Sec.
Transformations To Near Normality
4.8
•
.60
.45 •
. 30
•
•
4
•
• •
2 .15 •
.0 - 2.0
3
•
- 1 .0
5
6
9
2
1 .0
.0
2.0
3.0
( a) t4 > x
.00
- .60
- 1 .20
- 1 .80
- 2.40
- 3.00
(b)
Figure 4. 1 4 Q-Q plots of (a) the original and (b) the transformed radiation data (with door open). (The integers in the plot indicate the n u m ber of points occupying the same location).
21 3
214
Chap. 4 The M ultivariate Normal Distribution
0.5
0.4 0.3
0.2
0
0. 1
0.0
0.0
0. 1
Figure 4. 1 5
223
225.9
0.2
224
�
223 222 221 0.5 0.4 0.3
AI
Contour plot of C (.A 1 , .A2 ) for the radiation data.
EXERCISES
4.1. Consider a bivariate normal distribution with IL l = 1, IL z = 3, u1 1 = 2, u22 = 1 and p 1 2 = - 8 (a) Write out the bivariate normal density. (b) Write out the squared statistical distance expression (x - p)' :I - 1 (x - p) as a quadratic function of x1 and x2 • 4.2. Consider a bivariate normal population with �L 1 = 0, IL z = 2, u1 1 = 2, u22 = 1, and p 1 2 = .5. (a) Write out the bivariate normal density. (b) Write out the squared generalized distance expression (x - p)' :I- 1 (x - p) as a function of x1 and x2 • (c) Determine ( and sketch) the constant-density contour that contains 50% of the probability. .
.
Chap. 4 Exercises 4.3. Let X be N3 (p.., l:) with p.. ' =
21 5
[-3, 1, 4] and
� H - � �J
!
Which of the following random variables are independent? Explain. (a) and (b) and (c) and
X1 X2 X2 X3 (X1 , X2 ) X3 X X (d) I + 2 and X3 2 (e) X2 and X2 - �X1 - X3 = [2, -3, 1] and
4.4. Let X be N3 (p.., l:) with p.. '
�3X1 [:- 2X�2 +nX3 •
!
[ �: J are independent.
(a) Find the distribution of
2 1
(b) Relabel the variables if necessary, and find a X vector a such that and
X2
-
a'
4.5. Specify each of the following. ( a ) The conditional distribution of X1 , given that bution in Exercise (b) The conditional distribution of given that joint distribution in Exercise (c) The conditional distribution of given that joint distribution in Exercise 4.6. Let X be distributed as N3 (p.., l:), where p.. ' =
4.2.
l: =
X, 4.3. 2 x, 4.4. 3
[ � � -�]2
X2 = x2 for the joint distri X1 = x1 and X3 = x3 for the XI = xl and Xz = Xz for the [1, -1, 2] and
-1 0
Which of the following random variables are independent? Explain. ( a) and X2 (b) xl and (c) and (d) and (e ) and xl 4.7. Refer to Exercise and specify each of the following. = ( a) The conditional distribution of given that (b) The conditional distribution of X1 , given that = and =
X1 x X2 X33 (X , X ) X XI 1 3 + 3X2 z - 2X3 4.6
X1 ,
X2
X X23
x• x23 X3 x3 •
21 6
Chap.
4
The Multivariate Normal Distribution
{ - X1x
4.8. (Example of a nonnormal bivariate distribution with normal marginals.) Let X1 be N(O, 1), and let
x2 =
if - 1 ::;;; X1 ::;;; 1 l otherwise
Show each of the following. (a) X2 also has an N(O, 1) distribution. (b) X1 and X2 do not have a bivariate normal distribution.
Hint:
(a) Since X1 is N(O, 1), P [ - 1 < X1 ::;;; x] = P [ - x ::;;; X1 < 1] for any x.
When - 1 < x2 < 1, P [X2 ::;;; x2 ] = P [X2 ::;;; - 1 ] + P [ - 1 < X2 ::;;; x2 ] = ::;;; - 1] + p [ - 1 < - X, ::;;; Xz ] = p [ Xl ::;;; - 1 ] + p [ - Xz ::;;; Xl < 1 ] . But P [ - Xz ::;;; xl < 1] = P [ - 1 < xl ::;;; Xz ] from the symmetry argu ment in the first line of this hint. Thus, P [X2 ::;;; x2 ] = P [X1 ::;;; - 1] + P [ - 1 < X1 ::;;; x2 ] = P [X1 ::;;; x2 ] , which is a standard normal probability. (b) Consider the linear combination X1 - X2 , which equals zero with prob ability P [ /X1 / > 1] = .3174. 4.9. Refer to Exercise 4.8, but modify the construction by replacing the break point 1 by c so that p [ Xl
{ - X1X1
X2 -
if - c ::;;; X1 elsewhere
::;;;
c
Show that c can be chosen so that Cov (X1 , X2 ) = 0, but that the two random variables are not independent.
Hint:
For c = 0, evaluate Cov (X1 , X2 ) = E [X1 (X1 )] For c very large, evaluate Cov (X1 , X2 ) == E [X1 ( - X, )]. 4.10. Show each of the following.
(a)
I � :1
(b)
i
Hint: A, ( a)
�� �� = /AI /B/
for
/A/ * 0
. I 0 , B0 I by the IB = I A0, 0 I I 0, B0 I . Expandmg. the determmant 0
I
I
I 0 first row (see Definition 2A.24) gives 1 times a determinant of the same form, with the order of I reduced by one. This procedure is repeated until
X I B I is obtained. Similarly, expanding the determinant I � � I by the Chap. 4 Exercises
21 7
I� �I = IA1. 1 A 1C A C : determinant � the J expanding � I But . J � = � � � (b) � � !I I :, I A- 1 C by the last row gives I I = 1. Now use the result Part a. I I 4.11. Show that, if A is square, 1
last row gives
,
in
0,
for I A 22 1
I A 22 1 1 A l l - A 1 2 A 2i A 2 1 I A 1 1 1 1 A 22 - A 2 1 A JI1 A 1 2 1
for l A l l i
I
Hint: Partition A and verify that
*0 *0
Take determinants on both sides of this equality. Use Exercise 4.10 for the first and third determinants on the left and for the determinant on the right. The second equality for I A follows by considering
I
4.U. Show that, for A symmetric,
Thus, (A 1 1 - A 1 2 A2] A 2 1 ) - 1 is the upper left-hand block of A - 1 . Hint: Premtiltiply the expression in the hint to Exercise 4.11 by - A 1 2 A -22I - 1 I o -1 and postmultipiy by . Take inverses I A _221 A2 1 I of the resulting expression.
J
·
[
_
]
218
Chap. 4 The Multivariate Normal Distribution 4.13. Show the following if I I I #- 0. (a) Check that I I I = I I22 l l I 1 1 - I 1 2 I2i I 2 1 1 . (Note that I I I can be fac tored into the product of contributions from the marginal and conditional distributions.) (b) Check that
(Thus, the joint density exponent can be written as the sum of two terms cor responding to contributions from the conditional and marginal distributions.) (c) Given the results in Parts a and b, identify the marginal distribution of X 2 and the conditional distribution of X 1 I X 2 = x 2 .
Hint:
(a) Apply Exercise 4.11.
(b) Note from Exercise 4.12 that we can write (x - p.) ' I- 1 (x - p.) as
If we group the product so that
the result follows. 4.14. If X is distributed as NP (p., I) with I I I #- 0, show that the joint density can be written as the product of marginal densities for
X1
(q x l )
and
X2
(( p - q) X l)
if I1 2 =
Hint: Show by block multiplication that is the inverse of
0
(q X (p - q))
Chap. 4 Exercises
[
Then write
4.15.
21 9
: J [ :: :�J
1 [(x l - P1 )1 (x z - Pz ) '] �� 1 I i = 1 (x l - P1 ) 1I 11 ( X 1 - P 1 ) + (x z - Pz ) 1I 2i (x z - P z ) Note that I I I = I I 1 1 I I I22 l from Exercise 4.10(a). Now factor the joint density. n n Show that � ( x j - i ) ( i - p) 1 and � ( i - p ) (x j - i ) 1 are both p X p j= 1 j= l matrices of zeros. Here x; = [xj 1 , xj 2 , . . . , xjp ], j = 1, 2, . . . , n , and
4.16. Let X 1 , X 2 , X 3 , and X 4 be independent Np (p, I) random vectors. (a) Find the marginal distributions for each of the random vectors
V1
=
�X 1 - �X 2 + � X3 - � X4
and
4.17.
V2 = �X 1 + �X 2 - �X3 - � X4 (b) Find the joint density of the random vectors V 1 and V2 defined in (a). Let X 1 , X 2 , X3, X 4 , and X 5 be independent and identically distributed ran dom vectors with mean vector p and covariance matrix I. Find the mean
vector and covariance matrices for each of the two linear combinations of random vectors and
X l - X z + X3 - X4 + X s in terms of p and I. Also, obtain the covariance between the two linear combinations of random vectors.
4.18. Find the maximum likelihood estimates of the 2 X 1 mean vector p and the 2 X 2 covariance matrix I based on the random sample
from a bivariate normal population.
220
Chap. 4 The Multivariate Normal Distribution 4.19. Let X 1 , X 2 , . . . , X 20 be a random sample of size n = 20 from an N6 ( p, :t) population. Specify each of the following completely. (a) The distribution of (X1 - p) ' :t -1 (X1 - p). (b) The distributions of X and Vii (X - JL ) . (c) The distribution of (n - 1) S. 4.20. For the random variables X 1 , X 2 , . . . , X 20 in Exercise 4.19, specify the distri bution of B (19S) B ' in each case. 1 -! -! 0 0 (a) B = 0 0 0 -! -! 1
[
(b) B =
[� � � � � �J
OJ
4.21. Let X 1 , . . . , X60 be a random sample of size 60 from a four-variate normal dis tribution having mean JL and covariance :t. Specify each of the following completely. (a) The distribution of X (b) The distribution of (X 1 - p) ' :t - 1 ( X 1 - p) (c) The distribution of n ( X - p) ' :t- 1 ( X - p) 1 (d) The approximate distribution of n ( X - p) ' S - ( X - p) . 4.22. Let X 1 , X 2 . . . , X75 be a random sample from a population distribution with mean JL and covariance matrix :t. What is the approximate distribution of eacl.!__of the following?
(a) X (b) n ( X - �t) ' S- 1 ( X - p)
4.23. Consider the annual rates of return (including dividends) on the Dow-Jones industrial average for the years 1963-1972. These data, multiplied by 100, are 20.6, 18.7, 14.2, - 15.7, 19.0, 7.7, -11.6, 8.8, 9.8, and 18.2. Use these 10 obser vations to complete the following. (a) Construct a Q-Q plot. Do the data seem to be normally distributed? Explain. (b) Carry out a test of normality based on the correlation coefficient rQ . [See (4-31 ).] Let the significance level be a = .10. 4.24. Exercise 1.4 contains data on three variables for the 10 largest industrial cor porations as of April 1990. For the sales (x1 ) and profits (x 2 ) data: (a) Construct Q-Q plots. Do these data appear to be normally distrib uted? Explain. (b) Carry out a test of normality based on the correlation coefficient rQ . [See (4-31).] Set the significance level at a = .10. Do the results of these tests corroborate the results in Part a? 4.25. Refer to the data for the 10 largest industrial corporations in Exercise 1.4. Con struct a chi-square plot using all three variables. The chi-square quantiles are
0.103 0.325 0.575 0.862 1.196 1 .597 2.100 2.773 3.794 5.991
Chap. 4 Exercises
221
4.26. Exercise 1.2 gives the age x1 , measured in years, as well as the selling price x2 , measured in thousands of dollars, for n = 10 used cars. These data are reproduced as follows:
I
X1 3 5 5 7 7 7 8 9 10 1 1 x 2 2.30 1.90 1.00 .70 .30 1.00 1.05 .45 .70 .30 (a) Use the results of Exercise 1.2 to calculate the squared statistical dis tances (x i - x ) ' S- l (x i - x ) , j = 1, 2, . . . , 10, where x; = [xi l , Xi z ] . (b) Using the distances in Part a, determine the proportion o f the observa tions falling within the estimated 50% probability contour of a bivariate normal distribution. (c) Order the distances in Part a and construct a chi-square plot. (d) Given the results in Parts b and c, are these data approximately bivariate normal? Explain. 4.27. Consider the radiation data (with door closed) in Example 4.10. Construct a Q-Q plot for the natural logarithms of these data. [Note that the natural-log arithm transformation corresponds to the value A = 0 in (4-34).] Do the nat ural logarithms appear to be normally distributed? Compare your results with Figure 4.13. Does the choice A = � or A = 0 make much difference in this case?
The following exercises may require a computer. 4.28. Consider the air-pollution data given in Table 1 .3. Construct a Q-Q plot for the solar radiation measurements and carry out a test for normality based on the correlation coefficient rQ (see (4-31)). Let a = .05 and use the entry cor responding to n = 40 in Table 4.2. 4.29. Given the air-pollution data in Table 1.3, examine the pairs X5 = N02 and X6 = 0 3 for bivariate normality. (a) Calculate statistical distances (x i - x ) ' S - 1 (xi - x ) , j = 1 , 2, . . . , 42, where x; = [xi 5 , xi 6 ] . (b) Determine the proportion of observations x; = [xi 5 , xi 6 ] , j = 1, 2 , . . . , 42, falling within the approximate 50% probability contour of a bivariate nor mal distribution. (c) Construct a chi-square plot of the ordered distances in Part a. 4.30. Consider the used-car data in Exercise 4.26. (a) Determine the power transformation A 1 that makes the x1 values approx imately normal. Construct a Q-Q plot for the transformed data. (b) Determine the power transformation A 2 that makes the x 2 values approx imately normal. Construct a Q-Q plot for the transformed data. ( c) Determine the power transformations A' = [A 1 , A 2 ] that make the [x1 , x 2 ] values jointly normal using (4-40). Compare the results with those obtained in Parts a and b.
222
Chap. 4 The Multivariate Normal Distribution
4.31. Examine the marginal normality of the observations on variables X1 , X2 , , X5 for the multiple-sclerosis data in Table 1.4. Treat the non-multiple-sclero sis and multiple-sclerosis groups separately. Use whatever methodology, including transformations, you feel is appropriate. 4.32. Examine the marginal normality of the observations on variables X1 , X2 , . . . , X6 for the radiotherapy data in Table 1.5. Use whatever methodology, includ ing transformations, you feel is appropriate. 4.33. Examine the marginal and bivariate normality of the observations on vari ables XI ' Xz , X3 , and x4 for the data in Table 4.3. 4.34. Examine the data on bone mineral content in Table 1.6 for marginal and bivariate normality. 4.35. Examine the data on paper-quality measurements in Table 1 .2 for marginal and multivariate normality. 4.36. Examine the data on women ' s national track records in Table 1 .7 for mar ginal and multivariate normality. 4.37. Refer to Exercise 1.18. Convert the women ' s track records in Table 1.7 to speeds measured in meters per second. Examine the data on speeds for mar ginal and multivariate normality. 4.38. Examine the data on bulls in Table 1 .8 for marginal and multivariate nor mality. Consider only the variables YrHgt, FtFrBody, PrctFFB, BkFat, SaleHt, and SaleWt. . . •
REFERENCES
1. John AnderWisoln,ey,T.1984.W. (2d ed.). New York: 2. Andr e"ws, D. F., R. Gnanades i971)kan,, and825-840.J. L. Warner. "Transformations of Multivariate Dat a . no. 4 ( 1 3. Box, G. E. P., and D. R. Cox. "An Analysisno.of Tr2 (a1ns964)for,m211ati-o252.ns" (with discussion). 4.5. FiDanil ibeen,l, C.J., andJ. "TheF. S. Wood, (2dfed.icie)nt. NewTesYort fork:NorJohnmalWiitlye"y, 1980. Pr o babi l i t y Pl o t Cor r e l a t i o n Coef no.R.1 (1975), 111-117. 6. NewGnanades i k an, Yorns,kD.: JohnM. Wiley, 1977. London, UK: Chapman and Hall, 1980. 7.8. HerHawkinandez, F., and R. A. Johnson. "The Large-Sample Behaviono.r of372Tra(ns1980),form855-861. ations to Nor m al i t y . " 9. York: Hogg, Macmi R. V.,landan, 1978.A. T. Craig. (4th ed.). New 10. Looney, Coefficient with Normal ProbabiliS.ty W.Plo, tands." T. R. Gulledge, Jr. "Use of theno.Cor1 r(e1l985)ation, 75-79. An Introduction to Multivariate Statistical Analysis
Biometrics,
27,
Jour
nal of the Royal Statistical Society (B),
26,
Fitting Equations to Data
Tech
nometrics,
17,
Methods for Statistical Data Analysis of Multivariate Observations
Identification of Outliers.
Journal of the American Statistical Association, 75, Introduction to Mathematical Statistics
The American Statistician,
39,
Chap. 4 References
223
11. Sampl Shapireos), S.." S., and M. B. Wino.lk. "An4 (1965),Anal591-611. ysis of Variance Test for Normality (Complete 12. Versored-ril D, S.at,aandCorR.reAlat.ioJohnsn Staotn.isti"Tabl esTesandtinLarg Norge-mSamplality.e" Distribution Theory for Cen c s f o r 404 (m1988)um ,Li1192-1197. 13. Zehna, P. "lnno.var3ia(nce1966)no.of, Maxi 744. kelihood Estimators." Biometrika,
52,
Journal of the American Sta
tistical Association,
83,
Annals of Mathematical
Statistics,
37,
CHAPTER
5
Inferences about a Mean Vector 5 . 1 I NTRODUCTION
This chapter is the first of the methodological sections of the book. We shall now use the concepts and results set forth in Chapters 1 through 4 to develop techniques for analyzing data. A large part of any analysis is concerned with inference-that is, reaching valid conclusions concerning a population on the basis of information from a sample. At this point, we shall concentrate on inferences about a population mean vector and its component parts. Although we introduce statistical inference through initial discussions of tests of hypotheses, our ultimate aim is to present a full statistical analysis of the component means based on simultaneous confidence statements. One of the central messages of multivariate analysis is that p correlated vari ables must be analyzed jointly. This principle is exemplified by the methods pre sented in this chapter. 5.2 THE PLAUSIBILITY OF POPULATION MEAN
Po
AS A VALUE FOR A NORMAL
Let us start by recalling the univariate theory for determining whether a specific value t-t o is a plausible value for the population mean t-t · From the point of view of hypoth esis testing, this problem can be formulated as a test of the competing hypotheses and
Here H0 is the null hypothesis and H1 is the (two-sided) alternative hypothesis. If X1 , X2 , , X11 denote a random sample from a normal population, the appropri ate test statistic is • • •
224
Sec. 5.2 The Plausibility of J.L o as a Value for a Normal Population Mean
225
n n 1 -n1 � Xi and s 2 = --=--1 � ( Xi - X f n i= l i= l This test statistic has a student ' s t-distribution with n - 1 degrees of freedom (d.f.).
X - 1-L o ) t= ( s/Vn '
where X =
I
We reject H0 , that 1-L o is a plausible value of fL, if the observed t l exceeds a spec ified percentage point of a t-distribution with n 1 d.f. Rejecting H0 when I t is large is equivalent to rejecting H0 if its square,
I
-
(5-1) is large. The variable t2 in (5-1) is the square of the distance from the sample mean X to the test value 1-L o · The units of distance are expressed in terms of s/Vn, or estimated standard deviations of X. Once X and s 2 are observed, the test becomes: Reject H0 in favor of H1 , at significance level a, if (5-2) where tn _ 1 (a/2) denotes the upper 100 ( a/2) th percentile of the t-distribution with n - 1 d.f. If H0 is not rejected, we conclude that 1-L o is a plausible value for the normal population mean. Are there other values of 1-L which are also consistent with the data? The answer is yes! In fact, there is always a set of plausible values for a normal population mean. From the well-known correspondence between acceptance regions for tests of H0 : 1-L = /-L o versus H1 : 1-L ¥= 1-L o and confidence intervals for fL, we have { Do not reject H0 : 1-L = 1-L o at level a) or
{ fLo lies in the 100 (1 - a)% confidence interval
is equivalent to
or
:X
± tn - l ( a/2)
:n } (5-3)
The confidence interval consists of all those values /-L o that would not be rejected by the level a test of H0 : 1-L = fLo · Before the sample is selected, the 100 (1 - a)% confidence interval in (5-3) is a random interval because the endpoints depend upon the random variables X and s. The probability that the interval contains 1-L is 1 - a; among large numbers of such independent intervals, approximately 100 (1 - a)% of them will contain 1-L·
226
Chap. 5 Inferences about a Mean Vector
Consider now the problem of determining whether a given p X 1 vector Po is a plausible value for the mean of a multivariate normal distribution. We shall proceed by analogy to the univariate development just presented. A natural generalization of the squared distance in (5-1) is its multivariate analog
where
-
1 n X = - 2: Xj , (p x t ) n j = t
1
�
-
-
1
S - --=--i:" � ( Xj - X ) ( Xj - X ) , and Po j=1 (p X p) n (p X l} _
_
[:�:] :
' ILp o
The statistic T2 is called Hotelling's T2 in honor of Harold Hotelling, a pioneer in multivariate analysis, who first obt�ned its sampling distribution. Here (1/n ) S is the estimated covariance matrix of X. (See Result 3.1.) If the observed statistical distance T2 is too large-that is, if x is "too far" from p 0-the hypothesis H0: p = p0 is rejected. It turns out that special tables of T2 percentage points are not required for formal tests of hypotheses. This is true because
T2 rs. d'rstn'b uted as (n(n - 1)p p) Fp , n -p
(5-5)
_
where Fp, n -p denotes a random variable with an F-distribution with p and n - p d.f. To summarize, we have the following:
popq!:;ttio�.
Let xl ' Xz , .. . ' xn be a rand()m sample 'frolli an Np (P,, .'£) ' · . t•. : . . .. : . . . . • . . ,.:;..; '•11 j . ... l. � n :•. X·; ). >';• · . J . ·<•''2: x )(X:j •(Xi ...:. j-+;'!. • Then with X ;:::: '""t2:•:XjandSt'r;i;· ·· . . · . .. ·
-
.
· •·· ·
..
< • ·· ·
. ·. ·..
�P ·
[
· . . ·· · · _
(n
n jed
(n - 1)p F (n, ',' .-: p), p,n
·
n < X: • . - p )
whatf(Ye}_' the true :#t.:�d. .'£. of tV,e FJ>,h-p . dist���}!;tfOJ1.
,:·, . . . ' ,; , · . <;/:.,
'
i: (
s ! x
··
-
.
· · ..
.·
•
1 ) 1"" 1
; , ,·.;, ;t,'::··;, '
·
'· · ""- ·· ·
Sec. 5.2 The Plausibility of Mo as a Value for a Normal Population Mean
--
227
)p l n p) Fp. n - p a )
Statement (5-6) leads immediately to a test of the hypothesis H0: IL = /L o versus H1 : IL # ILo · At the a level of significance, we reject H0 in favor of H1 if the observed Tz
-
n(X
-
IL o ) ' S - 1 ( -X
-
ILo )
n > ((
-
(5-7)
(
It is informative to discuss the nature of the T2 -distribution briefly and its correspondence with the univariate test statistic. In Section 4.4, we described the manner in which the Wishart distribution generalizes the chi-square distribution. We can write n 2: ( Xi - X ) ( Xi X ) ' i= Vn ( X - /L o ) Tz = Vn ( X - /Lo ) '
.
(l
n-1
(
- )- l )(
)
which combines a normal, Np (O, "!), random vector and a Wishart, Wp, n - l ("!), ran dom matrix in the form T;, n - 1
( multivanate normal ) ' .
random vector
Wishart random -l . . mu1 tivanate norma 1 matrix random vector d.f.
(5-8) This is analogous to
or t"2 - ' =
C
norma1 . andom vanable
)
) ( (scaled) chi-square - ! ( . random vanable d. f.
norma1 . random vanable
)
for the univariate case. Since the multivariate normal and Wishart random vari ables are independently distributed [see (4-23)], their joint density function is the product of the marginal normal and Wishart distributions. Using calculus, the dis tribution (5-5) of T2 as given above can be derived from this joint distribution and the representation (5-8). It is rare, in multivariate situations, to be content with a test of H0 : IL = p0 , where all of the mean vector components are specified under the null hypothesis. Ordinarily, it is preferable to find regions of IL values that are plausible in light of the observed data. We shall return to this issue in Section 5.4.
,-
228
Chap. 5 Inferences about a Mean Vector (Evaluating T2)
Example 5 . 1
Let the data matrix for a random sample of size n = mal population be
X � Evaluate the observed T2 for p,� = of T2 in this case? We find
[I; �]
3 from a bivariate nor
[9, 5] . What is the sampling distribution
X =
and
(6 - 8) 2
+ (10
(6 - 8) (9 - 6) S22 =
(9 - 6) 2
+
so
- 8) 2
+
2
+ (10
(6 - 6) 2 2
+
(8 - 8) 2
=
4
- 8) (6 - 6)
+
2
(3 - 6) 2
=
(8 - 8) (3 - 6)
=
-3
9
[ - 43 - 39 ] 9 3] = [� � 1 [ � :17 J (4) (9) - ( - 3) ( - 3) 3 4 s=
Thus,
8_ 1 = and, from
(5-4),
T2 = 3 [8 - 9, 6 - 5]
[ t l27 J [ 86 = 59 J
=
3 [ - 1, 1 ]
Before the sample is selected, T2 has the distribution of a
(3 - 1 ) 2 (3 _ 2 ) F2, 3 - 2 random variab\e.
= 4F2, 1
[ -127 J - 7
- 9
•
Sec. 5.2 The Plausibility of Mo as a Value for a Normal Population Mean
229
The next example illustrates a test of the hypothesis H0 : p, = p,0 using data collected as part of a search for new diagnostic techniques at the University of Wis consin Medical School. Example 5.2
(Testing a multivariate mean vector with T2}
Perspiration from 20 healthy females was analyzed. Three components, X1 = sweat rate, X2 = sodium content, and X3 = potassium content, were measured, and the results, which we call the sweat data, are presented in Table 5.1. TABLE 5 . 1
SWEAT . DATA
Individual
( Sweat rate )
xl
Xz ( Sodium)
x3 ( Potassium)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3.7 5.7 3.8 3.2 3.1 4.6 2.4 7.2 6.7 5.4 3.9 4.5 3.5 4.5 1.5 8.5 4.5 6.5 4.1 5.5
48.5 65.1 47.2 53.2 55.5 36.1 24.8 33.1 47.4 54.1 36.9 58.8 27.8 40.2 13.5 56.4 71.6 52.8 44.1 40.9
9.3 8.0 10.9 12.0 9.7 7.9 14.0 7.6 8.5 11.3 12.7 12.3 9.8 8.4 10.1 7.1 8.2 10.9 11.2 9.4
Source: Courtesy of Dr. Gerald Bargman.
]
Test the hypothesis H0 : p, ' = [4, level of significance a = .10. Computer calculations provide
[
4.640 i = 45.400 ' 9.965
[
50, 10] against H1 : p, '
¥-
2.879 10.010 - 1.810 s = 10.010 199.788 - 5.640 - 1.810 - 5.640 3.628
[4, 50, 10] at
]
230
Chap. 5 Inferences about a Mean Vector
and s- 1 =
[
.586 - .022 .258 - .o22 .006 - .002 .258 - .002 .402
[
We evaluate
]
.586 - .022 .258 20 [ 4.640 - 4, 45.400 - 50, 9.965 - 10] - .022 .006 - .002 .258 - .002 .402
][ ] [ ]
4.640 - 4 45.400 - 50 9.965 - 10 .467 = 20 [.640, - 4.600, - .035] - .042 = 9.74 .160 Comparing the observed T2 = 9.74 with the critical value (n - 1) p 19 (3) _ p ) Fp, n -p ( . 1 0 ) = ---u- F3 , 1 (.10) = 3.353 (2.44) = 8.18 (n we see that T2 = 9.74 > 8.18, and consequently, we reject H0 at the 1 0% level 7
of significance. We note that H0 will be rejected if one or more of the component means, or some combination of means, differs too much from the hypothe sized values [4, 50, 10]. At this point, we have no idea which of these hypoth esized values may not be supported by the data. We have assumed that the sweat data are multivariate normal. The Q-Q plots constructed from the marginal distributions of X1 , X2 , and X3 all approximate straight lines. Moreover, scatter plots for pairs of observations have approximate elliptical shapes, and we conclude that the normality • assumption was reasonable in this case. (See Exercise 5.4.)
One feature of the T2-statistic is that it is invariant (unchanged) under changes in the units of measurements for X of the form Y = C
(p X l)
X
( p X p) (p X l)
+
d ,
(p X l )
C nonsingular
(5-9)
A transformation of the observations of this kind arises when a constant b; is sub tracted from the ith variable to form X; - b; and the result is multiplied by a con stant a; > 0 to get a;(X; - b;) · Premultiplication of the centered and scaled quantities a; (X; - b;) by any nonsingular matrix will yield Equation (5-9). As an example, the operations involved in changing X; to a;(X; - b;) correspond exactly to the process of converting temperature from a Fahrenheit to a Celsius reading.
Sec.
5.3
Hotelling's
T2
and Likelihood Ratio Tests
Given observations x 1 , x 2 , . , X 11 and the transformation in ately follows from Result 3.6 that . .
y = Ci + d Moreover, by
and
Sy =
1 n (y � --=n 1 j= 1 j
231
(5-9), it immedi
y ) (yj - y ) ' = CSC'
(2-24) and (2-45),
JL v = E (Y) = E (CX + d) = E (CX) + E (d) = CJL + d Therefore, T2 computed with the y ' s and a hypothesized value JL v ,o = CJLo + d is T2 = n ( y - JL v. o ) ' Sy - 1 ( Y - JL v. o )
= n (C ( i - JLo ) ) ' (CSC ' ) - 1 (C ( i - JLo ) ) = n ( i - JLo ) ' C' (CSC' ) - 1 C ( i - JLo ) = n ( i - JLo ) ' C' (C' ) - 1 S -1 C - 1 C ( i - JLo ) = n ( i - JLo ) ' S - 1 ( i - JLo ) The last expression is recognized as the value of T2 computed with the x ' s. 5.3 HOTELLING'S
T2 AND LIKELIHOOD RATIO TESTS
We introduced the T2-statistic by analogy with the univariate squared distance t2 • There is a general principle for constructing test procedures called the likelihood ratio method, and the T2-statistic can be derived as the likelihood ratio test of H0 : JL = JLo · The general theory of likelihood ratio tests is beyond the scope of this book. (See [3] for a treatment of the topic.) Likelihood ratio tests have several opti mal properties for reasonably large samples, and they are particularly convenient for hypotheses formulated in terms of multivariate normal parameters. We know from (4-18) that the maximum of the multivariate normal likeli hood as JL and :I are varied over their possible values is given by max L ( JL , :I ) = p. ,
l:
1 e - np /2 p (217' t /2 1 :I" l n/2
(5-10)
where
� and JL = x = £.J x � (x - i ) ( xj - i ) ' n j= l j n j=l j are the maximum likelihood estimates. Recall that jL and are those choices for JL and :I that best explain the observed values of the random sample. Under the hypothesis H0 : JL = JLo , the normal likelihood specializes to :I" =
1
n
h
-
-
1
-
i
232
Chap. 5 Inferences about a Mean Vector
The mean Po is now fixed, but I can be varied to find the value that is "most likely" to have led, with p0 fixed, to the observed sample. This value is obtained by maximizing L ( p0 , I) with respect to I. Following the steps in (4-13), the exponent in L( p0 , I) may be written as
II
Applying Result 4.10 with B = 2: (xj - p0 ) (xj j=J
- p0 ) ' and b = n/2, we have (5-11)
with
To determine whether p0 is a plausible value of p, the maximum of L (p0 , I) is compared with the unrestricted maximum of L (p, I). The resulting ratio is called the likelihood ratio statistic. Using Equations (5-10) and (5-11), we get Likelihood ratio =
max L ( p 0 , I)
I = A = · max L ( p , I) p, I
( �� 1 ) "/2 .. I Io I
(5-12)
The equivalent statistic A21" = I I I / I I 0 I is called Wilks' lambda. If the observed value of this likelihood ratio is too small, the hypothesis H0 : p = p0 is unlikely to be true and is, therefore, rejected. Specifically, the likelihood ratio test of H0 : p = P o against H1 : p � P o rejects H0 if (5-13) where c a is the lower (100a) th percentile of the distribution of A. (Note that the likelihood ratio test statistic is a power of the ratio of generalized variances.) For tunately, because of the following relation between T2 and A, we do not need the distribution of the latter to carry out the test.
Sec.
5.3
Hotelling's
T2
and Likelihood Ratio Tests
233
Np ( p,
I) pop Result 5. 1 . Let X 1 , X 2 , . . , X11 be a random sample from an ulation. Then the test in ( 5-7) based on T2 is equivalent to the likelihood ratio test of = versus ¥because .
H0: p p0
H1: p p0 Az ; "
Proof. A =
Let the
[
(p
+ 1)
=
(p
X
yz
( 1 + (n - 1 )
---
) -1
+ 1) matrix
J = (xj - x) (xj - x) ' :i (x - Po ) ( x - Po) ' i - 1 11
'
2:
Vn
' I -----------------------------------------------
Vn
'
I '
l
[ A 2 1 A22 ]
_ A1 1 l A12
-
- - - - - - -� - - - - -
i
By Exercise 4.11, I A I = I A 22 I I A 1 1 - A 1 2 A 2i A 2 1 1 = I A 1 1 I I A 22 - A 2 1 A 1{ A 1 2 1 , from which we obtain
l j� (xj - x ) (xj - x) ' n(x - PoHx - Po) ' I j�l (xj - x) (xj - x ) ' 1 1 - 1 - n(x - Po) ' (� (xj - x ) (xj - x ) ' r l (x - Po) I
(- 1)
+
Since, by (4-14), ll
2:
n
j =l (xj - p0 ) (xj - p0) ' = j2:=l (xj - x + x - p0 ) (xj - x + x - p0) ' 11
j =l (xj - x) (xj - x) + n(x - PoHx - Po) '
= 2:
the foregoing equality involving determinants can be written
or
Thus, AZ/11 =
(
)
2 _II_I = 1 + _T__ - I 1) I Io l
(n -
( 5-14 )
234
Chap. 5 Inferences about a Mean Vector
Here H0 is rejected for small values of A21n or, equivalently, large values of T2 • The • critical values of T2 are determined by (5-6). Incidentally, relation (5-14) shows that T2 may be calculated from two deter minants, thus avoiding the computation of s - 1 • Solving (5-14) for T2 , we have
T2
=
(n - 1) I I o I
Ii i
_
(n
1)
_
�� (xj - p,0 ) (xj - ILo ) ' I - (n - 1 ) I j±= l (xj - x ) (xj - x ) ' 1
(n - 1 )
(5-15)
Likelihood ratio tests are common in multivariate analysis. Their optimal large-sample properties hold in very general contexts, as we shall indicate shortly. They are well suited for the testing situations considered in this book. Likelihood ratio methods yield test statistics that reduce to the familiar F- and t-statistics in univariate situations. General Likelihood Ratio Method
We shall now consider the general likelihood ratio method. Let 8 be a vector con sisting of all the unknown population parameters, and let L ( 8) be the likelihood function obtained by evaluating the joint density of X 1 , X 2 , , X n at their observed values x 1 , x 2 , , x n . The parameter vector 8 takes its value in the parameter set e. For example, in the p-dimensional multivariate normal case, 8' [p. 1 , . . . , JLp , 0"1 1 , , CTl p • CTzz , , CTzp • , CTp- l , p • CTpp ] and e COnSiStS Of the Union Of the p dimensional space, where - oo < p. 1 < oo , . , - oo < JLp < oo and the [p (p + 1)/2] dimensional space of variances and covariances such that I is positive definite. Therefore, 0 has dimension v p + p (p + 1)/2. Under the null hypothesis H0: 8 80 , 8 is restricted to lie in a subset e0 of 0. For the multivariate normal situ ation with p, IL o and I unspecified, eo {p. 1 p.10 , JLz = JLzo • . . . , JLp JLp o ; (Ti l ' . . . ' (Ti p • CTzz , ' CTzp • ' (Tp- l ,p • (Tpp with I positive definite } , so eo has dimen sion v0 0 + p (p + 1)/2 p (p + 1)/2. A likelihood ratio test of Ho : 8 E eo rejects Ho in favor of Hl : 8 f1. eo if max L ( 8) A 8 e 0o (5-16) < C max L ( 8) • • •
• • •
=
• • .
. . •
• . .
.
.
=
=
=
=
• • •
=
=
=
• • •
=
=
O e (')
where c is a suitably chosen constant. Intuitively, we reject H0 i f the maximum of the likelihood obtained by allowing 8 to vary over the set 00 is much smaller than the maximum of the likelihood obtained by varying 8 over all values in 0 . When the maximum in the numerator of expression (5-16) is much smaller than the max imum in the denominator, eo does not contain plausible values for 8.
Sec.
5 .4
Confidence Regions and Simultaneous Comparisons of Component Means
235
In each application of the likelihood ratio method, we must obtain the sam pling distribution of the likelihood-ratio test statistic A. Then c can be selected to produce a test with a specified significance level a. However, when the sample size is large and certain regularity conditions are satisfied, the sampling distribution of -2 ln A is well approximated by a chi-square distribution. This attractive feature accounts, in part, for the popularity of likelihood ratio procedures.
( :: ��:;)
Result 5.2. When the sample size
- 2 ln A = - 2 ln
-
Oe0
n is large, is, approximately, a x ;- v.
random variable. Here the degrees of freedom are v - v0 = (dimension of 0) (dimension of 00). • Statistical tests are compared on the basis of their power, which is defined as the curve or surface whose height is P [test rejects H0 I 8], evaluated at each para meter vector 8. Power measures the ability of a test to reject H0 when it is not true. In the rare situation where 8 = 80 is completely specified under H0 and the alter native H1 consists of the single specified value 8 = 81 , the likelihood ratio test has the highest power among all tests with the same significance level a = P [test rejects H0 8 = 80]. In many single-parameter cases ( 8 has one component), the likelihood ratio test is uniformly most powerful against all alternatives to one side of H0 : {) = 00 • In other cases, this property holds approximately for large samples. We shall not give the technical details required for discussing the optimal properties of likelihood ratio tests in the multivariate situation. The general import of these properties, for our purposes, is that they have the highest possible (aver age) power when the sample size is large.
I
5.4 CONFIDENCE REGIONS AND SIMULTANEOUS COMPARISONS OF COMPONENT M EANS
To obtain our primary method for making inferences from a sample, we need to extend the concept of a univariate confidence interval to a multivariate confidence region. Let 8 be a vector of unknown population parameters and @ be the set of all possible values of 8. A confidence region is a region of likely 8 values. This region is determined by the data, and for the moment, we shall denote it by R (X), where X = [X1 , X 2 , , X, ] ' is the data matrix. The region R (X) is said to be a 100 (1 - a)% confidence region if, before the sample is selected, • • .
P [ R (X) will cover the true 8] =
1-a
This probability is calculated under the true, but unknown, value of 8.
(5-17)
236
Chap. 5 Inferences about a Mean Vector
The confidence region for the mean 11- of a p-dimensional normal population is available from (5-6). Before the sample is selected,
whatever the values of the unknown 11- and I. In words, X will be within [(n - 1)pF , n (a)/(n - p)J l l2
p -p
of Jl-, with probability 1 - a, provided that distance is defined in terms of n S - 1 . For a particular sample, x and S can be computed, and the inequality n ( x - P-)' S - 1 ( x - P-) � (n - 1)pFp, n -p (a)/(n - p) will define a region R (X) within the space of all possible parameter values. In this case, the region will be an ellipsoid centered at x. This ellipsoid is the 100(1 - a)% confidence region for II-·
To determine whether any P- o falls within the confidence region (is a plausible value for J�-), we need to compute the generalized squared distance n ( x - J1-0) ' S - 1 (x - P- o ) and compare it with [p (n - 1)/(n - p)]Fp, n -p (a). If the squared distance is larger than [p (n - 1)/(n- p)] Fp, n -p (a), P- o is not in the confi dence region. Since this is analogous to testing H0: 11- = P- o versus H1 : 11- "# P- o [see (5-7)], we see that the confidence region of (5-18) consists of all P- o vectors for which the T2-test would not reject H0 in favor of H1 at significance level a. For p � 4, we cannot graph the joint confidence region for II-· However, we can calculate the axes of the confidence ellipsoid and their relative lengths. These are determined from the eigenvalues Ai and eigenvectors ei of S. As in (4-7), the directions and lengths of the axes of
n ( -x - 11- ) ' S - 1 ( -x - 11- ) are determined by going
� c/ Vn
=
�
...._,
cz - p(n(n - p)1) Fp , n - p ( a ) _
� Yp (n - 1 ) Fp , n - p (a)/n (n - p)
Sec. 5.4 Confidence Regions and Simultaneous Comparisons of Component Means
237
units along the eigenvectors e; . Beginning at the center i , the axes of the confi dence ellipsoid are � t.
± V A;
i = 1, 2,
where Se; = A;e;,
�pn ((nn _- 1)) Fp, n -p ( a) e; p
... , p
(5-19)
The ratios of the A; ' s will help identify relative amounts of elongation along pairs of axes. Example 5.3
(Constructing a confidence ellipse for p,)
Data for radiation from microwave ovens were introduced in Examples 4.10 and 4.17. Let
x 1 = V'measured radiation with door closed
and For the
x2 = V'measured radiation with door open
n = 42 pairs of transformed observations, we find that [ .564 ] ' = [ .0144 .0117 ] ' x=
[
.603
s-l
s
.0117 .0146 203.018 - 163.391 = - 163.391 200.228
J
The eigenvalue and eigenvector pairs for S are A1 =
.026, A2 = .002,
[
e� =
[.704, .710] e� = [- .710, .704]
.564 - f.L l [ .603 - f.L
The 95% confidence ellipse for p, consists of all values (JL 1 , JL 2 ) satisfying
42 [ .564 -
IL l ,
- 163.391 .603 - 1L 2) _ 203.018 163.391 200.228
J
or, since F2 40(.05) =
2
�
J
2 (41) 40 ( 05) 40 F2, •
3.23, 42 (203.018) ( .564 - IL l ) 2 + 42 (200.228) ( .603 - JL 2 ) 2 - 84 (163.391) (.564 JL1 ) (.603 - IL 2 ) � 6.62 To see whether p,' = [.562, .589] is in the confidence region, we compute
-
238
Chap.
5
Inferences about a Mean Vector
42 (203.018) (.564 - .562) 2 + 42(200.228) (.603 - .589) 2 -84(163.391) (.564
.562) (.603 - .589) = 1.30 � 6.62
We conclude that = [.562, .589] is in the region. Equivalently, a test of H0 : ] would not be reJected [ .562 ] at the [ .562 . . favor of H1 : = _ 589 _ 589 a = .05 level of significance. The joint confidence ellipsoid is plotted in Figure 5.1. The center is at i ' = [.564, .603], and the half-lengths of the major and minor axes are given by p,
-
'
p,
p, �
m
..V;-:-- p (n - 1) .. r;:;:::; 2 (41) A 1 n (n _ p) Fp, n -p (a) = v .026 42 (40) (3.23) = .064
�
�
and 2 (41) (3.23) = .018 V A2 p (n _- 1) Fp , n - p (a) = ..v� .002 42 (40) n (n p)
.. r:--
�
respectively. The axes lie along � = [.704, .710] and ; = [- .710, .704] when these vectors are plotted with i as the origin. An indication of the elongation of the confidence ellipse is provided by the ratio of the lengths of the major and minor axes. This ratio is e
e
2
Figure 5.1 A 95% confidence ellipse for p. based on microwave-radiation data.
Sec. 5.4 Confidence Regions and Simultaneous Comparisons of Component Means p ( n - 1) 2 Vr.A l n (n - p ) Fp, n - p ( a) � .16 1 = 3.6 = = � .045 r.p (n 1) 2 V A 2 n (n _ p ) Fp , n - p ( a) The length of the major axis is 3.6 times the length of the minor axis. �
�
� �
239
•
Simultaneous Confidence Statements
While the confidence region n(i - p.)'S- 1 (i - p.) c2, for c a constant, correctly assesses the joint knowledge concerning plausible values of p., any summary of conclusions ordinarily includes confidence statements about the individual compo nent means. In so doing, we adopt the attitude that all of the separate confidence statements should hold simultaneously with a specified high probability. It is the guarantee of a specified probability against any statement being incorrect that motivates the term simultaneous confidence intervals. We begin by considering simultaneous confidence statements which are intimately related to the joint con fidence region based on theIT2-statistic. Let X have an Np (JL , ) distribution and form the linear combination �
From (2-43), /L z
and
=
E(Z) = a' p.
Var(Z) = a' Ia Moreover, by Result 4.2, Z has an N(a' p., a' Ia) distribution. If a random sample X 1 , X' 2 , ... , X n from the Np (JL , I) population is available, a corresponding sample of Z s can be created by taking linear combinations. Thus, a�
=
j=
The sample mean and variance of the observed values z 1 , z2 , z
and
= a'i
sz2 = a'Sa
1, 2, . . . , n , Zn are, by (3-36),
• . .
where i and S are the sample mean vector and covariance matrix of the xJ.' s, respective y. .
1
240
Chap. 5 Inferences about a Mean Vector
Simultaneous confidence intervals can be developed from a consideration of confidence intervals for a' p for various choices of a. The argument proceeds as follows. For a fixed and u� unknown, a 100(1 - a)% confidence interval for J.Lz = a' p is based on student's t-ratio i - a'p ) t = iVnJ.Lz = vn ( a'WSa (5-20) and leads to the statement Z t11 _ 1 (a/2) � :::::;; J.L z :::::;; + t11_ 1 (a/2) � or (5-21) where t11_1 (a/2) is the upper 100(a/2)th percentile of a t-distribution with n - 1 d.f. Inequality (5-21) can be interpreted as a statement about the components of the mean vector p. For example, with a ' = [1, 0, ... , 0], a' p = p 1 , and (5-21) becomes the usual confidence interval for a normal population mean. (Note, in this case, that a'Sa = ) Clearly, we could make several confidence statements about the components of p, each with associated confidence coefficient 1 - a, by choos ing different coefficient vectors a. However, the confidence associated with all of the statements taken together is not 1 a. Intuitively, it would be desirable to associate a "collective" confidence coef ficient of 1 - a with the confidence intervals that can be generated by all choices of a. However, a price must be paid for the convenience of a large simultaneous confidence coefficient: intervals that are wider (less precise) than the interval of (5-21) for a specific choice of a. Given a data set x , , X 11 and a particular a, the confidence interval in (5-21) is that set of a' pxvalues2 for which i '-Saa'p ) l :::::;; t11 _ 1 ( a12) I t I - l vn ( a'var-;;:;or, equivalently, (5-22) t = n (a xa 'Saa p ) 2 = n ( a (a-x'Sa ) ) 2 :::::;; t2 (a/2) A simultaneous confidence region is given by the set of a' p values such that t2 is relatively small for all choices of a. It seems reasonable to expect that the constant t� _ 1 (a/2) in (5-22) will be replaced by a larger value, c 2 , when statements are developed for many choices of a. z -
---'------==--'---'--=
sz
z
-
s 11 .
-
1 ,
. . •
-
2
1-
-
�
I
I
- P
n-t
Sec. 5.4 Confidence Regions and Simultaneous Comparisons of Component Means
241
Consjdering the values of a for which t2 :;;;; c 2 , we are naturally led to the determination of maxa t 2 - maxa n (a' (ai'S-a p)) 2 Using the maximization lemma (2-50) with x = a, d = ( i - p), and B = S, we get ) maxa n(a'(ia ' S-a p)) 2 = n [ maxa ( a' ( ia 'S-a P ?l = n ( x- - p ) ' S- t ( x- - p ) = yz _
(5-23)
with the maximum occurring for a proportional to S - I ( i - p). Resu lt 5.3. Let X 1 , X 2 , ... , X n be a random sample from an Np ( f.L, I) pop ulation with I positive definite. Then, simultaneously for all a, the interval ( a 'X - �p (n - 1) Fp, n - p ( a) a ' Sa , a ,-X + �p ( n - 1) Fp, n p ( a ) a 'Sa ) n(n p) n (n p) will contain a' p with probability 1 - a. Proof From (5-23), - a' p) 2 :;;;; c2 implies n ( a' ia'S� for every a, or a' x- - c V� --;;--;;- :;;;; a' p :;;;; a' -x + c V� for every a. Choosing c2 = p (n - 1)Fp, n -p (a)/(n - p)= [see2 (5-6)]2 gives intervals • that will contain a' p for all a, with probability 1 - a P[T :;;;; c ]. It is convenient to refer to the simultaneous intervals of Result 5.3 as T22-inter vals, since the coverage probability is determined by the distribution of T • The choices a' = [1, 0, .. . , 0], a' = [0, 1, ... , 0], and so on through = [0, 0, . .. , 1] for the T2-intervals allow us to conclude that a'successive _
_
242
Chap. 5 Inferences about a Mean Vector
�p. (n - 1) F� n - p (a) � IL t � X1 + �p((n - 1) F� n - p (a) (n - p) fj - p) fj �p (n 1) - 1) Fp, n -p (a) � ILz � Xz + �p (n Xz - (n --p) Fp, n -p (a) X!
p (n - 1) Xp - (n - p) F� n -p (a)
�
fjn
-
.
n
(n - p)
fj
fjn � ILP � X-P + �p(n(n --p)1) F� n -p (a) fj
(5-24)
all hold simultaneously with confidence coefficient 1 - a. Note that, without mod ifying the coefficient 1 - a, we can make statements about the differences ; corresponding' to =a' = [0, . . . , 0, a;, 0, . . . , 0, ak , 0, . . . , 0] where a; = 1 and aILk = -1.IL k In this case a Sa s;; - 2s; k + skk ' and we have the statement �p (n - 1) Fp, n -p ( a) �s;; - 2s;k + skk � IL,. - ILk (n - p) -.
� x,
_
n
- 1) xk + p( (n n - p ) FP ' n - p ( a _
�
) �S;; - 2s;nk + skk
(5-25)
The simultaneous T2 confidence intervals are ideal for "data snooping." The confidence coefficient 1 - remains unchanged for any choice of a, so linear com binations of the componentsa IL; that merit inspection based upon an examination of the data can be estimated. In addition, according to the results in Supplement SA, we can include the statements about (IL ;, IL k ) belonging to the sample mean-centered ellipses (5-26)
and still maintain the confidence coefficient (1 - ) for the whole set of statements. The simultaneous T2 confidence intervals fora the individual components of a mean vector are just the shadows, or projections, of the confidence ellipsoid on the component axes. This connection between the shadows of the ellipsoid and the simultaneous confidence intervals given by (5-24) is illustrated in the next example. Example 5.4 (Simultaneous confidence intervals as shadows of the confidence ellipsoid)
In Example 5.3, we obtained the 95% confidence ellipse for the means of the fourth roots of the door-closed and door-open microwave radiation mea-
Sec. 5.4 Confidence Regions and Simultaneous Comparisons of Component Means
243
surements. The 95% simultaneous T2 intervals for the two component means are, from (5-24), ( X-1 - \jfp((nn - p)1) Fp, n -p (.05) \j--;rs:; , X1 + \j/p((nn _-p)1 ) Fp, n -p ( .05) \j--;rs:; ) /2 (41) �·0144 ( ·564 - \j/2 (41) �:9_144 ) or ( ·516' ·612) 40 3 · 23 42 ' · 564 + \j 40 3 · 23 42 �-2 • x2 + \j�W ( n - 1 ) Fp, n -p (.05) \j--;; ( X-2 - \jfP(n (}; -p)1) Fp, n - p (.05)- \j-;; ( n p) /s;; ) ( .603 - �2 ��1 ) 3.23 �.o�;6 , .603 + �2 ��1 ) 3.23 �.o�;6 ) or ( .555, .651) In Figure 5.2, we have redrawn the 95% confidence ellipse from Exam ple 5.3. The 95% simultaneous intervals are shown as shadows, or projections, of this ellipse on the axes of the component means. _
_
_
_
_
•
Jl z
N \0
ci
I
- - - - - - - - - · -
0.500
-
-
-
-
-
-
0.552
-
-
-
-
-
-
-
-
-
-
-
0.604
Simultaneous T2-intervals for the component means as shadows of the confidence ellipse on the axes-microwave radiation data.
Figure 5.2
244
Chap. 5 Inferences about a Mean Vector Example 5.5
(Constructing simultaneous confidence intervals and ellipses)
The scores obtained by n = 87 college students on the College Level Exam ination Program (CLEP) subtest X and the College Qualification Test (CQT) subtests X2 and X3 are given in1 Table 5.2 on page 245 for X1 = social science and history, X2 = verbal, and X3 = science. These data give 527.74 5691.34 600.51 217.25 -X = 54.69 and S = 600.51 126.05 23.37
[
[ ]
]
25.13 217.25 23.37 23.11 Let us compute the 95% simultaneous confidence intervals for J.L 1 , f..Lz , and J.L 3 •
We have
p (n 1) 3 (87 - 1) 3 (86) n p Fp, n - p (a) = (87 3) F3 84 (.05) = � (2.7) = 8.29 and we obtain the simultaneous confidence statements [see (5-24)] ;-;::-;;;:: 5691.34 527 . 74 - V;-;::-; f..L J ::::; 527.74 + V 8.2'J9 ::::; 29;;:: 5691.34 8'6.L'J 87 87 -
_
_
�
�
or
54.69
or
�
�
51.22 ::::; 23 .11 ::::; 25.13 - V;-;::-; '6.L'J;;:: � �
or
�
�
504.45 ::::; 126.05 ::::; ;-;::-;;;:: � v '6.L'J
� -
,
f..L J
::::; 551.03
f..L z
::::; 54.69 +
f..L z
::::; 58.16
j.L 3
::::;
�
126.05 8.29 �
;-;::-;;;:: v
�
�
23 .11 8.29;;:: � 25.13 + V;-;::-;
23.65 ::::; f..L 3 ::::; 26.61
�
With the possible exception of the verbal scores, the marginal Q-Q plots and two-dimensional scatter plots do not reveal any serious departures from nor mality for the college qualification test data. (See Exercise 5.16.) Moreover, the sample size is large enough to justify the methodology, even though the data are not quite normally2 distributed. (See Section 5.5.) The simultaneous T -intervals above are wider than univariate intervals because all three must hold with 95% confidence. They may also be wider
TABLE 5.2 COLLEG E TEST DATA
x3 Xz xl x3 Xz xl (Social (Social and (Verbal) (Science) science and (Verbal) (Science) Individual science history) history) Individual 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
468 428 514 547 614 501 421 527 527 620 587 541 561 468 614 527 507 580 507 521 574 587 488 488 587 421 481 428 640 574 547 580 494 554 647 507 454 427 521 468 587 507 574 507
41 39 53 67 61 67 46 50 55 72 63 59 53 62 65 48 32 64 59 54 52 64 51 62 56 38 52 40 65 61 64 64 53 51 58 65 52 57 66 57 55 61 54 53
26 26 21 33 27 29 22 23 19 32 31 19 26 20 28 21 27 21 21 23 25 31 27 18 26 16 26 19 25 28 27 28 26 21 23 23 28 21 26 14 30 31 31 23
Source: Data courtesy of Richard W. Johnson.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
494 541 362 408 594 501 687 633 647 647 614 633 448 408 441 435 501 507 620 415 554 348 468 507 527 527 435 660 733 507 527 428 481 507 527 488 607 561 614 527 474 441 607
41 47 36 28 68 25 75 52 67 65 59 65 55 51 35 60 54 42 71 52 69 28 49 54 47 47 50 70 73 45 62 37 48 61 66 41 69 59 70 49 41 47 67
24 25 17 17 23 26 33 31 29 34 25 28 24 19 22 20 21 24 36 20 30 18 25 26 31 26 28 25 33 28 29 19 23 19 23 28 28 34 23 30 16 26 32 245
246
Chap. 5 Inferences about a Mean Vector
than necessary, because, with the same confidence, we can make statements about differences. For instance, with a' = [0, 1, - 1 ], the interval for p.2 - p.3 has endpoints (Xz - X3 ) - .Jp(( nn - p1)) , n -p ( · 05) .Jszz + s33n - 2 z 3 = (54.69 - 25.13) v8.29 .)126 ' 05 + 23 �; - 2 (23 • 37) 29.56 3.12 so (26.44, 32.68) is a 95% confidence interval for p.2 - p.3 • Simultaneous inter vals can also be constructed for the other differences. Finally, we can construct confidence ellipses for pairs of means, and the same 95% confidence holds. For example, for the pair (p.2 , p.3 ), we have 23.37 -I [ 54.69 p. 2 87 [S4 · 69 /J- z ' 25 · 13 /J-3 ] [ 126.05 23.37 23.11 J 25.13 - p. 3 J 0.849 (54.69 - p. 2 ) 2 + 4.633 (25.13 - p.3 ) 2 - 2 0.859 (54.69 - p. 2 ) (25.13 - p. 3 ) � 8.29 This ellipse is shown in Figure 5.3 on page 247, along with the 95% confidence ellipses for the other two pairs of means. The projections or shadows of these ellipses on the axes are also indicated, and these projections are the T2-intervals. -
-
+
F
-
s
'
±
±
=
-
_
_
=
X
•
A Comparison of Simultaneous Confidence I ntervals with One-at-a-Time Intervals
An alternative approach to the construction of confidence intervals=is to consider the components one at a time, as suggested by (5-21) with a' [0, . . . , 0, a;, 0, . . . , 0] where a; 1. This approach ignores the covariance structure of the vari ables and leads to the intervals xl - tn - l (a/2) �n � P- 1 � xl + til - l (a/2) � X2 l11 _ 1 (a/2) �n � /J-z � x2 + (11 _ 1 (a/2) f; (5-27) p
/J- ; =
-
x + tn - l (a/2) fj xp - tn - l (a/2) fj n � p.P � p Although prior to sampling, the ith interval has probability 1 a of cover ing we do not know what to assert, in general, about the probability of all -
!J- ; ,
Sec.
00 ll)
- -
r
5.4
Confidence Regions and Simultaneous Comparisons of Component Means
- - - - - - - - - - - - -
247
�-�---�- 1 J.l 3
0 ll)
500
522
544
522
500
- -1- I
50.5
544
�-�--�-,-:_:-:_ - - - - - - - - - 54.5
58.5
2 Figure 5 . 3 95% confidence ellipses for pairs of means and the simultaneous T -inter vals-college test data.
intervals containing their respective t-t/s. As we have pointed out, this probability is not 1 - a. To shed some light on the problem, consider the special case where the obser vations have a joint normal distribution and I
� rr� ' :, l0 0
,
�]
(Jpp
Since the observations on the first variable are independent of those on the second variable, and so on, the product rule for independent events can be applied, and before the sample is selected, P [all t-intervals in ( 5-27) contain the t-t/ s] (1 a ) (1 - a ) · · · (1 - a ) (1
a) P
248
Chap. 5 Inferences about a Mean Vector
If 1 - a = .95 and p = 6, this probability is (.95)6 = .74. To guarantee a probability of 1 - a that all of the statements about the component means hold simultaneously, the individual intervals must be wider than the sep�rate t-intervals ; just how much wider depends on both p and n , as well as on 1 - a. For 1 - a = .95, n = 15, and p = 4, the multipliers of y;;:;;; in (5-24) and (5-27) are
�p((n - 1)) Fp. n -p ( .OS ) = �----u4 (14) (3.36) = 4.14 n p and t11 (.025) = 2.145, respectively. Consequently, in this case the simultaneous intervals_ 1 are 100 (4.14 - 2.145)/2.145 = 93% wider than those derived from the _
one-at-a-time t method. Table 5.3 gives some critical distance multipliers for one-at-a-time t-intervals computed according to (5-21), as well2 as the corresponding simultaneous T2-inter vals. In general, the width of the T -intervals, relative to the t-intervals, increases for fixed n as p increases and decreases for fixed p as n increases. CRITICAL DISTANCE M U LTIPLI ERS FOR O N E-AT-A-TI M E t I NTERVALS AN D T2-I NTERVALS FOR SELECTED n AN D p ( 1 .95)
TABLE 5.3
a=
� ((nn - 1)p) Fp, n -p ( .OS ) p p = 10 p=4 _
n
15 25 50 100 00
fn - 1 (.025)
2.145 2.064 2.010 1.970 1.960
4.14 3.60 3.31 3.19 3.08
11.52 6.39 5.05 4.61 4.28
5.3 is a bit unfair, since the confidence level The comparison implied ofby TTable 2 fixed n and p, is .95, and the overn , associated with any collection -intervals, for t intervals, for the same individual of all confidence associated with a collection t intervals are too less than .95. The one-at-a-time can, as we have seen, be much about, say, all statements confideqce level for separate short to maintain an overall information possible best the look at them as means. Nevertheless, we issometimes pconcerning if the one Moreover, inference to be made. the null hypothesis, a mean, if this the only only when the T2 test rejects at-a-time intervals are calculated some researchers think2 they may more accurately represent the information about the means than the T intervals do.
Sec. 5.4 Confidence Regions and Simultaneous Comparisons of Component Means
249
The T2 intervals are too wide if they are applied only to the p component means. To see why, consider the confidence ellipse and the simultaneous intervals shown in Figure 5.2. If lies in its T2 interval and J.L z lies in its T2 interval, then ( J.L 1 , J.Lz) lies in the rectangle formed by these two intervals. This rectangle con tains the confidence ellipse and more. The confidence ellipse is smaller, but has probability . 95 of covering the mean vector p with its component means J.Lt and J.Lz · Consequently, the probability of covering the two individual means J.L 1 and J.Lz will be larger than .95 for the rectangle formed by the T2 intervals. This result leads us to consider a second approach to making multiple comparisons known as the Bonferroni method. 1-L t
The Bonferroni Method of Multiple Comparisons
Often, attention is restricted to a small number of individual confidence statements. In these situations it is possible to db better than the simultaneous intervals of Result 5.3. If the number m of specified component means J.L; or linear combina tions a' p = a1 J.t + a2 J.L2 + . . + ap J.tp is small, simultaneous confidence intervals can be developed that are shorter (more precise) than the simultaneous T2-inter vals. The alternative method for multiple comparisons is called the Bonferroni method, because it is developed from a probability inequality carrying that name. Suppose that, prior to the collection of data, confidence statements about m linear combinations a; p, a�p, . . . , a;11 /L are required. Let C; denote a confidence statement about the value of a; p with P[C; true] = 1 - a;, i = 1, 2, . . . , m. Now (see Exercise 5.6), [at least one ci false] :;::: 1 L ( C; false) = 1 - L ( 1 ( ci true)) i=l i= l (5-28) = 1 - (a 1 + a2 + . . . + a111 ) Inequality (5-28), a special case of the Bonferroni inequality, allows an inves tigator to control the overall error rate a 1 + a 2 + + an. , regardless of the cor relation structure behind the confidence statements. There is also the flexibility of controlling the error rate for a group of important statements and balancing it by another choice for the less important statements. Let us develop simultaneous interval estimates for the restricted set consist ing of the components J.L; of p. Lacking information on the relative importance of these components, we consider the individual t-intervals i = 1, 2, . . . , m with a; = a/m. Since P[X; ± tn - t (a/2m) � contains J.L;] 1 - a/m, i 1, 2, . . . , m, we have, from (5-28), 1
·
p
Ill
p
Ill
· · ·
- p
250
Chap. 5 Inferences about a Mean Vector
= 1 - a
m
terms
Therefore, with an overall confidence level greater than or equal to 1 - a, we can make the following m = p statements:
( �) i;t, _ 1 ( 2�) i;-
X1 - t, _ 1 2
Xz
-
�
IL1
�
�
/Lz
�
c : ) i;Xz + tn - 1 ( 2�) iii Xr + t, _ r
(5-29)
The statements in (5-29) can be compared with those in (5-24). The percent age point t (a/2p) replaces Y (n - 1)pFp, n - p (a)/(n - p), but otherwise the intervals are,_of1 the same structure. Example 5.6
(Constructing Bonferroni simultaneous confidence intervals and comparing them with T2 intervals)
Let us return to the microwave oven radiation data in Examples 5.3 and 5.4. We shall obtain the simultaneous 95% Bonferroni confidence intervals for the means, �t 1 and ILz • of the fourth roots of the door-closed and door-open mea surements with a; = .05/2, i = 1, 2. We make use of the results in Example 5.3, noting that n = 42 and t4 1 (.05/2(2)) = t4 1 (.0125) = 2.327, to get .0144 rs:; = .564 ± 2.327 1� or .521 �t 1 .607 x-1 ± t4 1 ( .0125 ) y;
.0146 or .560 r;;;, = .603 ± 2.327 1� ± t41 (.0125) -y-;;IL z .646 Figure 5.4 on page 251 shows the 95% T2 simultaneous confidence intervals for J.L 1 , J.Lz from Figure 5.2, along with the corresponding 95% Bonferroni intervals. For each component mean, the Bonferroni interval falls within the T2 interval. Consequently, the rectangular (j oint) region formed by the two Bonferroni intervals is contained in the rectangular region formed by the two T2 intervals. If we are interested only in the component2 means, the Bon ferroni intervals provide more precise estimates than the T intervals. On the other hand, the 95% confidence region for p gives the plausible values for x-2
�
�
�
�
Sec. 5.4 Confidence Regions and Simultaneous Comparisons of Component Means
'"' '"' ci .65 1 .64 6
251
- - - - - - - - - -' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ' ....,
Bonferroni
.560 .555
' '
...
- - - - - - - - - . -'-'-- -� ' ' '
' ' ' ' '
' ' '
' '
' '
1 --------,-------,--------,--�----� � � .------L-, .607 "6 1 2 . 5 1 6 _52 1
0 . 500
0.552
0.604
2
Figure 5.4 The 95% T and 95% Bonferroni simultaneous confidence intervals for the component means-microwave radiation data.
the pairs (JL 1 , JLz ) when the correlation between the measured variables is taken into account. The Bonferroni intervals for linear combinations a' p and the analogous T2intervals (recall Result 5.3) have the same general form: a'X ± (critical value) � Consequently, in every instance where a; = a/m, Length of Bonferroni interval tn - l (a/2m) 2 Length of T -interval �p (n - 1 ) F n (a) (5-30) •
n-p
p,
-
p
which does not depend on the random quantities X and S. As we have pointed out, for a small number m of specified parametric functions a' p, the Bonferroni intervals will always be shorter. How much shorter is indicated in Table 5.4 for selected n and p.
252
Chap. 5 Inferences about a Mean Vector TABLE 5.4 (LENGTH OF BON F ERRO N I I NTERVAL)/(LENGTH O F T2-I NTERVAL) FOR 1
n
15 25 50 100 CIJ
2 .88 . 90 .91 .91 .91
-
a = .95 AN D a; = .05/m m=p 4
.69 .75 .78 .80 .81
10 .29 .48 .58 .62 .66
We see from the Table 5.4 that the Bonferroni method provides shorter inter vals when m = p. Because they are easy to apply and provide the relatively short confidence intervals needed for inference, we will often apply simultaneous !-inter vals based on the Bonferroni method. 5.5 LARGE SAMPLE I NFERENCES ABOUT A POPULATION ' M EAN VECTOR
When the sample size is large, tests of hypotheses and confidence regions for p can be constructed without the assumption of a normal population. As illustrated in Exercises 5.13, 5.14, and 5. 1 5, for large n, we are able to make inferences about the population mean even though the parent distribution is discrete. In fact, serious departures from a normal population can be overcome by large sample sizes. Both tests of hypotheses and simultaneous confidence statements will then possess (approximately) their nominal levels. The advantages associated with large samples may be partially offset by a loss in sample information caused by using only the summary statistics i and S. On the other hand, since (i, S) is a sufficient summary for normal populations [see (4-21)], the closer the underlying population is to multivariate normal, the more efficiently the sample information will be utilized in making inferences. 2 All large-sample inferences about p are based on a x -distribution. From (4-28), we know that ( X - p) ' (n - 1 S) - 1 ( X - p) = n ( X - p) ' S - 1 ( X - p) is approximately x2 with p d.f., and thus, (5-31) where x; (a) is the upper (100a)th percentile of the xi-distribution. Equation (5-31) immediately leads to large sample tests of hypotheses and simultaneous confidence regions. These procedures are summarized in Results 5.4 and 5. 5 .
Sec. 5.5 Large Sample Inferences About a Population Mean Vector
253
Result 5.4. Let X 1 , X 2 , ... , X11 be a random sample from a population with mean p and= positive definite covariance matrix I. When n - p is large, the hypoth esis H0 : p p 0 is rejected in favor of H1 : p p0 , at a level of significance approx imately a, if the observed n ( i - Po ) ' S -I ( i - Po ) > x; (a) Here x; (a) is the upper (100a)th percentile of a chi-square distribution withp d.f. #
•
Comparing the test in Result 5. 4 with the corresponding normal theory test in (5-7), we see that the test statistics have the same structure, but the critical values are different. A closer examination, however, reveals that both tests yield essen tially the same result in situations where the x2-test of Result 5.4 is appropriate. This follows directly from the fact that (n - 1)pFp,ll - p (a)/(n - p) and x;(a) are approximately equal for n large relative top. (See Tables 3 and 4 in the appendix.) Result 5.5. Let X 1 , X 2 , , X11 be a random sample from a population with mean p and positive definite covariance I. If n - p is large, WSa;a ' X ± Vx; ( a ) y----; will contain a ' p , for every a with probability approximately 1 - a. Consequently, we can make the 100(1 - a)% simultaneous confidence statements xl ± Yx� (a) � containS JL t x2 ± Vx� (a) jif contains JL2 • • •
contains JLp and, in addition, for all pairs (JL ;, f.L k ), i, k = 1, 2, ... , p, the sample mean-cen tered ellipses The first part follows from Result 5A.1, with c2 = x; (a). The prob ability level is a consequence of (5-31). The statements for the f.L are obtained by the special choices a ' = [0, ... , 0, a;, 0, ... , 0], where a; =2 1,= i =2 1, 2,; ... , p. The ellip soids for pairs of means follow from Result 5A.2 with c x (a). The overall con fidence level of approximately 1 - a for all statements is, bnce again, a result of the large sample distribution theory summarized in (5-31). Proof.
•
254
Chap. 5 Inferences about a Mean Vector
It is good statistical practice to subject these large sample inference proce dures to the same checks required of the normal-theory methods. Although small to moderate departures from normality do not cause any difficulties for n large, extreme deviations could cause problems. Specifically, the true error rate may be far removed from the nominal level a. If, on the basis of Q-Q plots and other investigative devices, outliers and other forms of extreme departures are indicated (see, for example, [2]), appropriate corrective actions, including transformations, are desirable. Methods for testing mean vectors of symmetric multivariate distrib utions that are relatively insensitive to departures from normality are discussed in [10]. In some instances, Results 5.4 and 5.5 are useful only for very large samples. The next example allows us to illustrate the construction of large sample simultaneous statements for single mean components. Example 5.7
(Constructing large sample simultaneous confidence intervals)
A music educator tested thousands of Finnish students on their native musi cal ability in order to set national norms in Finland. Summary statistics for part of the data set are given in Table 5. 5 . These statistics are based on a sam ple of n = 96 Finnish 12th graders. M U S I CAL APTITUDE PRO F I LE M EAN S AN D STAN DARD D EVIATIO N S FOR 96 1 2TH-G RADE F I N N I S H STU DENTS PARTIC I PATING I N A STAN DARDIZATION PROGRAM
TABLE 5.5
Raw score Standard deviation (VS: ) Mean (:XJ 5.76 28. 1 5. 85 26.6 3. 82 35.4 5.12 34.2 3.76 23.6 3. 93 22.0 4.03 22.7
Variable X1 = melody x2 = harmony x3 = tempo x4 = meter X5 = phrasing x6 = balance x7 = style
Source: Data courtesy of V. Sell. Let us construct 90% simultaneous confidence intervals for the individ ual mean components i = 1, 2, ... , 7. From Result 5. 5 , simultaneous 90% confidence limits are given by :X; ± Yx� (.lO) .Jif:, i = 1, 2, ... , 7, where x� (.lO) = 12.02. Thus, with approximately 90% confidence, J.L ; ,
Sec. 5.5 Large Sample Inferences About a Population Mean Vector
255
5.76 contains f.L or 26.06 f.L 30. 1 4 28.1 ± V12.02 V% 96 5.85 contains f.L z or 24.53 f.L 28.67 26.6 ± V12.02 V% 96 3.82 contains f.L3 or 34.05 f-L 3 36.75 35.4 ± V12.02 V% 96 5. 1 2 contains f.L or 32.39 f.L 36.01 34.2 ± V12.02 V% 96 3.76 contains f.Ls or 22.27 f.L s 24.93 23.6 ± V12.02 V% 96 3.93 contains f.L6 or 20.61 f-L6 23.39 22. 0 ± V12.02 V% 96 4.03 contains f-L? or 21.27 f-L? 24.13 22.7 ± V12.02 V% 96 Based, perhaps, upon thousands of American students, the investigator could hypothesize the musical aptitude profile to be p� = [31, 27, 34, 31, 23, 22, 22] We see from the simultaneous statements above that the melody, tempo, and meter components of p0 do not appear to be plausible values for the corre sponding means of Finnish scores. When the sample size is large, the one-at-a-time confidence intervals for indi vidual means are ( 2a ) -v{%--::X; + ( 2a ) -v{% X; --::- i = 1, 2, ... , p 1
�
1
�
�
2
�
�
4
�
�
4
�
�
�
�
�
�
�
•
_
-
Z
� f.L ; �
_
Z
where z(a/2) is the upper 100(a/2)th percentile of the standard normal distribu tion. The Bonferroni simultaneous confidence intervals for the m = p statements about the individual means take the same form, but use the modified percentile z(a/2p) to give ( a ) fi:; ' X'· + ( a ) fi:u i = 1, 2, ... , p X·' 2p n 2p n Table 5. 6 gives the individual, Bonferroni, and chi-square-based (or shadow of the confidence ellipsoid) intervals for the musical aptitude data in Example 5. 7 . -
-
Z
-
- � f.L · �
-·
Z
-
-
256
Chap. 5 Inferences about a Mean Vector TABLE 5.6 TH E LARG E-SAM PLE 95% I N D IVI D UAL, BON FERRON I, AND T2-I NTERVALS FOR TH E M U S I CAL APTITUD E DATA
The one-at-a-time confidence intervals use (.025) = 1. 96 The simultaneous Bonferroni intervals use ( .025/7) = 2.69 The simultaneous T2, or shadows of the ellipsoid, use xj(.05) = 14.07 One-at-a-time Bonferroni Intervals Shadow of Ellipsoid Variable Lower Upper Lower Upper Lower Upper X1 = melody 26.95 29.25 26.52 29.68 25.90 30.30 27.77 24.99 28.21 24.36 28.84 X2 = harmony 25. 43 34.64 36.16 34.35 36.45 33. 94 36.86 x3 = tempo 33. 1 8 35.22 32.79 35. 6 1 32.24 36.16 x4 = meter 24.35 22.57 24.63 22.16 25. 04 X5 = phrasing 22.85 21.21 22.79 20.92 23.08 20.50 23.50 x6 = balance 21.89 23. 5 1 21. 5 9 23. 8 1 21.16 24.24 X7 = style z
z
Although the sample size may be large, some statisticians prefer to retain the F- and t-based percentiles rather than use the chi-square or standard normal-based percentiles. The latter constants are the infinite sample size limits of the former con stants. The F and t percentiles produce larger intervals and, hence, are more con servative. Table 5.7 gives the individual, Bonferroni, and F-based, or shadow of the confidence ellipsoid, intervals for the musical aptitude data. Comparing Table 5. 7 TABLE 5.7 TH E 95% I N D IVI DUAL, BONFERRO N I, AN D T2-I NTERVALS FOR TH E M U S I CAL APTITU DE DATA
The one-at-a-time confidence intervals use t95 (. 025) = 1.99 The simultaneous Bonferroni intervals use t95 (.025/7) = 2.75 The simultaneous T2, or shadows of the eliipsoid, use F7•89(.05) = 2.11 One-at-a-time Bonferroni Intervals Shadow of Ellipsoid Variable Lower Upper Lower Upper Lower Upper 44 72 25.76 30. 26.93 29.27 26.48 29. X1 = melody 97 24 24.23 28. 79 24.96 28. x2 = harmony 25.41 27. 85 36.95 33 36.47 33. 34.63 36.17 34. x3 = tempo 32.12 36.28 76 35. 64 22. 33. 1 6 35.24 32. x4 = meter 54 24.66 20.0471 25.13 84 24.22.8306 22. X5 = phrasing 22. 23.59 20.90 23.10 21.20 x6 = balance 21.88 23.52 21.57 23.83 21.07 24.33 X7 = style
Sec.
5.6
Multivariate Quality Control Charts
257
with Table 5.6, we see that all of the intervals in Table 5.7 are larger. However, with the relatively large sample size n = 96, the differences are typically in the third, or tenths, digit. 5.6 MULTIVARIATE QUALITY CONTROL CHARTS
To improve the quality of goods and services, data need to be examined for causes of variation. When a manufacturing process is continuously producing items or when we are monitoring activities of a service, data should be collected to evalu ate the capabilities and stability of the process. When a process is stable, the vari ation is produced by common causes that are always present, and no one cause is a major source of variation. The purpose of any control chart is to identify occurrences of special causes of variation that come from outside of the usual process. These causes of variation often indicate a need for a timely repair, but they can also suggest improvements to the process. Control charts make the variation visible and allow one to distin guish common from special causes of variation. A control chart typically consists of data plotted in time order and horizon tal lines, called control limits, that indicate the amount of variation due to com mo�causes. One useful control chart is the X chart (read X-bar chart). To create an X chart: 1. Plot the individual observations or sample means in time order. 2. Create and plot the centerline .X, the sample mean of all of the observations. 3. Calculate and plot the control limits given by Upper control limit (UCL) = x + 3 (standard deviation) Lower control limit (LCL) = x 3 (standard deviation) The standard deviation in the control limits is the estimated standard deviation of the observations being plotted. For single observations, it is often the sample stan dard deviation. If the means of subsamples of size m are plotted, then the standard deviation is the sample standard deviation divided by The control limits of plus and minus three standard deviations are chosen so that there is a very small chance, assuming normally distributed data, of falsely signaling an out-of-control observation-that is, an observation suggesting a special cause of variation. -
Viii .
Example 5.8
(Creating
a
univariate control chart)
The Madison, Wisconsin, police department regularly monitors tiiany of its activities as part of an ongoing quality improvement program. Table 5.8 gives the data on five different kinds of overtime hours. Each observation repre sents a total for 12 pay periods, or about half a year.
258
Chap. 5 Inferences about a Mean Vector TABLE 5.8 F IVE TYPES OF OVERTI M E H O U RS FOR TH E MADISON I , WISCO N S I N , POLICE DEPARTM ENT XI Xz Legal Appearances Extraordinary Hours Event Hours 3387 2200 3109 875 2670 957 3125 1758 3469 868 3120 398 3671 1603 4531 523 3678 2034 3238 1136 3135 5326 5217 1658 3728 1945 3506 344 3824 807 3516 1223
1
x3
Holdover Hours 1181 3532 2502 4510 3032 2130 1982 4675 2354 4606 3044 3340 2111 1291 1365 1175
COA1 Hours 14,861 11,367 13,329 12,328 12,847 13,979 13,528 12,699 13,534 11,609 l4,189 15,052 12,236 15,482 14,900 15,078 x4
Xs Meeting Hours 236 310 1182 1208 1385 1053 1046 1100 1349 1150 1216 660 299 206 239 161
Compensatory overtime allowed.
We examine the stability of the legal appearances overtime hours. A computer calculation gives :X1 = 3558. Since individual values will be plotted, :X1 is the same as x1 • Also, the sample standard deviation is � = 607, and the control limits are UCL = x1 + 3(�) = 3558 + 3 (607) = 5379 LCL = x1 - 3(�) = 3558 - 3(607) = 1737 The data, along with the centerline and control limits, are plotted as an X chart in Figure 5. 5 on page 259. The legal appearances overtime hours are stable over the period in which the data were collected. The variation in overtime hours appears to be due to common causes, so no special-cause variation is indicated. more thanprocess one important characteristic, should be usedWithto monitor stability. Such as approacha multivariate can accountapproach for correlations between characteristics and will control the overall probability of falsely signaling a special cause of variation when one is not present. High correlations among the •
Sec.
5.6
Multivariate Quality Control Charts
259
Regular Appearances Overtime Hours 5500
'"
UCL = 5379
4500
.,::l
'"::l 3500 >
"0
'6 .5
XI = 3558
· ;;
2500
LCL = 1 737 1 500 5
0
15
10 Ob servation Numb er
Figure 5.5
The X chart for
x1 =
legal appearances overtime hours.
variables can make it impossible to assess the overall error rate that is implied by a large number of univariate charts. The2two most common multivariate charts are (i) the ellipse format chart and ( ii) the T chart. Two cases that arise in practice need to be treated differently: 1. Monitoring the stability of a given sample of multivariate observations 2. Setting a control region for future observations Initially, we consider the use of multivariate control procedures for a sample of multivariate observations x 1 , x2 , . . . , x ll " Later, we discuss these procedures when the observations are subgroup means. Charts for Monitoring a Sample of Individual Multivariate Observations for Stability
We assume that X 1 , X 2 , . . . , X11 are independently distributed as Np ( J.t, :t). By Result 4. 8 , l. ( l. ) J n.!. 1 l. l. rt J J+l x.
-
x
=
l
-
n
x
-
x
- ··· -
n
x
_
n
x.
- ··· -
n
x II
260
Chap. 5 Inferences about a Mean Vector
has Cov(Xj - X) = ( 1 - -;;1 ) 2 I + (n - 1 ) n - 2 I = (n -n 1) 2 I and is distributed as N (O, I). However, Xj - X is not independent of the sample covariance matrix S, soP we use the approximate chi-square distribution to set con trol limits. Ellipse Format Chart. The ellipse format chart for a bivariate control region is the more intuitive of the charts, but its approach is limited to two vari ables. The two characteristics on the jth unit are plotted as a pair (xj 1 , xj 2 ) . The 95% ellipse consists of all x that satisfy (5-32) ( x - i ) ' S- 1 ( x - i ) xi (. 05) and
-
�
Example 5.9
(An ellipse format chart for overtime hours)
Let us refer to Example 5.8 and create a quality ellipse for the pair of over time characteristics (legal appearances, extraordinary event) hours. A com puter calculation gives [ 3558 ] [ 367,884.7 -72,093.8 ] X = 1478 and s = -72,093.8 1,399,053.1 We illustrate the ellipse format chart using the 99% ellipse, which con sists of all x that satisfy ( x - i ) ' S- 1 ( x - i ) xi(. 0 1) Here p = 2, so xi(.O l) = 9.21, and the ellipse becomes �
(
(x r - .X 1 ) 2 (xl - .XJ ) (x2 - x2 ) + (x2 - .Xr ) 2 s l l s22 2sr 2 S1 1S22 - S12 2 S22 sl l s 1 1 S22
)
(367844.7 1399053.1) 2 367844.7 1399053.1 - ( -72093. 8 ) 2 9.21 1 - 3558) 2 2( 72093 · 8 ) (x 1 - 3558) (x2 - 1478) + (x 1 - 1478) ) ( (x367844. 367844.7 1399053.1 1399053.1 7 This ellipse format chart is graphed, along with the pairs of data, in Fig ure 5.6. X
X
_
X
_
X
�
Sec.
8 "€
,
�0 ia "
ro l:: >< �
261
0 0 0 <'l
0 =
Multivariate Quality Control Charts
·-
0 0 0 .,..,
5.6
•
• 0 0
•
s
0
1 500
2500
•
•
•
• •
+.
•
•
•
3500
•
•
4500
Appearances Overtime
5500
Figure 5.6 The quality control 99% ellipse for legal appearances and extraordinary event overtime.
Notice that one point, indicated with an arrow, is definitely_ outside of the ellipse. When� point is out of the control region, individual X charts are constructed. The X chart for was given in Figure 5.5; that for is given in Figure 5.7 on page 262. When the lower control limit is less than zero for data that must be non negative, it is generally set to zero. The LCL = 0 limit is shown by the dashed line in Figure 5.7. Was there a special cause of the single point for extraordinary event overtime that is far outside the upper control limit in Figure 5.7 ? During this period, the United States bombed a foreign capital, and students at Madison were protesting. A majority of the extraordinary overtime was used in that four-week period. Although, by its very definition, extraordinary overtime occurs only when special events occur and is therefore unpredictable, it still has a certain stability. T2 Chart. A T2 chart can be applied to a large number of characteristics. Unlike the ellipse format, it is not limited to two variables. Moreover, the points are displayed in time order rather than as a scatter plot, and this makes patterns and trends visible. x1
x2
•
262
Chap. 5 Inferences about a Mean Vector
Extraordinary Event Hours 6000 UCL = 5027
5000 4000 " :::>
Oj > Oj
· ;; '6 .5
:::> "0
3000 2000 x1 = 1 478
1 000 0 - 1 000 - 2000
LCL = - 207 1
- 3000 0
10
5
15
Observation Number
Figure 5.7
The X chart for
x2 =
extraordinary event hours.
For the jth point, we calculate the T2 statistic
(5-33) We then plot the T2 values on a time axis. The lower control limit is zero, and we use the upper control limit UCL = x;(.o5) or, sometimes, x; (.Ol ) . There is no centerline in the T2 chart. Notice that the T2 statistic is the same as the quantity d/ used to test normality in Section 4.6. Example 5. 1 0 (A T2 chart for overtime hours)
Using the police department data in Example 5. 8 , we construct a T2 plot based on the two variables X = legal appearances hours and X2 = extraor dinary event hours. T2 charts 1with more than two variables are considered in Exercise 5.23. We take a = .01 to be consistent with the ellipse format chart in Example 5.2 9 . The T chart in Figure 5. 8 on page 262 reveals that the pair (legal appearances, extraordinary event) hours for period 11 is out of control. Fur ther investigation, as in Example 5.9, confirms that this is due to the large value of extraordinary event overtime during that period. •
Sec.
5.6
Multivariate Quality Control Charts
263
12
10
• - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
•
8
�
6
4 •
•
2
•
•
0
4
2
0
•
10
8
6
12
14
16
Period Figure 5.8
a =
.01.
2
The T chart for legal appearances hours and extraordinary event hours,
When the multivariante T2 chart signals that the jth unit is out of control, it should be determined which variables are responsible. A modified region based on Bonferroni intervals is frequently chosen for this purpose. The kth variable is out of control if xik does not lie in the interval ( :Xk - tn - 1 (.005/p) YS;,; , Xk + tn - 1 (.005/p) YS;,; ) where p is the total number of measured variables. Control Regions for Future Individual Observations
The goal now is to use data x 1 , x 2 , , x n , collected when a process is stable, to set a control region for a future observation x or future observations. The region in which a future observation is expected to lie is called a forecast, or prediction, region. the process is stable, we take the observations to be independently dis tributed as NP (p,, I). Because these regions are of more general importance than just for monitoring quality, we give the basic distribution theory as Result 5.6. Result 5.6. Let X 1 , X 2 , ... , X n be independently distributed as Np ( p,, I), and let X be a future observation from the same distribution. Then T2 = n n+ 1 (X - X)'S - 1 (X - X) is distributed as (nn -- l)p p F� n - p • • •
If
--
�
�
264
Chap. 5 Inferences about a Mean Vector
and a 100(1 - a)% p-dimensional prediction ellipsoid is given by all x satisfying ( n 2 - 1) ( x - x ) ' S- 1 ( x - x ) :;;;;; ( n n - pf FP n - p (a) Proof. We first note that X - X has mean 0. Since X is a future observa tion, X and X are independent, so - = I + -1 I = ( n + 1) I Cov(X - -X) = Cov(X) + Cov(X) n n and, by Result 4.8, Yn/ (n + 1) (X - X ) is distributed as NP (O, I). Now, � (X - X ) ' s- 1 .) n (X - X ) \j � n+1 which combines a multivariate normal, N (O, I), random vector and an indepen dent Wishart, Wp. n- 1 (I), random matrix inP the form ( multivariate normal ) 1 ( Wishart random matrix ) - 1 ( multivariate normal ) random vector random vector d.f. has the scaled F distribution claimed according to (5-8) and the discussion on page 226. • The constant for the ellipsoid follows from (5-6). Note that the prediction region in Result 5.6 for a future observed value x is an ellipsoid. It is centered at the initial sample mean x, and its axes are determined by the eigenvectors of S . Sirice - 1)p P [ (X - X) S - 1 (X - X) (nz n (n p) �;. n - p (a) ] -- 1 - a before any new observations are taken, the probability that X will fall in the pre diction ellipse is 1 - a. Keep in mind that the current observations must be stable before they can be used to determine control regions for future observations. Based on Result 5.6, we obtain the two charts for future observations. ·
-
I
-
<;;;
_
Control Ellipse for Future Observations
With p = 2, the 95% prediction ellipse in Result 5.6 specializes to z (5-34) (x - x) S (x - x) :;;;;; (nn (n - 1)2Z) Fz. n - 2 (.05) Any future observation x is declared to be out of control if it falls out of the con trol ellipse. -
I
-1
-
_
Sec. Example 5 . 1 1
5.6
Multivariate Quality Control Charts
265
(A Control ellipse for future overtime hours)
In Example 5. 9 , we checked the stability of legal appearances and extraordi nary event overtime hours. Let's use these data to determine a control region for future pairs of values. From Example 5.9 and Figure 5. 6 , we find that the pair of values for period 11 were out of control. removed this point and determined the new 99% ellipse. All of the points are then in control, so they can serve to deter mine the 95% prediction region just defined for p = 2. This control ellipse is shown in Figure 5.9 along with the initial 15 stable observations. Any future observation falling in the ellipse is regarded as stable or in control. An observation outside of the ellipse represents a potential out-of control observation or special-cause variation. We
•
T2 Chart for Future Observations
For each new observation x, plot T2 = +n _1 (x - i ) ' S- 1 (x - i ) in time order. Set LCL = 0, and take _ n
II)
E
•
0 0 0 N
•
.E> 0 c II) > i:Ll
� c
�0 "' tl >< i:Ll
• •
0 0
•
::
0 0 ll'l
• •
•
•
-+ •
•
• •
0 0 0
"i'
1 500
2500
3500
4500
Appearances Overtime
5500
Figure 5.9 The 95% control ellipse for future legal appearances and extraordinary event overtime.
266
Chap. 5 Inferences about a Mean Vector
UCL = (n(n - 1p))p Fp, n - p (.05) Points above the upper control limit represent potential special cause variation and suggest that the process in question should be examined to determine whether immediate corrective action is warranted. _
Control Charts Based on Subsample Means
It is assumed that each random vector of observations from the process is inde pendently distributed as NP (O, I). We proceed differently when the sampling pro cedure specifies that m > 1 units be selected, at the same time, from the process. From the first sample, we determine its sample mean X1 and covariance matrix S 1 . When the population is normal, these two random quantities are independent. For a general subsample mean Xi , Xi - X has a normal distribution with mean 0 and n - 1 Cov ( Xi - X ) = ( 1 - -1 ) 2 Cov ( -Xi ) + -2- Cov ( X-1 ) = (n - 1) I -
where
=
n
n
nm
X = -n1 i�=n l -X.J As will be described in Section 6.4, the sample covariances from the n sub samples can be combined to give a single estimate (called Spooled in Chapter 6) of the common covariance I. This pooled estimate is 1 S = (S 1 + S 2 + · + S n ) n Here (nm - n) S is independent of each X and, therefore, of their mean X. Further, (nm - n) S is distributed as a Wisharti random matrix with nm - n degrees of freedom. Notice that we are estimating I internally from the data col lected in any given period. These estimators are combined to give a single esti mator with a large number of degrees of freedom. Consequently, =
-
. .
(5-3 5)
is distributed as (nm - n)p (nm - n - p
+ 1) Fp, nm - n - p + l
Sec.
5.6
Multivariate Quality Control Charts
267
Ellipse Format Chart. In an analogous fashion to our discussion on individual multivariate observations, the ellipse format chart for pairs of subsample means is (5-36) ( 05) ( -X X ) S ( X X ) (nm (-nm1) -(mn -- 11)) 2 F2 although the right-hand side is usually approximated as x� (.05)/m. Subsamples corresponding to points outside of the control ellipse should be carefully checked for changes in the behavior of the quality characteristics being measured. The interested reader is referred to [9] for additional discussion. T2 Chart. To construct a T2 chart with subsample data and p characteristics, we plot the quantity -
= '
-1 -
=
-
:so;
·
nm - n - 1 •
for j = 1, 2, . . . , n, where the UCL = ((nnm--1)n (m- p- 1)p1) F2 (.05) The UCL is often2 approximated as x� (.05) when n is large. Values of � that exceed the UCL correspond to potentially out-of-control or special cause variation, which should be checked. (See [9]. ) +
·
nm - n - p + 1
Control Regions for Future Subsample Observations
Once data are collected from the stable operation of a process, they can be used to setIfcontrol future observed subsampl�means. X is a limits futureforsubsample mean, then X - X has a multivariate normal dis tribution with mean 0 and - = Cov(X - X) = Cov(X) -n1 Cov(X-1 ) Consequently, -nm (X- - X)'S - 1 (X - = X) n 1 is distributed as +
=
-
+
(nm - n)p (nm - n - p + 1) Fp, nm - n - p + 1 Control Ellipse for Future Subsample Means. The prediction ellipse for a future subsample mean for p = 2 characteristics is defined by the set of all x such that
268
Chap. 5 Inferences about a Mean Vector
+ 1) (m - 1 ) 2 F 1 (.05) (5-37) (i - x )'S - 1 (i - x ) .:::; (nm(nm - n - 1) where, again, the right-hand side is usually approximated as xi (.05)/m. Chart for Future Subsample Means. As before, we bring n/ ( n + 1 ) into the control limit and plot the quantity 2, nm _ n
_
T2
for future sample means in chronological order. The upper control limit is then UCL = ((nmn +-l n) (m- p- +1 ) p1 ) FP (.05) The UCL is often approximated as xi (.05) when n is large. Points outside of the prediction ellipse or above the UCL suggest that the current values of the quality characteristics are different in some way from those of the previous stable process. This may be good or bad, but almost certainly war rants a careful search for the reasons for the change. ·
nm - n - p + t
5.7 I NFERENCES ABOUT MEAN VECTORS WHEN SOME OBSERVATIONS ARE MISSING
Often, some components of a vector observation are unavailable. This may occur because of a breakdown in the recording equipment or because of the unwilling ness of a respondent to answer a particular item on a survey questionnaire. The best way to handle incomplete observations, or missing values, depends, to a large extent, on the experimental context. If the pattern of missing values is closely tied to the value of the response, such as people with extremely high incomes who refuse to respond in a survey on salaries, subsequent inferences may be seriously biased. To date, no statistical techniques have been developed for these cases. However, we are able to treat situations where data are missing at random-that is, cases in which the chance mechanism responsible for the missing values is not influenced by the value of the variables. A general approach for computing maximum likelihood estimates from incomplete data is given by Dempster, Laird, and Rubin [5]. Their technique, called the EM algorithm, consists of an iterative calculation involving two steps. We call them the prediction and estimation steps: Prediction step. Given some estimate ii of the unknown parameters, predict the contribution of any missing observation to the (complete-data) sufficient statistics. 2. Estimation step. Use the predicted sufficient statistics to compute a revised estimate of the parameters. L
Sec. 5.7 Inferences About Mean Vectors When Some Observations are Missing
269
The calculation cycles from one step to the other, until the revised estimates do not differ appreciably from the estimate obtained in the previous iteration. When the observations X 1 , X 2 , , X n are a random sample from ap-variate normal population, the prediction-estimation algorithm is based on the complete data sufficient statistics [see (4-21)] n T1 = � Xj = nX j= l and n T2 = � xj x; (n - 1)S + n X X ' j= l In this case, the algorithm proceeds as follows: We assume that the population mean and variance-It and I, respectively-are unknown and must be estimated. Prediction step. For each vector xj with missing values, let x]I> denote the missing denote those components which are available. Thus, x � = [x 1 >' x<2 >'] Given estimates ji and I from the estimation step, use the mean of the conditional1 normal distribution of x(ll , given x(2) , to estimate the missing values. That is, x .( l l = E (X< 1 > j x<2 > · I ) = + I 1 2 I-221 (x<2> - 11 (2)) (5-38) estimates the contribution of xp> to T1 • Next, the predicted contribution of xp> to T2 is �?> x} 1 > : = E (X} 1 >x? > ' i xJ2 > ; ji, I) = I1 1 - I12I2i i2 1 + xp > xpr (5-39) and • . .
1
1
'
1
-
•
J
1
1
II.
,, ( ! )
' ,_ ,
.-
� = E (X< 1 >X(2)' j x(2> · ,, I ) 1
1
1
J
I
.-
1
, ,_,
=
xJ( l > x(2) ' 1
The contributions in (5-38) and (5-39) are summed over all x �ith m�sing components. The results are combined with the sample data to yieldj T1 and T2. Estimation step. Compute the revised maximum likelihood estimates ( see Result 4.11): 1 - T-2 - ILIL (5-40) n We illustrate the computational aspects of the prediction-estimation algo rithm in Example 5.12. � •
1 If all the components xi are missing, set ii.i
-
=
-
ji and ii.iii.j
- -,
=
I
+
p.;;: .
270
Chap. 5 Inferences about a Mean Vector
Example 5. 1 2 (Illustrating the EM algorithm)
Estimate the normal population mean and covariance I using the incom plete data set p
Here n = 4, p = 3, and parts of observation vectors x 1 and x4 are missing. We obtain the initial sample averages 7 +-5 = 6, I-L- 2 _- 0 + 2 � - 1 , I-L3- = 3 + 6 + 2 + 5 = 4 IL- l = 2 3 4 from the available observations. Substituting these averages for any missing values, so that .X = 6, for example, we can obtain initial covariance esti mates. We shall construct these estimates using the divisor n because the algorithm eventually produces the maximum likelihood estimate I. Thus, (6 - 6) 2 + (7 - 6) 2 + (5 - 6) 2 + (6 - 6) 2 1 (]"1 1 = 4 2 1 0"22 = 2' (6 - 6) (0 - 1) + (7 - 6) (2 - 1) + (5 6) (1 - 1) + (6 - 6) (1 - 1) 4 1 4 3 0:1 3 = 1 0"2 3 = 4' The prediction step consists of using the initial estimates ji and i to predict the contributions of the missing values to the sufficient statistics T1 and T2 . [See (5-38) and (5-39).] The first component of x 1 is missing, so we partition ji and I as 11
-
�
and predict
3 x11 - J.L1 + I1 2 I22 [ XX1132 - ILIL- 32 J - 6 + b:, 1 ] [ 24' 42 ] - ' [ 03 41 ] 5 .73 + (5 .73) 2 32.99
Sec. 5.7 Inferences About Mean Vectors When Some Observations are Missing
271
-
_
- - 1 -
1
_
_
�
_
�
-
-
_
_
=
For the two missing components of x4 , we partition ji and I as and predict
[ �:: ] E( [�:: ] l x43 5; ) [ �: ] [�] + [ � J (�) - 1 (5 - 4) [ �:� ] =
=
ji , I
=
=
for the contribution to T1 . Also, from (5 -39),
[ t n [�J
and
[ Xx442t ] x43 ) (
=
[ 61 ..34 ] (5 ) [ 32.6.50 ] =
272
Chap. 5 Inferences about a Mean Vector
are the contributions to T2 . Thus, the predicted complete-data sufficient sta tistics are
[ix1112 ++ xx22 1 ++ xx3321 ++ x4�412 ] - [ 5.730 ++ 72 ++ 51 ++ 6.1.43 ] x1 3 + x23 + x3 + x43 3 + 6 + 2 + 5 _
[ 24.16.4.031003 ]
[ 32.0 +997(+2)72++5(1)52 ++41.8.0267 02 + 22 + 12 + 1.97 17.18 + 7(6) + 5(2) + 32 0(3) + 2(6) + 1(2) + 6.5 32 + 62 + 22 + 52 ] [ 148.27.2057 27.6.2977 101.20.5180 ] 101.18 20.50 74.00 This completes one prediction step. The next estimation step, using (5-40), provides the revised estimates2 6.03 24. 1 3 1 ii = ;; T1 = � 4.30 = 1. 08 16.00 4.00 - 1I = - Tz - jiji' n 148.05 27.27 101.18 6.03 = i 27.27 6.97 20.50 - 1. 08 [6.03 1. 08 4.00] 101.18 20.50 74.00 4.00 1 .61 .33 1.11 .33 .59 .83 l l.17 .83 2.50 Note that 0:11 = .61 and 0:22 = .59 are larger than the corresponding by replacing the missing observations on the first initial estimates obtained and second variables by the sample means of the remaining values. The third variance estimate 0: remains unchanged, because it is not affected by the missing components.33
[
[ ][ ]
]
][ ]
2 The final entries in I are exact to two decimal places.
Sec.
5.8
Difficulties Due to Time Dependence in Multivariate Observations
273
The iteration between the prediction and estimation steps continues until the elements of ji and I remain essentially unchanged. Calculations of • this sort are easily handlep with a computer.
i
Once final estimates jL and are obtained and relatively few missing com ponents occur in X, it seems reasonable to treat all IL such that n (jL - IL ) 'I - I (jL
100(1 -
- IL )
:o:::;
(5-41)
x; (a)
as an approximate a)% confidence ellipsoid. The simultaneous confidence statements would then follow as in Section but with x replaced by jL and S replaced by
i.
5.5,
Caution. The prediction-estimation algorithm we discussed is developed on the basis that component observations are missing at random. If missing values are related to the response levels, then handling the missing values as suggested may introduce serious biases into the estimation procedures. Typically, missing values are related to the responses being measured. Consequently, we must be dubious of any computational scheme that fills in values as if they were lost at random. When more than a few values are missing, it is imperative that the investigator search for the systematic causes that created them. 5.8 DIFFICULTIES DUE TO TIME DEPENDENCE IN MULTIVARIATE OBSERVATIONS
For the methods described in this chapter, we have assumed that the multivariate observations X1 , X 2 , . . . , X, constitute a random sample; that is, they are indepen dent of one another. If the observations are collected over time, this assumption may not be valid. The presence of even a moderate amount of time dependence among the observations can cause serious difficulties for tests, confidence regions and simultaneous confidence intervals, that are all constructed assuming that inde pendence holds. We will illustrate the nature of the difficulty when the time dependence can be represented as a multivariate first order autoregressive model. Let the model p X random vector X1 follow the multivariate
1
AR(1)
(AR(1))
[
(5-42)
where the E1 are independent and identically distributed with E e1] 0 and Cov (e1 ) = I e and all of the eigenvalues of the coefficient matrix are between and Under this model Cov (X� ' X1_j ) = «<>j i x where
-1
1.
:L «<>j ie «<>'j oc
Ix
=
j=O
274
Chap. 5 Inferences about a Mean Vector
The AR ( 1) model (5-42) relates the observation at time t, to the observation at time t - 1, through the coefficient matrix «1». Further, the autoregressive model says the observations are independent, under multivariate normality, if all the entries in the coefficient matrix «<» are 0. The name autoregressive model comes from the fact that ( 5-42) looks like a multivariate version of a regression with X1 as the depen dent variable and the previous value X1_1 as the independent variable. As shown in Johnson and Langeland [8], ·
S
=
n
1 � ) X ) ' --7 Ix (X1 X (X1 � 1
_
and
(5-43) where the arrows indicate convergence in probabHity. Moreover, for large n, Vn ( X - p) is approximately normal with mean 0 and covariance matrix given by (5-43). To make the calculations easy, suppose the underlying process has «<» = ¢1 where I ¢ 1 < 1. Now consider the large sample nominal 95% confidence ellipsoid for p.. •
- p.)' S - 1 ( X -
..::; x; (.05)} This ellipsoid has large sample coverage probability .95 i f the observations are { all p such that n ( X
p.)
independent. If the observations are related by our autoregressive model, however, this ellipsoid has coverage probability
P [x; ..::; ( 1 - ¢ ) ( 1
+ ¢) - 1 x; ( .05)]
Table 5.9 snows how the coverage pro!Jability is related to the coefficient ., the number of variables p. TABLE 5.9 COVERAGE PROBABI LITY OF TH E N O M I NAL 95% CON F I DENCE ELLI PSO I D
p
1 2 5 10 15
-. 25 .989 .993 .998 .999 1.000
0 .950 .950 .950 .950 .950
¢
. 2? .871 .834 .751 .641 .548
.5 .742 .632 .405 .193 .090
¢ and
Sec.
5.8
Difficulties Due to Time Dependence in Multivariate Observations
5.9,
275
.632,
According to Table the coverage probability can drop very low, to even for the bivariate case. The independence assumption is crucial, and the results based on this assumption can be very misleading if the observations are, in fact, dependent.
SUPPLEMEN T SA
Simultaneous Confidence In tervals and Ellipses as Shadows of the p-Dimensional Ellipsoids
We begin this supplementary section by establishing the general result concerning the projection (shadow) of an ellipsoid onto a line. Result 5A. 1 . Let the constant c > 0 and positive definite p p matrix A determine the ellipsoid {z: Z1 A -l z :;;;; c2}. For a given vector u 0, and z belong ing to the ellipsoid, the ( Projection (shadow) of ) = c � 2 U1U {z1 A- l z :;;;; c } on u which extends from 0� along u with length c Vu 1 Au/u 1 u . When u is a unit vector, the shadow extends c units, so Z1 u I :;;;; c � . The shadow also extends c � units in the -u direction. I Proof. By Definition 2A.l2, the projection of any z on u is given by (z1u)u/u1u. Its squared length is (z1uf/n1n. We want to maximize this shadow in over all z with Z12 A - lz :;;;; c2• The- 1 extended Cauchy-Schwarz inequality states that (b1d) :;;;-; l(b'Bd)(d'B d), with equality when b = kB - 1 d. Setting b(2-49) = z, d = u, and B = A , we obtain (u'u) (length of projection) 2 = (z1u) 2 :;;;; (z' A- 1 z) (u' Au) for all z: z' A - l z :;;;; c2 The choice z = cAn/� yields equalities and thus gives the maximum shadow, besides belonging to the boundary of the ellipsoid. That is, z' A - l z = ¥-
0
276
X
Supplement SA Simultaneous Confidence Intervals and Ellipses
277
c 2 u' Au/u' Au = c 2 for this z that provides the longest shadow. Consequently, the projection of the ellipsoid on u is c � u/u 'u, and its length is c Vu ' Au/u ' u . With the unit vector e0 = u/Vu'U , the projection extends The projection of the ellipsoid also extends the same length in the direction -u .
•
Suppose that the ellipsoid {z : z' A - 1 z .;;; c 2 } is given and that U = [u 1 ! u 2 ] is arbitrary but of rank two. Then z in the ellipsoid . . that for all U, U'z is in the ellipsoid 1mphes 1 2 based on A - and c based on (U' AU) - 1 and c 2
{
{
}
Result SA.2.
}
or
z' A - 1 z .;;; c 2 implies that (U' z)' (U'AUf 1 (U' z) .;;; c 2 Proof.
We
first
establish
a
basic
for all U
inequality.
Set
P
A 1 12 U (U' AU f 1 U' A 1 12 , where A = A 1 12 A 1 12 . Note that P = P ' and P2 = P, so (I - P) P ' = P - P2 = 0. Next, using A - 1 = A -l /2 A - l /2 , we write z' A - l z = (A - ll2 z)' (A - 1 12 z) and A - ll2 z = PA - 1 12 z + (I - P)A - 1 12 z. Then z' A -l z = (A - ll2 z) ' (A -ll2 z) = (PA - t 12 z + (I - P) A - t 12 z) ' (PA - t 12 z + (I - P) A - t 12 z) = (PA - ll2 z) ' (PA - 1 12 z) + ( (I - P) A - 1 12 z) ' ( (I - P) A - ll2 z) •
Since z' A - 1 z .;;; c 2 and U was arbitrary, the result follows.
Our next result establishes the two-dimensional confidence ellipse as a pro jection of the p-dimensional ellipsoid. (See Figure on page Projection on a plane is simplest when the two vectors u 1 and u 2 determining the plane are first converted to perpendicular vectors of unit length. (See Result
5.10
278. )
2A. 3 . )
Given the ellipsoid {z : z' A - 1 z .;;; c 2 } and two perpendicular unit vectors u 1 and u 2 , the projection (or shadow) of {z' A - l z .;;; c 2 } on the u 1 , u 2 plane results in the two-dimensional ellipse {(U' z)' (U' AU)- 1 (U' z) .;;; c 2 }, where Result SA.3 .
U = [u 1 i u 2 ] .
278
Chap. 5 Inferences about a Mean Vector 3
Figure 5 . 1 0 The shadow of the ellipsoid z ' A- 1 z .;; c2 on the u 1 , u 2 plane is an ellipse.
Proof
2A.3, the projection of a vector z on the u 1 , u 2 plane is z (u{ z) u 1 + (u� z) u 2 = [u 1 ! u 2 ] [ u � ] = UU' z UzZ
By Result
The projection of the ellipsoid {z : z' A - I z :;;;; c 2 } consists of all UU' z with z' A - I z :;;;; c 2 . Consider the two coordinates U' z of the projection U (U' z). Let z belong to the set {z : z' A - l z :;;;; c 2 } so that UU' z belongs to the shadow of the ellip soid. By Result
5A.2,
so the ellipse {(U' z)' (U' AU) - 1 (U' z) :;;;; c 2 } contains the coefficient vectors for the shadow of the ellipsoid. Let Ua be a vector in the u 1 , u 2 plane whose coefficients a belong to the ellipse {a' (U' AU) - 1 a :;;;; c2 }. If we set z = AU (U' AU) - 1 a, it follows that
U' z = U' AU (U' AUr 1 a = a and
z' A - l z = a' (U' AU) - 1 U' AA - I AU (U' AU) - 1 a = a' (U' AU) - 1 a :;;;; c 2 Thus, U ' z belongs to the coefficient vector ellipse, and z belongs to the ellipsoid z' A - I z :;;;; c 2 . Consequently, the ellipse contains only coefficient vectors from the • projection of {z: z' A - I z :;;;; c 2 } onto the u 1 , u 2 plane. Remark. Projecting the ellipsoid z' A - I z :;;;; c 2 first to the u 1 , u 2 plane and then to the line u 1 is the same as projecting it directly to the line determined by u 1 • In the context of confidence ellipsoids, the shadows of the two-dimensional ellipses give the single component intervals.
Chap. 5 Exercises
2<
q
Results 5A.2 and 5A.3 remain valid if linearly independent columns.
Remark. :;;;;
p
[2
EXERCISES
U
= [u 1 , . . . , u q ]
279
consists of
Evaluate T2, for testing H0 : p' = [7, 11], using the data 12 8 9 X= 68 109 (b) Specify the distribution of T2 for the situation in (a). conclusion do you reach? (c) Using (a) and (b), test H0 at the a = .05 level. What 2 5.2. Using the data in Example 5.1, verify that T remains unchanged if each observation xj , j = 1, 2, 3, is replaced by Cxj , where 5.1. (a)
c =
Note that the observations
J
[: - : ]
(10 - 6) (8 - 3) ] ' (10 + 6) (8 + 3) 5.3. (a) Use expression (5-15) to evaluate T2 for the data in Exercise 5.1. (b) Use the data in Exercise 5.1 to evaluate A in (5-13). Also, evaluate Wilks ' lambda. 5.4. Use the sweat data in Table 5.1. (See Example 5. 2 . ) (a) Determine the axes of the 90% confidence ellipsoid for p. Determine the lengths of these axes. (b) Construct Q-Q plots for the observations on sweat rate, sodium content, and potassium content, respectively. Construct the three possible scatter plots for pairs of observations. Does the multivariate normal assumption seem justified in this case? Comment. 5.5. The quantities i, S, and s - 1 are given in Example 5.3 for the transformed H0 : p ' = hypothesis null the of test a Conduct data. microwave-radiation [.55, .60] at the a = .05 level of significance. Is your result consistent with the 95% confidence ellipse for pictured in Figure 5.1? Explain.
[(6(6 +- 9)9)
yield the data matrix
JL
280
Chap. 5 Inferences about a Mean Vector
Verify the Bonferroni inequality in (5-28) for m = 3. Hint: A Venn diagram for the three events C1 , C2 , and C3 may help. 5.7. Use the sweat data in Table 5.1. (See Example 5.2.) Find simultaneous 95% T2 confidence intervals for �-t1 , �-t 2 , and �-t 3 using Result 5.3. Construct the 95% Bonferroni intervals using (5-29). Compare the two sets of intervals. 5.8. From (5-23), we know that T2 is equal to the largest squared univariate 1 t-value constructed from the linear combination a ' xj with a = s - (i - p0 ) . Using the results in ExaPlple 5.3 and the H0 i n Exercise 5.5, evaluate a for the transformed microwave-radiation data. Verify that the t 2 -value computed with this a is equal to T2 in Exercise 5.5. 5.9. A physical anthropologist performed a mineral analysis of nine ancient Peru vian hairs. The results for the chromium (x1 ) and strontium (x2 ) levels, in parts per million (p.p.m.), were as follows: 5.6.
x1 (Cr)
.48
x2 (Sr)
12.57
2.19
.55
.74
.66
.93
73.68 1 1 .13
20.03
20.29
.78
4.64
40.53
.37
.43 1 .08
Source: Benfer and others, "Mineral Analyno.sis 3of(1Anci978)e,nt277-282. Peruvian Hair," can Journal of Physical Anthropology,
.22
Ameri-
48,
It is known that low levels (less than or equal to .100 p.p.m.) of chromium suggest the presence of diabetes, while strontium is an indication of animal protein intake. (a) Construct and plot a 90% joint confidence ellipse for the population mean vector p' = [�-t1 , �-t 2 ] , assuming that these nine Peruvian hairs represent a random sample from individuals belonging to a particular ancient Peru vian culture. (b) Obtain the individual simultaneous 90% confidence intervals for �-t 1 and �-t 2 by "projecting" the ellipse constructed in Part a on each coordinate axis. (Alternatively, we could use Result 5.3.) Does it appear as if this Peruvian culture has a mean strontium level of 10? That is, are any of the points (�-t 1 arbitrary, 10) in the confidence regions? Is [.30, 10] ' a plausi ble value for p? Discuss. (c) Do these data appear to be bivariate normal? Discuss their status with ref erence to Q-Q plots and a scatter diagram. If the data are not bivariate normal, what implications does this have for the results in Parts a and b? (d) Repeat the analysis with the obvious "outlying" observation removed. Do the inferences change? Comment.
Chap. 5 Exercises
281
5.10. Given the data
with missing components, use the prediction-estimation algorithm of Section 5.7 to estimate p, and I. Determine the initial estimates, and iterate to find the revised estimates. 5.11. Determine the approximate distribution of ln ( I i I / I I0 I ) for the sweat data in Table 5.1. (See Result 5.2.) 5.U. Create a table similar to Table 5.4 using the entries (length of one-at-a-time t-interval)/(length of Bonferroni t-interval).
first
Exercises
5. 13, 5. 1 4,
-n
and
5. 15
refer to the following information:
Frequently, some or all of the population characteristics of interest are in the form of Each individual in the population may then be described in terms of the attributes it possesses. For convenience, attributes are usually numerically coded with respect to their presence or absence. If we let the variable X pertain to a specific attribute, then we can distinguish between the presence or absence of this attribute by defining
attributes.
X=
{�
if attribute present if attribute absent
In this way, we can assign numerical values to qualitative characteristics. When attributes are numerically coded as 0 1 variables, a random sam ple from the population of interest results in statistics that consist of the of the number of sample items that have each distinct set of characteristics. If the sample counts are large, methods for producing simultaneous confidence statements can be easily adapted to situations involving proportions. We consider the situation where an individual with a particular combi nation of attributes cah be classified into one of q + 1 mutually exclusive and exhaustive categories. The corresponding probabilities are denoted by p1 , p2 , • • • , P q ' P q + I · Since the categories include all possibilities, we take P q + l = 1 (p1 + p2 + · · · + P q ). An individual from category k will be assigned the ((q + 1) X 1) vector value [0, . . . , 0, 1, 0, . . . , 0] ' with 1 in the kth position.
-
-
counts
282
Chap. 5 Inferences about a Mean Vector
The probability distribution for an observation from the population of individuals in q + 1 mutually exclusive and exhaustive categories is known as the It has the following structure:
multinomial distribution.
Category:
1
2
k
q
q+1
1 0 0
0 1 0
0
0 0 0
0 0 0
0 1 0
0 1
Outcome (value):
Probability (proportion):
=
0
0
P1
Pz
0 1 0 0
Pq
Pq + l
, n,
=1
q
-
L Pi
i= l
n
Let Xj , j 1, 2, . . be a random sample of size from the multino mial distribution. The kth component, J0 k , of Xj is 1 if the observation (indi vidual) is from category k and is 0 otherwise. The random sample X 1 , X2 , . . . , Xn can be converted to a sample pro portion vector, which, given the nature of the preceding observations, is a sample mean vector. Thus, .
= !n ± xj j= l
with
E (p) =
p
=
and
[5J
az. q + l
n,
For large the approximate sampling distribution of p is provided by the central limit theorem. We have Vn
(p - p) is approximately N(O, I)
Chap. 5 Exercises
283
where the elements of I are ukk = Pk (1 - P k ) and u;k = - p ;P k · The nor mal approximation remains valid when ukk is estimated by G-kk = Pk (1 - Pk ) and u;k is estimated by G-;k = -P ; P k , i * k. Since each individual must belong to exactly one category, Xq+i ,j = 1 (X1 j + X2 j + · · · + Xq ) , so P q + l = 1 � ( p 1 + p 2 + . . . + Pq ) , and as a result, I has rank q. The usual inverse of I does not exist, but it is still pos sible to develop simultaneous 100(1 - a)% confidence intervals for all linear combinations A
a' p.
Result. Let X1 , X 2 , , X, be a random sample from a q + 1 category multinomial distribution with P [Xj k = 1 ] = p k , k = 1, 2, . . . , q + 1, j = 1, 2, . . . , n. Approximate simultaneous 100 (1 - a) % confidence regions for all linear combinations = a 1 p 1 + a 2 p 2 + · · · + aq+ l Pq+ l are given by the observed values of • . .
a'p
a'p :!:: Yx� ( a ) �/a'ia -n
provided that n - q is large. Here
n
p = (1/n) jLl Xj , and I = [ G-;d is a A
=
(q + 1) X (q + 1) matrix with G-kk = Pk (1 - Pk) and G-i k = - p i p k , i #- k. Also, x� ( a ) is the upper (100 a) th percentile of the chi-square distribution • with q d.f.
In this result, the requirement that n - q is large is interpreted to mean npk is about 20 or more for each category.
We have only touched on the possibilities for the analysis of categori cal data. Complete discussions of categorical data analysis are available in [1] and [4]. 5.13. Let Xj ; and Xj k be the ith and kth components, respectively, of Xj . Show that JL ; = E (Xj ;) = P ; and U;; = Var (Xj ; ) = P ; (1 - p ; ) , i 1, 2, . . . , p . (b) Show that u;k = Cov (Xj i • Xj k) = -p ip k , i #- k. Why must this covariance necessarily be negative? 5.14. As part of a larger marketing research project, a consultant for the Bank of Shorewood wants to know the proportion of savers that uses the bank ' s facil ities as their primary vehicle for saving. The consultant would also like to know the proportions of savers who use the three major competitors: Bank B, Bank C, and Bank D. Each individual contacted in a survey responded to the following question:
(a)
284
Chap. 5 Inferences about a Mean Vector
I
Which bank is your primary savings bank?
I
I
of Response: ShBank Bank B Bank C Bank D orewood
I
Another Bank
No I Savings
A sample of n = 355 people with savings accounts produced the fol lowing counts when asked to indicate their primary savings banks (the peo ple with no savings will be ignored in the comparison of savers, so there are five categories ) :
Bank (category )
Bank of Shorewood Bank B Bank C Bank D Another bank
Observed number
105
119
56
25
Population proportion
P1
Pz
P3
P4
Observed sample proportion
P1
105 355
.30
50
I
Total
n = 355
Ps = 1 (p l + P z + P 3 + P4 )
P z = .33 P 3 = .16 P 4 = .07 Ps = .14
Let the population proportions be
p 1 = proportion of savers at Bank of Shorewood p 2 = proportion of savers at Bank B
p 3 = proportion of savers at Bank C
p4 = proportion of savers at Bank D
1 - (p 1 +
p2 + p3 + p4 ) = proportion of savers at other banks
(a) Construct simultaneous 95% confidence intervals for p 1 , p 2 , , p 5 • (b) Construct a simultaneous 95% confidence interval that allows a compari son of the Bank of Shorewood with its major competitor, Bank B. Inter pret this interval. 5.15. In order to assess the prevalence of a drug problem among high school stu dents in a particular city, a random sample of 200 students from the city's five • • •
Chap. 5 Exercises
285
high schools were surveyed. One of the survey questions and the corre sponding responses are given below. What is your typical weekly marijuana usage? Category
Number of responses
None
Moderate (1-3 joints)
Heavy (4 or more joints)
117
62
21
Construct 95% simultaneous confidence intervals for the three proportions = 1 - (p l + P 2 ).
P 1 , P 2 , and P3
The following exercises may require a computer.
5.16. Use the college test data in Table 5.2. (See Example 5.5.) (a) Test the null hypothesis H0 : p ' = [500, 50, 30] versus H1 : p ' ¥- [500, 50, 30] at the a = .05 level of significance. Suppose [500, 50, 30] ' represent average scores for thousands of college students over the last 10 years. Is there reason to believe that the group of students represented by the scores in Table 5.2 is scoring differently? Explain. (b) Determine the lengths and directions for the axes of the 95% confidence ellipsoid for p. (c) Construct Q-Q plots from the marginal distributions of social science and history, verbal, and science scores. Also, construct the three possible scat ter diagrams from the pairs of observations on different variables. Do these data appear to be normally distributed? Discuss. 5.17. Measurements of 1 = stiffness and = bending strength for a sample of = 30 pieces of a particular grade of lumber are given in Table 5.10 on page 286. The units are pounds/(inches . Using the data in the table: (a) Construct and sketch a 95% confidence ellipse for the pair [IL 1 , IL 2 ] ' , where IL 1 = E(X1 ) and IL 2 = E (X2 ). (b) Suppose IL 1 o = 2000 and IL 2 o = 10,000 represent "typical" values for stiff ness and bending strength, respectively. Given the result in (a), are the data in Table 5.10 consistent with these values? Explain. (c) Is the bivariate normal distribution a viable population model? Explain with reference to Q-Q plots and a scatter diagram. 5.18. A wildlife ecologist measured = tail length (in millimeters) and = wing length (in millimeters) for a sample of = 45 female hook-billed kites. These data are displayed in Table 5.11 on page 286. Using the data in the table: (a) Find and sketch the 95% confidence ellipse for the population means IL l and IL 2 • Suppose it is known that IL l = 190 mm and IL 2 = 275 mm for
n
x
?
x1
x2
n
x2
TABLE 5. 1 0
LU M BER DATA
xl
Xz
(Stiffness: modulus of elasticity)
XI
(Bending strength)
1232 1115 2205 1897 1932 1612 1598 1804 1752 2067 2365 1646 1579 1880 1773
Xz
(Stiffness: modulus of elasticity)
(Bending strength)
1712 1932 1820 1900 2426 1558 1470 1858 1587 2208 1487 2206 2332 2540 2322
7749 6818 9307 6457 10,102 7414 7556 7833 8309 9559 6255 10,723 5430 12,090 10,072
4175 6652 7612 10,914 10,850 7627 6954 8365 9469 6410 10,327 7320 8196 9709 10,370
Source: Data courtesy of U.S. Forest Products Laboratory. TABLE 5 . 1 1
B I R D DATA
xl
Xz
XI
Xz
XI
Xz
(Tail length)
(Wing length)
(Tail length)
(Wing length)
(Tail length)
(Wing length)
191 197 208 180 180 188 210 196 191 179 208 202 200 192 199
284 285 288 273 275 280 283 288 271 257 289 285 272 282 280
186 197 201 190 209 187 207 178 202 205 190 189 211 216 189
266 285 295 282 305 285 297 268 271 285 280 277 310 305 274
173 194 198 180 190 191 196 207 209 179 186 174 181 189 188
271 280 300 272 292 286 285 286 303 261 262 245 250 262 258
Source: Data courtesy of S. Temple. 286
Chap. 5
Exercises
287
male hook-billed kites. Are these plausible values for the mean tail length
and mean wing length for the female birds? Explain.
(b) Construct the simultaneous 95% T2-intervals for f.L t and f.L z and the 95%
Bonferroni intervals for f.L t and f.Lz . Compare the two sets of intervals. What advantage, if any, do the T2-intervals have over the Bonferroni intervals? (c) Is the bivariate normal distribution a viable population model? Explain with reference to Q-Q plots and a scatter diagram. 5.19. Using the data on bone mineral content in Table 1.6, construct the 95% Bon ferroni intervals for the individual means. Also, find the 95% simultaneous T2 -intervals. Compare the two sets of intervals. 5.20. A portion of the data contained in Table 6.6 in Chapter 6 is reproduced immediately below. M I LK TRANSPORTATION-COST DATA
Fuel (x1 ) 16.44 7.19 9.92 4.24 1 1.20 14.25 13.50 13.32 29.1 1 12.68 7.51 9.90 10.25 11.11 12.17 10.24 10.18 8.88 12.34 8.51 26.16 12.95 16.93 14.70 10.32
Capital (x3 ) 12.43 2.70 1 .35 5.78 5.05 5.78 10.98 14.27 15.09 7.61 5.80 3.63 5.07 6.15 14.26 2.59 6.05 2.70 7.73 14.02 17.44 8.24 13.37 10.78 5.16
1 1.23 3:92 9.75 7.78 10.67 9.88 10.60 9.45 3.28 10.23 8.13 9.13 10.17 7.61 14.39 6.09 12.14 12.23 1 1 .68 12.01 16.89 7.18 17.59 14.58 17.00
288
Chap. 5 Inferences about a Mean Vector
5.21.
5.22.
5.23.
5.24.
These data represent various costs associated with transporting milk from farms to dairy plants for gasoline trucks. Only the first 25 multivariate obser vations for gasoline trucks are given. Observations 9 and 21 have been iden tified as outliers from the full data set of 36 observations. (See [2].) (a) Construct Q-Q plots of the marginal distributions of fuel, repair, and cap ital costs. Also, construct the three possible scatter diagrams from the pairs of observations on different variables. Are the outliers evident? Repeat the Q-Q plots and the scatter diagrams with the apparent outliers removed. Do the data now appear to be normally distributed? Discuss. (b) Construct 95% Bonferroni intervals for the individual cost means. Also, find the 95% T2-intervals. Compare the two sets of intervals. Using the Madison, Wisconsin, Police Department data in Table 5.8, con struct individual X charts for x3 = holdover hours and x4 = COA hours. Do these individual process characteristics seem to be in control? (That is, are they stable?) Comment. Refer to Exercise 5.21 . Using the data on the holdover and COA overtime hours, cbnstruct a quality ellipse and a T2 chart. Does the process repre sented by the bivariate observations appear to be in control? (That is, is it stable?) Comment. Do you learn something from the multivariate control charts that was not apparent in the individual X charts? Construct a T2 chart using the data on x 1 = legal appearances overtime hours, x2 = extraordinary event overtime hours, and x3 = holdover overtime hours from Table 5.8. Compare this chart with the chart in Figure 5.8 of Example 5.10. Does plotting T2 with an additional characteristic change your conclusion about process stability? Explain. Using the data on x3 = holdover hours and x4 = COA hours from Table 5.8, construct a prediction ellipse for a future observation x' = (x3 , x4 ) . Remem ber, a prediction ellipse should be calculated from a stable process. Interpret the result.
REFERENCES
12.. AgrBacon-estiS, hone, A. J., and W. K. Fung. "A NewNew GrYoraphik: Johncal MetWihlodey, f1988.or Detecting Single and Mul(1987)tip,le153-162. Outliers in Univariate and Multivariate Data." no. 2 3. BickelSan, P.FrJ.anci, andsco:K.HolA.dDoks um.1977. enD ay, 4. Bishop, Y. M. M., S.Cambr E. Feiindberg, and: TheP. W.MIHolT Prlaend.s , 1975. ge, MA. 5. Demps Dat(B),a vitaeno.rt,hA.e1EMP.(1977), N.AlM.,gor1-38.Laiithmrd,(wandithD.DiB.scusRubision)n..""Maximum Likelihood from Incomplete Categorical Data Analysis.
Applied Statistics,
36,
Mathematical Statistics: Basic Ideas and Selected Top
ics,
Discrete Multivariate Analysis:
Theory and Practice.
Journal of the Royal Statistical Society
39,
Chap. 5 References
289
6. Hart(1l958),ey, H.174-194."Maximum Likelihood Estimation from Incomplete Data." 7. Har(1971)tle,y,783-808. H. and R. R. Hocking. "The Analysis of Incomplete Data." 8. Johns nations Test for Det(1991)ectinIgnsSertituitael Corof Matreolahn,tematioR.n iniA.calMulandSttaitvT.iarstiicLangel astemonogr Samplandeas.ph,"A" EdsLinear. BloCombi 299-313. 9.10. TiRyan,ku, M.T. P.L., and M. Singh. "Robust Statistics forck,TesH. tetina!Newg Mean York:VectJohnors ofWiMulley, t1989.ivari at(1e982),Dis985-1001. tributions. " no. 9 14
Biometrics,
0.
Biometrics,
0.,
Topics in Statistical Dependence.
Statistical Methods for Quality Improvement.
Communications in Statistics-Theory and Methods,
11,
27
CHAPTER
6
Comparisons
of Several
Multivariate Means
6. 1 I NTRODUCTION
The ideas developed in Chapter 5 can be extended to handle problems involving the comparison of several mean vectors. The theory is a little more complicated and rests on an assumption of multivariate norlnal distributions or large sample sizes. Similarly, the notation becomes a bit cumbersome. To circumvent these prob lems, we shall often review univariate procedures for comparing several means and then generalize to the corresponding multivariate cases by analogy. The numerical examples we present will help cement the concepts. Because comparisons of means frequently (and should) emanate from designed experiments, we take the opportunity to discuss some of the tenets of good experimental practice. A design, useful in behavioral studies, is explicitly considered, along with modifications required to analyze
growth curves.
repeated measures
We begin by considering pairs of mean vectors. In later sections, we discuss seve ral comparisons among mean vectors arranged according to levels of treat ment. The corresponding test statistics depend upon a partitioning of the total vari ation into pieces of variation attributable to the treatment sources and error. This partitioning is known as the (MANOVA).
multivariate analysis of variance
290
Sec.
Paired Comparisons and a Repeated Measures Design
6.2
291
6.2 PAIRED COMPARISONS AND A REPEATED MEASURES DESIGN Paired Comparisons
Measurements are often recorded under different sets of experimental conditions to see whether the responses differ significantly over these sets. For example, the efficacy of a new drug or of a saturation advertising campaign may be determined by comparing measurements before the "treatment" (drug or advertising) with treatments can be those after the treatment. In other situations, administered to the same or similar experimental units, and responses can be com pared to assess the effects of the treatments. One rational approach to comparing two treatments, or the presence and or absence of a single treatment, is to assign both treatments to the units (individuals, stores, plots of land, and so forth) . The paired responses may then be analyzed by computing their differences, thereby eliminating much of the influence of extraneous unit-to-unit variation. In the single response (univariate) case, let Xj 1 denote the response to treat ment 1 (or the response before treatment), and let Xj 2 denote the response to treat ment 2 (or the response after treatment) for the jth trial. That is, (Xj 1 , Xj 2 ) are measurements recorded on the jth unit or jth pair of like units. By design, the differences
two or more
same identical
n
j = 1, 2,
... , n
(6-1)
should reflect only the differential effects of the treatments. Given that the differences Dj in (6 - 1) represent independent observations from an N ( 8, 0"3 ) distribution, the variable D - 8 t = --sd/Vn
(6-2)
where D
1 ll
= -- � D1. and s 2 =
1
ll
D
n j£..= l n - 1 1£..�= 1 ( D ) 2 has a t-distribution with n - 1 d.f. Consequently, an a-level test of d
��
.
J
-
(zero mean difference for treatments) versus
(6-3)
292
Chap.
6
Comparisons of Several Multivariate Means
may be conducted by comparing Jt/ with tn_1 (a/2)-the upper 100(a/2)th per centile of a t-distribution with n - 1 d.f. A 100(1 - a)% confidence interval for the mean difference 8 = E(Xj l - Xi2 ) is provided by the statement � d - tn_1 (a/2) Vn :;;;; 8 :;;;; d + tn - l (a/2) Vn (6-4) (For example, see [7].) Additional notation is required for the multivariate extension of the paired comparison procedure. It is necessary to distinguish between responses, two treat ments, and n experimental units. We label the responses within the jth unit as x1j1 = variable 1 under treatment 1 X1. j 2 = variable 2 Under. treatment 1 . . under treatment 1 = variable p x1jp x;j-;-;;---vari�ioleiunder-trea1menf1 X2i2. = variable 2 under.. treatment 2 . . X2i P = variable under treatment 2 and the paired-difference random variables become Di 1 = X1 i 1 - Xz j D.j z = x1 j 2 -. x2 j 2l (6-5) -
�
p
p
p
p
Let Dj [Di 1 , Di2 , ... , DiP ], and assume, for j = 1, 2, E(D) � B �
[ !1
. . . , n,
and Cov(D1) � I,
that (6-6)
If, in addition, D1 , D2 , ... , Dn are independent Np ( o, :Id ) random vectors, infer ences about the vector of mean differences o can be based upon a T2-statistic. Specifically, (6-7) where n 1 �n ( D . - -D ) (D. - -D ) ' (6-8) D = n1 � Di and S d = -n - 1 j� =l i= 1 -
1
1
Sec. Result 6. 1 .
6.2
Paired Comparisons and a Repeated Measures Design
293
Let the differences D 1 , 0 2 , . . . , D11 be a random sample from an
Np (o, :I") population. Then
T2 = n ( D - o) ' S;J1 ( D - o)
is distributed as an [(n - l )p/(n - p)] Fp, n - p random variable, whatever the true o and :I" . If n and n - p are both large, T 2 is approximately distributed as a x; random variable, regardless of the form of the underlying population of differences. Proof. The exact distribution of T 2 is a restatement of the summary in (5-6),
with vectors of differences for the observation vectors. The approximate distribu • tion of T 2 , for n and n - p large, follows from (4-28).
The condition o = 0 is equivalent to "no average difference between the two treatments." For the ith variable, oi > 0 implies that treatment 2 is larger, on aver age, than treatment 1. In general, inferences about o can be made using Result 6.1. Given the observed differences dj = [di 1 , di 2 , .. . , di P ] , j = 1 , 2, . . . , n , corre sponding to the random variables in (6-5), an a-level test of H0 : o = 0 versus H1 : o ¥= 0 for an NP(o, l:d) population rejects H0 if the observed
T2 - n -d , S"- cd
>
_
(n - 1)p p) Fp, n - p (a) (n
where Fp, n -p (a) is t�e upper (100a) th percentile of an F-distribution with p and n - p d.f. Here d and S" are given by (6-8) . A 100 (1 - a)% confidence region for o consists of all o such that -
-
( d - o) , S"-1 ( d - o)
�
(n - 1)p .-(--� .) FP n-p (a) n n - p
(6-9)
·
Also, 100 (1 - a)% simultaneous confidence intervals for the individual mean differences 8; are given by 0; .
+ . d; -
. -:-• - /sf, /r-(n-----,.. -1p))-p-Fp n - p. (a) \f-;:
\j
(n
_
(6-10)
,
where d; is the ith element of d and s� is the ith diagonal element of Sd. For n - p large, [(n - 1)p/(n - p)] Fp, n - p (a) == x; (a) and normality need not be assumed. The Bonferroni 100 ( 1 - a)% simultaneous confidence intervals for the individual mean differences are where tn _ 1 ( a/2p) is the upper 100 ( a/2p) th percentile of wit4 n - 1 d.f.
(6-10a) a
t-distribution
294
Chap.
6
Comparisons of Several Multivariate Means
Example 6. 1
(Checking for a mean difference with paired observations)
Municipal wastewater treatment plants are required by law to monitor their discharges into rivers and streams on a regular basis. Concern about the reliability of data from one of these self-monitoring programs led to a study in which samples of effluent were divided and sent to two laborato ries for testing. One-half of each sample was sent to the Wisconsin State Laboratory of Hygiene, and one-half was sent to a private commercial lab oratory routinely used in the monitoring program. Measurements of bio chemical oxygen demand (BOD) and suspended solids (SS) were obtained, for n = 11 sample splits, from the two laboratories. The data are displayed in Table 6 . 1 . TABLE 6 . 1
Sample j
EFFLU ENT DATA
Commercial lab x1 j 1 (BOD) x 1 j 2 (SS)
1
18 8 11 34 28 71 43 33 20
6
7 8 9 10 11
15 13
25 28 36 35 15 44 42 54 34 29 39
27 23 64 44 30 75 26 1 24 54 30 14
6 6
2 3 4 5
State lab of hygiene x2j 2 (SS) x2j 1 (BOD)
22 29 31 64 30 64 56 20 21
Source: Data courtesy of S. Weber. Do the two laboratories ' chemical analyses agree? If differences exist, what is their nature? The T 2-statistic for testing H0 : o ' = [ 8 1 , 82 ] = [0 , 0] is constructed from the differences of paired observations:
d 1 j = x 1 j 1 - x2 j l
- 1 9 -22 - 18 -27 -4
12
10
42
15 - 1
- 10 - 14 17 11
9
4 - 19
-4 60 - 2 1 0
-7
Sec.
6.2
Here d =
Paired Comparisons and a Repeated Measures Design
[ �: ] [ -9.13.2376 ] '
sd =
[ 199.26 88.38 ] 88.38 418. 6 1
295
- .0012 ] [ - 9.36 ] - 13 .6 T2 = 1l [-9 ·36' 13 ·27] [ -..00055 012 .0026 13.27 (. 05) . p,nand- p conclude [p(n - 1)/(n - p)]F Taking a = .0=5, we find Tthat 2 = H0 reject we 7, 4 9. > 6 13. [2(10)/9]F • (.05) 9.47. Sincedifference between the measurements of the that there 2is9 a nonzero meanfrom of the data, that the commer two laboratories. It appears, BODinspection measurements and higher SS measure cial lab tends to produce lower ments than the State Lab of Hygiene. The 95% simultaneous confidence intervals for the mean differences 81 and 82 can be computed using (6-10). These intervals are 199.-2-6 (n - 1)p F (a) sd, = -9. 3 6 9. 4 7 p p ::!: v ,n --;,__ � 11 � (n p) � or ( -22.46, 3.74) 418.61 or ( -5. 7 1, 32.25) 8 2 : 13.27 ::!: v �ll The 95%= simultaneous confidence intervals include zero, yet the hypothesis H0 : B 0 was rejected at the 5% level. What are we to conclude? The evidence points towards real differences. The point B = 0 falls out side the 95% confidence region for B (see Exercise 6.1), and this result is con sistent with the T2-test. The 95% simultaneous confidence coefficient applies to the entire set of intervals that could be constructed for all possible linear combinations of the form a 8 + a 2 8 2• The particular intervals correspond ing to the choices (a1 = 1, a12 1= 0) and (a 1 = 0, a 2 = 1) contain zero. Other choices of a1 and a2 will produce simultaneous intervals that do not contain zero. (If the hypothesis H0 : B = 0 were not rejected, then all simultaneous intervals would include zero. ) The Bonferroni simultaneous intervals also cover zero. (See Exercise 6. 2 . ) Our analysis assumed a normal distribution for the Dj . In fact, the situ ation is further complicated by the presence of one or, possibly, two outliers. (See Exercise 6. 3 . ) The numerical results of this example illustrate an unusual circumstance that can occur when making inferences. These data can be transformed to data more nearly normal, but with such a small sample, it is difficult to remove the effects of the outlier(s). (See Exercise 6. 4 . ) • and
�
�
r,::-;;::. ':J.4 /
r,::-;;::.
296
Chap.
6
Comparisons of Several Multivariate Means
The experimenter in Example 6.1 actually divided a sample by first shaking it and then pouring it rapidly back and forth into two bottles for chemical analysis. This was prudent because a simple division of the sample into two pieces obtained by pouring the top half into one bottle and the remainder into another bottle might result in more suspended solids in the lower half due to settling. The two labora tories would then not be working with the same, or even like, experimental units, and the conclusions would not pertain to laboratory competence, measuring tech niques, and so forth. Whenever an investigator can control the assignment of treatments to exper imental units, an appropriate pairing of units and a randomized assignment of treatments can enhance the statistical analysis. Differences, if any, between sup posedly identical units must be identified and most-alike units paired. Further, a random assignment of treatment 1 to one unit and treatment 2 to the other unit will help eliminate the systematic effects of uncontrolled sources of variation. Ran domization can be implemented by flipping a coin to determine whether the first unit in a pair receives treatment 1 (heads) or treatment 2 (tails). The remaining treatment is then assigned to the other unit. A separate independent randomiza tion is conducted for each pair. One can conceive of the process as follows: Experimental Design for Paired Comparisons
Like pairs of experimental units
3
2
D
D
{6
D
D
Treatments I and 2 assigned at random
Treatments I and 2 assigned at random
t
t
t
Treatments I and 2 assigned at random
n
D · · · D
• • •
• • •
t
Treatments I and 2 assigned at random
We2 conclude our discussion of paired comparisons by noting that d and S and hence T , may be calculated from the full-sample quantities and S. Here i is the 2p 1 vector of sample averages for the p variables on the twoi treatments given by (6-11) and S is the 2p 2p matrix of sample variances and covariances arranged as d,
X
x
s
l l sll
s 1z
Sz t
S zz
(p X p) ( p X p)
( p X p) (p Xp)
(6-12)
Sec.
6.2
Paired Comparisons and a Repeated Measures Design
297
The matrix S 1 1 contains the sample variances and covariances for the p variables on treatment 1. Similarly, S contains the sample variances and covariances com puted for the p variables on 22treatment 2. Finally, S 1 2 = S� 1 are the matrices of sam ple covariances computed from observations on pairs of treatment 1 and treatment 2 variables. Defining the matrix 0 .. .. .. 0 i -1 - . . . 1 0. i 0 1 (6-13) (p X 2p) 0 ... 1 i 0 0 ... i (p + 1) st column we can verify (see Exercise 6.9) that j = 1, 2, . . . , n (6-14) d = Ci and Sd = CSC' Thus, (6-15) and it is not necessary first to calculate the differences d1 , d2 , . . . , d11 • On the other hand, it is wise to calculate these differences in order to check normality and the assumption of a random sample. Each row c! of the matrix C in (6-13) is a contrast vector, because its elements sum to zero. Attention is usually centered on contrasts when comparing treatments. Each contrast is perpendicular to the vector 1' = [1, 1, ... , 1] since c! 1 = 0. The component 1'xj , representing the overall treatment sum, is ignored by the test sta tistic T2 presented in this section. c
[!
I
0 ...
I
• •
: I I
j]
A Repeated Measures Design for Comparing Treatments
Another generalization of the univariate paired t-statistic arises in situations where treatments are compared with respect to a single response variable. Each subject or experimental unit receives each treatment once over successive periods of time. The jth observation is q
j = 1, 2, .. , n .
298
Chap.
6
Comparisons of Several Multivariate Means
where � ; is the response to the ith treatment on the jth unit. The name repeated measures stems from the fact that all treatments are administered to each unit.
For comparative purposes, we consider contrasts of the components of
p, = E(Xj ). These could be
or
[:� � :: ] [ � [ ][ _: � IL t � /L q
IL 2 - IL t IL3 - IL z
=
i
- � -� . 0
0
-1 1 0 0 -1 1
··
�] [ �� ]
: -i
0 0 0
�q
= C1 p,
0 0 0
Both C1 and C2 are called contrast matrices, because their - 1 rows are linearly independent and each is a contrast vector. The nature of the design eliminates much of the influence of unit-to-unit variation on treatment comparisons. Of course, the experimenter should randomize the order in which the treatments are presented to each subject. When the treatment means are equal, C1 p, = C2 p, = 0. In general, the hypothesis that= there are no differences in treatments (equal treatment means) becomes Cp, 0 for any choice of the contrast matrix C. Consequently, based on the contrasts Cxj in the observations, we have means Ci and covariance matrix CSC', and we test C p, = 0 using the T 2-statistic IL q
IL q - t
0 0
0 0 0
q
Sec.
6.2
Paired Comparisons and a Repeated Measures Design
299
It can be shown that T2 does not depend on the particular choice of C. 1 A confidence region for contrasts Cp,, with the mean of a normal popula tion, is determined by the set of all Cp, such that p, n (Cx- - Cp,) 1 (CSC 1 ) (Cx- - Cp,) (n(n- 1) ( q -1)1) Fq - ! , n - q + ! ( a ) (6-17) q where i and S are as defined in (6-16). Consequently, simultaneous 100(1 - a) % confidence intervals for single contrasts C 1 p, for any contrast vectors of interest are given by (see Result 5A.1) �(n - 1) (q - 1) Fq - ! , n - q + ! ( a) 'J� (6-18) c 1 p, : c 1 _X ± ----;;(n q 1) -l
�
_
_
+
+
Example 6.2 (Testing for equal treatments in a repeated measures design)
Improved anesthetics are often developed by first studying their effects on animals. In one study, 19 dogs were initially given the drug pentobarbital. Each dog was then administered carbon dioxide ( C02 ) at each of two pres sure levels. Next, halothane (H) was added, and the administration of (C02 ) was repeated. The response, milliseconds between heartbeats, was measured for the four treatment combinations: Present
Halothane Absent
Low
High
C0 2 pressure
Table 6.2 contains the four measurements for each of the 19 dogs, where Treatment 1 = high C02 pressure without H Treatment 2 = low C02 pressure without H Treatment 3 = high C02 pressure with H Treatment 4 = low C02 pressure with H 1 Any pair of contrast matrices C1 and C2 must be related by C1 = BC2 , with B nonsingular. This follows because each C has the largest possible number, q - 1 , of linearly independent rows, all per pendicular to the vector 1. Then (BC2 ) ' (BC2 SC� B ' ) - 1 (BC2 ) = C� B ' (B ' ) - 1 (C2 SC; ) - 1 B - t oc2 = 2 1 C� (C2SC� ) - C2 , so T computed with C2 or C1 = BC2 gives the same result.
300
Chap.
6
Comparisons of Several Multivariate Means TABLE 6.2 SLEEPI NG-DOG DATA
Dog 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1 426 253 359 432 405 324 310 326 375 286 349 429 348 412 347 434 364 420 397
Treatment 2 3 609 556 236 392 433 349 431 522 426 513 438 507 312 410 326 350 447 547 286 403 382 473 410 488 377 447 473 472 326 455 458 637 367 432 395 508 556 645
4 600 395 357 600 513 539 456 504 548 422 497 547 514 446 468 524 469 531 625
Source: Data courtesy of Dr. J. Atlee. We shall analyze the anesthetizing effects of C02 pressure and halothane from this repeated-measures design. There are three treatment contrasts that might be of interest in the experiment. Let JL 1 , JL 2 , JL3 , and JL 4 correspond to the mean responses for treatments 1, 2, 3, and 4, respectively. Then Halothane contrast representing the difference between the presence and ( JL3 + JL 4 ) - ( JL 1 + JL 2 ) absence of halothane ( C0between 2 contrast representing the difference ) ( JL 1 + JL3 ) - ( JL 2 + JL 4 ) high and low C02 pressure Contrast representing the influence ( JL 1 + JL 4 ) of halothane on C02 pressure differences ( JL 2 + JL3 ) ( H - C0 2 pressure "interaction" )
(= = (=
)
)
Sec.
6.2
Paired Comparisons and a Repeated Measures Design
301
With p! = [JL 1 , JLz , JL3 , JL4] , the contrast matrix C is -� = 1 =- �1 - �1 �1 The data (see Table 6.2) give =
[!�!:�! ]
479.26 and 502.89 It can be verified that 209. 3 1 C i = -60. 05 ; -12.79 and i
[
T2 =
[
c
]
S
=
-
[�� : �
]
7963.14 2943.49 5303.98 6851.32 2295.35 4065.44 4499.63 4878
CSC '
=
[
J
9432.32 1098.92 927.62 1098.92 5195.84 914.54 927.62 914.54 7557.44
n (C i ) ' (CSC ' ) - 1 (C i )
=
19(6.11) = 116
]
With a = .05, 18(3) (3.24) = 10. 94 18(3) F3, 6 ( . 05) = � (n - 1 ) (q - 1) = F a n � ) ( l + l , q q q 1) (n From (6-16), T2 = 116 > 10. 94, and we reject H0 : Cp. = 0 (no treatment effects). To see which of the contrasts are responsible for the rejection of H0 we construct 95% simultaneous confidence intervals for these contrasts. From, (6-18), the contrast JL 2 ) = halothane influence JL4 ) - (JL 1 c { IL = (JL 3 is estimated by the interval 18(3) F3' 1 6 ( .05) �c{ Sc1 = 209.31 10 94 �9432.32 16 19 19 = 209. 3 1 ± 73. 7 0 where c{ is the first row of C. Similarly, the remaining contrasts are estimated by _
1
+
+
+
+ � r;-;;-;;; v 1U.�I4
302
Chap.
6
Comparisons of Several Multivariate Means
C02 pressure influence =
- 60.05 ± ViD.94 � 51��.84 - 60.05 ± 54.70
(JL1 + JL3 ) - ( JL 2 + JL4 ) :
H-C02 pressure "interaction" = ( JL 1 + JL4 ) - ( JL 2 + JL3 ) :
7557.44 = - 12.79 ± 65 .97 - 12.79 ± .. � �19v 1U.�4
The first confidence interval implies that there is a halothane effect. The presence of halothane produces longer times between heartbeats. This occurs at both levels of C02 pressure, since the H - C02 pressure interaction con trast, (JL1 + JL4 ) - (J.Lz - JLJ ) , is not significantly different from zero. (See the third confidence interval.) The second confidence interval indicates that there is an effect due to C0 2 pressure: The lower C02 pressure produces longer times between heartbeats. Some caution must be exercised in our interpretation of the results because the trials with halothane must follow those without. The apparent H-effect may be due to a time trend. (Ideally, the time order of all treatments • should be determined at random.)
(6-16) (6-16).
The test in is appropriate when the covariance matrix, Cov (X) = I, cannot be assumed to have any special structure. If it is reasonable to assume that I has a particular structure, tests designed with this structure in mind have higher power than the one in (For I with the equal correlation structure see a discussion of the "randomized block" design in or
(8-14),
[10] [16] . )
6.3 COMPARING MEAN VECTORS FROM TWO POPULATIONS
A T 2 -statistic for testing the equality of vector means from two multivariate pop ulations can be developed by analogy with the univariate procedure. (See for a discussion of the univariate case.) This T 2 -statistic is appropriate for comparing responses from one set of experimental settings (population 1) with indepen dent responses from another set of experimental settings (population The com parison can be made without explicitly controlling for unit-to-unit variability, as in the paired-comparison case. If possible, the experimental units should be randomly assigned to the sets of experimental conditions. Randomization will, to some extent, mitigate the effect of unit-to-unit variability in a subsequent comparison of treatments. Although some precision is lost relative to paired comparisons, the inferences in the two-popula tion case are, ordinarily, applicable to a more general collection of experimental units simply because unit homogeneity is not required.
[7]
2).
Sec.
6.3
Comparing Mean Vectors from Two Populations
303
1
Consider a random sample of size n 1 from population and a sample of size n2 from population 2. The observations on p variables can be arranged as: Summary statistics
Sample (Population 1) Xu • X 1 2 • . . · • x l n , (Population 2) X 2 1 • X zz , · · · • X zn,
In this notation, the first subscript-1 or 2-denotes the population. We want to make inferences about (mean vector of population 1) - (mean vector of population 2) = p,1 - p, 2 • For instance, we shall want to answer the ques tion, Is p, 1 = p, 2 (or, equivalently, is p, 1 l h = 0)? Also, if p, 1 - p, 2 "# 0, which component means are different? With a few tentative assumptions, we are able to provide answers to these questions. -
Assumptions Concerning the Structure of the Data
1. The sample X u , X 1 2 , . . . , X 1 11 , , is a random sample of size n 1 from a p-vari ate population with mean vector p, 1 and covariance matrix I1 • 2. The sample X 2 1 , X 22 , . . . , X 2 11, , is a :random sample of size n2 from a p-vari ate population with mean vector p, 2 and covariance matrix I 2 • 3. Also, X u , X 1 2 , . . . , X 1 11 , are independent of X 2 1 , X 22 , . . . , X 2 11 , .
(6-19)
We shall see later that, for large samples, this structure is sufficient for mak ing inferences about the p X 1 vector p, 1 - p, 2 . However, when the sample sizes n 1 and n2 are small, more assumptions are needed. Further Assu m ptions when
n1
and
n2
Are Small
1. Both populations are multivariate normal. 2. Also, I1 = I 2 (same covariance matrix).
(6-20)
The second assumption, that I 1 = I2 , is much stronger than its univ�riate coun terpart. Here we are assuming that several pairs of variances and covariances are nearly equal.
304
Chap.
Comparisons of Several Multivariate Means
6
nl
When I 1 = I2 = I, j�= l (x 1 i - x1 ) (x 1 i - x1 )' is an estimate of (n 1 - 1)I and of (n2 - 1)I. Consequently, we can pool � (x - x 2 ) (x 2i - x 2 )' is an estimate . j = l 2i the information in both samples in order to estimate the common covariance I. We set � (x 1 - i1 ) ( x 1i - x1 )' + � (x 2i - x2 ) ( x 2i - x2 )' j= l i j= 1 s pooled = n 1 + n2 - 2 (6-21) Since j�= l (x 1 i - x1 ) (x 1 i - id' has n1 - 1 d.f. and j�= l (x2i - x 2 ) (x2i - x 2 )' has n2 - 1 d.f., the divisor (n1 - 1) + (n2 - 1) in (6-21) is obtained by combining the two component degrees of freedom. [See (4-24).] Additional support for the pool ing procedure comes from consideration of the multivariate normal likelihood. (See Exercise 6.11.) To test the hypothesis that p - p = 80 a specified vector, we consider the squared statistical distance from x11 - x 22 to 80 •, Now, E(X 1 - X 2 ) = E(X 1 ) - E( X 2 ) = p 1 - p2 Since the independence assumption in (6-19) implies that X 1 and X 2 are indepen dent and thus Cov(X 1 , X 2 ) = 0 (see Result 4. 5 ), by (3-9), it follows that Cov(X1- - -X 2 ) = Cov(X- 1 ) + Cov(X- 2 ) = -nt1 I + -n1z I = ( -n1t + -n1z ) I (6-22) Because S pooted estimates I, we see that (�1 + �J s pooled is an estimator of Cov(X 1 - X 2 ). The likelihood ratio test of Ho : IL t - IL z = Bo is based on the square of the statistical distance, T2, and is given by (see [1]), Reject H0 if 1 yz = ( x t - X z - Bo ) ' [ (� + �J spoote d r ( x t - X z - Bo ) > c z (6-23) n2
�
�
�------------------�------------------
nl
n2
t
Sec.
6.3
Comparing Mean Vectors from Two Populations
305
where the critical distance c 2 is determined from the distribution of the two-sam ple T 2-statistic. Result 6.2. If X 1 1 , X 1 2 , . . . , X 1 11 1 is a random sample of size n 1 from NP (p.. 1 , :I) and X 2 1 , X 22 , . . . , X 2 112 is an independent random sample of size n2 from Np ( f.L2 , :I) , then
is distributed as
(
P [ (x l - X2 - ( P.. t - P2 )) ' [ ( �, Consequently,
+
2)p Fp, 111 + 112 - p - 1 - p - 1)
( n l + nz n , + n2
-
1 l �J spoo cctr (X l - X 2 - (p.. 1 - p..2 ))
�
c2
]=1-a (6-24)
where
Proof. X 1 - X2
We first note that
= -n1J X 1 1 + -n1l X 1 2 + . . . + -n1, X I n
I
1
1 1 - - X 2 1 - - X 22 - . . . - - X 2n2 n n n
2
z
z
is distributed as
c 1 = c2 = . . · = C111 = 1 /n 1 and C11 1 + 1 = C111 + 2 = = C111 +n2 = (4-23), / (n1 - 1 )S1 is distributed as W111 _ 1 (I) and (n2 - 1)S 2 as W112 _ 1 (I) By assumption, the X 1 / s and the X 2 / s are independent, so ( n 1 - 1)S1 and ( n2 - 1 ) S 2 are also independent. From (4-24), (n1 - 1)S1 + (n 2 - 1 ) S 2 is then dis tributed as W111 + 112 _ 2 ( I ) . Therefore, 4.8,
with by Result to According - 1 n2 •
..·
306
Chap. 6
(
Comparisons of Severa l M u ltivariate Means
)(
) (
I = multivariate normal ' Wishart random matrix - multivariate normal random vector
d.f.
random vector
)
which is the T2 -distribution specified in for the relation to F.]
•
(5-8), with n replaced by n1 + n2 1. [See (5-5) We are primarily interested in confidence regions for p 1 - p 2 . From (6-24), we conclude that all p1 - p within squared statistical distance c 2 of x 1 - x con -
2 2 stitute the confidence region. This region is an ellipsoid centered at the observed difference x 1 - x 2 and whose axes are determined by the eigenvalues and eigen vectors of s pooled (or s;;oled ) . Example 6.3
(Constructing a confidence region for the difference of two mean vectors)
Fifty bars of soap are manufactured in each of two ways. Two characteristics, X1 = lather and X2 = mildness, are measured. The summary statistics for bars produced by methods and are
1 2 [8.3 ] 4.1 ' [10.2 ] Xz 3. 9 '
-
95%
=
Obtain a confidence region for p 1 - p 2 . We first note that S1 and S 2 are approximately equal, so that it is rea sonable to pool them. Hence, from
Also,
(6-21), 49 49 [ 2 1 Spooled = 98 S I + 98 S z = 1 5 J
[ -1..29 ] so the confidence ellipse is centered at [ -1. 9 , .2] ' . The eigenvalues and eigen vectors of s pooled are obtained from the equation 0 = ! Spooled A I J = 1 2 -1 A 5 1 A I = A 2 - 7A + 9 X I - Xz =
-
_
Sec.
6.3
Comparing Mean Vectors from Two Populations
and Consequently, = ± v' so = the corresponding eigenvectors, e 1 and e 2 , determined from
A (7
A1 5. 303
49 - 36)/2.
are e1
=
i=
[ ..299057 ]
and e z =
307
A2 = 1. 697, and
1, 2
[ .957 ] - .290
6.2, (__!_n n2__!_) = ( 501 501 ) { 98)(97)(2) F2' 97 ( • 05) -- •25 1 since F2, 9 7 (. 05) = 3.1. The confidence ellipse extends
By Result
+
cz
+
1.15
.65
units along the eigenvector e ; , or units in the e 1 direction and units in the direction. The 95% confidence ellipse is shown in Figure Clearly, = 0 is not in the ellipse, and we conclude that the two methods of 1 manufacturing soap produce different results. It appears as if the two processes produce bars of soap with about the same mildness but those • from the second process have more lather ).
e2 p. - p.2
6.1.
(X1
(X2),
2.0
- 1 .0
Figure 6 . 1
P.l - P.2 .
95% confidence ellipse for
308
Chap. 6
Com parisons of Several M u ltivariate Means
Simultaneous Confidence Intervals
It is possible to derive simultaneous confidence intervals for the components of the vector p 1 - p 2 . These confidence intervals are developed from a consideration of all possible linear combinations of the differences in the mean vectors. It is assumed that the parent multivariate populations are normal with a common covariance I . Result 6.3.
/
Let c2 = [(n1 + n2 - 2 ) p (n1 + n2 - p
With probability 1 - a,
(
-
1 ) ] Fp , n1 + nz - p - 1 ( a ) .
)
/ + l. spooled a a' ( X 1 - X2) ± c a' l. \j n t nz will cover a' (!L 1 - p2) for all a. In particular f-t 1 i - f-t z i will be covered by (Xl i - Xz J ± C
1 ( _!_ + l.nz ) sii, pooled
\j n l
for i = 1, 2, . . . , p
Proof. Consider univariate linear combinations of the observations
given by a'X 1 j = a 1 X1j1 + a 2 X1j2 + . . . + apXtjp and a'X 2j = a 1 X2 j 1 + a 2 X2j2 + � · + apXZ jp · T� se linear combinations have�ample means and covariances a'X 1 , a'S 1 a and a'X 2 , a'S 2 a, respectively, where X 1 , S1 , and X 2, S2 are the mean and covariance statistics for the two original samples. ( See Result 3.5.) When both parent populations have the same covariance, sf' a = a'S 1 a and s� a = a'S 2 a are both estimators of a' I a, the common population variance of the li�ear combina tions a'X 1 and a'X 2 • Pooling these estimators, we obtain
2 sa, pooled
(n 1 - 1) sL + (n2 - 1 ) s�. a (n 1 + n2 - 2)
(6-25) To test H0 : a' (p 1 - p 2) = a' 80 , on the basis o f the a'X 1j and a'X 2j , we can form the square of the univariate two-sample £-statistic (6-26)
Sec.
- -
6.3
Comparing Mean Vectors from Two Populations
t; :,.;;; ( X I - X 2 - ( JL I
a
= Tz
(2-50),
- JLz ) ) '
= ( X 1 - X 2 - (p1 - JL2 ) ) and
[( n1 + n1 ) Spooled ]- I ( -X I - -X z - (JLI - JLz ) )
According to the maximization lemma with d B (1/n 1 + 1/n2 ) S pooled in
=
309
1
z
= P [ T2 :,.;;; c2] = P [t; :,.;;; c2, for all a] = [ I a ' ( X I - X z ) - a ' (p i - JLz ) I :,.;;; c �a ' (�I + � J s pooled a for all a] where c 2 i s selected according to Result 6. 2 . Remark. For testing H0 : p 1 - p 2 = 0, the linear combination a' ( i 1 - i 2 ),
for all
(1 - a )
¥-
0. Thus,
p
•
with coefficient vector a ex: s;�oted ( i 1 - i 2 ), quantifies the largest population dif ference. That is, if T 2 rejects H0 , then a' ( i 1 - i 2 ) is likely to have a nonzero mean. Frequently, we try to interpret the components of this linear combination for both subject matter and statistical importance. Example 6.4 (Calculating simultaneous confidence i ntervals for the differences in mean components)
= 45
= 55
Samples of sizes n 1 and n2 were taken of Wisconsin homeowners with and without air-conditioning, respectively. (Data courtesy of Statistical Laboratory, University of Wisconsin.) Two measurements of electrical usage (in kilowatt hours) were considered. The first is a measure of total on-peak consumption (X1 ) during July, and the second is a measure of total off-peak consumption (X2 ) during July. The resulting summary statistics are
4 = [ 204. 556.6 J ' -Xz = [ 130.0 ' 355.0 J XI
13825.3 23823.4 = [ 23823. 4 73107. 4 J ' [ 8632.0 19616.7 ] s 2 = 19616.7 55964. 5 ' sl
n1
= 45
n2
= 55
(The off-peak consumption is higher than the on-peak consumption because there are more off-peak hours in a month.) Let us find 95% simultaneous confidence intervals for the differences in the mean components. Although there appears to be somewhat of a discrepancy in the sample variances, for illustrative purposes we proceed to a calculation of the pooled sample covariance matrix. Here
310
Chap.
6
Comparisons of Several Multivariate Means
S pooled =
- 1 S + n2 - 1 S + n 2 - 2 1 nl + n 2 - 2 2 -
nl n,
_
and
10963.7 [ 21505.5
21505.5 63661.3
]
= (2.02) (3.1) = 6.26 With p { - p� = [ JL 1 1 - JL 2 1 , JL 1 2 - JL 22 ] , the 95% simultaneous confidence intervals for the population differences are 1L 1 1 - JL2 1 : (204.4 - 130.0) ± V6.26 or
�( 415 + 5�) 10963.7
�( 415 + 515 ) 63661.3
21.7 ,;; IL1 1 - JL 2 1 ,;; 127.1 JL 1 2 - JL22 : (556.6 - 355.0) ± V6.26 or
74.7 ,;; JL 1 2 - JL 22 ,;; 328.5 We conclude that there is a difference in electrical consumption between those with air-conditioning and those without. This difference is evident in both on-peak and off-peak consumption. The 95% confidence ellipse for p 1 - p 2 is determined from the eigen value-eigenvector pairs .-\ 1 = 71323.5, e { = [.336, .942] and .-\ 2 = 3301 .5, e� = [.942, - .336]. Since
and
we obtain the 95% confidence ellipse for p1 - p 2 sketched in Figure 6.2 on page 311. Because the confidence ellipse for the difference in means does not cover 0 ' = [0, 0], the T2 -statistic will reject H0 : p 1 - p 2 = 0 at the 5% level.
Sec.
6.3
Comparing Mean Vectors from Two Populations
p;
-
Jt�
Figure 6.2
=
31 1
95% confidence ellipse for (p. l l - IL2 t 1 IL 1 2 - P-22 ) .
The coefficient vector for the linear combination most responsible for rejec • tion is proportional to 2 . (See Exercise 6.7.)
S�o1oled (x1 - x ) The Bonferroni 100(1 - a)% simultaneous confidence intervals for the p pop
f.Lz; : ( :Xli - Xz; ) ± t,1+,2 - z ( 2: ) �(�1 + � J sii, pooled where t 1 _ (a/2p) is the upper 100(a/2p)th percentile of a !-distribution with n 1 + n2 ,-1 + 22 d.f.2 I1 I2 When I 1 I2 , we are unable to find a "distance" measure like T 2 , whose distrib ution does not depend on the unknowns I 1 and I2 . Bartlett ' s test [3] is used to test the equality of I1 and I2 in terms of generalized variances. Unfortunately, the con
ulation mean differences are Jl-1 ; -
The Two-Sample Situation when
=F
=F
clusions can be seriously misleading when the populations are nonnormal. Non normality and unequal covariances cannot be separated with Bartlett ' s test. A method of testing the equality of two covariance matrices that is less sensitive to the assumption of multivariate normality has been proposed by Tiku and Balakr ishnan [17]. However, more practical experience is needed with this test before we can recommend it unconditionally. We suggest, without much factual support, that any discrepancy of the order = 4 a2 , ;;, or vice versa, is probably serious. This is true in the univariate case. ;; a1 , The size of the discrepancies that are critical in the multivariate situation probably depends, to a large extent, on the number of variables A transformation may improve things when the marginal variances are quite different. However, for n 1 and n2 large, we can avoid the complexities due to unequal covariance matrices.
p.
312
Chap.
6
Comparisons of Several Multivariate Means
Resu lt 6.4. Let the sample sizes be such that n1 - p and n 2 - p are large. Then, an approximate 100 (1 - a)% confidence ellipsoid for p1 - p 2 is given by all IL 1 - /L z satisfying
where x; (a) is the upper (100a) th percentile of a chi-square distribution with p d.f. Also, 100 (1 - a)% simultaneous confidence intervals for all linear combina tions a' (p1 - p 2 ) are provided by
(
)
a' (p1 - p2 ) belongs to a' (i1 - i 2 ) ± Vx; (a) /a' _!_ s 1 + _!_ S 2 a nz "V n l Proof
From
(6-22) and (3-9),
E ( X 1 - X 2 ) = P- 1 - 1L2
and
By the central limit theorem, X 1 - X 2 is nearly NP [p1 - p 2 , n 1 1 I1 + n2 1 I 2 ] . If I1 and I 2 were known, the square of the statistical distance from X 1 - X 2 to IL 1 - p 2 would be This squared distance has an approximate x;-distribution, by Result 4.7. When n1 and n2 are large, with high probability, s l will be close to I I and s 2 will be close to I 2 . Consequently, the approximation holds with S1 and S 2 in place of I1 and I 2 , respectively. The results concerning the simultaneous confidence intervals follow from • Result 5A. l . Remark.
__!_ S 1 n1
+
If n 1 = n2 = n, then (n - 1)/(n + n - 2) = 1/2, so __!_ S 2 = l_ (S 1 + S 2 ) = (n - 1) S 1 + (n - l ) S 2 l + l n n n+nn n2
2
(
)
With equal sample sizes, the large sample procedure is essentially the same as the procedure based on the pooled covariance matrix. (See Result In one dimen-
6. 2 . )
Sec.
6.3
Comparing Mean Vectors from Two Populations
sion, it is well known that the effect of unequal variances is least when n 1 greatest when n 1 is much less than n2 or vice versa.
31 3 =
n2
and
(Large sample prdcedures for inferences about the difference in means)
Example 6.5
13825.3 [ 23823.4
8632.0 ] + 551 [ 19616.7
We shall analyze the electrical-consumption data discussed in Example 6.4 using the large-sample approach. We first calculate 1 n1
81
+ n12 S z
=
-
1 45
[ 464.17 886.08
23823.4 73107.4
886.08 2642.15
[
]
]
19616.7 55964.5
J
The 95% simultaneous confidence intervals for the linear combinations
a' ( p l - P z )
[1, 0] /L ' ' f.Lz ' 1 f.L 2 f.Lzz
=
[0 1] [ f.Lf.L 11 21 -- f.Lzf.Lzz1 J
and
,
are (see Result 6.4)
74.4 ± \15.99 V464.17
f.L1 1 - f.L2 1 :
=
f.L1 1 - f.Lz 1
=
f.L1 2 - /Lzz
or (21.7, 127.1)
201.6 ± \15.99 V2642.15 or
(75.8, 327.4)
Notice that these intervals differ negligibly from the intervals in Example 6.4, where the pooling procedure was employed. The T 2-statistic for testing H0 : P 1 - P z = 0 is -' 1 1 Tz = x , - X- z l ' - s I - S z [ x- l - -X z ] n n2
[-
[
]
- 130.0 ] [ 464.17 886.08 ] [ 204.4 - 130.0 ] [ 204.4 556.6 - 355.0 886.08 2642.15 556.6 - 355.0 = [74.4 201.6] (10 - 4 ) [ 59.874 - 20.080 [ 74.4 15 · 66 10.519 J 201.6 J - 20.080 1
I
+
-
I
=
For a = .05, the critical value is xi (.05) = 5.99 and, since T 2 = 15.66 > . 2 = Xz (.05) 5.99, we reJect H0 . The most critical linear combination leading to the rejection of H0 has coefficient vector
314
Chap.
6
Comparisons of Several Multivariate Means A
3
ex
( n1
s
+ s2 1 n2 1 1
)-�(-x 1 - -x 2 ) = (l0 - 4) [ .041 [ .063 J
59 , 874 - 20.080 - 20.080 10.519
] [ 74.4 ] 201.6
The difference in off-peak electrical consumption between those with air-conditioning and those without contributes more than the corresponding difference in an-peak consumption to the rejection of H0 : p1 - p 2 = 0. • A statistic similar to T2 that is less sensitive to outlying observations for small and moderately sized samples has been developed by Tiku and Singh [18] . How ever, if the sample size is moderate to large, Hotelling ' s T 2 is remarkably unaf fected by slight departures from normality and/or the presence of a few outliers.
6.4
COMPARING SEVERAL MULTIVARIATE POPULATION MEANS (ONE-WAY MANOVA)
Often, more than two populations need to be compared. Random samples, col lected from each of g populations, are arranged as Population 1: X 1 1 , X 1 2 , . . . , X 1 n , Population 2: X 2 1 , X 22 , . . . , X 2 n,
(6-27)
Population g: X8 1 , X 8 2 , . . . , X 8 "" MANOVA is used first to investigate whether the population mean vectors are the same and, if not, which mean components differ significantly. Assum ptions about the Structure of the Data for One-way MANOVA
1. X c 1 , X c 2 , , X c" is a random sample of size n e from a population with mean /L c , e = 1, 2: . . . , g. The random samples from different populations are . • .
independent. 2. All populations have a common covariance matrix I. 3 . Each population is multivariate normal.
Condition 3 can be relaxed by appealing to the central limit theorem (Result 4.13) when the sample sizes ne are large. A review of the univariate analysis of variance (ANOV A) will facilitate our discussion of the multivariate assumptions and solution methods.
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova)
315
A Summary of Univariate ANOVA
Xe Xe ... , Xe N(JLe , f.Le f.Lz · ·· f.Lg , f.L, f.L e f.L (JLe - f.L) f.L e JL Te Tc f.Lc - f.L · Tc reparameterization Tc f.L + f.Lc ( C th population ) ( overall ) ( C th population ) (6-28)
n, is a random In the univariate situation, the assumptions are that 1 , 2 , if) population, e = 1, 2, . . . , g, and that the random samples sample from an are independent. Although the null hypothesis of equality of means could be for = = mulated as JL 1 = it is customary to regard as the sum of an over all mean component, such as and a component due to the specific population. For instance, we can write = + or = + where = Populations usually correspond to different sets of experimental conditions, and therefore, it is convenient to investigate the deviations associated with the C th population (treatment). The
(treatment) effect
mean
mean
leads to a restatement of the hypothesis of equality of means. The null hypothe sis becomes
The response gestive form
Xci ' distributed as N(JL + Tc , a2) , can be expressed in the sug Xc i =
f.L
+
(overall mean)
ec i g
Tc
ee i
( treatment ) ( random ) +
effect
(6-29)
error
N(O,
a2 ) random variables. To define uniquely the where the are independent model parameters and their least squares estimates, it is customary to impose the constraint ,L
C=l
ncrc = 0.
Motivated by the decomposition in (6-29), the analysis of variance is based upon an analogous decomposition of the observations,
(observation)
(
-
X
overall sample mean
) ( +
(xc - x)
estimated treatment effect
)
+
(xci - xc) (residual)
(6-30)
f.L, 1- = x) is an estimate of Tc , and (xei - xc ) is ecj · e (xe -
where x is an estimate of an estimate of the error
316
Cha p . 6
Comparisons o f Several M u ltiva riate Means
Example 6.6
(The sum of squares decomposition for u nivariate ANOVA)
Consider the following independent samples. Population 1: 9, 6, 9 Population 2: 0, 2 Population 3: 3, 1, 2
Since, for example, .X3 = (3 + 1 + 2)/3 = 2 and .X = (9 + 6 + 9 + 0 + 2 + 3 + 1 + 2)/8 = 4, we find that
= 4 + (2 - 4) + (3 - 2)
= 4 + ( - 2) + 1
(� � ) (: : ) ( -� -� ) (-� -� )
Repeating this operation for each observation, we obtain the arrays 9
4
+ + 3 1 2 4 4 4 -2 -2 -2 observation = mean + treatment effect +
(xcj )
( .X)
1
4
( .Xc - .X )
1 -1 0 residual
(xcj - .Xc )
The question of equality of means is answered by assessing whether the contribution of the treatment array is large relative to the residuals. (Our esti g mates rc = .Xc - .X of Tc always satisfy 2: nc rc = 0. Under H0 , each rc is an C=l
estimate of zero.) If the treatment contribution is large, H0 should be rejected. The size of an array is quantified by stringing the rows of the array out into a vector and calculating its squared length. This quantity is called the sum of squares (SS). For the observations, we construct the vector y' = [9, 6, 9, 0, 2, 3, 1, 2]. Its squared length is ssobs = 92 + 62 + 92 + 02 + 22 + 3 2 + 1 2 + 22 = 216 Similarly ss mean sstr
= 42 + 42 + 42 + 42 + 42 + 42 + 42 + 42 = 8 (42 ) = 128
= 42 + 42 + 42 + ( - 3) 2 + ( - 3) 2 + ( - 2) 2 + ( - 2) 2 + ( - 2? = 3 (42 ) + 2 ( - 3) 2 + 3 ( - 2) 2 = 78
Sec. 6.4
31 7
Comparing Several Multivariate Popu lation Means (One-Way Manova)
and the residual sum of squares is The sums of squares satisfy the same decomposition, (6-30), as the observa tions. Consequently, or 216 = 128 + 78 + 10. The breakup into sums of squares apportions vari ability in the combined samples into mean, treatment, and residual (error) components. An analysis of variance proceeds by comparing the relative sizes of SS1r and SSres . If H0 is true, variances computed from SS1r and SSres should be approximately equal. The sum of squares decomposition illustrated numerically in Example 6.6 is so basic that the algebraic equivalent will now be developed. Subtracting :X from both sides of (6-30) and squaring gives •
We can sum both sides over j, note that 2: (xei - :Xe ) = 0, and obtain n,
j= l
�
2: (xei - :X ) z = ne ( :Xe - :X ) z
j=l
Next, summing both sides over e we get
�
+ 2: (xej - :Xe ) z j= l
(6-31)
(
) ( ) + ( SSres ) SS tr = total (corrected) SS between (samples) SS within (samples) SS or SScor
g
�
2: 2: xii = (n 1 + n2 f=l j= l ( SSobs ) =
+
··· +
n1)x 2
g
g
�
+ 2: ne ( :Xe - :X ) 2 + 2: 2: (xei - :Xe )2 + + (6-32) f=l
f=l j=l
318
Chap.
6
Comparisons of Several Multivariate Means
(6-32),
In the course of establishing we have verified that the arrays repre senting the mean, treatment effects, and residuals are That is, these arrays, considered as vectors, are perpendicular whatever the observation vector y ' = [x l l , . . . , Xln l • x2 1 • . . . , Xzn, • . . . , Xg n) · Consequently, we could obtain ssres by subtraction, without having to calculate the individual residuals, because ss res = SS obs - SS mean - SS tr " However, this is false economy because plots of the residu als provide checks on the assumptions of the model. The vector representations of the arrays involved in the decomposition also have geometric interpretations that provide the degrees of freedom. For an arbitrary set of observations, let [x1 1 , . . . , x 1 111 , x2 1 , . . . , x2 11, , . . . , xg n , ] = y ' . The observation vector y can lie anywhere in = + n2 + · · · + g dimensions; the mean vector .X1 = [x, . . . , x] ' must lie along the equiangular line of 1, and the treat ment effect vector
orthogonal.
(6-30)
n n1
(.XI
-
x)
1 1 0 0 0 0
}"·
+ ( xz
-
x)
0 0 1 + nz 1 0 0
}
= (:X\
-
n
+ ( xg - x )
0 0 0 0 1 n, 1
}
x ) u 1 + ( x2 - x ) u 2 + · · · + (xg - x ) ug
g
lies in the hyperplane of linear combinations of the vectors u 1 , u 2 , , ug. Since 1 = u 1 + u 2 + · · · + ug, the mean vector also lies in this hyperplane, and it is perpendicular to the treatment vector. (See Exercise Thus, the mean vector has the freedom to lie anywhere along the one-dimensional equiangular line, and the treatment vector has the freedom to lie anywhere in the other g - 1 dimen sions. The residual vector, e = y - ( .X1 ) - [ (.X1 - .X ) u 1 + · · · + (xg - .X ) ug] is perpendicular to both the mean vector and the treatment effect vector and has the = freedom to lie anywhere in the subspace of dimension that is perpendicular to their hyperplane. To summarize, we attribute d.f. to SS mean , d.f. to SS1" and n - g = ( 1 + n2 + · · · + ng ) - g d.f. to SS res · The total number of degrees of freedom is = n 1 + n 2 + · · · + ng. Alternatively, by appealing to the univariate distribution theory, we find that these are the degrees of freedom for the chi-square distributions associated with the corresponding sums of squares. The calculations of the sums of squares and the associated degrees of free dom are conveniently summarized by an ANOVA table.
always
. • .
6. 1 0. )
n n
1
n - (g - 1) - 1 n - g g-1
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova)
321
can be written as
The sum over j of the middle two expressions is the zero matrix, because n,
j�=l (xCi - X c) = 0. Hence, summing the cross product over C and j yields g g g + ( nc ) ) )' cj c ) ' j � � � � C= l j = I (x x xf x C=l (X. x (xe x C=l j�= l (x cj - xc) (x cj - xd
(
11(
total (corrected) sum of squares and cross products
The
) (
treatment (�etween) sum of squares and cross products
) (
�
)
residual (Within) sum of squares and cross products (6-36)
within sum of squares and cross products matrix can be expressed as W=
g
Ill
C=l� j�=l (x ej - xc Hx cj - ic)' (6-37)
where S c is the sample covariance matrix for the C th sample. This matrix is a gen eralization of the S pooled matrix encountered in the two-sample case. It plays a dominant role in testing for the presence of treatment effects. Analogous to the univariate result, the hypothesis of no treatment effects,
(n 1 + n2 - 2)
H0 : T1 = Tz = · · · = Tg = 0 is tested by considering the relative sizes of the treatment and residual sums of squares and cross products. Equivalently, we may consider the relative sizes of the
322
Chap.
Comparisons of Several Multivariate Means
6
residual and total (corrected) sum of squares and cross products. Formally, we summarize the calculations leading to the test statistic in a MANOVA table. MANOVA TABLE FOR COM PARI NG POPULATIO N M EAN VECTORS
Source of variation
Matrix of sum of squares and cross products (SSP)
= CL=g l ne ( ie - x ) ( ie - x ) ' g W = L L (x ej - ie H x ej - ie ) ' C= l j= l B
Treatment Residual (Error) Total (corrected for the mean)
ll(
g B + W = L L (x ej - i ) (x ej - x ) ' C = l j= l n,
Degrees of freedom ( d.f.)
g-1 g L ne - g C=l g L ne - 1 C=l
This table is exactly the same form, component by component, as the ANOVA table, except that squares of scalars are replaced by their vector counterparts. For example, ( .Xe - .X ) 2 becomes ( ie - x ) (i e The degrees of freedom corre spond to the univariate geometry and also to some multivariate distribution theory involving Wishart densities. (See [1].) One test of H0 : T1 T2 · · · Tg 0 involves generalized variances. We reject H0 if the ratio of generalized variances
x )'.
= = = =
A*
lwl
IB + WI
A* =
�� � (xej - ieHx ej - ie ) ' I l �l j� (xej - x ) (x ej - x ) ' l
(6-38)
I
is too small. The quantity I W I / B + W I , proposed originally by Wilks (see [20]), corresponds to the equivalent form (6-33) of the F-test of H0 : no treatment effects in the univariate case. Wilks' lambda has the virtue of being convenient and related to the likelihood ratio criterion. 2 The exact distribution of can be as
2 Wilks' lambda can also be expressed as a function of the eigenvalues of A , A , 1 2 A* =
fi ;�
(1 A)
A*
• • •
, A s of w- 1 B
1 1 + A; where s = min (p, g - 1 ) , the rank of B. Other statistics for checking the equality o f several multivari ate means, such as Pillai's statistic, the Lawley-Hotelling statistic, and Roy's largest root statistic, can also be written as particular functions of the eigenvalues of w- 1 B. For large samples, all of these sta tistics are, essentially, equivalent. (See the additional discussion on page 357.)
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova) D I STRI BUTION OF WI LKS' LAM BDA,
TAB LE 6.3
No. of variables
No. of groups
p=1 p=2
g ;;. 2
p ;;. 1
g
WI/I B
+ wJ
Sampling distribution for multivariate normal data
( Lne - g) C - *A * ) Fg - U:n, - g A g- 1 ( Lne - g - 1 ) e - \lA* ) F2(g - 1). 2(�n, - g - l) g- 1 y'A* ( Lne - p - 1 ) C - *A * ) Fp. ���. - p - t p A ( Lne - p - 2 ) e - \lA* ) Fzp, Z (�n, - p - 2) VA* p �
g ;;. 2
�
=2 g= 3
p ;;. 1
A* = J
323
�
�
derived for the special cases listed in Table 6.3. For other cases and large sample sizes, a modification of due to Bartlett (see [4]) can be used to test H0 • Bartlett (see [4]) has shown that if H0 is true and is large,
A*
Lne = n
- ( n - 1 - (p +2 g) ) ln A * = - ( n - 1 - (p +2 g) ) ln ( I B J w+ Jw J )
(6-39)
has approximately a chi-square distribution with p (g - 1) d.f. Consequently, for Lne = n large, we reject H0 at significance level a if - ( n - 1 - (p +2 g) ) ln ( B J w+ lW I ) (6-40)
J
where x; (g - ) (a) is the upper (100a) th percentile of a chi-square distribution with p(g - 1) d.f. l
Example 6.8 (A Manova table and Wil ks' lambda for testing the equality of three mean vectors)
Suppose an additional variable is observed along with the variable introduced in Example 6.6. The sample sizes are 3, 2, and 3. Arranging the observation pairs X ej in rows, we obtain
n1 = n2 =
n3 =
324
Chap.
6
Comparisons of Several Multivariate Means
[� ] [� ] [ � ] [�] [�] [�] [�] [�]
= [ ! ], and i = [:]
with i 1
x2 = [� ],
X3
= [ � ],
We have already expressed the observations on the first variable a s the sum of an overall mean, treatment effect, and residual in our discussion of uni variate ANOVA. We found that
(observation)
(mean)
( treatment ) effect
(residual)
and
= 128 + 78 + 10 Total SS (corrected) = SS obs - SS mean = 216 - 128 216
88
( =� =� � � -� =� ( ( + + ) ( ) ) ) =
Repeating this operation for the observations on the second variable, we have
! �
7
5
8 9 7
5 5 5
(observation)
(mean)
-1
( treatment ) effect 3
3
3
3
0
1 -1
(residual)
and
= 200 + 48 + 24 Total SS (corrected) = SS obs - SSmean = 272 - 200 272
=
72
These two single-component analyses must be augmented with the sum of entry-by-entry cross products in order to complete the entries in the
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova)
325
MANOVA table. Proceeding row by row in the arrays for the two variables, we obtain the cross product contributions: Mean: 4 (5) + 4 (5)
+
···
+ 4 (5) = 8 (4) (5) = 160
Treatment: 3 (4) ( - 1 ) + 2 ( - 3) ( - 3) + 3 ( - 2) (3) = - 12 Residual: 1 ( - 1 ) + ( - 2) ( - 2) + 1 (3) + ( - 1 ) (2) + Total: 9 (3) + 6 (2) + 9 (7) + 0 (4) +
···
·· ·
+ 0 (- 1) = 1
+ 2 (7) = 149
Total ( corrected ) cross product = total cross product - mean cross product = 149 - 160 = - 1 1 Thus, the MANOVA table takes the following form:
Source of variation
Matrix of sum of squares and cross products
Degrees of freedom
78 - 12 48 - 12
3 - 1 = 2
[
Treatment
[ 101 88 [ - 11
Residual Total ( corrected)
] ] 2� - 11 ] 72
] = [ - 7812
Equation (6-36) is verified by noting that
[
88 - 11 - 11 72
Using (6-38), we get A* =
IWI IB + WI
I
88 - 11 - 11 72
1
- 12 48
3 + 2 + 3 - 3 = 5 7
] + [ 101 241 ]
10 (24) - ( 1 ) 2 239 = = 0385 88 (72) - ( - 1 1 ) 2 6,215 •
326
Chap.
6
Comparisons of Several Multivariate Means
=
=
Since p 2 and g 3, Table 6.3 indicates that an exact test (assuming normality and equal group covariance matrices) of H0 : T1 T2 T3 0 (no treatment effects) versus H1 : at least one -rc # 0 is available. To carry out the test, we compare the test statistic
= = =
( 1 - VA* ) (2-nc - g - 1) = ( 1 - Y.0385) (8 - 3 - 1 ) = 8. 1 9 (g - 1) 3 - 1 VA* Y.0385 with a percentage point of an F-distribution having v1 = 2 (g - 1) = 4 and v2 = 2 ( 2-n c - g - 1) = 8 d.f. Since 8. 1 9 > F4,8 (.01) = 7.01, we reject H0 at the a = .01 level and conclude that treatment differences exist. •
When the number of variables, p, is large, the MANOVA table is usually not constructed. Still, it is good practice to have the computer print the matrices B and W so that especially large entries can be located. Also, the residual vectors
should be examined for normality and the presence of outliers using the techniques discussed in Section 4.6 and 4.7 of Chapter 4. Example 6.9 (A multivariate analysis of Wisconsin n u rsing home data)
The Wisconsin Department of Health and Social Services reimburses nursing homes in the state for the services provided. The department develops a set of formulas for rates for each facility, based on factors such as level of care, mean wage rate, and average wage rate in the state. Nursing homes can be classified on the basis of ownership (private party, nonprofit organization, and government) and certification (skilled nurs ing facility, intermediate care facility, or a combination of the two). One purpose of a recent study was to investigate the effects of owner ship or certification (or both) on costs. Four costs, computed on a per-patient day basis and measured in hours per patient day, were selected for analysis: X1 cost of nursing labor, X2 cost of dietary labor, X3 cost of plant operation and maintenance labor, and X4 cost of housekeeping and laun dry labor. A total of n 516 observations on each of the p 4 cost variables were initially separated according to ownership. Summary statistics for each of the g 3 groups are given below.
=
=
=
=
=
=
=
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova)
Number of observations
Group e = 1 (private) e = 2 (nonprofit) e = 3 (government)
327
[ ]' [ ]' [ ] Sample mean vectors
n1 = 271 n2 = 138
XI =
n3 = 107
2.066 .480 .082 .360
X2 =
2 167 .596 .124 .418
x3 =
2.273 .521 .125 .383
3
2: nc = 516
f=l
[ [
Sample covariance matrices
29 1
[
J J J Source: Data courtesy of State of Wisconsin Department of Health and Social Services. s,
s3 =
- .001 .011 .002 .000 .001 .010 .003 .000
O
.017 .030 .003 - .000 .004 .018 .006 .001
O
261
s2 =
561 .011 .025 .001 .004 .005 .037 .007 .002
O
Since the Sc ' s seem to be reasonably compatible,3 they were pooled [see (6-37)] to obtain VV
[ ! !��
= (n1
1
1)S1
-
8
:
]
+ ( n2 - 1 ) S 2 + ( n3 - 1 ) S 3
8.200 1.695 .633 1.484 9.581 2.428 .394 6.538
3 However, a normal-theory test of H0 : I 1 cance level because of the large sample sizes.
=
I2
=
I3 would reject H0 at any reasonable signifi
328
Chap.
6
Comparisons of Several Multivariate Means
[ l
Also,
2.136 .519 .102 .380
and
� B =
1
(-
- x) ( - x) -
e-'-'= ne Xe - Xe
'
-
[
3.475 1.111 1 .225 . .821 .453 .235 .584 .610 .230 .304
l
To test H0 : T1 = T2 = T3 (no ownership effects or, equivalently, no difference in average costs among the three types of owners-private, nonprofit, and government), we can use the result in Table 6.3 for g = 3. Computer-based calculations give
A* = and
( 2- nc a
p -
p
!WI IB + WI
.7714
2 ) ( 1 - � ) = ( 516 - 4 - 2 ) ( 1 - Y.7714 ) = 17_67 VA*
4
Y.77i4
Let = .01 , so that F2 (4) , Z ( S i o / · 0 1 ) ='= x� (.01 ) /8 = 2.51. Since 17.67 > F8 1 0 2 0 (.01 ) == 2.51,' we reject H0 at the 1% level and conclude that average costs differ, depending on type of ownership. It is informative to compare the results based on this "exact" test with those obtained using the large-sample procedure summarized in (6-39) and = = 516 is large, and H0 can be (6-40). For the present example, tested at the = .01 level by comparing
,
a
- (n - 1 -
2- ne n
(p
+ g ) /2) ln
( I :� ) = - 511.5 ln (.7714) = 132.76 IB I
with x; (g - 1 ) (.01) = xn 01) = 20.09. Since 132.76 > x� (.01 ) = 20.09, we reject H0 at the 1% level. This result is consistent with the result based on the • foregoing F-statistic.
Sec.
6.5
6.5
329
Simultaneous Confidence Intervals for Treatment Effects
SIMULTANEOUS CONFIDENCE INTERVALS FOR TREATMENT EFFECTS
When the hypothesis of equal treatment effects is rejected, those effects that led to the rejection of the hypothesis are of interest. For pairwise comparisons, the Bon ferroni approach (see Section 5.4) can be used to construct simultaneous confi dence intervals for the components of the differences Tk - Tc (or IL k - JLe ) . These intervals are shorter than those obtained for all contrasts, and they require critical values only for the univariate t-statistic. Let Tk ; be the ith component of Tk . Since Tk is estimated by 7-k = ik - i (6-41 } and 1-k ; - Tc ; = xk; - X e; is the difference between two independent sample means. The two-sample t-based confidence interval is valid with an appropriately modified a. Notice that
( nk
. 1 Var (7-k I- - 1-c I- ) = Var (Xk I- - Xc I- ) = --
-
+
1
)
- uI. I.
n(
where u;; is the ith diagonal element of I. As suggested by (6-37), Var (Xk i - Xe; ) is estimated by dividing the corresponding element of W by its degrees of freedom. That is, -
(
1 Var (Xk l. - XC l- ) = nk -
1
+ -
)
w 11 ..
-
nf n +
g
+ where W; ; is the ith diagonal element of W and n = n1 ng . It remains to apportion the error rate over the numerous confidence statements. Relation (5-28} still applies. There are p variables and g (g - 1)/2 pairwise differ ences, so each two-sample t-interval will employ the critical value t11 _ g (a/2m), where m = pg (g - 1 } /2 (6-42 }
···
is the number of simultaneous confidence statements. Result 6.5. Let n =
(1 -
a} ,
g
_L nk . For the model in (6-34), with confidence at least
k= l
Tk i - Te ; belongs to xk i - Xe;
) + ± tll - g ( pg (ga- 1 ) ) \j/__!!J _jj_ _ (_!_ _!_ n - g n k ne
for all components i = 1, . . . , p and all differences e < ith diagonal element of W.
k = 1, . . . , g. Here W; ; is the
330
Chap.
6
Comparisons of Several Multivariate Means
We shall illustrate the construction of simultaneous interval estimates for the pairwise differences in treatment means using the nursing-home data introduced in Example 6.9. Example 6. 1 0
(Simultaneous intervals for treatment differences-N u rsing Homes)
We saw in Example 6.9 that average costs for nursing homes differ, depend ing on the type of ownership. We can use Result 6.5 to estimate the magni tudes of the differences. A comparison of the variable X3 , costs of plant operation and maintenance labor, between privately owned nursing homes and government-owned nursing homes can be made by estimating r1 3 - r33 • Using (6-35) and the information in Example 6.9, we have 7-1 = ( i 1 - i ) = w
=
[
Consequently,
- .020
T3 = ( -x 3 - -x ) = �
182.962 4.408 8.200 1.695 .633 1 .484 9.581 2.428 .394
.137 .002 .023 .003
7-1 3 - 7-33 = - .020 - .023 = - .043 138 107 = 516, so that
+ + ' ( __!__ + __!__ )
and n = 271
[=:: ].
[]
�
=
1
( + _ ) 1.484 = .00614 107 516 - 3
1 ' 'J 271
'J n 1 n3 n - g = 4 and g = 3, for 95% simultaneous confidence statements we
Since p require that t5 1 3 (.05/4 (3) 2) = 2.87. (See Appendix, Table 1.) The 95% simultaneous confidence statement is
/(__!__ + __!__ )
� r1 3 - r33 belongs to 7-1 3 - 7-33 ± t5 1 3 (.00208) n3 n - g 'J n 1 = - .043 ± 2.87 (.00614)
- .043 ± .018, or ( - .061, - .025)
We conclude that the average maintenance and labor cost for government owned nursing homes is higher by .025 to .061 hour per patient day than for pri vately owned nursing homes. With the same 95% confidence, we can say that r1 3 - r2 3 belongs to the interval ( - .058, - .026)
Sec.
6.6
Two-Way Multivariate Analysis of Variance
331
and
T2 3 - T33 belongs to the interval ( - .021, .019) Thus, a difference in this cost exists between private and nonprofit nursing homes, but no difference is · observed between nonprofit and government • nursing homes. 6.6 TWO-WAY MULTIVARIATE ANALYSIS OF VARIANCE
Following our approach to the one-way MANOVA, we shall briefly review the analysis for a univariate two-way fixed-effects model and then simply generalize to the multivariate case by analogy. Univariate Two-Way Fixed-Effects Model with I nteraction
We assume that measurements are recorded at various levels of two factors. In some cases, these experimental conditions represent levels of a single treatment arranged within several blocks. The particular experimental design employed will not concern us in this book. (See [9] and [10] for discussions of experimental design.) We shall, however, assume that observations at different combinations of experimental conditions are independent of one another. Let the two sets of experimental conditions be the levels of, for instance, fac tor 1 and factor 2, respectively.4 Suppose there are g levels of factor 1 and b levels of factor 2, and that n independent observations can be observed at each of the gb combinations of levels. Denoting the rth observation at level e of factor 1 and level k of factor 2 by Xekr • we specify the univariate two-way model as
Xekr = IL
+ Te + f3k + Yek + eekr
= 1, 2, . k = 1, 2, r = 1, 2, . . e
.
.
. . .
g where :L Te f=l
.
(6-43)
,g
'b ,n
:L f3k = :L Ye k = :L Yek = 0 and the eekr are independent b
k=l
g
b
f=l
k=t
N(O, u2 ) random variables. Here p., represents an overall level, Te represents the fixed effect of factor 1, f3k represents the fixed effect of factor 2, and Yt k is the inter
action between factor 1 and factor 2. The expected response at the f th level of fac tor 1 and the kth level of factor 2 is thus 4 The use of the term factor to indicate an experimental condition is convenient. The factors dis cussed here should not be confused with the unobservable factors considered in Chapter 9 in the con text of factor analysis.
332
Chap.
6
Comparisons of Several Multivariate Means
(
) (
+
mean overall = response level
+
+
) + ( effect of ) ( effect of ) + ( fac�or 1-fa_ctor 2 ) factor 1
e = 1, 2, . . , 8,
+
k
.
factor 2
= 1, 2, .
'Ye k
mteraction
. .
,b
(6-44)
The presence of interaction, 'Yt k • implies that the factor effects are not addi tive and complicates the interpretation of the results. Figures 6.3(a) and (b) show expected responses as a function of the factor levels with and without interaction, respectively. The absense of interaction means 'Ye k 0 for all e and k. In a manner analogous to (6-44), each observation can be decomposed as
=
Level I of factor I Level 3 of factor I Level 2 of factor I
3
2
4
Level of factor 2 (a)
Level 3 of factor I Level I
of factor I
Level 2 of factor I
3
2
4
Level of factor 2 (b) Figure 6.3
i nteraction .
Curves for expected responses (a) with interaction and (b) without
Sec.
Two-Way Multivariate Analysis of Variance
6.6
333
where x is the overall average, xe . is the average for the l th level of factor 1 , x . k is the average for the kth level of factor 2, and Xe k is the average for the l th level of factor 1 an d the kth level of factor 2. Squaring and summing the devia tions (xe kr - x ) gives g
b
n
���
f=l k= l r=l
= f�= l bn (xe . - x ) 2 g
(xe kr - x ) 2
+
or SScor
g
b
��
f=l k=l
= SSfac I + SSfac
+
b
k= l
n (xe k - xe . - x. k
2 +
SSin t
- x )2
� gn (x. k
+
+
x )2
(6-46)
SS res
The corresponding degrees of freedom associated with the sums of squares in the breakup in (6-46) are gbn - 1 = (g - 1 )
+
( b - 1)
+
(g - 1 ) ( b - 1 )
+
gb ( n - 1 )
( 6-47)
The ANOVA table takes the following form: ANOVA TABLE FOR COM PARI NG EFFECTS O F TWO FACTORS AN D TH E I R I NTERACTION
Source of variation Factor 1 Factor 2 Interaction Residual (Error) Total (corrected)
Degrees of freedom ( d.f.)
Sum of squares ( SS )
= f�= l bn (xe . - x ) 2 2 ssfac 2 = � gn ( x. k - x ) k= ! SS int = � � n (xe k - xe . - x. k f=l k=l SS res = � � � (xe kr - xe k ) 2 f=l k=l r=l ssfac l
g
g- 1
b
SS cor
g
b
g
b
n
= f�= l k�= l r�= l (xe kr - x ) 2 g
b
n
b - 1 +
x )2
(g - 1 )(b - 1 ) gb (n - 1 )
gbn - 1
334
Chap.
6
Comparisons of Several Multivariate Means
- 1),
The F-ratios of the mean squares, SSrac 1 / (g - 1 ) , SSrac 2 / (b and ss int / (g - 1 ) (b - 1 ) to the mean square, ss res f (gb ( n - 1 ) ) can be used to test for the effects of factor 1 , factor 2, and factor 1-factor 2 interaction, respectively. ( See
[7] for a discussion of univariate two-way analysis of variance. )
Multivariate Two-Way Fixed-Effects Model with Interaction
Proceeding by analogy, we specify the two-way fixed-effects model for a response consisting of p components [see
(6-43)] X ek r = JL + Te + {Jk + 'Yek + e f k r e = 1, 2, . . . , g
vector
(6-48)
k = 1 , 2, . . . ' b
r = 1 , 2, . . . , n
p
b g = � 'Yek = � 'Yek = 0. The vectors are all of order 1, Te {Jk €=! k =! f = l k =l and the eu, are independent Np (O, I) random vectors. Thus, the responses consist of measurements replicated n times at each of the possible combinations of lev
g where �
b =�
X
p
els of factors 1 and 2. Following we can decompose the observation vectors
(6-45), X ek r as X ek r = i + ( ie. - i ) + ( i. k - i ) + ( iek - ie . - i. k + i ) + (x ek r - iek ) (6-49) where i is the overall average of the observation vectors, ie. is the average of the observation vectors at the fth level of factor 1, i. k is the average of the observa tion vectors at the kth level of factor 2, and iek is the average of the observation vectors at the fth level of factor 1 and the kth level of factor 2. Straightforward generalizations of (6-46) and (6-47) give the breakups of the sum of squares and cross products and degrees of freedom: n
g
f=l k =l r=l (x ek r - i ) (x ek r - i ) ' = f=l� bn ( ie. - i ) ( ie. - i ) ' + k�=l gn( i. k - i ) ( i. k - i ) ' + f�= l k�=l n( iek - ie. - i. k + i ) ( iek - ie. - i. k + i ) ' (6-50) + f=l�g k�=lb r=l� (xf.k r - iek ) (xekr - ie d ' gb n - 1 = (g - 1 ) + (b - 1 ) + (g - 1 ) (b - 1) + gb ( n - 1) (6-51) g b � � �
b
g
b
11
Sec.
6.6
Two-Way Multivariate Analysis of Variance
335
Again, the generalization from the univariate to the multivariate analysis consists simply of replacing a scalar such as with the corrresponding matrix
(xe . - xf
(xe . - x) (x e . - x)'.
The MANOVA table is the following:
MANOVA TAB LE FOR COM PARI N G FACTORS AND TH EIR I NTERACTIO N
Factor
1
Factor 2 Interaction Residual (Error)
Degrees of freedom ( d.f. )
Matrix of sum of squares and cross products ( SSP)
Source of variation
=
g
� bn( ie . - x) (xe . - x )' f=l b SSPrac z = � gn(x . k - x) (x . k - x)' k =l g b SSPi nt = � � n( iek - ie . - x . k + x) (x u - ie . - x . k + x)' f=l k =l g b = SSPres f=l� k�=l r=l� (xu, - Xek Hxek r - Xek )' SSPrac t
n
Total ( corrected)
SSPcor
=
g b
f=l� k�=! r=!� (xek r - x) (x ek r - x )' n
g-1 b-1 (g - 1) (b - 1) gb(n - 1) gbn - 1
A test ( the likelihood ratio test) 5 of
Ho : 'Yu =
1'1 2
=
... = 'Yg b = 0
( no interaction effects )
(6-52)
versus
H1 : At least one
'Yek * 0
is conducted by rejecting H0 for small values of the ratio (6-53) 5 The likelihood test procedures require that p "" gb (n inite (with probability 1).
-
1 ), so that SSP,., will be positive def
336
Chap.
6
Comparisons of Several Multivariate Means
For large samples, Wilks ' lambda, A * , can be referred to a chi-square percentile. Using Bartlett ' s multiplier (see [6]) to improve the chi-square approximation, we reject H0 : y1 1 = y1 2 = = 'Ygb = 0 at the a level if
[
- gb (n - 1 ) -
p
···
+ 1 - (g - 1 ) ( b - 1 ) ] In A * > xfg - t ) (b - t ) p (a) 2
(6-54)
where A * is given by (6-53) and x{g- ! ) (b - t ) p (a) is the upper (100a) th percentile of a chi-square distribution with (g - 1) (b - 1)p d.f. Ordinarily, the test for interaction is carried out before the tests for main fac tor effects. If interaction effects exist, the factor effects do not have a clear inter pretation. From a practical standpoint, it is not advisable to proceed with the additional multivariate tests. Instead, p univariate two-way analyses of variance (one for each variable) are often conducted to see whether the interaction appears in some responses but not others. Those responses without interaction may be interpreted in terms of additive factor 1 and 2 effects, provided that the latter effects exist. In any event, interaction plots similar to Figure 6.3, but with treat ment sample means replacing expected values, best clarify the relative magnitudes of the main and interaction effects. In the multivariate model, we test for factor 1 and factor 2 main effects as fol lows. First, consider the hypotheses H0 : T1 = T2 = · · · = Tg = 0 and H1 : at least one Te � 0. These hypotheses specify no factor 1 effects and some factor 1 effects, respectively. Let
A* =
I SSPres I I SSPfac 1 + SSP res I
(6-55)
so that small values of A * are consistent with H1 • Using Bartlett ' s correction, the likelihood ratio test is:
+ 1 - (g - 1 ) ] In A * > X(g2 - l) p (a)
Reject H0 : T1 = T2 = · · · = Tg = 0 (no factor 1 effects) at level a if
[
- gb (n - 1) -
p
2
(6-56)
where A * is given by (6-55) and x {g - i ) p (a) is the upper (lOOa) th percentile of a chi-square distribution with (g - 1)p d.f. In a similar manner, factor 2 effects are tested by considering H0 : /1 1 /12 = · · · = {Jb = 0 and H1 : at least one {Jk � 0. Small values of (6-57) are consistent with H1 • Once again, for large samples and using Bartlett ' s correc tion: Reject H0 : {11 = /12 = · · · = {Jb = 0 (no factor 2 effects) at level a if
- [ gb(n
Sec. -
p + 1 - (b - 1) ] lo A * > X (b - !) p (a)
6.6
1) -
Two-Way Multivariate Analysis of Variance 2
2
337
(6-58)
where is given by (6-57) and xfb - !) p (a) is the upper (100a) th percentile of a chi-square distribution with degrees of freedom. Simultaneous confidence intervals for contrasts in the model parameters can provide insights into the nature of the factor effects. Results comparable to Result 6.5 are available for the two-way model. When interaction effects are negligible, we may concentrate on contrasts in the factor and factor 2 main effects. The Bon ferroni approach applies to the components of the differences 'Te 7'111 of the fac tor 1 effects and the components of fJk - fJq of the factor 2 effects, respectively. The 100 (1 - a) % simultaneous confidence intervals for Te ; T111 ; are
A*
(b - 1)p
1
Te ; -
T,, ;
belongs to ( Xc;. - Xm ; . )
-
{E;;2 ± tv ( pg(ga- 1 ) ) v--:: b,;,
b (n - 1), Eu is the ith diagonal element of E = SSP and - xm i· isgthe ith component of Xe. - XIII • ' Similarly, the 100 (1 - a) % simultaneous confidence intervals for fJk i - /3q ; are ( pb (ba 1) ) v{E;;2 {3 k i - f3q ; belongs to (x .k; - x. q ; ) ± t v --:: g;; (6-60)
where v =
X e ;.
(6-59) res '
_
where v and Eu are as just defined and x.k; - x.q; is the ith component of x.k - x . q ·
Comment. We have considered the multivariate two-way model with repli cations. That is, the model allows for replications of the responses at each com bination of factor levels. This enables us to examine the "interaction" of the factors. If only one observation vector is available at each combination of factor levels, the two-way model does not allow for the possibility of a general interaction term 'Yc k · The corresponding MANOVA table includes only factor 1 , factor 2, and residual sources of variation as components of the total variation. (See Exercise
n
6.13.)
Example 6. 1 1
(A two-way multivariate analysis of variance of plastic fil m data)
The optimum conditions for extruding plastic film have been examined using a technique called Evolutionary Operation. (See [8] .) In the course of the study that was done, three responses-X1 = tear resistance, X2 = gloss, and X3 = opacity-were measured at two levels of the factors, and The measurements were repeated = 5 times at each combination of the factor levels. The data are displayed in Table 6.4.
amount of an additive.
rate of extrusion n
338
Chap.
6
Comparisons of Several Multivariate Means TABLE 6.4 PLASTIC F I LM DATA
x1 =
tear resistance,
x2 =
gloss, and
x3 =
opacity
Factor 2: Amount of additive Low (1.0%)
High (1.5%)
Xr [6.5 [6.2 Low ( - 10%) [5.8 [6.5 [6.5
x2 9.5 9.9 9.6 9.6 9.2
x3 4.4] 6.4] 3.0] 4.1] 0.8]
Xr [6.9 [7.2 [6.9 [6.1 [6.3
Xz 9.1 10.0 9.9 9.5 9.4
x3 5.7] 2.0] 3.9] 1.9] 5.7]
Xr [6.7 [6.6 [7.2 [7.1 [6.8
x2 9.1 9.3 8.3 8.4 8.5
x3 2.8] 4.1] 3.8] 1 .6] 3.4]
Xr [7. 1 [7.0 [7.2 [7.5 [7.6
Xz 9.2 8.8 9.7 10.1 9.2
x3 8.4] 5.2] 6.9] 2.7 ] 1 .9]
Factor 1: Change in rate of extrusion
-
High (10%)
-
-
The matrices of the appropriate sum of squares and cross products were calculated (see the SAS statistical software output in Panel 6.1), leading to the following MANOVA table:
Source of variation Factor 1: change of extruisni.Ornate Factor 2: amount addr"tl.veof Interaction Residual Total (corrected)
d.f. SSP [ 1.7405 -1.1.35005045 -..78395555 ] 1 . 4 205 .7605 ..66825125 1.1.79325305 ] 1 [[ .0005 .0165 4.�90055 ] .5445 3.1.49605685 1 [ 1.764(] 2..06200280 -3.-.05520700 ] 16 64.9240 [ 4.2655 -.5.70855855 -1.23959095 ] 19 74.2055
Sec. PAN EL 6.1
6.6
Two-Way Multivariate Analysis of Variance
339
SAS ANALYSIS F O R EXAM PLE 6. 1 1 U S I N G P ROC G L M .
title 'MAN OVA'; data fi l m ; i nfi le 'T6-4.dat'; i n put x 1 x2 x3 facto r1 factor2; proc g l m data = fi l m ; class factor1 facto r2; model x1 x2 x3 facto r1 factor2 facto r1 * facto r2 /ss3; m a n ova h = facto r1 facto r2 factor1 * factor2 /pri nte; means factor1 facto r2;
PROGRAM COMMANDS
=
Genera l Linear Models Proced u re Class Level I nfo rmatio n
J
Dependent ,V ariable:
Leve ls Cl ass V a l u es FACTOR 1 2 0 1 FACTO R2 2 0 1 N u m ber of o bservations i n d ata set = 20
X1 J
S o u rce Model E rror Corrected Tota l
S o u rce FACTOR1
I
I
FACTOR2 FACTOR1*f,.XCTOR2 DependentVa riable:
S o u rce Model E rror Corrected Total
OUTPUT
DF 3 16 19
S u m of Squares 2.501 50000 1 .76400000 4.26550000
Mean S q u a re 0.83383333 0 . 1 1 025000
R-Sq u a re 0.586449
4.893724
c.v.
Root M S E 0.332039
DF
Type Ill SS
Mean S q u a re
F Va l u e
1 .74050000
1 .7 4050000 0.76050000 0.00050000
1 5.79 6.90 0.00
0.00 1 1 0 .0 1 83 0.947 1
DF 3 16 19
S u m of S q u a res 2 . 45750000 2.62800000 5.08550000
Mean S q u a re 0.8 1 9 1 6667 0 . 1 6425000
F Va l u e 4.99
Pr > F 0.0 1 25
R-Sq u a re 0.483237
4.350807
c.v.
Root M S E 0 . 405278
.· 1 x2 j
o:!>oo5oooo 0.76050000
F Va l u e 7.56
Pr > F 0.0023
X1 Mean 6.78500000 Pr
>
F
X2 Mean 9 . 3 1 500000
340
Chap.
6
Comparisons of Several Multivariate Means
PANEL 6. 1
(continued)
S o u rce
OF
Type I l l SS
F Va l u e
1 .30050000 0.6 1 250000 0.54450000
7.92 3.73 3.32
0.0125 0.07 1 4 0.0874
Mean S q u a re 3.09383333 4.05775000
F Va l u e 0.76
Pr > F 0 . 53 1 5
S u m of OF S q u a res 3 9.281 50000 1 6 64.92400000 1 9 7 4.20550000
S o u rce Model E rror Co rrected Tota l
R-Sq uare 0 . 1 25078
Root MSE 2 . 0 1 4386
c.v.
51 . 1 9151
Pr
Pr
X1 X2 X3
0.42050000 4.90050000 3.96050000
e
=
i:rr()r, '$�&cP Matti��!! X2 0.02 2.628 -0.552
X1 1 .764 0.02 -3.07
0. 1 0 1 .2 1 0.98
>
X3 - 3.07 - 0 . 552 64.924
the =
Type Ill SS&CP Matrix fo r FACTO R 1 S 1 M 0.5 =
Pil l a i's Trace Hote l l i ng-Lawley Trace Roy's G reatest Root
0.61 8 1 4 1 62 1 .6 1 877 1 88 1 .6 1 877 1 88
=
7 . 5543 7.5543 7.5543
N
E =
=
6
Error SS&CP Matrix
3 3 3
14 14 14
F
0.75 1 7 0.2881 0.3379
Manova Test Criteria and Exact F Stati stics for
H
F
X 3 Mean 3.93500000
S o u rce
I
>
Mean S q u a re
0.0030 0.0030 0.0030
PANE L 6. 1
Sec.
Two-Way Multivariate Analysis of Variance
6.6
(contin ued)
I Hypothf!!��� ofno��erall FAE:TOR2
M a n ova Test Criteria a n d Exact F Statistics for the
Effect
341
I
E = E rror SS&CP M atrix H = Type I l l SS&CP Matrix fo r FACTOR2 N = 6 M = 0.5 S = 1
:Wilks' Lamqda
;Statistic
0.476965 1 0 0.91 1 9 1 832 0.91 1 9 1 832
Pi l lai's Trace H ote l l i ng-Lawley Trace R oy's G reatest Root
Hypothesis
.::: · :�}.2556
N u m OF
4.2556 4.2556 4.2556
3
3 3 3
14 14 14
0.0247 0.0247 0 .0247
M a n ova Test Criteria a n d Exact F Statistics fo r
the
of
no
Overall FACTOR 1 *FACTOR2 Effect
H = Type I l l SS&CP M atrix fo r FACTOR 1 * FACTOR2 N = 6 M = 0.5 S = 1
·
'
E = E rror SS&CP M atrix
Lariloda 0.22289424 0.2868261 4 0. 286826 1 4
Pi l l ai's Trace H otel l i n g-Lawley Trace Roy's G reatest Root Level of FACTOR 1 0
N 10 10
14 14 14
0.30 1 8 0.30 1 8 0.30 1 8
- - - - - - - - - X2 - - - - - - - - -
Mean 6. 49000000 7.08000000
Mean 9 . 57 000000 9 .06000000
SD 0.420 1 85 1 4 0.32249031
Mean 6.59000000 6.98000000
Level of FACTO R2 0
SD 0 .29832868 0 . 5758086 1
- - - - - - - - - X3 - - - - - - - - N 10 10
Mean 3.79000000 4.08000000
--------- X1 --------N 10 10
3 3 3
--------- X1 ---------
Level of FACTOR 1 0
Level of FACTO R2 0
1 .3385 1 .3385 1 .3385
SD 0. 40674863 0.47328638
SD 1 .8537949 1 2 . 1 82 1 49 8 1 - - - - - - - - - X2 - - - - - - - - Mean 9 . 1 4000000 9. 49000000
- - - - - - - - - X3 - - - - - - - - N 10 10
Mean 3.44000000 4.43000000
SD 1 .55077042 2.30 1 23 1 55
SD 0.560 1 587 1 0.42804465
342
Chap.
6
Comparisons of Several Multivariate Means
=
To test for interaction, we compute
275.7098 7771 I SSPres I 354.7906 • I SSPint + SSP I For (g - 1) (b - 1) = 1, 1 - A * ) (gb (n - 1) - p + 1)/2 F- ( ( I (g - ) ( b - 1) - p I + 1) /2 A* has an exact F-distribution with v1 I (g - 1 ) ( b - 1) - p I + 1 and v2 gb (n - 1) - p + 1 d.f. ( See [1].) For our example, 1 - .7771 ) (2 (2) (4) - 3 + 1)/2 1. 34 F ( .7771 ( 1 1 (1) - 3 1 + 1)/2 v, ( 1 1 (1) - 3 1 + 1) = 3 v2 = (2 (2) (4) - 3 + 1) = 14 and F3 1 (.05) = 3.34. Since F 1.34 < F3 , 14 (.05) 3.34, we do not reject the hypothesis H0 : y 1 1 y 12 y21 y22 0 (no interaction effects). A*
=
res
1
=
=
=
=
=
,
4
=
=
=
=
=
=
Note that the approximate chi-square statistic for this test is
- [2 (2) (4) - (3 + 1 - 1 (1))/2] ln (.7771) 3.66, from (6-54). Since xj (.05) 7.81, we would reach the same conclusion as provided by the =
=
exact F-test. To test for factor 1 and factor 2 effects (see page 336), we calculate
I SSPres I A* 1 I SSPfac 1 + SSP res I _
=
275.7098 722.0212
= .38 1 9
275.7098 527.1347
=
and
A *z For both g
-1
=
I SSPres I I SSPrac2 + SSPres I 1 and b - 1 = 1, =
_(
=
5230
Fl -
1 - A � ) (gb (n - 1) - p + 1)/2 ( I (g - 1) - p I + 1) /2 A�
Fz -
( 1 - Ai ) (gb (n - 1) - p + 1 ) /2
and _
Ai
( l (b - 1) - p l + 1)/2
Sec.
6.7
Profile Analysis
343
- 1) - p I + 1, = = + 1, v2I (g= gb(n - 1) - p v2+ 1, l (b ( 1 - .3819 ) (16 - 3 + 1)/2 = 7.55 F1 = .3819 ( 1 1 - 3 1 + 1)/2 ( 1 - .5230 ) (16 - 3 + 1)/2 4.26 Fz = .5230 ( 1 1 - 3 1 + 1) /2
+
have F-distributions with degrees of freedom 1) - p 1 and v1 = 1) - P I respectively. (See [ 1] . ) In our case,
gb(n -
v1
=
and
+1
3 Vz (16 - 3 + 1) = 14 From before, F3 14 ( .05) 3.34. We have F1 = 7.55 > F3 , 1 4 ( .05) = 3.34, , and therefore, we reject H0 : -r1 = -r2 = 0 (no factor 1 effects) at the 5% level. Similarly, F2 = 4.26 > F3 ,1 4 (.05) = 3.34, and we reject H0 : {J1 = {J2 = 0 (no factor 2 effects) at the 5% level. We conclude that both the change in rate of v1
= 11 - 31
=
=
=
extrusion
amount of additive
and the affect the responses, and they do so in an additive manner. The of the effects of factors 1 and 2 on the responses is explored in Exercise 6.15. In that exercise, simultaneous confidence intervals for con trasts in the components of Te and {Jk are considered. •
nature
6.1 PROFILE ANALYSIS
Profile analysis pertains to situations in which a battery ofp treatments (tests, ques tions, and so forth) are administered to two or more groups of subjects. All responses must be expressed in similar units. Further, it is assumed that the responses for the different groups are independent of one another. Ordinarily, we might pose the question, Are the population mean vectors the same? In profile analysis, the question of equality of mean vectors is divided into several specific possibilities. Consider the population means /L � = [ JL 1 1 , 11-1 2 , 11- 1 3 , JL 1 4 ] representing the average responses to four treatments for the first group. A plot of these means, connected by straight lines, is shown in Figure 6.4 on page 344. This broken-line graph is the for population 1. Profiles can be constructed for each popula tion (group). We shall concentrate on two groups. Let p ; = [ JL 1 1 , JL 1 2 , . . , fL i p ] and /L� = [ JL2 1 , JL22 , . . . , IL z p ] be the mean responses to p treatments for populations 1 and 2, respectively. The hypothesis H0 : p1 = p 2 implies that the treatments have the same (average) effect on the two populations. In terms of the population profiles, we can formulate the question of equality in a stepwise fashion.
profile
.
344
Chap.
6
Comparisons of Several Multivariate Means
Mean response
�
- - - - - - - - - ---- - - - - - - - - - - - - - - -
2
3
Figure 6.4 p = 4.
4
= =
The population profile
=
1. Are the profiles parallel? Equivalently: Is H0 1 : p.,1 ; - IL J i - l /L z; - p., 2 ; _ 1 , i 2, 3, . . . , p, acceptable? 2. Assuming that the profiles are parallel, are the profiles coincident? 6 Equivalently: Is H0 2 : p., 1 ; /L z; , i 1, 2, . . . , p, acceptable? 3. Assuming that the profiles are coincident, are the profiles level? That is, are all the means equal to the same constant? Equivalently: Is H0 3 : p.,1 1 p., 1 2 · · · fLi p /L2 1 /L zz · · · /L z p acceptable?
=
= = =
=
=
= =
The null hypothesis in stage 1 can be written where
[=
C is the contrast matrix c
((p - ! ) Xp)
-1
� �
1 0 0 -1 1 0 0 0 0
J !l
(6-61)
For independent samples of sizes n 1 and n 2 from the two populations, the null hypothesis can be tested by constructing the transformed observations j
=
j
= 1, 2, . . . , n2
and
1, 2 ,
. . . , n1
6 The question, 'Assuming that the profiles are parallel, are the profiles linear?' is considered in Exercise 6.12. The null hypothesis of parallel linear profiles can be written, H0 : (J.t l i + J.t2 ; ) ( J.t i i - l + J.t2 , _ 1 ) = (J.t 1 , _ 1 + J.t2, _ 1 ) - ( J.t i i - 2 + J.t2 , _ 2 ) , i = 3, . . . , p. Although this hypothesis may be of interest in a particular situation, in practice the question of whether two parallel profiles are the same (coincident ) , whatever their nature, is usually of greater interest.
Sec.
6.7
Profile Analysis
345
These have sample mean vectors C x 1 and Cx2 , respectively, and pooled covari ance matrix Since the two sets of transformed observations have Np - I ( Cp 1 , C!.C' ) and NP _ 1 ( Cp2 , C!.C' ) distributions, respectively, an application of Result 6.2 provides a test for parallel profiles.
CSpooiect C '.
Reject T2
H0 1 : {:p1
=
TEST FOR PARALLEL PROFI LES FOR TWO NORMAL POPULATIONS
Cp 2 (parallel profiles ) at level
= (it - Xz)'C' [ (�t �)cspoolectC' rt +
a
if
C ( xt - x 2 )
>
cz
(6-62)
where
When the profiles are parallel, the first is either above the second ( JL i i > JL 2 i , for all i), or vice versa. Under this condition, the profiles will be coincident only if the total heights f.Lt t JL1 2 JL 2p = 1 ' p2 are f.Lt p = 1 ' p 1 and JL 2 1 JL 22 equal. Therefore, the null hypothesis at stage 2 can be written in the equivalent form
+ + ··· +
Ho 2 : 1 ' Itt =
+ + ··· +
1 ' P2
We can then test H0 2 with the usual two-sample t-statistic based on the univariate observations 1 ' x1j, j = 1, 2, . . . , n 1 , and 1 ' x 2j , j = 1, 2, . . . , n 2 . TEST FORCOINCIDENT PROFILES, G IVEN THAT PROFI LES ARE PARALLEL
For two norinal populations: Reject H02 : at level a if
=
(
1/
�( _!_nt _!_nz ) s
-pooledl )
( -x i - -X z )
+
1
1
.
2
>
1'p 1
= 1'p2
(profiles coincident)
( ) = Fl,n1 + n2-2 ( a)
t n,2 + nz-2 � 2
(6-63)
346
Chap.
Comparisons of Several Multivariate Means
6
For coincident profiles, x 1 1 , x 1 2 , , x1 n I and x 2 1 , x 22 , . . . , x 2 n2 are all observations from the same normal population. The next step is to see whether all variables have the same mean, so that the common profile is level. When H0 1 and H0 2 are tenable, the common mean vector p. is estimated, using all 2 observations, by • . .
•
n1 + n
If the common profile is level, f.l- 1 3 can be written as
= = ··· = f.l- z
Jl-p ,
and the null hypothesis at stage
where C is given by (6-61). Consequently, we have the following test. TEST FOR LEVEL PROFILES, GIVEN. THAT PROFILES ARE COINCIDENT
Fl)r two noi"mal pop:glations: ' '' i':
, ' �' ;o,: ,
:(\
Reject H03 : . (;p. '<;?o";,";
=
0
atlevel (profiles leyel) -
where s is the sample covariance matrix based on all nl c
Example 6. 1 2
z
=
(nl
/ (n. 1
+ nz
-
+ n 2.
1) (p -
p
+
+ nz
f
a
if
(6-64)
observations and
1) F 1) p - l, n, + n, - p + l (a)
-
.
. · .·
(A profile analysis of love and marriage data)
E.
As part of a larger study of love and marriage, Hatfield, a sociologist, sur veyed adults with respect to their marriage "contributions" and "outcomes" and their levels of "passionate" and "companionate" love. Recently married males and females were asked to respond to the following questions, using the 8-point scale in the figure below. 1. All things considered, how would you describe
the marriage? 2. All things considered, how would you describe the marriage?
your contributions to your outcomes from
Sec.
Profile Analysis
6. 7
347
;;.-. "' - >
:c ·;: bJ) ' til
0 O.
:= v.l
8
7
6
5
4
3
2
Subjects were also asked to respond to the following questions, using the 5-point scale shown. love that you feel for your partner? 3. What is the level of love that you feel for your partner? 4. What is the level of
passionate companionate
None at all
Very little
Some
A great deal
Tremendous amount
2
3
4
5
Let
= an 8-point scale response to Question 1 x2 = an 8-point scale response to Question 2 = a 5-point scale response to Question 3 = a 5-point scale response to Question 4 x1
x3
x4
and the two populations be defined as
= married men Population 2 = married women
Population 1
p=
4 ques The population means are the average responses to the tions for the populations of males and females. Assuming a common covari ance matrix �. it is of interest to see whether the profiles of males and females are the same. 30 females gave the sample 30 males and A sample of mean vectors
n1
[
n2
J
6.833 7.033 3.967 ' 4.700 (males)
and pooled covariance matrix
[
6.633 7.000 4.000 4.533
J
(females)
348
Chap.
6
Comparisons of Several Multivariate Means Sample mean response .XCi
6
4 Key:
X- X
2
0-
-o
Males Females
[
2 Figure 6.5
3
4
l
Sample profiles for marriage-love responses.
s pooled
.606 - .262 .066 .161
.262 .637 .173 .143
.066 .173 .810 .029
.161 .143 .029 .306
The sample mean vectors are plotted as sample profiles in Figure 6.5. Since the sample sizes are reasonably large, we shall use the normal the ory methodology, even though the data, which are integers, are clearly non normal. To test for parallelism (H0 1 : Cp1 Cp 2 ), we compute
[ [
and
=
-1 1 0 -1 0 0 .719 - .268 - .125 - .268 1.101 - .751 - .125 - .751 1.058
0 -1 1 0
]
-!l
] [ I J _ : : - � _: n l .167
- .167 - .066 .200
Sec.
Thus, 2
T
= [- .167 ' - .066, .200] (io + ro) -
1
[ .719
6.7
Profile Analysis
349
] [ -.- 16667 ]
- .268 - .125 - 1 - .268 1.101 - .751 - .125 - .751 1.058
.0 .200
= 15 ( .067) = 1.005 Moreover, with a = .05, c 2 = [(30 + 30 - 2) (4 - 1)/(30 + 30 - 4)]F3, 56 (.05) = we conclude that the hypothesis of Since T 2 = 1.005 < 3.11 (2.8) = is tenable. Given the plot in Figure
8.7.
8.7,
6.5,
parallel profiles for men and women this finding is not surprising. Assuming that the profiles are parallel, we can test for coincident pro files. To test H0 2 : 1'p,1 1'p, 2 (profiles coincident), we need:
=
Sum of elements in ( i 1 - i 2 ) Sum of elements in Spooled Using
(6-63), we obtain T2
a=
=( =
v cto
.+367
=
1' ( i 1 - i 2 )
.367
= 1' S pooled 1 = 4.207
ro) 4.027
=
=
)
2
.501
With .05, F1.58 (.05) 4.0, and T2 .501 < F1 . 58 (.05) 4.0, we cannot reject the hypothesis that the profiles are coincident. That is, the responses of men and women to the four questions posed appear to be the same. We could now test for level profiles; however, it does not make sense to carry out this test for our example, since Questions 1 and 2 were measured on a scale of 1-8, while Questions 3 and 4 were measured on a scale of 1-5. The incompatibility of these scales makes the test for level profiles meaning less and illustrates the need for similar measurements in order to carry out a • complete profile analysis.
=
When the sample sizes are small, a profile analysis will depend on the nor mality assumption. This assumption can be checked, using methods discussed in Chapter 4, with the original observations Xe i or the contrast observations Cxei · The analysis of profiles for several populations proceeds in much the same fashion as that for two populations. In fact, the general measures of comparison are analogous to those just discussed. (See [13].)
350
Chap.
6
Comparisons of Several Multivariate Means
6.8 REPEATED MEASURES DESIGNS AND GROWTH CURVES
repeated measures
As we said earlier, the term refers to situations where the same characteristic is observed, at different times or locations, on the same subject.
(a) The observations on a subject may correspond to different treatments as in Example 6.2 where the time between heartbeats was measured under the 2 X 2 treatment combinations applied to each dog. The treatments need to be compared when the responses on the same subject are correlated. (b) A single treatment may be applied to each subject and a single characteristic observed over a period of time. For instance, we could measure the weight of a puppy at birth and then once a month. It is the curve traced by a typical dog that must be modeled. In this context, we refer to the curve as a When some subjects receive one treatment and others another treat ment, the growth curves for the treatments need to be compared.
growth curve.
To illustrate the growth curve model introduced by Potthoff and Roy [15], we consider calcium measurements of the dominant ulna bone in older women. Besides an initial reading, Table 6.5 gives readings after one year, two years, and TABLE 6.5 CALC I U M M EASU REM ENTS
ON TH E DOM I NANT U LNA CONTROL G RO U P
Subject
Initial
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
87.3 59.0 76.7 70.6 54.9 78.2 73.7 61.8 85.3 82.3 68.6 67.8 66.2 81.0 72.3 72.38
Mean
1 year 86.9 60.2 76.5 76.1 55.1 75.3 70.8 68.7 84.4 86.9 65.4 69.2 67.0 82.3 74.6 73.29
2 year 86.7 60.0 75.7 72.1 57.2 69.1 71.8 68.2 79.2 79.4 72.3 66.3 67.0 86.8 75.3 72.47
Source: Data courtesy of Everet Smith
3 year 75.5 53.6 69.5 65.3 49.0 67.6 74.6 57.4 67.0 77.4 60.8 57.9 56.2 73.9 66.1 64.79
Sec.
6.8
Repeated Measures Designs and Growth Curves
351
three years for the control group. Readings obtained by photon absorptiometry from the same subject are correlated but those from different subjects should be independent. The model assumes that the same covariance matrix � holds for each subject. Unlike univariate approaches, this model does not require the four mea surements to have equal variances. A profile, constructed from the four sample means ( :X 1 , :X2 , :X3 , :X4 ) , summarizes the growth which here is a loss of calcium over time. Can the growth pattern be adequately represented by a polynomial in time? When the p measurements on all subjects are taken at times t1 , t2 , , tP , the Potthoff-Roy model for quadratic growth becomes
[� [ J
X1 2 E [ X] = E
xp
=
f3o + {31 t1 + f32ti f3o + {31 2 + {32t�
�
f3o + {31 tp + {32t;
]
• • •
where the i-th mean JL; is the quadratic expression evaluated at t;. Usually groups need to be compared. Table 6.6 gives the calcium measure ments for a second �et of women, the treatment group, that received special help with diet and a regular exercise program. TABLE 6.6 CALC I U M M EASU REM ENTS
O N TH E DOMI NANT U LNA TREATM ENT G RO U P
Subject
Initial
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
83.8 65.3 81.2 75.4 55.3 70.3 76.5 66.0 76.7 77.2 67.3 50.3 57.7 74.3 74.0 57.3 69.29
Mean
1 year 85.5 66.9 79.5 76.7 58.3 72.3 79.9 70.9 79.0 74.0 70.7 51.4 57.0 77.7 74.7 56.0 70.66
2 year 86.2 67.0 84.5 74.3 59.1 7P · 6 80.4 70.3 76.9 77.8 68.9 53.6 57.5 72.6 74.5 64.7 71.18
Source: Data courtesy of Everet Smith
3 year 81.2 60.6 75.2 66.7 54.2 68.6 71.6 64.1 70.3 67.9 65.9 48.0 51.5 68.0 65.7 53.0 64.53
352
Chap.
6
Comparisons of Several Multivariate Means
When a study involves several treatment groups, an extra subscript is needed as in the one-way MANOVA mo del. Let be the vectors of I measurements on the subjects in group €, for € 1, . . . , g .
l
nc
]
X 0, X c 2 , ... , X e n
ne
=
Assumptions. All of the X ci are independent and have the same covariance matrix l:. Under the quadratic growth model, the mean vectors are f3co + f3c1 t1 + f3c 2 t f E[X ci ] f3e o + f3e 1 :t2 + f3e2 d f3eo + f3e 1 tP + f3c 2 t� =
where
] l B � � � �:� 1
(1 t f
and
tp tp
1
fJe
[ f3cof3c1 ]
=
(6-65)
f3e2
I f a q-th order polynomial is fit to the growth data, then
1 1
t1 .. .. .. t qq1 t2 t 2
and
B= 1
tp . . . t pq
/Je o /Je t
fJe =
(6-66)
/Jc q
Under the assumption of multivariate normality, the maximum likelihood estimators of the are
fJe
(6-67) where
S pooled with N
= (N 1_ g) ((n1 - 1 ) S1 + · · · + (ng - 1 ) S ) = N 1_ g W g
= C=1:Lg ne , is the pooled estimator of the common covariance matrix l:. The
estimated covariances of the maximum likelihood estimators are -
Cov
where k
=
(N -
fJc ) -- -nkc ( B ' Sp-oole1 ct B ) - 1
(A
g ) (N
for
- g - 1)/(N - g - p +
0 t.
-
-
1 , 2,
• • •
,g
(6-68)
q ) (N - g - p + q + 1).
Sec.
6.8
Repeated Measures Designs and Growth Curves
353
Also, P e and ph are independent, for C #- h, so their covariance is 0. We can formally test that a q-th order polynomial is adequate. The model is fit without restrictions, the error sum of squares and cross products matrix is just the within groups W which has N - g degrees of freedom. Under a q-th order poly nomial, the error sum of squares and cross products g nt (6-69) wq L L ( X cj - BPc ) ( Xej - BP e ) '
+
=
C = I j= I
has ng - g p - q - 1 degrees of freedom. The likelihood ratio test of the null hypothesis that the q- order polynomial is adequate, can be based on Wilks ' lambda
(6-70)
+
Under the polynomial growth model, there are q 1 terms instead of the p means for each of the g groups. Thus there are (p - q - 1) g fewer parameters. For large sample sizes, the null hypothesis that the polynomial is adequate is rejected if
- ( N - 21 (p - q - 2 - g) ) In A
Example 6. 1 3
*
> X
(6-71)
(Fitting a quadratic g rowth curve to calcium loss)
[=
]
Refer to the data in Tables 6.5 and 6.6. Fit the model for quadratic growth. A computer calculation gives
[ P1 , P2 ]
73.0701 70.1387 3.6444 4.0900 - 2.0274 - 1.8534
so the estimated growth curves are
73.07 (2.58) Treatment group: 70.14 (2.50) Control group:
where
B' S�a�IcctB
[=
+ 3.64t - 2.03t 2 ( .83) ( .28) + 4.09t - 1.85 t 2 ( .80)
( .27)
93.1744 - 5.8368 0.2184 - 5.8368 9.5699 - 3.0240 0.2184 - 3.0240 1.1051
]
and, by (6-68), the standard errors given below the parameter estimates were obtained by dividing the diagonal elements by n c and taking the square root.
354
Chap.
6
Comparisons of Several Multivariate Means
Examination of the estimates and the standard errors reveals that the t2 terms are needed. Loss of calcium is predicted after 3 years for both groups. Further, there does not seem to be any substantial difference between the two groups. ' Wilks lambda for testing the null hypothesis that the quadratic growth model is adequate, becomes 2726.282 2660.749 2369.308 2335. 9 12 2660.749 2756.009 2343. 5 14 2327.961 2369.308 2343. 5 14 2301. 7 14 2098.544 2335. 961 2098.544 2277.452 .7627 A _- / / W / /_ _- 2781.091712 2327. 2698.589 2363. 228 2362.253 Wz 2698.589 2832.430 2331.235 2381.160 2363.228 2331.235 2303. 687 2089. 996 2362.253 2381.160 2089.996 2314.485 Since, with a = . 0 1, ( 2 + g) ) In A * = - ( 31 - � (4 - 2 - 2 + 2) ) ln.7627 - (N - � p = 8. 1 5 < xf4 _ 2 _ 1 l 2 (.0 1 ) = 9. 2 1 we fail to reject the adequacy of the quadratic fit at a = .01. Since the P-value is less than .05 there is, however, some evidence that the quadratic does not fit well. We could, without restricting to quadratic loss, test for parallel and coincident calcium loss using profile analysis. The Potthoff and Roy growth curvc:_ model holds for more general designs than one-way MANOVA. However, the Pe are no longer given by (6-67) and the expression for its covariance matrix becomes more complicated than (6-68). We refer the reader to [12] for more examples and further tests. There are many other modifications to the model treated here. They include (a) Dropping the restriction to polynomial growth. Use non-linear parametric models or even non-parametric splines. (b) Restricting the covariance matrix to a special form such as equally correlated responses on the same individual. (c) Observing more than one response variable, over time, on the same individ ual. This results in a multivariate version of the growth curve model.
l l
] ]
- q -
•
Sec.
6.9
Perspectives and a Strategy for Analyzing M ultivariate Models
355
6.9 PERSPECTIVES AND A STRATEGY FOR ANALYZING MULTIVARIATE MODELS
We emphasize that, with several characteristics, it is important to control the over all probability of making any incorrect decision. This is particularly important when testing for the equality of two or more treatments as the examples in this chapter indicate. A single multivariate test, with its associated single P-value, is preferable to performing a large number of univariate tests. The outcome tells us whether or not it is worthwhile to look closer on a variable by variable analysis and group by group. A single multivariate test is recommended over, say, univariate tests because, as the next example demonstrates, univariate tests ignore important infor mation and can give misleading results. p
Example 6. 1 4 (Comparing mu ltivariate and univariate tests for the differences in means)
Suppose we collect measurements on two variables X1 and X2 for ten ran domly selected experimental units from each of two groups. The hypotheti cal data are shown below and displayed as scatterplots and marginal dot diagrams in Figure 6.6 on page 356. Group 5. 0 3. 0 1 4.5 3.2 1 6.0 3. 5 1 6.0 4.6 1 6.2 5.6 1 6.9 5. 2 1 6.8 6.0 1 5. 3 5. 5 1 6.6 7.3 1 7. 3 6.�9-----------2 5 1 --4�6--------4 4.9 2 4.0 4.5. 19 2 3.6.28 5.4 2 6. 1 5. 0 7.0 22 2 7.5. 31 4.6. 76 2 5. 8 7.8 2 6.8 8.0 2 Xz
--0
356
Chap.
6
Comparisons of Several Multivariate Means
++
loll L:!:..3J
8
0 + +o ++ o + 8o ++ O + 0 0 0
7
+ 0
6 5 4 3 4
+
+
++
+
+ 0 +
+
+
0 0 0 001 0 0 0 0 � xl �--------+ r---------r +---------�------�
----
Figure 6.6
+
7
6
5
Scatter plots and marginal dot diagrams for the data from two
groups.
It is clear from the horizontal marginal dot diagram that there is con siderable overlap in the x1 values for the two groups. Similarly, the vertical marginal dot diagram shows there is considerable overlap in the x2 values for the two groups. The scatterplots suggest there is fairly strong positive corre lation between the two variables for each group, and that, although there is some overlap, the group 1 measurements are generally to the southeast of the group 2 measurements. Let p{ = [ �t , �t ] be the population mean vector for the first group, and let p� = [ �t21 ,1 �t1 22 ]12be the population mean vector for the second group. Using the x observations, a univariate analysis of variance gives F 2.46 with v1 = 1 1and v2 = 18 degrees of freedom. Consequently, we cannot reject H0: �t 1 1 = �t21 at any reasonable significance level (F1 , 1 (.10) = 3. 0 1). Using the x2 observations, a univariate analysis of variance gives F 2. 68 with v1 = 1 and v2 18 degrees of freedom. Again, we cannot reject H0: �t 12 ILzz at any reasonable significance level. The univariate tests suggest there is no difference between the component means for the two groups, and hence we conclude p1 = Pz . On the other hand, if we use 2Hotelling ' s T2 to2 test(1 )for(2) the equality of the mean vectors, we find T = 17.29 > c = � 7 F2, 17 (. 0 1) = 2.118 6.11 = 12.94 and we reject H0: p1 = p2 at the 1% level. The multi variate test takes into account the positive correlation between the two mea surements for each group--information that is unfortunately ignored by the univariate tests. This T2 test is equivalent to the MANOVA test (6-38). =
8
=
=
=
X
•
Sec.
Perspectives and a Strategy for Analyzing M ultivariate Models
6.9
357
Example 6. 1 4 demonstrates the efficacy of a multivariate test relative to its univariate counterparts. We encountered exactly this situation with the effluent data in Example 6.1. In the context of random samples from several populations (recall the one way MANOVA in Section 6.4), multivariate tests are based on the matrices g g W = 2: 2: (x ei - i e H x ei - i e ) ' and B = 2: n e ( i e - i ) ( i e - i ) ' Throughout this chapter, we have used Wilks' lambda statistic A * I B + W I which is equivalent to the likelihood ratio test. Three other multivariate test sta tistics are regularly included in the output of statistical packages. -1] Lawley-Hotelling Trace = tr[BW Pillai Trace = tr [ B (B + W) - 1 ] Roy's largest root = maximum eigenvalue of W (B + W) - 1 All four of these tests appear to be nearly equivalent for extremely large sam ples. For moderate sample sizes, all comparisons are based on what is necessarily a limited number of cases studied by simulation. From the simulations reported to date, the first three tests have similar power, while the last, Roy 's test, behaves dif ferently. Its power is best only when there is a single non-zero eigenvalue and, at the same time, the power is large. This may approximate situations where a large difference exists in just one characteristic and it is between one group and all of the others. There is also some suggestion that Pillai ' s trace is slightly more robust against non-normality. However, we suggest trying transformations on the original data when the residuals are non-normal. All four statistics apply in the two-way setting and in even more complicated MANOVA. More discussion is given in terms of the multivariate regression model in Chapter 7. When, and only when, the multivariate tests signals a difference, or departure from the null hypothesis, do we probe deeper. We recommend calculating the Bon feronni intervals for all pairs of groups and all characteristics. The simultaneous confidence statements determined from the shadows of the confidence ellipse are, typically, too large. The one-at-a-time intervals may be suggestive of differences that merit further study but, with the current data, cannot be taken as conclusive evidence for the existence of differences. We summarize the procedure developed in this chapter for comparing treatments. The first step is to check the data for out liers using visual displays and other calculations. n,
f=l j=l
f=l
358
Chap.
6
Comparisons of Several Multivariate Means A
ST;�ATEGY .. FQR T;t-JE MULT;l\'ARIATE,.s�OMPARISON OF TREATMENTS
1. Try to identify outliers. Check the data group by group for outliers . . Also chegk the collectioq. ofresidP,�l vectors fiom any fitted .model fo:r 1 ' outliers. Be aware of any outliers so calculations can be performed with and without them. •
· .
a multivariate · test of hypothesis. Our choice i s the likelihood ratio test, which is equiv;:tlent to Wilks' lambda test.
2. Perform
·
3� Calculate the Bonferonni simultaneous confidence intervals. If the mul;. tivaria e test reve als a differe ce , then p oceed to calculate the treatments, and all onni confidence intervals for all pairs of groups characteristics. If no differences ary significant, try looking at Bonfer- ' · . roni intervals for the larger set of re�ponses that includes the differences and sums of pai s of respo ses .
t
n
r
n
r
or
Bonfer
·
We must issue one caution concerning the proposed strategy. It may be the case that differences would appear in only one of the many characteristics and, further, the differences hold for only a few treatment combinations. Then, these few active differences may become lost among all the inactive ones. That is, the overall test may not show significance whereas a univariate test restricted to the specific active variable would detect the difference. The best preventative is a good experimental design. To design an effective experiment when one specific variable is expected to produce differences, do not include too many other variables that are not expected to show differences among the treatments. EXERCISES
Construct and sketch a joint 95% confidence region for the mean difference vector using the effluent data and results in Example 6.1. Note that the point = 0 falls outside the 95% contour. Is this result consistent with the test of H0: = 0 considered in Example 6.1? Explain. 6.2. Using the information in Example 6.1, construct the 95% Bonferroni simul taneous intervals for the components of the mean difference vector Com pare the lengths of these intervals with those of the simultaneous intervals constructed in the example. 6.3. The data corresponding to sample 8 in Table 6.1 seem unusually large. Remove sample 8. Construct a joint 95% confidence region for the mean dif ference vector and the 95% Bonferroni simultaneous intervals for the com ponents of the mean difference vector. Are the results consistent with a test
6.1.
o o
o
o.
o
Chap.
Exercises
6
359
of H0 : = 0? Discuss. Does the "outlier" make a difference in the analysis of these data? 6.4. Refer to Example 6. 1 . (a) Redo the analysis in Example 6. 1 after transforming the pairs of obser vations to ln(BOD) and ln(SS). (b) Construct the 95% Bonferroni simultaneous intervals for the components of the mean vector of transformed variables. (c) Discuss any possible violation of the assumption of a bivariate normal dis tribution for the difference vectors of transformed observations. 6.5. A researcher considered three indices measuring the severity of heart attacks. The values of these indices for n = 40 heart-attack patients arriving at a hos pital emergency room produced the summary statistics 101. 3 63. 0 71.0 46.1 = 57. 3 and = 63.0 80.2 55. 6 i S 50.4 71.0 55. 6 97.4 (a) All three indices are evaluated for each patient. Test for the equality of mean indices using (6-16) with a = .05. (b) Judge the differences in pairs of mean indices using 95% simultaneous confidence intervals. [See (6-18). ] 6.6. Use the data for treatments 2 and 3 in Exercise 6. 8 . (a) Calculate S pooled . (b) Test H0 : p, 2 - p, 3 = 0 employing a two-sample approach with a = . 0 1. (c) Construct 99% simultaneous confidence intervals for the differences JLz; - IL3i • i = 1, 2. 6.7. Using the summary statistics for the electricity-demand data given in Exam ple 6.4, compute T2 and test the hypothesis H0: p, 1 - p, 2 = 0, assuming that I 1 = I 2 . Set a = . 05. Also, determine the linear combination of mean com ponents most responsible for the rejection of H0. 6.8. Observations on two responses are collected for three treatments. The observation vectors [ :: J are 8
8
[
[ ]
Treatment 1: Treatment 2: Treatment 3:
[�], [ �]. [�].
[�], [ !]. [�].
]
[ :]. [: ]. [ �] [� ] [ �]. [�]
360
Chap.
6
Comparisons of Several Multivariate Means
Break up the observations into mean, treatment, and residual compo nents, as in (6-35). Construct the corresponding arrays for each variable. (See Example 6.8.) (b) Using the information in Part a, construct the one-way MANOVA table. (c) Evaluate Wilks ' lambda, A * , and use Table 6.3 to test for treatment effects. Set a' = .01 . Repeat the test using the chi-square approximation with Bartlett s correction. [See (6-39).] Compare the conclusions. 6.9. !]sing the contrast matrix C in (6-13), verify the relationships dj = Cx j , d = Cx, and S" = CSC' in (6-14). 6.10. Consider the univariate one-way decomposition of the observation Xcj given by (6-30). Show that the mean vector .X is always perpendicular to the treat ment effect vector (.X 1 - .X )u 1 + ( x2 - .X )u2 + + ( xg - .X )ug where 1 0 0 1 0 0 0 1 0 = ui , u, n,, � 0 , Uz 0 1 0 1 0 n, 0 0 1 6.11. A likelihood argument provides additional support for pooling the two inde pendent sample covariance matrices to estimate a common covariance matrix in the case of two normal populations. Give the likelihood function, L (p, , p, , I), for two independent samples of sizes n and n 2 from NP (p, 1 , I) and N1 p (p,2 2 , I) populations, respectively. Show that 1this likelihood is maxi mized by the choices /J, 1 = x 1 , P,2 = x 2 and (a)
···
}"•
6.12.
}
}
Hint: Use (4-16) and the maximization Result 4. 1 0. (Test for linear profiles, given that the profiles are parallel.)
Let p,; and p,� = [,u2 1 , ,u2 2 , . . . , ,u2 P ] be the mean responses to p treatments for populations 1 and 2, respectively. Assume that the profiles given by the two mean vectors are parallel. (a) Show that the hypothesis that the profiles are linear can be written as H0: [ ,u 1 1 , ,u 1 2 , . . . , ,u 1 P ]
i - 1 + ,Uz ; - 1 ) - tLl i - 2 + ,Uz ; - z ) , = 3, . . . , p or as H0: C (p, 1 + p,2 ) =(,ul0, where the (p - 2) p matrix
(,ul i + ,Uz; ) - ( ,u l i- I + ,Uz ; - I )
i
(
X
Chap.
6
Exercises
361
-2 1 0 0 0 c � 1 -2 1 0 0 0 0 0 1 -2 (b) Following an argument similar to the one leading to (6-62), we reject H0 : C ( p, + p,2) = 0 at level a if y2 = ( i l + i2) ' C' [(�I + �J cspooled C' riC( i l + i 2 ) > c 2 where c 2 = (n ln +I n2n - -2 )p(p -1 2) Fp - 2, n , + n, - p + l ( a ) + 2 + Let n 1 = 30, n2 = 30, x; = [6.4, 6.8, 7. 3 , 7.0], i � = [4.3, 4.9, 5. 3 , 5.1], and .61 .26 .07 .16 s pooled = ..2067 .17.64 .17.81 .14.03 .16 .14 . 03 . 3 1 Test for linear profiles, assuming that the profiles are parallel. Use a = .05. 6.13. (Two-way MANOVA without replications.) Consider the observations on two responses, x1 and x2 , displayed in the form of the following two-way table (note that there is a single observation vector at each combination of factor levels): 1
l!
!]
l
Level 1
Level l Factor 1 Level 2 Level 3
]
Factor 2 Level Level 2 3
Level 4
[ � ] [ : ] [ 1 � ] [� ] -�] [:] [ -:] [ [�] = [ -�] [ ;J [ - [ =:J �J
362
Chap.
6
Comparisons of Several Multivariate Means
With no replications, the two-way MANOVA model is g
� Te f=1
b
= k� fJk =1
=0
where the ee k are independent Np (O, I) random vectors. (a) Decompose the observations for each of the two variables as similar to the arrays in Example 6.8. For each response, this decomposi tion will result in several 3 4 matrices. Here :X is the overall average, :Xe. is the average for the t'th level of factor 1, and x.k is the average for the kth level of factor 2. (b) Regard the rows of the matrices in Part a as strung out in a single "long" vector, and compute the sums of squares X
sstot
= ssmean + ssfac + ssfac 2 + ssres 1
and sums of cross products SCPtot = SCPmean + SCPrac + SCPrac 2 + SCPres Consequently, obtain the matrices SSP SSP SSP 2 , and SSP with degrees of freedom gb - 1, g - c1,or >b - 1,rae and (grae- 1) (b - 1), respectively. (c) Summarize the calculations in Part b in a MANOVA table. Hint: This MANOVA table is consistent with the two-way MANOVA table for comparing factors and their interactions where n = 1. Note that, with n = 1, SSP in the general two-way MANOVA table is a zero matrix with zero degrees of freedom. The matrix of interaction sum of squares and cross products now becomes the residual sum of squares and cross products matrix. (d) Given the summary in Part c, test for factor 1 and factor 2 main effects at the a = .05 level. Hint: Use the results in (6-56) and (6-58) with gb(n - 1) replaced by (g - 1) (b - 1). Note: The tests require that p ::;:; (g - 1) (b - 1) so that SSPres will be positive definite (with probability 1). 1
1 ,
res
res
Chap. 6.14.
A replicate
6
Exercises
363
of the experiment in Exercise 6.13 yields the following data:
Factor 2 Level Level Level Level 4 3 2 1 Level l [ 1: ] [ � ] [� ] [ �� ] Factor 1 Level 2 [ � ] [ 1 � ] [ 1 � ] [ � ] Level 3 [ -� J [ - � ] [ - 1 � ] [ - � ] (a)
Use these data to decompose each of the two measurements in the obser vation vector as
where :X is the overall average, :Xe . is the average for the t'th level of fac tor 1, and x. k is the average for the kth level of factor 2. Form the cor responding arrays for each of the two responses. (b) Combine the preceding data with the data in Exercise 6.13, and carry out the necessary calculations to complete the general two-way MANOVA table. (c) Given the results in Part b, test for interactions, and if the interactions do not exist, test for factor 1 and factor 2 main effects. Use the likelihood ratio test with a = .05. (d) If main effects, but no interactions, exist, examine the nature of the main effects by constructing Bonferroni simultaneous 95% confidence intervals for differences of the components of the factor effect parameters. 6.15. Refer to Example 6.11. (a) Carry out approximate chi-square (likelihood ratio) tests for the factor 1 and factor 2 effects. Set a = .05. Compare these results with the results for the exact F-tests given in the example. Explain any differences. (b) Using (6-59), construct simultaneous 95% confidence intervals for differ ences in the factor 1 effect parameters for pairs of the three responses. Interpret these intervals. Repeat these calculations for factor 2 effect parameters.
364
Chap.
6
Comparisons of Several Multivariate Means
The following exercises may require the use of a computer. 6.16. Four measures of the response stiffness on each of 30 boards are listed in
Table 4.3 (see Example 4.14). The measures, on a given board, are repeated in the sense that they were made one after another. Assuming that the mea sures of stiffness arise from 4 treatments, test for the equality of treatments in a repeated measures design context. Set a =.05. Construct a 95% (simulta neous) confidence interval for a contrast in the mean levels representing a comparison of the dynamic measurements with the static measurements. 6.17. Jolicoeur and Mosimann [11] studied the relationship of size and shape for painted turtles. Table 6.7 contains their measurements on the carapaces of 24 female and 24 male turtles. CARAPACE M EASU REM ENTS ( I N M I LLI M ETERS) FOR PAI NTED TU RTLES
TABLE 6.7
Female Length Width Height (xz ) (x3) (xl ) 38 81 98 38 84 103 86 103 42 86 42 105 88 44 109 50 92 123 46 95 123 51 99 133 102 51 133 51 102 133 48 100 134 49 102 136 51 98 138 51 99 138 53 105 141 57 108 147 55 107 149 56 107 153 63 115 155 60 117 155 62 115 158 63 118 159 61 124 162 67 132 177
Length (xi ) 93 94 96 101 102 103 104 106 107 112 113 114 116 117 117 119 120 120 121 125 127 128 131 135
Male Width (xz ) 74 78 80 84 85 81 83 83 82 89 88 86 90 90 91 93 89 93 95 93 96 95 95 106
Height (x3) 37 35 35 39 38 37 39 39 38 40 40 40 43 41 41 41 40 44 42 45 45 45 46 47
Chap.
6
Exercises
365
Test for equality of the two population mean vectors using a = .05. If the hypothesis in Part a is rejected, find the linear combination of mean components most responsible for rejecting H0• (c) Find simultaneous confidence intervals for the component mean differ ences. Compare with the Bonferroni intervals. Hint. You may wish to consider logarithmic transformations of the observations. 6.18. In the first phase of a study of the cost of transporting milk from farms to dairy plants, a survey was taken of firms engaged in milk transportation. Cost data on X1 = fuel, X2 = repair, and X3 = capital, all measured on a per-mile basis, are presented in Table 6.8 on page 366 for n1 = 36 gasoline and n2 = 23 diesel trucks. (a) Test for differences in the mean cost vectors. Set a = . 0 1. (b) If the hypothesis of equal cost vectors is rejected in Part a, find the linear combination of mean components most responsible for the rejection. (c) Construct 99% simultaneous confidence intervals for the pairs of mean components. Which costs, if any, appear to be quite different? (d) Comment on the validity of the assumptions used in your analysis. Note in particular that observations 9 and 21 for gasoline trucks have been identified as multivariate outliers. (See Exercise 5.20 and [2]. ) Repeat Part a with these observations deleted. Comment on the results. 6.19. The tail lengths in millimeters (x 1 ) and wing lengths in millimeters (x2 ) for 45 male hook-billed kites are given in Table 6.9 on page 367. Similar measure ments for female hook-billed kites were given in Table 5.11. (a) Plot the male hook-billed kite data as a scatter diagram, and (visually) check for outliers. (Note, in particular, observation 31 with x1 = 284.) (b) Test for equality of mean vectors for the populations of male and female hook-billed kites. Set a = .05. If H0 : IL l - IL z = 0 is rejected, find the lin ear combination most responsible for the rejection of H0 (You may want to eliminate any outliers found in Part a for the male •hook-billed kite data before conducting this test. Alternatively, you may want to interpret x 1 = 284 for observation 31 as a misprint and conduct the test with x1 = 184 for this observation. Does it make any difference in this case how observation 31 for the male hook-billed kite data is treated?) (c) Determine the 95% confidence region for IL l - IL z and 95% simultaneous confidence intervals for the components of IL1 - IL z . (d) Are male or female birds generally larger? 6.20. Using Moody ' s bond ratings, samples of 20 Aa (middle-high quality) corpo rate bonds and 20 Baa (top-medium quality) corporate bonds were selected. For each of the corresponding companies, the ratios X1 = current ratio (a measure of short-term liquidity) X2 = long-term interest rate (a measure of interest coverage) X3 = debt-to-equity ratio (a measure of financial risk or leverage) X4 = rate of return on equity (a measure of profitability) (a)
(b)
366
Chap.
6
Comparisons of Several Multivariate Means
MILK TRANSPORTATI O N-COST DATA Gasoline trucks Diesel trucks
TABLE 6.8
xl
16.44 7.19 9.92 4.24 11.20 14.25 13.50 13. 32 29.11 12. 68 7. 5 1 9.90 10.25 11.11 12. 1 7 10.24 10. 1 8 8.88 12.34 8. 5 1 26.16 12.95 16. 93 14.70 10.32 8.98 9.70 12.72 9.49 8.22 13.70 8.21 15.86 9.18 12.49 17.32
Xz
12.43 2.70 1.35 5.78 5.05 5.78 10.98 14.27 15.09 7.61 5.80 3.63 5.07 6. 1 5 14.26 2.59 6.05 2.70 7.73 14.02 17.44 8.24 13.37 10.78 5.16 4.49 11.59 8.63 2. 1 6 7.95 11.22 9.85 11.42 9. 1 8 4.67 6.86
x3
11.23 3. 92 9.75 7.78 10.67 9.88 10.60 9.45 3.28 10.23 8. 1 3 9. 1 3 10. 1 7 7. 6 1 14.39 6.09 12. 1 4 12.23 11.68 12.01 16.89 7. 1 8 17.59 14.58 17.00 4.26 6.83 5.59 6.23 6.72 4.91 8. 1 7 13.06 9.49 11.94 4.44
xl
8.50 7. 42 10.28 10.16 12.79 9. 60 6.47 11.35 9. 1 5 9.70 9.77 11. 61 9.09 8. 5 3 8.29 15.90 11.94 9.54 10.43 10.87 7.13 11.88 12.03
Source: Data courtesy of M. Keaton.
Xz
12.26 5.13 3.32 14.72 4.17 12.72 8.89 9.95 2.94 5.06 17.86 11.75 13.25 10. 1 4 6.22 12. 90 5. 69 16.77 17. 65 21.52 13.22 12. 1 8 9.22
x3
9.1 1 17.15 11.23 5. 99 29.28 11.00 19.00 14.53 13.68 20.84 35. 1 8 17.00 20.66 17.45 16.38 19.09 14.77 22.66 10. 66 28.47 19.44 21.20 23.09
Chap. TABLE 6.9 x
(Tailt length) 180 186 206 184 177 177 176 200 191 193 212 181 195 187 190
6
Exercises
MALE HOOK-BI LLED KITE DATA x
X
(Tailt length) 185 195 183 202 177 177 170 186 177 178 192 204 191 178 177
z (Wing �ength) 278 277 308 290 273 284 267 281 287 271 302 254 297 281 284
X
z (Wing length) 282 285 276 308 254 268 260 274 272 266 281 276 290 265 275
x
(Tailt length) 284 176 185 191 177 197 199 190 180 189 194 186 191 187 186
Source: Data courtesy of S. Temple.
] ]
X
z (Wing length) 277 281 287 295 267 310 299 273 278 280 290 287 286 288 275
were recorded. The summary statistics are as follows: Aa bond companies: n 1 = 20, x 1 = [2. 2 87, 12.600, .347, 14. 830]', and .459 .254 - .026 - .244 . 8 1 = 254 27.465 - .5 89 - . 267 -.026 -.589 .030 .102 -.244 - .267 .102 6. 854 Baa bond companies: n 1 = 20, x 2 = [2. 404, 7. 1 55, . 5 24, 12. 840]', . 944 - .089 .002 -. 7 19 8 2 = - . 089 16.432 - .400 19. 044 .002 -.400 .024 -. 094 - .7 19 19. 044 - .094 61.854 and .701 . 083 -.012 ! .083 21. 949 - .494 9.388 s pooled = - .012 - .494 . 027 .004 -.481 9.388 .004 34.354
r r
r
-
�
]
367
368
Chap.
6
Comparisons of Several Multivariate Means
Does pooling appear reasonable here? Comment on the pooling proce dure in this case. (b) Are the financial characteristics of firms with Aa bonds different from those with Baa bonds? Using the pooled covariance matrix, test for the equality of mean vectors. Set a = .05. (c) Calculate the linear combinations of mean components most responsible for rejecting H0: p1 - p2 = 0 in Part b. (d) Bond rating companies are interested in a company's ability to satisfy its outstanding debt obligations as they mature. Does it appear as if one or more of the foregoing financial ratios might be useful in helping to clas sify a bond as "high" or "medium" quality? Explain. 6.21. Researchers interested in assessing pulmonary function in nonpathological populations asked subjects to run on a treadmill until exhaustion. Samples of air were collected at definite intervals and the gas contents analyzed. The results on 4 measures of oxygen consumption for 25 males and 25 females are given in Table 6. 1 0 on page 369. The variables were X1 = resting volume 02 (L/min) X2 = resting volume 0 2 (mL/kg/min) X3 = maximum volume 02 (L/min) X4 = maximum volume 02 (mL/kg/min) (a) Look for gender differences by testing for equality of group means. Use a = .05. If you reject H0: p1 - p2 = 0, find the linear combination most responsible. (b) Construct the 95% simultaneous confidence intervals for each - JJ- 2 ; , i = 1, 2, 3, 4. Compare with the corresponding Bonferroni intervals. (c) The data in Table 6. 1 0 were collected from graduate-student volunteers, and thus they do not represent a random sample. Comment on the pos sible implications of this information. 6.22. Construct a one-way MANOV A using the width measurements from the iris data in Table 11.5 . Construct 95% simultaneous confidence intervals for dif ferences in mean components for the two responses for each pair= of popula tions. Comment on the validity of the assumption that � 1 = �2 �3 • 6.23. Construct a one-way MANOVA of the crude-oi� data listed in Table 11. 7 . Construct 95% simultaneous confidence intervals to determine which mean components differ among the populations. (You may want to consider trans formations of the data to make them more closely conform to the usual MANOVA assumptions.) 6.24. A project was designed to investigate how consumers in Green Bay, Wis consin, would react to an electrical time-of-use pricing scheme. The cost of electricity during peak periods for some customers was set at eight times the (a)
f.L u
·
TABLE 6. 1 0 OXYG EN-CON S U M PTION DATA
Females Males X Resting 02 Resting2 02 Maximum 02 Maximum 02 Resting! 02 Resting2 02 Maximum 02 Maximum 02 (L/min) (mL/kg/min) (L/min) (mL/kg/min) (L/min) (mL/kg/min) (L/min) (mL/kg/min) 33. 85 0.34 1. 93 3.7 1 2. 87 30.87 0.29 5.04 35.82 0.39 2.5 1 5.08 43. 85 3. 38 0.28 3.95 36.40 0.48 2. 3 1 5.13 44. 5 1 0. 3 1 4.88 37. 87 0.31 1. 90 3. 95 3. 60 46.00 5. 97 0.30 38.30 0.36 2.32 4.57 5. 5 1 3.11 47.02 0.28 39.19 0.33 2.49 4.07 48.50 3.95 1.74 0.11 39. 2 1 0.43 2.12 4.77 4.39 48.75 0.25 4.66 39. 94 0.48 1.98 6.69 3. 5 0 48.86 0.26 5.28 42. 4 1 0.21 2.25 3.71 2. 82 48.92 7.32 0.39 28.97 0.32 1.7 1 4.35 48.38 3. 5 9 6.22 0. 3 7 37.80 0.54 2.76 7.89 3.47 4.20 50. 5 6 0. 3 1 31.10 0.32 2.10 5. 37 3.07 51.15 5.10 0.35 38.30 0.40 2.50 4. 95 4.43 4.46 55. 34 0.29 51.80 0.31 3. 06 4.97 5.60 3.5 6 56.67 0.33 37. 60 0.44 2. 40 6.68 2. 80 3.86 58.49 0.18 36.78 2.58 0.32 4.01 4.80 49.99 3.31 0.28 46.16 3.05 0.50 6.43 6.69 42.25 3.29 0.44 38.95 1.85 0.36 4.55 5.99 51.70 3.10 0.22 40.60 2.43 0.48 6.30 4.80 5.73 63.30 0.34 43.69 2.58 5.12 0.40 6.00 3.06 46.23 0.30 30.40 1. 97 0.42 6.04 4.77 55.08 3.85 0. 3 1 39.46 2.03 0.55 6.45 5.00 5.16 58. 80 0.27 39.34 2.32 0.50 11.05 5.55 0.66 57.46 5. 23 34.86 0.34 2. 48 4.27 5.23 4. 00 0.37 50.35 35.07 2.25 0.40 4.58 5.37 2.82 0.35 32.48 xl
x
x3
4.13
0
w
�
Source: Data courtesy of S. Rokicki.
x4
x
x3
x4
370
Chap.
6
Comparisons of Several Multivariate Means
cost of electricity during off-peak hours. Hourly consumption (in kilowatt hours) was measured on a hot summer day in July and compared, for both the test group and the control group, with baseline consumption measured on a similar day before the experimental rates began. The responses, log(current consumption) - log(baseline consumption) for the hours ending 9 11 (a peak hour), 1 and 3 (a peak hour) produced the following summary statistics: A.M.,
Test group: n1 = Control group: n2 =
and
s pooled
=
A.M.
P.M.,
P.M.
]
28, x 1 = [.153, -.231, - .322, -.339]' 58, x 2 = [.151, .180, .256, .257]' .804 .355 .228 .232 .355 .722 .233 .199 .228 .233 .592 .239 .232 .199 .239 .479
l
WiSource:sconsiDatn. a courtesy of Statistical Laboratory, University of Perform a profile analysis. Does time-of-use pricing seem to make a differ ence in electrical consumption? What is the nature of this difference, if any? Comment. (Use a significance level of a = .05 for any statistical tests. ) 6.25. As part qf the study of love and marriage in Example 6.1�, a sample of hus bands and wives were asked to respond to these questions: 1. What is the level of passionate love you feel for your partner? 2. What is the level of passionate love that your partner feels for you? What is the level of companionate love that you feel for your partner? 4. What is the lev�l of companionate love that your partner feels for you? The responses were recorded on the following 5-point scale. 3.
None at all
Very little
Some
A great deal
Tremendous amount
2
3
4
5
in Table 6.11, where X1 = Thirty husbands and 30 wives gave the responses response to a 5-point-scale response to Question 1, X2 = a 5-po�nt-scale Question 2, x3 = a 5-point-scale response to Question 3, and x4 = 5-point scale response to Question 4. (a) Plot the mean vectors for husbands and wives as sample profiles. a
Chap. TABLE 6.1 1
SPOUSE DATA
2 5 4 4 3 3 3 4 4 4 4 5 4 4 4 3 4 5 5 4 4 4 3 5 5 3 4 3 4 4
Xz
3 5 5 3 3 3 4 4 5 4 4 5 4 3 4 3 5 5 5 4 4 4 4 3 5 3 4 3 4 4
x3
5 4 5 4 5 4 4 5 5 3 5 4 4 5 5 4 4 5 4 4 4 4 5 5 3 4 4 5 3 5
x4
5 4 5 4 5 5 4 5 5 3 5 4 4 5 5 5 4 5 4 4 4 4 5 5 3 4 4 5 3 5
Exercises
371
Wife rating husband
Husband rating wife xl
6
xl
4 4 4 4 4 3 4 3 4 3 4 5 4 4 4 3 5 4 3 5 5 4 2 3 4 4 4 3 4 4
Xz
4 5 4 5 4 3 3 4 4 4 5 5 4 4 4 4 5 5 4 3 3 5 5 4 3 4 4 4 4 4
x3
5 5 5 5 5 4 5 5 5 4 5 5 5 4 5 4 5 4 4 4 4 4 5 5 5 4 5 4 5 5
x4
5 5 5 5 5 4 4 5 4 4 5 5 5 4 5 4 5 4 4 4 4 4 5 5 5 4 5 4 4 5
Source: Data courtesy of E. Hatfield. (b)
Is the husband rating wife profile parallel to the wife rating husband pro file? Test for parallel profiles with a = .05. If the profiles appear to be parallel, test for coincident profiles at the same level of significance. Finally, if the profiles are coincident, test for level profiles with a = .05. What conclusion(s) can be drawn from this analysis?
372
Chap.
6
Comparisons of Several Multivariate Means
Two species of biting flies (genus Leptoconops) are so similar morphologically, that for many years they were thought to be the same. Biological differences such as sex ratios of emerging flies and biting habits were found to exist. Do the taxonomic data listed in part in Table 6. 1 2 on page 373 and on the com puter disk indicate any difference in the two species L. carteri and L. torrens? Test for the equality of the two population mean vectors using a = .05. If the hypotheses of equal mean vectors is rejected, determine the mean components (or linear combinations of mean components) most responsible for rejecting H0. Justify your use of normal-theory methods for these data. 6.27. Using the data on bone mineral content in Table 1. 6 , investigate equality between the dominant and nondominant bones. (a) Test using a = . 0 5. (b) Construct 95% simultaneous confidence intervals for the mean differences. (c) Construct the Bonferroni 95% simultaneous intervals, and compare these with the intervals in Part b. 6.28. Table 6.13 on page 374 contains the bone mineral contents, for the first 24 subjects in Table 1.6 , 1 year after their participation in an experimental pro gram. Compare the data from both tables to determine whether there has been bone loss. (a) Test using a = .05. (b) Construct 95% simultaneous confidence intervals for the mean differences. (c) Construct the Bonferroni 95% simultaneous intervals, and compare these with the intervals in Part b. 6.29. Peanuts are an important crop in parts of the southern United States. In an effort to develop improved plants, crop scientists routinely compare varieties with respect to several variables. The data for one two-factor experiment are given in Table 6. 1 4 on page 375. Three varieties (5, 6, and 8) were crossed with two geographical locations (1, 2), and, in this case, the three variables representing yield and the two most important grade-grain characteristics were measured. The three variables are: X1 = Yield (plot weight) X2 = Sound mature kernels (weight in grams-maximum of 250 grams) X3 = Seed size (weight, in grams, of 100 seeds) There were two replications of the experiment. (a) Perform a two-factor MANOVA using the data in Table 6.14. Test for a location effect, a variety effect, and a location-variety interaction. Use a = .05. (b) Analyze the residuals from Part a. Do the usual MANOVA assumptions appear to be satisfied? Discuss. (c) Using the results in Part a, can we conclude that the location and/or vari ety effects are additive? If not, does the interaction effect show up for some variables, but not for others? Check by running three separate uni variate two-factor ANOVA's. (d) Larger numbers correspond to better yield and grade-grain characteris tics. Using location 2, can we conclude that one variety is better than the 6.26.
TABLE 6. 1 2 B ITI NG-FLY DATA
( length Wing ) ( Wing ) width xl
L. torrens
L. carteri
8587 9492 919096 9291 87 106 105 103 100 109 10495 10490 104 8694 10382 103 101 103 10099 100 1109999 10395 101 10399 10599
Xz
4138 4443 444243 4341 38 4746 4441 4445 444040 46 404819 4143 4345 414443 454442 464743 435047 47
of ) ( Un�h of ) ( Thirdpalp ) (Thirdpalp ) (palpourth) ( Ungili antennal antennal x3
length 323136 3235 363636 3635 3834 353436 3635 3437 37 3738 354239 4044 424340 413538 363838 4037 4039
x4
Xs
width length 141315 222725 1714 28 26 161712 262624 1411 2324 2631 141515 1413 242723 232930 141515 2230 1412 1114 2531 3325 121514 1514 252932 151618 313431 1417 3336 141516 323131 151414 322337 1614 3334
x6
x7
segment 12 1389 1099 999 9 1010 101110 109 99 10 96 109 9119 101011 9109 108 1111 1112 7
segment 13 1389 1099 999 10 1011 101010 101010 1010 7109 998 101011 10 10109 108 1111 1011 7
Source: Data courtesy of Wil iam Atchley. 373
374
Chap.
6
Comparisons of Several Multivariate Means TABLE 6. 1 3
M I N ERAL CONTENT I N BON ES (AFTER 1 YEAR)
Subject Dominant Dominant Dominant number radius Radius humerus Humerus ulna Ulna 1 1.027 1. 051 2.268 2.246 .869 . 964 2 . 602 .689 . 857 .817 1.7 18 1.7 10 .765 .738 3 . 875 .880 1.953 1.756 .761 .698 1.443 4 .698 1.668 .873 . 5 51 . 6 19 1.661 5 . 8 11 .813 1.643 .753 . 5 15 1.378 6 .734 1.3 96 .640 .708 .787 1.686 7 . 947 .865 1. 851 .687 . 7 15 1. 8 15 .886 .806 1.742 8 .844 .656 1.776 .991 .923 1. 931 9 .869 .789 2. 1 06 . 925 1. 933 . 977 10 .654 .726 1. 651 .826 1. 609 .825 11 . 692 .526 1.980 .765 2.352 .851 12 . 670 .580 1.420 .770 13 .730 1.470 .823 .773 1.809 .875 1.846 .912 14 .746 .729 1.579 15 .826 1.842 . 905 .656 .506 1.860 .727 1. 747 .756 16 .693 .740 1.941 .764 1.923 .765 17 .883 .785 1. 997 .914 2. 1 90 .932 18 .577 .627 1.228 19 .782 1.242 .843 . 802 .769 1.999 . 906 2. 1 64 .879 20 . 540 .498 1. 3 30 1. 5 73 .537 .673 21 . 804 .779 2. 1 59 .900 2. 1 30 . 949 22 .570 .634 1.265 .637 1.041 .463 23 .585 .640 1.411 .743 1.442 .776 24
Source: Data courtesy of Everett Smith. other two for each characteristic? Discuss your answer, using 95% Bonferonni simultaneous intervals for pairs of varieties. 6.30. Refer to Example 6. 1 3. (a) Plot the profiles, the components of i 1 versus time and those of i 2 versus time, on the same graph. Comment on the comparison. (b) Test that linear growth is adequate. Take a = . 0 1. 6.31. Refer to Example 6. 1 3 but treat all 31 subjects as a single group. The maximum likelihood estimate of the ( + 1) 1 f3 is p = (B' S - 1 B) - 1 B' S- 1 i whereTheS iestimated s the samplecovariances covarianceof matrix. the maximum likelihood estimators are q
X
Chap. 6
References
375
TABLE 6. 1 4 P EAN UT DATA
Factor 1 Factor 2 Location Variety 5 1 5 1 5 2 5 2 6 1 6 1 6 2 6 2 8 1 8 1 8 2 8 2
x
x
2 XI 3 SeedSize SdMatKer Yield 51.4 195.3 153.1 53.7 194.3 167.7 55.5 189.7 139. 5 44.4 180.4 121.1 49. 8 203.0 156. 8 45.8 195.9 166.0 60.4 202.7 166. 1 54.1 197.6 161.8 57.8 193.5 164.5 58.6 187.0 165.1 65. 0 201.5 166.8 67.2 200.0 173. 8
Source: Data courtesy of Yolanda Lopez. Cov ( {J ) (n - 1 -(n p- +1)q)(n(n- -2)p + q)n ( B ' s- t u) - t Fit a quadratic growth curve to this single group and comment on the fit. ---
"
REFERENCES
(
)
1. John AndersWioln,ey,T.1984.W. 2d ed. . New York: 2. MulBacon-tiplSehone,OutliJ.ers, andin UniW. K.variFung.ate and"A MulNewtivGrariaaphite cDatal Meta."hod for Detecting Singleno.and2 3. Bart(1987)let, 153-162. , M.S. "Properties of(1Suf937)fic,i268-282. ency and Statistical Tests." 4. Bartlet , M.S. "Further Aspects of the Theor y ,of33-40.Multiple Regression." ( 1 938) 5. Bartlet , M. (S.1947)"Mul, 176-197. tivariate Analysis." 6. Bartlet , M. S. "A Note on the Multiplying Fact(1954)ors, for296-298. Various Approximations." 7. John BhattWiacharley,yya,1977.G. K., and R. A. Johnson. New York: 8. Box, G. E. P., and N.NewR. Draper. York: John Wiley, 1969. An Introduction to Multivariate Statistical Analysis
Applied Statistics,
36,
Proceedings of the Royal
Society of London (A) ,
160
Proceedings of
the Cambridge Philosophical Society, ment (B),
34
Journal of the Royal Statistical Society Supple
9
Journal of the Royal Statistical Society (B),
16
r
Statistical Concepts and Methods.
Evolutionary Operation: A Statistical Method for
Process Improvement.
376
Chap.
Comparisons
Several Multivariate Means
of 9. John Box, G.WilE.ey,P.1978., W. G. Hunter, and J. S. Hunter. New York: Newn inYorthekPai: Macmi lTuran,tl1971.e: A 11.10. JolJohn, iPrinccioeurP.palW.,Component P.M., and J. E.AnalMosyismisann.." "Size and(Shape Var i a t i o n t e d 12.13. KsMorhirrissaogarn, D., A.M.F. , and W. B. Smith, 1960)(2d ,ed.339-354. New). NewYorYork:k: MarMcGrawcel Dekker.Hil , 1995.1976. 14. brPearidge,son,EnglE. S.and:, andCambrH. idgeHarUnitley,veredssity. Pres , 1972. vol. I . Cam 15. usPoteftuhlofesf,pR.eciaF.l yandforS.groN.wtRoy,h curv"Ae prgeneroblems",alized multivariate anal(1964)ysis, 313-326. of variance model H.L., and N. Balakrishnan. "TesNewting tYorhe Equal k: JohnityWioflVarey, 1959. 17.16. TicesSchefku,thfM,ee,Robus iance-Covariance Matno. r12i t Way. " 18. Tiat(1e985)ku,DiM.,s3033-3051. L.ibut, andions.M." Singh. "Robust Statistics for Testing Mean Vectors of Multno.ivari9 t r 19. Tit(e1rey,982)mm,Cal,N.H.985-1001. Mon i f o r n i a : Br o oks / C ol e , 1975. 20. Wi(1932)lks,, S.471S.-494."Certain Generalizations in the Analysis of Variance."
6
Statistics for Experimenters.
Statistical Design and Analysis of Experiments. Growth,
24
Growth Curves, Multivariate Statistical Methods 0.
Biometrika Tables for Statisticians,
Biometrika,
51
The Analysis of Variance.
Communications in Statistics-Theory and Methods,
14,
Communications in Statistics-Theory and Methods,
11,
Multivariate Analysis with Applications in Education and Psychology.
Biometrika,
24
CHAPTER
7
Multivariate Linear Regression Models 7 . 1 INTRODUCTION
Regression analysis is the statistical methodology for predicting values of one or more response (dependent) variables from a collection of predictor (independent) variable values. It can also be used for assessing the effects of the predictor vari iables on the responses. Unfortunately, the name regression, culled from the title of the first paper on the subject by F. Galton [13], in no way reflects either the importance or breadth of application of this methodology. In this chapter, we first discuss the multiple regression model for the predic tion of a single response. This model is then generalized to handle the prediction of several dependent variables. Our treatment must be somewhat terse, as a vast literature exists on the subject. (If you are interested in pursuing regression analy sis, see the following books, in ascending order of difficulty: Bowerman and O ' Con nell [5], Neter, Kutner, Nachtsheim, and Wasserman [16], Draper and Smith [11], Seber [19], and Goldberger [14].) Our abbreviated treatment highlights the regres sion assumptions and their consequences, alternative formulations of the regres sion model, and the general applicability of regression techniques to seemingly different situations. 7.2 THE CLASSICAL LINEAR REGRESSION MODEL
Let z 1 , z2 , , be r predictor variables thought to be related to a response vari able Y. For example, with r = 4, we might have Y = current market value of home • • •
z,
377
378
Chap.
7
Multivariate Linear Regression Models
and
square feet of living area location (indicator for zone of city) appraised value last year z 4 = quality of construction (price per square foot) The classical linear regression model states' that Y is composed of a mean, which depends in a continuous manner on the z; s, and a random error e, which accounts for measurement error and the effects of other variables not explicitly considered in the model. The values of the predictor variables recorded from the experiment or set by the investigator are treated as ftxed. The error (and hence the response) is viewed as a random variable whose behavior is characterized by a set of distrib utional assumptions. Specifically, the linear regression model with a single response takes the form = f3o + /3 1 Z1 + · · · + f3rZr + e Y [Response] = [mean (depending on z 1 , z2 , . . . , Zr ) ] + [error] The term linear refers to the fact that the mean is a linear function of the unknown parameters {30 , {31 , . . . , f3r · The predictor variables may or may not enter the model as first-order terms. With n independent observations on Y and the associated values of Z ; , the complete model becomes = z2 = z3 = z1
Y1 Y2
= f3o + /3 Z1 + f32Z1 2 + . . . + f3rZ1 r +
= f3o + /311 Z2 11 + f32Z22 + · · · + f3rZ2r + 0012
where the error terms are assumed to have the following properties: 1. E(e) = 0 ; 2. Var(e) = u 2 (constant) ; and 3. Cov(ej, ek) = 0 , j * k. In matrix notation, (7-1) becomes z 11 2 . . .
(7-1) (7-2)
Sec.
The Classical Linear Regression Model
7.2
379
or
= z f3 + and the specifications in (7-2) become: 1. E(e) = 0; and 2. Cov(e) = E(ee') = u 2 1 . Note that a one in the first column of the design matrix is the multiplier of the constant term {30 . It is customary to introduce the artificial variable Zj o = 1, so that f3o + {3 , Zj ! + . . + {3, Zjr = f3o Zj o + /31 Zj l + . . . + {3 ,Zjr Each column of consists of the n values of the corresponding predictor variable, while the jth row of contains the values for all predictor variables on the jth trial. y
E
(n X (r + l)) ( (r + l ) X l )
(n X l )
(n X I )
Z
Z
.
Z
CLASSICAL LI N EAR REG RESSION MODEL y
(n x l)
E ( e)
= =
z f3 (n X (r + l)) ((r + l) X l) 0
(n X l)
,
and Cov (e)
+
E
(n x 1)
=
'
u21 ,
(n X n)
(7-3)
where f3 and u2 are unknown parameters and the design matrix Z has jth row. [Zja • Zj l • · · · , Zjr] .
Although the error-term assumptions in (7-2) are very modest, we shall later need to add the assumption of joint normality for making confidence statements and testing hypotheses. We now provide some examples of the linear regression model. Example 7. 1
(Fitting a straight-line regression model)
Determine the linear regression model for fitting a straight line Mean response = E(Y) = /30 + {31 z 1 to the data 0 1 2 3 4 1 4 3 8 9
380
Chap.
7
Multivariate Linear Regression Models
Before the responses Y = [ , , , Y5 ] ' are observed, the errors e = [e1 , e2 , , e5 ] ' are random,Y1andY2 we can write Y = ZfJ + e where • • •
• • .
The data for this model are contained in the observed response vector y and the design matrix Z, where 1 1 0 4 1 1 y= 3 , Z= 1 2 8 1 3 9 1 4 Note that we can handle a quadratic expression for the mean response by introducing the term f32 z2 , with z2 = z f . The linear regression model for the jth trial in this latter case is or Yj
•
= f3o + f3t zj l + f3z zll + ej
Example 7.2 (The design matrix for one-way ANOVA as a regression model)
Determine the design matrix if the linear regression model is applied to the one-way ANOVA situation in Example 6.6. We create so-called dummy variables to handle the three population means: = + T1 , z = + T2 , and /-L3 = + T3 We set 1 the observation is if the observation is from population 1 z2 = from population 2 0 otherwise otherwise if the observation is from population 3 otherwise 1-L t
1-L
1-L
1-L
{
1-L
if
•
Sec.
7.3
Least Squares Estimation
381
= 1, 2, ... ' 8 where we arrange the observations from the three populations in sequence. Thus, we obtain the observed response vector and design matrix 1 1 0 0 9 1 1 0 0 6 1 1 0 0 9 0 z 1 0 1 0 y 1 0 1 0 2 1 0 0 1 3 1 0 0 1 1 1 0 0 1 2 The construction of dummy variables, as in Example 7.2, allows the whole of analysis of variance to be treated within the multiple linear regression framework. j
( 8 X 4)
(8 X I )
•
7.3 LEAST SQUARES ESTIMATION
One of the objectives of regression analysis is to develop an equation that will allow the investigator to predict the response for given values of the predictor variables. Thus, it is necessary to "fit" the model in (7-3) to the observed yj corresponding to the known values 1, zj l , . . . , Zjr · That is, we must determine the values for the regression coefficients f3 and the error variance 0'2 consistent with the available data. Let b be trial values for {3. Consider the difference yj - b0 - b 1 zj l - . . · - b,zj, between the observed response yj and the value b0 + b 1 Zj 1 + · · · + b,zj, that would be expected if b were the "true" parameter vector. Typically, the differences yj - b0 - b 1 Zj - · · · - b, zj , will not be zero, because the response fluctuates (in a manner characterized by the error term assumptions) about its expected value. The method of least squares selects b so as to minimize the sum of the squares of the differences: (7-4) S (b) = 2: (yj - b0 - b 1 zj 1 - . . . - b,zj ,) 2 j= 1 = (y - Zb) (y - Zb) The coefficients b chosen by the least squares criterion are called least squares esti mates of the regression parameters {3. They will henceforth be denoted by p to emphasize their role as estimates of {3. 1
It
I
382
Chap. 7 Multivariate Linear Regression Models
The coefficients PA are consisten1 with Jhe data in theAsense that they produce estimated (fitted) mean responses, {30 + {3 z + + zjr • the sum of whose squares of the differences from the observed y1j isj 1 as small asf3rpossible. The deviations j = 1, 2, . . . , n (7-5) are called residuals. The vector of residuals e = y - zp contains the information about the remaining unknown parameter rr2• (See Result 7.2. ) n. 1 The least squares estimate of Result 7. 1 . Let Z have full rank r + 1 p in (7-3) is given by ···
A
:,;;;
p = ( Z' Z ) - 1 Z' y
Let = z = Hy denote the fitted values of y, where H = Z ( Z' Z) - 1 Z' is called "hat"y matrix.p Then the residuals e=y-
y
= [I - Z ( Z' Z ) -1 Z'] y = ( I - H ) y
satisfy Z' e = 0 and Y' e = 0. Also, the fl
A A
residual sum ofsquares = 2: ( yj - f3 o - {3 1 Zj l =I j
···
A
- f3rZjr ) 2 = e' e
= y' [I - Z ( Z' Z ) - 1 Z'] y = y' y - y' ZP Proof. p = ( Z' Z ) - 1 Z' y e = y - y = y - zp = 1 [I - Z ( Z' Z ) Z'] y. [I - Z ( Z' Z ) - 1 Z']
Let
as asserted. Then satisfies: 1. [I - Z ( Z' Z ) - 1 Z']' = [I - Z ( Z' Z ) - 1 Z'] (symmetric) ; 2. [I - Z ( Z' Z ) -1 Z'] [I - Z ( Z' Z ) - 1 Z'] = I - 2Z ( Z' Z ) - 1 Z' + Z ( Z' Z ) - 1 Z' Z ( Z' Z ) - 1 Z' = [I - Z ( Z' Z )- 1 Z'] (idempotent) ; 3.
The matrix
(7-6)
Z' [I - Z ( Z' Z ) -1 Z' ] = Z' - Z' = 0.
Consequently, Z'e = Z' (y - y) = Z' [I - Z ( Z' Z ) - 1 Z'] y = 0, so Y' e = P ' Z' e = 0. Additionally, e' e = y' [I - Z ( Z' Z ) - 1 Z'] [I - Z ( Z' Z ) -1 Z'] y = y' [I - Z ( Z' Z ) - 1 Z'] y = y' y - y' Zp . To verify the expression for p , we write cise
1
If z is not full rank, (Z' z) - 1 is replaced by (Z' Z ) - , a generalized inverse of Z' Z. (See Exer
7.6.)
Sec.
7.3
Least Squares Estimation
y - Z b = y - z p + z p - Zb = y - z p + z ( p - b )
so
S ( b ) = ( y - Zb ) ' (y - Zb)
383
+ ( P - b) ' Z' Z ( P - b )
(y - ZP ) ' (y - Z P )
+ 2 (y - ZP ) ' Z ( P - b) = (y - z p) ' ( y - z p) + ( P - b) ' Z' Z ( P - b ) since (y - zp) ' Z = e' Z = 0' . The first teqn in S (b) does not depend on b and the second is th� squared length of Z ( f1 - b) . Because Z has full rank, Z ( f1 - b ) * 0 if f1 * b , so the minimum sum of squares is unique and occurs for b = p = ( Z' Z ) - 1 Z'y. Note that ( Z' Z ) - 1 exists since Z'Z has rank + 1 n. (If Z'Z is not of full rank, Z' Za = 0 for some a * 0, but then a' Z' Za = 0 or Za = 0, which contradicts Z having full rank + 1.) Result 7.1 shows how the least squares estimates p and the residuals e can be obtained from the design matrix Z and responses y by simple matrix operations. A
r
r
Exatnple 7.3
(Calculating the least squares estimates, the residuals, and the residual sum of sq uares)
:s;;
•
Calculate the least square estimates /3, the residuals e, and the residual sum of squares for a straight-line model Yi = f3o + {31 zi l + si fit to the data 0 1 2 3 4 :� I 1 4 3 8 9 We have A
Z'
y
1 4 3 8 9
Z'Z
( Z'Z ) - 1
Z'y
384
Chap.
7
M ultivariate Linear Regression Models
Consequently, and the fitted equation is y = 1 + 2z
The vector of fitted (predicted) values is Y
so
=
A
zp
1 0 1 1 = 1 2 1 3 1 4 1
e=y-y=
The residual sum of squares is
4 3 8 9
[�] 1
3 5 7 9
1
3 5 7 9 0 1 -2 1 0
0 1 e' e = [o 1 - 2 1 o] - 2 = 02 + 1 2 + ( - 2) 2 + 1 2 + 02 = 6 1 0
•
Sum-of-Squares Decomposition
According to Result y' y = 2: Yl satisfies j= 1 II
7.1,
Y' e
= 0,
so the total response sum of squares
y' y = (y + y - y) ' (y + y - y) = (y + e) ' (y + e) = Y'Y + e' e (7-7) Since the first column of Z is 1, the condition Z' e = 0 includes the requirement 0 = 1' e = 2: ej = 2: yj - 2: yj , or y = y. Subtracting ny 2 = n (.Y ) 2 from both ! =1 j= 1 sides of the jdecomposition inj =(7-7), we obtain the basic decomposition of the sum n
n
of squares about the mean:
n
Sec.
y ' y - ny 2
or
(
=
7.3
Least Squares Estimation
385
n <.Y ? + e' e
Y'Y -
)( )
(7-8) + � if ) l j= l = j j= I total sum regression ( residual (error) ) sum of + sum of squares of squares about mean squares The preceding sum of squares decomposition suggests that the quality of the mod els fit can be measured by the coefficient of determination � II
)
( yj - y 2
=
� n
ll
( yj - y 2
=
� ( yj n
"
" '2
-
ei j= l
� ( yj "
---�--�
)
y 2
j= l
- y )2
j=l
.L..J
(7-9)
The quantity R 2 gives the proportion of the total variation in the 2 Y/S "explained" by, or attributable to, the predictor variables z , . Here R (or the multiple correlation coefficient R + VJi2) equalsz 11, zif2 ,the. . . , fitted equation passes t,_hrough all thAe dataA points, so tpat ej 0 for all j. At the other extreme, R 2 is 0 if {30 y and {3 1 {3 2 {3 , 0. In this case, the predictor variables z1 , z 2 , , z , have no influence on the response. =
=
=
=
=
. . .
=
=
• • •
Geometry of Least Squares
A geometrical interpretation of the least squares technique highlights the nature of the concept. According to the classical linear regression model, Mean response vector
=
E (Y)
=
ZfJ
=
f3o
l1 ] l l l l �
1
+
[3
Zz
1
�1 1 + . . . +
z, l
{3 ,
z , z , 1
�
4r
Thus, E (Y) is a linear combiriation of the columns of Z. As fJ varies, ZfJ spans the model plane of all linear combinations. Usually, the observation vector y will not lie in the model plane, because of the random error e; that is, y is not (exactly) a linear combination of the columns of Z. Recall that y ZfJ + vector ( error ) ( response ) in model vector vector plane
( )
386
Chap.
7
M ultivariate Linear Regression Models
Once the observations become available, the least squares solution is derived from the deviation vector y - Zb = (observation vector) - (vector in model plane) The squared length (y - Zb) ' (y - Zb) is the sum of squares S (b) . As illustrated in Figure 7.1, S (b) is as small as possible when b is selected such that Zb is the point in the model plane closest to y. This point occurs at the tip of the �erpen dicular projection of y on the plane. That is, for the choice b = {3, y = Z{J is the projection of y on the plane= consisting of all linear combinations of the columns of Z . The residual vector e y - y is perpendicular to that plane. This geometry holds even when Z is not of full rank. 3
Figure 7.1
Least squares as a projection for n = 3, r = 1 .
When Z has full rank, the projection operation is expressed analytically as multiplication by the matrix Z (Z' Z) - 1 Z' . To see this, we use the spectral decom position (2-16) to write where A 1 ;,: A z ;,: A r+ J > 0 are the eigenvalues of Z'Z and ep e 2 , . . . , e r+ I are the corresponding eigenvectors. If Z is of full rank, · · · ;,:
1 .! _!_ - e + e �+ (Z' Z) - 1 = A e 1 e � + _A _ e 2 e; + · · · + � A r+ I r I I z 1 Z. q; = Aj 1 12 Ze;, 1 ' 2 2 I 2 = 2 I k. I i / 1 / = k i= i 0 / e / A = e� ' A :A k k k q i q k = ' - Il' k- e i Z' Zek
Consider
ll j
of Then which is a linear combinationif of the columns That or if I
I
Sec.
Least Squares Estimation
7.3
387
is, the + 1 vectors q ; are mutually perpendicular and have unit length. Their linear combinations span the space of all linear combinations of the columns of Z. Moreover, r+l r+l ; ; Z (Z' Z) - 1 Z' = A;- 1 Ze;e z' = r
;
L i= l
Lqq i= l
According to Result 2A.2 and Definition 2A.12, the projection of y) on a (r+l r+l linear combination of { q 1 , q 2 , . . . , q + l is � ( q ; y ) q = � q ; q ; Y = Z (Z' Z) - 1 Z' y = z{J . Thus, multiplication by Z (Z' Z) - 1 Z' projects a vector onto the space spanned by the columns of Z.2 1 Similarly, [I - Z (Z' Z) Z'] is the matrix for the projection of y on the plane perpendicular to the plane spanned by the columns of Z. ,
;
t
Sampling Properties of Classical Least Squares Estimators
The least squares estimator p and the residuals e have the sampling properties detailed in the next result Result 7.2. Under the general linear regression model in (7-3), the least squares estimator {J = (Z' Z) - 1 Z' Y has E ( P ) = {3 and Cov( p ) = a.2 (Z' Z) - 1 The residuals e have the properties E ( e) = 0 and Cov( e) = 0"2 [I - Z (Z' Z)- 1 Z'] 0" 2 [I - H] Also, E (i ' e) (n 1)0'2, so defining e ' i - Y' [I - Z (Z' z) - 1 Z'] Y Y' [I - H] Y s 2 = ---n ( + 1) n 1 n 1 we have - r -
-
r
r -
r -
Moreover, {3 and i are uncorrelated. "
A1
2 If Z is not of full rank, we can use the generalized inverse (Z' z)-
;;;.
Az
;;;.
· · ·
Z (Z' Z ) - z' =
;;;.
A,, + 1
r1 +
1
L
i= l
>
0
=
A,, + 2
=
· · ·
=
A,+ 1 ,
as
described
in
=
+I
L A;- 1 e ; e ; ,
r1
i= 1
Exercise
7.6.
where Then
q;q; has rank r1 + 1 and generates the unique projection of y on the space spanned
by the linearly independent columns of Z. This is true for any choice of the generalized inverse. (See [19].)
388
Chap.
7
M u ltiva riate Linear Reg ression Models
Proof. Before the response Y = Z{J + e is observed, it is a random vector. Now, p = (Z'Z)- 1 Z'Y = (Z'Z) - 1 Z' (Z{J + e) = fJ + (Z'Z) - 1 Z'e e = [I - Z(Z'Z) - 1 Z']Y (7-10) = [I - Z(Z'Z) - 1 Z'] [Z{J + e] = [I - Z(Z'Z) - 1 Z'] e since [I - Z(Z'Z) - 1 Z']Z = Z - Z = 0. From (2-24) and (2-45), E( p ) = fJ + (Z'Z) - 1 Z' E(e) = fJ Cov( p ) = (Z'Z) - 1 Z'Cov(e)Z(Z'Z) - 1 = u2 (Z'Z) - 1 Z'Z(Z'Z) - 1 = u2 (Z'Z) - 1 E(e) = [I - Z(Z'Z) - 1 Z']E(e) = 0 Cov(e) = [I - Z(Z'Z) - 1 Z']Cov(e)[I - Z(Z'Z)- 1 Z']' = u2 [I - Z (Z' Z) - I Z'] where the last equality follows from (7-6). Also, Cov( p, e) = E[( p - {J)e'] = (Z'Z) - 1 Z' E(ee')[I - Z(Z'Z) - 1 Z'] = u2 (Z'Z) - 1 Z'[I - Z(Z'Z) - 1 Z'] = 0 because Z' [I - Z(Z'Z) - 1 Z'] = 0. From (7-10), (7-6), and Result 4. 9 , e ' e = e'[I - Z(Z'Z) - 1 Z'][I - Z(Z'Z) - 1 Z'] e = e' [I - Z(Z'Z) - 1 Z'] e = tr[e' (I - Z(Z'Z)- 1 Z') e] = tr([I - Z(Z'Z)- 1 Z'] ee') Now, for an arbitrary n n random matrix W, E(tr(W)) = E( W11 + W22 + · · · + Wnn ) = E( W11 ) + E( W22 ) + · · · + E( Wnn ) = tr[E(W)] Thus, using Result 2A.12, we obtain X
Sec. 7 . 3
Least Squares Esti mation
389
E(e' e) = tr ( [I - Z (Z' Z)- 1 Z' ] E (ee' ) ) u2 tr [I - Z (Z' Z) - 1 Z' ] u2 tr (I) - u2 tr [Z (Z' Z) - l Z' ] = u2 n - u2 tr[(Z' Z) - 1 Z' Z ] I nu 2 -
= =
u2 tr [ (r+ l) X (r+ l ) J
=
= u2 (n - r - 1) and the result for s2 = e ' e/(n r - 1) follows.
•
-
The least squares estimator [3 possesses a minimum variance property that was first established by Gauss. The following result concerns "best" estimators of linear parametric functions of the form c' f3 = c0{30 + c1 {31 + + c, {3, for any c. Result 7.3 (Gauss' 3 least squares theorem). Let Y = Z/3 + e, where E (e) = 0, Cov (e) = u2 I, and Z has full rank + 1. For any c, the estimator c' f3 = cof3o + c1 {3 1 + . . . + c,{3, of c' f3 has the smallest possible variance among all linear estimator of the form ···
r
A
A
A
A
that are unbiased for c' {3. Proof. For any fixed c, let a'Y be any unbiased estimator of c' {3. Then c' {3, whatever the value of {3. Also, by assumption, E(a'Y) E (a' Y) = E (a'Z/3 + a' e) = a'Z/3. Equating the two expected value expressions yields a'Z/3 = c' f3 or ( c' - a'Z) f3 = 0 for all {3, including the choice f3 = (c' - a'Z) ' . This implies that c' = a' Z for any unbiased estimator. Now, c' [3 = c' (Z' Z)-1 Z' Y = a*' Y with a* = Z (Z' z)- 1 c. Moreover, from Result 7.2 E ( [3 ) = {3, so c' [3 = a*' Y is an unbiased estimator of c' {3. Thus, for any a satisfying the unbiased requirement c' = a'Z, Var (a' Y) = Var (a' Z/3 + a' e) = Var (a' e) = a'Iu 2 a = =
u2 (a - a* + a*) ' (a - a* + a*) u2 [(a - a*) ' (a - a*) + a * ' a *]
3 Much later, Markov proved a less general result, which misled many writers into attaching his name to this theorem.
390
Chap.
7
M ultivariate Linear Regression Models
since (a - a*)'a* = (a - a*)'Z(Z'Z) - 1 c = O from the condition (a - a*)'Z = a'Z - a*'Z = c' - c' = O'. Because a* isfixed and (a - a*)' (a - a*) ispositi�e unless a = a*, Var(a'Y) is minimized by the choice a*'Y = c' (Z' z) - 1 Z' Y = c' {J. •
This powerful result states that substitution of [3 for {J leads to the best esti mator of c' {J for any c of interest. In statistical terminology, the estimator c' {J is called the best (minimum-variance) linear unbiased estimator (BLUE) of c' {J. 7.4 I N FERENCES ABOUT THE REGRESSION MODEL
We describe inferential procedures based on the classical linear regression model in (7-3) with the additional (tentative) assumption that the errors e have a normal distribution. Methods for checking the general adequacy of the model are consid ered in Section 7.6. I nferences concerning the Regression Parameters
Before we can assess the importance of particular variables in the regression function (7-11)
we must determine the sampling distributions of {J and the residual sum of squares, To do we shall assume that the errors e have a normal distribution. Result 7.4. Let Y = Z{J + e, where Z has full rank r + 1 and e is distrib uted as Nn (0, u2 1) . Then,.. the maximum likelihood estimator of {J is the same as the least squares estimator {J. Moreover, is distributed as and is distributed independently of the residuals = Y - zp . Further, is distributed as where a2 is the maximum likelihood estimator of u2• Proof. Given the data and the normal assumption for the errors, the likeli hood function for {J, u2 is n = jII= ! yl2;1 e - ej 2u e I e.
"
SO,
e
(]'
'/
'
=
Sec.
7.4
Inferences About the Regression Model
391
For a fixed value a2, the likelihood is maximized by minimizing (y - Z/3) ' (y - ZfJ). But this minimization yields the least squares estimate j1 = (Z' Z) - 1 Z'y, which does not depend upon a2 • Therefore, under the normal assumption, the maximum likelihood al:)d least squares approaches provide the same estimator fJ.A Next, maximizing L ( fJ, a2 ) over a2 [see (4-18)] gives A (y - ZP ) ' (y - ZP ) (7-12) L ( fJ, a�z ) - (27r /21( £T z ) " /2 - n /2 wh ere a�z n t From (7-10), we can express [1 and i as linear combinations of the normal vari ables e. Specifically, _
_
e
[-�] � [/��z\�.��ii�· �;] � [-�] k=(�(i.� ��;z; ] • � a Because is fixed, Result implies the joint normality of and Their mean vec tors and covariance matrices were obtained in Result Again, using we get eov ( [-� ]) � ov � o-+( z :)��f -1 - :: -zc�; z)-:;z.-j + Ae
+
Z
4.3
[1
7.2.
i.
(7-6),
A c ( e) A '
Since Cov ( [1, i ) = 0 for the normal random vectors p and i, these vectors are independent. (See Result 4.5.) Next, let (A, e) be any-eigenvalue-eigenvector pair for I - Z (Z' Z) - 1 Z'. 1 Then, by (7-6), [I - Z (Z' Z) 1 Z'f = [I - Z (Z' Z) - Z'] so
Ae = [I - Z (Z' Z) - 1 Z']e = [I - Z (Z' Z) - 1 Z'fe = A [I - Z (Z' Z) - 1 Z']e = A 2 e That is, A = 0 or 1. Now, tr [I - Z (Z' Z) - 1 Z'] = -n1 - r - 1 (see the proof of Result 7.2), and from Result 4.9, tr [I - Z (Z' Z) Z'] = A + A + · · · + A11, where A 1 ;;;. A z ;;;. · · · ;;;. A11 are the eigenvalues of [I - Z (Z'1 Z) - z1 Z'.] Conse quently, exactly n - r - 1 values of Ai equal one, and the rest are zero. It then
follows from the spectral decomposition that
(7-13)
where e , e , . . . , e r are the normalized eigenvectors associated with the eigen values A11 =2 A z = n - -=I An - r - I = 1. Let ···
V=
----------·
e n, - r - 1
e
392
Chap.
7
Multivariate Linear Regression Models
Then V is normal with mean vector 0 and
That is, the V; are independent N(O, u 2 ) and by (7-10),
n u 2 = e' e = e ' [l - Z ( Z' Z ) -1 Z'] e = V12 + V22 + . . .
lS. d'lStn'b ute d lT 2 Xn2 - r - 1 '
+ Vn2 r - 1 -
•
A confidence ellipsoid for P is easily constructed. It is expressed in terms of the estimated covariance matrix s 2 ( Z' Z )- 1 , where s 2 = e ' e/(n - r - 1 ) .
=
Resu lt 7.5. Let Y zp + e, where Z has full rank r + 1 and N11 (0, u21 ) . Then a 100 (1 - a)% confidence region for p is given by
e is
( fJ - P ) ' Z' Z ( /J - P ) :;;;; (r + 1 ) s 2 Fr + 1 , n - r - 1 (a)
where F,+ 1 , 11 _ ,_ 1 (a) is the upper (100a)th percentile of an F-distribution with
r + 1 and n - r - 1 d.f. Also, simultaneous 100 (1 - a)% confidence intervals for the /3; are given by � i ± VVar ( �; ) V (r + 1 ) Fr + 1, n - r- 1 (a) , i = 0, 1, . . . , r
where Ya; (�;) is the diagonal element of s 2 ( Z' Z ) -1 corresponding to � ; ·
Consip er the symmetric square-root matrix ( Z' Z ) 1 12 • [See (2-22) .] ( Z' Z ) 1 12 ( /J - /J ) and note that E (V) = 0,
Proof.
Set V
=
Cov (V)
= ( Z' Z ) 1 12Cov ( p ) ( Z' Z ) 1 12 = u2 ( Z' Z ) 1 12 ( Z' Z ) -1 ( Z' Z ) 1 12 = u2 1
and V is normally distributed, since it consists of linear combinations of the � ; s. Therefore, V' V = (p - /J ) ' ( Z' Z ) 1 12 ( Z' Z ) 1 12 (p - p) = (p - /J ) ' ( Z' Z ) (p - p) is distributed as u 2 x;+ t · By Result 7.4 (n - r - 1 ) s 2 = e' e is distributed as u 2 x�_ , _ 1 , independently of p and, hence, independently of V. Consequently, [ X;+ d (r + 1) ]/[x� - r - tf ( n - r - 1 ) ] = [ V ' V/ (r + 1) ] /s 2 has an F,+ 1 , n - r - 1 dis tribution, and the confidence ellipsoid for p follows. Projecting this ellipsoid for ( P - /J ) using Result 5A.1 with A -I = Z' Z/s 2 , c 2 = ( r + 1 ) F,+ 1 , 11 _ ,_ 1 (a), and u = [0, . . . , 0, 1, 0, . . . , O ] ' yields l /3; - �; I :;;;; V(r + 1 ) F,+ l n - r - 1 ( a) V \la;(�; ) , II where \la;(�;) is the diagonal element of s 2 ( Z' Z ) - 1 corresponding to � ; · The confidence ellipsoid is centered at the maximum likelihood estimate p, and its orientation and size are determined by the eigenvalues and eigenvectors of Z'Z. If an eigenvalue is nearly zero, the confidence ellipsoid will be very long in the direction of the corresponding eigenvector. '
A
Sec.
7.4
Inferences About the Regression Model
393
Practitioners often ignore the "simultaneous" confidence property of the interval estimates in Result 7.5. Instead, they replace ( r + 1 ) F, + I , n - r - 1 (a) with the one-at-a-time t value t11 _ , _ 1 (a/2) and use the intervals
(7-14) when searching for important predictor variables. Example 7.4 (Fitting a regression model to real-estate data)
The assessment data in Table 7.1 were gathered from 20 homes in a Milwau kee, Wisconsin, neighborhood. Fit the regression model Yi = f3o + /3 1 Zi l + /3zZjz + ei where z 1 = total dwelling size (in hundreds of square feet), z 2 = assessed value (in thousands of dollars), and Y = selling price (in thousands of dollars), to these data using the method of least squares. A computer calculation yields TABLE 7. 1
REAL-ESTATE DATA
zl Total dwelling size (100 ft 2)
15.31 15.20 16.25 14.33 14.57 17.33 14.48 14.91 15.25 13.89 15.18 14.44 14.87 18.63 15.20 25.76 19.05 15.37 18.06 16.35
Zz Assessed value
($1000) 57.3 63.8 65.4 57.0 63.8 63.2 60.2 57.7 56.4 55.6 62.6 63.4 60.2 67.2 57.1 89.6 68.6 60.1 66.3 65.8
y Selling price
($1000) 74.8 74.0 72.9 70.0 74.9 76.0 72.0 73.5 74.5 73.5 71.5 71.0 78.9 86.5 68.0 102.0 84.0 69.0 88.0 76.0
394
Chap.
7
[
Multivariate Linear Regression Models
5.1523 (Z' Z)-1 = .2544 .0512 - .1463 - .0172 .0067 and
p
= (Z' Z)- 1 Z'y =
[ ]
]
30.967 2.634 .045
Thus, the fitted equation is
y = 30.967 + 2.634z 1 + .045 z 2
(7 .88)
(.785)
(.285)
with s = 3.473. The numbers in parentheses are the estimated standard devi ations of the least squares coefficients. Also, R 2 = .834, indicating that the data exhibit a strong regression relationship. ( See Panel 7.1 on page 395, which contains the regression analysis of these data using the SAS statistical software package. ) If the residuals e pass the diagnostic checks described in Section 7.6, the fitted equation could be used to predict the selling price of another house in the neighborhood from its size and assessed value. We note that a 95% confidence interval for {3 2 [ see (7-14)] is given by
�2
±
t1 7 (. o25) 'V Var'< � 2 ) = . o45
±
2.1l o (.285)
or
(- .556, .647) Since the confidence interval includes {3 2 = 0, the variable z 2 might be
dropped from the regression model and the analysis repeated with the single predictor variable z 1 • Given dwelling size, assessed value seems to add little • to the prediction of selling price. Likelihood Ratio Tests for the Regression Parameters
Part of regression analysis is concerned with assessing the effects of particular pre dictor variables on the response variable. One null hypothesis of interest states that certain of the z; ' s do not influence the response Y. These predictors will be labeled Zq + l • Zq + 2 , . . . , z,. The statement that Zq + l • Zq + 2 , . . . , z, do not influence Y trans lates into the statistical hypothesis
f3q + 2 = . . . = {3, = 0 or H0 : fJ (2) = where /J(2) = [ f3q + l • /3q + 2 • . . . , /3, ] ' . H0 : f3q + J =
0
(7-15)
Sec. 7.4 Inferences About the Regression Model PAN EL 7. 1
395
SAS ANALYSI S FOR EXAM PLE 7.4 U S I N G P ROC REG .
title ' Regression Ana lysis'; d ata estate; i nfi l e 'D-1 .dat'; i n put z 1 z2 y; proc reg d ata estate; model y z 1 z2;
PROGRAM COMMANDS
=
=
Model: M O D E L 1 Dependent Variable: Y
OUTPUT
Ana lysis of Variance S o u rce Model E rro r C Tota l
Sum of Sq u a res 1 032 .87506 204.99494 1 237 .87000
DF 2 17 19
·.
Deep Mean c.v.
Mean Square 5 1 6. 43753 1 2.05853
F va l ue 42.828
j R-sq uare ,
�.47254 1
76.55000 4. 53630
Adj R-sq
Prob > F 0.0001
0.8 1 49
Pa ra m eter Est i mates Vari a b l e I NTERCEP z1 z2
DF
.-P -ara-:m -e..., te ..r" .,·
Estimate . �Q�9��fi66' 2;634400
·o:o4S184
Setting
z=
[
1
' 's1:a"ndard
. en'fir
T fo r HO: Para m eter 0 3.929 3.353 0. 1 58 =
·
'7�88220844 '
o:2861 a21.1 .
o17Ss59a72
Z1 ! Z n X ( q + l ) i n X ( r -z q )
]'
•.
fJ
=
l ] ((q + 1) X 1 ) fJ fJ (2 ) ((r - q)(l)x 1 )
-----------
+ e = Z 1 fJ(ll + Z 2 fJ(2) + e [-�pwJ ( 2) j
we can express the general linear model as
Y = ZfJ + e = [Z 1 j Z 2 ]
Prob > I T I 0.00 1 1 0.0038 0.8760
Under the null hypothesis H0 : fJ(2 ) = 0, Y = Z 1 fJ(l ) of H0 is based on the
+ e. The likelihood ratio test
396
Chap.
7
Multivariate Linear Regression Models
Extra sum of squares
=
SS res (Zd
-
SS res (Z)
(7-16)
Result 7.6. Let Z have full rank r + 1 and e be distributed as Nn (0, a.21 ) . The likelihood ratio test of H0: /3(2) = 0 is equival�nt to a test,_ of H0 based on the extra sum of squares in (7-16) and s 2 = (y - Z/3 ) ' (y - Z/3 )/(n r - 1 ) . In particular, the likelihood ratio test rejects H0 if -
where Fr - q, n - r - l (a) is the upper (100a)th percentile of an F-distribution with r q and n - r - 1 d.f. -
Proof.
Given the data and the normal assumption, the likelihood associated with the parameters f3 and u2 is
with the maximum occurring at jJ = (Z' Z)- 1 Z' y and u 2 = (y - ZP ) ' (y - ZP )/n. Under the restriction of the null hypothesis, = Z1/3 ( 1 ) + e and
Y
max L ( /3( 1 ) • u2 )
fJ(J )• u'
where the maximum occurs at /J(l)
Rejecting H0 : /3 (2)
=
1
e ( 27T)n f2 u� ln
- n/2
= (Z� Z1 ) - 1 Z1 y. Moreover,
= 0 for small values of the likelihood ratio
Sec.
7.4
Inferences About the Regression Model
397
is equivalent to rejecting H0 for large values of ( 8r - 8 2 ) I 82 or its scaled version,
n (8r - 82 )/(r - q ) n8 2/(n - r - 1 ) The F-ratio above has an F-distribution with r Result 7.1 1 with m = 1.)
-
q and n - r - 1 d.f. (See [18] or •
Comment. The likelihood ratio test is implemented as follows. To test whether all coefficients in a subset are zero, fit the model with and without the terms corresponding to these coefficients. The improvement in the residual sum of squares (the extra sum of squares) is compared to the residual sum of squares for the full model via the F-ratio. The same procedure applies even in analysis of vari ance situations where Z is not of full rank. 4 More generally, it is possible to formulate null hypotheses concerning r - q linear combinations of fJ of the form H0 : CfJ = A0• Let the ( r - q ) X ( r + 1 ) matrix C have full rank, let A0 0, and consider =
H0 : CfJ =
0
( This null hypothesis reduces to the previous choice when C = [o ! (r - q} XI (r - q) ].)
Under the full model, cp is distributed as N, _ q (CfJ, u 2 C (Z' Z)- 1 C' ) . We reject H0 : CfJ = 0 at level a if 0 does not lie in the 100 (1 - a)% confidence ellipsoid for CfJ. Equivalently, we reject H0 : CfJ = 0 if o
(7-17)
where s 2 = (y - ZP ) ' (y - ZP )/(n - r - 1) and Fr - q , n - r - l (a) is the upper (100a)th percentile of an F-distribution with r - q and n - r - 1 d.f. The test in (7-17) is the likelihood ratio test, and the numerator in the F-ratio is the extra residual sum of squares incurred by fitting the model, subject to the restriction that CfJ = 0. (See [21 ]). 4 1n situations where Z is not of full rank, rank(Z) replaces r + 1 and rank(Z 1 ) replaces q in Result 7.6.
+ 1
398
Chap. 7 Multivariate Linear Regression Models
The next example illustrates how unbalanced experimental designs are easily handled by the general theory described just described. Example 7.5
(Testing the importance of additional predictors using the extra sum-of-squares approach)
Male and female patrons rated the service in three establishments (locations) of a large restaurant chain. The service ratings were converted into an index. Table contains the data for n = customers. Each data point in the table is categorized according to location or 3) and gender (male = and female = This categorization has the format of a two-way table with unequal numbers of observations per cell. For instance, the combination of location and male has responses, while the combination of location and female has responses. Introducing three dummy variables to account for location and two dummy variables to account for gender, we can develop a regression model linking the service index Y to location, gender, and their "interaction" using the design matrix
7.2
18
1).
1
(1, 2,
0
5
2
2
TABLE 7.2 RESTAU RANT-SERVIC E DATA
Location
Gender
Service (Y)
1 1 1 1 1 1 1 2 2 2 2 2 2 2
0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1
15. 2 21.2 27.3 21.2 21.2 36.4 92.4 27.3 15.2 9.1 18.2 50.0 44.0 63.6 15.2 30.3 36.4 40. 9
3 3
3 3
Sec. 7.4 Inferences About the Regression Model
constant
Z=
1 1 1 1 1 1 1 1 1 1 1 1 1
1
1 1 1 1
location
,.-A----,
1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1
1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
interaction
gender ,----"---..
1 1 1 1 1 0 0 1 1
1 1 1 0 0 1 1 0 0
0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1
1 1 1 1 1 0 0 0
p
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0
The coefficient vector can be set out as
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1
399
5 responses
} 2 responses
1
5 responses
} 2 responses } 2 responses } 2 responses
fJ = [ ,Bo , .B t , .Bz , ,83 , T1 ' Tz , 'Yu ' 'Yt z • 'Y2 1 ' 'Yzz , 'Y3 1 , 'Y3 2 ] ' where the ,8/ s ( i > 0) represent the effects of the locations on the determi-
nation of service, the T/ s represent the effects of gender on the service index, and the 'Y;k ' s repr�sent the location-gender interaction effects. The design matrix Z is not of full rank. (For instance, column 1 equals the sum of columns 2-4 or columns 5-6.) In fact, rank (Z) = 6. For the complete model, results from a computer program give
SSres (Z)
=
= 2977.4
and n - rank (Z) = 18 - 6 12. The model without the interaction terms has the design matrix Z 1 consisting of the first six columns of Z. We find that with n = y3 2
ssres (Z I ) = 3419.1
- rank (Z 1 ) = 18 - 4 = 14. To test H0: y1 1 = y1 2 = y2 1 = y2 2 = y3 1 = 0 (no location-gender interaction), we compute
400
Chap. 7 Multivariate Linear Regression Models
F
( Z ) )/( 6 - 4) = ( SSres ( Z I ) - SSres sz (3419.1 - 2977.4)/2 = . 89 2977.4/12
( SSres ( Z ) - ssres ( Z ) ) /2 SSres ( Z )/12 I
The F-ratio may be compared with an appropriate percentage point of an F-distribution with 2 and 12 d.f. This F-ratio is not significant for any rea sonable significance level a. Consequently, we conclude that the service index does not depend upon any location-gender interaction, and these terms can be dropped from the model. Using the extra sum-of-squares approach, we may verify that there is no difference between locations (no location effect), but that gender is signifi cant; that is, males and females do not give the same ratings to service. In analysis-of-variance situations where the cell counts are unequal, the variation in the response attributable to different predictor variables and their interactions cannot usually be separated into independent amounts. To eval uate the relative influences of the predictors on the response in this case, it is necessary to fit the model with and without the teqns in question and com • pute the appropriate F-test statistics. 7.5 I NFERENCES FROM THE ESTIMATED REGRESSION FUNCTION
Once an investigator is satisfied with the fitted regression model, it can be used to solve two prediction problems. �et,..zo [1, z0 1 , , z0,] ' be selected values for the predictor variables. Then z 0 and fJ can be psed ( 1 ) to estimate the regression function {30 + {3 1 z01 + · · · + {3,z0, at z 0 and ' ( 2) t� estimate the value of the response Y at z 0•
=
• • .
Estimating the Regression Function at z 0
Let Y0 denote the value of the response when the predictor variables have values [1, z0 1 , . . . , z0,] '. According to the model in (7-3 ) , the expected value of Y0 is
z0
=
E ( Yo l zo )
= f3o + {31 Zo1 +
Its least squares estimate is z � p .
···
+ f3,Zor
= z�{J
(7-18)
Result 7.7. For the lip.ear regression model in ( 7-3 ), z � P is the unbiased lin
ear estimator of E(Y0 I z 0 ) with minimum variance, Var (z� P ) = z � ( Z' Z ) - 1 z 0 u 2 • If the errors e are normally distributed, then a 100 ( 1 - a) % confidence interval for E ( Y0 I z 0 ) z � {J is provided by
=
Sec.
Inferences from the Estimated Regression Function
7.5
401
where t, _ 1 ( a /2 ) is the upper 100 ( a/2)th percentile of a t-distribution with n - r - 1 d.f. r-
Proof. For a fixed z 0, z� fJ is just a linear combination of the {3; ' s, so Result 7.3 applies. Also, Var (z�p ) z� Cov ( P )z 0 z� ( Z' Z ) - 1 z0 a 2 since (P 1 a 2 ( Z' Z ) by Result 7.2. Under the further assumption that e is normally distrib uted, Result 7.4 asserts that P is N,+ 1 ( fJ, a 2 ( Z' Z ) - 1 ) independently of s 2 / a 2 , which is distributed as x� _ 1 / (n - r - 1 ) . Consequently, the linear combination z�P is N (z�fJ, a2 z� ( Z' Z )- 1 z 0) and (z� [J - z�fJ) / Ya 2 z� ( Z' z )- 1 z 0 _ (z� [J - z� fJ) - Y2 s ( z� ( Z' Z ) - 1 z 0 ) yr;z;;;z
=
Cov ) =
=
r-
•
is t, _ ,_ 1 • The confidence interval follows.
Forecasting a New Observation at z0
=
Prediction of a new observation, such as Y0, at z 0 [1, z0 1 , . . . , z0,] ' is more uncer tain than estimating the expected value of Y0• According to the regression model of (7-3), or (new response Y0)
= (expected value of Y0 at z0 ) + (new error)
where s0 is distributed as N ( 0, a2 ) and is �ndependent of e and, hence, of p and s 2 • The errors e influence the estimators fJ and s 2 through the responses but s0 does not.
Y,
Result 7.8.
has the
Given the linear regression model of (7-3 ) , a new observation Y0
unbiased predictor
z�p
= ffio + ffi t Zot + · · · + ffi,zo,
The variance of the forecast error Y0 - z� {J is
Var (Y0 - z�P )
= a2 ( 1 + z� ( Z' Z ) - 1 z0 )
When the errors e have a normal distribution, a 100 ( 1 Y0 is given by
- a)% prediction interval for
402
Chap.
7
Multivariate Linear Regression Models
where tn - r - l (a/2) is the upper 100(a/2)th percentile of !-distribution with n - - 1 degrees of freedom. Proof. We forecast Y0 by z� p , which estimates E(Y0 Iz 0 ). By Result 7. 7 , z� p has E(z� P ) = z� /3 and Var(z� P ) = z�(Z'Z) - 1 z0 u2. The forecast error is then Y0 - z� p = z� /3 + e0 - z� p = e0 + z�( /3 - p ). Thus, E(Y0 - z� P ) E ( e0 ) + E (z� ( f3 - [3 ) ) = 0 so the predictor is unbiased.2 Since e0 and1 [3 2are independent, Var(Y1 0 - z� p ) = Var(e0 ) + Var(z� p ) = u + z�(Z'Z) - z0 u = 2u (1 + z�(Z'Z) - z0 ). If it is further assumed that e has a normal distribution, then [3 is normally distributed, and so is the linear combination Y0 - z� p . Con sequently, (Y0 - z� p )lVu2 (1 + z�(Z' z) - 1 z 0 ) is distributed as N (O, 1). Dividing this ratio by v;z;;;.z, which is distributed as v'x?z_,_ 1j(n - - 1), we obtain fl
r
r
•
which is tn - r - i · The prediction interval follows immediately. The prediction interval for Y0 is wider than the confidence interval for esti mating the value of the regression function E ( Y0 I z 0 ) = z�f3. The2 additional uncer tainty in forecasting Y , which is represented by the extra term s in the expression 0 s2 (1 + z�(Z'Z) - 1 z0 ), comes from the presence of the unknown error term e0 • Example 7.6 (I nterval estimates for a mean response and a future response)
Companies considering the purchase of a computer must first assess their future needs in order to determine the proper equipment. A computer scien tist collected data from seven similar company sites so that a forecast equa tion of computer-hardware requirements for inventory management could be developed. The data are given in Table 7.3 for z1 = customer orders (in thousands) z2 = add-delete item count (in thousands) Y = CPU (central processing unit) time (in hours) for the . mean CPU a time,% Construct a 95% confidence interval = [1, 130, 7.5] ' Also, find 95 (Y0 I z0 ) = {30 + {3for1 z0 1a +new{32 z0facility Eprediction 2 at 'zs0 CPU requireme�t corresponding to interval the same z 0 .
Sec.
7.5
TAB LE 7.3
(
z
l Orders ) 123.5 146.1 133.9 128.5 151.5 136.2 92.0
Inferences from the Estimated Regression Function
403
COM PUTER DATA
(
Zz Add-delete items) 2.108 9.213 1 .905 .815 1.061 8.603 1 .125
(
y
CPU time)
Source: Data taken from H. P. Artis, away, NJ: Bell Laboratories, 1979).
141.5 168.9 154.8 146.5 172.8 160.1 108.5
Forecasting Com puter Requirements: A Forecaster's Dilemma
(Piscat
A computer program provides the estimated regression function
[
y = 8.42 + 1.08z + .42z2
(Z' Z) - 1 =
1
8.17969 - .06411 .00052 .08831 - .00107
and s = 1 .204. Consequently,
z ;J1 = 8.42 + 1.08 (130) + .42 (7.5 ) = 151.97
and s Yzb (Z' Z)- 1 z0 = 1 .204 (.58928) = .71 . We have t4 (.025) = 2.776, so the 95% confidence interval for the mean CPU time at z0 is or (150.00, 153.94).� Since s Y�1 -zb,.(-. Z-'Z_)_ = (1.204) (1 .16071) = 1.40, a 95% predic tion interval for the+ CPU time_-:-1z-at0 a new facility with conditions z0 is or (148.08, 155.86).
•
404
Chap.
7
M ultivariate Linear Regression Models
7.6 MODEL CHECKING AND OTHER ASPECTS OF REGRESSION Does the Model Fit?
Assuming that the model is "correct," we have used the estimated regression func tion to make inferences. Of course, it is imperative to examine the adequacy of the model before the estimated function becomes a permanent part of the decision making apparatus. All the sample information on lack of fit is contained in the residuals e l = Y I - ffi o - ffi l z 1 1 e z = Yz - ffi o - ffi 1 z 2 1
···
···
- /3,Zir - /3,Z z r A
A
or i
= [I - Z (Z' Z) - 1 Z'] y = [I - H] y
(7-19)
If the model is valid, each residual ej is an estimate of the error ej , which is2 assumed to be a normal random variable with mean zero and variance u • Although the residuals i have expected value 0, their covariance matrix 1 2 u [I - Z (Z' Z) Z'] = u2 [I - H] is not diagonal. Residuals have unequal vari ances and nonzero correlations. Fortunately, the correlations are often small and the variances are nearly equal. Because the residuals i have covariance matrix u2 [I - H], the variances of the ej can vary greatly if the diagonal elements of H, the leverages hjj • are sub stantially different. Consequently, many statisticians prefer graphical diagnostics based2 on studentized residuals. Using the residual mean square s 2 as an estimate of u , we have (7-20) j = 1, 2, . . . , n and the studentized residuals are (7-21) Ys 2 (1 - hjj ) ' j = 1 , 2, . . . , n We expect the studentized residuals to look, approximately, like independent draw ings from an N(O, 1) distribution. Some software packages2 go one step further and studentize e using the delete-one estimated variance s (j), which is the residual mean squarej when the jth observation is dropped from the analysis. Residuals should be plotted in various ways to detect possible anomalies. For general diagnostic purposes, the following are useful graphs:
Sec. 1.
7.6
Model Checking and Other Aspects of Regression
405
Plot tfte residuals ej against the predicted values Yj = �0 + � 1 Zj 1 + · · · + {3, zjr · Departures from the assumptions of the model are typically indi cated by two types of phenomena: (a) A dependence of the residuals on the predicted value. This is illustrated in
Figure 7.2(a). The numerical calculations are incorrect, or a {30 term has been omitted from the model. (b) The variance is not constant. The pattern of residuals may be funnel shaped, as in Figure 7.2(b) so that there is large variability for large y and small variability for small y. If this is the case, the variance of the error is not constant, and transformations or a weighted least squares approach (or both) are required. (See Exercise 7. 3 . ) In Figure 7. 2 (d), the residuals form a horizontal band. This is ideal and indicates equal variances and no dependence on y. 2. Plot the residuals ej against a predictor variable, such as z 1 , or products ofpre dictor variables, such as zi or z z • A systematic pattern in these plots sug gests the need for more terms 1in2 the model. This situation is illustrated in Figure 7.2(c). 3. Q-Q plots and histograms. Do the errors appear to be normally distributed? To answer this question, the residuals e or e/ can be examined using the tech niques discussed in Section 4.6. The Q-Qj plots, histograms, and dot diagrams help to detect the presence of unusual observations or severe departures from ,
(b )
(c)
Figure 7.2
Residual plots.
406
Chap.
7
Multivariate Linear Regression Models
normality that may require special attention in the analysis. If n is large, minor departures from normality will not greatly affect inferences about {3. 4. Plot the residuals versus time. The assumption of independence is crucial, but hard to check. If the data are naturally chronological, a plot of the residuals versus time may reveal a systematic pattern. (A plot of the positions of the residuals in space may also reveal associations among the errors.) For instance, residuals that increase over time indicate a strong positive depen dence. A statistical test of independence can be constructed from the first autocorrelation, II
2: ej ej - l
j=2
___
(7-22)
of residuals from adjacent periods. A popular test based on the statistic � (ej - ej _ 1 )2�� e} 2(1 - r1 ) is called the Durbin-Watson test. (See [12] for a description of this test and tables of critical values.) ==
Example 7.7
(Residual plots)
Three residual plots for the computer data discussed in Example 7.6 are shown in Figure 7.3 on page; 407. The sample size n = 7 is really too small to allow definitive judgments however, it appears as if the regression assump tions are tenable. If several observations of the response are available for the same values of the predictor variables, then a formal test for lack of fit can be carried out. (See [11] for a discussion of the pure-error lack-of-fit test. ) •
Leverage and Influence
Although a residual analysis is useful in assessing the fit of a model, departures from the regression model are often hidden by the fitting process. For example, there may be "outliers" in either the response or explanatory variables that can have a considerable effect on the analysis, yet are not easily detected from an examination of residual plots. In fact, these outliers may determine the fit. The leverage hjj is associated with the jth data point and measures, in the space of the explanatory variables, how far the jth observation is from the other n - 1 observations. For simple linear regression with one explanatory variable z,
Sec.
Model Checking and Other Aspects of Regression
7.6
407
(b )
(a)
(c) Figure 7.3
Residual plots for the computer data of Example 7.6.
h". .
= -n1 +
)2 The average leverage is ( + 1)/n. (See Exercise 7. 8 . ) For a data point with high leverage, h approaches 1 and the prediction at z is almost solely determined by yi , the rest ofii the data having little to say about thei matter. This follows because (change in yi ) = hii (change in yi ) , provided that other y values remain fixed. Observations that significantly affect inferences drawn from the data are said to be influential. Methods for assessing influence are typically based on the change in the vector of parameter estimates, {J, when observations are deleted. Plots based upon leverage and influence statistics and their use in diagnostic check ing of regression models are described in [2], [4], and [9]. These references are rec ommended for anyone involved in an analysis of regression models. If, after the diagnostic checks, no serious violations of the assumptions are detected, we can make inferences about fJ and the future Y values with some assurance that we will not be misled. n
� (zi -
j= l
r
A
z
408
Chap.
7
Multivariate Linear Regression Models
Additional Problems in Linear Regression
We shall briefly discuss several important aspects of regression that deserve and receive extensive treatments in texts devoted to regression analysis. (See [11], [19], and [8]). Selecting predictor variables from a large set. In practice, it is often dif ficult to formulate an appropriate regression fupction immediately. Which predic tor variables should be included? What form should the regression function take? When the list of possible predictor variables is very large, not all of the variables can be included in the regression function. Techniques and computer programs designed to select the "qest" subset of predictors are now readily avail able. The good ones try all subsets: z 1 alone, z2 alone, . . . , z 1 and z , The best choice is2 decided by examining some criterion quantity like R 2• [See2 . (. 7. -• 9).] How ever, R always increases with the inclusion of additional predictor variables. Although this2 problem can be circumvented b,y using the adj�sted R 2, R2 1 - (1 - R ) (n - 1)/(n r 1), a better statistic for selecting variables seems to be Mallow's CP statistic (see [10]), (residual sum of squares for subset model with p parameters, including an intercept) (n - 2p ) (residual variance for full model) A plot of the pairs (p, C ), one for each subset of predictors, will indicate models that forecast the observedP responses well. Good models typically have (p, CP ) coordinates near the 45° line. In Figure 7.4 on page 409, we have circled the point corresponding to the "best" subset predictor variables. If the list of predictor variables is very long, cost considerations limit the num ber of models that can be examined. Another approach, called stepwise regression (see [11 ]), attempts to select important predictors without considering all the pos sibilities. The procedure can be described by listing the basic steps (algorithm) involved in the computations: Step 1. All possible simple linear regressions are considered. The predictor vari able that explains the largest significant proportion of the variation in Y (the variable that has the largest correlation with the response) is the first variable to enter the regression function. . Step 2. The next variable to enter is the one (out of those not yet included) that makes the largest significant contribution to the regression sum of squares. The significance of the contribution is determined by an F-test. (See Result 7.6.) The value ofthe F-statistic that must be exceeded before the contri bution of a variable is deemed significant is often called the F to enter. Step 3. Once an additional variable has been included in the equation, the indi vidual contributions to the regression sum of squares of the other varicp
(=
--
)-
of
=
Sec.
• col
• (3)
(2) e
7.6
Model Checking and Other Aspects of Regression
409
• (2 , 3)
• ( 1 , 3)
Numbers in parentheses correspond to predicator variables
Figure 7.4 CP plot for computer data from Exam ple 7.6 with three predictor variables (z1 = orders, z2 = add-delete count, z3 = n u m ber of items; see the example and original source).
abies already in the equation are checked for significance using F-tests. If the F-statistic is less than the one (called the F to remove) corresponding to a prescribed significance level, the variable is deleted from the regres sion function. Step 4. Steps 2 and 3 are repeated until all possible additions are nonsignificant and all possible deletions are significant. At this point the selection stops. Because of the step-by-step procedure, there is no guarantee that this approach will select, for example, the best three variables for prediction. A second drawback is that the (automatic) selection methods are not capable of indicating when transformations of variables are useful. Colinearity. If Z is not of full rank, some linear combination, such as Za, must equal 0. In this situation, the columns are said to be colinear. This implies that Z'Z does not have an inverse. For most regression analyses, it is unlikely that Za 0 exactly. Yet, if linear combinations of the columns of Z exist that are nearly 0, the calculation of (Z' Z) - 1 is numerically unstable. Typically, the diagonal entries of (Z' Z) - 1 will be large. This yields large estimated variances for the ffi /s and it is then difficult to detect the "significant" regression coefficients /3; · The problems =
410
Chap.
7
Multivariate Linear Regression Models
caused by colinearity can be overcome somewhat by (1) deleting one of a pair of predictor variables that are strongly correlated or (2) relating the response Y to the principal components of the predictor variables-that is, the rows zj of Z are treated as a sample, and the first few principal components are calculated as is sub sequently described in Section 8.3. The response Y is then regressed on these new predictor variables. Bias caused by a misspecified model. Suppose some important predictor variables are omitted from the proposed regression model. That is, suppose the true model has Z = [Z1 i Z2 ] with rank r + 1 and (7-23)
= z1 {J(l l + Zz fJ(z) + e where E ( e) = 0 and Var(e) = u2 1. However, the investigator unknowingly fits a model using only the first predictors by minimizing the error sum of squares (Y - Z1 {J
(7-24)
That is, [3(1 ) is a biased estimator of {J(l ) unless the columns of Z1 are perpendicu lar to those of Z2 (that is, z; Z2 = 9) . If important variables are missing from the model, the least squares estimates {J(l ) may be misleading. 7.7 MULTIVARIATE MULTIPLE REGRESSION
In this section, we consider the problem of modeling the relationship between m responses Y1 , Y2 , . . . , Ym and a single set of predictor variables Z p z2 , . . . , z, . Each response is assumed to follow its own regression model, so that y1 Yz
f3o1 + /311 Z1 + . . . + f3r1 Z, + e 1 f3oz + f312Z1 + . . . + f3,zZr + ez
(7-25)
= f3om + f31mZ1 + . . . + f3rm Zr + em The error term e = [e 1 , e 2 , . . . , em ] ' has E ( e) = 0 and Var(e) = I. Thus, the error terms associated with different responses may be correlated. To establish notation conforming to the classical linear regression model, let [zj o • zj 1 , . . . , zj ,] denote the values of the predictor variables for the jth trial, let Ym
Sec.
Multivariate M ultiple Regression
7.7
l' "
41 1
= [Yi 1 , Yi 2 , • • notation, • , Yj m ] ' be the responses, and let ei = [ ei 1 , ei 2 , • • . , ei 111 ] ' be the the design matrix errors. In matrix
Yi
zl l
z:. o
Zz o Zz 1
z
(n X (r + I ))
z,, Zz ,.
Zn l . . . z
j
Il l'
J
is the same as that for the single-response regression model. [See (7-3) .] The other matrix quantities have multivariate counterparts. Set
l f3;Iolll f3;oz f3om f3 ll J : :J l:t� y
l l Y1 2 · · · Yl m Yz 1 Yzz · · · Yzm
y
(n X m )
nz . . . Y;, m
··· {3 1 1 {31 2 · · · 1 m . . . {3 ,. 1 {3,. z · · · f3 r m
..
f3
((r + l ) X m )
el l e1 2 • • • e z l Bzz · · ·
E
(n X m)
. .
.
e 1 111 S zm
e , l e nz · · · e , m
. . regressif!: .. . .Th ., �. ,mu.ltivq,.ria. �.f: Un."·e" . ar · IJ m.Qd.e.l... is .
�',, "
"
,
.: / . _
y
(n X m)
. •. ... .. ,e
z .
. . ··
.
. .•.
f3
(n X (r + l)) ((r + l) X m)
+
E (n x m)
i, k
l, :4, . �····• rr!:.
The m observatiolls on the jth trial have covariance matrix l {uik}, but observations from different trials are unGor.related. Here and uik �re . "UDk:nown parap:tet�fs; the �esi�? matrix Z ha� jth row [zlq• zip , Zir,].
(J
=
. . .
41 2
Chap.
7
Multivariate Linear Regression Models
Simply stated, the ith response Y(iJ follows the linear regression model i = 1, 2, . . . , m (7-27) with Cov eU) ) = uii I. However, the errors for different responses on the same trial can be( correlated. Given the outcomes Y and the values of the predictor variables Z with full column rank, we determine the least squares estimates Pul exclusively from the observations u on the ith response. In conformity with the single-response solu tion, we take Y J (7-28)
Collecting these univariate least squares estimates, we obtain or
{3 = [ P < 1J j p (2) j · · · l P <mJ l = ( Z' Z ) - 1 Z' [ Y< IJ j Y(2) j
···
j Y<mJl
(7-29)
For any choice of parameters B = [ (l) j b(2) j j b(m)] , the matrix of errors is Y - ZB. The error sum of squares andbcross-products matrix is (Y - ZB) ' (Y - ZB) ···
[
( Y< 1 l - Zb< l J ) ' ( Y<1 l - Zb < l J ) . ( Y(mJ - Zb (mJ ) ' ( Y( IJ - Zb < l J )
(
( Y,,, - Zb '" ) ' Y<m> - Zb <m> l ( Y(m ) - Zb(m ) ) ( Y(m ) - Zb (m ) )
]
(7-30)
The selection b(i ) = f1u> minimizes the ith diagonal sum of squares ( Y(i ) - Zb (i ) ) ' ( Y(i ) - Zb(i ) ) . Conse q_uently, tr[(Y - ZB) ' (Y - ZB)] is minimized by the choice B = {3. Also, the generalized variance I (Y - ZB) ' - ZB) I is minimized by the least squares estimates {J. (See Exercise 7 . 11 (Yfor an additional generalized sum of squares property. ) Using the least squares estimates {3, we can form the matrices of Predicted values: Y = z{J = Z ( Z' Z) - 1 Z' Y i = Y - Y = [I - Z ( Z' Z ) - 1 Z ' ] Y (7-31) Residuals: The amongregression, the residuals, predicted values,multiple and columns of Z,orthogonality which hold inconditions classical linear hold in multivariate regres sion. They follow from Z' [I - Z ( Z' Z) - 1 Z'] = Z' - Z' = 0. Specifically, A
Z' i = Z' [I - Z ( Z' Z ) - 1 Z' ] Y = 0
(7-32)
Sec.
7. 7
Multivariate Multiple Regression
41 3
so the residuals ecil are perpendicular to the columns of Z. Also, Y ' E:
=
P ' Z' [I - Z ( Z' Z ) - 1 Z' ] Y
=
(7-33)
0
confirming that the predicted values Ycil are perpendicular to all residual vectors e(k) · Because Y Y + E:, =
or
Y ' Y = (Y + E:) ' (Y + E:)
=
Y ' Y + E:' E: + 0 + 0'
( total sum of squares ) ( predicted sum of squares ) + and cross products and cross products Y' Y
+
(residualof squares(error)andsum) E:' E:
cross products
(7-34)
The residual sum of squares and cross products can also be written as E: ' E:
=
Y ' Y - Y' Y
=
Y ' Y - {J ' Z' Z P
(7-35)
To illustrate the calculations of {3, Y, and E:, we fit a straight-line regression model (see Panel 7.2 on page 415 for SAS output),
Example 7.8
(Fitting a multivariate straight-line regression model)
j = 1, 2, . . . ' 5 to two responses Y1 and Y2 using the data in Example 7.3. These data, aug mented by observations on an additional response, are: Yt
0 1 -1
1 4 -1
2 3 2
3
8
4 9 2
3 Y2 The design matrix Z remains unchanged from the single-response problem. We find that Z' =
[1
1 1 1 1 0 1 2 3 4
J
( Z' Z ) - t =
[ - .26 - .2.1 ] .
41 4
Chap.
7
Multivariate Linear Regression Models
and
-1 1 1 1 1 1 ] 1 Z Y = [ 0 1 2 3 4 2 3 I
2
so
From Example 7.3, Hence, The fitted values are generated from j\ = 1 + 2z1 and y2 = -1 + z2 . Collectively, Y = zp =
and
1 0 1 1 1 2 [ � -� ] 1 3 1 4
1 -1 3 0 5 1 7 9
2
3
OJ
-1 [� 1 1 -1 Note that 1 -1 h/ 1 - 2 1 � J 53 01 y = [ � -1 1 1 [� �] 7 2
e=Y-Y=
1
2 1
E
-
9
3
Sec. 7 . 7
M u ltiva riate M u ltiple Reg ression
41 5
Since Y ' Y = [ -11 Y' Y =
[165 45
1 -1 4 3 8 43 -12 [ 171 43] -1 2 3 �] 8 3 43 19 9 2 45] and = [ 6 -2] -2 4 15 E' E
the sum of squares and cross products decomposition Y ' Y = Y' Y + is easily verified. e' e
•
PAN EL 7.2 SAS ANALYSI S FOR EXAM PLE 7.8 U S I N G PROC. G L M . title ' M u ltiva riate Regression Ana lysis'; data m ra; i nfi le 'E7-8.d at'; i n put y 1 y2 z 1 ; proc g l m data m ra; m od el y 1 y2 z 1 /ss3; m a n ova h z 1 /pri nte;
PROGRAM COMMANDS
=
=
=
I·
D.ep.endentVa ria ?i e:
S o u rce Model E rror Co rrected Total
S o u rce Z1
INTERCEPT ) ,,
OUTPUT
DF 1 3 4
Sum of Sq u a res 40.00000000 6.00000000 46.00000000
Mean S q u a re 40.00000000 2.00000000
R-Sq u a re 0.869565
28.28427
c.v.
Root M S E 1 .4 1 42 1 4
Type I l l SS 40.00000000
Mean Sq u a re 40.00000000
DF
Parameter Z1
YtJ
Genera l Linear Models Proced u re
' - :·;�,;
Estimate
1.000000000
2.ooooooooo'
T for HO: Para m eter 0 0.91 4.47 =
F Va l u e 20.00
Pr > F 0.0208
Y 1 Mean 5.00000000 F Va l u e 20.00 Pr > I T I 0.4286 0.0208
Pr > F 0.0208 Std E rror of Esti m ate 1 .095445 1 2 0.4472 1 360
(continued)
416
Chap.
7
Multivariate Linear Regression Models PAN EL 7.2 (continued)
I [)ependent
Variable:
Y21
OF 1 3 4
S u m of S q u a res 1 0.00000000 4.00000000 1 4.00000000
Mean Sq u a re 1 0.00000000 1 .33333333
R-Sq u a re 0.71 4286
1 1 5.4701
c.v.
Root MSE 1 . 1 5470 1
OF 1
Type Ill SS 1 0.00000000
Mean S q u a re 1 0.00000000
S o u rce Model E rror Corrected Tota l
S o u rce Z1
T for HO: Para m eter = 0 -1 . 1 2 2.74
j
EstimatJ -1 .000000000'
Para meter
I NTE RCEPT
1.000000000
Z1
I�
=
Error SS
$
V1
CP Matrix
I
Pr > F 0.07 1 4
F Va l u e 7.50
V2 M e a n 1 .00000000 Pr > F 0.07 1 4
F Va l u e 7 .50 Pr > ITI 0.3450 0.07 1 4
Std E rro r of Est i m ate 0.894427 1 9 0.3651 4837
Y2
Y1 Y2 Manova Test Criteria a n d Exact F Statistics fo r the Hypothesis of no Overa l l Z1 Effect E = E rror SS&CP Matrix H = Type I l l SS&CP Matrix fo r Z 1 N = O M = O S = 1 Statistic W i l ks' Lambda P i l l ai's Trace H ote l l i ng-Lawley Trace Roy's G reatest Root
Va l u e 0.06250000 0.93750000 1 5.00000000 1 5.00000000
F 1 5.0000 1 5.0000 1 5.0000 1 5.0000
N u m OF 2 2 2 2
Pr > F 0 .0625 0.0625 0.0625 0.0625
Den O F 2 2 2 2
For the least squares estimator {J [p ! P(z) ! · ! P(m) J determined under the multivariate multiple regression model(l)(7-26) with full rank (Z) = r + 1 < n, ..
Result 7.9.
and
i, k = 1, 2, . . . , m
Sec.
7. 7
Multivariate Multiple Regression
41 7
The residuals = [ e<1 l [ e<2l [ · · · [ e(ml] = Y - Z/3 satisfy E ( e(i J ) = 0 and E (e(; J e(kJ ) = (n - r - 1)u;k • so E ( £ ) = 0 and E ( n - 1r - 1 £ £ ) = Also, and fj are uncorrelated. Proof. The ith response follows the multiple regression model Y(i J = Z {J(i ) + e(i ) , E ( e(i ) ) = 0, and E ( e(i ) e(i ) ) = O";; I Also, as in (7-10), Pu l - PuJ = (Z'Z)- 1 Z'Y<; J - fJ
A
A
Af A
� �
i
so E (fJ
i' i
A
i.
418
Chap.
7
M ultivariate Linear Regression Models
We first consider the problem of estimating the mean vector when the pre dictor variables have the values z0 = [ 1, z0 1 , , z0,] ' . The mean of the ith response variable is z� {J(i ) • and this is estimated by z�P(i ) , the ith component of the fitted regression relationship. Collectively, , a = [ ' " . ' " . ··· . ' " 1 (7-37) Zop Zo p(I) i Zo p(2) i i Zo p is an unbiased estimator z� f3 since E(z�P(i ) ) "= z� E ( �(i) ) = z� {J(i ) for each com ponent. From the covariance matrix for p(i ) and p(k) • the estimation errors z� {J(i ) - z�p (i ) have covariances • • •
(m)
E [z �( {J(i ) - P(i ) ) ( p(k) - P(k) ) ' z 0] = z�(E ( {J(i) - P(i ) ) ( {J(k) - P(k) ) ' ) z 0
(7-38)
The related problem is that of forecasting a new observation vector Y0 = [Y01, Y0 2 , Y0111 ] ' at z 0 • According to the regression model, Y0; = {J(i ) + e0; where the "new" error e0 = [e01 , e0 2 , , e0111 ] ' is independent of the z�errors and satisfies E ( e0; ) = 0 and E (e0;eo k ) = a; k · The forecast error for the ith component of Y0 is • • •
,
• • •
E
Y0; - z�P(i ) = Y0; - z� {J(i ) + z� {J(i ) - z�p (i ) = e0; - z� ( P(i ) - P(i ) ) E (Y0; - z�p (i ) ) = E ( e0;) - z�E ( P(i ) - {J(i ) ) = 0, z�P (i ) unbiased predictor Yo ;· E (Y0; - z�P (i ) ) (Yo k - z�P(k) ) = E ( e0; - z� ( p(i ) - f3(i ) ) ) ( e0 k - z� ( p (k) - P(k) ) ) = E(e0;e0 k ) + z�E( Pu > - {J(i ) ) ( p ( k) - {J( k) ) ' z 0 - z� E( ( P(i ) - P u > ) e ok ) - E (e o ; ( P(k) - P (k) ) ' ) z o = a; k (1 + (7-39) /J (i ) = e(i ) + fJ(i ) E ( ( P(i ) - /J(i ) ) e0 k ) = 0 e0 . E ( e0; ( P( k) - /J(k) ) ') .
so
indicating that of The forecast errors have covariances
is an
z�(Z'Z) - 1 z0) Note that since is indepen (Z'Z) - 1 Z' dent of A similar result holds for Maximum likelihood estimators and their distributions can be obtained when the errors have a normal distribution. Result 7. 1 0. Let the multivariate multiple regression model in (7-26) hold with full rank (Z) = r + 1, n (r + 1) + m, and let the errors have a normal distribution. Then E
�
E
Sec.
7.7
Multivariate Multiple Regression
419
A
is the maximum likelihood estimator of fJ and- 1 fJ has a normal distribution with E ( /3 ) = fJ and Cov ( p (i ) ' p ( k) ) = u;k (Z' Z) . Also, jj is independent of the maximum likelihood estimator of the positive definite l: given by and
is distributed as wp. n - (l:) Proof. According to the regression model, the likelihood is determined from the data Y = [Y1 , Y2 , . . . , Yn ] ' whose rows are independent, with Yj distributed as N111 ( {J ' zj, l:). We first note that Y - Z {J = [Y 1 - {J ' z 1 , Y2 - {J ' z2 , . . . , Y11 - {J ' zn ] ' so n (Y - z{J) ' (Y - z{J) = 2: (Yj - {J ' zj ) (Yj - {J ' zj ) ' j= l and n 11 2: (Yj - fJ ' zJ 'l:-1 (Yj - {J ' zj ) = 2: tr [(Yj - fJ ' zJ 'l: - 1 (Yj - {J ' zj)] j= 1 j= l = j2:= 1 tr [l:- 1 (Yj - fJ ' zJ (Yj - {J ' zj) '] = tr [l:- 1 (Y - z{J) ' (Y - z {J ) ' ] (7-40) Another preliminary calculati�m will enable us to express the likelihood in a sim ple form. Since = Y - Z{J satisfies Z' = 0 [see (7-32)], ni
r- 1
II
e
(Y - z{J) ' (Y - z{J)
e
= [Y - z jj + z ( jj - {J)] ' [Y - z jj + z ( jj - {J)] = (Y - z jj ) ' (Y - z jj ) + ( p - {J) ' Z' Z( jj - {J) = + ( p - {J) ' Z' Z( jj - {J) Using (7-40) and (7-41), we obtain the likelihood L ( {J, l:) = n (2 :) m f2 l l:l 1 /2 e' e
e- i (yj - P' zj)'I- I (yj - P' zj)
(7-41)
420
Chap.
Multivariate Linear Regression Models
7
The1 2 matrix z ( {J - fJ) I- 1 ( {J - fJ) ' Z ' is the form A' A, with A = I - 1 ( {J - fJ)' Z' , and, from Exercise 2 . 16 , it is nonnegative definite. Therefore, its eigenvalues are nonnegative also. Since, by Result 4.9, 1 tr[z ( {J - fJ) I - ( {J - fJ) ' Z'] is the sum of its eigenvalues, this trace will equal its minimum value, zero, if fJ = /3. This choice is unique because Z is of full rank and P(i) - P(i) * 0, implies that Z ( p(i) - p(i) ) * 0, in which case tr[z ( {J - fJ) I- 1 ( {J - fJ) ' Z'] ;;;.: c' I - 1 c > 0, where c' is any nonzero row of z ( {J - fJ). Applying Result 4.10 with B = b = n /2 , and p = m, we find 1 that {J and i = n- are the maximum likelihood estimators of fJ and I, respectively, and i' i,
i' i
L ( {J, I )
A
P (i)
/2 -c e -nm---(n) mn/2 - nm/2 e n mn 2 2 (2 1T) / 1 £ ' £ 1 / (2 1T ) mn /2 1 i l n/2
1
_
---
(7-42)
--
It remains to establish the distributional results. From (7-36), we know that and e(i) are linear combinations of the elements of e. Specifically, p (i) = (Z' Z) - 1 Z' e<; J + p(i ) e(i) = [I - Z (Z' Z) - 1 Z' ] e(iJ •
i = 1, 2, . . . , m
Therefore, by Result 4.3, /3(1 ) • /3(2) • . . . ' P(m) ' e(1 ) • e(Z) • . . . ' e(m) are jointly normal Their mean vectors and covariance matrices are given in Result 7.9. Since and fJ have a zero covariance matrix,n by- r Result 4.5 they are independent. Further, as in 1 (7-13), [I - Z (Z' Z) - 1 Z' ] = 2: eee�, where e�ek = 0, e * k, and e�ee = 1. f=1 Set Vc = e ' ec = [e(1) ec, e(z) ec, . . . , e(m) ee] ' = en e1 + ec z Ez + . . . + ec n en . Be cause Ve , e = 1, 2, . . . , n - r - 1, are linear combinations of the elements of e, they have a joint normal distribution with E (Vc ) = E ( e ' ) e e = 0. Also, by Result 4.8, Vc and Vk have covariance matrix (e�ek ) I = (O) I = 0 if e * k. Conse quently, the Vc are independently distributed as Nm (O, I) . Finally, i
i'i
n - r- 1
n -r- 1
f=1 (4-22)
f=1
= e ' [I - Z (Z' Z) - 1 Z' ] e = 2: e ' eee� e = 2: Vev;
•
which has the wp, n - r - 1 (I) distribution, by . Result 7.10 provides additional supp_9rt for using least squares estimates. When the errors are normally distributed, fJ and n- 1 are the maximum likeli hood estimators of fJ and I , respectively. Therefore, for large samples, they have nearly the smallest possible variances. Comment. The multivariate multiple regression model poses no new computational problems. Least squares (maximum likelihood) estimates, p (i) = (Z' Z) -1 Z' Y(i ) , are computed individually for each response variable. i' i
Sec. 7 . 7
M u ltivariate M u ltiple Reg ression
421
Note, however, that the model requires that the same predictor variables be used for all responses. Once a multivariate multiple regression model has been fit to the data, it should be subjected to the diagnostic checks described in Section 7. 6 for the sin gle-response model. The residual vectors [ej 1 , ej 2 , , ejm l can be examined for normality or outliers using the techniques in Section 4.6. The remainder of this section is devoted to brief discussions of inference for the normal theory multivariate multiple regression model. Extended accounts of these procedures appear in [1] and [21] . • • •
Likelihood Ratio Tests for Regression Parameters
[ ]
The multiresponse analog of (7 1 5) , the hypothesis that the responses do not depend on Zq + l ' Zq + z , . . , z,, becomes -
.
H0 : P(2) = 0 where
fJ =
p(l) ((q + l) X m) ((r -/3q)��x; m)
(7-43)
Setting Z = [ (n xZ(q1+ l)) J: (n XZ(r2 q)) , we can write the general model as - J ----
---
Under H0 : P (2) = 0, Y = Z1 /J(l) + e and the likelihood ratio test of H0 is based on the quantities involved in the extra sum of squares and cross products
= (Y - zJj(l )) ' (Y - Z1 P(l) ) (Y - z {J ) ' ( Y - z {J) = n (i 1 - i ) where {J(l) = (Z{ Z 1 ) - 1 Z{ Y and i 1 = n - 1 (Y - Z1 P(l) ) ' (Y - Z1 P(1J A,
From (7-42), the likelihood ratio, can be expressed in terms of general ized variances: max L ( fJ(l) ' I ) L ( P(l) ' i l ) = max L , (7-44) ( fJ I) L ( p, i ) A
_ P:_.:_ ( I )•_ I_ ----;--=----,-(3, :£
422
Chap.
7
Multivariate Linear Regression Models
Equivalently, Wilks' lambda statistic can be used. Result 7. 1 1 . Let the multivariate multiple regression model of (7-26) hold with Z of full rank r + 1 and (r + 1) + m n . Let the errors e be normally dis tri�uted. }Jnder H0: fJ(z) = 0, ni is distributed as Wp. n - r - 1 (I) independently of n (I 1 - I) which, in turn, is distributed as Wp., (I). The likelihood ratio test of H0 is equivalent to rejecting H0 for large values of I n� I - 2 In = - n In ( I � I ) = - n In l ni + n (I 1 - I) l I I1 I 5 For n large, the modified statistic A
�
_
A
q
A
A
has, to a close approximation, a chi-square distribution with m (r - q) d.f. • Proof. (See Supplement 7A.) If Z is not of full rank, but has rank r + 1, then fJ = (Z' Z) - Z' Y, where (Z' Z) - is the generalized inverse discussed in1 [18]. (See Exercise 7.6 also.) The dis tributional conclusions stated in Result 7.11 remain the same, provided that r is replaced by r1 and q + 1 by rank(Z1 ). However, not all hypotheses concerning fJ can be tested due to the lack of uniqueness in the identification of fJ caused by the linear dependencies among the columns of Z. Nevertheless, the generalized inverse allows all of the important MANOVA models to be analyzed as special cases of the multivariate multiple regression model. A
Example 7.9
(Testing the im portance of additional predictors with a multivariate response)
The service in three locations of a large restaurant chain was rated according to two measures of quality by male and female patrons. The first service-qual ity index was introduced in Example 7.5. Suppose we consider a regression model that allows for the effects of location, gender, and the location-gender interaction on both serviCe-quality indices. The design matrix (see Example 7.5) remains the same for the two-response situation. We shall illustrate the test of no location-gender interaction in either response using Result 7.11. A computer program provides 5 -
Technically, both approximation.
n
r
and
n
m
should also be large to obtain a good chi-square
( residual sum of squares ) and cross products ( extra sum of squares ) and cross products
1021.72 ] n I = [ 2977.39 1021.72 2050.95 [ 441.76 246.16 ] = n (I 1 - I) = 246.16 366.12 Let f3c2> be the matrix of interaction parameters for the two responses. Although the sample size n = 18 is =not large, we shall illustrate the calcula tions involved in the test of H0: {3(2) 0 given in Result 7.11. Setting a = .05, we test H0 by referring ln� l 1) - [ n - r1 - 1 - !2 (m - r1 + q 1 + 1) ] 1n ( 1 ni + n(I 1 - I) = - [ 18 - 5 - 1 - � (2 - 5 + 3 + 1) ] ln ( .7605) = 3.28 to a chi-square =percentage point with m ( r1 - q 1 ) = 2 (2) = 4 d.f. Since 3.28 < x� (.05) 9.49, we do not reject H0 at the 5% level. The interaction terms are not needed. More generally, we could consider a null hypothesis of the form H0: C/3 = r0, where C is - q) and is of full rank - q). For the choices C = [0 i I(r ] and (rr0+ =1)0, this null hypothesis(r becomes H0: C/3 = /3(2) = 0, the case considered earlier. It can be shown that the extra sum of squares and cross products generated by the hypothesis H0 is n( I 1 - i) = (cp - r0) '( C(Z ' Z) - 1 C ' ) - 1 (Cp - r0) Under the null hypothesis, the statistic n (I 1 - I) is distributed as W (I) inde pendently of i. This distribution theory can be employed to develop a test of H0: C/3 = r0 similar to the test discussed in Result 7.11. (See, for example, [21].) Sec.
7.7
Multivariate Multiple Regression
423
=
A
A
A
A
•
X
(r - q) X (r - q)
,_
q
Other Multivariate Test Statistics
Tests other than the likelihood ratio test have been proposed for testing H0: /3 c2) = 0 in the multivariate multiple regression model. Popular computer-package programs routinely calculate four multivariate test statistics. To connect with their output, we introduce some alternative notation. Let E be the p p error, or residual, sum of squares and cross products matrix E = ni that results from fitting the full model. The p p hypothesis, or extra, sum of squares and cross products matrix X
X
424
Chap.
7
M ultivariate Linear Regression Models
The statistics can be defined in terms of E and H directly, or in terms of the nonzero eigenvalues TJr 1Jz � 1Js 2f HE - � where s = min(p, r - q). Equivalently, they are the roots of I (I 1 - I ) - TJI I = 0. The definitions are: l E I ---' Wilks' lambda = ITs 1 +1 TJ; = --,--I E -'H I ---;+ -TJ_;- = tr[H(H + E) - 1 ] Pillai 's trace = � _ 1 + TJ; Hotelling-Lawley trace = �1 TJ; = tr [H E 1 ] Roy 's greatest root = 1 +TJ r TJI Roy's test selects the coefficient vector so that the univariate F-statistic based on has its maximum possible value. When several of the eigenvalues ; are moderately large, Roy's test will perform poorly relative to the other three.TJSim ulation studies suggest that its power will be best when there is only one large eigenvalue. Charts and tables of critical values are available for Roy' s test. (See [17] and [15].) Wilks ' lambda, Roy ' s greatest root, and the Hotelling-Lawley trace test are nearly equivalent for large sample sizes. If there is a large discrepancy in the reported P-values for the four tests, the eigenvalues and vectors may lead to an interpretation. In this text, we report Wilks' lambda, which is the likelihood ratio test. �
�
•··
--
i= I s
i=l s
-
i=
a a ' Yi
a
Predictions from Multivariate Multiple Regressions
Suppose the model Y = Z fJ + with normal errors has been fit and checked for any inadequacies. If the model is adequate, it can be employed for predictive purposes. One problem is to predict the mean responses corresponding to fixed values z0 of the predictor variables. Inferences about the mean responses can be made using the distribution theory in Result 7.10. From this result, we determine that P ' z0 is distributed as Nm ( fJ ' z0, z� ( Z' Z ) - 1 z0I) and ni is independently distributed as wn -r- r ( I ) The unknown value of the regression function at 0 is So, from the discussion of the T2-statistic in Section 5.2. we can write z fJ ' z0• E,
E,
Sec. 7.7 Multivariate Multiple Regression yz
and the
( P ' z0 - {J' z ) ' ( Vz� ( Z' Z) - 1 :0
=
A )-I ( P ' z - {J ' z ) n I 1 r nVz�tZ ' Z) -1 :0
425
(7-45)
( _ : _ I ) - I ( {J ' z0 - {J ' z0 ) 1 n
100 (1 - a)% confidence ellipsoid for {J' z0 is provided by the inequality P ' Zo ) '
( fJ ' Zo -
, , ,;:;; z0 ( Z Z ) - 1 z0
[ ( m (n - r - 1) ) Fm 11 _ r 111 ( a)] (7-46) n-r-m ·
_
where F111, 11 _ , _ 111 (a) is the upper (lOOa)th percentile of an F-distribution with m and n - r - m d.f. The 100 (1 - a)% simultaneous confidence intervals for E ( Y; ) z�fJu> are
�( mn(n--r r--m1) ) Fm n - r-- m ( a) �Zo, ( Z , Z ) _ 1 Zo ( n - nr - 1 a;; ) ' =
A
·
i
=
1, 2, . . . , m
(7-47)
where p (i ) is the ith column of {J and U;; is the ith diagonal element of I . The second prediction problem is concerned with forecasting new responses Y0 {J' z0 + e0 at z0• Here Eo is independent of e. Now, Y0 - P ' z0 ( {J - P ) ' z0 + E0 is distributed as N111 (0, (1 + z� ( Z ' Z) - 1 z0 ) I ) =
=
independently of n i , so the
100 (1 - a)% prediction ellipsoid for Y0 becomes
---.,::::.; ( 1 + z0' ( Z , Z )
�(
)
_1
z0 )
[( mn(n--r r--m1) ) F111 n - r -m ( a) ] ·
(
(7-48)
)
- a)% simultaneous prediction intervals for the individual responses ¥0; are m (n - r - 1) n Fm, n - r - m a ) (1 + z0, (Z , Z) _ 1 z0 ) n - r - 0';; , n-r-m 1 i 1, 2, . . . , m (7-49) where P (i ) • U;;, and Fm , n - r - m ( a ) are the same quantities appearing in (7-47). Com paring (7-47) and (7-49), we see that the prediction intervals for the actual values
The 100(1
( �
A
=
of the response variables are wider than the corresponding intervals for the expected values. The extra width reflects the presence of the random error e o ; ·
426
Chap.
7
Multivariate Linear Regression Models
Example 7. 1 0 (Constructing a confidence ellipse and a prediction ellipse for bivariate responses)
Y2,
A second response variable was measured for the computer-requirement prob lem discussed in Example 7.6. Measurements on the response disk input/out put capacity, corresponding to the z 1 and z 2 values in that example were
Y2 = [301.8, 396.1, 328.2, 307.4, 362.4, 369.5, 229.1] Obtain the 95% confidence ellipse for /1 1 z0 and the 95% prediction ellipse for Y0 = [ Y0 1 , Y0 2 ] 1 for a site with the configuration z0 = [1, 130, 7.5] 1 • Computer calculations provide the fitted equation y2 = 14.14 + 2.25z 1 + 5.67z2 with s = 1.812. Thus, [3(2) [14.14, 2.25, 5.67] 1 • From Example 7.6, I
=
[3( 1 ) = [8.42, 1.08, 42] 1, z � P( l ) = 151.97, and z� ( Z1 Z ) - 1 z0 = .34725
We find that
�
zP and
(2 )
=
14.14 + 2.25 (130) + 5.67 (7.5) = 349.17
151.97 ] [ 349.17
Since
n = 7, r = 2, and m
o/3(
from
[z
I
- 151.97,
I
[-�;!!!_� .]
/3 1Zo = zo /3(2) is, 5.30 ] - \ [ z �/3( 1 ) - 151.97 ] 13.13 z�/3(2) - 349.17 2 4) (.34725) [ ( � ) F2. 3 (.05) ]
2, a 95% confidence ellipse for
zo/3(2) - 349.17] (4) [ 5.80 5.30
(7-46), the set 1)
=
�
Sec.
7.8
The Concept of Linear Regression
427
with F2 3 (.05) = 9.55. This ellipse is centered at (151.97, 349.17). Its orienta tion and the lengths of the major and minor axes can be determined from the eigenvalues and eigenvectors of n i . Comparing (7-46) and (7-48), we see that the only change required for the calculation of the 95% prediction ellipse is to replace = .34725 with + = 1.34725. Thus, the 95% prediction ellipse for is also centered at (151.97, 349.17), but is larger than the con Y0 = fidence ellipse. Both ellipses are sketched in Figure 7.5. It is the prediction ellipse that is relevant to the determination of com• puter requirements for a particular site with the given
z�(Z'Z) - 1 z0
1 z�(Z'Z)-1z0 [Y0 1 , Y0 2]'
z0 •
Response 2
d
380
�
360
340
Prediction ellipse
onfidence ellipse
Figure 7.5
95% confidence and prediction ellipses for the computer data with two responses.
0
7.8 THE CONCEPT OF LINEAR REGRESSION
The classical linear regression model is concerned with the association between a single dependent variable and a collection of predictor variables z 1 , z 2 , , z,. The regression model that we have considered treats as a random variable whose mean depends upon fixed values of the z; ' s . This mean is assumed to be a linear function of the regression coefficients .•. , The linear regression model also arises in a different setting. Suppose all the variables Z1 , Z2 , , are random and have a joint distribution, not necessar ily normal, with mean vector p and covariance matrix :I . Partition-
Y
Y,
• . .
•••
Y
{30 , {31 , {3, .
Z,
(r+ l) X l
ing p and I in an obvious fashion, we write
[ ]
(r + l) X (r+ l)
I I
and
I=
I
y O" yy : Uz X i ( 1 X r)
(1 1 ) rX l):-rx��: (rX r) (;�I
428
Chap.
7
Multivariate Linear Regression Models
with (7-50)
Uz y = [uyz , Uyz , . . . , u yz ] 1 I
2
can be taken to have full rank.6 Consider the problem of predicting Y using the linear predictor = b0 + b 1 Z1 + + b,Z, = b0 + b1 Z (7-51) For a given predictor of the form of (7-51), the error in the prediction of Y is prediction error = Y - b0 - b 1 Z1 - - b,Z, = Y - b0 - b1 Z (7-52) Because this error is random, it is customary to select b0 and b to minimize the I zz
r
···
···
mean square error = E ( Y - b0 - b1 Z) 2
(7-53) Now the mean square error depends on the joint distribution of Y and Z only through the parameters p and I. It is possible to express the "optimal" linear pre
dictor in terms of these latter quantities. Result 7 1 2 The linear predictor {30 + fJ 1 Z with coefficients f3o = J.Ly - /31 JL z has minimum mean square among all linear predictors of the response Y. Its mean square error is E ( Y - {30 - /31 Z) 2 = E (Y - J.Ly - u�y izi (Z - Pz ) ) 2 = u yy - u�y i zi uz y Also, {30 + {J1 Z = J.Ly + u�y izi (Z - pz ) is the linear predictor having maxi mum correlation with Y; that is, Carr (Y, {30 + /31 Z) = max Carr ( Y, b0 + b1 Z) .
.
�
bo, b
Proof.
we get
E(Y -
Writing
I
�-j = {J1 UI zz fJ = Uz y ,.. z z Uz y yy O"yy b0 + b1 Z = b 0 + b1 Z - ( J.Ly - b1 p z ) + ( J.Ly - b1 Pz ),
b0 - b1 Z)2 = E [ Y - J.Ly - (bi Z - b1 Pz ) + ( J.Ly - b0 - b1 pz )JZ
= E(Y - J.Ly ) 2 + E (b1 (Z - Pz ) ) 2 + ( J.Ly - b0 - b1 p z ) 2 - 2E [ b1 (Z - p z ) (Y - J.Ly)] = Uyy + b1 I zz b + ( J.Ly - b0 - b1 Pzf - 2b1 Uz y
6 If I zz is not of full rank, one variable-for example, Zk---can be written as a linear combina tion of the other Z;'s and thus is redundant in forming the linear regression function Z' {J. That is, Z may be replaced by any subset of components whose nonsingular covariance matrix has the same rank as I z z ·
Sec.
7.8
The Concept of Linear Regression
429
Adding and subtracting u�yizi uz y , we obtain
=
The mean square error is minimized by taking b = I i i uz y {J, making the last term zero, and then choosing be b0 = f.L y - (I ii uz y ) 1 JL z = {30 to make the third term zero. The minimum mean square error is thus uyy - u�yizi uz y· Next, we note that Cov (b 0 + b1 Z, Y ) = Cov (b1 Z, Y ) h1 Uz y so
=
[ Corr (b0 + b 1 Z, Y ) ] 2 _- Uyy[b1(b1uzIyJzzZ b) ,
for all b0 , b
Employing the extended Cauchy-Schwartz inequality of (2-49) with B = I zz , we obtain
or
with equality for b = I i i uz y = {J. The alternative expression for the maximum correlation follows from the equation u�yizi uz y = u�y {J = u�y iziizz fJ = •
fJ 1 Izz fJ·
The correlation between Y and its best linear predictor is called the popula
tion multiple correlation coefficient
I
Uz y �...., z- 1zUz y Uyy
PY( Z) = +
(7-54)
The square of the population multiple correlation coefficient, p�(Z) , is called the pop ulation coefficient of determination. Note that, unlike other correlation coefficients, the multiple correlation coefficient is a positive square root, so ::::;; PY( Z) ::::;; 1 . The population coefficient of determination has an important interpretation. From Result 7.12, the mean square error in using {30 + {J1 Z to forecast Y is
0
(
I
)
2 y �...., z- z1 Uz y Uz -1 (Z) ) (7-55) Uyy - Uz1 y izz U U (1 U PY yy y yy yy Uz Uyy _
0,
_
If P�(Z) = there is no predictive power in Z . At the other extreme, p�(Z) = 1 implies that Y can be predicted with no error.
430
Chap.
7
Multivariate Linear Regression Models
Example 7. 1 1
(Determining the best linear predictor, its mean square error, and the multiple correlation coefficient)
[-�!'�-L��x] [ !Q_l_�- -=-�-]
Given the mean vector and covariance matrix of Y, Z1 , Z2 , and I = Uz y i Izz = 1 i 7 3 i -1 i 3 2 determine (a) the best linear predictor + {31 Z1 + {32 Z2 , (b) its mean square error, and (c) the multiple correlation{30 coefficient. Also, verify that the mean square error equals Uyy{ 1 - p�(z) ). First, -1 fJ = I i i uzv = [ � � [ -� J [ _ : : �:� ] [ - � J [ -� J J f3o = #Ly - fJ 1 ILz = 5 - [1, -2) [ � J = 3 so the best linear predictor is {30 + {J 1 Z = 3 + Z1 - 2Z2 • The mean square error is Uy y - Uz y �"'-zz- 1 Uz y = 10 - [ 1, - 1 ] [ - ..46 -1..64 J [ _ 11 J = 10 - 3 = 7 and the multiple correlation coefficient is -1 y #o = .548 Uz y �"'-zzUz P Y(Z) = Uyy Note that uyy{ 1 - Phz> ) = 10(1 - fo ) = 7 is the mean square error. • It is possible to show (see Exercise 7. 5 ) that 1 - PY2 (Z) -- p 1y y (7-56) where p Y Y is the upper left-hand corner of the inverse of the correlation matrix determined from I. The restriction to linear predictors is closely connected to the assumption of normality. Specifically, if we take __
I
I
I
I
y
Z21 to be distributed as z
z,
N, +
1 (p,, I)
Sec.
7.8
The Concept of Linear Regression
then the conditional distribution of Y with
431
z 1 , z2 , ... , z, fixed (see Result 4.6) is
N( J.Ly + u�y izi (z - J.tz ), uyy - u�y izi uz y )
The mean of this conditional distribution is the linear predictor in Result 7.12. That is, (7-57) E(Yiz 1 , z2 , , z, ) = J.Ly + u�yiz i (z - p,z ) = f3o + fJ'z and w e conclude that E(Y iz 1 , z2 , , z , ) is the best linear predictor of Y when the • . .
• • •
population is N, + 1 ( p, , I) . The conditional expectation of Y in (7-57) is called the
linear regression function .
E(Yiz 1 , z2 , , z, ) [18])
When the population is not normal, the regression function need not be of the form {30 + z . Nevertheless, it can be shown (see that (Y 1 , whatever its form, predicts Y with the smallest mean square er ror. Fortunately, this wider optimality among all estimators is possessed by the linear predictor when the population is normal.
fJ'
E iz , z2 , z, ),
• • •
• • •
Result 7. 1 3.
[-�S.Y-z !'y_f-�-: S�zzr..]
Suppose the joint distribution of Y and and S =
Z is N,+ ( p,, I) . Let 1
be the sample mean vector and sample covariance matrix, respectively, for a ran dom sample of size n from this population. Then the maximum likelihood estima tors of the coefficients in the linear predictor are
P = S z� Sz y,
fi o
= Y - s� y S z� Z = Y - P'Z
Consequently, the maximum likelihood estimator of the linear regression function is
fi o + P ' z =
Y
+ s� y Sz � (z - Z )
and the maximum likelihood estimator of the mean square error I
n-1 A O" yy.z - n (Syy - S z/ y zz Sz y --
Proof.
s-1 )
E [ Y - {30 - fJ'Z] 2 is
We use Result 4.11 and the invariance property of maximum likeli hood estimators. [See (4-20).] Since, from Result 7.12,
f3o = J.Ly - (I i i uz y ) ' p, z , and mean square error = uyy. z = uyy - u�y izi uz y
432
Chap.
7
Multivariate Linear Regression Models
[-r-J
the conclusions follow upon substitution of the maximum likelihood estimators p� -
z
for and I =
:.] [-��-Uz �y-�-��: I zz
•
It is customary to change the divisor from n to n - ( r + 1 ) in the estimator of the mean square error, cr yy. z = E ( Y - {30 - {J ' Z) 2 , in order to obtain the unbiased estimator
( n - 1 ) (S yy - S z y S zz Sz y ) ,
n-r-1
-1
n -
r -
1
(7-58)
Example 7. 1 2 (Maximum likelihood estimate of the regression functionsingle response)
For the computer data of Example 7.6, the n = 7 observations on Y (CPU time), Z 1 (orders), and Z2 (add-delete items) give the sample mean vector and sample covariance matrix: fo s
� =
[-;j � [-l��-��] [ j [_1§.?:�I-�_l_1_� :?��- -�-�2?_�·] S yy ! s � v = 418.763 : 377.200 28.034 S z y ! 8 zz 35.983 i 28.034 13.657
-- - - - + - - - - - I
I
Assuming that Y, Z1 , and Z2 are jointly normal, obtain the estimated regres sion function and the estimated mean square error. Result 7.13 gives the maximum likelihood estimates A
] [ 418.763 ] [ 1.079 ] .420 35.983 [ 130.24 150.44 - 142.019 {J ' z = 150.44 - [1.079, .420] 3.547 J
f3 - 8 -zz1 S z A
-
{30 = y
-
A
Y
-
[ - .003128 .006422
- .006422 .086404
= 8.421
and the estimated regression function
=
Sec.
7.8
The Concept of Linear Regression
433
�0 + [J 1 z = 8.42 - 1.08z 1 + .42z 2 The maximum likelihood estimate of the mean square error arising from the prediction of Y with this regression function is
(n-1 n
) 76 467 .913 - [418.763, 35.983 ] [ .003128 - .006422 ( )( ( Syy
- S 1z y s -zz1 S z y )
- .006422 .086404
= .894
][
418.763 35.983
]) •
Prediction of Several Variables
The extension of the previous results to the prediction of several responses Y1 , Y2 , , Ym is almost immediate. We present this extension for normal populations. Suppose
[<:�-"-J
• • •
is distributed as Nm + r (JL, I)
(r X 1 )
with
I'
� l[,����-J (r X l )
By Result 4.6, the conditional expectation of [Y1 , Y2 , ues z 1 , z 2 , . , Z r of the predictor variables, is .
.
.
.
.
, Y111 ] 1, given the fixed val (7-59)
This conditional expected value, considered as � function of z 1 , z 2 , z , is called the multivariate regression of the vector Y on Z. It is composed of m univariate regressions. For instance, the first component of the conditional mean vector is f.L y, + I y, z i:l i (z - P. z ) = E(Y1 I z 1 , z 2 , . , Z r ), which minimizes the mean square error for the prediction of Y1 The m X r matrix fJ = Iv z I :l i is called the matrix of regression coefficients. The error of prediction vector • • •
. .
•
,
434
Chap.
M ultivariate Linear Regression Models
7
has the expected squares and cross-products matrix
I yy. z = E [Y - py - Iy z izi (Z - Pz )] [Y - /-ty - I y z izi (Z - P z )] ' = Iyy - I y z iii (Iy z ) ' - I yz iii izy + I y z iii izz iii (Iy z ) '
(7-60)
Because p and I are typically unknown, they must be estimated from a ran dom sample in order to construct the multivariate linear predictor and determine expected prediction errors. Result 7. 1 4. Suppose Y and Z are distributed as Nm + r (p, I) . Then the regression of the vector Y on Z is
The expected squares and cross products matrix for the errors is
E (Y - Po - {J Z) (Y - Po - {J Z) ' = I yy . z = Iyy - I y z izi izy Based on a random sample of size regression function is
the maximum likelihood estimator of the
n,
and the maximum likelihood estimator of I yy .z is A
I yy. z =
(
n
)
- 1 (S - S S S y ) yz z� z yy n
-
Proof. The regression function and the covariance matrix for the prediction errors follow from Result Using the relationships
4.6.
Po + {J z = /-t y + I yz i zi (z - Pz )
I yy . z = Iyy - Iyz iiiizy = Iyy - f3 I zzf1 ' we deduce the maximum likelihood statements from the invariance property [ see of maximum likelihood estimators upon substitution of
[-�-J [-���-i-���-]
(4-20)] jL
=
Z
;
I =
I zy : I zz
=
(
n
: 1)S =
(
n
: 1)
[-���-�-���-·] S z y i S zz I
•
Sec.
(
The Concept of Linear Regression
7.8
It can be shown that an unbiased estimator of � n - 1 n - r- 1
) (Svv - SvzSz� Szv ) =
Example 7. 1 3
IJ A 1 2: P (Y oI n - r - 1 j= 1
1'\.
435
yy . z is A
A
fJ Z . ) (Y - Po - fJZ. )' I
I
I
(7-61)
(Maximum likelihood estimates of the regression functions-two responses)
l-�-�����---]
We return to the computer data given in Examples 7.6 and 7.10. For Y1 = CPU time, Y2 = disk I/0 capacity, Z1 = orders, and Z2 = add-delete items, we have jL
and
s � [�;�-f-�-��j �
=
[�]
150.44
=
[ ��'�%� f-ro� ��H �� �� �����l �
3.547
467.913 1148.556 i 418.763 ---
-
35.983
,
�
-
140.558 i
,
35.983
-
28.034
13.657
Assuming normality, we find that the estimated regression function is
Po + P z
=
=
[ 150.44 ] + [ 418.763
y +
S y z Si� ( z - z )
35.983 1008.976 140.558
327.79
[
150.44 = 327.79
.003128 [ - .006422
]
] ] [ zz21 -- 130.24 3.547
- 3.547) - 130.24) + .420 [ 1 .079 J 2.254 (z1 - 130.24) + 5.665 (z2 - 3.547) J x
+
- .006422 .086404
( Zz
(Z1
Thus, the minimum mean square error predictor of Y1 is 150.44 + 1 .079 (z 1 - 130.24) + .420 (z 2 - 3.547) Similarly, the best predictor of Y2 is
=
8.42 + 1 .08z 1 + .42z 2
436
Chap. 7 M ultivariate Linear Regression Models 14.14
+
2.25z 1
+ 5.67z2
The maximum likelihood estimate of the expected squared errors and cross products matrix l:yy . z is given by
( -- 1 ) (S vv - S vz S z� S zv ) n
( 67 ) ( [ 1148.536 467.913
n
J [ - .006422 .003128
1148.536 3072.491
[ 418.763 35.983 1008.976 140.558 J ( 76 ) [ 11.042 .043 1 .042 [ .894 = .893 2.572 J
.893 2.205
- .006422 .086404
J
418.763 [ J
1008.976 35.983 140.558
J
)
The first estimated regression function, 8.42 + 1 .08z 1 + .42z 2 , and the asso ciated mean square error, .894, are the same as those in Example 7.12 for the single-response case. Similarly, the second estimated regression function, 14.14 + 2.25z 1 + 5.67z2 , is the same as that given in Example 7.10. We see that the data enable us to predict the first response, Y1 , with smaller error than the second response, Y2 • The positive covariance .893 indi cates that overprediction (underprediction) of time tends to be accom panied by overprediction (underprediction) of disk capacity. •
CPU
Comment. Result 7.14 states that the assumption of a joint normal distri bution for the whole collection Y1 , Y2 , . . . , Ym , Z1 , Z2 , . . . , Z, leads to the predic tion equations
Yt Yz
�01 + � 1 1 Z 1 + . . . + �rt Z r f3oz + /3 1 2Z 1 + · · · + f3,z Z r A
A
A
We note the following: 1. The same values, z 1 , z 2 , . . . , z, are used to predict each Y; . 2. The ffi i k are estimates of the (i, k)th entry of the regression coefficient matrix f3 = Ivz izi for i, k ;:;: 1 .
W e conclude this discussion of the regression problem b y introducing one fur ther correlation coefficient.
Sec.
7.8
The Concept of Linear Regression
437
Partial Correlation Coefficient
Consider the pair of errors
Y1 - f.L y, - Iy, z i Z: i (Z - /L z ) Yz - f.L y2 - Iy2z i Z: i (Z - /L z ) obtained from using the best linear predictors to predict Y1 and Y2 • Their correla tion, determined from the error covariance matrix l: yy. z = Ivv - Iv z iZ:i izy , measures the association between Y1 and Y2 after eliminating the effects of Zl , Zz , . . . ' Zr . We define the partial correlation coefficient between Y1 and Y2 , elimi nating 21 , 22 , . . . , Z, by (7-62) where ay, y, . z is the (i, k) th entry in the matrix l:yy. z = Ivv - Iv z iZ:i i z v · The corresponding sample partial correlation coefficient is (7-63) with S y y . z the (i, k) th element of S yy - S y z S Z.� S z v · Assuming that Y and Z have a ] �int multivariate normal distribution, we find that the sample partial cor relation coefficient in (7-63) is the maximum likelihood estimator of the partial cor relation coefficient in (7-62). Example 7. 1 4 (Calculating a partial correlation)
From the computer data in Example 7.13, 1 .043 1 .042 S yy - S y z S -zz1 S z y = 1 .042 2.572
[
Therefore, 1 .042 V1.00 V2Si2
J = 64 .
(7-63)
Calculating the ordinary correlation coefficient, we obtain ry, y2 = .96. Com paring the two correlation coefficients, we see that the association between Y1 and Y2 has been sharply reduced after eliminating the effects of the variables • Z on both responses.
438
Chap.
7
Multivariate Linear Regression Models
7.9 COMPARING THE TWO FORMULATIONS OF THE REGRESSION MODEL
In Sections 7.2 and 7.7, we presented the multiple regression models for one and several response variables, respectively. In these treatments, the predictor variables had fixed values zj at the jth trial. Alternatively, we can start-as in Section 7.8with a set of variables that have a joint normal distribution. The process of condi tioning on one subset of variables in order to predict values of the other set leads to a conditional expectation that is a multiple regression model. The two approaches to multiple regression are related. To show this relationship explicitly, we intro duce two minor variants of the regression model formulation. Mean Corrected Form of the Regression Model
For any response variable Y, the mutiple regression model asserts that
The predictor variables can be "centered" by subtracting their means. For instance, {3 1 z 1 j = {3 1 (z 1 j - z 1 ) + {3 1 z 1 and we can write
(7-64)
[11 Z21 1
with {3* = {30 + {3 1 z1 + . . . + {3,z,. The mean sponding to the reparameterization in (7-64) is
�' ]
corrected design matrix corre
Z1 1 - Zt . . . Z1 r - Z 1 . . • z 2 , - z, .. .. . Z nr - Z , Z n J - Z1 .
1,
where the last r columns are each perpendicular to the first column, since n
2: 1 (zji - z; ) = 0, j= I
i
=
2, . .
Further, setting Zc = [1 i Zc 2] with z;2 1 = 0, we obtain
.
,r
Sec.
so
Comparing the Two Formulations of the Regression Model
7.9
� ltJ
439
(z; u • z; y
[;
:
(z; , , , )
-'
] [;;:Y] � [rz;-;z;;yc;z;;y]
(7-65)
That is, the regression coefficients [,8 1 , ,82 , . . . , ,B,] ' are unbiasedly estimated by (Z�2 Zc2 ) - 1 Z;2y and ,B* is estimated by y. Because the definitions ,8 1 , ,82 , . . . , ,B, remain unchanged by the reparameterization in (7-64), their best estimates com puted from the design matrix Zc are exactly the � arne a,_s thAe best �stimates com puted from the design matrix Z. Thus, setting f1c = [,8 1 , ,8 2 , . . . , ,B,] ' , the linear predictor of Y can be written as (7-66) with ( z -
z ) = [zl
- Zl , Z z
-
Zz ,
. . . ' z, - z,] ' . Finally, (7-67)
Comment. The multivariate multiple regression model yields the same mean corrected design matrix for each response. The least squares estimates of the coeffieient vectors for the ith response are given by
i = 1, 2, . . . , m
/ 'Jf±j = l zj; -
(7-68)
Sometimes, for even further numerical stability, "standardized" input vari ables ( zj ; - Z; )
=
(
zY
= ( zj ; - z; ) /Y ( n - 1)sz; z; are used. In
this case, the slope coefficients ,B; in the regression model are replaced by /3; ,B; Y (n - 1) sz; z; . The least squares estimates of the beta coefficients /3; become $; = �; ( n - 1)s z; z; , i = 1, 2, . . . , r. These relationships hold for each response in the multivariate multiple regression situation as well.
V
440
Chap.
7
Multivariate Linear Regression Models
Relating the Formulations
When the variables Y, Z1 , Z2 , dictor of Y (see Result 7.13) is
. . •
, Z, are jointly normal, the estimated pre (7-69)
where the estimation procedure leads naturally to the introduction of centered z;' s. Recall from the mean corrected form of the regression model that the best linear predictor of Y [see (7-66)] is
y = � * + [1; ( z - z )
with � * = )l and [1;. = y' Zc 2 (Z;2 Zc 2) - 1 • Comparing (7-66) and (7-69), we see that � * = Y = � 0 and Pc = [1 since 7 1 Y ' Zc 2 (Z'c 2 c 2 ) - 1 (7-70) Sz' Y s-ZZ -
z
Therefore, both the normal theory conditional mean and the classical regression model approaches yield exactly the same linear predictors. A similar argument indicates that the best linear predictors of the responses in the two multivariate multiple regression setups are also exactly the same. Example 7. 1 5
(Two approaches yield the same linear predictor)
The computer data with the single response Y1 = CPU time were analyzed in Example 7.6 using the classical linear regression model. The same data were analyzed again in Example 7.12, assuming that the variables Y1 , Z1 , and Z2 were jointly normal so that the best predictor of Y1 is the conditional mean of Y1 given z 1 and z2 • Both approaches yielded the same predictor,
y = 8.42 + 1 .08z1 + .42z2
•
Although the two formulations of the linear prediction problem yield the same predictor equations, conceptually they are quite different. For the model in (7-3) or (7-26), the values of the input variables are assumed to be set by the exper imenter. In tqe conditional mean model of (7-57) or (7-59), the values of the pre dictor variables are random variables that are observed along with the values of the response variable(s). The assumptions underlying the second approach are more stringent, but they yield an optimal predictor among all choices, rather than merely &mong linear predictors. 7 The identify in
(7-70) is established by writing y
=
(y - y l) + y l so that
y' Zc2 = (y - .)' l) ' Zc2 + y l ' Zc2 = (y - y l) ' Zc2 + 0 ' = ( y - y l ) ' Z,2 Consequently,
y Zc2 ( Z:C2 Zc2) - 1
=
(y - y l) ' Z,2 ( Z:C2 Zc2) - 1 = ( n - l ) s� y [ (n - l ) S zz t 1 = s � ySz�
Sec.
7.1 0
Multiple Regression Models with Time Dependent Errors
[!�-:�--=-:-��-=-���-1-�-�--=-:-�-�:--=-���]
441
We close by noting that the multivariate regression calculations in either case can be couched in terms of the sample mean vectors y and z and the sample sums of squares and cross-products: n
1
n
n
I
"
L cz - z ) (yj - y ) ' i L c zj - z ) (zj - z ) ' : j= l j= l j
=
[-!-Zc'�2_�Y:c_j__� ����-] : Z c' 2 Zc 2 I
This is the only information necessary to compute the estimated regression coeffi cients arid their estimated covariances. Of course, an important part of regression analysis is model checking. This requires the residuals (errors), which must be cal culated using all the original data. 7 . 1 0 MULTIPLE REGRESSION MODELS WITH TIME DEPENDENT ERRORS
For data collected over tiine, observations in different time periods are often relate d , or alitocorrelated. Consequently, in a regression context, the observations on the dependent variable or, equivalently, the errors, cannot be independent. As indicated in our discussion of dependence in Section 5.8, time dependence in the observations can invalidate inferences made using the usual independence assump tion. Similarly, inferences in regression can be misleading when regression models are fit to time ordered data and the standard regression assumptions are used. This issue is important so, in the example that follows, we not only show how to detect the presence of time dependence, but also how to incorporate this dependence into the multiple regression model. Example 7. 1 6
(Incorporating Time Dependent Errors in a Regression Model) I
Power companies must have enough natural gas to heat all of their customers ' homes and businesses, particularly during the coldest d ays of the year. A major component of the planning process is a forecasting exercise based on a model relating the sendouts of natural gas to factors, like temperature, that clearly have some relatidhship to the amount of gas consumed. More gas is required on cold days. Rather than use the daily average temperature, it is customary to use degree heating days (DHD) 65 deg - daily average tem perature. A large number for DHD indicates a cold day. Wind speed, again a 24 hour average, can also be a factor in the sendout amount. Because many businesses close for the weekend, the demand for natural gas is typically less =
442
Chap.
7
M ultivariate Linear Regression Models TABLE 7.4 NATU RAL GAS DATA y
zl
z2
z3
z4
Sendout
DHD
DHDLag
Windspeed
Weekend
227 236 228 252 238
32 31 30 34 28
30 32 31 30 34
12 8 8 8 12
1 1 0 0 0
333 266 280 386 415
46 33 38 52 57
41 46 33 38 52
8 8 18 22 18
0 0 0 0 0
on a weekend day. Data on these variables for one winter in a major north ern city are shown, in part, in Table 7.4. (See the data disk for the complete data set. There are n = 63 observations.) Initially, we developed a regression model relating gas sendout to degree heating days, wind speed and a weekend dummy variable. Other vari ables likely to have some affect on natural gas consumption, like percent cloud cover, are subsumed in the error term. After several attempted fits, we decided to include not only the current DHD but also that of the previous day. (The degree heating day lagged one time period is denoted by DHDLag in Table 7.4.) The fitted model is Sendout
=
1.858
+ 5.874DHD + 1.405DHDLag
+ 1 .315Windspeed - 15.857Weekend
with R 2 = .952. All the coefficients, with the exception of the intercept, are significant and it looks like we have a very good fit. (The intercept term could be dropped. When this is done, the results do not change substantially.) How ever, if we calculate the correlation of the residuals that are adjacent in time, the lag 1 autocorrelation, we get n
lag 1 autocorrelation
=
r1 (e)
=
� eiei - l \ � eJ
i=
j= l
=
.52
Sec.
7.1 0
Multiple Regression Models with Time Dependent Errors
443
The value, .52, of the lag 1 autocorrelation is too large to be ignored. A plot of the residual autocorrelations for the first 15 lags shows that there might also be some dependence among the errors 7 time periods, or one week, apart. This amount of dependence invalidates the t-tests and P-values associ ated with the coefficients in the model. The first step towards correcting the model is to replace the presumed independent errors in the regression model for sendout with a possibly dependent series of noise terms Nj . That is, we formulate a regression model for the Nj where we relate each Nj to its previous value Nj - l , its value one week ago, Nj _ 1 , and an independent error sj . Thus, we consider Nj = cfit Nj - l + cp1 Nj - 1 + Bj
a-2 .
where the sj are independent normal random variables with mean 0 and vari ance The form of the equation for Nj is known as an autoregressive model. (See [7].) The SAS commands and part of the output from fitting this com bined regression model for sendout with an autoregressive model for the noise are shown in Panel 7.3 on page 444. The fitted model is Sendout = 2.130
+ 5.810DHD + 1.426DHDLag + 1 .207Windspeed
10.109Weekend
and the time dependence in the noise terms is estimated by Nj = .470Nj - 1 + .240Nj - 1 + Bj The variance of s is estimated to be (; 2 = 228.89 . From Panel 7.3, we see that the autocorrelations of the residuals from the enriched model are all negligible. Each is within two estimated standard errors of 0. Also, a weighted sum of squares of residual autocorrelations for a group of consecutive lags is not large as judged by the P-value for this sta tistic. That is, there is no reason to reject the hypothesis that a group of con secutive autocorrelations are simultaneously equal to 0. The groups examined in Panel 7.3 are those for lags 1-6, 1-12, 1-18, and 1-24. The noise is now adequately modeled. The tests concerning the coeffi cient of each predictor variable, the significance of the regression, and so forth, are now valid. 8 The intercept term in the final model can be dropped. When this is done, there is very little change in the resulting model. The enriched model has better forecasting potential and can now be used to fore cast sendout of natural gas for given values of the predictor variables. We will not pursue prediction here, since it involves ideas beyond the scope of this book. (See [7].) • 8 These tests are obtained by the extra sum of squares procedure but applied to the regression plus autoregressive noise model. The tests are those described in the computer output.
444
Chap. 7 M ultivariate Linear Regression Models PAN EL 7.3
SAS ANALYSI S FOR EXAM PLE 7 . 1 6 U S I N G P ROC ARIMA
data a ; i n f i l e T7-4.dat'; t i m e = _n_; i n p ut obsend d h d d h d lag wind xweeke nd; p roc ar im a data = a; identify var = obsend crosscor = ( d h d d h d l a g wind xweekend ) ; estimate p = ( 1 7) m ethod = m l i n put = ( d h d d h d la g wind xweekend ) plot; est i mate p = ( 1 7) noconsta nt method = ml i n put = ( d h d d h d l a g wind xweekend ) plot;
PROGRAM COMMANDS
ARIMA Proced u re Maxi m u m Like l i h ood Estimation
OUTPUT
Approx. Std E rro r 1 3 . 1 2340 0. 1 1 779 0 . 1 1 528 0.24047 0.24932 0.4468 1 6.03445
Pa ra m eter MU AR 1 , 1 AR 1 ,2 NUM1 N U M2 NUM3 NUM4 Constant Esti m ate
0.61 770069
Std E rror Estimate AIC SBC N u m ber of Res i d u a l s =
1 5. 1 29244 1 528.49032 1 543.492264 63
T Ratio 0. 1 6 3.99 2 .08 2 4. 1 6 5.72 2.70 - 1 .68
Lag 0 7 0 0 0 0
S h ift Vari a b l e 0 OBSEND 0 OBSE N D 0 OBSE N D 0 DHD 0 D H D LAG 0 WI N D 0 XWE E KE N D
l>flariah'c>�>e§tilii!�tee:;�: ::.ixi l:��2�1l394d2'� ·> , 1
Autocorre l ation C h eck of Resi d u a l s To Lag 6 12 18 24
Chi S q u a re 6.04 1 0.27 1 5.92 23.44
Autocorrelations DF 4 10 16 22
0.079 0. 1 44 0.0 1 3 0.0 1 8
0.0 1 2 -0.067 0. 1 06 0.004
0.022 - 0. 1 1 1 -0. 1 37 0.250
0. 1 92 - 0.056 - 0. 1 70 -0.080
- 0. 1 27 - 0.056 - 0 . 079 - 0.069
0. 1 61 - 0. 1 08 O.D 1 8 - 0.051
(continued)
Sec. PAN EL 7.3
7.1 0
M ultiple Regression Models with Time Dependent Errors
445
(continued)
Autocorrelation Plot of Resi d u a l s Lag 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cova ria nce 228.894 1 8. 1 94945 2 .763255 5.038727 44.059835 - 29. 1 1 8892 36.90429 1 33.008858 - 1 5.4240 1 5 - 25.379057 - 1 2.890888 - 1 2 .777280 - 2 4.825623 2 .970 1 97 24. 1 50 1 68 - 3 1 . 4073 1 4
Corre l ation 1 .00000 0.07949 0.01 207 0.0220 1 0 . 1 9249 - 0 . 1 2722 0. 1 6 1 23 0 . 1 442 1 -0.06738 - 0. 1 1 088 -0.05632 -0.05582 - 0 . 1 0846 0.01 298 0 . 1 0551 - 0. 1 37 2 1
-1 9 8 7 6 5 4 3 2
0 1 2 3 4 5 6 7 8 9 1 I *********************! I **
I **** . *** I I *** I *** *I ** I *I *I ** I I ** *** I " . " m a rks two sta n d a rd errors
When modeling relationships using time ordered data, regression models with noise stuctures that allow for the time dependence are often useful. Modern soft ware packages, like SAS, allow the analyst to easily fit these expanded models.
SUPPLEMEN T lA
The Dis tribution of the Likelihood Ratio for the Multivariate Multiple Regression Model
The development in this supplement establishes 1Result 7.11. We know that1 ni = Y' (I - Z(Z'Z) - Z')Y and under H0, ni11 Y ' [I - Z1 (z;z,) - Z{]Y1 with Y = Zd J<' l + e.1 Set P = [I - Z(Z'Z) - Z']. Since 0 = [I - Z(Z'Z) - Z']Z = [I - Z(Z'Z) - Z'] [Z1 ! Z2] = [PZ 1 ! PZ2] the columns of Z are perpendicular to P. Thus, we can write ni = (z p + e) 'P (ZfJ + e) = e 'P e ni 1 = ( Z J l + e) 'P 1( Z 1/Jo l + e) = e 'P 1 e where P 1 = I - Z1 (z;z,) - 'z;. We then use the Gram-Schmidt process (see Result 2A.3) to construct the orthonormal vectors [g 1 , g2 , . . . , gq + I ] = G from the columns of Z1 • Then we continue, obtaining the orthonormal set from [G, Z 2 ], and finally complete the set to n dimensions by constructing an arbitrary ortho normal set of n - r - 1 vectors orthogonal to the previous vectors. Conse quently, we have J
0
g, , gz , · · · ' gq + l • gq + 2 • gq + 3 • · · · ' gr + l • gr+ 2 • gr+ 3 • · · · ' gn '----v-----'
from columns from columns of Z2 arbitrary set of but perpendicular orthonormal of Z1 to columns of Z 1 vectors orthogonal to columns of Z - 1 z; . Then, since Z,) (z; Z of pair Let (A,- 1 e) be an eigenvalue-eigenvector 1 1 1 [Z1 (z;z,) Z{] [Z1 (Z{Z1 ) - Z{] = Z1 (z;z,) - Z" it follows that 446
Supplement 7A The Distribution of the Likelihood Ratio
447
Ae = Z1 (Z;Z1 ) - 1 Z;e = (Z1 (z;z1 ) - 1 Z;) 2 e = A (Z1 (z;z1 ) - 1 Z;)e = A 2 e and the eigenvalues of Z ( z; Z ) - I z; are 0 or 1 . Moreover, tr ( Z 1 ( Z{ Z 1 ) - t Z{) = 1 1 tr((Z{Z1 ) - 1 Z;Z1 ) = tr( (q + l) XI (q + l ) ) = q + 1 = A 1 + A 2 + . . . + A q + t • where A 1 A 2 - A q + t > 0 are the eigenvalues of Z1 (z; Z1 ) - 1 z;. This shows that Z1 (Z;Z1 ) 1z; has q + 1 eigenvalues equal to 1. Now, (Z1 (Z;Z 1 ) - 1 Z;)Z1 = Z1 , so any linear combination Z1 be of unit length is an eigenvector corresponding to the eigenvalue 1. The orthonormal vectors g , e = 1, 2, . . . , q + 1, are therefore eigenvectors of Z1 (Z{Z1 ) - 1 z;, since they arec formed by taking particular linear combinations of theq + !columns of Z1 . By the spectral decomposition (2-16), we have Z1 (Z;Z1 ) - 1 Z{ = C2:= l geg� . Similarly, by writing (Z(Z'Z) - 1 Z')Z = Z, we readily see that the linear combination Zb = ge, for example, isr+ anl eigenvector of Z(Z'Z) - 1 Z' with eigenvalue A = 1 , so that Z(Z'Z) - 1 Z' = f2:= l geg� . Continuing, we have PZ = [I - Z(Z'Z) - 1 Z']Z = Z - Z = 0 so ge = Zbe , e r + 1 , are eigenvectors of P with eigenvalues A = 0. Also, from the way the ge , e '> r + 1, were constructed, Z' ge = 0, so that Pgc = ge . Conse quently, these ge s are eigenvectors of P corresponding to the n - r - 1 unit eigenvalues. By the spectral decomposition (2-16), P = f =2:r+ 2 geg� and ;_;;,
;_;;, . . . ;_;;,
c
�
11
f=r+2
f = r+ 2
the e ' ge = where, because [ 1, .. , Veml' are independently distributed as I). Conse quently, by ni is distributed as (I). In the same manner, { gc e > qq ++ 11 t ge n so P1 = f =2:q + 2 geg� . We can write the extra sum of squares and cross products as 1 r+ 1 n(I 1 - I ) = e '(P1 - P) e = t =r+2:q + 2 ( e ' ge ) ( e ' gc ) ' = C =2:q + 2 Vev; where the Ve are independently distributed �s Nm (O, I);_ By (1-22), n(I 1 - I ) is distributed as Wp, r - q (I) independently of ni , since n (I 1 - I) involves a differ ent set of independent Ve ' s. The large sample distribution for - [n - r - 1 - � (m - r + q + 1 ) ] ln(!IA I/IIA 1 ! ) follows from Result 5.2, with v - v0 = m (m + 1 )/2 + m (r + 1 ) Cov ( Ve;, Vj k ) = E (g; e<; J etkJ gj) = O'; k g�gj = 0, e * j, Ve = Ve . . . , Ve;, . Nm (O, (4-22), Wp , n - r - t p
o
e
�
448
Chap.
7
M ultivariate Linear Regression Models
m (m + 1)/2 - m (q + 1) = m (r - q) d. f. The use of (n - r - 1 � (m - r + q + 1 ) ) instead of n in the statistic is due to Bartlett [3] following Box
[6], and it improves the chi-square approximation.
EXERCISES
7.1.
Given the data Zj
I
10
5 9
7 3
19 25
8 13
11 7 +
fit the linear regression model Yj = {30 + f3J.. zj l ej , j = 1 , 2, . . . , 6. Specif ically, calculate the least squares estimates {3, the fitted values y, the residuals e, and the residual sum of squares, e ' e. 7.2. Given the data zZ I y
2
5 3 9
10 2 15
7 3 3
19
6
25
11 7 7
18 9 13
fit the regression model j = 1, 2 , ' 6. to the standardized form (see page 439) of the variables z , and z • From this fit, deduce the corresponding fitted regression equation 1for the 2original (not standardized) variables. 7.3. (Weighted least squares estimators.) Let 0 0 0
y,
y
(n X I)
=
+ e z (n X (r + l)) ((r+ l) X l ) (n X I)
where E ( e) = 0 but E ( ee' ) = u2 V, with V (n n) known and positive definite. For V of full rank, show that the weighted least squares estimator is X
If u2 is unknown,1 it may be estimated, unbiasedly, by (n - r - 1 ) - 1 (Y - ZP w) ' V - (Y - ZP w) · Hint: y-I /2 Y = cv - I !Z z ) f3 + v - l/Z e is of the classical linear regression form Y * = z * p + e* , with E ( e* ) = 0 and E ( e* e * ' ) = u 2 1. Thus, P = p * = * * -1 X
w
(Z z ) Z *' Y *.
Chap.
7
Exercises
449
Use the weighted least squares estimator in Exercise= 7. 3 to derive= an expres sion for the estimate of2the slope f3 in the model Y zi + ei , j= 1,2 2, ... , n, u zJ . Com when (a) Var(ei ) u , (b) Var(ei ) = u2 zi , andi(c) f3Var(e) ment on the manner ill which the unequal variances for the errors influence the optimal choice of f3w · 7.5. Establish (7-56): Ph z ) = 1 - 1 / Hint: From (7-55) and Exercise 4.9.
7.4.
=
p Y Y.
From result 2A. 8(c) , uh I Iz z i / J i i , where uYY is the entry of I - 1 in the first- 1 row and first column. Since (see Exercise 2.23) p = v- 112I V- 112 and p- 1 = ( V-= 1 12::£v- 1 12) -1 = V112::£ - 1 V 1 12, the entry in the (1, 1) position of p is 7.6. (Generalized inverse ofZ'Z) A matrix (Z'Z) - is called a generalized inverse of Z'Z if Z'Z(Z'Z)-Z'Z = Z'Z. Let r1 + 1 = rank(Z) and suppose A 1 A2 A, 0 are the nonzero eigenvalues of Z' Z with corresponding eigenvectors+ 1 e>1 , e 2 , . .. , e,, + (a) Show that +I (Z'Z) - i�= l Aj 1 e; e; is a generalized inverse of Z' Z. (b) The coefficients jJ that minimize the sum of squared errors (y - Z /3 )' (y - Z /3 ) satisfy the normal eJuations (Z'Z) P = Z'y. Show that these equations are satisfied for any f3 such that Z /3 is the projection of y on the columns of Z. (c) Show that z p Z(Z'Z)-Z'y is the projection of y on the columns of Z. (See Footnote 2 in this chapter.) (d) Show directly that jJ (Z'Z) - Z'y is a solution to the normal equations (Z'Z) [(Z'Z) - Z'y] = Z'y. Hint: (b) If z p is the projection, then y - z p is perpendicular to the columns of Z. (d) The eigenvalue-eigenvector requirement implies that (Z' Z) (Aj 1 e;) = e; for i r11 + 1 and= 0 = e; (Z' Z) e; for > r1 + 1. Therefore, (Z'Z) (Aj e;)e;z' e;e;z'. Summing over i gives =
pYY
;;;;.
;;;;. · • · ;;;;.
u Y Y CT y y ·
'
1 •
=
r1
=
=
:s;;
i
=
since e;z' = 0 for i > r1 + 1.
IZ' = Z'
450
Chap.
7
M ultivariate Linear Regression Models 7.7.
Suppose the classical regression model is, with rank (Z) = r + 1, written as Y = Z1 /J(2) + fJ(l) + Z 2 (n X l) (n X (q + 1)) ((q + 1 ) X 1) (n X ( r - q)) (( r - q) X 1 ) (n X l) where rank(Z1 ) = q + 1 and rank(Z2) = r - q. If the parameters {J(2) are identified beforehand as being of primary interest, show that a 100(1 - a) % confidence region for {J(2) is given by 6
( P <2> - {J(2) ) ' ( Z � Z 2 - Z � Z1 ( z ; z 1 ) - 1 Z ; z 2 ] ( p <2> - fJ<2> ) .:;;; s 2 (r - q ) Fr - q. n - r - 1 (a)
By Exercise 4.10, with 1 's and 2's interchanged, [ e 2n c 221 2 ] C 22 = [ Z 2' Z 2 - Z 2' Z 1 (Z 1' Z 1 ) - 1 Z 1' z 2] - 1 -1 = where (Z' ) z ' C 1 C Multiply by the square-root matrix ( C 22 ) - 1 12 , and conclude that 22 2 2 ( C ) - 1 1 ( p (2) - {J(2) ) / u is N(O, 1) , so that ( P <2> - fJ<2> ) ' ( C 22 ) - 1 ( p <2> - fJ<2> ) is u 2 x;- q . 7.8. Recall that the hat matrix is defined by H = Z (Z' Z ) - 1 Z' with diagonal ele ments hjj . (a) Show that H is an idempotent matrix. [See Result 7.1 and ( 7 -6 ) ] . n (b) Show that 0 < hjj < 1, j = 1, 2, . . . , n, and that � hjj = r + 1, where r j= l is the number of independent variables in the regression model. (In fact, (1/n) hjj < 1.) (c) Verify, for the simple linear regression model with one independent vari able z, that the leverage, hjj• is given by Hint:
.:;;;
(zj - zY 1 h". . = + n --'-n � (zj - z ) 2 j= l -
7.9.
-
---
Consider the following data on one predictor variable' z1 and two responses Y1 and Y2 : -2 - 1 0 1 2 2 1 4 5 3 Y1 3 2 1 3 1 Y2 Determine the least squares estimates of the parameters in the straight-line regression model Yj l = f3o1 + {3 1 1 Zj t + ej 1 lf2 = f3o2 + f31 2Zj l + ej 2 • j = 1, 2, 3, 4, 5
Chap.
7
Exercises
451 E
Also, calculate the matrices of fitted values Y and residuals with Y = [y1 l y2 ] . Verify the sum of squares and cross products decomposition �
E'E
Y' Y = Y ' Y +
Using the results from Exercise 7.9, calculate each of the following. (a) A 95% confidence interval for the mean response E ( Y01) = /30 1 + /31 1 z0 1 corresponding to z01 = 0.5. (b) A 95% prediction interval for the response Y01 corresponding to z01 = 0.5. (c) A 95% prediction region for the responses Y0 1 and Y0 2 corresponding to z0 1 = 0.5. 7.11. (Generalized least squares for multivariate multiple regression.) Let A be a positive definite matrix, so that dJ (B) = (yj - B' zj ) ' A (yj - B' z) is a squared statistical distance f.rom the jth- 1observation yj to its regression B' zj . Show that the choice B = fJ = (Z' Z) Z' Y minimizes the sum of squared statistical distances, j=Ll df (B), for any choice of positive definite A. Choices for A include I - t and I. Hint: Repeat the steps in (7-40) and (7-41) with I - 1 replaced by A. 7.U. Given the mean vector and covariance matrix of Y, Z1 , and Z2 , 7.10.
n
determine each of the following. (a) The best linear predictor {30 + {31 Z1 + {32 Z2 of Y. (b) The mean square error of the best linear predictor. (c) The population multiple correlation coefficient. (d) The partial correlation coefficient p z z 7.13. The test scores for college students described in Example 5.5 have
[ zz1 ] [ ] [ y
z =
2
:z3
=
527.74 54.69 , 25.13
s =
I
.
2
.
5691.34 600.51 126.05 217.25 23.37 23.11
]
Assume joint normality. (a) Obtain the maximum likelihood estimates of the parameters for predict ing zl from z2 and z3 (b) Evaluate the estimated multiple correlation coefficient Rz z (c) Determine the estimated partial correlation coefficient R z 7.14. Twenty-five portfolio managers were evaluated in terms of their performance. Suppose Y represents the rate of return achieved over a period of time, Z1 is the manager's attitude toward risk measured on a five-point scale 0
1 (Z2'
3) ·
1 ' z2 · z3 .
452
Chap. 7 Multivariate Linear Regression Models
from "very conservative" to "very risky," and Z is years of experience in the investment business. The observed correlation 2coefficients between pairs of variables are Zz zl .82 1 .0 - .35 R = - .35 1.0 - .60 . 82 - .60 1.0 (a) Interpret the sample correlation coefficients ryz1 = - .35 and ryz2 = - . 8 2. (b) Calculate the partial correlation coefficient ryz1 • z2 and interpret this quantity with respect to the interpretation provided for ryz1 in Part a.
[
y
]
The following exercises may require the use of a computer.
Use the real-estate data in Table 7.1 and the linear regression model in Example 7.4. (a) Verify the results in Example 7.4. (b) Analyze the residuals to check the adequacy of the model. (See Section 7.6. ) (c) Generate a 95% prediction interval for the selling price (Y0 ) corresponding to total dwelling size z 1 = 17 and assessed value z2 = 46. (d) Carry out a likelihood ratio test of H0 : {3 2 = 0 with a significance level of a = .05. Should the original model be modified? Discuss. 7.16. Calculate a CP plot corresponding to the possible linear regressions involving the real-estate data in Table 7.1. 7.17. Consider the Fortune 500 data in Exercise 1 .4. (a) Fit a linear regression model to these data using profits as the dependent variable and sales and assets as the independent variables. (b) Analyze the residuals to check the adequacy of the model. Compute the leverages associated with the data points. Does one (or more) of these companies stand out as an outlier in the set of independent variable data points? (c) Generate a 95% prediction interval for profits corresponding to sales of 40,000 (millions of dollars) and assets of 70,000 (millions of dollars). (d) Carry out a likelihood ratio test of H0 : {32 = 0 with a significance level of a = .05. Should the original model be modified? Discuss. 7.18. Calculate a CP plot corresponding to the possible regressions involving the Fortune 500 data in Exercise 1.4. 7.19. Satellite applications motivated the development of a silver-zinc battery. Table 7.4 contains failure data collected to characterize the performance of the battery during its life cycle. Use these data. (a) Find the estimated linear regression of ln(Y) on an appropriate ("best") subset of predictor variables. (b) Plot the residuals from the fitted model chosen in Part a to check the nor mal assumption. 7.15.
Chap.
7
Exercises
BATIERY-FAI LU RE DATA
TABLE 7.4
zl
z3
453 y
z4
EndZs of Depth of charge Charge Discharge discharge to (% of rated Temperature voltage Cycles rate rate (volts) failure CCC) (amps) (amps) ampere-hours) 101. 2.00 40. 60.0 .375 3.13 141. 1.99 30. 76.8 1.000 3.13 96. 2.00 20. 60.0 1.000 3.13 125. 1. 98 20. 60.0 1.000 3.13 43. 2. 0 1 10. 43.2 1.625 3.13 16. 2. 0 0 20. 60.0 1.625 3.13 188. 2. 0 2 20. 60.0 1.625 3.13 10. 2. 0 1 10. 76.8 .375 5.00 3. 1.99 10. 43.2 1.000 5.00 386. 2. 0 1 30. 43.2 1.000 5.00 45. 2.00 100.0 20. 1.000 5.00 2. 1. 99 10. 76.8 1.625 5.00 2. 0 1 76. 10. 76.8 .375 1.25 1.000 1.25 78. 43.2 1. 99 10. 1.000 1.25 76.8 30. 2.00 160. 1.000 1.25 3. 2.00 60.0 0. 216. 43.2 1.625 1.25 30. 1. 99 73 . 2.00 20. 1.625 1.25 60.0 314. .375 3.13 30. 76.8 1.99 .375 3.13 20. 60.0 170. 2.00 Zz
Source: Selected from S. Sidik, H. Leibecki, and J. Bozek,NASA Technical Memorandum 81556 Cleveland: Lewis Research Center, 1980).
Failure of Silver-Zinc Cells
with Competing Failure Modes-Preliminary Data Analysis,
(
Using the battery-failure data in Table 7.4, regress ln(Y) on the first princi pal component of the predictor variables z1 , z2 , , z (See Section 8.3.) Compare the result with the fitted model obtained in Exercise 7.19(a). 7.21. Consider the air-pollution data in Table 1. 3 . Let Y1 = N02 and Y2 = 03 be the two responses (pollutants) corresponding to the predictor variables Z1 = wind and z2 = solar radiation. (a) Perform a regression analysis using only the first response Y (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction interval for N02 corresponding to z1 = 10 and z 2 = 80.
7.20.
• • •
5
•
1
•
454
Chap.
7
Multivariate Linear Regression Models
Perform a multivariate multiple regression analysis using both responses (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction ellipse for both N02 and 03 for z 1 = 10 and z2 = 80. Compare this ellipse with the prediction interval in Part a (iii). Comment. 7.22. Using the data on bone mineral content in Table 1 .6: (a) Perform a regression analysis by fitting the response for the dominant radius bone to the measurements on the last four bones. (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (b) Perform a multivariate multiple regression analysis by fitting the responses from both radius bones. 7.23. Using the data on the characteristics of bulls sold at auction in Table 1.8: (a) Perform a regression analysis using the response Y1 = SalePr and the pre dictor variables Breed, YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt. (i) Determine the "best" regression equation by retaining only those predictor variables that are individually significant. (ii) Using the best fitting model, construct a 95% prediction interval for selling price for a set of predictor variable values that are not in the original data set. (iii) Examine the residuals from the best fitting model. (b) Repeat the analysis in Part a, using the natural logarithm of the sales price as the response. That is, set Y1 = Ln(SalePr). Which analysis do you prefer? Why? 7.24. Using the data on the characteristics of bulls sold at auction in Table 1.8: (a) Perform a regression analysis, using only the response Y1 SaleHt and the predictor variables Z1 = YrHgt and Z2 = FtFrBody. (i) Fit an appropriate model and analyze the residuals. (ii) Construct a 95% prediction interval for SaleHt corresponding to z 1 = 50.5 and z 2 = 970. (b) Perform a multivariate regression analysis with the responses Y1 = SaleHt and Y2 = SaleWt and the predictors Z1 = YrHgt and Z2 = FtFrBody. (i) Fit an appropriate multivariate model and analyze the residuals. (ii) Construct a 95% prediction ellipse for both SaleHt and SaleWt for z 1 = 50.5 and z 2 = 970. Compare this ellipse with the prediction interval in Part a (ii). Comment. 7.25 Amitriptyline is prescribed by some physicians as an antidepressant. How ever, there are also conjectured side effects that seem to be related to the use of the drug: irregular heartbeat, abnormal blood pressures, and irregular waves on the electrocardiogram, among other things. Data gathered on 17 (b)
Y1 and Y2•
=
Chap.
7
Exercises
455
TABLE 7.5 AM ITRI PlYLI N E DATA
Yz AMI 3149 653 810 448 844 1450 493 941 547 392 1283 458 722 384 501 405 1520 Source: See [20].
YJ TOT 3389 1101 1131 596 896 1767 807 1111 645 628 1360 652 860 500 781 1070 1754
z GENl 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1
Zz AMT 7500 1975 3600 675 750 2500 350 1500 375 1050 3000 450 1750 2000 4500 1500 3000
z3 Z4 Zs QRS PR DIAP 220 0 140 200 0 100 205 60 111 160 60 120 83 185 70 180 60 80 154 80 98 93 200 70 137 60 105 74 167 60 180 60 80 160 64 60 135 90 79 160 60 80 180 0 100 170 90 120 180 0 129
patients who were admitted to the hospital after an amitriptyline overdose are given in Table 7. 5 . The two response variables are Y1 = Total TCAD plasma level (TOT) Y2 = Amount of amitriptyline present in TCAD plasma level (AMI) The five predictor variables are Z1 = Gender: 1 if female, 0 if male (GEN) Z2 = Amount of antidepressants taken at time of overdose (AMT) Z3 = PR wave measurement (PR) Z4 = Diastolic blood pressure (DIAP) Z5 = QRS wave measurement (QRS) (a) Perform a regression analysis using only the first response Y1 • (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction interval for Total TCAD for z1 = 1, z2 = 1200, z3 = 140, z4 = 70, and z5 = 85.
456
Chap.
7
M ultivariate Linear Regression Models
(b) (c)
Repeat Part a using the second response Y2 • Perform a multivariate multiple regression analysis using both responses Y1 and Y2 • (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction ellipse for both Total TCAD and Amount of amitriptyline for z 1 = 1, z2 = 1200, z3 = 140, z4 = 70, and = 85. Compare this ellipse with the prediction intervals in Partsz5a and b. Comment.
REFERENCES
1. John AnderWisoln,ey,T.1984.W. (2d ed.). New York: 2. verAtksiintysoPrn,eA.s , 1985. C. Oxford, England: Oxford Uni 3. Bartions.tle"t , M. "A Note on Multiplying Factors for Var(1954)ious, 296-298. Chi-Squared Approxima 4. WiBellsely,ey,1980.D. A., E. Kuh, and R. E. Welsh. New York: John 5. Bowerman, and R.KentT. ,O'1990.Connell. (6. Box,2d ed.G.). E.BosP.B.to"n:L.A,PWSGener al Distribution Theory for a Clas of Likelihood Criteria." ( 1 949) , 317-346. 7. Box, G. (E.3dP.ed., G.). EnglM. Jenkiewoodns,Clandif sG., NJ:C. PrReientnsicee-l. Hall, 1994. 8. Chat terjee, S., and B. Price. New York: John Wiley, 1977. 9. Cook, R.l, D.1982., and S. Weisberg. London: Chapman and Hal 10. Dani el, C., and F. S. Wood. (2d ed.). New York: John Wiley, 1980. 11. DrWialeper,y, 1981.N. R., and H. Smith. (2d ed.). New York: John 12. Dursion,bII.in,"J., and G. S. Wat(1so951)n. "Tes, 159-178. ting for Serial Correlation in Least Squares Regres 13. Galton, F. "Regres ion(1Towar d Mediocrity in Heredity Stature." 885) , 246-263. 115.4. Heck, GoldberD.gL.er,"Char A. S. ts of Some Upper PercNewentageYorPoik:nJohnts of tWihelDiey,st1964.ribution of the Largest Characteristic Root." (1960), 625-642. An Introduction to Multivariate Statistical Analysis
Plots, Transformations and Regression.
S. Journal of the Royal Statistical Society (B) ,
16
Regression Diagnostics.
Linear Statistical Models: An Applied Approach Bio
metrika,
36
Time Series Analysis: Forecasting and
Control
Regression Analysis by Example.
Residuals and Influence in Regression.
Fitting Equations to Data
Applied Regression Analysis
Biometrika,
38
Journal of the Anthro
pological Institute,
15
Econometric Theory.
Annals of Mathematical Statistics,
31
Chap.
7
References
457
16. Neter, J.,3dM.ed.Kut. Chiner,cago:C. NachtRicharshdeimD. andIrwinW., 1996.Wassermann. 17. Piatel aAnali, K.yC.sis.S." "Upper Percenta1967ge Ppi, 189-193. nts of the Largest Root of a Matrix in Multivari 18. WiRao,ley,C.1973.R. 2d ed. . New York: John G.er, A.M.F.V. "Cardiovascular Changes andNewPlaYorsmak:DrugJohnLevel Wiley,s af1977.ter Amitriptyline 20.19. Seber, Rudor f 1882 , 67-71. Mon 21. TitOverdos merey,m,CA:N.H.e."Brooks/Cole, 1975. Models
(
Applied Linear Regression
)
Biometrika, 54 ( ) Linear Statistical Inference and Its Applications.
(
)
Linear Regression Analysis.
Journal of Toxicology-Clinical Toxicology,
19
(
)
Multivariate Analysis with Applications in Education and Psychology.
CHAPTER
8
Prlndpai Componenb 8 . 1 I NTRODUCTION
A principal component analysis is concerned with explaining the variance-covari ance structure of a set of variables through a few linear combinations of these vari ables. Its general objectives are (1) data reduction and (2) interpretation. Although p components are required to reproduce the total system variabil ity, often much of this variability can be accounted for by a small number k of the principal components. If so, there is (almost) as much information in the k com ponents as there is in the original p variables. The k principal components can then replace the initial p variables, and the original data set, consisting of n measure ments on p variables, is reduced to a data set consisting of n measurements on k principal components. An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily result. A good example of this is provided by the stock market data discussed in Example 8.5. Analyses of principal components are more of a means to an end rather than an end in themselves, because they frequently serve as intermediate steps in much larger investigations. For example, principal components may be inputs to a mul tiple regression (see Chapter 7) or cluster analysis (see Chapter 12). Moreover, (scaled) principal components are one "factoring" of the covariance matrix for the factor analysis model considered in Chapter 9. 8.2 POPULATION PRINCIPAL COMPONENTS
Algebraically, principal components are particular linear combinations of the p random variables X1 , X2 , XP . Geometrically, these linear combinations repre sent the selection of a new coordinate system obtained by rotating the original sys. . • ,
458
Sec.
8.2
Population Principal Components
459
tern with X1 , X2 , . . . , XP as the coordinate axes. The new axes represent the direc tions with maximum variability and provide a simpler and more parsimonious description of the covariance structure. As we shall see, principal components depend solely on the covariance matrix I (or the correlation matrix p) of X1 , X2 , . . . , XP . Their development does not require a multivariate normal assumption. On the other hand, principal compo nents derived for multivariate normal populations have useful interpretations in terms of the constant-density ellipsoids. Further, inferences can be made from the sample components when the population is multivariate normal. (See Section 8. 5 . ) Let the random vector X' = [X1 , X , , X ] have the covariance matrix I with eigenvalues A1 ;;:;. A2 ;;:;. • AP 0.2 p Consider the linear combinations Y1 = a � X = a1 1 X1 + a1 2X2 + · · · + a1 P XP Y2 = a ; x = a 2 1X1 + a 22X2 + · · · + a 2P XP (8-1) • • •
• •
;;:;.
;;:;.
Then, using (2-45), we obtain Var(Y;) = a;Ia; i = 1 , 2, . . . , p (8-2) Cov(Y;, Yk) = a;Iak i, k = 1 ' 2, . . . ' p (8-3) The principal components are those uncorrelated linear combinations Y1 , Y2 , ••• , YP whose variances in (8-2) are as large as possible. The first principal component is the linear combination with maximum vari ance. That is, it maximizes Var(Y1 ) = a{Ia1 • It is clear that Var(Y1 ) = a{Ia1 can be increased by multiplying any a1 by some constant. To eliminate this inde terminacy, it is convenient to restrict attention to coefficient vectors of unit length. We therefore define First principal component = linear combination a{ X that maximizes Var (a{ X) subject to a{ a1 = 1 Second principal component = linear combination a{ X that maximizes Var (a� X) subject to a� a2 = 1 and Cov(a{X, a�X) = 0 At the ith step, ith principal component = linear combination a;x that maximizes Var(a;x) subject to a; a; = 1 and Cov(a;x, a�X) = 0 for k < i
460
Chap.
8
Principal Components
Result 8. 1 . Let I be the covariance matrix associated with the random vec tor X' = [X1 , X2 , , XP] . Let I have the eigenvalue-eigenvector pairs (A1 , e1 ) , A2 (A2 , e2 ) , , (Ap , ep ) where A 1 0 . Then the ith principal com AP ponent is given by Y; = e; x = ei l X1 + e;2X2 + · · · + e;P XP , i = 1, 2, . . , p (8-4) With these choices, Var(Y;) = e; Ie; = A; i = 1, 2, , p (8-5) Cov(Y;, Yk) = e; Iek = 0 i * k If some A; are equal, the choices of the corresponding coefficient vectors e;, and hence Y;, are not unique. Proof. We know from (2-51), with B = I, that a'Ia = Al (attained when a = e1 ) max a'a O * a But e{ e 1 = 1 since the eigenvectors are normalized. Thus, • . .
;a.
• • •
;a.
• • •
;a.
;a.
.
. . .
--
Similarly, using (2-52), we get = 1, 2, ... , p - 1 Forthechoicea = ek + 1 , with e� + 1 e; = O, for i = 1,2, ... , k andk = 1, 2, . , p - 1, e� + 1 Iek + t f e� + t ek + l = e� + 1 Iek + t = Var(Yk +t ) But e� + 1 (Iek + l ) = Ak + l e� + l ek + l = Ak + 1 so Var(Yk + d = Ak + t · It remains to show that e; perpendicular to ek (that is, ef ek = 0, i i= k) gives Cov(Y;, Yk) = 0. Now, the eigenvectors of I are orthogonal if all the eigenvalues A 1 , A 2 , , AP are distinct. If the eigenvalues are not all distinct, the eigenvectors corresponding to common eigenvalues may be chosen to be orthogonal. Therefore, for any two eigen vectors e; and ek , ef ek = 0, i i= k. Since Iek = Akek , premultiplication by e/ gives Cov(Y;, Yk) = ef iek = ef Akek = Ake/ ek = 0 for any i i= k, and the proof is complete. From Result 8.1, the principal components are uncorrelated and have vari ances equal to the eigenvalues of I. k
. .
. • .
•
Sec.
8.2
Population Principal Components
461
Result 8.2. Let X' = [X , X2 , , X ] have covariance matrix I, with eigen value-eigenvector pairs (A1 , e11 ) , (A2 , e2 ) ,p , ( Ap , ep ) where A 1 ;;;:.: A2 ;;;:.: AP ;;;:.: 0. Let Y1 = e{ X, Y2 = e� X, . . . , YP = e� X be the principal components. Then p p + uPP = :L Var(X;) = A 1 + A2 + + A P = :L Var(Y;) u 1 1 + u22 + i= l i= l Proof. From Definition 2A. 2 8, u + u22 + + u = tr(I). From (2-20) with A = I, we can write I = PAP'1 1 where A is the PPdiagonal matrix of eigenvalues and P = [e1 , e2 , , ep] so that PP' = P'P = I. Using Result 2A.12(c), we have tr(I) = tr(PAP') = tr(AP'P) = tr(A) = A 1 + A2 + + AP Thus, p p :L Var(X;) = tr(I) = tr(A) = :L Var(Y;) i= l i= l Result 8.2 says that Total population variance = u1 1 + u22 + + uPP (8-6) = A 1 + A2 + + AP and consequently, the proportion of total variance due to (explained by) the kth principal component is Proportion of total population variance = Ak due to kth principal A 1 + A2 + + AP k = l, 2, . . , p (8-7) component If most (for instance, 80 to 90%) of the total population variance, for large p, can be attributed to the first one, two, or three components, then these components can "replace" the original p variables without much losse of information. Each component of the coefficient vector e/ = [ i l , . . . , ei k • . . . , e;p ] also mer its inspection. The magnitude of eik measures the importance of the kth variable to the ith principal component, irrespective of the other variables. In particular, e; k is proportional to the correlation coefficient between Y; and Xk . Result 8.3. If Y = e{ X, Y2 = e� X, . . . , YP = e� X are the principal com ponents obtained from 1the covariance matrix I, then e . k \IA, i, k = 1, 2, . . . , p (8-8) P V CTkk ···
• • •
• • •
;;;:.:
···
···
···
• • •
···
•
(
···
)
Y;. X•
-
···
l l . �
···
.
462
Chap.
8
Principal Components
are the correlation coefficients between the components Y; and the variables Xk . Here (A 1 , e 1 ) , (A2 , e2 ) , , (Ap , ep ) are the eigenvalue-eigenvector pairs for I. Proof. Set a� = [0, ... , 0, 1, 0, ... , 0] so that Xk = a�X and Cov(Xk , Y;) = Cov( a� X, e; x) = a� Ie ; , according to (2-45). Since Ie ; = A;e;, Cov(Xk , Y;) a� A ; e ; = A;e;k · Then Var(Y;) = A; [see (8-5)] and Var(Xd = ukk yield Cov(Y;, Xk ) i, k = 1, 2, .. . ' p Pv,, x. _v'var(Y;) v'var(Xk ) . • •
•
Although the correlations of the variables with the principal components often help to interpret the components, they measure only the univariate contri bution of an individual X to a component Y. That is, they do not indicate the importance of an X to a component Yin the presence of the other X's. For this rea son, some statisticians (see, for example, Rencher [17]) recommend that only the coefficients ei k • and not the correlations, be used to interpret the components. Although the coefficients and the correlations can lead to different rankings as measures of the importance of the variables toc a given component, it is our expe rience that these rankings are often not appre iably different. In practice, variables with relatively large coefficients (in absolute value) tend to have relatively large correlations, so the two measures of importance, the first multivariate and the sec ond univariate, frequently give similar results. We recommend that both the coef ficients and the correlations be examined to help interpret the principal components. hypothetical example illustrates the contents of Results 8. 1 , 8.2, andThe8.following 3.
Exampl e
8. 1
(Calculating the population principal components)
Suppose the random variables X1 , X2 and X3 have the covariance matrix
It may be verified that the eigenvalue-eigenvector pairs are A1 =
5.83, A2 = 2. 0 0, A3 = 0. 1 7,
e{ =
[. 383, -.924, 0] e� = [0, 0, 1] e; = [. 924, . 3 83, 0]
Therefore, the principal components become
Sec.
8.2
Population Principal Components
463
Y1 = e{ X = .383X1 - . 924X2 The variable X is one of the principal components, because it is uncorrelated with the other 3two variables. Equation (8-5) can be demonstrated from first principles. For example, Var(Y1 ) = Var(.383X1 - . 924X2 ) = (. 383) 2 Var(X1 ) + (- . 924) 2 Var(X2 ) + 2(.383) (-. 924)Cov(X1 , X2 ) = .147(1) + . 854(5) - . 7 08( - 2) = 5. 83 = A 1 Cov ( Y1 , Y2 ) = Cov ( .383X1 - .924X2 , X3 ) = .383 Cov (X1 , X3 ) - . 924 Cov (X2 , X3 ) = . 3 83(0) - .924(0) = 0 It is also readily apparent that a11 + a22 + a33 = 1 + 5 + 2 = A 1 + A2 + A3 = 5.83 + 2.00 + .17 validating Equation (8-6) for this example. The proportion of total variance accounted for by the first principal component is A 1 /(A 1 + A 2 + A 3 ) = 5. 83/8 = .73. Further, the first two components account for a proportion (5. 83 + 2)/8 = . 98 ofthe population variance. In this case, the components Y1 and Y2 could replace the original three variables with little loss of information. Next, using (8-8), we obtain .383� = . 925 eu = Py,, x, = � 11 e 1 � = -. 924 V5.83 = 998 Py, , x, = 2vo:;; Vs · Notice here that the variable X , with coefficient -. 924, receives the greatest weight in the component Y1 .2 It also has the largest corr�lation (in absolute value) with Y1 . The correlation of X1 , with Y1 , .925, is almost as large as that for X2 , indicating that the variables are about equally important to the first principal component. The relative sizes of the coefficients of X1 ya::,
_
464
Chap.
8
Principal Components
and X2 suggest, however, that X2 contributes more to the determination of Y1 than does X1 • Since, in this case, both coefficients are reasonably large and they have opposite signs, we would argue that both variables aid in the inter pretation of Y1 • Finally, � V2 Py2· x, = p Y2. x2 = 0 and py2· x, = -�=\12 = 1 (as it should) The remaining correlations can be neglected, since the third component is • unimportant. It is informative to consider principal components derived from multivariate normal random variables. Suppose X is distributed as Np (p, I). We know from (4-7) that the density of X is constant on the p centered ellipsoids 1 2 ( x - p ) 1 I - ( x - p) = c which have axes ± c V"i:; ei, i = 1, 2, . . . , p, where the ( Ai , e i ) are the eigen value-eigenvector pairs of I. A point lying on the ith axis of the ellipsoid will have coordinates proportional to e; = [ei l , ei2 , . . . , eip ] in the coordinate system that has origin p and axes that are parallel to the original axes x1 , x2 , . . . , xP . It will be con venient to set p = 0 in the argument that follows. 1 1 From our discussion in Section 2.3 with A = I - , we can write 2 2 2 2 c = X1 I - 1 x = : ( e{ x ) + : ( e� x ) + + : ( e; x ) 2 p I where e{ x, e� x, . . . , e; x are recognized as the principal components of x. Setting y1 = e1 x , y2 = e2 x , . . . , yP = eP x , we h ave ···
1
I
I
and this equation defines an ellipsoid (since A , A 2 , , AP are positive) in a coor dinate system with axes y1 , y2 , , Yp lying in the1 directions of e 1 , e 2 , , eP , respec tively. If A1 is the largest eigenvalue, then the major axis lies in the direction e 1 • The remaining minor axes lie in the directions defined by e2 , , eP . To summarize, the principal components y1 = e{ x, y2 = e� x, . . . , Y = e; x lie in the directions of the axes of a constant density ellipsoid. Therefore, anyp point on the i th ellipsoid axis has x coordinates proportional to e; = [ei ei2 , , eip ] and, necessarily, principal component coordinates of the form [0, . . . , 0, Yi • 0, . . . , 0] . When p 0, it is the mean-centered principal component y1 = e; (x - p) that has mean 0 and lies in the direction ei. • • •
• . .
• • •
• • •
1 ,
• • •
¥-
1
This can b e done without loss o f generality because the norma! random vector X can always be translated to the normal random vector W == X - p. and E (W) = 0 . However Cov(X) = Cov(W).
Sec.
8. 2
Population Principal Components
465
Figure 8.1 The constant density ellipse x ' l: - 1 x = c2 and the principal components y1 , y2 for a bivariate normal random vector X having mean 0.
Jt = O p = .75
A constant density ellipse and the principal components for a bivariate nor mal random vector with p 0 and .75 are shown in Figure 8.1. We see that the principal components are obtained by rotating the original coordinate axes through an angle ()until they coincide with the axes of the constant density ellipse. This result holds for p > 2 dimensions as well. =
p =
Principal Components Obtained from Standard ized Variables
Principal components may also be obtained for the standardized variables 22
In matrix notation,
=
� (Xz - J.Lz ) vo:;
(8-9)
(8-10) where the diagonal standard deviation matrix V 112 is defined in (2-35). Clearly, E(Z) = 0 and
466
Chap.
8
Principal Components
by (2-37). The principal components of Z may be obtained from the eigenvectors of the correlation matrix p of X. All our previous results apply, with some simplifica tions, since the variance of each Z; is unity. We shall continue to use the notation Y; to refer to the ith principal component and (A;, e;) for the eigenvalue-eigenvector pair from either p or I. However, the (A;, e;) derived from I are, in general, not the same as the ones derived from p . Result 8.4. The ith principal component of the standardized variables Z' [Z1 , Z2 , , Zp] with Cov(Z) = p, is given by i = 1, 2, . . . , p Moreover, p p (8-11) 2: Var(Y;) 2: Var(Z;) = p i= l i=l and Py. = e . k vT i, k = 1, 2, ' p In this case, (A 1 , e 1 ), (A 2 , e2 ), 0., (AP ' ep ) are the eigenvalue-eigenvector pairs for AP p, with A A2 Proof. Result 8.4 follows from Results 8.1, 8.2, and 8.3, with Z 1 , Z2 , , ZP in place of X1 , X2 , ... , XP and in place of I. We see from (8-11 ) that the total (standardized variables) population vari ance is simply p, the sum of the diagonal elements of the matrix p . Using (8-7) with Z in place of X, we find that the proportion of total variance explained by the kth principal component of Z is Proportion of (standardized) population variance due k = 1, 2, , p (8-12) to kth principal component where the Ak 's are the eigenvalues of p . • • •
, zk
l
I
0 0 0
• • .
1
;;;.
;;;. · · · ;;;.
;;;.
p
(
Example 8.2
• . .
)
0 0 0
(Principal components obtained from covariance and correlation matrices are different)
Consider the covariance matrix I = [ � 10� ]
•
Sec.
8.2
and the derived correlation matrix p=
Population Principal Components
[ .41 .41 ]
467
The eigenvalue-eigenvector pairs from I are A 1 = 100.16, [.040, .999] . 84, 2 [. 999, - .040] Similarly, the eigenvalue-eigenvector pairs from p are [.707, .707] A I = 1 + p = 1. 4 , [.707, - .707] A2 = 1 - p = .6, The respective principal components become Y1 = .040X1 + . 999X2 I: Y2 = . 999X1 - .040X2 and Y1 = .70721 + .70722 = .707 (Xl -1 P- 1 ) + .707 ( X2 10- P-2 ) p: = . 7 07(X1 - p.,1 ) + . 0707(X2 - p., 2 ) Y2 = .70721 - .70722 = .707 (Xl -1 P- 1 ) - .707 ( X2 10- P-2 ) = . 7 07(X1 - p.,1 ) - . 0707(X2 - p., 2 ) Because of its large variance, X2 completely dominates the first principal component determined from I. Moreover, this first principal component explains a proportion ' el e
'
of the total population variance. When the variables X and X2 are standardized, however, the resulting variables contribute equally1 to the principal components determined from p . Using Result 8.4, we obtain PY, , z, = e1 1 � = . 7 07 ViA = .837
468
Chap.
8
Principal Components
and PY,, = e2 1 � =
.707 vlA = .837 In this case, the first principal component explains a proportion A I = 1.4 = .7 2 p of the total (standardized) population variance. Most strikingly, we see that the relative importance of the variables to, for instance, the first principal component is greatly affected by the standard ization. When the first principal component obtained from p is expressed in terms of X1 and X2 , the relative magnitudes of the weights .707 and . 0707 are in direct opposition to those of the weights . 040 and .999 attached to these variables in the principal component obtained from I. The preceding example demonstrates that the principal components derived from I are different from those derived from p . Furthermore, one set of principal components is not a simple function of the other. This suggests that the standard ization is not inconsequential. Variables should probably be standardized if they are measured on scales with widely differing ranges or if the units of measurement are not commensurate. For example, if X1 represents annual sales in the $10,000 to $350,000 range and X2 is the ratio (net annual income)/(total assets) that falls in the .01 to .60 range, then the total variation will be due almost exclusively to dollar sales. In this case, we would expect a single (important) principal component with a heavy weighting of X1 • Alternatively, if both variables are standardized, their subsequent magnitudes will be of the same order, and X2 (or Z2 ) will play a larger role in the construction of the components. This behavior was observed in Example 8.2. z,
•
Principal Components for Covariance Matrices with Special Structures
[
There are certain patterned covariance and correlation matrices whose principal components can be expressed in simple forms. Suppose I is the diagonal matrix 0 (Tl l 0 0. (8-13) I = 0. (}".22 . . . 0 0 (Tpp Setting e; [0, ... , 0, 1, 0, ... , 0], with 1 in the ith position, we observe that .
J
Sec.
lT n 0
O"z z
0
8.2
0
Population Principal Components
469
0
0
0
1
10";;
0
0
0
0
or
I e ; = £Tii e i
and we conclude that ( e; ) is the ith eigenvalue-eigenvector pair. Since the lin ear combination e; X = X;, the set of principal components is just the original set of uncorrelated random variables. For a covariance matrix with the pattern of (8-13), nothing is gained by extracting the principal components. From another point of view, if X is distrib uted as Np (/L, I), the contours of constant density are ellipsoids whose axes already lie in the directions of maximum variation. Consequently, there is no need to rotate the coordinate system. Standardization does not substantially alter the situation for the I in (8-13). In that case, p = I, the p p identity matrix. Clearly, pe; = 1e; , so the eigen value 1 has multiplicity p and e; = [0, . . . , 0, 1, 0, . . . , 0] , i = 1, 2, . . , p, are conve nient choices for the eigenvectors. Consequently, the principal components determined from p are also the original variables Z1 , . . . , ZP . Moreover, in this case of equal eigenvalues, the multivariate normal ellipsoids of constant density are spheroids. Another patterned covariance matrix, which often describes the correspon dence among certain biological variables such as the sizes of living things, has the general form O"; ; ,
X
.
(8-14)
The resulting correlation matrix (8-15)
is also the covariance matrix of the standardized variables. The matrix in (8-15) implies that the variables X1 , X2 , , XP are equally correlated. It is not difficult to show (see Exercise 8.5) that the p eigenvalues of the corre lation matrix (8-15) can be divided into two groups. When p is positive, the largest is • • •
470
Chap.
8
Principal Components
AI
= 1 + (p - 1)p
(8-16)
= [ vp1 ' vp1 ' . . . , vp1 ] The remaining p - 1 eigenvalues are with associated eigenvector
1
(8-17)
e,
and one choice for their eigenvectors is 1 -1 ' vlX2 ' 0, .. . , 0 ] [ vlX2 e3'
1 , v2X3 1 , v2X3 -2 ] [ v2X3 , 0, ... , 0
[
1 1 - (i - 1) , o, . . . , o ... , , -v-;;= (i - 1)i == , v (i - 1)i v (i - 1)i
]
= [ v (p 1- 1 )p , . . . , v(p 1- 1 )p , v-(p(p -- 11))p ] The first principal component Y1 = e { X = 1 2: Xi Vp is proportional to the sum of the p original variables. It might be regarded as an "index" with equal weights. This principal component explains a proportion -Ap1 = 1 + (pp - 1)p = p + 1 p- p (8-18) of the total population variation. We see that A 1 /p p for p close to 1 or p large. For example, if p = .80 and p = 5, the first component explains 84% of the total variance. When p is near 1, the last p - 1 components collectively contribute very little to the total variance and can often be neglected. If the standardized variables Z , Z , . . . , Z have a multivariate normal distri bution with a covariance matrix given1 by2 (8-15),Pthen the ellipsoids of constant den sity are "cigar shaped," with the major axis proportional to the first principal component Y1 = (1/vp ) [1, 1, . .. , 1 ] X. This principal component is the projection ep'
p
,
1
i=l
--
='=
Sec.
8.3
Summarizing Sample Variation by Principal Components
471
of X on the equiangular line 1' = [1, 1, ... , 1]. The minor axes (and remaining prin cipal components) occur in spherically symmetric directions perpendicular to the major axis (and first principal component). 8.3 SUMMARIZING SAMPLE VARIATION BY PRINCIPAL COMPONENTS
We now have the framework necesssary to study the problem of summarizing the variation in n measurements on p variables with a few judiciously chosen linear combinations. Suppose the data x , x , , x represent n independent drawings from some p-dimensional population1 with2 meann vector p and covariance matrix I. These data yield the sample mean vector i, the sample covariance matrix S, and the sample correlation matrix R. Our objective in this section will be to construct uncorrelated linear combi nations of the measured characteristics that account for much of the variation in the sample. The uncorrelated combinations with the largest variances will be called the sample principal components. Recall that the n values of any linear combination j = 1, 2, .. , n have sample mean a� i and sample variance a� Sa 1 . Also, the pairs of values ( a� xi, a�x), for two linear combinations, have sample covariance a� Sa 2 [see (3-36)]. The sample principal components are defined as those linear combinations which have maximum sample variance. As with the population quantities, we restrict the coefficient vectors a ; to satisfy a; a; = 1. Specifically, First sample linear combination a� xi that maximizes principal component = the sample variance of a� xi subject to a� a1 = 1 Second sample linear combination a�xi that maximizes the sample principal component = variance of a�xi subject to a� a 2 = 1 and zero sample covariance for the pairs ( a� xi, a�x) At the ith step, we have linear combination a; xi that maximizes the sample ith sample principal component = variance of a; xi subject to a; a; = 1 and zero sample covariance for all pairs ( a; xi , a�xi ) , k < i The first principal component maximizes a� Sa 1 or, equivalently, (8-19) • • •
.
472
Chap. 8
Pri ncipal Components
By (2-51), the maximum is the largest eigenvalue A 1 attained for the choice a 1 = eigenvector e1 of S. Successive choices of a; maximize (8-19) subject to 0 = a; se" = a; A "e" , or a; perpendicular to e" . Thus, as in the proofs of Results 8.1-8.3, we obtain the following results concerning sample principal components: A
If S
=
{s;k} is
the
where A. 1 � A 2 � .. , XP . Also, X2 , -;.., .
•
•
.. . .
p
p
s �mple
(A 1 , e1 ) , (A2, e2 ),
nenqs given by
vector pairs
A
X
AP
<
• •
�
• .
covariance matrix With eigenvalue-eigen
, {Ap, ep ), the ith\ sample
: :: . '/";: ;, . . . .. .· � 0 and x
A · · :· ·:: ;,
•
•
·
Sample variance(.h)
In addition,
Total and
(y;,Yk )
sample variance =
e;"�
� ·
2,
... ,p
is any observation on the variables Xi ,
p
=
=
� sii
i=l
>
,· ·: �
.
•
Sample covariance
i.=
princip al compo
A",
.. ·
k
A
0,
= A1
i, k
.
A
=
+
1 , 2,
:':'".-:-�::;:::
=
A2 A
1, 2,
... ,p
(8-20)
+
... 'p
We shall denote the sample principal components by y1 , y2 , . . . , yP , irrespec tive of whether they are obtained from S or R.2 The components constructed from S and R are not the same, in general, but it will be clear from the context which matrix is being used, and the single notation Y; is convenient. It is also convenient to label the component coefficient vectors e; and the component variances A ; for both situations. 2 Sample principal components can also be obtained from i = 811 , the maximum likelihood esti· mate of the covariance matrix !, if the Xj are normally distributed. (See Result 4.1 1 .) In this case, pro vided that the eigenvalues of ! are distinct, the sample principal components can be viewed as the maximum likelihood estimates of the corresponding population counterparts. (See (1]. ) We shall not consider i because the assumption of normality is not required in this section. Also, i has eigen values [ ( n - 1 ) /n] A ; and corresponding eigenvectors e;. where (A;, e; ) are the eigenvalue-eigenvector pairs for S. Thus, both S and i give the same sample principal components e; x [see (8-20)] and the same proportion of explained variance A ; / (\. 1 + A 2 + . . . + A p ) . Finally, both S an� i give the same sample correlation matrix R, so if the variables are standardized, the choice of S or ! is irrelevant.
Sec.
8. 3
Summarizing Sample Variation by Principal Components
473
The observations xj are often "centered" by subtracting x. This has no effect on the sample covariance matrix S and gives the ith principal component A A/ i = 1, 2, ... , p Y; = e; ( x - -x ) , (8-21) for any observation vector x. If we consider the values of the ith component (8-22) j = 1, 2, .. . , n generated by substituting each observation xi for the arbitrary x in (8-21), then " = -1 �" eA/. (x . - x- ) = -1 e.A/ ( �" (x . - -x ) ) = -1 eA/. 0 = 0 (8-23) y.
,
n j� =l ,
1
n , j� =l
1
n ,
That is, the sample mean of 'each principal component is zero. The sample vari ances are still given by the A ; s, as in (8-20). Example 8.3
(Sum marizing sample variability with two sample principal components)
census provided information, by tract, on five socioeconomic variables for the Madison, Wisconsin, area. The data from 14 tracts are listed in Table 8.5 in the exercises at the end of this chapter. These data produced the following summary statistics: x' = [4. 3 2, 14.01, 1. 95, 2. 1 7, 2.45] total median total health services median population school employment employment home value (thousands) years (thousands) (hundreds) ($10,000s) and 4.308 1. 683 1.803 2. 1 55 -.253 1. 683 1.768 .588 .177 .176 S = 1. 803 .588 . 8 01 1. 065 - . 1 58 2. 1 55 . 1 77 1.065 1.970 - .357 -.253 .176 -. 1 58 - .357 .504 Can the sample variation be summarized by one or two principal components? A
474
Chap.
8
Principal Components
We find the following: COEF F I C I ENTS FOR TH E P R I N C I PAL COMPON ENTS (Correlation Coefficients i n Parentheses)
Variable Total population Median school years Total employment Health services employment Median home value Variance ( A ; ) : Cumulative percentage of total variance
el ( rp l , x)
ez ( rPz, x)
.781 (.99)
e3
e4
-.071 ( - .04)
.004
.542
- .302
.306 (.61) .334 (.98)
- .764 ( -.76) .083 (.12)
-.162 .015
-.545 .050
- .010 .937
.426 (.80)
.579 (.55)
.220
- .636
- .173
-.054 (- .20) 6.931
-.262 ( -.49) 1.786
.962 .390
- .051 .230
.024 .014
74.1
93.2
97.4
99.9
es
100
The first principal component explains 74.1% of the total sample vari ance. The first two principal components, collectively, explain 93.2% of the total sample variance. Consequently, sample variation is summarized very well by two principal components and a reduction in the data from 14 obser vations on 5 variables to 14 observations on 2 principal components is reasonable. Given the foregoing component coefficients, the first principal compo nent appears to be essentially a weighted average of the first four variables. The second principal component appears to contrast health services employ ment with a weighted average of median school years and median home • value. As we said in our discussion of the population components, the component coefficients e; k and the correlations rp,, x. should both be examined to interpret the principal components. The correlations allow for differences in the variances of the original variables, but only measure the importance of an individual X without regard to the other X's making up the component. We notice in Example 8.3, how ever, that the correlation coefficients displayed in the table confirm the interpre tation provided by the component coefficients.
Sec.
8. 3
Summarizing Sample Variation by Principal Components
475
The Number of Principal Components
There is always the question of how many components to retain. There is no defin itive answer to this question. Things to consider include the amount of total sam ple variance explained, the relative sizes of the eigenvalues (the variances of the sample components), and the subject-matter interpretations of the components. In addition, as we discuss later, a component associated with an eigenvalue near zero and, hence, deemed unimportant, may indicate an unsuspected linear dependency in the data. A useful visual aid to determining an appropriate number of principal com ponents is a scree plot} With the eigenvalues ordered from largest to smallest, a scree plot is a plot of A; versus i-the magnitude of an eigenvalue versus its num ber. To determine the appropriate number of components, we look for an elbow (bend) in the scree plot. The number of components is taken to be the point at which the remaining eigenvalues are relatively small and all about the same size. Figure 8.2 shows a scree plot for a situation with six principal components. An elba� occurs in the plot in Figure 8.2 at about i = 3. That is, the eigen values after A 2 are all relatively small and about the same size. In this case, it �i
Figure 8.2
3 Scree is the rock debris at the bottom of a cliff.
A scree plot.
476
Chap.
8
Principal Components
appears, without any other evidence, that two (or perhaps three) sample principal components effectively summarize the total sample variance. Example 8.4 (Summarizing sample variability with one sample principal component)
In a study of size and shape relationships for painted turtles, Jolicoeur and Mosimann [11] measured carapace length, width, and height. Their data, reproduced in Exercise 6.17, Table 6.5, suggest an analysis in terms of loga rithms. (Jolicoeur [10] generally suggests a logarithmic transformation in stud ies of size-and-shape relationships.) Perform a principal component analysis. The natural logarithms of the dimensions of 24 male turtles have sam ple mean vector x' = [4.725, 4.478, 3.703] and covariance matrix 11.072 8.019 8.160 3 s = w8.019 6.417 6.005 8.160 6.005 6. 773 A principal component analysis (see Panel 8. 1 on page 477 for the out put from the SAS statistical software package) yields the following summary:
[
]
COEFFICI ENTS FOR P R I N C I PAL COMPON ENTS (Correlation Coefficients in Parentheses)
Variable e3 e2 e! (ry J , Xk ) -.7 13 .683 (.99) ln (length) -. 1 59 .622 .510 (.97) ln (width) -.594 .324 .788 ln (height) .523 (.97) Variance (A;): 23.30 w - 3 .6o w-3 .36 w - 3 Cumulative percentage of total 100 98.5 96.1 variance A scree plot is shown in Figure 8.3 on page 478. The very distinct elbow in this plot occurs at i = 2. There is clearly one dominant principal component. The first principal component, which explains 96% of the total variance, has an interesting subject-matter interpretation. Since j\ = . 6 83 ln(length) + . 5 10 ln(width) + . 5 23 ln(height) = ln [ (length)·6 8 3 (width)5 10 (height)-5 2 3] the first principal component may be viewed as the ln(volume) of a box with adjusted dimensions. For instance, the adjusted height is (height)-5 2 3 , which • accounts, in some sense, for the rounded shape of the carapace. x
x
x
Sec.
Summarizing Sample Variation by Principal Components
8.3
477
SAS ANALYSI S FOR EXAM PLE 8.4 U S I N G PROC P RI N CO M P .
PAN EL 8. 1
title 'Principa l Co m po n e nt Ana lysis'; data t u rtle; i nfi le 'E8-4.dat'; i n p ut l ength width heig ht; x 1 = log (length); x2 = log(width ) ; x3 = log(heig ht); proc princo m p cov data = t u rtle out= resu lt; var x1 x2 x3;
PROGRAM COMMANDS
Pri ncipal Components Ana lysis 2 4 O bservati ons 3 Vari a b l es
O UTPUT
S i m p l e Statistics X1 X2 4.725443647 4.477573765 0 . 1 05223590 0.080 1 04466
Mean StD
X3 3.7031 85794 0.0822967 7 1
j Covariance M atrix ' !
X1
0.01lOJ20040
X1 X2
0.0080 1 9 1 4 1 9
o.ooa1596480
X3
I
X3
X2
1L- 0.0060052707 oioo6o0527o7 o.oo,�n27585 1 0.0080 1 9 1 4 1 9
0.00641 67255
0.0081 596480
�---,.-=----,
Tota l Variance = 0.024261 488
·
PRI N 1 PRIN2 PR I N 3
Eigenva l ue
i, ,
0.023303 Q;OOO���
o,000360
X1 X2 X3
Difference 0.022705 0.000238
·· ·.' · R I N 1 · ,·.···. ··.··• '· , P · •· 0.6831 02
:.(f5t6220 > ;
·;
Proportion 0.960508 0.024661 0.01 4832
PR I N2 '····
t:.5940 1 2 :t ··'· 0·788490 ,;i i. .-:- . 1 59479
PRIN3 - .712697 0.621953
.·0.3�1�01
Cumulative
0.�8.!) 17 1 .00000 0.9.6051
478
Chap.
8
Principal Components
3
2
Figure 8.3
A scree plot for the turtle
data.
Interpretation of the Sample Principal Components
The sample principal components have several interpretations. First, suppose the underlying distribution of X is nearly N (p,, I). Then the sample principal compo nents, Y; = e; (x - x) are realizationsP of population principal components Y; = ; (X - ), which have an N (O, A) distribution. The diagonal matrix A has entries A 1 , A 2 , p, , AP and (A;, ) arep the eigenvalue-eigenvector pairs of I. Also, from the sample values xj , we can approximate p, by x and I by S. If S is positive definite, the contour consisting of all p 1 vectors x satisfying (8-24) estimates the constant density contour (x - p,) ' I - 1 (x - p,) = c 2 of the underly ing normal density. The approximate contours can be drawn on the scatter plot to indicate the normal distribution that generated the data. The normality assumption is useful for the inference procedures discussed in Section 8. 5 , but it is not required for the development of the properties of the sample principal components sum marized in (8-20). Even when the normal assumption is suspect and the scatter plot may depart somewhat from an elliptical pattern, we can still extract the eigenvalues from S and obtain the sample principal components. Geometrically, the data may be plotted as points in p-space. The data can then be expressed in the new coordinates, which coincide with the axes of the contour of (8-24). Now, (8-24) defines a hyperellip soid that is centered at x and whose axes are given by the eigenvectors of or, equivalently, of S. (See Section 2.3 and Result 4.1, with S in place of I.) The e
. . •
e;
X
n
s-1
Sec.
Summarizing Sample Variation by Principal Components
8. 3
479
lengths axes are proportional to �, i = 1, 2, . . . , p, where ,.. ,.. of these hyperellipsoid ,.. A 1 ;;;;: A 2 ;;;;: ;;;;: 1 ,\ 0 are the eigenvalues of S. Because e; has length 1, the absolute value of the ith principal component, I y; I = I e; ( - i ) I , gives the length of the projection of the vector ( - i ) on the unit vector e; . [See (2-8) and (2-9).] Thus, the sample principal components Y; = e; ( - i ) , i = 1, 2, . . . , p, lie along the axes of the hyperellipsoid, and their absolute values are the lengths of the projections of - i in the directions of the axes e; . Consequently, the sample principal components can be viewed as the result of translating the origin of the original coordinate system to i and then rotating the coordinate axes until they pass through the scatter in the directions of maxi mum variance. The geometrical interpretation of the sample principal components is illus trated in Figure 8.4 f£>r p 2. Figure 8.4(a) shows an ellipse of constant distance, centered at i, with ,\ 1 > A 2 • The sample principal components are well deter mined. They lie along the axes of the ellipse in the perpendicular directions of max imum s�mple yariance. Figure 8.4(b) shows a constant distance ellipse, centered at i, with ,\ 1 ,\ 2 • In this case, the axes of the ellipse (circle) of constant distance are not uniquely determined and can lie in any two perpendicular directions, including the directions of the original coordinate axes. Similarly, the sample principal com ponents can lie in any two perpendicular directions, including those of the original coordinate axes. When the contours of constant distance are nearly circular or, equivalently, when the eigenvalues of S are nearly equal, the sample variation is homogeneous in all directions. It is then not possible to represent the data well in fewer than p dimensions. If the last few eigenvalues A ; are sufficiently small such that the variation in the corresponding e; directions is negligible, the last few sample principal compo nents can often be ignored, and the data can be adequately approximated by their representations in the space of the retained components. (See Section 8.4). P
..·
x
x
;;;;:
x
x
=:
==
(X - x ) ' s- • ( X -
X ) = c2
( x - x ) ' s- • ( x - x )
'------=---------l� xl :x.
(a) £ . > � 2
Figure 8.4
A
(b ) A. l
•
=
A. 2
A
Sample principal components and ellipses of constant distance.
= c2
480
Chap.
8
Principal Components
Finally, Supplement 8A gives a further result concerning the role of the sample principal components when directly approximating the mean-centered data xi - x. Standardizing the Sample Principal Components
Sample principal components are, in general, not invariant with respect to changes in scale. (See Exercise 8.2.) As we mentioned in the treatment of population com ponents, variables measured on different scales or on a common scale with widely differing ranges are often standardized. For the sample, standardization is accom plished by constructing xi l - x 1
�
xi 2 - Xz
j = 1, 2, . . . , n
vs;
(8-25)
Xjp - Xp
The n
X
p
�
data matrix of standardized observations
[}:' l � [ ::.: z1
Z=
z1 1 z1 2 z 22
z,, Zzp
:
l
Znz XJ J - X I x12 - x1
z ,P X 1p - X 1
Xz1 - Xz Xzz - Xz
Xzp - Xz Vs22
� vs;
�
vs;
-
vs;;,
(8-26)
X11p - Xp
Xn 1 - Xp Xn z - Xp
vs;;,
�-
�
yields the sample mean vector [see (3-24)]
± xi 2 - Xz
z = -1 (l ' Z ) ' = -1 Z' l = -1 j = l n n n
-vs;
=
0
(8-27)
s
Sec.
803
481
Summarizing Sample Variation by Principal Components
and sample covariance matrix [see (3-27)] 1 ( - -1 11' ) ' ( - -1 11' ) = -n n n-1 1 ( Z - lz ) ( Z - lz ) = -n-1 z
z
z
'
z
'
z
'
Z Z = _l_ n-1 '
(n - 1)s1 1 s" (n - 1)s1 2 1 n - 1 \ls1 1 Vs;2 (n - 1)s1" - · · �- �
-
(n - 1)s1 2
-·· - -
(n - 1)s 1 P
(n - 1)s22
(n - 1)s2P
� vs;
��
�-·--
vs; �
Szz
=R
(8-28)
(n - 1)s2"- (n - 1 ) spp ---sPP � � vs; � The sample principal components of the standardized observations are given by (8-20), with the matrix R in place of So Since the observations are already "centered" by construction, there is no need to write the components in the form of (8-21)0 If
,
z1 , z2 , z, are standardized observations with covariance matrix R, the ith sample principal component i 0 0 0
Y; = e; z = eil zl
s
+ e; z Z2 +
ooo
+
i = l, 2, o o o , p
e;p Zp ,
)Yhere.... (A i• e; ) � the ith eigenvalue-eigenvector pair of R with ;;;o A P ;;;o 0 Al o , A 1 ;;;o A 2 ;;;o 0 0 0
0
s
Sample variance (y;) In
Sample covariance (y;, Yk )
=
=
....
0
and
i, k
=
(8-29)
i =I= k
addition,
Total (standardized) sample variance = tr (R)
1 , 2, o o o , p
i =
A;
=
p
1 , 2,
.... A....
= A1 +
0 0 0
2
+
ooo
....
+ A"
,p
Using (8-29), we see that the proportion of the total sample variance explained by the ith sample principal component is
482
Chap.
8
( Proportion of (standardized) ) sample variance due to ith
Principal Components
i = 1, 2, . . . , p
(8-30)
sample principal component A rule of thumb suggests retaining only those components whose variances A; are greater than unity or, equivalently, only those components which, individually, explain at least a proportion 1/p of the total variance. This rule does not have a great deal of theoretical support, however, and it should not be applied blindly. As we have mentioned, a scree plot is also useful for selecting the appropriate num ber of components. "
Example 8.5
(Sample principal components from standardized data)
The weekly rates of return for five stocks (Allied Chemical, du Pont, Union Carbide, Exxon, and Texaco) listed on the New York Stock Exchange were determined for the period January 1975 through December 1976. The weekly rates of return are defined as (current Friday closing price - previous Friday closing price)/(previous Friday closing price), adjusted for stock splits and dividends. The data are listed in Table 8.4 in the exercises. The observations in 100 successive weeks appear to be independently distributed, but the rates of return across stocks are correlated, since, as one expects, stocks tend to move together in response to general economic conditions. Let x , x2 , , x5 denote observed weekly rates of return for Allied Chemical, du1 Pont, Union Carbide, Exxon, and Texaco, respectively. Then • . •
i'
and
=
[.0054, .0048, .0057, .0063, .0037]
1.000 .577 .509 .387 .462 .577 1.000 .599 .389 .322 R= .509 .599 1.000 .436 .426 .387 .389 .436 1.000 .523 .462 .322 .426 .523 1.000
=
-
-
We note that R is the covariance matrix of the standardized observations - Xz , . . . , Z5 = Xs� - Xs - X I , z = Xzvs; z 1 X I� 2 The eigenvalues and corresponding normalized eigenvectors of R, deter mined by a computer, are A I = 2.857, "
A 2 = .809, "
.464, .457, .470, .421, .421] .240, .509, .260, - .526, - .582]
Sec.
8. 3
Summarizing Sample Variation by Principal Components AI
483
,\3,.. = .540, e3 [.- 612, .178, .335, .541, - .435] ,\,.. 4 = .452, e4 [ .387, .206, - .662, .472, - .382] A,.. 5 = .343, es [ - .451, .676, - .400, - .176, .385] Using the standardized variables, we obtain the first two sample princi pal components .Y, = e; z = .464z , + .457zz + .470z3 + .421z4 + .421zs AI
AI
y2 = e� z = .240z1 + .509z 2 + .260z 3 - .526z4 - .582zs
These components, which account for
of the total (standardized) sample variance, have interesting interpretations. The first component is a roughly equally weighted sum, or "index," of the five stocks. This component might be called a general stock-market component, or simply a market component. The second component represents a contrast between the chemical stocks (Allied Chemical, du Pont, and Union Carbide) and the oil stocks (Exxon and Texaco). It might be called an industry component. Thus, we see that most of the variation in these stock returns is due to market activity and uncorrelated industry activity. This interpretation of stock price behavior has also been suggested by King [12]. The remaining components are not easy to interpret and, collectively, represent variation that is probably specific to each stock. In any event, they do not explain much of the total sample variance. This example provides a case where it seems sensible to retain a component (y2 ) associated with an eigenvalue less than unity. •
Example 8.6 (Components from a correlation matrix with a special structure)
Geneticists are often concerned with the inheritance of characteristics that can be measured several times during an animal' s lifetime. Body weight (in grams) for n = 150 female mice were obtained immediately after the birth of their first 4 litters.4 The sample mean vector and sample correlation matrix were, respectively, XI
[39.88,
4 Data courtesy of J. J. Rutledge.
45.08,
48.11,
49.95]
484
Chap.
8
Principal Components
and
r
1.000 .7501 .6329 �63 .7501 1.000 .6925 .7386 R= .6329 .6925 1.000 .6625 .6363 .7386 .6625 1.000
The eigenvalues of this matrix are
]
and A 4 = .217 We note that the first eigenvalue is nearly equal to 1 + (p - 1 )r 1 + (4 - 1) (.6854) = 3.056, where r is the arithmetic average of the off diagonal �lements of R. The remaining_ eigenv11lues are small and about equal, although A 4 is somewhat smaller than A 2 and A 3 . Thus, there is some evidence population correlation matrix p may be of the "equal that the corresponding correlation" form of (8-15). This notion is explored further in Example 8.9. The first principal component A I = 3.085, A 2 = .382, A
A
A3 = .342, A
A
.9 1 = e{ z = .49z 1 + .52z2 + .49z3 + .50z4 100( A J fp)% = 100(3.058/4)% = 76%
of the total variance. accounts for Although the average postbirth weights increase over time, the variation in weights is fairly well explained by the first principal component with (nearly) equal coefficients. Comment. An unusually small value for the last eigenvalue from either the sample covariance or correlation matrix can indicate an unnoticed linear depen dency in the data set. If this occurs, one (or more) of the variables is redundant and should be deleted. Consider a situation where x1 , x2 , and x3 are subtest scores and the total score x4 is the sum x1 + x2 + x3 • Then, although the linear combination e' x = [1, 1, 1, - 1]x = x1 + x2 + x3 - x4 is always zero, rounding error in the computation of eigenvalues may lead to a small nonzero value. If the linear expression relating x4 to (x1 , x2 , x3 ) was initially overlooked, the smallest eigen value-eigenvector pair should provide a clue to its existence. Thus, although "large" eigenvalues and the corresponding eigenvectors are important in a principal component analysis, eigenvalues very close to zero should not be routinely ignored. The eigenvectors associated with these latter eigenvalues may point out linear dependencies in the data set that can cause interpretive and computational problems in a subsequent analysis. •
8.4 GRAPHING THE PRINCIPAL COMPONENTS
Plots of the principal components can reveal suspect observations, as well as pro vide checks on the assumption of normality. Since the principal components are linear combinations of the original variables, it is not unreasonable to expect them
Sec.
Graphing the Principal Components
8.4
485
to be nearly normal. It is often necessary to verify that the first few principal com ponents are approximately normally distributed when they are to be used as the input data for additional analyses. The last principal components can help pinpoint suspect observations. Each observation can be expressed as a linear combination xj = (xj e1 ) e1 + (xj C-z ) ez + . . . + (xj ep ) ep = .Yj 1 e 1 + Yi z ez + . · + Yip ep e1 , e2 , . . . , eP S. .
of the complete set of eigenvectors of Thus, the magnitudes of the last principal components. . determine how well the first few fit the observations. That is, yj l e1 + Yjz Cz + . + Yj, q - 1 eq - 1 differs from xj by Yjq eq + . . . + Yjp ep , the square of whose length is Yfq + + YJP . Suspect observations will often be such that at least one of the coordinates Y • . . . , Yip contributing to this squared length will be large. (See Supplement 8A foriqmore general approximation results. ) The following statements summarize these ideas. 1. To help check the normal assumption, construct scatter diagrams for pairs of the first few principal components. Also, make Q-Q plots from the sample values generated by each principal component. 2. Construct scatter diagrams and Q-Q plots for the last few principal compo nents. These help identify suspect observations. Example 8. 7 (Plotting the principal components for the turtle data) We illustrate the plotting of principal components for the data on male tur tles discussed in Example 8.4. The three sample principal components are ···
y 1 = .683(x 1 j/2 = - .159(x1 j/3 = - .713(x 1 x1 =
- 4.725) + .510(x2 - 4.478) + .523(x3 - 3.703) - 4.725) - .594(x2 - 4.478) + . 788(x3 - 3.703) - 4.725) + .622(x2 - 4.478) + .324(x3 - 3.703) where ln(length), x2 = ln(width), and x3 = ln(height), respectively. Figure 8.5 shows the Q-Q plot for j/2 and Figure 8.6 shows the scatter
plot of (j/ 1 , j/2 ). The observation for the first turtle is circled and lies in the @I
••
0. •
• ••
.,.
,,, ,
•
�
Figure 8.5 A Q-Q plot for the second principal component y2 from the data on male turtles .
486
Chap.
8
Principal Components •
.3
.I A
YJ
•
3
•
• • • • •
-.1
-.
•
••
•
•
•
•
•
• •
•
=· Figure 8.6
Scatter plot of the principal components y1 and y2 of the data on male turtles.
A
y2
lower right corner of the scatter plot and in the upper right corner of the Q-Q plot; it may be suspect. This point should have been checked for recording errors, or the turtle should have been examined for structural anomalies. Apart from the first turtle, the scatter plot appears to be reasonably ellipti cal. The plots for the other sets of principal components do not indicate any substantial departures from normality. The diagnostics involving principal components apply equally well to the checking of assumptions for a multivariate multiple regression model. In fact, hav ing fit any model by any method of estimation, it is prudent to consider the . vector = (observation . vector) - ( vector. of predicted ) Restdual ( estimate d) va ues or (8-31) ii Yi fJ ,zi j = 1, 2, ... , n for the multivariate linear model. Principal components, derived from the covari ance matrix of the residuals, 1 � ( )( ) (8-32) n-p. can be scrutinized in the same manner as those determined from a random sam ple. You should be aware that there are linear dependencies among the residuals from a linear regression analysis, so the last eigenvalues will be zero, within rounding error. •
1
"
(p X l)
(p X I )
- £.i
-
j=l
(p X I)
� E· J
-
" E· J
" � £. - E· J
J
'
Sec.
8.5
Large Sample Inferences
487
8.5 LARGE SAMPLE INFERENCES
We have seen that the eigenvalues and eigenvectors of the covariance (correlation) matrix are the essence of a principal component analysis. The eigenvectors deter mine the directions of maximum variability, and the eigenvalues specify the vari ances. When the first few eigenvalues are much larger than the rest, most of the total variance can be "explained" in fewer than p dimensions. In practice, decisions regarding the quality of the principal component approximation must be made on the basis of the eigenvalue-eigenvector pairs ( A ;, e; ) extracted from s or R. Because of sampling variation, these eigenvalues and eigenvectors will diffp from their underlying population counterparts. The sampling distributions of A ; and e; are difficult to derive and beyond the scope of this book. If you are interested, you can find some of these derivations for multi variate normal populations in [1 ], [2], and [5]. We shall simply summarize the per tinent large sample results. Large Sample Properties of
AI and e/
Currently available results concerning large sample confidence intervals for A ; and e; assume that the observations X 1 , X 2 , ... , X are a random sample from a normal population. It must also be assumed that the "(unknown) eigenvalues of :I are dis tinct and positive, so that A 1 > A2 > > AP > 0. The one exception is the case where the number of equal eigenvalues is known. Usually the conclusions for dis tinct eigenvalues are applied, unless there is a strong reason to believe that :I has a special structure that yields equal eigenvalues. Even when the normal assumption is violated, the confidence intervals obtained in this manner still provide some indi cation of the uncertainty in A ; and e; . Anderson [2] and Girshick [5] have established the following large sample distribution theory for the eigenvalues .A = [ A 1 , . . . , A p ] and eigenvectors el , . . . , ep of s : 1. Let A be the diagonal matrix of eigenvalues A 1 , ... , AP of :I, then Vn (A - A) is approximately NP (0, 2A 2 ) . 2. Let A
···
I
then 'fi1 ( e; - e ; ) is approximately N, (0, E; ). Each A ; is distributed independently of the elements of the associated e; . Result 1 implies that, for n large, the A; are independently distributed. Moreover, A ; has an approximate N(A;, 2Al/n) distribution. Using this normal distribu tion, we obtain P [ I A ; - A; l z(a/2)A; Y2/n] = 1 - a. A large sample 100(1 - a)% confidence interval for A; is thus provided by 3.
A
�
488
Chap.
8
Principal Components
----
��==�
A1 A
----
A
-
� ===-
A1
-------- -(1 - z(a/2) V2f;t ) A
(1 + z(a/2) V2f;t ) where z ( a/2) is the upper 100( a/2)th percentile of a standard normal distribution. Bonferroni-type simultaneous 100(1 - a)% intervals for m A,' s are obtained by replacing z(a/2) with z(a/2m). '(See Section 5.4.) Result 2 implies that the e, s are normally distributed about the correspond �
'
�
(8-33)
ing e,'s for large samples. The elements of each 1 are correlated, and the correla tion depends to a large extent on the separatione of the eigenvalues A 1 , A 2 , , AP (which is unknown) and the sample size n. Approximate standard errors for tl}_e coeffici�nts e1k are given by the square root� of' the diagonal elements of (1/n)E1 where E 1 is derived from E 1 by substituting A, s for the A,' s and e,' s for the e,' s. Exam ple 8.8 (Constructing a confidence interval for A1) We shall obtain a 95% confidence interval for A1 , the variance of the first pop ulation principal component, using the stock price data listed in Table 8.4. Assume that the stock rates of return represent independent drawings from an N5 (JL, I) population, where I is positive definite with distinct eigen values A 1 > A2 > > A 5 > 0. Since n = 100 is large, we can use (8-33) with � = 1 to construct a 95% confidence interval for A 1 . From Exercise 8. 1 0, A 1 = .0036 and in addition, z(.025) = 1.96. Therefore, with 95% confidence, 0036 .0036 A or .0028 A 1 .0050 1 ( 1 - 1.96�60 ) (1 + 1. 96�) Whenever an eigenvalue is large, such as 100 or even 1000, the intervals gen erated by (8-33) can be quite wide, for reasonable confidence levels, even though is fairly large. In general, the confidence interval gets wider at the same rate that A1 gets larger. Consequently, some care must be exercised)n' dropping or retain ing principal components based on an examination of the A , s. • • •
···
.
-------===---
�
�
�
�
•
fJ:.
Testing for the Equal Correlation Structure
The special correlation structure Cov(X1, Xk) = V(J"1 1(J"k k or Corr(X1, Xk) = all i 1= k, is one important structure in which the eigenvalues of I are not distinct and the previous results do not apply. To test for this structure, let p,
Ho : P = Po = ( p X p)
fi : Il
p,
Sec.
Large Sample Inferences
8.5
489
and H, : p * Po
test of H0 versus H may be based on a likelihood ratio statistic, but Lawley [14] has demonstrated that1 an equivalent test procedure can be constructed from the off-diagonal elements of R. Lawley 's procedure requires the quantities A
-r
-- p(p
2 -
r 1) 2:i <2: k ik
(p - 1) 2 [1 - (1 - r?]
(8-34)
y = p - (p - 2) (1 - r ) 2
It is evident that rk is the average of the off-diagonal elements in the kth column (or row) of R and r is the overall average of the off-diagonal elements. The large sample approximate a-level test is to reject H0 in favor of H1 if � - - 2] 2 ( a ) (8-35) T - (1( n -1)) 2 ["" ( ri k - ) 2 k = i ( k ) > X(p + l )(p - 2) /2 1< k where x [P + l )(p 2)12 ( a ) is the upper (lOOa)th percentile of a chi-square distribution with (p + 1) (p - 2)/2 d.f. _
-
_
- r
LJ LJ
r
A
- 1' LJ
r
- r
_
Example 8.9
[
{Testing for equ icorrelation structure)
]
From Example 8.6, the sample correlation matrix constructed from the post birth weights of female mice is R
1.0 .7501 .6329 .6363 .7501 1.0 .6925 .7386 = .6329 .6925 1.0 .6625 .6363 .7386 .6625 1.0
We shall use this correlation matrix to illustrate the large sample test in (8-35). Here p = 4, and we set H, P
� Po �
H, : p * Po
Using (8-34) and (8-35), we obtain
[� � � �]
490
Chap . 8
Principal Components
-rl = 31 (.7501 + .6329 + .6363) = .6731, ,.2 = .7271,
,.3 = .6626, ,.4 = .6791 -r = 2 (.7501 + .6329 + .6363 + .6925 + .7386 + .6625) 4(3)
.6855
2:2: (r;k - r) 2 = (.7501 - .6855) 2
i
(.6329 - .6855) 2 + . . . + (.6625 - .6855) 2 = .01277 +
2: c rk - r) 2 = (.6731 - .6855) 2 + . . . + (.6791 - .6855) 2 = .00245 4
k=l
and
1'
=
(4 - 1) 2 [1 - (1 - .6855) 2] = 2 1329 • 4 - (4 - 2) (1 - .6855) 2
__
- 1) [.01277 - (2.1329) (.00245)] = 11.4 T = (1(150 6855) 2 Since (p + 1) (p - 2)/2 = 5(2)/2 = 5, the 5% critical value for the test in (8-35) is xh.05) = 11.07. The value of our test statistic is approximately equal to the large sample 5% critical point, so the evidence against H0 (equal
correlations) is strong, but not overwhelming. As we saw in ExaJilple 8.6, the smallest eigenvalues A , A 3 , and A 4 are slightly different, with A 4 being somewhat smaller than the 2other two. Con sequently, with the large sample size in this problem, small differences from • the equal correlation structure show up as statistically significant. 8.6 MONITORING QUALITY WITH PRINCIPAL COMPONENTS
In Section 5.6, we2 introduced multivariate control charts, including the quality ellipse and the T chart. Today, with electronic and other automated methods of data collection, it is not uncommon for data to be collected on 10 or 20 process variables. Major chemical and drug companies report measuring over 100 process variables, along including temperature,process. pressure,Evenconcentration, and weight, at various positions the production with 10 variables to monitor, there are 45 pairs for which to create quality ellipses. Clearly, another approach is required to both visually display important quantities and still have the sensitivity to detect special causes of variation.
Sec.
8.6
Monitoring Quality with Principal Components
491
Checking a Given Set of Measurements for Stability
Let 1 , X 2 , , X11 be a random sample from a multivariate normal distribution withXmean and covariance matrix I. We consider the first two sample principal components, Yj l = e{ (xj - i ) and Yj 2 = e� (xj - i ) . Additional principal com ponents could be considered, but two are easier to inspect visually, and, of any two components the first two explain the largest cumulative proportion of the total sample variance. If a process is stable over time, so that the measured characteristics are influ enced only by variations in common causes, then the values of the first two princi pal components should be stable. Conversely, if the principal components remain stable over time, the common effects that influence the process are likely to remain constant. To monitor quality using principal components, we consider a two-part procedure. The first part of the procedure is to construct an ellipse format chart for the pairs of values C Yn , yi 2 ) for j = 1, 2, . . . , n. By (8-20) , the sample variance of the first principal component y 1 is given by the largest eigenvalue A and the sample variance of the second principal compo nent y2 is the second-largest eigenvalue A 2 • The two sample components are uncor related, so the quality ellipse for n large (see Section 5.6) reduces to the collection of pairs of possible values ( y1 , y2 ) such that A2 A 2 Yz .,:;: xz ( a ) ( 8-36 ) 2 AI A • • •
JL
1 ,
h "
Example 8. 1 0
+
,..
2
�
(An ellipse format chart based on the first two principal components)
Refer to the police department overtime data given in Table 5.8. Table 8.1 contains the five normalized eigenvectors and eigenvalues of the sample covariance matrix S. The first two sample components explain 82% of the total variance. EIGENVECTO RS AN D EIG ENVALUES OF SAM PLE COVARIANC E MATRIX F O R POLICE DEPARTM ENT DATA
TABLE 8. 1
Variable Appearances overtime (x1 ) Extraordinary event (x2 ) Holdover hours (x3 ) COA hours (x ) Meeting hours (x45 ) "
AI
el
ez
e3
e4
es
.046 .629 - .643 .432 -.048 .985 - .077 -.151 - .007 .039 .582 .107 .250 -.392 -.658 .734 .069 .503 .397 -.213 .081 .586 .107 .784 -.155 2,770,226 1,429,206 628,129 221,138 99,824
492
Chap.
8
Principal Components TABLE 8.2 VALUES OF TH E PRI N C I PAL COM PO N ENTS FOR TH E POLICE DEPARTM ENT DATA
Period 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
yj l
2044.9 -2143.7 -177.8 -2186.2 -878.6 563.2 403.1 -1988.9 132.8 -2787.3 283.4 761. 6 -498.3 2366.2 1917.8 2187.7
Yj 2
Yj3
Yj4
Yj s
588.2 425.8 -189. 1 -209.8 -686.2 883.6 -565. 9 -441.5 -464.6 707.5 736.3 38.2 450.5 -184.0 443.7 -325.3 -545.7 115. 7 296.4 437. 5 -1045.4 281.2 620. 5 142.7 66.8 340.6 -135. 5 521.2 -801.8 -1437.3 -148.8 61.6 563.7 125.3 68.2 611.5 7. 8 169. 4 -202.3 -213. 4 3936. 9 -0. 9 276.2 -159. 6 256.0 -2153. 6 -418. 8 28.2 244.7 966.5 -1142.3 182.6 -1193.7 -165.5 270.6 -344. 9 -782.0 -82. 9 -196.8 -89. 9 -373.8 170.1 -84.1 -250.2
The sample values for all five components are displayed in Table 8.2. Let us construct a 95% ellipse format chart using the first two sample principal components and plot the 16 pairs of component values. Since n = 16, we find that x� (.05) = 5.99, and the ellipse becomes z 2 :"' 1 :.... 2 5.99 Al A2 This ellipse is shown in Figure 8.7 on page 493, along with the data. One point is out of control, because the second principal component for this point has a large value. Scanning Table 8.2, we see that this is the value 3936.9 for period 11. According to the entries of e2 in Table 8.1, the second principal component is essentially extraordin'ary event overtime hours. The principal component approach has led us to the same conclusion we came to in Example 5. 9 . • In the event that special causes are likely to produce shocks to the system, the second part of our two-part procedure-that is, a second chart-is required. This chart is created from the information in the principal components not involved in the ellipse format chart. +
�
Sec.
< or:
8 0 0
0 0
§I
• •
,.
• •
Monitoring Quality with Principal Components
8.6
•
+·
•
•
•
493
•
•
• •
Figure 8.7
The 95% control ellipse based on the first two principal components of overtime hours.
Consider the deviation vector X - p., and assume that X is distributed as Even without the normal assumption, j - can be expressed as the sum of its projections on the eigenvectors of I. X JL Np ( JL , I ) .
X - JL = (X - p.) ' e1 e1 + ( X - p.) ' e 2 e 2
or
+ ( X - p.) ' e3e3 + . . . + ( X - p.) ' eP eP
(8-37) where Yi = (X - JL ) ' e i is the population i th principal component centered to have mean 0. The approximation to X - JL by the first two principal components has the form Y1 e1 + Y1 e2 • This leaves an unexplained component of X - JL - Y1 e1 - Y2 e 2
Let E = , e , . . . , ep] be the orthogonal matrix whose columns are the eigen vectors of[e1I. The2 orthogonal transformation of the unexplained part,
494
Chap.
8
Principal Components
=
0 0 Y, 0 0 Yz y3 0 0 3 E' (X - p Y1 e 1 - Y2 e 2 ) y (2) y y 0 0 p p so the last p - 2 principal components are obtained as an orthogonal transforma tion of the approximation errors. Rather than base the T2 chart on the approxi mation errors, we can, equivalently, base it on these last principal components. Recall that Var(Y;) = A; for i = 1, 2, .. , p and Cov(Y;, Yk) 0 for i =F k. Consequently, the statistic Y(2) 'Iv<�•· y(2)Y(Z) • based on the last p 2 population principal components, becomes yz Yj y z (8-38) Y, Yz y
-
-
=
.
- + � + . . + ____!!_ ·
A3
A4
AP
This is just the sum of the squares of p 2 independent standard normal variables and so has a chi-square distribution with p - 2 degrees of freedom. In terms of the sample data, the principal components and eigenvalues must be estimated. Because the coefficients of the linear combinations e; are also estimates, the principal components do not have a normal distribution even when the popula tion is normal. However, it is customary to create a T2 chart based on the statistic -
= which involves the estimated eigenvalues and vectors. Further, it is usual to appeal to the large sample approximation described by (8-38) and set the upper control limit of the chart as UCL = = This statistic is based on high-dimensional data. For example, when = 20 variables are measured, it uses the information in the 18-dimensional space per T2 j
T2 T2
�
Yjz3 A3 A
c2
+
�
YJz4 A4 A
+ ... +
�
YizP AP A
x� _ 2 ( a ) .
p
pendicular to the first two eigenvectors e1 and e2 • Still, this T2 based on the unex plained variation in the original observations is reported as highly effective in picking up special causes of variation. Example 8.1 1
(A T2 chart for the unexplained (orthogonal) overtime hours)
Consider the quality control analysis of the police department overtime hours in Example 8. 1 0. The first part of the quality monitoring procedure, the qual ity ellipse based on the first two principal components, was shown in Figure
Sec.
8.6
Monitoring Quality with Principal Components
495
To illustrate the second step of the two-step monitoring procedure, we create the chart for the other principal components. Since p = 5, this chart is based on 5 2 = 3 dimensions, and the upper control limit is xj (.05) = 7.81. Using the eigenvalues and the values of the principal components, given in Example 8.10, we plot the time sequence of values 2 YA 23 YA ·4 YA 2·s n = +A 3 + +A 4 + +A s 2 where the first value is T = .891 and so on. The T 2 chart is shown in Figure 8.8. 8.7.
-
0
5
10
15 Period
Figure 8.8
A T 2 chart based on the last three principal components of overtime hours.
Since points 12 and 13 exceed or are near the upper control limit, some thing has happened during these periods. We note that they are just beyond the period in which the extraordinary event overtime hours peaked. From Table 8.2, hi is large in period 12, and from Table 8.1, the large coefficients in e 3 belong to legal appearances, holdover, and COA hours. Was there some adjusting of these other categories following the period extraor • dinary hours peaked?
Controlling Future Values
Previously, we considered checking whether a given series of multivariate obser vations was stable by considering separately the first two principal components and then the last p 2. Because the chi-square distribution was used to approximate -
496
Chap.
8
Principal Components
the UCL of the T2 chart and the critical distance for the ellipse format chart, no further modifications are necessary for monitoring future values. Example 8. 1 2
(Control ellipse for future principal com ponents)
In Example 8.10, we determined that case 11 was out of control. We drop this point and recalculate the eigenvalues and eigenvectors based on the covari ance of the remaining 15 observations. The results are shown in Table 8.3.
EIG ENVECTORS AN D EIG ENVALU ES FROM TH E 1 5 STABLE OBSERVATIO N S
TABLE 8.3
el
Appearances overtime (x1 ) Extraordinary event (x2 ) Holdover hours (x3 ) COA hours (x Meeting hours (x45 ))
ez
e3
e4
e5
.304 .530 .479 .049 .629 - .212 .007 - .260 .939 -.078 -.662 - .158 - .437 - .089 .582 .731 - .336 - .291 .503 - .123 .632 - .752 .081 -.159 - .058 A I 2,964,749.9 672,995.1 396,596.5 194,401.0 92,760.3 "
The principal components have changed. The component consisting pri marily of extraordinary event overtime is now the third principal component and is not included in the chart of the first two. Because our initial sample size is only 16, dropping a single case can make a substantial difference. Usu ally, at least 50 or more observations are needed, from stable operation of the process, in order to set future limits. Figure 8.9 on page 497 gives the 99% prediction (8-36) ellipse for future pairs of values for the new first two principal components of overtime. The • 15 stable pairs of principal components are also shown. In some applications of multivariate control in the chemical and pharmaceu tical industries, more than 100 variables are monitored simultaneously. These include numerous process variables as well as quality variables. Typically, the space orthogonal to the first few principal components has a dimension greater than 100 and some of the eigenvalues are very small. An alternative approach (see [13]) to constructing a control chart, that avoids the difficulty caused by dividing a small squared principal component by a very small eigenvalue, has been successfully applied. To implement this approach, we proceed as follows. For each stable observation, take the sum of squares of its unexplained component d�i = (xi
-
x
-
yj l e 1 - Yjz e2 ) ' (xi
Note that, by inserting EE ' = I, we also have
-
x
-
yj l e 1 - Yjz ez )
Sec.
8.6
Monitoring Quality with Principal Components
497
0 0 < ;>-.
N
s
•
0 0 0
•
•
s I
•
+.
•
•
• •
•
•
•
•
Figure 8.9 A 99% ellipse format chart for the first two principal components of future overtime.
which is just the sum of squares of the neglected principal components. Using either form, the d�i are plotted versus j to create a control chart. The lower limit of the chart is 0 and the upper limit is set by approximating the distri bution of d�i as the distribution of a constant c times a chi-square random variable with degrees of freedom. For the chi-square approximation, the constant c and degrees of freedom are chosen to match the sample mean and variance of the d�i ' j = 1, 2, .. , n. In particular, we set v
.
and determine
c = S�z
and The upper control limit is then ex� ( a ) , where a = .05 or .01. 2d�
---=
v
SUPPLEMEN T BA
The G eometry of the Sample Principal Componen t Approxima tion
In this supplement, we shall present interpretations for approximations to the data based on the first r sample principal components. The interpretations of both the p-dimensional scatter plot and the n-dimensional representation rely on the algebraic result that follows. We consider approximations of the form A = [a1 , a 2 , . . . , a11] ' to the mean corrected data matrix (n X p) [x 1 - X, X z - X, . . . , X 11 - x ] '
The error of approximation is quantified as the sum of the np squared errors n n P (SA-l) 2: (xj - i - a) ' (xj - i - aj ) = 2: 2: (xj i - X; - ajY j= l
j= l i= l
Let (nAxp) be any matrix with rank(A) r < min(p, n). Let E, = [e , ez , . . . ' e,] , where e; is the ith eigenvector of s. The error of approxima tion suml of squares in (SA-l) is minimized by the choice Result 8A. 1
A
A =
l i)' J (x 1 ( X z - -x ) I
: _ ,
�
A
( x11 - x )
so the jth column of its transpose A ' is A
498
A
,
�
�
�
A
,
E,E, = [y 1 , y2 , . . . , y,]E,
Supplement 8A The Geometry of the Sample Principal Component Approximation
499
where are the values of the first r sample principal components for the jth unit. Moreover, 11
� (xi - x - a) ' (xi - x - a) = ( n - l) (Ar+ l + i= l S. A r + 1 � • • • � AP "
"
···
+ ip )
where are the smallest eigenvalues of Proof. Consider first any A whose transpose A' has columns ai that are a linear combination of a fixed set of r perpendicular vectors u1 , u 2 , . . . , u , so that U = [u1 , u 2 , . . . , ur] satisfies U' U = I. For fixed U, xi - i is best approximated by its projection on the space spanned by u1 , u2 , . . . , ur (see Result 2A.3), or (xi - i ) ' u1 u1 + (xi - i ) ' u 2 u 2 +
···
= [u 1 , u 2 , . . . , ur]
[=� ��; � :� ]
+ (xi - i ) ' urur
ur (xi
x)
UU' ) (xi -
i)
= UU' (xi -
i)
(8A-2)
This follows because, for an arbitrary vector bi , xi - i - Ubi = xi - i - UU' (xi - i) + UU' (xi - i) - Ubi = (I - UU' ) (xi - i) + U ( U' (xi - i) - bi ) so the error sum of squares is (xi - i - Ubi ) ' (xi - i - Ubi) = (xi - i ) ' (I - UU' ) (xi - i) + 0 + (U' (xi - i) - bi ) ' (U' (xi - i) - bi ) where the cross product vanishes because ( I - UU' ) U = U - UU' U = U U = 0. The last term is positive unless bi is chosen so that bi = U' (xi - i) and Ubi = UU' (xi - i ) is the projection of xi - i on the plane. Further, with the choice ai = Ubi = UU' (xi - i), (8A-1) becomes � (xi - i - UU' (xi - i ) ) ' (xi - i - UU' (xi - i)) i= 1 11
"
= � (xi i= 1 "
= � (xi i= 1
i ) ' (I -
"
i)' (xi - i) - i�= (xi - i ) ' UU' (xi - i) 1
(8A-3)
500
Chap.
Principal Components
8
We are now in a position to minimize the error over choices of U by maximizing the last term in (8A-3). By the properties of trace (see Result 2A.12), i2:= l (xi - x )'UU' (xi - x ) = i2:= l tr[(xi - x )'UU' (xi - x )] n
n
n
- x ) (xi - x)' ] (8A-4) = (n - 1) tr[UU'S] = (n - 1 ) tr[U'SU] That is, the best choice for U maximizes the sum of the diagonal elements of U' SU. From (8-19), selecting u 1 to maximize u{ Su1 , the first diagonal element of U' SU, gives u l = el . For Uz perpeng_icular to el ' u� Suz is lllaximized,..by ez J�ee (2-52).]X Continuing, we find that = [e l ' ez , . . . ' e,.] = E,. and A = E ,.E; [x l - , Xz - X, . . . ' xn - x ] , as asserted. With this choice the ith diagonal element of U'SU is e; sei = e; ( >.. iei ) A i so tr[U'SU] = A I + A z + . . . + A ,.. Also, 2: ( xi - x )' (xi - x ) =l tr [ j±= l (xi - x )(xi - x)' ] = (n - 1) tr(S) = (n - l ) (A I + A z + . . . + A p ) . Let • U = U in (8A-3), and the error bound follows. =
2: tr[UU' (xi
i=
l
I
u
A
A
A
A
A
A
A
A
n
A
j
The p-Dimensional Geometrical Interpretation
The geometrical interpretations involve the determination of best approximating planes to the p-dimensional scatter plot. The plane through the origin, determined by u 1 , u2 , ... , u, consists of all points x with x = b1 u1 + b2 u2 + .. + b,. u = Ub, for some b This plane, translated to pass through a, becomes a + Ub for some b. We want to select the r-dimensional plane a + Ub that minimizes the sum of n squared distances 2: dJ between the observations xi and the plane. If xi is approxi i= l mated by a + Ubi with i2:= l bi = 0,5 then .
,.
n
5 If ± bi = nb j�l
*
O, use a
+ Ubi = (a + Ub) + U(bi - b) = a* + Ubt.
Supplement 8A The Geometry of the Sample Principal Component Approximation
501
ll
2: (xj - a - Ubj ) ' (xj - a - Ubj ) j= ! ll
= j2: (xj - i - Ubj + i - a) ' (xj - i - Ubj + i - a) =l ll
= j2: (xj - i - Ub) ' (xj - i - Ubj ) + n ( i - a) ' ( i - a) =! ll
2: (xj - i - E , E ; (xj - i ) ) ' (xj - i - :E,:E; (xj - i ) ) j= l [Ub 1 , . . . , Ub,] = � r. a = i, el ' e2 , . . . ' e,. ek e;(xj - i ) = Yjk• �
by Result 8A.l, since A' has rank(A) The lower bound is reached by taking so the plane passes through the sample mean. This plane is determined by The coefficients of are the kth sample principal component evaluated at the jth observation. The approximating plane interpretation of sample principal components is illustrated in Figure 8.10. An alternative interpretation can be given. The investigator places a plane through i and moves it about to obtain the largest spread among the shadows of the observations. From (SA-2), the projection of the deviation xj - i on the plane Ub is vj = UU' (xj - i ) . Now, v = 0 and the sum of the squared lengths of the
projection deviations
n n 2: v/ vj = 2: (xj - i ) ' UU' (xj - i ) j= l j=l U = E. v = 0,
is maximized by
A
(n
- 1)
tr[U'SU]
Also, since
3
Figure 8. 1 0 The r = 2-dimensional plane that approximates the scatter plot n
by minimizing
.2; d/ .
i= l
502
Chap.
8
Principal Components
(n - 1)S.v = j�=n i (vj - v ) ( vj - v ) ' = j�=n i vjvj and this plane also maximizes the total variance
The n-Dimensional Geometrical Interpretation
Let us now consider, by columns, the approximation of the mean-centered data matrix by A. For r = 1, the ith column [xl i - X; , x2 ; - X; , . . . , xn i - x;]' is approx imated by a multiple c;b' of a fixed vector b = [ b1 , b2 , , bn ] '. The square of the length of the error of approximation is • • •
n
Lf = � (xji j= i
X; -
c; bj ) 2
Considering (nAX p) to be of rank one, we conclude from Result 8A.1 that
minimizes the sum of squared lengths i�=p l Lf. That is, the best direction is deter mined by the vector of values of the first principal component. This is illustrated in Figure 8.11(a) on page 503. Note that the longer deviation vectors (the larger s;; 's) have the most influence on the minimization of i�= l Lf . If the variables are first standardized, the resulting vector [ (xl i - X; )/Vi;; , (x2 ; - x;)/Yi;; , . . . , (xn i - x;)/Vi;; ] has length n - 1 for all variables, and each vector exerts equal influence on the choice of direction. [See Figure 8.1l(b).] In either case, the vector b isp moved around in n-space to minimize the sum of the squares of the distances i�= 1 Lf where L; is the squared distance between [xli X; , x2 ; - X; , . . . , xn i - x;]' and its projection on the line determined by b. The -second principal component minimizes the same quantity among all vectors perpendicular to the first choice. p
Chap. 8
Exercises
503
3
(a ) Principal component of S
(b ) Principal component of R
Figure 8 . 1 1 The first sample principal component y 1 , m i n i m izes the sum of the squares of the distances, Lf, from the deviation vectors, d; = [ xl i - X; , x2i - Xu . . . , Xn i - x;] ' to a line .
EXERCISES
8.1.
Determine the population principal components Y1 and Y2 for the covari ance matrix
Also, calculate the proportion of the total population variance explained by the first principal component. 8.2. Convert the covariance matrix in Exercise 8.1 to a correlation matrix p. (a) Determine the principal components Y1 and Y2 from p and compute the proportion of total population variance explained by Y1 . (b) Compare the components calculated in Part a with those obtained in Exercise 8.1. Are they the same? Should they be? (c) Compute the correlations and 8.3. Let py
I•
z , I
py
I•
z , 2
py
2•
z . I
Determine the principal components Y1 , Y2 , and Y3 • What can you say about the eigenvectors (and principal components) associated with eigenvalues that are not distinct?
504
Chap.
8
Principal Components 8.4.
Find the principal components and the proportion of the total population variance explained by each when the covariance matrix is 1 1 -- < p < V2
V2
8.5. (a)
Find the eigenvalues of the correlation matrix
Are your results consistent with (8-16) and (8-17)? Verify the eigenvalue-eigenvector pairs for the p p matrix p given (8-15). 8.6. Data on x 1 = sales and x2 = profits for the 10 largest U. S . industrial corpo rations were listed in Exercise 1.4 of Chapter 1. From Example 4.12 62,309 ] S = [ 10,005.20 255.76 ] 1 05 X = [ 255.76 14.30 2,927 ' (a) Determine the sample principal components and their variances for these data. (You may need the quadratic formula to solve for the eigen values of S.) (b) Find the proportion of the total sample variance explained by 5\ . (c) Sketch the constant density ellipse ( x - i)'S - 1 ( x - i) = 1. 4 , and indi cate the principal components 5\ and y2 on your graph. (d) Compute the correlation coefficients , k = 1, 2. What interpretation, if any, can you give to the first principal component? 8.7. Convert the covariance matrix S in Exercise 8. 6 to a sample correlation matrix R. ( a) Find the sample principal components y 1 , y2 and their variances. (b) Compute the proportion of the total sample variance explained by y1 • (c) Compute the correlation coefficients k = 1, 2. Interpret y 1 • (d) Compare the cornpon�nts obtained in Part a with those obtained in Exer cise 8.6(a). Given the original data displayed in Exercise 1.4, do you feel that it is better tq determine prinCipal components from the sample covariance matrix or sample correlation matrix? Explain. 8.8. Use the results in Example 8. 5 . a) Compute the correlations for i = 1, 2 and k = 1, 2, ... , 5. Do these correlations reinforce the interpretations given to the first two compo nents? Explain. (b)
_
X
ry- I• xk
ry , , z k '
(
in
X
rY;. zk
Chap. 8
(b)
Test the hypothesis
Exercises
505
1 p p p p 1 p p p
p p p p
Ho : P = Po =
versus
p 1 p p p p 1 p p p p 1
H1 : P * Po
at the 5% level of significance. List any assumptions required in carrying out this test. 8.9. (A test that all variables are independent.) (a) Consider the normal theory likelihood ratio test of H0 : l: is the diag onal matrix
� ��' :, l0 0
]
� , cr;; > 0
,
crPP
Show that the test is: Reject H0 if
S I " 12 A = -pI -= I R I " /2 < n II s /2 i=1 II
A
c
For a large sample size, -2 ln is approximately x;(p _ 1 )12 . Bartlett [3] suggests that the test statistic - 2[1 - (2p + 11)/6n] ln be used in place of -2 ln This results in an improved chi-square approximation. The large sample a critical point is x;(p - 1 )12 (a). Note that testing l: = l:0 is the same as testing p = I. (b) Show that the likelihood ratio test of H0 : l: = a- 2 1 rejects H0 if A.
A
-l l
I S I " 12 (tr( - S )/p)"P12 - ( �D� A ; r _
A
A;
" 12
=
[ geometric mean A . "P/2 < c arithmetic mean A ; J I A
For a large sample size, Bartlett [3] suggests that - [ 1 - ( 2p 2 + p + 2)/6pn] ln approximate p + 2) ( p - 1)/2 " Thus, the large sample a critical pointis is x{p + 2) ( p - l )/2ly(a).xrThis test is 2
A
506
Chap.
8
Principal Components
called a sphericity test, because the constant density contours are spheres when I a-2 1. Hint: (a) max L( p,, I) is given by (5-10), and max L( p, , I0) is the product of the univariate likelihoods, �!; (2 1T) - n1Z a-i t12 exp [ - j� (xj i - .uY/2 a-;; J . Hence, [L; (1/n) j2:=n I xji and (1 /n) j2:=n I (xji - xY. The divisor n cancels in so S may be used. (b) Verify 6- 2 [:i (xj 1 - x 1 ) 2 + + :i (xjp - xp ) 2 ]/ np under H0 • j= l j= l Again, the divisors n cancel in the statistic, so S may be used. Use Result 5.2 to calculate the chi-square degrees of freedom. =
p. I
U;;
=
A,
=
···
=
The following exercises require the use of a computer.
The weekly rates of return for five stocks listed on the New York Stock Exchange are given in Table 8.4. (See the stock-price data on the disk. ) (a) Construct the sample covariance matrix S, and find the sample principal components in (8-20). (Note that the sample mean vector x is displayed in Example 8.5.) (b) Determine the proportion of the total sample variance explained by the first three principal components. Interpret these components. (c) Construct Bonferroni simultaneous 90% confidence intervals for the variances A 1 , A2 , and A3 of the first three population components Y1, Y2 , and Y3 • (d) Given the results in Parts a-c, do you feel that the stock rates-of-return data can be summarized in fewer than five dimensions? Explain. 8.11. Consider the census-tract data listed in Table 8.5 on page 508. Suppose the observations on X median value home were recorded in thousands, rather than ten thousands,5 of dollars; that is, multiply all the numbers listed in the sixth column of the table by 10. (a) Construct the sample covariance matrix S for the census-tract data when X5 median value home is recorded in thousands of dollars. (Note that this covariance matrix can be obtained from the covariance matrix given in Example 8.3 by multiplying the off-diagonal elements in the fifth col umn and row by 10 and the diagonal element s55 by 100. Why?) (b) Obtain the eigenvalue-eigenvector pairs and the first two sample princi pal components for the covariance matrix in Part a. (c) Compute the proportion of total variance explained by the first two prin cipal components obtained in Part b. Calculate the correlation coeffi cients, ry, , x, • and interpret these components if possible. Compare your 8.10.
=
=
Chap. 8
Exercises
507
TABLE 8.4 STOC K-PRICE DATA (WEEKLY RATE OF RETU RN)
Union Allied Week Chemical Du Pont Carbide
Exxon
Texaco
1 2 3 4 5 6 7 8 9 10
.000000 .027027 .122807 .057031 .063670 .003521 -.045614 .058823 .000000 .006944
.000000 -.044855 .060773 .029948 -.003793 .050761 -.033007 .041719 -.019417 - .025990
.000000 -.003030 .088146 .066808 -.039788 .082873 .002551 .081425 .002353 .007042
.039473 - .014466 .086238 .013513 - .018644 .074265 - .009646 - .014610 .001647 - .041118
- .000000 .043478 .078124 .019512 - .024154 .049504 - .028301 .014563 -.028708 - .024630
91 92 93 94 95 96 97 98 99 100
-.044068 .039007 -.039457 .039568 -.031142 .000000 .021429 .045454 .050167 .019108
.020704 .038540 -.029297 .024145 -.007941 -.020080 .049180 .046375 .036380 -.033303
-.006224 .024988 -.065844 -.006608 .011080 -.006579 .006622 .074561 .004082 .008362
- .018518 - .028301 -.015837 .028423 .007537 .029925 - .002421 .014563 -.011961 .033898
.004694 .032710 -.045758 - .009661 .014634 -.004807 .028985 .018779 .009216 .004566
results with the results in Example 8.3. What can you say about the effects of this change in scale on the principal components? S.U. Consider the air-pollution data listed in Table 1.3. Your job is to summarize these data in fewer than p = 7 dimensions if possible. Conduct a principal component analysis of the data using both the covariance matrix S and the correlation matrix R. What have you learned? Does it make any difference which matrix is chosen for analysis? Can the data be summarized in three or fewer dimensions? Can you interpret the principal components? In the radiotherapy data listed in Table 1 .5 (see also the radiotherapy data on the disk), the n = 98 observations on p = 6 variables represent patients' reac tions to radiotherapy. (a) Obtain the covariance and correlation matrices S and R for these data. (b) Pick one of the matrices S or R (justify your choice), and determine the eigenvalues and eigenvectors. Prepare a table showing, in decreasing order of size, the percent that each eigenvalue contributes to the total sample variance. 8.13.
508
Chap.
8
TABLE 8.5
Principal Components CENSUS-TRACT DATA
Median Total Health services Median value Total population school employment employment home Tract (thousands) years (thousands) (hundreds) ($10,000s) 1 5.935 2. 9 1 14.2 2.265 2.27 2 1. 523 13.1 .597 2. 62 .75 3 2.599 1.237 1. 72 12.7 1.11 4 4.009 15.2 3.02 1.649 .81 5 4.687 2.22 14.7 2.312 2.50 6 8.044 3. 641 2.36 4.5 1 15. 6 7 2.766 1.244 13.3 1. 97 1. 03 8 6.538 17.0 1. 85 2.618 2.39 9 6.451 2. 01 3.147 12.9 5.52 10 3. 3 14 1. 82 12.2 2.18 1.606 11 3. 777 2.83 1.80 13. 0 2.119 12 4.25 1.530 .84 .798 13.8 13 2.64 2.768 1.75 13. 6 1.336 14 3.17 6.585 14.9 1. 9 1 2.763
tNotionse:mayObsenotrvatconsionstiftruotme aadjraandomcent censsampluset.racts are likely to be correlated. That is, these 14 observaGiven the results in Part b, decide on the number of important sample principal components. Is it possible to summarize the radiotherapy data with a single reaction-index component? Explain. (d) Prepare a table of the correlation coefficients between each principal component you decide to retain and the original variables. If possible, interpret the components. 8.14. Perform a principal comp0nent analysis using the sample covariance matrix of the sweat data given in E�mmple 5.2 . Construct a Q-Q plot for each of the important principal components. Are there any suspect observations? Explain. 8.15. The four sample standard deviations for the postbirth weights discussed in Example 8.6 are � 32. 9909, � 33.5918, vs;-; 36.5534, and � 37.3517 Use these and the correlations given in Example 8.6 to construct the sample covariance matrix S. Perform a principal component analysis using S. 8.16. Over a period of five years in the 1990s, yearly samples of fishermen on 28 lakes in Wisconsin were asked to report the time they spent fishing and how many of each type of game fish they caught. Their responses were then con verted to a catch rate per hour for (c)
=
=
=
=
= Black crappie = Smallmouth bass == Bluegill = Northern pike Largemouth bass = Walleye The estimated correlation matrix (courtesy of Jodi Barnet) Chap. 8
x1 x4
S=
x2
x3
x5
x6
1 .4919 1 .4919 .2635 .3127 .4653 .3506 - .2277 - .1917 .0652 .2045
.2636 .4653 .3127 .3506 1 .4108 1 .4108 .0647 - .2249 .2493 .2293
Exercises
509
- .2277 .0652 - .1917 .2045 .0647 .2493 - .2249 .2293 1 - .2144 1 - .2144
is based on a sample of about 120. (There were a few missing values.) Fish caught by the same fisherman live alongside of each other, so the data should provide some evidence on how the fish group. The first four fish belong to the centrarchids, the most plentiful family. The walleye is the most popular fish to eat. (a) Comment on the pattern of correlation within the centrarchid family x 1 through x4• Does the walleye appear to group with the other fish? (b) Perform a principal component analysis using only x 1 through x4 • Inter pret your results. (c) Perform a principal component analysis using all six variables. Interpret your results. 8.17. Using the data on bone mineral content in Table 1.6, perform a principal component analysis of 8.18. The data on national track records for women are listed in Table 1.7. (a) Obtain the sample correlation matrix R for these data, and determine its eigenvalues and eigenvectors. (b) Determine the first two principal components for the standardized vari ables. Prepare a table showing the correlations of the standardized vari ables, with the components and the cumulative percentage of the total (standardized) sample variance explained by the two components. (c) Interpret the two principal components obtained in Part b. (Note that the first component is essentially a normalized unit vector and might measure the athletic excellence of a given nation. The second component might measure the relative strength of a nation at the various running distances.) (d) Rank the nations based on their score on the first principal component. Does this ranking correspond with your inituitive notion of athletic excel lence for the various countries? 8.19. Refer to Exercise 8.18. Convert the national track records for women in Table 1.7 to speeds measured in meters per second. Notice that the records for 800 m, 1500 m, 3000 m, and the marathon are given in minutes. The
S.
510
Chap.
8
Principal Components
marathon is 26.2 miles, or 42,195 meters, long. Perform a principal components analysis using the covariance matrix S of the speed data. Compare the results with the results in Exercise 8. 1 8. Do your interpretations of the components differ? If the nations are ranked on the basis of their score on the first principal component, does the subsequent ranking differ from that in Exercise 8.18? Which analysis do you prefer? Why? 8.20. The data on national track records for men are listed in Table 8. 6 . (See also the data on national track records for men on the disk. ) Repeat the principal component analysis outlined in Exercise 8.18 for' the men. Are the results consistent with those obtained from the women s data? TABLE 8.6
NATIONAL TRACK RECORDS FOR MEN
Country Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Colombia Cook Islands Costa Rica Czechoslovakia Denmark Dominican Republic Finland France German Democratic Republic Federal Republic of Germany Great Britain and Northern Ireland Greece Guatemala Hungary
lOO m (s) 10.39 10. 3 1 10.44 10.34 10.28 10.22 10.64 10. 1 7 10.34 10. 5 1 10.43 12. 1 8 10.94 10.35 10.56 10. 1 4 10.43 10.11 10. 1 2 10. 1 6 10.11 10.22 10.98 10.26
200 m (s) 20.81 20.06 20.81 20.68 20.58 20.43 21.52 20.22 20.80 21.04 21.05 23.20 21. 90 20.65 20.52 20.65 20.69 20.38 20.33 20.37 20.21 20.71 21.82 20.62
400 m (s) 46.84 44.84 46.82 45.04 45.91 45.21 48.30 45.68 46.20 47.30 46.10 52.94 48.66 45.64 45.89 46.80 45.49 45.28 44.87 44.50 44.93 46.56 48.40 46.02
800 m (min) 1.8 1 1.74 1.79 1.73 1.80 1.73 1.80 1.76 1.79 1.81 1.82 2.02 1. 87 1.76 1.78 1. 82 1.74 1.73 1.73 1.73 1.70 1.78 1.89 1.77
1500 m (min) 3.70 3. 57 3.60 3. 60 3.75 3. 66 3. 85 3. 63 3. 7 1 3.73 3. 74 4.24 3. 84 3.58 3. 6 1 3. 82 3. 6 1 3.57 3.56 3.53 3.51 3. 64 3. 80 3. 62
5000 m (min) 14. 04 13.28 13.26 13.22 14.68 13.62 14.45 13. 5 5 13. 6 1 13. 90 13.49 16. 70 14.03 13.42 13. 5 0 14.91 13.27 13.34 13.17 13.21 13. 0 1 14.59 14.16 13.49
10,000 m (min) 29.36 27.66 27.72 27.45 30.55 28.62 30.28 28.09 29.30 29.13 27.88 35.38 28.81 28.19 28.11 31.45 27.52 27.97 27.42 27. 6 1 27. 5 1 28.45 30.11 28.44
Marathon (mins) 137.72 128.30 135.90 129.95 146.62 133. 1 3 139. 95 130.15 134.03 133.53 131.35 164.70 136.58 134.32 130.78 154. 1 2 130.87 132.30 129. 92 132.23 129. 1 3 134.60 139.33 132.58 (continued )
Chap. 8
Exercises
51 1
TABLE 8.6 (continued)
Country India Indonesia Ireland Israel Italy Japan Kenya Korea Democratic People ' s Republic of Korea Luxembourg Malaysia Mauritius Mexico Netherlands New Zealand Norway Papua New Guinea Philippines Poland Portugal Rumania Singapore Spain Sweden Switzerland Taipei Thailand Turkey USA USSR Western Samoa
lOO m (s) 10. 60 10.59 10. 6 1 10.71 10.01 10.34 10.46 10.34 10. 9 1 10.35 10.40 11.19 10.42 10.52 10. 5 1 10.55 10.96 10.78 10. 1 6 10.53 10.41 10.38 10.42 10.25 10.37 10. 5 9 10.39 10. 7 1 9.93 10.07 10.82
200 m (s) 21.42 21.49 20.96 21. 00 19.72 20.81 20.66 20.89 21. 94 20.77 20.92 22.45 21.30 20.95 20.88 21.16 21.78 21.64 20.24 21.17 20.98 21.28 20.77 20.61 20.46 21.29 21.09 21.43 19.75 20.00 21.86
400 m (s) 45.73 47.80 46.30 47.80 45.26 45.86 44.92 46.90 47.30 47.40 46.30 47.70 46. 1 0 45. 1 0 46.10 46.71 47.90 46.24 45.36 46.70 45.87 47.40 45.98 45.63 45.78 46.80 47.91 47.60 43.86 44.60 49.00
800 m (min) 1.76 1. 84 1.79 1.77 1.73 1.79 1.73 1.79 1. 85 1.82 1. 82 1.88 1. 80 1.74 1.74 1.76 1.90 1.81 1.76 1.79 1.76 1.88 1.76 1.77 1.78 1.79 1.83 1.79 1.73 1.75 2.02
1500 m (min) 3.73 3. 92 3.56 3.72 3. 60 3. 64 3.55 3.77 3.77 3.67 3. 80 3. 83 3. 65 3.62 3.54 3.62 4.01 3. 83 3.60 3. 62 3.64 3. 89 3.55 3.61 3.55 3.77 3. 84 3. 67 3.53 3.59 4.24
5000 m (min) 13.77 14.73 13.32 13.66 13.23 13.41 13.10 13. 96 14. 1 3 13.64 14.64 15.06 13.46 13. 3 6 13.2 1 13.34 14.72 14.74 13.29 13.13 13.25 15.11 13.31 13.29 13.22 14.07 15.23 13.56 13.20 13.20 16.28
10,000 m (min) 28.81 30.79 27.81 28.93 27.52 27.72 27.38 29.23 29.67 29.08 31.01 31.77 27.95 27. 6 1 27.70 27.69 31.36 30.64 27.89 27.38 27.67 31.32 27.73 27.94 27. 9 1 30.07 32.65 28.58 27.43 27.53 34.71
Marathon (mins) 131.98 148.83 132.35 137.55 131.08 128.63 129.75 136.25 130.87 141.27 154. 1 0 152.23 129.20 129.02 128.98 131.48 148.22 145.27 131.58 128.65 132.50 157.77 131.57 130.63 131.20 139.27 149. 90 131.50 128.22 130.55 161.83
SOURCE: IAAF/ATFS Track and Field Statistics Handbook for the 1984 Los Angeles Olympics. 8.21.
Refer to Exercise 8.20. Convert the national track records for men in Table 8.6 to speeds measured in meters per second. Notice that the records for 800 m, 1500 m, 500 m, 10,000 m and the marathon are given in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Perform a principal compo-
512
Chap.
8
Principal Components
nent analysis using the covariance matrix S of the speed data. Compare the results with the results in Exercise 8.20. Which analysis do you prefer? Why? 8.22. Consider the data on bulls in Table 1.8. Utilizing the seven variables YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt, perform a principal component analysis using the covariance matrix S and the correlation matrix R. Your analysis should include the following: (a) Determine the appropriate number of components to effectively summa rize the sample variability. Construct a scree plot to aid your determination. (b) Interpret the sample principal components. (c) Do you think it is possible to develop a "body size" or "body configura tion" index from the data on the seven variables above? Explain. (d) Using the values for the first two principal components, plot the data in a two-dimensional space with .Yt along the vertical axis and y2 along the horizontal axis. Can you distinguish groups representing the three breeds of cattle? Are there any outliers? (e) Construct a Q-Q plot using the first principal component. Interpret the plot. 8.23. Refer to Example 8.10 and the data in Table 5.8, page 258. Add the variable x6 regular overtime hours whose values are (read across) =
6187 7679
7336 8259
6988 6964 10954 9353
8425 6291
6778 4969
5922 4825
7307 6019
and redo Example 8.10. 8.24. Refer to the police overtime hours data in Example 8.10. Construct an alter nate control chart, based on the sum of squares d�j • to monitor the unex plained variation in the original observations summarized by the additional principal components. REFERENCES
1. John AndersWioln,ey,T.1984.W. (2d ed.). New York: 2. Anderson, T. W. "Asympt(o1t963),ic Theor122-148. y for Principal Components Analysis." 3. tBartions.le"t , M. S. "A Note on Multiplying Factors for Vari(1954)ous, 296-298. Chi-Squared Approxima 4. Dawkins(,1B.989)"Mul, 110-115. tivariate Analysis of National Track Records." 5. Girschick, M. A. "On the Sampling(1Theor y203-224. of Roots of Determinantal Equations." 939) , 6. nentHotesl."ing, H. "Analysis of a Complex of Stat(is1t933)ical ,Vari417-441,ables498-520. into Principal Compo An Introduction to Multivariate Statistical Analysis
Annals of
Mathematical Statistics,
34
Journal of the Royal Statistical Society (B) ,
16
The American Statisti
cian,
43
Annals of Mathematical Statistics,
10
Journal of Educational Psychology,
24
Chap. 8
References
51 3
7. Hot(1935)el i,n139-142. g, H. "The Most Predictable Criterion." 8. Hot(1936)el ,in27-35. g, H. "Simplified Calculation of Principal Components. " 10.9. JolHoti(ce1oeur963)l ing,,, P.H.497-499. "The"RelaMultionstivbetariwateene GenerTwoaSetlizats ofionVarof itahtees.Al" lometry Equatio(1n.936)" , 321-377. 11. JolPriinccioeurpal ,Component P., and J. E.AnalMosyismisann.." "Size and(1Shape Variation in the Painted Turtle: A 960) , 339-354. 12. Ki(1966)ng, B., 139-190. "Market and Industry Factors in Stock Price Behavior." 13. iKourt i,"T., and J. McGregor, "Multivariate SPC(1996)Met, 409-428. hods for Process and Product Mon t o r i n g, 14. Lawley, D. N. "On Testing(1a963)Set, 1of49-151. Correlation Coefficients for Equality." 15. Maxwel l, A. E. London: Chapman and Hal l , 1977. 16. WiRao,ley,C.1973.R. (2d ed.). New York: John 17. Rencher, terpretatiosn."of Canonical Discriminant Funct(1992)ions,, 217-225. Canonical Vari ates and PrA.inciC.pal"InComponent Journal of Educational Psychology,
Psychometrika,
Biometrika,
26
1
28
Biometrics,
19
Growth,
24
Journal of Business,
Journal of Quality Technology,
39
28
Annals of
Mathematical Statistics,
34
Multivariate Analysis in Behavioural Research.
Linear Statistical Inference and Its Applications
The American Statistician,
46
CHAPTER
9
Factor Analysis and Inference for Structured Covariance Matrices 9. 1 INTRODUCTION
Factor analysis has provoked rather turbulent controversy throughout its history. Its modern beginnings lie in the early 20th-century attempts of Karl Pearson, Charles Spearman, and others to define and measure intelligence. Because of this early association with constructs such as intelligence, factor analysis was nurtured and developed primarily by scientists interested in psychometrics. Arguments over the psychological interpretations of several early studies and the lack of powerful computing facilities impeded its initial development as a statistical method. The advent of high-speed computers has generated a renewed interest in the theoreti cal and computational aspects of factor analysis. Most of the original techniques have been abandoned and early controversies resolved in the wake of recent devel opments. It is still true, however, that each application of the technique must be examined on its own merits to determine its success. The essential purpose of factor analysis is to describe, if possible, the covari ance relationships among many variables in terms of a few underlying, but unob servable, random quantities called factors. Basically, the factor model is motivated by the following argument: Suppose variables can be grouped by their correlations. That is, suppose all variables within a particular group are highly correlated among themselves, but have relatively small correlations with variables in a different group. Then it is conceivable that each group of variables represents a single under lying construct, or factor, that is responsible for the observed correlations. For example, correlations from the group of test scores in classics, French, English, mathematics, and music collected by Spearman suggested an underlying "intelli gence" factor. A second group of variables, representing physical-fitness scores, if available, might correspond to another factor. It is this type of structure that fac tor analysis seeks to confirm. 514
Sec.
9.2
The Orthogonal Factor Model
51 5
Factor analysis can be considered an extension of principal component analy sis. Both can be viewed as attempts to approximate the covariance matrix I. How ever, the approximation based on the factor analysis model is more elaborate. The primary question in factor analysis is whether the data are consistent with a pre scribed structure. 9.2 THE ORTHOGONAL FACTOR MODEL
The observable random vector X, with p components, has mean p, and covariance matrix I. The factor model postulates that X is linearly dependent upon a few unobservable random variables F1 , F2 , . . . , F111 , called common factors, and p addi tional1 sources of variation c: 1 , c:2 , . . . , c:P , called errors or, sometimes, specific fac tors. In particular, the factor analysis model is xl - fL 1 x2 fL 2
fl l FI + e 1 2 F2 + f21 F1 + e22F2 +
+ eJ mFm + c: l + e2 m Fm + c: 2
xp
fP 1 F1 + fp 2 F2 +
+ ep m Fm + c:P
J.Lp
(9-1)
or, in matrix notation, X - p, (p X J )
L
F
(p X m ) (m X I )
+
E
(p X 1 )
(9-2)
The coefficient f;i is called the loading of the ith variable on the jth factor, so the matrix L is the matrix of factor loadings. Note that the ith specific factor is associated only with the ith response X;. The p deviations X1 - J.L 1 , X2 - J.Lz , . . . , XP - J.Lp are expressed in terms of p + m random vari ables , F2 , . . . , Fn, c: 1 , c:2 , . . . , c:P which are unobservable. This distinguishes the factor F1model of (9-2) from the multivariate regression model in (7-26), in which the independent variables [whose position is occupied by F in (9-2)] can be observed. With so many unobservable quantities, a direct verification of the factor model from observations on XI , x2 , . . . , xp is hopeless. However, with some addi tional assumptions about the random vectors F and e, the model in (9-2) implies certain covariance relationships, which can be checked. e;
1 As Maxwell [22] points out, in many investigations the e , tend to be combinations of measure ment error and factors that are uniquely associated with the individual variables.
516
Chap.
9
Factor Analysis an d Inference for Structured Covariance Matrices
We assume that E(F) (m0X Cov(F) E[FF'] =
E(e)
0
(p X I )
'
I
=
'
I)
Cov(e) E[ee'] =
(m X m)
[!' 1 J 0
=
'II
( p X p)
1/Jz
(9-3)
0
and that F and e are independent, so Cov(e, F) E(eF') (p X0m) These assumptions and the relation in (9-2) constitute the orthogonal factor model.2 =
·
ORTHOGGNAL 'FAC:rrali N1boE.t·· WITH
m
COMMON FACTORS
The orthogonal factor model implies a covariance structure for X. From the model in (9-4), 2 Allowing the factors F to be correlated so that Cov (F) is not diagonal gives the oblique factor model. The oblique model presents some additional estimation difficulties and will not be discussed in this book. (See (20].)
Sec.
(X - �-t) (X - �-t ) '
9.2
The Orthogonal Factor Model
=
(LF + e) (LF + e) '
=
(LF + e) ( (LF) ' + e' )
=
LF (LF) ' + e (LF) ' + LFe' + ee'
51 7
so that I
=
Cov (X)
=
LE (FF' ) L' + E ( eF' ) L' + LE (Fe' ) + E ( ee' )
=
LL' + 'II
according to (9-3). Also, by the model in Cov (X, F) = E (X - �-t) F'
=
E (X - �-t) (X - �-t) '
(9-4), (X - IL ) F' =
= (LF + e) F' LE (FF' ) + E ( eF' ) = L.
=
LFF' + eF', so
COVARIANCE STRUCTURE FOR T H E ORTHOGONAL FACTOR MODEL
l. Cov (X)
=
LL' + 'II
or
2.
Cov (X, F)
or
=
L
(9-5)
The model X - /L = LF + e is linear in the common factors. If the p responses X are, in fact, related to underlying factors, but the relationship is non linear, such as in X1 - f.J- 1 = f 11 F1 F3 + t: 1 , X2 - f.J-2 = f2 1 F2F3 + t: 2 , and so forth, then the covariance structure LL' + 'II given by (9-5) may not be adequate. The very important assumption of linearity is inherent in the formulation of the traditional factor model. That portion of the variance of the ith variable contributed by the m common factors is called the ith communality. That portion of Var (X; ) = u;; due to the specific factor is often called the uniqueness, or specific variance. Denoting the ith communality by hf , we see from (9-5) that
518
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Var (X; ) or
=
+ specific variance
communality
and
= I = LL' '\ft
(9-6)
1, 2, . . . , p
i
The ith communality is the sum of squares of the loadings of the ith variable on the m common factors. Example 9. 1
[ ] I= ]=[ ][
(Verifying the relation
Consider the covariance matrix
[
19 30 2 12
The equality
or
+
19 30 2 12 30 57 5 23 2 5 38 47 12 23 47 68
=
4 7 -1 1
1 2 6 8
for two factors)
30 2 12 57 5 23 5 38 47 23 47 68
;
7 -1 2 6
I = LL' '\ft I +
may be verified by matrix algebra. Therefore, by an m 2 orthogonal factor model. Since
L= '\ft=
has the structure produced
Sec.
The Orthogonal Factor Model
9.2
(9-6), e?, + er2
519
the communality of X1 is, from
2 2 = 4 + 1 hf = and the variance of X1 can be decomposed as a" =
(C r, + C fz ) + 1/11
=
=
17
h f + 1/1 1
or
19 variance
2
+ =
communality
17 + 2
+ specific variance
A similar breakdown occurs for the other variables.
•
The factor model assumes that the p + p (p - 1) /2 = p (p + 1) /2 vari ances and covariances for X can be reproduced from the pm factor loadings f;i and the p specific variances 1/J;. When m = p, any covariance matrix � can be repro duced exactly as LL' [see (9-11)], so W can be the zero matrix. However, it is when m is small relative to p that factor analysis is most useful. In this case, the factor model provides a "simple" explanation of the covariation in X with fewer para meters than the p (p + 1) /2 parameters in �- For example, if X contains p = 12 variables, and the factor model in (9-4) with m = 2 is appropriate, then the p (p + 1)/2 = 12(13)/2 = 78 elements of � are described in terms of the mp + p = 12(2) + 12 = 36 parameters eii and 1/J; of the factor model. Unfortunately for the factor analyst, most covariance matrices cannot be fac tored as LL' + 'IT, where the number of factors m is much less than p. The fol lowing example demonstrates one of the problems that can arise when attempting to determine the parameters f;i and 1/J; from the variances and covariances of the observable variables. Example 9.2
(Nonexistence of a proper solution)
[ ]
Let p = 3 and m = 1, and suppose the random variables X1 , X2 , and X3 have the positive definite covariance matrix
1 .9 .7 .9 1 .4 .7 .4 1
� Using the factor model in
(9-4), we obtain xl
-
JL1
X2 f.Lz X3 - f.L3 -
= = =
C 1 1 F1 + e 1 e2 1 F1 + 62 f3 1 F1 + e3
520
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
The covariance structure in (9-5) implies that I
or
= LL' + 'II
.90 = 1=
e 11 e2 , ei 1 + 1/Jz
The pair of equations .70 = .40 =
implies that
.70 = e 1 1 e31 .40 = e2 1 e31 1 = ej, + 1/13
e 1 ,e31 ezl e31
Substituting this result for e2 1 in the equation .90 = £1 1 £2 1 yields ef 1 = 1.575, or e,1 = ± 1.255. Since Var(F1 ) = 1 (by assumption) and Var(X ) = 1 , e1 = Cov(X , F1 ) = Corr(X1 , F1 ). Now, a correlation coef ficient 1cannot be 1greater than1 unity (in absolute value), so, from this point of view, I el l I = 1.255 is too large. Also, the equation 1 = e? 1 + 1/11 , Or 1/11 = 1 - e ? 1 gives 1/1, = 1 - 1.575 = - .575 which is unsatisfactory, since it gives a negative value for Var(e1 ) = I/J1 • Thus, for this example with m = 1, it is possible to get a unique numer ical solution to the equations I = LL' + However, the solution is not consistent with the statistical interpretation of the coefficients, so it is not a • proper solution. When > 1, there is always some inherent ambiguity associated with the factor model.= To see this, let T be any m m orthogonal matrix, so that TT' = T'T I. Then the expression in (9-2) can be written '11 .
m
X
X
-
p,
= LF + e = LTT'F + e = L * F * + e
(9-7)
Sec.
9. 3
Methods of Estimation
521
where L*
= LT and F * = T'F
Since E (F * )
= T' E (F) = 0
and Cov (F * )
= T' Cov (F) T = T'T =
I ( m x m)
it is impossible, on the basis of observations on X, to distinguish the loadings L from the loadings L * . That is, the factors F and F * = T'F have the same statisti cal properties, and even though the loadings L * are, in general, different from the loadings L, they both generate the same covariance matrix �. That is, (9-8) � = LL' + 'II = LTT'L' + 'II = (L * ) (L * ) ' + 'II This ambiguity provides the rationale for "factor rotation," since orthogonal matri ces correspond to rotations (and reflections) of the coordinate system for X. Factor Ioa:dings L the loadings '
• ·
'
'
are
both give�the �same elements ofLL' =
'
determined
only
L * = LT
'
"
up to and
' an orthog€ma1 matrix
T.
L
Thus, (9-9)
repre��ntation. The ��mmunalities, �\t�n by the diag�al (L *)(L *)', are also tmaffected by the choice
of T.
:
The analysis of the factor model proceeds by imposing conditions that allow one to uniquely estimate L and 'fl. The loading matrix is then rotated (multiplied by an orthogonal matrix), where the rotation is determined by some "ease-of-inter pretation" criterion. Once the loadings and specific variances are obtained, factors are identified, anq estimated values for the factors themselves (called factor scores ) are frequently constructed. 9.3 METHODS OF ESTIMATION
Given observations x 1 , x2 , , x n on p generally correlated variables, factor analy sis seeks to answer the question, Does the factor model of (9-4 ), with a small num ber of factors, adequately represent the data? In essence, we tackle this statistical model-building problem by trying to verify the covariance relationship in (9-5). • • •
522
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
The sample covariance matrix S is an estimator of the unknown population covariance matrix �- If the off-diagonal elements of S are small or those of the sample correlation matrix R essentially zero, the variables are not related, and a factor analysis will not prove useful. In these circumstances, the specific factors play the dominant role, whereas the major aim of factor analysis is to determine a few important common factors. If � appears to deviate significantly from a diagonal matrix, then a factor model can be entertained, and the initial problem is one of estimating the factor loadings l;i and specific variances 1/J; · We shall consider two of the most popular methods of parameter estimation, the principal component (and the related princi pal factor) method and the maximum likelihood method. The solution from either method can be rotated in order to simplify the interpretation of factors, as described in Section 9.4. It is always prudent to try more than one method of solu tion; if the factor model is appropriate for the problem at hand, the solutions should be consistent with one another. Current estimation and rotation methods require iterative calculations that must be done on a computer. Several computer programs are now available for this purpose. The Principal Component (and Principal Factor) Method
The spectral decomposition of (2-20) provides us with one factoring of the covari ance matrix �- Let � have eigenvalue-eigenvector pairs (A.; , e; ) with A 1 ;;;. A 2 ;;;. · · · ;;;. AP ;;;. 0. Then � = A 1 e 1 e; � �
+ A 2 e 2 e� + � �
[ v A 1 e 1 l v A2 e2 l
···
···
+ APePe;
� �
l
v
AP e ] p
� e� -\.!A;-;;-· -----------·
(9-10)
This fits the prescribed covariance structure for the factor analysis model having as many factors as variables (m = p) and specific variances 1/J; = 0 for all i. The load ing matrix has jth column given by � ei . That is, we can write L
L'
+ 0 = LL'
(9-11)
Apart from the scale factor � ' the factor loadings on the jth factor are the coef ficients for the jth principal component of the population. Although the factor analysis representation of � in (9-11) is exact, it is not particularly useful: It employs as many common factors as there are variables and does not allow for any variation in the specific factors e in (9-4). We prefer mod(p X p)
(p X p ) ( p X p )
(p X p )
Sec.
9. 3
Methods of Estimation
523
els that explain the covariance structure in terms of just a few common factors. One approach, when the last p - m eigenvalues are small, is to neglect the contribu tion of A m + l e m + l e�, + l + · · · + AP eP e; to I in (9-10). Neglecting this contribution, we obtain the approximation
L
L'
(p X m) (m Xp )
(9-12)
The approximate representation in (9-12) assumes that the specific factors e in (9-4) are of minor importance and can also be ignored in the factoring of I. If spe cific factors are included in the model, their variances may be taken to be the diag onal elements of I - LL', where LL' is as defined in (9-12). Allowing for specific factors, we find that the approximation becomes
I = LL' +
'}I
(9-13)
where "'i
=
(J"ii
m
- � e� for i = 1, 2, . . . ' p.
[xn ] [:XI ] [xil -:XI ]
j=l
To apply this approach to a data set x 1 , x 2 , , x n , it is customary first to center the observations by subtracting the sample mean i. The centered observations • • .
.,
-
.
�
�: r :: � ::
(xxi1 -:Xt )
j
= 1 , 2, . . . , n
(9-14)
have the same sample covariance matrix S as the original observations. In cases where the units of the variables are not commensurate, it is usually desirable to work with the standardized variables
zi
v's l l
=
( j z - Xz )
VSzz
j
= 1, 2, . . . , n
524
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
whose sample covariance matrix is the sample correlation matrix R of the obser vations x 1 , x 2 , . . . , x n . Standardization avoids the problems of having one variable with large variance unduly influencing the determination of factor loadings. The representation in (9-13), when applied to the sample covariance matrix S or the sample correlation matrix R, is known as the principal component solution. The name follows from the fact that the factor loadings are the scaled coefficients of the first few sample principal components. (See Chapter 8.) ,n:\. PRI NClPAL C@MPONENT SQLUTION OF
THfFACTOR MODEL
1il;b prifi�p!n ooitlponell.fiactor:iiklysisbi the s1fropie c��ariaric�· mattbt.s s�ecified in
tenlls
{�£;� eph;j�!here 1ll: ;;.
A 2 ,�
factors. Then the matrix of estimated • • •
:· ;!';(.,
:"�'>> '•
��
' � ' ;,:, · · ··.> -
,;}�:0
�
Ll/ ' . �Q '
.
The
.;.<�, ,
.'
. ,·;,·��' ,! ,
giveri by
.,
,
diagonal elements of the
, l!r , ,�
are estimated as
principal
�alysis,
of the
dmtri:x is: �btained:by staJ!ting with R intjlace of.'S, ·.·
is
• • •
' �.
�={1 f. }]
Commuri lities
.
factor lo adfugs {iii}
Th e estimate_g_specific variances are provided by the
�ittrix s,�:�
is
(A 1 , e 1 ), (A 2 , e2 )� �J?h · Le�;�. < ;R;,i�e th� 1.tlumbe�:i9f COrt}�on
of i1s �igenval�e-eigenvector pairs
component . factor
sample
correlation
; �;
For the principal component solution, the estimated loadings for a given fac tor do not change as the number of factors is increased. For example, if m = 1, i = [ � e t l and if m = 2, i = [ � e 1 i � e2 ] , where (A 1, e1 ) and (A z , ez ) are the first two eigenvalue-eigenvector pairs for S (or R). By the definition of �i ' the diagonal elements of S are equal to the diagonal elements of LL ' + iii . However, the off-diagonal elements of S are not usually reproduced by LL ' + iii . How, then, do we select the number of factors m ? If the number of common factors is not determined by a priori considera tions, such as by theory or the work of other researchers, the choice of m can be
Sec. 9.3 Methods of Estimation
525
based on the estimated eigenvalues in much the same manner as with principal components. Consider the residual matrix
s - (LL'
+
'ii )
(9-18)
resulting from the approximation of S by the principal component solution. The diagonal elements are zero, and if the other elements are also small, we may sub jectively take the m factor model to be appropriate. Analytically, we have (see Exercise 9.5) Sum of squared entries of (S - (LL'
+
'ii ) )
::;:;;
A! + t +
···
+ .A;
(9-19)
Consequently, a small value for the sum of the squares of the neglected eigenval ues implies a small value for the sum of the squared errors of approximation. Ideally, the contributions of the first few factors to the sample variances of the variables should be large. The contribution to the sample variance s;; from the first common factor is tf1 . The contribution to the total sample variance, s1 1 + s22 + · · · + sPP = tr (S), from the first common factor is then
(
-z + - + f 1 1 ti1
) { .
···
+ -epz l
=
... ... �
�
( V A 1 e1 ) ' ( V A 1 e1 )
=
A
A1
since the eigenvector e 1 has unit length. In general, Proportion of total samp Ie vanance due to jth factor
=
s u + szz + Aj A
p
···
+ Spp for a factor analysis of S for a factor analysis of R (9-20)
Criterion (9-20) is frequently used as a heuristic device for determining the appro priate number of common factors. The number of common factors retained in the model is increased until a "suitable proportion" of the total sample variance has been explained. Another convention, frequently encountered in packaged computer pro grams, is to set m equal to the number of eigenvalues of R greater than one if the sample correlation matrix is factored, or equal to the number of positive eigenval ues of S if the sample covariance matrix is factored. These rules of thumb should not be applied indiscriminately. For example, m = p if the rule for S is obeyed, since all the eigenvalues are expected to be positive for large sample sizes. The best approach is to retain few rather than many factors, assuming that they provide a satisfactory interpretation of the data and yield a satisfactory fit to S or R. Example 9.3
(Factor analysis of consumer-preference data)
In a consumer-preference study, a random sample of customers were asked to rate several attributes of a new product. The responses, on a 7-point
526
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
semantic differential scale, were tabulated and the attribute correlation matrix constructed. The correlation matrix is presented next:
Attribute (Variable) Taste Good buy for money Flavor Suitable for snack Provides lots of energy
1 2 3 4 5
4 5 1 3 2 1 .00 .02 ® .42 .01 .02 1.00 .13 .71 @ .96 .13 1.00 .50 .11 .42 .71 .50 1 .00 @J .01 .85 .11 .79 1 .00
It is clear from the circled entries in the correlation matrix that variables 1 and 3 and variables 2 and 5 form groups. Variable 4 is "closer" to the (2, 5) group than the (1, 3) group. Given these results and the small number of vari ables, we might expect that the apparent linear relationships between the variables can be explained in terms of, at most, two or three common factors. The first two eigenvalues, A 1 = 2.85 and A 2 = 1 .81 , of R are the only eigenvalues greater than unity. Moreover, m = 2 common factors will account for a cumulative proportion A
A1
; A 2 = 2.85 ; 1.81 = .93 A
of the total (standardized) sample variance. The estimated factor loadings, communalities, and specific variances, obtained using (9-15), (9-16), and (9-17), are given in Table 9.1. Now,
LL'
+ '\fl =
+
.56 .82 .78 - .53 .65 .75 .94 - .10 .80 - .54 .02 0 0 0 0
0 .12 0 0 0
0 0 .02 0 0
[ .56.82 0 0 0 .11 0
.78 .65 .94 .80 - .53 .75 - .10 - .54 0 0 0 0 .07
1.00
0.1 1.00
]
.97 .11 1 .00
.44 .79 .53 1 .00
.00 .91 .11 .81 1 .00
nearly reproduces the correlation matrix R. Thus, on a purely descriptive basis, we would judge a two-factor model with the factor loadings displayed in Table 9.1 as providing a good fit to the data. The communalities (.98, .88, .98, .89, .93) indicate that the two factors account for a large percentage of the sample variance of each variable.
Sec.
9.3
Methods of Estimation
527
TABLE 9. 1
Estimated factor loadings e ij = vr A ; eij F2 PI -
Variable 1. Taste 2. Good buy for money 3. Flavor 4. Suitable for snack 5. Provides lots of energy
Communalities h;- 2
Specific variances
;r,i
=1
.56
.82
.98
.02
.78 .65
- .53 .75
.88 .98
.12 .02
.94
- .11
.89
.11
.80
- .54
.93
.07
Eigenvalues
2.85
1.81
Cumulative proportion of total ( standardized) sample variance
.571
.932
-
hf
We shall not interpret the factors at this point. As we noted in Section 9.2, the factors ( and loadings ) are unique up to an orthogonal rotation. A rotation of the factors often reveals a simple structure and aids interpretation. We shall consider this example again ( see Example 9.9 and Panel 9.1) after • factor rotation has been discussed. Example 9.4 (Factor analysis of stock-price data)
Stock-price data consisting of n = 100 weekly rates of return on p = 5 stocks were introduced in Example 8.5. In that example, the first two sample prin cipal components were obtained from R. Taking m = 1 and m = 2, we can easily obtain principal component solutions to the orthogonal factor model. Specifically, the estimated factor loadings are the sample principal component coefficients ( eigenvectors of R), scaled by the square root of the correspond ing eigenvalues. The estimated factor loadings, communalities, specific vari ances, and proportion of total ( standardized ) sample variance explained by each factor for the m = 1 and m = 2 factor solutions are displayed in Table 2_.2. TE e co�munalities are given by (9-17). So, for example, with m = 2, h� = e�1 + e�2 = (.783) 2 + (- .217) 2 = .66.
528
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
TABLE 9.2
Two-factor solution
One-factor solution
Ft
Specific variances � � 1/1; = 1 - h ;2
Allied Chemical Du Pont Union Carbide Exxon Texaco
.783 .773 .794 .713 .712
.39 .40 .37 .49 .49
Cumulative proportion of total (standardized) sample variance explained
.571
Estimated factor loadings Variable 1. 2. 3. 4. 5.
Ft
Fz
Specific variances � � 1/1; = 1 - h;2
.783 .773 .794 .713 .712
- .217 - .458 - .234 .472 .524
.34 .19 .31 .27 .22
.571
.733
Estimated factor loadings
The residual matrix corresponding to the solution for m = 2 factors is
�� � R - LL' - W =
- .127 0 0 - .127 - .164 - .122 .055 - .069 .017 .012
.017 - .164 - .069 .012 .055 - .122 0 - .019 - .017 - .019 0 - .232 0 .232 - .017
The proportion of the total variance explained by the two-factor solution is appreciably larger than that for the one-factor solution. However, for m = 2, LL' produces numbers that are, in general, larger than the sample correla tions. This is particularly true for r4 5 • It seems fairly clear that the first factor, F1 , represents general economic conditions and might be called a market factor. All of the stocks load highly on this factor, and the loadings are about equ,a l. The second factor contrasts the chemical stocks with the oil stocks. (The chemicals have relatively large negative loadings, and the oils have large positive loadings, on the factor.) Thus, F2 seems to differentiate stocks in different industries and might be called an industry factor. To summarize, rates of return appear to be deter mined by general market conditions and activities that are unique to the dif ferent industries, as well as a residual or firm specific factor. This is essentially the conclusion reached by an examination of the sample principal compo nents in Example 8.5. •
Sec. 9.3 Methods of Estimation
529
A Modified Approach-the Principal Factor Solution
A modification of the principal component approach is sometimes considered. We describe the reasoning in terms of a factor analysis of R, although the procedure is also appropriate for S. If the factor model p = LL' + 'if is correctly specified, the m common factors should account for the off-diagonal elements of p, as well as the communality portions of the diagonal elements Pii
= 1 = hf + 1/J;
If the specific factor contribution 1/J; is removed from the diagonal or, equivalently, the 1 replaced by hl, the resulting matrix is p 'if = LL' . Suppose, now, that initial estimates lfJ;* of the specific variances are available. Then replacing the ith diagonal element of R by h;* 2 = 1 1/J( , we obtain a "reduced" sample correlation matrix -
R,
�
l
h r2 r12
r1 p
r1: P' � r2P'
hP
'
't
l
-
Now, apart from sampling variation, all of the elements of the reduced sample cor relation matrix R, should be accounted for by the m common factors. In particu lar, R, is factored as
R
r =
L* L* ' r
r
where L� = {t'�l are the estimated loadings. The principal factor method of factor analysis employs the estimates
L* = [�1 eA *l 'i �2 e' 2* 'i . . . 'i �lll e' lll* ] r
1
Ill
-
" e *.2
LJ j= l
(9-21)
(9-22)
I)
where ( A ;*, er), i = 1, 2, . . . , m are the ( largest ) eigenvalue-eigenvector pairs determined from R,. In turn, the communalities would then be ( re ) estimated by Ill
'h * 2 = " e *. z .LJ l
j= l
l]
(9-23)
The principal factor solution can be obtained iteratively, with the communality esti mates of (9-23) becoming the initial estimates for the next stage. In the spirit of the principal component solution, consideration of the estimated eigenvalues At , Ai , . . . , A; helps determine the number of common factors to retain. An added complication is that now some of the eigenvalues may be neg ative, due to the use of initial communality estimates. Ideally, we should take the
530
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
number of common factors equal to the rank of the reduced population matrix. Unfortunately, this rank is not always well determined from R r , and some judg ment is necessary. Although there are many choices for initial estimates of specific variances, the most popular choice, when one is working with a correlation matrix, is 1/Jt = 1/rii , where r ii is the ith diagonal element of R - 1 . The initial communality estimates then become
h *2 = 1 I
-
,/,* 'f'1 = 1
1
(9-24)
which is equal to the square of the multiple correlation coefficient between Xi and the other p - 1 variables. The relation to the multiple correlation coefficient means that ht 2 can be calculated even when R is not of full rank. For factoring S, the initial specific variance estimates use s ii , the diagonal elements of s- I . Further discussion of these and other initial estimates is contained in [12]. Although the principal component method for R can be regarded as a prin cipal factor method with initial communality estimates of unity, or specific vari ances equal to zero, the two are philosophically and geometrically different. (See [12].) In practice, however, the two frequently produce comparable factor loadings if the number of variables is large and the number of common factors is small. We do not pursue the principal factor solution, since, to our minds, the solu tion methods that have the most to recomend them are the principal component method and the maximum likelihood method, which we discuss next. The Maximum Likelihood Method
If the common factors F and the specific factors e can be assumed to be normally distributed, then maximum likelihood estimates of the factor loadings and specific variances may be obtained. When Fi and ei are jointly normal, the observations Xi - p, = LFi + ei are then normal, and from (4-16) , the likelihood is
L ( p, , I ) = (2 7Tf q' I I � - � e - G)tr [ I-' (� (x; - x) (x; - x)' + n (x - ttHi - tt l ' ) ] = (2 7T) X
- (n -
2
p
l )p
II � - -2-I ) e - (I) tr [ I _1(i�" (x; - x) (x; - x) ·)] (n -
J
(2 7Tf 2 1 I I - 2 e -
( 2II) (x -
2
, ._.
tt l ...
1
- (x - tt )
(9-25)
which depends on L and '}I through = LL' + W. This model is still not well defined, because of the multiplicity of choices for L made possible by orthogonal transformations. It is desirable to make L well defined by imposing the computa tionally convenient uniqueness condition
I
a diagonal matrix
(9-26)
Sec.
Methods of Estimation
9.3
531
The maximum likelihood estimates L and '\If must be obtained by numerical maximization of (9-25). Fortunately, efficient computer programs now exist that enable one to get these estimates rather easily. We summarize some facts about maximum likelihood estimators and, for now, rely on a computer to perform the numerical details. A
A
Let X 1 , X 2 , . . . , X11 be a random sample from NP (p,, I ) , where is the covariance matrix for the m common factor model of (9-4). The maximum likelihood estimators L, q, , and jl = i maximize (9-25) subject to i ' q, - l i being diagonal. The maximum likelihood estimates of the communalities are '\If
Result 9. 1 .
I
= LL' +
for i so
= 1, 2, . . . , p
( Proportion of total sample ) = .e"" t2j + .e"' 22j + . . . + .ep"' 2j variance due to jth factor
s 1 1 + s22 + · · · + sPP
(9-27)
(9-28)
Proof. By the invariance property of maximum likelihood estimate� (see Section 4.3), functions of L and '\If are estimated by the same functions of L and q, . In particular, the communaliti es hf = .Cf1 + . · + .e�" have maximum likeli hood estimates h f = .Cf1 + . . . + .Cfm · •
.
A
If, as in (8-10), the variables are standardized so that Z then the covariance matrix p of Z has the representation
= v - 1 12 (X
- p, ) , (9-29)
Thus, p has a factorization analogous to (9-5) with loading matrix L z = v - 1 12 L and specific variance matrix '\ffz = v - t/2 '\fi'V - 1 /2 • By the invariance property of maximum likelihood estimators, the maximum likelihood estimator of p is
fJ = c v- l /2 i ) ( v - 1 /2 i ) ' + -v - 1 /2 q, -v - 1 /2 = L Z L� + A
A
'\If z A
(9-30)
where v - 1 12 and i are the maximum likelihood estimators of v - 1 /2 and L, respec tively. (See Supplement 9A.) As a consequence of the factorization of (9-30), whenever the maximum like lihood analysis pertains to the correlation matrix, we call i = 1, 2, . . . , p
(9-31)
the maximum likelihood estimates of the communalities, and we evaluate the importance of the factors on the basis of
532
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
( Proportion of total (standardized) ) = C fj + C ij + . . . + e;j p
sample variance due to jth factor
(9-32)
To avoid more tedious notations, the preceding e J s denote the elements of Lz . A
A
Comment. Ordinarily, the observations are standardized, and a sample cor relation matrix is factor analyzed. The sample correlation matrix R is inserted for [ (n - 1 ) /n] S in the likelihood function of (9-25), and the maximum likelihood estimates Lz and 'II z are obtained using a computer. Although the likelihood in (9-25) is appropriate for S, not R, surprisingly, this practice is equivalent to obtain ing the maximum likelihood estimates L and q, based on the sample covariance matrix S, setting L z = v - 1 12 L and q,z = y - 1 12 ,J, v - l /2 . Here v - 1 /2 is the diago nal matrix with the reciprocal of the sample standard deviations (computed with the divisor on the main diagonal. A Goins._ in the other direction, given the estimated loadings Lz and specific variances 'II z obtained from R, we find that the resulting maximum likelihood estimates for a factor analysis of the covariance matrix [(n - 1 ) /n] S are L = V 1 12 L z and q, = V 1 12 ,J, z V 1 /2' or A
A
Vn)
and where au is the sample variance computed with divisor n. The distinction between divisors can be ignored with principal component solutions. The equivalency between factoring S and R has apparently been confused in many published discussions of factor analysis. (See Supplement 9A.) Example 9.5
(Factor analysis of stock-price data using the maximum likelihood method)
The stock-price data of Examples 8.5 and 9.4 were reanalyzed assuming an m = 2 factor model and using the maximum likelihood method. The esti mated factor loadings, communalities, specific variances, and proportion of total (standardized) sample variance explained by each factor are in Table 9.3. The corresponding figures for the m 2 factor solution obtained by the principal component method (see Example 9.4) are also provided. The com munalities corresponding,... to the maximum likelihood factoring of R are of the form [see (9-31)] h"' i = Cl1 + C"' ;22 • So, for example,
=
hf = (.684) 2 + ( . 1 8 9) 2 = .50
Sec. 9.3 Methods of Estimation
533
TABLE 9.3
Principal components
Maximum likelihood Estimated factor loadings
F!
F2
Specific variances � I- = 1 - h I2
Allied Chemical Du Pont Union Carbide Exxon Texaco
.684 .694 .681 .621 .792
.189 .517 .248 - .073 - .442
.50 .25 .47 .61 .18
Cumulative proportion of total (standardized) sample variance explained
.485
.598
Variable 1. 2. 3. 4. 5.
Estimated factor loadings
F!
F2
.783 .773 .794 .713 .712
- .217 - .458 - .234 .412 .524
.571
.733
Specific variances �-I = 1 - ft2I .34 .19 .31 .27 .22
The residual matrix is
R - LL' -
A
'\fl =
.005 - .004 - .024 - .004 0 .000 0 - .003 - .004 .005 .031 - .004 0 - .004 - .003 .031 - .000 0 - .024 - .004 0 .000 - .004 - .000 - .004 A
The elements of R - L L ' - '\fl are much smaller than those of the residual matrix corresponding to the principal component factoring of R presented in Example 9.4. On this basis, we prefer the maximum likelihood approach and typically feature it in subsequent examples. The cumulative proportion of the total sample variance explained by the factors is larger for principal component factoring than for maximum like lihood factoring. It is not surprising that this criterion typically favors princi pal component factoring. Loadings obtained by a principal component factor analysis are related to the principal components, which have, by design, a variance optimizing property. [See the discussion preceding (8-19)]. Focusing attention on the maximum likelihood solution, we see that all variables have large positive loadings on F1 . We call this factor the market factor, as we did in the principal component solution. The interpretation of the second factor, however, is not as clear as it appeared to be in the princi pal component solution. The signs of the factor loadings are consistent with a contrast, or industry factor, but the magnitudes are small in some cases, and one might identify this factor as a comparison between Du Pont and Texaco.
534
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
The patterns of the initial factor loadings for the maxjmym ijkelihood solu tion are constrained by the uniqueness condition that L ' 'lt - 1 L be a diagonal matrix. Therefore, useful factor patterns are often not revealed until the fac • tors are rotated (see Section 9.4).
Example 9.6
(Factor analysis of olympic decathlon data)
Linden [21] conducted a factor analytic study of Olympic decathlon scores since World War II. Altogether, 160 complete starts were made by 139 ath letes. 3 The scores for each of the 10 decathlon events were standardized, and a sample correlation matrix was factor analyzed by the methods of principal components and maximum likelihood. Linden reports that the distributions of standard scores were normal or approximately normal for each of the ten decathlon events. The sample correlation matrix, based on n = 160 starts, is 100-m Long Shot High 400-m 110-m run ump put ump run hurdles 1.0 .59 .35 .34 .63 .40 1 .0 .42 .51 .49 .52 1.0 .38 .19 .36 1.0 .29 .46 1.0 .34 1.0
j
R=
j
Discus .28 .31 .73 .27 .17 .32 1.0
Pole Jave- 1500-m vault lin run - .07 .20 .11 .21 .36 .09 .24 .44 - .08 .39 .17 .18 .23 .39 .13 .33 .18 .00 .24 .34 - .02 1.0 .24 .17 1.0 - .00 1 .0
From a principal component factor analysis perspective, the first four eigenvalues, 3.78, 1.52, 1.11, .91, of R suggest a factor solution with m = 3 or m = 4. A subsequent interpretation of the factor loadings reinforces the choice m = 4. The principal component and maximum likelihood solution methods were applied to Linden ' s correlation matrix and yielded the estimated factor loadings, communalities, and specific variance contributions in Table 9.4. 4 3 Because of the potential correlation between successive scores by athletes who competed in more than one Olympic game, an analysis was also done using 139 scores representing different ath letes. The score for an athlete who participated more than once was selected at random. The results were virtually identical to those based on all 160 scores. 4 The output of this table was produced by the BMDP statistical software package. The output from the SAS program is identical for the principal component solution and very similar for the maxi mum likelihood solution. For this example the solution to the likelihood equations produces a Hey wood case. That is, the estimated loadings are such that some specific variances are negative. Consequently, the software package may not run unless the Heywood case option is selected. With that option, the program obtains a feasible solution by slightly adjusting the loadings so that all specific vari ance estimates are nonnegative. A Heywood case is suggested in this example by the .00 values for the specific variances for the shot put and the 1500-m run.
TABLE 9.4
Principal component
F3
F4
Specific variances �-l = 1 - hI2
.217 .184 - .535 .134 .551
- .520 - .193 .047 .139 - .084
- .206 .092 - . 175 .396 - .419
.16 .30 .19 .35 .13
- .090 .065 - . 139 .156 .376
.341 .433 .990 .406 .245
.687 .621 .538 .434 . 147
.042 - .521 .087 - .439 .596
- .161 . 109 .411 .372 .658
.345 - .234 .440 - .235 - .279
.38 .28 .34 .43 .11
- .021 - .063 .155 - .026 .998
.361 .425 .728 .030 .229 .264 .441 - .010 .000 .059
.388 .019 .394 .098 .000
.38
.53
.64
-.55
.61 -
Estimated factor loadings 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Variable
Fl
100-m run Long jump Shot put High jump 400-m run 100 m hurdles Discus Pole vault Javelin 1500-m run
.691 .789 .702 .674 .620
Cumulative proportion of total variance explained
-- -
Ill w Ill
- -·
Maximum likelihood
Fz
-
.73 -- - --
Estimated factor loadings
Fl
Fz
.12 -
.37
F3
F4
.830 - . 1 69 .595 .275 .000 .000 .336 .445 .671 - .137
Specific variances � l- = 1 - h I2 .16 .38 .00 .50 .33 .54 .46 .70 .80 .00
536
Chap . 9
Factor Analysis and I nference for Structu red Cova riance M atrices
In this case, the two solution methods produced very different results. For the principal component factorization, all events except the 1500-meter run have large positive loadings on the first factor. This factor might be labeled general athletic ability. The remaining factors cannot be easily interpreted to our minds. Factor 2 appears to contrast running ability with throwing ability, or "arm strength." Factor 3 appears to contrast running endurance (1500-meter run) with running speed (100-meter run), although there is a relatively high pole-vault loading on this factor. Factor 4 is a mys tery at this point. For the maximum likelihood method, the 1500-meter run is the only variable with a large loading on the first factor. This factor might be called a running endurance factor. The second factor appears to be primarily a strength factor (discus and shot put load highly on this factor), and the third factor might be running speed, since the 100-meter and 400-meter runs load highly on this factor. Again, the fourth factor is not easily identified, although it may have something to do with jumping ability or leg strength. We shall return to an interpretation of the factors in Example 9.11 after a discussion of factor rotation. The four-factor principal component solution accounts for much of the total (standardized) sample variance, although the estimated specific vari ances are large in some cases (for example, the javelin and hurdles). This sug gests that some events might require unique or specific attributes not required for the other events. The four-factor maximum likelihood solution accounts for less of the total sample variance, but, a� the f�llowing residual matrices indicate, the maximum likelihood estimates L and W do a ,!?etter job of repro ducing R than the principal component estimates L and V : Principal component: R - LL' - V = 0 - .075 - .030 - .001 - .047 - .096 - .027 .114 .051 - .016
0 - .010 - .056 - .077 - .092 - .041 - .042 .042 .017
0 .042 - .020 - .032 - .031 - .034 - .158 .056
p
0 - .024 .022 0 - .122 0 .014 - .001 - .017 0 .009 .067 - .129 - .215 0 .041 - .254 - .005 .036 - .022 .062 - .109 - .112 .076 .020 - .091
0
Sec.
9.3
Methods of Estimation
537
Maximum likelihood: ,..
,..
,..
R - LL ' - 'It = 0 .000 .000 .012 .000 - .0 1 2 .004 .000 - .0 1 8 .000
0 .000 .002 - .002 .006 - .025 - .009 - .000 .000
0 .000 0 .000 - .033 0 .001 - .000 .028 0 - .000 - .034 - .002 .036 - .000 .006 .008 - .012 .052 - .013 - .000 - .045 .000 .000 .000 .000
0 .043 .016 .000
0 .091 .000
0 .000
0 •
A Large Sample Test for the Number of Common Factors
The assumption of a normal population leads directly to a test of the adequacy of the model. Suppose the m common factor model holds. In this case I = LL ' + 'It, and testing the adequacy of the m common factor model is equivalent to testing
H0 : I = L (p X p)
( 9-33 )
L' + 'It
(p X m) (m X p)
(p X p)
versus H 1 : I any other positive definite matrix. When I does not have any special �orm, the maximum of the likelihood function [see ( 4-18 ) and Result 4.11 with I = ( (n - 1)/n) S = S11 ] is proportional to ( 9-34 )
Under H0 , I is restricted to have the form of ( 9-33 ) . I ll this p�se, th � maximurp of th� likelihood function [see ( 9-25 ) with jl = i and I = L L ' + 'It , where L and 'It are the maximum likelihood estimates of L and 'It, respectively] is pro portional to
(
[ c� (xj - i) (xj - x) ' )])
l i l -"12 exp - � tr i - l
= I LL ' + � � - "12 exp ( - ! n tr [( LL +
� ) - 1 S11 ] )
( 9-35 )
Using Result 5.2, ( 9-34) , and ( 9-35 ) , we find that the likelihood ratio statistic for testing H0 is
538
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
likelihood under H0 ] _ 2 In A = _ 2 n [ maximized maximized likelihood 1
(9-36)
with degrees of freedom,
= !P (P 1) - [p (m + 1) - !m(m - 1 ) ] = ! [ (p - m) 2 - p - m ] Supplement 9A indicates that tr ( I - I s n ) - p = 0 provided that I = the maximum likelihood estimate of I = + Thus, we have - 2 ln A = n ln ( i� l ) v - v0
+
LL1
'11 .
l
LL
I
(9-37) + ,p. is (9-38)
Bartlett [4] has shown that the chi-square approximation to the sampling dis tribution of -2 In A can be improved by replacing n in (9-38) with the multiplica tive factor (n - 1 - (2p + 4m + 5)/6) . Using Bartlett ' s correction, 5 we reject H0 at the a level of significance if (n - 1 - (2p
l
ii ,p+ 4m + 5)/6) ln I �Sn+ 1 I
> X[(p2 - m)' - p - m]/2 (a )
(9-39)
provided that n and n - p are large. Since the number of degrees of freedom, H (p - m ) 2 - p - m] , must be positive, it follows that
(9-40) m < ! (2p + 1 - Y8p + 1 ) in order to apply the test (9-39). Comment. In implementing the test in (9-39), we are testing for the ade quacy of the m common factor model by comparing the generalized variances I ii 1 + ,V I and I Sn 1 . If n is large and m is small relative to p, the hypothesis H0
=
will usually be rejected, leading to a retention of more common factors. However, LL 1 + ,P. may be close enough to Sn so that adding more factors does not provide additional insights, even though those factors are "significant." Some judg ment must be exercised in the choice of m.
I
Example 9.7
(Testing for two common factors)
The two-factor maximum likelihood analysis of the stock-price data was pre sented in Example 9.5. The residual matrix there suggests that a two-factor 5 Many factor analysts obtain an approximate maximum likelihood estimate by replacing S" with the unbiased estimate S = [n/ (n - l) J Sn and then minimizing In [ l: [ + tr [l:- 1 S]. The dual substitu tion of S and the approximate maximum likelihood estimator into the test statistic of (9-39) does not affect its large sample properties.
Sec.
9.3
Methods of Estimation
solution may be adequate. Test the hypothesis H0 : I = LL' +
m = 2, at level a = .05.
539 W,
with
The test statistic in (9-39) is based on the ratio of generalized variances
Ii i
lSJ
� ii = l > + l
l sn l Let v - 1/2 be the diagonal matrix such that v - 1/2 Sn v - 1 /2 = R . By the prop erties of determinants (see Result 2A.11), l v - 1 /2 1 1 LL ' + � l l v - 1 /2 1 = l v -1 12LL ' v - 1 /2 + v -1/2 � y - 1/2 1 and
Consequently,
l i I = l v - 1 12 1 I LL ' + � I l v - 112 1 l sn l lSJ l v -112 1 l v - 112 1 l y - 112 LL ' V - 112 + y - 1/2 � y - 1/2 l l v - 1/2 s n v - 1/2 1 I L .t '. + � . 1 IRI
(9-4 1)
by (9-30). From Example 9.5, we determine 1.000 .572 1.000 .513 .602 1.000 .411 .393 .405 1 .000 L L � 1 I . �+ . . 458 .322 . 430 .523 1.000 = .194414 = 1 .0065 1.000 .193163 IRI .577 1 .000 .509 .599 1.000 .387 .389 . 436 1.000 .462 .322 .426 .5 23 1.000 Using Bartlett ' s correction, we evaluate the test statistic in (9-39):
LL � [n - 1 - (2p + 4m + 5)/6] 1n I ' + I I Sn
[
= 100
l
-
1 - (lO + : + 5) ] ln ( l.0065) = .62
540
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Since H CP - m ) 2 - p - m ] = H C 5 - 2? - 5 - 2] = 1 , the 5% critical value x f (.05) = 3.84 is not exceeded, and we fail to reject H0. We conclude the that data do not contradict a two-factor model. In fact, the observed sig nificance level, or P-value, P [xf > .62] = .43 implies that H0 would not be • rejected at any reasonable level. L11rg � sample variances and covariances for the maximum likelihood esti mates eij> 1/1 ; have been derived when these estimates have been determined from the sample covariance matrix S. (See [20].) The expressions are, in general, quite complicated. 9.4 FACTOR ROTATION
As we indicated in Section 9.2, all factor loadings obtained from the initial load ings by an orthogonal transformation have the same ability to reproduce the covariance (or correlation) matrix. [See (9-8).] From matrix algebra, we know that an orthogonal transformation corresponds to a rigid rotation (or reflection) of the coordinate axes. For this reason, an orthogonal transformation of the factor load ings, as well as the implied orthogonal transformation of the factors, is called fac
tor rotati9n.
If L is the p X m matrix of estimated factor loadings obtained by any method ( principal component, maximum likelihood, and so forth) then
i
*
= LT '
where TT'
=
T'T
=I
(9-42)
is a p X m matrix of "rotated" loadings. Moreover, the estimated covariance (or correlation) matrix remains unchanged, since (9-43)
Equation (9-43)A indicates that the residual matrix, s n - LL ' - A-tit = A A s n - L * L * I - '11 , remaip. s unchanged. Moreover, the specific variances 1/J;, and hence the communalities ,M ' are unaltered. Thus, from a mathematical viewpoint, it is immaterial whether L or L * is obtained. Since the original loadings may not be readily interpretable, it is usual practice to rotate them until a "simpler structure" is achieved. The rationale is very much akin to sharpening the focus of a microscope in order to see the detail more clearly. Ideally, we should like to see a pattern of loadings such that each variable loads highly on a single factor and has small to moderate loadings on the remain ing factors. However, it is not always possible to get this simple structure, although the rotated loadings for the decathlon data discussed in Example 9.11 provide a nearly ideal pattern. We shall concentrate on graphical and analytical methods for determining an orthogonal rotation to a simple structure. When m = 2, or the common factors are
Sec.
9.4
Factor Rotation
541
considered two at a time, the transformation to a simple structure can frequently be determined graphically. The uncorrelated common factors are regarded as unit v�ctors along perpendicular coordinate axes. A plot of the pairs of factor loadings ( £ i 1 , £i 2 ) yields p points, each point corresponding to a variable. The coordinate axes can then be visually rotated through an angle-call it
[ -cossin
A
i* = L
( p X 2)
T =
where T =
[ cossin
(9-44)
T
J - sin
(p X 2) ( 2 X 2)
sin
clockwise rotation counterclockwise rotation
The relationship in (9-44) is rarely implemented in a two-dimensional graph ical analysis. In this situation, clusters of variables are often apparent by eye, and these clusters enable one to identify the common factors without having to inspect the magnitudes of the rotated loadings. On the other hand, for m > 2, orientations are not easily visualized, and the magnitudes of the rotated loadings must be inspected to find a meaningful interpretation of the original data. The choice of an orthogonal matrix T that satisfies an analytical measure of simple structure will be considered shortly. Example 9.8
(A first look at factor rotation)
Lawley and Maxwell [20] present the sample correlation matrix of examina tion scores in p = 6 subject areas for n = 220 male students. The correlation matrix is Gaelic English History Arithmetic Algebra Geometry
1.0
R=
.439 1.0
.410 .351 1.0
.288 .354 .164 1.0
.329 .320 .190 .595 1.0
.248 .329 .181 .470 .464 1.0
and a maximum likelihood solution for m = 2 common factors yields the estimates in Table 9.5. All the variables have positive loadings on the first factor. Lawley and Maxwell suggest that this factor reflects the overall response of the students to instruction and might be labeled a general intelligence factor. Half the load ings are positive and half are negative on the second factor. A factor with this
542
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices TABLE 9.5
Estimated factor loadings 1. 2. 3. 4. 5. 6.
Variable
Fl
Gaelic English History Arithmetic Algebra Geometry
.553 .568 .392 .740 .724 .595
Communalities
F2
h2
.429 .288 .450 - .273 - .211 - . 132
.490 .406 .356 .623 .569 .372
I
pattern of loadings is called a bipolar factor. (The assignment of negative and positive poles is arbitrary, because the signs of the loadings on a factor can be reversed without affecting the analysis.) This factor is not easily identified, but is such that individuals who get above-average scores on the verbal tests get above-average scores on the factor. Individuals with above-average scores on the mathematical tests get below-average scores on the factor. Perhaps this factor can be classified as a "math-nonmath" factor. The factor loading pairs (f i 1 , f i 2 ) are plotted as points in Figure 9.1 . The points are labeled with the numbers of the corresponding variables. Also shown is a clockwise orthogonal rotation of the coordinate axes through an angle of cf> = 20°. This angle was chosen so that one of the new axes passes through ( f 4 1 . f 4 2 ). When this is done, all the points fall in the first quadrant (the factor loadings are all positive), and the two distinct clusters of variables are more clearly revealed. The mathematical test variables load highly on F: and have negligible loadings on F; . The first factor might be called a mathematical-ability factor. Similarly, the three verbal test variables have high loadings on F; and mod erate to small loadings on F: . The second factor might be labeled a verbalF2 .5
F*2 I
I I I I I I
•3 •I •2
Figure 9.1
scores.
Factor rotation for test
Sec. 9.4 Factor Rotation
543
TABLE 9.6
Variable 1 . Gaelic
2. English 3. History
4. Arithmetic 5. Algebra 6. Geometry
Estimated rotated factor loadings F*2 p1* . 3 69
.433
.211 .789
.752
�
.594 .467 .558 .001 .054 .083
Communalities h I* 2 = h I2 .490 .406 .356
.623
.568
.372
ability factor. The general-intelligence factor identified initially is submerged
in the factors F: and Ft . The rotated factor loadings obtained from (9-44) with cp ='= 20° and the corresponding communality estimates are shown in Table 9.6. The magni tudes of the rotated factor loadings reinforce the interpretation of the factors suggested by Figure 9.1. The " " communality A A estimates A A , are unchanged by the orthogonal rotation, since L L ' = LTT' L ' = L * L "' ' , and the communalities are the diagonal elements of these matrices. We point out that Figure 9.1 suggests an oblique rotation of the coordi nates. One new axis would pass through the cluster { 1 , 2, 3} and the other through the { 4, 5, 6 } group. Oblique rotations are so named because they cor respond to a nonrigid rotation of coordinate axes leading to new axes that are not perpendicular. It is apparent, however, that the interpretation of the oblique factors for this example would be much the same as that given pre viously for an orthogonal rotation. • Kaiser [19] has suggested an analytical measure of simple structure known as the varimax (or normal varimax) criterion. Define f;� = C;� /h; to be the rotated coefficients scaled by the square root of the communalities. Then the (normal) vari max procedure selects the orthogonal transformation T that makes (9-45) as large as possible. Scaling the rotated coefficients e ;� has the effect of giving variables with small communalities relatively more weight in the determination of simple structure. After the transformation T is determined, the loadings f;7 are multiplied by h; so that the original communalities are preserved.
544
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Although (9-45) looks rather forbidding, it has a simple interpretation. In words, (scaled) loadings for ) V ii= I ( variance of squar;�ths offactor 0(
(9-46)
Effectively, maximizing V corresponds to "spreading out" the squares of the load ings on each factor as much as possible. Therefore, we hope to find group� of large and negligible coefficients in any column of the rotated loadings matrix L * . Computing algorithms exist for maximizing V, and most popular factor analy sis computer programs (for example, the statistical software packages SAS, SPSS, and BMDP) provide varimax rotations. As might be expected, varimax rotations of factor loadings obtained by different solution methods ( principal components, maximum likelihood, and so forth) will not, in general, coincide. Also, the pattern of rotated loadings may change considerably if additional common factors are included in the rotation. If a dominant single factor exists, it will generally be obscured by any orthogonal rotation. On the other hand, it can always be held fixed and the remaining factors rotated. Example 9.9
(Rotated loadings for the consumer-preference data)
Let us return to the marketing data discussed in Example 9.3. The original factor loadings (obtained by the principal component method), the commu nalities, and the (varimax) rotated factor loadings are shown in Table 9.7. (See the SAS statistical software output in Panel 9.1.) It is clear that variables 2, 4, and 5 define factor 1 (high loadings on fac tor 1, small or negligible loadings on factor 2), while variables 1 and 3 define factor 2 (high loadings on factor 2, small or negligible loadings on factor 1 ) . Variable 4 is most closely aligned with factor 1 , although it has aspects of the TABLE 9.7
Estimated factor loadings
Rotated estimated factor loadings
FI
Fz
F1*
Taste Good buy for money Flavor Suitable for snack Provides lots of energy
.56 .78 .65 .94 .80
.82 - .52 .75 - .10 - .54
.02
®
Cumulative proportion of total (standardized) sample variance explained
GiJ7
.43 - .02
.571
.932
.507
.932
Variable 1. 2. 3. 4. 5.
.13
F*2
@
- .01
(}§)
Communalities "z h;
.98 .88 .98 .89 .93
Sec.
9 .4
Factor Rotation
545
trait represented by factor 2. We might call factor 1 a nutritional factor and factor 2 a taste factor. The factor loadings for the variables are pictured with respect to the • original and ( varimax) rotated factor axes in Figure 9.2. I
F*2
Fz
.5
0
I
I
I
I
I
..... -- .5
I
I
.....
I
I•
I
• 3
Fl
• 1 .0
.....
4
..... .....
...,._
Ff
Factor rotation for hypothetical marketing data.
Figure 9.2
Rotation of factor loadings is recommended particularly for loadings obtained by maximum likelihood, since the initial values are constrained to satisfy the uniqueness condition that i>4r -1 L be a diagonal matrix. This condition is con venient for computational purposes, but may not lead to factors that can easily be interpreted. PAN EL 9. 1
SAS ANALYSI S FOR EXAM PLE
9.9 U S I N G P ROC FACTOR.
t i t l e 'Facto r Ana lysis'; data cons u m e r(type=co rr) ; _type_ = 'CO R R'; i n put _name_ $ taste m o n ey flavor snack energy; ca rds; 1 .00 taste .02 1 .00 m o ney flavor .96 . 1 3 1 .00 snack . 42 .71 .50 1 .00 energy .01 .85 .1 1 .79 1 .00
PROGRAM COMMANDS
proc factor res data=co n s u m e r m ethod=prin nfact=2 rotate=vari m ax preplot plot; va r taste m o ney flavor snack energy;
(continued)
546
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices PAN El 9. 1 fhiti a l
(continued)
F�ctorMethod: �ri�cipaiComponents
O UTPUT
Prior Com m u n a l ity Esti m ates: O N E Eigenva l u es o f t h e Correlation Matrix: Total
Eigenva l u e Difference
Proportion
.cumulative
5 Average
=
=
1
1 2.853090 1 .046758
2 1 .806332 1 .60 1 842
3 0.204490 0 . 1 0208 1
4 0. 1 02409 0.068732
5 0.033677
0.5706
0.36 1 3
0.0409 0.9728
0.0205 0.9933
0.0067 1 .0000
0.570.6
0.93 1 9
2 facto rs w i l l be reta ined by the N FACTOR criterion. Factor Pattern
TASTE MONEY FLAVOR S NACK E N E RGY
FACTORi
0.55986
0.64534
. >Q;'7'?7�Q.
0.79821'
• ·· .
' FACTOR.2 0.8 1 610
:·�0.524,@!) 0.74795 0 1 0 492
0.939 1 1
·:
-
FiiJ
TASTE 0.9796 1
M O N EY
0.878920
"0.54323 .
Total
=
4.659423
F LAVOR
S NACK
E N ERGY
0.975883
0.892928
0.93223 1
Rotation M ethd'd: Varimax Rotated Factor Pattern
TASTE M O N EY FLAVOR S NACK E N ERGY
FACTOR 1
o.o197o
·· . 0.1285.6 ' ···
0.93744
.
Q.965.49
0.84244
. FACTOR2
> o.9894a i. 0.9794:7 0.42805
-0.0 11 23
�0.01 5(ii3
Variance exp l a i ned by each factor FACTOR 1 2.537396
FACTO R2 2 . 1 22027
Sec. Example 9. 1 0
9.4
Factor Rotation
547
(Rotated loadings for the stock-price data)
Table 9.8 shows the initial and rotated maximum likelihood estimates of the factor loadings for the stock-price data of Examples 8.5 and 9.5. An m = 2 factor model is assumed. The estimated specific variances and cumulative proportions of the total ( standardized ) sample variance explained by each factor are also given. An interpretation of the factors suggested by the unrotated loadings was presented in Example 9.5. We identified market and industry factors. The rotated loadings indicate that the chemical stocks ( Allied Chemical, Du Pont, and Union Carbide ) load highly on the first factor, while the oil stocks ( Exxon and Texaco ) load highly on the second factor. (Although the rotated loadings obtained from the principal component solution are not dis played, the same phenomenon is observed for them. ) The two rotated fac tors, together, differentiate the industries. It is difficult for us to label these factors intelligently. Factor 1 represents those unique economic forces that cause chemical stocks to move together. Factor 2 appears to represent eco nomic conditions affecting oil stocks. As we have noted, a general factor ( that is, one on which all the vari ables load highly) tends to be "destroyed after rotation." For this reason, in cases where a general factor is evident, an orthogonal rotation is sometimes • performed with the general factor loadings fixed. 6 TABLE 9.8
Maximum likelihood estimates of factor loadings
Rotated estimated factor loadings p2* F1*
Variable
F,
Fz
Allied Chemical Du Pont Union Carbide Exxon Texaco
.684 .694 .681 .621 .792
.189 .517 .248 -.073 - .442
.601 .850 .643 .365 .208
.598
.335
Cumulative proportion of total sample variance explained
.485
.377 .164 .335
aiD3
Specific variances
� · = 1 - Jr z l
l
.50 .25 .47 .61 .18
.598
6 Some general-purpose factor analysis programs allow one to fix loadings associated with certain factors and to rotate the remaining factors.
548
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Example 9.1 1
(Rotated loadings for the Olympic decathlon data)
The estimated factor loadings and specific variances for the Olympic decathlon data were presented in Example 9.6. These quantities were derived for an m = 4 factor model, using both principal component and maximum likelihood solution methods. The interpretation of all the underlying factors was not immediately evident. A varimax rotation [see (9-45)] was performed to see whether the rotated factor loadings would provide additional insights. The varimax rotated loadings for the m = 4 factor solutions are displayed in Table 9.9, along with the specific variances. Apart from the estimated load ings, rotation will affect only the distribution of the proportions of the total sample variance explained by each factor. The cumulative proportion of the total sample variance explained for all factors does not change. The rotated factor loadings for both methods of solution point to the same underlying attributes, although factors 1 and 2 are not in the same order. We see that shot put, discus, and javelin load highly on a factor, and, following Linden [21 ], this factor might be called explosive arm strength. Sim ilarly, high jump, 110-meter hurdles, pole vault, and-to some extent-long jump load highly on another factor. Linden labeled this factor explosive leg strength. The 100-meter run, 400-meter run, and-again to some extent-the long jump load highly on a third factor. This factor could be called running speed. Finally, the 1500-meter run loads highly and the 400-meter run loads moderately on the fourth factor. Linden called this factor running endurance. As he notes, "The basic functions indicated in this study are mainly consis tent with the traditional classification of track and field athletics." Plots of rotated maximum likelihood loadings for factors pairs (1, 2) and (1, 3) are displayed in Figure 9.3 on page 550. The points are generally grouped along the factor axes. Plots of rotated principal component loadings B are very similar. Oblique Rotations
Orthogonal rotations are appropriate for a factor model in which the common fac tors are assumed to be independent. Many investigators in social sciences consider oblique (nonorthogonal) rotations, as well as orthogonal rotations. The former are often suggested after one views the estimated factor loadings and do not follow from our postulated model. Nevertheless, an oblique rotation is frequently a use ful aid in factor analysis. If we regard the m common factors as coordinate axes, the point with the m coordinates (e i l ' ei 2 ' . . . , eim ) represents the position of the ith variable in the fac tor space. Assuming that the variables are grouped into nonoverlapping clusters, an orthogonal rotation to a simple structure corresponds to a rigid rotation of the coordinate axes such that the axes, after rotation, pass as closely to the clusters as
Sec.
9.4
Factor Rotation
549
TABLE 9.9
Principal component
Variable
100-m
p1*
Estimated rotated factor loadings, f;� p*
2
p3*
p4*
Maximum likelihood
Specific variances �-I = 1 - J;I2
pI*
Estimated rotated factor loadings, e i� p*
2
p3*
run
1 .884 1 .136
.156 -.113
.16
.167 1 .857 1
Long jump
1 .631 1 .194
.30
Shot put
[��!�J - .006
.240 [��22J l .58o 1
.245 1 .825 1
.223 - .148
.19
1 .966 1
High jump
.239
400-m run
110-m
hurdles Discus Pole vault Javelin
1500-m
run
p4*
.246 - .138
I
I
.16 .38
.154
.200 - .058
.00
.150
1 .750 1
.076
.35
.242
.173
1 .632 1
.113
.50
1 .797 1 .075
.102
.468
.13
.055 1 .709 1
.236
.330
.33
.404 .153 1 .635 1 - .170 .186 1 .814 1 .147 - .079
.38 .28
.205 .261 1 .697 1 .133
1 .589 1 - .071 .180 - .009
.54 .46
.217 .141
.34 .43
.137
.078 .019
1 .513 1 .175
.116 .002
.70 .80
.112 1 .934 1
.11
-.055
.056
113
1 .990 1
.00
.18
.34
.50
.61
-.036 .176 1 .762 1 - .048 1 .735 1 .110 .045 -.041
[��X�J
Cumulative of total sample explained
� - = 1 - fz 2
.011
proportion
variance
Specific variances
.21
.42
.61
.73
550
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices Factor 3
5
Factor I
0
9
7
•
8
3•
•
Factor I
-5
-5
- 8 -6 -4 - 2
0
2
4
6
8
- 8 -6 -4 - 2
0
2
4
6
8
Figure 9.3 Rotated maximum likelihood loadings for factor pairs (1 , 2) and (1 , 3) decathlon data . (The numbers in the figures correspond to variables . )
possible. An oblique rotation to a simple structure corresponds to a nonrigid rota tion of the coordinate system such that the rotated axes (no longer perpendicular) pass (nearly) through the clusters. An oblique rotation seeks to express each vari able in terms of a minimum number of factors-preferably, a single factor. Oblique rotations are discussed in several sources (see, for example, [12] or [20]) and will not be pursued in this book. 9.5 FACTOR SCORES
In factor analysis, interest is usually centered on the parameters in the factor model. However, the estimated values of the common factors, called factor scores, may also be required. These quantities are often used for diagnostic purposes, as well as inputs to a subsequent analysis. Factor scores are not estimates of unknown parameters in the usual sense. Rather, they are estimates of values for the unobserved random factor vectors Fj, j = 1 , 2, . . . , n. That is, factor scores
fj
= estimate of the values fj attained by Fj (jth case)
Sec . 9 . 5
Factor Scores
551
The estimation situation is complicated by the fact that the unobserved quantities fj and ej outnumber the observed xj . To overcome this difficulty, some rather heuristic, but reasoned, approaches to the problem of estimating factor values have been advanced. We describe two of these approaches. Both of the factor score approaches have two elements in common: A
A
1. They treat the estimated factor loadings f ;j and specific variances 1/J; as if they were the true values. 2. They involve linear transformations of the original data, perhaps centered or standardized. Typically, the estimated rotated loadings, rather than the origi nal estimated loadings, are used to compute factor scores. The computational formulas, as given in this section, do not change when rotated loadings are substituted for unrotated loadings, so we will not differentiate between them. The Weighted Least Squares Method
Suppose first that the mean vector p, the factor loadings L, and the specific vari ance 'II are known for the factor model X (p x
I)
f.L
(p X
L I)
F
( p X m) (m X I )
[
+
e (p X I )
[3]p]
Further, regard the specific factors e ' = e 1 , e 2 , . . . , e as errors. Since Var (e; ) = 1/J; , i = 1, 2, . . . , p, need not be equal, Bartlett has suggested that weighted least squares be used to estimate the common factor values. The sum of the squares of the errors, weighted by the reciprocal of their variances, is e 2: _i_ = z
P
i=l
1/1;
7.3)
e ' 'll - 1 e
=
(x
- f.L -
Lf) ' w- 1 (x
A
- f.L -
Bartlett proposed choosing the estimates f of f to minimize (see Exercise is
(9-48),
Motivated by we take the estimates £ , ,J, , and jL obtain the factor scores for the jth case as
Lf)
(9-47)
(9-47). The solution (9-48)
= x as the true values and
(9-49)
552
Chap. 9
Factor Analysis and I nference for Structu red Cova riance M atrices
When L and lJ1 are determined by the maximum likelihood method, these estimates must satisfy the uniqueness condition, L 1 W - l L = A , a diagonal matrix. We then have the following: A
A
FACTOR SCORES OBTAIN�D BY WEIGHTED LEAST SQUARES FROM THE MAXIMUM LIKELIHOOD ESTIMATES
ij
or' ifthe
=
A."CtL'lJf�t(�i - x),
d:!'!<' . · · · ·
-
correlation matriX is factored
f. J
wh.ere zj
=
=
= =
·
(LI q,-ti, )�til q,-t z . A
.
1
A
Z
A
Z
a; L�w;
;�
o-t/Z (x
z
i
J
Z ··
Z
J
p, ) =
1; 2, . . . , n
j = l, 2, . . , , n
) , as in (s-25),and jJ = izi� z1 ,
(9-50)
+
q,
z•
The factor scores generated by (9-50) have sample mean vector 0 and zero sample covariances. (See Exercise 9.16.) If rotated loadings i * = iT are used in place of the original loadings in (9-50), ,.. ]. = 1, 2, . . . , n. "' by fi"""' * = T , fj, "" * , are related to fi the subsequent factor scores, fi Comment. If the factor loadings are estimated by the principal component method, it is customary to generate factor scores using an unweighted (ordinary) least squares procedure. Implicitly, this amounts to assuming that the 1/J; are equal or nearly equal. The factor scores are then
or
ij = (i 1 i ) -1 i � ( xj - x ) fj = (L�LJ - 1 L1zZj
for standardized data. Since L
A
= [ � e 1 ! YA �e 2 ! · · · ! VA;: e m ] [see (9-15)], we have
f. = J
(9-51 )
Sec. 9 . 5
Factor Scores
553
For these factor scores, (sample mean) and II 1 f.f ' L n - 1 I. = 1 I I "
--
A
=I
(sample covariance) A
Comparing (9-51) with (8-21), we see that the fi are nothing more than the first m (scaled) principal components, evaluated at xi . The Regression Method
Starting again with the original factor model X - p = LF + e, we initially treat the loadings matrix L and specific variance matrix '\)1 as known. When the common factors F and the specific factors (or errors) e are jointly normally distributed with means and covariances given by (9-3), the linear combination X - p = LF + e has an NP (0, LL' + V ) distribution. (See Result 4.3.) Moreover, the joint distrib ution of (X - p) and F is Nm + p (O, I *) , where I = LL' + v i L i (p X m ) (p Xp) (9-52) j (m P m + p } (m Xp) : ( m X m}
[=
+ ;:
--------i-,- ----------i '
] _____ ___ _
and 0 is an ( m + p) X 1 vector of zeros. Using Result 4.6, we find that the con ditional distribution of F I x is multivariate normal with mean = E (F J x) L' I - 1 (x - p) L' (LL' + v) - 1 ( x - p ) (9-53)
=
=
and
= Cov (F J x ) = I - L' I- 1 L = I - L' (LL' + v) - 1 L (9-54) The quantities L' (LL' + v) - 1 in (9-53) are the coefficients in a (multivariate) covariance
regression of the factors on the variables. Estimates of these coefficients produce factor scores that are analogous to the estimates of the conditional mean values in multivariate regression analysis. (See Chapter 7.) Consequently, given any vector of observations xi, and taking the maximum likelihood estimates i and 4t as the true values, we see that the jth factor score vector is given by (9-55) j = 1, 2, . . . , n fi = i • i - 1 (xi - i ) = L ' (LL ' + 4t ) - 1 (xi - i ) , The calculation of fi in (9-55) can be simplified by using the matrix identity (see Exercise 9.())
(m Xp }
(p Xp)
(m X m}
(m Xp } (p Xp }
(9-56)
554
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
This identity allows us to compare the factor scores in (9-55), generated by the regression argument, with those generated by the weigh ted least squares pr�ce dure [see (9-50)]. Temporarily, we denote the former by Cf and the latter by rfs. Then, using (9-56), we obtain
rfs = ( I / � - 1 £) - 1 (1 + L ' � -1 i ) rf = (I + (L ' � - 1 i ) -1) rf
(9-57)
For maximum likelihood estimates ( L ' � - 1 i ) - 1 = .l- 1 and if the elements of this diagonal matrix are close to zero, the regression and generalized least squares methods will give nearly the same factor scores. In an attempt to reduce the effects of a ( possibly) incorrect determination of the number of factors, practitioners tend to calculate the factor scores in (9-55) by using S (the original sample covariance matrix) instead of i = LL ' + �. We then have the following: FACTOR
f.J
or,
=
SCORES OBTAI NED BY REG RESSION
L ' S- 1 (x.J -
j = 1, 2,
i )'
if a correlation matrix is factored,
j where, see (8-25),
zi = n-112(xi
-
i ) and
=
p
1 , 2,
=
. . .
. . .
,n
,n
izi.�
+
�z
Again, if rotated loadings i * = i T are used in place of the original loadings in (9-58), the subsequent factor scores rt are related to fj by j
= 1, 2, . . . , n
A numerical measure of agreement between the factor scores generated from two different calculation methods is provided by the sample correlation coefficient between scores on the same factor. Of the methods presented, none is recom mended as uniformly superior. Example 9. 1 2
(Computing factor scores)
We shall illustrate the computation of factor scores by the least squares and regression methods using the stock-price data discussed in Example 9.10. A maximum likelihood solution from R gave the estimated rotated loadings and specific variances
Sec. 9.5 Factor Scores
i* z
=
.601 .850 .643 .365 .208
.377 .164 .335 .507 .883
A
and 'II z
=
555
.50 0 0 0 0 0 .25 0 0 0 0 0 .47 0 0 0 0 0 .61 0 0 0 0 0 .18
The vector of standardized observations, z'
= [.50, - 1.40, - .20, - .70, 1.40]
yields the following scores on factors 1 and 2: Weighted least squares (9-50):
[ - 1 .8 ] 2.0
Regression (9-58):
f = L* ' R_ 1 z z
[
.657 .222 .050 - .210 = .187 .037 - .185 .013 .107 .864
]
.50 - 1 .40 - .20 - .70 1 .40
[ - 1 .2 ] 1 .4
In this case, the two methods produce somewhat different results. All of the regression factor scores, obtained using (9-58), are plotted in Figure 9.4. • Comment. Factor scores with a rather pleasing intuitive property can be constructed very simply. Group the variables with high ( say, greater than .40) load ings on a factor. The scores for factor 1 are then formed by summing the ( stan dardized ) observed values of the variables in the group, combined according to the sign of the loadings. The factor scores for factor 2 are the sums of the standardized observations corresponding to variables with high loadings on factor 2, and so forth. Data reduction is accomplished by replacing the standardized data by these simple factor scores. The simple factor scores are frequently highly correlated with the fac tor scores obtained by the more complex least squares and regression methods. Example 9. 1 3
(Creating simple summary scores from factor analysis groupings)
The principal component factor analysis of the stock price data in Example 9.4 produced the estimated loadings
556
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices 3
2
I
•
••
0
1 '-
•
•
•
•
•
•
•• • •
. ... .
•
•
• ••• • ••• • •
•
•
•
• • •
•
I
I
-2
-1
•
•
•• •
...
• • • • • • • • • • • • • • $ • •• $• • ••• • • •
-2 -
-3
•
• •
I
•
•
I-
r-
-
I
e l
$
•
•
-
•
•
•
• •
•
-
-
I
I 2
0
3
Figure 9.4 Factor scores using (9-58) for factors 1 and 2 of the stock-price data (maximum likelihood estimates of the factor loadings).
L=
.784 - .216 .773 - .458 .795 - .234 .712 .473 .524 .712
and i *
= LT =
.746 .889 .766 .258 .226
.323 .128 .316 .815 .854
For each factor, take the largest loadings in L as equal in magnitude, and neglect the smaller loadings. Thus, we create the linear combinations
/1 = X1 /2 = X4 A
A
X3 + Xs - X2 + X2 +
+
X4
+
Xs
as a summary. In practice, we would standardize these new variables. If, instead of L, we start with the varimax rotated loadings L * , the sim ple factor scores would be
/1 = x1 J2 = X4 A
+ x2 + +
Xs
x3
Sec. 9 . 6
Perspectives and a Strategy for Factor Ana lysis
557
The identification of high loadings and negligible loadings is really quite sub jective. Linear compounds that make subject-matter sense are preferable. • Although multivariate normality is often assumed for the variables in a fac tor analysis, it is very difficult to justify the assumption for a large number of vari ables. As we pointed out in Chapter 4, marginal transformations may help. Similarly, the factor scores may or may not be normally distributed. Bivariate scat ter plots of factor scores can produce all sorts of nonelliptical shapes. Plots of fac tor scores should be examined prior to using these scores in other analyses. They can reveal outlying values and the extent of the ( possible) nonnormality. 9.6 PERSPECTIVES AND A STRATEGY FOR FACTOR ANALYSIS
There are many decisions that must be made in any factor analytic study. Proba bly the most important decision is the choice of m, the number of common factors. Although a large sample test of the adequacy of a model is available for a given m, it is suitable only for data that are approximately normally distributed. Moreover, the test will most assuredly reject the model for small m if the number of variables and observations is large. Yet this is the situation when factor analysis provides a useful approximation. Most often, the final choice of m is based on some combi nation of (1) the proportion of the sample variance explained, (2) subject-matter knowledge, and (3) the "reasonableness" of the results. The choice of the solution method and type of rotation is a less crucial deci sion. In fact, the most satisfactory factor analyses are those in which rotations are tried with more than one method and all the results substantially confirm the same factor structure. At the present time, factor analysis still maintains the flavor of an art, and no single strategy should yet be "chiseled into stone." We suggest and illustrate one reasonable option: 1.
Perform a principal component factor analysis. This method is particularly appropriate for a first pass through the data. (It is not required that R or S be nonsingular.) (a) Look for suspicious observations by plotting the factor scores. Also, cal culate standardized scores for each observation and squared distances as described in Section 4.6. (b) Try a varimax rotation.
2. 3.
Perform a maximum likelihood factor analysis, including a varimax rotation. Compare the solutions obtained from the two factor analyses. (a) Do the loadings group in the same manner?
(b) Plot factor scores obtained for principal components against scores from the maximum likelihood analysis.
558
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices 4.
Repeat the first three steps for other numbers ofcommon factors m. Do extra fac
5.
For large data sets, split them in half and perform a factor analysis on each part. Compare the two results with each other and with that obtained from
tors necessarily contribute to the understanding and interpretation of the data?
the complete data set to check the stability of the solution. (The data might be divided at random or by placing the first half of the cases in one group and the second half of the cases in the other group. )
Example 9. 1 4 (Factor analysis of chicken-bone data)
We present the results of several factor analyses on bone and skull measure ments of white leghorn fowl. The original data were taken from Dunn [10] . Factor analysis of Dunn ' s data was originally considered b y Wright [26] , who started his analysis from a different correlation matrix than the one we use. The full data set consists of n = 276 measurements on bone dimensions: Head: Leg: Wing:
length { xlx skull 2 skull breadth length { x3x4 = femur tibia length length { x humerus ulna length = =
=
X5 = 6=
The sample correlation matrix
R=
1.000 .505 .569 .602 .621 .603 .505 1 .000 .422 .467 .482 .450 .569 .422 1.000 .926 .877 .878 .602 .467 .926 1 .000 .874 .894 .621 .482 .877 .874 1.000 .937 .603 .450 .878 .894 .937 1 .000
was factor analyzed by the principal component and maximum likelihood methods for an m =3 factor model. The results are given in Table 9.10. 7 After rotation, the two methods of solution appear to give somewhat different results. Focusing our attention on the principal component method and the cumulative proportion of the total sample variance explained, we see that a three-factor solution appears to be warranted. The third factor explains a "significant" amount of additional sample variation. The first factor appears 7 Notice the estimated specific variance of .00 for tibia length in the maximum likelihood solu tion. This suggests that maximizing the likelihood function may produce a Heywood case. Readers attempting to replicate our results should try the Hey(wood) option if SAS or similar software is used.
Sec. TABLE 9. 1 0
9.6
Perspectives and a Strategy for Factor Analysis
559
FACTOR ANALYSI S O F CH ICKEN-BO N E DATA
Principal Component
Variable
Estimated factor loadings Rotated estimated loadings p3* p2* p1* F2 F3 F1
Skull length Skull breadth Femur length Tibia length Humerus length Ulna length
.741 .604 .929 .943 .948 .945
.350 .720 - .233 - . 175 - .143 - .189
.573 - .340 - .075 - .067 - .045 - .047
Cumulative proportion of total ( standardized ) sample variance explained
.743
.873
.950
1. 2. 3. 4. 5. 6.
.355 .235
�
.904 .888
�
.576
.244
(. 949 )
( .902 )
.164 .212 .228 .192
.21 1 .218 .252 .283 .264
.763
.950
1/1 ; "
.00 .00 .08 .08 .08 .07
Maximum Likelihood
Variable 1. 2. 3. 4. 5. 6.
Estimated factor loadings Rotated estimated loadings p2* p1* p2* F2 F1 F3
Skull length Skull breadth Femur length Tibia length Humerus length Ulna length
.602 .467 .926 1 .000 .874 .894
.214 .177 .145 .000 .463 .336
.286 .652 - .057 - .000 - .012 - .039
Cumulative proportion of total ( standardized ) sample variance explained
.667
.738
.823
.467 .21 1
�
.936 .831
�
.559
c:J 2 .289 .345 .362 .325
.128 .050 .084 - .073 .396 .272
.779
.823
1/1 ; "
.51 .33 . 12 .00 .02 .09
to be a body-size factor dominated by wing and leg dimensions. The second and third factors, collectively, represent skull dimension and might be given the same names as the variables, skull breadth and skull length, respectively. The rotated maximum likelihood factor loadings are consistent with those generated by the principal component method for the first factor, but
560
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
not for factors 2 and 3. For the maximum likelihood method, the second fac tor appears to represent head size. The meaning of the third factor is unclear, and it is probably not needed. Further support for retaining three or fewer factors is provided by the residual matrix obtained from the maximum likelihood estimates:
R -
.000 - .000 .000 - .003 .001 .000 .000 .000 .000 .000 - .001 .000 .000 .000 .000 .004 - .001 - .001 .000 - .000 .000
i L' - 'It" = z
z
z
All the entries in this matrix are very small. We shall pursue the m = 3 factor model in this example. An m = 2 factor model is considered in Exercise 9.10. Factor scores for factors 1 and 2 produced from (9-58) with the rotated maximum likelihood estimates are plotted in Figure 9.5. Plots of this kind allow us to identify observations that, for one reason or another, are not
3
""
I
�
I
I
2 t-
• • • • • •• • • • • • •$ • • • • •••• • • • •• • . .$$ • •• • • • • • $• · · $• • • • $ • •$ • • •$• • • $ $ $. . $ • •• •$ • ••• • •• •• • $ ••$ ••
�
0
••• •••• $ $ • $ •• •• $ • •$ • $ $ • • • • $• $$ • •• $$ • $ • • $• • • • • •• • • • • • •
-2 -
•
•
• • • $$ • • • • • $• • • • • • • • • •• $$ • • $ • $ • • • •
'
$ •
•
•
•
Figure 9.5
I -1
0
(�
-
•
I -2
-
•
l t-
-3
•
• •
(� I t-
I
I
L 2
Factor scores for the first two factors of chicken-bone data.
3
Sec.
Perspectives and a Strategy for Factor Analysis
9.6
561
consistent with the remaining observations. Potential outliers are circled in the figure. It is also of interest to plot pairs of factor scores obtained using the prin cipal component and maximum likelihood estimates of factor loadings. For the chicken-hope data, plots of pairs of factor scores are given in Figure 9.6. If the loadings on a p artic�l ar factor agree, the pairs of scores should cluster tightly about the 45° line through the origin. Sets of loadings that do not agree will produce factor scores that deviate from this pattern. If the latter occurs, it is usually associated with the last factors and may suggest that the number of factors is too large. That is, the last factors are not meaningful. This seems I
2.7
I
I
I
I
I
-
1 .8
1-
.9
1-
-2.7
I
I
I
1 1 1 1 11 1 1 I 11 I I I 32 1 1 1 2 I 12 1 1 21 1 1 I 1 1 2 32 1 I 1 4 2 63 1 I I 2 1 24 3 I I 33 1 1 2 2 1 I 2I 17 2 1
I
I
I
1-
1-
1 I
1-
I
-
1 3 -
-
'"
""
l l 43 4 3�:t t
1 - 1 .8
I
1
0
- .9
I
Principal component
Maximum likelihood
1 2 243 3 1 4 13 2231 5 2 I1 1 4 1 1 22 1 I 12121 1 1 5 25 1 1 11 133 I I I 2 II1 I I I I 1 I 1
-
-
-
-3.6 r--
-
1
I
I
-3.5 - 3.0
I
I
L
- 2.5 - 2.0 - 1 .5
I
- 1 .0
I
- .5
0
I
.5
I
1 .0
I
1 .5
I
2.0
( a) First factor
Figure 9.6
I
2.5
Pairs of factor scores for the chicken-bone data. (Loadings are esti mated by principal component and maximum likelihood methods.)
I
3.0
562
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices I
I
I
I Principal component
I
I
I
I
I
I
I
l
9.0 I-
eD
I
-
7.5 I-
6.0
-
!--
-
4.5 I-
3.0
1 .5
-
-
-
0
-
I II I I I 2 12 1 1 I 1 2322 1 1 23 4623 1 22 4 3 3 1 2 1 4 46C61 1
2 1 362 5 5 7 2 1 2 1A3 8 3 7 3 1 1 1 52 5 1 1 1 223 3 1 I Ill 21 I - 1 .5 I I II I I I I I I I -3.00 - 2.25 - 1 .50 - .75 0 .75
-
Maximum likelihood
-
-
I 1 .50
I 2.25
I 3.00
I 3.75
I 4.50
I 5 .25
I 6.00
I 6.75
I 7.50
(b ) Second factor
Figure 9.6
(continued)
to be the case with the third factor in the chicken-bone data, as indicated by Plot ( c) in Figure 9.6. Plots of pairs of factor scores using estimated loadings from two solu tion methods are also good tools for detecting outliers. If the sets of loadings for a factor tend to agree, outliers will appear as points in the neighborhood of the 45° line, but far from the origin and the cluster of the remaining points. It is clear from Plot (b ) in Figure 9.6 that one of the 276 observations is not consistent with the others. It has an unusually large F2 score. When this point, [39.1, 39.3, 75.7, 115, 73.4, 69.1], was removed and the analysis repeated, the loadings were not altered appreciably. When the data set is large, it should be divided into two ( roughly) equal sets, and a factor analysis should be performed on each half. The results of
Sec. I
I
3.00
I
I
Perspectives and a Strategy for Factor Analysis
9.6
I
I Principal component
-
I
I
I
I
I
I
563 I -
I
2.25
-
I
1 .50
.75
2
-
I I
I I I
- .75 r-
1-
I
I
I
2
I
r- 1
I I I 1 Ill
I
1ii 1 I 21
1
I II I I I
I
1 1
I
I I I I
I I
2
1
I I
I I 1 I l l I
I
I
I
I
- 3.0 - 2.4 - 1 .8 - 1 .2
-"-"
,
I
I I Il l I I I 3 I I I
I I
I
I
I
I
I
I
-
1
�
I
Maximum likelihood I
I
I I
-
I
I
I
I
-"-
�..
1 I 2 I 2 2 I 211 II 21 1
-
I
I I I
I
I
-
2 I
I
- .6
I
1
111
-
I
I
I I l l I I
II
I I
I
1
22 121 1 3141 I I 1 2
22
1 12 I 2
I
I
I l l
I
2 I II 2 I
I
- 2.25 r-
- 3.00
I I
I
I I I 2 I I 2 1 I I 3 1 1 21 1 1 I
I I
- 1 .50
I
1 21 32 2 I I I 2 1 1 I I I
-
0
I
2
0
I
I
.6
1 .2
_l
1 .8
j_
2.4
_l
3.0
j_
3.6
_l
4.2
_l
-
4.8
(c ) Third factor
Figure 9.6
(continued)
these analyses can be compared with each other and with the analysis for the full data set to test the stability of the solution. If the results are consistent with one another, confidence in the solution is increased. The chicken-bone data were divided into two sets of n 1 = 137 and n 2 = 139 observations, respectively. The resulting sample correlation ma trices were
Rt =
1 .000 .696 1 .000 .588 .540 1.000 .639 .575 .901 1 .000 .694 .606 .844 .835 1 .000 .660 .584 .866 .863 .931 1 .000
Chap.
564
9
Factor Analysis and Inference for Structured Covariance Matrices
and 1.000 .366 1.000 .572 .352 1 .000 .587 .406 .950 1.000 .587 .420 .909 .91 1 1.000 .598 .386 .894 .927 .940 1.000
R2 =
The rotated estimated loadings, specific variances, and proportion of the total ( standardized ) sample variance explained for a principal component solution of an m = 3 factor model are given in Table 9.1 1 . The results for the two halves of the chicken-bone measurements are very similar. Factors Ft and F; interchange with respect to their labels, skull length and skull breadth, but they collectively seem to represent head size. The first factor, F: , again appears to be a body-size factor dominated by leg and wing dimensions. These are the same interpretations we gave to the results from a principal component factor analysis of the entire set of data. The solution is remarkably stable, and we can be fairly confident that the large loadings are "real." As we have pointed out however, three factors are probably too many. A one- or two-factor model is surely sufficient for the TABLE 9. 1 1
(n 1 =
First set
137 observations )
Rotated estimated factor loadings Variable 1. 2. 3. 4. 5. 6.
Skull length Skull breadth Femur length Tibia length Humerus length Ulna length
Cumulative proportion of total ( standardized) sample variance explained
F*1 .360 .303
�
.877 .830 .871 '-----'
.546
F*2
F*3
.361 (.853) (.899) .312 .175 .238 .270 .242 .247 .395 .332 .231
.743
.940
(n2 =
Second set
139 observations )
Rotated estimated factor loadings
!/!;
F*I
F*2
F*3
!/!;
.01 .00 .08 .10 .11 .08
.352 .203
'330'
(.921) .145 .239 .248 .252 .272
. 167 (.968) .130 .187 .208 .168
.00 .00 .06 .05 .06 .06
.593
.780
.962
.925 .912 .914 '-----'
Sec.
9.7
Structural Equation Models
565
chicken-bone data, and you are encouraged to repeat the analyses here with • fewer factors and alternative solution methods. (See Exercise 9.10.) Factor analysis has a tremendous intuitive appeal for the behavioral and social sciences. In these areas, it is natural to regard multivariate observations on animal and human processes as manifestations of underlying unobservable "traits." Factor analysis provides a way of explaining the observed variability in behavior in terms of these traits. Still, when all is said and done, factor analysis remains very subjective. Our examples, in common with most published sources, consist of situations in which the factor analysis model provides reasonable explanations in terms of a few interpretable factors. In practice, the vast majority of attempted factor analyses do not yield such clear-cut results. Unfortunately, the criterion for judging the quality of any factor analysis has not been well quantified. Rather, that quality seems to depend on a WOW criterion If, while scrutinizing the factor analysis, the investigator can shout "Wow, I understand these factors," the application is deemed successful. 9.7 STRUCTURAL EQUATION MODELS
Structural equation models are sets of linear equations used to specify phenomena in terms of their presumed cause-and-effect variables. In their most general form, the models allow for variables that cannot be measured directly. Structural equa tion models are particularly helpful in the social and behavioral sciences and have been used to study the relationship between social status and achievement, the determinants of firm profitability, discrimination in employment, the efficacy of social action programs, and other interesting mechanisms. (See [2], [5], [6], [7], [9], [11 ] , [13], [14], [15], [17], and [23] for additional theory and applications.) Computer software for specifying, fitting, and evaluating structural equation models has been developed by Joreskog and Sorbom and is now widely available as the LISREL (Linear �tructural Relationships) system. (See [18].) The LISREL Model
Using the notation of Joreskog and Sorbom [18], the LISREL model is given by the equations '1/
{m X l )
B'f/ + + l rg (m X m ) {m X l) {m X II ) {l! X ! ) (m X l)
(9-59)
566
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices y
(p x I ) X (q x I )
-
Ay TJ
(p X m ) (m x I ) Ax g
( q X n ) (n X I )
+ +
e (p X I ) 0
(9-60)
(q X I)
with
'Ji
E([)
= 0;
Cov ([)
E ( e)
= 0;
E ( o)
=
= ee Cov ( o) = 0 6
o;
Cov ( e)
=
(9-61)
Here {, e, and o are mutually uncorrelated; Cov (g) = ; [ is uncorrelated with g; e is uncorrelated with 71; o is uncorrelated with g; B has zeros on the diagonal; and I B is nonsingular. In addition to the assumptions (9-61), we take E (g) = 0 and E ( 71) = 0. The quantities g and 7J in (9-59) are the cause-and-effect variables, respec tively, and, ordinarily, are not directly observed. They are sometimes called latent variables. The quantities Y and X in (9-60) are variables that are linearly related to 7J and g through the coefficient matrices Ay and Ax , and these variables can be measured. Their observed values constitute the data. Equations (9-60) are some times called the measurement equations. -
Construction of a Path Diagram
A distinction is made between variables that are not influenced by other variables in the system (exogenous variables) and variables that are affected by others (endogenous variables). With each of the latter dependent variables is associated a residual. Certain conventions govern the drawing of a path diagram, which is con structed as follows (directed arrows represent a path): 1. A straight arrow is drawn to each dependent (endogenous) variable from each of its sources. 2. A straight arrow is also drawn to each dependent variable from its residual. 3. A curved, double-headed arrow is drawn between each pair of independent (exogenous) variables thought to have nonzero correlation.
The curved arrow for correlation is indicative of the symmetrical nature of a correlation coefficient. The other connections are directional, as indicated by the single-headed arrow.
Sec. Example 9. 1 5
9. 7
Structural Equation Models
567
(A path diag ram and the structural equation)
Consider the following path diagram for the cause-and-effect variables:
This path diagram, with equation [see (9-59)]
m = 2 and n =
3, corresponds to the structural
with Cov a·1 , g2 ) = ¢1 , Cov ( g2 , g3 ) = ¢2 , Cov ( g1 , g3 ) = c/>3 , and Cov ((1 , (2 ) = 0. We see, for example, that 7] 1 depends upon g1 , g2 , and TJz · Path diagrams are useful aids for formulating structural models. Because they indicate the direction and nature of the causality, the diagrams • force the investigator to think about the problem. The model in (9-59) and (9-60) has a very rich structure and includes several important submodels as special cases. For example, with a j udicious choice of dimensions and variables, it is possible to define the multivariate linear regression model and the factor analysis model. Covariance Structure
Because 'Y/ and g are not observed, the LISREL model cannot be verified directly. However, as in factor analysis, the model and assumptions imply a certain covari ance structure, which can be checked. Define the data vector [Y', X'] '. Then Cov
(-i-)
I (p + q) X (p + q)
[
:1 1 1 i :1 1 2 Xp) i (p X q) (p ---------r-----I z 1 : Izz (q xp) i (q x q)
j
---·
I
[
Cov (Y) ,i Cov (Y, X) Cov (X, Y) i Cov (X)
-----------------
J
-----------------·
( 9 -62 )
568
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
where, with B
=
0 to simplify the discussion, Cov (Y) = E(YY' ) = Ay Cov ( 17) A; + Cov (X)
Cov (Y, X)
=
=
E (XX' )
= E (YX')
= =
=
0"
Ay (r«<>r' + 'II ) Ay' + Ax Cov ( t) A; +
08
0e
E (Ay (r t + �) + e) (Ax t + Ay r«<>A; = [Cov (X, Y) ] '
B)'
(9-63)
The covariances are (nonlinear) functions of the model parameters Ax , AY , r, «1>, 'II , 0" , and 08 . (Recall that we set B = 0). Given n multivariate observations ' , j = 2, . . . , n, the sample covari ance matrix [see
[
[yj, xj ]
(3-11)]
s
(p + q) X (p + q)
=
su
1,
-
i
sl2
(p Xp ) i (p X q) :8 2 1 : S zz (q Xp ) i (q X q)
---- --
-
--------
-
I
l
-
can be constructed and partitioned in a manner conformable to I. The information in S is used to estimate the model parameters. Specifically, we set
i
=
s
(9-64)
and solve the resulting equations. Estimation
Unfortunately, the e quations in (9-64) often cannot be solved explicitly. An itera tive search routine that begins with initial parameter estimates must be used to pro duce a matrix � that · closely approximates S. Th� search routines use a criterion function that measures the discrepancy between I and S. The LISREL program currently uses a "least squares" criterion and a "maximum likelihood" criterion to estimate the model parameters. (See [18] for details.) The next example, with hypothetical data, illustrates a case wpere the para meter estimates can be obtained from (9-64). Example 9. 1 6
Take m
=
(Estimation for a structural equation model-artificial dlata)
1, n = 1, p =
2,
q = 2, and B = 0 in (9-59) and (9-60) . Then 1J = y g + �
Sec.
[ �1 �J ,
Additionally, let Var ( g)
=
Cov ( e) Using (9-63), we get
9. 7
= cp ; Var (C) =
Structural Equation Models
1/J,
and Cov ( B ) =
[ � �J
] [ t���--H-;;j � -=�'1---=-Ii�-j-=�H---=-H-
Data are collected on Y and X, and the sample covariance matrix
i s constructed. Setting I =
14.3
- 27.6 i
6.4
3.2
3.2
- 6.4 :
1 .6
1.1
S, we obtain the equations: (i) <·?� + � ) + o ! = 14.3 = - 27.6 (ii) A I ( y 2 � + � )
(iii) Ai ( .Y 2 � + � ) + 8 2 = 55.4 (iv) A 2 .Y � (v) .y �
(vi) A I A 2 (il � )
(vii) A 1 < .Y � ) (viii) � A � + 83 (ix) c/J A 2 " "
"
"
(x) c/1 + 6 4
= = = = = = =
6.4
3.2 - 12.8 - 6.4 3.7 1 .6 1.1
569
570
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
From (iv) and (v), A 2 may verify that
=
2.0. Then, from (ix),
y=4 A
¢ = .8
A
l/1
=
@e A
1
=
[
.5 O 0 .2
J
� = 8, and so forth. The reader
@8 A
and
[
= · 50 .3O
J
The columns of the Ay and Ax matrices contain 1 ' s in order to fix the scale of the unobserved 17 and g. In this example, the units of 17 have been chosen to be the same as those of Y1 . The units of g are taken to be those of X2 • (Any constant can be used to remove the scale ambiguity associated with the variables 17 and g. The number 1 is a convenient choice.) In this example, the structural variables, 17 and g, might be the perfor mance of the firm and managerial talent respectively. The former cannot be measured directly, but indicators of the firm ' s performance, for example, Y1 = profit and Y2 = common stock price, can be measured. Similarly, indi cators of managerial talent, say, X1 = years of chief executive experience and X2 = memberships on boards of directors, can be measured. Assuming that a firm ' s performance is caused, to a large degree, by managerial talent, we are led to the foregoing model. • In general, to estimate the model parameters, we need more equations than there are unknowns. Consequently, if t is the total number of unknown parameters, p and q must be such that t
�
(p
+
q) (p
+
q
+ 1 ) /2
(9-65)
Condition (9-65), however, does not guarantee that all parameters can be esti mated uniquely. The final fit of the model must be assessed carefully. Individual parameter estimates, along with the entries in the residual matrix S I should be examined. Parameter estimates should have appropriate signs and magnitudes. For example, iterative parameter estimation routines operate over the entire parameter space and may not yield variance estimates that are positive. Entries in the residual matrix should be uniformly small. -
Model-Fitting Strategy
In linear structural equation models, interest is often centered on the values of the parameters and the associated "effects." Predicted values for the variables are not easily obtained, unless the model reduces to a variant of the multivariate linear regression model.
Sec.
9.7
Structural Equation Models
571
A useful model-fitting strategy consists of the following: 1. If possible, generate parameter estimates using several criteria (for example, least squares, maximum likelihood), and compare the estimates. (a) Are the signs and magnitudes consistent? (b) Are all variance estimates positive? (c) Are the residual matrices, S i , similar? 2. Do the analysis with both S and R, the sample correlation matrix. What effect does standardizing the observable variables have on the outcome? 3. Split large data sets in half, and perform Steps 1 and 2 on each half. Compare the results with each other and with the result for the complete data set to check the stability of the solution. -
A model that gives consistent results for Steps 1-3 is probably a reasonable one. Inconsistent results point toward a model that is not completely supported by the data and needs to be modified or abandoned.
SUPPLEMEN T 9A
Some Computational Details for Maximum Likelihood Estima tion Although a simple analytical expression cannot be obtained for the maximum like lihood estimators i and W, they can be shown to satisfy certain equations. Not sur prisingly, the nconditions are stated in terms of the maximum likelihood estimator
= (1/n) :2 ( Xi - X) (Xi - X) ' of an unstructured covariance matrix. Some j= 1 factor analysts employ the usual sample covariance S, but still use the title maxi mum likelihood to refer to resulting estimates. This modification, referenced in Sn
Footnote 5 of this chapter, n amounts to employing the likelihood obtained from the
Wishart distribution of :2 ( Xi - X) ( Xi - X) ' and ignoring the minor contribu i= l tion due to the normal density for X. The factor analysis of R is, of course, unaf fected by the choice of S n or S, since they both produce the same correlation matrix. Result 9A. 1 . Let x 1 ' X 2 , . . . ' xn be a random"sample from a normal popula tion. The maximum likelihood estimates L and '\II are obtained by maximizing (9-25) subject to the uniqueness condition in (9-26). They satisfy
(9A-1) so the jth column of ,j,- 1 12 £ is the (nonnormalized) eigenvector of ,j, - 1 12 Sn w - 1 /2 corresponding to eigenvalue 1 + � ; · Here
Sn
n
= n - 1 :2 (xi - i ) (xi - i ) ' = n- 1 (n- 1 ) S and j= 1
Also, at convergence, 572
� � ;;;;. � 2 ;;;;. · · · ;;;;. � m
Supplement 9a Some Computational Details for Maximum Likelihood Estimation
�; = ith diagonal element of sn - ii I and
tr ( I - 1 S" )
+
573
(9A-2)
=p
We avoid the details of the proof. However, it is evident that jL = i and a consideration of the log-likelihood leads to the maximization of - (n/2) [ln i i i tr (I-1 Sn ) 1 over L and '\}1. Equivalently, since Sn and p are con stant with respect to the maximization, we minimize subject to L' l}l-1 L
=
A,
(9A-3) a diagonal matrix.
Comment. Lawley and Maxwell [20], along with many others who do fac tor analysis, use the unbiased estimate S of the covariance matrix instead of the maximum likelihood estimate Sw Now, ( n - 1 ) S has, for normal data, a Wishart distribution. [See (4-21) and (4-23).] If we ignore the contribution to the likelihood in (9-25) from the second term involving ( p, - i ) , then maximizing the reduced likelihood over L and '\}1 is equivalent to maximizing the Wishart likelihood n Likelihood ex: I I I - (Il - 1 ) /2 e - l< - 1 ) /2] tr[:£- Is]
over L and
'\}1.
Equivalently, we can minimize
or, as in (9A-3),
ln i i i
+
tr (I - 1 S) - ln l s i - p
Under these conditions, Result 9A-1 holds with S in place of Sn . Also, for large n, S and Sn are almost identical, and the corresponding maximum likelihood esti mates, i and � . would be similar. For testing the factor model [see (9-39)], I ii � I should be compared with I S" l if the actual likelihood of (9-25) is � I should be compared with I S I if the foregoing Wishart employed, and I ii likelihood is used to derive i and � .
1+
1+
Recom mended Computational Scheme
For m > 1, the condition L' l}I- 1 L = A effectively imposes m (m - 1 ) /2 con straints on the elements of L and '\}1, and the likelihood equations are solved, sub ject to these contraints, in an iterative fashion. One procedure is the following:
574
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
[16]
1. Compute initial estimates of the specific variances suggests setting
1/11 , 1/12 , , 1/Jp · Joreskog • • •
(9A-4) �·' = ( 1 - l_2 . m ) (�) s" where is the ith diagonal element of s - 1 • Given � ' compute the first m distinct eigenvalues, A 1 > A 2 > ··· > A m > 1, and corresponding eigenvectors, el' ez, ... ' em, of the "uniqueness-rescaled" covariance matrix s* = � -1/z s n � -1 ;2 (9A-5) Let E [ el i ez i .. . i em ] be the m matrix of normalized eigenvectors and A A A A = diag [ 1 . 2 , . . . , m ] be the� m m diagonal matrix of eigenvalues. From (9A-1), A = I + A and E = -1/2 LA -112. Thus, we obtain the estimates (9A-6) Substitute i obtained in (9A-6) into the likelihood function (9A-3), and min imize the result with respect to � 1 , � 2 , ... , �p · A numerical search routine must be used. The values � 1 , � 2 , ... , �P obtained from this minimization are employed at Step (2) to create a new i. Steps (2) and (3) are repeated until convergence-that is, until the differences between successive values of e and � are negligible. It often happens that the objective function in (9A-3) has a rel ative minimum corresponding to negative values for some �;· This solution is clearly inadmissible and is said to be jmproper, or a Heywood case. For most pack aged computer programs, negative 1/J ; , if they occur on a particular iteration, are changed to small positive numbers before proceeding with the next step. p = LzL: + '��z When I has the factor analysis structure I = LL ' + '11 , p can be factored as p = v -1/2 Iy - 1 /2 = ( v -1/2 L) ( v - 1 /2 L) ' + v - 1 /2\}ly-1/2 = L Z L� + '��z · The loading matrix for the standardized variables is Lz = v -t /2 L, and the corresponding specific variance matrix is '��z = v -1/2 '11v - l l2 , where v -1 /2 is the diagonal matrix with ith diagonal element u;; 1 12 . If R is substituted for S n in the objective function of (9A-3), the investigator minimizes p
s
ii
2.
p
X
X
3.
ij
i
Comment.
Maximum Likelihood Estimators of
(9A-7)
Chap . 9
575
Exercises
Introducing the diagonal matrix V 1 12, whose ith diagonal element is the square root of the ith diagonal element of 8 11 , we can write the objective function in (9A-7) as ln
( I V 1 12 I lL Z L� + :z i i V 1 12 I ) I V 1 12 I I R I I V112 I
+ tr [ (LzL � + '11, ) - 1 V - 1 12 V1 12 RV 1 12 V - 112]
= ln
( I
-
p
I S" I
+ tr [ ( ( V 1 /2L , ) (v t i2L , ) ' + vl l2 wz v 1 /2) - 1 S"]
-
P (9A-8) A
A
The last inequality follows because the maximum likelihood estimates L and 'II minimize the objective function (9A-3). [Equality holds in (9A-8) for Lz = v - 1 /2L and ,f,z = v - 1 ;2-q, V - 1 12. ] Therefore, minimizing (9A-7) over Lz and 'liz is equiv alent to obtaining L and .q, from s n and estimating Lz = v-1/2L by Lz = v - l/2L and '��z = v- 112 \{ly- 1 /2 by .q,z = v - l/2,f, v - 1 12. The rationale for the latter pro cedure comes from the invariance property of maximum likelihood estimators. [See ( 4-2 ). ]
0
EXERCISES
9.1.
[1.0 1.0 .35 ] 1.0
Show that the covariance matrix p
for the p by the m
=
.63 .45
.63 .45 .35
= 3 standardized random variables Z1 , Z2 , and Z3 can be generated = 1 factor model
where Var (F1 )
=
1 = 0, and
1, Cov (e, F )
576
Chap.
9
[ 0.19 0�1
Factor Analysis and Inference for Structured Covariance Matrices �
=
Cov(e) =
0 0
That is, write p in the form p = LL' + �. 9.2. Use the information in Exercise 9.1. (a) Calculate communalities hf , i = 1, 2, 3, and interpret these quantities. (b) Calculate Corr(Z;, F1 ) for i = 1, 2, 3. Which variable might carry the greatest weight in "naming" the common factor? Why? 9.3. The eigenvalues and eigenvectors of the correlation matrix p in Exercise 9. 1 are A I = 1. 9 6, e ; = [.625, . 5 93, . 5 07), A 2 = . 6 8, e ; = [-. 2 19, -. 4 91, . 843) A 3 = . 3 6, e ; = [. 7 49, - . 6 38, -.177) (a) Assuming an m = 1 factor model, calculate the loading matrix L and matrix of specific variances � using the principal component solution method. Compare the results with those in Exercise 9.1. (b) What proportion of the total population variance is explained by the first common factor? 9.4. Given p and � in Exercise 9.1 and an m = 1 factor model, calculate the reduced correlationLmatrix jJ = p - � and the principal factor solution for the loading matrix . Is the result consistent with the information in Exercise 9.1? Should it be? 9.5. Establish the inequality (9-19). Hint: Since S - LL' - � has zeros on the diagonal, (sum of squared entries of S - ii:> - qf) (sum of squared entries of S - LL' ) N ow, S - LL' = ll 111 + 1 e111 + 1 em + 1 + + ll P eP eP = (2) A ( 2) P('2) , wh ere . !A c2> = [ t\,"+ 1 i · i ep J and A. (2) is the diagonal matrix with elements m + 1 , . . . , AP . "Us� (sum of squared entries of A) = tr AA' and tr[P c2J A (2) A (2) lP(2) ] = tr [A c2> A <2> ]. 9.6. Verify the following matrix identities. (a) ( I + L' � - 1 L) - 1 L' � - 1 L = I - (I + L' � - 1 L) - 1 Hint: Premultiply bothI sides by (I + L' �- 1 L ) . � - 1 L (I + L' � - 1 L) - I L' � - I (b) (LL' + � ) - 1 = � Hint: Postmultiply both sides by (LL' + � ) and use (a). - -
-
A \
�
A
.
_
Af
· · ·
A \
A
Af
p A
A
1'\
Chap. 9
Exercises
577
L' (LL' + '\]f)- 1 = (I + L''\]f- 1 L) - 1 L''\]f - 1 Hint: Postmultiply the result in 1 (b) by L, use (a), and take the trans pose, noting that (LL' + '\]f) - , '\)f - 1 , and ( I + L''\]f - 1 L) - 1 are sym metric matrices. 9.7. (The factor model parameterization need not be unique.) Let the factor model with p = 2 and m = 1 prevail. Show that 0"11 -- eA "2 + 1/11 , O"zz = f i 1 + 1/Jz and, for given 0"11 , 0"22 , and 0"1 2 , there is an infinity of choices for L and 9.8. (Unique but improper solution: Heywood case.) Consider an m = 1 factor model for the population with covariance matrix 1 .4 . 9 I = .4 1 . 7 .9 .7 1 Show that there is a unique choice of L and with I = LL' + but that I/J3 < 0, so the choice is not admissible. 9.9. In a study of liquor preference in France, Stoetzel [25] collected preference rankings of p = 9 liquor types from n = 1442 individuals. A factor analysis of the 9 9 sample correlation matrix of rank orderings gave the following estimated loadings: (c)
'\)1,
[ ]
'\)1
'\)1,
X
Variable (X1 ) Liquors Kirsch Mirabelle Rum Marc Whiskey Calvados Cognac Armagnac *
Estimated factor loadings F,
.64 .50 .46 .17 -.2299 -. -.49 -.52 -.60
Fz
. 02 -.06 -.24 .74 .66 -..2008 -.03 -. 1 7
F3
.16 -.10 -.19 .97* -.39 .09 -.04 .42 .14
valforThiuobte sofafii.ng6iur4,ngeastihsaetresoesotuihimltgatofh.eandItfactexceeds approrolxiomadithatneiogsmaxin usedmetmhumodby Stoetzel.
578
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Given these results, Stoetzel concluded the following: The major principle of liquor preference in France is the distinction between sweet and strong liquors. The second motivating element is price, which can be understood by remembering that liquor is both an expensive commodity and an item of con spicuous consumption. Except in the case of the two most popular and least expensive items (rum and marc), this second factor plays a much smaller role in producing preference judgments. The third factor concerns the sociologi cal and primarily the regional, variability of the judgments. (See [25], p. 11.)' (a) Given what you know about the various liquors involved, does Stoetzel s interpretation seem reasonable? (b) Plot the loading pairs for the first two factors. Conduct a graphical orthogonal rotation of the factor axes. Generate approximate rotated loadings. Interpret the rotated loadings for the first two factors. Does your interpretation agree with Stoetzel's interpretation of these factors from the unrotated loadings? Explain. 9.10. The correlation matrix for chicken-bone measurements (see Example 9. 1 4) is 1.000 .505 1.000 .569 .422 1.000 .602 .467 .926 1.000 .621 .482 .877 .874 1.000 .603 .450 .878 .894 . 937 1.000 The following estimated factor loadings were extracted by the maximum like lihood procedure: Varimax Estimated rotated estimated factor loadings factor loadings Variable p1* F, Fz F* 1. Skull length . 602 .200 .484 .411 2. Skull breadth .467 .154 .375 .319 3. Femur length . 926 .143 . 603 . 7 17 4. Tibia length 1.000 .000 . 5 19 .855 5. Humerus length .874 .476 .861 .499 6. Ulna length .894 . 327 .744 .594 Using the unrotated estimated factor loadings, obtain the maximum likeli hood estimates of the following. (a) The specific variances. 2
Chap. 9
Exercises
579
The communalities. The proportion of variance explained by each factor. The residual matrix R LzL� � z · 9.11. Refer to Exercise 9. 1 0. Compute the value of the varimax criterion using both unrotated and rotated estimated factor loadings. Comment on the results. 9.U. The covariance matrix for the logarithms of turtle measurements (see Exam ple 8.4) is 11.072 s = w - 3 8. 0 19 6. 4 17 8. 1 60 6.005 6.773 The following maximum likelihood estimates of the factor loadings for an m = 1 model were obtained: (b)
(c) (d)
-
-
]
[
Estimated factor loadings
Variable 1. ln(length) 2. ln(width) 3. ln(height)
FI
.1022 .0752 .0765
Using the estimated factor loadings, obtain the maximum likelihood esti mates of each of the following. (a) Specific variances. (b) Communalities. (c) Proportion of variance exp} ajned by the factor. (d) The residual matrix S n L L ' 'II . Hint: Convert S to 811 • 9.13. Refer to Exercise 9. 1 2. Compute the test statistic in (9-39). Indicate why a test of H0 : I = LL' + 'II (with m = 1) versus H1 : I unrestricted cannot be carried out for this example. [See (9-40).] 9.14. The maximum likelihood factor loading estimates are given in (9A-6) by -
-
Verify, for this choice, that where
a A
=
A
-
I
is a diagonal matrix.
580
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Hirschey and Wichern [15] investigate the consistency, determinants, and uses of accounting and market-value measures of profitability. As part of their study, a factor analysis of accounting profit measures and market esti mates of economic profits was conducted. The correlation matrix of account ing historical, accounting replacement, and market-value measures of profitability for a sample of firms operating in 1977 is as follows: Variable HRA HRE HRS RRA RRE RRS Q REV Historical return on assets, HRA 1.000 Historical return on equity, HRE .738 1. 000 Historical return on sales, HRS .731 .520 1.000 Replacement return on assets, RRA . 828 . 688 . 652 1.000 Replacement return on equity, RRE .681 .831 . 5 13 . 887 1. 000 Replacement return on sales, RRS .712 .543 . 826 . 867 .692 1.000 Market Q ratio, Q .625 .322 .579 . 639 .419 . 608 1.000 Market relative excess value, REV . 604 .303 . 6 17 .563 .352 . 6 10 . 937 1.000 The following rotated principal component estimates of factor loadings for an m = 3 factor model were obtained: Estimated factor loadings Variable Fz F3 Ft .433 . 6 12 .499 Historical return on assets Historical return on equity .125 . 892 .234 Historical return on sales .296 .238 . 887 .406 .708 .483 Replacement return on assets Replacement return on equity .198 .895 .283 .331 .414 .789 Replacement return on sales . 928 .160 .294 Market Q ratio . 9 10 .079 .355 Market relative excess value Cumulative proportion .287 . 628 .908 of total variance explained (a) Using the estimated factor loadings, determine the specific variances and communalities. (b) Determine the residual matrix, R izi� ir z· Given this information and the cumulative proportion of total variance explained in the preced ing table, does an m = 3 factor model appear appropriate for these data? 9.15.
-
-
Chap. 9
Exercises
581
Assuming that estimated loadings less than .4 are small, interpret the three factors. Does it appear, for example, that market-value measures provide evidence of profitability distinct from that provided by account ing measures? Can you separate accounting historical measures of prof itability from accounting replacement measures? 9.16. Verify that factor scores constructed according to (9-50) have sample mean vector 0 and zero sample covariances. 9.17. Consider the LISREL model in Example 9.16. Interchange 1 and A1 in the parameter vector AY ' and interchange A 2 and 1 in the parameter vector Ax Using the S matrix provided in the example, solve for the model parameters. Explain why the scales of the structural variables and �must be fixed. (c)
TJ
The following exercises require the use of a computer.
Refer to Exercise 5.16 concerning the numbers of fish caught. (a) Using only the measurements x 1 - x4, obtain the principal component solution for factor models with m = 1 and m = 2. (b) Using only the measurements x 1 - x4, obtain the maximum likelihood solution for factor models with m = 1 and m = 2. (c) Rotate your solutions in Parts (a) and (b). Compare the solutions and comment on them. Interpret each factor. (d) Perform a factor analysis using the measurements x1 - x6 • Determine a reasonable number of factors m, and compare the principal component and maximum likelihood solutions after rotation. Interpret the factors. 9.19. A firm is attempting to evaluate the quality of its sales staff and is trying to find an examination or series of tests that may reveal the potential for good performance in sales. The firm has selected a random sample of 50 sales people and has evaluated each on 3 measures of performance: growth of sales, profitability of sales, and new-account sales. These measures have been converted to a scale, on which 100 indicates "average" performance. Each of the 50 individuals took each of 4 tests, which purported to measure creativity, mechanical reasoning, abstract reasoning, and mathematical abil ity, respectively. The n = 50 observations on p = 7 variables are listed in Table 9.12. (a) Assume an orthogonal factor model for the standardized variables Zi = ( Xi - JL i )/� , i = 1, 2, . . . , 7. Obtain either the principal compo nent solution or the maximum likelihood solution for m = 2 and m = 3 common factors. (b) Given your solution in (a), obtain the rotated loadings for m = 2 and m = 3. Compare the two sets of rotated loadings. Interpret the m = 2 and m = 3 factor solutions. c) List the estimated communalities, specific variances, and ££ ' + 4r for the m = 2 and m = 3 solutions. Compare the results. Which choice of m do you prefer at this point? Why? 9.18.
(
582
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
TABLE 9. 1 2 SALESPEOPLE DATA
Sales Salesperson growth (x1 ) 1 93.0 2 88.8 3 95.0 4 101.3 5 102.0 6 95.8 7 95.5 8 110.8 9 102.8 10 106.8 11 103. 3 12 99.5 13 103.5 14 99.5 15 100.0 16 81.5 17 101. 3 18 103.3 19 95.3 20 99.5 88. 5 21 22 99.3 87.5 23 24 105.3 25 107.0 93.3 26 106.8 27 106.8 28 92.3 29 106.3 30 106.0 31 88.3 32 96.0 33 94.3 34 106.5 35
Index of: Sales profitability (x2 ) 96.0 91.8 100.3 103.8 107. 8 97.5 99.5 122.0 108.3 120.5 109.8 111. 8 112.5 105.5 107.0 93.5 105.3 110.8 104.3 105.3 95. 3 115. 0 92.5 114.0 121.0 102.0 118.0 120.0 90.8 121. 0 119.5 92.8 103.3 94.5 121.5
Score on: Mechanical Abstract Newaccount Creativity reasoning reasoning sales (x3 ) test (x4 ) test (x5 ) test (x6 ) 12 09 97.8 09 10 10 96.8 07 12 09 08 99.0 12 14 106.8 13 12 15 103.0 10 14 11 10 99.3 12 09 09 99.0 20 15 115.3 18 13 17 10 103.8 11 18 102.0 14 12 17 104.0 12 18 08 10 100.3 17 11 16 107.0 10 11 08 102.3 08 10 13 102.8 05 09 07 95.0 11 12 11 102.8 14 11 11 103.5 13 14 05 103.0 11 17 17 106.3 07 12 10 95.8 11 11 05 104.3 07 09 09 95.8 12 15 12 105.3 12 19 16 109.0 07 15 10 97.8 12 16 14 107.3 11 16 10 104.8 13 10 08 99.8 11 17 09 104.5 10 15 18 110.5 08 11 13 96.8 11 15 07 100.5 11 12 10 99.0 10 17 18 110.5
Mathernatics test (x7 ) 20 15 26 29 32 21 25 51 31 39 32 31 34 34 34 16 32 35 30 27 15 42 16 37 39 23 39 49 17 44 43 10 27 19 42 (continued)
Chap. 9
TABLE 9. 1 2
Salesperson 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Exercises
583
(continued)
Sales growth (x1 ) 106.5 92.0 102.0 108.3 106.8 102.5 92.5 102.8 83.3 94.8 103.5 89.5 84.3 104.3 106.0
Index of: Sales profitability (x2 ) 115. 5 99.5 99.8 122.3 119.0 109. 3 102.5 113. 8 87.3 101. 8 112.0 96.0 89.8 109.5 118.5
Score on: Mechanical Abstract Newaccount Creativity reasoning reasoning sales (x3 ) test (x4 ) test (x5 ) test (x6 ) 14 13 08 107.0 08 16 18 103.5 14 12 13 103.3 12 19 15 108.5 12 20 14 106.8 13 17 09 103. 8 06 15 13 99.3 10 20 17 106.8 09 05 01 96.3 11 16 99.8 07 12 13 18 110.8 11 15 07 97.3 08 08 94.3 08 12 12 106.5 14 11 16 105.0 12
Mathernatics test (x7 ) 47 18 28 41 37 32 23 32 15 24 37 14 09 36 39
Conduct a test of H0 : LL' + '\fl versus H : I LL' + '\fl for both m 2 and m 3 at the .01 level. With1 these results and those in Parts b and c, which choice of m appears to be the best? (e) Suppose a new salesperson, selected at random, obtains the test scores x' = [x1 , x2 , , x7 ] [ 110, 98, 105, 15, 18, 12, 35]. Calculate the sales person ' s factor score using the weighted least squares method and the regression method. Note: The components of x must be standardized using the sample means and variances calculated from the original data. 9.20. Using the air-pollution variables X1 , X2 , X5 , and X6 given in Table 1. 3 , gen erate the sample covariance matrix. (a) Obtain the principal component solution to a factor model with m 1 and m 2. (b) Find the maximum likelihood estimates of L and '\fl for m 1 and m 2. (c) Compare the factorization obtained by the principal component and max imum likelihood methods. 9.21. Perform a varimax rotation of both m = 2 solutions in Exercise 9. 2 0. Inter pret the results. Are the principal component and maximum likelihood solu tions consistent with each other? (d)
=
=
• • •
=
a =
=I=
=
=
=
=
584
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Refer to Exercise 9.20. (a) Calculate the factor scores from the = 2 maximum likelihood esti mates by (i) weighted least squares in (9-50) and (ii) the regression approach of (9-58). (b) Find the factor scores from the principal component solution, using (9-51). (c) Compare the three sets of factor scores. 9.23. Repeat Exercise 9. 2 0, starting from the sample correlation matrix. Interpret the factors for the = 1 and = 2 solutions. Does it make a difference if R, rather than S, is factored? Explain. 9.24. Perform a factor analysis of the census-tract data in Table 8.2. Start with R and obtain both the maximum likelihood and principal component solutions. Comment on your choice of Your analysis should include factor rotation and the computation of factor scores. 9.25. Perform a factor analysis of the "stiffness" measurements given in Table 4.3 and discussed in Example 4.14. Compute factor scores, and check for outliers in the data. Use the sample covariance matrix S. 9.26. Consider the mice-weight data in Example 8.6. Start with the sample covari ance matrix. (See Exercise 8. 1 5 for � .) (a) Obtain the principal component solution to the factor model with = 1 and = 2. (b) Find the maximum likelihood estimates of the loadings and specific vari ances for = 1 and = 2. (c) Perform a varimax rotation of the solutions in Parts a and b. 9.27. Repeat Exercise 9. 26 by factoring R instead of the sample covariance matrix S. Also, for the mouse with standardized weights [.8, - .2, - .6, 1 .5], obtain the factor scores using the maximum likelihood estimates of the loadings and Equation (9-58) . 9.28. Perform a factor analysis of the national track records for women given in Table 1 .7. Use the sample covariance matrix S and interpret the factors. Compute factor scores, and check for outliers in the data. Repeat the analy sis with the sample correlation matrix R. Does it make a difference if R, rather than S, is factored? Explain. 9.29. Refer to Exercise 9.28. Convert the national track records for women to speeds measured in meters per second. (See Exercise 8 . 19. ) Perform a factor analysis of the speed data. Use the sample covariance matrix S and interpret the factors. Compute factor scores, and check for outliers in the data. Repeat the analysis with the sample correlation matrix R. Does it make a difference if R, rather than S, is factored? Explain. Compare your results with the results in Exercises 9.28. Which analysis do you prefer? Why? 9.30. Perform a factor analysis of the national track records for men given in Table 8.6. Repeat the steps given in Exercise 9.28. Is the appropriate factor model for the men's data different from the one for the women 's data? If not, are m
9.22.
m
m
m.
m
m
m
m
Chap. 9
References
585
the interpretations of the factors roughly the same? If the models are differ ent, explain the differences. 9.31. Refer to Exercise 9.30. Convert the national track records for men to speeds measured in meters per second. (See Exercise 8.2 1.) Perform a factor analy sis of the speed data. Use the sample covariance matrix S and interpret the factors. Compute factor scores, and check for outliers in the data. Repeat the analysis with the sample correlation matrix R. Does it make a difference if R, rather than S, is factored? Explain. Compare your results with the results in Exercises 9.30. Which analysis do you prefer? Why? 9.32. Perform a factor analysis of the data on bulls given in Table 1. 8 . Use the seven variables YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt. Factor the sample covariance matrix S and interpret the factors. Compute fac tor scores, and check for outliers. Repeat the analysis with the sample corre lation matrix R. Compare the results obtained from S with the results from R. Does it make a difference if R, rather than S, is factored? Explain. REFERENCES An Introduction to Multivariate Statistical Methods (
)
1. Anders oln,ey,T.1984.W. 2d ed. . New York: John Wi 3.2. BartBarthleolt o, mew, M. (S.1937)D."TheJ., 97-104. Statistical Conception of Mental Factors." London: Grif in, 1987. 4. Barttions.le"t , M. S. "A Note on Multiplying Factors for Var(1954)ious, 296-298. Chi-Squared Approxima 5. Bentler, P. M. "Multivaria(t1e980),Anal419-456. ysis with Latent Variables: Causal Models." 6. Bielby, W.(1977)T., and, 137-161. R. M. Hauser "Structural Equation Models." 7.8. DiBolxlon,en, W.K. A.J., ed. NewBerkYork: John: UniWilveersy, 1989. el e y, CA. ity of Cal i f o r n i a Pr e s , 1979. 9. Press, Duncan,1975. New York: Academic 10. Dunn, L. C. "The Effect of Inbre(edi1928)ng ,on1-112.the Bones of the Fowl." 11. Gol(d1972)berger,, 979-1001. A. S. "Structural Equation Methods in the Social Sciences." 12.13. Hayduk, Harmon, L.H.A.H. Chicago: The UniverBalsitytiofmore:ChicTheago Johns Pres , Hop 1967. ki n s Uni v er s i t y Pres s , 1987. 14. Heise, D. R. New York: John Wiley, 1975. Latent Variable Models and Factor Analysis.
British Journal of Psy
chology,
28
Journal of the Royal Statistical Society (B) ,
16
Annual
Review of Psychology,
31
Annual Review of Soci
ology,
3
Structural Equations with Latent Variables.
BMDP Biomedical Computer Programs.
0.
Q. Introduction to Structural Equation Models.
Storrs Agricultural
Experimental Station Bulletin,
52
Econometrica,
40
Modern Factor Analysis.
Structural Equation Modeling with LISREL.
Causal Analysis.
586
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
15. iHitarbisclihey,ty: ConsM., iandstency,D. DetW. eWirmcihern.nants"Account in" g and Market-Value Measures of Prof and Us e s. 984), o375-383. 16. Joreskog,no.K. 4G.(1"Fact r Analysis by LeaseditteSquares andleMaxiin, A.mumRalsLitokn,eliandhood.H." S.InWilf. d by K. Ens New kYork:og, K.JohnG., andWileD.y, 1975.Sorbom. 17. Jores Cambr idge,D.MA:Sorbom.Abt Books, 1979. 18. Jores k og, K. , and Chicago: Scientific Sof t w ar e I n t e rnat i o nal , 1996. 19. Kaiser, H. F. "The(1958)Vari, 18m7-200. ax Criterion for Analytic Rotation in Factor Analysis." 20. YorLawlke: y,AmerD. N.ic,anandElsA.eviE.er Maxwel ln. g Co., 1971. 2d ed. . New Publ i s h i 21. Linden,no. 3M.(1977)"A Fact, 562-568. or Analytic Study of Olympic Decathlon Data." 22. Maxwel l, A. E. London: Chapman and Hal l , 1977. 23. MiEnvil err,onment D. "St.a"ble in the Saddle" CEO Tenurno. e1 and(1991),the 34--Mat5c2.h between Organization and 24.25. MorStoetrizselon,, J.D."AF.Factor Analysis of Liquor Prefere2dnce.ed." . New York: McGraw-Hil , 1976. 26. Wr(1960) ight, S., edi7-11."Theted byInterpKempt retationhorofnMule andtivarotihaertesSys. Amestems., l"AIn: Iowa State University Pres , 1954, 11-33. Journal of Business and Economic Sta
tistics,
2,
Sta
tistical Methods for Digital Computers,
Advances in Factor Analysis and Structural Equation
Models.
LISREL
8:
User's Reference Guide,
Psy
chometrika,
23
Factor Analysis as a Statistical Method
(
)
Research Quarterly,
48,
Multivariate Analysis in Behavioral Research.
Management Science,
37,
Multivariate Statistical Methods
(
)
Journal of Advertising Research,
1
Statistics and Mathematics in
Biology,
0.
CHAPTER
10
Canonical Correlation Analysis 1 0. 1 INTRODUCTION
Canonical correlation analysis seeks to identify and quantify the associati ns between two sets of variables. H. Hotelling ([5], [6]), who initially developed Othe technique, provided the example of relating arithmetic speed and arithmetic power to reading speed and reading power. (See Exercise 10. 9 . ) Other examples include relating governmental policy variables with economic goal variables and relating college "performance" variables with precollege "achievement" variables. Canonical correlation analysis focuses on the correlation between a linear combination of the variables in one set and a linear combination of the variables in another set. The idea is first to determine the pair of linear combinations having the largest correlation. Next, we determine the pair of linear combinations hav ing the largest correlation among all pairs uncorrelated with the initially selected pair, and so on. The pairs of linear combinations are called the canonical variables, and their correlations are called canonical correlations. The canonical correlations measure the strength of association between the two sets of variables. The maximization aspect of the technique represents an attempt to concentrate a high-dimensional relationship between two sets of vari ables into a few pairs of canonical variables. 1 0.2 CANONICAL VARIATES AND CANONICAL CORRELATIONS
We shall be interested in measures of association between two groups of variables. The first group, of p variables, is represented by the (p 1) random vector x ( l> . The second group, of q variables, is represented by the (q 1) random vector x . We assume, in the theoretical development, that x < 1 > represents the smaller set, so that p q. X
X
:o::::;
587
588
Chap.
10
Canonical Correlation Analysis
For the random vectors X(l l and X (2) , let E (X (l l ) (l l; Cov cx< I l ) I ll E (x
JL
=
=
JL
=
=
X1 )
((p + q) X
=
(10-2)
has mean vector (10-3)
and covariance matrix I E(X - p.) (X - p.)' (p + q) X (p + q) =
[ ]
In i I1 2 (10-4) I(q x2p)1 !! (qIxzzq) The covariances between pairs of variables from different sets-one variable from X (l l , one variable from x<2l-are contained in I 1 2 or, equivalently, in I21 . That is, the pq elements of I 1 2 measure the association between the two sets. When p and q are relatively large, interpreting the elements of I 1 2 collectively is ordinarily hopeless. Moreover, it is often linear combinations of variables that are interesting and useful for predictive or comparative purposes. The main taskl of canonical correlation analysis is to summarize the associations between the x
--------
--------
I
Sec.
1 0.2
Canonical Variates and Canonical Correlations
589
sets in terms of a few carefully chosen covariances (or correlations) rather than the pq covariances in I 1 2 • Linear combinations provide simple summary measures of a set of variables. Set U = a' X ( t ) (10-5) V = b' X (2) for some pair of coefficient vectors a and b. Then, using (10-5) and (2-45), we obtain Var(U ) = a' Cov ( X (I l ) a = a' I1 1 a (10-6) Var(V ) = b' Cov (X<2> ) b = b'I22 b Cov ( U, V ) = a' Cov ( x ( l>, x<2 > ) b = a' I 1 2 b We shall seek coefficient vectors a and b such that X (2)
(10-7)
is as large as possible. We define the following: The first pair of canonical variables, or first canonical variate pair, is the pair of linear combinations U1 , V1 having unit variances, which maximize the cor relation (10-7); The second pair of canonical variables, or second canonical variate pair, is the pair of linear combinations U2 , V2 having unit variances, which maximize the correlation (10-7) among all choices that are uncorrelated with the first pair of canonical variables. At the kth step: The kth pair of canonical variables, or kth canonical variate pair, is the pair of linear combinations Uk, Vk having unit variances, which maximize the cor relation (10-7) among all choices uncorrelated with the previous k 1 canon ical variable pairs. The correlation between the kth pair of canonical variables is called the kth canon ical correlation. The following result gives the necessary details for obtaining the canonical variables and their correlations. q and let the random vectors x and x<2> have Result 1 0. 1 . Suppose p (q X l ) (p X l ) Cov ( X ( l l ) = I1 1 , Cov ( X<2 > ) = I22 and Cov ( X ( l l , X ( 2 ) ) = I1 2 where I has q X q) X ) X q) -
:o::;
(p p
(
(p
590
Chap.
10
Canonical Correlation Analysis
full rank. For coefficient vectors (p aX I ) and (q bX I) , form the linear combinations U = a' X ( I ) and V = b'X<2>. Then max Corr ( U, V) = pj a, b attained by the linear combinations (first canonical variate pair) U1 = e{ I !l 12 X (l) and V1 = f{ I2":f/2 X<2> '-.r--'
'-v--'
a{
b{
The kth pair of canonical variates, k = 2, 3, ... , p, � - l /2 x<2 > vk = rk· .... uk - ek' �_.., I-Il /2 x ( l ) 22 maximizes -
among those linear combinations uncorrelated with the preceding 1, 2, ... , k - 1 canonical variables. � .... � 1;2 - � .... H ere p*1 2 ;;;;. p*2 2 ;;;;. ;;;;. P*p 2 are t h e etgenva ues of � .... 1-11;2� .... 1 2 �.... 22 2 1 1-1 , and e 1 , e2 , . . . , eP are the associated (p 1) eigenvectors. (The quantities p1 2 , p� 2 , , p; 2 are also the p largest eigenvalues of the matrix I:Z] i2 I 2 1 I!l i 1 2 I2F2 with corresponding (q 1) eigenvectors f 1 , f2 , . . . , fP . Each f proport10na to � .... 22- 1;2�....2 1 �•1-11;2 e ; . ) The canonical variates have the properties Var(Uk) = Var(Vd = 1 Cov(Uk, Uc) = Corr(Uk, Uc) = 0 k * f Cov(Vk, Ve) = Corr(Vk, Ve) = 0 k i= f Cov(Uk, Ve) = Corr(Uk, Vc) = 0 k i= f for k, e = 1, 2, . . . , p. Proof. We assume that I 1 1 and I 22 are nonsingular. 1 Introduce the symmet ric square-root matrices 2Ig2 and I��2 with I 1 1 = I g2 Ig2 and I!l = I !li2 I !F2 . 2 [See (2-22). ] Set c = Ig a and d = I�� b, so a = I!ll2 c and b = I:Z]i2 d. Then ·
• • •
X
. . •
· ; IS
·
1
Corr ( a' X ( l) ' b' X (2) )
I
X
=
a· � ..., 1 2 b Ya' I 1 1 a Yb' I22 b
� - 1 /2� � - 1 /2 = c · ..., 1 1 ..., 1 2...,2 2 d � \l'd'd
(10-8)
1 I f I1 1 o r I 22 is singular, one or more variables may be deleted from the appropriate set, and the linear combinations a'X(I) and b ' X(2 ) can be expressed in terms of the reduced set. If p > rank (I1 ) = p 1 , then the nonzero canonical correlations are p� , . . . , p;, . 2
Sec.
1 0.2
Canonical Variates and Canonical Correlations
591
By the Cauchy-Schwarz inequality (2-48), Since
(2-51)
c 1 Ii l 12 I 1 2 Iz"F2 d :,;;; ( c 1 I !l12 I 1 2 I Z21 I 2 1 I 1? f2 c ) 1 12 ( d 1 d ) 1 12 pXp Iil 12 I 1 2 I2] I 2 1 I!l /2
is a
yields
(10-9)
symmetric matrix, the maximization result
� - � � � -11 /2 C :,;;; i C C C 1 �"'-1-11 /2�"'- 1 2"'-22 ll "'-2 1 "'-1 Iil 12 I 1 2 I2d I2 1 I !l f2 . A1 . 2 2 1 e Ii/ I2F I2 1 1. ( b 1 X< 2 > ) = � I
\
(10-10)
where A 1 is the largest eigenvalue of Equality occurs in (10-10) for c =e 1 , a normalized eigenvalue associated with Equality also holds in (10-9) if d is proportional to Thus, max (10-11 ) Corr a1X ( l> , a, b with2 equality occurring for a = I i{ /2 c = Ii/ /2 e 1 and with b proportional to 2 2 I2F I2F I 2 1 Iil 1 e 1 , where the sign is selected to give positive correlation. We take b = :_tz-j12 f 1 . This last correspondence follows by multiplying both sides of � - 1 /2� � - 1 � � - 1 /2 ) e = e ( "'-1 1 "'- 1 2"'-22 "'-2 1 "'- t t l ll l 1 \
' y1e ldmg
� - 1 /2� � - 1 /2 ' . "'-22 "'-2 1 "'-1 1 � - 1 /2� � - � � � - 1 12 ( � - 1 12� � - 1 /2 e ) = , ( � - 1 12 � � - 1 /2 e ) (10- 12) llt "'-22 "'-21 "'-1 1 1 "'-22 "'-2 1 "'-1 1 "'-1 2 "'-22 "'-22 "'-2 1 "'-1 1 1 � �-1 2 �"'-t-t1 /2� � - 1 "" 1 ( llt , e 1 ) . . "" 1 2 "" 22 2 1 "" 1 1 / 2 2 / of i2F I 2 1 I il e 1 -is (A 1 , f 1 ) -with f 1 f1 I2F2 I2 1 Iil i1 2 I2F2 . v1 = r; :.tz-j f2 X (2) = e{ I !l /2 X ( I ) p 'f = � . ( ) = e{ Ii/ 12 I 1 1 I!l 12 e 1 = e{ e 1 = 1, ( ) = 1. 1 a 1 X< 1 > = C 1 Iil /2 X< 1 >
b
Y
Th us, 1'f 1s an e1genva ue-e1genvector pau. for , then an eigenvalue-eigen the normalized form The sign for is chosen to give a positive vector pair for correlation. We have demonstrated that UJ and are the first pair of canonical variables and that their correlation is Also, Var U1 and similarly, Var V1 Continuing, we note that U and an arbitrary linear combination are uncorrelated if 1 2� � 1 2 0 = Cov ( U1 • c � � "'- -I It /2 X< 1 > ) - e 1� � "'-1-1 ; "'-1 1 "'-1-1 1 c - e 11 c ' At the kth stage, we require that c e 1 , e2 , . , ek - t · The maximization result (2-52) then yields 1 2� � - � �"'- t �"'- -t tJ /2 c :o::; /l kc c C1 � for c e 1 , . . . , e k - 1 z "'-1-1 / "'- t z "'- zz and by (10-8), '
.
..L
..
\
I
..L
592
Chap.
10
Canonical Correlation Analysis
with equality2 for1 e k or a = 2 I;;}12 2 e k and b I2F2 fk , as before. Thus, Uk ei i 1l f X< > and Vk = fi i2F X< >, are the kth canonical pair, and they have correlation � p� . Although we did not explicitly require the Vk to be uncorrelated, ifk =l= f � p Also, Cov(Uk, Ve ) ei i1lf2 I 1 2 I2F2 fe = 0, ifk =l= f � p since r; is a multiple of ei i1l f2 I1 2 I2F2 by (10-12). If{2)the original variables are standardized with z ( J ) = rzp> , Z�1 > , . . . ' Z�1 > ] ' 2 2 2 and z [Z� ) ' Z� ) ' ' Z� ) ] from first principles, the canonical variates are of the form Uk -- ak' zO> -- ek' p 11- 1 12 z< 1 > vk (10-13) - bk' Z(2) -- f'k p22- 1 12 z<2> 2> Here, Cov(zO > ) . = p11 , Cov(Z<2>-)1 /2 p22 ,- 1Cov(zO>, and f k are the e1genvectors of Pu P1 2 Pk22 P2 1P 11- 1 /2 z - p ( l > ) ak 1 (X fi> - JL f!> ) + ak 2 (X� 1 > - JL � 1 > ) + ... + akp (X�l ) - JL�1 ) ) (X f l > - JL f!> ) (X� 1 ) JL� 1 > ) + ak ak 2 1 - � va::; va;;_ vu:::, �1 �2 c =
=
=
=
=
=
•
I'
. • .
=
=
=
=
.
···
=
_
_
U; ; .
where Var(xfi > ) i 1, 2, ... , p. Therefore, the canonical coefficients for the standardized variables, z p> cxp>) - JLf!> ) ;v;;; ' are simply related to the canonical coefficients attached to the original variables xp > . Specifically, if a; is the coefficient vector for the kth canonical variate Uk, then a; vg2 is the coefficient vector for the kth canonical variate constructed from the standardized variables =
=
=
Sec.
1 0.2
Canonical Variates and Canonical Correlations
593
zO l . Here Vj{2 is the diagonal matrix with ith diagonal element va:;; . Similarly, 2 b� V�� is the coefficient vector for the canonical variate constructed from the set of standardized variablesY Z (2) . In2lthis case vW is the diagonal matrix with ith diag onal element va:;; = Var(Xf ) . The canonical correlations are unchanged by the standardization. However, the choice of the coefficient vectors ak , bk will not be umque 1" f P*k 2 = P*k 2+ i · The relationship between the canonical coefficients of the standardized vari ables and the canonical coefficients of the original variables follows from the spe cial structure of the matrix (see also (10-16)) I!l12 I 12 I2d I21I !l 12 (or p ],1 12 P 12 Pzi P21 PI1112 ) and, in this book, is unique to canonical correlation analysis. For example, in prin cipal component analysis, if a� is the coefficient vector for the kth principal com ponent obtained from I, then a� ( X - J.t) = a� V 1 12 z, but we cannot infer that a� V 1 /2 is the coefficient vector for the kth principal component derived from p . 0
Example 1 0. 1
(Calculating canonical variates and canonical correlations for standardized variables)
ll.O
l
2l = [Zf2l , Z�2l ]' Suppose z( ll = [Zf' l , Z�l) ]' are standardized1 lvariables and z< 2 are also standardized variables. Let = [Z< , z< l ]' and .4 i .5 .6 ] : _ , _ �_.(! _ = 5� 1 � � � � Cov (Z) = ��!_' . .3 : 1.0 .2 L P21 : P22 J .6 .4 i .2 1.0 Then 681 - .2229 p 11- 1 /2 = [ -.1.20229 1. 0681 J - .2083 P2i [ -.1.20417 083 1.0417 J and .2178 ] P1- 1/2 P12P22- 1 P2 1 P 11- 1 /2 - [ ._42371 178 _1096 The e1genvalues, p1* 2 , p*2 2 , of p 11- 1 /2 p1 2 p22- 1 p2 1p 11- 1 /2 are obtamed from - A .2178 1 -- (.4371 - A) (.1096 - A) - (2. 1 78)2 = 1 . 4 371 .2178 _ 1096 A = A 2 - .5467 A + . 0005 Z
-- 2 ---- -·- - --: -----· -
O
0
0
_
594
Chap.
10
Canonical Correlation Analysis
yielding Pt 2 = .5458 and Pi 2 = .0009. The eigenvector e 1 follows from the vector equation [ .4371 .2178 e r = (.5458) e r .2178 . 1 096 J Thus, e{ [. 8947, .4466] and .8561 ] a r = Pu- 1 /2 e r = [ .2776
959 .2292 ] [ .8561] = [ .4026] b1 P22- 1 Pzr 31 = [ ..53209 .3542 .2776 .5443 We must scale b1 so that Var(V1 ) = Var(b{Z<2> ) = b{p22 b1 = 1 The vector [.4026, .5443]' gives oc
°26 ] = .5460 [.4026, .5443] [1 .·02 1.·20 ] [..54443 Using Y.5460 = .7389, we take 1 [ .4026 ] = [ .5448 ] bl = .7389 .5443 .7366
The first pair of canonical variates is U1 = a; z( l > = . 86zp> + .28Z �1 > VI = b{ Z(Z) = .54Zf2> + .74Zfl and their canonical correlation is Pt = v'Pf2 = v':54s8 = .74 possible between linear combinations of vari This is the largest(l)correlation ables from the z and z (Z) sets. The second canonical correlation, Pi = v:oOo9 = .03, is very small, and consequently, the second pair of canonical variates, although uncorre lated with members of the first pair, conveys very little information about the association between sets. (The calculation of the second pair of canonical variates is considered in Exercise 10. 5 . )
Sec.
1 0.3
Interpreting the Population Canonical Variables
595
We note that U and V , apart from a scale change, are not much dif ferent from the pair 1 1 z p > J = 3zp> + Z�1 > if1 = a' z< 1 > = [3 , 1] [ z!t> zz1z>> v1 = b'z = [1 , 1] [ f J = z F > + Z�2 > For these variates, Var( U1 ) = a' p11 a = 12.4 Var(V1 ) = b' p22 b = 2.4 Cov(U1 , V1 ) = a' p1 2 b = 4.0 and 4.0 = 73 Vi2.4 VzA . The correlation between the rather simple and, perhaps, easily interpretable linear combinations U1 , V1 is almost the maximum value Pt = .74. • The procedure for obtaining the canonical variates presented in Result 10. 1 has certain advantages. The symmetric matrices, whose eigenvectors determine the canonical coefficients, are readily handled2 by computer routines. Moreover, writing 2 = the coefficient vectors as a k = I !ll ek and bk I2}1 fk facilitates analytic descriptions and their geometric interpretations. To ease the computational burden, many people prefer to get the canonical correlations from the eigenvalue equation (10-15) The coefficient vectors a and b follow directly from the eigenvector equations I !t1 It zi2i izt a = p * 2 a (10-16) The matrices I !l i1 2 I2i i2 1 and I2ii2 1 I !li1 2 are, in general, not symmetric. (See Exercise 10.4 for more details.) 1 0. 3 INTERPRETING THE POPULATION CANONICAL VARIABLES
Canonical variables are, in general, artificial. That is, they have no physical mean ing. If the original variables x<1 > and X (2) are) used, thez> canonical coefficients a and b have units proportional to those of the x
596
Chap.
10
Canonical Correlation Analysis
are standardized to have zero means and unit variances, the canonical coefficients have no units of measurement, and they must be interpreted in terms of the stan dardized variables. Result 10.1 gives the technical definitions of the canonical variables and canon ical correlations. In this section, we concentrate on interpreting these quantities. Identifying the Canonical Variables
Even though the canonical variables are artificial, they can often be "identified" in terms of the subject-matter variables. Many times this identification is aided by computing the correlations between the canonical variates and the original vari ables. These correlations, however, must be interpreted with caution. They provide only univariate information, in the sense that they do not indicate how the original variables contribute jointly to the canonical analyses. (See, for example, [11].) For this reason, many investigators prefer to assess the contributions of the original variables directly from the standardized coefficients (10-13). Let �AX � = [a 1 , a 2 , ... , ap]' and �BX � = [b1 , b2 , ... , bq]', so that the vectors of canonical variables are (10-17) = AX(l l; V = BX<2l U (q X l ) (p X l) where we are primarily interested in the first p canonical variables in V. Then (10-18) = 1, Corr(Ui , XL1 l ) is obtained by dividing Cov(Ui , xp > ) by Because Var(U Yvar(XL1 l ) = i )lf Equivalently, Corr(/U2 i , xp > ) = Cov(Ui , u;F2 XL1 > ). Intro ducing the (p p) diagonal matrix V]/ with kth diagonal element u;F2 , we have, in matrix terms, 1 /2 x<1 l ) = Cov(AX< 1 l ' v11- 1 12 x<1 J ) vCov(U Pu. x o > = Corr (U, X (l l ) 11 ' (p Xp) - A�"'-'11 v11- 1/2 Similar calculations for the pairs (U, x<2> ), (V, X (2) ) and (V, X ( ll) yield 1 /2. P v x (2) -- B"'-'� 22 v22- 1/2 v Pu , x o > - A� "'-' 11 1 1 (q x q ) (p Xp) (10-19) � 1 2 v22- 1 /2.' P v, x <'' - B"'-'� 2 1 v11- 1 /2 Pu. x <' ' - A"'-' (q Xp) (p x q) (q q) diagonal matrix with ith diagonal element Vvar(XI2 l ) . V2:fl2 is thevariables whereCanonical derived from standardized variables are sometimes inter preted by computing the correlations. Thus, u
X
•
X
Sec.
1 0. 3
Interpreting the Population Canonical Variables
Pu . z<" = A . P t t ;
597
P v, z < '' = B z P 22
(10-20) where (pA.xp and (qB. are the matrices whose rows contain the canonical coeffix q) 2 cients for the) z( ll and Z ( ) sets, respectively. The correlations in the matrices dis played in (10-20) have the same numerical values as those appearing in (10-19) ; that is, Pu, x <'' = Pu, z <'' ' and so forth. This follows because, for example, Pu, x o ' = Al: 1 1 V1F2 = AV]�2 V1l12 l: 1 1 V1l12 = A.p 1 1 = Pu , z <'' · The correlations are unaffected by the standardization. P v, z <' J = B. P z t
Example 1 0.2
(Computing correlations between canonical variates and their component variables)
Compute the correlations between the first pair of canonical variates and their component variables for the situation considered in Example 10.1. The variables in Example 10.1 are already standardized, so equation (10-20) is applicable. For the standardized variables, pl l = [ 1..40 1..40 ] ; P22 = [ 1..20 1..20 ] and [ .5 ..46 ] Pt z = With p = 1, A. = [. 86, . 2 8], B. = [. 5 4, . 7 4) so [. 86, .28) [ 1 :� 1 :� J [.97, .62) P u, , z o l = A z P t t and [ [ 1._20 l .O.2 ] = [.69, .85) P v" z i'J = B.p 22 = . 5 4, . 7 4) We conclude that, of the two variables in the set z ( ll, the first is most closely(2) associated with the canonical variate U1 • Of the two variables in the set Z , the second is most closely associated with V1 • In this case, the correA. lations reinforce the information supplied by the standardized coefficientsl ) and B However, the correlations elevate the relative importance of Z� in .3
•.
598
Chap.
10
Canonical Correlation Analysis
the first set and Z\2) in the second set because they ignore the contribution of the remaining variable in each set. From (10-20), we also obtain the correlations A,PI 2 [.86, .28] [ ..35 .46 ] [.5 1, .63] P u� o z (2) and B,p21 B, p{ 2 [ 54 .74] [:� :! ] [.7 1, .46] P v, , z i'J Later, in our discussion of the sample canonical variates, we shall comment on the interpretation of these last correlations. The correlations n1 and i2l can help supply meanings for the canoni cal variates. The spirit isPvthex same Pasvi'n.xprincipal component analysis when the cor relations between the principal components and their associated variables may provide subject-matter interpretations for the components. =
=
.
=
=
=
.
=
,
•
Canonical Correlations as Generalizations of Other Correlation Coefficients
First, the canonical correlation generalizes the correlation between two variables. When X (l l and x<2> each consist of a single variable, so that p 1, for all a, b Therefore, the l"canonical variates" ul xp > and VI xfZ> have correlation p� I Corr (Xf l , xfZ> ) 1 . When X ( I ) and X (2) have more components, setting a' [0, ... , 0, 1, 0, . . . , OJ with 1 in the ith position and b' [0, ... , 0, 1, 0, . . . , OJ with 1 in the kth position yields I Corr(x? > ' Xk2) ) I I Corr(a'X ( l ) ' b 'X(2) )1 Corr(a'X (l l , b 'X(2l ) p� (10-21) � max a, b That is, the firsty -canonical correlation is larger than the absolute value of any entry P 12 1 1I /2 �""1 2 y 22- 1/2 Second, the multiple correlation coefficient p (x i'J l [see (7-54)] is > a special case of a canonical correlation when X (l) has the 1single element xp (p 1). Recall that p 1 (x l'' l - maxb Corr (X<11 l , b 'X<2l ) -- p*1 forp 1 (10-22) When p > 1, p� is larger than each> of the(1multiple correlations of xfi > with X (2) l or the multiple correlations of x? with X . = q =
=
=
=
=
=
=
=
·
m
=
·
=
=
Sec.
1 0.3
Interpreting the Population Canonical Variables
599
Finally, we note that (10-23 )
k=
1 , 2, . . .
,p
from the proof of Result 10.1. Similarly, (10-24 )
. . ,p That is, the ) canonical correlations are also the multiple correlationl coefficients of uk with x . The largest value, pf 2 , is sometimes regarded as a measure of set "overlap." k=
1 , 2,
.
The First r Canonical Variables a s a Summary o f Variability
The change of coordinates from x<'l to U = AX<1 l and from X(2) to V = BX (2) is chosen to maximize Corr(U1 , V1 ) and, successively, Corr(U;, VJ, where (U;, VJ have zero correlation with the previous pairs ( U1 , V1 ), ( U2 , V2 ) , , ( U; 1 , V; _ 1 ). Correlation between the sets x
_
Example 1 0.3 (Canonical correlation as a poor summary of variability)
Consider the covariance matrix Cov
x[ ' > xp >
600
Chap.
10
Canonical Correlation Analysis
The reader may1l verify (see Exercise 10.1) that the first pair of canonical vari ates U1 = X� and V1 = XfZl has correlation Yet U1 = X�1 ) provides a very poor summary of the variability in the first set. Most of the variability in this set is in xpl, which is uncorrelated with U1 • The same situation is true for = Xf2l in the second set. •
v1
A Geometrical Interpretation of the Population Canonical Correlation Analysis
A geometrical interpretation of the procedure for selecting canonical variables pro vides some valuable insights into the nature of a canonical correlation analysis. The transformation · ·
u
from x
=
I
Consequently, l)U = AX<1 l = E'P1 A 1 1 /2 PiX< 1 l can be interpreted as (1) a transformation of x< to uncorrelated standardized principal components, followed by (2) a rigid (orthogonal) rotation P1 determined by I 1 1 and then (3) another rotation E' determined from the full covariance matr�x I. A similar interpretation applies to V = BX<2l .
Sec.
1 0.4
The Sample Canonical Variates and Sample Canonical Correlations
601
1 0.4 THE SAMPLE CANONICAL VARIATES AND SAMPLE CANONICAL CORRELATIONS
random sample of observations on each of the (p + q) variables x <1 l , x <2l can be assembled into the (p + q) data matrix A
=
l
n
n X
x (1l 1l x (l1 1l x ( l1l x (l 1l _2 _2 • •
x (nl l1
· · · •
• •
x
x
· · ·
0 •
· · ·
• 0
• •
• 0
... ... . ... 0 •
x
x
J
(10-25) =
The vector of sample means can be organized as n X- (1) [ -1 2: xj l l x <1 l X where X- (2) J j= l (p + q) x n -1 2: x;z > x
=
-----
=
n
(10-26) Similarly, the sample covariance matrix can be arranged analogous to the repre sentation (10-4). Thus, =
s (p + q ) x (p + q )
where
1
1j The linear combinations Skt
=
t l S1 1 !
S12
z1 :
zz
= -- " LJ n =
l
'
(q x p) ! (q x q )
(x'(kl - x < kl ) (x'V l - x U l
have sample correlation [see (3-36)]
j= i
!_�-5�-�S .:L -���!' S '
n
n
), '
k, l
=
1, 2 (10-27) (10-28) (10-29)
602
Chap.
10
Canonical Correlation Analysis
The first pair of sample canonical variates is the pair of linear combinations U1 , V1 having unit sample variances that maximize the ratio (10-29). In ge11era1 the kth pair of sample canonical variates is the pair of linear com binations Uk, Vk having unit sample variances that maximize the ratio ( 10-29) among those linear combinations uncorrelated with the previous k 1 sample canonical variates. The sample correlation between Uk and Vk is called the kth sample canonical correlation. The sample canonical variates and the sample canonical correlations can be obtained from the sample covariance matrices s ll , s 1 2 s� l , and s 22 in a manner consistent with the population case described in Result 10. 1 . Resu lt 1 0.2. Let p f 2 ;:.:;,: M 2 ;:.:;,: ;:.:;,: /) ; 2 be the p ordered eigenvalues of 2 S!?1 S 1 2 S2fS 2 1 S1l f2 with corresponding eigenvectors e1 , e2 , ... , eP , where the Ski are 2defined in (10-27) and p Let i\, f 2 , , fP be the eigenvectors of 2 S2:lf S 2 1 S!lS 1 2 S2:l2 f , where the first p f 's may be obtained from 2 fk (1/j)nszF S , k 1, 2, . . . , p. Then the kth sample canonical vari S1l1 ek 1 2 2 ate pair is A
A
-
A
A
=
�
=
q.
···
• . •
=
where x
•
=
2 When the distribution is normal, the maximum likelihood method can be employed using in place of S. The sample canonical corrt;_Iations fit are, therefore, the maximum likelihood esti mates of p: and Yn/ ( n 1 ) iik , Yn/ ( n 1 ) bk are the maximum likelihood estimates of ak and bk , respectively. 3 If p > rank(S 1 2 ) = p 1 , the nonzero sample canonical correlations are p� , . . . , p;1 •
i
= s.
-
-
Sec.
1 0.4
The Sample Canonical Variates and Sample Canonical Correlations
The sample canonical variates have unit sample variances s ok . ok = s vk> vk = 1 and their sample correlations are
603
(10-30)
k * e
(10-31) The interpretation of U V is often aided by computing the sample correlations between the canonical variates and the variables in the sets x< l l and x<2> . We define the matrices " " B (10-32) A (p Xp) (q X q ) whose rows are the coefficient vectors for the sample canonical variates.4 Analo gous to (10-17), we have " (10-33) v = A x (l) . (p X (q X I ) and we can define = matrix of sample correlations of U with x < l l = matrix of sample correlations of V with x <2 l = matrix of sample correlations of U with x< 2l = matrix of sample correlations of V with x< l) Corresponding to (10-19), we have " - A s 11 o-111 ;2 " - n s 22 o-221 ;2 (10-34) = A S 1 2 D z21 /2 = B S 21 D !/ 12 where1 D!/- /122is the (p 2 p) diagonal matrix with ith diagonal element (sample var(x} >)) 1 and - 1 12. is the (q q) diagonal matrix with ith diagonal element (sample var(x �2> )) D2F Com ment. If the observations are standardized [see (8-25)], the data matrix becomes k•
k * e
k
u
'
I)
"
R u , x '"
R V, x < ' l
R u . x '"
Rv. x " l
R U. , x « l R V, . x l21
-
Ru , x "l
Rv, x < ' l
X
X
z rh + 2, . . . , bh q - s-'1 z hp , + t • bh p + 2 · d from a 4 Th e vec t ors bh p , + t hq are d etermme - s22t; r - s-t; 22 22 2 r p, , choice of the last q - p1 mutually orthogonal eigenvectors f associated with the zero eigenvalue of 2
S21f S z t Sit' S , z S2f /2 •
604
Chap.
10
Canonical Correlation Analysis
Z
[Z < 1 > i Z (2)]
=
and the sample canonical variates become u A
(p X I)
=
A z(l).
'
z
A
v
(q X I)
=
B z<2l
( 1 0-3 5)
z
where .A . A D�{2 and B. BD1i2. The sample canonical correlations are unaf fected by the standardization. The correlations displayed in (10-34) remain unchanged and may be calculated, for standardized observations, by substituting .A . for A , B . for B , and R for S. Note that D!F2 I and D2:fl2 = I for (q X q) (p X p) standardized observations. =
=
=
Example 1 0.4 (Canonical correlation analysis of the chicken-bone data)
In example 9.14, data consisting of bone and skull measurements of white leghorn fowl were described. From this example, the chicken-bone mea surements for > skull length Head (X ) : { xp Xf1 l skull breadth length { xf2l> femur Leg ( x )· X fZ length tibia have the sample correlation matrix =
(I)
=
=
0
R�
t-:;;-tt�J �
=
l-l:.�O��---���2:d i 6� ---'��l .505 i .569
.602
.602
.467 i .926 1 .0
canonical correlation analysis of the head and leg sets of variables using R produces the two canonical correlations and corresponding pairs of variables A
Sec.
1 0.4
The Sample Canonical Variates and Sample Canonical Correlations
V1 = v-1 =
.631
605
.781z�l2l + .345z�l2l .o6od > + . 944d >
and .856d2ll + 1.106z�21 l .057 Vl!_22 == - 2.648d > + 2.475z� > Here zP l , i = 1, 2 and zFl, i = 1, 2 are the standardized data values for sets 1 and 2, respectively. The preceding results were taken from the SAS statis tical software output shown in Panel 10.1. In addition, the correlations of the original variables with the canonical variables are highlighted in that Panel. A
-
•
Example 1 0.5
(Canonical correlation analysis of job satisfaction)
As part of a larger study of the effects of organizational structure on "job sat isfaction," Dunham [4] investigated the extent to which measures of job sat isfaction are related to job characteristics. Using a survey instrument, Dunham obtained measurements of p = 5 job characteristics and q = 7 job satisfaction variables for = 784 executives from the corporate branch of a large retail merchandising corporation. Are measures of job satisfaction associated with job characteristics? The answer may have implications for job design. The2l original job characteristic variables, X (ll and job satisfaction vari ables, x< were respectively defined as n
PAN EL 1 0. 1
SAS ANALYSI S FOR EXAM PLE 10 . 4 U S I N G P ROC CANCORR.
title 'Ca n o nical Co rrelatio n Ana lysis'; d ata sku l l (type corr); _type_ = 'CORR'; i n p ut_na m e_ $ x 1 x2 x3 x4; ca rds; x 1 1 .0 x2 .505 1 .0 x3 .422 1 .0 . 569 x4 .602 . 467 .926 1 .0 =
proc cancorr d ata sku l l vprefix var x 1 x2; with x3 x4; =
=
head wprefix
PROGRAM COMMANDS
=
leg;
(continued)
606
Chap.
10
Canonical Correlation Analysis PAN El l O. l
2
(continued)
I
I
Ca nonical Correlati o n Ana lysis Approx Adj usted Canon ical Canon ica l Sta n d a rd Corre l ation E rror Correlation
I I
0 ..
6310�5.1
· 0.056794
I
0.6282 9 1
S q u a red c a n o n ical
Co rrelation 0 .398268 0 .003226
0.036286 0.060 1 08
tfipiEmtf�r �he 'VAR' V� riabl es i
Raw Ca!l9!lica i
[;l]
J-IEAD1
0.3445068301
2
Raw Canonical
� 4
Coefficients
OUTPUT
HEAD2
0.7807924389
- 0.8559731 84
1 . 1 0618351 45 forthe 'WITH' Va riables
I
LEG2
. LEG 1
-2.648H>6338
··· o:o6o��l)8775 0.943948961
2.47493889 1 3
Canon ica l Struct u re
Correlations Between the 'VAWVa riables and Their Ca nonical Variables X1
Correlations
X2
HEAD1 0.9548 0.7388
H EAD2 - 0.2974 0.6739
(see ( 1 0-3 4 ) )
Between the 'WITH' Va riables and Thei r Canonica l Variables X3 X4
and
LEG 1 0.9343 0.9997
LEG2 - 0.3564 0.0227
(see ( 1 0-3 4 ) )
Oorf�l�tlons !3e�een · · f h� 'VA R ' va ri a bl�s · ·
the Canonical Variables of the 'WITH' Variables X1 X2
�nd �ne p��pqi9al
LE G 1 0.6025 0. 4663
LEG2 - 0 .0 1 69 0.0383
(see ( 1 0-3 4 ) )
· ��rJI!.I)Ie�. o:f.IH�· .�\IA.Fl' Y� ti� bles
Correlations Between the 'WITH' Va riables
X3 X4
HEAD1 0.5897 0.6309
H EAD2 - 0.0202 0.00 1 3
(see ( 1 0-3 4 ) )
Sec.
1 0.4
The Sample Canonical Variates and Sample Canonical Correlations
607
feedback task significance 1 ) task variety x< = task identity autonomy supervisor satisfaction x (2) career-future satisfaction x<2z) 2 financial satisfaction x<3 ) workload satisfaction x (2 ) = Xiz ) 2 ( company identification x ) kind-of-worksatisfaction x<2 ) 2 ( general satisfaction x ) Responses for variables x
1
5
6 7
1.0 i .33 .32 .20 .19 .30 .37 .21 i .30 . 2 1 .16 .08 .27 . 3 5 .20 .49 1.0 .53 .57 1.0 i . 3 1 .23 .14 .07 .24 .37 .18 i .24 .22 .12 .19 .21 .29 .16 .49 .46 .48 1.0 .51 .53 . 5 7 . 5 7 1. 0 ! 1 �.38 .32 . 1 7 .23 .32 .36 .27 ---�3 3---- -.3o---- - 31----�24---- � 3-s- r o--------------------------------------------------. .32 .21 . 2 3 . 2 2 . 3 2 ! . 4 3 1.0 .20 .16 .14 .12 .17 ! .27 .33 1.0 .19 .08 . 07 .19 .23 1 .24 .26 .25 1.0 .30 . 27 . 24 .21 .32 i .34 .54 .46 .28 1.0 .37 .35 .37 .29 .36 ! .37 .32 .29 .30 .35 1.0 .21 . 20 .18 .16 .27 i .40 .58 .45 .27 .59 . 3 1 1.0 The min(p, q) = min(5, 7) = 5 sample canonical correlations and the sample canonical variate coefficient vectors (from Dunham [4]) are displayed in the following table: '
'
'
0\
i
CANO N I CAL VARIATE C O E F F I C I ENTS AN D CAN O N I CAL CO RRELATIO N S
a1' ·
0
a� : a� : � I
a4 :
as' ··
zP>
Standardiz1) ed variab1 les
z�l )
z1
zi >
z �l )
.42 .21 .17 -.02 .44 -.30 .65 .85 -.29 -.81 -.86 .47 -.19 -.49 . 95 .76 -.06 -.12 -1.14 -.25 .27 1.01 -1.04 .1 6 .32
PI
z i2 )
�*
.55 .23 .12 .08 .05
A
b{ : A
b� : A
b� : A
b� : A
b� :
z �2 )
Stand12> ardized2 variables 2) z
zi >
z�
z �2 )
z �2 )
.42 .22 -.03 .01 .29 .52 -.12 .03 -.42 .08 -.91 .14 .59 -.02 .58 -.76 -.41 -.07 .19 -.43 .92 .23 .49 .52 -.47 .34 -.69 -.37 -.52 -.63 .41 .21 .76 .02 .10
Sec.
1 0.4
The Sample Canonical Variates and Sample Canonical Correlations
609
For example, the first sample canonical variate pair is VA 1 = .42z�2> + .22z�2> - .03z�2> + .Olzi2> + .29z �2> + .52z�2> - .12z�2> with sample canonical correlation pfA = .55. According)o the coefficients, U is primarily a feedback and autonomy variable, while V 1 represents supervisor,1 career-future, and kind-of-work satisfaction, along with company identificaJion. A To e_rovide interpretations for U and V 1 , the Asample correlations between U and its component variables1 and between V 1 and its component variables were computed. Also, the following table shows the sample corre lations between variables in one set and the first sample canonical variate of the other set. These correlations can be calculated using (10-34). 1
SAM PLE CORRELATIONS B ETWEEN ORIGI NAL VARIAB LES AN D CAN O N I CAL VARIABLES
x
Sample canonical variates A VA I ul .83 .46 .74 . 4 1 .75 .42 .62 .34 .85 .48
Sample canonical variates A VA I ul .42 . 7 5 .35 .65 .21 .39 .21 .37 .36 .65 .44 .80 .28 .50
X (2) variables 1. Supervisor satisfaction 2. Career-future satisfaction 3. Financial satisfaction 4. Workload satisfaction 5. Company identification 6. Kind-of-work satisfaction 7. General satisfaction
All five job characteristic ':.a. riables have roughly the �arne correlations with the first canonical variate U From this standpoint, U 1 might be inter preted as a job characteristic "index." This differs from the preferred inter pretation, based on coefficients, where the task variables are pot important. The other member of the first canonical variate pair, V seems to be representing, primarily, supervisor satisfaction, career-future satisfaction, comp�ny identification, and kind-of-work satisfaction. As the variables sug gest, V1 might be regarded as a job satisfaction-company identification index. This agrees with the preceding interpretation based on the canonical coeffi cients of the z}2>'s . The sample correlation between the two indices U1 and V1 is iJi = .55. There appears to be some overlap between job characteristics and job satisfaction. We explore this issue further in Example 10.7. • 1•
1,
610
Chap.
10
Canonical Correlation Analysis
Scatter plots of the first ( U1 , l\ ) pair may reveal atypical observations xi requiring further study. If the canq_ nic�l corr�lations M, p;, ... are also moderately large, scatter plots of the pairs ( U2 , V2) , ( U3 , V3) , may also be helpful in this respect. Many analysts suggest plotting "significant" canonical variates against their component variables as an aid in subject-matter interpretation. These plots rein force the correlation coefficients in (10-34). If the sample size is large, it is often desirable to split the sample in half. The first half of the sample can be used to construct and evaluate the sample canoni cal variates and canonical correlations. The results can then be "validated" with the remaining observations. The change (if any) in the nature of the canonical analysis will provide an indication of the sampling variability and the stability of the conclusions. • • •
1 0. 5 ADDITIONAL SAMPLE DESCRIPTIVE MEASURES
If the canonical variates are "good" summaries of their respective sets of variables, then the associations between variables can be described in terms of the canonical variates and their correlations. It is useful to have summary measures of the extent to which the canonical variates account for the variation in their respective sets. It is also useful, on occasion, to calculate the proportion of variance in one set of vari ables explained by the canonical variates of the other set. Matrices of Errors of Approximations
A and B defined in (1Q-32),)et a ang b we can write (10-36) x < 1 J = A - I U '· (q x q) (q x 1 ) (p X 1 ) (q X 1 ) (p Xp) (p X 1 ) I , and (p X p) I ' (q x q) Ul
s1 1
S zz
= ( A - 1 ) ( A - 1 ) ' = a' + a (2) a <2> ' + =
(B -1 ) (B - 1 ) '
=
b
+
b(ZJb(zJ,
...
p; a b
' a
a
'
+ ··· + +
+ ... +
b(qJ b(qJ'
(10-37)
Sec.
1 0.5
Additional Sample Descriptive Measures
61 1
Since x (I) = A - t U and U has sample covariance I, the first r columns of A - 1 contain the sample covariances> of the first r canonical variates 0 , 02, , 0, �ith1 their component variables xp , X�1�, X11 > . �imilarly, the first 1r columns of B contain the sample covariances of V 1 , V2 , , V, with their component variables. If only the first r canonical pairs are used, so that for instance, • • •
• •
.,._,
• • •
x- ( 1 )
_- [ A
: a ( I ) : aA ( 2) ::
and
·.
. :: aA (r) J
[ S� ] :
A
u,
(10-38 )
then S 1 2 is approximated by sample Cov (x ( l> , x <2> ) . Continuing, we see that the matrices of errors of approximation are S l l - (a (l) a ( l ) l
+
a (2) a (2) '
+ ... +
a (r) a (r) ' )
= a (r + l ) a (r + l) '
+ ... +
a (P) a (P) '
s 22 - ( b ( l )b (l )l
+
b (2 )b (2 ) 1
+ ... +
b(r)b(r)l )
=
b (r+ l )b (r + 1 ) 1
+ ... +
b(q) b (q)l
S 1 2 - c rr a ( l> b < 1 > '
+
iJ� a ( 2)b <2 > '
+ ... +
r ; a (r> b <'> ' )
= PA r*+ 1 a (r + l) b (r + 1 )1
+ ... +
PA p* a (p)b( P) I
(10-39 )
The approximation error matrices (10-39) may be interpreted as descriptive summaries of how well the first r sample canonical variates reproduce the sample covariance matrices. Patterns of large entries in the rows and/or columns of the approximation error matrices indicate a poor "fit" to the corresponding variable(s). Ordinarily, the first r variates do a better job of reproducing the elements of S 1 2 = S� 1 than the elements of S1 1 or S 22 . Mathematically, this occurs because the residual matrix in the former case is directly related to the smallest p - r sample canonical correlations. These correlations are usually all close to zero. On the other hand, the residual matrices associated with the approximations to the matrices S 1 1 and S22 depend only on the last p - r and q r coefficient vectors. The elements in these vectors may be relatively large, and hence, the residual matrices can have "large" entries. For standardized observations, R k 1 replaces S k 1 and a �k) , b�l) replace a , b(l) in (10-39) . -
61 2
Chap.
10
Canonical Correlation Analysis
Exam ple 1 0.6
(Calculating matrices of errors of approximation)
In Example 10.4, we obtained the canonical correlations between the two head and the two leg variables for white leghorn fowl. Starting with the sam ple correlation matrix .505 i .569 .602 R� ��;;-f-:;;-] � --'���---!'�u- ��-2 ---- ��-
ll.O
.602
h
.467 : .926
1 .0
l
we obtained the two sets of canonical correlations and variables U1 = .781z\1 l + .345z�l ) vl = .060z\2 ) + .944z�2 )
M = .631
and
�*
= - .856z \1 l + 1 .106z�l ) V2 = - 2.648z\2l + 2.475z�2l A
p 2 = .057
l!._2
where z�l ) , i = 1, 2 and zf2l , i = 1, 2 are the standardized data values for sets 1 and 2, respectively. We first calculate (see Panel 10.1) A.-1 = z
B- 1 z
.345 ] - 1 [ .9548 [ - .781 .856 1.106 .7388 [ .9997 .9343 - .3564 .0227 J
- .2974 .6739
J
Consequently, the matrices of errors of approximation created by using only the first canonical pair are [ - .2974 ] [ - .3564 .0227] ( .057) .6739
[
R1 1 -
sample Cov (z(l l)
.006 - .000 - .014 .001
J
[ -:���: J .2974 [ - .088 - .200 .200 .454 J [-
.6739]
] [- .3564 - sample Cov (z<2l) [- .3564 .0227 .127 - .008 ] [ - .008 .001 Sec.
R 22
1 0 .5
Additional Sample Descriptive Measures
61 3
.0227]
where z ( l ) ' z (Z ) are given by (10-38) with r = 1 and a(l) z replace a
Proportions of Explained Sample Variance
When the observations are standardized, the sample covariance matrices Sk1 are correlatiOJl matrixes Rkt· The canonical ,..coefficieJlt vectors are the rows of the matrices A z and B. and the columns of A� 1 and B; 1 are the sample correlations between the canonical variates and their component variables. Specifically sample Cov (z< l ) , U ) = sample Cov (A� 1 U , U ) = A�1 and
sample Cov (z<2 l, V ) so
[ [
J J
' z 11 1 1 r V' 2 , z 11 1 1 · · · r UP, . . r u2, zi ll • r UP, zi' ' .A-z 1 .. .. . . r uP' z,\ 1 ' r U2 , z,\ 1 1 , z 1121 r V, 2, z 1211 · · · r vP, · · · , z2121 rvP, rV, 2 , z2121 (10-40) 8 -1 .. z ... . ry" z,\21 ri/2, z,\21 r vP' z,\21 where ru z o) and rv z '') are the sample correlation coefficients between the quantities with subscripts. Using (10-37) with standardized observations, we obtain ,.
k
,,
k
ru 1 , z l 1 ' r u " zi l ) : ru z ( l ) rv z ( 2 ' rVI , zi21 p
p
I'
I
Total (standardized) sample variance in first set
= tr (R 1 1 ) = tr (a�1 > a�1 ) , + a�2la�2) , + . . . + a <j'> a <:> , ) = P
(10-41a)
614
Chap.
10
Canonical Correlation Analysis
Total (standardized) sample variance in second set = tr (R 22 ) = tr ( bzo>t;z< 1 >' + b(z2)b(z2l ' + + b(zq ) b(zq) ' ) · · ·
=
q
(10-41b)
Since the correlations in the first r < p columns of A� 1 and :8; 1 involve only the ,... ,... ,... ,... ,... sample canonical variates U1 , U2 , , U, and V1 , V2 , , V,, respectively, we define the contributions of the first r canonical variates to the total (standardized) sample variances as "
• . .
. . •
l tr ( 3 (z1 ) 3 (l) z
""r � ""P r � z(l) + 3 (z2) 3 (z2) ' + . . . + 3 (z') 3 (z') ' ) = � i= l k=
tr ( b�l) b�1 ) '
+ b�2l b�2) ' + . . . + b �') b�'l' ) = ± f r�' . zf>
and
l
u. ,,
i= l k= !
(
k
)
The proportions of total (standardized) sample variances "explained by" the first r canonical variates then become
R z(l > i u, . u, . . . . , u, = 2
proportion of total standardized sample varian,_ce i� first se; explained by U 1 , U2 , . . . , U, 1 tr ( 3 (z ) 3 oz >, + . . . + a� (r) . a� (.r) ' ) r p "" "" £../ � r o,. z ll) i= 1 k= l p k
and
RzI'l- l 2
V1 , V2 , A
A
. . . , V, A
=
(
proportion of total standardized sample varianc� in �econd �et explained by V1 , V2 , . . . , V, tr ( b(zl ) b(zJ ) , + . . . + b(z') b(z') ' )
)
(10-42)
q
Descriptive measures (10-42) provide some indication of how well the canon ical variates represent their respective sets. They provide single-number descrip tions of the matrices of errors. In particular,
Sec.
1 -
p
tr [R 1 1 - aA (l ) aA (l) ' - aA (2 ) aA ( 2) ' z
z
z
z
• • •
1 0. 6
Large-Sample Inferences
- aA (r) aA (r) ' ] z
z
1 - R2t' ' l . z
61 5
u, , u, . . . . , u, •
•
according to (10-41) and (10-42). Example 1 0.7
(Calculating proportions of sample variance explained by canonical variates)
Consider the job characteristic-job satisfaction data discussed in Example 10.5. Using the table of sample correlation coefficients presented in that example, we find that 1 5 1 2 2 2 ro , , z\' ' = 5 [ (.83) + (.74) + . . . + (.85) ] R;o,ID, = 5
�
.58
1 7 1 2 2 2 = ' R;<' lv, 7 2: r�, z
The first sample canonical variate U1 of the job characteristics set accounts f9r 58% of the set's total sample variance. The first sample canonical variate V1 of the job satisfaction s�t explains 37% of the set's total sample variap.ce. We might thus infer that U1 is a "better" representative of its set t�an V1 is of its set. The interested reader may wish to see how well U 1 and V 1 repro duce the correlation matrices R 1 1 and Rzz , respectively. [See (10-39).] • 1 0.6 LARGE SAMPLE I NFERENCES
When l: 1 2 = 0, a ' X ( ! ) and b ' X ( 2 ) have covariance a ' l: 1 2 b = 0 for all vectors a and b. Consequently, all the canonical correlations must be zero, and there is no point in pursuing a canonical correlation analysis. The next result provides a way of test ing l: 1 2 = 0, for large samples. Result 1 0.3. Let
xj =
[-�f�:-J
be a random sample from an NP
I
+q
j
= 1 , 2,
... , n
(p.,, I ) population with
61 6
Chap.
10
Canonical Correlation Analysis
Then the likelihood ratio test of H0 : I 12
= 0 versus H1 : I1 2 �X�
for large values of
-=!=
0 rejects H0
�X�
(10-43)
s = tt�+�;;-]
where
pq
is the unbiased estimator of I. For large n, the test statistic (10-43) is approxi mately distributed as a chi-square random variable with d.f. See Kshirsagar [8].
Proof.
•
The likelihood ratio statistic (10-43) compares the sample generalized vari ance under H0, namely,
q
! (p
with the unrestricted generalized variance I S 1 . Bartlett [3] suggests replacing the multiplicative factor n in the likelihood ratio statistic with the factor n - 1 + + 1) to improve the x 2 approximation to the sampling distribution of -2 I n A. Thus, for n and n + q) large, we Reject H0 : I 12 -
(
= O (p{ = Pi =
n -
1 -
···
- (p
= p: = 0) at significance level a if
� (p + q + 1 ) ) In JJ (1 - '/J;*2 ) > x;q (a)
pq
(10-44)
where x; q (a) is the upper (100a)th percentile of a chi-square distribution with d.f. If the null hypothesis H0 : I 12 = 0 (p{ = Pi = · · · = p: = 0) is rejected, it is natural to examine the "significance" of the individual canonical correlations. Since the canonical correlations are ordered from the largest to the smallest, we can begin by assuming that the first canonical correlation is nonzero and the remaining - 1 canonical correlations are zero. If this hypothesis is rejected, we assume that the first two canonical correlations are nonzero, but the remaining - 2 canonical correlations are zero, and so forth. Let the implied sequence of hypotheses be
p
p
Hg : p{ H� : p/
-=/= -=/=
0, Pi
-=/=
..
0, . , p:
0 for some i
�
-=/=
0, p:
k+1
+1
=
···
= p1� = 0
(10-45 )
Sec.
1 0.6
Large-Sample Inferences
61 7
Bartlett [2] has argued that the kth hypothesis in (10-45) can be tested by the like lihood ratio criterion. Specifically,
- ( n - 1 - .!2 (p + q + 1) ) In i=k+ IT I (1 - p;*2 ) > Xfr k)(q - k) ( a )
Reject H�"l at significance level a if
(pXfp - k) (q - k) ( a)
(10-46)
where is the upper (100a)th percentile of a chi-square distribution with - k) (q - k) d.f. We point out that the test statistic in (10-46) involves p II (1 - P;* 2 ) , the "residual" after the first k sample canonical correlations have
i=k+
I
been removed from the total criterion A
H<�k)
p
Z/n = II i= ( 1 - p;�z ) .
l
If the members of the sequence H0 , H&1 l , H&2 l , and so forth, are tested one at a time until is not rejected for some k, the overall significance level is not a and, in fact, would be difficult to determine. Another defect of this procedure is the tendency it induces to conclude that a null hypothesis is correct simply because it is not rejected. To summarize, the overall test of significance in Result 10.3 is useful for mul tivariate normal data. The sequential tests implied by (10-46) should be interpreted with caution and are, perhaps, best regarded as rough guides for selecting the num ber of important caqonical variates. Example 1 0.8
(Testing the significance of the canonical correlations for the job satisfaction data)
Test the significance of the canonical correlations exhibited by the job char acteristics-job satisfaction data introduced in Example 10.5. All the test statistics of immediate interest are summarized in the table on page 618. From Example 10.5, = 784, p = 5, q = 7, fJ: = .55, A * = · 05 · PA z* = · 23 , PA 3* = · 12 , PA 4* = ·08 , and Ps Assuming multivariate normal data, we find that the first two canonical correlations, p: and p; , appear to be nonzero, although with the very large sample size, small deviations from zero will show up as statistically significant. From a practical point of view, the second (and subsequent) sample canoni cal correlations can probably be ignored, since (1) they are reasonably small in magnitude and (2) the corresponding canonical variates explain v ery little • of the sample variation in the variable sets X ( ! ) and x (z) .
n
p
The distribution theory associated with the sample canonical correlations and the sample canonical variate coefficients is extremely complex (apart from the = 1 and q = 1 situations), even in the null case, :11 2 = 0. The reader interested in the distribution theory is referred to Kshirsagar [8].
0'1
...
00
TEST R ES U LTS
1.
xz
Null hypothesis (al
Ho : :l:lz
=
pj
=
Upperof1% point Obs(Barervedlet cortesrtescttaitoin)stic Degrees of freedom distribution Conclusion ( n 1 - � q 1) ) In g (1 57 Reject q 5(7) 35 ( 784 - 1 - � (5 7 1) ) ln (.6453) 340.1 - (n 2 q 1) ) In IT (1 - 1 ) (q - 1 ) 24 42.98 Reject 60.4 - (n 2 q 1) ) 1n ll (1 2)(q - 2) 15 30.58 rDoejectnot 18.2 -
0
0)
-
=
(p
+
-
+
+
-
j)j2 )
p
=
xjs (.O l )
=
=
H0 •
=
H0 •
+
=
2.
H h1l : Pi p�
3.
=
Hb2l : Pi pr =
=F
.
.
.
=F . . .
0, =
p�
=
0, p� p�
=
=
0 =F o
- 1 - l_ (p
+
+
- 1 - l_ (p
+
+
i=Z
- pj2 )
i=3
j)j2 )
(p
=
xi4 (.0 1 )
(p -
=
xfs ( .01 )
=
0,
=
=
H0 •
Chap. 1 0
Exercises
61 9
EXERCISES
10.1. Consider the covariance matrix given in Example 10.3:
Verify that the first pair of canonical variates are u = X� 1 >, VI = xf> with l canonical correlation p: = .95. 10.2. The (2 X 1) random vectors x< 1 > and x<2> have the joint mean vector and joint covariance matrix
(a) Calculate the canonical correlations p: , Pi . (b) Determine the canonical variate pairs ( U1 , V1 ) and ( U2 , V2 ) . (c) Let = [ U1 , U2 ] ' and V = [ V1 , V2 ] ' . From first principles, evaluate
U
Compare your results with the properties in Result 10. 1 . 10.3. Let z( l l = V!?f2 ( X< 1 l - p.(ll ) and Z(2 ) = V;]I2 ( X<2 > - p.<2 > ) be two sets of standardized variables. If p: , Pi , . . , p; are the canonical correlations for the X(!), x<2> sets and ( U;, V; ) = ( a; x ) , i = 1, 2, . . . , p, are the associated canonical variates, determine the canonical correlations and canonical variates for the z(ll, z<2> sets. That is, express the canonical cor relations and canonical variate coefficient vectors for the z(ll, Z(2 ) sets in terms of those for the x(l>, x<2> sets. 10.4. (Alternative calculation of canonical correlations and variates.) Show that, if A; is an eigenvalue of I i/12 I 1 2 I 2:i I 2 1 I !l/2 with associated eigenvector e;, A; is also an eigenvalue of I !l i 1 2 I ;] I 2 1 with eigenvector I il12 e;. .
620
Chap. 1 0
Canonical Correlation Analysis
0 = I I 1l f2 I I I 1l f2 I 1 2 I2i i 2 t i!l12 - A ii i i i g2 1 I I !l i1 2 I2i i 2 1 - A il I 10.5. Use the information in Example 10.1. (a) Find the eigenvalues of I!l I 1 2 I221 I 2 1 and verify that these eigenvalues are the same as the eigenvalues of I !l 12 I 1 2 I221 I2 1 I !l 12 • (b) Determine the second pair of canonical variates ( U2 , V2 ) and verify, from first principles, that their correlation is the second canonical cor relation Pt = .03. 10.6. Show that the canonical correlations are invariant under nonsingular linear transformations of the x< l J, x ( 2) variables of the form c x< l) and (p X p) (p X 1 ) 2 D x< > . (q X q) ( q x l )
�!t_z!?�-J Consider any ( ����!-] ) [-DI��U��_j _ . C DI 22 D ' 21 '
Hint Consider Cov l x(2) n ·
=
i
linear combination a{ ( CX( l l ) = a ' x < t ) with a ' = a{ C. Similarly, consider b{ ( DX<2> ) = b' X<2> with b' = b{ D. The choices a{ = e ' I 1F2 C - 1 and b{ = f' I2F2 D - 1 give the maximum correlation.
10.7. Let p1 2 =
[ � � J and p1 1 = p22 = [ � �]. corresponding to the equal
correlation structure where x < t > and x <2> each have two components. (a) Determine the canonical variates corresponding to the nonzero canoni cal correlation. (b) Generalize the results in Part a to the case where x< 1 > has p components and x <2) has q � p components. Hint: p1 2 = p ll ' , where 1 is a (p X 1) column vector of 1 ' s and 1' is a (q X 1) row vector of 1 ' s. Note that p 1 1 = [1 + (p - 1 ) p] 1 so P1F2 1 = [1 + (p - 1 ) pr l/2 1. 10.8. (Correlation for angular measurements.) Some observations, such as wind direction, are in the form of angles. An angle 82 can be represented as the pair X (2) = [cos ( 82 ) , sin ( 82 ) ] ' . "- (a) Show that b' X <2 > = v'bf=--+ b---:�:: cos ( 82 ' - {3) where b 1 / v'bf + b� cos (f3) and b2 / v'bf + bi_ = sin ( f3) . Hint: cos ( 82 - {3) = cos ( 82 ) cos ( f3) + sin ( 82 )sin ( f3) . (b) Let X ( l) have a single component xp> . Show that the single canonical cor relation is p: = max Corr (Xf!> , cos ( 82 - {3) ) . Selecting the canonical f3 variable V1 amounts to selecting a new origin {3 for the angle 82 • (See Johnson and Wehrly [7].) 1
Chap. 1 0
Exercises
621
(c) Let xp l be ozone (in parts per million) and ()2 = wind direction mea sured from the north. Nineteen observations made in downtown Mil waukee, Wisconsin, give the sample correlation matrix
]
[
ozone cos ( 02 ) sin ( 02 ) .694 1.0 ! .166 --------�-------------------.166 l 1.0 - .051 .694 i - .051 1 .0
Find the sample canonical c�rrelation Pi and the canonical variate £\ representing the new origin f3 . (d) Suppose x
a ) , cos ( 02 - /3 ) )
(e) Twenty-one observations on the 6:00 A.M. and noon wind directions give the correlation matrix
R =
l l .O
]
cos ( 01 ) sin ( 0 1 ) cos ( 02 ) sin ( 02 ) - .291 l .440 .372 - .291 1.0 ! - .205 .243 � .181 .440 - .205 l 1.0 .372 .243 i .181 1.0
-------------------
----------------------
Find the sample canonical correlation p{ and U
1 ,
V
1•
The following exercises may require a computer. 10.9. H. Hotelling [5] reports that n = 140 seventh-grade children received four
] [ � t�;+��:J � - '��- - �� �ss:d -�'�?�- - - '��
tests on Xf!l = reading speed, X�!) = reading power, Xfl = arithmetic speed, and X�2l = arithmetic power. The correlations for performance are 1.0
R
.0586
.6328 i
.2412
.0655 !
.4248 1 .0
.0586
(a) Find all the sample canonical correlations and the sample canonical variates. (b) Stating any assumptions you make, test the hypotheses
Ho : :I 1 2 = P1 2 = 0 Ht : :I12 = P1 2 * 0
( p{ = Pt = 0 )
622
Chap. 1 0
Canon ical Correlation Ana lysis
at the
a=
.05 level of significance. If H0 is rejected, test
with a significance level of a = .05. Does reading ability (as measured by the two tests) correlate with arithmetic ability (as measured by the two tests)? Discuss. (c) Evaluate the matrices of approximation errors for R 1 1 , R22 , and R 1 2 determined by the first sample canonical variate pair U 1 , V 1 • 10.10. In a study of poverty, crime, and deterrence, Parker and Smith 10 report certain summary crime statistics in various states for the years 1 970 and 1973. A portion of their sample correlation matrix is A
A
[ ]
The variables are
xp> = 1973 nonprimary homicides x�t ) = 1 973 primary homicides (homicides involving family or acquaintances) XfZl = 1970 severity of punishment (median months served) X�2l = 1970 certainty of punishment (number of admissions to prison divided by number of homicides)
(a) Find the sample canonical correlations. (b) Determine the first canonical pair 0 1 , V and interpret these quantities. 10.11. Example 8.5 presents the correlation matrix obtained from n = 100 succes sive weekly rates of return for five stocks. Perform a canonical correlation analysis with X ( l ) = 2 3 ' the rates of return for the chemical companies, and X (2 = [ , the rates of return for the oil companies. 10.12. A random sample of n = 70 families will be surveyed to determine the asso ciation between certain "demographic" variables and certain "consump tion" variables. 1
(!) (l) (l) ) [XxfZ>X, X�2lX] ' ]' I
'
'
Chap. 1 0
Let: Criterion set Predictor set
Exercises
= annual frequency of dining at a restaurant { xp> Xi 1 > = annual frequency of attending movies
{
X[2 > x?> X12 >
623
= age of head of household = annual family income = educational level of head of household
Suppose 70 observations on the preceding variables give the sample corre lation matrix
R12J R = tR11 Rzl j· R22 -------
I1 2
-------
!
I I I I
1.0
---·�Q �_:Q ___
.26
.67
p12
.34
__
t-
___________________
.33 ! 1 .0 .59 l .37 .34 i .21
1 .0 .35 1 .0
(a) Determine the sample canonical correlations, and test the hypothesis H0: = 0 (or, equivalently, = 0) at the a = .05 level. If H0 is rejected, test for the significance (a = .05) of the first canonical correlation. (b) Using standardized variables, construct the canonical variates corre sponding to the "significant" canonical correlation(s). (c) Using the results in Parts a and b, prepare a table showing the canonical variate coefficients (for "significant" canonical correlations) and the sam ple correlations of the canonical variates with their component variables. (d) Given the information iii (c), interpret the canonical variates. (e) Do the demographic variables have something to say about the con sumption variables? Do the consumption variables provide much infor mation about the demographic variables? 10.13. Waugh [12] provides information about n = 138 samples of Canadian hard red spring wheat and the flour made from the samples. The p = 5 wheat measurements (in standardized form) were
= = z11> = d> = z� > = zfi> z� 1 >
The q
1 1
kernel texture test weight damaged kernels foreign material crude protein in the wheat
= 4 (standardized) flour measurements were zf> = wheat per barrel of flour
624
Chap . 1 0
Canonical Correlation Ana lysis
d2 ) zf>
= ash in flour = crude protein in flour zf> = glpten quality index The sample correlation matrix was R
=
1.0 .754 1 .0 - .690 - .712 1.0 .323 1.0 - .446 - .515 - .334- 1 .0 - .444 -----.692 -- .412 ------
�s27 -- =-.3s3_ r _i�o-- - ---------------------------- .479 - .419 .361 .461 - .505 i .251 1.0 .780 .542 - .546 - .393 .737 i - .490 - .434 1 .0 - .152 - .102 .172 - .019 - .148 i .250 - .079 - .163 1.0
-= .6os --= �722
.737
(a) Find the sample canonical variates corresponding to significant ( at the a = .01 level) canonical correlations. (b) Interpret the first sample canonical variates U 1 , V 1 • Do they in some sense represent the overall quality of the wheat and flour, respectively? (c) What proportion of the total sample variance of the first set z(l> is explained by the canonical variate U 1 ? What proportion of the to;al sample variance of the z<2> set is explained by the canonical variate V1 ? Discuss your answers. 10.14. Consider the correlation matrix of profitability measures given in Exercise 9.15. Let x< t > = [xp> , X�l) , . . . , X� I) ] ' be the vector of variables represent ing accounting measures of profitability, and let X(2) = [XfZ> , X�2 > ] ' be the vector of variables representing the two market measures of profitability. Partition the sample correlation matrix accordingly, and perform a canonical correlation analysis. Specifically: (a) Determine the first sample canonical va'riates U V and their correla tion. Interpret these canonical variates. (b) Let z(l> and Z(2) be the sets of standardized variables corresponding to x(l> and x<2>, respectively. What proportion 9_f the total sample variance of z< t > is explained by the canonical variate U1 ? What proportion of the total sample variance of z(2) is explained by the canonical variate V 1 ? Discuss your answers. 10.15. Observations on four measures of stiffness are given in Table 4.3 and dis cussed in Example 4.14. Use the data in the table to construct the sample A
A
A
1,
A
1
Chap. 1 0
Exercises
625
covariance matrix S. Let X(ll = [Xf! l , X� 1 l ] ' be the vector of variables rep resenting the dynamic measures of stiffness ( shock wave, vibration ) , and let x<2 l = [Xf2l , Xfl ] ' be the vector of variables representing the static mea sures of stiffness. Perform a canonical correlation analysis of these data. 10.16. Andrews and Herzberg [1] give data obtained from a study of a comparison of nondiabetic and diabetic patients. Three primary variables,
= glucose intolerance x�l) = insulin response to oral glucose x�l) = insulin resistance
X}ll
and two secondary variables,
= relative weight = fasting plasma glucose were measured. The data for n = 46 nondiabetic patients yield the covari XFl Xfl
ance matrix
1 106.000 396.700 i 108.400 .787 26.230 396.700 2382.000 i 1 143.000 - .214 - 23.960
___ !Q���QQ___ g�?..�Q9Q_l_?!_�?�9_Q9____ ���?2 ___=�9�?�Q_ .787 - .214 ! 2.189 26.230 - 23.960 i - 20.840
.016 .216
.216 70.560
Determine the sample canonical variates and their correlations. Interpret these quantities. Are the first canonical variates good summary measures of their respective sets of variables? Explain. Test for the significance of the canonical relations with a = .05. 10.17. Data concerning a person's desire to smoke and psychological and physical state were collected for n = 110 subj ects. The data were responses, coded 1 to 5, to each of 12 questions (variables ) . The four standardized measure ments related to the desire to smoke are defined as
z}l l z2 l d ll
zi1 l
= = = =
smoking 1 ( first wording ) smoking 2 ( second wording ) smoking 3 ( third wording ) smoking 4 ( fourth wording )
The eight standardized measurements related to the psychological and phys ical s rate are given by
zf2 l
= concentration
626
Chap. 1 0
Ca nonical Correlation Analysis
zi2l d2 l zi2 l
d2l z�2) d2 l z�2l
= = = = = = =
annoyance sleepiness tenseness alertness irritability tiredness contentedness
t-:;�-i-��-;j
The correlation matrix constructed from the data is
where Rl l
R12
R 22
=
Ll �
=
.785 .810 .775 .785 1.000 .816 .813 .810 .816 1.000 .845 .775 .813 .845 1.000
= R� I =
=
R
l
.086 .200 .041 .228
.144 .119 .060 .122
.140 .211 .126 .277
]
.222 .301 .120 .214
.101 .223 .039 .201
.189 .221 .108 . 156
.199 .274 .139 .271
239 .235 .100 .171
]
1.000 .562 .457 .579 .802 .595 .512 .492 .562 1.000 .360 .705 .578 .796 .413 .739 .457 .360 1.000 .273 .606 .337 .798 .240 .579 .705 .273 1 .000 .594 .725 .364 .711 .802 .578 .606 .594 1 .000 .605 .698 .605 .595 .796 .337 .725 .605 1.000 .428 .697 .512 .413 .798 .364 .698 .428 1 .000 .394 .492 .739 .240 .71 1 .605 .697 .394 1 .000
Determine the sample canonical variates and their correlations. Interpret these quantities. Are the first canonical variates good summary measures of their respective sets of variables? Explain.
Chap. 1 0
References
627
REFERENCES
Newy ofYorMulkti:pSprile Regres nger-Vsierlon.a"g, 1985. 2.1. BartAndrewslet , M., D.S.F."Fur, andthA.M. er AspHerzber ects of tgh.e Theor 3. Bartlet , M.S. "A Note on Tests of Signi(f1i938)c(ance1939), 33-40. in, 180-185. Multivariate Analysis." 4. tDunham, ing Effects of the Organiza i5. Hoton.e"l ing, R.H. B."The"ReactMostionPrteodiJobctablChare Craictteeriroino.sn.ti"cs1: (Moderat 1977), 42-65. 6.7. Johns Hot(1935)eloi,n,n139-142. g,R.H.A."Rel, andatT.ionsWehrbetwlyeen. "MeasTwourSetes sandof VarModeliablsefs.o"r Angular Correla(1ti936)on and, 321-377. Angu l8. Ksar-hLiirsnaeargar,CorA.M.relation." ( 1 977) , 222-229. NewcYoral Analk: Marcel Dekker, Inc., (1972.1959), 59-66. 9.10. Lawl e y, D. N. "Tes t s of Si g ni f i c ance i n Canoni y s i s . " Parker, R.N., and M.D. Smi(t1h979). "Det, 614-624. er ence, Poverty, and Type of Homicide." 11. Rencher, A.inciC.pal"InComponent terpretatiosn."of Canonical Discriminant Funct(1992)ions,, 217-225. Canonical Vari at e s and Pr 12. Waugh, F. W. "Regres ion between Sets of Variates." (1942), 290-310. Data.
Proceedings of
the Cambridge Philosophical Society,
34
Proceedings
of the Cambridge Philosophical Society,
35
Academy of Management Journal,
20,
Journal of Educational Psychology,
26
Biometrika, 28
Journal of the Royal Statistical Society (B) , 39 Multivariate Analysis.
Biometrika, 46
Ameri
can Journal of Sociology,
85
The American Statistician,
46
Econometrica,
10
CHAPTER
11
Discrimination and Classification 1 1 . 1 INTRODUCTION
Discrimination and classification are multivariate techniques concerned with sepa rating distinct sets of objects (or observations) and with allocating new objects (observations) to previously defined groups. Discriminant analysis is rather exploratory in nature. As a separative procedure, it is often employed on a one time basis in order to investigate observed differences when causal relationships are not well understood. Classification procedures are less exploratory in the sense that they lead to well-defined rules, which can be used for assigning new objects. Classification ordinarily requires more problem structure than discrimination does. Thus, the immediate goals of discrimination and classification, respectively, are as follows. Goal 1 . T o describe, either graphically (in three o r fewer dimensions) or alge braically, the differential features of objects (observations) from sev eral known collections (populations). We try to find "discriminants" whose numerical values are such that the collections are separated as much as possible. Goal 2. To sort objects (observations) into two or more labeled classes. The emphasis is on deriving a rule that can be used to optimally assign new objects to the labeled classes. We shall follow convention and use the term discrimination to refer to Goal 1. This terminology was introduced by R. A. Fisher [9] in the first modern treat ment of separative problems. A more descriptive term for this goal, however, is separation. We shall refer to the second goal as classification or allocation. A function that separates objects may sometimes serve as an allocator, and, conversely, a rule that allocates objects may suggest a discriminatory procedure. In practice, Goals 1 and 2 frequently overlap, and the distinction between separation and allocation becomes blurred. 629
630
Chap. 1 1
Discri m i n ation and Classification
1 1 .2 SEPARATION AND CLASSIFICATION FOR TWO POPULATIONS
To fix ideas, let us list situations in which one may be interested in (1) separating two classes of objects or (2) assigning a new object to one of two classes (or both). It is convenient to label the classes 7T1 and 7T2 • The objects are ordinarily separated or classified on the basis of measurements on, for instance, p associated random variables X ' = [X1 , X2 , . . . , 1Xp ] . The observed values of X differ to some extent from one class to the other. We can think of the totality of values from the first class as being the population of x values for 7T1 and those from the second class as the population of x values for 7T2 • These two populations can then be described by probability density functions /1 (x ) and f2 (x), and consequently, we can talk of assigning observations to populations or objects to classes interchangeably. You may recall that some of the examples of the following separation-classi fication situations were introduced in Chapter 1 . Populations 7T 1 and 7T2
Measured variables X
1. Solvent and distressed property-liability Total assets, cost of stocks and bonds, market
insurance companies.
2. Nonulcer dyspeptics (those with upset
stomach problems) and controls ("normal"). 3. Federalist Papers written by James Madison and those written by Alexander Hamilton. 4. Two species of chickweed. 5. Purchasers of a new product and
laggards (those "slow" to purchase). 6. Successful or unsuccessful (fail to graduate) college students. 7. Males and females. 8. Good and poor credit risks.
9 . Alcoholics and nonalcoholics.
value of stocks and bonds, loss expenses, surplus, amount of premiums written. Measures of anxiety, dependence, guilt, perfectionism. Frequencies of different words and lengths of sentences.
Sepal and petal length, petal cleft depth, bract length, scarious tip length, pollen diameter. Education, income, family size, amount of previous brand switching. Entrance examination scores, high school grade point average, number of high school activities. Anthropological measurements, like circumference and volume on ancient skulls. Income, age, number of credit cards, family size. Activity of monoamine oxidase enzyme, activity of adenylate cyclase enzyme.
1 If the values of X were not very different for objects in 7T 1 and 7T2 , there would be no problem; that is, the classes would be indistinguishable, and new objects could be assigned to either class indiscriminately.
Sec. 1 1 . 2
Separation and Classificatio n for Two Populations
631
We see from item 5, for example, that objects (consumers) are to be sepa rated into two labeled classes ("purchasers" and "laggards") on the basis of observed values of presumably relevant variables (education, income, and so forth). In the terminology of observation and population, we want to identify an observation of the form x ' = [x 1 (education), x2 (income), x3 (family size), x4 (amount of brand switching)] as population 1 , purchasers, or population 2 , laggards. At this point, we shall concentrate on classification for two populations, returning to separation in Section 11.5. Allocation or classification rules are usually developed from "learning" sam ples. Measured characteristics of randomly selected objects known to come from each of the two populations are examined for differences. Essentially, the set of all possible sample outcomes is divided into two regions, R1 and R2 , such that if a new observation falls in R 1 , it is allocated to population and if it falls in R2 , we allo cate it to population 1r2 . Thus, one set of observed values favors 1r 1 , while the other set of values favors 1r2 • You may wonder at this point how it is we know that some observations belong to a particular population, but we are unsure about others. (This, of course, is what makes classification a problem! ) Several conditions can give rise to this apparent anomaly (see [19]): 1T
1T
1T 1 ,
1.
2.
3.
Incomplete knowledge of future performance. Examples: In the past, extreme values of certain financial variables were
observed 2 years prior to a firm's subsequent bankruptcy. Classifying another firm as sound or distressed on the basis of observed values of these leading indicators may allow the officers to take corrective action, if necessary, before it is too late. A medical school applications office might want to classify an applicant as likely to become M. D. or unlikely to become M. D. on the basis of test scores and other college records. Here the actual determination can be made only at the end of several years of training.
"Perfect" information requires destroying the object. Example: The lifetime of a calculator battery is determined by using it until
it fails, and the strength of a piece of lumber is obtained by loading it until it breaks. Failed products cannot be sold. One would like to classify products as good or bad (not meeting specifications) on the basis of certain preliminary measurements.
Unavailable or expensive information. Examples: It is assumed that certain of the Federalist Papers were written by
James Madison or Alexander Hamilton because they signed them. Others of the Papers, however, were unsigned and it is of interest to determine which of the two men wrote the unsigned Papers. Clearly, we cannot ask them. Word frequencies and sentence lengths may help classify the disputed Papers. Many medical problems can be identified conclusively only by conducting an expensive operation. Usually, one would like to diagnose an illness from
632
Chap. 1 1
Discri m i n ation and Classification
easily observed, yet potentially fallible, external symptoms. This approach helps avoid needless-and expensive-operations. It should be clear from these examples that classification rules cannot usually provide an error-free method of assignment. This is because there may not be a clear distinction between the measured characteristics of the populations; that is, the groups may overlap. It is then possible, for example, to incorrectly classify a 7Tz object as belonging to 7T 1 or a 7T 1 object as belonging to 7T2 • Example 1 1 . 1
(Discriminating owners from nonowners of riding mowers)
Consider two groups in a city: 7T1 , riding-mower owners, and 7Tz , those with out riding mowers-that is, nonowners. In order to identify the best sales prospects for an intensive sales campaign, a riding-mower manufacturer is interested in classifying families as prospective owners or nonowners on the basis of x 1 = income and x2 = lot size. Random samples of n 1 = 12 current owners and n2 = 12 current nonowners yield the values in Table 11.1. TABLE 1 1 . 1 7T 1 :
7T2 : Nonowners
Riding-mower owners
x i (Incorp.e in $1000s)
x2 (Lot size2 in 1000 ft )
xi (Income in $1000s)
x2 (Lot size in 1000 ttl)
60.0 85.5 64.8 61.5 87.0 110.1 108.0 82.8 69.0 93.0 51.0 81.0
18.4 16.8 21.6 20.8 23.6 19.2 17.6 22.4 20.0 20.8 22.0 20.0
75.0 52.8 64.8 43.2 84.0 49.2 59.4 66.0 47.4 33.0 51.0 63.0
19.6 20.8 17.2 20.4 17.6 17.6 16.0 18.4 16.4 18.8 14.0 14.8
These data are plotted in figure 11.1. We see that riding-mower own ers tend to pave larger income� and bigger lots than nonowners, although income seems tp be a better "discriminator" than lot size. On the other hand, there is some overlap between the two groups. If, for example, we were to allocate those valqes of (xi , x2 ) that fall into region R I (as determined by the solid line in the figure) to 7T 1 , mower owners, and those (xi , x2 ) values which fall into R2 to 7Tz , nonowners, we would make some mistakes. Some riding-
Sec. 1 1 . 2
Separation and Classification for Two Popu lations
633
xz 24 ti � 1:! "'
;:l 0" "' """ 0 "' "0 c: "' "' ;:l 0
•
0
16
0
£ .s
OJ N ' Vi
0
..J
o •
60
Riding-mower owners Nonowners
90
1 20
Income in thousands of dollars
Figure 1 1 . 1
Income and lot size for riding-mower owners and nonowners.
mower owners would be incorrectly classified as nonowners and, conversely, some nonowners as owners. The idea is to create a rule (regions R 1 and R2 ) that minimizes the chances of making these mistakes. (See Exercise 11.2.) • A good classification procedure should result in few misclassifications. In other words, the chances, or probabilities, of misclassification should be small. As we shall see, there are additional features that an "optimal" classification rule should possess. It may be that one class or population has a greater likelihood of occurrence than another because one of the two populations is relatively much larger than the other. For example, there tend to be more financially sound firms than bankrupt firms. As another example, one species of chickweed may be more prevalent than another. An optimal classification rule should take these "prior probabilities of occutrence" into account. If we really believe that the (prior) probability of a finan cially distressed and ultimately bankrupted firm is very small, then one should clas sify a randomly selected firm as nonbankrupt unless the data overwhelmingly favors bankruptcy. Another aspect of classification is cost. Suppose that classifying a 1r 1 object as belonging to 1r2 represents a more serious error than classifying a 7Tz object as belonging to 1r 1 • Then one should be cautious about making the former assign ment. As an example, failing to. diagnose a potentially fatal illness is substantially more "costly" than concluding that the disease is present when, in fact, it is not. An optimal classification procedure should, whenever possible, account for the costs associated with misclassification.
634
Chap. 1 1
Discri m i nation and Classification
Let f1 (x) and f2 (x) be the probability density functions associated with the 1 vector random variable X for the populations 1r1 and 1r2 , respectively. An object with associated measurements x must be assigned to either 1r1 or 1r2 . Let !1 be the sample space-that is, the collection of all possible observations x. Let R1 be that set of x values for which we classify objects as 1r1 and R 2 = n R 1 be the remaining x values for which we classify objects as 1r2 . Since every object must be assigned to one and only one of the two populations, the sets R1 and R2 are mutu ally exclusive and exhaustive. For p = 2, we might have a case like the one pic tured in Figure 11.2.
p
X
-
Figure 1 1 .2
Classification regions for two populations .
The conditional probability, P(2 1 1 ) , of classifying an object as fact, it is from 1r1 is P (2 1 1) = P (X
E
R2 l 1r1 ) =
J
R2 = !.l - R 1
1r2 when, in
/1 (x) dx
Similarly, the conditional probability, P (2 1 1 ) , of classifying an object as it is really from 1r2 is P ( 1 1 2)
= P (X R1 l 1r2 ) = E
J /2 (x) dx R,
(11-1)
1r1 when (11-2)
The integral sign in (11-1) represents the volume formed by the density function
f1 (x) over the region R2• Similarly, the integral sign in (11-2) represents the vol ume formed by f2 (x) over the region R1 • This is illustrated in Figure 11.3 for the univariate case, p = 1. Let p 1 be the prior probability of 1r1 and p 2 be the prior probability of 1r2 , where p 1 + p2 = 1 . Then the overall probabilities of correctly or incorrectly clas
sifying objects can be derived as the product of the prior and conditional classifi cation probabilities:
Sec . 1 1 .2
j
P( l i 2) = f2 (x) dx
Classify as
Figure 1 1 . 3
635
Separation and Classification for Two Popu lations
P(2 i l l =
1t 1
jt1 (x ) dx
Rz
Classify as
n2
Misclassification probabilities for hypothetical classification regions
when p = 1 .
P (observation is correctly classified as 1r1 ) = P (observation comes from '7T1 and is correctly classified as '7T 1 ) P (observation is misclassified as 1r 1 ) = P (observation comes from 1r2 and is misclassified as 1r1 ) P (observation is correctly classified as 1r2 )
= P (observation comes from '7T2
and is correctly classified as '7T2 )
P ( observation is misclassified as 1r2 ) = P ( observation comes from 1r1 and is misclassified as '7T2 )
= P (X
E
R2 1 '7T 1 )P ( '7T 1 ) = P (2 1 1)p 1 (1 1-3)
Classification schemes are often evaluated in terms of their misclassification probabilities (see Section 11.4), but this ignores misclassification cost. For exam ple, even a seemingly small probability such as .06 = P (2 1 1 ) may be too large if the cost of making an incorrect assignment to '7Tz is extremely high. A rule that ignores costs may cause problems. The costs of misclassification can be defined by a cost matrix:
True population:
Classify as: '7Tz c (2 1 1 ) 0
(11-4)
636
Chap. 1 1
Discrimination and Classification
The costs are (1) zero for correct classification, (2) c (1 j 2) when an observation from 'TT'z is incorrectly classified as 'TT' p and (3) c(2 j 1) when a 7T 1 observation is incorrectly classified as 2• For any rule, the average, or expected cost of misclassification (ECM) is pro vided by multiplying the off-diagonal entries in (11-4) by their probabilities of occurrence, obtained from (11-3). Consequently, 7r
ECM
= c (2 j 1) P(2 j 1)p 1 + c (1 j 2) P(1 j 2)p2
(11-5)
A reasonable classification rule should have an ECM as small, or nearly as small, as possible. Result 1 1 . 1 . The regions R 1 and R2 that minimize the ECM are defined by the values x for which the following inequalities hold
( )( ) co �t ) ( denratio�ity ) ( ratio pr::���lity . ratiO (x) ( c (1 j 2) ) ( ) Rz : /1 (x) c (2 j 1) p1 /2 prior ( cost ) ( denratiO�ity ) ( ratio .. probabrhty . . ::;, c (1 j 2) Pz R t ·. /1f (x) c (2 j 1) p 1 2 (x) �
�
<
<
Proof.
See Exercise
11.3.
(
Pz
ratio
) )
(11-6)
•
It is clear from (11-6) that the implementation of the minimum ECM rule requires (1) the density function ratio evaluated at a new observation x0, (2) the cost ratio, and (3) the prior probability ratio. The appearance of ratios in the def inition of the optimal classification regions is significant. Often, it is much easier to specify the ratios than their component parts. For example, it may be difficult to specify the costs (in appropriate units ) of classifying a student as college material when, in fact, he or she is not and classify ing a student as not college material, when, in fact, he or she is. The cost to tax payers of educating a college dropout for 2 years, for instance, can be roughly assessed. The cost to the university and society of not educating a capable student is more difficult to determine. However, it may be that a realistic number for the ratio of these misclassification costs can be obtained. Whatever the units of mea surement, not admitting a prospective college graduate may be five times more
Sec. 1 1 . 2
Separation and Classification for Two Populations
637
costly, over a suitable time horizon, than admitting an eventual dropout. In this case, the cost ratio is five. It is interesting to consider the classification regions defined in (11-6) for some special cases. SPECIAL CASES OF MINIMUM
(a) p2 /p1
(b)
c
=
1 (equal prior probabilities)
(lj2) I c62ll)
(c) p2
EXPECTED COST REGIONS
=
1 ( equal 1llisclassification
fP1 c(1 ! i);c(2 ! i\
costs)
P2/p1 1/(c(i 12)/t(z!i))
( eq'\;lal ptj9r prQglilbilities a11� eq'l;lal misclassWcati
=
or
=
Rz :
fl ( x))•.·
fz
<
X
When the prior probabilities are unknown, they are often taken to be equal, and the minimum ECM rule involves comparing the ratio of the population densi ties to the ratio of the appropriate misclassification costs. If the misclassification cost ratio is indeterminate, it is usually taken to be unity, and the population den sity ratio is compared with the ratio of the prior probabilities. (Note that the prior probabilities are in the reverse order of the densities.) Finally, when both the prior probability and misclassification cost ratios are unity, or one ratio is the reciprocal of the other, the optimal classification regions are determined simply by compar ing the values of the density functions. In this case, if x0 is a new observation and /1 (x0 ) //2 (x0 ) � 1-that is, /1 (x0 ) � [2 (x0 )-we assign x0 to 7T1 • On the other hand, if /1 (x0 )//2 (x0 ) < 1, or /1 (x0 ) < /2 (x0 ) , we assign x0 to 1r2 • It is common practice to arbitrarily use case (c) in (11-7) for classification. This is tantamount to assuming equal prior probabilities and equal misclassifica tion costs for the minimum ECM rule. 2 2 This is the justification generally provided. It is also equivalent to assuming the prior probabil ity ratio to be the reciprocal of the misclassification cost ratio.
638
Chap. 1 1
Discri mi nation and Classification
Example 1 1 .2
(Classifying a new observation into one of the two populations)
A researcher has enough data available to estimate the density functions
/1 (x) and /2 (x) associated with populations 7T 1 and 7T2 , respectively. Suppose c(2 l l) = 5 units and c(l l 2) = 10 units. In addition, it is known that about 20% of all objects (for which the measurements x can be recorded) belong to 7T2 . Thus, the prior probabilities are p 1 = .8 and p 2 = .2. Given the prior probabilities and costs of misclassification, we can use
(11-6) to derive the classification regions R1 and R2 • Specifically, we have
C2) (:�) = . 5 C5°) (:�) = .5
Suppose the density functions evaluated at a new observation x0 give /1 (x0) = 3 and /2 (x0) = .4. Do we classify the new observation as 7T1 or 7Tz ? To answer the question, we form the ratio .
.5 obtained before. Since /1 ( X 0) = . 75 ( c (l l 2) ) ( Pz ) . 5 /2 (x0) c (2 l l) p 1
and compare it with
>
we find that
•
x0 R1 and classify it as belonging to 7T 1 . E
Criteria other than the expected cost of misclassification can be used to derive "optimal" classification procedures. For example, one might ignore the costs of misclassification and choose R1 and R2 to minimize the total probability of mis classification (TPM): TPM = P (misclassifying a 7T1 observation or misclassifying a 7T2 observation)
= P (observation comes from 7T1 and is misclassified)
+ P (observation comes from 7Tz and is misclassified) (11-8)
Mathematically, this problem is equivalent to minimizing the expected cost of mis classification when the costs of misclassification are equal. Consequently, the opti mal regions in this case are given by (b) in (11-7).
Sec. 1 1 . 3
639
Classification with Two M u ltiva riate Normal Populations
We could also allocate a new observation x0 to the population with the largest "posterior" probability P ( 71'; l x0) . By Bayes's rule, the posterior probabilities are p ( 71' I I X o )
P ( 71' occurs and we observe x0 ) = ---'--'-1 P (we observe x0 ) ----'-'-'--------
P (we observe x0 1 71' 1 )P( 71'1 )
+ P (we observe x0 I 'T1'2 ) P ( 71'2 )
(11-9)
Classifying an observation x0 as 71'1 when P( 71'1 l x0 ) > P ( 71'2 l x0 ) is equivalent to using the (b) rule for total probability of misclassification in (1 1-7) because the denominators in (11-9) are the same. However, computing the probabilities of the populations 71' 1 and 'TI'z after observing x0 (hence the name posterior probabilities) is frequently useful for purposes of identifying the less clear-cut assignments. 1 1 . 3 CLASSIFICATION WITH TWO MULTIVARIATE NORMAL POPULATIONS
Classification procedures based on normal populations predominate in statistical practice because of their simplicity and reasonably high efficiency across a wide variety of population models. We now assume that f1 (x) and f2 (x) are multivari ate normal densities, the first with mean vector p 1 and covariance matrix l: 1 and the second with mean vector p 2 and covariance matrix l: 2 . The special case of equal covariance matrices leads to a particularly simple linear classification statistic. Classification of Normal Populations when
l: 1 = l: 2 = l:
Suppose that the joint densities of X' = [X1 , X2 , . . . , Xp] for populations 71' and 71'2 are given by 1
for i = 1 , 2 (11-10) p1 , p 2 , and l: are known. Then, after Suppose also that the population parameters 1 cancellation of the terms (271' )PI2 I l: 1 12 the minimum ECM regions in (11-6) become
640
Chap. 1 1
Discri m i n ation and Classification
(�g: �n (�:) exp [ - � (x - p 1 ) 1 I - t (x - P t ) + � (x - p 2 ) 1I - 1 (x - p 2 ) ] ( £Q_0_ c (2J1) ) ( PPzt ) (11-11) �
R2 :
<
Given these regions R 1 and R2 , we can construct the classification rule given in the following result. Result 1 1 .2. Let the populations 7Tt and 7Tz be described by multivariate normal densities of the form Then the allocation rule that minimizes the ECM is as follows: Allocate x0 to 7T 1 if
(11-10).
+ J.tz Allocate x0 to
7Tz otherwise.
)
�
In
[ ( £Q_0_ c (2J1) ) ( Pzp1 ) ] (11-12)
(11-11)
Proof Since the quantities in are nonnegative for all x, we can take their natural logarithms and preserve the order of the inequalities. Moreover (see Exercise
11.5 ),
1 (x - p 1 )1I - 1 (x - Pt ) + 21 (x - p2 ) 1I - 1 (x - p2 ) 1 (11-13) = (p, - J.tz )':I - ' x - 2 (p i - J.tz ) 1 I - 1 ( J.tl + J.tz )
-2
and consequently,
Rz : ( J.t t - J.tz ) 1 I - I x
-
1
2 (p i
-I
- J.tz ) 1 I (p, + J.tz )
The minimum ECM classification rule follows.
<
In
c Jl ) pz ) ] [ ( c (1J2) (2 ) ( P t (11-14)
•
Sec. 1 1 . 3
Classification with Two M u ltivariate Normal Pop u l ations
641
In most practical situations, the population quantities p1 , p2 , and :I are unknown, so the rule (11-12) must be modified. Wald [27] and Anderson [2] have suggested replacing the population parameters by their sample counterparts. Suppose, then, that we have n 1 observations of the multivariate random vari able X' = [X1 , X2 , . . . , Xp ] from 1r1 and n 2 measurements of this quantity from 7Tz , with n 1 + n2 - 2 � p. Then the respective data matrices are
l(
x; l x,z I
x n,
j
(11-15)
From these data matrices, the sample mean vectors and covariance matrices are determined by
n, x l = n1 2: x , J , l J= l (p X l ) -
1 n, X z = n 2: X z} ' z J= l (p x l ) -
sl
1 n , (x - (x1 - ) n l - 1 � 11 i 1 ) 1 i1 '
Sz (p Xp)
1 n , (x - ) (x - ) ' nz - 1 � zJ i z zJ Xz
(p Xp)
(11-16)
Since it is assumed that the parent populations have the same covariance matrix :I, the sample covariance matrices S1 and S2 are combined ( pooled) to derive a sin gle, unbiased estimate of :I as in (6-21). In particular, the weighted average S pooled =
[ (n l - 1)n1 +- (n1 z - 1) ] S l + [ (nl - 1n)z +- (n1 z - 1) ] S z
(11-17)
is an unbiased estimate of :I if the data matrices XI and x2 contain random sam ples from the populations 7T1 and 7Tz , respectively. Substituting i1 for p 1 , i 2 for p2 , and Spooled for :I in (11-12) gives the "sam ple" classification rule:
642
Chap. 1 1
Discri m i nation and Classification
Alloc;lte
(-
THE ESTI MATED MINIMUM ECM RU LE FOR TWO NORMAL · POPULATIONS x a to 7T1
- ) 'S· -1
x1 - Xz
Allocate
.
if
.
pooled X o -
2
l.( - · x1
- ) m[(c(1c(211)1 2))· (pz;)· J
-. - )· ,8-1 ( Xz
pooled x1 + Xz
Pl .
(11�18)
x0 :to 7T{Otherwise.
If, in (11-18),
;;;;.
(�) ( Pz ) = 1 c (2 l l)
p1
then ln(1) = 0, and the estimated minimum ECM mle for two normal populations amounts to comparing the scalar variable YA
= ( X- 1
- ' -1 - Xz ) s pooled X
= aA / X
(11-19)
evaluated at x0, with the number
(11-20) where and That is, the estimated minimum ECM mle for two normal populations is tanta mount to creating two univariate populations for the y values by taking an appro priate linear combination of the observations from populations 7T1 and 7Tz and then assigning a new observation x0 to 7T1 or 7T2 , depending upon whether Yo = a'x0 falls to the right or left of the midpoint m between the two univariate means y 1 and )12 • Once parameter estimates are inserted for the corresponding unknown pop ulation quantities, there is no assurance that the resulting rule will minimize the
Sec. 1 1 . 3
Classification with Two M u ltivariate Normal Populations
643
expected cost of misclassification in a particular application. This is because the optimal rule in (11-12) was derived assuming that the multivariate normal densities /1 (x) and f2 (x) were known completely. Expression (11-18) is simply an estimate of the optimal rule. However, it seems reasonable to expect that it should perform well if the sample sizes are large.3 To summarize, if the data appear to be multivariate normal,4 the classification statistic to the left of the inequality in (11-18) can be calculated for each new obser vation x0 . These observations are classified by comparing the values of the statis tic with the value of I n [(c (1 ! 2)/c (2 ! 1)) (p 2 /p 1 )]. Example 1 1 .3
(Classification with two normal populations common I and equal costs)
This example is adapted from a study [4] concerned with the detection of hemophilia A carriers. (See also Exercise 11.32.) To construct a procedure for detecting potential hemophilia A carriers, blood samples were assayed for two groups of women and measurements on the two variables,
X1 = log 1 0 (AHF activity) X2 = log1 0 (AHF-like antigen) recorded. (" AHF" denotes antihemophilic factor.) The first group of n 1 = 30 women were selected from a population of women who did not carry the hemophilia gene. This group was called the normal group. The second group of n2 = 22 women was selected from known hemophilia A carriers ( daugh ters of hemophiliacs, mothers with more than one hemophilic son, and moth ers with one hemophilic son and other hemophilic relatives). This group was called the obligatory carriers. The pairs of observations (x 1 , x2 ) for the two groups are plotted in Figure 11.4. Also shown are estimated contours con taining 50% and 95% of the probability for bivariate normal distributions cen tered at i1 and i 2 , respectively. Their common covariance matrix was taken as the pooled sample covariance matrix Spooled . In this example, bivariate nor mal distributions seem to fit the data fairly well. The investigators (see [4]) provide the information i
and
1=
[
]
- .0065 - .0390 '
[ - .2483 ] .0262
3 As the sample sizes increase, i 1 , x 2 , and Spooled become, with probability approaching 1, indis tinguishable from p, 1 , p, 2 , and �. respectively [see (4-26) and (4-27)]. 4 At the very least, the marginal frequency distributions of the observations on each variable can be checked for normality. This must be done for the samples from both populations. Often, some vari ables must be transformed in order to make them more "normal looking." (See Sections 4.6 and 4.8.)
644
Chap. 1 1
Discrimi nation and Classification x 2 = log 1 0
( AHF-like antigen )
.4 .
3
.2 .I 0 -.I
-
.
•
Normals
0
3
Obligatory carriers
L...-..J--'---'----L---* X 1
- .7
-
.
5
-.1
- .3
.I
.3
[ -131.158 90.423
=
log 1 0 ( AHF activity )
Figure 1 1 .4 Scatter plots of [ log1 0 (AHF activity), log 1 0 (AHF-Iike antigen) ] for the normal group and obligatory hemophilia A carriers. -1
s pooled =
- 90.423 108.147
]
Therefore, the equal costs and equal priors discriminant function [see (11-19)] is
= [.2418 - .0652]
[ -131.158 90.423
= 37.61x1 - 28.92x2 Moreover,
[37.61 - 28 . 92] .Yz = a'x 2 = [37.61 - 28.92]
- 90.423 108.147
[ - .0065 ]
x1 [ J x2 J
.2483 ] [ - _0262 = - 10.1o - .0390
.88
and the midpoint between these means [see (11-20)] is
m = ! C .Y1 + .Y2 ) = ! (.88 - 10.10) = - 4.61 Measurements of AHF activity and AHF-like antigen on a woman who may be a hemophilia A carrier give x1 = - .210 and x2 = - .044. Should this woman be classified as 7T1 (normal) or 7Tz (obligatory carrier)?
Sec. 1 1 . 3
Classification with Two M u ltiva riate Normal Popu lations
645
Using (11-18) with equal costs and equal priors so that ln(1) = 0, we obtain Allocate X o to 7T l if Yo = a' X o Allocate X o to 7Tz if Y o = a' X o where x� = [ - .210, - .044] . Since Y o = a' X o = [37.61 - 28.92]
�
<
.210 [-- .044 J
m
- 4.61
m
- 4.61
- 6.62
<
- 4.61
we classify the woman as 7Tz , an obligatory carrier. The new observation is indicated by a star in Figure 11.4. We see that it falls within the estimated .50 probability contour of population 7Tz and about on the estimated .95 proba bility contour of population 7T 1 . Thus, the classification is not clear cut. Suppose now that the prior probabilities of group membership are known. For example, suppose the blood yielding the foregoing x 1 and x2 mea surements is drawn from the maternal first cousin of a hemophiliac. Then the genetic chance of being a hemophilia A carrier in this case is .25. Conse quently, the prior probabilities of group membership are p 1 = .75 and p 2 = .25. Assuming, somewhat unrealistically, that the costs of misclassifica tion are equal, so that c (1 1 2) = c (2 1 1 ) , and using the classification statistic Or W = a' x0 - m with a' x o = - 6.62, we have w
X� = [ - .210, - .044] ,
= - 6.62 - ( - 4.61 )
Applying (11-18), we see that w
= - 2.01
<
ln
m = - 4.61,
and
- 2.01
[ �: J = In [ :�� ] = - 1.10
and we classify the woman as 7Tz , an obligatory carrier.
•
Scaling
The coefficient vector a = s �ololed ( X I - X z ) is unique only up to a multiplicative constant, so, for c ¥= 0, any vector ca will also serve as discriminant coefficients. The vector a is frequently "scaled" or "normalized" to ease the interpreta tion of its elements. Two of the most commonly employed normalizations are:
646
Chap. 1 1
Discri m i n ation and Classification
1. Set
a*
(11-21)
so that a* has unit length. 2. Set
a*
(11-22)
so that the first element of the new coefficient vector a* is 1. In both cases, a* is of the form ca. For normalization (1), c = (a ' a ) - 112 and for (2), c = a1 1 . The magnitudes of at , a; , . . . , a; in (11-21) all lie in the interval [ - 1 , 1 ] . In (11-22), at = 1 and a; , . . . , a; are expressed as multiples of at . Constraining the a;* to the interval [ - 1, 1 ] usually facilitates a visual comparison of the coefficients. Similarly, expressing the coefficients as multiples of a;* allows one to readily assess the relative importance (vis-a-vis X1 ) of variables X2 , . . . , XP as discriminators. Normalizing the a;' s is recommended only if the X variables have been stan dardized. If this is not the case, a great deal of care must be exercised in inter preting the results. Classification of Normal Populations When
I 1 * I2
As might be expected, the classification rules are more complicated when the pop ulation covariance matrices are unequal. Consider the multivariate normal densities in (11-10) with I ;, i = 1, 2, replac ing I . Thus, the covariance matrices, as well as the mean vectors, are different from one another for the two populations. As we have seen, the regions of minimum ECM and minimum total probability of misclassification (TPM) depend on the ratio of the densities, f1 (x ) //2 (x) , or, equivalently, the natural logarithm of the density ratio, ln [/1 ( x ) /f2 ( x) ] = ln [.t; ( x ) ] - ln [f2 (x ) ] . When the multivariate normal densities have different covariance structures, the terms in the density ratio involving I I ; 1 1 /2 do not cancel as they do when I 1 = I 2 • Moreover, the quadratic forms in the exponents of f1 (x ) and f2 ( x ) do not combine to give the rather sim ple result in (11-13). Substituting multivariate normal densities with different covariance matrices into (1 1-6) gives, after taking natural logarithms and simplifying (see Exercise 1 1.15), the classification regions
Sec. 1 1 . 3
[ ( cc (2(1 11 2)1 ) ) ( Pp2I ) ] c (1 1 2) ) ( p 2 ) ] ln [ ( c (2 1 1 ) P 1
Classification with Two M u ltivariate Normal Popu l ations
_ l_ X1 (I I- 1 - I2- 1 )x 2
+
"- 1 I I- 1 - rn 21 I 2- 1 )x ( r1
_ l_ x 1 (I 1- 1 - I 2- 1 )x 2
+
(p { I ] 1 - p21 I2- 1 )x -
k
�
647
ln
k <
(11-23)
where k
-2
(
- 1 1n I I I I ) I I2 1
+ 2 fLI ""'l 1(
I
� - I IL - f.Lz ""'� 2 f.L2 ) 1 I
-1
(11-24)
The classification regions are defined by quadratic functions of x. When I 1 = I 2 , the quadratic term, - �x � (I] 1 - I2 1 )x, disappears, and the regions defined by (11-23) reduce to those defined by (11-14). The classification rule for general multivariate normal populations follows directly from (11-23). Resu lt 1 1 .3 . Let the populations 7T1 and 7T2 be described by multivariate normal densities with mean vectors and covariance matrices p 1 , I 1 and p 2 , I 2 , respectively. The allocation rule that minimizes the expected cost of misclassifica tion is given by:
Allocate x0 to
7T 1
if
� - 1 ) Xo - 21 x o1 ( �""'J- 1 - ""'z
+
( f.L J� �""' 1- 1 - f.Lz� �""' 2- I ) Xo -
Allocate x0 to 1r2 otherwise. Here k is set out in (11-24).
k
>-
�
1n
[ ( cc (2( 1 11 2)1 ) ) ( PzP! ) ] •
In practice, the classification rule in Result 11.3 is implemented by substitut ing the sample quantities x i 2 , S 1 , and S2 (see (11-16)) for p 1 , p 2 , I 1 , and I 2 , respectively.5 1 ,
5 The inequalities n 1 > p and n2 > p must both hold for S] 1 and 82 1 to exist. These quantities are used in place of l:] 1 and l:2 1 , respectively, in the sample analog (1 1-25).
648
Chap. 1 1
Discri m i n ation and Classification QUADRATIC. CLASSIFICATION RULE
(NORMAL
POPULATIO N S WITH
UNEQUAL
COVARIANCE MATRI C ES)
Allocate
x0
to
_ !..2 x'o (S-1 1 Allocate
7T1
if
) xo Sd 2
x 0 to 7T2
·
+
) xo - x2' S-1 ( x1' S':"1 2 1
otherwise.
-
k
�
2)) (Pz)J ln [(c(11 c (2 1 1) P!
(11-25)
Classification with quadratic functions is rather awkward in more than two dimensions and can lead to some strange results. This is particularly true when the data are not ( essentially) multivariate normal. Figure 11.5 ( a ) shows the equal costs and equal priors rule based on the ide alized case of two normal distributions with different variances. This quadratic rule leads to a region R 1 consisting of two disj oint sets of points. In many applications, the lower tail for the 7T 1 distribution will be smaller than that prescribed by a normal distribution. Then, as shown in Figure 1 1 .5 ( b ) , the lower part of the region R 1 , produced by the quadratic procedure, does not line up well with the population distributions and can lead to large error rates. A serious weakness of the quadratic rule is that it is sensitive to departures from normality. If the data are not multivariate normal, two options are available. First, the nonnormal data can be transformed to data more nearly normal, and a test for the equality of covariance matrices can be conducted to see whether the linear rule ( 11-18 ) or the quadratic rule ( 11-25 ) is appropriate. Transformations are discussed in Chapter 4. (The usual tests for covariance homogeneity are greatly affected by nonnormality. The conversion of nonnormal data to normal data must be done before this testing is carried out. ) Second, we can use a linear ( or quadratic) rule without worrying about the form of the parent populations and hope that it will work reasonably well. Studies ( see [20] and [21 ]) have shown, however, that there are nonnormal cases where a linear classification function performs poorly, even though the population covari ance matrices are the same. The moral is to always check the performance of any classification procedure. At the very least, this should be done with the data sets used to build the classifier. Ideally, there will be enough data available to provide for "training" samples and "validation" samples. The training samples can be used to develop the classification function, and the validation samples can be used to evaluate its performance.
Sec . 1 1 .4
Eva l uati n g C lassification F u n ctions
649
(a)
(b) Figure 1 1 .5 Quadratic rules for (a) two normal distributions with unequal variances and (b) two distributions, one of which is nonnormal-rule not appropriate.
1 1 .4 EVALUATING CLASSIFICATION FUNCTIONS
One important way of judging the performance of any classification procedure is to calculate its "error rates," or misclassification probabilities. When the forms of the parent populations are known completely, misclassification probabilities can be calculated with relative ease, as we show in Example 11.4. Because parent popula tions are rarely known, we shall concentrate on the error rates associated with the sample classification function. Once this classification function is constructed, a measure of its performance in future samples is of interest. From (11-8), the total probability of misclassification is TPM
= p 1 J /1 (x)dx + p2 J f2 (x) dx R2
R1
650
Chap. 1 1
Discri m i n ation and Classification
The smallest value of this quantity, obtained by a judicious choice of R 1 and R2 , is called the optimum error rate (OER).
Thus, the OER is the error rate for the minimum TPM classification rule. Example 1 1 .4 (Calculating misclassification probabilities)
Let us derive an expression for the optimum error rate when p 1 = p 2 = � and /1 ( x ) and f2 ( x ) are the multivariate normal densities in (11-10). Now, the minimum ECM and minimum TPM classification rules coin cide when c (1 1 2) = c (2 1 1 ). Because the prior probabilities are also equal, the minimum TPM classification regions are defined for normal populations by (11-12), with In
[ ( � g � �� ) ( �� ) ] = 0. We find that
R 1 : ( IL t - ILz ) 'I- l x - HILt - 1Lz ) 'I - 1 (1L t Rz : ( IL t - ILz ) 'I- l x - HILt - 1Lz ) 'I- 1 (1L t
+ ILz ) ;;;;. + ILz ) <
0 0
These sets can be expressed in terms of y = (IL1 - 1Lz ) ' I - 1 x = a ' x as R t (y ) : Y ;;;;. � (ILt - 1Lz ) ' I - 1 (1Lt + ILz ) Rz (y ) : Y < HILt - 1Lz ) ' I- 1 ( 1Lt + ILz )
But Y is a linear combination of normal random variables, so the probability densities of Y, /1 (y ) and f2 (y ) , are univariate normal (see Result 4.2) with means and a variance given by
f.Lt Y = a ' ILt = (ILt - 1Lz ) ' I -11Lt f.Lz y = a ' ILz = ( ILt - ILz ) ' I-l .IL z a� = a 'I a = (ILt - 1Lz ) ' I - 1 (1Lt - ILz ) = fJ. 2 Now, TPM = � P [misclassifying a 1r1 observation as 1r2 ]
+ � P [ misclassifying a 2 observation as 1 ] 1T
1r
Sec. 1 1 .4
651
Eval uati ng Classification F u n ctions
But P [misclassifying a 17'1 observation as 17'2 ]
= P (2 1 1 )
= P [Y < HIL t - 1Lz ) ' I, - 1 (1L t + ILz ) ] =
p
( y - IL t Y
(
Uy
<
1 HILt - ILz ) ' I, - I ( IL t + ILz ) - ( IL t - 1Lz ) ' I- 1L t �
---------------------------------
A
)
where ) is the cumulative distribution function of a standard normal ran dom variable. Similarly, •
P [misclassifying a 17'2 observation as 17't l
These misclassification probabilities are shown in Figure 1 1 .6. Therefore, the optimum error rate is OER = minimum TPM
=
� ( -2A ) + � ( -2A ) = ( -2A )
If, for example, A 2 = ( IL 1 - ILz ) ' I, - l ( IL 1 - ILz ) = 2.56, A = \12.56 = 1.6, and, using Table 1 in the appendix, we obtain Minimum TPM
Figure 1 1 .6
=
( -� ·6 ) = ( -
.8)
= .2119
The misclassification probabilities based on Y.
(11-27) then
652
Chap. 1 1
Discri m i n ation and Classification
The optimal classification rule here will incorrectly allocate about items to one population or the other.
21% of the •
11.4
Example illustrates how the optimum error rate can be calculated when the population density functions are known. If, as is usually the case, certain pop ulation parameters appearing in allocation rules must be estimated from the sam ple, then the evaluation of error rates is not straightforward. The performance of sample classification functions can, in principle, be eval uated by calculating the actual error rate (AER), AER = p 1
(11-28)
f. f1 (x) d x + p2 J. f2 (x) d x R2
RI
where R 1 and R 2 represent the classification regions determined by samples of size n 1 and n 2 , respectively." For ex�mple, if the classification function in is employed, the regions R 1 and R 2 are defined by the set of x ' s for which the fol lowing inequalities are satisfied.
1 ( -x t - -Xz ) ' spooledx - t ( -Xt + -Xz ) -t - 2 ( -Xt - -Xz ) ' s pooled 1 -t ( x- l + -Xz ) ( -x l - Xz- ) ' spooled - t x - 2 ( -Xt - -Xz ) ' s pooled
R" 1 : R" z:
;;;;.:
<
(11-18)
1 1
1 c 2) P (1 n [ ( c 1 ) ( Pz ) (2 1) t J c (1 1 2) P ] n [ ( c 1 ) ( Pz ) (2 1) t
The AER indicates how the sample classification function will perform in future samples. Like the optimal error rate, it cannot, in general, be calculated, because it depends on the unknown density functions /1 ( x ) and /2 ( x ) . However, an estimate of a quantity related to the actual error rate can be calculated, and this estimate will be discussed shortly. There is a measure of performance that does not depend on the form of the parent populations and that can be calculated for any classification procedure. This measure, called the apparent error rate (APER), is defined as the fraction of observations in the training sample that are misclassified by the sample classifica tion function. The apparent error rate can be easily calculated from the confusion matrix, which shows actual versus predicted group membership. For n 1 observations from 1r1 and n2 observations from 1r2 , the confusion matrix has the form Predicted membership Actual membership where
1Tz
1Tz
(11-29)
Sec. 1 1 .4
n1 c n1 M n2 c n2M
Eval u ati ng Classification F u n ctions
653
= number of 7T1 items �orrectly classified as 7T 1 items = number of 7T1 items misclassified as 7Tz items = number of 7Tz items �orrectly classified = number of 7Tz items misclassified
The apparent error rate is then
(11-30) which is recognized as the proportion of items in the training set that are misclassified. Example 1 1 .5
(Calculating the apparent error rate)
11.1
Consider the classification regions R 1 and R2 shown in Figure for the rid ing-mower data. In this case, observations northeast of the solid line are clas sified as 7T1 , mower owners; observations southwest of the solid line are classified as 7T2 , nonowners. Notice that some observations are misclassified. The confusion matrix is Predicted membership 7T 1 : riding-mower owners 7T2 : nonowners 7T 1 :
ridingmower owners
7T2 :
nonowners
Actual membership
c� : �2 ) 100% = ( 2: ) 100% = 16.7%
The apparent error rate, expressed as a percentage, is APER =
•
The APER is intuitively appealing and easy to calculate. Unfortunately, it tends to underestimate the AER, and the problem does not disappear unless the sample sizes n 1 and n 2 are very large. Essentially, this optimistic estimate occurs because the data used to build the classification function are also used to evaluate it. Error-rate estimates can be constructed that are better than the apparent error rate, remain relatively easy to calculate, and do not require distributional assumptions. One procedure is to split the total sample into a training sample and a validation sample. The training sample is used to construct the classification func tion, and the validation sample is used to evaluate it. The error rate is determined
654
Chap. 1 1
Discri m i n ation and Classification
by the proportion misclassified in the validation sample. Although this method overcomes the bias problem by not using the same data to both build and judge the classification function, it suffers from two main defects: (i) It requires large samples. (ii) The function evaluated is not the function of interest. Ultimately, almost all
of the data must be used to construct the classification function. If not, valu able information may be lost.
A second approach that seems to work well is called Lachenbruch ' s "hold out" procedure6 (see also Lachenbruch and Mickey [22]): 1. Start with the 1r1 group of observations. Omit one observation from this
group, and develop a classification function based on the remaining n 1 1 , n 2 observations. 2. Classify the "holdout" observation, using the function constructed in Step 1 . 3. Repeat Steps 1 and 2 until all o f the 1r 1 observations are classified. Let nflfj be the number of holdout ( H ) observations misclassified in this group. 4. Repeat Steps 1 through 3 for the 1r2 observations. Let n�Ifj be the number of holdout observations misclassified in this group. -
Estimates P (2 j 1) and P (l j 2) of the conditional misclassification probabili ties in (11-1) and (11-2) are then given by
P (2 j 1)
=
.fo (1 J 2)
=
n n l n ___1M_ nz
___1M_
(11-31)
and the total proportion misclassified, ( nflf) + n�lfj ) / ( n 1 + n2 ) , is, for moderate samples, a nearly unbiased estimate of the expected actual error rate, E (AER).
Lachenbruch ' s holdout method is computationally feasible when used in con junction with the linear classification statistics in (11-18) or (1 1-19). It is offered as an option in some readily available discriminant analysis computer programs. 6 Lachenbruch's holdout procedure is sometimes referred to as jackknifing or cross-validation.
Sec. 1 1 .4
Example 1 1 .6
Eva l u ating Classification F u n ctions
655
(Calculating an estimate of the error rate using the holdout procedu re)
We shall illustrate Lachenbruch ' s holdout procedure and the calculation of error rate estimates for the equal costs and equal priors version of (1 1-18). Consider the following data matrices and descriptive statistics. (We shall assume that the n1 = n2 = bivariate observations were selected randomly from two populations 1r1 and ?Tz with a common covariance matrix.)
3
X,
u �n � [! n
�[a � x, � [ a
2S ,
..
zs ,
[ - 22 - 28 ]
� [ -� �J -
The pooled covariance matrix is
1 S pooled = 4 (2S t + 2 S2 ) =
[
1 -1 -1
4
]
Using S pooled • the rest of the data, and Rule (11-18) with equal costs and equal priors, we may classify the sample observations. You may then verify (see Exercise 1 1.19) that the confusion matrix is Classify as:
1T1 1Tz 2 1 l True population: 1T 1Tz 1 2 and consequently,
2
APER (apparent error rate) = 6 =
[ 43 10 ] ; -x t H = [ 39.5 ] ;
.33
Holding out the first observation X� = [2, 12] from x l ' we calculate xlH =
s
� [l S l H + 2Sz] = � [ =·�
The new pooled covariance matrix, sH, pooled • is
SH, pooled =
and 1 S I H =
.[ 51 21 ]
�� J
656
Chap. 1 1
Discri m i n ation and Classification
[
with inverse7
1 10 1 1 SH,- pooled = 8 1 2.5
J
It is computationally quicker to classify the holdout observation x l H on the basis of its sample distances from the group means i1 H and i 2 . This procedure is equivalent to computing the value of the linear function y = a� x H = (i l H - Xz ) ' S/f�pooledXH and comparing it to the midpoint m H = Hx1 H - Xz ) ' S/f�pooled ( i l H + Xz ) . [See (11-19) and (11-20).] Thus with x� = [2, 12] we have
[ J[ ]
Squared distance from i l H = (x H - i l H ) ' S/f� pooled (x H - i l H ) 2 - 3.5 1 10 1 = [2 - 3.5 12 - 9] 8 1 2.5 12 - 9
[
[ 122 -- 47
Squared distance from i 2 = (x H - i2 ) ' S/f�pooled (x H - Xz )
10 1 = [2 - 4 12 - 7] _!_8 1 2.5
J
= 4.5
J = 10.3
Since the distance from x H to i1 H is smaller than the distance from x H to i 2 , we classify x H as a 1r1 observation. In this case, the classification is correct. If x� [4, 10] is withheld, i1 H and S/f� pooled become
=
x lH =
-1
_!_
and S H, pooled 8
[ 164
We find that
(x H - X r H ) ' S/f�pooled (x H - i l H ) = [4 - 2.5 10 - 10]
= 4.5
(x H - Xz ) ' S /f�pooled (x H - Xz ) = [4 - 4 10 - 7]
= 2.8
l
J
[ � � ] [ :o � ]
�[ ! 1
4 2.5
1
25
- 2 _ �
2�5] [ 1� -=. �]
and consequently, we would incorrectly assign x� = [4, 10] to 1r2 • Holding out x� = [3, 8] leads to incorrectly assigning this observation to 7Tz as well. Thus, nfl!) = 2.
7 A matrix identity due to Bartlett [3] allows for the quick calculation of SJ1.1 pooled directly from S �o�Ied · Thus one does not have to recompute the inverse after withholding each observation. (See Exer
cise 1 1 .20.)
Sec. 1 1 .4
[ ]
657
Eva l uati ng Classification F u n ctions
[ 3.5 ] ;
]
Turning to the second group, suppose x� = [5, 7] is withheld. Then
x2 H =
3 9 4 5 ;
-X z H =
and 1 S2 11 =
7
The new pooled covariance matrix is
[
[
SH. poolcd = 3 [2S , + 1 S z H ] = 3 - 4 1
1
[
with inverse
1_ 16 4 1 S H- , pooled -_ 24 4 2.5 We find that
2.5 - 4 16
]
.5 - 2 _2 8
]
; [ 1� 2�5 ] [;�:O J 3 [ 16 4 [ 5 - 3.5 7 - 7] J 7-7] 24 4 2.5
(x11 - x , ) 'S!f� pooled (x H - x , ) = [5 - 3 7 - 10] 4 = 4.8
(x H - i2 H ) ' S!f� pooled (x H - i2 H ) = [5 - 3.5 = 4.5
and x� = [5, 7] is correctly assigned to 7T2 • When x� = [3, 9] is withheld,
; [ 1� 2�5 ] [: � :o ]
(x11 - i1 ) ' S!f� pooled (x11 - i1 ) = [3 - 3 9 - 10] 4 = .3
; [ � 2\ J [ 39 -_ 4�5 J
1 (x H - i2 H ) ' S!f.1 pooled (x H - i2 H ) = [3 - 4.5 9 - 6] 4 = 4.5
and x� = [3, 9] is incorrectly assigned to 7T1 • Finally, withholding x� = [ 4, 5] leads to correctly classifying this observation as 7T2 • Thus, nff) = 1 . An estimate of the expected actual error rate i s provided by A
E (AER) =
(H ) + n 1M n1 +
( H ) -2 1 nzM = + = .5 n2 3 + 3
658
Chap. 1 1
Discri m i n ation and Classification
Hence, we see that the apparent error rate APER = .33 is an optimistic mea sure of performance. Of course, in practice, sample sizes are larger than those we have considered here, and the difference between APER and E (AER) may not be as large. • If you are interested in pursuing the approaches to estimating classification error rates, see [21 ] . The next example illustrates a difficulty that can arise when the variance of the discriminant is not the same for both populations. Example 1 1 . 7 (Classifying Alaskan and Canadian salmon)
The salmon fishery is a valuable resource for both the United States and Canada. Because it is a limited resource, it must be managed efficiently. Moreover, since more than one country is involved, problems must be solved equitably. That is, Alaskan commercial fishermen cannot catch too many Canadian salmon and vice versa. These fish have a remarkable life cycle. They are born in freshwater streams and after a year or two swim into the ocean. After a couple of years in salt water, they return to their place of birth to spawn and die. At the time they are about to return as mature fish, they are harvested while still in the ocean. To help regulate catches, samples of fish taken duri�g the harvest must be identified as coming from Alaskan or Canadian waters. The fish carry some information about their birthplace in the growth rings on their scales. Typically, the rings associated with freshwater growth are smaller for the Alaskan-born than for the Canadian-born salmon. Table 1 1 .2 gives the diam eters of the growth ring regions, magnified 100 times, where
X2
xl = diameter of rings for the first-year freshwater growth (hundredths of an inch)
= diameter of rings for the first-year marine growth (hundredths of an inch)
2
In addition, females are coded as 1 and males are coded as 2. Training samples of sizes n 1 = 50 Alaskan-born aqd n = 50 Canadian born salmon yield the summary statistics
[ 429.660 98.380 ] ' [ 137.460 ] Xz = 366.620 '
81 82
- 188.093 ] [ - 260.608 188.093 1399.086 ' = [ 326.090 133.505 133.505 893.261 J
=
The data appear to satisfy the assumption of bivariate normal distributions (see Exercise 1 1 .31 ) , but the covariance matrices may differ. However, to
Sec. 1 1 .4
659
Eva l u ati ng Classification F u n ctions
TABLE 1 1 .2 SALMON DATA (G ROWTH-RI NG DIAM ETERS)
Canadian
Alaskan Gender
Freshwater
Marine
Gender
Freshwater
Marine
2 1 1 2 1 2 1 2 2 1 1 2 1 2 2 1 2 2 2 1 1 2 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1
108 131 105 86 99 87 94 117 79 99 1 14 123 123 109 1 12 104 111 126 105 119 114 100 84 102 101 85 109 106 82 118 105 121 85 83 53 95 76
368 355 469 506 402 423 440 489 432 403 428 372 372 420 394 407 422 423 434 474 396 470 399 429 469 444 397 442 431 381 388 403 451 453 427 411 442
1 1 1 2 2 2 1 2 1 2 2 1 1 2 1 1 1 2 2 1 2 1 1 2 2 2 1 2 1 2 1 2 1 1 2 2 1
129 148 179 152 166 124 156 131 140 144 149 108 135 170 152 153 152 136 122 148 90 145 123 145 1 15 134 117 126 118 120 153 150 154 155 109 117 128
420 371 407 381 377 389 419 345 362 345 393 330 355 386 301 397 301 438 306 383 385 337 364 376 354 383 355 345 379 369 403 354 390 349 325 344 400
660
Chap. 1 1
Discri m i n ation and Classification
(continued)
TABLE 1 1 .2
Alaskan
Canadian
Gender
Freshwater
Marine
Gender
Freshwater
Marine
1 2 1 2 2 1 2 1 1 2 1 2 1
95 87 70 84 91 74 101 80 95 92 99 94 87
426 402 397 511 469 451 474 398 433 404 481 491 480
1 2 2 1 1 2 1 2 2 2 1 1 1
144 163 145 133 128 123 144 140 150 124 125 153 108
403 370 355 375 383 349 373 388 339 341 346 352 339
1
Gender Key: f e mal e ; 2 mal e . Source: K. A. Jensen and B. Van Alen of the State of Alaska DepartmDatentaofcourFisthesandy ofGame. =
=
illustrate a point concerning misclassification probabilities, we will use the lin ear classification procedure. The classification procedure, using equal costs and equal prior proba bilities, yields the holdout estimated error rates Predicted membership 7T2 : Canadian Alaskan 7Tz : Canadian 7TJ :
Actual membership
6 49
44 1
based on the linear classification function [see (11-19)]
y
= - 5.54121 - .12839x 1 + .05194x2
There is some difference in the sample standard deviations of y for the two populations:
n Alaskan Canadian
50 50
Sample Mean 4.144
-4.147
Sample Standard Deviation 3.253 2.450
Sec. 1 1 . 5
Figure 1 1 . 7
Fisher's Discri m i nant Function-Separation of Populations
661
Schematic of normal densities for linear discriminant-salmon
data.
Although the overall error rate (7/100, or 7%) is quite low, there is an unfairness here. It is less likely that a Canadian-born salmon will be misclas sified as Alaskan-born, rather than vice versa. Figure 1 1 .7, which shows the two normal densities for the linear discriminant y explains this phenomenon. Use of the midpoint between the two sample means does not make the two misclassification probabilities equal. It clearly penalizes the population with the largest variance. Thus, blind adherence to the linear classification proce dure can be unwise. • It should be intuitively clear that good classification (low error rates) will depend upon the separation of the populations. The farther apart the groups, the more likely it is that a useful classification rule can be developed. This separative goal, alluded to in Section 11.1, is explored further in Sections 1 1 .5 and 1 1 .7. As we shall see, allocation rules appropriate for the case involving equal prior probabilities and equal misclassification costs correspond to functions designed to maximally separate populations. It is in this situation that we begin to lose the dis tinction between classification and separation. 1 1 .5 FISHER'S DISCRIMINANT FUNCTION-SEPARATION OF POPULATIONS
Fisher [9] actually arrived at the linear classification statistic (11-19) using an entirely different argument from the one in Section 1 1 .3. Fisher ' s idea was to trans form the multivariate observations x to univariate observations y such that the y ' s derived from populations 1r1 and 1r2 were separated as much as possible. Fisher suggested taking linear combinations of x to create y ' s because they are simple enough functions of the x to be handled easily. Fisher ' s approach does not assume that the populations are normal. It does, however, implicitly assume that the pop ulation covariance matrices are equal, because a pooled estimate of the common covariance matrix is used. A fixed linear combination of the x ' s takes the values y 1 1 , Y w . . . , Y r n , for the observations from the first population and the values y2 1 , J22 , , Yz nz for the obser vations from the second population. The separation of these two sets of univariate y ' s is assessed in terms of the difference between )11 and )12 expressed in standard deviation units. That is, • . .
662
Chap. 1 1
Discri m i n ation and Classification
. separatiOn
=
I .Y1 - .Yz l ,
sY
is the pooled estimate of the variance. The objective is to select the linear combi nation of the x to achieve maximum separation of the sample means y1 and y2 . Result 1 1 .4. The linear combination .9
mizes the ratio
(
Squared distance between sample means of y (Sample variance of y )
= a' X
=
( x l - X z ) ' s;�oled X maxi
) (a' x1 - a' x2 ) 2 a' s pooled a (a' d ) 2 a' s pooled a
(11-33)
over all possible coefficient vectors 1a where d ( x1 - x 2 ) . The maximum of the ratio (11-33) is D 2 ( x1 - x 2 ) ' S;ooied ( x l - i 2 ) . =
=
Proof. The maximum of the ratio in (11-33) is given by applying (2-50) directly. Thus, setting d ( x1 - x 2 ) , we have =
-
(a' d ? - 1 c -XI - Xz ) = D 2 - 1 led d - ( -XI - -Xz ) ' spooled d ' spoo m!'IX A A a a ' S pooled a
where D 2 is the sample squared distance between the two means. Note that s; in (11-33) may be calculated as
•
(11-34)
w1th y l j = aA f x1 j a nd y2j •
=
aAf x 2j .
Example 1 1 .8 (Fisher's linear discriminant for the hemophilia data)
A study concerned with the detection of hemophilia A carriers was intro duced in Example 1 1 .3. Recall that the equal costs and equal priors linear dis criminant function was
Sec. 1 1 .5
Fisher's Discri m i nant F u n ction-Separation of Popu lations
663
This linear discriminant function is Fisher ' s linear function, which max imally separates the two populations, and the maximum separation in the samples is
[ -131.158 90.423
( x i - Xz )' S �o1oied ( x i - Xz ) [.2418, - .0652] 10.98
- 90.423 108.147
J
.2418 [ - .0652
J
•
Fisher ' s solution to the separation problem can also be used to classify new observations. AN ALLOCATION RULE BASED ON FISH ER'S PISCRI M I NANJ FUNCTION8
Allocate x0 to
1r1 if
or
Allocate x0 to
1r2
Yo
if
-
m
;;.
o
(11-35)
Yo < m
or
Yo - m < o
The procedure (11-33) is illustrated, schematically, for p = 2 in Figure 11.8. All points in the scatter plots are pro ected onto a line in the direction a , and this direction is varied until the samples are maximally separated. Fisher ' s linear discriminant function in (11-35) was developed under the assumption that the two populations, whatever their form, have a common covari ance matrix. Consequently, it may not be surprising that Fisher ' s method corre sponds to a particular case of the minimum expected-cost-of-misclassification rule. The first term, y = ( x1 - x 2 ) ' S��oled x , in the classification rule (11-18) is the lin ear function obtained by Fisher that maximizes the univariate "between" samples
j
8 We must have (n , does not exist.
+ n2
-
2)
;,
p; otherwise
spoolc� is singular, and the usual inverse, s ;�oled •
664
Chap. 1 1
Discri m i n ation and Classification
y=
x2 • • • • : • 1[ 2 • •. • . . . . . . . •. • • •. • • · � �.·� X2 . . · : .; � :· : . . . .,.. -\ . ..,· : . : . • • .. . ,.. ..... . . .... : : . . . . •• :. � o • • . . 0• • • • o o 0 • o• o 0• 0 o 1t o o g o 00 0 I 0 0 0 0 o o o0o --o co 0 0 o o o o o "" o oo 8 o o o o o ,.. ,.....c 0 o0 O o o o 0o QO O 0 0 0 0 0o 00 0
a'x
�
x,
Figure 1 1 .8 A pictorial representation of Fisher's procedure for two popula tions with p = 2.
variability relative to the "within" samples variability. [See (11-33).] The entire expression
is frequently called
1 Xz ) 'S��oled x - H x 1 - X z ) ' S �ooicd ( i i Xz ) ' S ��oled [x - H x 1 + Xz ) ]
+
Xz ) (11-36)
Anderson's classification function (statistic). Once again, if
[ (c (1 1 2 ) /c (2 1 1 ) ) (p2 /p i )] = 1, so that ln [ ( c (1 1 2 )/ c (2 1 1 ) ) (p 2 /p 1 )] = 0, Rule (11-18) is comparable to Rule (11-35), based on Fisher ' s linear discriminant func
tion. Thus, provided that the two normal populations have the same covariance matrix, Fisher ' s classification rule is equivalent to the minimum ECM rule with equal prior probabilities and equal costs of misclassification. In sum, for two populations, the maximum relative separation that can be obtained by considering linear combinations of the multivariate observations is equal to the distance D 2• This is convenient because D 2 can be used, in certain sit uations, to test whether the population means p 1 and p 2 differ significantly. Con sequently, a test for differences in mean vectors can be viewed as a test for the "significance" of the separation that can be achieved. Suppose the populations 7T I and 7Tz are multivariate normal with a common covariance matrix I. Then, as in Section 6 . 3, a test of H0 : PI = p 2 versus H1 : PI #- p 2 is accomplished by referring
n2 - p - 1 ) ( n i n2 ) D 2 ( n 1 + n2 - 2 )p n 1 + n2 to an F-distribution with v1 = p and v2 = n 1 + n2 - p - 1 d.f. If H0 is rejected,
( n1
+
we can conclude that the separation between the two populations 7T1 and 7T2 is significant.
Sec. 1 1 .6
Classification with Severa l Popu lations
665
Comment. Significant separation does not necessarily imply good classifi cation. As we have seen in Section 1 1 .4, the efficacy of a classification procedure can be evaluated independently of any test of separation. On the other hand, if the separation is not significant, the search for a useful classification rule will probably prove fruitless. 1 1 .6 CLASSIFICATION WITH SEVERAL POPULATIONS lh
theory, the generalization of classification procedures from 2 to g ;;;; 2 groups is straightforward. However, not much is known about the properties of the corre sponding sample classification functions, and in particular, their error rates have not been fully investigated. The "robustness" of the two group linear classification statistics to, for instance, unequal covariarlces or nonnormal distributions can be studied with com puter generate q sampling experiments.9 For more than two populations, this approach does not lead to general conclusions, because the properties depend on where the populations are located, and there are far too many configurations to study conveniently. As before, our approach in this section will be to develop the theoretically optimal rules and then indicate the modifications required for real-world applications. The Minimum Expected Cost of Misclassification Method
Let !; (x) be the density associated with population TT; , i = 1, 2, . . . , g. [For the most part, we shall take f; (x) to be a multivariate normal density, but this is unnec essary for the development of the general theory.] Let
P ; = the prior probability of population TT; ,
l=
1 , 2, . . . ' g
c(k I i) = the cost of allocating an item to TTk when, in fact, it belongs to TT; , for k , i = 1, 2, . . . , g For k = i, c (i l i ) = 0. Finally, let Rk be the set of x ' s classified as TTk and P (k l i )
= P (classifying item as TTk i 'TT; ) =
J !; (x) dx Rk
9 Here robustness refers to the deterioration in error rates caused by using a classification pro cedure with data that do not conform to the assumptions on which the procedure was based. It is very difficult to study the robustness of classification procedures analytically. However, data from a wide variety of distributions with different covariance structures can be easily generated on a computer. The performance of various classification rules can then be evaluated using computer-gen erated "samples" from these distributions.
666
Chap. 1 1
Discri m i n ation and Classification
for k , i = 1, 2, . . . , g with P (i l i ) = 1 -
7Tg
g
2: P (k i i ) .
k=l k*i
The conditional expected cost of misclassifying an x from 1r 1 into 1r2 , or 1r3 , . . . , or is
g
+
ECM ( 1 ) = P (2 i 1 ) c (2 1 1)
P (3 i 1 ) c (3 i 1 )
+ ... +
P (g l 1 )c (g l 1 )
= 2: P ( k l 1 )c ( k l 1 ) k=2
This conditional expected cost occurs with prior probability p1 , the probability of 1r1 . In a similar manner, we can obtain the conditional expected costs of misclas sification ECM ( 2) , . . . , ECM(g). Multiplying each conditional ECM by its prior probability and summing gives the overall ECM: ECM = p1 ECM (1 )
=
+
p 2 ECM (2)
P1 ( �2 P (k i 1 )c (k i 1 ) )
+
+ ... +
PgECM (g)
p2
P (k i 2 ) c (k i 2)
(�
1
Pg (:t: P (k i g)c (k i g) ) = �P; � P (k i i )c (k i i ) 1 + ... +
(
k*2
)
k*i
) (11-37)
g
Determining an optimal classification procedure amounts to choosing the mutually exclusive and exhaustive classification regions R 1 , R2 , . . . , R such that (11-37) is a minimum.
g
7T
=
Result 1 1 .5. The classification regions that minimize the ECM (11-37) are defined by allocating x to that population k , k 1, 2, . . . , g, for which
2: p;.f; (x)c( k l i )
i= l i*k
(1 1-38)
is smallest. If a tie occurs, x can be assigned to any of the tied populations. Proof.
See Anderson [2].
•
Sec. 1 1 .6
667
Classification with Severa l Popu lations
Suppose all the misclassification costs are equal, in which case the minimum expected cost of misclassification rule is the minimum total probability of mis classification rule. (Without loss of generality, we can set all the misclassification costs equal to 1.) Using the argument leading to (11-38), we would allocate x to that population 7Tk, k = 1, 2, . . . , g, for which g
2: pJ;(x) ii = k * I
(11-39)
is smallest. Now, (11-39) will be smallest when the omitted term, P k fk (x), is largest. Consequently, when the misclassification costs are the same, the minimum expected cost of misclassification rule has the following rather simple form. M I N I M U M ECM CLASSIFICATION RULE WITH · EQUAL MISCLASSIFICAT1(1)N CO$\TS
Allocate x to
or,
7Tk
if
equivalently,
Allocate 1 to
7Tk
ln pdk (x)
>
ln p;.t;(x) for.all i =I= k
It is interesting to note that the classification rule in ( 1 1 -40) is identical to the one that maximizes the "posterior" probability P( 7Tk J x) P ( x comes from 7Tk given that x was observed) , where P ( 7rk J x ) =
;dk (x)
; (x) i2: = P .f; I
(prior) X (likelihood) for k = 1 , 2, . . . , g I [(prior) X (likelihood) ] (11-42)
Equation (11-42) is the generalization of Equation (1 1-9) to g ;;;;: 2 groups. You should keep in mind that, in general, the minimum ECM rules have three components: prior probabilities, misclassification costs, and density functions. These components must be specified (or estimated) before the rules can be implemented.
668
Chap. 1 1
Discri m i nation and Classification
Example 1 1 .9 (Classifying a new observation into one of three known populations)
Let us assign an observation x0 to one of the g = 3 populations 1r1 , 1r2 , or 1r3 , given the following hypothetical prior probabilities, misclassification costs, and density values: True population
7T l 7Tz 7T3
Classify as: Prior probabilities: Densities at x0:
7Tl
7Tz
7T3
c (1 / 1) = 0 c (2 / 1 ) = 10 c (3 / 1 ) = 50
c (1 / 2) = 500 c (2 / 2) = 0 c (3 / 2) = 200
c (1 / 3) = 100 c (2 / 3 ) = 50 c (3 / 3) = 0
p1 = .05 /1 (x0) = .01
Pz = .60 fz (x o ) = .85
P 3 = .35 f3 (x o ) = 2
We shall use the minimum ECM procedures.
3
The values of .2, p;/;(x0 ) c ( k / i ) [ see (11-38)] are i=l i*k
k = 1 : pz f2 (x0) c (1 / 2)
+
= (.60) (.85) (500) k = 2 : p J f1 (x0 ) c (2 / 1 )
+
= (.05) (.01 ) (10) k = 3 : p Jf1 (x0)c (3 / 1 )
+
= (.05) (.01 ) (50) 3
p JI3 (x0 ) c (1 / 3 ) +
(.35) (2) (100) = 325
p3f3 (x0 )c (2 / 3 )
+
(.35) (2) (50) = 35.055
pz f2 (x0 ) c (3 / 2)
+
(.60) (.85) (200 ) = 102.025
Since .2, P ;[; (x0) c (k / i ) is smallest for k = 2, we would allocate x0 to 7T2 . i=l i*k
If all costs of misclassification were equal, we would assign x0 according to ( 11-40 ), which requires only the products
Pd1 (x0) = (.05) (.01 ) = .0005 Pz /2 (x0 ) = (.60) (.85) = .510 P3 /3 (x0) = (.35) (2) = .700 Since
Sec. 1 1 .6
669
Classification with Several Pop u l ations
we should allocate x0 to 7T3 . Equivalently, calculating the posterior probabil ities [see (11-42)], we obtain
(.05) (.01) (.05) ( .01 ) + (.60) (.85)
Pz fz ( x o )
3
L PJJ x o )
+
( .35) (2)
=
.0005 = 0004 1 .2105 ·
.510 (.60) (.85) = = .421 1 .2105 1 .2105
i= I
P ( 7T3 I x o ) =
fd3 ( xo )
L P;[; ( x o )
(.35) (2) .700 = = . 5 78 1 .2105 1 .2105
i= l
We see that x0 is allocated to 7T3 , the population with the largest posterior • probability. Classification with Normal Populations
An important special case occurs when the
i = 1, 2, . . ' g .
(11-43)
are multivariate normal densities with mean vectors f.L; and covariance matrices I; lf, further, c (i l i) = 0, c ( k i i ) = 1 , k of= i (or, equivalently, the misclassification costs are all equal), then (11-41) becomes Allocate x to 7Tk if
= max lnp;[; (x) i
(11-44)
The constant (p/2) ln (27T ) can be ignored in (11-44), since it is the same for all populations. We therefore define the quadratic discrimination score for the ith population to be
670
Chap. 1 1
Discri m i n ation and Classification
i=
1, 2, . . .
'g
(11 -45)
The quadratic score d? (x) is composed of contributions from the generalized vari ance I I ; I , the prior probability p; , and the square of the distance from x to the population mean P ; · Note, however, that a different distance function, with a dif ferent orientation and size of the constant-distance ellipsoid, must be used for each population. Using discriminant scores, we find that the classification rule (11-44) becomes the following:
MINI MOM l'GTAL PR0BABILITY
0F MISCLASSI!=ICATI0N (TPM) RULE F0R N0RMAL'P0P0LATI0NS�UNEQUAL
Allocate x to 1rk if <,'><�':>'
t�� q���r���c sco�e df(x) = )wg��t of,t!PC�). dr< (!'), ' i ,\': ' ' ' ' >lhfh'':,
, o'/':':
:'
' :< <,
"
" "' i
,''i(
w:4erl:l;;dcf (�) .is. gjy(ln .by (l174S). ;: <
' oi\0;'
'
;
,
''
' ,' , o
,'
'
0, ' ;'0' ' '
,'
,'
'
',
: /" ,\
...
I1
, qf (x) .
, ' ''' > ,
' ', '
(l.k49)
8.-
In practice, the P ; and I; are unknown, but a training set of correctly classi fied observations is often available for the construction of estimates. The relevant sample quantities for population 'lr; are i;
= sample mean vector
S ; = sample covariance matrix and
n; = sample size The estimate of the quadratic discrimination score d F (x) is then
d F (x) = - ! Jn i S ; I - ! (x - i; ) ' Sj 1 (x - i; )
+
ln p;,
i=
1, 2, . . . , g ( 1 1 -47)
and the classification rule based on the sample is as follows:
Sec. 1 1 . 6
Classification with Several Popu lations
671
ESTIMATED MINIMUM {TPM) RULE ·. FPR SEVERAL NORMAL POPULATIONS'7:""'U NEQUAL I; the qqadratil:: �where
score
df(x)
=
thelargestof df (x), d¥ (x), ; . . ; , df(x) ,
Jp {x) is .given by (11-47).
{11-48) \
A simplification is possible if the population covariance matrices, I;, are equal. When I; = I, for i = 1, 2, . . . , g, the discriminant score in (1 1-45) becomes
The first two terms are the same for df (x), df (x), . . . , df (x), and, consequently, they can be ignored for allocative purposes. The remaining terms consist of a con stant c; = ln p; - ! P ; I - 1 p, ; and a linear combination of the components of x. Next, define the linear discriminant score
d; ( x )
_
��-1x -
- /A- ; _..
� � - 1 /A- ;
I z /A- ; _..
for i
+ 1n p;
(11 -49 )
= 1, 2, . . , g .
An estimate d; (x) of the linear discriminant score d; (x) is based on the pooled esti mate of I,
S pooled
=n 1
+
n2
1
+ ..· +
ng - g ( (n , - 1 ) S 1
+
(n2 - 1) S 2
+ ... +
(ng - 1 ) Sg )
(11-50)
and is given by
(11-51) for i Consequently, we have the following:
= 1, 2, . . . , g.
672
Chap. 1 1
Discri m i nation and Classification
FOR
ESTIMATED
IVfJNIIVfYM JPM��RULE
EQUAL-COVARIANCE NORIVfAL POPULATIONS
Allocate x to 'TT"k if · the linear discriminant sc�re. dk(x) . . }/
�' 0''
=
the largest . of. a l c.x) , dz(x), . . . ' dg (x)
(11-52)
�th di (x) given by (11-51).
Comment. Expression (11-49) is a convenient linear function of x . An equivalent classifier for the equal-covariance case can be obtained from (11-45) by ignoring the constant terin, - ! In / I / . The result, with sample estimates inserted for unknown population quantities, can then be interpreted in terms of the squared distances
D l { x) = (x - x; ) ' s;o�Ied (x - X; )
(11-53)
from x to the sample mean vector X ;. The allocatory rule is then Assign x to the population 71"; for which - ! Df (x)
+
ln p; is largest
(11-54)
We see that this rule-or, equivalently, (11-52)-assigns x to the "closest" popula tion. (The distance measure is penalized by In P ; ·) If the prior probabilities are unknown, the usual procedure is to set p 1 = p 2 = · · · = Pc = 1/g. An observation is then assigned to the closest population. Example 1 1 . 1 0 (Calculating sample discriminant scores, assuming a common covariance matrix)
Let us calculate the linear discriminant scores based on data from g = 3 pop ulations assumed to be bivariate normal with a common covariance matrix. Random samples from the populations 71" 1 , 71"2 , and 71"3 , along with the sample mean vectors and covariance matrices, are as follows:
[
1 -1 4 -1
]
Sec. 1 1 . 6
�
-2
]
Classification with Severa l Pop u l ations
' so n3 = 3, x 3 = _
[ O ' and s3 = [ 11 41 J -2J
673
Given that p 1 = p 2 = .25 and p3 = .50, let us classify the observation according to (11-52). From (11-50), x� = [x0 1 , x0 2 ] =
[-2 -1] 3 - 1 [ 1 -1] + 3 - 1 [ 1 1 ] 3 - 1 [ 1 -1 s pooled = 9 - 3 -1 4 ] + 9 - 3 -1 4 9-3 1 4 = �6 [ -11 -+ 11 ++ 11 -14 +- 41 ++ 41 ] = - -: 4
l
so
Next,
[-1
3
-�l
[ ] = _!__35 [ -27 24]
36 3 3] _!__ 35 3 9
and 99 35 so
+ ( ��7 ) xol + G�) xo z - � G� ) Notice the linear form of d 1 (x0 ) = constant + (constant)x0 1 + (constant )x0 2 . In a similar manner, 1 [ 36 3 = 1 [48 39] Xz S pooled -- [ 1 4 ] = ln ( .25)
-,
-1
35
3 9
J
35
674
Chap. 1 1
Discri m i n ation and Classification
35
204 and
( 4358 ) x0 1 ( 3539 ) x0 2 - 21 ( 20354 ) -_ [O - 21 351 [ 363 39 ] _- 351 [ -6 -18] _- 351 [ - 6 _ 181 [ - 2O J = 3536
d 2 ( x 0 ) = ln (.25) A
Finally, _1 x_3, 8pooled -x , 8 _ 1 x 3 pooled 3 -
and
+
+
-
-
-
-
-6 ) x0 1 ( 3s -18 ) Xo2 - 21 ( 36 ) ( 35 35 Substituting the numerical values x0 1 = - 2 and x0 2 = -1 gives 7 d 1 ( 0 ) = -1. 386 ( �� ) ( - 2) G� ) (-1) 7990 -1.943 204 d 2 (x o ) -1. 386 (��) ( - 2) ( -1) 70 -8.158 G�) -18 ) (-1) 36 = -.350 d ( o ) -. 693 ( �56 ) ( - 2) ( 3s 70 Since d 3 ( 0 ) = -. 3 50 is the largest discriminant score, we allocate x0 to 7r • d 3 ( x 0 ) = ln (.50) A
x
3 x
+
+
+
+
+
+
+
+
x
Example 1 1 . 1 1
3
•
(Classifying a potential business-school g raduate student)
The admission officer of a business school has used an "index" of under graduate grade point average (GPA) and graduate management aptitude test (GMAT) scores to help decide which applicants should be admitted to the school ' s graduate programs. Figure shows pairs of x 1 = GPA, x2 = GMA T values for groups of recent applicants who have 0 been cate gorized as 7r1 : admit; 7r2 : do not admit; and 7r3 : borderline. 1 The data pic-
11. 9
10 In this case, the populations are artificial in the sense that they have been created by the admis sions officer. On the other hand, experience has shown that applicants with high GPA and high GMAT scores generally do well in a graduate program; those with low readings on these variables generally experience difficulty.
Sec. 1 1 . 6
Classification with Several Pop u l ations
675
GMAT
720 A
A
A
630
B B
BB B
B
450
B
B BB B
B
B B 360
e
BB B
AAAA A A AA A
e
B
B
BB
B
e
e ecce e
ee e
B B
A A A
A
A
540
A A
e eA ee e e ee ee e
A
A
A A
e
: Admit ( 1t 1 ) B : Do not admit ( 1t 2 ) e : Borderline ( 1t3 )
A
B
B
e
270 2.40
2. 1 0
A
A
A e
A
A A A
2.70
3.00
3 .90
3 .60
3.30
Figure 1 1 .9 Scatter plot of (x1 = GPA, x2 = GMAT) for applicants to a gradu ate school of business who have been classified as admit, do not admit, or borderline.
tured are listed in Table 11.6. ( See Exercise 11.29.) These data yield ( see the SAS statistical software output in Panel 11.1)
3.40 [ 561.23 J [ 2.97 ] X = 488.45
n l = 31 XI
2.48 [ 447.07 J .0361 s pooled = [ 2.0188 -
[ 4462.99.23 J
n2 = 28
n3 = 26
x2 =
X3 = - 2.0188 3655.9011
J
Suppose a new applicant has an undergraduate GPA of x 1 = 3.21 and a GMAT score of x2 = 497. Let us classify this applicant using the rule in (11-54) with equal prior probabilities. With x � = [3.21, 497] , the sample distances are
D i (x o ) = (x o - i l ) ' s;o�Ied ( Xo -
i1 )
[3 . 21 - 3 . 40 ' 497 - 561 . 23] = 2.58
[ 28.6096 .0158
.0158 .0003
3.40 ] [ 4973.21--561.23
J
676
Chap. 1 1
Discri m i n ation and Classification
SAS ANALYSI S FOR ADM I S S I O N DATA U S I N G PROC D I SCRI M .
PAN El l l . l
title 'Discri m i n ant Ana lysis'; data g pa; i nfi l e 'T1 1 -6.dat'; i n put a d m it $ gpa g m at; p roc discrim d ata = g pa m ethod = normal pool = yes manova wcov pcov l i sterr crossl i sterr; priors 'adm it' = .3333 'notad m it' = .3333 'border' = .3333; class a d m it; va r gpa g m at;
PROGRAM
COMMANDS
DISCR I M I NANT ANALYS I S 85 Observations 84 DF Tota l 2 Va riables 82 DF With i n Classes 3 Classes 2 DF Between Classes
O UTPUT
Class Leve l I nfo rmation ADM IT adro.it
;!;.
F re q u e n cy 31
notadmit border
Weight 3 1 .0000 26.0000 28.0000
26 28
DISCR I M I NANT ANALYS IS ADMIT Va riable G PA G MAT
Probability Prior
1, 0.33�333 0.333333
0.333333
WITH I N -C LASS COVAR IANCE MATR ICES O F = 30 = a d m it G MAT G PA 0.058097 0.043558 461 8.2473 1 2 0.058097
ADM IT = border G PA Varia ble G PA 0.029692 -5.403846 G MAT ADM IT = nota d m it Va riable G PA G PA 0.033649 - 1 . 1 92037 GMAT Pooled
Proportion 0.364706 0.305882 0.3294 1 2
DF
=
25
G MAT -5.403846 2246.9046 1 5 D F = 27 G MAT - 1 . 1 92037 389 1 .253968
Within-Class Covar�ance Matrix
Va riable G PA G MAT
G PA
o:o36068
-2.01 8759
G MAT
:.:..2JJ1S759
3655.90 1 1 2 1
(continued)
Sec. 1 1 . 6
677
Classification with Several Popu l ations
(x o - Xz ) ' S�o'oled ( Xo - Xz ) = 17.10 (xu - i3 ) ' S�o�Ied ( Xo - i3 ) = 2.47 Since the distance from x b = [3.21, 497] to the group mean i3 is smallest, we • assign this applicant to 7T3 , borderline. PAN E L 1 1 . 1
(continued)
M u ltiva riate Statistics a n d F Approx i m ations N = 39. 5 S = 2 M = -0.5 Den DF N u m OF F Va l u e Statistic 1 62 4 Wi l ks' La m bd a 73.4257 0 . 1 2637661 1 64 41 .7973 4 1 .00963002 P i l l ai's Trace 1 60 4 1 1 6.733 1 5.8366560 1 Hote l l i n g-Lawley Trace Roy's G reatest Root 5.64604452 231 .4878 2 82 N OTE: F Statistic fo r Roy's G reatest Root is an u pper b o u n d . N OTE: F Statistic fo r Wilks' La m bd a is exact.
Pr > F 0.000 1 0.0001 0.000 1 0.0001
LI N EAR DISCR I M I NANT F U N CT I O N DISCR I M I NANT ANALYS I S Constant = - . 5 X/ cov - 1 Xi + I n PRIORi Coefficient Vector = cov - 1 Xi ADM IT n ota d m it border a d m it - 1 34.99753 - 1 78.41 437 -241 .47030 C O N STANT G PA 78.08637 1 06.2499 1 92.66953 0 . 1 6541 0.2 1 2 1 8 0 . 1 7323 G MAT
. Classification Results fo( Cali.��ation Oa:ta: WORI(.GPA Resubstitution Results using Linear Discriminant Function Genera l ized S q uared Dista nce Fu nctio n : Di2 ( X ) = ( X - Xi ) ' cov- 1 ( X - Xi ) Posterior Probabil ity of Mem bersh i p in each ADM IT: Pr(j I X) = exp ( - . 5 Di2 ( X ) ) / S U M exp ( - . 5 D � ( X ) ) k
O bs 2 3 24 31 58 59 66
Posteri o r Probabil ity of Members h i p i n ADM IT: From Classified a d m it i nto ADMIT border ADMIT * 0 . 1 202 border 0.8778 a d m it * a d m it 0. 3654 border 0.6342 * a d m it 0 . 5234 0.4766 border * a d m it border 0.2964 0.7032 * n ota d m it 0.000 1 border 0.7550 * n ota d m it border 0.000 1 0.8673 * border a d m it 0.5336 0 . 4664
n ota d m it 0.0020 0.0004 0.0000 0.0004 0.2450 0 . 1 326 0.0000
* M iscl assified observation
(continued)
678
Chap. 1 1
Discri m i n ation and Classification
PAN El 1 1 . 1
(continued) > blassif'icatloh SUrn marv >'for calibration Data: WORI(GPA' Cross-validation S u m mary using Linear Discri m i n a nt Fu nction Genera l ized Sq u a red Dista nce F u n ction: Dt ( X ) = ( X - X( XJ i ) ' COV(x1J ( X - X (XJ i ) Posterior Proba b i l ity of M e m bers h i p in each ADM IT: Pr(j j X) exp ( - .5 Di2 ( X ) ) / S U M exp ( - . 5 D U X ) ) =
k
I border I I notadmit I � � �
I admit I
N u m ber of Observations a n d Percent Classified i nto ADM IT: From
1 admit l
ADM IT
I border I
I r1otadmit l Tota l Percent Priors
Rate Priors
83.87
ITJ
3.85
�
0.00 27 3 1 .76 0.3333
1 6. 1 3
ffiJ
92.31
�
7. 1 4 31 36.47 0.3333
0.00
GJ
3.85
�
92.86 27 3 1 .76 0.3333
Erro r Co u nt Esti m ates fo r ADM IT: a d m it border n ota d m it 0. 1 6 1 3 0.07 1 4 0.0769 0.3333 0.3333 0.3333
Tota l 31 1 00.00 26 1 00.00 28 1 00.00 85 1 00.00
Tota l 0. 1 032
The linear discriminant scores (11-49) can be compared, two at a time. Using these quantities, we see that the condition that dk (x) is the largest linear discrimi nant score among d 1 (x), d2 (x), . . . , dg (x), is equivalent to
for all i = 1, 2, . . . , g. Adding - ln (p dpi ) = ln ( p )pk ) to both sides of the preceding inequality gives the alternative form of the classification rule that minimizes the total proba bility of misclassification. Thus, we
Sec. 1 1 .6
Allocate x to
'TTk if
679
Classification with Several Pop u l ations
�
(p k - Pi ) ' I - 1 x - (p k - pJ ' I - 1 (p k
+
pJ
�
ln
(; J
(11-55)
for all i = 1, 2, . . . , g. Now, denote the left-hand side of (11-55) by dki (x). Then the conditions in (11-55) define classification regions R 1 , R2 , • • • , Rg , which are separated by (hyper) planes. This follows because dki (x) is a linear combination of the components of x. For example, when g = 3, the classification region R 1 consists of all x satisfying for i
= 2, 3
That is, R1 consists of those x for which
and, simultaneously,
�
d1 3 (x) = (p l - P3 ) ' I - 1 x - (p i - P3 ) ' I - 1 ( P J
+
P- 3 )
�
ln
( �: )
Assuming that p 1 , p 2 , and p3 do not lie along a straight line, the equations d1 2 (x) = ln(p2 /p1 ) and d1 3 (x) = ln(p 3 /p 1 ) define two intersecting hyperplanes that delineate R1 in the p-dimensional variable space. The term ln(p 2 /p 1 ) places the plane closer to p 1 than p 2 if p 2 is greater than p1 • The regions R 1 , R2 , and R3 are shown in Figure 11.10 for the case of two variables. The picture is the same for more variables if we graph the plane that contains the three mean vectors.
Figure 1 1 . 1 0 The classification regions R1 , R2 , and R3 for the linear minimum TPM rule CP1 = �� P2 = !, P3 = � ).
680
Chap. 1 1
Discri m i n ation and Classification
The sample versiort of the alternative form in (11-55) is obtained by substi tuting x i for /L i and inserting the pooled sample covariance matrix spooled for I. When
g
2: ( n - 1) ;;;. p, so that s;o1olect exists, this sample analog becomes i= l i
Allocate x to 7Tk if
for all i
-=/=
k
(11-56)
Given the fixed training set values x i and S poolect ' d kJx) is a linear function of the components of x. Therefore, the classification regions defined by (1 1-56)-or, equivalently, by (11-52)-are also bounded by hyperplanes, as in Figure 11.10. As with the sample linear discriminant rule of (11-52), if the prior probabili ties are difficult to assess, they are frequently all taken to be equal. In this case, ln(pjpk ) = 0 for all pairs. Because they employ estimates of population parameters, the sample classi fication rules (11-48) and (11-52) may no longer be optimal. Their performance, however, can be evaluated using Lachenbruch ' s holdout procedure. If nfZl is the number of misclassified holdout observations in the ith group, i = 1, 2, . . . , g, then an estimate of the expected actual error rate, E(AER), is provided by
2: nfZl i= l g 2: n i i= l g
A
E (AER)
(11-57)
Example 1 1 . 1 2 (Effective classification with fewer variables)
In his pioneering work on discriminant functions, Fisher [8] presented an analysis of data collected by Anderson [1] on three species of iris flowers. (See Table 11.5, Exercise 11.27.) Let the classes be defined as
1r 1 : Iris setosa; 1r2 : Iris versicolor; 1r3 : Iris virginica The following four variables were measured from 50 plants of each species.
X1 = sepal length, X2 = sepal width x3 = petal length, x4 = petal width
Using all the data in Table 11.5, a linear discriminant analysis produced the confusion matrix
Sec. 1 1 . 6
681
Classification with Several Popu l ations
Predicted membership 7T 1 :
Setosa Actual 1r2 : Versicolor membership 1r3 : Virginica 7T 1 :
Setosa
7T2 :
Versicolor
50 0 0
7T3 :
0 48 1
Virginica 0 2 49
Percent correct
100 96 98
The elements in this matrix were generated using the holdout procedure, so (see 11-57) A
E (AER)
3 150
.02
The error rate, 2%, is low. Often, it is possible to achieve effective classification with fewer vari ables. It is good practice to try all the variables one at a time, two at a time, three at a time, and so forth, to see how well they classify compared to the discriminant function, which uses all the variables. If we adopt the holdout estimate of the expected AER as our criterion, we find for the data on irises: Single variable
Misclassification rate
.253 .480 .053 .040 Pairs of variables
Misclassification rate
Xt , Xz Xt , X3 Xt , X4 Xz , X3 Xz , X4 x3, x4
.207 .040 .040 .047 .040 .040
We see that the single variable X4 = petal width does a very good job of dis tinguishing the three species of iris. Moreover, very little is gained by includ ing more variables. Box plots of X4 = petal width are shown in Figure 11.11
682
Chap. 1 1
Discri m i n ation and Classification
·� "3
"0 0..
2.5
-
2.0
-
1 .5
-
1 .0
-
0.5
-
0.0
-
I
I
* *
: I
I
$ I
7t z
I
I
Group Figure 1 1 . 1 1
Box plots of petal width for the three species of iris.
for the three species of iris. It is clear from the figure that petal width sepa rates the three groups quite well, with, for example, the petal widths for Iris setosa much smaller than the petal widths for Iris virginica. Darroch and Mosimann [6] have suggested that these species of iris may be discriminated on the basis of "shape" or scale-free information alone. Let Y1 = X1 IX2 be the sepal shape and Y2 = X3 IX4 be the petal shape. The use of the variables Y1 and Y2 for discrimination is explored in Exercise 1 1 .28. The selection of appropriate variables to use in a discriminant analysis is often difficult. A summary such as the one in this example allows the inves tigator to make reasonable and simple choices based on the ultimate criteria • of how well the procedure classifies its target objects. Our discussion has tended to emphasize the linear discriminant rule of (11-52) or (11-56), and many commercial computer programs are based upon it. Although the linear discriminant rule has a simple structure, you must remember that it was derived under the rather strong assumptions of multivariate normality and equal covariances. Before implementing a linear classification rule, these ten tative assumptions should be checked in the order multivariate normality and then equality of covariances. If one or both of these assumptions is violated, improved classification may be possible if the data are first suitably transformed. The quadratic rules are an alternative to classification with linear discrimi nant functions. They are appropriate if normality appears to hold, but the assump tion of equal covariance matrices is seriously violated. However, the assumption of normality seems to be more critical for quadratic rules than linear rules. If doubt
Sec. 1 1 . 7
Fisher's M ethod for Discri m i n ati ng Among Severa l Popu l ations
683
exists as to the appropriateness of a linear or quadratic rule, both rules can be con structed and their error rates examined using Lachenbruch ' s holdout procedure. 1 1 .7 FISHER'S METHOD FOR DISCRIMINATING AMONG SEVERAL POPULATIONS
Fisher also proposed an extension of his discriminant method, discussed in Section 11.5, to several populations. The motivation behind the Fisher discriminant analy sis is the need to obtain a reasonable representation of the populations that involves only a few linear combinations of the observations, such as a; x, a� x, and a� x. His approach has several advantages when one is interested in separating sev eral populations for (1) visual inspection or (2) graphical descriptive purposes. It allows for the following: 1. Convenient representations of the g populations that reduce the dimension from a very large number of characteristics to a relatively few linear combi nations. Of course, some information-needed for optimal classification may be lost, unless the population means lie completely in the lower dimensional space selected. 2. Plotting of the means of the first two or three linear combinations ( discrimi nants). This helps display the relationships and possible groupings of the populations. 3. Scatter plots of the sample values of the first two discriminants, which can indicate outliers or other abnormalities in the data.
The primary purpose of Fisher ' s discriminant analysis is to separate populations. It can, however, also be used to classify, and we shall indicate this use. It is not nec essary to assume that the g populations are multivariate normal. However, we do assume that the p x p population covariance matrices are equal and of full rank. 11 That is, I 1 = I 2 = · · · = Ig = I. Let P, denote the mean vector of the combined populations and B,. the between groups sums of cross products, so that
B,. = � (/L; - p,) (p,; - p, )' l g
i=
We consider the linear combination
where p, =
1 g
/L; g i� =l
-
(11 -58)
Y = a' X
which has expected value 1 1 If not, we let P = [ e 1 , , e, ] be the eigenvectors of l: corresponding to nonzero eigenvalues 1 [A 1 , . . . , A q ] · Then we replace X by P ' X which has a full rank covariance matrix P ' l:P. • • •
684
Chap. 1 1
Discri m ination and Classification
E ( Y)
=
a' E (X I 7T; )
=
a' IL;
Var (Y)
=
a' Cov (X) a
=
a' Ia
for population 7T;
and variance for all populations
Consequently, the expected value /L; y = a' /L ; changes as the population from which X is selected changes. We first define the overall mean 1 g ji y = - 2: f.L; y g z=l
=
=
(
-g1 2: a' /L ; = a' -g1 � /L ; z=l z=l g
g
)
a' /i
( populations Sum of squared distances from ) to overall mean of Y
and form the ratio
(Variance of Y)
g
2: ( a' P, ; - a' p, ) 2
( � ( P,; - /i ) (p,; - /i ) ' ) a
i= 1
a'
a' Ia
a' Ia
or i= I
(1 1-59)
The ratio in (11-59) measures the variability between the groups of Y-values rela tive to the common variability within groups. We can then select a to maximize this ratio. Ordinarily, I and the /L ; are unavailable, but we have a training set consist ing of correctly classified observations. Suppose the training set consists of a ran dom sample of size n; from population 7T; , i = 1, 2, . . . , g. Denote the n; X p data set, from population 7T;, by X; and its jth row by x;i . After first constructing the sample mean vectors
and the covariance matrices S;, i vector
=
1 , 2, . . . , g, we define the "overall average"
Sec. 1 1 . 7
685
Fisher's Method for Discri m i nati ng Among Several Pop u l ations
g 2: 2: X ;j i = l j= l g 2: n; 11;
g 2: n;X;
X = ,_i=--"n1
__
2: n;
i= l i=l 1 vector average taken over all of the sample observations in the
which is the p X training set. Next, we define the sample between groups matrix B which includes the sam ple sizes. Let
g
B = 2: n; ( x; - x ) ( x; - x ) '
(11-60)
i= l Also, an estimate of I is based on the sample
within groups matrix g g w = 2: ( n; - 1 ) S; = 2: 2: ( x ij - x; ) ( x ij - x ; ) ' i= l i=l j= l II;
(11-61)
Consequently, W / ( n 1 + n 2 + · · · + ng - g ) = S pooled is the estimate of I. Before presenting the sample discriminants, we note that W is the constant ( n l + nz + . . . + ng - g) times s pooled • so the same a that maximizes a' B a/ a' S pooled a also maximizes a' Ba /a ' Wa . Moreover, we can present the opti mizing a in the more customary form as eigenvectors e ; of w - 1 B, because if w- 1 Be = A e then s;�oled Be = A ( n l + nz + . . . + ng - g)e.
�;/ c\�,:: , �,- .·�;��;,s���:.
;�·t
{ { '* '·�····!';,• ·� (!l��' �¥ '
·;'1;:.·�1�:"tl'::;t\:� �., ·:· - }j;i;. Hk'' '
;c.·
i �
'
% 1 ,· ··o !f;,'
·
Exercise 1 1 .21 outlines the derivation of the Fisher discriminants. The dis criminants will not have zero covariance for each random sample X ; . Rather, the condition
686
Chap. 1 1
Discri m i n ation and Classification
A / s pooled aA k = a. 1
{ 01
" f . - k ,:::: s otherwise
1 l
(11-63)
�
will be satisfied. The use of Spooled is appropriate because we tentatively assumed that the g population covariance matrices were equal. Example 1 1 . 1 3
(Calculating Fisher's sample discriminants for three populations)
Consider the observations on p = 2 variables from g = 3 populations given in Example 11.10. Assuming that the populations have a common covariance matrix I , let us obtain the Fisher discriminants. The data are
xl
=
[ 5]
[� ]
Trz ( nz = 3 )
-2
o
x, �
3 ; - 1 1
In Example 11.10, we found that
so B
=
[ H}
- -� 1 -2
[ ]
x3 = - 20
� 3 ( X; - X ) ( X; - X ) I = [ � 6� J
3 W = � � ( xij - X; ) ( x ij - X; ) ' = ( n 1 i= 1 j= 1 n;
=
[ -� �! ] = 1 [ 24 2 ] ; 140 2 6
w- 1
To solve for the s ::;;; min (g - 1, p) w - 1 B, we must solve
I W- 1B - A I I =
or
w- lB .
=
+
n2
+
n3 - 3 ) Spooled
[ 1.07143 1.4 ] .21429 2.7
= min ( 2, 2 ) = 2 nonzero eigenvalues of
- A 1 [ 1.07143 .21429
1.4 2.7 - A
Jl = 0
Sec. 1 1 . 7
(1.07143 - A) (2.7 - A ) - (1.4) (.21429)
=
A 2 - 3.77143A
+
2.5929
Using the quadratic formula, we find that A 1 = 2.8671 and A 2 The normalized eigenvectors a1 and a 2 are obtained by solving
[ 1.07143.21429- 2.8671
and scaling the results such that a; S pooled a; (w- lB
687
Fisher's M ethod for Discri m i nating Among Severa l Populations
_
A 1 I )a1
=
is, after the normalization a{ s pooled a! a{
=
=
i = 1, 2
=
0
= .9044.
[ a��� ] [ 0o ] J 12
= 1. For example, the solution of 1.4 2.7 - 1 .07143
1,
[ .386 .495]
Similarly, a�
The two discriminants are
=
[.938 - .1121
[ :: ] = .386x1 + .495x2 [ - .1121 :: ] .938x 1 - .112x2
a{ x
=
[.386 .495]
.92 = a� x
=
[.938
.91
=
=
•
Example 1 1 . 1 4 (Fisher's discriminants for the crude-oil data)
Gerrild and Lantz [12] collected crude-oil samples from sandstone in the Elk Hills, California, petroleum reserve. These crude oils can be assigned to one of the three stratigraphic units (populations) 1r1 : Wilhelm sandstone
1r2 : Sub-Mulinia sandstone 7T3:
Upper sandstone
on the basis of their chemistry. For illustrative purposes, we consider only the five variables: X1
=
vanadium (in percent ash)
x2 = v'iron (in percent ash) x3 = v'beryllium(in percent ash)
688
Chap. 1 1
Discri m i n ation and Classification
X4
= 1/[saturated hydrocarbons (in percent area)]
X5
= aromatic hydrocarbons (in percent area)
The first three variables are trace elements, and the last two are determined from a segment of the curve produced by a gas chromatograph chemical analysis. Table 11.7 (see Exercise 1 1.30) gives the values of the five original variables (vanadium, iron, beryllium, saturated hydrocarbons, and aromatic hydrocarbons) for 56 cases whose population assignment was certain. A computer calculation yields the summary statistics 3.229 6.587 .303 .150 1 1.540
xt =
Xz
=
4.445 5.667 .344 .157 5.484
x3 =
7.226 4.634 .598 .223 5.768
x=
6.180 5.081 .51 1 .201 6.434
and (nl
+
nz
+
n3 - 3 )S pooled
=W=
= (38 + 11 + 7 - 3 )S pooled 187.575 1.957 141.789 - 4.031 2.128 3.580 1.092 - .143 - .284 .077 79.6,72 - 28.243 2.559 - .996 338.023
There are at most s = min (g - 1, p) = min (2, 5) = 2 positive eigenvalues of w - 1 B, and they are 4.354 and .559. The centered Fisher linear discrimi nants are .9 1
= .312 (x1 - 6.180) - .710 (x2 - 5.081 ) + 2.764 (x3 - .51 1 ) + 11.809 (x4 - .201) - .235 (x5 - 6.434)
.92
= .169 (x1 - 6.180) - .245 (x2 - 5.081) - 2.046 (x3 - .51 1 ) - 24.453 (x4 - .201 ) - .378 (x5 - 6.434)
The separation of the three group means is fully explained in the two-dimen sional "discriminant space." The group means and the scatter of the individ ual observations in the discriminant coordinate system are shown in Figure 1 1 . 12 on page 689. The separation is quite good. •
Sec. 1 1 . 7
3
0 0
2
.2 v
689
Fisher's M ethod for Discri m i nati ng Among Severa l Populations
0
0
0
..
0
0
0
0
0 0
0 0 0
0
0
0 0
0
-I
-2
•
•
•
•
0
Wilhelm
0
Upper
..
0
0
i9lJO
0
O Q] 0
0 0 0 0
0
0
0 0
Sub-Mulinia Mean coordinates
-4
Figure 1 1 . 1 2
Example 1 1 . 1 5
0 0
..
•
•
0
-3
..
0
oB o
0
0
• •
0
-2
0
2
Crude-oil samples in discriminant space.
(Plotting sports data in two-dimensional discriminant space)
Investigators interested in sports psychology administered the Minnesota Multiphasic Personality Inventory (MMPI) to 670 letter winners at the Uni versity of Wisconsin in Madison. The sports involved and the coefficients in the two discriminant functions are given in Table 1 1 .3. A plot of the group means using the first two discriminant scores is shown in Figure 1 1.13. Here the separation on the basis of the MMPI scores is not good, although a test for the equality of means is significant at the 5% level. (This is due to the large sample sizes.) While the discriminant coefficients suggest that the first discriminant is most closely related to the L and Ma scales, and the second discriminant is most closely associated with the D and Pt scales, we will give the interpreta tion provided by the investigators. The first discriminant, which accounted for 34.4% of the common vari ance, was highly correlated with the Mf scale (r = - .78). The second dis criminant, which accounted for an additional 18.3% of the variance, was most
690
Chap. 1 1
Discri m i n ation and Classification
TABLE 1 1 .3
Sport
Sample size
Football Basketball Baseball Crew Fencing Golf Gymnastics Hockey Swimming Tennis Track Wrestling
158 42 79 61 50 28 26 28 51 31 52 64
MMPI Scale
First discriminant
Second discriminant
QE L F
.055 - .194 - .047 .053 .077 .049 - .028 .001 - .074 .189 .025 - .046 - .103 .041
- .098 .046 - .099 - .017 - .076 .183 .031 - .069 - .076 .088 - . 188 .088 .053 .016
K
Hs D Hy Pd Mf Pa Pt Sc Ma Si
Source: W. Morgan and R. W. Johnson. Second discriminant .
6 • Swimming
.4 • Fencing • Wrestling
.2
• Tennis
Hockey • -+----+----,f---1--lf---t- First discriminant .8 .6 .2 - .4 - .2 .4 • - .8 6 • Football Track • Gymnastics • Crew • - .2 Baseball -
.
• Golf
- .4
-
Figure 1 1 . 1 3
.
• Basketball
6
The discriminant means y' = [ y1 , y2 ] for each sport.
Sec. 1 1 . 7
Fisher's M ethod for Discri m i nati ng Among Several Populations
691
highly related to scores on the Sc, F. and D scales (r' s = .66, .54, and .50, respectively). The investigators suggest that the first discriminant best repre sents an interest dimension; the second discriminant reflects psychological adjustment. Ideally, the standardized discriminant function coefficients should be examined to assess the importance of a variable in the presence of other vari ables. (See [25].) Correlation coefficients indicate only how each variable by itself distinguishes the groups, ignoring the contributions of the other vari ables. Unfortunately, in this case, the standardized discriminant coefficients were unavailable. In general, plots should also be made of other pairs of the first few dis criminants. In addition, scatter plots of the discriminant scores for pairs of discriminants can be made for each sport. Under the assumption of multi variate normality, the unit ellipse (circle) centered at the discriminant mean vector y should contain approximately a proportion P [ (Y - p y ) ' (Y - p y ) :::;
1] = P [xi :::; 1]
of the points when two discriminants are plotted.
.39
•
Using Fisher's Discriminants to Classify Objects
Fisher ' s discriminants were derived for the purpose of obtaining a low-dimensional representation of the data that separates the populations as much as possible. Although they were derived from considerations of separation, the discriminants also provide the basis for a classification rule. We first explain the connection in terms of the population discriminants a ; X . Setting Yk
= a� X = kth discriminant, k
s
:::;
(11-64)
we conclude that
under population 7T; and covariance matrix I , for all populations. (See Exer cise 11.21.) Because the components of Y have unit variances and zero covariances, the appropriate measure of squared distance from Y = to /L ; y is
y
(y - /L; y ) ' (y - /L; y ) = L (yi - PwY s
j= I
I
692
Chap. 1 1
Discri m i n ation and Classification
A reasonable classification rule is one that assigns y to population 1Tk if the square of the distance from y to p., k v is smaller than the square of the distance from y to J.t ; v for i * k. If only r of the discriminants are used for allocation, the rule is Allocate x to 1Tk if
2: ( yi - 1-t k Y/ = 2: [ aj ( x - Pk ) F i= l j= l r
r
r
::;;; 2: [aj ( x - P., ;)f
for all i
*k
(11-65)
j=l
Before relating this classification procedure to those of Section 11.6, we look more closely at the restriction on the number of discriminants. From Exer cise 11.21,
s = number of discriminants = number of nonzero eigenvalues of I- 1 BJL
Now, I- 1 BJL is p
or of I - 1/2B Jt I - l /Z
X
p, so s
::;;; p. Further, the g vectors
J.t 1 - P., , J.tz - P., , · · · , J.tg - P.,
(11-66)
satisfy (p.,1 - /i ) + ( p.,2 - /i ) + · · · + (p.,g - /i ) = g/i - gp., = 0. That is, the first difference p., 1 - /i can be written as a linear combination of the last g - 1 differences. Linear combinations of the g vectors in (11-66) determine a hyperplane of dimension q ::;;; g - 1 . Taking any vector e perpendicular to every P., ; - p.,, and hence the hyperplane, gives
g
g
i= l
i=l
B e = 2: ( P., ; - /i ) (p.,; - /i ) ' e = 2: ( J.t ; - p., )0 = 0 Jt
so There are p - q orthogonal eigenvectors corresponding to the zero eigenvalue. This implies that there are q or fewer nonz ero eigenvalues. Since it is always true that q ::;;; g - 1, the number of nonzero eigenvalues s must satisfy s ::;;; min (p, g - 1).
Sec. 1 1 . 7
Fisher's M ethod for Discrimi nati n g Among Several Popu lations
693
Thus, there is no loss of information for discrimination by plotting in two dimensions if the following conditions hold. Number of variables
Number of populations
Maximum number of discriminants
Any p Any p
g=2 g=3 Any g
1 2 2
p=2
We now present an important relation between the classification rule (11-65) and the "normal theory" discriminant scores [see (11-49)], � - 1 IL ; d; ( x ) - IL;1 �..., - 1 x - z1 /L ;1 ...,
+ 1n P ;
or, equivalently,
obtained by adding the same constant - !x' I - 1 x to each d; (x) . Result 1 1 .6. Let yj
1 I - 1 ;2BI - 12 . Then p
= a; x, where aj = I- 1 /2 ej and ej is an eigenvector of p
1 2: (yj - llwY = 2: [aj (x - p,; ) ] Z = (x - P, ; ) ' I - (x - p,; ) j= 1 j= 1 I
p
2: ( yj - JL ;y ) 2 , is constant for all j= S+ 1 populations i = 1, 2, . . . , g so only the first s discriminants yj , or 2: (yj - JL ;yy, j=1 contribute to the classification. Also, if the prior probabilities are such that p 1 = p 2 = . . . = Pg = 1/g, the rule (11-65) with r = s is equivalent to the population version of the minimum TPM rule (11-52) . 1 Proof. The squared distance (x - p,; ) ' I - (x - /L; ) (x - /L; ) ' I- 1 12EE ' I- 1 12 (x - p,; ) , where ( x - p,; ) ' I - 1 12I - 1 12 (x - /L;) E = [e 1 , e 2 , . . . , e ] is the orthogonal matrix whose columns are eigenvectors of I - 1 12B �' I - 1 12 • (See Exercise 11.21. ) If A 1
�
•
.
.
�
As
> 0 = As + 1
p
=
..
·
=
A , P
1
s
I
694
Chap. 1 1
Discri m i n ation and Classification
Since I - 1 /2e I.
= a. or a ' = e ' I- 1 12 l
l
l
'
l p-;)] a � (x a 2 (x - IL; )
a� (x - IL; )
and
(x - IL; ) ' I - 1 12EE ' I- 1 12 (x - IL; ) = _L [aj (x - IL JF j= l Next, each aj = I- 1 /2ej , j > s, is an ( unsealed ) eigenvector of I - 1 B ,.. with eigenvalue zero. As shown in the discussion below (11-66), aj is perpendicular to every IL; - /i and hence to (IL k - IL ) - ( IL; - /i ) = ILk - IL; for i, k = 1, 2, . . . , g. The condition 0 = a,: ( ILk - IL; ) = f.L k Y - f.L ; v implies that p yj - f.L k Y = Yj - f.L w so ,L (yj - f.L ; v) 2 is constant for all i = 1, 2, . . . , g. Therej=s+ • fore, only the first s discriminants yj need to be used for classification. p
I
I
'
I
I
I
We now state the classification rule based on the first discriminants.
r ,.,: s sample
FI�H ER��s CL;ASSIFI��TIC)f.;J PR(l)CEDURE' · BAS�D ON SAMPLE DISCRIMINANTS Allocate x to
Trk
if
for all i ¥ k
When the prior probabilities are such that p 1 = p 2 = · · · = Pg = 1 /g and r = s, rule (11-67) is equivalent to rule (11-52), which is based on the largest linear discriminant score. In addition, if r < s discriminants are used for classification,
there is a loss of squared distance, or score, of ,L [ aj (x - i; ) F for each popu p
s
j = r+ l
lation 7T; where 2: [a; ( X - i; ) F is the part useful for classification.
j = r+ l
Sec. 1 1 . 7
Fisher's M ethod for Discrimi nating Among Several Populations
695
Example 1 1 . 1 6 (Classifying a new observation with Fisher's discriminants)
Let us use the Fisher discriminants
= a { X = .386x l + .495x2 .92 = a�x = .938x1 - .112x2 j\
from Example 11.13 to classify the new observation x� ance with (11-67). Inserting x � = [x0 1 , x0 2 ] = [1 3], we have
= [1 3] in accord
j/ 1 = .386x0 1 + .495x0 2 = .386(1 ) + .495(3) = 1.87 j/2 = .938x0 1 - .112x0 2 = .938(1 ) - .112(3) = .60 Moreover, Ykj
[ -� ] = 1.10 - .1121 [-� = - 1.27 J
= a; xk , so that (see Example 1 1.13 ) Y1 , = a{ x 1 = [.386 .495]
5'12 = a�x 1 = [.938 Similarly,
= a { x2 = Y22 = a� x2 = )13 , = a{ x 3 = )132 = a�x 3 = Y2 1
2.37 .49 - .99 .22
Finally, the smallest value of
2
2 2 ( = L yj Ykj ) L [a/ (x - x d F j= l j= l for k
= 1, 2, 3, must be identified. Using the preceding numbers gives 2
L (yj - Y!j ) 2
j= l 2
= (1.87 - 1.10) 2 + (.60 + 1.27 ) 2 = 4.09
L ( - Y2Y = (1.87 - 2.37 ) 2 + (.60 - .49 ) 2 = .26 j = l yj 2
L (yj - Y3j ) 2
j= l
= (1.87 + .99 ) 2 + (.60 - .22) 2 = 8.32
696
Chap. 1 1
Discrimination and Classification 2
Smallest distance
9 �• -Y 2 -1
Figure 1 1 . 1 4 The points 9' = [y] l Y2l , y; = [ Yn , :VnJ, Y� = [ :V2 1 , yn], and Y� = [ }13 1 , :Vd in the classification plane.
-1
2
Since the minimum of 2: (yj - Yk) 2 occurs when j=l
k = 2, we allocate x0 to
population 1r2 • The situation, in terms of the classifiers yj , is illustrated schematically in Figure 11 .14. • Comment: When two linear discriminant functions are used for classifica tion, observations are assigned to populations based on Euclidean distances in the two-dimensional discriminant space. Up to this point, we have not shown why the first few discriminants are more important than the last few. Their relative importance becomes apparent from their contribution to a numerical measure of spread of the populations. Consider the separatory measure
11� = 2: ( P; - /i ) ' I - 1 (/L; - /i ) g
i= I
(11-68)
where
and ( P ; /i ) ' !, - 1 (p; - /i ) is the squared statistical distance from the ith popu lation mean /L ; to the centroid /i. It can be shown ( see Exercise 1 1 .22) that 11� = A 1 + A z + · · · + AP where the A 1 � A z � · · · � A 5 are the nonzero eigen values of !,- 1 B ( or !, - 1 /2B!, - 1 12) and A s + I • . . , AP are the zero eigenvalues. The separation given by 11 � can be reproduced in terms of discriminant means. The first discriminant, Y1 = e{ !,- 1 12 X has mean s J.L; v, = e { !, - 1 12 /L; and the -
.
Sec. 1 1 . 8
F i n a l Comments
697
g
squared distance � (J..t ; y1 - /.Z yY of the /-t; y1's from the central value i= l J..t y1 = e ; I 1 1 2/i is A1 . (See Exercise 11.22.) Since � i can also be written as -
�i = A 1 g
+ A2 + ··· + AP
= � (P,;y - /iy ) ' (P,; y - /iy) i=l
f.LyY + � ( J.L ; y2 - 1-ZY/ + · · · + � ( J.L ; y, - /.Z y) 2 g
g
it follows that the first discriminant makes the largest single contribution, A 1 , to the separative measure �� - In general, the rth discriminant, Y,. = e/ I - 1 12X, con tributes A,. to �� - If the next s - r eigenvalues (recall that As + ! = As +z = = AP = 0) are such that A,. + 1 A,. + 2 As is small com pared to A 1 A 2 A,., then the last discriminants Y,. + 1 , Y,. + 2 , . . . , Ys can be neglected without appreciably decreasing the amount of separation. 1 2 Not much is known about the efficacy of the allocation rule (11-67). Some insight is provided by computer-generated sampling experiments, and Lachenbruch [21] summarizes its performance in particular cases. The development of the pop ulation result in (11-65) required a common covariance matrix I. If this is essen tially true and the samples are reasonably large, rule (11-67) should perform fairly well. In any event, its performance can be checked by computing estimated error rates. Specifically, Lachenbruch's estimate of the expected actual error rate given by (11-57) should be calculated.
··· + + ··· +
+
+ ··· +
1 1 .8 FINAL COMMENTS Including Qualitative Variables
Our discussion in this chapter assumes that the discriminatory or classificatory vari ables, X1 , X2 , . . . , XP have natural units of measurement. That is, each variable can, in principle, assume any real number, and these numbers can be recorded. Often, a qualitative or categorical variable may be a useful discriminator (classifier). For example, the presence or absence of a characteristic such as the color red may be a worthwhile classifier. This situation is frequently handled by creating a variable X whose numerical value is 1 if the object possesses the characteristic and zero if the object does not possess the characteristic. The variable is then treated like the measured variables in the usual discrimination and classification procedures. 12
See [17] for further optimal dimension-reducing properties .
698
Chap. 1 1
Discri m i n ation and Classification
There is very little theory available to handle the case in which some vari ables are continuous and some qualitative. Computer simulation experiments (see [20]) indicate that Fisher ' s linear discriminant function can perform poorly or sat isfactorily, depending upon the correlations between the qualitative and continu ous variables. As Krzanowski [20]-notes: "A low correlation in one population but a high correlation in the other, or a change in the sign of the correlations between the two populations could indicate conditions unfavorable to Fisher ' s lin ear discriminant function." This is a troublesome area and one that needs further study. When a number of variables are of the 0-1 type, it may be better to consider an alternative approach, called the logistic regression approach to classification. (See [15].) The probability of membership in the first group, P I ( x ) , is modeled directly as P I (x)
= 1
e a + {J'x + e o: +fJ'x
for the two-population problem. The a needs to be adjusted to accommodate a prior distribution, and it may not be easy to include costs. If the populations are nearly normal with equal covariance matrices, the linear classification approach is best. Classification Trees
An approach to classification completely different from the methods discussed in the previous sections of this chapter has been developed. (see [5].) It is very com puter intensive and its implementation is only now becoming widespread. The new approach, called classification and regression trees (CART), is closely related to divisive clustering techniques. (See Chapter 12.) Initially, all objects are considered as a single group. The group is split into two subgroups using, say, high values of a variable for one group and low values for the other. The two subgroups are then each split using the values of a second variable. The splitting process continues until a suitable stopping point is reached. The values of the splitting variables can be ordered or unordered categories. It is this feature that makes the CART procedure so general. For example, suppose subjects are to be dassified as 7TI : heart-attack prone 1r2 :
not heart-attack prone
on the basis of age, weight, and exercise activity. In this case, the CART procedure can be diagrammed as the tree shown in Figure 1 1 . 1 5. The branches of the tree actually correspond to divisions in the sample space. The region R I , defined as being over 45, being overweight, and undertaking no regular exercise, could be
Sec. 1 1 . 8
F i n a l Comments
699
No
1t 1 :
1t 2 :
Heart -attack prone Not heart-attack prone
Figure 1 1 . 1 5
A classification tree.
used to classify a subject as 1r1 : heart-attack prone. The CART procedure would try splitting on different ages, as well as first splitting on weight or on the amount of exercise. The CART methodology is not tied to an underlying population probability distribution of characteristics. Nor is it tied to a particular optimality criterion. In practice, the procedure requires hundreds of objects and, often, many variables. The resulting tree is very complicated. Subjective judgments must be used to prune the tree so that it ends with groups of several objects rather than all single objects. Each terminal group is then assigned to the population holding the majority mem bership. A new object can then be classified according to its ultimate group. Breiman, Friedman, Olshen, and Stone [5] have developed special-purpose software for implementing a CART analysis. Their program uses several intelligent rules for splitting and usually produces a tree that separates groups reasonably well. Neural Networks
A neural network (NN) is a computer-intensive, algorithmic procedure for trans forming inputs into desired outputs using highly connected networks of relatively simple processing units (neurons or nodes). Neural networks are modeled after the neural activity in the human brain. The three essential features, then, of an NN are the basic computing units (neurons or nodes), the network architecture describing
700
Chap. 1 1
Discri m ination and Classification
the connections between the computing units, and the training algorithm used to find values of the network parameters (weights ) for performing a particular task. The computing units are connected to one another in the sense that the out put from one unit can serve as part of the input to another unit. Each computing unit transforms an input to an output using some prespecified function that is typ ically monotone, but otherwise arbitrary. This function depends on constants (para meters ) whose values must be determined with a training set of inputs and outputs. Network architecture is the organization of computing units and the types of connections permitted. In statistical applications, the computing units are arranged in a series of layers with connections between nodes in different layers, but not between nodes in the same layer. The layer receiving the initial inputs is called the input layer. The final layer is called the output layer. Any layers between the input and output layers are called hidden layers. A simple schematic representation of a multilayer NN is shown in Figure 1 1.16. Neural networks can be used for discrimination and classification. When they are so used, the input variables are the measured group characteristics X1 , X2 , . . . , XP , and the output variables are categorical variables indicating group membership. Current practical experience indicates that properly constructed neural networks perform about as well as logistic regression and the discriminant
Output
t
t
Middle (hidden)
Input
Figure 1 1 . 1 6
A neural network with one hidden layer.
t
Sec. 1 1 . 8
F i n a l Com m ents
701
functions we have discussed in this chapter. Reference [26] contains a good dis cussion of the use of neural networks in applied statistics. Selection of Variables
In some applications of discriminant analysis, data are available on a large number of variables. Mucciardi and Gose [23] discuss a discriminant analysis based on 157 variables. 13 In this case, it would obviously be desirable to select a relatively small subset of variables that would contain almost as much information as the original collection. This is the objective of stepwise discriminant analysis, and several pop ular commercial computer programs have such a capability. If a stepwise discriminant analysis (or any variable selection method) is employed, the results should be interpreted with caution. (See [24].) There is no guarantee that the subset selected is "best," regardless of the criterion used to make the selection. For example, subsets selected on the basis of minimizing the apparent error rate or maximizing "discriminatory power" may perform poorly in future samples. Problems associated with variable-selection procedures are magni fied if there are large correlations among the variables or between linear combi nations of the variables. Choosing a subset of variables that seems to be optimal for a given data set is especially disturbing if classification is the objective. At the very least, the derived classification function should be evaluated with a validation sample. As Murray [24] suggests, a better idea might be to split the sample into a number of batches and determine the "best" subset for each batch. The number of times a given vari able appears in the best subsets provides a measure of the worth of that variable for future classification. Testing for Group Differences
We have pointed out, in connection with two group classification, that effective allocation is probably not possible unless the populations are well separated. The same is true for the many group situation. Classification is ordinarily not attempted, unless the population mean vectors differ significantly from one another. Assuming that the data are nearly multivariate normal, with a common covariance matrix, MANOVA can be performed to test for differences in the pop ulation mean vectors. Although apparent significant differences do not automati cally imply effective classification, testing is a necessary first step. If no significant differences are found, constructing classification rules will probably be a waste of time. 13 Imagine the problems of verifying the assumption of 157-variate normality and simultaneously estimating, for example, the 12,403 parameters of the 157 X 157 presumed common covariance matrix!
702
Chap. 1 1
D iscri m i nation and Classification
G raphics
Sophisticated computer graphics now allow one visually to examine multivariate data in two and three dimensions. Thus, groupings in the variable space for any choice of two or three variables can often be discerned by eye. In this way, poten tially important classifying variables are often identified and outlying, or " atypi cal," observations revealed. Visual displays are important aids in discrimination and classification, and their use is likely to increase as the hardware and associated computer programs become readily available. Frequently, as much can be learned from a visual examination as by a complex numerical analysis. Practical Considerations Regarding Multivariate Normality
The interplay between the choice of tentative assumptions and the form of the resulting classifier is important. Consider Figure 1 1 .17, which shows the kidney shaped density contours from two very nonnormal densities. In this case, the normal theory linear (or even quadratic) classification rule will be inadequate com pared to another choice. That is, linear discrimination here is inappropriate. Often discrimination is attempted with a large number of variables, some of which are of the presence-absence, or 0-1 , type. In these situations and in others with restricted ranges for the variables, multivariate normality may not be a sensi ble assumption. As we have seen, classification based on Fisher ' s linear discrimi nants can be optimal from a minimum ECM or minimum TPM point of view only when multivariate normality holds. How are we to interpret these quantities when normality is clearly not viable? In the absence of multivariate normality, Fisher' s linear discriminants can be viewed as providing an approximation to the total sample information. The values of the first few discriminants themselves can be checked for normality and rule (11-67) employed. Since the discriminants are linear combinations of a large num-
j
"Linear classification" boundary \
Contor of /2 ( x )
�
X/
"Good classification" boundary
\
X/
X
X
I
Contour of f1 ( x )
\
\
\
Figure 1 1 . 1 7 \
xi
Two nonnormal populations for which linear discrimination is inappropriate.
Chap. 1 1
Exercises
703
ber of variables, they will often be nearly normal. Of course, one must keep in mind that the first few discriminants are an incomplete summary of the original sample information. Classification rules based on this restricted set may perform poorly, while optimal rules derived from all of the sample information may per form well. EXERCISES
11.1. Consider the two data sets
for which
and
S pooled
=
[� �]
(a) Calculate the linear discriminant function in (1 1-19). (b) Classify the observation x b = [2 7] as population 1r1 or population 7T2 , using Rule (1 1-18) with equal priors and equal costs. 11.2. (a) Develop a linear classification function for the data in Example 11.1 using (11-19). (b) Using the function in (a) and (11-20), construct the "confusion matrix" by classifying the given observations. Compare your classification results with those of Figure 11.1, where the classification regions were deter mined "by eye." (See Example 11.5. ) (c) Given the results in (b), calculate the apparent error rate (APER). (d) State any assumptions you make to justify the use of the method in Parts a and b. 11.3. Prove Result 11.1. Hint: Substituting the integral expressions for P (2 j 1 ) and P ( 1 j 2 ) given by (1 1-1) and (1 1-2) , respectively, into (11-5) yields ECM Noting that f!
= c (2 j 1 )p1
J f1 (x) dx R2
+
c (1 j 2 )p 2
J /2 (x) dx R1
= R 1 U R2 , so that the total probability
704
Chap. 1 1
Discri m i nation and Classification
1 =
J /1 (x) dx = J
we can write
n
R1
f1 (x) dx
[ J /1 (x) dx ]
ECM = c (2 l l )p1 1 -
R1
J
+
+
R2
f1 (x) dx
c (1 1 2)p 2
J [c (1 1 2)pd2 (x) - c (2 1 1 )p t f1 (x)]dx
J /2 (x) dx R1
By the additive property of integrals (volumes), ECM =
R,
+
c ( 2 1 1)p 1
Now, p1 , p 2 , c (1 1 2) , and c (2 1 1 ) are nonnegative. In addition, f1 (x) and f2 (x) are nonnegative for all x and are the only quantities in ECM that depend on x . Thus, ECM is minimized if R 1 includes those values x for which the integrand
and excludes those x for which this quantity is positive. 11.4. A researcher wants to determine a procedure for discriminating between two multivariate populations. The researcher has enough data available to estimate the density functions f1 (x) and f2 (x) associated with populations 7T 1 and 7T2 , respectively. Let c (2 1 1 ) = 5 0 (this is the cost of assigning items as 1r2 , given that 7T1 is true) and c (1 1 2) = 100. In addition, it is known that about 20% of all possible items (for which the measurements x can be recorded) belong to 1r2 • (a) Give the minimum ECM rule (in general form) for assigning a new item to one of the two populations. (b) Measurements recorded on a new item yield the density values f1 (x) = .3 and f2 (x) = .5. Given the preceding information, assign this item to population 1r 1 or population 7T2 • 11.5. Show that - ! (x - p.J ' I - 1 (x - p. 1 ) + ! (x p. 2 ) 'I- 1 (x - JL 2 ) -
= (p.l - J.tz ) 'I- l x - !
+
P. z )
[see Equation (11-13).] 11.6. Consider the linear function Y = a' X. Let E (X) = p. 1 and Cov (X) = I if X belongs to population 1r1 • Let E (X) = p. 2 and Cov (X) = I if X belongs to population 1r2 • Let m = h�-t1 y + /-L z y) = � (a' p.1 + a' p. 2 ) . Given that a' = (p. 1 - p.2 ) ' I _ , , show each of the following. (a) E (a' X I 1r 1 ) - m = a' p. 1 - m > 0
Chap. 1 1
(b) E ( a' X I 1r2 ) - m
11.7.
11.8.
11.9.
= a' p2
Exercises
705
- m <
0 Hint: Recall that I is of full rank and is positive definite, so I - 1 exists and is positive definite. Let f1 (x) = H 1 - l x l ) for l x l <S 1 and f2 (x ) = � (1 - l x - .5 1 ) for - .5 <S X <S 1.5. (a) Sketch the two densities. (b) Identify the classification regions when p1 = p 2 and c (1 1 2 ) = c (2 1 1 ) . (c) Identify the classification regions when p1 = .2 and c (1 1 2 ) = c (2 1 1 ) . Refer to Exercise 11.7. Let f1 (x) be the same as in that exercise, but take f2 (x) = H 2 - l x - .5 1 ) for - 1 .5 <S x <S 2.5. (a) Sketch the two densities. (b) Determine the classification regions when p1 = p 2 and c (1 1 2) = c (2 1 1) . For g = 2 groups, show that the ratio in (11-59) is proportional to the ratio
(
Squared distance between means of Y (Variance of Y)
)
( a' f.L J - a' P 2 ) 2 a' Ia a' ( f.L 1 - P 2 H,__P_,_J a' Ia
-'---
_,_f.L-=-2 ) '_a
_
(a' o )2 a' Ia
where o = ( p 1 - p 2 ) is the difference in mean vectors. This ratio is the population counterpart of (11-33). Show that the ratio is maximized by the linear combination for any c * 0. Hint: Note that ( P ; - f.i ) (p; - f.i ) ' = � (p 1 - p 2 ) (p 1 - p 2 ) ' for i = 1, 2, where f.i = } ( p1 + p2 ) . 11.10. Suppose that n 1 = 1 1 and n2 = 1 2 observations are made o n two random variables X1 and X2 , where X1 and X2 are assumed to have a bivariate nor mal distribution with a common covariance matrix I , but possibly different mean vectors p 1 and p 2 • The sample mean vectors and pooled covariance matrix are XI s pooled
[ =�l [�] - 1.1 ] = [ - 7.3 4.8 1.1 -Xz =
(a) Test for the difference in population mean vectors using Hotelling ' s two sample T2-statistic. Let = .10. a
706
Chap. 1 1
Discri m i n ation and Classification
(b) Construct Fisher ' s (sample) linear discriminant function. [See (1 1-19) and (11-35).] (c) Assign the observation x b = [0 1] to either population 1r 1 or 7T2 • Assume equal costs and equal prior probabilities. 11.11. Suppose a univariate random variable X has a normal distribution with vari ance 4. If X is from population 1r1 , its mean is 10; if it is from population 1r2 , its mean is 14. Assume equal prior probabilities for the events A 1 = X is from population 1r1 and A2 = X is from population 1r2 , and assume that the misclassification costs c (2 j 1 ) and c (1 j 2) are equal (for instance, $10). We decide that we shall allocate (classify) X to population 1r1 if X � c, for some c to be determined, and to population 7Tz if X > c. Let B1 be the event X is classified into population 1r1 and B2 be the event X is classified into popu lation 1r2 • Make a table showing the following: P (B1 j A2), P (B2 j A 1 ) , P (A1 and B2), P (A2 and B1) , P (misclassification), and expected cost for various values of c. For what choice of c is expected cost minimized? The table should take the following form:
c
P (B 1 j A 2) P (B2 j A 1) P (A 1 and B2) P (A 2 and B 1) P (error)
Expected cost
10 14
11.12. 11.13. 11.14.
11.15.
What is the value of the minimum expected cost? Repeat Exercise 11.11 if the prior probabilities of A1 and A2 are equal, but c (2 j 1) = $5 and c (1 j 2) = $15. Repeat Exercise 11.11 if the prior probabilities of A1 and A2 are P(A1) = .25 and P (A2) = .75 and the misclassification costs are as in Exer cise 11.12. Consider the discriminant functions derived in Example 1 1 .3. Normalize a using (11-21) and (11-22). Compute the two midpoints m 7 and mi corre sponding to the two choices of normalized vectors, say, a 7 and ar . Classify x b = [ - .210, - .044] with the function _y; = a * ' x0 for the two cases. Are the results consistent with the classification obtained for the case of equal prior probabilities in Example 11.3? Should they be? Derive the expressions in (11-23) from (11-6) when f1 (x) and f2 (x) are multivariate normal densities with means f.L 1 , f.L z and covariances !, 1 , !, 2 , respectively.
Chap. 1 1
Exercises
707
11.16. Suppose x comes from one of two populations:
7T1 : Normal with mean p 1 and covariance matrix � 1 . 1r2 : Normal with mean p2 and covariance matrix � 2 . If the respective density functions are denoted by f1 (x) and f2 (x) , find the expression for the quadratic discriminator
[f2 (x) ]
= ln f (x) ,
Q If � 1
= �2 = �. for instance, verify that Q becomes ( P t - P- 2 ) ' � - J x
-
! ( Pt
-
P- 2 ) ' � - ' (p , + P- 2 )
11.17. Suppose populations 7T 1 and 7T2 are as follows:
Population
7Tl Distribution Mean p Variance-Covariance �
[
Normal
7T2
J [
[10, 15] ' 18 12 12 32
Normal
�]
[10, 25] ' 20 -7 -
Assume equal prior probabilities and misclassifications costs of c (2 J 1) = $10 and c (1 J 2) = $73.89. Find the posterior probabilities of populations 7T1 and 7T2 , P (7T1 i x) and P( 7T2 J x) , the value of the quadratic discriminator Q in Exercise 11.16, and the classification for each value of x in the following table:
X
Q Classification
[10, 15] ' [12, 17] ' [30, 35] ' (Note:
Use an increment of 2 in each coordinate-11 points in all.)
Show each of the following on a graph of the x 1 , x2 plane.
(a) The mean of each population.
708
Chap. 1 1
D iscri m i n ation and Classification
(b) The ellipse of minimal area with probability .95 of containing x for each population. (c) The region R1 (for population ?T 1 ) and the region n - R1 = R 2 (for population ?T2). (d) The 11 points classified in the table. 11.18. If B is defined as c (p1 - p 2 ) (p1 - p 2 ) ' for some constant c, verify that e = ci - 1 (p1 - p2 ) is in fact an (unsealed) eigenvector of I- 1 B, where I is a covariance matrix. 11.19. (a) Using the original data sets X L and x2 given in Example 11.6, calculate i ; , S ; , i = 1, 2, and S pooled • verifying the results provided for these quan tities in the example. (b) Using the calculations in Part a, compute Fisher ' s linear discriminant function, and use it to classify the sample observations according to Rule (11-35). Verify that the confusion matrix given in Example 1 1 .6 is correct. (c) Classify the sample observations on the basis of smallest squared dis tance Df (x) of the observations from the group means i 1 and i 2 . [See (11-54).] Compare the results with those in Part b. Comment. 11.20. The matrix identity (see Bartlett [3])
n - 3 _1 1 S H_ , pooled - n 2 S pooled _
_
(
+
ck 1 1 - ck ( X H - x- k ) I s;ooled ( X H - xk )
where
ck -
nk (nk - 1 ) (n - 2)
allows the calculation of S i/rooled from s;�oled · Verify this identity using the data from Example 11.6. Specifically, set n = n1 + n2 , k = 1, and x;; = [2, 12] . Calculate Sif� pooled using the full data s;otoled and i 1 , and com pare the result with Si{� poolect in Example 11.6. 11.21. Let A1 � A 2 � · • · � As > 0 denote the s ,s_; min (g - 1, p) nonzero eigen 1 values of I- B�< and e1 , e 2 , . . . , es the corresponding eigenvectors (scaled so that e ' Ie 1 ). Show that the vector of coefficients a that maximizes the ratio
=
a'
[ � (J.t; - fi ) (p; - /i ) ' ] a a' Ia
Chap. 1 1
Exercises
709
is given by a1 = e 1 . The linear combination a{ X is called the first discrimi nant. Show that the value a 2 = e 2 maximizes the ratio subject to Cov (a{ X , a;x) = 0. The linear combination a;x is called the second dis cnmmant. Continuing, ak = e k max1m1zes the ratio subject to 0 = Cov (a� X, a;' X ) , i < k, and a� X is called the kth discriminant. Also, Var (a( X) = 1, i = 1, . . . , s. [See (11-62) for the sample equivalent.] Hint: We first convert the maximization problem to one already solved. By the spectral decomposition in (2-20), I = P' AP where A is a diagonal matrix with positive elements A;. Let A 1/2 denote the diagonal matrix with elements � - By (2-22), the symmetric square-root matrix I 1 12 = P' A 1 12P and its inverse I- 1 12 = P' A - l /2P satisfy I 1 12I 1 12 = I , I 1 12I - 1 12 = I I - 1 12I 1 12 and I - 1 12I - 1 12 = I -1 . Next, set and u ' I - 1 12B IL I -1 12u u ' u = a' I 1 12I 1 12a = a' I a a' I 1 12I - 1 12B IL I - 1 12I 1 12a = a' B ILa . Consequently, the problem reduces to so
maximizing
u'u over u. From (2-51), the maximum of this ratio is A 1 , the largest eigenvalue of I - 1 12B IL I - 1 12 . This maximum occurs when u = e 1 , the normalized eigen vector associated with A 1 . Because e 1 = u = I 1 12 a 1 , or a 1 = I- 1 12e 1 ,
e { I - 1 ;2II- 1 12e 1 = e{ I - 1 12I 1 12I 1 12I- 1 12e 1 = e 1 maximizes the preceding ratio when u = e 2 , the normalized eigenvector corresponding to A 2 . For this choice, a 2 = I- 1 12e 2 , and Cov (a;X, a{ X ) = a; Ia 1 = e� I - 1 12II- 1 12e 1 = e� e 1 = 0, since e 2 _1_ e 1 . Similarly, Var ( a ; x ) = a; I a 2 = e� e 2 = 1. Continue in this fash ion for the remaining discriminants. Note that if A and e are an eigen value-eigenvector pair of I - 1/2BIL I - 1 12 , then I - 1 /zB IL I - 1 /2e = Ae
Var (a{ X ) = a{ I a1 = e{ e 1 = 1 . By (2-52), u _1_
and multiplication on the left by I - 1 /2 gives
A I - 1 12e or I - 1 B /L ( I - 1 12e) = A ( I - 1 12e) Thus, I- 1 B IL has the same eigenvalues as I- 1 12B IL I- 1 12 , but the corre sponding eigenvector is proportional to I- 1 12e = a, as asserted. 11.22. Show that �� = A 1 + A 2 + · · · + AP = A 1 + A 2 + · · · + A s , where A 1 , A 2 , . . . , A s are the nonzero eigenvalues of I - 1 BIL (or I- 1 12BIL I- 1 12 ) and I - 1 ;2I- 1 12BIL I- 1 12e
=
71 0
Chap. 1 1
D iscri m i n ation and Classification
2 + · · · + A, is the resulting separation when only the first r discriminants, Y1 , Y2 , . . . , Y, are used. Hint: Let P be the orthogonal matrix whose ith row e; is the eigenvector of I- 1/2B �< I - ' 12 corresponding to the ith largest eigenvalue, i = 1 , 2, . . . , p .
.:l � is given by (11-68). Also, show that A 1 Consider
y
(p X I )
+
Y,
e � I - lf2x
Y,
e ' I - 1 1 2X
y{J
e p' I- 1 12X
s
(P-;
_
A
=
PI - 1 12X
/i ) ' I- 1/2p' pi - l l2 (p., ;
_
,.,_ )
( P-; - p., ) ' I- ' ( P-; - It )
I!
Therefore, .:l� = L ( P.,; y - py ) ' ( P.,; y - /i y ). Using Y1 , we have
i= I
g
L ( JL ; yl - /i y/
i=l
g
=
because e 1 has eigenvalue g
" ( " ·y i = ,-, -, .L.; I
L e ; I - l /2 ( P.,; - Ji ) (p.,; - p., ) ' I - 1 /z e ,
i=l
Similarly, Y2 produces
A1 . li
-)
r- Y,
2
=
e2' I - 1 12B I - 1 12e 2
and Y" produces
Thus,
.:l�
g
=
L ( P., ; y - /i y ) ' ( P.,; y - /i y )
i= I
I'
= A
2
Chap. 1 1
Exercises
71 1
since A ,. + 1 = · · · = A, = 0. If only the first r discriminants are used, their contribution to Ll � is A 1 + A 2 + · · · + A , .
The following exercises require the use of a computer. 11.23. Consider the data given in Exercise 1.14. (a) Check the marginal distributions of the x;' s in both the multiple-sclero sis (MS) group and non-multiple-sclerosis (NMS) group for normality by graphing the corresponding observations as normal probability plots. Suggest appropriate data transformations if the normality assumption is suspect. (b) Assume that I 1 = I2 = I. Construct Fisher ' s linear discriminant func tion. Do all the variables in the discriminant function appear to be important? Discuss your answer. Develop a classification rule assuming equal prior probabilities and equal costs of misclassification. (c) Using the results in (b), calculate the apparent error rate. If computing resources allow, calculate an estimate of the expected actual error rate using Lachenbruch ' s holdout procedure. Compare the two error rates. 11.24. Annual financial data are collected for bankrupt firms approximately 2 years prior to their bankruptcy and for financially sound firms at about the same time. The data on four variables, X1 = CF/TD = (cash flow)/(total debt), X2 = NI/TA = (net income)/(total assets), X3 = CA/CL = (current assets)/(current liabilities), and X4 = CAINS = (current assets)/(net sales), are given in Table 11.4. (a) Using a different symbol for each group, plot the data for the pairs of observations (x 1 , x2 ), (x 1 , x3 ), and (x 1 , x4 ). Does it appear as if the data are approximately bivariate normal for any of these pairs of variables? (b) Using the n 1 = 21 pairs of observations (x1 , x 2 ) for bankrupt firms and the n2 = 25 pairs of observations (x1 , x 2 ) for nonbankrupt firms, calcu late the sample mean vectors x 1 and x 2 and the sample covariance matri ces S 1 and S2 . (c) Using the results in (b) and assuming that both random samples are from bivariate normal populations, construct the classification rule (11-25) with p 1 = p2 and c (1 1 2) = c (2 1 1) . (d) Evaluate the performance of the classification rule developed in (c) by computing the apparent error rat� (APER) from (11-30) and the esti mated expected actual error rate E (AER) from (11-32). (e) Repeat Parts c and d, assuming that p 1 = 05, p 2 = .95, and c (1 1 2) = c (2 1 1 ). Is this choice of prior probabilities reasonable? Explain. (f) Using the results in (b), form the pooled covariance matrix S poolect , and construct Fisher ' s sample linear discriminant function in (1 1-19). Use this function to classify the sample observations and evaluate the APER.
71 2
Chap. 1 1
Discri m i nation and Classification
Is Fisher ' s linear discriminant function a sensible choice for a classifier in this case? Explain. (g) Repeat Parts b-e using the observation pairs (x 1 , x3 ) and (x 1 , x4). Do some variables appear to be better classifiers than others? Explain. (h) Repeat Parts b-e using observations on all four variables (X1 , X2 ,
X3 , X4).
11.25. The annual financial data listed in Table 1 1 .4 have been analyzed by Johnson [ 18] with a view toward detecting influential observations in a discriminant analysis. Consider variables X1 = CF/TD and X3 = CA/CL. TABLE 1 1 .4 BAN KRU PTCY DATA
Row
CF XI = TD
NI -X2 = -TA
x3 = CA CL
x4 = CA NS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 2 3 4 5 6 7 8
- .45 - .56 .06 - .07 - .10 - .14 .04 - .06 .07 - .13 - .23 .07 .01 - .28 .15 .37 - .08 .05 .01 .12 - .28 .51 .08 .38 .19 .32 .31 .12 - .02
- .41 - .31 .02 - .09 - .09 - .07 .01 - .06 - .01 - .14 - .30 .02 .00 - .23 .05 .11 - .08 .03 - .00 .11 - .27 .10 .02 .11 .05 .07 .05 .05 .02
1.09 1 .51 1.01 1 .45 1.56 .71 1 .50 1.37 1.37 1.42 .33 1 .31 2.15 1 .19 1.88 1.99 1.51 1.68 1 .26 1.14 1.27 2.49 2.01 3.27 2.25 4.24 4.45 2.52 2.05
.45 .16 .40 .26 .67 .28 .71 .40 .34 .44 .18 .25 .70 .66 .27 .38 .42 .95 .60 .17 .51 .54 .53 .35 .33 .63 .69 .69 .35
Population
11'; ,
i = 1, 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
1
(continued)
Chap. 1 1
(continued)
TABLE 1 1 .4
CF TD
XI
Row
.22 .17 .15 - .10 .14 .14 .15 .16 .29 .54 - .33 .48 .56 .20 .47 .17 .58
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1r1
71 3
Exercises
NI X2 = TA
x3 = CA CL
x4 = CA NS
.08 .07 .05 - .01 - .03 .07 .06 .05 .06 .11 - .09 .09 .11 .08 .14 .04 .04
2.35 1.80 2.17 2.50 .46 2.61 2.23 2.31 1.84 2.33 3.01 1.24 4.29 1.99 2.92 2.45 5.06
.40 .52 .55 .58 .26 .52 .56 .20 .38 .48 .47 .18 .45 .30 .45 .14 .13
-
7T2
Population 7Ti ,
i = 1, 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Legend: irms. s. Source: 1968, 1969,0: bankr1970,upt1971,firms1972; Moody'1: nons Ibankr ndusturiptal fManual =
=
(a) Using the data on variables X1 and X3 , construct Fisher ' s linear dis
criminant function. Use this function to classify the sample observations and evaluate the APER. [See (11-35) and (11-30).] Plot the data and the discriminant line in the (x 1 , x3) coordinate system. (b) Johnson has argued that the multivariate observations in rows 16 for bankrupt firms and 13 for sound firms are influential. Using the X1 , X3 data, calculate Fisher ' s linear discriminant function with only data point 16 for bankrupt firms deleted. Repeat this procedure with only data point 13 for sound firms deleted. Plot the respective discriminant lines on the scatter in part a, and calculate the APERs, ignoring the deleted point in each case. Does deleting either of these multivariate observations make a difference? (Note that neither of the potentially influential data points is particularly "distant" from the center of its respective scatter.) 11.26. Using the data in Table 1 1 .4, define a binary response variable Z that assumes the value 0 if a firm is bankrupt and 1 if a firm is not bankrupt. Let X CA/CL, and consider the straight-line regression of Z on X. Although a binary response variable does not meet the standard regression assump tions, it is still reasonable to consider predicting the response with a linear function of the predictor(s). =
714
Chap. 1 1
Discri m i n ation and Classification
(a) Use least squares to determine the fitted straight line for the X, Z data.
Plot the fitted values for bankrupt firms as a dot diagram on the inter val [0, 1 ]. Repeat this procedure for nonbankrupt firms and overlay the two dot diagrams. A reasonable discrimination rule is to predict that a firm will go bankrupt if its fitted value is closer to 0 than to 1 . That is, the fitted value is less than .5. Similarly, a firm is predicted to be sound if its fitted value is greater than . 5 . Use this decision rule to classify the sample firms. Calculate the APER. (b) Repeat the analysis in Part a using all four variables, X1 , , X4 • Is there any change in the APER? Do data points 16 for bankrupt firms and 13 for nonbankrupt firms stand out as influential? • • •
11.27. The data in Table 1 1 .5 contain observations on X2 = sepal width and X4 = petal width for samples from three species of iris. There are n 1 = n 2 = n3 = 50 observations in each sample. (a) Plot the data in the (x2 , x4 ) variable space. Do the observations for the three groups appear to be bivariate normal? (b) Assume that the samples are from bivariate normal populations with a common covariance matrix. Test the hypothesis H0 : p 1 = p 2 = p3 ver sus H1 : at least one IL; is different from the others at the a = . 05 signif icance level. Is the assumption of a common covariance matrix reasonable in this case? Explain. (c) Assuming that the populations are bivariate normal, construct the qua dratic discriminate scores d? (x) given by ( 11-47) with p 1 = p 2 = p 3 = � · Using Rule (11-48), classify the new observation x� = [3.5 1 . 75] into population 7T 1 , 7T2 , or 7T3 • (d) Assume that the covariance matrices I; are the same for all three b�vari ate normal populations. Construct the linear discriminate score d; (x) given by ( 11-51), and use it to assign x� = [3.5 1 .75] to one of the pop ulations 7T; , i = 1, 2, 3 according to (11-52). Take p 1 = p 2 = p 3 = � · Compare the results in Parts c and d. Which approach do you prefer? Explain. (e) Assuming equal covariance matrices and bivariate normal populations, and supposing that p 1 = p2 = p 3 = � ' allocate x� = [3.5 1 .75] to 7T 1 , 7T2 , or 7T3 using Rule (11-56). Compare the result with that in Part d. Delineate the classification regions R 1 , R 2 , and R 3 on your graph from Part a determined by the linear functions d k ; (x0 ) in (11-56). ( f) Using the linear discriminant scores fr qm Part d, classify the sample observations. Calculate the APER and E (AER). (To calculate the lat ter, you should use Lachenbruch ' s holdout procedure. [See (11-57).]) 11.28. Darroch and Mosimann [6] have argued that the three species of iris indi cated in Table 1 1 . 5 can be discriminated on the basis of "shape" or scale-
Chap . 1 1
TABLE 1 1 .5 7T1 :
DATA
Iris setosa
1r2 :
Iris versicolor
Sepal Petal Petal width length width
x,
Xz
x3
x4
x,
Xz
5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2
3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1
1.4 1.4 1.3 1.5 1 .4 1.7 1 .4 1.5 1.4 1.5 1 .5 1.6 1.4 1.1 1 .2 1 .5 1.3 1.4 1 .7 1 .5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1 .5 1.5 1.4 1.5
0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4 0.3 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2
7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0
4.9
71 5
O N I R I S ES
Sepal length
5.5
Exercises
Sepal Sepal length width
7T3 :
Iris virginica
Petal Petal Sepal Sepal Petal Petal length width length width length width x3
x4
x,
Xz
x3
x4
4.7 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
1.4 1.5 1.5 1.3 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3 1.4 1.5 1.0 1.5 1.1 1.8 1.3 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0 1.1 1.0 1.2 1.6 1.5
6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1
3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6
6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6
2.5 1.9 2.1 1.8 2.2 2.1 1 .7 1.8 1.8 2.5 2.0 1 .9 2.1 2.0 2.4 2.3 1.8 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8 1.8 1 .8 2.1 1.6 1.9 2.0 2.2 1.5 1.4
(continued)
71 6
Chap . 1 1
TABlE 1 1 .5
Discri m i n ation and Classification
(continued)
7T1 : Iris setosa Sepal length
xi
5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
Sepal Petal Petal width length width
Xz
x3
x4
3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3
1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1 .6 1.9 1.4 1 .6 1.4 1.5 1.4
0.2 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2
Source: Anderson
1r3 : Iris virginica
7Tz : Iris versicolor Sepal Sepal length width
XI
Xz
6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7
3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8
Petal Petal Sepal Sepal Petal Petal length width length width length width
x3
x4
xi
Xz
x3
x4
4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1
1.6 1.5 1.3 1 .3 1.3 1.2 1.4 1.2 1 .0 1.3 1.2 1.3 1.3 1.1 1.3
7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0
6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3 2.5 2.3 1.9 2.0 2.3 1.8
[1]. free information alone. Let Y1 = X1 /X2 be sepal shape and Y2 = XiX4 be petal shape. (a) Plot the data in the (log Y1 , log Y2 ) variable space. Do the observations for the three groups appear to be bivariate normal? (b) Assuming equal covariance matrices and bivariate normal populations, and supposing that p 1 = p2 = p3 = � , construct the linear discriminant scores d; (x) given by ( 11-51 ) using both variables log Y1 , log Y2 and each variable individually. Calculate the APERs. (c) Using the linear discriminant functions from Part b, calculate the hold out estimates of the expected AERs, and fill in the following summary table: Variable(s) log Y1 1og Y2 log Y1 , log Y2
Misclassification rate
Chap . 1 1
Exercises
71 7
Compare the preceding misclassification rates with those in the summary tables in Example 11.12. Does it appear as if information on shape alone is an effective discriminator for these species of iris? (d) Compare the corresponding error rates in Parts b and c. Given the scat ter plot in Part a, would you expect these rates to differ much? Explain. 11.29. The GPA and GMAT data alluded to in Example 1 1 . 1 1 are listed in Table 1 1 .6. (a) Using these data, calculate i1 , i 2 , i3 , i , and S pooled and thus verify the results for these quantities given in Example 1 1 . 1 1 . (b) Calculate w- t and B and the eigenvalues and eigenvectors o f w- t B. Use the linear discriminants derived from these eigenvectors to classify the new observation x� = [3.21 497] into one of the populations 1r 1 : admit; 1r2 : not admit; and 1r3 : borderline. Does the classification agree with that in Example 11.11? Should it? Explain. 11.30. Gerrild and Lantz [12] chemically analyzed crude-oil samples from three zones of sandstone:
1r1 : Wilhelm 1r2 : Sub-Mulinia 1r3 : Upper (Mulinia, second subscales, first subscales) The values of the trace elements
X1 = vanadium (in percent ash) X2 = iron (in percent ash) x3 = beryllium (in percent ash) and two measures of hydrocarbons,
x4 = saturated hydrocarbons (in percent area) X5 = aromatic hydrocarbons (in percent area) are presented for 56 cases in Table 11.7 on page 719. The last two measure ments are determined from areas under a gas-liquid chromatography curve. (a) Obtain the estimated minimum TPM rule, assuming normality. Comment on the adequacy of the assumption of normality. (b) Determine the estimate of E(AER) using Lachenbruch ' s holdout pro cedure. Also, give the confusion matrix. (c) Consider various transformations of the data to normality (see Example 11.14), and repeat Parts a and b.
71 8
Chap. 1 1
TABLE 1 1 .6
Discri m i n ation and Classification
AD M I S S I O N DATA FOR G RAD UATE S C H O O L O F B U S I N ES S
1r1 : Admit
1r2 : Do not admit
Applicant no.
GPA
GMAT
Applicant no.
GPA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
2.96 3.14 3.22 3.29 3.69 3.46 3.03 3.19 3.63 3.59 3.30 3.40 3.50 3.78 3.44 3.48 3.47 3.35 3.39 3.28 3.21 3.58 3.33 3.40 3.38 3.26 3.60 3.37 3.80 3.76 3.24
596 473 482 527 505 693 626 663 447 588 563 553 572 591 692 528 552 520 543 523 530 564 565 431 605 664 609 559 521 646 467
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2.54 2.43 2.20 2.36 2.57 2.35 2.51 2.51 2.36 2.36 2.66 2.68 2.48 2.46 2.63 2.44 2.13 2.41 2.55 2.31 2.41 2.19 2.35 2.60 2.55 2.72 2.85 2.90
(xl )
(x2 )
(x l )
1r3 : Borderline
GMAT Applicant no. (x2 )
446 425 474 531 542 406 412 458 399 482 420 414 533 509 504 336 408 469 538 505 489 411 321 394 528 399 381 384
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
GPA
GMAT
2.86 2.85 3.14 3.28 2.89 3.15 3.50 2.89 2.80 3.13 3.01 2.79 2.89 2.91 2.75 2.73 3.12 3.08 3.03 3.00 3.03 3.05 2.85 3.01 3.03 3.04
494 496 419 371 447 313 402 485 444 416 471 490 431 446 546 467 463 440 419 509 438 399 483 453 414 446
(x l )
(x2 )
Chap. 1 1
TAB LE
7T t
7Tz
7T3
1 1 .7
Exercises
C RU D E-O I L DATA
xt
Xz
x3
x4
Xs
3.9 2.7 2.8 3.1 3.5 3.9 2.7 5.0 3.4 1.2 8.4 4.2 4.2 3.9 3.9 7.3 4.4 3.0 6.3 1.7 7.3 7.8 7.8 7.8 9.5 7.7 11.0 8.0 8.4 10.d 7.3 9.5 8.4 8.4 9.5 7.2 4.0 6.7
51.0 49.0 36.0 45.0 46.0 43.0 35.0 47.0 32.0 12.0 17.0 36.0 35.0 41.0 36.0 32.0 46.0 30.0 13.0 5.6 24.0 1S.O 25.0 26.0 17.0 14.0 20.0 14.0 18.0 18.0 15.0 22.0 15.0 17.0 25.0 22.0 12.0 52.0
0.20 0.07 0.30 0.08 0.10 0.07 0.00 0.07 0.20 0.00 0.07 0.50 0.50 0.10 0.07 0.30 0.07 0.00 0.50 1.00 0.00 0.50 0.70 1.00 0.05 0.30 0.50 0.30 0.20 0.10 0.05 0.30 0.20 0.20 0.50 1.00 0.50 0.50
7.06 7.14 7.00 7.20 7.81 6.25 5.11 7.06 5.82 5.54 6.31 9.25 5.69 5.63 6.19 8.02 7.54 5.12 4.24 5.69 4.34 3.92 5.39 5.02 3.52 4.65 4.27 4.32 4.38 3.06 3.76 3.98 5.02 4.42 4.44 4.70 5.71 4.80
12.19 12.23 11.30 13.01 12.63 10.42 9.00 6.10 4.69 3.15 4.55 4.95 2.22 2.94 2.27 12.92 5.76 10.77 8.2 7 4.64 2.99 6.09 6.20 2.50 5.71 8.63 8.40 7.87 7.98 7.67 6.84 5.02 10.12 8.25 5.95 3.49 6.32 3.20 (continued)
719
720
Chap. 1 1
D iscri m i n ation and Classification
TABLE 1 1 . 7 (continued) XI
9.0 7.8 4.5 6.2 5.6 9.0 8.4 9.5 9.0 6.2 7.3 3.6 6.2 7.3 4.1 5.4 5.0 6.2
x2
x3
x4
27.0 29.0 41.0 34.0 20.0 17.0 20.0 19.0 20.0 16.0 20.0 15.0 34.0 22.0 29.0 29.0 34.0 27.0
0.30 1.50 0.50 0.70 0.50 0.20 0.10 0.50 0.50 0.05 0.50 0.70 0.07 0.00 0.70 0.20 0.70 0.30
3.69 6.72 3.33 7.56 5.07 4.39 3.74 3.72 5.97 4.23 4.39 7.00 4.84 4.13 5.78 4.64 4.21 3.97
Xs
3.30 5.75 2.27 6.93 6.70 8.33 3.77 7.37 11.17 4.18 3.50 4.82 2.37 2.70 7.76 2.65 6.50 2.97
11.31. Refer to the data on salmon in Table 1 1 .2. (a) Plot the bivariate data for the two groups of salmon. Are the sizes and orientation of the scatters roughly the same? Do bivariate normal dis tributions with a common covariance matrix appear to be viable popu lation models for the Alaskan and Canadian salmon? (b) Using a linear discriminant function for two normal populations with equal priors and equal costs [see (11-19)], construct dot diagrams of the discriminant scores for the two groups. Does it appear as if the growth ring diameters separate the two groups reasonably well? Explain. (c) Repeat the analysis in Example 1 1 .7 for the male and female salmon separately. Is it easier to discriminate Alaskan male salmon from Cana dian male salmon than it is to discriminate the females in the two groups? Is gender (male or female) likely to be a useful discriminatory variable? 11.32. Data on hemophilia A carriers, similar to those used in Example 1 1 .3, are listed in Table 11.8. (See [14].) Using these data: (a) Investigate the assumption of bivariate normality for the two groups. (b) Obtain the sample linear discriminant function, assuming equal prior probabilities, and estimate the error rate using the holdout procedure.
Chap. 1 1
TABLE
Group
1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 .8
Exercises
721
H E M O P H I LIA DATA
Noncarriers ( 7T 1 ) log 1 0 (AHF log 10 (AHF activity) antigen)
- .0056 - .1698 - .3469 - .0894 - .1679 - .0836 - .1979 - .0762 - .1913 - .1092 - .5268 - .0842 - .0225 .0084 - .1827 .1237 - .4702 - .1519 .0006 - .2015 - .1932 .1507 - .1259 - .1551 - .1952 .0291 - .2228 - .0997 - .1972 - .0867
- .1657 - .1585 - .1879 .0064 .0713 .0106 - .0005 .0392 - .2123 - .1 190 -.4773 .0248 - .0580 .0782 - .1 138 .2140 - .3099 - .0686 - .1153 - .0498 - .2293 .0933 - .0669 - .1232 - .1007 .0442 - .1710 - .0733 - .0607 - .0560
Group
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Obligatory carriers ( 1r2 ) log1 0 (AHF log 1 0 (AHF antigen) activity)
- .3478 - .3618 - .4986 - .5015 - .1326 - .6911 - .3608 - .4535 - .3479 - .3539 - .4719 - .3610 - .3226 - .4319 - .2734 - .5573 - .3755 - .4950 - .5107 - .1652 - .2447 - .4232 - .2375 - .2205 - .2154 - .3447 - .2540 - .3778 - .4046 - .0639 - .3351 - .0149 -.0312 - .1740 - .1416 - .1508
.1151 - .2008 - .0860 - .2984 .0097 - .3390 .1237 - .1682 - .1721 .0722 - .1079 - .0399 .1670 - .0687 - .0020 .0548 - .1865 - .0153 - .2483 .2132 - .0407 - .0998 .2876 .0046 - .0219 .0097 - .0573 - .2682 - .1162 .1569 - .1368 .1539 .1400 - .0776 .1642 .1137 (continued)
722 TABLE
Chap. 1 1 1 1 .8
Group
Discri m i n ation and Classification
(continued)
Noncarriers ( 1r1 ) log 1 0 (AHF log 1 0 (AHF activity) antigen)
Group
Obligatory carriers ( 1r2 ) log1 0 (AHF log 1 0 (AHF activity) antigen)
2 2 2 2 2 2 2 2 2 Source: See
- .0964 - .2642 - .0234 - .3352 - .1878 - .1744 - .4055 - .2444 - .4784
.0531 .0867 .0804 .0875 .2510 .1892 - .2418 .1614 .0282
[14].
(c) Classify the following 10 new cases using the discriminant function in Part b. N EW CAS ES REQU I RI N G CLAS S I F I CATI O N
Case
log 1 0 (AHF activity)
log1 0 (AHF antigen)
1 2 3 4 5 6 7 8
- .112 - .059 .064 - .043 - .050 - .094 - .123 - .011 - .210 - .126
- .279 - .068 .012 - .052 - .098 - .113 - .143 - .037 - .090 - .019
9
10
(d) Repeat Parts a-c, assuming that the prior probability of obligatory car riers (group 2) is � and that of noncarriers (group 1) is � 11.33. Consider the data on bulls in Table 1 .8. (a) Using the variables YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt, calculate Fisher ' s linear discriminants, and classify the bulls as Angus, Hereford, or Simental. Calculate an estimate of E (AER) using the holdout procedure. Classify a bull with characteristics YrHgt = 50, FtFrBody = 1000, PrctFFB = 73, Frame = 7, BkFat = .17,
Chap. 1 1
References
723
SaleHt = 54, and SaleWt = 1525 as one of the three breeds. Plot the discriminant scores for the bulls in the two-dimensional discriminant space using different plotting symbols to identify the three groups. (b) Is there a subset of the original seven variables that is almost as good for discriminating among the three breeds? Explore this possibility by com puting the estimated E(AER) for various subsets. 11.34. Table 1 1 .9 contains data on breakfast cereals produced by three different American manufacturers: General Mills (G), Kellogg (K), and Quaker (Q). Assuming multivariate normal data with a common covariance matrix, equal costs, and equal priors, classify the cereal brands according to manufacturer. Compute the estimated E(AER) using the holdout procedure. Interpret the coefficients of the discriminant functions. Does it appear as if some manu facturers are associated with more "nutritional" cereals (high protein, low fat, high fiber, low sugar, and so forth) than others? Plot the cereals in the two-dimensional discriminant space, using different plotting symbols to identify the three manufacturers. REFERENCES
Anderson, E. "The Irises of the Gaspe Peninsula." 2. John AndersWioln,ey,T. W. (2d ed.). New York: Bartlet , M. S. "An Inverse Matrix Adjustment Arising in Discriminant Analysis." Bouma, B. N. , et al. "Evaluation of the Detectino.on 2Rate of Hemophilia Carriers." Friedman, R.InOlc., shen, and C. Stone. BelDarBreimrmoontch,an,, J.L.CA:,N.J.,Wads w ort h , andno. J. E. Mos241imann.-252."Canonical and Principal Components of Shape." A. "Pitcfs.al"sJourin tnhale Application ofno.Discriminant Analysis in Business, FiFiEinssheanceer,nbeiR.ands, A.R.Economi "The Use of Multiple Measurements in Taxonomic Problems." Fisher, R. A. "The Statistical Utilization of Multiple Measurements." re Approaches to Clustering via Maximum LiGeiGaneskels ieharood.,liS.ngam"Di" sS.cri"Clminaats iiofn,icatAlioloncatandono.ryMiandxtuSepar atorNewy, LinYorearkAs: Academi pects." Icn Pres , edi t e d by J. Van Ryzi n , pp. 12. PlGerriioceneld, P.SandM.,Uniandts, R.ElkJ.HiLantl s Oiz. l"Chemi Field, CalcalifAnalorniay."sis of Crude Oil Samples from 1.
Bulletin of the American Iris Society,
59
(1939), 2-5.
An Introduction to Multivariate Statistical Methods 1984.
3.
Annals of Mathematical Statistics,
4.
22
(1951), 107-1 1 1 .
tical Methods for Clinical Decision Making,
5.
Statis (1975) , 339-350. Classification and Regression Trees.
1,
1 984.
6.
Biometrika,
7.
12 ,
1 (1985),
of Finance,
8.
Eugenics,
9.
ics,
10.
8
7
32,
3 (1977), 875-900.
Annals of
(1936) , 179-188.
Annals of Eugen
(1938), 376-386.
Applied Statistics,
11.
and Clustering,
38,
3 (1989), 455-466. 301-330.
Classification 1 977.
75 U. S. Geological Survey Open-File
Report, 1969.
TABLE 1 1 .9
�
�
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
DATA ON B RA N D S O F C EREAL B rand
Manufacturer
Calories
Protein
Fat
Sodium
Fiber
Carbohydrates
Sugar
Potassium
Group
Apple_Cinnamon_Cheerios Cheerios Cocoa_Puffs Count_Chocula Golden_Grahams Honey_Nut_Cheerios Kix Lucky_Charms Multi_Grain_Cheerios Oatmeal_Raisin_Crisp Raisin_Nut_Bran Total Corn Flakes Tota Raisi-;:i_Bran Total_Whole_Grain Trix Wheaties Wheaties_Honey_Gold All_Bran Apple_Jacks Corn_Flakes Corn_Pops Cracklin'_Oat_Bran Crispix Froot_Loops Frosted_Flakes Frosted_Mini_Wheats Fruitful_Bran J ust_Right_Crunchy_Nuggets Mueslix_Crispy_Blend Nut&Honey_Crunch Nutri-grain_Almond-Raisin Nutri-grain_Wheat Product_1 9 Raisin Bran Rice_Krispies Smacks Special_K Cap'n'Crunch Honey_Graham_Ohs Life Puffed_Rice
G G G G G G G G G G G G G G G G G K K K K K K K K K K K K K K K K K K K K Q Q Q Q Q Q
110 110 110 110 1 10 110 110 110 100 130 1 00 110 140 100 110 100 110 70 110 100 110 110 110 110 110 100 120 1 10 1 60 120 140 90 100 120 1 10 1 10 110 120 120 1 00 50 50 100
2 6 1 1 1 3 2 2 2 3 3 2 3 3 1 3 2 4 2 2 1 3 2 2 1 3 3 2 3 2 3 3 3 3 2 2 6 1 1 4 1 2 5
2 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 0 0 0 3 0 1 0 0 0 1 2 1 2 0 0 1 0 1 0 2 2 2 0 0 2
180 290 180 180 280 250 260 1 80 220 1 70 1 40 200 1 90 200 1 40 200 200 260 125 290 90 140 220 125 200 0 240 170 150 1 90 220 1 70 320 210 290 70 230 220 220 150 0 0 0
1 .5 2.0 0.0 0.0 0.0 1 .5 0.0 0.0 2.0 1 .5 2.5 0.0 4.0 3.0 0.0 3.0 1 .0 9.0 1.0 1.0 1 .0 4.0 1.0 1.0 1 .0 3.0 5.0 1.0 3.0 0.0 3.0 3.0 1 .0 5.0 0.0 1 .0 1 .0 0.0 1 .0 2.0 0.0 1 .0 2.7
10.5 17.0 12.0 12.0 15.0 11.5 21.0 12.0 15.0 13.5 10.5 21.0 15.0 16.0 13.0 17.0 16.0 7.0 1 1 .0 21.0 13.0 10.0 21.0 1 1 .0 14.0 14.0 14.0 17.0 17.0 15.0 21.0 18.0 20.0 14.0 22.0 9.0 16.0 12.0 12.0 12.0 13.0 10.0 1 .0
10 1 13 13 9 10 3 12 6 10 8 3 14 3 12 3 8 5 14 2 12 7 3 13 11 7 12 6 13 9 7 2 3 12 3 15 3 12 11 6 0 0 1
70 1 05 55 65 45 90 40 55 90 120 140 35 230 1 10 25 110 60 320 30 35 20 1 60 30 30 25 1 00 1 90 60 1 60 40 130 90 45 240 35 40 55 35 45 95 15 50 110
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
C
42 Puffed_Wheat 43 Quaker_Oatmeal Source:
Data courtesy of Chad Dacus.
1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3
Chap. 1 1
References
725
13. NewGnanades ikJohnan, R.Wiley, 1977. York: 14. Anal Habbema, J. D. F.Us, J.inHermans , EsandtimK.atiVanon." DenIn Broek. "A Stepwise Discriminant ysis Program g Dens i t y Vienna: Physica, 1974. New York: John Wiley, 1981. 15.16. HiHand,l s, M.D."Alpp.J. lo101-110. cat,io1-31.n Rules and Their Error Rates." ( 1 966) 17. Hudl t, R., andDimR.ensA.ionalJohnsRepron.es"Lientnatearions.Di"sIcnrimination and Some Further Resediutletds onby BesJ. Vant eLower Ryzi n , pp. 371-394. New York: Academi c Pres s , 1977. 18. Johns enta iBayes al ObsiaenrvFratiaomewor ns for kAl."location, Separation, and the Deton,erW.min"TheationDetof Preno.ctoibabio3n(of1l987)itIinesfl,iun369-381. Kendal l, M.ki, G.W. J. "The PerformanceNewof FiYorsher'k:sHafLinnearer PresDiscsr,im1975.inant Function under 20.19. NonKrzanows 2 (1977)Haf,n191-200. 21.22. Lachenbruch, LachenbrOptiumch,al Condi P.P. A.A.t,ioandns."M. R. Mickey. "EsNewtno.imYork: er PrRates ,e1975. at i o n of Error s in Discriminant Anal y s i s . " no. 1 ( 1 968) , 1-11. 23. Mucci E. E. Gostione.Pr"AoperCompar Subs1023-1etas0r31.dofi, A.PatN.te,rnandRecogni ties." ison of Seven Techniques for Choos(1971),ing 24. Murray, G. D. "A Cautno.ionar3 (y1977)Not,e246-250. on Selection of Variables in Discriminant Analysis." 25. Rencher, A.inciC.pal"InComponent terpretatiosn."of Canonical Discriminant Funct(1992),ions, 217-225. Canonical Vari at e s and Pr 26.27. WalStern,d, H.A.S"On. "Neurala StaNettistiwcalorkPrs oinblApplem AriedisStingatiisntitchs.e" Clas ification of an(1I996)ndiv,id205-214. ual into One of Two Gr o ups. " ( 1 944) , 145-162. 28. Welch, B. L. "Note on Discriminant Functions." (1939), 218-220. Methods for Statistical Data Analysis of Multivariate Observations,
Compstat
Statistics,
1 974,
Proc. Computational
Discrimination and Classification,
Journal of the Royal Statistical Soci
ety (B) ,
28
Classification and Clustering,
Journal of Business and
Economic Statistics,
5,
Multivariate Analysis.
Technometrics, 19, Discriminant Analysis.
Technometrics,
10,
IEEE Trans. Computers,
Applied Statistics,
26,
The American Statistician ,
46
Technometrics,
Annals of Mathematical Statistics,
15
Biometrika,
31
38,
C20
CHAPTER
12
Clustering, Distance Methods, and Ordination 1 2. 1 INTRODUCTION
Rudimentary, exploratory procedures are often quite helpful in understanding the complex nature of multivariate relationships. For example, throughout this book, we have emphasized the value of data plots. In this chapter, we shall discuss some additional displays based on certain measures of distance and suggested step-by step rules (algorithms) for grouping objects (variables or items). Searching the data for a structure of "natural" groupings is an important exploratory technique. Groupings can provide an informal means for assessing dimensionality, identifying outliers, and suggesting interesting hypotheses concerning relationships. Grouping, or clustering, is distinct from the classification methods discussed in the previous chapter. Classification pertains to a known number of groups, and the operational objective is to assign new observations to one of these groups. Clus ter analysis is a more primitive technique in that no assumptions are made con cerning the number of groups or the group structure. Grouping is done on the basis of similarities or distances (dissimilarities). The inputs required are similarity mea sures or data from which similarities can be computed. To illustrate the nature of the difficulty in defining a natural grouping, con sider sorting the 16 face cards in an ordinary deck of playing cards into clusters of similar objects. Some groupings are illustrated in Figure 12. 1 . It is immediately clear that meaningful partitions depend on the definition of similar. In most practical applications of cluster analysis, the investigator knows enough about the problem to distinguish "good" groupings from "bad" groupings. Why not enumerate all possible groupings and select the "best" ones for further study? For the playing-card example, there is one way to form a single group of 16 face cards, there are 32,767 ways to partition the face cards into two groups (of varying sizes), there are 7,141,686 ways to sort the face cards into three groups (of 726
Sec. 1 2 . 1
• • • • A DODD K DDDD Q DDDD 1 DDDD
I ntrod u ction
727
• • • •
�DODD (b ) Individual suits
(a) Individual cards
• • • •
;oo (c ) Black and red suits
(d ) Major and minor suits (bridge )
• • • •
• • • •
(e) Hearts plus queen of spades and other suits (hearts )
(f) Like face cards
;�o
Figure 1 2. 1
Grouping face cards.
varying sizes), and so on. 1 Obviously, time constraints make it impossible to deter mine the best groupings of similar objects from a list of all possible structures. Even large computers are easily overwhelmed by the typically large number of cases, so one must settle for algorithms that search for good, but not necessarily the best, groupings. To summarize, the basic objective in cluster analysis is to discover natural groupings of the items (or variables). In turn, we must first develop a quantitative scale on which to measure the association (similarity) between objects. Section 12.2 is devoted to a discussion of similarity measures. After that section, we describe a few of the more common algorithms for sorting objects into groups. Even without the precise notion of a natural grouping, we are often able to group objects in two- or three-dimensional plots by eye. Stars and Chernoff faces, the second kind given by
n
(1/k !) ± ( - 1)k-i � j". ( See [1].) Adding these numbers for k
1 The number of ways of sorting j�O
obj ects into k nonempty groups is a Stirling number of
() }
groups, we obtain the total number of possible ways to sort
n
obj ects into groups.
=
.
1 , 2, . .
, n
728
Chap. 1 2
Cl usteri ng, Dista nce Methods, and Ord i n ation
discussed in Section 1.4, have been used for this purpose. (See Examples 1.8 and 1.9. ) Additional procedures for depicting high-dimensional observations in two dimensions such that similar objects are, in some sense, close to one another are considered in Sections 12.5-12.7. 1 2.2 SIMILARITY MEASURES
Most efforts to produce a rather simple group structure from a complex data set require a measure of "closeness," or "similarity." There is often a great deal of sub jectivity involved in the choice of a similarity measure. Important considerations include the nature of the variables (discrete, continuous, binary), scales of mea surement (nominal, ordinal, interval, ratio), and subject matter knowledge. When items (units or cases) are clustered, proximity is usually indicated by some sort of distance. On the other hand, variables are usually grouped on the basis of correlation coefficients or like measures of association. Distances and Similarity Coefficients for Pai rs of Items
We discussed the notion of distance in Chapter 1, Section 1 .5. Recall that the Euclidean (straight-line) distance between two p-dimensional observations (items) x = [x 1 , x2 , , xP ] ' and y = [y1 , y2 , , yp ] ' is, from (1-12), ( 12-1 ) d ( x , y) = Y (x1 - y 1 ) 2 + (x2 - y2 ) 2 + · · · + (xp - Yp ? • • .
• • •
= Y (x - y) ' (x - y) The statistical distance between the same two observations is of the form [see ( 1 -23 ) ]
d (x, y) = Y (x - y) ' A (x - y)
( 12-2 )
Ordinarily, A = S - I , where S contains the sample variances and covariances. How ever, without prior knowledge of the distinct groups, these sample quantities can not be computed. For this reason, Euclidean distance is often preferred for clustering. Another distance measure is the Minkowski metric ( 12-3 )
For m = 1, d (x, y) measures the "city-block" distance between two points in p dimensions. For m = 2, d (x, y) becomes the Euclidean distance. In general, vary ing m changes the weight given to larger and smaller differences. Two additional popular measures of "distance" or dissimilarity are given by the Canberra metric and the Czekanowski coefficient. Both of these measures are defined for nonnegative variables only. We have
Sec. 1 2. 2
S i m i l a rity Measu res
Canberra metric:
729
(12-4) p
Czekanowski coefficient:
d ( y) x,
1 -
2 2: min (x;, Y; )
i = l____ p
2: (x;
i= I
+
Y; )
(12-5)
Whenever possible, it is advisable to use "true" distances-that is, distances satis fying the distance properties of (1-25)-for clustering objects. On the other hand, most clustering algorithms will accept subjectively assigned distance numbers that may not satisfy, for example, the triangle inequality. The following example illustrates how rudimentary groupings can be formed by simply reorganizing the elements of the distance matrix. Example
1 2. 1
(Clustering by shading a distance matrix)
Table 12.1 gives the Euclidean distances between pairs of 22 U.S. public util ity companies, based on the data listed in Table 12.5 after they have been standardized. Because the distance matrix is large, it is difficult to visually select firms that are close together (similar). However, the graphical method of shading allows us to pick out clusters of similar firms quite easily. First, distances are arranged into several classes (for instance, 15 or fewer), based on their magnitudes. Next, all distances within a given class are replaced by a common symbol with a certain shade of gray. Darker symbols correspond to smaller distances. Finally, the distance matrix is reorganized so that items with common symbols appear in contiguous locations along the main diagonal. Groups of similar items correspond to patches of dark shadings. From Figure 12.2 on page 731, we see that firms 1, 18, 19, and 14 form a group, firms 22, 10, 13, 20, and 4 form a group, firms 9 and 3 form a group, firms 3 and 6 form a group, and so forth. The groups (9, 3) and (3, 6) overlap, as do other groups in the diagram. Firms 11, 5, and 17 appear to stand alone. • When items cannot be represented by meaningful p-dimensional measurements, pairs of items are often compared on the basis of the presence or absence of certain characteristics. Similar items have more characteristics in common than do dissimilar items. The presence or absence of a characteristic can be described mathematically by introducing a binary variable, which assumes the value 1 if the characteristic is pre sent and the value 0 if the characteristic is absent. For p = 5 binary variables, for instance, the "scores" for two items i and k might be arranged as follows:
Item i Item k
1
2
Variables 3
4
5
1 1
0 1
0 0
1 1
1 0
TABLE 1 2. 1
D I STANCES B ETWEEN
22
UTI LITIES
-
.....
w
e
Firm no.
1 2 3 4 5 6
7 8 9
1
.00 3.10 3.68 2.46 4.12 3.61
2 .00 4.92 2.16 3.85 4.22
3.90 3.45 2.74 3.89 3.25 3.96
10 3.10 2.71 11 3.49 4.79 12 3.22 2.43 13 3.96 3.43 14 2.11 4.32 15 2.59 2.50
16 17 18 19 20 21 22
4.03 4.40 1 .88 2.41 3.17 3.45 2.51
4.84 3.62 2.90 4.63 3.00 2.32 2.42
3
4
5
.00 .00 4.1 1 .00 4.47 4.13 2.99 3 .20 4.60
6
.00
4.22 3.97 4.60 3 .35 4.99 3 .69 5.16 4.91 2.75 3.75 4.49 3.73 3.93 5.90 4.03 4.39 2.74 5.16
5.26 6.36 2.72 3.18 3.73 5.09 4. 1 1
1 .49 4.86 3 .50 2.58 3 .23 3.19
4.97 4.89 2.65 3.46 1 .82 3.88 2.58
4.05 6.46 3.60 4.76 4.82 4.26
5.82 5.63 4.34 5.13 4.39 3.64 3.77
7
8
9
.00 4.36 2.80
.00 3.59
.00
3 .83 4.51 6.00 6.00 3 .74 1 .66 4.55 5 .01 3 .47 4.91 4.07 2.93
5.84 6.10 2.85 2.58 2.91 4.63 4.03
5 .04 4.58 2.95 4.52 3.54 2.68 4.00
3.67 3 .46 4.06 4.14 4.34 3 .85
2.20 5.43 3.24 4.11 4.09 3.98 3.24
3.57 5.18 2.74 3 .66 3 .82 4.11
3.63 4.90 2.43 4. 1 1 2.95 3.74 3.21
10
.00 5 .08 3 .94 1 .41 3 .61 4.26
4.53 5.48 3.07 4.13 2.05 4.36 2.56
11
12
13
14
.00 5.21 .00 5.31 4.50 .00 4.32 4.34 4.39 .00 4.74 2.33 5.10 4.24 3.43 4.75 3.95 4.52 5 .35 4.88 3.44
4.62 3.50 2.45 4.41 3.43 1 .38 3.00
4.41 5.61 3.78 5.01 2.23 4.94 2.74
5.17 5.56 2.30 1 .88 3.74 4.93 3.51
15
.00
5.18 3.40 3.00 4.03 3.78 2.10 3.35
16
17
.00 5.56 3.97 5.23 4.82 4.57 3.46
.00 4.43 6.09 4.87 3.10 3.63
18
19
20
.00 2.47 .00 .00 2.92 3.90 3.19 4.97 4.15 2.55 3.97 2.62
21
22
.00 3.01
.00
Sec. 1 2 . 2
18 19 14 9 3 6 22 10 13 20 4 7 12 21 IS 2 II 16 8 5 17
less than to to to to to to greater than
2.563 2.994 3 .424 3.66 1 3.963 4. 1 35
Firm no.
+ M M +
+ eX e S. + + M + M ll. X X + - ll S. e X N
•+ e X ll. M - e X
+ ll M M M M e S. S. X M S. X M M M M X M
-·
••
Figure 1 2.2
731
2.563 2.994 3 .424 3 .66 1 3.963 4. 1 35 4.45 8 4.458
M M M M M M M M M M M M -
S. M e e ll. - + ll + + s. . XM S. M - • - S. e
ll. M X+
S i m i l a rity Measures
:+:+ X
+e -- ++
e -
X X e + X M MX
\
�MM
M
M
Shaded distance matrix for 22 utilities.
In this case, there are two 1-1 matches, one 0-0 match, and two mismatches. Let xij be the score (1 or 0) of the jth binary variable on the ith item and xkj be the score (again, 1 or 0) of the jth variable on the k th item, j = 1 , 2, . . . , p . Consequently,
{�
if X;j if X;j p
= xkj -=1=
xkj
0
(12-6)
and the squared Euclidean distance, 2: (xij - xkj ) 2 , provides a count of the numj� I ber of mismatches. A large distance corresponds to many mismatches-that is, dis similar items. From the preceding display, the square of the distance between items i and k would be
732
Chap. 1 2
C l ustering, Distance Meth ods, and O rd i nation 5
2: (x;j - xky = ( 1 - 1) 2 + (0 - 1 ) 2 + (0 - 0) 2 + (1 - 1 ) 2 + (1 - 0) 2 j= l
=2 Although a distance based on (12-6) might be used to measure similarity, it suffers from weighting the 1-1 and 0--D matches equally. In some cases, a 1-1 match is a stronger indication of similarity than a 0-0 match. For instance, in grouping people, the evidence that two persons both read ancient Greek is stronger evidence of similarity than the absence of this ability. Thus, it might be reasonable to dis count the 0-0 matches or even disregard them completely. To allow for differen tial treatment of the 1-1 matches and the 0--D matches, several schemes for defining similarity coefficients have been suggested. To introduce these schemes, let us arrange the frequencies of matches and mismatches for items i and k in the form of a contingency table: Item k
Item i Totals
1 0
1
0
Totals
a c
b d
a+b c+d
a+c
b+d
p=a+b+c+d
(12-7)
In this table, a represents the frequency of 1-1 matches, b is the frequency of 1-0 matches, and so forth. Given the foregoing five pairs of binary outcomes, a = 2 and b = c = d = 1 . Table 12.2 lists common similarity coefficients defined i n terms o f the fre quencies in (12-7). A short rationale follows each definition. Coefficients 1, 2 and 3 in the table are monotonically related. Suppose coef ficient 1 is calculated for two contingency tables, Table I and Table II. Then if (a1 + d1 ) /p � ( a1 1 + d1 1 )/p, we also have 2 (a1 + d1 )/[2 (a1 + d1 ) + b1 + cr J � 2 ( au + du ) / [2 (au + d1 1 ) + b1 1 + c11 ] , and coefficient 3 will be at least as large for Table I as it is for Table II. (See Exercise 12.4.) Coefficients 5, 6, and 7 also retain their relative orders. Monotonicity is important, because some clustering procedures are not affected if the definition of similarity is changed in a manner that leaves the relative orderings of similarities unchanged. The single linkage and complete linkage hier archical procedures discussed in Section 12.3 are not affected. For these methods, any choice of the coefficients 1, 2 and 3 in Table 12.2 will produce the same group ings. Similarly, any choice of the coefficients 5, 6, and 7 will yield identical groupings.
Sec. 1 2 . 2
TABLE
1 2. 2
733
S i m i l a rity Measures
S I M I LARITY COEFFICI ENTS F O R CLUSTERI N G ITEM S *
Rationale
Coefficient
a+d 1. -p
1-1 matches and 0-0 matches. Double weight for 1-1 matches and matches.
Equal weights for
2. 2 (a +2 (ad)++d)b + c a+d 3 · a + d + 2 (b + c )
----�--�----
{[_
4. p a -5. -a+b+c
0-0
Double weight for unmatched pairs. No 0-0 matches in numerator. No 0-0 matches in numerator or denominator. (The 0-0 matches are treated as irrelevant. )
2a --6. ---2a + b + c
No 0-0 matches in numerator or denominator. Double weight for 1-1 matches.
a a + 2 (b + c )
7. ------
No 0-0 matches in numerator or denominator. Double weight for unmatched pairs.
a 8. -b +c
Ratio of matches to mismatches with 0-0 matches excluded.
*
[p binary variables; see (12-7).]
Example
1 2.2
(Calculating the values of a similarity coefficient)
Suppose five individuals possess the following characteristics:
Height
1 68 in
Individual Individual 2 Individual 3 Individual Individual 5
73 in 67 in in in
4 64 76
Weight 140 lb
185 lb 165 lb
120 lb 210 lb
Eye color
Hair color
Handedness
Gender
green brown blue brown brown
blond brown blond brown brown
right right right right left
female male male female male
734
Chap. 1 2
Cl usteri ng, Distance Methods, and Ord i n ation
{� Xz = { � X3 = { �
Define six binary variables X1 , X2 , X3 , X4, X5 , X6 as
x1 =
height height
<
;,
72 in. 72 in.
weight weight
;,
<
150 lb 150 lb
{� Xs = { � x6 = { �
brown eyes otherwise
The scores for individuals 1 and 2 on the p =
Individual
1 2
right handed left handed female male
6 binary variables are
0 1
0 1
0 1
blond hair not blond hair
X4 -
1 0
1 1
1 0
and the number of matches and mismatches are indicated in the two-way array
Individual 2
Individual 1
1
0
Total
1 0
1 3
2 0
3 3
Totals
4
2
6
Employing similarity coefficient 1 , which gives equal weight to matches, we compute
a+d p
1
+0 6
1
6
Continuing with similarity coefficient 1, we calculate the remaining similar ity numbers for pairs of individuals. These are displayed in the 5 X 5 sym metric matrix
Sec. 1 2 .2
S i m i l a rity Measures
735
Individual 1 2 Individual 3 4
5
1 1
2
3
I 6
1 3 6 3
1
4
6 4
6
0
CD 6
2 6 2 6
4
5
1 2 6
1
Based on the magnitudes of the similarity coefficient, we should con clude that individuals 2 and 5 are most similar and individuals 1 and 5 are least similar. Other pairs fall between these extremes. If we were to divide the individuals into two relatively homogeneous subgroups on the basis of the similarity numbers, we might form the subgroups (1 3 4) and (2 5). Note that X3 = 0 implies an absence of brown eyes, so that two people, one with blue eyes and one with green eyes, will yield a 0-0 match. Conse quently, it may be inappropriate to use similarity coefficient 1, 2, or 3 because • these coefficients give the same weights to 1-1 and 0-0 matches. We have described the construction of distances and similarities. It is always possible to construct similarities from distances. For example, we might set (12-8) where 0 < sik � 1 is the similarity between items i and k and dik is the corre sponding distance. However, distances that must satisfy (1-25) cannot always be constructed from similarities. As Gower [5, 6] has shown, this can be done only if the matrix of similarities is nonnegative definite. With the nonnegative definite condition, and with the maximum similarity scaled so that su = 1 , (12-9) has the properties of a distance. Similarities and Association Measures for Pairs of Variables
Thus far, we have discussed similarity measures for items. In some applications, it is the variables, rather than the items, that must be grouped. Similarity measures for variables often take the form of sample correlation coefficients. Moreover, in some clustering applications, negative correlations are replaced by their absolute values.
736
Chap. 1 2
Cl ustering, Distance Methods, and Ordi nation
When the variables are binary, the data can again be arranged in the form of a contingency table. This time, however, the variables, rather than the items, delin eate the categories. For each pair of variables, there are n items categorized in the table. With the usual 0 and 1 coding, the table becomes as follows: Variable k
Variable i
1 0 Totals
1
0
Totals
a c a+c
b d b+d
a+b c+d 11 = a + b + c + d
(12-10)
For instance, variable i equals 1 and variable k equals 0 for b of the 11 items. The usual product moment correlation formula applied to the binary vari ables in the contingency table of (12-10) gives (see Exercise 12.3)
ad --- -be -r = ---[ ( a + b ) (c + d ) ( a + c) ( b + d ) ] 1 12
(12-1 1)
This number can be taken as a measure of the similarity between the two variables. The correlation coefficient in (12-1 1) is related to the chi-square statistic ( r2 = x 2 /11) for testing the independence of two categorical variables. For n fixed, a large similarity (or correlation) is consistent with the absence of independence. Given the table in (12-10), measures of association (or similarity) exactly anal ogous to the ones listed in Table 12.2 can be developed. The only change required is the substitution of 11 (the number of items) for p (the number of variables). Concluding Comments on Similarity
To summarize this section, we note that there are many ways to measure the sim ilarity between pairs of objects. It appears that most practitioners use distances [see (12-1) through (12-5)] or the coefficients in Table 12.2 to cluster items and corre lations to cluster variables. However, at times, inputs to clustering algorithms may be simple frequencies. Example
1 2. 3
(Measuring the similarities of
11
languages)
The meanings of words change with the course of history. However, the mean ing of the numbers 1, 2, 3, . . . represents one conspicuous exception. Thus, a first comparison of languages might be based on the numerals alone. Table 12.3 gives the first 10 numbers in English, Polish, Hungarian, and eight other modern European languages. (Only languages that use the Roman alphabet are considered, and accent marks, cedillas, diereses, etc., are omitted.) A cur sory examination of the spelling of the numerals in the table suggests that the
TABLE 1 2.3
..... w .....
N U M ERALS I N
11
LAN G UAG ES
English (E)
Norwegian (N)
Danish (Da)
Dutch (Du)
German (G)
French (Fr)
Spanish (Sp)
Italian (I)
Polish (P)
Hungarian (H)
Finnish (Fi)
one two three four five six seven eight nine ten
en to tre fire fern seks sju atte ni ti
en to tre fire fern seks syv otte
een twee drie vier vijf zes zeven acht negen tien
eins zwei drei vier funf sechs sieben acht neun zehn
un deux trois quatre cinq
uno dos tres cuatro cmco seis siete ocho nueve diez
uno due tre quattro cinque sei sette otto nove dieci
jeden dwa trzy cztery piec szesc siedem osiem dziewiec dziesiec
egy ketto harom negy ot hat het nyolc kilenc tiz
yksi kaksi kolme neua viisi kuusi seitseman kahdeksan yhdeksan kymmenen
lll
ti
SIX
sept huit neuf dix
738
Chap. 1 2
Cl usteri ng, Distance Methods, and Ord i n ation
first five languages (English, Norwegian, Danish, Dutch, and German) are very much alike. French, Spanish, and Italian are in even closer agreement. Hungarian and Finnish seem to stand by themselves, and Polish has some of the characteristics of the languages in each of the larger subgroups. The words for 1 in French, Spanish, and Italian all begin with u . For illustrative purposes, we might compare languages by looking at the first let ters of the numbers. We call the words for the same number in two different languages concordant if they have the same first letter and discordant if they do not. From Table 12.3, the table of concordances (frequencies of matching first initials) for the numbers 1-10 is given in Table 12.4. We see that English TABLE IN
E N Da Du G Fr Sp I p
H Fi
11
1 2.4
CONCORDANT F I RST LETIERS F O R N U M B ERS
LAN G UAG ES
E 10 8 8 3 4 4 4 4 3 1 1
N Da Du 10 9 5 6
4 4 4 3 2 1
10 4 5
5 5
4 4 2 1
5
10 1 1 1 0 2 1
G
10 3 3 3 2 1 1
Fr
Sp
I
10 8 9
10 9
10
0 1
0 1
0 1
5
7
6
P
H
Fi
10 0 1
10 2
10
and Norwegian have the same first letter for 8 of the 10 word pairs. The remaining frequencies were calculated in the same manner. The results in Table 12.4 confirm our initial visual impression of Table 12.3. That is, English, Norwegian, Danish, Dutch, and German seem to form a group. French, Spanish, Italian, and Polish might be grouped together, • whereas Hungarian and Finnish appear to stand alone. In our examples so far, we have used our visual impression of similarity or distance measures to form groups. We now discuss less subjective schemes for cre ating clusters. 1 2. 3 HIERARCHICAL CLUSTERING METHODS
We can rarely examine all grouping possibilities, even with the largest and fastest computers. Because of this problem, a wide variety of clustering algorithms have emerged that find "reasonable" clusters without having to look at all configurations.
Sec. 1 2 . 3
H ierarchical Clustering Methods
739
Hierarchical clustering techniques proceed by either a series of successive mergers or a series of successive divisions. Agglomerative hierarchical methods start with the individual objects. Thus, there are initially as many clusters as objects. The most similar objects are first grouped, and these initial groups are merged accord ing to their similarities. Eventually, as the similarity decreases, all subgroups are fused into a single cluster. Divisive hierarchical methods work in the opposite direction. An initial single group of objects is divided into two subgroups such that the objects in one sub group are "far from" the objects in the other. These subgroups are then further divided into dissimilar subgroups; the process continues until there are as many subgroups as objects-that is, until each object forms a group. The results of both agglomerative and divisive methods may be displayed in the form of a two-dimensional diagram known as a dendrogram. As we shall see, the dendrogram illustrates the mergers or divisions that have been made at suc cessive levels. In this section we shall concentrate on agglomerative hierarchical procedures and, in particular, linkage methods. Excellent elementary discussions of divisive hier archical procedures and other agglomerative techniques are available in [2] and [4]. Linkage methods are suitable for clustering items, as well as variables. This is not true for all hierarchical agglomerative procedures. We shall discuss, in turn, single linkage (minimum distance or nearest neighbor), complete linkage (maxi mum distance or farthest neighbor), and average linkage (average distance). The merging of clusters under the three linkage criteria is illustrated schematically in Figure 12.3. I
/
l 1e
\
I
-
"
II
/
\
"
----
"
I
\ l
/
-
___.j- 4 \ 2.-;----"
-- -
/
;3,
-- -
Cluster distance
\ l . s; /
(a) -
----
2•
......_ _
\ I I
"
/
dl 3
+
dl4
+ dl 5 + d23 +
d24
+
d25
6 (c) Figure 1 2.3
lntercluster distance (dissimilarity) for (a) single linkage, (b) com plete linkage, and (c) average linkage.
740
Chap. 1 2
Clusteri ng, Distance Methods, and Ordination
From the figure, we see that single linkage results when groups are fused according to the distance between their nearest members. Complete linkage occurs when groups are fused according to the distance between their farthest members. For average linkage, groups are fused according to the average distance between pairs of members in the respective sets. The following are the steps in the agglomerative hierarchical clustering algo rithm for grouping N objects (items or variables): Start with N clusters, each containing a single entity and an N X N symmet ric matrix of distances (or similarities) D = { di k } . 2. Search the distance matrix for the nearest (most similar) pair of clusters. Let the distance between "most similar" clusters U and V be duv · 3. Merge clusters U and V. Label the newly formed cluster ( UV). Update the entries in the distance matrix by (a) deleting the rows and columns corre sponding to clusters U and V and (b) adding a row and column giving the dis tances between cluster ( UV) and the remaining clusters. 4. Repeat Steps 2 and 3 a total of N 1 times. (All objects will be in a single cluster after the algorithm terminates.) Record the identity of clusters that are merged and the levels (distances or similarities) at which the mergers take place. (12-12) 1.
-
The ideas behind any clustering procedure are probably best conveyed through examples, which we shall present after brief discussions of the input and algorithmic components of the linkage methods. Single Linkage
The inputs to a single linkage algorithm can be distances or similarities between pairs of objects. Groups are formed from the individual entities by merging near est neighbors, where the term nearest neighbor connotes the smallest distance or largest similarity. Initially, we must find the smallest distance in D = { di d and merge the cor responding objects, say, U and V, to get the cluster (UV). For Step 3 of the gen eral algorithm of (12-12), the distances between ( UV) and any other cluster W are computed by (12-13) d( UV) W = min { d w • dv w l Here the quantities du w and dv w are the distances between the nearest neighbors of clusters U and W and clusters V and W, respectively. u
The results of single linkage clustering can be graphically displayed in the form of a dendrogram, or tree diagram. The branches in the tree represent clusters. The branches come together (merge) at nodes whose positions along a distance (or similarity) axis indicate the level at which the fusions occur. Dendrograms for some specific cases are considered in the following examples.
Sec. 1 2 . 3
Example
1 2. 4
H ierarch ical Cl ustering M ethods
741
(Clustering using single linkage)
To illustrate the single linkage algorithm, we consider the hypothetical dis tances between pairs of five objects as follows:
D = { di d
1 2 3 4 5
1 2 0 9 0 3 7 6 5 11 10
3 4 5 0 9 0 G) 8 0
Treating each object as a cluster, we commence clustering by merging the two closest items. Since
objects 5 and 3 are merged to form the cluster (35). To implement the next level of clustering, we need the distances between the cluster (35) and the remaining objects, 1, 2, and 4. The nearest neighbor distances are d(3S ) J = min { d3 1 , d5 d = min { 3, 11 } = 3 d(3s)z = min { d3 2 , d5 2 } = min { 7, 10 } = 7 d(35 ) 4 = min { d34 , d5 4 } = min { 9, 8 } = 8 Deleting the rows and columns of D corresponding to objects 3 and 5, and adding a row and column for the cluster (35), we obtain the new distance matrix (35) 1 2 4
(35) 1 2 4
lt : � J
The smallest distance between pairs of clusters is now d(35 l 1 = 3, and we merge cluster (1) with cluster (35) to get the next cluster, (135). Calculating d( 135 >2 = min { d( 35> 2 , d1 2 } = min { 7, 9 } = 7 d( 135 ) 4 = min { d( 35>4 , d1 4 } = min { 8, 6 }
=
6
we find that the distance matrix for the next level of clustering is
742
Chap. 1 2
Clusteri ng, Distance Methods, and Ord i nation
(135) 2 4
[
(135) 2 4 0 7 0 6 G)
J
The minimum nearest neighbor distance between pairs of clusters is d4 2 = 5, and we merge objects 4 and 2 to get the cluster (24). At this point we have two distinct clusters, (135) and (24). Their near est neighbor distance is d ( 1 35 ) ( 24 )
=
min { d( I 3S ) 2 , d( I 3S ) 4 }
=
min { 7, 6 }
=
6
The final distance matrix becomes (135) (24)
(135) (24)
[®
0
J
Consequently, clusters (135) and (24) are merged to form a single cluster of all five objects, (12345), when the nearest neighbor distance reaches 6.
�
0
2
3
5
2
4
Objects
Figure 1 2.4 Single linkage dendrogram for distances between five objects.
The dendrogram picturing the hierarchical clustering just concluded is shown in Figure 12.4. The groupings and the distance levels at which they • occur are clearly illustrated by the dendrogram. In typical applications of hierarchical clustering, the intermediate results-where the objects are sorted into a moderate number of clusters-are of chief interest. Example
1 2.5
(Single linkage clustering of
11
languages)
Consider the array of concordances in Table 12.4 representing the closeness between the numbers 1-10 in l l languages. To develop a matrix of distances, we subtract the concordances from the perfect agreement figure of 10 that each language has with itself. The subsequent assignments of distances are
Sec. 1 2. 3
E N Da Du G Fr Sp I p
H Fi
E 0 2 2 7 6 6 6 6 7 9 9
Hierarchical Cl ustering M ethods
N Da Du G Fr Sp
I
p
743
H Fi
0
CD 0 5 4 6 6 6 7 8 9
0 5 9 9 9 10 8 9
6 5 6 5 5 6 8 9
0 7 7 7 8 9 9
0 2
0
CD CD 0
5 3 4 0 10 10 10 10 0 9 9 9 9 8
0
We first search for the minimum distance between pairs of languages
( clusters ) . The minimum distance, 1, occurs between Danish and Norwegian,
Italian and French, and Italian and Spanish. Numbering the languages in the order in which they appear across the top of the array, we have
d3 2
=
d86
1;
=
1;
and
Since d76 = 2, we can merge only clusters 8 and 6 or clusters 8 and 7. We cannot merge clusters 6, 7, and 8 at level 1. We choose first to merge 6 and 8, and then to update the distance matrix and merge 2 and 3 to obtain the clusters (68) and (23). Subsequent computer calculations produce the den drogram in Figure 12.5. From the dendrogram, we see that Norwegian and Danish, and also French and Italian, cluster at the minimum distance ( maximum similarity ) level. When the allowable distance is increased, English is added to the Nor wegian-Danish group, and Spanish merges with the French-Italian group. Notice that Hungarian and Finnish are more similar to each other than to the other clusters of languages. However, these two clusters ( languages ) do not 10 8
� 0
6
0 4 "'
2 0 E
N
Da
Fr
Sp
P
Languages
Du
G
H
Fi
Figure 1 2.5 Single linkage dendrograms for distances between numbers in 1 1 languages.
744
Chap. 1 2
Clusteri ng, Dista nce Methods, and Ordination
merge until the distance between nearest neighbors has increased substan tially. Finally, all the clusters of languages are merged into a single cluster at • the largest nearest neighbor distance, 9. Since single linkage joins clusters by the shortest link between them, the tech nique cannot discern poorly separated clusters. [See Figure 12.6(a).] On the other hand, single linkage is one of the few clustering methods that can delineate nonel lipsoidal clusters. The tendency of single linkage to pick out long stringlike clusters is known as chaining. [See Figure 12.6(b ).] Chaining can be misleading if items at opposite ends of the chain are, in fact, quite dissimilar.
�
Variable 2
Variable 2
Nonelliptical / - -... , configurations I ' , , - '\ \ I ' I , '\ I
:• Elliptical : ! .... configurations :..•. ,.. •. • ••. • • •
•
•••• •
--
·· :.� · · .·:: •
'------ ·variable I (a) Single linkage confused by near overlap
Figure 1 2.6
.,., '-----Variable I ,
_ _ _ _ _ _
(b ) Chaining effect
Single linkage clusters.
The clusters formed by the single linkage method will be unchanged by any assignment of distance (similarity) that gives the same relative orderings as the ini tial distances (similarities). In particular, any one of a set of similarity coefficients from Table 12.2 that are monotonic to one another will produce the same clustering. Complete Linkage
Complete linkage clustering proceeds in much the same manner as single linkage clusterings, with one important exception: At each stage, the distance (similarity) between clusters is determined by the distance (similarity) between the two ele ments, one from each cluster, that are most distant. Thus, complete linkage ensures that all items in a cluster are within some maximum distance (or minimum simi larity) of each other. The general agglomerative algorithm again starts by finding the minimum entry in D = ( d;k } and merging the corresponding objects, such as U and V, to get cluster ( UV). For Step 3 of the general algorithm in (12-12), the distances between ( UV) and any other cluster W are computed by (12-14)
Sec. 1 2 . 3
H iera rch ical Clusteri ng M ethods
745
Here d uw and dv w are the distances between the most distant members of clusters U and W and clusters V and W, respectively. Example
1 2.6
(Clustering using complete linkage)
Let us return to the distance matrix introduced in Example 12.4 :
D
=
{ d;d
=
1 2 3 4 5
2 3 4 5
1 0
0
9
3
7
0
5 9 0 1 1 10 Q) 8 0 6
At the first stage, objects 3 and 5 are merged, since they are most similar. This gives the cluster (35). At stage 2, we compute d( J S ) I
=
max { d3 1 , d5 d
=
max { 3, 1 1 }
d( Js )z
=
max { d3 2 , d5 2 }
=
10
d(Js ) 4
=
max { d34 , d5 4 }
=
9
=
11
and the modified distance matrix becomes
(35) 1 2 4 The next merger occurs between the most similar groups, 2 and 4, to give the cluster (24). At stage 3, we have d ( 24 ) ( J S )
=
max { d2 (JS ) • d4 (Js ) l
d(24 ) 1
=
max { d2 1 , d4 d
=
=
max { lO, 9 }
=
10
9
and the distance matrix
(35) ( 24 ) 1
(24) 1
&
J
The next merger produces the cluster (124). At the final stage, the groups (35) and (124) are merged as the single cluster (12345) at level
746
Chap. 1 2
Cl ustering, Distance Methods, and Ord i n ation
2
4
3
5
Objects
Figure 1 2.7 Complete linkage dendrogram for distances between five objects.
d< 124) ( 3s ) = max {d1 ( 3s ) ' d< 24) ( 3s ) l = max { l l , 10)
11
The dendrogram is given in Figure 12.7.
•
Comparing Figures 12.4 and 12.7, we see that the dendrograms for single link age and complete linkage differ in the allocation of object 1 to previous groups. Example
1 2. 7
(Complete linkage clustering of
11
languages)
In Example 12.5, we presented a distance matrix for numbers in 1 1 languages. The complete linkage clustering algorithm applied to this distance matrix pro duces the dendrogram shown in Figure 12.8. Comparing Figures 12.8 and 12.5, we see that both hierarchical methods yield the English-Norwegian-Danish and the French-Italian-Spanish lan guage groups. Polish is merged with French-Italian-Spanish at an intermedi ate level. In addition, both methods merge Hungarian and Finnish only at the penultimate stage. However, the two methods handle German and Dutch differently. Sin gle linkage merges German and Dutch at an intermediate distance, and these two languages remain a cluster until the final merger. Complete linkage
E
N
lli
G
�
� Languages
P
�
H
Fi
Figure 1 2.8 Complete linkage dendrogram for distances between numbers in 1 1 languages.
Sec . 1 2 . 3
H ierarchical C l ustering M ethods
747
merges German with the English-Norwegian-Danish group at an intermedi ate level. Dutch remains a cluster by itself until it is merged with the English Norwegian-Danish-German and French-Italian-Spanish-Polish groups at a higher distance level. The final complete linkage merger involves two clus • ters. The final merger in single linkage involves three clusters. Example
1 2.8
(Clustering variables using complete linkage)
Data collected on 22 U.S. public utility companies for the year 1975 are listed in Table 12.5. Although it is more interesting to group companies, we shall TABLE
1 2.5
P U B LI C UTI LITY DATA
(1 975) Variables
Company 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
Arizona Public Service Boston Edison Co. Central Louisiana Electric Co. Commonwealth Edison Co. Consolidated Edison Co. (N.Y.) Florida Power & Light Co. Hawaiian Electric Co. Idaho Power Co. Kentucky Utilities Co. Madison Gas & Electric Co. Nevada Power Co. New England Electric Co. Northern States Power Co. Oklahoma Gas & Electric Co. Pacific Gas & Electric Co. Puget Sound Power & Light Co. San Diego Gas & Electric Co. The Southern Co. Texas Utilities Co. Wisconsin Electric Power Co. United Illuminating Co. Virginia Electric & Power Co. X1 : X2 : X3 : X4 : X5 : X6 : X7 : X8 :
XI
Xz
x3
1.06 .89 1.43 1.02 1 .49 1.32 1.22 1.10 1.34 1.12 .75 1.13 1.15 1.09 .96 1.16 .76 1.05 1.16 1.20 1.04 1 .07
9.2 10.3 15.4 11.2 8.8 13.5 12.2 9.2 13.0 12.4 7.5 10.9 12.7 12.0 7.6 9.9 6.4 12.6 11.7 11.8 8.6 9.3
151. 202. 113. 168. 192. 111. 175. 245. 168. 197. 173. 178. 199. 96. 164. 252. 136. 150. 104. 148. 204. 174.
KEY: FiRatxed-e ofcharretguerncoveron capiagetratal. io (income/debt). Cost perloKWad facapaci ty in place. Annual ct o r . Peak SalPerecsentKWH (KnuclWHdemand usear.e per gryear).owth from 1974 to 1975. Tot a l fuel cos t s ( c ent s per KWH) . Source: Data courtesy of H. E. Thompson.
x7
Xs
1.6 9077. 0. 54.4 2.2 5088. 25.3 57.9 53.0 3.4 9212. 0. .3 6423. 34.3 56.0 51.2 1.0 3300. 15.6 60.0 -2.2 11 127. 22.5 2.2 7642. 0. 67.6 57.0 3.3 13082. 0. 7.2 8406. 0. 60.4 2.7 6455. 39.2 53.0 6.5 17441. 0. 51.5 3.7 6154. 0. 62.0 53.7 6.4 7179. 50.2 1.4 9673. 0. 49.8 .9 62.2 -0. 1 6468. 9.2 15991. 0. 56.0 9.0 5714. 8.3 61.9 56.7 2.7 10140. 0. 54.0 -2.1 13507. 0. 59.9 3.5 7287. 41.1 3.5 6650. 0. 61.0 5.9 10093. 26.6 54.3
.628 1.555 1.058 .700 2.044 1 .241 1 .652 .309 .862 .623 .768 1.897 .527 .588 1.400 .620 1 .920 1.108 .636 .702 2.116 1.306
x4
Xs
x6
748
Chap. 1 2
C l u steri ng, Dista nce Methods, and O rd i n ation
TABLE 1 2.6
CORRELATI O N S B ETWEEN PAI RS OF VARIAB LES
( P U B L I C UTI LITY DATA)
XI
x2
1 .000 .643 - .103 - .082 - .259 - .152 .045 - .013
1.000 - .348 - .086 - .260 - .010 .211 - .328
x7
Xs
1 .000 .100 1.000 .435 .034 1 .000 .176 1 .000 .028 - .288 .115 - .164 - .019 - .374 1 .000 .005 .486 - .007 - .561 - .185
1 .000
x3
x4
Xs
x6
see here how the complete linkage algorithm can be used to cluster variables. We measure the similarity between pairs of variables by the product-moment correlation coefficient. The correlation matrix is given in Table 12.6. When the sample correlations are used as similarity measures, variables with large negative correlations are regarded as very dissimilar; variables with large positive correlations are regarded as very similar. In this case, the "dis tance" between clusters is measured as the smallest similarity between mem bers of the corresponding clusters. The complete linkage algorithm, applied to the foregoing similarity matrix, yields the dendrogram in Figure 12.9. We see that variables 1 and 2 (fixed-charge coverage ratio and rate of return on capital), variables 4 and 8 (annual load factor and total fuel costs), and variables 3 and 5 (cost per kilowatt capacity in place and peak kilo watthour demand growth) cluster at intermediate "similarity" levels. Variables 7 (percent nuclear) and 6 (sales) remain by themselves until the final stages. The final merger brings together the (12478) group and the (356) group. • As in single linkage, a "new" assignment of distances (similarities) that have the same relative orderings as the initial distances will not change the configuration of the complete linkage clusters. Average Linkage
Average linkage treats the distance between two clusters as the average distance between all pairs of items where one member of a pair belongs to each cluster. Again, the input to the average linkage algorithm may be distances or simi larities, and the method can be used to group objects or variables. The average linkage algorithm proceeds in the manner of the general algorithm of ( 12-12). We begin by searching the distance matrix D = {dik J to find the nearest (most similar)
Sec. 1 2 . 3
749
H ierarch ical Cl usteri ng M ethods
- .4 - .2 12 .s
.., " t:: 0 � 0
I
I
0
I
.2
.4
·a
]
.6
Vi
-�
.-- -'----
.8 1 .0 2
7
4
3
8
5
6
Variables
Figure 1 2.9 Complete linkage dendrogram for similarities among eight utility company variables.
objects-for example, U and V. These objects are merged to form the cluster ( UV). For Step 3 of the general agglomerative algorithm, the distances between ( UV) and and other cluster W are determined by
d( U V) W
=
2:: 2:: dik
i k N( U V) NW
(12-15)
where dik is the distance between object i in the cluster ( UV) and object k in the cluster W, and N( uv ) and Nw are the number of items in clusters (UV) and W, respectively. Example 1 2.9
(Average linkage clustering of 1 1 languages)
The average linkage algorithm was applied to the "distances" between ll lan guages given in Example 12.5. The resulting dendrogram is displayed in Fig ure 12.10. 10 u c
"
� 0
8 6
4 2 0 E
N
Da
G
Du
Fr
Sp
Languages
p
H
Fi
Figure 1 2. 1 0 Average linkage dendrogram for distances between numbers in 1 1 languages.
750
Chap. 1 2
Clustering, Dista nce Methods, and Ord i nation
A comparison of the dendrogram in Figure 12.10 with the correspond ing single linkage dendrogram (Figure 12.5) and complete linkage dendro gram (Figure 12.8) indicates that average linkage yields a configuration very much like the complete linkage configuration. However, because distance is defined differently for each case, it is not surprising that mergers take place • at different levels. Example 1 2. 1 0 (Average linkage clustering of public utilities)
An average linkage algorithm applied to the Euclidean distances between 22 public utilities (see Table 12.1) produced the dendrogram in Figure 12. 1 1 . 4
3 �
., u c
� i5
2
I
I
I
I
II
n
0 I
1 8 19 14 9
3
6 22 I 0 1 3 20 4
7 12 21 15 2 I I 16 8
5 17
Public utility companies
Figure 1 2. 1 1
Average linkage dendrogram for distances between 22 public utility companies.
Concentrating on the intermediate clusters, we see that the utility com panies tend to group according to geographical location. For example, one intermediate cluster contains the firms 1 (Arizona Public Service), 18 (The Southern Company-primarily Georgia and Alabama), 19 (Texas Utilities Company), and 14 (Oklahoma Gas and Electric Company). There are some exceptions. The cluster (7, 12, 21, 15, 2) contains firms on the eastern seaboard and in the far west. On the other hand, all these firms are located near the coasts. Notice that Consolidated Edison Company of New York and San Diego Gas and Electric Company stand by themselves until the final amalgamation stages. It is, perhaps, not surprising that utility firms with similar locations (or types of locations) cluster. One would expect regulated firms in the same area to use, basically, the same type of fuel(s) for power plants and face common markets. Consequently, types of generation, costs, growth rates, and so forth should be relatively homogeneous among these firms. This is apparently reflected in the hierarchical clustering. •
Sec. 1 2 . 3
Hiera rch ical Cl usteri ng M ethods
751
For average linkage clustering, changes in the assignment of distances (simi larities) can affect the arrangement of the final configuration of clusters, even though the changes preserve relative orderings. Ward's Hierarchical Clustering Method
Ward [22] considered hierarchical clustering procedures based on minimizing the 'loss of information ' from joining two groups. This method is usually implemented with loss of information taken to be an increase in an error sum of squares criterion, ESS. First, for a given cluster k, let ESSk be the sum of the squared devi ations of every item in the cluster from the cluster mean (centroid). If there are currently K clusters, define ESS as the sum of the ESSk or ESS = ESS 1 + ESS2 + . . . + ESS K . At each step in the analysis, union of every possible pair of clus ters is considered, and the two clusters whose combination results in the smallest increase in ESS (minimum loss of information) are joined. Initially, each cluster consists of a single item, and, if there are N items, ESSk = 0, k = 1, 2, . . . , N, so ESS = 0. At the other extreme, when all the clusters are combined in a single group of N items, the value of ESS is given by ESS
=
N
2: (xi i= l
- x ) ' (xi - x )
where xi is the multivariate measurement associated with the jth item and x is the mean of all the items. The results of Ward ' s method can be displayed as a dendrogram. The verti cal axis gives the values of ESS at which the mergers occur. Ward ' s method is based on the notion that the clusters of multivariate obser vations are expected to be roughly elliptically shaped. It is a hierarchical precursor to non-hierarchical clustering methods that optimize some criterion for dividing data into a given number of elliptical groups. We discuss non-hierarchical cluster ing procedures in the next section. Additional discussion of optimization methods of cluster analysis is contained in [4] . Example
1 2. 1 1
(Clustering pure malt scotch whiskies)
Virtually all the world ' s pure malt Scotch whiskies are produced in Scot land. In one study (see [14] ) , 68 binary variables were created measuring characteristics of Scotch whiskey that can be broadly classified as color, nose, body, palate, and finish. For example, there were 14 color charac teristics (descriptions) , including white wine, yellow, very pale, pale, bronze, full amber, red, and so forth. LaPointe qnd Legendre clus tered 1 0 9 pure malt Scotch whiskies, each from a different distillery. The
752
Chap. 1 2
Cl usteri ng, Dista nce Methods, and Ordination
investigators were interested in determining the major types of single-malt whiskies, their chief characteristics, and the best representative. In addi tion, they wanted to know whether the groups produced by the hierarchi cal clustering procedure corresponded to different geographical regions, since it is known that whiskies are affected by local soil, temperature, and water conditions. Scaled binary variables were used with Ward's method to group the 109 pure (single-) malt Scotch whiskies. The resulting dendrogram is shown in Figure 12.12. (An average linkage procedure applied to a similarity matrix produced almost exactly the same classification.) The groups labelled A-L in the figure are the 12 groups of similar Scotches identified by the investigators. A follow-up analysis suggested that these 12 groups have a large geographic component in the sense that Scotches with similar characteristics tend to be produced by distilleries that are located reasonably close to one another. Consequently, the investigators concluded: "The relationship with geographic features was demonstrated, supporting the hypothesis that whiskies are affected not only by distillery secrets and tradi tions but also by factors dependent on region such as water, soil, microcli mate, temperature and even air quality." • Final Comments-Hierarchical Procedures
There are many agglomerative hierarchical clustering procedures besides single linkage, complete linkage, and average linkage. However, all the agglomerative procedures follow the basic algorithm of (12-12). As with most clustering methods, sources of error and variation are not for mally considered in hierarchical procedures. This means that a clustering method will be sensitive to outliers, or "noise points." In hierarchical clustering, there is no provision for a reallocation of objects that may have been "incorrectly" grouped at an early stage. Consequently, the final configuration of clusters should always be carefully examined to see whether it is sensible. For a particular problem, it is a good idea to try several clustering methods and, within a given method, a couple different ways of assigning distances (simi larities). If the outcomes from the several methods are (roughly) consistent with one another, perhaps a case for "natural" groupings can be advanced. The stability of a hierarchical solution can sometimes be checked by applying the clustering algorithm before and after small errors (perturbations) have been added to the data units. If the groups are fairly well distinguished, the clusterings before perturbation and after perturbation should agree. Common values (ties) in the similarity or distance matrix can produce multi ple solutions to a hierarchical clustering problem. That is, the dendrograms corre sponding to different treatments of the tied similarities (distances) can be different, particularly at the lower levels. This is not an inherent problem of any method;
Sec. 1 2 . 3
1 .0
2
3
I
6
I
0.5
0.2
I
A
I
�
.----
c
D
�
r-
Fr-Lr Lji
,---
r--
r--1
G
�
H
I
I
_rLr
.----
J
K� I
� '----
y
L
Figure 1 2. 1 2
whiskies.
753
Number of groups
12
0.7
H ierarchical Cl ustering M ethods
�
0.0 I
Aberfeldy Laphroaig Aberlour Macallan Balvenie Lochside Dalmore Glendullan Highland Park Ardmore Port Ellen Blair Athol Auchentoshan Coleburn Balblair Kinclaith Inchmurrin Caol lla Edradour Aultmore Benromach Cardhu Miltonduff Glen Deveron Bunnahabhain Glen Scotia Springbank Tomintoul Glenglassaugh Rose bank Bruichladdich Deanston Glentauchers Glen Mhor Glen Spey Bowmore Lon grow Glenlochy Glenfarclas Glen Albyn Glen Grant North Port Glengoyne Balmenach Glenesk Knockdhu Convalmore Glendronach Mortlach Glenordie Tormore Glen Elgin Glen Garioch Glencadam Teaninich
Glenugie Scapa Singleton Millburn Benrinnes Strathisla Glenturret Glenlivet Oban Clynelish Talisker Glenmorangie Ben Nevis Speyburn Littlemill Bladnoch Inverleven Pulteney Glenburgie Glenallachie Dalwhinnie Knockando Benriach Glenkinchie Tullibardine lnchgower Cragganmore Longmorn Glen Moray Tamnavulin Glenfiddich Fettercairn Ladyburn Tobermory Ardberg Lagavulin Dufftown Glenury Royal Jura Tamdhu Linkwood Saint Magdalene Glenlossie Tomatin Craigellachie Brackla Dailuaine Dallas Dhu Glen Keith Glenrothes Banff Caperdonich Lochnagar Imperial
A dendrogram for similarities between 1 09 pure malt Scotch
754
Cha p . 1 2
Cl usteri ng, Distance Methods, and O rd i n ation
rather, multiple solutions occur for particular type of data. Multiple solutions are not necessarily bad, but the user needs to know of their existence so that the group ings (dendrograms) can be properly interpreted and different groupings ( dendro grams) compared to assess their overlap. A further discussion of this issue appears in [18] . Some data sets and hierarchical clustering methods can produce inversions. (See [18] .) An inversion occurs when an object joins an existing cluster at a smaller distance (greater similarity) than that of a previous consolidation. An inversion is represented two different ways in the following diagram:
30
32 30
32
20
20
0 A
B
C
D
(i)
A
B
C
D
(ii)
In this example, the clustering method joins A and B at distance 20. At the next step, C is added to the group (AB) at distance 32. Because of the nature of the clustering algorithm, D is added to group (ABC) at distance 30, a smaller dis tance than the distance at which C joined (AB). In (i) the inversion is indicated by a dendrogram with crossover. In (ii), the inversion is indicated by a dendrogram with a nonmonotonic scale. Inversions can occur when there is no clear cluster structure and are gener ally associated with two hierarchical clustering algorithms known as the centroid method and the median method. The hierarchical procedures discussed in this book are not prone to inversions.
1 2.4 NONHIERARCHICAL CLUSTERING METHODS
Nonhierarchical clustering techniques are designed to group items, rather than vari ables, into a collection of K clusters. The number of clusters, K, may either be spec ified in advance or determined as part of the clustering procedure. B ecause a matrix of distances (similarities) does not have to be determined, and the basic data
Sec. 1 2 .4
Nonh ierarchical Clustering M ethods
755
do not have to be stored during the computer run, nonhierarchical methods can be applied to much larger data sets than can hierarchical techniques. Nonhierarchical methods start from either (1) an initial partition of items into groups or (2) an initial set of seed points, which will form the nuclei of clusters. Good choices for starting configurations should be free of overt biases. One way to start is to randomly select seed points from among the items or to randomly par tition the items into initial groups. In this section, we discuss one of tbe more popular nonhierarchical proce dures, the K-means method. K-means Method
MacQueen [16] suggests the term K-m eans for describing an algorithm of his that assigns each item to the cluster having the nearest centroid (mean). In its simplest version, the process is composed of these three steps: 1.
Partition the items into K initial clusters. 2. Proceed through the list of items, assigning an item to the cluster whose cen troid (mean) is nearest. (Distance is usually computed using Euclidean distance with either standardized or unstandardized observations.) Recalculate the cen troid for the cluster receiving the new item and for the cluster losing the item. 3. Repeat Step 2 until no more reassignments take place. (12-16) Rather than starting with a partition of all items into K preliminary groups in Step 1, we could specify K initial centroids (seed points) and then proceed to Step 2. The final assignment of items to clusters will be, to some extent, dependent upon the initial partition or the initial selection of seed points. Experience suggests that most major changes in assignment occur with the first reallocation step. Example
1 2. 1 2
(Clustering using the K-means method)
Suppose we measure two variables X1 and X2 for each of four items A, B, and D. The data are given in the following table:
C,
Observations Item
XI
Xz
A
5 -1 1 -3
3 1 -2 -2
B c D
The objective is to divide these items into K = 2 clusters such that the items within a cluster are closer to one another than they are to the items in dif ferent clusters. To implement the K = 2-means method, we arbitrarily parti tion the items into two clusters, such as (AB) and (CD ), and compute the coordinates 0\ , x2 ) of the cluster centroid (mean). Thus, at Step 1, we have:
756
Chap. 1 2
Cl usteri ng, Distance Methods, and O rd i nation
Coordinates of centroid
xl
Cluster
(AB )
5 + (- 1) 2
(CD )
1 + ( - 3) 2
-
x2 =
2
=
-1
3 + 1
2
2
-2 +
( - 2)
-- =
2
=
-2
At Step 2, we compute the Euclidean distance of each item from the group centroids and reassign each item to the nearest group. If an item is moved from the initial configuration, the cluster centroids (means) must be updated before proceeding. We compute the squared distances
d2 (A, (AB )) = (5 - 2) 2 + (3 - 2) 2 = 10 d 2 (A , (CD ) ) = (5 + 1) 2 + (3 + 2) 2 = 61 Since A is closer to cluster (AB) than to cluster (CD ), it is not reassigned.
Continuing, we get
d 2 (B, (AB )) = ( - 1 - 2) 2 + (1 - 2) 2 = 10 d 2 (B, (CD )) = ( - 1 + 1) 2 + (1 + 2) 2 = 9 and, consequently, B is reassigned to cluster (CD), giving cluster (BCD ) and
the following updated coordinates of the centroid:
Coordinates of centroid
x-2
Cluster
A (BCD )
5
3
-1
-1
Again, each item is checked for reassignment. Computing the squared distances gives the following: Squared distances to group centroids Item Cluster
A
B
c
D
A (BCD )
52
0
40 4
41
89
5
5
Sec. 1 2 .4
Nonh ierarchical Clusteri ng M ethods
757
We see that each item is currently assigned to the cluster with the near est centroid (mean), and the process stops. The final K = 2 clusters are A • and (BCD ). To check the stability of the clustering, it is desirable to rerun the algorithm with a new initial partition. Once clusters are determined, intuitions concerning their interpretations are aided by rearran,sing the list of items so that those in the first cluster appear first, those in the second cluster appear next, and so forth. A table of the cluster centroids (means) and within-cluster variances also helps to delineate group differences. Example
1 2. 1 3
(K-means clustering of public util ities)
Let us return to the problem of clustering public utilities using the data in Table 12.5. The K-means algorithm for several choices of K was run. We pre sent a summary of the results for K = 4 and K = 5. In general, the choice of a particular K is not clear cut and depends upon subject-matter knowledge, as well as data-based appraisals. (Data-based appraisals might include choos ing K so as to maximize the between-cluster variability relative to the within cluster variability. Relevant measures might include I W I / I B + W I [see (6-38)] and tr (W - 1 B).) The summary is as follows: K=4 Cluster
Number of firms
1
5
2
6
3
5
4
6
{ { {
{
Firms Idaho Power Co. (8), Nevada Power Co. (11), Puget Sound Power & Light Co. (16), Virginia Electric & Power Co. (22), Kentucky Utilities Co. (9). Central Louisiana Electric Co. (3), Oklahoma Gas & Electric Co. (14), The Southern Co. (18), Texas Utilities Co. (19), Arizona Public Service (1), Florida Power & Light Co. (6). New England Electric Co. ( 12), Pacific Gas & Electric Co. (15), San Diego Gas & Electric Co. (17), United Illuminating Co. (21), Hawaiian Electric Co. (7). Consolidated Edison Co. (N.Y.) (5), Boston Edison Co. (2), Madison Gas & Electric Co. (10), Northern States Power Co. (13), Wisconsin Electric Power Co. (20), Commonwealth Edison Co. (4).
758
Chap. 1 2
Cl usteri ng, Dista nce Methods, and Ord i n ation
� [�
Distances between Cluster Centers 4 3 2 1
3 4
K=5 Cluster
Number of firms
1
5
2
6
3
5
4
2
5
4
{ { {
3. 8 3.29 3.05
0 3.56 2.84
0 3.18
J
Firms Nevada Power Co. (11), Puget Sound Power & Light Co. (16), Idaho Power Co. (8), Virginia Electric & Power Co. (22), Kentucky Utilities Co. (9). Central Louisiana Electric Co. (3), Texas Utilities Co. (19), Oklahoma Gas & Electric Co. (14), The Southern Co. (18), Arizona Public Service (1), Florida Power & Light Co. (6). New England Electric Co. (12), Pacific Gas & Electric Co. (15), San Diego Gas & Electric Co. (17), United Illuminating Co. (21), Hawaiian Electric Co. (7).
{ Consolidated Edison Co. (N.Y.) (5), Boston Edison Co. (2). { Commonwealth Edison Co. (4), Madison Gas & Electric Co. (10),
Northern States Power Co. (13), Wisconsin Electric Power Co. (20). Distances between Cluster Centers 1 4 5 2 3 1 0 2 3.08 0 3.29 3.56 0 3 4 3.63 3.46 2.63 0 3.18 2.99 3.81 2.89 0 5
The cluster profiles (K = 5 ) shown in Figure 12.13 order the eight vari ables according to the ratios of their between-cluster variability to their within-cluster variability. [For univariate F-ratios, see Section 6.4.] We h ave mean square percent nuclear between clusters 3.335 = 13 1 = Fnuc = .255 mean square percent nuclear within clusters ·
Cluster profiles - variables are ordered by F-ratio size
-- 1 -•
Percent nuclear Total fuel costs Sales Cost per KW capacity in place Annual load factor Peak KWH demand growth Rate of return on capital FIXed-charge coverage ratio
-- 1 --- 1 --= = - 1 - 1 --
----=--=--=- �!-=--=-� ���� •
•
2-
•
--2-
3--3-
-33--3 ----2----2---3---2----3- - - - - - - - - - - - 3- - - - - - - - - - - ------2-----2-2--
•
•
Each column describes a cluster. The cluster number is printed at the mean of each variable. Dashes indicate one standard deviation above and below mean.
Figure 1 2. 1 3
� ....
Cluster profiles (K = 5) for public utility data .
•
•
--4-4-4--4-4 --
-
-����==!===-•
-5 5--5---5--
--- 5 ---
-5-
--5 --•
1 3. 1
F-ratio
-5-
1 2 .4 8.4 5.1 4 .4 2 .7 2 .4 0.5
760
Chap. 1 2
Clusteri ng, Distance Methods, and Ord i nation
so firms within different clusters are widely separated with respect to percent nuclear, but firms within the same cluster show little percent nuclear varia tion. Fuel costs (FUELC) and annual sales (SALES) also seem to be of some importance in distinguishing the clusters. Reviewing the firms in the five clusters, it is apparent that the K-means method gives results generally consistent with the average linkage hierarchi cal method. (See Example 12.10.) Firms with common or compatible geo graphical locations cluster. Also, the firms in a given cluster seem to be • roughly the same in terms of percent nuclear. We must caution, as we have throughout the book, that the importance of
individual variables in clustering must be judged from a multivariate perspective. All of the variables (multivariate observations) determine the cluster means and the
reassignment of items. In addition, the values of the descriptive statistics measuring the importance of individual variables are functions of the number of clusters and the final configuration of the clusters. On the other hand, descriptive measures can be helpful, after the fact, in assessing the "success" of the clustering procedure. Final Comments-Nonhierarchical Procedu res
There are strong arguments for not fixing the number of clusters, K, in advance, including the following: 1. If two or more seed points inadvertently lie within a single cluster, their
resulting clusters will be poorly differentiated. 2. The existence of an outlier might produce at least one group with very dis perse items. 3. Even if the population is known to consist of K groups, the sampling method may be such that data from the rarest group do not appear in the sample. Forcing the data into K groups would lead to nonsensical clusters.
In cases where a single run of the algorithm requires the user to specify K, it is always a good idea to rerun the algorithm for several choices. Discussions of other nonhierarchical clustering procedures are available in [2], [4], and [10]. 1 2.5 MULTIDIMENSIONAL SCALING
This section begins a discussion of methods for displaying (transformed) multi variate data in low-dimensional space. We have already considered this issue when
Sec. 1 2 . 5
M u ltidi mensional Sca l i n g
761
we discussed plotting scores on, say, the first two principal components or the scores on the first two linear discriminants. The methods we are about to discuss differ frotn these procedures in the sense that their primary objective is to "fit" the original data into a low-dimensional coordinate system such that any distortion caused by a reduction in dimensionality is minimized. Distortion generally refers to the similarities or dissimilarities (distances) among the original data points. Although Euclidean distance may be used to measure the closeness of points in the final low-dimensional configuration, the notion of similarity or dissimilarity depends upon the underlying technique for its definition. A low-dimensional plot of the kind we are alluding to is called an ordination of the data. Multidimensional scaling techniques deal with the following problem: For a set of observed similarities (or distances) between every pair of N items, find a rep resentation of the items in few dimensions such that the interitem proximities "nearly match" the original similarities (or distances). It may not be possible to match exactly the ordering of the original similari ties (distances). Consequently, scaling techniques attempt to find configurations in q ,;;; N - 1 dimensions such that the match is as close as possible. The numerical measure of closeness is called the stress. It is possible to arrange the N items in a low-dimensional coordinate system using only the rank orders of the N (N - 1)/2 original similarities (distances), and not their magnitudes. When only this ordinal information is used to obtain a geo metric representation, the process is called nonmetric multidimensional scaling. If the actual magnitudes of the original similarities (distances) are used to obtain a geometric representation in q dimensions, the process is called metric multidimen sional scaling. Metric multidimensional scaling is also known as principal coordi
nate analysis.
Scaling techniques were developed by Shepard (see [19] for a review of early work), Kruskal [11, 12, 13], and others. A good summary of the history, theory, and applications of multidimensional scaling is contained in [23]. Multidimensional scal ing invariably requires the use of a computer, and several good computer programs are now available for the purpose. The Basic Algorithm
For N items, there are M = N (N - 1)/2 similarities (distances) between pairs of different items. These similarities constitute the basic data. (In cases where the sim ilarities cannot be easily quantified as, for example, the similarity between two col ors, the rank orders of the similarities are the basic data.) Assuming no ties, the similarities can be arranged in a strictly ascending order as (12-17)
Here S;1 k 1 is the smallest of the M similarities. The subscript i1 k 1 indicates the pair of items that are least similar-that is, the items with rank 1 in the similarity order-
762
Chap.
12
Clustering, Distance Methods, and Ordination
ing. Other subscripts are interpreted in the same manner. We wantl to find a q-dimensional configuration of the N items such that the distances, dil , between pairs of items match the ordering in (12-17). If the distances are laid out in a man ner corresponding to that ordering, a perfect match occurs when (12-18)
That is, the descending ordering of the distances in q dimensions is exactly analo gous to the ascending ordering of the initial similarities. As long as the order in (12-18) is preserved, the magnitudes of the distances are unimportant. For a given value of q, it may not be possible to find a configuration of points whose pairwise distances are monotonically related to the original similarities. Kruskal [11] proposed a measure of the extent to which a geometrical representa tion falls short of a perfect match. This measure, the stress, is defined as (12-19) in
The d;�) 's the stress formula are numbers known Jo lsatisfy (12-18); that is, they are monotonically related to the similarities. The dil 's are not distances in the sense that they satisfy the usual distance properties of (1-25). They are merely ref erence numbers used to judge the nonmonotonicity of the observed dl%) 's. The idea is to find a representation of the items as points in q-dimensions such that the stress is as small as possible. Kruskal [11] suggests the stress be infor mally interpreted according to the following guidelines: Stress
Goodness of fit
Poor Fair (12-20) Good Excellent Perfect Goodness of fit refers to the monotonic relationship between the similarities and the final distances. A second measure of discrepancy, introduced by Takane et al. [21] is becom ing the preferred criterion. Fqr a given dimension q, this measure, denoted by SStress, replaces the d;k's and d;k's in (12-19) by their squares and is given by 20% 10% 5% 2.5% 0%
Sec. 1 2 . 5
SStress
M u ltid i mensional Sca l i n g
763
(12-21)
=
The value of SStress is always between 0 and 1 . Any value less than .1 is typically taken to mean that there is a good representation of the objects by the points in the given configuration. Once items are located in q dimensions, their q X 1 vectors of coordinates can be treated as multivariate observations. For display purposes, it is convenient to represent this q-dimensional scatter plot in terms of its principal component axes. (See Chapter 8.) We have written the stress measure as a function of q, the number of dimen sions for the geometrical representation. For each q, the configuration leading to the minimum stress can be obtained. As q increases, minimum stress will, within rounding error, decrease and will be zero for q = N 1 . Beginning with q = 1, a plot of these stress(q) numbers versus q can be constructed. The value of q for which this plot begins to level off may be selected as the "best" choice of the dimensionality. That is, we look for an "elbow" in the stress-dimensionality plot. The entire multidimensional scaling algorithm is summarized in these steps: -
1. For N items, obtain the
M
N(N 1 ) /2 similarities (distances) between distinct pairs of items. Order the similarities as in (12-17) . (Distances are ordered from largest to smallest.) If similarities (distances) cannot be com puted, the rank orders must be specified. 2. Using a trial configuration in q dimensions, determine the interitem distances di� ) and numbers J i
-
We have assumed that the initial similarity values are symmetric (sik = ski ) , that there are n o ties, and that there are n o missing observations. Kruskal [ 1 1 , 12] has suggested methods for handling asymmetries, ties, and missing observations. In addition, there are now multidimensional scaling computer programs that will
764
Chap. 1 2
Cl ustering, Dista nce Methods, and O rdination
handle not only Euclidean distance, but any distance of the Minkowski type. [See (12-3).] The next three examples illustrate multidimensional scaling with distances as the initial (dis)similarity measures. Example
1 2. 1 4
(Multidimensional scaling of U.S. cities)
Table 12.7 displays the airline distances between pairs of selected U.S. cities. Since the cities naturally lie in a two-dimensional space (a nearly level part of the curved surface of the earth), it is not surprising that multidimen sional scaling with q = 2 will locate these items about as they occur on a map. Note that if the distances in the table are ordered from largest to small est-that is, from a least similar to most similar-the first position is occu pied by dBoston L A = 3052. A multicli��nsional scaling plot for q = 2 dimensions is shown in Figure 12.14. The axes lie along the sample principal components of the scatter plot. A plot of stress(q) versus q is shown in Figure 12.15 on page 766. Since stress(1) X 100% = 12%, a representation of the cities in one dimension (along a single axis) is not unreasonable. The "elbow" of the stress function occurs at q = 2. Here stress(2) X 100% = 0.8%, and the "fit" is almost perfect. The plot in Figure 12.15 indicates that q = 2 is the best choice for the dimension of the final configuration. Note that the stress actually increases Spokane •
Boston •
.8 -
Columbus Indianapolis • • • Cincinnati • St. Louis
.4 :-
0 11-
Los Angeles
Dallas
•
•
Atlanta Memphis • Little Rock
e
•
Tampa
- .4 1-
•
1-
- .8
I-
I
- 2.0
I - 1 .5
Figure 1 2. 1 4
sional scaling.
I - 1 .0
I - .5
I 0
I
.5
L
1 .0
A geometrical representation of cities produced by multidimen
1 .5
TABLE 1 2. 7
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
�
AI R LI N E- D I STAN C E DATA
Atlanta
Boston
Cincinnati
Columbus
D allas
Indianapolis
Little Rock
Los Angeles
Memphis
St. Louis
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
Spokane Tampa
(11)
(12)
0 1068 461 549 805 508 505 2197 366 558 2467 467
0 867 769 1819 941 1494 3052 1355 1178 2747 1379
0 107 943 108 618 2186 502 338 2067 928
0 1050 172 725 2245 586 409 2131 985
0 882 325 1403 464 645 1891 1077
0 562 2080 436 234 1 959 975
0 1701 137 353 1988 912
0 1831 1848 1227 2480
0 294 2042 779
0 1820 1016
0 2821
0
766
Chap. 1 2
Cl ustering, Dista nce Methods, and Ordi nation Stress
.12
.10
.08
.06
.04
.02
2
4
6
q
Figure 1 2 . 1 5 Stress function for airline distances between cities.
for q = 3. This anomaly can occur for extremely small values of stress because of difficulties with the numerical search procedure used to locate the minimum stress. • Example 1 2. 1 5
(Multidimensional scaling of public utilities)
Let us try to represent the 22 public utility firms discussed in Example 12.8 as points in a low-dimensional space. The measures of (dis)similarities between pairs of firms are the Euclidean distances listed in Table 12. 1 . Mul tidimensional scaling in q = 1, 2, . . . , 6 dimensions produced the stress func tion shown in Figure 12.16. The stress function in Figure 12.16 has no sharp elbow. The plot appears to level out at "good" values of stress (less than or equal to 5%) in the neigh borhood of q = 4. A good four-dimensional representation of the utilities is achievable, but difficult to display. We show a plot of the utility configuration obtained in q = 2 dimensions in Figure 12.17. The axes lie along the sample principal components of the final scatter. Although the stress for two dimensions is rather high (stress(2) X 100% = 19% ), the distances between firms in Figure 12.17 are not wildly inconsis tent with the clustering results presented earlier in this chapter. For example, the midwest utilities-Commonwealth Edison, Wisconsin Electric Power (WEPCO), Madison Gas and Electric (MG & E), and Northern States Power (NSP)-are close together (similar). Texas Utilities and Oklahoma Gas and
Sec. 1 2 .5
767
M u ltidi mensional Sca l i n g
Stress
Figure 1 2. 1 6
1 .5 -
Stress function for distances between utilities.
San Dieg. G&E •
1 .0 -
• Pac. G&E
•
e N . Eng. El.
Kent Uti!. •
Bost. Ed. VEPCO •
o -
- .5 -
- 1.0
Pug. Sd. Po. •
•
Idaho Po.
Southern Co. •
•
•
e WEPCO
Flor. Po. & Lt. •
Common. Ed .
• M.G.&.E.
Ariz. Pub. Ser.
- Nev. Po. •
Con. Ed. •
Haw. El. •
Unit. Ill. Co. •
.5 -
q
8
6
4
2
0
e Qk. G. & E. • Tex. Uti!.
•
Cent. Louis.
NSP . •
- 1 .5 -
I - 1 .5 Figure 1 2. 1 7
I - 1 .0
I - .5
I 0
I .5
I 1 .0
A geometrical representation of utilities produ ced by m u ltidi mensional scaling.
I
1 .5
768
Chap. 1 2
Cl usteri ng, Dista nce Methods, and O rd i n ation
Electric (Ok. G & E) are also very close together (similar). Other utilities tend to group according to geographical locations or similar environments. The utilities cannot be positioned in two dimensions such that the interutility distances di
(Multidimensional scaling of u niversities)
Data related to 25 U.S. universities are given in Table 12.9 on page 782. (See Example 12.18.) These data give the average SAT score of entering freshmen, percent of freshmen in top 10% of high school class, percent of applicants accepted, student-faculty ratio, estimated annual expense, and graduation rate (% ). A metric multidimensional scaling algorithm applied to the stan dardized university data gives the two-dimensional representation shown in Figure 12.18. Notice how the private universities cluster on the right of the 4 1-
2 1-
TexasA&M
PennS tate
o -
UVirginia NotreDame Brown UCBerkeley G �6�!?Jwn Duke Harvard DartmollthPrinceton UMichigan y: 1 Stanford UPen Northwestern "columbia MIT a e UChicago
UWisconsin
Purdue
CarnegieMellon -2 -
JohnsHopkins
CalTech
-4 l
-4 Figure 1 2. 1 8
-2
_l
0
I
2
A two-dimensional representation of universities prod uced by metric multidimensional scaling.
Sec. 1 2 .5
M u ltidimensional Sca l i n g
769
4 -
2 -
UCBerkeley TexasA&M
PennState UMichigan
o -
Purdue
. NotreDameGeorgetown Brown Princeton UVirginia Cornell Duke Dartmouth Harvard UPenn Stanford Northwestern Yale Columbia MIT UChicago
UWisconsin CarnegieMellon
-2 -
JohnsHopkins
-4 r-
I
-4
I
-2
I
0
CalTech
I
2
Figure 1 2. 1 9
A two-dimensional representation of u niversities produced by nonmetric multidimensional scaling.
plot while the large public universities are, generally, on the left. A nonmetric multidimensional scaling two-dimensional configuration is shown in Figure 12.19. For this example, the metric and nonmetric scaling representations are very similar, with the two dimensional stress value being approximately 10% • for both scalings. Classical metric scaling, or principal coordinate analysis, is equivalent to plot ing the principal components. Different software programs choose the signs of the appropriate eigenvectors differently, so at first sight, two solutions may appear to be different. However, the solutions will coincide with a reflection of one or more of the axes. ( See [17].) To summarize, the key objective of multidimensional scaling procedures is a low-dimensional picture. Whenever multivariate data can be presented graphically in two or three dimensions, visual inspection can greatly aid interpretations.
770
Chap. 1 2
Cl ustering, Dista nce Methods, and Ord i n ation
When the multivariate observations are naturally numerical, and Euclid ean distances in p-dimensions, dlf > , can be computed, we can seek a q < p dimensional representation by minimizing (12-23) In this alternative approach, the Euclidean distances in p and q dimensions are compared directly. Techniques for obtaining low-dimensional representations by minimizing are called nonlinear mappings. The final goodness of fit of any low-dimensional representation can be depicted graphically by minimal spanning trees. (See [10) for a further discussion of these topics.)
E
1 2.6 CORRESPONDENCE ANALYSIS
Developed by the French, correspondence analysis is a graphical procedure for rep resenting associations in a table of frequencies or counts. We will concentrate on a two-way table of frequencies or contingency table. If the contingency table has I rows and J columns, the plot produced by correspondence analysis contains two Sets of points: a set of I points corresponding to the rows and a set of J points cor responding to the columns. The positions of the points reflect associations. Row points that are close together indicate rows that have similar profiles (conditional distributions) across the columns. Column points that are close together indicate columns with similar profiles (conditional distributions) down the rows. Finally, row points that are close to column points represent combinations that occur more frequently than would be expected from an independence model that is, a model in which the row categories are unrelated to the column categories. The usual output from a correspondence analysis includes the "best" two dimensional representation of the data, along with the coordinates of the plotted points, and a measure (called the inertia) of the amount of information retained in each dimension. Before briefly discussing the algebraic development of contingency analysis, it is helpful to illustrate the ideas we have introduced with an example. Example 1 2. 1 7 (Correspondence analysis of archaeological data)
Table 12.8 contains the frequencies (counts) of J = 4 different types of pot tery (called potsherds) found at I = 7 archaeological sites in an area of the American Southwest. If we divide the frequencies in each row (archaeologi cal site) by the corresponding row total, we obtain a profile of types of pot tery. The profiles for the different sites (rows) are shown in a bar graph in Figure 12.20(a). The widths of the bars are proportional to the total row fre quencies. In general, the profiles are different; however, the profiles for sites P1 and P2 are similar, as are the profiles for sites P4 and P5.
Sec. 1 2 . 6
TABLE 1 2.8
771
Correspondence Analysis
F R EQU E N C I ES O F TYP ES OF POTIERY
Type Site
A
B
c
D
Total
PO P1 P2 P3 P4 P5 P6
30 53 73 20 46 45 16
10 4 1 6 36 6 28
10 16 41 1 37 59 169
39 2 1 4 13 10 5
89 75 116 31 132 120 218
Total
283
91
333
74
781
Source: Data courtesy of M. J. Tret er. ( Type By Site )
( Site by Type )
I
0.75
Ll
pO pi
p2 p3 p4
p5
p6
Site
I
0 . 75
� Cii
0.5
0.25
I ,_;
0
a
c
b Type
d
(b)
(a) Figure 1 2.20
Site and pottery type profiles for the data in Table 1 2 . 8 .
The archaeological site profiles for different types of pottery (columns) are shown in a bar graph in Figure 12.20(b ). The site profiles are constructed using the column totals. The bars in the figure appear to be quite different from one another. This suggests that the various types of pottery are not dis tributed over the archaeological sites in the same way. The two-dimensional plot from a correspondence analysis 2 of the pot tery type-site data is shown in Figure 12.21. The plot in Figure 12.21 indicates, for example, that sites P1 and P2 have similar pottery type profiles (the two points are close together), and sites 2 The
JMP software was used for a correspondence analysis of the data in Table
12.8.
772
Chap. 1 2
Cl ustering, Distance M ethods, and Ord i n ation A.f = .28 (55% ) 1 .0 -
a
a p3
-0.5 a
X A
pi
a p2
0.0
a p5
- 0.5 -
a p4
-
X D
pO
"i = . 1 7 (33%)
�
a p6
X C
- 1 .0 I
I
- 1 .0
-0.5
[!I Figure 1 2.21
I
0.0
0.5
I
1 .0
c2 Type
Site
A correspondence analysis plot of the pottery type-site data.
PO and P6 have very different profiles (tl:le points are far apart). The individ ual points representing the types of pottery are spread out, indicating that their archaeological site profiles are quite different. These findings are con sistent with the profiles pictured in Figure 12.20. Notice that the points PO and D are quite close together and separated from the remaining points. This indicates that pottery type D tends to be associated, almost exclusively, with site PO. Similarly, pottery type A tends to be associated with site P1 and, to lesser degrees, with sites P2 and P3. Pottery type B is associated with sites P4 and P5, and pottery type C tends to be asso ciated, again, almost exclusively, with site P6. Since the archaeological sites represent different periods, these associations are of considerable interest to archaeologists. The number A � = .28 at the end of the first coordinate axis in the two dimensional plot is the inertia associated with the first dimension. This iner tia is 55% of the total inertia. The inertia associated with the second dimension is A i = .17, and the second dimension accounts for 33% of the total inertia. Together, the two dimensions account for 55% + 33% = 88% of the total inertia. Since, in this case, the data could be exactly represented in three dimensions, relatively little information (variation) is lost by represent ing the data in the two-dimensional plot of Figure 12.21. Equivalently, we may regard this plot as the best two-dimensional representation of the multi-
Sec . 1 2 . 6
Correspondence Analysis
773
dimensional scatter of row points and the multidimensional scatter of column points. The combined inertia of 88% suggests that the representation "fits" the data well. In this example, the graphical output from a correspondence analysis shows the nature of the associations in the contingency table quite clearly. • Algebraic Development of Correspondence Analysis
To begin, let X, with elements xii • be an I X J two-way table of unsealed frequen cies or counts. In our discussion we take I > J and assume that X is of full column rank J. The rows and columns of the contingency table X correspond to different categories of two different characteristics. As an example, the array of frequencies of different pottery types at different archaeological sites shown in Table 12.8 is a contingency table with I = 7 archaeological sites and J = 4 pottery types. It is convenient to base the graphical representation of association in a con tingency table on a suitably centered and scaled matrix. If n is the total of the fre quencies in X, we first construct a matrix of proportions P = pii } by dividing each element of X by n. Hence
{
X· ·
= __!!_ n, i=
P ii
1 , 2, . . . , I, j
=
1, 2, . . . , J, or
p
(/ X J)
= .!.n X (/ X J )
(12-24)
The matrix P is called the correspondence matrix. Next, P is centered by subtract ing the product of the row total and the column total for each entry. This opera tion produces Pii
= P ii
-
r;ci , i = 1, 2, . . . , I, j = 1, 2, . . . , J, or P = P
- rc'
(12-25)
where
i = 1, 2, . . . , I, or r; = '2: P ii = '2: __!!_ , i= 1 j=1 n J
J
I
I
X· ·
X· ·
P ii = '2: __!!_ , ci = i'2: i= 1 n =t
and 1'
r - r
=
j
=
1 , 2, . . . , J,
r
(/ X 1 )
c
or
[1, 1 , . . . , 1 ] . We note that rank ( P )
= 0.
(J X l) ,;::;;
J
= p
1
= P'
1
(/ X J) (I X 1 )
(12-26)
(J X /) (/ X 1 )
- 1 since P 1
= P1
- rc'1
=
Define the diagonal matrices
D, =
diag ( r1 , r2 , . . . , r1 ) and
and construct the scaled matrix
p*
(/ X J)
so that the (i, j )th entry of P * is
= n-1 /2 r
(/ X /)
De = P
diag ( c 1 , c2 , . . . , c1 )
D -112 c
(/ X J) (J X J )
(12-27) (12-28)
774
Chap. 1 2
Cl ustering, Distance Methods, and O rd i n ation
P ij - T;Cj P;*j = vr:c, , I I
z
= 1, 2, . . . , I, j = 1 , 2,
. .
. ,J
(12-29)
We are now ready to outline the steps leading to a picture of the association in a two-way table. Step I. Find the singular value decomposition of p * (see Result 2A.15). We have
p* =
(I X !)
where rank (P * ) = rank ( P )
�
V' A u / X (J - 1 ) (J - 1 ) X (J - 1 ) (J - 1 ) X J
J -
(12-30)
1,
U' U = V' V = I
and the diagonal matrix A = diag (,\ 1 , ,\ 2 , , ,\1 _ 1 ) contains the singular values, ordered from largest to smallest, along its main diagonal. Step 2. Set U = D� I2 U and_V = D�I2 V, then, using (12-28) and (12-30), the singular value decomposition of P is • • •
J- 1
,\ u v ' ji = p - rc' = UAV' = " £,; 1 1 1
(12-31) j 1 where uj is the jth column vector of U and vj is the jth column vector of V. In this representation, the left and right singular vectors are normalized to have unit lengths in the metrics D; 1 and D;\ respectively. That is, .
.
=
o r- 1 u ((J - 1 ) X /) (/ X /) (/ X (J - 1 ))
o c- 1 ((J - 1) X J) (J X J)
v
(.1 X (J - 1 ))
I
(12-32)
(J - 1) X (J - I )
The columns of U define the coordinate axes for the points representing the col umn profiles of P. Similarly, the columns of V define the coordinate axes for the points representing the row profiles of P.
Step 3. Calculate the coordinates of the row profiles y
( / X (J - 1 ))
=
A U o r- 1 (/ X /) (/ X (J - 1)) ((J - 1 ) X (J - l))
(12-33)
and the coordinates of the column profiles z
(J X (J - 1))
=
V A o; 1 (I X !) (J X (.I - 1)) ((J - l) X (J - 1))
(12-34)
The first two columns of Y contain the pairs of coordinates of the row points in the best two-dimensional representation of the data. The first two columns of Z con tain the pairs of coordinates of the column points in the best two-dimensional rep resentation of the data. The points corresponding to these two sets of coordinates can be superimposed on the same graph, and produce a picture like the one in Fig-
Sec. 1 2 . 6
Correspondence Analysis
775
ure 12.21. For a set of row points, or for a set of column points, Euclidean distance in the two-dimensional plot corresponds to a statistical distance between pairs of row (column) profiles in the original data. It is important to remember that there is no direct distance relation between a point representing a row profile and a point representing a column profile.
Step 4. The inertias displayed at the ends of the coordinate axes in the two dimensional plot are the squares of the singular values corresponding to these dimensions. The total inertia is defined as the sum of the squares of all the non zero singular values. Total inertia
=
K
� 11}
(12-35)
i=l
where A 1 ;;;;. A 2 ;;;;. , ;;;;. A K > 0 are the non-zero diagonal elements of A. Here P K = rank ( ) and, ordinarily, rank ( P ) = min (/ 1, J - 1 ) . • • •
-
Approximating the Row (Column) Profiles
Using the singular value decomposition (see (12-31)) J- 1 P = P - rc ' = � A k iik v� we have, for some K
< J
k=l
- 1 , the approximation
P
=
rc
'
K
+ � A k iik v� k =l
=
"
P
(12-36)
(12-37)
Among all matrices of rank K + 1 (or less), the matix P on the right hand side of the approximation minimizes the generalized least squares criterion trace (D; 1 (P - P )D; 1 (P - P ) ' ) I f w e focus, for example, on the rows of P and P , then the rows o f P define a (K + ! )-dimensional subspace that is closest to the rows of P in terms of a weighted sum of squared distances. 3 It is in this sense that we are using the word best in our previous discussion. Using expressions (12-33), (12-34) and (12-31), we can write P in terms of the matrices of coordinates Y and Z as (12-38) P = rc ' + D YA - 1 Z' D r
c
or 3 We could make the same argument using the columns of P and P .
776
Chap . 1 2
Clusteri ng, Dista nce Methods, and Ordi nation
1- 1
p ij = r;cj + r;cj ,2 Y;kZj k / A k (12-3 9) k= l where Yi k is the (i, k )th entry of Y, and zj k is the (j, k )th entry of Z. For an approximation of dimension K + 1 (12-39) runs from 1 to K.
< J
- 1, the sum of the right hand side of
Chi-square Analysis of Association and Correspondence Analysis
The rationale for the correspondence analysis we have described is based on a Chi square analysis of association in a two-way contingency table. The x 2 statistic for measuring the degree of association between the row and column variables in a two-way contingency table with I rows and J columns is ( oij - Eij ) 2 2 "" (12-40) x = £...
E;j
i,j
where O;j = xij is the observed frequency for the (i, j )th cell and E;j = nr;cj is the frequency expected in the (i, j)th cell if the row variable is independent (unrelated) to the column variable. After a little manipulation, and using (12-29), we make same sign can write 2 2 = n "" ( P ;j - r;cj ) = n "" P ;* 2 (12-41) £... X £... j
i, j
r;cj
i, j
In matrix notation, 2 1 K_ = trace(D; (P - rc' )D; 1 ( P - rc' ) ' ) = trace(P * P * ' ) = n A
.2 P ;? i,j
(12-42)
A
Let P = rc' . Since rank (P ) = rank (rc') = min rank (r, c) = 1, then we see from (12-37) and the discussion immediately following it, that rc' is the best rank 1 approximation (corresponding to the independence model) to P, and x 2/n is the discrepancy (in a generalized least squares sense) between P and rc' . It is this dis crepancy, this difference from independence, that correspondence analysis attempts to picture. Inertia
The ith row profile r; with jth element
r;j =
_
x;/n
( ,2 xij ) In }
=
p ij
7;'
j = 1, 2,
..
. ,J
is the ith row of X divided by its sum. Thus, the matrix of row profiles is given by
Sec. 1 2 . 6
Correspondence Ana lysis
R
777
(12-43)
/X]
Similarly, the column profiles cj , j = 1, 2, . . . , 1, are the columns of X divided by their sums so that the ith element of cj is _
C;j =
P ;j
l
- '
cj
= 1, 2, . . . ' I
In matrix notation c
-
]XI
=
[_, J cl cZ .
_,
�:,
= o; 1 P'
(12-44)
.I X ] ] X /
Profiles for our potsherd data are plotted in Figure 12.20. Consider the weighted average R' r of the row profiles, or row centroid. Now R ' r = P ' D; ' r = P' D; ' Pl by (12-26). Further o; ' P l = 1 because by (12-43), D; ' P has entries pjr;. Finally c = P' l so (12-45) Similarly, the weighted average C' c of the column profiles, or column centroid, is (12-46) We are now in a position to define inertia. The total inertia is the weighted sum of the squared distances of the row pro files (or column profiles) to the centroid. Consequently, it is a measure of the over all variation, or differences, in the points representing the row profiles (or column profiles). It can be shown 4 that the inertia associated with the row points is the same as the inertia associated with the column points. Thus I
J
Inertia = 2: r; Ci; - c ) ' D ; ' (r; - c ) = 2: cj (cj - r ) ' D; ' (cj - r ) j= l i= l
(12-47)
or Inertia = 2: r; 2: 1
( P;/ r;
-
cj )2/cj = 2: cj 2: ( P ;/ cj 1
-
rY/r; (12-48)
4 See, for example, Greenacre [9] .
778
Chap. 1 2
Cl ustering, Dista nce Methods, and Ord i n ation
Using the relation J- 1
P - rc' = 2: A k iik v� k=1 the scaling (12-28), and the orthogonality conditions for the ii ' s and the v ' s
x 2/n = trace( D ; 1 ( P - rc' )D; 1 ( P - rc' ) ' ) = 2: A� J- 1
Inertia =
k=l
(12-49)
The rows of the matrices Y and Z give the coordinates of the sets of centered row and column profiles. We can show, using (12-32), that
Y' D Y = r
A2 = Z' D Z c
(12-50)
so the weighted sum of squares of the row points ' coordinates along the k th axis and the weighted sum of squares of the column points ' coordinates along the kth axis are each equal to A� called the kth principal inertia. (These are the A� values printed along the axes in Figure 12.21.) J- 1
How do we interpret a large value for the proportion (Af + A;)/2: Jtf? k=1 Geometrically, we can say that the associations in the centered data are well represented by points in a plane, and this best approximating plane accounts for nearly all the variation in the data beyond that accounted for by the independence model. Algebraically, we can say that the approximation
is very good or, equivalently, that
Final Comments
Correspondence analysis is primarily a graphical technique designed to represent associations in a low-dimensional space. It can be regarded as a scaling method, and can be viewed as a complement to other methods such as multidimensional scaling ( Section 12.5) and biplots ( Section 12.7). Correspondence analysis also h as links to principal component analysis ( Chapter 8) and canonical correlation analy sis ( Chapter 10). The book by Greenacre [9] is a good choice for learning more about correspondence analysis.
Sec. 1 2 . 7
Biplots for Viewi ng Sampling U n its and Va riables
779
1 2.7 BIPLOTS FOR VIEWING SAMPLING UNITS AND VARIABLES
A biplot is a graphical representation of the information in an n X p data matrix. The bi- refers to the two kinds of information contained in a data matrix. The information in the rows pertains to samples or sampling units and that in the columns pertains to variables. When there are only two variables, scatter plots can represent the informa tion on both the sampling units and the variables in a single diagram. This permits the visual inspection of the position of one sampling unit relative to another and the relative importance of each of the two variables to the position of any unit. With several variables, one can construct a matrix array of scatter plots, but there is no one single plot of the sampling units. On the other hand, a two-dimen sional plot of the sampling units can be obtained by graphing the first two princi pal components, as in Section 8.4. The idea behind biplots is to add the information about the variables to the principal component graph. Figure 12.22 gives an example of a biplot for the public utilities data in Table 12.5. Nev. Po. 3
Pug. Sd. Po. X6 2
Idaho Po.
X5
Ok. G. & E. Tex. Uti!.
X3
0
-I
San Dieg. G&E
XI
Flor. Po. & Lt.
Unit. Ill. Co. N. En El .
Con. Ed. Haw. EI.
-2
Figure 1 2.22
.
X8
A biplot of the data on public utilities.
780
Chap. 1 2
Clusteri ng, Distance Methods, and O rd i nation
You can see how the companies group together and which variables con tribute to their positioning within this representation. For instance, X4 = annual load factor and X8 = total fuel costs are primarily responsible for the grouping of the mostly coastal companies in the lower right. The two variables X1 = fixed charge ratio and X2 = rate of return on capital put the Florida and Louisiana com panies together. Constructing Biplots
The construction of a biplot proceeds from the sample principal components. According to Result 8A.1 , the best two-dimensional approximation to the data matrix X approximates the jth observation xi in terms of the sample values of the first two principal components. In particular, (12-51) where el and e2 are the first two eigenvectors of s or, equivalently, of = ( n - 1 ) S. Here Xc denotes the mean corrected data matrix with rows (xi - x ) ' . The eigenvectors determine a plane, and the coordinates of the jth unit (row) are the pair of values of the principal components, (yi 1 , yi 2 ) . To include the information on the variables in this plot, we consider the pair of eigenvectors ( e1 , e2 ) . These eigenvectors are the coefficient vectors for the first t;_w o sample principal components. Consequently, each row of the matrix E = [ e1 , e2 ] positions a variable in the graph, and the magnitudes of the coefficients (the coordinates of the variable) show the weightings that variable has in each prin cipal component. The positions of the variables in the plot are indicated by a vector. Usually, statistical computer programs include a multiplier so that the lengths of all of the vectors can be suitably adjusted and plotted on the same axes as the sampling units. Units that are close to a variable likely have high values on that variable. To interpret a new point x 0 , we plot its principal components E (x0 - x ). A direct approach to obtaining a biplot starts from the singular value decom position (see Result 2A.15), which first expresses the n X p mean corrected matrix Xc as
X�Xc
XC =
n Xp
u
A V'
n Xp p Xp p X p
(12-52)
where A = diag ( ,.\ 1 , ,.\2 , . . . , ,.\P ) and V is an orthogonalAmatrix whose columns are the eigenvectors of X�Xc = (I} - 1 ) S. That is, V = E = [ e1 , e2 , . . . , ep] . Multi plying (12-52) on the right by E , we find (12-53) where the jth row of the left-hand side,
Sec . 1 2 . 7
Biplots for Viewi ng Sampling U n its a n d Varia b l es
781
is just the value of the principal components for the jth item. That is, U A contains all of the values of the principal components, while V = E contains the coefficients that define the principal components. The best rank 2 approximation to Xc is obtained by replacing A by A * = diag (A 1 , A 2 , 0, . . . , 0) . This result, called the Eckart-Young theorem, was established in Result 8.A.l. The approximation is then (12-54) where y 1 is the n X 1 vector of values of the first principal component and y2 is the X 1 vector of values of the second principal component. In the biplot, each row of the data matrix, or item, is represented by the point located by the pair of values of the principal components. The ith column of the data matrix, or variable, is represented as an arrow from the origin to the point with coordinates ( eu , e2 J, the entries in the ith column of the second matrix [e 1 , e2 ] ' in the approximation (12-54). This scale may not be compatible with that of the principal components, so an arbitrary multiplier can be introduced that adjusts all of the vectors by the same amount. The idea of a biplot, to represent both units and variables in the same plot, extends to canonical correlation analysis, multidimensional scaling, and even more complicated nonlinear techniques. (See [7].)
n
Example 1 2. 1 8 (A biplot of universities and their characteristics)
Table 12.9 gives the data on some universities for certain variables used to compare or rank major universities. These variables include X1 = average SAT score of new freshmen, X2 = percentage of new freshmen in top 10% of high school class, X3 = percentage of applicants accepted, X4 = student-fac ulty ratio, X5 = estimated annual expenses and X6 = graduation rate (%). Because two of the variables, SAT and Expenses, are on a much differ ent scale from that of the other variables, we standardize the data and base our biplot on the matrix of standardized observations zj . The biplot is given in Figure 12.23 on page 783. Notice how Cal Tech and Johns Hopkins are off by themselves; the vari able Expense is mostly responsible for this positioning. The large state uni versities in our sample are to the left in the biplot, and most of the private schools are on the right. Large values for the variables SAT, ToplO, and Grad are associated with the private school group. Northwestern lies in the middle • of the biplot.
782
Chap. 1 2
Cl usteri ng, Distance M ethods, and Ord i n ation
TABLE 1 2.9
DATA O N U N IVERS ITI ES
University
SAT
Top10
Accept
SFRatio
Expenses
Grad
Harvard Princeton Yale Stanford MIT Duke CalTech Dartmouth Brown JohnsHopkins UChicago UPenn Cornell Northwestern Columbia NotreDame UVirginia Georgetown CarnegieMellon UMichigan UCBerkeley UWisconsin PennS tate Purdue TexasA&M
14.00 13.75 13.75 13.60 13.80 13.15 14.15 13.40 13.10 13.05 12.90 12.85 12.80 12.60 13.10 12.55 12.25 12.55 12.60 11.80 12.40 10.85 10.81 10.05 10.75
91 91 95 90 94 90 100 89 89 75 75 80 83 85 76 81 77 74 62 65 95 40 38 28 49
14 14 19 20 30 30 25 23 22 44 50 36 33 39 24 42 44 24 59 68 40 69 54 90 67
11 8 11 12 10 12 6 10 13 7 13 11 13 11 12 13 14 12 9 16 17 15 18 19 25
39.525 30.220 43.514 36.450 34.870 31.585 63.575 32.162 22.704 58.691 38.380 27.553 21.864 28.052 31.510 15.122 13.349 20.126 25.026 15.470 15.140 11.857 10.185 9.066 8.704
97 95 96 93 91 95 81 95 94 87 87 90 90 89 88 94 92 92 72 85 78 71 80 69 67
Source:
U. S. News
&
World Report, September
18, 1995, 126. p.
1 2.8 PROCRUSTES ANALYSIS: A METHOD FOR COMPARING CONFIGURATIONS
Starting with a given n X n matrix of distances D, or similarities S, that relate n objects, two or more configurations can be obtained using different techniques. The possible methods include both metric and nonmetric multidimensional scaling. The question naturally arises as to how well the solutions coincide. Figures 12.18 and 12.19 in Example 12.16 respectively give the metric multidimensional scaling (principal coordinate analysis) and nonmetric multidimensional scaling solutions for the data on universities. The two configurations appear to be quite similar, but a quantitative measure would be useful. A numerical comparison of two configurations,
Sec. 1 2 . 8
Procrustes Analysis: A Method for Comparing Confi g u rations
783
2
SFRatio
UVirginia NotreDame Brown UCBerkeley Georgetown Cornell
Grad
PennState
TexasA&M
UMichigan 0
Purdue
UWisconsin
UChicago
Accept
-1
Expense CamegieMellon -2
JohnsHopkins CalTech -4 Figure 1 2.23
-2
0
2
A biplot of the data on universities.
obtained by moving one configuration so that it aligns best with the other, is called
Procrustes analysis, after the innkeeper Procrustes, in Greek mythology, who would either stretch or lop off customers ' limbs so they would fit his bed. Constructing the Procrustes Measure of Agreement
Suppose the n X p matrix X * contains the coordinates of the n points obtained for plotting with technique 1 and the n X q matrix Y * contains the coordinates from technique 2, where q � p . By adding columns of zeros to Y * , if necessary, we can assume that X * and Y * both have the same dimension n X p . To determine how compatible the two configurations are, we move, say, the second configuration to
784
Chap. 1 2
Clusteri ng, Distance Methods, and Ord i nation
match the first by shifting each point by the same amount and rotating or reflect ing the configuration about the coordinate axes. 5 Mathematically, we translate by a vector b and multiply by an orthogonal matrix Q so that the coordinates of the jth point yj are transformed to
Qyj
+b
The vector b and orthogonal matrix Q are then varied to order to minimize the sum, over all n points, of squared distances (12-55) between xj and the transformed coordinates Q yj + b obtained for the second tech nique. We take, as a measure of fit, or agreement, between the two configurations, the residual sum of squares
PR 2
=
II
min � ( xj - Q yj - b ) ' ( xj - Q yj - b ) Q, b j = l
(12-56)
The next result shows how to evaluate this Procrustes residual sum of squares mea sure of agreement and determines the Procrustes rotation of Y * relative to X * . Let the n X p configurations X * and Y * both be centered so that all rows have mean zero. Then II p II PR 2 = � xj xj + � yj yj - 2 � A ; j= l j= l i=l * * * * (12-57) = tr [X X ' ) + tr [Y Y ' J - 2tr [ A ) Result 1 2. 1
where
A
=
diag (A1 , A 2 , . . . , Ap ) and the minimizing transformation is p b = 0 Q = � v;u ; = VU' i=l A
Here
A
(12-58)
A, U, and V are obtained from the singular value decomposition ll
� yj xj j= l
=
Y*' X* ( p X 11 ) (n Xp )
=
U
A
V'
( p Xp) ( p Xp) ( p X p)
5 Sibson [20] has proposed a numerical measure of the agreement between two configurations, given by the coefficient
For identical configurations, 'Y completed.
=
'Y
=
l
-
[tr (Y * 'X * X * ' Y * ) 1 12 j 2 tr (X * ' X * )tr (Y * ' Y * )
0. If necessary, 'Y can be computed after a Procrustes analysis has been
Sec. 1 2 . 8
Procr y stes Analy� is: A Method for Comparing Confi g u rations
( � xj = 0 and j� yj = 0 ) we have Proof.
785
Because the configurations are centered to have zero means .
2: (xj - Q yj - b) ' (xj - Qyj - b) = 2: (xj - Qy)' (xj - Q y) + nb ' b j= l j= l The last term is nonnegative, so the best fit occurs for b = 0. Consequently, we need only consider n
n
"
= min 2: (xj - Qyj ) ' (xj - Q yj ) = 2: xj xj + 2: yj yj - 2max 2: xj Q yj Q j= l Q j= l j= l j= l Using xj Q yj = tr [Qyj xj ] , we find that the expression being maximized becomes PR2
n
n
n
n
� xj Q yj = j� tr [Q yjxj ] = tr [ Q j� yjxj ]
By the singular value decomposition, n
p = 2: A;u;v; j= l = [v1 , v2 , . . . , vp ] are p X p orthogonal matri
� yj xf = Y * ' X * =
j= l = [u 1 , u 2 , . . . , up ] and V
where U ces. Consequently,
U A V'
� xj Q yj = tr [ Q ( � A;u;v; ) ] = � A;tr [Qu;v; ) ]
The variable quantity in the ith term
tr [Qu;v; ] = v; Q u;
has an upper bound of 1 as can be seen by applying the Cauchy-Schwarz inequal ity (2-48 ) with b = Q u; and d = v;. That is,
v; Q u;
�
Vv; Q Q ' v; � = v;r;;
X
1 = :J-
Each of these p terms can be maximized by the same choice Q = V U ' . With this choice,
0 0 v; Q u; = v; vu ' u ; = [0, . . . , 0, 1, 0 , . . . , 0 ] 1 = 1 0 0
786
Chap. 1 2
Clusteri ng, Dista nce Methods, and Ord i n ation
Therefore, ll
- 2max :L x] Q yj = - 2(A 1 Q j= l
+
A2
+ ... +
Ap )
Finally, we verify that Q Q' = VU' UV' = VIP V ' = IP , so Q is p matrix, as required. Example 1 2. 1 9
X
p
•
orthogonal
(Procrustes analysis of the data on universities)
Two configurations, produced by metric and non metric multidimensional scaling, of data on universities are given Example 12.16. The two configura tions appear to be quite close. There is a two-dimensional array of coordi nates for each of the two scaling methods. Initially, the sum of squared distances is
25
= 4.090 :L ( xj - y)' ( xj - y) ; j=l
9992 .0405 ] [ -..0405 .9992 11 .999 0.000 ] A= [ 4 0.000 21.418
A computer calculation gives u
=
v
'
=
[ - 1.0000 .0069
J
.0069 1.0000 '
According to Result 12.1, to better align these two solutions, we multiply the nonmetric scaling solution by the orthogonal matrix
Q =
± i= l
V;u ;
=
VU'
=
9994 - .0336 [ ..0336 . 9994 J
This corresponds to clockwise rotation of the nonmetri� solution by about 2 degrees. After rotation, the sum of squared distances, 4�090, is reduced to the Procrustes measure of fit
25 PR 2 = :L xj xj j =l
+
25
2
j= l
j= l
:L yj yj - 2 :L A; = 3.627
•
Example 1 2.20 (Procrustes analysis and additional ordinations of data on forests)
Data were collected on the populations of eight species of trees growing on ten upland sites in southern Wisconsin. These data are shown in Table 12.10. The metric, or principal coordinate, solution and nonmetric multidi mensional scaling solution are shown in Figures 12.24 and 12.25.
Sec. 1 2 . 8
TABLE 1 2. 1 0
787
Procrustes Analysis: A Method for Compari n g Confi g u rations WISCO N S I N FOREST DATA
Site Tree
1
2
3
4
5
6
7
8
9
10
BurOak Black Oak WhiteOak Red Oak AmericanElm Basswood Ironwood SugarMaple
9 8 5 3 2 0 0 0
8 9 4 4 2 0 0 0
3 8 9 0 4 0 0 0
5 7 9 6 5 0 0 0
6 0 7 9 6 2 0 0
0 0 7 8 0 7 0 5
5 0 4 7 5 6 7 4
0 0 6 6 0 6 4 8
0 0 0 4 2 7 6 8
0 0 2 3 5 6 5 9
Source: See
[15] .
4 1-
2 1S3 Sl S2
SIO
0 1S7
S4
S9
ss
S6
-2 1-
S5
I
I
I
I
-2
0
2
4
Figure 1 2.24
Metric multidimensional scaling of the data on forests .
Using the coordinates of the points in Figures 12.24 and 12.25, we obtain the initial sum of squared distances for fit:
2: (xi - y) ' (xi - y) = 26.437 10
i= I
788
Chap. 1 2
C lusteri ng, Distance M ethods, and O rd i n ation
SIO
4 -
S3
2 1S9
Sl
0
S2
-
S8 S4
-2 1S7 S5
I
-2 Figure 1 2.25
S6
I
I
I
0
2
4
Nonmetric multidimensional scaling of the data on forests.
.9826 [ -- .1856 A = [ 41.8035 0.000
] 0.000 ] 21.5566
A computer calculation gives u
=
- .1856 , .9826
v
=
[ - .9935 .1 136 -
- .1136 .9935
J
According to Result 12.1, to better align these two solutions, we multiply the nonmetric scaling solution by the orthogonal matrix
Q =
± V;u; = VU' = [
i=l
_
.9973 .0728 .0728 .9973
J
This corresponds to clockwise rotation of the nonmetric solution by about 4 degrees. After rotation, the sum of squared distances, 26.437, is reduced to the Procrustes measure of fit
Sec. 1 2. 8
Procrustes Analysis: A Method for Com paring Config u rations
PR 2 = L xj xi + L yj yi 1 l 10
10
j=
j=
789
2
- 2 L A i = 25.430 i 1 =
We note that the sampling sites seem to fall along a curve in both pictures. This could lead to a one-dimensional nonlinear ordination of the data. A qua dratic or other curve could be fit to the points. By adding a scale to the curve, we would obtain a one-dimensional ordination. It is informative to view the Wisconsin forest data when both sampling units and variables are shown. A correspondence analysis applied to the data produces the plot in Figure 12.26. The biplot is shown in Figure 12.27. 5
2 -
6
Ironwood
BlackOak
1 -
SugarMaple 4 8
BurOak 0
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
�Dl�i�anE4u - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7
1
Basswood
3
2
WhiteOak
- 1 1-
10
'
9
lltedOak I
-2
I
-1 Figure 1 2.26
'
i
0
I
I
2
The correspondence analysis plot of the data on forests.
All of the plots tell similar stories. Sites 1-5 tend to be associated with species of oak trees, while sites 7-10 tend to be associated with basswood, iron wood, and sugar maples. American elm trees are distributed over most sites, but are more closely associated with the lower numbered sites. There is almost a continuum of sites distinguished by the different species of trees. •
790
Chap. 1 2
Cl usteri ng, Distance Methods, and O rd i n ation
3
2
9
3 I
10
2 BlackOak
Ironwood SugarMaple
0
-I
B urOak
8 Basswood
7
4 WhiteOak
6
-2
RedOak
5 -2
-I Figure 1 2.27
0
3
2
The biplot of the data on forests.
EXERCISES
12.1. Certain characteristics associated with a few recent U.S. presidents are listed in Table 12.11. TABLE 1 2. 1 1
President 1. 2. 3. 4. 5. 6.
R. Reagan J. Carter G. Ford R. Nixon L. Johnson J. Kennedy
Birthplace Elected (region of first United States) term? Midwest South Midwest West South East
Yes Yes No Yes No Yes
Party Republican Democrat Republican Republican Democrat Democrat
Prior U.S. Served as congressional experience? vice president? No No Yes Yes Yes Yes
No No Yes Yes Yes No
Chap. 1 2
Exercises
791
(a) Introducing appropriate binary variables, calculate similarity coefficient 1 in Table 12.2 for pairs of presidents.
Hint: You may use birthplace as South, non-South.
(b) Proceeding as in Part a, calculate similarity coefficients 2 and 3 in Table 12.2. Verify the monotonicity relation of coefficients 1, 2, and 3 by dis playing the order of the 15 similarities for each coefficient. 12.2. Repeat Exercise 12.1 using similarity coefficients 5, 6, and 7 in Table 12.2. 12.3. Show that the sample correlation coefficient [see (12-11)] can be written as r
=
ad - be b) (a + c) (b + d ) ( c + d )p ; z
�
--------------------------
[ (a +
for two 0-1 binary variables with the following frequencies: Variable 2 0 1
Variable 1
0
1
a
b
c
d
12.4. Show that the monotonicity property holds for the similarity coefficients 1 , 2 , and 3 in Table 12.2. Hint: (b + c) p - (a + d ) . So, for instance, =
a + d a + d + 2 (b +
c)
1 1 + 2 [pj(a + d ) - 1 ]
This equation relates coefficients 3 and 1. Find analogous representations for the other pairs. 12.5. Consider the matrix of distances 1
1 2 3 4
u
2
3
0 2 3
0 4
4
J
Cluster the four items using each of the following procedures. (a) Single linkage hierarchical procedure. (b) Complete linkage hierarchical procedure. (c) Average linkage hierarchical procedure. Draw the dendrograms and compare the results in (a), (b), and (c).
792
Chap. 1 2
Cl ustering, Distance Methods, and O rd i nation
12.6. The distances between pairs of five items are as follows:
1 2 3
4 5
1 0 4 6 1 6
2 0 9 7 3
3
4
5
0 10 5
0 8
0
Cluster the five items using the single linkage, complete linkage, and average linkage hierarchical methods. Draw the dendrograms and compare the results. 12.7. Sample correlations for five stocks were given in Example 8.5. These corre lations, rounded to two decimal places, are reproduced as follows:
Allied Chemical Du Pont Union Carbide Exxon Texaco
Union Allied Chemical Du Pont Carbide Exxon Texaco 1 1 .58 1 .51 .60 .44 1 .39 .39 52 1 .46 .32 .43
Treating the sample correlations as similarity measures, cluster the stocks using the �ingle linkage and complete linkage hierarchical procedures. Draw the dendrqgrams and compare the results. 12.8. Using the distances in Example 12.4, cluster the items using the average linkage hierarchical procedure . praw the dendrogram. Compare the results with those in Examples 12.4 AndI 12.6. 12.9. The vocabulary "richness" of a text can be quantitatively described by counting the words u:;ed once, the words used twice, and so forth. Based on these counts, a linguist proposed the following distances between chapters of the Old Testanwrii book La� entad ons (data courtesy of Y. T. Radday and M. A. Pollatscbek): ·
'
1 Lamentations chapter
1 2 3
4 5
Lamentations chapter 3 4 2
5
0 0 .76 0 2.97 .80 4.88 4.17 .21 0 3.86 1.92 1 .51 .51 0
Chap. 1 2
Exercises
793
Cluster the chapters of Lamentations using the three linkage hierarchical methods we have discussed. Draw the dendrograms and compare the results. 12.10. Use Ward ' s method to cluster the 4 items whose measurements on a single variable X are given in the table below. Measurements X
Item 1
2 1 5 8
2 3 4
(a) Initially, each item is a cluster and we have the clusters {1}
{2}
Show that ESS = 0, as it must.
13 }
{4}
(b) If we join clusters { 1 } and { 2 } , the new cluster { 12 } has
and the ESS associated with the grouping { 12 } , {3 } , { 4 } is ESS = .5 + 0 + 0 = .5. The increase in ESS (loss of information) from the first step to the current step in .5 0 = .5. Complete the following table by determining the increase in ESS for all the possibilities at step 2. -
Increase in ESS
Clusters {12 } { 13 } { 14 } {1 } {1} {1}
{3 } {2} {2 } { 23} { 24 } {2}
{4} {4} {3} {4} {3 } { 34 }
.5
(c) Complete the last two steps, and construct the dendrogram showing the
values of ESS at which the mergers take place.
794
Chap. 1 2
Cl ustering, Distance Methods, and O rd i n ation
12.11. Suppose we measure two variables Xt and X2 for four items A, B, C, and D. The data are as follows:
Observations Item
xt
x2
A B
5 1 -1 3
4 -2 1 1
c D
Use the K-means clustering technique to divide the items into K = 2 clus ters. Start with the initial groups (AB) and (CD ). 12.12. Repeat Example 12.12, starting with the initial groups (A C) and (BD ). Compare your solution with the solution in the example. Are they the same? Graph the items in terms of their (xt , x2 ) coordinates, and comment on the solutions. 12.13. Repeat Example 12.12, but start at the bottom of the list of items, and pro ceed up in the order D, C, B, A. Begin with the initial groups (AB ) and (CD ). [The first potential reassignment will be based on the distances d2 (D, (AB ) ) and d2 (D, (CD ) ).] Compare your solution with the solution in the example. Are they the same? Should they be the same? 12.14. � uppose the centered matrix P = P - rc ' is approximated by the matrix P = A t u t v� , where A t is the largest singular value in the singular-value decomposition of P, and ut and vt are the corresponding left and right sin gular vectors, normalized to have unit lengths in the metrics n; t and o ; t respectively. Using the generalized least squares criterion, trace (D; t (P - P ) D ; t ( P - P ) ' ) show that the weighted sum of squared deviations of the elements of P from 1- 1 J- 1 the corresponding elements of P is 2: AU 2: Af, where A 1 , A 2 , . . . , A1_ 1 are
k=2
k= l
the singular values for P. 12.15. Show that the coordinates Y and Z of the two sets of points in a corre spondence analysis are related by Y = o r- 1 PZA - 1 and Z = o c- 1 P' YA - t
The following exercises require the use of a computer.
12.16. Table 1 1 .9 lists measurements on 8 variables for 43 breakfast cereals. (a) Using the data in the table, calculate the Euclidean distances between pairs of cereal brands.
Chap. 1 2
12.17. 12.18.
12.19. 12.20.
12.21.
12.22.
12.23.
12.24. 12.25.
Exercises
795
(b) Treating the distances calculated in (a) as measures of (dis)similarity, cluster the cereals using the single linkage and complete linkage hierar chical procedures. Construct dendrograms and compare the results. Input the data in Table 11.9 into a K-means clustering program. Cluster the cereals into K = 2, 3, and 4 groups. Compare the results with those in Exer cise 12.16. The national track records data for women are given in Table 1 .7. (a) Using the data in Table 1.7, calculate the Euclidean distances between pairs of countries. (b) Treating the distances in (a) as measures of (dis)similarity, cluster the countries using the single linkage and complete linkage hierarchical pro cedures. Construct dendrograms and compare the results. (c) Input the data in Table 1.7 into a K-means clustering program. Cluster the countries into groups using several values of K. Compare the results with those in Part b. Repeat Exercise 12.18 using the national track records data for men given in Table 8.5. Compare the results with those of Exercise 12.18. Explain any differences. Table 12.12 gives the road distances between 12 Wisconsin cities and cities in neighboring states. Locate the cities in q = 1, 2, and 3 dimensions using multidimensional scaling. Plot the minimum stress(q) versus q and interpret the graph. Compare the two-dimensional multidimensional scaling configu ration with the locations of the cities on a map from an atlas. Table 12.13 on page 797 gives the "distances" between certain archaeologi cal sites from different periods, based upon the frequencies of different types of potsherds found at the sites. Given these distances, determine the coordinates of the sites in q = 3, 4, and 5 dimensions using multidimensional scaling. Plot the minimum stress(q) versus q and interpret the graph. If pos sible, locate the sites in two dimensions (the first two principal components) using the coordinates for the q = 5-dimensional solution. (Treat the sites as variables.) Noting the periods associated with the sites, interpret the two dimensional configuration. A sample of n = 1660 people is cross-classified according to mental health status and socioeconomic status in Table 12.14 on page 798. Perform a correspondence analysis of these data. Interpret the results. Can the associations in the data be well represented in one dimension? A sample of 901 individuals was cross-classified according to three cate gories of income and four categories of job satisfaction. The results are given in Table 12.15 on page 798. Perform a correspondence analysis of these data. Interpret the results. Perform a correspondence analysis of the data on forests listed in Table 12.10, and verify Figure 12.26 given in Example 12.20. Construct a biplot of the pottery data in Table 12.8. Interpret the biplot. Is the biplot consistent with the correspondence analysis plot in Figure 12.21? Discuss your answer.
� ....
TABLE 1 2. 1 2 Appleton (1) (1) (2)
(3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
D I STANCES B ETWE E N C ITI ES I N W I S CO N S I N A N D C ITI ES I N N E I G H B O R I N G STATES
Fort Beloit Atkinson Madison Marshfield (2) (3) (4) (5)
0 130
0
98
33
0
102 103 100 149
50
36 1 64 54 58 359 166
315 91 1 96
257 186
1 85
73 33 377 1 86 94 3 04 97
119 287 1 13
Milwaukee
(6)
Monroe Superior Wausau
(7)
(8)
(9)
Dubuque St. Paul Chicago (10) (12) (11)
0 1 38
77 47 330 139 95 258 146
0 1 84 170 219
0 107
0
45 186
181
161 276
322
362 1 86 61 289
93
130
394 168
0 223
467
351 162
0 215
0
175 275
274 184
0
395
0
TABLE 1 2. 1 3
(1) (2) (3) (4) (5) (6) (7) (8) (9)
D I STAN C E S B ETWEEN ARCHAEOLO G I CAL S ITES
P1980918 (1)
P1931131 (2)
P1550960 (3)
P1530987 (4)
Pl361024 (5)
P1351005 (6)
Pl340945 (7)
P1311137 (8)
P1301062 (9)
0 2.202 1 .004 1 . 108 1.122 0.914 0.914 2.056 1.608
0 2.025 1.943 1.870 2.070 2.186 2.055 1 .722
0 0.233 0.719 0.719 0.452 1 .986 1 .358
0 0.541 0.679 0.681 1 .990 1 .168
0 0.539 1 .102 1.963 0.681
0 0.916 2.056 1.005
0 2.027 1.719
0 1 .991
0
P1980918
P198
Source: Data CourrefetressytoofsitM.e J. Tretdatteer.d
KEY:
� ....
A.D.
0918, P1931131
P193
refers to site dated
A.D.
1131,
and so forth.
798
Chap. 1 2
C l usteri ng, Dista nce M ethods, and O rd i n ation
TABlE 1 2. 1 4
M ENTAl H EALTH STATUS AN D SOC I O ECO N O M I C STATUS DATA
Parental Socioeconomic Status Mental Health Status
A ( High)
B
c
D
E ( Low)
Well Mild symptom formation Moderate symptom formation Impaired
121 188 112 86
57 105 65 60
72 141 77 94
36 97 54 78
21 71 54 71
SourOplecr,e:andAdaptT. A.edC.froRenni m datea, in Srole, L., T. S. Langner, S. T. Michael, P. Kirkpatrick, M. K. rev. ed. (New York: NYU Pres , 1978).
Mental Health in the Metropolis: The Midtown Manhatten Study,
TABlE 1 2. 1 5
I N C O M E AND J O B SATI S FACT I O N DATA
Job Satisfaction Income
<
$25,000 $25,000-$50,000 > $50,000
Very dissatisfied
Somewhat dissatisfied
Moderately satisfied
Very satisfied
42 13 7
62 28 18
184 81 54
207 113 92
YorSource:k: JohnAdaptWieledy,fr1990.om dat) a in Table 8.2 in Agresti, A.,
Categorical Data Analysis
(New
12.26. Construct a biplot of the mental health and socioeconomic data in Table 12.14. Interpret the biplot. Is the biplot consistent with the correspondence analysis plot in Exercise 12.22? Discuss your answer. 12.27. Construct a biplot of the income and job satisfaction in Table 12.15. Inter pret the biplot. Is the biplot consistent with the correspondence analysis plot in Exercise 12.23? Discuss your answer. 12.28. Using the archaeological data in Table 12.13, determine the two-dimen sional metric and nonmetric multidimensional scaling plots. (See Exercise 12.21). Given the coordinates of the points in each of these plots, perform a Procrustes analysis. Interpret the results. REFERENCES
1. Abramowi , and I.ce,A.NatStieogun,nal Bureau eds. of Standards Applied Mathematical SeriU.eSs.. Depar1964.tmenttzof, M.Commer 2. Anderberg, M. R. New York: Academic Press, 1973. Handbook of Mathematical Functions.
55,
Cluster Analysis for Applications.
Chap. 1 2
799
References
3. Cormack, R. M. "A Revieno.w of3 Cl(1971)as ifi,c321-367. ation (with discussion)." 4.5. Gower, Everit , J.B.S.C. "Some Distance (Pr3dopered.)ti.eLondon: Edward Arnol d, 1993.or Methods Used in s of Lat e nt Root and Vect MultivarJ.iatC.e Anal"Mulytsivisar."iate Analysis and(1Mul966)t,id325-338. 6. Gower, imensional Geometry." ( 1 967), 13-25. 7.8. Gnanades Gower, J. iC.kan,, andR.D. J. Hand. London: Chapman and Hall, 1996. New Yor k : John Wi l e y, 1977. 9. emiGreenacr e,, M.1984.J. London: Acad c Pr e s Wiley, 1975.s of Fit to a Nonmetric 11.10. Hart KrHypotusikgalhan,es, J.iJ.s.B."A."Multidimensional no.Scal1inNew(g1964)by YorOpt, 1-ki27.m: Johnizing Goodnes 12. Kruskal, J. B. "Nonm1964)etric, 115-129. Multidimensional Scaling: A Numerical Method. " no. 1 ( 13. KronuQuant skal, J.itaB.tiv, eandApplM.icWiatioshns. "Mul tidSociimensal SciionalencesScal, 07-ing.0"11.SageBeverUniverly HislitsyandPaperLondon: series i n t h e Sage Publnte,icF-atJio, nsand, 1978.P. Legendre. "A Clas ification of Pure Malt Scotch Whiskies." 14. LaPoi 1 (1994)ds., 237-257. 15. LudwigNew, J. A.York: , and John J. F.no.Reynol Wilehy,ods1988.for Clas ification and Analysis of Multivariate Obser 16. MacQueen, J. B. "Some Met vations."Berkeley, CA: University of California Pres (1967), 281-297. 17. Mardi a1979., K. V., J. T. Kent, and J. M. Bibby. New York: Academic Pres s , 18. Morgan, uniqueness and Inversions in Cluster Analy s19. Shepard, is." B.R.J.N.T."Mul, andtiA.dimP.ensno.G.io1Ray.nal(1995)Scal"Non-, i117-134. ng, Tree-Fit ing, and Clustering." no. 4468 (1980) , 390-398. 20. Sibson, R. "Studies in the (Robus t234-238. ness of Multidimensional Scaling" 1 978) , 21. Takane, dimensioY.nal, F.Scal(1977)inYoung, g:, 7-67. AlterandnatiJ.ngDeLeasLeeuw. t Squares"Non-wimthetOptric IinmdialviScaldualinDig ffFeaterencesures."Multi 22. Ward, Jr., J. H. "Hierarchical Grouping(t1o963)Opt, i236-244. mize an Objective Function." 23. Young, F.HiW.l sd,alande, NJ:R. M.LawrHamer ence .Erlbaum Associates, Publishers, 1987.
Journal of the Royal
Statistical Society (A) ,
134,
Cluster Analysis
Biometrika,
53
The Statistician,
17
Biplots.
Methods for Statistical Data Analysis of Multivariate Observations.
Theory and Applications of Correspondence Analysis.
Clustering Algorithms.
Psychometrika,
29,
Psy
chometrika,
29,
Applied Statistics,
43,
Statistical Ecology-a Primer on Methods and Com
puting.
Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Proba
bility,
1,
Multivariate Analysis.
Applied Statistics,
44,
Science,
210,
Journal of the Royal
Statistical Society (B) ,
40
W.
cometrika,
Psy
42
Journal of
the American Statistical Association,
58
Multidimensional Scaling: History, Theory, and Appli
cations.
Appendix ,.
Table 1 Standard Normal Probabilities Table 2 Student ' s t-Distribution Percentage Points Table 3 x2 Distribution Percentage Points Table 4 F-Distribution Percentage Points (a = .10) Table 5 F-Distribution Percentage Points (a = .05) Table 6 F-Distribution Percentage Points (a = .01)
800
Append ix
TABLE 1
z
STAN DARD N O RMAL PROBAB I LITI ES
.00
.0 1
.02
.03
.04
.05
.06
.07
.08
.09
. 5040 . 5438 .5832 .62 1 7 .65 9 1 .6950
. 5080 . 5478 .587 1 .6255 .6628
. 5 1 20
. 5 1 60 .5557 .5948 .633 1
.5239 .5636 .6026 .6406 .6772
.53 1 9 .57 1 4 .6 1 03 .6480 . 6844
.6985 .7324 .7642 .7939 . 82 1 2
.70 1 9 .7357 .7673 .7967 .8238
. 5 1 99 . 5 5 96 . 5 987 .6368 .6736 . 7088 . 7422 .7734 . 8023 . 8289
.5279
. 55 1 7 .59 1 0 .6293 .6664
.5359 .5753 .6 1 4 1 .65 1 7 . 6 879 . 7224
.6 .7 .8 .9
. 5000 .5398 . 5793 .6 1 79 .6554 . 69 1 5 . 7257 .7 5 80 .788 1 . 8 1 59
1 .0
. 84 1 3
. 8485
.8686
. 8708
. 8508 .8729
.853 1
. 8643
. 8438 . 8665
. 846 1
1.1 1 .2
. 8 749
. 8 849
. 8869
.8888
.8907
. 8925
.8944
1 .3
. 9032
.9099
1 .6
. 9 1 92 .9332 . 9452
. 9066 .9222 .9357 . 9474
. 9082
1 .4 1 .5
. 9049 . 9207 .9345 . 9463
. 9236 .9370 . 9484
1 .7
. 9554
.9564
1 .8 1 .9
. 964 1 . 97 1 3
.9649 . 97 1 9
.9573 .9656 .9726
.9582 .9664 .9732
2.0 2. 1
. 9772 .982 1 . 986 1 .9893 . 99 1 8 .9938 .9953 . 9965 . 9974 .998 1
.9778 .9826 . 9864 .9896 . 9920 .9940 .9955 . 9966 .9975 .9982
.9783 . 9830 .9868 .9898 .9922 . 994 1 .9956 . 9967 .9976 . 9982
.9788 .9834
.9987 . 9990 . 9993 .9995 . 9997
.9987 .999 1 . 9993 . 9995 .9997 .9998
. 9987 . 999 1 . 9994 . 9995 . 9997 .9998
.0 .1 .2 .3 .4 .5
2. 2
2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3. 1 3.2 3.3 3.4 3.5
.9998
.729 1 .76 1 1 . 79 1 0 . 8 1 86
.6700 . 7054 .7389 . 7703 .7995 . 8264
.5675 .6064 .6443 .6808 .7 1 57 . 7486 . 7794 . 8078 . 8340
. 7 1 90 . 75 1 7 . 7823 . 8 1 06 .8365
, 85 54
. 8577
. 8599
. 862 1
. 8770
. 8790
.88 1 0
. 8830
. 8962
. 8980
. 8997
. 90 1 5
.9 1 1 5
.91 3 1
. 9 1 47
. 9 1 62
. 9 1 77
. 925 1 .9382
. 9265 .9394
.9279 .9406
. 9306 . 9429
. 93 1 9 . 944 1
. 9495
. 9505
. 95 1 5
. 9292 . 94 1 8 . 9525
.9535
. 9545
.959 1
.9599
. 967 1 .9738
.9678 .9744
. 9608 . 9686 . 9750
.96 1 6 . 9693 .9756
. 9625 . 9699 .976 1
.9633 . 9706 . 9767
.9798 . 9842
.9803 .9846
. 9808 . 9850
.98 1 2 . 9854
. 98 1 7 .9857
.987 1 . 990 1 .9925 .9943 .9957 . 9968 .9977 .. .9983
.9793 .9838 .9875 .9904 .9927 .9945 .9959 . 9969 .9977 . 9984
.9878
.988 1
. 9906 .9929 .9946 . 9960 . 9970 .9978 .9984
.9909 .993 1 .9948 .996 1 . 997 1 . 9979 . 9985
.9884 .99 1 r .9932 . 9949 . 9962 . 9972 . 9979
. 99 1 6 .9936 .9952 . 9964 . 9974 .998 1 .9986
.9988 . 999 1 . 9994 .9996 .9997
.9988 .9992 . 9994 . 9996 .9997
. 9989 .9992 . 9994 . 9996 . 9997
.9989 . 9992 . 9994 . 9996 . 9997
. 99 89 . 9992 . 9995 .9996 . 9997
.9887 . 99 1 3 . 9934 .995 1 . 9963 .9973 .9980 . 9986 . 9990 . 9993 . 9995 . 9996 . 9997
. 9990 .9993 . 9995 . 9997 .9998
.9998
. 9998
.9998
. 9998
.9998
.9998
. 9998
. 7 1 23 .7454 . 7764 . 805 1 .83 1 5
.9985
. 7549 .7852 . 8 1 33 . 8 3 89
.9890
801
802
Appendix
TABLE 2
d.f. p
I 2 3 4 5 6
7 8 9
10 II 12 13 14 15 16
A 0
.250
. 1 00
.050
.025
1 .000 .816 . 765 .74 1 . 727 .7 1 8 .71 1 . 706 .703
3 .078 1 . 886 1 .6 3 8 1 .533 1 .476 1 .440 1 .4 1 5 1 .3 97
6.3 1 4
.700 .697 .695 .694
1 . 372 1 .363 1 . 356 1 .350 1 . 345 1 . 34 1
2. 920 2.353 2. 1 32 2 .0 1 5 1 . 943 1 . 895 1 . 860 1 . 833 1 .8 1 2 1 . 796 1 . 782
.692 .69 1
18 19
.690 .689 .688 .688
17
STU D ENT' S t-D I STRI B UTION PERC ENTAG E P O I NTS
20
.687
2I
.686
22 23 24 25 26 27 28
.686 .685 .685 .684 .684 .684 .683
29 30 40 60 1 20
.683 .683 .68 1 .679 .677 .674
00
1 .3 8 3
1 .77 1
1 .3 3 7 1 .333
1 . 76 1 1 . 753 1 .746 1 . 740
1 .3 3 0 1 . 328 1 . 325
1 . 734 1 . 729 1 . 725 1 . 72 I
.0 1 0
.00833
.00625
.005
1 2. 706
3 1 . 82 1
38. 1 90
50.923
4.303 3 . 1 82 2.776 2.57 1 2.447 2.365 2. 306 2.262
6.965 4.54 1 3 . 747 3 . 365 3 . 1 43 2.998 2. 896
7.649 4.857 3 . 96 1 3 .534 3.287 3 . 1 28 3 .0 1 6
8. 860 5 . 3 92 4.3 1 5 3.8 1 0 3.52 1
63.657 9.925 5 . 84 1 4.604 4.032 3 .707 3 .499 3.355
2.228 2.20 1 2. 1 79 2. 1 60 2. 1 45 2. 1 3 1 2. 1 20 2. 1 1 0 2. 1 0 1 2.093 2.086
1 .3 I I 1 . 3 IO 1 . 303 1 .296 I .289
1 .7 I 7 1 .7 I 4 1.7I I 1 . 708 1 .706 1 .703 1 .70 1 1 .699 1 .697 1 .684 1 .6 7 I 1 .658
2.080 2 .074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.02 1 2.000 1 .980
1 .282
1 .645
1 .960
I .323 1 .3 2 I 1 .3 I 9 I .3 I 8 1 .3 I 6 1 .3 1 5 1 .3 1 4 1 .3 1 3
a
tv ( a)
2.82 1 2. 764 2 .7 1 8 2.68 1 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.5 I 8 2.508 2. 500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.423 2.390 2.358 2.326
2.933 2 . 8 70 2.820 2.779 2. 746 2. 7 1 8 2.694 2.673 2.655 2.639 2.625 2.6 I 3 2.60 I 2.59 I 2 . 5 82 2 . 5 74 2.566 2.559 2.552 2 . 546 2.54 1 2.536 2.499 2.463 2.428 2.394
3.335 3 . 206 3. 1 1 1 3.038 2.98 1 2.934 2. 896 2. 864 2.837 2.8 1 3 2 . 793
3 . 250 3 . 1 69 3 . 1 06 3.055 3 .0 1 2 2.977 2.947 2.92 1 2.898
2.775 2.759 2. 744
2.878 2.86 1 2 . 845
2.732
2.83 I 2.8 I 9
2.720 2.7 I O 2. 700 2.692 2.684 2.676 2.669 2.663 2.657 2.6 1 6 2.575 2.536 2 .498
2.807 2.797 2.787 2.779 2.77 1 2 . 763 2.756 2.750 2. 704 2.660 2.6 1 7 2.576
.0025 1 27.32 1
1 4 .089
7.453
5 . 5 98
4.773 4 .3 1 7 4.029 3.833 3 .690 3.58 1 3 .497 3 .428 3.372 3.326 3 .286 3.252 3 .222 3 . 1 97 3 . 1 74 3 . 1 53 3. 1 35 3.1 19 3 . I 04 3.09 1 3.078 3.067 3 .057 3 .047
3 .038 3 .030 2.97 1 2.9 1 5 2.860
2.8 1 3
Appendix
x2 D I STRI B UT I O N P ERC ENTAG E POI NTS
TABLE 3
I� X 3 (a)
d . f.
I
2 3 4 5 6 7 8 9 10 II
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 1 00
.990
.950
.900
. 500
.0002 .02
.004 . 10 .35 .7 1 1.15 1 .64 2. 1 7 2.73 3.33 3 .94 4.57 5.23 5.89 6.57 7.26 7.96 8.67 9.39 1 0. 1 2 1 0.85 1 1 . 59 1 2.34 1 3 .09 1 3 .85 1 4.6 1 1 5.38 1 6. 1 5 1 6.93 1 7. 7 1 1 8.49 26. 5 1 34.76 43. 1 9 5 1 .74 60. 39 69. 1 3 77.93
.02 .2 1 .58 1 .06 1 .6 1 2.20 2.83 3 .49 4. 1 7 4.87 5.58 6.30 7.04 7.79 8.55 9.3 1 1 0.09 1 0. 86 1 1 .65 1 2.44 1 3 .24 1 4.04 1 4.85 1 5.66 1 6.47 1 7.29 1 8. 1 1 1 8.94 1 9.77 20.60 29.05 37.69 46.46 55.33 64.28 73.29 82.36
.45 1 . 39 2.37 3.36 4.35 5.35 6.35 7.34 8.34 9.34 1 0.34 1 1 . 34 1 2.34 1 3 .34 1 4.34 1 5 .34 1 6.34 1 7 .34 1 8.34 1 9.34 20.34 2 1 .34 22.34 23.34 24. 34 25.34 26.34 27.34 28.34 29.34 39.34 49.33 59.33 69.33 79.33 89.33 99.33
.I I
.30 .55 .87 1 . 24 1 .65 2.09 2.56 3 .05 3.57 4. 1 1 4.66 5.23 5.8 1 6.41 7.0 1 7.63 8.26 8.90 9.54 1 0.20 1 0.86 1 1 . 52 1 2. 20 1 2.88 1 3 .56 1 4.26 1 4.95 22. 1 6 29. 7 1 3 7.48 45.44 53.54 6 1 .75 70.06
a
x2
. 1 00
.050
.025
.0 1 0
.005
2.7 1 4.6 1 6.25 7.78 9.24 1 0.64 1 2.02 1 3 .36 1 4.68 1 5.99 1 7.28 1 8.55 1 9. 8 1 2 1 .06 22.3 1 23.54 24.77 25.99 27.20 28.4 1 29.62 30. 8 1 32.0 1 33.20 34.3 8 35.56 36.74 37.92 39.09 40.26 5 1 .8 1 63. 1 7 74.40 85.53 96. 58 1 07.57 1 1 8.50
3 . 84 5.99 7.8 1 9.49 1 1 .07 1 2.59 1 4.07 1 5. 5 1 1 6.92 1 8.3 1 1 9.68 2 1 .03 22.36 23.68 25 .00 26.30 27.59 28.87 30. 1 4 3 1 .4 1 32.67 33.92 35. 1 7 36.42 37.65 38.89 40. 1 1 4 1 .34 42.56 43 .77 55.76 67.50 79.08 90. 53 1 0 1 .88 1 1 3. 1 5 1 24.34
5 .02 7.38 9.35 1 1.14 1 2.83 1 4.45 1 6. 0 1 1 7.53 1 9.02 20.48 2 1 .92 2 3 . 34 24.74 26. 1 2 27.49 28.85 30. 1 9 3 1 . 53 32.85 34. 1 7 3 5 .48 36.78 3 8.08 39.36 40.65 4 1 .92 43. 1 9 44.46 45.72 46.98 59.34 7 1 .42 83.30 95.02 1 06.63 1 1 8. 1 4 1 29.56
6.63 9.2 1 1 1 . 34 1 3 .28 1 5 .09 16.81 1 8.48 20.09 2 1 .67 23.2 1 24.72 26.22 27.69 29. 1 4 30. 5 8 3 2.00 33.4 1 34. 8 1 36. 1 9 37.57 38.93 40.29 4 1 .64 42.98 44. 3 1 45.64 46.96 48.28 49.59 50.89 63.69 76. 1 5 88.38 1 00.43 1 1 2. 3 3 1 24. 1 2 1 35.8 1
7.88 1 0.60 1 2. 84 1 4.86 1 6 .75 1 8. 5 5 20.28 2 1 .95 23.59 25. 1 9 26.76 28.30 29.82 3 1 .3 2 32.80 34.27 3 5 .72 37. 1 6 38.58 40.00 4 1 .40 42.80 44. 1 8 45.56 46.93 48.29 49.64 50.99 52.34 53.67 66.77 79.49 9 1 .95 1 04. 2 1 1 1 6.32 1 28.30 1 40. 1 7
803
I )>
-c -c ro ::J a.
x·
TABLE 4
F-D I STRI B UTI O N PERCENTAG E POI NTS
( a = . 1 0)
V\L
� J\. 1 , v z (. I O)
2 I
2 3 4 5 6 7 8 9 10 II
3
4
5
6
7
8
9
F
10
12
15
20
25
30
40
60
39.86 49.50 53.59 55.83 57.24 5 8.20 58.9 1 59.44 59.86 60. 1 9 60. 7 1 6 1 .22 6 1 .74 62.05 62.26 62.53 62.79 9.47 9.46 9.47 9.44 9.45 9.42 9.4 1 9.39 9.38 9.37 9.35 9.33 9.29 9.24 8.53 9.00 9. 1 6 5. 1 5 5. 1 6 5. 1 7 5. 1 8 5. 1 7 5 . 20 5.22 5.23 5 . 24 5.25 5 .27 5.3 1 5 .28 5.39 5.46 5.34 5 . 54 3 . 79 3.80 3 . 84 3 . 83 3 . 82 3.87 3 . 90 3 . 92 3 .94 3.95 3.98 4.0 1 4.05 4. 1 1 4.54 4.32 4. 1 9 3. 1 6 3. 1 4 3. 1 7 3. 1 9 3. 2 1 3 . 24 3.27 3.30 3.32 3.34 3.37 3.40 3 .45 3.52 3.78 3.62 4.06 2.76 2.78 2.84 2.80 2.87 2.8 1 2.90 2.94 2.96 2.98 3 .0 1 3 .05 3.1 1 3. 1 8 3.46 3 .29 3 . 78 2.54 2.56 2.59 2.57 2.63 2.5 1 2.67 2.70 2.72 2.78 2.75 2.83 2.88 2.96 3.07 3.26 3.59 2.38 2.34 2.42 2.36 2.40 2.50 2.46 2.54 2.56 2.62 2.59 2.67 2.73 2.8 1 2.92 3. 1 1 3 .46 2.25 2.27 2.34 2.30 2.38 2.23 2.42 2.2 1 2.44 2.5 1 2.47 2.55 2.6 1 2.69 2.8 1 3.0 1 3.36 2. 1 6 2.20 2. 1 7 2.24 2.28 2.32 2.35 2. 1 1 2. 1 3 2.4 1 2.38 2.46 2.52 2.6 1 3.29 2.92 2.73 2. 1 2 2. 1 0 2. 1 7 2.2 1 2.0 8 2.25 2.27 2.03 2.05 2.39 2.34 2.30 2.45 2.54 2.86 2.66 3.23
12 13 14 IS 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 1 20 00
3. 1 8 3. 1 4 3.10 3.07 3.05 3.03 3.01 2.99 2.97 2.96 2.95 2.94 2.93 2.92 2.9 1 2.90 2.89 2.89 2.88 2.84 2.79 2.75 2.7 1
2. 8 1 2.76 2.73 2.70 2.67 2.64 2.62 2.6 1 2.59 2.57 2.56 2.55 2.54 2. 5 3 2.52 2.5 1 2.50 2.50 2.49 2.44 2.39 2.35 2.30
2.6 1 2.56 2.52 2.49 2.46 2.44 2.42 2.40 2.38 2.36 2. 3 5 2.34 2.33 2.32 2.3 1 2.30 2.29 2.28 2.28 2.23 2. 1 8 2. 1 3 2.08
1 .6 9 1 .6 7 1 .66 1 .6 5 1 .64 1 .63 1 .62 1 .6 1 1 . 54 1 .48 1 .4 1
1 . 99 1 . 93 1 . 89 1 .85 1 .8 1 1 .78 1 .7 5 1 . 73 1 .7 1 1 .6 9 1 .6 7 1 .66 1 .64 1 .63 1 .6 1 1 .60 1 .59 1 .5 8 1 .57 1 .5 I 1 .44 1 .37
1 .96 1 .90 1 . 86 1 . 82 1 .7 8 1 .7 5 1 .72 1 .70 1 .68 1 .66 1 .64 1 .62 1 .6 1 1 .59 1 .5 8 1 .5 7 1 . 56 1 .55 1 .54 1 .47 1 .40 1 .32
1 .34
1 . 30
1 .24
2.33 2.28 2.24 2.2 1 2. 1 8 2. 1 5 2. 1 3 2. 1 1 2.09 2.08 2.06 2.05 2.04 2.02 2.0 1 2.00 2.00 1 . 99 1 .98 1 .93 1 . 87 1 . 82
2.28 2.23 2. 1 9 2. 1 6 2. 1 3 2. 1 0 2.08 2 .06 2.04 2.02 2.0 1 1 .99 1 .98 1 . 97 1 .96 1 . 95 1 .94 1 .93 1 .93 1 . 87 1 . 82 1 .77
2.24 2.20 2. 1 5 2. 1 2 2.09 2.06 2.04 2.02 2.00 1 .98 1 .97 1 .95 1 . 94 1 .93 1 . 92 1 .9 1 1 .90 1 . 89 1 . 88 1 . 83 1 .77 1 .72
2.2 1 2. 1 6 2. 1 2 2.09 2.06 2.03 2.00 1 . 98 1 . 96 1 .9 5 1 . 93 1 . 92 1 .9 1 1 . 89 1 .88 1 .87 1 . 87 1 . 86 1 . 85 1 . 79 1 . 74 1 .6 8
2. 1 9 2. 1 4 2. 1 0 2.06 2.03 2.00 1 .98 1 . 96 1 . 94 1 .92 1 .90 1 . 89 1 . 88 1 . 87 1 . 86 1 . 85 1 . 84 1 . 83 1 . 82 1 . 76 1.71 1 .6 5
2. 1 5 2. 1 0 2.05 2.02 1 . 99 1 .96 1 .93 1 .9 1 1 . 89 1 . 87 1 . 86 1 . 84 1 .83 1 . 82 1 .8 1 1 . 80 1 .79 1 .78 1 .7 7 1 .7 1 1 . 66 1 .60
2. 1 0 2 .05 2.01 1 . 97 1 .94 1 .9 1 1 . 89 1 . 86 1 . 84 1 . 83 1 .8 1 1 . 80 1 .78 1 . 77 1 . 76 1 . 75 1 . 74 1 . 73 1 . 72 1 .66 1 .60 1 .55
2.06 2.0 1 1 .96 1 . 92 1 . 89 1 . 86 1 . 84 1.81 1 .7 9 1 .7 8 1 . 76 1 .74 1 . 73 1 . 72 1 .7 1 1 .70 1 .69 1 .6 8 1 .6 7 1 .6 1 1 . 54 1 .48
2.03 1 .98 1 .93 1 . 89 1 . 86 1 . 83 1 . 80 1 .7 8 1 .76 1 .74 1 .73 1 .7 1 1 . 70 1 .6 8 1 .67 1 .66 1 .6 5 1 .64 1 .63 1 .5 7 1 . 50 1 .45
2.0 1 1 .96 1 .9 1 1 . 87 1 . 84 1 .8 1 1 .78 1 .76 1 .74 1 .72 1 . 70
2. 1 9 2. 1 8 2. 1 7 2. 1 7 2. 1 6 2. 1 5 2. 1 4 2.09 2.04 1 . 99
2.39 2.35 2.3 1 2.27 2.24 2.22 2.20 2. 1 8 2. 1 6 2. 1 4 2. 1 3 2. 1 1 2. 1 0 2.09 2.08 2.07 2.06 2.06 2.05 2.00 1 . 95 1 .90
1 .94
1 .85
1 . 77
1 .72
1 .67
1 .63
1 .60
1 .5 5
1 .49
1 .42
1 .3 8
2.48 2.43 2.39 2.36 2.33 2.3 1 2.29 2.27 2.25 2.23 2.22 2.2 1
)>
-c -c (1) ::J Q..
x·
� "'
I )>
"0 "0 (1) ::J a..
x·
TABLE 5
F- D I STRI B UTI O N PERCENTAG E P O I NTS
( a = .05)
V\L J;; 1 , v /05 )
2
I 2 3 4 5 6 7 8 9 10 II
3
4
5
6
7
8
9
F
10
12
15
20
25
30
40
60
1 6 1 . 5 1 99.5 2 1 5. 7 224.6 230.2 234.0 236.8 238.9 240.5 24 1 . 9 243 . 9 246.0 248.0 249 . 3 250. 1 25 1 . 1 252.2 1 8. 5 1 1 9.00 1 9. 1 6 1 9.25 1 9.30 1 9. 3 3 1 9. 3 5 1 9. 3 7 1 9. 3 8 1 9.40 1 9.4 1 1 9.48 1 9.43 1 9.45 1 9.46 1 9.46 1 9.47 9.55 1 0. 1 3 9.28 9.0 1 8 . 94 8 . 74 8.57 9. 1 2 8.89 8.81 8 .66 8 . 70 8.79 8.59 8.85 8.62 8.63 6.59 6 . 94 7.7 1 6.26 6. 1 6 6 .3 9 5.69 5 . 80 6.09 5 . 86 5.9 1 6.00 6 .04 5 . 72 5 . 75 5.96 5.77 5 . 79 5.41 5. 1 9 6.6 1 5.05 4.95 4.43 4.88 4.62 4.74 4.68 4.77 4.56 4.82 4.46 4.50 4.52 4.39 5.14 5 . 99 4.76 4.53 4.28 4.2 1 4. 1 5 4.00 3 . 74 3 . 94 4. 1 0 4.06 3.77 3.8 1 3.87 3.83 4.74 3.97 5.59 4. 1 2 4.35 3.87 3.79 3.30 3.5 1 3.73 3.68 3 . 64 3.57 3.38 3.34 3 .44 3 . 40 4.46 5.32 3 . 84 4.07 3.69 3.58 3 . 50 3.39 3.28 3 .0 1 3 .44 3 . 22 3.35 3 .08 3.04 3. 1 5 3.1 1 5.12 3 .63 3.37 4.26 3 .48 3 . 86 3.14 3 .29 3.18 3 .07 2.79 3 .0 1 3 .23 2.94 2.89 2.86 2.83 3 .48 3 . 22 3.33 3.7 1 4. 1 0 4.96 3 .02 2.62 2.98 3. 1 4 2.9 1 2.85 2.77 3 .07 2.73 2. 70 2.66 4.84 3.36 2.49 3 . 98 3.09 3 . 20 2 . 90 3.01 2.85 2.95 3.59 2.79 2.72 2.65 2.60 2. 5 7 2.53
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 1 20 00
4.75 4.67 4.60 4.54 4.49 4.45 4.4 1 4.38 4.35 4.32 4.30 4.28 4.26 4.24 4.23 4. 2 1 4.20 4. 1 8 4. 1 7 4.08 4.00 3 . 92 3 . 84
3 . 89 3.8 1 3 . 74 3.68 3 .63 3.59 3.55 3 . 52 3 .49 3 .47 3.44 3 .42 3 .40 3.39 3.37 3.35 3.34 3.33 3.32 3.23 3. 1 5 3 .07 3 . 00
3 .49 3.4 1 3 . 34 3.29 3 . 24 3 .20 3.16 3. 1 3 3. 10 3 .07 3 .05 3 .03 3.01 2.99 2.98 2.96 2.95 2.93 2.92 2 . 84 2.76 2.68 2.6 1
3 .26 3. 1 8 3. 1 1 3 .06 3.01 2 . 96 2.93 2.90 2.87 2 . 84 2.82 2 . 80 2.78 2.76 2.74 2.73 2. 7 1 2 . 70 2.69 2.6 1 2.53 2.45 2.37
3. 1 1 3.03 2.96 2.90 2.85 2. 8 1 2.77 2.74 2.7 1 2.68 2.66 2.64 2.62 2.60 2.59 2.57 2.56 2.55 2. 5 3 2.45 2.37 2.29 2.2 1
3.00 2.92 2.85 2.79 2.74 2.70 2.66 2.63 2.60 2.57 2.55 2.53 2.5 1 2.49 2.47 2.46 2.45 2.43 2.42 2.34 2.25 2. 1 8 2. 1 0
2.9 1 2.83 2.76 2.7 1 2.66 2.6 1 2.58 2 . 54 2.5 1 2.49 2.46 2.44 2.42 2.40 2.39 2.37 2.36 2.35 2.33 2.25 2. 1 7 2.09 2.0 1
2.85 2.77 2.70 2.64 2.59 2.55 2.5 1 2.48 2.45 2.42 2.40 2.37 2.36 2.34 2.32 2.3 1 2.29 2.28 2.27 2. 1 8 2. 1 0 2.02 1 . 94
2 . 80 2. 7 1 2.65 2.59 2 . 54 2.49 2.46 2.42 2.39 2.37 2.34 2.32 2.30 2.28 2.27 2.25 2.24 2.22 2. 2 1 2. 1 2 2.04 1 . 96 1 . 88
2.75 2.67 2.60 2.54 2.49 2.45 2.4 1 2.38 2.35 2.32 2.30 2.27 2.25 2.24 2.22 2.20 2. 1 9 2. 1 8 2. 1 6 2.08 1 .99 I .9 1 1 . 83
2.69 2.60 2.53 2.48 2.42 2.38 2.34 2.3 1 2.28 2.25 2.23 2.20 2. 1 8 2. 1 6 2. 1 5 2. 1 3 2. 1 2 2. 1 0 2.09 2.00 1 . 92 1 . 83 1 . 75
2.62 2.53 2.46 2.40 2.35 2.3 1 2.27 2.23 2.20 2. 1 8 2. 1 5 2. 1 3 2. 1 1 2.09 2.07 2.06 2.04 2.03 2.0 1 1 .92 1 . 84 1 .75 1 .67
2.54 2.46 2.39 2.33 2.28 2.23 2. 1 9 2. 1 6 2. 1 2 2. 1 0 2.07 2.05 2.03 2.0 1 1 .99 1 . 97 1 . 96 1 .94 1 . 93 1 . 84 1 . 75 1 .66 1 .57
2.50 2.4 1 2.34 2.28 2.23 2. 1 8 2. 1 4 2. 1 1 2.07 2.05 2.02 2.00 1 .97 1 .96 1 . 94 1 . 92 1 .9 1 1 . 89 1 . 88 1 . 78 1 .69 1 . 60 1 .5 1
2.47 2.38 2.3 1 2.25 2. 1 9 2. 1 5 2. 1 1 2.07 2.04 2.0 1 1 .98 1 .96 1 .94 1 .92 1 .90 1 . 88 1 . 87 1 . 85 1 . 84 1 .74 1 .65 1 .55 1 .46
2.43 2.34 2.27 2. 20 2. 1 5 2. 1 0 2.06 2.03 1 . 99 1 .96 1 .94 1 .9 1 1 . 89 1 . 87 1 .85 1 . 84 1 . 82 1 .8 1 1 . 79 1 .69 1 .59 1 . 50 1 .3 9
2.38 2.30 2.22 2. 1 6 2. 1 1 2.06 2.02 1 .98 1 . 95 1 .92 1 . 89 1 . 86 1 . 84 1 . 82 1 . 80 1 . 79 1 . 77 I . 75 1 . 74 1 .64 1 . 53 1 .43 1 . 32
)>
"'0 "'0 (1) ::I a.
x·
� ....
i )>
"U "U Cll ::J 0..
)( '
TABLE 6
F- D I STRI B UTI O N PERCENTAG E P O I NTS
(a
= .01 )
lr\L
�� FV I , v z (. O l )
·�
I 4052.
2 5000.
3 5403.
4 5625.
5 5764.
6 5859.
7 5928.
8 598 1 .
F
9 6023 .
10
12
15
20
25
30
40
60
6056.
6 1 06.
6 1 57.
6209.
6240.
6 26 1 .
6287.
63 1 3 .
98.50
99.00
99. 1 7
99.25
99.30
99. 3 3
99.36
99. 3 7
99.39
99.40
99.42
99.43
99.45
99.46
99.47
99.47
99.48
34. 1 2
30.82
29.46
28. 7 1
2 8 . 24
27.9 1
27.67
27.49
27.35
27.23
27.05
26.87
26.69
26.51!
26.50
26.4 1
26.32
2 1 .20
1 8.00
1 6.69
1 5.98
1 5.52
1 5. 2 1
1 4.98
1 4. 80
1 4.66
1 4. 5 5
1 4. 3 7
1 4. 20
1 4.02
1 3.91
1 3 .84
1 3.75
1 3 .65
1 6 .26
1 3 .27
1 2.06
1 1 .39
1 0.97
1 0.67
1 0.46
1 0. 2 9
!0. 1 6
1 0.05
9.89
9.72
9.55
9.45
9.38
9.29
9 . 20
1 3.75
1 0.92
9.78
9. 1 5
8.75
8.47
8.26
8. 1 0
7.98
7.87
7.72
7.56
7.40
7. 3 0
7.23
7. 1 4
7.06
1 2.25
9.55
8.45
7.85
7.46
7. 1 9
6.99
6 . 84
6.72
6.62
6.47
6.3 1
6. 1 6
6.06
5 . 99
5.9 1
5.82
1 1 . 26
8.65
7.59
7.01
6.63
6.37
6. 1 8
6.03
5.91
5.8 1
5.67
5.52
5.36
5 . 26
5 . 20
5. 1 2
5.03
9
1 0. 5 6
8.02
6.99
6.42
6.06
5.80
5.6 1
5.47
5.35
5.26
5. I I
4.96
4.8 1
4. 7 1
4.65
4.57
4.48
10
1 0.04
7.56
6.55
5 .99
5 .64
5.39
5 . 20
5 .06
4. 94
4.85
4. 7 1
4.56
4.4 1
4.3 1
4.25
4. 1 7
4.08
I I
9.65
7.2 1
6.22
5 .67
5.32
5 .07
4.89
4.74
4.63
4.54
4.40
4.25
4. 1 0
4.0 1
3 .94
3 . 86
3.78
4
6
I I
12
9.33
6.93
5.95
5.4 1
5.06
4.82
4.64
4.50
4.39
4.30
4. 1 6
4.0 1
3.86
3 .76
3 .7 0
3.62
3 . 54
13
9.07
6.70
5 . 74
5.2 1
4.86
4.62
4.44
4.30
4. 1 9
4. 1 0
3 . 96
3.82
3.66
3.57
3.5 1
3.43
3.34
14
8.86
6.5 1
5.56
5.04
4.69
4.46
4.28
4. 1 4
4.03
3 . 94
3.80
3 .66
3.5 1
3.4 1
3.35
3.27
3. 1 8
15
8.68
6.36
5.42
4.89
4.56
4.32
4. 1 4
4.00
3.89
3 . 80
3.67
3.52
3.37
3.28
3.21
3. 1 3
3 . 05
16
8.53
6.23
5.29
4.77
4.44
4.20
4.03
3.89
3.78
3.69
3.55
3 .4 1
3.26
3.16
3. 1 0
3 .02
2.93 2.83
17
8.40
6. 1 1
5. 1 9
4.67
4.34
4. 1 0
3.93
3.79
3.68
3.59
3 .46
3.3 1
3. 1 6
3 .07
3 . 00
2.92
18
8.29
6.01
5 .09
4.58
4.25
4.0 1
3 . 84
3.7 1
3 .60
3.5 1
3.37
3.23
3 . 08
2.98
2.92
2.84
2.75
19
8. 1 8
5 . 93
5.0 1
4.50
4. 1 7
3 . 94
3.77
3.63
3.52
3 . 43
3.30
3. 1 5
3 .00
2.91
2 . 84
2.76
2.67
20
8. 1 0
5.85
4.94
4.43
4. 1 0
3.87
3 . 70
3.56
3 . 46
3.37
3.23
3 .09
2 . 94
2 . 84
2.78
2.69
2.6 1
21
8.02
5.78
4.87
4.37
4.04
3.8 1
3 . 64
3.5 1
3 . 40
3.3 1
3. 17
3 . 03
2.88
2.79
2.72
2.64
2.55
22
7.95
5. 72
4.82
4.3 1
3 . 99
3.76
3.59
3.45
3.35
3.26
3. 1 2
2.98
2.83
2.73
2.67
2.58
2.50
23
7 . 88
5.66
4.76
4.26
3 . 94
3.7 1
3 . 54
3 .4 1
3.30
3.2 1
3 .07
2.93
2.78
2.69
2.6 2
2.54
2.45
24
7.82
5.6 1
4.72
4.22
3 . 90
3.67
3.50
3.36
3.26
3. 1 7
3 .03
2.89
2 . 74
2.64
2.5 8
2.49
2 . 40
25
7.77
5.57
4.68
4. 1 8
3.85
3.63
3 . 46
3.32
3.22
3. 1 3
2.99
2.85
2.70
2.60
2.54
2.45
2.36
26
7.72
5.53
4.64
4. 14
3.82
3.59
3.42
3.29
3. 1 8
3 . 09
2 . 96
2.8 1
2.66
2.57
2 . 50
2.42
2.33
27
7.68
5 .49
4.60
4. 1 1
3.78
3.56
3.39
3 . 26
3. 1 5
3 .06
2.93
2.78
2.63
2.54
2.47
2.38
2.29
28
7.64
5.45
4.57
4.07
3.75
3.53
3.36
3.23
3. 1 2
3 .03
2 . 90
2.75
2.60
2.5 1
2.44
2.35
2.26
29
7 . 60
5 .42
4.54
4.04
3.73
3 . 50
3.33
3 . 20
3.09
3.00
2.87
2.73
2.57
2.48
2.4 1
2.33
2.23
30
7.56
5.39
4.5 1
4.02
3 . 70
3 .47
3.30
3.17
3 .07
2.98
2 . 84
2.70
2.55
2.45
2.39
2.30
2.2 1
40
7.3 1
5. 1 8
4.3 1
3.83
3. 5 1
3 . 29
3. 1 2
2.99
2.89
2.80
2 . 66
2.52
2.37
2.27
2.20
2. 1 1
2.02
60
7.08
4.98
4. 1 3
3.65
3 . 34
3. 1 2
2.95
2.82
2.72
2.63
2.50
2.35
2 . 20
2. 1 0
2.03
1 . 94
1 . 84
1 20
6.85
4.79
3 . 95
3.48
3. 1 7
2 . 96
2.79
2.66
2.56
2.47
2.34
2. 1 9
2.03
1 . 93
1 . 86
1 . 76
1 . 66
00
6.63
4.6 1
3.78
3.32
3 .02
2 . 80
2.64
2.5 1
2.4 1
2.32
2. 1 8
2.04
1 . 88
1 . 78
1 . 70
1 .59
1 . 47
)>
-c -c ro :::J c..
x·
$ co
Data Index Admission, 718 examples, 674, 717 Airline distances, 765 examples, 764 Air pollution, 38 examples, 37, 221 , 453, 507, 583 Amitriptyline, 455 examples, 454 Archeological site distances, 797 examples, 795, 798 Bankruptcy, 712 examples, 42, 71 1 , 713 Battery failure, 453 examples, 452, 453 Bird, 286 examples, 285 Biting fly, 373 examples, 372 Bonds, 367 examples, 365 Bones (mineral content), 43, 374 examples, 41, 46, 222, 285, 372, 454, 509 Breakfast cereal, 724 examples, 45, 723, 794, 795 Bull, 46 examples, 46, 222, 454, 512, 585, 722 Calcimn (bones), 350, 351 examples, 353, 374 Carapace (painted turtles), 364, 579 examples, 364-65, 476, 485, 579
Census tract, 508 examples, 473, 506, 584 College test scores, 245 examples, 244, 285, 451 Computer data, 403, 426 examples, 402, 406, 426, 432, 435, 437, 440 Crime, 622 examples, 622 Crude oil, 719 examples, 368, 687, 717
Diabetic, 625 examples, 625 Effiuent, 294 examples, 294, 358, 359 Electrical consumption, 309 examples, 309, 313, 359 Electrical time-of-use pricing, 370 examples, 368 Examination scores, 541 examples, 541 Financial, 37 examples, 14, 37, 130, 144, 195, 220, 452, 504 Forest, 787 examples, 786, 795 Fowl, 558 examples, 558, 578, 604, 612 Hair (Peruvian), 280 examples, 280
Hemophilia, 643, 644, 721 , 722 examples, 643, 662, 720 Hook-billed kite, 367 examples, 365
Iris, 715 examples, 368, 680, 714 Job satisfaction/characteristics, 607, 798 examples, 605, 615, 617, 795, 798 Lamentations, 792 examples, 792 Love and marriage, 347-48 examples, 346 Lumber, 286 examples, 285 Mental health, 798 examples, 795, 798 Mice, 483, 484, 508 examples, 483, 489, 508, 584 Milk transportation cost, 287, 366 examples, 43, 287, 365 Multiple sclerosis, 41 examples, 40, 222, 7 1 1 Musical aptitude, 254 examples, 254 National track record�, 44, 510 examples, 42, 222, 509, 5 1 1 , 584, 585, 795
81 1
81 2
Su bject I ndex
Natural gas, 442 examples, 441 Numerals, 737 examples, 736, 742, 746, 749 Nursing home, 327 examples, 326, 330
Olympic decathlon, 534 examples, 534, 548 Overtime (police), 258 examples, 257, 260, 262, 265, 288, 491 , 494, 496, 512 Oxygen consumption, 369 examples, 45, 368 Paper quality, 17 examples, 16, 19, 222 Peanut, 375 examples, 372 Plastic film, 338 examples, 337, 363
Pottery, 771 examples, 770, 795 Profitability, 580 examples, 580, 624 Public utility, 747 examples, 24, 26, 45, 46, 729, 747, 750, 757, 764
Radiation, 192, 212 examples, 191, 208, 211, 221 , 237, 242, 250, 279, 280 Radiotherapy, 42 examples, 40, 222, 507 Reading/arithmetic test scores, 621 examples, 621 Real estate, 393 examples, 393, 452 Road distances, 796 examples, 795 Salmon, 659 examples, 658, 720
Sleeping dog, 300 examples, 299 Smoking, 626 examples, 625 Spouse, 371 examples, 370 Stiffness (lumber), 1 98, 203 examples, 23, 198, 202, 222, 364, 584, 624 Stock price, 507 examples, 482, 488, 506, 527, 532, 538, 545, 554, 622, 792
Sweat, 229 examples, 229, 279, 280, 508 University, 782 examples, 768, 781 , 786 Wheat, 624 examples, 623
Subject Index Analysis of variance, multivariate: one-way, 320--2 8 two-way, 334--43, 361-62 Analysis of variance, univariate: one-way, 315-20, 380-81 two-way, 331-34 ANOV A (see Analysis of variance, univariate) Autocorrelation, 442-43 Autoregressive model, 443 Average linkage (see Cluster analysis) Biplot, 779 Bonferroni intervals: comparison with T2 intervals, 25 1-52
definition, 249 for means, 250, 293, 3 1 1 for treatment effects, 329, 337
Canonical correlation analysis: canonical correlations, 587, 589-90, 598, 602 canonical variables, 587, 589-90, 602 correlation coefficients in, 596-97, 602-04 definition of, 589, 602 errors of approximation, 610--1 1 geometry of, 600 interpretation of, 595-600 population, 589-90 sample, 602
tests of hypothesis in, 615-17 variance explained, 613-15 CART, 698 Central-limit theorem, 1 87 Characteristic equation, 102 Characteristic roots (see Eigenvalues) Characteristic vectors (see Eigenvalues) Chernoff faces, 25 Chi-square plots, 196 Classification: Anderson statistic, 664 B ayes' rule (ECM), 636, 639, 667 confusion matrix, 652 error rates, 650--5 3 expected cost, 636, 665-66 interpretations in, 671-72, 678-80
S u bject I n d ex
Lachenbruch holdout procedure, 654, 680 linear discriminant functions, 639-40, 642, 661-64, 671-72, 685 misclassification probabilities, 634-35, 638 with normal populations, 639-49, 669-83 quadratic discriminant function, 646-48, 669-71 qualitative variables, 697 selection of variables, 701 for several groups, 665-83, 691-97 for two groups, 630-39, 639-49, 663 Classification trees, 698 Cluster analysis: algorithm, 740, 755 average linkage, 748-51 complete linkage, 744-48 dendrogram, 740 hierarchical, 738-54 inversions in, 754 K-means, 755-60 similarity and distance, 735 similarity coefficients, 733, 736 single linkage, 740-44 Ward's method, 751 Coefficient of determination, 385, 429 Communality, 517 Complete linkage (see Cluster analysis) Confidence intervals: mean of normal population, 225 simultaneous, 241-42, 250, 253, 283, 293, 329, 337 Confidence regions: for contrasts, 229 definition, 235-36 for difference of mean vectors, 305, 312 for mean vectors, 236, 252-53 for paired comparisons, 293 Contingency table, 770 Contrast matrix, 298 Contrast vector, 297 Control chart: definition, 257 ellipse format, 260, 267, 491-92 for subsample means, 266, 268 multivariate, 259, 493-94, 496--97 T2 chart, 261 , 265, 268, 494-95
Control regions: definition, 263 for future observations, 263- 64, 267, 496 Correlation: autocorrelation, 442-43 coefficient of, 8, 73 geometrical interpretation of sample,124 multiple, 385, 429, 598-99 partial, 437 sample, 8 Correlation matrix: population, 73 sample, 10, 123 tests of hypotheses for equicorrelation, 488-89 Correspondence analysis: algebraic development, 773-75 correspondence matrix, 773 inertia, 772, 775, 776 Correspondence matrix, 773 Covariance: definition of, 71 of linear combinations, 77-80 sample, 8 Covariance matrix: definitions of, 71 distribution of, 184 factor analysis models for, 517 geometrical interpretation of sample, 124, 130-36 large sample behavior, 186 as matrix operation, 147 partitioning, 77, 81 population, 73 sample, 129
Dendrogram, 740 Descriptive statistics: correlation coefficient, 8 covariance, 8 mean, 7 variance, 7 Design matrix, 379, 41 1 , 438 Determinant: computation of, 97-98 product of eigenvalues, 109 Discriminant function (see Classification) Distance: Canberra, 729 Czekanowski, 729
81 3
development of, 28-36, 65 Euclidean, 28 Minkowski, 728 properties, 36 statistical, 29, 34-35 Distributions: chi-square (table), 803 F (table), 804-09 multinomial, 281-82 normal (table), 801 Q-Q plot correlation coefficient (table), 193 t (table), 802 Wishart, 1 84
Eigenvalues, 102 Eigenvectors, 1 03 EM algorithm, 268 Estimation: generalized least squares, 448, 451 least squares, 381-90 maximum likelihood, 177-83 minimum variance, 389 unbiased, 126--2 9, 389 Estimator (see Estimation) Expected value, 68 Experimental unit, 5 Factor analysis: bipolar factor, 542 common factors, 515-16 communalities, 517-18 computational details, 572-75 of correlation matrix, 524, 529-30, 574-75 Heywood cases, 574, 577 least squares (Bartlett) computation of factor scores, 551 loadings, 515-16 maximum likelihood estimation in, 530-32 nonuniqueness of loadings, 520-21 oblique rotation, 543, 548 orthogonal factor model, 516 principal component estimation in, 522-28 principal factor estimation in, 529-30 regression computation of factor scores, 554 residual matrix, 525 rotation of factors, 540-50
814
S u bject I n dex
Factor analysis (cont.) specific factors, 515-16 specific variance, 517-18 strategy for, 557-58 testing for the number of factors, 537-40 varimax criterion, 543 Factor loading matrix, 515 Factor scores, 550--57 Fisher's linear discriminants: population, 705, 708-09 sample, 662, 685 scaling, 645
Gamma plot, 196 Gauss (Markov) theorem, 389 Generalized inverse, 387, 449 Generalized least squares (see Estimation) Generalized variance: geometric interpretation of sample, 1 30--3 1 , 143 sample, 130, 142 situations where zero, 136-41 General linear model: design matrix for, 379, 411 multivariate, 410--27 univariate, 377-81 Geometry: of classification, 678-80 generalized variance, 130--3 1 , 143 of least squares, 385-87 of principal components, 498-503 of sample, 124 Gram-Schmidt process, 90 Graphical techniques: biplot, 779 Chernoff faces, 25 marginal dot diagrams, 13 n points in p dimensions, 16 p points in n dimensions, 19 scatter diagram (plot), 12, 1 9 stars, 24 Growth curve, 350 Hat matrix, 382, 450 Heywood cases (see Factor analysis) Hotelling's T2 (see T2-statistic) Independence: definition, 70--7 1 of multivariate normal variables, 168- 69
of sample mean and covariance matrix, 184 tests of hypotheses for, 515 Inequalities: Cauchy-Schwarz, 81 extended Cauchy-Schwarz, 82 Inertia, 772, 775, 776 Influential observations, 407 In variance of maximum likelihood estimators, 183 Item (individual), 5
K-means (see Cluster analysis)
Lawley-Hotelling trace statistic, 357 Leverage, 404, 407 Likelihood function, 177 Likelihood ratio tests: definition, 234-35 limiting distribution, 235 in regression, 394-97, 421-24 and T2 , 233 Linear combinations of factors, 87, 174 Linear combinations of variables: mean of, 78-79 normal populations, 165-66 sample covariances of, 149, 153 sample means of, 149, 153 variances and covariances of, 78-79 Linear structural relationships, 565 LISREL, 565-70 Logistic regression, 698 MANOV A (see Analysis of variance, multivariate) Matrices: addition of, 93 characteristic equation of, 102 correspondence, 773 definition of, 55, 92 determinant of, 97, 109 dimension of, 92 eigenvalues of, 60, 102 eigenvectors of, 60, 103 generalized inverses of, 387, 449 identity, 59, 94 inverses of, 59, 99 multiplication of, 94, 114 orthogonal, 60, 101 partitioned, 75, 77, 80, 81 positive definite, 61, 64 products of, 56, 94, 95
random, 68 rank of, 99 scalar multiplication in, 93 singular and nonsingular, 99 singular-value decomposition, 105, 774, 780 spectral decomposition, 62, 1 04 square root, 67 symmetric, 58, 94 trace of, 101 transpose of, 55, 93 Maxima and minima (with matices), 81-85 Maximum likelihood estimation: development, 177-83 invariance property of, 183 in regression, 390, 418-20, 431-36 Mean, 68 Mean vector: definition, 71 distribution of, 1 84 large sample behavior, 186-87 as matrix operation, 145-46 partitioning, 75, 80 sample, 10, 80 Minimal spanning tree, 770 Missing observations, 268-73 Multicollinearity, 409 Multidimensional scaling: algorithm, 761-62 development, 760--70 sstress, 762-63 stress, 762 Multiple comparisons (see Simultaneous confidence intervals) Multiple correleation coefficient: population, 429, 598-99 sample, 385 Multiple regression (see Regression and General linear model) Multivariate analysis of variance (see Analysis of variance, multivariate) Multivariate control chart (see control chart) Multivariate normal distribution (see Normal distribution, multivariate)
Neural network, 699 Nonlinear mapping, 770 Nonlinear ordination, 789
S u bject I ndex
Normal distribution: bivariate, 159 checking for normality, 1 88-200 conditional, 170 constant density contours, 162, 464 marginal, 165, 167 maximum likelihood estimation in, 182
multivariate, 158-77 properties of, 1 64-77 transformations to, 204-14 Normal equations, 449 Normal probability plots (see Q-Q plots)
Outliers: definition, 200 detection of, 201 Paired comparisons, 291-97 Partial correlation , 437 Partitioned matrix: definition, 75, 77, 80, 81 determinant of, 2 1 6-17 inverse of, 217 Path diagram, 566 Pillai's trace statistic, 357 Plots: biplot, 779 cp , 408
factor scores, 555, 557 gamma (or chi-square), 1 96 principal components, 484-86
Q-Q, 189, 405 residual, 405-06 scree, 475 Positive definite (see Quadratic forms) Posterior probabilities, 639, 667 Principal component analysis: correlation coefficients in, 461-62, 472, 481
for correlation matrix, 466, 481 definition of, 459-60, 471-72 equicorrelation matrix, 469-70, 488-89
geometry of, 498-503 interpretation of, 464-65, 478- 80 large-sample theory of, 487-90 monitoring quality with, 490-97 plots, 484-86 population, 458-71
reduction of dimensionality by, 498-500 sample, 471-84
tests of hypotheses in, 488-90, 505-06
variance explained, 461 , 466, 482 Procustus analysis: development, 782-90 measure of agreement, 783-86 rotation, 784 Profile analysis, 343-49 Proportions: large-sample inferences, 282- 83 multinomial distribution, 281-82
Q-Q plots: correlation coefficient, 193 critical values, 193 description, 1 89-94 Quadratic forms: definition, 64, 104 extrema of, 83 nonnegative definite, 64 positive definite, 61, 64 Random matrix, 68 Random sample, 1 25 Regression (see also General linear model): autoregressive model, 443 assumptions, 379, 390, 41 1 , 418-1 9 coefficient of determination, 385, 429
confidence regions in, 392, 400-01 , 425, 450 cp plot, 408
decomposition of sum of squares, 385 413
extra sum of squares and cross products, 396-97, 421 fitted values, 382, 412 forecast errors in, 401 Gauss theorem in, 389 geometric interpretation of, 385-87 least squares estimates, 382, 412 likelihood ratio tests in, 394-97, 421-24
maximum likelihood estimation in, 390, 418-20, 431-36
multivariate, 410-27 regression coefficients, 381 , 433 regression function, 390, 431 residual analysis in, 404-07
81 5
residuals, 382, 405, 412 residual sum of squares and cross products, 382, 413 sampling properties of estimators, 387-89, 416-17
selection of variables, 408-09 univariate, 377-81 weighted least squares, 448 with time-d�pendent errors, 441-45 Regression coefficients (see Regression) Repeated measures designs, 297-302, 350-54
Residuals, 382, 405, 412 Roy's largest root, 357
Sample: geometry, 1 17-24 Sample splitting, 558, 571 , 653 Scree plot, 475 Simultaneous confidence ellipses: as projections, 277-78 Simultaneous confidence intervals: comparisons of, 246-48, 249-52 for components of mean vectors, 241-42, 250, 253
for contrasts, 299 development, 239-42 for differences in mean vectors, 308, 3 1 1 , 3 1 2
for paired comparisons, 293 as projections, 276 for regression coefficients, 392 for treatment effects, 329, 337 Single linkage (see Cluster analysis) Singular matrix, 99 Singular-value decomposition, 1 05, 774, 780
Special causes (of variation), 257 Specific variance, 5 1 7-18 Spectral decomposition, 62, 104 SStress, 762-63 Standard deviation: population, 74 sample, 8 Standard deviation matrix: population, 74 sample, 1 47 Standardized observations, 480 Standardized variables, 465 Stars, 24 Strategy for multivariate comparisons, 358
816
S u bject I ndex
Stress, 762 Structural equation models, 565-71 Studentized residuals, 404 Sufficient statistics, 1 83 Sums of squares and cross products matrices: between, 321 total, 321 within, 321
Time dependence (in multivariate observations), 273-75, 441-45 T2-statistic: definition of, 226 distribution of, 226-27 invariance property of, 230 in quality control, 261 , 265, 268, 494-95 in profile analysis, 345 for repeated measures designs, 298
single-sample, 226 two-sample, 305 Trace of a matrix, 101 Transformations of data, 204-14
Variables: canonical, 589-90, 602 dummy, 380 endogenous, 566 exogenous, 566 latent, 566 predictor, 377 response, 377 standardized, 465 Variance: definition, 69-70 generalized, 130, 142 geometrical interpretation of, 124 total sample, 144, 472, 481, 613-14 Varimax rotation criterion, 543
Vectors: addition, 50-5 1 , 86 angle between, 52, 89 basis, 88 definition of, 49, 86 inner product, 52, 89 length of, 51, 89 linearly dependent, 54, 87 linearly independent, 54, 87 linear span, 87 perpendicular (orthogonal), 53, 90 projection of, 54, 90 random, 68 scalar multiplication, 86 unit, 52 vector sp ace, 87
Wilks' lambda, 232, 322 Wishart distribution, 1 84