CHAPMAN & HALL TEXTS IN STATISTICAL SCIENCE SERIES Editors: Dr Chris Chatfield Reader in Statistics, School of Mathemati...
462 downloads
2149 Views
10MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
CHAPMAN & HALL TEXTS IN STATISTICAL SCIENCE SERIES Editors: Dr Chris Chatfield Reader in Statistics, School of Mathematical Sciences, University of Bath, UK
Professor Jim V. Zidek Department of Statistics, University of British Columbia, Canada
OTHER TITLES IN THE SERIES INCLUDE: Practical Statistics for Medical Research D.G. Altman Interpreting Data A.J.B. Anderson Statistical Methods for SPC and TQM D. Bissell Statistics in Research and Development Second edition R. Caulcutt The Analysis of Time Series Fifth edition C. Chatfield Problem Solving - A statistician's guide C. Chatfield Introduction to Multivariate Analysis C. Chatfield and A.J. Collins Modelling Binary Data D. Collett Modelling Survival Data in Medical Research D. Collett Applied Statistics D.R. Cox and E.J. Snell Statistical Analysis of Reliability Data M.J. Crowder, AC. Kimber, T.J. Sweeting and R.L. Smith An Introduction to Generalized Linear Models · AJ. Dobson Introduction to Optimization Methods and their Applications in Statistics B.S. Everitt Multivariate Statistics - A practical approach B. Flury and H. Riedwyl Readings in Decision Analysis S. French Bayesian Data Analysis A. Gelman, J.B. Carlin, H.S. Stern and D.B. Rubin Practical Longitudinal Data Analysis D.J. Hand and M.J. Crowder Multivariate Analysis of Variance and Repeated Measures D.J. Hand and C.C. Taylor
The Theory of Linear Models B. Jorgensen Modeling and Analysis of Stochastic Systems V.G. Kulkarni Statistics for Accountants S. Letchford Statistical Theory Fourth edition B. Lindgren Randomization and Monte Carlo Methods in Biology · .· B.F.J. Manly Statistical Methods in Agriculture and Experimental Biology Second edition R. Mead, R.N. Curnow and AM. Hasted Statistics in Engineering A V. Metcalfe Elements of Simulation B.J.T. Morgan Probability - Methods and measurement A. O'Hagan Essential Statistics Second edition D.G. Rees Large Sample Methods in Statistics P.K. Sen and J.M. Singer Decision Analysis - A Bayesian approach J.Q. Smith Applied Nonparametric Statistical Methods Second edition P. Sprent Elementary Applications of Probability Theory Second edition H.C. Tuckwell Statistical Process Control - Theory and practice Third edition G.B. Wetherill and D.W. Brown Applied Bayesian Forecasting and Time Series Analysis A. Pole, M. West and J. Harrison
Full information on the complete range of Chapman & Hall statistics books is available from the publisher.
COMPUTERAIDED MULTIVARIATE ANALYSIS Third edition ·
A.A. Afifi Dean and Professor of Biostatistics School ofPublic Health and Professor of Biomathematics University of California, Los Angeles USA and
V. Clark Professor Emeritus of Biostatistics and Biomathematics University of California, Los Angeles USA
.CHAPMAN & HALUCRC Boca Raton London New York Washington, D.C.
Library of Congress Cataloging-in-Publication Data Catalog record is available from the Library of Congress.
This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts ha:ve been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Apart from any fair dealing for the purpose of research or private study, or criticism or review, as permitted under the UK Copyright Designs and Patents Act, 1988, this publication may not be reproduced, stored or transmitted, in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording; or by any information.storage or retrieval system, without the prior permission in writing of the publishers, or in the case of reprographic reproduction only in accordance with the terms of the licenses issued by the Copyright Licensing Agency in the UK, or in accordance with the terms of the license issued by the appropriate Reproduction Rights Organization outside the UK. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe.
© 1997 by Chapman & Hall!CRC First edition 1984 Second edition 1990 Third edition 1996 Reprinted 1998 First CRC Press reprint 1999 © 1984, 1990 by van Nostrand Reinhold © 1996 by Chapman & Hall (
No claim to original U.S: Government works International Standard. Book Number 0-412-73060-X Library of Congress Catalog Card Number 96-83484 Printed in the United States of America 2 3 4 5 6 7 8 9 0 . Printed on acid-free paper
\
Contents Preface xiii Preface to the second edition Preface to the first edition
xvii
xix
Part One Preparation for Analysis
1
1 What is multivariate analysis? 3 1.1 How is multivariate analysis defined? 3 1.2 Examples of studies in which multivariate analysis is useful 3 1.3 Multivariate analyses discussed in this book 6 1.4 Organization and content of the book 9 References 11 2 Characterizing data for future analyses 12 2.1 Variables: their definition, classification and use 12 2.2 Defining statistical variables 12 2.3 How variables are classified: Stevens's classification system 13 2.4 How variables are used in data analysis 16 2.5 Examples of classifying variables 17 2.6 Other characteristics of data 18 Summary 18 References 19 Further reading 19 Problems 19 3 Preparing for data analysis 21 3.1 Processing the data so they can be analyzed 21 3.2 Choice of computer for statistical analysis 22 3.3 Choice of a statistical package 23 3.4 Techniques for data entry 28
v1
Contents 3.5 Data management for statistics 34 3.6 Data example: Los Angeles depression study 40 Summary 43 References 45 Further reading 46 Problems 46
4 Data screening and data transformation 48 4.1 Making transformations and assessing normality andindependence 48 4.2 Common transformations 48 4.3 Assessing the need for and selecting a transformation 54 4.4 Assessing independence 64 Summary 67 References 67 Further reading 68 Problems 68 5 Selecting appropriate analyses 71 5.1 Which analyses? 71 52 Why selection of analyses is often difficult 71 5.3 Appropriate statistical measures under Stevens's classification 72 5.4 Appropriate multivariate analyses under Stevens's classification 76 Summary 79 References 79 Further reading 80 Problems 80 Part Two Applied Regression Analysis 83 6 Simple linear regression and correlation 85 6.1 Using linear regression and correlation to examine the relationship between two variables 85 6.2 When are regression and correlation used? 85 6.3 Data example 86 6.4 Description of methods of regression: fixed-X case 88 6.5 Description of methods of regression and correlation: variable-X case 93 6.6 Interpretation of results: fixed-X case 94 6.7 Interpretation of results: variable-X case 96 6.8 Further examination of computer output 100
\
Contents 6.9 Robustness and transformations for tegression analysis 6.10 Other options in computer programs 111 6.11 Special applications of regression 112 6.12 Discussion of computer programs 115 6.13 What to watch out for 117 Summary 118 References 118 Further reading 120 Problems 121 7 Multiple regression and correlation 124 7.1 Using multiple linear regression to examine the relationship between one dependent variable and multiple independent variables 124 7.2 When are multiple regression and correlation used? 7.3 Data example 125 7.4 Description of techniques: fixed-X case 128 7.5 Description of techniques: variable~X case 130 7.6 How to interpret the results: fixed-X case 137 7.7 How to interpret the results: variable-X case 140 7.8 Residual analysis and transformations 143 7.9 Other options in computer programs 148 7.10 Discussion of computer programs 154 7.11 What to watch out for 157 Summary 160 References 160 Further reading 161 Problems 162
125
8 Variable selection in regression analysis 166 8.1 Using variable selection techniques in multiple regression analysis 166 8.2 When are variable selection methods used? 166 8.3 Data example 167 8.4 Criteria for variable selection 170 8.5 A general F test 173 8.6 Stepwise regression 175 8.7 Subset regression 181 8.8 Discussion of computer programs 185 8.9 Discussion and extensions 187 8.10 What to watch out for 191 Summary 193 References 193
108
vn
vn1
Contents Further reading Problems 194
194
9 Special regression topics 197 9.1 Special topics in regression analysis 197 9.2 Missing values in regression analysis 197 9.3 Dummy variables 202 9.4 Constraints on parameters 209 9.5 Methods for obtaining a regression equation when multicollinearity is present 212 9.6 Ridge regression 214 Summary 219 References 220 Further reading 221 Problems 221 Part Three Multivariate Analysis 225 10 Canonical correlation analysis 227 10.1 Using canonical correlation analysis to analyze two sets of variables 227 10.2 When is canonical correlation analysis used? 227 10.3 Data example 228 10.4 Basic concepts of canonical correlation 229 10.5 Other topics related to canonical correlation 234 10.6 Discussion of computer programs 237 10.7 What to watch out for 239 Summary 240 References 241 Furt~er reading 241 Problems 241 11
Discriminant analysis 243 11.1 Using discriminant analysis to classify cases 243 11.2 When is discriminant analysis used? 244 11.3 Data example 245 11.4 Basic concepts of classification 246 11.5 Theoretical background 253 11.6 Interpretation 255 11.7 Adjusting the value of the dividing point 259 11.8 How good is the discriminant function? 262 11.9 Testing for the contributions of classification variables 265 11.10 Variable selection 266
Contents 11.11 Classification into more than two groups 267 11.12 Use of canonical correlation in discriminant function analysis 269 11.13 Discussion of computer programs 272 11.14 What to watch out for 275 Summary 276 References 277 Further reading 277 Problems 278
12 Logistic regression 281 12.1 Using logistic regression to analyze a dichotomous outcome variable 281 12.2 When is logistic regression used? 281 12.3 Data example 282 12.4 Basic concepts of logistic regression 283 12.5 Interpretation: categorical variables 285 12.6 Interpretation: continuous and mixed variables 288 12.7 Refining and evaluating logistic regression analysis 289 12.8 Applications of logistic regression 296 12.9 Discussion of computer programs 299 12.10 What to watch out for 301 Summary 302 References 302 Further reading 303 Problems 304 13 Regression analysis using survival data 306 13.1 Using survival analysis to analyze time-to-event data 306 13.2 When is survival analysis used? 306 13.3 Data examples 307 13.4 Survival functic~ 309 13.5 Common distributions used in survival analysis 314 13.6 The log-linear regression model 317 13.7 The Cox proportional hazards regression model 319 13.8 Some comparisons of the log-linear, Cox and logistic regression models 320 13.9 Discussion of computer programs 324 13.10 What to watch out for 326 Summary 327 References 328 Further reading 328 Problems 328
IX
x
Contents
14 Principal components analysis 330 14.1 Using principal components analysis to understand intercorrelations 330 14.2 When is principal components analysis used? 330 14.3 Data example 331 14.4 Basic concepts of principal components analysis 333 14.5 Interpretation 336 14.6 Use of principal components analysis in regression and other applications 345 14.7 Discussion of computer programs 348 14.8 What to watch out for 350 Summary 351 References 352 Further reading 352 Problems 352 15 Factor analysis 354 15.1 Using factor analysis to examine the relationship among P variables 354 15.2 When is factor analysis used? 354 15.3 Data example 355 15.4 Basic concepts of factor analysis 356 15.5 Initial factor extraction: principal components analysis 358 15.6 Initial factor extraction: iterated principal components 362 15.7 Factor rotations 365 15.8 Assigning factor scores to individuals 371 15.9 An .application of factor analysis to the depression data 372 15.10 Discussion of computer programs 374 15.11 What to watch out for 376 Summary 377 References 378 Further reading 379 Problems 379 16 Cluster analysis 381 16.1 Using cluster analysis to group cases 381 16.2 When is cluster analysis used? 381 16.3 Data example 383 16.4 Basic concepts: initial analysis and distance measures 16.5 Analytical clusterip.g techniques 391 16.6 Cluster analysis for financial data set 398 16.7 Discussion of computer programs 404 16.8 What to watch out for 406
385
\
Contents Summary 406 References 407 Further reading 407 Problems 408 17 Log-linear analysis 410 17.1 Using log-linear models to analyze categorical data 410 17.2 When is log-linear analysis used? 410 17.3 Data example 411 17.4 Notation and sample considerations 413 17.5 Tests ofhypotheses and models for two-way tables 415 17.6 Example of a two-way table 419 17.7 Models for multiway tables 421 17.8 Tests of hypotheses for multiway tables: exploratory model building 425 17.9 Tests of hypotheses: specific models 431 17.10 Sample size issues 432 17.11 The logit model 434 17.12 Discussion of computer programs 437 17.13 What to watch out for 439 Summary 440 References 440 Further reading 441 Problems 441 Appendix A Lung function data 443 Table A.l Code book for lung function data set 444 Table A.2 Lung function data set 445 Further reading 446 Appendix B Lung cancer survival data 446 Table B.l Lung cancer data 44 7 Index 449
xt
Preface This book has been written for investigators, specifically behavioral scientists, biomedical scientists, industrial or academic researchers who wish to perform multivariate statistical analyses on their data and understand the results. It has been written so that it can either be used as a self-guided textbook or as a text in an applied course in multivariate analysis. In addition, we believe that the book will be helpful to many statisticians who have been trained in conventional mathematical statistics who are now working as statistical consultants and need to give explanations to clients who lack sufficient background in mathematics. We do not present mathematical derivations of the techniques in this book; rather we rely on geometric and graphical arguments and on examples to illustrate them. The mathematical level has been kept deliberately low, with no mathematics beyond the high-school level required. The derivations of the techniques are referenced. The original derivations for most of the current techniques were done 50 years ago so we feel that the applications of these techniques to real-life problems is the 'fun' part now. We have assumed that the reader has taken a basic course in statistics that includes test of hypotheses. Many computer programs use analysis of variance and that part of the results of the program can only be understood if the reader is familiar with one-way analysis of variance.
WHAT HAS BEEN ADDED TO THE THIRD EDITION A new chapter on log-linear analysis has been added as well as considerably more material on data management and data cleaning. In this new edition we expanded our emphasis on the use of the personal computer and incorporated three more PC statistical packages, namely, STATA, STATISTICA and SYSTAT, in addition to BMDP, SAS and SPSS. The new packages contain multivariate programs which may not have quite as many features as the big three. Since the new programs were originally written for the PC, they tend to be less expensive, they take less computer memory to run and their output is sometimes geared better to a computer screen. Four data sets used for homework problems, and examples are available
XIV
Preface
from the publisher on disk to assist users in trying the multivariate techniques on their own. We have made changes in each chapter, ranging from rather minor ones to some that are quite extensive. But we have tried to keep the chapters short, and continue to emphasize the main features of the techniques rather than getting lost among the various branches that have grown here and there. Copious references are included for the reader who wishes to go into more details on a particular technique. APPROACH OF THIS BOOK The book has been written in a modular fashion. Part One, consisting of five chapters, provides examples of the multivariate techniques, discusses characterizing data for analysis, computer programs, data entry, data manageme_nt, data clean-up, missing, values, and transformations, and presents a rough guide to assist in the choice of an appropriate multivariate analysis. These topics have been included since many investigators have more difficulty with these preliminary steps than with running the multivariate analyses. Also, if these steps are not done with care, the results of the statistical analysis can be faulty. In the remaining chapters a standard format has been used. The first four sections of each chapter ·include a discussion of when the technique is used, a data example and the basic assumptions and concepts of the technique. In subsequent sections, more detailed aspects of the technique are presented. At the end of each chapter, a summary table is given showing which features are available in the six packaged programs. Also a section entitled 'What to watch out for' is included to warn the reader about common problems related to data sets that might lead to biased statistical results. Here, we relied on our own experience in consulting rather than only listing the formal treatment of the subject. Finally, references plus further reading and a problem set are included at the end of each chapter. Part Two is on regression analysis. Chapter 6 deals with simple linear regression and is included for review purposes, to introduce our notation and to provide a more complete discussion of outliers than is found in some texts. Chapters 7-9 are concerned with multiple linear regression. Multiple linear regression analysis is very heavily used in practice and provides the foundation for understanding many concepts relating to residual analysis, transformations, choice of variables, .missing values, dummy variables and multicollinearity. Since these concepts are essential to a good grasp of multivariate analysis, we thought it useful to include these chapters in the book. Part Three includes chapters on canonical data, discriminant analysis, logistic regression analysis, regression analysis on survival data, principal components analysis, factor analysis, cluster analysis and log-linear analysis.
Preface
xv
We have received many helpful and different suggestions from instructors and reviewers on the ordering of these chapters. For example, one reviewer uses the following order when teaching: principal components, factor analysis, cluster analysis and then canonical analysis. Another prefers having logistic regression and analysis of survival data directly after regression analysis. Obviously, the user looking for information on a particular technique can go directly to that chapter and instructors have a wide choice of sensible orderings to choose from. Since many users of multivariate statistics restrict themselves to the analyses available in their package program, we emphasize the features included in the six computer packages more than most other statistical books. The material on the features of the computer programs can be easily skipped or assigned for self-study if the book is used as a text for a course. The discussion of the multivariate methods themselves does not depend on the material concerning the computer packages. We thank Michal Fridman-Sudit and David Zhang for their editorial and technical assistance in preparing the third edition. A.A. Afifi Virginia Clark
Preface to the second edition The emphasis in the second edition of Computer-Aided Multivariate Analysis continues to be on performing and understanding multivariate analysis, not on the necessary mathematical derivations. We added new features while keeping the descriptions brief and to the point in order to maintain the book at a reasonable length. In this new edition, additional emphasis is placed on using personal computers, particularly in data entry and editing. While continuing to use the BMDP, SAS and SPSS packages for working out the statistical examples, we also describe their data editing and data management features. New material has also been included to enable the reader to make the choice of transformations more straightforward. Beginning with Chapter 6, a new section, entitled 'What to Watch Out For', is included in each chapter; These sections attempt to warn the reader about common problems related to the data set that may lead to biased statistical results. We relied on our own experience in consulting rather than simply listing the assumptions made in deriving the techniques. Statisticians sometimes get nervous about nonstatisticians performing multivariate analyses without necessarily being well versed in their mathematical derivations. We hope that the new sections will provide some warning of when the results should not be blindly trusted and when it may be necessary to consult an expert statistical methodologist. Since the first edition, many items on residual analysis have been added to the output of most multivariate and especially regression analysis programs. Accordingly, we expanded and updated our discussion of the use of residuals to detect outliers, check for lack of independence and assess normality. We also explained the use of principal components in regression analysis in the presence of multicollinearity. In discussing logistic regression, we presented information on evaluating how well the equation predicts outcomes and how to use the receiver operating characteristic (ROC) curves in this regard. We added a chapter in this edition on regression analysis using survival data. Here the dependent variable is the length of time until a defined event occurs. Both the log-linear, or accelerated failure time model, and the Cox proportional hazards models are presented along with examples of their use.
xviii
Preface to the second edition
We compare the interpretation of the coefficients resulting from the two models. For survival data, the outcome can be classified into one of two possibilities, success or failure, and logistic regression can be used for analyzing the data. We present a comparison of this type of analysis with the other two regression ·methods. for survival data. To competisate for the addition of this chapter, we deleted from the first edition the chapter on nonlinear regression. We also placed the chapter on canonical correlation immediately after the regression chapters, since this technique may be viewed as a generalization of multiple correlation analysis. Descriptions of computer output in all chapters have been updated to include new features appe~ring since the first edition. New problems have been added. to the problem sets, and references were expanded to incorporate recent literature. We thank Stella Grosser for her help in writing new problems, updating the references, deciphering new computer output and checking themanuscript. We also thank Mary Hunter, Evalon Witt and Jackie Champion for their typing and general assistance with the manuscript for the second edition. A.A. Afifi Virginia Clark
\
Preface to the first edition This book has been written for investigators, specifically behavioral scientists, biomedical scientists, econometricians, and industrial users who wish to perform multivariate statistical analyses on their data and understand the results. In addition, we believe that the book will be helpful to many statisticians who have been trained in conventional mathematical statistics (where applications of the techniques were not discussed) and who are now working as statistical consultants. Statistical consultants overall should find the book useful in assisting them in giving explanations to. clients who lack sufficient background in mathematics. We do not present mathematical derivation of the techniques in this book but, rather, rely on geometric and graphical arguments and on examples to illustrate them. The mathematical level has been kept deliberately low, with no mathematics beyond high-school level required. Ample references are included for those who wish to see the derivations of the results. We have assumed that you have taken a basic course in statistics and are familiar with statistics such as the mean and the standard deviation. Also, tests of hypotheses are presented in several chapters, and we assume you ar~ familiar with the basic concept of testing a null hypothesis. Many of the computer programs utilize analysis of variance, and that part of the programs can only be understood if you are familiar with one-way analysis of variance. APPROACH OF THE BOOK The content and organizational features of this book are discussed in detail in Chapter 1, section 1.4. Because no university-level mathematics is assumed, some topics often found in books on multivariate analysis have, not been included. For example, there is no theoretical discussion of sampling from a multivariate normal distribution or the Wishart distribution, and the usual chapter on matrices has not been included. Also, we point out that the book is not intended to be a comprehensive text on multivariate analysis. The choice of topics included reflects our preferences and experience as consulting statisticians. For example, we deliberately excluded multivariate analysis of variance because we felt that it is not as commonly used as the other topics we describe. On the other hand, we would have liked to include
xx
Preface to the first edition
the log-linear model for analyzing multivariate categorical data, but we decided against this for fear that doing so would have taken us far afield and would have added greatly to the length of the book. The multivariate analyses have been discussed more as separate techniques than as special cases arising from some general framework. The advantage of the approach used here is that we can concentrate on explaining how to analyze a certain type of data by using outputfrom readily available computer programs in order to answer realistic questions. The disadvantage is that the theoretical interrelationships among some of the techniques are not highlighted. USES OF THE BOOK This book can be used as a text in an applied statistics course or in a continuing education course. We have used preliminary versions of this book in teaching behavioral scientists, epidemiologists, and applied statisticians. It is possible to start with Chapter 6 or 7 and cover the remaining ·chapters easily in one semester if the students have had a solid background in basic statistics. Two data sets are included that can be used for homework, and a set of problems is included at the end of each chapter except the first. COMPUTER ORIENTATION The original derivations for most of the current multivariate techniques were done over forty years ago. We feel that the application of these techniques to real-life problems is now the 'fun' part of this field. The presence of computer packages and the availability of computers have removed many of the tedious aspects of this discipline so that we can concentrate on thinking about the nature of the scientific problem itself and on what we can learn from our data by using multivariate analysis. Because the multivariate techniques require a lot of computations, we have assumed that packaged programs will be used. Here we bring together discussions of data entry, data screening, data reduction and data analysis aimed at helping you to perform these functions, understand the basic methodology and determine what insights into your data you can gain from multivariate analyses. Examples of control statements are given for various computer runs used in the text. Also included are discussions of the options available in the different statistical packages and how they can be used to achieve the desired output. ACKNOWLEDGMENTS Our most obvious acknowledgment is due to those who have created the computer-program packages for statistical analysis. Many of the applications\
Preface to the first edition
xx1
of multivariate analysis that are now prevalent in the literature are directly due to the development and availability of these programs. The efforts of the programmers and statisticians who managed and wrote the BMDP, SAS, SPSS and other statistical packages have made it easier for statisticians and researchers to actually use the methods developed by Hotelling, Wilks, Fisher and others in the 1930s or even earlier. While we have enjoyed longtime associations with members of the BMDP group, we also regularly use SAS and SPSS-X programs in our work, and we have endeavored to reflect this interest in our book. We recognize that other statistical packages are in use, and we hope that our book will be of help to their users as well. We are indebted to Dr Ralph Frerichs and Dr Carol Aneshensel for the use of a sample data set from the Los Angeles Depression Study, which appears in many of the data examples. We wish to thank Dr Roger Detels for the use of lung function data taken from households from the UCLA population studies of chronic obstructive respiratory disease in Los Angeles. We also thank Forbes Magazine for allowing us to use financial data from their publications. We particularly thank Welden Clark and Nanni Afifi for reviewing drafts and adding to the discussion of the financial data. Helpful' reviews and suggestions were also obtained from Dr Mary Ann Hill, Dr Roberta Madison, Mr Alexander Kugushev and several anonymous reviewers. Welden Clark has, in addition, helped with programming and computer support in the draft revisions and checking, and in preparation of the bibliqgraphies and data sets. We would further like to thank Dr Tim Morgan, Mr Steven Lewis and Dr Jack Lee for their help in preparing the data sets. The BMDP Statcat (trademark of BMDP Statistical Software, Inc.) desktop computer with the UNIX (trademark of Bell Telephone Laboratories, Inc.) program system has served for text processing as well as further statistical analyses. the help provided by Jerry Toporek of BMDP and Howard Gordon of Network Research Corporation is appreciated. In addition we would like to thank our copyeditor, Carol Beal, for her efforts in making the manuscript more readable both for the readers and for the typesetter. The major portion of the typing was performed by ·Mrs Anne Eiseman; we thank her also for keeping track of all the drafts and teaching material derived in its production. Additional typing was carried out by Mrs Judie Milton, Mrs Indira Moghaddan and Mrs Esther Najera. A.A. Afifi Virginia Clark
..
Part One Preparation for Analysis
1 What is multivariate analysis? 1.1 HOW IS MULTIVARIATE ANALYSIS DEFINED? The expression multivari~_te analysis is used to describe analyses of data that are multivariate in the sense that numerous observations or variables are obtained for each individual or unit studied. In a typical survey 30 to 100 questions were asked of each respondent. In describing the financial status of a company, an investor may wish to examine five to ten measures of the company's performance. Commonly, the answers to some of these measures are interrelated. The challenge ofrdisentangling complicated interrelationships among various measures on the same individual or unit and of interpreting these results is what makes multivariate analysis a rewarding activity for the investigator. Often results are obtained that could not be attained without multivariate analysis. In the next section of this chapter several studies are described in which the use of multivariate analyses is essential to understanding the underlying problem. Section 1.3 gives a listing and a very brief description of the multivariate analysis techniques discussed in this book. Section 1.4 then outlines the organization of the book.
1.2 EXAMPLES OF STUDIES IN WHICH MULTIVARIATE ANALYSIS IS USEFUL The studies described in 'thefollowing subsections illustrate various multivariate analysis techniques. Some are used later in the book as examples. Depression study example
The data for the depression study have been obtained from a complex, random, multiethnic sample of 1000 adult residents of Los Angeles County. The study was a panel or longitudinal design where the same respondents Were interviewed four times between May 1979 and July 1980. About three-fourths of the respondents were reinterviewed for all four interviews.
4
What is multivariate analysis?
The field work for the survey was conducted by professional interviewers from the Institute for Social Science Research at UCLA. This research is an epidemiological study of depression and help-seeking behavior among free-living (noninstitutionalized) adults. The major objectives are to provide estimates of the prevalence and incidence of depression and to identify casual factors and outcomes associated with this condition. The factors examined include demographic variables, life events stressors, physical health status, health care use, medication use, lifestyle, and social support networks. The major instrument used for classifying depression is the Depression Index (CESD) of the National Institute of Mental Health, Center for Epidemiological Studies. A discussion of this index and the resulting prevalence of depression in this sample is given in Frerichs, Aneshensel and Clark (1981). The longitudinal design of the study offers advantages for assessing causal priorities since the time sequence allows us to rule out certain potential causal links. Nonexperimental data of this type cannot directly be used to establish causal relationships, but models based on an explicit theoretical framework can be tested to determine if they are consistent with the data. An example of such model testing is given in Aneshensel and Frerichs (1982). Data from the first time period of the depression study are presented in Chapter 3. Only a subset of ~he factors measured on a sample of the respondents is included in order to keep the data set easily comprehensible. These data are used several times in subsequent chapters to illustrate some of the multivariate techniques presented in this book. Bank loan study The managers of a bank need some way to improve their prediction of which borrowers will successfully pay back a type of bank loan. They have data from the past of the characteristics of persons to whom the bank has lent money and the subsequent record of how well the person has repaid the loan. Loan payers can be classified into several types: those who met all of the terms of the loan, those who eventually repaid the loan but often did not meet deadlines, and those who simply defaulted. They also have information on age, sex, income, other indebtedness, length of residence, type of residence, family size, occupation and the reason for the loan. The question is, can a simple rating system be devised that will help the bank personnel improve their prediction rate and lessen the time it takes to approve loans? The methods described in Chapters 11 and 12 can be used to answer this question. Chronic respiratory disease study The purpose. of the ongoing respiratory dise~se study _is to determine t~~ effects of vanous types of smog on lung functiOn of children and adults m \
Examples of studies
5
the Los Angeles area. Because they could not randomly assign people to live in areas that had different levels of pollutants, the investigators were very concerned about the interaction that might exist between the locations where persons chose to live and their values on various lung function tests. The investigators picked four areas of quite different types of air pollution and are measuring various demographic and other responses on all persons over seven years old who live there. These areas were chosen so that they are close to an air-monitoring station. The researchers are taking measurements at two points in time and are using the change in lung function over time as well as the levels at the two periods as outcome measures to assess the effects of air pollut~on. The investigators have had to do the lung function tests by using a mobile unit in the field, and much effort has gone into problems of validating the accuracy of the field observations. A discussion of the particular lung function measurements used for one of the four areas can be found in Detels et al. (1975). In the analysis of the data, adjustments must be made for sex, age, height and smoking status of each person. Over 15 000 respondents have been examined and interviewed in this study. The original data analyses were restricted to the first collection period, but now analyses include both time periods. The data set is being used to answer numerous questions concerning effects of air pollution, smoking, occupation, etc. on different lung function measurements. For example, since the investigators obtained measurements on all family members seven years old and older, it is possible to assess the effects of having parents who smoke on the lung function of their children (Tashkin et al., 1984). Studies of this type require multivariate analyses so that investigators can arrive at plausible scientific conclusions that could explain the resulting lung function levels. A subset of this data set is included in Appendix A. Lung function and associated data are given for nonsmoking families for the father, mother and up to three children ages 7-17. Assessor office example
Local civil laws often require that the amount of property tax a homeowner pays be a percentage of the current value of the property. Local assessor's offices are charged with the function of estimating current value. Current value can be estimated by finding comparable homes that have been recently sold and using some sort of an average selling price as an estimate of the price of those properties not sold. Alternatively, the sample of sold homes can indicate certain relationships between selling price and several other characteristics such as the size of the lot, the size of the livable area, the number of bathrooms, the location etc. These relationships can then be incorporated into a mathematical equation
6
What is multivariate analysis?
used to estimate the current selling price from those other characteristics. Multiple regression analysis methods discussed in Chapters 7-9 can be used by many assessor's offices for this purpose (Tchira, 1973). 1.3 MULTIVARIATE ANALYSES DISCUSSED IN THIS BOOK In this section a brief description of the major multivariate techniques covered in this book is presented. To keep the statistical vocabulary to a minimum, we illustrate the descriptions by examples.
Simple linear regression A nutritionist wishes to study the effects of early calcium intake on the bone density of postmenopausal women. She can measure the bone density of the arm (radial bone), in grams per square centimeter, by using a nonmvasive device. Women who are at risk of hip fractures because of too low a bone density will show low arm bone density also. The nutritionist intends to sample a group of elderly churchgoing women. For women over 65 years of age, she will plot calcium intake as a teenager (obtained by asking the women about their consumption of high-calcium foods during their teens) on the horizontal axis and arm bone density (measured) on the vertical axis. She expects the radial bone density to be lower in women who had a lower calcium intake. The nutritionist plans to fit a simple linear regression equation and test whether the slope of the regression line is zero. In this example a single outcome factor is being predicted by a single predictor factor. Simple linear regression as used in this case would not be considered multivariate by some statisticians, but it is included in this book to introduce the topic of multiple regression.
Multiple linear regression A manager is interested in determining which factors predict the dollar value of sales of the firm's personal computers. Aggregate data on population size, income, educational level, proportion of population living in metropolitan areas etc. have been collected for 30 areas. As a first step, a multiple linear regression equation is computed, where dollar sales is the outcome factor and the other factors are considered as candidates for predictor factors. A linear combination of the predictor factors is used to predict the outcome or response factor.
Canonical correlation A psychiatrist wishes to correlate levels of both depression and physical well..;being from data on age, sex, income, number of contacts per month\
Multivariate analyses discussed in this book
7
with family and friends, and marital status. This problem is different frpm the one posed in the multiple linear regression example because more than one outcome factor is being predicted. The investigator wishes to determine the linear function of age, sex, income, contacts per month and marital status that is most highly correlated with a linear function of depression and physical well-being. After these two linear functions, called canonical variables, are detennined, the investigator will test to see whether there is a statistically significant (canonical) correlation between scores from the two linear functions and whether a reasonable intepretation can be made of the two sets of coefficients from the functions. Discriminant function analysis
A large sample of initially disease.;free men over 50 years of age from a community has been followed to see who subsequently has a diagnosed heart attack. At the initial visit, blood was drawn from each man, and numerous determinations were made from it, including serum cholesterol, phospholipids and blood glucose. The investigator would like to determine a linear function of these and possibly other measurements that would be useful in predicting who would and who would not get a heart attack within ten years. That is, the investigator wishes to derive a classification (discriminant) function that would help determine whether or not a middle-aged man is likely to have a heart attack. Logistic regression
A television station staff has classified movies according to whether they have a high or low proportion of the viewing audience when shown. The staff has also measured factors such as the length and the type of story and the characteristics of the actors. Many of the characteristics are discrete yes-no or categorical types of data. The investigator may use logistic regression because some of the data do not meet the assumptions for statistical inference used in discriminant function analysis, but they do meet the assumptions for logistic regression. In logistic regression we derive an equation to estimate the probability of capturing a high proportion of the audience. Survival analysis
An administrator of a large health maintenance organization (HMO) has collected data since 1970 on length of employment in years for their physicians W?o are either family practitioners or internists. Some of the physicians are still employed, but many have left. For those still employed, the administrator can only know that their ultimate length of employment will be greater than their current length of employment. The administrator wishes to describe
8
What is multivariate analysis?
the distribution oflength of employment for each type of physician, determine the possible efects of factors such as gender and location of work, and test whether or not the length ofemployment is the same for the two specialties. Survival analysis, or event history analysis (as it is often called by behavioral scientists), can be used to analyze the distribution of time to an event such as quitting work, having a relapse of a disease, or dying of cancer.
Principal components analysis An investigator has made a number of measurements of lung function on a sample of adult males who do not smoke. In these tests each man is told to inhale deeply and then blow out as fast and as much as possible into a spirometer, which makes a trace of the volume of air expired over time. The maximum or forced vital capacity (FVC) is measured as the difference between maximum inspiration and maximum expiration. Also, the amount of air expired in the first second (FEVl), the forced mid-expiratory flow rate (FEF 25-75), the maximal expiratory flow rate at 50% of forced vital capacity (V50) and other measures of lung function are calculated from this trace. Since all these measures are made from the same flow-volume curve for each man, they are highly interrelated. From past experience it is known that some of these measures are more interrelated than others and that they measure airway resistance in different sections of the airway. The investigator performs a principal components analysis to determine whether a new set of measurements called principal components can be obtained. These principal components will be linear functions of the original lung function measurements and will be uncorrelated with each other. It is hoped that the first two or three principal components will explain most of the variation in the original lung function measurements among the men. Also, it is anticipated that some operational meaning. can be attached to these linear functions that will aid in their interpretation. The investigator may decide to do future analyses on these uncorrelated principal components rather than on the original data. One advantage of this method is that-often fewer principal components are needed than original variables. Also, since the principal components are uncorrelated, future computations and explanations can be sim pli:fied.
Factor analysis An investigator has asked each respondent in a survey whether he or she strongly agrees, agrees, is undecided, disagrees, or strongly disagrees with 15 statements concerning attitudes toward inflation. As a first step, the investigator will do a factor analysis on the resulting data to determine which statements belong together in sets tha! are uncorr~lated with ?ther sets. Th~ particulal\ statements that form a smgle set wdl be exammed to obtam a better\
Organization and content of the book
9
nderstanding of attitudes toward inflation. Scores derived from each set or ~ctor will be used in subsequent analysis to predict consumer spending. Cluster analysis Investigators have made numerous measurements on a sample of patients who have been classified as being depressed. They wish to determine, on the basis of their measurements, whether these patients can be classified by type of depression. That is, is it possible to determine distinct types of depressed patients by performing a cluster analysis on patient scores on various tests? Unlike the investigator of men who or do not get heart attacks, these investigators do not possess a set of individuals whose type of depression can be known before the analysis is performed (see Andreasen and Grove, 1982, for an example). Nevertheless, the investigators want to separate the patients into separate groups and to examine the resulting groups to see whether distinct types do exist and, if so, what their characteristics are. Log-linear analysis An epidemiologist in a medical study wishes to examine the interrelationships among the use of substances that are thought to be risk factors for disease. These include four risk factors where the answers have been summarized into categories. The riskfactors are smoking tobacco (yes at present, former smoker, never smoked), drinking (yes, no), marijuana use (yes, no) and other illicit drug use (yes, no). Previous studies have shown that people who drink are more apt than nondrinkers to smoke cigarettes, but the investigator wants to study the associations among the use of these four substances simultaneously. 1.4 ORGANIZATION AND CONTENT OF THE BOOK This book is organized into three major parts. Part One (Chapters 1-5) deals ~ith data preparation, entry, screening, transformations and decisions about hkely choices for analysis. Part 2 (Chapters 6-9) deals with regression analysis. Part Three (Chapters 10-17) deals with a number of multivariate analyses. Statisticians disagree on whether or not regression is properly considered as ~art of multivariate analysis. We have tried to avoid this argument by Including regression in the book, but as a separate part. Statisticians certainly agr~e that regression is an important technique for dealing with problems ?avtng multiple variables. In Part Two on regression analysis we have Included various topics, such as dummy variables, that are used in Part Three. Chapters 2-5 are concerned with data preparation and the choice of what ~~alysis to use. First, variables and how they are classified are discussed in ~pter 2. The next two chapters concentrate on the practical problems of gettmg data into the computer, getting rid of erroneous values, checking
10 What is multivariate analysis? assumptions of normality and independence, creating new variables and preparing a useful code book. The use of personal computers for data entry and analysis is discussed. The choice of appropriate statistical analyses is discussed in Chapter 5. Readers who are familiar with handling data sets on computers could skip these initial chapters and go directly to Chapter 6. However, formal coursework in statistics often leaves an investigator unprepared for the complications and difficulties involv~d in real data sets. The material in Chapters 2-5 was deliberately included to fill this gap in preparing investigators for real world data problems. For a course limited to multivariate analysis, Chapters 2-5 can be omitted if a carefully prepared data set is used for analysis. The depression data set, presented in Section 3.6, has been modified to make it directly usable for multivariate data analysis but the user may wish to subtract one from the variables 2, 31, 33 and 34 to change the results to zeros and ones. Also, the lung function data presented in Appendix A and the lung cancer data presented in Appendix B can be used directly. These data, along with the data in Table 8.1, are available on disk from the publisher. In Chapters 6-17 we follow a standard format. The topics discussed in each chapter are given, followed by a discussion of when the teGhniques are used. Then the basic concepts and formulas are explained. Further intepretation, and data examples with topics chosen that relate directly to the techniques, follow. Finally, a summary of the available computer output that may be obtained from six statistical packaged programs is presented. We conclude each chapter with a discussion of pitfalls to avoid when performing the analyses described. As much as possible, we have tried to make each chapter self-contained. However, Chapters 11 and 12, on discriminant analysis and logistic regression, are somewhat interrelated, as are Chapters 14 and 15, covering principal components and factor analyses. References for further information on each topic are given at the end of each chapter. Most of the references at the end of the chP;>ters do require more mathematics than this book, but special emphasis can be placed on references that include examples. References requiring a strong mathematical background are preceded by an asterisk. If you wish primarily to learn the concepts involved in the multivariate techniques and are not as interested in performing the analysis, then a conceptual introduction to multivariate analysis can be found in Kachigan (1986)~ We believe that the best way to learn multivariate analysis is to do it on data that the investigator is familiar with. No book can illustrate all the features found in computer output for a real-life data set. Learning multivariate analysis is similar to learning to swim: you can go to lectures, but the real learning occurs when you get into the water. \
References
11
REFERENCES Andreasan, N.C. and_ Grove, W.M. (1982). !he classification of de~ression: Traditional versus mathematical approaches. Amerzcan Journal of Psychzatry, 139, 45-52. Aneshensel, C.S. and Frerichs, R.R (1982). Stress, support, and depression: A longitudinal causal model. Journal of Community Psychology, 10, 363-76. Detels, R., Coulson, A., Tashkin, D. and Rokaw, S. (1975). Reliability of plethysmography, the single breath test, and spirometry in population studies. Bulletin de Physiopathologie Respiratoire, 11, 9-30. Frerichs, R.R., Aneshensel, C.S. and Clark, V.A. (1981). Prevalence of depression in Los Angeles County. American Journal of Epidemiology, 113, 691-99. Kachigan, S.K. (1986). Statistical analysis, an Interdisciplinary Introduction to Univariate and Multivariate Methods. Radius Press, New York. Tashkin, D.P., Clark, V.A., Simmons, M., Reems, C., Coulson, A.H., Bourque, L.B., Sayre, J.W~, Detels, R. and Rokaw, S. (1984). The UCLA population studies of chronic obstructive respiratory disease. VII. Relationship between parents smoking and children's lung function. American Review of Respiratory Disease, 129, 891-97. Tchira, A.A. (1973). Stepwise regression applied to a residential income valuation system. Assessors Journal, 8, 23-35.
2
Characterizing data for future analyses 2.1 VARIABLES: THEIR DEFINITION, CLASSIFICATION AND USE In performing multivariate analysis, the investigator deals with numerous variables. In this chapter, we define what a variable is in section 2.2. Section 2.3 presents a method of classifying variables that is sometimes useful in multivariate analysis since it allows one to check that a commonly used analysis has not been missed. Section 2.4 explains how variables are used in analysis and gives the common terminology for distinguishing between the two major uses of variables. Section 2.5 gives some examples of classifying variables and section 2.6 discusses other characteristics of data and gives references to exploratory data analysis.
2.2 DEFINING STATISTICAL VARIABLES The word variable is used in statistically oriented literature to indicate a characteristic or property that is possible to measure. When we measure something, we make a numerical model of the thing being measured. We follow some rule for assigning a number to each level of the particular characteristic being measured. For example, height of a person is a variable. We assign a numerical value to correspond to each person's height. Two people who are equally tall are assigned the same numeric value. On the other hand, two people of different heights are assigned two different values. Measurements of a variable gain their meaning from the fact that there exists unique correspondence between the assigned numbers and the levels of the property being measured. Thus two people with different assigned heights are not equally tall. Conversely, if a variable has the same assigned value for all individuals in a group, then this variable does not convey useful information about individuals in the group. Physical measurements, such as height and weight, can be measured directly by using physical instruments. On the other hand, properties such as reasoning ability or the state of depression of a person must be measured
How variables are classified
13
. d'rectly. We might choose a particular intelligence test and define the tn ~able 'intelligence' to be the score achieved on this test. Similarly, we may · · responses to a senes · · ' as th e num ber of positive dvan fine the variable ' depression ; questions. Although what we wish to measure is the .degree of depression, 0 end up with a count of yes answers to some questions. These examples ;~int out a fun~amental difference between direct physical measurements and abstract vanables. Often the question of how to measure a certain property can be perplexing. For example, if the property we wish to measure is the cost of keeping the air clean in a particular area, we may be able to come with a reasonable estimate, although different analysts may produce different estimates. The problem becomes much more difficult if we wish to estimate the benefits of clean air. On any given individual or thing we may measure several different characteristics. We would then be dealing with several variables, such as age, height, annual income, race, sex and level of depression of a certain individual. Similarly, we can measure characteristics of a corporation, such as various financial measures. In this book we are concerned with analyzing data sets consisting of measurements on several variables for each individualin a given sample. We use the symbol P to denote the number of variables and the symbol N to denote the number of individuals, observations, cases or sampling units; 2.3 HOW VARIABLES ARE CLASSIFIED: STEVENS'S CLASSIFICATION SYSTEM In the determination of the appropriate statistical analysis for a given set of data, it is useful to classify variables by type. One method for classifying variables is by the degree of sophistication evident in the way they are measured. For example, we can measure height of people according to whether the top of their head exceeds a mark on the wall; if yes, they are tall; and if no, they are short. On the other hand, we can also measure height in centimeters or inches. The latter technique is a more sophisticated way of me~suring height. As a scientific discipline advances, measurements of the vanab~es with which it deals become very sophisticated. Vanous attempts have been made to formalize variable classification. A commonly accepted system is that proposed by Stevens (1951). In this system, ~:asur~men~s are classified as n~minal, ordinal, interval or ratio. In deriv~g th classtficatton, Stevens charactenzed each of the four types by a transformation fo~t would not chan~e a measurement's ~lassifica~ion. In the subsections that ow • rather than dtscuss the mathematical details of these transformations, we pres~nt the practical implications for data analysis. p As With many classification schemes, Stevens's system is useful for some urposes but not for others. It should be used as a general guide to assist
14
Characterizing data for future analyses
in characterizing the data and to make sure that a useful analysis is not overlooked. However, it should not be used as a rigid rule that ignores the purpose of the analysis or limits its scope (Velleman and Wilkinson, 1993).
Nominal variables With nominal variables each observation belongs to one of several distinct categories. The categories are not necessarily numerical, although numbers may be used to represent them. For example, 'sex' is a nominal variable. An individual's gender is either male or female. We may use any two symbols, such as M and F, to represent the two categories. In computerized data analysis, numbers are used as the symbols since many computer programs are designed to handle only numerical symbols. Since the categories may be arranged in any desired order, any set of numbers can be used to represent them. For example, we may use 0 and l to represent males and females, respectively. We may also use 1 and 2 to avoid confusing zeros with blanks. Any two other numbers can be used as long as they are used consistently. An investigator may rename the categories, thus performing a numerical operation. In doing so, the investigator must preserve the uniqueness of each category. Stevens expressed this last idea as a 'basic empirical operation' that preserves the category to which the observation belongs. For example, two males must have the same value on the variable 'sex', regardless of the two Table 2.1 Stevens's measurement system Type of measurement
Basic empirical operation
Examples
Nominal
Determination of equality of categories
Company names Race Religion Basketball players' numbers
Ordinal
Determination of greater than or less than (ranking)
Hardness of minerals Socioeconomic status Rankings of wines
Interval
Determination of equality of differences between levels .
Ten1perature, in degrees Fahrenheit Calendar dates
Ratio
Determination of equality of ratios of levels
Height Weight Density Difference in time
How variables are classified
15
bers chosen for the categories. Table 2.1 summarizes these ideas and · I vana · bles Wit· h more t han two categones, · n urn. nts· further examp1es. Nomma · · - · · I ch a11enges to t hemuI tlvanate prese has race or relig~on, may present speCia ~~~a analyst. Some ways of dealing with these variables are presented in Chapter 9.
Ordinal variables Categories are used for ordinal variables as well, but there also exists a known order among them. For example, in the Mobs Hardness Scale, minerals and rocks are classified according to ten levels of hardness. The hardest mineral isdiamond and the softest is talc (Pough, 1976). Any ten numbers can be used to represent the categories, as long as they are ordered in magnitude. For instance, the integers 1-10 would be natural to use. On the other hand, any sequence ofincreasing numbers may also be used. Thus the basic empirical operation defining ordinal variables is whether one observation is greater than another. For example, we must be able to determine whether one mineral is harder than another. Hardness can be tested easily by noting which mineral can scratch the other. Note that for most ordinal variables there is an underlying continuum being approximated by artificial categories. For example, in the above hardness scale fluorite is defined as having a hardness of 4, and calcite, 3. However, there is a range of hardness between these two numbers not accounted for by the scale. Often investigators classify people, or ask them to classify themselves, along some continuum. For example, a physician may classify a patient's disease status as none = 1, mild = 2, moderate = 3 and severe = 4. Clearly increasing numbers indicate increasing severity, but it is not certain that the difference between not having an illness and having a mild case is the same as between having a mild case and a moderate case. Hence, according to Stevens's classification system, this is an ordinal variable.
Interval variables
~n
interval variable is a special ordinal variable in which the differences 'tetween successive values are always the same. For example, the variable t~rn~erature', in degrees Fahrenheit, is measured on the interval scale since between 12° and 13° is the same as the difference between 13o aneddtfference 140 h . the Moor t e difference between any tw? suc~ssive te~pera~ures. In ~ontrast, bet . hs Hardness Scale does not sattsfy this condttlon smce the mterva~d sarwreen succe~sive categories are not necssarily the same. The scale mu~~ Is Y the baste empirical operation of preserving the equality of intervals.
16
Characterizing data for future analyses
Ratio variables Ratio variables are interval variables with a natural point representing the origin of measurement, i.e. a natural zero point. For instance, height is a ratio variable since zero height is a naturally defined point on the scale. We may change the unit of measurement (e.g. centimeters to inches), but we would still preserve the zero point and also the ratio of any two values of height. Temperature is not a ratio variable since we may choose the zero point arbitrarily, thus not preserving ratios. There is an interesting relationship between interval and ratio variables. The difference between two interval variables is a ratio variable. For example, although time of day is measured on the interval scale; the length of a time period is a ratio variable since it has a natural zero point. Other classifications Other methods of classifying variables have also been proposed (Coombs, 1964). Many authors use the term categorical to refer to nominal and ordinal variables where categories are used. We mention, in addition, that variables may be classified as discrete or continuous. A variable is called continuous if it can take on any value in a specified range. Thus the height of an individual may be 70 in. or 70.4539 in. Any numerical value in a certain range is a conceivable height. A. variable that is not continuous is called discrete. A discrete variable may take on only certain specified values. For example, counts are discrete variables since only zero or positive integers are allowed. In fact, all nominal and ordinal variables are discrete. Interval and ratio variables can be continuous or discrete. This latter classification carries over to the possible distributions assumed in the analysis. For instance, the normal distribution is often used to describe the distribution of continuous variables. Statistical analyses have been developed for various types of variables. In Chapter 5 a guide to selecting the appropriate descriptive measures and multivariate analyses will be presented. The choice depends on how the variables are used in the analysis, a topic that is discussed next. 2.4 HOW VARIABLES ARE USED IN DATA ANALYSIS The type of data analysis required in a specific situation is also related to the way in which each variable in the data set is used. Variables may be used to measure outcomes or to explain why a particular outcome resulted. For example, in the treatment of a given disease a specific drug may be used. The outcome may be a discrete variable classified as 'cured' or 'not cured', Also, the outcome may depend on several characteristics of the patient such as age, genetic background and severity of the disease. These characteristics are sometimes called explanatory or predictor variables. Equivalently, we may
Examples of classifying variables
17
the outcome the dependent variable and the characteristics the independent 11 ca . ble The latter terminology is very common in statistical literature. This · bl es d o · terminology IS · un1ortunate " · t hat t he '·md· epen dent' vana varia m h 1·ce of c ~ have to be statistically independent of each other. Indeed, these ~~ependent varia~les are. usually .interrelated in a complex wa~. Another d · advantage of this termmology Is that the common connotation of the ~rds implies a causal model, an assumption not needed for the multivariate :nalyses described in this book. In spite of these drawbacks, the widespread use of these terms forces us to adopt them. In other situations the dependent variable may be treated as a continuous variable. For example, in household survey data we may wish ,to relate monthly expenditure on cosmetics per household to several explanatory or independent variables such as the number of individuals in the household, their gender and the household income. In some situations the roles that the various variables play are not obvious and may also change, depending on the question being addressed. Thus a data set for a certain group of people may contain observations on their sex, age, diet, weight and blood pressure. In one analysis we may use weight as a dependent variable with height, sex, age and diet as the independent variables. In another analysis blood pressure might be the dependent variable, with weight and other variables considered as independent variables. In certain exploratory analyses all the variables may be used as one set with no regard to whether they are dependent or independent. For example, in the social sciences a large number of variables may be defined initially, followed by attempts to combine them into a smaller number of summary variables. In such an analysis the original variables are not classified as dependent or independent. The summary variables may later be used to possibly explain certain outcomes or dependent variables. In Chapter 5 ~ultivariate analyses described in this book will be characterized by situations m which they apply according to the types of variables analyzed and the roles they play in the analysis. 2.5 EXAMPLES OF CLASSIFYING VARIABLES In the depression data example several variables are measured on the nominal ~cale: sex, marital status, employment and religion. The general health scale ~a~ example of an ordinal variable. Income and age are both ratio variables. 0 b Interval variable is included in the data set. A partial listing and a code ook for this data set are given in Chapter 3. f: . One of the questions that may be addressed in analyzing this data is 'Which actors are I t ed to t he degree of psychological . . of a person?' depression Th . rea an: v~n~ble 'cases' may be used as the dependent or outcome variable since tndtvtdual is considered a case if his or her score on the depression scale
18
Characterizing data for future analyses
exceeds a certain level. 'Cases' is an ordinal variable, although it can be considered nominal because it has only two categories. The independent variable could be any or all of the other variables (except ID and measures of depression). Examples of analyses without regard to variable roles are given in Chapters 14 and 15 using the variables C 1 to C 20 in an attempt to summarize them into a small number of components or factors. Sometimes, the Stevens classification system is difficult to apply, and two investigators could disagree on a given variable. For example, there may be disagreement about the ordering of the categories of a socioeconomic status variable. Thus the status of blue-collar occupations with respect to the status of certain white..;collar occupations might change over time or from culture to culture. So such a variable might be difficult to justify as an ordinal variable, but we would be throwing away valuable information if we used it as a nominal variable. Despite these difficulties, the Stevens system is useful in making decisions .on appropriate statistical analysis, as will be discussed in Chapter 5.
2.6 OTHER CHARACTERISTICS OF DATA Data are often characterized by whether the measurements are accurately taken and are relatively error free, and by whether they meet the assumptions that were used in deriving statistical tests and confidence intervals. Often, an investigator knows that some of the variables are likely to have observations that have errors. If the effect of an error causes the numerical value of an observation to be not in line with the numerical values of most of the other observations, these extreme values may be called outliers and should be considered for removal from the analysis. But other observations may not be accurate and still be within the range of most of the observations. Data sets that contain a sizeable portion of inaccurate data or errors are called 'dirty' data sets. Special statistical methods have been developed that are resistant to the effects of dirty data. Other statistical methods, called robust methods, are insensitive to departures from underlying model assumptions. In this book, we do not present these methods but discuss finding outliers and give methods of determining if the data meet the assumptions. For further information on statistical methods that are well suited for dirty data or require few assumptions, see Velleman and Hoaglin (1981), Hoaglin, Mosteller and Tukey (1983) or Fox and Long (1990). SUMMARY In this chapter statistical variables were defined. Their types and the roles they play in data analysis were discussed.
Problems
19
These concepts can affect the choice of analyses to be performed, as will be discussed in Chapter 5.
REFERENCES R fi ences preceded by an asterisk require strong mathematical background. *~:~rnbs, C.H. (1964). A Theory of Data. Wiley, New York. Fox, J. and Long, J.S., eds (1990). Modern Methods of Data Analysis. Sage, Newbury Park, CA. . *Hoaglin, D.C., Mosteller, ~· a~d Tukey, J.W. (1983). Understandzng Robust and . . Exploratory Data Analyszs. Wiley, New York.. Pough, F.H. (1976). Field Guide to Rocks and Mznerals, 4th edn. Houghton Mtftlm, . . Boston, MA. Stevens, S.S. (1951). Mathematics, measurement, and psychophysics, in Handbook of Experimental Psychology (ed. S.S. Stevens). Wiley, New York. Velleman, P.F. and Hoaglin, D.C. (1981). Applications, Basics and Computing of Exploratory Data Analysis. Duxbury Press, Boston, MA. Velleman; P.F. and Wilkinson, L. (1993). Nominal, ordinal, interval and ratio typologies are misleading. The American Statistician, 47, 65-72.
FURTHER READING Churchman, C.W. and Ratoosh, P., eds (1959). Measurement: Definition and Theories. Wiley, New York. Ellis, B. (1966). Basic Concepts of Measurement. Cambridge University Press, London. Torgerson, W.S. (1958). Theory and Methods of Scaling. Wiley, New York.
PROBLEMS 2.1 Classify the following types of data by using Stevens's measurement system: decibels of noise level, father's occupation, parts per million of an impurity in water, density of a piece of bone, rating of a wine by one judge, net profit of a firm, and score on an aptitude test. 2·2 In a survey of users of a walk· in mental health clinic, data have been obtained on sex, age, household roster, race, education level (number of years in school), family income, reason for coming to the clinic, symptoms, and scores on screening examination. The investigator wishes to determine what variables affect whether or not coercion by the family, friends, or a governmental agency was used to get the patient to the clinic. Classify the da:ta according to Stevens's measurement system. What would you consider to be possible independent variables? Dependent variables? Do you expect the dependent variables to be independent of each other? · 2 .3 Fo~ the chronic respiratory study data presented in Appendix A, classify each vanable according to the Stevens scale and according to whether it is discrete or continuous. Pose two possible research questions and decide on the 2.4 ~~ropria te dependent and independent variables. a eld of statistical application (perhaps y~ur o~n field of specialty), d escnbe a data set and repeat the procedures descnbed.m Problem 2.3.
n: 6
20
Characterizing data for future analyses
2.5 If the RELIG variable described in Table 3.4 of this text was recoded 1 =Catholic, 2 =Protestant, 3 =Jewish, 4 =none and 5 =other, would this meet the basic empirical operation as defined by Stevens for an ordinal variable? 2.6 Give an example of nominal, ordinal, interval and ratio variables from a field • of application you are familiar with. 2.7 Data that are ordinal are often analyzed by methods that Stevens reserved for interval data. Give reasons why thoughtful investigators often do this.
3
Preparing for data analysis 3.1 PROCESSING THE DATA SO THEY CAN BE ANALYZED Once the data are available from a study there are still a number of steps that must be undertaken to get them into shape for analysis. This is particularly true when multivariate analyses are planned since these analyses are often done on large data sets. In this chapter we provide information on topics related to data processing. In section 3.2 we discuss available computer choices to perform the analysis. In this book we are assuming that the analyses will be run on a PC or workstation, although the computation can also be performed on mainframe computers. Section 3.3 describes the statistical packages used in this book. Note that only a few statistical packages offer an extensive selection of multivariate analyses. On the other hand, almost all statistical packages and even some of the spreadsheet programs include multiple regression as an option, if that is your only interest. The next topic discussed is data entry (section 3.4)~ Here the methods used depend on the size of the data set. For a small data set there are a variety of options since cost and efficiency are not important factors, Also, in that case the data can be easily screened for errors simply by visual inspection. But for large data sets, careful planning of data entry is suggested since costs are .an important consideration along with getting an error-free data set ~Vadable for analysis. Here we summarize the data input options available In the statistical packages used in this book and discuss the useful options. daSectio~ 3.5 covers statis~ical data ma~agement techniques used to .get a . . ta set m shape for analysts. The operations used and the software available tn the various packages are described. Topics such as combining data sets an~ selecting data subsets, missing values, outliers, transformations and ~aviDg results are discussed. Finally, in section 3.6 we introduce a multivariate c0adta set that will be widely used in this book and summarize the data in a e book.
22
Preparing for data analysis
We want to stress that the topics discussed in this chapter can be time consuming and frustrating to perform when large data sets are involved. Often the amount of time used for data entry, editing and screening can far exceed that used on statistical analysis. It is very helpful to either have computer expertise yourself or have someone that you can get advice from occasionally. 3.2 CHOICE OF COMPUTER FOR STATISTICAL ANALYSIS In this book we are mainly concerned with analyzing data on several variables simultaneously, using computers. We describe how the computer is used for this purpose and the steps taken prior to performing multivariate statistical analyses. As an introduction, we first discuss how to decide what type of computer to use for the three main operations: data entry, data management and statistical analysis. At opposite extremes, one can use either a personal computer on a desk or a mainframe computer from a terminal for all three operations. By mainframe computer, we mean a large computer typically in a computer center serving an institution or corporation and staffed by systems programmers, analysts and consultants. A mainframe computer generally has large memory and many data storage devices, and provides services to many users at one time. By personal computer (PC) we mean a machine of desk or laptop size designed to be used by one person· at a time and generally for one task at a time. (Personal computers are sometimes referred to as microcomputers.) Between these extremes there are several alternatives. The desktop computer may be linked to other desktop computers in a local area network to share data and programs, and peripherals such as printers and file servers. The personal computers or terminals might be high-capabilitymachines commonly known as technical workstations, which provide faster computation, more memory, and high-resolution graphics. Any of these personal computers, terminals or workstations might be linked to larger minicomputers, often known as departmental computers, which might in themselves be linked to a computer-center mainframe. Many investigators find it convenient to perform data entry and data management on personal computers. Using a personal computer makes it possible to place the machine close to the data source. Laptop computers can be used in field work and personal computers are often used to capture data directly from laboratory equipment. In telephone surveys, interviewers can enter the data directly into the PC without leaving the office. Further discussion on different methods of data entry is given in section 3.4. For data management and preliminary data screening, some of the same arguments apply. These activities are often done together with or soon after
Choice of statistical package
23
ry so that the same PC can be used. The availability of numerous data ent ' . . dsheet and statistical programs that can be used for data management ~:;aencourages the us~ of .the. PC..A dis~ussion of data management and ftninary data screerung IS giVen 1D sectiOn 3.5. p~~r multivariate statistical analysis, either the mainframe or the personal uter can be used. One restriction is that for performing elaborate coIDp . a person al computer reqmres . a f:ast processor alyses on huge data sets usmg ~~ip and sufficient memory and hard disk ~pace. If an Intel 386 or earlier rocessor is used, then a math coprocessor Is necessary. p The advantages of using a personal computer include the ability to do things your own way without being dependent on the structure and software choice of a central computer facility, the multiplicity of programs available to perform statistical analyses on the PC, and the ease of going from data entry to data management to data screening to statistical analysis to report writing and presentation all on the same PC with a printer close at hand. The newer operating systems, such as Windows and OS/2, also make the transfer of information from one package to another simpler. One advantage of using the mainframe is that much of the work is done for you. Somebody else enters the statistical package into the computer, makes sure the computer is running, provides back-up copies of your data and provides consultation. Storage space is available for very large data sets. Shared printer facilities often have superior features. The central facility does the purchasing of the software and arranges for obtaining updates. An intermediate operation that is available at many workplaces is a medium-size computer with workstations linked to it. This provides many of the advantages of both mainframes and PCs. Often the final decision rests on cost considerations that depend on the local circumstance. For example, statistical software is still more expensive than typical spreadsheet or wordprocessing programs. Some users who have to supply their own statistical software find it less expensive to perform some of t?e .operations on a spreadsheet or database program and buy a 'smaller' ~tattstlcal package. The computer and software used may depend on what Is made available by one's employer. In that case, a 'larger' statistical package makes good sense since its cost can be split among numerous users. 3
·3
CHOICE OF A STATISTICAL PACKAGE
~het?er the investigator
decides to use a PC, workstation or mainframe, h ere ts a wide choice of statistical packages available. Many of these packages, a o~ever, are quite specialized and do not include many of the multivariate t~a Yses ~venin this book. For example, there are several statistical packages th:~ provtde excellent graphics or detailed results for non parametric analyses are more useful for other types of work. In choosing a package for
24
Preparing for data analysis
multivariate analysis, we recommend that you consider the statistical analyses listed in Table 5.2 and check whether the package includes them. Another feature that distinguishes among the statistical packages is whether they were originally written for mainframe computers or for the PC. Packages written for mainframe computers tend to be more general and comprehensive. They also take more computer memory (or hard disk) to store the program and are often more expensive. In this book we use three programs originally written for the mainframe and three originally written for the PC. In some cases the statistical package is sold as a single unit and in others you purchase a basic package, but you have a choice of additional programs so you can buy what you need.
Ease of use Some packages are easier to use than others, although many of us find this difficult to judge- we like what we are familiar with. In general the packages that are simplest to use have two characteristics. First, they have fewer options to choose from and these options are provided automatically by the program with little need for 'programming' by the user. Second, they use the 'point and click' method of choosing what is done rather than writing out statements. This latter method is even simpler to learn if the package uses a common set of commands that are similar to ones found in word processing, spreadsheet or database management packages. Currently, statistical packages that are 'true' Microsoft Windows or Windows NT packages (not those that simply run under Windows) have this characteristiq. A similar statement holds for OS/2 and for X-Windows. On the other hand, programs with extensive options have obvious advantages. Also, the use of written statements (or commands) allows you to have a written record of what you have done. This record can be particularly useful in large-scale data analyses that extend over a considerable period of time and involve numerous investigators. Many current point-and-click programs do not leave the user with any audit trail of what choices have been made.
Packages used in this book In this book, we make specific reference to three general-purpose statistical packages that were originally written for the mainframe and three packages that were originally written for the PC. For the mainframe packages, we use the generic names of the packages without distinguishing between the mainframe and PC versions. In Table 3.1, the packages are listed in alphabetical order. The three mainframe packages are BMDP, SAS and
Table 3.1 Statistical packages and their platform requirements BMDP
SAS
SPSS
STATA
STATISTICA
SYSTAT
Versions (as of 3/1994)
Regular (7.0), Dynamic (7.0), New System (NS)
6.08
5.0, 6.0
4.0 Regular (R), Supercool (S)
3.1, 4.3
5.03
Medium-sized computers and workstations
Yes
Yes
Yes
Yes
No
No
Mainframes
Yes
Yes
Yes
No
No
No
PC operating systems
DOS (7.0), Windows (NS) 8-20MB (7.0), 3MB (NS) 4MB
OS/2, Windows
DOS (5.0), Windows (6.0)
DOS & OS/2-R&S Windows
DOS (3.1), Windows (4.3)
DOS, Windows
54MB plus 61MB optional 8MB
20-40MB
2MB
15 MB (3.1), 13MB (4.3)
8MB
8MB
512 KB (R), 4MB (S)
640 KB (3.1), 4MB (4.3)
4MB
Minimum hard disk for PC
Minimum RAM for PC
26
Preparing for data analysis
SPSS, and the PC packages are STATA, STATISTICA and SYSTAT. A summary of the versions we used is given in Table 3.1. We also note whether these versions run on PCs, medium-size computers with workstations, or on mainframe computers. For packages used on a PC, we note which operating system can be used with the listed version and how much RAM memory and hard disk memory are required. In general, if a program uses Windows or OS/2, it will run faster with 8MB of RAM memory, although several of the packages listed in Table 3.1 can get by with 4MB. Note also that programs that use Windows may be quite different in their use of the Windows operating system. They range from some that simply start in Windows and then switch to a command-driven mode to others that have fully integrated the Windows point-and-click or dragand-drop features into the package. Among the mainframe packages, SPSS and BMDP adopt the philosophy of offering a number of co~prehensive programs, each with its own options and variants, for performing portions of the analyses. The SAS package, on the other hand, offers a large number of limited-purpose procedures, some with a number of options. The SAS philosophy is that the user should string together a sequence of procedures to perform the desired analysis. The wide range of procedures offered contributes to the need for a large memory requirement for SAS. SAS also offers a lot of versatility to the user. The three PC packages tend to require less hard disk space. The least space is required by the Regular version of STATA and the most by STATISTICA, with SYSTAT in the middle. As a general rule, the more hard disk memory is required, the more features are available and the higher the price. STATA is similar to SAS in that an analysis consists of a sequence of commands, some with their own options. STATISTICA and SYST AT, like BMDP and SPSS, offer a number of self-contained analyses with options. Each of the computer packages will be featured in one of the chapters. BMDP also has a Windows version called New System which presently has regression analysis but not the other multivariate procedures. BMDP New System and- ST ATISTICA make the most sophisticated use of the Windows environment, with many common and procedure-specific options available to the user. SYSTAT and ST ATA adopt a philosophy that combines the command-driven and Windows environments. As of this writing, it is also possible to obtain a version of SAS that is in the above-described hybrid stage. User manuals
Each of the six computer packages has several manuals. In the following we list the manuals used in this book. (The editions listed also refer to the versions of the program used.)
Choice of statistical package
27
;~~;/Dynamic User's Guide, Release 7, 1992 (includes data entry) BM DP Statistical Software Manual: The Data Manager, Release 7, Volume 3, 1992 BMDP Statistical Software Manual, Release 7, Volumes 1 and 2, 1992 (general statistical programs) . BM DP User's Digest, 1992 (qmck reference for BMDP programs) BMDP New Systems for Windows User's Guide, 1994 SAS SASjFSP Users Guide, Release 6.03, 1990 (inc~udes data entry) . SAS Procedures Guide, Version 6, 1990 (descnbes procedures used m data management, graphics, reporting and descriptive statistics) SAS Language Reference, Version 6, 1990 (describes DATA and PROC steps and other options) SAS/STAT User's Guide, Version 6, Volumes 1 and 2, 1990 (general statistical procedures) SPSS SPSS Data Entry II, 1988 SPSS for Windows Base System User's Guide, Release 6.0, 1993 (includes
introduction, data management, descriptive statistics) SPSS for Windows Professional Statistics, Release 6.0, 1993 SPSS for Windows Advanced Statistics, Release 6.0, 1993 STATA Getting Started with STATA for Windows, 1995 Reference Manual, Release 4.0, Volume 1, 1995 (includes introduction, graphics and tutorial) Reference Manual, Release 4.0, Volume 2, 1995 (includes data management, statistics and graphics) . Reference Manual, Release 4.0, Volume 3, 1995 (includes statistics, graphics and matrix commands) · SYSTAT SYSTAT: Getting Started, Version 5, 1992 (includes introduction and overview) SYSTAT: Data, Version 5, 1992 (includes entering data and data management) SYSTAT: Graphics, Version 5, 1992 (includes all graphical plots) SYSTAT: Statistics, Version 5, 1992 (includes all statistical analyses) STATISTICA
~iATISTICA: Quick Reference, 1~94 (includes .a ~ummary in~roduction)
ATISTICA, Volume 1: Conventzons and Statzstzcs 1, 1994 (mcludes data ~nagement, basic statistics, regression and ANOV A) . STATISTICA, Volume 2: Graphics, 1994 (includes all graphics) an:TISTICA, Volume 3: Statistics 2, 1994 (includes multivariate analyses rnegafile manager)
28
Preparing for data analysis
When you are learning to use a package for the first time, there is no substitute for reading the manuals. The examples are often particularly helpful. However, at times the sheer number of options presented in these programs may seem confusing, and advice from an experienced user may save you time. Many programs offer what are called default options, and it often helps to use these When you run a program for t~e first time. In this book, we frequently recommend which options to use. On-line HELP is also available in some of the programs. This is especially useful when it is programmed to offer information needed for the part of the program you are currently using (context sensitive). There are numerous statistical packages that are not included in this book. We have tried to choose those that offered a wide range of multivariate techniques. This is a volatile area with new packages being offered by numerous software firms. For information on other packages, you can refer to the statistical computing software review sections of The American Statistician, PC Magazine or journals in your own field of interest.
3.4 TECHNIQUES FOR DATA ENTRY Appropriate techniques for entering data for analysis depend mainly on the size of the data set and the form in which the data set is stored. As discussed below, all statistical packages use data in a spreadsheet (or rectangular) format. Each column represents a specific variable and each row has the data record for a case or observation. The variables are in the same order for each case. For example, for the depression data set given later in this chapter, looking only at the first three variables and four cases, we have ID 1 2 3 4
Sex 2 1 2 2
Age 68 58 45 50
where for the variable 'sex', 1 = male and 2 = female, and 'age' is given in years. Normally each row represents an individual case. What is needed in each row depends on the unit of analysis for the study. By unit of analysis, we mean what is being studied in the analysis. If the individual is the unit of analysis, as it usually is, then the data set just given is in a form suitable for analysis. Another situation is where the individuals belong to one household, and the unit of analysis is the household. Alternatively, for a company, the unit of analysis may be a sales district and sales made by different salespersons in each district are recorded. Such data sets are called hierarchical data sets and their form can get to be quite complex. Some statistical packages have
Techniques for data entry . d
29
apacity to handle hierarchical data sets. In other cases, the
rrn1te c · k fi h 1 • tor may have to use a relational data base pac age to rst get t e invesuga· to the rectangular or spreadsheet form used m · t h e ·sta tisbca · · 1pac · k age. t data. se · · data entry. T he fi rst one IS. · ·actuall y h mone or two steps are mvolved m Eit· er the data into the computer. If the data are not entered d'uect1y mto . w~~ · used, a second step of transfernng ·h +;stical package bemg t e.data to the sta.... . . · the desired statistical package must be performed.
Data entry Before entering the actual data in most statistical, spreadsheet or database management packages, the inves.tigator ~rst names the file where th~ data are stored, states how many vanables will be entered, names the vanables and provides information on these variables. Note that in the example just given we listed three variables which were named for easy use later. The file could be called 'depress'. Statistical packages commonly allow the user to designate the format and type of variable, e.g. numeric or alphabetic, calendar date, categorical, or Stevens's classification. They allow you to specify missing value codes, the length of each variable and the placement of the decimal points. Each program has slightly different features so it is critical to read the appropriate manual or HELP statements, particularly if a large data set is being entered. The two commonly used formats for data entry are the spreadsheet and the form. By spreadsheet, we mean the format given previously where the columns are the variables and the rows the cases. This method ofentry allows the user to see the input from previous records, which often gives useful clues if an error in entry is made. The spreadsheet method is very commonly used, particularly when all the variables can be seen on the screen without scrolling. Many pe~sons doing data entry are familiar with this method due to their experience with spreadsheet programs, so they prefer it. . With the form method, only one record, the one being currently entered, IS on view on the screen. There are several reasons for using the form method. An entry form can be made to look like the original data collection form so ~hat the data entry person sees data in the same place on the screen as it is m th e co1lecbon · form. A large number of· vanables . for each case can be seen on a co.mputer monitor screen and they can be arranged in a two-dimensional mstead of just the ?n~-dimensional arra~ ava~able for each ~ase in than p:ead~heet format.. Fhppmg pages (screens) m a d.Isplay may be simpler crolling left or nght for data entry. Short codmg comments can be in 1 al;:!ed. on t?e scree~ to assist in data entry. Also, if the data set includes then · ettcal Information such as short answers to open-ended questions, the form method is preferred. The chOice · between these two formats is largely a matter of personal
::7'
30
Preparing for data analysis
preference, but in general the spreadsheet is used for data sets with a small or medium number of variables and the form is used for a larger number. To make the discussion more concrete, we present the features given in a specific data entry package. The SPSS data entry program provides a good mix of features that are useful in entering large data sets. It allows either spreadsheet or form entry and switching back and forth between the two modes. In addition to the features already mentioned, SPSS provides what is called 'skip and fill'. In medical studies and surveys, it is common that if the answer to a certain question is no, a series of additional questions can then be skipped. For example, subjects might be asked if they ever smoked, and if the answer is yes they are asked a series of questions on smoking history. But if they answer no, these questions are not asked and the interviewer skips to the next section of the interview. The skip-and-fill option allows the investigator to specify that if a person answers no, the smoking history questions are automatically filled in with specified values and the entry cursor moves to the start of the next section. This saves a lot of entry time and possible errors. Another feature available in many packages is range checking. Here the investigator can enter upper and lower values for each variable. If the data entry person enters a value that is either lower than the low value or higher than the high value, the data entry program provides a warning. For example, for the variable 'sex', if an investigator specifies 1 and 2 as possible values and the data entry person hits a 3 by mistake, the program issues a warning. This feature, along with input by forms or spreadsheet, is available also in SAS and BMDP. Each program has its own set of features and the reader is encouraged to examine them before entering medium or large data sets, to take advantage of them. Mechanisms of entering data
Data can be entered for statistical will discuss four of them.
comp~tation
from different sources. We
1. entering the data along with the program or procedure statements for a batch-process run; 2. using the data entry features of the statistical package you intend to use; 3. entering the data from an outside file which is constructed without the use of the statistical package; 4. importing the data from another package using Windows or Windows NT.
The first method can only be used with a limited number of programs which use program or procedure statements, for example BMDP or SAS. It
Techniques for data entry
31
ommended for very small data sets. For example, an SAS data set · on1Yrec · IS d 'depress' could be made on a personal computer by statmg: calle data depress; input id sex age; cards;
1 2 68 2 1 58 3 2 45
4 2 50 run;
Similar types of statements can be used for the other programs which use the spreadsheet type of format. The disadvantage of this type of data entry is that there are only limited editing features available to the person entering the data. No checks are made as to whether or not the data are within reasonable ranges for this data set. For example, all respondents were supposed to be 18 years old or older, but there is no automatic check to verify that the age of the third person, who was 45 years old, was not erroneously entered as 15 years. Another disadvantage is that the data set disappears after the program is run unless additional statements are made (the actual statements depend on the type of computer used, mainframe or PC). In small data sets, the ability to save the data set, edit typing and have range checks performed is not as important as in larger data sets. The second strategy is to use the data entry package or system provided by the statistical program you wish to use. This is always a safe choice as it means that the data set is in the form required by the program and no data transfer problems will arise. Table 3.2 summarizes the built-in data entry features of the six statistical packages used in this book. Note that for SAS, Proc COMPARE can be used to verify the data after they are entered. In general, as can be seen in Table 3.2, SPSS, SAS and BMDP have more data entry features than STATA, STATISTICA or SYSTAT. The third method is to use another statistical program, data entry package, ~ordprocessor, spreadsheet or data management program to enter the data Into a spreadsheet format. The advantage of this method is that an available fh~~:am that you are famili~r with can. be used. t~ enter the data .. But then is ta has to be transferred mto the desued statistical program. Thts method dartecommended for STATA, STATISTICA or SYSTAT when using large a sets.
sta~~v:ral
techniques for performing this transfer are available. Many tsttcal packages provide transfer capabilities for particular spreadsheet
32
Preparing for data analysis
Table 3.2 Built-in data entry features of the statistical packages
Spreadsheet entry Form entry Range check Logical check Skip and fill Verify mode
BMDP
SAS
SPSS
STATA
STATISTICA
SYSTAT
Yes
Yes
Yes
Yes
Yes
Yes
Yes Yes Yes No Yes
Yes Yes Yes Use SCL No
Yes Yes Yes Yes Yes
No No No No No
No No No No No
No No No No No
or other programs (Table 3.3). So one suggestion is to first check the manual or HELP for the statistical package you wish to use to see which types of data files it can import. A widely used transfer method is to create an ASCII file from the data set. ASCII (American Standard Code for Information Interchange) files can be created by almost any spreadsheet, data management or wordprocessing program. These files are sometimes referred to as DOS files on IBMcompatible PCs. Instructions for reading ASCII files are given in the statistical packages. For mainframe computers, EBCDIC (Extended Binary-CodedDecimal Information Code) may be used. The disadvantage of transferring ASCII files is that typically only the data are transferred, and variable names and information concerning the variables have to be reentered into the statistical package. This is a minor problem if there are not too many variables. If this process appears to be difficult, or if the investigator wishes to retain the variable names, then they can run a special-purpose program such as DBMS/COPY or DATA JUNCTION that will copy datafiles created by a wide range of programs and put them into the right format for access by any of a wide range of statistical packages. Further information on transferring data between programs is given in Poor (1990). Finally, if the data entry program and the statistical package both use the Windows or Windows NT operating system, then three methods of transferring data may be considered depending on what is implemented in the programs. First, the data in the data entry program may be highlighted and then placed on a clipboard and moved to the statistical package program. Second, dynamic data exchange (DDE) can be used to transfer data. Here the data set in the statistical package is dynamically linked to the data set in the entry program. If you correct a variable for a particular case in the entry program, the identical change is made in the data set in the statistical package. Third, object linking and embedding (OLE) can be used to share data between a program used for data entry and statistical analysis. Here also the data entry program can be used to edit the data in the statistical program. The
Table 3.3 Data management features of the statistical packages BMDP
SAS
SPSS
STATA
STATlSTICA
SYSTAT
Merging data sets
JOIN
MERGE statement
MERGE
Yes
Adding data sets Hierarchical data sets
MERGE
Proc APPEND or set statement Write multiple output statements: RETAIN
JOIN with MATCH command JOIN with ADD command No
APPEND
Yes
MERGE with USE/SAVE commands APPEND
RESHAPE
MegaFile Manager
Use-OUTPUT
ASCII, ACCESS: spreadsheets, databases
ASCII, spreadsheets, databases
ASCII, spreadsheets, databases
ASCII, spreadsheets, databases
ASCII, spreadsheets, databases
ASCII, ACCESS: spreadsheets, databases Yes
ASCII, spreadsheets, databases
ASCII, spreadsheets, databases
ASCII, spreadsheets, databases
ASCII, spreadsheets, databases
Yes
Yes
FLIP
Yes xpose
Yes
Yes
Edit menu Transpose Yes
No TRANSPOSE
Mean substitution
Regression substitution
Mean substitution
Importing data (types)
Exporting data (types)
PACK and UNPACK, EXTRACT ASCII, spreadsheets, databases, statistical packages ASCII, statistical packages
Calendar dates Yes Transposes data PACK Range limit checks Missing value imputation
Yes Yes, several methods
Proc TRANSPOSE Yes User program
Yes No
34
Preparing for data analysis
investigator can activate the data entry program from within the statistical package program. If you have a very large data set to enter, it is often sensible to use the services of a professional data entering service. A good service can be very fast and can offer different levels of data checking and advice on which data entry method to use. But whether or not a professional service is used, the following suggestions may be helpful for data entry. 1. Where possible, code information in numbers not letters.
2. Code information in the most detailed form you will ever need. You can use the statistical program to aggregate the data into coarser groupings later. For example, it is better to record age as the exact age at the last birthday rather than to record the ten-year age interval into which it falls. 3. The use of range checks or maximum and minimum values can eliminate the entry of extreme values but they do not guard against an entry error that falls within the range. If minimizing errors is crucial then the data can be entered twice into separate data files. One data file can be subtracted from the other and the resulting nonzeros examined. Alternatively, some data entry programs have a verify mode where the user is warned if the first entry does not agree with the second one (SPSS or BMDP). 4. If the data are stored on a personal computer, then backup copies should be made on disks. Backups should be updated regularly as changes are made in the data set. Particularly when using Windows programs, if dynamic linking is possible between analysis output and the dataset, it is critical to keep an unaltered data set. 5. For each variable, use a code to indicate missing values. The various programs each have their own way to indicate missing values. The manuals or HELP statements should be consulted so that you can match what they require with what you do. To summarize, there are three important considerations in data entry: accuracy, cost and ease of use of the data file. Whichever system is used, the investigator should ensure that the data file is free of typing errors, that time and money are not wasted and that the data file is readily available for future data management and statistical analysis. 3.5 DATA MANAGEMENT FOR STATISTICS Prior to statistical analysis, it is often necessary to make some changes in the data set. This manipulation of the data set is called data management. The term data management is also used in other fields to describe a somewhat different set of activities that are infrequently used in statistical data management. For example, large companies often have very large sets of data that are stored in a complex system of files and can be accessed in a
Data management for statistics
35
. ntial fashion by different types of employees. This type of data dtffere ement is not a common problem in statistical data management. Here . . stattstlca . . 1 rnanag . tot h e more common1y used optiOns ·ulimit the discussion m we w~anagement programs. The manuals describing the programs should data ad for further details. Table 3.3 summarizes common data management be re in the six programs d escr1'bed m · t h'IS b oo k . options Combining data sets Combining data sets is an operation that is commonly performed. For example, in biomedical studies, data may be taken from medical history forms, a- questionnaire and laboratory results for each patient. These data for a group of patients need to be combined into a single rectangular data set where the rows are the different patients and the columns are the combined history, questionnaire and laboratory variables. In longitudinal studies of voting intentions, the questionnaire results for each respondent must be combined across time periods in order to analyze change in voting intentions of an individual over time. There are essentially two steps in this operation. The first is sorting on some variable (called a key variable in SPSS, STATISTICA, SYST AT and BMDP, a bivariable or common variable in SAS, and an identifying variable irt STATA) which must be included in the separate data sets to be merged. Usually this key variable is an identification or ID variable (case number). The second step is combining the separate data sets side-by-side, matching the correct records with the correct person using the key variable. Sometimes one or more of the data sets are missing for an individual. For example, in a longitudinal study it may not be possible to locate a respondent for one or more of the interviews. In such a case, a symbol or symbols indicating missing values will be inserted into the spaces for the missing data items by the program. This is done so you will end up with a rectangular data set or fil~, ~n which information for an individual is put into the proper row, and mtssmg data are so identified Data sets can be combined in the manner described above in SAS by using ~:e MERGE statement followed by a BY statement and the variable(s) to f used to match the records. (The data must first be sorted by the values ~ :he mat~hing variable, say ID.) An UPDAT~ statement can also be used c dd vanables to a master file. In SPSS,you simply use the JOIN MATCH command followed by the data files to be merged if you are certain that the i ase: are already listed in precisely the same order and each case is present a~ t~e data files. Otherwise, you first .sort the separate data files on the va~· ~nable and use the JOIN MATCH command followed by the BY key instia 1~· In BMDP, you first sort on the key variable then use the JOIN ructions to obtain a single matched data file. In STATA, you USE the
0
k:
36
Preparing for data analysis
first data file and then use a MERGE key variable USING the second data file statement. STATISTICA and SYSTAT also have a MERGE function that can be performed either with or without a key variable. In any case, it is highly desirable to list the data set to determine that the merging was done in the manner that you intended. If the data set is large, then only the first and last 25 or so cases need to be listed to see that the results are correct. If the separate data sets are expected to have missing values, you need to list sufficient cases so you can see that missing records are correctly handled. Another common way of combining data sets is to put one data set at the end of another data set or to interleave the cases together based on some key variable. For example, an investigator may have data sets that are collected at ctifferent places and then combined together. In an education study, student records could be combined from two high schools, with one simply placed at the bottom of the other set. This is done by using the Proc APPEND in SAS. The MERGE paragraph in BMDP can be used to either combine end to end or to interleave data files. In SPSS the JOIN command with the keyword ADD can be used to combine cases from two to five data fil~s or with specification of a key variable, to interleave. In STATA and SYSTAT, the APPEND command is used. IIi STATISTICA, the MERGE procedure is used. It is also possible to update the data files with later information using the editing functions of the package. Thus a single data file can be obtained that contains the latest information, if this is desired for analysis. This option can also be used to replace data that were originally entered incorrectly. When using a statistical package that does not have provision for merging data sets, it is recommended that a spreadsheet program be used to perform the merging and then, after a rectangular data file is obtained, the resulting data file can be transferred to the desired statistical package. In general, the newer spreadsheet programs have excellent facilities for combining data sets side-by-side or for adding new cases. Missing values
As will be discussed in later chapters, most multivariate analyses require complete data on all the variables used in the analysis. If even one variable has a missing value for a case, that case will not be used. Most statistical packages provide programs that indicate how many cases were used in computing common univariate statistics such as means and standard deviations (or say how many cases were missing). Thus it is simple to find which variables have few or numerous missing values. Some programs can also indicate how many missing values there are for each case. Other programs allow you to transpose or flip your data file so
Data management for statistics
37
s become the columns and the columns become the rows (Table 3.3). the ro;e cases and variables are switched as far as the statistical package is ThUS rned. The number of missing values by case can be found by computing conce ivariate statistics on the transposed data. th~:e the pattern of missing data is determined, a. de~sion must ~e made how to obtain a complete data set for muluvanate analysts. Most o:atisticians agree on the following guidelines. s . 1. If a variable is missing in a high proportion of cases, then that variable
should be deleted. 2. If a case is missing variables that you wish to analyze, then that case should be deleted. Following these guidelines will give you a data set with a minimum of missing values. You may wish to check if there is anything special about the cases that have numerous missing data as this might give you insight on problems in data collection. Further discussion on other options available when there are missing data are given in section 9.2 or in Bourque and Clark (1992). These options include replacing missing values with estimated values. Missing values can occur in two ways. First, the data may be missing from the start. In this case, the investigator enters a code for missing values at the time the data are entered into the computer. One option is to enter a symbol that the statistical package being used will automatically recognize as a missing value. For example, a period, an asterisk (*) or a blank space may be recognized as a missing value by some programs. Commonly, a numerical value is used that is outside the range of possible values. For example, for the variable 'sex' (with 1 =male and 2 =female) a missing code could be 9. A string of 9s is often used; thus, for the weight of a person 999 could be used as a missing code. Then that value is declared to be missing. For example, for SAS, one could state if sex = 9, then sex = . ;
~intilar statements are used for the other programs. The reader should check he manual for the precise statement. bl~~ the data have bee? ~ntered into a sp~eadsheet program, then commonly ks are used for missmg values. In this case, the newer spreadsheet (and w~pro . . will . allow cessor) programs have search-and-replace features that ~ou~ 0 replace all the blanks with the missing value symbol that your statistical t~ dage aut~matically recognizes. This replacement should be done before Th ata file ts transferred to the statistical package. Val e second way in which values can be considered missing is if the data ex ues are beyond the range of the stated maximum or minimum values. For ample, in BMDP, when the data set is printed, the variables are either
38
Preparing for data analysis
assigned their true numerical value or called MISSING, GT MAX (greater than maximum), or LT MIN (less than minimum). Cases that have missing or out-of-range values can be printed for examination. The pattern of missing values can be determined from the BMDP AM program. Further discussion on ways of handling missing values in multivariate data analysis is given in section 9.2and in the reference and further reading sections at the end of this chapter (Anderson, Basilevsky and Hum, 1983; Ford, 1983; Kalton and Kaspyrzyk, 1986; Little and Rubin, 1987, 1990).
Detection of outliers
Outliers are observations that appear inconsistent with the remainder of the data set (Barnett and Lewis, 1994). One method for determining outliers has already been discussed, namely setting minimum and maximum values. By applying these limits, extreme or unreasonable outliers are prevented from entering the data set. Often, observations are obtained that seem quite high or low but are not impossible. These values are the most difficult ones to cope with. Should they be removed or not? Statisticians differ in their opinions, from 'if in doubt, throw it out' to the point of view that it is unethical to remove an outlier for fear of biasing the results. The investigator may wish to eliminate these outliers from the analyses but report them along with the statistical analysis. Another possibility is to run the analyses twice, both with the outliers and without them, to see if they make an appreciable difference in the results. Most investigators would hesitate, for example, to report rejecting a null hypothesis if the removal of an outlier would result in the hypothesis not being· rejected. A review of formal tests for detection of outliers is given in Dunn and Clark (1987) and in Barnett and Lewis (1994). To make the formal tests you usually must assume normality of the data. Some of the formal tests are known to be quite sensitive to nonnormality and should only be used when you are convinced that this assumption is reasonable. Often an alpha level of 0.10 or 0.15 is used for testing if it is suspected that outliers are not extremely unusual. Smaller values ofalpha can be used if outliers are thought to be rare. The data can be examined one variable at a time by using histograms and box plots if the variable is measured on the interval or ratio scale. A questionable value would be one that is separated from the remaining observations. For nominal or ordinal data, the frequency of each outcome can be noted. If a recorded outcome is impossible, it can be declared missing. If a particular outcome occurs only once or twice, the investigator may wish to consolidate that outcome with a similar one. We will return to the subject
Data management for statistics
39
tf1ers in connection with the statistical analyses starting in Chapter 6, of ou. .01 y the discussion in this book is not based on formal tests. but mat Transformations of the data formations are commonly made either to create new variables with a · or to ach'teve an approximate · Trans more · suitable for ana1ys1s norma1 form · " · d' tribution. Here we discuss t he fi rst poss1'b'l' 11ty. T ransiormatwns to ach'teve IS roximate normality are discussed in Chapter 4. app · bles can ett ' her be penorme " · d as a step Transformations to create new vana ·. data management or can be included later when the analyses are being ~rformed. It isrecoi?~ended that they be ~one as a part of data management. The advantage of this Is that the new vanables are created once and for all, and sets of instructions for running data analysis from then on do not have to include the data transformation statements. This results in shorter sets of instructions with less repetition and chance for errors when the data are being analyzed. This is almost essential if several investigators are analyzing the same data set. One common use of transformations occurs in the analysis of questionnaire data. Often the results from several questions are combined to form a new variable. For example, in studying the effects of smoking on lung function it is common to ask first a question such as: Have you ever smoked cigarettes? or
yes._ __ no._ __
If the subjects answer no, they skip a set of questions and go on to another topic. If they answer yes, they are questioned further about the amount in ~enns of packs per day and length of time they smoked (in years). From this
~formation, a new pack-year variable is created that is the number of years tmes the average number of packs. For the person who has never smoked, the answer is zero. Transformation statements are used to create the new variable. b Each package offers a slightly different set of transformation statements, ut some general options exist. The programs allow you to select cases that meet cert"''n . IF statements. Here for mstance . . the m spect'fica t'tons usmg if respon · . ' . has. ever smoked sh0 ld seb Is no to whether the person . . ' the new vanable by ~u . e s~t to zero. If the response is yes, then pack-years is computed This s~hplymg .the av~rage am.oun~ smok~d by the length of time s~oke~. add rt of anthmetic operatiOn Is proVIded for and the new vanable 1s A~~~? the end .of th~ data set. . . . rna . Ihonal optiOns mclude takmg means of a set of vanables or the inv~mum. value of a set of variables. Another common arithmetic transformation 0 Ves stmply changing the numerical values coded fora nominal or ordinal
40
Preparing for data analysis
variable. For example, for the depression data set, sex was coded male= 1 and female = 2. In some of the analyses used in this book, we have recoded that to male = 0 and female = 1 by simply subtracting one from the given value. Saving the results
Mter the data have been screened for missing values and outliers, and transformations made to form new variables, the results are saved in a master file that can be used for analysis. We recommend that a copy or copies of this master file be made on a disk or other storage device that can be stored outside the computer. A summary of decisions made in data screening and transformations used should be stored with the master file. Enough information should be stored so that the investigator can later describe what steps were taken in data management. If the data management was performed by typing in control language, it is recommended that a copy of this control language be stored along with the data sets. Then; should the need arise, the manipulation can be redone by simply editing the control language instructions rather than completely recreating them. If the data management was performed interactively (point and click), then it is recommended that multiple copies be saved along the way until you are perfectly satisfied with the results and that the Windows notepad or some similar memo facility be used to document your steps. 3.6 DATA EXAMPLE: LOS ANGELES DEPRESSION STUDY In this section we discuss a data set that will be used in several succeeding chapters to illustrate multivariate analyses. The depression study itself is described in Chapter 1. The data given in this are from a subset of 294 observations randomly chosen from the original 1000. This subset of observations is large enough to provide a good illustration of the statistical techniques but small enough to be manageable. Only data from the first time period are included. Variables are chosen so that they would be easily understood and would be sensible to use in the multivariate statistical analyses described in Chapters 6-17. The code book, the variables used and the data set are described below. Code book
In multivariate analysis, the investigator often works with a data set that has numerous variables, perhaps hundreds of them. An important step in making the data set understandable is to create a written code book that can be given to all the users. The code book should contain a description of
Data example: Los Angeles depression study
41
. ble and the variable name given to each variable for use in the eac~ ~ar~apackage. Most statistical packages have limits on the length of the stat.JS~ta ames so that abbreviations are used. Often blank spaces are not vana ; ~ dashes or underscores are included. Some statistical packages allowe 'ertain words that may not be used as variable names. The variables re~~; ~e listed in the same order as they are in the data file. The code book sh · as a guide and record for all users of the data set and for future serves . documentation of the results. . Table 3.4 contains a code ?o~k for .the depr~ss10n data s~t. In the first column the variable n~mber ts hsted, smce th~t ts often t~e s~plest way to · c:e·r to the variables m the computer. A vanable name ts giVen next, .and re11 this name is used in later data analysis. These names were chosen to be etght characters or less so that they could be used by all the package programs. It is helpful to choose variable names that are easy to remember and are descriptive of the variables, but short to reduce space in the display. · Finally a description of each variable is given in the last column of Table 3A For nominal or ordinal data, the numbers used to code each answer are listed. For interval or ratio data, the units used are included. Note that income is given in thousands of dollars per year for the household; thus an income of 15 would be $15 000 per year. Additional information that is sometimes given includes the number of cases that have missing values, how missing values are coded, the largest and smallest value for that variable, simple descriptive statistics such as frequencies for each answer for nominal or ordimtl data, and means and standard deviations for interval or ratio data. We note that one package (STAT A) can produce a code book for its users that includes much of the information just described.
0
Depression variables The 20 items used in the depression scale are variables 9-28 and are named C2, ..., ~20. (The wording of each item is given later in the text, in Table th · ~- Eac~ Item was written on a card and the respondent was asked to tell be~~~!ervte~er the nu~ber that best describes how often he or she felt or it. d thts way dunng the past week. Thus respondents who answered . aemt. C 2, 'I felt depressed', could respond 0-3, depending on whether this P r Icular 1't O) . em app1'te d to them rarely or none of the .. time (less than 1 day: of :~m~ or little of the time (1- 2 days: 1), occasionally or a moderate amount Me time (3-4 days: 2), or most or all of'the time (5-7 days: 3). are post. ?f the items are worded in a negative fashion, but items C8-Cll Peop~~~t~vely w~r.ded. For exa~ple, C8 is 'I felt that I was as goo~ as other of 3 is .h or posttively worded Items the scores are reflected: that ts, a score to 3. 1 c h~nged to be 0, 2 is changed to 1, 1 is changed to 2, and 0 is changed n t Is way, when the total score of all 20 items is obtained by summation
ff2
42
Preparing for data analysis
Table 3.4 Code book for depression data Variable number 1 2
Variable name
4
ID SEX AGE MARITAL
5
EDUCAT
6
EMPLOY
7 8
INCOME RELIG
9-28
C1-C20
3
29
CESD
30
CASES
31 32
DRINK HEALTH
33 34
REG DOC TREAT
35
BED DAYS
36
ACUTEILL
37
CHRONILL
Description Identification number from 1 to 294 1 = male; 2 = female Age in years at last birthday 1 = never married; 2 = married; 3 = divorced· ' 4 = separated; 5 = wtdowed 1 = less than high school; 2 = some high school; 3 = finished high school; 4 = some college; 5 = fjnished bachelor's degree; 6 = finished master's degree; 7 = finished doctorate 1 = full time; 2 = part time; 3 = unemployed; 4 =retired; 5 = houseperson; 6 =in school; 7 =other Thousands of dollars per year 1 = Protestant; 2 = Catholic; 3 = Jewish; 4 = none; 5 = other 'Please look at this card and tell me the number that best describes how often you felt or behaved this way during the past week'. 20 items from depression scale (already reflected; see text) 0 = rarely or none of the time (less than 1 day); 1 =some or a little of the time (1-2 days); 2 =occasionally or a moderate amount of the time (3-4 days); 3 =most or all of the time (5-7 days) Sum of C1-20; 0 =lowest level possible; 60 = highest level possible 0 = normal; 1 = depressed, where depressed is CESD ~ 16 Regular drinker? 1 = yes; 2 = no General health? 1 = excellent; 2 = good; 3 = fair; 4 = poor Have a regular dctor? 1 =yes; 2 =no Has a doctor prescribed or recommended that you take medicine, medical treatments, or change your way of living in such areas as smoking, special diet, exercise, or drinking? 1 =yes· 2 =no Spent e~tire day(s) in bed in last two months? 0 =no; 1 =yes Any acute illness in last two months? 0 = no; 1 =yes Any chronic illness in last year? 0 = no; 1 = yes
.
Summary
43
. bles C1-C20, a large score indicates a person who is depressed. This of va~Ia h 29 th variable, named CESD. sum 15 t e whose CESD score is greater than or equal to 16 are classified . use d m . t he 1'1terature personssed since this value Is . the common cuto ff pomt as d~r:::. Aneshensel and Clark, 19~1). These pers~ns are given ~ score of · able 30, the CASES vanable. The particular depression scale (F~e " use m . commumty . . surveys of nomn. 1 . ID va0d here was developed 1or 1 emP · . oye · naltz'ed respondents (Comstock and He1smg, 1976,· Rad1off., 1977.) stitU110 Data set As can be seen by examining ~he code boo~ given in Table 3.4 demographic data (variables 2-8), depressiOn data (vanables 9-30) and general health data (variables 32-37) are included in this data set. Variable 31, DRINK was included so that it would be possible to determine if an association exists between drinking and depression. Frerichs et al. (1981) have already noted a lack of association between smoking and scores on the depression scale. The actual data for the first 49 of the 294 respondents are listed in Table 3.5. The rest of the data set, along with the lung function data set, the chemical companies data set and the survival data set, are available on disk. To obtain the disk write to the publisher. SUMMARY In this chapter we discussed the steps necessary before statistical analysis can begin. The first of these is the decision of what computer and software packages to use. Once this decision is made, data entry and data management can be started. Figure 3.1 summarizes the steps taken in data entry and data management. Note that investigators often alter the order of these operations. For example, some prefer to check for missing data and outliers and to make ~ransformations prior to combining the data sets. This is particularly true ~:f~~alyzing longitud!nal data when th~ first. data set m~y be available. ~ell e the others. Thts may also be an Iterative process m that determmmg errors d . d . e . urmg ata management may lead to entenng new data to replace rroneous values
S'f~~Istatistical .packages
- BMDP, SAS, SPSS, STATA, SYSTAT and a Pack:TI.c~ - were noted as the p.ackages used in this book. In evaluating featu ge It Is often helpful to examme the data entry and data management oftenres they offer. The tasks performed in data entry and management are anaty much more difficult and time consuming than running the statistical is a re~t· so a package that is e~sy and intuitive to use for these operations help. If the package available to you lacks needed features, then you
Table 3.5 Depression data
0 B
s I
s
0
1 2 3
1 2 3 4 5 6 7 8 9 10 11 12 13 111 15 16
A G E
[ ){
2 1 2 2 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 2
68
:>B
115 50 33 211 58 22 47 30 11 20 12 57 13 39 61 14 15 23 21 16 17 17 23 18 18 55 26 19 19 20 20 64 114 21 21 25 22 22 23 23 72 211 211 61 113 25 25 26 26 52 27 27 23 73 28 28 34 29 29 30 30 34 31 31 117 32 32 31 33 33 60 311 34 35 35 35 2 56 36 36 1 110 37 37 2 33 38 38 2 35 39 39 1 59 40 40 1 42 41 41 1 19 42 42 1 32 43 43 2 47 1111 44 1 51 45 45 2 66 46 46 2 53 117 47 2 32 1.16 46 55 0:.9 1.19 '2 56 II
5 6 7 8 9 10
'
M A R I T A L
5 3 2 3 4 2 2 1 2 2 1 2 2 5 2 1 1 4 1 5 1 2 5 2 3 2 2 II
2 2 5 2 2 2
2 2 2 2 2 3 1 2
A
E M
E
p
I
H
R
c
D
L
0
E L
c
0 y
M
e:
I
G
1
2
II
II
1 1 3 1 1 5 1
4 15 28
1 1 1 1 1 1 1 1 2
0 1 0 0 0 0 1 2 1 0 1 0 0 0 0 2 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 2 0 0 2 2 0 0 0 .o 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 0 1 0 0 0
u
3 3 3
3 2 3 3 2 2 3 2 3 3 2 II
2 6 2 3 3 .3 3 3 2 3 2 3 3 5 II
3 3 3 3 4 2 2 5 3 1 6
II
1 3 2 1 II
1
.,1
3 1
II
1 1 4 1 I
5 5 4 2 1 1 1 1 5 2 1 1 1 1 1
, , 1
9 35 11 11
9
23 35 25 211 28 13 15 6 B 19 15 9 6 35 1 19 6 19 13 5 19 20 19 115 35 23 23 15 19 15 23 23 11
4 4
1
, l
2
1 1 1 1 II
,
2
2 2 1 2 1 2 2 1 4 4 1 1 1
,
4
2 4 4 4 4 1
23 2 55 2 4 2 28 2 3 4 23 11. 9 5 2 1 2 3 2 35 1 1 35 1 '2 5 1 55 4 '2 6
, ,
c c c c c c c c c 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 1 3 1
3
2
0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 1 3 0 0 0 0 0 0 0 0 2 0 1
,
'
c c c c c c c c c c c
4
5
6
7
B
9
1 0
0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 3 1 0 0 0 1 0 0 0 0 0 0 0 0 .0 0
0 0 1 1 0
0 0 0 1 0 0 0 1 3 0 2 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 2 .0 0 0 0 0 0 0 0 0 0 0 0 0 0, 0 0
0 0 0
0 0
0
3 3 0 2 0 0 0 0 0
0 0 0 0 3 1 2 0 0 0 0 0 0 0 0 2 3 0 0 0 3 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 2 0 0 2 1 0 0 .0 0 2 1 3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 2 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .0 0 0 2 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 t 1 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0
0 1 2 0 0 1 0 0 1 0 1 3
,
0 0 0 0 0 0 1 1 0
2 1 0 0
,
1 0 0 1 0 1 0 1 0 0 0 0 0 1 2 0 0
0
0 1 0 0
0 0 0 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 0
,
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
, 0
o·
o.
1 0 1 2 0 0 3 0 0 0 0 2 0 3 0
,
0 0 0 3 1 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0
1 I
1 2
1 3
0 1 0 0 0 1 0 0 3 0 1 0 1 1 0 1 2 0 0 0 0 0 0 0
0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 1 1 0 0 0 0 1 2 2 0 1 1
1
1 0 0 2 1 0 1 1 1 0 0 1 1 0
.o
0 0 0 1 0 0 1 0 1
0 3 0 3 1 0 0 0 0 0 1 0 0 1 0 3 1 0 0 0 0 0
1
4
0 0 1 0 0
0 0 0 3 0
2 0 2 0 0 2 2 0 0 0 0 0 0 0 1 0
,
0 2 0 1 1 0 1 0 0 0 3 0 0 1 0 3 3 0 0 0 1 1
1 5
1 6
0 1 1 0 0 0 3 0 2 0 1 2 0 0 1 1 3 0 0 0 0 1 0 2 1 3 0 0 3 0 0 1 1 0 0 1 1 0
0 0 1 0 0 2 0 1 3
0
0 0 0 0 1 0 1 1 0 0
0 1
2 1 0 3 1
2 0 0 0 0 0 0 0 2 2
o.
0 3 0 1 1 1 1 1 0 1 1 0 0 1 1 1 0 0 0 1 2 0
1
7
0 1 0 0 0 1 0 1 0 0 2 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 1 1 1 1 0
,
0 3 0 0 1 2 0 0 0 0 1 2 1
1 8
1 9
0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 1 0 2 0 0 0 0 0 0
0
0 0 0 1 0
0
3 0 0 0 0 2 1 0 0 0 0 0 .0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 2 1
0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1
2
0
.c c A £ s s E D s
0 0 0 0 0 0 1 0 0 0 0 0 1
15 10 16 0 18 4 B
1
II B
,
0
3 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0
0
0 4 4 5 6
7
21 112 6 0 3
3 4 2 II
10 12 6 9 28 1 7 11 B
1 0
io
0
4
1 0 0 0 0 0 0 0 0 0 1 1 0
1
6 18 0 2
7
0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
5
0
9
0 0 0 0 1 0 0
9 0
2 211 13 12
D
R I
N k 2 1 1 2 1 1 2 2 1 1 1 2 1 t 1 1 1 2 2 1
, , 1
2 2 1 2 1 1 1
1
1 1 2 1 1
,,
1 1 1 1 1 1 1 1 1 1 1
R E G
H E A L
D
T
0
H
c
2 1 2 1 1 1 3 1
1 1 1 1 1 1 1 2 1 1
4 1 2 2 3
1
1 3 1 3 2 2 1 2 2 3 3 3 2 J
2 2 2 2 2 1 1 1 2 3 2 1
,
2 1 3 1 2 1 2 1
,
1 1 t 1 1 2 1 2 1 1 1
,
T R
D
E A
A
I
T
s
L L
1 1 1 2 1 1
0 0 0 0 1 0
1
0
2 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1
1 1 1 1 1 1 1 1 2 1 1 1
1 1 1 2 1 2 2 1 1 1 2
I
2 1 1 2 2
1 1 1 1 2 1 t 1
, 2 1 1 1
c
B E 0
2
2 2 1 1 2 2 1 2 1
y
0 1 0 0 1
0
0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1
u T
e:
0
0 0 0 1 1 1 1 0 0
0 1 1
1
0 0 1 1 1 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0' 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0
,
c H
R 0 H
I L L
1 1 0 1 0 1 1 0 1 0 0 1 0 0 0
,
0 1 0 0 1 1 1
, 1
0 0 1 0 1 1 0 1 0 0 0 1
,.
0 0 0 0 0 1 1 l
0 0 1
References
45
r-
Data Entry assigning attributes to data entering data screening out-of-range values
Data Management combining data sets finding patterns of missing data detecting additional outliers transformations of data checking results
, Saving working data set
~· Creating a code book
J
Figure 3.1 Preparing Data for Statistical Analysis.
may wish to perform these operations in one of the newer spreadsheet or relational database packages and then transfer the results to your statistical package.
REFERENCES
An~~rs~n, R.N~, Basilevsky, A. and Hum, D.P.J. (1983).
Missing data: A review of A~ ~rature, in Handbook of Survey Research (eds. P.H. Rossi, J.D. Wright and narn~tt· nderson) .. Academic Press, New York, pp. 415-94. . Bourq 'V. and Lewts, T. (1994). Outliers in Statistical Data, 3rd edn. Wiley, New York. N~~ LB. and Clark, V.A. (1992). Processing Data: The Survey Example. Sage, Cornstoc ury Park, CA. . . . .. . Psych\ G ..w. and Helsmg, K.J. (1976). Symptoms of depressiOn m two conimumttes. Dunn ~ ogzcal Medicine, 6, 551-63. Regre~s· ,J. and Clark, V.A. (1987). Applied Statistics: Analysis of Variance and zon, 2nd edn. Wiley, New York. 1
46
Preparing for data analysis
Ford, D.L. (1983). An overview of hot-deck procedures in incomplete data, in Sample Surveys, Vol. 2, Theory and Bibliographies, (eds W.G. Madow, I. Olken and D.B Rubin). Academic Press, New York, pp.l85-207. · Frerichs, R.R., Aneshensel, C.S. and Clark, V.A. (1981). Prevalence of depression in Los Angeles County. American Journal of Epidemiology, 113, 691-99. Frerichs, R.R., Aneshensel, C.S., Clark,V.A. and Yokopenic, P. (1981). Smoking and depression: A community survey~ American Journal of Public Health, 71, 637-40 Kalton, G. and Kasprzyk, D. (1986). The treatment of missing survey data. Surve; Methodology, 12, 1-16. Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data. John Wiley, New York. Little, R.J.A. and Rubin, D.B. (1990). The analysis of social science data with missing values, in Modern Methods of Data Analysis (eds J. Fox and J.S. Long), Sage, Newbury Park, CA, pp. 374-409. Poor, A. (1990). The Data Exchange. Dow Jones-Irwin, Homewood, IL. Radloff, L.S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385-401.
FURTHER READING Aneshensel, C.S., Frerichs, R.R. and Clark, V.A. (1981). Family roles and sex differences in depression. Journal of Health and Social Behavior, 22, 379-93. Raskin, R. (1989). Statistical Software for the PC: Testing for Significance. PC Magazine, 8(5), 103-98. Tukey, J.W. (1977). Exploratory Data Analysis, Addison-Wesley, Reading, MA.
PROBLEMS 3.1
3.2
3.3
3.4 3.5 3.6
3. 7
Enter the data set given in Table 8.1, Chemical companies' financial performance (section 8.3), using a data entry program of your choice. Make a code book for this data set. Using the data set entered in Problem 3.1, delete the P/E variable for the Dow Chemical company and D/E for Stauffer Chemical and Nalco Chemical in a way appropriate for the statistical package you are using. Then, use the missing value features in your statistical package to find the missing values and replace them with an imputed value. Using your statistical package, compute a scatter diagram of income versus employment status from the depression data set. From the data in this tabl~, decide if there are any adults whose income is unusual considering thet~ employment status. Are there any adults in the data set whom you think are unusual. Transfer a data set from a spreadsheet program into your statistical package. Describe the person in the depression data set who has the highest total CESD s~re. Using a statistical program, compute histograms for mothers' and fathers' hetg~~s 1 and weights from the lung function data set described in section 1.2 (Appen " A). Describe cases that you consider to be outliers. . From the lung function data set (Problem 3.6), determine how many familt~S have one child, two children, and three children between the ages of 7 and 1 ·
Problems 3.8 3.9 . 3.1 0 3·11
.1 3 2
47
F . the lung function data set, produce a two-way table of gender of child 1 o:us gender of child 2 (for families with at least two children). Comment. ver · depresst?n. · d ata se,t d et e~m~ · 1'f any of t he vanables · have observations For the that do not fall wtthm the ranges .gtven m Table 3.4, code book for depression data. For the statistical package you mtend to use, describe how you would add data from three more time periods for the same subjects to the depression data set. For the lung function data set, create a new variable called AGEDIFF = (age of child 1)- (age of child 2) for families with at least two children. Produce a frequency count of this variable. Are there any negative values? Comment. Combine the result~ from t.he following two questions into a single variable: a. Have you been stck durmg the last two weeks? Yes, go to b. _ _ __ No b. How many days were you sick? _ _ __
4 Data screening and data transformation 4.1 MAKING TRANSFORMATIONS AND ASSESSING NORMALITY AND INDEPENDENCE In section 3.5 we discussed the use of transformations to create . new variables. In this chapter we discuss transforming the data to obtain a distribution that is approximately normal. Section 4.2 shows how transformations change the shape of distributions. Section 4.3 discusses several methods for deciding when a transformation should be made and how to find a suitable transformation. An iterative scheme is proposed that helps to zero in on a good transformation. Statistical tests for normality are evaluated. Section 4.4 presents simple graphical methods for determining if the data are independent Each computer package offers the users information to help decide if their data are normally distributed. The packages provide convenient methods for transforming the data to achieve approximate normality. They also include output for checking the independence of the observations. Hence the assumption of independent, normally distributed data that appears in many statistical tests can be, at least approximately, assessed. Note that most investigators will try to discard the most obvious outliers prior to assessing normality because such outliers can grossly distort the distribution.
4.2 COMMON TRANSFORMATIONS In the analysis of data it is often useful to transform certain variables before performing the analyses. Examples of such uses are found in the next section and in Chapter 6. In this section we present some of the most commonlY used transformations. If you are familiar with this subject, you may wish to skip to the next section. d To develop a feel f?r. transformations, let u~ examine a p~ot o~ transforrne f 0 values versus .the ongm~l v~lues of the vanab!e.. To begt_n wit~, a plot h values of a vanable X agamst Itself produces a 45 diagonal bne gomg througd the origin, as shown in Figure 4.1(a). One of the most commonly perforrne transformations is taking the logarithm (log) to the base 10. Recall that the
Common transformations
49
X 100
50
0
50
100
a. Plot of X Versus X
2.0
1.0
ow---------------~----------------~----x 50 100 b. Plot of log 10 X Versus X
Figure 4.1
Plots of Selected Transformations of Variable X.
50
Data screening and data transformation
c. Plot of loge X Versus X
50
0
d. Plot of Figure
4~1
(Continued).
JX, or X 1
12
,
Versus X
Common transformations
51
e. Plot of 1/X, or x-1, Versus X Figure 4.1 (Continued).
logarithm Yof a number X satisfies the relationship X= lOY. That is, the logarithm of X is the power Y to which 10 must be raised in order to produce !·As sho~n in Figu~e 4.l~b), the logarithm of 10 is 1 si~ce 10 = 10 1 • ~imil~rly, ~ ~ganthm of 1 Is 0 smce 1 = 10°, and the loganthm of 100 IS 2 smce 2 1 . -:- 10 . Other values oflogarithms can be obtained from tables of common p~~~Ithms .or from. a h~nd calculator with a log functi~n. All statisti.cal as ages discussed m this book allow the user to make this transformation Well as others. to ~ot~ th~t an increase in X from 1 to 10 increases the logarithm from 0 100 '. t at Is, an increase of one unit. Similarly, an increase in X from 10 to &rea~~creases the logarithm also by one unit. For larger numbers it takes a trans:ncrea~e in X to produce a small increase in log X. Thus the logarithmic orrnation has the effect of stretching small values of X and condensing
1
52
Data screening and data transformation
0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
f. Plot of ex Versus X Figure 4.1 (Continued).
large values of X. Note also that the logarithm of any number less than 1 is negative, and the logarithm of a value of X that is less than or equal to 0 is not defined. In practice, if negative or zero values of X are possible, the investigator may first add an appropriate constant to each value of X, thus making them all positive prior to taking the logarithms. The choice of the additive constant can have an important effect on the statistical properties of the transformed variable, as will be seen in the next section. The value added must be larger than the magnitude of the minimum value of X. Logarithms can be taken to any base. A familiar base is the number e = 2.7183 ... Logarithms taken to the base e are called naturallogaritbJfiS and are denoted by loge or ln. The natural logarithm of X is the power to which e must be raised to produce X. There is a simple relationship between the natural and common logarithms, namely logeX = 2.3026log 10 X log 10 X
= 0.4343loge X
Common transformations
Jf
53
. ure 4.l(c) shows that ~ graph of ~oge_X versus has the _same sha~e Fig X, with the only difference bemg m the vertical scale; I.e. loge X IS as }og~ban log 10 X for X> ~and sma~ler than_Iog 10 X for X< l._The nat~ral large~ hm is used frequently m theoretical studies because of certam appealing }ogaot . thematical properties. xn~nother class of trans~ormations i~ known as power tra~sformations. For 2 example, the transformatiohn X (X rat~ed tohthe P?wer oTf2h) Is used frequent ly . . tistical formulas sue as computmg t e vanance. e most common1y m sta . h (X · used power transformations are t e square root ratsed to t h e power 21) and the inverse 1/X (X raised to the power - _1). Fig~re ~.l(d) shows a plot of the square root of X versus X. Note _that t~s function Is _also not ?etined for negative values of X. Compared With takmg the loganthm, t~king the square root also progressively con?ens~s the values of X a_s X mcreases. However, the degree of condensatiOn IS not as severe as m the case of logarithms. That is, the square root of X tapers off slower than log X, as can be seen by comparing Figure 4.1(d) with Figure 4.1(b) or 4.1(c). . Unlike xt and log X, the function 1/X decreases with increasing X, as shown in Figure 4.1(e). To obtain an increasing function, you may use -1/X. One way to characterize transformations is to note the power to which X i~ raised. We denote that power by p. For the square root transformation, p = ! artd for the inverse transformation, p = -1. The logarithmic transformation can be thought of as corresponding to a value of p = 0 (Tukey, 1977). We can think of these three transformations as ranging from p values of -1 to 0 to !. The effects of these transformations in reducing a long-tailed distribution to the right are greater as the value of p decreases from 1 (the ~ value for no transformation) to t to 0 to -1. Smaller values of p result ~n a greater degree of transformation for p < 1. Thus, p of -2 would result 10 a greater degree of transformation than a p of -1. The logarithmic t~~nsformation reduces a long tail more than a square root transformation ( Igure 4.l(d) versus 4.1(b) or 4.1(c)) but not as much as the inverse t~ansformation. The changes in the amount of transformation depending on t e value of p provide the background for one method of choosing an appropriate transformation. In the next section after a discussion of how to ass~ss whether you have a normal distribution,' we give several strategies to SSlSt · appropnate . p values. p· inch oosmg An Inally, exponential functions are also sometimes used in data analysis. for exponential function of X may be thought of as the antilogarithm of X: X· ~xa_mple, the antilogarithm of X to the base 10 is 10 raised to the power th "1 . X.' suruiariy Th ' e ~nti oganthm of X to the base of e is e raised to the power 10~ ha~ ~Xponential functio~ ex is illustrated in Figure 4.1(f). The function effect of he ~arne shape but _mcrease~ faster. than ex. Both have the opposite X. Add" _takmg_ loganthms; t.e. they mcreasmgly stretch the larger values of lttonal discussion of these interrelationships can be found in Tukey (1977).
a
54
Data screening and data transformation
4.3 ASSESSING THE NEED FOR AND SELECTING A TRANSFORMATION In the theoretical development of statistical methods some assumptions are usually made regarding the distribution of the variables being analyzed. Often the form of the distribution is specified. The most commonly assumed distribution for continuous observations is the normal, or Gaussian, distribution. Although the assumption is sometimes not crucial to the validity of the results, some check of normality is advisable in the early stages of analysis. In this section, methods for assessing normality and for choosing a transformation to induce normality are presented. Assessing normality using histograms The left graph of Figure 4.2(a) illustrates the appearance of an ideal histogram, or density function, of normally distributed data. The values of the variable X are plotted on the horizontal axis. The range of X is partitioned into numerous intervals of equal length and the proportion of observations in each interval is plotted on the vertical axis. The mean is in the center of the distribution and is equal to zero in this hypothetical histogram. The distribution is symmetric about the mean, that is intervals equidistant from the mean have equal proportions of observations (or the same height in the histogram). If you place a mirror vertically at the mean of a symmetric histogram, the right side should be a mirror image of the left side. A distribution may be symmetric and still not be normal, but often distributions that are symmetric are close to normaL If the population distribution is normal; the sample histogram should resemble this famous symmetric bell-shaped Gaussian curve. For small sample sizes, the sample histogram may be irregular in appearance and the assessment of normality difficult. All six statistical packages described in Chapter 3 plot histograms for specific variables in the data set. The best fitting normal density function can also be superimposed on the histogram. Such graphs, produced by some packages, can enable you to make a crude judgment of how well the normal distribution approximates the histogram. Assessing symmetry using box plots Symmetry of the empirical distribution can be assessed by examining certain percentiles (or quantiles). Suppose the observations for a variable such as income are ordered from smallest to largest. The person with an income at the 25th.percentile would have an income such that 25~ .of the i?dividual~ have an mcome greater than or equal to that person. Thts mcome ts denote as P(25). Equivalently, the 0.25 quantile, often written as Q(0.25), is the incoJlle that divides the cases into two groups where a fraction 0.25 of the cases has
Assessing the need for a transformation
55
z
0
N orinal Probability Plot
Histogram
a. Normal Distribution
z
..
2.5
0
I
1.0
0
X
-2.5
0.5
Histogram
1.0
X
Nor mal Probability Plot
b. Rectangular Distribution
z
Histogram
Nor mal Probability Plot
c. Lognormal (Skewed Right) Distribution l?igure 4 2 Plots f • Plots of Hypothetical Histograms and the Resulting Normal Probability · rom those Distributions.
56
Data screening and data transformation z
Histogram
Normal Probability P
d. Negative Lognormal (Skewed Left) Distribution
z
Histogram
Normal Probability
e. 1/Normal Distribution
z
~X Histogram
Normal Probabilit f. High-Tail Distribution
Figur.e 4.2 (Continued).
Assessing the need for a transformation
57
.
less than or equal to Q(0.25) and a fraction 0.75 greater. an mc?;~y P(25) = Q(0.2~) .. For further discussion on quantiles and how Nurnert ornputed in statistical packages, see Chambers et al. (1983) and h yare c . t : . e Hoaglin and Igl~wtcz (1989). . .. . Fogg ' uantiles are wtdely used. Once such quantile ts the sample medtan Some q median divt.des · .the o bservat~ons . . S). The mto two e~uaI parts. The me d'tan Q(O. d red variables ts either the middle observation if the number of of or etions N is odd; or the average of the middle two if N is even. · 'b utton, · observa .h etically, for the norma1 d'tstn the med'tan equaIs t he mean. In T eorles from normal d'tstn'b· uuons, · · ·· Iy equaI b ut they t hey may not be preCise s:rn~d not be far different. Two other widely used quantiles are Q(0.25) and Q(~.75). These are ~ailed. quartiles since, along with the median, they divide the values of a vanable mto four equal parts. If the distribution is symmetric~ then the difference between the median and Q(0.25) would equal the difference between Q(O. 7 5) and the median. This is displayed graphically in box plots. Q(0.75) and Q(0.25) are plotted at the top and the bottom of the box (or rectangle) and the median is denoted either by a line or a dot within the box. Dashed lines (called whiskers) extend from the ends of the box out to what are called adjacent values. The numerical values of the adjacent values and quartiles are not precisely the same in all statistical packages (Frigge, Hoaglin and Iglewicz, 1989). For example, instead of quartiles, SYSTAT computes hinges which are approximately equal to the quartiles~
Here, we will give the definition given in Chambers et al. (1983). The top adjacent value is the largest value of the case that is less than or equal to Q(0.75) plus 1.5(Q(0.75) - Q(0.25)). The bottom adjacent value is the smallest value. of the case that is greater than or equal to Q(0.25) minus 1.5(Q~0.75)- Q(0.25)). Values that are beyond the adjacentvalueare sometimes exannned to see if they are outliers. Symmetry can be assessed from box i~~ts primarily by lookin~ at the distances from the median to the top of a bo~ and from the medtan to the bottom of the box. If these two lengths isr:~ectdedly u~equal, t~e~ symmetry would be questionable. If the distribution AUt ~Ymmetnc, then It ts not normal. SIX packages can produce box plots. Assessing norma l•t • normaI proba b.l. 1 1ty or normal quantile plots 1 Y usmg
Norrnal b .. norrn r pro abthty plots present an appealing option for checking for Other ~~ty. One axis of the probability plot shows the values of X and the 1'he c ows expected values, Z, of X if its distribution were exactly normal. \Viche~rnputation o~ these expected Z values is discussed in Johnson and fo\lnct .n (l 988). Eqmvalently, this graph is a plot ofthe cumulative distribution In the data set, with the vertical axis adjusted to produce a straight
58
Data screening and data transformation
line if the data followed an exact normal distribution. Thus if the data were from a normal distribution, the normal probability plot should approximate a straight line, as shown in the right-hand graph of Figure 4.2(a). In this graph, values of the variable X are shown on the horizontal axis and values of the expected normal. are shown on the vertical axis. Even when the data are normally distributed, if the sample size is small, the normal probability plot may not be perfectly straight, especially at the extremes. You should concentrate your attention on the middle 80% or 90% of the graph and see if that is approximately a straight line. Most investigators can visually assess whether or not a set of connected points follows a straight line, so this method of determining whether or not data are normally distributed is highly recommended, In some packages, such as STATA, the X and Z axes are interchanged. For your package, you should examine the axes to see which represents the data (X) and which represents the expected values (Z). The remainder of Figure 4.2 illustrates other distributions with their corresponding normal probability plots. The purpose of these plots is to help you associate the appearance of the probability plot with the shape of the histogram or frequency distribution. One especially common departure from normality is illustrated in Figure 4.2(c) where the distribution is skewed to the right (has a longer tail on the right side). In this case, the data actually follow a log-normal distribution. Note that the curvature resembles a quarter of a circle (or an upside-down bowl) with the ends pointing downward. In this case we can create a new variable by taking the logarithm of X to either the base 10 ore~ A normal probability plot oflog X will follow a straight line. If the axes in Figure 4.2(c) were interchanged so X was on the vertical axis, then the bowl would be right side up with the ends pointing upward. It would look like Figure 4.2(d). Figures 4.2(b) and 4.2(t) show distributions that have higher tails than a normal distribution (more extreme observations). The corresponding normal probability plot is an inverted S. Figure 4.2(e) illustrates a distribution where the tails have fewer observations than a normal does. It was in fact obtained by taking the inverse of normally distributed data. The .resulting normal probability plot is S-shaped. If the inverse of the data (1/X) were plotted, then a straight line. norm~l proba~ility plot would b~ obtained. If X wer~ plotted on the vertical axis then Figure 4.2(e) would illustrate a heavy ta distribution and Figures 4.2(b) and 4.2(t) illustrate distributions that have fewer expected observations than in the tails of a normal distribution. 8 The advantage of using the normal probability plot is that it not only tel1 you if the data are approxim~te~y no~ally distributed, but it ~lso. offe~ insight into how the data are dtstnbuted 1f they are not normally distnbute This insight is helpful in deciding what transformation to use to induce normality.
Assessing the need for a transformation
59
. . computer programs also produce theoretical quantile-quantile IvfanY · d'Istn'b ution . sueh as a normaI IS . p1otted agamst . Here a theoretical t he plot~· . 1 quantiles of the data. When a normal distribution is used, they are ernptrtC~Ied normal quantile plots. These are interpreted similarly to normal ofte~ :lity plots with the difference being that quantile plots emphasize the pri~s ~ore than the center of the dist~buti~n: Normal quantile plots may ta the variable X plotted on the vertical axis m some packages. For further ~~~~rnation on their interpretation, see Chambers et al. (1983) or Gan, Koehler and Thompson (1991). . . . . Normal probability or normal quantile plots are available also m all six packages.
Selecting a transformation to induce normality As we noted in the previous discussion, transforming the data can sometimes produce an approximate normal distribution. In some cases the appropri.ate transformation is known. Thus, for example, the square root transformation is used with Poisson-distributed variables. Such variables represent' counts of events occurring randomly in space or time with a small probability such that the occurrence of one event does not affect another. Examples include the number of cells per unit area on a slide of biological material, the number of incoming phone calls per second in a telephone exchange, and counts of radioactive material per unit oftime. Similarly, the logarithmic transformation has been used on variables such as household income, the time elapsing before a specified number of Poisson events have occurred, systolic blood pressure of older individuals, and many other variables with long tails to the right. Finally, the inverse transformation has been used frequently on the length of time it takes an individual or an animal to complete a certain task. Bennett and Franklin (1954) illustrate the use of transformations for counted data, and Tukey (1977) gives numerous examples of variables and appropriate transformations on them to produce approximate normality. Tukey (1957), Box and Cox (1964), Draper-and Hunter (1969) and Bickel and Doksum ~19 81). discuss some systematic ways to find an appropriate transtormatton. Another strategy for deciding what transformtion to use is to progress up ~down the values of p depending upon the shape of the normal probability ~ ot. If the plot looks like Figure 4.2(c), then a value of p less than 1 is tried. P:P~se ? =!is tried. If this is not sufficient and the normal probability tra ~s still .curved downward at the ends, then p = 0, i.e. the logarithmic litt~s ormatmn, can be tried. If the plot appears to be almost correct but a fro~ ~ore trans.formation is needed, ~positive constant can be subtracted be a tch X pnor to the tra~s.formation. If the transformation appear~ to Ittle too much, a positive constant can be added. Thus, vanous
0
60
Data screening and data transformation
transformations of the form (X + C)P are tried until the normal probability plot is as straight as possible. An example of the use of this type of transformation occurs when considering systolic blood pressure (SBP) for adult males. It has been found that a transformation such as log (SBP -75) results in data that appear to be normally distributed. The investigator decreases or increases the value of p until the resulting observations appear to have close to a normal distribution. Note that this does not mean that theoretically you have a normal distribution since you are only working with a single sample. Particularly if the sample is small the transformation that is best for your sample may not be the one that is' best for the entire population. For this reason, most investigators tend to round off the numerical values of p. For example, if they found that p = 0.46 worked best with their data, they might actually use p = 0.5 or the square root transformation. All the statistical packages discussed in Chapter 3 allow the user to perform a large variety of transformations. STATA has an option, called ladder of powers, that automaticaly produces power transformations for various values of p. Hines and O'Hara Hines (1987) developed a method for reducing the number of iterations needed by providing a graph which produces a suggested value of p from information available in most statistical packages. Using their graph one can directly estimate a reasonable value of p. Their method is based on the use of quantiles. What are needed for their method are the values of the median and a pair of symmetric quantiles. As noted earlier, the normal distribution is symmetric about the median. The difference between the median and a lower quantile (say Q(0.2)) should equal the difference between the upper symmetric quantile (say Q(0.8)) and the median if the data are normally distributed. By choosing a value of p that results in those differences being equal, one is choosing a transformation that tends to make the data approximately normally distributed. Using Figure 4.3 (kindly supplied by W. G. S. Hines and R. J. O'Hara Hines), one plots the ratio of the lower quantile of X to the median on the vertical axis and the ratio of the median to the upper quantile on the horizontal axis for at least one set of symmetric quan tiles. The resulting pis read off the closest curve. For example, in the depression data set given in Table 3.4 a variable cal~ed INCOME is listed. This is family income in thousands of dollars. The median is 15, Q(0.25) = 9, and Q(0.75) is 28. Hence the median is not halfway between the two quantiles (which in this case are the quartiles), indicating a nonsymmetricdistribution with the long tail to the right. If we plot 9/15 = 0.60 on the vertical axis and 15/28 on the horizontal axis of Figure 4.3, we g~t a value of p of approximately -!. Since -!lies between 0 and -!, trytnS first the log transformation seems reasonable. The median for log (INCOMEJ is 1.18, Q(0.25) = 0.95 and Q(0.75) = 1.45. The median -0.95 = 0.23 an
Assessing the need for a transformation
61
power, p y
2
3
5/2
o.1
0.7
4
7/2
0.6
10
5
40
0.9
power, p
~
1.0
Median X Upper X
Figure 4.3 Determination of Value of p in the Power Transformation to Produce Approximate Normality.
1.~5 -median = 0.27 so from the quartiles it appears that the data are still shghtly skewed to the right. Additional quantiles could be estimated to better approximate p from Figure 4.3. When we obtained an estimate of the skewness ~:;og (IN COM~), .it w~s negative, indicating a long tail to left. The reason ($ the contradiction Is that s~ven respondents had an mcome o~ only 2 2 the~O per year) and these low mcomes. had a large effect on .th~ ~stima~e. of to kewness but less effect on the quarttles. The skewness statistic Is sensitive extreme values.
t?e
th:~~ TA, through the command 'lnskewO', considers all transformations of
app ~In (X- k) and finds the value of k which makes the skewness set r~~lmately equal to 0. For the variable INCOME in the depression data tra~st e value of k chosen by the program is k = - 2.01415. For the ormed variable In(INCOME + 2.01415) the value of the skewness is
62
Data screening and data transformation
-0.00016 compared with 1.2168 for INCOME. Note that adding a constant reduces the effect of taking the log. Examining the histograms with the normal density imposed upon them shows that the transformed variable fits a normal distribution much better than the original data. There does not appear to be much difference between the log transformation suggested by Figure 4. 3 and the transformation obtained from STATA when the histograms are examined. Quantiles or the equivalent percentiles can be estimated from all six statistical packages. It should be noted that not every distribution can be transformed to a normal distribution. For example, variable 29 in the depression data set is the sum of the results from the 20.;.item depression scale. The mode (the most commonly occurring score) is at zero, thus making it virtually impossible to transform these. CESD scores to a normal distribution. However, if a distribution has a single mode in the interior of the distribution, then it is usually not too difficult to find a distribution that appears at least closer to normal than the original one. Ideally, you should search for a transformation that has the proper scientific meaning in the context of your data.
Statistical tests for normality It is also possible to do formal tests for normality. These tests can be done to see if the null hypothesis of a normal distribution is rejected and hence whether a transformation should be considered. They can also be used after transformations to assist in assessing the effect of the transformation. A commonly used procedure is the Shapiro-Wilk W statistic (Dunn and Clark, 1987). This test has been shown to have a good power against a wide range of nonnormal distributions (see D' Agostino, Belanger and D' Agostino, 1990, for a discussion of this test and others). Several of the statistical programs give the value of Wand the associated p value for testing that the data carne from a normal distribution. If the value of p is small, then the data may not be ~on~idered normally distributed For. exam~le, we used the comman~ 'swilk' m STATA to perform the Shaprro-Wdk test on INCOME an In (INCOME+ 2.01415) from the depression data set. The p values for the two variables were 0.000? an.d 0.2134?, r.es~ectively. ~hese results indicat~ that INCOME has a distnbution that Is sigmficantly different from norma' while. the transform~d data fit a norm~l.distribution reasonabl~ well. A rno;~ practical transformation for data analysis IS In (INCOME + 2). This transform d variable is also fitted well with a normal distribution (p = 0.217). · · an Some programs also compute the Kolmogorov -Smirnov D statistiC 5 approximate p values. Here the null hypoth~sis is rejected for la~g~ D val~~s and small p values. Afifi and Azen (1979) discuss the charactensttcs of t
Assessing the need for a transformation
63
lso possible to perform a chi~square goodness of fit test. Note that test. Jt~s al(olmogorov-Smirnov D test and the chi-square test have poor both t eroperties. In effect, if you use. the~e tests you will ten~ to reject the power P othesis when your sample stze IS large and accept It when your null hYP 'ze is small. As one statistician put it, they are to some extent a . aJllP1e s1 . . 8 . ted way of determmmg your samp1e size. 1 cornP ~~:er way of testing for normality is to examine the skewness of your · · a d'tstn'b· ut10n · IS. · If t h e .AnoSkewness is a measure of h ow nonsymmetnc data. are symmetrically or normally distributed, the computed skewness Will dataclose to · zero. The numenca · 1 v~1ue o f t h e ratio · o f t h e sk ewness to Its · be ·dard error can be compared With normal Z tables, and symmetry, and ~:ce normality, ":ou~d be rejected if t.he ~b~~lute value of the .ratio is la.rge. positive skewness mdicates that the distnbutton has a long tall to the nght and probably looks like Figu~e 4.2(c~. (Note that i~ is i~portant !o rem~ve outliers or erroneous observatiOns pnor to performmg this test.) DAgostmo, Belanger and D' Agostino (1990) give an approximate formula for the test of skewness for sample sizes > 8. Skewness is computed in all six packages. BMDP and STATISTICA provide estimates of the standard error that can be ·used with large sample sizes. STATA computes the test of D'Agostino, Belanger and D' Agostino (1990) via the Sk test command. In general, a distribution with skewness greater than one will appear to be noticeably skewed unless the sample size is very small. Very little is known about what significance levels a should be chosen to compare with the p value obtained from formal tests of normality. So the sense of increased preciseness gained by performing a formal test over examining a plot is somewhat of an illusion. From the normal probability plot you can both decide whether the data are normally distributed and get a suggestion about what transformation to use. .
Ass
.
essmg the need for a transformation
!~egeneral, transformations
are more effective in inducing normality when llle standard deviation of the untransformed variable is large relative to the tra~~o~he. standard deviation divided by the mean is less than ~; then the divid d atiOn may not be necessary. Alternatively, if the largest observation not t e b by the. smallest observation is less than 2, then the data are likely on t~e ;e~uffictently v~riable for the transformation to have a decisive effect rneani ults (Hoaghn, Mosteller and Tukey, 1983). These rules are not Steven~~ul fo~ dat~ without a natural zero (interval data but not ratio by V~Iues ; ~~asstficatiOn~. For exampl.e, the above. rules will.result in di~erent Without lllperature 1s measured m Fahrenheit or Celsms. For vanables a natural zero, it is often possible to estimate a natural minimum
64
Data screening and data transformation
value below which observations are seldom if ever found. For such data, the rule of thumb is if Largest value - natural minimum 2 smallest observation - natural minimum < then a transformation is not likely to be useful. Other investigators examine how far the normal probability plot is from a straight line. If the amount of curvature is slight, they would not bother to transform the data. Note that the usefulness of transformations is difficult to evaluate when the sample sizes are small or if numerous outliers are present. In deciding whether to make a transformation, you may wish to perform the analysis with. and without the proposed transformation. Examining the results will frequently convince you that the conclusions are not altered after making the transformation, In this case it is preferable to present the results in terms of the most easily interpretable units. And it is often helpful to conform to the customs of the particular field of investigation. Sometimes, transformations are made to simplify later a~alyses rather than to approximate normal distributions. For example, it is known that FEV1 (forced expiratory volume in 1 s) and FVC (forced vital capacity) decrease in adults as they grow older. (See section 1.3 for a discussion of these variables.) Some researchers will take the ratjo FEVl/FVC and work with it because this ratio is less dependent on age. Using a variable that is independent of age can make analyses of a sample including adults of varying ages much simpler. In a practical sense, then, the researcher can use the transformation capabilities of the computer program packages to create new variables to be added to the set of initial variables rather than only modify and replace them. If transformations alter the results, then you should select the transformation that makes the data conform as much as possible to the assumptions. If a particular transformation is selected, then all analyses should be perforilled on the transformed data, and the results should be presented in terms of the transformed values. Inferences and statements of results should reflect this fact.
4.4 ASSESSING INDEPENDENCE . · d.tVl'dual collected fro~ the same m cb Measurements on two or more vanables are not expected to be, nor are they assumed to be, mdependent of ea other. On the other hand, independence of observations collected fr.o~ different individuals or items is an assumption made in the theoretl~·~\ derivation of most multivariate statistical analyses. This assumption is crucJas to the validity of the results, and violating it may result in erroneou
Assessing independence
65
. s unfortunately, little is published about the quantitative effects concl~sJOnd.egrees of nonindependence. Also, few practical methods exist for · ts · vaI'd 0 f vanous whether the assumption 1 . . checkt~f ations where the observations are collected from people, it is Jnu:~t~y safe to assume independence .of .observations collected .from fr~q t individuals. Dependence could extst tf a factor or factors extst to dtff~enll of the individuals in a similar manner with respect to the variables a~ec easured For example, political attitudes of adult members of the bell1g · . . · household cannot be expected to be mdependent. Inferences that assume same · .w.h.en d. rawn f rom su~h responses. s·1m1'Iar1y, independence are no~ vahd . logt'cal data on twms or stbhngs are usually not Independent. b10 'h .. Data collected in the form of a sequence ett er m ttme or space can a1so be dependent For example, observations of temperature on successive days are likely to be dependent In those cases it is useful to plot the data in the appropriate sequence and search for trends or changes over time. Some programs allow you to plot an outcome variable against the program's own JD variable. Other programs do not, so the safe procedure is to always type in a variable or variables that represent the order in which the observations are taken and any other factor that you think might result in lack of independence such as location or time. Also, if more complex sampling methods are used, then information on clusters or strata should be entered. Figure 4A presents a series of plots that illustrate typical outcomes. Values of the outcome variable being considered appear on the vertical axis and values of order of time, location or observation identification number (ID) appear on the horizontal axis. If the plot resembles that shown in Figure 4.4(a), little reason exists for suspecting lack of independence or nonrandomness. The data in that plot were, in fact, obtained as a sequence of random standard normal numbers. In contrast, Figure 4.4(b) shows data that exhibit a positive trend. Figure 4.4(c) is an example of a temporary shift in the level of the observations, followed by a return to the initial level. This result may occur, for. exa~ple, in laboratory data when equipment is temporarily out of ~alibratiOn or when a substitute technician, following different procedures, ts ~ed temporarily. Finally, a common situation occurring in some business or :dustrial data is the existence of seasonal cycles, as shown in Figure 4.4(d). lllen~f! plot~ or s~atter diagrams are ~vailable i.n all six statistical packages di ?ed In thts book and are wtdely available elsewhere. If a twoco:i~sion~l display of loc~tion is .desired'wit.h the outco~e variable under dimen ~atton on the vertical axts, then thts can be dtsplayed in three So stons as some packages allow. given~e formal tests for the randomness of a sequence of observations are 10 Prese B.rownlee (1965) and Bennett and Franklin (1954). One such test is anaty~::d m Chapter 6 o~ this ~ook i~ the context o~ regression and correlation · · If You are dealing With senes of observations you may also wish to am · .
66
Data screening and data transformation
•
•••
•
• •
•
•
a. Random Observations
•
0
• • • • • •
••
t-----.-=-•- - - - - - S e q u e n c e • • • •
• • •
b. Positive Trend
• • ••• • • • • • 0 t - - - - - - - - - - - Sequence .
.• . • •
• •• • c. Temporary Shift
•
• • ••
•
•• ••
0 ~----o.,--.~--.-~·-=-•- - - - Sequence
•
•
•
d. Cycle
Figure 4.4 Graphs of Hypothetical Data Sequences Showing Lack of I
References
67
h area of forecasting and time series analysis. Some books on this stu~Y t :e included in the further reading section at the end of this chapter subJ~ct ~e reference and further reading sections of Chapter 6.
and In
·
suMMARY this chapter we emphasized ~ethod.s for determining. if the data were In ally distributed and for findmg smtable transformations to make the ~~:closer to a normal ~stribution. We also discussed methods for checking whether the data w~re mdepende?t. . . . . No special order 1s best for all situatiOns m data screemng. However, most inve~tigators would delete the obvious outliers first. They may then check the assumption of normality, attempt some transformations, and recheck the data for outliers. Other combinations of data-screening activities are usually dictated by the problem. We wish to emphasize that data screening could be the most time~consuming and costly portion of data analysis. Investigators should not underestimate this aspect. If the data are not screened properly, much of the analysis may have to be repeated, resulting in an unnecessary waste of time and resources. Once the data have been carefully screened, the investigator will then be in a position to select the appropriate analysis to answer specific questions. In Chapter 5 we present a guide to the selection ofthe appropriate data analysis.
REFERENCES References preceded by an asterisk require a strong mathematical background,
~·~.A. and Azen,.S.P. (1979). Statistical Analysis: A Computer Oriented Approach, Be n edn. Academic Press, New York. ~O::tt, .c.A. and Franklin, N.L. (1954). Statistical Analysis in Chemistry and the "'Bick emzcal Industry. Wiley, New York. · Joel, ~.J. and Doksum, K.A. (1981); An analysis of transformations revisited. "'Box: u~~ of the American Statistical Association, 76, 296-311. R.~y ~nd Cox, D.R. (1964). An analysis of transformations; Journal of the Brow tatzstical Society Series B, 26, 211-52. 2n~ K.A .. (1965). Statistical Theory and Methodology in Science and Engineering, Chamb e n. Wiley, New york. Meth~d J.M., Cleveland, W.S., Kleiner, B. and Tukey, P.A. (1983). Graphical b'Agosf s for Data Analysis. Duxbury Press, Boston, MA. Powe~~o, R.B.? Belang~r, A. and D'Agost~no, Jr, R.B. (1990). A suggestion for using braper, N~kand Informative tests ofnormahty. The An:erican Statistician, 44, 31~~21. 1'echn · ~nd Hunter, W.G. (1969). TransformatiOns: Some examples reviSited. ometrzcs, 11, 23-40.
tediSP.
68
Data screening and data transformation
Dunn, O.J. and Clark, V.A. (1987). Applied Statistics: Analysis of Variance and Regression. Wiley, New York. Frigge, M., Hoaglin, D.C. and Iglewicz, B. (1989); Some implementations of the boxplot. The American Statistician, 43, 50-4. Gan, F.F., Koehler, K.J. and Thompson, J.C. (1991). Probability plots and distribution curves for assessing the fit of probability models. The American Statistician, 45, 14-2l Hines, W.G.S. and O'Hara Hines, R.J. (1987). Quick graphical power-law transfonnatio~ selection. The American Statistician, 41, 21-4. Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds) (1983). Understanding Robust and Exploratory Data Analysis. Wiley, New York. Johnson, R.A. and Wichern, D.W. (1988). Applied Multivariate Statistical Analysis 2nd edn. Prentice-Hall, Englewood Cliffs, NJ. ' *Tukey, J.W. (1957). On the comparative anatomy of transformations. Annals of Mathematical Statistics, 28, 602-32. Tukey,J. W. (1977). Exploratory Data Analysis. Addison-Wesley, Reading, MA.
FURTHER READING Bartlett, M.S. (1947). The use of transforniations. Biometrics, 3, 39-52. *Box, G.E.P. and Jenkins, G.M. (1976). 1lme Series Analysis: Forecasting and Control, rev. edn. Holden-Day, San Francisco. Dixon, W.J. and Massey, F.J. (1983). Introduction to Statistical Analysis, 4th edil. McGraw-Hill, New York. Fox, J. and Long, J.S. (eds) (1990). Modern Methods for Data Analysis. Sage, Newbury Pa.rk, CA. . McCleary, R. and Hay, R.A. (1980). Applied Time Series Analysis for the Social Sciences. Sa.ge, Beverly Hills, CA. Mage, D.T. (1982). An objective graphical method for testing normal distribution assumptions using probability plots. American Statistician, 36, 116-20. *Montgomery, D.C. and Johnson, L.A. (1976). Forecasting and time series analysis. McGraw-Hill, New York. Ostrom, C.W. (1978). Time Series Analysis: Regression Techniques. Sage, Beverly Hills, CA.
PROBLEMS 4.1
Using the depression data set given in Table 3.5, create a variable equal to. t.he negative of one divided by the cubic root of income. Display a normal probabdttY plot of the new variable. 4.2 Take the logarithm to the base 10 of the income variable in the depression di~ set. Compare the histogram of income with the histogram of log(INCOM · Also, compare the normal probability plots of income and log(INCOME). 4.3 Repeat Problems 4.1 and 4.2, taking the square root of income. IJ1 4.4 Generate a set of 100 random normal deviates by using a computer progra re package. Display a histogram and normal probability plot of these values. Squa al these numbers by using transformations. Compare histograms and norfll d probability plots of the logarithms and the square roots of the set of square normal deviates.
Problems 4.5
69
Take the logarithm of the CESD score plus 1 and compare the histograms of CESD and log(CESD + 1). (A small constant must be added to CESD because
cESD can be zero.) The accompanying data are from the New York Stock Exchange Composite 4·6 Index for the period August 9 through September 17, 1982. Run a program to lot these data in order to assess the lack of independence of successive ~bservations. (Note that the data cover five days per week, except for a holiday on Monday, September 6)~· This time period saw an unusually rapid rise in stock prices (espec~ally ~or Au.gust), coming after a protracted falling market. Compare this time senes With pnces for the current year.
Day
Month
Index
9 10 11
Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept.
59.3 59.1 60.0 58.8 59.5 59.8 62.4 62.3 62.6 64.6 66.4 66.1 67.4 68.0 67.2 67.5 68.5 67.9 69.0 70.3
12 13
16 17 18 19 20 23 24 25 26 27 30 31 1 2 3 6 7 8 9 10
i3 14 15 16 17
69.6 70.0 70.0 69.4 70.0 70.6 71.2 71.0 70.4
4.7 Ob tha!ain a normal probability plot of the data set given in Problem 4.6. Suppose You had been ignorant of the lack of independence of these data and had treated h . . · . t e~ as If they were mdependent samples. Assess whether they are 00 l1nally distributed.
70 4.8
Data screening and data transformation
Obtain normal probability plots of mothers' and fathers' weights from the lung function data set in Appendix A. Discuss whether or not you consider weight to be normally distributed in the population from which this sample is taken. 4.9 Generate ten random normal deviates. Display a probability plot of these data. Suppose you didn't know the original of these data:. Would you conclude they were normally distributed? What is your conclusion based on the Shapiro-Wilk test? Do ten observations provide sufficient information to check normality? 4.10 Repeat problem 4.8 with the weights expressed in ounces instead of pounds. How do your conclusions change? Obtain normal probability plots of the logarithm of mothers' weights expressed in pounds and then in ounces, and compare. 4.11 From the variables ACUTEILL and BEDDAYS described in Table 3.4, use arithmetic operations to create a single variable that takes on the value 1 if the person has been both bedridden and acutely ill in the last two months and that takes on the value 0 otherwise.
5
selecting appropriate analyses 5.1 WHICH ANALYSES? When you have a data set and wish to analyze it using the computer, obvious questions that arise are 'W?a~ descriptive measures should be use? to e~amine the data?' and 'What statistical analyses should be performed? Section 5.2 explains why sometimes these are confusing questions, particularly to investigators who lack experience in real-life data analysis. Section 5.3 presents measures that are useful in describing data. Usually, after the data have been 'cleaned' and any needed transformations made, the next step is to obtain a set of descriptive statistics and graphs. It is also useful to obtain the number of missing values for each variable. The suggested descriptive statistics and graphs are guided by Stevens's classification system introduced in Chapter 2. Section 5.4 presents a table that summarizes suggested multivariate analyses and indicates in which chapters of this book they can be found. Readers with experience in data analysis may be familiar with all or parts of this chapter.
S.2 WHY SELECTION OF ANALYSES IS OFTEN DIFFICULT There are two reasons why deciding what descriptive measures or analyses ~:tperf~rm ~nd re~o.rt is often difficul~ ~or an investigator with rea~-1ife 1· ~- Ftrst, 1D statiStics textbooks, statistical methods are presented m a v~gical .order from the viewpoint of learning statistics but not from the ~e~otnt. of doing data analysis by using statistics. Most texts are either si~t ~mattcal statistics texts, or are imit~tions of_ t~em with the mat~ematics stud fied or left out. Also, when learmng statistics for the first time, the \Vo e~t often finds mastering the techniques themselves tough enough without rea~~fu~g about how to. use _them in the future. The s~cond reason is t~at of a data often con tam rmxtures of types of data, whtch makes the chmce samnalysis somewhat arbitrary. Two trained statisticians presented with the e set of data will often opt for different ways of analyzing the set, depending
1
72
Selecting appropriate analyses
on what assumptions they are willing to take into account in the interpretation of the analysis. Acquiring a sense of when it is safe to ignore assumptions is difficult both to learn and to teach. Here, for the most part, an empirical approach wilt be suggested. For example, it is often a good idea to perform several different analyses, one where all the assumptions are met and one where some are not, and compare the results. The idea is to use statistics to obtain insights into the data and to determine how the system under study works. One point to keep in mind is that the examples presented in many statistics books are often ones the authors have selected after a .long period of working with a particular technique. Thus they usually are 'ideal' examples, designed to suit the technique being discussed. This feature makes learning the technique simpler but does not provide insight into its use in typical real-life situations. In this book we will attempt to be more flexible than standard textbooks so that you will gain experience with commonly encountered difficulties. In the next section, suggested graphical and descriptive statistics are given for several types of data collected fot analysis. Note, however, that these suggestions should not be applied rigidly in all situations. They are meant to be a framework for assisting the investigator in analyzing and reporting the data. 5.3 APPROPRIATE STATISTICAL MEASURES UNDER STEVENS'S CLASSIFICATION In Chapter 2, Stevens's system of classifying variables into nominal (naming results), ordinal (determination of greater than or less than), interval (equal intervals between successive values of a variable) and ratio (equal intervals and a true zero point) was presented. In this section this system is used to obtain suggested descriptive measures. Table 5.1 shows appropriate graphical and computed measures for each type of variable. It is important to note that the descriptive measures are appropriate to that type of variable listed on the left and to all below it. Note also that :E signifies addition, n multiplication and P percentile in Table 5.1. Measures for nominal data For nominal data, the order of the numbers has no meaning. For exarnple, in the depression data set the respondent's religion was coded 1 = Prot~st~n:; 2 =Catholic, 3 =Jewish, 4 =none, and 5 =other. Any other five dtstl~ numbers could be chosen, and their order could be changed without chang1flg the empirical operation of equality. The measures used to describe these data should not imply a sense of order.
Appropriate statistical measures
73
Descriptive measures depending upon Stevens's scale
table 5.:_1------------------:-----:---------Graphical Classification measures
Measures of the center of a distribution
Measures of the variability of a distribution
"Norninal
Bar graphs Pie charts
Mode
Binomial or multinomial variance
Ordinal
Histogram
Median
Range
Interval
Histogram with areas measurable
Mean= X
Standard deviation = S
Ratio
Histogram with areas measurable
Geometric mean =
Coefficient of variation = S/X
P?s- P2s
cnxir i= 1
Harmonic mean =
4t,
1/X,
Suitable graphical measures for nominal data are bar graphs and pie charts. In the above example, these graphical measures show the proportion of respondents who have each of the five responses for religion. The length of the bar represents the proportion for the bar graph and the size (or angle) of the piece represents the proportion for the pie chart. Fox and Long (1990) note that both of these graphical methods can be successfully used by observers to make estimates of proportions or counts. Cleveland (1985) is ~ess impressed with the so-called stacked bar graphs (where each bar is divided 1 ~to a number of subdistances based on another variable, say gender). It is dttficult to compare the subdistances in stacked bar graphs since they all do n_ot start at a common base. Bar graphs and pie charts are available in all Six packages.
hNote that there are two types of pie charts: the 'value' and the 'count' pie ;. arts. Some packages only have the former. The count pie chart has a pie corresponding to each category of a given variable. On the other hand, thiece e valu · variabi e. pte cha_rts ~epresent the sum of values for each of a group of van bles m the pte pteces. Counts can be produced by creating a dummy equ:1. ~~or e_a~h category: suppose you create a variable called 'rell' which 're14, s If rehgton = 1 and 0 otherwise. Similarly, you can create 'rel2', 'rel3', orca and ' re15'· The sum of rell over the cases will then equal the number to ret;s ~or which religion= 1; and so forth. The value pie chart with rell l'h Will then have five pieces representing the five categories of religion. e lllode, or outcome, of a variable that occurs most frequently, is the
74
Selecting appropriate analyses
only appropriate measure of the center of the distribution. For a variabl such as sex, where only two outcomes are available, the variability, 0 e variance, of the proportion of cases who are male (female) can be measure~ by the binomial variance, which is . d . f p( 1 - p) E sttmate vanance o p = · N
where pis the proportion of respondents who are males (females). If more than two outcomes are possible, then the variance of the ith proportion is given by . · p.(1 -Pi) Estimated vanance of Pi = ' N
Measures for ordinal data For ordinal variables, order or ranking does have relevance, and so more descriptive measures are available. In addition to the pie charts and bar graphs used for nominal data, histograms can now be used. The area under the histogram still has no meaning because the intervals between successive numbers are not necessarily equal. For example, in the depression data set a general health question was asked and later coded 1 = excellent, 2 = good, 3 =fair and 4 =poor. The distance between 1 and 2 is not necessarily equal to the distance between 3 and 4 when these numbers are used to identify answers to the health question. An appropriate measure of the center of the distribution is the median. Roughly speaking, the median is the value of the variable that half the respondents exceed and half do not. The range, or largest minus smallest value occurring, is a measure of how variable or disperse the distribution is. Another measure that is sometimes reported is the difference between two percentiles. For example, sometimes the fifth percentile, P(5), is subtracted from the 95th percentile, P(95). As explained in section 4.3, P(5) equals the quartile Q(0.05). Some investigators prefer to report the interquartile range, Q(0.?5)- Q(0.25), or t?e qu~rtile d~viatio~, ~Q(0.75)- Q(0.25))/~· . ok· Histograms are available m all siX statistical packages used m thts bo Also,. medians, Q(0:2~) and Q(0.75), ar~ widely av~ilable: Percentiles A~ quartdes for each dtsUnct value of a vanable are available m BMPD, S d SPSS and SYSTAT. STATA has values for a wide range of percentiles an in ST ATISTICA any quantile can be computed. Measur~s
for interval data
ha::
For interval data the full range of descriptive statistics generally used available to the investigator. This set includes graphical measures sue
Appropriate statistical measures
75
s with the area under the histogram now having meaning. The histograrn ~ mean and standard deviation can now be used also. These well·~n~~ statistics are part of the output in most programs and hence are descrtptlV . east'Iy obtamable. Measures for ratio data dditional measures available for ratio data are seldom used. The Teohem:tric mean (GM) is sometimes used when the log transformation is used, g . since
IlogX log GM = =-N---=---
it is also used when computing the mean of a process where there is a constant rate of change. For example, suppose a rapidly growing city has a population of 2500 in 1970 and a population of 5000 according to the 1980 census. An estimate of the 1975 population (or halfway between 1970 and 1980) can be estimated as GM
=
2xi)1/2 = (2500 n (
X
5000) 112 = 3525
The harmonic mean (HM) is the reciprocal of the arithmetic mean of the reciprocals of the data. It is used for obtaining a mean of rates when the quantity in the numerator is fixed. For example, if an investigator wishes to analyze distance per unit of time that N cars require to run a fixed distance, then the harmonic mean should be used . . T~e c~fficient of variation can be used to compare the variability of dtstnbutiOns that have different means. It is a unitless statistic. Note that the geometric and harmonic mean can both be computed with ~~~ package by first transforming the data. The coefficient of variation is 1 d ~~~available and is simple to compute in any case since it is the standard evtatton divided by the mean. Stretch· Ing assumptions In the data anal . . h . . . h . . often yses giVen m t e next sectiOn, It Is t e ordmal vanables that data, :;t~se co~f~sion. Sot;ne statisticians t~ea~ them as if they were nominal or Usi n sphttmg them mto two categones if they are dependent variables Varia~g the dummy variables described in section 9.3 if they are independent ttsuan es. O~her statisticians treat them as if they were interval data. It is y Possible to assume that the underlying scale is continuous, and that
76
Selecting appropriate analyses
because of a lack of a sophisticated measuring instrument, the investigat is not measuring with an interval scale. or One i~p?rtant question ~s how ~ar off is the ordinal scale from an interval scale? If It Is close, then usmg an mterval scale makes sense; otherwise it · more questionable. There are many more statistical analyses available £ls interval data than for ordinal data so there are real incentives for ignori~r the distinction between interval and ordinal data. Velleman and Wilkinso g (1993) proposed that the data analysis used should be guided by what yo~ hope to learn from the data, not by rigid adherence to Stevens's classification system. Further discussion of this point and a method for converting ordinal or rank variables to interval variables is given in Abelson and Tukey (1959, 1963). Although assumptions are often stretched so that ordinal data can be treated as interval, this stretching should not be done with nominal data, because complete nonsense is likely to result.
5.4 APPROPRIATE MULTIVARIATE ANALYSES UNDER STEVENS'S CLASSIFICATION To decide on possible analyses, we must classify variables as follows: 1. independent versus dependent; 2. nominal or ordinal versus interval or ratio.
The classification of independent or dependent may differ from analysis to analysis, but the classification into Stevens's system usually remains constant throughout the analysis phase of the study. Once these classifications are determined, it is possible to refer to Table 5.2 and decide what analysis should be considered. This table should be used as a general guide to analysis selection rather than as a strict set of rules. In Table 5.2 nominal and ordinal variables have been combined because this book does not cover analyses appropriate only to nominal or ordinal data separately. An extensive summary of measures and tests for these type~ of variables is given in Reynolds (1977) and Hildebrand, Laing and Rosentha (1977). Interval and ratio variables have also been combined because the same analyses are used for both types of variables. There are many measu~es of association and many statistical methods not listed in the table. For furt e~ information on choosing analyses appropriate to various data types, see Gag (1963) or Andrews et al. (1981). . are The first row of. Table 5.2 include~ analys~s that can be d?ne tf .there be no dependent vanables. Note that If there ts only one vanable, tt ca~ ble considered either dependent or independent A single independent varta rs that is either interval or ratio can be screened by methods given in Ch~P~e al 3 and 4, and descriptive statistics can be obtained from many stausuc
Table 5.2 Suggested data analysis under Stevens's dassitica:tion
Independent variables Nominal or ordinal
Interval or ratio
Dependent variable(s)
1 variable
> 1 variable
No dependent variables
x2 goodness of fit
Measures of association Univariate statistics (e.g. Log-linear model (17) one-sample t tests) 2 x test of independence Descriptive measures (5) Tests for normality (4)
Nominal or ordinal 1 variable
x2 test Fisher's exact test
Log.:.linear model (17) Logistic regression (12)
Discriminant function (11) Discriminant function (11) Logistic regression (12) Logistic regression (12) Univariate statistics (e.g. two-sample t tests)
Log-linear model (17)
Log-linear model (17)
Discriminant function (11) Discriminant function (11)
t-test Analysis of variance Survival analysis (13)
Analysis of variance Multiple-classification analysis Survival analysis (13)
Linear regression (6) Correlation (6) Survival analysis (13)
> 1 variable Interval or ratio 1 variable
> l variable
l variable
> 1 variable Correlation matrix (7) Principal components (14) Factor analysis (15) Cluster analysis (16)
Multiple regression (7, 8, 9) Survival analysis (13)
Multivariate analysis of Multivariate analysis of Canonical correlation (10) Canonical correlation (10) vanance variance Path analysis Analysis of variance on Analysis of variance on Structural models principal components principal components (LISREL, EQS) 2 Hotelling's T Profile analysis (16)
78
Selecting appropriate analyses
programs. If there are several interval or ratio independent variables, th several techniques are listed in the table. en In Table 5.2 the numbers in the parentheses following some techniqu refer to the chapters of this book where those techniques are described F es examp_Ie, to determine t~e advantages .of doing a principal compo~en~~ analysts and how to obtam results and mterpret them, you would consult Chapter 11. A ve~y brief descrip~io~ of this technique is ~lso g!ven in Chapter 1. If no number m parentheses ts giVen, then that techmque ts not discussed in this book. For interval or ratio dependent variables and nominal or ordinal independent variables, analysis of variance is the appropriate technique. Analysis of variance is not discussed in this book; for discussions of this topic, see Winer (1971), Dunn and Clark (1987), Box, Hunter and Hunter (1978) or Miller (1986). Multivariate analysis of variance and Hotelling's T 2 are discussed in Afifi and Azen (1979) and Morrison (1976). Discussion of multiple-classification analysis can be found in Andrews et al. ( 1973). Structure models are discussed in Duncan (1975). The term log-linear models used here applies to the analysis of nominal and ordinal data (Chapter 17). The same term is used in Chapter 13 in connection with one method of survival analysis. Table 5.2 provides a general guide for what analyses should be considered. We do not mean that other analyses couldn't be done but simply that the usual analyses are the ones that are listed. For example, methods of performing discriminant function analyses have been studied for noninterval variables, but this technique was originally derived with interval or ratio data. Judgment will be called for when the investigatorhas, for example, five independent variables, three of which are interval, while one is ordinal and one is nominal, with one dependent variable that is interval. Most investigators would use multiple regression, as indicated in Table 5.2. They might pretend that the one ordinal variable is interval and use dummy variables for the nominal variable (Chapter 9). Another possibility is to categorize all the independent variables and to perform an analysis of variance on the. data. That is, analyses that require fewer assumptions in terms of types of vanable~ can always be done. Sometimes, both analyses are done and the results .a~ compared. Because the packaged programs are so simple to run, rnulttP e analyses are a. realistic option. . t be In the examples given in Chapters 6-17, the data used will often no al ideal. In some chapters a data set has been created that fits all the us~et assumptions to explain the technique, but then a nonideal, real-life da.tabtes is also run and analyzed. It should be noted that when inappropriate varta lltts are used in a s~atis~ical.ana~ysis, the association between the. st;atistical;~~ do and the real-hfe s1tuat10n ts weakened. However, the statistical ID:0 eftd not have to fit perfectly in order for the investigator to obtam us information from them.
References
79
v:M:MARY S
ens system of classification of variables can be a helpful tool in Th~ ~tev 011 the choice of descriptive measures as well as in sophisticated dectdtnglyses. In this chapter we presented a table to assist the investigator ?ataa~~a of these t~o a~eas. ~ be~inning data analyst may be~efit from tn e . ·ng the advice gtven m this chapter and from consultmg more ractiCl P erienced researchers. . . . . .. ex~he recommended a~alyses ar~ mtended as general gmdehnes and are. by . eans exclusive, It ts a good Idea to try more than one way of analyzmg no · I situatiOns · · · speci'fic h rndata whenever possi'ble. AI so, specia may requue t. ~yses, perhaps ones not covered thoroughly in this book. Some investigators :ay wish to consult the detailed recommendations given in Andrews et al. (1981 ).
REFERENCES References preceded by an asterisk require strong mathematical background. Abelson, R.P. and Tukey, J.W. (1959). Efficient conversion of nonmetric information · into metric information. Proceedings of the Social Statistics Section, American Statistical AssoCiation, 226-30. *Abelson, R.P. and Tukey, J.W. (1963). Efficient utilization of non-numerical information in quantitative analysis: General theory and the case of simple order. Annals of Mathematical Statistics, 34, 1347-69. Afifi, A.A. and Azen, S.P. (1979). Statistical Analysis: A Computer Oriented Approach; 2nd edn. Academic Press, New York. Andrews, F.M., Klem, L., Davidson, T.N., O'Malley, P.M. and Rodgers, W.L. (1981). A Guide for Selecting Statistical Techniques for Analyzing Social Sciences Data, B 2nd edn. Institute for Social Research, University of Michigan, Ann Arbor. oxN,G.E.P., Hunter, W.G. and Hunter, J.S. (1978). Statistics for Experimenters. Wiley, ew York. ~~veland, W.S. (1985). The Elements of Graphing Data. Wadsworth, Monterey, CA. uncan, O.D. (1975). Introduction to Structural Equation Models. Acaderilic Press, N ew York Du · ~~· O.J: and Clark, V.A. (1987). Applied Statistics: Analysis of Variance and F gresswn. Wiley, New York. · ox,J.and L Park C ong, J.S. (eds) (1990). Modern Methods of Data Analysis. Sage, Newbury Ga ' A. liii~~,b~.Ld (ed.) (1963): Handbook of Research on Teaching. Rand McNally, Chicago. Sagea~ ' D.K., Lamg, J.D. and Rosenthal, H. (1977). Analysis of Ordinal Data. M:iller ' everly Hills, CA. · · "Mor;i~G. Jr (1986). Beyond ANOVA: Basics of Applied Statistics. Wiley, New York. ~eYnoJcts nBD.~. (1976). Multivariate Statistical Methods. McGraw-Hill, New York. eiJernan' pT. (1977). Analysis of N aminal Data. Sage, Beverly Hills. W· tyPolo'g· .F. an~ Wilkinson, L. (1993). Nominal, ordinal, interval and ratio Iner, B J les are mtsleading. The American Statistician, 47, 65-72. liill ,.:T · (1971). Statistical Principles in Experimental Design, 2nd edn. McGraw, l"'ew york. .
80
Selecting appropriate analyses
FURTHER READING Andrews, F.M., Morgan, J., Sonquist; J. and Klem, L. (1973). Multiple Classi.fic . Analysis, 2nd edn, Institute for Social Research, University of Michigan, Ann Ar~zon *Joreskog, K.G. and Sorbom, D. (1986). LISREL VI: Analysis of Linear Sttuct or. Relationships by the Method of Maximum Likelihood Instrumental Variables ural Least Square Methods. Scientific Software, Mooresville, IN. 'ana Stevens, ~.S. (1951). Mathematics, measurement~ and psychophysics, in Handbook Experzmental Psychology, (ed. S. Stevens). Wiley, New York. of
PROBLEMS 5.1
5.2
5.3
5.4
5.5
5.6
5.7
Compute an appropriate measure of the center of the distribution for the following variables from the depression data set: MARITAL, INCOME, AGE and HEALTH. An investigator is attempting to determine the health effects on families of living in crowded urban apartments. Several characteristics of the apartment have been measured, including square feet of living area per person, cleanliness, and age of the apartment. Several illness characteristics for the families have also been measured, such as number of infectious diseases and number of bed days per month for each child, and overall health rating for the mother. Suggest an analysis to use with these data. A coach has made numerous measurements on successful basketball players; such as height, weight and strength. He also knows which position each player is successful at. He would like to obtain a function from these data that would predict which position a new player would be best at. Suggest an analysis to use with these data. A college admissions coinmittee wishes to predict which prospective students will successfully graduate. To do so, the committee intends to obtain the college grade point averages for a sample of college seniors and compare these with their high school grade point averages and Scholastic Aptitude Test scores~ Which analysis should the committee use? h Data on men and women who have died have been obtained from healtd maintenance organization records. These data include age at death, height and weight, and se~eral physiol~gical ~nd life-style measurements such a_s bl~~e ~ressu~e, smoking stat~s, dietary mtake and usual a~ount of exerctse. data Immediate. and underlymg ~a uses ?f death a~e also available. F~om thes:l inS we would hke to find out which vanables predict death due to vanous unde Yes. 1 causes. (This procedure is known as risk factor analysis.) Suggest possible ana ;~er 0 Large amounts of data are available from the United Nations and rld, international organizations on each country and sovereign st~te of the ~:e to 1 including health, education and commercial data. An economist would y 011 invent a descriptive system for the degree of development of each countr trieS the basis of these data. Suggest possible analyses. For the data described in Problem 5.6 we wish to put together similar coun into groups. Suggest possible analyses.
Problems
81
the data described in Problem 5.6 we wish to relate health data such as
5.8 ~~;nt :mortality (the proportion of children dying before the age of one year) 10
5.9
S.lO
5.11
5.12
5.13
d life expectancy (the expected age at death of a person born today if the ~n tb rates remain unchanged) to other data such as gross national product per eaita, percentage of people older than 15 who can read and write (literacy), caPrage da.I·1 y ca1one · m · ta ke per capt·ta, average energy consumptiOn · per year av; capita, and number of persons per practicing physician. Suggest possible ~alyses~ What other ~ar~ables woul~ you in.clude in your analysis? A member of the admiSSIOns committee notices that there are several women with high high school grade point averages but low SAT scores. He wonders if this pattern holds for both men and women in general, only for women in general, or only in a few cases. Suggest ways to analyze this problem. Two methods are currently. use~ to treat a ~articular type o~ cancer. It ~s suspe~ted that one of the treatments ts twtce as effective as the other m prolongmg survival regardless of the severity of the disease at diagnosis. A study is carried out. Mter the data are collected, what analysis should the investigators use? A psychologist would like to predict whether or not a respondent in the depression study described in Chapter 3 is depressed. To do this, she would like to use the information contained in the following variables: MARITAL, INCOME and AGE. Suggest analyses. Using the data described in Table A.l, an investigator would like to predict a child's lung function based on that of the parents and the area they live in. What analyses would be appropriate to use? In the depression study, information was obtained on the respondent's religion (Chapter 3). Describe why you think it is incorrect to obtain an average score for religion across the 294 respondents.
part Two j\pplied Regression Analysis
,) I
11:11'
, I
;'II)
I
I
I I
I '
6
Simple linear regression and correlation 6 1 USING LINEAR REGRESSION AND CORRELATION . TO EXAMINE THE RELATIONSHIP BETWEEN TWO VARIABLES Simple linear regression analysis is commonly performed when investigators wish to examine the relationship between two interval or ratio variables. Section 6.2 describes the two basic models used in linear regression analysis and a data example is given in section 6.3. Section 6.4 presents the assumptions, methodology and usual output from the first model while section 6.5 does the same for the second model. Sections 6.6 and 6. 7 discuss the interpretation of the results for the two models. In section 6.8 a variety of useful output options that are available from statistical programs are described. These outputs include standardized regression coefficients, the regression analysis of variance table, determining whether or not the relationship is linear, and the use of residuals to find outliers. Section 6.9 defines robustness in statistical analysis and discusses how critical the various assumptions are. The use of transformations for simple linear regression is also described. In section 6.10 regression through the origin and weighted regression are introduced, and in section 6.11 a variety of uses of linear r~ressions are given. A briefdiscussion of how to obtain the computer output ~~en in this chapter is .round in .section 6.~2~ Finally, section 6.13 describes at to watch out for m regression analysts .. s~:. You are reading about simple linear regression for the first time, skip fo tons 6.9, 6.10 and 6.11 in your first reading. If this chapter is a review r You, you can skim most of it, but read the above-mentioned sections in detail.
~·l WilEN ARE REGRESSION AND CORRELATION USED? re~tillleth?ds described in this chapter are appropriate for studying the btde~nshtp between two variables X and Y. By convention, X is called the called ~ent variable and is plotted on the horizontal axis. The variable Y is t e dependent variable and is plotted on the vertical axis. The dependent
86
Simple linear regression and correlation
variable is assumed to be continuous, while the independent variable rna y . or d'tscrete. be contmuous The data for regression analysis can arise in two forms. 1. Fixed-X case. The values of X are selected by the researchers or forced on them by the nature of the situation. For example, in the problem of predicting the sales for a company, the total sales are given for each year. Year is the fixed- X variable, and its values are imposed on the investigator by nature. In an experiment to determine the growth of a plant as a function of temperature, a. researcher could randomly assign plants to three different preset temperatures that are maintained in three greenhouses. The three temperature values then become the fixed values for X. 2. Variable-X case. The values of X and Y are both random variables. In this situation, cases are selected randomly from the population, and both X and Yare measured. All survey data are of this type, whereby individuals are chosen and various characteristics are measured on each. Regression and correlation analysis can be used for either of two main purposes. 1. Descriptive. The kind of relationship and its strength are examined. 2. Predictive. The equation relating Y and X can be used to predict the value of Y for a given value of X. Prediction intervals can also be used to indicate a likely range of the predicted value of Y.
6.3 DATA EXAMPLE In this section we present an example used in the remainder of the chapter to illustrate the methods of regression and correlation. Lung function data were obtained from an epidemiological study of households living in four areas with different amounts and types of air pollution. The data set used in this book is a subset of the total data. In this chapter we use only the data taken on the fathers, all of whom are nonsmokers (see Appendix A for more details). . . V1 One of the major early indicators of reduced respiratory functton ts FE . or forced expiratory volume in the first second (amount of air exhaled_ I~ 1 s). Since_ it is know~ that. taller males ~end to have higher FEVl, we ~~a to determme the relationship between hetght and FEVl. We exclu~e th~ for from the mothers as se~era~ studies have shown a different relatton~ht~e-.X wol)len. The sample stze Is 150. These data belong to the vanab y case, where X is height (in inches) and Y is FEVl (in liters). Here w~ ~a a be concerned with describing the relationship between FEVl and hetg ·~e descriptive purpose. We may also use the resulting equation to deterJlll expected or normal FEVl for a given height, a predictive use.
Data example
87
Sca'tterplot (LUNG.STA 32v*150c)
=
y -4.087+0.118*x+eps
FHEIGHT
Figure 6.1 Scatter Diagram and Regression Line of FEVl Versus Height for Fathers in Lung Function Data.
In Figure 6.1 a scatter diagram of the data is produced by the STATISTICA package. In this graph, height is given on the horizontal axis since it is the independent variable and FEV1 is given on the vertical axis since it is the dependent variable. Note that heights have been rounded to the nearest inch in the original data and the program marked every four inches on the horizontal axis. The circles in Figure 6.1 represent the location of the data. There does appear to be a tendency for taller men to have higher FEVL The ~rogram also draws the regression line in the graph. The line is tilted upwards Indicating that we expect larger values of FEV1 with larger values of height. :e ~ill~efine the regres~ionline in section 6.4. The equation oftheregression e Is giVen under the title as Y = -4.087
+ 0.118X
i~: Progr~m also i~cludes a ' + eps' which we will ig~or~ u~til section 6.6.
quantity 0.118 m front of X IS greater than zero mdicatmg that as we X, Y will increase. For example, we would expect a worker who is 70c~~ase ches tall to have an FEV1 value of
in
FEV1
= -4.087 + (0.118)(70) = 4.173
or FEV1 = 4.17 (rounded oft)
88
Simple linear regression and correlation
If the height was 66 inches then we would expect an FEV1 value of only 3.7o To take an extreme example, suppose a father was 2 feet tall. Then th · ~quation would predict a. negative val~e of FE~l ( -1.~55). This exam pi: illustrates the danger of usmg the regression equatiOn outside the appropriat range. A safe policy is to restrict the use of the equation to the range of th: X observed in the sample. In order to get more information about these men, we requested descriptive s~tistics. These inclu~ed the samp~e size N = 150, the mean for each variable (X= 69.260 inches for height and Y = 4.093liters for FEV1) and the standard deviations (Sx = 2.779 and Sr = 0.651). Note that the mean height is approximately in the middle of the heights and the mean FEV1 is approximately in the middle of the FEV1 values in Figure 6.1. The various statistical packages include a wide range of options for displaying scatter plots and regression lines. Different symbols and colors can often be used for the points. This is particularly simple to do with the Windows-based programs. Usually, however, the default option is fine unless other options are desirable for inclusion in a report or an article. One option that is useful for reports is controlling the numerical values to be displayed on the axes. Plots with numerical values not corresponding to what the reader expects or to what others have used are sometimes difficult to interpret. 6.4 DESCRIPTION OF METHODS OF REGRESSION: FIXED-X CASE In this section we present the background assumptions, models and formulas necessary for understanding simple linear regression. The theoretical background is simpler for the fixed- X case than for the variable- X case, so we will begin with it. Assumptions and background For each value of X we conceptualize a distribution of values of Y. This distribution is described partly by its mean and variance at each fixed X value: Mean of Y values at a given X = ex
+ {JX
Variance of Y values at a given X
= a2
and
The basic idea of simple linear regression is that the means of Y lie ~ne~ straight line when plotted against X. Secondly, the variance of Y at a .gtV is X is assumed to be the same for all values of X. The latter assurnpuonteS called homoscedasticity, or homogeneity of variance. Figure 6.2 illustra
Methods of regression: fixed-X case
89
y
/
?
,.""'
//
~---------L------~------~---------X
0
Figure 6~2 Simple Linear Regression Model for Fixed X's.
the distribution of Y at three values of X. Note from Figure 6.2 that a 2 is not the variance of all the Y 's from their mean but is, instead, the variance of Y at a given X. It is clear that the means lie on a straight line in the range of concern and that the variance or the degree of variation is the same at the different values of X. Outside the range of concern it is immaterial to ~ur analysis what the curve looks like, and in most practical situations, linearity will hold over a limited range of X. This figure illustrates that extrapolation of a linear relationship beyond the range of concern can be dangerous. The expression ex + {JX, relating the mean of Y to X, is called the population regression equation. Figure 6.3 illustrates the meaning of the parameters ex ~nd /3. The parameter ex is the intercept of this line. That is, it is the mean of th When X = 0. The slope f3 is the amount of change in the mean of Y when th: Value of X is increased by one unit. A negative value of f3 signifies that rnean of Y decreases as X increases.
least
squares method 1'he . the farameters ct and f3 are estimated from a sample collected according to Xed-X model. The sample estimates of ex and {3 are denoted by A and
90
Simple linear regression and correlation
0
Figure 6.3 Theoretical Regression Line Illustrating a and {3 Geometri
y
30
20
10
{indicates deviation, or residual 0
5
10
Figure 6.4 Simple Data Example Illustrating Computations of 0 Figure 6.1.
Methods of regression: fixed- X case
91
ectively, and the resulting regression line is called the sample least
. equa t"Jon. B' resP es regression sq~; illustrate the method, we consider a hypothetical sample of four points,
XisfixedatX 1 = 5,X 2 = 5,X 3 = 10andX4 = 10. The sample values whefrer.e y = 14, Y 2 = 17, Y 3 = 27 and Y 4 = 22. These points are plotted f a 1 . ? Figure 6A. . . . . tn The output from the computer would mclude the followmg mformabon:
X
y
MEAN 7.5 20.0
ST.DEV. 2.8868 5.7155
REGRESSION LINE Y = 6.5
+ 1.8X
RES.MS.
8.5
The least squares method finds the line that minimizes the sum of squared vertical deviations from each point in the sample to the point on the line corresponding to the X-value. It can be shown mathematically that the least squares line is
Y= A+ BX where L(X- X)(Y- Y)
B=
L(X -X)2
and
A= Y -BX Here X and Y denote the sample means of X and Y, and f denotes the predicted value of Y for a given X. The deviation (or residual) for the first point is computed as follows: Y(1)- Y(1) = 14- [6.5
+ 1.8(5)] =
-1.5
s· .
~nularly, the other residuals are + 1.5, -2.5 and + 2.5, respectively. The 8pr~ of the squares of these deviations is 17.0. No other line can be fitted to
· ~h~ce a. smaller su~ of squared dev.iations than 17.0. . corn estimate of a 2 ts called the residual mean square (RES. MS.) and ts PUted as
S2 =RES. MS.= L(Y- Y')
2
N-_2
'~'hen lllinusU;ber N- 2, called the resi~ual deg~ees ~f fr~edom, is the sample ~ize N .._ 2 e n~~ber of parameters m the hne (m this case, ex and {3). Usmg as a dtvtsor in computing S 2 produces an unbiased estimate of a 2 • In
92
Simple linear regression and correlation
the example, S2
= RES. MS. = ~ = 8.5 4
2
The square root of the residual mean square is called the standard deviatio of the estimate and is denoted by S. n Packaged regression programs will also produce the standard errors of A and B. These statistics are computed as
and
Confidence and prediction intervals For each value of X under consideration a population of Y-values is assumed to exist. Confidence intervals and tests of hypotheses concerning the intercept, slope, and line may be made with assurance when three assumptions hold.
1. The Y-values are assumed to be normally distributed. 2. Their means lie on a straight line. 3. Their variances are all equal. For example, confidence for the slope B can be computed by using the standard error of B. The confidence interval (CI) for B is CI
= B + t[SE(B)]
where t is the 100(1 - rt./2) percentile of the t distribution with N- Zdf (degrees of freedom). Similarly, the confidence interval for A is CI = A
+ t[SE(A)]
where the same degrees of freedom are used for t. . . ways: The value t computed for a particular X can be mterpreted m two
1. Y is the point estimate of the mean of Y at that value of X. f x.. 0 2. Yis the estimate of the Yvalue for any individual with the given value . . . . . . terval The mvestigator may supplement these pomt esttmates wtth 10 e of estimates. The confidence interval (CI) for the mean of Y at a given valu
Regression and correlation: variable-X case . y X*, is
J(, sa
1
CI =
(X* _ .X)2
f + tS [ N + L(X- X)2
93
J
112
tis the 100(1- rt./2) percentile of the t distribution with N- 2df. wh;~: an individual. Y-.val~e the confidence .int~r~al is called .the prediction The p·redictiOn mterval.(PI) for an mdiVIdual Yat X* IS computed as intervaI (PI)· PI
=
1 (X*- X)2 ]1/2 y + tS [ 1 + N + (X - X)2
L
where t is the same as for the confidence interval for the mean of Y. In summary, for the fixed-X case we have presented the model for simple regression analysis and methods for estimating the para1neters of the model. Later in the chapter we will return to this model and present special cases a~d other uses. Next, we present the variable-X model. 6.5 DESCRIPTION OF METHODS OF REGRESSION AND CORRELATION: VARIABLE-X CASE in this section we present, for the variable-X case, material similar to that given in the previous section. For this model both X and Y are random variables measured on cases that are randomly selected from a population. One example is the lung function data set, where FEV1 was predicted from height. The fixed-X regression model applies in this case when we treat the X-values as if they Were preselected. (This technique is justifiable theoretically by conditioning ~~the ~-valu~s tha~ happened to be obtaine~ in the sample.) The.refore all t e PreviOus discusston and formulas are precisely the same for this case as or ~he fixed-X case. In addition, since both X and Yare considered random ~ar;ables, other parameters can be useful for describing the model. These f.tnc u~e the means and variances for X and Y over the entire population (J.lx, ax and cr~ ). The sample estimates for these parameters are usually included setc~:uter. output. For exa~ple, in the an~ysis of~he.lung f~n~tion data A. e estimates were obtamed by requestmg descnptive statistics. cauesd a hmeasure of how the variables X and Y vary together, a parameter of X a~ e Po~ulation covariance is often estimated. The population covariance the e f d Y Is defined as the average of the product (X - J.lx) ( Y - J.ly) over to in~ Ire population. This parameter is denoted by cr xY. If X and Y tend in.cre~ease together, cr xY will be positive. If, on the other hand, one tends to So t~ as the other decreases, cr xY will be negative. at the magnitude of crxY is standardized, its value is divided by the
i;'
94
Simple linear regression and correlation
product axay. The resulting parameter, denoted by p, is called the prod moment correlation coefficient, or simply the correlation coefficient. The valu Uct
eor
.
axy axay
p=--
lies between -1 and + 1, inclusive. The sample estimate for pis the sampl . coem· e correI ation c1ent r, or ·
where
s XY-
L(X- X)(Y- Y) N- 1
The sample statistic r also lies between -1 and + 1, inclusive. Further interpretation of the correlation coefficient is given in Thorndike (1978). Tests of hypotheses and confidence intervals for the variable- X case require that X and Y be jointly normally distributed. Formally, this requirement is that X and Y follow a bivariate normal distribution. Examples of the appearance of bivariate normal distributions are given in section 6.7. If this condition is true, it can be shown that Y also satisfies the three conditions for the fixed-X case. 6.6 INTERPRETATION OF RESULTS: FIXED-X CASE In this section we present methods for interpreting the results of a regression 0~~ 1 First~ t~e ty~ of the sample. must be determined. If it .is a fixed- X saiil;r~ the statistics of ~nterest a~e the mt~rcept and ~lope of the lme and the standadY error of the estimate; pomt and mterval estimates for ex and f3 have alre been discussed. ·ng The investigator may also be interested in testing hypotheses concernt the parameters. A commonly used test is for the null hypothesis
The test statistic is t = _(B_ _ f3..:..;_o)___,[L=--(X _ _X_)_z]_t_Jz
s
Interpretation of results: fixed-X case
95
e Sis the square ro~t of the residual mean .squa~e and the computed wher f t is compared with the tabled t percentiles with N- 2 degrees of val~~ to obtain the P value. Many computer programs will print the free do rd error of B. Then the t statistic is simply stan a
.
B-~
t
= SE(B)
A cornrnon value of flo is {3 0 = 0, indicating independence of X and Y, i.e. h rnean value of Y does not change as X changes. t ~ests concerning ex can also be performed for the null hypothesis H 0 : (J. == a , using 0 A- cx 0
Values of this statistic can also be compared with tables oft percentiles with N - 2 degrees of freedom to obtain the t value. If the standard error of A is printed by the program, the test statistic can be computed simply as t
A-a0 = SE(A)
For example, to test whether the line passes through the origin, the investigator would test the hypothesis cx0 = 0. It should be noted that rejecting the null hypothesis H 0 : f3 = 0 is itself no indication of the magnitude of the slope. An observed B = 0.1, for instance, ~ght be found significantly different from zero, while a slope of B = 1.0 Illlght be considered inconsequential in a par~icular application. The importance and strength of a relationship between Yand X is a separate question from the question of whether certain parameters are significantly different from ze~o. The test of the hypothesis f3 = 0 is a preliminary step to determine Y' e~her the magnitude of B should be further examined. If the null hypothesis IS ~ected, then the magnitude ofthe effect of X on Y should be investigated. on ;~ way of i.nvestigati~g the magnitude of .the effect ~f a _!ypica~ ~ value relativs to ~ulttply B by X a.nd to contrast this resu~t with .Y.. If BX.ts small A.n the t? Y, then the magmtude of the effect of B m predtctmg Y ts small. Val~ er tnterpretation of B can be obtained by first deciding on two typical l'hise~·~ X, say X 1 and X 2 , and then calculating the difference B(X 2 - X 1). 1 l'o . erence measures the change in Y when X goes from X 1 to X 2. hav~n~er causality, we.must justify that all other factors. p~ssibly. affecting Is to d .een controlled m the study. One way of accomplishing th1s control \Vhile esign an experiment in which such intervening factors are held fixed Only the variable X is set at various levels. Standard statistical wisdom
y
96
Simple linear regression and correlation
also requires randomization in the assignment to the various X levels in th hope of controlling for any other factors not accounted for (Box, 1966). e 6.7 INTERPRETATION OF RESULTS: VARIABLE-X CASE In this section we present methods for interpreting the results of a regressio and correlation output. In particular, we will look at the ellipse of concentration and the coefficient of correlation. n For the variable-X model the regression line and its interpretation remain valid. Strictlyspeaking, however, causality cannot be inferred from this model. Here we are concerned with the bivariate distribution of X and Y. We can safely estimate the means, variances, covariance and correlation of X and Y. i.e. the distribution of pairs of values of X and Y measured on the sam~ individual. (Although these parameter estimates are printed by the computer, they are meaningless in the fixed-X model.) For the variable-X model the interpretations of the means and variances of X and Yare the usual measures ofloca tion (center of the distribution) and variability. We will now concentrate on how the correlation coefficient should be interpreted. Ellipse of concentration The bivariate distribution of X and Y is best interpreted by a look at the scatter diagram from a random sample. If the sample comes from a bivariate normal distribution, the data will tend to cluster around the means of X and Y and will approximate an ehipse called the ellipse of concentration. Note in Figure 6.1 that the points representing the data could be enclosed by an ellipse of concentration. An ellipse can be characterized by the following: 1. The center. 2. The major axis, i.e. the line going from the center to the farthest point 00 the ellipse. . 00 3. The minor axis, i.e. the line going from the center to the nearest po~nt the ellipse (the minor axis is always perpendicular to the major axis).. . ~~ 4. The .ratio. o~ the length of ~he ~n~r axis to the length of t~e maJorlli se If this ratio IS small, the elhpse IS thm and elongated; otherwtse, thee P is fat and rounded. .
.
Interpreting the correlation coefficient . . . h mealls For ellipses of concentratiOn the center Is at the pomt defined by t e are of X and Y. The direction~ and lengths of the m~jor and ~inor axre~"ed determined by t~e two vanan~es and the correlatiOn c?efficte~t. Fotbat of values of the vanances the ratto of the length of the mmor axis to
Interpretation of results: variable-X case
97
major axis, and hence the shape of the ellipse, is determined by the the lation coefficient p. correfigure 6.5 we represent ellipses of concentration for various bivariate In 1 distributions in which the means of X and Y are both zero and the nor.m:ces are both one (standardized X's and Y's). The case p = 0, Figure vaR:), represents independence of X and Y. That is, t~e val.ue of one var~able 6· · effect on the value of the other, and the ellipse 1s a perfect cucle. 0 h~s :er values of p correspond to more elongated ellipses, as indicated in ~~:ure6.5(b)-(~). Monett~(in ~ox an~ Long, 1990),.givesfurtherinterpretation fthe correlatiOn coefficient m relatiOn to the elhpse. 0 We see that for very high values of p one variable conveys a lot of information about the other. That is, if we are given a value of X, we can guess the corresp~nding Y value quite ac~ura.tely: We can .do so because. the range of the possible values of Y for a gtven X ts determmed by the Width ofthe ellipse at that value of X. This width is small for a large value of p. For negative values of p similar ellipses could be drawn where the major axis (long axis) has a negative slope, i.e. in the northwest/southeast direction. Another interpretation of p stems from the concept of the conditional distribution. For a specific value of X the distribution of the Y value is called the conditional distribution of Y given that value of X. The word given is translated symbolically by a vertical line, so Y given X is written as YIX. The variance of the conditional distribution of Y, variance (YIX), can be expressed as
or O"~IX
=
a~(l
-
p2)
Note. that a yIX = a as given in section 6.4. Tlus equation can be written in another form: 2
p
l'he term
2
2
•
=
2
O"y- O"YIX 2 O"y
•
l'herefor u Yl~ measu~es the vanance of ~ when X has .a specific ~xed value. tedu · e thts equatiOn states that p 2 ts the proportiOn of vanance of Y sayi:ed becaus~ of knowledge of X. This result is often loosely expressed by 2 A.. ~ that .P ts the proportion of the variance of Y 'explained' by X. etter Interpretation of p is to note that
98
Simple linear regression and correlation
a. p = 0.00
b.
c. p = 0.33
d. p = 0.6
p = 0.15
I
I I,, ...
I' !Iill .. 'i
._i.!,
I
I 1',·:.
' : I!j':
I'
I
e. p = 0.88
. ,
I' ' "'II
'
'I; I I
I;
iI
I'
I
Figure 6.5 Hypothetical Ellipses of Concentration for Various p Va
Interpretation of results: variable-X case
99
. value is a measure of the proportion of the standard deviation of Y not Tbl:ained by X. For example, if p = + 0.8, then .64% of the variance of Y is 2 e1<.PJained by X: However; (_1 - 0.8 )112 . 0.6, saymg that 60~ o~th~ standard exP. tion of Y ts not explamed by X. Smce the standard deviatiOn ts a better devJaure of variability than the variance, it is seen that when p = 0.8, more lllheas half of the variability of Y is still not explained by X. If instead of using t an . we use r2 from a samp1e, t hen l frorn the popu1ation
Sr21x = S 2 =
(N-
2 - r 2) N _ 1) Sy(1
2
nd the results must be adjusted for sample size. a An important property of the correlation coefficient is that its value is not affected by the units of X or Y or any linear transformation of X or Y. For instance, X was measured in inches in the example shown in Figure 6.1, but the correlation between height and FEV1 is the same if we change the units of height to centimeters or the units of FEV1 to milliliters. In general, adding (subtracting) a constant to either variable or multiplying either variable by a constant will not alter the value of the correlation. Since Y' = A + BX, it follows that the correlation between Yand Y' is the same as that between Y and X. If we make the additional assumption that X and Y have a bivariate normal distribution, then it is possible to test the null hypothesis H 0 : p = 0 by computing the test statistic
r(N- 2) 112 t = (1 - r2)1/2
with N- 2 degreees of freedom. For the fathers from the lung function data :et given in Figure 6.1, r = 0.504. To test the hypothesis H 0 : p = 0 versus he alternative H 1: p =1- 0, we compute t=
~~th
0.504(150 - 2) 112 (1 - 0.5042 )1/2 = 7.099
148 degrees of freedom. This statistic results in P
< 0.0001, and the
~erved r is significantly different from zero. can ~~s of nul~ hypotheses other than p = 0 and confidence intervals for p found m many textbooks (Brownlee, 1965; Dunn and Clark 1987· A.fifi . and A . . . ' ' attern . zen, 1979). As before, a test of p = 0 should be made before
Note f~~ng to interpret the ~agni.tude of the sample correlati~n coefficient. A.U of t the test ~f p = 0 IS ~qmvalent to the ~est of f3 = 0 gtven earlier. data foU the ab~ve .mterpretatiOn.s "":ere !llade ~1th .the ~ssumption that the Y is r · ow a btvanate normal dtstnbutiOn, whtch tmphes that the mean of 1 e ated to X in a linear fashion. If the regression of Y on X is nonlinear,
100 Simple linear regression and correlation y 100
80
60
40
20
Figure 6.6 Example of Nonlinear Regression.
it is conceivable that the sample correlation coefficient is near zero when Y is, in fact, strongly related to X. For example, in Figure 6.6 we can quite 2 accurately predict YfromX(infact, thepointsfitacurve Y = 100- (X- 10) exactly). However, the sample correlation coefficient is r = 0.0. (An appropriate regression equation can. be fitted by the techniques of polynomial regression -section 7.8. Also, in section 6.9 we discuss the role of transformations in reducing nonlinearities.)
6.8 FURTHER EXAMINATION OF COMPUTER OUTPUT Most packaged regression programs include other useful statistics in the printed output. To obt~in some o~ this output, th~ investi~ator usua11Y h;~ to run one of the multiple regressiOn programs dtscussed m Chapter. · le this section we introduce some of these statistics in the context of sllllP linear regression.
7
Standardized regression coefficient . coeffic1ent . . t he s1ope m . t h e regresst·on equatioll . ed regression The standard1z 1s . d bY if X and Yare standardized. Standardization of X and Y is achle~e'di.flS subtracting the respective means from each set of observations and dtVl
Further examination of computer output
101
differences by the respec.tive standard deviations. The resulting ~et. of the dardized sample values will have a I?ea~ of zer~ and a st~ndard devtat~on stan for both X and Y. After standardizatiOn the mtercept m the regression of on: on will be zero, andfor simple linear regression (one X variable) the equ~~dized slope will be equal to the correlation coefficient r. In multiple stan ssion, where several X variables are used, the standardized regression ~~~~cients help quantify the relative contribution of each X variable (Chapter 7).
Analysis of variance table The test for H 0 : f3 = 0 was discussed in section 6.6 using the t statistic. This test allows one-sided or two-sided alternatives. When the two-sided alternative is chosen, it is possible to represent the test in the form of an analysis of variance (ANOVA) table. A typical ANOV A table is represented in Table 6.1. If X were useless in predicting Y, our best guess of the Y value would be y regardless of the value of X. To measure how different our fitted line Yis from Y, we calculate the sums of squares for regression as I:(Y- Y) 2 , summed over each data point. (Note that Y is the average of all the Yvalues.) The residual mean square is a measure of how poorly or how well the regression line fits the actual data points. A large residual mean square indicates a poor fit. The F ratio is, in fact, the squared value of the t statistic described in section 6.6 for testing H 0 : f3 = 0. Table 6.2 shows the ANOVA table for the fathers from the lung function data set Note that the F ratio of 50.50 is the square of the t value of 7.099 Table 6.1 ANOVA table for simple linear regression Source of variation
Sums of squares
df
Mean square
F
Regression Residual
:ECY- Y) 2 :E(Y- Y) 2
1
ssreg/1 SSres/(N - 2)
MSreg/MSres
N-2
:E(Y-Y)z
N-1
Total
l'able 6 2 A .....____·
NOV A example from Figure 6.1
Sour--~------------------------------------------------
~iation
~egression esidual l'ota) -
""---
Sums of squares
df
Mean square
F
16.0532 47.0451 63.0983
1 148 149
16.0532 0.3179
50.50
102
Simple linear regression and correlation
obtained when testing that the population correlation was zero. It is n precisely so in this example because we ro~nded the results before cotnputi~t the value oft. We can computeS by takmg the square root of the residua~ mean square, 0.3179, to get 0.5638. Data screening in simple linear regression For simple linear regression, a scatter diagram such as that shown in Figure 6.1 is one of the best tools for determining whether or not the data fit the basic model. Most researchers find it simplest to examine the plot of Yagainst X. Alternatively, the residuals e = Y- Y can be plotted against X. Table 6.3 shows the data for the hypothetical example presented in Figure 6.4. Also shown are the predicted values Yand the residuals e. Note that, as expected ...... . . ' the mean of the Yvalues 1s equal to Y. Also, it wdl always be the case that the mean of the residuals is zero. The variance of the residuals from the same regression line will be discussed later in this section. Examples of three possible scatter diagrams and residual plots are illustrated in Figure 6.7. In Figure 6.7(a), the idealized bivariate (X, Y) normal distribution model is illustrated, using contours similar to those in Figure 6.5. In the accompanying residual plot, the residuals plotted against X would also approximate an ellipse. An investigator could make several conclusions from Figure 6.7(a). One important conclusion is that no evidence exists against the linearity of the regression of Y or X. Also, there is no evidence for the existence of outliers (discussed later in this section). In addition, the normality assumption used when confidence intervals are calculated or statistical tests are performed is not obviously violated. In Figure 6.7(b) the ellipse is replaced by a fan-shaped figure. This shape suggests that as X increases, the standard deviation of Y also increases. Note that the assumption of linearity is not obviously violated but that the assumption of homoscedasticity does not hold. In this case, the use of weighted least squares is recommended (section 6.10).
Table 6.3 Hypothetical data example from Figure 6.4
1 2 3 4 Mean
X
y
Y= 6.5 +1.8X
e = y _ y =Rest'dual
5 5 lO 10
14 17 27 22
15.5 15.5 24.5 24.5
-1.5 1.5 2.5 -2.5
20
20
7.5
0
Further examination of computer output II
Scatter Diagram
Y-Y
103
Residual Plot
... L--------X 0 a. Bivariate Normal II
Y-Y
.. 0
~---------------X
b. Standard Deviation of Y Increases with X II
y
Y-Y
.... o--------x c. Regression of Y Not Linear in X
F' •gure 6.7 Hypothetical Scatter Plots and Corresponding Residual Plots.
thIn Figure 6. 7(c) the ellipse is replaced by a crescent-shaped form, indicating
Pr~~:he ~egression
of Y is ?ot line~r in X . .one pos~ibility for solving this Po -~~ Is to fit a quadratic equatiOn as discussed m Chapter 7. Another 881 the thty is to transform X into log X or some other function of X, and be ~.fit a straight line to the transformed X values and Y. This concept will lscussed in section 6.9.
104
Simple linear regression and correlation
Formal tests exist for testing the linearity of the simple regression equar when multiple values of Y are available for at least some of the X va}~on Howeve!, mo~t investigators asses~ linearity. by plotting eith~r Y against~ or Y- Y against X. The scatter dtagram With Y plotted agamst X is oft simpler to interpret. Lack of fit to the model may be easier to assess fr en the. residual plots as shown on the ri?h~-hand side of Figure 6.7. Since~~ residuals always ha~e a zero mea~, It Is. useful to draw a h?rizontal line through the zero pomt on the vertical axts as has been done m Figure 6.7 This will aid in checking whether the residuals are symmetric around thei · mean (whichis expected if the residuals are normally distributed). Unusua~ clusters of points can alert the investigator to possible anomalies in the data. The distribution of the residuals about zero is made up of two components. One is called a random component and reflects incomplete prediction of y from X and/or imprecision in measuring Y. But if the linearity assumption is not correct, then a second component will be mixed with the first, reflecting lack of fit to the model. Most formal analyses of residuals only assume that the first component is present. In the following discussion it will be assumed that there is no appreciable effect of nonlinearity in the analysis of residuals. It is important that the investigator assess the linearity assumption if further detailed analysis of residuals is performed. Most multiple regression programs provide lists and plots of the residuals. The investigator can either use these programs, even though a simple linear regression is being performed, or obtain the residuals by using the transformation Y - (A + BX) = Y- Y = e and then proceed with the plots. The raw sample regression residuals e have unequal variances and are slightly correlated. Their magnitude depends on the variation of Yabout the regression line. Most multiple regression programs provide numerous adjusted residuals that have been found useful in regression analysis. For simple lin~ar regression, the various forms of residuals have been most useful in drawt~g attention to important outliers. The detection of outliers was discus~ed 1 ~ C:hapter 3 and the sim~lest of t~e technique~ dis~ussed there, plotu~gu~l histograms, can be applied to residuals. That Is, histograms of the restd values can be plotted and examined for extremes. . n ~ h d . f 1' . regressto In recent years, procedures 10r t e ete~tton o ou~ ters m . ; 1988 programs have focused on three types of outliers (Chatterjee and Hadt, Fox, 1991): 1. outliers in Y from the regression line; 2. outliers in X; .
.
.
3. outliers that have a large mfluence on the estimate of the slope coe
fficient·
. tiCS
sO
The programs include numerous types of residuals and other statts onlY that the user can detect these three types of outliers. The more corn~sleY' used ones will be discussed here (for a more detailed discussion see e
Further examination of computer output
105
Welsch 1980; Cook and Weisberg, 1982; Fox, 1991; Chatterjee and J(uh. an~ ). A listing of the options in ~ackag~d program~ is deferred to Ba~J, 1 ?.lO since they ~re mostly _avai!able m t~e multiple regression sectton the discussion IS presented m this chapter smce plots and formulas progra~s. to understand for the simple linear regression model. are easJer
88
outliers in y . the sample residuals do not all have the same variance and their smce 'tude depends on how closely the points lie to the straight line, they magn;ten simpler to interpret if they are standardized. If we analyze only a a!eglo van'able Y. then it can be standardized by computing (Y- Y)jSy which sm · e ' ' ·n have a mean zero and a standard deviation of one. As discussed in ~~apter 3, formal tests for detection of outliers are available but often researchers simply investigate all standardized residuals that are larger than a given magnitude, say three. This general rule is based on the fact that, if the data are normally distributed, the chances of getting a value greater in magnitude than three is very small. This simple rule does not take sample size into consideration but is still widely used. A comparable standardization for a residual in regression analysis would be (Y- Y)jS or ejS, but this do.es not take the unequal variance into account. The adjusted residual that aecomplishes this is usually called a studentized residual:
.
.
studenuzed residua
I
e
= S( 1 _ h) 112
where his a quantity called leverage (to be defined later when outliers in X are discussed). Note that this residual is called a standardized residual in B~DP programs; Chatterjee and Hadi (1988) call it an internally studentized restdual. (Standard nomenclature in this area is yet to be finalized and it is ~~fest to ~~ad the description in the computer program used to be sure of re er:e~ntb?n.) In addition, since a single outlier can greatly affe~t the ss1?n hne (particularly in small samples), what are often called deleted Stugdenti d · · sa ze . restduals can be obtained. These residuals are computed in the st:e m~nner as studentized residuals, with the exception that the ith deleted ith e~ttzed residual is computed from a regression line fitted to all but the the ~ftservation. Deleted residuals have two advantages. First, they remove in siJn e~t ~f an extrem~ outlier in assessing the effect of that outlier. Second, Stude P_e linear regressiOn, if the errors are normally distributed, the deleted freect~ttzed residuals follow a Student t distribution with N - 3 degrees of clifferem. The deleted studentized residuals also are given different names by extern authors and programs. Chatterjee and Hadi (1988) call them a Y studentized residuals. They are caiied studentized residuals in
n:l
106
Simple linear regression and correlation
STAT A and SYST AT, RSTUDENT in SAS, .DSTRRESID in SPSS DSTRESID in BMDP. 'and These and ot~er ty~s of resi~uals ...can either. be .obtained in the forrn 0 plots of the desired resi~ual agam.st Y or X~ or m hsts. T~e plots are usefuf for large samples or quick scanmng. The lists of the residuals (someti 1 accompanied by a simple plot of their distance from zero for each observat~es are useful for identif~ing the actual observ~tion that ?as a.large residual Once the observations that have large residuals are Identified, the research can examine those cases more carefully to decide whether to leave them ier correct them, or declare them missing values. n,
i:i
Outliers in X Possible outliers in X are measured by statistics called leverage statistics. One measure of leverage for simple linear regression is called h, where
h=
1 N
(X-
+ L (X -
Xf X)2
When X is far from X, then leverage is large and vice versa. The size of his limited to the range
The leverage h for the ith observation tells how much Y for that observation contributes to the corresponding f. If we change Y by a quantity L\ Y forth_: ith observation, then h L\ Y is the resulting change in the corresponding Y. Observations with large leverages possess the potential for having a la~ge effect on the slope ofthe line. It has been suggested for simple linear regr~ssto~ that a leverage is considered large if the value of h for the ith observatl~ll IS great~r than 4/N (F ~x and Long, 1990). F~gure 6.8 includes .som.e observa~;n; tha.t dlus.trate the ~Iff~rence between. pomts that are outliers m X .and X is Pomt 1 IS an outlier m Y (large residual), but has low leverage smce It close to X. It will affect the estimate of the intercept but not the slo~e~be will tend to increase the estimate of S and hence the standard error ~ ate slope coefficient R Point 2 ~a~ high leverag~, b~t will n~t affe~t the es~~igh of the slope much because It 1s not an outlier m Y. Pomt 3 1s both f the 0 leverage point and an outlier in Y, so it will tend to reduce the value ffect slope coefficient B and to tip the regression line downward. It will also ~ade the estimate of A and S, and thus have a large effect on the statements idual concerning the regression line. Note that this is true even though the re~duals e is less for point 3 than it is for point 1. Thus, looking solely at res•
Further examination of computer output
107
y
L-------------~-~---------------------x
0
X
Figure 6.8 Illustration of the Effect of Outliers.
in y may not tell the whole story and leverage statistics are important to examine if outliers are a concern. Influential obsenations
A direct approach to the detection of outliers is to determine the influence of each observation on the slope coefficient B. Cook (1977) derived a function called Cook's distance, which provides a scaled distance between the value of B when all observations are used and B( - i), the slope when the ith observation is omitted. This distance is computed for each of the N ~~servat~ons. Observ~tio~s result~ng in _Iarge value~ of Cook's distance should examtned as possible mfluentml pomts or outhers. Cook (1977) suggests comparing the Cook's distance with the percentiles of the F distribution with P +I d . inde an N -. P - 1 degrees . of freedom. ~ere, P ts the num?er of rec pendent vanables. Cook's distances exceedmg the 95th percentile are ~~men_ded for careful examination and possible removal from the data set lhe er ~Istance measures have been proposed (Chatterjee and Hadi, 1988). the srOdified Cook's distance measures the effect of an observation on both c~llectoi; Band o_n the variance_S 2 • The Welsch-!<-uh di~tan:e ?Iea~ure (~lso ~lth the ~FITS) Is th~ scaled distance between Y and Y( -z), 1.e. Y denved llJ.fluenr zth observation deleted. Large values of DFFITS also indicate an S2, sitni~l observation. DFFITS tends to measure the influence on B and A. u taneously. general lesson from the research work on outliers in regression analysis
108
Simple linear regression and correlation
is that, when one examines either the scatter diagram of Y versus X 0 plot of the residuals versus X, more attention should be given to the p r.the that are outliers in both X and Y than to those that are only outliers ~~n~ Lack of independence among residuals 'Yh~n the observ~tion~ ca? be ordered in time or place, pl_ots of the residuals
stmdar to those giVen m Ftgure 4.3 can be made and the discussion in secti 4.4 applies here directly. If the observations are independent, then succession ~esi_duals should not ~e appreciably corre~ated. !he serial correlation, whi~~ Is srmply the correlation between successive residuals, can be used to asses lack of independence. For a sufficiently large N, the significance levels the usual test p = 0 apply approximately to the serial correlation. Another test statistic available in some packaged programs is the Durbin-Watson statistic. The Durbin-Watson statistic is approximately equal to 2(1 -serial correlation). Thus when the serial correlation is zero, the Durbin-Watson statistic is close to two. Tables of significance levels can be found in Theil (1971 ). The Durbin-Watson statistic is used to test whether the s~rial correlation is zero when it is assumed that the correlation between successive residuals is restricted to a correlation between imme<;liately adjacent residuals.
fo:
Normality of residuals Some regression programs provide normal probability plots of the residuals to enable the user to decide whether the data approximate a normal distribution. If the residuals are not normally distributed, then the distribution of Y at each value of X is not normal. For simple linear regression with the variable- X model, many researchers assess bivariate normality by examining the scatter diagram of Y versus X to see if the points approximately fall within an ellipse.
6.9 ROBUSTNESS AND TRANSFORMATIONS FOR REGRESSION ANALYSIS
. section · we define the concept of robustness m · statistic · · a1 an alysis ' and In this we discuss the role of transformations in regression and correlation. Robustness and assumptions . .. . . . bout tbe RegressiOn and correlatiOn analysts make certam assumptm~s .a e tbat population from which the data were obtained. A robust analysiS JS on se of is useful even though all the assumptions are not met. For the ~ur~~uted, fitting a straight line, we assume that the Y values are normally dtstfl d we the population regression equation is linear in the range of concern, an
Robustness and transformations
109
.. f y is the same for all values of X. Linearity can be checked varia~ce lloy and transformations can help straighten out a nonlinear regression graphtca , line· . ssumption of homogeneity of variance is not crucial for the resulting The a ares line. In fact, the least squares estimates of a and f3 are unbiased )east ~qu or not the assumption is valid. However, if glaring irregularities of wh:t e~ exist, weighted least squares can improve the fit. In this case the v~Ia~~ are chosen to b~ pr~portional ~o the inverse of the _vari~nce. For w g le if the variance Is a hnear functiOn of X, then the wetght Is 1/X. examP ' · of t h e· Y va1ues of each va1ue ~ f X Is · rnad e The assumption of normality only when tests of hypotheses are .performed. o~ con~dence mtervals. are calculated. It is generally a~reed m the statl~tlcal literature. that slig~t departures from this assumption do not appreciably alter our mferences tf the sample size is sufficiently large. The lack of randomness in the sample can seriously invalidate our inferences. Confidence intervals are often optimistically narrow because the sample is not truly a random one from the whole population to which we wish to generalize. In all of the preceding analyses linearity of the relationship between X and Y was assumed. Thus careful examination of the scatter diagram should be the first step in any regression analysis. It is advisable to explore various transformations of Y and/or X if nonlinearity of the original measurements is apparent. Tr~nsformations
The subject of transformations has been discussed in detail in the literature (e.g. Hald, 1952; Mosteller and Tukey, 1977). In this subsection we present some typical graphs of the relationship between Y and X and some practical transformations. d' In. Chapter 4 we discussed the effects of transformations on the frequency 0~stnbution. There it was shown that taking the logarithm or square root rnaa ~umber condensed the magnitude of larger numbers and stretched the gre gtnitude of values less than one. Conversely, raising a number to a power a er tha . h one. Thes none st~etc es the lar~e value~ and condenses _the values less t~an to str . he properties are useful m selectmg the appropriate transformatiOn M atg ten out a nonlinear graph of one variable as a function of another. linea;s~eller and Tukey (1977) present typical regression curves that are not corn 111 In X as belonging to one of the quadrants of Figure 6.9(a). A very quadraon case is illustrated in Figure 6.9(b), which is represented by the fourth Ill' . 1 . F' . Ight bnt oft"'~ '~ ctrc e In 1gure 6.9(a). For example, the curve in Figure 6.9(b) Anothere made linear by transforming X to log X, to -1/X or to X 1 12 . Possibility would be to transform Y to Y 2 . The other three cases
110
Simple linear regression and correlation Shape of curve
/ log X
-II X
IV
I
III
II
JX
log X -1/X
.JX
a. Curves Not Linear in X
y Curve IV
~---------------------X
b. Detail of Fourth Quadrant FigUre 6.9 Choice of Transformation: Typical Curves and Appropriate Transformation.
are also indicated in Figure 6.9(a). The remaining quadrants are interpreted ~n a similar fashion. Other transformations could also be attempted, such as powers other than those indicated. It may also be useful to first add or subtract a constant fror all values of X or Y and then take a power or logarithms. For e:x~II1P e, sometimes taking log X does not straighten out the curve sufficte~~~) Subtracting a constant C (which must be smaller than the smallest X va and then t~ki~~ the logarithm has a greater effect. . . . fan The avadabthty of packaged programs greatly facilitates the chotce 0 . ns appropriate transformation. New variables can be created that are functJDeW 0 of the original variables:, and scatter diagrams can be obtained of the
1
Other options in computer programs
111
ed variables. Visual inspection will often indicate the best transformation. Also, the magnitude of the correlation coefficient r will tra~sfor7he best linear fit since it is a measure of linear association. Attention iodtc~;e lso be paid to transformations that are commonly used in the field shoU Ii~ation and that have a particular scientific basis or physical rationale. ofaPP. the transformation is selected, all subsequent estimates and tests are t". d 1 s· h . b. . once erforrned in terms of the trans1o~e va ues. mc~ t e vana le t? be P d. ted is usually the dependent variable Y, transformmg Y can complicate pre tc h 1. . . . h .f . the interpretation of t e :esu tm~g regr:sston equation mor~ t an 1 .x ~s transformed. For example, 1flog X ts used mstead ofX,the resultmg equation Is Y =A+ Blog 10 X
This equation presents no problems in interpreting the predicted values of Y, and most investigators accept the transformation of log 10 X as reasonable in certain situations. However; if log Y is used instead of Y, the resulting equation is log 10 Y= A+ BX Then the predicted value of Y, say Y*, must be detransformed, that is, Y* = toA-Bx
Thus slight biases in fitting log Y could be detransformed into large biases in predicting Y. For this reason most investigators look for transformations of X first.
6·10 OTHER OPTIONS IN COMPUTER PROGRAMS In this .section we discuss two options available from computer programs:
regressiOn through the origin and weighted regression. Regression through the origin
Sometime. . . . . . . throu h t s an.t~vesttga~or Is convmced th~t the regre~s1~n Ime should pass g he ongm. In thts case the appropnate modelts simply the mean of Y= {3X
l'hat is, th 1. . . . option of e . nterc.ept IS forced to be zero. The programs usually gtve the Ustng thts model and estimate f3 as
112
Simple linear regression and correlation
To test H 0 : f3
=
f3 0 , the test statistic is
where
s = [I(Y- BX)2]1/2 N-1 and t has N - 1 degrees of freedom. Weighted least squares regression The investigator may also request a weighted least squares regression line. In weighted least squares each observation is given an individual weight reflecting its importance or degree of variability. There are three common situations in which a weighted regression line is appropriate. 1. The variance of the distribution at a given X is a function of the X value. An example of this situation was shown in Figure 6.7(b). 2. Each Y observation is, in fact, fhe mean of several determinations, and that number varies from one value of X to another. 3. The investigator wishes to assign different levels of importance to different points. For example, data from different countries could be weighted either by the size of the population or by the perceived accuracy of the data.
Formulas for weighted least squares are discussed in Dunn and Clark (1987) and Brownlee (1965). In case 1 the weights are the inverse of the variances of the point. In case 2 the weights are the number of determinations at each X point. In case 3 the investigator must make up numerical weights to reflect the perception of importance or accuracy of the data points. In weighted least squares regression the estimates of rx and f3 and ~ther statistics are adjusted to reflect these special characteristics of the observatwns. In most situations the weights will not affect the results appreciably unles~ they are quite different from each other. Since it is considerably more wdord · 1east squares. regr~sston · ~quauon, · · ts · recommen e It r than to compute a weighted that one of the computer programs hsted m section 6.12 be used, rathe hand calculations. 6.11 SPECIAL APPLICATIONS OF REGRESSION . . . we d'tscuss some Important . In t h .ts section app1'tcattons of regres st'on analystS . ueS· that require some caution when the investigator is using regression techntq
Special applications of regression
113
calibration on situation in laboratories or industry is one in which an instrument A co~~esigned to measure a certain characteristic needs calibration. For that JS Ie the concentration of a certain chemical could be measured by an exarnPrne,nt For calibration of the instrument several compounds with known · stru · 111 ncentrations, denoted by X, co~ld be used, and the measurements, denoted co Y, could be determined b~ the Instrument for each compo~nd. As a second by pie a costly or destructive method of measurement that IS very accurate, ~~ ' . denoted by X, co~ld be compared With anot?er method of ~eas~rem~nt, denoted by Y, that IS a _less. costly or nondestructive method. In either situation re than one determination ofthe level of Ycould be made for each value of X. rn~he object of the analysis is to derive, from the calibration data, a formula or a graph that can be used to predict the unknown value of X from the measured Yvalue in future determinations. We will present two methods (indirect and direct) that have been developed and a technique for choosing between them. .
.
.
.
1. Indirect method. Since X is assumed known with little error and Y is the random variable, this classical method begins with finding the usual regression of Yon X,
Y=
A+ BX
Then for a determination Y* the investigator obtains the indirect estimate of X, Xin' as
X.
=
m
2 ·
Y*- A B
T~is method is illustrated in Figure 6.10(a). ~arect method. Here we pretend that X is the dependent variable and Y ~s the independent variable and fit the regression of X on Y as illustrated 1 p· n tgure 6.10(b). We denote the estimate of X by Xdr; thus
Xdr =
c + DY* ·
forTa~ compare th~ two methods; we compute the quantities ~in and Xdr g of the data m the sample. Then we compute the correlation between and X, denoted by r(in), and the correlation between Xd and X deinnoted b r ' carrel . . Y r(dr). We. then. use the ~ethod ·tha.t results .in the higher 967 )ahon. Further discussion of thts probl~m Is foun~ m Krutchkoff It. ' Be~kson (1969), Shukla (1972) and Lwm and Mantz (1982). test ~hadvtsable, before u~in~ ei~her of the~e methods, that the in~estigator ether the slope B ts stgmficantly dtfferent from zero. Thts test can
o
114
Simple linear regression and correlation X
y 1\
Y=A+BX
"
x~c+ny
"
Xdr~------------~-
..___ _ _ _ _ _
..~-
___ X
1\
Y*
X in
a. Indirect Method
y
b. Direct Method
Figure 6.10 Illustration of Indirect and Direct Calibration Methods.
be done by the usual t test. If the hypothesis H 0 :{3 = 0 is not rejected, this result is an indication that the instrument should not be used, and the investigator should not use either equation.
Forecasting Forecasting is the technique of analyzing historical data in order to provide estimates of the future values of certain variables of interest. If data taken over time (X) approximate a straight line, forecasters might assume that this relationship would continue in the future and would use simple linear regression to fit this relationship. This line is extended to some point in the future. The difficulty in using this method lies in the fact that we are never sure that the linear trend will continue in the future. At any rate, it is advisable not to make long-range forecasts by using this method. Further discussion and additional methods are found in Montgomery and Johnson (1976), Levenbach and Cleary (1981) and Chatfield (1984). Paired data Another situation where regression analysis may be appropriate is th~ ~aired-s~mple case. A paired sample cons~sts. ~f observatio~s taken a~c:d tm:te pmnts (pre- and post-) on the ~ai?e md~vtdual or consists of rn~d the pairs, where one member of the pair Is subJeCt to one treatment a other is a. cont~ol or is subject to anothe~ treatment. d as a Many Investigators apply only the patred t test to such data an 'fhe result may miss an important relationship that could exist in the data.k for 100 investigator should, at least, plot a scatter diagram of the data and . eerns possible linear and nonlinear relationships. If a linear relatiooshiP.~ed ill appropriate, then the techniques of regression and correlation descfl
Discussion of computer programs
115
. h pter are useful in describing this relationship. If a nonlinear relationship th~sc ~ransformations should be considered. extsts, residuals fOCU S 00 . . times, performing the regression analysis may be only a preliminary Solll~o obtaining the residuals, which are the quantities of interest to be step d for future analysis. For example, in the bone density study described ~av~hapter 1, the variable age has a major effect on predicting bone density :C elderly women. <:>ne method o_f taking a~e into a.cc?unt in future ~naly~es uld be to fit a simple regression equation predicting bone density, with :; as the only indepe~dent varia?le. Then futur~ analys~s o~ bone density could be performed ustng the residuals from this equatiOn mstead of the original bone density measurements. These residuals would have a mean of zero. Examination of subgroup means is then easy to interpret since their difference from zero can be easily seen. Adjusted estimates of Y
An alternative method to using residuals is to use the slope coefficient B from the regression equation to obtain an age-adjusted bone density for each individual, which is an estimate of their bone density at the average age for the entire sample. This adjusted bone density is computed for each individual from the following equation, adJusted Y = Y + B(X - X) where ~ and Y are the values for each individual. The adjusted Y has a mean Y and thus allows the researcher to work with numbers of the usual magnitude. 6 12 DISCUSSION OF COMPUTER PROGRAMS · AU . . output . "10r simp . 1e 1'mear regressiOn . since . this the . package . . s provi'de suffi Cient co"" Isla Widely Used analysis. This will not be the case when we discuss more ex mult' . d . we•up Will . I variate proce . ures m subsequent chapters. In those chapters has a pr?vide a table that shows whether or not each statistical package So Particular feature. . d.m this . chapter can only he obtained from sotnernef of the st. a t'ts t'tcs mentwne ciescribo d t?e packages by using the multiple linear regression programs linear r: 10 ~ection 7.10. These programs can be used to perform simple We Used gression as well. This was the case with STATISTICA, the package to obtain the results for the fathers given in Figure 6.1 and elsewhere
116
Simple linear regression and correlation
····················
ou''•~•••
'•••
u, ••••••••••••
·0.6
0
0.6
1.2
1.8
Residuals
Figure 6.11 Normal Probability Plot of the Residuals of the Regression of FEVl on Height for Fathers in the Lung Function Data. .
in this chapter. The multiple regression program was run and results that applied only to multiple regression were ignored. Since STATISTICA is a Windows-based package the analysis was performed by clicking on the appropriate choices and then examining the output. First, descriptive statistics were obtained by choosing 'Basic Statistics/Tables' from the STATISTICA module switcher (which appears first when STATISTICA is accessed). This was followed by clicking 'Descriptive Statistics'. To obtain the scatter plot, 'Graph' was chosen from the Windows menu at the top of the screen and then '2D' from the pull-down list and finally 'Scatter Plots'; To obtain further output, 'Analysis' was chosen from the Windows rn~nu and 'Other Statistics' chosen. From this the multiple linear regression opuon was picked. As with other Windows-based packages the print option was chosen from the 'File' menu at the top of the .screen. hr STA!ISTICA also provides numerous .graphs ~f resi~uals and o;d ea regressio~ o~tpu_t. For example, to check if t~~ residuals m Y f~llo~s for normal distnbutton, we ran a normal probability plot of the residua in the regression equation predicting FEVl from height. As can be seen Figure 6.11, the residuals appear to be normally distributed. ters. Note that we will feature different statistical packages in different chaP0 ugll For simple linear regression, all the packages provide more than ~ first output. The non-Windows-based packages may take a little longer t e time but all of them provide the needed output.
What to watch out for
117
WHAT TO WATCH OUT FOR . 6 13 tions 6.8 and 6.9, methods for checking for outliers, normality, l~~~~eneity of varia~ce, and inde~enden~e were pr~sented alon.g with .a h . f discussion of the Importance of mcluding checks m the analysts. In thts bne· n other cautiOns · · d. WI·n b e emph astze sectlo , Sampling process. In the development of the theory for linear regression, 1· the sample is assumed to be obtained randomly in such a way that it represents the whole population. Often, convenience samples, which are sarnples of easily available cases, are taken for economic or other reasons. In surveys of the general population, more complex sampling methods are often employed. The net effect of this lack of simple random sampling from a defined population is likely to be an underestimate of the variance and possibly bias in the regression line. Researchers should describe their sampling procedures carefully so their results can be assessed with this in mind 2. Errors in variables. Random errors in measuring Y do not bias the slope and intercept estimates but do affect the estimate of the variance. Errors in measuring X produce estimates of f3 that are biased towards zero. Methods for estimating regression equations that are resistant to outliers are given in Hoaglin, Mosteller. and Tukey (1983) and Fuller (1987). Such methods should be considered if measurement error is likely to be large. 3. The use of nominal or ordinal, rather than interval or ratio, data in regression analysis has several possible consequences. First, often the number of distinct values can be limited. For example, in the depression data set, the self-reported health scale has only four possible outcomes: excellent, good, fair or poor. It has been shown that when interval data are grouped into seven or eight categories, later analyses are little affected; but coarser grouping may lead to biased results. Secondly, the intervals between the numbers assigned to the scale values (l, 2, 3 and 4 for the health scale, for example) may not be properly chosen. Perhaps poor health should be given a value of 5 as it is an extreme answer and is more tha~ one unit from fair health (Abelson and Tukey, 1963). 4 · Chotce of model. Particularly for the variable- X model, it is essential not only that the model be in the correct analytic form, i.e. linear if a linear ~.del ~s assumed, but also that the appropriate X variables be chosen. X ~sa~lll b~ discuss~. in mo.re detail in ~hapter 8. Here, sin~ only. one ( table Is used, tt Is possible that an Important confoundmg vanable ao~~ correlate~ with bot~ X and Y) may be omitted. This could result in is gh correlation coeffiCient between X and Y and a well-fitting line that de~~~sed by. the omission of the confounding v~riable. For example, FEV1 to ases With age for adults as does bone denstty. Hence, we would expect In see spurious correlation and regression between bone density and FEVl. section 6.11, the use of residuals to assist in solving this problem was
118
Simple linear regression and correlation
discussed and further results will be given in the following chapters on . Ie regression . ana1ysis. . muIttp 5. In using the computed regression equation to predict Y from X it . ~dvisable to restrict the pre~ctio~ to X ~alues ~ithin the range obs,erve~ m t~e sai?ple, unless the tnv~stlga~or Is ~rtam t.hat the same linear relationship between Y and X Is· valid outs1de of this range. 6. An observed correlation between Y and X should not be interpreted t mean a causal relationship between X and Y regardless of the magnitud~ of the correlation. The correlation may be due to a causal effect of X on Y, of Y on X, or of other variables on both X and Y. Causation should be inferred only after a careful examination of a variety of theoretical and experimental evidence, not merely from statistical studies based on correlation. 7. In most multivariate studies, Y is regressed on more than an X variable. This subject will be treated in the next three chapters. SUMMARY In this chapter we gave a conventional presentation of simple linear regression, similar to the presentations found in many statistical textbooks. Thus the reader familiar with the subject should be able to make the transition to the mode we follow in the remainder of the book. We made a clear distinction between random- and fixed-X variable regression. Whereas most of the statistics computed apply to both cases, certain statistics apply only to the random-X case. Computer programs are written to provide more output than is sometimes needed or appropriate. You should be aware of which model is being assumed so that you can make the proper interpretation of the results. If packaged programs are used, it may be sufficient to run one of the simple plotting programs in order to obtain both the plot and the desired statistics. It is good practice to plot many variables against each other in the prelimi~ary stages of data analysis. This practice allows you to examine the data m a straightforward fashion. However, it is also advisable to use the more sophisticated programs as warm-ups for the more complex data analyses . . . this discussed in later c~apters.. 1 The concepts of simple lmear regressiOn and correlatiOn presented ? f 0 chapter can be extended in several directions. Chapter 7 treats the fitung r linear regression equations to more than one independent vaiiable. Ch.apt~n 8 gives methods for selecting independent variables. Additional top~cs·~g regression analysis are covered in Chapter 9. These topics inclu?e mtsSI values, dummy variables, segmented regression and ridge regresswn. REFERENCES
References preceded by an asterisk require strong mathematical background.
References
119
General "'
T~ke~,
R.P. and J.W. (_1963). Efficient utilization of no?-numerical Atu.eison, rmation in quantitative analysts: General theory and the case of stmple order. ~a/s of Mathematical Statistics, 34, 1347-69.
.
:~.A. and Azen, S.P. (1979). Statistical Analysis: A Aii z~d edn; Academic Press, New York. .
Computer Oriented Approach, . G.E.P. (1966). Use and abuse of regressiOn. Technometrzcs, 8, 625-9. ::~nlee,K.A. (1965). Statistical Theory and Methodology in Science and Engineering, znd edn. Wiley, New York. . . . . . Dunn, OJ. and Clark, V:A. (1987). Applzed Statzstzcs: Analysis of Varzance and Regression, 2nd edn. Wiley, New York. . Fox, J. and Long, J.S. (eds) (1990). Modern Methods of Data Analyszs. Sage, Newbury Park, CA. Fuller, W.A. (1987). Measurement Error Models. Wiley, New York. Hald, A. (1952). Statistical Theory with Engineering Applications. Wiley, New York. Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds) (1983). Understanding Robust and Exploratory Data Analysis. Wiley, New York. Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. AddisonWesley, Reading, MA. Theil, H. (1971). Principles of Econometrics. Wiley, New York. Thorndike, R.M. (1978). Correlational Procedures for Research. Gardner Press, New York.
Calibration Berkson, J, (1969). Estimation of a linear function for a calibration line. Technometrics, 11, 649-60. Ktutchko.ff, R.G. (1967). Classical and inverse regression methods of calibration. Technometrics, 9, 425-40. Lwin, T. and Maritz, J.S. (1982). An analysis of the linear-calibration controversy from the perspective of compound estimation. Technometrics, 24, 235-'42. Shukla, G.K; (1972). On the problem of calibration. Technometrics, 14, 547-51.
Influential observations and outlier detection
Bel~ley, D.~.,
Kuh, E. and Welsch, R:E. (1_980). !l-egression diagnostics~· Identifying Cha'fu~ntza/ Data and _Sources of Colmea~Z!Y: Wiley, ~e~ ":ork. . . terJee, S. and Hadt, A.S. (1988). Sensztzvzty analyszs in lmear regresswn. Wiley, ewy · k · · ·· · C0 N . or . . · ~~· ~-D. (1977). Detection of influential observations in linear regression. Technometrics, c0 ; 5-18. . %\ R·~· and Weisberg, S. (1982). Residuals and influence in regression. Chapman Fox J a, London. ' · (1991). Regression Diagnostics. Sage, Newbury Park, CA.
1
l'itne s · er•es and forecasting Chatfield . . . . .. . & liail C. (1996). The Analyszs of Tzme Serzes: An mtroductwn, 5th edn. Chapman leve .b • London. p~cach, H. and Cleary, J.P. (1981). The Beginning Forecaster: The Forecasting ess through Data Analysis. Lifetime Learning, Belmont, CA.
120
Simple linear regression and correlation
Montgomery, D.C. and Johnson, L.A. (1976). Forecasting and Time Series Analys· McGraw-Hill, New York. zs.
FURTHER READING General Allen, D.M. and Cady, F.B. (1982). Analyzing Experimental Data by Regressi Lifetime Learning, Belmont, CA. on. Dixon, W.J. and Massey, F.J. (1983). Introduction to Statistical Analysis, 4th edn McGraw-Hill, New York. · KJitgaard, R. (1985). Data Analysisfor Development. Oxford University Press, New york Lewis-Beck, M.S.( 1980). Applied Regression: An Introduction. Sage, Newbury Park, CA: Montgomery, J?.C. and Peck, E.A. (1992). Introduction to Linear Regression Analysis, 2nd edn. Wiley, New York. Neter, J., Wasserman, W. and Kutner, M.H. (1990). Applied Linear Statistical Models 3rd edn. Richard D. Irwin, Homewood, IL. ' Rawlin~s, J.O. (1988). Applied Regression: A Research Tool. Wadsworth & Brooks/Cole, Pacific Grove, CA. Williams, F.J. (1959). Regression Analysis. Wiley, New York.
Calibration *Hunter, W.G. and Lamboy, W.F. (1981). A Bayesian analysis ofthe linear calibration problem. Technometrics 23, 323-43. Naszodi, L.J. (1978). Elimination of bias in the course of calibration. Technometrics, 20, 201-6.
Influential observations and outlier detection Barnett, V. and Lewis, T. (1984). Outliers in Statistical Data, 2nd edn. Wiley, New York. *Cook, R.D. (1979). Influential observations in linear regression. Journal of the American Statistical Association, 74, 169-74. . *Cook, .R.D. and W:isb~rg, S. ~1980). C~aracteri~ation of an empirical infl~enr, function for detectmg mfluential cases m regressiOn. Technometrzcs, 22, 495 5~ Velleman, P.F. and Welsch, R.E. (1981). Efficient computing of regression diagnostics. The American Statistician, 35, 234-42.
Transformations . . . c rd UniversitY Atkmson, A.C. (1985). Plots, Transformatzons and Regresszons. 0 XtO Press, New York. . nd the Bennett, C.A. and Franklin, N.L. (1954). Statistical Analysis in Chemzstry a Chemical Industry. Wiley, New York. . . visited. *Bickel, P.J. and Doksum, K.A. (1981). An analysts of transformations re Journal of the American Statistical Associati?n, 76, 296-311.. of the *Box, G.E.P. and Cox, D.R. (1964). An analysts of transformations. Journa 1 Royal Statistical Society Series B, 26, 211-52. . ressioTl· Carroll, R.J. and Ruppert, D. (1988). Transformation and Weighting zn Reg Chapman & Hall, London.
Problems
121
. N R. and Hunter, W.G. (1969). Transformations: Some examples revisited. praP~hno",netrics, 11~ 2.3-40. . . . . . . A. (1952). Statzstzcal Theory wzth Engzneerzng Applzcatzons. ~d~y, New Yo~k. IJal ' 1 ki C.J. (1970). The performance of some rough tests for btvanate normahty l{oW~:re~nd after coordinate transformations to normal~ty. Technometri~s, 12,51 ~-44. bl.ler F. and Tukey, J.W. (1977). Data Analyszs and Regresszon. Addtson}\-1oste • . Wesley Readmg, MA. . J~W. (1957). On the comparative anatomy of transformations. Annals of keY' *TU}1/athematzcal · Statzstzcs, . . 28, 602 - 32· .
1J
Time series and forecasting *Box, G.E.P. and Jenkins, G.M. (197~). Time Series Analysis: Forecasting and Control, rev. edn. Holden-Day, San Franctsco. . . . Diggle, P.J. (1990). Time Series: A Biostatistical Introduction. Oxford Umverstty Press, New York. Levenbach, H. and Cleary, J.P. (1982). The Professional Forecaster: The Forecasting Process through Data Analysis. Lifetime Learning, Belmont, CA. McCleary, R. and Hay, R.A. (1980). Applied Time Series Analysis for the Social Sciences. Sage, Beverly Hills, CA.
PROBLEMS 6.1
In Table 8.1, financial performance data of 30 chemical companies are presented. Use growth in earnings per share, labeled EPSS, as the dependent variable and growth in sales, labeled SALESGRS, as the independent variable. (A description of these variables is given in section 8.3.) Plot the data, compute a regression line, and test that {3 = 0 and ex = 0. Are earnings affected by sales growth for these chemical companies? Which company's earnings were highest, considering its growth in sales? 6.2 From the family lung function data set in Appendix A, perform a regression analysis of weight on height for fathers. Repeat for mothers. Determine the correlation coefficient and the regression equation for fathers and mothers. Test that the coefficients are significantly different from zero for both sexes. Also, find the standardized regression equation and report it. Would you suggest removing the woman who weights 267lb from the data set? Discuss why the 6 3 correlation for fathers appears higher than that for mothers. · !~ Problem 4.6, the New York Stock Exchange Composite Inde~ for August 9 rough September 17, 1982, was presented. Data for the daily volume of transactions, in millions of shares, for all of the stocks traded on the exchange ~:d inci_uded in the Composite Index ~re given b~low, together with ~he ho:postte Index values, for the same peno~ a~ that m _Problem ~-6. Descn~e D v_olume appears to be affected by the pnce mdex, usmg regression analysts. escnbe whether or not the residuals from your regression analysis are serially ~~~~~lated: Plot the in~ex versus time and volume versus time, and describe the 6-4 F honshtps you see m these plots. 0 a ~ the ~ata in Problem 6.3, pretend that the index increases linearly in time n use hnear regression to obtain an equation to forecast the index value as
122
Simple linear regression and correlation Day
Month
Index
Volume
9 10
Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Aug. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept. Sept.
59.3 59.1 60.0 58.8 59.5 59.8 62.4 62.3 62.6 64.6 66.4 66.1 67.4 68.0 67.2 67.5 68.5 67.9 69.0 70.3
63 63 59 59 53 66 106 150 93 113 129 143 123 160 87 70 100 98 87 150
69.6 70.0 70.0 69.4 70.0 70.6 71.2 71.0 70.4
81 91 87 82
11 12 13 16 17 18 19 20 23 24 25 26 27 30 31 1 2 3 6 7 8 9 10 13 14 15 16 17
71 98 83 93
77
a function of time. Using 'volume' as a weight variable, obtain a weighted lea~t squares forecasting equation. Does weighted least squares help the fit? Obt~tn a recent value of the index (from a newspaper). Does either forecasting equatton predict the true value correctly (or at least somewhere near it)? Explain· . . 1 6.5 Repeat Problem 6.2 using log(weight) and log(height) in place of the on~nas variables. Using graphical and nunierical devices, decide whether the transfonnatton help. . tion Examine the plot you produced in Problem 6.1 and choose some transfo~ation for X and/or Yand repeat the analysis described there. Compare the corrhe a we coefficients for the original and transformed variables, and decide whet er transformation helped. If so, which transformation was helpful? sion 6.7 Using the depression data set described in Table 3.4, perform a regres ·
6.6
Problems
123
. nalys~s of depression, as measured by CESD, on income. Plot the residuals. ~oes the normality assumption appear to be met? Repeat using the logarithm of CESD instead of CESD. Is the fit improved? (Continuation of Problem 6.7.) C~lculate the variance of CESD for observations 6·8 in each of the groups defined by mcome as follows: INCOME < 30, INCOME between 30 and 39 inclusive, INCOME between 40 and 49 inclusive, INCOME between 50 and 59 inclusive, INCOME >59. For each observation, define a variable WEIGHT equal to 1 divided by the variance of a CESD within the income group to which it belongs. Obtain a weighted least squares regression of the untransformed variable CESD on income, using the values of WEIGHT as weights. Compare the results with the unweighted regression analysis. Is the fit improved by weighting?
6.9 From the depression data set described in Table 3.4 create a data set containing only the variables AGE and INCOME. (a) Find the regression of income on age. (b) Successively add and then delete each of the following points:
AGE 42 80 180
INCOME 120 150 15
and repeat the regression each time with the single extra point. How does the regression equation change? Which of the new points are outliers? Which are influential? Use the family lung function data described in Appendix A for the next four problems. 6.10 For the oldest child, perform the following regression analyses: FEV1 on weight, FEV1 on height, FVC on weight, and FVC on height. Note the values of the slope and correlation coefficient for each regression and test whether they are equal to zero. Discuss whether height or weight is more strongly associated with lung function in the oldest child. 6 ·11 What is the correlation between· height and weight in the oldest child? How 6.12 Would.youransw~rtothelastpartofproblem.6.10changeifp = 1_?p = -1?p = 0? · E~arrune the residual plot from the regression of FEV1 on height for the oldest child. Choose an appropriate transformation, perform the regression with the transformed variable, and compare the results (statistics, plots) with the original . 6 13 regression analysis. For the mother, perform a regression of FEV1 on weight. Test whether the COefficients are zero. Plot the regression line on a scatter diagram of MFEV1 Versus MWEL On this plot, identify the following groups of points: group 1: ID = 12, 33, 45, 42, 94, 144; group 2: ID = 7, 94, 105, 107, 115, 141, 144.
~emove the
observations in group 1 and repeat the regression analysis. How oes the line change? Repeat for group 2.
7 Multiple regression and correlation 7.1 USING MULTIPLE LINEAR REGRESSION TO EXAMINE THE RELATIONSHIP BETWEEN ONE DEPENpENT VARIABLE AND MULTIPLE INDEPENDENT VARIABLES Multiple regression is performed when an investigator wishes to examine the relationship between a single dependent (outcome) variable Y and a set of independent (predictor or explanatory) variables X 1 to X P· The dependent variable is ofthe continuous type. The X variables are also usually continuous although they can be discrete. In section 7.2 the two basic models used for multiple regression are introduced and a data example is given in the next section. The background assumptions, model and necessarx formulas for the fixed-X model are given in section 7.4, while section 7.5 provides the assumptions and additional formulas for statistics that can ·be used for the variable-X model. Tests to assist in interpreting the fixed-X model are presented in section 7.6 and similar information is given for the variable-X model in section 7.7. Section 7.8 discusses the use of residuals to evaluate whether the model is appropriate and to find outliers. Three methods of changing the model to make it m~re appropriate for the data are given in that section: transformations, polynomtal regression~ ~nd _interaction ter:ns. ~ultic~llinearity i~ defined and metho:~ for recogmzmg It are also explamed m section 7.8. SectiOn 7.9 presents sev~r other options available when performing regression analysis, namely regressto~ through the origin, weighted regression and testing whether two sub~rouJs regression are equal. Section 7.10 discusses how the numerical results 10 chapter were obtained using STAT A and section 7.11 explains what to wa out for when perform~~g a multiple re~ressi~n analysis. . aiysis. There are two additional chapters m this book on regressi~n a~ wheJl Ch~pter ~ prese~ts metho?olog! used .to choos~ indep~ndent vanabl~hapter the mvestigator IS uncertam whtch vanables to mclude m the model. f the 0 9 discusses missing values and dummy variables (used when sorne diinS independent variables are discrete), and gives methods for hall multicollinearity.
\:h
Data example
125
WI-lEN ARE MULTIPLE REGRESSION AND 1.Z coRRELATION USED? ethods described in this chapter are appropriate for studying the Thef~nship between several·X variables and one Y variable. By convention, rela ~variables are called independent variables, although they do not have thebe statistically independent and are permitted to be intercorrelated. The ~variable is called the dependent va~iable. . . As in Chapter 6, the data for multtple regressiOn analysts can come from one of two situations. Fixed-X case. The levels of the various X's are selected by the researcher 1 · or dictated by the nature of the situation. For example, in a chemical process an investigator can set the temperature, the pressure and the length of time that a vat of material is agitated, and then measure the concentration of a certain chemical. A regression analysis can be performed to describe or explain the relationship between the independent variables and the dependent variable (the concentration of a certain chemical) or to predict the dependent variable. 2, Variable-X case. A random sample of individuals is taken from a population, and the X variables and the Y variable are measured on each individual. For example, a sample of adults might be taken from Los Angeles and information obtained on their age, income and education in order to see whether these variables predict their attitudes toward air pollution. Regression and correlation analyses can be performed in the variable-X case. Both descriptive and predictive information can be obtained from the results.
. Multiple regression is one of the most commonly used statistical methods, and an understanding of its use will help you comprehend other multivariate techniques. 7· 3 DATA EXAMPLE InCh· anal zapter 6 the data for fathe~s from the lung. function data set were va . Y ed: These data fit the vanable-X case. Hetght was used as the X nable In order to predict FEVl, and the following equation was obtained:
FEVl = -4.087 + 0.118 (height in inches)
~b~~;ever,. F~Vl
ten?s to decrease ~ith age for ad~lts, so we should be •n a Pt~dtct It better if we use both hetght and age as mdependent variables be n:ul~Iple regression equation. We expect the slope coefficient for age to gahve and the slope coefficient for height to be positive.
126
Multiple regression and correlation Regression line
Regression line
Age
Height
a. Simple Regression of FEVl on Age b. Simple Regression of FEVl on Height
Height
c. Multiple Regression of FEVl on Age and Height Figure 7.1 Hypothetical Representation of Simple and Multiple Regression Equations
of FEVl on Age and Height.
A geometric representation of the simple regression of FEV1 on age ~~d height, respectively, is shown in Figures 7.1(a) and 7.1(b). The rnulttPt: regression equation is represented by a plane, as shown in Figure 7.1(c). No a that the plane slopes upward as a function of height and down war~ as to function of age. A hypothetical individual whose FEVl is large relattv~ as his age and height appears above both simple regression lines as we above the multiple regression plane. ted· For illustrative purposes Figure 7.2 shows how a plane can be construe Constructing such a regression plane involves the following steps.
Data example y
127
Strings
Mean of Yat X1 and X2 . drawn on Lme · X1,Y wall
II / / X1
-----¥/Point X1, X2 Floor
Figure 7.2 Visualizing the Construction of a Plane.
1. ·Draw lines on the X 1 , Y wall, setting X 2 = 0. 2. Draw lines on the X 2 , Y wall, setting X 1 = 0. 3. Drive nails in the walls at the lines drawn in steps 1 and 2. 4. Connect the pairs of nails by strings and tighten the strings. The resulting strings in step 4 form a plane. This plane is the regression plane of Y on X 1 and X 2 • The mean of Y at a given X 1 and X 2 is the point on the plane vertically above the point X 1 , X 2 .
Data for the fathers in the lung function data set were analyzed by STATA, and the following descriptive statistics were printed: Variable
Ag~
Obs
Mean
Std.Dev.
Min
Max
150 40.13 6.89 26 59 ~:~~ht 150 69.26 2.78 61 76 1 50 4.09 0.65 2.50 5.85 eqluht~ 'regress' program from STATA produced the following regression a ton:
FEVl = - 2.761 - 0.027(age) + 0.114(height) A
s}~;:Pe~ted, there is a pos.iti~e slope associated with height and a negative tnaie With age. For predtcttve purposes we would expect a 30-year-old N Whose height was 70 inches to have an FEVl value of 4.41 liters. ote that a value of -4.087 + 0.118(70) = 4.17 liters would be obtained
128
Multiple regression and correlation
for a father in the same sample with a height of 70 inches using the fi equation (not taking age into account). In the single predictor equation t~t coeffici~nt for height was 0.118. This value is the rate of change of FEVI £ e fathers as a function of height when no other variables are taken into acco or With two predictors the coefficient for height is 0.114, which is interpr~n~ as the rate of change of FEV1 as a function of height after adjusting for a e The latter slope is also called the partial regression coefficient of FEVl ge. height after adjusting for age. Even when both equations are derived froon the same sample, the ~imple and p~rtial regression_coefficients ~ay not be equ~ The output of this program mcludes other Items that Will be discussed later in this chapter.
7.4 DESCRIPTION OF TECHNIQUES: FIXED-X .CASE In this section we present the background assumptions, model and formulas necessary for an understanding of multiple linear regression for the fixed- X case. Computations can quickly become tedious when there are several independent variables, so we assume that you will obtain output from a packaged. computer program. Therefore we present a minimum of formulas and place the main emphasis on the techniques and interpretation of results. This section is slow reading and requires concentration. Since there is more than one X variable, we use the notation X 1 , X 2 , .•. , X P .to represent P possible variables. In packaged programs these variables may appear in the output as X(1), Xl, VAR 1 etc. For the fixed-X case, values of the X variables are assumed to be fixed in advance in the sample. At each combination of levels of the X variables, we conceptualize a distribution of the values of Y. This distribution of Y values is assumed to have a mea~ value equal to rx + f3 1X 1 + f3 2 X 2 + · ·· + f3pXp and a variance equal to f1 at given levels of X 1 ,X2 , .. :,Xp. . . . he When P = ~·the surface IS a plane, as depicted m Figure 7.1(c) ?r parameter {3 1 IS the rate of change of the mean of Y as a functwn of where the value of X 2 is held fixed. Similarly, {3 2 is the rate of change of th: mean of Y as a function of X 2 when X 1 is fixed. Thus /3 1 and P2 are t slopes of the regression pla?e with regard to _x 1 and X 2 • . r lane, When P > 2, the regression plane generalizes to a so-called hype ~ me which canno~ be represe~ted g~ometrically on two-di~ensional pa?e~ie, ~od people conceive the vertical axis as always representing the Y varia . ntal they think o.f a~ of ~he X variables as being r~rese?ted ~y the ~on;~gure plane. In this situation the hyperplane can still be Imagmed as m essioll 7.1(c). The parameters {3 1 ,{3 2 , ... ,[3p represent the slope of the regr called hyperplane with respect to X 1 , X 2 , ... , X P• respectively. The betas are of the partial regression coefficients. For example, {3 1 is the rate of change
7.2.lh,
Description of techniques: fixed- X case
129
f y as a function of X 1 when the levels of X 2 , ••• , X P are held fixed. mean. nse it represents the change of the mean of Y as a function of X 1 In thts se . X . ad'ustmg for X 2• •.. ' P· . • . after . J we assume that the vanance of Y Is homogeneous over the range Agatnrn of the X variables. Usually, for the fixed-X case, Pis rarely larger ofconce than three or four. 0
Least squares method As for simple linear regression, the met?od of least squares is used to obtain timates of the parameters. These estimates, denoted by A, B 1, B 2 , ••• , Bp, printed in the o~tput of ~ny multiple regression program: Th~ estimate A is usually labeled 'mtercept, and B 1, B 2 , ••• , B P are usually giVen m tabular form under the label 'coefficient' or 'regression coefficient' by variable name. The formulas for these estimates are mathematically derived by minimizing the sums of the squared vertical deviations. Formulas may be found inmany of the books listed in the further reading section at the end of this chapter. The predicted value Y for a given set of values Xi, Xi, ... , is then calculated as
:re
x;
The estimate of a 2 is the residual mean square, obtained as
The square root of the residual mean square is called the standard error of the estimate. It represents the variation unexplained by the regression plane. £ Packaged programs usually print the value of S2 and the standard errors or t~e. regression coefficients (and sometimes for the intercept). These ~uanttttes are useful in computing confidence intervals and prediction Intervals around the computed f. These intervals can be computed from the OUtp t · u In the following manner. Pred' •
ICtlon and confidence intervals
~o~tn; regressi.on programs can compute confidence and prediction intervals Pro at specified values of the X variables. If this is not an option in the the grarn w_e need access to the estimated covariances or correlations among llM:~~ession coefficients. These can be found in the regression programs of • SAS, SPSS, STATISTICA and SYSTAT or in the correlate program
130
Multiple regression and correlation
in s. T . ATA. Given this information, we can compute the estimated varian ~y~
2
" SN VarY=
~
-2 * -2 +[(X *1 -X 1 ) VarB 1 +(X 2 -X 2 ) VarB 2 +···
* Xp) - 2 VarBp] + (Xp+ [2(Xi- X 1)(X!- X 7)Cov(B 1 ,B 2 )
+ 2(Xi- X1)(X!- X3 )Cov(B 1,B 3 ) +
...]
The variances (Var) of the various Bi are computed as the squares of the standard errors of the Bb i going from 1 to P. The covariances (Cov) are computed from the standard errors and the correlations among the regression coefficients. For example, Cov(B 1 ,B 2 ) =(standard error B 1 ) (standard error B 2 )[Corr(B 1, B 2 )]
If ~ is interpreted as an estimate of the mean of Y at Xi, Xi, ... , then the confidence interval for this mean is computed as
x;,
* = Y. . + t(Var Y) . . 1/2 CI (mean Y at X *1 ,X *2 , ••. ,Xp) where t is the -100(1 - rt./2) percentile of the t distribution with N- P- 1 degrees of freedom. When Y is interpreted as an estimate of the Y value for an individual, then the variance of Y is increased by S2 , similar to what is done in simple linear regression. Then the prediction interval is PI (individual Y at Xi, Xi, ... ,X;)=
Y + t(S 2 + Var Y) 112
where tis the same as it is for the confidence interval above. Note that these intervals require the additional assumption that for any set of levels of X the values of Y are normally distributed. . . le In summary, for the fixed-X case we presented the model for multtp linear regression analysis and discussed estimates of the parameters. Next, we present the variable-X model. 7.5 DESCRIPTION OF TECHNIQUES: VARIABLE-X CASE . . . b d . bl measured For this model the X's and the Y vana le are ran om vana es . the 18 on cases that are randomly selected from a population. An example ere lung function data set where FEVl, age, height and other variables ~ous measured on individuals in households. As in Chapter 6, the p~e;:e~X discussio~ and f~IT?ul~s given for t~e fixed- X cas~ .ap~ly to the varta riable case. (Thts result ts JUSttfiable theorettcally by condttiOmng on the X va
Description of techniques: variable-X case
131
values that happen to be obt~ined in t~e .sam~le.~ Fu.rthermore, since the X · . bles are also random vanables, the JOint distnbutwn of Y, X 1 , X 2 , •.• , X P varta . of interest. . . . . 15 When there .Is o.nly one X vanable, It was shown I.n Chapt~r 6 th.at the . riate distnbutwn of X and Y can be charactenzed by Its regiOn of bJvacentration. T hese regwns . . are en·Ipses I'f t he d ata come from a b'Ivanate co~al distribution. Two such ellipses and their corresponding regression 0 ~ es are illustrated in Figures 7.3(a) and 7.3(b). tnWhen two X variables exist, the regions of concentration become three-
Age a. Regression of FEVl on Age
Height b. Regression ofFEVl on Height
-tf >
Height
c. Regression of FEV 1 on Age and Height
~
.
gure 7 3 H ypot h etlca . 1 Regtons . . Line o f Concentration and Corresponding Regression sand Planes for the Population Variable-X Model. i
•
132
Multiple regression and correlation
~m~nsi~nal .forms .. Th~se forms take t~e shape of ellipsoids ~ the joint distnbutiOn IS multivariate normal. In Figure 7.3(c) such an ellipsoid Vv'th the co~responding regression plane o~ Y on X 1 a~d X 2 is illust~ated. ;he regression plane passes through the pomt representmg the populatiOn mea of the three variables and intersects the vertical axis at a distance ex from t~s origin, Its position is determined by the slope coefficients {3 1 and {3 2 • Whee P > 2, we can imagine the horizontal plane representing all the X variablen and the vertical plane representing Y. The ellipsoid of concentration becomes the so~called hyperellipsoid. s
Estimation of parameters For the variable-X case several additional parameters are needed to characterize the joint distribution. These include the population means Jl1 , f.lz, ... , Jlp and f.lr, the population standard deviations a 1 , a 2 , ... , a P and' a y, and the covariances of the X and Y variables. The variances and covariances are usually arranged in the form of a square array called the covariance matrix. For example, if P = 3, the covariance matrix is a four-by-four array of the form
---x--1
I
X
I
2
y
3
1 2 3 y
A dashed line is included to separate the dependent variable Y from the independent variables. The estimates of the means, variances and cova?ance~ are available in the output of most regression programs. The estimate variances and covariances are denoted by S instead of a. . In addition to the estimated covariance matrix, and estimated correlation matrix is also available, which is given in the same format:
---X 1 1 I 2 X I - - - - -3- -
y
2
I 3
I I
Y
Description of techniques: variable- X case
133
. . the correlation of a variable with itself must be equal to 1, the diagonal Since nts of the correlation matrix are equal to 1. The off-diagonal elements eleiiltehe simple correlation coefficients described in section 6.5. As before, ~ . . between + 1 and -1. · nuroerical values always he th~ an example, the following covariance matrix is obtained from the lung ~ion data set for the fathers. For the sake of illustration, we include a rune .h . .. d third X variable, wetg t, gtven m poun s.
-------------x---------
X
Age Height Weight
l ----- ----------
y
FEV1
Age
Height
Weight
I y I FEV1
47.47 -1.08 - 3.65
-1.08 7.72 34.70
-3.65 34.70 573.80
I -1.39 1 o.91 I 2.07
-1.39
0.91
2.07
I
-----------------------~-----
I I
0.42
Note that this matrix is symmetric, that is, if a mirror were placed along the diagonal, then the elements in the lower triangle would be mirror images of those in the upper triangle. For example, the covariance of age and height is -1.08, the same as the covariance between height and age. Symmetry holds also for the correlation matrix that follows:
------------X--------Height
Weight
I y I FEV1
1.00 -0.06 -0.02
-0.06 1.00 0.52
-0;02 o.s2 1.00
I -0.31 1 o.5o I 0.13
- 0.31
0.50
0.13
Age X J
y
Age Height Weight
-~-------------
FEVl
I
----------------------~~-----
I I
1.00
Note that age is not highly correlated with height or weight but that height ~nd weight have a substantial positive correlation as might be expected. The :~gest correlation between an independent variable and FEV1 is between cetgh~ and FEVl. Usually the correlations are easier to interpret than the . . . . s smce coovanance t.hey a1ways 1'te between -1 and + 1, and zero stgm:fies no rre1at10n. AU the . 1 . . cov . corre atlons are computed from the appropnate elements of the anance matrix. For example, the correlation between age and height is
134
Multiple regression and correlation
or -1.08 -0.06 = (47.47)112(7.72)1/2 Note that the correlations are computed between Y and each X as well among the X variables. We will return to interpreting these sim pie correlatio:s . section . 7.. 7 s m
Multiple correlation So far, we have discussed the concept of correlation between two variables. It is also possible to describe the strength of the linear relationship between Y and a set of X variables by using the multiple correlation coefficient. In the population we will denote this multiple correlation by ~. It represents the simple correlation between Y and the corresponding point on the regression plane for all possible combinations of the X variables. Each individual in the population has a Y value and a corresponding point on the plane computed as Y'
= ~
+ {3 1 X 1 + {3 2 X 2 + ... + [JpXp
Correlation rJl is the population simple correlation between all such Y and Y' values. The numerical value of~ cannot be negative. The maximum possible value for rJl is 1.0, indicating a perfect fit of the plane to the points in the population. Another interpretation of the ~ coefficient for multivariate normal distributions involves the concept of the conditional distribution. This distribution describes all of the Y values of individuals whose X values are specified at certain levels. The variance of the conditional distribution is the variance of the Y values about the regression plane in a vertical direction. For multivariate normal distributions this variance is the same at all combinations of levels of the X variables and is denoted by a 2 • The following fundamental expression relates~ to a 2 and a~:
Rearrangement of this expression shows that f7A2
=
a2- a2 y
2
ay
a2
= 1-2 ay
• . . . . 2 then the When the vanance abo~t the p~ane, or a 2 , IS small relative _to ay, 2 about squared multiple correlatiOn PA 2 Is close to 1. When the vanance a zerO· the plane is almost as large as the variance a~ of Y, then .t)f2 is close to
Description of techniques: variable- X case
135
ase the regression plane does not fit the Y values much better than In t~~ cs the multiple correlation squared suggests the proportion of the Jlr·. . un accounted for by the regression plane. As in the case of simple y.aoattoegression, another interpretation of 1A is that (1 -PA 2 ) 112 x 100 is the 1 . d b y X to X P· }loear rtage of ay not. exp.ame 1 pe;;e that a~ and a 2 can be estimated from a computer output as follows: .
2
Sy
=
L(Y- ¥')1 N -1
total sum of squares = N ~1
and 82
= ,L(Y- f) 2 =residual sum of squares N-P-1 N-P-1
Partial correlation Another correlation coefficient is useful in measuring the degree of dependence between two variables after adjusting for the linear effect of one or more of the other X variables. For example, suppose an educator is interested in the correlation between the total scores of two tests, T1 and T2 , given to twelfth graders. Since both scores are probably related to the student's IQ, it would be reasonable to first remove the linear effect of IQ from both T1 and T2 and then find the correlation between the adjusted scores. The resulting correlation coefficient is called a partial correlation coefficient between T1 and Tz given IQ. For this example we first derive the regression equations of T1 on IQ and T2 on IQ. These equations are displayed in Figures 7.4(a) and 7.4(b). Consider an individual whose IQ is IQ* and whose actual scores are Tt and T2*. The test scores defined by the population regression lines are T1 and T2 , ~~pectiv~ly. The adjusted test scores are the res~du~l~ Tt ~ T1 and T2* -. T2 • . ese adjusted scores are calculated for each mdtvtdual m the populatiOn, ~nd the simple correlation between them is computed. The resulting value ~ de~ned to ?e the population partial correlation coefficient between T1 and 2 Wtth the linear effects of IQ removed. PoFonn.ulas exist for computing partial correlations directly from the resf:;!atton simple correlations or other para~eters without. obtaining the .-'2 .,., IS . ,de als.· In the above case suppose that the stmple correlatiOn for T1 and . . Parr noted by P12' for T1 and IQ by p 1q, and T2 and IQ by p 2 q. Then the ta1 correlation of T1 and T2 given IQ is derived as
136
Multiple regression and correlation
~------~-----------------IQ
IQ*
a. Regression Line for T 1
~------~-----------------IQ
IQ*
b. Regression Line for T 2 Figure 7.4 Hypothetical Population Regressions of T1 and T2 Scores, Illustrating the Computation of a Partial Correlation Coefficient.
In general, for any three variables denoted by i, j and k, Pii- PikPik
· bles
It is also possible to compute partial correlations between any two va~tacase
after r~moving the l~ne.ar effect of two or II? ore other variables .. In tht~ 1105 e the res1duals are deviatiOns from the regression planes on the vaoables
How to interpret the results: fixed-X case
137
are removed. The partial correlation is the simple correlation between effects . residuals. the kaged computer programs, discussed further in section 7.10, calculate ~ac tes of multiple and partial correlations. We note that the formula given esttma can be used with sample simple correlations to obtain sample partial above . . . rrelatwns. . . co This section has covered the baste concepts m multiple regression and . lation. Additional items found in output will be presented in section 7.8 corre . . and in Chapters 8 and 9. ?.6 HOW TO INTERPRET THE RESULTS: FIXED-X CASE In this section we discuss interpretation of the estimated parameters and present some tests of _hypotheses forth~ fixed- X ~ase.. . As mentioned previously, the regressiOn analysts mtght be performed for prediction: deriving an equation to predict Y from the X variables. This situation was discussed in section 7.4. The analysis can also be performed for description: an understanding of the relationship between the Y and X variables. In the fixed-X case the number of X variables is generally small. The values of the slopes describe how Y changes with changes in the levels of X variables. In most situations a plane does not describe the response surface for the whole possible range of X values. Therefore it is important to define the region of concern for which the plane applies. The magnitudes of the {J's and their estimates depend on this choice. In the interpretation of the relative magnitudes of the estimatedB's, it is useful to compute B 1 X 1, B 2 X 2 , ••• , BpXp and c~mpare the resulting values. A large (relative) value of the magnitude of Bi!i indicates a relatively important contribution of the variable Xi. Here the X's represent typical values within the region of concern. /1 If we restrict the range of concern here, the additive model, ex + 1 +./3zX 2 + .. · + {JpXp, is often an appropriate description of the un erlymg relationship. It is also sometimes useful to incorporate interactions ~~~~~:li~earities .. This ~tep can partl! b~ achieve~ by transformations and th dtscussed m section 7.8. Exammation of residuals can help you assess e adequacy of the model, aad this analysis is discussed in section 7.8.
!1
AnalySis · of variance
~~~a test of whether the regression plane is at all helpful in predicting the UsedesT~f Y, the ANOVA table printed in most regression outputs can be ·
e null hypothesis being tested is Ho:/31
=
f3z
= ··· =
f3p
=
0
138
Multiple regression and correlation
Table 7.1 ANOV A Table for multiple regression Source of variation
Sums of squares
df
Mean square
F
Regression Residual
I:(Y- Y)2 I:(Y- Yj2
p
SSreg/P SSres/(N -
MSre8 /MS res
Total
I:(Y- Y)2
N-1
N-P-1
P - 1)
That is, the mean of Y is as accurate in predicting Y as the regression plane. The ANOVA table for multiple regression has the form given in Table 7.1, which is similar to that of Table 6.1. When the fitted plane differs from a horizontal plane in a significant fashion then the term L (f- f)2 will be large relative to the residuals from the plan~ L(Y- ¥)2 • This result is the case for4hefathers where F = 36.81 withP = 2 degrees offreedom in the numerator and N- P - 1 = 147 degrees of freedom in the denominator (Tables 7.1 and 7.2). F = 36.81 can be compared with the printed F value in standard tables with v 1 (numerator) degrees of freedom across the top of the table and v2 (denominator) degrees of freedom in the first column of the table for an appropriate ct. For example, for ct = 0.05, v1 = 2, and v2 = 120 (147 degrees of freedom are not tabled, so we take the next lower level), the tabled F(1 - ct) = F(0.95) = 3.07, from the body of the table. Since the computed F of 36.81 is much larger than the tabled F of 3.07, the null hypothesis is rejected. Table 7.2 ANOV A example from the lung function data (fathers) Source of variation
Sum of squares
df
Mean square
F
Regression Residual
21.0570 42.0413
2 147
10.5285 0.2860
36.81
Total
63.0983
· · · ts · pnn · ted in the The P value can be determmed more precisely. In fact, It . bles output asP= 0.0000: this could be reported asP< 0.0001. Th~~ the~a~~Vl. age and height together help significantly in predicting an indtvtdual s sion If t~is blanket hypothesis is rejected, then the de~~ee to which ~her~~~=~ the equatiOn fits the data can be assessed by examimng a quantity ca coefficient of determination, defined as . . . sum of squares regression f t tal coefficient of determmatiOn = __,. sum o squares o
How to interpret the results: fixed-X case
139
If the regression sum of squares is not in the program output, it can be ·ned by subtraction, as follows: obtaJ ssion sum of squares= total sum of squares- residual sum of squares regre For the lathers,
. . . = 21.0570 . = 0. 334, coeffi ctent of d eterminatiOn 63 0983 This value is an indication of the reduction in the variance of Y achieved by using X 1 and X 2 a.s predictors. In this case the variance .around the regression plane is 33:4% less than t~e v~ria~ce of the original Y values. Numeric~lly, the coefficient of determinatiOn Is equal to the square of the multiple correlation coefficient, and therefore it is called RSQUARE in the output of some computer programs. Although the multiple correlation coefficient is not meaningful in the fixed- X case, the interpretation of its square is valid. Other tests of hypotheses Tests of hypotheses can be used to assess whether variables are contributing significantly to the regression equation. It is possible to test these variables either singularly or in groups. For any specific varible Xi we can test the null hypothesis
by computing
B.-o
t=-'-SE(Bj)
~~d performing a one- or two-sided t test with N - P - 1 degrees of freedom. . ese t statistics are often printed in the output of packaged programs. Other ~~ograms print the corresponding F statistics (t 2 =F), which can test the . . · ~hnull ~ypoth.esis against a two-sided alt~mative. . . can en thts test Is performed for each X vanable, the JOint stgmficance level ltla:ot be determined. A method designed to overcome this uncertainty the es use o~ the so-called Bonferroni inequality. In this method, to compute VaJu~PPro~nate joint P value for any test statistic, we multiply the single P test.l'~~t~t?ed from th~ printout by the n~mber of~ variables we wish to For JOint P value IS then compared With the desued overall level. example, to test that age alone has an effect, we use Ho:f3t
=
0
140
Multiple regression and correlation
and from the computer output =
t
-0.0266-0 = -418 0.00637 ..
with 147 degrees of freedom or P = 0.00005. This result is· also hi hi significant, indicatin~ that ~he eq~ation using age and height is significa~tiY better than an equation usmg height alone~ Y Mos~ compute~ outputs will_ ~splay the P level of .the t statistic for each regressiOn coeffiCient. To get JOint P values accordmg to the Bonferron· inequality, we multiply each individual P value by the number of X variable; before we compare it with the overall significance level ex. For example, if P = 0.015 and there are two X's, then 2(0.015) = 0.03, which could be compared with an ex level of 0.05. 7.7 HOW TO INTERPRET THE RESULTS: VARIABLE-X CASE In the variable-X ca~e all of the Y and X variables are considered to be. random variables. The means and standard deviations printed by packaged programs are used to estimate the corresponding population parameters. Frequently, the covariance and/or the correlation matrices are printed. In the remainder of this section we discuss the interpretation of the various correlation and regression coefficients found in the output. Correlation matrix Most people find the correlation matrix more useful than the covariance matrix since all of the correlations are limited to the range -1 to + 1. Note that because of the symmetry of these matrices, some programs will p~nt only the terms on one side of the diagonal. The diagonal terms of a correlation matrix will always be 1 and thus sometimes are not printed. . . a Initial screening of the correlation matrix is helpful in obtammg d preliminary impression of the interrelationships among the X va~a~les ~~s between Y and each of the X variables. One way of accom?bshmgf the 0 screening is to first determine a cutoff point on the magmtude r in . t' greateIting correlatiOn, for example, 0.3 or 0.4. Then each corre1awn magnitude than this number is underlined or highlighted. T~e r~~us· i.e. pattern can give a ~isual impre.ssio.n of the un~erlying .inter~elauon~e~~1 i 011s highlighted correlations are mdicattons of possible relationships. Co~ b}es are near + 1 or -1 among the X variables indicate that the two van a nside! nearly perfect functions of each other, and the investigator should. co A.lso, dropping one of them since they convey nearly the same infonnauon.
How to interpret the results: variable-X case
141
. . . hysical situation leads the investigator to expect larger correlations 1f the those found in the printed matrix, then the presence of outliers in the than r ofnonlinearities among the variables may be causing the discrepancies. da: ~sts of significance are desired, the following test can be performed. To test the hypothesis
use the statistic given by r(N- 2) 112
t
= (l _
7 2)112
where p is a particular population simple correlation and r is the estimated simple correlation. This statistic can be compared with the t table value with df = N - 2 to obtain one-sided or two-sided P values. Again, t 2 = F with 1 and N - 2 degrees of freedom can be used for two-sided alternatives. Some programs, such as the SPSS procedure CORRELATIONS, will print the;: P value for each correlation in the matrix. Since several tests are being performed in this case, it may be advisable to use the Bonferroni correction to the P value, i.e. each P is multiplied by the number of correlations (say M). This number of correlations can be determined as follows:
M=
(no. of rows)(no. of rows - 1) . 2
After this adjustment to the P values is made, they can be compared with the nominal significance level ex.
Standardized coefficients The interpretations ofthe regression plane, the associated regression coefficients, and the standard error around the plane. are the same as those in the fixed- X ca~~ All the tests _presented in sectio_n 7.6 app~y he~e as well: • CQeffl~ther way to mterpre~ the r~gresston coefficients IS to exami~e standardized and c ents. These are pnnted m the output of many regression programs an be computed easily as . . d B ·= B . (standard deviation of Xi) t d ardIze san · ' ' standard deviation of Y
'these c . . . . . Were Oeffic,ents are the ones that would be obtamed If the Y and X vanables Varia~~=ndardized prior to performing the regression a~alysi~. When the X s are uncorrelated or have very weak correlation with each other,
142
Multiple regression and correlation
the standardized coefficients of the various X variables can be dir 1 compa~ed i~ order to determine th~ relative contributio~ of each toe~~y regression lme. The larger the magmtude of the standardized Bi, them e Xi contributes to the prediction of Y. Comparing the unstandardizedo~ directl~ d?~s not achieve th~s result because of ~he differe~t u~its and degrees of vanabthty of the X vanables. The regression equatiOn Itself should b reported for future use in terms of the unstandardized coefficients so th ~ predic~ion can be made directly fro~ the raw X variab~es. ~hen standardiz~ coefficients are used, the standardized slope coefficrent Is the amount of change in the mean of the standardized Y values when the value of xis increased by one standard deviation, keeping the other X variables constant. If the X variables are intercorrelated, then the usual standardized coefficients are difficult to interpret (Bring, 1994). The standard deviation of Xi used in computing the standardized Bi should be replaced by a partial standard deviation of Xi which is adjusted for the multiple correlation of Xi with the other X variables included in the regression equation. The partial standard deviation is equal to the usual standard deviation multiplied by the square root of one minus the squared multiple correlation of Xi with the other X)s and then multiplied by the square root of (N- 1)/(N- P). The quantity one minus the squared multiple correlation of Xi with the other X's is called tolerance (its inverse is called the variance inflation factor) and is available from some programs (see section 7.9 where multicollinearity is discussed). For further discussion of adjusted standardized coefficients see Bring (1994). Multiple and partial correlations As mentioned earlier, the multiple correlation coefficient is a measure of the strength of the linear relationship between Y and the set of variables X 1' X 2 , ...,Xp. The multiple correlation has another useful property: it is ~he highest possible simple correlation between Y and any linear combi?atto.n of X 1 to Xp· This property explains why R (the computed correlatt.o~) ts never negative. In this sense the least squares regression plane ma~untz~ the correlation between the set of X variables and the dependent vartable · · . · plane It therefore presents a numencal measure of how well the regression e 1 fits the Y values. When the multiple correlation R is close to zero) the Ptne barely predicts Y better than simply using Y to predict Y. A value of R c os to 1 indicates a very. good ~t. . . . R2 is an As in the case of stmple linear re.gre~siOn, dis~ussed m Cha~ter 6, fitting estimate of the proportional reductiOn m the vanance of Y achteved by d the the plane. Again, the proportion of the standard deviation of Y arounthesis plane is estimated by (1 - R 2 ) 1 ' 2 .If a test of significance is desired, the hypo H 0 :~
= 0
Residual analysis and transformations
143
e tested by the statistic can b ·
F
=
R 2 /P (1 - R 2 )/(N- P- 1)
.h. his compared with a tabled F with P and N - P - 1 degrees of freedom. . • • 1 . Since this hypothesis ts eqmva ent to
W tC
llo:f31 =
fJ2 = · · · = fJp = 0
the F statistiC is e~uivalent ~o the one ~alculated i~ Table 7.1. (See section SA for an explanation of adJusted multiple correlations.) · As mentioned earlier, the partial correlation coefficient is a measure of the linear relationship between two variables after adjusting for the linear effect of a group of other variables. For example, suppose a regression of Yon X 1 and X 2 is fitted. The square of the partial correlation of Y on X 3 after adjusting X 1 and X 2 is the proportion of the variance of Y reduced by using X 3 as an additional X variable. The hypothesis
H 0 :partial p = 0 can be tested by using a test statistic similar to the one for simpler, namely t =
(partial r)(N- Q - 2) 112 [1 - (partial r 2 )] 112
--------------:---:---;;--~-
where Q is the number of variables adjusted for. The value of t is compared with the tabled t value with N - Q - 2 degrees of freedom. The square of the t statistic is an F statistic with df = 1 and N - Q - 2; the F statistic may be used to test the same hypothesis against a two-sided alternative. In the special case where Y is regressed on X 1 to X p, a test that·the partial c~?"el~tion between Y and any of the X variables, say Xi, is zero after a Justmg for the remaining X variables is equivalent to testing that the regr · . . esston coefficient pi ts zero. Pl~n Ch~pter 8, where variable selection is discussed, partial correlations will Y an Important role. 78 ' RESIDUAL ANALYSIS AND TRANSFORMATIONS
!~!~' secti_on we discuss the use of residual analysis for finding outliers, tel1lls ?rmattons, polynomial regression and the incorporation of interaction see di Into th~ equation. For further discussion of residuals and transformations agnosttc plots for the effect of an additional variable in section 8.9.
144
Multiple regression and correlation
Residuals The. use of resi~uals in the ~a~e of simple li~ear regres.sion w~s discussed i sectiOn 6.8. Restdual analysts Is even more Important m multiple regres . n analysis because the scatter diagrams of Y and all the X variables simultaneo~~n cannot be portrayed on two-dimensional paper or computer screens. If th Y are only two X variables and a single Y variable, programs do exist ~re showing three-dimensional plots. But unless these plots can be easily rotat ~r finding an outlier is often difficult and lack of independence and the ne:d fO{ transformations can be obscured. Fortun.ately~ the use of re~iduals c~anges lit!le co?ceptually between simple and multiple linear regression. The mformatton giVen on types of residuals (raw, standardized, studentized and deleted) and the various statistics given in section 6.8 all apply here. The references cited in that section are also useful in the case of multiple regression. The formulas for some of the statistics for example h (a leverage statistic) become more complicated, but basically' the same interpretations apply. The values of hare obtained from the diagonal of a matrix called the 'hat' matrix (hence the use of the symbol h) but a large value of h still indicates an outlier in at least one X variable. The three types of outliers discussed in section 6.8 (outliers in Y, outliers in X or h, and influential points) can be detected by residual analysis in multiple :regression analysis; Scatter plots of the various types of residuals against Y are the simplest to do and we recommend that, as a minimum, one of each of the three types be plotted against Y whenever any new regression analysis is performed. Plots of the residuals against each X are also helpful in showing whether a large residual is associated with a particular value of one of the X variables. Since an outlier in Y is more critical when it is also an outlier in X, many investigators plot deleted studentized resid~als on the vertical axis versus leverage on the horizontal axis and look for pomts that are iii the upper right-hand corner of the scatter plot. . If an outlier is found then it is important to determine which cases cont.atn the suspected outliers. Examination of a listing of the three types o~ outh:~s in a spreadsheet format is a common way to determine the case while at. u: same time allowing the investigator to see the values of Y and the ~an~as X variables. Sometimes this process provides a clue that an outlier tt've . a certam . subset of the cases. In some of t h e newe r interachave occurred m 0 programs, it is also possible to identify outliers using a mouse and ! can 8 that information transferred to the data spreadsheet. Removal of outlt~ect of be chosen as an option, thus enabling the user to quickly see the e mend discarding an outlier on the regression plane. In any case, we reco~Jier(s) performing the multiple regression both with and without suspected ou to see the actual difference in the multiple regression equation. find a The methods used for finding an outlier were originally derived to
Residual analysis and transformations
145
. outlier. These same methods can be used to find multiple outliers, 1 stng e gh in some cases the presence of an outlier can be partially hidden by althOU sence of other outliers. Note also that the presence of several outliers ~betK~e same area ~ill hav.e a lar~er effect th~n one outlier alo~e, so it is ~ 0 . rtant to examme multiple outliers for possible removal. For this purpose, tiilPo ggest use of the methods for single outliers, deleting the worst one first weds~hen repeating the process in subsequent iterations (see Chatterjee and ~adi, 1988, for ~dditi~nal methods); . . , . As discussed m sect10~ 6.8,. Cooks dts~ance: ~od1fied .cooks distance .or DFFITS should be exammed m order to Identify mfluentlal cases. ChatterJee nd Hadi (1988) suggest that a case may be considered influential if its ~FFITS value is greater than 2[(P + 1)/(N- P- 1)]t in absolute value, or if its Cook's or modified Cook's distance is greater than the 95th percentile of the F distribution with P + 1 and N - P - 1 degrees of freedom. As was mentioned in section 6.8, an observation is considered an outlier in Y if its residual or studentized residual is large. Reasonable cutoff values are 2(P + 1)/N for the ith diagonal leverage value hii and 2 for the studentized residual. An observation is considered influential if it is an outlier in both X andY. We recommend examination of the regression equations with and without the influential observations. The Durbin-Watson statistic or the serial correlation of the residuals can be used to assess independence of the residuals (section 6.8). Normal probability plots of the residuals can be used to assess the normality of the error variable. Transformations !n examining the adequacy of the multiple linear regression model, the ~vest~gator ~ay ~onder whether the tra~sformations .of some or. all of the th:an~bles might Im~rove the fit. (~ee section 6.9 for revie~.) For this pur~ose F'orrestdual pl~ts agams~ the X van~bles may suggest certam transformat10~s. p· example, If the residuals agamst Xi take the humped form shown m s~gure 6.7(c), it may be useful to transform Xi. Reference to Figure 6.9(a) lo:g:.ts that log Xi ~ay be the appropriate tra~sform~tion. In this situa~ion Wh ' can be used m place of or together With Xi m the same equation. an ~n both Xi and log Xi are used as predictors, this method of analysis is vart:~:ance of attempting to impr~~e the fi~ by including an additional squ e. Other commonly used addttiOnal vanables are the square (X~) and are root (Xl/2) f. . . . ' Variable . i o appropn~te Xi va?ables; o.ther candtdates may new order s altogether. Plots agamst candidate vanables should be obtamed in Plots to check whether the residuals are related to them. We note that these Correi:re rno~t informative when the X variable being considered is not highly ted With other X variables.
?e
146
Multiple regression and correlation
Cau~i~n shoul~ be taken at this st~ge not to be delug~d by a rnultit of additional vanables. For example, 1t may be preferable m certain c Ude use log xi instead of xi and x;; the interpretation of the equation beases to more difficult as the number of predictors increases. The subject of vac?~es selection will be discussed in Chapter 8. na le
Polynomial regression
A subject related to variable transformations is the so-called polynom. · F or examp1e,. suppose .that t he mvesttgat?r · · Ia1 reg~ess10n. ~tarts with a single vanable X and that the scatter dtagram of Yon X mdicates a curviline · function, as shown in Figure 7.5(a). An appropriate regression equation ~ this case has the form
This equation is, in effect, a multiple regression equation with X 1 = X and X 2 = X 2 • Thus the multiple regression programs can be used to obtain the estimated curve. Similarly, for Figure 7.5(b) the regression curve has the form " Y= A+ B 1 X
+ B2 X 2 + B3X 3
which is a multiple regression equation with X 1 =X, X 2 = X 2 and X 3 = X 3 . Both of these equations are examples of polynomial regression equations with degrees two and three, respectively. Some computer packages have special programs that perform polynomial regression without the need for the user to make transformations (section 7.10). Interactions
Another issue in model fitting is to determine whether the X variables interact. If the effects of two variables Xi and Xi are not interactive, then they appear as BiXi + BiXi in the regression equation. In this case the effects of the tw.o 15 variables are said to be additive. Another way of expressing this concept . . . di . tenns to say that there IS nO InteraCtiOn between Xi and Xi. If the ad uve d ot for these variables do not completely specify their effects on the depen~~is variable Y, then interaction of Xi and Xi is said to be pr~sent. ·cal . . many . . . F or exampie·' m chemi phenomenon can be observed m situations. fl .tion processes the additive effects oftwo agents are often not an accurate re e:Seot. of their combined effect since synergies or catalytic effects are often P~e tWO Similarly, in studies of economic growth the interactions between t It is major factors labor and capital are important in predicti~g outcorn;~ssioO therefore often advisable to include additional variables m the reg dd the equation to represent interactions. A commonly used practice i~ to :etween product XiXi to the set of X variables to represent the interactiOn
Residual analysis and transformations
147
y
-I
/
......... ....
/
--~--------------------------------------X
a. Single Maximum Value
y
---r---------------------------------------x b. One Maximum and One Minimum Value
F·
Igure 7.5 Hypothetical Polynomial Regression Curves.
)(
in~ and Xi· The use of so-called dummy variables is a method of incorporating ~ractions and will be discussed in Chapter 9. 8 a 1" a check for the presence of interactions, tables can be constructed from itu lSt of residuals, as follows. The ranges of Xi and Xi are divided into ea:vais, as shown in the examples in Tables 7.3 and 7.4. The residuals for Cell are found from the output and simply classified as positive or
148
Multiple regression and correlation
Table 7.3 Examination of positive residuals for detecting interactions: no intera t' ------------------------------------------------------clon xi(%)
x,
Low
Medium
High
Low Medium High
50 52 49
52 48 50
48 50 51
Table 7.4 Examination of positive residuals for detecting interactions: interactions present Xi(%)
xi
Low
Medium
High
Low Medium High
20
40 50 60
45 60 80
40
55
negative. The percentage of positive residuals in each cell is then recorded. If these percentages are around 50%, as shown in Table 7.3, then no interaction term is required. If the percentages vary greatly from 50% in the cells, as in Table 7.4, then an interaction term could most likely improve the fit. Note that when both xi and xj are low, the percentage of positive individuals is only 20%, indicating that most of the patients lie below the regression plane. The opposite is true for the case when Xi and X. are high. This search for interaction effects should be made after consideration of transformations and selection of additional variables.
7.9 OTHER OPTIONS IN COMPUTER PROGRAMS Because regression analysis is a very commonly used technique, packag~~ programs offer a bewildering number of options to the user. But do ~~t in tempted into using options that you do not understand well. As a gut elar selection options, we briefly discuss in this section some of the more .popu ones. Other options are often, but not always, described in user gutdes.sioJJ The options for the use of weights for the observations and for regre~ioO· through th~ origin were disc~ssed in Sectio~ 6.10 for s~ple linear re.g~~ssioll !hese ?ptlons are a~so available for multiple regressiOn an~ the dJS b the m section 6.10 apphes here as well. Be aware that regresston throug
Other options in computer programs
149
eans that the mean of Y is assumed to be zero when all Xi equal assumption is often unrealistic. For a. detailed discuss~on of how zero. . us regression programs handle the no-mtercept regressiOn model, b vano . t e okunade, Chang an~ Evans (199~). . . see ction 7.5 we descnbed the matnx of correlations among the estimated In se and we indicated how this matrix can be used to find the covariances slopes, the slopes. Covariances can be obtained directly as optional output among in sorne programs. . .
or1gtn;:is
Multicollinearity In practice, a proble~ called multicollinearity ~ccurs whe~ so~e o~ the X ariables are highly mtercorrelated. Here we dtscuss multtcolhneanty and ~he options available in computer programs to determine if it exists. In section 9.5, methods are presented for lessening its effects. A considerable amount of effort is devoted to this problem (e.g. Wonnacott and Wonnacott, 1979; Chatterjee and Price, 1991; Belsey, Kuh and Welsch, 1980; Fox, 1991). When multicollinearity is present, the computed estimates of the regression coefficients are unstable and have large standard errors. Some computations may also be inaccurate. For multiple regression the squared standard error of the ith slope coefficient can be written as
2
where S is the residual mean square, Si is the standard deviation of the ith X variable, and Ri is the multiple correlation between the ith X variable and all the other X variables. Note that as Ri approaches one, the denominator of the last part of the equation gets smaller and can become very close to ~~~·This r~sults ~n dividing o~e by a num~er which is close to zero and t ·efore th1s fract10n can be qu1te large. Thts leads to a large value for the 8 andard error of the slope coefficient. fa ~ome statistical packages print out this fraction, called the variance inflation st~n~ra~VIF). Note that the square .root of Y!F directly affects .the s~e of ~he fact . d error of the slope coefficient. The mverse of the vanance mflat10n ~~~s called tol~r~nce (~ne minus the squared multiple correlation between toler t~e remammg X s) and some programs print this out. When the say gance Is small, say less than 0.01, or the variance inflation factor is large, of th reater than 100, then it is recommended that the investigator use one in ~=d~ethods given in section 9.5. Note that multicollinearity is uncommon tend t lCal research, using data on patients, or in social research, since people 0 be highly variable and the measuring methods not precise.
xi
150
Multiple regression and correlation
Comparing regression planes Sometimes, it ~s desirable to exa~ine the regress.ion equa~ions for subgrou of the populatiOn. For example, dtfferent regression equations for subgro Ps subjected to different treatments can be derived. In the program it is 0 ~Ps convenient to designate the subgroups as various levels of a grouping variab~n In the lung function data set, for instance; we presented the regressi e. equation for 150 males: on FEV1 = -2.761 -0.0267 (age)+ 0.114 (height) An equation can also be obtained for 150 females, as FEV1
=
-2.211- 0.0200 (age)+ 0.0926 (height)
In Table 7.5 the statistical results for males and females combined, as well the separate results, are presented.
Table 7.5 Statistical output for the lung function data set for males and females
Mean Overall (N Intercept Age Height FEV1
= 300, df = 38.80 66.68 3.53
Standard deviation
Regression coefficient
Standard error
Standardized regression coefficient
-6.737 -0.0186 0.1649
0.00444 0.00833
-0.160 0.757
-2.761 -0.0266 0.1144
0.00637 0.01579
-0.282 0.489
-2.211 -0.0200 0.0926
0.00504 0.01370
-0.276 0.469
297) 6.91 3.69 0.81
= 0.76, R 2 = 0.57, S = 0.528 Males (N m = 150, df = 147)
R
Intercept Age Height FEVl
6.89 2.78 0.65
= 0.33, S = 0.535 Females (Nr = 150, df = 147) R
= 0.58
40.13 69.26 4.09
1
R2
Intercept Age Height FEV1 R
= 0.54, R2
37.56 64.09 2.97
6.71 2.47 0.49
= 0.29, S = 0.413
-----
Other options in computer programs
151
expected, the males tend to be older than the females (these are couples .~ t least one child age seven or older), taller and have larger average wttv:. The regression coefficients for males and females are quite similar FE . g the same sign and similar magnitude. ha:tbis type of analysis it is useful for an investigator to present the means . nd standard deviations along with the regressio~ coefficients. This_ information a ·.Jd enable a reader of the report to try typical values of the mdependent . equatiOn . m . order to assess t he numenca . 1 eftiects .wouiables in the regression v;rthe independent variables. Presentation of these results also makes it 0 ossible to assess the characteristics of the sample under study. For example, ~would not be sensible to try to make.inferences from this sample regression ~lane to males ~h~ we_re 90 years ~ld. Si~ce the mean age is .40 years and the standard deviatiOn Is 6.89, there IS unhkely to be any male tn the sample age 90 years (section 6.13, item 5). We can make predictions for males and females who are, say, 30 and 50 years old and 66 inches tall. From the two equations presented in Table 7.5 we would predict FEV 1 to be as follows:
Age (years) Subgroup Males Females
30
50
3.96
3.43
3.30
2.90
~hus, ev~n though the equations for males and ~emales look quite simila~, e e predtcted FEV1 for females of the same height and age as a male Is Xpected to be less. fliThe standardized regression coefficient is also useful in assessing the relative ~ ects of the two variables. For the overall group, the. ratio of the t:stan?ardized coefficients for height and age is 0.1649/0.0186 = 8.87 but ove ratio of_ the standardized coefficients is 0.757/0.160 = 4. 73. Both show an th;:whelmm~ effect of height over age, although it is not quite striking for Wta:ndardtzed coefficients. to c hen there are only two X variables, as in this example, it is not necessary 0111 or a P~te the partial standard deviations (section 7.7) since the correlation si111 ~~ With height is the same as the correlation of height with age. This FP Ifies the computation of the standardized coefficients. 4.2 7°f~ rnales alone, the ~atio for t~e stan~ar?ized coefficients is 1.73 ~ersus r the unstandardized coeffiCients. Similarly, for females, the ratto for
152
Multiple regression and correlation
the standardized coefficients is 1.70 versus 4.63 for the unstandardized. Th the relative contribution of age is somewhat closer to that of height tha u~, appears. to be from the unstandardized coefficients for males and fetn ~It separately. But height is always the most important variable. Often w~ es appears to b.e a m.ajor effect when you l~ok at the unsta~dardized coefficien~: b~comes qutte mmor when you examme the standardized coefficients, and
mceversa. Another interesting result from Table 7.5 is that the multiple correlation of FEV.l for the overall group (R = 0.76) is greater than .the multiple correlatiOns for males (R = 0.58) or females (R = 0.54). Why dtd this occur? Let us look at the effect of the height variable first. This can be explained partially by the greater effect of height on FEVl for the overall group than the effect of height for males and females separately (standardized coefficients 0.757 versus 0.489 or 0.469). Secondly, the average height for males is over 5 inches more than for females (male heights are about two standard deviations greater than for females). 1n Figure 6.5, it was shown that larger correlations occur when the data points fall in long thin ellipses (large ratio of major to minor axes). If we plot FEVl against height for the overall group we would get a long thin ellipse. This is because the ellipse for men lies to the right and higher than the ellipse for women, resulting in an overall ellipse (when the two groups are combined) that is long and narrow. Males have higher FEVl and are taller than women. The effect of age on FEVl for the combined group was much less than height (standardized coefficient of only -0.160). If we plot FEVl versus age for males and for females, the ellipse for the males is above that for females but only slightly to the right (men have higher FEVl but are only 2.5 years older than women or about 0.4 of a standard deviation greater). This will produce a thicker or more rounded ellipse, resulting in a smaller correlation between age and FEVl for the combined group. When the effects of the height and age are combined for the overall gro~p, the effect of height predominates and results in the higher multiple correlatton for the overall group than for males or females separately. h . If the. re~ression coefficients ar~ divided by their standard errorsrl;: highly stgmficant t values are obtamed. The computer output from S d ce gave the values of t along with the significance levels and 95% ~o~fi ence intervals for the slope coefficients and the intercept. The levels of stgntfica~on . . reports or astens . k s are pace 1 d best'de the regresstallY m are often mcluded coefficients to indicate their level of significance. The asterisks are then usu explained in footnotes to the table. . . dividual The standard errors could also be used to test the equality of the 10 ssion coefficients for the two groups. For example to compare the reg~esis of coefficients for height for men and women and to test the null hypot e
Other options in computer programs
153
or -
0.1144-0.0926 = 104 2 2 112 . z- (0.01579 + 0.01370 ) The computed value of Z can be compared with the percentiles of the normal distribution from standard tables to obtain an approximate P value for large sarnples. In this case, the coefficients are not significantly different at the usual p = 0.05 level The standard deviation around the regression plane (i.e. the square root of the residual mean square) is useful to the readers in making comparisons with their own results and in obtaining confidence intervals at the point where all the X's take on their mean value (section 7.4). As mentioned earlier, the grouping option is useful in cbmparing relevant subgroups such as males and females or smokers and nonsmokers. An F test is available as optional output from BMDP1R and the SAS GLM procedure to check whether the regression equations derived from the different subgroups are significantly different from each other. It is also straightforward to do using information provided by most of the programs, as we will demonstrate. This test is an example of the general likelihood ratio F test that is often used in statistics and is given in its usual form in section 8.5. For our lung function data example, the null hypothesis H 0 is that a single population plane for both males and females combined is the same as true plane for each group separately. The alternative hypothesis H 1 is that different planes should be fitted to males and females separately. The residual sums ~squares obtained from (.Y - Y)Z where the two separate planes are fitted .Ill always be less than or equal to the residual sum of squares from the ~Ingle plane for the overall sample. These residual sums of squares are printed ;~t~e analysis of variance table that accompanies regression programs. Table ·Tis an example of such a program for males. pi he. general F test compares the residual sums of squares when a single rea~e Is fitted to what is obtained when two planes are fitted. If these two lf~t ~als are not very different, then the null hypothesis cannot be rejected. th . hng two separate planes results in a much smaller residual sum of squares, th~n the null hypothesis is rejected. The smaller residual sum of squares for bet two Planes indicates that the regression coefficients differ beyond chance int:een the two groups. This difference could be a difference in either the rcepts or the slopes or both. In this test, normality of errors is assumed.
L
154
Multiple regression and correlation
The F statistic for the test is
F = [SSres(Ho)- SSres(H 1)]/[df(H0 ) - df(H 1)] SSres(H 1)/df(H 1 ) In the lung function data set, we have
F = [82.6451 - (42.0413 + 25.0797)]/[297 - (147 (42.0413 + 25.0797)/(147 + 147)
+ 147)]
where 82.6451 is the residu~l sum of_ squares for the overall plane with 297 degrees of freedom, 42.1413ts the ~estdual ~urn of squares for males with 147 degrees of freedom, and 25.0797 Is the residual sum of squares for females with 147 degrees of freedom also. Thus, F- (82.6451- 67.1210)/3 _ 5.1747 _ . 22 67 67.1210/294 - 0.2283 ·
with 3 and 294 degrees of freedom. This F value corresponds toP< 0.0001 and indicates that a better prediction can be obtained by fitting separate planes for males and females. Other topics in regression analysis and corresponding computer options will be discussed in the next two chapters; In particular, selecting a subset of predictor variables is given in Chapter 8. 7.10 DISCUSSION OF COMPUTER PROGRAMS Almost all general purpose statistical packages and some spreadsheet softw?Ie have multiple regression programs. Although the number of features vanes; most of the programs have a choice of options. In this section we show ~ow we obtained the computer output in this chapter, and summarize the multiple regression output options for the six statistical packages used in this bo~:~ The lung function data set is a rectangular data set where the rows are d families and the columns are the variables, given first for males, then repe~tteof . £ n· . the unt for fe~~es and f~r up to three children. Ther~ ~re 150 am tes,_ ation analysts ts the family. We can use the data set as It 1s to analyze the 1 ~0 :s for for the males or for the females separately by choosing the vana females ~ales (FSEX, F AGE, ...) when ~nalyzing fathers and those for hen we (MSEX, MAGE, ...) when analyZJng mothers (Table A.l). But w ed to . p Ianes as was d one m . secfton 7·9' we ne00 for compare the two regression 3 make each parent a unit of analysis so the sample size becom,es aci<:ed' mothers and fathers combined. To do this the data set must be unP . . sAS and so that the females are listed below the males. As examples, we will give the commands used to do thts tn
Discussion of computer programs
155
ATA. These ex~mples make little sense unless you either ~ead the manu~s Sf familiar With the package. Note that any of the SIX packages Will or ;r:m this operation. It is particularly simple in the true Windows packages per ~e copy-and-paste options of Windows can be used. unpack a data set in SAS, we can use multiple output statements in he ~ata step. In the following SAS program, each iteration of the data step t ds in a single record, and the two output statements cause two records, r~e for the mother and one for the father, to be written to the output data ~et. The second data step writes the new file as ASCII.
as;.
** Lung-function data from Afifi & Clark ***; *** Unpack data for mothers & fathers ***; data c.momsdads; infile 'c:\work\aavc\lung1.dat'; input id area fsex fage fheight fweight ffvc ffev1 msex mage mheight mweight mfvc mfev1 ocsex ocage ocheight ocweight ocfvc ocfev1 mcsex mcage mcheight mcweight mcfvc mcfev1 ycsex ycage ycheight ycweight ycfvc ycfev1; ** output father **; sex= fsex; age= fage; height= fheight; weight= fweight; fvc = ffvc; fev1 = ffev1 ; output; ** output mother **; sex= msex; age= mage; height= mheight; weight= mweight; fvc = mfvc; fev1 = mfev1 ; output; drop fsex--ycfev1; run; data temp; ** output as ascii **; set c.momsdads file 'c:\work\aavc\momsdads.dat'; PUt id area sex age height weight fvc fev1; run; Proc . print', tttle 'Lung function data from Afifi & Clark'; run·,
Wi~n STATA, the packed and unpacked forms of the data set are called the l.lsine and long forms. We can convert between the wide and long forms by for .g the reshape command. The following commands unpack five records inroeach. of the records in the lung function data set. Note that there is rrnatton not only for males and females, but also for three children, thus
156
Multiple regression and correlation
giving five records for each family. The following instructions will re . 1 a d~ta set with (five record~ per family) x (150 families)= 750 cases.s~ \ •n family has less than three children, then only the data for the actual ch1·ld he will be included, the rest will be missing. ren dictionary using lung1.dat { * Lung function data from Afifi & Clark. id area sex1 age1 height1 weight1 fvc1 fev11 sex2 age2 height2 weight2_ fvc2 fev12 sex3 age3 heig ht3 weight3 fvc3 fev1 3 sex4 age4 height4 weight4 fvc4 fev14 sex5 age5 height5 weight5 fvc5 fev15
} (151 observations read) • • • • •
reshape reshape reshape reshape reshape
groups member 1-5 vars sex age height weight fvc fev1 cons id area long query
Four options must be specified when using the reshape command. In the above example, the group option tells STATA that the variables of the five family members can be distinguished by the suffix 1-5 in the original data set and that these suffixes are to be saved as a variable called 'member' in the new data set. The vars option tells which of these variables are repeated for each family member and thus need to be unpacked. The cons (constant) option specifies which of the variables are constant within family and are to be output to the new data set. To perform the conversion, either the long (for unpacking) or the wide (for packing) option must be stated. Finally, the query option can be used to confirm the results. To obtain the means and standard deviations in STATA, we used the statement . by member: summarize
This gave the results for each of the five family members separately. Similarly, . by member: corr age height weight fev1
produced the correlation matrix for each family member separately. The statement . regress fev1 age height if member < 3
d mothers produced the regression output for the fathers (member = 1) an and (member= 2) combined (N = 300). To obtain regressions for the fathers mothers separately, we used the statement: . by member: regress fev1 age height if member < 3
What to watch out for
157
The regression options for the six statistical packages are summarized in . ble 7.6. BMDP identifies its m~ltiple r~gression pro~rams .by usi.ng the 'fa R The column for BMDP hsts lR if the output Is available m that }etterra~ since that is the basic regression program for BMDP. If the output ~rog t there, then another program is listed. Note that most of the output in ~:~s also available along with additional options in 2R and 9R. BMDP statements are writt~~ in sentences that. are grouped into ~aragraphs ~see tbe rnanual) ..In additiOn, the same .opttons can be found tn the multiple rnear regressiOn program of the Wtndows BMDP New System package. ~ere, choose Analyze followed by Regression and Multiple Regression. Reg is the main regression program for SAS. We will emphasize the regression output from that. program. ~oncise i~structions are type~ ~o tell the program what regression analysts you wtsh to perform. Stmdarly, REGRESSION is the general regression program for SPSS. For STATA, the regress program is used but the user also has to use other programs to get the desired output The index of STATA is helpful in finding the various options you want. The instructions that need to be typed in STATA are very brief. In STATISTICA, the user chooses Linear Regression from the Statistica Module Switcher and then points and clicks through their choices. In SYSTAT, the MGLH, signifying multiple general linear hypothesis, is chosen from the STATS menu and then Regression is selected. Once there, the variables, cases and desired options are selected. It should be noted that all of the packages have a wide range of options for multiple linear regression. These same programs include other output that will be discussed in Chapters 8 and 9. Note also that in regression analysis you are fitting a model to the data without being 100% certain what the 'true' model is. We recommend that You carefully review your objectives before performing the regression analysis to d~~ermine just what is needed. A second run can always be done to obtain addtttonal output. Once you find the output that answers your objectives) we recommend writing out the result in English o.r highlighting the printed output while this objective is still fresh in mind. ilMost investigators aim for the simplest model or regression equation that :il! ~~ovide a goo~ prediction of t~e dependent variable. In Chapter 8, we Wh' tscuss the chmce of the X vanables for the case when you are not sure Ich X variables to include in your model. ?.JJ WHAT TO WATCH OUT FOR
l'he discussi· d . k . . . ll1Ulti on ~n cauti~nary re~~r s gtven m .s~tlon 6.13. also apply to th Pie regres.ston analysts. In addition, because It lS not posstble to display regr e .results m a two-dimensional scatter diagram in the case of multiple esston) checks for some of the assumptions (such as independent normally
au
Output
BMDP
SAS
SPSS
STAT A
STATISTICA
SYSTAT
Matrix output Covariance matrix of X's Correlation matrix of X's Correlation matrix of B's
lR lR lR
CORR REG REG
REGRESSION REGRESSION REGRESSION
correlate correlate correlate
Linear Regress Linear Regress Linear Regress
MGLH MGLH MGLH
lR
REG
REGRESSION
regress
Linear Regress
MGLH
lR
REG
REGRESSION
regress
Linear Regress
MGLH
lR 9R 9R 9R lR lR lR 6R
REG REG REG REG REG REG REG REG
REGRESSlON REGRESSION REGRESSION REGRESSION REGRESSION REGRESSION REGRESSION REGRESSION
regre8s regress regress regress regress regress regress pcorr
Linear Linear Linear Linear Linear Linear Linear Linear
Regress Regress Regress Regress Regress Regress Regress Regress
MGLH MGLH MGLH MGLH MGLH MGLH MGLH MGLH
Additional options Weighted regression Regression through zero Regression for subgroups Test equality of subgroups Highlight points in plots
lR 1R lR lR User's Guide
REG REG REG GLM REG
REGRESSION REGRESSION REGRESSION
regress regress use if
Linear Regress Linear Regress include if
MGLH MGLH use if
Linear Regress
Editor
Outliers in Y Raw residuals Deleted studentized residuals Other types of residuals
lR 2R, 9R 2R,9R
REG REG REG
REGRESSION REGRESSION REGRESSION
fit fit fit
Linear Regress
MGLH MGLH
Outliers in X Leverage or h Other leverage measures
2R, 9R 2R,9R
REG
REGRESSION REGRESSION
fit
2R,9R 2R,9R 2R,9R 2R,9R
REG
REGRESSION
fit
-
Regression equation Slope and intercept coefficients Standardized slope coefficients Standard error of B Standard error of A Test pop. B = 0 Test pol A= 0 ANOV Multiple correlation Test pop. multiple corr Partial correlations
=0
Influence measures Cooks distance D Modified Cook's D DFFI1S
Other inf\uence measures Checks ~or otber assumptions 'Io\erance 'l at\ance \nf\at\on {actor 'i>e't\a\ cone\ation o\ res\dua\s Uut'l:>an-~a\sOn s\a\\s\\c ~~\. 'ru:!mo~6\.'j o\ ~aria.nce
REG REG
lR
REG
lR
REG REG REG REG
9R
Linear Regress MGLH Linear Regress Linear Regress
fit regress REGRESSION REGRESSION
MGLH Linear Regress
core core
MGLH
Linear Regress Linear Regress
MGLH MGLH MGLH
What to watch out for
159
. ·buted error terms) are somewhat more difficult to perform. But since dtstrt · ion programs are so widely used, they contain a large number of re~~:~s to assist in checking the data. At a minimum, outliers and norma~ity oP ld routinely be checked. At least one measure of the three types (outhers ~ho;, outliers in. ~' and influential. points) s~ould be obta~ed along with a tn mal probabthty plot of the restduals agamst f and agamst each X. no~ften two investigators collecting similar data at two locations obtain slopecoefficients that ar.e not si~il~r in m~gnitud~ in the~r multiple r.egression . uation. One explanatiOn of this mconststency ts that tf the X vanables are e~t independent, the magnitudes of the coefficients depend on which other ~ariables are included in the regression equation. If the two investigators do not have the same variables in their equations, then they should not expect the slope coefficients for a specific variable to be the same. This is one reason why they are called partial regression coefficients. For example, suppose Y can be predicted very well from X 1 and X 2 but investigator A ignores the x 2 • The expected value of the slope coefficient B1 for investigator A will be different from {3 1, the expected value for the second investigator. Hence, they will tend to get different results. The difference will be small if the correlation between the two X variables is small, but if they are highly correlated even the sign of the slope coefficient can change. In order to get results that can be easily interpreted, it is important to include the proper variables in the regression model. In addition, the variables should be included in the model in the proper form such that the linearity assumption holds. Some investigators discard all the variables that are not significant at a certain P value, say P < 0.05, but this can sometimes lead to problems in interpreting other slope coefficients, as they can be altered by the omission of that variable. More on variable selection will be given in Chapter 8. Another problem that can arise in multiple regression is multicollinearity, ~entioned in section 7.9. This is more apt to happen in economic data than m data on individuals, due to the high variability among persons. The bro~rams ~ill wa~ the us~r if ~his occurs. The problem is sometimes solved dfsdtscard_mg a van able or, tf this does ~ot make. sense b~cause of t~e pr~ blems f cussed m the above paragraph, speetal techmques extst to obtam esttmates 0 the slope coefficients when there is multicollinearity (Chapter 9). sh~no.ther practica~ problem in perfo~ing multiple regression analyses is rn· ~kmg sample stze. If many X vanables are used and each has some l~sstng values, then many cases may be excluded because of missing values. a e ~rograms themselves can contribute to this problem. Suppose you specify co:rtes of ~egression programs with slightly different X variables. To save Co Puter ttme the SAS REG procedure only computes the covariance or a ~~l~tion matrix once using all the variables. Hence, any variable that has reg ssmg va1ue for a case causes that case to be excluded from all of the ression analyses, not just from the analyses that include that variable.
160
Multiple regression and correlation
When a large portion of the sample is missing, then the problems of lllaki inferenc~ regarding the parent p~pulation can be difficult: In any case, t~g sample stze should be 'large' relative to the number of vanables in orde e obtain stable results. Some statisticians believe that if the sample size is r to at least five to ten times the number of variables, then interpretation of results is risky. e
r:;;t
SUMMARY In this chapter we presented a subject in which the use of the computer is almost a must. The concepts underlying multiple regression were presented along with a fairly detailed discussion of packaged regression programs. Although much of the material in this chapter is found in standard textbooks, we have emphasized an understanding and presentation of the output of packaged programs. This philosophy is one we will follow in the rest of the book. In fact, in future chapters the emphasis on interpretation of output will be even more pronounced. Certain topics included in this chapter but not covered in usual courses include how to handle interactions among the X variables and how to compare regression equations from different subgroups. We also included discussion of recent developments, such as the examination of residuals and influential observations. A review of the literature on these and other topics in regression analysis is found in Hocking (1983). Several texts contain excellent presentations of the subject of regression analysis. We have included a wide selection of texts in the further reading section.
REFERENCES References preceded by an asterisk require strong mathematical background. Helsley, D.A., Kuh, E. and Welsch, R.E. (1980). Regression Diagnostics: Identifying _Influential Data and Sources ~~ Colline~rity. Wile~, New York. . . tician, Bnng, I. (1994). How to standardiZe regression coefficients. The Amerzcan Stalls 48, 2_09-13. . . . . d . Wiley, ChatterJee, S. and Pnce, B. (1991). Regresswn Analyszs by Example, 2nd e n New York. . Wiley, Chatterjee, S. and Hadi, A.S. (1988). Sensitivity Analysis in Linear Regresszon. New York. hnometrics, Fox, J. (1991). Regression Diagnosti~s. ~age, Newb~ry Park, CA. Hocking, R.R. (1983). Developments m linear regression methodology. Tee 25, 219-29. . .. nalysis of Okunade, A.A., Chang, C.F. an~ _Ev~ns, R.D. (199_3): Comparattve ~American regression output summary statistics m common statistical packages. Th Statistician, 47, 298-303. . NeW yorle· Wonnacott, R.J. and Wonnacott, T.H. (l919).Econometrics, 2nd edn. Wiley,
Further reading
161
HER READING
fVR T c.JI. (1982).
Interpreting and .u~ing Regres~ion. Sage, Bevetl~ Hills, CA. AcbeA A and Azen, S.P. (1979). Statzstzca!Analyszs: A Computer Orzented Approach, Afifi, 'e
162
Multiple regression and correlation
PROBLEMS 7.1
7.2 7.3 7.4
7.5
7.6
7.7
U~ing the. chemical c~mpanies data given i~ Table 8.1, predict t pncejearnmgs (P/E) ratio from the debt-to-equity (D/E) ratio, the an he dividends. divided by the latest 12-months' earnings-per-s~~:l (PAYOUTR1), and the percentage net profit margin (NPM1). Obt . e 1 the correlation matrix, and check the intercorrelations of the variabf n Summarize the results, including appropriate tests of hypotheses. es. Fit the regression plane for fathers using FFVC as the dependent variabi and age and height as the independent variables. e ~rite up the results for Problem 7.2, that would be suitable for inclusion m a report. Include table(s) that present the results the reader should see Fit the regression plane for mothers using MFVC as the dependen~ variable and age and height the independent variables. Summarize the results in a tabular form. Test whether the regression results for mothers and fathers are significantly different. From the depression data set described in Table 3.4, predict the reported level of depression as given by CESD, using INCOME and AGE as independent variables, for females and males separately. Analyze the residuals and decide whether or not it is reasonable to assume that they follow a normal distribution. Search for a suitable transformation for CESD if the normality assumption in Problem 7.5 cannot be made. State why you are not able to find an ideal transformation if that is the case. Using a statistical package of your choice, create a hypothetical data set which you will use for exercises in this chapter and some of the following chapters. Begin by generating 100 independent cases for each of ten variables using the standard normal distribution (means = 0 and variances = 1). Call the new variables Xl, X2, ... , X9; Y. Use the following seeds in generating these 100 cases.
X1 seed X2 seed X3 seed X4 seed X5 seed X6 seed X7 seed X8 seed X9 seed Y seed
36541 43893 45671 65431 98753 78965 67893 34521 98431 67895
. r each of You now have ten independent, random, normal numbe~ fo ndard 100 cases. The population mean is 0 and the populatwn st~iableS deviation is 1. Further transformations are done to make the va
Problems
163
intercorrelated. The transformations are accomplished by making some of the variables functions of other variables, as follows: X1 = 5*X1 X2 = 3*X2 X3 = X1 + X2 + 4*X3 X4=X4 X5 = 4*X5 X6 = X5 - X4 + 6*X6 X7 == 2*X7 XB = X7 + 2*X8 X9 = 4*X9 y = 5 + X1 + 2*X2 + X3
+ 1O*Y
We now have created a random sample of 100 cases on 10 variables: Xl, X2, X3, X4, X5, X6, X7, X8, X9, Y. The population distribution is multivariate normal. It can be shown that the population means and variances are as follows: Population
X1
X2
X3
X4
X5
X6
X7
X8
X9
Y
Mean Variance
0 25
0 9
0 50
0 l
0 16
0 53
0 4
0 8
0 16
5 297
The population correlation matrix is as follows:
Xl X2
X3 X4
xs
X6 X7
X8 X9 y
X1
X2
X3
X4
X5
X6
X1
X8
X9
y
1 0 0.71 0 0 0 0 0 0 0.58
0 1 0.42 0 0 0 0 0 0 0.52
0.71 0.42 1 0 0 0 0 0 0 0.76
0 0 0 1 0 -0.14 0 0 0 0
0 0 0 0 1 0.55 0 0 0 0
0 0 0 -0.14 0.55 1 0 0 0 0
0 0 0 0 0 0 1 0.71 0 0
0 0 0 0 0 0 0.71 1 0 0
0 0 0 0 0 0 0 0 1 0
0.58 0.52 0.76 0 0 0 0 0 0 1
The population squared multiple correlation coefficient between Y and Xl to X9 is 0.6633, between Y and Xl, X2, X3 is 0.6633, and between Y and X4 to X9 is zero. Also, the population regression line of Y on Xl to X9 has tX = 5, {3 1 = 1, /3 2 = 2, /3 3 = 1, /3 4 = {3 5 = · · · = /3 9 = 0.
164
7.8 7.9
7.10
7.11
7.12
7.13
7.14
7.15
Multiple regression and correlation Now, using the data you have generated, obtain the sample statistics from a computer packaged program, and compare them with th parameters given above. Using the Bonferroni inequality, test the sim le correlations, and determine which are significantly different from ze~oe <:omment. · Repeat Problem 7.7 using another statistical package and see if you get the same sample. (<:ontinuation of Problem 7.7.) <:alculatethepopulation partial correlation coefficient between X2 and X3 after removing the linear effect of Xl. Is it larger or smaller than p2:3? Explain. Also, obtain the corresponding sample partial correlation. Test whether it is equal to zero. (<:ontinuation of Problem 7.7.) Using a multiple regression program perform an analysis, with the dependent variable = Y and the independen~ variables= Xl to X9, on the 100 generated cases~ Summarize the results and state whether they came out the way you expected them to considering how the data were generated. Perform appropriate tests of' hypotheses. <:omment. (Continuation of Problem 7.5.) Fit a regression plane for CESD on IN<:OME and AGE for males and females combined. Test whether the regression plane is helpful in predicting the values of CESD. Find a 95% prediction interval for a female with IN<:OME = 17 and AGE = 29 using this regression. Do the same using the regression calculated in problem 7.5, and compare. (<:ontinuation of Problem 7.11.) For the regression of CESD on INCOME and AGE, choose 15 observations that appear to be influential or outlying. State your criteria, delete these points, and repeat the regression. Summarize the differences in the results of the regression analyses. For the lung function data in Appendix A, find the regression of FEVl on weight and height for the fathers. Divide each of the two explanatory variables into two intervals: greater than, and less than or equal to the respective median. Is there an interaction between the two explanatory variables? . EV1 (Continuation of Problem 7.13.) Find the partial correlation of~ le and age given height for the oldest child, and compare it to the sunp e . . ·~r~ correlation between FEV1 and age of the oldest child. Is ett . bout significantly different from zero? Based on these results a~d wtt een doing any further calculations, comment on the relationship betW O<:HEIGHT and O<:AGE in these data. . fi d the (<:ontinuation of Prob~em _7.13.) (a) Fo~. the. oldest chtl~'... ~eight, regression of FEVl on (1) wetght and age; (u) hetght and age, (m) each · · s In weight_ and age. Compare the three regression equation · 1 (b) regression, which coefficients are significantly different from zero.
Problems
165
Find the correlation matrix for the four variables. Which pair of variables is most highly correlated? least correlated? Heuristically, how might this explain the results of part (a)? .1 Repeat Prob~em 7.15(a) for fathers~ measur~ents instead of those of 7 6 the oldest children. Are the regression coefficients more stable? Why?
8 Variable selection in regression analysis 8.1 USING VARIABLE SELECTION TECHNIQUES IN MULTIPLE REGRESSION ANALYSIS The variable selection techniques provided in statistical programs can assist in making the choice among numerous independent or predictor varibles. Section 8.2 discusses when these techniques are used and section 8.3 introduces an example chosen because it illustrates clearly the differences among the techniques. Section 8.4 presents an explanation of the criteria used in deciding how many and which variables to include in the regression model. Section 8.5 describes the general F test, a widely used test in regression analysis that provides a criterion for variable selection. Most variable selection techniques consist of successive steps. Several methods of selecting the variable that is 'best' at each step of the process are given in section 8.6. Section 8.7 describes the options available to find the 'best' variables using subset regression. Section 8.8 presents a discussion of obtaining the output given in this chapter by SAS and summarizes what output the six statistical packages provide. Data screening to ensure that the data follow the regression model is discussed in section 8.9. Partial regression plots are also presented there. Section 8:1° describes what to watch out for in performing these variable selectiOn techniques. Since investigators often overestimate what data selection met~ods will do for them, the last two sections contain references and further readmg.
sED?
8.2 WHEN ARE VARIABLE SELECTION METHODS U h. h variables In Chapter 7 it was assumed that the investigators knew w. IC fi ed-.X they wished to include in the model. Th~s is usually ~he case ~n ~~. also model, where often only two to five vanables are bemg constde h retical is the case frequently when the investigator is working from a t t~;es the 1 model and is choosing the variables that fit this model.. But s~me 0 ut what investigator has only partial knowledge or is interested m fi.ndtng variables can be used to best predict a given dependent var:able: ns where Variable selection methods are used mainly in exploratory sttuatto '
!t
Data example
167
anY independen~ variables have been measured ~nd a final ~odel explaining ~e dependent vanable has not been reached. Vanable selectiOn methods are 1 eful for example, in a survey situation in which numerous characteristics us d r~sponses of each individual are recorded in order to determine some a~aracteristic of the respondent, such as level of depression. The investigator ~ay have prior justificat~o~ for u~ing certain variables but may be open to suggestions for the remammg vanables. For example, ag~ and g~nder ~ave been shown to relate to levels of the CESD (the depression vanable giVen in the depression study described in Chapter 3). An investigator might wish to enter age and gender as independent variables and try additional variables in an exploratory fashion. The set of independent variables can be broken down into logical subsets. First, the usual demographic variables that will be entered first, namely age and gender. Second, a set of variables that other investigators have shown to affect CESD, such as perceived health status. Finally, the investigator may wish to explore the relative value of including another set of variables after the effects of age, gender and perceived health status have been taken into account. In this case, it is partly a model-driven regression analysis and partly an exploratory regression analysis. The programs described in this chapter allow analysis that is either partially or completely exploratory. The researcher may have one of two goals in mind: l. to use the resulting regression equation to identify variables that best
explain the level of the independent variable-a descriptive or explanatory purpose; 2; To obtain an equation that predicts the level of the dependent variable with as little error as possible-a predictive purpose. The variable selection methods discussed in this chapter can sometimes serve one purpose better than the other, as will be discussed after the methods are Presented. S.3 DATA EXAMPLE
1'able 8.1 presents various characteristics reported by the 30 largest chemical ~ornpanies; the data are taken from a January 1981 issue of Forbes. This a~ set w~l henc_efort~ be called the chemical companies data. he vanables hsted m Table 8.1 are defined as follows (Brigham, 1980):
• pIE: Price-to-earnings ratio, which is the price of one share of common
~~ock divided by the earnings per share for the past year. This ratio shows
e dollar amount investors are willing to pay for the stock per dollar of • ~ rrent earnings of the company. ~R5: percent rate of return on total capital (invested plus debt) averaged er the past five years. ~
0
.
.
Table 8.1 Chemical companies' financial performance Company
P/E
ROR5 (%)
D/E
SALESGR5 (%)
Diamond Shamrock Dow Chemical Stauffer Chemical E. I. duPont Union Carbide Pennwalt W. R. Grace Hercules Monsanto AJnericanCyanamid Celanese Allied Chemical Rohm&Haas Reichhold Chemicals Lubrizol Nalco Chemical Sun Chemical cabot International Minerals & Chemical Dexter Freeport Minerals Air Products & Chemicals Mallinckrodt Thiokol Witco Chemical Ethyl Ferro Liquid Air of North America Wi\\iams Companies
9 8 8 9 5 6 10 9 11 9 7
13.0 13.0 13.0 12.2 10.0 9.8 9.9 10.3 9.5 9.9 7.9 7.3 7.8 6.5 24.9 24.6 14.9 13.8 13.5 14.9 15.4 11.6 14.2 13.8 12.0 11.0 13.8 11.5 6.4 3.8
0.7 0.7 0.4 0.2 0.4 0.5 0.5 0.3 0.4 0.4 0.4 0.6 0.4 0.4 0.0 0.0 1.1 0.6 0.5 0.3 0.3 0.4 0.2 0.1 0.5 0.3 0.2 0.4 0.7 0.6
20.2 17.2 14.5 12;9 13.6 12.1 10.2 11.4 13.5 12.1 10.8 15.4 11.0 18.7 16.2 16.1 13.7 20.9 14.3 29.1 15.2 18.7 16.7 12.6 15.0 12.8 14.9 15.4 16.1 6.8
~ona
Mean ~tan.da:td
de'l\at\on
7 7 10 13 14 5 6 10 12 14 13 12 12 7 7 6 12 9 14 9.37
12.01
0.42
2.BO
4.50
0:23
14.91 4.05
EPSS (%) 15.5 12.7 15.1 11.1 8.0 14.5 7.0 8.7 5.9 4.2 16.0 4.9 3.0 -3.1 16.9 16.9 48.9 36.0 16.0 22.8 15.1 22.1 18.7 18.0 14~9
10.8 9.6 11.7 -2.8 -11.1
NPM1 (%) 7.2 7.3 7.9 5.4 6.7 3.8 4.8 4.5 3.5 4.6 3.4 5.1 5.6 1.3 12.5 11.2 5.8 10.9 8.4 4.9 21.9 8.1 8.2 5.6 3.6 5.0 4.4 7.2 6.8 0.9
12.93
6.55
11.15
3.92
PAYOUTR1 0.43 0.38 0.41 0.57 0.32 0.51 0.38 0.48 0.57 0.49 0.49 0.27 0.32 0.38 0.32 0.47 0.10 0.16 0.40 0.36 0.23 0.20 0.37 0.34 0.36 0.34 0.31 0.51 0.22 1.00 0.39 0.16
Data example •
e • • •
169
. d bt-to-equity (invested capital) ratio for the past year. This ratio D~· e the extent to which management is using borrowed funds to indtcates . erate the company. op ESGR5: percent annual compound gro~th rate of. sales, computed SAL the roost recent five years compared With the prevmus five years. fr';;s: percent annual compound growth in e.arnings per s~are, computed E the most recent five years compared With the precedmg five years. ~l: percent net profit margin, which is the net profits divided by the net sales for the past ye~~' expres.s~d as a percentage. . PAYOUTRl: annual dividend dtvided by t~e latest 12:month.earnmgs er share. This value represents the proportion of earnmgs patd out to ~hareholders rather than retained to operate and expand the company.
The P/E ratio is usually high for growth stocks and low for mature or troubled firms. Company managers generally want high P/E ratios, because high ratios make it possible to raise substantial amounts of capital for a small number of shares and make acquisitions of other companies easier. Also, investors consider P/E ratios of companies, both over time and in relation to similar companies, as a factor in valuation of stocks for possible purchase and/or sale. Therefore it is of interest to investigate which of the other variables reported in Table 8.1 influence the level of the P/E ratio. In this chapter we use the regression of the P /E ratio on the remaining variables to illustrate various variable selection procedures. Note that all the firms of the data set are from the chemical industry; such a regression equation could vary appreciably from one industry to another. . To get a preliminary impression of the data, examine the means and standard deviations shown at the bottom of Table 8.1. Also examine Table S.2,. which shows the simple correlation matrix. Note that P jE, the dependent ~:na~Ie, is mos.t highly correlated with D/E. The association, however,. is .gattve..Thus Investors tend to pay less per dollar earned for compames With relatively heavy indebtedness. The sample correlations ofP jE withROR5, Table 82 C
--... ·
----
.
.
orre1atton matnx of chemical companies data P/E
ROR5
D/E
SALESGR5 EPS5 NPM1
PAYOUTR1
PfE aoR.s
1 0.32 1 DtE -0.47 -0.46 SALESGRs 0.13 0.36 EPss NP?\it - 0.20 0.56 0.35 0.62 PA VOtJTR.l 0.33 -0.30
~
1 -0.02 1 0.19 0.39 -0.22 0.25 -0.16 -0.45
1 0.15 1 -0.57 -0.43
1
170
Variable selection in regression analysis
NPMl and PAYOUTRl range from 0.32 to 0.35 and thus are quite s· . Investors tend to pay more per dollar earned for stocks of comp .. 1llli1ar. higher r~tes of ret~rn on ~otal capital, high~r ~et profi~ ~argins a~~e~. With proportiOn of earmngs paid out to them as dividends. Similar interpret 1 ~her can be made for the other correlations. attons If a single independent variable were to be selected for 'explain' , the variable of choice would be D/E since it is the most highly ~ng P/E, . . . rre ated But clearly there are other vanables that represent differences in mana · style~ dynamics of growth and efficiency of operation that should !:s:e~t considered. e It is possible to derive a regression equation using all the independ variables, as discussed in Chapter 7. However, some of the independ en~ variables are strongly interrelated. For example, from Table 8.2 we see t~~t growth of earnings (EPS5) and growth of sales (SALESGR5) are positively correlated,suggesting that both variables measure related aspects of dynamically growing companies. Further, growth of earnings (EPS5) and growth of sales (SALESGR5) are positively correlated with return on total capital (ROR5), suggesting that dynamically growing companies show higher returns than mature, stable companies. In contrast, growth of earnings (EPS5) and growth of sales (SALESGR45) are both negatively correlated with the proportion of earnings paid out to stockholders (PAYOUTRl), suggesting that earnings must be plowed back into operations to achieve growth. The profit margin (NPMl) shows the highest positive correlation with rate of return (ROR5), suggesting that efficient conversion of sales into earnings is consistent with high return on total capital employed. Since the independent variables are interrelated, it may be better to use a subset of the independent variables to derive the regression equation. Most investigators prefer an equation with a small number of variables since s.uc? 15 an equation will be easier to interpret. For future predictive purposes It f · · 1 set o often possible to do at least as well with a subset as with the tota. . 1 independent variables; Methods and criteria for subset selection are gtv;~ 1°y . sections. . . h thods are iatr the subsequent It should be noted that t ese me dited sensitive to gross outliers. The data sets should therefore be carefully e prior to analysis.
1
8.4 CRITERIA FOR VARIABLE SELECTION . stigator has In man~ si~uati~ns wher~ regr~ssion an~lysis i.s usefu~, the mveuation. rhe strong JUStification for mcludmg certam vanables m t~e eq dies or to justification may be to produce results comparable to previous stu onceived conform to accepted theory. But often the investigator has no predabtes. It ~s~essment of t~e im?ortance of ~orne oral~ of the.independent ::useful. Ism the latter situatiOn that vanable selectiOn procedures can ·
Criteria for variable selection
171
·able selection procedure requires a criterion for deciding how AnY v~ which variables to select. As discussed in Chapter 7, the least manY an ethod of estimation minimizes the residual sum of squares (RSS) squares;: regression plane [RSS = L (Y ~ Y)Z)]. Therefore an implicit a~ou~o~ ~s the v~lue of RSS. In deciding between al~emative subsets of crtt~rbl the investigator would select the one producmg the smaller RSS van~ en~'terion were used in a mechanical fashion. Note, however, that if thlS C . . RSS = _L(Y- Y) 2 (1- R 2 ) where R is the multi~l~ correlatio~ coefficient.. Therefore. minimizing ~S~ is ivalent to maximtzmg the multiple correlation coeffiCient. If the cntenon ~f~aximizing R were used, the investigator would always select all of the :ndependent variables, because the value of R will never decrease by including additional variables. Since the multiple correlation coefficient, on the average, overestimates the population correlation coefficient, the investigator may be misled into including too many variables. For example, if the population multiple correlation coefficient is, in fact, equal to zero, the average of all possible values of R 2 from samples of size N from a multivariate normal population iS P/(N- 1), where P is the number of independent variables (Stuart and Ord, 1991). An estimated multiple correlation coefficient that reduces the bias is the adjusted multiple correlation coefficient, denoted by R. It is related to R by the following equation: if.2
= R2
-
P( 1 - R2)
N-P-1 ~here Pis the number of independent variables in the equation (Theil, 1971). · ote that in this chapter the notation P will sometimes signify less than the total number of available independent variables. Note that in many texts ~nd .~mputer manuals, the formulas are written with a value, say P', that 1 .ang~ es the number of parameters being estimated. In the usual case where · Inter · · · . . 'be·tng cept · ls mcluded. m the regression model. . ' there are P + 1 parameters · P' ~ pestimate~, .one for each of the P va~abl~s plus the intercept thus be pr ~ 1. This IS the reason the formulas m this and other texts may not . · eclsely the same l'he in · · rnaxitn· ve~ttgator may proceed now to select the independent variables that 2 result i~e R •• Note that exch~_ding some independent variables may, in fact, Cherni a highe~ value of R 2 • As will be seen, this result occurs for the /{~ Wi~ab1compames data. Note also that if N is very large relative toP, then cl.iffere e approximately equal to R 2 • Conversely, maximizing iP can give nt results from those obtained in maximizing R 2 when N is small.
172
Variable selection in regression analysis
Another method suggested by statisticians is to minimize the r .d square, which is defined as est Ual mean RMS or RES. MS. =
RSS
N-P-1
.
This quantity is related to the adjusted R? as follows:
jp
=1_
(RMS) s~
where
s:
does not involve any independent variables, minimizing RMS · Since equivalent to maximizing iP. Is Anothe~ quantity used in variable selection and found in standard computer packages Is the so-called CP criterion. The theoretical derivation of this quantity is beyond the scope of this book (Mallows, 1973). However, Cp can be expressed as follows: RMS 1) + (P CP = (N- P- 1) ( fj2-
+ 1)
where RMS is the residual mean square based on the P selected variables and 8 2 is the RMS derived from the total set of independent variables. The quantity CP is the sum of two components, P + 1 and the remainder .of the expression. While P + 1 increases as we choose more independent vanables) the other part of C P will tend to decrease. When all variables are chos':~' P + 1 is at its maximum but the other part of CP is zero since RMS = fJ Many investigators recommend selecting those independent variables tha minimize the value of CP. . . . . . • , . 0 madoD Another recommended cntenon for vanable selection IS Ak.a1ke s 1 ~ r and 1 criterion (AIC). This criterion was developed to find an optuna hown parsimonious model (not entering unnecessary variables). It ha~ beentunder to be equivalent to CP for large sample sizes (asymptotically eqUivaler s both so~~ .con~itions, (~ishi, 19~4)). This means t.hat for very large satll~:; lead cntena will result m selectmg the same vanables. However, theY f AIC see to different results in small samples. For a general develo~ment/ mula for Bozdogan (1987). The similarity to CP can be seen by usmg a or estimating AIC given by Atkinson (1980),
t
RSS
AIC=~+2(P+ (J
1)
A general F test
173
. the residual sum of squares. As more variables are added the whe~e RS~r~s of the formula gets smaller but P + 1 gets larger. A minimum .. . fracuoll ~IC is desired. value of equation is used m BMDP New System. SAS uses a different 1'he a~~;:stimating ~IC ~iven by N~sh_i (1984). formula· h above criteria will lead to similar results, but not always. Other .of~enh~:e been proposed that tend to allow either fewer or ~or~ variables ~ntena ation. It is not recommended that any of these cntena be used Ill the e~~lly under all circumstances, particularly when the sample size is mec~~cactical methods for using these criteria will be discussed in sections sma · d r8 7 Hocking (1976) compared these criteria (except for AIC) and 8.6 an ·. · . . . s and .·th·er, .... recommended the followmg uses. 0
1. In a given sample the value ?f the .unadjusted R can 2
?e used as a measure
of data description. Many mvestlgators exclude vanables that add only a very small amount to the R 2 value obtained from the remaining variables. 2. If the object of the regression is extrapolation or estimation of the regression parameters, the investigator is advised to select those variables that maximize the adjusted R 2 (or, equivalently, minimize RMS). This criterion may result in too many variables if the purpose is prediction (Stuart and Ord, 1991). 3. For prediction, finding the variables that make the value of CP approximately equal toP + 1 or minimizing AIC is a reasonble strategy. (For extensions to AIC that have been shown to work well in simulation studies see Bozdogan, 1987.) The above discussion was concerned with criteria for judging among alternative subsets of independent variables. The numerical values of some of these criteria are used to decide which variables enter the model in the statistical packages. In other cases, the value of the criterion is simply printed ~t but is not used to choose which variables enter directly by the program. cho~e that. the investigators can use the printed values to form their own Oice of tnd d . . . . next epen ent vanables. How to select the candidate subsets is our . concern we first d' · Be~ore descn'b'mg meth ods to perform the selection, though, subset 0 f ~scuss the use of the F test for determining the effectiveness of a Independent variables relative to the total set. 8.s A G S ENERAL F TEST i Uppose We a . . . . 11 the regr .re convmced that the vanables X 1 , X 2 , ••• , X P should be used "an abies Xession equa t'Ion. Sup.pose also that measurements on Q additional ~~the addif+ h X P+~· .. •, X P +Q• are available. Before deciding whether any at, as a g tonal vanables should be included, we can test the hypothesis roup, the Q variables do not improve the regression equation.
174
Variable selection in regression analysis
If the regression equation in the population has the form
Y=
+ P1X1 + PzXz + ... + fJpXp + PP+1Xp+ 1 + ··· + PP+QXP+Q + e
C(
we test the hypothesis H 0 :{3P+l = PP+z = · · · = PP+Q = 0. To perf! test, we ~rst obt~i_n an equation that includes all th~ ~ + Q variab~: the we obtam the residual sum of squares (RSSP+Q). Similarly, we obt ? and 1 equation that includes only the first P variables and the corresp: ~.an residual sum of squares (RSSp). Then the test statistic is computed asn tng F=
(RSSp- RSSP+Q)/Q RSSP+Q/(N- p- Q- 1)
The numerator measures the improvement in the equation from using the additional Q variables. This quantity is never negative. The hypothesis is rejected if the computed F exceeds the tabled F(1 - tX) with Q and N - P - Q - 1 degrees of freedom. This very general test is sometimes referred to as the generalized linear hypothesis test. Essentially, this same test was used in section 7.9 to test whether or not it is necessary to report the regression analyses by subgroups. The quantities P and Q can take on any integer values greater than or equal to one.· For example, suppose that six variables are available. If we take P equal to 5 and Q equal to 1, then we are testing H 0 :{3 6 = 0 in the equation Y = tX + {3 1X 1 + {3 2 X 2 + {3 3 X 3 + {34 X 4 + f3 5 X 5 + f3 6 X 6 +e. This test is the same as the test that was discussed in section 7.6 for the significance of a single regression coeffic~ent. . . . . d As another example, m the chemtcal compames data tt was alrea Y observed that D/E is the best single predictor of the P/E ratio. A re~ev~t hypothesis is whether the remaining five variables impr?ve the~ pred~c~: obtained by D/E alone. Two regressions were run (one With all six vana and one with just D/E), and the results were as follows: RSS 6 = 103.06 RSS 1 =176.08 p
=
1, Q = 5
Therefore the test statistic is
-
(176.08 - 103.06)/5 - 14.60 = 3.26 F- 103.06/(30- 1- 5-1)- 4.48 . . h the value with 5 and 23 degrees of freedom. Comparing this value wtt
round
Stepwise regression
17 5
d statistical tables, we find the P value to be less than 0.025. Thus in stand:: ignificance level we conclude that one or more of the other five at the S;o ~ nificantly improves the prediction of the P/E ratio. Note, variables t:;t the D/E ratio is the most highly correlated variable with the howtwe~, .
PfE ~~t:~ieetion proc~ss. affects.t~e inference so that the t~e P valu~ of the
T~ unknown, but tt.Is per~aps .greater than 0.025. Strtctl! ~peakmg,, the 1 test allinear hypothesis test ts vahd only when the hypothesis Is deternuned gener · · · th · d ta . r to exammmg e a ; . . .. . . . pnob. eneral linear hypothesis test Is the basts for several selection reg . d' h . . . du·res· as will be dtscusse m t e next two sections. proce ' · 8. 6 STEPWISE REGRESSION The variable selection problem can be described as considering certain subsets of independent variables and selecting that subset that either maximizes or minimizes an appropriate criterion. Two obvious subsets are the best single variable and the complete set of independent variables, as considered in section 8.5. The problem lies in selecting an intermediate subset that may be better than both these extremes. In this section we discuss some methods for making this choice.
Forward selectiQn method S~Iecting the best single variable is a simple matter: we choose the variable Wtth the highest absolute value of the simple correlation with Y. In the chemical companies data example D/E is the best single predictor of P/E ~:au.se the correlation between D(E an? P/E is -0.47, which has a larger gmtude than any other correlation wtth P/E (Table 8.2). NJ~~h~ose ~second variable to, combine with D/E, we .could. naively select This ch~~nce It has these~ond htghest a?solute correlatt?n Wtth P/E (0.3.5). DfE ~may not be Wise, however, smce another vanable together wtth One ~ay give a higher multiple correlation than D/E combined with NPMl. R2 :h:ategy th~refore is to search for that variable that maximizes multiple choosin n ~ombmed with D/E. Note that this procedure is equivalent to that i~ ~e s~ond variab~e to m~nimize the r~sidual su~ of squares gi:en t~e second v P~ 1D the equatt?n: Thts proced~re Is also eqmvalent to choosmg W~th PJB afteanable t.o maxn~uze the magmt~de of the partial correlation Will also 111 .r ~emovmg the lmear effect of D/E. The variable thus selected corretatio a~muze the F statistic for testing that the above-mentioned partial &etteralli: Is zero (section 7.7). This F test is a special application of the In the c~ar ?YPothesis test described in section 8.5. emicai companies data example the partial correlations between
or:_
176
Variable selection in regression analysis
P /E and each of the other variables after removing the linear efli are as follows: ect of D;E Partial correlations RORS SALESGR5 EPSS NPMl PAYOUTRl
0.126 0.138 -0.114 0.285 0.286
Therefore, the method described above, called the forward selection method wo~ld choose D/E. a~ the first variable and PAYOUTRl as the second vanable. Note that It IS almost a toss-up between PAYOUTRl and NPMl as the second variable. The computer programs implementing this method will ignore such reasoning and select PAYOUTRl as the second variable because 0.286 is greater than 0.285. But to the investigator the choice may be more difficult since NPMl and PAYOUTRl measure quite different aspects of the companies. Similarly, the third variable chosen by the forward selection procedure is the variable with the highest absolute partial correlation with P/E after removing the linear effects of D/E and PAYOUTRl. Again, this procedure is equivalent to maximizing multiple R and F and minimizing the residual mean square, given that D/E and PAYOUTRl are kept in the equation. The forward selection method proceeds in this manner, each time adding one variable to the variables previously selected, until a specified stopping rule is satisfied. The most commonly used stopping rule in packaged programs is based on the F test of the hypothesis that the partial correlation ?f the variable entered is equal to zero. One version of the stopping rule tenmn.ates entering variables when the computed value ofF is less than a specified value. This cutoff value is often called the minimum F-to-enter. Id Equivalently, the P value corresponding to the computed F stati~tic cou · be calculate~ and the forward selection stopped when ~his P value IS t~:~~~~ than a specified level. Note that here also the P value IS affected by t be that the variables are selected from the data and therefore should no used in the hypothesis-testing context.. . .. -to-enter ntile Bendel and Afifi (1977) compared vanous levels of the mmimum F used in forward selection. A recommended value i~ the FmP~;~ze is corresponding to a P value equal to 0.15. For example, if the sa . P0 f the F large, the recommended minimum F-to-enter is the 85th percentl1e distribution with 1 and oo degrees of freedom, or 2.07. the basis Any of the criteria discussed in section 8.4 could also be used as din tbe for a stopping rule. For instance, the multiple R 2 is always presente
Stepwise regression
177
f the commonly used programs whenever an additional variable is outputd The user can examine this series of values of R 2 and stop the process selectethe increase in R 2 is a ver~ small amount. Alternatively, the se?es of w~en d R,2 values can be exammed. The process stops when the adjusted adJuste . . -z . rnaxumzed. R 18 n example, data from the chemical companies analysis are given in ~;eaS.3. Using the P value of 0.15 (or the minimum F of 2.07) as a cutoff, ra uld enter only the first four variables since the computed F-to-enter we ;~ fifth variable is only 0.38. In examining the multiple iP, we note that f~ng from five t~ six va.riables increasedR 2 only by 0.001. It is therefore gbvious that the s1xth vanable should not be entered. However, other readers ~ay disagree about the. practical value of .a 0:007 increa~e in R 2• ob~ained by including the fifth vanable. We would~ mclmed not to mclude 1t. Fmally, in examination of the adjusted multiple R 2 we note that it is also maximized by including only the first four variables. This selection of variables by the forward selection method for the chemical companies data agrees well with our understanding gained from study of the correh:ttions between variables. As we noted in section 8.3, return on total capital (ROR5) is correlated with NPMl, SALESGR5 and EPS5 and so adds little to the regression when they are already included. Similarly, the correlation between SALESG R5 and EPS5, both measuring aspects of company growth, corroborates the result shown in Table 8.3 that EPS5 adds little to the regression after SALESGR5 is already included. It is interesting to note that the computed F -to-enter follows no particular ~attem as we include additional variables. Note also that even though R 2 Is always increasing, the amount by which it increases varies as new variables :re adde~. Finally, this example verifies that the adjusted multiple R2 increases 0 a maxtmum and then decreases. The other criteria discussed in section 8.4 are Cp and AIC. These criteria are also recommended for stopping rules. Specifically, the combination of 0
'table 8 3 F
---... ·
Variables
orward selection of variables for chemical companies
---i· added
l·D;B
Computed Multiple Multiple R2 F-to-enter iP --------------------
PAVODTRI . Np·l\1! ~· SAlESGRs · EPss 6. RaRs
""---=
8.09 2.49 9.17 3.39 0.38 0.06
0.224 0.290 0.475 0.538 0.545 0.546
0.197 0.237 0.414 0.464 0.450 0.427
178
Variable selection in regression analysis
variables minimizing their values can be chosen. A numerical ex this procedure will be given in section 8. 7. arnple for Computer programs may also terminate the process of forward . 1 when the tolerance level (section 7.9) is smaller than a specified ~~~chon tolerance. Intrnurn Backward elimination method
An alternative strategy for variable selection is the backward elirnin f method. This t~c~niq~e begins with all of t~e variables in the equationa ::~ proceeds by ehmtnatmg the least useful vanables one at a time. As an exampl~, Table 8.41ists the regression coefficien.t (both standardized and unstandardtzed), the P value, and the correspondmg computed p fo testing that each coefficient is zero for the chemical company data. The : statistic here is called the computed F-to-remove; Since ROR5 has the smallest computed F-to-remove, it is a candidate for removal. The user must specify the maximum F-to-remove; if the computed F-to-remove is less than that maximum, ROR5 is removed. No recommended value can be given for the maximum F-to-remove. We suggest, however, that a reasonable choice is the 70th percentile oftheF distribution (or, equivalently, P = 0.30). For a large value of the residual degrees of freedom, this procedure results in a maximum F -to-remove of 1.07. In Table 8.4, since the computed F-to-remove for ROR5 is 0.06, this variable is removed first. Note that ROR5 also has the smallest standardized regression coefficient and the largest P value. The backward elimination procedure proceeds by computing a new equation with the remaining five variables and examining the computed F-to-remove for another likely candidate. The process continues until no variable can .be removed accordi~g to the stopping rule.. . . ds In companng the forward selection and the backward elimmauon me tho ' · 1 mpanies Table 8.4 Backward elimination: coefficients and F and P values for chemica co _
Coefficients Variable Intercept NPMl PAYOUTRl SALESGRl
D/E EPS5 ROR5
p
Standardized
F-to-remove---
9.86 0.20 -2.55 -0.04
0.49 0.56 0.29 -0.21 -0.15
0.04
0.07
3.94 12.97 3.69 3.03 0.38 0.06
Unstandardized 1.24 0.35
0.06
o.OOI
o.o7 0.09
o.ss
o.Sl ____...
Stepwise regression
179
h t one advantage of the former is that it involves a smaller amount we oote t ta tion than the latter. However, it may happen that two or more of ~ornPu aan together be a good predictive set while each variable taken var1ab!es ct very effective. In this case backward elimination would produce atone tS noquation than forward selection. Neither method is expected to a better tehe best possible equation for a given number of variables to be · than one or t·he tota1 set.) Produce ·d d (other inclu e Stepwise procedure . ry commonly used technique that combines both of the above methods One ve ~ ~ d se1ect10n · ~eth o d ts · ~ften is called the ·stepwise pr~edure. In 1act, th ~ 1orwar called the forward s!e~w1se met~od. In this. case a step .con.ststs ,of ad~mg a variable to the predtcttve'equatton. At step 0 the only vanable us~d Is the mean of the Y variable. At that step the program normally pnnts the computed F -to-enter for each variable. At step 1 the variable with the highest computed F -to-enter is entered; and so forth. Similarly, the backward elimination method is often called the backward stepwise method. At step 0 the computed F-to-remove is calculated for each variable. In successive steps the variables are removed one at a time, as described above. The standard stepwise regression programs do forward selection with the option of removing some variables already selected. Thus at step 0 only the Y mean is included. At step 1 the variable with the highest computed F-to-enter is selected. At step 2 a second variable is entered if any variable qualifies (i.e. if at least one computed F-to-enter is greater than the minimum F-to-enter). After the second variable is entered, the F-to-remove is computed ;~~ both.varia?les. If either of them is _lower ~han t?e.maximu~ !'-to-remove, F- t vanab~e Is removed. If not, a thtrd vanable Is mcluded if Its computed a. t_?-enter ts large enough. In successive steps this process is repeated. For equation, variables with small enough computed F-to-remove values a. rgl.Ven eremo. d . . are • ve 'and the vanables wtth large enough computed F -to-enter values 1 · T~~c u;;e~. The process stops when no variables can be deleted or added. affect be Oice of the minimum F-to-enter and the maximum F-to-remove s oth th · . . . selected F .e nature of the selection process and the number of vanables the mini or Instance, if the maximum F-to-remove is much smaller than any cas~~~ F-t~-~nter, then the process is essentially forward selection. In F..to?emove~ nunun.um F-to-enter ~ust be .larger than the maximum conttnuou ' otherwtse, the same vanable will be entered and removed 1 ~he fuUseqs y. In m~y situations it is useful for the investigator to examine cy setting ~~:ce .u~til all variables are entered. This step can be accomplished Orresp 0 ndin tnmtmum F-to-enter equal to a small value, such as 0.1 (or a g P value of0.99). In that case the maximum F-to-remove must
180
Variable selection in regression analysis
then be smaller than 0.1. After examining this sequence, the invest" make a second run, using other F values. An alternative using B~~t~r ca~ to choose a double setofF-to-enter and F-to-remove values. The fi 2R. ts chosen with small F levels and allows numerous variables, but not n rst set is all, to enter in a forward stepwise mode. Mter all the variables arecessarily that meet this liberal set of F levels, the program operates in a ~ e~terect stepwise manner using the second set ofF levels, which are the l:c Ward 1 investigator wishes to actually use (see the manual for example). In t~~ s the stepwise is used on not all of ~he candidate variables but rather ~st::y, that have a better chance of bemg useful. se Values that have been found to .be useful in practice are a minimum F-to-enter equal to 2.07 and a maxtmum F-to-remove of 1.07. With these recommend,ed values, for example; the results for the chemical compan· data are identical to those of the forward selection method presented in Ta~~~ 8.3 up to step 4. At step 4 the variables in the equation and their computed F-to-remove are as shown in Table 8.5. Also shown in the table are the variables not in the equation and their computed F-to-enter. Since all the computed F-to-remove values are larger than the minimum F-to-remove of 1.07, none of the variables is removed. Also, since both of the computer F-to..;enter values are smaller than the minimum F-to-enter of 2.07, no new variables are entered. The process terminates with four variables in the equation. There are many situations in which the investigator may wish to include ~rtain variables in the equation. For instance, the theory underlying the subject of the study may dictate that a certain variable or variables be used. The investigator may also wish to include certain variables in order for the results to be comparable with previously published studies. Similarly, the investigator may wish to consider two variables when both provide nearly the same computed F-to-enter. If the variable with the slightly .lower F-to-enter is the preferred variable from other viewpoints, the inv~stlg~~or may force it to be included. This preference may arise from cost or sunph~ttY considerations. Most stepwise computer programs offer the user the optiOn
0
. , . · , d ta example Table 8.5 Step 4 of the stepwise procedure for chemtcal compames a , .-Computed Variables not Computed Variables F-to-enter in equation F -to-remove in equation -----~------------------------------
D/E PAYOUTR1 NPM1 SALESGR5
3.03 13.34 9.51 3.39 EPS5 ROR5
0.38
o.o_:__--.
Subset regression
181
. g variables at the beginning of or later in the stepwise sequence, as of {orc•n . desired.rnphasize that none of the procedures described here are guaranteed Wee expected to produce the best possible regression equation. This or applies ~articularly to the intermediate steps where the n~m~er of co~ bles selected IS more than one but less than P - 1. In the maJonty of vart; ations of the stepwise method, investigators attempt to select two to apP IC riables from the available set of independent variables. Programs that five vt~ne the best possible subsets of a given size are also available. These exam programs are discussed next.
e;:nt
8. 7 sUBSET REGRESSION The availability of the computer makes it possible to compute the multiple R2 and the regression equation for all possible subsets of variables. For a specified subset size the 'best' subset of variables is the one that has the largest multiple R 2 • The investigator can thus compare the best subsets of different sizes and choose the preferred subset. The preference is based on a compromise between the value of the multiple R 2 and the subset size. In section 8.4 three criteria were discussed for making this comparison, namely, adjusted R2 , Cp and AIC. For a small number of independent variables, say three to six, the investigator may indeed be well advised to obtain all possible subsets. This technique allows detailed examination of all the regression models so that ~n appropriate model can be selected. SAS will evaluate all possible subsets if there are l1 or fewer independent variables. . ~y using the BEST option, SAS will print out the best n sets of independent ;:na?les for n . 1, 2, 3, ... etc. BMDP 2R, using a method developed by . rmval and Wilson (1974), also prints out the best n subsets for various ~~e s~bsets. Both SAS and BMDP use R 2 , adjusted R 2 and Cp for selection n~~e;100 : For a given size, these three cri~eria are. equivalent. But if size is T Pecified? they rna y select subsets of different sizes. the ~~~e ~.6 mcludes part of the output produced .by run~ng S~S REG ~or a lllulr ~Ical2compames data. As before, the best smgle vanable Is D/E, With if onlytp e R ~f0.2~4. Included in Table 8.6 are the next two best candidates is ch one vanable Is to be used. Note that if either NPM1 or PAYOUTR1 relati~s~nl instead of D/E, then the value of R 2 drops by about half. The that a~ Y ow ~alue of R 2 for any of these variables reinforces our understanding variati:n~ne Independent variable alone does not do well in 'explaining' the l'he bestn the d.epe~dent variable p jE. a lllultiple t c2ombmat10n ~f two v~ria?les is NPM ~ and PA YOUTRl, with the case . R of 0.408. Thts cornbmat10n does not mclude DjE, as would be ln stepwise regression (Table 8.3). In stepwise regression the two
182
Variable selection in regression analysis
Table 8.6 Best three subsets for one, two, three or four variables t companies data or chemical Number of variables Names of variables
----
R2
iP
cp
1
D/Ea NPM1 PAYOUTR1
0.224 0.123 0.109
0.197 0.092 0.077
2 2 2
NPM1, PAYOUTR1 PAYOUTR1, ROR5 ROR5, EPS5
0.408 0.297 0.292
0.364 0;245 0.240
3 3 3
PAYOUTR1, NPM1, SALESGR5 PAYOUTR1, NPM1, D/Ea PAYOUTR1, NPM1, ROR5
0.482 0.422 0.475 0.414 0.431 0.365
13.30 57.1 18.40 60.8 19.15 61.3 6.{)() 51.0 11.63 56.2 11.85 56.3 4.26 49.0 4.60 49.4 6.84 51.8
4 4 4
PAYOUTRl, NPMl, SALESGR5, DjEa 0.538 PAYOUTR1, NPMl, SALESGR5, EPS5 0.498 PAYOUTR1, NPM1, SALESGR5, 0.488 ROR5
0.464 0.418 0.406
3.42 47.6 5.40 50.0 5.94 50.6
Best 5
PAYOUTRl, NPMl, SALESGR5, D/E, EPS5a
.0.545
0.450
5,06 49.1
0.546
0.427
7.00 51.0
1 1
Al16 a Combinations
Arc
selected by the stepwise procedure.
variables selected were D/E and PAYOUTRl with a multiple R 2 of 0.290. So in this case stepwise regression does not come close to selecting the best combination of two variables. Here stepwise regression resulted in t~e fourth-best choice. Also, the third-best combination of two variables IS essentially as good as the second best The best combination of two variables; 0 NPMl ~nd .PA YOUT:Rl, as chosen ~y SA~ RE<J:, is interesti~g in light : our earlier mterpretabon of the vanables tn section 8.3. Vanable NP~ measures the efficiency of the operation in converting sales to earnings, whi e PA YOUTRl measures the intention to plow earnings back into the cornpan~ or distribute them to stockholders. These are quite different aspects of currene company behavior. In contrast, the debt-to-equity ratio D/E may be, in Iar~t part, a historical carry-over from past operations or a reflection ofmanagerne style. . 1 Rz For· the best combination of three variables the value of the multi? epler is 0.482, only slightly better than the stepwise choice. If D/E is a lot ~IlliiJ.ce to obtain than SALESGR5, the stepwise selection might be preferre siven the loss in the multiple R 2 is negligible. Here the investigator, when g
Subset regression
183
tion of different subsets, might prefer the first (NPMl, PAYOUTRl, the~fsoR5) on theoretical grounds, since it is the only option that ex~licitly ~'\ des a measure of growth (SALESGR5). (You should also examme the me~ ariable combinations in light of the above discussion.) fo~ ~marizing the results in the form of Table 8.6 is advisable. Then plotting h ~st combination of one, two, three, four, five or six variables helps the ~ e stigator decide how many variables to use. For example, Figure 8.1 shows mv~ot of the multiple R 2 and the adjusted IP versus the number of variables ~ ~luded in the best combination for the chemical companies data. Note that ~2 is a nondecreasing function. However, it levels off after four variables. The adjusted iP reaches its maximum with four variables (D/E, NPMl, pAYOUTRl and SALESGR5) and decreases with five and six variables. ·Figure 8.2 shows CP versus P (the number of variables) for the best combinations for the chemical companies data. The same combination of four variables selected by the IP criterion minimizes CP. A similar graph in
(2,3,5,6)
(1,2,3,4,5,6)
'---»---·
0.5
0.4
0.3
0.2
0.1
1: 2: 3: 4:
ROR5 DIE SALESGR5 EPS5 5: NPMl 6: PAYOUTRl
J?• tgure 81 M . for Cbe~· ulttple R 2 and Adjusted iF versus P for Best Subset with P Variables •cal Companies Data.
184
Variable selection in regression analysis
(2)
1: 2: 3: 4: 5: 6:
ROR5 DIE SALESGR5 EPS5 NPMl PAYOUTRl
(2,3,5,6)
Figure 8.2 Cl> Versus P for Best Subset with P Variables for Chemical Companies Data.
Figure 8.3 shows that AIC is also minimized by the same choice of four variables. In this particular example all three criteria agree. However, in other situations the criteria may select different numbers of variables. In such cases the investigator's judgment must be used in making the final choice. Even · th'Is case an mvestigator · · · bl.es' suchand as m may prefer to use only t h r~ vana NPM1,.PAYOUT~1 and SAL~SGR_S. The value ofhavm~ th~se ta?le~ up, figures IS that they mform the mvesttgator of how much ts bemg gtve as estimated by the sample on hand. h.1 hlY You should be aware that variable selection procedures are ~wo dependent on the particular sample being used. For example, data fro~eses different years could give very different results. Also, any tests of hypo e not should be viewed with extreme caution since the significance levels arllbset valid when variable selection is performed. For further information on s regression see Miller (1990).
Discussion of computer programs
185
I:RORS 2:DIE 3:SALESGR5 4:EPS5 S:NPMl 6:PAYOUTR1
56
~--------L--------L------~~------~------~~
1
2
3
4
5
6
p
Figure 8.3 AIC Versus P for Best Subset with P Variables for Chemical Companies Data.
8.8 DISCUSSION OF COMPUTER PROGRAMS
~n ~able 8.7 a summary of the options available in the six statistical packages
18
glVen. The major difference is that only BMD P and SAS offer all or best subset selection. Note that by repeatedly selecting which variables are in the re~ression model manually, you can perform subset regression for yourself Wttb any of the programs. This would be quite a bit of work if you have ~umerous independent variables. Alternatively, by relaxing the entry criterion 8 forward stepwise you can often reduce the candidate variables to a smaller e s~~d then try backward stepwise as an alternative. ope . Was used to obtain the forward stepwise and best subset results Pro~~~tng on a data fil~ called 'chemco': The model was ~ritte~ out with t~e Proc t dependent and mdependent vanables under consideratiOn. The baste P s atetnents for stepwise regression using SAS REG are as follows:
°:
roc re 9 d 111odel ata === chemco au test = chem 1; /setectiPe === ror5 de salesgr5 eps5 npm1 payoutr1 rlJn. on === stepwise· ''• '
1'he d · ata file is called chemco and the stepwise option is used for selection.
Table 8.7 Summary of computer output for various variable selection methods Output Variable selection statistic (controls order of selection) F-to-enter or remove Alpha for F R2 Adjusted R 2 Cp
Variable selection criteria (printed in output) R2 Adjusted R 2
cp
AIC Selection method Interactive mode of entry Forward Backward Stepwise Forcing variable entry Best subsets AU subsets Swapping of variables Effects of varib\e plots "Partial residua\ plots "Partial regression plots f!l.:u.grnented. partial residua\ plot
BMDP
SAS
SPSS
STATA
STATISTICA
2R
REG
REGRESSION REGRESSION
regress
Linear Regress
9R 9R 9R
REG REG REG
2R, 9R 2R, 9R 9R New System
REG REG REG REG
REGRESSION REGRESSION
regress regress
2R 2R 2R 2R 2R 9R 9R 2R
REG REG REG REG REG REG REG REG
REGRESSION REGRESSION REGRESSION REGRESSION REGRESSION
lR, 2R ·New system
REG
SYSTAT
MGLH
regress regress regress
Linear Regress Linear Regress
Linear Regress
MGLH MGLH
MGLH MGLH MGLH MGLH MGLH
REGRESSION REGRESSION
fit fit fit
Linear Regress
MGLH MGLH
Discussion and extenions
187
. her data sets if too few variables enter the model, the entry levels can forhot ged from the default of 0.15 for stepwise to 0.25 by using SLE = 0.25 bec an . . model statement. Ul t~ subset selection met~od was used to obtain subset regression and to T t~e partial regression residual plot.s. (section 8.9) by adding the word get . I' 'Details' was added to get additiOnal output as follows: •par1ta · reg data = chemco outest = chem2; pro~el pe = ror5 de salesgr5 eps5 npm1 payoutr1 ;~lection = rsquare adjrsq cp aic partial details;
run; The data file called chemco had been created prior to the run and the variables were named so they could be referred to by name. Four selection criteria were listed to obtain the results for Table 8.6. The manual furnishes numerous examples that are helpful in understanding instructions. The most intuitive interactive method for entering and removing variables is available in BMDP New System where the program sets up two windows, one for the variables in the model and one for the candidate variables. Variables can be dragged back and forth between the two windows using the mouse. Output automatically appears that shows the results of the changes. SYSTAT is also very straightforward to use and the output adapts to the variables the user types in. All the statistical packages have tried to make their stepwise programs attractive to users as these are widely used options.
8.9 DISCUSSION AND EXTENSIONS The process leading to a final regression equation involves several steps and te?d~ to be iterative. The investigator should begin with data screening, ehnunating blunders and possibly some outliers and deleting variables with ~n excessive proportion of missing values. Initial regression equations may ~~:~ced ~o include certain variables as di~tated by t~e underlying situa~ion. in th. vanables are then added on the basts of selection procedures outlmed and Is chapter. The choice of the procedure depends on which computers r6rograms are accessible to the investigator. equ ~ results of this phase of the computation will include various alternative equ:~~ons. For example, if the forward selection procedure is used, a regression 1 thee ~n ~teach ste~ can be obtained. Thus in the chemical companies data Note ~hattons resultt~g from th~ first f~ur steps are as given in _Table 8.8. one st at the coeffiCients for gtven vanables could vary appreCiably from or th ep to another. In general, coefficients for variables that are independent effecteo;ernai?ing predictor va?abl~s ~end to have stable coefficients: The a Vanable whose coeffictentts htghly dependent on the other vanables
0
188
Variable selection in regression analysis
Table 8.8 Regression coefficients for chemical companies: first four steps Regression coefficients Steps
Intercept
1
11.81 9.86 5.08 1.28
2 3 4
D/E
PAYOUTR1
NPM1
-5.86 -5.34 -3.44 -3.16
4.46 8.61 10.75
0.36 0.35
--------------
--SALESGR.s
0.19
in the equati~n is more difficult to interpret than the effect of a variable with
stable coeffiCients. A particularly annoying. situation occurs when the sign of a coefficient ch~nges from ~ne. step to the next. ~method for avoiding this difficulty is to Impose restnctions on the coeffiCients (Chapter 9). For some of the promising equations in the sequence given in Table 8.6) another program could be run to obtain extensive residual analysis and to determine influential cases. Examination of this output may suggest further data screening. The whole process could then be repeated. Several procedures can be used. One strategy is to obtain plots of outliers in Y, outliers in X or leverage, and influential points after each step. This can result in .a lot of graphical output so often investigators will just use those plots for the final equation selected. Note that if you decide not to use the regression equation as given in the final step, the diagnostics associated with the final step do not apply to your regression equati'On and the program should be re-run to obtain the appropriate diagnostics. Another procedure that works well when variables are added one at a time is to e~amine ~hat are ~alle~ added variable. or partia~ plots. These pl~: are useful m assessmg the hneanty of the candidate vanable to be add ' given the effects of the variable(s) already in the equati~n. They are als~ 0 useful in detecting influential points. Here we, will descnbe thre~ types . 1 . plots that vary m . thetr . usefuIness to detect non1'mean't Yor tnfluentta . · 'th partial points (Chatterjee and Hadi, 1988; Fox and Long, 1990; Draper ~n~~en~ 1981; Cook and Weisberg, 1982). To clarify the plots, some formulas will gt Suppose we have already fitted an equation given by Y=A
+ BX
. d we want where BX refers to one or more X's already in the equatiOn, an defined. to evaluate a candidate variable X k· First, several residuals need to be The residual
e(Y .X)= Y- (A+ BX)
Discussion and extenions
189
ual residual in Y when the equation with the X variable(s) only is isthe ~~xt, the residual in Y when both the X variable(s) and Xk are used used. 1"~ ·tS given by . . using Xk as if it were a dependent variable being predicted by the Then, . · .other x· vanable(s), we have . e(Xk.X) = Xk- (A"+ B"X) Partial regression plots (also called added variable plots) have e(Y. X) plotted on the vertical axis and e(Xk. X) plott~d on. the horizontal a~s. Partial regression plots are used t~ assess the nonhneanty of th~ X k vana?le a~d influential points. If the pomts do not appear to follow a hnear relatwnshtp then lack of linearity of Y as a function of X k is possible (after taking the effect of X into account). In this case, transformations of Xk should be considered. Also, points that are distant from the majority of the points are candidates for being outliers. Points that are outliers on both the vertical and horizontal axes have the most influence. Figure 8.4 presents such a plot where P/E is Y the variables NPM1 and PAYOUTR1 are the X variables and SALESGR5 is Xk. This plot was obtained using the SAS partial option. If a straight line is fitted to the points in Figure 8.4, we would get a line going through the origin (0, 0) with a slope Bk. The residuals from such a line are e(Y. X,Xk). In examining Figure 8.4, it can be seen that the llleans of e( Y.X) and e(Xk.X) are both zero. If a straight line was fitted to the points it would have a positive slope. There is no indication of nonlinearity so there is no reason t? consider transforming the variable SALESGR5. There is one point to the nght that is somewhat separated from the other points, particularly in the ~-+------+------+------+------+------+------+------+------+------+------+--
PE;
I
5 +
+
1
1
1 1
1
Q +
2 1
1 1
1
1 1
-s + -4
11 1
+ 1 1
1
--+- -----+-----6
1 1 1 1 1 1 111
1
1
+ +------+------+------+------+------+------+------+------+--
-2
0
2
4
SALESGR5
};'·
lgute 8 4
.
· · Partial Regression Residual Plot.
6
8
10
12
14
190
Variable selection in regression analysis
horizontal direction. In other words, using NPMl and PAYOUT predict SALESGR5 we have one company that had a larger SALE Rl to than might be expected. SGRs In general if Xk is statistically independent of the X variable(s) al . the equation, then its residuals would have a standard deviationret~dy 1~ approximately equal to the standard deviation of the original data for v ~t Is Xk. Otherwise the residuals would have a smaller standard deviation ~nab~e example.SALESGR5 has a correlation of -0.45 with PAYOUTRl ~n: ~hts with NPMl (Table 8.2) so we would expect the standard deviation of -~S residuals to be smaller, and it is. t e 'J_'he one isolated poi~t. does not appear .to be an outlier in terms of the y vanable or P /E. Exammmg the raw data lD Table 8.1 shows one compan Dexter, that appears to have a large SALESGR5 relative to its size ofNPJ'l and PAYOUTRl. To determine if that is the company in question, the PAINT option in SAS could be used. Since it is not an outlier in Y, we did not remove it. Partial residual plots (also called component plus residual plots) have e(Y. X, Xk) + BkXk on the vertical axis and Xk on the horizontal axis. A straight line fitted to this plot will also have a slope of Bk and the residuals are e( Y . X, X k). However; the appearance of partial regression and partial residual plots can be quite different. The latter plot may also be used to assess linearity of X k and warn the user when a transformation is needed. When there is a strong multiple correlation between X k and the X variable(s) already entered, this plot should not be used to assess the relationship between Y and Xk (taking into account the other X variable(s)) as it can give an incorrect.impression. It may be difficult to assess influential points from this . plot (Chatterjee and Hadi, 1988). Mallows (1986) suggested an additional component to be added to parttal residual plots and this plot is called an augmented partial residual plot. The square of X k is added to the equation and the residuals from this augmente~ 1 equation plus BkXk + Bic+ 1 X~ are plotted on the vertical axis and the Xk. ~ again plotted on the horizontal axis. If linearity exists, this augmente~ par~as residual plot and the partial residual plot will look similar. Mallows mclut_e 1 . · · h · ntedpar ta so~e plots that demonstr~te th.e eas~ of mterpretation wit au?me ifferent residual plots~ When nonhneanty exists, the augmented plot will look d from the partial residual plot and is simpler to assess. wise An alternative approach to variable selection is the so-called stag:s in regression procedure. In this method the first stage (step) is th~ sam~ y 00 forward stepwise regression, i.e. the ordinary regression equatiOn 'duals the most highly correlated independent X variable is computed. ~~~~ that from this equation are also computed. In the second step the X varta 1·dualS is most highly correlated with th~se residuals is sele~ted. Here .the r~; these are considered the 'new' Y vanables. The regressiOn coeffictent
°
What to watch out for
191
. ls on the X variable selected in step 2 is computed. The constant and restduaression coefficient from the first step are combined with the regression the ;:~ent from the second step to produce the equation with two X variables. . . h coeHJct . quation is the two-stage regression equation. T e process can be fhts. eued to any number of desired stages. co~~~ resulting stagewi.se reg~essioo equation ~oes not fit the data as ~ell as . least squares equation usmg the same vanables (unless th~ X vanables the independent in the sample). However, stagewise regression has some ::irable advantages i~ certain ~pplicati~ns. In particular, econometric dels often use stagewise regressiOn to adJust the data source factor, such :oa trend or seasonality. Another feature is that the coefficients of the .·. riables already entered are preserved from one stage to the next. For ;:ample, for the chemical companies data the coefficient for D/E would be preserved over ~he successiv~ stages o~ a st.agewise reg~ession. This result .can be contrasted with the changmg coeffiCient m the stepwise process summanzed in Table 8.9. In behavioral science applications the investigator can determine whether the addition of a variable reflecting a score or an attitude scale improves prediction of an outcome over using only demographic variables.
8.10 WHAT TO WATCH OUT FOR The variable selection techniques described in this chapter are used for the variable-X regression model. Therefore, the problem areas discussed at the ends of Chapters 6and 7 also apply to this chapter. Only possible problems connected with variable selection will be mentioned here. Areas of special concern are as follows.
1. Selection of the best possible variables to be checked for inclusion in the model is extremelyimportant. As discussed in section 7.11, if an appropriate set ?f variables is not chosen, then the coefficients obtained for the 'good' ~nab~es wil! be ?iased by the lack of inclu~ion of other needed variables . the. good vanables and the needed vanables are correlated). Lack of Inclusion of needed or missing variables also results in an overestimate ~~he .standard errors of the slope of the variables used. Also, the predicted i Is biased unless the missing variables are uncorrelated with the variables t:~luded or ~ave zero. slope coefficients (Chat~erjee and Price, 1991). If rne best candi~ate vanab.les are not measured .m the first place, then no 2. s· thod of vanable selection can make up for It later. i~n~e the ~rograms select the 'best' variable or set of variables for testing r t e vanous variable-selection methods, the level of significance of the esulting F test cannot be read from a table of the F distribution. Any
192
Variable selection in regression analysis
l~vels of significan~e shou~d be viewed with c~uti?n. The test shout
3.
4.
5.
6.
7.
VIewed as a screemng deVICe, not as a test of Signtficance. d be Minor differences can affect the order of entry of the variables. For ex in forward stepwise selection the next ~ariable chosen is the one w~~Ple, largest F-to-enter. If the F-to-enter IS 4.444444 for variable on the 4.444443 for variable two, then variable one enters first unless some me hand t: • • ed h . . h et od o f 1orcmg Is us or t e mvestigator runs t e program in an inter t' . . can be affected great} ac tbve mo d e and mtervenes. T h e rest o f t h e ana1ysts the result of this toss-up situation. Y Y Spurious variables can enter, especially for small samples. These variabl may enter due to outliers or other problems in the data set. Whatever! t~s cause, investigators will occasionally obtain a variable from mechanica~ use of variable-selection methods that no other investigator can duplicate and which would not be expected for any theoretical reason. Flack and Chang (1987) discuss situations in which the frequency of selecting such 'noise' variables is high. · Methods using CP and AIC, which depend on a good estimate of cr2 , may not actually be better in practice than other methods if a proper estimate for cr 2 does not exist. Especially in forward stepwise selection, it may be useful to perform regression diagnostics on the entering variable given the variables already entered as described in the previous section. Another reason for being cautious about the statements made concerning the results of stepwise or best subsets regression analysis is that usually we are using a single sample to estimate the population regression equation. The results from another sample from the same population may be different enough to result in different variables entering the equation or a different order of entry. If they have a large sample, some investigators try to check this by splitting their sample in half, running separate regression analyses on both halves, and then comparing the results. If they differ markedly, then the reason for this difference should be explored. Another test is to split the sample, using three-fourths of the samp1eh t~ obtain the regression equation, and then applying the res~lts of. t equation to the other fourth. That is, the unused X's and Y s fro~ of one"fourth sample are used in the equation and the simple correlatl~~ple the Y's and the predicted Y's is computed and compared to the mu Her correlation from the same data. This simple correlation is usually sm~ion than the multiple correlation from the original regression eq:pare computed from the three-fourths sample. It can also be useful to cothat of the distribution of the residuals from the three-fourths sample to iduals the unused one-fourth sample. If the multiple correlation and the resultiple are markedly different in the one-fourth sample, using the rn regression for prediction in the future may be questionable.
t:e
References
193
sv!vfMARY . ble selection techniques are helpful in situations in which many Y~ ;endent variables exist and no standard model is available. The methods m ed scribed in this chapter constitute one part of the process of making the we e . 1 selection of one or more equatiOns. fin~lthough these methods are useful in exploratory situations, they should t be used as a substitute for modeling based on the underlying scientific n;oblem. In particular, the variables selected by the methods described here ~ pend on the particular sample on hand, especially when the sample size isesmall. Also, significance levels are not valid in variable selection situations. These techniques can, however, be an important component in the modeling process. They frequently suggest new variables to be considered. They also help cut down the number of variables so that the problem becomes less cumbersome both conceptually and computationally. For these reasons variable selection procedures are among the most popular of the packaged options. The ideas underlying variable selection have also been inc.orporated into other multivariate programs' (e.g. Chapters ll, 12 and 13). In the intermediate stages of analysis the investigator can consider several equations as suggested by one or more of the variable selection procedures. The equations selected should be considered tentative until the same variables are chosen repeaf..edly. Consensus will also occur when the same variables are selected by different subsamples within the same data set or by different investigators using separate data sets. 1
REFERENCES References preceded by an asterisk require strong mathematical background. *Atkinson, A. C. (1980). A note on the generalized information criterion for a choice Be~fa model. Biometrika, 67, 413-.18. . . . . del, R.B. and Afifi, A.A. (1977). Comparison of stoppmg rules m forward stepwtse *B~~~ress10n. Journal of the Americ~n Stati'stical.Association, 7~, 46-~3. . . Th ogan, H. (1987). Mo~el selectiOn and Akaike's Information Ctttenon (AIC): Bri he general theory and Its analytical extensions. Psychometrika, 52, 345-70. ·ag .amd,E.F. (1980). Fundamentals oifFinancial Management, 2nd edn. Dry· den Press, ms 1 IL · · · Ch t . a e, ' ~~~~· Sk. and Hadi, A.S. (1988). Sensitivity Analysis in Linear Regression. Wiley, Ch or . atterjee s d p . . New y' ·an nee, B. (1991). Regression Analysis by Example, 2nd edn. Wiley, •c· Ook R ork. . & B IiDL.and Wetsberg, S. (1982). Residuals and Influence in Regression. Chapman Draper ~ , ondon.. Flack VFR. and Sllllth, H. (1981). Applied Regression Analysis, 2nd edn. Wiley, New York. regre~ : and Cha~g, P.c;. (1987). Frequency of selecting noise variables in a subset Fox.,J.an~I~n analysts: A Simulation study. The American Statistician, 41, 84-86. ong, J.S. (1990). Modem Methods ofData Analysis. Sage, Newbury ~ark, CA.
194
Variable selection in regression analysis
*Furnival, G.M. and Wilson, R.W. (1974). Regressions by leaps and b 0 Unds. Technometrics, 16, 499-512. Gorman, J.W. and Toman, R.J. (1966). Selection of variables for fitting equ . atiOns to data. Technometrics, 8, 27-51. Hocking, R.R. (1976). The analysis and selection of variables in linear r . Biometrics, 32, 1-50. egression. Mallows, C.L. (1973). Some comments on CP. Technometrics, 15, 661-76. M~llows, C.L. (1986). Augmente? p~rtial resid~als. Technometrics, 28, 313_ . 19 Miller, A.J. (1990). Subset Selectwn m Regresswn. Chapman & Hall, Londo *Nishi, R. (1984). Asymptotic properties of criteria for selection of variables in ~ . regression. The Annals of Statistics, 12, 758-65. u1ttple A.• and Ord, J.K. (1991). Advanced Theory of Statistics ' Vol • 2• 0 XIO[ r d *Stuart, • . Umversity. Press, New York. Theil, H. (1971). Principles of Econometrics. Wiley, New York.
FURTHER READING *Beale, E.M.L., Kendall, M.G. and Mann, D.W. (1967). The discarding of variables in multivariate analysis. Biometrika, 54, 357-66. *Efroymson, M.A. (1960). Multiple regression analysis, in Mathematical Methodsfor Digital Computers (eds Ralston and Wilf). Wiley, New York. Forsythe, A.B., Engleman, L., Jennrich, R. and May, P.R.A. (1973). Stopping rules for variable selectio·n in multiple regression. Journal of the American Statistical Association, 68, 75-77. Hocking, R.R. (1972). Criteria for selection of a subset regression: Which one should be used. Technometrics, 14, 967-70. Mantel, N. (1970). Why stepdown procedures in variable selection. Technometrics, 12, 621-26. Myers, R.H. (1986). Classical and Modern Regression with Applications. Duxbury Press, Boston, MA. . Pope, P.T. and Webster, J.T. (1972). The use of an F-statistic in stepwise regression procedures. Technometrics, 14, 327-40. . . . Pyne, D.A. and Lawing, W.D. (1974). A note on the use of the Cp statistic and tt8 relation to stepwise variable selection procedures. Technical Report no. 210. John~ Hopkins University. . . t'on Wilkinson, L. and Dallal, G.E. (1981). Tests of significance m forward se1ect regression with an F-to-enter stopping rule. Technometrics, 23, 377-80.
PROBLEMS . . . . the dependent Use the depression data set given m Table 3.5. Usmg CESD as iables, variable, and age, income and level of education as the independe?tdvar ndent . regressiOn · program to determme · wh'IC h ofthe In epe run a forward stepwise variables predict level of depression for women. . It 8.2 Repeat Problem 8.1, using subset regression, and compare the res~h:·chemical 8.3 Forbes gives, each year, the same variables listed in Table 8.1 for suited in a industry. The changes in lines of business and company mergers r~ d a subset somewhat different list of chemical companies in 1982. We have sele~ echeroicalsof 13 companies that are listed in both years and whose main product ~s n analysis, Table 8.9 includes data for both years. Do a forward stepwise regressto 8.1
Tsb/e 8.9 Chemical companies' financial performance as of 1980 and 1982 (subset of 13 of the chemica\ companies)
Company
Symbol
1980
Diamond Shamrock Dow Chemical Stauffer Chemical E. I. duPont Union Carbide Penn walt W. R. Grace Hercules Monsanto American Cyanamid Celanese Allied Chemical Rohm & Haas
dia dow stf dd uk psm gra hpc mtc acy cz acd roh
1982 Diamond Shamrock Dow Chemical Stauffer Chemical E. I. duPont Union Carbide Pennwalt W. R. Grace Hercules Monsanto American Cyanamid Celanese Applied Corpa Rohm & Haas
dia dow stf dd uk psm gra hpc mtc acy cz aid roh
PjE
9 8 8
9 5 6 10 9 11
9 7 7 7 8 13
9 9 9 10 6 14 10 12 15 6
12
ROR5
D/E
SALESGRS
13.0 13.0 13.0 12.2 10.0 9.8 9.9 10.3 9.5 9.9 7.9 7.3 7.8
0.7 0.7 0.4 0.2 0.4 0.5 0.5 0.3 0.4 0.4 0.4 0.6 0.4
20.2 17.2 14.5 12.9 13.6 12.1 10.2 11.4 13.5 12.1 10.8 15.4 11.0
15.5 12.7 15.1 11.1 8.0 14.5 7.0 8.7 5.9 4.2 16.0 4.9 3.0
7.2 7.3 7.9 5.4 6.7 3.8 4.8 4.5 3.5 4.6 3.4 5.1 5.6
0.43 0.38 0.41 0.57 0.32 0.51 0.38 0.48 0.57 0.49 0.49 0.27 0.32
9.8 10.3 11.5 12.7 9.2 8.3 11.4 10.0 8.2 9.5 8.3 8.2 10.2
0.5 0.7 0.3 0.6 0.3 0.4 0.6 0.4 0.3 0.3 0.6 0.3 0.3
19.7 14.9 10.6 19.7 10.4 8.5 10.4 10.3 10.8 11.3 10.3 17.1 10.6
-1.4 3.5 7.6 11.7 4.1 3.7
5.0 3.6 8.7 3.1 4.6 3.1 5.5 3.2 5.6 4.0 1.7 4.4 4.3
0.68 0.88 0.46 0.65 0.56 0.74 039 0.71 0.45 0.60 1.23 0.36 0.45
Source: Forbes, vol. 127, no. 1 (January 5, 1981) and Forbes, vol. 131, no. l (January 3, 1983). a Name
and symbol changed.
EPSS
9~0
9.0 -0.2 3.5 8.9 4.8 16.3
NPMl
PA.'YOUTRl
196
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
Variable selection in regression analysis using P/E as the dependent variable and ROR5, DjE, SALESGR5, EPss N and PAYOUTR1 as independent variables, on both years' data and ' Pr.,,fl the results. Note that this period showed little growth for this subset of co cornp~re and the variable(s) entered should be evaluated with that idea in min:Pantes, For .ad.ult males it ~as been dem~nstr~ted that a~e and heig?t are ~sefu] . predict~ng FEVl. Usmg .the data gtven m _:Appe~dlX A•. determme whether tn regression plane can be tmproved by also tncludmg wetght. the Using the data given in Table 8.1, repeat the analyses described in this ch with (P/E) 1 ' 2 .as the dependent variable instead of P/E. Do the results ch:e~ 8 much? Does It make sense to use the square root transformation? yse the data y~u generated f~om Problem 7.7, w~ere X1, X2, ..., X9 are the mdependent vanables and Y Is the dependent vanable. Use the generali · d linear hypothesis test to test the hypothesis that P4 = Ps = · · · = {3 9 = 0. Comm~e t in light of what you know about the population parameters. n For the data from Problem 7.7, perform a variable selection analysis, using the methods described in this chapter. Comment on the results in view of the population parameters. In Problem 7.7 the population multiple fJf 2 of Y on X4,X5, ... ,X9 is zero. However, from the sample alone we don't know this result. Perform a variable selection analysis on X4 to X9, using your sample, and comment on the results. (a) For the lung function data set in Appendix A with age, height, weight and FVC as the candidate independent variables, use subset regression to find which variables best predict FEV1 in the oldest child. State the criteria you use to decide. (b) Repeat, using forward selection and backward elimination. Compare with part (a). Force the variables you selected in Problem 8.9(a) into the regression equation with OCFEV1 as the dependent variable, and test whether including the FEV1 of the parents (i.e. the variables MFEV1, and FFEV1 taken as a pair) in the equation significantly improves the regression. Using the methods described in this chapter and the family lung function data in Appendix A, and choosing from among the variables OCAGE, OCWEIGHT, MHEIGHT, MWEIGHT, FHEIGHT and FWEIGHT, select the variables that best predict height in the oldest child. Show your analysis. f 0 From among the candidate variables given in Problem 8.11, find the subset three variables that best predicts height in the oldest child, separately for b~ys and girls. Are the two sets the same? Find the best subset of three variables or the group as a whole. Does adding OCSEX into the regression equation improve the fit?
9
Special regression •
topiCS 9 .1 SPECIAL TOPICS IN REGRESSION ANALYSIS In this chapter we present brief descriptions of several topics in regression analysis - some that might occur because of problems with the data set and some that are extensions of the basic analysis. Section 9.2 discusses methods used to alleviate the effects of missing values. Section 9.3 defines dummy variables and shows how to code them. It also includes a discussion of the interpretation of the regression equation when dummy variables are used. In section 9.4 the use of linear constraints on the regression coefficients is explained and an example is given on how it can be used for spline regression. In section 9.5 methods for checking for multicollinearity are discussed. Some easy methods that are sometimes helpful are given. Finally, ridge regression, a method for doing regression analysis when multicollinearity is present, is presented in section 9.6.
9.2 MISSING VALUES IN REGRESSION ANALYSIS We included in section 3.5 a discussion of how missing values occur and how to find which variables or cases have missing values. Missing values can be an aggravating problem in regression analysis since often too many cases are excluded for missing values in standard packaged program. The ~eason for this is the way statistical software excludes cases. SAS is typical tn how . . . I . st m~ssmg va ues are handled. If you include variables in the VAR fr~tement In SAS that have any missing values, those cases will be excluded va ~~he regression analysis even though you do not include the particular assna les with missing values in the regression equation. The REG procedure urnes · llliss· yo~ may want to use all the variables and delete cases with a.ny in thin~ van abies. Sometimes by careful pruning of the X variables included case e .AR statement for particular regression equations, dramatically fewer T~ With missing values will be deleted. Wide e subj~ct ofrandomly missing values in regression analysis has received attention from statisticians. Afifi and Elashoff (1966) present a review
198
Special regression topics
of the literature up to 1966, and Hartley and Hocking (1971) present . taxonomy of incomplete data problems (also Little, 1982). Litti a Sifl1ple presents a review of methods for handling missing X's in regression e (1 99 ~) that is both current and informative. analysis Types of missing values Missing values are often classified into three types (Little and Rubin 198 1990). The firs~ of these ~ypes is called missing completely at random (MCAR~~ For example, if a ch_etmst knocks over~ test tube or loses a part of a record, then these observations could be considered MCAR. Here, we assume th being missing is independent of both X and Y. In other words, the missi~ cases are a random subsample of the original sample. In such a situation there should be no major difference between those cases with missing values and those without, on either the X or the Y variables. This can be checked by dividing the cases into groups based on whether or not they have missing values. The groups can then be compared by a series of t tests or by more complicated multivariate analyses (e.g. the analysis given in Chapter 12). But it should be noted that lack of a significant difference does not necessarily prove that the data are missing completely at random. Sometimes the chance of making a beta error is quite large. For example, if the original sample is split into two samples, one with a small number of cases, then the power of the test could be low. It is always recommended that the investigator examine the pattern of the missing values. Clusters ofmissing data may be an indication of lack of MCAR. This can be done by examining the data array of any of the statistical or spreadsheet programs and noting where the missing values fall. Programs such as BMDP AM or BMDP New System give an array where the values that are missing are filled in and the data that are present are left blank so it is easy to see patterns. The second type of missing value is called missing at random (MAR). Here 1 the probability that Y is missing depends on the value of X but not. ~ this case the observed values of Yare not a random sub sample of the ongtn . . 1 . h' subclasses sample of Y, but they are a random sample of the subsam:p es wtt m have defined by values of X. For example, poorly educated responders maY es . . . . I d d as cas ' trouble answermg some questiOns. Thus although they are me u e that they may have missing values on the difficult questions. Now, ~upp.os~he f education is an X variable and the answer to a difficult questiOn IS t d in variable, then if the uneducated who respond were like the uneduca e the population, then the data would be missing at random. . . sing js 1 The third type of missing data (no~rando:m) occ~rs .when bemg: ~ession related to bo~h X and Y. For example, 1~we were pred~ctmg levels of he~ or n~t and both bemg depressed (Y) and bemg poor (X) mfluence whet d rtns a person responded, then the third type of missing data has occurre ·
f:
Missing values in regression analysis
199
hoW up as 'don't knows' or missing values that are difficult or impossible rna;e~ate clearly to any v~riable. In p~actice, miss~n~ values often occur for t~ re reasons that are difficult to fit mto any statistical model. Even careful blZarscreening can cause missing values, since observations that are declared data. rs are commonI y treated as missmg . .. vaI ues. ut Ie . 1 0 Options for treating missing values The strategies that are use~ t~ handle missing values in re'gressi~n. ~nalysis depend on what ty.pes ?f ~Issm.g data oc~ur. For nonrandom missmg data, there is no theoretical JustificatiOn for usmg any of the formal methods for treating missing values. In this case, some investigator.s analyze the complete cases only and place caveats about the population to which inferences can be drawn. In such a situation, the best hope is that the number of incomplete cases is small. · In contrast, when the data are missing completely at random, several options are available since the subject of randomly missing observations in regression analysis has received wide attention from statisticians. The simplest option is to use only the cases with complete data. Using this option when the data are MCAR will result in unbiased estimates of the regression parameters. The difficulties with this approach are the loss of information from discarding cases and possible unequal sample sizes from one analysis to another. If only Y values are missing, there is no advantage in doing anything other than using the complete cases. Another method is imputation of the missing values. Various methods are used to fill in, or impute, the missing data and then perform the regression analysis on the completed data set. In most cases these methods are used on the X variables, although they can be used on the Y variable. Note that there is only one Y variable but often there are numerous candidates for X ~ariable~ and a missing value in any one of the X variables will cause a case 0 be missing. 0 rni ?e method of filling in the missing values is mean substitution. Here the vas.stnbg values for a given variable are replaced by the mean value for that na le.
re~es~~re com~licated imputati~n
met?od i~v.olves first ~omput.ing a tem · .~ equatiOn where the vanable with missmg value(s) IS considered one~~~::tly to be a depe~dent variable "":ith the i~de~endent variable~ being lllissin are good predictors. Then this equation IS used to 'predict' the some ~eivalues. Fo~ e~ample, ~uppo~e that for a sample. of schoolchildren Ofheigh ghts are missmg. The mvestigator could first denve the regressions lllissin ~ ~n age for boys and girls and then these regressions to predict the an irn:roeights. If the children hav~ a ~ide ran~e of ages, this.method presents vement over mean substitutiOn (for simple regressiOn, see Afifi and
200
Special regression topics
Elashoff, 1969a, b). The variable height could then be used as an X . in a future desired regression model or as a Y variable. These meth ~anabie 0 been called deterministic methods. s have A variation on the above method is to add to the predicted valu temporary dependent variable a residual value chosen at random f e of the residual~ from the regressi~n equ~tio~ (Kalton a?d ~a~przyk, 19 ~~~/~e method 1~ c~Iled a stochastic ~u~stttutlon. The obJective ts to obtain vai:Is for the mtssmg data that are stmdar to those that are not missing. Not thes when a regression equation is used without adding a randomly c~ at residual~ that all the missing values are filled in with values that are righ~sen the regression line (or plane). This same technique can be used with meon substitution where the 'residual' are residuals or deviations from the meaan 'Hot deck' imputation where the missing data are replaced by a resu~~ from a case that is not missing and which is similar on other variables to the case that has the missing data can also be used (David et al., 1986; Ford, 1983). More recent work has focused on multiple imputation (Little, 1992; Little and Rubin, 1990). This subject is beyond the scope of this book. Another nonimputation method, which applies to regression analysis with normally distributed X variables, was studied by Orchard and Woodbury (1972) and further developed by Beale and Little (1975). This method is based on the theoretical principle of maximum likelihood used in statistical estimation and it has good theoretical properties. It involves direct analysis of the incomplete data by maximum likelihood methods. Further explanation of this method is beyond the scope of this text (Little and Rubin, 1987). There are other methods that do not use imputation. One that is available in several statistical packages is called pairwise deletion of cases. Pairwise deletion is an alternative method for determining which variables should be deleted in computing the variance-covariance or the correlation arr~Y· As stated earlier, most statistical packages delete a case if any of the v~nables listed for possible inclusion has a missing value. Pairwise deletion ts do~e separately for each pair of variables actually used in the regression analyst~ The result is that any two covariances (or correlations) could be com?u~eg . . . ed b e it is mtsstn . d sing using somewhat different cases. A case Is not remov . ecaus data in say one variable; only the covariances or correlatmns compute u that variable are affected. . . h ndling In spite of this large selection of possible techmques avmlable for a e still missing values when the data are missing completely at rando~, wtances. ·. · mosttns recommend .. using only the cases with complete o bservatmns . . 'nlm . the st·rnpiest . We believe it is the most understandable analysts, tt certat Y ts to do, and it leads to unbiased results. djust for If the sample size is small, or for other reason it is desirable to a ax:imulll missing values when the data are MCAR, we would sugge~t the Rubin, likelihood method if the data are normally distributed (Ltttle an
8
":t
Missing values in regression analysis
201
that this method does well even when the data are not normal and 199°, sa~ethods to alleviate lack of normality). The drawback here is that present the six statistical packages we are reviewing, the method is available 00 alll a programmed feature in the ~MDP.AM pro~ra~. onlY tation has the advantage that tf you Impute mtssmg values for one ImP~e cases, the imputed values can be used in several analyses. But ?r rn~ tion tends to bias the estimates of the variances and covariances, 1111 ~ ~ to biased estimates of the regression coefficients. It also artificially !~~a:es the sample siz~, and the-pr~gra~s will t~en treat the data set, which . ixture of real and Imputed data, as iftt contatned all real data. Confidence 18 arn · · 'ficance can be gross Iy overstated. Th'IS Is · especta · II y . t rvals and levels of stgm 10 : of mean imputation. The only advantages of mean substitution are that ~;fs very simple and can easily be done without a special option in a statistical ~ackage. If very few mi~sing values were imputed in a sizable data set, the bias rnay not be too senous. Imputation using the regression approach with or without the stochastic adjustment or hot deck imputation has the advantage ofreplacing the missing values with data that are more similar to other data in the set. It makes good use of whatever information is available in the data set. But it is hard to predict what the possible biases will be and the problem of trusting confidence limits and P values still exist. It also takes considerable time and effort to do unless a statistical package is used that provides this option. We do not recommend the pairwise deletion methods unless the investigator performs runs with other methods also as a check. In simulations it has been shown to work in some instances and to lead to problems in others. It may result in making the computations inaccurate or impossible to do because of the inconsistencies introduced by different sample sizes for different c?rrelations. But it is available in several statistical packages so it is not dtfficult to perform. theA!l in a~, if th~ data are ~ssing ~ompletely at random, in most analyses a d nvesttgator Is well advised to stmply analyze the complete cases only nWhalso rep· . . values. . ort h ow many cases were deleted due to mtssmg doe en the data are missing at random (MAR), then using complete cases . s not . un b.· tased· estimates . supp . ·result. m of all the parameters. For example, of Po~se t~at ~~~pie linear regression is being done and the scatter diagram . more . pomts . t han would be expected for some X values - perhtnts ts mtssmg or the aps more Y values are missing for small X values. Then the estimates :egress~e~ and variance of Y may be quite biased but the estimate of the '11vestig~~ ne n:ay be satisfactory (Little and Rubin, 1987). In addition, other or lllissin ors domg studies with the same variables often have a similar pattern ~ith fo~alvalues s.o the sain:ple r~sults may be compara~le infor:m~lly if not ltl their re te~ts. Smce most mvestlgators use only data wtth no mtssmg values gress10n analyses, this remains a recommended procedure[or MAR data
!
202
Special regression topics
If an investigator wishes to adjust for missing values, the . likelihood method is recommended as a more theoretically a:axun.um solution when the data are missing at random. Weights which a Proprlate used for nonresponse (a case not being included at all in the sa~e ~s)uaUy also be used for missing values. Greater weight is given to the c p e' can have a larger amount of missing data. A discussion of this can be ~ses th~t Little and Rubin (1987). ound 1n Different statistical packages offer different options for handling . . values. BMJ?~ in the AM program offers .several options to desc~~s~~: pattern of mtssmg values - where they are located, whether pairs of vari bl. · · va1ues m · the same cases, whh · missing valuesaes have mtssmg et er cases wtth · · are extreme o·bserva t'.tons, . and wh eth· er v~I ues are mi~smg at random. The program then wtll estimate a correlatiOn or covanance matrix that can subsequently be saved and used by the regression programs using either maximum likelihood estimation, a version of pairwise deletion, or the complete cases only. Their version of pairwise deletion called ALLVALUE computes the covariance using a mean that is based on all cases for the variables present. It ·is recalculated to yield an acceptable matrix for computations if the results indicate that computational problems exist (nonpositive semi-definite matrix). Imputation can be done using mean or regression substitution. The program has several options to assist the user in finding suitable variables for predicting the missing data by regression analysis. SAS does not offer missing data options other than deleting cases with any incomplete data. SPSS offers pairwise deletion and mean substitution. It also offers the option to include cases with missing values assigned by the user, with the missing values being treated as a valid observation. For example, if a variable with a missing value was assigned 99, this would be treated as a valid outcome. If you are using dummy variables, discussed in the next section; or log~linear analysis, discussed in Chapter 17, this option allows you to treat lack of an answer as a useful piece of information. .d ST ATA ~ill. perform ~egres~ion imputation ~sing ~he impute con::_a~h~ For each mtssmg value m a giVen case, an estimate ts computed fr . tin regression of that variable on all the other variables for which data ~xttstt'on . questiOn. . 11ty of m~an subsU u the case m STATISTICA offers the poss1·b·1· and pairwise deletion. In SYSTAT, pairwise deletion is available.
9.3 DUMMY VARIABLES
t I . . . ssion model are ?oal Often X vanables desired 10r me uswn m a regre . 1 ordtn continuous. Such variables could be either nominal or ordma ·mple, the measurements represent variabl_es with a~ underlying scale. For ~~ese burns severity of a burn could be classified as mild, moderate or severe. tivelY· the are commonly called first-, second- and third-degree burns, respec .
•
•
l':
•
Dummy variables
203
·able representing these categories may ~e coded 1, 2 ~r 3. Th.e data
X vai1 dinal variables could be analyzed by usmg the numencal codmg for frorn ~nd treated as though they were. continuous (or interval data). This thern d takes advantage of the underlymg order of the data. On the other rnet~othis analysis assumes equal intervals between successive valu~s of the han, ble Thus in the burn example, we would assume that the difference varta n. a first- and a second-degree burn is equivalent to the difference betW~n a second- and third-degree burn. As was discussed in Chapter 5, an betW~priate choice of scale may be helpful in assigning numerical values to ~~~routcome that would reflect the varying lengths of intervals between successive respons~s. . . . . . In contrast, the mvest1gator may Ignore the underlymg order for ordmal variables and treat them as though they were nominal variables. In this section we discuss methods for using one or more nominal X variables in regression analysis along with the role of interactions. A single binary variable
We will begin with a simple example to illustrate the technique. Suppose that the dependent variable Y is yearly income in dollars and the independent variable X is the sex of the respondent (male or female). To represent sex, we create a dummy (or indicator) variable D = 0 if the respondent is male and D = 1 if the respondent is female. (All the major computer packages offer the option of recoding the data in this form.) The sample regression equation can, then be written as Y = A + BD. The value of Y is
Y=A
if
D=O
if
D= 1
and
Since our b . . . . estirn . est estimate of Y for a g1ven group 1s that group's mean, A 1s . ated as the average mcome · s-: · the average . (.D = 0) and· A + B 1s Incorn 10r rna1es B:::: Y.e for fe~ales (D = 1). The regression coefficient B is therefore fetnale~,m~Ies - Y~ales· In effect, males are considered the reference group; and Income 1s measured by how much it differs from males' income.
A Second rn . A. ethod of coding dummy variables n aiternar . · Ive way of coding the dummy variables is D*
=
-1
for males
204
Special regression topics
and
D*
+1
=
for females
In this case the regression equation would have the form
Y =A*+ B*D* The average income for males is now
A* -B*
(when D*
=
A*+ B*
(when D*
= + 1)
-1)
and for females it is
Thus 1 -
A* = 2(Ymales +
-
Yremales)
and
B* = Yremales - .!.2(Ymales - .
+
-
Yremales)
or B *-.!.- z(Yremates- Ymates)
In this case neither males nor females are designated as the reference group. A nominal variable with several categories For another example we consider an X variable with k > 2 categories. Suppose income is now to be related to the religion of the respondent. Religion is classified as Catholic (C), Protestant (P), Jewish (J) and other (0). The religion variable can be represented by the dummy variables that follow:
Religion
Dt
D2
D3
c
1
0
p
0 0 0
1
0 0
0 0
1 0
J 0
du01n1Y Note that to represent the four categories, we need only thre~ duOllllY variables. In general, to represent k categories, we need k-
Dummy variables
205
Here the three variables represent C, P and J, respectively. For variables. D = 1 if Catholic and zero otherwise; D2 = 1 if Protestant and 1 e:lCample, 1·se· and D 3 = 1 if Jewish and zero otherwise. The 'other' group otherw ' . zero of zero on each of the three dummy vanables. 1 has a va tl~~ated regression equation will have the form The es. Y =A+ B 1D 1 + B 2 Dz + B 3 D 3 The ave r
age incomes for the four groups are
Yc =A+ B 1 Yp =A+ B 2
Y1 =A+ B 3 Y0 =A Therefore,
B 1 = Yc- A = Yc- Y0 B 2 = Yp- A = Yp- Y0
B3 = Y1 - A = Y1 - Y0 Thus the group 'other' is taken as the reference group to which all the others are compared. Although the analysis is independent of which group is chosen as the reference group, the investigator should select the group that makes the interpretation of the B's most meaningful. The mean of the reference group is the constant A in the equation. The chosen reference group should have a fairly large sample size. . If none of the groups represents a natural choice of a reference group, an alternative choice of assigning values to the dummy variables is as follows: Religion
c p J
0
D*1
Di
1
0
0 0
1 0
-1
-1
D*3 0 0
1 -1
As before C . and Jewis'h athohc has a value of 1 on Di, Protestant a value of 1 on Di, or the th a value of 1 on Dj. However, 'other' has a value of -1 on each dull1tny ree _dummy variables. Note that zero is also a possible value of these vanables.
206
Special regression topics
The estimated regression equation is now
Y= A*+ B*D* 1 1
+ B*D* 2 2 + B*D* 3 3
The group means are
Yc =A*+ Bi Yp =A*+ Bi Y1 =A*+ Bj
Y0
=
A* - Bi - Bi - Bj
In this case the constant
the unweighted average of the four group means. Thus
*=1-.--Yc- 4(Yc + Yp + Y + Y )
B1
Bi =
1
0
Yp- i(Yc + Yp + Y1 + Y0 )
and
Note that with this choice of dummy variables there is no reference group. Also, there is no slope coefficient corresponding to the 'other' group. As before, the choice of the group to receive -1 values is arbitrary. That is, any of the four religious groups could have been selected for this purpose.
One nominal and one interval variable Another example, which includes one dummy variable and one continu~u~ variable, is the following. Suppose an investigator wants to refate vtt~ capacity (Y) to height (X) for men and women (D) in a restricted age grou · One model for the population regression equation is
Y = rx
+ f3X + bD + e
. . . . rnultiple where?= 0 for. fema~es and D = 1 for males. Th~s equa~on IS :aks doWll regression equatiOn wtth X = X 1 and D = X 2 • This equatiOn br to an equation for females, Y = rx
+ f3X + e
Dummy variables ·e 1~'or rnales, and on· · '
Y = rx
+ f3X + b + e
Y = (rx
+ b) + f3X + e
or
207
. · 9 l(a) illustrates both equations. fJguret.. ~hat this model forces the equation for males and females to have No e. . · · til · th e f3 · The only difference between the two equatiOns ts the same S·lope ·. intercept: a for females and (a + b) for males.
Interaction
A model that does not force the lines to be parallel is Y
=
rx
+ f3X + bD + y(XD) + e
This equation is a multiple regression equation with X = X 1, D = X 2 , and XD = X 3 . The variable X 3 can be generated on the computer as the product of the other two variables. With this model the equation for females (D = 0) is Y
=
rx
+ f3X + e
and for males it is Y = rx
+ f3X + b + yX + e
Y = (a
+ b) + ({3 + y)X + e
or
~hheereequ~tions ~' Is
for this model are illustrated in Figure 9.l(b) for the case ·. . Th • a posttlve quantity. for · usl Wtth this model we allow both the intercept and slope to be different inte~a es and females. The quantity b is the difference between the two inves~~pts, and }' is the difference between the two slopes. Note that some sex. w?~or~ would call the product XD the interaction between height and and th~ 8 tht~ model we can test the hypothesis of no interaction, i.e. y = 0, byusin ~ectde whether parallel lines are appropriate. This test can be done regressig t e methods discussed in Chapter 7, since this is an ordinary multiple b on model. ror f see seerUrther d'tscusston . of t he eftiects of mteract10n . . m . regression . ana1ysts . Ion 7.8 of this book or Jaccard, Turrisi and Wan (1990). 1
Females Y=
a+ [3X
Height
0
b. Lines are Not Parallel (Interaction)
Figure 9.1 Vital Capacity Versus Height for Males and Females.
Constraints on parameters
209
£%tensions . . . se these Ideas and apply them to Situations where there may be We ca~; variables and several D (or dummy) variables. For example, we severa frnate vital capacity by using age and height for males and females 1 can ~~r smokers and no~smokers. T~e selectio~ of the model --:- i.e. whe~her and . t include interaction- and the InterpretatiOn of the resulting equations or~~~: done with cau~ion. The p~evious me~hod?logy fo~ variable ~election rnu lies here as well, with. appropnate att.e~tlo~ gtv~n. to InterpretatiOn. For apP ple, if a nominal van able such as rehgton ts spht mto three new dummy exam . . 11y as a re1erence " . program . bles with 'other' used ongma group, a stepwise vana ' . . maY enter only one of these three ~~my vanabl~s. Suppose the v~n~ble D is entered, whereas D1 = 1 sigmfies Catholic and D1 = 0 signifies p:otestant, Jewish or other. In this case the reference group becomes Protestant, Jewish or other. Care must be taken in the interpretation of the results to report the proper reference group. Sometimes, investigators will force in the three dummy variables D 1 , D2 , D 3 if any one of them is entered in order to keep the reference group as it was originally chosen. This feature can be accomplished easily by using the subsets-of~ variables option in BMDP 2R. The investigator is advised to write out the separate equation for each subgroup, as was shown in the previous examples. This technique will help clarify the interpretation of the regression coefficients and the implied reference group (if any). Also, it may be advisable to select a meaningful reference group prior to selecting the model equations rather than rely on the computer program to do it for you. Another alternative to using dummy variables is to find separate regression ~quations for each level of the nominal or ordinal variables. For example, tn t~e prediction of FEVl from age and height, a better prediction can be ~chteved if an equation is found for males and a separate one for females. emales are not simply 'short' men. If thesample size is adequate, it is a ~~od procedure to check whether it is necessary to include an interaction · Ill. .·If the slopes ' seem ' equa, 1 no mteractwn · · ·term IS · necessary; oth erwise, · such are b terms should be included. Formal tests exist for this purpose, but they Dueyond the scope of this book (Graybill, 1976). data ~my ~ariables can be created from any nominal, ordinal or interval .. m . t h e vanous . . . 1 packages. statlstlca A. Item yt' USing the 'I'f- th·en- ' op t'IOns . . a Ively th d already . e · ata can be entered mto the data spreadsheet m a form SUitable for use as dummy variables. 9
-4 So
co
NSTRAINTS ON PARAMETERS Pack the User th age~ programs, known as nonlinear regression programs, offer e 0 Ptlon of restricting the range of possible values of the parameter
. me
210
Special regression topics
estimates. In addition, some programs (e.g. BMDP 3R, BMDp STATISTICA nonlinear estimation module, STATA cnsreg, and SAS R AR, offer th.e option of imposing linear constraints on the parameters. T;G) constramts take the form ese
where
f3 1,f3 2 , ••• ,f3p are the parameters in the regression equation a
C1 , C2 , .•• , Cp and C are the constants supplied by the user. The prog nd
finds estimates of t.he parameters restricted to satisfy this constraint as~~~ as any other supplied. Although some of these programs are intended for nonlinear regression they also provide a convenient method of performing a linear regression with constraints on the parameters. For example, suppose that the coefficient of the first variable in the regression equation was demonstrated from previous research to have a specified value, such as B 1 = 2.0. Then the constraint would simply be
C2 = ... = Cp = 0 and
C=2.0
or
1{31
= 2.0
Another example of an inequality constraint is the situation when coefficients are required to be nonnegative. For example, if {3 2 ~ 0, this constraint can also be supplied to the program. The use of linear constraints offers a simple solution to the problem know_n as s~line. regression or segmented-curve regression .. For inst~nce, in econom~~ apphcatwns we may want to relate the consumptiOn functiOn Y to t~el~v of aggregate disposable income X. A possible nonlinear relationship. 15 a linear function up to some level X 0 , i.e. for X ~ X 0 , and another l:e~; 1 function for X > X0 • As illustrated in Figure 9.2, the equation for X ~ 0
y =
C(l + plx + e
Y=
tX 2
+ f3 2 X + e
=
X 0 • This condition produces
and for X > X 0 it is
The two curves must meet at X constraint
the linear
Constraints on parameters
211
y
a 1 :: 2
fJ,
a2 fJ:~
0
= 0.5 = -3
=1
L-------------~----------------x
X0 = 10
Figure 9.2 Segmented Regression Curve with X= X 0 as the Change Point.
Equivalently, this constraint can be written as
To formulate this problem as a multiple linear regression equation, we first define a dummy variable D such that
D=O
if
D= 1
if
~~me Programs, for example STATA mkspline and STATISTICA piecewise thiear regression, can be used to produce a regression equation which satisfies 8 cond" · · can b Itton. If your package does not have such an option, the problem into
:;~lved ~s fol~ows. The segmented regression equation can be combined multiple hnear equation as
\Vhen X
~ Xo, then D = 0, and this equation becomes
212
Special regression topics
Therefore y 1 becomes
= tX 1
and y3 =
/3 1 . When X> X 0 , then D = 1) and thee
. quahon
Therefore y 1 + y2 = tX 2 and y3 + y4 = {3 2 • With this model a nonlinear regression program can be employed . the restriction With
or
or
Yz
+ XoY4 = 0
For example, suppose that X 0 is known to be 10 and that the fitted multiple regression equation is estimated to be Y
= 2 - 5D + 0.5X + 0.5DX + e
Then we estimate tX 1 = y 1 as 2, /31 = y3 as 0.5) tX 2 . = y 1 + y2 as 2- 5 = -3, and /3 2 = y3 + y4 as 0.5 + 0.5 = 1. These results are pictures in Figure 9.2. Further examples can be found in Draper and Smith (1981). Where there are more than two segments with known values of X at which the segments intersect, a similar procedure using dummy variables could be employed. When these points of intersection are unknown, more complicated estimation procedures are necessary. Quandt (1972) presents a method. for estimating two regression equations when an unknown number ofthepo~nts belong to the first and second equation. The method involves numencal maximization techniques and is beyond the scope of this book. . 1 Another kind of restriction occurs when the value of the dependent vanab e is cons~raine~ to be. above (or below) ~ certain. limit. For ~xample, it;;~ be physically Impossible for Y to be negatiVe. Tobm (1958) denved a proce . t for obtaining am ultiple linear regression equation satisfying such a constraJ.ll · 9.5 METHODS FOR OBTAINING A REGRESSION EQUATION WHEN MULTICOLLINEARITY IS PRESENT
dellt In section 7.9, we discussed multicollinearity (i.e. when the indeP~f111 ate variables are highly intercorrelated) and showed how it affected .the. es suallY of the standard error of the regression coefficients. MulticollineantY ts u
Regression equation when multicollinearity is present
213
. ted by examining tol~r~nce (one .minus the squared mu~tiple correlation detec ·X. and the remammg X vanables already entered m the model) or ?et~ee:se 'the variance inflation factor. A commonly used cutoff value is 111 1ts "ece is less than 0.01 or variance inflation factor greater than 100. Since toler~ ranee coefficient is itself quite unstable, sometimes reliable slope th~~~i:nts can be o~tained even if the tolera~~e is less ~han 0.01. •. coMulticollinearity. ~s also detected exam~m~g the siZe of .the condition . d (or the conditiOn number). Thts quantity ts computed m SAS REG 10 e~dure and SYSTAT Regression in MGLH as the square root of the ratio ~;~~e largest eigenvalue to ~he smallest eigenva~u.e of the .covariance matrix of the X variables (see sectiOn 14.5 for a defi~ttion of ~tgenvalues). It can readily be com~uted. fro.m ~ny progra~ t~at gt:es the eigenvalues. A large condition index Is an mdtcatwn of multtcolhneanty. Belsley, Kuh and Welsch (1980) indicate that values greater than 30 can be an indication of serious problems. A useful method for checking stability is to standardize the data, compute the regular slope coefficients from the standardized data, and then compare thein to standardized coefficients obtained from the original regression equation (Dillon and Goldstein, 1984). To standardize the data, new variables are created whereby the sample mean is subtracted from each observation and the resulting difference is divided by the standard deviation. This is done for each independent and dependent variable, causing each variable to have a .zero mean and a standard deviation of one. If there are no appreciable differences between the standardized slope coefficients from the original equation and the regular slope coefficients from the standardized data, then there is less chance of a problem due to multicollinearity. Standardizing the !'sis also a suggested procedure: to stabilize the estimate of the condition mdex (Chatterjee and Hadi, 1988). th~other i~dication. of lac~ of st~bility is a change in the coefficients when u . sample stze changes. It ts possible to take a random sample of the cases r Slng. features provided by the statistical packages. If the investigator c~~~o~Iy. ~plits the sample in half and obtains widely different slope But Cients In the two halves, this may indicate a multicollinearity problem. B ~ote that unstable coefficients could also be caused by outliers. are ~tor~ shifting to formal techniques for handling multicollinearity, there ~rst isr~~htforward w~ys to detec~ the ~ossible cause of the proble~. _The Inflate th heck for _outhers. A_n outher whtch has large leverage can artifiCially and oth. e correl_atwn coefficients between the particular X variable involved be enouerhX vanables. Removal of outlier(s) that inflate the correlations may l'he g to solve the problem. that Wi~~~ond step is to check for relationships among some of the X variables blood p ead to problems. For example, if a researcher included mean arterial ressure along with systolic and diastolic blood pressure as X variables,
?Y
214
Special regression topics
then problems will occur since there is a linear relationship among th three variables. In this case, the simple solution is to remove one of the these ~ar~ables. Also, if .two X va?ables are highly intercorrelated, multicollinea~~e Is hkely and agam removmg one of them could solve the problem. Wh Y this solution is used, note that the estimate of the slope coefficient for thee; varia~le is keptw~llikely not be the sa~e as ~t would be if bot~ intercorrelated X vanables were mcluded. Hence the mvestigator has to decide whether thi solution creates conceptual problems in interpreting the results of th~ regression model. Another technique that may be helpful if multicollinearity is not too severe is to perform the regression analysis on the standardized variables or 00 so-called centered variables, where the mean is subtracted from each variable. (both the X variables and Y variable). Centering reduces the magnitude of the variables and thus tends to lessen rounding errors. When centered variables are used, the intercept , is zero but the estimates of the slope coefficients should remain unchanged. With standardized variables the intercept is zero and the estimates of the coefficients are different from those obtained using the original data (section 7. 7). Some statistical packages always use the correlation array in their regression computations to lessen computational problems and then modify them to compute the usual regression coefficients. If the above methods are insufficient to alleviate the multicollinearity, then the use of ridge regression, which produces biased estimates, can be used, as discussed in the next section.
9.6 RIDGE REGRESSION In the previous section, methods for handling multicollinearity by either excluding variables or modifying them were given, and also in Chapter 8 where variable selection methods were described. However, there are situations where the investigator wishes to use several independent variables that are highly intercorrelated. For example, such will be the case when the ind~pen~en~ variables are the prices of commodities or highly related physwlogt.ca variables on animals. There are two major methods for performing regr~sstc;; analysis when multicollinearity exists. The first method, which wtl~ ·: explained next, uses a technique call~d ridge re~ression. T?e se~ond t~chn;1~. uses a principal components regression and Will be descnbed m sectton Theoretical background
'dge One solution to the problem of multicollinearity is the so-calle1 ;~70; regression procedure (Marquardt and Snee, 1975; Hoerl and Kennar ' tbe Gunst ~nd Mason, 198?)· In effect, rid~e reg~ession artificial~y redu:~:able correlations among the mdependent vanables m order to o btam roor
Ridge regression
215
·mates of the regre~sion coefficients. Note that although such estimates estl b' sed they may, m fact, produce a smaller mean square error for the are Ia ' timates. . . . es To explain the ~ncept of ndge re.gressiOn, v:e ~ust follow throu~h a . etical presentation. Readers not mterested m this theory may skip to ~~:oliscussion of how to perform the analysis, using ordinary least squares ression programs. re~e will restrict our presentati~n to t?e standardized form of the o bser:ations, . where the mean of each vanable Is subtracted from the observatiOn and ~~~ difference is divided by the standard deviation of the variable. The suiting least squares regression equation will automatically have a zero :tercept and standardized regression coefficients. These coefficients are functions of only the correlation matrix among the X variables, namely,
1
'12
r13
'12
1
'23
r13
'23
1
rlP
rzp
r3p
riP
1
The instability of the least squares estimates stems from the fact that some ?f the independent variables can be predicted accurately by the other ll1dependent variables. (For those familiar with matrix algebra, note that this feature results in the correlation matrix having a nearly zero determinant.) The ridge regression method artificially inflates the diagonal elements of the correlation matrix by adding a positive amount k to each of them. The correlation matrix is modified to .
1+k
r 12
r12
1+k
r 13
r 23
r23 1 + k ... l+k
l'he Val etll.pi . ue of k can be any positive number and ncauy, as will be described later.
IS
usually determined
216
Special regression topics
Example of ridge regression For P = 2- Le. with two independent variables X 1 and X 2 -the least squares standardized coefficients are computed as
and
The ridge estimators turn out to be b*=ru-[r 12 /(1+k)]r 2 1 1 - [r 12 /(1 + k)] 2
y(
1 ) 1+ k
and b* 2
= ' 2 r - [r 12 /(1 + k)Jrn( 1 ) 1-[r12 /(1+k)] 2
1+k
Note that the main difference between the ridge and least squares co is that r 12 is replaced by r 12 /(1 + k), thus artificially reducing the co between X 1 and X 2 • For a numerical example, suppose that r 12 = 0.9, r u = 0.3, and r Then the standardized least squares estimates are b = 0.3 - (0.9)(0.5) 1
1 - (0.9)
= -0.79
2
and - 0.5 - (0.9)(0.3) - 1 21 b2 1 - (0.9)2 - . For a value of k = 0.4 the ridge estimates are
*- 0.3- [0.9/(1 + 0.4)](0.5) (
b1
-
1 - [0.9/(1
+ 0.4)] 2
1 ) = -0.026 1 + 0.4
and b* = 0.5 - [0.9 /(1 + 0.4)] (0,3) ( 1 ) = 0.374 2 2 1 - [0.9/(1 + 0.4)] l + 0.4
Ridge regression
217
. that both coefficients are reduced in magnitude from the original slope 'N~~cients. We com~uted several other ridge est~ates fork= 0.05 to ?.5. co actice these estimates are plotted as a function of k, and the resulting Inaprh, as shown ' · p· 93 . n· d h • 1D · tgure .. , Is ca e t e ndge trace. gr /he ridge trace should be supplemented by a plot of the residual mean re of the regression equation. From the ridge trace together with the sq~~ual mean square graph, it becomes apparent when the coefficients begin rest · an mcreasmg · · va1ue of k prod uces a sma11 effiect on the . stabilize. That Is, :efficients. The ·value. of k at the. point of stabili~ation (s.ay k*) is selected to compute the final n?ge coe~eten_ts. The fi~al ndge estimates are thus a compromise between bms and Inflated coe~ctents. For the above example, as seen from Figure 9.3, the value of k* = 0.2
b* 1.2
0.8
0.6 0.4
0.2
-0;2 -0.4
'--0.8 ).f·
lgure 9.3
.
Ridge Regression Coefficients for Various Values of k (Ridge Trace).
218
Special regression topics
seems to represent that compromise. Values of k greater than 0.2 d 0 produce appreciable changes in the coefficients, although this decisio ~ot subjective one on o~r part. Unfortunately, no objective crit~rion hasnb~ a developed to detenmne k* from the sample. Proponents of ndge regr . en agree that the ridge estimates will give better predictions in the popu~::i~n although ~hey ~o not fit the samp~e as w~ll as th~ le~st squares estimate~' Further dtscusswn of the properties of ndge estimators can be found . · Gruber (1989). · 10
Implementation on the computer In practice, ordinary least squares regression programs such as those discussed in Chapter 7 can be used to obtain ridge estimates. As usual, we denote the dependent variable by Y and the P independent variables by X 1, X 2 , ••. , X P·- First, the Y and X variables are standardized by creating new variables in which we subtract the mean and divide the difference by the standard deviation for . each variable. Then, P additional dummy observations are added to the data set. In these dummy observations the value of Y is always set equal to zero. For a specified value of k the values of the X variables are defined as follows: 1. Compute [k(N- 1)r'2 • Suppose N
= 20 and k = 0.2; then [0.2(19)] 112 =
1.95. 2. The first dummy observation has X 1 = [k(N- 1)] 112 , X2 = 0, ... , X P = 0 and Y = 0. 3. The second dummy observation has X 1 = 0, X 2 = [k(N- 1)] 112 , X3 = 0, ... , X P = 0, Y = 0 etc. 4. The Pth dummy observation has X 1 = 0, X 2 = 0, X 3 = 0, ... , X p-1 = 0, Xp = [k(N- 1)r' 2 and Y = 0. For a numerical example, suppose P = 2, N additional dummy observations are as follows:
= 20
and k
= 0.2.
The two
y 1.95 0
0 1.95
0 0
'ons we use With the N regular observations and these P dummy observatt : forced 15 an ordinary least squares regression program in which the intercept ut are 00 1 to be zero. The regression coefficients obtained from the . ~esidual automatically the ridge coefficients for the specified k. The resultmg
Summary
219
. n square should al~o be no~ed. Repeating this process f?r various values rnea will provide the m~ormat10n necessary ~o plot the ndge trace of the 0!:recients, similar to Ftgure 9.3, and the restdual mean square. The usual c of interest fork is between zero and one. ra~g~e regression programs allow the user to fit the regression model starting . ho the correlation matrix. With these programs the user can also perform wtt · 1y b y mo d'f ·· · · as shown "d regression dtrect 1 ymg the corre1atwn matnx rtre~~ously in this section (replace the l's down the diagonal with 1 + k). This ~ recommended for SAS, SPSS and SYSTAT. Note that SYSTAT has a ~orked out example in their matrix chapter. For STATA the method of implementation just g~ven is sugge~ted. Both ~MD~ 4R and the ST ~TIS!ICA regression program wtll perform ndge regressiOn duectly and provtde smtable output from the results.
SUMMARY In this chapter we reviewed methods for handling some special problems in
regression analysis and introduced some recent advances in the subject. We reviewed the literature on missing values and recommended that the largest possible complete sample be selected from the data set and then used in the analysis. In certain special circumstances it may be useful to replace missing Values with estimates from the sample. Methods for imputation were given and reference made as to where to obtain further information. We gave a detailed discussion of the use of dummy variables in regression analysis. Several situations exist where dummy variables are very useful. !hese include incorporating nominal or ordinal variables in equations. The Ideas explained here can also be used in other multivariate analyses. One reason we went into detail in this chapter is to enable you to adapt these methods to other multivariate techniques. i Tw~ methods were discussed for producing 'biased' parameter estimates, .e. estimates th. a t do not ~verage out· tot.he true p~rameter v~lue. The first rnetho . You w~sts placmg ~onstramts ?n the parameters ..Thts method ts useful when con . h to restnct the estimates to a certam range or when natural stratnts f . . . biased re on .some un~twn ?f t~e paramete~s must _be satisfied: The second When gresston techmque 1s ndge regressiOn. This method ts used only 0 correla~e~ must employ variables that are known to be highly inter\' . Poss~~ should also consider examining the data set for outliers in X and Provide redundant variables. The tolerance or variance inflation option a forw:~da Warn.ing of which variable is causi~g the problem when you use selectwn or forward stepwise selectiOn method.
220
Special regression topics
REFERENCES References preceded by an asterisk require strong mathematical background.
Missing data in regression analysis Afifi, A.~. and Ela~hoff, R.M. (1966). Missing ?bservat~o?s in multivariate statis . 1: Review of the literature. Journal ofthe ;t.~rzcan Statz~tzcal.Associa_tion? 61, 5 95 _~~: Afifi, A.A. and Elashoff, R.M. (1969a). M1ssmg observations m multivanate stat' r Til: Large sample analysis of simple linear regression. Journal of the Am Is .1cs. Statistical Association, 64, 337-58. erzcan Afifi.• A.A. and Elasho.ff, R.M .. (1969b). Mis~ing observations in multivariate statistics. IV: A. n.ote on simple lmear regressiOn. Journal of the American Statistical Assocwtwn, 64, 359-65. *Beale, E.M.L. and Little, ~.~.A. (19:5). Missing values in multivariate analysis. Journal of the Royal Statistical Soc1ety, 37, 129-45. David, M., Little; R.J.A., Samuhel, M.E. and Triest, R.K. (1986). Alternative methods for CPS income imputation. Journal ofthe American Statistical Association, 81, 29-41. Dillon, W.R. and Goldstein, M. (1984). Multivariate Analysis: Methods and Applications. Wiley, New York. Ford, D.L. (1983). An overview of hot-deck procedures in incomplete data, in Sample Surveys, Vol. 2, Theory and Bibliographies (eds W.G. Madow, I. Olken and D.B. Rubin). Academic Press, New York, pp. 185-207. Hartley, H.O. and Hocking, R.R. (1971). The analysis of incomplete data. Biometrics, 27, 783-824. Kalton, G. and Kasprzyk, D. (1986). The treatment of missing survey data, Survey Methodology, 12, 1-16. Little, R.J.A. (1982). Models for nonresponse in sample surveys. Journal of the American Statistical Association, 77, 237-50. Little, R.J.A..(1992). Regression with missing X's: a review. Journal of the American Statistical Association, 81, 1227-37. . Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data. Wtley, New York. . . Little, R.J.A. and Rubin; D.B. (1990). The analysis of social science data with m)1~stng values, in Modern Methods of Data Analysis (eds J. Fox and J.S. Long · age, Newbury Park, CA. . . . . . . . Theory *Orchard, T. and Woodbury, M.A. (1972). A mtssmg mformatton pnnCtple. tical and application, in Proceedings of the 6th Berkeley Symposium on Mathema · Statistics and Probability, 1, 697-715.
Ridge regression demic press, *Gruber, M. (1989). Regression Estimators: A Comparative Study. Aca San Diego. . . . ·on. Dekker, Gunst, R.F. and Mason, R.L. (1980). Regresswn Ana/yszs and Its Appllcatz New York. . . . . to noll' Hoerl, A.E. and Kennard, R.W. (1970). Rtdge regression: ApplicatiOn . Americ(liZ orthogonal problems. Technometrics, 12, 55-82. Marquardt, D.W. and Snee, R.D. (1975). Ridge regression in practice. Statistician, 29, 3-20.
Problems
221
ented (spline) regression segJil dt R.E. (1972). New approac hes to est1matmg · · switc · h.mg regressions. · T 1 JOurna •Quanh 'American Statistical Association, 67, 306-310.
oft e
Other regression . D.A., Kuh, E. and Welsch, R.~. (1~80). R_egres:sion Diagnostics: Identifying Belsley, ntial Data and Sources of Collmeanty. Wiley, New York. In.flu~ee s and Hadi, A.S. (1988). Sensitivity Analysis in Linear Regression. Wiley, Chatter] • · ···
NewerY~r~.
~-
and Smith, (1981). Applied Regression Analysis, 2nd edn. Wiley, *Drap • ~ · New York. *Graybill, EA. (1976). Theory and Application of the Linear Model. Duxbury Press, N. Scituate, MA. . . . . Jaccard, l, Turrisi, R. and Wan, C.K. (1990). Interactwn Effects zn Multzple Regresswn. ·· Sage, Newbury ~ar~, CA. . . . . . . *Tobin, J. (1958). Estnnation of relationship for lllDlted dependent vanables. Econometrzca, 26, 24-36.
FURTHER READING Ridge regression Fennessey, J. and D'Amico, R. (1980). Collinearity, ridge regression, and investigator judgement. Social Methods and Research, 8, 309-40. Myers, R.H. (1986). Classical and Modern Regression with Applications. Duxbury Press, Boston.
Segmented (spline) regression Quandt,. R.E. (1958). The estimation of the parameters of a linear regression system 0·beying two separate regimes. Journal of the AmeriCan Statistical Association, 53, ·.873-80.
*Q~~ndt, RE. (196~). Tests of the hypothesi~ that a l!n~ar regres~io~ system obeys 0 separate regimes. Journal of the Amerzcan Statzstzcal Assoczatwn, 55, 324-31.
Other regression Aiken, L S
.
,
.
Inter · : and West, S.G. (1991). Multzple Regresswn: Testing and Interpreting Belsley, agzons. Sage, Newb~r.y ~ark,
~RonLEMs 9.t
In the d
age
epression data set described in Chapter 3, data on educational level, ' sex a n d mcome . are presented for a sample of adults from Los Angeles
222
Special regression topics
County. Fit a regression plane with income as the dependent variabl other variables as independent variables. Use a dummy variable for the an~ the SEX that was originally coded 1,2 by stating SEX = SEX - 1. Which es:a~Iabie reference group? x ts the 9.2 Repeat Problem 9.1, but now use a dummy variable for education 0 .. education level into three categories: did not complete high schooi IVIde the at least high school, and completed at least a bachelor's degree C~completed . . d h . mpare th mterpretat10n you woul make of t e effects of education on inco · e me In this problem and in Problem 9.1. 9.3 In the depression data set, determine whether religion has an effect 0 · · n Income · d d · bl 1 d wh en . use . as an m epen. ent vana e a. ong·With age, sex and educationa11eve} 9.4 Draw a ndge trace for the accompanymg data. ·
Variable Case
X1
X2
X3
y
1 2 3 4 5 6 7 8
0.46 0.06 1.49 1.02 1.39 0.91 1.18 1.00
0.96 0.53 1.87 0.27 0.04 0.37 0.70 0.43
6.42 5.53 8.37 5.37 5.44 6.28 6.88 6.43
3.46 2.25 5.69 2.36 2.65 3.31 3.89 3.27
Mean Standard deviation
0.939 0.475
0.646 0.566
6.340 0.988
3.360 1.100
9.5 Use the data in Appendix A. For the parents we wish to relate Y =weight to X = height for both men and women in a single equation. Using dummY variables, write an equation for this purpose, including an interaction ter~~ Interpret the parameters. Run a regression analysis, and test whether the tr~e of change of weight versus height is the same for men and women. Interpre results wih the aid of appropriate graphs. . d girl· 9.6 (Continuation of Problem 9.5.) Do a similar analysis for the first boy an Include age and age squared in the regression equation. •jdeal' 9.7 U~like the r~al data used in Prob~em 9.?, the accompanying datar ~~erican weights published by the Metropolitan Life Insurance Company fo_ -framed men and women. Compute Y = midpoint of weight range for med~um that tile men and women for the various heights shown in the table. Pre~endi:O~Jeill 9.5, results represent a real sample, repeat the analysis requested m p and compare the results of the two analyses.
Problems Weights of men (lb)a
Height
Small frame
Medium frame
Large frame
5 ft 2 in. 5 ft 3 in. 5ft 4in. 5 ft 5 in. 5ft' 6in. 5ft 7in. 5ft 8in. 5 ft 9 in. 5ft lOin. 5 ft 11 in. 6ft Oin. 6ft lin. 6ft 2in. 6ft 3in. 6ft 4in.
128-134 130-136 132-138 134-140 136-142 138-145 140-148 142-151 144-154 146-157 149-160 152-164 155-168 158-172 162-176
131-141 133-143 135-145 137-148 139-151 142-154 145-157 148-160 151-163 154-166 157-170 160-174 164-178 167-182 171-187
138-150 140-153 142-156 144-160 146-164 149-168 152-172 155-176 158-180 161-184 164-188 168-192 172-197 176-202 181-207
a Including
5lb of clothing, and shoes with 1 in. heels.
Weights of women (lb)a
Height 4ft lOin. 4ft 11 in. 5ft Oin. Sft 1 in. 5ft 2 in. 5 ft 3in. 5ft 4in. 5 ft 5 in. 5ft 6in. 5ft 7in. 5ft 8 in. Sft 9in. 5 ft lOin. 5ft llin. 6ft Oin. '
---
'
Small frame
Medium frame
Large frame
102-111 103-113 104-115 106-118 108-121 111-124 114-127 117-130 120-133 123-136 126-139 129-142 132-145 135-148 138-151
109.:...121 111-123 113-126 115-129 118-132 121-135 124-138 127-141 130-144 133-147 136-150 139-153 142-156 145-159 148-162
118-131 120-134 122-137 125-140 128-143 131-147 134-151 137-155 140-159 143-163 146-167 149-170 152-173 155-176 158-179
• Incluct· tng 3lb of clothing, and shoes with 1 in. heels.
223
224
Special regression topics
9.8 Use the data described in Problem 7.7. Since some of the X var' b intercorrelated, it may be usef~l to do a ridge regression analysis 0 ; ; les are to X9. Perform such an analysts, and compare the results to those of p on Xt 7.10 and 8.7. roblems 9.9 (Continuation of Problem 9.7.) Using the data in the table given i p 9.7, compute the midpoints of weight range for all frame sizes fo~ roblem women ~eparately. Pretending that th~ results. re~re~ent a real sampl::~ ;nd each hetght has three Y values associated With It mstead of one re hat . . . • peat the · . ~n~lysts of ~roblem 9.7. Pr?duce the app~opnate pl?ts. Now repeat wit mdtcator vanables for the different frame sizes, choosmg one as a refl h erence . . · group. I nc1u d· e any necessary mteraction terms. 9.10 Take the family lung function data in Appendix A and delete (label as miss" the height of the middle child for every family with ID divisible by 6, tha~n.g) 1 families 6, 12, 18 etc. (To find these, l?ok for ~hose IDs ~ith Il?/6 = integer Pa :~ of (ID/6).) Delete. the FEVl of t_he. m1ddle child for famthes wtth ID divisible by 10. Assume these data are mtssmg completely at random. Try the various imputation methods suggested in this chapter to fill in the missing data. Find the regression ofFEVl on height and weight using the imputed values. Compare the results using the imputed data with those from the original data set. 9.11 In the depression data set, define Y = the square root of total depression score (CESD), X 1 = log (income), X 2 = Age, X 3 = Health and X 4 = Bed days. Set X 1 = missing whenever X 3 = 4 (poor health). Also set X 2 = missing whenever X 2 is between 50 and 59 (inclusive). Are these data missing at random? Try various imputation methods and obtain the regression of Yon X 1 , ... , X 4 with the imputed values as well as using the complete cases only. Compare these results with those obtained from the original data (nothing missing). 9.12 Using the family lung function data, relate FEVl to height for the oldest child in the three ways: simple linear regression (Problem 6.9), regression of FEVl o? height squared, and spline regression (split at HEI = 64). Which method IS preferable? . 9.13 Another way to answer the question of interaction between the independent variables in Problem 7.13 is to define a dummy variable that indicates whe.ther an observation is above the median weight, and an equivalent variable for het~ht. . 1 . 1 d" an interaction Relate FEV1 for the fath~rs to these d~mmy_ vanab ~s, me u mg results term. Test the hypothesis that there Is no mteractiOn. Compare your · with those of Problem 7.13. . . FEVl 9.14 Perform a ridge regression analysis of the family lung functiOn data usmg 0 fthe of the oldest child as the dependent variable and height, weight and age oldest child as the independent variables. . h oldest 9.15 Using the family lung function data, find the regression of hetght for the seX of child on mother's and father's height. Include a dummy variable fort e the child and any necessary interaction. terms. . CESP as the 9.16 Using dummy variables, run a regression analysts that relate~ . chapter dependent variable to marital status in the depression data set gtven. m d groUP• · . · · the combme s 3. Do it separately for males and females. Repeat usmg . t"100 terrl1 · but including a dummy variable for sex and any necessary mterac Compare and interpret the results of the two regressions.
part Three Multivariate Analysis
10 Canonical correlation analysis 10.1 USING CANONICAL CORRELATION ANALYSIS TO ANALYZE TWO SETS OF VARIABLES Section 10.2 discusses canonical correlation analysis and gives some examples ofits use. Section 10.3 introduces a data example using the depression data set. Section 10.4 presents basic concepts and tests of hypotheses, and section 10.5 describes a variety of output options. The output available from the six computer programs is summarized in section 10.6 and precautionary remarks are given in section 10.7.
10.2 WHEN IS CANONICAL CORRELATION ANALYSIS USED? The technique of canonical correlation analysis is best understood by considering it as an extension of multiple regression and correlation analysis. In multiple regression analysis we find the best linear combination of P variables, X 1 , X 2 , ••• , X p, to predict one variable Y. The multiple correlation coefficient is the simple correlation between Y and its predicted value Y. In multiple regression and correlation analysis our concern was therefore to examine the relationship between the X variables and the single Y variable. b In. canonical correlation analysis we examine the linear relationships te~:e.en a set ?f X varia?les and a ~et of more_ tha_n one Y variabl~. The and~tque conststs of findt~g several h~ea~ combmatwns of_the X _vanables w he same number of hnear combmattons of the Y vanables m such a t:y that these linear combinations best express the correlations between the th; sets. T?ose linear combinations are called the canonical variables, and can~o:relatwns between corresponding pairs of canonical variables are called nical correlations. 1 out~ a common application of this technique the Y's are interpreted as Pred?~e or dependent variables, while the X's represent independent or vari~~~~ve va~ables. T?e Y. vari~bles. may _be hard~r to m_easure than the X s, as tn the cahbratton s1tuatwn dtscussed m section 6.11.
228
Canonical correlation analysis
Canonical correlation analysis applies to situations in which re . techniques are appropriate and where there exists more than one degresston variable. It is especially useful when the dependent or criterion varia~~ndent moderately intercorrelated; so it does not make sense to treat them sep es are ignoring the interdependence. Another useful application is for atrat~ly, independence between the sets of Y and X variables. This applicati estJ~g 00 be discussed further in section 10.4. Wtll An example of an application is given by Waugh (1942); he studied h. relationship between characteristics of certain varieties of wheat t ·~ characteristics of the resulting flour. Waugh was able to conclude ~n desirable wheat was high in texture, density and protein content, and loat on .damaged kernels and foreign materials. Similarly, good flour should ha: high crude protein content and low scores on wheat per barrel of flour an~ ash in. flour. Canonic~! correlation has. also. been used i~ psychology by Meredtth (1964) to calibrate two sets of tntelhgence tests giVen to the same individuals. In addition, it has been used to relate linear combinations of personality scales to linear combinations of achievement tests (Tatsuoka; 1988)~ Hopkins (1969) discusses several health-related applications of canonical correlation analysis, including, for example, a relationship between illness and housing quality. Canonical correlation analysis is one ofthe less commonly used multivariate techniques. Its limited use may be due, in part, to the difficulty often encountered in trying to interpret the results. Also, prior to the advent of computers the calculations seemed forbidding. 10.3 DATA EXAMPLE The depression data set presented in previous chapters is used again ~ere to illustrate canonical correlat~on analysis. yve select two dependent vanabl~~ CESD and health. The vanable CESD ts the sum of the scores on th~ depression scale items; thus a high score indicates likelihood of depress~~ Likewis~, '~ealth' is a rating scale from 1 to~' where 4 sigt.d~es P00~ ~~d~s and 1 stgmfies excellent health. The set of mdependent vanables tn . , . years,. 'educauon' ds 'sex', transformed so that 0 = male and 1 = female; ' age' m from 1 = less than high school up to 7 = doctorate; and 'income' in thousan of dollars per year. d 10.2. The summary statistics for the data are given in Tables 10:1 an 5sible Note that the average score on the depression scale (CESD) is 8.9 tn .a P? g ao range of 0 to 60. The average on the health variables is 1.8, indtcatl~rage · 1· between exce11ent an d g? od· The average percetved health level falmg . havscboo1 educational level of 3.5 shows that an average person has fimshed htg and perhaps attended some college. . cesO Examination of the correlation matrix in Table 10.2 shows netther
Basic concepts of canonical correlation
229
Table 10.1 Means and standard deviations for depression data set Variable
Mean
Standard deviation
CESD Health Sex Age Education Income
8.88 1.77 0.62 44.41 3.48 20.57
8.82 0.84 0.49 18.09 1.31 15.29
Table 10.2 Correlation matrix for depression data set
CESD
Health Sex Age Education Income
CESD
Health
Sex
Age
Education
Income
1
0.212 1
0.124 0.098 1
-0.164' 0.308 0.044 1
-0.101 -0.270 -0.106 -0.208 1
-0.158 -0.183 -0.180 -0.192 0.492 1
nor health is highly correlated with any of the independent variables. In fact, the highest correlation in this matrix is between education and income. Also, CESD is negatively correlated with age, education and income (the younger, less-educated and lower-income person tends to be more depressed). The Positive correlation between CESD and sex shows that females tend to be more depressed than males. Persons who perceived their health as good are more apt to be high on income and education and low on age. deIn the follo~ing sections .we will examine the rela~ionship between the i pendent vanables (perceived health and depressiOn) and the set of ndependent variables. 10 4 · BASIC CONCEPTS OF CANONICAL CORRELATION Supp . X~> Xose we Wish to study the relationship between. a set of va~ables as th 2 ,: • • 'X P and another set Y1 , Y2 ,. .. , YQ. The X vanables can be vtewed depened Independent or predictor variables, while the Y's are considered mean 0~nt or out~ome variables. We assume that in any given sample the the sa each vanable has been subtracted from the original data so that how t~ple means of all X and Y variables are zero. In this section we discuss and Wee degree of association between the two sets of variables is assessed Present some related tests of hypotheses.
230
Canonical correlation analysis
First canonical correlation The basic idea of canonical correlation analysis begins with finding one r combination of the Y's, say tnear
and one linear combination of the X's, say
For any particular choice ~f t~e. coefl~.cients (a's and b's) we can compute values of U1 and V1 for each mdtvtdual m the sample. From the N individuals in the sample we can then compute the simple correlation between the N pairs of U1 and V1 values in the usual manner. The resulting correlation depends on the choice of the a's and b's. In canonical correlation analysis we select values of a and b coefficients so as to maximize the correlation between U1 and V1 . With this particular choice the resulting linear combination U1 is called the first canonical variable of the Y's and V1 is called the first canonical variable of the X's. Note that both U1 and V1 have a mean of zero. The resulting correlation between U1 and V1 is called the first canonical correlation. The square of the first canonical correlation is often called the first eigenvalue. The first canonical correlation is thus the highest possible correlation between a linear combination of theX's and a linear combination of the Y's. In this sense it is the maximum linear correlation between the set of X variables and the set of Y variables~ The first canonical correlation is analogous to the multiple correlation coefficient between a single Y varia_ble and the set of X variables. The difference is that in canonical correlatiOn analysis we have several Y variables and we must find a linear combination of them also. The SAS CANCORR program computed the a's and b's as shown inTab1~ 10.3. Note that the first set of coefficients is used to compute the values 0 the canonical variables U 1 and V1· • n Table 10.4 shows the process used to compute the canonical correlatto · .
.
.
.
.
Table 10.3 Canomcal correlation coeffiCients for first correlation (depresston Coefficients b1 b2 b3 b4
(sex) a 1 = -0.055 (CESD) (age) a 2 = 1.17 (health) -0.29 (education) +0.005 (income)
= 0.051 = 0.048 = =
Standardized coefficients
data set)
---
___. 0 (CESP)
0.025 (sex) a1 = -0.49 health) 0.871 (age) a 2 = +0.9 82 ( . b3 = -0.383 (education) b4 = 0.082 (income) _______. b1 b2
= =
Table 10.4 Computation of correlation between U1 and V1 Individual 1 2 3
Sex b1 (X 1
Age
-
X 1)
+ b2(X 2 - X2 )
Education
+ b 2 (X 2
-
Income
X3 ) + b4 (X 4
-
X4 )
=
v1
0.051(1 - 0.62) + 0.048(68- 44.4)- 0.29(2- 3.48) + 0.0054(1 - 20.57) == 1.49 0.051(0- 0.62) + 0.048(58- 44.4)- 029(4- 3.48) + 0.0054(15- 20.57) = 0.44 0.051(1- 0.62) + 0.048(45- 44.4)- 0.29(3- 3.48) + 0.0054(28- 20.57) = 0.23
294 Correlation between U1 and V1 = canonical correlation =
0.405~
CESD U1 =
a1(Y1 -
Health
y t)
+
a2(Y2-
Y 2)
0.76 = -0.055(0- 8.88) + 1.17(2- 1.77) -0.64 = -0.055(4- 8.88) + 1.17(1- 1.77) 0.54 = -0.055(4- 8.88) + 1.17(2- 1.77)
232
Canonical correlation analysis
For each individual we compute V1 from the b coefficients and the indiv'd , X variable values after subtracting the means. We do the same for u \,~al 8 computations are shown for the first three individuals. The cor~~~ .ese coefficient is then computed from the 294 values of U1 and V1 • Note th ~tt~n variances of U1 and V1 are each equal to 1. a t e The standardized coefficients are also shown in Table 10.3, and they to be used with the standardized variables. The standardized coefficients are be ~bt.ained by multiplying ~ach un~tandardized coefficient by the stand~~~ deviat~on of the corresp?ndmg vanable. For example, the unstandardized coefficient of y 1 (CESD) IS a 1 = -0.0555, and from Table 10.1 the standard deviation of y 1 is 8.82. Therefore the standardized coefficient of y 1 is -0.490. In this example the resulting canonical correlation is 0.405. This value represents the highest possible correlation between any linear combination of the independent variables .and any linear combination of the dependent variables. In particular, it is larger than any simple correlation between an x variable and a y variable (Table 10.2). One method for interpreting the linear combination is by examining the standardized coefficients. For the x variables the canonical variable is determined largely by age and education. Thus a person who is relatively old and relatively uneducated would score high on canonical variable V1 . The canonical variable based on the y's gives a large positive weight to the perceived health variables and a negative weight to CESD. Thus a person with a high health value (perceived poor health) and a low depression score would score high on canonical variable U1. In contrast, a young person with relatively high education would score low on V1 , and a person in good perceived health but relatively depressed would score low on U1 . Sometimes, because of high intercorrelations between two variables in the same set, one variable may result in another having a small coefficient (Levine, 1977) and thus make the interpretation difficult. No very high correlations within a set existed in the present example. b In summary, we conclude that older but uneducated people tend to e relatively undepressed although they perceive their health as relatively p~or. Because the first canonical correlation is the largest possible, this impresston . . . . hi l . f the data. ts the strongest concluswn we can make from t s ana ysis o the However, there may be other important conclusions to be drawn from data, which will be discussed next.
Other canonical correlations
. 's tS Additional interpretation of the relationship between the x's and the :ding obtained by deriving other sets of canonical variables and their corres~oble Jlz canonical correlations. Specifically, we derive a second canonical va?able Vz (linear combination of the x's) and a corresponding canonical varta
Basic concepts of canonical correlation
233
mbination of the y's). The coefficients for these linear combinations (Iiileabr c~n so that the following conditions aFe met. are c os . uncorrelated with vl and ul. 1 Vz ts . . d . h 17 d U . Vz is uncorrelat~. Wit yl an 1" 2· Subject to conditions 1 and 2, U2 and have the maximum possible 3. . correlation.
v;
v;
is called the second canonical correlation . rrelation between U2 and The co · · h . 1· corre1atwn. . d will necessanly be less t an or equa1 to t h· e fi rst canomca anln our example the sec~nd s~t of canonical variables expressed in terms of the standardized coefficients IS
Vi
== 0.396(sex)- 0.443(age)- 0.448(education)- 0.555(income)
and U2 == 0.899(CESD)
+ 0.288(health)
Note that U2 gives a high positive weight to CESD and a low positive weight to health. In contrast, V2 gives approximately the same moderate weight to all four variables, with sex having the only positive coefficient. A large value of V2 is associated with young, poor, uneducated females. A large value of U2 is associated with a high value of CESD (depressed) and to a lesser degree with a high value of health (poor perceived health). The value of the second canonical correlation is 0.266. In general, this process can be continued to obtain other sets of canonical variables U3 , V3 ; U4 , V4 ; etc. The maximum number of canonical correlations and their corresponding sets of canonical variables is equal to the minimum of P (the number of x variables) and Q (the number of y variables). In our data. ex~mple P = 4 and Q = 2, so the maximum number of canonical correlations is two.
Tests of hypotheses
M~t~ . . . . variabl · mputer programs pnnt the coeffiCients for all of the canomcal cano . es, the values of the canonical correlations and the values of the score~~~ variables for each individual in the sample (canonical variable s~allest ~other_ common .feature is a. test of the null hypothesis that the k ettherb P Pulatton canomcal correlatiOns are zero. Two tests may be found . oart1ett' h. ' F test (R s c I-square test (Bartlett, 1941; Lawley, 1959) or an approximate and Y's. ao,. 1.973). These tests were derived with the assumption that the X's A. largea~~elDtly distributed accor~ing t~ a ~ul~ivariate normal distribution. Populati square or a large F ts an mdicatwn that not all of those k on correlations are zero.
234
Canonical correlation analysis
In our data example, the approximate F test that both population cano . correlations are zero was computed by the CANCORR procedu nicai F = 9.68 with 8 and 576 degrees of freedom and P = 0.0001. Sore to be conclude that at least one population canonical correlation is nonz we can . ero and proceed to test that the smallest one IS zero. The F value equals 7.31 . h and 289 degrees of freedom. The P value is 0.0001 again, and we co Wilt 3 .. 1 corre1atlons . . "ficant1y d.1"ftierent from zero s·nc ude t h at b oth canomca are s1gm . . , · liUI 1ar results were obtat~ed from Ba~tlett s test from BMDP and STATISTICA. In _data sets wtth mor~ v~nables these_ tests can ~e a useful guide for selectmg the number of stgmficant canomcal correlatiOns. The test result are examined to determine at which step the remaining canonical correlatio s can be considered zero. In this case,·as in stepwise regression, the significan~s level should not be interpreted literally. e
10.5 OTHER TOPICS RELATED TO CANONICAL CORRELATION
In this section we discuss some useful optional output available from packaged programs.
Plots of canonical variable scores A useful option available in some programs is a plot of the canonical variable scores Vi versus ~· In Figure 10.1 we show a scatter diagram of U 1 versus V1 for the depression data. The first individual from Table 10.4 is indicated on the graph. The degree of scatter gives the impression of a somewhat weak but significant canonical correlation (0.405). For multivariate normal data the graph would approximate an ellipse of concentration. Such a plot can be useful in highlighting unusual cases in the sample as possible outliers or blunders. For example, the individual with the lowest value on ~ 1 is cas~ number 289. This individual is a 19-year-old female with some htgh schoo education and with $28 000-per-year income. These scores produce a val~e of V1 = -0.73. Also, this woman perceives her health as excellent (1) andts very depressed (CESD = 47), resulting in U 1 = - 3.02. This individual, then, represents an extreme case in that she is uneducated and young but ~;s; good income. In spite of the fact that she perceives her health as e~ce ~nn, she is extremely depressed. Although this case gives an unusual combmauo ' it is not necessarily a blunder. . scatter The plot of U 1 versus V1 does not result in an apparen~ly ~onhne~r . 1 in diagram, nor does it look like a bivariate normal distnbutwn (ellt~uca nas shape). It may be that the skewness present in the CESD distribut~o~as a resulted in a somewhat skewed pattern for U 1 even though healt rnore: greater overall effect on the first canonical variable. If this pattern were
Other topics related to canonical correlation
235
.. +····+··········~··································• l
2•4
1. e
• • • •
._cc
• • +
c.c
• • "+
• • • •
-2.4
1
1
•• • • • + •
1
1
1
•
I
t1
2 1
l 1 1
1 1 2
1
~
1
+
l
1
11
1
1
• •
.a
2
1
+ 1 1
•
-1
l
1
• 1
..
• • • •
1 11 1 1
1
•
• •
1
1
•
-1.2
•
1
•
1· 2
• •
~
+
+
-.6C
1
1
1
•
u,
11
1
• •
11 l 1 1c I 1c cc:: c c;· 2 1 11 1 11 ~ 1
111
1 1 1 111 11 12 1 1 11 J 12 1 1 J 1 1 11 12 1 1 1 11 1 1 21 1 1 1 1 1 1 1 1l 1 1 2 22 11 1 l 1 1 1 c 1 31 1 21:c1 111 2 1 11 2 1 12421 12 3 1 1 1 1 1 11 1112 1 1 1l 111 1 2 11 1 13 1 1 11 1 1 11 2 1 11 11 11 11 11 1 l 11 1 1 2 1 1 1
l 1
2 1
1
1
• + • • •
•
1 21
1 1
1
•
•
••
1 1 1
• • • + • • • •
l
1
+
1
1
• •
• • • •
• • • •
+
t
• •
-:!.c
• • • • • + 1~ • • •••••••••••• + •••• + •••••••••••••••••••••••• + ••••••••••
•
- i •2
-.eo
-.40
o.o
.40
.ao
1•
c
c• C 1 .6
F'
t~gu~e 10.1 .Plot of 294 Pairs of Values of the Canonical Variables U1 and V1 for e epresston Data Set (Canonical Correlation = 0.405).
extrem · · va . ·b e, It mtght be worthwhile to consider transformations on some of the na les such as CESD. Another· Interpretation of canonical variables
~~~~~r Useful optional output is the set of correlations between the canonical
· es and the original variables used in deriving them. This output
236
Canonical correlation analysis
provides a way of interpreting the canonical variables when som variables within either the set of independent or the set of dependent e ?f the are highly intercorrelated with each other (Cooley and Lohnes, 1 97;~~ab~es 1977). For the depression data example these correlations are as sh evl~, Table 10.5. These correlations are sometimes called canonical vow_n In loadings. Other terms are canonical loadings and canonical structural coe~r~abie (see Dillon and Goldstein, 1984, or Thompson, 1984, for addit~•ents interpretations of these coefficients). Ionai Since the canonical variable loadings can be interpreted as si . 1 ~orrelations b~tween each v~riab~e and the canonica~ ~ariable, .they are u:~u~ tn un~erstan~mg the relatwnshtp bet~een the oz:gmal vanables and the canorucal vanables. When the set of vanables used tn one canonical variabl are uncorrelated, the canonical variable loadings are equal to the standardize~ canonical variable coefficients. When some of the original variables are highly intercorrelated, the loadings and the coefficients can be quite different. It is in these cases that some statisticians find it simpler to try to interpret the canonical variable loadings rather than the canonical variable coefficients. For example, suppose that there are two X variables that are highly positively correlated, and that each is positively correlated with the canonical variable. Then it is possible that one canonical variable coefficient will be positive and one negative, while the canonical variable loadings are both positive, the result one expects. In the present data set, the intercorrelations among the variables are neither zero nor strong. Comparing the results of Tables 10.5 and 10.3 for the first canonical variable shows that the standardized coefficients have the same signs as the canonical variable loadings, but with somewhat different magnitudes.
Table 10.5 Correlations between canonical variables and corresponding variables
(depression data set)
ul CESD Health
Sex Age Education Income
-0.281 0.878
u2 0.960 0.478
vl
~·
0.089 0.936 -0.532 -0.254
0.525 -0.225 -0.636 -0.737
Discussion of computer programs
237
undancY analysis ged ge of the squared canonical variable loadings for the first canonical 'fh~ a:er~' gives the pro~ortion of the v~riance in the X variab!es_ explained vanat st canonical vanate. The same IS true for U 1 and Y. Simtlar results bY reach of the other canonical variates. For example, for U 1 we have hold ~; 1 )2 + 0.878 2 ]/2 = 0.425, or less than half of the variance in the Y's 0 [(-I ·.. ed by the first canonical variate. Sometimes the proportion of variance e"Pt~ned is quite low, even though there is a large canonical correlation. ~~sa:ay be due _to only one or two variables having a major influence on the canonical vanate. . . . The above computatwns provide one aspect of what IS known as redundancy analysis. In addition, SAS CANCORR~ STATISTICA Canonical analysis and B:M_DP 6M can. compute _a quantity called the redun~a~cy coefficient which Is also usefulm evaluatmg the adequacy of the prediction from the canonical analysis (Muller, 1981 ). The coefficient is a measure of the average proportion of variance in the Y set that is. accounted for by the v set. It is comparable to the squared multiple correlation in multiple linear regression analysis. It is also possible to obtain the proportion of variance in the X variables that is explained by the U variables, but this is usually of l~ss interest.
th;
fi
10.6 DISCUSSION OF COMPUTER PROGRAMS
The BMDP, SAS, STATA and STATISTICA packages each contain a special purpose canonical correlation program. The SPSS MANOVA and the SYS!AT MGLH will also perform canonical correlation analysis. In general, the ·t · . · In erpretatwn of the results is simpler from special purpose programs sm~e all the output relates directly to canonical correlation analysis. The vanous options for these programs are summarized in Table 10.6. he~~~ ~e data exa~ple described in sectio~ 10.3 we want t~ rela~e reported SAS C nd depresswn levels to several typical demographic vanables. The chapt .A~CORR program was used to obtain the results reported in this Proc der. ~e data set 'depressl.dat' was located in drive B. The CANCORR . h. . . e ure IS ve specify th ry _straig tforward to run. We simply call the procedure and e followmg: FILf:NA DATA
ME TEST1 'B:depress1.dat';
""'ONE· INFILE TEST1· INPUT ,
HtALT~1 SEX AGE v4 EDUCAT v6 INCOME v8~v28 CESD v30-v31 PRoc c v33-v37; . VA.R CE:NCORR DATA= ONE; D HEALTH:
_____:~----=--= -·-----
_.:_------= ---_-_-_--·-----'=-----
-.
-~--- ----~ --=-----
=------=- __.
---~-
-
---
-
Table 10.6 Summary of computer output for canonical correlation analysis Output
BMDP
SAS
SPSS
STATA
STATISTICA
SYSTAT
Correlation matrix Canonical correlations Canonical coefficients Standardized canonical coefficients Canonical variable loadings Canonical variable scores Plot of canonical variable scores Bartlett's test and P value Wilk's lambda and F approx. Redundancy analysis
6M 6M 6M 6M 6M 6M 6M 6M
CANCORR CANCORR CANCORR CANCORR CANCORR CANCORR PLOT
MANOVA MANOVA MANOVA MANOVA MANOVA
correl canon canon
Canonical Canonical
CORR MGLH
Canonical Canonical Canonical Canonical Canonical
CANCORR CANCORR
MANOVA MANOVA
MGJH MGLH MGLH GRAPH CORR MGLH
6M
predict graph
Canonical
What to watch out for
239
WliH sEX AGE EDUCAT INCOME;
RUN; oVT =Data set will produce a new data set that includes the original Th: bl s plus the canonical variable scores which can then be plotted. vart~ ~MDP 6M program has a canonical paragraph where you state which . ~ ~les are the dependent (first) canonical variables and which are the ~a~Ia ndent (second) canonical variables. Print and plot statements are used ~~ ;~:ain additional data and plots. The BMDP o~tput is comprehensive; ; f. t Figure 10.1 was produced by BMDP 6M. m ;~e' output in STATA is less complete but it does include estimates for the standard errors of the canonical coefficients, along with co~fidence limits, . d a test that they are zero. The STATA canon program stmply asks for ~~e dependent and independent variables. The original data could be standardized prior to running canon to obtain standardized coefficients. The canonical variable scores could be obtained from the canonical coefficients following the example in Table 10.4 and they could then be plotted. The STATISTICA canonical correlation options are also quite complete. In STATISTICA the results are obtained by first telling the program which variables ybu wish to use and then pointing and clicking on the options you desire. In STATISTICA the standardized coefficients are called canonical weights. The results labeled factor structure are the correlations between the canonical variables and the corresponding variables given in Table 10.5. The numerical value of the second canonical correlation is given with the chi-square test results. With SPSS MANOVA the user specifies the variables and output. An example is included in the manual. SYSTAT MGHL is similar except you hav.e to run the program twice interchanging the dependent and independent vanab~es to get both sets of canonical coefficients. If you use the pointand-cltck options, then include an '&' between the independent variable ~mes typed in the Effects window. Neither ofthese programs will automatically ave or plot . 1 . b . . obtained · c~nomca . vana .le sco~es but, as noted earher, they can .be the sta by ~stng trans~ormatwns. Smce SYST AT and STATISTICA pnnt coeffi .ndardtzed coefficients, the first step is to obtain the unstandardized varia~~e~ts ~y dividing by the standard deviation of the corresponding Using t e sectton 10.4). Then the computation given in Table 10.4 is performed ransfonnation options. 10.7
w
eecaus
liATTOWATCHOUTFOR e cano . . multiple r ·meal correlatwn analysis can be viewed as an extension of at · ana1ysts, · many of the precautiOnary · · remarks made t t he endtnear reg resswn 0 llowing ps ~f Chapters 6-8 apply here. The user should be aware of the Dints:
240
Canonical correlation analysis
1. The sample should be representative of the population to wh. . investigator wishes to make inferences. A simple random sampl ~ch t~e property. If this is not attainable, the investigator should at lee this sure that the cases are selected in such a way that the full as make obser~atio~s ?ccurrin~ in the pop.ulation can occur in the samp~:.nr; of range IS artificially restricted, the estunates of the correlations will be a"' the . b'l' . 1ower estimatesuected 2. P oor re1Ia I tty of t he measurements can resu I.t 1D f h· 0 correlations among the X's and among the Y's. t e 3. A search for outliers should be made by obtaining histograms and sc t . of vana . bl es. a ter . of paus dtagrams 4. Stepwise procedures are not available in the programs described in th' chapter. Variables that contribute little and are not needed for theoretic I~ models should be candidates for removal. It may be necessary to run t:e. programs several times to arrive at a reasonable choice of variables. 5. The investigator should check that the canonical correlation is large enough to make examination of the coefficients worthwhile. In particular, it is important that the correlation not be due to just one dependent variable and the independent variable. The proportion of variance should be examined, and if it is small then it may be sensible to reduce the number of variables in the model. 6. If the sample size is large enough, it is advisable to split it, run a canonical analysis on both halves, and compare the results to see if they are similar. 7. If the canonical coefficients and the canonical variable loadings differ considerably, (i.e. if they have different signs), then both should be examined carefully to aid in interpreting the results. Problems of interpretation are often more difficult in the second or third canonical variates than in the first. The condition that subsequent linear combinations of the variables be independent of those already obtained places restrictions on the results that may be difficult to understand. . . 1 8. Tests of hypotheses regarding canonical correlations assume that the J~tn distribution of the X's and Y's is multivariate normal. This assumptton should be checked if such tests are to be reported. fX 9. Since canonical correlation uses both a set of Y variables and a set ; be variables the total number of variables included in the analysis rna sed ' b · not u quite large. This can increase the problem of many ~ases ·et~g tation because of missing values. Either careful choice of vanables or trnpu techniques may be required.
ts
SUMMARY
tiofl . 1 correla In this chapter we presented the basic concepts of can~ntca lysis. roe analys~s, ~n extension of multiple. regr~ssion and correlatiOn a~~epeodeot extensiOn ts that the dependent vanable ts replaced by two or mor
Problems
241
. Ies. If Q, the number .of depende?t variabl~s, equals 1, then canonical vartab . n reduces to multiple regressiOn analysis. correl~~~ral, the resulting canonical corr~lations quantify the stre~gth of the In ~ation between ~he dependent. and md~pe~dent sets of.v~nables: The ass~Cl d anonical vanables show wh1ch combmattons of the ongmal vanables denve h~bit this association. The canonical variables can be interpreted in a best exr similar to the interpretation of principal components or factors, as manne . . will be explamed m Chapters 14 and 15,
REFERENCES References preceded by an asterisk require strong mathematical background. *Bartlett, M.S. (1941). The statistical sigilificance of canonical correlations. Biometrika, 32~ 29-38. . . . . *Cooley, W.W. and Loh~es, P.R. (1971). M~ltzv?rzate Dat~ Analyszs. Wtley, Ne~ Y?rk. Dillon, W.R. and Goldstem, M. (1984). Multzvarzate Analyszs: Methods and Applzcatzons. Wiley, New York. Hopkins, CE (1969). Statistical analysis by. canonical correlation: A computer application. Health Services Research, 4 (W!nter), 304-12. . *Lawley, D.N. (1959). Tests of significance in canonical analysis. Biometrika, 46: 59-66. Levine, M.S. (1977). Canonical Analysis and Factor Comparison, Sage University Paper. Sage, Beverly Hills, CA. Meredith, W. (1964). Canonical correlation with fallible data. Psychometrika 29: 55-65. Muller, K.E. (1981). Relationships between redundancy analysis, canonical correlation, and multivariate regression. Psychometrika, 46, 139-42. *Rao, C.R. (1973). Linear Statistical Inference. Wiley, New York. *Tatsuoka, M.M. (1988). Multivariate Analysis: Techniques for Educational and . Psychological Research, 2nd edn. Wiley, New York. . ~ompson, B. (1984). Canonical Correlation Analysis. Sage, Beverly Hills, CA. augh, F.V. (1942). Regressions between sets of variables. Econometrica, 10,290-310.
FURTHER READING *H otelling · · *Mo . ' H · (1936) · · R e1attons between two sets of vanables. Biometrika 28: 321-77. y~~~on, D.F. (1976). Multivariate Statistical Methods, 2nd edn. McGraw-Hill, New Muller E rnodel · · (19~2) .. Understanding canonical correlation through the general linear lborndike~~d pnnctpal compo?e~ts. The American Statistician, 36, 342-54. .M. (1978). Correlatzonal Procedures for Research. Gardner Press, New York.
K
p
R.OBLEMs
IO.t
For the d .. the t.0 ~presston data set, perform a canonical correlation analysis between 11 OWtng. • Set 1: AGE . . less , MARITAL (marned versus other), EDUCAT (htgh school or • Set ;.ersus other), EMPLOY (full-time versus other) and INCOME. · the last seven variables.
242
Canonical correlation analysis
Perform separate analyses for men and women. Interpret the result 10.2 For the data set given in Appendix A, perform a canonical correlatios. on height, weight, FVC, and FEV1 for fathers versus the same var7a~~alysis mothers. Interpret the results. es for 10.3 For the chemical companies data given in Table 8.1, perform a . corre~a~ion a~alysis us!ng P/E and E~S5 as d~pende~t variablesc::~mcal r~m~mmg vanab~es as mdep~nde~t vanables. W~tte t~e Interpretations of the stgmficant canomcal correlations tn terms of theu vanables. the 10.4 Using the data described in Appendix A, perform a canonical correl f analysis using height, weight, FVC and FEVl of the oldest child a a I~n dependent variables and the same measurements for the parents a: ~he independent variables. Interpret the results. e 10.5 Generate ~he sa~ple data forX_1,X2, ... ,X9, Y as in P~oblem 7.7. For each of the followmg pa~rs of s~ts of variables, per~orm a canomcal correlation analysis between the vanables m Set 1 and those m Set 2. Set 1 a. X1, X2, X3 b. X4, X5, X6 c. Y,X9
Set 2 X4-X9 X7, X8, X9 X1, X2,X3
Interpret the results in light of wh.at you know about the population. 10.6 For the depression data set run a canonical analysis using HEALTH, ACUTEILL, CHRONILL and BEDDAYS as dependent variables and AGE, SEX (transformed to 0, 1), EDUCAT and INCOME as independent variables. Interpret the results.
11 Discriminant analysis USING DISCRIMINANT ANALYSIS TO 1 1 1. CLASSIFY CASES Discriminant analysis is used to ~lassi~y a ~ase into one of ~wo ?r ~ore ·· ulations. Two methods are gtven m this book for classrficatton mto ~~~ulations or groups, discriminant analysis and logistic regr_ession anal~sis (Chapter 12). In both of these analyses, you must know which populatiOn the individual belongs to for the sample initially being analyzed (also called the training sample). Either analysis will yield information that can be used to classify future individuals whose population membership is unknown into a population. If population membership is unknown in the training sample, then cluster analysis given in Chapter 16 could be used. Sections 11.2 and 11.3 provide further discussion on when discriminant function analysis is useful and introduce an example that will show how to classify individuals as depressed or not depressed. Section 11.4 presents the basic concepts used in classification as given by Fisher (1936) and shows how they apply to the example. The interpretation of the effects of each variable based on the discriminant function coefficients is also presented. The a~sumptions made in making inferences from discriminant analysis are discussed in section 11.5. Interpretation of various output options from the statistical programs for ~wo ~opulations is given in section 11.6. Often in practice, the numbers of r:~~~.n ~he ~wo populations are unequal; in section 11.7 a method of adjusting is hit ~s Is gtv~n. Also, if the seriousness of misclassification into one group intogd~r t~a~ ID the other, a technique for incorporating costs of misclassification gives ~~~~~mant function analysis is given in the same section. Section 11.8 Will cl:ssi~bonal ~ethods for determining how well the discriminant function to dete .~ ca~es In the overall population. Section 11.9 presents formal tests c~ses t~~~~ If the discriminan~ function results in better da~sification of dtscrirnin ance .alone. In sectwn 11.10, the use of stepwise procedures in Sectiona~~.fun~tton analysis is ~isc~ss~d. . . 11 t~ classify · ~ntroduces how discnmmant function analysts can be used cltscrirnina cases tn to more than two groups and section 11.12 discusses how Correiatio nt function analysis can be viewed as a special case of canonical n analysis.
244
Discriminant analysis
Section 11.13 summarizes the output from the various computer and section 11.14lists what to watch out for in doing discriminanf;~!ra.Ills analysis. Chon
11.2 WHEN IS DISCRIMINANT ANALYSIS USED? Discriminant analysis techniques are used to classify individuals into . . ( . ) one of two or more a1tematiVe groups or popu1attons on the basis of a set of measurements. The populations are ~nown to be distinct, and each individual bel~ngs to one ?f them. Thes~ techniques c.an al.so be used to identify which vanables contnbute to makmg the classrficahon. Thus, as in regressio analysis, we have two uses, prediction and description. · n As an example, consider an archeologist who wishes to determine which of two possible tribes created a particular statue found in a dig. The archeologist takes measurements on several characteristics of the statue and must decide whether these measurements are more likely to have come from the distribution characterizing the statues of one tribe or from the other tribe's distribution. These distributions are based on data from statues known to have been created by members of one tribe or the other. The problem of classification is therefore to guess who made the newly found statue on the basis of measurements obtained from statues whose identities are certain. The measurements on the new statue may consist of a single observation, such as its height. However, we would then expect a low degree of accuracy in classifying the new statue since there may be quite a bit of overlap in the distribution of heights of statues from the two tribes. If, on the other hand, the classification is based on several characteristics, we would have more confidence in the prediction. The discriminant analysis methods described in this chapter are multivariate techniques in the sense that they employ several measurements. 'de As another example, consi~er a loan officer.at a bank w~o wi~~es t~ d;c~de whether to approve an apphcant's automobile loan. This decisiOn. Is. to by determining whether the applicant's characteristics are more stmllar of those of persons who in the past repaid loans successfully or to those ast . on these two groups, available from ding P persons who defaulted.lnformatton records, would include factors such as age, income, marital status, outs tan coroes debt, and home ownership. . A third example which is described in detail in the next sect!on, betber · h to predtctd w · ' data set (Chapters 1 and 3). We WIS presse d from the depressiOn an individual living in the community is more or less likely to be e logistic on the basis of readily available information on the individual. . The examples just mentioned could also be analyzed ustng regression as will be discussed in Chapter 12.
Data example
245
pATA EXAMPLE 11 3 ' .ibed in Chapter 1, the depression data set was collected for individuals As ~~scr in Los Angeles County. To illustrate the ideas described in this rest we will develop a method for estimating whether an individual is ~bar~~ be depressed. For the purposes of this example 'depression' is defined Itke y . ore of 16 or greater on the CESD scale (see the code book given in bY~ scJ.4). This information is given in the variable called 'cases'. We will ra ~be estimation on demographic and other characteristics of the individual. base variables used are ed ucatton · a~d'.mcom~. W e ~ay_a1so WIS. · h to d et~mune · The hether we can improve our predictiOn by mcludmg mformatton on Illness, wx or age. Additional variables are an overall health rating, number of bed ~:ys in the pas~ two _months (0 if less than eig~t days, 1 if eigh~ ~r more), acute illness (1 if yes m the past two months, 0 If no) and chrome Illness (0 if none, 1 if one or more). The first step in examining the data is to obtain descriptive measures of each of the groups. Table 11.1 lists the means and standard deviations for each variable in both groups. Note that in the depressed group, group II,
m;
Table 11.1 Means and standard deviations for nondepressed and depressed adults in Los Angeles County Group I, nondepressed (N = 244) Variable Sex (~ale = 1, female = 2) Age (trt years) Education (1 to 7, 7 high) Incom ( h · e t ousands of dollars Per Year)
~~lth index (1 to 4, a excellent) ect days (0· Per Year· 1· .8less than 8 days Acute c~n · .. or more days) 1 ,._ """ Yes) dtttorts (0 = no• C'hro · l ""'onic conditions (0 =none ne or . ,
~re)
Group II, depressed (N= 50)
Mean
Standard deviation
Mean
Standard deviation
1.59a 45.2 3.55 21.68a
0.49 18.1 1.33 15.98
1.80a 40.4 3.16 15.20a
0.40 17.4 1.17 9.84
2.06a
0.98
1.7P
0.80
0.17a
0.39
0.42a
0.50
0·28
0.45
0.38
0.49
0.48
0.50
0.62
0.49
Por a test ore-----------------------------------------------u q al means, Pless than 0.01, asuming a normal distribution.
246
Discriminant analysis
we ha~e a higher percentag~ of females, a lower average age, a 1 educattonal level and lower mcomes. The standard deviations in th ower groups are similar except for income, where they are slightly differe ; two variances for income are 255.4 for the nondepressed group and 96 8 ~ · The depressed gro~p. The r~tio of the _var~ance for the. nondepressed ~roC: the the depressed ts 2.64 for mcome, whtch mcreases the Impression of diffe P to . vanatwn . . between the two groups as we11 as mean values. The var·rences m r · bl es are more stmt · .1ar. Iances tort he other vana Note also that the health characteristics of the depressed group are generally worse than those of. the nondepressed, even though the members of the depressed group tend to be younger on the average. Because sex is coded males = 1 and females = 2, the a~er~ge sex o~ 1.80 indicates that 80% of the depressed group are females. Stmtlarly, 59% of the nondepressed individuals are female. Suppose that we wish to predict whether or not individuals are depressed on the basis of their incomes. Examination of Table 11.1 shows that th~ mean value for depressed individuals is significantly lower than that for the nondepressed. Thus, intuitively, we would classify those with lower incomes as depressed and those with higher incomes as nondepressed. Similarly, we may classify the individuals on the basis of age alone, or sex alone, etc. However, as in the case of regression analysis, the use of several variables simultaneously can be superior to the use of any one variable. The methodology for achieving this result will be explained in the next sections;
11.4 BASIC CONCEPTS OF CLASSIFICATION In this section, we present the underlying concepts of classification as given by Fisher (1936) and give an example illustrating its use. We also briefly discuss the coefficients from the Fisher discriminant function. Statisticians have formulated different ways of performing and evaluating discriminant function analysis. One method of evaluating the results uses what are called class~cation fun~tion~. This ~pproach will be described ;e~~ Necessary computatiOns are given m sectiOns 11.6, 11.11 and 11. 1 f addition, discriminant function analysis can be viewed as a special case canonical correlation analysis presented in Chapter 10. Many of the progr~on print out. canoni~al coeffi~ients and gra~hical ~utput b~sed u~on e~a:~~ of canomcal vanates. Thts approach wtll be dtscussed m sectton .1 ·. ·nate . . . f . . ts . used to. dtscrnntused· In general, when dtscnmmant unctwn ana1ysts between two groups, Fisher's method and classification func~tons areonical When the discrimination is among three or more groups, va~ous ~:~dents coefficients are reported (sections 11.11 and 11.12). The canomca~ c doing sO· can be used for the two group case but there is no adv~ntage_ 10 ret and Many investigators find the two group comparisons easter to mterp
:s
Basic concepts of classification
247
efore perform several such comparisons when there are more than two For example, one group can be used as a reference or control group grou~~ other groups. comp~red one at a time to i~. A~te~natively, if ~ou have and roups, you mtght wtsh to run both the dtscnmmant analysts for all th~:: ;roups simultaneously (section 11.11) .as well as the three pairwise th al ses between the thr~e groups two at a time. . . anJost of this chapter wtll be devoted to the two group case as that ts most 'ghtforward to understand and most commonly used. · strat ther
Principal ideas Suppose that an individual. m~y. belong to one oftw.o pop.ulations. We begin by considering how an mdtvtdual can be classified mto one of these populations on the basis of a measurement of one characteristic, say X. Suppose that we have a representative sample from each population, enabling us to estimate the distributions of X and their means. Typically, these distributions can be represented as in Figure 11.1. From the figure it is intuitively obvious that a low value of X would lead us to classify an individual into population II and a high value would lead us to classify an individual into population I. To define what is meant by low or high, we must select a dividing point. If we denote this dividing point
Population II
Population I
Percent of members of Percent of members of population I incorrectly population II incorrectly F· classified into population II classified into population I :ptgure ll.t II . . ercentage of Ypothehcal Frequency. Distributions of Two Populations Showing Cases Incorrectly Classified.
248
Discriminant analysis
by C,. then we would classify an in~ivid~al into po~ulation I if x ~ C. F any gtven value of C we would be mcurnng a certam percentage of er · or the individua~ came from po~ulatio~ I ?~t the .measured X were lessr~~· If C, we would mcorrectly classtfy the mdtvtdualmto population 11, and ~n versa. These two types of errors are illustrated in Figure 11.1. If we vzce assume that the two populations have the same variance, then the u can value of C is sua}
C=XI+Xn 2 This value ensures that the two probabilities of error are equal. The idealized situation illustrated in Figure 11.1 is rarely found in practice. In real-life situations the degree of overlap of the two distributions is frequently large, and the variances are rarely precisely equal. For example, in the depression data the income distributions for the depressed and nondepressed individuals do overlap to a large degree, as illustrated in Figure 11.2. The usual dividing point is
c=
15.20 + 21.68 = 18.44 2
As can be seen from Figure 11.2, the percentage errors are rather large. ...., t:
<1) (;)
0
~
"" Cl.. <1)
1.0
'<:!' ~
...... 00 ...... II
30
C()
~
......
C'\1
II
CJ 1><:
25 20 15 Nondepressed (N= 244)
0
5
10
15
40
45 50 55 60 tlars) ds of Do Income (Thousan ls d Jndividua Figure 11.2 Distribution of Income for Depressed and Nondepresse Showing Effects of a Dividing Point at an Income of $18 440.
Basic concepts of classification
249
Classification of individuals as depressed or not depressed on the basis rable 11.2 . come alone 0 f tn
----
Classified as
;\ctual status Not depressed (N == 244) Depressed (N = 50) Total N
= 294
Not depressed
Depressed
% correct
123 19
121 31
50.4 62:0
142
152
52.4
The exact data on the errors are shown in Table 11.2. These numbers were obtained by first checking whether each individual's income was greater than or equal to $18.44 x 103 and then determining whether the individual was correctly classified. For example, of the 244 nondepressed individuals, 123 had incomes greater than or equal to $18.44 x 10 3 and were therefore correctly classified as not depressed (Table 11.2). Similarly, of the 50 depressed individuals, 31 were correctly classified. The total number of correctly classified individuals is 123 + 31 = 154, amounting to 52.4% of the total sample of 294 individuals, as shown in Table 11.2. Thus, although the mean incomes were significantly different from each other (P < 0.01), income alone is not very successful in identifying whether an individual is depressed. Combining two or more variables may provide better classification. Note that the number of variables used must be less than N 1 + Nrr- 1. For two v~riables X 1 and X 2 concentration ellipses may be illustrated as shown in F~gure 11.3 (see section 7.5 for an explanation of concentration ellipses). Ftgure 11.3 also illustrates the univariate distributions of X 1 and X 2 ~~P;rately: The univariate dist~bution of X 1 is wh.at is obtained ~f the.v~l~es p . 2 are Ignored. On the basts of X 1 alone, and tts correspondmg dtvtdmg eOint Cl, a relatively large amount of error of misclassification would be p~~ountered. Similar results occur for X 2 and its corresponding dividing of C2· To use both variables simultaneously, we need to divide the plane classir and 2 ~~o two regions, each corresponding to one population, and is to ~:~e tndtvtd~als a~cordingly. A simple_ way of ~efining !he two regions concent ·VI_ a str~tght line through the pomts of mtersectwn of the two l'he ration elhpses, as shown in Figure 11.3. ~hown f~r~entage of individuals from population II incorrectly classified is llldivictu he crosshatched areas. The shaded areas show the percentage of 1 ll.sing tw~ s fro!ll population I who are misclassified. The ~rrors incurred. by elthervari ~anables are often much smaller than those mcurred by usmg a le alone. In the illustration in Figure 11.3 this result is, in fact, the case.
;t
1!
250
Discriminant analysis
Z=C
Region I
Figure 113 Classification into Two Groups on the Basis of Two Variables.
The dividing line was represented by Fisher (1936) as an equation Z == C, where Z is a linear combination of X 1 and X 2 and C is a constant defined as follows: C = ZI
+ Zu 2
where ZI is the average value of z in population I and zil is the average value of Z for population II. . as In this book we will call Z the Fisher discriminant function, wnttell
. t a and for the two-vari~ble .case. The formulas for computing the coefficten s 979). a 2 can be fou~d ~n.Ftsher (1936), Lachenbr~ch (1975) or Afifi and_ Aze~culated· For each mdtvtdual from each population, the value of Z IS ca eacb When the frequency distributions of Z are plotted separately for
(l
Basic concepts of classification
251
Figure 11.4 Frequency Distributions of Z for Populations I and II.
population, the result is as illustrated in Figure 11.4. In this case, the bivariate classification problem with X 1 and X 2 is reduced to a univariate situation using the single variable Z. E~ample
As an example of this technique for the depression data, it may be better to use both income and age to classify depressed individuals. Program BMDP 7M was used to obtain the Fisher discriminant function. Unfortunately, this equation cannot be obtained directly from the output, and some intermediate computations must be made, as explained in section 11.6. The result is
Z = 0.0209(age) + 0.0336(income)
~he mean Z value for each group can be obtained as follows, using the means om Table 11.1:
inean Z = 0.0209(mean age)+ 0.0336(mean income) l'hus
z
not depressed
.
= 0.0209(45.2) + 0.0336(21.68) = 1.67
atld
z
depressed
= 0.0209(40.4) + 0.0336(15.20)
= 1.36
252
Discriminant analysis ,--------------------------
••
•••
65
,-~:l
•
60
• •
55
• ••
•
50
w
45
0
40
z
35
~
•
•
•• •• •
• ••
(.)
30
• •• • •
••
4
j
e
. .... .. •
nondepress depress
• ••••••••••••
... . .
..
•
......
•••• • •• •• • ...... ••• • • •• • •••• •••• •••• •• • •• • • •• •• ••••• •••• • • • •• ••• ••• t • •• • • • • •• •• • • • • .•u.• • t• •••• •• • •• •• • ••• t ••• • • .. • ::••.: • •• t • • • •• • • • • • • • • • • • • • ~
25
...
20 15
.
10 5 0
15
20
25
30
35
40
45
50
55
60
65
70
75
60
85
90
AGE
Figure 11.5 Classification of Individuals as Depressed or Not Depressed on the Basis of Income and Age.
The dividing point is therefore C=
1.67 + 1.36 2
= 1.515
An individual is then classified as depressed if his or herZ value is less than 1.52; For two variables it is possible to illustrate the classification procedure. as shown in Figure 11.5. The dividing line is a graph of the equation Z = C, t.e. 0.0209(age)
+ 0.0336(income) = 1.515
An individual falling in the region above the dividing line is classified as n~~ depressed. Note that, indeed, very few depressed persons fall far above t dividing line. tbis To measure the degree of success of the classification procedur~ The sample, we must count how many of each group are correctly classifi~~ll io computer program produces these counts automatically; they are .sh d ibis Table 11.3. Note tht 63.1% of the nondepressed are correctly classdie. ·used 1 value is compared with 50.4%, which results when income .alone ~rable (Table 11.2). The percentage of depressed correctly classified ts cotnP
ro;
Theoretical background
253
. Classification of individuals as depressed or not depressed on the basis of fable 11 ·3 ...-.·e and age
---------------------------------------------------jnCO~'•
Classified as
Not depressed
Actual status
Depressed
% correct
Depressed (N = 50)
154 20
90 30
63.1 60.0
Total N =294
174
120
62.6
Not depressed (N = 244)
in both tables. Combining age with income improved the overall percentage ofcorrect classification from 52.4% to 62.6%.
Interpretation of coefficients In addition to its use for classification, the Fisher discriminant function is helpful in indicating the direction and degree to which each variable contributes to the classification. The first thing to examine is the sign ofeach coefficient: if it is positive, the individuals with larger values of the corresponding variable tend to belong to population I, and vice versa. In the depression data example, both coefficients are positive, indicating that large values of both variables are associated with a lack of depression. In more complex examples, comparisons of those variables having positive coefficients with those having negative coefficients can be revealing. To quantify the magnitude of the distribution, the investigator may find standardized coefficients helpful, a.s explained in section 11.6. The concept of discriminant functions applies as well to situations where there are more than two variables, say X 1,X2 , ... ,Xp. As in multiple linear ~:fre~sion, ~tis often sufficient to select a small number of variables. Variable ection will be discussed in section 11.10. In the next section we present 8 ome necessary theoretical background for discriminant function analysis. lt.s THEORETICAL BACKGROUND In d · . rnak:nving _his linear discriminant function, Fisher (1936) did not have to Fish any distributional assumptions for the variables used in classification. er denoted the discriminant function by
Z = a1 X 1
+ a2 X 2 + .. · + apXp
... A.s in th · . 2 11 • W e previous section, we denote the two mean values of Z by Z 1 and e also denote the pooled sample variance of Z by Si (this statistic is
254
Discriminant analysis
similar to the pooled variance used in the standard two-sample t t Dixon and Massey, 1983)). To measure how 'far apart' the two gr est (e.g. in terms of values of Z, we compute oups are -
2 _ D -
-
(ZI - Zu)
2
2
Sz
Fisher selected the coefficients a 1 , a 2 , ... , ap so that D 2 has the max· · unum possible value. The term D 2 can be interpreted as the squared distance between them . of the standardized value of Z. A larger value of D 2 indicates that it is e:~ns . . . b etween t he two groups. The quantity . D2 is called Sier to d1scnmmate th 2 Mahalanobis distance. Both ai and D are functions of the group means an~ the pooled variances and covariances of the variables. The manuals for the statistical package programs make frequent use of these formulas, and you will find a readable presentation of them in Lachenbruch (1975) or Klecka(l980). Some distributional assumptions make it possible to develop further statistical procedures relating to the problem of classification. These procedures include tests of hypotheses for the usefulness of some or all of the variables and methods for estimating errors of classification. The variables used for classification are denoted by X 1 , X 2 , •.• , X P· The standard model makes the assumption that for each of the two populations these variables have a multivariate normal distribution. It further assumes that the covariance matrix is the same in both populations. However, the mean values for a given variable may be different in the two populations. We further assume that we have a random sample from each of the populations. The sample sizes are denoted by N 1 and N 11• Alternatively, we may think of the two populations as subpopulati~il~ of a single population. For example, in the depression data the ongmal population consists of all adults over 18 years old in Los Ang~les Co:ti~ Its two subpopulations are the depressed and nondepressed. A smgle sa P · was collected and later diagnosed to form two subsamples. . 1 If we examine the variables age and income to see if they meet the ~heoret•c:s · · performmg · d't~cnmman~ · · f ~nc t't?n an.alysis ' 1t.. becorn assumptiOns rnad e tn section clear that t~eydo not. We ha~ exammed the ?tstnbutton of~nc~me.:n and a 4.3 where tt was found that mcome had qmte a skewed dtstrtbutl ticallY, logarithmic transformation could be used to reduce that skewness. Theore riable if a set of data follows a multivariate normal distribution, then each v: worl<: in the set will be univariately normally distributed. But t?is _does ;:ith0 ut vice versa. Each variable can be univariately normally ?Istnbu~ t in th~t the se~ of vari~bles. b~ing mult~variately normally ~ist~Ibuted: tr~ution. tS case m p·racttee, It Is very likely that the multtvanate dis. h heckti18 ' . fi ed Wit c approximately normal, so that most investigators are satls the variables one at a time.
Interpretation
255
ddition, the variances for income in the depressed and nondepressed In ~ appear to be di~er~nt. A formal test for equal c~variance ~at~x (see gro~P 7 5 for a descnpbon and example of a covanance matnx) tn the sectiOned. and nondepressed groups was rejected at a P = 0.00054 level. This depresst clearly does not meet the theoretical assumptions for performing test da~ seotheses and making inferences based on the assumptions since one of ofe ;'cfnables is not normally distributed and the covariance matrices are not eq~al. th However, it is not clear that form~l _tests of hypot?e~es of und_erlytng mptions are the best way of determtmng when a stattstlcal analysts, such ass~iscriminant function analysis should be used. Numerous statistical as earchers have studied how discriminant function analysis performs when ~~~erent distributions, sample sizes etc. are tried. This research has been done using simulations. Here, the researcher may take a multivariate normal distribution with all the variances equal to one and another distribution with all variances say equal to four. The distributions are also designed to have means that are a fixed distance apart. Then, samples of a given size are taken from each distribution and the researcher sees how well discriminant function analysis performs (what percentage of the two samples is correctly classified). This process is repeated for numerous sets of two samples. The same process is then repeated again for some other method of classification such as logistic regression analysis (Chapter 12) or a more complex type of discriminant function analysis. The method of choice is then the one which empirically has the highest proportion of cases correctly classified. If an investigator has data that are similar to what has been studied using simlilations, then this is often used to guide the choice of analysis since it s~ows what works in practice. The difficulty with this method is that the Simulations seldom precisely fit the data set that an investigator has; so usually some judgment is needed to decide what to trust. In the example using age and income, even though the covariance matrices ;ere .not equal they were close enough to justify using regular discriminant analysis for classification assuming the multivariate data were nunctton ormally d' t "b d (. . in . Is n ute Marks and Dunn, 1974). Further results will be given sectton 11.13. l1. 6 INTERPRETATION In this · functio sec~ton. we present various methods for interpreting discriminant the coe~· . pecrfically, we discuss the regression analogy, computations of . Cients, standardized coefficients and posterior probabilities. ){egres • A. . ston analogy th Useful con . . eregress·. n~ctton extsts between regression and discriminant analyses. For 100 Interpretation we think ofthe classification variables X 1, X 2 ••• , X P
256
Discriminant analysis
~s ~he ~ndependent va?ables. The ~ependent variab~e is a dummy va . mdtcatmg the populatton from whtch each observatiOn comes. Spe .finable CI Cally,
Y=
Nn NI
+ Nn
if the observation comes from population I, and
if the observation comes from population II. For instance, for the depress· data Y = 50/(244 + 50) if the individual is not depressed and ;~ - 244/(244 +50) if the individual is depressed. Whe_n the usu_al multiple regr~ssion analy_sis ~s performed, the resulting regressiOn coefficients are proporttonal to the dtscnmmant function coefficients a 1 , a 2 , •• , ap(Lachenbruch, 1975). The value ofthe resulting multiple correlation coefficient R is related to the Mahalanobis D 2 by the following formula: D2
=
(N1 + N 11 )(N 1 + N 11 1- R NINII R
2
-
2)
2
Hence from a multiple regression program it is possible to obtain the coefficients of the discriminant function and the value of D 2 • The Z's for each group can be obtained by multiplying each coefficient by the corresponding variable's sample mean and adding the results. The dividing point C can then be computed as
As in regression analysis, some of the independent variables (or classificatiot variables) may be dummy variables (section 9.3). In the depression examp: we may, for instance, use sex as one of the classification variables by tr~a~fes it as a dummy variable. Research has shown that even th?u~h ~uch van~ysis do not follow a normal distribution, their use in linear dtscnmmant ana can still help improve the classification.
Computing the Fisher discriminant function d . section tt.l3, sornet . . . 1 . t b d" In t h e d tscnmmant ana ysts programs o e · tscusse m . -111inai1 computations must be performed to obtain the values of the ~tsc~~ioant) coefficients. Some programs (BMDP 7M and STATISTICA Dtscrt srSS print what is called a 'classification function' for each groUP·
Interpretation
257
Classification function and discriminant coefficients for age and income TableBMDp 7M from · -----------------------------------------------Classification function 11 4
-----Variables Age Income constant
Group I, not depressed
Group II, depressed
Discriminant function
0.1634 0.1360 -5.8641
0.1425 0.1024 -4.3483
0.0209 = a 1 0.0336 = a2 1.5158 = c
DISCRIMINANT calls these 'classification function coefficients', SAS DISCRIM procedure calls them the 'linearized discriminant functions' and SYSTAT calls them the 'group classification functions'. For each population the coefficients are printed for each variable. The discriminant function coefficients a 1, a 2 , •. • , ap are then obtained by subtraction. As an example, we again consider the depression data using age and income. The classification functions are shown in Table 11.4. The coefficient a 1 for age is 0.1634- 0.1425 = 0.0209. For income, a2 ....: 0.1360- 0.1024 = 0.0336. The dividing point C is also obtained by subtraction, but in reverse order. Thus C = -4.3483- ( -5.8641) = 1.5158. (This agrees closely with the previously computed value C = 1.515, which we use throughout this chapter.) For more than two variables the same procedure is used to obtain a 1 , a 2 , ••• , ap and C. In practice each case is assigned to the group for which it has the largest classification function score. For example, the first respondent in the depression data set is 68 years old and has an income of $4000. The classification function for not depressed is 0.1634(68) + 0.1360(4) = 5.7911 and for depressed is 5. 7513, so she would be assigned to the not depressed group. llenaming the groups
If we w· h . . int Is to put the depressed persons mto group I and the nondepressed 'ca~e;rou~ II, we can do so by reversing the zero and one values for the is de vanable. In the data used for our example, 'cases' equals 1 if a person a pe~resse.d and 0 if not depressed. However, we can make cases equal 0 if l'RA.~~; Is depressed and 1 if a person is not depressed. The BMDP fr ORM -paragraph for this conversion is as follows: " RA.NSFORM ",.CA. IF (X SES. IF (X ~Q 0) THEN CASES= 1. Q 1) THEN CASES= 0.
258
Discriminant analysis
Note that this reversal does not change the classification functions but . changes their order so that all the signs in the linear discriminant fusunpiy are changed. The new constant and discriminant function are nction -1.515 and -0.0209(age)- 0.0336(income), respectively. The ability to discriminate is exactly the same, and the number of individ correctly classified is the same. ua1s Standardized coefficients
As in the case of regression analysis, the values of a 1 , a 2 , ••• , ap are not direct! comparable~ However, an impression of the relative effect of each variabl~ on the discriminant function can be obtained from the standardized discriminant coefficients. This technique involves the use of the pooled (or within-group) covariance matrix from the computer output. In the original example this covariance matrix is as follows:
Age
Income
Age
Income
324.8 -57.7
-57.7 228.6
Thus the pooled standard deviations are (324.8) 1 ' 2 = 18.02 for age and (228.6) 112 = 15.12 for income. The standardized coefficients are obtained by multiplying th~ a/s by the corresponding pooled standard deviations. Hence the standardized discriminant coefficients are (0.0209)(18.02)
=
0.377 for age
and (0.0336)(15.12) = 0.505 for income · · nt It is therefore seen that income has a slightly larger effect on the discnnuna function than age. Posterior probabilities
. . . . 'ther groUP Thus far the class~fication pr?cedure assigned~~ !ndividual ~ 0 eithe wrong I or group II. Smce there Is always a possibility of makmg . dividual classification, we may wish to compute the probability that the tn babilitY · h a pro [I rrnula is has come from one group or the other. We can compute sue 0 under the multivariate normal model discussed in section 11.5. The 1 probability of belonging to population I = ~
Adjusting the value of the dividing point
259
z
xp( _ + C) indicates e raised to the power (- Z + C), as discussed where e tt Cornfield and Kannell (1967). The probability of belonging to io T:~:~idn 11 is one minus the p~ob~~ility of belonging to p~pulation .1. poP xample, suppose that an mdtvidual from the depressiOn study Is 42 For ~d and earns $24 x 10 3 income per year. For that individual the value ears . f . . Y the odiscriminant of unctiOn Is
z = 0.0209(42) + 0.0336(24) = 1.718 Since c == 1.515-~nd theref~re Z is greater th~n C-we.classify.the indivi?ual ot depressed (m populatiOn 1). To determme how likely this person ts to ~~ :ot depressed, we compute the probability 1 1 + exp( -1.718
+ 1.515)
= 0.55
The probability of being depressed is 1 - 0.55 = 0.45. Thus this individual is only slightly more likely to be not depressed than to be depressed. Several packaged programs compute the probabilities of belonging to both groups for each individual in the sample. In some programs these probabilities are called the posterior probabilities since they express the probability of belonging to a particular population posterior to (i.e. after) performing the analysis. Posterior probabilities offer a valuable method of interpreting classification results. The investigator may wish to classify only those individuals whose probabilities clearly favor one group over the other. Judgment could be withheld for individuals whose posterior probabilities are close to 0.5. In the next Section another type of probability, called prior probability, will be defined and used to modify the dividing point. ~inally, we note that the discriminant function presented here is a sample estimate .of the population discriminant function. We would compute the latter if we had the actual values of the population parameters. If the Populations were both multivariate normal with equal covariance matrices, ~hen the population discriminant classification procedure would be optimal; I.e. no other classification procedure would produce a smaller total classification error (Anderson, 1984).
ll.? ADJUSTING THE VALUE OF THE DIVIDING POINT In th' · can ~s s~ction we indicate how prior probabilities and costs of misclassification e tncorporated into the choice of the dividing point C. Inca· l' rporating prior probabilities into the choice of C hus far the d' .. d. . h d h . .. Peree t ' IVI mg pmnt C as been use as t e pomt producmg an equal n age of errors of both types, i.e. the probability of misclassifying an
260
Discriminant analysis
individual from population I into population II, or vice versa. This be seen in Figur~ 11.4. ~ut the choice of .t~~ value of C can be ;:~ can produce any destred ratiO of these probabilities of errors. To expla' ~ to this choice is ma~e, we mus~ introduce the concept o~ prio~ probabilit;~Si ow the two populations constitute an overall populatiOn, It is of inter nee examine their relative size. The prior probability of population 1 ~st ~0 1 probability that an individual selected at random actually comes ~ t e population I. In other words, it is the proportion of individuals in the ov ro7 1 population who .fall in population.~· This proportion is denoted by q • era 1 In the depresswn data the defimtwn of a depressed person was originall designed so that 20% of the population would be designated as depress~ and 80% nondepressed. Therefore the prior probability of not being depressed (population I) is .q1 .= 0.8. Li~ewis~, q~ . 1 - % = 0.2. Without knowing any of the charactensttcs of a giVen mdtvtdual, we would thus be inclined to classify him or her as nondepressed, since 80% were in that group. In this case we would be correct 80% of the time. This example offers an intuitive interpretation of prior probabilities. Note, however, that we would be always wrong in identifying depressed individuals. The theoretical choice of the dividing point C is made so that the total probability of misclassification is minimized. This total probability is defined as qi times the probability of misclassifying an individual from population I into population II plus %I times the probability ofmisclassifying an individual from population II into population 1: qi · Prob(II given I)
+ %I· Prob(I given II)
Under the multivariate normal model mentioned in section 11.5 the optimal choice of the dividing point Cis C = ZI
+ Zn + In qn 2
qi
. h where ln stands for the natural loganthm. Note t at q11 /% = 1 and lnq11 /% = 0. In this case Cis
C=
ZI
+ Zu
·r qi =
1
_l
qu-
2'
then
if
2
1
:::: 2. :::~ q~ the
Thus in the previous sections we have been implicitly assuming that q. For the depression data we have seen that % = 0.8, and there theoretical dividing point should be C
=
1.515 + ln(0.25)
=
1.515 - 1.386 = 0.129
0
Adjusting the value of the dividing point
261
. arnining the data, we see that using this dividing point classifies all of In e~ondepressed individuals correctly and all of the depressed individuals ~:~orrectly. Therefore the probability .of class~ying a nondepressed individual 1 ulation I) as depressed (populatiOn II) IS zero. On the other hand, the (po1ability of classifying a depressed individual (population II) as nondepressed r~pulation I) is 1. Therefore the total probability of misclassific~~i?n is g,S)(O) + (0.2)(1) = 0.2. When C = 1.515 was .used, the two probabilities of ( ·sciassification were 0.369 and 0.400, respectively (Table 11.3). In that case ~~ total probability of misclassification is (0.8)(0.379) + (0.2)(0.400) = 0.383. This result verifies that the theoretical dividing point did produce a smaller value of this total probability of misclassification. In practice, however, it is not appealing to identify none of the depressed individuals. If the purpose of classification were preliminary screening, we would be willing to incorrectly label some individuals as depressed in order to avoid missing too many of those who are truly depressed. In practice, we would choose various values· of C and for each value determine the two probabilities of misclassification. The desired choice of C would be made when some balance of these two is achieved. Incorporating costs into the choice of C One method of weighting the errors is to determine the relative costs of the two types of misclassification. For example, suppose that it is four times as serious to falsely label a depressed individual as nondepressed as it is to label a nondepressed individual as depressed. These costs can be denoted as cost(II given I)
=
1
cost(I given II)
=
4
and
:~e 1dividing 'fi ~·usc
point C can then be chosen to minimize the total cost of assi cation, namely
ql · Prob(II ·
giVen I)· cost(II given I)
+ q.1 · Prob(l given II)· cost(l given II) l'he choice of C· th . . . . IS . . a t ac hi eves t h'IS mtmm1zat10n
262
Discriminant analysis
where K = In q11 • cost(I given II) q( cost(II
given I)
In the depression example the value of K is 0.2(4) K=ln =lnl=O . 0.8(1) In other words, this numerical choice of cost ofmisclassification and the . pro b a b'li . h use o f pnor I ties counteract eac other so that C = 1.515, the same val obtained without incorporating costs and prior probabilities. ue Many of the computer programs allow the user to adjust the prior prob~bilities bu.t not the ~~s~s. We can tric~ a program into incorporating costs mto the pnor probabibtles, as the followmg example illustrates. Suppose % = 0.4, %1 = 0.6, cost(II given I)= 5 and cost(I given II)= 1. Then adjusted q1 = q1 ·cost (II given I)
=
(0.4)(5) = 2
and adjusted qn = q11 • cost(I given II) = (0.6)(1)
=
0.6
Since the prior probabilities must add up to one (not 2 + 0.6 = 2.6), we further adjust the computed prior such that their sum is one, i.e. adjusted q1 =
2 = 0.77 2. 6
and
adjusted q11 =
~:~ = 0.23
Finally, it is important to note that incorporating the prior probabilities and costs of misclassification alters only the choice of the dividing point C. It does not affect the computation of the coefficients a 1, az, ... , ap in ~he discriminant function. If the computer program does not allow the optt?n of incorporating those quantities, you can easily modify the dividing potnt as was done in the above example.
11.8 HOW GOOD IS THE DISCRIMINANT FUNCTION? . . . . . f the tW 0 A measure of goodness for the classificatiOn procedure consists 0 b'l'tY (I 11 probabilities of misclassification, probability (II given I) and P:?~ one given II). Various me~~ods exist for estima!ing these. probabtht~:~~ '[oat ~ethod, c~lled the ~m~•r•~al method~ was used m the prevwus examp deriving IS, we apphed the discnminant function to the same samples used for 'fables it and computed the proportion incorrectly classified from each group (
How good is the discriminant function?
263
d 11.3). This process is a form of validation ofthediscriminant function. 1t.~~~gh this method i~ intuitive!~ appealing,. it does produce b~~s~d Al~ tes.ln fact, the resultmg proportions underestimate the true probabilities estll11~classification, because the same sample is used for deriving and of 1111 • . • f · lidating the discnmmant unctiOn. vaideally, we would like t~ derive the func~ion fr.om o~e sampl~ and apply it to another sample ~o estlm~te the proportl?n miscl~ssified. Thi_s pro~edure is called cross-vabdati~n, ~nd It produces unbi~s~d estimat~s .. The mvestlgator achieve cross-vahdatwn by randomly sphttmg the ongmal sample from canh group into two subsamples: one for deriving the discriminant function eac I'd . . and one for cross-va I atmg It. The investigator may be hesitant to split the sample if it is small. An alternative method sometimes used in this case, which imitates splitting the samples, is called the jackknife procedure. In this method we exclude one observation from the first group and compute the discriminant function on the basis of the remaining observations. We then classify the excluded observation. This procedure is repeated for each observation in the first sample. The proportion of misclassified individuals is the jackknife estimate of Prob(II given I). A similar procedure is used to estimate Prob(I given II). This method produces nearly unbiased estimators. Some programs offer this option. If we accept the multivariate normal model with equal covariance matrices, theoretical estimates of the probabilities are also available and require only an estimate of D 2 • The formulas are
.
.
estimated Prob (II gtven I) = area to left of
(K _lD2) D under standard 2
normal curve and 2
estimated Prob(I given II)= area to left of ( - K
~ ~D ) under
standard normal curve Where K = In q11 • cost(I given II) q1 · cost(II given I)
lfl(:::::Ot . under hese two estimates are each equal to the area to the left of (- D/2) D~:::::: 0 ~ standard normal curve. For example, in the depression example 3 and K = 0. Therefore D/2 = 0.282, and the area to the left of ·
th
9
264
Discriminant analysis
-0.282 ~s 0.389. From thi~ method ~e esti.mate both Pr~b(II given I) . Prob(I gtven II) as 0.39. This method Is particularly useful If the discri . anct function is derived from a regression program, sinceD 2 can be easily co Ininant mputect from R 2 (section 11.6). U~fo rtu~fiatel!, thAis lastb~etdhod .also undfehrestimates _the true probabilities o mtsc assi catiOn.· nun •ase estimator o t e popu1atton Mahalanob' n 2 • IS lS unbiased D2 = N 1 + Nn - p - 3 D2 _ p (_!__ +
f
1
_1_)
NI +Nil- 2
Nl
Nil
In the depression example we have •
unbiased D
2
244 - 2 = 5050+ + _ 244 2
3 . ( 1 1 ) (0.319)- 2 + 50 244
= 0.316 - 0.048
= 0.268
The resulting area is computed in a similar fashion to the last method. Since unbiased D/2 = 0.259, the resulting area to the left of -D/2 is 0.398. In comparing this result with the estimate, based on the biased D 2 , we note that the difference is small because (1) only two variables are used and (2) the sample sizes are fairly large. On the other hand, if the number of variables P were close to the total sample size (N 1 + Ni1), the two estimates could be very different from each other. Whenever possible, it is recommended that the investigator obtain at least some of the above estimates of errors ofmisclassification and the corresponding probabilities of correct prediction. To evaluate how well a particular discriminant function is performing, the investigator may also find it useful to compute the probability of correct prediction based on pure guessing. The procedure is as follows. Suppose that the prior probability of belonging to population I is known to be qi. Then ~ 11 = .1 - ~~· One v:ay to classify individuals. using thes~e. probabilitie~ aloi~~ IS to ~~gtne a com t~at com~s ~~ head.s with prob~bihty qi an? ~atls :Sed. probability qn. Every ttme an mdividualis to be classified, the com IS to d individ~al is cl.a~si~ed ~nto population I if th.e coin come.s up ~ea~~ ~Jl mto populatiOn II If It ts tails. Overall, a proportion q1 of allmdtvidua be classified into population I. 'fi tion. Next, the investigator computes the total probability of corre~t clas~t ca and Recall that the probability tht a person comes from populatton. I IS Iq1: 8 q1• the probability that ~~y individual is classified into popula~ton I a~d is Therefore the probability that a person comes from populatton b bilitY correctly classified into population I is qf. Similarly, q~ is the pr~ ~ into that an individual comes from population II and is correctly classt. eg onlY population II. Thus the total probability of correct classification usm
!he
resting for the contributions of classification variables
265
wledge of the pri?~ probabilities is q; + q~. Note .that the lowe~t p~~sible f this probabthty occurs when q1 = qii = 0.5, t.e. when the tndtvtdual 0 ~a}ue ny likely to come from either population. In that case q; + q~ = 0.5. ts ~~a g this method for the depression example, with q1 = 0.8 and %1 = 0.2, . sms + q~ = 0.68. Thus we would expect more than two-thirds of the ~~~~i~uais to be corr~ctly classified if we simply flipped a coin that comes : beads 80% of.the. t~me. Note,. how_ever, that we would b~ ~rong on 80% the depressed tndtvtduals, a SituatiOn We may not be wtlbng to tolerate. ~n this context you might recall the role of costs of misclassification discussed ii1 section 11.7.) kno
l
r
ll.9 TESTING FOR THE CONTRIBUTIONS OF ·. CLASSIFICATION VARIABLES Can we classify individuals by using variables available to us better than we can by chance alone? One answer to this question assumes the multivariate normal model presented in section 11.4. The question can be formulated as a hypothesis-testing problem. The null hypothesis being tested is that none ofthe variables improves the classification based on chance alone. Equivalent null hypotheses are that the two population means for each variable are identical, or that the population D 2 is zero. The test statistic for the null hypothesis is F
=
Nl + Nil- p - 1 P(N1 + N 11 + 2)
X
NINII N 1 + N 11
X
D2
~ith degrees of freedom of P and N 1 +Nil- P- 1 (Rao, 1973). The P value Is the tail area to the right of the computed test statistic. We point out that 2 NI.N_nD /(NI + N 11 ) is known as the two-sample Hotelling T 2 , derived ongmally for testing the equality of two sets of means (Morrison, 1976). For the depression example using age and income, the computed F value is F = 244 + 50 - 2 - 1 2(244 + 50- 2)
With 2 a d
X
244 X 50 244 + 50
O.Oos Th n 291 degrees of freedom. The
X
0 319 6 60 · - ·
P value for this test is less than
based
us these two variables together significantly improve the prediction Popula~·n1 0 chance alone. Equivalently, there is statistical evidence that the lllost ex • ~ means are not identical in both groups. It should be noted that f tsttng c . rom the ·. omputer programs do not pnnt the value of D 2 • However, Pnnted value of the above F statistic we can compute D 2 as follows: D2
=
P(N1 + NII}(N1 + Nn- 2) F (N1N 11 )(N1 +Nil- P- 1)
266
Discriminant analysis
Another useful test is whether one additional variable impro discrimination. Suppose that the population D 2 based on x X ves the var~ables is den.ote? ~y pop We wish to tes~ whether a~ addi~i~ XP vanable X P+ 1 wtll stgmficantly mcrease the pop D 2 , I.e. we test the hyp th n~I that pop n;+ 1 = pop The test statistic under the multivariate ~ ests model is also an F statistic and is given by onnai
J?;.
v;.
F = (NI + Nn- p- 2)(NINII)(D;+ 1 - n;) (N1 + N 11 )(N1 + N 11 - 2) + N 1N 11 D;
with 1 and (N1 + N 11 - P- 2) degrees of freedom (Rao, 1965). For example, in the depression data we wish to test the hypothesis that age improves the discriminant function when combined with income. The Df for income alone is 0.183, and D~ for income and age is 0.319. Thus withP = 1
'
F _(50 + 244- 1 - 2)(50 x 244)(0.319 - 0.183) _ . 5 8 (294)(292) +50 X 244 X 0.183 - .4 with 1 and 291 degrees of freedom. T;he P value for this test is equal to 0.02. Thus age significantly improves the classification when combined with income. A generalization of this last test allows for the checking of the contribution of several additional variables simultaneously. Specifically, if we start with X 1 , X 2 , ••• , X P variables, we can test whether X P+ 1 , •.., X P+.Q variables + Q = pop improve the prediction. We test the hypothesis that pop For the multivariate normal model the test statistic is
n;
(NI F=
+ Nu- p- QQ
1)
n;.
NINn(D;+Q- n;) x (NI + Nu)(NI + Nn- 2) + NINnD;
with Q and (N1 + N 11 - P- Q- 1) degrees. of fre~dom. . . be The last two formulas for F are useful m vanable selectiOn, as will shown in the next section. 11.10 VARIABLE SELECTION . . . . d d. crirninant ·ven Recall that there ts an analogy between regressiOn analysts an IS. function analysis. Therefore ~uch of ~he discussion ?f va?ab~e select~onr~~ps. in Chapter 8 apphes to selectmg vanables for classtficatton mto tw d ~ re as In fact, the computer programs discussed in Chapter 8 may be use d seubset well. These include, in particular, stepwise regression programs a~l ble for regression programs. In addition, some computer programs are a vat a ncepts performing stepwise discriminant analysis. They employ the same co discussed in connection with stepwise regression analysis.
Classification into more than two groups
267
discriminant function a?alysis, inst~ad of tes~ing whether the value of In_ le R2 is altered by addmg (or deletmg) a vanable, we test whether the J!lll~~pof pop D 2 is a~tered by ad~ing or deleting variables. The F statis~ic v~len in section 11.9 Is used for thts purpose. As before, the user may spectfy gtV e for the F-to-enter and F-to-remove values. For F-to-enter, Costanza 1 a v; ~fifi (1979) recommend using a value c~rresponding to a P of 0.15. No an mended value from research can be given for the F-to-remove value, reoom . · but a reasonable chmce may be a P of 0.30. 11.11 CLASSIFICATION INTO MORE THAN TWO
GROUPS A comprehensive discussion of classification into more than two groups is beyond the scope of this book. However, several books do include detailed discussion of this subject, including McLachlin (1992), Tatsuoka (1988), Lachenbruch (1975), Morrison (1976) and Afifi and Azen (1979). In this section we summarize the classification procedure for a multivariate normal model and discuss an example. We now assume that an individual is to be classified into one of k populations, fork~ 2, on the basis of the values of P variables X 1 , X 2 , ..• , X p· We assume that in each of the k populations the P variables have a multivariate normal distribution, with the same covariance matrix. A typical packaged computer program will compute, from a sample from each population, a classification function. In applications, the variable values from a given individual are substituted in each classification function, and the individual is classified into the population corresponding to the highest classification function. Now we will co!lsider an example. In the depression data set values of ~;s~ can vary from 0 ~o 60 (Chapter 3). In the previous runs, persons were _ssified ~s depressed If thei~ CESD scores were greater than ~r _e~u~l to 16 .' _otherwtse, they were classified as not depressed. Another possibility IS to dIVIde the i d" "d 1 . d . n IVI ua s mto k = 3 groups: those who deny any symptoms of (CESD score= 0), those who have CESD scores between 1 and 15e~resston Inclusiv d h · the h e an t ose who have scores of 16 or greater. Arbitrarily, we call ;~ t ree. groups lowdep (1), meddep (2) and highdep (3). educ ef van~bles considered for entry in a stepwise fashion are sex, age, ente~ Ion, ~ncome, health, bed days and chronic illness. The method of stepw~g vanables used in this program is similar to that described for forward 1 p ~e regression in Chapter 8 arttal r 1 . . the var· blesu ts for this example are given in Table 11.5. Note that not all 1'he ta · es ~ntered the discriminant function. that th:PProximate F statistic given in Table 11.5 tests the null hypothesis means of the three groups are equal for all variables simultaneously.
268
Discriminant analysis
Table 11.5 Partial printout from BMDP 7M for classification into m ore than tw 0 groups, using the depression data with k = 3 groups
----
APPROXIMATE F-STATISTIC 4,347 DEGREES OF FREEDOH 12 O
• 0
n1eddeFr hishdeFr
F·-HATRI X
DEGREES OF FREEDOH
lowde? 2.36 7.12
aeddeFr
=6
286
57"'1.:;,0
s.sa
CLASSIFICATION FUNCTIONS GROUP VARIABLE 2 se>' 3 ase 5 educat
7 inco111e 32 health 35 bedda~s
CONSTANT VARIABLE 2 3 5 7 32 35
sex ase educat income health bedda~s
CONSTANT
= lowde?
aedde?
hi!lhdep
7.07529 .16774 2.54993 .10533 2.13954 -.97394
7.49617 • 13935 2.82551 .09005 2.75024 -.80246
8.14165 .11698 2.68116 .06537 3.10425 .46685
-17.62107
-18.54811
-18.81630
COEFFICIENTS 'FOR CANONICAL VARIABLES -.73103 .03167 -.00617 .02758 -.57524 -1.13644
-.01977 .02531 -.65481 -.00067 -.68822 1.13117
.49631
2.17769
From standard tables we observe that the p value is very small, indicating that the null hypothesis should be rejected. . The F matrix part of Table 11.5 gives F statistics for testing the equa~~~ of means for each pair of groups. The F value (2.36) for the lowdep.. ~e~dep gr~up~ is bare.ly significant at the level. The other F sta~;~~! mdtcate a stgmficant dtfference between the htghdep group and each ps other two. This result partially justifies our previous analysis of two gr~~ut~ where highdep is the depressed group and lowdep plus meddep cons the nondepressed group. 1 sifY a The classification functions are also shown in Table 11.5. To c:~f the new individual into one of the three groups, we first evaluate eac ariabtes three classification functions, using that individual's scores on all the vouP for that entered the function. Then the individual is assigned to the 1~;terested which the computed classification function is the highest. If we are
50
Use of canonical correlation
269
roups taken two at a time, the corresponding pair of classification ill th~ g s could be subtracted from each other to produce a discriminant fullct~on as explained in section 11.6. functl: example considered here we divided a discrete v~riabl~ into t~ree In to get the three groups. Often when we are workmg with nommal subse~~ree or more separate groups exist; examples include classification by da~a~ . n or by race of individuals. Thus the capability that discriminant ;~~fon anal.ysis. has of handling more than two groups is a useful feature ill some apphcatwns. t1.12 USE OF CANONICAL CORRELATION IN DISCRIMINANT FUNCTION ANALYSIS Recall that there is an analogy between regression and discriminant analysis for two groups. It is unfortunate that this analogy does not extend to the case of k > 2. There is, however, a correspondence between canonical correlation and classification into several populations. We begin by defining a set of new variables called Y1 , Y2 , • •• , ~ _ 1 . These are dummy or indicator variables that show which group each member of the sample came from. Note that as discussed in Chapter 9, we need k- 1 dummy variables to describe k groups. For example, suppose that there are k = 4 groups. Then the dummy variables Y1, Y2 and Y3 are formed as follows: Group
r;_
Y;
I;
1
1 0 0 0
0 1 0 0
0 0 1 0
2 3 4
Thus . from group 1 would be assigned . y · an·Ind'IVI·d· ual comtng a value of 1 on 1 0 on Y2, and 0 on Y3 , etc. y ywe make Q = k - 1, then we have a sample with two sets of variables, a~~~;': · ·' YQ and X 1' X 2 , ... , X P· We now perform a canonical correlation anct asts of these variables. This analysis will result in a set of Ui variables Pairs ~e~a~~ ~ va~iables. As explained in Chapter ~ 0, the nm~ber of these of p anct k tab les Is the smaller of P and Q. Thus thts number ts the smaller 1 l'he va . . lllaxirnu nable Vl is the linear combination of the X variables with the With the~ correlation with U 1 . In this sense, V1 maximizes the correlation the lllaxi ummy ~ariables representing the groups, and therefore it exhibits lllaxirnurn~~m dtfference among the groups. Similarly, V2 exhibits this rfference with the condition that V2 is uncorrelated with V1 , etc.
If
0
270
Discriminant analysis
Once derived, the variables ~ should be examined as discussed · 10 in an effort to give a meaningful interpretation. Such an interpr~~ ~hapter be based on the magnitude of the standardiz~d coefficients giv ati~n can outputs. The variables with the largest standardized coefficients cane~ In ~he 1 'names' to the canonical discriminant functions. To further assist th e P gJ~e this interpretation, the SPSS DISCRIMINANT procedure offers the use~ In of rotating the ~ variables by the varimax method (explained in Ch: optton for factor analysis). pter 15, The ~variables are called the canonical discriminant functions orca . .. bles m . program output. F or t he depresswn . data example, the nomcal varia r 1 are ~ven at the bottom. of T~bl~ 1_1.5. Since _there a_re three group::u:~ obtamedk- 1 = 2canomcal(dtscrtmmantfunctwn)vanables. The numerical values that we would. obtain had w~ used a canonical discriminant function program are proportiOnal to those m Table 11.5. As noted earlier, v1 is the linear combination of the X variables with the maximum correlatio n with U 1 , a linear function of the dummy variables representing group membership. In this sense, V1 maximizes the correlation with the dummy variables representing the groups, and therefore it exhibits the maximum differences amongst the groups. Similarly, V2 exhibits this maximum difference with the condition that V2 is uncorrelated with V1 , etc. As was true with canonical correlation analysis, often the first canonical discriminant function is easier to interpret than subsequent ones. For the depression data example, the variables sex, health and bed days have large negative coefficients in the first canonical variable. Thus, patients who are female with perceived poor health and with a large number of bed days tend to be more depressed. The second canonical variable is more. difficult to interpret. When there are two or more canonical variables, some packaged programs produce a plot of the values of the first two canonical variables for each individual in the total sample. In this plot each group is identified by a different letter or number. Such a plot is a valuable tool in seeing how separ_ate the groups are since this plot illustrates the maximum possible separadtton among the groups. The same plot can also be used.to 1"dentt·rYoutl"te rs· or . blun ers. le. Figure 11.6 shows a plot of the two canonical variables for thts ~xda~~es . . . Th fi e also tn tea The symbols- 1, m and h mdtcate the three groups. e g~r . bles for the position of the mean values (1, 2 and 3) of the canomcal van~ tion is · f overlaP each of the three groups. T hese means sh ow t h a t the main vana 0 exhibited in canonical variable 1. The plot also shows a great deal tliers are between the individuals in the three groups. Finally, no extreme ou.t further evident, although the case in the upper right-hand corner may mert examination. . 1 ariable. In When there are only two groups, there is only one canontc~ v 1 to those this case, the coefficients of the canonical variable are prop~ruon~s for the of the Fisher discriminant function. The resulting pair of htstogra
Use of canonical correlation
271
~
....ACJ .....\1l
3.75
~
;7 ~
1.)
·~
8\1l
3.00
1
0
m
2.25
mh ml m h
1.50
0.75
0.00
-0.75
m
m
m
m
ml * l • m hm h 1 ml lhm rn mm m m m h m * l mm l rn m*lm l h m ml m h mh h * m l'ln h hh m m m h *mmm m m m m m*m ~ m h h m m h ~ mm mh ml rn f7\hmrrm m m m h m h mm m \V m rrrn m ~rrimmnJITI m m m ~n m h h hrnrmJm m l m m mm m l m *m *mhm mm m l l m m lmri1*hm m * mm *urn m mm In mm m hl m mrrun m m h m m m m m nun mlmm m m hm m m m m m hhh mm m rn mmm m mrn m m nun
m
m rn* hl
m
1
*
m
m
1
m
h
m
-2.25
m m
-3.oo
m
l h
h m
-1.6
-0.8
0.0
0.8
1.6
2.4
Canonical Variable 1
f.'· '&ure 116
grouPs. • 'Plot of the Canonical Variables for the Depression Data Set with k
=
3
272
Discriminant analysis
two groups s.hould be examined, ~ather than the scatter diagram. In words, there Is no advantage to usmg the more general canonical p other The advantage of the canonical procedure is that it aids in interp r~~edure. results when there are three or more groups. re mg the When the X variables are highly intercorrelated, it is sometimes d'ffi to interpret even the standardized canonical coefficients since the e~ cult two correlated variables may be shared by both variables (Dill ect of Goldstein, 1984). In this case the canonical loadings which give theo~ and correlation between each variable and the computed score for that can ~Ple · ble s·hould· be examme · · d . Vana ·· bl es· wit · h a h'Igh correlatiOn · make aorucal·. vana
~ontributi.on. to the c~nonical
1
vari,a.ble. Th~se can~nical loadings are c:r~;~ pooled withm canomcal structure m SAS, canomcalloadings' in SYSTAT 'pooled within-groups correlations' in SPSS and 'factor structure' i~ STATIST! CA. Various options and extensions for classification are offered in standard packaged programs, some of which are presented in the next section. In particular, the stepwise procedure of variable selection is available with most standard programs. However, unless the investigator is familiar with the complexities of a given option, we recommend the use of the default options. For most standard packaged programs, the variable to be entered is selected to maximize a test statistic called Wilks' lambda. The one exception to the procedure is that we recommend modifying the values of the F-to-enter and F-to-remove as discussed in the previous section. When the number of groups, k is greater than two, the investigator may wish to examine the discrimination between the groups taken two at a time. This examination will serve to highlight specific differences between any two groups, and the results may be easier to interpret. Another possibility is to contrast each group with the remaining groups taken together. 11.13 DISCUSSION OF COMPUTER PROGRAMS . . . b k . ummarized The computer output for five of the SIX packages m this oo IS s the in Table 11.6. STAT A does not have a discriminant analysis program s~orm regression option discussed in section 11.6 would be used to perssioll discriminant function analysis in STATA. In that case, once the regre n be coefficients are obtained then histograms of the predicted values caful the . h W success . the two obtamed for each of the two groups. The user can assess 0 discriminant analysis was by noting the amount of overlap between histograms and how far apart the mean values are. ch point 2 The D • given .in !~ble 11.6 is a me~sure of the ~istance ~etwe;~ e:pulat.i~Jl representmg an mdividual and. th~ pomt representt_ng ~h~ estnr~at 'k~ a ty~tcal ~e~n: A s~all value of 2 mdicates that the mdivtdual ts Itt opulattoO· mdividual m that populatiOn and hence probably belongs to tha P
?J
Table 11.6 Summary of computer output for discriminant function analysis
BMDP
SAS
SPSS
STATISTICA
Pooled covariances and correlations Covariances and correlations by group Classification function D2
5M 7M,5M 7M, 5M 7M
DISCRIMINANT DISCRIMINANTb DISCRIMINANT
Discriminant
F statistics Wilks' lambda Stepwise options Classified tables Jackknife classification table Cross-validation with subsample Canonical coefficients Standardized canonical coefficients Canonical loadings Canonical plots Prior probabilities Posterior probabilities Quadratic discriminant analysis Nonparametric discriminant analysis
7M, 5M 7M,5M 7M 7M, 5M 7M 7M 7M 7M 7M 7M 7M,5M 7M, 5M 5M
All a Alia DISCRIM DISCRIM STEPDISC All a Ana STEPDISC DISCRIM DISCRIM DISCRIM CANDISC CANDISC CANDISC CANDISC DISCRIM DISCRIM DISCRIM DISCRIM
DISCRIMINANT DISCRIMINANT DISCRIMINANT DISCRIMINANT
Discriminant Discriminant Discriminant Discriminant
DISCRIMINANT DISCRIMINANT DISCRIMINANT DISCRIMINANT DISCRIMINANT DISCRIMINANT DISCRIMINANT
Discriminant
Output
a &
Means STEPDISC, DISCRIM and CANDISC. Covariance matrices by groups.
Discriminant Discriminant
Discriminant Discriminant Discriminant Discriminant Discriminant
SYSTAT MGLH MGLH MGLH MGLH MGLH
MGLH MGLH MGLH MGLH MGLH
27 4
Discriminant analysis
BMDP has two discriminant function programs. The one used . this chapter is called 7M which we ran using the default option s~ far in prior probabilities. 7M performs stepwise discriminant function ana~ ~qual two or more gro~ps. It has numerous opti~ns as ~an _be seen in Tabi:Is for The current verston has added features which assist m determinin ·r ll.6. variable has outliers or follows a normal distribution. The second dis:i I .each function program, 5M, performs linear and quadratic discriminant furnm~nt nctton . I d . d" . . an~1ysi~. t oes not ~otnpute a stepwise tscnmmant analysis o~ the jackknife vahdatton data but It does allow the user to compute successive mod 1 . an interactive fashion. e s tn The discriminant function analyses discussed in this chapter have b restricted to 'linear' discriminant analyses analogous to linear regressieen Quadratic discriminant function analysis uses the separate group covaria~;· matrices instead of the pooled covariance matrix for computing the variou~ output. For further discussion of quadratic discriminant function analysis see Lachenbruch (1975). Quadratic discriminant analysis is theoretically appropriate when the data are multivariately normally distributed but the covariance matrices are quite unequal. In the depression example that was given in this chapter, the null hypothesis of equal covariance matrices was rejected (section 11.5). The variable 'age' had a similar variance in the two groups but the variance of income was higher in the non depressed group. Also, the covariance between age and income was a small positive number for the depressed respondents but negative and larger in magnitude for the non depressed (older people and smaller income). We did not test if the data were multivariate normal. When we tried quadratic discrimination on this data set the percentage correctly classified went down somewhat, not up. . d We also tried making a transformation on income to see If that wo~l improve the percentage correctly classified. Using the suggested transfo~att~ give_n in section 4.3, we . created a n~w variable l~K--:---!~~~t log(mcome + 2) and used that with age. The vanances now were _stm.tl . t the covariances were still of opposite sign as before. The linear dtscri~unafn function came out with the same percentage classified correctly as gtv~nnts section 11.4 without the transformation although slightly more respon e were classified as normal than before the transformation. dratic Marks and Dunn (1974) showed that the decision to use a. qu~es but discriminant function depended not only on the ratio of the varian ber of also on the sample size, separation between the groups and the nu~nequal variables. They did not consider cases with .equal variances but Ie sizes covariances (which we had after transformation). With small ~a~i~crirnill' and variances not too different, linear was preferred to quadr~t~c difference ation When the ratio of the variances was two or less, htt e ce under was s~en and even ratios up to four resulted in not too much ditferen
What to watch out for
275
. onditions. Multivariate normality was assumed in all their ost c J1l . s simul;t::s ·three discriminant function procedures, DISCRIM, CANDISC S~TEPDISC. DISCRIM is a comprehensive procedure that does linear, andd atic and nonparametric (not discussed here) discriminant analysis. q;;DISC per~orms .ste~w~se discrimi~ant functi.on analysis and CANDISC S rfortns canontcal dtscnmmant functiOn .analysts. peThe SPSS DISC~I~I1~ANT progr~m Is a gen.eral purpose program that r: nns stepwise dtscnmmant analysts and provides clearly labeled output. c peno . f . . I . Iso provides a test .or equa1tty o covanance matnces. t ~he STATISTICA discriminant program is quite complete and is quick to tun. With ST~ TISTICA and using th~ equal prior. option, we get the same output as wtth the other packages wtth the exceptiOn of the constants ~·the classification function. They came out somewhat different and the resulting classification table was also different. SYSTAT uses the MGLH program so you do get some output that does not relate directly to .discriminant function analysis, but the discriminant output is clearly labeled. SYSTAT is planning a new discriminant analysis program called Discrim that will include both linear and quadratic functions. The variables for the linear function can be entered using forward or backward stepwise selection. The output for this program appears to be very complete.
11.14 WHAT TO WATCH OUT FOR
~artly ~~ause there is more than one group, discriminant function analysis
· as ad~ttlonal aspects to watch out for beyond those discussed in regression ~l~~Ysts. A. thor~ugh discussion of possible problems is given in Lachenbruch 7). A hst of Important trouble areas is as follows: 1. Theoreti · A . . · ~a11y, a stmple random sample from each population is assumed.
~. t~ts Is .of~en not feasible, the sample taken should be examined for
~Sible btasmg factors . 2. P It IS crif 1 in th tea that the correct group identification be made. For example, sea e example in this chapter it was assumed that all persons had a correct nor: ~n the CESD scale so that they could be correctly identified as Wou}:. or depressed. Had we identified some individuals incorrectly, this is Par/ncrease the reported error rates given in the computer output. It than Icularly troublesome if one group contains more misidentifications 3. lh another. e choice of . b . regres . vana les Is also important. Analogies can be made here to OutJie:Ion analysis. Similar to regression analysis, it is important to remove s, make necessary transformations on the variables, and check
276
4.
5.
Discriminant analysis
independence of the cases. It is also a matter of concern if th considerably more missing values in one group than another. ere are Multivariate normality is assumed in discriminant function analysi computing posterior probabilities or performing statistical tests. ;h::en of dummy variables ha~ bee~ s~ow~ not to cause un_due problems, bse very skewed or long-tailed dtstnbutwns for some vanables can increaUt the total error rate. se Another assumption is equal covariance matrices in the groups. If covariance matrix is very different from the other, then quadratic discriminone function analy~is should be considered for two ~roups (Lachenbruch, 1 ~~~ or transformatiOns should be made on the vanables that have the greate t difference in their variances. s If the sample size is sufficient, researchers sometimes obtain a discriminant function from one-half or two-thirds of their data and apply it to the remaining data to see if the same proportion of cases are classified correctly in the two subsamples. Often when the results are applied to a different sample, the proportion classified correctly is smaller. If the discriminant function is to be used to classify individuals, it is-important that the original sample come from a population that is similar to the one it will be applied to in the future. If some of the variables are dichotomous and one of the outcomes rarely occurs, then logistic regression analysis (discussed in the next chapter) should be considered. Logistic regression does not require the assumption of multivariate normality or equal covariances for inferences and should be considered as an alternative to discriminant analysis when these assumptions are not met or approximated. As in regression analysis, significance levels ofF should not be taken .as valid if variable selection methods such as forward or backward stepwise methods are used. Also, caution should be taken when using small samples w_ith numer~us v~riables (Renc~er and Larson, 1~80). If a nominale~~ dtscrete vanable ts transformed mto a dummy vanable, we recomm that only two binary outcomes per variable be considered.
9
6.
7.
8.
SUMMARY . . . . . . . h ·que dating In thts chapter we dtscussed dtscnmtnant function analysts, a tee !11 d tioO back to at least 1936. However, its popularity began with the mtro wu~s to . the 1960s. T he met·h o d' s ongma. . . I concern _ sed for of large-scale computers m classify an individual into one of several populations. It IS ~Is~ u ariable explanatory purposes to identify the relative contributions of a smg e v or a group of variables to the classification. ulatioOS· In this chapter our main emphasis was on the case of two pop f the use 0 We gave some theoretical background and presented an example
Further reading
277
kaged program for this situation. We also discussed the case of more of a pac two groups. . . . tball sted readers may pursue the subJect of classtficatton further by Intere . . . Iting the references. In particular, Lachenbruch (1975), McLachhn consu) and James (1985) present a comprehensive discussion of the subject. (19 92. ..
REFERENCES References preceded by an asterisk require strong mathematical background. Afifi, A.A. and Azen ~.P. (1979). Statistical Analysis: A Computer Oriented Approach, 2nd edn. Academic Press, New Yor~. . . . . . . *Anderson, T.W. (1984). An Introductton to Multivanate Statistical Analysis, 2nd edn. Wiley, New York. Costanza, M.C. and Afifi, A.A. (1979). Comparison of stopping rules for forward stepwise discriminant analysis. Journal of the American Statistical Association, 74, 777:_85. Dillon, W.R. and Goldstein, M. (1984). Multivariate Analysis: Methods and Applications. Wiley, New York. Dixon, W.J. and Massey, F.J. (1983). Introduction to Statistical Analysis, 4th edn. McGraw-Hill, New York. Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals Dj Eugenics, 7, 179-88. James, M. (1985). Classification Algorithms. Wiley, New York. Klecka, W.R. (1980). Discriminant Analysis. Sage, Beverly Hills. Lachenbruch, P.A. (1975). Discriminant Analysis. Hafner Press, New York. Lachenbruch, P.A. (1977). Some misuses of discriminant analysis. Methods of .. Information in Medicine, 16, 255-8. . *Marks, S. and Dunn, O.J. (1974). Discriminant functions when covariance matrices * are un~qual. Journal of the American Statistical Association, 346, 555-9. MWc~achhn, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition. dey, New Yotk. *Myorrikson, D.F. (1976). Multivariate Statistical Methods, 2nd edn. McGraw-Hill, New or .
Ra~~CD~ (1965). Cova~ance adjustm~nt ~nd related p.roblems.in ~ultivariate ~nalysis, Sym~osmm on Multivanate 87 _ 10~ton · Academic Press,. New Yotk.
Analysis, Multzvarzate Analyszs 1, pp.
*Rao
"'Ren~~·~·11973).
Linear Inference and its .APP_licati?n, 2nd. edn. Wi~ey, ~e~ ': ork. analy ~ .C. and Larson, S.F. (1980). Bias m Wilks' A m stepwise discnmmant *Tat Sis. Technometrics 22 349-56 · SUok . ' ' ' Psychof ~.M. (1988). Multivariate Analysis: Techniques for Educational and Truett, J., ~~cal Research, 2nd edn. Wiley, New Yor~. . . . corona hrnfield~ J. an~ Kannell, W. (1967). Multivanate analysis of the nsk of ry eart disease m Framingham. Journal of Chronic Diseases, 20, 511-24.
FDRl' . l-IER READING lianct, D.J
· (19 81). Discrimination and Classification. Wiley, New York.
278
Discriminant analysis
PROBLEMS 11.1 Using the depression data set, perform a stepwise discriminant function . with age, sex, log(income), bed days and health as possible variables. Canaiysls the results with those given in section 11.13. ompare 11.2 For the data shown in Table 8.1, divide the chemical companies i t 0 groups: group I consists of those companies with a P/E less than 9, an~ two II consists of those companies with a P/E greater than or equal to 9 Jroup I sh~uld be considered mature or .tro~b~ed firms, ~nd grou~ II sh~ul~o~~ considered growth firms. Perform a discnmmant function analysis, using RORS DJE, s.~:ESGR5, EPS5, N~M1 ~nd :AYOUTRl. Assume equal prio; probabilities and costs of misclassification. Test the hypothesis that th population D 2 = 0. Produce a graph of the posterior probability of belongin e to group I versus the value of the discriminant function. Estimate thg probabilities of misclassification by several methods. e 11.3 (Continuation of Problem 11.2.) Test whether D/E alone does as good a classification job as all six variables. 11.4 (Continuation of Problem 11.2.) Choose a different set of prior probabilities and costs of misclassification that seem reasonable to you and repeat the analysis. 11.5 (Continuation of Problem 11.2~) Perform a variable selection analysis, using stepwise and best-subset programs. Compare the results with those of the variable selection analysis given in Chapter 8. 11.6 (Continuation of Problem 11.2.) Now divide the companies into three groups: group I consists of those companies with a P /E of 7 or less, group II consists of those companies with a P /E of 8 to 10, and group III consists of those companies with a P /E greater than or equal to 11. Perform a stepwise discriminant function analysis, using these three groups and the same variables as in Problem 11.2. Comment. 11.7 In this problem you will modify the data set created in Problem 7.7 to make it suitable for the theoretical exercises in discriminant analysis. Generate the sample data for X1,X2, ... ,X9 as in Problem 7.7 (Y is not used here). Then for the first 50 cases, add 6 to Xl, add 3 to X2, add 5 to X3, and leave ~~e values for X4 to X9 as they are. For the last 50 cases, leave all the data as t. Y are. Thus the first 50 cases represent a random sample from a multivanate 3 normal population called population I with the following means: 6 for Xlt, . · represen . II)a for X2, 5 for X3, and zero for X4 to X9. The last 50 observatiOns random sample from a multivariate normal population (called pop~la~Izn for whose mean is zero for each variable. The population Mahalanobis X sand 3 each variable separately are as follows: 1.44 for X1, 1 for X2, 0.5. for c ' 0 ws: 2 · D IS as tO 11 zero for each of X4 to X9. It can be shown that the popu1atiOO or all nine 9 3.44 for X1 to X9; 3.44 for X1, X2 and X3; and zero for X4 to ~ · F fficients; variables the population discriminant function has the followmg coe X9 'fbe 0.49 for X1, 0.5833 for X2, -0.25 for X3, and zero for each of X4 to · population errors of misclassification are Prob(I given II)= Prob(II given I)= 0.177 tructed. Now perform a discriminant function analysis on the data you cons
Problems
11.8
11-9 U.lO 1111 ·
11.12
11.13
11. 14
ll.Is
ll.l6
279
·ng all nine variables. Compare the results of the sample with what you know usi . bout the populations. ~Continuation of Problem _11.7.) Perform a simil~r analysis, using only X_1, ~2 and X3. Test the hypot?esis that these three vanables do as well as all mne m lassifying the observatiOns. Comment. ~Continuation of Problem 11.7.) Do a variable selection analysis for all nine variables. Comme nt . (Continuation ofProblem 11.7.) Do a variable selection analysis, using variables X4 to X9 only. Comment. From the family lung function data in Appendix A creata a dataset containing only those families from Burbank and Long Beach (AREA = 1 or 3). The observations now belong to one of two AREA-defined groups. (a) Assuming equal prior probabilities and costs, perform a discriminant function analysis for the fathers using FEV1 and FVC. (b) Now choose prior probabilities based on the population of each of these two California cities (at the time the study was conducted, the population of Burbank was 84625 and that of Long Beach was 361334), and assume the cost of Long Beach given Burbank is twice that of Burbank given Long Beach. Repeat the analysis of part (a) with these assumptions. Compare and comment. At the time the study was conducted, the population of Lancaster was 48 027 while Glendora had 38 654 residents. Using prior probabilities based on these population figures and those given in Problem 11.11, and the entire lung function data set (with four AREA-defined groups), perform a discriminant function analysis on the fathers. Use FEV1 and FVC as cla~sifying varia,bles. Do you think the lung function measurements are useful in distinguishing among the four areas? (a) In the family lung function data in Appendix A divide the fathers into two groups: group I with FEV1less than or equal to 4.09, and group II with FEV1 greater than 4.09. Assuming equal prior probabilities and costs, perform a stepwise discriminant function analysis using the variables height, weight, age and FVC. What are the probabilities of misclassification? (b) Use the classification function you found in part (a) to classify first the mothers and then the oldest children. What arethe probabilities ofmisclassification? ~h~ is the assumpt~on of ~qual pri~r probabiliti~s not realist!c? b VIde the oldest children m the family lung function data set mto two groups ased_ on weight: less than or equal to 101 versus greater than 101. Perform a stepwise discriminant function analysis using the variables OCHEIGHT ~CAGE, MHEIGHT, MWEIGHT, FREIGHT and FWEIGHT. No~ :mporarily remove from the data set all observations with OCWEIGHT r:t.ween ~7 and 115 inclusive. Repeat the analysis and~compare the results. s It Possible to distinguish between men and women in the depression data . of mcome . . . the classificatiOn . . fuet on. the bas1s and leve1 of depress10n? What Is van~ton? What are your prior probabilities? Test whether the following ~~abies help discriminate: EDUCAT, EMPLOY, HEALTH. lll~der ~ 0 the table of ideal weights given in Problem 9.7 and calculate the Potnt of each weight range for men and women. Pretending these represent
I,
I~
280
I
I
I
I
I I
d 1
I11
I
I
.•
Discriminant analysis a real sample, perform a discriminant function analysis to classify as male or female on the basis of height and weight. How could frame size in the analysis? What happens if you do?
12 Logistic regression 12 ·
.1 USING LOGISTIC REGRESSION TO ANALYZE A DICHOTOMOUS OUTCOME VARIABLE
In Chapter 11 we presented a method originally developed to classify individuals into one of two possible populations using continuous predictor variables. From this chapter, you will learn another classification method which applies to discrete or continuous variables. Sections 12.2 and 12.3 present examples of the use of logistic regression analysis and illustrate the appearance of the logistic function. In section 12.4, a definition of odds is given along with the basic model assumed in logistic regression. The advantages of the use of maximum likelihood estimates computed in logistic regression are also given there. Sections 12.5 and 12.6 discuss the interpretation of categorical and continuous predictor variables, respectively. Methods of variable selection, checking the fit of the model and evaluating how well logistic regression predicts outcomes are given in section 12.7. Section 12.8 discusses the use of logistic regression in cross-sectional, case-control and matched samples. A summary of the output of the six ~tatistical packages is given in section 12.9 and what to watch out for when Interpreting logistic regression is discussed in section 12.10. l2.2 WHEN IS LOGISTIC REGRESSION USED?
~ogistic regression can be used whenever an individual is to be classified ~~ 01 o~e of two populations. Thus, it is an alternative to the discriminant wh::~:s presented in Chapter 11. Whe~ there ~r~ more t~an two g~oups, be
. called polychotomous or generalized logtsttc regressiOn analysts can PR.~~e~ but that technique will not be discussed in this book (see BMDP for an S LOGISTIC or STAT A mlogit for programs and Demaris, 1992, example). 1 rne~c~~efi~ast most of the applications of logistic regress.ion were in ~he heart d. . Id. It has been used, for example, to calculate the nsk of developmg such a:sease as .a function of certain personal and behavioral characteristics Similar} age, ':"e~ght, blood pressure, cholesterol level and smoking history. Predict' y, logtsttc regression can be used to decide which characteristics are Ive of teenage pregnancy. Different variables such as grade point
282
Logistic regression
average in elementary school, religion or family size could be used to . whether or not a teenager will become pregnant. In industry, operatio:rect~ct could be classified as successful or not according to some objective c~~n~ts Then several characteristics of the units could be measured and lo ~ 11 ~· · could· be used to determme · whtc · h charactensttcs · · best predict suegtsttc regressiOn ·. The linear discriminant function gives rise to the logistic post c~ss. probability when the multivariate normal model is assumed (see discu:~lor of posterior probabilities in section 11.6). Logistic regression represent:ton alternative method of classification when the multivariate normal modet~ not ~ustified. As we dis~uss. in this . chapter, logisti~ regression analysis ~~ applicable for any combmatton of dtscrete and contmuous variables. (If th multivariate normal model with equal covariance matrices is applicable th: methods discussed in Chapter 11 would result in a better dassif1c~tion procedure and require less computer time to analyze.) Aldrich and Nelson (1984) provide graphical and analytical comparisons of these and other methods. Logistic regression analysis requires knowledge of both the dependent (or outcome) variable and the independent (or predictor) variables in the sample being analyzed. The results can be used in future classification when only the predictor variables are known, . 12.3 DATA EXAMPLE The same depression data set described in Chapter 11 will be used in this chapter. However, the first group will consist of people who are depressed rather than not depressed. Based on the discussion given in section 11.6, this redefinition of the groups implies that the discriminant function based on age and income is
z = -0.0209(age) -
0.0336(income)
with a dividing point C = -1.515. Assuming equal prior probabilities, the posterior probability of being depressed is
1 Prob(depressed) = 1 + exp[ -1.515 + 0.0209{age)
+ 0.0336(income)]
. . d'tVl'd ua1 wtt .. h a d'tscnmman . . t f unc t'ton va1ue of Z • we can write For a gtven m this posterior probability as
1 Pz = 1 + ec-z . figure
As a function ofZ, the probability Pz has the logistic fo.rm shown ~ and 1 12.1. Note that Pz is always positive; in fact, it must be between 1
Basic concepts of logistic regression
283
p(Z)
0.8
0.6 Z
=-1.515 andPz = 0.5
0.4
0.2
-6.0
0.0
-2.0
-4.0
2.0
Figure 12.1 Logistic Function for the Depression Data Set.
because it is a probability. The minimum age is 18 years, and the minimum income is $2 x 10 3 • These minimums result in a Z value of -0.443 and a probability Pz = 0.745, as plotted. When Z = -1.515, the dividing point C, ~hen Pz = 0.5. Larger values of Z occur when age is younger and/or income Is lower. For an older person with a higher income, the probability of being depressed is low. · . The same data and other data from this depression set will be used later ~~ ~h~ chapter to illustrate a different method of computing the probability emg depressed.
12 4
.
BASIC CONCEPTS OF LOGISTIC REGRESSION
l'he I · . ogtstic function
Pz
1 =
c-z
1 +e
284
Logistic regression
may be transformed to produce a new interpretation. Specifically wed the odds as the following ratio: ' efine odds=
p
z
1 -Pz
Computing the odds is a commonly used technique ofinterpretingprobabil't' (Fleiss, 1981). F~r example, in sp~rts we may s~y that the odds are 3 ;~e~ that one team wtll defeat another m a game. Thts statement means that th · favored team has a probability of 3/(3 + 1) of winning, since 0.7 ~ (1 - 0.75) = 3/1. Note that as the value of Pz varies from 0 to 1, the odds vary from o to oo. When Pz = 0.5, the odds are 1. On the odds scale the values from o to 1 correspond to values of Pz from 0 to 0.5. On the other hand, values of p from 0.5 to 1.0 result in odds of 1 to oo. Taking the logarithm of the odds will cure this asymmetry. When Pz = 0, ln(odds) = -oo; when Pz = 0.5, ln(odds) = 0.0; and when Pz = 1.0, ln(odds) = + oo. The term logit is sometimes used instead of In (odds). By taking the natural logarithm of the odds and performing some algebraic manipulation, we obtain
5
In words, the logarithm of the odds is linear in the discriminant function Z. Since Z = a 1 X 1 + a 2 X 2 + · · · + apX P (from Chapter 11), ln(odds) is seen to be linear in the original variables. If we rewrite - C + Z as a+ b1X 1 + b2 X 2 + ··· + bpX p, the equation regarding ln(odds) to the discriminant function is
. same form as the mu1tip . 1e 1'mear regress.ion equation This equation is in the son the (Chapter 7), where a = - C and b1 = ai for i = 1 to P. For thts rea. and logistic function has been called the multiple logistic regressi~n equar;;;~~nts. 1 the coefficients in the equation can be interpreted as regres.st~n coe ( dds) 1 The fundamental assumption in logistic regression analysts ts that n :ade is linearly related to the independent variables. No assumptions ~re major regarding the distributions of the X variables. In fact,. one of t ~·10 000 s. advantages of this method is that the X variables may be discrete or con The model assumed is ln(odds) =ex+
/31 X 1 + /3 2 X 2 + · .. + f3pXP
Interpretation: categorical variables
285
,....,s of the probability of belonging to population I, the equation can
Jn tell"
.
be written as
probabilit~ of belonging = 1 to population I 1 + exp[ -(a+ /3 1 X 1 + {3 2 X 2 + · .. + {3pXp)J
:s
Th. equation is called the logistic regression equation. mentioned earl~er, the technique of linear discriminant analysis can be d to compute estimates of the parameters a, {3 1 , {3 2 , ••• , f3p· However, the usethod of maximum likelihood produces estimates that depend only on the ~~stic model. The.maxim~m ~ik~lihood esti~ates s?ould, therefore, be.more obust than the hnear discnmmant functiOn estimates. However, If the ~istribution of the X variables is, in fact, multivariate normal, then the discriminant analysis method requires a smaller sample size to achieve the same precision as the maxim urn likelihood method (Efron, 197 5). The maximum likelihood estimates of the probabilities of belonging to one population or the other are preferred to the discriminant function estimates when the X's are nonnormal (Halperin, Blackwelder and Verter, 1971; Press and Wilson, 1978). The estimates of the coefficients or the probabilities derived from the two methods will rarely be substantially different from each other, whether or not the multivariate normality assumption is satisfied. An exception to this statement is given in O'Hara et al. (1982); they demonstrate that the use of discriminant function analysis may lead to underestimation of the coefficients when all the X's are categorical and the probability of the outcome is small. Most logistic regression programs use the method of maximum likelihood to comp11te estimates of the parameters. The procedure is iterative and a theoretical discussion of it is beyond the scope of this book. When the number 0 ~ va:i~?les is large (say more than 12), the computer time required may be P ohibittvely long. Some programs such as BMDP LR allow the investigator to use a relatively fast approximation to the maximum likelihood method and SAS LOGISTIC uses weighted-least-squares estimation. 12 5 · INTERPRETATION: CATEGORICAL VARIABLES
~~eo~fthe m~st important benefits of the logistic model is that it allows the
the rno~:tegonc~l X vari~bles .. An.y num.ber o~ such variable~ can be us~d in With t 1. Th.e simplest situatiOn Is one m which we have a smgle X vanable to Pre~? possibl~ values. For example, for the depression data we can attempt Sho.ws t~t ~h~ ~s depressed on the basis of the individual's sex. Table 12.1 If th ~ tndividuals classified by depression and sex. Similar~ Individual is female, then the odds of being depressed are 40/143. Y. for males the odds of being depressed are 10/101. The ratio of
286
Logistic regression Table 12.1 Classification of individuals by depression level and sex Depression Sex
Yes
No
Total
Female (1) Male (0)
40 10
143 101
183 111
Total
50
244
294
these odds is . odds ratiO
=
40/143 10/101
=
40 10
X X
101 143
=
2.825
The odds. of a female being depressed are 2.825 times that of a male. Note that we could just as well compute the odds ratio of not being depressed, In this case we have .
. odds ratw
=
143/40 101110
= 0.354
The concept of odds ratio is used extensively in biomedical applications (Fleiss, 1981 ). It is a measure of association of a binary variable (risk factor) with the occurrence of a given event (disease) (see Reynolds, 1977, for applications in behavioral science). To represent a variable such as sex, we customarily use a dummy variable: X = 0 if male and X = 1 if female. The logistic regression equation can then be written as Prob(depressed)
1
= 1 + e-~~.-px
The sample estimates of the parameters are a
= estimate
of a
b = estimate of f3
==
2.313
1.039
. dds ratio of We note that the estimate of/3 is the naturalloganthm of the 0 females to males; or 1.039
=
In 2.825
Interpretation: categorical variables
287
Equivalently, odds ratio = eb = e 1 · 039 = 2.825 Also,
the estimate of a is the natural logarithm of the odds for males (X
=
0); or
..__ 2.313-1n ~ 101 When there is only a single dichotomous variable, it is not worthwhile to perform a logistic regression analysis. However, in a multivariate logistic e uation the value of the coefficient of a dichotomous variable can be related t~ the odds ratio in a manner similar to that outlined above. For example, for the depression data, if we include age, sex and income in the same logistic model, the estimated equation is Prob(depressed) =
1 1 + exp{ -[ -0.676- 0.021(age)- 0.037(income)
+ 0.929(sex)]}
Since sex is a 0, 1 variable, its coefficient can be given an interesting interpretation. The quantity e0 •929 = 2.582 may be interpreted as the odds ratio of being depressed if female after adjusting for the linear effects of age and income. It is important to note that such an interpretation is valid only when we do not include the interaction of the dichotomous variable with ~ny of the other variables. Further discussion on this subject may be found In Breslow and Day (1980), Schlesselman (1982) and Hosmer and Lemeshow (198?), In particular, the case of several dummy variables is discussed in some detail in these books. Approximate confidence intervals for the odds ratio for a binary variable ~:~hbe computed using the slope coeffic_ient b and its sta~d~rd error listed . e output. For example, 95% approxtmate confidence hmits for the odds rat10 ca b The n e com?uted a~ exp(b + .1.96 x standard error of b). . . re}af o~ds ratto provides a directly understandable statistic for the (giv tonship between the outcome variable and a specified predictor variable for :n all the other predictor variables in the model). The odds ratio of 2.582 estir:~ can be directly interpreted as indicated previously. In contrast, the occura ed f3 coefficient is not linearly related to the probability of the 199o{~~ce of Pz since Pz = 1/(1 + ec-z), a nonlinear equation (Demaris, to th~ Ince the slope coefficient for the variable 'sex' is not linearly related lllanne~~obability of being depressed, it is difficult to interpret in an intuitive
288
Logistic regression
12.6 INTERPRETATION: CONTINUOUS AND MIXED VARIABLES To show ~he similarity between. t?e discri~inant fu~ction analysis w per~o~ed m ~hapter 11 an? logtst~c regressiOn .analysts, we performed e logtstlc regressiOn on age and mcome 1D the depressiOn data set. The estim a obtained from the logistic regression analysis are as follows: ates Term
Coefficient
Standard error
Age Income Constant
-0.020 -0.041 0.028
0.009 0.014 0.487
So the estimate of a is 0.028, of /3 1 is -0.020, and /3 2 is -0.041. The equation for ln(odds), or logit, is estimated by 0.028 - 0.020(age)- 0.041(income). The coefficients -0.020 and -0.041 are interpreted in the same manner they are interpreted in a multiple linear regression equation, where the dependent variable is In [P1/(1 - P 1)] and where P 1 is the logistic regression equation, estimated as · Prob(depressed) =
1 .. . 1 + exp {- [0.028 - 0.020(age) - 0.041(mcome)]}
Recall that the coefficients for age and income in the discriminant function are -0.0209 and -0.0336, respectively. These estimates are within one standard error of -0.020 and -0.041, respectively, the estimates obtained for the logistic regression. The constant 0.028 corresponds to a dividing point of -0.028. This value is different from the dividing point of -1.515 obtained by ~h~ discrim~nant analysi~. T~e .explanati~n for this ~~screp.ancy is tha~ t~ logtsttc regressiOn prog~am tmphcttly uses p~tor probab~l~t~ esttmates obtat:nd t from the sample. In thts example these pnor probabthtles are 2~/~9~ 50/2~4, resp~c~ively. _w~en these prior probabilities are used, the~tscrt~~~s functtondtvtdmgpomtts -1.515 + lnq 11 jq1 = -1.515 + 1.585-0.07 111 value is closer to the value 0.028 obtained by the logistic regression progra than is -1.5~ 5. . . . . 't exp(b) For a contmuous vanable X wtth slope coefficient b, the quantt Y Iative is interpreted as the ratio of the odds for a person with value (X + 1) re ental to the odds for a person with value X. Therefore, exp(b) is the incre~rning odds ratio corresponding to an increase of one unit in the variable~' assrnental that the values of all other X variables remain unchanged. The mere arnPie, odds ratio corresponding to the change of k units in X is exp(kb). for ex
Refining and evaluating for t
289
he depression data, the odds of depression can be written as Pz/(1 - Pz) = exp[0.028- 0.020(age)- 0.041(income)]
the variable age, a reasonable increment is k = 10 years. The incremental ~~s ratio corresponding to an increase of 10 years in age can be computed . xp(lOb) or exp(lO x -0.020) = exp( -0.200) = 0.819. In other words, the a~~s of depression are estimated as 0.819 of what it would be if a person 0 ere ten years younger. An odds ratio of one signifies no effect, but in this :ase increasing the age has the effect of lowering the odds of depression. If the computed value of exp(10b) had been two, then the odds of depression would have doubled for that incremental change in a variable. This statistic can be called the ten-year odds ratio for age or, more generally, the k incremental odds ratio for variable X (Hosmer and Lemeshow, 1989). Ninety.;.five percent confidence intervals for the above statistic can be computed from exp[kb + k(1.96)se(b)], where se stands for standard error. For example, for age
F
0
exp(10 x -0.020
+ 10 x
1.96 x 0.009)
exp( -0.376 to -0.024)
or
or
0.687 < odds ratio < 0.976 Note that for each coefficient, many of the programs will supply an asymptotic standard error of the coefficient. These standard errors can be used to test hypotheses and obtain confidence intervals. For example, to test the hypothesis that the coefficient for age is zero, we compute the test statistic
z--
-0.020-0 0.009 - - 2.22
-
For large . 1 · . . this Z ~a~p es an ap~roxtmate P value can be ob~at~ed companng statistic to percentiles of the standard normal dtstnbutwn.
?Y
12 7 · REFINING AND EVALUATING LOGISTIC REGRESSION ANALYSIS any .of the same st ra t egtes · used m · 1·mear regressiOn · ana1ysts · or d'tscnmmant · · llncr is us:~~ ~nalysis_ to obtain a good model or to decide if the computed model lvtf
u tn practice also apply to logistic regression analysis. In this section
290
Logistic regression
we describe typical steps taken in logistic regression analysis in the 0 that they are performed. rder
Variable selection As with regression analysis, often the investigator has in mind a predetenn· set ~f predictor v~riables to. predic~ ~roup me~bership. For exampl;~~~ sectiOn 12.2 we discussed usmg logtstic regressiOn to predict future h ~t~~cks in adults. In this case the data came from .longitudinal studie:a~ Imtially healthy adul~s ~~o were followed over time to obtain periodic measure~ents .. Some mdividuals eve~tually h~d heart attacks and some did not. The mvestlgators had a set ofvanables which they thought would predict who would have a heart attack. These variables were used first in discriminant analyses and later in logistic regression models. Thus, the choice of the predictor variables was determined in advance. In other examples of logistic regression applications, one of the variables indicates which of two treatments each individual received and other variables describe quantities or conditions that may be related to the success or failure of the treatments. The analysis is then performed using treatment (say experimental and control) and the other variables that affect outcome, such as age or gender, as predictor variables. Again, in this example the investigator is usually quite sure of which variables to include in the logistic regression equation. The investigator wishes to find out if the effect of the treatment is significant after adjusting for the other predictor variables. But in exploratory situations, the investigator may not be certain about which variables to include in the analysis. Since it is not advisable for computational reasons to use large numbers of predictor variables, often ~he variables are screened one at a time to see which ones are associated With the dichotomous outcome variable. This screening can be done by using the common chi-square test of association described in section 17.5 for dis~ete predictor variables or at test of equal group means for continuous predtc~or variables. A fairly large significance level is usually chosen for this screenmg so that useful predictors will not be missed (perhaps a = 0.15). Then, a~: variables with P values less than say P = 0.15 are kept for further ana1ys. Is· . bl t ten or 1es . end Such a screening will often reduce the number of vana es 0 . Note that the difficulties in using a large number of obser~ations wtll deX the on the statistical package used (dynamic memory allocation or not) an size and speed of the memory of the PC. . an be After this initial screening, stepwise logistic regression analysts ~ictioll performed. The test for whether a single variable improves the pre ·nciple forms the basis for the stepwise logistic regression ~roc~du~e ..The ~~nctioll is the same as that used in linear regression or stepwtse discnmtnant.t ria [of analysis. Forward and backward procedures as well as different crt e
Refining and evaluating
291
ciding how many and ~hich varia~le~ to select are ~vailable. These tests de t picallY based on chi-square statistics. A large chi-square or a small P ar~u:indicates that th~ vari~b~e ~hould beadd~d in forward ste?wise selectio~. va me programs this statistic Is called the 'Improvement chi-square'. As m In s~ssion analysis, any levels ofsignificance should be viewed with caution. ~::test should b~ viewed. as a screening device, n~t ~ test of s!gnifican~e. several simulatiOn studtes have shown that logtstic regressiOn provides better estimates of the coeffici~nt~ an~ better predict~on than .discri~inant function analysis when the distn~utton . of the predictor vana?le~ 1~ not uJtivariate normal. However, a simulation study found that dtscnmma,nt runction analysis often did better in correctly selecting variables than logistic regression when the data were log-normally distributed or were a mixture of dichotomous and log-normal variables. Even with all dichotomous predictor variables, the discriminant function did as well a~ logistic regression analysis. These simulations were performed for sample stzes of 50 and 100 in each group (O'Gorman and Woolson, 1991). Many of the logistic regression programs also include a chi-square statistic that tests whether or not the total set of variables in the logistic regression equation is useless in classifying individuals. A large value of this statistic (sometimes called tire model chi-square) is an indication that the variables are useful in classification. This test is analogous to testing the hypothesis thatthe population Mahalanobis D 2 is zero in discriminant analysis (section 11.9). Checking the fit of the model
When first examining the data, one should check that there are no gross outliers in the predictor variables; Note that the variables do not have to be normally distributed so that formal test for outliers may not apply. We reco~mend visually checking graphical output (box plots or histograms) for c~ntmuous variables and frequencies for the discrete variables. Mter performing t ·e analysis, residuals from the logistic regression can be plotted and the ~ase~ that are large in magnitude examined. As in linear regression, outliers ~~t e dependent variable and in the independent or predictor variables us~~\d. be exam~ned: Some programs also can provide a measure that is resid m detecting Influential observations (section 6.8). Other types of Prog~:ls, namely deviance ~nd. Pea~s~n residuals, are available in some l'he ms and can be useful In Identifymg cases that do not fit the model. . each program so the user should reactsethare defi ne d somew hat d'fti. I erent1y 1n S e respective manuals. ever aI logist· approaches have been proposed for testing how well the fitted tc regre . . 1 nol"lnar ssion mode fits the data. Note that we are not concerned with ~ost tty but we are assuming that the model given in section 12.4 holds. approaches rely on the idea of comparing an observed number of
292
Logistic regression
individuals with the number expected if the fitted model were valid. The observed (0) and expected (E) numbers are combined to form a chi·sq se (x 2 ) statistic called the goodness-of-fit x2 . Large values of the test stat~a~e indicate a poor fit of the model. Equivalently, small P values are an indicai~ttc ofpoor fit. Here we discuss two different approaches to the goodness-of.fit t ton The classical ~pproach be~ins with i~entifying different combinationse~f values of the vanables used 1D the equatwn, called patterns. For example ·r 1 there are two dichotomous variables, sex and employment status, there ~ four distinct patterns: male employed, male unempl~ye~, female employ:~ and female unemployed. For each of these combmattons we count th observed number of individuals (0) in populations I and II. Similarly, fo; each of these individuals we compute the probability of being in population I and II from the computed logistic regression equation. The sum of these probabilities for a given pattern is denoted by E. The goodness-of~fit test statistic is computed as goodness-of-fit
x2 =
L 20 (In~)
where the summation is extended over all the distinct patterns. The BMDP LR program computes this statistic and the corresponding P value. The reader is warned, though, that this statistic may give misleading results ifthe number of distinct patterns is large since this situation may lead to small values of E. A large number of distinct patterns occurs when continuous variables are used. These distinct patterns are also used in some of the statistical packages to display information on residuals. That is, the residuals are displayed for each distinct pattern rather than for each case. Differences between the obse_rved proportion of assignments to one of the two outcomes and the predicted proportion are computed for each pattern. Another goodness-of-fit approach is developed by Lemeshow and Hosm~r (1982). In this approach the probability of belonging tc population 1 •: calculate~ f~r ever~ individual in the sample, and _t~e resultin~ numb~~::d arranged m mcreasmg order. The range ofprobabthty values ts t~en_d.d · Is into ten groups (deciles). For each decide the observed number ofmdtvii::ed in popu_lation I is c~~puted (0) .. ~~so, the exp~ct~d.numbe~s (E) are c~!~:. The by addmg the logtstlc probabilities for all mdtvtduals m each de . . as: . h" re statistic goodness-of-fit statistic is calculated from the Pearson c 1-squa 2
goodness-of-fit X =
L
(0
=
E
E) 2
deciies. where the summation extends over the two populations and the ten
Refining and evaluating
293
A similar statistic and its P value are given in the ST ATA logistic lfit . nand BMDP LR. The procedure calculates the observed and expected optl~ers for all individuals, or in each category defined by the predictor nu~ables if they are categorical. The above summation then extends over all va11 · ll . . 1 individuals or over a categones, respective~· . All of these goodness-of-fit tests are approxunate. an~ reqmre large sample . s A large goodness-of-fit (or a small P value) mdtcates that the fit may stz: be good. Unfortunately, if they indicate that a poor fit exists, they do ~~t say what is causing it ~r how to modify. the ~ariables to improve the fit. Note that we are assummg that the relationship between ln(odds) and the predictor va.riables is li~ear (sect.ion 12.4). This. assumptio? can be checked for the contmuous predictor vanables, one vanable at a tune (Hosmer and Lemeshow, 1989; Demaris, 1992). Finally, we note that, just as in linear- regression, variables that represent the interaction between two predictor variables can be added to the model (section 7.8). These interaction variables may improve the fit of the model.
Evaluating how well logistic regression predicts outcomes If the purpose of performing the logistic regression analysis is to obtain an equation to predict into which of two groups a person or thing could be classified, then the investigator would want to know how much to trust such predictions. In other words, can the equation predict correctly a high proportion of the time? This question is different from asking about the statistical significance, as it is possible to obtain statistically significant results that do not predict very well. In discriminant function programs, output is available to assist in this evaluation, for example using the classification table w.here the number predicted correctly in each group is displayed. Also, the ~stogram and scatter diagram plot from such programs provide visual tnfo.rmation on group separation. Here we will discuss what information is available from logistic regression programs that can be used to evaluate Prediction.
of~ t.he investigator wishes to classify cases, a cutoff point on the probability
by ;I~g depressed, fo~ example, must be foun~. This cuto~ ~oint is deno~ed is c We would classify a person as depressed Ifthe probability of depressiOn · greater than or equal top . Some st . . c . . . user h attstical packages provide a great deal of mformatwn to help the a spec. ~ose a cutpoint and some do not. Others treat logistic regression as c.ase of nonlinear regression and provide output similar to that of outp~ e. hnear regression (STATISTICA, SYSTAT). The STATISTICA Probabil~n.cludes the predicted probabilities for each individual. These depresse tties can be sa~ed and separated by whether the person was actually d or not. Two histograms (one for depressed and one fornondepressed)
llluittt
294
Logistic regression
can t~en ~e formed to see where a reasonable cutpoint would fall. On cutpomt IS chosen then a two-way table program can be used to m ~a classification table similar to Table 1~.3.
100 ~
·--e _g
80
0
'\j
-
A..
60
0
~
'0"' u 40 ......
1:
0
'"' 20 A.. Q)
0
0.05
0.1
0.15
0.2
0.25
Cutoff Point Figure 12.2 Percentage of Individuals Correctly Classified by Logistic Regre
ssion·
Refining and evaluating
29 5
entioned above. In this way the user can choose the cutoff point giving . imum loss. 011 ~rATA logi.s~ic can print a classifica~io~ table for any cutoff point. y~u u plY· In additiOn to a two~way table stmdar to Tabl~ 11.3, other st.attstlcs \~printed. The user can qmckly try a set of cutoff pomts and zero m on a a . ood one. g Both BMDP and STATA can also plot an ROC (receiver operating characteristic) curve. ROC was originally propos~d for signal-detection work ben the signals were not always correctly received. In an ROC curve, the wroportion of dep~essed person~ correctly. classifi~d ~s depressed is plotted ~n the vertical axis for the vanous cutpomts. This IS often called the true positive frac.tion, or sensitivity in medical rese~rch. On the horizontal axis is the proportiOn of nondepressed persons classified as depressed (called false positive fraction or. one ~nus specificity in medical studies). Swe~s (19_73) gives a summary of Its use m psychology and Metz (1978) presents a discussiOn ofits use in medical studies (see also Kraemer, 1988). SAS also gives the sensitivity, specificity and false negative rates. Three ROC curves are drawn in Figure 12.3. The top one represents a hypothetical curve which would be obtained from a logistic regression equation that resulted in excellent prediction. Even for small values of the proportion of normal persons incorrectly classified as depressed, it would be 01
1 s:: 0 ...... ..... u
·tl;!
tt
0.8
~·
> :0 'lil
~
0.6
Hypothetical curveexcellent prediction Predicting depression using sex and age
~.
~ 0
0.4
:.:: ·-. til
0.2
.....
Hypothetical curvepoor prediction
,:;. ·s t::
~
0 ~<'igu
te
0.3
0.4
0.5
0.6
0.7 0.8 False Positive Fraction or !-Specificity
0.9
12 ·3 ROC Curve from Logistic Regression for the Depression Data Set.
1
296
Logistic regression
possible to get a large proportion of depressed persons correctly cla 'fi as depressed. The middle curve is what was actually obtained fro sst ect pred~ctio~ of depress_ion using just age and sex. Here we. can see t:::.t the obtam a htgh proportiOn, say 0.80, of depressed persons classified as dep to . . ~~ results m a proportiOn of about 0.65 of the normal persons classified depressed, an unacceptable level. The lower hypothetical curve (straight r as represents the chance-alone assignment (i.e. flipping a coin). The close~~:) of th_e middl~ curv~ to the lower c.urve shows t~a~ we perhaps need othe: predictor vanables m order to obtam a better logistic regression model. Thi is the case even though our model was significant at the P = 0.009 level. s If we believe that the prevalence of depression is low, then a cutpoint on the lower part of the ROC curve is chosen since most of the population is normal and we do not want too many normals classified as depressed. A case would be called depressed only if we were quite sure the person were depressed. This is called a strict threshold. The downside of this approach is that many depressed persons would be missed. If we thought we were taking persons from a population with a high rate of depression, then a cutpoint higher up the curve should be chosen. Here a person is called depressed if there is any indication that the person might fall in that group. Very few depressed persons would be missed, but many normals would be called depressed. This type of cutpoint is called a lax threshold. Note that the curve must pass through the points (0,0) and (1,1). The maximum area under the curve is one. The numerical value of this area would be close to one if the prediction is excellent and close to one-half if it is poor. The ROC curve can be useful in deciding which of two logistic regression models to use. All else being equal, the one with the greater area would be chosen. Alternatively, you might wish to choose the one with the greatest height to the ROC curve at the desired cutoff point. Both BMDP and STATA give the area under the curve.
12.8 APPLICATIONS OF LOGISTIC REGRESSION Tt Multiple logistic regression equations are often used to estimate the probabti (s . event occurnng . to a gtven . . d'IVI'dua.1 Examp1es of such heven of a certam m c-, om k deat .r are failure to repay a loan, the occurrence of a heart attac , or · veot . . d f .. . h" h such an e ttack lung cancer. In such applicatmns the peno o time 1D w IC · . ht b a heart a is to occur must be specified. For example, the event mig e occurring within ten years from the start of observation. . d"vi· dual . . . . . . h' h ach ID I For estimation of the equatwn a sample Is needed 1D w IC e f set of has been observed for the specified period of time and valueso~se:vatioO· relevant variables have been obtained at or up to the start of the Such a sample can be selected and used in the following two ways.
Applications of logistic regression
297
. A sample is selecte.d in a ran.dom manner and o~served for the specifie~ 1. eriod of time. This sample IS called a cross•sectiOnal sample. From this ~ gle sample two subsamples result, namely those who experience the si~nt and those who do not. The methods described in this chapter are ;;en used to obtai~ the. estimated logistic regression equati?n. This . uation can be apphed directly to a new member of the populatiOn from ehich the original sample was obtained. This application assumes that ;e populatio~ is in a steady stat~, i.e. no major changes occur that alter the relationship between the vanables and the occurrence of the event. Use of the equation with a different population may require an adjustment, as descri,bed in the next paragraph. 2. The second way of obtaining a sample is to select two random samples, one for which the event occurred and one for which the event did not occur. This sample is called a case-control sample. Values of the predictive variables must be obtained in a retrospective fashion, Le. from past records or recollection. The data can be used to estimate the logistic regression equation. This method has the advantage of enabling us to specify the number of individuals with or without the event. In application of the equation the regression coefficients b1,b 2 , ••• , bp are valid. However, the constant a must be adjusted to reflect the true population proportion of individuals with the event. To make the adjustment, we need an estimate of the probability P of the event in the target population. The adjusted constant a* is then computed as follows: a*= a+ ln[Pn0 /(1- P)n 1 ] where a = the constant obtained from the program
P =estimate of p no = the number of individuals in the sample for whom the event did not occur nl = the number of individuals in the sample for whom the event did occur
~llrther detail~ regarding this subject can be found in Bock and Afifi (1988).
Pro~~b~~ample, I~ the depression data suppose that we wish to estimate the the d
1
ty _of being depressed on the basis of sex, age and income. From depre~Pression data we obtain an equation for the probability of being sed as follows: .
Prob(depressed) ~
1
1 + exp{- [ -0.676- 0.021(age)- Q.037(income)
+ 0.929(sex)]}
298
Logistic regression
This equation is derived from data on an urban population with a d . rate ofapproximately 20%. Since the sample has 50/294, or 17~ ~Pression individuals, we must adjust the constant. Since a = -0.676 ;r:_essed n0 = 244 and n1 =50, we compute the adjusted constant as ' - 0.2Q,
+ ln[(0.2)(244)/(0.8)(50)] -0.676 + ln(l.22)
a* = -0.676 =
= -0.676
+ 0.199
= -0.477 Therefore the equation used for estimating the probability of being depressed is Pro b (depressed)
1 1 + exp{ -[ -0.477- 0.021(age)- 0.037(income)
-~---r~~~~~~~~~~--~--------
+ 0.929(sex)J}
This can be compared to the equation given in section 12.5. Another method of sampling is to select pairs of individuals matched on certain characteristics. One member of the pair has the event and the other does not. Data from such matched studies may be analyzed by the logistic regression programs described in the next section after making certain adjustments. Holford, White and Kelsey (1978) and Kleinbaum (1994) describe these adjustments for one-to-one matching; Breslow and Day (1980) give a theoretical discussion of the subject. Woolson and Lachenbruch (1982) describe adjustments that allow a least-squares approach to be used for the same estimation, and Hosmer and Lemeshow (1989) discuss many-to-one matching. As an: exampl~ of one-to-one matc~ing, c~nsi~er a s.tudy of melano~~ among workers m a large company. Smce this disease IS not common,h · k eten the new cases are sampled over a five-year period. These wor ers ar · y matched by gender and age intervals to other workers from t~e same comP~~e who did not have melanoma. The result is a matched-paired sampl~. bles . f e varia ' researchers may be concerned a~out a parbc~lar. set o e~posur t of the such as exposure to some chemicals or working m a particular par ants plant. There may be other variables whose effect the researcher ~lsbol. w All · h' aria es. to control for, although they were not considered mate mg v 'dered . . . h h' n be const these vanables, except those that were used m t e mate mg, ca . unents as independent variables in a logistic regression analysis. Three adJ~s order 10 have to be made to running the usual logistic regression program to accommodate matched-pairs data.
Discussion of computer programs
299
each of theN pairs, compute the difference X(case)- X( control) for 1. fo~ X variable. These differences will form the new variables for a single eac pie of size N to be used in the logistic regression equation. ~Illate a new outcome variable Y with a value of 1 for each of the N pairs. 2. S ret the constant term in the logistic regression equation equal to zero. 3. e Schlessehnan (19~2~ compar~s the out~ut that would. be o~tained fro~ a . d-matched logtsttc regresswn analysts to that obtamed tf the matchmg patr~gnored. He explains that if-the matching is ignored and the matching wa~ abies are associated with the exposure variables, then the estimates of odds ratio of the exposure variables will be biased toward no effect (or t neodds ratio of 1). If the matching variables are unassociated with the :xposure variables, then it is not necessary to use the form of the analysis that accounts for matching. Furthermore, in the latter case, using the analysis that accounts for matching would result in larger estimates of the standard error of the slope coefficients.
v:n
12.9 DISCUSSION OF COMPUTER PROGRAMS
BMDP, SAS, SPSS and STATA have specific logistic regression programs. STATISTICA and SYSTAT perform logistic regression as a part of nonlinear regression analysis. The options for all these programs are given in Table 12.2. SAS has LOGISTIC which is a general purpose program with numerous features. In addition to the logistic regression model discussed in this chapter, it also performs logistic regression for the case where the outcome variable can take on two or more ordinal values. It can also be used to do probit regression analysis and extreme value analysis. SAS also provides global me.asures for assessing the fit of the model, namely Akaike's Information Cnterion (AIC), Schwartz's Criterion and - 2logQikelihood) where a small value of. the statistic indicates a desirable model. It also provides several rank cor~elatton statistics between observed and predicted probabilities. Stepwise ~~tlons are available. SAS has an especially rich mix of diagnostic statistics that may be useful in detecting observations that are outliers or do not fit e model. · BMDPh regre . as two programs, LR and PR. The LR program performs logistic chap::~~n analysis for the binary outco~e. variable .case discu.ssed in this outco The. PR program computes logtsttc regresston analysts when the ordin ~e vanable takes on two or more values and is either nominal or check~ · The LR program provides stepwise options, several methods for and c:g for the goodness of fit, regression diagnostics for individual cases, inclucti nslderable output to assist the user in finding a suitable cutpoint, ng the ROC curve. It provides features that assist the user in performing
---===---=---
--~
-
~---==---
Table 12.2 Summary of computer output for logistic regression analysis Output
BMDP
SAS
SPSS
STAT A
STATISTICA
SYSTATa
Beta coefficients Standard error of coefficients Odds ratios Stepwise variable selection Model chi-square Goodness-of-fit chi-square Residuals Classification table cutpoint Classification table multiple cutpoints Correlation matrix of coefficients Cost matrix ROC curve Predicted probability for cases Polychotomous logistic regression
LR LR LR LR LR LR LR
LOGISTIC LOGISTIC
LOGISTIC LOGISTIC
Logistic Logistic
LOGISTIC LOGISTIC
LOGISTIC LOGISTIC
Nonlin, Logit N onlin, Logit Logit Logit Logit
LOGISTIC LOGISTIC
LOGISTIC LOGISTIC
logit logit logistic swlogis logistic logistic logistic
a
Logit is a supplementary module.
Logistic
Logit Logit
Logistic
Nonlin
logistic
LR LR LR LR LR PR
Logistic
LOGISTIC LOGISTIC LOGISTIC
correlate LOGISTIC
logistic logistic mlogit
Logit
What to watch out for
301
. (1c regression for matched case-control studies. It can also generate logls y variables for categorical data with two or more values. SPSS LOGISTIC program also has a good choice of available options. e~forms stepwise selection and can generate dummy variables using It p rous methods (see the CONTRASTS subcommand). It also provides a nurne f . d. . c h ·ch assortment o regressiOn tag~ostlcs .or eac case. . . . rt The STATAp~ogram also provides an e~cellent ~et of ~ptwns for logtstlc ssion analysts. It has almost all the optiOns available many of the other regre . . f 1 . . ackages and ts ver~ easy to use. It ts one o on y two programs that pr~vt~e ph odds ratios, thetr standard errors, z values and 95% confidence limits t ;ich are widely used in medical applications. It provides both Pearson and ;eviance residual along with a range of other diagnostic statistics. It also prints a summary of the residuals that is useful in assessing individual cases. It also computes an ROC curve. sT ATISTICA and SYSTAT treat logistic regression as a regression model. In STATISTICA, logistic regression can be selected as an option in the Nonlinear Estimation module. STATISTICA provides numerous diagnostic residuals and plots that are useful in finding cases that do not fit the model. SYSTAT uses the Nonlin program to perform logistic regression. SYSTAT also offers a supplementary module called Logit that performs logistic regression. Logit provides a full range of the usual logistic regression output along with some output that is not available in other programs.
du;;
12.10 WHAT TO WATCH OUT FOR When researchers first read that they do not have to assume a multivariate normal distribution to use logistic regression analysis, they often leap to the conclusion that no assumptions at all are needed for this technique. U~fortunately, as with any statistical analysis, we still need some assumptions: astm 1 th P e random sample, correct knowledge of group membership, and data at are free of miscellaneous errors and outliers. Additional points to watch owt · · or Include the following. I. The d 1 . . . . mo e assumes that In( odds) ts linearly related to the mdependent vrnanables. This should be checked using goodness-of-fit measures or other eans .d d d t provt e by the program. It may require transformations of the 2. a a to achieve this. 1 co~s~o~putations use an iterative procedure that can be quite timerese ming on some personal computers. To reduce computer time, 3. Log~r~hers sometimes limit the number of variables. 1 lon .shc. regression should not be used to evaluate risk factors in gitUdtnal studies in which the studies are of different durations (see
302
Logistic regression
Woodbury, Manton and Stallard, 1981, for an extensive discu . 8810 n of problems in interpreting the logistic function in this context). 4. The coefficient for a variable in a logistic regression equation dep d the other variables included in the modeL The coefficients for t~n s on variable, when included with different sets of variables, could be quite d~same 5. If a matched analysis is performed, any variable used for matching c erent. also be used as an independent variable. annat 6; Th~re ~re circu~stanc~s where the m~ximum likelihood method of estimatmg coefficients will not produce estimates. If one predictor vari bl ?e.co~es a perfect predictor (.say t~e v~riable seldom is positive but w~e~ It Is, It has a one-to-one relatiOnship wtth the outcome) then that variabt· must be excluded to get a maximum likelihood solution. This case ca~ be handled by a program called LogXact from CYTEL Software Corporation in Cambridge, Massachusetts. The LogXact program is also recommended when the data are all discrete and there are numerous empty cells or small sample sizes (see Hirji, Mehta, and Patel, 1987, for an explanation of the method of computation).
SUMMARY In this chapter we presented another method of classifying individuals into one of two populations. It is based on the assumption that the logarithm of the odds of belonging to one population is a linear function of the variables used for classification. The result is that the probability of belonging to the population is a multiple logistic function. We described in some detail how this relatively new method can be implemented by packaged programs, and we made the comparison with the linear discriminant function. Thus logistic regression is appropriate ~hen both ~ate~orical and continuous varia~les ~reused, while the linear discrinun:~ functiOn 1s preferable when the multtvanate normal model can be assum d We also described situations in which logistic regression is a useful metho of analysis. ·00 · · gressi Aldrich and Nelson (1984) and Demaris (1992) discuss logistic re . ble as one of a variety of models used in situations where the dependent vanaide is dichotomous. Hosmer and Lemeshow (1989) and Kleinbaum (1994) prov an extensive treatment of the subject of logistic regression.
REFERENCES . . . I b ckground. . !Jodefs. References preceded by an astensk reqmre strong mathemauca a . .. L . d Probtt Aldrich, J.H. and Nelson, F.D. (1984). Lmear Probabzhty, ogzt, an Sage, Newbury Park, CA.
Further reading
303
. J. and Afifi, A.~. (1988). Estimation. of pr?bability using the logistic model in sock, · ective studies. Computers and Bzomedzcal Research, 21, 449-70. ret~: N.E. a~d ~ay, N.E. (1980). Statistical Method_s i~ Cancer Research. IARC *Bres~entific Publications No .. 32. W~rl_d Health .OrgamzattOn, L~~ns, France.. SCI . A. (1990). Interpreting logistic regression results: a cntical commentary. pernan~al of the Marriage and the Family, 52,271-77. . jou~s A. (1992). Logit Modeling: Practical Applications. Sage, Newbury Park, CA. pernar~: (1975). The efficiency oflogistic regression compared to normal discriminant Efron, lysis. Journal of the American Statistical Association, 70, 892-98. ~na1 L. (1981). Statistical Methods. for Rates and Proportions. Wiley, New York. Ft~ss,ri~ M., Blackwelder, W.C. and Verter, J.I. (1971). Estimation of the multivariate FI ~:istic risk function: A comparison o~ the. discriminant function and maximum rkelihood approach. Journal of Chrome Dzseases, 24, 125-58. *Hi~ji, K.F., Meh~a, C.R. and Patel, N.R. ~1987). C?~puting di~tr~butions for exact logistic regressiOn. Journal of the Amerzcan Statzstz~al 4ssoczatzon, 82, 1110-1 ~ Holford; T.R., Whit~, C. and !<-elsey, J.L. (1978): M~ltivanate analyses for matched case~control studies. Amerzcan Journal of Epzdemzology, 107, 245-56. Hosmer, D. W., J rand Lemeshow, S. (1989). Applied Logistic Regression. Wiley, New York. Kleinbaum, D.G. (1994). Logistic Regression: a Self-learning Text. Springer-Verlag, New York. Kraemer, H.C. (198~). Assessment of 2 x 2 associations: Generalization of signaldetection methodology. The American Statistician, 42, 37-49. Lemeshow, S. and Hosmer, D. W. (1982). A review of goodness-of-fit statistics for use in the development oflogistic regression models. American Journal ofEpidemiology, 115,.92-106. Metz, C.E. (1978). Basic .principles of ROC analysis. Seminars in Nuclear Medicine, 8, 283-98. O'Gorman,T.W. and Woolson, R.F. (1991). Variable selection to discriminate between two groups: stepwise logistic regression or stepwise discriminant analysis? The . , American Statistician, 45, 187-93. 0 H~ra, T.F., Hosmer, D.W., Lemeshow, S. and Hartz, S.C. (1982). A comparison of discriminant function and maximum likelihood estimates of logistic coefficients p for categorical data, University of Massachusetts, Amherst, MA. ress, S.J._and Wilson, S. (1978). Choosing between logistic regression and discriminant Re a:alysis. Journal of the American Statistical Society, 73, 699-705. Sc~I olds, H.T. (1977). Analysis of Nominal Data. Sage, Beverly Hills, CA. . Swe:ssselman, J.J. (1982). C~se-Control_ Studies. Oxf~r~ "f!niversity Press, ~ew York. . ~-~-973). The receiver operatmg characteristic m psychology. Sczence, 182,
1
990
Woodbury M A . chr · . ' . · ., Manton, K.G. and Stallard, E. (1981). Longitudinal models for lnc;mc ?Isease risk: An evaluation of logistic multiple regression and alternatives. Woots~natzonal Journal of Epidemiology, 10, 187-97. case-n, R.F. and Lachenbruch, P.A. (1982). Regression analysis of matched control data. American Journal of Epidemiology, 115, 444-52.
F'u R.
~. l'IiER READING
sP
tlttain E . ,.. Rep~rt 980). Probability of Developing Coronary Heart Disease~ Technical Cox, D.R ( · Stanford University, Division of Biostatistics, Stanford, CA. · 1970). Analysis of Binary Data. Methuen, London.
304
Logistic regression
PROBLEMS 12.1 If the probability of an individual getting a hit in baseball is 0.20 th odds of getting a hit are 0.25. Check to determine that the previous 'staten the · ernent is true. Would you prefer to be told that your chances are one in five of a h" or that for every four hitless times at bat you can expect to get one hit? It 12.2 Using the formula odds= P/(1 - P), fill in the accompanying table. ·
Odds
p
0.25 0.5 1.0 1.5 2.0 2.5 3.0 5.0
0.20 0.5
0.75
12.3 The accompanying table presents the number of individuals by smoking and disease status. What are the odds that a smoker will get disease A? That a nonsmoker will get disease A? What is the odds ratio?
Disease A Smoking Yes No Total
Yes
No
Total
80 20
120 280
200 300
100
400
500
J · · regression · ana1ysts · · Wlt · h t he same vana · bles and data used a 1ogtsttc 12.4 Peuorm in the example in section 11.13 and compare the results: . !1.2. Perform a logistic regression analysis on the data descnbed m Problem 12.5 Compare the two analyses. the . . . Compare 12.6 Repeat Problem 11.5, usmg stepwise logtstic regressiOn. results. the Repeat Problem 11. 7, using stepwise logistic regression. Compare 12.7 results. the Repeat Problem 11.10, using stepwise logistic regression. Compare 12.8 results. (a) Using the depression data set, fill in the following table: 12.9
Problems
305
Sex Regular drinker
Female
Male
Total
YeS
No Total
12.10
12.11 12.12 12.13 12.14
12.15
12 16 ·
What are the odds that a woman is a regular drinker? That a man is a regular drinker? What is the odds ratio? (b) Repeat the tabulation and calculations for part (a) separately now for people who are depresseli and those who are not. Compare the odds ratios for the two groups. (c) Run a logistic regression analysis with DRINK as the dependent variable and CESD and SEX as the independent variables. Include an interaction term. Is it significant? How does this relate to part (b) above? Using the depression data set and stepwise logistic regression, describe the probability of an acute illness as a function of age, education, income, depression and regular drinking. Repeat Problem 12.10, but for chronic rather than acute illness. Repeat Problem 11.13(a), using stepwise logistic regression. Compare the results. Repeat Problem 11.14, using stepwise logistic regression. Compare the results. Define low FEVl to be an FEV1 measurement below the median FEVl of the fathers in the family lung function data set given in Appendix A. What are the odds that a father in this data set has low FEV1? What are the odds that a father from Glendora has low FEVl? What are the odds that a father from Long Beach does? What is the odds ratio of having low FEVl for these two groups? Using the definition of low FEV1 given in Problem 12.14, perform a logistic regression of low FEV1 on area for the fathers. Include all four areas and use a dummy variable for each area. What is the intercept term? Is it what you expected (or could have expected)? ~or the family lung function data set, define a new variable VALLEY (residence In San Gabriel or San Fernando Valley) to be one if the family lives in Burbank ~ Glendora and zero otherwise. Perform a stepwise logistic regression of h~LLEY on mother's age and FEV1, father's age and FEVl, and number of c lldren (1, 2 or 3). Are these useful predictor variables?
13 Regression analysis using survival data 13.1 USING SURVIVAL ANALYSIS TO ANALYZE TIME-TO-EVENT DATA
In Chapters 11 and 12, we presented methods for classifying individuals into one of two possible populations and for computing the probability of belonging to one of the populations. We examined the use of discriminant function analysis and logistic regression for identifying important variables and developing functions for describing the risk of occurrence of a given event. In this chapter, we present another method for further quantifying the probability of occurrence of an event, such as death or termination of employment, but here the emphasis will be on the length of time until the event occurs, not simply whether or not it occurs. Section 13.2 gives examples of when survival analysis is used and explains how it is used. Section 13.3 presents a hypothetical example of survival data and defines censoring. In section 13.4 we describe four types of distributionrelated functions used in survival analysis and show how they relate to each other. Section 13.5 presents the distributions commonly used in survival analysis and gives figures to show how the two most commonly. u~ed distributions look. The normal distribution that is often assumed in statistlc~l analyses is seldom used in survival analysis, so it is necessary to think tn terms of other distributions. d Section 13.6 introduces a log-linear regression model to express an.assu~ee relationship between the predictor variables and the log of the survival ~~m ,~ 0 Section 13.7 describes an alternative model to fit the data, calle~ ~p proportional hazards model. In section 13.8, we discuss the rel~~~~~del between the log-linear model and Cox's model, and betwee~ Cox 'bed in and logistic regression. The output of the statistical packages ts desert section 13.9 and what to watch out for is given in section 13.10.
s . e it take Survival analysis can be used to analyze data on the length of ti~ pending a specific event to occur. This technique takes on different names, e 13.2 WHEN IS SURVIVAL ANALYSIS USED?
Data examples
307
he particular application on hand. For example, if the event under on t'deration is the death of a person, animal or plant, then the name survival co~:• sis is used. If the even.t is fa.ilure of a ~anufa~tur.e? item, e.g. a light 311 : then one speaks of failure time analysis or rehabthty theory (Nelson, 1 bug~. The term ev~nt history a~alysis is used by soci~l scientists to describe 19 lications in thetr field~ (Al~tson, 1984; Yamaguchi, 19?1). For ~xample, a~~Iysis of the length of ttme tt tak~s an employ~e to re~ue or restgn fro~ a . ·ven job could be called event htstory analysts. In thts chapter, we will a gt the term survival analysis to mean any of the analyses just mentioned. us~urvival analysis is a way of describing the distribution of the length of time to a given event.. Suppose the event is te~in~tio~ .of employment. We could simply draw a htstogram of the length oftrme tndtvtduals are employed. Alternatively, we could use log length of employment as a dependent variable and determine if it can be predicted by variables such as age, gender, education level, or type of position (this will be discussed in section 13.6). Another possibility would be to use the Cox regression model as described in section 13.7. Readers interested in a comprehensive coverage of the subject of survival analysis are advised to study one of the texts given in the reference and further reading sections. In this chapter, our objective is to describe regression-type techniques that allow the user to examine the relationship between length of survival and a set of explanatory variables~ The explanatory variables, often called covariates, can be either continuous, such as age or income, or they can be dummy variables that denote treatment group.. The material in sections 13.3-13.5 is intended as a brief summary of the background necessary to understand the remainder of the chapter.
l3.3 DATA EXAMPLES
·~ this section,
we begin with a description of a hypothetical situation to ~ ustrate the types of data typically collected. This will be followed by a real a.~a example taken from a cancer study. andn a survival study we can begin at a particular point in time, say 1980, is d sttan to enroll patients. For each patient we collect baseline data, that a a at th · · infollnati e time of ~nro~ment, a~d begm follow-up. We t~en recor~ other Which theon ~bout _the. pa~tent dunng follow-up and, especially, the trme at If some. P~tient dtes ~tht~ occurs ~efore termination of the study, say 1990. Up to te Patients are still alive, we simply record the fact that they survived say the ~Year~. However, we may enroll patients throughout a certain period, the bas /st etght years of the study. For each patient we similarly record study(~ ~ne an~ other data, th~ leng~h of survi~al, or the leng~h of tim~ of exall1Ples he patient ?oes not dte dur~ng t.he penod of ~bservatwn). Ty~tcal of five patients are shown m Ftgure 13.1 to Illustrate the vanous
308
Regression analysis using survival data
possibilities. Patient 1 started in 1980 and died 1t years after p . started in 1982 and also died during the study period. Patient 3. en~hent .2 1988 and died 1t years later. These three patients were observed untile~ed tn Patient 4 was observed for two years, but was lost to follow-up. All w keath. about this patient is that survival was more than two years. This is an e~ now of what is known as a censored observation. Another censored obser:mple is represented by patient 5 who enrolled for 3t years and was still ali~~oan the end of the study. t The data s~own i~ Figure 13.1 ca_n be repr~sente~ as if all patients enrolled at t~e same tm~e (Ftgure 13.2). Thts makes tt ea~ter to compare the length of time the patients were followed. Note that t~ts r~presentation implicitly assumes that there have been no changes over tune m the conditions of the study or in the types of patients enrolled. One simple analysis of these data would be to compute the so-called T year survival rate, for example T = 3. For all five patients we determine whether it is known that they survived three years: • Patient 1 did not survive three years. • Patient 2 did survive three years and the time of later death is known. • Patient 3 did not survive three years, but should not be counted in the analysis since the maximum period of observation would have been less than three years. • Patient 4 was censored in less than three years and does not count for this analysis. • Patient 5 was censored, but was known to live for more than three years and counts in this analysis.
End
Start of Study died 2
died 3 4
.- died
lost 5
year
1980
1985
1990
Figure 13.1 Schematic Presentation of Five Patients in a Survival Study.
Survival functions 1
.......
died
2 3
309
died
-
.......
4
died lost
5
withdrawn alive I
_l
IObservation Start of
I
I
I
5
_l
I
I
_l
I
year
10
Figure 13.2 Five Patients Represented by a Common Starting Time.
Note that three of the five patients can be used for the analysis and two cannot. The three-year survival rate is then calculated as: number of patients known to survive 3 years number of patients observed at risk for 3 years In this expression, 'patients at risk' means that they were enrolled more than
three years before termination of the study. In the above example, two patients (patients 2 and 5) were known to survive three years, three patients (patients 1, 2 and 5) were observed at risk for three years, and therefore the three-year survival rate = 2/3. In succeeding sections, we discuss methods that use more information from each observation. We end this section by describing a real data set used in v~ri.ous illustrations in this chapter. The data were taken from a multicenter chmcal trial of patients with lung cancer who, in addition to the standard treatment, received either the experimental treatment known as BCG or the ~~~trol treatment consisting of a saline sol~tion injection (Gail et al., 1984). . total length of the study was approximately ten years. We selected a 8 t~~~le ~f 401 p~tients for anal~sis. The pat~ents in the .study w~r~ selected thu Without distant metastasis or other life threatemng conditions, and hooskrepresented early stages of cancer. Data selected for presentation in this variab~e described in a code book shown as Table 13.1. In the binary llnfav ' We used 0 to denote good or favorable values and 1 to denote descri~r~ble values. For HIST, subtract 1 to obtain 0, 1 data. The data are · e and partially listed in Appendix B. 13 4 lQ. SURVIVAL FUNCTIONS e con llot, th cept of a frequency histogram should be familiar to the reader. If e reader should consult an elementary statistics book. When the
310
Regression. analysis using survival data
Table 13.1 Code book for lung cancer data a Variable number 1 2
ID Staget
3
Stagen
4
Hist
5
Treat
6
Perfbl
7
Poinf
8
Smokfu
9
Smokbl
10 11
a
Variable name
Days Death
Description
-----
Identification number from 1 to ~ Tumor size 0 =small 1 =large Stage of nodes 0 =early 1 =late Histology 1 =squamous cells 2 = other types of cells Treatment 0 = saline (control) 1 = BCG (experiment) Performance status at baseline 0 =good 1 =poor Post-operative infection 0 =no l =yes Smoking status at follow-up 1 =smokers 2 =ex-smokers 3 = never smokers Smoking status at baseline 1 =smokers 2 = ex-smokers 3 = never smokers Length of observation time in days Status at end of observation time 0 =alive (censored) 1 = dead
Missing data are denoted as MIS in Appendix Band with a full point (period) in the data-set disk.
tll of variable under consideration is the length of time to death, a histogr~ng a the data from a sample can be constructed either by hand or by u; ver'f 0 packaged program. If we imagine that the data were available ~ togra.fll large sample ofthe population, then we can construct the frequencY _hiS sucb a. using a large number of narrow frequency intervals. In constructtnS
Survival functions
311
~
~
~
tt
4)
;>:
-;1
Relative Frequency Histogram
Approximate Death Density
(1:1
'd)
~
0
Time
Figure 13.3 Relative Frequency Histogram and Death Density.
histogram, we use the relative frequency as the vertical axis so that the total area under the frequency curve is equal to one. This relative frequency curve is obtained by connecting the midpoints of the tops of the rectangles of the histogram (Figure 13.3). This curve represents an approximation of what is known as the death density function. The actual death density function can be itmigined as, the result of letting the sample get larger and larger - the number of intervals gets larger and larger while the width of each interval gets smaller and smaller. In the limit, the death density is therefore a smooth curve. Figure 13.4 presents an example of a typical death density function. Note that this curve is not symmetric, since most observed death density functions are not. In particular, it is not the familiar bell-shaped normal density function. The death density is usually denoted by f(t). For any given time t, the area ~~~er~he curve .to the left of tis t~e prop?rtion o~i~dividuals in the populati?n d , hdte up to time t. As a functiOn of time t, thts ts known as the cumulative t~~~e d~stributio~ function an.d is denoted by F(t). The area und~r the cur~e latt nght o~ t ts 1 ~ F(t), smce the total area under the curve ts one. Thts sur~~ ~roportion of individuals, denoted by S(t), is the proportion of those Pres~VI~g· at least to time t and is called the survival function. A graphic a~aly~e:hon of both ~(t) a~d ~(t) ~s given i? Figure. 13.5. In many statistic~! F1gure lJ~he cumulati~e dtstnbu~w~ ~unctiOn F(t) ts presented as show~ m function 5((a). In survtval analysts, It ts more customary to plot the survival 8 We Will t) as shown in Figu~e 13.5(b). . . ~o, 'We r next define a quantity called the hazard functwn. Before domg 1S tneas eturn. to an interpretation of the death density function, f(t). If time Probab~~ed tn very small units, say seconds, then f(t) is the likelihood, or tty, that a randomly selected individual from the population dies in
312
Regression analysis using survival data
f(t) or death density
Time Figure 13.4 Graph of Death Density, Cumulative Death Distribution Function and Survival Function.
the interval between t and t + 1, where 1 represents one second. In this context; the population consists of all individuals regardless of their time of death. On the other hand, we may restrict our attention to only those members of the population who are known to be alive at time t. The probability that a randomly selected one of those members dies between t and t + 1 is called the hazard function, and is denoted by h(t). Mathematically, we can write h(t) = f(t)/S(t). Other names of the hazard function are the force of mortality, instantaneous death rate or failure rate. To summarize, we described four functions: f(t), F(t), S(t) and h(t). Mathe~atically, any three. of the f~ur. c~ be obtained from the fourt~ (Le~ 1992; Miller, 1981). In surviVal studies It Is Important to compute an estimate survival function. A theoretical distribution can be assumed, for example, one of those discussed i~ t~e n~xt section. In. that ~ase, we need to estima:; the parameters of the distnbutton and substitute m the formula .for a obtain !ts es~im~te ~or any time t. This is called a parametric esttmate.f the t?eor~ttcal distnbutto~ cannot be assu~ed ~ased on ou~ knowle~ge 0Most situatiOn, we can obtam a nonparametr1c estimated survival functiOn· this packaged programs (section 13.9) use the Kaplan-Meier method ~or tioll purpose. Readers interested in the details of survival function esumading should consult one of the texts cited in the reference and further rea
S(tir
Survival functions
313
- - - - - - - - - - -::-::;..:;:::;=------
S(t) = 1 - F(t)
0.2
F(t)
0~-------~~-------------------------------------
Time a.. Cumulative Death Distribution Function
Time b. Survival Function }?·
•gure 13.5 Cumulative Death Distribution Function and Survival Function.
314
Regression analysis using survival data
sections. Many of these texts also discuss estimation of the hazard f . and death density function. unctton 13.5 COMMON DISTRIBUTIONS USED IN SURVIVAL ANALYSIS The simplest model for survival analysis assumes that the hazard [I . is constant across time, that is h(t) = A, where A is any constant greatunc~on zero. This results in the exponential death density f(t) = A exp( _At) ~ an exponential survival.function.S(t) = ex~(- At). ~rap~ically, the hazard fu:ct:!e and the death density functiOn are displayed 1D Figures 13.6(a) and lJ ?( ~ This model assumes that having survived up to a given point in time h~ a· effect on the probability of dying in the next instant. Although simple t~o model has been successful in describing many phenomena observed id re~ life. For example, it has been demonstrated that the exponential distribution closely describes the length of tim~ from the first to the second myocardial infarction in humans and the time from diagnosis to death for some cancers. The exponential distribution can be easily recognized from a flat (constant) hazard function plot. Such plots are available in the output of many packaged programs, as we will discuss in section 13.9. If the hazard function is not constant, the Weibull distribution should be considered. For this distribution, the hazard function may be expressed as h(t) = aA.(A.t)cx-t. The expressions for the density function, the cumulative distribution function and the survival function can be found in most of the references given at the end of the chapter. Figures 13.6(b) and 13.7(b) show plots of the hazard and density functions for A. = 1 and a = 0.5, 1.5 and 2.5. The value of a determines the shape of the distribution and for that reason it is called the shape parameter or index. Furthermore, as may be seen in Figure 13.6(b), the value of a determines whether the hazard function increa:ses or decreases over time. Namely, when a< 1 the hazard function is decreasmg and when a> 1 it is increasing. When a= 1 the hazard is constant, and the Weibull and exponential distributions are identical. In that case, t~e exponential distribution is used. The value of A. determines how much t e distributio~ is str~tch.ed, ~nd ~herefore is cal~ed th~ scale p~ramete~. n 13.6 The Weibull distnbutton Is used extensively m practice. Secuo der describes a model in which this distribution is assumed. However~ the ~ea the should note that other distributions are sometimes used. These mclu e . fi a set log-normal, gamma and others (e.g. Cox and Oakes, 1984). A good way of deciding whether or not the Weibull distributiOn t~ethel' of data is to obtain a plot oflog( -log S(t)) versus logtime, and check ~ 11 the the graph approximates a straight line (section 13.9). If it .does,. t s:ctioll 10 Weibull distribution is appropriate and the methods described d or tlle 13.6 can be used. If not, either another distribution may be assume method described in section 13.7 can be used.
:r
Common distributions used in survival analysis
2
1
0
315
3 Time
a. Exponential Distribution with }. = 1
I=!
·..::0 2.0
I
.
u i::
I
::3
ti..
'1:l
~
N
I
1
1.5
±1
<~'"'"
"'.,
,.,"
,,""a= 2.5
I," }" 1.0
//1
/fa= 1.5 I
I
o.s / I
I I
I
I
'I /
0----------~~--------~-----------L--
l
2
3 Time
h. Weibull Distribution with A.= 1 and a= 0.5, 1.5, and 2.5 lfigu
re
13 ·6 liazard Functions for the Exponential and Weibull Distributions.
316
Regression analysis using survival data
s::
sa 2.0 ~
~
a
·~ 1.5 ~
Cl
1
0
2
3
Time a. Exponential Distribution with ).
=
1
.....g 2.0 t)
t::: ~ CJ..
>.... 1.5
·~
a= 0.5
0
,_,
1.0
/
, . . -f--,, 0.5
I
I
I
II /
0
I
'a=2.5
\
...... '
a=l.s' ........~~
'"' "' ..... ........
-2
3
Time
b. Weibull Distribution with ). = 1 and a= 0.5, 1.5, and 2.5 . 'butions · Figure 13.7 Death Density Functions for the Exponential and Weibull Dtstrt
The log-linear regression model
317
.6 THE LOG-LINEAR REGRESSION MODEL 13 . this section, we descri~e th~ use of multiple linear regressi~n to study the ~:Iationship?etw~en survival time and a set of exp.lanatoryvanables. Suppose h t t is survival time and X 1 , X 2 , •.. , X P are the mdependent or explanatory t aiables. Let Y = log(t) be the dependent variable, where natural logarithms var used. Then the model assumes a linear relationship between log(t) and ~~~ X's. The model equation is log(t) =a+ {J 1 X 1
+ {J 2 X 2 + ··· + {JpXp + e
where e is an error term. This model is known as the log-linear regression model since the log of survival time is a linear function of the X's. If the distribution of log(t) were normal and if no censored observations exist in the data set, it would be possible to use the regression methods described in Chapter 7 to analyze the data. However, in most practical situations some of the observations are censored, as was described in section 13.3. Furthermore, log(t) is usually not normally distributed (tis often assumed to have a Weibull distribution). For those reasons, the method of maximum likelihood is used to obtain estimates of fJ/s and their standard errors. When the Weibull distribution is assumed, the log-linear model is sometimes known as the accelerated life or accelerated failure time model (Kalbfleisch and Prentice, 1980). For the data example presented in section 13.3, it is of interest to study the relationship between length of survival time (since admission to the study) and the explanatory variables (variables 2-9) of Table 13.1. For the sake of illustration, we restrict our discussion to the variables Staget (tumor size: 0 ==small, 1 =large), Perfbl (performance status at baseline: 0 =good, 1 ==poor), Treat (treatment:-0 =control, 1 =experimental) and Poinf (post operative infection: 0 = no, 1 = yes). . A simple analysis that can shed some light on the effect of these variables 18 to determine the percentage of those who died in the two categories of each of the variables. Table 13.2 gives the percentage dying, and the results of a chi-square test of the hypothesis that the proportion of deaths is the ~ame fo~ each category. Based on this analysis, we may conclude that of the 0 ~h~ar:ables, all ex~ept Treat ~ay a~ect ~he li.kelihood of dea~h. the leis Simple an~lysis may ~ misleadmg, smce It does not take mto a~count p .ngth of survtval or the simultaneous effects ofthe explanatory vanables. d~~Vtous analysi~ of these data (Gail et al., 1984) has demonstrated that the est'a fit the Weibull model well enough to justify this assumption. We Sll)~ated S{t) from the data set and plotted log( -log S(t)) versus log(t) for fur~ a~d l~rge tumor sizes separately. The plots are approximately linear, A. er J~sttfying the Weibull distribution assumption. Use;sumtng a Weibull distribution, th.e LIFEREG procedure ~f SAS was to analyze the data. Table 13.3 displays some of the resultmg output.
318
Regression analysis using survival data
Table 13.2 Percentage of deaths versus explanatory variables Variable Staget Pertbl Treat Poinf
----
Deaths(%)
P Value
42.7 60.1 48.4 64.5 49.2 52.4 49.7 75.0
Small Large Good Poor Control Experiment No Yes
<0.01 0.02 0.52 0.03
Table 13.3 Log-linear model for lung cancer data: results Variable
Estimate
Standard error
Two-sided P value
Intercept Staget Perfbl Poinf Treat
7.96 -0.59 -0.60 -0.71 0.08
0.21 0.16 0.20 0.31 0.15
<0.01 <0.01 <0.01 <0.02 0.59
Shown are the maximum-likelihood estimates of the intercept (a) and regression ({Ji) coefficients along with their estimated standard errors, and P values for testing whether the corresponding parameters are zero. Since all of the regression coefficients are negative except Treat, a value of one for any of the three status variables is associated with shorter survival time than is a value of zero. For example, for the variable Staget, a large tumor is associated with a shorter survival time. Furthermore, the P values confirm the same results obtained from the previous simple analysis: thr:e of the variables are significantly associated with survival (at the 5~ 13 significance level), whereas treatment is not. The information in Table can also be used to obtain approximate confidence intervals for t e parameters. Again, for the variable Staget, an approximate 95% confidence interval for {3 1 is
h
-0.59
+ (1.96)(0.16)
that is -0.90 < /3 1 < -0.28. . vival The log-linear regression equation can be used to estimate typtcal sur eacb times for selected cases. For example, for the least-serious case when
The Cox proportional hazards regression model
319
Ianatory variable equals zero, we have log(t) = 7.9.6. Since this is a nat~r~l eJCP ithm, t = exp(7.96) = 2864 days= 7.8 years. Thts may be an unreahsttc lo~ar ate of survival time caused by the extremely long tail of the distribution. ~~~e other extreme, if every variable has a value of one, t = exp(6.14) = 464 or somewhat greater than one year. S daY~
13 .
.7 THE COX PROPORTIONAL HAZARDS REGRESSION MODEL
In this section, we descri?e th~ use of another method of mo~eling the relationship between survtval ttme and a set of explanatory vanables. In section 13.4, we defined the hazard function and used the symbol h(t) to indicate that it is a function of time t. Suppose that we use X, with no subscripts, as shorthand for all the Xi variables. Since the hazard function may depend on t and X, we now need to use the notation h(t, X). The idea behind the Cox model is to express h(t, X) as the product of two parts: one that depends on tonly and another that depends on the Xt only. In symbols, the basic model is
where ho(t) does not depend on the xi. If all X/s are zero, then the second part of the equation would be equal to 1 and h(t, X) reduces to h0 (t). For this reason, h0 (t) is sometimes called the baseline hazard function. In order to further understand the model, suppose ~hat we have a single explanatory variable X 1, such that X 1 = 1 if the subject 18 from group J and X = 0 if the subject is from group 2. For group 1, the 1 hazard function is ·
h(t, 1) = h0 (t)exp(/3 1 ) Similarly, for group 2, the hazard function is
h(t, 0) = h0 (t)exp(O) = h0 (t)
The rarIo of these two hazard functions is h(t, 1)/h(t, 0) = exp(/3 1 ) Which is a . . . funcr . constant that does not depend on time. In other words, the hazard Prop~~: for g~oup 1 is proportional t~ the hazard fu~ction for group 2: This Ptoport! motivated D.R. Cox, the mventor of this model, to call It the An •onal hazards regression model. in th;ther way to understand the model is to think in terms of two individuals study, each with a given set of values of the explanatory variables.
320
Regression analysis using survival data
The Cox model assumes that the ratio of the hazard functions for th individuals, say h1(t)/h 2(t), is a constant that does ilot depend on time. The e two should note that it is difficult to check this proportionality assumption d~eader or visually in practice. An indirect ~ethod to check th~s assumption is to ~:~y a plot oflog( -logS(t)) versus t for different values of a gtven explanatory vari b~ This plot should exhibit an approximately constant difference between the cua e. corresponding to the levels of the explanatory variable. rves Packaged programs use the maximum likelihood method to obt · estimates of the parameters and their standard errors. Some programs a~IU allow variable selection by stepwise procedures similar to those we describ:~ for other models. One such program is BMDP 2L, which we have used to analyze the same lung cancer survival data described earlier. In this analysis we did not use the stepwise feature but rather obtained the output for th~ same variables as iil section 13.6. The results are given in Table 13.4. Note that all except one of the signs of the coefficients in Table 13.4 are the opposites of those in Table 13.3. This is due to the fact that the Cox model describes the hazard function, whereas the log-linear model describes survival time. The reversal in sign indicates that a long survival time corresponds to low hazard and vice versa. To check the assumption of proportionality, we obtained a plot of log( -log S(t)) versus t for small and large tumors (variable Staget). The results, given in Figure 13.8, show that an approximately constant difference exists between the two curves, making the proportionality assumption a reasonable one. For this data set, it is reasonable to interpret the data in terms of the results of either this or the log-linear model. In the next section, we compare the two models with each other as well as with the logistic regression model presented in Chapter 12. 13.8 SOME COMPARISONS OF THE LOG-LINEAR, COX AND LOGISTIC REGRESSION MODELS Log-linear versus Cox In Tables 13.3 and 13.4, the estimates of the coefficients from the log~lin~~: and Cox models are reported for the lung cancer data. As noted 10 Table 13.4 Cox's model for lung cancer data: results
----------------------------------------------~----Two-sided P value Standard error Variable
Estimate
Staget Pertbl Poinf Treat
0.54 0.53 0.67 0.07
------------------------------------------------<0.01 0.14 0.19 0.28 0.14
<0.01 <0.025 >0.50
1-0~J
• I NUS LOG SURV I \/AL FUf\ICT I 01\l II + . • • • • • +. • . • . • + • . • . .• + •••••• + . • • • •• + • • • . • • + • . . . •. + • • • • •• + . • • • •
88
o. 0
88888888888 88888
-+
8888 88888 8888
IY'PAf'tAAA
88
AAAA
888
Af¥:J.A
8
-.90
(¥){)('{l{'j
888
+
AAA AAA
88 8
AA
88 8
AA
88 8
AA A
A
88 88
API(::;
8 8
A A
+
8
(¥¥l. {¥¥l.·
8
-1.8
+
+
A
8 A 88 AAA 8 A
-2.7
-3.6
+ 8 8 8 8 - 8
A
- 8
A
+
A A A A
- 8 A - 8 A + 8 A
-SPA
-BA
-8* -*A -* -* -4.5 +* -* -* """'* -* -* -* -* -s.4 +A -BA
+
+ • +.•••••• + •.•••• + .................... + •.•••• + •••••• + •.•.•. + •...•• + ••••• 400. 1200 2000 2800 o.oo 800. 1600 2400 3200 days
F'igure 13 Data (A •8 Computer-Generated Graph of log(- log S(t)) Versus t for Lung Cancer :::::: Large Tumor, B = Small Tumor).
322
Regression analysis using survival data
previous section, the coefficients in the two tables have opposite sig for the variable Treat, which was nonsignificant. Also, in this Pns ~Xcept example the magnitudes of the coefficients are not too different. For articular for the variable Staget, the magnitude of the coefficient was O.S;xf~~Ple, log-linear model and 0.54 for the Cox model. One could ask, is ther the general relationship between the coefficients obtained by using thee some different techniques? se two An answer to this question can be found when the Weibull distribuf . 18 assumed in fitting the log-linear model. For this distribution, the popu:o~ coefficients B have the following relationship a •on B(Cox)
= -(Shape)BQog-linear)
where (Shape) is the value of the shape parameter a. For a mathematical proof of this relationship, see Kalbfleisch and Prentice (1980, page 35). The sample coefficients will give an approximation to this relationship. For example, the sample estimate of shape is 0.918 for the lung cancer data and -(0.918)( -0.59) = +0.54, satisfying the relationship exactly for the Staget variable. But for Perfbl, ( -0.918)( -0.60) = +0.55, close but not exactly equal to + 0.53 given in Table 13.4. If one is trying to decide which of these two models to use, then there are several points to consider. A first step would be to determine if either model fits the data, by obtaining the plots suggested in sections 13.5 and 13.7: 1. Plot log(-log S(t)) versus t for the Cox model separately for each category of the major dichotomous independent variables to see if the proportionality assumption .holds. 2. Plot log( -log S(t)) versus log (t) to see if a Wei bull fits, or plot -log ~(t) versus t to see if a straight line results, implying an exponential distributiOn fits. (The Weibull and exponential distributions are commonly used for the log-linear model.)
In practice, these plots are sometimes helpful in discriminating ~e\;~~ the two models, but often both models appear to fit equally well. ~hts fit .5 1 case with the lung cancer data. Neither plot looks perfect and netther hopelessly poor. Cox versus logistic regression
f
odO . . regressiOn . ana1ysts . was ?resen t ed as a. metdhta In Chapter 12, 1ogtstlc on
classifying individuals into one of two categones. In analyzmg a fi~ed survival one possibility is to classify individuals as dead or alive aft~r ad and . ressioll duration' of follow-up. Thus we create two groups: th ose wh? survtve those who did not survive beyond this fixed duration. Logtsttc regbjective 0 can be used to analyze the data from those two groups with t~e 111 the of estimating the probability of surviving the given length of tllne.
Comparison of the regression models
323
. cancer data set, for example, the patients could be separated into two lung ps those who died in less than one year and those who survived one g~~~ 0 ; more. In using this method, only patients whose results are known, Y could have been known; for at least one year should be considered for 0 ~alysis. These a~e the sa~e pati~nts that could be used to compute a one-.year aurvival rate as discussed m sectton 13.3. Thus, excluded would be all patients 5 ho enrolled less than one year from the end of the study (whether they ~ d or died) and patients who had a censored survival time of less than : : year. Logistic regres~ion coefficients .can be computed to describe the 0 elationship between the mdependent vanables and the log odds of death. r Several statisticians (Ingram and Kleinman, 1989; Green and Symons, l983; Mickey, 1985) have examined the relationships between the results obtained from logistic regression analysis and those from the Cox model. Ingram and Kleinman (1989) assumed that the true population distribution followed a Weibull distribution with one independent variable that took on two values, and that the distribution could be thought of as group membership. Using a technique known as Monte Carlo simulation, the authors obtained the following results. 1. Length of follow-up. The estimated regression coefficients were similar
(same sign and magnitude) for logistic regression and the Cox model when the patients were followed for a short time. They classified the cases as alive if they survived the follow-up period. Note that for a short time period, relatively few patients die. The range of survival times for those who die would be less than for a longer period. As the length of follow-up increased, the logistic regression coefficients increased in magnitude but those for the Cox model stayed the same. The magnitude of the standard error of the coefficients was somewhat larger for the logistic regression model. The estimates of the standard error for the Cox model decreased as the follow-up time increased. 2 · Censoring. The logistic regression coefficients became very biased when there was greater censoring in one group than in the other, but the regress.ion coefficients from the Cox model remained unbiased (50% censonng was used). 3 · ~;~n .the proportio~ dying is .small (10%. a~d 19% in the two groups), the sttmate~ regre.ssto? coefficients were simtlar for the two methods. As tnodproportton dymg mcreased, the regression coefficients for the Cox re el ~tayed the same and their standard errors decreased. The logistic 4. \}resston coefficients increased as the proportion dying increased. ~vunor · · . . · · effect VIolatton.s of the proportiOnal hazards assumptton had only a small viol ?n the estimates of the coefficients for the Cox model. More serious or t~b~ns resulted in bias when the proportion dying is large or the effect e Independent variable is large.
324
Regression analysis using survival data
Green and Symons (1983) have concluded that the regression coeffi . for the logistic regression and the Cox model will be similar when the outCients event is unco~m.on, the eff~ct of the inde~endent variab~es is weak, an~o~e follow~ up penod IS short. Mickey (1985) denved an approXImate mathem t' he relationship between the formulas for the two sets of population coeffi~ teal and concluded that if less than 20% die, then the coefficients should a ents within 12%, all else being equal. gree The use of the Cox model is preferable to logistic regression if one wish to compare results with other researchers who chose different follow- es periods. Logistic regression coefficients may vary with follow-up time up proportion dying, and therefore use of logistic regression requires care ~~ describing the results and in making comparisons with other studies. For a short follow-up period of fixed duration (when there is an obvious choice of the duration), some researchers may find logistic regression simpler to interpret and the options available in logistic regression programs useful. If it is thought that different independent variables best predict survival at different time periods (for example, less than two months versus greater than two months after an operation), then a separate logistic regression analysis on each of the time periods may be sensible. 13.9 DISCUSSION OF COMPUTER PROGRAMS Table 13.5 summarizes the options available in the five statistical packages that perform either or both of the regression analyses presented in this chapter. Note that some survival analysis programs do not emphasize the regression modeling discussed here; Instead, they perform life table analysis, make plots of the hazard and survival function, and test if two or more survival distribution are equal. Such programs are often useful to run in conjuncti~n with the programs discussed in this section. Examples are LIFETEST tn SAS, 1L in BMDP, SURVIVAL in SPSS, ltable or survival in STATA and Life table in STATISTICA. Before entering the data. for regression an.alysis of s.urvival data, the u~~~ should check what the optwns are for entenng dates tn the program. M f 0 programs have been designed so that the user can simply type the dat~ e the starting and end points and the program will compute the length of tl~e until the event such as death occurs or until the case is censored. Bust e · form of the date that th e program allows.uldombe user has to follow the precise have several options and others are quite restricted. The manual sbo ·n the checked to avoid later problems. . LIFEREG from SAS was used to obtain the log-linear coeffictents; fault cancer example. The program uses the Weibull distribution as the ernelY option. In addition a wide range of distributions can be assume~, ~a fhe the exponential, log-normal, log-logistic, gamma, normal and logisuc.
Table 13.5 Summary of computer output for survival regression analysis Output
BMDP
SAS
STATA
STATISTICA
Data entry with dates Cox's proportional hazards model Log-linear or accelerated failure model Number of censored cases Graphical goodness-of-fit (residuals) Covariance matrix of coefficients Coefficients and standard errors Significance test coefficients Stepwise available Weights of cases Quantiles Diagnostic plots Time dependent covariates Competing risk analysis
2L 2L 2L 2L 2L 2L 2L 2L
Informat
dates cox weibull tabs urn predict correlate cox, weibull cox, weibull swcox, swweib fweight, aweights
Survival Cox Survival Survival Survival Survival Survival Survival
survival cox
Survival
a Supplementary
module.
LIFEREG LIFEREG PLOT LIFEREG LIFEREG LIFEREG
2L
2L 2L 2L
LIFE REG LIFE REG LIFETEST
SYSTATa Survival Survival Survival Survival Survival Survival Survival
Survival Surviv~l
Survival
326
Regression analysis using survival data
INFORMA T statement should be checked to find the options for date T . program can use data that are either censored on the left or right. s. his BMDP 2L was used for Cox's proportional hazards model. 2L incl d both Cox's and the log-linear model (Weibull, exponential, log-normal u es log-logistic dist~buti?ns). Stepwise fea~ures a~e available for both mod:~: Plots of censonng ttme and other dtagnostlc plots are available M · advan~d f~atures o~ 2L incl~de the use o~ time-varying covaria~es a~~ competmg nsk analysts ~Kalbfleisch and Prentice, 1980; Cox and Oakes, 19S4). STATA also has a wtde range of features. The program can handle date It has both Cox's regression analysis and the log-linear model (Weib~· exponential and normal (see cnreg) distributions). For a specified model. i~ computes estimated hazard ratios (odds ratios) for a one-unit change in ;he independent variable. Stepwise regression is also available for these models and time-varying covariates can be included for Cox's model. STATISTICA also can compute either the Cox or log-linear (exponential, log-normal or normal distribution) model. The user first chooses Survival Analysis, then Regression Models, and then scrolls to the model of choice. Survival curves are available in the Kaplan-Meier or Life Table options listed under Survival Analysis. The Survival program in SYSTAT is included in a supplementary module. This module includes the usual life table analysis in addition to Cox's model and the log-linear model. 13.10 WHAT TO WATCH OUT FOR In addition to the remarks made in the regression analysis chapters, several potential problems exist for survival regression analysis. 1. If subjects are entered into the st~dy over a long.time period, th~ research~ needs to check that those entenng early are hke those entenng later. this is not the case, the sample may be a mixture of persons from different populations. For example, if the sample was taken from employ~:; records 1960-80 and the employees hired in the 1970s were substantt on different from those hired in the 1960s, putting everyone back to a comro of starting point does not take into account the change in the ty~e rrn employees over time. It may be necessary to stratify by year and ~e~·~ate 1 separate analyses, or to use a dummy independent variable to lfl the two time periods. . occurs 2. The analyses described in this chapter assume that censonng teristic . charac mdependently of any poss1'ble treatment eftiect or subiect 'J • • t relates (including survival time). If persons with a given charactensuc tha ly tbeO to the independent variables, including treatment, are censor:d ~he~ever this assumption is violated and the above analyses are not valtd.
Summary
3.
4.
_ 5.
327
nsiderable censoring exists, it is recommended to check this assumption ~; comparing the ~a~tems of censoring among subgroups of .subjects ~ith different chara~tensttcs. Programs. such as BMDP lL provtde graphical output illustratmg when the censonng occurs. A researcher ca~ also ~eve~se the roles of censored and non-censored values by temporanly swttchmg the labels and calling the censored values 'deaths'. The researcher can then use the methods described in this chapter to study patterns in time to 'death', in subgroups of subjects. If such examination reveals that the censoring pattern is not random, then other methods need to be used for analysis. Unfortunately, the little that is known about analyzing non-randomly censored observations is beyond the level of this book (e.g. Kalbfleisch and Prentice, 1980). Also, to date such analyses have not been incorporated in general statistical packages. _ In comparisons among studies it is critical to have the starting and end points defined in precisely the same way in each study. For example, death is a common end point for medical studies of life-threatening diseases such as cancer. But there is often variation in how the starting point is defined. It may be time of first diagnosis, time of entry into the study, just prior to an operation, post operation, etc. In medical studies, it is suggested that a starting point be used that is similar in the course of the disease for all patients. If the subjects are followed until the end point occurs, it is important that careful and comprehensive follow-up procedures be used. One type of subject should not be followed more carefully than another, as differential censoring can occur. Note that although the survival analysis procedures allow for censoring, more reliable results are obtained if there is less censoring. The investigator should be cautious concerning stepwise results with small sample sizes or in the presence of extensive censoring.
SUMMARY
~~this chapter, we presented two methods for performing regression analysis
as~= the depende~tv~riable was time until the occurrence of an event, such
log-1' ath or termmatton of employment. These methods are called the haz IUear or accelerated failure time model, and the Cox or proportional sornards model. One special feature of measuring time to an event is that e ofthe sub'~e~ts may not be followed long e~ough for the event to occur, so the Censo ~ are classified as censored. The regressiOn analyses allow for such or th nng as long as it is independent of the characteristics of the subjects arnone treatment they receive. In addition, we discussed the relationship tnode~ the results obtained when the log-linear, Cox or logistic regression s are used.
328
Regression analysis using survival data
Further information on survival analysis at a modest mathemati 1 can be found in Allison (1984) or Lee (1992). Applied books also c~ level Harris and Albert (1991) and Yamaguchi (1991). For those interei~clu~e industrial applications, Nelson (1982) is highly recommended. s ed In REFERENCES References preceded by an asterisk require strong mathematical background. Allison, P.D. (1984). Event History Analysis, #46. Sage, Newbury Park, CA. *C~x, D.R. and Oakes; D. (1984). Anal!sisofSurvival Data. Chap~an & Hall, London. Gad, M.H., Eagan, R.T., F~ld, R., Gmsberg, R., Goodell, B., Hrll, ~., Holmes, E.C., L11keman, J.M., Moun tam; C.F., Oldham, R.K., Pearson, F.G:, Wnght, P.W., Lake, ~;H. and the Lung Cancer Study Group (1984). Prognostic factors in patients with resected stage 1 non-small lung cancer. Cancer, 9, 1802-13. · Green, M;S. and Symons, M.J. (1983). A comparison of the logistic risk function and the proportional hazards model in prospective epidemiologic studies. Chronic Disease, 36, 715-24. Harris, E.K. and Albert; A. (1991). Survivorship Analysis for Clinical Studies. Marcel Dekker, New York. Ingram, D.D. and Kleinman, J.C. (1989). Empirical comparison of proportional hazards and logistic regression models. Statistics in Medicine, 8, 525-38. *Kalbfleisch, J.D. and Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Lee, E.T. (1992). Statistical Methods for Survival Data Analysis, 2nd edn. Lifetime · Learning, Belmont, CA. Mickey, M.R. (1985). Multivariable analysis of one-year graft survival, in Clinical Kidney Transplants (ed. P.I. Terasaki). UCLA Tissue Typing Laboratory, Los Angeles. *Miller, R.G. (1981). Survival Analysis. Wiley, New York. Nelson, W. (1982). Applied Life Analysis. Wiley, New York. Yamaguchi, K. (1991). Event History Analysis. Sage, Newbury Park, CA.
FURTHER READING Abbot, R.D. (1985). Logistic regression in survival analysis. American Journal of Epidemiology, 121, 465-71. . . . York. *Lawless,J.F. (1982). Statistical Models and Methods for Lifetime Data. Wt.ley, ~e~T. London, D. (1988). Survival Models and their Estimation. ACT~X, Wmst . .' F ctors Slonim-~evo, V. and ~lark,. V.A. ~1989). An illustration .ofsurvival analys~i~/ affectmg contraceptive discontmuation among Amencan teenagers, So 25, 7-19.
Work.
PROBLEMS
d
. B ao . · Appen dtX The following problems all refer to the lung cancer data gtven m described in Table 13.1. . . . . a log~linear 13.1 (a) Find the effect of Stagen and Htst upon survival by fittmg. methods 1 model. Check any assumptions and evaluate the fit using the graphtca described in this chapter.
Problems
13.2 13.3
13.4 13.5 13.6 13.7 13.8
329
(b) What hap~ens in pa~t (a) if you i~clude Staget in your model along with Stagen and Htst as predtctors of survival? Repeat Problem 13.1, using a Cox proportional hazards model instead of a log-linear. Compare the results. Do the patterns of censoring appear to be the same for smokers at baseline, ex-smokers at baseline, and nonsmokers at baseline? What about for those who are smokers; ex-smokers, and nonsmokers at follow-up? Assuming a log-linear model for survival, does smoking status (i.e. the variables Smokbl and Smokfu) significantly affect survival? Repeat Problem 13.4 assuming a proportional hazards model. Assuming a log-linear model, do the effects of smoking status upon survival change depending on the tumor size at diagnosis? Repeat Problem 13.6 assuming a proportional hazards model. Define a variable Smokchng that measures change in smoking status between baseline and follow-up, so that Smokchng equals 0 if there is no change and 1 if there is a change. (a) Is there art association between Staget and· Smokchng? (b) What effect does Smokchng have upon survival? Is it significant? Is the effect the same if the smoking status variables are also included in the model as independent variables? (c) How else might you assign values to a variable that measures change in smoking status. Repeat part (a) using the new values.
14 Principal components analysis 14.1 USING PRINCIPAL COMPONENTS ANALYSIS TO UNDERSTAND INTERCORRELATIONS Principal components analysis is used when a simpler representation is desired for a set of intercorrelated variables. Section 14.2 explains the advantages of using principal components analysis and section 14.3 gives a hypothetical data example used in section 14.4 to explain the basic concepts. Section 14.5 contains advice on how to determine the number of components to be retained, how to transform coefficients to compute correlations, and how to use standardized variables. It also gives an example from the depression data set. Section 14.6 discusses some uses of principal components analysis including how to handle multicollinearity in a regression model. Section 14.7 summarizes the output of the statistical computer packages and section 14.8 lists what to watch out for.
14.2 WHEN IS PRINCIPAL COMPONENTS ANALYSIS USED? Principal components analysis is performed in order to simplify the desc~pti~ of a set of interrelated variables. In principal components analysts t ~ variables are treated equally, i.e. they are not divided into dependent an independent variables, as in regression analysis. . inal ~he tec?nique can be summarized ~s a method of trans~orming the;l~~ the vanables mto new, uncorrelated vanables. The new vanables are ~ . n of . . . . . li combmatto prmc1pal components. Each pnnctpal component ts a. near . · nveyed the original variables. One measure of the amount of mformatton co. ncipal by each principal component is its variance. For this reason the phn most · vanance. · Thus t. ee is the components are arranged in order of decreasmg informative principal component is the first, and the least inforrnattvembers last (a variable with zero variance does not distinguish between them of the population).
Data example
331
A investigator may wish to reduce the dimensionality of the problem, i.e. d n e the number of variables without losing much of the information. This re .uctive can be achieved by choosing to analyze only the first few principal 0 ::onents: The pr~ncip~l comp?nent~ not analyzed conve! only ~ sma!l c unt of mformatwn smce thetr vanances are small. Thts techmque ts amo ctive for another reason, namely that the principal components are not ~~~:correlated. Thus in~tead .of anal~zing ~large number of original variables 'th complex interrelatiOnships, the mvestlgator can analyze a small number WI . . l f uncorrelated prmctpa components. 0 The selected principal components rna y also be used to test for their orma1ity. If the principal components are not normally distributed, then neither ~e the original variables. Another use of the principal components is to search for outliers. A histogram of each of the principal components can identify those individuals with very large or very small values; these values are candidates for outliers or blunders. · In regression analysis it is sometimes useful to obtain the first few principal components corresponding to the X variables and then perform the regression on the selected components. This tactic is useful for overcoming the problem ofmulticollinearity since the principal components are uncorrelated (Chatterjee and }>riee, 1991). Principal components analysis can also be viewed as a step toward factor analysis (Chapter 15). Principal components analysis is considered to be an exploratory technique that may be useful in gaining a better understanding of the interrelationships among the variables. This idea will be discussed further in section 14.5. The original application of principal components analysis was in the field of education testing. Hotelling (1933) developed this technique and showed that there are two major components to responses on entry-examination tests: verbal and quantitative ability. Principal components and factor anal~sis are also used extensively in psychological applications in an attempt ~~discover u~de~lyin~ structure. ~n addit~on,.principal componen.ts analysis been used m bwlogtcal and medtcal apphcatwns (Seal, 1964; Momson, 1976).
14 3
. .
.
DATA EXAMPLE
l'he depr . tech . esswn data set will be used later in the chapter to illustrate the &enentque. However, to simplify the exposition of the basic concepts, we rand~ated ·~ hypothetical data set. These hypothetical data consist of 100 X1 ism Pairs of observations, X 1 and X 2 • The population distribution of tlorrna~or~al with mean 100 .and variance 100. Fo~ X 2 the di~tribution is X1 and ;n~ mean 50 and vanance 50. The populatiOn correlatiOn between 112 rando 2 !s 1/2 ~ 0.707. Figure 14.1 shows a scatter diagram of the 100 rn Patrs. of points. The two variables are denoted in this graph by
2 NOF<M1 ANO VARIABLE NORM2 PLOT OF VARlAELE 3 •• + •••• + •••••••••••••• + •••• + •••••••••••••• + •••• + •••• +••••••••• x ••••••••••
64
60
• • • • + •
1
..•
2 1
'• ...
11
..
.
3~
1
2
1
'
1
1
1 1
1
11
1
1
l 1
1
1
t 1
• • • • + • • • • + • • • • + • • • •
1
1
1
1
1 1
ll
l
1
4: 1
1
+
•
1 1
1
• • •
• • •
1
1
•~ • • • • + •
1
1
ll
1
+ • • • +
1
l
•
•
•
1 1
1
1
1
1
ll
ll 1
1
1 1
l
1
•
40
1
1
1
+
•
4-4
1
1
1 1
•
2
1
1
• •
52
1
111 1
•~
R
.
1 1 121
• y
• • • • • •
1
• + •
•
+
I
•
a
1
1
1
•
N
•
1
1
1
l
•
.•
+
•
• • ••••••••••••••• x•••••• + •••• + •••• + •••• + •••• + ••••••••••••••••••• + •••• + ••••• l
1
'(
72.
76.
eo.
a4.
ea.
qz.
too
'i6.
HORN I
10a
104
124
116
112
120
Basic concepts of principal components analysis
333
Table 14.1 Sample statistics for hypothetical data set Variable Statistic
NORM1
NORM2
N Mean Standard deviation Variance
100 101.63 10.47 109.63
100 50.71 7.44 55.44
NORMl and NORM2. The sample statistics are shown in Table 14.1. The sample correlation is r = 0.757. The advantage of this data set is that it consists of only two variables, so it is easily plotted. In addition, it satisfies the usual normality assumption made in statistical theory.
14.4 BASIC CONCEPTS OF PRINCIPAL COMPONENTS ANALYSIS Again, to simplify the exposition of the basic concepts, we present first the case of two variables X 1 and X 2 • Later we discuss the general case of P variables. Suppose that we have a random sample of N observations on X 1 and X 2 . For ease of interpretation we subtract the sample mean from each observation, thus obtaining
and
Note th at th'ts techmque . . makes the means of x and x equal to zero but do 1 2 ~~not ~lt~r th~ sample variances Si and or the correlation r. corn e baste IdealS to create two new variables, C 1 and C2 , called the principal ther P;nents. Th.ese new variables are linear functions of x 1 and x 2 and can e ore be Wrttten as
s;
334
Principal components analysis
We note that for any set of values of the coefficients a 11 , a12 , a can introduce theN observed x 1 and x 2 and obtain N values of The means and variances of theN values of C 1 and C2 are
2'
1
. a 22 ' We and C2·
mean cl =mean c2 = 0 VarC 1 = ai 1 Si
+ ai 2 s; + 2a 11 a 12 rS 1 S2
and
where S~ = Var Xi. The coefficients are chosen to satisfy three requirements. L Var C 1 is as large as possible. 2. TheN values of C 1 and C2 are uncorrelated. 2 2 2 2 3· au + a12 = a21 + a22 = 1· The mathematical solution for the coefficients was derived by Hotelling (1933). The solution is illustrated graphically in Figure 14.2. Principal Xz
Figure 14.2 Illustration ofTwo Principal Components C 1 and C2.
Basic concepts of principal components analysis
335
onents analysis amounts to rotating the original x 1 and x 2 axes to new cot11Pnd c2 axes. The angle of rotation is determined uniquely by the Ct \ernents just stated. For a given point x 1, x 2 (Figure 14.2) the values of requ~d c are found by drawing perpendicular lines to the new C 1 and C2 C1 a TheN values of C 1 thus obtained will have the largest variance according a:e~quirement 1. !heN values of C 1 and C 2 ~11 ~ave a zero correlation. t our hypothetical data example, the two pnncipal components are .n 1 c1 = 0.841xl + 0.539x2 c2
= - 0.539x1 + 0.841x2
where x 1 = NORM1 - mean(NORM1) and x2
= NORM2 - mean(NORM2)
Note that 0.841 2 + 0.539 2 = 1 and ( -0.539) 2 + 0.841 2 = 1 as required by requirement 3 above. Also note that, for the two-variable case only, au = a22 and a12 = -a2t· Figure 14.3 illustrates the x 1 and x 2 axes (after subtracting the means) and ~he rotated cl and c2 principal component axes. Also drawn in the graph Is an ellipse of concentration of the original bivariate normal distribution. ~he variance of C1 is 147.44, and the variance of C2 is 17.59. These two ~~nances are commo~ly known as the first an~ ~econd eigenvalues, respectively v~:onyms usedfor eigenvalue are characte~Istlc ro~t, latent roo!, and p~op~r e . e). Note that the sum of these two vanances Is 165.03. This quantity IS th~al to the sum of the original two variances (109.63 + 55.40 = 165.03). rotls ~esult will always be the case, i.e. the total variance is preserved under of ~~on ?f the principal co~pon~nts. Note also that the_ lengths of the axes stand elhpse _of concentratiOn (Figure 14.3) are proportiOnal to the sample are th~rd deviations of C 1 ~nd C 2, respe~tively. These ~tandard devia~ions 14. 3 th square roots of the eigenvalues. It Is therefore easily seen from Figure than .at C 1 has a larger variance than C 2. In fact, C 1 has a larger variance lh etther of the original variables x 1 and x 2 • Bach ese ?as.ic ideas are easily extended to the case of P variables x 1, x 2 , ••• , Xp. Pnnc1pal component is a linear combination of the x variables.
336
Principal components analysis
Xz
110
120
130
140
NORM I Figure 14.3 Plot of Principal Components for Bivariate Hypothetical Data.
Coefficients of these linear combinations are chosen to satisfy the following three requirements: VarC 2 ~ · • · ~ Var Cp. 2. The values of any two principal components are uncorrelated. 3. For any principal component the sum of the squares of the coefficients is one; 1. Var C 1
~
In other words, C 1 is the linear combination with the largest variance. Subject to the condition that it is uncorrelated with C1 , C2 is the linear combination with the largest variance. Similarly, C 3 has the largest variance subject to the condition that it is uncorrelated with C 1 and C2; etc. Thel . . . · · 1 tota Var Ci are the eigenvalues. These P vanances add up to the ong1na . r variance. In some packaged programs the set of coefficients of ~he hn~:r combination for the ith principal component is called the ith e1genvec (also known as the characteristic or latent vector).
14.5 INTERPRETATION · . ed for In this section we discuss how many components should be retatn iableS· further analysis and we present the analysis for standardized x var Application to the depression data set is given as an example.
Interpretation .
NoJJl
337
ber of components retained
.
entioned earlier, one of the objectives of prin~ipal components analysis ~s ~uction of dimensionality. Since the principal components are arranged ~s ~ecreasing order of variance, we may select the first few as representatives :the original set o~ v.ariables. The n~mber of com~onents sele~ted may be determined by exammmg ~he proport~on of total van~nce e~pla.med. by each component. !he cumulative. proport~on _oftot~l vanance m_dicates to_ the investigator JUSt how much informatiOn Is retamed by selectmg a specified umber of components. 0 In the hypothetical example the total variance is 165.03. The variance of the first component is 147.44, which is 147.44/165.03 = 0.893, or 89.3%, of the total variance. It can be argued that this amount is a sufficient percentage of the total variation, and therefore the first principal component is a reasonable representative of the two original variables NORM1 and NORM2 (see Morrison, 1976, or Eastment and Krzanowski, 1982, for further discussion). Various rules have been proposed for deciding how many components to retain but none ofthem appear to work well in all circumstances. Nevertheless, they do provide some guidance on this topic. The discussion given above relates to one rule, that of keeping a sufficient number of principal components to explain a certain percentage of the total variance. One common cutoff point is 80%. If we used that cutoff point, we would only retain one principal component in the above example. But the use of that rule may not be satisfactory. You may have to retain too many principal components each of which explaining only a small percentage of the variation to reach the 80% limit. Or there may be a principal component you wish to retain that puts you over the cutoff point. . Other rules have approached the subject from the opposite end, that is to d~scard principal components with the smallest variances. Dunteman (1989) ~Iscu~ses several of these rules. One is to discard all components that have vanance less than 70j P percent of the total variance. In our example, that ~fou~d be 35% of the total variance of 57.1 which is larger than the variance co t e second component or 17.59 so we would not retain the second va~Ponent. Other statisticians use 100/P instead of 70/P percent of the total nance · Oth . . that . er vanattons of t h'Is type of rule are not to keep any components 'ran~~~,atn s~~ll p~oportions of the variance since they may represent ~imply Prin . vanatwn m the data. For example, some users do not retain any O~~Palcomponent that has a variance of less than 5% of the total variance. axis ers advocateplottingtheprincipal componentnumberon the horizontal an exversus the individual eigenvalues on the vertical axis (Figure 14.5(b) for lines ~~P.le using the depression data). The idea is to use a cutoff point where and /~In~ng consecutive points are relatively steep left of the cutoff point e atiVely flat right of the cutoff point. Such plots are called scree plots.
338
Principal components analysis
In practice this sometimes works and other times there is no clea h in the slope. r c ange Since principal components analysis is often used as an exploratory ~any investig~tors a~gue for not taking any of the above rules seriou~ethoct, mstead they will retam as many components as they can either inte y, and ·rpret or are useful in future analyses. Transforming coefficients to correlations To interpret the meaning of the first principal component, we recall that it was expressed as
The coefficient 0.841 can be transformed into a correlation between x 1 and C 1 . In general, the correlation between the ith principal component and the jth x variable is au(Var Ci) 112 rii = (Var xi) 112 where aii is the coefficient of xi for the ith principal component. For example, the correlation between c1 and x1 is . 0.84(147.44) 112 ru = (109.63)1/2 = 0.975 and the correlation between C 1 and x 2 is . 1/2 = 0.539(147.44) = 0 880 r12 (55.40)1/2 . Note that both of these correlations are fairly high and positive. As. can ~ seen from Figure 14.3, when either x 1 or x 2 increases, so will C1. This resunt occurs often in principal components analysis whereby the first compone is positively correlated with all of the original variables. Using standardized variables
. 1 prior to Investigators frequently prefer to standardize the x ~an~b e~ hieved performing the principal components analysis. StandardiZatto~ IS aciysis is by dividing each variable by its sample standard deviation. Thts ana riaoce 0 then equivalent to analyzing the correlation matrix instead of the ~ "~atriX, matrix. When we derive the principal components from the correlatton the interpretation becomes easier in two ways.
Interpretation
339
The total variance is s~m~ly the number o~variables P, and ~he p~oportion 1. explained by each pnncipal component IS the correspondmg eigenvalue divided by P. correlation between the ith principal component Ci and the jth z. The . . variable IS X} IS rii
= aii(Var Ci) 112
Therefore for a given Ci we can compare the aii to quantify the relative degree of dependence of Ci on each of the standardized variables. This correlation is called the factor loading in some computer programs.
In our hypothetical example the correlation matrix is as follows:
0.757 1.000
1.000 0.757
Here S 1 and S 2 are the standard deviations of the first and second variables. Analyzing this matrix results in the following two principal components:
c1
= O70
.
7
which explains 1.757/2 x 100
C2
NORM1
s1
= 87.8%
= O707 NORM1 •
s1
O
+ .707
NORM2
s2
of the total variance, and _ O
.707
NORM2
s2
explaining 0.243/2 x 100 = 12.2% of the total variance. So the first principal Component is equally correlated with the two standardized variables. The sec~nd Principal component is also equally correlated with the two standardized vanabies, but in the opposite direction. b T:e case of two standardized variables is illustrated in Figure 14.4. Forcing standard deviations to be one causes the vast majority of the data to bot e co t . . Ill named In the square shown in the figure (e.g. 99.7% of normal data ofut~:an Within plus o~ minus three standar~ d~viations of the me~n). Be.cause dire . symmetry of thiS square, the first pnncipal component will be m the 1 ctton of the 45° line for the case of two variates. co~ terms of the original variables NORM1 and NORM2, the principal Ponents based on the correlation matrix are
c, "'0.707NO~Ml + 0.707NO~M2 = 0.707N~~l + 0.707N~.~:·1:2 ~ 0.0675NORM1 + 0.0953NORM2
340
Principal components analysis
Figure 14.4 Principal Components for Two Standardized Variables.
Similarly, C2
= + 0.0675NORM1- 0.0953NORM2
Note that these results are very different from the principal components obtained from the covariance matrix (section 14.4). This is the case in general. . · ce In fact, there Is no easy way to convert the results based on the covat;t~ matrix into those based on the correlation matrix, or vice versa. The m:aJontY . . . mpensates thell of researc?ers prefer to use the corr~latton matnx becaus~ ~~ ~o for the umts of measurement of the dtfferent variables. But if tt ts used, all interpretations must be made in terms of the standardized variables. Analysis of depression data set al . f 1 data set, the . . I p the N ext, we present a pnnctpa components an ysts o a rea depression data. We select for this example the 20 items that make ~es are CESD scale. Each item is a statement to which the response categofl
Interpretation
341
d·nal. The answer rarely or none of the time (less than 1 day) is coded as or 8 ~me or a little of the time (1-2 days) as 1, occasionally or a moderate O, 0 unt of the time (3-4 days) as 2, and most or all of the time (5-7 days) am3 The values of the response categories are reversed for the positive-affect ;s ~s (see Table 14.2 for a listing of the items) so that a high score indicates ~~elihood of depression. The CESD score is simply a sum of the scores for these 20 items: . . .. . We emphasize that these vanables do not satisfy the assumptions often ade in statistics of a multivariate normal distribution. In fact, they cannot :en be considered to be continuous variables. However, they are typical of what is found in real-life applications. In this example we used the correlation matrix in order to be consistent with the factor analysis we present in Chapter 15. The eigenvalues (variances of the principal components) are plotted in Figure 14.5(b). Since the correlation matrix is used, the total variance is the number of variables, 20. By dividing each eigenvalue by 20 and multiplying by 100, we obtain the percentage of total variance explained by each principal component. Adding these percentages successively produces the cumulative percentages plotted in 14.5(a). These eigenvalues and cumulative percentages are found in the output of standard packaged programs. They enable the user to determine whether and how many principal components should be used. If the variables were uncorrelated to begin with, then each principal component, based on the correlation matrix, would explain the same percentage of the total variance, namely 100/P. If this were the case, a principal components analysis would be unnecessary. Typically, the first principal component explains a much larger percentage than the remaining components, as shown in Figure 14.5. . Ideally, we wish to obtain a small number of principal components, say two or three, which explain a large percentage of the total variance, say 80% or more. In this example, as is the case in many applications, this ideal is not achieved. We therefore must compromise by choosing as few principal co~ponents as possible to explain a reasonable percentage of the total ~~"1an~e. _A rule of thumb adopt~~ by many investigators is to select only v . Pnnctpal components explammg at least 100/P percent of the total an~nce (at least 5% in our example). This rule applies whether the covariance 0 e .co~relation matrix is used. In our example we would select the first ex P_nnctpal components if we followed this rule. These five components 1 th~t am 59% of the total variance, as seen in Figure 14.5. Note, however, Wout~he next two components each explain nearly 5%. Some investigators tot th.erefore select the first seven components, which explain 69% of the 1 vanance a I . Pti~ ~hould be explained that the eigenvalues are estimated variances of the · Clpa} components and are therefore subject to large sample variations.
fi::
342
Principal components analysis
Table 14.2 Principal components analysis for standardized CESD scale · (depression data set) · Items
----
Principal component Item
1
2
3
0.2774 0.3132 0.2678 02436 0.2868 Q.2206
0.1450 -0.0271 0.1547 0.3194 0.0497 -0.0534
4
5
Negative affect 1. I felt that I could not shake off the blues even with the help of my family or friends. 2. I felt depressed. 3. I felt lonely. 4. I had crying spells. 5. I felt sad. 6. I felt fearful. 7. I thought my life had been a failure.
0.2844
0.1644 -0.0190 -0.0761 -0.0870
Positive affect 8. I felt that I was as good as other people. 9. I felt hopeful about the future. 10. I was happy. 11. I enjoyed life.
0.1081 0.1758 0.2766 0.2433
0.3045 0.1103 -Q.5567 -0.0976 0.1690 -0.3962 -0.0146 Qjill 0.0454 -0.0835 0.0084 0.3651 0.1048 -0.1314 0.0414 0.2419
Somatic and retarded activity 12. I was bothered by things that usually don't bother me. 13. I did not feel like eating; my appetite was poor. 14. I felt that everything was an effort. 15. My sleep was restless. 16. I could not 'get going'. 17. I had trouble keeping my mind on what I was doing. 18. I talked less than usual.
0.0577 -0.0027 0.0883 0.0316 0.2478 0.0244 0.0346 0.2472 -0.2183 0.1769 -0.0716 -0.1729 0.1384 0.2794 -0.0411 0.2242 0.1823 -0.3399
0.1790
- 0.2300
0.1634
0.1451
0.0368
0.1259
-0.2126
0.2645 -Q.5400
0.0953
0.1803 0.2004 0.1924
-0.4015 -0.1014 -0.2461 -0.0847 -0.2098 0.2703 0.0312 0. 0834 -0.4174 -0.1850 -0.0467 - 0·0399
0.2097 0.1717
-0.3905 -0.0860 -0.0684 -0.0153 0.2019 -0.0629
Interpersonal 19. People were unfriendly. 20. I felt that people disliked me.
0.1315 0.2357
03349 -0.0569 -0.6326 -0.023 2 - . 909 0.2283 -0.1932 -0.2404~
Eigenvalues or Var Ci Cumulative proportion explained 0.5/(Var CY' 2
7.055 0.353 0.188
1.486 0.427 0.410
1.231 0.489 0.451
1.066 o.542
- 0·049 ~ 275
°·
t.Ol~ 59
°
1
0.48~
Interpretation
•
•
•
•
• •
• • •
343
• • • • • • •
• • • •
0
5
10
15
20
Principal Components Number
a. Cumulative Percentage Q)
::s
~
>
= ....b.o Q)
~
7
•
6 5
4 3 2
• 1
0
• • • • ••• • • • • • 5
10
15
20
Principal Components Number
b. Eigenvalue si~~~ l4.5 Eigenvalues and Cumulative Percentages of Total Variance for Depres-
}1'·
uata.
344
Principal components analysis
Arbitrary cutoff points should thus not be taken too seriously. Ideall investigator should make the selection of the number of principal compoy, the on the basis of some underlying theory. nents Once the number of principal components is selected, the investig t ~hould exa~ne the coefficients defining ea~h of them i~ orde~ to assig: ~~ mterpretatwn to the components. As was discussed earlter, a htgh coeffic· • • bl . . . tent · · ·1 of a p~ncipa component o~ a gtven vana .e ~s an Indication of high correlatiOn between that vanable and the pnncipal component (see th formulas given earlier). Principal components are interpreted in the contex~ of the variables with high coefficients. For the depression data example, Table 14;2 shows the coefficients for the first five components. For each principal component the variables with a correlation greater than 0.5 with that component are underlined. For the sake of illustration this value was taken as a cutoff point. Recall that the correlation rii is aii (Var Ci) 112 and therefore a coefficient aii is underlined if it exceeds 0.5(Var Ci) 1' 2 (Table 14.2). As the table shows, many variables are highly correlated (greater than 0.5) with the first component. Note also that the correlations of all the variables with the first component are positive (recall that the scaling of the response for items 8-11 was reversed so that a high score indicates likelihood of depression). The first principal component can therefore be viewed as a weighted average of most of the items. A high value of C1 is an indication that the respondent had many of the symptoms of depression. On the other hand, the only item with more than 0.5 absolute correlation with C2 is item 16, although items 17 and 14 have absolute correlations close to 0.5. The second principal component, therefore, can be interpreted as .a measure oflethargy or energy. A low value of C2 is an indication of a lethargtc state and a high value of C2 is an indication of a high level of energy. By construction, cl and c2 are uncorrelated. . Similarly, C 3 measures the respondent's feeling toward how others percetve him or her; a low value is an indication that the respondent believes that people are unfriendly. Similar interpretations can be made for the other .tw~ principal components in terms of the items corresponding to the underhne coefficients. · ·d a1 The numerical values of Ci for i = 1 to 5, or 1 to 7, for each ind~t uof can be used in subsequent analyses. For example, we could use ~he v ~e sis C1 instead of the CESD score as a dependent variable in a regression a~ a ~ a . . Problem 8. 1. H a d t he fi rst component explatne b tter such as that gtven m higher proportion of the variance, this procedure might have been e than simply us.ing a sum of the s~ores. . . . . he resultS The depressiOn da~a exampl~ Illustrates a situatiOn ID which ~ figure: are not clear-cut. This conclusiOn may be reached from observUlg ts to 14.5, where we saw that it is difficult to decide how many componen
Principal components analysis in regression
345
It is not possible to explain a very high proportion of the total variance
u~~h a small ~umber of pri~cipal co~ponents. Also, t~e. interpretation of the :rnponents m '!able 14.2t~ not str~tght~orward. Thts ts ~requ~ntly the case
~ real-life situatiOns. OccasiOnally, sttuattons do come up m whtch the results tn clear-cut. For examples of such situations, see Seal (1964), Cooley and r:hnes (1971) or Harris (1975). . . In biological examples where dtstance measurements are made on particular . rts of the body, the first component is often called the size component. ~~r example, in studying the effects of operations in young children who have deformed skulls, various points on the skull are chosen and the distances between these points are measured. Often, 15 or more distances are computed. When principal components analysis is done on these distances, the first component would likely have only positive coefficients. It is called the size component, since children who have overall large skulls would score high on this component. The second component would likely be a mixture of negative and positive coefficients and would contrast one set of distances with another. The second component would then be called shape since it contrasts, say, children who have predominant foreheads with those who don't. In· some examples, a high proportion of the variance can be explained by a size and shape component, thus not only reducing the number ofvariables to analyze but also providing insight into the objects being measured. Tests of hypotheses regarding principal components are discussed in Lawley(l956), Cooley and Lohnes (1971), Jackson and Hearne (1973), Jackson (1991) and Morrison (1976). These tests have not been implemented in the computer programs since the data are often not multivariate normal, and most users view principal components analysis as an exploratory technique.
14.6 USE OF PRINCIPAL COMPONENTS ANALYSIS IN REGRESSION AND OTHER APPLICATIONS As mentioned in section 9.5, when multicollinearity is present, principal :~~onen~s ~nalysis may be used to_ a!leviat~ the proble~. The advant~ge h stng pnnctpal components analysts ts that 1t both helps m understandmg ~bt:~ ~ariab1es are causin~ the mul~icollinearity and provide~ a metho~ for ~ntng stable (though btased) estimates of the slope coefficients. If senous 111 1 u.dttcollinearity exists, the use of these coefficients will result in larger reSI U 1 . ' as tn the data from which they were obtained and smaller multiple Corre1 · . · . regresa~ton than when least squares ts used (the same holds for ndge and ;ton). But the estimated standard error of the slopes could be smaller 9 w t e resulting equation could predict better in a fresh sample. In Chapter rege n~ted that both BMDP 4R and STA TISTICA regression perform ridge co d"trectly all the cornressto· ·~ . d"trect1y. However~ o~ly BMDP 4R per1o~s PUtattons necessary for pnnctpal components regression. SAS REG does
346
Principal components analysis
?rint t?e eigenvalues~ thu~ alerting_the user if?rinci~a~ components regressi IS sensible, and provides mformatton useful m decidmg how many pri . on nctpa} components to retain. So far in this chapt~r we h~ve concentra~ed our atte~tion on the first fe components ~h~t explam the high~st propo~tton of the_ va?~nce. To understan~ the use of pnnctpal components m regressiOn analysts, It IS useful to con ·d ·-" · · availa bl. e ~n, · say, t h e 1ast or Pth component. SIThe er "": h at huormatton ts eigenvalue of that component will tell us how much of the total varian . the X_'s it explains. ~uppose we consider st~n~ardized data to simplif;t~~ magnitudes; If the eigenvalue of the last pnncipal component (which mu t be smallest) is almost one, then the simple correlations among the X variabl:s must be close to zero. With low or zero simple correlations, the lengths of the principal component axes within the ellipse of concentration (Figure 14.3) are all nearly the same and multicollinearity is not a problem. At the other extreme, if the numerical value of the last eigenvalue is close to zero, then the length of its principal axis within the ellipse of concentration is very small. Since this eigenvalue can also be considered as the variance of the Pth component, Cp, the variance of Cp is close to zero. For standardized data with zero mean, the mean of CP is zero and its variance is close to zero. Approximately, we have 0
= ap 1 x 1 + ap2 x 2 + ··· + appXp
a linear relationship among the variables. The values of the coefficients of the principal components can give information on this interrelationship. For example, if two variables were almost equal to each other, then the value of Cp might be 0 = 0.707x1 - 0.707x 2 • For an example taken from actual data, see Chatterjee and Price (1991). In other words, when multicollinearity exi~ts, the examination of the last few principal components will provide infonnauon on which variables are causing the problem and on the nature of the interrelationship among them. It may be that two variables are almost a linear function of each other, or that two variables can almost be added to obtain a third variable. . · al In principal component regression analysis programs, first the prmctp ·h h gresston components of the X or independent variables are found. T en t ere . ·nal equation is computed using the principal components instead of the ongt ts X variables. From the regression equation using all the principal co~po~~~ it is possible to obtain the original regression equation in the X's, stnc~ what is a linear relationship between the X's and the C's (section 14.4~. Buobtain 0 the special programs mentioned above allow the researcher to do IS } the tast 0 a regression equation using the X variables where one or more h size of · · t he computa · t"ton. Ba sed. on t e ng the principal components are not used m the eigenvalue (how small it is) and the possible relationshtp arno decide variables displayed in the later principal components, the user can
Principal components analysis in regression
34 7
manY components to discard. The default option for 4R uses 0.01 as h;~utpoint for eigenvalues.. A regression equation in the X's can be ?btained t n if only one component Is kept, but usually only one or two are discarded. ~v:e standard error of the slope coefficients when the components with very all eigenvalues are discarded tends to be smaller than when all are kept. smThe use of principal component regression analysis is especially useful hen the investigator understands the interrelationships of the X variables well enough to interpret these last few components. Knowledge of the wariables can help in deciding what to discard. v To perform principal components regression analysis without BMDP 4R, you can first perform a principal components analysis, examine the eigenvalues and discard the components with eigenvalues less than, say, O.Ol. Suppose the last eigenvalue whose value is greater than 0.01 is the kth one. Then, compute and save the values of C 1 , C2 , ••. , Ck for each case. These can be called principal component scores~ Note that programs such as SAS PRINCOMP can do this for you. Using the desired dependent variable, you can then perform a regression analysis with the k C/s as independent variables. After you obtain the regression analysis, replace the C's by X's using the k equations expressing C's as a function of X's (see section 14.4 for examples of the equations). You will now have a regression equation in terms of the X's. If you begin with standardized variables, you must use standardized variables throughout. Likewise, if you begin with unstandardized variables you must use them throughout. Principal components analysis can be used to assist in a wide range of statistical applications (Dunteman, 1989). One of these applications is finding outliers among a set of variables. It is easier to see outliers in two-dimensional space in the common scatter diagram than in other graphical output. For multivariate data, if the first two components explain a sizable proportion of the variance among the P variables then a display of the first two principal component scores in a scatter diagram should give a good graphical indication of the data set. Values of C 1 for each case are plotted on the horizontal axis values of C2 on the vertical axis. Points can be considered outliers if ey s do not lie close to. .the. bulk of . the. others . . se catter diagrams of the first t~o components .have also been used to spot w~arate clusters of cases (see Figure 16.8 (section 16.5) for an example of . at clusters look like). VaI~ general, when you want to analyze a sizable num}?er of correlated YoI'lables, ~rincipal components analysis can be considered. For example, if onu have SIX correlated variables and you consider an analysis of variance th eac~ one, the results of these six analyses would not be independent since an~vanables are correlated. But if you first do a principal components analysis llla come up with fewer uncorrelated components, interpreting the results Ybe more valid and there would be fewer analyses to do.
:f:d
348
Principal components analysis
The same would hold true for discriminant function analysis with se groups of highly correlated data. If you use the principal component s vera} derived from the total data set i_nstead _of t~e original. X'~, you will ~:~: uncorrelated components and their contnbutlon to the dtscnminant anal . can be evaluated using analysis of variance. Note that the first compo YSis . nent f . d' . . . may not be t he one t h at IS most use u1 m tscnmmatmg among the grou or between subsets of the groups, since group membership is a separ~t~ variable from the others. Si~ilarly, principa~ components analy~is can be pe~ormed on the dependent and mdependent vanables separately pnor to canomcal correlation analysi If each set of variables is highly intercorrelate?, the~ the two princip:i component analyses can reduce the number ofvanables m each set and could possibly result in a canonical correlation analysis that is easier to interpret. But if the correlations among a set of variables are near zero, there is no advantage in performing principal components analysis since you already have variables with low intercorrelations and you would need to keep a lot of components to explain a high proportion of the variance. 14.7 DISCUSSION OF COMPUTER PROGRAMS The principal components output for the six computer packages is summarized in Table 14.3. SAS is the only package that provides a completely separate program for principal components analysis. The rest of the packages combine it with factor analysis (which will be described in the next chapter) or, in the case of BMDP, it can also be obtained from their 4R regression program. The most straightforward program is the PRINCOMP procedure in SAS. In the depression example, we stored a SAS data file called 'depress' in the computer and it was only. necessary to type .
proc princomp data var c1-c20;
= in.depress;
to tell the computer what to run. As can be seen from Table 14.3,. SAS . . . . . th covartance provides comprehensive output for either the correlatiOn or e matrix option. . ... . rincipal For BMDP there are two possibthttes. One IS to use the 4R P 'able ' · program. In t h'IS case, t he user t a kes .anY vartres it components regressiOn 1 other than those needed for the principal components analysi~, dec andent as the dependent variable and uses the desired X variables as the mdepenents, variables in the regression analysis. To obtain all the principal c_orn::uld be use 'limit= .000001, .000001 .'in the regression paragraph. Thts sl ting to small enough for most applications. The part of the output re and tbe regression analysis is ignored, and the principal components a
Table 14.3 Summary of computer output for principal component analysis Output
BMDP
SAS
Covariance matrix Correlation matrix Coefficients from raw data Coefficients from standardized data Correlation between x's and C's Eigenvalues Cum. proportion of variance explained Principal component scores saved Plot of principal components
4M 4M,4R 4R 4R 4M 4M,4R 4M,4R 4M 8 4M 8
PRINCOMP PRINCOMP PRINCOMP PRINCOMP FACTOR PRINCOMP PRINCOMP PRINCOMP PRINCOMP
ascores must be multiplied by the square root of the corresponding eigenvalue.
SPSS FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR a
STAT A
STATISTICA
SYSTAT
correlate correlate factor factor factor factor factor factor
Factor Factor
Factor Factor Factor Factor Factor Factor
Factor Factor Factor Factor Factor a Factor a
Factor a Factor a
350
Principal components analysis
eigenvalues are then obtained directly from the program. Either raw (u 'no stand' option) or standardized coefficients can be obtained. se the Alternatively, the BMDP 4M program can be used stating 'constant _ , and, if raw data are preferred, state 'form= cava.' in the factor paragr~ 0· Also state 'method = none' in the rotate paragraph. The 4M program Ph. written for factor analysis (the subject of the next chapter). The desired outwas is called unrotated factor loadings. To obtain principal component coeffici PUt from the loadings for standardi~ed d.ata, multiply ~ach fa~to: loading bye~; square root of the correspondmg ergenvalue. Thts multtphcation must b done on the scores computed from factor loadings in order to get principa~ component scores from 4M, SPSS, ST ATISTICA and SYSTAT. The FACTOR ~rogram in S~SS ca~ be ~sed to ~rform principal components analysts. The only optton available ts correlatiOn, so the results are standardized. The user should specify 'EXTRACTION = PC' and also 'CRITERIA= FACTORS(P)', where P is the total number of variables. Adjusting the scores as described above is necessary. In STATA, the 'pc' option is used along with the covariance option if the user desires that raw data be used. The score option is used to compute the principal component scores, which can then be saved or plotted using other program statements. In ST ATISTICA, the Factor Analysis module is switched on and then the principal component extraction method is chosen. The correlation is the default option, but the covariance matrix can be inputted directly and analyzed to produce principal components for unstandardized data. No method of rotation should be chosen. The available output includes factor loadings and scores that should be multiplied by the square root of the corresponding eigenvalues as explained for BMDP 4M. The SYSTAT factor program includes options for principal components analysis that give all the output you expect from a principal components program. First, under results select Extended. To perform principal components analysis, select the Factor program and then select principal components. Use no method of rotation. Adjusting the scores as in BMDP 4M is necessary for SYST AT also. 14.8 WHAT TO WATCH OUT FOR · · al components ana1ysts · ts · mam · 1y used as an exp!oratory technique Pnnctp . [! rnal with little use of statistical tests or confidence intervals. Because ofthts, ?ruslY . · · obvto distributional assumptions are not necessary. Nevertheless, It ts simpler to interpret if certain conditions hold. . . . . dom sarnple 1. As with all stattsttcal analyses, mterpretmg data from a. ran f orn a of a well-defined population is simpler than interpreting data r sample that is taken in a haphazard fashion.
Summary
351
If the observations arise from or are transformed to a symmetric
2· distribution, the results are easier to understand than if highly-skewed distributions are analyzed. Obvious outliers should be searched for and rernoved.lfthe data follow a multivariate normal distribution, then ellipses such as that pictured in Figure 14.4 are obtained and the principal components can be interpreted as axes of the ellipses. Statistical tests that can be found in some packaged programs usually require the assumption of multivariate normality. Note that the principal components offer a method for checking (in an informal sense) whether or not some data follow a multivariate normal distribution. For example, suppose that by using C1 and C2 , 75% of the variance in the original variables can be explained. Their numerical values are obtained for each case. Then the N values of C 1 are plotted using a cumulative normal probability plot, and checked to see if a straight line is obtained. A similar check is done for C 2 • If both principal components are normally distributed, this lends some credence to the data set having a multivariate normal distribution. A scatter plot of C 1 versus C 2 can also be used to spot outliers. PRINCOMP allows the user to store the principal component scores of each case for future analysis. 3. If principal components analysis is to be used to check for redundancy in the data set (as discussed in section 14.6), then it is important that the observations be measured accurately. Otherwise, it may be difficult to detect the interrelationships among the X variables; SUMMARY In the two previous chapters we presented methods of selecting variables for regression and discriminant function analyses. These methods include stepwise ~nd subset procedures. In those analyses a dependent variable is present, unplicitly or explicitly. In this chapter we presented another method for summarizing the data. It differs from variable selection procedures in two ways.
1. No dependent variable exists. -J 2 · Variables are not eliminated but rather summary variables- i.e. principal components - are computed from all of the original variables.
The major ideas underlying the method of principal components analysis Were presented in this chapter. We also discussed how to decide on the ~u:;ber of principal componenents retained and how to use them in ,: sequent analyses. Further, methods for attaching interpretations or ~mes' t? the selected principal components were given. th detaded discussion (at an introductory level) and further examples of Us~ of principal components analysis can be found in Dunteman (1989). deva higher mathematical level, Joliffe (1986) and Flury (1988) describe recent elopments.
1\:
352
Principal components analysis
REFERENCES References preceded by an asterisk require strong mathematical background Chatterjee,. S. and Price, B. (1991). Regression Analysis by Example, 2nd edn. Wil N~Y~
~
*Cooley, W.W. and Lohnes_, P:R..(197l). Multivariate f!ata Analy~is. "_Tiley, New York Dunteman, C.H. (1989). Prmc1pal Components Analysis, Sage Uruversity Papers. Sa · ge, Newbury Park, CA. Eastment, H.T. and Krzanowski, W.J. (1982). Ctoss-validory choice of the numb of components from a prin:ip~l component analysis. Technometrics, 24, 73-7?~r *Flur~, B. (1988). Common Prmczpal Components and Related Multivariate Models. Wiley, New York. Harris, R.J. (1985). A Primer of Multivariate Statistics, 2nd edn. Academic Press, New York. *Hotelling, H. (1933). Analysis of. a complex of statistical variables into principal components. Journalof Educatwnal Psychology, 24, 417-41. Jackson, J.E. (1991). A User's Guide to Principal Components. Wiley, New York. Jackson, J.E. and Hearne, F.T. (1978). Relationships among coefficients of vectors in principa1 components. Technometrics, 15; 601-10. *Joliffe, l.T. (1986). Principal Components Analysis. Springer-Verlag, New York. *Lawley, D.N. (1956). Tests of significance for the latent roots of covariance and correlation matrices. Biometrika, 43, 128-36. Morrison, D.F. (1976). Multivariate Statistical Methods, 2nd edn. McGraw-Hill, New York. Seal, H. (1964). Multivariate Statistical Analysis for Biologists. Wiley, New York.
FURTHER READING Dunn, OJ. and Clark, V.A. (1987). Applied Statistics: Analysis of Variance and Regression, 2nd edn. Wiley, New York. *Tatsuoka, M.M. (1988). Multivariate Analysis: Techniques for Educational and Psychological Research, 2nd edn. Wiley, New York.
PROBLEMS 14.1 14.2
14.3 14.4 14.5
For the depression data set, perform a principal components analysis on the last seven variables DRINK-CHRONILL (Table 3.4). Interpret the resu~ts. (Continuation of Problem 14.1.) Perform a regression analysis of CASE t': the last seven variables as well as on the principal components. What does lysis regression represent? Interpret the results. • J • • a1 onents ana · For the data generated m Problem 7.7, peuorm a pnncip comp ulation. on X1o X2 , ••• , X9 • Compare the results with what is kn~wn about the poprincipal (Continuation of Problem 14.3.) P~rform the ~egressiOn ~f Yon th: ~l to X9. components. Compare the results with the multiple regression ofY 0 . eluding 10 Perform a principal components analysis on the data in Table 8.1 (n?t analysis the variable P/E). Interpret the components. Then perform~ r~gression onents. with P/E as the dependent variable, using the relevant prmcipal coillP Compare the results with those in Chapter 8.
Problems
353
Using the family lung function data in Appendix A define a new variable 14 6 · RATIO= FEV1JFVC for the fathers. What is the correlation between RATIO and FEV1? Between RATIO and FVC? Perform a principal components analysis on FEV1 and FVC, plotting the results. Perform a principal components analysis on FEV1, FVC and RATIO. Discuss the results. 14.7 using the family lung function data, perform a principal components analysis on age, height and weight for the oldest child. 14.8 (Continuation of Problem 14.7.) Perform a regression of FEV1 for the oldest child on the principal components found in Problem 14.7. Compare the results to those from Problem 7.15. 14.9 Using the family lung function data, perform a principal components analysis on mother's height, weight, age, FEV1 and FVC. Use the covariance matrix, then repeat using the correlation matrix. Compare the results and comment. 14.10 Perform a principal components analysis on AGE and INCOME using the depression data set. Include all the additional data points listed in Problem 6.9(b). Plot the original variables and the principal components. Indicate the added points on the graph; and discuss the results.
15 Factor analysis
15.1 USING FACTOR ANALYSIS TO EXAMINE THE RELATIONSHIP AMONG P VARIABLES
From this chapter you will learn a useful extension of principal components analysis called factor analysis that will enable you to obtain more distinct new summary variables. The methods given in this chapter for describing the interrelationships among the variables and obtaining new variables have been deliberately limited to a small subset of what is available in the literature in order to make the explanation more understandable. Section 15.2 discusses when factor analysis is used and section 15.3 presents a hypothetical data example that was generated to display the features of factor analysis. The basic model assumed when performing factor analysis is given in section 15.4. Section 15.5 discusses initial factor extraction using principal components analysis and section 15.6 presents initial factor extraction using iterated principal components (principal factor analysis). Section 15.7 presents a commonly used method of orthogonal rotation and a method of oblique rotation. Section 15.8 discusses the process of assigning factor scor:s to individuals. Section 15.9 contains an additional example of factor analySis using the CESD scale items introduced in section 14.5. A summary of the output of the statistical packages is given in section 15.10 and what to watch out for is given in section 15.11.
15.2 WHEN IS FACTOR ANALYSIS USED?
.. a Facto.r analysis is ~ir_nilar t~ princip~l co~ponents analysis in .that ttBt~th techmque for exammmg the mterrelatwnshtps among a set of vanables. a of these techniques differ from regression analysis in that we do not haveer . . . d . bles Howev ' dependent vanable to be explamed by a set ofmdepen ent vana · tber. 0 principal components analysis and factor analysis also differ from each ber In principal components analysis the major objective is to select a. ~UJllThe of components that explain as much of the total variance as p~sstb e.'rnple values of the principal components for a given individual are relatively SI
Data example
355
compute and interpret. On the other hand, the factors obtained in factor to alysis are selected mainly to explain the interrelationships among the a~ginal variables. Ideally, the number of factors expected is known in advance. ~be rnajor emphasis is placed on obtaining easily understandable factors that convey the essential information contained in the original set of variables. Areas of application of factor analysis are similar to those mentioned in section 14.2 for principal components analysis. Chiefly, applications have ~orne from the social sciences, particularly psychometrics. It has been used mainly to explore the underlying structure of a set of variables. It has also been used in assessing what items to include in scales and to explore the interrelationships among the scores on different items. A certain degree of resistance to using factor analysis in other disciplines has been prevalent, perhaps because of the heuristic nature of the technique and the special jargon employed. Also, the multiplicity of methods available to perform factor analysis leaves some investigators uncertain of their· results. In this chapter no attempt will be made to present a comprehensive treatment of the subject. Rather, we adopt a simple geometric approach to explain only the most important concepts and options available in standard packaged programs. Interested readers should refer to the reference and further reading sections for texts that give more comprehensive treatments of factor analysis. In particular, Dillon and Goldstein (1984), Everitt (1984), Bartholomew (1987) and Long (1983) discuss confirmatory factor analysis and linear structural models, subjects not discussed in this book. A discussion of the use of factor analysis in the natural sciences is given in Reyment and Joreskog (1993).
15.3 DATA ·EXAMPLE As in Chapter 14, we generated a hypothetical data set to help us present the fundamental concepts. In addition, the same depression data set used in Chapter 14 will be subjected to a factor analysis here. This example should serve to illustrate the differences between principal components and factor analysis. It will also provide a real-life application. X The hypothetical data set consists of 100 data points on five variables, 1 . 'hX 2 • ···,X 5 • The data were generated from a multivariate normal distribution zero means. Note that most statistical packages provide the option of gwn ene r sh ra ~ng normal data. The sample means and standard deviations are o;n In Table 15.1. The correlation matrix is presented in Table 15.2. e note, on examining the correlation matrix, that there exists a high Correi t' and a Ion between X 4 and X 5 , a moderately high correlation between X 1 lh X 2' and a moderate correlation between X 3 and each of X 4 and X 5 . e remaining correlations are fairly low.
356
Factor analysis
Table 15.1 Means and standard deviations of 100 hypothetical data points Variable
Mean
Standard deviation
xl x2 x3 x4
0.163 0.142 0.098 -0.039 -0.013
1.047 1.489 0.966 2.185 2.319
Xs
---
Table 15.2 Correlation matrix of 100 hypothetical data points
xl x2 x3 x4 Xs
X1
x2
x3
x4
Xs
1.000 0.757 0.047 0.115 0.279
1.000 0.054 0.176 0.322
1.000 0.531 0.521
1.000 0.942
1.000
15.4 BASIC CONCEPTS OF FACTOR ANALYSIS In factor analysis we begin with a set of variables X 1 , X 2 , .••, X p· These variables are usually standardized by the computer program so that their variances are each equal to one and their covariances are correlation coefficients. In the remainder of this chapter we therefore assume that each xi is a standardized variable, i.e. xi = (X1 - Xi)/Si. In the jargon of factor analysis the x/s are called the original or response variables. . The object of factor analysis is to represent each of these variables ~s a linear combination ofa smaller set of common factors plus a factor umque to each of the response variables. We express this representation as x 1 = l11 F 1
Xz
=
lzlFl
+ l12 F 2 + ··· + l1mFm + e 1 + lzzFz + ... + lzmFm + ez
where the following assumptions are made: .
.
.
.
1. m ts the number of common factors (typically this number IS roue
than P).
h srnaner
Basic concepts of factor analysis
357
F , F2 , ••• , F mare the common factors. These factors are assumed to have 2· z:ro means and unit variances. J•. is the coefficient of Fi in the linear combination describing xi. This term 3 · i~ called the loading of the ith variable on the jth common factor. 4. e , e2, ..• , eP are unique factors, each relating to one of the original variables. 1
The above equations and assumptions constitute the so-called factor modeL Thus each of the response variables is composed of a part due to the common factors and a part due to its own unique factor. The part due to the common factors is assumed to be a linear combination of these factors. As an example, suppose that x 1 , x 2 , x 3 , x 4 , x 5 are the standardized scores of an individual on five tests. If m = 2, we assume the following model: x 1 = l11 F 1 X2 = l21F1
+ l 12 F 2 + e 1 + l22F2 + e2
Each of the five scores consists of two parts: a part due to the common factors F 1 and F 2 and a part due to the unique factor for that test. The common factors F 1 and F 2 might be considered the individual's verbal and quantitative abilities, respectively. The unique factors express the individual variation on each test score. The unique factor includes all other effects that keep the common factors from completely defining a particular xi. In a sample of N individuals we can express the equations by adding a subscript to each xi, Fi and ei to represent the individual. For the sake of simplicity this subscript was omitted in the model presented here. The factor model is, in a sense, the mirror image of the principal components model, where each principal component is expressed as a linear combination of the variables. Also, the number of principal components is equal to the number of original variables (although we may not use all of the principal components). On the other hand, in factor analysis we choose the number of factors to be smaller than the number of response variables. Ideally, the ~mber of factors should be known in advance, although this is often not ~he case. However, as we will discuss later, it is possible to allow the data emselves to determine this number. als~he factor model~ by breaki~g each response ~ariabl~ xi into tw? par~s, " . breaks the vanance of xi mto two parts. Smce xi ts standardiZed, tts artance is equal to one and is composed of the following two parts:
~·
the communality, i.e. the part of the variance that is due to the common factors; · the specificity, i.e. the part of the variance that is due to the unique factor ei.
Denoting the communality of xi by
h~ and the specificity by uf, we can write
358
Factor analysis
the variance of xi as Var xi= 1 = hf + uf_ In words, the variance of x the communality plus the specificity. i equals The numerical aspects of factor analysis are concerned with fi d' estimates of the factor loadings (lii) and the communalities (hf). The~ tng many ways available to numerically solve for these quantities. rhe sol~/re process is c~lled initial factor extraction.. T~~ next two. sections discuss t~~~ such extractiOn methods. Once a set of IDltial factors Is obtained, the . step m . t h e ana1ysis . IS . to o btam . new f:actors, call ed the rotated facto next maJor in order to improve the interpretation. Methods of rotation are discussed ~s, • Ill sectiOn 15.7. In any factor analysis the number m of common factors is required. As mentioned earlier, this number is, ideally, known prior to the analysis. If it is not known, most investigators use a default option available in standard computer programs whereby the number of factors is the number of eigenvalues greater than one (see Chapter 14 for the definition and discussion of eigenvalues). Also, since the numerical results are highly dependent on the chosen number m, many investigators run the analysis with several values in an effort to get further insights into their data.
15.5 INITIAL FACTOR EXTRACTION: PRINCIPAL COMPONENTS ANALYSIS In this section and the next we discuss two methods for the initial extraction of common factors. We begin with the principal componen!s_~nalysis method,_ which can be fou_!l~tin mo_st of the staill:Iafc.ffactor-·amilysis programs. The ;b~ic-iaeais-to cboose the fitsi rri ptindpalcomponents and modify thern to lfiLthtt[actor model defined in the previous section. The reason for choosi~g the first-mpnncipa:Ccomponents, nitherthan anfothers, is that they explain the greatest proportion of the variance and are therefore the most important. Note that the principal components are also uncorrelated and thus present an attrac!ive choice as fa~tors. . . .. . ach satisfy the assumpti~n of umt vana~ce~ of the fa~tors, we divtde: 'th pnncipal component by Its standard devtatto~. Th~t ts, .we. define t!n~nt common factor F1.as F.= C1.j(Var C1-) 1 ' 2 , where Ciis theJthpnnCipalcomp h'p ~ . . , h 1 ·uons 1 To express each vanable xim terms of the Fi s, we first recall t e ~e a between the variables xi and the principal components Ci. Spectfically,
!o.
C1 =
a 11 x 1
C2 =
a21X1
Cp
+ a 12 x 2 + ··· + a 1j.Xp + a22X2 + ... + azpXp
= aP1x 1 + ap2 X 2 + .. · + appXp
. verted It may be shown mathematically that this set of equations can be m
Initial factor extraction: principal components analysis
359
to express the x/s as functions of the C/s. The result is (Harmon, 1976; Afifi d Azen, 1979):
an
+ a21C2 + ··· + aplCP a12Cl + a22C2 + ... + aP2CP
X1 = auCl
x2
=
Note that the rows of the first set of equations become the columns of the second set of equations. . 112 Now since Fi = Ci/(Var C) , it follows that Ci = FiVar C) 1' 2, and we can then express the ith equation as x 1 = aHF 1(Var C 1)112 + a 2iF2(Var C 2 ) 112 + ··· + aPiFp(Var Cp) 1' 2 This last equation is now modified in two ways. 1. We use the notation lu = aii(Var C) 1 ' 2 for the first m components. 2. We combine the last P- m terms and denote the result by ei. That is, . + aPiFP (VarCp . .)1/2 ei = am+i,iFm+l (VarCm+l) 1/2 + ··· With these manipulations we have now expressed each variable xi as
xi= luFl
+ li2F2 + ... + limFm + ei
for i = 1, 2, ... , P. In other words, we have transformed the principal components model to produce the factor model. For later use, note also that when the original variables are standardized, the factor loading lu turns out to be the correlation between xi and Fi (section 14.5). The matrix of factor loadings is sometimes called the pattern matrix. When the factor loadings are correlations between the x/s and F/s as they are here, it is also called the factor structure matrix (Dillon and Goldstein, 1984). Furthermore, it can be shown mathematically that the communality of xi is hf = 1?1 + 1?2 + ··· + lfm· For example, in our hypothetical data set the eigenvalues of the correlation lllatrix (or Var CJ are Var C 1 = 2.578 Var C 2 = 1.567 Var C 3 = 0.571 Var C4
=
0.241
Var C 5 = 0.043 Total = 5.000
360
Factor analysis
Note that the sum 5.0 is ~qual toP, the t~tal. number of variables. Based 0 the rule of thumb of selectmg only those pnnctpal components correspond· n to eigenvalues of one or more, we select m = 2. We obtain the Princ·tng components analysis factor loadings, the 1i}s, shown in Table 15.3. foal example; the loadin~ of x 1 on F 1 is 111 = ?.511 and on F 2 is 112 = 0.?&~~ Thus the first equatwn of the factor modelts x1
= 0.511F 1 + 0.782F 2 + e 1
Table 15.3 also shows the variance explained by each factor. For example the variance explained by F 1 is 2.578, or 51.6%, of the total variance of 5.' The communality column, hf, in Table 15.3 shows the part of the variance of each variable explained by the common factors. For example,
hi = 0.511
2
+ 0.782 2 =
0.873
Finally, the specificity u~ is the part of the variance not explained by the common factors. In this example we use standardized variables xi whose variances are each equal to one. Therefore for each xi, uf = 1 - hf. It should be noted that the sum of the communalities is equal to the cumulative part of the total variance explained by the common factors. In this example it is seen that 0.873
+ 0.875 + 0.586 + 0.898 + 0.913
= 4.145
and 2.578
+ 1.567 =
4.145
Table 15.3 Initial factor analysis summary for hypothetical data set from principal components extraction method Factor loadings Communality Variable
xl x2
.x3 x4 Xs
Variance explained Percentage
F1
h~ ·I
F2
0.511 0.553 0.631 0.866 0.929
0.782 0.754 -0.433 -0.386 -0.225
2.578 51.6
1.567 31.3
4.145 82.9
2
Ui
0.127 0.125 0.414 0.102 0.087
0.873 0.875 0.586 0.898 0.913
'Lh? =
Specificity
Lu? =
----
0.855 17.1
------
Initial factor extraction: principal components analysis
361
thUS verifying the ab~ve stat~ment. Similarly, the sum of th~. specificities is . ual to the total vanance mmus the sum of the communalities. eQ A valuable graphical aid to interpreting the factors is the factor diagram bown in Figure 15.1. Unlike the usual scatter diagrams where each point sepresents an individual case and each axis a variable, in this graph each ~oint represents ~ response variable and each axis a comm~n factor. For example, the pomt labeled 1 represents the response vanable x 1 . The coordinates of that point are the loadings of x 1 on F 1 and F 2 , respectively. The other four points represent the remaining four variables. Since the factor loadings are the correlations between the standardized variables and the factors, the range of values shown on the axes of the factor diagram is -1 to + 1. It can be seen from Figure 15.1 that x 1 and x 2 load more on F 2 than they do on F 1 • Conversely, x 3 , x 4 and x 5 load more on
-----,.2 1
I
I I I I I I I I
Factor 1
}?·
•gure 15.1 Factor Diagram for Principal Components Extraction Method.
362
Factor analysis
F 1 than on F 2 • However, these distinctions are not very clear-cut a d
technique o~ fa~tor rotation will p_roduce c~earer res~ts (section 1S.7)~ the An exammatton of the correlatiOn matnx shown m Table 15.2 confi that x 1 and x 2 form ~ block of correlated variables. Similarly, x 4 and~s form another block, wtth x 3 also correlated to them. These two major bl0 k5 are only weakly correlated with each other. (Note that the correlation m ~. s for the standardized x's is the same as that for the unstandardized X' s~ nx
15.6 INITIAL ·FACTOR EXTRACTION: ITERATED PRINCIPAL COMPONENTS The second method of extracting initial factors is a modification of the principal components analysis method. It has different names in different packages. It is called principal factor analysis or principal axis factoring approach in many texts. · To understand this method, you should recall that the communality is the part of the variance of each variable associated with the common factors. The principle underlying the iterated solution states that we should perform the factor analysis by using the communalities in place ofthe original variance. This principle entails substituting communality estimates for the 1's representing the variances of the standardized variables along the diagonal of the correlation matrix. With 1's in the diagonal we are factoring the total variance of the variables; with communalities in the diagonal we are factoring the variance associated with the common factors. Thus with communalities along the diagonal we select those common factors that maximize the total communality. Many factor analysts consider maximizing the total communality a more attractive objective than maximizing the total proportion of the explaine.d variance, as is done in the principal components method. The problem ts that communalities are not known before the factor analysis is performed. Some initial estimates of the communalities must be obtaiT\ed prior to t;e analysis. Various procedures exist, and we recommend, in the absenc~ 0 a 1 priori estimates; that the investigator use the default option in the parttcu :; program since the resulting factor solution is usually little affected by t initial communality estimates. . ated . · t the Iter The steps performed by a packaged program m carrymg ou factor extraction are summarized as follows: 1. Find the initial communality estimates.
2. Substitute the communalities for the diagonal elements (1's) correlation matrix. 3. Extract m principal components from the modified matrix.
in the
Initial factor extraction: principal components analysis
363
:rvfultiply the principal components coefficients by the standard deviation 4 · of the respective principal components to obtain factor loadings. S. compute new commu~a~iti~s from the ~omputed factor loading~ .. Replace the communahttes m step 2 wtth these new communalities and 6 · repeat steps 3, 4 and 5. This step constitutes an iteration. 7. continue iterating, stopping when the communalities stay essentially the same in the last two iterations. For our hypothetical data example the results of using this method are shown in Table 15.4. In comparing this table with Table 15.3, we note that the total communality is higher for the principal components method (82.9% versus 74.6%). This result is generally the case since in the iterative method we are factoring the total communality, which is by necessity smaller than the total variance. The individual loadings do not seem to be very different in the two methods; compare Figures 15.1 and 15.2. The only apparent difference in loadings is in the case of the third variable. Note that Figure 15.2, the factor diagram for the iterated method, is constructed in the same way as Figure 15.1. We point out that the factor loadings extracted by the iterated method depend on the number of factors extracted. For example, the loadings for the first factor would depend on whether we extract two or three common factors. This condition does not exist for the uniterated principal components method. It should also be pointed out that it is possible to obtain negative variances and eigenvalues. This occurs because the 1's in the diagonal of the correlation matrix have been replaced by the communality estimates which can be considerably less than one. This results in a matrix that does not necessarily have positive eigenvalues. These negative eigenvalues and the factors associated with them should not be used in the analysis. Table 15.4 Initial factor analysis summary for hypothetical data set from principal factor extraction method Factor loadings Variable xl x2
x3 x4
--Xs
Varia 'P nee explained __ercentage .
Communality
Fl
F2
h~I
0.470 0.510 0.481 0.888 0.956
0.734 0.704 -0.258 -0.402 -0.233
0.759 0.756 0.298 0.949 0.968
2.413 48.3
1.317 26.3
'Eh~= 3.730 74.6
Specificity
u;
0.241 0.244 0.702 0.051 0.032
L;u? =
1.270 25.4
364
Factor analysis
1
-----"1.2 I I I
I I I
I I
I
Factor 1
•3
•5
Figure 15.2 Factor Diagram for Iterated Principal Factor Extraction Method.
Regardless of the choice of the extraction method, if the investigator does not have a preconceived number of factors (m) from knowledge of the subject matter, then several methods are available for choosing one numerically. Reviews of these methods are given in Harmon (1976), Gorsuch (1983) ~nd Thorndike (1978). In section 15.4 the commonly used technique of includt~g any factor whose eigenvalue is greater than or equal to one was used. ~hts criterion is based on theoretical rat1onales developed using true populaU~~ correlation coefficients. It is commonly thought to yield about one factor of every three to five variables. It appears to correctly estimate the num~er ot factors when the communalities are high and the number of variables ts_ ned too large. If the communalities are low, then the number of factors obtattnbat . "1 r to with a cutoff equal to the average communality tends to be stmt a
Factor rotations
365
btained by the principal components method with a cutoff equal to one. ~s an altemativ~ method, called the ~ree ~ethod, some investigators will simPlY plo~ the eige~values on the vertical axis versus the nu~ber of factors · the honzontal axts and look for the place where a change m the slope of 11 :he curve connecting successive points occurs. Examining the eigenvalues listed in section 15.5, we see that compared with the first two eigenvalues (2.578 and 1.567), t~e values. of. the remaining ei?envalues are low ~0.571, o.241 and 0.043). This result tndicates that m = 2 IS a reasonable choice. In terms of initial factor extraction methods the iterated principal factor solution is the method employed most frequently by social scientists, the main users of factor analysis. For theoretical reasons many mathematical statisticians are more comfortable with the principal components method. Many other methods are available that may be preferred in particular situations (Table 15.7 in section 15.9). In particular, if the investigator is convinced that the factor model is valid and that the variables have a multivariate normal distribution, then the maximum likelihood (ML) met)lod should be used. If these assumptions hold, then the ML procedure enables the investigator to perform certain tests of hypotheses or compute confidence intervals (Lawley and Maxwell, 1971). In addition, the ML estimates of the factor loadings are invariant (do not change) with changes in scale of the original variables. Note that in section 14.5, it was mentioned that for principal components analysis there was no easy way to go back and forth between the coefficients obtained from standardized and unstandardized data. The same is true for the principal components method of factor extraction. This invariance is therefore a major advantage of the ML extraction method. Finally, the ML procedure is the one used in confirmatory factor analysis (Long, 1983; Dillon and Goldstein, 1984). In confirmatory factor analysis, the investigator specifies constraints on the outcome of the factor analysis. For example, zero loadings may be hypothesized for some variables on a specified factor or some of the factors ~ay be assumed to be u~co~related with other factors. Statistical tests ~an e performed to determme if the data 'confirm' the assumed constramts (Long, 1983). . . . . .
15 ·7 FACTOR ROTATIONS
Re~all
that the main purpose of factor analysis is to derive from the data ~~ly inter~retable common factors. T.he initial factors, however, are often fi cult to Interpret For the hypothetical data example we noted that the t~:~ factor is ~ssentially ~n a~erage of all variables. We also n~te~ ear~ier is the factor mterpretattons In that example are not clear-cut. Thts sttuatton i .0.ften the case in practice, regardless of the method used to extract the nt hal factors. ·
366
Factor analysis
Fortunately, it is possible to find new factors whose loadings are ea . interpret. These new factors, called the rotated factors, are selected sSier to (ide~lly) some of the loadings are very large (near + 1) and the rem~i~~at loadmgs are very small (near zero). Conversely, we would ideally wish ;g any given variable, that it have a high loading on only one factor. If th'. ~r · · eas! t~ ~ve_ · eachf:actor an I~terpreta~ton · . arising from. ISis th t h e_ case, It_Is vanables wtth which It IS highly correlated (high loadmgs). e Theoretically, factor rotations can be done in an infinite number of wa If you are interested in detailed descriptions of these methods, refer to so ys. of the books listed in the reference and further reading sections. In this sectime we highlight only two typical factor rotation techniques. The most commo~n used technique is the varimax rotation. This method is the default option i~ most packaged programs and we present it first. The other technique discussed later in this section, is called oblique rotation; we describe on~ commonly used oblique rotation, the direct quartimin rotation. Varimax rotation We have reproduced in Figure 15.3 the principal component factors that were shown in Figure 15.1 so that you can get an intuitive feeling for what factor rotation does. As shown in Figure 15.3, factor rotation consists of finding new axes to represent the factors. These new axes are selected so that they go through clusters or subgroups of the points representing the response variables. The varimax procedure further restricts the new axes to being orthogonal (perpendicular) to each other. Figure 15.3 shows the results of the rotation. Note that the new axis representing rotated factor 1 goes through the cluster of variables x 3 , x 4 and x 5 • The axis for rotated factor 2 is close to the cluster of variables x 1 and x 2 but cannot go through them since it is restricted to being orthogonal to factor 1. . The result of this varimax rotation is that variables x 1 and x 2 have high loadings on rotated factor 2 and nearly zero loadings on rotated factor 1. Similarly, ~ 3 , x 4 and x 5 load heavily on rotated factor 1 but not on rota~~ f~ctor 2. Ftgure 15.3 clearly shows that th~ rotated factors are or!hog~ In smce the angle between the axes representmg the rotat~d fact_ors IS 90 h statistical terms the orthogonality of the rotated factors IS eqmvalent to t e . . . the sum fact that the~ are uncorrel~ted with e~ch. other: ComputatiOnally, the varamax rotataon IS achieved by maximiZing ther of the variances of the squared factor loadings within each factor. Fur lit; these factor loadings are adjusted by dividing each of them by the commu;:jser of the corresponding variable. This adjustment is kn?wn. as the t tends normalization (~arman, 1976;_Afifi an~ Azen, _1979). This ad~~stm:;it were to equalize the Impact of vanables with varymg communahu~s. ce the not used, the variables with higher communalities would highly mfluen final solution.
Factor rotations
367
Rotated factor 2
Factor 1
Rotated factor 1
Figure 15.3 Varimax Orthogonal Rotation for Factor Diagram of Figure 15.1: Rotated Axes Go Through Clusters of Variables and are Perpendicular to Each Other.
Any method of rotation may be applied to any initially extracted factors. ~?r example, in Figure 15.4 we show the varimax rotation of the factors lllJ.tially extracted by the iterated principal factor method (Figure 15.2). Note ~he similarity of the graphs shown in Figures 15.3 and 15.4. This similarity ~~furth~r reinforced by examination of t~e two sets of rotated factor lo~d~.gs own tn Tables 15.5 and 15.6. In thts example both methods of tmttal extraction produced a rotated factor 1 associated mainly with x 3 , x 4 and x 5 , and a rotated factor 2 associated with x 1 and x 2 • The points in Figures 15.3 ~d 15.4 are plots of the rotated loadings (Tables 15.5 and 15.6) with respect the rotated axes. th~n comparing Tables 15.5 and 15.6 with Tabl~s 15.3 and.15.4, w~ note th~t communalities are unchanged after the vanmax rotatiOn. Thts result ts
368
Factor analysis
Rotated factor 2
Factor 1
Rotated factor 1
Figure 15.4 Varimax Orthogonal Rotation for Factor Diagram of Figure 15.2.
always the case for any orthogonal rotation. Note also that the percentage of the variance explained by the rotated factor 1 is less than that explained by unrotated factor 1. However, the cumulative percentage of the vari~nce explained by all common factors remains the same after orthogonal rotatwn. Furthermore, the loadings of any rotated factor depend on how many other factors are sele.cted, regardless of the method of initial extraction. Oblique rotation
Some factor analysts are willing to relax the restriction of orthogonalitY ~ therotated factors, thus permitting a further degree of flexibility. Nonorth~~onin rotations are called oblique rotations. The origin of the term oblique te~ot geometry, whereby two crossing lines are called oblique if they are
Factor rotations
369
table 15.5 Varimax rotated factors: principal components extraction
Factor loadings a Communality F1
Variable
0.055 0.105 0.763 0.943 0.918
Xt
Xz X3
.x4 Xs
Variance explained Percentage a
F2
2.328 46.6
h~
'
0.9331 0.929 -0.062 0.095 0.266
0.873 0.875 0.586 0.898 0.913
1.817 36.3
4.145 82.9
Vertical lines indicate large loadings.
Table 15.6 Varimax rotated factors: iterated principal factors extraction
Factor loadingsa Communality Variable
F1
xt Xz
x3 x4 Xs
Variance explained Percentage a Vertical
F2
h~I
0.063 0.112 0.546 0.972 0.951
0.869 0.862 0.003 0.070 0.251
0.759 0.756 0.298 0.949 0.968
2.164 43.3
1.566 31.3
3.730 74.6
lines indicate large loadings.
Perpendicular to each other. Oblique rotated factors are correlated with each other, and in some applications it may be harmless or even desirable to have correlated common factors (M ulaik, 1972). l' A commonly used oblique rotation is called the direct quartimin procedure. shhe res~lts of applying this method to our hypothetical data example are th own tn Figures 15.5 and 15.6. Note that the rotated factors go neatly £ rough the centers of the two variable clusters. The angle between the rotated .actors is not 90°. In fact, the cosine of the angle between the two factor axes 18 the sample correlation between them. For example, for the rotated principal ~ornponents the correlation between rotated factor 1 and rotated factor 2 is ·165, which is the cosine of 80.5.
370
Factor analysis
Rotated factor 2
Factor 1
Rotated factor 1
Figure 15.5 Direct Quartmin Oblique Rotation for Factor Diagram of Figure 15.1: Rotated Axes Go Through Clusters of Variables But are not Required to be Perpendicular to Each Other.
For oblique rotations, the reported pattern and structure matrices are somewhat different. The structure matrix gives the correlations between ~ and Fj. Also, the statements made concerning the proportion of th~ to~ variance explained by the communalities being unchanged by rotatton not hold for oblique rotations. . ns In general, a factor analysis computer program will print the correlat•~ed between rotated factors if an oblique rotation is performed. The r~tallY oblique factor loadings also depend on how many factors are selected. Ft7~nd we note that packaged programs generally offer a choice of orthogona oblique rotations, as discussed in section 15.10.
°
Assigning factor scores to individuals
371
Rotated factor 2
Factor 1
3 Rotated factor 1
Figure 15.6 Direct Quartmin Oblique Rotation for Factor Diagram of Figure 15.2.
15.8 ASSIGNING FACTOR SCORES TO INDIVIDUALS
?nee the initial extraction of factors and the factor rotations are performed, ~ tnay be of interest to obtain the score an individual has for each factor.
for example, if two factors, verbal and quantitative abilities, were derived rom a set of test scores, it would be desirable to determine equations for ~Otnputing an individual's scores on these two factors from a set oftest scores. Uch equations are linear functions of the original variables. i :~eoretically, it is conceivable to construct factor score equations in an a~ Ulte number of ways .(Gorsuch, .1983). B.ut perha?s the simplest way is to t d the values of the vanables loadmg heav1ly on a gtven factor. For example, ror the hypothetical data discussed earlier we would obtain the score of Otated factor 1 as x 3 + x 4 + x 5 for a given individual. Similarly, the score
372
Factor analysis
of rotated factor 2 in that example would be x 1 + x 2 . In effect, the f: analysis identifies x 3 , x 4 and x 5 as a subgroup of intercorrelated vari a~:or One way of combining the information conveyed by the three variab~ e~. simply to add them up. A similar approach is used for x 1 and x . In es ts 2 applications such a simple approach may be sufficient. sorne More sophisticated approaches do exist for computing factor scores. Th so-called regression procedure is commonly used to compute factor sco e This meth~d combines the interc~r~elations among the xi variables and :~~ factor loadmgs to produce quantities called factor score coefficients. Th · · a 1·mear f:as h.ton to comb.me the values of th ese coeffi ctents are used m standardized xi's into factor scores. For example, in the hypothetical dat: example using principal component factors rotated by the varimax method (Table 15.5), the scores for rotated factor 1 are obtained as factor score 1 = -0.076x 1
-
0.053x 2
+ 0.350x 3 + 0.414x 4 + 0.384x 5
The large factor score coefficients for x 3 , x 4 and x 5 correspond to the large factor loadings shown in Table 15.5. Note also that the coefficients of x 1 and x 2 are close to zero and those of x 3 , x 4 and x 5 are each approximately 0.4. Factor score 1 can therefore be approximated by 0.4x 3 + 0.4x 4 + 0.4x 5 = 0.4(x 3 + x 4 + x 5 ), which is proportional to the simple additive factor score given in the previous paragraph. The factor scores can themselves be used as data for additional analyses. Many statistical packages facilitate this by offering the user the option of storing the factor scores in a file to be used for subsequent analyses. Before using the factor scores in other analyses, the user should check for possible outliers either by univariate methods such as box plots or displaying the factor scores two factors at .a time in scatter diagrams. The discussion given in earlier chapters on detection of outliers, and checks for normality and independence apply to the factor scores if they are to be used in later analyses. 15.9 AN APPLICATION OF FACTOR ANALYSIS TO THE DEPRESSION DATA . . ts analysis In Chapter 14 we presented the results of a pnnctpal componen . d the for the 20 CESD items in the depression data set. Table 14.2 bste. tent coefficients for the first five principal components. However, to be consiS(not with published literature on this subject, we now choose m = 4 facto~~cipal 5) and proceed to perform an orthogonal varimax rotation using the prt tated components as the method of factor extraction. The results for the ro factors are given in Table 15.7. that, as In comparing these results ~ith those in Table 14.!, we no~e does tbe expected, rotated factor 1 explams less of the total vanance tba
Table 15.7 Varimax rotation, principal component factors for standardized CESD: Scale items (depression data set) Factor loadings Item
Fl
F2
F3
Communalities
F4
h~l
Negative affect 1. I felt that I could not shake off the blues even with the help of my family or friends. 2. I felt depressed. 3. I felt lonely. 4. I had crying spells. 5. I felt sad. 6. I felt fearful. 7. I thought my life had been a failure.
0.638
0.146
0.268
0.280
0.5784
0.773 0.726 0.630 0.797 0.624 0.592
0.296 0.054 -0.061 0.172 0.234 0.157
0.272 0.275 0.168 0.160 -0.018 0.359
-0.003 0.052 0.430 0.016 0.031 0.337
0.7598 0.6082 0.6141 0.6907 0.4448 0.6173
Positive affect 8. I felt that I was as good as other people. 9. I felt hopeful about the future. 10. I was happy. 11. I enjoyed life.
0.093 0238 0.557 0.498
-0.051 0.033 0.253 0.147
0.109 0.621 0.378 0.407
0.737 0.105 0.184 0.146
0.5655 0.4540 0.5516 0.4569
0.449
0.389
-0.049
-0.065
0.3600
0.070 0.117 0.491 0.196 0.270 0.409
0.504 0.695 0.419 0.672 0.664 0.212
-0.173 0.180 -0.123 0.263 0.192 -0.026
0.535 0.127 -0,089 -0.070 0.223
0.5760 0.5459 0.4396 0.5646 0.5508 0.2628
-0.015 0.358
0.237 0.091
0.746 0.506
-0.088 0.429
0.6202 0.5770
4.795 24.0
2.381 11.9
2.111 10.6
1.551 7.8
Somatic and retarded activity 12. I was bothered by things that usually don't bother me. 13. I did not feel like eating, my appetite was poor. 14. I felt that everything was an effort, 15. My sleep was restless. 16. I could not 'get going' 17. I had trouble keeping my mind on what I was doing. 18. I talked less than usual. Interpersonal 19. People were unfriendly. 20. I felt that people disliked me. Variance explained Percentage
OJ)()()
10~838
54.2
374
Factor analysis
first principal component. The four rotated factors together explain the s proportion of the total variance as do the first four principal componarne ·. ents (54.2%). Thts result occurs because the unrotated factors were the princi components. Pal In interp~eting the factors, we see t~at factor_ 1 loads heavily on variables 1- ~· These Items a~e known as negat~ve-affect Items. Fa~t~r 2 loads main! on Items 12-:18, whtch me~sure somatic and retarded acttvtty. Factor 3loa/s on the two Interpersonal ttems, 19 and 20, as well as some positive-affect item~ (9. and 11_). Fact?r 4 does ~?t repres~nt a clear pattern. Its highest loadmg ts associated wtth the postttve-affect Item 8. The factors can thus be loosely identified as negative affect, somatic and retarded activity, interpersonal relations, and positive affect, respectively. Overall, these rotated factors are much easier to interpret than the original principal components. This example is an illustration of the usefulness of factor analysis in identifying important interrelations among measured variables. Further results and discussion regarding the application of factor analysis to the CESD scale may be found in Radloff(1977) and Clark et al. (1981). 15.10 DISCUSSION OF COMPUTER PROGRAMS Table 15.8 summarizes the computer output that relates to topics discussed in this chapter from the six statistical programs. BMDP 4M provides a wide selection of options. In addition to the options listed in Table 15.8, 4M also computes Mahalanobis distances from each case to the centroid of all cases; this distance is useful in detecting outliers. 4M provides a histogram of the eigenvalues instead of the scree plot. This program also presents information on Cronbach's alpha, a widely used statistic for assessing reliability in scales. SAS FACTOR has a wide selection of methods of extraction and rotation as well as a variety of other options. One choice, that is attractive ·~ an investigator is uncertain whether to use an orthogonal or oblique rotatt?n, is to specify the promax rotation method. With this option a varimax rotation is obtained first, followed by an oblique rotation, so the two methods can be compared in a single run. SAS also prints residual correlations tha.t ar~ useful in assessing the goodness-of-fit of the factor model. These restdua cor~elations are the ~ifferences betwee~ the ac~ual correlations amongo~~~ vanables and correlations among the vanables estunated from the factor m p There is one important difference in the default options of SAS and BMD . . . d. t h w manY for the number of factors retamed. If the us~r does not _m tea e ? value factors to retain, the BMDP default cutoff pomt for the stze of the eigen d of is equal to one. The default cutoff point in SAS depends on th~ methosAS initial extraction. Hence, different number of factors may be retamed by and BMDP when the default option is used.
Table 15.8 Summary of computer output for facto:r analysis Output
Matrix analyzed Correlation Covariance Method of initial extraction Principal components Iterative princt&: factor Maximum like ood Alpha Image Harris Least s'1rares Kaiser ttle jiffy Communali% estimates Unaltered ia~onal · Multiple corre ation squared Maximum correlation m row Estimates, user supplied Number of factors enumber S · minimum ei envalue umu1ative eigenv:1ues Orthogonal rotations Varimax Equamax Quartimax Orthomax with gamma Orthogonal with gamma Pro max Oblique rotations Direct quartimin Pro max With gamma Orthogonal oblique Procrustes Direct oblimin Harris-Kaiser Factor scores Factor scores for individuals Save factor scores Factor score coefficients Other output Inverse correlation matrix Factor structure matrix Plots of factor loadings Scree plots
BMDP
SAS
SPSS
STAT A
STATlSTICA
SYSTAT
4M 4M
FACTOR FACTOR
FACTOR
factor
Factor
Factor Factor
4M 4M 4M
FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR
FACTOR FACTOR FACTOR FACTOR FACTOR
facior factor factor
Factor Factor Factor
Factor Factor
4M 4M 4M 4M
FACTOR FACTOR FACTOR FACTOR
FACTOR FACTOR
factor factor
Factor Factor
Factor Factor
4M 4M
FACTOR FACTOR FACTOR
FACTOR FACTOR
factor factor
Factor Factor
Factor Factor
4M 4M 4M
FACTOR FACTOR FACTOR FACTOR
FACTOR FACTOR FACTOR
factor
Factor Factor Factor
Factor Factor Factor
4M
FACTOR
FACTOR
4M FACTOR 4M 4M 4M
Factora factor
FACTOR FACTOR FACTOR
FACTOR 4M 4M 4M
FACTOR FACTOR FACTOR
FACTOR FACTOR FACTOR
score score score
Factor Factor Factor
4M 4M 4M
FACTOR FACTOR FACTOR FACTOR
FACTOR FACTOR FACTOR FACTOR
factor
Factor Factor
Factor Factor
376
Factor analysis
SPSS also has a wide selection of options especially for initial extracr methods. They include a test of the null hypothesis that all the correlati ton among the variables are zero. If this test is not rejected, then it does ons . ke sense to penorm c f:actor ana1ysts . as there ts . no stgm . .ficant correlatinot rna on . to exp1am. STATA has somewhat fewer options than BMDP, SAS and SPSS. STAT A has an additional method of producing factor scores called Bartlett's method that is intended to produce unbiased factors (see manual). In STATISTICA, the rotations labeled normalized will yield the results found in most programs. It also includes the centroid method which is a multiple-group method (Gorsuch, 1983). STATISTICA also produces residual correlations. SYSTAT is appreciably enlarging its factor analysis program in the next version. Additional planned output includes the maximum likelihood method of extraction and the direct oblimin method of oblique rotation. Plots of the factor loadings will also be included. In addition to the statistical packages used for exploratory factor analysis mentioned above, there are at least three packages that perform confirmatory factor analysis. The best known is LISREL by Joreskog and Sorbom. Two other programs, which can handle both categorical and continuous data, are EQS 4.0 by Bentler from BMDP (Los Angeles, CA 90025, USA) and LISCOMP 2nd Edition by Muthen (Scientific Software, Inc., Morseville, IN 46158, USA). A comparison of these and four other confirmatory factor analysis programs is given by Waller (1993).
15.11 WHAT TO WATCH OUT FOR In performing factor analysis, a linear model is assumed where each variable is seen as a linear function of common factors plus a unique factor. The method tends to yield more understandable results when at least mode~ate correlations exist among the variables. Also, since a simple correlatiOn coe~cient.mea~ures the full correlation between .two.variables when a line~~ relatwnshtp extsts between them, factor analysts wtll fit the data better. k only linear relationships exist among the variables. Specific points to loo out for include: 1. The original sample should reflect t~e co~position of the tar~et popul~t~~~~ Outliers should be screened, and lmeanty among the vanables ~he h. s It may be necessary to use transformations to linearize the relauons IP among the variables. t'100 s 2. Do not accept the number of factors produced by the default 0 P. allY without checking that it. makes sense. The results can change .dras 1rno IC ng depending on the numbers of factors used. Note that the chotce a
Summary
3.
4.
s. 6.
7.
377
different numbers of factors depends mainly on which is most interpretable to the investigator. The cutoff points for the choice of the number of factors do not take into account the variability of the results and can be considered to be rather arbitrary. It is advised that several different numbers of factors be tried, especially if the first run yields results that are unexpected. Ideally, factor analyses should have at least two variables with non-zero weights per factor. If each factor has only a single variable, it can be considered a unique factor and one might as well be analyzing the original variables instead of the factors. If the investigator expects the factors to be correlated (as might be the case in a factor analysis of items in a scale), then oblique factor analysis should be tried. The ML method assumes multivariate normality. It also can take more computer time than many of the other methods. It is the method to use if statistical tests are desired. Usually, the results of a factor analysis are evaluated by how much sense they make to the investigator rather than by use of formal statistical tests of hypotheses. Gorsuch (1983) and Dillon and Goldstein (1982) present comparisons among the various methods of extraction and rotation that can help the reader in evaluating the results. As mentioned earlier, the factor analysis discussed in this chapter is an exploratory technique that can be an aid to an analytical thinking. Statisticians have criticized factor analysis because it has sometimes been used by some investigators in place of sound theoretical arguments. However, when carefully used, for example to motivate or corroborate a theory rather than replace it, factor analysis can be a useful tool to be employed in conjunction with other statistical methods.
SUMMARY In this chapter we presented the most essential features of exploratory factor ~nalysis, a technique heavily used in social science research. The major steps In factor analysis are factor extraction, factor rotation, and factor score ~amputation. "W_e gave an example using the depression data set that emonstrated thts process. . The main value of factor analysis is that it gives the investigator insight Into the interrelationships present in the variables and it offers a method for ;~rnmarizing. the~. Factor analysis often suggests certai? hyp?theses to be rther exammed m future research. On the one hand, the mvestigator should ~t h~sitate ~o replace the re~ults of any factor analysi~ by scientifically based eones denved at a later ttme. On the other hand, m the absence of such
378
Factor analysis
theory, factor analysis is a convenient and useful tool for searching (' . hips among vana . bles. tOr re1attons When factor analysis is used as an exploratory technique, there does n exist an optimal way of performing it. This is particularly true of metho~t of rotation. We advise the investigator to try various combinations ~ extraction and rotation of main factors (including oblique methods). Als~ different choices ofthe number of factors should be tried. Subjective judgmen; is then needed to select those results that seem most appealing, on the basis of the investigator's own knowledge of the underlying subject matter. Since either factor analysis or principal components analysis can be performed on the same data set, one obvious question is which should be used? The answer to that question depends mainly on the purpose of the analysis. Principal components analysis is performed to obtain a small set of uncorrel~ted linear combinations of the observed variables that account for as much of the total variance as possible. Since the principal components are linear combinations of observed variables, they are sometimes called manifest variables. They can be thought .of as being obtained by a rotation of the axes. In factor analysis, each observation is assumed to be a linear combination the unobservable common factors and the unique factor. Here the user is assuming that underlying factors, such as intelligence or attitude, result in what is observed. The common factors are not ob~ervable and are sometimes called latent factors. When the iterative principal components method of extraction is performed, the variance of the observations are ignored and off-diagonal covariances (correlations) are used. (This is contrasted with principal components which are computed using the variances and the covariances.) Factor analysis attempts to explain the correlations among the variables with as few factors as possible. In general, the loadings for the factor analysis will be proportional to the principal component coefficients if the communalities are approximately equal. This often occurs when all the communalities are large. . In this chapter and the previous one, we discussed several situations tn which factor analysis and principal components analysis can be useful. In some cases, both techniques can be used and are recommended. In other situations, only one technique is applicable. The above discussion, and ·~ 0 periodic scan of the literature, should help guide the reader in this part statistics which may be considered more 'art' than 'science'. REFERENCES References preceded by an asterisk require strong mathematical background. . . . . d A proach, Afifi, A.A. and Azen, S.P. (1979). Stattstzcal Analyszs: A Computer Orzente P 2nd edn. Academic Press, New York.
Problems
379
•Bartholomew, D.J. (1987). Latent Variable Models and Factor Analysis. Charles Griffin and Company, London. Clark, V.A., Aneshensel~ C.S., Frerich~, R.R. and Morgan, T.M. (1981). Analysis of effects of sex and age m response to Items on the CES-D scale. Psychiatry Research, 5, 171-81. Dillon, W.R. and Goldstein, M. (1984). Multivariate Analysis: Methods and Applications. Wiley, New York. Everitt, B.S. (1984). An Introduction to Latent Variable Models. Chapman & Hall, London. *Gorsuch, R.L. (1983). Factor Analysis, 2nd edn. L. Erlbaum Associates, Hillsdale, NJ. *Harmon, H.H. (1976). Modern Factor Analysis. University of Chicago Press, Chicago. *Lawley, D.N. and Maxwell, A.E. (1971). Factor Analysis as a Statistical Method, 2nd · edn. Elsevier, New York. Long; J.S. (1983). Confirmatory Factor Analysis. Sage, Newbury Park, CA. Mulaik, S.A. (1972). The Foundations of Factor Analysis. McGraw-Hill, New York. Radloff, L.S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement. 1, 385-401. Reyment, R.A. and Joreskog, K.G. (1993). Applied Factor Analysis in the Natural Sciences. Cambridge University Press, New York. Thorndike, R.M. (1978). Correlational Procedures for Research. Gardner Press, New York. Waller, N.G. (1993). Seven confirmatory factor analysis programs- EQS, EZPATH, LINCS, LISCOMP, LISREL-7, SIMPLIS, and CALIS. Applied Psychological Measurement. 11, 73-100.
FURTHER READING Kim, J.O. and Mueller, C.W. (1978a). Introduction to Factor Analysis. Sage, Beverly . Hills, CA. Kim, J.O. (1978b). Factor Analysis. Sage, Beverly Hills, CA. Long, J.S. (1983). Covariance Structural Models: An Introduction to LISREL. Sage, Newbury Park, CA. . Yates, A. (1987). Multivariate Exploratory Data Analysis: A Perspective on Exploratory Factor Analysis. State University of New York Press, Albany, NY.
PROBLEMS 15.1
The CESD scale items (C1-C20) from the depression data set in Chapter 3 were used to obtain the factor loadings listed in Table 15.7. The initial factor solution was obtained from the principal components method, and a varimax rotation was performed. Analyze this same data set by using an oblique rotation such as the direct quartimin procedure, Compare the results. 15.2 Repeat the analysis of Problem 15.1 and Table 15.7, but use an iterated principal factor solution instead of the principal components method. Compare the results. 15.3 Another method of factor extraction, maximum likelihood, was mentioned in section 15.6 but not discussed in detail. Use one of the packages which offers this option to analyze the data along with an oblique and orthogonal rotation. Compare the results with those in problems 15.1 and 15.2, and comment on the tests of hypotheses produced by the program, if any. 15.4 Perform a factor analysis on the data in Table 8.1 and explain any insights this factor analysis give you.
380
Factor analysis
15.5 For the data generated in Problem 7.7, perform four factor analyses, using t different initial extraction methods and both orthogonal and oblique rotatio~~ Interpret the results. · 15.6 Separate the depression data set into two subgroups, men and women. Usin four factors, repeat the factor analysis in Table 15.7. Compare the results ~ 0 your two factor analyses to each other and to the results in Table 15.7. 15.7 For the depression data set, perform four factor analyses on the last seve variables DRINK-CHRONILL (Table 3.4). Use two different initial extraction methods, and both orthogonal and oblique rotations. Interpret the result: Compare the results to those from Problem 14.1.
16 Cluster analysis
16.1 USING CLUSTER ANALYSIS TO GROUP CASES Cluster analysis is usually done in an attempt to combine cases into groups when the group membership is not known prior to the analysis. Section 16.2 presents examples of the use of cluster analysis. Section 16.3 introduces two data examples to be used in the remainder of the chapter to illustrate cluster analysis. Section 16.4 describes two exploratory graphical methods for clustering and defines several distance measures that can be used to measure how close two cases are to each other. Section 16.5 dis,cusses two widely used methods of cluster analysis: hierarchical (or join) clustering and K-means clustering. In section 16.6, these two methods are applied to the second data set introduced in section 16.3. A discussion and listing ofthe options in five statistical packages are given in section 16.7. Section 16.8 presents what to watch out for in performing cluster analysis.
16.2 WHEN IS CLUSTER ANALYSIS USED?
Cluster analysis is a technique for grouping individuals or objects into unknown groups. It differs from other methods of classification, such as discriminant analysis, in that in cluster analysis the number and characteristics of the groups are to be derived from the data and are not ·usually known Prior to the analysis. In biology, cluster analysis has been used for decades in the area of taxonomy. In taxonomy, living things are classified into arbitrary groups on the basis of their characteristics. The classification proceeds from the most general to the most specific, in steps. For example, classifications for domestic ~oses and for modem humans ~e i~lus~ate~ in Figure 16.1 (Wilson et al., 973). The most general classificatiOn ts kmgdom, followed by phylum, ~Ubphylum, etc. The use of cluster analysis in taxonomy is explained by neath and Sokal (1973). , . Cluster analysis has been used in medicine to assign patients to specific dxagnostic categories on the basis of their presenting symptoms and signs.
382 A.
Cluster analysis Modern Humans KINGDOM: Animalia (animals) PHYLUM: Chordata (chordates) SUBPHYLUM: Vertebrata (vertebrates) CLASS: Mammalia (mammals) ORDER: Primates (primates) FAMILY: Hominidae (humans and close relatives) GENUS: Homo (modern humans and precursors) SPECIES: sapiens (modern humans)
B.
Domestic Rose KINGDOM: Plantae (plants) PHYLUM: Tracheophyta (vascular plants) SUBPHYLUM: Pteropsida (ferns and seed plants) CLASS: Dicotyledoneae (dicots) ORDER: Rosales (saxifrages, psittosporums, sweet gum plane trees, roses, and relatives) FAMILY: Rosaceae (cherry, plum, hawthorn, roses, and relatives) GENUS: Rosa (roses) SPECIES: galliea (domestic roses)
Figure 16.1 Example of Taxonomic Classification.
In particular, cluster analysis has been used in classifying types of depression (e.g. Andreasen and Grove, 1982). It has also been used in anthropology to classify stone tools, shards or fossil remains by the civilization that produced them. Consumers can be clustered on the basis of their choice of purchases in marketing research. In short, it is possible to find applications of cluster analysis in virtually any field of research. We point out that cluster analysis is highly empirical. Different methods can lead to different groupings, both in number and in content. Furthermore, since the groups are not known a priori, it is usually difficult to judge whether the results make sense in the context of the problem being studied. . It is also possible to cluster the variables rather than the cases. Cluste~ng of variables is sometimes· used in analyzing the items in a scale to detertlltlle which items tend to be close together in terms of the individual's response to them. Clustering variables can be considered an alternative to factor analysis, although the output is quite different. Some statistical packa~~s allow the user to either cluster cases or variables in the same program wht e
Data example
383
others have separate programs for clustering cases and variables (Section t6.7). In this chapter, we discuss in detail only clustering of cases. 16.3 DATA EXAMPLE A hypothetical data set was created to illustrate several of the concepts discussed in this chapter. Figure 16.2 shows a plot of five observations for the two variables X 1 and X 2 • This small data set will simplify the presentation since the analysis can be performed by hand. Another data set we will use includes financial performance data from the January 1981 issue of Forbes. The variables used are those defined in section 8.3. Table 16.1 shows the data for 25 companies from three indus tries: chemical companies {the first 14 of the 31 discussed in section 8.3), health care companies and supermarket companies. The column labeled 'Type' in Table 16.1lists the abbreviations Chern, Heal and Groc for these three industries. In section 16.6 we will use two clustering techniques to group these companies and then check the agreement with their industrial type. These three industries were selected because they represent different stages of growth, different product lines, different management philosophies, different labor and capital requirements etc; Among the chemical companies all of the large diversified firms were selected. From the major supermarket chains, the top six rated
7
•4
6
•3
5
•5
4
3 2
•2
1
•1
0
1
2
3
4
5
6
Figure 16.2 Plot of Hypothetical Cluster Data Points.
7
8
x,
Table 16.1 Financial performance data for diversified chemical, health and supermarket companies Type
Symbol
Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Heal Heal Heal Heal Heal Groc Groc Groc Groc Groc Groc
dia dow stf dd uk psm gra hpc mtc acy cz aid rom ret hum hca nme ami ahs lks wm sgl sic
kr sa
Means Stan
Observation number l 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ROR5
D/E
SALESGR5
EPS5
NPM1
P/E
PAYOUTR1
13.0 13.0 13.0 12.2 10.0 9.8 9.9 10.3 9.5 9.9 7.9 7.3 7.8 6.5 9.2 8.9 8.4 9.0 12.9 15.2 18.4 9.9 9.9 10.2 9.2
0.7 0.7 0.4 0.2 0.4 0.5 0.5 0.3 0.4 0.4 0.4 0.6 0.4 0.4 2.7 0.9 1.2 1.1 0.3 0.7 0.2 1.6 1.1 0.5 1.0
20.2 17.2 14.5 12.9 13.6 12.1 10.2 11.4 13.5 12.1 10.8 15.4 11.0 18.7 39.8 27.8 38.7 22.1 16.0 15.3 15.0 9.6 17.9 12.6 11.6
15.5 12.7 15.1 11.1 8.0 14.5 7.0 8.7 5.9 4.2 16.0 4.9 3.0 -3.1 34.4 23.5 24.6 21.9 16.2 11.6 11.6 24.3 15.3 18.0 4.5
7.2 7.3 7.9 5.4 6.7 3.8 4.8 4.5 3.5 4.6 3.4 5.1 5.6 1.3 5.8 6.7 4.9 6.0 5.7 1.5 1.6 1.0 1.6 0.9 0.8
9 8 8 9 5 6 10 9 11 9 7 7 7 10 21 22 19 19 14 8 9 6 8 6 7
0.426398 0.380693 0.406780 0.568182 0.324544 0.508083 0.378913 0.481928 0.573248 0.490798 0.489130 0.272277 0.315646 0.384000 0.390879 0.161290 0.303030 0.303318 0.287500 0.598930 0.578313 0.194946 0.321070 0.453731 0.594966
10.45 2.65
0.70 0.54
16.79 7.91
13.18 8.37
4.30
10.16
0.408
2.25
4.89
0.124
Initial analysis and distance measures
385
for return on equity were included. In the health care industry four of the five com~~ies included. were those c?nnected .with hospi~al management; the remammg company mvolves hospital suppbes and eqmpment !6.4 BASIC CONCEPTS: INITIAL ANALYSIS AND DISTANCE MEASURES In this section we present some preliminary graphical techniques for clustering. Then we discuss distance measures that will be useful in later sections.
Scatter diagrams Prior to using any of the analytical clustering procedures (section 16.5), most investigators begin with simple graphical displays of their data. In the case of two variables a scatter diagram can be very helpful in displaying some of the main characteristics of the underlying clusters. In the hypothetical data example shown in Figure 16.2, the points closest to each other are points 1 and 2. This observation may lead us to consider these two points as one cluster. Another cluster might contain points 3 and 4, with point 5 perhaps constituting a third cluster. On the other hand, some investigators may consider points 3, 4 and 5 as the second cluster. This example illustrates the indeterminacy of cluster analysis, since even the number of clusters is usually unknown. Note that the concept of closeness was implicitly used in defining the clusters. Later in this section we expand on this concept by presenting several definitions of distance. If the number of variables is small, it is possible to examine scatter diagrams of each pair of variables and search for possible clusters. But this technique may become unwieldy if the number of variables exceeds four, particularly if the number of points is large. ~ofile diagram
A helpful technique for a moderate number of variables is a profile diagram. To plot a profile of an individual case in the sample, the investigator ~ustomarily first sta~da.rdizes the data b! subtracting the ~ean an~ divi~ing by the standard deviatiOn for each vanable. However, this step IS omitted Y some researchers, especially if the units of measurement of the variables ~re comparable. In the financial data example the units are not the same, so ~ andardization seems helpful. The standardized financial data for the 25 are s~own in Table 16.2..A profile ~iagram, as shown .in Figure 1s~~P~nies 'hsts the vanables along the honzontal aXIs and the standardiZed value t~ale along the vertical axis. Each point on the graph indicates the value of he corresponding variable. The profile for the first company in the sample as been plotted in Figure 16.3. The points are connected in order to facilitate
Table 16.2 Standardized financial performance data for diversified chemical, health and supermarket companies Type
Symbol
Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Heal Heal Heal Heal Heal Groc Groc Groc Groc Groc Groc
dia dow stf dd uk psm gra hpc mtc acy cz aid rom rei hum hca nme ami
ahs lks win sgl slc kr
sa
Observation number
ROR5
D/E
SALESGR5
EPS5
NPM1
P/E
PAYOUTR1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0.963 0.963 0.963 0.661 -0.171 -0.246 -0.209 -0.057 -0.360 -0.209 -0.964 -1.191 -1.002 -1.494 -0.473 -0.587 -0.775 -0.549 0.925 1.794 3.004 -0.209 -0.209 -0.095 -0.473
-0.007 -0.007 -0.559 -0.927 -0.559 -0.375 -0.375 -0.743 -0.559 - 0.559 -0.589 -0.191 -0.559 -0.559 3.672 0.361 0.913 0.729 -0.743 -0.007 -0.927 1.649 0.729 -0.375 0.545
0.431 0.052 -0.290 -0.492 -0.403 -0.593 -0.833 -0.681 -0.416 -0.593 -0.757 -0.176 -0.732 0.241 2.908 1.366 2.769 0.671 -0.100 -0.189 -0.226 -0.909 0.140 -0.530 -0.656
0.277 -0.057 0.230 -0.248 -0.618 0.158 -0.737 -0.534 -0.869 -1.072 0.337 -0.988 -1.215 -1.943 2.534 1.233 1.364 1.042 0.361 -0.188 -0.188 L328 0.254 0.576 -1.036
1.289 1.334 1.601 0.488 1.067 -0.224 0.221 0.087 -0.358 0.132 -0.402 0.354 0.577 -1.337 0.666 1.067 0.265 0.755 0.621 -1.248 -1.204 -1.471 -1.204 -1.515 -1.560
-0.237 -0.442 -0.442 -0.237 -1.056 -0.851 -0.033 -0.237 0.172 -0.237 -0.647 -0.647 -0.647 -0.033 2.218 2.422 1.809 1.809 0.786 -0.442 -0.237 -0.851 -0.442 -0.851 -0.647
0.151 -0.193 -0.007 1.291 -0.668 0.807 -0.231 0.597 1.331 0.668 0.655 -1.089 -0.740 -0.190 -0.135 -1.981 -0.841 -0.839 -0.966 1.538 1.372 -1.710 -0.696 0.370 1.506
C,::,ow:ce: Forbes, vo\. 121, no. 1 (January 5, 1981).
Initial analysis and distance measures
->
387
41
::s ~
~
41
N ......
~
s..
~ ~
2.5 2.0
s:::
~
~
r:n
1.5 1.0 0.5 0 -0.5 -1.0 -1.5 -2.0 -2.5 -3.0
~
0
~
Cil
0'
~ {.!) r:n Cil
1.0
.--4
~
·~
r:n
Cil
~
z
~
< r:n
Cil
'
~
.--4
ex:::
Variables
~
0
~
< ~
Figure 16.3 Profile Diagram of a Chemical Company (dia) Using Standardized Financial Performance Data.
the visual interpretation. We see that this company hovers around the mean, being at most 1.3 standard deviations away on any variable. A preliminary clustering procedure is to graph the profiles of all cases on the same diagram. To illustrate this procedure, we plotted the profiles of seven companies (15-21) in Figure 16.4. To avoid unnecessary clutter, we reversed the sign of the values ofROR5 and PAYOUTRL Using the original data on these two variables would have caused the lines connecting the points to cross each other excessively, thus making it difficult to identify single company profiles. Examining Figure 16.4, we note the following.
~: Compan~es 20 and 21
are v.er~ similar. Compames 16 and 18 are stmtlar.
388
Cluster analysis • Hospital company Supermarket company (20 and 21)
X
19 18
-0.5 -1.0 -1.5 -2.0 -2.5
~
0
~
I
c.'l
'
Q
~ Cl en
It;)
en p.. c.'l
LJ;'J
..J
< en
.-4
~
P..
z
c.'l "p..
......
~ ~
Variables
:::;, 0
~
<
p. I
Figure 16.4 Profile Plot of Health and Supermarket Companies with Standardized Financial Performance Data.
3. Companies 15 and 17 are similar. 4. Company 19 stands alone. Thus it is possible to identify the above four clusters. It is also conceivable to identify three clusters: (15, 16, 17, 18), (20, 21) and (19). These clusters are consistent with the types of the companies, especially noting that companY 19 deals with hospital supplies. f ~lthoug~ t~is technique's effectiveness is no~ aff~cted by the ~umber~' vanables, It fads when the number of observations ts large. In Ftgure 16·
Initial analysis and distance measures
389
the impression is clear, because we plotted only seven companies. Plotting all 25 companies would have produced too cluttered a picture. Distance measures
For a large data set analytical methods such as those described in the next section are necessary. All of these methods require defining some measure of closeness or similarity of two observations. The converse of similarity is distance. Before defining distance measures, though, we warn the investigator that many of the analytical techniques are particularly sensitive to outliers. some preliminary checking for outliers and blunders is therefore advisable. This check may be facilitated by the graphical methods just described. The most commonly used is the Euclidian distance. In two dimensions, suppose that two points have coordinates (X 11 , X 21 ) and (X 12 , X 22 ), respectively. Then the Euclidian distance between the two points is defined as distance
= [(X 11 -X 12) 2 +(X21 -X22 ) 2r'2
For example, the distance between points 4 and 5 in Figure 16.2 is distance (points 4, 5)
= [(5- 7) 2 + (8- 5)2 ] 1 ' 2 = 13 112 = 3.61
For P variables the Euclidian distance is the square root of the sum of the squared differences between the coordinates of each variable for the two observations. Unfortunately, the Euclidian distance is not invariant to changes in scale and the results can change appreciably by simply changing the units of measurement In computer program output the distances between all possible pairs of points are usually summarized in the form of a matrix. For example, for our ?YPothetical data, the Euclidian distances between the five points are as given 1 ~ Table 16.3. Since the distance between the point and itself is zero, the diagonal elements of this matrix are always zero. Also, since the distances a~e symmetric, many programs print only the distances above or below the dtagonal. Table 16.3 E'uclidian distance between five hypothetical points
1
2 3 4
s
1
2
3
4
5
0
1.00 0 4.47 6.40 7.62
5.39 4.47 0 2.24 5.10
7.21 6.40 2.24 0 3.61
8.06 7.62 5.10 3.61 0
1.00 5.39 7.21 8.06
390
Cluster analysis
Since the square root operation does not change the order of how clo t~e points .are to each other,. s~me ~rogram~ use the sum of the squar: d•ft'erences mstead of the Euclidian distance (t.e. they don't take the squar r~ot). Another option available in some progr~ms is to replace the square~ differences by another power of the absolute differences. For example, if th power of 1 is chosen, the distance is the sum of the absolute differences 0~ the coordinates. The distance is the so-called city-block distance. In two dimensions it is the distance you must walk to get from one point to another in a city divided into rectangular blocks. Several other definitions of distance exist (Gower, 1971). Here we will give only one more commonly used definition, the Mahalanobis distance discussed earlier in Chapter 11. In effect the Mahalanobis distance is a generalization of the idea of standardization. The squared Euclidian distance based on standardized variables is the sum of the squared differences, each divided by the appropriate variance. When the variables are correlated, a distance can be defined to take this correlation into account. The Mahalanobis distance does just that. For two variables X 1 and X 2 , suppose that the sample variances are Si andS~, respectively, and that the correlation is r. The squared Euclidian distance based on the original values is (Euclidian distance) 2 =(X 11 -X 22 ) 2 +(X 21 -X 22 ) 2 The same quantity based on standardized variables is (standardized Euclidian distance) 2
= (X 11 ;
X 12 ) 2 1
2
+ (X 21 ~2X 22
)2
2
If r = 0, then the last quantity is also the Mahalanobis distance. If r #- 0, the Mahalanobis distance is
1 [(X 11 D= 1 - r2 2
X 12 ) 2
2
s1
+
(X 21 -X 22 ) 2 2
s2
2r(X 11 -X 12 )(X 21 -
S
-
X22)]
s1 2
For more than two variables the Mahalanobis distance is easily defined in terms of vectors and matrices (e.g. Afifi and Azen, 1979). Some comp~ter programs, such as BMDP KM, offer the Mahalanobis distance as an optton, with estimates of the sample variances and correlations obtained from the data. It is noted that, strictly speaking, the within-group sample covariance matrix should be used. However, before the investigator finds clusters, no . . IS . the only one such estimates exist, and the total group covanance matnx . available. When the latter is used, the computed Mahalanobis distance IS then somewhat different from that presented in Chapter 11. . . One method for obtaining the within-cluster sample covariance matnx: IS to recompute it after each assignment of observations to each affected cluster.
Analytical clustering techniques
391
Thus, the within-duster covariance matrix changes .as the computation proceeds. These within-cluster covariance matrices are then pooled across clusters. This is an available option in the BMDP KM program. In the ACECLUS procedure in SAS, a different approach is used. Using a special algorithm, the within-cluster sample covariance matrix is estimated without knowledge of which points are in which clusters (see the discussion in the AGK method in the manual). In using this method it is assumed that the clusters are multivariately normally distributed with equal covariance matrices. In most situations different distance measures yield different distance matrices, in tum leading to different clusters. When variables have different units, it is advisable to standardize the data before computing distances. Further, when high positive or negative correlations exist among the variables, it may be helpful to consider techniques of standardization that take the covariances as well as the variances into account. An example is given in the ACECLUS write-up using real data where actual cluster membership is known. It is therefore possible to say which observations are misclassified, and the results are striking. Several rather different types of 'distance' measures are available in STATISTICA and SYSTAT, namely measures relating to the correlation between variables (if you are clustering variables) or between cases (if you are clustering cases). One measure is 1 - rii where ruis the Pearson correlation between the ith variable (case) and the jth variable (case). Two variables (cases) that are highly correlated thus are assigned a small distance between them. For a discussion of problems that may occur in interpreting this measure see Dillon and Goldstein (1984). Other measures of association, not discussed in this book, are available in these and other packages for clustering nominal and ordinal data. In the next section we discuss two of the more commonly used analytical cluster techniques. These techniques make use of the distancefunctions defined. 16.5 ANALYTICAL CLUSTERING TECHNIQUES T.he commonly used methods of clustering fall into two general categories: hterarchical and nonhierarchical. First, we discuss the hierarchical techniques. liierarchical clustering Hierarchical methods can be either agglomerative or divisive. In the agglomerative methods we begin with N clusters, i.e. each observation constitutes its own cluster. In successive steps we combine the two closest clusters, thus reducing the number of clusters by one in each step. In the final step all observations are grouped into one cluster. In some statistical Packages:· the agglomerative method is called Join. In divisive methods we
392
Cluster analysis
begin with one cluster containing all of the observations. In successive st we split off the cases that are most dissimilar to the rem~ining ones. M:~ of the commonly used programs are of the agglomerative type, and We therefore do not discuss methods further. The centroid procedure is a widely-used example of agglomerative method In the centroid method the distance between two clusters is defined as ths. distance between the group centroids (the centroid is the point whos: coordinates are the means of all the observations in the cluster). If a cluster has one observation, then the centroid is the observation itself. The process proceeds by combining groups according to the distance between their centroids, the groups with the shortest distance being combined first. · The centroid method is illustrated in Figure 16.5 for our hypothetical data. Initially, the closest two centroids (points) ofthe five hypothetical observations plotted in Figure 16.2 are points 1 and 2, so they are combined first and their centroid is obtained in step 1. In step 2, centroids (points) 3 and 4 are combined (and their centroid is obtained), since they are the closest now that points 1 and 2 have been replaced by their centroid. At step 3 the centroid of points 3 and 4 and centroid (point) 5 are combined, and the centroid is
x"2 8
7 6 5
5 (1,2,3,4,5) Step 4
4
3 2
2 • Centroids
1
0 .
.
.
•
Figure 16.5 Hierarchical Cluster Analysts Usmg Unstandardtzed Hypothettca1 Set.
pata
Analytical clustering techniques
393
obtained. Finally, at the last step the centroid of points 1 and 2 and the centroid of points 3, 4 and 5 are combined to form a single group. Figure 16.6 illustrates the clustering steps based on the standardized bypothetical data. ~he results are identical to the previous ones, although this is not the case m general. We could also have used the city-block distance. This distance is available by that name in many programs. As noted earlier in the discussion of distance measures, this distance can also be obtained from 'power' measures. Several other methods can be used to define the distance between two clusters. These are grouped in the computer programs under the heading of
*Centroids
igure 16.6 Hierarchical Cluster Analysis Using Standardized Hypothetical Data Set.
394
Cluster analysis
linkage methods. In these programs, the linkage distance is the dist between two clusters defined according to one of these methods. ance With two variables and a large number of data points, the representaf of the steps in a graph similar to Figure 16.5 can get too cluttered to interp~on Also, if the number of variables is more than two, such a graph is not feasib~t. A clever device called the dendrogram or tree graph has therefore beee. incorporated into packaged computer programs to summarize the clusterinn at successive steps. The dendrogram for the hypothetical data set is illustrate~ in Figure 16. 7. The horizontal axis lists the observations in a particular order. In this example the natural order is convenient. The vertical axis shows the successive steps. At step 1 points 1 and 2 are combined. Similarly, points 3 and 4 are combined in step 2; point 5 is comb~ned with cluster (3, 4) in step 3; and finally clusters (1, 2) and (3, 4, 5) are combmed. In each step two clusters are combined. A tree graph can be obtained from the BMDP 2M, SPSS CLUSTER, ST ATISTICA Join and SYSTAT Join. It can be obtained from SAS by calling the TREE procedure, using the output from the CLUSTER program. In SAS TREE the number of clusters is printed on the vertical axis instead of the step number shown in Figure 16.7. SPSS and STATISTICA provide vertical tree plots called icicles. Vertical icicle plots look like a row of icicles hanging from the edge of a roof. The first cases to be joined at the bottom of the graph are listed side by side and will form a thicker icicle. Moving upward, icicles are joined to form successively thicker ones until in the last step all the cases are joined together at the edge of the roof.
,.. Q)
,.0
s::s
zp.
4
-
3
t-
Q) ~
{/')
2 r1
0
t-
3
2
1
5
4
Observation Number
.
.
.
. 1Data Set.
Figure 16.7 Dendrogram for Hterarchtcal Cluster Analysts of Hypothettca
Analytical clustering techniques
395
The programs offer additional options for combining dusters. The manuals or Help statements should be read for descriptions of these various options. Hierarchical procedures are appealing in a taxonomic application. Such procedur~s can be mislead!ng,_ howeve.r, in certain situations. F~r example, an undesirable early combmatwn persists throughout the analysts and may lead to artificial results. The investigator may wish to perform the analysis several times after deleting certain suspect observations. For large sample sizes the printed dendrograms become very large and unwieldy to read. One statistician noted that they were more like wallpaper than comprehensible results for large N. An important problem is how to select the number of clusters. No standard objective procedure exists for making the selection. The distances between clusters at successive steps may serve as a guide. The investigator can stop when the distance exceeds a specified value or when the successive differences in distances between steps make a sudden jump. Also, the underlying situation may suggest a natural number of clusters. If such a number is known, a particularly appropriate technique is the K-means clustering technique of MacQueen (1967). K-means clustering
The K-means clustering is a popular nonhierarchical clustering technique. For a specified number of clusters K the basic algorithm proceeds in the following steps. 1. Divide the data into K initial clusters. The members of these clusters may be specified by the user or rna y be selected by the program,. according to an arbitrary procedure. 2. Calculate the means or centroids of the K clusters. 3. For a given case, calculate its distance to each centroid. If the case is closest to the centroid of its own cluster, leave it in that cluster; otherwise, reassign it to the cluster whose centroid is closest to it. 4. Repeat step 3 for each case. S; Repeat steps 2, 3 and 4 until no cases are reassigned.
d Individual programs implement the basic algorithm in different ways. The ctfault option of BMD~ KM begins b~ cons~d~ring all o~ the. data as one T~ster. For the hypothetical data set thts step ts Illustrated m Ftgure 16.8(a). e program then searches for the variable with the highest variance, in this ~seX 1· The original cluster is now split into two clusters, using the midrange 1 as the dividing point, as shown in Figure 16,8(b). If the data are standardized, then each variable has a variance of one. In that case the ~ariance with the smallest range is selected to make the split. The program, In general, proceeds in this manner by further splitting the clusters until the
396
Cluster analysis
x2
Xz
8
8
7 6
7 6
5
5
4 3
4 3
2
2.
1
0
1
2
3
4
5 6 7 8
XI
a. Start with All Points in One Cluster
0
I
2 3 4
5
6. 7
8 XI
b. Cluster Is Split into 1\vo Clusters at Midrange of X1 (Variable with Largest Variance)
Xz 8
7 6
5 4
d. Every Point Is Now Closest to Centroid of Its Own Cluster; Stop
3 2
1 0
1
2
3 4
5 6
7
8
xt
c. Point 3 Is Closer to Centroid of Cluster (1,2,3) and Stays assigned to Cluster (1 ,2,3)
Figure 16.8 Successive Steps inK-means Cluster Analysis forK= 2, Using BDMP KM for Hypothetical Data Set with Default Options.
specified number K is achieved. That is, it successively finds that particular variable and the cluster producing the largest variance and splits that clust~r accordingly, until K clusters are obtained. At this stage, step 1 of the basiC algorithm is completed and the program proceeds with the other steps.
Analytical clustering techniques
397
For the hypothetical data example with K = 2, it is seen that every case already belongs to the cluster whose centroid is closest to it (Figure 16.8(c)). For example, point 3 is closer to the centroid of cluster (1, 2, 3) than it is to the centroid of cluster (4, 5). Therefore it is not reassigned. Similar results bold for the other cases. Thus the algorithm stops, with the two clusters being selected being cluster (1, 2, 3) and cluster (4, 5). The SAS procedure FASTCLUS is recommended especially for large data sets. The user specifies the maximum number of clusters allowed, and the program starts by first selecting cluster 'seeds', which are used as initial guesses of the means of the clusters. The first observation with no missing values in the data set is selected as the first seed. The next complete observation that is separated from the first seed by an optional, specified minimum distance becomes the second seed (the default minimum distance is zero). The program then proceeds to assign each observation to the cluster with the nearest seed. The user can decide whether to update the cluster seeds by cluster means each time an observation is assigned, using the DRIFT option, or only after all observations are assigned. Limiting seed replacement results in the program using less computer time. Note that the initial, and possibly the final, results depend on the order of the observations in the data set. Specifying the initial cluster seeds can lessen this dependence. Finally, we note that BMDP KM also permits the user to specify seed points. For either program judicious choice of the seeds or leaders can improve the results. SPSS QUICK CLUSTER program can cluster a large number of cases quickly and with minimal computer resources. Different initial procedures are employed depending upon whether the user knows the cluster center or not. If the user does not specify the number of clusters, the program assumes that two clusters are desired. As mentioned earlier, the selection of the number of clusters is a troublesome problem. If no natural choice of K is available, it is useful to try various values and compare the results. One aid to this examination is a plot of the points in the clusters, labeled by the cluster they belong to. If the number of variables is more than two, some two-dimensional plots are available from certain programs, such as BMDP KM and SAS CLUSTER. ACECLUS or FASTCLUS. Another possibility is to plot the first two Principal components, labeled by cluster membership. A second aid is to examine the F ratio for testing the hypothesis that the cluster means are equal. An alternative possibility is to supply the data with each duster as a group as input to a K group discriminant analysis program. Such a program ~ould then produce a test statistic for the equality of means of all variables sunultaneously, as was discussed in Chapter 11. Then for each variable, separately or as a group, comparison of the P values for various values of 1( may be a helpful indication of the value of K to be selected. Note, however, that these P values are valid only for comparative purposes and not as
398
Cluster analysis
significance levels in a hypothesis-testing sense, even if the normalit assumption is justified. The individual F statistics for each variable indicaty the ~elative importance. of the variable~ in d~termining cluster membershi; (agam, they are not vahd for hypothesis testmg). Since cluster anal~sis is. an e~piri~al techniq~e~ it may be ~dvisab~e to try several approaches m a given situatiOn. In addition to the hierarchical and K-means approaches discussed above, several other methods are available in the literature cited in the reference and further reading sections. It should be noted that the K ..;means approach is gaining acceptability in the literature over the hierarchical approach. In any case, unless the underlying clusters are clearly separated, different methods can produce widely different results. Even with the same program, different options can produce quite different results. It may also be appropriate to perform other analyses prior to doing a cluster analysis. For example, a principal components analysis may be performed on the entire data set and a small number of independent components selected. A cluster analysis can be run on the values of the selected components. Similarly, factor analysis could be performed first, followed by cluster analysis offactor scores. This procedure has the advantage of reducing the original number of variables and making the interpretations possibly easier, especially if meaningful factors have been selected. 16.6 CLUSTER ANALYSIS FOR FINANCIAL DATA SET In this section we apply some of the standard procedures to the financial performance data set shown in Table 16.1. In all our runs the data are first standardized as shown in Table 16.2. Recall that in cluster analysis the total sample is considered as a single sample. Thus the information on type of company is not used to derive the dusters. However, this information will be used to interpret the results of the various analyses. Hierarchical clustering The dendrogram or tree is shown in Figure 16.9. Default options including the centroid method with Euclidian distance were used with the standardized data. The horizontal axis lists the observation numbers in a particular order, which prevents the lines in the dendrogram froin crossing each other. One result of this arrangement is that certain subgroups appearing near each other on the horizontal axis constitute clusters at various steps. Note tha~ the distance is shown on the right vertical axis. These distances are measure being the centers of the two clusters just joined. On the left vertical axis the number of clusters is listed. In the figure, companies 1, 2 and 3 form a sin~Ie. cluster, with the gr?upin~ being completed when there are 22 clusters. Stmtlarly, at the oppostte en
Cluster analysis for financial data set
I
U)
J.<
$
U)
1
1-
·~
[)
.... 0
J.<
" E
.a ~
z
2 3 t--1-
4
-
1.36
-
-
~
1-
r-.._
-
.--'-
1-
-
1-
-
1-
....
1-
15 ,.... 16 .....
l _..._
-
17 18 19 20 r21 r-
-
25
-
-
L
9 1-
22 23 24
_j_=
4.47 3.70 3.68 3.13 2.78 2.68 2.16 2;05 2.03 1.95 1.80 1.78 1.56 1.54 1.41
-
5 16 17 18 110 11 12 13 14
G)
u
I
1-
399
:
n~
l
...s::"' a)
i5
- 1.31 - 1.21 - 0.96
~
-
0.84
- 0.84
-
_[]
1 2 3 19 5 13 12 6 11 9 8 10 7 4 24 23 25 14 21 20 22 16 18 17 15
0.83 0.65 0.60 0.00
Company Number
Figure 16.9 Dendrogram of Standardized Financial Performance Data Set.
15, 17, 18 and 16 (all health care companies) form a single cluster at the step in which there are two clusters. Company 22 stays by itself until there are only three clusters, The distance axis indicates how disparate the two clusters just joined are. The distances are progressively increasing. As a general rule, large increases in the sequence should be a signal for examining those particular steps. A large distance indicates that at least two very different observations exist, ?ne from each of the two clusters just combined. Note, for example, the large ~~crease in distance occurring when we combine the last two clusters into a Single cluster. It is instructive to look at the industry groups and ask why the cluster analyses, as in Figure 16.9, for example, did not completely differentiate them. It is clear that the clustering is quite effective. First, 13 of the chemical companies, all except no. 14 (rei), are clustered together with only one nonchemical firm, no. 19 (ahs), when the number of dusters is eight. This
400
Cluster analysis
result is impressive when one considers that these are large diversified companies with varied emphasis, ranging from industrial chemicals to textiles to oil and gas production. At the level of nine clusters three of the four hospital management firms nos 16, 17 and 18 (hca, nme and ami), have also been clustered together, and the other, no. 15 (hum), is added to that cluster before it is aggregated with any nonhospital management firms. A look at the data in Tables 16.1 and 16.2 shows that no. 15 (hum) has a clearly different D/E value from the others, suggesting a more highly leveraged operation and probably a different management style. The misfit in this health group is no. 19 (ahs), clustered with the chemical firms instead of with the hospital management firms. Further examination shows that, in fact, no. 19 is a large, established supplies and equipment firm, probably more akin to drug firms than to the fast-growing hospital management business, and so it could be expected to share some financial characteristics with the chemical firms. The grocery firms do not duster tightly. In scanning the data of Tables 16.1 and 16.2 we note that they vary substantially on most variables. In particular, no. 22 (sgl) is highly leveraged (high DjE), and no. 21 (win) has low leverage (low D/E) relative to the others. Further, if we examine other characteristics in this data set, important disparities show up. Three of the six, nos 21, 24 and 25 (win. kr, sa), are three of the four largest United States grocery supermarket chains. Two others, nos 20 and 22 (kls, sgl), have a diversified mix of grocery, drug, department and other stores. The remaining firm, no. 24 (sic), concentrates on convenience stores (7-Eleven) and has a substantial franchising operation. Thus the six, while all grocery-related, are quite different from each other. Various K-means clusters Since these companies do not present a natural application of hierarchical clustering such as taxonomy, the K-meansprocedure may be more appropriate. The natural variable K is three since there are three types of companies. The first four columns of Table 16.4 show the results of one run of the SAS FASTCLUS procedure and three runs of BMDP KM. The numbers in each column indicate the cluster to which each company is assigned. In columns (1) and (2) the default option of the two programs were used with K = 3. In column (3) two values of K, 2 and 3, were specified in the same run. Note that the results are different from those in column (2) although the sarne ~r.o~ram was used. The difference is ~u~ to the fact that in column (3)_ t~~ tintlal three groups are selected by sphttmg one of the two clusters obtatn for K = 2 whereas in column (2) the initial splitting is done before andy ' . . e reassignment is carried out (section 16.5). Different results still were obtatn when the Mahalanobis distance option was used, as shown in column (4).
Table 16.4 Companies clustered together from K-means standardized analysis (financial performance data set)
Type of company 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Chern Heal Heal Heal Heal Heal Groc Groc Groc Groc Groc Groc
(1)
(2)
FASTCLUSa K=3 Default
BMDPKM K=3 Default
(3) BMDPKM K= 2, 3 Default
(4) BMDPKM K=3 Mahalanobis
(5) Summary of 4 Runs, K=3
(6) BMDPKM
1 1 1 1 3 1 3 1 1 1 3 3 3 3 2 2 2 2 1 1 1 3 3 3 1
1 1 1 1 1 3 1 1 3 1 3 1 1 3 2 2 2 2 1 3 3 3 3 3 3
1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2 2 1 1 1 3 3 3 3
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1, 3 1, 3 1, 3 1 1, 3 1 3 1, 3 1, 3 3 2 2 2 2 1 1 1 3 3 3 1,3
1 1 1 1 4 3 4 4 4 4 3 4 4 4 2 2 2 2 1 1 1 3 3 3 4
1 1 1 2 2 2 2 3 1 1 3 3 3 1
K=4 Mahalanobis
402
Cluster analysis
One method of summarizing the first four columns is shown in column (5). For example, all four runs agreed on assigning companies 1-4 to cluster 1. Similarly, every company assigned unanimously to a cluster is so indicated Other companies were assigned to a cluster if three out of four runs agreed on it. For example; company 8 was assigned to cluster 1 and company 18 was assigned to cluster 2. Seven of the companies could not be assigned to a unique cluster. Each received two votes for cluster 1 and two votes for cluster 3. It seems, therefore, that the value of K should perhaps be 4, not 3. With K = 4 and the BMDP KM program the results are as shown in column (6). This run seems basically to group the companies labeled 1 and 3 in column (5) into a new cluster, 4. Otherwise, with the exception of companies 8, 10 and 14, all other companies remain in their original clusters. Cluster 1 now consists of companies 1, 2, 3, 4 (chemicals), 19 (health), and 20, 21 (grocery). Cluster 2 consists exclusively of health companies. Cluster 3 consists of three grocery companies and two chemicals. Finally, cluster 4 consists mostly of chemical companies. It thus seems, with some exceptions, that the companies within an industry have similar financial data. For example, an exception is company 19, a company that deals with hospital supplies. This exception was evident in the hierarchical results as well as inK-means, where company 19 joined the chemical companies and not the other health companies. Profile plots of means Graphical output is very useful in assessing the results of cluster analysis. As with most exploratory techniques, it is the graphs that often provide the best understanding of the output. Figure 16.10 shows the cluster profiles based on the means of the standardized variables (K = 4). Note that the values of ROR5 and PAYOUTR1 ~re plotted with the signs reversed in order to keep the lines from crossing too much. It is immediately apparent that cluster 2 is very different from the remaining three clusters. Cluster 1 is also quite different from clusters 3 and 4. The latter two clusters are most different on EPS5. Thus the four health companies forming cluster 2 seem to clearly average higher SALESGR5 and P jE. This cluster mean profile is a particularly appropriate display of the K-means output, because the objective of the analysis is to make these means as widely separated as possible and because there are usually a small number of clusters to be plotted. Use of F tests For quantification of the relative importance of each variable the univaria~e 15 F ratio for testing the equality of each variable's means in the K dusters given in the output ofBMDP KM, SPSS QUICK CLUSTER, SAS FASTCLVS, STATISTICA K-Means and SYSTAT Kmeans. A large F value is an indication that the corresponding variable is useful in separating the clusters.
Cluster analysis for financial data set
403
1:::
~
Cl)
~
2.00
Cl)
.....
N '1j
.... ~
'1j
i\ I. \ \ I. \. I . i \ ~; \. \ /\2
1.75 1.50
1:::
......~
rn
A
A
'1j
1.25 1.00 0.25 0.50
, (
·J
\4 \
0.25 0 -0.25 -0.50
I
-0.75
4\
\
I I
I \ I
¥
-1.00 -1.25 -1.50 lO
~
0
~
I
L:iJ
;Q·
Cl
Cl
........
rn
LJ;'J
....:l
< rn
lO
rn
0..
LJ;'J
.--4
~
0..
z
LJ;'J ........
0..
.--4
~
Variables
E-t
:::;, 0
~
< 0.. I
Figure 16.10 Profile of Cluster Means for Four Clusters (Financial Performance Data Set).
For example, the largest F value is for P/E, indicating that the clusters are most different from each other in terms of this variable.Conversely, the F ratio for P AYOUTRl is very small, indicating that this variable does not pia Y an important role in defining the clusters for the particular indus try groups and companies analyzed. Note that these F ratios are used for comparing the variables only, not for hypothesis-testing purposes.
404
Cluster analysis
The above example illustrates how these techniques are used in exploratory manner. The results serve to point out some possible natu:~ groupings as a first step in genera~ing scient.ific hy?otheses fo~ the purpo:e of further analyses. A great deal of Judgment 1s reqmred for an mterpretatio of the results since the outputs of different runs may be different and sometimen even contradictory. Indeed, this result points out the desirability of makin~ several runs on the same data in an attempt to reach some consensus. 16.7 DISCUSSION OF COMPUTER PROGRAMS BMDP, SAS, SPSS, STATISTICA and SYSTAT all have cluster analysis programs. Since cluster analysis is largely an empirical technique, the options for performing it vary from program to program. It is important to read the manuals or Help statements to determine what the program you use actually does. We have discussed only a small portion of the total options available in. the various programs. More complete discussion of the various methods are given in the cluster analysis texts listed in the bibliography. One of the difficulties with cluster analysis is that there is such a wide range of available methods that the choice among them is sometimes difficult. BMDP offers four cluster analysis programs: 1M clusters variables, 2M does hierarchical clustering of cases, KM performs K-means clustering of cases and 3M constructs block clusters for categorical data. We only summarize the options available in 2M and KM in Table 16.5. With this and other packages, additional options are offered that are not listed in Table 16.5. SAS offers five procedures. ACECLUS can be used to reprocess data for subsequent use by CLUSTER or FASTCLUS. It computes approximate estimates of the pooled within-cluster covariance matrix assuming multivariate clusters with equal covariance matrices. CLUSTER is used to obtain hierarchical cluster analysis. This procedure offers a wide variety of options. FASTCLUS is used to perform K-means clustering. VARCLUS clusters the variables instead of the cases and TREE produces dendrograms for the CLUSTER procedure. SPSS offers two programs: CLUSTER and QUICK CLUSTER. CLUSTER performs hierarchical cluster analysis and QUICK CLUSTER does K-means clustering. CLUSTER can also be used to cluster variables. The SPSS programs are attractive partially because of the rich choice of options and partially because of the readability of the manual. . STATISTICA offers a wide choice of options particularly in its hierarchtcal program (called Join). Either Join or K-Means will cluster cases or variables. They also have a program that performs two-way joining or clusters and variables. Being Windows based, all three programs are easy to run. SYSTAT has two programs: Corr which can be used to provide inp~t .to their cluster analysis program and Cluster that performs hierarchical Gotn)
Hierarchical analysis
BDMP
SAS
SPSS
STATISTICA
S'YSTAT
2M 2M
CLUSTER CLUSTER ACECLUS
CLUSTER DESCRIPTIVES
Join
Join Standardize
2M
CLUSTER CLUSTER
CLUSTER CLUSTER CLUSTER
Join
2M
CLUSTER
CLUSTER
Join Join Join Join Join
Join Join
2M 2M 2M
CLUSTER CLUSTER CLUSTER CLUSTER CLUSTER CLUSTER
CLUSTER CLUSTER
Join Join
Join Join
CLUSTER CLUSTER CLUSTER
Join Join Join
Join Join Join
2M 2M
CLUSTER TREE CLUSTER
Join Join Join
Join
2M
CLUSTER CLUSTER CLUSTER CLUSTER
KM KM KM
FASTCLUS STANDARD ACECLUS
QC" DESCRIPTIVES
K-means
Kmeans Standardize
KM KM
FASTCLUS
QC
K-means
Kmeans
KM KM
FASTCLUS FASTCLUS
QC QC
KM KM KM KM KM KM KM
FASTCLUS FASTCLUS FASTCLUS
QC QC QC (gives means) QC
K~Means
Kmeans Kmeans Kmeans
FASTCLUS FASTCLUS
QC
K-Means
Kmeans
Standardization
Not standardized Standardized Estimation of within covariances DistanCe Euclidian Squared Euclidian distance Sum of pth power of absolute distances 1-Pearson r Other options Linkage methods Centroid Nearest neighbor (single linkage) Kth nearest neighbor Complete linkage Ward's method Other options Printed output Distances between combined clusters Dondrogram (tree) Vertical icicle Initial distance between observations
Join
K-means analysis Standardization Not standardized Standardized Estimation of within covariances Distance Euclidian Mahalanobis Distance measured from Leader or seed point Centroid or center Printer output Distances between cluster centers Distances between points and center Cluster means and standard deviations F ratios Cluster mean profiles Plot of cluster membership Save cluster membership •QC
= QillCK CLUSTER.
Kmeans K-Means K~Means
K-Means
406
Cluster analysis
or K-means cluster analysis. Join can cluster either cases or variables. If y wish to cluster variables with K -means, then you need to transpose the d~~ first (make the rows the columns and vice versa). In the next version, th options for these programs will be expanded. Eight methods of computin e distances will be available for each of these methods. Cluster mean profileg will be displayed forK-Means clustering and other output will be expanded~ 16.8 WHAT TO WATCH OUT FOR Cluster analysis is a heuristic technique for classifying cases into groups when knowledge of the actual group membership is unknown. There are numerous methods for performing the analysis, without good guidelines for choosing among them. 1. Unless there is considerable separation among the inherent groups, it is not realistic to expect very clear results with cluster analysis. In particular, if the observations are distributed in a nonlinear manner, it may be difficult to achieve distinct groups. For example, if we have two variables and cluster 1 is distributed such that a scatter plot has a U or crescent shape, while cluster 2 is in the empty space in the middle of the crescent, then most cluster analysis techniques would not be successful in separating the two groups. 2. Cluster analysis is quite sensitive to outliers. In fact, it is sometimes used to find outliers. The data should be carefully screened before running cluster programs. 3. Some of the computer programs supply F ratios, but it should be kept in mind that the data have been separated in a particular fashion prior to making the test, so the interpretation of the significance level of the tests is suspect. Usually, the major decision as to whether the cluster analysis has been successful should depend on whether the results make intuitive sense. 4. Most of the methods are not scale invariant, so it makes sense to try several methods of standardization and observe the effect on the results. If the program offers the opportunity of standardizing by the pooled withincluster covariance matrix, this option should be tried. Note that for large samples, this could increase the computer time considerably. 5. Some investigators split their data and run cluster analysis on both halves to see if they obtain similar results. Such agreement would give further credence to the conclusions.
SUMMARY In this chapter we presented a topic still in its evolutionary form. U~lik~ subjects discussed in previous chapters, cluster analysis has not yet game
Further reading
407
a standard methodology. Nonetheless, a number of techniques have been developed for dividing a multivariate sample, the composition of which is not known in advance, into several groups. In view of the state ofthe subject, we opted for presenting some of the techniques that have been incorporated into the standard packaged programs. These include a hierarchical and nonhierarchical technique. We also explained the use of profile plots for small data sets. Since the results of any clustering procedure are often not definitive, it is advisable to perform more than one analysis and attempt to collate the results.
REFERENCES Afifi, A.A. and Azen. S.P. (1979}. Statistical Analysis: A Computer Oriented Approach. 2nd edn. Academic Press, New York. Andreasen, N.C. and Grove, W.M. (1982). The classification of depression: Traditional versus mathematical approaches. American Journal of Psychiatry, 139, 45-52. Dillon, W.R. and Goldstein, M. (1984). Mt!ltivariate Analysis: Methods and Applications. Wiley, New York. Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857-74. MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 281-97. Sneath, P.H. and Sokal, R.R (1973). Numerical Taxonomy. Freeman, San Francisco. Wilson, E.O., Eisner, T., Briggs, W.R., Dickerson, R.E., Metzenberg, R.L., O'Brien, R.D., Susman, M. and Boggs, W.E. (1973). Life on Earth. Sinauer Associates, Sunderland, MA.
FURTHER READING Aldendeifer, M.S. and Blashfield, R.K. (1986). Cluster Analysis. Sage, Newbury Park, CA. Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, New York. Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical . Society, Series A, 134, 321-67. Everitt, B. (1993). Cluster Analysis, 2nd edn. Wiley, New York. Gordon, A.D. (1981). Classification. Chapman & Hall, London. Band, D.J. (1981). Discrimination and Classification. Wiley, New York. Hartigan, J.A. (1975). Clustering Algorithms. Wiley, New York. Johnson, R.A. and Wichern, D.W. (1988). Applied Multivariate Statistical Analysis, 2nd edn. Prentice-Hall, Englewood Cliffs, NJ. Johnson, S.C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241-44. Kaufman, L. and Rousseeuv, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York. Ton, J.T. and Gonzalez, R.C. (1974). Pattern Recognition Principles. Addison-Wesley, Reading. MA. Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236-44.
408
Cluster analysis
PROBLEMS 16.1
For the depression data set, usethe last seven variables to perform a cluste analysis producing two groups. Compare the distribution of CESD and cas r in the groups. Compare also the distribution of sex in the groups. Try 0 different programs with different options. Comment on your results. 16.2 For the situation described in Problem 7.7, modify the data for X1 ' X2 ' ... , X9 as follows.
t:s
• • • •
For For For For
the the the the
first 25 cases; add 10 to X1, X2, X3. next 25 cases, add 10 to X4, X5, X6. next 25 cases, add 10 to X7, X8. last 25 cases, add 10 to X9.
Now perform a cluster analysis to produce four clusters. Use two different programs with different options. Compare the derived clusters with the above groups. 16.3 Perform a cluster analysis on the chemical company data in Table 8.1, using the K -means method for K = 2, 3, 4. 16.4 For the accompanying small, hypothetical data set, plot the data by using methods given in this chapter, and perform both hierarchical and K-means clustering with K = 2. Cases
X1
X2
1 2 3 4 5 6 7 8
11
10 10
8 9 5 3 8 11 10
11
4 4 5 11
12
16.5 Describe how you would expect guards, forwards and centers in basketball to cluster on the basis of size or other variables. Which variables should be measured? 16.6 For the family lung function data in Appendix A, perform a hierarchical cluster analysis using the variables FEV1 and FVC for mothers, fathers and oldest children. Compare the distribution of area residence in the resulting groups. 16.7 Repeat Problem 16.6, using AREA as an additional clustering variable. Comment on your results. 16.8 Repeat Problem 16.6, using the K-means method for K = 4. Compare the results with the four clusters produced in Problem 16.6. 16.9 Create a data set from the family lung function data in Appendix A as follows. It will have three times the number of observations tha~ the original data set has -the first third of the observations will be the mothers' measurements, the second third those of the fathers, and the final third those of the oldest children.
Problems
409
Perform a cluster analysis, first producing three groups and then two groups, using the variables AGE, HEIGHT, WEIGHT, FEVl and FVC. Do the mothers, fathers and oldest children cluster on the basis of any of these variables? 16.10 Repeat Problem 16.1, using the variables age, income and education instead of the last seven variables.
17 Log-linear analysis
17.1 USING LOG-LINEAR MODELS TO ANALYZE CATEGORICAL DATA Log-linear models are useful in analyzing data that are commonly described as categorical, discrete, nominal or ordinal. We include in this chapter several numerical examples to help clarify the main ideas. Section 17.2 discusses when log-linear analysis is used in practice and section 17.3 presents numerical examples from the depression data set. The notation and sampling considerations are given in section 17.4 and the usual model and tests of hypotheses are presented in section 17.5 for the two-way table. Section 17.6 presents a worked out example for the two-way table. In section 17.7 models are given when there are more than two variables, and tests of hypotheses are explained for exploratory situations in section 17.8. Testing when certain preconceived relationships exist among the variables is discussed in section 17.9, and the issue of small or very large sample sizes is reviewed in section 17.10. Logit models are defined in section 17.11 and their relationship to log-linear and logistic models is discussed. The options available in the various computer programs are presented in section 17.12 while section 17.13 discusses what to watch out for in applying log-linear models.
17.2 WHEN IS LOG-LINEAR ANALYSIS USED? Log-linear analysis helps in understanding patterns of association among categorical variables. The number of categorical variables in one analysis can vary from two up to an arbitrary P. Rarely are values of Pas great as six used in actual practice and P equal to three or four is most common. In log-linear analysis the categorical variables are not divided into dependent and independent variables. In this regard, log-linear analysis is similar. to principal components or factor analysis. The emphasis is on understandmg the interrelationships among the variables rather than on prediction. . If some of the variables are continuous, they must be made categorical by grouping the data, as was done in section 11.11, prior to performing the
Data examp :e
~/
.. 1.1
I
•I
log-linear analysis. Log-linear analysis can be used for either nominal or ordinal variables, but we do not discuss models that take advantage of the ordinal nature of the data. Analyses specific to ordinal data are given in Agresti (1984) or Hildebrand, Laing and Rosenthal (1977). Log-linear analysis deals with multiway frequency tables. In section 12.5 we presented a two-way frequency table where depression (yes or no) was given in the columns and sex (female or male) was listed in the rows of the table. From this table we computed the odds ratio = 2.825 (the odds of a female being depressed are 2.825 times that of a male). For two-way tables most investigators would either compute measures of association such as odds ratios or perform the familiar chi-square test (see Afifi and Azen, 1979, Fleiss, 1981, Dixon and Massey, 1983, or numerous other texts for a discussion of tests of independence and homogeneity using the chi-square test). Log-linear analysis is applicable to two-way tables but it is more commonly used for three-way or larger tables where intepretation of the associations is more difficult. While the use of odds ratios is mainly restricted to data with only two distinct outcomes (binary data), log-linear analysis can be used when there are two or more subgroups or subcategories per variable. We do not recommend that a large number of subcategories be used for the variables as this will lead to very few observations in each cell of the frequency table. The number of subcategories should be chosen to achieve what the investigator believes is meaningful while keeping in mind possible sample size problems. It is possible to model many different types of relationships in multiway tables. In this chapter we will restrict ourselves to what are called hierarchical models (section 17.7). For a more complete discussion oflog-linear analysis see Agresti (1990), Wickens (1989), Bishop, Feinberg and Holland (1975) or other texts given in the reference and further reading sections at the end of this chapter.
17.3 DATA EXAMPLE The depression data set described earlier (Chapter 3) will be used in this chapter. Here we will work with categorical data (either nominal or ordinal) or data that have been divided into separate categories by transforming or recoding them. This data set has 294 observations and is part of a single sample taken from Los Angeles County. Suppose we wished to study the association between gender and income. Gender is given by the variable SEX and income has been measured in thousands of dollars. We split the variable INCOME into high (greater than 19) and low (less than or equal to 19). The number 19 denotes a yearly income of$19 000. This is an arbitrary choice on our part and does not reflect family size or other factors. This recoding of the data can be done in most programs
.:; 11.::.
,_og-'Inear ana ysts
Table 17.1 Classification of individuals by income and sex
INCOME SEX
Low
High
Female Male
125 54
58 57
183 111
Total
179
115
294
!F
Total
using th~ TH~N transf~rmati~n statement (see man~als for your program) or by usmg special groupmg optmns that are present tn some programs. For the depression data set this results in a two-way table (Table 17.1). The totals in the right-hand column and bottom row of Table 17.1 are called marginal totals and the interior of the table is divided into four cells where the frequencies 125, 58, 54 and 57 are located. By simply looking at Table 17.1, one can see that there is a tendency for females to have low income. The entry of 125 females with low income is appreciably higher than the other frequencies in this table. Among the females 125/183 or 68.3% have low income while among the males 54/111 or 48.6% have low income. Ifwe perform a chi-square test of independence, we obtain a chi-square value of 11.21 with one degree of freedom resulting in P = 0.0008. We can compute the odds ratio using the results from section 12.5 as (125 x 57)/(58 x 54) = 2.27, i.e. the odds ·of a female having low income are 2.27 times those of a male. The data in Table 17.1 are assumed to be taken from a cross-sectional survey, and the null hypothesis just tested is that gender is independent of income. Here we assume that the total sample size was determined prior to data collection. In section 17.4 we discuss another method of sampling. Now, suppose we are interested in including a third variable. As we shall see, that is when the advantage of the log-linear model comes in. Consider the variable called TREAT (Table 3.4) which was coded yes if the respondent was treated by a medical doctor and no otherwise. If we also categorize the respondents by TREAT as well as SEX and INCOME, then we have a three-way table (Table 17.2). Note that the rows in Table 17.2 have been split into TREAT no and yes (for example, the 125 females with low income have been split into 52 with no treatment and 73 that were treated). In other words, where before we had rows and columns, we now have rows (SEX), columns (INCOME) and layers (TREAT). The largest frequency occurs among the females with low income and medical treatment but it is difficult to describe the results in Table 17.2 without doing some formal statistical analysis. If we proceed one step further and make a four-way frequency ta?le, interpretation becomes very difficult, but let us see what such a table looks hke.
r;oaattofi
ahr.._l-samp~e consrcc::--,1~~~':'1;)
"t:::
Table 17.2 Classification by income, sex and treatment
INCOME Low
High
Total
Female Male
52 34
24 36
76 70
Female Male
73 20
34 21
107 41
179
115
294
TREAT
SEX
No
Yes
Total
We considered adding the variable CASES which was defined as a score of 16 or more on the CESD (depression) scale (section 11.3). When we tried this, we got very low frequencies in some of the cells. Low frequencies are a common occurrence when you start splitting the data into numerous variables. (Actually, the problem is one of small theoretical rather than observed frequencies; but small theoretical frequencies often result in small observed frequencies.) To avoid this problem, we split the CESD score into low (less than or equal to 10) and high (11 or above). A common choice of the place to split a continuous variable is the median if there is no theoretical reason for choosing other values since this results in an equal or near equal split of the frequencies. If any of our variables had three or more subcategories, our problem of cells with very small frequencies would have been even worse, i.e. the table would have been too sparse. If we had had a larger sample size, we would have greater freedom in what variables to consider and the number of subcategories for each variable. According to our definition, the high depression score category includes persons who would not be considered depressed by the conventional clinical cutpoint. Thus the need to obtain adequate frequencies in cells has affected the quality of our analysis. Table 17.3 gives the results when four variables are considered simultaneously. We included in the table several marginal totals. This table is obviously difficult to interpret without the use of a statistical model. Note also how the numbers of respondents in the various cells have decreased compared to Table 17.2, particularly in the last row for high CESD and medically treated males. Before discussing the log-linear model, we introduce some notation and discuss two common methods of sampling. 17.4 NOTATION AND SAMPLE CONSIDERATIONS The sample results given in Table 17.1 were obtained from a single sample of size N = 294 respondents. Table 17.4 presents the usual notation for this
{-:;-
•J
......
.. og-.Jnear ana ysts
Table 17.3 Classification by income, sex, treatment and CESD
INCOME CESD
TREAT
SEX
Low
No
Female Male Total
Yes
High
Low
High
Total
33 23 56
20 30 50
53 53 106
Female Male Total
48 16 64
20 16 36
68 32 100
No
Female Male Total
19 11 30
4 6 10
23 17 40
Yes
Female Male Total
25 4 29
14 5 19
39 9 48
Total TREAT CESD SEX INCOME
No Yes Low High Female Male Low High
146 148 206 88 183 111 179 115
Table 17.4 Notation for two-way frequency table from a single sample
Column Row
1
2
Total
1
fu
ft2
2
ht
!22
!1+ !2+
Total
f+t
!+2
f++ =N
.• eSiS OI ~lypotneses artC
IDOtleTsT6t tWO-Way T
'~-'
case. Here / 11 (or 125) is the sample frequency in the first row (females) and column (low income) where the first subscript denotes the row and the second the column. The total for the first row is denoted by / 1 + (or 183 females) where the '+' in place of the column subscripts symbolizes that we added across the columns. The total for the first column is f+ 1 and for the second column /+ 2 • The symbol fil denotes the observed frequency in an arbitrary row i and column j. For a· three-way table three subscripts are needed with the third one denoting layer; four-way tables require four subscripts, etc. If we divide the entire table by N or f + +' then the resulting values will be the sample proportions denoted by Pii and the sample table of proportions is shown in Table 17.5. The proportion in the ith row and jth column is denoted by Pit The sample proportions are estimates of the population proportions that will be denoted by rcii. The total sample proportion for the ith row is denoted by Pi+ and the total proportion for the jth column is given by p + i' Similar notation is used for the population proportions. Suppose instead, that we took a random sample of respondents from a population that had high income and a second independent random sample of respondents from a population of those with low income. In this case, we will assume that both f + 1 and f + 2 can be specified prior to data collection. Here, the hypothesis usually tested is that the population proportion of low income people who are female is the same as the population proportion of high income people who are female. This is the so-called null hypothesis of homogeneity. The usual way of testing this hypothesis is a chi-square test, the same as for testing the hypothesis of independence. 17.5 TESTS OF HYPOTHESES AND MODELS FOR TWO-WAY TABLES In order to perform the chi-square test for a single sample, we need to calculate the expected frequencies under the null hypothesis of independence. The expected frequency is given by uii = Nrcii Table 17.5 Notation for two-way table of proportions Column Row
1
2
Total
1
Pu
P12
2
P21
P22
Pi+ P2+
Total
P+l
P+2
P++ = 1
,.:, .· ~)
_,og-:1near ana ys1s
where the Greek letters denote population parameters, i denotes the ith row and j the jth column. Theoretically, if two events are independent, then the probability of both of them occurring is the product of their individual probabilities. For example, the probability of obtaining two heads when you toss two fair coins is (1/2)(1/2) = 1/4. If you multiply the probability of an event occurring by the number of trials, then you get the expected number of outcomes of that event. For example if you toss two fair coins eight times then the expected number of times in which you obtain two heads i~ (8)(1/4) = 2 times. Returning to two-way tables, the expected frequency in the ijth cell when the null hypothesis of independence is true is given by
where ni+ denotes the marginal probability in the ith row and n+ 1 denotes the marginal probability in the jth column. In performing the test, sample proportions estimated from the margins of the table are used in place of the population parameters and their values are compared to the actualfrequencies in the interior of the table using the chi-square statistic. For example, for Table 17.1, the expected value in the first cell is 294 (183/294)(179/294) = 111.4 which is then compared to the actual frequency of 125. Here, more females have low income than we would have expected if income and gender were independent.
Log-linear models The log-linear model is, as its name implies, a linear model of the logs (to the base e) of the population parameters. If we take the log of JliJ when the null hypothesis is true, we have logJ.tiJ = logN
+ logni + + logn +1
since the log of a product of three quantities is the sum of the logs of the three quantities. This linear model is usually written as
logf.tiJ
=
A.
+ A.A + A.BU>
(independence model)
where A denotes rows and B columns, and the A.'s are used to denote either log N or log n. There are many possible values of the A.'s and so to make them unique two constraints are commonly used, namely that the sum of the A.A = 0.
.· esrs
0:1
::1ypoc1eses anc mo{JeTs-::ot two;_\'\18Y
~0 r:~cs
":; ~ r
Now if the null hypothesis is not true, the log-linear model must be augmented. To allow for association between the row and column variables, a fourth parameter is added to the model to denote the associati~n between the row and column variable: log,uii =A.
+ A.A + A.BU> + A.AB
(association model)
where A.AB(ii> measures the degree of association between the rows and columns. Note.that the association terms are also required to add up to zero across rows and across columns. This latter model is sometimes called the saturated model since it contains as many free parameters as there are cells in the interior of the frequency table. The two models only differ by the last term that is added wben the null hypothesis is not true. Hence, the null hypothesis of independence can be written H o:
A. AB(ii> = 0
for all values of i and j. Note that the log-linear model looks a lot like the two-way analysis of the variance model
where constraints are also used (sum of the row means = 0, sum of the column means = 0, and sums of the interaction terms across rows and across columns= 0) (see Wickens, 1989, section 3.10 for further discussion of when and how this analogy can be useful). In performing the test, two goodness-of-fit statistics can be used. One of these is the usual Pearson chi-square goodness-of-fit test. It is the same as the regular chi-square test of independence, namely I:(observed frequency- expected frequency) 2 X = expected frequency 2
where the observed frequency is the actual frequency and the expected frequency is computed under the null hypothesis of independence from the margins of the table. This is done for each cell in the interior of the table and then summed for all cells. For the log-linear model test, the expected frequencies are computed from the results predicted by the log-linear model. To perform the test of no association, the expected frequencies are computed when the A.AB(ii> term is not included in the model (independence model above). The estimated parameters of the log-linear model are obtained as solutions to a set of likelihood equations (Agresti, 1990, section 6.1.2).
·~'·
('
· ,og-··tnear ana ysts
If a small chi-square value is obtained (indicating a good fit) then there is
?o need to inclu4e the term measuring association. The simple additive or mdependence model_of the effects of the two variables is sufficient to explain the data. An even simpler model may also be appropriate to explain th data. To che~k for this possibility, we test the hypothesis that there are n~ row effects (t.e. all A,A = 0). The Pearson chi-s_quare statis~ic is calculated in the same way, where the expected frequencies are obtamed from the model that satisfies the null hypothesis. The idea is to find the simplest model that will yield a reasonably good fit. In general, if the model does not fit, i.e. chi-square is large, you add more terms to the model. If the model fits all right, you try to find an even simpler mode that is almost as good. What you want is the simplest model that fits well. In this section and the next we are assuming that the sample was from a survey of a single population. For a two-way table, this does not involve looking at many models. But for four or five-way tables, several models are usually given a close look and you may need to run the log-linear program several times both to find the best fitting model and to get a full description of the selected model and how it fits. Another chi-square test statistic is reported in some of the computer programs. This is called G2 or likelihood ratio chi-square. Again, the larger the value of G2 , the poorer the fit and the more evidence against the null hypothesis. This test statistic can be written as observed frequency) G2 = 2I:(observed frequency) log ( d f · expecte. requency where the summation is over all the cells. Note that when the observed and expected frequencies are equal, their ratio is 1 and log(1) = 0, so that the cell will contribute nothing to G2 • The results for this chi-square are usually close the Pearson chi-square test statistic but not precisely the same. For large sample sizes they are what statisticians call 'asymptotically equivalent'. That is, for large sample sizes the difference in the results should approach zero. Many statisticians like to use. the likelihood ratio test because it is additive in so-called nested or hierarchical models (o:ue model is a subset of the other). However, we would suggest that you use the Pearson chi-square if there is any problem with some of the cells having small frequencies since the Pearson chi-square has been shown to be better in most instances for small sample sizes (see Agresti, 1990, section 7. 7.3 for a more detailed discussion on tills point). Otherwise, the likelihood ratio chi-square is recommended. In order to interpret the size of chi-square, it is necessary to know the degrees of freedom associated with each of the computed chi-squares. The degrees of freedom for the goodness-of-fit chi-square are given by the number
Exaniple~of
a two-way tar)- e
;'; -')
of cells in the interior of the table minus the number of free parameters being estimated in the model. The number of free parameters is equal to the number of parameters minus the number of independent constraints. For example, for Table 17.1 we have four cells. For rows, we have one free parameter or 2 - 1 since the sum of the two row parameters is zero. In general, we would have A - 1 free parameters for A rows. Similarly, for column parameters we have one free parameter and, in general, we have B- 1 free parameters. For the association term A.AB(ii> the sum over i and the sum over j must equal zero. Given these constraints, only (A- 1)(B- 1) of these terms are free. There is also one parameter representing the overall mean. Thus for Table 17.1 the number of free parameters for the independence model is 1 + (2 - 1)
+ (2 -
1) = 3
For the association model, there are 1 + (2- 1) + (2- 1)
+ (2- 1)(2- 1) =
4
free parameters. Hence, the number of degrees of freedom is 4- 3 = 1 for the independence model and zero for the association or saturated model. For higher than two-way models computing degrees of freedom can be very complicated. Fortunately, the statistical packages compute this for use and also give the P value for the chi-square with the computed degrees offreedom. 17.6 EXAMPLE OF A TWO-WAY TABLE We analyze Table 17.1 as an example of the two-way models and hypotheses used in the previous section. If we analyze these data using the usual Pearson chi-square test we would reject the null hypothesis of no association with a chi-square value of 11.21 and a P = 0.0008 with one degree of freedom: (number of rows - 1) x (number of columns - 1) or (2- 1)(2- 1) = J. To understand the log-linear model, we go through some of the steps of fitting the model to the data. The first step is to take the natural logs of the data in the interior of Table 17.1 as shown in Table 17.6. The margins of the table are then computed by taking the means of the appropriate log,hl' Note that this is similar to what one would do for a two-way analysis of variance (for example, 4.444 is the average of 4.828 and 4.060), The various A's are then estimated from these data using the maximum likelihood method (Agresti, 1990). Table 17.7 gives the estimates of the A's for the four cells in Table 17.6. Here, the variable SEX is the row variable with A= 2 rows and INCOME is the column variable with B = 2 columns. The parameter A. is estimated by the overall mean from Table 17.6. For cell (1, 1), A.A(l) is estimated as 0214 = 4.444- 4.230 (marginal means from
''· .:::.;
og-1tnear ana ysts
Table 17.6 Natural logs of entries in Table 17.1
INCOME SEX
Low
High
Mean
Female Male
4.828 3.989
4.060 4.043
4.444 4.016
Mean
4.409
4.052
4.230
Table 17.7 Estimates of parameters of log-linear model for Table 17.6
Cell (i, j)
1, 1 1, 2 2, 1 2,2
4.230 4.230 4.230 4.230
0.214 0.214 -0.214 -0.214
0.178 -0.178 0.178 -0.178
0.205 -0.205 -0.205 0.205
Table 17.6) and A.B as 0.178 = 4.409 -4.230 (minus rounding error of 0.001). For the association term for the first cell, we have 4.828- (4.230 + 0.214 + 0.178) = 0.206 (within rounding error of0.001). The estimates in the remaining cells are computed in a similar fashion. Note that the last three columns all add to zero. The numerical values for when i = 1 and when i = 2 sum to zero: 0.214 + (-0.214) = 0. Likewise for j = 1 and j = 2, we have 0.178 + (-0.178) = 0. For the association terms we sum both over i and over j (cells (1, 1) + (1, 2) = 0, cells (2, 1) + (2, 2) = 0, cells (1, 1) + (2, 1) = 0 and cells (1, 2) + (2, 2) = 0). This is analogous to the two-way analysis of variance where the main effects and interaction terms sum to zero. In this example with only two rows and two columns, each pair of estimates of the association terms ( + 0.205) and the row ( + 0.214) and column ( + 0.178) effects are equal except for their signs. If we had chosen an example with three or more rows andjor columns, pairs of numbers would not all be the same, but row, column and interaction effects would still sum to zero. Note that A.sEX is positive, indicating that there are more females than males in the sample. Likewise, A.1NcoME is positive indicating more low inc?~e respondents than high. Also, A.sEXxiNC(ll) and A.sEXxiNq 22 > are postuve indicating that there are more respondents in the first cell (1, 1) and the last (2, 2) than would occur if a strictly additive model held with no association (there is an excess of females with low income and males with high income). Statistical packages provide estimates of the parameters of the log-line.ar model along with their standard errors or the ratio of the estimate to tts
:- -7··-
standard error to assist in identifying which parameters are significantly different from zero. The programs also present results for testing that the association between the row and column variables is zero. Here they are computing goodness of fit chi-square statistics for the model with the association term included as well as for the model without it. The chi-square test statistic and its degree of freedom are obtained by taking the difference in these chi-squares and their degrees of freedom. A large test statistic would indicate that the association is significant and should be included in the model. The Pearson chi-square for this test is 11.21 with one degree of freedom, giving P = 0.0008, or the same as was given for the regular chi-square test mentioned earlier. The real advantage of using the log-linear model will become evident when we increase the number of rows and columns in the table and when we analyze more than two variables. The G2 or likelihood ratio chi-square test statistics was calculated as 11.15 with a P = 0.00084, very similar to the Pearson chi-square in this case. We reject the null hypothesis of no association and would say that income and gender are associated. There is also a significant chi-square value for the row and for the column effects. Thus; we would reject the null hypotheses of there being equal numbers of males and females and low and high income respondents. These last two tests are often of little interest to the investigator who usually wants to test for associations. In this example, the fact that we had more females than males is a reflection of an unequal response rate for males and females in the original sample and may be a matter of concern for some analyses. The odds ratio is related to the association parameters in the log-linear model. For the two-way table with only two rows and columns, the relationship is very straightforward, namely loge(odds ratio)=
4A.AB(ll)
In the SEX by INCOME example the odds ratio is 2.275 and the computed value from the association parameter using the above equation is given by 2.270 (in this case they are not precisel,Y equal since only three decimal places were reported for the association parameter). 17.7 MODELS FOR MULTIWAY TABLES The log-linear model for the three-way table can be used to illustrate the different types of independence or lack of association that may exist in multiway tables. The simplest model is given by the complete independence (mutual independence) model. In order to demonstrate the models we will first redo Table 17.2 in symbols so the correspondence between the symbols and the table will be clear. We have written Table 17.8 in terms ofthe sample
--:-;;-· ...:.. .......=...
··-J •'..
5 .•:.,-
' I '\.....(:;,
~
,·~
;I
.1 (··.
• /;:':,I;)
Table 17.8 Notation for three-way table of proportions Layer C 1
Column B Row A 1 2
Partial layer 1 2
Row A 1 2
Partial layer 2
1
2
Total
P111
P121
P1+1
P211
P221
P2+1
P+u
P+21
P++l
1
2
Total
P112
Pi22
P1+2
P212
P222
P2+2
P+12
P+22
P++2
Column B Marginal table
Row A 1 2
1
2
Total
P11+
P12+
Pi++
P21+
P22+
P2++
Total
P+l+
P+2+
P+++
proportions that can be computed from a table such as Table 17.2 by dividing the entries in the entire table by N. The marginal table given at the bottom is the sum of the top two-way tables (for layer C = 1 plus for layer C = 2). If we knew the true probabilities in the population, then the Piik's would be replaced by reuk's. Here i denotes the ith row, j the jth column and k the kth layer and a '+' signifies a sum over that variable. Another marginal table of variable A versus C could be made using the column margins (column labeled Total) of the top two partial layer tables. A third marginal table for variables B and C can be made of the row marginals for the top two partial layer-tables; some programs can produce all of these tables under request. Next we consider some special cases with and without associations; Complete independence model
In this case, all variables are unassociated with each other. This is also called mutual independence. The probability in any cell (i,j, k) is then given by reiik
= rei+ +re +i +re + +k
where rei+ + is the marginal probability for the ith row (the lower marginal table row total), re+ J+ is the marginal probability for thejth column (the lower marginal table column total) andre+ +k is the marginal probability for the kth layer (the total for layer 1 and layer 2 tables). The sample estimates
Models for multiway tables
423
for the right side of the equation are Pi++' P+ i+ and P+ +k which can be found in Table 17.8. When complete independence holds, the three-way log-linear model is given by logJ.Liik
=
A.
+ A.A + A.BW + Ac(k)
The model includes no parameters that signify any association. One variable independent of the other two
If variable C is unrelated to Gointly independent of) both variable A and variable B then the only variables that can be associated are A and B. Hence, only parameters denoting the association of A and B need be included in the model. The three-way log-linear model then becomes
Saying that C is jointly independent of A and B is analogous to saying that the simple correlations between variables C and A and variables C and B are both zero in regression analysis. Note that, similarly, A could be jointly independent of B and C or B could be jointly independent of A and C. Conditional independence model
Suppose A and B are independent of each other for the cases in layer 1 of C in Table 17.8 and also independent of each other for the cases in layer 2 of C in Table 17.8. Then A and B are said to be conditionally independent in layer k of C. This model includes associations between variables A and C and variables B and C but not the association between variables A and B. The model can be written as
If we say that A and Bare independent of each other conditional on C, this is analogous to the partial correlation of A and Bgiven C being zero (Wickens, 1989, sections 3.3-3.6). Note that, similarly, the A and C association term or the Band C association term could be omitted to produce other conditional independence models. Models with all two-way associations It is also possible to have a model with all three two-way associatibns present.
This is called the homogenous assbciation model and would be written as logJ.Liik
=
A. + A.A
+ A.BW + A.qk) + A.AB + A.AC(ik> + A.BcUk>
''-l.::·' 1..
~.•
og-ilnear anatysts
In this case all three two-way association terms are included. This would be analogous to an analysis of variance model with the main effects and all two-way interactions included.
Saturated model The ?n~l model is the s~turate~. model that also includes a three-way assoCiatiOn term, A.ABC(iik>' m addttton to the parameters in the last model In this model there are as many free parameters as cells in the table and th~ fit of the model to the log of the cell frequencies is perfect. Here we are assuming that there is a three-way association among variables A, Band c that is above and beyond the two-way interactions plus the row, column and layer affects. These higher order interaction terms are not common in real-life data but significant ones sometimes occur.
Associations in partial and marginal tables Some anomalies can occur in multiway tables. It is possible, for example, to have what are called positive associations in the partial layers 1 and 2 of Table 17.8 while having a negative association in the marginal table at the bottom of Table 17.8. A positive association is one where the odds ratio is greater than one and a negative association has an odds ratio that lies between zero and one. An example of this is given in Agresti (1990, section 5.2). Thus, the associations found in the partial layer tables can be quite different from the associations among the marginal sums. It is important to carefully examine the partial and marginal frequency tables computed by the statistical packages in an attempt to determine both from the data and from the test results how to interpret the outcome.
Hierarchical log-linear models When one is examining log-linear models from multiway tables, there are a large number of possible models. One restriction on the models examined that is often used is to look at hierarchical models. Hierarchical models obey the following rule: whenever the model contains higher order A.'s (higher order associations or interactions) with certain variables, it also contains the lower order effects with these same variables. For instance, if A.ABW> is included in the model, then A., A.A and A.BU> must be included. (An example of a nonhierarchical model would be a model only including A., A.A and A.BC::") A hierarchical model is analogous to keeping the main effects in analysts of variance if an interaction is present. Most investigators find hierarchical models easier to intepret than nonhierarchical ones and restrict their search to this type. Note that when you get up to even four variables there are a very large number of possible models, so although this restriction may seem
Tests of hypothesis for multiway tables
425
limiting, actually it is a help in reducing the number of models and restricting the choice to ones that are easier to explain.
Four-way and higher dimensional models As stated in section 17.2, the use of log-linear analysis for more than four variables is rare since, in those cases, the interpretation becomes difficult and insufficient sample sizes are often a problem. Investigators who desire to either use log-linear analyses a great deal or analyze higher order models should read texts such as Agresti (1990), Wickens (1991) or Bishop, Feinberg and Holland (1975). For a four-way model, the log-linear model can be written as follows. (Here we have omitted the ijkl subscripts to shorten and simplify the appearance of the equation.) log,u = A. + A.A
+ A.B + Ac + An + A.AB + A.Ac + A.AD + ABc + A.BD + Acn + A.ABC + A.ABD + AAcD + ABeD + A.ABCD
In all, there are over a hundred possible hierarchical models to explore for a four-way table. Obviously, it becomes necessary to restrict the ones examined. In the next section, we describe how to go about navigating through this sea of possible models.
17.8 TESTS OF HYPOTHESES FOR MUL TIW A Y TABLES: EXPLORATORY MODEL BUILDING Since there are so many possible tests that can be performed in three- or higher-way tables, most investigators try to follow some procedure to restrict their search. Here, we will describe some of the strategies used and discuss tests to assist in finding a suitable model. In this section we consider a single sample with no restrictions on the role of any variable. In section 17.9 strategies will be discussed for finding models when some restrictions apply. There are two general strategies. One is to start with just the main effects and add more interaction terms until a good-fitting model is obtained. The other is to start with a saturated or near-saturated model and remove the terms that appear to be least needed. The analogy can be made to forward selection or backward elimination in regression analysis. Exploratory model building for multiway tables is often more difficult to do than for linear regression analysis. The output from the computer programs requires more attention and the user may have to run the program several times before a suitable model is found, especially if several of the variables have significant
,:~::::6
Log-linear analysis
interaction terms. In other cases, perhaps only two of the variables have a significant interaction so the results are quite obvious.
Adding terms For the first strategy, we could start with the main effects (single order factors) only. In Table 17.2, we presented data for a three-way classification using the variables INCOME(~, SEX (s) and TREAT (t). We ran log-linear analyses of these data using the BMDP 4F program, which uses the first letter of each variable name to identify that variable. When we included just the main effects i, s and t in the model, we got a likelihood ratio (goodness-of-fit) chi-square of 24.08 with four degrees of freedom and P = 0.0001, indicating a very poor fit. So we tried adding the three second order interaction terms (si, it and st). The likelihood ratio chi-square dropped to 0.00 with one degree of freedom and P = 0.9728, indicating an excellent fit. But we may not need all three second order interaction terms, so we tried just including si. When we specify a model with si, the program automatically includes i and s, since we are looking at hierarchical models. However, in this case we did need to specify tin order to include the three main effects. Here we got a likelihood ratio chi-square of 12.93 with three degrees of freedom and P = 0.0048. Thus, although this was an improvement over just the main effect terms, it was not a good fit. Finally, we added st to the model and got a very good fit with a likelihood ratio chi-square of 0.00 with two degrees of freedom and P = 0.9994. Therefore, our choice of a model would be one with all three main effects and the INCOME and SEX interaction as well as the SEX and TREAT interaction.
Eliminating terms The second strategy used when the investigator has a single sample (cross-sectional survey) and there is no prior knowledge about how the variables should be interrelated, is to examine the chi-square statistics starting with fairly high order interaction term(s) included. This will usually have a small value of the goodness-of-fitchi-square, indicating an acceptable model. Note that for the saturated model, the value of the goodness-of-fit chi-square will be zero. But a simpler model may also provide a good starting point. After eliminating nonsignificant terms, less complex models are examined to find one that still gives a reasonably good fit. Significant fourth or higher order interaction terms are not common and so, regardless of the number of variables, the investigator rarely has to start with a model with more than a four-way interaction. Three-way interactions are usually sufficient.
Tests of hypothesis for multiway tables
427
We instructed the program to run a model that includes the three factor association term (A.ABc) so that we would see the results for the model with all possible terms in it. Note that if we speCify that the third order interaction term be included then the program will automatically include all second order terms, since, at this point, we are assuming a hierarchical model. First, we obtain results for three tests. The first tests that all main effects (A.A, A.B and A.c) are zero. The second tests that all two-factor associations (A.AB' A.Ac and A.Bd are zero. The third tests that the third order association term, (A.ABd is zero. The results from BMDP 4F were as follows: Simultaneous Test that All K-Factor Interactions Are Zero K-Factor DF LR Chi-Sq P Pearson Chi-Sq
p
1 2 3
0.00000 0.00002 0.97262
3 3 1
31 .87 24.08 0.00
0.00000 0.00002 0.97276
36.67 24.65 0.00
These test results were obtained from the goodness-of-fit chi-squares for the following models: L 2. 3. 4.
all terms included; all two-factor association and first order (main effect) terms included; all first order terms included; only the overall mean included.
Note that we are assuming hierarchical models and, therefore, the overall mean is included in all models. The chi-squares listed above were then computed by subtraction as follows. The LR Chi-Sq G2 of 31.87 was computed by subtracting the goodness of fit chi-square for model 3 from the goodnessof-fit chi-square for model 4. The chi-square of 24.08 was computed by subtracting the chi-square for model 2 from the chi-square for model 3. The 0.00 chi-square was computed by subtracting the chi-square formodell from the chi-square for model 2. Similar subtractions were performed for the Pearson chi-square. We start our exploration at the bottom of the table. The third order interaction term was not even near significant by either the likelihood ratio or the Pearson test, being 0.00 (first two decimal points were zero). Now we figure we only have to be concerned with the two factor association and main effect terms. Since we have a large chi-square when testing that all K-factors forK = 2 are zero, we decide that we have to include some measures of association between two variables. But we do not know which two-way associations to include. The next step is to try to determine which second order association terms to include. There are two common ways of doing this. One is to examine
-L:.(-,
:
_og- tnear ana·ysts
the ratio of the estimates of each second order interaction term to its estimated standard error and keep the terms that have a ratio of say two or more (Agresti, 1990, section 7.2). The ratios were greater than three for both the SEX by INCOME interaction and the SEX by TREAT interaction. The third one was not greater than two and thus should not be kept. The other common method includes examining output for individual tests for the three possible two-factor associations. The 4F program does two tests one labeled partial association and one marginal association as follows: ,
Effect
DF
Partial Association p Chi-Sq
is it st
1 1 1
10.67 0.00 12.45
0.0011 0.9927 0.0004
Marginal Association p Chi-Sq
11.15 0.48 12.93
0.0008 0.4895 0.0003
In the partial association test, the program computes the difference in chi-square when all three second-order association terms are included, and when one is deleted. This is done for each association term separately. A large value of chi-square (small P value) indicates that deleting that association term results in a poor fit and the term should be kept. The marginal association test is done on the margins. For example, to test the SEX by INCOME association term, the program sums across TREAT to get a marginal table and obtains the chi-square for that table. This test may or may not give the same results as the partial test. If both tests result in a large chi-square, then that term should be kept. Also, if both result in a small chi-square value then the term should be deleted. It is less certain what to do if the two tests disagree. One strategy would be to keep the interaction and report the disagreement. The results in our example indicate that we should retain the INCOME by SEX and the SEX by TREAT association terms, but there is no need for the INCOME by TREAT association term in the model. Here the partial and marginal association tests agree even though they do not result in precisely the same numerical values. These results also agree with our previous analysis. To check our results we ran a model with si and st (which also includes i, s and t). The fitted model has a very small goodness of fit chi-square indicating a good fit. In summary our analysis indicates that there is a significant association between gender and income as well as between gender and getting medical treatment but there was no significant association between income and getting medical treatment and no significant third order association. Thus we obtained the conditional independence model described in section 17.7. INCOME and TREAT are conditionally independent given SEX.
Stepwise selection
Some computer programs offer stepwise options to assist the user in the choice of an appropriate model. In BMDP 4F both forward selection and backward elimination are available, but backward elimination or backward stepwise is the preferred method (Bendetti and Brown, 1978; Oler, 1985). The criterion used for testing is the P value from the likelihood ratio chi-square test. The program keeps eliminating terms that have the least effect on the model until the test for lack of fit is significant. The procedure continues in a comparable fashion to stepwise procedures in regression or discriminant function analysis. Again, it should be noted that with all this multiple testing, that levels of P should not be taken seriously. We recommend trying the backward elimination more as a check to make sure that there is not a better model than the one that you have previously chosen. The process should not be relied on to automatically find the 'best' model. Nevertheless, it does serve as a useful check on previous results and should be tried. As an example, the data in Table 17.3, which include four variables (SEX (s), INCOME(i), TREAT (t) and CESD (c)), were analyzed using backward elimination. We started with a model with all three-way interaction terms included (isc, itc, stc, ist). The program eliminates at each step the term with the largest P value for the likelihood ratio chi-square test (least effect on the model). In this case, we instructed the program to stop when the largest term to be eliminated had a P value less than 0.05. The program deleted each three-way interaction term one at a time and computed the chi-square, degrees of freedom and the P value. The three-way interaction term with the largest P value was ist so it was eliminated in the first step leaving isc, itc and stc. At the second step, the largest P value occurred when isc was tested so it was eliminated, keeping itc and stc and also si. Note that we are testing only three-way interaction terms at this step and thus the program keeps all second order interactions. At step 3, the interaction stc has the largest P value so it is. removed a11d the resulting model is itc, si, sc and st. The terms it, tc and ic are contained in itc and are thus also included since we have a hierarchical model. At step four, itc was not removed and the term with the largest P value was sc, a second order interaction term. So the model at the end of step three is itc, si and st. At the next step the largest P value was 0.0360 for itc, which is less than 0.05 so it stays in and the process stops since no term should be removed. So the program selected the final model as itc, si and st. Also, the main effects i. t. c and s must be included since they occur in at least one of the interaction terms.
f;~.';"
_og-.Jnear ana;Jysts
These results can be compared with the likelihood ratio chi-square tests for partial associations obtained when all three-way interaction terms were included as given below: Term
Degrees Freedom
LR Chi-Sq
p
I
1 1 1 1
14.04 17.01 0.01 48.72
0.0002
o.oooo
1 1 1 1 1 1
9.91 0.00 1.17 11.98 2.34 0.32
0.0016 0.9655 0.2799 0.0005 0.1259 0.5733
1 1 1 1
0.01 0.01 4.74 1.23
0.9274 0.9205 0.0294 0.2680
s t c Sl
it ic st sc tc ist ISC
itc stc
0.9072 0.0000
Here also we would include the itc three-way interaction term, and the si and st two-way interaction terms. We also need to include the it, ic and tc and t terms even though they are not significant if we include itc in order to keep the model hierarchical. The model is written itc, si and st since the ti, ic and tc are automatically included according to the hierarchical principle. In this example we retained a three-way interaction term among INCOME, TREAT and CESD, although no pair of the terms involved in the three-way interaction had a significant two-way interaction. This illustrates some of the complexity that can occur when interpreting multiway tables. There was also a significant interaction between SEX and INCOME as we had found earlier and also between SEX and TREAT, with women being more apt to be treated by a physician. Model fit The fit of the model can be examined in detail for each cell in the table. This is often a useful procedure to follow particularly when two alternative models are almost equally good since it allows the user to see where the lack of model fit may occur. For example, there may be small differences throughout the table or perhaps larger differences in one or two cells. The latter would point to either a poor model fit or possibly errors in the data. This procedure is somewhat comparable to residual analysis after fitting a regression model.
Tests of hypotheses: specilic
moc1e~s
,_;,
The expected values from the model are computed for each cell. Given this information, several methods of examining the frequencies between the expected values and the actual frequencies are possible. The simplest is to just examine the difference, observed minus expected. But many find it easier to interpret the components of chi-square for each cell. For the Pearson chi-square this is simply (observed- expected) 2 /(expected) computed for each cell. Large values of these components indicate where the model does not fit well. Standardized deviates are also available and many users prefer to examine them for much the same reasons that standardized residuals are used in regression analysis. These are computed as (observed - expected)/(square root of expected). The standardized deviates are asymptotically normally distributed but are somewhat less variable than normal deviates (Agresti, 1990, section 7.3). Thus, standardized deviates greater than 1.96 or less than -1.96 will be at least two standard deviations away from the expected value and are candidates for inspection. 17.9 TESTS OF HYPOTHESES: SPECIFIC MODELS The log-linear model is sometimes used when the sample is more complicated than a single random sample from the population. For example, suppose a sample of males and a separate sample of females are taken. T~is is called a stratified sample. Here we assume that the sample size in each subgroup or stratum (males and females) is fixed. This implies that the marginal totals for the variable SEX are fixed (unlike in the case of a single random sample where the marginal totals for sex are random). When we collect a stratified sample, we have two independent samples, one for each gender. In this case we should always include the main effect for SEX, A.sEx' in the model whether or not it is significant according to the chi-square test. This is done since we want to test the other effects when the main effects of SEX are accounted for. If we stratify by both income and gender, and take independent samples with fixed sample sizes among low income males, high income males, low income females and high income females, then we should include A.sEx + A1Nc + A.sEXxiNC terms in the modeL Furthermore, in section 8.2, in connection with regression analysis, we indicated that often investigators have prior justification for including some variables and they wish to check if other variables improve prediction. Similarly, in log-linear modeling, it often makes the interpretation of the model conceptually simpler if certain variables are included, even if this increases the number of variables in the model. Since the results obtained in log-linear models depend on the particular variables included, care should be taken to keep in variables that make logical sense. If, theoretically, a main effect andjor certain association terms belong in a model to make the tests for other variables reasonable, then they should be kept.
,,l~
1_og-Tinear analysis
As mentioned earlier, the goodness of fit chi-square for any specified model measures how well the model fits the data: the smaller the chi-square, the better the fit. Also, it is possible to test a specific model against another model obtained by adding or deleting some terms. Suppose that Model 2 is obtained from Model 1 by deleting certain interaction and/or main effect terms. Then we can test whether Model 2 fits the data as well as Model 1 as follows: test chi-square = (goodness-of-fit chi-square for Model 2) -(goodness-of-fit chi-square for Model 1) test degrees of freedom= (degree of freedom for Model 2) -(degree of freedom for Model 1) If the test chi-square value is large (or the P value is small) then we conclude that Model2 is not as good as Modell, i.e. the deleted terms do improve the fit. This test is similar to the general F -test discussed in section 8.5 for regression analysis. It can be used to test two theoretical models against each other or in exploratory situations as was done in the previous section. For example, in the SEX(s), INCOME (i), TREAT (t) data suppose we believe that theist term should not be included. Then our base model is st, is, it. Thus Model 1 includes 'these interactions and all main effects. We could then begin by testing the hypothesis that INCOME has no effect by formulating Model2 to include only the st interaction (as well as s and t). If the P-value of the test chi-square is small, we would keep INCOME in the model and proceed with the various methods described in section 17.8 to further refine the model. In some cases, the investigator may conceive of some variables as explanatory and some as response variables. The log-linear model was not derived originally for this situation but it can be used if care is taken with the model. If this is the case, it is important to include the highest order association term among the explanatory variables in the model (Agresti, 1990, section 7.2; Wickens, 1989, section 7.1). For example, if A, B and C are explanatory variables and Dis considered a response variable, then the model should include the A.ABc third order association term and other terms introduced by the hierarchical model, along with A.AD' A.BD and Acn terms.
17.10 SAMPLE SIZE ISSUES The sample size used in a multiway frequency table can result in unexpected answers either because it is too small or too large. As with any statistical analysis, when the sample is small there is a tendency not to detect effects
Samp'e size ISSUes
, ,,
unless these effects are large. This is true for all statistical hypothesis testing. The problem of low power due to small samples is common, especially in fields where it is difficult or expensive to obtain large samples. On the other hand, if the sample is very large, then very small and unimportant effects can be statistically significant. There is an additional difficulty in the common chi-square test. Here, we are using a chi-square statistic (which is a continuous number) to approximate a distribution of frequencies (which are discrete numbers). Any statistics computed from the frequencies can only take on a limited (finite) number of values. Theoretically, the approximation has been shown to be accurate as the sample size approaches infinity. The critical criteria are the sizes of the expected values that are used in the chi-square test. For small tables (only four or so cells) the minimum expected frequency in each cell is thought to be two or three. For larger tables, the minimum expected frequency is supposed to be around one, although for tables with a lot of cells, some could be somewhat less than one. Clogg and Eliason (1987) point out three possible problems that can occur with sparse tables. The first is essentially the one given above. The goodness of fit chi-square may not have the theoretical null distribution. Second, sparse tables can result in difficulties in estimating the various A parameters. Third, the estimate of A divided by its standard error may be far from normal, so inferences based on examining the size of this ratio may be misleading. For the log-linear model, we are looking at differences in chi-squares. Fortunately, when the sample size is adequate for the chi-square tests, it is also adequate for the difference tests (Wickens, 1989, section 5.7; Haberman, 1977). One of the reasons that it is difficult to make up a single rule is that, in some tables, the proportions in the subcategories in the margins are almost equal, whereas in other tables they are very unequal. Unequal proportions in the margins mean that the total sample size has to be much larger to get a reasonably sized expected frequency in some of the cells. It is recommended that the components of cbi-sq uare for the individual cells be examined if a model is fit to a table that has many sparse cells to make sure that most of the goodness of fit chi-square is not due to one or two cells with small expected value(s). In some cases, a particular cell might even have a zero observed frequency. This can occur due to sampling variation (called sampling zero) or because it is impossible to get a frequency in the cell (no pregnant males) which are called structural zeros. In this latter case, some of the computer programs will exclude cells that have been identified to have structural zeros from the computation. On the other hand, if the sample size is very large as it sometimes is when large samples from the census or other governmental institutions are analyzed,
·":_:.·::c
~-og-anear
ana ysts
the resulting log-linear analysis will tend to fit a very complicated model, perhaps even a saturated model (Knoke and Burke, 1980). The size of some of the higher order association terms may be small, but they will still be significantly different from zero when tested due to the large sample size. In this case, it is suggested that a model be fitted simply with the main effect terms included and G2 (the likelihood ratio chi-square) be computed for this baseline model. Then alternative models are compared to this baseline model to see the proportional improvement in G2 • This is done using the following formula: G
2
baseline -
G
G
2
alternative
2
baseline
As more terms are added, it is suggested that you stop when the numerical value of the above formula is above 0.90 and the effects of additional terms are not worth the added complexity in the model. 17.11 THE LOGIT MODEL
Log-linear models treat all the variables equally and attempt to clarify the interrelationships among all of them. However, in many situations the investigator may consider one of the variables as outcome or dependent and the rest of the variables as explanatory or independent. For example, in the depression data set, we may wish to understand the relationship of TREAT (outcome) and the explanatory variables SEX and INCOME. We can begin by considering the log-linear model of the three variables with the following values: 0 if low (19 or less) I(= INCOME)= { 1 if high (20 or more) S(= SEX)
0 if male - { 1 if female
T(= TREAT)
0 if treatment is recommended { - 1 if no treatment is recommended
Analysis of this model shows that the three-way interaction is unecessary (x 2 = 0.0012 and P = 0.97). Therefore we consider the model which includes only the two-way interactions. The model equation is
---,---;1e
~ogti:
mooe~
,- ;:
1 ~.
The estimates of the parameters are obtained from BMDF 4F as follows: ):I(O}
= 0.178 = -AI< 1)
As = -0.225 = -
As(l}
= - 0.048 = -
AT(l)
AT(O) AIS(O,O)
= -0.206 =
AIS(l.l)
=
AIT(O,O)
= - 0.()() 1 =
AIT(1,1)
= -
lsT(O,O)
= -0.219 =
AsT(1,1)
=
-AIS(l,O) AIT(1,0)
-Asr(t,O)
=
-AIS(O,l)
= - AIT(O, 1) =
-AsT(O,t)
These numbers do not readily shed light on treatment as a function of sex and income. One method for better understanding this relationship is the logit model. First, we consider the odds of no treatment, i.e. the probability of no recommended treatment divided by the probability ofrecommended treatment. Thus, odds=
niil
niio
for a person for whom SEX = i and INCOME = j. This can equivalently be written as odds= Jliit JlijO
since Jliik = N niik· Although this is an easily interpreted quantity, statisticians have found it easier to deal mathematically with the logit or the logarithm of the odds. (Note that this is also what we did in logistic regression~) Thus, logit = log(odds) = log(. Jliit) Jluo This equation may be equivalently written as logit = logJ.liit - logJ.tuo which, after substituting the log-linear model, becomes A + AI(i)
+ As(i) + Aru> + AIS(ii> + ArT + Asrut>
-A - Ar(i) - As(i) - ~ - AIS(ii> - AIT(iO) - Asruo>
-L
':_.og- Jnear ana ysts
which reduces to ( .. T( 1) -
,!T(O))
+ (,!IT(i 1) -
,!I T(IO))
+ ( ,!ST(}I) -
,!ST(jO))
But, recall that the A. parameters sum to zero across treatment categories i.e. Ar = - Aru> and hence ' AT(l)- AT(O}
= 2A.T(l)
Similar manipulations for the other parameters reduce the expression for the logit to logit = 2AT(l) + 2AIT(i1) + 2AST(jl) Note that this is a linear equation of the form logit = constant +effect of sex on treatment + effect of income on treatment For example, to compute the logit or log odds of no recommended treatment for a high income (i = 1) female U = 1), we identify
Aru> = 0.048, A.IT(l,l) = -0.001, AsT(l,l) = -0.219 and substitute in the logit equation to obtain: logit = 0.096-0.002- 0.438 = -0.344 By exponentiating, we compute the odds of no recommended treatment as exp (- 0.344) = 0. 71. Conversely, the odds of recommended treatment is exp (0.344) = 1.411. Similarly, for a low income male, we compute the logit of no recommended treatment is 0.536 or odds of 1.71. Similar manipulations can be performed to produce a logit model for any log-linear model in which one of the variables represents binary outcome. Demaris (1992) and Agresti (1990) discuss this subject in further detail, including situations in which the outcome and/or the explanatory variables may have more than two categories. In order to understand the relationship between the logit and logistic models, we define the following variables: Y =TREAT ( = 0 for yes and 1 for no)
Xl =SEX ( = 0 for male and 1 for female) X2 = INCOME ( = 0 for low and 1 for high)
Discussion of co-mputet progran;s
.-~ ~·
,
Some numerical calculations show that the logit equation for the above example can be written as
1)
log ( prob Y= prob Y= 0
=
.
0.536 - 0.004X 1
-
0.876X 2
which is precisely in the form of the logistic regression model presented in Chapter 12. In fact, these coefficients agree within round-off error with the results of logistic regression programs which we ran on the same data. In general, it can be shown that any logit model is equivalent to a logistic regression model. However, whereas logit models include only categorical (discrete) variables, logistic regression models can handle discrete and continuous variables. On the other hand, logit models including discrete and continuous variables are also used and are called generalized logit models (Agresti, 1990). For these reasons, the terms 'logit models' and 'logistic regression models' are used interchangeably in the literature. 17.12 DISCUSSION OF COMPUTER PROGRAMS Otthe six packages described in this book, BMDP, SAS, SPSS, STATISTICA and SYST AT fit log-linear models. The features of these programs are summarized in Table 17.9. The BMDP has a wide range of output in its 4F program. Note also that SPSS has a useful selection in its HILOGLINEAR program when combined with its CROSSTAB program. The CATMOD procedure in SAS includes log-linear modeling along with several other methods. In fact, it is possible to perform log-linear analysis with any of these five packages. However, programs that examine several models and allow you to get a printed output of the results in one run appear to be the simplest to use. Printing the output is recommended for log-linear analysis because it often takes some study to decide which model is desired. When using BMDP 4F, one can either enter the specific model to be tested or indicate what level of association (for example, all three-way) to include. This should be the most complete model you want to test. Several of the programs also have options for performing 'stepwise' operations. SPSS provides the backward elimination method. ST ATISTICA starts off with a forward addition mode until it includes all significant k-way interactions and then it does backward elimination. BMDP provides either forward addition or backward elimination. The forward selection method has been shown to have low power unless the sample size is large. When backward elimination is used, the user specifies the starting model. Terms are then deleted in either a 'simple' or 'multiple effect'. Here simple signifies that one term is removed at a time and multiple means that a combination of terms is removed at each step. We recommend the simple option
Table 17;.9 Summary of computer output for log-linear analysis Output
BMDP
SAS
SPSS
STATISTICA
SYSTAT
Multiway table Pearson's chi-square Likelihood ratio chi~square Test of K-factor interactions Differences in chi-squares Stepwise Partial association Marginal association Structural zeros Delta Expected values Residuals Standardized residuals Components of LR chi-square Components of Pearson chi-square Lamba estimates
4F 4F 4F 4F 4F 4F 4F 4F 4F 4F 4F 4F 4F 4F 4F 4F
FREQ
CROSSTAB HILOGLINEAR HILOGLINEAR HILOGLINEAR HILOGLINEAR HILOGLINEAR HILOGLINEAR
Two-way Log-linear Log-linear Log-linear
Tables Tables Tables
CATMOD CATMOD
CATMOD
HI LOGLINEAR
CATMOD
HILOGLINEAR HILOGLINEAR HILOGLINEAR HILOGLINEAR
Log~linear Log-linear Log~ linear Log'-linear Log-inear Log'-linear Log,.linear Log-linear Log-linear
Tables Tables Tables
What to watch out for
439
particularly for deletion. The simple option was used in section 17.8. SPSS and STATISTICA use the simple option and BMDP offers both simple and multiple. The delta option allows the user to add an arbitrary constant to the frequency in each cell. Some investigators who prefer Yates's correction in Pearson chi-square tests add 0.5, particularly when they suspect they have a table that has several cells with low or zero frequencies. This addition tends to result in simpler models and can be overly conservative. It is not recommended that it be uniformly used in log-linear analysis. If a constant is added, the user should try various values and consider constants that are considerably smaller than 0.5. In most programs the default option is delta = 0.00 but in STATISTICA it is set at 0.5. SYSTAT has a new log-linear program in draft form for their next version. This new program, called Loglin, provides considerably more options than their present Tables program, including a rich assortment of information on the fit of each cell frequency and estimates of the A.'s. Although STAT A does not include a log-linear program, its logit procedure fits logit models to a binary outcome and produces statistics useful for testing hypotheses about the parameters. 17.13 WHAT TO WATCH OUT FOR Log~linear
analysis require special care and the results are often difficult for investigators in interpret. In particular the following should be noted.
1. The choice of variables for inclusion is as important here as it is in multiple regression. If a major variable is omitted you are in essence working with a marginal table (summed over the omitted variable). This can lead to the wrong conclusions since the results for partial and marginal tables can be quite different (as noted in section 17.7). If the best variables are not considered, no method of log-linear modeling will make up for this lack. 2. It is necessary to carefully consider how many and what subcategories to choose for each of the categorical variables. The results can change dramatically depending on which subcategories are combined or kept separate. We recommend that this be done based on the purpose of the analysis and what you think makes sense. Usually, the results are easier to explain if fewer subcategories are used. 3. Since numerous test results are often examined, the significance levels should be viewed with caution. 4. In general, it is easier to interpret the results if the user has a limited number of models to check. The investigator should consider how the sample was taken and any other factors that could assist in specifying the model.
440
Log-linear analysis
5. If the model that you expect to fit well does not, then you should look
at the cell residuals to determine where it fails. This may yield some insight into the reason for the lack of fit. Unless you get a very good fit, residual analysis is an important step. 6. Collapsing tables by eliminating variables should not be done without first considering how this will affect the conclusions. If there is a causal order to the variables (for example gender preceding education preceding income) it is best to eliminate a variable such as income that comes later in the causal order (see Wickens, 1989, for further information on collapsing tables). · 7. Clogg and Eliason (1987) provide some specific tables for checking whether or not the correct degrees of freedom are used in computer programs. Some programs use incorrect degrees of freedom for models that have particular patterns of structural zeros or sampling zeros.
SUMMARY Log-linear analysis is used when the investigator has several categorical variables (nominal or ordinal variables) and wishes to understand the interelationships among them. It is a useful method for either exploratory analysis or confirmatory analysis. It provides an effective method of analyzing categorical data. The results from log-linear (or logit) analysis can also be compared to what you obtain from logistic regression analysis (if categorical or dummy variables are used for the independent variables). Log-linear analysis can be used to model special sampling schemes in addition to the ones mentioned in this chapter. Bishop, Feinberg and Holland (1975) provide material on these additional models. Log-linear analysis is a relatively new statistical technique so there is not the wealth of experience and study done on model building as there is with regression analysis. This forces the users to employ more of their own judgment on interpreting the results. In this chapter we provided a comprehensive introduction to the topic of exploratory model building. We also described the general method for testing the fit of specific models. We recommend that the reader who is approaching the subject for the first time analyze various subsets of the data presented in this book and also some data from a familiar field of application. Experience gained from examining and refining several log-linear models is the best way to become comfortable with this complex topic, REFERENCES References preceded by an asterisk require strong mathematical background. Afifi, A.A. and Azen, S.P. (1979). Statistical Analysis: a Computer Oriented Approach, 2nd edn. Academic Press, New York.
Problems
,~<-I
Agresti, A. (1984). Analysis of Ordinal Categorical Data. Wiley, New York. Agresti, A. (1990). Categorical Data Analysis. Wiley, New York. Benedetti, J.K. and Brown, M.B. (1978). Strategies for the selection oflog-linear models. Biometrics, 34, 680-86. *Bishop, Y.M., Feinberg, S.E. and Holland, P.W. (1975). Discrete Multivariate Analysis. MIT Press, Cambridge, MA. Clogg, C.C. and Eliason, S.R. (1987). Some common problems in log-linear analysis. Sociological Methods and Research, 16, 8-44. Demaris, A. (1992). Logit Modeling: Practical Applications. Sage, Newbury Park, CA. Dixon, W.J. and Massey, F.J., Jr (1983). Introduction to Statistical Analysis, 4th edn. McGraw Hill, New York. Fleiss, J.L. (1981). Statistical Methodsfor Rates and Proportions. Wiley, New York. Haberman, S.J. (1977). Log-linear models and frequency tables with small expected cell counts. Annals of Statistics, 5, 1148-69. Hildebrand, D.K., Laing, J.D. and Rosenthal, H. (1977). Analysis of Ordinal Data. Sage, Beverly Hills, CA. Knoke, D. and Burke, P.J. (1980). Log-linear Models. Sage, Newbury Park, CA. Oler, J. (1985). Noncentrality parameters in chi-squared goodness-of-fit analyses with an application to log-linear procedures. Journal of American Statistical Association, 80, 181_:89. . . Wickens, T.D. (1989). Multiway Contingency Tables Analysis for the Social Sciences. Hillsdale, NJ, Lawrence Erlbaum Associates.
FURTHER READING *Anderson, E.B. (1990). The Statistical Analysis of Categorical Data. SpringerVerlag, New· York. Christensen, R. (1990). Log-linear Models. Springer-Verlag, New York. Everett, B.S. (1992). The Analysis of Contingency Tables. Chapman & Hall, London. Feinberg, S.E. (1980). The Analysis of Cross-Classified Categorical Data, 2nd edn. MIT Press, Cambridge, MA. Ishi-Kuntz, M. (1994). Ordinal Log-linear Models. Sage, Newbury Park, CA. Upton, G.J.G. (1978). The Analysis of Cross-Tabulated Data, Wiley, New York.
PROBLEMS 17.1
Using the variables AGE and CESD from the depression data, make two new categorical variables called CAGE and CCESD. For CAGE, group together all respondents that are 35 years old or less, 36 up to and including 55, and over 55 into three groups. Also group the CESD into two groups: those with a CESD of ten or less and those with a CESD greater than ten. Compute a two-way table and obtain the usual Pearson chi-square statistic for test of independence. Also, run a log-linear analysis and compare the results. 17.2 Run a log-linear analysis using the variables DRINK, CESD (split at ten or less) and TREAT from the depression data set to see if there is any significant association among these variables. 17.3 Split the variable EDUCAT from the depression data set into two groups, those who completed high school and those who did not. Also split INCOME at 18 or less and greater than 18 (this is thousands of dollars). Run a log-linear
442
17.4
17.5
17.6 17.7
Log-linear analysis analysis of SEX versus grouped EDUCAT versus grouped INCOME and describe what associations you find. Does the causal ordering of these variables suggested in paragraph 6 of section 17.13 help in understanding the results? Perform a log-linear analysis using data from the lung cancer data set described in section 13.3. Check if there are any significant associations among Staget, Stagen and Hist. Which variable is resulting in a small number of observations in the cells? Run a four-way log-linear analysis using the variables Stagen~ Hist, Smokfu and Death in the lung cancer data set. Report on the significant associations that you find. Rerun the analyses given in Problem 17.4 adding a delta of 0.01, 0.1 and 0.5 to see if it changes any of the results. In Problem 17.5, eliminate the four-way interaction, then compare the two models: the one with and the one without Smokfu. Interpret the results.
Appendix A Lung function data
The lung function data set includes information on nonsmoking families from the UCLA study of chronic obstructive respiratory disease (CORD). In the CORD study persons seven years old and older from four areas (Burbank, Lancaster, Long Beach, and Glendora) were sampled, and information was obtained from them at two time periods. The data set presented here is a subset including families with both a mother and a father, and one, two, or three children between the ages of 7 and 17 who answered the questionnaire and took the lung function tests at the first time period. The purpose of the CORD study was to determine the effects of different types of air pollutants on respiratory function, but numerous other types of studies have been performed on this data set. Further information concerning the CORD study can be found in the Bibliography at the end of this appendix. The code book in Table A.1 summarizes the information in this subset of data. Mter the first two variables, note that the same set of data is given for fathers (F), mothers (M), oldest child (OC), middle child (MC), and youngest child (YC). These data are available on disk from the publisher. The data on the disk are in ASCII format (variables separated by blanks). The FEV1 and FVC data on the disk have been written with decimal points so the FVC for the first father is written as 3.91 and FEV1 is 3.23 in liters. The first 25 cases are printed in this Appendix. They do not show the separators since we wanted to get all the data for one family on one line. The format is F3.0, FLO, 5(Fl.O, 2F2.0, F3.0, 2F3.2). Note that in the listing of the data in Table A.2 columns of blanks have been added to isolate the data for different members of the family. The format statement and the code book assume no blanks. Some families have only one or two children between the ages of 7 and 17. If there is only one child, it is listed as the oldest child. Thus there are numerous missing values in the data for the middle and youngest child. Forced vital capacity (FVC) is the volume of air, in liters, that can be expelled by the participant after having breathed in as deeply as possible, that is, full expiration regardless of how long it takes. The measurement of
444
Appendix A
FVC is affected by the ability of the participant to understand the instructions and by the amount of effort made to breathe in deeply and to expel as much air as possible. For this reason seven years is the practical lower age limit at which valid and reliable data can be obtained. Variable FEVl is a measure of the volume of the air expelled in the first second after the start of expiration. Variable FEVl or the ratio FEVl/FVC has been used as an outcome variable in many studies of the effects of smoking on lung function. The authors wish to thank Dean Roger Detels, principal investigator of the CORD project, for the use of tp.is data set and Miss Cathleen Reems for assembling it. Table A.l Code book for lung function data set Variable number 1 2
Variable name
Description
ID AREA
1-150 1 =Burbank 2 = Lancaster 3 = Long Beach 4 =Glendora 1 =male Age, father Height (in.), father Weight (lb), father FVC father FEV1 father 2 =female Age, mother Height (in.), mother Weight (lb), mother FVC mother FEV1 mother Sex, oldest child 1 =male 2 =female Age, oldest child Height, oldest chi~d Weight, oldest chdd FVC oldest child FEV1 oldest child Sex, middle child Age, middle child Height, middle child Weight, middle child FVC middle child FEV1 middle child Sex, youngest child Age, youngest child . Height, youngest ch~d Weight, youngest child FVC youngest child FEV1 youngest child
14 15
FSEX FAGE FREIGHT FWEIGHT FFVC FFEV1 MSEX MAGE MHEIGHT MWEIGHT MFVC MFEV1 OCSEX
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
OCAGE OCHEIGHT OCWEIGHT OCFVC OCFEV1 MCSEX MCAGE MCHEIGHT MCWEIGHT MCFVC MCFEVl YCSEX YCAGE YCHEIGHT YCWEIGHT YCFVC YCFEV1
3 4 5 6 7 8 9 10 11
12 13
Table A.2 Lung function data set
Columns 123 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
4
1 56789012345678
2 3 90123456789012
34567890123456
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
(father) 15361161391323 14072198441395 12669210445347 13468187433374 14661121354290 14472153610491 13564145345339 14569166484419 14568180489429 13066166550449 14670188473390 15068179391334 13174210526398 13767195468401 14070190522456 13271170501411 13774198595414 15472223361285 13466176467383 14069178618479 13369176651472 13172163512463 15173215499416 13071163518470 13967151458377
(mother) 24362136370331 23866160411347 22759114309265 23658123265206 23962128245233 23666125349306 22768206492425 24563115342271 24168144357313 22667156364345 24465136348243 24864179232186 23066143399365 23267164508357 23559112303257 23268138448354 23562145360292 24765122363275 23057110260204 23763131350287 23165142428319 23066147417355 24963122299294 22965125206202 23464176418351
(child 1) 21259115296279 11056 66323239 1 850 59114111 21157106256185 11661 8826024 7 11567100389355 21154 70218163 11567153460388 21465144289272 2 949 52192142 11768145501381 21150 54152144 2 852 61148132 21055 61224205 21156 92244206 2 955 61169149 11361103296250 11159 84216213 2 849 52171142 21664117371353 21162101297258 1 749 55164157 11770180496436 1 850 50158144 11771164567449
4
5 6 78901234567890
7 12345678901234
(child 2)
(child 3)
1 949 56159130 21260 85268234 11357 87276237
21050 53154143 21055 72195169
2 954 81193187 11262108257235 l 750 61236153
1 854 87241165 2 749 50118108 21058 85249218 2 954 58202184 11257 85254214 2 850 52134131 11567148391287 21361145266242
11264105326299
446
Appendix B
FURTHER READING Detels, R., Rokaw, S.N., Coulson, A.H., Tashkin, D.P., Sayre, J.W. and Massey, F.J. (1979). The UCLA population studies of chronic obstructive respiratory disease. I. Methodology and comparison oflung function in areas of high and low pollution. American Journal of Epidemiology, 109, 33-58. Detels, R., Sayre, J.W., Coulson, A.H., Rokaw, S.N., Massey, F.J., Tashkin, D.P. and Wu, M. (1981). The UCLA population studies of chronic obstructive respiratory disease. IV. Respiratory effects of long-term exposure to photochemical oxidants, nitrogen dioxide, and sulfates on current and never smokers. American Review of Respiratory Disease, 124, 673-80. Rokaw, S;N., Detels, R., Coulson, A.H., Sayre, J.W., Tashkin, D.P., Allwright, S.S. and Massey, F.J. (1980). The UCLA population studies of chronic respiratory disease. Ill Comparison of pulmonary function in three communities exposed to photochemical oxidants, multiple primary pollutants, or minimal pollutants. Chest, 78, 252-62. Tashkin, D.P., Detels, R., Coulson, A.H., Rokaw, S.N. and Sayre, J.W. (1979). The UCLA population studies of chronic obstructive respiratory disease. II. Determination of reliability and estimation of sensitivity and specificity. Environmental Research, 20, 403-24.
Appendix B Lung cancer survival data In section 13.3 a description of this data set and a codebook are given. In the data available on disk, a full point (period) has been used to identify missing values. The data on the disk are in ASCII format (variables separated by blanks). The first 40 cases of the data are reprinted in Table B.l. The authors wish to thank Dr E. Carmack Holmes, Professor of Surgical Oncology at the UCLA School of Medicine, for his kind permission to use the data.
Table B.l Lung cancer data ID
Staget
Stagen
Hist
Treat
Perfbl
Poinf
Smokfu
Smokbl
Days
Death
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 0 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 1 1
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
0 1 0 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 2 2
1 1 2 1 1 1 1 1 2 2 3 1 1 1 1 1 1 2 3 2 2
2926 590 2803 2762 2616 2716 1485 2456 500 2307 1200 1073 2209 2394 2383 3425 694 2738 3180 204 3180 2847 849 2568 524 2504 3183 3116 2940 2638 504 726 287 744 1376 314 3532 187 2924 1259
0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 l 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
2 2 2 2 1 1 1 1 1
MIS 1
MIS MIS 2 3 1 1 1 2 1
MIS
2 3
MIS MIS 1 2
MIS MIS
2 2 2
MIS 3 1 1
MIS 2 2
MIS
2
MIS
2 2
MIS 1 1
1 2 2 2 1 3 1 1 1 1
2 2 1 1 2 2
1 0 0 1 l 1 1 1 1 0 1 0 1
Index
Accelerated failure time model 317 Adjusted multiple correlation 171, 177, 181-3 Agglomerative clustering 391-5 AIC 172-3, 185 Antilogarithm 53 ASCII files 32 Augmented partial residual plots 190 Bartlett's chi-square test for canonical correlation 233 Bonferroni inequality 139 Calibration 113-14 Canonical correlation analysis 6-7, 227-41 Bartlett's chi-square test 233 canonical correlation 229, 232-3 discriminant function 269-72 loadings 236 structural coefficients 236 variable loadings 236 variable scores 234 variables 230, 232-3 coefficients 230--3 computer programs 237-40 correlation matrix 229 first canonical correlation 230-2 interpretation 230--3 plots 234-5 redundancy analysis 237 second canonical correlation 232-3 standardized coefficients 232 tests of hypotheses 233-4 what to watch for 240--41 Canonical discriminant function 269-72
Canonical variables 230, 269-70 Categorical variables 16 CESD 4, 228, 341 Classification of variables 13-16 Cluster analysis 9, 381-409 agglomerative clustering 391-5 centroid 392-3 city-block distance 390 computer programs 404-406 dendrogram 394, 399-400 distance matrix 389 distance measures 389-91 Euclidian distance 390-91 F test 398, 402-403 graphical techniques 385-9 hierarchical clustering 391-5, 398-400 icicles 394 K-means clustering 395-8, 400-403 linkage methods 394 Mahalanobis distance 390 number of clusters 395, 397-8 outliers 389 profile diagram 385-8 profile plot of means 402-403 scatter diagrams 383, 385 seeds 397 similarity 389 standardized distance 390 standardized variables 390 taxonomic classification 381-2 tree graph 394, 399-400 what to watch for 406 Code book construction 40-42 Coefficient of determination 138 Coefficient of variation 73, 75 Combining data sets 35-6 Conditional distribution 134
450
Index
Confidence intervals 92-3, 129, 287, 289 Continuous variables 16 Cook's distance 107, 145 Correlation 94, 96-9, 132-7, 140-43, 169,181-3,229-33,256,338,356 adjusted multiple 171, 177, 181-3 matrix 132-3, 140-41, 169, 229, 338, 356 multiple 134-5, 142-3, 229, 256 partial 135, 143 simple 94, 96-9 Costs of misclassification 261-2 Covariance matrix 132-3 cp criterion 172, 184 Cox's proportional hazards model 319-20 Cox versus logistic regression 322-4 Cox versus log:-linear regression 320-22 Cumulative distribution 58, 312-14 Data entry 28-32 Data management capabilities 33-6 Data screening 36-9, 54-9, 62-4, 64-6, 102-109 independence 64-6, 108 normality 54-9, 62-4 outliers 104-108, 347 Data transfer 32 Dependent variables 17, 85-6, 125 Depression code book 41-2 Depression study 3-4, 40-2, 228-9, 245,341-4,372-4 CESD 3-4, 228, 341 code book 42 data 43 definition of depression 245 Discrete variables 16 Discriminant analysis 7, 243-79 canonical variables 269-72 classification 243-52 classification function 257 coefficients 257 computer programs 272-5 cost of misclassification 261-2 cross-validation 263 D 2 253-4, 263-4 description 244 dividing point 248, 259-62 dummy variable 256 Fisher discriminant function 250-3
Hotelling T 2 265 jackknife 263 Mahalanobis distance 253-4, 256, 263-4, 272 minimizing misclassification 260 more than two groups 267-9 pooled variance of discriminant function 253 posterior probabilities 258-9 prediction 244 prediction by guessing 264 prior probabilities 259-61 quadratic discriminant analysis 274 regression analogy 255-6 renaming groups 257 standardized coefficients 258 tests of hypotheses 265-6 variable selection 266-7 what to watch for 275-6 Wilks' lambda 272 Dummy variables 202-9, 256, 269, 286 Durbin-Watson statistic 108 e as base 52 Eigenvalues 335-6, 341-3, 359, 365 Ellipse of concentration 96-8, 234, 249-50 Event history analysis 307 Exponential function 53 Factor analysis 8, 354-79 based 9:£!. correlation 356, 362 common factors 357 communality 357-8, 360; 362-3 computer programs 374-6 direct quartimin rotation 369-70 eigenvalues 359, 365-76 factor diagram 361 factor loadings 357-8, 360, 363 factor model 356-7 factor rotation 365:._70 factor score coefficients 371-2 factor scores 371-2 factor structure matrix 359 initial factor extraction 359-65 iterated principal C<'mponents 362-3 Kaiser normalization 366 loading of ith variable 357-8, 360 latent factors 378 Mahalanobis distance 374
Index maximum likelihood 365 nonorthogonal rotations 368-71 number of factors 364-5, 374 oblique rotation 368-71, 374 orthogonal rotation 366-8 outliers 376 pattern matrix 359 principal components 358-62 principal axis factoring 362 principal .factor analysis 362-5 regression procedure 372 rotated factors 365~71 scree method 365 specificity 357, 360 standardized .X 356, 372 storing factor scores 372 unique factors 357 varimax rotation 366-8 what to watch for 376-7 Failure time analysis 307 Fisher discriminant function 250-53 Forced expiratory volume 1 sec (FEV1) 8, 64, 86, 125 Forced vital capacity (FVC) 8, 64 Forecasting 114 General F test 153-4, 173-4 Geometric mean 75 Harmonic mean 75 Hierarchical clustering 391-5, 398-400 Homoscedasticity 88 Hotelling fl 265 Independence, assessing 64-7, 108, 145, 415-16, 421, 427-8, 431-2 Independent variables 17, 85, 125 Indicator variables 203 Influence of observation 107 Interaction 146-8, 207-208, 287 Interquartile range 74 Interval variables 15, 74 Jackknife procedure 263 JOIN 35-6 JOIN MATCH 35 K-means clustering 395-8, 400-403 Kolmogorov-Smirnov D test 63 Least squares method 89-91, 129
451
Leverage 105-106, 144 Likelihood ratio chi-square 418, 421, 434 Log-linear regression model 317-19 Logarithmic transformation 48-53, 111
Logistic regression 8, 281-305 adjust constant 297 applications 296-9 assumption 284-5 case-control sample 297-9 categorical variables 285-7 coefficients 284-9 computer programs 299-301 confidence intervals categorical data 287 confidence intervals continuous data 289 continuous variables 288-9 cross-sectional sample 297 cutoff point 293-4 dummy variables 285-7 goodness of fit chi-square 292-3 improvement chi-square 291 interaction 287, 293 logarithm odds 284 logistic function 283-4 logit 284 matched samples 297-9 maximum likelihood 285 model fit 291-3 odds 284 odds ratio 286-7 prior probabilities 288 probability population 285, 288 ROC curves 295-6 sensitivity 295 specificity 295 standard error 287-9 stepwise variable selection 290-91 versus Cox's regression model 322-4 what to watch for 301-302 Logit model 435-7 Log-linear analysis 9, 410-42 both explanatory and response variables 432 comparison with logistic 437 computer programs 437-9 conditional independence model 423 degrees offreedom 419, 440
452
Index
Log-linear analysis contd exploratory model construction 425-30 fit of model 430-31 hierarchical models 411, 418, 424 homogenous association model 423-4 likelihood ratio chi-square 418, 421,434 logit model 435-7 marginal association test 428 multiway frequency tables 411-14, 421-37 mutual independence model 422 notation 415 odds ratio 421 one variable jointly independent model 423 partial association test 428 Pearson chi-square 417-18, 421, 433 sample size 432-4 sampling 415, 431-2 saturated model 417, 424 standardized deviates 431 stepwise selection 429-30 structural zeros 433 tests ofhypotheses 415, 416, 421, 427-8,431-2 what to watch for 439-40 Lung cancer code book 310 Lung cancer survival data 448 Lung function code book 444 Lung function data 446 Lung function definition 8, 64 Mahalanobis distance 253-4, 256, 263-4, 390 Median 74 MERGE 35-6 Missing at random 198 Missing completely at random 198 Missing values 36-8, 197-202 imputation 199 maximum likelihood substitution 200 mean substitution 199 Mode 73 Multicollinearity 149, 212-19, 331, 345-7 Multiple regression 6, 124-224 additive model 137, 146
adjusted multiple correlation 171, 177, 181-3 AIC 172-3, 185 analysis of variance 137.:....8 augmented partial residual plots 190 backward elimination 178-9 Bonferroni inequality 139-40 coefficient of determination 139 comparing regression planes 150-53 computer programs 154-7, 185-7 conditional distribution 134 confidence intervals 130 Cook's distance 145 correlation matrix 132-3, 140-41 covariance matrix 132-3 CP criterion 172, 184 descriptive 142, 167 dummy variables 202-209 Durbin-Watson statistic 145 fixed X case 128-30, 137-43 forcing variables 181 forward selection 175-8 general F test 153, 173-5 hat matrix 144 hyperellipsoid 132 indicator variables 202-209 interactions 146-7, 207-208 least squares method 129 leverage 144 linear constraints.on parameters 209-12 maximum F-to-remove 178-9, 180 minimum F-to-enter 176-7, 180 missing values 197-202 multicollinearity 149, 212-19, 345-7 multiple correlation 134-5, 142-3, 171 normal probability plots 145 normality assumption 130, 132, 134, 145 outliers 144 partial correlation 135-7, 143 partial regression coefficient I 28-9 partial regression plots 189-90 partial residual plots I 90 polynomial regression I 46-7 prediction intervals I 30 predictive I 67; 17 3 principal components 345-7
Index reference group 203-205 regression plane 126-7 regression through origin 148 residual inean square 129, 172 residual sums of squares 139, 197 residuals 144-5, 148 ridge regression 214-19 ridge trace 217 RSQUARE 139 segmented curve regression 210-12 serial correlation 145 spline regression 210-12 stagewise regression 19<J-91 standard error of estimate 129 standardized coefficients 141-2, 213 stepwise selection 175-81 stopping rule 176-7 subset regression 181-4 testing regression planes 150-51 tests of f3 = 0 139 tolerance 149 transformations 145--6 variable selection 166-93 variable- X case 13<J-37, 140-43 variance inflation factor 149 weighted regression 148 what to watch for 157-60, 191-2 Multivariate analysis, definition 3 Multiway frequency tables 411, 412-14,421-37 N number of cases, sampling units 13 Nominal variables 14-15, 72-4 Normal distribution 54-5, 131 Normal probability plots ·57-8, 108 Normality 57-8, 62-3, 108 normal probability plots 57-8, 108 tests 62-3 Odds 284 Odds ratio 286-7, 421 Ordinal variables 15, 74 Outliers 38-9, 104-I08, 144-5, 331, 347, 376, 389 Outliers in X I 06--107, I 44 Outliers in Y I05-106, 144 P number of variables
13 Package programs 23-8 Paired data 115 Partial regression plots 189 Partial residual plots 190
453
Percentiles 74 Poisson distribution 59 Polynomial regression 146-7 Posterior probabilities 258-9 Power transformations 53, 59--62 Prediction intervals 93, 130 Principal components 8, 330-53 characteristic roots 335 coefficients 333-5 computer programs 348-50 correlation 338-9 cumulative percentage variance 341-3 definition 333 eigenvalues 335-6, 341-3 eigenvector 336 ellipse of concentration 335 interpretation of coefficients 334-5, 344 latent root 335 multicollinearity 345-7 normality 331 number of components 337, 341-4 outliers 331, 347 reduction of dimensionality 337 regression analysis 345-7 scree plots 337 shape component 345 size component. 345 standardized x 338-40 tests of hypotheses 345 variance of components 334 what to watch for 350-4 Quartile deviation 74 Range 74 Ratio variables 16; 75 Reflected variables 41 Regression 85-224 multiple linear 6, 124-224 polynomial 146-7 principal component 345-7 ridge 214-19 segmented-curve 2I 0--12 simple linear 6, 85-I 23 spline 210-12 · stagewise I 90-91 Replacing missing data 199-201 Residuals 91, 104-105, 108, 144-5, 148 Ridge regression 214-19
454
Index
ROC curves 295-6 Rotated factors 365-71 Saving data 40 Scale parameters 314 Segmented-curve regression 210-12 Selecting appropriate analysis 71-9 Serial correlation 108, 145 Shape parameter 314 Shapiro-Wilk test 62 Simple linear regression 6, 85-123 adjusted Y 115 analysis of variance 101 assumptions 88-9, 108-109 calibration 113-14 computer programs 115-16 confidence intervals 92 Cook's distance 107 correlation 94, 97-9 covariance 93 descriptive 86 Durbin-Watson statistic 108 ellipse of concentration 96-8 :fixed-X case 86, 88-93, 94-6 forecasting 114 h statistic 105-106 influence of observation 107 intercept 89-90 leverage 105-106 linearity 109 outliers 105-108 prediction intervals 93 predictive 86 regression line 88 regression through origin 111-12 residual analysis 102-106 residual degrees of freedom 91 residual mean square 91-2 residuals 91, 103-104, 115 robustness 108-109 serial correlation 108 slope 89 standard error 92, 95 standardized coefficients 100-101 standardized residuals 105 studentized residuals 105 tests of hypotheses 94-5, 101 transformations 109-11 variable-X case 86, 93-4, 96-IOO weighted least squares 112 what to watch for I 17-18
Skewness 63 Spline regression 210-12 Square root transformation 53, 59 Stagewise regression 190-91 Standardized regression coefficients 100--101, 141, 213 Standardized residuals 105 Stepwise selection 175-81, 274-5, 290--91,429-30 Stevens's classification system 13-14, 72-8 selecting analysis 7~-8 Studentized residuals 105 Subset regression 181-4 Suggested analyses 76-8 Survival ana1ysis 7-8, 306-29 accelerated life model 317-19 computer programs 324-6 Cox's proportional hazards model 319-20 Cox versus logistic regression 322-4 Cox versus log-linear regression 320--22 cumulative death distribution function 311-13 death density function 311-12 exponential distribution 314-16 hazard function 312, 314-15 log-linear regression model 317-19 log~linear versus Cox's model 320--22 proportional hazards model 319 scale parameter 314 shape parameter 314 survival function 311-14 T-year survival rate 308-309 Weibull distribution 314-17 what to watch for 326-7 Testing regression planes 138, 150-53 Tolerance 149 Transformations 39-40, 48--64, 109-11, 145-6 assessing need 63-4 common transformations 48-53, 109-11 I data manipUlajtion 39-40 graphical methods 54-62, 109-111
Index in multiple regression 145-6 in simple linear regression 109-111 logarithmic 49-53, 111 normal probability plots 55-8 power 53, 59-62
square root 50, 53 Variable definition 12 Var-iance inflation factor 149 Weighted least squares 112, 148
455