Categorical Variables in
Developmental Research
This Page Intentionally Left Blank
Categorical Variables in Develo...
54 downloads
1073 Views
13MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Categorical Variables in
Developmental Research
This Page Intentionally Left Blank
Categorical Variables in Developmental Research Methods of Analysis
Edited by
Alexander von Eye Michigan State University East Lansing, Michigan
Clifford C. Cloggm Pennsylvania State University University Park, Pennsylvania *Deceased
ACADEMIC PRESS
San Diego New York Boston
London
Sydney Tokyo Toronto
This book is printed on acid-flee paper. ( ~ Copyright 9 1996 by ACADEMIC PRESS, INC. All Rights Reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. A c a d e m i c Press, Inc. A Division of Harcourt Brace & Company 525 B Street, Suite 1900, San Diego, California 92101-4495
United Kingdom Edition publ&hed by Academic Press Limited 24-28 Oval Road, London NW1 7DX Library of Congress Cataloging-in-Publication Data Categorical variables in developmental research : methods of analysis / edited by Alexander von Eye, Clifford C. Clogg p. cm. Includes bibliographical references and index. ISBN 0-12-724965-6 1. Psychometrics. 2. Psychology--Statistical methods. 3. Categories (Mathematics) I. Eye, Alexander von. II. Clogg, Clifford C. BF39.C294 1995 155'.072--dc20 95-21950 PRINTED IN THE UNITED STATES OF AMERICA 95 96 97 98 99 00 QW 9 8 7 6 5
4
3
2
1
Contents
Contributors xi Preface xiii Acknowledgments xvii In Memoriam xix
PART
1 Measurement and Repeated
Observations of Categorical Data 1. MeasurementCriteria for Choosing among Models with GradedResponses David Andrich
1. Introduction 3 2. Measurement Criteria for a Model for Graded Responses 4
vi
Contents
3. Models for Graded Responses 4. Examples 28 5. Summary and Discussion 32 References 34 11
Growth Modeling with Binary Responses
Bengt O. Muth~n
1. Introduction 37 2. Conventional Modeling and Estimation with Binary Longitudinal Data 39 3. More General Binary Growth Modeling 42 4. Analyses 46 5. Conclusions 52 References 52 11
Probit Models for the Analysis of Limited Dependent Panel Data Gerhard Arminger
1. 2. 3. 4.
Introduction 55 Model Specification 56 Estimation Method 61 Analysis of Production Output from German Business Test Data 68 5. Conclusion 72 References 73
PART 2 Catastrophe Theory 4. CatastropheAnalysis of Discontinuous Development
Han L. J. van der Maas and Peter C. M. Molenaar
1. 2. 3. 4. 5. 6.
Introduction 77 Catastrophe Theory 79 Issues in Conservation 80 The Cusp Model 84 Empirical Studies 89 Discussion 101 References 104
Contents B
vii
Catastrophe Theory of Stage Transitions in Metrical and Discrete Stochastic Systems
Peter C. M. Molenaar and Pascal Hartelman
1. Introduction 107 2. Elementary Catastrophe Theory 111 3. Catastrophe Theory for Metrical Stochastic Systems 115 4. Catastrophe Theory for Discrete Stochastic Systems 125 5. General Discussion and Conclusion 128 References 129
PART 3 Latent Class and Log-Linear Models
6. Some Practical Issues Related to the Estimation of Latent Class and Latent Transition Parameters
Linda M. Collins, Penny L. Fidler, and Stuart E. Wugalter
1. Introduction 133 2. Methods 137 3. Discussion 144 References 146
7. ContingencyTables and BetweenSubject Variability Thomas D. Wickens
1. 2. 3. 4. 5.
Introduction 147 Association Variability 148 The Simulation Procedure 150 Tests Based on Multinomial Variability 152 Tests Based on Between-Subject Variability 156 6. Procedures with Two Types of Variability 161 7. Discussion 163 References 167
viii
Contents
8. AssessingReliability of Categorical Measurements Using Latent Class Models Clifford C. Clogg and Wendy D. Manning
1. Introduction 169 2. The Latent Class Model: A Nonparametric Method of Assessing Reliability 171 3. Reliability of Dichotomous Measurements in a Prototypical Case 174 4. Assessment of Reliability by Group or by Time 178 5. Conclusion 181 References 182 W
Partitioning Chi-Square: Something Old, Something New, Something Borrowed, but Nothing BLUE (Just ML) David Rindskopf
1. 2. 3. 4. 5.
Introduction 183 Partitioning Independence Models 184 Analyzing Change and Stability 190 How to Partition Chi-Square 195 Discussion 199 References 201
10. Nonstandard Log-Linear Models for Measuring Change in Categorical Variables
Alexander von Eye and Christiane Spiel 1. 2. 3. 4.
Introduction 203 Bowker's Test 204 Log-Linear Models for Axial Symmetry 205 Axial Symmetry in Terms of a Nonstandard Log-Linear Model 206 5. Group Comparisons 209 6. Quasi-Symmetry 210 7. Discussion 213 References 214
Contents
ix
11. Application of the Multigraph
Representation of Hierarchical Log-Linear Models H. J. Khamis
Introduction 215 2. Notation and Review 216 3. The Generator Multigraph 217 4. Maximum Likelihood Estimation and Fundamental Conditional Independencies 5. Examples 222 6. Summary 228 References 229 ,
219
PART 4 Applications 12. Correlation and Categorization under a Matching Hypothesis
Michael J. Rovine and Alexander von Eye
1. 2. 3. 4. 5. 6. 7. 8. .
10. 11.
Introduction 233 An Interesting Plot 234 The Binomial Effect Size Display 236 An Organizing Principle for Interval-Level Variables 237 Definition of the Matching Hypothesis 237 A Data Example 238 Correlation as a Count of Matches 241 Correlation as a Count of How Many Fall within a Set Range 243 Data Simulation 244 Building Uncertainties from Rounding Error into the Interpretation of a Correlation 246 Discussion 247 References 248
13. Residualized Categorical Phenotypes and Behavioral Genetic Modeling Scott L. Hershberger
1. The Problem 249 2. Weighted Least-Squares Estimation
250
X
Contents
3. Proportional Effects Genotype-Environment Correlation Model 253 4. Method 256 5. Results 258 6. Conclusions 271 References 273 Index
275
Contributors
Numbers in parentheses indicate the pages on which the authors' contributions begin.
David Andrich (3) School of Education, Murdoch University, Murdoch,
Western Australia 6150, Australia Gerhard Arminger (55) Department of Economy, Bergische Universit~it, Wuppertal, Germany Clifford C. Clogg (169) Departments of Sociology and Statistics and Population Research Institute, Pennsylvania State University, University Park, Pennsylvania 16802 Linda M. Collins (133) The Methodology Center and Department of Human Development and Family Studies, Pennsylvania State University, University Park, Pennsylvania 16802
x/
Xll oo
Contributors
Penny L. Fidler (133) J. P. Guilford Laboratory of Quantitative Psychology, University of Southern California, Los Angeles, California 90089 Pascal Hartelman (107) Faculty of Psychology, University of Amsterdam, The Netherlands Scott L. Hershberger (249) Department of Psychology, University of Kansas, Lawrence, Kansas 66045 H. J. Khamis (215) Statistical Consulting Center and Department of Community Health, School of Medicine, Wright State University, Dayton, Ohio 45435 Wendy D. Manning (169) Department of Sociology, Bowling Green State University, Bowling Green, Ohio 43403 Peter C. M. Molenaar (77, 107) Faculty of Psychology, University of Amsterdam, The Netherlands Bengt O. Muth~n (37) Graduate School of Education, University of California, Los Angeles, Los Angeles, California 90095 David Rindskopf (183) Educational Psychology, City University of New York Graduate School, Chestnut Ridge, New York 10977 Michael J. Rovine (233) Human Development and Family Studies, Pennsylvania State University, University Park, Pennsylvania 16802 Christiane Spiel (203) Department of Psychology, University of Vienna, Vienna, Austria Han L. J. van der Maas (77) Department of Developmental Psychology, University of Amsterdam, The Netherlands Alexander yon Eye (203, 233) Department of Psychology, Michigan State University, East Lansing, Michigan 48824 Thomas D. Wickens (147) Department of Psychology, University of California, Los Angeles, Los Angeles, California 90095 Stuart E. Wugalter (133) J. P. Guilford Laboratory of Quantitative Psychology, University of Southern California, Los Angeles, California 90089
Preface
Categorical variables come in many forms. Examples include classes that are naturally categorical such as gender or species; groups that have been formed by definition, such as classes of belief systems or classifications into users versus abusers of alcohol; groups that result from analysis of data, such as latent classes; classes that reflect grading of, for instance, performance or age; or categories that were formed for the purpose of discriminating as, for instance, nosological units. Developmental researchers have been reluctant to include categorical variables in their studies. The main reason for this reluctance is that it often is thought difficult to include categorical variables in plans for data analysis. Methods and computer programs for continuous variables seem to be more readily accessible. This volume presents methods for analysis of categorical data in developmental research. Thus, it fills the void perceived by many, by providing infor-
xiii
XiV
Preface
mation in a very understandable fashion, about powerful methods for analysis of categorical data. These methods go beyond the more elementary tabulation and • methods still in widespread use. This volume covers a broad range of methods, concepts, and approaches. It is subdivided into the following four sections: 1. Measurement and Repeated Observations of Categorical Data 2. Catastrophe Theory 3. Latent Class and Log-Linear Models 4. Applications The first section, Measurement and Repeated Observations of Categorical Data, contains three chapters. The first of these, by David Andrich, discusses measurement criteria for choosing among models with graded responses. Specifically, this chapter is concerned with the important issues of criteria for measurement and choice of a model that satisfies these criteria. Proper selection of a measurement model ensures that measurement characteristics can be fully exploited for data analysis. The second chapter in the first section is by Bengt Muthrn. Titled Growth Modeling with Binary Responses, it describes methods for analysis of individual differences in growth or decline. It presents modeling with random intercepts and slopes when responses are binary. The importance of this paper lies in its demonstration of how to allow for restrictions on thresholds across time and differences across time in the variances of the underlying variables. The third chapter in this section, by Gerhard Arminger, is on Probit Models for the Analysis of Limited-Dependent Panel Data. It presents methods for analysis of equidistant panel data. Such data often involve metric and nonmetric variables that may be censored metric, dichotomously ordered, and unordered categorical. The chapter presents an extension of Heckman's (1981) models. The new methods allow researchers to analyze censored metric and ordered categorical variables and any blend of such variables using general threshold models. The second section of this volume contains two chapters on Catastrophe Theory. The first of these chapters, authored by Han van der Maas and Peter Molenaar, is Catastrophe Analysis of Discontinuous Development. This chapter presents results from application of catastrophe theory to the study of infants' acquisition of the skill of conservation. This skill, well-known since Piaget's investigations, is a very important step in a child's cognitive development. However, proper methodological analysis has been elusive. This paper presents methods and results at both the individual and the aggregate level. The second chapter in this section, by Peter Molenaar and Pascal Hartelman, addresses issues of Catastrophe Theory of Stage Transitions in Metrical and Discrete Stochastic Systems. This paper is conceptual in nature. It first gives an outline of elementary catastrophe theory and catastrophe theory for stochastic metrical systems. The chapter presents a principled approach to dealing with
Preface
XV
problems from Cobb's approach and introduces an approach for dealing with catastrophe theory of discrete stochastic systems. The importance of this chapter lies in the new course it charts for analysis of both continuous and categorical developmental change processes using catastrophe theory. The third section covers Latent Class and Log-Linear Models. Because of the widespread use of these models, they were allotted more space in this volume. The section contains six chapters. The first chapter, by Linda Collins, Penny Fidler, and Stuart Wugalter, addresses Some Practical Issues Related to Estimation of Latent Class and Latent Transition Parameters. Both of the issues addressed are of major practical relevance. The first concerns estimability of parameters for latent class and latent transition models when sample sizes are small. The second issue concerns the calculation of standard errors. The paper presents computer simulations that yielded very encouraging results. The second chapter in this section, by Thomas Wickens, covers issues of Contingency Tables and Between-Subject Variability. Specifically this chapter addresses problems that arise when between-subject variability prevents researchers from aggregating over subjects. Like the chapter before, this chapter presents simulation studies that show that various statistical tests differ in Type-I error rates and power when there is between-subject variability. The importance of this contribution lies in the presentation of recommendations for analysis of categorical data with between-subject variability. The third chapter in this section, by Clifford Clogg and Wendy Manning, is Assessing Reliability of Categorical Measurements Using Latent Class Models. This paper addresses the important topic of reliability assessment in categorical variables. The authors propose using the framework of the latent class model for assessing reliability of categorical variables. The model can be applied without assuming sufficiency of pairwise correlations and without assuming a special form for the underlying latent variable. The importance of this paper lies in that it provides a nonparametric approach to reliability assessment. The fourth chapter, by David Rindskopf, is Partitioning Chi-Square: Something Old, Something New, Something Borrowed, but Nothing BLUE (just ML). Partitioning chi-square is performed to identify reasons for departure from independence models. Rindskopf introduces methods for partitioning the likelihood chi-square. These methods avoid problems with the Lancaster methods of partitioning the Pearson chi-square. They are exact without needing complex adjustment formulas. Change-related hypotheses can be addressed using Rindskopf' s methods. The fifth chapter, by Alexander von Eye and Christiane Spiel, is Extending the Bowker Test for Symmetry Using Nonstandard Log-Linear Models for Measuring Change in Categorical Variables. The chapter presents three ways to formulate tests of axial symmetry in square cross-classifications: the Bowker test, standard log-linear models, and nonstandard log-linear models. The advan-
xvi
Preface
tage of the latter is that one can devise designs that allow one to simultaneously test symmetry and other developmental hypotheses. The sixth chapter in this section, by Harry Khamis, introduces readers into the Application of the Multigraph Representation of Hierarchical Log-Linear Models. The chapter focuses on hierarchical log-linear models. It introduces the generator multigraph and shows, using such graph-theoretic concepts as maximum spanning trees and edge cutsets, that the generator multigraph provides a useful tool for representing and interpreting hierarchical log-linear models. The fourth section of this volume contains application-oriented chapters. The first of these, written by Michael Rovine and Alexander von Eye, reinterprets Correlation and Categorization under a Matching Hypothesis. The authors show the relationship between the magnitude of a correlation coefficient and the number of times cases fall into certain segments of the range of values. The relationship of the correlation coefficient with the binomial effect size display is shown. The second chapter of this section, contributed by Scott Hershberger, discusses methods for Residualized Categorical Phenotypes and Behavioral Genetic Modeling. The paper presents methods for behavior genetic modeling of dichotomous variables that describe dichotomous phenotypes.
Alexander von Eye Clifford C. Clogg
Acknowledgments
We are indebted to many people who supported us during the production phase of this book. First there is the Dean of the Pennsylvania State College of Health and Human Development, Gerald McClearn. We thank him for his generous support. We thank Tina M. Meyers for help and support, particularly in phases of moves and transition. We thank a number of outside reviewers for investing time and effort to improve the quality of the chapters. Among them are Holger We[3els of the NIH, Ralph Levine of MSU, and Phil Wood of the University of Missouri who all read and commented on several chapters. We thank the authors of this volume for submitting excellent chapters and for responding to our requests for changes in a very professional fashion. We thank Academic Press, and especially Nikki Fine, for their interest in this project and for smooth and flexible collaboration during all phases of production. Most of all we thank our families, the ones we come from and the ones we live in. They make it possible for us to be what we are.
xg//
This Page Intentionally Left Blank
In Memoriam
Clifford C. Clogg died on May 7, 1995. In Cliff we all lose a caring friend, a dedicated family man, a generous colleague, and a sociologist and statistician of the highest caliber. Cliff was a person with convictions, heart, and an incredibly sharp mind; he was a person to admire and to like. We needed him then and we still do now. The manuscripts for this book were submitted to the publisher two weeks before this tragedy. This includes the preface, which I left unchanged. Alexander von Eye
xix
This Page Intentionally Left Blank
..P A.R.T~ Measurement and
Repeated
Observations of Categorical Data
This Page Intentionally Left Blank
Measurement Criteria for Choosing among Models with Graded Responses 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
David Andrich
Murdoch University Western Australia
1. INTRODUCTION It is generally accepted that measurement has been central to the advancement of empirical physical science. The prototype of measurement is the use o f an instrument to map the amount of a property on a real line divided into equal intervals by thresholds sufficiently fine that their own width can be ignored, and in its elementary form; this is understood readily by young school children. However, the function of measurement in science goes much deeper: it is central to the simultaneous definition of variables and the formulation of quantitative scientific theories and laws (Kuhn, 1961). Furthermore, when proper measurement has taken place, these laws take on a simple multiplicative structure (Ramsay, 1975). Although central to formulating physical laws, it is understood that measurement inevitably contains error. In expressions of deterministic theories, these errors are considered sufficiently small that they are ignored. In practice, the mean of independent repeated measurements, the variance of which is inversely Categorical Variables in Developmental Research: Methods of Analysis Copyright
9
1996 by Academic
Press,
Inc. All rights of reproduction
in any form
reserved.
4
David Andrich
proportional to the number of measurements, can be taken to increase precision to a point where errors can indeed be ignored. It is also understood that instruments have operating ranges; in principle, however, the measurement of an entity is not a function of the operating range of any one instrument, but of the size of the entity. Graded responses of one kind or another are used in social and other sciences when no measuring instrument is available, and these kinds of graded responses mirror the prototype of measurement in important ways. First, the property is envisaged to be continuous, such as an ability to perform in some domain, or an intensity of attitude, or the degree of a disease; second, the continuum is partitioned into ordered adjacent (contiguous) intervals, usually termed categories, that correspond to the units of length on the continuum. In elementary treatments of graded responses, the prototype of measurement is followed closely in that the successive categories are simply assigned successive integers, and these are then treated as measurements. In advanced treatments, a model with a random component is formalized for the response and classification processes, the sizes of the intervals are not presumed equal, and the number of categories is finite. This chapter is concerned with criteria, and the choice of a model that satisfies these criteria, so that the full force of measurement can be exploited with graded responses of this kind. The criteria are not applied to ordered variables where instruments for measurement already exist, such as age, height, income expressed in a given currency, and the like, which in a well-defined sense already meet the criteria. Although one new mathematical result is presented, this chapter is also not about statistical matters such as estimation, fit, and the like, which are already well established in the literature. Instead, it is about looking at a relatively familiar statistical situation from a relatively nonstandard perspective of measurement in science. Whether a variable is defined through levels of graded responses that characterize more or less of a property, or whether it is defined through the special case of measurement in which the accumulation of successive amounts of the property can be characterized in equal units, central to its definition is an understanding of what constitutes more or less of the property and what brings about changes in the property. It is in expressing this relationship between the definition and changes in terms of a model for measurement that generalizes to graded responses, and in articulating a perspective of empirical enquiry that backs up this expression, that this chapter contributes to the theme of the analysis of categorical variables in developmental research.
2. MEASUREMENTCRITERIA FOR A MODEL FOR GRADED RESPONSES In this section, three features of the relationship between measurement and theory are developed as criteria for models for measurement: first, the dominant direc-
1. Measurement Criteria for Choosing among Models with Graded Responses
5
tion of the relationship between theory and measurement; second, the structure of the models that might be expected to apply when measurements have been used; and third, the invariance of the measurement under different partitions of the continuum. These are termed measurement criteria. It is stressed that in this argument, the criteria established are independent and a priori to any data to which they might apply. In effect, the model chosen is a formal rendition of the criteria; it expresses in mathematical terms the requirements to which the graded responses must conform if they are to be like measurements, and therefore the model itself must exhibit these criteria. Thus the model is not a description of any set of data, although it is expected that data sets composed of graded responses can be made to conform to the model, and that even some existing ones may do so. Moreover, the criteria are not assumptions about any set of data that might be analyzed by the model. If data collected in the form of graded responses do not accord with the model, then they do not meet the criteria embedded in the model, but this will not be evidence against the criteria or the model. Thus it is argued that graded responses, just like measurements, should subscribe to certain properties that can be expressed in mathematical terms, and also that the data should conform to the chosen model and not the other way around; that is, the model should not be chosen to summarize the data. This position may seem nonstandard, and because it is an aspect of a different perspective, it has been declared at the outset. It is not, however, novel, having been presented in one form or another by Thurstone (1928), Guttman (1950), and Rasch (1960/1980), and reinforced by Duncan (1984) and Wright (1984).
2.1. Theory Precedes Measurement Thomas Kuhn is well known for his theory of scientific revolutions (Kuhn, 1970). In this chapter, I will invoke a part of his case, apparently much less known than the revolutionary theory itself, concerning the function of measurement in science (Kuhn, 1961) in which he stands the relationship between measurement and theory as traditionally perceived on its head: In text books, the numbers that result from measurements usually appear as the archetypes of the "irreducible and stubborn facts" to which the scientist must, by struggle, make his theories conform. But scientific practice, as seen through the journal literature, the scientist often seems rather to be struggling with the facts, trying to force them to conformity with a theory he does not doubt. Quantitative facts cease to seem simply "the given." They must be fought for and with, and in this fight the theory with which they are to be compared proves the most potent weapon. Often scientists cannot get numbers that compare well with theory until they know what numbers they should be making nature yield. (Kuhn, 1961, p. 171)
6
David Andrich
Kuhn (1961) elaborates the . . . "paper's most persistent thesis: The road from scientific law to scientific measurement can rarely be traveled in the reverse direction" (p. 219, emphasis in original). If this road can be seldom traveled in the physical sciences, then it is unlikely to be traveled in the social ' sciences. Yet, I suggest that social scientists attempt to travel this route most of the time by modeling available data, that is, by trying to find models that will account for the data as they appear. In relentlessly searching for statistical models that will account for the data as given, and finding them, the social scientist will eschew one of the main functions of measurement, the identification of anomalies: To the extent that measurement and quantitative technique play an especially significant role in scientific discovery, they do so precisely because, by displaying serious anomaly, they tell scientists when and where to look for a new qualitative phenomenon. To the nature of that phenomenon, they usually provide no clues. (Kuhn, 1961, p. 180) And this is because When measurement departs from theory, it is likely to yield mere numbers, and their very neutrality make them particularly sterile as a source of remedial suggestions. But numbers register the departure from theory with an authority and finesse that no qualitative technique can duplicate, and that departure often is enough to start a search. (Kuhn, 1961, p. 180) Although relevant in general, these remarks are specifically relevant to the role of measurement and therefore to the role that graded responses can have. Measurement formalizes quantitatively and efficiently a theoretical concept that can be summarized as a variable in terms of degree, similar in kind but greater or lesser in intensity, in terms of more or less, greater or smaller, stronger, or weaker, better or worse, and so on, that is to be studied empirically. If it is granted that the variable is an expression of a theory, that is, that it is an expression of what constitutes more or less, greater or smaller, and so on, according to the theory, then when studied empirically, it becomes important to invoke another principle of scientific enquiry, that of falsifiability (Popper, 1961). Although Popper and Kuhn have disagreed on significant aspects of the philosophy of science, the idea that any attempt at measurement arises from theory, and that measurement may provide, sooner or later, evidence against the theory, is common in both philosophies. The implication of this principle to the case of graded responses is that even the operational ordering of the categories, which defines the meaning of more or less of the variable, should be treated as a hypothesis. Clearly, when a researcher decides on the operational definition of the ordered categories, there is a strong conviction about their ordering, and any chosen model should reflect this. However, if the categories are to be treated as a hypothesis, then the model used to characterize the response process must itself
1. Measurement Criteria for Choosing among Models with Graded Responses
7
also have the ordering as a hypothesis. Thus, the model should not provide an ordering irrespective of the data but should permit evidence to the contrary to arise: If the data refute the ordering, then an anomaly that must be explained is revealed. The explanation might involve no more than identifying a coding error, although most likely it will clarify the theory: however, whatever its source, the anomaly must be explained.
2.2. Fundamental Measurement and Laws in Physical Science Because of the strong analogy to be made between measurement in the physical sciences and the construction and operation of graded responses, any model chosen should be consistent with how measurements operate in the physical sciences. As already noted, graded responses differ from traditional measurements in that (a) they are finite in number, (b) the categories (units) are not equal in size, and (c) assignment of an entity to a category contains an uncertainty that must be formalized. Therefore, the model should specialize to the case in which the number of categories from an origin is, in principle, not finite, in which the sizes of the categories are equal (a constant unit), and in which measurement precision is increased to the point where the random component can be ignored. In general terms, the model should specialize to fundamental measurement. Fundamental measurement is an abstraction from the idea of concatenation of entities in which the successive addition of measurements corresponds to the concatenation (Wright, 1985, 1988). Measurements of mass exemplify fundamental measurement, in that if two entities are amalgamated, then the amalgamated entity behaves with respect to acceleration as a single body with a mass the sum of the masses of the original entities. It seems that because successive concatenations of a unit can be expressed as a multiplication of the unit, measurement leads to laws that have a multiplicative structure among the variables: "Throughout the gigantic range of physical knowledge, numerical laws assume a remarkably simple form provided fundamental measurement has taken place" (Ramsay, 1975, p. 262) and "Virtually all the laws of physics can be expressed numerically as multiplications or divisions of measurements" (p. 258). Thus it might be expected that any model that is to form a criterion for measurement would have this multiplicative structure, although this alone would not be sufficient to meet the criterion: it would still have to be demonstrated that when specialized to the case of physical measurement, the model leads to increased precision as the units are made smaller, and that the model reflects the process of concatenation.
2.3. Invariance of Location across an Arbitrary Partitioning of the Continuum Because an entity can be measured with different units, it is expected that the location of any entity on the continuum should be invariant when measured in
8
David Andrich
different units, although with smaller units greater precision is expected. This difference in units corresponds to different partitions of the continuum. Therefore, this invariance of location across different partitions of the continuum should be met by graded responses (of which measurement is a special case).
2.4. Summary of Measurement Criteria In summary, the measurement criteria for a model for graded responses are 1. it should contain the ordering of the categories as a falsifiable hypothesis, 2. it should specialize to fundamental measurement, and 3. the location of an entity should be invariant under different partitions of the continuum. It should not come as a surprise that the preceding criteria are in some sense related, but they have been listed separately to focus on the salient features of measurement and its function. A model that satisfies these criteria provides a special opportunity for learning about variables in the form of graded responses. In the next section, these criteria are taken in reverse order, and the model that satisfies the first criterion by definition is presented first and then is shown to satisfy the other two criteria as well.
2.5. Statistical Criteria Partly because contrasts are helpful in any exposition, and partly because there is historically an alternate model for graded responses, this alternate model is outlined here. This model, too, has been presented as satisfying certain principles of scientific enquiry (McCullagh, 1980, 1985) for graded responses. These criteria, in contrast to what have been referred to as measurement criteria, will be referred to as statistical criteria: (i) If, as is usually the case, the response scale contains useful and scientifically important information such as order or factorial structure, the statistical model should take this information into account. (ii) If the response categories are on an arbitrary but ordinal scale, it is nearly always appropriate to consider models that are invariant under the grouping of adjacent categories (McCullagh, 1980). (iii) If sufficient care is taken in the design and execution of the experiment or survey, standard probability distributions such as the Poisson or multinomial may be applicable, but usually the data are over d i s p e r s e d . . . (McCullagh, 1985, p. 39). These statistical criteria seem compatible with the measurement criteria. However, it will be seen that the models chosen to meet each set of criteria are in
1. Measurement Criteria for Choosing among Models with Graded Responses
9
fact incompatible and serve different situations. It is in part to help understand the basis of the choices and the change in perspective entailed in choosing the former model ahead of the latter for graded responses that the measurement criteria have been set in a wider context of the relationship between data, theory, and models. The distinction made here between measurement and statistical criteria is compatible with the same kind of distinction made in Duncan and Stenbeck (1988).
3. MODELS FOR GRADED RESPONSES The model that satisfies the criteria of invariance with graded responses arises from the Rasch (1961) class of models, and its properties have been elaborated by Andersen (1977) and Andrich (1978, 1985). The historically prior model is based on the work of Thurstone (Edwards & Thurstone, 1952) and has been further elaborated by Samejima (1969), Bock (1975), and McCullagh (1980, 1985).
3.1. Rasch Cumulative Threshold Model Rasch (1961) derived a class of models for measurement explicitly from the requirements of invariant comparisons. Because a measurement is the comparison of a number of units to a standard unit, and therefore the comparison of two measurements is indirectly a comparison of these comparisons with a standard, it is perhaps not surprising that the relationship can be taken in reverse, so that if conditions are imposed on comparisons, they may lead to measurement. Rasch's parsimonious criteria are the following. A. "The comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; and it should also be independent of which other stimuli within the considered class were or might also have been compared." B. "Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for comparison; and it should also be independent of which other individuals were also compared, on the same or on some other occasion (Rasch, 1961, p. 322)." The probabilistic models that satisfy these conditions have sufficient statistics for the parameters of the individuals and for the stimuli, which will be referred to in general now as entities and instruments, respectively. The existence of sufficient statistics means that the outcome space can be partitioned so that within the subspaces the distribution of responses depends only on either the
10
David Andrich
parameters of the instrument or the entity, but not on both, thus providing the required invariance of comparison, either among instruments or among entities. The model for the probability Pr{xpi } of a response with respect to entity p and instrument i that satisfies these conditions of invariance can be expressed in different forms, one efficient form being Pr{xpi } = "rki + xpi[3p , ~1 exp 'Ypi k= 1
where ~pi = ~
exp - ~
x -- 0
-oO<~p
-rki + Xpi~p
-oc<'rxi<
oo, x =
1, m,
(1)
is a normalizing factor ensuring that the
k=l
probabilities sum to 1.0, J3p is the location of entity p, and -rxi, x = 1. . . . mi are m thresholds of instrument i which partition the continuum into m i + 1 categories, where "rxi > "rx(i_ 1), x = 2 . . . . m, and Xpi E {0,1,2 . . . . m} indicates the c o u n t of the number of thresholds exceeded: x therefore denotes the category of any response, just as in the prototype of measurement, and x = 0 denotes that no threshold has been exceeded. For completeness, although it does not exist, let "roi- 0. Because of the structure of the threshold parameters in Equation (1), the model is termed the c u m u l a t i v e t h r e s h o l d m o d e l (CTM), and because ~pi is always the sum of the numerators, it will not be redefined in subsequent expressions of the equation. Figure 1 shows its probability distribution. 3.1.1. Invariance under different partitions of the continuum The invariance property can be demonstrated readily, and has been (e.g., Andersen, 1977), but because a special point is to be emphasized, it will be demonstrated again briefly. Suppose there are two instruments i and j with m i and mj, m i 4= my, thresholds and m i + 1 and mj + 1 categories, respectively, that are supposed to elicit the same property in any entity p and the responses are supposed to conform to the model of Equation (1). Because the instruments have different numbers of thresholds, they must partition the continuum in two different ways. (They could of course partition the continuum in different ways even if they had the same number of thresholds.) Retaining the generality of different numbers of categories, Pr{(Xpi,Xpj) JrpiJ = Xpi + xpj} =
where
"Yrpij
=
1 exp - ? l "rki "~rp ij =
min(r, mi) ( Xpi XI~j = = E exp --Z~rki--Z~rkj max( O,r-- mi)
1
"rkj ,
(2)
=
. Equation (2) does not involve
1. Measurement Criteria for Choosing among Models with Graded Responses
11
FIGURE 1. Probability of response in each category as a function of the location of the entity and the thresholds according to the Rasch model.
the location parameter of the entity; therefore, when generalized, it permits the estimation of the threshold parameters of different instruments with different numbers of categories, and the thresholds so estimated reflect the different partitions o f the continuum independently o f the location parameters ~3p of the entities p, p = 1, E To illustrate Equation (2), suppose m i = 3, mj = 4 for instruments i and j, respectively. Then the outcome space for the response of entity p to each of these instruments is shown in the following matrix in which each element rpij is the sum of the outcomes; that is, rpij = Xpi -~ Xpj.
rpu Xpi
0
1
0 1 2 3
1 2 3 4
Xp~ 2 3 2 3 4 5
3 4 5 6
4 4 5 6 7
For each total score rpij, except 0 and m i -Jr-mj, there is more than one way in which the same total score can be obtained. Thus rpij partitions the outcome space into subspaces. For example, rpi~ = 2 can be obtained from {(2,0), (1,1), (0,2) }. The probability of each of these pairs of outcomes for entity p, taken to be independent, is given by
12
David Andrich
Pr{2,0} =
e x p ( - a ' l i - q'2i + 213p)
~pi Pr{ 1,1} =
9
1
,
"~pj
exp(-'rli + [3p) exp(-'rl9 + [3p)
~lpi 1 Pr{0,2} = - - .
"~/pj
exp(-'rl/-
'~pi
'r2j + 2~p)
'~pj
The probability of the sum rpi9 = 2, therefore, is given by Pr{ rpij = 2 } = Pr{ (2,0), (1,1), (0,2) } e x p ( - ' r l i - q'2i + 2 ~ p ) + e x p ( - ' r l i -
"rU + 2[~p) + exp(--'r~/
q'2j + 2[~p)
~pi~lpj Consider now the probability of any one of these outcomes, say (0, 2), given that the total is rpij - 2. This conditional probability is given by Pr{(O,2)lXpi + Xpj = 2} =
Pr{(0,2)} Pr{ (2,0), (1,1), (0,2) } e x p ( - ' r mj - "rzj + 2~p)[~lpi'Ypj
[ e x p ( - ' r l i - q ' 2 i - 2~p) + e x p ( - - ' r l i - "rU + 2~p) + exp(--'rl/-- "r2j + 2~p) exp(2[3p)eXp(--,rlj -- a-zj) exp(2[3p)[exp(-a-li- "r2i ) -Jr- e x p ( - - r ~ i - -rlj) + exp(-'r~j - a'2j)] exp(-'r 19 - a'29) exp(-'rli-
q'2i) + e x p ( - ' r l i -
"rlj) + exp(-'rlj - "r2j)'
which is independent of the location of the entity [3p and is the special case of Equation (2). The symmetrical case can also be readily demonstrated. Suppose two entities p and q are classified using instrument i with m i thresholds. Then, let S(xpi,Xqi ) = {(Xpi,Xqi),x = O,mi} be the outcome space of all possible pairs of responses which is partitioned into subspaces S*(xpi,Xqi ) -- {(Xpi,Xqi),(Xqi,Xpi)} of pairs for Xpi 4: Xqi, Xpi,Xqi E {0,1,2 . . . . mi}. Then, Pr{ (Xpi,Xqi) l S*(xpi,Xqi) } --
"~pqi
exp{Xpi~p i + Xqi6qi},
(3)
where ~lpqi--" exp{xpi~p + Xqi~q} + exp{Xqi~3p + Xpi~q} is independent of the thresholds of instrument i, that is, independent of the partition of the continuum. For example, consider an instrument i with m i -- 3 and outcomes Xpi = 1, Xqi "-- 2 for entities p and q, respectively. The outcome space for the response of entities p and q is given by the following matrix.
1. Measurement Criteria for Choosing among Models with Graded Responses
13
q
S(xpi,Xqi) p
Then for X p i - 1, Pr{ (lpi,2qi)[S* ( lpi,2qi) }
0 1 2 3
0
1
2
(0,0) (1,0) (2,0) (3,0)
(0,1) (1,1) (2,1) (3,1)
(0,2) (0,3) (1,2) (1,3) (2,2)(2,3) (3,2) (3,3)
Xqi = 2,
3
S*(lpi,2qi ) - {(lpi,2qi),
(2pi, lqi)} ,
and
exp(-'rli + [3p)exp(-'r~i- T2i-+- 2~q)/'~pfypj [exp(-'rli + [3p)exp(-'rli- q'z/ -+- 213p) + e x p ( - ' r l i - T2i-+- [3p)eXp(-'rli + ~q)]/'~pi'~pj e x p ( - ' r l i - "rli- "rzi)exp(l[3 p + 213q) e x p ( - ' r i i - - r l i - 'rzi)[exp(l[3 p + 213q) + exp(2[3p + l[3q)] exp([3p + 213q) exp([3p + 213q) + exp(2[3p + [3q) which is a special case of Equation (3). When generalized, Equation (3) could be used to estimate the parameters of many entities, although usually, alternate methods are used to estimate these parameters. The point is that Equation (3) itself is independent of the parameters of the instruments and how they partition the continuum. It is important, however, to appreciate that this invariance is of the location of the entities and the thresholds of the instruments, and not of the distributions of the outcomes themselves. In fact, the distributions themselves are not invariant in that if the data conform to the model, and two adjacent categories are collapsed, then the data will no longer conform to the model (Andersen, 1977" Andrich, 1978, 1992" Jansen & Roskam, 1986" Rasch, 1966). This may seem surprising, and it is central to the contrast between the measurement criteria of this chapter and the statistical criteria of McCullagh (1985) for a model for graded responses. It will be considered again when the alternate model is constructed. Although the preceding analysis demonstrates that the model does have the relevant invariance properties, as indicated earlier, this model is unique in having these properties for discrete data, resting as it does on the existence of sufficient statistics (Andersen, 1977).
3.1.2. Parametric structure and specialization to equal units from a natural origin Equation (1) has the required parametric structure of fundamental measurement expressed in the additive or logarithmic form. To see how the
14
David Andrich
model specializes to the case of physical measurement, it is instructive to use the multiplicative form, characteristic of equations in physical laws. Let log ~p = [3p and log o~x = %, then Equation (1) becomes
Pr{xpi} =
1
~p'"
-xpi- ,
'~pi H
m~i > O,
~p > O,
(4)
COki
k=O
where now the parameters must be greater than 0. Thus in the model, and in the multiplicative metric, a value of 0 provides the natural origin, as "roi -~ 0, tOoi ~ 1. Suppose that the first threshold in instrument i has the value o.)i as before, and then that each of the successive thresholds is an additional distance coi from the previous threshold. Thus the first threshold can be conceived of as one unit from the origin, with 03i being the unit of measurement, just as in the prototype of measurement. Then, the thresholds take on the value Xpi
COoi =
1
,
O.)xi = X(Di,
X = 1
,.
9
and
m,
.,
~I COki --- Xpilt'~xpi . w
!
9
k=O
Next, suppose that the number of categories (units) is not finite, but that in principle, it may be any count of the number of units" this being determined by the amount of the property and not by the finite range of a particular instruxpi
ment. Replacing m by ~ and I-I tOki by Xpi!to~," in Equation (4) gives k=0
Pr{xpi } -
However, with ~pi
~r x ~ O
I
(GI~,)~, '' ~ ,
"Ypi
X!
x - 0.....
(~#(.Di) xpi
~.
(5)
= exp(~p/eOi), Equation (5) is the 9
9
Poisson distribution, giving, in the conventional mode of expression,
Pr{xpi} = e -;~'/~~ (G/~
x!
(6)
Thus, repeated measurements with error specialize to the Poisson distribution as a function of the location of the entity and the size of the unit of the instrument. This result in itself, novel and perhaps surprising, has many implications. Lest they detract from the point of the chapter, only two will be considered here, the precision of measurement as a function of the size of the unit and the addition of measurements in relation to concatenation of entities. These are central to the specialization of graded responses to the physical measure-
1. Measurement Criteria for Choos&g among Models with Graded Responses
15
ment and can be studied by using only the well-known result that the mean and variance of a Poisson distribution are given by its parameter: E[Xpi] -- V[Xpi] ~- ~p/(D i.
Suppose that, as in physical measurement, a unit toI = 1 of instrument I is chosen as a standard. In this unit, ~p/tOI = ~p, giving E[Xpl ] = ~p, and just as in physical measurement, any observed count of units exceeding the origin is taken immediately as an unbiased estimate of the location of the entity. o)I Now, suppose that the units are reduced to o~n = m in some instrument n, n
and that the count in these units is denoted by the random variable Xpn. Then in these units
E[Xp n]
_
o~1/n
__
?/~p.
Thus reducing the units to (1/n)th of an original unit gives n times the value in the new units. Let the values of the variable Xp, be denoted by ~tF, n(1) - when expressed in standard units of instrument I: then, if xp,, E {0,1,2,3,4 . . . . } in units .(I){ 01234 } of to,,, ~p,, E ........ in units of toI. (For example, if the original
nnnnn
units are in centimeters and the new ones are in millimeters, then measurements in millimeters when expressed in centimeters will be 1/lOth their millimeter values.) Therefore, expressed in terms of the original unit ~oI,
n
n
n
as required to show the invariance of location of entity p when measured with different instruments when the measurement is carried out in different units (but expressed in the same unit). In some sense this result is in fact rather trivial, but it is important because it highlights the need to distinguish between the effect of changing the unit in relation to the original unit in making the actual measurements and the effect of reexpressing the measurement in the new unit in terms of the original unit. In the former, the changed unit actually changes the parameter of the distribution, and therefore the probabilities, whereas in the latter, the parameter and the probabilities do not change, only the values of the measurements change into the original units. The same logic then can be used to establish the variance, and this is not trivial. Thus v[.~'21
=
v
~
n
:
-n~
v[x,j
=
n2
=-,
n
16
DavidAndrich
indicating that when the unit is (1/n)th of the original unit, the variance is also 1/nth of the original variance when expressed in the original metric and therefore increases the precision in the location of ~p. Although beyond the scope of this chapter, it can be readily shown that if the measurements are Poisson distributed according to Equation (6), then the distribution of the m e a n of n independent measurements in a particular unit is identical to the distribution of the measurements in a unit (1/n)th the size of that unit. Thus taking n measurements and finding the mean to estimate the location has an identical effect as taking one measurement in a unit (1/n)th the size of the original unit. This result itself is rather impressive, but here the only point emphasized is that it is consistent with what is expected in measurements, namely, that a measurement in smaller units should be more precise than a measurement in larger units. Figure 2 shows the distribution of measurements of Equation (5) for an entity with ~p = 1 in standard units, and when the units are halved successively up to units of to//8. The figure clearly shows that the distribution also tends to the normal, and this arises because as the value of its parameter increases, the Poisson distribution tends to the normal. This is all as expected in physical measurement: When expressed in the original metric, the expected value remains constant, while the precision becomes greater as the unit is reduced and the distribution of measurements is expected to become normal. If the unit is made
0.6
"
Pr{x} o.5 0 0.4
Unit =1
9 Unit = 1/2 9 Unit = 1/4 9 Unit = 1/8
( 0.3
0.2
0.1
0.0 0.0
0.5
1.0
= 1
1.5
2.0
2.5
3.0
3.5
Measurement x
FIGURE2. Distributions of measurements of entity of size 1 and units starting with 1.0 then halved for each new set of measurements.
1. Measurement Criteria for Choosing among Models with Graded Responses
17
sufficiently small relative to the size of the entity measured, then it might be small enough to be ignored in a deterministic theory. In addition to behaving as measurements with respect to their location, dispersion, and distribution in physical science, the model also incorporates the characteristic of concatenation. As is well known, the sum of two Poisson distributions is Poisson again, with the new parameter equal to the sum of parameters of the individual distributions. Suppose entities p and q with location values ~p and ~q, respectively, are concatenated and then measured with instrument i with unit coi. Then, from Equation (6) and some straightforward algebra, Pr{xp+q = Xp + Xq} = e
with E[xp+q]
_ ~,,+~q 0,, [(~p + ~q)/O~i]~,'+',/Xp+q!
+ eq =
~
, 1.oi
(7)
indicating that the value of the new concatenated entity is the sum of the values of the original entities and that an estimate of this value is the sum of the measurements of the original entities.
3.1.3. Falsification of the ordering of the categories Rasch (1961) originally specialized a m u l t i d i m e n s i o n a l version of a response model that satisfied criteria A and B and expressed it in the form equivalent to Pr{xpi} =
1
eXp{Kxi
@
+xi~p}'
(8)
"~pi
where the coefficient Kxi and scoring function +xi characterized each category x. Andersen (1977) showed that the model had sufficient statistics only if +(x+ 1)i - +xi = +xi - +(~-l)i and that categories x and x = 1 could be combined only if +(~+1)i = +~i. Andrich (1978) constructed the K'S and + ' s in terms of the prototype of measurement. The steps in the construction, reviewed here with an emphasis on falsification of threshold order, are 1. assume in the first instance independent dichotomous decisions at each threshold x, x = 1, m; 2. characterize the dichotomous decision by the model Pr{y~p i = 1 } = e x p O[xi(~ p -"rxi)/T]xpi , Pr{yxp i = 0 } = 1]Ylxpi, where Yxpi is a Bernoulli random variable in which Yxpi = 1 denotes threshold x is exceeded, Yxpi = 0 denotes threshold x is not exceeded, %i is the discrimination at threshold x, Tlxpi -= 1 + exp O[xi(~ p -- Txi); and 3. restrict and renormalize the original outcome space 1~ = { ( 0 , 0 , 0 . . . 0), (1,0,0 . . . . 0) . . . . (1,1,1 . . . . 1) } of all possible 2 m outcomes to the subspace 1~' of those outcomes which conform to the Guttman pattern consistent with
18
David Andrich
the threshold order according to (0,0,0 . . . . O) corresponding to xpi = 0 , ( 1 , 0 , 0 . . . O) corresponding to xpi = 1, (1,1,0 . . . . O) corresponding to xpi = 2 , and (1,1,1 . . . . 1) corresponding to xpi = m. This restriction makes the final outcome at each threshold dependent on the location of all thresholds (Andrich, 1992). For example, consider the case of a continuum partitioned into two thresholds. Then t} = {(0,0), Pr{ (0,0) } = Pr{(1,O)} = Pr{ (0,1) } = Pr{(1,1)} =
(1,0), (0,1), (1,1) } and ( 1)( 1 )/"qlpi'q2pi, [exp oLli([31, - Tli)](1)/"qlpi'q2pi , (1)[exp c~2i([3p - T2i)]/TllpiTl2pi , [exp OLli([3 p -- "rli) exp oL2i([3 p -- 'T2i ) ]/'lq l l, iTI21, i.
However, the outcome must be in one of only three categories, and this is given by the subspace 1 1 ' = {(0,0), (1,0), (1,1)}, in which case a response in the first category x~,i = 0 implies failure at both thresholds, xpi = 1 implies success at the first threshold and failure at the second, and xpi = 2 implies success at both thresholds. The element (0,1) is excluded because it implies exceeding the second more difficult threshold and failing the first, easier threshold, which violates the ordering of the thresholds. If the element (1,0) had been excluded and the element (0,1) included, then the intended ordering of the thresholds and direction of the continuum would have been reversed. The choice of elements which are legitimate, and therefore others which are inadmissible, is emphasized because it is a key step in defining the direction of the continuum and the interpretation of the application of the model. The probability of the outcomes conditional on the subspace t}', and after some simplification, is given by Pr{(O,O) ID'} = Pr{x~,i = 0 } =
(1)(1)
'~pi 1 Pr{(1,O) lt)'} = Pr{xpi = 1} = -- {exp(-otli'rli "~pi 1
+ otii[3p) }
Pr{(1,O) lO'} = Pr{x~; = 2} = - - { e x p ( - o t j i - r l i - ot2i'rzi + (otli + o~zi)[3p)}. "~i,i
Defining the successive elements of the Guttman pattern {(0,0), (1,0), (1,1)} by the integer random variable X, x E {0,1,2}, and generalizing, gives
1. Measurement Criteria for Choosing among Models with Graded Responses
Pr{xpi-
19
1 0} = ~ exp{Kx/ + d~xi[3p}, "Ypi
which is Equation (8) in which +xi and K~i are identified as +oi---0 +xi
--
and
OLli nt- OL2i -[-
9 9 9 -[- OLxi
K0/------0 Kxi
x-
--
--OLliTil
1. . . .
--
OL2i'r2i
9 9 9 --c~ .viT r~i ,
m,
for -oo < T x i ~ oo, X - - 1 . . . . m; "rxi > "r(~_ ~)i, x - 2 . . . . m. Thus the scoring functions +~i are the sums of the successive discriminations at the thresholds, and if %; > 0, x - 1. . . . m, as would normally be required, the successive scoring functions increase, qb(x+1); > 4'~/x - 1. . . . m. If the discriminations are constrained to be equal, O L l i - - O L 2 i 0~3i = . . . OLxi = . . . c~,,,i = oti > 0, then the scoring functions and the category coefficients become +o = 0 +xi
and
- - XOLi
Ko = 0 Kxi
--
--OL(TIi
-[- T 2 i
-q- " " " -'[- T x i ) ,
X --
1. . . .
m.
It can now be seen that the successive scoring functions have the property ( b ~ + l - (b~ = ( b ~ - (bx-l = a specified by Andersen (1977). Thus the mathematical requirement of sufficient statistics, which arises from the requirement of invariant comparisons, leads to the specialization of e q u a l d i s c r i m i n a t i o n s at the t h r e s h o l d s , exactly as in the prototype of measurement. Furthermore, if at threshold x, % i - 0, then (b(~+~)~- (bxi, which explains why categories can be combined only if threshold x does not discriminate, that is, if it is in any case artificial. At the same time, if a threshold does not discriminate, then the categories should not be combined. This analysis of the construction of the CTM provides an explanation of the known result that the collapsing of categories with such a model has nontrivial implications for interpreting relationships among variables (Clogg and Shihadeh, 1994). The parameter oL can be absorbed into the other parameters without loss of generality, giving the model of Equation (1), in which the scoring functions now are simply the successive counts x E {0,1,2 . . . . m} of the number of thresholds exceeded, as in the prototype of measurement. The category coefficients then become simply the opposite of the sums of the thresholds up to the category of the coefficient x, giving in summary +oi-0 ~)xi -- Xi
and
K0/--0 Kxi
-'- - - ( T I i
-[- T 2 i
-[-
9 9 9 -nt- T x i ) ,
X --
1. . . .
m,
and the resultant model is the CTM. It is important to note that the scoring of successive categories with successive integers d o e s n o t r e q u i r e a n y a s s u m p t i o n t h a t the d i s t a n c e s b e t w e e n t h r e s h o l d s are e q u a l - - t h e s e differences are estimated through the estimates of the thresholds.
20
David Andrich
The requirement of equal discriminations at the thresholds, and the strict ordering of the thresholds, is therefore central to the construction of the model. However, it turns out that there is nothing in the structure of the parameters or in the way the summary statistics appear in any of the estimation equations (irrespective of method of estimation) that constrains the threshold estimates to be ordered; and in the estimates, the parameters may appear in an order different from that intended. If they appear in the incorrect order, it is taken that the thresholds are not operating as intended, and that the hypothesis of threshold order must be rejected. This reversal of order in the estimates will happen readily if the discriminations at the thresholds are not equal in the data and the data are analyzed using the CTM which requires equal discriminations. A common reaction to this evidence that the model permits the estimates to be disordered is that the disordered estimates should simply be accepted--either there is something wrong with the model or the parameters should be interpreted as showing something other than that the data fail to satisfy the ordering requirement. This perspective is supported by the feature that the usual statistical tests of fit based on the chi-square distribution can be constructed and that the data can be shown to fit the model according to such a test of fit. However, such reasoning does not take into account that none of these tests of fit, which operate with different powers, are necessary and sufficient to conclude that the data fit the model. One hypothesis is that the thresholds are ordered correctly. Thus the fit can be satisfied with a global test of fit or with respect to some specific hypothesis about the data, even though some other specific hypothesis may have to be rejected. In addition, such tests of fit involve the data in the estimates of parameters, and degrees of freedom are lost to the test of fit: the test of fit checks whether, given these estimates and the model, the detail of these very same data can be recovered. In the CTM, in which the threshold estimates can show any order, the test of fit involves those estimates and so it may not reveal any misfit, and, in particular, it cannot in itself reveal anything special about the reversed estimates of the thresholds. It is a test of fit of the internal consistency of the data given the parameter estimates, and features of the parameter estimates such as order have to be tested separately. Thus, evidence from this kind of statistical test of fit is incomplete and does not obviate the need to study the estimates in relation to other evidence and other criteria related to the construction of the model. This argument is so central to the application of the model that it is now elaborated from an alternative perspective. This perspective involves the log-odds of successive categories: log (Pr{(x)i}/Pr{(x
-
1)i}) - [([3p
If "r(x+ 1)i > "r(x)i, then from Equation (9),
--
~i)
--
"rxi].
(9)
1. Measurement Criteria for Choos&g among Models with Graded Responses
21
log (Pr{(x)i}/Pr{(x - 1)i}) - log (Pr{(x + 1)i}/Pr{(x)i}) =
[(~p
--
T(x+
-
a;)
1)i - -
-
-r~;] -
[([3p -
a/)
-
(Pr{xpi} )2 1 )pi})(Pr {(x
+
%+~;]
Txi ~ O,
it follows that
(Pr {(x -
1
)pi } )
>1,
which ensures a unimodal distribution of Xpi. Specifically, no matter how close two successive thresholds are, if "~x)i < "r~x+~); and if "rex); < [3p < "r~x+~)i, then the probability of the outcome x is greater than the probability of an outcome in any other category. Figure 3 illustrates this feature where thresholds 'r3i and "r4i are close together. When successive thresholds are in an identical location, or reversed, unimodality no longer holds. The unimodal shape when the thresholds are ordered is consistent with interpreting the values Xpi to parallel measurements, because the distribution of Ypi is then simply a distribution of random error conditional on the location of the entity and the thresholds, and upon replication, regression effects would produce a unimodal distribution. The Poisson distribution which has been shown to be the special case when thresholds are equidistant in the multiplicative metric, exemplifies this feature. Other random error distributions, such as the binomial and negative binomial, as well as the normal
FIGURE 3.
Probability of response in seven categories with thresholds 3 and 4 close together.
22
David Andrich
and other continuous distributions all are unimodal. It is stressed, however, that unimodality is a property of the model, and that there is no guarantee that in any set of data, this unimodal property would hold--that is an empirical issue-and if it does not hold, then reversed threshold estimates will be obtained. When one tests for the normal, binomial, or Poisson distributions in usual circumstances, one can readily find that the data do not fit the model because they are bimodal. To return to the idea that the threshold order is a hypothesis about the data, there must be a myriad of ways in empirical work in which data may be collected in graded responses where the discriminations at the thresholds will not be equal, or where some thresholds do not discriminate at all, or where some may even discriminate negatively so that a nonunimodal distribution, with reversed estimates of thresholds, is obtained. If the estimates show threshold disorder, then it is taken that the intended ordering has been refuted by the data and that an anomaly has been disclosed, an anomaly that needs substantive investigation. The CTM permits the data to reveal whether or not they are ordered. 3.1.4. S u m m a r y of C T M and measurement criteria The CTM therefore satisfies the criteria specified, and also therefore characterizes the responses in ordered response categories in a way that reflects measurement. The only difference is that the categories in which the continuum is partitioned is not in equal units and the number of categories is finite. This means that an estimate of the location is not available explicitly but must be obtained by solving an equation iteratively; otherwise, it has the same features. How each of the criteria, and how the model's characterization of these criteria, can be exploited in data is reflected in the examples provided in the next section, after the competing model is presented.
3.2. Thurstone Cumulative Probability Model The model historically prior to the CTM for ordered response categories is also based on the partitioning of a continuum into categories. The derivation of the model is based on the plausible assumption of a single continuous response process so that the entity is classified into a category depending on the realization of this process. If "rli, "rzi, "r3i . . . . . "rxi . . . . 'rmi are again m-ordered thresholds dividing the continuum into m + 1 categories, then Figure 4 shows the construction and the continuous process that Thurstone originally assumed to be normal. However, because it is much more tractable, and because with a scaling constant it is virtually indistinguishable from the normal (Bock, 1975), the process is now often assumed to be the double exponential density function f ( y ) = (exp(y))/(1 + exp(y)) 2. In addition, various parameterizations have been
1. Measurement Criteria for Choosing among Models with Graded Responses
FIGURE 4.
23
The double exponential density function with four thresholds.
used but the differences among these is not relevant to the chapter; therefore only the one that is the simplest and closest to the parameterization of the CTM, the one with only the location and thresholds parameters, is examined. Thus if Ypi is a random continuous process on the continuum about [3p, and if successive categories are denoted by successive integers Xpi, then an outcome "rxi <- Ypi <- "r(x+ 1~i leads to the outcome Xpi, with Xpi = 0 if Ypi <- "rli, and xpi = m if Y p i ~ "rmi" According to the double exponential with only the location parameter [3p, Pr{Ypi > "rxi }
~:(
C Y/,i -- [~p
e [~/' -- Tvi
J (1 + e~,--'Y% ~dy~i : 1 + e~,'-~.'~
ff'x i
Let Pr*{Xpi} = Pr{xpi} + Pr{(x + 1)pi} Jr- Pr{(x + 2)pi } + . . .
+ Pr{mpi }, then
ei3p--Tki Pr{Ypi > "rki} = Pr*{Xpi} = 1 + e ~,'-'~ki"
(10)
Thus in contrast to the CTM, in which the successive thresholds are accumulated in the relevant expression, in this model it is the successive probabilities that are accumulated: hence the designation cumulative probability model (CPM). This model may now be assessed according to the measurement criteria specified in section 2.
3.2.1. Invariance under different partitions of the continuum probability for each value of the outcomes Xpi is given by Pr{xpi } = Pr* {Xpi} that is,
--
Pr* {(x +
1)pi}, X
=
1 ....
m -
1,
First, the
24
David Andrich e~,,-r
ef3,,-r
Pr{xpi} = 1 + e ~,'-~-';
-
+ , ,i
1 + e ~,'-~'.'+'" . x . = . 1 . . .
m-
1
(11)
and Pr {Opi }
"-
1-
e[Bp--Tli 1 + e 6p-'rli'
Pr { mpi } :
e6p-%,i 1 + e 6p-%'i"
In this model, no sufficient statistics exist. Therefore, the kind of invariance through the separation of parameters and the conditioning out of one set of parameters while the other set is estimated, as established in the CTM, cannot be demonstrated. However, and in terms of the statistical criteria presented in McCullagh (1985) and previously quoted, another kind of invariance does exist. Because there is only the one random process characterized by the d i s t r i b u t i o n Ypi, this distribution itself remains invariant under different partitions of the continuum; that is, it remains invariant with different numbers of thresholds with different locations. Even so, this does not ensure that the estimate of [3p will be the same under these different partitions. It may turn out that in empirical or simulated data there is not a great change in the estimates under different partitions, but in the model itself, there is no property from which it follows that the location of the entity would be invariant. 3.2.2. Structure a n d s p e c i a l i z a t i o n Although it contains, in part, an additive structure within each of the two terms of the probability P r { x p i }, because the two terms are related by a difference, the structure of the model is not multiplicative (or additive in the logarithmic metric), as found when fundamental measurement has taken place, and this also precludes the possibility of sufficient statistics. Next, consider the effect of increasing the number of thresholds. Figure 5 shows the distribution of the response process Ypi a s in Figure 4, but with another threshold placed between the original thresholds. It is evident that the distribution has neither narrowed nor changed in location as a result of increasing the number of thresholds. In principle, this means that the estimate of the location [3p will not be improved, that is, the precision of the estimate will not be improved by the addition of thresholds and the reduction in the size of the unit. Thus although the location of the observed value Xpi is itself now within a narrower region, it has the same probability of being beyond any specified value from [3p in both sets of thresholds. Therefore, it does not have the effect that as the number of thresholds increases, the precision of the location of [3p also increases. Interestingly, it is this very feature that might make the model applicable to another kind of situation, and it is perhaps the metaphor of this situation that has led it to be used with graded responses. The single distribution of the process Ypi with a location and a variance that
1. Measurement Criteria for Choosing among Models with Graded Responses
FIGURE 5.
25
The double exponential density function with seven thresholds.
does not change with different partitions of the continuum, implies that the variance is a fixed property of the distribution and not a property of the process of measurement or construction of the categories. An example might be helpful. Suppose persons are measured for their height and then classified into class intervals. On the assumption of no errors of measurement, this distribution has both a mean and a variance, and if the data were partitioned into different sets of class intervals, then an invariance of distribution might be expected. In this case, the variance is not a function of errors of measurement, but is a variation that describes the differences among the persons when the construction of different class intervals occurs post hoc, that is, after the data are collected with a particular unit of measurement. In this case, the parameter to be identified is not simply the location of an entity, in which case the smaller the unit the narrower the expected error distribution and the greater the precision of measurement, but two parameters, one for the location and one for the variance of entities. It seems that CPM is constructed from this kind of situation and that it might therefore be appropriate to apply it to such situations. For example, if the probability of the classification of members of a population is to be characterized as a function of some explanatory variable, say income as a function of age, then the CPM may be appropriate. This case is different in principle from that of graded responses which are in the process of being constructed to characterize the meaning of more or less of a variable, and the model is used to locate each single entity as precisely as possible in parallel to measurement in physical science. 3.2.3. Falsification The strict ordering of threshold estimates is ensured in the CPM. This is because the estimates of the thresholds are a function of the
26
David Andrich
cumulative probabilities in the model, specifically of the log-odds of the cumulative probabilities relative to their complement, which can be seen from log (Pr* {Xpi } )/( 1 -- Pr* {Xpi } ) = ~p -- 'Txi.
(12)
Because the P r * {Xpi } are estimated by the cumulative frequencies beyond "rxi, they are strictly decreasing, and so the threshold estimates will never be disordered. (Two successive thresholds may have the same value if the frequency in a category is 0.) Because the model has built into it this strict ordering, when viewed from the perspective of the statistical criteria this is considered a strength. However, from the point of view of measurement criteria, it means that no matter how badly or incorrectly the data have been collected in relation to the theory from which they arise, or how incorrect the theory, the data will always show a strict ordering of thresholds. Then, in part because of the model and in part because of the perspective that leads to the model, no anomalies in ordering will ever appear. This means that the ordering is strictly a property of the model and not of the data, and in fact is independent of the data. Because the statistical tests of fit based on the chi-square distribution are again not necessary and sufficient, it is possible to obtain tests of fit that show that the data fit the model even when from inspection of the data it is evident that a threshold is not discriminating. One of the upcoming examples reveals this.
3.2.4. Summary of the CPM in relation to the measurement criteria It is evident that the CPM satisfies none of the measurement criteria: it does not provide invariance of location, it does not specialize to the case of measurement when precision is to be increased by increasing the number of thresholds, and it does not permit the ordering of the response categories to be falsified. The reason it has none of these features is that it arises from a situation in which the distribution is not that of measurement error, but of a property already measured in which the entities are described by a mean and a variance.
3.3. Combining and Juxtaposing the CTM and the CPM It is possible to consider some combination of ideas associated with the two principles for deriving models for ordered categories when one has groups of entities to be compared, when the comparison is essentially with respect to the location parameter of the group even though it is known that each entity within each group g has its own location. In the presence of error or measurement of individual entities, it is possible to superimpose the distribution of the entities, as defined in the CPM, on the probability of the classification of each entity in a standard way:
1. Measurement Criteria for Choosing among Models with Graded Responses
Pr{xgi} - ~ Pr{xpil~p ~ g} 13,,
Pr{[3p}
27
orPr{xgi} - f Pr{x, il6p E g}f([3p)d[3,, 13,,
for a discrete or continuous distribution of [3, respectively. However, it would still be necessary to assume a shape for the distribution of [3. If the distribution for the entities themselves is assumed to be unimodal (say, based on the normal or double exponential or gamma distributions), then the reversal of the thresholds may imply either a violation of the operation of the graded responses or a violation of the unimodal assumption of the distribution of the entities. Alternatively, and more simply, it is possible to apply the CTM directly to the location of the groups. Then, if the groups are assumed to have homogeneous unimodal distributions differing only in their locations, the variation among entities within a group would be absorbed into the locations of the thresholds. In this case, the reversal of thresholds again would reflect either on the operation of the graded responses or on the assumption of a unimodal distribution of the entities. It should be stressed, however, that in the case in which estimates of individual entities are characterized by two sets of graded responses, when the parameters of the entity are conditioned out, then the reversal of threshold estimates reflects only on the graded responses. Estimation and other features of the CTM and the CPM are well established in the literature for various design structures and therefore these will not be discussed here. The CTM may be identified as a member of the power series distributions (Leunbach, 1976; Noack, 1950), and can also be identified as a loglinear model (Andrich, 1979; Duncan, 1986; Goodman, 1981). This means that the statistical properties that are relevant for estimation and fit are known. However, it should be stressed that the CTM is not chosen for graded responses because it can be identified in this way statistically, but because it turns out to be the case that the CTM, chosen because it meets the criteria for measurement, can be so identified. The CPM is established as a standard link function in the analysis of general linear models (McCullagh, 1980). If the task were simply to model a particular set of data, that is, to describe the data with a model that fits according to some general statistical criterion, then very little difference would be found between the CTM and the CPM. In addition, if the criterion of threshold order is not taken as a hypothesis but as given, and if the distribution of outcomes is defined by a location and dispersion that are to be taken as invariant across different partitions of the continuum, then the application of the CPM becomes attractive. However, if the graded responses are used to locate an entity as accurately as possible, as in measurement so that the dispersion is only a function of the accuracy of the instrument, and if the ordering of the thresholds is taken as a hypothesis about the variable, then the application of the CTM is required.
28
David Andrich
4. EXAMPLES 4.1. Example 1: Boys' Dreams The first example follows up an analysis of the distribution of dreams among 223 boys aged 5 to 15 by McCullagh (1980) using the CPM. These data, taken from Maxwell (1961) and reproduced in Table 1 in a rearranged format, have also been analyzed by Nelder and Wedderburn (1972) using a log-linear model with a linear scoring function for location but without concern for any threshold parameters. For convenience of interpretation here, the order of the table has been reversed so that the most severe disturbance of dreams has the highest rating. Because it concerns the development through age of the severity of dreams along a variable defined through a set of graded responses, the example is directly relevant to the theme of this book. The data are analyzed here according to the CTM but with only one instrument i and the location of an entity characterizes the location ~.Lgof each group g. The model then takes the form Pr{xgi}
=
1 exp -
"rk +
XglXg
, -~
< tx < oc, - c o
< .r k < oc, x = 1 . . . .
m.
k=l
"~gi
As an aside, note that l o g [ P r { x g i } / P r { ( x - 1)gi}] = I.Lg -- Txi , which permits a linear regression-type analysis between the ordered thresholds "rxi, x = 1. . . . m of the dependent variable and iXg, g -- 1,6 of the explanatory variable. This approach to the analysis of ordered dependent variables, but without specific con-
TABLE 1 Frequencyof Disturbed Dreams Among Boys Aged 5 to 15 Degree of suffering from disturbed dreams Not severe Age
x:
5-7 8-9 10-11 12-13 14-15
1
2
3
~g
7 10 23 28 32
4 15 9 9 5
3 ll 11 12 4
7 13 7 10 3
.281 .337 -.034 -.013 -.573
0.481
q-, •
Very severe
0
= 9.631, p < .292
-.261
-.220
13, = - . 3 2 0
29
1. Measurement Criteria for Choosing among Models with Graded Responses
cern for the empirical ordering of the thresholds, is described in Goodman (1981) and now exists in some statistical packages featuring log-linear analyses. Table 1 also shows the estimates of the group effects ~l.g and of the thresholds § and a chi-square test of fit of the recovery of the frequencies according to the model. In describing the data, the model is adequate: X 2 = 9.631 on 8 df p < .292. Before making a study of the thresholds, it is also evident that the age groups 5 to 7 and 8 to 9 have very similar location values, and likewise age groups 10 to 11 and 12 to 13, and that when taking this grouping into account, the basic shift is for the dreams to decrease in severity as a function of age. To highlight this feature, Table 2 shows a reanalysis according to the CTM but with ages 5 to 9 combined and with ages 10 to 13 combined. Again, the data accord well with the model - X 2 = 6.738 on 4 df p < .150. This grouping, which is rather straightforward in that it involves no more than making class intervals larger than originally set, has a major implication for understanding the rate of change along the variable as a function of age: changes are noticeable in the range of 5 to 13 years in intervals of 4 years, with a rather sharper change in the age group of 14 to 15 years. Returning to the thresholds, in both analyses, the estimates show reversals, with the second and third thresholds having lesser values than the first threshold, and the second and third thresholds being close in value: an anomaly has been disclosed. To study this anomaly, recall that in the construction of the CTM, equal discriminations at the thresholds are required, and that if a category does not discriminate, but the model assumes it does, then it is likely that reversed threshold estimates will be obtained. Moreover, if a threshold does not discriminate, then the two categories it separates could and should be
TABLE 2 Frequency of Disturbed Dreams Among Boys Aged 5 to 15 with Age Groups 5 to 9, 10 to 13, and 14 to 15 Degree of suffering from disturbed dreams Not severe Age
x:
5-9 10-13 14-15
Very servere
0
1
2
3
~Lg
17 51 32
19 18 5
14 23 4
20 17 3
0.412 0.069 -0.482 m
;r.,.
0.482 X4z = 6.738, p < .150
-0.261
-0.221
12 = -.411
30
David Andrich
collapsed. Directed by this perspective, it can be seen just by looking at the data that the middle threshold does not discriminatemfrequencies in the categories on either side of it are about equal irrespective of age. That is, even though the overall grading of the severity of dreams changes as a function of age in the extreme two categories, the relative distribution in the middle two categories does not change. A number of points regarding the relationship between a theory and the construction of graded responses can be illustrated using this example. First, there is a need for a theory that relates changes in the minds of boys as they become older to the definition of degrees of severity of dreams. Thus, because it is not age as such, but the changes that take place through aging that are relevant, it is not sufficient to observe that severity of dreams is related to age and that a statistical model can be found that accounts for the data. Ideally, the theory comes before the data are collected, but the need for a theory arises even if the relation is observed after the data are collected. In this case, the theory should bear on the observation that changes in the 5 to 13 year range are noticeable only in class intervals of 4 years and are more sharply noticeable in the same direction in the age group of 14 to 15 years, and on how this relates to the severity of dreams. Second, then, is that the changes in dreaming that occur through age and the intensity of the variable defined through the graded responses are defined simultaneously: the locations of the thresholds are an integral part of the definition of the intensity of the variable--they define what it means to have more or less severe dreams. Third, although there is a trend in the change in severity as a function of what happens through a change in age, there is no change in severity as characterized by the frequencies in the middle two categories: the change occurs only in relation to the extreme categories, and this, it should be stressed, is a characteristic of the data (which the model of analysis exposed). If the middle two categories operated consistently with the extreme categories, with the third being more severe than the second, then they too should show the same trend, even if it is not as strong as with the extreme categories. If the two extreme categories were not present, but only the middle two, then no trend would be observed. Although the emphasis here is theoretical, to help make the point, it is noted that there could be undesirable practical consequences in ignoring the evidence: for example, suppose treatments on the basis of the classification were recommended, then different treatments might be recommended for classifications in the middle two categories, which from this empirical evidence are no better than random. If the classifications in the middle two categories are combined, which is justified under the specific hypothesis that the middle two categories do not discriminate, then the estimates of the parameters and the general test of fit is as shown in Table 3. The data again fit the model, but in addition, the two thresh-
31
1. Measurement Criteria for Choosing among Models with Graded Responses
TABLE3 Frequencyof Disturbed Dreams among Boys Aged 5 to 15 with Age Groups 5 to 9, 10 to 13, and 14 to 15, and the Original Middle Two Categories Degree of suffering from disturbed dreams Not severe
Age
x:
5-9 10-13 14-15
Very severe
0
1
2
~g
17 51 32
33 41 9
20 17 3
.708 .068 -.777 B
~,
-.414
.414
t2 = - . 6 3 7
X2 = 2.632, p < .268
olds are now in the correct order. However, the task is not simply to model the data, but to understand their meaning. Therefore, the implication of this analysis is not that the model accounts for the data and therefore that closure on the analysis of the data has been reached. Quite the contrary: the implication is that the definition of what constitutes the middle two categories of severity, how these might be related to changes in boys' dreams with age, and how the variable of severity has been operationalized all have to be examined from the point of view of the relative substantive theory. In addition, the assumption of some kind of homogeneity of the boys' location in each age group should be examined for any possible splitting into two or more subpopulations, especially on the basis of some background factor that might be relevant to severity of dreams, perhaps recent family or personal trauma or the like. In the words of Kuhn quoted earlier, an anomaly has been disclosed with quantitative finesse, although as to the source of the anomaly there is no clue: a search for an explanation is required. Unfortunately, Maxwell (1961) provides no additional substantive information regarding the source of the data, and indeed from a statistical fit perspective no further action is required--the model accounts for the data.
4.2. Example 2: Serological Readings The second example involves 12 serological readings classified into five levels of reaction of 12 blood cells analyzed by Fisher (1958, p. 290). If the readings from the sera in the five levels of reaction are scored by successive integers beginning with 0, then the data may be arranged as in Table 4. Fisher described a method of analysis in which he assigned the score of 0 to the first level, a score of 1 to the last, and then estimated the values for the remaining categories
32 TABLE 4
David Andrich
Serological Readings of 12 Cells Frequency of level of reaction of 12 sera/ceU
Cell
x:
1 2 3 4 5 6 7 8 9 10 11 12
0
1
2
3
4
~g
0 0 0 1 0 0 0 0 0 0 0 0
1 4 0 1 1 1 0 0 0 4 1 0
8 8 12 10 7 8 3 5 6 8 10 4
3 0 0 0 4 3 7 6 6 0 1 6
0 0 0 0 0 4 2 1 4 0 0 2
-.031 -2.184 -0.794 - 1.903 .273 1.224 1.770 1.307 1.695 -2.184 -0.794 1.620
;r,
-3.543 X23
=
-2.095
2.357
3.282
12 = - . 9 0 9
25.58, p < .818
by maximizing the sum of squares of the main effects relative to the total sum of squares. This is exactly the same number of constraints as required in the analysis using the CTM, and as in the CTM, no ordering is forced on the estimates o f the weights. Table 4 also shows the parameter estimates and the test of fit according to the CTM, and shows both that the data accord with the model and that the threshold estimates are in the correct order. One purpose in reproducing this simple example is to appreciate Fisher's simple and powerful remark, consistent with the perspective in this chapter on the evidence that the estimates of the values in his analysis (which correspond to the thresholds in the CTM) were in the correct order: "It will be observed that the numerical v a l u e s . . , l i e . . , in the proper order for increasing reaction. This is not a consequence of the procedure by which they have been obtained, but a property of the data examined" (Fisher, 1958, p. 294).
5. SUMMARY AND DISCUSSION This chapter discusses the criteria for choosing among models for analyzing graded responses that are ubiquitous in the social, biological, and other sciences. The observation on which these criteria and choice of model was pursued is
1. Measurement Criteria for Choosing among Models with Graded Responses
33
that graded responses are used in those situations in which measuring instruments of the kind found in the physical sciences are not available for dependent variables, and that the graded responses have, therefore, as important a role in empirical enquiry as do measurements, especially in developmental research. Graded responses are similar to measurements in that, just as units do in measurement, they partition a latent unidimensional continuum into adjacent intervals. They both define, therefore, what it means to have more or less of the property in question, which is considered central to the definition of the variable and to the operationalization of the theory from which the variable arises, and to an understanding of development and change along the variable. Graded responses differ from measurements in that they are finite in number, they are not of equal length, and generally the origin is arbitrary. Therefore, it was considered appropriate to set criteria for a model for graded responses in the wider context of the role played by measurements in the physical sciences, in particular, that the model is required to 1. locate the entity on the continuum in a way that is invariant under different partitions of the continuum, that is, invariant across a choice of different units; 2. specialize to the case of measurements where the graded response categories are equal in size, where there is a natural origin, where the number of categories is not finite, and where a decrease in the size of categories increases the precision of the estimate of the location of the entity; and 3. permit the data collected to falsify the theory of ordering of the graded responses. The Rasch model for graded responses was shown to satisfy these criteria. For purposes of exposition, the model based on the work of Thurstone, which has also been advocated for use with graded responses, was contrasted with the Rasch model and was shown not to meet the same criteria. Instead, it was shown that the Thurstone model provides invariant distributions under different partitions of the continuum (rather than invariant locations), greater precision of location of each observation (rather than of the location of the entity), and ensures that the graded responses are ordered irrespective of the data, that is, that the ordering is a property of the model. The differences in the approaches of the two models goes beyond the way they describe the data, because if they are used descriptively, that is, to summarize the data, then in general if one model is satisfactory the other will also be satisfactory. A constant theme of this chapter is that because the models reflected different understandings of the relationship between theory, measurement, and data, to appreciate the difference between the two models a different perspective is required--in short, a different understanding of scientific modeling when measurement or graded responses are involved. In other words, the two
34
David Andrich
models are not simply alternative descriptions of the data from within the same scientific framework, but reflect different scientific frameworks.
ACKNOWLEDGMENTS This research was supported by the Australian Research Council. Clifford C. Clogg and Tzuwei Cheng read the manuscript and made constructive comments.
REFERENCES Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 69-81. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 357-374. Andrich, D. (1979). A model for contingency tables having an ordered response classification. Biometrics, 35, 403-415. Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. B. Tuma (Ed.), Sociological methology (Chapter 2, pp. 33-80). San Francisco: Jossey-Bass. Andrich, D. (1992, June 22-26). On the function of fundamental measurement in the social sciences: The objective measurement of subjective meaning. Invited Keynote Address, International Conference, Social Science Methodology, Research Committee on Logic and Methodology of the International Sociological Association, Trento, Italy. Bock, R. D. (1975). Multivariate statistical methods in behavioral research. New York: McGrawHill. Clogg, C. C., & Shihadeh, E. S. (1994). Statistical models for ordinal variables. Thousand Oaks: SAGE Publications. Duncan, O. D. (1984). The latent trait approach in survey research. In C. E Turner & E. Martin (Eds.), Surveying subjective phenomena (Vol. 1, pp. 210-229). New York: Russell Sage Foundation. Duncan, O. D. (1986). Probability, disposition, and the inconsistency of attitudes and behavior. Synthese, 68, 65-98. Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological methodology (pp. 1-35). Washington, DC: American Sociological Association. Edwards, A. L., & Thurstone, L. L. (1952). An internal consistency check for scale values determined by the method of successive integers. Psychometrika, 17, 169-180. Fisher, R. A. (1958). Statistical methods for research workers (13th ed.). New York: Hafner. Goodman, L. A. (1981). Three elementary views of log linear models for the analysis of crossclassifications having ordered categories. In S. Leinhardt (Ed.), Sociological methodology, pp. 193-239. San Francisco: Jossey-Bass. Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stouffer (Ed.), Measurement and prediction (pp. 60-90). New York: Wiley. Jansen E G. W., & Roskam, E. E. (1986). Latent trait models and dichotomization of graded responses. Psychometrika, 51(1), 69-91. Kuhn, T. S. (1961). The function of measurement in modern physical science. Isis, 52(Pt. 2), 161-193.
1. Measurement Criteria for Choosing among Models with Graded Responses
35
Kuhn, T. S. (1970). The structure of scientific revolutions (2nd enlarged ed.) Chicago: University of Chicago Press. Leunbach, G. (1976). A probabilistic measurement model for assessing whether tests measure the same personal factor. Copenhagen, Denmark: Danish Institute for Educational Research. Maxwell, A. E. (1961). Analysis of qualitative data. London: Methuen. McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society, Series B. 42(2), 109-142. McCullagh, E (1985). Statistical and scientific aspects of models for qualitative data. In P. Nijkamp, H. Lielner, and N. Wrigley (Eds.), Measuring the unmeasurable, pp. 39-49. Dordrecht, The Netherlands: Martinus Nijhoff. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A, 135, 370-384. Noack, A. (1950). A class of random variables with discrete distributions. Annals of Mathematical Statistics, 21, 127-132. Popper, K. (1961). The logic of scientific discovery. New York: Science Editions. Ramsay, J. O. (1975). Review of the book Foundations of measurement, Vol. 1, by D. H. Krantz, R. D. Luce, E Suppes, and A. Tverskey. Psychometrika, 40, 257-262. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. (Expanded edition with foreword and afterword by B. D. Wright. Chicago: University of Chicago Press, 1980). Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In J. Neyman (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 4, pp. 321-334). Berkleley: University of California Press. Rash, G. (1966). An individualistic approach to item analysis. In E F. Lazarsfeld & N. W. Henry (Eds.), Readings in mathematical social science (pp. 89-108). Chicago: Science Research Associates. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monographs, 34(2, No. 17). Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529-554. Wright, B. D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3, 281-288. Wright, B. D. (1985). Additivity in psychological measurement. In E. E. Roskam (Ed.), Measurement and personality assessment (Selected papers). 23rd International Congress of Psychology, 8, 101-111. Wright, B. D. (1988). Campbell concatenation for mental testing. Special Interest Group: Rasch Measurement, 2(1), 3-4.
This Page Intentionally Left Blank
Growth Modeling with Binary Responses Bengt O. Muth~n
University of California, Los Angeles Los Angeles, Cafifornia 90024
1. INTRODUCTION In developmental research it is natural to place an emphasis on the study of individual differences in change, growth, and decline over time. It is useful to formulate a longitudinal model for these processes to assess the amount of individual variation and to relate the individual variation to background information on the individuals. Random coefficient growth modeling (see, e.g., Laird & Ware, 1982; Rutter & Elashoff, 1994) is suitable for such longitudinal analysis. It goes beyond conventional structural equation modeling of longitudinal data with its focus on autoregressive models (see, e.g., Jfreskog & Sfrbom, 1977; Wheaton, Muthfn, Alwin, & Summers, 1977) in that it describes individual differences in the longitudinal processes. It is instructive to consider first some examples of longitudinal studies with binary responses in which random coefficient modeling has been used.
1.1. Example 1: Decline in Depression Gibbons and Bock (1987) reported on the longitudinal analyses of data on Danish psychiatric patients. A total of 100 clinically depressed patients were given Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
37
38
Bengt O. Muth~n
one of two drugs and were observed five times weekly. The response variable was recorded as recovered (scored 0) or still depressed (scored 1). Information was collected on time-invariant covariates such as severity of illness and age. Time-varying covariates included the plasma level of the drug. The patients were divided into two groups. One group received the drug imipramine and the other received chlorimipramine. The object of the study was to find out which of the two drugs resulted in the sharpest decline in the probability that the patient still felt depressed.
1.2. Example 2: Change/Stability of Neuroticism Muth6n (1983) studied the data described in Henderson, Byrne, and DuncanJones (1981) for 231 Canberra adults interviewed four times at 4-month intervals regarding aspects of "neurotic illness." In a short form of a general health questionnaire, the four questions asked were "In the last month have you suffered from any of the following? Anxiety. Depression. Irritability. Nervousness." A yes response was denoted 1 and no was denoted 0. Time-invariant covariates included gender and a measure of long-term susceptibility to neurosis (the N scale from the Eysenck Personality Inventory). Time-varying covariates included life events in the four months prior to the interview. The object of the study was to assess the stability over time of the level of neuroticism of this population of individuals.
1.3. Example 3: Correlated Observations on Asthma Attacks Stiratelli, Laird, and Ware (1984) studied data on daily observations of 64 asthmatics living in Garden Grove, California. The response variable was the presence or the absence of an asthma attack recorded over a period of about 7 months. Time-invariant covariates included gender, age, and history of hay fever. Timevarying covariates included air pollution and weather conditions. The object of the study was to assess the relative importance of various risk factors for increased probability of asthma attacks.
1.4. Contrasting the Examples It is interesting to contrast these three examples. Example 3 illustrates the fact that often the longitudinal structure of the data is only a nuisance. Here, the interest is the same as in regression analysis. However, observations over time for the same individual are correlated so that the usual assumption of independent observations does not hold. The focus is on how to do the regression analysis while properly taking into account the nonindependence. In Example 2, the longitudinal structure of the data is not a nuisance but is essential to the analy-
2. Growth Modeling with Binary Responses
39
sis. The longitudinal process is one in which no particular trend over time is expected; the interest is in assessing how much responses vary over time for a typical individual. Observations fluctuate up and down over time for each individual and it is the amount of fluctuation that is the focus of the study. Example 1 involves a further elaboration of the longitudinal study. Here, observations do not only fluctuate over time for a given individual but also follow a decreasing trend over time. This chapter focuses on situations illustrated by Examples 1 and 2. The aim of this chapter is to discuss conventional random effects modeling for binary longitudinal response and to compare that with a generalized random effects model for longitudinal data which draws on techniques used in latent variable modeling. In section 2, the conventional modeling and estimation is presented. Section 3 critiques this approach and gives a more general formulation. Section 4 presents a small Monte Carlo study in which the more general approach is studied and analyses of real data is also presented.
2. CONVENTIONALMODELING AND ESTIMATION WITH BINARY LONGITUDINAL DATA Consider a binary variable y and a corresponding continuous latent response variable y* for which "r is a threshold parameter determining the y outcomes: y = 1 when y* > -r and y = 0 otherwise. Here, the progress over time of the latent response variable y* is described as Yik = cti + [3itk + ~l~Vik + ~ik,
(1)
where i denotes an individual, tk denotes a time-related variable with tk = k (e.g., k = 0, 1, 2 . . . . . K - 1), oLi is a random intercept at t = 0, [3i is a random slope, ~/k are fixed slopes, vik is a time-varying covariate, and ~ik is a residual, --- In(0, ~ ) . Furthermore, { O~i -'- ~Lor + "IToLW i "+" ~ ori ~ i -- ~ f3 + 3T [3W i -+" ~ [3i'
(e)
where IX~, tx~, rr~,, and 7r~ are parameters, w i is a time-invariant covariate, and ~ , ~ are residuals assumed to have a bivariate normal distribution,
With tk = k and a linear function of time, for example, k = 0, 1, 2 . . . . . K - 1, the variables a and [3 can be interpreted as the initial status level and the rate of growth/decline, respectively.
40
Bengt O. Muth~n
The residuals of ~ik a r e commonly assumed to be uncorrelated across time 9 In line with Gibbons and Bock (1987), however, a first-order autoregressive structure over time for these residuals is presented. Letting x = (w, v)', the model implies multivariate normality for y* conditional on x with
Yi.o !X c7
I
~l~176
YillX 9
(
~
~]lvil
= T tx~ + qTe~Wi jr_ ~ ~ -Jr-"rr~3W i ]
"~2Vi2
k Y i K - 1 IX
(4)
~K-lViK-I
and 1 p p2
V(y* Ix) = Ttlr~/f~T ' + t~;
9 pK-1
p
13 2
1 p
p 1
o pK-2
.
.
.
... ...
. pK--3
. . .
pK-- 1 pK-2 pK--3
(5) 1
where with linear growth or decline
r ___
1 1 1
0 1 2
.
(6)
.
1
K
1
and ~ , / ~ is the 2 x 2 covariance matrix in Eq. (3). For given tk, w, and v, the model expresses the probability of a certain observed response Yi~: as a function of the random coefficients oL and [3,
P(Yi~
=
I I OLi, ~i' X) = P(Yik > a'[ oti, =
[~i' X)
f: tp(slz, qJ~)ds
(7) where q~ is a (univariate) normal density, ~; denotes the standard deviation of ~, and
Z -- OLi At- ~itk -t- ~lkVik.
(8)
To identify the model, the standardization a" = 0, ~ -- 1 can be used as in conventional probit regression (see, e.g., Gibbons & Bock, 1987).
2. Growth Modeling with Binary Responses
41
The probability of a certain response may be expressed as P(Yo . . . . .
YK- 1 IX) --
.,-~f+~j -~ P(Yo, Y l . . . . .
YK-1 [OL, ~3, X) q~(oL, ~31x)doL d~3,
(9)
where P(Yo, Yl . . . .
fc - ( Yi o)
YK-iI e~, [3, X) = 9
9 9fc
- ( YiK- 1)
q~(Yo, Yl*. . . . .
YK-1 * I OL, ~,
x)dYio..,
*
dYiK_
1,
(10)
where c - ( Y i k ) denotes the integration domain for Yik given that the kth variable takes on the value Yik. Here, the integration domain is either ( - ~ , "r) or ('r, +~). In the special case of uncorrelated residuals, that is, p = 0 in Equation (5), the y* variables are independent when conditioning on a, [3, and x so that P(Yo, Yl . . . . . YK-iI a, [3, x) simplifies considerably, K--1
P(Yo,
Yl . . . . .
YK-1 Io~, 13, x) - k=oI-[ -(y.~,q~(YikloL,
fc
*
[3, x)dYik. *
(11)
In this case, only univariate normal distribution functions are involved so that the essential computations of Equation (9) involve the two-dimensional integral over ~ and [3. Perhaps because of the computational simplifications, the special case of p = 0 appears to be the standard model used in growth analysis with binary response 9 This model was used in Gibbons and Bock (1987; see also Gibbons & Hedeker, 1993). The analogous model with logit link was studied in Stiratelli et al. (1984) and in Zeger and Karim (1991). Gibbons and Bock (1987) considered maximum likelihood estimation using Fisher scoring and EM procedures developed for binary factor analysis in Bock and Lieberman (1970) and Bock and Aitkin (1981). Stiratelli et al. (1984) considered restricted maximum likelihood using the EM algorithm. Gibbons and Bock (1987) used a computational simplification obtained by orthogonalizing the bivariate normal variables oL and [3 using a Cholesky factor so that the bivariate normal density is written as a product of two univariate normal densities. They used numerical integration by Gauss-Hermite quadrature, with the weights being the product of the onedimensional weights. For the case of p 4: 0, Gibbons and Bock (1987) used the Clark algorithm to approximate the probabilities of the multivariate normal distribution for y* in Equation (10). Even when p = 0 the computations are heavy when there is a large number of distinct x values in the sample. Zeger and Karim (1991) employed a Bayesian approach using the Gibbs sampler algorithm. For recent overviews, see Fitzmaurice, Laird, and Rotnitzky (1993); Longford (1993); Diggle, Liang, and Zeger (1994); and Rutter and Elashoff (1994).
42
Bengt O. Muth~,n
3. MORE GENERAL BINARY GROWTH MODELING 3.1. Critique of Conventional Approaches In this section, weaknesses in conventional growth modeling with binary data are presented, along with a more general model and its estimation. The maximum likelihood approach to binary growth modeling leads to heavy computations when p 4: 0. This seems to have caused a tendency to restrict modeling to an assumption of p = 0. Experience with continuous response variables, that is, when y =- y*, indicates that p = 0 is not always a realistic assumption. The assumption of a single p parameter that is different from zero, as in the Gibbons-Bock first-order autoregressive model, also may not be realistic in some cases. Instead, it appears necessary to include a separate parameter for at least the correlations among residuals that are adjacent in time. Furthermore, the conventional model specification of a- = 0, qJ~ = 1 has no effect when, as in standard probit regression, there is only a single equation that is being estimated. It is important to note, however, that this is not the case in longitudinal analysis. The longitudinal analysis can be characterized as a multivariate probit regression in which the multivariate response consists of the same response variable at different time points. This has the following consequences. First, the standardization of "r to zero at all time points needs clarification. In the binary case, this does not lead to incorrect results but does not show the generalization to the case of ordered categorical response or to the case of multiple indicators. The threshold a" is a parameter describing a measurement characteristic of the variable y, namely, the level (proportion) of y with zero values on x. Because the same y variable is measured at all time points, equality of this measurement characteristic over time is the natural model specification. In the binary case, however, the equality of the level of y across time points is accomplished by tx~, in Equation (2) affecting y equally over time as a result of the unit coefficient of oLi in Equation (1), which is not explicitly shown. Setting a- = 0 is therefore correct, although an equivalent specification would take "r as a parameter held equal over time points while fixing tx~, at zero. In the ordered categorical case, however, there are several "r parameters involved for a y variable and equality over time of such "r's is called for. In this case, tx~, cannot be separately estimated but may be fixed at zero. The multiple indicator case will be discussed in the next section. Second, ~ is the standard deviation of the residual variation of the latent response variable y*, and fixing it at unity implicitly assumes that the residual variation has the same value over time. This is not realistic because over time different sources of variation not accounted for by the time-varying variable vi~, are likely to be introduced. Again, experiences with continuous response variables indicate that the residual variance often changes over time.
2. Growth Modeling with Binary Responses
43
In presentations using the logit version of the model, the parameters of a" and ~ are usually not mentioned (see, e.g., Diggle et al., 1994). This is probably because the threshold formulation, often used in the probit case, is seldom used in the logit case. This has inadvertently lead to an unnecessarily restrictive logit formulation in growth modeling.
3.2. The Approach of Muth6n An important methodological consideration is whether computational difficulties should lead to a simplified model, such as using p = 0 or t~ = 1, or whether it is better to maintain a general model and instead use a simpler estimator. Here, I describe the latter approach, building on the model of Equations (1) and (2) to consider a more general model and a limited-information estimator. First, the ~ik variables of Equation (1) are allowed to be correlated among themselves and are allowed to have different variances over time. Second, multiple indicators Yikj, J = 1, 2 . . . . . p are allowed at each time point, (12)
Yil,j = kjxlik + eikj,
where kj is a measurement (slope) parameter for indicator j, eikj is a measurement error residual for variable j at time k, and Yikj 1 if Yikj > "rj. The multiple indicator case is illustrated by Example 2 in which four measurements of a single construct "neurotic illness" ('q) were considered. Given that the a"s and the k's are measurement parameters, a natural model specification would impose equality over time for each of these parameters. Using normality assumptions for all three types of residuals, ~, ~, and e, again leads to a multivariate probit regression model. This generalized binary growth model is a special case of the structural equation model of Muth6n (1983, 1984). The longitudinal modeling issues just discussed were also brought up in Muth6n (1983), where a random intercept model like Equations (1), (2), and (12) was fitted to the Example 2 data. The problems with standardization issues related to ~- and t~ have also been emphasized by Arminger (see, e.g., ch. 3, this volume) and Muth6n and Christofferson (1981) in the context of structural equation modeling. In the approach of Muth6n (1983, 1984), conditional mean and covariance matrix expressions corresponding to Equations (4) and (5) are considered. This is sufficient given the conditional normality assumptions. Muth6n (1983, 1984) introduced a diagonal scaling matrix A containing the inverse of the conditional standard deviations of the latent response variable at each time point, =
A = d i a g [ V ( y * Ix)] -1/2.
(13)
Muth6n (1983, 1984) describes three model parts. Using the single-indicator growth model example of Equations (4) and (5),
44
Bengt O. MuthOn 0.1 = ATIx 0" 2 -- A[T-rr ~/] 0"3 = 1 ) = A ( T ~ , ~ / ~ T ' + ~ g g ) A ,
(14) (15) (16)
where ~ is the K • K covariance matrix of ~ (cf. Eq. 5). The three parts correspond to the intercepts, slopes, and residual correlation matrix of a multivariate probit regression. Note that 1) is a correlation matrix. The general model of Muth6n (1983, 1984), including multiple indicators as well as multiple factors at each time point, can be expressed as follows. Consider a set of measurement relations for a p-dimensional vector y*, y* = Axl + e,
(17)
and a set of structural relations for an m-dimensional vector of latent variable constructs -q, n =a+
13n + F x +
~,
(18)
where A, a, B, F are parameter arrays, e is a residual (measurement error) vector with mean zero and covariance matrix 19, and ~ is a residual vector with mean zero and covariance matrix ~ . The scaling matrix ZX is also included in this general framework, as is the ability to analyze independent samples from multiple populations simultaneously. Muthrn (1983, 1984, 1987) used a least-squares estimator where with 0. (if'l, 0.;, 0.;)t, F = (s - 0.)' W - I ( s
--
0.),
(19)
where the s elements are arranged in line with 0. and are maximum likelihood estimates of the intercepts, slopes, and residual correlations. Here, s I and s 2 are estimates from probit regressions of each y variable on all the x variables, whereas each s 3 element is a residual correlation from a bivariate probit regression of a pair of y variables regressed on all x variables. A generalized leastsquares estimator is obtained when the weight matrix W is a consistent estimate of the asymptotic covariance matrix of s. In this case, a chi-square test of model fit is obtained as n. F, where n is the sample size and F refers to the minimum value of the function in Equation (19). (For additional technical details on the asymptotic theory behind this approach, see Muthrn & Satorra, 1995.) Muthrn (1984, 1987) presented a general computer program LISCOMP which carries out these calculations.
3.3. Model Identification Given that the response variables are categorical, the general binary growth model needs to be studied carefully in terms of parameter identification. Under
2. Growth Modeling with Binary Responses
45
the normality assumptions of the model, the number of distinct elements of cr represents the total number of parameters that can be identified, therefore the number of growth model parameters can be no larger than this. The growth model parameters are identified if and only if they are identified in terms of the elements of er. It is instructive to consider first the conditional y* variance in some detail for the case of binary growth modeling. Let [A]~ 2 denote the conditional variance of Ykk given x. For simplicity, the focus is on the case with linear growth. With four time points, the conditional variances of y* can be expressed in model parameter terms as [A]o2 = + ~ + tb~o~o [z~]H2 = +~,~ + 2tb~ + tb~ + tb~l~l [A]222 = tb~,~, + 4tb~,~ + 4tb~ + ~2~2 [A]332 = ~,,~ + 6~,~ + 9qs~ + ~3~3"
(20) (21) (22) (23)
Note that the A elements are different because of across-time differences in contributions from ~,~/~ as well as ~ . In Equation (5), using the Gibbons-Beck standardization of +~ - 1, there are no free ~ parameters to be estimated. Contrary to this conventional approach, there are four different ~ parameters in Equations (18) through (21). Because the y* variables are not directly observed, not all of these parameters are identifiable. Instead of assuming ~ = 1 for all time points, as in Gibbons-Beck, the first diagonal element of A can be fixed to unity, corresponding to the first time point. For the remaining time points, the A elements in Equations (19) through (21) are the unrestricted parameters instead of the residual y* variances of ~. The residual variances are not taken as free parameters to be estimated, but can be obtained from the other parameters using Equations (18) through (21). Allowing the A parameters to be different across time allows the residual variances to be different across time. With four time points, this adds three parameters to the model relative to the conventional model. The (co)variance-related parameters of the model are in this case the three free/X elements (not the the's) and the three elements of ~,~/~. With covariates, added parameters are lX~, tx~, -rr~, -rr~, and ~/o. . . . . ~/K-1. It can be shown that with four time points, covariances between pairs of residuals at adjacent time points can also be identified in this model (see Muthdn & Liu, 1994). As opposed to the case of a single response variable, multiple indicator models allow for separate identification of the residual variances and the measurement errors of each indicator. In this case, there is the additional advantage that the residual variances for the latent variable constructs xl are identified at all time points. Multiple-indicator models would assume equality of the measurement parameters (the -r' s and the )t' s) for the same response variable across time. In this case, the p~,~intercept is fixed at zero. The A matrix scaling is general-
46
Bengt O. Muth~n
ized as follows. The scaling factors of A are fixed at the first time point for all of the indicators to eliminate the indeterminacy of scale for each different y* variable (corresponding to each indicator) at this time point. The scaling factors of A are free for all indicators at later time points so that the measurement error variances are not restricted to be equal across time points for the same response variable.
3.4. Implementation in Latent Variable Modeling Software In the case of continuous response variables, Meredith and Tisak (1984, 1990) have shown that the random coefficient model of the previous section can be formulated as a latent variable model. For applications in psychology, see McArdle and Epstein (1987); for applications in education, see Muth6n (1993) and Willett and Sayer (1993); and for applications in mental health, see Muth6n (1983, 1991). For a pedagogical introduction to the continuous case, see Muth6n, Khoo, and Nelson Goff (1994) and Willett and Sayer (1993). Muth6n (1983, 1993) pointed out that this idea could be carried over to the binary and ordered categorical case. The basic idea is easy to describe. In Equation 1, c~i is unobserved and varies randomly across individuals. Hence, it is a latent variable. Furthermore, in the product t e r m ~itk, ~i is a latent variable multiplied by a term tk which is constant over individuals and can therefore be treated as a parameter. The tks may be fixed as in Equation (6), but with three or more time points they may be estimated for the third and later time points to represent nonlinear growth. More than one growth factor may also be used.
4. ANALYSES Simulated and real data will now be used to illustrate analyses using the general growth model with binary data.
4.1. A Monte Carlo Study A limited Monte Carlo study was carried out to demonstrate the sampling behavior of the generalized least-squares estimator in the binary case. The model chosen for the study has a single binary response variable observed at four time points. There is one time-invariant covariate and one time-varying covariate (one for each time point). The simulated data can be thought of as being in line with the Example 1 situation in which the probability of a problem behavior declines over time. Linear decline is specified with T as in Equation (6). Both the random intercept (et) and the random slope ([3) show individual variation as represented both by their common dependence on the time-invariant covariate (w) and their residual variation (represented by ~ / ~ ) . The e~ variable regression has a
2. Growth Modeling with Binary Responses
47
positive intercept (Ix~) and a positive slope (Try,) for w, whereas the [3 variable regression has a negative intercept (tx~) and a negative slope ('rr~) for w. In this way, the time-invariant covariate can be seen as a risk factor, which with increasing value increases o~, that is, it increases the initial probability of the problem and decreases [3, making the rate of decline larger (this latter means that the higher the risk factor value, the more likely the improvement is in the problem behavior over time). The regression coefficient for the response variable on the time-varying covariates (the vk's) is positive and the same at all time points. The residual variances ( ~ ) are changing over time and there is a nonzero residual covariance between adjacent pairs of residuals that is assumed to be equal. The time-varying covariates are correlated 0.5 and are each correlated 0.25 with the time-invariant covariate. All covariates have means of 0 and variances of 1. The population values of the model parameters are given in Table 1.
TABLE 1 Monte Carlo Study: 500 Replications Using LISCOMP GLS for Binary Response Variables Parameter
True value
n = 1000
n = 250
~ ~
0.50 -0.50
0.50 (0.05, 0.05) -0.51 (0.04, 0.04)
0.50 (0.10, 0.09) -0.50 (0.10, 0.09)
"rr~ "rr~ ~/1 ~/2 ~/3 ~/4
0.50 -0.50 0.70 0.70 0.70 0.70
0.50 -0.51 0.70 0.72 0.71 0.71
0.50 -0.50 0.70 0.76 0.71 0.71
qJe~ot ~[3ot ~[313
0.50 -0.10 0.10
0.49 (0.13, 0.13) -0.09 (0.06, 0.05) 0.10 (0.04, 0.03)
0.48 (0.29, 0.26) -0.08 (0.13, 0.11) 0.10 (0.09, 0.08)
~k+ 1, k
0.20
0.22 (0.09, 0.08)
0.26 (0.31, 0.22)
All
1.00 1.00 1.00
0.99 (0.13, 0.13) 1.00 (0.11, 0.11) 1.00 (0.11, 0.11)
1.01 (0.29, 0.25) 1.02 (0.25, 0.22) 1.03 (0.25, 0.23)
14.75 5.53 5.2 1.0
15.51 5.61 5.4 2.0
A22 A33 X 2 Average (df= 15) SD 5% Reject proportion 1% Reject proportion
(0.05, (0.04, (0.06, (0.11, (0.09, (0.08,
0.05) 0.04) 0.05) 0.11) 0.09) 0.08)
(0.10, 0.10) (0.09, (0.11, (0.40, (0.19, (0.18,
0.09) 0.11) 0.29) 0.18) 0.17)
Note. In parentheses are empirical standard deviations, standard errors, df degrees of freedom; SD, standard deviation.
48
Bengt O. Muth#n
Two sample sizes were used, a larger sample size of n = 1000 and a smaller sample of n = 250. Multivariate normal data were generated for y* and x, and the y* variables were dichotomized at zero. The generalized least-squares estimator was used. The parameter values chosen (see Table 1) imply that the proportion of y = 1 at the four time points is .64, .50, .34, and .25. The parameters estimated were IX~, IX[3, "rr,~, 7r[3, ~/o, ~tl, '5/2, ~/3, q/oto~, q/[3o~,11/[313,q/~l,+z~k(a single parameter), [A] ll, [A]22, and [A]33. The threshold parameter -r was fixed at zero and the scaling factor [A]lz was fixed at one. As discussed in section 3.3, the four residual variances of qJ;~ are not free parameters to be estimated but they are still allowed to differ freely across time (their population values are .5, .6, .5, and .2) because the A parameters are free. The degrees of freedom for the chi-square model test is 15. A total of 500 replications were used for both sample sizes in Table 1. Table 1 gives the parameter estimates, the empirical standard deviation of the estimates across the 500 replications, the mean of the estimated standard errors for the 500 replications, and a summary of the chi-square test of model fit for the 500 replications. As seen in Table 1, the estimates for the n = 1000 case show almost no parameter bias, the empirical variation is very close to the mean of the standard errors, and the chi-square test behaves correctly. As expected, the empirical standard deviation is cut in half when reducing the sample size to a quarter, from 1000 to 250. Exceptions to this, however, are the regression slope for the second time point "Y2and the residual covariance ~+,~k" The cause for these anomalies needs further research. In these cases, the standard errors are also strongly underestimated. In the remaining cases, the standard errors agree rather well with the empirical variation, with perhaps a minor tendency to underestimate the standard errors for the (co)variance-related parameters of ~ and A. At n = 250, the variation in the regression intercept and slope parameter (IX and 7r) estimates is low enough for the hypotheses of zero values to be rejected at the 5% level. For the (co)variance-related parameters of +, however, this is not the case and the A parameters also have relatively large variation. The chi-square test behavior at n = 250 is quite good.
4.2. Analysis of Example 2 Data The model used for the preceding simulation study will now be applied to the Example 2 data of neurotic illness as described previously (for more details, see Henderson et al., 1981). Each of the four response variables will be modeled separately. They can also be analyzed together as multiple indicators of neurotic illness, but this will not be done here. Previous longitudinal analyses of these data were done in Muth6n (1983, 1991). Summaries of the data are given in Table 2. As is shown in Table 2, there is a certain drop from the first to the remaining occasions in the proportion of
2. Growth Modeling with Binary Responses
49
TABLE2 Descriptive Statistics for Example 2 Data Response variables Percentage yes: Anxiety Depression Irritability Nervousness
Time 1
Time 2
Time 3
Time 4
26.4 25.5 40.3 24.2
16.5 14.7 31.2 19.0
15.6 17.7 28.1 16.0
16.0 13.9 29.0 15.6
Covariates
N L1 L2 L3 L4
Means
Variances
9.31 3.86 3.17 2.58 2.42
20.66 6.54 5.89 4.90 5.27 Correlations
N L1 L2 L3 L4
1.00 0.22 0.16 0.18 0.21
1.00 0.54 0.50 0.53 .
.
.
1.00 0.49 0.49
1.00 0.51
1.00
.
people answering yes to the neuroticism items. There is also a corresponding drop in the mean of the life event score. Because the latter is used as a timevarying covariate, this means that the data could be fit by a model that does not include a factor for a decline in the response variable, but which instead uses only a random intercept factor model. Given previous analysis results, gender is dropped as a time-invariant covariate. Only the N score, the long-term susceptibility to neurosis, will be used to predict the variation in the random intercept factor. Two types of models will be fit to the data. First, the general binary growth model will be fit, allowing for across-time variation in the latent response variable residual variance and nonzero covariances between pairs of residuals (the residual covariances are restricted to being equal). Second, the conventional binary growth model, in which these features are not allowed for, will be fit as a comparison. In both cases, the same quantities as in Table 1 will be studied, along with two types of summary statistics. One summary statistic is R 2, that is, the proportion of variation is the e~ factor accounted for by N. A second statistic is the proportion that the e~ factor variation makes up of the total variation in the latent response variable y*, calculated at all time points. The oL factor
50
Bengt O. Muth6n
represents individual variation in a neurotic illness trait, variation that is present at all time points. In addition to this variation, the y* variation is also influenced by time-specific variation caused by time-varying, measured covariates (Ls) and time-specific unmeasured residuals (~s). In this way, the proportion is a time-varying measure of how predominant the trait variation is in the responses. Table 3 shows the results for the general binary growth model. The model fits each of the four response variables very well. As expected, the N score has a significantly positive influence (4r~) on the random intercept factor and the L scores have significantly positive influences (C/s) on the probability of yes answers for the response variables. For none of these four response variables is the residual covariance significantly different from zero. Note, however, from the simulation study at n = 250 that the variation in this estimate is quite large and that a large sample size is required to reject zero covariance. As shown in the simulation study, the point estimate of the covariance may be of reasonable magnitude. The model with
TABLE3 Analysis of Example 2 Data Using the General Growth Model ,,
,
Anxiety
Depression
Irritability
Nervousness
ixoL
-1.46 (0.21)
-2.56 (0.34)
-1.21 (0.20)
-2.43 (0.27)
"rre~ yl y2 ~3 ~4
0.06 (0.01) 0.08 (0.03) 0.05 (0.02) 0.06 (0.03) 0.03 (0.02)
0.13 (0.02) 0.14 (0.04) 0.02 (0.03) 0.12 (0.03) 0.04 (0.05)
0.06 (0.01) 0.09 (0.03) 0.08 (0.02) 0.07 (0.02) 0.06 (0.02)
0.14 (0.02) 0.08 (0.03) 0.03 (0.02) 0.05 (0.03) 0.01 (0.03)
~o~o~
0.31 (0.08)
0.39 (0.10)
0.27 (0.07)
0.69 (0.09)
-0.01 (0.05)
-0.12 (0.10)
-0.04 (0.04)
-0.11 (0.06)
1.37 (0.19) 1.48 (0.21) 1.20 (0.18)
0.88 (0.15) 1.08 (0.19) 0.88 (0.14)
1.41 (0.22) 1.43 (0.24) 1.32 (0.22)
1.01 (0.11) 1.14 (0.11) 1.00 (0.11)
~k, k+l All A22 A33 X2(19) p-value R2ot P1 P2 P3 P4
16.49 .624 0.19 0.34 0.31 0.31 0.30
21.73 .298 0.47 0.50 0.41 0.42 0.41
26.02 .130 0.22 0.30 0.28 0.29 0.28
23.30 .225 0.37 0.75 0.52 0.54 0.52
2. Growth Modeling with Binary Responses
51
nonzero residual covariance is therefore maintained. In this particular application, the estimate is small. The scaling factors of A are not significantly different from unity at the 5% level for all but one of the cases. Because in this model there is no [3 factor, A is a function of the oL factor residual variance + , ~ and the residual variance thcc (cf. Eqs. 18-21). Unit values for the A scaling factors would therefore indicate that the residual variances are constant over time in this application. Note, however, from the Table 1 simulation results that the sampling variation in the A estimates is quite large at n = 250 which makes it difficult to reject equality of residual variances over time. The Table 1 results also indicate that the point estimates for A are good. Table 4 shows the results for the conventional binary growth model. This model cannot be rejected at the 5% level in these applications. The parameter estimates are, in most cases, similar to those for the generalized model of Table 3. Differences do, however, show up in the values for the trait variance proportions, labeled P~ through P4 in Tables 3 and 4. Relative to the more general model, the conventional model overestimates these proportions for three out of the four response variables. For example, the conventional model indicates that there is a considerable dominance of trait variation in the response variable Nervousness, with a proportion of .81 for the last three time points (see Table 4).
TABLE 4
Analysis of Example 2 Data Using the Conventional Growth Model Anxiety
Depression
Irritability
Nervousness
~OL
-1.79 (0.17)
-2.40 (0.16)
-1.56 (0.16)
-2.56 (0.23)
9roL ~1 ~/2 "y3 ~/4
0.07 (0.01) 0.12 (0.02) 0.06 (0.03) 0.05 (0.04) 0.04 (0.03)
0.12 (0.01) 0.13 (0.02) 0.03 (0.03) 0.10 (0.03) 0.05 (0.03)
0.08 (0.01) 0.12 (0.02) 0.10 (0.02) 0.09 (0.03) 0.07 (0.03)
0.15 (0.02) 0.09 (0.03) 0.03 (0.02) 0.03 (0.03) 0.01 (0.03)
~c~c~
0.50 (0.51)
0.31 (0.06)
0.42 (0.06)
0.72 (0.05)
X2(23) p-value RZoL P1 P2 P3 P4
26.12 .295 0.17 0.50 0.54 0.54 0.54
26.29 .287 0.49 0.43 0.47 0.45 0.46
33.97 .066 0.24 0.45 0.46 0.47 0.48
29.02 .180 0.39 0.78 0.81 0.81 0.81
52
Bengt O. Muth~n
The more general model of Table 3 points to a much lower range of values for the last three time points, .52 to .54.
5. CONCLUSIONS This chapter has discussed a general framework for longitudinal analysis with binary response variables. As compared with conventional random effects modeling with binary response, this general approach allows for residuals that are correlated over time and variances that vary over time. It also allows for multiple indicators of latent variable constructs, in which case it is possible to identify separately residual variation and measurement error variation. The more general model can be estimated by a limited-information generalized least-squares estimator. The general approach fits into an existing latent variable modeling framework for which software has been developed. A Monte Carlo study showed that the limited-information generalized leastsquares estimator performed well with sample sizes at least as low as n - 250. At this sample size, the sampling variability is not unduly large for the regression parameters of the model, but it is rather high for the (co)variance-related parameters of the model. Analyses of a real data set indicated that the differences in key estimates obtained by the conventional model are not always markedly different from those obtained by the more general model, but can lead to quite different conclusions about certain aspects of the phenomenon that is being modeled. The general approach should be of value for developmental studies in which variables are often binary and in which the variables are often very skewed and essentially binary. The general model allows for a flexible analysis which has so far been used very little with binary responses. A multiple-cohort analysis of this type is carried out in Muthdn and Muthdn (1995), which describes the development of heavy drinking over age for young adults.
ACKNOWLEDGMENTS This research was supported by Grant AA 08651-01 from NIAAA for the project "Psychometric Advances for Alcohol and Depression Studies" and Grant 40859 from the National Institute of Mental Health.
REFERENCES Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459.
2. Growth Modeling with Binary Responses
53
Bock, R. D., & Lieberman (1970). Fitting a response model for n dichotomously scored items. Psychometrika, 35, 179-197. Diggle, P. J., Liang, K. Y., & Zeger, S. (1994). Analysis of longitudinal data. Oxford: Oxford University Press. Fitzmaurice, G. M., Laird, N. M., & Rotnitzky, A. G. (1993). Regression models for discrete longitudinal data. Statistical Science, 8, 284-309. Gibbons, R. D., & Bock, R. D. (1987). Trend in correlated proportions. Psychometrika, 52, 113-124. Gibbons, R. D., & Hedeker, D. R. (1993). Application of random-effects probit regression models (Technical report). Chicago: University of Illinois at Chicago, UIC Biometric Laboratory. Henderson, A. S., Byrne, D. G., & Duncan-Jones, E (1981). Neurosis and the social environment. Sydney: Academic Press. J6reskog, K. G., & S6rbom, D. (1977). Statistical models and methods for analysis of longitudinal data. In D. J. Aigner & A. S. Goldberger (Eds.), Latent variables in socio-economic models (pp. 285-325). Amsterdam: North-Holland. Laird, N. M., & Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963-974. Longford, N. T. (1993). Random coefficient models. Oxford: Oxford University Press. McArdle, J. J., & Epstein, D. (1987). Latent growth curves within developmental structural equation models. Child Development, 58, 110-133. Meredith, W., & Tisak, J. (1984). "Tuckerizing" curves. Paper presented at the annual meetings of the Psychometric Society, Santa Barbara, CA. Meredith, W., & Tisak, J. (1990). Latent curve analysis. Psychometrika, 55, 107-122. Muth6n, B. (1983). Latent variable structural equation modeling with categorical data. Journal of Econometrics, 22, 43-65. Muth6n, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, 115-132. Muth6n, B. (1987). LISCOMP. Analysis of linear structural equations with a comprehensive measurement model. Theoretical integration and user's guide. Mooresville, IN: Scientific Software. Muth6n, B. (1991). Analysis of longitudinal data using latent variable models with varying parameters. In L. Collins & J. Horn (Eds.), Best methods for the analysis of change. Recent advances, unanswered questions, future directions (pp. 1-17). Washington, DC: American Psychological Association. Muth6n, B. (1993). Latent variable modeling of growth with missing data and multilevel data. In C. R. Rao & C. M. Cuadras (Eds.), Multivariate analysis: Future directions 2 (pp. 199-210). Amsterdam: North-Holland. Muth6n, B., & Christofferson, A. (1981). Simultaneous factor analysis of dichotomous variables in several groups. Psychometrika, 46, 407-419. Muth6n, B., Khoo, S. T., & Nelson Goff, G. (1994). Longitudinal studies of achievement growth using latent variable modeling (Technical report). Los Angeles: UCLA, Graduate School of Education. Muth6n, B., & Liu, G. (1994). Identification issues related to binary growth modeling (Technical report). Manuscript in preparation. Muth6n, B., & Muth6n, L. (1995). Longitudinal modeling of non-normal data with latent variable techniques: Applications to developmental curves of heavy drinking among young adults (Technical report). Manuscript in preparation. Muth6n, B., & Satorra, A. (1995). Technical aspects of Muth6n's LISCOMP approach to estimation of latent variable relations with a comprehensive measurement model. Psychometrika. Rutter, C. M., & Elashoff, R. M. (1994). Analysis of longitudinal data: Random coefficient regression modeling. Statistics in Medicine, 13, 1211-1231.
54
Bengt O. Muth~n
Stiratelli, R., Laird, N., & Ware, J. H. (1984). Random-effects models for serial observations with binary response. Biometrics, 40, 961-971. Wheaton, B., Muth6n, B., Alwin, D., & Summers, G. (1977). Assessing reliability and stability in panel models. In D. R. Heise (Ed.), Sociological methodology 1977 (pp. 84-136). San Francisco: Jossey-Bass. Willett, J. B., & Sayer, A. G. (1993). Using covariance structure analysis to detect correlates and predictors of individual change over time. Psychological Bulletin. Zeger, S. L., & Karim, M. R. (1991). Generalized linear models with random effects; a Gibbs sampling approach. Journal of the American Statistical Association, 86, 79-86.
Probit Models for the Analysis of Limited Dependent Panel Data GerhardArminger
Bergische Universit~t Wuppertal, Germany
1. INTRODUCTION Panel data are observations from a random sample of n elements from a population collected over a fixed number T of time points. Usually, n is fairly large (e.g., approximately 5000 households in the German Socio-Economic Panel [GSOEP; Wagner, Schupp, & Rendtel, 1991] or approximately 4000 firms in the 1993 panel of the German bureau of labor) and T is fairly small (T = 10 panel waves in the GSOEP). Usually, the time points t = 1. . . . . T are equidistant. In psychology, panel data are often referred to as repeated measurements; in epidemiology they are referred to as cohort data. In analyzing panel data using regression models, one often finds that the dependent variables of interest are nonmetric, that is, dichotomous, censored metric, and ordered or unordered categorical. Typical examples are found in the following areas: 9 The analysis of individual behavior in the labor market. Flaig, Licht, and Steiner (1993) model and test hypotheses about whether a person is unemployed Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
55
56
GerhardArminger
or not at time t depending on variables such as unemployment at t - l, length of former unemployment, education, professional experience, and other personal characteristics using data from the GSOEE Here, the dependent variable is dichotomous with the categories "employed" versus "unemployed." 9 The analysis of the behavior of individual firms. Arminger and Ronning (1991) analyze the simultaneous dependence of changes in output, prices, and stock using data from the German business test conducted quarterly by IFO Institute in Munich. Here, the dependent variables of change in output (less, equal, and more than at time t - 1) and change in price are ordered trichotomous and the variable stock is metric as it is measured in production months. The special problem in specifying a model for these data is that trichotomous variables appear on the left a n d on the right side of regression equations. 9 The analysis of marketing data. Marketing data come often in the form of preferences measured on a Likert scale with five or seven ordered categorical outcomes such as "1 = I like product A very much" to "5 = I don't like product A at all." Again, the variables are ordered categorical and often a great number of variables must be reduced to a smaller set of variables by using factor analytic models for ordered categorical data. Ordered categorical variables may again appear on the left and on the right side of regression equations. In the following section the construction principles of Heckman (1981a, 1981b) on the specification and estimation of dichotomous outcomes in panel data are extended to include ordered categorical and censored dependent variables as well as simultaneous equation systems of nonmetric dependent variables. The parameters of models in which strict exogeneity of error terms holds can be estimated by assuming multivariate normality of the error terms in threshold models and by using conditional polychoric and polyserial covariance coefficients in the framework of mean and covariance structure models for nonmetric dependent variables. These models and estimation techniques have been introduced by Muth6n (1984) and extended by Ktisters (1987) and Schepers and Arminger (1992). The estimation of special models in which strict exogeneity of error terms does not hold is also briefly discussed. Special attention is given to the problem of initial states. As an example, the trichotomous output variable of a four-wave panel of 656 firms from the German business test conducted by the IFO Institute is analyzed.
2. MODEL SPECIFICATION 2.1. Heckman's Model for Dichotomous Variables Heckman (1981a, chap. 3.3) considers the following model for an unobserved variable Yit, i = 1 . . . . . n, t = 1 ..... T , where i denotes the individual and t denotes a sequence of equispaced time points:
3. Probit Model and Limited Dependent Panel Data
Yit -- ~it + 6-it"
57
(1)
The term txi, is the expected value of Y~t, which itself may be written as a function of the exogenous variables x;,, lagged observed variables Yi,t-j, and lagged unobserved variables Yi,,-j: vc
I'Lit - Xit~ + E
vc
3"t--j, tYi,t--j + E
j=l
j--1
j
K
~kJ,t--J ]--I Yi,t--I + E l--1
k=l
~
9
kYi,t-k"
(2)
The error terms 6-it are collected in a vector and are assumed to be normally distributed: 6-i "-" (6-il
6-iT)" 6-i "~ t (0, Y,).
.....
(3)
A threshold model is used to model the relation between the unobserved variable Yit and the observed variable Yit: if Yzt > 0 if Yit <- O.
Yit --
(4)
The unobserved variable Y;t is considered to be a disposition or utility that is connected to the observed dependent variable Y;t through a dichotomous threshold model with threshold 0. The values Y~t, Yit and the 1 • P vectors xit are collected in T x 1 vectors Yi, Yi and the T • P matrix X i. The random variables {y;, X~} are identically and independently distributed, which corresponds to a simple random sample from a population. The model specification consists of determining the structures of the systematic part ~Lit and the stochastic part % of the model. A discussion of the different parts of the model specification for Y~t is found in Heckman (1981a) and in Hamerle and Ronning (1995). Here, only the most important elements of the specification are repeated. The first component of ix,.t in Equation (2) is the variation induced by possible time-varying explanatory variables xit. The P x 1 parameter vector [3 is time-constant in this model, but can be changed to P x 1 parameter vectors [3t, t - 1. . . . . T that vary over time. The second component of txit captures the influence of former states of the observed dependent variable Yi,t-9, J > 1, which is called the true state dependence. If 3't-l,, = 3'J 4 : 0 and 3"t--j,t 0 for all j > 1 and t - 1. . . . . T, we have a simple Markov model. Note that the inclusion of former states requires the knowledge of initial states Y~o, Y i , - 1 . . . . . depending on the specification of the parameters 3't-i,t. If the initial states are known and are nonstochastic, they can be included in the vectors xit as additional explanatory variables. If the initial states are themselves outcomes of the process that generates Yit, the distribution of the initial states must be taken into account, as discussed by Heckman (1981 b) and in Section 3 of this chapter. Note that the effects of the former states Y;,t-~ may change for each time point. This is captured by 3't-j,t. In most applications, 3't-j,, is set to 3',_j and almost all of the parameters are set to 0. =
50
GerhardArminger
The third component of [JLi, models the dependence of Yit on the duration of the state Yi,t = 1. Again, the effects of duration may be different for each time length. These different effects are parameterized in ha,,_j. If hj,,_j = h, this is the simple case of a linear duration effect. The fourth component of ].Lit models the dependence of Yit on former values Yi,t-j o f the unobserved endogenous variables. Structures that incorporate this component are called models with habit formation or habit persistence. The idea behind such models is that Yit is not dependent on the actual former state of the observed variable, but rather on the former disposition or habit identified with Yi,t-j instead of Yi,t-j. If the initial dispositions Yi,o, Yi,-1 . . . . . Yi,-(K-I) are known and nonstochastic, they may be included in the list of explanatory variables. Otherwise, assumptions about the distribution of the initial variables Yi,o, Yi,-1 . . . . . Y i , - ( K - 1 ) m u s t be made. Now turn to the specification of the error term Eit in Equation (3). The error term is often decomposed in the form (5)
~'it-- oti at- ~-it'
where oti denotes the error term which varies across individuals but not over time and which may be considered to be an unobserved heterogeneity just as in the metric case (cf. Hsiao, 1986). The values of oti may be considered as fixed effects for each i or may be considered as random effects such that oti "" I "(0, o-2). In the first case, oti is an individual specific parameter. If Yi, is modeled by Yit = xitf3 + oti Jr- E.it and is actually observed, as in the metric case, then oti may be eliminated by taking the first differences Y i , - Y i , t - ~ - ( x i , - x/,t-1)[3 + ei.t - %,-1. This technique does not work for nonmetric models such as the probit model. In the case of a dichotomous logit model, the otis may be eliminated by conditioning on a sufficient statistic, as shown in Hsiao (1986) and Hamerle and Ronning (1995). If oti is a random variable, then it is assumed to be uncorrelated with xit and ~-it. If ~-it has constant variance %2 and is serially uncorrelated, then Ei has the typical covariance structure 2+ orot
2 oroL
2 ore
2 + 2 O'o~ ore
211' + or2I,
v( *.] "2 orc~
"2 orc~
9 9 9
2 _1_ 2 O"OL ore
where 1 is a T X 1 vector of ones and I is the T • T identity matrix 9 More generally, a serial or a factor analytic structure may be assumed for V(e i). (Details are found in Heckman, 1981 a, or Arminger, 1992.) Discussion of the estimation of the parameters of this model under various assumption is deferred to section 3.
3. Probit Model and Limited Dependent Panel Data
59
2.2. Extension to General Threshold Models We now extend Heckman's (198 la) dichotomous models in a systematic way to censored, metrically classified and ordered categorical dependent variables and simultaneous equation systems that allow as dependent variables any mixture of metric and/or limited dependent variables. Only random effect models are considered. The T x 1 vector Yi of utilities Yit is formulated as a multivariate linear regression model
Y i -- ~
-+- 1-[x i "~ e-i,
where x i is a R • 1 vector of observed explanatory variables, ~/is a T • 1 vector of regression constants, and II is a T X R matrix of regression coefficients. The T X 1 vector of error terms follows a normal distribution L I ~ Note that there is a slight change in notation compared with section 2.2. The R X 1 vector x; may be interpreted as the vectorized form of X i in section 2.1 and may additionally include dummy variables denoting the lagged values Y~t-~, Y~t-2 . . . . and duration of observed states. The Heckman model of the form P~i,- x;,[3, t = 1. . . . . T, which in matrix form is written as p~; = XJ3, is then written as ~J'~i =
I-Ixi
with
x i,
!' H
0
...
0 )
[3' 0
... ...
0 [3'
Xi2
,
)Ci ~
\XiT
If the regression parameters are not serially constant, [3' in the first row is ret placed by [3 l, in the second row by [3'2, and so forth. Together with the specification of E through a model for unobserved heterogeneity and for serial correlation, the preceding specification of Yi-~l + Hxi + Ei yields a conditional mean and covariance structure in the latent variable vector Yi, with Yi " ~ t~(~/+ IIxi, E). The model is now extended by allowing not only the dichotomous threshold model of Equation (4), but any one of the following threshold models that maps Yit onto the observed variable y;, (cf. Schepers, Arminger, & Kiisters, 1991). For convenience, the case index i = 1. . . . . n is omitted. 9 Variable Yt is metric (identity relation). Examples are variables such as monthly income and psychological test scores. y, = y,.
(6)
60
GerhardArminger
9 Variable Yt is ordered categorical with unknown thresholds n-t,~ < -rt,2 < . . . < "rt,K, and categories y, = 1. . . . . K t + 1 (ordinal probit relation; McKelvey & Zavoina, 1975). Examples are dichotomous variables such as e m p l o y m e n t (employed vs. unemployed) and five-point Likert scales with categories ranging from "I like very much" (1) to "I don't like at all" (5). yt - k r
yt
E[Tt,k_I, Tt,k)
with ['r,,o, n,, l) - (-oc, ,r,,l) and "r,,K,+l -- +oo.
(7)
Note that for reasons of identification, the threshold "rt,, is set to 0 and the variance of the reduced form error term o52 is set to 1 The parameters in [3, are only identified up to scale. If one considers simultaneous equation models or analyzes two or more panel waves simultaneously, only hypotheses of proportionality of regression coefficients across equations can be tested in general. Hypotheses of equality of regression coefficients across equations can only be tested under additional, and sometimes nontestable, assumptions (Sobel & Arminger, 1992). 9 Classified metric variables may be treated analogously to the orginal probit case with the difference that the class limits are now used as known thresholds (Stewart, 1983). No identification restrictions are necessary. An important example is grouped income with known class boundaries. 9 Variable Yt is one-sided censored with a threshold value q't, 1 known a priori (tobit relation; Tobin, 1958). The classical example is household expenditure for durable goods, such as cars, when only a subset of the households in a sample acquires a car during a given time interval. Yt Y' = ['rt, ,
if Yt > "rt,1 ifyt <- Tt, l"
(8)
9 Variable Yt is double-sided censored with threshold values a-t,, < q't,2 known a priori (two-limit probit relation; Rosett & Nelson, 1975).
Yt-
I]Yt-r~l t "r,,2
if y; -< "rt,~ if r,, 1 < y,. < Tt,2 ifT t2 <-y,.
(9)
In the case of general threshold models, several modifications of the dichotomous models for panel data have to be considered. First, the state dependence must be modified for nondichotomous dependent variables. In the case of a censored dependent variable, there should be a dummy variable di,t_ j that takes on the value of 1 if the variable Yi,t-j has been observed and a value of 0 if Yi,t-j takes on the threshold value. For metrically classified variables and ordered categorical variables, a K • 1 vector of d u m m y variables ~,*t)_j must be defined. The d u m m y variable ~.*]_j equals 1 if Yi,,-j falls in category k, k = 2 . . . . . K + 1 and is 0 otherwise.
3. Probit Model and Limited Dependent Panel Data
61
Second, the identification restrictions for the variances (r 2t, t = 1. . . . . T of q-it and for the thresholds collected in the vector "r~t) for each panel wave t must be chosen with great care. In my opinion, the threshold vectors "r~t) should be set equal across panel waves, otherwise the meaning of the categories of an ordered categorical variable is assumed to vary across time. This restriction immediately implies that, except for the first wave, the variances o',2 need not be restricted but can vary across the panel waves. For censored dependent vari2 ables, no restrictions for o-t are necessary. A further extension concerns the specification of models for a system of variables. Instead of considering only one variable over T waves, one may consider a vector Yit of H dependent variables over time. Each component of Yit is denoted by Yith, h - 1 . . . . . H. The vector of dependent variables Yi of the ith observation is then a H . T • 1 vector of dependent variables observed at T time points. Each latent variable Yith at each time point is then mapped onto the observation Yith through a threshold model of the form given previously (different thresholds may be used for each component). The covariance matrix E of e; contains in this case not only the conditional serial covariance structure for each variable Yth but also the conditional covariance structure between the variables across all time points. An example for such a model is found in Arminger and Ronning (1991). e.
e.
3. ESTIMATION METHOD The estimation of the parameters of the models discussed in section 2 is usually performed with full information or sequential marginal limited-information maximum likelihood (ML) estimation methods. To set up the likelihood equation, an important distinction must be made concerning whether the error terms % are strictly exogenous or not (cf. Keane & Runkle, 1992). q-it is strictly exogenous if Eit 1s assumed to be independent of all present, past, and future values of explanatory variables. Examples of strictly exogenous error terms are included in the following models in which xit is assumed to be scalar and the error terms or/ and % are assumed to be independent of xit and of each other. o
Y i t - [3o + {3jxit + e~i + % Yit = [3o + [3 lxi, + ~lYit- 1 + %.
(10) (11)
The first model is a model without state dependence Yit-1 but with unobserved heterogeneity o~i. The second model includes state dependence but not unobserved heterogeneity. However, if both models are combined, then the error term Eit = Ol.i + q'it is not strictly exogenous" Yit = [3o + [31xit + ~lYit-1 + oLi + %.
(12)
62
GerhardArminger
Here, ~-it is correlated with the regressor Yit-1 through the unobserved heterogeneity term oti. Hence, the likelihood set up in subsection 3.2 must be used for ML estimation in which oti is integrated out. However, if state dependence is replaced by habit persistence, then the error term % is again strictly exogenous. Let the initial value Yio be given by (13)
Yio = No + [3oXio + r
and the following waves by (14)
Yit : ~t + ~lXit -Jr" OLi + Eit,
where ~io, ai, and % are independent of each other and all values Xio, Xil . . . . . XiT. Successive substitution of Yio in the equation for Yil and of Yil in the equation for Yi2, and so forth, yields the result that the error terms r of the reduced form are serially correlated but are independent of all values of Xio, xi~ . . . . . XiT.
3.1. Sequential Marginal ML Estimation Under Strict Exogeneity In this section, strict exogeneity of errors is assumed. For general mean and covariance structures, it is assumed that a P x 1 vector Yi of latent dependent variables follows a multivariate normal distribution with conditional mean and covariance: E(yT[xi) = ~/(O) + 1-I(O)xi,
V(y*lxi) = ~,(O).
(15)
In the analysis of panel data, P equals T if a univariate dependent variable is analyzed or P equals H . T if a multivariate dependent variable is analyzed. ~/(O) Is a P x 1 vector of regression constants and 1-I(O) is a P X R matrix of reduced form regression coefficients, x i Is a R x 1 vector of explanatory variables, 2~(O) is the P x P covariance matrix of the errors of the reduced form, and O is the q x 1 vector of structural parameters to be estimated. The reduced form parameters ~/(O), II(O), and ~ ( 0 ) are continously differentiable functions of a common vector O. Typical examples are simultaneous equation systems Yi = Byi + Fxi + ~i
with
~i "~ N(0, f~),
(16)
with the reduced form parameters II(O) = ( I -
B)-~F
and
~(O) = (I - B)- 111(1 - B)-
I',
(17)
and confirmatory factor analysis Yi = Axli + ~-i
with
rli "" N(0, ~ )
and
ei ~" N(0, |
(18)
and the reduced form parameters I I ( O ) = 0,
E(O) = A ~ A ' + |
(19)
3. Probit Model and Limited Dependent Panel Data
63
In the first example, O consists of the structural parameter matrices B, F, and 11. In the second example, 0 consists of A, ~, | The estimation of the structural parameter vector and the threshold parameters collected in 0 from the observed data vector y; proceeds in three stages. This section is based on Schepers et al. (1991). Computation of the estimates with the MECOSA program is described in Schepers and Arminger (1992). 1. In the first stage, the threshold parameters "r, the reduced form coefficients y and II of the regression equation, and the reduced form error variance 00,2 of the tth equation are estimated using marginal ML. Note that this first stage is the estimation of the mean structure without restrictions as in Equation (15). The parameters to be estimated in the tth equation are the thresholds denoted by the vector % the regression constant denoted by y,, the regression coefficients, 2 The that is, the tth row of H denoted by Hr., and the variance denoted by 00,. marginal estimation is performed using ordinary univariate regression, tobit, and ordinal probit regression. 2. In the second stage, the problem is to estimate the covariances of the error terms in the reduced form equations. Note that in this stage the covariances are estimated without parametric restrictions. Because the errors are assumed to be normally distributed and strongly consistent estimators of the reduced form coefficients have already been obtained in the first stage, the estimation problem reduces to maximizing the loglikelihood function
10(000) : k In P(Yit, Yij ]Xi' "r,, "~t, fit . ,^2~,, 2rj, yj, ^ flj., 6.2, % 1 , ^
(20)
i=1
in which P(Yi,, Yijlxi, "r,, ~/,, I],., or,^2,?j, %, (Ij., 6"2, 000)is the bivariate probability of Yit and Yiy given x i and the reduced form coefficients. A typical example of this bivariate probability is the case when Yt and yj are both ordinal. Then the probability that Yit = k and yij = 1 is given by P(Yit --- k, Yij = l[xi) -- f'~,ll,') s
~ ( Y t ' * Yj* ] ~-Lit' 00t' "2 ~Lij' 002 , 00o)dyj*d y , *,
(21)
^ 2 p~a, ~ 000)is in which ~i, = ~/, + II,.xi, O~ij = ~/j + (-lj.xi and 0(y,,* yj.* I tx,, 00,, the bivariate normal density function. Note that in the ordinal case 6.2 = 6.2 = 1. Hence, 000 is a correlation coefficient called the polychoric correlation coefficient. The loglikelihood function lo(00tj ) has to be modified accordingly if variables with other measurement levels are used. Note that the covariances 000 are the covariances of the error terms in the equations for y,, t = 1. . . . P conditional on x i. The estimated thresholds ?,, the reduced form coefficients ~/, and lit., the vailances 00,, ^ 2 and the covariances 6.o from all equations are then collected in a vec-
64
GerhardArminger
tor ~,, which depends on the sample size n. For the final estimation stage, a strongly consistent estimate of the asymptotic covariance matrix W of ~,, is computed. This estimate is denoted by I,V,,. The asymptotic covariance matrix W is difficult to derive because the estimates of 6-u of the second stage depend on the estimated coefficients 'Tf, ~ f = t, j of the first stage. The various elements of the asymptotic covariance matrix W are given in Ktisters (1987). The estimate ff'n is computed in MECOSA by using analytical first order and numerical second order derivatives of the first and second stage loglikelihood function. 3. In the third stage, the vector of K of thresholds, the reduced form regression coefficients, and the reduced form covariance matrix is written as a function of the structural parameters of interest, collected in the parameter vector O. The parameter vector 0 is then estimated by minimizing the quadratic form
1-~If.,O'~,
Q,,(O) - ( ~ , , -
K(O))tW,-~I(I~n-K(O)),
(22)
which is a minim distance approach based on the asymptotic normality of the estimators of the reduced form coefficients. The vector ~,, is asymptotically normal with expected value K(O) and covariance matrix W. Because W, is a strongly consistent estimate of W, the quadratic form Qn(O) is centrally xZ-distributed with p - q degrees of freedom if the model is specified correctly and the sample size is sufficiently large. The number p indicates the number of elements in ~:,,, and q is the number of elements in O. The computation of Wn is quite cumbersome for models with many parameters. The function Q,,(a3) is minimized using the Davidon-Fletcher-Powell algorithm with numerical first derivatives. The program MECOSAfollows these three estimation stages. In the third stage, the facilities of GAUSS are fully exploited to estimate parameters under arbitrary restrictions. The parameter vector K(O) may be defined using the matrix language and the procedure facility of GAUSS. Consequently, arbitrary restrictions can be placed on -r(O), ~/(~), II(O), and ~(O), including the restrictions placed on V(%* = ~(0) in the analysis of panel data. The MECOSAprogram provides estimates of the reduced form parameters II(O) in the first stage by regressing all dependent variables in Yi = (Y/~. . . . . Y/T)' on all regressors collected in xi without the usual restriction that the effect of xi, o n Yi,t-j, J >- 1 is 0. This restriction has to be put into the program in the third stage. If the model includes lagged dependent variables, it may be necessary to restrict the parameter estimates already in the first stage. In this case, one can trick the program into restricted estimation in the first stage by running the first stage of MECOSA T times separately for each Yit which is then regressed on xit. In the second stage, the standard batch input of MECOSA has to be corrected to deal with the T estimation results from the first stage. The polychoric covariances and/or correlation coefficients of the reduced form are then computed under the restrictions of the first stage. :,r:.
3. Probit Model and Limited Dependent Panel Data
65
3.2. ML Estimation for a Model with State Dependence and Unobserved Heterogeneity A model with state dependence and unobserved heterogeneity % --~ N(0, 0~ for a dichotomous outcome y , ~ {0, 1} will now be considered. Such a model has been found to be useful in labor market research in which Yir denotes employment which may depend on former employment status, on observed exogenous variables, and on unobserved variables collected in % that vary across individuals but not across time. The initial value Yio is assumed to be fixed. The model is written as
Yir
[3 Xir + "lYit-I + % + e-it"
(23)
The % are assumed to be serially independent where % --~ N(0, 002) and are assumed to be independent of or;. If the variance 00,2 is set to 1, then the conditional individual likelihood function for [3 and 3, given ot may be formulated as Li([3, "y I'rl) = T
I-I [di)(~3'xit + "Yyit-i + 00aTI)]''" [1 -- Cb(f3'xit + "YYit-1 + O'e~TI)]i-y;',
(24)
t=l
where ~ ( . ) in the standard normal acumulative distribution function and rl "-" N(0, 1) such that oL = 00,~,rI. Hence, L;(6;, "YIxI) is the product of T conditional probit likelihood functions. To estimate [3, "y, and 00., the likelihood function is set up by integrating over rl and computing the product over all individuals i = 1, 9
.
.
,F/:
(25) i=1
zc
Here, ~ ( x I) is the density function of the standard normal distribution. The integral has to be evaluated numerically using Gauss-Hermite numerical integration. This estimation method may be extended to ordered categorical data as shown in Arminger (1992). Unfortunately, no standard software is available yet to deal with the general model.
3.3. The Problem of Initial States For the estimation of the general Heckman model, it was assumed that the initial states for a model with state dependence or with habit persistence are known and nonstochastic or that information about the initial states could be used in the list of explanatory variables. If this is not the case, one must assume that either a new process begins at the first wave of the panel or that the error terms
66
GerhardArminger
of the process for the initial states are uncorrelated with the error terms of the T panel waves (cf. Heckman, 1981 a, 198 lb). Note that the first assumption holds for the attrition process in a panel where the dependent variable is the participation of a person in panel wave t given that this person has participated in wave 1. At the first wave, all persons participate. Hence, the initial states are known and a new process starts at time 1. Heckman (1981b) considers the following special case of the initial states problem: 6"it-- 6 -+- "YYi t--~ -+- O~i -1- ~-it
Yit =
1 if ' y i . t > 0 0 if Yir <- 0.
with
(26)
The random effects are oLi "~ i~(0, (7a), 2 E(oLi~-it ) = O, ~-it --~ I~(0, o'~) with o'~2 = 1. In this model, the only initial value is Y;o. What happens if Y/o is not known and nonstochastic but is determined by the same process as Yit for t = 1. . . . . T? In this case, Yio is dependent on oti and the sample conditional likelihood function for Yit, t - 1 . . . . . T given Yio is given by integration over the unobserved variable a which may be written as a -- o'~xl with ~1 ~"- I~(0, 1):
L([3o, ~l, cry) =
i= 1
~
zc t =
(Op(~3 q- ~lYi,t-1
+ O'aT]) y`'
(1 -- Oi~(~ + "YYi,t-1 -k- O'oLTI))l-yi' )<
P(Yio [cr~xl)OP(rl)dxl,
(27)
where cI)(.) is the standard normal density. The term P(Yiol oO is the probability that the random variable Yio is either > 0 or-< 0. Possible specifications of P(Yiola) by introducing restrictions on the process generating the exogenous variables xit and application of the estimation method of Kiefer and Wolfowitz (1956) are discussed by Heckman (1981b). However, for all practical purposes this estimation method is too cumbersome. As a first simple solution for this estimation problem, Heckman proposes the estimation of a i as a parameter by ML methods. This method, however, yields only consistent estimates of the structural parameters 13 and "y if T ~ w (cf. Andersen, 1973; Neyman & Scott, 1948), which cannot be assumed for panel data. The limited Monte Carlo evidence given by Heckman (1981b) for T = 8 panel waves in the model of Equation (27) is certainly not sufficient to recommend the ML estimation of cr i in conjunction with 13 and "y as a general solution. The second simple solution proposed by Heckman (1981b) is the substitution of P(Yiola) by F(xio~) ), where Xio is a vector of explanatory variables for Yio and F(.) is the distribution function of the random variable Yio = Xio ~ -at- ~-iO"
(28)
3. Probit Model and Limited Dependent Panel Data
67
This ad hoc solution is simply the attempt to replace the unobserved heterogeneity in Y;o by observed heterogeneity in X~o. The error term is allowed to be correlated with OLi and % for t = 1. . . . . T without restrictions to capture the serial correlation between eio and % = o~; + Eit induced by o~;. In practice, this procedure amounts to the inclusion of Yio in the vector of dependent variables and additional parameters in the augmented covariance matrix for E i = ( E i 0 , 9 If %9 = oLi + % with V(c~i) = O"2 and V(ei, ) = O"G, 2 then the augell9. . . . . eiT)'. mented covariance matrix is given by 2 O'G0
2 0"2 + O'e
t3"10
v(~2) =
~2o
~
9 0"70
"2 O-oL
~
2+
"2 O-oL
~
2
.
9 9 9
(29)
2 q_ 2 O-c~ O"e
The limited Monte Carlo simulation of Heckman (1981b) shows that this approach works better than the simultaneous ML method to yield consistent estimates of 13 and y. This solution for the initial states problem depends crucially on the specification of a model for the initial states. The key idea is to specify a model for the initial states that gives as good predictions as possible. The model for the initial states need not be the same model as for Yit, t - 1 . . . . . T. The errors eio, ei_ j, ei_ K are allowed to correlate freely with %. If the model for the initial states is misspecified, the simple solution will work poorly 9 A problem that is easier to handle occurs if habit persistence instead of state dependence is taken into account. Assume that the following model is formu2 g ( E i t ) _. o e2, and E(Eitq.is ) "-- 0 for t 4: s: l a t e d w i t h Eit* __ ot i -k- q-it, W ( ~ = O-c~, Yit = xit[3 + ~/Yi,t-I + eit,
t = 1. . . .
T.
(30)
Furthermore, we assume like Bhargava and Sargan (1983) for the metric case, that the start value Yio has been generated by the same process a s Yi,t, that is, Yio = Xio~ -k- ~/Yi,- 1 + Eio.
(31)
This implies that y*go is determined by :xz
Yio = ~
~/k(Xi,-k[3 + e/,-k) = Yio q- ~i,-k,
(32)
k=O .._~
k *
where ~i.-k k=O ~/ e/,-k" Assume that the optimal predictor of Yio given all observed values x,,, t - 1. . . . . T of the exogenous variables is given by Yio -- X i ~ '
(33)
68
GerhardArminger
where 8 is the vector of regression coefficients for the vector Then Y;o may be written as
x i -- (Xio,
Xi
I,
9
9
9
,
x;r).
Yio =
(34)
Yio -nt- RiO'
where Uio ~ I~ o"2) and Uio is independent of oti and %. The model for the augmented vector Yi - (Y;o, Y T ' ) ' can then be written in the form of a simultaneous equation system: Yi = B y i
+ Fx;
(35)
+ e. i ,
with the (T + 1) X (T + 1) parameter matrix B and the (T + l) X q parameter matrix F" 0 ~ B=
0 o
. . . ...
0 o
0 o
y
...
0
0
,
F=
8 0
0 13
0 0
0
0
13 . . .
9
,
o o . . . y o
oo
. . . ...
0 0 0
(36)
.
o...
j3
The vector of explanatory variables )~i is "~i = (X/o, Xil ..... X i T ) ' . The covariance matrix V({~) of the augmented vector e i = (eio, e i) is given by the elements (cf. Bhargava & Sargan, 1983): 2 O'er -- ~ 2
v~176 =1 VOt - -
2 F O"u
2 O'cx
1-y 2+
V tt = O'oL V,s -
t
2 O'e 1 - - ,,~2
2
er a ,
2
O'e ,
t =
1, . . . , T
t 4= s 9
The reduced form parameters estimated in the first two stages of MECOSA are given by H=(I-B)-~F
and
~=(I-B)-~V(~)(I-B)
-~'.
(37)
4. ANALYSIS OF PRODUCTION OUTPUT FROM GERMAN BUSINESS TEST DATA To illustrate the model in section 2 and the estimation methods of section 3.1, I have analyzed business test data from the IFO Institute 9Four waves obtained in August and November 1987 and February and May 1988 from 656 German firms are presented. The variables and codes used in the analysis are given in Table 1.
3. Probit Model and Limited Dependent Panel Data
69
TABLE 1 Questions and Variables from the IFO Business Test
Question Our domestic production activity concerning XY has been with regard to the month before brisker (3), unchanged (2), weaker
Measurement level
Variable in model
Variable name
Ordinal
Ay, = y, - y,_ l
Output O
(1). At the moment, our stock of unsold finished products of XY corresponds to: no stock, less than 0.5/4, 1/4 . . . . . 6/4, more than 6/4, o r . . . months of production.
Metric
At the moment, our stock of raw materials for the production of XY corresponds to: no stock, less than 0.5/4, 1/4 . . . . . 6/4, more than 6/4, o r . . . months of production.
Metric
At the moment, we feel that our stock of orders of XY is relatively big (e.g., extended delivery period) (3), stayed the same (2), be rather unfavorable (I).
Ordinal
a,-
Under elimination of mere seasonal fluctuations, our business situation for XY during the coming 6 months will be rather favorable (3), stay more or less the same (2), be rather unfavorable (1).
Ordinal
d,,,+, - d,
Business expectation GL
With regard to the month before, for us the demand situation (at home and abroad) has improved (3), stayed the same (2), decreased ( 1).
Ordinal
Ad,=d,-d,_l
Demand a t t D
Number of employes in the business (grouped).
Metric
In k
LNE
Production activity at time t - 1 with 1 if Ay,_ ~ = 1 and 0 otherwise.
Dummy
u,
Output OL1
Production activity at time t - 1 with 1 if Ay,_ ~ = 3 and 0 otherwise.
Dummy
v,
Output OL2
ft-i
Stock of finished products LFP
rt rt-l
Stock of raw material LRP
f,
In
a~'
Order stock AB
70
GerhardArminger
The variables are indexed by the number of the panel wave; hence O o, LFP o, LRP o. . . . refer to the variables at time 0, and O~, LFP~, LRP~ . . . . refer to the variables at time 1. The variables O (output), AB (order stack), GL (business expectation), and D (demand) are only measured as ordered categorical variables, the variables LFP, LRP, and LNE are metric, and OL 1 and OL2 are d u m m y variables. The model that is specified now is set out in Arminger and Ronning (1991). The process described by the model is conditional on wave 0, hence the process is not modeled for this wave. AYt = P~t + ~3tl(at-1 -- a t - l )
~rt-1
+ [3,2(dt,t+ 1 - dr) + ~,3st + ~/tl( In K)
(
+~/,3 In ~
+
'~t4blt
+
"~t5Vt
+%t=
1, 2, 3,
(38)
Where e, --- In(0, cr2) is assumed to be serially uncorrelated. This model describes the change in output at time t as a function of the stock of orders at time t - 1, business expectation at time t, the number of employees, the relative change of stock in raw material from t - 1 to t, the relative change in the stock of finished products from t - 1 to t, and the states of the output variable at time t - 1. The variable In K, that is, the number of employees, does not change over time. Unobserved heterogeneity is not incorporated. The variable s t is defined as s t = [(dt - dr_ ~) - (d t_ ~,t - d , _ ~)],
(39)
which may be interpreted as a shock or surprise effect. If (dt - d t _ ~ ) > ( d t - ~ . , - d,_,), then the shock is positive, that is, the demand is higher than the past expectation; otherwise, the shock is 0 or negative. The effect of s t may be estimated by setting the parameter of the demand (dt - dt-~) to [3,3 and the parameter of the past business expectation (dr.,-~ - d r - ~ ) to -[3t3. Note that the variables Ay,, (a,_ 1 - a,_ 1), (d,.,_ l - dr), and ( d , - dr_ 1) are only observed at an ordinal scale where the following observation rule is supposed to hold:
Ot =
i
if if if
Ayt_<.r] 1)
(40)
"r I < A y t <-- "r~2l) A y t > "r~21).
The observation rules for AB t, GL t, and D t are analogous. For identification, the first threshold is set to 0. Because AB t, GL t, and Dt are only observed at an ordinal scale, it is assumed that ( a t - l - at*-l), (dt,t-~ - dr), and ( d r - d r - ~ ) are endogenous variables with means conditioned by the exogenous variables In K,
\
In ~
rt-I/
,
In
:,-I
, u t,
71
3. Probit Model and Limited Dependent Panel Data
and v, and a multivariate normal error vector that is uncorrelated with % The model for all endogenous variables in the first wave is therefore given by a0 - a0 d21 - dl d I - do
IX1 [&2 -
d l , o - do Yl - Yo
q-
~4
~5 Yll
-q-
~1,3
Y12
0 0
0 0
0 0
0 0
0
0
0
0
0 [~51
0 ~52
0 [~53 -InK
Y14
Y,5~
/ln
"~31
"~32
'~34
'~/42
~44
Y35 ] Y45 ]
~
'~/41 Y51
Y52
Y54
Y55 /
Ik
\
0 0
a0 - a0 d2, l - d l
0
d I - do
0 0
dl, o - d o Yl - Yo
ra
In
f~
+
.
(41)
u1 P1
The models for wave 2 and wave 3 are constructed in similar ways. Note that the variable GL occurs two times in the first wave as GL o and GL~. In the second wave, GL~ must be taken from the first wave. Hence the whole model consists of 13 equations. The focus here, however, is only on Equations 5, 9, and 13, that is, y~ - Y o , Y 2 - Y ~ , and Y3 -- Y2. The first model does not take into account restrictions by proportionality of coefficients for each wave and unobserved heterogeneity. The parameter estimates obtained from MECOSA are shown in Table 2. The pseudo R2s of McKelvey and Zavoina (1975) show that only a small portion of the variance of the output is explained. The output increases in the second wave in comparison to the first and third waves. Judged by the z-values, the variables stock order (AB) and surprise effect are more important than business expectation (GL). The firms react primarily to shocks of the recent past. If the shock has been positive, more output is produced. The variables stock of raw materials and stock of unsold finished products are of lesser importance than the dependence on the state of the period before. Here, however, only the decrease of output in the past period matters. The covariances between the errors are rather small, indicating that the assumption of uncorrelated errors over time given the former states is correct. In Table 3, the results of the restricted parameter estimation under the hypothesis of proportionality of the regression coefficients except for the constants and the effect of the number of employees are shown. The hypothesis of proportionality is not rejected at the 0.05 test level. The error variance of the second wave is greater in absolute terms than the error variances of the first and second waves as judged from the inverse of the proportionality coefficient ~.
72
GerhardArminger
Unrestricted Parameter Estimates for IFO Output Model
TABLE 2
Explanatory variables
"r~ a"2 IX (a,_~ - a~;~) d,+ ~., - d, s, In K In
Wave 1
2.175 0.604 0.439 0.053 0.413 --0.049
r,
0 (28.989) (1.613) (5.685) (1.046) (10.993) (-- 1.433)
Wave 2
2.175 1.147 0.305 0.089 0.312 --0.023
0 (28.989) (3.281) (4.161) (2.319) (9.844) (--0.707)
0.097 (0.679)
--0.077 (0.481)
Wave 3
2.175 0.599 0.293 0.125 0.385 0.015
0 (28.989) (2.793) (6.236) (2.822) (13.532) (0.468)
0.086(0.632)
Ft-- 1
f, ft-~ u, v,
0.238 (2.973)
-0.031 (-0.369)
-0.058 (-0.796)
-0.543 (-3.040) 0.129 (1.039)
-0.834 (-5.916) -0.204 (-1.697)
-0.227 (-2.386) 0.429 (3.217)
R 2MZ
0.089
0.184
0.116
0.514 O. 127 0.045
0.715 -0.010
0.616
In
Covariances Wave l Wave 2 Wave 3 Note: z-values are in parentheses.
5. CONCLUSION The models discussed in this chapter serve as an introduction to the wide range of models that have been and may be formulated for the analysis of nonmetric longitudinal data with few time points observed in many independent sample elements. Other models include extensions of legit models and of loglinear models for count data (cf. Hamerle & Ronning, 1995). The estimation methods given here have been proven to be useful. However, new approaches for estimation are emerging. Beck and Gibbons (1994) exploit new techniques for highdimensional numerical integration. Muth6n and Arminger (1994) report first results for using the Gibbs sampler to describe the a posteriori distribution of parameters in MECOSA-type models.
ACKNOWLEDGMENTS I am grateful to the IFO Institute Munich for providing the business test data and to Professor G. Ronning of the University of Ttibingen and to R. Jung of the University of Konstanz for preparing
3. Probit Model and Limited Dependent Panel Data
TABLE3
Restricted Parameter Estimates for IFO Output Model
Explanatory variables
-rI "r2 tx (a,_~ - a~"__~) d,+~.,- d, s, In K In
73
Wave 1
2.133 0.850 0.393 0.097 0.413 --0.073
0 (30.454) (2.514) (9.130) (3.3169) (13.255) (--2.292)
Wave 2
2.133 1.212 0.393 0.097 0.413 --0.036
0 (30.454) (3.051) (9.130) (3.3169) (13.255) (--0.881)
Wave 3
2.133 0.731 0.393 0.097 0.413 0.006
0 (30.454) (3.072) (9.130) (3.3169) (13.255) (0.172)
r, rt-l
0.125 (1.505)
0.125 (1.505)
0.125 (1.505)
In f' fr-J u, v,
0.036 (0.774)
0.036 (0.774)
0.036 (0.774)
--0.501 (--7.466) 0.014 (0.224)
--0.501 (--7.466) 0.014 (0.224)
--0.501 (--7.466) 0.014 (0.224)
0.753
0.908
df
15
X2 Statistic
5.781
Note. z-values are in parentheses.
the data. Helpful comments on an earlier version of the chapter have been provided by an unkown reviewer. Comments should be sent to Gerhard Arminger, Bergische Universit~it Wuppertal, Department of Economics (FB 6), D-42097 Wuppertal, Germany.
REFERENCES Andersen, E. B. (1973). Conditional inference and models for measuring. Copenhagen: Mentalhygiejnisk Forsknings Institut. Arminger, G. (1992). Analyzing panel dam with non-metric dependent variables: Probit models, generalized estimating equations, missing data and absorbing states, (Discuss. Pap. No. 59). Berlin: Deutsches Institut for Wirtschaftsforschung. Arminger, G., & Ronning, G. (1991). Ein Strukturmodell ftir Preis-, Produktions- und Lagerhaltungsentscheidungen von Firmen, IFO-STUDIEN. Zeischrift fiir Empirische Wirtschaftsforschung, 37, 229-254. Bhargava, A. and Sargan, G. D. (1983). Estimating Dynamic Random Effect Models from Panel Data Covering Short Time Periods, Econometrica, 51, 1635-1359. Bock, R. D., & Gibbons, R. D. (1994). High-dimensional multivariate probit analysis. Unpublished manuscript. University of Chicago, Department of Psychology. Flaig, G., Licht, G., & Steiner, V. (1993). Testing for state dependence effects in a dynamic model of male unemployment behavior (Discuss. Pap. No. 93-07). Mannheim: Zentrum ftir Europ~iische Wirtschaftsforschung Gmbh. Hamerle, A., & Ronning, G. (1995). Analysis of discrete panel data. In G. Arminger, C. C. Clogg,
74
GerhardArminger
& M. E. Sobel (Eds.), Handbook of statistical modeling for the behavioral sciences. (pp. 401-451). New York: Plenum. Heckman, J. J. (1981a). Statistical models for discrete panel data. In C. F. Manski & D. McFadden (Eds.), Structural analysis of discrete data with econometric applications (pp. 114-178). Heckman, J. J. (198 l b). The incidental parameters problem and the problem of initial conditions in estimating a discrete time-discrete stochastic process. In C. E Manski & D. McFadden (Eds.), Structural analysis of discrete data with econometric applications (pp. 179-195). Hsiao, C. (1986). Analysis of panel data. Cambridge, MA: Cambridge University Press. Keane, M. E, & Runkle, D. E. (1992). On the estimation of panel-data models with serial correlation when instruments are not strictly exogenous. Journal of Business & Economic Statistics, 10(1), 1-29. Kiefer, J., & Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Mathematical Statistics, 27, 887-906. KiJsters, U. (1987). Hierarchische Mittelwert- und Kovarianzstrukturmodelle mit nichtmetrischen endogenen Variablen. Heidelberg: Physica Verlag. McKelvey, R. D., & Zavoina, W. (1975). A statistical model for the analysis of ordinal level dependent variables. Journal of Mathematical Sociology, 4, 103-120. Muth6n, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, 115-132. Muth6n, B. O., & Arminger, G. (1994). Bayesian latent variable regression for binary and continuous response variables using the Gibbs sampler. Unpublished manuscript, UCLA, Graduate School of Education. Neyman, J., & Scott, E. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16, 1-32. Rosett, R. N., & Nelson, F. D. (1975). Estimation of the two-limit probit regression model. Econometrica, 43, 141-146. Schepers, A., & Arminger, G. (1992). MECOSA: A program for the analysis of general mean- and covariance structures with non-metric variables, user guide. Fraunenfeld, Switzerland: SLI-AG. Schepers, A., Arminger, G., & Kiisters, U. (1991). The analysis of non-metric endogenous variables in latent variable models: The MECOSA approach. In E Gruber (Ed.), Econometric decision models: New methods of modeling and applications (pp. 459-472). Heidelberg: Springer-Verlag. Sobel, M., & Arminger, G. (1992). Modeling household fertility decisions: A nonlinear simultaneous probit model. Journal of the American Statistical Association, 87, 38-47. Stewart, M. B. (1983). On least squares estimation when the dependent variable is grouped. Review of Economic Studies, 50, 737-753. Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica, 26, 24-36. Wagner, G., Schupp, J., & Rendtel, U. (1991). The Socio-Economic Panel (SOEP)for GermanyMethods of production and management of longitudinal data (Discuss. Pap. No. 3 l a). Berlin: Deutsches Institut ftir Wirtschaftsforschung.
..P.A.R.T~ Catastrophe Theory
This Page Intentionally Left Blank
Catastrophe Analysis of Discontinuous Development 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
Hart L. J. van der Maas and Peter C. M. Mo/enaar University of Amsterdam The Netherlands
1. INTRODUCTION The theme of this book is categorical data analysis. Advanced statistical tools are being presented for the study of discrete things behaving discretely. Many relevant social science questions can be captured by these tools, but there is also tension between categorical and noncategorical data analysis which centers on ordinal, interval, and ratio scaled data. Zeeman (1993) presents an original view on this conflict (see Table 1). He distinguishes four types of applied mathematics according to whether things are discrete or continuous, and whether their behavior is discrete or continuous. Of special interest is Pandora's box. In contrast to the other three types of applied mathematics, discrete behavior of continuous things often gives rise to controversies. The mathematical modeling of music, the harmonics of vibrating strings, led to a major debate in the eighteenth century. The foundations of quantum theory are still controversial. A final example is catastrophe theory, which is concerned with the modeling of discontinuities. Researchers in developmental psychology are well aware of the controversial nature of discontinuities in, for example, cognitive development. Categorical Variables in Developmental Research." Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
77
78
H. L. J. van der Maas and P. C. M. Molenaar
TABLE 1 Four Types of Applied Mathematics Things Behavior
Discrete
Continuous
Discrete
Dice Symmetry DISCRETE BOX Finite probability Finite groups
Continuous
Planets Populations TIME BOX Ordinary differential equations
Music, harmony Light Discontinuities PANDORA'S BOX Fourier series Quantum theory Catastrophe theory Waves Elasticity CONTINUOUS BOX Partial differential equations
Note.
From Zeeman (1993).
Note that categorical behavior is not explicitly included in Zeeman's boxes. Categorical data, nominal measurements, should be labeled as discrete, whereas measurements on an ordinal, interval, or ratio level are covered by continuous behavior. Probably, the critical distinction here is in quantitative and qualitative differences in responding. If this distinction is crucial, then the definition of discreteness in catastrophe theory includes categorical data. We will not discuss further the classification of Zeeman. We hope that it illustrates the relationship between categorical data analysis and this chapter, the goal of which is to present our results, obtained by the application of catastrophe theory, to the study of conservation acquisition. Conservation acquisition is a kind of benchmark problem in the study of discontinuous cognitive developmerit. Discontinuous development is strongly associated with the stage theory of Piaget (1960, 1971; Piaget & Inhelder, 1969). The complexity of debate on this theory is enormous (the Pandora box), and in our presentation we will necessarily neglect a large body of literature on alternative models, other relevant data, and severe criticism on Piaget's work. It is common knowledge that Piaget's theory, especially concerning stages and discontinuities, has lost its dominant position in developmental psychology. Yet, we will defend that our results generally confirm Piaget's ideas on the transition of nonconservation to conservation. We start with a short introduction to catastrophe theory, issues in conservation research, and alternative models of conservation acquisition. We proceed by explaining our so-called "cusp" model of conservation acquisition and the possibilities to test this model. The last part of this chapter discusses the experi-
4. Catastrophe Theory
7'0
ments that we conducted to collect evidence for the model. We first explain our test of conservation, a computer variant of the traditional test. Finally, data from a cross-sectional and from a longitudinal experiment are discussed in relation to three phenomena of catastrophe models: sudden jumps, anomalous variance, and hysteresis.
2. CATASTROPHETHEORY There are several good introductions to catastrophe theory. In order of increasing difficulty, we can recommend Zeeman (1976), Saunders (1980), Poston and Stewart (1978), Gilmore (1981), and Castrigiano and Hayes (1993). Catastrophe theory is concerned with the classification of equilibrium behavior of systems in the neighborhood of singularities of different degrees. Singularities are points where, besides the first derivative, higher order derivatives of the potential function are zero. The mathematical basis of catastrophe theory consists of a proof that the dynamics of systems in such singular points can be locally modeled by seven elementary catastrophes (for systems with up to four independent variables). The elementary behavior of systems in singular points depends only on the number of independent variables. In the case of two independent variables, systems characterized by singular behavior can be transformed (by a well-defined set of transformations) to the so-called cusp form. Our preliminary model of conservation acquisition is formulated as a cusp model. In contrast to the well-known quadratic minima, the equilibrium behavior of singular systems is characterized by strange behavior such as sudden jumps and splitting of equilibria. The quadratic minimum is assumed in the application of linear and nonlinear regression models. In each point of the regression line a normal distribution of observed scores is expected. In contrast, singular minima lead to bimodal or multimodal distributions. If the system moves between modes, this change is called a qualitative change (of a quantitative variable). The comparison with regression analysis is helpful. As in regression models, catastrophe theory distinguishes between dependent (behavioral) and independent variables (control variables). These variables are related by a deterministic formula which can be adjusted for statistical analysis. The cusp catastrophe is denoted by
V(X; a, b) = 1/4 X 4 - 1/2 a X
2 -
b X,
(1)
which has as equilibria (first derivative to zero): X 3 -
a X-
b = O.
(2)
80
H. L. J. van der Maas and P. C. M. Molenaar
If we compare the latter function to regression models in the general form of
X=f(a,b),
(3)
we can see an important difference. The cusp function, Equation (2), is an implicit function, a cubic that cannot be formulated in the general regression form. This difference has far-reaching consequences. In the cusp, sudden jumps occur under small continuous variations of the independent variables a and b. In contrast, in the regression models, either linear or nonlinear small continuous variation of independent variables may lead to an acceleration in X but not to genuine discontinuities. Of course f, in the regression model, can be an explicit discontinuous threshold function, but then a purely descriptive position is taken. In catastrophe functions, the equilibrium surfaces are continuous (smooth), hence discontinuities are not built in. In catastrophe theory, qualitative changes in quantitative behavior are modeled in a way that is clearly distinguished from not only (nonlinear) regression models but also from Markov chain models (Brainerd, 1979) and special formulations of Rasch models (Wilson, 1989).
3. ISSUES IN CONSERVATION Conservation is the invariance of certain properties in spite of transformation of form. An example of a conservation of liquid task is shown in Figure 1. The liquid of one of two identical glasses (B) is poured into a third, nonidentical class (C). The child's task is to judge whether the quantity in C is now more than, equal to, or less than in A. A conserver understands that the quantities remain equal, a nonconserver systematically chooses the glass with the highest level as having more. A typical property of conservation tests is that the lowest test score is systematically below chance score.
FIGURE 1. A standard equality item of conservation of liquid. In the initial situation A and B have to be compared; in the final situation, after the transformation, A and C are to be compared (as indicated by the arrows below the glasses).
4. Catastrophe Theory
81
Piaget and co-workers used conservation repeatedly as an example to explain the general equilibration theory. Piaget's idea of epigenetic development implies that children actively construct new knowledge. In conservation development, a number of events can be differentiated: (a) (b) (c) (d)
focusing on the height of the liquid columns only (nonconservation) focusing on the width of the liquid columns only the conflict between the height and the width cue the conflict between the height cue and the conserving transformation (e) constructing new operations, f and g (f) focusing on the conserving transformation (g) understanding of compensation (multiplication of height and width) In Piaget's model, the sequence of events is probably a, b, c, e, f, and g. In the Piagetian view, f and g are connected. Whether d should be included, or that d is connected to c, is unclear. Many authors give alternative sequences and deny the active construction of the new operations. Bruner (1967), for instance, argues that f and g already exist in the nonconservation period but are hampered by the perceptual misleading factor. Peill (1975) gives an excellent overview of the various proposals about the sequence of events a to g. Thousands of studies have been conducted to uncover the secrets of conservation development. Many models have been introduced which have been tested in training settings and in sequence and concurrence designs by test criteria varying from generalization to habituation. Mostly, consensus has not been reached. In our opinion, conservation (or nonconservation) cannot be reduced to a perceptual, linguistic, or social artifact of the conservation test procedure, although perceptual, linguistic, and social factors do play a role, especially in the transitional phase. In a later section, we will explain how we developed a computer test of conservation to overcome the major limitations of clinical conservation testing. An important approach to the study of conservation focuses on conservation strategies (also rules and operations). At present, the best methodology of assessing conservation and several related abilities has been introduced by Siegler (1981, 1986). He uses several item types to distinguish between four rules. These rules roughly coincide with Piaget's phases in conservation development. Rule 1: Users focus on the dominant dimension (height). Rule 2: Users follow rule 1 except when values of the dominant dimensions are equal, then they base their judgment on the subordinate dimension (width). Rule 3: Users do as rule 2 users except when both the values of the dominant and the subordinate dimensions differ, then they guess. Rule 4: Users multiply the dimensional values (in the case of the balance beam) and/or understand that the transformation does not affect the quantity of liquid. Multiplying dimensions (or compensation) is
82
H. L. J. van der Maas and P. C. M. Molenaar
correct on conservation tasks too, but more difficult and rarely used (Siegler, 1981). Siegler's methodology has been criticized mainly for rule 3. Guessing (muddling through) is a rather general strategy of another order. Undoubtedly, it can occur, and should be controlled for. Several alternative strategies have been proposed, for instance, addition of dimensions (Anderson & Cuneo, 1978), maximization (Kerkman & Wright, 1988), and the buggy rule (van Maanen, Been, & Sijtsma, 1989). The validation of these alternatives will take a lot of test items but is not impossible. Another important concept in the study of conservation is that of the cognitive conflict. Piaget himself associated the concept of cognitive conflict to conservation. Most authors agree that conservation performance on Piagetian conservation tasks should be explained as (or can be described as) a conflict between perceptual and cognitive factors. Concepts such as salience, field dependency, and perceptual misleadingness have been used to define the perceptual factor in conservation (Bruner, 1967; Odom, 1972; Pascual-Leone, 1989). Siegler applies it in the dominant/subordinate distinction. We will use this factor as an independent variable in our model. The cognitive factor is less clearly understood. It can be defined as cognitive capacity, short-term memory, or as a more specific cognitive process such as learning about quantities. Also, the very general concept of maturation can be used. In cusp models, one has to choose two independent variables. In the case of conservation, this very important choice is difficult, which is one of the reasons to communicate our cusp model as preliminary. Flavell and Wohlwill (1969) took Piaget's view on the acquisition of conservation as a starting point. Piaget distinguished between four phases: nonconservation, pretransitional, transitional, and conservation. Flavell and Wohlwill (1969) formulated these Piagetian phases in the form of a simple equation: P( + ) = Pa * Pb l-k,
(4)
where P ( + ) is the probability that a child given person parameters Pa and k and item parameter Pb succeeds on a particular task. Pa is the probability that the relevant operation is functional in a given child. Pb is a difficulty coefficient between 0 and 1. The parameter k is the weight to be attached to the Pb factor in a given child. In the first phase, Pa and k are supposed to be 0, and consequently P ( + ) is 0. In the second phase, Pa changes from 0 to 1, whereas k remains 0. For a task of intermediate difficulty, (Pb = .5) P ( + ) = .25. According to Piaget, and Flavell and Wohlwill add this to their model, the child should manifest oscillations and intermediary forms of reasoning. Notice that this does not follow from the equation. The third phase is a period of stabilization and consolidation, Pa is 1 and k increases. In the fourth phase, Pa and k are 1 and full conservation has been reached.
4. Catastrophe Theory
83
Of course, this model also has its disadvantages. It predicts a kind of growth curve that is typical for many models and it predicts little besides what it assumes. In addition, the empirical verification suffers from the lack of operationalization of the parameters. As Flavell and Wohlwill admit, task analysis will not easily lead to an estimation of Pb, Pa is a nonmeasurable competence factor, and k does not have a clear psychological meaning and therefore also cannot be assessed. Yet, the reader will notice that this 25-year-old model has a close relationship to modern latent trait models. The simplest one, the Rasch model, can be denoted as P ( + ) = 1/(1 +
e-(Pa-Pb)).
(5)
Pa is the latent trait and Pb is the item difficulty. Several modifications are known. Parameters for discrimination, guessing, and so on, can be added. Pa and Pb have in both models a similar meaning. In latent trait models, one cannot find a direct equivalent for k, although some latent trait models contain person-specific weights for item parameters. If we compare Equations (4) and (5), we can conclude that Equation (4) is a kind of early latent trait model. The theory of latent trait models has developed to a dominant approach in current psychometrics. Much is known of the statistical properties of these models. The most appealing advantage is that item parameters can be estimated independent of the sample of subjects and person parameters independent of the sample of items. Besides assumptions as local independence, some constraints on the item-specific curves are required. For instance, in the Rasch model, item curves have equal discriminatory power. More complex latent trait models do not demand this, but other constraints remain. The important constraint for the discussion here is that in all latent trait models, for fixed values of Pa and Pb, one and only one value of P ( + ) is predicted. The item characteristic curve suggested by catastrophe theory does not accommodate this last constraint. Hence, we can conclude that the behavior in the Flavell and Wohlwill model, similar to more advanced latent trait models (e.g., Saltus, proposed by Wilson, 1989), does not vary discontinuously. The behavior may show a continuous acceleration in the increase of P(+), but is not saltatory in a strict sense. A second important conclusion is that the model of Flavell and Wohlwill, contrary to their suggestion, does not predict the oscillations and intermediary forms of reasoning in the transitional phases. We will see that the cusp model explicitly predicts these phenomena. This latter point is also of importance with regard to Markov models of Brainerd (1979) and the transition model of Pascual-Leone (1970). In Molenaar (1986) it is explained why such models do not test discontinuities. Of course, many more models on conservation are proposed, often in a nonmathematical fashion; but as far as they concern discontinuity explicitly, it seems that they will be in
84
H. L. J. van der Maas and P. C. M. Molenaar
a c c o r d a n c e with the c a t a s t r o p h e interpretation or they will not, that is, b e l o n g to the class o f m o d e l s d i s c u s s e d in this chapter.
4. THE CUSP MODEL In this section, w e explain the cusp m o d e l in general and its specification as a m o d e l o f c o n s e r v a t i o n . T h e type o f cusp m o d e l w e apply, typical by a rotation o f axes, is also f o u n d in Z e e m a n (1976). It is especially useful if the d y n a m i c s are u n d e r s t o o d in terms o f a conflict. In our application o f the cusp m o d e l , w e use p e r c e p t u a l m i s l e a d i n g n e s s and cognitive capacity as i n d e p e n dent variables and c o n s e r v a t i o n as a d e p e n d e n t variable. T h e m o d e l is r e p r e s e n t e d in F i g u r e 2.
FIGURE 2. The cusp model of conservation acquisition holds two independent and one behavioral variable: perceptual misleadingness, cognitive capacity, and conservation, respectively. The cusp surface is shown above the plane defined by the independent variables. The cusp surface has a folded form at the front that vanishes at the back of the surface. The folding implies that for certain values of the independent variables, three values of the behavioral variable, that is, three modes, are predicted. The mode in the middle is unstable and compelling and therefore called inaccessible. The remaining two modes lead to bimodality. The area in the control plane, defined by the independent variables, for which more behavioral modes exist is the bifurcation set. If the system is in its upper mode and moves from right to the left, a sudden jump will take place to the lower mode at the moment that the upper mode disappears. If the system then moves to the right, another jump to the upper mode will occur when leaving the bifurcation set at the right. This change in position of jumps is hysteresis. The placement of groups R, NC, TR, and C is explained in the text. Van der Maas and Molenaar (1992). Copyright 1992 by the American Psychological Association. Adapted by the permission of the publisher.
4. Catastrophe Theory
85
A three-dimensional surface consisting of the solutions of Equation (2) represents the behavior for different values of the independent variables. The independent variables define the control plane beneath the surface. The bifurcation set consists of independent values for which three models of behavior exists. Two of the three models are stable, the one in the middle is unstable. Outside the bifurcation set, only one mode exists. Changes in independent variables lead to continuous changes in the behavior variable, except for places on the edges of the bifurcation set; there, where the upper or lower mode of behavior disappears, a sudden jump takes place to the other stable mode. In terms of conservation, we interpret this to mean that an increase in cognitive capacity, or a decrease in perceptual misleadingness, leads to an increase of conservation, in most cases continuously and sometimes saltatory. The reverse process is also possible, but is usually not the dominant direction in development. However, artificial manipulation of cognitive capacity (by shared attention tasks) or misleadingness (by stimulus manipulation) should lead to saltatory regressions. This cusp model suggests the discrimination of four groups of children. At the neutral point, the back of the surface, the probability of a correct response equals chance level. Both independent values have low values. At the left front, the scores are below chance level. Perceptual misleadingness is high, capacity is low. In the middle, both independent values are high. Two behavior modes are possible, above and below chance level. Finally, at the right front, the high scores occur. Here, perceptual misleadingness is low and capacity is high. Associated with these states are four groups which we call residual group (R), nonconservers (NC), transitional group (TR), and conservers (C), respectively. The residual group consists of children who guess or follow irrelevant strategies because of a lack of understanding of test instructions or a lack of interest. This group is normally not incorporated into models of conservation development. In this model, it is expected that this group is characterized by low values for both independent variables. The nonconserver and conserver children are included in each model of conservation development. Most models also include a transitional group, defined, however, according to all kinds of criteria. In the cusp model, the transition period is defined as the period when the subjects stay in the bifurcation set. In this period, the possibility of the sudden jump is present. The sudden jump itself is part of this transition phase. The transition includes more than just the sudden jump. There are at least seven other behavioral phenomena typical for the transition period (see section 4.2.).
4.1. Routes Through Control Plane The classification of four groups suggests an expected sequence, but this does not directly follow from the cusp model. We need an additional assumption on
86
H. L. J. van der Maas and P. C. M. Molenaar
the direction of change of independent variables as a function of age (or as a function of experimental manipulation). To get at a sequence of residual, nonconservation, transitional, and conservation, it should be assumed that at first, both factors are low, then, perceptual misleadingness increases, third, cognitive capacity increases too, and fourth, perceptual misleadingness decreases back to a low level. We discuss this assumption because from the literature we know that very young children indeed score at chance level, 4- to 6-year-olds score below chance level, and later on, they score above chance level (McShane & Morrison, 1983). The phases of the model of Flavell and Wohlwill (1969) coincide with the nonconservation, transitional, and conservation sequence. The phase in which very young children score at chance level is not included in their model nor in the model of Piaget. This is partly because of issues of measurement. Guessing is not helpful on the classical Piagetian test of conservation in which valid verbal arguments are required. Other test criteria (judgment-only criterion or measures of looking time and surprise) allow for false positive classifications, whereas Piaget's criterion suffers from false negatives. In the cusp model, we assume the use of the judgmentonly criterion of conservation. What should be clear now is that some assumption on the change of the independent variables is necessary to specify a sequence of behavior. Of course, this is also required in the model of Flavell and Wohlwill (1969).
4.2. Test of the Cusp Model: The Catastrophe Flags The transitional period, associated with the bifurcation set, is characterized by eight catastrophe flags derived from catastrophe theory by Gilmore (1981). Three of these flags can also occur outside the bifurcation set and may predict the transition. Together, they can be used to test the model. These catastrophe flags are sudden jump, bimodality, inaccessibility, hysteresis, divergence, anomalous variance, divergence of linear response, and critical slowing down. Some are well known in developmental research, others only intuitively or not at all. The sudden jump is the most obvious criterion. However, in the case of developmental research, quite problematic. Empirical verification of this criterion requires a dense time-series design. In spite of the many studies on conservation development, a statistically reliable demonstration of the sudden jump in conservation ability is lacking. In the following, we present data that demonstrate the sudden jump. Bimodality is also known in conservation research. In van der Maas and Molenaar (1992), we present a reanalysis of the data of Bentler (1970) which clearly demonstrates bimodality. Although this seems to be the most simple criterion, applicable to cross-sectional group data of the behavior variable only, some is-
4. Catastrophe Theory
87
sues arise. In the prediction of bimodal score distribution, we combine two flags, bimodality and inaccessibility. Inaccessibility is implied by the unstable mode in between two stable modes. In a mixture distribution model (Everitt & Hand, 1981), this inaccessible mode misses. In these models, a mixture of two normal (or binomial, etc.) distributions is fitted on empirical distributions: F ( X ) = p N(Ixl, 0.1) + (1 - p) N ( I x 2 , 0"2) ,
(6)
where N is the normal distribution, p defines the proportions in the modes, and tx and 0. are the characteristics of the modes. This mixture model takes five parameters, whereas a mixture of binomials takes three parameters. The number of components can be varied and tested in a hierarchical procedure. An alternative formulation is found in the work of Cobb and Zacks (1985). They apply the cusp equation to define distributions of a stochastic cusp model: F ( X ) = ~. e -(z4-az2-bz)
Z-
sX-
l,
(7)
where a and b (independent values) define the form of the distribution, s and 1 linearly scale X, and h is the integration constant. The four parameters a, b, s, and 1 are estimated. It is possible to fit unimodal and bimodal distributions by constraints on the parameters a and b and compare the fits. The method of Cobb has some limitations (see Molenaar, this volume), and it is difficult to fit to data (e.g., in computing h). Yet, it takes into account the impact of the inaccessible mode and the computational problems are solvable. A comparison of possible forms of Equations (6) and (7) leads to the conclusion that their relationship is rather complex. In Equation (6), distributions can be fitted to data defined by IX~ = IX2 and 0.1 > > 0.2. Such distributions are not allowed in Equation (7), the modes must be separated by the inaccessible mode. Hysteresis is easily demonstrated in simpler physical catastrophic processes, but is probably very difficult in psychological experimentation. The degree of the hysteresis effect, the distance between the jumps up and down, depends on the disturbance or noise in the system (leading to the so-called Maxwell condition). Later on we discuss our first attempt to detect hysteresis in conservation. Divergence has a close relation to what is usually called a bifurcation. In terms of the chosen independent variables it means that if, in the case of a residual child, perceptual misleadingness as well as cognitive capacity are increased, the paths split between the upper and lower sheets, that is, between the high and low scores. Two children with almost similar start values (both independent variables very low) and following the same path through the control plane can show strongly diverging behavior. Again, this is not easily found in an experiment. The manipulation of only one independent variable is already difficult. Another choice of independent variables (e.g., a motivation factor or
88
H. L. J. van der Maas and P. C. M. Molenaar
optimal conditions factor in another rotation of independent variables) may lead to an empirical test. Anomalous variance is a very important flag. Gilmore (1981) proves that the variance of the behavioral variable increases strongly in the neighborhood of the bifurcation set and that drops occur in the correlation structure of various behavioral measures in this period. In developmental literature, many kinds of anomalous behaviors, from oscillations to rare intermediate strategies, are known. In our view, this prediction, used here as a criterion, is what Flavell and Wohlwill call "manifestations of oscillations and intermediary forms of reasoning." Divergence of linear response and critical slowing down, the last flags, concern the reaction to perturbations. They imply the occurrence of large oscillations and a delayed recovery of equilibrium behavior, respectively. These flags are not studied in our experiments. Reaction time measures are possibly the solution for testing these two flags. The flags differ importantly in strength. Bimodality and the sudden jump, certainly when they are tested by regression analysis and mixture distributions, are not unique for the cusp model. It is very difficult to differentiate acceleration and cusp models by these flags. Inaccessibility changes this statement somewhat. It is unique, but as indicated before, usually incorporated in the bimodality test. Anomalous variance is unique, as other mathematical models do not derive this prediction from the model itself. In this respect, the many instances of oscillations and intermediary forms of reasoning count as evidence for our model. Hysteresis and divergence are certainly unique for catastrophe models. Divergence of linear response and critical slowing down seem to be unique, too. Such a rough description of the strength and importance of each flag is only valid in contrast with predictions of alternative models. In the case of conservation research, bimodality and the sudden jump are thought of as necessary but not sufficient criteria, the others as both necessary and sufficient. The general strategy should be to find as many flags as possible. The demonstration of the concurrence of all flags forms the most convincing argument for the discontinuity hypothesis. Yet, the flags constitute an indirect method for testing catastrophe models. However, the advantage is that, except for divergence and hysteresis, the flags only pertain to the behavioral variable. We suggested a perceptual and a cognitive factor in our cusp model of conservation acquisition, but the test of the majority of flags does not depend on this choice. These flags do answer the question whether a discontinuity, as restrictively defined by catastrophe theory, occurs or not. The definitions of hysteresis and divergence include variation along the independent variables. In the case of the cusp conservation model, we can use variation of perceptual cues of conservation items or divided attention tasks to
4. Catastrophe Theory
89
force the occurrence of hysteresis and divergence. On the other hand, it is a pity that these appealing flags cannot be demonstrated without knowledge of the independent variables. On the other hand, by these flags, hypotheses about the independent variables can be tested.
4.3. Statistical Fit of Cusp Models A direct test of the cusp model of conservation acquisition requires a statistical fit of the cusp model to measurements of the dependent as well as the independent variables. The statistical fit of catastrophe models to empirical data is difficult and a developing area of research. In the comparison of Equations (2) and (3), the cusp models and (nonlinear) regression models, we did not discuss the statistical fit. It is common practice to fit regression models with one dependent and two independent variables to data and to test whether the fit is sufficient. Fitting data to the cusp model is much more problematic. There have been two attempts to fit Equation (2) in terms of Equation (3) (Guastello, 1988; Oliva, Desarbo, Day, & Jedidi, 1987). Alexander, Herbert, DeShon, and Hanges (1992) heavily criticize the method of Guastello. The main problem of Guastello's method puts forth the odd characteristic that the fit increases when the measurement error increases. Gemcat by Oliva et al. (1987) does not have this problem but needs an additional penalty function for the inaccessible mode. A different approach is taken by Cobb and Zacks (Cobb & Zacks, 1985). They developed stochastic catastrophe functions which can be fitted to data. In Equation (7), the basic function is shown. If one fits a cusp model, a and b are, for example, linear functions of observed independent variables. Examples of applications of Cobb' s method can be found in Ta' eed, Ta' eed, and Wright (1988) and in Stewart and Peregoy (1983). We are currently modifying and testing Cobb's method. Simulation studies should reveal the characteristics of parameter estimates and test statistics, and the requirements on the data. At present, we have not fit data of measurements of conservation, perceptual misleadingness, and cognitive capacity directly to the cusp model of conservation acquisition.
5. EMPIRICAL STUDIES 5.1. Introduction In the experiments, we have concentrated on the catastrophe flags. Evidence for bimodality, inaccessibility, and anomalous variance can be found in the literature (van der Maas & Molenaar, 1992). This evidence comes from
gO
H. L. J. van der Maas and P. C. M. Molenaar
predominantly cross-sectional experiments. Analyses of time series from longitudinal experiments have been rarely applied. The focus on cross-sectional research hampers the assessment of the other flags, most importantly, the sudden jump itself. We do not know of any study that demonstrates the sudden jump (or sharp continuous acceleration) convincingly. This is, of course, a difficult task, as dense time series are required. The presented evidence for anomalous variance is also achieved by crosssectional studies. In such studies, conservation scores on some test are collected for a sample of children in the appropriate age range. Children with consistent low and high scores are classified as nonconservers and conservers, respectively. The remaining children are classified as transitional. Then, some measure of anomalous variance is applied (rare strategies, inconsistent responses to a second set of items, inconsistencies in verbal/nonverbal behavior). By an analysis of variance (or better, by a nonparametric equivalent) it is decided whether the transitional group shows anomalous variance more than the other groups. In comparison with the strong demands, in the form of the catastrophe flags, that we put on the detection of a transition, this procedure of detecting transitional subjects is rather imprecise. However, it is a very common procedure in conservation research, and is also applied to decide whether training exclusively benefits transitional subjects. In light of the importance of anomalous variance for our verification, we want a more severe test. For these reasons, a longitudinal experiment is required. Special requirements are dense time series and precise operationalizations for the flags. This appears to be a difficult task when we rely on the clinical test procedure of conservation. Many retests in a short period creates a heavy load on our resources, as well as on the time of the children and the continuation of normal school activities. Moreover, this clinical test procedure has been heavily criticized. We chose to construct a new test of conservation. We describe this test and its statistical properties elsewhere (van der Maas, 1995), therefore only a short summary will be given here.
5.2. Instrument: A Computer Test of Conservation 5.2.1. Items In the clinical test procedure, the child is interviewed by an experimenter who shows the pouring of liquids to the child and asks a verbal explanation for the judgment. Only a few tasks are used and the verbal justification is crucial in scoring the response and detecting the strategy that is applied by the child. The rule assessment methodology of Siegler (1981), discussed previously, has a different approach. Siegler applies many more items that are designed to detect strategies on the basis of the judgments only. Although verbal justifications have an additional value, the level of conservation performance is determined by simple responses to the items only.
4. CatastropheTheory
91
The computer test is based on four out of six item types of Siegler's methodology. We call them the guess equality, guess inequality, standard equality, and standard inequality item types. The guess equality and guess inequality item types compare with Siegler's dominant and equal items. In the guess item types, the dimensions of the glasses are equal. In the equality variant amounts, heights and widths of the liquid are equal before and after the transformation. In the inequality variant, one of the glasses has less liquid, is equal in width, but differs in height. In these item types, the perceptual height cue points to the correct response and should therefore be correctly solved by all children. Consequently, these items can be used to detect children who apply guessing or irrelevant strategies. In view of the criticism of the judgment-only criterion concerning the possibility of guessing, items like this should be included. The standard item types, equality and inequality, compare with the conflict equal and subordinate items of Siegler. The standard equality item is shown in Figure 1. The dimensions of the glasses and liquids differ, whereas the amounts are equal. In the initial situation, the dimensions are clearly equal. Understanding the conserving transformation is sufficient to understand that the amounts are equal in the final situation, although multiplication of dimensions in the final situations suffice as well. The standard inequality item starts with an initial situation in which widths are equal but heights and amounts differ. In the final situation, the liquid of one of the glasses is poured into a glass of such width that the heights of the liquid columns are then exactly equal (see Fig. 3). 5.2.2. Strategies For all of these item types, a large variation in dimensions is allowed. For example, in the item in Figure 3, the differences in width can be made more or less salient. We expect that these variations do not alter the classification of children. Siegler makes the same assumption. A close examination of the results of Ferretti and Butterfield (1986) show that only very large differences in dimensional values have a positive effect on classification. Siegler uses six item types (each four items) to distinguish between four rules. We use instead the two standard item types (Figs. 1 and 3) and classify according to the following strategy classification schema (Table 2).
FIGURE 3. A standard inequality item of conservation of liquid.
92
H. L. J. van der Maas and P. C. M. Molenaar
The conservation items have three answer alternatives: left more, equal, and right more. These three alternatives can be interpreted as the responses highest more, equal, and widest more in the case of equality items, and in the responses equal, smallest more, and widest more in the case of inequality items. The combination of responses on equality and inequality items is interpreted in terms of a strategy. Children who consistently chose highest more on equality items and equal on inequality items follow the nonconserver height rule (NC.h), or, in Siegler's terms, rule 1. Conservers (C) or rule 4 users chose the correct response on both equality and inequality items. Some children will fail the equality items but succeed on the inequality items (widest more). If they prefer the highest more response on the equality items, they are called NC.p, or rule 2 users in Siegler's terms. If they chose the widest more response, they probably focus on width instead of height. The dominant and subordinate dimensions are exchanged (i.e., from a to b in the list of events in conservation development). We call this the nonconserver width rule (NC.w). The nonconserver equal rule ( N C . = ) refers to the possibility that children prefer the equal response to all items, including the inequality items. We initially interpret this rule as an overgeneralization strategy. Children, discovering that the amounts are equal on the equality items, may generalize this to inequality items. If this interpretation is correct, N C . - m a y be a typical transitional strategy. The cell denoted by i l does not have a clear interpretation. Why should a child focus on width in the case of equality items, and on the height in the case of inequality items. This combination makes no sense to us. The same is true for the second row of Table 2, determined by the smallest more response on inequality items. This response should not occur at all, so this row is expected to be empty. The results of the experiments will show how much children apply the strategies related to the nine cells. If a response pattern has an equal distance to two cells, then it is classified as a tie. The number of ties should be small. Notice that ties can only occur when more than one equality and more than one inequality item is applied. The possibility of ties implies a possibility of a formal test of this strategy classification procedure.
TABLE2 Strategy Classification Schema Standard equality items Highest more
Equal
Widest more
Standard
Equal
NC.h/rule 1
NC. =
i1
Inequality
Smallest more
i2
i3
i4
Items
Widest more
NC.p/rule 2
C/rule 4
NC.w
4. Catastrophe Theory
93
The NC.=, NC.w, and the i l rules do not occur in Siegler's classification of rules, although they could be directly assessed in Siegler's methodology. Rule 3 of Siegler, muddling through or guessing on conflict items, does not occur in this classification schema. Proposed alternatives for rule 3, like addition or maximization, are also not included. Additional item types are required to detect these strategies. 5.2.3. Test Items are shown on a computer screen. Three marked keys on the keyboard are used for responding. They represent this side more (i.e., left), equal, that side more (right). First, the initial situation appears, two glasses filled with liquid are shown. The subject has to judge this initial situation first. Then, an arrow denotes the pouring of liquid into a third glass which appears on the screen at the same time as the arrow; the liquid disappears from the second glass and appears in the third. Small arrows below the glasses point to the glasses that should be compared (A and C). In the practice phase of the computer test, the pouring is, except by the arrow, also indicated by the sound of pouring liquid. This schematic presentation of conservation items is easily understood by the subjects. The application of the strategy classification schema is more reliable when more items are used. We chose to apply one guess equality, one guess inequality, three standard equality, and three standard inequality items in the longitudinal experiment, and two guess equality, two guess inequality, four standard equality, and four standard inequality items in the cross-sectional studies. Guessers are detected by responses to guess items and the initial situations of the standard items, which are all non-misleading situations. Two types of data are obtained, number correct and strategies. Apart from the preceding test items, four other items are applied. We call these items compensation construction items. On the computer screen, two glasses are shown, one filled and one empty. The child is asked to fill the empty glass until the amounts of the liquids are equal. The subject can both increase and decrease the amount until he or she is satisfied with the result. Notice that a conserving transformation does not take place, hence the correct amount can only be achieved by compensation. We will only refer briefly to this additional part of the computer test. It is, however, interesting from a methodological point of view as the scoring of these items is not discrete but continuous, that is, on a ratio scale.
5.3. Experiment 1: Reliability and Validity The first experiment is a cross-sectional study in which 94 subjects ranging in age from 6 to 11 years old participated. We administered both the clinical conservation test (Goldschmid & Bentler, 1968) and an extended computer test.
94
H. L. J. van der Maas and P. C. M. Molenaar
The reliability of the conservation items of the computer test turned out to be .88. Four of 94 subjects did not pass the guess criteria; they failed the clinical conservation test as well. This clinical test consisted of a standard equality and a standard inequality item, scored according to three criteria: Piagetian, Goldschmid and Bentler, and judgment-only. The classifications obtained by these three criteria correlate above .96, hence to ease presentation, we only apply the Piagetian criterion. The correlation between the numbers correct of the clinical test and the computer test is .75. In terms of classifications, 79 of 94 subjects are classified concordantly as nonconservers and conservers. A more concise summary is given in Table 3. Here we can see that the classifications in NC.h and C do not differ importantly. For the NC.p, N C . = , and NC.b strategies, no definite statements can be made because of the small number of subjects. It seems that the ties appear among conservers and nonconservers on the clinical test. The percentages correct on both tests show that the tests do not differ in difficulty (62% vs. 59%, paired t-test: t = .98, df = 93, p = .33). We can state that the computer test seems to be reliable and valid (if the clinical test is taken as criterion).
5.4. Experiment 2: Longitudinal Investigation of the Sudden Jump and Anomalous Variance In the second study, 101 subjects from four classes of one school participated. At the beginning of the experiment, the ages varied between 6.2 and 10.6 years old. The four classes are parallel classes containing children of age groups 6, 7, and 8 years.
TABLE3 Classifications on the Computer Test Versus the Clinical Test Consisting of an Equality and an Inequality Item Clinical test scores
Strategy on c o m p u t e r test
=
:/:
NC.h
NC.p
NC.=
0
0
1
0
19
3
0
1
0
0
0
1
2
2
0
1
1
2
1
2
C
NC.b
Ties
1
0
4
3
30
2
0
0
1
4
3
0
0
0
7
40
1
7
0
53
Guess
Total
Note. Four score patterns of correct (1) and incorrect (0) can occur on the clinical test. These are compared with the strategies found on the computer test (strategies i l to i4 did not occur at all). The cells contain the number of subjects.
4. Catastrophe Theory
g5
We placed a computer in each class and individually trained the children to use it. During 7 months, 11 sessions took place. Except for the first session, children made the test by themselves as part of their normal individual education. This method of closely following subjects is similar to what Siegler and Jenkins (1989) call the microgenetic method, except for the verbal statements on responses which they use as additional information. We can call our method a computerized microgenetic method. Its main advantage is that it takes less effort, because, ideally, the investigator has only to backup computer diskettes. 5.4.1. Sudden jump To find evidence for the sudden jump, we classified subjects in four groups, nonconservers, transitional, conservers, and residual on the basis of the strategies applied. The transitional subjects are subjects who applied both conserver and nonconserver strategies during the experiment. Twenty-four of the 101 subjects show a sharp increase in the use of conserver strategies. We corrected the time series for latency of transition points. The resuiting individual plots are shown in Figure 4. This figure demonstrates a very sharp increase in the conservation score. To judge how sharp, we applied a multiple regression analysis in which the conservation score serves as the dependent variable and the session (linear) and a binary template of a jump as independent variables. This latter jump indicator consists of zeros for sessions 1 to 10 and ones for sessions 11 to 19. Together, the independent variables explain 88% of the variance (F(2,183) = 666.6, p = .0001). The t-values associated with the beta coefficient of each independent variable is t = 1.565 (p - .12) and t = 20.25 (p = .0001), for the linear and jump indicators, respectively. Actually, the jump indicator explains more variance than a sixth-order polynomial of the session variable. Yet, this statistical result does not prove that this sharp increase is catastrophic. The data can also be explained by a continuous acceleration model, as the density of sessions over time may be insufficient. What this plot does prove is that this large increase in conservation level takes place within 3 weeks, between two sessions, for 24 of 31 potentially transitional subjects. For the remaining 70 subjects, classified as nonconservers, conservers, and residual subjects on the basis of strategy use, no significant increases in scores are found. The mean scores on sessions 1 and 11 for these subjects do not differ significantly, F(1,114)= .008, p = .92. The scores of these subjects stay at a constant level. The transitional group shows important individual differences. A few subjects show regressions to the nonconserver responses, some apply rare strategies during some sessions, but the majority show a sharp increase between sessions. One subject jumped within one test session, showing consistent nonconserver scores on all preceding sessions and conserver scores on all subsequent test sessions.
81_S
g6
H. L. J. van der Maas and P. C. M. Molenaar
68
6 4
'
I
0
9
5
I
9
I
10
15
2
,~--
,
I
0
'
I
5
'
.
9
I
0
5
i
5
.
i
10
15
2O
2 I
0
2O
'
I
5
9
10
'
I
15
20
..
'
15
10
0
5
0
5
9
0
I
9
5
9
I
10
I
9
15
20
0
9
9
10
0
i
0
9
15
20
6 4
6 4
2
2
0
,
9
I
0
5
I
9
i
10
0
9
15
20
0
I
5
0
'
0
I
'
I
5
0
9
10
I
15
0
,
20
9
5
0
,
I
9
10
1
15
9
20
I
i
9
I
10
,
5
15
J I
10
,
9
20
0
I
9
5
I
9
10
9
I
15
20
0
15
20
0
I
5
FIGURE 4.
9
I
10
9
I
15
9 "'
20
9
20
0
l
'
I
10
i
9
5
"
15
i
9
10
20
/
9
15
20
0
!
0
,
5
i
'
10
I
9
15
20
81 / 1 '1 ~ -
-
I
'
I
0
5
10
0
5
10
'
I
20
15
6 4 2 i
5
9
0
I
10
15
20
5
i
8
4\
2 /
10
" 15
0
9
20
__A 9
'
I
9
15
/
6
2
9
I
15
81 J 9
0
9
6 4 0
9
S
2
I
2
9
'
6 4
6 4 0
I
10
20
2 i
2
2
"
6 4
6 4
6 4
I
15
J
,
I
2
2
"
6 4
6 4
6 4
I
I
10
2
0
20
9
8
2 I
l
5
0
9
?
6 4
9
0
2
0
15
0
6 4
81 81 81 / 8i 81 81 81 _/81 s 0
,
4
I
10
i
8
-
4681j0
,
0
20
. . . . . .
2
0
,
81 81
j
6 4
2
4681 /---
0
.....
4
2 0
j
20
.... I
!
5
10
'
I
15
9
20
Individual transition plots aligned to time point l 1. On the vertical axis raw scores on
the c o n s e r v a t i o n test are shown. T w o guess items and six standard items were used.
4. Catastrophe Theory
97
5.4.2. Anomalous variance A special problem in the analysis of anomalous variance is that the conservation test consists of dichotomous items. The variance depends on the test score. We looked for three solutions for this problem using inconsistency, alternations, and transitional strategies. The inconsistency measure is achieved by an addition to the computer test, a repetition of four standard items at the end of the test. These additional items are not included in the test score. The inconsistency measure is the number of responses that differ from the responses on the first presentation. We do not present the results here. In summary, inconsistencies do occur in the responses of transitional subjects but occur more in the responses of residual subjects. The other two measures concern the application of strategies. The analysis of alternations did not indicate a transitional characteristic. The analysis of strategies, however, did. To explain this, we will now describe the results in terms of strategies. The other measures are discussed in van der Maas, Walma van der Molen, and Molenaar (1995). The responses to six standard items are analyzed by the strategy classification schema. The results are shown in Table 4. This table shows some important things. The number of ties is low, hence in the large majority of patterns, the classification schema applies well. The NC.h and the C strategy are dominant (80%). The other cells are almost empty or concern uncommon strategies. NC.p, rule 2 in Siegler's classification, N C . - , il, and NC.w make up 14% of the response patterns. Focusing on the subordinate dimension, NC.w is very rare. A statistical test should reveal whether the small number, 6 if guessers are removed,
TABLE 4 Responsesto Six Standard Items Analyzed by Strategy Classification Schema Equality items Ties 43 (8) Missing 265
Inequality items
Highest more Equal
NC.h
Smallest more
i2
Widest more
NC.p
417 (15)
Equal NC.=
Widest more
36 (9)
il
20 (5)
i4
2 (2)
4 (0)
i3
5 (3)
50 (11)
C
258 (4)
NC.w
11 (5)
Note. For 11 sessions, *101 subjects = 1111 response patterns can be classified. There were 265
missing patterns. Forty-three response patterns, ties, could not be classified because of equal distance to the ideal patterns associated with the nine cells. The 801 remaining patterns can be uniquely classified and are distributed as shown in the table. The number of guessers is displayed in parentheses. A response pattern is classified as a guess strategy if two guess items are incorrect or if one guess item and more than 25% of the responses to the non-misleading initial situations of the standard items are incorrect.
g8
H. L. J. van der Maas and P. C. M. Molenaar
can be ascribed to chance or cannot be neglected. Latent class analysis may be of help here. For the classification of time series, that is, the response patterns over 11 sessions, we make use of the translation of raw scores in strategies. The classification is rather simple. If NC.h occurs at least once, and C does not occur, the series is classified as NC. If C occurs at least once, and NC.h does not occur, the series is classified as C. If both C and NC.h occurs in the time series it is classified as TR. The remaining subjects and those who apply guess and irrelevant (i2, i3, and i4) strategies on the majority of sessions are classified as R. According to these criteria, there are 31 TR, 42 NC, 20 C, and 8 R subjects. Note that this classification of time series does not depend on the use of the uncommon strategies NC.p, N C . = , i l, and NC.w. Thus we can look at the frequency of use of these strategies in the four groups of subjects (see Table 5). NC.p, rule 2 in Siegler's classifications, does not seem to be a transitional characteristic; in most cases, it is used by the subjects of the nonconserver group. The strategies i l and NC.w are too rare to justify any conclusion. In 15 of 27 cases of application of N C . = , it is applied by the transitional subjects. The null hypothesis that N C . = is distributed evenly over the four groups is rejected (X2 = 11.3, d f = 3, p = .01). Furthermore, seven transitional subjects apply N C . = just before they start using the C strategy. This analysis demonstrates much more convincingly than cross-sectional studies that transitional subjects manifest what has been called intermediary forms of reasoning. Whether we can ascribe the occurrence of the NC.-- strategy to anomalous variance is not entirely clear.
TABLE 5 Application of Uncommon Strategies by Nonconservers, Conservers, Transitional, and Residual Subjects Group Strategy
NC
C
TR
R
NC.p NC.w NC.= il Other
21 (5.5%) 1 (0.3%) 5 (1.3%) 5 (1.3%) 349 (91.6%)
4 (2.5%) 0 (0%) 4 (2.5%) 0 (0%) 149 (95%)
7 (2.8%) 3 (1.2%) 15 (6.0%) 7 (2.8%) 217 (87%)
7 (11.8%) 2 (3.4%) 3 (5.1%) 3 (5.1%) 44 (74.6%)
Total
381 (100%)
157 (100%)
249 (100%)
59 (100%)
Note. Other strategies are C, NC.h, guess, ties, and irrelevant strategies (i2 to i4). Raw numbers and percentages of use in each group are displayed.
4. Catastrophe Theory
gg
Note that the search for anomalous strategies comes from the fact that traditional variance is not useful in a binomially distributed test score. In this respect, the compensation construction items are of interest. These measure conservation (in fact, compensation) ability on a continuous ratio scale (amount of liquid or height of liquid level). In this measure, a more direct test of anomalous variance may be possible.
5.5. Experiment 3: Hysteresis The last flag that we will discuss here is hysteresis. We build the cusp model of conservation acquisition on the assumption that perceptual and cognitive factors conflict with each other. Hysteresis should occur under systematic increasing and decreasing variation in these factors. Preece (1980) proposed a cusp model that is a close cousin of ours. Perceptual misleadingness plays a role in his model, too. We will not explain Preece' model here, but it differs from our preliminary model, for example, with respect to the behavior modes to which the system jumps. The difference can be described in terms of the conservation events (see section 3). Preece's model concerns event c, whereas our model concerns event d, resulting in a choice between the height more and weight more responses and between the height more and equal responses, respectively. So Preece expects a hysteresis effect between incorrect responses and we between correct and incorrect responses. The idea suggested by Preece is to vary the misleading cue in conservation tasks. In a conservation of weight test, a ball of clay is rolled into sausages of lengths 10, 20, 40, 80, 40, 20, and 10 cm, respectively (see Table 6). In this test, a child shows hysteresis if he or she changes the judgment twice; once during the increasing sequence and once during the decreasing sequence. If both jumps take place, hysteresis occurs either according to the delay (Hyst D) or to Maxwell convention (Hyst M). In the latter case, the jumps take place on the same position, for instance, between 20 and 40 cm. If only one jump occurs, we classify the pattern as "jump." This weight test was performed as a clinical observation test and was administered to 65 children. Forty-three subjects correctly solved all items. Two subjects judged consistently that the standard ball was heavier, and two judged the sausage as heavier. In Table 6 the response patterns of all subjects are shown, the consistent subjects are in the lower part. The codes - 1 , 0, and 1 mean ball more, equal weight, and sausage more, respectively. The first subject shows a good example of a hysteresis response pattern. When the ball is rolled into a sausage of a length of 10 cm, he judges the ball as heavier. Then, the sausage of 10 cm is rolled into a length of 20 cm, the child persists in his opinion. But when the sausage of 20 cm is rolled into a sausage of 40 cm, he changes his judgment, now the roll is heavier. He continues to think this when
100
H. L. J. van der Maas and P. C. M. Molenaar
TABLE 6 Response Patterns of 65 Children in a Cross-Sectional Experiment on Hysteresis 10
20
40
80
40
20
10
nr
cm
cm
cm
cm
cm
cm
cm
Pattern
Alternation
39
-1 -1 -1 -1 0 0 0 -1 0 0 0 1 1 0 1 -1 1 1 1 0 0 0
-1
1
-1 -1
1 1
0 0 1 1 -1 0 0 1 -1 0 1 0 0 0
1 -1 0 0 0 1 0 0 1 -1 1 0 0 -1 -1 0 0 1 -1 1
hyst D hyst D' jump hyst M hyst D' hyst D hyst D' jump hyst D' hyst D' jump
inc/inc inc/inc inc/inc inc/inc c/inc c/inc c/inc c/inc c/inc c/inc c/inc
- 1 1 0
- 1 1 0
- 1 1 0
63 11 51 42 47 18
3 21 40 49 1 62 64 65 25 32 27
1 1
1
-1
-1
1 1 -1
1 -1
1
1
0 0 -1 -1 -1 -1 0 -1 -1 0 0 -1 1 -1 - 1 -1 0 1
-1 0 -1 -1 -1 -1 1 -1 -1 -1 0 1 1 0 1 -1 0 -1
-1 -1 -1 -1 -1 -1 1 1 1 0 0 1 0 -1 1 -1 0 0
1 1 0 -1 -1 -1 -1 -1 1 -1 1 0 0 0 -1 1 1 0 1 1
- 1 1 0
- 1 1 0
- 1 1 0
- 1 1 0
1
-1
0 0 0 -1
2 Subjects 2 Subjects 43 Subjects
Note. Explanation of abbreviations is given in the text.
the sausage reaches its maximum length of 80 cm. Then the sequence is reversed. The roll of 80 cm is folded to 40 cm, the child judges that the roll is still heavier. This continues until the roll is folded to 10 cm. On this last item the child changes his opinion, the ball is heavier again. The first subject is tested two times. The second time, hysteresis according to the delay convention again takes place. However, the jumps have exchanged relative position. The jump takes place earlier in the increasing rather than in the decreasing sequence of items. We denoted this strange phenomena as Hyst D'. This subject shows hysteresis between incorrect responses (inc/inc); subject 42 shows hysteresis between correct and incorrect responses (c/inc). This subject judges the ball and the sausage to be equal until the sausage is rolled into a
4. CatastropheTheory
101
length of 80 cm. Then the ball is heavier. The child persists in this judgment when the sausage is folded to 40 cm. Finally, in the last two items, the child responses are "equal" again. Eleven instances of hysteresis and jumps occur among the tested children. The other patterns are not interpretable in these terms. The hysteresis results on the conservation of weight test are not conclusive. It would be naive to expect a convincing demonstration of hysteresis in the first attempt. The number of items in the sequence and the size of steps between items are difficult to choose. Perhaps, another kind of test is needed to vary misleadingness continuously. Furthermore, only the transitional subjects in the sample are expected to show these phenomena. In this regard, the results in Table 6 are promising. We also constructed a computer test of liquid for hysteresis. Only 2 instances of the jump pattern and four of the hysteresis pattern occurred in the responses of 80 subjects (from the longitudinal sample). Three of these subjects came from the transitional group. One of them responded as nonconserver until we applied the hysteresis test (between sessions 7 and 8) and responded as conserver in all subsequent sessions. These tests of hysteresis have some relationship to a typical aspect of Piaget's clinical test procedure of conservation. This aspect is called resistance to countersuggestion, which is not applied in the computer test of conservation. It means that, after the child has made his or her judgment and justification of this judgment, the experimenter suggests the opposite judgment and justification to the child. If the child does not resist this countersuggestion, he or she is classified as transitional. That countersuggestions work has been shown in many studies. Whether subjects who accept countersuggestion are transitional (or socially adaptive) is not always clear; as said before, cross-sectional studies have serious limitations in making this decisive. The method of countersuggestion can be interpreted in the cusp model as pushing the behavior in the new or old mode of behavior to see whether this mode still or already exists and has some stability. If countersuggestion is resisted, the second mode apparently does not exist. In the cusp model, the second mode only exists in the bifurcation set. Note that we defined the transition as being the period during which the system remains in this set. If the subject switches opinions there and back as a result of repeated countersuggestions, this behavior can be taken to be hysteresis. Whether the manipulation of countersuggestion yields an independent variable or not is less clear.
6. DISCUSSION In this chapter, we gave an overview of our work on the application of catastrophe theory to the problem of conservation development. Not all of the ideas
102
H. L. J. van der Maas and P. C. M. Molenaar
or all of the empirical results are discussed, but the most important outcomes are presented. What can be learned from this investigation? One may think that catastrophe theory is a rather complex and extensive tool for something so simple as conservation learning. Some readers may hold the saltatory development of conservation as evident in itself (see Zahler and Sussmann, 1977). With respect to the first criticism, we state that there are theoretical and empirical reasons for using catastrophe theory. Theoretically, the reason for applying catastrophe theory is that the rather vague concept of epigenese in Piaget's theory can be understood as equivalent to self-organization in nonlinear system theory. Empirically, the lack of criteria for testing saltatory development, implies in the Piagetian theory, has been an important reason for the loss of interest in this important paradigm (van der Maas & Molenaar, 1992). The second criticism centers on the argument that our conservation items are dichotomous items, so behavior is necessarily discrete. The criticism is not in accordance with test theory. A set of dichotomous items yield, in principle, a continuous scale. On other tests in which a set of dichotomous items of the same difficulty are used, the sum score is unimodally distributed. In contrast, the sum scores on the conservation test are bimodally distributed. Yet, the use of dichotomous items complicates the interpretation of results. It raises a question about the relationship between catastrophe models and standard scale techniques, especially latent trait theory. This is a subject that needs considerable study. We suggested the use of compensation construction items to exclude radically these complications. This criticism of intrinsic discreteness is not commonly used by developmental psychologists. The traditional rejection of Piaget's theory assumes the opposite hypothesis of continuous development. With regard to this position the presented data are implicative. First, the sudden jump in the responses of a quarter of our longitudinal sample has not been established before. It is disputable whether this observation excludes alternative models that predict sharp continuous accelerations. We can argue that at least one subject showed the jump within a test session during a pause of approximately 30 s between two parts of the computer test. Hence, we demonstrated an immediate sudden jump for at least one subject. Second, evidence is presented for the occurrence of transitional strategies. Note that we, as this concerns a result in a longitudinal study, are much more certain that the subjects classified as transitional are indeed transitional. Transitional subjects appear to respond "equal" to both equality and inequality items. This may be interpreted to mean that they overgeneralize their discovery of the solution of standard equality items to the inequality items. Verbal justifications are required to decide whether this interpretation is correct. In our view the dem-
4. Catastrophe Theory
103
onstration of a typical transitional strategy is quite important to Piaget's idea of knowledge construction, this result is confirmatory. It also makes an argument for using the rule assessment methodology in studying Piagetian concepts. We presented this result under the heading of anomalous variance. We relate the catastrophe flag anomalous variance to what Flavell and Wohlwill (1969) call oscillations and intermediary forms of reasoning. We suggest that the findings of Perry, Breckinridge Church, and Goldin Meadow (1988) concerning inconsistencies between nonverbal gestures and verbal justifications should be understood in terms of this flag, too. Third, the few instances of hysteresis response patterns suggest a choice for the cusp model of conservation. We admit that more experimentation is required here. The test should be improved and perhaps we need other designs and manipulations. We have already suggested using countersuggestion as a manipulation. Finally, evidence for other flags can be found in the literature, especially for bimodality and inaccessibility for which strong evidence already exists (Bentler, 1970; van der Maas, 1995). Divergence, divergence of linear response, and critical slowing down are not demonstrated. Although the presented results together make a rather strong argument for the hypothesis of discontinuous development, we will attempt to find evidence for these flags, too. The presented evidence for the catastrophe flags offers a strong argument for the discontinuity hypothesis concerning conservation development. That is, the data bear evidence for catastrophe models of conservation development. However, we cannot claim that this evidence directly applies to the specific cusp model that we presented. The choice of independent variables is not fully justified. The hysteresis experiment gives some evidence for perceptual misleadingness as independent variable, but this variable is, for instance, also used by Preece (1980). On the other hand, there are some theoretical problems for which the model of Preece (1980) can be rejected (see van der Maas & Molenaar, 1992). We did not put forward evidence for cognitive capacity as independent variable; in future studies, we will focus on this issue. However, it is in accordance with the views of many supporters and opponents of Piaget that conservation acquisition should be understood in terms of a conflict between perceptual and cognitive factors. We hope that we have made clear that a catastrophe approach to discontinuous behavior has fruitful implications. Catastrophe theory concerns qualitative (categorical) behavior of continuous variables. It suggests a complex relation between continuous and categorical variables that falls outside the scope of standard categorical models and data analysis methods. Yet, the catastrophe models are not unrelated to standard notions; but the question of how catastrophe models should be incorporated in standard techniques is far from answered.
104
H. L. J. van der Maas and P. C. M. Molenaar
ACKNOWLEDGMENTS This research was supported by the Dutch organization for scientific research (NWO) and the Department of Psychology of the University of Amsterdam (UVA).
REFERENCES Alexander, R. A., Herbert, G. R., DeShon, R. E, & Hanges, E J. (1992). An examination of least squares regression modeling of catastrophe theory. Psychological Bulletin, 111(2), 366-374. Anderson, H. H., & Cuneo, D. O. (1978). The height + width rule in children's judgments of quantity. Journal of Experimental Psychology: General, 107(4), 335-378. Bentler, P. M. (1970). Evidence regarding stages in the development of conservation. Perceptual and Motor Skills, 31, 855-859. Brainerd, C. J. (1979). Markovian interpretations of conservation learning. Psychological Review, 86, 181-213. Bruner, J. S. (1967). On the conservation of liquids. In J. S. Bruner, R. R. Olver, & E M. Greenfield (Eds.), Studies in cognitive growth (pp. 183-207). New York: Wiley. Castrigiano, D. E L., & Hayes, S. A. (1993). Catastrophe theory. Reading, MA: Addison-Wesley. Cobb, L., & Zacks, S. (1985). Applications of catastrophe theory for statistical modeling in the biosciences. Journal of the American Statistical Association, 80(392), 793-802. Everitt, B. S., & Hand, D. J. (1981). Finite mixture distributions. London: Chapman & Hall. Ferretti, R. P., & Butterfield, E. C. (1986). Are children's rule assessment classifications invariant across instances of problem types? Child Development, 57, 1419-1428. Flavell, J. H., & Wohlwill, J. E (1969). Formal and functional aspects of cognitive development. In D. Elkind, & J. H. Flavell (Eds.), Studies in cognitive development: Essays in honor of Jean Piaget (pp. 67-120). New York: Oxford University Press. Gilmore, R. (1981). Catastrophe theory for scientists and engineers. New York: Wiley. Goldschmid, M. L., & Bentler, E M. (1968). Manual conservation assessment kit. San Diego, CA: Educational and Industrial Testing Service. Guastello, S. J. (1988). Catastrophe modeling of the accident process: Organizational subunit size.
Psychological Bulletin, 103(2):246-255. Kerkman, D. D., & Wright, J. C. (1988). An exegenis of compensation development: Sequential decision theory and Information integration theory. Developmental Review, 8, 323-360. McShane, J., & Morrison, D. L. (1983). How young children pour equal quantities: A case of pseudoconservation. Journal of Experimental Child Psychology, 35, 21-29. Molenaar, E C. M. (1986). Issues with a rule-sampling theory of conservation learning from a structuralist point of view. Human Development, 29, 137-144. Odom, R. D. (1972). Effects on perceptual salience on the recall of relevance and incidental dimensional values: A developmental study. Journal of Experimental Psychology, 92, 185-291. Oliva, T. A., Desarbo, W. S., Day, D. L., & Jedidi, K. (1987). GEMCAT: A general multivariate methodology for estimating catastrophe models. Behavioral Science, 32, 121-137. Pascual-Leone, J. (1970). A mathematical model for the transition rule in Piaget's development stages. Acta Psychologica, 32, 301-345. Pascual-Leone, J. (1989). An organismic process model of Witkin' s field-dependency-independency. In T. Globerson, & T. Zelniker (Eds.), Cognitive style and cognitive development (Vol. 3, pp. 36-70). Norwood, Nd: Ablex.
4. Catastrophe Theory
105
Peill, E. J. (1975). Invention and discovery of reality: The acquisition of conservation of amount. London: Wiley. Perry, M., Breckinridge Church, R., & Goldin Meadow, S. (1988). Transitional knowledge in the acquisition of concepts. Cognitive Development, 3, 359-400. Piaget, J. (1960). The general problems of the psychobiological development of the child. In J. M. Tanner, & B. Inhelder (Eds.), Discussions on child development (Vol. 4, pp. 3-27). London: Tavistock. Piaget, J. (1971). The theory of stages in cognitive development. In D. R. Green, M. E Ford, & G. B. Flamer (Eds.), Measurement and Piaget. (pp. 1-11). New York: McGraw-Hill. Piaget, J., & Inhelder, B. (1969). The psychology of the child. New York: Basic Books. Poston, T., & Stewart, I. (1978). Catastrophe theory and its applications. London: Pitman. Preece, E E W. (1980). A geometrical model of Piagetian conservation. Psychological Reports, 46, 143-148. Saunders, E T. (1980). An introduction to catastrophe theory. Cambridge, UK: Cambridge University Press. Siegler, R. S. (1981). Developmental sequences within and between concepts. Monographs of the society for Research in Child Development, 46(2), 84. Siegler, R. S. (1986). Unities across domains in childrens strategy choices. Minnesota Symposia on Child Psychology, 19, 1- 46. Siegler, R. S., & Jenkins, E. (1989). How children discover new strategies. Hillsdale, NJ: Erlbaum. Stewart, I. N., & Peregoy, P. L. (1983). Catastrophe theory modeling in psychology. Psychological Bulletin, 94(2), 336-362. Ta'eed, L. K., Ta'eed, O., & Wright, J. E. (1988). Determinants involved in the perception of the Necker Cube: An application of catastrophe theory. Behavioral Science, 33, 97-115. van Maanen, L., Been, E, & Sijtsma, K. (1989). The linear logistic test model and heterogeneity of cognitive strategies. In E. E. Roskam (Ed.). Mathematical psychology in progress (pp. 267-287). Berlin: Springer-Verlag. van der Maas, H. L. J., & Molenaar, E C. M. (1992). Stagewise cognitive development: An application of catastrophe theory. Psychological Review, 99(3), 395-417. van der Maas, H. L. J. (1995). Nonverbal assessment of conservation. Manuscript submitted for publication. van der Maas, H. L. J., Walma van der Molen, J., & Molenaar, E C. M. (1995). Discontinuity in conservation acquisition: A longitudinal experiment. Manuscript submitted for publication. Wilson, M. (1989). Saltus: A psychometric model of discontinuity in cognitive development. Psychological Bulletin, 105(2), 276-289. Zahler, R. S., & Sussmann, H. J. (1977). Claims and accomplishments of applied catastrophe theory. Nature (London), 269( 10):759-763. Zeeman, E. C. (1976). Catastrophe theory. Scientific American, 234(4), 65-83. Zeeman, E. C. (1993). Controversy in science, the ideas of Daniel Bernoulli and Ren~ Thom. Johann Bernoulli Lecture. Groningen, The Netherlands: University of Groningen.
Catastrophe Theory of Stage Transitions in Metrical and Discrete Stochastic Systems 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
Peter C. M. Molenaar and Pascal Hartelman University of Amsterdam The Netherlands
1. INTRODUCTION The current spectacular progress in the analysis of nonlinear systems has important implications for theories of development. Until recently, central theoretical constructs in developmental biology and psychology, such as epigenesis and emergence, seemed to defy any attempt at causal modeling. For instance, when addressing the problem of causal inference in experimental studies of development, Wohlwill (1973, p. 319) believed that it would be impossible to isolate sufficient causes of normal developmental processes which he described as "acting independently of particular specifiable external agents or conditions." As to this, the situation has changed considerably in that we now have available a range of nonlinear dynamical models that can explain the self-regulating mechanisms of normal development (such as "catch-up growth" after temporary inhi-
Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
107
108
Peter c. M. Molenaar and Pascal Hartelman
bition which served as the empirical example in Wohlwill's discussion of causal inference). Moreover, the mathematical theory of nonlinear dynamics has provided innovative, rigorous approaches to the empirical study of epigenetical processes (cf. Molenaar, 1986). This chapter is devoted to one of these approaches. The founders of modern nonlinear dynamics have always been quite aware of the relevance of their work to the analysis of developmental processes. Prigogine's work with Nicolis on applications of nonequilibrium thermodynamics is an excellent example (Nicolis & Prigogine, 1977), as is Prigogine's (1980) classic monograph on the physics of becoming. Haken (1983) applied his synergistic approach to self-organization in evolutionary processes. And last but not least, Thom's (1975) main treatise of catastrophe theory is entirely inspired by morphogenetics. A common theme in all of these applications is the causal modeling of self-organization in nonlinear developmental systems. Furthermore, this is accomplished by using the same basic paradigm: bifurcation analysis. In a bifurcation analysis, a nonlinear system is subjected to smooth variation of its parameters. One might conceive of this variation as the result of maturation, giving rise to slow continuous changes (so-called quasi-static changes) in the system parameters. Most of the time this only yields continuous variation of the behavior of the system, but for particular parameter values, the system may undergo a sudden change in which new types of behavior emerge and/or old behavior types disappear. Such a discontinuous change in a system's behavior marks the point when its dynamics become unstable, after which a spontaneous shift occurs to a new, stable dynamical regime. Hence, sudden transitions in the dynamics of systems undergoing smooth quasi-static parameter variation constitute a hallmark of self-organization. Notice that bifurcation analysis is a mathematical technique which is applied to a mathematical model of a given nonlinear system. For many developmental processes, however, no adequate mathematical model is available. For instance, it is unknown what would constitute a proper dynamical model of (Piaget's theory of) cognitive development. One could proceed by making an educated guess (called an "Ansatz") of what might be a plausible model and subject that model to a bifurcation analysis. This approach is commonly used in applied synergism (cf. Haken, 1983). Yet, in the context of cognitive development, even the formulation of a plausible Ansatz is a formidable endeavor. Far too little is known about the dynamics of cognitive systems, and their operation can only indirectly be observed. To study self-organization in such ill-defined systems, a mathematical technique is needed that does not require the availability of specific dynamic models. This is provided by catastrophe theory. Catastrophe theory deals with the critical points or equilibria of gradient systems, that is, systems whose dynamics are governed by the gradient of a socalled potential. A potential is a smooth scalar function of the state variables of a dynamical system. For the moment, a potential can be best conceived of as a
5. Catastrophe Theoryand Stochastic Systems
109
convenient mathematical construct; later, we present possible interpretations in the context of developmental processes. Let x(t) denote the vector of state variables at time t (we will denote vectors and matrices by bold lower- and uppercase letters). The rate of change of the state variables in a gradient system, then, equals the gradient of a potential V[x(t)] with respect to these state variables: d l d t x ( t ) = grad~ V[x(t)], where grad~ = [8/8xl, ~/~X 2 ~/~Xn] and ~/~X i denotes the partial derivative with respect to the state coordinate x i. This implies that the equilibria of a gradient system are given by the zeroes of grad~ V[x]. Catastrophe theory, however, does not deal with an individual gradient system but yields a characterization of the equilibria of entire families of parameterized gradient systems: grad~ V[x; e] = 0, where c is a set of parameters. Hence catastrophe theory is the study of how the critical points of V[x; c] change as the parameters in e change. The main result of catastrophe theory can be summarized as follows: The equilibria of a subset of all potentials V[x; e] can be characterized in terms of a few canonical forms that depend only on the number of parameters in e and on the rank of the Hessian, that is, the matrix of second-order derivatives of V[x; c] with respect to the state variables. If the number of parameters in c does not exceed 7 and if the rank of the Hessian is not less than n - 2, where n is the dimension of the state vector x, then this result holds. The implications of this main result of catastrophe theory are profound: For gradient systems, one only needs to specify two numbers to give a characterization of the way in which their equilibria change as the parameters change. This is precisely the kind of result that is needed for the study of self-organization in ill-defined dynamical systems. In fact, Gilmore (1981) derived characteristic features of sudden changes in equilibria that do not even depend on these two numbers but instead require an absolute minimum of a priori information for their application. These so-called catastrophe flags are extensively discussed in van der Maas and Molenaar (1992) and in the chapter by van der Maas and Molenaar in this volume. A key question now presents itself: Are gradient systems general enough to capture the essential self-organizing characteristics of most developmental processes? Apart from the catastrophe flags which apply to any gradient (and conservative Hamiltonian) system, catastrophe theory proper applies to a subset of the gradient systems and the entire class of gradient systems itself only constitutes a subset of the set of all possible dynamical systems. Notice that a definite answer to this question can only be obtained by empirical means. For instance, whether or not a gradient system constitutes a satisfactory causal model of cognitive development (or, more realistically, of an aspect of cognitive development such as conservation acquisition) can only be determined in dedicated empirical research. Having said this, however, one can put forward some a priori considerations that are indicative of the viability of applications of catastrophe theory to biological and psychological developmental processes. As alluded to earlier, .
.
.
.
.
110
Peter C. M. Molenaar and Pascal Hartelman
these processes are in general ill-defined in that we are unable to give a complete specification of the underlying dynamics. If catastrophe theory is applied and the (unknown) true dynamics do not conform to a gradient system, then what is the error of approximation? More specifically, let the true dynamical equation be d/dt x(t) = F[x(t); e], where F now is an arbitrary vector-valued function. Then, according to Jackson (1989, p. 117): "If F[x; e] = 0 implies that grad x V[x; e] = 0, for some [scalar potential] V, then Thom's theorem [i.e., catastrophe theory] can be applied to V[x; el. It is not required that F = gradx V everywhere in the [state] space." Hence, if the equilibria of a nongradient system can be locally represented by the equilibria of a potential, then there is no error of approximation. The class of potential systems, that is, systems whose dynamics depend on the gradient of a potential, includes gradient systems, (conservative) Hamiltonian systems, and Hamiltonian systems with positive damping. Thus the class of potential systems is rather large, coveting many physical processes. Huseyin (1986) has extended the catastrophe theory program to this class of potential systems. Furthermore, Huseyin shows that the (so-called static) instabilities of autonomous nonpotential systems can be characterized in much the same way as those of potential systems (thus corroborating Jackson's observation). Taken together, these a priori considerations imply that the range of applicability of catastrophe theory may be rather large (although one would like to have a more detailed formal specification of the approximation error in the general case). This is a fortunate state of affairs because catastrophe theory may be the only principled method available to study self-organization in many ill-defined systems. In what follows, we will first present an outline of elementary catastrophe theory for metrical deterministic gradient systems. This presentation is based on Gilmore (1981) and follows a constructive approach that is distinct from the more abstract, geometrical approaches followed by most other authors (e.g., Thorn, 1975). Then, we move on to a consideration of catastrophe theory for stochastic metrical systems. After having presented a typology of stochastic systems to indicate the type of systems we are interested in, a concise overview is given of Cobb's innovative work in this area (e.g., Cobb & Zacks, 1985). It will be argued that Cobb's approach does not meet the basic tenets of catastrophe theory, and therefore a revision of his approach is formulated that agrees with the goals of elementary catastrophe theory. Even then, however, there remain some fundamental problems with extending the catastrophe theory program to metrical stochastic systems. We discuss these problems in a heuristic way and suggest a principled approach to deal with them. This part of the chapter draws on ongoing research and therefore the reader should not expect definite solutions. The same can be said about the final part of the chapter dealing with catastrophe theory for discrete stochastic systems. Here, we restrict the discussion
5. Catastrophe Theory and Stochastic Systems
111
to a characterization of these systems in a way that makes them compatible with the catastrophe theory program.
2. ELEMENTARYCATASTROPHETHEORY The basic tenet of catastrophe theory is to arrive at a classification of the possible equilibrium forms of gradient systems undergoing continuous quasi-static variation of their parameters. This implies a classification of the critical points of perturbed parameterized potentials V[x; c]. Critical points belong to the same class if there exists a diffeomorphic relationship between the corresponding potentials. Roughly speaking, a diffeomorphism changes x and/or e in a smooth manner (i.e., a local diffeomorphism at a point p is an invertible coordinate transformation for which derivatives of arbitrary order exist). Hence the program of elementary catastrophe theory is effected by smooth coordinate transformations. Actually the picture is more complicated, as will be indicated in a later section, but this suffices for our present purposes.
2.1. Coordinate Transformations We consider here transformations of a coordinate system [X1, X 2 . . . . . Xn] for processes in R'. Such coordinate transformations play an important role in effecting the program of elementary catastrophe theory and will also figure predominantly in the extension of this program to stochastic metrical systems. A transformation of the coordinate system [x~, x 2. . . . . x.] into [Yl, Y2. . . . . y.] then is defined by Yi = Yi(X1, X2 . . . . .
Xn) , i = 1 . . . . .
n,
(1)
at any point p in R" at which the Jacobian det 18Yi/Sxjl is nonzero. As Equation (1) is understood to be a local diffeomorphism, it is useful to express the new Yi coordinates in terms of a Taylor series expansion of the old x coordinate system. Suppose a given potential is represented in a particular coordinate system: V[x]. Then of course a coordinate transformation will change the functional form of this potential: U[y]. Yet the potential itself remains identical, hence U[y] = V[x]. In contrast, two different potentials will have different functional forms in the same coordinate system; for example, V[x] will not be equal to U[x]. However, in the words of Gilmore (1981, p. 18), these two potentials are "qualitatively similar if we can find a smooth change of coordinates so that the functional form of U, expressed in terms o f the new coordinates, is equal to the functional form of V in the original coordinate system." Hence if V[x] differs from U[x], but we can find a coordinate transformation so that U[y] = V[x],
112
Peter c. M. Molenaarand Pascal Hartelman
then these two potentials are qualitatively similar. In particular, the number and types of their critical points is the same. The way in which coordinate transformations are used in effecting the program of catastrophe theory can now be summarized as follows. A potential is expressed in a Taylor series expansion about its critical point. A coordinate transformation is sought that reduces this potential to a canonical form. All potentials that can be reduced to the same canonical form then constitute an equivalence class having the same type of critical point. This will be elaborated more fully in the next sections.
2.2. Morse Form To start, we consider a critical point x* of a potential V[x] at which the Hessian is nonsingular: grad x V[x] = 0 at x*;
Vij -- { ~2V[x]/~xi~xjli, j = 1. . . . .
n } is nonsingular at x*.
(2)
Only the quadratic and higher degree terms of the Taylor series expansion of V[x] about x* will be nonzero. It now can be proved that there always exists a coordinate transformation so that in a neighborhood of x*, V[x] can be expressed as U[y] -- ~i=
],n
)k/Yi2,
(3)
where h i a r e the eigenvalues of the Hessian matrix Vii. This simple quadratic form is the canonical Morse form for critical points of a potential at which the Hessian is nonsingular. The constructive proof given by Gilmore (1981, pp. 20-23) is elementary and consists of counting the number of disposable coefficients in the Taylor series expansion of a coordinate transformation that can be used to kill (i.e., transform to zero) coefficients in the Taylor series expansion of the transformed potential. From Equation (3) it follows that all critical points of potentials at which the Hessian is nonsingular belong to a single equivalence class. Hence, all of the equilibria of the arbitrary gradient systems for which the Hessian is nonsingular are qualitatively similar.
2.3. Them Forms If the Hessian at a critical point of a potential is singular (has zero eigenvalues), then it cannot be transformed to the Morse form, Equation (3). Suppose that the Hessian Vii of a potential V[x] has m zero eigenvalues (in technical jargon: has nullity m). Then, it follows from the so-called splitting lemma that this potential can be reduced to U[y] = Fdg[Y l, Y2 . . . . .
Ym] + ~i=m+ 1,n hiYi 2"
(4)
5. Catastrophe Theory and Stochastic Systems
113
Thus the potential can be reduced to a form that consists of two parts: a degenerate part Fdg associated with the m degenerate eigenvalues of the Hessian Vii and another part that is simply the Morse form associated with the nondegenerate eigenvalues of Vii. The y-coordinates associated with degenerate and nondegenerate directions in state space are split, hence the name of the lemma. Notice that in Equation (4) the part conforming to the Morse form already has a canonical form. Can the degenerate part Fdg also be transformed into a canonical form? To answer this question, we first note that, thus far, the parameters e have been left implicit because they play no role in the Morse form. That is, there is only one canonical Morse form, irrespective of the number of parameters in e. To arrive at canonical forms for Fdg, however, its dependence on c has to be made explicit: Fdg[Y 1. . . . . Ym; e*], where e* denotes the value of e at the critical point. We will see that canonical forms for Fdg depend on the number k of parameters in e as well as on m. Specifically, such canonical forms only exist if m is not larger than 2 and k is at most 7. To illustrate, let m = 1 and let there be k parameters. Then, the Taylor series expansion of Fdg about the degenerate critical point, as obtained from the splitting lemma, is given by
Fdg[Yl; C*] = ~itiYl i, i -- 3, 4 . . . . .
(5)
where it is understood that the critical point has been translated to y* = 0, and where the term for i = 0 is unimportant, the term for i = 1 is lacking because t~ = 0 at a critical point, and the term for i = 2 is lacking because of the degeneracy. It can be shown (cf. Gilmore, 1981, pp. 26-27) that coordinate transformations of both y~ and e can reduce Equation (5) to the following canonical form: CG[1; k] =
sz ~+2; S
=
1 or - 1,
(6)
where, in general, CG[m; k] stands for the so-called catastrophe germ of a k-parameter potential whose Hessian at the critical point has m vanishing eigenvalues, and where s is a dummy variable (there are in fact two canonical forms, one is the negative of the other). The canonical form, Equation (6), is based on a Taylor series expansion in terms of the state variables of the potential. This implies that Equation (6) is valid in an open neighborhood of the critical point in state space. But it is only valid at the point e = e*, because no Taylor series expansion of the parameters has been considered. The extension of the validity of Equation (6) to a neighborhood of the critical point in parameter space amounts to considering the effect of perturbations of e* on the catastrophe germ. The constructive approach followed by Gilmore to establish this extension is the same as before, that is, determine which terms in the Taylor series expansion of the perturbed catastrophe germ can always be removed by a coordinate transformation. In this way, a canonical form of all possible perturbations of a catastrophe germ is obtained,
114
Peter C. M. Molenaarand Pascal Hartelman
which will be denoted by Pert[m; k]. Note that the canonical perturbations are different for different catastrophe germs and thus depend on m and k. The final result now reads, schematically, Cat[m; k] = CG[m; k] + Pert[m; k],
(7)
where Cat[m; k] stands for (elementary) catastrophe. The canonical forms of Equation (7) are called the Thom forms. In particular, if k = 2, then it can be shown that the canonical perturbation of Equation (6) is given by Pert[l; 2] = ClZ + c2z 2, yielding the cusp catastrophe which figures predominantly in the chapter by van der Maas and Molenaar: Cat[l; 2] = sz 4 + c2z 2 + ClZ; s = 1 o r - 1 .
(8)
2.4. Discussion Our heuristic outline of elementary catastrophe theory focuses on the role of coordinate transformations in the derivation of canonical forms for the critical points of potentials. There are several reasons for choosing this particular point of view. The mathematical theory underlying Equation (7) is very abstract and therefore may not be readily accessible to developmental biologists and psychologists. The constructive transformational approach used.by Gilmore (1981) may fare better in creating a valid mental picture within the confines of a single chapter. Also, this constructive approach may be less known because most textbooks present a much more geometrically inspired perspective on catastrophe theory (e.g., Castrigiano & Hayes, 1993; Poston & Stewart, 1978; Thorn, 1975). Last but not least, as will be indicated in the next section, it turns out that the problems arising from the extension of catastrophe theory to stochastic metrical systems are closely linked to the way in which coordinate transformations for these systems are defined. As alluded to earlier, elementary catastrophe theory constitutes a powerful approach to the study of self-organization in ill-defined systems. Selforganization typically takes place in a "punctuated" way, in which the stability of the equilibrium of a system gradually decreases and becomes degenerate as a result of parameter variation over an extended period of time, after which a sudden shift to a new, qualitatively different equilibrium occurs. The ongoing behavior of a system is organized around its equilibrium states and therefore a qualitative shift in equilibrium will manifest itself by the emergence of new behavioral types. The results of catastrophe theory, based on mathematical deduction, show that under mild conditions (see the quotation of Jackson, 1989, in section 1) the occurrence of such stage transitions in epigenetical processes can be detected and modeled without a priori knowledge of the system dynamics. To apply elementary catastrophe theory, one only has to identify the degenerate
5. Catastrophe Theoryand Stochastic Systems
115
coordinates of a system's equilibrium point and the parameters whose quasistatic variation induces the loss of stability. In the jargon of catastrophe theory, the degenerate coordinates are called the behavioral variables, and the parameters controlling system stability are called the control variables. Given a set of empirical measurements of candidate behavioral and control variables, and allowing for all possible coordinate transformations, the appropriate Thom forms then can be fitted to the data to determine whether a genuine stage transition is actually present and to which class it belongs. In closing this section, it must be noted that our outline of catastrophe theory, although providing a convenient stepping stone to the next section, is incomplete in almost all respects. It one thing to present a constructive approach, it is quite another to specify the necessary and sufficient conditions for such an approach to hold. Also, we do not give a complete list of the Thorn forms. All of this is contained in the excellent monograph by Castrigiano and Hayes (1993).
3. CATASTROPHETHEORY FOR METRICAL STOCHASTIC SYSTEMS Elementary catastrophe theory pertains to deterministic processes generated by gradient systems. In contrast, real developmental processes encountered in the biological and social sciences almost never can be forecasted with complete certainty and therefore do not constitute deterministic but stochastic processes. To extend catastrophe theory to the analysis of these stochastic processes, one first has to specify the relevant ways in which gradient systems can be generalized to include random influences. There are several distinct ways in which a gradient system can become stochastic (cf. Molenaar, 1990), but it will be sufficient for our purposes to restrict attention to two of them. First, consider the system: d/dt x(t) = grad x V[x] y(t) = x(t) + e(t),
(9)
where e(t) denotes random measurement noise. Here, the manifest process y(t) is stochastic, but the latent gradient system is still deterministic. The random measurement noise in (9) does not affect the system dynamics and thus prediction of x(t) by means of sequential nonlinear regression techniques will become perfect when t approaches infinity. In contrast, the role of random influences is entirely different in the following system: d/dt x ( t ) = grad x V[x] + w(t) y(t) = x(t),
(10)
where the random process w(t), called the innovations process, directly affects the system dynamics. This implies that x(t) can never be predicted with complete certainty, even if t approaches infinity.
116
Peter C. M. Molenaarand Pascal Hartelman
In what follows, we restrict attention to (10) where the system dynamics is intrinsically stochastic. This constitutes by far the most interesting (and, admittedly, also the most problematic) case. Although it is possible to allow for measurement noise e(t) in (10), we will avoid this complication because it is not essential to our purposes. Hence the second equation of (10) can be dropped because it involves a trivial identity transformation, after which x(t) will denote the manifest process. We are thus left with the first equation of (10), which constitutes an instance of a stochastic differential equation (SDE). Following the notational conventions appropriate for SDEs (to be explained in the next section), the starting point of our discussion is then given by the following metrical stochastic system: dx(t) = grad x V[x]dt + dw(t).
(11)
3.1. Stochastic Differential Equations The theory of SDEs can be formulated in a number of ways (e.g., employing martingale methods or semigroup methods; cf. Revuz & Yor, 1991). We will have to leave these elegant formal constructions for what they are, however, and concentrate on the mere presentation of a few basic notions that are of importance. The first of these is the Wiener process w(t). Consider the following process in discrete time: A (presumably drunken) man moves along a line, taking, at random, steps to the left or to the fight with equal probability. We want to know the probability that he reaches a given point on the line after a given elapsed time. This process is the well-known random walk. If now both the unit of time and the step size become infinitesimally small, then the one-dimensional Wiener process w(t) is obtained. Given that this Wiener process starts at the origin, w(0) = 0, the probability that it reaches point w at time t, p ( w ( t ) = w lw(0) = 0), can be shown to be Gaussian with mean zero and variance proportional to t. The sample paths of a Wiener process are continuous, that is, each realization is a continuous function of time. Yet, it can be shown (cf. Gardiner, 1990) that these sample paths are nondifferentiable. This implies that a realization of the Wiener process is exceedingly irregular, as is also reflected by the linear dependence of the (conditional) variance on elapsed time t. Of particular importance is the statistical independence of the increments of w(t). Let ~w(ti) = w(ti) - w ( t i - 1) denote these increments, where t i, i = 1, 2 . . . . . n is a partition of the time axis, then the joint probability of the 8 w ( t i ) is the product of normal distributions, each with mean zero and variance 8 t i - - t i - - t i _ 1. We now define the differential dw(t) by letting ~ti become infinitesimally small. But w(t) is nondifferentiable, hence dw(t) does not exist. On the other hand, the integral of dw(t) exists and is the Wiener process w(t). In particular:
5. Catastrophe Theoryand Stochastic Systems
117
t
w(t) - w(O) =
f dw(s),
(12)
o and this integral equation can be interpreted consistently. Thus dw(t) can be conceived of as a convenient abstraction to be used to arrive at a consistent definition of stochastic differential equations. We are now in a position to provide a definition of an n-dimensional SDE. Let w(t) denote an n-dimensional Wiener process with uncorrelated components: E[wi(t), wj.(t)] -= min(t, s)gij, where E denotes the expectation operator and ~ij is Kronecker's delta, that is, gij = 1 if i = j and gij = 0 otherwise. In addition, let a[x] denote an n-dimensional so-called drift vector and B[x] an (n • n)dimensional diffusion matrix. Both drift and diffusion are arbitrary functions of x(t). Then consider the SDE dx(t) = a[x]dt + B[x]dw(t),
(13)
which again can be conceived of as the formal derivative of the integral equation t
x(t)- x(0)=
t
fa[xldt + fBLxldw(t) 0
(14)
0
Note that if a[x] = grad x V[x] and B[x] = I,,, that is, the (n • n) identity matrix, then Equation (13) reduces to Equation (11). For our purposes, it suffices to consider two aspects of Equation (13). First, it should be noted that this SDE does not obey the usual transformation rules for differential equations. Instead, the famous Ito formula describes the way in which Equation (13) transforms under a change of variables. Specifically, let n - 1 for ease of presentation and consider an arbitrary smooth function of x(t): f[x]. Then, Ito's formula is given by
df[x] = { a[x]f'[x] + b[x]Zf"[x]/2 }dt + b[x]f'[x]dw(t),
(15)
where f'[x] and if[x] denote, respectively, the first- and second-order derivatives of fix] with respect to x(t). Second, it should be noted that, given initial conditions, Equation (13) specifies the evolution of the probability density function of x(t) as a function of time t. Because the drift and diffusion terms in Equation (13) do not explicitly depend on time, it follows that the probability density function of x(t) becomes stationary if t approaches infinity. That is, p(x, t) --) p(x)
as
t --~ oc
(16)
118
Peter C. /t4. Molenaar and Pascal Hartelman
The stationary probability density p(x) is a function of the drift a[x] and the diffusion matrix B[x]. This functional dependence takes on a particularly simple form if x(t) obeys the so-called detailed balance condition. Heuristically speaking, a process satisfies detailed balance if in the stationary situation each possible transition (from x(t) = xl to x(t + d t ) = x 2) balances with the reverse transition (cf. Gardiner, 1990, pp. 148-170). If this is the case, that is, if a[x] and B[x] obey the formal criteria for detailed balance, then the stationary probability density is given by p(x) - exp(-~[x]),
(17)
where x
9 [x] = fdy.zta, B, y] zi[a, B, xl = ~ B~2[xl{2aktx] - ~ O/OxjB2j[x]}, k
i = 1, 2 . . . . .
n.
j
The dot in the second equation denotes the vector inner product (of the n-dimensional vectors dy and z). The important point conveyed by (17) is that under detailed balance, the stationary probability density of a homogeneous process x(t) is given by p(x) = exp(-~[x]), where ~[x] is a potential. Moreover, this potential ~[x] is a nonlinear function of the drift and diffusion terms in the SDE.
3.2. Cobb's Approach In a number of papers, Cobb has elaborated a stochastic catastrophe theory for metrical systems (e.g., Cobb, 1978, 1981; Cobb, Koppstein, & Chen, 1983). We will focus on the paper by Cobb and Zacks (1985) in which the link with the SDEs is explicitly addressed. What follows is a synopsis of the relevant parts of the latter paper. Let n = 1 and consider the following instance of Equation (13): dx(t) = grad x V[x]dt + b[x]dw(t),
(18)
which also can be regarded as a slight generalization of Equation (11). Now the first step in Cobb and Zack's approach consists of substituting an elementary catastrophe for the potential V[x]. In particular, V[x] is taken to be the cusp catastrophe given by Equation (8). In the second step, the stationary probability density of Equation (18) is determined. An application of (17) to (18) yields, after some rewriting, the stationary density in the form as given by Cobb and Zacks: x
p(x) = exp(2f {g[y]/bZ[y]}dy),
(19)
5. Catastrophe Theoryand Stochastic Systems
119
where g[x] = V'[x] - b'[x]b[x], and the apostrophe denotes differentiation with respect to x. Cobb and Zacks then proceed by considering several functional forms for bZ[x]: a constant, a linear, or quadratic function of the state x, and so on. This yields distinct family types of stationary densities, one for each functional form of bZ[x] (and g[x]).
3.3. Issues with Cobb's Approach As alluded to earlier, the gist of catastrophe theory is to arrive at a minimal classification of the possible equilibrium forms of gradient systems undergoing quasi-static parameter variation. This is accomplished by transforming away all inessential details of particular instantiations of gradient systems and concentrating on the canonical forms thus obtained. In this way, an infinite variety of equilibrium forms is reduced to a minimal classificatory scheme that can be applied without specification of the underlying dynamical system equations. It seems, however, that Cobb's approach does not conform to this basic reductionistic tenet of catastrophe theory. The typology of stationary probability densities arrived at in his approach depends on b2[x] and, through g[x], on V[x]. This implies that each elementary catastrophe that is substituted for V[x] gives rise to a potentially infinite number of density types, one for each distinct form of b Z[x]. Hence, it would seem that Cobb' s approach creates a plethora of canonical forms, which reintroduces the need to consider the details of each system in order to determine its equilibrium type. Note, however, that this presumed state of affairs rests on the assumption that Equation (19) indeed represents the proper canonical form of stationary probability densities. That is, (19) should represent what is left after transforming away all inessential details of particular instantiations of stochastic gradient systems. But obviously this is not the case. It is always possible to find a change of variable y = f[x] such that application of Ito' s formula, Equation (15) to Equation (18), will yield a transformed SDE in which the diffusion term bZ[y] = 1. In fact, for each given instance of Equation (18), the diffusion term can be transformed into an infinite variety of forms by judicious application of Ito's formula. Consequently, the stationary density types in Cobb's approach are not canonical in the intended sense because they depend on the specific forms of diffusion terms, whereas these diffusion terms are not invariant under smooth coordinate transformations.
3.4. An Alternative Approach The classificatory scheme of Cobb depends on a bivariate feature vector: the potential V[x] and the diffusion term b2[x]. Given a fixed form of V[x], one can get different density types for different forms of b2[x]; and vice versa, given a
120
Peter c. M. Molenaarand Pascal Hartelman
fixed form of b2[x], different density types are obtained for different forms of V[x]. Pertaining to this, it was shown in the previous section that the diffusion term is not invariant under smooth coordinate transformations and therefore is not suitable to index canonical forms. Hence, one plausible alternative approach consists of trying to redefine the feature vector underlying the classificatory scheme so that it no longer depends on the diffusion term in the way envisaged by Cobb. This could be accomplished by simply removing the diffusion term from the feature vector which would result in a univariate feature vector composed only of the potential. Note, however, that this implies a redefinition of the featured potential itself. We address this important point first. The potential in Cobb's approach is defined as the entity whose gradient yields the drift term in Equation (18). This potential is a component of g[x], where g[x] itself is a component of the stationary density, Equation (19). One can only recover Cobb's potential from the stationary density (19) if the diffusion term b Z[x] is given, which is why Cobb's approach depends on a bivariate feature vector. Thus in order to arrive at a classificatory scheme of stationary densities that depends only on a potential, Cobb's definition of the featured potential will not do. Instead, we will have to turn to the definition of the stationary probability density given by Equation (17), taking n = 1: p(x) = e x p ( - ~ [ x ] ) . Note that the potential ~[x] in Equation (17) is entirely different from Cobb's potential V[x] underlying the drift term in Equation (18). In terms of Cobb's scheme based on (18), ~[x] is a nonlinear function of V[x] and the diffusion term b Z[x]. Moreover, ~[x] completely characterizes stationary densities and hence can be used as the univariate feature indexing canonical forms for these densities. In summary, it would seem that a minimal classificatory scheme for stationary probability densities can be based on the canonical forms of the potential ~[x] in Equation (17). It is then required that x(t) obey the detailed balance condition (which is automatically the case if n = 1). Taking qg[x] as the featured potential implies that, in contrast with Cobb's approach, no distinct reference is made to the drift term and/or the diffusion term in an SDE. This alleviates the problems in Cobb's approach, which, as alluded to earlier, arise from transformation of the diffusion term. We now can concentrate on the way q~[x] behaves under coordinate transformations in which the transformations concerned are given by the Ito formula, Equation (15). The main question then becomes: Does application of Ito's formula (15) to ~[x] in (17) (or ~[x] for n > 1) yield the same elementary catastrophes as the application of local diffeomorphisms, Equation (1), to the potential associated with a deterministic gradient system? In other words, is there a direct transfer of elementary catastrophe theory to metrical stochastic systems. The answer is, unfortunately, negative and leads to some rather fundamental problems that we have studied intensely for the past year.
5. Catastrophe Theoryand Stochastic Systems
121
3.4.1. The m a i n p r o b l e m It can be shown that the collection of all possible potential forms in Equation (17) constitutes a single equivalence class (type). That is, each given form of ~[x] can be transformed into any other possible form by means of Ito's formula. In the words of Zeeman (1988), all potential forms in Equation (17) are "diffeomorphically measure equivalent." For instance, let ~[x] be the cusp catastrophe Cat[l;2] given by Equation (8), and let the values of the control variables c~ and c 2 be chosen so that (8) has three critical points (two of which are stable and one is unstable). Then, the associated stationary probability density p(x) = e x p ( - C a t [ 1 ;2]) has two modes corresponding to the two stable critical points of the cusp catastrophe. This bimodal density with cuspoid potential, however, can always be transformed by means of Ito's formula into a unimodal density for which the potential has Morse form (Hartelman, van der Maas, & Molenaar, 1995). Thus the distinction between Thom forms and Morse forms, a distinction that is central to the program of catastrophe theory, collapses at the level of modes of stationary densities of metrical stochastic systems. To illustrate this collapse at the level of modes of stationary densities, consider the simulated realization of dx(t) = (2x(t) - x(t)3)dt + dw(t), 0 < t < 100, shown in Figure 1. The drift term of x(t) is given by the gradient of the cusp catastrophe and the diffusion term is 1; in the notation of Equation (13), a[x] = 2x(t) - x ( t ) 3, and b[x] = 1. It then follows directly from Equation (17) that the stationary probability of x(t) is given by p(x) = c - l e x p ( x 4 / 4 - x2/2), where c-~ is a normalizing constant. Hence, the stationary density p(x) is bimo-
FIGURE 1. Simulated realization by means of numerical integration of a univariate SDE, dx(t) = a[x]dt + b[x]dw(t), 0 < t < 100, where the drift a[x] = 2x - x 3 and the diffusion term b[x] = 1.
122
Peter C. M. Molenaarand Pascal Hartelman
dal. A rough estimate of p(x) can be obtained by determining a histogram of the realized values shown in Figure 1. This histogram is presented in Figure 2 and clearly has the expected bimodal form. We next consider the following transformation of x(t): y(t) = f[x] = x(t) + x(t) 3. An application of Ito's formula (15) this time yields the drift and diffusion terms of the y(t) process: a[y] = 8x(t) + 5x(t) 3 - 3x(t) 5, b[y] = 1 + 3x(t) 2. Figure 3 shows a realization of this y(t)process, where again 0 < t < 100, and Figure 4 shows the histogram associated with this realization. It turns out that this histogram, and hence the stationary density of the transformed x(t) process, is now unimodal, illustrating the main problem discussed in this section. 3.4.2. Zeeman's solution The stationary density, Equation (17), constitutes a probability measure. Smooth coordinate transformations (diffeomorphisms) in the space of measures are defined by Ito's formula. In contrast, the potential associated with a deterministic gradient system constitutes a function. The space of continuous functions is dual to the space of measures, and smooth coordinate transformations in function space are defined by Equation (1). We have already seen that the crucial distinction in function space between Morse forms and Thom forms collapses in measure space because these forms are diffeomorphically measure equivalent. To avoid this collapse, Zeeman (1988) proposes to treat the measure defined by Equation (17) as a function. More specifically, he states (1988):
FIGURE 2.
Histogramof the realized x(t) values depicted in Figure 1.
5. Catastrophe Theory and Stochastic Systems
123
FIGURE 3. Simulated realization by means of numerical integration, 0 < t < 100, of the transformed x ( t ) process depicted in Figure 1. The transformation is y ( t ) = f [ x ] = x + x 3.
We regard the steady state u [our stationary density (17)] as a tool, a quantitative property of v [the deterministic gradient system acting as drift term in an SDE]. A n d then, having constructed the tool, we can use the tool in any way we please to study v . . . . Therefore we can apply the qualitative theory of functions to the tool to obtain a qualitative description of v. (p. 132)
200 180, 160. 140. 120, o
100, 80, 60,
4oi 2oi 0
-:~0
-15
-10
5
-5
10
15
y value
FIGURE 4. Histogram of the realized
y ( t ) values depicted in Figure 3.
124
Peter c. M. Molenaar and Pascal Hartelman
An application of Zeeman's solution to avoid the collapse of distinct canonical forms seems to be straightforward. For a given stochastic process, one first determines the stationary probability density. Then the potential ~[x] associated with this stationary density can be conceived of as a function; that is, @[x] is treated in the same way as the potential V[x] associated with deterministic gradient systems and thus is amenable to smooth coordinate transformations defined by Equation (1). It then follows that the elementary catastrophes obtained for V[x] also yield the canonical forms for the critical points of ~[x] (and hence for the modes of the associated stationary density). 3.4.3. Discussion Zeeman's solution is presented in a fundamental paper (Zeeman, 1988) that introduces a new definition of the stability of dynamical systems. It is one of the rare technical publications dealing with catastrophe theory of stochastic systems, apart from Cobb's work, and presents a challenging and innovative point of view. Here, however, we have to restrict our discussion to the shift from measure space to function space proposed by Zeeman. It is clear that this shift prevents the collapse of Morse and Thom forms referred to in section 3.4.1. However, there are some issues that require further scrutiny. Perhaps the most important of these concerns the "timing" of the shift, that is, at which stage in the derivation of the stationary density in measure space does one shift to a function space perspective? To elaborate, suppose we are given a realization x(t), for t in the interval [0, T], of a homogeneous stochastic process. Then first the stationary density has to be estimated, for instance, by a cubic spline fit to the empirical histogram. Let the estimated potential obtained be denoted by ~[xl T]. Is this the potential that according to Zeeman's approach has to be treated as the function characterizing the observed process? Or should Ito's formula be applied to ~[xl T], yielding a transformed potential ~ [ f [ x ] IT], before the shift to a function space perspective is made? It would seem that Ito's transformation should not be made, because that would open up the door again for the problem discussed in section 3.4.1 (e.g., ~[x[ T] might be a Thom form which is Ito transformed to a Morse form ~ [ f [ x ] I T]). Accordingly, it appears to be the estimated potential ~[xl T] in the stationary density of the given realization x(t) to which Zeeman's solution applies. This assigns a special status to the actual measurements, to their operationalization, dimensionality, and scale. For instance, it may make a difference to Zeeman's solution whether the behavior of a system is measured in terms of amplitude or in terms of energy (which is proportional to the second power of amplitude). This special status of the given (chosen) measurements, however, will be problematic for the social sciences, where fundamental measures (dimensions) are rare and it is often a matter of taste as to which measurement scales are being used. It should be noted that Zeeman's aims in his 1988 paper are different from
5. Catastrophe Theoryand Stochastic Systems
125
ours. Zeeman introduces SDEs with known properties (e.g., their variance e, where e > 0, is small) to define a so-called e-smoothing of the equilibria of deterministic gradient systems and thereby arrive at new definitions of equivalence and stability of gradient systems. In social scientific applications, however, such a priori information is almost always lacking. It then appears that Zeeman's solution presents some problems of its own that require further elaboration. Yet, a shift from the space of measures to function space as stipulated by Zeeman would seem unavoidable in order to prevent the collapse of canonical forms for the potentials in stationary densities. Moreover, the focus on stationary probability densities in stochastic catastrophe theory is plausible because these stationary densities constitute direct analogues of the equilibria of deterministic gradient systems that figure in elementary catastrophe theory. If, however, one were to change the focus on stationary densities, then one could look for an alternative characterization of stochastic processes for which the canonical forms in measure space keep the distinction between elementary catastrophes intact. In such an approach, it would no longer be necessary to shift from a measure space to a function space perspective. In closing this section, we present the outline of one such alternative approach. To reiterate, we are looking for an alternative measure that characterizes the equilibria of homogeneous stochastic systems while avoiding the problems with stationary probability density mentioned in section 3.4.1. Such a characterization should not cause a collapse into a single canonical form under Ito transformation, but instead keep intact in measure space the distinct canonical forms of elementary catastrophe theory. Recently, Hartelman has identified such an alternative characterization: the number of level-crossings nl(T) of a stochastic process, where l is the level and [0, 7] the interval of observation (cf. Florens-Zmirou, 1991, for a detailed discussion of levelcrossings in SDEs). It can be shown (Hartelman et al., 1995) that the number of modes of the measure nz(T) stay invariant under Ito transformation. This would seem sufficient to establish something approaching a one-to-one correspondence between canonical forms of nl(T) in measure space and elementary catastrophes in function space. Thus it appears to be possible to arrive at a sensible catastrophe theory of stochastic systems in measure space which does not require the shift to a function space perspective underlying Zeeman' s approach.
4. CATASTROPHETHEORY FOR DISCRETE STOCHASTICSYSTEMS Until now, we have been considering metrical deterministic and stochastic systems. That is, systems whose state (behavioral) variables are real-valued. Moreover, we have considered the evolution of these systems in continuous
120
Peter C. M. Molenaar and Pascal Hartelman
time under smooth quasistatic variation of the system parameters (control variables). In this setting of continuous variation of state variables, parameters, and time, it was shown that discrete changes (elementary catastrophes) emerge in the equilibrium forms of the systems concerned. These discrete changes in equilibrium type constitute a rather special instance of what is the main theme of this book, categorical variables. In this section, however, we consider categoricalvalued processes in another, more direct sense. In particular, we outline a possible extension of the program of catastrophe theory to stochastic systems with categorical state variables. The deliberations will remain at a rather general level, mainly concentrating on the possibility of deriving potentials for discrete stochastic systems as a prerequisite for the application of catastrophe theory. We start with potential theory for denumerable Markov chains in discrete time.
4.1. Potentials for Markov Chains Let S denote the set of categories making up the state space of a Markov chain. We denote representative elements of S by i, j, k, and so on. Let x,,, where n is integer-valued, represent the Markov chain. Then it holds that Pr[xn+ 1 = i[x, =j,x,_
1 = k .....
x o = l] = P r [ X , + l
= i[x~
=j],
(20)
where P r [ a [ b ] denotes the conditional probability of a given b. In addition, we assume that the Markov chain is homogeneous" Pr[xn+ 1 = i[x,
= j ] = P r [ x , + m + 1 = i[Xn+ m = j ] .
(21)
For the state space S, the complete set of conditional probabilities given by Equation (21) can be conveniently collected in a square, so-called transition matrix P. Then, given the starting probabilities P r [ x o = i], the measure on the state space S of a Markov chain is completely determined by the transition matrix P. Now the question we want to address is whether a potential can be defined for a Markov chain. Perhaps the most extensive treatment of this question can be found in a monograph by Kemeny, Snell, and Knapp (1976). They show that almost all results of potential theory generalize to Markov chains, although some require further assumptions that are analogous to the detailed balance condition for metrical stochastic systems alluded to earlier in section 3.1. More specifically, they present Markov chain analogues of potential theory concepts such as charge, potential operator, harmonic function, and so on. In particular, the Markov chain analogues of the potential are lira e~(I + P + . . .
+ P ' ) o r l i m (I + P + . . .
P')er,
(22)
where lim denotes the limit for n approaching infinity, e l is a finite normed row vector called left charge, and e~ a finite normed column vector called fight charge.
5. Catastrophe Theoryand Stochastic Systems
127
In closing this section, note that Equation (22) gives two Markov chain analogues of the potential, one associated with a left charge cr and the other with a fight charge er Kemeny et al. (1976) refer to the analogue associated with the fight charge as the potential function, whereas the analogue associated with the left charge is denoted by the potential measure. To appreciate this difference, note that Kemeny et al. define Lebesgue integration on the denumerable state space S of a Markov chain by a vector inner product in which a nonnegative row vector constitutes the measure and a column vector constitutes the function (1976, p. 23). This would seem to imply that only the potential function associated with the fight charge can be conceived of as the Markov chain analogue of the potential in elementary catastrophe theory. But regarding the potential measure associated with the left charge, they state (Kemeny et al., 1976, p. 182): "Classically, potentials are left as point functions and are never transformed into set functions because such a transformation is frequently impossible. In Markov chain potential theory, however, every column vector can be transformed into a row vector by the duality mapping." Note that this resembles the shift from measure space to function space in Zeeman's approach to stochastic catastrophe theory for metrical stochastic systems (cf. section 3.4.2). At present, however, nothing more can be said about this interesting correspondence because a catastrophe theoretical analysis of the Markov chain potentials given by Equation (22) is still lacking.
4.2. Potentials for Markov Processes A continuous time homogeneous Markov process x(t) with denumerable state space S obeys conditions similar to Equations (20) and (21). For such processes, it is possible to derive a so-called master equation (e.g., Van Kampen, 1981). Denoting Pr[x(t) = i] by pi(t), the master equation is given by
dpi(t)/dt = Xj[wijpj(t) - wjpi(t)],
(23)
where wij is the transition probability per unit time from state j to state i. If for the state space S the complete set of transition probabilities is collected in a square matrix W, where the ith diagonal element of W is defined by wii = - E j wji for j unequal to i, then we obtain a representation that is similar to the one for Markov chains given in the preceding section. On the basis of this representation, Markov process analogues for potential theory concepts can again be derived. For this, we refer to the large literature on potential theory for Markov processes, including the momentous monograph by Doob (1984).
4.3. Discussion The message of the preceding two sections can be summarized in a simple statement: there exists a well-developed potential theory for denumerable Markov
128
Peter C. M. Molenaar and Pascal Hartelman
chains and processes. This would seem to provide us with a convenient stepping stone for the application of catastrophe theory to these types of stochastic systems. Unfortunately, such a catastrophe theoretical analysis has not yet been undertaken. At present, we do not know whether an application of catastrophe theory to the potential analogues of Markov chains and processes will prove to be fruitful. Perhaps alternative approaches may fare better, such as the approximation of denumerable Markov processes by stochastic differential equations to generalize the results presented in section 3. We have concentrated on the representation of homogeneous Markov chains and processes in terms of the transition matrices P and W, respectively. It should be noted that these representations are inherently linear and therefore would seem to be uninteresting for further catastrophe theory analysis in which the focus is on nonlinear potentials. Yet, the latter conjecture does not hold: for instance, Gilmore (1981) shows how the program of catastrophe theory is extended to matrices. More specifically, suppose that the transition rates wij in the master equation (23) are nonlinear functions in i, j and depend on the control variables c. Suppose also that the detailed balance condition holds, implying that the W matrix is symmetric. Then it follows that the stationary probability density of the Markov process is given by the eigenvector of W associated with the largest (zero) eigenvalue (cf. Van Kampen, 1981, p. 126). The components of this eigenvector are nonlinear functions in the states i, j, and will depend smoothly on the control variables in e (Gilmore, 1981). Hence, despite the linearity of the master equation in W, its stationary solution provides the proper setting for further catastrophe theoretical analysis.
5. GENERAL DISCUSSION AND CONCLUSION Catastrophe theory is an advanced mathematical subject, and so is the (potential) theory of stochastic processes. In this chapter, we have considered some forms of integration of these two theories. Of course, this is an enormous task and we could only present a few preliminary remarks about possible approaches. Yet, the task has to be addressed in order to arrive at principled applications of catastrophe theory to the stochastic processes typically observed in the biological and social sciences. As a first contribution to that, we presented two main results, one negative and one positive. The negative result pertains to Cobb's approach to catastrophe theory for metrical stochastic systems, which was shown to be flawed in one rather fundamental respect. The positive result concerns the presentation of principled ways to extend catastrophe theory to metrical stochastic systems, in particular Zeeman's approach. It should be remembered that in this chapter we introduced several assumptions of a restrictive nature. For instance, it was assumed that the processes un-
5. Catastrophe Theory and Stochastic Systems
129
der scrutiny are homogeneous and obey the detailed balance condition. Although these assumptions considerably eased the presentation, it remains to be established whether they are necessary and plausible. In fact, this also applies to the basic concept of a potential; we considered its possibly restrictive nature in section 1, but still have to address its interpretation in the context of developmental processes. We could take a formal route and refer to the stationary density of Equation (17): p ( x ) = exp(-~[x]). If ~[x] has a Morse form, then Equation (17) reduces to the (multivariate) Normal density. In the latter case, the potential is proportional to the quadratic form x ' ~ x , where 9 denotes the inverse of the covariance matrix of x, and hence this potential can be interpreted as an analogue of energy. More informally, we can use the interpretation given by Helbing (1994) of the potential in his model for the behavior of pairwise interacting individuals. According to Helbing, this potential can be understood as social field. Analogously, the potential in the stationary density (17) of developmental processes could be interpreted as developmental field. This would correspond nicely with the prominence of field concepts in mathematical biology and embryology (e.g., Meinhardt, 1982). In conclusion, the extension of the program of elementary catastrophe theory to stochastic systems leads to a number of deep and challenging questions which for the most part are still unexplored. The issues at stake not only pertain to the formal framework of catastrophe theory for stochastic systems but also relate to basic questions in developmental theory. For instance, the question of how to arrive at an integral conceptualization of intra- and interindividual differences in the timing and pattern (horizontal and vertical decalages) of discrete stage transitions marking the emergence of qualitatively new behavior. Or the specification of causal models of probabilistic epigenesis. We hope to have made clear that there exist principled approaches to resolve these questions. Thus we expect that further elaboration of the program of stochastic catastrophe theory will lead to fundamental progress in developmental theory.
REFERENCES Castrigiano, D. E L., & Hayes, S. A. (1993). Catastrophe theory. Reading, MA: Addison-Wesley. Cobb, L. (1978). Stochastic catastrophe models and multinomial distributions. Behavioral Science, 23, 360-374. Cobb, L. (1981). Parameter estimation for the cusp catastrophe model. Behavioral Science, 26, 75-78. Cobb, L., Koppstein, E, &Chen, N. H. (1983). Estimation and moment recursion relations for multimodal distributions of the exponential family. Journal of the American Statistical Association, 78, 124-130. Cobb, L., & Zacks, S. (1985). Applications of catastrophe theory for statistical modeling in the biosciences. Journal of the American Statistical Association, 80, 793-802.
130
Peter C. M. Molenaar and Pascal Hartelman
Doob, J. L. (1984). Classical potential theory and its probabilistic counterpart. New York: SpringerVerlag. Florens-Zmirou, D. (1991). Statistics on crossings of discretized diffusions and local time. Stochastic Processes and their Applications, 39, 139-151. Gardiner, C. W. (1990). Handbook of stochastic methods for physics, chemistry and the natural sciences (2nd ed.). Berlin: Springer-Verlag. Gilmore, R. (1981). Catastrophe theory for scientists and engineers. New York: Wiley. Haken, H. (1983). Advanced synergetics. Berlin: Springer-Verlag. Hartelman, E, van der Maas, H. L. J., & Molenaar, E C. M. (1995). Catastrophe analysis of stochastic metrical systems in measure space Amsterdam: University of Amsterdam. Helbing, D. (1994). A mathematical model for the behavior of individuals in a social field. Journal of Mathematical Sociology, 19, 189-219. Huseyin, K. (1986). Multiple parameter stability theory and its applications: Bifurcations, catastrophes, instabilities. Oxford: Clarendon Press. Jackson, E. A. (1989). Perspectives of nonlinear dynamics (Vol. 1). Cambridge, UK: Cambridge University Press. Kemeny, J. G., Snell, J. L., & Knapp, A. W. (1976). Denumerable Markov chains. New York: Springer-Verlag. Meinhardt, H. (1982). Models of biological pattern formation. London: Academic Press. Molenaar, E C. M. (1986). On the impossibility of acquiring more powerful structures: A neglected alternative. Human Development, 29, 245-251. Molenaar, E C. M. (1990). Neural network simulation of a discrete model of continuous effects of irrelevant stimuli. Acta Psychologica, 74, 237-258. Nicolis, G., & Prigogine, I. (1977). Self-organization in nonequilibrium systems. New York: Wiley. Poston, T., & Stewart, I. N. (1978). Catastrophe theory and its applications. London: Pitman. Prigogine, I. (1980). From being to becoming: Time and complexity in the physical sciences. San Francisco: Freeman. Revuz, D., & Yor, M. (1991). Continuous martingales and Brownian motion. Berlin: Springer-Verlag. Thom, R. (1975). Structural stability and morphogenesis. Reading, MA: Benjamin. van der Maas, H. L. J., & Molenaar, E C. M. (1992). Stagewise cognitive development: An application of catastrophe theory. Psychological Review, 99, 395-417. Van Kampen, N. G. (1981). Stochastic processes in physics and chemistry. Amsterdam: NorthHolland. Wohlwill, J. E (1973). The study of behavioral development. New York: Academic Press. Zeeman, E. C. (1988). Stability of dynamical systems. Nonlinearity, 1, 115-155.
. . . ......
‘““’3
Latent Class and Log-Linear Models ...............................
This Page Intentionally Left Blank
Some Practical Issues Related to the Estimation of Latent Class and Latent Transition Parameters 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
L/nda M. Collins
Stuart E. Wuga/ter
Pennsy/van/a State Un/versity University Park, Pennsy/van/a
Un/versity of Southern Ca//torn/a Los Angeles, Ca/iforn/a
9
Penny L. Fidler
University of Southern Cafifomia Los Angeles, Cafifomia
1. INTRODUCTION Sometimes developmental models involve qualitatively different subgroups of individuals, such as learning disabled versus normal learners, children who have been abused versus children who have not, or substance use abstainers versus experimenters versus regular users. Other developmental models hypothesize that individuals go through a sequence of discrete stages, such as the stages in Piaget's theories of development (Piaget, 1952) or Kohlberg's (1969) stages of moral development. Both types of models are characterized by the involvement
Categorical
Variables
in Developmental
Research:
Methods
of Analysis
Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
134
L. M. Collins, P. L. Fidler, and S. E. Wugalter
of discrete latent variables. The first models can be tested by means of a measurement model for discrete latent variables, namely, latent class analysis (Clogg & Goodman, 1984; Goodman, 1974; Lazarsfeld & Henry, 1968). Stage sequential models can be tested by means of latent transition analysis (LTA) (Collins & Wugalter, 1992; see also Hagenaars & Luijkx, 1987; van de Pol, Langeheine, & de Jong, 1989), which is an extension of latent class models to longitudinal data. Typically, in latent class and latent transition models, data are analyzed in the form of frequencies or proportions associated with "response patterns." Conceptually, a response pattern is a set of possible subject responses to the manifest variables. For example, suppose the variables are two yes/no items. Then, No,No is one response pattern, No,Yes is another response pattern, and so on. Technically, a response pattern proportion is simply a cell proportion, in which the cell is part of the cross-classification of all the manifest variables involved in the problem of interest. Latent class models usually involve data collected at one time only. For example, let Y = {i, j, k} represent a response pattern made up of response i to Item 1, response j to Item 2, and response k to Item 3. Then, C
P(Y) = ~_~ ~lcPilcPjlcPklc, c=l
where ~/r is the proportion of individuals in latent class c, and Dilc is the probability of response i to Item 1, conditional on membership in latent class c. These parameters are conceptually similar to factor loadings in that they reflect the strength of the relationship between the manifest variables and the latent variable. They are different from factor loadings in that they are estimated probabilities, not regression coefficients. In latent class models, a strong relationship between the latent and manifest variables is reflected in a strong p parameter, that is, a parameter close to zero or one. We refer to the p parameters as measurement parameters. Latent transition models can involve some variables that are measured only once (because they are unchanging, such as an experimental condition) and other variables that are measured longitudinally (because they are expected to change overtime, such as stage membership). Let Y = {m, i,j, k, i',j', k', i",j", k"} represent a response pattern that is made up of a single response to a manifest indicator (m) of an exogenous variable and responses to three items at times t (i, j, and k), t + 1 (i', j', and k'), and t + 2 (i", j", and k"). Then, the estimated proportion of a particular response pattern P(Y) is expressed as follows for a first-order model): C
S
S
S
c=l p=l
q=l
r=l
v(r) = s s s s
~ c P m l c g p I cD i l p, cPj l p, cPk l p, c'r q l p, cP i ' I q, cPj' I q, cDk' q,c~r r l q, cP i" l r,cOj" l r,cPk" I r,c,
6. Latent Class and Latent Transition Parameters
135
where "Vc is the proportion in latent class c. In this model, individuals do not change latent class membership over time. The factor 8p Ic is the proportion in latent status p at time t conditional on membership in latent class c. This array of parameters represents the distribution across latent statuses at the first occasion of membership. Latent status membership can and does change over time. The factor Tqlp, c is an element of the latent transition probability matrix. These parameters represent how the sample changes latent status membership over time. The factor Pilp, c is the probability of response i to item 1 at time t, conditional on membership in latent status p at time t and on membership in latent class c. These measurement parameters are interpreted in the same way as their counterparts in latent class models. The purpose of this chapter is to address two practical issues related to parameter estimation in latent class and latent transition models. These practical issues arise frequently in developmental research. The first issue has to do with parameter estimation when the number of subjects is small in relation to the number of response patterns. As latent class and latent transition models become used more extensively, it is inevitable that researchers will attempt to test increasingly complex models. In our work, for example, we test models of early substance use onset in samples of adolescents. These models often involve four or more manifest variables used as indicators, measured at two or more time points. The cross-classification of all of these manifest variables measured repeatedly can become quite large. For a problem involving four dichotomous variables measured at two times, there are 28 - 256 cells in the cross-classification. If there are more than two occasions of measurement, if the items have more than two response options, and/or if there are additional items, the table becomes much larger. The size of the cross-classification is not a problem in and of itself, but it often indirectly causes a problem when the sample size is small in relation to the number of cells, that is, under conditions of "sparseness." There is a general, informal consensus that a minimum sample size in relation to the number of cells (N/k) is needed to perform latent class analyses successfully. However, to our knowledge, there are no firm guidelines, either about what the minimum N or N/k should be or about what are the likely consequences if one performs analyses on data that are too sparse. The source of the concern about sparseness is partly the well-established literature on the difficulties of conducting hypothesis testing under sparseness conditions. This literature has shown in numerous settings (see Read & Cressie, 1988) that when the data matrix is sparse, the distributions of commonly used goodness-of-fit indices diverge considerably from that of the chi-squared, making accurate hypothesis testing impossible. Collins, Fidler, Wugalter, and Long (1993; see also Holt & Macready, 1989) showed that under conditions of sparseness in latent class models, the expectation of the likelihood ratio statistic can
136
L. M. Collins, P. L. Fidler, and S. E. Wugalter
be considerably less than the expectation of the chi-square distribution, and that although the expectation of the Pearson statistic is much closer to that of the chi-square, its variance is unacceptably large. Collins et al. (1993) recommended using Monte Carlo simulations to estimate the expectation of the test statistic under the null hypothesis. The demonstrated problems with hypothesis testing under sparseness conditions have left researchers uneasy about parameter estimation under these conditions. It is not known what the effects of sparseness are, if any, on parameter estimation. One possible consequence of sparseness is bias, that is, the expectation of the parameter estimates could be greater than or less than the parameter being estimated. Another consequence could be an unacceptably large mean squared error (MSE), indicating a lack of precision in parameter estimation. Also, it is not known what the limits of sparseness are, that is, how small can the number of subjects be in relation to the number of cells if parameters are to be estimated with acceptable accuracy? Another important question related to sparseness is whether, given a fixed N, adding a manifest indicator is a benefit, because it increases the amount of information, or a detriment, because it increases sparseness. The second practical issue addressed by this chapter is the estimation of standard errors for parameters. Estimation for latent class and latent transition models is usually performed using the EM algorithm (Dempster, Laird, & Rubin, 1977). Less often, estimation is performed using Fisher's method of scoring (Dayton & Macready, 1976). The EM algorithm has the advantage of being a very robust estimation procedure. Its disadvantages are that it is slow compared with Fisher's method of scoring, and that it does not yield standard errors of the parameters it estimates. Unlike the Fisher scoring method, the EM algorithm does not require computation of the information matrix, hence standard errors are not a by-product of the procedure. It has been suggested (e.g., Rao, 1965, p. 305) that standard errors can be obtained by inverting the information matrix computed at the final iteration of the EM procedure. This sounds reasonable, but just how well this work has never been investigated. It also remains to be seen how well standard errors can be estimated in very sparse data matrices.
2. METHODS To investigate the issues presented in the first section, we performed a simulation study. The purpose of the simulation was to generate data based on latent class models with known parameters while systematically varying sparseness and other factors. We estimated the parameters in the data we generated and examined whether the factors we varied affected parameter estimation.
6. Latent Class and Latent Transition Parameters
137
The following independent variables were selected as likely to affect the accuracy of estimation: 1. Number of items. We used four-, six-, and eight-item models. Because dichotomous items were used, this resulted in data sets with 16, 64, and 256 response patterns, respectively. 2. Size of p. We reasoned that the size of p might be an important factor because all else being equal, latent class models with extreme values (close to zero or one) for the p parameters generate sparser data matrices than latent class models with less extreme p parameters. This factor had two levels: p = .65 and 9=.9. 3. Sparseness. This was operationalized as N/k, that is, the number of subjects divided by the number of cells (in this case, response patterns). Four levels of N/k were crossed with the other factors in the study: 1, 2, 4, and 16, with N/k of 1 being a level that many researchers would feel is too low and N/k of 16 being a level that most researchers would agree is "safe." Thus the design as discussed here involves the following fully crossed factors: four levels of N/k, three levels of number of items, and two levels of size of p, or 24 cells. To investigate extreme sparseness, four more cells were run with N/k = .5. One cell had six items, p = .65; one had six items, p = .9; one had eight items, p = .65; and one had eight items, p = .9. We did not cross N/k = .5 fully with number of items, because the four item conditions would have involved only eight subjects. Table 1 shows the design, including the number of artificial "subjects" in each data set, computed by multiplying N/k times the number of response patterns. As Table 1 shows, the design was set up so that there would be multiple cells with the same N in order to make it possible to investigate the effects of adding items while holding N constant. TABLE 1 Design of Simulation and Numbers of Subjects Used to Generate Data N/k .5
1
2
4
16
N of items
p = .65
4 6 8
* 32 128
16 64 256
32 128 512
64 256 1024
256 1024 4096
p = .90
4 6 8
* 32 128
16 64 256
32 128 512
64 256 1024
256 1024 4096
*Cell not included in design.
138
L. M. Collins, P. L. Fidler, and S. E. Wugalter
One thousand random data sets were generated for each cell. Parameters of the known true model were estimated in each data set using the EM algorithm (Dempster et al., 1977). All of the data sets were generated using models with two latent classes, so one ~ parameter was estimated in each data set. In the four-item data sets, all eight p parameters were freely estimated. To keep the total number of parameters estimated constant across data sets, constraints were imposed on the p parameters in the six- and eight-item conditions. In the sixitem models, the p parameters associated with the fifth and sixth items were constrained to be equal to those for the first and second items, respectively. In the eight-item models, the p parameters associated with the fifth through eighth items were constrained to be equal to those associated with the first through fourth items, respectively. Thus each data set involved estimation of one -,/parameter and eight p parameters.
1.1 Data Generation There is a finite set of possible response patterns that can be generated for a given set of items, and each latent class model produces a corresponding vector of predicted response pattern probabilities. We used a total of six vectors associated with the three numbers of items and the two measurement strengths. A single vector was used to generate data for each of the four levels of N/k. The data were generated using a uniform random number generator written in FORTRAN. Data were generated for a single subject by randomly selecting a number from zero to one from the uniform distribution. This number was then compared with the cumulative response pattern probability vector which determined a particular subject's response pattern. A mastery model was used to generate the data in which one of the latent classes in each model was a "master" latent class, where the probability of passing the item was large for each item, and the other was a "nonmaster" latent class, where the probability of passing was small for each item. Of course, in individual data sets the parameter estimates did not always come out this way. In a few instances, the two latent classes did not seem distinct at all, so that it was difficult or impossible to denote one class as a mastery class and one as a nonmastery class. An example is a solution where the probability of passing each item is estimated as .9 for one latent class and .6 for the other. Under these circumstances, it cannot be determined whether a p parameter is conditional on membership in the mastery latent class or on membership in the nonmastery latent class. Because the true values for the p parameters that are conditional on the mastery latent class are different from those conditional on the nonmastery latent class, when this occurs it is impossible to compare the parameter estimates to their true values, and bias and MSE cannot be assessed. We examined each replication to identify instances of this indeterminacy. Table 2 shows the number of these solutions in each cell. Most of the indeter-
139
6. Latent Class and Latent Transition Parameters
TABLE2
Number of Indeterminate Solutions Out of 1000 Replications Per Cell N/k
p = .65
p = .90
.5
1
2
4
16
N of items 4 6 8
* 34 1
87 19 0
59 7 0
35 1 0
0 0 0
4 6 8
* 0 0
0 0 0
0 0 0
0 0 0
0 0 0
*Cell not included in design. minate solutions occurred when p = .65 and there were four items, with the largest number (87) occurring when N/k = 1, which was the sparsest matrix included for four items. There were no such solutions when p = .9, and none when N/k = 16. There was only one instance of indeterminacy when there were eight items, occurring when N/k = .5 and p = .65. Data sets yielding indeterminate solutions were r e m o v e d from the simulation and new random data sets were generated until a total of one thousand suitable solutions was obtained for each cell. The remaining results are based on this amended data set.
1.2. Parameter Estimates We evaluate the parameter estimates in terms of bias and MSE. Table 3 shows bias and M S E for the ~/parameter, and Table 4 shows these quantities for one of the p parameters. The pattern of results was virtually identical for all eight p parameters, so only one set of results is shown here. In the population from which these data were sampled, the latent class parameter equals .50 and the P parameter equals either .65 or .90. There is some bias present in the sample-based estimates of both types of parameters, particularly when p = .65. The overall amount of bias is quite small for the latent class parameter; in no cell is it larger than .026. There is considerably more bias in the estimation of the P parameters, where bias is more than .03 in six cells. For both parameters, bias virtually disappears when p = .90, even for conditions where N/k is as low as .5. Bias generally decreases as N/k increases and as the number of items increases, although there is some slight nonmonotonicity in the bias associated with the latent class parameter. Because certain groups of these parameters sum to one and the parameters are bounded by zero and one, posi tive bias in the estimation of one parameter in a set implies negative bias somewhere else in the set, and vice versa. For example, the 9's associated with a
Cr~
c~
t~
0
c~
~
0
0
~
II
c~
CD CD ,-.I
::j
ml a,1
CD
w-.i,,
0.,,,.k
C~
Dhl t,,.i.
I'rl C~
C~ ,m.
.,m.
r'rl
r~ 0
c~
c~
c~
c~
0
0
-u
C~ i.h
_m. a,1 c~ .m.
rrl
~h
C~L
s~
c~
.r~
.,,,L
141
6. Latent Class and Latent Transition Parameters
particular item and a particular latent class must sum to one across response categories. This means that although the bias in Table 4 is positive overall, if we e x a m i n e d the c o m p l e m e n t of this parameter, we would find that the bias is negative overall. Although there can be a substantial amount of bias when N/k is small, holding N/k constant and increasing the n u m b e r of items generally tends to decrease the bias. (Note that to maintain a constant N/k, the overall N must be increased when more cells are added.) This pattern is not completely consistent for the latent class parameter. Essentially the same pattern holds for the MSE. The M S E is much smaller overall for both the latent class parameter and the p parameter when p = .90 than it is when p = .65. The M S E generally decreases as N/k increases, and as the n u m b e r of items increases, although there are some inconsistencies evident in the M S E s for the latent class parameter. Even with N/k = .5, the M S E is small when eight items are used. G i v e n a particular N, is there an increase in bias or M S E when items are added to a latent class model? Tables 5 and 6 show that as the n u m b e r of items
TABLE 5
Bias in Estimation of the ~ Parameter as a Function of N N of items 4
N of subjects 32 64 p = .65
256
.026 (.041) .008 (.051) .003 (.058)
1024
32 64 p = .90
256
.000 (.OO9) .001 (.005) -.000 (.001)
1024
6
.016 (.047) .024 (.054) -.001 (.029) .002 (.OO7)
-.002 (.015) -.000
(.003)
-.000
(.OO8) .003 (.004) -.001 (.OOl) -.000
(.000)
Note. Mean squared errors are given in parentheses.
8
-.001 (.o01) -.ooo
(.00o)
142
L. M. Collins, P. L. Fidler, and S. E. Wugalter
TABLE 6
Bias in Estimation of p Parameter As a Function of N N of items 4 N of subjects 32 64
p = .65
256
.053 (.050) .043 (.039) .O38 (.020)
.040 (.052) .031 (.035) .010 (.009) .002 (.OO2)
-.002 (.008) .001
.003 (.006) .002
1024
32 64 p = .90
256
6
(.003)
(.003)
-.001 (.001)
-.000 (.OO1) -.001 (.000)
1024
8
.007 (.005) .002 (.001)
-.003 (.OO1) .000 (.000)
Note. Mean squared errors are given in parentheses.
increases for a fixed N, bias and MSE sometimes do increase slightly, at least when the overall N is 64 or less. Under conditions when N is larger than this, generally bias and MSE remain about the same or even decrease slightly as the number of items is increased.
1.3. Estimation of Standard Errors Each cell of the simulation contains a sampling distribution of each parameter estimated. This sampling distribution is made up of the 1000 replications in each cell. The standard deviation of a parameter estimate across the 1000 replications is our definition of the true standard error of the parameter. To estimate the bias and MSE associated with estimating the standard errors, we subtracted this standard deviation from the estimate of the standard error obtained by inverting the information matrix at each replication. Table 7 contains bias and MSE for the estimation of the standard error of the latent class parameter, and Table 8 contains this information for the estimation of the standard error of the p parameter.
6. Latent Class and Latent Transition Parameters
143
TABLE 7 Bias in Estimation of the Standard Error of the ~/Parameter N/k
N of items 4
1
2
4
16
*
.284 (.714) -.035 (.022) -.023 (. 001 )
.167 (.131) -.066 (.011) -.005 (. 000)
.146 (.150) -.037 (.004) -.004 (. 000 )
-.023 (.015) -.004 (.000) .000 (. 000 )
.222 (.979) .000 (.000) -.001 (.000)
.011 (.004) -.001 (.000) .000 (.000)
-.003 (.000) .000 (.000) .000 (.000)
.001 (.000) -.000 (.000) .000 (.000)
.018 (.039) -.049 (. 004 )
p = .65
p = .90
.5
4
*
6
-.002 (.000) -.000 (.000)
8
Note. Mean squared errors are given in parentheses. *Cell not included in design.
TABLE 8
Bias in Estimation of the Standard Error of the p Parameter N/k
N of items 4
1
2
4
16
*
.056 (.157) -.016 (.026) -.009 (.000)
.057 (.089) -.023 (.009) -.003 (.000)
.057 (.104) -.014 (.001) -.001 (.000)
-.003 (.013) -.001 (.000) .001 (.000)
.086 (.366) -.002 (.000) -.000 (.000)
-.008 (.007) -.002 (.000) .000 (.000)
-.005 (.000) .000 (.000) .000 (.000)
.000 (.000) .000 (.000) .000 (.000)
-.016 (.037) -.020 (.001)
p = .65
p = .90
.5
4
*
6
-.013 (.OO2) -.002 (.000)
8
Note. Mean squared errors are given in parentheses. *Cell not included in design.
144
L. M. Collins, P. L. Fidler, and S. E. Wugalter
Tables 7 and 8 show that although generally the standard errors are estimated well, there is considerable bias in the estimation of the standard error for the latent class parameter when p -- .65, there are four items, and N/k is low. Under these conditions, inverting the information matrix tends to produce a standard error that is larger than the standard deviation of the empirical sampling distribution, particularly for the three lowest values of N/k. The MSE is also large in these conditions, especially when N/k = 1. For both the latent class and p parameters bias tends to be positive in the four-item conditions and negative in the six- and eight-item conditions. Inverting the information matrix produces highly accurate estimates of the standard error in almost all of the p = .9 conditions. The one exception is the four-item, N/k = 1 condition where there is considerable positive bias and the MSE is unacceptably large. The overall pattern of the results is the same for the p parameters, except that there is less bias in the estimate of the standard error and the MSEs are smaller in most cases.
3. DISCUSSION The results of this study are very encouraging for the estimation of latent class models. They suggest that highly accurate estimates of latent class parameters can be obtained even in very sparse data matrices, particularly when manifes titems are closely related to the latent variable. Even when N/k was .5 or 1, bias was negligible for estimating the latent class parameter and only slight for estimating the p parameters. Both bias and MSE were smallest when p = .9, indicating that in circumstances in which the items are strongly related to the latent classes, a smaller N is needed for good estimation compared with circumstances in which the items are less strongly related to the latent classes. Usually, a researcher has only so much data to work with, that is, a fixed N, and is debating how many indicators to include. With measurement models for continuous data, there is usually no debate; more indicators are better. But with latent class and latent transition models, the researcher is faced with a nagging question: Will adding an indicator be a benefit by increasing measurement precision, or will it be a detriment by increasing sparseness? The results of our study suggest that when the p parameters are close to zero or one, adding additional items has little effect on estimation. When the p parameters are weaker, adding items generally decreases the MSE for overall Ns of 256 or greater, and may increase it slightly for smaller Ns. The effect of the addition of items on bias is less consistent, but generally small. It is important to note that in the design of this simulation, we used constraints so that when additional items were added the total number of parameters estimated remained the same. This amounts to treating some items as replications of each other; conceptually, it is similar to constraining factor loadings to be equal in a confirmatory factor analysis. With-
6. Latent Class and Latent Transition Parameters
145
out such constraints, as items are added, more parameters are estimated. If the addition of items is accompanied by estimation of additional parameters, this may change the conclusions discussed in this paragraph. This study also indicates that standard errors of the parameters in latent class models can be estimated well in most circumstances by inverting the information matrix after parameter estimation has been completed by means of the EM algorithm. However, there can be a substantial positive bias in the estimate of the standard error, particularly when p is weak and N/k is small. This approach to estimation of standard errors probably should not be attempted when N/k is one or less or there are four or fewer manifest indicators. Serendipitously, this study also revealed a little about indeterminate results, that is, results for which the latent classes are not clearly distinguished. We should note that such results are not indeterminate if they reflect the true model that generated the data. However, in our study, the data generation models involved clearly distinguished latent classes. We found that, again, strong measurement parameters were very important; none of the indeterminate cases occurred when p = .9. It was also evident that more items and a greater N/k helped to prevent indeterminate solutions. It is interesting that strong measurement showed itself to be so unambiguously beneficial in this study. On the other hand, p parameters close to zero or one are analogous to large factor loadings in factor analysis, clearly indicating a close relationship between manifest indicators and a latent variable. On the other hand, all else being equal, p parameters close to zero and one are also indicative of more sparseness. Given a particular N, the least sparse data would come from subject responses spread evenly across all possible response patterns. When the p parameters are such that some responses have very high probabilities and others have very low probabilities, subject responses will tend to be clumped together in the high-probability response patterns, whereas the low-probability response patterns will be empty or nearly empty. For this reason, it might have been expected that strong measurement would tend to result in more bias or larger MSEs. This simulation shows that the sparseness caused by extreme measurement parameters is unimportant. Like all simulations, this one can only provide information about conditions that were included. We did not include any conditions where there were more than two latent classes. As mentioned previously, we controlled the number of parameters estimated rather than let the number increase as more items were added. However, in many studies, researchers will wish to estimate p parameters freely for any items they add to a model, which will result in an increase in the total number of parameters to be estimated. It would be worthwhile to investigate the effects of this increased load on estimation. Finally, we did not mix the strengths of the measurement parameters. Each data set contained measurement parameters of one strength only. Of course, in empirical research set-
146
L. M. Collins, P. L. Fidler, and S. E. Wugalter
tings different variables would be associated with different measurement strengths, so the effect of this should be investigated also.
ACKNOWLEDGMENTS This research was supported by National Institute on Drug Abuse Grant DA04111.
REFERENCES Clogg, C. C., & Goodman, L. A. (1984). Latent structure analysis of a set of multidimensional contingency tables. Journal of the American Statistical Association, 79, 762-771. Collins, L. M., Fidler, P. L., Wugalter, S. E., & Long, J. D. (1993). Goodness-of-fit testing for latent class models. Multivariate Behavioral Research, 28, 375-389. Collins, L. M., & Wugalter, S. E. (1992). Latent class models for stage-sequential dynamic latent variables. Multivariate Behavioral Research, 27, 131-157. Dayton, C. M., & Macready, G. B. (1976). A probablistic model for validation of behavioral hierarchies. Psychometrika, 41, 189-204. Dempster, A. P., Laird, N. L., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38. Goodman, L. A. (1974). Exploratory latent structure analysis using both indentifiable and unidentifiable models. Biometrika, 61, 215-231. Hagenaars, J. A., & Luijkx, R. (1987). LCAG: Latent class analysis models and other loglinear models with latent variables: Manual LCAG (Working Paper Series 17). Tilburg, Netherlands: Tilburg University, Department of Sociology. Holt, J. A., & Macready, G. B. (1989). A simulation study of the difference chi-square statistic for comparing latent class models under violations of regularity conditions. Applied Psychological Measurement, 13, 221-231. Kohlberg, L. (1969). Stage and sequence: the cognitive-developmental approach to socialization. In D. A. Goslin (Ed.). Handbook of socialization theory and research. Chicago: Rand McNally. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin. Piaget, J. (1952). The origins of intelligence in children. New York: International Universities Press. Rao, C. R. (1965). Linear statistical inference and its applications. New York: Wiley. Read, T. R. C., & Cressie, N. A. C. (1988). Goodness-of-fit statistics for discrete multivariate data. New York: Springer-Verlag. van de Pol, F., Langeheine, R., & de Jong, W. (1989). PANMARK User Manual: PANel analysis using MARKov chains. Voorburg: Netherlands Central Bureau of Statistics.
Contingency Tables and Between-Subject Variability Thomas D. Wickens
University of California, Los Angeles Los Angeles, California
1. INTRODUCTION An important problem in the analysis of contingency tables is how to treat data when several observations come from each subject. How should these dependent responses be combined into a single analysis? The problem is, of course, an old one; I found reference to the effects of heterogeneity of observations going back at least to Yule's textbook in 1911. The dangers of pooling dependent observations in the Pearson statistic are well known, although many researchers are not really sure what to do about it. For example, Lewis and Burke pointed the problem out in 1949 in their discussion of the use and misuse of chi-square tests, and yet Delucchi (1983), in his reexamination of the outcome of their recommendations, noted that errors are still too common. The introduction of new statistical techniques for frequency data has both helped and hurt. On the one hand, we now have both the procedures to represent and analyze the between-subject component of variability in these designs and the computer power needed to implement these procedures. On the other hand, the variety of new techniques, and the incomplete understanding many users have of them, has partially obscured the problem. For example, I have
Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
147
148
Thomas D. Wickens
spoken to several researchers who believe that using "log-linear models" somehow eliminates the problems of combining over subjects. However, this is not really the case. The issue was brought to my attention some time ago when a colleague asked me to look at a paper he was reviewing and to comment on the statistics. The data in question contained results from a group of subjects with several observations per subject, and thus involved a mixture of between-subject and withinsubject variability. They had been analyzed with standard log-linear techniques. The analysis bothered me, and eventually I identified the problem as a failure to treat the between-subject variability correctly. I also realized that the issue was more general than this particular instance, and that many treatments of categorical designs for researchers, my own included (Wickens, 1989), did not discuss how to analyze such data. In particular, they gave no practical recommendations for the everyday researcher. Thinking about the problem, and about what to recommend, led me to explore several approaches to this type of analysis (reported in Wickens, 1993). Here I describe, and to some degree expand upon, that work. My thoughts are influenced both by the statistical problems and by the need to make recommendations that are likely to be used.
2. ASSOCIATION VARIABILITY To begin, remember that there are two common situations in which we are called on to test for association between two categorizations. In one, each subject produces a response that is classified in two dimensions, producing one entry in a contingency table. In the other situation, there are two or more types of subjects, each generating a response that is classified along a single dimension. Again, a two-way table results, although the underlying sampling models are different. Essentially, the difference between the two sampling situations is the same as that between the correlation model and the regression model. I will concentrate on the first situation here, but will mention the second situation again later. ~ Consider the following example, loosely adapted from the design that called the problem to my attention. Suppose that you are interested in determining whether there is a link between the behavior of a caregiver and that of an infant. You want to know whether a certain caregiver behavior is followed by a particular target behavior of the infant. You observe the interaction of a caregiver-infant dyad for some fixed duration. You divide this period into a se-
IThere is a third possibility in which the design constrains both marginal distributions--Fisher's (1973) lady testing tea--but is less often used and I will not discuss it here.
149
7. Between-Subject Variability
ries of short intervals and select certain pairs of intervals, say one pair every 5 min, for examination. In the first of any pair of two consecutive intervals, you record whether the caregiver made the antecedent behavior and in the second whether the infant made the target behavior. Aggregating over the pairs, you obtain a 2 • 2 table that describes the contingency of the two behaviors: Infant Yes No
I
Caregiver NoYeS 4213 2817
I
Each count in this table records the outcome from one pair of intervals. You can discover whether the behaviors are contingent by looking for an association between the row categorization and the column categorization. For a single dyad, these data are easy to analyze. You just test for an association in the 2 • 2 table. There are many ways to conduct this test, all of which have similar properties. I do not consider their differences here; an ordinary Pearson statistic suffices. Of course, you would not normally run the study with only one dyad. You would obtain data from several caregiver-infant pairs and want to generalize over these pairs. Table 1 shows some data, simulated from the model that I describe in the next section. These data form a three-way contingency table, the dimensions of which are the antecedent behavior, the target behavior, and the subject (classification factors A, B, and S, respectively). The goal of an analysis is either to test the hypothesis that there is no association between the behavior classifications or to estimate the magnitude of this association. Two sources of variability influence these observations, one occurring within a dyad, the other between dyads. The within-dyad source is embodied in the
TABLE 1 Simulated Data for Eight Dyads Responses
Log odds ratio
Pair
YY
YN
NY
NN
Yk
sk2
1 2 3 4 5 6 7 8
42 44 42 29 42 24 38 45
17 30 9 29 15 24 24 16
13 12 34 23 24 27 15 19
28 14 15 19 19 25 23 20
1.634 0.526 0.698 -0.187 0.780 -0.076 0.868 1.064
0.190 0.204 0.222 0.162 0.180 0.157 0.174 0.183
The estimated log odds ratios Yk and their sampling variance s2 are obtained from Equations (1) and (2).
Note.
150
Thomas D. Wickens
scatter of the responses within each 2 x 2 table; the between-dyad source is embodied in the differences among these tables. Some between-dyad scatter is expected because of within-dyad response variability, but there is more variability among the observations in Table 1 than can be attributed to such within-dyad influences. Considered in terms of these sources of variability, there are three ways that one might test a hypothesis of independence or unrelatedness. The most general approach is to find a method that explicitly uses estimates of both the betweensubject and the within-subject components of the variability. Presumably, the tests in this class should be the most accurate. However, such tests are relatively complicated and not well incorporated into the conventional statistical packages. Their complexity creates a problem in their application. Many researchers and some journal editors pay little attention to new statistical methods unless a very strong case can be made for them, preferring to fall back on more familiar procedures. Therefore it is also important to look at tests based on simpler variability structures. The more familiar types of tests are based on estimates of variability of only one type, either between-subject or within-subject. If there is much betweensubject variability, then the class of tests that estimate and use it should be the next best alternative, although they might be inferior to the tests that use both sources. Tests that use only the within-subject variation are likely to be the least accurate. Although these notions make sense, I was unclear of the size and nature of the biases and of the relative advantages of each class. The simulations I describe in this chapter address these similarities and differences. My goal is to make some practical recommendations.
3. THE SIMULATION PROCEDURE To look at the characteristics of the different testing procedures, I simulated experiments with various amounts of between-subject and within-subject variability, applied a variety of test statistics to them, and looked at the biases and power of these tests. The first step in this program was to create a model for the data that included both components of variation. The model I used has between-subject and within-subject portions. The within-subject portion is conventional. The kth subject (or dyad) gives a 2 • 2 table of frequencies Xok. In a design in which each subject gives a doubly classified response, these frequencies are a multinomial sample of in k observations from a four-level probability distribution with probabilities "rr0k. The multinomial distribution is, of course, a standard model for categorical data analysis. The between-subject portion of the model provides the probabilities ~ijk and the number of observations m k. The probabilities derive from a set of parame-
7. Between-Subject Variability
151
ters that are tied to the processes being modeled. A 2 • 2 table of probabilities "rrij can be generated by the two marginal distributions and the association. Specifically, the marginal distributions are expressed by a pair of marginal logits ~l - log "rrll -I- 'rrl2
and
~2 = log Trll nt- 'rr21,
"1721 -Jr- "ri'22
'7r12 + ,rt'22
and the association by log odds ratio: xI - log 31"11'13"22. 'IT 12"rr21
In the example, the parameters ~l and ~2 m e a s u r e how likely the caregiver and infant are to emit the antecedent and target behaviors, respectively. They are essentially nuisance parameters in that example, although in other research their relationship is the principal focus of the work. As in many studies of association, one wants to make an inference about xl, either estimating it or testing the null hypothesis that it equals zero. The between-subject portion of the model specifies the distribution of these quantities over the population of subjects. I represented them as sampled from normal distributions, with unknown means and variances: ~l "~ ~'t/'(~L~,,
0"~1)'
~2 "-" ~ I/'(~2, 0"~2), and
11 "~ ~l]~lxn, (r2).
Properly speaking, the triplet (~l, ~2, ~) should have a trivariate distribution to allow intercorrelations, but I will not follow up on this multivariate aspect. For one thing, the values of the correlations depend closely on the particular study being modeled and I wanted to be more general; for another, I suspect that their effects are secondary to those that 1 discuss here. In a given experiment, n subjects are observed, each corresponding to an independently sampled triplet of distribution parameters. Over the experiment, there are n of these triplets (~lk, ~2k, ilk), which I have shown by adding a subscript for the subject and placing a tilde above the symbol, implying that, from the point of view of the study as a whole, they are realizations of random variables and not true parameters. For any given experiment, the mean fi of the log odds ratio is not, except accidentally, equal to the mean population association; typically,
n
The three subject parameters ~lk, ~2k, and fik determine a set of probabilities ~rij for the four events. The other portion of the model concerns the item sample sizes m k. In most of my simulations m~, was fixed, as if it were set by the experimental procedure.
152
Thomas D. Wickens
In the example, this situation would occur if the dyads were sampled at regular intervals for a fixed period of time. A few simulations let m k vary. Suppose that instead of taking observations at fixed times, every instance of a particular type of caregiver behavior was classified, along with the subsequent behavior of the infant. The number of observations would then depend on the caregiver's behavior rate. I modeled this situation by assuming that instances of the behavior were generated by a Poisson process (like the multinomial, a standard sampling model) and that the rate parameter of this process varied over subjects according to a gamma distribution. This mixture model implies that m k has a negative binomial distribution (e.g., Johnson, Kotz, & Kemp, 1992). Both the mean and the variance of this distribution must be chosen. This model was the basis of the simulations. Each condition was based on values of the mean and variance for the distributions of ~k, ~2k, and xlk, and on the value of n and the distribution of m k. For each of the n subjects in the "experiment," I sampled an independent random triplet (~k, ~2k, ilk) according to the relevant normal distributions and converted it to cell probabilities ~ijk. If m k was random, I generated its value from a negative binomial distribution, and for either the fixed or random case, I generated m~ multinomially distributed responses. I then analyzed these data using a variety of test statistics, some based only on the within-subject variability, some only on a between-subject estimate of variability, and some that separated the two sources. To get stable estimates of such characteristics as the probability of rejecting a null hypothesis, I replicated the "experiment" between 2,000 and 10,000 times, depending on the particular part of the study, and calculated the observed proportion of null hypotheses rejected at the 5% level. Over the course of the study, I varied several characteristics of the experiments. Two of these characteristics concerned properties of the experiment: the fixed sample size n and the (possibly random) observation sample size m~. Two other characteristics concerned the ancillary characteristics of the subject population: the degree to which the marginal distribution of responses was skewed toward one alternative and the amount of between-subject variability. Finally, I varied the size of the association effect, so that I could investigate both Type I error rates and power. I also looked at some studies with more than two response categories. I will not describe all of the simulations and their details here, nor will I mention every statistic that I tried. Some of these are reported in Wickens (1993), others I have not reported formally but they are essentially similar to the ones I describe here.
4. TESTS BASED ON MULTINOMIAL VARIABILITY The most familiar class of test statistics contains tests that are based on the within-subject multinomial variability alone. The simplest procedure is to col-
7. Between-Subject Variability
153
lapse the data over the subjects (obtaining the marginal distribution of Table 1) and test for association in the resulting two-way table. This approach is unsatisfactory for several reasons. First, it violates the rules about the independence of observations in "chi-square tests." Second, collapsing the three-way table into its margin can artificially alter the association if, as is likely, the relative frequencies of the two behaviors are related to the subject classification--this is the effect that in its more pernicious manifestation gives Simpson's paradox. This test also shows all of the same biases as the other tests in this class, so I will not linger on it, but will turn instead to more plausible candidates. The second test procedure is a likelihood-ratio test based on a pair of hierarchical log-linear models fitted to the three-way A by B by S table. One model expresses the conditional independence of the responses given the subject; in the bracket notation that denotes the fitted marginals it is [AS][BS], and in loglinear form it is log [,LijCIk-- ~k + ~kA + ~kB + ~kS + ~k~kS + ~kjBS. This model allows for relationships between the subjects and the behavior rates, but excludes any association between the two behaviors. The second model is of no interaction in the association. It adds to the conditional-independence model a parameter that expresses a constant level of association; in bracket notation it is [AS][BS][AB], and in log-linear form it is log [.Lij N] --- ~k -~- h A + ~B + ~k~k"_~_ ~kAB ..[_ ~kiAS _~_ ~kjBS. These models are hierarchical, and the second perforce fits the data better. The need for an association term is tested with a likelihood-ratio test: ,, N I
AG~
~-
2 ce~llsXij k log ~[.Lijk ,,CI
----
G[AS][BSI -- G[ABI[AS][BS]' 2 2
where the ~ are fitted frequencies. This statistic is referred to a chi-square distribution with 1 df Because this test explicitly includes the subjects as a factor in the analysis, it appears to solve the problems associated with their pooling. The likelihood-ratio test is appropriate only when the more general model adequately fits the data, so some authorities recommend testing the no-interaction model [AB][AS][BS] first and only proceeding to the difference test when it cannot be rejected. Retention of the no-interaction model implies that there are few differences in association among the subjects. The recommendation to use this test as a gatekeeper is not very useful, however. For one thing, the goodness-offit test is neither easy to run nor particularly powerful when the per-subject frequencies mk are small. More seriously, it leaves the researcher stranded when homogeneity of association is rejected. In practice, I think the recommendation is ignored more often than followed. The third test in this class approaches the problem through an estimate of
154
Thomas D. Wickens
the log odds ratio. Any of several estimates can be used here. I describe one based on Woolf's average of the log odds ratios (Woolf, 1995) in some detail because I will be using an extension of it here; another one, which I do not describe, uses the Mantel-Haenszel estimate (Mantel, 1963). For a single subject, the log odds ratio TIk is estimated by the ratio
( ')(,) (xl2~+~1)(l) x21k+~ X l l k -Jr- "~
Yk = log
X 2 2 k "Jr" "~
.
(1)
The 1 added to each frequency reduces bias and prevents unseemly behavior when an observed frequency is zero (Gart & Zweifel, 1967). With multinomial sampling, the sampling variance of Yk approximately equals the sum of the reciprocals of the cell frequencies, again with the 1 adjustment: 2
1
n t-
S k -= Xll k +
1
1
+
+ ~
1
(2)
1
1
1
1
--
X 1 2 k -q- _
X21 k "-1- _
X 2 2 k nt- _
2
2
2
2
Table 1 gives these estimates for the individual subjects. Aggregation over subjects uses a weighted average with weights equal to the reciprocal of the multinomial sampling variability: Yw
"~wky k s k
with
wk
1 sk2"
(3)
The sampling variance of an average that uses these weights is the reciprocal of the sum of the weights: 1
var(~w) = Ew k. The hypothesis that the population value corresponding to Yw equals zero is tested by the ratio of Yw to its standard error, or equivalently by referring the ratio X2 =
~2 varfPw)
(4)
to a chi-square distribution with 1 df Test statistics in this class are unsatisfactory in the presence of betweensubject variability. Figure 1 shows the proportion of null hypotheses rejected at the nominal 5% level as a function of the association standard deviation for experiments with n = 10 subjects, m = 100 observations per subject, and marginal distributions that were nearly symmetrical (tx~ = 0.3 and o'~ = 0.2, giving ap-
7. Between-Subject Variability
155
FIGURE 1. Type I error rates at the nominal 5% level for tests based on multinomial variability. Subject sample size n = 10, fixed response sample size mk = 100, and near-symmetrical marginal distributions. Each point is based on 10,000 simulated experiments.
proximately a 5 7 : 4 3 distribution in the marginal categories). If the tests were unbiased, then these lines would lie along the 5% line at the bottom of the figure. Instead, the proportion of rejected hypotheses increases with the intersubject variability. The bias is substantial when the association varies widely among subjects. One might now ask how the biases in Figure 1 depend on the sample size. In particular, conventional lore about sample sizes would suggest that large samples of subjects or items would reduce or eliminate the bias. Unfortunately, this supposition is wrong, as Figure 2 shows for the statistic AG28 (the other statistics are similar). The bias of the test increases (not decreases) with the number of observations per subject and is essentially unaffected by the number of subjects. The problems with these tests are too fundamental to be fixed by larger samples.
FIGURE 2. Type I error rates at the nominal 5% level for the statistic AG2B as a function of subject and item sample sizes. The association standard deviation is 0"0 = 1.0 and each point is based on 10,000 simulations.
156
Thomas D. Wickens
The difficulty with this class of tests, and the source of the biases in Figures 1 and 2, is that the wrong null hypothesis is being tested. There are three null hypotheses that could be tested 9 The strongest of these asserts that there is no relationship at all between the behavior categories. In terms of the variability model, this hypothesis implies that every glk is zero 9 It is a compound hypoth2 are esis, as it implies that both the mean association txn and the variability orn zero: Hol 9 p~ = 0
and
2=0. ~rn
This hypothesis is equivalent to a hypothesis of the conditional independence of the two responses given by the subjects and is tested by the fit of the log-linear model [AS][BS]. A much less restrictive null hypothesis is that there is no association between categories in the sample: Ho2: ql = 0 .
This hypothesis focuses attention on the particular sample. It does not involve the population parameters. The third null hypothesis refers only to the mean association: Ho3: I~,q = 0.
It does not preclude intersubject variability and is also less restrictive than Hol. 2 = 0, these three When there is no intersubject variability, that is, when ~rn null hypotheses are equivalent. When crn2 > 0, they differ. The statistics that incorporate only the multinomial variability test hypothesis H02. Although these statistics are satisfactory ways to test this hypothesis, when the subjects vary, as they usually do, a researcher probably wants to test the population hypothesis H03 instead of the sample hypothesis H02. The researcher's intent is to generalize to a population parameter, not to restrict conclusions to the particular sample. A test of H03 must accommodate the subject variation.
5. TESTS BASED ON BETWEEN-SUBJECTVARIABILITY There is a handy rule for treating between-subject effects that I have often found useful. If you wish to analyze anything that varies across subjects, you first find a way to measure it for each individual, ideally to reduce each subject's data to a single number. These numbers can then be investigated using a standard statistical procedure that accommodates their variability. 2 In this spirit, I consider from this class statistics that test the distribution of the log odds ratios for a difference in mean from zero. All these statistics are based on the observed variability of the Yk.
7. Between-Subject Variabifity
157
The prototype for a test based on an empirical estimate of variability is the t test. To apply it here, the association between categories for each subject is estimated by Yk. The observed mean Yu and variance s u2 are calculated, and the hypothesis that the population mean corresponding to these observations is equal to any particular value is tested with a single-sample t test; for Ho3 tu =
~
YO
(5)
.
The ordinary t test might prove unsatisfactory for a variety of reasons. One potential problem is heterogeneity of variability. The sampling variability of the log odds ratio Yk depends both on the underlying distribution -?rijk (and through it on the subject-level parameters ~lk, ~2k, and glk) and on sample sizes mk. W h e n these values vary from one subject to the next, so does the sampling variance of Yk- One way to accommodate this heterogeneity is to use a weighted t test. Each observation is weighted by the inverse of its sampling variance, as in Equation (3), so that -
Yw=
~wkYk
and
]~wk
2 Sw=
Ewk(Yk -
n-
-
1
Yw)2
The test statistic tw is calculated as in the unweighted case (Eq. 5). It is instructive to relate these t tests to the log-linear models. One can write either t statistic as an F ratio; for example, using the weighted statistic, Fw=tw
2
=
SSassn/1 SShetero/(n -
1)'
where SSassn = E W O -2' w and k
SShetero--EWk(Yk--Yw) 2
k
(6)
The numerator and denominator of F w are ratios of chi-square distributed sums of squares to their degrees of freedom, one expressing the variability of the mean about its hypothesized value, the other the variability of the scores about their mean. Comparable interpretations attach to two of the likelihood-ratio statistics.
2An example of this principle in another domain is a test of a hypothesis about the value of a contrast in a repeated-measures analysis of variance. To test the hypothesis that a contrast ~ differs from zero, one compares the sums of squares associated with the contrast to the sum of squares for the t~ • S interaction. This error term measures the variation of the individual estimates of ~k. The F statistic that results is equivalent to a single-sample t test that checks whether the population mean of t~ is zero.
158
Thomas D. Wickens
FIGURE 3. Type I error rates for tests based on between-subject variability. The conditions are the same as in Figure 1, but the ordinate is expanded.
The presence of association, comparable to SSassn , is expressed by the difference statistic AG28, and the variability of association, comparable to SShetero , is expressed by the goodness-of-fit of the no-interaction model [AB][AS][BS]. Both quantities have chi-square distributions under the null hypothesis Ho3. Although I do not know whether these statistics have the independence necessary to construct a proper F statistic, their ratio might provide a subject-based test of association: 3 F g ~"
AGZB/1 2 G[AB][AS][BS]](rt -- 1)
(n
_
1)(G~AsIfBSl _ G[ABI[ASI[BS] 2 ) 2
G[AB][AS][BS]
This test is handy when one has the report of a log-linear analysis but no access to the original data. I call this statistic a pseudo F. As might be expected, the simulations show that these tests solve the Type I error problem. In Figure 3, the Type I error rates for all three tests lie approximately at the nominal 5% level. The ordinary t test, plotted as a solid line, does very well here. Moreover, these error rates are not substantially modulated by the parameters of the simulations. A better picture of the differences among these statistics is obtained from their power functions. When the marginal distributions are approximately symmetrical and the sample sizes are moderate, there is very little difference among them. Figure 4 shows power functions when m = 100 and n = 20 and the marginal distributions are nearly symmetrical. The panels differ in the degree of between-subject variability. The three tests have almost identical power functions unless the association variability is very large. When the subjects are very heterogeneous, the weighted t test is least satisfactory, with a power function that is displaced away from zero. 3This test is certainly inappropriate for tables larger than 2 x 2. In such tables, AG2B and 2 G[AB][ASItBS] may be influenced by different components of the effect, and thus may be uncomparable.
7. Between-Subject Variability
159
FIGURE 4. Powerof tests based on between-subjectvariability. Subject sample size n = 20, fixed response sample size mk = 100, and nearly-symmetricalmarginal distributions. Curves are based on 5000 simulated experiments and have been smoothed.
When the association is highly variable, there are some subjects for which one cell is empty or nearly so. These cells may be at the root of the weighted t problem. Evidence for this contention is obtained from the power functions in Figure 5, for which a highly asymmetrical set of marginal frequencies were used (1~r = 1.5, ~rr = 0.2; marginal categories distributed roughly as 82: 18). These marginals also create cells in which rrijk is tiny and small frequencies are likely. The unbalanced frequencies in this extreme case exaggerate the differences among the statistics. Again, the weighted t function is displaced along the abscissa. These effects probably result from the way the procedures treat cells with small observed frequencies, as they vanish when the item sample sizes mk are large. In general, these curves demonstrate the robustness of both the unweighted t and the pseudo F statistics to violations of the large-sample assumptions. I think the important point here is that the standard t does as well as any of the other statistics. It is possible that variation in the size of the multinomial samples m k might make the weighted t test necessary. This possibility was my main motivation for running the simulations with a negative binomial distribution of m k. I tried
160
Thomas D. Wickens
FIGURE 5. Power of tests based on between-subject variability with O-.q= 0.5. Subject sample size n = 20, fixed response sample size mk = 100, and asymmetrical marginal distributions. Curves are based on 5000 simulated experiments and have been smoothed.
distributions with both small and large a m o u n t s of variability relative to the mean. H o w e v e r , these c h a n g e s rarely affected the ordering of the statistics. Figure 6 s h o w s the p o w e r functions for e x p e r i m e n t s in which the variance of the s a m p l e sizes was 5 times their m e a n P~m" E v e n here, the standard t is m o r e satisfactory than the w e i g h t e d t, w h o s e p o w e r functions are still displaced. I h a v e two c o n c l u s i o n s to draw about the class of tests b a s e d on b e t w e e n subject variability. First, w h e n the s a m p l e s are large and there are no very imp r o b a b l e cells, all of the statistics are satisfactory and r o u g h l y equivalent. Sec-
FIGURE 6. Power for tests based on between-subject variability with a negative binomial distribution of response sample sizes. Subject sample size n = 25, association standard deviation o'n = 1.0, and near-symmetrical marginal distributions. The variance of the distribution of rnk is 5 times the mean. Curves are based on 5000 simulated experiments and have been smoothed.
7. Between-Subject Variability
161
ond, the tests are not equally robust to unlikely cells. The ordinary t test is better than some alternatives and no worse than any other statistic. Somewhat to my surprise, it appears to be the most accurate procedure.
6. PROCEDURES WITH TWO TYPES OF VARIABILITY Of the procedures examined so far, the t test appears to work best. This test is based on an estimate of the between-subject variability only. The question remains whether better results could be obtained from tests that separately estimate the between-subject and the within-subject components of variability. There is a substantial literature on this problem, dispersed over several domains. The fact that real categorical data may be overdispersed relative to the multinomial has been recognized for many years (e.g., Cochran, 1943). Some of this work concerns essentially the same problem that I am discussing here, that is, the accumulation of separate instances (responses) from a sampled population of larger units (subjects or dyads). Another line of work derives from the theory of survey sampling, in which the heterogeneity of geographical areas or respondent subpopulations introduce the extra variation. A third body of work involves the meta-analysis of categorical experiments, particularly in the medical domain, where experiments that differ in their conditions and size of effect are combined. In these studies, individual subjects play the role that I have assigned to responses, and the experiments correspond to my subjects. Most (although not all) of these methods begin with a sampling model similar to the one I have described. Many are derived from the Generalized Linear Model framework (McCullagh & Nelder, 1989). Unfortunately, almost all of these models suffer from some degree of mathematical intractability. Except for a procedure that merely adjusts the sample sizes downward to accommodate the variability (Rao & Scott, 1992), it is difficult to estimate the excess variance component. Either one must apply a computationally intensive procedure such as the EM algorithm or numerical minimization, or use estimators that may be suboptimal. Most of these procedures require some special purpose programming. I have not looked at all of the candidate tests here, as many of them were too complex and time consuming to be used in the type of simulation work I have described. I did, however, look at two procedures, one from the meta-analysis literature (DerSimonian & Laird, 1986) and the other a random-effect model (Miller & Landis, 1991). Both procedures obtain their estimates of the betweensubject parameters by looking at the excess variability over the multinomial variability. This strategy is at the core of most of these tests; for example, it is used in the Generalized Linear Model approach. As an example of these approaches, I will describe the DerSimonian-Laird procedure, which illustrates what is going on more transparently than the other statistics.
162
Thomas O. Wickens
This test is almost like the test of Woolf's statistic (Eq. 4), but it includes an adjustment for the between-subject variability in the weights. One begins by writing the variance of the log odds ratio Yk as the sum of two parts, a between2 2 and a multinomial within-subject part s/,: subject part 0-.q 0.2 = Yk
2_.1_ 2
0.'q
Sk "
The within-subject component s k2 is the multinomial sampling variance, as estimated by Equation (2). The between-subject component is estimated from the observed variability. First, calculate the weighted mean association 37w (Eq. 3) and the weighted sums of squares of deviations about it (Eq. 6) using weights based on the multinomial variability: SShetero -" E
wk(Yk
-
k
yw)2
with
w k = 1/s 2.
When the association is homogeneous over the subjects, SShetero is distributed as a • with n - 1 df and has an expected value of n - 1; but when there is between-subject variability, its expected value exceeds n - 1. The size of the between-subject component is estimated from this excess. After some algebra, a method-of-moments estimator of 0.~2 is obtained: ^ 2 _ max[0, SShetero - (?/ - 1)]
O'xl--
~w~,
~ ] wk
"
(7)
Y-,wk
The difference between the observed SShetero and its expectation in the absence of between-subject differences is truncated at zero to avoid negative variance estimates. The balance of the test parallels the test of the Woolf statistic. First, new weights are created that include both sources of variability: ~-~ ~
Wk
1
1
-- ,-2 _+_ 2"
Yk
0-Xl
(8)
Sk
Next, a pooled estimate of the association and its variance is found using these weights:
-*
Yw =
~wkYk ~w k
with
var(~w)=
1 y_,wk ~o
Finally, the ratio of 37" to its variance gives a test statistic for the hypothesis of no association: x~ =
aw
~w) ~
var(~w)"
7. Between-Subject Variability
163
Similar to the test for the Woolf statistic, this ratio is referred to a chi-square distribution with one degree of freedom. The adjustment of the weights makes the test more conservative. When between-subject variance is present, it increases var(y~) relative to the uncorrected var(y w) and reduces XZw accordingly. The simulations based on these statistics make three points. First, the Type I error rates are imperfect. There is some negative bias when the association variability is small and some positive bias when it is large, although these biases are small compared with those of the multinomial-based tests and improve with increased mk. These biases show both in the Type I error functions (Fig. 7, plotted on the same scale as Fig. 3 and including the t test for comparison) and in the power functions (Fig. 8). I suspect that these biases are the result of the truncation at zero used in the method-of-moments estimators. Second, the power functions are displaced when an improbable cell is created by a high marginal bias (Fig. 8, right). The simpler adjusted Woolf statistic is worse off than the more complex Miller-Landis test, although both are affected in the most extreme case. At their best, these tests are comparable in power to the between-subject tests such as the t test. Apparently, as testing procedures, these statistics, which are based on two sources of variability, do not improve on the t test, which does not separate between-subject and within-subject components. Although the new tests provide separate estimates of the between-subject variance components, they do not give more powerful tests.
7. DISCUSSION These results are a little disconcerting at first. Why do the procedures that accommodate two sources of variability work no better than those that use only a
FIGURE 7. Type I error rates for tests using both between-subject and within-subject variability. Subject sample size n = 10, fixed response sample size mk= 100, and near-symmetrical marginal distributions. Each point is based on 10,000 simulated experiments.
184
Thomas D. Wickens
FIGURE 8. Power of tests based on both between-subject and within-subject variability. Subject sample size n = 10 and fixed response sample size m k = 100. Curves are based on 5000 simulated experiments and have been smoothed.
between-subject source? The answer lies in how "between-subject" variability is defined and estimated in the two types of tests. A test that only takes account of a "between-subject" component measures it directly from the observed variability. This estimate encompasses both differences among subjects and any sampling fluctuation intrinsic to the subject. The tests that involve both components of the variability attempt to separate these two sources. In these tests, the between-subject portion refers only to the variability that exceeds an estimate of the within-subject part. Generally, the two variance components must be combined to test for between-group effects. These steps are clear in the DerSimonian2 and this parLaird model" the total variability O"2Yk is partitioned into o'n2 and s k, tition is the basis for the estimate of o'n2 in Equation (7). The two terms are recombined when the weights are calculated in Equation (8). The other procedures operate in similar, if sometimes less obvious, ways. In retrospect, it is not surprising that they look like the between-subject tests. Much the same kind of thing happens in other domains in which both between-subject and within-subject variability can be measured. Consider the analysis of variance. In a mixed design, say one with a groups of n subjects and m observations per subject, one can estimate distinct components of variability o'~ and o"2, the first deriving from between-subject sources and the second from within-subject (or repetition) sources. The test of a between-group effect uses
7. Between-Subject Variability
165
an error term that combines both sourcesmthe expected mean square for the error term is E(MSs/A)
= m o .2 + Cr2R.
Although this error term contains two theoretical parts, it is estimated by a mean square that uses only differences among the subjects, not those among the repeated scores. To calculate it, you need only the data aggregated within a subject: M S s / A = rn ~
(Yi.k - ~i..)2,
where Yijk is the jth score of the kth subject in group i and dots indicate averages. The model contains both between-subject and within-subject terms, but the test itself is calculated from between-subject information only. It helps here to remember the difference between the sampling models for frequencies and those for continuous scores. Both the binomial and the Poisson sampling models for an observation depend on a single parameter, so that their means and variances are functionally related. Although this relationship lets one construct tests in situations in which no independent error estimate is available, it also prevents the model from representing arbitrary variability. The addition of an independent variance parameter, however formulated, puts them in a category analogous to that of the models more typically used for normally distributed scores. As I keep being reminded, the t test and the analysis of variance are robust members of this class. The principle I have applied here to test the association against betweensubject variability can be used with other hypotheses. Consider two examples. Many experiments with categorical responses use independent groups of subjects. When one observation is obtained from each subject, a two-way table results, say with the subject classification specified by the row and response by the column. From a contingency table perspective, the hypothesis of homogeneous response distributions over the subject groups is tested in this table by the same test that would be used to test the independence of a double classification. When repeated observations are present, things are a little different. Although their specific forms are somewhat different, three null hypotheses can be written. The hypothesis Ho~ asserts that the response probabilities are homogeneous, both between and within the populations from which the groups were drawn; the hypothesis H02 asserts the absence of differences in mean response probabilities for the particular samples that were tested, but allows for individual differences; and the hypothesis H03 asserts that the mean response probabilities are identical across the populations. Typically, a researcher is only interested in H03. The tests described in the earlier sections of this paper cannot be applied directly, as the lack of matched subjects prevents association coefficients such
166
Thomas O. Wickens
as Yk from being calculated. The natural approach is to calculate a performance measure for each subject and compare the groups on this measure, using something like a two-sample t test or analysis of variance. This is the approach that would be taken by most researchers. To be concordant with the log-linear structure of most frequency analyses, one might choose to make the comparison using logits. For data in which xijk refers to the number of responses in category j made by the kth subject in group i, calculate the estimated logits 1
1
X l l k 4- --
X21 k --1- --
2
Ylk = log
1
and
2
Yzk = log
X l 2k -']- --
1 X 2 2 k '}- --
2
2
then compare the two sets of numbers. I have not done elaborate simulations of this situation, but I believe the same principles that apply to the association tests also apply here. In some other work in which I compared parametric and contingency table approaches to categorical data, I was impressed by the stability and power of the t test, even when the number of observations from each subject is very small. The second example returns to a two-classification situation exemplified by my dyad example. Suppose that you wish to compare the marginal frequencies in the two tablesmin the previous example, to look at whether the behavior rates for the caregivers and the infants differ. The null hypothesis here, expressed in logarithmic form, is that the two marginal logits ~ and ~2 are identical. With one observation per subject, the test is one of marginal homogeneity, which in a 2 • 2 table is McNemar's test of correlated proportions. When repeated observations from each subject are available, the marginal logits are estimated by 1
1
X l l k Jr- X 1 2 k -+- --
X l l k --I- X21 k --[- --
Zlk=log
2
and
Zzk=log
2
1
1
X21 k "~ X 2 2 k -3t- _
X 1 2 k 3t- X 2 2 k +
2
--
2
These values are compared with a dependent-scores t test or some comparable test. Equivalently, the difference between zlk and Zzk is a single quantity that expresses the difference in marginal rates for that subject:
(
X l l k -Jr- X 1 2 k +
')(
2
X l 2 k q- X 2 2 k +
dk = log xzl~+xz2~+~
xl~+x21~+~
,)
7. Between-Subject Variability
167
This measure is tested against zero using a procedure based on estimates of the between-subject variability. I do not want to give the impression that I think that t tests and analyses of variance solve every problem involving between-subject variation. The tests that separate the components of variability give information that is unavailable when the variability is pooled. They may also be more appropriate in situations in which the data are sparse and in which the statistics I have used here, such as Yk, are variable, impossible to calculate, or depend on such semiarbitrary things as the 89correction. Potentially, one of the techniques that I have not investigated, such as the computationally less convenient likelihood- or quasilikelihood-based procedures, may avoid some of the bias that appeared in my simulations. However, I do not really expect these methods to improve the power of the tests. In one form or another, they base their between-subject variability estimate on the observed between-subject variability, just as do the simpler tests. As I mentioned at the start, I believe that it is important to be able to provide recommendations to others that are feasible, and if not optimal, at least near to it. Contingency data can be particularly confusing. I think that by clarifying which null hypothesis should be tested and identifying a satisfactory way to test it, a good practical solution to the problem of between-subject variability in these designs is achieved.
ACKNOWLEDGMENTS This research was supported by a University of California, Los Angeles, University Research Grant.
REFERENCES Cochran, W. G. (1943). Analysis of variance for percentages based on unequal numbers. Journal of the American Statistical Association, 38, 287-301. Delucchi, K. L. (1983). The use and misuse of chi-square: Lewis and Burke revisited. Psychological Bulletin, 94, 166-176. DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177-188. Fisher, R. A. (1973). Statistical methods for research workers (14th ed.). New York: Hafner. Gart, J. J., & Zweifel, J. R. (1967). On the bias of various estimators of the logit and its variance with application to quantile bioassay. Biometrika, 54, 181-187. Johnson, N. L., Kotz, S., & Kemp, A. W. (1992). Univariate discrete statistics (2nd ed.). New York: Wiley. Lewis, D., & Burke, C. J. (1949). The use and misuse of chi-square. Psychological Bulletin, 46, 433-489. Mantel, N. (1963). Chi-square tests with one degree of freedom: Extensions of the Mantel-Haenszel procedure. Journal of the American Statistical Association, 58, 690-700.
168
Thomas D. Wickens
McCullagh, E, & Nelder, J. A. (1989). Generalized linear models (2nd ed.). London: Chapman & Hall. Miller, M. E., & Landis, J. R. (1991). Generalized variance component models for clustered categorical response variables. Biometrics, 47, 33-44. Rao, J. N. K., & Scott, A. J. (1992). A simple method for the analysis of clustered binary data. Biometrics, 48, 577-585. Wickens, T. D. (1989). Multiway contingency tables analysis for the social sciences. Hillsdale, NJ: Erlbaum. Wickens, T. D. (1993). The analysis of contingency tables with between-subjects variability. Psychological Bulletin, 113, 191-204. Woolf, B. (1955). On estimating the relationship between blood group and disease. Annals of Human Genetics, 19, 251-253. Yule, G. U. (1911). An introduction to the theory of statistics. London: Griffin.
Assessing Reliability of Categorical Measurements Using Latent Class Models 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
Clifford C. Clogg~
Pennsylvania State University University Park, Pennsylvania
Wendy D. Manning
Department of Sociology Bowling Green State University Bowling Green, Ohio
1. INTRODUCTION To assess the reliability of continuous or quantitative measurements of one or more underlying variables, standard methods based on "average" correlations or model-based assessments using factor analysis are available (for a compact survey, see Carmines & Zeller, 1979). With the exception of Bartholomew and Schuessler (1991), who developed reliability indices for use with latent trait models, there is little explicit attention to reliability and its measurement for cat1Clifford C. Clogg passed away shortly after the completion of this manuscript. I am honored to have had the opportunity to collaborate with such a wonderful colleague. W.D.M.
Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
169
170
c. c. Clogg and W. D. Manning
egorical measures. Reliability is most often conceived of in terms of errors in measurement, so any measurement-error model is automatically a model for assessing reliability. Latent class models provide a natural framework for assessing reliability of categorical measures (cf. Schwartz, 1986). In this chapter, we explore various definitions of reliability as they might be operationalized with standard latent class models. Clogg (1995) surveys the latent class model, including various areas of application, and provides an exhaustive list of references. For the most part, the technical underpinning of this chapter can be found in Clogg (1995) or in references cited there, and for this reason we do not include an exhaustive list of references here. The fundamental paper on modem latent class analysis is Goodman (1974). Except for slight changes in notation consistent with recommendations given in Clogg (1995), the model formulation put forth in Goodman's fundamental work is the one used here. Reliability of categorical measurements are examined in this chapter in the context of an empirical study of the measurement properties of various indicators of social support. Social support refers to the provision of or the receipt of aid from family members, close friends, parents, grandparents, and so on. There are many possible types of social support. A rigid empiricist view would regard each type of support as a social type, not as an indicator of something more fundamental. Such a view, however, prohibits the assessment of reliability unless an independent data source with "true" or at least more exact measurements is available which would allow comparison of true measures with the other measures whose reliability is in question. The view adopted here is that (a) each type of support represents an indicator of a broad pattern or structure of social support; (b) each specific indicator is an instance of a more abstract social condition or support structure and gives clues about that structure; and (c) the items as a set measure the structure, or the items measure an underlying variable that defines the structure, given assumptions. On the basis of this orientation, latent structure methods can be used to characterize the pattern of support that can be inferred from either unrestricted (and nonparametric) views of the structure or restricted (and more "parametric") views of the structure. The data source is the popular National Survey of Families and Households (NSFH; Sweet, Bumpass, & Call, 1988). The NSFH is a nationally representive sample of approximately 13,000 individuals in the United States. The indicators of social support contained in these data have been analyzed in several publications (Hogan, Eggebeen, & Clogg, 1993; Manning & Hogan, 1993). In this chapter, several of the indicators used in these sources are examined to determine reliability of the measures and related aspects of the data. Textbook accounts refer to reliability, in the abstract, as average correlations in some sample that generalize in an obvious way to some population, without regard for the context or the group to which these correlations refer. We examine reliability in several
8. Latent Class Models
171
senses, including reliability inferred from group comparisons that are relevant for inferring measurement properties of the indicators and for the social structures or patterns associated with these measures in particular contexts or in particular groups.
2. THE LATENT CLASS MODEL: A NONPARAMETRIC METHOD OF ASSESSING RELIABILITY We first give the general form of the latent class model to show that the parameters in this model can be used to infer reliability and other aspects of the measurements. The exposition is nontechnical. We focus mostly on what is assumed and what is not assumed in using this model to make inferences about reliability. We also give measures of reliability suited for different questions that arise in the assessment of reliability outside the world of correlations. Suppose that we have three dichotomous indicators of some underlying characteristic, variable, or trait. Let these be denoted as A, B, and C; and let the levels of those variable be denoted by subscripts i, j, and k, respectively. Level 1 might refer to receiving and level 2 to not receiving support. The items might refer to the receipt of aid or support of various kinds, such as childcare, advice, or material benefits. The probability of cell (i,j,k) in the observable crossclassification of A, B, and C is denoted "rraBc(ijk ). Now suppose that each item measures an underlying variable X, with possibly different levels of reliability. How reliability ought to be measured or indexed is the main objective of this chapter, but the most significant feature of the latent class model is that X is modeled nonparametrically so that reliability can be assessed without bringing in extra assumptions. That is, we do not assume that X is continuous or even quantitative; rather, we assume only that X has a specific number of categories or "latent classes." Let T denote the number of latent classes. If X could be observed, the "data" could be arranged in a four-way contingency table cross-classifying A, B, C, and X. Let t denote the level of the latent variable X, and let "rraBcx(ijkt ) denote the probability of cell (i,j,k,t) in this indirectly observed contingency table. The latent class model first assumes that T
"rrABC(ijk) = ~ "rrABcx(iJt:t).
(1)
t--1
In words, what we observe is obtained by collapsing over the levels of the hidden variable. The number T of latent classes can be viewed as a parameter. This integer number reflects the number of latent classes or groups in the population, up to the limits imposed by identifiability. As a practical matter, we can always find an integer value of T for which the relationship in Equation (1) must hold.
172
c. c. Cio99 and W. O. Manning
That is, by letting the value of T denote a parameter, the preceding assumption is not really an assumption at all, but rather an algebraic property of decomposing observed or observable contingency tables into a certain number of indirectly observed contingency tables, where this number is determined by the number of variables, the "size" of the contingency table, and other features of the data structure. The expression in Equation (1) is thus not a restrictive aspect of the model as such. Of course, by specifying a particular value for T (T = 2, T = 3, etc.), the model imposes an assumption, and this assumption can be tested. But nothing has been assumed about the latent variable X. If the true underlying variable is discrete with T classes, then the T-class model is exactly suited for this situation. If the true underlying variable is metric (or continuous or even normally distributed), then X in this model represents a nonparametric or discretized version of this latent variable. The distribution of the modeled latent variable can be represented by the so-called latent class proportions, ~rx(t ) = P r ( X -- t), with ~r
t=l
"rrx(t) = 1 and ~rx(t) > 0 for each t. Note that the case in which 7rx(t* ) = 0
for some t* need not be considered; if the t*th class is void (has zero probability), then the model has T - 1 latent classes, not T latent classes. The T-class latent structure refers to T n o n v o i d latent classes. For item A, let "rralx=t(i ) = P r ( A - i l X - t) denote the conditional probability that item A takes on level i given that latent variable X takes on level t, for i - - 1, 2 and t = 1. . . . T. Let "rrBix=t(j), "rrcix=t(k ) denote similarly defined conditional probabilities for the other items. The latent class model is now completed by the expression, or the model, "rrABCX( ijkt ) = ,rrx( t ),rr A i x = t( i )'rr B i X= t( j )'rr c i x= t( k),
(2)
which is posited for all cells (i, j, k, t). This part of the model expresses a nontrivial assumption that the observed items are conditionally independent given the latent variable X. This is the so-called assumption of local independence that defines virtually all latent structure models (Lazarsfeld & Henry, 1968). Much has been written about this assumption. There are many models that relax this assumption to some extent, but for ordinary assessments of reliability, we think that this assumption serves as a principle of measurement. We take this assumption as a given; virtually all model-based assessments of reliability begin with this or a similar assumption. Note that reliability of measurement can be defined as functions of the conditional probability parameters. For example, suppose that level i = 1 of item A corresponds to level t -- 1 of X, that level i = 2 of item A corresponds to level t = 2 of X, as with observed receiving and "latent" receiving of aid. Then item level-specific reliability is defined completely, and nonparametrically, in terms
8. Latent Class Models
173
of the conditional probabilities for item A. That is, level i = 1 of item A perfectly measures level t = 1 of X if 'ITA IX = 1(1) = 1; values less than unity reflect some degree of unreliability. Level i = 2 of item A perfectly measures level t = 2 of X if ' n ' Z l X = 2 ( 2 ) -- 1. I t e m - s p e c i f i c reliability can be defined completely, and nonparametrically, in terms of the set of conditional probabilities for the given item. We would say, for example, that item A is a perfectly reliable indicator of X if " r r A i X _ l ( 1 ) -- "rrAiX=2(2 ) -- 1, SO that item reliability in this case is a composite of the two relevant parameters for item-level reliability. (Perfect item reliability is the same as perfect item-level reliability at each level of the item.) In the preceding form, reliability can be viewed as the degree to which X predicts the level of a given observed item. We can also ask how well X is predicted by a given item, which gives us another concept of reliability. (Note that for correlation-based approaches, prediction of X from an item and prediction of an item by X are equivalent and hence are not distinguished from each other. This is because the correlation between two variables is symmetric.) Using the definition of conditional probability, we can define reliability in terms of prediction of X as follows. Instead of the conditional probabilities in Equation (2), use instead the reverse conditionals, "rrx l z = i( t ) = "rrx( t )'rr A i x = t( i ) /'rr A ( i ),
(3)
where "ira(i ) -- P r ( A = i) is the marginal probability that A takes on level i. (Note that this expression is equivalent to P r ( X = t ) P r ( A = i l X = t)/ P r ( A = i) = P r ( A = i, X = t ) / P r ( A = i).) Item level-specific and item-specific reliability can be defined in these terms. Finally, a margin-free measure of item-level reliability can be defined in terms of the (partial or conditional) odds ratio between X and A, OZX = "rra l X= l (1)'rrZ l X = 2 ( 2 ) / [ T r Z l X= l (2)TrZ l X = 2 ( 1 ) ] .
(4)
Generalizations of this function can be applied when the variables involved are polytomous rather than dichotomous. It is easily verified that the same result is obtained when the reverse conditionals are used. A correlation-type measure based on these values is Q A X = (OAx -- 1 )/(OAX -1- 1 ),
(5)
which is just Yule's Q transform of the odds ratio as given in elementary texts. Note that all quantities can also be defined for items B and C, or for however many items are included in the model. We can generalize the concept of reliability-as-predictability further. Rather than focus on item level-specific reliability or item-level reliability, we can instead focus on i t e m - s e t reliability. For a set S of the original items, say S - {A, B}, the following probability can be calculated: r r x i s ( t ) = P r ( X = t i S -- s), for example, P r ( X - 1 IA -- i, B = j ) . For items in a given set S, the item-set reli-
174
c. c. Clogg and W. D. Manning
ability can be operationalized in terms of these quantities. For the special case in which S corresponds to a single item, item-set reliability is already defined. In cases in which S is the complete set of items, the reliability of the entire set of items as measures of X can be measured in terms of
7rXlaBc(t) = 7raBcx(ijkt)/TraBc(ijk),
(6)
where the right-hand quantities already appear in the two fundamental equations of latent class analysis, Equations (1) and (2). Convenient summary indices of reliability, or predictability, can be defined in terms of these equations or models. To summarize, the nonparametric latent class model assumes (a) that the observed contingency table is obtained by collapsing over the T levels of the unobserved variable and that (b) the observed association among items is explained by the latent variable. That is, the assumption of local independence is used to define the latent variable or the latent structure. Given this model, various definitions of reliability of measurements are possible, including item level-specific reliability conceived as the prediction of observed items from the latent variable or as the prediction of the latent variable from the observed items, and various definitions of item-specific reliability, even including a correlation-type measure based on Yule's Q transform. Finally, reliability of sets of items or even of the entire set of items can be defined easily in terms of probabilistic prediction. These various measures of reliability are all direct functions of the parameters in the latent class model; indeed, in some accounts, these are viewed as the "parameters" of the model. We next illustrate these in a concrete case.
3. RELIABILITY OF DICHOTOMOUS MEASUREMENTS IN A PROTOTYPICAL CASE The simplest possible case to consider is a 23 table that cross-classifies three dichotomous indicators. In this case, the two-class latent structure is exactly identified (Goodman, 1974). Table 1 gives a three-way cross-classification of three items from the NSFH. The items pertain to receipt of support from parents. The items are A (receive instrumental help), B (receive advice), and C (receive childcare), with level 1 = did not receive aid and level 2 = received aid. The time referent is "in the past month." The group is white, nonhispanic mothers less than 30 years of age, with n = 682. Thus, the items cross-classified in this table have specific group, time, and source (i.e., parental) referents, and these have to be kept in mind when assessing reliability. We also provide the expected frequencies under the model of mutual independence, the log-linear hypothesis that fits the univariate marginals {(A), (B), (C)}. This model is equivalent to the latent class model with T = 1 "latent"
175
8. Latent Class Models
TABLE 1 Cross-Classification .of Three Indicators of Social Support
Cell (C, B, A) (1, (1, (1, (1, (2, (2, (2, (2,
1, 1, 2, 2, 1, 1, 2, 2,
1) 2) 1) 2) 1) 2) 1) 2)
Observed frequency
Expected frequency
193 56 49 41 65 79 57 142
104.3 91.1 76.7 67.0 105.5 92.1 77.6 67.8
Pr(X =
.97 .70 .76 .19 .62 .11 .14 .01
1)
Pr(X =
.03 .30 .26 .81 .38 .89 .86 .99
2) 1 1 1 2 1 2 2 2
Note. See text for definition of items. The expected frequencies are for the model of mutual inde-
pendence, the "one-class" latent structure. The expected frequencies for the two-class model are the same as the observed frequencies. The columns labeled P r ( X = t) correspond to the conditional probabilities in Equation (6) (sample estimates) for the two-class model. X is the predicted latent distribution.
classes. The variables are strongly associated with each other; the Pearson chisquare statistic is X 2 = 213.28, and the likelihood-ratio chi-square statistic is L 2 -- 186.72 on 4 df. We see that the independence model underpredicts the extremes in the table, that is, when mothers receive all types of aid or do not receive any type of aid. Note the "skew" in the observed joint distribution that might be important to take into account. In this case, a model using only twoway interactions (or pairwise correlations) would be sufficient to explain the association. The log-linear hypothesis that fits all two-way marginal tables gives X 2 = .97, L 2 - .97 with 1 df. 2 This standard model does not, however, lend itself easily to an analysis of reliability simply because it does not incorporate a well-specified relation of observed items to the unobserved variable that the items were designed to measure. Two other columns in this table give estimated prediction probabilities suited for the assessment of the reliability of the entire set of items; by using these, we obtain the predicted latent distribution for the two-class model given in the last column of the table. The two-class latent structure is "saturated" with respect to any 2 x 2 x 2 contingency table. This means that the latent class model, with T = 2 latent classes, is consistent with any such distribution, or with any model or assumed distribution for the data. If the "true" variable measured by the items is in fact 2Adding just one variable to the set analyzed produces a data structure that cannot be described in terms of two-way marginals, so the sufficiency of pairwise correlations is contradicted by the data in this case once other items are included. We have chosen just three of the available items for expository purposes.
176
c. c. Clogg and W. D. Manning
continuous, with known or unknown distribution, the two-class model still provides a summary of the measurement properties of the items. That is, there is no additional statistical information in the data that cannot be obtained from an assumption that the true variable is dichotomous rather than continuous. Because the two-class model is saturated, the chi-square values are identically zero and the model has zero degrees of freedom. We examine reliability of the measurements A, B, and C using the operational definitions provided in the previous section. What is the total reliability of the measures? This question is answered by considering the conditional probabilities in Equation (6), which can be called the posterior distribution of X given the items. (This is item-set reliability in which all items are used to define the set.) The maximum likelihood estimates of these probabilities appear in the columns headed P r ( X = t) in Table 1. For members of cell (1, 1, 1), with observed frequency flll = 193, the model predicts that 97% are in latent class 1 and 3% are in latent class 2. That is, 193 x .97 = 187.2 is the estimated expected number of cases in this cell that belong to the first latent class, and 193 - 187.2 = 5.8 is the expected number in the second latent class. By using these quantities, we can next define the cell-specific reliability, for example, as the maximum of the two probabilities, or perhaps the odds of the two classes. Continuing in this fashion leads to an expected number correctly predicted equal to 597.3, or 87.6% of the total sample. In other words, the items as a set can be said to be almost 90% reliable. An alternative measure of this reliability is the lambda measure of association between X and the observed items, ~. = .74. (See Clogg, 1995, for the definition of lambda in the latent class setting.) This measure is consistent with the assignment rule that created the predicted distribution in Table 1. The parameter values for the model as expressed in Equation (2) appear in Table 2. We see that level 1 of each item (not receiving aid) is associated with level 1 of X and that level 2 of each item (receiving aid) is associated with level 2 of X, so that level 1 of X can be characterized as the latent receiving class and
TABLE2 Estimated Parameter Values for the Two-Class Model Applied to the Data in Table 1, Including Measures of Reliability
~Titem[X=l
Item A A B B C C X
(i = (i = (j = (j = (k = (k =
1) 2) 1) 2) 1) 2)
.83 .17 .83 .17 .82 .18 .48
~Titem[X:2 .26 .74 .33 .67 .19 .81 .52
~item.X
Oitem.X
13.6
.86
9.9
.82
19.6
.90
~TX=llitem .75 .28 .70 .41 .80 .20
~TX=21item .25 .72 .30 .59 .20 .80
8. Latent Class Models
177
level 2 of X can be characterized as the latent not-receiving class. The estimated latent distribution is 4rx(1) = .485, 4rx(2) = .515. The item level-specific measures of reliability, defined in terms of predictability of an item from X, are the estimated conditional probabilities reported in this table. That is, for item A, X = 1 predicts A = 1 with 83% reliability, and X = 2 predicts A = 2 with 74% reliability. Level 1 of X is an equally good predictor of level 1 of A and B (reliability is 83%), and level 2 of X best predicts level 2 of C (reliability is 81%). The item level-specific measures of reliability, defined in terms of prediction of X from the items, requires the use of Equation (3). These measures are given in the last two columns of Table 2. Taking into account the marginal distributions of the items and the latent variable (see Eq. 3), we see that item C is the best predictor of X for each level, with item level-specific reliabilities of 80% for each level. Finally, the most reliable indicator for variable reliability is also item C, with an odds ratio of 19.6, corresponding to a Yule's Q of .90. The temptation to produce an overall index of reliability from these measures is strong, and therefore we next propose overall indices that should prove to be meaningful. To measure the average item level-specific reliability, with reliability defined in terms of prediction of an item from X, we merely take the relevant average of the conditional probabilities "lTitem]X. For level 1 of the items, this average is .83, and for level 2 it is .74. This means that the X variable is a better predictor of level 1 of the items on average. To measure the average item level-specific reliability, with reliability defined in terms of prediction of X from an item, we merely take the relevant average of the conditional probabilities "rrx] item" For level 1 of X, this average is .75, and for level 2 of X it is .70. Thus, with this concept of reliability, level 1 of the items is more reliably predicted than level 2. Averages of the odds ratios (or their logarithms) as well as averages of the Q transforms of these can serve as average item-level reliabilities; the average Q is .86, for example. Averages of the relevant prediction probabilities can be used to define average cell-specific reliability. For the prediction of level X = 1, for example, the average of the "ITx]ABC(1) for cells where X = 1 is predicted is .76. For the prediction of level X = 2, the average of these posterior probabilities for cells where X = 2 is predicted is .89. Thus, cell-specific reliability as predictability of X is higher for the prediction of X = 2 than for the prediction of X = 1. The overall predictability indices given earlier (lambda and the percentage correctly predicted) summarize reliability of the prediction of the variable X. As this analysis demonstrates, reliability can be viewed in many ways, as item level-specific, variable-specific, cell-specific, or item set-specific aspects of the same problem. Whether these various facets of reliability are viewed as satisfactory or not depends on the purpose of the analysis, the sample chosen, the specific indicators used, and so on. By carefully examining what the latent class model says about measurement, we see that a rich portrait of reliability
178
c. c. Clogg and W. O. Manning
can be obtained. We hasten to add that this prototypical example, with the simplest relevant case in which latent class models can be used, indicates how the definitions of reliability put forward here can be used in general settings with many items, including polytomous items. We conclude this section by examining whether item or item-level reliabilities differ across items. Restricted latent class models can be used for this purpose. Consider the model to which the following condition is applied: ,rrAiX=l(1 ) = ,rrBIX=l(1 ) = ,n-clx=l(1 ).
(7)
This condition says that the item-level reliability (with X viewed as a predictor of the item) is constant across items for level 1 of the items. The model with this constraint produces L 2 = X 2 = .06 on 2 df, so this condition is consistent with the data. The common value of this reliability is, under the model, .83. Next, consider the model with the following condition applied:
"rralx-:(1) = ~rstx=2(1) = ~rcjx_-:(1).
(8)
This condition says that item-level reliability (with X viewed as a predictor of the item) is constant across items for level 2 of the items. The model with this constraint applied gives L 2 = 9.79, X 2 = 9.87 on 2 df. Thus this condition is not consistent with the data, so this type of item-level reliability is not constant across items. The model with both of the preceding constraints applied gives L 2 = X 2 = 14.61 on 4 df. Given these results, the main source of lack of fit of this model is nonconstant reliability in the prediction of the second (notreceiving) level of the items. Various hypotheses about variability in reliabilities can be considered using constraints of this general kind (see Clogg, 1995, and references cited there).
4. ASSESSMENT OF RELIABILITY BY GROUP OR BY TIME Important checks on reliability of measurement can be made by examining group differences in parameters of the latent class model. The groups might represent observations (on different individuals) at two or more points in time, which permits one kind of temporal assessment of reliability common in educational testing, or the groups might represent gender, age, or other factors, as is common in cross-sectional surveys. For the example used in the previous section, it is natural to consider two relevant social groups for which subsequent analyses using the measures are key predictors. The sample was divided into married and not married, because the provision of social support is expected to vary by this grouping. These two groups are perhaps most relevant for the consideration of differentials in support, and it is therefore natural to ask whether the items measure equally in the
8. Latent Class Models
179
two groups. The three-way table for each group corresponding to Table 1 appears in Table 3. We use this example to illustrate how multiple-group latent class models can be used to extend the study of reliability using virtually the same concepts of reliability as in the previous example. The predicted latent distribution under this model (two-class model for married and unmarried w o m e n ) also appears in the table. Now, compare Table 1 and Table 3. Under the fitted model, we see that cell-specific predictions of X differ between the two groups for cell (2, 1, 1), with unmarried w o m e n in this cell predicted to be in the second latent class and married w o m e n in this cell predicted to be in the first latent class. We can estimate the two-class model separately for each group, producing analyses that are exactly analogous to the single-group case in the previous section. After some exploratory fitting of models, we were led to select the model for which all conditional probabilities were constrained to be h o m o g e n e o u s across groups. The model with these constraints fits the data remarkably well, with L 2 = 6.71, X 2 = 6.70 on 6 df. The estimated latent distribution (4rx(t) for each group) was .56, .44 for married mothers, and .39, .61 for unmarried mothers. In other words, the two latent distributions are quite different, with married w o m e n much more likely to receive aid from parents than unmarried w o m e n (56% vs. 39%). The model with h o m o g e n e o u s (across-group) reliabilities for all levels and all items is consistent with the data, however, so that the only statistically relevant group difference is in the latent distributions. Such a finding is painfully difficult to obtain in many cases, but in this case we can say that the indicators measure similarly, and with equal reliability, in both of these relevant groups. The conditional probabilities in Table 4 are nearly the same as those reported earlier (Table 2) for the combined groups, and as a result, virtually the same
TABLE 3 Cross-Classification of Three Indicators of Support: Married Versus Unmarried Women Cell (C, B, A)
Married women
X
Unmarried women
(1, 1, 1)
88
1
105
1
(1, (1, (1, (2, (2, (2, (2,
16 13 12 21 22 16 42
1 1 2 1 2 2 2
40 36 29 44 57 41 100
1 1 2 2 2 2 2
1, 2) 2, 1) 2, 2) 1, 1) 1, 2) 2, 1) 2, 2)
Note. The estimated percentage correctly allocated into the predicted latent distribution over both groups is 86.3%; lambda = .77.
180
c. c. Clogg and W. D. Manning
TABLE 4 Estimated Parameter Values for a Multiple-Group Model Applied to the Data in Table 3
~TitemI x = 1
~TitemIX=2
Oitem.X
Oitem.X
1) 2) 1) 2) 1)
.85 .15 .85 .15 .84
.28 .72 .35 .65 .21
14.4
.87
10.5
.83
19.4
.90
c (~ = 2)
.16
.78
Item
A A B B C
(i (i (j (j (k
= = = = =
Note. The quantities apply to both groups, that is, to both married and unmarried women, because the model constrained these parameter values to be homogeneous across the groups.
conclusions are reached about reliability values, regardless of definition. (For the analysis of item-level reliability viewed as predictability of X, the inferences are somewhat different because of the different latent distributions in the two groups and because of the different item marginals in the two groups). We conclude that these items measure with equal reliability in the two groups, apart from some sampling fluctuation that is consistent with the model used. To save space, other reliability indices will not be reported. The analyses throughout these examples were restricted to reliability assessment in one point in time and in one point in the life coursemearly motherhood. The recently released second wave of NSFH data will permit analyses of social support structures over time. For just two points in time, we can illustrate an approach that might be used as follows. Suppose that the initial measurements are denoted (A 1, B l, Cl) and that the second-wave measurements are denoted (A 2, B2, C2). A natural model to consider in this case would posit a latent variable X 1 for the first-wave measurements and a latent variable X 2 for the second-wave measurements. The concepts, measures, and statistical methods that can be used for this case are virtually the same as those presented in this chapter. The ideal situation would be one in which the X l - ( A ~, B l, C~) reliabilities were high and equal to the X2-(A2, B2, C2) reliabilities. Standard methods summarized in Clogg (1995) can be used to operationalize such a model, and the reliability indices described in this chapter could be defined easily for this case. If such a measurement model were consistent with the data, then the main question would be how X~ differed from X2. For example, the change at the latent level could be attributed to developmental change. If more than two waves of measurement are available, then more dynamic models ought to be used (van de Pol & Langeheine, 1990). But even for the broader class of latent Markov models covered in van de Pol and Langeheine, the measures of reliability presented here can be used to advantage.
8. Latent Class Models
181
5. CONCLUSION Model-based assessment of reliability is an important aspect of social measurement. For categorical measures, it would be inappropriate to assess reliability with correlation-based measures for the same reason that linear models are not generally appropriate for categorical data (Agresti, 1990). The nonparametric approach provided by the ordinary latent class model seems well suited for the measurement of reliability of categorical measures, and the parameters in this model provide the necessary components for several alternative measures or indices of reliability. For polytomous items, the same general procedures can be used, and for T-class models (with T > 2), level-specific reliability has to be defined carefully. To our knowledge, existing computer programs for latent class analysis (see Clogg, 1995) do not automatically provide indices of reliability of the sort provided here, with the exception of the overall summaries of prediction of X. Reliability is always a relative concept, and its measurement should always take account of a standard for comparison suited for the particular measures under consideration. For the example analyzed previously, most researchers would say that the indices of reliability provide evidence that high or very high reliability has been obtained. For the group comparisons involving married and unmarried women, these inferences were strengthened because the items were equally reliable (in terms of most indices) for both groups. It will be important to examine other group comparisons that are relevant for the problem at hand (e.g., by racial group or over time) to bolster confidence in the measurements. It might also be important to examine whether similar measures have the same measurement properties when other social contexts are considered (such as the giving of aid, or receipt of aid from grandparents). These are all modeling questions, however, and not simply exercises in producing reliability indices. We hope that the definitions of reliability put forward here and illustrated with simple examples will help to bridge the gap between reliability assessment of continuous variables and reliability assessment of categorical variables.
ACKNOWLEDGMENTS This work was supported in part by Grant SBR-9310101 from the National Science Foundation, (C. C. C.) in part by the Population Research Institute, Pennsylvania State University, with core support from National Institute of Child Health and Human Development Grant 1-HD28263-01 and NIA Grant T32A000-208. The authors are indebted to Alexander von Eye for helpful comments.
182
c. c. Clogg and W. D. Manning
REFERENCES Agresti, A. (1990). Categorical data analysis. New York: Wiley. Bartholomew, D. J., & Schuessler, K. E (1991). Reliability of attitude scores based on a latent trait model. In E V. Marsden (Ed.), Sociological methodology 1991 (Vol. 21, pp. 97-124). New York: Basil Blackwell. Carmines, R., & Zeller, R. (1979). Reliability and validity assessment. Beverly Hills, CA: Sage. Clogg, C. C. (1995). Latent class models. In G. Arminger, C. C. Clogg, & M. E. Sobel (Eds.), Handbook of statistical modeling for the social and behavioral sciences (pp. 311-359). New York: Plenum. Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215-231. Hogan, D. E, Eggebeen, D. J., & Clogg, C. C. (1993). The structure of intergenerational exchanges in American families. American Journal of Sociology, 99, 1428-1458. Lazarsfeld, E F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin. Manning, W.D., & Hogan, D. E (1993). Patterns of family formation and support networks. Paper presented at the annual meetings of the Gerontological Society of America, New Orleans, LA. Schwartz, J. E. (1986). A general reliability model for categorical data, applied to Guttman scales and current status data. In N. B. Tuma (Ed.), Sociological methodology 1986 (pp. 79-119). San Francisco: Jossey-Bass. Sweet, J., Bumpass, L., & Call, V. (1988). The design and content of the national survey of families and households (NSFH Working Paper No. 1). Madison: University of Wisconsin. van de Pol, E J. R., & Langeheine, R. (1990). Mixed Markov latent class models. In C. C. Clogg (Ed.), Sociological methodology 1990 (pp. 213-247). Oxford: Basil Blackwell.
Partitioning Chi-Square: Something Old, Something New, Something Borrowed, but Nothing BLUE (Just ML) 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
David Rindskopf
City University of New York New York, New York
1. INTRODUCTION Reading the typical statistics textbook gives the impression that the analysis of a two-way contingency table generally consists of (a) calculating a statistic that has a chi-square distribution if the rows and columns are independent, and (b) either rejecting or not rejecting the null hypothesis of independence. If the null hypothesis is rejected, the researcher is left to conclude that "there's something going on in my table, but I don't know what." On the other hand, the overall test of independence might not be rejected, even though some real effect is Categorical Variables in Developmental Research: Methods of Analysis Copyright
9
1996
by Academic
Press,
Inc. All rights
of reproduction
in any
form
reserved.
183
184
David Rindskopf
being masked by many small effects. Furthermore, a researcher might have a theory that predicts the existence of specific relationships, much like preplanned comparisons in the analysis of variance. Although most researchers are acquainted with the idea of contrasts in the analysis of variance, they are unaware that similar techniques can be used with categorical data. The simplest such technique, called partitioning chi-square, can be used to test a wide variety of preplanned and post hoc hypotheses about the structure of categorical data. The overall chi-square statistic is divided into components, each of which tests a specific hypothesis. Partitioning chi-square has been used at least since 1930 (Fisher, 1930) and appeared in quite a few books on categorical data analysis (e.g., Everitt, 1977; Maxwell, 1961; Reynolds, 1977) until recently. The failure of the technique to attain wide usage is probably the result of three factors: (a) partitioning the Pearson fit statistic, calculated automatically by most computer programs, involves complicated formulas that have no intuitive basis; (b) in many expositions of partitioning, the technique was applied mechanically, without considering comparisons that would be natural in the context of a particular piece of research; and (c) the development of log-linear models for dealing with complex contingency tables has drawn attention away from the simpler two-way table. (A pleasant exception to these generalizations is Wickens, 1989). The modern use of partitioning chi-square uses the likelihood-ratio statistic, which partitions exactly without the use of special formulas. It is simple to understand, easy to do and to interpret, and encourages the testing of focused hypotheses of substantive interest. (Those acquainted with log-linear models will recognize that the partitioning of chi-square discussed in this chapter is different from the partitioning to compare nested log-linear models.) This chapter first presents partitioning in the context of examples; these include partitioning in two-way tables, partitioning in multiway tables, and partitioning of the equiprobability model in assessing change. Then, general schemes for partitioning are described, and finally, potential weaknesses are discussed.
2. PARTITIONING INDEPENDENCE MODELS 2.1. Partitioning Two-Way Tables: Breast Cancer Data A simple example using a two-way table will show how partitioning chi-square can allow researchers to address the questions they consider important, rather than being limited to the usual global hypothesis tests. Consider the crosstabulation shown in Table 1, adapted from Goleman (1985), which shows how well breast cancer patients with various psychological attitudes survive 10 years after treatment.
9. Partitioning Chi-Square
185
TABLE 1 Ten-Year Survival of Breast Cancer in Patients with Various Psychological Attitudes Response Attitude
Alive
Dead
Denial Fighting Stoic Helpless
5 7 8 1
5 3 24 4
LR--7.95;
P=8.01
Note Adapted from Goleman (1985). LR is the likelihood-
ratio goodness-of-fit statistic; P is the Pearson goodnessof-fit statistic. Each test has 3 df.
Almost every researcher would know to do a test of independence for the data in this table; the familiar Pearson chi-square statistic is 8.01 with 3 df, and the likelihood-ratio (LR) statistic is 7.95. Those not familiar with the likelihoodratio test can think of it as a measure similar to Pearson's for deciding whether or not the model of independence can be rejected. The likelihood-ratio statistic, like the more familiar Pearson statistic, compares the observed frequencies in a table with the corresponding frequencies that would be expected if a hypothesized model were true. In this case, the hypothesized model is one of independence of row and column classifications. The formula for the likelihood-ratio statistic is LR = 2
~i Oi ln(Oi/Ei),
where a single subscript i is used to indicate individual cells in a table, and ~i indicates summation over all of the cells in the table. As an aid to intuition, imagine what would happen if a model fit perfectly so that O i = E i for each cell. Then, the ratio Oi/E i 1 for each cell; and because the logarithm of one is zero, it follows that LR = 0. For more information on the likelihood-ratio statistic, see Reynolds (1977), Everitt (1977), or Wickens (1989). For the data in Table 1, p < .05 for both the Pearson and LR statistic, so there is a relationship between the two variables: attitude is related to survival. From the traditional point of view, that is that; there is nothing else to say. The issue of where the relationship lies is generally not considered. Suppose, however, that in this example the researchers had a theory that active responses to cancer, such as fighting and denial, are beneficial, compared with passive responses such as stoicism and helplessness. Furthermore, suppose that they were not sure whether patients with different active modes of response -
-
186
David Rindskopf
differ in survival rate, or whether patients with different passive modes have different survival rates. The theory immediately suggests that instead of a single overall test of independence, three tests should be done. The first should test whether fighters and deniers differ; the second, whether stoics and helpless patients differ; and the third, whether the active responders differ from the passive responders. Each of these tests is displayed in Table 2, along with both Pearson and LR chi-square tests. Each of the three tests has 1 df, and the results are as hypothesized: fighters and deniers do not differ in survival rate, nor do stoics and helpless differ, but those with active modes of responding survive better than those with passive modes. Note that the likelihood-ratio statistics for the three tests of the specific hypotheses sum to the value of the test of the overall hypothesis of independence. That is, the overall chi-square has been partitioned into three components, each of which tests a specific hypothesis about comparisons among the groups. (The Pearson test statistics, although not partitioning exactly, are still valid tests of the same hypotheses.) Because the test of the overall hypothesis of independence was barely rejected at the .05 level, one could ask what would happen had it not been rejected? Using the usual method of analysis, this would have been the end, but partitioning chi-square according to preplanned comparisons makes the overall test irrelevant. Just as in analysis of variance, the overall test might not be sig-
TABLE2 Partition of Chi-Square for Attitude and Cancer Survival Data Response Attitude
Denial (D) Fighting (F) Stoic (S) Helpless (H) D+F S+H
Alive
Dead
5 5 7 3 LR = .84; P = .83 8 24 1 4 LR = .06; P = .06 12 8 8 28 LR = 7.05; P = 7.10
Note: LR is the likelihood-ratio goodness-of-fit statistic, and P is the Pearson goodness-of-fit statistic, for the 2 X 2 table that precedes them. Each test has 1 df.
9. Partitioning Chi-Square
187
nificant, and yet one or more specific contrasts might be significant. Testing specific rather than general hypotheses increases statistical power, which is especially advantageous with small sample sizes.
2.2. Partitioning Multiway Tables: Aspirin and Stroke Data The Canadian Cooperative Study Group (1978) studied the effectiveness of aspirin and sulfinpyrazone in prevention of stroke. The four-way cross-tabulation of sex, aspirin (yes/no), sulfinpyrazone (yes/no), and stroke (yes/no) is presented in Table 3. Because stroke is considered an outcome variable, the table is presented as if it were an 8 x 2 table, with the rows representing all combinations of sex, aspirin, and sulfinpyrazone. A test of independence in this table produces a LR chi-square of 15.77 with 7 df (p = .027), so there is a relationship between group and stroke. To pinpoint the nature of the relationship of sex, aspirin, and sulfinpyrazone to stroke, we partition the overall chi-square by successively testing each of the three effects. First, within Sex x Aspirin categories, the relationship between sulfinpyrazone and stroke is tested. As shown in part (a) of Table 4, there is no relationship between sulfinpyrazone and stroke for any combination of sex and aspirin, so sulfinpyrazone was ineffective in reducing stroke in this trial. Next, we collapse over sulfinpyrazone and test whether aspirin is related to stroke within each level of sex. The second section (b) of Table 4 shows that aspirin is related to stroke for males, but not for females; inspection of the frequencies shows that aspirin reduces the in-
TABLE3 Effectiveness of Aspirin and Sulfinpyrazone in Preventing Stroke: Observed Frequencies by Sex Stroke Sex
Aspirin
Sulfa
Yes
No
P(Stroke)
Male
No
No Yes No Yes No Yes No Yes
22 34 17 12 8 4 9 8
69 81 81 90 40 37 37 36
.24 .30 .17 .12 .17 .10 .20 .18
Yes Female
No Yes
Note: Adapted from Canadian Cooperative Study Group (1978).
188
David Rindskopf TABLE4 Partition of Chi-Square for Aspirin, Sulfinpyrazone, and Stroke Data (a) Test of independence of sulfa and stroke, within each of the four Sex x Aspirin groups Group
LR
df
p
Male, no aspirin Male, aspirin Female, no aspirin Female, aspirin
.75 1.26 .92 .03
1 1 1 1
.39 .26 .34 .87
Sum
2.96
4
(b) Test of independence of aspirin and stroke, separately for each sex LR
df
p
Male Female
10.01 .97
1 1
.002 .33
Sum
10.98
2
Sex
(c) Test of independence of sex and stroke LR 1.82
df 1
.18
Note. LR is the likelihood-ratio goodness-of-fit statistic.
cidence of stroke in males. Last we collapse over levels of aspirin, and as shown in the third section (c) of Table 4, we find that sex is independent of stroke. Another partitioning of chi-square for these data gives additional insight. First, split the original table into two subtables; the first subtable consisting of the first two rows (males who did not take aspirin), and the second subtable consisting of the last six rows. For the first subtable, the model of independence is not rejected (LR = .75, df = 1); similarly, in the second subtable, independence is not rejected (LR = 3.39, df = 5). Finally, we test the table consisting of the row totals of each of the two subtables to see whether the incidence of stroke is different for males who do not take aspirin than for the remaining groups (all females, and males who took aspirin). Independence is rejected (LR = 11.63, df = 1); examining the proportions of who had a stroke, we see that males without aspirin had a higher rate of stroke than females or males who took aspirin. The sum of the LR statistics for these three tests of independence is 15.77, and the sum of the degrees of freedom is 7, which corresponds to the test of independence in the 8 x 2 table.
9. Partitioning Chi-Square
189
This example shows two ways in which partitioning chi-square can be used in multiway contingency tables: (a) a nested partition that follows the explicit design of the study, and (b) as special comparisons to establish homogeneity of sets of subgroups of people. Without partitioning, one typical approach would be to do the overall test of independence in the 8 X 2 table as was shown previously for the partitioning. But this approach merely shows that there is a relationship between group and stroke. Another approach would be to use log-linear or logit models on the four-way table. This method would show that sulfinpyrazone was ineffective and that there was an interaction of sex and aspirin effects on stroke (in the usual analysis of variance terms). However, the nature of the interaction would not be illuminated by the usual log-linear or logit model approach; with partitioning chi-square, we can describe the nature of the interaction.
2.3. Illuminating Interactions in Hierarchical Log-Linear Models: U.C. Berkeley Admissions Data The data in Table 5 are from the University of California at Berkeley, which was investigating possible sex bias in admissions to graduate school. Data from six major areas were presented and discussed in Freedman, Pisani, and Purves (1978), and so the general results are well known. For the purposes of this chapter, they illustrate the shortcomings of log-linear models (as they are usually applied) and the advantage of partitioning chi-square.
TABLE5 GraduateAdmissions Data from University of California, Berkeley Major area
Gender
A
M F
B
C D E F Note:
M
F M F M F M F M F
% Admitted
62 82 63 68 37 34 33 35 28 24 6 7
Adapted from Freedman, Pisani, and Purves (1978).
190
David Rindskopf TABLE 6 Test of Independence of Sex and Admission by Major Area for Berkeley Admissions Data Major
LR
df
p
A B C D E F
19.26 .26 .75 .30 .99 .03
1 1 1 1 1 1
.0001 .61 .39 .59 .32 .87
Sum
21.59
6
.0014
Note. LR is the likelihood-ratio goodness-of-fit statistic.
Table 5 presents the percentage of students admitted, by sex and major area. A log-linear model of the Major x Sex x Admission table shows that no unsaturated model fits the data. In other words, there is a Sex x Admissions relationship that is not the same for each major, so the description of the relationships among variables is not simplified by the log-linear approach. According to most sources about log-linear models, there is nothing more to do. But a simple visual examination of the percentage of students admitted indicates that most likely Major Area A is the only one in which males differ from females. The partitioning in Table 6 confirms this: Sex and admission are independent in Major Areas B through F, but not in Major Area A, in which males are admitted at a lower rate than females. The sum of the LR statistics for testing independence within each major area is the LR statistic for testing the log-linear model [MS] [MA]; that is, sex is independent of admissions, given major area. The partitioning locates the reason for failure of this model to fit the data.
3. ANALYZING CHANGE AND STABILITY 3.1. Hypothesis Tests for One Group Partitioning chi-square is also useful in the analysis of change and stability of responses over time. As an example, we use data reanalyzed by Marascuilo and Serlin (1979). In this study, students in the ninth grade were asked whether they agreed with the statement "The most important qualities of a husband are determination and ambition." The question was asked of them again when they were in the 12th grade. Table 7 shows the cross-tabulation of Time 1 by Time 2 responses for whites and blacks combined into one group.
9. Partitioning Chi-Square
191
TABLE 7 Cross-Tabulation for a Group of Teenagers of Their Opinion at Two Times Time 2 Time 1
Disagree
Agree
Agree Disagree
238 422
315 142
Note: Teenagers were asked whether they agree that "the most important qualities of a husband are determination and ambition." Adapted from Marascuilo and Serlin (1979).
The usual analysis of such a table is McNemar's test, which is a test of whether the frequency in cell 10 is the same as in cell 01; that is, whether change is equal in both directions (equivalently, it is a test of marginal homogeneity). But that is only one hypothesis that might be tested; there are two others that, combined with the first, will partition the chi-square test of equiprobability in the table. The three tests are displayed in Table 8.
TABLE 8 Partition of Equiprobability Model for Change Data Frequencies 10
O1
238
142
00
11
422
315
10 + 01
00 + 11
380
737 Total
LR
df
24.517
1
15.590
1
116.126
1
156.233
3
Note: In the context of the Marascuilo and Serlin data set, 0 means "disagree," 1 means "agree"; in each pair, the first number is the response at Time l, the second number is the response at Time 2. LR is the likelihood-ratio goodness-of-fit statistic.
192
OavidRindskopf
The first test is equivalent to McNemar's test, except that the likelihood-ratio statistic is used rather than the Pearson statistic. The test is significant, indicating change across time; in this case, fewer people agreed with the statement at Time 2. The second test compares the two categories of people who did not change, and answers the question of whether there were more stable people who agreed with the statement than who disagreed with it. The test statistic indicates that the null hypothesis of equality is rejected; more of the teenagers with stable responses disagreed with the statement than agreed with it. Finally, the data are collapsed and we test whether there is the same number of changers as stable responders. This hypothesis is also rejected; more people responded the same at both times than changed. The sum of the fit statistics for testing these three hypothesis (and their degrees of freedom) is equal to the statistic for testing equiprobability in the original four-cell table: 156.233 with 3 df
3.2. Hypothesis Tests for Multiple Groups With multiple groups, partitioning chi-square can be extended in an obvious way. Consider the full data, displayed in Table 9, for members of five ethnic groups answering the question about important qualities of a husband at two time points. The data form a 5 X 2 x 2 table, but the table can also be interpreted as a 5 (Ethnic Group) X 4 (Response Pattern) table, which better suits the purposes here. Testing for independence in the 5 X 4 table results in an LR chi-square of 98.732 with 12 df showing that ethnic group is related to response pattern. The partition of chi-square takes place for both rows (ethnic group) and columns (response pattern) of the 5 x 4 table. The partition of columns follows the scheme discussed in the previous section: compare the two response patterns in which change occurred; compare the two response patterns in which no
TABLE9 Cross-Tabulationfor a Group of Teenagers of Their Opinion at Two Times by Ethnic Group Time 1: Time 2: Whites Asians Blacks Hispanics Native Americans
1 1
1 0
0 1
0 0
243 60 72 62 86
208 50 30 29 28
112 22 30 19 47
381 68 41 25 39
Note: Teenagers were asked whether they agree that "the most important qualities of a husband are determination and ambition." Adapted from Marascuilo and Serlin (1979).
9. Partitioning Chi-Square
193
change occurred; compare the changers to those who did not change. The partition of rows could be done in an a priori manner if hypotheses had been made about similarities and differences among ethnic groups; however, in this case, the partitioning of ethnic groups was done in a post hoc manner. Because of the post hoc nature of these tests, it might be appropriate to consider a way to control for the Type I error level, which is discussed in detail later. One possibility is an analogy to the Scheffe test in analysis of variance: use the critical value for the overall test of independence when evaluating the follow-up tests. Table 10 shows the partition of the independence chi-square for those who changed. The overall test of independence produces an LR chi-square of 24.239 with 4 df We first compare Whites to Asians; they do not differ (LR = .523, df-1). Next, we compare Blacks and Hispanics, who do not differ (LR = 1.171, df = 1). We then combine the Blacks and Hispanics and compare
TABLE 10 Partition of Chi-Square Test of Independence for People Who Changed from Time 1 to Time 2 .....
10
O1
P(O1)
LR
(a) All ethnic groups (overall test of independence) Whites Asians Blacks Hispanics Native Americans
208 50 30 29 28
112 22 20 19 47
.65 .69 .50 .60 .37
24.239
208 50
112 22
.65 .69
.523
30 29
20 19
.50 .60
1.171
1
49 47
.55 .37
5.351
1
17.193
1
(b) Whites vs. Asians Whites Asians (c) Blacks vs. Hispanics Blacks Hispanics
(d) Blacks (B) and Hispanics (H) vs. Indians B + H NA
59 28
(e) Whites (W) and Asians (A) vs. Blacks (B), Hispanics (H), and Indians (I) W + A B + H + NA
258 87
134 96
.66 .48
Note." 10 means a change from 1 at Time 1 to 0 at Time 2; 01 means the opposite; LR is the likelihoodratio goodness-of-fit statistic; df is degrees of freedom.
194
David Rindskopf
them with Native Americans; there is a difference here (LR = 5.351, df = 1) if the usual critical value of chi-square with 1 df (3.84 at the .05 level of significance) is used, but not if the critical value with 4 df (9.49), as in a Scheffe-like test, is used. Finally, we compare the Whites and Asians with the Blacks, Hispanics, and Native Americans and find that there is a difference (LR = 17.193, df = 1): The Whites and Asians were more likely to change toward disagreement than the other groups. The sum of the four statistics, each of which has 1 df, equals the statistic for testing independence in the 5 x 2 table, showing that the chi-square statistic for the overall test has indeed been partitioned. In Table 11, we present the partitioning for the people who did not change from Time 1 to Time 2. The test of independence of ethnic group and response is significant: LR = 73.357 with 4 df. The second panel of Table 11 shows the test of independence for Blacks, Hispanics, and Native Americans; because the
TABLE 11 Partition of Chi-Square Test of Independence for People Who Did Not Change from Time 1 to Time 2 P(ll)
LR
df
381 68 41 25 39
.39 .47 .64 .71 .69
73.357
4
72 62 86
41 25 39
.64 .71 .69
1.389
2
220
105
Whites Asians
243 60
381 68
.39 .47
2.747
1
(Sum)
303
449
11
O0
(a) All ethnic groups (overall test of independence) Whites Asians Blacks Hispanics Native Americans
243 60 72 62 86
(b) Blacks vs. Hispanics vs. Indians Blacks Hispanics Native Americans (Sum) (c) Whites vs. Asians
(d) Blacks (B), Hispanics (H), and Native Americans (NA) vs. Whites (W) and Asians (A) B + H + NA W + A
220 303
105 449
.68 .40
69.221
1
Note. 11 indicates individuals who agreed with the statement at both times; 00 indicates individuals
who disagreed with the statement at both times. LR is the likelihood-ratio goodness-of-fit statistic.
9. Partitioning Chi-Square
195
LR = 1.389 with 2 df, these three groups do not differ. The third panel compares Whites and Asians; LR = 2.747 with 1 df so Whites and Asians do not differ. Finally, the fourth panel compares the combined Blacks, Hispanics, and Native Americans to the combined White and Asian students: the LR = 69.221 with 1 df This contrast obviously explains all of the group differences: Blacks, Hispanics, and Native Americans who gave the same response at both times tended to agree at both times, whereas Whites and Asians were more likely to disagree at both times. (Again, the sum of the LR chi-squares and degrees of freedom add up to the total LR chi-square and degrees of freedom testing independence of ethnic group and response among nonchangers.) Finally, we can compare the changers with the nonchangers to see whether ethnic groups differ in the proportion who are changers. The LR statistic for testing independence is 1.136 with 4 df," the ethnic groups do not differ in the ratio of changers to nonchangers. As a check on the overall calculations, the LR chi-squares can be summed over the last three tables to get 73.357 + 24.239 + 1.136 = 98.732, which is the LR for testing independence of ethnic group and response in the original 5 • 4 table. We sum the degrees of freedom also to get 4 + 4 + 4 = 12.
4. HOW TO PARTITION CHI-SQUARE A complete partition of chi-square will, at the last stage, result in the analysis of as many 2 X 2 tables as there are degrees of freedom for testing independence (or other statistic that is partitioned) in the original table. For a 5 X 4 table, for example, there are 4 x 3 = 12 degrees of freedom for testing independence, and as many one degree of freedom tests in a complete partition. In many cases, one need not do a complete partition, because only those hypotheses that make theoretical sense will be tested. There are well-known systematic ways for partitioning chi-square in the literature, but most do not take into account the nature of the investigator's hypotheses about the structure of the data. Two approaches are available that will reflect specific research hypotheses, and in the end, they both produce the same partition, so the choice between them will be based on which one feels most comfortable for the analyst. The two methods are called joining and splitting to reflect what happens to the rows or columns of the original table as the method is implemented.
4.1. Joining Joining is illustrated schematically in Figure 1. The procedure will be illustrated for rows of the table; following (or instead of) this, partitioning of columns can
196
David Rindskopf
FIGURE 1. The joining technique for partitioning chi-square. also be performed. First, extract any two rows of the table (that are thought to have the same row proportions) and test the table consisting of these two rows for independence. Next, replace the two rows in the original table with one row consisting of their sum. Repeat this procedure until only one row remains. For each (two-row) subtable extracted, perform the same procedure on the columns to obtain a complete partition of chi-square. During the process of combining rows or sets of rows, one or more statistical test may indicate lack of independence. The rows (or sets of rows) can still be combined, although the comparisons that follow will involve a nonhomogenous set of people; that is, groups that differ in their row proportions will have been combined. Such a row will then represent a (weighted) average of the groups representing the constituent rows. Whether the use of that average provides a meaningful comparison with the remaining groups will usually be determined within the context of a particular data set; if not, the rows should not be combined and the partitioning will not be reduced completely to one degree of freedom comparisons. If a point is reached where all remaining rows (and columns) differ significantly, then the analyst has found the important parts of the data structure; in such a case, a reduction to one degree of freedom contrasts would not add useful information. The joining procedure has two common patterns, each of which corresponds to a common coding method for contrasts in analysis of variance; these are il-
9. Partitioning Chi-Square
197
FIGURE 2. Two joining techniques: "piling on" and nesting.
lustrated in Figure 2. The first might be called "piling on," which corresponds to Helmert contrasts. In this procedure, two groups are compared and combined; these are compared to and then combined with a third, and so on. In the figure, rows 1 and 2 are extracted and tested for independence first. Then, rows 1 and 2 are summed and compared with row 3. Finally, the sum of rows 1, 2, and 3 are compared with row 4. An application of this method might involve the comparison of three groups administered drugs and a fourth group given a placebo. Suppose that two of the drugs were closely related chemically; call them Drug 1A and Drug 1B, and that the third drug, called Drug 2, was not related to either 1A or lB. One sensible analysis would first compare Drug 1A to Drug 1B, then combine them to compare drugs of type 1 to Drug 2, and finally combine all subjects given drugs to compare them with subjects not given drugs. The second common joining method is used for a nested or hierarchical structure in which the rows are divided into sets of pairs that are compared and combined. The resulting combined rows are then paired, compared, and combined, and so on. This is illustrated in the bottom part of Figure 2. First, rows 1 and 2 are compared and combined, as are rows 3 and 4, then 5 and 6, and finally 7
198
David Rindskopf
FIGURE 3. The splitting technique for partitioning chi-square. and 8. Next, row (1 + 2) is compared with row (3 + 4), and row (5 + 6) is compared with (7 + 8), where the notation refers to sums of rows in the original table. Finally, row (1 + 2 + 3 + 4) is compared with row (5 + 6 + 7 + 8). This method was used for the first partitioning analysis of the data on aspirin and stroke in which rows were successively collapsed over levels of the variables sulfinpyrazone, aspirin, and sex. Combinations of these methods might also be used (as was done for the data on attitudes toward the role of husbands for five ethnic groups in a previous section).
4.2. Splitting The splitting algorithm, illustrated in Figure 3, begins by dividing the original table into two subtables. In addition, a table is created of the row sums of each of the two subtables. The chi-square for the original table is now partitioned into three parts, one for each of the subtables and one for the table of row totals. The test in each subtable tests homogeneity of the rows within that subtable, and the test of the row totals tests the difference between the subtables. This procedure may be repeated for each of the two original subtables and, finally, columns may be split. The splitting method was used in the analysis of data on attitude and cancer survival in which the original table was split into two parts, within each of which homogeneity held. The remaining hypothesis test was of the sum of the rows in the two subtables created in the first step,
9. Partitioning Chi-Square
199
showing that the first two rows differed from the second two rows. This method was also used for the second partitioning of the aspirin and stroke data. Of course, splitting may be used on the rows and joining on the columns, or vice versa.
5. DISCUSSION 5.1. Advantages Partitioning chi-square has several obvious advantages. First and foremost, it allows researchers to test specific hypotheses of interest rather than more general null hypotheses. One could say that it is a context-sensitive statistical technique, because it can be implemented in a way that reflects content area concerns and should not be done mechanically. The emphasis on testing specific hypotheses can add statistical power, because in some cases an overall test may not be significant, hiding one or more significant effects. The first example, of cancer survival data, came close to fitting this description: the overall test of independence was barely rejected even though there was a large difference detected in the partitioning. Second, the technique of partitioning is as close to foolproof as can be; no great knowledge of mathematics or statistics is necessary. Finally, partitioning can be done by anyone with access to a statistical analysis program, as every general statistical program will produce a chi-square statistic for a contingency table. Some programs may provide Pearson rather than likelihood-ratio statistics, but this is no problem; the total chi-square will not partition exactly, but all of the hypothesis tests are still valid.
5.2. Cautions and Problems 5.2.1. Power One area for caution is that the statistical power is not the same for all hypotheses being tested. As rows (or columns) are collapsed, the sample size (and power) for a test gets larger; as tables are split, the sample size gets smaller. Effects of the same size might not be detected as significant in a small part of the table, but could be as larger segments of the table get involved in a comparison. Researchers may want to calculate effect size estimates, including standard measures of association, in various subtables for exploratory purposes. 5.2.2. Post hoc tests A second area for caution that was mentioned earlier is that in doing post hoc tests, the researcher might want to control for the level of Type I error. (I do not consider a priori hypothesis tests to be problematic, but others might.) The simplest technique would be to use, in an analogy with
200
David Rioaskoof
the Scheffe method in analysis of variance, an adjusted critical value. For partitioning chi-square, this would be the critical value for the overall test in the full table. For a partition of independence in an R • C table, the degree of freedom is (R - 1)(C - 1), which would determine the critical value for all hypothesis tests. Like the Scheffe test, the use of this critical value would guarantee that if the overall test is not significant, no post hoc test would be significant either. This procedure is very conservative, as is the Scheffe test. If only rows or only columns are partitioned, then the degrees of freedom for the critical value can be chosen consistent with that partitioning (see Marascuilo & Levin, 1983, for a readable discussion of the use of the Scheffe procedure with categorical data). The Bonferroni procedure is widely used in other contexts to control the overall Type I error level and can be less conservative than the Scheffe test if only a small number of hypotheses are tested. To implement the Bonferroni procedure as a post hoc test, one must know the total number possible of hypotheses that might have been tested. Then calculate a new alpha value, dividing the desired overall alpha level (usually .05) by the potential number of hypotheses tested and reject the null hypothesis only if the outcome would be rejected at the adjusted alpha level. All of these methods are discussed by Santner and Duffy (1989), based on research by Goodman (1964, 1965; see also Goodman, 1969). Informal techniques for judging significance of partitioned tests are also available. A complete partition of chi-square will ultimately produce (R - 1)(C - 1) hypothesis tests, each with 1 df The square root of a chi-square statistic with 1 df is a standard normal deviate. A half-normal plot of these deviates, as illustrated in Fienberg (1980), may help in judging which elements of the partition are likely to represent real effects and which are not. (Those familiar with factor analysis will see a similarity to using a scree plot of eigenvalues to judge the number of factors in a set of variables.)
5.3. Relationship to Nonstandard Log-Linear Models Researchers acquainted with advanced statistical methods will realize that partitioning is a simple-minded way of doing what could also be accomplished within a generalized linear model framework. The hypotheses tested here do not fit in the framework of the usual hierarchical log-linear models, but are what have been called nonstandard log-linear models (Rindskopf, 1990). Nonstandard log-linear models can be used to test every kind of hypothesis that can be tested using partitioning, and more. Why, then, should partitioning of chi-square be used? Because partitioning is simple, any researcher who knows how to test independence in contingency tables can do a partitioning of chi-square. The hypotheses being tested are obvious, and inspection of row or column proportions will give the researcher a good idea about whether the hypothesis tests have
9. Partitioning Chi-Square
201
been done correctly. Using nonstandard log-linear models, on the other hand, involves setting up a model matrix by coding variables to test the desired hypotheses. This is sometimes tricky, and even experienced researchers cannot always correctly interpret parameters when variables have been coded in nonstandard ways. Comparing the power of different contrasts is simple when using nonstandard models: The standard errors will vary with the nature of the comparison; larger standard errors indicate lower power. In conclusion, partitioning chi-square is a simple technique that allows researchers to test hypotheses specified by their theories. Partitioning can be done by anyone with access to a general statistics package, and because of its simplicity it is more difficult to misuse than nonstandard log-linear models. Caution is needed because of possible increases in Type I error rates when post hoc tests are conducted, and because different stages of a partitioning can have different powers. Partitioning has been unjustly neglected in the recent literature because most earlier expositions either used complex formulas to make the Pearson statistic additive or used mechanical partitioning schemes that did not reflect the important scientific hypotheses researchers wished to test.
ACKNOWLEDGMENTS The author thanks Howard Ehrlichman, Bengt Muthen, Laurie Hopp Rindskopf, Alex von Eye, David Andrich, and Bob Newcomb, Kim Romney, and their colleagues and students at the University of California, Irvine, for helpful comments and suggestions.
REFERENCES Canadian Cooperative Study Group. (1978). A randomized trial of aspirin and sulfinpyrazone in threatened stroke. New England Journal of Medicine, 299, 53-59. Everitt, B. S. (1977). The analysis of contingency tables. London: Chapman & Hall. Fienberg, S. E. (1980). The analysis of cross-classified categorical data. Cambridge, MA: MIT Press. Fisher, R. A. (1930). Statistical methods for research workers (3rd ed.). Edinburgh: Oliver & Boyd. Freedman, D., Pisani, R., & Purves, R. (1978). Statistics. New York: Norton. Goleman, D. (1985, October 22). Strong emotional response to disease may bolster patient's immune system. The New York Times, pp. C1, C3. Goodman, L. A. (1964). Simultaneous confidence limits for cross-product ratios in contingency tables. Journal of the Royal Statistical Society, Series B, 26, 86-102. Goodman, L. A. (1965). On simultaneous confidence intervals for multinomial proportions. Technometrics, 7, 247-254. Goodman, L. A. (1969). How to ransack social mobility tables and other kinds of cross-classification tables. American Journal of Sociology, 75, 1-40. Marascuilo, L. A., & Levin, J. R. (1983). Multivariate statistics in the social sciences: A researcher's guide. Monterey, CA: Brooks/Cole.
202
David Rindskopf
Marascuilo, L. A., & Serlin, R. C. (1979). Tests and contrasts for comparing change parameters for a multiple sample McNemar data model. British Journal of Mathematical and Statistical Psychology, 32, 105-112. Maxwell, A. E. (1961). Analysing qualitative data. London: Methuen. Reynolds, H. T. (1977). The analysis of cross-classifications. New York: Free Press. Rindskopf, D. (1990). Nonstandard loglinear models. Psychological Bulletin, 108, 150-162. Santner, T. J., & Duffy, D. E. (1989). The statistical analysis of discrete data. New York: SpringerVerlag. Wickens, T. D. (1989). Multiway contingency tables analysis for the social sciences. Hillsdale, NJ: Erlbaum.
Nonstandard Log-Linear Models for Measuring Change in Categorical Variables 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
A/exander yon Eye Michigan State University East Lansing, M/ch/gan
Christiane Spiel
University of Vienna Vienna, Austria
1. INTRODUCTION Many statistical tests are special cases of more general statistical models. When teaching statistics, it is often seen as a didactical plus if tests are introduced via both the "classical" formulas and within the framework of statistical models. This chapter proposes recasting statistical tests of axial symmetry and quasisymmetry in terms of nonstandard log-linear models. First, three equivalent forms of the well-known Bowker test are presented. The first form is the test statistic originally proposed by Bowker (1948); the other two are log-linear
Categorical
Variables
in Develo pme ntal
Research."
Methods
of Analysis
Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
203
204
A. von Eye and C. Spiel
models. Second, quasi-symmetry is recast in terms of nonstandard log-linear models.
2. BOWKER'S TEST Known since 1948, Bowker's test allows researchers to assess axial symmetry in a square cross-tabulation. The test was originally proposed as a generalization of McNemar's (1947) chi-square test which assesses axial symmetry in 2 X 2 tables. The axial symmetry concept implies that for cell frequencies F,j in a square cross-tabulation, B
AB
= F)i , f o r i > j
(1)
holds, where superscript A denotes the row variable and B denotes the column variable. If changes from one category to another are symmetric, the marginal distributions stay the same. Typically, researchers apply McNemar's and Bowker' s tests when a categorical variable is observed twice (e.g., Sands, Terry, & Meredith, 1989). The null hypothesis test states that changes from one category to another are random in nature. Textbooks illustrate the tests using examples from many areas, such as from pharmacology, when subjects are asked twice about the effects of a drug, or when effects of drugs are compared with effects of placebos in randomized repeated measures designs (cf. Bortz, Lienert, & Boehnke, 1990). In developmental research, for example, the Bowker test is used in research concerning change and stability of intellectual functioning (Sands et al., 1989). For the following description of the McNemar and the Bowker tests, consider a categorical variable with k categories (k >- 2). This variable is observed twice on the same individuals. The following test statistic, known as the McNemar test if k -- 2, and as the Bowker test if k > 2, is approximately distributed as X2 with ( ~ ) degrees of freedom: k
k (4j -- fji)2
x =EE i=1 j=l
f o r i > j, f/j "F 4i
(2)
where i, j = 1. . . . . k, and f/j is the observed frequency in cell ij. The following example (adapted from Bortz et al., 1990) illustrates the application of Equation (2). A sample of N = 100 subjects took vitamin pills over the course of 2 weeks. Twice during the experiment, subjects indicated their well-being by using the three categories, positive, so-so, and negative. Table 1 displays the 3 X 3 cross-tabulation of the two observations of the well-being variable.
10. Nonstandard Log-Linear Models
205
TABLE 1 Cross-Tabulationof Two Reports About Effects of Vitamin Pill and Placebo Responses at second observation Responses at first observation
Positive
So-so
Negative
Sums
Positive So-so Negative
14 5 1
7 26 7
9 19 12
30 50 20
Sums
20
40
40
N = 100
Inserting the frequencies from Table 1 into Equation (2) yields X2 = ( 5 -
5+7
7) 2 + (1 --9) 2 + ( 7 - 19) 2 = 12.27. 1 +9 7 + 19
Thisvaluehasfordf=(32)=3atailprobabilityofp=.OO65.
Thus, thenull
hypothesis of axial symmetry must be rejected. There have been attempts to improve and generalize the preceding formulation. Examples include Krauth's (1973) proposal of an exact test, Meredith and Sands' (1987) embedding of Bowker' s test in the framework of latent trait theory, the proposal of Zwick, Neuhoff, Marascuilo, and Levin (1982) of using simultaneous multiple comparisons instead of Bowker's test (cf. Havr~inek & Lienert, 1986), and reformulation of axial symmetry in terms of log-linear models (Bishop, Fienberg, & Holland, 1975; cf. Wickens, 1989). Benefits of reformulating such tests as the Bowker test in terms of log-linear models include embedding it in a more general framework and the possibility of parameter interpretation (e.g., Meredith & Sands, 1987). The following section summarizes Bishop et al.'s (1975) formulation of axial symmetry.
3. LOG-LINEAR MODELS FOR AXIAL SYMMETRY The log-linear model that allows testing of axial symmetry can be given as follows:
with side constraints
~i - "r~, for i = j
(4)
206
A. von Eye and C. Spiel
and
~ijB = ~B,
(5)
for i > j.
The side constraints (4) and (5) result in estimated expected cell frequencies _
+::i"
2
, for i 4=j,
(6)
and
~iiiB =fAB.
(7)
The following section shows how this model formulation can be recast in terms of a nonstandard log-linear model.
4. AXIAL SYMMETRY IN TERMS OF A NONSTANDARD LOG-LINEAR MODEL The saturated log-linear model for an i • j table can be expressed as log k-~u8 = k o + h a + X~ + h AS,
(8)
where h o is the "grand mean" parameter; h a are the parameters for the main effect of the row variable, A; h~ are the parameters for the main effect of the column variable, B; and h A8 are the parameters for the A • B interaction. Equation (8) is a special case of log F = Xh,
(9)
where F is an array of frequencies, X is a design matrix, and h is a parameter vector. One of the main benefits from nonstandard log-linear modeling is that the researcher can specify constraints and contrasts in the design matrix, X (Clogg, Eliason, & Grego, 1990; Evers & Namboodiri, 1978; Rindskopf, 1990; von Eye, Brandtst~idter, & Rovine, 1993). Here, we translate the constraints of the axial symmetry model into vectors for X. The specification Equation (3), together with side constraints (4) and (5), requires a design matrix with two sets of vectors: 1. Vectors that exclude the frequencies in the main diagonal from the estimation process (structural frequencies). 2. Vectors that specify what pairs of cells are assumed to contain the same frequencies. (For alternative ways of specifying symmetry models using design matrices, see Clogg et al., 1990.) When all pairs of a table are to be tested, the design matrix contains, in addition to the constant vector, k - 1 vectors specifying structural
207
10. Nonstandard Log-Linear Models
cies. Thus, the degrees of freedom for the model of axial symmetry are
The following sections illustrate the design matrix approach in two examples. The first example uses Table 1 again. This table contains data that contradict the model of axial symmetry. The log-linear main effect model, here equivalent to Pearson's X 2, yields a test statistic of X 2 = 22.225, which, for d f - 4, has a tail probability of p = .0002. The design matrix given in Table 2 was used to specify the model of axial symmetry. The first two columns after the cell indices in Table 2 contain the vectors needed for the frequencies in the main diagonal to meet side constraint (4). Because of these vectors, the frequencies in the main diagonal cells are estimated as observed. The next three vectors specify one pair of cells each to meet side constraint (5). Specifically, the following three null hypotheses are put forth: vector 3 posits that the frequencies in cells 12 and 21 are, statistically, the same; vector 4 posits that the frequencies in cells 13 and 31 are, statistically, the same; and vector 5 posits that the frequencies in cells 23 and 32 are, statistically, the same. For the data in Table 1, this model yields a Pearson X 2 = 12.272, which is identical to the result from applying Equation (2). Application of (3) through (7) also yields the same results. Thus, the model does not hold and parameters cannot be interpreted. To illustrate parameter interpretation, the second example
TABLE2 Design Matrix for Model of Axial Symmetry in 3 x 3 Cross-Tabulation Vectors Cell index
1
2
3
4
5
11 12 13 21 22 23 31 32 33
1
0
0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0
0
0
1 0 1 0 0 0 0 0
0 1 0 0 0 1 0 0
0 0 0 0 1 0 1 0
208
A. yon Eye and C. Spiel
presents a case in which the symmetry model fits. Consider a sample of N = 89 children who were asked about their preferred vacations. All children were in elementary school. The children had spent their last vacations (Time 1) at the Beach (B), at Amusement Parks (A), or in the mountains (M). When their families planned for their vacations the following year (Time 2), children were asked where they would like to spend these vacations. Alternatives included going to the same type of place or switching to one of the other places. Table 3 displays the cross-tabulation of preferences. The log-linear main effect model of the data in Table 3 indicates that the preferences at the two occasions are not independent (Pearson X 2 = 75.59, df = 4, p < .01). The strong diagonal suggests that most children stay with the places they used to go. The symmetry model asks whether those children who switch to another place do this in some systematic fashion. The design matrix given in Table 2 applies again. The symmetry model provides a good fit. The Pearson X 2 = 5.436 has, for df = 3, a tail probability of p = . 1425. Thus there is no need to reject this model, and parameters--typically not estimated for the Bowker or the McNemar t e s t s - can be interpreted. All three parameters that correspond to the symmetry model are statistically significant. The first parameter is ~/~/se~ = - 3 . 4 8 5 ; for the second, we calculate ~/2/se2 = - 4 . 3 2 6 ; and for the third, ~!3]se3 - 4 . 4 1 1 . Thus, each of these vectors accounts for a statistically significant portion of the variability in Table 3. Substantively, these parameters suggest that shifts from beach vacations to amusement parks are as likely as shifts from amusement parks to beach vacations. Shifts from beach vacations to mountain vacations are as likely as inverse shifts. Shifts from amusement parks to mountain vacations are as likely as inverse shifts. Application of Bowker's test yields -
X2 = ( 1 0 - 3 ) 2 + ( 2 - 4 ) 2 + ( 1 - 3 ) 2 = 5 . 4 3 6 , 10+3 2+4 1 +3
TABLE3 Cross-Tabulation of Children's Preferences at Two Occasions Vacations at Time 2 Vacations at Time 1
B
A
M
Sums
B
25 3 4
10 19 3
2
37
A M
1 22
23 29
Sums
32
32
25
N = 89
10. Nonstandard Log-Linear Models
209
a value that is identical with the Pearson X 2 for the nonstandard log-linear version of the axial symmetry model.
5. GROUP COMPARISONS One of the major benefits from recasting a specific statistical test in terms of more general statistical models, such as the log-linear model, is that generalizations and applications in various contexts become possible. This section applies the design matrix approach of nonstandard log-linear modeling axial symmetry to group comparisons. More specifically, we ask whether the model of axial symmetry applies to two or more groups in the same way, that is, by using the Model of Parallel Axial Symmetry in which parameters are estimated simultaneously for all groups. To illustrate the Model of Parallel Axial Symmetry, we use data that describe two groups of kindergartners. The children were observed twice, the first time 1 month after entering kindergarten and the second time 6 months later. One of the research questions concerned the popularity of individual children. Specifically, the question was asked whether there was a systematic pattern of shifts in popularity in those children who did not display stable popularity ratings. Popularity was rated by kindergarten teachers on a 3-point Likert scale, with 1 indicating low popularity and 3 indicating high popularity. Ratings described popularity in two groups of children. The first group contained N = 86 kindergartners from day care centers in Vienna, Austria (Spiel, 1994). The second group contained N = 92 children from day care centers in rural areas in Austria. Table 4 displays the Group (2; G) x Time 1 (3; T1) X Time 2 (3; T2) crosstabulation of popularity ratings, the estimated expected cell frequencies, and the standardized residuals for the two groups of children. Table 5 presents the design matrix used for estimating the expected cell frequencies in Table 4 for the Model of Parallel Axial Symmetry. The model has 6 dr, one invested in each vector in Table 5 and one invested in the constant vector (not shown in Table 5). The design matrix in Table 5 contains vectors that make two types of propositions. First, there are vectors that guarantee that the cells in the main diagonals of the subtables are estimated as observed. These are vectors 1, 2, 3, 7, and 8. Second, there are vectors that specify the conditions of axial symmetry. These are vectors 4, 5, 6, 9, 10, and 11. Vectors 4, 5, and 6 posit that the frequencies in cell pairs 12 and 21, 13 and 31, and 23 and 32 are the same in the first group of children. Vectors 9, 10, and 11 posit the same for the second group of children. Goodness-of-fit for this model is good (Pearson X 2 = 10.445, dr= 6, p = .107). The parameter estimates for the symmetry model are
210
A. von Eye and C. Spiel
TABLE 4 Cross-Tabulationof Popularity Ratings of Two Groups of Children over Two Observations, Evaluated Using Model of Simultaneous Axial Symmetry Cell frequencies Cell indexes G*TI*T2
Observed
Expected
Standardized residuals
111 112 113 121 122 123 131 132 133 211 212 213 221 222 223 231 232 233
1 3 0 4 43 18 0 4 13 7 4 0 3 36 9 1 7 25
1.0 3.5 0.25 3.5 43.0 11.0 0.0 11.0 13.0 7.0 3.5 0.5 3.5 36.0 8.0 0.5 8.0 25.0
0.0 -0.27 -0.01 0.27 0.0 -2.11" -0.01 -2.11" 0.0 0.0 0.27 -0.71 -0.27 0.0 0.35 0.71 -0.35 0.0
~5/se5 = - 0 . 1 7 1 , ~16[se6 = - 2 . 8 0 8 , ~19/se9 - - - 4 . 5 8 9 , ~lo/ se~o = -3.836, and ~/~/se~ = -3.559. Only the second of these parameters ~4/se4--4.598,
is not statistically significant. Note that the second of these parameters is hard to estimate because the observed frequencies for both of the cells, 13 and 31, are zero. This may be a case for which the Delta option would be useful, where a constant, for example, 0.5, is added to each cell frequency.
6. QUASI-SYMMETRY The model of quasi-symmetry puts constraints only on interaction parameters. Thus, the side constraints specified in Equation (5) do not apply. A quasi~Using the Delta option (~ = 0.5) has the following consequences for the current example: the sample size is artificially increased by 8" the Pearson X 2 now is X e = 9.507, df = 6, p = .1470; the second parameter estimate now is ~ls/se5= 3.875, thus suggesting that the pair of cells, 13 and 31, also account for a statistically significant portion of the overall variation; all other parameter estimates are very close to what they were without adding a constant of 0.5 to each cell.
211
10. Nonstandard Log-Linear Models
TABLE 5 Design Matrix for Model of Simultaneous Axial Symmetry in 2 x 3 x 3 Cross-Tabulation for Two Groups Vectors 1
2
3
4
5
6
7
8
9
10
11
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
symmetry model describes data by (a) reproducing the marginal frequencies, and (b) by estimating expected cell frequencies such that
eij + eji --fij '[- fji,
(11)
where eij and eji denote the estimated expected cell frequencies. The following example presents a design matrix for the quasi-symmetry model of a 3 x 3 cross-tabulation. The data describe repeat restaurant visitors' choices of main dishes at two occasions. The sample includes N = 94 customers who had selected on both visits from the following dishes: prime rib (R), sole (S), or vegetarian plate (V). Table 6 contains the cross-tabulation of the choices, the observed cell frequencies, and the expected cell frequencies estimated for the quasi-symmetry model, and the standardized residuals. The design matrix for this analysis appears in Table 7. The main effect model for the data in Table 6 suggests a lack of independence of the first and second visit choices (X 2 = 15.863; df = 4; p = .0032). The quasi-symmetry model asks whether selection of dishes changes in symmetrical fashion without placing constraints on each pair of cells. The likelihoodratio X 2 suggests that this model describes the data adequately (X 2 = 2.966;
212
A. von Eye and C. Spiel
TABLE 6 Quasi-Symmetryof Meal Choices Made by Repeat Restaurant Customers Frequencies Cell indexes RR RS RV SR SS SV VR VS VV
Observed
Expected
Standardized residuals
25 14 3 11 17 6 7 3 8
25.00 12.30 4.70 12.70 17.00 4.30 5.30 4.70 8.OO
0.0 .48 -.78 -.48 0.0 .82 .74 -.78 0.0
= 1; p = .085). The test statistics for the three parameters estimated for the quasi-symmetry model are ~ls/se5 = -1.967, ~16[se6 = -2.769, and ~7/ se 7 -- -2.404, thus suggesting that for each pair of cells the condition specified in Equation (11) accounts for a substantial amount of variation in the table. The expected cell frequencies in Table 6 show that the design matrix given in Table 7 does indeed lead to estimated expected cell frequencies that meet condition (11). Specifically, both the observed and the expected cell frequencies for cell pair RS-SR add up to 25, both the observed and the expected cell fredf
TABLE 7 Design Matrix for Quasi-Symmetry Model for 3 x 3 Cross-Classification in Table 6 Main effect First occasion 1 0
-1 1 0 -1 1 0 -1
0 1 -1 0 1 -1 0 1 -1
Main effect Second occasion 1 1
1 0 0 0 -1 -1 -1
0 0 0 1 1 1 -1 -1 -1
Symmetry 0 1 0 1 0 0 0 0 0
0 0 1 0 0 0 1 0 0
0 0 0 0 0 1 0 1 0
10. Nonstandard Log-Linear Models
213
quencies for cell pair RV-VR add up to 10, and the observed and the expected cell frequencies for cell pair SV-VS add up to 9.
7. DISCUSSION This chapter presented nonstandard log-linear models for two well-known tests and log-linear models for axial symmetry and quasi-symmetry. First, three equivalent forms of Bowker's symmetry test were considered. The first form, originally proposed by Bowker (1948), generates a test statistic that is approximately distributed as chi-squared. The second form is a log-linear model with side constraints that result in a formula for estimation of model fit that is the same as the one proposed by Bowker. The third form equivalently recasts the log-linear model as a nonstandard model that allows researchers to express model specifications in terms of coding vectors of a design matrix. Recasting statistical tests equivalently in terms of more general statistical models results in a number of benefits. For example, the new form may enable the user to understand the characteristics of a test. For the Bowker test, the design matrix shows, for instance, that main effects (marginal frequencies) are not considered, and that the change patterns are evaluated regardless of the size of the frequencies in the main diagonal (diagonal cells are blanked out). For the quasi-symmetry model, the design matrix shows that main effects are considered. In addition, the design matrix approach allows researchers to analyze data without having to create three-dimensional tables (see Bishop et al., 1975, pp. 289ff.; for an illustration, see Upton, 1978, pp. 120ft., or Wickens, 1989, pp. 260ff.), and it allows instructors to introduce axial and quasi-symmetry models in a unified approach. Yet, there are benefits beyond the presentation of equivalent forms and beyond didactical advances. For instance, log-linear model parameters and residuals can be interpreted. Thus, researchers can identify those pairs of cells for which symmetry holds and those for which symmetry is violated. In addition, other options of log-linear modeling can be used in tandem with the original test form. Examples of these options include consideration of the ordinal nature of variables (Agresti, 1984) and the incorporation of symmetry testing in multigroup comparisons. This chapter presented the Model of Parallel Axial Symmetry as one example of how to incorporate symmetry testing in multigroup comparisons. The model proposes that axial symmetry holds across two (or more) groups of subjects. Differences in group size can be considered, and so can assumptions that constrain axial symmetry to specific pairs of cells. Additional extensions include models for more than two observation points.
214
A. von Eye and C. Spiel
ACKNOWLEDGMENTS Parts of this chapter were written while Alexander von Eye was Visiting Professor at the University of Vienna, Austria. The support of the University is gratefully acknowledged. The authors are also indebted to Clifford C. Clogg, G. A. Lienert, and Michael J. Rovine for helpful comments on earlier versions of this chapter. Parts of Alexander von Eye's work on this chapter were supported by NIA Grant 5T32 AG00110-07.
REFERENCES Agresti, A. (1984). Analysis of ordinal categorical data. New York: Wiley. Bishop, Y. M. M., Fienberg, S. E., & Holland, E W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge, MA: MIT Press. Bortz, J., Lienert, G. A., & Boehnke, K. (1990). Verteilungsfreie Methoden in der Biostatistik [Distribution-free methods for biostatistics]. Berlin: Springer-Verlag. Bowker, A. H. (1948). A test for symmetry in contingency tables. Journal of the American Statistical Association, 43, 572-574. Clogg, C. C., Eliason, S. R., & Grego, J. M. (1990). Models for the analysis of change in discrete variables. In A. von Eye (Ed.), Statistical methods in longitudinal research: Vol. 2. Time series and categorical longitudinal data (pp. 409-441). San Diego, CA: Academic Press. Evers, M., & Namboodiri, N. K. (1978). On the design matrix strategy in the analysis of categorical data. In K. E Schuessler (Ed.), Sociological methodology (pp. 86-111). San Francisco: JosseyBass. Havr~inek, T., & Lienert, G. A. (1986). Pre-post treatment evaluation by symmetry testing in square contingency tables. Biometrical Journal, 28, 927-935. Krauth, J. (1973). Nichtparametrische Ans~itze zur Auswertung von Verlaufskurven [Non-parametric approaches to analyzing time series]. Biometrical Journal, 15, 557-566. McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153-157. Meredith, W., & Sands, L. E (1987). A note on latent trait theory and Bowker's test. Psychometrika, 52, 269-271. Rindskopf, D. (1990). Nonstandard log-linear models. Psychological Bulletin, 108, 150-162. Sands, L. P., Terry, H., & Meredith, W. (1989). Change and stability in adult intellectual functioning assessed by Wechsler item responses. Psychology and Aging, 4, 79-87. Spiel, C. (1994). Risks to development in infancy and childhood. Manuscript submitted for publication. Upton, G. J. G. (1978). The analysis of cross-tabulated data. Chichester: Wiley. von Eye, A., Brandtst~idter, J., & Rovine, M. J. (1993). Models for prediction analysis. Journal of Mathematical Sociology, 18, 65-80. Wickens, T. (1989). Multiway contingency tables analysis for the social sciences. Hillsdale, NJ: Erlbaum. Zwick, R., Neuhoff, V., Marascuilo, L. A., & Levin, J. R. (1982). Statistical tests for correlated proportions: Some extensions. Psychological Bulletin, 92, 258-271.
Application of the Multigraph Representation of Hierarchical Log-linear Models H. J. Khamis
Wright State University Dayton, Ohio
1. INTRODUCTION In developmental research, as indeed in other forms of research, it has not been uncommon in recent decades to confront studies in which large quantities of data are amassed. In the categorical case, this leads to large contingency tables that are not necessarily sparse. This is especially true given that standard rules for the adequacy of asymptotic approximations, such as the minimum expected cell size should be at least 5, are too conservative (see Fienberg, 1979). More appropriate rules of thumb are that the minimum expected cell size should be one or more, or that the total sample size should be at least 4 or 5 times the number of cells (Fienberg, 1979). Although the analytical techniques and software used for analyzing the structures of association among variables in such tables are well known, the
Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
215
216
H. J. Khamis
techniques for interpreting and using the complex models (e.g., loglinear models) associated with the tables have not kept up. In particular, it is often of interest to identify the conditional independencies resulting from a given contingency table, and from these, the collapsibility conditions for the table. Although this is not difficult when just a few variables are involved, for complex models involving five or more variables the task can be quite cumbersome because there is no coherent, efficient methodology for these analyses. A very useful technique for analyzing and interpreting hierarchical loglinear models in a graphical way was introduced by Darroch, Lauritzen, and Speed (1980). Although it does not seem to be in widespread use, it is included in some more recent categorical data analysis textbooks, such as Wickens (1989) and Christensen (1990). The usefulness of the approach by Darroch et al. (1980) is principally due to the simple graphical characterization of models that can be understood purely in terms of conditional independence relationships. In this chapter, I introduce an alternative approach to that of Darroch et al. (1980) that uses the generator multigraph. The multigraph approach has several strategic advantages over the first-order interaction graph used by Darroch et al. (1980). The focus of this chapter, however, is on how to use the multigraph approach in maximum likelihood estimation and in identifying conditional independencies in hierarchical loglinear models. All theoretical details (theorems and proofs) have been left out (they are contained in McKee and Khamis, 1996, and are available from the authors upon request). The next section establishes the notation necessary for the application of the multigraph approach.
2. NOTATION AND REVIEW I have assumed that the reader is familiar with the technique of log-linear model analysis of multidimensional contingency tables, such as that presented in Bishop, Fienberg, and Holland (1975), Wickens (1989), or Agresti (1990). It is also helpful to know the rudimentary principles of mathematical graphs; a knowledge of graphical models would be useful but is not essential for understanding this chapter (for a review of the literature concerning graphical models, see Khamis and McKee, 1996). Attention will be confined to those models of discrete data that are most practically useful, namely, the hierarchical loglinear models (HLLMs). These models are uniquely characterized by their generating class or minimal sufficient configuration which establishes the correspondence between the h-terms (using the notation of Agresti, 1990) in the model and the minimal sufficient statistics. Consider the following model of conditional independence in the threedimensional table, log mijk = ~ + hi 1 + ~kj2 + hk 3 -q- h/j 12 + hik 13,
(1)
11. Multigraph Representation
217
where mijk denotes the expected cell frequency for the ith row, jth column, and kth layer, and the parameters on the right side of Equation (1) represent certain contrasts of logarithms of mok. The generating class for this model is denoted by [12][13] and corresponds to the inclusion-maximal sets of indices in the model (called the generators of the model). For the I • J • K table with xijk denoting the observed cell frequency for the ith row, jth column, and kth layer, the minimal sufficient statistics for I; j = 1, 2 . . . . . the parameters of this model then are {xij + }, i = 1, 2 . . . . . J; and {Xi+k}, i = 1, 2 . . . . . I; k = 1, 2 . . . . . K, where xiy~ represents the observed cell frequency and the " + " in the subscript corresponds to summation over the index replaced. This model corresponds to conditional independence of Factors 2 and 3 given Factor 1. Using Goodman's (1970) notation, it can be written as [2 | 3]1]. Decomposable models (also called models of Markov type, multiplicative models, or direct models) are those HLLMs for which the cell probability (or, equivalently, expected frequency) can be factored according to the indices in the generators of the model. For instance, in the preceding example, m(i k = mij + m i + k ] m i + + , and this allows for an explicit solution to the maximum likelihood estimation problem. In fact, for this model, the maximum likelihood estimator for the expected cell frequency mij k is xij + 9x i + d x i + +. Because models with closed-form maximum likelihood estimators have closed-form expressions for asymptotic variance (Lee, 1977), the importance of decomposable models can be seen in theoretical and methodological research, for example, in the study of large, sparse contingency tables (see, e.g., Fienberg, 1979; Koehler, 1986).
3. THE GENERATOR MULTIGRAPH The generator multigraph, or simply multigraph, is introduced as a graphical technique to analyze and interpret HLLMs. In the multigraph, the vertex set is the set of generators of the model, and two vertices are joined by edges that are equal in number to the number of indices shared by the two vertices. The multigraph M~ for the model in Equation (1) ([12][13]) is given in Figure 1. Note that the vertices for the multigraph consist of the two generators of the model, [ 12] and [ 13], and because { 1, 2 } f-) { 1, 3 } = { 1 }, there is a single edge joining the two vertices.
FIGURE 1. Generatormultigraph Ml for [12][13].
218
H. J. Kharnis
FIGURE 2. Generatormultigraph M 2 for [135][245][3451.
The multigraph M 2 for the generating class [135][245][345], corresponding to a five-dimensional table, is given in Figure 2. Here, there are two double edges ({1, 3, 5} n {3, 4, 5} = {3, 5} and {2, 4, 5} n {3, 4, 5} = {4, 5}) and one single edge ({ 1, 3, 5 } n {2, 4, 5 } = {5 }).
3.1. Maximum Spanning Trees A fundamental concept for this examination of multigraphs is the standard graphtheoretic notion of a maximum spanning tree T of a multigraph M: a tree, or equivalently, a connected graph with no circuits (or closed loops), which includes each vertex of M such that the sum of all of the edges is maximum. Maximum spanning trees always exist and can be found by using, for example, Kruskal's algorithm (Kruskal, 1956). Kruskal's algorithm simply calls for the successive selection of multiedges with maximum multiplicity so that no circuits are formed and such that all vertices are included. Each maximum spanning tree T of M consists of a family of sets of factor indices called the branches of the tree. For the multigraph M 1 in Figure 1, the maximum spanning tree is trivially the edge (branch) joining the two vertices, and it is denoted by T~ = { 1 }, namely the set containing the factor index corresponding to that edge. For the generating class [135][245][345] with multigraph M 2 in Figure 2, the maximum spanning tree is T2 = {{3, 5}, {4, 5} }, with branches {3, 5} and {4, 5}. For the nondecomposable model, [12][23][34][14], with multigraph M 3 given in Figure 3, there are four distinct possible maximum spanning trees, each of the form T3 = {{i}, {j}, {k} }. Maximum spanning trees will be used in the next section
FIGURE 3. Generatormultigraph M3 of [12][23][34][14].
11. Multigraph Representation
219
to provide a remarkably easy method for identifying decomposable models, for factoring the joint distribution of such models in terms of their generators, and for identifying conditional independencies in these models.
3.2. Edge Cutsets Another fundamental concept used in working with the multigraph is the edge cutset. An edge cutset of a multigraph M is an inclusion-minimal set of multiedges whose removal disconnects M. For the model [12][13] with multigraph M 1 given in Figure 1, there is a single edge cutset that disconnects the two vertices, and it is trivially the minimum number of edges that does so. We denote this edge cutset by { 1 }, the factor index associated with the edge whose removal disconnects M1. For the multigraph M 2 given in Figure 2, there are three edge cutsets and each disconnects a single vertex: (a) the edge cutset {4, 5} (corresponding to the single edge {5 } and the double edge {4, 5}) disconnects the vertex 245, (b) the edge cutset {3, 5} disconnects the vertex 135, and (c) the edge cutset {3, 4, 5} disconnects the vertex 345. For the nondecomposable model [12][23][34][14] with multigraph M 3 given in Figure 3, there is a total of six edge cutsets: there are four edge cutsets that each disconnects a single vertex (these edge cutsets are {1, 2}, {2, 3}, {3, 4}, and { 1, 4}); there is an edge cutset corresponding to the two horizontal edges (namely, {2, 4}); and there is an edge cutset corresponding to the two vertical edges (namely, { 1, 3 }). So, for example, removal of the two edges associated with indices 1 and 2 in Figure 3, corresponding to the preceding set {1, 2}, would disconnect the vertex 12 from the rest of the multigraph, and this is the minimum number of edges that will do so. One convenient way of keeping track of edge cutsets is to draw dotted lines that disconnect the graph. Those edges that the dotted lines intersect are contained in an edge cutset, as illustrated in Figure 4 for the multigraph M 3. Section 2.2 of Gibbons (1985) contains a standard mechanical procedure for finding all edge cutsets. This relatively efficient procedure will be important in the next section for identifying conditional independencies in nondecomposable models.
4. MAXIMUM LIKELIHOOD ESTIMATION AND FUNDAMENTAL CONDITIONAL INDEPENDENCIES 4.1. Maximum Likelihood Estimation McKee and Khamis (1996) show that a HLLM is decomposable if and only if the number of indices added over the branches in any maximum spanning tree
220
#. J. Khamis
FIGURE 4. Identificationof edge cutsets in M3.
of the multigraph subtracted from the number of indices added over the vertices of the multigraph is equal to the dimensionality of the table; that is, d = Z
I s I - ~ IsI
S E V (T)
SEB (T)
(2)
where d is the number of categorical variables in the contingency table, T is any maximum spanning tree of the multigraph, and V(T) and B(T) are the set of vertices and set of branches, respectively, of T. For example, in Figure 1, d = 3, V(T) = {{ 1, 2}, {1, 3}}, and B(T) = {1 }. Therefore the formula in (2) becomes 3 = (2 + 2) - 1; because this equality is true, the model [12][13] is decomposable (as was shown in section 2). For the generating class [135][245][345] with multigraph M 2 given in Figure 2, d = 5, V(T) = {{1, 3, 5}, {2, 4, 5}, {3, 4, 5}}, B(T) = {{3, 5}, {4, 5}}, and the formula in (2) becomes 5 = (3 + 3 + 3) - (2 + 2), so that [135][245][345] is decomposable. In Figure 3, 4 4:8 - 3; therefore [12][23][34][14] is nondecomposable. For decomposable models, the multigraph can be used directly to factor the joint distribution of the contingency table in terms of the generators of the model. In particular, let M be the multigraph of a decomposable generating class, and let T be any maximum spanning tree with set V(T) of vertices and set B(T) of branches. Then, the joint distribution for the associated contingency table is
I-I P[v: v ~ s] SEV(T)
P[vl,
122. . . . .
12d] =
1--[ P[v: V ~ S]'
(3)
SEB(T)
where P[v~, v 2. . . . . va] represents the probability associated with level v~ of the first factor, level v2 of the second factor . . . . . and level v a of the dth factor; P[v: V E S] denotes the marginal probability indexed on those indices contained in S (and summing over all other indices).
11. Multigraph Representation
221
Consider the generating class [12][13] with multigraph M 1 given in Figure 1. Because V(T)= {{ 1, 2}, {1, 3}} and B ( T ) = {1 }, from Equation (3) we get, using simpler notation, Pijk = Pij+Pi+k]Pi++; note that the terms in the numerator are indexed on factors corresponding to V(T), namely Factors 1 and 2 (Pij+) and Factors 1 and 3 (Pi+k), and the term in the denominator is indexed on the factor corresponding to B(T), namely Factor 1 (Pi+ +)" This formula agrees with the one given in section 2 for this model. Consider the model [135][245][345] with multigraph M2 given in Figure 2. Here, V(T)= {{1, 3, 5}, {2, 4, 5}, {3, 4, 5}} and B(T) = {{3, 5}, {4, 5}}, so that Equation (3) gives Pijklm = Pi+k+mP
+j+lmP + +klm[P + +k+mP + + +lm 9
4.2. Fundamental Conditional Independencies Consider a partition of the d factors in a contingency table, for example, C1, C2, .... C k, S, where 2 --< k --< d - 1. McKee and Khamis (1996) show that each generating class uniquely determines a set of fundamental conditional independencies (FCIs), each of the form [C 1 | C 2 | | Ckl S] with k -> 2 such that all other conditional independencies can be deduced from them by replacing S with S' such that S C S', replacing each C i with C i' such that C i' C C i, subject to (C~' U C 2 ' [,.J . . . [,.J C k ' ) ("] S ' -- Q~, and forming appropriate conjunctions. For example, if Factors 1 and 2 are independent of Factor 3 conditional on Factor 4, then (a) Factor 1 is independent of Factor 3 conditional on Factors 2 and 4, and (b) Factor 2 is independent of Factor 3 conditional on Factors 1 and 4. Notationally, [1, 2 | 314] ~ [1 | 312, 4] N [2 | 311, 4]. The FCIs are determined from the multigraph as follows. For a given multigraph M and set of factors S, construct the multigraph M/S by removing each factor of S from each generator (vertex in the multigraph) and removing each edge corresponding to that factor. For decomposable models, S is chosen to be a branch of any maximum spanning tree of M, and for nondecomposable models, S is chosen to be the factors corresponding to an edge cutset of M. Then, the FCI corresponds to the mutual independence of the sets of factors in the disconnected components of M/S conditional on S. A few examples should make ~ e technique clear. Consider the multigraph M 1 given in Figure 1. Select S to be the branch of the maximum spanning tree T~, that is, S - { 1 }. Then, M1]S is constructed by removing the index 1 from each vertex in the multigraph and removing the edge corresponding to 1. The resulting multigraph is given in Figure 5. The disconnected components correspond to Factors 2 and 3, so Factors 2 and 3 are independent given Factor 1, as indicated in section 2. Consider the decomposable model [135][245][345] with maximum spanning tree T2 - { {3, 5 }, {4, 5 } } and multigraph M 2 given in Figure 2. The multigraphs
222
H. ,!. Khamis
2
3
[2 | 311] FIGURE 5. The multigraph M~/Sand corresponding FCI for S = {1}.
M2/S for S = {3, 5} and S = {4, 5} are given in Figure 6, along with the FCI derived from each. For the nondecomposable model [12][23][34][14] with multigraph M 3 given in Figure 3, I have chosen S to be the set of factors associated with an edge cutset. The six edge cutsets for this multigraph are {1, 2}, {2, 3}, {3, 4}, {1, 4 }, {2, 4 }, and { 1, 3 } (see section 3), so there are six possible sets S; however, only the latter two edge cutsets yield an FCI, as the others do not produce a multigraph M3/S with more than one component (see Fig. 7). More details concerning how to work with the generator multigraph and additional examples are given in McKee and Khamis (1996).
5. EXAMPLES In the following examples, instead of using numbers, as was done in the preceding discussion, capital letters are used to represent factors. In this way, identification of the factors will be easier. Edwards and Kreiner (1983) analyzed a set of data in the form of a five-way contingency table from an investigation conducted at the Institute for Social Research, Copenhagen, collected during 1978 and 1979. A sample of 1592 em-
a
b 1
24
13
2
4 S
[1
=
3
{3,5}
2,413,5 ]
S
=
[2 |
{4,5}
]
FIGURE6. The multigraphs MJS and corresponding FCIs for (a) S = {3, 5} and (b) S = {4, 5 }.
223
11. Multigraph Representation a
1
3
2"'
1
3
4
S
=
{2,4}
S
[1 ~ 312,4]
-2
4
.....
=
{I,3}
[2 (~ 4 1 1 , 3 ]
FIGURE 7. The multigraphs M3]S and corresponding FCIs for (a) S = {2, 4} and (b) S = {1, 3 }.
ployed men, 18 to 67 years old, were asked whether in the preceding year they had done any work which before they would have paid a craftsman to do. The variables included in the study are as follows. Variable Age category Response Mode of residence Employment Type of residence
Symbol A R M E T
Levels <30, 31-45, 46-67 Yes, no Rent, own Skilled, unskilled, other Apartment,house
There are three models of interest that Edwards and Kreiner (1983) consider to have a good fit to the data: (a) [ARME][AMET], (b) [ARME][AMT], and (c) [AME][RME][AMT]. I will analyze each of these models using the multigraph approach. The multigraphs for these models are given in Figure 8. The m a x i m u m spanning tree in each case is (a) T = {A, M, E}, (b) T = {A, M}, and (c) T = {ME, AM}. It is easily checked, by the formula in Equation (2), that each of these models is decomposable" (a) 5 = (4 + 4) - 3; (b) 5 = (4 + 3) - 2; and (c) 5 = (3 + 3 + 3) - (2 + 2). The m a x i m u m likeli-
AME ARME'
_AMET
ARME:
........
,'"
cRME
AMT AMT
FIGURE 8. The multigraphs for (a) [ARMEI[AMET], (b) [ARME][AMT], (c) [AME][RME] [AMT].
224
a
/4. J. Khamis
R
b
T
S = {A,M, E}
[R |
RE
T
S = {A,M}
]
[R,E
|
]
FIGURE 9. Multigraphs M/S and conditional independence interpretation for (a) [ARME][AMET] and (b) [ARME][AMT].
hood estimator for Pijklm c a n be obtained from the formula in Equation (3) if so desired. Let us consider the conditional independence interpretations of these models. For the first two models, the multigraphs M/S, where S is selected to be the branch corresponding to T in each case, are given in Figure 9 along with the conditional independence interpretation. For the third model, (c) [AME] [RME][AMT], the maximum spanning tree has two branches: T - { {M, E}, {A, M} }. Thus there are two choices of S, leading to two FCIs (see Figure 10). The interpretations of these three models are, respectively, (a) the type of residence is conditionally independent of response, (b) the type of residence is conditionally independent of response and employment category, and (c) the response is conditionally independent of type of residence and age category, and the type of residence is conditionally independent of response and employment category. Note that these three models are nested: [AME][RME][AMT] is a special case of [ARME][AMT], and [ARME][AMT] is a special case of [ARME] [AMET]. Also, recall that any conditional independence that exists for a given
A
R
E
AT S = [A,T
{M,E} ~ R J M , E]
-RE
T S = {A,M} [R,E
|
]
FIGURE 10. Multigraphs M/S for the model [AME][RME][AMT] and for the two branches S = {M, E} and S = {A, M}.
11. Multigraph Representation
225
model will also be true of a special case (or submodel) of that model. In particular, the previous statements satisfy this relationship: the statement in (c) implies that in (b), and the statement in (b) implies that in (a). With this in mind, one can balance the practical considerations of the interpretation with the statistical aspects of the associated models, as follows. Suppose that all five variables in this data set are treated as response variables (multinomial sampling design), and that we confine our attention to conditional independencies. Then, because the model [ARME, AMT] fits the data well (p = .2561), we do not reject the assertion that residence type is conditionally independent of survey response and employment category. The model [AME, RME, AMT] fits the data adequately at the 5% level of significance, but not at the 10% level (p = .0931). In addition to the conditional independence, mentioned above, this latter model corresponds to conditional independence between survey response and age category. The implication is that conditional independence between survey response and age category is not as strongly supported by the data (in fact, it is rejected at the 10% level of significance) as the conditional independence between residence type and both survey response and employment category. In terms of the survey response, one might then conclude that whether or not one had done work in the past year that previously one would have paid a craftsman to do is (a) conditionally independent of the type of residence (apartment or house) and (b) (marginally) conditionally independent of age category. A second example involves the Dayton Area Drug Survey, a survey conducted in 1992 by the Wright State University School of Medicine and the United Health Services of the Dayton area (a United Way agency) of all Dayton area school children in grades 6 through 12. One part of the study, which focused on 2276 nonurban seniors (grade 12), asked whether they had ever used marijuana, alcohol, or cigarettes. The variables are as follows. Variable Marijuana Alcohol Cigarettes Race Sex
Symbol M A C R S
Levels No, yes No, yes No, yes Other, white Female, male
The actual data are listed below lexicographically, with the levels of marijuana changing the fastest, the levels of alcohol changing the second fastest, .... and the levels of Sex changing the slowest: 12 17
0 0
19 18
2 1
1 8
0 1
23 19
23 30
117 133
1 1
218 201
13 28
17 17
1 1
268 228
405 453
A model that fits these data quite well is [ACR][MCS][MAC], with p = .5233 (Pearson chi-squared statistic). The generator multigraph for this generating class
226
H. J. Khamis
CR
MAC
. . . . .
"
M C S
FIGURE 11. Multigraphfor [ACR][MCS][MAC].
is given in Figure 11. The maximum spanning tree is T = {{A, C}, {M, C} }. This model is easily seen to be decomposable because the number of indices in the vertices is 3 + 3 + 3 = 9, the number of indices in the branches is 2 + 2 - 4, and 9 - 4 = 5 is the number of factors in the contingency table. It is helpful to know that the model is decomposable for interpretive purposes because the FCIs are obtained so much more easily in decomposable than in nondecomposable models. In particular, they are based on the multigraphs M/S which are obtained directly from the branches and vertices of any maximum spanning tree for the generator multigraph. For example, by selecting S1 = {A, C } and S2 = {M, C } (the branches of the maximum spanning tree obtained from Fig. 11), one gets the multigraphs M/S1 and M/S2 given in Figure 12 along with the associated FCI interpretation. The conclusions from this model concerning conditional independencies are that race is conditionally independent of both marijuana use and sex, and sex is conditionally independent of alcohol use (note that conditional independence between sex and race appear in both FCIs). Certainly, the conditional independence between race and sex is reasonable, given the multinomial sampling design. Using Goodman's forward selection procedure (see, e.g., Bishop et al., 1975), there is only one term that is identified as being statistically significant at the 5% level given the model [ACR][MCS][MAC], and hence consideration should be given to adding it to the model. The conditional likelihood ratio test statistic for the first-order interaction h As is
a
b R
M
MS
A"
J
J
AR
S
S = {A, C}
S = {M, C}
[R ~ M , SIA, C ]
[S | A , R I M , C ]
,(a)
(b)
FIGURE 12. Multigraphsfor (a) M/S1 and (b) M / S 2 with Sl = {A, C} and S 2 model [ACR, MCS, MAC].
--
{M, C} in the
11. Multigraph Representation
227
R
MAC-
.. -
"MCS
FIGURE 13. Multigraphof [ASI[ACR][MCSI[MAC].
G2([AS][ACRI[MCSI[MAC] I [ACRI[MCS][MAC]) = 4.09, with 1 df (p = .0432). The resulting model, [AS][ACR][MCS][MAC], fits the data well (p = .7563; Pearson chi-squared statistic). In this simple case, it is easily seen that the effect of including h As destroys the conditional independence between alcohol use and sex in the previous model. However, for purposes of illustration, let us go through the exercise of obtaining the conditional independence interpretation for the more complex model, [AS][ACR][MCS] [MAC]. The multigraph for this model is given in Figure 13; this is a nondecomposable model, as 5 4= 11 - 5. Consequently, we need to identify all edge cutsets in the multigraph. In Figure 14, dotted lines are drawn that separate vertices in the multigraph of Figure 13; recall that for a given dotted line, the edge cutset contains the indices associated with those edges that intersect the dotted line. In Figure 14, the dotted lines are numbered and the associated edge cutsets are listed as follows: 1. {A, C}; 2. {M, C, S}; 3. {A, M, C}; 4. {A, S}; 5. {A, S, C}; and 6. {A, M, C, S }. Now, in each case Si, i = 1, 2 . . . . . 6, is selected to
9
I
\ AS
i t
,
t I
/
~
' '
.ACR
4-.f3
"
I --5
MAC
'
j !
js '
i
16
:
\%
MCS
2
-
FIGURE 14. Multigraph of [AS][ACR][MCS][MAC] with dotted lines that identify the edge cutsets.
228
H. J. Khamis a
M
b
.
.
.
.
s~ =
.
.
.
.
S
MS
{A,C}
[R ~ M , S I A , C ]
FIGURE 15.
R
S3 =
{A,M,C}
[R ~ S I A , M , C ]
R M
S5 =
M
{A,S,C}
[R ~ M ] A , S , C ]
Multigraphs M / S i for edge cutsets (a) Sl = {A, C}, (b) S3 = {A, M, C}, and (c)
S5 = { A , S , C } .
be the given edge cutset and the corresponding multigraph M / S i is constructed. The only edge cutsets that lead to more than one component for M/Si are the ones for i = 1, 3, and 5; they are shown in Figure 15. Note that the latter two conditional independencies are redundant; that is, they are implied by the first conditional independence, [R | M, S[A, C], thereby making [R | M, S IA, C] the only FCI for the model [AS][ACR][MCS][MAC]. In this model, race is conditionally independent of sex and marijuana use; we no longer have conditional independence between sex and alcohol use. Because the disproportionate exposure to alcohol use between the sexes among high school seniors is of interest to the investigators of this study, and because of its statistical significance, the first-order interaction h As is retained in the model for further investigation.
6. SUMMARY The generator multigraph was introduced as a graphical method for representing hierarchical loglinear models. The multigraph has the following useful properties. 9 The multigraph is typically smaller (i.e., has fewer vertices) than the interaction graph (Darroch et al., 1980), especially for contingency tables with many factors and few generators. 9 There is a one-to-one correspondence between the generating class and the multigraph representation. 9 Use of the multigraph leads to a remarkably simple method of factoring the joint distribution of the contingency table for decomposable models; that is, the factors in the numerator and denominator correspond directly to the vertices and branches, respectively, of any maximum spanning tree in the multigraph. 9 The multigraph can be used in a mechanical procedure for obtaining all conditional independencies in the model. For decomposable models, the procedure is especially simple, as all fundamental conditional independencies can be obtained directly from the vertices and branches of any maximum spanning tree
11. Multigraph Representation
229
in the multigraph. For nondecomposable models, the fundamental conditional independencies are derived from the edge cutsets of the multigraph. Multigraph representations provide a useful and versatile technique for the study and interpretation of hierarchical loglinear models. In this article, I have focused on maximum likelihood estimation and derivation of FCIs. For purposes of interpreting large, complex models in terms of conditional independencies, the multigraph provides an essential tool: a mechanical, relatively efficient method of deriving all possible conditional independencies in the model. Such a capability has thus far been unavailable. The method discussed here is applicable to all HLLMs. Although decomposable models have important advantages for statistical methodologists (see section 2), their most important advantage for researchers in developmental processes is the ease with which conditional independencies can be identified--and this facilitates interpretation of the model. One can anticipate the usefulness of the multigraph in the study of such topics as model selection techniques, collapsibility, latent variable models, and the analysis and interpretation of recursive, logit, nongraphical, and nonhierarchical loglinear models.
REFERENCES Agresti, A. (1990). Categorical data analysis. New York: Wiley. Bishop, Y. M. M., Fienberg, S. E., & Holland, E W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge, MA: MIT Press. Christensen, R. (1990). Log-linear models. New York: Springer-Verlag. Darroch, J. N., Lauritzen, S. L., & Speed, T. E (1980). Markov fields and log-linear interaction models for contingency tables. Annals of Statistics, 8, 522-539. Edwards, D., & Kreiner, S. (1983). The analysis of contingency tables by graphical models. Biometrika, 70, 553-565. Fienberg, S. E. (1979). T h e u s e of chi-squared statistics for categorical data problems. Journal of the Royal Statistical Society, Series B, 41, 54-64. Gibbons, A. (1985). Algorithmic graph theory. Cambridge, UK: Cambridge University Press. Goodman, L. A. (1970). The multivariate analysis of qualitative data: Interaction among multiple classifications. Journal of the American Statistical Association, 65, 226-256. Khamis, H. J., & McKee, T. A. (1996). Chordal graph models of contingency tables. Mathematics and Computer Modelling. Koehler, K. J. (1986). Goodness-of-fit tests for loglinear models in sparse contingency tables. Journal of the American Statistical Association, 81, 483-493. Kruskal, J. B. (1956). On the shortest spanning sub-tree and the travelling salesman problem. Proceedings of the American Mathematical Society, 7, 48-50. Lee, S. K. (1977). On the asymptotic variance of O-terms in loglinear models of multidimensional contingency tables. Journal of the American Statistical Association, 72, 412-419. McKee, T. A., & Khamis, H. J. (1996). Multigraph representations of hierarchical loglinear models, Journal of Statistical Planning and Inference, in press. Wickens, T. D. (1989). Multiway contingency tables analysis for the social sciences. Hillsdale, NJ: Erlbaum.
This Page Intentionally Left Blank
9
9
PART~ 9
9
9
9
9
9
9
Applications
This Page Intentionally Left Blank
Correlation and Categorization under a Matching Hypothesis 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
M/chae/ J. Rov/ne
Pennsylvania State University University Park, Pennsylvania
Alexander von Eye Michigan State University East Lansing, Michigan
1. INTRODUCTION Standard texts often describe the correlation coefficient through its defining equation, usually in a form such as N
r(X, Y) = i=l
.
(1)
(N - 1)SxSy The correlation is often interpreted as the slope of a standardized regression line or in terms of the amount of variance two variables share (Glass & Hopkins, 1984; McNemar, 1962). The correlation can also be defined both in terms of
Categorical
Variables
in Develo pme ntal
Research."
Methods
of Analysis
Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
233
234
Michael J. Rovine and Alexander von Eye
the points around a bivariate regression and in geometric terms (Marks, 1982). In all of these forms, it is fairly easy to interpret when at an extreme (e.g., - 1 , 0, or 1). However, for values somewhere in the middle, the correlation is more difficult to interpret. Consider a correlation of .30 (the standard in social science). What does that mean? One could square the coefficient and say that two variables share 9% of their variance, but what does that mean? Being able to give some meaning to the correlation is important. In the social sciences, correlations are used to summarize the results of studies. Although most correlational studies are uncontrolled, investigators will often make statements such as "people who experience more conflict at home tend to have more problems at work." This very often elicits the follow-up question: Who are these people and how many of them are affected? This generates the more basic question: If you have a correlation coefficient of r, is there any way to tell how many cases you are talking about? Statisticians know that such correlations can be the result of many different patterns of data, and that these results can be tenuous due to reasons such as leverage (Rousseauw & Leroy, 1987). This can cause a problem particularly when there is a tendency to use such results as the basis for policy decisions. In the spirit of such decisions, we think that investigators should make some effort to consider how many cases they are talking about when they interpret a correlation coefficient. Because it represents the average cross-product between two standardized variables, the correlation coefficient cannot really be pinned down to an exact statement regarding the number of people affected. In certain situations, however, such statements can be made. We feel that investigators can use these situations as a benchmark to help them interpret their results. The criterion most often used to allow statements such as the preceding one to be made is the statistical significance of the correlation between the variables in question. For the experimenter who is interested in the effect of a treatment on an outcome, a significance level or effect size is often sufficient. For the interventionist, a significant correlation between the intervention protocol and the outcome may be interesting, but a follow-up question regarding exactly how many cases are affected by the intervention cannot be easily answered by the correlation coefficient.
2. AN INTERESTING PLOT Given a sample of any size, many different patterns of data for two variables can result in a correlation coefficient of the same size. Because a correlation can be substantially changed by changing the value of even a single data point (Rousseauw & Leroy, 1987), it can be very difficult to interpret. This is espe-
12. Correlation and Categorization under a Matching Hypothesis
235
cially so when an investigator is interested in determining exactly who is contributing to the correlation. Rovine and von Eye (1989) looked at characteristics of correlation coefficients and imagined a specific situation: What if a sample on which two standardized interval level variables were collected could be divided into two subgroups. For each member of Group 1 the values of the two variables matched perfectly (the correlation was 1). For the second subgroup, the correlation was zero. To realize this, we simulated the following data. First, we started out simulating 100 cases with two random variables that were uncorrelated. Second, we computed the actual correlation in the 100 cases (which was close to 0 but not identical to 0 in the sample). Third, we took the first case of the simulated data, changed the value of the second variable to match the value of the first variable, and then recomputed the correlation. Fourth, we took the first two cases and changed the values of the second variable to match the first variable, recomputed the correlation, and so on. A typical result appears in Figure 1. The relationship between the number of matches and the value of the correlation is essentially a straight line. Provided that we start out with two uncorrelated variables and add a number of matches to the data, the correlation coefficient is essentially counting the number of matches in the data set. The result may appear surprising, but it is a direct result of the definition of the Pearson correlation coefficient. As we will show, one can simply manipulate its defining equation to arrive at this result. As it turns out, a similar conclusion
FIGURE 1. Plot of number of exact matches versus the correlation coefficient.
236
Michael J. Rovine and Alexander von Eye
has been suggested by work done with the binomial effect size display described by Rosenthal and Rubin (1982).
3. THE BINOMIAL EFFECT SIZE DISPLAY Rosenthal and Rubin (1982) made a case for an alternative interpretation of the correlation coefficient by defining the binomial effect size display (BESD). They argued that there were situations in which a shared variance interpretation of the coefficient was not giving the correlation its due. Consider Figure 2 which imagines a treatment/control study with a dichotomous outcome (either good or bad). The BESD for this example can be interpreted in the following way: suppose that the relative number of good and bad outcomes for the control group represents what would happen in the absence of a treatment; then the relative number of good and bad outcomes for the experimental group can be used to determine the number of people expected to improve with the treatment. For this data, only 20% (10/50) would be expected to have a good outcome without the treatment. On the other hand, 80% (40/50) are expected to have a good outcome with the treatment. The difference in these expectations, 60% (or .60), is equal to the size of the correlation coefficient. The coefficient is, then, functioning as a count of the number of people improving (or the percentage of people expected to improve) as a result of the treat-
Outcome Good
Bad
Treatment
40
I
10
control
10
I
40
R=.60
Total=lO0
FIGURE 2. The binomial effect size display. If the control group represents what would happen with no treatment, the treatment group could show the number of people expected to improve with the treatment. With the treatment, 30 people would be expected to improve. This represents 60% or .60 of those in the treatment group. That is the size of the correlation.
12. Correlation and Categorization under a Matching Hypothesis
237
ment. This kind of interpretation is appealing to intervention researchers, who may be quite interested in the number of individuals affected by an intervention treatment. With dichotomous treatment and outcome variables, the correlation coefficient can indicate the increase in the proportion of individuals expected to have positive outcomes. Rosenthal and Rubin extended this result to a situation with a continuous outcome; however, one rarely sees the BESD reported in the literature. The question remains whether a similar interpretation could be made for two interval-level variables.
4. AN ORGANIZING PRINCIPLE FOR INTERVAL-LEVEL VARIABLES To come up with an interpretation, we suggest an organizing principle. First, consider two interval variables: One variable can be interpreted like a treatment in that the higher a person is located in the distribution, the more of the "treatment" he or she has received; the other variable is an outcome in that the higher a person is located in that distribution, the better the person's outcome. Then, we might argue that an effective treatment would tend to line up individuals on the outcome in roughly the same rank order that they would appear in the treatment. Consider, for example, whether watching television affects one's behavior. If watching television can be considered a treatment, then the amount of television watched can be considered in some sense the level of the treatment applied. A correlation between the amount watched and some measure of the targeted behavior then tests the effectiveness of the treatment.
5. DEFINITION OF THE MATCHING HYPOTHESIS We determined the straight line relationship between the number of matches and the correlation under the condition that the sample could be divided into two subsamples: one in which the match was perfect and the other in which there was no relationship between the two variables. But because a perfect match would certainly be considered unrealistic, we redefined the match in terms of an interval. This redefinition appears in Figure 3. The definition of a match used up to this point is very specific. For the value on a second variable to match the value on the first, the numbers have to be equal. This requires both equal scaling and the unrealistic (for real data) requirement that the values on two variables are exactly the same. To relax this definition, we redefined a match as in Figure 3. We first locate variable X in its distribution. Variable Y matches variable X when the value of Y is in the same location in its distribution ___ some interval defined as a. In Figure 3, we first
238
Michael J. Rovineand Alexander von Eye 45
.lllmll jlml tmltllltlltttl
19 MHQ
iIg a
52 54 56
87 Self-efficacy
FIGURE 3. Definition of a match for two interval-level variables. If the value on the target variable falls within a specified range, a match has occurred. The center of the target range is defined by the equation Center = [(MHQ- MHQ)/MHQstaev]SELF-EFFstaev + SELF-EFF and the range is range = center +_ a.
locate the center of that target range by first locating where X would appear in the Y distribution and then building the interval. With this definition, one would still expect a relationship between the number of matches and the size of the correlation. However, allowing a match that is not quite as exact would intuitively suggest that the relationship might not be quite so strong. To indicate empirically the degree to which the correlation is still accounting for the number of matches, we present a data example.
6. A DATA EXAMPLE The example represented in the figure shows variables taken from the Penn State Nursing Home Project. One measure, The Penn State Mental Health Caregiving Questionnaire (MHQ; Spore, Smyer, & Cohn, 1991) assessed nursing aides' knowledge of appropriate caregiving practices. As the intervention in the study was geared toward increasing nursing aides' knowledge about caregiving practices, this variable was treated as a proxy for the intervention itself. Because we hoped to increase the nursing aides' confidence in their ability to provide good care, we chose another measure, the Self-Efficacy Survey Instrument (SESI; Chambers, 1992) as a reasonable outcome variable. After an initial wave of data collection in which both of these measures were assessed, an intervention was carried out. Then, a second wave of data was collected. The correlation (and the
12. Correlation and Categorization under a Matching Hypothesis
239
number of matches) between MHQ and SESI prior to the intervention was considered a base line against which to measure the effectiveness of the intervention. Assuming that the intervention was effective, the correlation would be expected to increase; at the same time, a kind of reshuffling should occur in which the location of individuals in the respective distributions of the two variables would tend to be more of a match. The implicit idea of a match is that the movement in the distribution of the second variable will fall within a target range determined by the location of the individual in the distribution of the first variable. For repeated data with a replicated outcome (or a replicated "treatment" and outcome), the matching hypothesis should work, in that a reshuffling of the outcome (the movement of the individual within the outcome distribution) would be reflected in the higher correlation and in the greater number of matches according to the preceding criterion. For the base-line relationship between MHQ and SESI, we calculated the correlation between the two variables. We also determined the number of matches. Matches were counted by determining how many of the outcome variables' values fell within a targeted range (center _+ a), where the Center expresses the location of the treatment variable in the metric of the outcome variable distribution; +_a is the width of the target interval. The number of matches for the Time 1 data appear in Table 1. The correlation was .12. The table shows the number of matches using values of a from .50 (a range of 1) to 2.50 (a range of 5). For a relatively wide range of values of a, the correlation is a pretty good estimator of the number of matches. Not wishing to bank on the effectiveness of our training program, we simulated the Time 2 data adding 18 matches to the data. The matches were added by locating the "treatment' s" corresponding location in the outcome distribution and adding a stochastic term. The results appear in Table 2. Descriptive statistics appear in Table 3. The first thing to notice is that the difference in correla-
TABLE 1 Table of Results for Time 1 Data A
Matches
Mismatches
Percentage
.50 .75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
5 7 10 13 13 14 16 17 19
98 96 93 90 90 89 87 86 84
.05 .07 .10 13 13 14 16 17 18
Note.
Correlation between MHQ and SESI: .12.
240
Michael J. Rovine and Alexander von Eye
TABLE2 Table of Results for Time 2 Data A
Matches
Mismatches
Percentage
.50 .75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
12 15 21 27 27 29 31 34 36
91 88 82 76 76 74 72 69 67
.12 .15 .20 .26 .26 .28 .30 .33 .35
Note. Correlation between MHQ and SESI: .29.
tions is almost equal to the difference in the number of matches. As a approaches the value of the multiplier of the stochastic term, the percentage of matches approaches the value of the correlation. Just as a comparison, we also computed the BESD separately at each occasion after splitting the variables at the median. Even this relatively crude categorization is close to the correlation (see Fig. 4). The correlation coefficient provides a good estimate of the number of matches at both Time 1 and Time 2. The number of matches is believed to be a way of determining what percentage of individuals in an intervention group are affected by the treatment. The difference between the two correlations is essentially the difference in the number of matches for the two occasions. It is interesting that a test of the difference of the two correlations would show that they are not
TABLE3 Descriptive Results Time 1 SESI Mean 66.117 SD 8.443 Median 66.000 Correlation BESD Correlation difference • 100 = 17 BESD difference • 100 = 12
.12 .16
Time 2 MHQ
SESI
MHQ
12.533 3.223 13.000
69.806 8.803 70.955
12.945 3.186 13.252 .29 .28
Note. Time 2 data was simulated by creating 18 additional matches in the Time 1 data and adding a stochastic term along with a small overall linear increase. SD, standard deviation.
12. Correlation and Categorization under a Matching Hypothesis Time 1
Self-Efficacy High Low
High
Time 2
59
MHQ
241
Self-Efficacy High
Low
High
33
18
51
Low
19
33
52
MHQ Low
44
BESD for Time 1 = [36159]-[20144] = .16
BESD for Time 2 = [33151]-[19152] = .28
FIGURE 4. Binomial effect size display for Time 1 and Time 2 data. The BESDs represent the increase of positive outcomes in the High group (the "treatment condition") when contrasted with the number of positive outcomes in the Low group ("the control condition").
significantly different, which is the case despite the fact that they accurately account for the change in the data. This suggests that if matching is a reasonable criterion for determining the effectiveness of a treatment, the correlation coefficient gives an indication of what percentage of the sample could be considered to have been affected by the treatment. As we suggested before, we expect the correlation to act as a count of matches as a direct result of the definition of the Pearson correlation coefficient.
7. CORRELATION AS A COUNT OF MATCHES
Assume that one variable, X, is used to predict another variable, Y. W h e n the correlation is O, the value of one variable tells us nothing about the other. W h e n the correlation is 1 or - 1, the value of one variable perfectly describes the other. Consider a data set that can be divided into two groups. For one group, there is no relationship between the two variables; for the other, we can successfully predict an individual's value on Y from X. We define successful prediction as occurring when there is a one-to-one mapping of X and Y values. We temporarily define a match as occurring when X and Y have identical values. For the sake of simplicity, assume that X and Y have been standardized to Zx and zr, with mean = 0 and standard deviation = 1. The correlation for the z-scores can be expressed as
N ZxiZYi r(zx, zr) = -= N - 1
(2)
242
Michael J. Rovine and Alexander von Eye
Assume that the cases have been reordered so that the first k cases represent matches and the next N - k cases are unrelated (r = 0). The formula can be expressed as k
r=
ZXi Z Yi
~ .= N - 1
N
+
Zgi Z Yi
~ N-1 i=k+l
(3)
Multiplying the first term by k/k and the second term by ( N - k ( N - k - 1) yields
1)/
k ~ Zx~Zri N - k - 1 ~ zx, zr, t ~ . k i=1 N - 1 N-k-li=k+lN-1
r = -
(4)
The constants can be rearranged as follows:
r=
k N-1
(5)
~ Zx~2 + N - k - 1 ~N ZxiZYi k N 1 N k-1 i=1 i=k+l
The first term is k / ( N - 1) times approximately the variance of a standardized variable. The second term is the correlation between two unrelated variables. So, for the situation described previously, k
r=
N-1 k N-1
(1) + 0
(6)
.
(7)
Thus the result approximately equals the proportion of matches in the data set. The preceding result is approximate, in part, because the mean of the subset is not necessarily equal to the mean originally used to standardize the two variables. One can get a more accurate value for the correlation by reexpressing the variance of the matched variable in Equation (5) in raw score form. Thus, r = ~ i--1
{[(xi - xk) - d]/sx~] }2 k
(8)
'
where X = X k + d; X is the mean of the complete group, and X k is the mean of the size k subgroup. This can be expanded to
r=
N-1
i=1
k
- 2
k d2
N-li=
1
k
t k
k
SXk2 N-- 1" (9)
The middle term includes the sum of deviations around a mean and is equal to 0. Equation (9) then becomes
12. Correlation and Categorization under a Matching Hypothesis
k
r=
O-t
N-
1
?. =
k N-1
d2
k
SXk2 N -
(lO)
1
( .2)
This yields
1+
~
243
(11)
"
As d approaches 0, the correlation approaches k / ( N -
1).
8. CORRELATIONAS A COUNT OF HOW MANY FALL WITHIN A SET RANGE The perfect match is not often realistic. Describing a match more in terms of an interval conforms more to the kinds of statements social scientists tend to make. Thus now we reconsider the correlation for the case in which a match is described as an individual's score on variable Y falling within a specific interval A of variable X. Assuming standardized scores, the equation for the correlation between X and Y becomes
~ Zx~(Zx,-k- ai) r(zx, zr) =
+
N - 1 i=1
N
ZxiZYi
~
N-I'
i=k+l
(12)
where l ail <- A. Once again, the second term has an expected value of O. Then, multiplying by k/k, Zx~(Zx~ + ai) N-1
r = ~ i=1
_
~
Zx~(Zx~-t- ai)
i=1
k
k N-1
(~3) (14)
Expanding yields 2
Xi N-1
i=1
k
+
k N-1
~
Zxiai
i=1
k
(15)
and k N-1
t-
k N-1
r(zx, a) =
k N-1
[ 1 + r(zx, a)].
(16)
Equation (16) is approximate in that the variance of the k-subsample on zx may not be exactly 1. Adding the correction given by Equation (11) gives
244
Michael J. Rovine and Alexander von Eye
r =
[ d2 [ .2 1
N-
and k = r(N -
1
1)/
Sxk2
1
+ r ( z x , a)
Sxk2
1 ]
+ r ( z x , a)
(17)
.
(18)
The actual correlation will differ from the proportion of matches by (a) how much the mean of the k-subsample differs from the mean of the whole sample, and (b) by the correlation within the matching interval between the z x variable and the a i. In all, the correlation should still provide a good indication of how many matches occurred. Rovine and von Eye (1993) discuss the relationship between A and the size of the correlation coefficient in greater detail. This also suggests the intuitively reasonable result that as we increase the band in which we are willing to accept two values as matching, we decrease the size of the expected correlation. We would expect to have to increase the size of the band as, for example, the reliability of the target variable decreases. Decreased reliability would increase the variance of the target variable. This would decrease the correlation between the two variables. With more variability in the target, we would have to increase the value of a to see a match.
9. DATA SIMULMION To illustrate the effect of increasing the band width, we again simulated some data. As before, we started with two uncorrelated standardized variables (N = 100). We previously created matches (CORR1) by setting Y equal to X (a = 0). This time, we generated matches by creating a Y variable that was equal to X, plus a stochastic term that was smaller than some specified value a. Once again we ran 100 trials matching the first case on the first trial, the first two on the second trial, and so on. For the second run (CORR2), we selected a = .5; for a third run (CORR3), we picked a = 1. Considering that the variables were standardized, this represents a fairly liberal target range. The correlations resulting from the simulations appear in Table 4. As expected, the correlations for all three runs (a = 0, .5, and 1) increase linearly as the number of matches increases. Also as expected, larger values of a yield smaller correlations as the number of matches increases. A graph of the three sets of correlations against the number of matches appears in Figure 5. The three graphs are essentially linear, with the lines fanning out for different values of a. The two lines where a -- 0 fall somewhat below the exact match line. As a gets larger, the slope of the line decreases, although only slightly, with the greater discrepancies among coefficients occurring at the higher end of the scale. This is as expected from Equation (20). Even when a match is de-
245
12. Correlation and Categorization under a Matching Hypothesis
TABLE 4 Correlations Between X and Y with N Matches N
0 1 2 3 4
5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
CORRl A=O 00 01 02 02 02 03 04 04 03 03 04 05 06 07 08 08 15 15 22 22 21 21 67 67 68 68 69 71 74 75 75 76 77 80 82 82 82 85 85 85 85 85 86 87 87
CORR2 CORR3 A=.5 A = l
00 00 02 02 02 03 04 03 03 03 04 04 06 06 07 07 14 14 21 21 21 21 65 65 66 66 68 70 72 73 73 74 75 77 79 79 80 82 82 82 82 82 83 84 84
00 01
03 04 03 05 06 05 05 05 06 07 10 09 11 11
19 19 25 25 25 24 60 61 62 62 63 65 68 68 68 69 71 73 75 75 75 77 77 77 78 78 78 79 79
N
CORRl A=O
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 89 90 91 92 93 94 95 96 97 98 99 100
21 21 22 25 26 26 26 28 29 29 29 29 37 37 37 41 42 43 49 49 49 49 87 87 87 87 87 89 90 90 90 98 99 100
CORRZ CORR3 Az.5 A = l
21 21 21 24 25 25 26 27 28 29 29 29 36 36 36 40 41 42 48 48 48 48 84 84 84 84
86 86 86 87 87 97 96 96
24 24 24 27 28 28 28 29 30 30 30 29 36 36 36 39 41 41 47 46 46 46 79 79 79 19 80 80 81 81 81 87 89 89
N
CORRl A=O
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
50 50 50 50 50 51 51 54 56 56 57 58 58 58 59 61 61 62 63 64 64 65
CORRZ CORR3 A=.5 A = l
49 49 49 49 49 50 50 53 55 55 56 56 56 57 57 59 59 60 62 62 62 63
47 47 47 47 47 48 48 51 52 53 53 54 54 54 54 56 56 57 58 58 58 59
Continues.
246 TABLE 4
Michael J. Rovine and Alexander von Eye
Continued Correlations
Match CORR1 CORR2 CORR3
Match
CORR1
CORR2
CORR3
1.000 .993 .992 .991
1.000 .999 .999
1.000 .999
1.000
fined as falling within a relatively wide range of the X variable, the correlation coefficient is still well approximated by the number of matches in the data set.
10. BUILDING UNCERTAINTIES FROM ROUNDING ERROR INTO THE INTERPRETATIONOF A CORRELATION Consider a scale consisting of 10 Likert-type items. If each item has 5 points, assume the rounding error of each item is +_.5. The 10-item scale, then, has a
FIGURE 5. of A.
Plot of number of matches versus the correlation coefficient for three different values
12. Correlation and Categorization under a Matching Hypothesis
! i
1
I i
I i
! i
l n
2
3
4
5
247
I
1.5
FIGURE 6. Relationship between two variables with rounding error. If each item has an uncertainty of _+0.5, then a 10-item scale has an uncertainty of _+5.
total rounding error of _+5. When correlating two of these scales, it would be unreasonable to expect an exact match for an individual even if the true scores were identical. A more realistic picture of the situation is shown in Figure 6. The calculated errors as a result of rounding could be used to determine the expected decrease in the size of the correlation. For any scale, the variability caused by rounding represents an easily identifiable source of measurement error.
11. DISCUSSION Provided that the matching hypothesis makes sense, the correlation coefficient impressively seems to account for the number of matches that occur between a pair of variables. This result has a number of implications for the way in which researchers interpret the coefficient. When the coefficient represents the relationship between a proxy to something like the degree of a treatment and an outcome, the correlation coefficient could have an interpretation similar to a binomial effect size display for two interval-level variables. Considering either a variable value-to-interval relationship or an interval-tointerval relationship as the basis for a correlation suggests that under a matching hypothesis, categorizing a pair of interval-level variables to explore their association may not be such a bad idea. The idea of a degree of treatment as an organizing factor (something that changes the relative rank order of individuals within a distribution) may represent a useful way for intervention researchers to conceptualize the result of certain natural interventions (e.g., television watching). This is especially the case when one considers how accurately the correlation can assess these relative changes using a matching criterion.
248
Michael J. Rovine and Alexander von Eye
It appears that investigators may do well to go beyond the idea of shared variance when interpreting the correlation coefficient. If the matching hypothesis is tenable, it can be used to give a more exact interpretation of what is indicated by the correlation coefficient.
ACKNOWLEDGMENTS This research was supported in part by NIMH MH43373. The authors thank David Andrich and an anonymous reviewer for comments on a previous draft of this chapter.
REFERENCES Chambers, N. (1992). An examination of the role of self-efficacy in nurse assistants. Unpublished manuscript, Pennsylvania State University, University Park. Glass, G. V., & Hopkins, K. D. (1984). Statistical methods in education and psychology. Engelwood Cliffs, NJ: Prentice Hall. Marks, E. (1982). A note on the geometric interpretation of the correlation coefficient. Journal of Educational Statistics, 7(3), 233-237. McNemar, Q. (1962). Psychological statistics (3rd ed.). New York: Wiley. Rosenthal, R., & Rubin, D. B. (1982). A simple general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74(2), 166-169. Rousseauw, E J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley. Rovine, M., &von Eye, A. (1989). A reinterpretation of the correlation coefficient: Correlation as a count of matches. In L. Malone & K. Berk (Eds.), Computer science and statistics: Proceedings of the 21 st Symposium of the Interface (pp. 287-291). Alexandria, VA: American Statistical Association. Rovine, M., & von Eye, A. (1993). What a correlation coefficient may mean under certain circumstances. Unpublished manuscript, Pennsylvania State University, University Park. Spore, D., Smyer, M. A., & Cohn, M. D. (1991). Assessing nursing assistants' knowledge of behavioral approaches to mental health problems. The Gerontologist, 31(3), 309-317.
Residualized Categorical Phenotypes and Behavioral Genetic Modeling Scott L. Hershberger
University of Kansas Lawrence, Kansas 66045
1. THE PROBLEM This chapter describes a method for the modeling of multicategoried dependent variables using maximum likelihood (ML) estimation. In this method, the assumption of multivariate normality is not, as is almost always the case for categorical variables, compromised. The domain to which this method will be applied in this chapter is behavioral genetics. Behavioral genetics is concerned with the variance decomposition of a phenotype (trait) into various genetic and environmental effects. Many of the phenotypes studied by behavioral geneticists are polychotomous, and most frequently, dichotomous. For example, an individual is either categorized as schizophrenic or not schizophrenic. At a minimum, the analysis of the either-or states represented by the categorical variable has always been troubled by ascertainment bias, or the nonrandom sampling of
Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
249
250
Scott L. Hershberger
affected individuals and their relatives. Nonrandom sampling leads to biases in the covariances calculated between relatives from the observed sample (N. G. Martin & Wilson, 1982; Neale, Eaves, Kendler, & Hewitt, 1989). With the relatively recent interest in analyzing categorical phenotypes through structural equation methodology, additional difficulties have been introduced. Structural equation modeling easily seduces researchers into treating categorical variables as continuous. Yet, estimation methods in structural equation methodology are traditionally predicated under an assumption of multivariate normality (e.g., Bollen, 1989), hardly an appropriate description of the distribution of an observed, dichotomous variable. As will be described, estimation methods not developed under the assumption of multivariate normality, such as weighted least-squares (WLS), present their own difficulties. The purpose of this chapter is to describe a simple method of transforming the non-normal distribution of a categorical phenotype into a more normal, continuous distribution. Such a method should allow the more appropriate application of normal theory-based estimation methods (e.g., ML), while avoiding the difficulties of non-normal theory-based estimation methods (e.g., WLS).
2. WEIGHTED LEAST-SQUARESESTIMATION The covariance structure of the observed variables is frequently analyzed using Pearson product-moment correlations or covariances. One justification for treating observed categorical variables (y) as if they were continuous rests on the oftentimes implicit assumption that a latent continuous response variable (y*) underlies each y. For a dichotomous y: Y=
<,]
1, if y* > "r '
where "r is a threshold value on the latent continuous variable beyond which the individual is placed in the affirmative category. Several consequences follow from this implicit assumption. First, consider the usual measurement model for the p x 1 vector of y*: y* = )ky'l] -I-- E. In general, y 4: y*, or y4: Xy'l] -~- E Thus, the measurement model for the latent continuous y* does not hold for the observed categorical y. Additionally, the distribution of y* almost certainly differs from that of y; even if y* is multinomial, the distribution of dichotomous y will still be very different. Moreover, E*, the population covariance matrix, will
13. Residualized Categorical Phenotype
251
generally not equal 2s the observed covariance matrix. Thus, if ~(0) represents the covariance structure hypothesis, ~* = ~(0) but E 4: ~(0). If S is a consistent estimator of E and S* is a consistent estimator of ~*, parameters based on S will be inconsistent estimators of the parameters in 0. More simply phrased, the sample values of the parameters will not equal their true population counterparts. One may question how robust the results of normal-theory structural equation modeling are when categorical variables are treated as if they are continuous. Pearson correlation coefficients between categorized variables are generally less than the corresponding continuous variables (Bollen & Barb, 1981; Henry, 1982; J. A. Martin, 1982; Olsson, Drasgow, & Dorans, 1982; Wylie, 1976). Boomsma (1982) and Babakus, Ferguson, and J6reskog (1987) found that with ML estimation, when the observed variables were highly skewed, asymptotic chi-square goodness-of-fit values were inflated. Muthen and Kaplan (1985) also found that extreme kurtoses inflated the value of chi-square. Anderson and Amemiya (1988) and Amemiya and Anderson (1990) report, however, that chisquare was relatively robust against departures from normality if the latent variables were independent. Unfortunately, procedures for determining the independence of latent variables are not available. The number of categories of the variable does not seem to affect chi-square, but does induce correlated errors (Johnson & Creech, 1983) and inaccurate asymptotic standard errors (Boomsma, 1983) when the number of categories is very small. Given the severe distortions that may occur when categorical variables are analyzed under the assumption of multivariate normality, WLS (J6reskog & S6rbom, 1988) or asymptotic distribution-free methods (Bentler, 1989) have been developed that do not require the assumption of multivariate normality. Specifically, the recommended procedure (J6reskog & S6rbom, 1988) for analyzing ordered, observed categorical variables is, first, to invoke a threshold model, as previously described, and compute the tetrachoric or polychoric correlations between the variables, using as an estimate of the threshold "r Ti
i = 1,2 . . . . . k=l- ~
c-
1,
'
where +-1(,) is the inverse of the standardized normal distribution function, N k is the number of cases in the kth category, and c is the total number of categories for y. With a dichotomous variable, N k reduces to N, and only one threshold is computed. Actually, behavioral geneticists have used the threshold model for a number of years, under the term "liability model" (Falconer, 1981), to avoid the biases inherent in Pearson correlations computed between categorical variables. The tetrachoric (or polychoric, if there are more than two categories) are analyzed with a WLS (distribution-free) fitting function, thereby avoiding the
252
Scott L. Hershberger
normal-theory based ML or generalized least-squares fitting functions, and obtaining asymptotically correct chi-square values and standard errors. What at first appears to be a panacea for the analysis of categorical variables is compromised by several difficulties. WLS fitting functions computed on more than several variables are quite complex computationally, requiring the fourthorder moments of the measured variables. When sample size is small or model degrees of freedom are large, WLS performs inaccurately as an estimator, producing biased chi-square values (Chou, Bentler, & Satorra, 1991). In particular, Hu, Bentler, and Kano (1992) found distribution-free estimation to perform "spectacularly badly" in every case of their Monte Carlo study (rejecting true models 93 to 99.9% of the time) when sample size was below 5000. More fundamentally, the phenotype is analyzed as if a continuous normal distribution exists for the phenotype in the population. For some phenotypes, such as schizophrenia, this is a reasonable assumption to make; for others, such as gender (chromosomally defined), it is not. Few behavioral genetic studies consist of sample sizes one-tenth the magnitude of 5000. Thus the problem seems to be the identification of a method of estimation appropriate to categorical data, without resorting to a distributionfree method such as WLS. A satisfactory solution would entail the use of normalbased methods of estimation, given their well-known desirable asymptotic properties. The answer may lie with the exploitation of a procedure that is vital to behavioral genetic analysis: twin studies. Twin correlations are typically corrected for age and gender effects. Because identical and same-sex fraternal twins share the same age and gender, age and gender, if significantly related to the phenotype, will inflate the twin correlations. McGue and Bouchard (1984) found that twin correlations were overestimated consistently without the correction, and, in a majority of cases, attenuated the true magnitude of the genetic effect. Note what occurs if a polychotomous variable is residualized for the effects of the two variables: the polychotomous variable itself becomes continuous. The continuous nature of the residual distribution results from fitting a regression line to the N number of horizontal lines in the regression plane, N equaling the number of categories of the variable. For instance, in the case of a dichotomous variable scored 0, 1, if the regression line is not parallel to the two horizontal lines y -- 0 and y = 1, a continuous residual distribution is created. With the subsequent application of a nonlinear transformation to the residuals, the distribution of the previously categorical variable may very well approximate a normal distribution, allowing greater freedom for the use of normal-theory methods of estimation. In this study, behavioral genetic model-fitting was performed on 14 medication variables, each indicating whether an individual was presently taking the medication. Two-category variables were selected rather than multiple-category variables because of the greater frequency of the former in behavioral genetics and for greater simplicity of presentation. The method is equally applicable to
13. ResidualizedCategorical Phenotype
253
polychotomous variables. The correlations for four twins groups served as the data: monozygotic (identical) twins reared together (MZT) and apart (MZA), and dizygotic (fraternal) twins reared together (DZT) and apart (DZA). Structural equation modeling was performed under two conditions. In the first condition, a WLS analysis was performed on the variables, directly incorporating the covariates of age and gender into the model. In the second condition, ML estimation was performed on the nonlinearly transformed age and gender residualized variables. Two questions are of interest: how well does the nonlinearly transformed age and gender residualized variable approximate normality, and how comparable are the WLS and ML solutions?
3. PROPORTIONALEFFECTS GENOTYPE-ENVIRONMENT CORRELATION MODEL In this section, the model fit in this study is described in some detail. The behavioral genetic model is here referred to as a proportional effects genotype-environment correlation (CovGE) model. The parameter of greatest interest in this model is CovGE, representing the linear association between genotypes and environments. CovGE, with the exception of its "passive" form (e.g., Neale & Cardon, 1992), is rarely estimated. The primary reason for the restriction in the specification of CovGE lies with the predominate use of twin data, data for which identification problems occur when specifying CovGE. To see why it is difficult to identify CovGE in twin data, first define the phenotypic (Pl) score of an individual as e l = Enl + Gal Jr- Esl ,
where E, = nonshared environmental effects, G a = additive genetic effects, and Ea = rearing environmental effects. If each of the terms in the preceding equation is a deviation score, then the variance of P(Ve) can be expressed as Wp 1 "~ WEnl -t- WGa1 + WEs1 Jr-COVGa/Es, -Jr- COVEslGal
or, because COVGalEsl -- COVEslGal , Wp 1 -- WEn1 + WGaI + VEs I -Jr"2COVGal,Es 1.
According to these equations, the variance in a phenotype can be partitioned into sources attributable to nonshared environmental effects, additive genetic effects, rearing environmental effects, and the correlation between additive genetic and rearing environmental effects. Two terms are not present in these equations. First, it is assumed genotype-environment interaction is not important for the phenotype. Second,
254
Scott L. Hershberger
no term representing the correlation between nonshared environmental effects and additive genetic effects has been included. Based on the definition of nonshared environmental effects as those environmental influences on a person that are uncorrelated with the environmental influences of another person, a correlation between nonshared environmental effects and additive genetic effects cannot be stipulated. Such a correlation would incorrectly imply a correlation between the nonshared environmental effects of two genetically related individuals. For univariate models of CovGE, only rearing environmental effects are allowed to correlate with genetic effects, whether the genetic effects are additive or nonadditive. The t e r m COVGalEsl refers to the correlation of genes and environments within one person. If the correlation between additive genetic effects and rearing environmental effects is specified for the covariance between relatives, the form of the correlation changes. For example, the covariance between MZTs can be specified as COVMzT1,MZT2 = COVGal,Ga2 _+_COVEsl,Es2 _+_COVGal,Es2 qt.. COVGa2,Esl '
or, assuming COVGalEs2 -- COVGa2Esl , COVMz T = VGal2 + VEsl2 + 2COVGal,Es 2.
In contrast to the correlation between additive genetic effects and rearing environmental effects (COVGal,Esl) that contributes to the sample phenotypic variance of a trait, COVGal,Es 2 refers to the correlation between one twin's genotype and the other twin's environment. This difference may be expressed as the difference between "a within-persons CovGE" and a "between-persons CovGE." The covariance between DZTs, MZAs, and DZAs can then be specified as COVDzT-" .5VGal 2 + VEsl2 + 2COVGal,Es 2 COVMzA -- VGal 2 + 2COVGal,Es 2 C O V D z A = .5VGal2 + 2COVGal,Es 2.
Now, each of the four terms in this series of five equations (one equation for the variance of the phenotype, four for the twin covariances) is identifiable. But, as Thomas (1975), Jensen (1976), Goldberger (1977), and Jaspers and de Leeuw (1980) have pointed out, to equate Covc, al,es2 between reared together and reared apart twins is misguided. Jensen (1976) conceives of Covc, al,es2 as arising from two mechanisms: one by which the rearing environment reacts in a particular way to an individual based on the genotype (which both reared together and reared apart twins will share) and one by which the rearing environment is imposed on the individual regardless of the individual's genotype (which reared together twins will share, but not reared apart twins). In the latter case, an important element from the between-persons CovGE will be missing for reared apart twins.
13. ResidualizedCategorical Phenotype
255
Now, one could simply correct the false equivalence of COVGal,Es 2 a c r o s s the four twins groups by defining separate CovGE terms for the reared apart and reared together twins, but herein lies the identification problem: such a model with five unknowns and five equations is underidentified. Indeed, it was this identification problem that led Thomas (1975), Jensen (1976), and others to use an approach to model-fitting that was dependent on the arbitrary designation of some parameter values as permissible and others as not. Further contributing to the difficulty is Goldberger's (1977) legitimate remark that it is questionable even to specify the same CovGE term for MZTs and DZTs: A greater betweenpersons CovGE would be expected for persons of greater genetic relatedness. The solution to the identification problem in twin data taken in this study is as follows. First, a critical element, the imposed rearing environment, common to reared together twins, is missing from reared apart twins. This alone will diminish CoVoal.Es 2. But in addition, if reared apart twins have been placed into truly uncorrelated environments, then similarity in opportunities for selection of environmental choices or similarity in evoking the reaction of others from the environment will also be attenuated. Further contributing to the lack of similarity of environmental selection and evocation is the imperfect freedom adults have in choosing their environments. Thus, there is no great compromise in accepting CoVcal,Es 2 as zero for reared apart twins. Again, if COVGal,Es2 consists of two elements, one derived through imposition and the other through selection, and if the former is entirely missing from, and the latter greatly diminished in contribution to the covariance between reared apart twins, COVOal,Es 2 for reared apart twins must be very minute, if at all extant. For MZTs and DZTs, the problem is one of defining nonequivalent CovGE terms for the two groups. For MZTs, one simply equates COVGal,Es2 t o COVGalE1, the sample variance. One would expect, for genetically identical individuals, the between-persons correlation to equal the within-persons correlation. For DZTs, if the difference between the magnitudes of CoVoal,Es t and CoVcal,Es 2 is caused by imperfectly correlated genotypes (the imposition part of CovGE is the same as for MZTs, but the selection/evocation part varies as a function of genotypic similarity), then COVoalF_.s2 = 1/2COVoam~ 1. Now, the univariate twin model of genotype-environment correlation can be defined as Vp = gEn 1 "+- gGa 1 "31-gEs 1 .Jr_2Covoal,Es 1
Covmz T = VGa 1 + VEs 1 + 2COVGal,~s 2 COVDz T ~-.5VGa 1 nt- WEsI Jr-COVGal,Es 2
C~ = v~,,, COVozA = .5Vc~1. This univariate model, which I label a "proportional effects" model after the proportional relationship between the MZT and DZT CovGEs, is therefore
256
Scott L. Hershberger
identified. When nonadditive genetic effects (Gj) are incorporated in the model instead of additive genetic effects, the relationship between the MZT CovGE term and the DZT CovGE term, respectively, will be COVGdlEs2 = 1/4COVGj1E.~/.
In this study, it is not possible to have both additive and nonadditive effects in the model simultaneously; thus after testing both genetic parameters, the parameter of greater significance is retained for the four-parameter model. Figure 1 presents a path diagram of the proportional effects CovGE model. The model presented in Figure 1 assumes that the twin correlations have already been corrected for the effects of age and gender. Rather than partial the covariates from the twin correlations before fitting the model, the covariates may be directly incorporated into the model, as shown in Figure 2. The variance of the phenotype, under additive genetic effects, is then: Vp = WEn! + VGa ! + VEs ! "1- 2COVG, I,E.,.i + age 2 + gender 2.
In this chapter, the parameter of primary interest is CovGE and results are reported in terms of CovGE.
4. METHODS 4.1. SUBJECTS The data for this study were obtained from the Swedish Adoption/Twin Study of Aging (SATSA). The twins were identified from the Swedish Twin Registry, which includes nearly 25,000 pairs of same-sex twins born in Sweden between
FIGURE 1. Proportionaleffects genotype-environment correlation (CovGE) model; ot = withinpersons CovGE, 13= between-persons CovGE.
13. ResidualizedCategorical Phenotype
257
FIGURE 2. Proportionaleffects genotype-environment correlation (CovGE) model, including the covariates of age and gender; cx = within-persons CovGE, 13= between-persons CovGE.
1886 and 1958 (Cederlrf & Lorich, 1978). The twin registry was completed during two periods, the early 1960s and the early 1970s. As part of the questionnaire sent to the twins during compilation of the registry, information was requested concerning the age at which the twins were separated. Twins were judged to be reared apart if they were separated before 10 years of age. For the selection of the SATSA twin sample, the twins reared apart were matched to the twins reared together on the basis of age, gender, and county of birth. The average age of separation for the twins reared apart was 2.8 years; 47.7% were separated during their first year of life, 63.9% were separated by their second birthday, and 81.8% were separated by their fifth birthday (Pedersen et al., 1984). The average age of the twins was 58.6 years at the time of testing, ranging from 26 years to 87 years. Forty percent of the sample was male, 60% was female. Initially, zygosity status was based on reported physical similarity across a number of traits. During a later health interview, approximately 50% of the pairs who were tested earlier also had their zygosity status determined by comparison of serum proteins and red cell enzymes from blood assays. Zygosity was changed for 8% of the sample of twin pairs on the basis of these assay results (Pedersen et al., 1991).
258
Scott L. Hershberger
4.2. Measures As part of an extensive battery of inventories completed by the respondents, the respondents were asked whether they had taken any one of 26 different medications during the last month. Because of an extremely low endorsement rate for 12 of the medications, the responses to 14 of the 26 medications were retained for this study. These 14 medications included blood pressure medications, cough medicines, diuretics, heart medications, herbal medications, iron supplements, laxatives, nitroglycerin tablets, nonprescription pain killers, prescription pain killers, skin medications, stomach medications, tranquilizers, and vitamins. Responses were scored 0, did not take the medication, or 1, did take the medication.
4.3. Procedure The following steps were performed to analyze the data for this study. First, coefficients of skewness and kurtoses (D'Agostino, 1986) were calculated for each of the variables using PRELIS 2 (Jrreskog & Srrbom, 1993b). These variables were then residualized for the effects of age and gender by twin group separately by regressing the medication variable onto the covariates of age and gender using logistic regression, a regression technique appropriate for categorical dependent variables (Hosmer & Lemeshow, 1989). The residuals from the regression analyses were saved. Coefficients of skewness and kurtoses were computed for the residualized variables. A series of nonlinear transformations was then applied to the residualized variables: raising the variable to the second power, raising the variable to the third power, taking the square root, taking the log to the base 10, taking the reciprocal, taking the square root of the reciprocal, and computing the logit transformation (i.e., 1~2In(p~1 - p ) , where p is the value of the residualized variable). These transformations were selected for their typicality in the behavioral/social sciences. The transformation that maximized the value of the Shapiro-Wilk's test of normality was selected. Again, coefficients of skewness and kurtoses were computed. Twin correlations were computed using PRELIS 2 for the model-fitting analyses, which were conducted with LISREL 8 (Jrreskog & Srrbom, 1993a). For the WLS estimation, the tetrachoric correlation between the twins was calculated, as well as the tetrachoric and polyserial correlations between the medication, age, and gender variables. For the ML estimation, Pearson product-moment correlations were calculated for the twins, the twin data already having been corrected for the effects of age and gender.
5. RESULTS Table 1 shows the skewness and kurtosis coefficients computed for each of the medication variables before residualization, after residualization, and after non-
TABLE 1 Skewness and Kurtosis of Measures Before and after Residualization and Transformation Before residualization Measure Blood pressure medication Cough medicine Diuretic Heart medication Herbal medication Iron Laxative Nitroglycerin Nonprescription pain killer Prescription pain killer Skin medication Stomach medication Tranquilizer Vitamin *p<.o5. **p<.Ol. ***p<.OOl.
After residualization
After transformation
Skewness
Kurtosis
Skewness
Kurtosis
Skewness
Kurtosis
1.58"** 2.34*** 2.62*** 3.21"** 3.39*** 3.84*** 3.43*** 3.65*** 1.50 1.98"** 2.02*** 2.85*** 3.35*** 1.51"**
.50*** 3.49*** 4.86*** 8.28*** 9.46*** 12.72"** 9.74*** 11.32"** .26* 1.92"** 2.06*** 6.12"** 9.21 *** .26*
1.43*** 2.33*** 2.43*** 2.98*** 3.26*** 3.81"** 3.26*** 3.45*** 1.43"** 1.94"** 2.00*** 2.82*** 3.27*** 1.43"**
.38** 3.47*** 4.43*** 7.50*** 8.96*** 12.60"** 9.08*** 10.48"** .25* 1.86"** 2.03*** 6.03*** 8.91 *** .21"
.19*** - 1.41"** .28*** .17*** 4.60*** -1.42"** .74*** .37*** - . 16** -.77*** -4.64*** - 1.47"** - . 10 -.32***
-.73"** 1.30"** - . 18 - . 12 4.60*** 3.07*** .72*** .37*** -.73"** .04 -.21 2.10"** -.07 -1.05"**
Transformation type Logl0 Reciprocal Logl0 Logl0 Logl0 Reciprocal Logl0 Logl0 Reciprocal Reciprocal Reciprocal Reciprocal Reciprocal Reciprocal
square root
square root square root square root square root square root
1400 1394 1402 1398 1394 1394 1392 1396 1382 1390 1392 1398 1398 1398
260
Scott L. Hershberger
linear transformation of the residualized variables. Before residualization, the skewness and kurtosis coefficients of nearly all of the variables were large and significant. After removing the effects of age and gender, nearly all of the variables still showed significant skewness and kurtosis, but the absolute values of the two coefficients decreased notably. Finally, application of a nonlinear transformation to each of the variables resulted in a dramatic decrease in the magnitude of skewness and kurtosis; although many (but fewer) coefficients are significant, the significance can in part be attributed to the large sample size. Therefore, the successive procedures of residualizing the dichotomous variables, and then nonlinearly transforming their distributions, has shifted the distributions significantly toward normality. The distribution of each variable before residualization, after residualization, and after nonlinear transformation of the residualized variable is shown in Figure 3. Most striking are the differences between the "residualized distribution" and the "transformed, residualized distribution." Although the residualization process introduces continuity where none existed before in the original distribution, the shifting of this continuity toward a normal distribution occurs more clearly after nonlinear transformation. As shown in Table 2, three correlations were computed for each of the four twin types for each medication variable: the tetrachoric correlation following the removal of the age and gender effects (TE); the Pearson product-moment correlation following the removal of age and gender effects (PM Before); and the Pearson product-moment correlation following the nonlinear transformation of the age- and gender-corrected variable (PM After). Comparing the three types of correlations, it is obvious that the former two correlations (TE and PM Before) are closer in magnitude to each other than either is to the PM After correlation. For most of the variables, the nonlinearly transformed data resulted in quite substantial twin correlations. Whether this substantial increase alters the latent variable structure of the observed variables (in particular CovGE) can be determined by the behavioral genetic model-fitting. Tables 3 and 4 permit a comparison of the results of applying WLS estimation to the twin correlations, allowing the covariates of age and gender to enter the model directly, with the results of ML estimation on the nonlinearly transformed, residualized twin correlations. Differences occur both in the significance and nature of CovGE across the two estimation methods. CovGE was significant for six variables under WLS and significant for 13 variables under ML; each of the six variables found significant under WLS was among the 12 found to be significant under ML. Note the smaller standard errors obtained with ML. In addition, five of the variables under WLS provided evidence for significant nonadditive genetics for CovGE; all 14 of the variables under ML provided evidence for only additive genetic effects for CovGE. The detection of additivity for CovGE, as opposed to the detection of nonadditivity, is more reasonable,
FIGURE 3. Frequency distributions for the medication variables.
Continues
FIGURE 3. Continued
FIGURE 3. Continued
FIGURE 3. Continued
FIGURE 3. Continued
FIGURE 3.
Continued
268
Scott L. Hershberger TABLE 2 Tetrachoric (TE) and Pearson Product Moment (PM) Correlations Before and After Transformation
TE PM Before PM After N
MZT .79 .67 .90 159
Blood pressure medication DZT MZA .41 .73 .35 .56 .66 .76 214 96
DZA .33 .38 .66 231
TE PM Before PM After N
MZT .27 .14 .43 157
Cough medicine DZT MZA .04 .30 .04 .15 .54 .22 213 95
DZA .12 .05 .05 232
TE PM Before PM After N
MZT .65 .50 .86 158
DZT .40 .31 .69 213
MZA .47 .42 .85 97
DZA .22 .29 .75 233
TE PM Before PM After N
MZT .21 .35 .90 157
Heart medication DZT MZA .16 .65 .17 .39 .78 .87 214 96
DZA .24 .27 .78 232
TE PM Before PM After N
MZT .65 .43 .82 156
Herbal medication DZT MZA .64 .36 .36 .15 .81 .42 213 97
DZA .22 .15 .67 231
TE PM Before PM After N
MZT .55 .26 .45 157
DZT .25 .08 .35 211
TE PM Before PM After N
MZT .25 .10 .43 156
DZT .50 .33 .73 212
Diuretic
Iron MZA .92 .67 .87 96
DZA .18 .10 .65 233
MZA .92 .51 .80 96
DZA .18 .15 .86 233
Laxative
Continues
13. ResidualizedCategorical Phenotype
TABLE2
Continued
MZT
TE PM Before PM After N
.25 .29 .91 158
MZT
TE PM Before PM After N
.28 .29 .54 156
MZT
TE PM Before PM After N
.60 .41 .65 158
MZT
TE PM Before PM After N
.44 .29 .70 157
MZT
TE PM Before PM After N
.25 .20 .64 157
Nitroglycerin MZA .20 .54 .19 .28 .83 .87 213 96
DZT
Nonprescription pain killer DZT MZA .25 149 .17 .43 .34 .81 212 97 Prescription pain killer DZT MZA .17 .40 .14 .21 .62 .36 211 97 Skin medication MZA -.01 .52 .02 .28 .27 .62 212 94 DZT
Stomach medication MZA .15 .22 .08 .12 .39 .73 214 96
DZT
DZA
-.03 .10 .63 231
DZA
.34 .30 .60 226
DZA
.29 .18 .43 229
DZA
.03 .04 .30 233
DZA
.03 .02 .20 232
Tranquilizer TE PM Before PM After N
MZT
DZT
MZA
DZA
.38 .17 .70 158
.42 .19 .60 213
.36 .21 .82 96
.62 .39 .92 232
MZT
DZT
MZA
DZA
.44 .29 .42 157
.36 .29 .58 214
.45 .37 .70 96
.17 .18 .49 232
Vitamin TE PM Before PM After N
26g
270
Scott L. Hershberger
TABLE3 Weighted Least-Squares Parameter Estimates of Genotype-Environment Correlation (CovGE) Medication Blood pressure Cough medicine Diuretic Heart medication Herbal medication Iron Laxative Nitroglycerin Nonprescription pain killer Prescription pain killer Skin medication Stomach medicine Tranquilizer Vitamin
COVGaEs
SE
1.00 -.49 -.57
3.26 28.82 1.02
-.29
.12
- 2.41"
-.91
.06
-15.17"**
-.81
.45
.53 -.83 .38
1.82 .06 .25
z-value
COVGdEs
SE
z-value
.31 -.02 - 1.02 .66
.08
8.25***
.45
.04
11.25"**
.53
.10
5.30***
.03 1.00
.62 12.47
-1.80 .05 .08
.29 - 13.83"** 1.52
700 697 701 699 697 697 697 698 691 695 696 699 699 699
,,
*p<.05. **p<.01. ***p<.001.
TABLE4 Maximum Likelihood Parameter Estimates of Genotype-Environment Correlation (CovGE) Medication Blood pressure Cough medicine Diuretic Heart medication Herbal medication Iron Laxative Nitroglycerin Nonprescription pain killer Prescription pain killer Skin medication Stomach medication Tranquilizer Vitamin *p<.05. **p<.01. ***p<.001.
COVGaEs
SE
z-value
-.27 .74 -.39 -.42 -.42 .67 -.78 -.44 .65 -.39 .42 .29 -.51 .72
.03 .16 .03 .02 .03 .06 .04 .02 .11 .08 1.24 .12 .04 .05
-9.00*** 4.63*** - 13.00"** -21.00"** - 14.00*** 11.17"** - 19.50"** -22.00*** 5.91"** -4.88*** .34 2.42* - 12.75"** 14.40"**
Covcdes SE
z-value
N 700 697 701 699 697 697 697 698 691 695 696 699 699 699
13. ResidualizedCategorical Phenotype
given types. using lution
271
the priority of additive effects over nonadditive effects for many phenoTherefore, in terms of significance and reasonableness, the ML solution residualized, continuous phenotypes is more desirable than the WLS sousing categorical phenotypes.
6. CONCLUSIONS At this point, it is important to emphasize the purpose of this study. The purpose was not to compare the relative value of WLS and ML estimation for the detection of CovGE underlying the medication variables. Little dispute exists as to the greater theoretical justification for the use of WLS with dichotomous data. Rather, the goal was to examine whether a reasonable solution could be obtained by using ML estimation after transforming the variable's distribution to approximate normality as closely as possible. It appears that the ML solution works well, in fact better, than the WLS solution for the detection of CovGE. A critical issue to consider is the interpretation of the normalized residuals analyzed by ML. This issue is important because the method described in this chapter may be misperceived as a statistical trick, resulting in a variable that has been robbed of interpretability. The correlations between the twins' medication usages, corrected for the effects of age and gender, refer to the usage of these medications without regard to age or gender. Correction or residualization in this chapter retains its usual meaning. If "ageless" or "genderless" effects are meaningful to the research, then the residualization process is viable. If not, then as in any other statistical context, covariates are to be avoided. The practice of removing age and gender effects from phenotypes in behavioral genetic analyses is commonplace, if not oftentimes required. At the same time, one can also exploit the necessity of residualization by taking advantage of normal theorybased methods of estimation. There are of course limitations to the results reported in this chapter and to the adaptability of the residualization method to domains other than behavioral genetics. First, the population value of CovGE for each of the medication variables is unknown; thus there is no sure way of assessing which solution, ML or WLS, provided the better solution. What is known is that if CovGE is significant, ML was more powerful in detecting its presence than WLS. Conversely, if CovGE is largely nonsignificant, than the results of WLS are the more valid. Another point to be made concerns the very high twin correlations reported in Table 2 computed from the nonlinearly transformed data. Indeed, some MZT correlations (for example, the correlation of .91 for nitroglycerin) even exceed the MZT correlations found typically for the marker behavioral genetic variables of height and weight. Greater credibility could possibly be attached to correlations of more reasonable magnitude. Furthermore, only one dependent
272
Scott L. Hershberger
ariable was analyzed at a time; thus the specific question of multivariate normality was not addressed. What is required is a simulation study in which multivariate data are created with a known CovGE structure. These data may then be categorized, and their residualization by one or more covariates performed, before conducting a series of ML analyses. These data would also be analyzed in their original continuous form by WLS, incorporating the covariates into the model. The WLS analyses should also be done without the presence of the covariates to ensure the retrievability of the correct CovGE values. The extent to which the ML estimates of CovGE match their true values would indicate the value of residualization. One would also want to compare the ML solution using the residualized phenotype with an ML solution using the categorized phenotype (the covariates incorporated directly in the model) to assess whether continuity created through residualization confers any benefit over analyzing the categorical phenotype with ML. The appropriateness of using normal theory-based estimation methods relies on more than multivariate normality, also required is a "large" sample size. Ironically, however, WLS estimation has appeared more vulnerable to the effects of sample size than ML (e.g., Hu et al., 1992). Without a very large sample, the use of ML over WLS may be preferred even with categorical variables, particularly if there is reason to assume independence of the latent variables in the population. In addition, not every study will have covariates defined a priori or will be able to identify potentially important covariates. If such is the case, it would be highly inappropriate to use a variable without any theoretical association to the dependent variable as a covariate purely to induce continuity in a categorical variable. The covariates selected should also be highly reliable, on the order of magnitude of variables such as age and gender. The spurious resuits often obtained from using unreliable covariates are well documented (e.g., Huitema, 1980). In conclusion, the results of the study suggest that the procedure of residualizing and nonlinearly transforming a variable that was originally categorical, followed by ML estimation, is reasonable. This is particularly true if sample size is small or moderate, or more generally, if one suspects the data will not meet the assumptions underlying WLS estimation.
ACKNOWLEDGMENTS The Swedish Adoption/Twin Study of Aging (SATSA) is an ongoing study conducted at the Department of Epidemi'ology of the Institute for Environmental Medicine, Karolinska Institute, Stockholm, Sweden, in collaboration with the Center for Developmental and Health Genetics at The Pennsylvania State University. SATSA is supported in part by grants from the National Institute on Aging (AG-04563) and from the John D. and Catherine T. McArthur Foundation Research Network on Successful Aging.
13. Residualized Categorical Phenotype
273
REFERENCES Amemiya, Y., & Anderson, T. W. (1990). Asymptotic chi-square tests for a large class of factor analysis models. Annals of Statistics, 18, 1453-1463. Anderson, T. W., & Amemiya, Y. (1988). The asymptotic normal distribution of estimators in factor analysis under general conditions. Annals of Statistics, 16, 759-771. Babakus, E., Ferguson, C. E., & J6reskog, K. G. (1987). The sensitivity of confirmatory maximum likelihood factor analysis to violations of measurement scales and distributional assumptions. Journal of Marketing Research, 24, 222-228. Bentler, P. M. (1989). EQS structural equations program manual. Los Angeles: BMDP Statistical Software. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley. Bollen, K. A., & Barb, K. (1981). Pearson' s R and coarsely categorized measures. American Sociological Review, 46, 232-239. Boosma, A. (1982). The robustness of LISREL against small sample sizes in factor analysis models. In K. G. J6reskog & H. Wold (Eds.), Systems under indirect observation (Part I, pp. 149-173). Amsterdam: North-Holland. Boomsma, A. (1983). On the robustness of LISREL (maximum likelihood estimation) against small sample size and nonnormality. Unpublished doctoral dissertation, University of GrCningen, The Netherlands. Cederl6f, R., & Lorich, U. (1978). The Swedish twin registry. In W. E. Nance, G. Allen, & P. Parisi (Eds.), Twin research: Part C. Biology and epidemiology (pp. 189-195). New York: Alan R. Liss. Chou, C. P., Bentler, P. M., & Satorra, A. (1991). Scaled test statistics and robust standard errors for nonnormal data in covariance structure analysis: A Monte Carlo study. British Journal of Mathematical and Statistical Psychology, 44, 347-357. D'Agostino, R. B. (1986). Tests for the normal distribution. In R. B. D'Agostino & M. A. Stephens (Eds.), Goodness-of-fit techniques (pp. 367-419). New York: Dekker. Falconer, D. S. (1981). Introduction to quantitative genetics. New York: Wiley. Goldberger, A. S. (1977). On Thomas's model for kinship correlations. Psychological Bulletin, 84, 1239-1244. Henry, E (1982). Multivariate analysis and ordinal data. American Sociological Review, 47, 299-307. Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley. Hu, L., Bentler, P. M., & Kano, Y. (1992). Can test statistics in covariance structure analysis be trusted? Psychological Bulletin, 112, 351-362. Huitema, B. E. (1980). The analysis of covariance and its alternatives. New York: Wiley. Jaspers, J. M. E, & de Leeuw, J. A. (1980). Genetic-environmental covariation in human behaviour genetics. In L. J. T. van der Kamp, W. E Langerak, & D. N. M. de Gruijter (Eds.), Psychometrics for educational debates (pp. 37-72). New York: Wiley. Jensen, A. R. (1976). The problem of genotype-environment correlation in the estimation of heritability from monozygotic and dizygotic twins. Acta Geneticae Medicae et Gemellologiae, 25, 86-99. Johnson, D. R., & Creech, J. C. (1983). Ordinal measures in multiple indicator models: A simulation study of categorization error. American Sociological Review, 48, 398-407. J6reskog, K. G., & S6rbom, D. (1988). LISREL 7: A guide to program and applications. Chicago: SPSS. J6reskog, K. G., & S6rbom, D. (1993a). LISREL 8 user's reference guide. Chicago: Scientific Software. J6reskog, K. G., & S6rbom, D. (1993b). PRELIS 2 user's reference guide. Chicago: Scientific Software.
274
Scott L. Hershberger
Martin, J. A. (1982). Application of structural modeling with latent variables to adolescent drug use: A reply to Huba, Wingard, and Bentler. Journal of Personality and Social Psychology, 43, 598-603. Martin, N. G., & Wilson, S. R. (1982). Bias in the estimation of heritability from truncated samples of twins. Behavior Genetics, 12, 467-472. McGue, M., & Bouchard, T. J. (1984). Adjustment of twin data for the effects of age and sex. Behavior Genetics, 14, 325-343. Muthen, B., & Kaplan, D. (1985). A comparison of some methodologies for the factor analysis of nonnormal Likert variables. British Journal of Mathematical and Statistical Psychology, 38, 171-189. Neale, M. C., & Cardon, L. R. (1992). Methodology for genetic studies of twins and families. Boston: Kluwer. Neale, M. C., Eaves, L. J., Kendler, K. S., & Hewitt, J. K. (1989). Bias in correlations from selected samples of relatives: The effects of soft selection. Behavior Genetics, 19, 163-169. Olsson, U., Drasgow, F., & Dorans, N. J. (1982). the polyserial correlation coefficient. Psychometrika, 47, 337-347. Pedersen, N. L., Floderus-Myrhead, B., Friberg, L., McClearn, G. E., & Plomin, R. (1984). Swedish early separated twins: Identification and characterization. Acta Geneticae Medicae et Gemellologiae, 33, 243-250. Pedersen, N. L., McClearn, G. E., Plomin, R., Nesselroade, J. R., Berg, S., & de Faire, U. (1991). The Swedish Adoption/Twin Study of Aging: An update. Acta Geneticae Medicae et Gemellologiae, 40, 7-20. Thomas, H. (1975). A model for interpreting genetic correlations with estimates of parameters. Psychological Bulletin, 82, 711-719. Wylie, P. B. (1976). Effects of coarse grouping and skewed marginal distribution on the Pearson product moment correlation coefficient. Educational and Psychological Measurement, 36, 1-7.
Index
A Agresti, A., 181, 182, 213, 214, 216, 229 Aitkin, M., 41, 52 Alexander, R. A., 89, 104 Alwin, D., 37, 54 Amemiya, Y., 251,273 Andersen, E. B., 9, 10, 13, 17, 19, 34, 66, 73
Anderson, H. H., 82, 104 Anderson, T. W., 251,273
Andrich, D., 9, 13, 17-18, 27, 34 Anomalous variance, 88, 90 experimental evidence, 97, 103 Anomaly CPM versus CTM, 29, 31 identification by measurement, 6, 7 longitudinal analysis, 48 threshold order, 22 Ansatz, 108 Arminger, G., 43, 56, 58, 59, 60, 61, 63, 65, 70, 72, 73, 74
Note: Numbers in italics refer to pages on which complete author references are listed.
275
276
Index
Association variability, 147-152, 161-165; see also Between-subject variability; Withinsubject variability Autoregressive structure, 40 Axial symmetry Bowker test, 203-205, 213 reformulations, 205-210 group comparisons, 209-210
Babakus, E., 251,273 Balance condition, 118 Barb, K., 25 l, 273 Bartholomew, D. J., 169, 182 Been, E, 82, 105 Behavior continuous and discrete, comparison, 77-78, 102, 103 equilibrium, catastrophe theory, 79 Behavioral genetics, 249-250, 251 Behavioral variable, 115 Bentler, E M., 86, 93, 94, 103, 104, 251,252,
Binomial effect size display, 236-237, 240 Biomodality, 86-87, 88, 103 Bishop, Y. M. M., 205, 213, 214, 216, 226, 229
Bock, R. D., 9, 22, 34, 37, 40, 41, 42, 45, 52, 53, 72, 73 Boehnke, K., 204, 214 Bollen, K. A., 250, 25 I, 273 Bonferroni procedure, Type I error, 200 Boomsma, A., 251,273 Bortz, J., 204, 214 Bouchard, T. J., 252, 274 Bowker, A. H., 203, 213, 214 Bowker's test, axial symmetry, 203-205, 213 reformulations, 205-210 Brainerd, C. J., 80, 83, 104 Branch, maximum spanning tree, 218-219 Brandtst~idter, J., 206, 214 Breckinridge Church, R., 103, 105 Bruner, J. S., 81, 82, 104 Bumpass, L., 170, 182 Burke, C. J., 147, 167 Butterfield, E. C., 91, 104 Byrne, D. G., 38, 53
273
Berg, S., 274 BESD, see Binomial effect size display Between-subject variability, 147-150 examples, 165-167 log odds ratio, 151, 156-157, 162 null hypothesis selection, 167 simulation, 150-152 t test, 157-161, 163, 167 within-subject variability and, 161-165 Bhargava, A., 67, 68, 73 Bias association variability, 155-156, 163 latent class analysis, 136, 139-144 Bifurcation analysis, 108 Bifurcation set, cusp model, 85, 86 Binary longitudinal response conventional model, 39-41, 52 example, 49, 51 weaknesses, 42-43 examples, 37-39, 46-52 general model, 43-44, 52 example, 48-52 Monte Carlo study, 46-48, 52 parameter identification, 44-46 LISCOMP program, 44, 47 nuisance versus essential, 38-39
Call, V., 170, 182 Canonical form, see Catastrophe theory Cardon, L. R., 253, 274 Carmines, R., 169, 182 Castrigiano, D. E L., 79, 104, 114, 115, 129 Catastrophe flags, 86-89 experimental evidence, 89-103 self-organization, 109 Catastrophe germ, 113 Catastrophe theory, 79-80 applicability, 110 basic tenet, 111, 119 canonical forms Morse, 112, 113, 129 Morse-Thom distinction, 121-124 Thorn, 112-114, 115 conservation, 102-103 cusp model, 79-80, 84-86 experiments, 89-101 issues, 80-84 coordinate transformations, 111-112 discrete versus continuous behavior, 77-78, 102, 103
Index discrete stochastic systems, 125-126 Markov chains and processes, 126-128 main result, 109 metrical deterministic systems, 110, 111-115 metrical stochastic systems, 110, 115-118 Cobb approach, 118-120, 128 Zeeman approach, 121, 122-125, 128 potentials, s e e Potential theory Cederl6f, R., 257, 2 7 3 Cell-specific reliability, 176, 177 Censored variable, panel data, 56, 59, 60, 61 Chambers, N., 238, 2 4 8 Change, in variables, s e e Axial symmetry; Longitudinal analysis; Quasi-symmetry Chen, N. H., 118, 129 Chi-square partition advantages, 199 cautions, 199-200 versus log-linear models, 184, 189-190, 200-201 methods joining algorithm, 195-198 splitting algorithm, 198-199 multiway tables, 187-189 responses over time, 190-195 two-way tables, 184-187 Chi-square statistic, 183-184; see a l s o McNemar' s test association variability, 163 general longitudinal model, 48 latent variables, 251 Cholesky factor, binary responses, 41 Chou, C. P., 252, 2 7 3 Christensen, R., 216, 2 2 9 Christofferson, A., 43, 53 Clark algorithm, binary responses, 41 Clogg, C. C., 19, 34, 134, 146, 170, 176, 178, 180, 181, 182, 206, 2 1 4 Cobb, L., 87, 89, 104, 110, 118-120, 124, 128, 129
Cochran, W. G., 161, 167 Cognitive capacity, 84, 85; s e e a l s o Cusp model Cognitive development conservation, see Conservation development dynamics, 108 Cohn, M. D., 238, 2 4 8 Cohort data, s e e Panel data Collapse, categories, 19, 29-30 Collins, L. M., 134, 135, 136, 146 Computer program, s e e Software Concatenation, 7, 17
277
Conditional independence hierarchical log-linear models, 216-217, 221-222, 228-229 reliability of measurement, 172 within-subject variability, 153 Connected graph, 218 Conservation development, s e e a l s o Catastrophe flags; Catastrophe theory experiments, 89-90 cross-sectional, 93-94, 99-101 instrument, 90-93 longitudinal, 94-99 Flavell-Wohlwill model, 82-83, 86 Piaget model, 81, 82 Siegler approach, 81-82, 90-93, 97-98 Context-sensitive technique, 199 Contingency table, see a l s o Between-subject variability; Chi-square partition; Withinsubject variability association variability, 147-150 large, 215-216 reliability of measurement, 171-172 sparseness, 215 hierarchical log-linear models, 217 latent class analysis, 135-136, 137, 144145 Continuity, categorical variable latent variable assumption, 250 residualization, 258-260 Continuous behavior, and discrete behavior, comparison, 77-78, 102, 103 Control variable, 115 Coordinate transformation, 111-112 Correlation, see Genotype-environment correlation model Correlation coefficient, see Pearson correlation coefficient Countersuggestion, 101 CovGE model, see Genotype-environment correlation model CPM, s e e Cumulative probability model Creech, J. C., 251,273 Cressie, N. A. C., 135, 146 Criteria, s e e Measurement criteria; Statistical criteria Critical slowing down, 88, 103 CTM, see Cumulative threshold model Cumulative probability model, 22-23, 33 versus cumulative threshold model, 26-27 example, 28-31 scientific modeling, 33-34
278
Index
Cumulative probability model (continued) falsifiable hypothesis, 25-26, 27 fundamental measurement, 24-25 invariance of location, 23-24 Cumulative threshold model, 9-10, 32-33 construction, 17-19 versus cumulative probability model, 26-27 example, 28-31 scientific modeling, 33-34 falsifiable hypothesis, 17-22, 27 fundamental measurement, 13-17 invariance of location, 10-13 Cuneo, D. O., 82, 104 Cusp catastrophe, 121 Cusp model, 79-80, 84-86 catastrophe flags, 86-89 experimental evidence, 89-101 statistical fit, 89
D'Agostino, R. B., 258, 273 Darroch, J. N., 216, 228, 229 Davidon-Fletcher-Powell algorithm, 64 Day, D. L., 89, 104 Dayton, C. M., 136, 146 Decomposable HLLM, 217, 219-221 de Faire, U., 274 de Jong, W., 134, 146 de Leeuw, J. A., 254, 273 Delucchi, K. L., 147, 167 Dempster, A. P., 136, 138, 146 DerSimonian, R., 161, 167 DerSimonian-Laird procedure, association variability, 161-163, 164 Desarbo, W. S., 89, 104 DeShon, R. E, 89, 104 Design matrix axial symmetry, 206-209 quasi-symmetry, 211-213 Deterministic system, see Metrical deterministic system Dichotomous variable, see also Binary longitudinal response catastrophe theory, 97, 102 panel data heterogeneity, 65 model, 56-58 Diggle, P. J., 41, 43, 53
Discontinuous change, 77-78, 102-103, 108 Discontinuous development, see Conservation development Discrete behavior, and continuous behavior, comparison, 77-78, 102, 103 Discrete stochastic system, 125-128 Divergence, conservation development, 87-88, 103 linear response, 88, 103 Doob, J. L., 127, 130 Dorans, N. J., 251,274 Drasgow, E, 251,274 Duffy, D. E., 200, 202 Duncan, O. D., 5, 9, 27, 34 Duncan-Jones, E, 38, 53 Dynamical model, 107-108, 124; see also Catastrophe theory
Eaves, L. J., 250, 274 Edge cutset, multigraph, 219 Edwards, A. L., 9, 34 Edwards, D., 222, 223, 229 Eggebeen, D. J., 170, 182 Elashoff, R. M., 37, 41, 53 Eliason, S. R., 206, 214 EM algorithm, 136, 138, 145 association variability, 161 Endogenous variable, panel data, 58 Entities, as individuals, 9 Epigenetic development, 81, 108, 114; see also Conservation development Epstein, D., 46, 53 Equilibrium behavior, catastrophe theory, 79 Error term estimation problem, 67 exogenous, 56, 61-64 panel data, dichotomous model, 58 Estimation latent class analysis, 136, 142-144, 145 least squares, 44, 46; see also Weighted least-squares estimation maximum likelihood, see Maximum likelihood estimation panel data, see Panel data threshold, 20, 25-26 Everitt, B. S., 87, 104, 184, 185, 201 Evers, M., 206, 214
Index Examples with data aspirin and stroke, 187-189 breast cancer survival, 184-187 caregiving practices, 238-241 children's preferences, 208-209 conservation development cross-sectional studies, 93-94, 99-101 longitudinal study, 94-99 customer choices, 211-213 disturbed dreams, 28-31 drug use, 225-228 employment and residence, 222-225 IFO business test, 68-71 neuroticism, change/stability in, 48-52 popularity ratings, 209-210 problem behavior, decline in, 46-48 serological readings, 31-32 sex bias, 189-190 simulations association variability, 150-152 correlation coefficient, 244-246 latent class analysis, 137-146 social support indicators, 174-180 teenagers' opinions, 190-195 twin studies, 256-272 vitamin treatment effects, 204-205, 207 Exogeneity, error term, 56, 61-64 Explanatory variable, panel data, 57, 65 Extended model algorithm, see EM algorithm
Factor loading, versus latent class analysis, 134 Falconer, D. S., 25 I, 273 Falsifiable hypothesis, 6-7 cumulative probability model, 25-26, 27 cumulative threshold model, 17-22, 27 FCI, see Fundamental conditional independencies Ferguson, C. E., 251,273 Ferretti, R. E, 91, 104 Fidler, P. L., 135, 146 Fienberg, S. E., 200, 201, 205, 214, 215, 217, 229
Fisher, R. A., 31-32, 34, 167, 184, 201 Fisher scoring, 136 Fitzmaurice, G. M., 41, 53 Flaig, G., 55, 73 Flavell, J. H., 82, 86, 88, 103, 104
27g
Flavell-Wohlwill model, 82-83, 86 Floderus-Myrhead, B., 274 Florens-Zmirou, D., 125 F ratio, between-subject variability, 157-159 Freedman, D., 189, 201 Friberg, L., 274 Fundamental conditional independencies, 221222, 228-229 Fundamental measurement, 7 cumulative probability model, 24-25 cumulative threshold model, 13-17
Gardiner, C. W., 116, 118, 130 Gart, J. J., 154, 167 GAUSS facilities, panel parameter estimation, 64 Gauss-Hermite quadrature, 41 Gemcat, 89 Generalized Linear Model, 161 General threshold models, panel data, 56, 59-65 Generating class, HLLM, 216-217, 228 Generator multigraph, 216, 217-219 benefits, 228-229 examples, 222-228 fundamental conditional independencies, 221-222, 228-229 versus interaction graph, 216, 228 maximum likelihood estimation, 219-221, 229 Genotype-environment correlation model identification, 253-256 residualization, 258-272 German Socio-Economic Panel, 55, 56 Gibbons, A., 219, 229 Gibbons, R. D., 37, 40, 41, 42, 45, 53, 72, 73 Gibbs sampler algorithm, 41, 72 Gilmore, R., 79, 88, 104, 109, 110, 111, 112, 113, 114, 128, 130 Glass, G. V., 233, 248 Goldberger, A. S., 254, 255, 273 Goldin Meadow, S., 103, 105 Goldschmid, M. L., 93, 94, 104 Goleman, D., 184, 201 Goodman, L. A., 27, 29, 34, 134, 146, 170, 174, 182, 200, 201, 217, 229 Goodman's forward selection procedure, 226
280
Index
Graded response, versus traditional measurement, 4, 7, 33 Graded response model, see Cumulative probability model; Cumulative threshold model; Measurement criteria; Statistical criteria Gradient system, 108-110; see also Potential theory Graphic methods, see Generator multigraph; Interaction graph Grego, J. M., 206, 214 Growth modeling, longitudinal analysis, see Longitudinal analysis GSOEE see German Socio-Economic Panel Guastello, S. J., 89, 104 Guttman, L., 5, 17, 18, 34
Habit persistence, panel data, 58, 62 Hagenaars, J. A., 134, 146 Haken, H., 108, 130 Hamerle, A., 57, 58, 72, 73 Hamiltonian system, 110 Hand, D. J., 87, 104 Hanges, P. J., 89, 104 Hartelman, E, 121, 125, 130 Havr~nek, T., 205, 214 Hayes, S. A., 79, 104, 114, 115, 129 Heckman, J. J., 56-59, 66, 74 Heckman model, panel data, 56-59 Hedeker, D. R., 41, 53 Helbing, D., 129, 130 Helmert contrasts, 197 Henderson, A. S., 38, 48, 53 Henry, F., 251,273 Henry, N. W., 134, 146, 172, 182 Herbert, G. R., 89, 104 Hessian matrix, 109, 112 Heterogeneity, panel data estimation problem, 67 general threshold models, 59, 61-62, 65 Hewitt, J. K., 250, 274 Hierarchical log-linear model, 216-217 decomposability fundamental conditional independence, 221-222, 228-229 ML estimation, 217, 219-221,229 examples, 222-228 within-subject variability, 153
HLLM, see Hierarchical log-linear model Hogan, D. E, 170, 182 Holland, E W., 205, 214, 216, 229 Holt, J. A., 135, 146 Hopkins, K. D., 233, 248 Hosmer, D. W., 258, 273 Hsiao, C., 58, 74 Hu, L., 252, 272, 273 Huitema, B. E., 272, 273 Huseyin, K., 110, 130 Hypothesis category order, 6-7, 17-22 falsifiable, see Falsifiable hypothesis matching, see Pearson correlation coefficient Hysteresis, 87, 88 experimental evidence, 99-101, 103
IFO Institute, 56, 68-71 Inaccessibility, conservation development, 87, 88, 103 Individuals, as entities, 9 Inhelder, B., 78, 105 Initial state, panel data, 57, 65-68 Instrument versus outcome, 13 as stimulus, 9 Interaction graph, versus generator multigraph, 216, 228 Interval variable, correlation coefficient, 237, 247 Invariance, conservation development, 80 Invariance of location, 7-8 cumulative probability model, 23-24 cumulative threshold model, 10-13 Item level-specific reliability, 172-173, 177178, 180 Item-set reliability, 173-174, 176 Item-specific reliability, 173 Ito formula, catastrophe theory, 117, 119, 120, 124
Jackson, E. A., 110, 114, 130 Jansen, E G. W., 13, 34 Jaspers, J. M. E, 254, 273
Index Jedidi, K., 89, 104 Jenkins, E., 95, 105 Jensen, A. R., 254, 255, 273 Johnson, D. R., 251,273 Johnson, N. L., 152, 167 Joining algorithm, chi-square partition, 195-198 J6reskog, K. G., 37, 53, 251,258, 273
Kano, Y., 252, 273 Kaplan, D., 251,274 Karim, M. R., 41, 54 Keane, M. E, 61, 74 Kemeny, J. G., 126, 127, 130 Kemp, A. W., 152, 167 Kendler, K. S., 250, 274 Kerkman, D. D., 82, 104 Khamis, H. J., 216, 219, 221, 222, 229 Khoo, S. T., 46, 53 Kiefer, J., 66, 74 Knapp, A. W., 126, 130 Koehler, K. J., 217, 229 Kohlberg, L., 133, 146 Koppstein, E, 118, 129 Kotz, S., 152, 167 Krauth, J., 205, 214 Kreiner, S., 222, 223, 229 Kruskal, J. B., 218, 229 Kruskal's algorithm, maximum spanning trees, 218 Kuhn, T. S., 3, 5-6, 31, 34, 35 Kurtosis, residualization, 258-260 Kfisters, U., 56, 59, 64, 74
Laird, N., 38, 54, 161, 164, 167; see also DerSimonian-Laird procedure Laird, N. L., 136, 146 Laird, N. M., 37, 41, 53 Lambda measure, association, 176, 177 Landis, J. R., 161, 168; see also Miller-Landis test Langeheine, R., 134, 146, 180, 182 Latent class analysis, 133-135 local independence, 172 reliability, see Reliability
281
simulation study, 137-139 results, bias and MSE, 139-145 results, standard errors, 142-144 social support, 170, 174-180 sparseness of data cells, 135-136, 137, 144145 standard error estimation, 136, 142-144, 145 Latent trait model, 83 Latent transition model, 134-135, 144-145 Latent variable continuity assumption, 250, 251 conventional longitudinal model, 39 general longitudinal model, 44, 46 Lauritzen, S. L., 216, 229 Lazarsfeld, E E, 134, 146, 172, 182 Least-squares estimation general longitudinal model, 44, 46 weighted, see Weighted least-squares estimation Lee, S. K., 217, 229 Lemeshow, S., 258, 273 Leroy, A. M., 234, 248 Leunbach, G., 27, 35 Levin, J. R., 200, 201, 205, 214 Lewis, D., 147, 167 Liability model, 251 Liang, K. Y., 41, 53 Licht, G., 55, 73 Lieberman, 41, 53 Lienert, G. A., 204, 205, 214 Likelihood estimation, see Maximum likelihood estimation Likelihood ratio between-subject variability, 157-158, 167 chi-square partition, 184, 185-195 quasi-symmetry, 211-212 within-subject variability, 153 Likert scale panel data, 56, 60 rounding error, 246-247 LISCOMP program, 44, 47 Liu, G., 45, 53 Local independence, 172 Location, invariance of, see Invariance of location Logit model, versus chi-square partition, 189 Log-linear model axial symmetry, 205-206, 213 between-subject variability, 148, 157
282
/.dex
Log-linear model (continued) versus chi-square partition, 184, 189-190, 200-201 hierarchical, see Hierarchical log-linear model nonstandard, see Nonstandard log-linear model Log odds ratio between-subject variability, 151, 156-157, 162 within-subject variability, 154-156, 162 Long, J. D., 135, 146 Longford, N. T., 41, 53 Longitudinal analysis, see also Axial symmetry; Quasi-symmetry binary response, see Binary longitudinal response catastrophe flags, see Catastrophe flags chi-square partition, 190-192 conservation development, 94-99 latent class analysis, 134-135, 180 panel data, see Panel data Lorich, U., 257, 273 LR, see Likelihood ratio Luijkx, R., 134, 146
M
Macready, G. B., 135, 136, 146 Manning, W., 170, 182 Mantel, N., 154, 167 Mantel-Haenszel estimate, log odds ratio, 154 Marascuilo, L. A., 190, 200, 201, 202, 205, 214
Markov chain, potentials, 126-128 Markov process, potentials, 127-128 Markov type HLLM model, 217 Marks, E., 234, 248 Martin, J. A., 251,274 Martin, N. G., 250, 274 Matching hypothesis, see Pearson correlation coefficient Maximum likelihood estimation, 61-62, 249; see also EM algorithm exogeneity, 62-65 hierarchical log-linear models, 217, 219221,229 initial state alternatives, 65-68 versus weighted least-squares estimation, 250, 260, 270-272
Maximum spanning tree, 218-219, 228 Maxwell, A. E., 28, 31, 35, 184, 202 Maxwell convention, hysteresis, 99 McArdle, J. J., 46, 53 McClearn, G. E., 274 McCullagh, E, 8, 9, 13, 24, 27, 28, 35, 161, 168
McGue, M., 252, 274 McKee, T. A., 216, 219, 221,222, 229 McKelvey, R. D., 60, 71, 74 McNemar, Q., 204, 214, 233, 248 McNemar's test axial symmetry, 204, 208 versus chi-square partition, 191-192 McShane, J., 86, 104 Mean squared error, latent class analysis, 136, 139-145 Measurement, see also Fundamental measurement function of, 3, 5-7 without instrument, 4 model selection, 4-5 physical sciences, 7, 14-17 prototype, 3, 4, 14 reexpression, 15-16 role of theory, 5-7, 33-34 examples, 30, 32, 185-186 Measurement criteria, 4-5, 8, 32-34; see also Falsifiable hypothesis; Fundamental measurement; Invariance of location cumulative probability model, 22-27, 26-27 cumulative threshold model, 9-22 versus statistical criteria, 13 Measurement error, general longitudinal model, 45 Measurement parameter, 134-135 Measure-to-function shift, 122-125, 127 Mecosa program, 63, 64, 68, 71, 72 Meinhardt, H., 129, 130 Mental Health Caregiving Questionnaire, 238, 239 Meredith, W., 46, 53, 204, 205, 214 Metrical deterministic system, 1I0, 111-115 Metrically classified variable, 59-60 Metrical stochastic system, 110, 115-125 Microgenetic method, 95 Miller, M. E., 161, 168 Miller-Landis test, association variability, 163 Minimal sufficient configuration, HLLM, 216 ML estimation, see Maximum likelihood estimation
/ndex Modeling, and measurement, 4-7; see also Measurement criteria Model of Parallel Axial Symmetry, 209-210, 213 Molenaar, P. C. M., 83, 84, 86, 87, 89, 97, 102, 103, 104, 105, 108, 109, 115, 121,
283
Normality, by residualization, 258-260, 271272 NSFH, see National Survey of Families and Households
130
Monte Carlo method latent class analysis, 136 longitudinal analysis, 46-48, 52 panel data, estimation problem, 66, 67 Morrison, D. L., 86, 104 Morse canonical form, 112, 113, 129 versus Thorn form, 121-124 MSE, see Mean squared error Multigraph, generator, see Generator multigraph Multinomial within-subject variability, see Within-subject variability Multiple indicators conventional longitudinal model, 42, 52 general longitudinal model, 43-44, 45, 52 Multivariate normality, 249-251 residualization, 258-260, 271-272 Muth6n, B., 37, 38, 43-44, 45, 46, 48, 52, 53, 54, 56, 72, 74, 251,274 Muth6n, L., 52, 53
Namboodiri, N. K., 206, 214 National Survey of Families and Households, 170 Neale, M. C., 250, 253, 274 Nelder, J. A., 28, 35, 161, 168 Nelson, F. D., 60, 74 Nelson Goff, G., 46, 53 Nesselroade, J. R., 274 Neuhoff, V., 205, 214 Neyman, J., 66, 74 Nicolis, G., 108, 130 Noack, A., 27, 35 Nonlinear system, 107-108; see also Catastrophe theory Nonlinear transformation, 260, 268-269, 271272 Nonstandard log-linear model, 213 axial symmetry, 206-209 versus chi-square partition, 200-201 quasi-symmetry, 210-213
Odom, R. D., 82, 104 Oliva, T. A., 89, 104 Olsson, U., 251,274 Ordered categorical data, 56, 59, 60, 65; see also Falsifiable hypothesis
Panel data estimation method, 61- 62 initial states, 57, 65-68 strict exogeneity, 62-64 examples, 55-56, 68-71 general threshold models, 56, 59-65 Heckman model, 56-59 Parallel Axial Symmetry, 209-210, 213 Parametric structure, 13-14; see also Fundamental measurement Partition, see Chi-square partition; Invariance of location Pascual-Leone, J., 82, 83, 104 Pearson chi-square statistic, 185, 186, 201 axial symmetry, 207-209 Pearson correlation coefficient, 233-234; see also Genotype-environment correlation model binomial effect size display, 236-237, 240 degree of treatment principle, 237, 247 matching hypothesis, 237-238, 240-241, 247-248 derivation, exact match, 241-243 derivation, interval of matches, 243-244 example, 238-241,244-246 rounding error, 246-247 subgroup manipulation, 234-236 nonlinear transformation, 260, 268-269 Pedersen, N. L., 257 Peill, E. J., 81, 105 Perceptual misleadingness, 84, 85; see also Cusp model Peregoy, P. L., 89, 105 Perry, M., 103, 105
284
Index
Phenotype analysis, see Behavioral genetics Piaget, J., 78, 81, 82, 86, 94, 101, 102, 103, 105, 108, 133, 146 Pisani, R., 189, 201 Plomin, R., 274 Poisson distribution, cumulative threshold model, 15-17 Polychoric correlation coefficient, 63, 64, 251 Popper, K., 6, 35 Poston, T., 79, 105, 114, 130 Potential theory, 108-109 canonical forms, 112-114, 115, 121-124, 129 Cobb approach, 118-120, 128 coordinate transformations, 111-112 integration with catastrophe theory, 128-129 Markov chains and processes, 126-128 Zeeman approach, 121, 122-125, 128 Preece, E F. W., 99, 103, 105 Prigogine, I., 108, 130 Probit model, panel data analysis, 55-72 Probit regression, binary longitudinal data, 40, 42, 43 Proportion effects model, see Genotypeenvironment correlation model Pseudo F statistic, 158-159 Purves, R., 189, 201
Q transform, see Yule's Q transform Quantitative response, and qualitative response, comparison, 78 Quasi-symmetry, 210-213
Ramsay, J. O., 3, 7, 35 Random effects modeling association variability, 161 binary responses, 39 panel data, 59 Random walk, 116 Rao, C. R., 136, 146 Rao, J. N. K., 161, 168 Rasch, G., 5, 9, 13, 17, 33, 35 Rasch models catastrophe theory, 80, 83 CTM, see Cumulative threshold model
Read, T. R. C., 135, 146 Regression versus catastrophe theory, 79-80 versus latent class analysis, 134 panel data, general threshold models, 63-64 Reliability latent class analysis, 169-172, 181 cell-specific, 176, 177 across groups, 178-180 item level-specific, 172-173, 177-178, 180 item-set, 173-174, 176 item-specific, 173 social support example, 174-180 across time, 180 as predictability, 173-174, 177, 180 Rendtel, U., 55, 74 Repeated measurement, see Panel data Residualized categorical variable, 258-260 ML versus WLS results, 260, 270-272 nonlinear transformation, 260, 271-272 Residuals conventional longitudinal model, 39-40, 41, 42, 52 general longitudinal model, 44, 45, 52 Resistance to countersuggestion, 101 Response pattern, 134 Revuz, D., 116, 130 Reynolds, H. T., 184, 185, 202 Rindskopf, D., 200, 202, 206, 214 Ronning, G., 56, 57, 58, 61, 70, 72, 73 Rosenthal, R., 236, 237, 248 Rosett, R. N., 60, 74 Roskam, E. E., 13, 34 Rotnitzky, A. G., 41, 53 Rousseeuw, E J., 234, 248 Rovine, M., 206, 214, 235, 244, 248 Rubin, D. B., 136, 146, 236, 237, 248 Runkle, D. E., 61, 74 Rutter, C. M., 37, 41, 53
Saltus, psychometric model, 83 Samejima, E, 9, 35 Sample size, see also Sparseness between-subject variability, 160-161 latent class analysis, 135-136 longitudinal analysis, 48, 52
Index maximum likelihood estimation, 272 panel data, 64 weighted least-squares estimation, 252, 272 within-subject variability, 155-156 Sands, L. P., 204, 205, 214 Santner, T. J., 200, 202 Sargan, G. D., 67, 68, 73 Satorra, A., 44, 53, 252, 273 Saunders, E T., 79, 105 Sayer, A. G., 46, 54 Scheffe test, chi-square partition, 193-194, 200 Schepers, A., 56, 59, 63, 74 Schuessler, K. F., 169, 182 Schupp, J., 55, 74 Schwartz, J. E., 170, 182 Science modeling framework, 5-7, 33-34 physical measurement, 14-17 Scott, A. J., 161, 168 Scott, E., 66, 74 Self-Efficacy Survey Instrument, 238-239 Self-organization, 108-110, 114; see also Catastrophe theory Serlin, R. C., 190, 202 Shapiro-Wilk test, normality, 258 Shihadeh, E. S., 19, 34 Siegler, R. S., 81-82, 90-93, 95, 97-98, 105 Sijtsma, K., 82, 105 Simulations, see Examples with data Simultaneous equations, panel data, 56, 62, 71 Singularities, catastrophe theory, 79 Size, categories fundamental measurement, 16 versus traditional measurement, 7 Skewness, residualization, 258-260 Smyer, M. A., 238, 248 Snell, J. L., 126, 130 Sobel, M., 60, 74 Social science, model selection, 6 Social support, latent class analysis, 170, 174180 Software chi-square partition, 199 LISCOME 44, 47 Mecosa, 63, 64, 68, 71, 72 reliability indices, 181 S6rbom, D., 37, 53, 251,258, 273 Sparseness, 215 HLLM, 217 latent class analysis, 135-136, 137, 144-145
285
Speed, T. E, 216, 229 Spiel, C., 209, 214 Splitting algorithm, chi-square partition, 198199 Spore, D., 238, 248 Standard error, latent class analysis, 136, 142144, 145 Stationary probability density, 117-120, 125, 129 Statistical criteria, 8-9 cumulative probability model, 24, 26, 27 cumulative threshold model, 27 versus measurement criteria, 13 Steiner, V., 55, 73 Stenbeck, M., 9, 34 Stewart, I., 79, 89, 105, 114, 130 Stewart, M. B., 60, 74 Stimulus, as instrument, 9 Stiratelli, R., 38, 41, 54 Stochastic differential equation, 116-118 Stochastic processes, see Potential theory Stochastic system, see Discrete stochastic system; Metrical stochastic system Structural equation modeling, 43, 250, 251 Sudden jump conservation development, 86, 88, 90, 95, 102 self-organization, 108 Summers, G., 37, 54 Sussmann, H. J., 102, 105 Sweet, J., 170, 182
Ta'eed, L. K., 89, 105 Ta'eed, O., 89, 105 Taylor series, potential theory, 112, 113 Terry, H., 204, 214 Tetrachoric correlation, residualization, 260, 268-269 Theory, role in measurement, 5-7, 33-34 examples, 30, 32, 185-186 Thom, R., 108, 110, 114, 130 Thomas, H., 254, 255, 274 Thorn canonical form, 112-114, 115 versus Morse form, 121-124 Threshold, see also Cumulative threshold model; Falsifiable hypothesis; Invariance of location behavioral genetics, 251
286
Index
Threshold (continued) CPM versus CTM example, 28-31 distances, 19 equal discriminations, 19, 20, 22 estimates, 20, 25-26 general model, panel data, 56, 59-65 location versus distribution, 13 order, 20, 22, 25-26, 31-32 partition of continuum, 10 Thurstone, L. L., 5, 9, 22, 33, 34, 35 Thurstone CPM, see Cumulative probability model Tisak, J., 46, 53 Tobin, J., 60, 74 Transition, see also Latent class analysis conservation development, 85, 86, 95, 98, 102-103 epigenetic processes, 114-115 Tree, see Maximum spanning tree True state dependence, panel data, 57 t test, between-subject variability, 157-161, 163, 167 Twin study data, 252-253, 256-258 CovGE model, 253-256 residualization, 258-272
Upton, G. J. G., 213, 214
van der Maas, H. L. J., 84, 86, 89, 90, 97, 102, 103, 105, 109, 114, 121, 130 van de Pol, E, 134, 146, 180, 182 Van Kampen, N. G., 127, 128, 130 van Maanen, L., 82, 105 Variability, see Association variability; Between-subject variability; Withinsubject variability von Eye, A., 206, 214, 235, 244, 248
W
Wagner, G., 55, 74 Walma van der Molen, J., 97, 105 Ware, J. H., 37, 38, 53, 54 Wedderburn, R. W. M., 28, 35
Weighted least-squares estimation, 250-253 disadvantages, 252 versus maximum likelihood estimation, 250, 260, 270-272 Weighted t function, 158-160 Wheaton, B., 37, 54 Wickens, T., 148, 152, 168, 184, 185, 202, 205, 213, 214, 216, 229 Wiener process, random walk, 116 Willett, J. B., 46, 54 Wilson, M., 80, 83, 105 Wilson, S. R., 250, 274 Within-subject variability, 149-150, 152-153 between-subject variability and, 161-165 log odds ratio, 154-156, 162 null hypothesis selection, 156 simulation, 150-152 WLS estimation, see Weighted least-squares estimation Wohlwill, J. E, 82, 86, 88, 103, 104, 107-108, 130; see also Flavell-Wohlwill model Wolfowitz, J., 66, 74 Woolf, B., 154, 168 Woolf's statistic, association variability, 154, 162-163 Wright, B. D., 5, 7, 35 Wright, J. C., 82, 104 Wright, J. E., 89, 105 Wugalter, S. E., 134, 135, 146 Wylie, E B., 251,274
Yor, M., 116, 130 Yule, G. U., 147, 168 Yule' s Q transform, 173, 174, 177
Zacks, S., 87, 89, 104, 110, 118-119, 129 Zahler, R. S., 102, 105 Zavoina, W., 60, 71, 74 Zeeman, E. C., 77-78, 79, 84, 105, 121, 122125, 127, 128, 130 Zeger, S., 41, 53, 54 Zeller, R., 169, 182 Zweifel, J. R., 154, 167 Zwick, R., 205, 214 Zygosity, twin study data, 257